Thanks for raising this Patrick.

I'm CCing this to [email protected] and will respond there first about 
ways to use Daffodil to do this.


If we have to discuss adding features/APIs to Daffodil (which we might), then 
that would make sense for here on the dev list.


There's no notion in DFDL of "splittable", that matches exactly the concept you 
want, but there are techniques to discuss.



________________________________
From: Patrick GRANDJEAN <[email protected]>
Sent: Tuesday, December 11, 2018 5:01:35 PM
To: [email protected]
Subject: DFDL & Input Formats

Hi !
My name is Patrick and I have recently attended a Spark meetup in Boston where 
Mike Beckerle has presented Apache Daffodil (incubating). I work for a company 
that has to deal with many data formats, both new (JSON, XML, YAML, Protobuf, 
etc) and old (EDIFACT, IATA formats, etc). More recently, we have started to 
process files in Hadoop and we have developed "input formats" for each data 
format. Basically, an input format tells Hadoop how a file can be split into 
smaller parts to be processed.

To give an example, let's consider a huge XML having the following structure:
<root schemaVersion="1.2.3">  <transaction>...</transaction>  
<transaction>...</transaction>  ...  <transaction>...</transaction>
</root>

Each transaction needs to be processed individually. An input format can split 
such XML into a list of valid XMLs, each containing a single <transaction>:
<root schemaVersion="1.2.3">  <transaction>...</transaction></root><root 
schemaVersion="1.2.3">  <transaction>...</transaction></root>...
It is not necessary to completely parse the XML at that moment, only to split 
into smaller pieces. Therefore, parsing <transaction> can be bypassed, except 
to detect the closing tag </transaction>.
I was wondering if DFDL has such a concept of splittable file. If not, would it 
be interesting to add it? The main advantage I see is: if DFDL can describe a 
data format and how to split it, then one could use a generic Hadoop input 
format to process files using DFDL. In other words, in addition to parsers and 
unparsers, users could have Hadoop input formats for (almost?) free.
Please let me know if this idea makes sense in the context of Apache Daffodil. 
I would love to discuss this further.

Kind Regards,Patrick.

Reply via email to