Patrick,

So, ability to split fast is dependent on data format and behavior of the data.


In the XML example you gave, you are depending on fast scan for the terminator 
"</transaction>", with no way that gets tripped up by quoting problems like 
that string appearing within the data itself. Pretty safe assumption for XML 
usually.


Consider  https://github.com/DFDLSchemas/GeoNames


GeoNames is a really good candidate for exactly the sort of fast-split-up you 
are taking about. It's 2+ gigabyte file compressed, and the data stream could 
be rapidly split up.


So I modified the
https://github.com/OpenDFDL/daffodil-spark

example. It now processes GeoNames data using spark, and that should show you 
how to "fast parse" data.  The geonames data is a file of "quasi-XML" data that 
needs to be massaged back into real XML form, and it's really big. The "test" 
for geonames here reads a compressed geonames data file (small sample 
included), and writes out a compressed spark RDD as files.

I think it illustrates that you have to drive the parser sequentially to split 
the data, but then after that all subsequent processing (in this case, 
assembling the fragments of quasi-XML into an actual piece of well-behaved XML) 
is spark-parallel work.

Give it a look see. I'm not much of a spark expert but I *think* this is going 
to create parallel data as fast as daffodil can separate it off.

-mike beckerle
<https://github.com/DFDLSchemas/>

<https://github.com/DFDLSchemas/>


________________________________
From: Mike Beckerle <[email protected]>
Sent: Wednesday, December 12, 2018 11:50 AM
To: [email protected]; Patrick GRANDJEAN; [email protected]
Subject: Re: DFDL & Input Formats


Thanks for raising this Patrick.


I'm CCing this to [email protected] and will respond there first about 
ways to use Daffodil to do this.


If we have to discuss adding features/APIs to Daffodil (which we might), then 
that would make sense for here on the dev list.


There's no notion in DFDL of "splittable", that matches exactly the concept you 
want, but there are techniques to discuss.



________________________________
From: Patrick GRANDJEAN <[email protected]>
Sent: Tuesday, December 11, 2018 5:01:35 PM
To: [email protected]
Subject: DFDL & Input Formats

Hi !
My name is Patrick and I have recently attended a Spark meetup in Boston where 
Mike Beckerle has presented Apache Daffodil (incubating). I work for a company 
that has to deal with many data formats, both new (JSON, XML, YAML, Protobuf, 
etc) and old (EDIFACT, IATA formats, etc). More recently, we have started to 
process files in Hadoop and we have developed "input formats" for each data 
format. Basically, an input format tells Hadoop how a file can be split into 
smaller parts to be processed.

To give an example, let's consider a huge XML having the following structure:
<root schemaVersion="1.2.3">  <transaction>...</transaction>  
<transaction>...</transaction>  ...  <transaction>...</transaction>
</root>

Each transaction needs to be processed individually. An input format can split 
such XML into a list of valid XMLs, each containing a single <transaction>:
<root schemaVersion="1.2.3">  <transaction>...</transaction></root><root 
schemaVersion="1.2.3">  <transaction>...</transaction></root>...
It is not necessary to completely parse the XML at that moment, only to split 
into smaller pieces. Therefore, parsing <transaction> can be bypassed, except 
to detect the closing tag </transaction>.
I was wondering if DFDL has such a concept of splittable file. If not, would it 
be interesting to add it? The main advantage I see is: if DFDL can describe a 
data format and how to split it, then one could use a generic Hadoop input 
format to process files using DFDL. In other words, in addition to parsers and 
unparsers, users could have Hadoop input formats for (almost?) free.
Please let me know if this idea makes sense in the context of Apache Daffodil. 
I would love to discuss this further.

Kind Regards,Patrick.

Reply via email to