Thanks for raising this Patrick.
I'm CCing this to [email protected] and will respond there first about ways to use Daffodil to do this. If we have to discuss adding features/APIs to Daffodil (which we might), then that would make sense for here on the dev list. There's no notion in DFDL of "splittable", that matches exactly the concept you want, but there are techniques to discuss. ________________________________ From: Patrick GRANDJEAN <[email protected]> Sent: Tuesday, December 11, 2018 5:01:35 PM To: [email protected] Subject: DFDL & Input Formats Hi ! My name is Patrick and I have recently attended a Spark meetup in Boston where Mike Beckerle has presented Apache Daffodil (incubating). I work for a company that has to deal with many data formats, both new (JSON, XML, YAML, Protobuf, etc) and old (EDIFACT, IATA formats, etc). More recently, we have started to process files in Hadoop and we have developed "input formats" for each data format. Basically, an input format tells Hadoop how a file can be split into smaller parts to be processed. To give an example, let's consider a huge XML having the following structure: <root schemaVersion="1.2.3"> <transaction>...</transaction> <transaction>...</transaction> ... <transaction>...</transaction> </root> Each transaction needs to be processed individually. An input format can split such XML into a list of valid XMLs, each containing a single <transaction>: <root schemaVersion="1.2.3"> <transaction>...</transaction></root><root schemaVersion="1.2.3"> <transaction>...</transaction></root>... It is not necessary to completely parse the XML at that moment, only to split into smaller pieces. Therefore, parsing <transaction> can be bypassed, except to detect the closing tag </transaction>. I was wondering if DFDL has such a concept of splittable file. If not, would it be interesting to add it? The main advantage I see is: if DFDL can describe a data format and how to split it, then one could use a generic Hadoop input format to process files using DFDL. In other words, in addition to parsers and unparsers, users could have Hadoop input formats for (almost?) free. Please let me know if this idea makes sense in the context of Apache Daffodil. I would love to discuss this further. Kind Regards,Patrick.
