Re: [DISCUSS]: Additional Formats for Drill

Paul Rogers Tue, 02 Apr 2019 20:06:56 -0700

Hi All,

Daffodil is an interesting project as is the DFDLSchemas project. Thanks for 
sharing!

An interesting challenge is how these libraries load data: what is their 
internal format, or what API do they use for the application to consume data? 
Found this for Daffodil, it will "parse data into an infoset represented as XML 
or JSON"

Drill is part of the "big data" ecosystem. Converting a 100GB file, say, into 
XML, then into Drill would be a bit cumbersome. Better would be if the 
libraries provided an API that Drill could implement to receive the data and 
write it to vectors using, say, the new row set framework that we've just added 
for CSV and will soon add for JSON. Both JSON and XML provide a parser to which 
the app provides an implementation. Drill uses this approach to parse JSON.

Another issue is file splits: to store a large file on HDFS (yes, HDFS is old, 
everyone uses S3 now), we want Drill to read each file block separately. The 
means the file must be "splittable": there must be some well-defined token that 
the scanner can search for at block boundaries. Not clear if these parsers are 
designed for this big data model.

For both projects, would be good to read data into Arrow. Ideally, we'd get a 
volunteer to port the row set mechanism to Arrow so that the same API can write 
to both Arrow and Drill vectors (saving the entire world from having to write 
their own vector writing mechanisms.)

Thanks,
- Paul

    On Tuesday, April 2, 2019, 1:06:53 PM PDT, Ted Dunning 
<[email protected]> wrote:  

 I have no idea how much uptake these would have, but if the library can
give all the formats all at once for modest effort, that would be great.

On Tue, Apr 2, 2019 at 9:22 AM Charles Givre <[email protected]> wrote:

> Hello everyone,
> I recently presented a talk at the ASF DC Roadshow (shameless plug[1] )
> but heard a really good talk by a PMC member for the Apache Daffodil
> (incubating) project.  At its core, Daffodil is a collection of parsers
> which convert various data formats to a standard structure which can then
> be ingested into other tools.  Some of these formats Drill already can
> ingest natively such as PCAP, CSV however many cannot such as NACHA (bulk
> financial transactions), vCard, Shapefile, and many more.  Here is a brief
> presentation about Daffodil [2].
>
> The DFDLSchemas github has a handful of DFDL schemas that are pretty good
> open source examples[3].
>
> On a related note, I stumbled on the Kaitai struct library[4] which is
> another library which performs a similar function to Daffodil.  Would it be
> of interest for the community to incorporate these libraries into Drill?
> My thought is that it would greatly increase the types of data that Drill
> can natively query and hence seriously increase Drill’s usefulness.  If
> there is interest, (and honestly even if there isn’t) I can start working
> on this for the next release of Drill.
>
>
> [1]:
> https://www.slideshare.net/cgivre/drilling-cyber-security-data-with-apache-drill
> <
> https://www.slideshare.net/cgivre/drilling-cyber-security-data-with-apache-drill
> >
> [2]:
> https://www.slideshare.net/mbeckerle/tresys-dfdl-data-format-description-language-daffodil-open-source-public-overview-100432615
> <
> https://www.slideshare.net/mbeckerle/tresys-dfdl-data-format-description-language-daffodil-open-source-public-overview-100432615
> >
> [3]: https://github.com/DFDLSchemas <https://github.com/DFDLSchemas>
> [4]: http://formats.kaitai.io <http://formats.kaitai.io/>
>
>

Re: [DISCUSS]: Additional Formats for Drill

Reply via email to