[jira] [Commented] (DRILL-8474) Add Daffodil Format Plugin

ASF GitHub Bot (Jira) Thu, 04 Jan 2024 13:57:10 -0800


    [ 
https://issues.apache.org/jira/browse/DRILL-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803337#comment-17803337
 ]

ASF GitHub Bot commented on DRILL-8474:
---------------------------------------

mbeckerle commented on PR #2836:
URL: https://github.com/apache/drill/pull/2836#issuecomment-1877814024

   Let me respond between the paragraphs....

   On Tue, Jan 2, 2024 at 11:49 PM Paul Rogers ***@***.***>
   wrote:

   > Hi Mike,
   >
   > Just jumping in with a random thought. Drill has accumulated a number of
   > schema systems: Parquet metadata cache, HMS, Drill's own metastore,
   > "provided schema", and now DFDL. All provide ways of defining data: be it
   > Parquet, JSON, CSV or whatever. One can't help but wonder, should some
   > future version try to reduce this variation somewhat? Maybe map all the
   > variations to DFDL? Map DFDL to Drill's own mechanisms?
   >
   > Well we can dream can't we :-)

   I can contribute the ideas in

https://daffodil.apache.org/dev/design-notes/Proposed-DFDL-Standard-Profile.md
   which
   is an effort to restrict the DFDL language so that schemas written in DFDL
   can work more smoothly with Drill, NiFi, Spark, Flink, Beam, etc. etc.

   DFDL's data model is too restrictive to be "the model" for Drill since
   Drill wants to query even unstructured data like XML without schema. DFDL's
   data model is targeted only at structured data.

   Drill's data model and APIs seem optimized for streaming block-buffered
   top-level rows of data (the EVF API does anyway). Top level row-sets are
   first-class citizens, as are the fields of said rows. Fields containing
   arrays of maps (possibly containing more arrays of maps, and so on deeply
   nested) are not handled uniformly with the same block-buffered "row-like"
   mechanisms. The APIs are similar, but not polymorphic. I suspect that the
   block-buffered data streaming in Drill only happens for top-level rows,
   because there is no test for whether or not you are allowed to create
   another array item like there is a test for creating another row in a
   row-set writer. There is no control inversion where an adapter must give
   back control to Drill in the middle of trying to write an array.

   The current Drill/Daffodil interface I've created doesn't cope with
   header-body* files (ex: PCAP which format has a header record, then
   repeating packet records) as it has no way of returning just the body
   records as top level rows. So while there exists a DFDL schema for PCAP,
   you really do want to use a dedicated PCAP Drill adapter which hands back
   rows, not Daffodil which will parse the entire PCAP file into one huge row
   containing a monster sub-array of packets, where each packet is a map
   within the array of maps. This is ok for now as many files where DFDL is
   used are not like PCAP. They are just repeating records of one format with
   no special whole-file header. Eventually we will want to be able to supply
   a path to tell the Drill/Daffodil interface that you only want the packet
   array as the output rows. (This is the unimplemented Daffodil "onPath(...)"
   API feature. We haven't needed this yet for DFDL work in cybersecurity, but
   it was anticipated 10+ years back as essential for data integration.)

   > Drill uses two kinds of metadata: schema definitions and file metadata
   > used
   > for scan pruning. Schema information could be used at plan time (to
   > provide
   > column types), but certainly at scan time (to "discover" the defined
   > schema.) File metadata is used primarily at plan time to work out how to
   > distribute work.

   DFDL has zero notion of file metadata. It doesn't know whether data even
   comes from a file or an open TCP socket. Daffodil/DFDL just sees a
   java.io.InputStream.
   The schema it uses for a given file is specified by the API call. Daffodil
   does nothing itself to try to find or identify any schema.

   So we're "blank slate" on this issue with DFDL.

   >
   >
   > A bit of background on scan pruning. Back in the day, it was common to
   > have
   > thousands or millions of files in Hadoop to scan: this was why tools like
   > Drill were distributed: divide and conquer. And, of course, the fastest
   > scan is to skip files that we know can't contain the information we want.
   > File metadata captures this information outside of the files themselves.
   > HMS was the standard solution in the Hadoop days. (Amazon Glue, for S3, is
   > evidently based on HMS.)
   >
   > For example, Drill's Parquet metadata cache, the Drill metastore and HMS
   > all provide both schema and file metadata information. The schema
   > information mainly helped with schema evolution: over time, different
   > files
   > have different sets of columns. File metadata provides information *about*
   > the file, such as the data ranges stored in each file. For Parquet, we
   > might track that '2023-01-Boston.parquet' has data from the
   > office='Boston'
   > range. (So, no use scanning the file for office='Austin'.) And so on.
   >
   > With Hadoop HFS, it was customary to use directory structure as a partial
   > primary index: our file above would live in the /sales/2023/01 directory,
   > for example, and logic chooses the proper set of directories to scan. In
   > Drill, it is up to the user to add crufty conditionals on the path name.
   > In
   > Impala, and other HMS-aware tools, the user just says WHERE order_year =
   > 2023 AND order_month = 1, and HMS tells the tool that the order_year and
   > order_month columns translate to such-and-so directory paths. Would be
   > nice
   > if Drill could provide that feature as well, given the proper file
   > metadata: in this case, the mapping of column names to path directories
   > and
   > file names.

   The above all makes perfect sense to me, and DFDL schemas are completely
   orthogonal to this.
   If a file naming convention tells *Drill* that it doesn't need to open and
   parse some data using Daffodil, great, then *Drill* will not invoke
   Daffodil to do so.

   DFDL/Daffodil doesn't know nor care about this.

   >
   > Does DFDL provide only schema information? Does it support versioning so
   > that we know that "old.csv" lacks the "version" column, while "new.csv"
   > includes that column? Does it also include the kinds of file metadata
   > mentioned above?

   DFDL only provides structural schema information.

   Data formats do versioning in a wide variety of ways, so DFDL can't take
   any position on how this is done, but many DFDL schemas adapt to multiple
   versions of the data formats they describe based on the existence of
   different fields or values of those fields. This can only work for formats
   where there are data fields that identify the versions.

   But nothing based on file metadata.

   >
   >
   > Or, perhaps DFDL is used in a different context in which the files have a
   > fixed schema and are small in number? This would fit well the "desktop
   > analytics" model that Charles and James suggested is where Drill is now
   > most commonly used.

   The cybersecurity use case is one of the prime motivators for DFDL work.

   Often the cyber gateways are file movers, files arrive spontaneously in
   various locations, and are moved across the cyber boundary.
   The use cases continue to grow in scale, and some people use Apache NiFi
   with DFDL for large scale such file moving.

   Unlike Drill, these use cases all parse and then re-serialize the data
   after extensive validation and rule-based filtering.

   The same sort of file-metadata based stuff - ex: rules like all the files
   in this directory named X with extension ".dat" use schema S - all applies
   in the cyber-gateway use case.

   Apache Daffodil doesn't know anything about this cyber use case however,
   nor anything about data integration. Daffodil is actually a quite narrow
   library. Stays in its lane.

   >
   >
   > The answers might suggest if DFDL can be the universal data description.
   > or
   > if DFDL applies just to individual file schemas, and Drill would still
   > need
   > a second system to track schema evolution and file metadata for large
   > deployments.

   Yeah. Drill needs a separate system for this. Not at all a DFDL-specific
   issue.  DFDL/Daffodil take no position on schema evolution.

   However, to Daffodil devs, a DFDL schema is basically source code. We keep
   them in git. They have releases. We package them in jars and use managed
   dependency tools to grab them from repositories the same way java code jars
   are grabbed by maven.

   One of my concerns about metadata repositories/registries is that they are
   not thought of as configuration management systems. But DFDL schemas are
   certainly large formal objects that require configuration management.

   For example, the VMF schema we have is over 180K lines of DFDL "code",
   spread over hundreds of files. It is actually an assembly composed of
   specific versions of 4 different smaller DFDL schemas and the large corpus
   of VMF-specific schema files. There is documentation, analysis reports,
   etc. that go along with it.

   So some sort of repository that makes specific schemas available to Drill
   makes sense, but cannot be confused with the configuration management
   system.

   I quite literally just got a Maven Central/Sonotype account yesterday so
   that I can push some DFDL schemas up to maven central so they can be reused
   from there via jars.

   >
   >
   > Further, if DFDL is kind of a stand-alone thing, with its own reader, then
   > we end up with more complexity: the Drill JSON reader and the DFDL JSON
   > reader. Same for CSV, etc. JSON is so complex that we'd find ourselves
   > telling people that the quirks work one way with the native reader,
   > another
   > way with DFDL. Plus, the DFDL readers might not handle file splits the
   > same
   > way,

   Daffodil knows no concept of "file splits". It doesn't even know about
   files actually. It's just an input byte stream. literally a
   java.io.InputStream.

   > or support the same set of formats that Drill's other readers support,
   > and so on. It would be nice to separate the idea of schema description
   > from
   > reader implementation, so that DFDL can be used as a source of schema for
   > any arbitrary reader: both at plan and scan times.

   The DFDL/Drill integration converts DFDL-described data directly to Drill
   with no intermediate form like XML nor JSON. One hop. E.g.,

      drillScalaWriter.setInt(daffodilInfosetElement.getInt());

   There is no notion of Daffodil "also" reading JSON. You wouldn't parse JSON
   with DFDL typically. You would use a JSON library and hopefully a JSON
   schema that describes the JSON.
   Ditto for XML, Google protocol buffers, Avro, etc.

   >
   > If DFDL uses its own readers, then we'd need DFDL reader representations in

   DFDL is a specific reader, this notion of "its own readers" doesn't apply.

   >
   > Calcite, which would pick up DFDL schemas so that the schemas are reliably
   > serialized out to each node as part of the physical plan. This is
   > possible,
   > but it does send us down the two-readers-for-every-format path.

   >

   On the other hand, if DFDL mapped to Drill's existing schema description,
   > then DFDL could be used with our existing readers

   I don't get "DFDL used with existing readers".... by "with" you mean
   "along-side" or "incorporating"?

   > and there would be just
   > one schema description sent to readers: Drill's existing provided schema
   > format that EVF can already consume. At present, just a few formats
   > support
   > provided schema in the Calcite layer: CSV for sure, maybe JSON?

   This is what we need.  The Daffodil/Drill integration walks DFDL metadata
   and creates Drill metadata 100% in advance and this should, I think,
   automatically find its way to all the right places without anything else
   being needed beyond today's Drill behavior.

   But besides Drill's metadata the Daffodil execution at each node needs to
   load up the compiled DFDL schema. That object, which can be several
   megabytes of stuff. Needs to find its way out to all the nodes that need
   it.  This I have no idea how we make happen.

   >
   > Any thoughts on where this kind of thing might evolve with DFDL in the
   > picture?
   >
   > Thanks,
   >
   > - Paul
   >
   >
   > On Tue, Jan 2, 2024 at 8:00 AM Mike Beckerle ***@***.***>
   > wrote:
   >
   > > @cgivre <https://github.com/cgivre> yes, the next architectural-level
   > > issue is how to get a compiled DFDL schema out to everyplace Drill will
   > run
   > > a Daffodil parse. Every one of those JVMs needs to reload it.
   > >
   > > I'll do the various cleanups and such. The one issue I don't know how to
   > > fix is the "typed setter" vs. (set-object) issue, so if you could steer
   > me
   > > in the right direction on that it would help.
   > >
   > > —
   > > Reply to this email directly, view it on GitHub
   > > <https://github.com/apache/drill/pull/2836#issuecomment-1874213780>, or
   > > unsubscribe
   > > <
   > 
https://github.com/notifications/unsubscribe-auth/AAYZF4MFVRCUYDCKJYSKKYTYMQVLFAVCNFSM6AAAAAA576F7J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZUGIYTGNZYGA>
   >
   > > .
   > > You are receiving this because you were mentioned.Message ID:
   > > ***@***.***>
   > >
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/drill/pull/2836#issuecomment-1874845274>, or
   > unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AALUDA4H366DXIG2RATIV4TYMTPLHAVCNFSM6AAAAAA576F7J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZUHA2DKMRXGQ>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >

> Add Daffodil Format Plugin
> --------------------------
>
>                 Key: DRILL-8474
>                 URL: https://issues.apache.org/jira/browse/DRILL-8474
>             Project: Apache Drill
>          Issue Type: New Feature
>    Affects Versions: 1.21.1
>            Reporter: Charles Givre
>            Priority: Major
>             Fix For: 1.22.0
>
>

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (DRILL-8474) Add Daffodil Format Plugin

Reply via email to