paul-rogers commented on PR #2836:
URL: https://github.com/apache/drill/pull/2836#issuecomment-1874845274

   Hi Mike,
   
   Just jumping in with a random thought. Drill has accumulated a number of
   schema systems: Parquet metadata cache, HMS, Drill's own metastore,
   "provided schema", and now DFDL. All provide ways of defining data: be it
   Parquet, JSON, CSV or whatever. One can't help but wonder, should some
   future version try to reduce this variation somewhat? Maybe map all the
   variations to DFDL? Map DFDL to Drill's own mechanisms?
   
   Drill uses two kinds of metadata: schema definitions and file metadata used
   for scan pruning. Schema information could be used at plan time (to provide
   column types), but certainly at scan time (to "discover" the defined
   schema.) File metadata is used primarily at plan time to work out how to
   distribute work.
   
   A bit of background on scan pruning. Back in the day, it was common to have
   thousands or millions of files in Hadoop to scan: this was why tools like
   Drill were distributed: divide and conquer. And, of course, the fastest
   scan is to skip files that we know can't contain the information we want.
   File metadata captures this information outside of the files themselves.
   HMS was the standard solution in the Hadoop days. (Amazon Glue, for S3, is
   evidently based on HMS.)
   
   For example, Drill's Parquet metadata cache, the Drill metastore and HMS
   all provide both schema and file metadata information. The schema
   information mainly helped with schema evolution: over time, different files
   have different sets of columns. File metadata provides information *about*
   the file, such as the data ranges stored in each file. For Parquet, we
   might track that '2023-01-Boston.parquet' has data from the office='Boston'
   range. (So, no use scanning the file for office='Austin'.) And so on.
   
   With Hadoop HFS, it was customary to use directory structure as a partial
   primary index: our file above would live in the /sales/2023/01 directory,
   for example, and logic chooses the proper set of directories to scan. In
   Drill, it is up to the user to add crufty conditionals on the path name. In
   Impala, and other HMS-aware tools, the user just says WHERE order_year =
   2023 AND order_month = 1, and HMS tells the tool that the order_year and
   order_month columns translate to such-and-so directory paths. Would be nice
   if Drill could provide that feature as well, given the proper file
   metadata: in this case, the mapping of column names to path directories and
   file names.
   
   Does DFDL provide only schema information? Does it support versioning so
   that we know that "old.csv" lacks the "version" column, while "new.csv"
   includes that column? Does it also include the kinds of file metadata
   mentioned above?
   
   Or, perhaps DFDL is used in a different context in which the files have a
   fixed schema and are small in number? This would fit well the "desktop
   analytics" model that Charles and James suggested is where Drill is now
   most commonly used.
   
   The answers might suggest if DFDL can be the universal data description. or
   if DFDL applies just to individual file schemas, and Drill would still need
   a second system to track schema evolution and file metadata for large
   deployments.
   
   Further, if DFDL is kind of a stand-alone thing, with its own reader, then
   we end up with more complexity: the Drill JSON reader and the DFDL JSON
   reader. Same for CSV, etc. JSON is so complex that we'd find ourselves
   telling people that the quirks work one way with the native reader, another
   way with DFDL. Plus, the DFDL readers might not handle file splits the same
   way, or support the same set of formats that Drill's other readers support,
   and so on. It would be nice to separate the idea of schema description from
   reader implementation, so that DFDL can be used as a source of schema for
   any arbitrary reader: both at plan and scan times.
   
   If DFDL uses its own readers, then we'd need DFDL reader representations in
   Calcite, which would pick up DFDL schemas so that the schemas are reliably
   serialized out to each node as part of the physical plan. This is possible,
   but it does send us down the two-readers-for-every-format path.
   
   On the other hand, if DFDL mapped to Drill's existing schema description,
   then DFDL could be used with our existing readers and there would be just
   one schema description sent to readers: Drill's existing provided schema
   format that EVF can already consume. At present, just a few formats support
   provided schema in the Calcite layer: CSV for sure, maybe JSON?
   
   Any thoughts on where this kind of thing might evolve with DFDL in the
   picture?
   
   Thanks,
   
   - Paul
   
   
   On Tue, Jan 2, 2024 at 8:00 AM Mike Beckerle ***@***.***>
   wrote:
   
   > @cgivre <https://github.com/cgivre> yes, the next architectural-level
   > issue is how to get a compiled DFDL schema out to everyplace Drill will run
   > a Daffodil parse. Every one of those JVMs needs to reload it.
   >
   > I'll do the various cleanups and such. The one issue I don't know how to
   > fix is the "typed setter" vs. (set-object) issue, so if you could steer me
   > in the right direction on that it would help.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/drill/pull/2836#issuecomment-1874213780>, or
   > unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AAYZF4MFVRCUYDCKJYSKKYTYMQVLFAVCNFSM6AAAAAA576F7J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZUGIYTGNZYGA>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to