paul-rogers commented on PR #2836: URL: https://github.com/apache/drill/pull/2836#issuecomment-1874845274
Hi Mike, Just jumping in with a random thought. Drill has accumulated a number of schema systems: Parquet metadata cache, HMS, Drill's own metastore, "provided schema", and now DFDL. All provide ways of defining data: be it Parquet, JSON, CSV or whatever. One can't help but wonder, should some future version try to reduce this variation somewhat? Maybe map all the variations to DFDL? Map DFDL to Drill's own mechanisms? Drill uses two kinds of metadata: schema definitions and file metadata used for scan pruning. Schema information could be used at plan time (to provide column types), but certainly at scan time (to "discover" the defined schema.) File metadata is used primarily at plan time to work out how to distribute work. A bit of background on scan pruning. Back in the day, it was common to have thousands or millions of files in Hadoop to scan: this was why tools like Drill were distributed: divide and conquer. And, of course, the fastest scan is to skip files that we know can't contain the information we want. File metadata captures this information outside of the files themselves. HMS was the standard solution in the Hadoop days. (Amazon Glue, for S3, is evidently based on HMS.) For example, Drill's Parquet metadata cache, the Drill metastore and HMS all provide both schema and file metadata information. The schema information mainly helped with schema evolution: over time, different files have different sets of columns. File metadata provides information *about* the file, such as the data ranges stored in each file. For Parquet, we might track that '2023-01-Boston.parquet' has data from the office='Boston' range. (So, no use scanning the file for office='Austin'.) And so on. With Hadoop HFS, it was customary to use directory structure as a partial primary index: our file above would live in the /sales/2023/01 directory, for example, and logic chooses the proper set of directories to scan. In Drill, it is up to the user to add crufty conditionals on the path name. In Impala, and other HMS-aware tools, the user just says WHERE order_year = 2023 AND order_month = 1, and HMS tells the tool that the order_year and order_month columns translate to such-and-so directory paths. Would be nice if Drill could provide that feature as well, given the proper file metadata: in this case, the mapping of column names to path directories and file names. Does DFDL provide only schema information? Does it support versioning so that we know that "old.csv" lacks the "version" column, while "new.csv" includes that column? Does it also include the kinds of file metadata mentioned above? Or, perhaps DFDL is used in a different context in which the files have a fixed schema and are small in number? This would fit well the "desktop analytics" model that Charles and James suggested is where Drill is now most commonly used. The answers might suggest if DFDL can be the universal data description. or if DFDL applies just to individual file schemas, and Drill would still need a second system to track schema evolution and file metadata for large deployments. Further, if DFDL is kind of a stand-alone thing, with its own reader, then we end up with more complexity: the Drill JSON reader and the DFDL JSON reader. Same for CSV, etc. JSON is so complex that we'd find ourselves telling people that the quirks work one way with the native reader, another way with DFDL. Plus, the DFDL readers might not handle file splits the same way, or support the same set of formats that Drill's other readers support, and so on. It would be nice to separate the idea of schema description from reader implementation, so that DFDL can be used as a source of schema for any arbitrary reader: both at plan and scan times. If DFDL uses its own readers, then we'd need DFDL reader representations in Calcite, which would pick up DFDL schemas so that the schemas are reliably serialized out to each node as part of the physical plan. This is possible, but it does send us down the two-readers-for-every-format path. On the other hand, if DFDL mapped to Drill's existing schema description, then DFDL could be used with our existing readers and there would be just one schema description sent to readers: Drill's existing provided schema format that EVF can already consume. At present, just a few formats support provided schema in the Calcite layer: CSV for sure, maybe JSON? Any thoughts on where this kind of thing might evolve with DFDL in the picture? Thanks, - Paul On Tue, Jan 2, 2024 at 8:00 AM Mike Beckerle ***@***.***> wrote: > @cgivre <https://github.com/cgivre> yes, the next architectural-level > issue is how to get a compiled DFDL schema out to everyplace Drill will run > a Daffodil parse. Every one of those JVMs needs to reload it. > > I'll do the various cleanups and such. The one issue I don't know how to > fix is the "typed setter" vs. (set-object) issue, so if you could steer me > in the right direction on that it would help. > > — > Reply to this email directly, view it on GitHub > <https://github.com/apache/drill/pull/2836#issuecomment-1874213780>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAYZF4MFVRCUYDCKJYSKKYTYMQVLFAVCNFSM6AAAAAA576F7J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZUGIYTGNZYGA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org