[GitHub] [drill] dzamo commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

GitBox Tue, 09 Nov 2021 23:17:34 -0800


dzamo commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-964848311



   > The bigger implication of having a `columns` array vs `field_n` is when a 
user starts with `SELECT *` queries. It makes it harder for BI tools to gather 
schema metadata and it also is non-standard SQL. Now... querying PDF is also 
non-standard SQL... so maybe that's less important. But, it makes the discovery 
a little harder IMHO.
   
   @cgivre Oh right, that makes sense.  So should we put in support for both 
`columns[n]` and `field_n` as widely as possible, with a standardised option 
which lets users switch between each mode?  Maybe standardising on naming of 
`columns[n]` and `column_n` is a small saving on cognitive load for users here?
   
   @paul-rogers with apologies to this PR for saddling it with so much broader 
design chat, I wanted to share a last set of findings from talking with others 
with you and finally ask if we might go over a couple of questions with you, 
away from this PR.
   
   1. I've polled some Drill devs and going after the "long tail" of formats 
and storage systems is mostly of interest to them.  @vvysotskyi even has an 
intriguing idea of a marketplace for these plugins, I guess something like the 
Eclipse plugin marketplace.
   2. I have developed a conviction that to go after the "long tail" and not 
produce a sprawling mess that neither developers nor users want to touch, we 
need to try to get strict (to the extent possible) about consistency in how 
plugins behave and how they are configured.  Today we already are not all that 
consistent (e.g. see remarks on `columns[n]` vs `field_n` above, on column 
`name` and `type` in fixed width format).
   3. Those I've spoken with do also like the idea of splitting our distributed 
packages into "core" and "kitchen sink", or something like that, to put us in a 
better position to go after the "long tail".  It sounds like we're okay with 
our existing mono repo containing many plugins but end users should not have to 
download the kitchen sink to query e.g. just JSON or Parquet.  Drill startup 
times will probably be slow for the kitchen sink because the Java class loader 
will have a huge amount to scan.  And developer testing could get onerous if we 
cannot compile only a subset.
   4. By chance I saw that BigQuery, which for some reason I've designated in 
my mind of as kind of the Rolls Royce of Dremel-family engines even though I 
know little about it, can query Google Sheets.  So even they entertain some 
"small data" formats, although nothing like what we're imagining.  Just an 
anecdote.
   
   I would love to consult with you on 2 and 3 in a sort of "Very well, if you 
_must_ do a distribution of Drill with this long tail of formats, storage 
systems and UDFs in it then at least equip yourselves with the following 
practices" chat.  Perhaps in the upcoming community meetup, otherwise outside 
(if it's of any interest on your end of course).
   
   Thanks
   James


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [drill] dzamo commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

Reply via email to