dzamo commented on pull request #2359: URL: https://github.com/apache/drill/pull/2359#issuecomment-964848311
> The bigger implication of having a `columns` array vs `field_n` is when a user starts with `SELECT *` queries. It makes it harder for BI tools to gather schema metadata and it also is non-standard SQL. Now... querying PDF is also non-standard SQL... so maybe that's less important. But, it makes the discovery a little harder IMHO. @cgivre Oh right, that makes sense. So should we put in support for both `columns[n]` and `field_n` as widely as possible, with a standardised option which lets users switch between each mode? Maybe standardising on naming of `columns[n]` and `column_n` is a small saving on cognitive load for users here? @paul-rogers with apologies to this PR for saddling it with so much broader design chat, I wanted to share a last set of findings from talking with others with you and finally ask if we might go over a couple of questions with you, away from this PR. 1. I've polled some Drill devs and going after the "long tail" of formats and storage systems is mostly of interest to them. @vvysotskyi even has an intriguing idea of a marketplace for these plugins, I guess something like the Eclipse plugin marketplace. 2. I have developed a conviction that to go after the "long tail" and not produce a sprawling mess that neither developers nor users want to touch, we need to try to get strict (to the extent possible) about consistency in how plugins behave and how they are configured. Today we already are not all that consistent (e.g. see remarks on `columns[n]` vs `field_n` above, on column `name` and `type` in fixed width format). 3. Those I've spoken with do also like the idea of splitting our distributed packages into "core" and "kitchen sink", or something like that, to put us in a better position to go after the "long tail". It sounds like we're okay with our existing mono repo containing many plugins but end users should not have to download the kitchen sink to query e.g. just JSON or Parquet. Drill startup times will probably be slow for the kitchen sink because the Java class loader will have a huge amount to scan. And developer testing could get onerous if we cannot compile only a subset. 4. By chance I saw that BigQuery, which for some reason I've designated in my mind of as kind of the Rolls Royce of Dremel-family engines even though I know little about it, can query Google Sheets. So even they entertain some "small data" formats, although nothing like what we're imagining. Just an anecdote. I would love to consult with you on 2 and 3 in a sort of "Very well, if you _must_ do a distribution of Drill with this long tail of formats, storage systems and UDFs in it then at least equip yourselves with the following practices" chat. Perhaps in the upcoming community meetup, otherwise outside (if it's of any interest on your end of course). Thanks James -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
