[GitHub] [drill] cgivre commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

GitBox Tue, 09 Nov 2021 18:23:57 -0800


cgivre commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-964722438



   > > * Sometimes the PDF reader does not read tables perfectly and you get a 
mix of found headers and not found headers, so that's one reason I took that 
approach.
   > 
   > This consideration will only apply for `extractHeaders = true`, right? In 
this case all of the readers do split out columns so I think we're good.
   > 
   > > * I actually dislike the `columns` approach from the CSV readers because 
it increases the level of complexity of queries.  In theory, if someone is 
querying a table (doesn't matter from where) they will want that broken into 
columns and rows.  The columns array approach (IMHO) makes this a lot harder 
that it needs to be.
   > 
   > The columns array does allow text files to contain jagged arrays, which is 
perhaps valuable? I don't see a huge saving in typing `select field_0, field_1` 
over `select columns[0], columns[1]`, am I missing some other complication?
   > 
   > > * This actually follows the model used in the Excel reader.
   > 
   > My fear, and I don't know if this is real or not, is that if we 
profilerate plugin-specific quirks in how the schema is represented to the user 
then the promise of a standard SQL interface to the data gets tainted and 
developers and users will be put off.
   > 
   > "It's standard except every plugin presents its data the way that author 
prefers" "That fragment of code I sent you for the columns array won't work 
here because this plugin does it differently" "This SQL script makes for 
confusing reading because the plugins involved name their generated columns 
differently, sometimes 'column', sometimes 'field', sometimes 'var'" etc.
   > 
   > I don't feel all that strongly about `field_0` vs `columns[0]` (I mean, 
maybe we deprecate the columns array?) but I am finding myself thinking more 
and more about consistency.
   
   The bigger implication of having a `columns` array vs `field_n` is when a 
user starts with `SELECT *` queries.  It makes it harder for BI tools to gather 
schema metadata and it also is non-standard SQL.  Now... querying PDF is also 
non-standard SQL... so maybe that's less important.  But, it makes the 
discovery a little harder IMHO.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] cgivre commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

Reply via email to