[GitHub] [drill] dzamo commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

GitBox Mon, 08 Nov 2021 05:26:26 -0800


dzamo commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-963148567



   > * Sometimes the PDF reader does not read tables perfectly and you get a 
mix of found headers and not found headers, so that's one reason I took that 
approach.
   
   This consideration will only apply for `extractHeaders = true`, right?  In 
this case all of the readers do split out columns so I think we're good.
    
   > * I actually dislike the `columns` approach from the CSV readers because 
it increases the level of complexity of queries.  In theory, if someone is 
querying a table (doesn't matter from where) they will want that broken into 
columns and rows.  The columns array approach (IMHO) makes this a lot harder 
that it needs to be.
   
   The columns array does allow text files to contain jagged arrays, which is 
perhaps valuable?  I don't see a huge saving in typing `select field_0, 
field_1` over `select columns[0], columns[1]`.
   
   > * This actually follows the model used in the Excel reader.
   
   My fear, and I don't know if this is real or not, is that if we profilerate 
plugin-specific quirks in how the schema is represented to the user then the 
promise of a standard SQL interface to the data gets tainted and developers and 
users will be put off.  
   
   "It's standard except every plugin presents its data the way that author 
prefers"
   "That fragment of code I sent you for the columns array won't work here 
because this plugin does it differently"
   "This SQL script makes for confusing reading because the plugins involved 
name their generated columns differently, sometimes 'column', sometimes 
'field', sometimes 'var'"
   etc.
   
   I don't feel all that strongly about `field_0` vs `columns[0]` (I mean, 
maybe we deprecate the columns array?) but I am finding myself thinking more 
and more about consistency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [drill] dzamo commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

Reply via email to