[GitHub] [drill] paul-rogers commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

GitBox Wed, 03 Nov 2021 18:57:47 -0700


paul-rogers commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-959821275



   Cool contribution. I'm not entirely convinced this is something that Drill 
should handle. There are too many variables for the very limited controls which 
Drill provides. It is likely that this will work for one or two limited use 
cases, but not the vast majority of PDF files. Using JSON plugin config files 
to specify the mapping is awkward. Probably each file will need its own config, 
which is not scalable.
   
   Drill's fundamental design is to run at scale. It is highly unlikely that 
someone will use PDF files to store GB of data. If they do, they have problems 
bigger than Drill can help them solve. Thus, this kind of plugin works only at 
the small scale: one or two files in, say, an embedded Drillbit with JDBC or 
SQLine.
   
   A better choice would be to wrap this thing in a script: tinker with the PDF 
extraction, using whatever tools are available, to get the right mapping. Then, 
wrap that in a script that produces, say, a CSV format to stdout. Drill can 
then read that input.
   
   Such an approach enables all manner of ad-hoc, small scale data extraction.
   
   Or, maybe Drill should offer a "desktop edition" that is designed for small, 
ad-hoc projects based on local files, with some way to handle all the tinkering 
needed when reading PDF files, images, Word files, spreadsheets, Twitter feeds, 
Slack posts another formats popular with data scientists. Such features would 
not normally be part of the massive-scale deployments for which Drill is 
designed.
   
   Thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] paul-rogers commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

Reply via email to