[GitHub] [drill] dzamo commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

GitBox Sun, 07 Nov 2021 01:15:40 -0700


dzamo commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-962569638



   >  The problem is, a tool that tries to be both a desert topping and a floor 
wax (let's see how old the readers are with this one), ends up being good at 
neither.
   
   @paul-rogers you got me with this idiom, but I like it!  The broader topic 
is super interesting.  If SQLite started adding the features needed to compete 
with Oracle Database 21c it would quickly fail at being SQLite.  If Linux tried 
to be an OS kernel for both TVs and supercomputers it would... continue to 
dominate both extremes!  There are some twists here!
   
   Pigeonholing formats into small scale and large scales is also a tricky 
business.  For example, we naturally want to declare PDF a desktop format, but 
I can easily imagine a conversation like the following.
   
   "Hey Bob, remember that we sent decades of paper archives from the basement 
out to that big scanning centre for digitisation?  They've come back as 
millions of pages of PDFs.  Someone just asked me if we can help them find all 
invoices containing a particular SKU, and pull out the price on that line.  The 
ERP system only has the last 10 years loaded into it and they want to go back 
further".
   
   "Chuck 'em in HDFS, we'll run a Drill query"
   
   "But PDF is a desktop publishing format, not a big data format!  Surely our 
big data cluster will want nothing to do with it!?"
   
   "Drill's got a plugin architecture which led to people adding support for 
all sorts of weird and wonderful formats.  Querying PDFs is a dubious business 
but we'll know after ~10 lines of SQL if we can do this with Drill or not.  If 
not, miserable days or weeks of programming with a PDF library await one of our 
interns."
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] dzamo commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

Reply via email to