paul-rogers commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-961327918


   @cgivre, @dzamo raise good points. So, what is Drill today? Is it still 
primarily used for distributed queries at scale? Or, as a handy desktop tool 
for data scientists? Probably both. The problem is, a tool that tries to be 
both a desert topping and a floor wax (let's see how old the readers are with 
this one), ends up being good at neither.
   
   One approach, if we had resources, would be to create a Drill Desktop that 
is optimized for that case and encourages all kinds of specialized data 
connectors. Create an easy way to define those connectors (YAML files created 
by specialized web apps?) Ensure Drill has good integration with Jupyter and 
the other usual suspects.
   
   Another approach, if we had resources, is the oft-discussed idea of 
separating the less-common plugins from the Drill core. Work started on this: 
to create an extension mechanism that made this possible. (Today, most plugins 
need quite a bit of Drill internal code.)
   
   So, no harm in adding the PDF reader, but I expect usage will be pretty 
limited just because, for the folks that need it, configuration will be too 
hard. Better would be a Python or Spark job that extracts the data into a CSV 
file, then query the CSV file with Drill. Each step could be debugged easily. I 
can't imagine anyone will want to debug their PDF extraction using Drill's 
overly generous Java stack traces... 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to