paul-rogers commented on pull request #2359: URL: https://github.com/apache/drill/pull/2359#issuecomment-961327918
@cgivre, @dzamo raise good points. So, what is Drill today? Is it still primarily used for distributed queries at scale? Or, as a handy desktop tool for data scientists? Probably both. The problem is, a tool that tries to be both a desert topping and a floor wax (let's see how old the readers are with this one), ends up being good at neither. One approach, if we had resources, would be to create a Drill Desktop that is optimized for that case and encourages all kinds of specialized data connectors. Create an easy way to define those connectors (YAML files created by specialized web apps?) Ensure Drill has good integration with Jupyter and the other usual suspects. Another approach, if we had resources, is the oft-discussed idea of separating the less-common plugins from the Drill core. Work started on this: to create an extension mechanism that made this possible. (Today, most plugins need quite a bit of Drill internal code.) So, no harm in adding the PDF reader, but I expect usage will be pretty limited just because, for the folks that need it, configuration will be too hard. Better would be a Python or Spark job that extracts the data into a CSV file, then query the CSV file with Drill. Each step could be debugged easily. I can't imagine anyone will want to debug their PDF extraction using Drill's overly generous Java stack traces... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org