paul-rogers commented on pull request #2359: URL: https://github.com/apache/drill/pull/2359#issuecomment-959821275
Cool contribution. I'm not entirely convinced this is something that Drill should handle. There are too many variables for the very limited controls which Drill provides. It is likely that this will work for one or two limited use cases, but not the vast majority of PDF files. Using JSON plugin config files to specify the mapping is awkward. Probably each file will need its own config, which is not scalable. Drill's fundamental design is to run at scale. It is highly unlikely that someone will use PDF files to store GB of data. If they do, they have problems bigger than Drill can help them solve. Thus, this kind of plugin works only at the small scale: one or two files in, say, an embedded Drillbit with JDBC or SQLine. A better choice would be to wrap this thing in a script: tinker with the PDF extraction, using whatever tools are available, to get the right mapping. Then, wrap that in a script that produces, say, a CSV format to stdout. Drill can then read that input. Such an approach enables all manner of ad-hoc, small scale data extraction. Or, maybe Drill should offer a "desktop edition" that is designed for small, ad-hoc projects based on local files, with some way to handle all the tinkering needed when reading PDF files, images, Word files, spreadsheets, Twitter feeds, Slack posts another formats popular with data scientists. Such features would not normally be part of the massive-scale deployments for which Drill is designed. Thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org