dzamo commented on pull request #2359: URL: https://github.com/apache/drill/pull/2359#issuecomment-962895343
@paul-rogers, right, okay it's an expressiveness thing here rather than a scale thing. The expressiveness of Drill SQL ∪ Drill format config JSON falls well short of that of a general purpose scripting language and for reading fiddly unstructured data that shortfall might rapidly become uncomfortable. The format config for this particular plugin looks quite succinct, like the plugin will either automagically get your data out, or it won't and then you need to pack up and go and open the interpreter of your favourite scripting language. Making your resulting script scale to millions of pages, if it that's needed, is left to the student. I quite like the Ray project for Python myself. This thread has triggered some thoughts. If we find ourselves starting to write long essays of JSON in format configs then we should probably be concerned. If we find ourselves trying to embed a miniature data processing DSL into format config JSON then we need to stop moving immediately and pray to the ancestors that we might be shown a path that will return us from wilderness. I want to revisit the draft fixed width format plugin with these ideas in mind. Its config allows setting names and types for columns, but for other formats we must do this in SQL. I think we should only ever do this in SQL. I think we can do something on the packaging front. These format plugins live under contrib/ in the source tree and are compiled to their own jar files. If we simply change the final tarball-building stage of our Maven build to give us something like the following on our download page, would we not be in reasonable shape? Package|Size|Description --|--|-- drill-core|300MB|Drill with core storage layer libs only. Use this in a focussed big data environment to query standard formats like Parquet, CSV and JSON in HDFS or object storage with predictable results and performance. Supplement this with indiviudal plugins listed below as needed. drill-ktichen-sink|1.5GB|Drill core plus all 100+ storage and format plugins. Use this for maximum compatibility. Results and performance may vary across plugins. drill-storage-jdbc|130KB|Plugin to query systems that provide a JDBC driver using a generic SQL dialect. drill-format-pdf|90KB|Plugin to query tables scraped from PDF files. ... P.S. We'd be persisting with a monolithic Git repo containing multiple "projects" here, but I personally don't mind mono repos. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org