[GitHub] [drill] dzamo commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

GitBox Wed, 03 Nov 2021 22:35:50 -0700


dzamo commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-960472017



   @cgivre @paul-rogers, my 2c.  I guess some partial precedents for a format 
plugin like this are ones like format-image and format-esri (as noted), though 
those do only go after the explicitly structured content.  It would not 
surprise me if there are quite often cases of unfortunate people needing to 
scrape data out of 10^0 - 10^5 PDFs.  The purist in me agrees with Paul's 
thought: a Groovy script over whatever Java lib is used here could be employed 
instead.  That would not be automatically be parallelised like a Drill query is 
so a user with many PDFs _might_ be pushed all the way through GNU parallel to 
Spark.  All that is followed by this recurring thought: if Drill is disciplined 
and focussed only on SQL against standard big data formats then it finds itself 
trying to compete in an uncomfortably crowded space.  Probably fatally crowded.
   
   I do also note that the "big" and "small" data worlds are not disjoint.  I 
have in practice joined big data in Parquet with small reference data in Excel 
(actually EtherCalc).  Even in the big data regime reference data remains small 
and is maintained by humans in human forrmats rather than pumped out by 
machines in machine formats.
   
   This is getting long again.  Last thoughts.
   
   - I feel this really should be good at finding and parsing tables (work 
often, rather than work seldom) for inclusion.
   - We should consider a subproject to contain our long tail of "non-standard" 
data formats.  This would separate away plugins that should not be expected to 
run with speed or realiability of the core data formats and keep the core 
distributable size down as we add format after format.  We could then start to 
distribute tarballs for `drill-core` and `drill-extra`.
   - This looks like a plugin that will benefit from Drill's optional schema.  
That means we might have some unique ability to compete here.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [drill] dzamo commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

Reply via email to