[GitHub] [drill] cgivre commented on a change in pull request #2359: DRILL-8028: Add PDF Format Plugin

GitBox Wed, 15 Dec 2021 07:57:33 -0800


cgivre commented on a change in pull request #2359:
URL: https://github.com/apache/drill/pull/2359#discussion_r769761345




##########
File path: contrib/format-pdf/README.md
##########
@@ -0,0 +1,67 @@
+# Format Plugin for PDF Table Reader
+One of the most annoying tasks is when you are working on a data science 
project and you get data that is in a PDF file. This plugin endeavours to 
enable you to query data in
+ PDF tables using Drill's SQL interface.  
+
+## Data Model
+Since PDF files generally are not intended to be queried or read by machines, 
mapping the data to tables and rows is not a perfect process.  The PDF reader 
does support 
+provided schema. 
+
+### Merging Pages
+The PDF reader reads tables from PDF files on each page.  If your PDF file has 
tables that span multiple pages, you can set the `combinePages` parameter to 
`true` and Drill 
+will merge all the tables in the PDF file.  You can also do this at query time 
with the `table()` function.
+
+## Configuration
+To configure the PDF reader, simply add the information below to the `formats` 
section of a file base storage plugin. 
+
+```json
+"pdf": {
+  "type": "pdf",
+  "extensions": [
+    "pdf"
+  ],
+  "extractHeaders": true,
+  "combinePages": false
+}
+```
+The avaialable options are:
+* `extractHeaders`: Extracts the first row of any tables as the header row.  
If set to false, Drill will assign column names of `field_0`, `field_1` to each 
column.
+* `combinePages`: Merges multipage tables together.
+* `defaultTableIndex`:  Allows you to query different tables within the PDF 
file. Index starts at `0`. 
+
+
+## Accessing Document Metadata Fields
+PDF files have a considerable amount of metadata which can be useful for 
analysis.  Drill will extract the following fields from every PDF file.  Note 
that these fields are not
+ projected in star queries and must be selected explicitly.  The document's 
creator populates these fields and some or all may be empty. With the exception 
of `_page_count
+ ` which is an `INT` and the two date fields, all the other fields are 
`VARCHAR` fields.
+ 
+ The fields are:
+ * `_page_count`
+ * `_author`
+ * `_title`
+ * `_keywords`
+ * `_creator`
+ * `_producer`
+ * `_creation_date`
+ * `_modification_date`
+ * `_trapped`
+ * `_table_count`

Review comment:
       > Here, I assume `_author` will repeat for every table row? That 
`_page_count` will increase and gives the page number? Or, is constant and will 
always be, say, 25 if that's the number of pages?
   
   Page count is the total number of pages in the document.  All these 
metafields will be the same for every row.  Drill does this with the other file 
based readers and adds implicit fields, where if you query them, you'll get the 
same info for every row.  
   > 
   > Does `_table_count` stay fixed at the number of tables (5, say), or does 
it count tables (1, 2, 3, 4, 5)?
   
   It was supposed to be the total number of tables in the document, however, I 
ended up removing it because in order to compute it, I had to load all the 
tables into memory. 
   
   > 
   > What is `_trapped`? Is there some PDF standard we could point to where 
this stuff is defined?
   
   Yes.... I'll add that to the docs.  It has something to do with how a PDF 
doc is printed.
   
   > 
   > Actually, there is a larger question. If I'm exploring a pile of PDFs, I 
may want to extract metadata, such as the above. If I'm extracting tables, I 
don't need the metadata repeated per table row.
   > 
   > Should the plugin allow two query types? Either metadata or tables? Can 
this be done with some kind of Calcite trickery? `FROM "foo.pdf"` means the 
tables, `FROM "foo.pdf.metadata"` means the metadata?
   
   It could but this is significantly more complex to do that.
   
   > 
   > Or, we check and if we see only metadata columns, we return one row per 
file rather than one row per table row?
   > 
   > Wouldn't something like this be useful to extract metadata from a PDF with 
no tables?
   
   I'm debating this point... stand by.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [drill] cgivre commented on a change in pull request #2359: DRILL-8028: Add PDF Format Plugin

Reply via email to