Tim Allison created TIKA-1374:
---------------------------------

             Summary: Need to add code to look for OS-specific keys for 
embedded files within PDFs
                 Key: TIKA-1374
                 URL: https://issues.apache.org/jira/browse/TIKA-1374
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Tim Allison
            Assignee: Tim Allison
            Priority: Minor
             Fix For: 1.6


Embedded files in PDFs can be found by the general all purpose key we  
currently use via PDFBox:  "EF/F".  However, embedded documents can also be 
stored under OS specific keys: "EF/DOS", "EF/Mac" and "EF/Unix".

[~lehmi] confirmed on the PDFBox users 
[list|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201407.mbox/%3c1572548479.2099779.1406198761475.open-xchange@patina.store%3e]
 that we might be missing embedded documents if we're not trying the OS 
specific keys as well.  As Andreas points out, according to the spec the OS 
specific keys shouldn't be used any more, but I think we should support 
extraction for them.

My proposal is to pull all documents that are available by any of the four keys 
(well, via getEmbeddedFile<OS>() in PDFBox).  The code fix is trivial, and I'll 
try to commit it today.  However, it will take me a bit of time to generate a 
test file that stores files under the OS specific keys.  So, if anyone has an 
ASF-friendly file available or wants to take the task of generating one, please 
do.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to