[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13889697#comment-13889697
 ] 

Tim Allison edited comment on TIKA-1228 at 2/3/14 6:09 PM:
-----------------------------------------------------------

I won't have time to fix this for a week or so, but it looks like the client 
(Tika) needs to look through the kids of embeddedFiles recursively (well, in 
this file, just one level down) to get the non-null embeddedFileNames.

Something like this does pull out the .doc file:

{noformat}
Map<String, COSObjectable> embeddedFileNames = embeddedFiles.getNames();
List<PDNameTreeNode> kids = embeddedFiles.getKids();
    for (PDNameTreeNode n : kids){
        Map<String, COSObjectable> embeddedFileNames = n.getNames();
        processEmbedded(embeddedFileNames, embeddedExtractor);
....
{noformat}

where processEmbedded is shorthand for the existing code:
{noformat}
if (embeddedFileNames != null){
...
}
{noformat}

We can fix this at the Tika level in the short term.  I'm not sure if this is 
the expected behavior in PDFBox.  At the least we might want to request that 
this line in the javadoc to PDDocumentNameDictionary: ("The value in this name 
tree will be PDComplexFileSpecification objects.") be changed to "The value in 
this name tree or its children will be PDComplexFileSpecification objects.")


was (Author: talli...@mitre.org):
I won't have time to fix this for a week or so, but it looks like the client 
(Tika) needs to look through the kids of embeddedFiles recursively (well, in 
this file, just one level down) to get the non-null embeddedFileNames.

Something like this does pull out the .doc file:

{no-format}
Map<String, COSObjectable> embeddedFileNames = embeddedFiles.getNames();
List<PDNameTreeNode> kids = embeddedFiles.getKids();
    for (PDNameTreeNode n : kids){
        Map<String, COSObjectable> embeddedFileNames = n.getNames();
        processEmbedded(embeddedFileNames, embeddedExtractor);
....
{no-format}

where processEmbedded is shorthand for the existing code:
{no-format}
if (embeddedFileNames != null){
...
}
{no-format}

We can fix this at the Tika level in the short term.  I'm not sure if this is 
the expected behavior in PDFBox.  At the least we might want to request that 
this line in the javadoc to PDDocumentNameDictionary: ("The value in this name 
tree will be PDComplexFileSpecification objects.") be changed to "The value in 
this name tree or its children will be PDComplexFileSpecification objects.")

> Embedded files not extracted properly from PDF
> ----------------------------------------------
>
>                 Key: TIKA-1228
>                 URL: https://issues.apache.org/jira/browse/TIKA-1228
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>         Environment: CentOS 6.5 VM
>            Reporter: Jason Sherman
>              Labels: easyfix
>         Attachments: pdf_with_doc_and_text_attached.pdf
>
>
> IAW pdfbox example here:
> http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
> the PDF parser does not check for additional entries under Kids node when 
> Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to