[jira] [Commented] (TIKA-3332) Embedded files not extracted from PDF files with multilevel EmbeddedFiles tree

Tilman Hausherr (Jira) Mon, 22 Mar 2021 10:15:26 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306412#comment-17306412
 ]


Tilman Hausherr commented on TIKA-3332:
---------------------------------------

this segment
{code:java}
        Map<String, PDComplexFileSpecification> embeddedFileNames = 
efTree.getNames();
        //For now, try to get the embeddedFileNames out of embeddedFiles or its 
kids.
        //This code follows: pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
        //If there is a need we could add a fully recursive search to find a 
non-null
        //Map<String, COSObjectable> that contains the doc info.
        if (embeddedFileNames != null) {
            processEmbeddedDocNames(embeddedFileNames);
        } else {
            List<PDNameTreeNode<PDComplexFileSpecification>> kids = 
efTree.getKids();
            if (kids == null) {
                return;
            }
            for (PDNameTreeNode<PDComplexFileSpecification> node : kids) {
                embeddedFileNames = node.getNames();
                if (embeddedFileNames != null) {
                    processEmbeddedDocNames(embeddedFileNames);
                }
            }
        }
{code:java}
should be extracted so that it is something like 
{{extractFilesFromEFTree(PDNameTreeNode efTree....)}} and this segment
{code}
            for (PDNameTreeNode<PDComplexFileSpecification> node : kids) {
                embeddedFileNames = node.getNames();
                if (embeddedFileNames != null) {
                    processEmbeddedDocNames(embeddedFileNames);
                }
{code}
should be changed to
{code}
            for (PDNameTreeNode<PDComplexFileSpecification> node : kids)
            {
                extractFilesFromEFTree(node, .....);
            }
{code}


> Embedded files not extracted from PDF files with multilevel EmbeddedFiles tree
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-3332
>                 URL: https://issues.apache.org/jira/browse/TIKA-3332
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.25
>            Reporter: Ross Johnson
>            Priority: Major
>         Attachments: Screen Shot 2021-03-22 at 10.29.51 AM.png, Screenshot 
> (5).png, image-2021-03-20-13-36-48-525.png
>
>
> I have come across some portfolio PDFs that have many attachments / embedded 
> files, but Tika is not detecting or extracting them as it does with some 
> other portfolio PDFs. The issue may be that these files have a multilevel 
> EmbeddedFiles name tree that is not being handled properly by PDFBox.
> Here is the EmbeddedFiles structure of one of the PDF portfolios in question. 
> Notice that the root EmbeddedFiles dictionary has a Kids array that only 
> consists of intermediate dictionaries, with the actual Names array being one 
> more level down.
> !image-2021-03-20-13-36-48-525.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3332) Embedded files not extracted from PDF files with multilevel EmbeddedFiles tree

Reply via email to