[ https://issues.apache.org/jira/browse/TIKA-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306412#comment-17306412 ]
Tilman Hausherr commented on TIKA-3332: --------------------------------------- this segment {code:java} Map<String, PDComplexFileSpecification> embeddedFileNames = efTree.getNames(); //For now, try to get the embeddedFileNames out of embeddedFiles or its kids. //This code follows: pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java //If there is a need we could add a fully recursive search to find a non-null //Map<String, COSObjectable> that contains the doc info. if (embeddedFileNames != null) { processEmbeddedDocNames(embeddedFileNames); } else { List<PDNameTreeNode<PDComplexFileSpecification>> kids = efTree.getKids(); if (kids == null) { return; } for (PDNameTreeNode<PDComplexFileSpecification> node : kids) { embeddedFileNames = node.getNames(); if (embeddedFileNames != null) { processEmbeddedDocNames(embeddedFileNames); } } } {code:java} should be extracted so that it is something like {{extractFilesFromEFTree(PDNameTreeNode efTree....)}} and this segment {code} for (PDNameTreeNode<PDComplexFileSpecification> node : kids) { embeddedFileNames = node.getNames(); if (embeddedFileNames != null) { processEmbeddedDocNames(embeddedFileNames); } {code} should be changed to {code} for (PDNameTreeNode<PDComplexFileSpecification> node : kids) { extractFilesFromEFTree(node, .....); } {code} > Embedded files not extracted from PDF files with multilevel EmbeddedFiles tree > ------------------------------------------------------------------------------ > > Key: TIKA-3332 > URL: https://issues.apache.org/jira/browse/TIKA-3332 > Project: Tika > Issue Type: Bug > Affects Versions: 1.25 > Reporter: Ross Johnson > Priority: Major > Attachments: Screen Shot 2021-03-22 at 10.29.51 AM.png, Screenshot > (5).png, image-2021-03-20-13-36-48-525.png > > > I have come across some portfolio PDFs that have many attachments / embedded > files, but Tika is not detecting or extracting them as it does with some > other portfolio PDFs. The issue may be that these files have a multilevel > EmbeddedFiles name tree that is not being handled properly by PDFBox. > Here is the EmbeddedFiles structure of one of the PDF portfolios in question. > Notice that the root EmbeddedFiles dictionary has a Kids array that only > consists of intermediate dictionaries, with the actual Names array being one > more level down. > !image-2021-03-20-13-36-48-525.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)