[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764225#comment-17764225 ]
Tim Allison edited comment on PDFBOX-5682 at 9/12/23 2:41 PM: -------------------------------------------------------------- Thank you, [~lehmi]. In Tika, we initially copied PDFBox's ExtractEmbeddedFiles example, but we found that PDF writers can stuff attached files/file specs/associated files on pretty much anything (https://www.pdfa.org/wp-content/uploads/2018/10/PDF20_AN002-AF.pdf) . >From what we can tell with publicly available corpora, it is rare to have an >attachment not in the name tree and not in an annotation on a page, but after >making the change in TIKA-4012, we did find a few new attachments. This may be a "won't fix" in 3.x. Perhaps we allow users to turn off the "scan every object for an embedded file" on the Tika side? was (Author: talli...@mitre.org): Thank you, [~lehmi]. In Tika, we initially copied PDFBox's ExtractEmbeddedFiles example, but we found that PDF writers can stuff attached files/file specs/associated files on pretty much anything (https://www.pdfa.org/wp-content/uploads/2018/10/PDF20_AN002-AF.pdf) . >From what we can tell with publicly available corpora, it is rare to have an >attachment not in the name tree and not in an annotation on a page, but after >making the change in TIKA-4012, we did find a few new attachments. This may be a "won't fix" in 3.x. > Long/permanent hang in PDFBox 3.x > --------------------------------- > > Key: PDFBOX-5682 > URL: https://issues.apache.org/jira/browse/PDFBOX-5682 > Project: PDFBox > Issue Type: Bug > Reporter: Tim Allison > Priority: Minor > > I found two files in the regression tests where we're now getting timeouts at > 3 minutes where we weren't before. Unfortunately, PDFBox's export:text works > on both, so it is probably another structural feature, perhaps a problem in > Tika? > This file halts after printing out the header for Table 19 on page 46: > https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf > Pure PDFBox's export:text complains multiple times: "Page skipped due to an > invalid or missing type null, but it does finish quickly." > This file halts after extracting {{"854,793,592"}}: > https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY > Pure PDFBox's export:text processes this without problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org