[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874933#comment-17874933 ] Tim Allison commented on PDFBOX-5868: - I don't mean to change the topic of this issue, but separately, is it worth temporarily putting some of the /ActualText logic into Tika for now? > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, > suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766567#comment-17766567 ] Tim Allison commented on PDFBOX-5682: - Wow. Thank you! > Long/permanent hang in PDFBox 3.x > - > > Key: PDFBOX-5682 > URL: https://issues.apache.org/jira/browse/PDFBOX-5682 > Project: PDFBox > Issue Type: Bug >Reporter: Tim Allison >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 3.0.1 PDFBox, 4.0.0 > > > I found two files in the regression tests where we're now getting timeouts at > 3 minutes where we weren't before. Unfortunately, PDFBox's export:text works > on both, so it is probably another structural feature, perhaps a problem in > Tika? > This file halts after printing out the header for Table 19 on page 46: > https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf > Pure PDFBox's export:text complains multiple times: "Page skipped due to an > invalid or missing type null, but it does finish quickly." > This file halts after extracting {{"854,793,592"}}: > https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY > Pure PDFBox's export:text processes this without problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764228#comment-17764228 ] Tim Allison edited comment on PDFBOX-5682 at 9/12/23 2:41 PM: -- This is the part from that document that is, erm, eye-opening: {noformat} 4.2 AF entry not in the catalog 4.2.1 General Most existing applications that take advantage of Associated Files use the AF entry in the document catalog as the place to make the association. However, the concept of Associated Files goes well beyond association only with the file as a whole, and also allows for defining relations between embedded files and certain pages, annotations, form fields, graphics objects, structure elements in the tagging structure, DParts or any other PDF object. {noformat} And, yes, the document goes on to say, PDF writers should do the traditional thing, but... was (Author: talli...@mitre.org): This is the part from that document that is, erm, eye-opening: {noformat} 4.2 AF entry not in the catalog 4.2.1 General Most existing applications that take advantage of Associated Files use the AF entry in the document catalog as the place to make the association. However, the concept of Associated Files goes well beyond association only with the file as a whole, and also allows for defining relations between embedded files and certain pages, annotations, form fields, graphics objects, structure elements in the tagging structure, DParts or any other PDF object. {noformat} > Long/permanent hang in PDFBox 3.x > - > > Key: PDFBOX-5682 > URL: https://issues.apache.org/jira/browse/PDFBOX-5682 > Project: PDFBox > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > > I found two files in the regression tests where we're now getting timeouts at > 3 minutes where we weren't before. Unfortunately, PDFBox's export:text works > on both, so it is probably another structural feature, perhaps a problem in > Tika? > This file halts after printing out the header for Table 19 on page 46: > https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf > Pure PDFBox's export:text complains multiple times: "Page skipped due to an > invalid or missing type null, but it does finish quickly." > This file halts after extracting {{"854,793,592"}}: > https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY > Pure PDFBox's export:text processes this without problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764225#comment-17764225 ] Tim Allison edited comment on PDFBOX-5682 at 9/12/23 2:41 PM: -- Thank you, [~lehmi]. In Tika, we initially copied PDFBox's ExtractEmbeddedFiles example, but we found that PDF writers can stuff attached files/file specs/associated files on pretty much anything (https://www.pdfa.org/wp-content/uploads/2018/10/PDF20_AN002-AF.pdf) . >From what we can tell with publicly available corpora, it is rare to have an >attachment not in the name tree and not in an annotation on a page, but after >making the change in TIKA-4012, we did find a few new attachments. This may be a "won't fix" in 3.x. Perhaps we allow users to turn off the "scan every object for an embedded file" on the Tika side? was (Author: talli...@mitre.org): Thank you, [~lehmi]. In Tika, we initially copied PDFBox's ExtractEmbeddedFiles example, but we found that PDF writers can stuff attached files/file specs/associated files on pretty much anything (https://www.pdfa.org/wp-content/uploads/2018/10/PDF20_AN002-AF.pdf) . >From what we can tell with publicly available corpora, it is rare to have an >attachment not in the name tree and not in an annotation on a page, but after >making the change in TIKA-4012, we did find a few new attachments. This may be a "won't fix" in 3.x. > Long/permanent hang in PDFBox 3.x > - > > Key: PDFBOX-5682 > URL: https://issues.apache.org/jira/browse/PDFBOX-5682 > Project: PDFBox > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > > I found two files in the regression tests where we're now getting timeouts at > 3 minutes where we weren't before. Unfortunately, PDFBox's export:text works > on both, so it is probably another structural feature, perhaps a problem in > Tika? > This file halts after printing out the header for Table 19 on page 46: > https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf > Pure PDFBox's export:text complains multiple times: "Page skipped due to an > invalid or missing type null, but it does finish quickly." > This file halts after extracting {{"854,793,592"}}: > https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY > Pure PDFBox's export:text processes this without problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764228#comment-17764228 ] Tim Allison commented on PDFBOX-5682: - This is the part from that document that is, erm, eye-opening: {noformat} 4.2 AF entry not in the catalog 4.2.1 General Most existing applications that take advantage of Associated Files use the AF entry in the document catalog as the place to make the association. However, the concept of Associated Files goes well beyond association only with the file as a whole, and also allows for defining relations between embedded files and certain pages, annotations, form fields, graphics objects, structure elements in the tagging structure, DParts or any other PDF object. {noformat} > Long/permanent hang in PDFBox 3.x > - > > Key: PDFBOX-5682 > URL: https://issues.apache.org/jira/browse/PDFBOX-5682 > Project: PDFBox > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > > I found two files in the regression tests where we're now getting timeouts at > 3 minutes where we weren't before. Unfortunately, PDFBox's export:text works > on both, so it is probably another structural feature, perhaps a problem in > Tika? > This file halts after printing out the header for Table 19 on page 46: > https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf > Pure PDFBox's export:text complains multiple times: "Page skipped due to an > invalid or missing type null, but it does finish quickly." > This file halts after extracting {{"854,793,592"}}: > https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY > Pure PDFBox's export:text processes this without problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764225#comment-17764225 ] Tim Allison commented on PDFBOX-5682: - Thank you, [~lehmi]. In Tika, we initially copied PDFBox's ExtractEmbeddedFiles example, but we found that PDF writers can stuff attached files/file specs/associated files on pretty much anything (https://www.pdfa.org/wp-content/uploads/2018/10/PDF20_AN002-AF.pdf) . >From what we can tell with publicly available corpora, it is rare to have an >attachment not in the name tree and not in an annotation on a page, but after >making the change in TIKA-4012, we did find a few new attachments. This may be a "won't fix" in 3.x. > Long/permanent hang in PDFBox 3.x > - > > Key: PDFBOX-5682 > URL: https://issues.apache.org/jira/browse/PDFBOX-5682 > Project: PDFBox > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > > I found two files in the regression tests where we're now getting timeouts at > 3 minutes where we weren't before. Unfortunately, PDFBox's export:text works > on both, so it is probably another structural feature, perhaps a problem in > Tika? > This file halts after printing out the header for Table 19 on page 46: > https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf > Pure PDFBox's export:text complains multiple times: "Page skipped due to an > invalid or missing type null, but it does finish quickly." > This file halts after extracting {{"854,793,592"}}: > https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY > Pure PDFBox's export:text processes this without problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763903#comment-17763903 ] Tim Allison commented on PDFBOX-5682: - Both files spend quite a bit of time in "parseObjectDynamically" when I call this: PDDocument document = Loader.loadPDF(path.toFile()); List objs = document.getDocument().getObjectsByType(COSName.FILESPEC); > Long/permanent hang in PDFBox 3.x > - > > Key: PDFBOX-5682 > URL: https://issues.apache.org/jira/browse/PDFBOX-5682 > Project: PDFBox > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > > I found two files in the regression tests where we're now getting timeouts at > 3 minutes where we weren't before. Unfortunately, PDFBox's export:text works > on both, so it is probably another structural feature, perhaps a problem in > Tika? > This file halts after printing out the header for Table 19 on page 46: > https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf > Pure PDFBox's export:text complains multiple times: "Page skipped due to an > invalid or missing type null, but it does finish quickly." > This file halts after extracting {{"854,793,592"}}: > https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY > Pure PDFBox's export:text processes this without problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763904#comment-17763904 ] Tim Allison commented on PDFBOX-5682: - It looks like that causes a full parse of the file? > Long/permanent hang in PDFBox 3.x > - > > Key: PDFBOX-5682 > URL: https://issues.apache.org/jira/browse/PDFBOX-5682 > Project: PDFBox > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > > I found two files in the regression tests where we're now getting timeouts at > 3 minutes where we weren't before. Unfortunately, PDFBox's export:text works > on both, so it is probably another structural feature, perhaps a problem in > Tika? > This file halts after printing out the header for Table 19 on page 46: > https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf > Pure PDFBox's export:text complains multiple times: "Page skipped due to an > invalid or missing type null, but it does finish quickly." > This file halts after extracting {{"854,793,592"}}: > https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY > Pure PDFBox's export:text processes this without problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5682: Summary: Long/permanent hang in PDFBox 3.x (was: Long/permanent hang i n PDFBox 3.x) > Long/permanent hang in PDFBox 3.x > - > > Key: PDFBOX-5682 > URL: https://issues.apache.org/jira/browse/PDFBOX-5682 > Project: PDFBox > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > > I found two files in the regression tests where we're now getting timeouts at > 3 minutes where we weren't before. Unfortunately, PDFBox's export:text works > on both, so it is probably another structural feature, perhaps a problem in > Tika? > This file halts after printing out the header for Table 19 on page 46: > https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf > Pure PDFBox's export:text complains multiple times: "Page skipped due to an > invalid or missing type null, but it does finish quickly." > This file halts after extracting {{"854,793,592"}}: > https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY > Pure PDFBox's export:text processes this without problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5682) Long/permanent hang i n PDFBox 3.x
Tim Allison created PDFBOX-5682: --- Summary: Long/permanent hang i n PDFBox 3.x Key: PDFBOX-5682 URL: https://issues.apache.org/jira/browse/PDFBOX-5682 Project: PDFBox Issue Type: Bug Reporter: Tim Allison I found two files in the regression tests where we're now getting timeouts at 3 minutes where we weren't before. Unfortunately, PDFBox's export:text works on both, so it is probably another structural feature, perhaps a problem in Tika? This file halts after printing out the header for Table 19 on page 46: https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf Pure PDFBox's export:text complains multiple times: "Page skipped due to an invalid or missing type null, but it does finish quickly." This file halts after extracting {{"854,793,592"}}: https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY Pure PDFBox's export:text processes this without problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x
[ https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763759#comment-17763759 ] Tim Allison commented on PDFBOX-5681: - When I run the demo code in PDFBox trunk with logging on, I see this in the log before the new exception. Further, when running debug in the PDFBox project, I can confirm that the xrefTable is somehow being modified during the iteration of the objects. {noformat} 11.09.2023 10:49:07 ERROR cos.COSObject:126 - Can't dereference COSObject{5, 0} java.io.IOException: Wrong type of referenced length object COSObject{6, 0}: COSDictionary at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:845) ~[classes/:?] at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:875) ~[classes/:?] at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:710) ~[classes/:?] at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:631) ~[classes/:?] at org.apache.pdfbox.pdfparser.COSParser.dereferenceCOSObject(COSParser.java:586) ~[classes/:?] at org.apache.pdfbox.cos.COSObject.getObject(COSObject.java:121) ~[classes/:?] at org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:257) ~[classes/:?] at org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240) ~[classes/:?] at org.apache.pdfbox.TestConcurrentModification.oneOff(TestConcurrentModification.java:18) ~[test-classes/:?] ... {noformat} > ConcurrentModificationException in getObjectsByType() in 3.x > > > Key: PDFBOX-5681 > URL: https://issues.apache.org/jira/browse/PDFBOX-5681 > Project: PDFBox > Issue Type: Bug >Affects Versions: 3.0.0 PDFBox >Reporter: Tim Allison >Priority: Minor > Attachments: PDFBOX-3714-2.pdf > > > [~tilman]'s regression testing turned up this exception when we integrate > PDFBox 3.0.0 into Tika: > {noformat} > java.util.ConcurrentModificationException > at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597) > at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620) > at > org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254) > at > org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240) > {noformat} > I can replicate this exception consistently on the attached file. > With this code: > {noformat} > Path path = Paths.get("/.../PDFBOX-3714-2.pdf"); > PDDocument document = Loader.loadPDF(path.toFile()); > List objs = > document.getDocument().getObjectsByType(COSName.FILESPEC); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x
[ https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763754#comment-17763754 ] Tim Allison commented on PDFBOX-5681: - I initially thought this was a threading issue, but it isn't. The exception can be thrown if any modification is made to the underlying collection while the iterator is iterating, even if in the same thread. My guess is that the computeIfAbsent call in {{getObjectFromPool}} is somehow changing the xRefTable keyset that is being iterated over??? There may be another iteration + modification on a different collection during the parse. The triggering object {{5 0 R}} requires parsing numerous objects from an xrefstream. > ConcurrentModificationException in getObjectsByType() in 3.x > > > Key: PDFBOX-5681 > URL: https://issues.apache.org/jira/browse/PDFBOX-5681 > Project: PDFBox > Issue Type: Bug >Affects Versions: 3.0.0 PDFBox >Reporter: Tim Allison >Priority: Minor > Attachments: PDFBOX-3714-2.pdf > > > [~tilman]'s regression testing turned up this exception when we integrate > PDFBox 3.0.0 into Tika: > {noformat} > java.util.ConcurrentModificationException > at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597) > at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620) > at > org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254) > at > org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240) > {noformat} > I can replicate this exception consistently on the attached file. > With this code: > {noformat} > Path path = Paths.get("/.../PDFBOX-3714-2.pdf"); > PDDocument document = Loader.loadPDF(path.toFile()); > List objs = > document.getDocument().getObjectsByType(COSName.FILESPEC); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x
Tim Allison created PDFBOX-5681: --- Summary: ConcurrentModificationException in getObjectsByType() in 3.x Key: PDFBOX-5681 URL: https://issues.apache.org/jira/browse/PDFBOX-5681 Project: PDFBox Issue Type: Task Reporter: Tim Allison Attachments: PDFBOX-3714-2.pdf [~tilman]'s regression testing turned up this exception when we integrate PDFBox 3.0.0 into Tika: {noformat} java.util.ConcurrentModificationException at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597) at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620) at org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254) at org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240) {noformat} I can replicate this exception consistently on this file: With this code: {noformat} Path path = Paths.get("/.../PDFBOX-3714-2.pdf"); PDDocument document = Loader.loadPDF(path.toFile()); List objs = document.getDocument().getObjectsByType(COSName.FILESPEC); {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x
[ https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5681: Affects Version/s: 3.0.0 PDFBox > ConcurrentModificationException in getObjectsByType() in 3.x > > > Key: PDFBOX-5681 > URL: https://issues.apache.org/jira/browse/PDFBOX-5681 > Project: PDFBox > Issue Type: Task >Affects Versions: 3.0.0 PDFBox >Reporter: Tim Allison >Priority: Minor > Attachments: PDFBOX-3714-2.pdf > > > [~tilman]'s regression testing turned up this exception when we integrate > PDFBox 3.0.0 into Tika: > {noformat} > java.util.ConcurrentModificationException > at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597) > at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620) > at > org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254) > at > org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240) > {noformat} > I can replicate this exception consistently on this file: > With this code: > {noformat} > Path path = Paths.get("/.../PDFBOX-3714-2.pdf"); > PDDocument document = Loader.loadPDF(path.toFile()); > List objs = > document.getDocument().getObjectsByType(COSName.FILESPEC); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x
[ https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5681: Description: [~tilman]'s regression testing turned up this exception when we integrate PDFBox 3.0.0 into Tika: {noformat} java.util.ConcurrentModificationException at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597) at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620) at org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254) at org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240) {noformat} I can replicate this exception consistently on the attached file. With this code: {noformat} Path path = Paths.get("/.../PDFBOX-3714-2.pdf"); PDDocument document = Loader.loadPDF(path.toFile()); List objs = document.getDocument().getObjectsByType(COSName.FILESPEC); {noformat} was: [~tilman]'s regression testing turned up this exception when we integrate PDFBox 3.0.0 into Tika: {noformat} java.util.ConcurrentModificationException at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597) at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620) at org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254) at org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240) {noformat} I can replicate this exception consistently on this file: With this code: {noformat} Path path = Paths.get("/.../PDFBOX-3714-2.pdf"); PDDocument document = Loader.loadPDF(path.toFile()); List objs = document.getDocument().getObjectsByType(COSName.FILESPEC); {noformat} > ConcurrentModificationException in getObjectsByType() in 3.x > > > Key: PDFBOX-5681 > URL: https://issues.apache.org/jira/browse/PDFBOX-5681 > Project: PDFBox > Issue Type: Bug >Affects Versions: 3.0.0 PDFBox >Reporter: Tim Allison >Priority: Minor > Attachments: PDFBOX-3714-2.pdf > > > [~tilman]'s regression testing turned up this exception when we integrate > PDFBox 3.0.0 into Tika: > {noformat} > java.util.ConcurrentModificationException > at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597) > at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620) > at > org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254) > at > org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240) > {noformat} > I can replicate this exception consistently on the attached file. > With this code: > {noformat} > Path path = Paths.get("/.../PDFBOX-3714-2.pdf"); > PDDocument document = Loader.loadPDF(path.toFile()); > List objs = > document.getDocument().getObjectsByType(COSName.FILESPEC); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x
[ https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5681: Issue Type: Bug (was: Task) > ConcurrentModificationException in getObjectsByType() in 3.x > > > Key: PDFBOX-5681 > URL: https://issues.apache.org/jira/browse/PDFBOX-5681 > Project: PDFBox > Issue Type: Bug >Affects Versions: 3.0.0 PDFBox >Reporter: Tim Allison >Priority: Minor > Attachments: PDFBOX-3714-2.pdf > > > [~tilman]'s regression testing turned up this exception when we integrate > PDFBox 3.0.0 into Tika: > {noformat} > java.util.ConcurrentModificationException > at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597) > at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620) > at > org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254) > at > org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240) > {noformat} > I can replicate this exception consistently on this file: > With this code: > {noformat} > Path path = Paths.get("/.../PDFBOX-3714-2.pdf"); > PDDocument document = Loader.loadPDF(path.toFile()); > List objs = > document.getDocument().getObjectsByType(COSName.FILESPEC); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5595) Slight regression on corrupt bug tracker file
[ https://issues.apache.org/jira/browse/PDFBOX-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5595: Description: I'm not sure this is a regression, and apologies if you already dealt with this before the release of 2.0.28. Also, as a warning, this file is corrupt. We used to get more text out of this file in 2.0.27 than we do now in 2.0.28: [https://corpora.tika.apache.org/base/docs/bug_trackers/evince/evince-395-0.zip-0.pdf] This file derived from the evince bug tracker, which now eventually links to this issue: [https://gitlab.freedesktop.org/poppler/poppler/-/issues/323] This image from the poppler issue shows what we get with PDFBox 2.0.28 on the left, and 2.0.27 on the right. If the decision is "the file is corrupt -> not going to fix", I completely understand. !https://gitlab.gnome.org/GNOME/evince/uploads/0bc2302dbafc0bbc2110f0d42951428e/evince.JPG! was: I'm not sure this is a regression, and apologies if you already dealt with this before the release of 2.0.28. Also, as a warning, this file is corrupt. We used to get more text out of this file in 2.0.27 than we do now in 2.0.28: [https://corpora.tika.apache.org/base/docs/bug_trackers/evince/evince-395-0.zip-0.pdf] This file derived from the evince bug tracker, which now eventually links to this issue: [https://gitlab.freedesktop.org/poppler/poppler/-/issues/323] This image shows what we get with PDFBox 2.0.28 on the left, and 2.0.27 on the right. If the decision is "the file is corrupt -> not going to fix", I completely understand. !https://gitlab.gnome.org/GNOME/evince/uploads/0bc2302dbafc0bbc2110f0d42951428e/evince.JPG! > Slight regression on corrupt bug tracker file > - > > Key: PDFBOX-5595 > URL: https://issues.apache.org/jira/browse/PDFBOX-5595 > Project: PDFBox > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > I'm not sure this is a regression, and apologies if you already dealt with > this before the release of 2.0.28. Also, as a warning, this file is corrupt. > > We used to get more text out of this file in 2.0.27 than we do now in 2.0.28: > [https://corpora.tika.apache.org/base/docs/bug_trackers/evince/evince-395-0.zip-0.pdf] > > This file derived from the evince bug tracker, which now eventually links to > this issue: > [https://gitlab.freedesktop.org/poppler/poppler/-/issues/323] > > This image from the poppler issue shows what we get with PDFBox 2.0.28 on the > left, and 2.0.27 on the right. > > If the decision is "the file is corrupt -> not going to fix", I completely > understand. > !https://gitlab.gnome.org/GNOME/evince/uploads/0bc2302dbafc0bbc2110f0d42951428e/evince.JPG! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5595) Slight regression on corrupt bug tracker file
Tim Allison created PDFBOX-5595: --- Summary: Slight regression on corrupt bug tracker file Key: PDFBOX-5595 URL: https://issues.apache.org/jira/browse/PDFBOX-5595 Project: PDFBox Issue Type: Task Reporter: Tim Allison I'm not sure this is a regression, and apologies if you already dealt with this before the release of 2.0.28. Also, as a warning, this file is corrupt. We used to get more text out of this file in 2.0.27 than we do now in 2.0.28: [https://corpora.tika.apache.org/base/docs/bug_trackers/evince/evince-395-0.zip-0.pdf] This file derived from the evince bug tracker, which now eventually links to this issue: [https://gitlab.freedesktop.org/poppler/poppler/-/issues/323] This image shows what we get with PDFBox 2.0.28 on the left, and 2.0.27 on the right. If the decision is "the file is corrupt -> not going to fix", I completely understand. !https://gitlab.gnome.org/GNOME/evince/uploads/0bc2302dbafc0bbc2110f0d42951428e/evince.JPG! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5550) reduce number of open files
[ https://issues.apache.org/jira/browse/PDFBOX-5550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5550: Summary: reduce number of open files (was: redcuce number of open files) > reduce number of open files > --- > > Key: PDFBOX-5550 > URL: https://issues.apache.org/jira/browse/PDFBOX-5550 > Project: PDFBox > Issue Type: Improvement > Components: IO >Affects Versions: 3.0.0 PDFBox >Reporter: Andreas Lehmkühler >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 3.0.0 PDFBox > > > {{org.apache.pdfbox.io.RandomAccessReadBufferedFile}} creates a new instance > of {}org.apache.pdfbox.io.RandomAccessReadBufferedFile{} which opens > a new file using the underlying file every time when creating a new view. The > view of a COSStream isn't most likely closed until the entire pdf is closed. > In the end there are as many open files as created COSStreams until the pdf > is closed. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5540) export:text creates jibberish / malformed output
[ https://issues.apache.org/jira/browse/PDFBOX-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635337#comment-17635337 ] Tim Allison commented on PDFBOX-5540: - Should I kick that off now? > export:text creates jibberish / malformed output > > > Key: PDFBOX-5540 > URL: https://issues.apache.org/jira/browse/PDFBOX-5540 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.16, 2.0.27, 3.0.0 PDFBox > Environment: Same on Windows, Linux and macOS >Reporter: Alfons >Assignee: Tilman Hausherr >Priority: Minor > Labels: regression > Fix For: 2.0.28, 3.0.0 PDFBox > > Attachments: PDFBOX-5540.pdf.txt, test.pdf, test.txt > > > Using PDFBox as part of Tika and having issues with some PDFs outputting > unreadable content. Copying text from Adobe / macOS Preview / Browsers works > as expected. > I have also tried "re-encoding" the PDF by editing and saving it with > Acrobat, thinking it could be an issue with their original PDF creator and > using pdfbox with different encodings, but output mostly remained unchanged. > I attached the PDF and text it produces. Running it PDFBox via CLI as follows: > {code:java} > root % java -jar pdfbox-app-3.0.0-alpha3.jar export:text -i test.pdf > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5501) Jempbox is slow on xmp with large event histories
[ https://issues.apache.org/jira/browse/PDFBOX-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17602789#comment-17602789 ] Tim Allison commented on PDFBOX-5501: - Thank you! > Jempbox is slow on xmp with large event histories > - > > Key: PDFBOX-5501 > URL: https://issues.apache.org/jira/browse/PDFBOX-5501 > Project: PDFBox > Issue Type: Wish >Reporter: Tim Allison >Priority: Minor > Attachments: big.xmp.gz > > > In looking at the timeouts in a recent run against 8 million PDFs, I found > one file where the processing time was caused by extremely slow parsing of > the media management schema. > If I do enough subclassing and put a hard limit inside > getEventSequenceList(), the processing time is fairly quick. > I realize that Jempbox is not going to be supported going forward and > understand if this is a "do not fix". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-5501) Jempbox is slow on xmp with large event histories
[ https://issues.apache.org/jira/browse/PDFBOX-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved PDFBOX-5501. - Resolution: Not A Problem Y. I just also confirmed that this is fixed in 1.8.17-SNAPSHOT. Sorry about that. Thank you. Any plans for the 1.8.17 release? > Jempbox is slow on xmp with large event histories > - > > Key: PDFBOX-5501 > URL: https://issues.apache.org/jira/browse/PDFBOX-5501 > Project: PDFBox > Issue Type: Wish >Reporter: Tim Allison >Priority: Minor > Attachments: big.xmp.gz > > > In looking at the timeouts in a recent run against 8 million PDFs, I found > one file where the processing time was caused by extremely slow parsing of > the media management schema. > If I do enough subclassing and put a hard limit inside > getEventSequenceList(), the processing time is fairly quick. > I realize that Jempbox is not going to be supported going forward and > understand if this is a "do not fix". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5501) Jempbox is slow on xmp with large event histories
Tim Allison created PDFBOX-5501: --- Summary: Jempbox is slow on xmp with large event histories Key: PDFBOX-5501 URL: https://issues.apache.org/jira/browse/PDFBOX-5501 Project: PDFBox Issue Type: Wish Reporter: Tim Allison Attachments: big.xmp.gz In looking at the timeouts in a recent run against 8 million PDFs, I found one file where the processing time was caused by extremely slow parsing of the media management schema. If I do enough subclassing and put a hard limit inside getEventSequenceList(), the processing time is fairly quick. I realize that Jempbox is not going to be supported going forward and understand if this is a "do not fix". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5490) Add reconstruction information to the PDDocument
[ https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578904#comment-17578904 ] Tim Allison commented on PDFBOX-5490: - Y. Completely understand. I don't want to impede 3.0.0. Thank you! > Add reconstruction information to the PDDocument > > > Key: PDFBOX-5490 > URL: https://issues.apache.org/jira/browse/PDFBOX-5490 > Project: PDFBox > Issue Type: Wish > Components: Parsing >Reporter: Tim Allison >Priority: Minor > > When the xref has to be rebuilt or there are other anomalies in the parsing > of the PDDocument, the results are currently logged. In a multithreaded > environment it is not easy to reconstruct which documents had which problems. > It would be helpful if a PDF was able to be successfully loaded to include > information about what had to be fixed in order to load it successfully. > Certainly, rebuilding the xref table comes to mind, but any other info would > also be useful. > This is a wish for 3.x. I don't think I'll have time to contribute. :( -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5490) Add reconstruction information to the PDDocument
[ https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578510#comment-17578510 ] Tim Allison commented on PDFBOX-5490: - My initial request would be for whether or not the xref table had to be rebuilt...largely because I'm somewhat interested in that at the moment. Any info at the pre-DOM stage for what had to be guessed or assumed -- alleged obj stream length != actual object stream. Other places where PDFBox currently logs warnings (missing font, missing unicode mappings etc) after the DOM has been built would also be useful. > Add reconstruction information to the PDDocument > > > Key: PDFBOX-5490 > URL: https://issues.apache.org/jira/browse/PDFBOX-5490 > Project: PDFBox > Issue Type: Wish > Components: Parsing >Reporter: Tim Allison >Priority: Minor > > When the xref has to be rebuilt or there are other anomalies in the parsing > of the PDDocument, the results are currently logged. In a multithreaded > environment it is not easy to reconstruct which documents had which problems. > It would be helpful if a PDF was able to be successfully loaded to include > information about what had to be fixed in order to load it successfully. > Certainly, rebuilding the xref table comes to mind, but any other info would > also be useful. > This is a wish for 3.x. I don't think I'll have time to contribute. :( -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5490) Add reconstruction information to the PDDocument
[ https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578129#comment-17578129 ] Tim Allison commented on PDFBOX-5490: - Oh, that looks great. > Add reconstruction information to the PDDocument > > > Key: PDFBOX-5490 > URL: https://issues.apache.org/jira/browse/PDFBOX-5490 > Project: PDFBox > Issue Type: Wish > Components: Parsing >Reporter: Tim Allison >Priority: Minor > > When the xref has to be rebuilt or there are other anomalies in the parsing > of the PDDocument, the results are currently logged. In a multithreaded > environment it is not easy to reconstruct which documents had which problems. > It would be helpful if a PDF was able to be successfully loaded to include > information about what had to be fixed in order to load it successfully. > Certainly, rebuilding the xref table comes to mind, but any other info would > also be useful. > This is a wish for 3.x. I don't think I'll have time to contribute. :( -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5490) Add reconstruction information to the PDDocument
[ https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578055#comment-17578055 ] Tim Allison commented on PDFBOX-5490: - A Listener would be great. Any mechanism that would allow programmatic retrieval of problems encountered during the parse per file. > Add reconstruction information to the PDDocument > > > Key: PDFBOX-5490 > URL: https://issues.apache.org/jira/browse/PDFBOX-5490 > Project: PDFBox > Issue Type: Wish > Components: Parsing >Reporter: Tim Allison >Priority: Minor > > When the xref has to be rebuilt or there are other anomalies in the parsing > of the PDDocument, the results are currently logged. In a multithreaded > environment it is not easy to reconstruct which documents had which problems. > It would be helpful if a PDF was able to be successfully loaded to include > information about what had to be fixed in order to load it successfully. > Certainly, rebuilding the xref table comes to mind, but any other info would > also be useful. > This is a wish for 3.x. I don't think I'll have time to contribute. :( -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5490) Add reconstruction information to the PDDocument
[ https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5490: Component/s: Parsing > Add reconstruction information to the PDDocument > > > Key: PDFBOX-5490 > URL: https://issues.apache.org/jira/browse/PDFBOX-5490 > Project: PDFBox > Issue Type: Wish > Components: Parsing >Reporter: Tim Allison >Priority: Minor > > When the xref has to be rebuilt or there are other anomalies in the parsing > of the PDDocument, the results are currently logged. In a multithreaded > environment it is not easy to reconstruct which documents had which problems. > It would be helpful if a PDF was able to be successfully loaded to include > information about what had to be fixed in order to load it successfully. > Certainly, rebuilding the xref table comes to mind, but any other info would > also be useful. > This is a wish for 3.x. I don't think I'll have time to contribute. :( -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5490) Add reconstruction information to the PDDocument
Tim Allison created PDFBOX-5490: --- Summary: Add reconstruction information to the PDDocument Key: PDFBOX-5490 URL: https://issues.apache.org/jira/browse/PDFBOX-5490 Project: PDFBox Issue Type: Wish Reporter: Tim Allison When the xref has to be rebuilt or there are other anomalies in the parsing of the PDDocument, the results are currently logged. In a multithreaded environment it is not easy to reconstruct which documents had which problems. It would be helpful if a PDF was able to be successfully loaded to include information about what had to be fixed in order to load it successfully. Certainly, rebuilding the xref table comes to mind, but any other info would also be useful. This is a wish for 3.x. I don't think I'll have time to contribute. :( -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5431) New NPE in xmpbox parser in trunk
[ https://issues.apache.org/jira/browse/PDFBOX-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5431: Description: I noticed a new NPE in one of our test files on Tika when I recently built PDFBox's trunk. I've attached the file. If I don't set strict parsing to false, the parse works. {noformat} DomXmpParser xmpParser = new DomXmpParser(); xmpParser.setStrictParsing(false); Path p = Paths.get(".../metadata.xml"); try (InputStream is = Files.newInputStream(p)) { XMPMetadata metadata = xmpParser.parse(is); for (XMPSchema schema : metadata.getAllSchemas()) { for (AbstractField f : schema.getAllProperties()) { System.out.println(f); } } } {noformat} Stack {noformat} ava.lang.NullPointerException at org.apache.xmpbox.xml.DomXmpParser.parseLiDescription(DomXmpParser.java:608) at org.apache.xmpbox.xml.DomXmpParser.parseLiElement(DomXmpParser.java:529) at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:487) at org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:352) at org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:319) at org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:248) at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:201) at org.apache.tika.parser.indesign.IDMLParserTest.testXMP(IDMLParserTest.java:81) {noformat} was: I noticed a new NPE in one of our test files on Tika when I recently built PDFBox's trunk. I've attached the file. If I don't set strict parsing to false, the parse works. {noformat} DomXmpParser xmpParser = new DomXmpParser(); xmpParser.setStrictParsing(false); Path p = Paths.get("/home/tallison/Desktop/tmp/META-INF/metadata.xml"); try (InputStream is = Files.newInputStream(p)) { XMPMetadata metadata = xmpParser.parse(is); for (XMPSchema schema : metadata.getAllSchemas()) { for (AbstractField f : schema.getAllProperties()) { System.out.println(f); } } } {noformat} Stack {noformat} ava.lang.NullPointerException at org.apache.xmpbox.xml.DomXmpParser.parseLiDescription(DomXmpParser.java:608) at org.apache.xmpbox.xml.DomXmpParser.parseLiElement(DomXmpParser.java:529) at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:487) at org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:352) at org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:319) at org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:248) at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:201) at org.apache.tika.parser.indesign.IDMLParserTest.testXMP(IDMLParserTest.java:81) {noformat} > New NPE in xmpbox parser in trunk > - > > Key: PDFBOX-5431 > URL: https://issues.apache.org/jira/browse/PDFBOX-5431 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Affects Versions: 3.0.0 PDFBox >Reporter: Tim Allison >Priority: Major > Attachments: metadata.xml > > > I noticed a new NPE in one of our test files on Tika when I recently built > PDFBox's trunk. I've attached the file. > If I don't set strict parsing to false, the parse works. > {noformat} > DomXmpParser xmpParser = new DomXmpParser(); > xmpParser.setStrictParsing(false); > Path p = Paths.get(".../metadata.xml"); > try (InputStream is = Files.newInputStream(p)) { > XMPMetadata metadata = xmpParser.parse(is); > for (XMPSchema schema : metadata.getAllSchemas()) { > for (AbstractField f : schema.getAllProperties()) { > System.out.println(f); > } > } > } > {noformat} > Stack > {noformat} > ava.lang.NullPointerException > at > org.apache.xmpbox.xml.DomXmpParser.parseLiDescription(DomXmpParser.java:608) > at > org.apache.xmpbox.xml.DomXmpParser.parseLiElement(DomXmpParser.java:529) > at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:487) > at > org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:352) > at > org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:319) > at > org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:248) > at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:201) > at > org.apache.tika.parser.indesign.IDMLParserTest.testXMP(IDMLParserTest.java:81) > {noformat} -- This
[jira] [Updated] (PDFBOX-5431) New NPE in xmpbox parser in trunk
[ https://issues.apache.org/jira/browse/PDFBOX-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5431: Component/s: XmpBox > New NPE in xmpbox parser in trunk > - > > Key: PDFBOX-5431 > URL: https://issues.apache.org/jira/browse/PDFBOX-5431 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Affects Versions: 3.0.0 PDFBox >Reporter: Tim Allison >Priority: Major > Attachments: metadata.xml > > > I noticed a new NPE in one of our test files on Tika when I recently built > PDFBox's trunk. I've attached the file. > If I don't set strict parsing to false, the parse works. > {noformat} > DomXmpParser xmpParser = new DomXmpParser(); > xmpParser.setStrictParsing(false); > Path p = > Paths.get("/home/tallison/Desktop/tmp/META-INF/metadata.xml"); > try (InputStream is = Files.newInputStream(p)) { > XMPMetadata metadata = xmpParser.parse(is); > for (XMPSchema schema : metadata.getAllSchemas()) { > for (AbstractField f : schema.getAllProperties()) { > System.out.println(f); > } > } > } > {noformat} > Stack > {noformat} > ava.lang.NullPointerException > at > org.apache.xmpbox.xml.DomXmpParser.parseLiDescription(DomXmpParser.java:608) > at > org.apache.xmpbox.xml.DomXmpParser.parseLiElement(DomXmpParser.java:529) > at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:487) > at > org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:352) > at > org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:319) > at > org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:248) > at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:201) > at > org.apache.tika.parser.indesign.IDMLParserTest.testXMP(IDMLParserTest.java:81) > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5431) New NPE in xmpbox parser in trunk
[ https://issues.apache.org/jira/browse/PDFBOX-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5431: Affects Version/s: 3.0.0 PDFBox > New NPE in xmpbox parser in trunk > - > > Key: PDFBOX-5431 > URL: https://issues.apache.org/jira/browse/PDFBOX-5431 > Project: PDFBox > Issue Type: Task >Affects Versions: 3.0.0 PDFBox >Reporter: Tim Allison >Priority: Major > Attachments: metadata.xml > > > I noticed a new NPE in one of our test files on Tika when I recently built > PDFBox's trunk. I've attached the file. > If I don't set strict parsing to false, the parse works. > {noformat} > DomXmpParser xmpParser = new DomXmpParser(); > xmpParser.setStrictParsing(false); > Path p = > Paths.get("/home/tallison/Desktop/tmp/META-INF/metadata.xml"); > try (InputStream is = Files.newInputStream(p)) { > XMPMetadata metadata = xmpParser.parse(is); > for (XMPSchema schema : metadata.getAllSchemas()) { > for (AbstractField f : schema.getAllProperties()) { > System.out.println(f); > } > } > } > {noformat} > Stack > {noformat} > ava.lang.NullPointerException > at > org.apache.xmpbox.xml.DomXmpParser.parseLiDescription(DomXmpParser.java:608) > at > org.apache.xmpbox.xml.DomXmpParser.parseLiElement(DomXmpParser.java:529) > at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:487) > at > org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:352) > at > org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:319) > at > org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:248) > at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:201) > at > org.apache.tika.parser.indesign.IDMLParserTest.testXMP(IDMLParserTest.java:81) > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5431) New NPE in xmpbox parser in trunk
Tim Allison created PDFBOX-5431: --- Summary: New NPE in xmpbox parser in trunk Key: PDFBOX-5431 URL: https://issues.apache.org/jira/browse/PDFBOX-5431 Project: PDFBox Issue Type: Task Reporter: Tim Allison Attachments: metadata.xml I noticed a new NPE in one of our test files on Tika when I recently built PDFBox's trunk. I've attached the file. If I don't set strict parsing to false, the parse works. {noformat} DomXmpParser xmpParser = new DomXmpParser(); xmpParser.setStrictParsing(false); Path p = Paths.get("/home/tallison/Desktop/tmp/META-INF/metadata.xml"); try (InputStream is = Files.newInputStream(p)) { XMPMetadata metadata = xmpParser.parse(is); for (XMPSchema schema : metadata.getAllSchemas()) { for (AbstractField f : schema.getAllProperties()) { System.out.println(f); } } } {noformat} Stack {noformat} ava.lang.NullPointerException at org.apache.xmpbox.xml.DomXmpParser.parseLiDescription(DomXmpParser.java:608) at org.apache.xmpbox.xml.DomXmpParser.parseLiElement(DomXmpParser.java:529) at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:487) at org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:352) at org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:319) at org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:248) at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:201) at org.apache.tika.parser.indesign.IDMLParserTest.testXMP(IDMLParserTest.java:81) {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5415) Infinite loop in ExtractText in 2.x branch on a specific pdf
[ https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522531#comment-17522531 ] Tim Allison commented on PDFBOX-5415: - An answer on the Tika side. Yes, parsing is dangerous and you’ll need to isolate at the process level; thread level isolation is not enough. See what we offer in Tika for robustness: https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=148647830#content/view/148647830 > Infinite loop in ExtractText in 2.x branch on a specific pdf > > > Key: PDFBOX-5415 > URL: https://issues.apache.org/jira/browse/PDFBOX-5415 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.26 >Reporter: Tim Allison >Priority: Major > Attachments: PDFBOX-5415-TIKA-3718-p10.pdf > > > [~DavidAvant] reported an infinite loop in Tika and provided an example file. > I can reproduce this with the latest PDFBox app 2.0.26-SNAPSHOT's > ExtractText. > File: https://issues.apache.org/jira/secure/attachment/13042292/map.pdf > Adobe and a slightly out of date pdftotext also have problems with this file. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5415) Infinite loop in ExtractText in 2.x branch on a specific pdf
[ https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521382#comment-17521382 ] Tim Allison commented on PDFBOX-5415: - Michael Demey's diagnosis: https://twitter.com/MyMilkedEek/status/1513990823511273472?s=20 > Infinite loop in ExtractText in 2.x branch on a specific pdf > > > Key: PDFBOX-5415 > URL: https://issues.apache.org/jira/browse/PDFBOX-5415 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.26 >Reporter: Tim Allison >Priority: Major > Attachments: PDFBOX-5415-TIKA-3718-p10.pdf > > > [~DavidAvant] reported an infinite loop in Tika and provided an example file. > I can reproduce this with the latest PDFBox app 2.0.26-SNAPSHOT's > ExtractText. > File: https://issues.apache.org/jira/secure/attachment/13042292/map.pdf > Adobe and a slightly out of date pdftotext also have problems with this file. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5415) Infinite loop in ExtractText in 2.x branch on a specific pdf
[ https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5415: Affects Version/s: 2.0.26 > Infinite loop in ExtractText in 2.x branch on a specific pdf > > > Key: PDFBOX-5415 > URL: https://issues.apache.org/jira/browse/PDFBOX-5415 > Project: PDFBox > Issue Type: Bug >Affects Versions: 2.0.26 >Reporter: Tim Allison >Priority: Major > > [~DavidAvant] reported an infinite loop in Tika and provided an example file. > I can reproduce this with the latest PDFBox app 2.0.26-SNAPSHOT's > ExtractText. > File: https://issues.apache.org/jira/secure/attachment/13042292/map.pdf > Adobe and a slightly out of date pdftotext also have problems with this file. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5415) Infinite loop in ExtractText in 2.x branch on a specific pdf
[ https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5415: Component/s: Parsing > Infinite loop in ExtractText in 2.x branch on a specific pdf > > > Key: PDFBOX-5415 > URL: https://issues.apache.org/jira/browse/PDFBOX-5415 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.26 >Reporter: Tim Allison >Priority: Major > > [~DavidAvant] reported an infinite loop in Tika and provided an example file. > I can reproduce this with the latest PDFBox app 2.0.26-SNAPSHOT's > ExtractText. > File: https://issues.apache.org/jira/secure/attachment/13042292/map.pdf > Adobe and a slightly out of date pdftotext also have problems with this file. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5415) Infinite loop in ExtractText in 2.x branch on a specific pdf
Tim Allison created PDFBOX-5415: --- Summary: Infinite loop in ExtractText in 2.x branch on a specific pdf Key: PDFBOX-5415 URL: https://issues.apache.org/jira/browse/PDFBOX-5415 Project: PDFBox Issue Type: Bug Reporter: Tim Allison [~DavidAvant] reported an infinite loop in Tika and provided an example file. I can reproduce this with the latest PDFBox app 2.0.26-SNAPSHOT's ExtractText. File: https://issues.apache.org/jira/secure/attachment/13042292/map.pdf Adobe and a slightly out of date pdftotext also have problems with this file. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-5396) Add maven enforcer rule to ensure that JAVA_HOME is set
[ https://issues.apache.org/jira/browse/PDFBOX-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved PDFBOX-5396. - Fix Version/s: 2.0.26 Resolution: Fixed > Add maven enforcer rule to ensure that JAVA_HOME is set > --- > > Key: PDFBOX-5396 > URL: https://issues.apache.org/jira/browse/PDFBOX-5396 > Project: PDFBox > Issue Type: Task >Affects Versions: 2.0.25 >Reporter: Tim Allison >Priority: Trivial > Fix For: 2.0.26 > > > I recently stubbed my toe on this one again. At least in the 2.x branch, the > module fontbox requires that the JAVA_HOME variable be set. If it isn't set, > the project build fails in fontbox without any meaningful indication as to > why, even with the -X option set in maven. > {noformat} > (default-compile) on project fontbox: Compilation failure -> [Help 1] > org.apache.maven.lifecycle.LifecycleExecutionException: Failed to > execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile > (default-compile) on project fontbox: Compilation failure > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > {noformat} > Also, on our website, there's no mention that JAVA_HOME should be set. And, > yes, I realize that it is set on most developers' systems. :D > One solution would be to add this rule to the maven-enforcer-plugin > configuration in the parent pom: > {code:java} > > JAVA_HOME > The JAVA_HOME environment variable must be set! > > {code} > If this is ok, I'll add this rule in 2.x and see if I get the same behavior > in trunk. > Side note: This was probably the cause of: > https://www.mail-archive.com/users@pdfbox.apache.org/msg11423.html and a few > other issues. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5401) A carefully crafted pdf can trigger an infinite loop while parsing
[ https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512474#comment-17512474 ] Tim Allison commented on PDFBOX-5401: - bq. Hi, I didn't test these samples on PDFBOX 2.0 Sorry, my comment above was a finding, not a question. > A carefully crafted pdf can trigger an infinite loop while parsing > -- > > Key: PDFBOX-5401 > URL: https://issues.apache.org/jira/browse/PDFBOX-5401 > Project: PDFBox > Issue Type: Bug > Components: Parsing, PDModel >Affects Versions: 3.0.0 PDFBox > Environment: Mac OS 12.1 & Ubuntu Linux 16.04 (4.15.0-163-generic) >Reporter: Xiaohan Zhang >Priority: Major > Attachments: verified.zip > > > Hi, I found a crafted pdf that can trigger an infinite loop while parsing > using PDFBOX. I have tested on the latest commit of PDFBOX on Github. > > This bug can be triggered by the following code. > ``` > File ff = new File("path/to/the/sample"); > PDDocument document = Loader.loadPDF(ff); > ``` > > I found that the root cause of this infinite loop resides in the while-loop > at line 321 of [COSParse.java|#L321].]. When parsing the provided PDF files, > the variable $prev is never changed during this loop. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5401) A carefully crafted pdf can trigger an infinite loop while parsing
[ https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512397#comment-17512397 ] Tim Allison edited comment on PDFBOX-5401 at 3/25/22, 4:38 PM: --- I confirmed this behavior with the last 2.0.26-SNAPSHOT I used for regression tests (from earlier this week?) with 3 of the 4 files ({{bda2803...}} does not cause problems for me). was (Author: talli...@mitre.org): Can confirm behavior with the last 2.0.26-SNAPSHOT I used for regression tests (from earlier this week?) with 3 of the 4 files ({{bda2803...}} does not cause problems for me). > A carefully crafted pdf can trigger an infinite loop while parsing > -- > > Key: PDFBOX-5401 > URL: https://issues.apache.org/jira/browse/PDFBOX-5401 > Project: PDFBox > Issue Type: Bug > Components: Parsing, PDModel >Affects Versions: 3.0.0 PDFBox > Environment: Mac OS 12.1 & Ubuntu Linux 16.04 (4.15.0-163-generic) >Reporter: Xiaohan Zhang >Priority: Major > Attachments: verified.zip > > > Hi, I found a crafted pdf that can trigger an infinite loop while parsing > using PDFBOX. I have tested on the latest commit of PDFBOX on Github. > > This bug can be triggered by the following code. > ``` > File ff = new File("path/to/the/sample"); > PDDocument document = Loader.loadPDF(ff); > ``` > > I found that the root cause of this infinite loop resides in the while-loop > at line 321 of [COSParse.java|#L321].]. When parsing the provided PDF files, > the variable $prev is never changed during this loop. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5401) A carefully crafted pdf can trigger an infinite loop while parsing
[ https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512397#comment-17512397 ] Tim Allison edited comment on PDFBOX-5401 at 3/25/22, 2:07 PM: --- Can confirm behavior with the last 2.0.26-SNAPSHOT I used for regression tests (from earlier this week?) with 3 of the 4 files ({{bda2803...}} does not cause problems for me). was (Author: talli...@mitre.org): Can confirm behavior with the last 2.0.26-SNAPSHOT I used for regression tests with 3 of the 4 files ({{bda2803...}} does not cause problems for me. > A carefully crafted pdf can trigger an infinite loop while parsing > -- > > Key: PDFBOX-5401 > URL: https://issues.apache.org/jira/browse/PDFBOX-5401 > Project: PDFBox > Issue Type: Bug > Components: Parsing, PDModel >Affects Versions: 3.0.0 PDFBox > Environment: Mac OS 12.1 & Ubuntu Linux 16.04 (4.15.0-163-generic) >Reporter: Xiaohan Zhang >Priority: Major > Attachments: verified.zip > > > Hi, I found a crafted pdf that can trigger an infinite loop while parsing > using PDFBOX. I have tested on the latest commit of PDFBOX on Github. > > This bug can be triggered by the following code. > ``` > File ff = new File("path/to/the/sample"); > PDDocument document = Loader.loadPDF(ff); > ``` > > I found that the root cause of this infinite loop resides in the while-loop > at line 321 of [COSParse.java|#L321].]. When parsing the provided PDF files, > the variable $prev is never changed during this loop. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5401) A carefully crafted pdf can trigger an infinite loop while parsing
[ https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512397#comment-17512397 ] Tim Allison commented on PDFBOX-5401: - Can confirm behavior with the last 2.0.26-SNAPSHOT I used for regression tests with 3 of the 4 files ({{bda2803...}} does not cause problems for me. > A carefully crafted pdf can trigger an infinite loop while parsing > -- > > Key: PDFBOX-5401 > URL: https://issues.apache.org/jira/browse/PDFBOX-5401 > Project: PDFBox > Issue Type: Bug > Components: Parsing, PDModel >Affects Versions: 3.0.0 PDFBox > Environment: Mac OS 12.1 & Ubuntu Linux 16.04 (4.15.0-163-generic) >Reporter: Xiaohan Zhang >Priority: Major > Attachments: verified.zip > > > Hi, I found a crafted pdf that can trigger an infinite loop while parsing > using PDFBOX. I have tested on the latest commit of PDFBOX on Github. > > This bug can be triggered by the following code. > ``` > File ff = new File("path/to/the/sample"); > PDDocument document = Loader.loadPDF(ff); > ``` > > I found that the root cause of this infinite loop resides in the while-loop > at line 321 of [COSParse.java|#L321].]. When parsing the provided PDF files, > the variable $prev is never changed during this loop. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5396) Add maven enforcer rule to ensure that JAVA_HOME is set
[ https://issues.apache.org/jira/browse/PDFBOX-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509892#comment-17509892 ] Tim Allison commented on PDFBOX-5396: - This is not a problem in trunk. > Add maven enforcer rule to ensure that JAVA_HOME is set > --- > > Key: PDFBOX-5396 > URL: https://issues.apache.org/jira/browse/PDFBOX-5396 > Project: PDFBox > Issue Type: Task >Affects Versions: 2.0.25 >Reporter: Tim Allison >Priority: Trivial > > I recently stubbed my toe on this one again. At least in the 2.x branch, the > module fontbox requires that the JAVA_HOME variable be set. If it isn't set, > the project build fails in fontbox without any meaningful indication as to > why, even with the -X option set in maven. > {noformat} > (default-compile) on project fontbox: Compilation failure -> [Help 1] > org.apache.maven.lifecycle.LifecycleExecutionException: Failed to > execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile > (default-compile) on project fontbox: Compilation failure > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > {noformat} > Also, on our website, there's no mention that JAVA_HOME should be set. And, > yes, I realize that it is set on most developers' systems. :D > One solution would be to add this rule to the maven-enforcer-plugin > configuration in the parent pom: > {code:java} > > JAVA_HOME > The JAVA_HOME environment variable must be set! > > {code} > If this is ok, I'll add this rule in 2.x and see if I get the same behavior > in trunk. > Side note: This was probably the cause of: > https://www.mail-archive.com/users@pdfbox.apache.org/msg11423.html and a few > other issues. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5396) Add maven enforcer rule to ensure that JAVA_HOME is set
[ https://issues.apache.org/jira/browse/PDFBOX-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5396: Description: I recently stubbed my toe on this one again. At least in the 2.x branch, the module fontbox requires that the JAVA_HOME variable be set. If it isn't set, the project build fails in fontbox without any meaningful indication as to why, even with the -X option set in maven. {noformat} (default-compile) on project fontbox: Compilation failure -> [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile (default-compile) on project fontbox: Compilation failure at org.apache.maven.lifecycle.internal.MojoExecutor.execute {noformat} Also, on our website, there's no mention that JAVA_HOME should be set. And, yes, I realize that it is set on most developers' systems. :D One solution would be to add this rule to the maven-enforcer-plugin configuration in the parent pom: {code:java} JAVA_HOME The JAVA_HOME environment variable must be set! {code} If this is ok, I'll add this rule in 2.x and see if I get the same behavior in trunk. Side note: This was probably the cause of: https://www.mail-archive.com/users@pdfbox.apache.org/msg11423.html and a few other issues. was: I recently stubbed my toe on this one again. At least in the 2.x branch, the module fontbox requires that the JAVA_HOME variable be set. If it isn't set, the project build fails in fontbox without any meaningful indication as to why, even with the -X option set in maven. {noformat} (default-compile) on project fontbox: Compilation failure -> [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile (default-compile) on project fontbox: Compilation failure at org.apache.maven.lifecycle.internal.MojoExecutor.execute {noformat} Also, on our website, there's no mention that JAVA_HOME should be set. And, yes, I realize that it is set on most developers' systems. :D One solution would be to add this rule to the maven-enforcer-plugin configuration in the parent pom: {code:java} JAVA_HOME The JAVA_HOME environment variable must be set! {code} If this is ok, I'll add this rule in 2.x and see if I get the same behavior in trunk. Side note: This was probably the cause of: https://www.mail-archive.com/users@pdfbox.apache.org/msg11423.html and a few other issues. > Add maven enforcer rule to ensure that JAVA_HOME is set > --- > > Key: PDFBOX-5396 > URL: https://issues.apache.org/jira/browse/PDFBOX-5396 > Project: PDFBox > Issue Type: Task >Affects Versions: 2.0.25 >Reporter: Tim Allison >Priority: Trivial > > I recently stubbed my toe on this one again. At least in the 2.x branch, the > module fontbox requires that the JAVA_HOME variable be set. If it isn't set, > the project build fails in fontbox without any meaningful indication as to > why, even with the -X option set in maven. > {noformat} > (default-compile) on project fontbox: Compilation failure -> [Help 1] > org.apache.maven.lifecycle.LifecycleExecutionException: Failed to > execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile > (default-compile) on project fontbox: Compilation failure > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > {noformat} > Also, on our website, there's no mention that JAVA_HOME should be set. And, > yes, I realize that it is set on most developers' systems. :D > One solution would be to add this rule to the maven-enforcer-plugin > configuration in the parent pom: > {code:java} > > JAVA_HOME > The JAVA_HOME environment variable must be set! > > {code} > If this is ok, I'll add this rule in 2.x and see if I get the same behavior > in trunk. > Side note: This was probably the cause of: > https://www.mail-archive.com/users@pdfbox.apache.org/msg11423.html and a few > other issues. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5396) Add maven enforcer rule to ensure that JAVA_HOME is set
Tim Allison created PDFBOX-5396: --- Summary: Add maven enforcer rule to ensure that JAVA_HOME is set Key: PDFBOX-5396 URL: https://issues.apache.org/jira/browse/PDFBOX-5396 Project: PDFBox Issue Type: Task Affects Versions: 2.0.25 Reporter: Tim Allison I recently stubbed my toe on this one again. At least in the 2.x branch, the module fontbox requires that the JAVA_HOME variable be set. If it isn't set, the project build fails in fontbox without any meaningful indication as to why, even with the -X option set in maven. {noformat} (default-compile) on project fontbox: Compilation failure -> [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile (default-compile) on project fontbox: Compilation failure at org.apache.maven.lifecycle.internal.MojoExecutor.execute {noformat} Also, on our website, there's no mention that JAVA_HOME should be set. And, yes, I realize that it is set on most developers' systems. :D One solution would be to add this rule to the maven-enforcer-plugin configuration in the parent pom: {code:java} JAVA_HOME The JAVA_HOME environment variable must be set! {code} If this is ok, I'll add this rule in 2.x and see if I get the same behavior in trunk. Side note: This was probably the cause of: https://www.mail-archive.com/users@pdfbox.apache.org/msg11423.html and a few other issues. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5358) Add support for UTF-8 in strings
Tim Allison created PDFBOX-5358: --- Summary: Add support for UTF-8 in strings Key: PDFBOX-5358 URL: https://issues.apache.org/jira/browse/PDFBOX-5358 Project: PDFBox Issue Type: Improvement Reporter: Tim Allison Attachments: Screen Shot 2022-01-06 at 9.18.09 AM.png Peter Wyatt recently published an article on UTF-8 strings in PDF 2.0: [https://www.pdfa.org/understanding-utf-8-in-pdf-2-0/] The article includes a link to a test file he created: [https://github.com/pdf-association/pdf20examples/blob/master/pdf20-utf8-test.pdf] Our debugger shows that we may need to add support for this (see attached). This was with PDFBox 2.0.25. I didn't have a chance to test with 3.x or the 2.x snapshot. I don't think we're necessarily covering all the changes yet in PDF 2.0, but I thought I'd open this issue for at least discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5164) Create portable collection PDF
[ https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326042#comment-17326042 ] Tim Allison commented on PDFBOX-5164: - Thank you, [~tilman]! > Create portable collection PDF > -- > > Key: PDFBOX-5164 > URL: https://issues.apache.org/jira/browse/PDFBOX-5164 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.18 > Environment: java >Reporter: zhouxiaolong >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.24, 3.0.0 PDFBox > > Attachments: CreatePortableCollection.java, MakePackage.java, > PortableCollection.pdf, collection.pdf, image-2021-04-15-16-02-42-451.png, > screenshot-1.png, tika-output.json, viewfiles - 副本.pdf > > > !image-2021-04-15-16-02-42-451.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5164) Create portable collection PDF
[ https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325972#comment-17325972 ] Tim Allison commented on PDFBOX-5164: - Sorry to hijack this, but I wanted to confirm with [~zxltmj]...is this the output that you'd expect? This is the recursive parser wrapper from Tika, which uses PDFBox. I just want to confirm that we don't have to do anything else to handle portable collections. > Create portable collection PDF > -- > > Key: PDFBOX-5164 > URL: https://issues.apache.org/jira/browse/PDFBOX-5164 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.18 > Environment: java >Reporter: zhouxiaolong >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.24, 3.0.0 PDFBox > > Attachments: CreatePortableCollection.java, MakePackage.java, > PortableCollection.pdf, collection.pdf, image-2021-04-15-16-02-42-451.png, > screenshot-1.png, tika-output.json, viewfiles - 副本.pdf > > > !image-2021-04-15-16-02-42-451.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5164) Create portable collection PDF
[ https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5164: Attachment: tika-output.json > Create portable collection PDF > -- > > Key: PDFBOX-5164 > URL: https://issues.apache.org/jira/browse/PDFBOX-5164 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.18 > Environment: java >Reporter: zhouxiaolong >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.24, 3.0.0 PDFBox > > Attachments: CreatePortableCollection.java, MakePackage.java, > PortableCollection.pdf, collection.pdf, image-2021-04-15-16-02-42-451.png, > screenshot-1.png, tika-output.json, viewfiles - 副本.pdf > > > !image-2021-04-15-16-02-42-451.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324082#comment-17324082 ] Tim Allison commented on PDFBOX-5166: - Ha @bitsgalore has an example of subtype=Screen. Yay! https://twitter.com/_tallison/status/1383164998629924870?s=20 > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: New Feature > Components: PDModel >Reporter: Tim Allison >Priority: Minor > Labels: Annotations > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324048#comment-17324048 ] Tim Allison commented on PDFBOX-5166: - Are those also streams in subtype=RichMedia or do we need to look for other subtypes? > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: New Feature > Components: PDModel >Reporter: Tim Allison >Priority: Minor > Labels: Annotations > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324002#comment-17324002 ] Tim Allison edited comment on PDFBOX-5166 at 4/16/21, 6:07 PM: --- Extraction only, yes...for our purposes on Tika, we wouldn't have any need to add or modify. I'm ok with Tilman's example code for now, but I worry that we'll likely come across some required special handling that it would be better to have in PDFBox. This isn't high priority, and I don't see a need to backport to 2.x. Separate topic...I'm wondering now if there are other annotation types that might conceal embedded files? was (Author: talli...@mitre.org): Extraction only, yes...for our purposes on Tika, we wouldn't have any need to add or modify. I'm ok with Tilman's example code for now, but I worry that we'll likely come across some required special handling that'd it would be better to have in PDFBox. This isn't high priority, and I don't see a need to backport to 2.x. Separate topic...I'm wondering now if there are other annotation types that might conceal embedded files? > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: New Feature >Reporter: Tim Allison >Priority: Minor > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324002#comment-17324002 ] Tim Allison commented on PDFBOX-5166: - Extraction only, yes...for our purposes on Tika, we wouldn't have any need to add or modify. I'm ok with Tilman's example code for now, but I worry that we'll likely come across some required special handling that'd it would be better to have in PDFBox. This isn't high priority, and I don't see a need to backport to 2.x. Separate topic...I'm wondering now if there are other annotation types that might conceal embedded files? > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: New Feature >Reporter: Tim Allison >Priority: Minor > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5166: Issue Type: New Feature (was: Task) > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: New Feature >Reporter: Tim Allison >Priority: Minor > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in JempBox
[ https://issues.apache.org/jira/browse/PDFBOX-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323831#comment-17323831 ] Tim Allison edited comment on PDFBOX-5165 at 4/16/21, 1:52 PM: --- Thank you for the quick fix! Unless there are needs on other projects, we have no immediate need on the Tika side. Let's wait a bit to see if anything else falls out of the regression tests with PDFBox 3.0.0-SNAPSHOT. At some point, it would be great to have an updated jempbox for this issue and also for the rare date/time concurrency issue. was (Author: talli...@mitre.org): Unless there are needs on other projects, we have no immediate need on the Tika side. Let's wait a bit to see if anything else falls out of the regression tests with PDFBox 3.0.0-SNAPSHOT. At some point, it would be great to have an updated jempbox for this issue and also for the rare date/time concurrency issue. > Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in > JempBox > --- > > Key: PDFBOX-5165 > URL: https://issues.apache.org/jira/browse/PDFBOX-5165 > Project: PDFBox > Issue Type: Task > Components: JempBox >Affects Versions: 1.8.16 >Reporter: Tim Allison >Assignee: Tilman Hausherr >Priority: Trivial > Labels: optimization > Fix For: 1.8.17 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in JempBox
[ https://issues.apache.org/jira/browse/PDFBOX-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323831#comment-17323831 ] Tim Allison commented on PDFBOX-5165: - Unless there are needs on other projects, we have no immediate need on the Tika side. Let's wait a bit to see if anything else falls out of the regression tests with PDFBox 3.0.0-SNAPSHOT. At some point, it would be great to have an updated jempbox for this issue and also for the rare date/time concurrency issue. > Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in > JempBox > --- > > Key: PDFBOX-5165 > URL: https://issues.apache.org/jira/browse/PDFBOX-5165 > Project: PDFBox > Issue Type: Task > Components: JempBox >Affects Versions: 1.8.16 >Reporter: Tim Allison >Assignee: Tilman Hausherr >Priority: Trivial > Labels: optimization > Fix For: 1.8.17 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5166: Priority: Minor (was: Major) > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323809#comment-17323809 ] Tim Allison commented on PDFBOX-5166: - Completely unsurprisingly, [~tilman] has already shown how to extract these files on SO: https://stackoverflow.com/questions/45460027/what-is-the-best-way-to-extract-embedded-flash-file-from-a-pdf-using-the-pdfbox If this is a "not going to fix", no problem! I'm happy to put that code into Tika for now, and if a RichMedia annotation gets implemented in PDFBox, I can update our code accordingly. > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5166) Implement RichMedia annotation
Tim Allison created PDFBOX-5166: --- Summary: Implement RichMedia annotation Key: PDFBOX-5166 URL: https://issues.apache.org/jira/browse/PDFBOX-5166 Project: PDFBox Issue Type: Task Reporter: Tim Allison Attachments: testFlashInPDF.pdf See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not currently extracting the embedded file. In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the COSDictionary, I can see the subtype is "RichMedia". If someone has the time, it'd be great to implement this so that we can extract more attachments in Tika... Obv, others may find use too. :D Many thanks to Tyler Thorsted for the test file and many thanks to @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in JempBox
[ https://issues.apache.org/jira/browse/PDFBOX-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322323#comment-17322323 ] Tim Allison commented on PDFBOX-5165: - I realize that Jempbox is out dated, but we're still using it in Tika. I found a PDF with a large event list in the media management schema. Calling getHistory() on it takes a couple of minutes. :( Is there any simple fix available? The XMP is in this file: http://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/MR/MRWP762LL3DMIFWGZPVPZXJNFGUAGHML I tried to zip the extracted xmp and attach it here with no luck. > Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in > JempBox > --- > > Key: PDFBOX-5165 > URL: https://issues.apache.org/jira/browse/PDFBOX-5165 > Project: PDFBox > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement
Tim Allison created PDFBOX-5165: --- Summary: Exceedingly slow processing of XMPSchemaMediaManagement Key: PDFBOX-5165 URL: https://issues.apache.org/jira/browse/PDFBOX-5165 Project: PDFBox Issue Type: Task Reporter: Tim Allison -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in JempBox
[ https://issues.apache.org/jira/browse/PDFBOX-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5165: Summary: Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in JempBox (was: Exceedingly slow processing of XMPSchemaMediaManagement) > Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in > JempBox > --- > > Key: PDFBOX-5165 > URL: https://issues.apache.org/jira/browse/PDFBOX-5165 > Project: PDFBox > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5158) Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/PDFBOX-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317514#comment-17317514 ] Tim Allison edited comment on PDFBOX-5158 at 4/9/21, 1:36 PM: -- Which in turn led me to find a bug in Tika's integration with 3.x: https://github.com/apache/tika/commit/f336c599a5536c7d3a7c0a0c94c71c8b695832ec :D Oh, dear, this bug is active in the wild in 1.26... TIKA-3350 :( Three cheers for collaboration across projects! was (Author: talli...@mitre.org): Which in turn led me to find a bug in Tika's integration with 3.x: https://github.com/apache/tika/commit/f336c599a5536c7d3a7c0a0c94c71c8b695832ec :D > Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT > > > Key: PDFBOX-5158 > URL: https://issues.apache.org/jira/browse/PDFBOX-5158 > Project: PDFBox > Issue Type: Bug >Affects Versions: 3.0.0 PDFBox >Reporter: Tim Allison >Assignee: Tilman Hausherr >Priority: Critical > Fix For: 3.0.0 PDFBox > > > I found a bunch of files that had a "read too many EOFs", which is a safety > check we now do in TikaInputStream to identify parsers that read an EOF > > 1000 times, which may be a sign of an infinite loop. > When I turn off this safety check in TikaInputStream, I get an infinite loop. > This is one of the triggering files: > https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W > It's a truncated file from Common Crawl. > The stacktrace when this is thrown is: > {noformat} > afterRead:809, TikaInputStream (org.apache.tika.io) > read:82, ProxyInputStream (org.apache.commons.io.input) > :113, RandomAccessReadBuffer (org.apache.pdfbox.io) > loadPDF:454, Loader (org.apache.pdfbox) > loadPDF:430, Loader (org.apache.pdfbox) > getPDDocument:189, PDFParser (org.apache.tika.parser.pdf) > parse:148, PDFParser (org.apache.tika.parser.pdf) > parse:288, CompositeParser (org.apache.tika.parser) > parse:288, CompositeParser (org.apache.tika.parser) > parse:150, AutoDetectParser (org.apache.tika.parser) > parse:157, RecursiveParserWrapper (org.apache.tika.parser) > getRecursiveMetadata:379, TikaTest (org.apache.tika) > getRecursiveMetadata:369, TikaTest (org.apache.tika) > getRecursiveMetadata:357, TikaTest (org.apache.tika) > getRecursiveMetadata:351, TikaTest (org.apache.tika) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5158) Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/PDFBOX-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317514#comment-17317514 ] Tim Allison commented on PDFBOX-5158: - Which in turn led me to find a bug in Tika's integration with 3.x: https://github.com/apache/tika/commit/f336c599a5536c7d3a7c0a0c94c71c8b695832ec :D > Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT > > > Key: PDFBOX-5158 > URL: https://issues.apache.org/jira/browse/PDFBOX-5158 > Project: PDFBox > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I found a bunch of files that had a "read too many EOFs", which is a safety > check we now do in TikaInputStream to identify parsers that read an EOF > > 1000 times, which may be a sign of an infinite loop. > When I turn off this safety check in TikaInputStream, I get an infinite loop. > This is one of the triggering files: > https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W > It's a truncated file from Common Crawl. > The stacktrace when this is thrown is: > {noformat} > afterRead:809, TikaInputStream (org.apache.tika.io) > read:82, ProxyInputStream (org.apache.commons.io.input) > :113, RandomAccessReadBuffer (org.apache.pdfbox.io) > loadPDF:454, Loader (org.apache.pdfbox) > loadPDF:430, Loader (org.apache.pdfbox) > getPDDocument:189, PDFParser (org.apache.tika.parser.pdf) > parse:148, PDFParser (org.apache.tika.parser.pdf) > parse:288, CompositeParser (org.apache.tika.parser) > parse:288, CompositeParser (org.apache.tika.parser) > parse:150, AutoDetectParser (org.apache.tika.parser) > parse:157, RecursiveParserWrapper (org.apache.tika.parser) > getRecursiveMetadata:379, TikaTest (org.apache.tika) > getRecursiveMetadata:369, TikaTest (org.apache.tika) > getRecursiveMetadata:357, TikaTest (org.apache.tika) > getRecursiveMetadata:351, TikaTest (org.apache.tika) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5158) Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/PDFBOX-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317509#comment-17317509 ] Tim Allison commented on PDFBOX-5158: - Y, I get your stacktrace with a file, but I get an infinite loop with an inputstream. {noformat} Path path = Paths.get("OELHPKYAQPDNDWC535NE23Z6FKYRMN7W.pdf"); try (InputStream is = Files.newInputStream(path)) { Loader.loadPDF(is); } {noformat} > Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT > > > Key: PDFBOX-5158 > URL: https://issues.apache.org/jira/browse/PDFBOX-5158 > Project: PDFBox > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I found a bunch of files that had a "read too many EOFs", which is a safety > check we now do in TikaInputStream to identify parsers that read an EOF > > 1000 times, which may be a sign of an infinite loop. > When I turn off this safety check in TikaInputStream, I get an infinite loop. > This is one of the triggering files: > https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W > It's a truncated file from Common Crawl. > The stacktrace when this is thrown is: > {noformat} > afterRead:809, TikaInputStream (org.apache.tika.io) > read:82, ProxyInputStream (org.apache.commons.io.input) > :113, RandomAccessReadBuffer (org.apache.pdfbox.io) > loadPDF:454, Loader (org.apache.pdfbox) > loadPDF:430, Loader (org.apache.pdfbox) > getPDDocument:189, PDFParser (org.apache.tika.parser.pdf) > parse:148, PDFParser (org.apache.tika.parser.pdf) > parse:288, CompositeParser (org.apache.tika.parser) > parse:288, CompositeParser (org.apache.tika.parser) > parse:150, AutoDetectParser (org.apache.tika.parser) > parse:157, RecursiveParserWrapper (org.apache.tika.parser) > getRecursiveMetadata:379, TikaTest (org.apache.tika) > getRecursiveMetadata:369, TikaTest (org.apache.tika) > getRecursiveMetadata:357, TikaTest (org.apache.tika) > getRecursiveMetadata:351, TikaTest (org.apache.tika) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5158) Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/PDFBOX-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317499#comment-17317499 ] Tim Allison commented on PDFBOX-5158: - Hmmm...will try to replicate with pure PDFBox. Thank you! > Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT > > > Key: PDFBOX-5158 > URL: https://issues.apache.org/jira/browse/PDFBOX-5158 > Project: PDFBox > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I found a bunch of files that had a "read too many EOFs", which is a safety > check we now do in TikaInputStream to identify parsers that read an EOF > > 1000 times, which may be a sign of an infinite loop. > When I turn off this safety check in TikaInputStream, I get an infinite loop. > This is one of the triggering files: > https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W > It's a truncated file from Common Crawl. > The stacktrace when this is thrown is: > {noformat} > afterRead:809, TikaInputStream (org.apache.tika.io) > read:82, ProxyInputStream (org.apache.commons.io.input) > :113, RandomAccessReadBuffer (org.apache.pdfbox.io) > loadPDF:454, Loader (org.apache.pdfbox) > loadPDF:430, Loader (org.apache.pdfbox) > getPDDocument:189, PDFParser (org.apache.tika.parser.pdf) > parse:148, PDFParser (org.apache.tika.parser.pdf) > parse:288, CompositeParser (org.apache.tika.parser) > parse:288, CompositeParser (org.apache.tika.parser) > parse:150, AutoDetectParser (org.apache.tika.parser) > parse:157, RecursiveParserWrapper (org.apache.tika.parser) > getRecursiveMetadata:379, TikaTest (org.apache.tika) > getRecursiveMetadata:369, TikaTest (org.apache.tika) > getRecursiveMetadata:357, TikaTest (org.apache.tika) > getRecursiveMetadata:351, TikaTest (org.apache.tika) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5158) Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/PDFBOX-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5158: Description: I found a bunch of files that had a "read too many EOFs", which is a safety check we now do in TikaInputStream to identify parsers that read an EOF > 1000 times, which may be a sign of an infinite loop. When I turn off this safety check in TikaInputStream, I get an infinite loop. This is one of the triggering files: https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W It's a truncated file from Common Crawl. The stacktrace when this is thrown is: {noformat} afterRead:809, TikaInputStream (org.apache.tika.io) read:82, ProxyInputStream (org.apache.commons.io.input) :113, RandomAccessReadBuffer (org.apache.pdfbox.io) loadPDF:454, Loader (org.apache.pdfbox) loadPDF:430, Loader (org.apache.pdfbox) getPDDocument:189, PDFParser (org.apache.tika.parser.pdf) parse:148, PDFParser (org.apache.tika.parser.pdf) parse:288, CompositeParser (org.apache.tika.parser) parse:288, CompositeParser (org.apache.tika.parser) parse:150, AutoDetectParser (org.apache.tika.parser) parse:157, RecursiveParserWrapper (org.apache.tika.parser) getRecursiveMetadata:379, TikaTest (org.apache.tika) getRecursiveMetadata:369, TikaTest (org.apache.tika) getRecursiveMetadata:357, TikaTest (org.apache.tika) getRecursiveMetadata:351, TikaTest (org.apache.tika) {noformat} was: I found a bunch of files that had a "read too many EOFs", which is a safety check we now do in TikaInputStream to identify parsers that read an EOF > 1000 times, which may be a sign of an infinite loop. When I turn off this safety check in TikaInputStream, I get an infinite loop. This is one of the triggering files: https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W It's a truncated file from Common Crawl. The stacktrace when this is thrown is: {noformat} afterRead:809, TikaInputStream (org.apache.tika.io) read:82, ProxyInputStream (org.apache.commons.io.input) :113, RandomAccessReadBuffer (org.apache.pdfbox.io) loadPDF:454, Loader (org.apache.pdfbox) loadPDF:430, Loader (org.apache.pdfbox) getPDDocument:189, PDFParser (org.apache.tika.parser.pdf) parse:148, PDFParser (org.apache.tika.parser.pdf) parse:288, CompositeParser (org.apache.tika.parser) parse:288, CompositeParser (org.apache.tika.parser) parse:150, AutoDetectParser (org.apache.tika.parser) parse:157, RecursiveParserWrapper (org.apache.tika.parser) getRecursiveMetadata:379, TikaTest (org.apache.tika) getRecursiveMetadata:369, TikaTest (org.apache.tika) getRecursiveMetadata:357, TikaTest (org.apache.tika) getRecursiveMetadata:351, TikaTest (org.apache.tika) {noformat} The stack > Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT > > > Key: PDFBOX-5158 > URL: https://issues.apache.org/jira/browse/PDFBOX-5158 > Project: PDFBox > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I found a bunch of files that had a "read too many EOFs", which is a safety > check we now do in TikaInputStream to identify parsers that read an EOF > > 1000 times, which may be a sign of an infinite loop. > When I turn off this safety check in TikaInputStream, I get an infinite loop. > This is one of the triggering files: > https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W > It's a truncated file from Common Crawl. > The stacktrace when this is thrown is: > {noformat} > afterRead:809, TikaInputStream (org.apache.tika.io) > read:82, ProxyInputStream (org.apache.commons.io.input) > :113, RandomAccessReadBuffer (org.apache.pdfbox.io) > loadPDF:454, Loader (org.apache.pdfbox) > loadPDF:430, Loader (org.apache.pdfbox) > getPDDocument:189, PDFParser (org.apache.tika.parser.pdf) > parse:148, PDFParser (org.apache.tika.parser.pdf) > parse:288, CompositeParser (org.apache.tika.parser) > parse:288, CompositeParser (org.apache.tika.parser) > parse:150, AutoDetectParser (org.apache.tika.parser) > parse:157, RecursiveParserWrapper (org.apache.tika.parser) > getRecursiveMetadata:379, TikaTest (org.apache.tika) > getRecursiveMetadata:369, TikaTest (org.apache.tika) > getRecursiveMetadata:357, TikaTest (org.apache.tika) > getRecursiveMetadata:351, TikaTest (org.apache.tika) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5158) Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT
Tim Allison created PDFBOX-5158: --- Summary: Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT Key: PDFBOX-5158 URL: https://issues.apache.org/jira/browse/PDFBOX-5158 Project: PDFBox Issue Type: Task Reporter: Tim Allison I found a bunch of files that had a "read too many EOFs", which is a safety check we now do in TikaInputStream to identify parsers that read an EOF > 1000 times, which may be a sign of an infinite loop. When I turn off this safety check in TikaInputStream, I get an infinite loop. This is one of the triggering files: https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W It's a truncated file from Common Crawl. The stacktrace when this is thrown is: {noformat} afterRead:809, TikaInputStream (org.apache.tika.io) read:82, ProxyInputStream (org.apache.commons.io.input) :113, RandomAccessReadBuffer (org.apache.pdfbox.io) loadPDF:454, Loader (org.apache.pdfbox) loadPDF:430, Loader (org.apache.pdfbox) getPDDocument:189, PDFParser (org.apache.tika.parser.pdf) parse:148, PDFParser (org.apache.tika.parser.pdf) parse:288, CompositeParser (org.apache.tika.parser) parse:288, CompositeParser (org.apache.tika.parser) parse:150, AutoDetectParser (org.apache.tika.parser) parse:157, RecursiveParserWrapper (org.apache.tika.parser) getRecursiveMetadata:379, TikaTest (org.apache.tika) getRecursiveMetadata:369, TikaTest (org.apache.tika) getRecursiveMetadata:357, TikaTest (org.apache.tika) getRecursiveMetadata:351, TikaTest (org.apache.tika) {noformat} The stack -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5153) New flatefilter exception on Tika unit test files with 3.0.0-RC1
Tim Allison created PDFBOX-5153: --- Summary: New flatefilter exception on Tika unit test files with 3.0.0-RC1 Key: PDFBOX-5153 URL: https://issues.apache.org/jira/browse/PDFBOX-5153 Project: PDFBox Issue Type: Task Reporter: Tim Allison On TIKA-3347, we're integrating PDFBox 3.0.0-RC1. We're getting new flate filter exceptions on a set of files that I _think_ I created with PDFBox a while ago. Looks like we're also getting xref exceptions. I would not be surprised in the least to learn that I did something wrong in the creation of these files and that they are corrupt! I can replicate this issue with {{java -jar pdfbox-app-3.0.0-RC1.jar export:text}} {noformat} SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException Error extracting text for document [IOException]: java.util.zip.DataFormatException: invalid block type {noformat} One of the files: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/test/resources/test-documents/testPDF_no_extract_yes_accessibility_owner_user.pdf -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303522#comment-17303522 ] Tim Allison commented on PDFBOX-5128: - The process hasn't finished, but I'm dumping the files here: [https://corpora.tika.apache.org/base/xmps/] I'm roughly binning them by the file type of the container file, including: [https://corpora.tika.apache.org/base/xmps/pdf/] Let me know if I can do any processing on these or if I botched the extraction. > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303391#comment-17303391 ] Tim Allison edited comment on PDFBOX-5128 at 3/17/21, 1:01 PM: --- Side note...I'm looking at the EOFs for my xmp byte scanner, and I notice that Oracle Outside In (at least back in 2011) didn't include a closing packet – PDFBOX-1192 !image-2021-03-17-09-00-57-653.png! -- was (Author: talli...@mitre.org): Side note...I'm looking at the EOFs for my xmp byte scanner, and I notice that Oracle Outsid !image-2021-03-17-09-00-57-653.png! e In (at least back in 2011) didn't include a closing packet – PDFBOX-1192 > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5128: Attachment: image-2021-03-17-09-00-57-653.png > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303391#comment-17303391 ] Tim Allison commented on PDFBOX-5128: - Side note...I'm looking at the EOFs for my xmp byte scanner, and I notice that Oracle Outsid !image-2021-03-17-09-00-57-653.png! e In (at least back in 2011) didn't include a closing packet – PDFBOX-1192 > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302946#comment-17302946 ] Tim Allison commented on PDFBOX-5128: - [~msahyoun] ... does the attached look about right? If so, I'll run against our full corpus and mirror the directory structure. > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5128: Attachment: PDFBOX.zip > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu
[ https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302784#comment-17302784 ] Tim Allison commented on PDFBOX-5133: - +1 that's how I got the rest of the build to work on Ubuntu. Thank you! > Failing testFlattenPDFBox2469Filled on Ubuntu > -- > > Key: PDFBOX-5133 > URL: https://issues.apache.org/jira/browse/PDFBOX-5133 > Project: PDFBox > Issue Type: Task > Components: AcroForm >Affects Versions: 2.0.22 >Reporter: Tim Allison >Assignee: Tilman Hausherr >Priority: Trivial > Fix For: 2.0.24, 3.0.0 PDFBox > > Attachments: in-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf, > out-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf-7.png-diff.png > > > I tried to build the 2.0.23 candidate, but I got a test failure on the above > test. This isn't worth respinning another candidate, but how can I help fix > this? > > {noformat} > Distributor ID: Ubuntu > Description: Ubuntu 20.04.2 LTS > Release: 20.04 > Codename: focal > {noformat} > > {noformat} > openjdk version "1.8.0_282" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode) > {noformat} > > {noformat} > Files differ: > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png > > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu
[ https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302700#comment-17302700 ] Tim Allison commented on PDFBOX-5133: - [~msahyoun] failed the build on Ubuntu. I had no problems with openjdk 11 on my Mac. openjdk version "11.0.4" 2019-07-16 OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.4+11) OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.4+11, mixed mode) > Failing testFlattenPDFBox2469Filled on Ubuntu > -- > > Key: PDFBOX-5133 > URL: https://issues.apache.org/jira/browse/PDFBOX-5133 > Project: PDFBox > Issue Type: Task > Components: AcroForm >Reporter: Tim Allison >Priority: Trivial > Attachments: in-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf, > out-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf-7.png-diff.png > > > I tried to build the 2.0.23 candidate, but I got a test failure on the above > test. This isn't worth respinning another candidate, but how can I help fix > this? > > {noformat} > Distributor ID: Ubuntu > Description: Ubuntu 20.04.2 LTS > Release: 20.04 > Codename: focal > {noformat} > > {noformat} > openjdk version "1.8.0_282" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode) > {noformat} > > {noformat} > Files differ: > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png > > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu
[ https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5133: Attachment: out-testPDF_acroForm.pdf-7.png-diff.png out-testPDF_acroForm.pdf-7.png > Failing testFlattenPDFBox2469Filled on Ubuntu > -- > > Key: PDFBOX-5133 > URL: https://issues.apache.org/jira/browse/PDFBOX-5133 > Project: PDFBox > Issue Type: Task > Components: AcroForm >Reporter: Tim Allison >Priority: Trivial > Attachments: in-testPDF_acroForm.pdf-7.png, > out-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf-7.png-diff.png > > > I tried to build the 2.0.23 candidate, but I got a test failure on the above > test. This isn't worth respinning another candidate, but how can I help fix > this? > > {noformat} > Distributor ID: Ubuntu > Description: Ubuntu 20.04.2 LTS > Release: 20.04 > Codename: focal > {noformat} > > {noformat} > openjdk version "1.8.0_282" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode) > {noformat} > > {noformat} > Files differ: > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png > > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu
[ https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5133: Attachment: out-testPDF_acroForm.pdf > Failing testFlattenPDFBox2469Filled on Ubuntu > -- > > Key: PDFBOX-5133 > URL: https://issues.apache.org/jira/browse/PDFBOX-5133 > Project: PDFBox > Issue Type: Task > Components: AcroForm >Reporter: Tim Allison >Priority: Trivial > Attachments: in-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf, > out-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf-7.png-diff.png > > > I tried to build the 2.0.23 candidate, but I got a test failure on the above > test. This isn't worth respinning another candidate, but how can I help fix > this? > > {noformat} > Distributor ID: Ubuntu > Description: Ubuntu 20.04.2 LTS > Release: 20.04 > Codename: focal > {noformat} > > {noformat} > openjdk version "1.8.0_282" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode) > {noformat} > > {noformat} > Files differ: > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png > > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu
[ https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302596#comment-17302596 ] Tim Allison commented on PDFBOX-5133: - I _think_ I attached the right files to help with diagnosis. Please let me know if there's anything else I can do. Thank you! > Failing testFlattenPDFBox2469Filled on Ubuntu > -- > > Key: PDFBOX-5133 > URL: https://issues.apache.org/jira/browse/PDFBOX-5133 > Project: PDFBox > Issue Type: Task > Components: AcroForm >Reporter: Tim Allison >Priority: Trivial > Attachments: in-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf, > out-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf-7.png-diff.png > > > I tried to build the 2.0.23 candidate, but I got a test failure on the above > test. This isn't worth respinning another candidate, but how can I help fix > this? > > {noformat} > Distributor ID: Ubuntu > Description: Ubuntu 20.04.2 LTS > Release: 20.04 > Codename: focal > {noformat} > > {noformat} > openjdk version "1.8.0_282" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode) > {noformat} > > {noformat} > Files differ: > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png > > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu
[ https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5133: Attachment: in-testPDF_acroForm.pdf-7.png > Failing testFlattenPDFBox2469Filled on Ubuntu > -- > > Key: PDFBOX-5133 > URL: https://issues.apache.org/jira/browse/PDFBOX-5133 > Project: PDFBox > Issue Type: Task > Components: AcroForm >Reporter: Tim Allison >Priority: Trivial > Attachments: in-testPDF_acroForm.pdf-7.png, > out-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf-7.png-diff.png > > > I tried to build the 2.0.23 candidate, but I got a test failure on the above > test. This isn't worth respinning another candidate, but how can I help fix > this? > > {noformat} > Distributor ID: Ubuntu > Description: Ubuntu 20.04.2 LTS > Release: 20.04 > Codename: focal > {noformat} > > {noformat} > openjdk version "1.8.0_282" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode) > {noformat} > > {noformat} > Files differ: > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png > > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu
[ https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5133: Attachment: (was: image-2021-03-16-10-57-14-639.png) > Failing testFlattenPDFBox2469Filled on Ubuntu > -- > > Key: PDFBOX-5133 > URL: https://issues.apache.org/jira/browse/PDFBOX-5133 > Project: PDFBox > Issue Type: Task > Components: AcroForm >Reporter: Tim Allison >Priority: Trivial > > I tried to build the 2.0.23 candidate, but I got a test failure on the above > test. This isn't worth respinning another candidate, but how can I help fix > this? > > {noformat} > Distributor ID: Ubuntu > Description: Ubuntu 20.04.2 LTS > Release: 20.04 > Codename: focal > {noformat} > > {noformat} > openjdk version "1.8.0_282" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode) > {noformat} > > {noformat} > Files differ: > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png > > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu
[ https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5133: Attachment: (was: image-2021-03-16-10-57-14-489.png) > Failing testFlattenPDFBox2469Filled on Ubuntu > -- > > Key: PDFBOX-5133 > URL: https://issues.apache.org/jira/browse/PDFBOX-5133 > Project: PDFBox > Issue Type: Task > Components: AcroForm >Reporter: Tim Allison >Priority: Trivial > > I tried to build the 2.0.23 candidate, but I got a test failure on the above > test. This isn't worth respinning another candidate, but how can I help fix > this? > > {noformat} > Distributor ID: Ubuntu > Description: Ubuntu 20.04.2 LTS > Release: 20.04 > Codename: focal > {noformat} > > {noformat} > openjdk version "1.8.0_282" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode) > {noformat} > > {noformat} > Files differ: > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png > > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu
[ https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5133: Attachment: (was: testPDF_acroForm.pdf-7.png) > Failing testFlattenPDFBox2469Filled on Ubuntu > -- > > Key: PDFBOX-5133 > URL: https://issues.apache.org/jira/browse/PDFBOX-5133 > Project: PDFBox > Issue Type: Task > Components: AcroForm >Reporter: Tim Allison >Priority: Trivial > > I tried to build the 2.0.23 candidate, but I got a test failure on the above > test. This isn't worth respinning another candidate, but how can I help fix > this? > > {noformat} > Distributor ID: Ubuntu > Description: Ubuntu 20.04.2 LTS > Release: 20.04 > Codename: focal > {noformat} > > {noformat} > openjdk version "1.8.0_282" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode) > {noformat} > > {noformat} > Files differ: > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png > > /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu
Tim Allison created PDFBOX-5133: --- Summary: Failing testFlattenPDFBox2469Filled on Ubuntu Key: PDFBOX-5133 URL: https://issues.apache.org/jira/browse/PDFBOX-5133 Project: PDFBox Issue Type: Task Components: AcroForm Reporter: Tim Allison I tried to build the 2.0.23 candidate, but I got a test failure on the above test. This isn't worth respinning another candidate, but how can I help fix this? {noformat} Distributor ID: Ubuntu Description:Ubuntu 20.04.2 LTS Release:20.04 Codename: focal {noformat} {noformat} openjdk version "1.8.0_282" OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08) OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode) {noformat} {noformat} Files differ: /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5127) Multithreading issue in JempBox's DateConverter
[ https://issues.apache.org/jira/browse/PDFBOX-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300589#comment-17300589 ] Tim Allison commented on PDFBOX-5127: - My personal pref would be to generate SimpleDateFormat objects as needed. The good news either way (maybe?) is that this is in an exception handling bit, and I don't think I've seen it before so it should be pretty rare??? > Multithreading issue in JempBox's DateConverter > --- > > Key: PDFBOX-5127 > URL: https://issues.apache.org/jira/browse/PDFBOX-5127 > Project: PDFBox > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > > [~tilman] recently found an exception thrown from here > ([https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L186)] > in one run of tika-eval but not in another. > > This is a multithreading issue caused by > [https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L43] > SimpleDateFormat is not threadsafe. I'm surprised we haven't seen this > earlier, but so it goes. > > Many, many thanks to Tilman for finding this! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300365#comment-17300365 ] Tim Allison commented on PDFBOX-5128: - I’ll scrape xmp out of our regression corpus. I should retain the packet envelope? > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5127) Multithreading issue in JempBox's DateConverter
Tim Allison created PDFBOX-5127: --- Summary: Multithreading issue in JempBox's DateConverter Key: PDFBOX-5127 URL: https://issues.apache.org/jira/browse/PDFBOX-5127 Project: PDFBox Issue Type: Bug Reporter: Tim Allison [~tilman] recently found an exception thrown from here ([https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L186)] in one run of tika-eval but not in another. This is a multithreading issue caused by [https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L43] SimpleDateFormat is not threadsafe. I'm surprised we haven't seen this earlier, but so it goes. Many, many thanks to Tilman for finding this! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3953) StackOverflowError in org.apache.pdfbox.pdmodel.PDPageTree.getKids
[ https://issues.apache.org/jira/browse/PDFBOX-3953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226417#comment-17226417 ] Tim Allison commented on PDFBOX-3953: - Related? > StackOverflowError in org.apache.pdfbox.pdmodel.PDPageTree.getKids > -- > > Key: PDFBOX-3953 > URL: https://issues.apache.org/jira/browse/PDFBOX-3953 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.7 >Reporter: Jorge Spinsanti >Priority: Major > > I got an StackOverflowError in > org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:135) > {code} > java.lang.StackOverflowError > at org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:135) > at org.apache.pdfbox.pdmodel.PDPageTree.access$200(PDPageTree.java:38) > at > org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:166) > at > org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) > at > org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) > at > org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) > at > org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) > at > org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) > at > org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) > at > org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) > at > org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) > at > org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) > at > org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow
Tim Allison created PDFBOX-5009: --- Summary: Corrupt PDF can lead to a StackOverflow Key: PDFBOX-5009 URL: https://issues.apache.org/jira/browse/PDFBOX-5009 Project: PDFBox Issue Type: Task Reporter: Tim Allison See TIKA-3224. I confirmed this with 2.0.21 by calling the app's ExtractText on the file posted on the Tika issue. cc [~dadoonet] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4623) COSParser: Infinite recursion
[ https://issues.apache.org/jira/browse/PDFBOX-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037202#comment-17037202 ] Tim Allison commented on PDFBOX-4623: - Adding a page tree infinite loop. > COSParser: Infinite recursion > - > > Key: PDFBOX-4623 > URL: https://issues.apache.org/jira/browse/PDFBOX-4623 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.16 > Environment: java version "12" 2019-03-19 > Java(TM) SE Runtime Environment (build 12+33) > Java HotSpot(TM) 64-Bit Server VM (build 12+33, mixed mode, sharing) > MacOS Mojave >Reporter: Alex Rebert >Priority: Minor > Attachments: infinite-recursion.pdf, loop_in_page_tree.pdf > > > Parsing an invalid PDF can lead to an infinite recursion in COSParser, which > results in a StackOverflowError. > *Steps to repro* > # Download malformed PDF (attached) > # {{Run: java -jar pdfbox-app-2.0.16.jar ExtractText infinite-recursion.pdf}} > *Stacktrace* > {noformat} > Exception in thread "main" java.lang.StackOverflowError [1005/1916] > at java.base/sun.nio.cs.UTF_8.updatePositions(UTF_8.java:79) > at java.base/sun.nio.cs.UTF_8$Decoder.xflow(UTF_8.java:210) > at java.base/sun.nio.cs.UTF_8$Decoder.decodeArrayLoop(UTF_8.java:321) > at java.base/sun.nio.cs.UTF_8$Decoder.decodeLoop(UTF_8.java:414) > at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:578) > at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:801) > at org.apache.pdfbox.pdfparser.BaseParser.isValidUTF8(BaseParser.java:787) > at org.apache.pdfbox.pdfparser.BaseParser.parseCOSName(BaseParser.java:768) > at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:887) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216) > at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:867) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:912) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > ... > {noformat} > The file was generated by fuzzing and is (probably) not a valid PDF file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-4623) COSParser: Infinite recursion
[ https://issues.apache.org/jira/browse/PDFBOX-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037202#comment-17037202 ] Tim Allison edited comment on PDFBOX-4623 at 2/14/20 6:51 PM: -- Adding a page tree stackoverflow. was (Author: talli...@mitre.org): Adding a page tree infinite loop. > COSParser: Infinite recursion > - > > Key: PDFBOX-4623 > URL: https://issues.apache.org/jira/browse/PDFBOX-4623 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.16 > Environment: java version "12" 2019-03-19 > Java(TM) SE Runtime Environment (build 12+33) > Java HotSpot(TM) 64-Bit Server VM (build 12+33, mixed mode, sharing) > MacOS Mojave >Reporter: Alex Rebert >Priority: Minor > Attachments: infinite-recursion.pdf, loop_in_page_tree.pdf > > > Parsing an invalid PDF can lead to an infinite recursion in COSParser, which > results in a StackOverflowError. > *Steps to repro* > # Download malformed PDF (attached) > # {{Run: java -jar pdfbox-app-2.0.16.jar ExtractText infinite-recursion.pdf}} > *Stacktrace* > {noformat} > Exception in thread "main" java.lang.StackOverflowError [1005/1916] > at java.base/sun.nio.cs.UTF_8.updatePositions(UTF_8.java:79) > at java.base/sun.nio.cs.UTF_8$Decoder.xflow(UTF_8.java:210) > at java.base/sun.nio.cs.UTF_8$Decoder.decodeArrayLoop(UTF_8.java:321) > at java.base/sun.nio.cs.UTF_8$Decoder.decodeLoop(UTF_8.java:414) > at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:578) > at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:801) > at org.apache.pdfbox.pdfparser.BaseParser.isValidUTF8(BaseParser.java:787) > at org.apache.pdfbox.pdfparser.BaseParser.parseCOSName(BaseParser.java:768) > at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:887) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216) > at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:867) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:912) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > ... > {noformat} > The file was generated by fuzzing and is (probably) not a valid PDF file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4623) COSParser: Infinite recursion
[ https://issues.apache.org/jira/browse/PDFBOX-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-4623: Attachment: loop_in_page_tree.pdf > COSParser: Infinite recursion > - > > Key: PDFBOX-4623 > URL: https://issues.apache.org/jira/browse/PDFBOX-4623 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.16 > Environment: java version "12" 2019-03-19 > Java(TM) SE Runtime Environment (build 12+33) > Java HotSpot(TM) 64-Bit Server VM (build 12+33, mixed mode, sharing) > MacOS Mojave >Reporter: Alex Rebert >Priority: Minor > Attachments: infinite-recursion.pdf, loop_in_page_tree.pdf > > > Parsing an invalid PDF can lead to an infinite recursion in COSParser, which > results in a StackOverflowError. > *Steps to repro* > # Download malformed PDF (attached) > # {{Run: java -jar pdfbox-app-2.0.16.jar ExtractText infinite-recursion.pdf}} > *Stacktrace* > {noformat} > Exception in thread "main" java.lang.StackOverflowError [1005/1916] > at java.base/sun.nio.cs.UTF_8.updatePositions(UTF_8.java:79) > at java.base/sun.nio.cs.UTF_8$Decoder.xflow(UTF_8.java:210) > at java.base/sun.nio.cs.UTF_8$Decoder.decodeArrayLoop(UTF_8.java:321) > at java.base/sun.nio.cs.UTF_8$Decoder.decodeLoop(UTF_8.java:414) > at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:578) > at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:801) > at org.apache.pdfbox.pdfparser.BaseParser.isValidUTF8(BaseParser.java:787) > at org.apache.pdfbox.pdfparser.BaseParser.parseCOSName(BaseParser.java:768) > at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:887) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216) > at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:867) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:912) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > ... > {noformat} > The file was generated by fuzzing and is (probably) not a valid PDF file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4768) Unable to extract text from PDF
[ https://issues.apache.org/jira/browse/PDFBOX-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032556#comment-17032556 ] Tim Allison commented on PDFBOX-4768: - To complement Tilman's points...qpdf complains about this file: {noformat} WARNING: kst-31430-3-b3_unextractable.pdf: file is damaged WARNING: kst-31430-3-b3_unextractable.pdf (offset 638658): xref not found WARNING: kst-31430-3-b3_unextractable.pdf: Attempting to reconstruct cross-reference table WARNING: kst-31430-3-b3_unextractable.pdf (object 123 0, offset 214900): expected endstream WARNING: kst-31430-3-b3_unextractable.pdf (object 123 0, offset 211564): attempting to recover stream length WARNING: kst-31430-3-b3_unextractable.pdf (object 123 0, offset 211564): recovered stream length: 13564 qpdf: operation succeeded with warnings; resulting file may have some problems {noformat} Tika's exception is: {noformat} Caused by: java.io.IOException: Unknown dir object c=')' cInt=41 peek=')' peekInt=41 at offset 8689 at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:966) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:636) at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:175) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:513) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:480) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:153) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:867) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) {noformat} > Unable to extract text from PDF > --- > > Key: PDFBOX-4768 > URL: https://issues.apache.org/jira/browse/PDFBOX-4768 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.18 >Reporter: Jan Vlug >Priority: Major > Attachments: kst-31430-3-b3_unextractable.pdf > > > I have a PDF document (see attachment) that can be viewed in Evince, but tika > text extraction does not work. I think that this is due to a crash in pdfbox. > I'm also a bit puzzled by the message: "You do not have permission to extract > text". > Here the output of the ExtractText command: > {{java -jar pdfbox-app-2.0.19-20200206.060243-86.jar ExtractText > kst-31430-3-b3_unextractable.pdf tekst_jan.txt}} > {{Feb 07, 2020 11:03:15 AM org.apache.pdfbox.pdfparser.COSParser > validateStreamLength}} > {{WARNING: The end of the stream doesn't point to the correct offset, using > workaround to read the stream, stream start position: 211564, length: 3336, > expected end position: 214900}} > {{Feb 07, 2020 11:03:15 AM org.apache.pdfbox.pdfparser.COSParser > parseCOSStream}} > {{WARNING: stream ends with 'endobj' instead of 'endstream' at offset 225134}} > {{Exception in thread "main" java.io.IOException: You do not have permission > to extract text}} > {{ at > org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:223)}} > {{ at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97)}} > {{ at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4737) Text extraction is gibberish
[ https://issues.apache.org/jira/browse/PDFBOX-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016209#comment-17016209 ] Tim Allison commented on PDFBOX-4737: - The following reinforces points already made, I think. >On the other hand of course a proper implementation of a strict mode will >require quite a lot of work +1 > and a half-hearted implementation is worthless. Indications of specific types of wonkiness – e.g. missing fonts, missing unicode mappings, missing/invalid xref, many other features – would be useful to some downstream processors, and if we did a "group by" on "producer/creator tool" for a given corpus like CommonCrawl, we might be able to shame software companies and projects into fixing specific issues. We could add these incrementally... and I see some benefit from even partial information (missing unicode mappings). As I and others point out, though, text can always be hosed, and there is no perfect "junk detector". You can try to use tika-eval's out of vocabulary statistic as an indicator that the text is not "languagey", but it will incorrectly categorize parts lists, isbns, duck phyla as "bad." More advanced machine learning (e.g. neural nets) may do a better job, but they will still be wrong some of the time. There's a reason Google is running OCR on at least some PDFs. :P So, from an OS community perspective, I see two avenues of work: # improving reporting of "nonstandard" features of the PDF – or helping developers understand what types of "nonstandard" features can currently be detected with PDFBox # working together to improve a junk detector... a la Tika's > Text extraction is gibberish > > > Key: PDFBOX-4737 > URL: https://issues.apache.org/jira/browse/PDFBOX-4737 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.18 >Reporter: Jorge Spinsanti >Priority: Major > Attachments: noUnicodeMapping.pdf, obfuscateTest_Duplicate_2_3.pdf > > > As it was discussed on https://issues.apache.org/jira/browse/PDFBOX-4549 > there are many PDFs where the text extraction is gibberish. > Perhaps you can add two modes (strict/lax) to text extraction to avoid > gibberish if not useful. Add a file to analyze the problem. > [^noUnicodeMapping.pdf] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4549) No Unicode mapping
[ https://issues.apache.org/jira/browse/PDFBOX-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010747#comment-17010747 ] Tim Allison commented on PDFBOX-4549: - And then there's this gem on content masking attacks: [https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/markwood] . Many thanks to Peter Wyatt for bringing Markwood et al's work to my attention. > No Unicode mapping > -- > > Key: PDFBOX-4549 > URL: https://issues.apache.org/jira/browse/PDFBOX-4549 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.15 >Reporter: Sergey Makarov >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.16, 3.0.0 PDFBox > > Attachments: XO_Thames.zip, our_star_wars.pdf > > > Hello, if i try get text from pdf (attached), i will result empty out and > many warns. Font attached also. > Acrobat reader will open succeed, I can select, copy text and save as text > my code: > {code:java} > private static void parseOne(String path) throws IOException { > String pdfFileInText; > PDFTextStripper tStripper; > File file = new File(path); > tStripper = new PDFTextStripper(); > MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMixed(0, > 5).setTempDir(new File("/home/user/pdfBoxTest/newFiles/")); > PDDocument document = PDDocument.load(file, memUsageSetting); > if (!document.isEncrypted()) { > pdfFileInText = tStripper.getText(document); > System.out.print(pdfFileInText); > } > document.close(); > }{code} > Error: > {code:java} > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont > WARNING: Invalid ToUnicode CMap in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+83 (83) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+116 (116) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+97 (97) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+114 (114) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+87 (87) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+115 (115) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont > WARNING: Invalid ToUnicode CMap in font HPDFAB+DejaVuSansMono,Book > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4549) No Unicode mapping
[ https://issues.apache.org/jira/browse/PDFBOX-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010741#comment-17010741 ] Tim Allison commented on PDFBOX-4549: - These are good points [~mkl]. See e.g.: [http://www.vintasoft.com/forums/viewtopic.php?t=2320] for willful/intentional obfuscation of test. Note that Google is running OCR on at least some PDFs. See slides 50-51: [https://github.com/tballison/share/blob/master/slides/activate19/Activate2019_tika_tallison_20190911.pptx] And even OCR can be gamed: [https://arxiv.org/abs/1802.05385] :( > No Unicode mapping > -- > > Key: PDFBOX-4549 > URL: https://issues.apache.org/jira/browse/PDFBOX-4549 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.15 >Reporter: Sergey Makarov >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.16, 3.0.0 PDFBox > > Attachments: XO_Thames.zip, our_star_wars.pdf > > > Hello, if i try get text from pdf (attached), i will result empty out and > many warns. Font attached also. > Acrobat reader will open succeed, I can select, copy text and save as text > my code: > {code:java} > private static void parseOne(String path) throws IOException { > String pdfFileInText; > PDFTextStripper tStripper; > File file = new File(path); > tStripper = new PDFTextStripper(); > MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMixed(0, > 5).setTempDir(new File("/home/user/pdfBoxTest/newFiles/")); > PDDocument document = PDDocument.load(file, memUsageSetting); > if (!document.isEncrypted()) { > pdfFileInText = tStripper.getText(document); > System.out.print(pdfFileInText); > } > document.close(); > }{code} > Error: > {code:java} > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont > WARNING: Invalid ToUnicode CMap in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+83 (83) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+116 (116) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+97 (97) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+114 (114) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+87 (87) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+115 (115) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont > WARNING: Invalid ToUnicode CMap in font HPDFAB+DejaVuSansMono,Book > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4549) No Unicode mapping
[ https://issues.apache.org/jira/browse/PDFBOX-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009133#comment-17009133 ] Tim Allison commented on PDFBOX-4549: - Perhaps tika-eval's out of vocabulary statistic? Or implement your own from: [https://dl.acm.org/doi/10.1145/1600193.1600237] > No Unicode mapping > -- > > Key: PDFBOX-4549 > URL: https://issues.apache.org/jira/browse/PDFBOX-4549 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.15 >Reporter: Sergey Makarov >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.16, 3.0.0 PDFBox > > Attachments: XO_Thames.zip, our_star_wars.pdf > > > Hello, if i try get text from pdf (attached), i will result empty out and > many warns. Font attached also. > Acrobat reader will open succeed, I can select, copy text and save as text > my code: > {code:java} > private static void parseOne(String path) throws IOException { > String pdfFileInText; > PDFTextStripper tStripper; > File file = new File(path); > tStripper = new PDFTextStripper(); > MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMixed(0, > 5).setTempDir(new File("/home/user/pdfBoxTest/newFiles/")); > PDDocument document = PDDocument.load(file, memUsageSetting); > if (!document.isEncrypted()) { > pdfFileInText = tStripper.getText(document); > System.out.print(pdfFileInText); > } > document.close(); > }{code} > Error: > {code:java} > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont > WARNING: Invalid ToUnicode CMap in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+83 (83) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+116 (116) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+97 (97) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+114 (114) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+87 (87) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+115 (115) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont > WARNING: Invalid ToUnicode CMap in font HPDFAB+DejaVuSansMono,Book > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4715) Need to add release version for maven-compiler-plugin
[ https://issues.apache.org/jira/browse/PDFBOX-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1763#comment-1763 ] Tim Allison commented on PDFBOX-4715: - {noformat} [ERROR] error: release version 6 not supported {noformat} I'm not seeing that in mine with Maven 3.6.3. That's the useful info I was hoping for! > Need to add release version for maven-compiler-plugin > - > > Key: PDFBOX-4715 > URL: https://issues.apache.org/jira/browse/PDFBOX-4715 > Project: PDFBox > Issue Type: Task >Reporter: Tim Allison >Priority: Blocker > Fix For: 2.0.18 > > > If I build PDFBox with > Java 8, but then try to run it via Tika with Java 8, > I get: > {noformat} > java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: > Could not initialize class org.apache.pdfbox.pdmodel.font.PDType1Font > BatchProcess: at java.util.concurrent.FutureTask.report(FutureTask.java:122) > BatchProcess: at java.util.concurrent.FutureTask.get(FutureTask.java:192) > BatchProcess: at > org.apache.tika.batch.BatchProcess.mainLoop(BatchProcess.java:206) > BatchProcess: at > org.apache.tika.batch.BatchProcess.call(BatchProcess.java:166) > BatchProcess: at org.apache.tika.batch.BatchProcess.call(BatchProcess.java:52) > BatchProcess: at java.util.concurrent.FutureTask.run(FutureTask.java:266) > BatchProcess: at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > BatchProcess: at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > BatchProcess: at java.lang.Thread.run(Thread.java:748) > BatchProcess:Caused by: java.lang.NoClassDefFoundError: Could not initialize > class org.apache.pdfbox.pdmodel.font.PDType1Font > BatchProcess: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:76) > BatchProcess: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > BatchProcess: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66) > BatchProcess: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:875) > BatchProcess: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:509) > BatchProcess: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483) > BatchProcess: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156) > BatchProcess: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > BatchProcess: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > BatchProcess: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) > BatchProcess: at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:867) > {noformat} > and > {noformat} > java.lang.NoSuchMethodError: > java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer; > BatchProcess: at > org.apache.fontbox.type1.Type1Lexer.readToken(Type1Lexer.java:184) > BatchProcess: at > org.apache.fontbox.type1.Type1Lexer.(Type1Lexer.java:64) > BatchProcess: at > org.apache.fontbox.type1.Type1Parser.parseASCII(Type1Parser.java:86) > BatchProcess: at > org.apache.fontbox.type1.Type1Parser.parse(Type1Parser.java:61) > BatchProcess: at > org.apache.fontbox.type1.Type1Font.createWithPFB(Type1Font.java:56) > BatchProcess: at > org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getType1Font(FileSystemFontProvider.java:259) > BatchProcess: at > org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getFont(FileSystemFontProvider.java:131) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org