[COMPRESS and Tika/PDFBox/POI] files from bug trackers
All, I recently downloaded attachments from the following bug trackers: COMPRESS, TIKA, PDFBox, POI, Open Office, Libre Office and ghostscript: http://162.242.228.174/docs/bugtrackers/ I then unpackaged/uncompressed all of the package/compressed files so: COMPRESS-115-1.zip is the second file attached to COMPRESS-115 COMPRESS-115-1.zip-0.txt is the first text file in that zip file. I just kicked off Tika against the files to find if anything interesting turns up. Let me know if this is of any interest to you and/or if there are other bug trackers we should add. Cheers, Tim
[jira] [Commented] (PDFBOX-4774) Add AWS Lambda support to FontFileFinder
[ https://issues.apache.org/jira/browse/PDFBOX-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037274#comment-17037274 ] Ben Manes commented on PDFBOX-4774: --- The final iteration of this workaround was to create a Lambda layer, which is the zip containing the fonts. This way they do not need to be packaged directly into the task itself. I also use a prebuilt cache file for a faster cold start, which has to reside in a writable path. {code:java} // Discover the extra fonts by hacking the search of $HOME/.fonts (PDFBOX-4774) System.setProperty("user.home", "/opt/pdfbox"); try { // Use a prebuilt cache manifest System.setProperty("pdfbox.fontcache", System.getProperty("java.io.tmpdir")); Files.copy(Path.of("/opt/pdfbox/.pdfbox.cache"), tempDir().resolve(".pdfbox.cache")); } catch (IOException e) { logger.error("Failed to copy prebuilt font cache", e); } {code} > Add AWS Lambda support to FontFileFinder > > > Key: PDFBOX-4774 > URL: https://issues.apache.org/jira/browse/PDFBOX-4774 > Project: PDFBox > Issue Type: Improvement > Components: FontBox >Affects Versions: 2.0.18 >Reporter: Ben Manes >Priority: Major > Attachments: fixed_page.jpg, original.pdf, rendered_page.jpg > > > The font directory finder is hard coded based on the operating system and is > not directly extensible. Instead, if I understand correctly, the fonts have > to be explicitly declared in a {{PDFBox_External_Fonts.properties}} file. > AWS Lambda includes only minimal fonts in its linux distribution. For some > documents this is too limiting, so on our EC2 instances we install > {{msttcorefonts}}, {{ttf-aenigma}}, and {{fonts-tuffy}}. These go into > {{/usr/share/fonts}} which the {{UnixFontDirFinder}} inspects. > AWS Lambda will unzip the distribution into {{/var/task}}, will unzip layers > into {{/opt}} only allows tasks to otherwise write to {{/tmp}}. The common > recommendation for fonts is to include them in the lambda, reference them at > {{/var/task/fonts}}, and set {{FONTCONFIG_PATH}} to that path for headless > tasks like Chrome's print-to-pdf. > Since PdfBox does not use font-config this solution does not work. Ideally it > would be nice if one could specify a custom {{FontDirFinder}} or add custom > directories (such as by a system property if not API). Alternatively, PdfBox > could include reasonable default locations if the environmental variable > {{LAMBDA_TASK_ROOT}} is set. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4623) COSParser: Infinite recursion
[ https://issues.apache.org/jira/browse/PDFBOX-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037202#comment-17037202 ] Tim Allison commented on PDFBOX-4623: - Adding a page tree infinite loop. > COSParser: Infinite recursion > - > > Key: PDFBOX-4623 > URL: https://issues.apache.org/jira/browse/PDFBOX-4623 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.16 > Environment: java version "12" 2019-03-19 > Java(TM) SE Runtime Environment (build 12+33) > Java HotSpot(TM) 64-Bit Server VM (build 12+33, mixed mode, sharing) > MacOS Mojave >Reporter: Alex Rebert >Priority: Minor > Attachments: infinite-recursion.pdf, loop_in_page_tree.pdf > > > Parsing an invalid PDF can lead to an infinite recursion in COSParser, which > results in a StackOverflowError. > *Steps to repro* > # Download malformed PDF (attached) > # {{Run: java -jar pdfbox-app-2.0.16.jar ExtractText infinite-recursion.pdf}} > *Stacktrace* > {noformat} > Exception in thread "main" java.lang.StackOverflowError [1005/1916] > at java.base/sun.nio.cs.UTF_8.updatePositions(UTF_8.java:79) > at java.base/sun.nio.cs.UTF_8$Decoder.xflow(UTF_8.java:210) > at java.base/sun.nio.cs.UTF_8$Decoder.decodeArrayLoop(UTF_8.java:321) > at java.base/sun.nio.cs.UTF_8$Decoder.decodeLoop(UTF_8.java:414) > at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:578) > at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:801) > at org.apache.pdfbox.pdfparser.BaseParser.isValidUTF8(BaseParser.java:787) > at org.apache.pdfbox.pdfparser.BaseParser.parseCOSName(BaseParser.java:768) > at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:887) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216) > at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:867) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:912) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > ... > {noformat} > The file was generated by fuzzing and is (probably) not a valid PDF file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-4623) COSParser: Infinite recursion
[ https://issues.apache.org/jira/browse/PDFBOX-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037202#comment-17037202 ] Tim Allison edited comment on PDFBOX-4623 at 2/14/20 6:51 PM: -- Adding a page tree stackoverflow. was (Author: talli...@mitre.org): Adding a page tree infinite loop. > COSParser: Infinite recursion > - > > Key: PDFBOX-4623 > URL: https://issues.apache.org/jira/browse/PDFBOX-4623 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.16 > Environment: java version "12" 2019-03-19 > Java(TM) SE Runtime Environment (build 12+33) > Java HotSpot(TM) 64-Bit Server VM (build 12+33, mixed mode, sharing) > MacOS Mojave >Reporter: Alex Rebert >Priority: Minor > Attachments: infinite-recursion.pdf, loop_in_page_tree.pdf > > > Parsing an invalid PDF can lead to an infinite recursion in COSParser, which > results in a StackOverflowError. > *Steps to repro* > # Download malformed PDF (attached) > # {{Run: java -jar pdfbox-app-2.0.16.jar ExtractText infinite-recursion.pdf}} > *Stacktrace* > {noformat} > Exception in thread "main" java.lang.StackOverflowError [1005/1916] > at java.base/sun.nio.cs.UTF_8.updatePositions(UTF_8.java:79) > at java.base/sun.nio.cs.UTF_8$Decoder.xflow(UTF_8.java:210) > at java.base/sun.nio.cs.UTF_8$Decoder.decodeArrayLoop(UTF_8.java:321) > at java.base/sun.nio.cs.UTF_8$Decoder.decodeLoop(UTF_8.java:414) > at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:578) > at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:801) > at org.apache.pdfbox.pdfparser.BaseParser.isValidUTF8(BaseParser.java:787) > at org.apache.pdfbox.pdfparser.BaseParser.parseCOSName(BaseParser.java:768) > at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:887) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216) > at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:867) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:912) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > ... > {noformat} > The file was generated by fuzzing and is (probably) not a valid PDF file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4623) COSParser: Infinite recursion
[ https://issues.apache.org/jira/browse/PDFBOX-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-4623: Attachment: loop_in_page_tree.pdf > COSParser: Infinite recursion > - > > Key: PDFBOX-4623 > URL: https://issues.apache.org/jira/browse/PDFBOX-4623 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.16 > Environment: java version "12" 2019-03-19 > Java(TM) SE Runtime Environment (build 12+33) > Java HotSpot(TM) 64-Bit Server VM (build 12+33, mixed mode, sharing) > MacOS Mojave >Reporter: Alex Rebert >Priority: Minor > Attachments: infinite-recursion.pdf, loop_in_page_tree.pdf > > > Parsing an invalid PDF can lead to an infinite recursion in COSParser, which > results in a StackOverflowError. > *Steps to repro* > # Download malformed PDF (attached) > # {{Run: java -jar pdfbox-app-2.0.16.jar ExtractText infinite-recursion.pdf}} > *Stacktrace* > {noformat} > Exception in thread "main" java.lang.StackOverflowError [1005/1916] > at java.base/sun.nio.cs.UTF_8.updatePositions(UTF_8.java:79) > at java.base/sun.nio.cs.UTF_8$Decoder.xflow(UTF_8.java:210) > at java.base/sun.nio.cs.UTF_8$Decoder.decodeArrayLoop(UTF_8.java:321) > at java.base/sun.nio.cs.UTF_8$Decoder.decodeLoop(UTF_8.java:414) > at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:578) > at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:801) > at org.apache.pdfbox.pdfparser.BaseParser.isValidUTF8(BaseParser.java:787) > at org.apache.pdfbox.pdfparser.BaseParser.parseCOSName(BaseParser.java:768) > at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:887) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216) > at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:867) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:912) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801) > at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055) > at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114) > ... > {noformat} > The file was generated by fuzzing and is (probably) not a valid PDF file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4776) OutofMemory for more than 300 input data
[ https://issues.apache.org/jira/browse/PDFBOX-4776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037198#comment-17037198 ] Tilman Hausherr commented on PDFBOX-4776: - Please share a destination file. Also make sure to use the font for the whole document, i.e. don't create a font object for each page. Same for images if they are used several times. > OutofMemory for more than 300 input data > > > Key: PDFBOX-4776 > URL: https://issues.apache.org/jira/browse/PDFBOX-4776 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.8 >Reporter: Tejas >Priority: Major > Attachments: Capture.PNG > > > When creating large file from XML i.e. large data file creates out of memory > issue. > > tested greater than 300 MB xml file data. Works as charm under 300 MB file > and takes under 8 mins to generate PDF > > Need to run at least 500 MB file -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4776) OutofMemory for more than 300 input data
[ https://issues.apache.org/jira/browse/PDFBOX-4776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037124#comment-17037124 ] Maruan Sahyoun commented on PDFBOX-4776: [~TP3099] this is a very generic question without detailed information. Which memory settings do you have? Do you parse the XML into a DOM or do you use a streaming parser? Are there a lot of images? How do you generate the PDF content? At the end - it could be that you have to give the JVM more memory. Also it might be worth looking at Apache FOP if you generate PDFs from XML as it provides typesetting capabilities, headers and footers, templates, tables ... Please also note that this is a bug tracker. Generic questions are better asked on the users mailing list. None of the information you provide shows that there is a bug in PDFBox. Having said that we are happy to help - but without any specifics it's impossible. BR Maruan > OutofMemory for more than 300 input data > > > Key: PDFBOX-4776 > URL: https://issues.apache.org/jira/browse/PDFBOX-4776 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.8 >Reporter: Tejas >Priority: Major > Attachments: Capture.PNG > > > When creating large file from XML i.e. large data file creates out of memory > issue. > > tested greater than 300 MB xml file data. Works as charm under 300 MB file > and takes under 8 mins to generate PDF > > Need to run at least 500 MB file -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4776) OutofMemory for more than 300 input data
[ https://issues.apache.org/jira/browse/PDFBOX-4776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun updated PDFBOX-4776: --- Flags: (was: Patch,Important) > OutofMemory for more than 300 input data > > > Key: PDFBOX-4776 > URL: https://issues.apache.org/jira/browse/PDFBOX-4776 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.8 >Reporter: Tejas >Priority: Major > Attachments: Capture.PNG > > > When creating large file from XML i.e. large data file creates out of memory > issue. > > tested greater than 300 MB xml file data. Works as charm under 300 MB file > and takes under 8 mins to generate PDF > > Need to run at least 500 MB file -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-4776) OutofMemory for more than 300 input data
Tejas created PDFBOX-4776: - Summary: OutofMemory for more than 300 input data Key: PDFBOX-4776 URL: https://issues.apache.org/jira/browse/PDFBOX-4776 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 2.0.8 Reporter: Tejas Attachments: Capture.PNG When creating large file from XML i.e. large data file creates out of memory issue. tested greater than 300 MB xml file data. Works as charm under 300 MB file and takes under 8 mins to generate PDF Need to run at least 500 MB file -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org