Re: [VOTE] Release Apache PDFBox 1.8.7
+1, thanks for managing the release process Timo Am 15.09.2014 um 20:49 schrieb Andreas Lehmkuehler: Hi, a candidate for the PDFBox 1.8.7 release is available at: http://people.apache.org/~lehmi/pdfbox/1.8.7/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/pdfbox/tags/1.8.7/ The SHA1 checksum of the archive is ba7f83a1db9e697bcd0d3613571e1b397968daf6. Please vote on releasing this package as Apache PDFBox 1.8.7. The vote is open for the next 72 hours and passes if a majority of at least three +1 PDFBox PMC votes are cast. [ ] +1 Release this package as Apache PDFBox 1.8.7 [ ] -1 Do not release this package because... Here is my +1 BR Andreas Lehmkühler -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474 F: +49 345 4780471 timo.boe...@ontochem.com _ OntoChem GmbH Geschäftsführer: Dr. Lutz Weber Sitz: Halle / Saale Registergericht: Stendal Registernummer: HRB 215461 _
[jira] [Commented] (PDFBOX-2350) Type1 Parser hangs indefinitely
[ https://issues.apache.org/jira/browse/PDFBOX-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135091#comment-14135091 ] Daniel Scheibe commented on PDFBOX-2350: Thanks Tilman for your feedback. What i currently do is: _pdDocument = PDDocument.load(_inputStream); if (_pdDocument.isEncrypted()) { _logger.warn(Document is encrypted, trying to decrypt without password); _pdDocument.decrypt(); } // 2.0.0-SNAPSHOT _pdRenderer = new PDFRenderer(_pdDocument, true); // ... So i guess what you said about an additional call to decrypt is already in place and should work? Type1 Parser hangs indefinitely --- Key: PDFBOX-2350 URL: https://issues.apache.org/jira/browse/PDFBOX-2350 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 2.0.0 Environment: Windows 7, JDK 1.7.0_51-b13 Reporter: Daniel Scheibe When rendering the first page of my pdf document the Type1Parser (org.apache.fontbox.type1.Type1Parser) hangs in a loop in {{parseBinary(byte[] bytes) throws IOException}} and kills our rendering pipeline. Please find the loop that hangs below: // find /Private dict while (!lexer.peekToken().getText().equals(Private)) { lexer.nextToken(); } There is no token named Private ever in the list of returned tokens (they're empty all the time). Furthermore going deeper into the source code it seems the class reading the tokens (Type1Lexer) does never finally advance the buffer position and always returns an empty name token in the readToken(Token prevToken) method. Looking at the decrypted buffer i cannot get something useful out of it based on my current understanding. Unfortunately i cannot provide the pdf in question as it contains confidental data. Acrobat Reader XI Version 11.0.08 renders the document just fine. In addition it seems the pdf was encrypted (40-Bit RC4) with an empty password and says it's pdf version 1.5. Does this provide enough information or can i do anything else to help nailing this one down? I guess this might be a pdf document structure/feature that is not yet supported completely but at least pdfbox should throw an exception instead of failing silently... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2353) ArrayIndexOutOfBoundsException in Type2CharString.drawAlternatingCurve
[ https://issues.apache.org/jira/browse/PDFBOX-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135114#comment-14135114 ] simon steiner commented on PDFBOX-2353: --- Similar to PDFBOX-2177 ArrayIndexOutOfBoundsException in Type2CharString.drawAlternatingCurve -- Key: PDFBOX-2353 URL: https://issues.apache.org/jira/browse/PDFBOX-2353 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 2.0.0 Reporter: Tilman Hausherr The file from PDFBOX-2348 fails with this exception: {code} java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 3 at java.util.Vector.get(Vector.java:744) at org.apache.fontbox.cff.Type2CharString.drawAlternatingCurve(Type2CharString.java:333) at org.apache.fontbox.cff.Type2CharString.handleCommand(Type2CharString.java:181) at org.apache.fontbox.cff.Type2CharString.access$000(Type2CharString.java:32) at org.apache.fontbox.cff.Type2CharString$1.handleCommand(Type2CharString.java:104) at org.apache.fontbox.cff.CharStringHandler.handleSequence(CharStringHandler.java:45) at org.apache.fontbox.cff.Type2CharString.convertType1ToType2(Type2CharString.java:107) at org.apache.fontbox.cff.Type2CharString.init(Type2CharString.java:58) at org.apache.fontbox.cff.CIDKeyedType2CharString.init(CIDKeyedType2CharString.java:46) at org.apache.fontbox.cff.CFFCIDFont.getType2CharString(CFFCIDFont.java:233) at org.apache.pdfbox.pdmodel.font.PDCIDFontType0.getType2CharString(PDCIDFontType0.java:210) at org.apache.pdfbox.rendering.font.CIDType0Glyph2D.getPathForCharacterCode(CIDType0Glyph2D.java:63) at org.apache.pdfbox.rendering.PageDrawer.drawGlyph2D(PageDrawer.java:431) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[DISCUSS] move documentation and examples to git
Hi there, in order to make it easier for people to contribute to the documentation and examples I thought about the potential benefits of moving these to a git based repository instead of svn. The main idea behind that is to allow people to contribute via github opening another channel of communication and making it easier to contribute. Proposed names are pdfbox-docs and pdfbox-examples. Take a look at https://github.com/apache/cordova-docs for an example of that. I haven’t thought about all potential implications and changes necessary yet but wanted to get a first feedback about support for that idea before putting more effort into that. WDYT? Maruan
Re: [DISCUSS] move documentation and examples to git
Hi, Maruan Sahyoun sahy...@fileaffairs.de hat am 16. September 2014 um 10:21 geschrieben: Hi there, in order to make it easier for people to contribute to the documentation and examples I thought about the potential benefits of moving these to a git based repository instead of svn. The main idea behind that is to allow people to contribute via github opening another channel of communication and making it easier to contribute. Proposed names are pdfbox-docs and pdfbox-examples. Take a look at https://github.com/apache/cordova-docs for an example of that. I haven’t thought about all potential implications and changes necessary yet but wanted to get a first feedback about support for that idea before putting more effort into that. WDYT? Good idea, but I'm not sure if a splitted repo configuration (svn/git) is supported by infra. So maybe this is only possible if we migrate the whole project to git. Maruan BR Andreas Lehmkühler
Re: [DISCUSS] move documentation and examples to git
what about having extra repos for pdfbox-docs and pdfbox-examples? Maruan Am 16.09.2014 um 11:43 schrieb Andreas Lehmkühler andr...@lehmi.de: Hi, Maruan Sahyoun sahy...@fileaffairs.de hat am 16. September 2014 um 10:21 geschrieben: Hi there, in order to make it easier for people to contribute to the documentation and examples I thought about the potential benefits of moving these to a git based repository instead of svn. The main idea behind that is to allow people to contribute via github opening another channel of communication and making it easier to contribute. Proposed names are pdfbox-docs and pdfbox-examples. Take a look at https://github.com/apache/cordova-docs for an example of that. I haven’t thought about all potential implications and changes necessary yet but wanted to get a first feedback about support for that idea before putting more effort into that. WDYT? Good idea, but I'm not sure if a splitted repo configuration (svn/git) is supported by infra. So maybe this is only possible if we migrate the whole project to git. Maruan BR Andreas Lehmkühler
Re: [DISCUSS] move documentation and examples to git
Maruan Sahyoun sahy...@fileaffairs.de hat am 16. September 2014 um 11:46 geschrieben: what about having extra repos for pdfbox-docs and pdfbox-examples? Hmm, I'm a little bit puzzled. Your origin proposal was already about extra git-repos for docs and examples, wasn't it? Andreas Maruan Am 16.09.2014 um 11:43 schrieb Andreas Lehmkühler andr...@lehmi.de: Hi, Maruan Sahyoun sahy...@fileaffairs.de hat am 16. September 2014 um 10:21 geschrieben: Hi there, in order to make it easier for people to contribute to the documentation and examples I thought about the potential benefits of moving these to a git based repository instead of svn. The main idea behind that is to allow people to contribute via github opening another channel of communication and making it easier to contribute. Proposed names are pdfbox-docs and pdfbox-examples. Take a look at https://github.com/apache/cordova-docs for an example of that. I haven’t thought about all potential implications and changes necessary yet but wanted to get a first feedback about support for that idea before putting more effort into that. WDYT? Good idea, but I'm not sure if a splitted repo configuration (svn/git) is supported by infra. So maybe this is only possible if we migrate the whole project to git. Maruan BR Andreas Lehmkühler
Re: [DISCUSS] move documentation and examples to git
OK - I see what you mean, got your question wrong. We can check with infra but I don’t see a reason why pdfbox-docs and pdfbox-examples can't exist in new repos and there is pdfbox in the old one and the new repos being git based. Would behave just like ‚different‘ projects. So if it’s possible shall we do it? Moving the whole project to git is a different story. I’d see the same benefit applying to pdfbox but the impact is larger. So moving the docs and examples might also be a good test case. Maruan Am 16.09.2014 um 11:55 schrieb Andreas Lehmkühler andr...@lehmi.de: Maruan Sahyoun sahy...@fileaffairs.de hat am 16. September 2014 um 11:46 geschrieben: what about having extra repos for pdfbox-docs and pdfbox-examples? Hmm, I'm a little bit puzzled. Your origin proposal was already about extra git-repos for docs and examples, wasn't it? Andreas Maruan Am 16.09.2014 um 11:43 schrieb Andreas Lehmkühler andr...@lehmi.de: Hi, Maruan Sahyoun sahy...@fileaffairs.de hat am 16. September 2014 um 10:21 geschrieben: Hi there, in order to make it easier for people to contribute to the documentation and examples I thought about the potential benefits of moving these to a git based repository instead of svn. The main idea behind that is to allow people to contribute via github opening another channel of communication and making it easier to contribute. Proposed names are pdfbox-docs and pdfbox-examples. Take a look at https://github.com/apache/cordova-docs for an example of that. I haven’t thought about all potential implications and changes necessary yet but wanted to get a first feedback about support for that idea before putting more effort into that. WDYT? Good idea, but I'm not sure if a splitted repo configuration (svn/git) is supported by infra. So maybe this is only possible if we migrate the whole project to git. Maruan BR Andreas Lehmkühler
Re: [DISCUSS] move documentation and examples to git
Maruan Sahyoun sahy...@fileaffairs.de hat am 16. September 2014 um 12:06 geschrieben: OK - I see what you mean, got your question wrong. We can check with infra but I don’t see a reason why pdfbox-docs and pdfbox-examples can't exist in new repos and there is pdfbox in the old one and the new repos being git based. Would behave just like ‚different‘ projects. Technically yes, but we should asked infra if it's possible from the organizational point of view. So if it’s possible shall we do it? +1, We have to split the build if we move the examples to a git repo and concatenate them. Moving the whole project to git is a different story. I’d see the same benefit applying to pdfbox but the impact is larger. So moving the docs and examples might also be a good test case. Yes, that would be a perfect opportunity Maruan Andreas Am 16.09.2014 um 11:55 schrieb Andreas Lehmkühler andr...@lehmi.de: Maruan Sahyoun sahy...@fileaffairs.de hat am 16. September 2014 um 11:46 geschrieben: what about having extra repos for pdfbox-docs and pdfbox-examples? Hmm, I'm a little bit puzzled. Your origin proposal was already about extra git-repos for docs and examples, wasn't it? Andreas Maruan Am 16.09.2014 um 11:43 schrieb Andreas Lehmkühler andr...@lehmi.de: Hi, Maruan Sahyoun sahy...@fileaffairs.de hat am 16. September 2014 um 10:21 geschrieben: Hi there, in order to make it easier for people to contribute to the documentation and examples I thought about the potential benefits of moving these to a git based repository instead of svn. The main idea behind that is to allow people to contribute via github opening another channel of communication and making it easier to contribute. Proposed names are pdfbox-docs and pdfbox-examples. Take a look at https://github.com/apache/cordova-docs for an example of that. I haven’t thought about all potential implications and changes necessary yet but wanted to get a first feedback about support for that idea before putting more effort into that. WDYT? Good idea, but I'm not sure if a splitted repo configuration (svn/git) is supported by infra. So maybe this is only possible if we migrate the whole project to git. Maruan BR Andreas Lehmkühler
[jira] [Created] (PDFBOX-2354) DataFormatException: incorrect header check
simon steiner created PDFBOX-2354: - Summary: DataFormatException: incorrect header check Key: PDFBOX-2354 URL: https://issues.apache.org/jira/browse/PDFBOX-2354 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 2.0.0 Reporter: simon steiner java -jar ~/pdf-box-svn/app/target/pdfbox-app-2.0.0-SNAPSHOT.jar WriteDecodedDoc -nonSeq 601501018.pdf java.util.zip.DataFormatException: incorrect header check at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2354) DataFormatException: incorrect header check
[ https://issues.apache.org/jira/browse/PDFBOX-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] simon steiner updated PDFBOX-2354: -- Description: http://svn.apache.org/viewvc/incubator/pdfbox/trunk/test/input/601501018.pdf?revision=682412view=copathrev=793348 java -jar ~/pdf-box-svn/app/target/pdfbox-app-2.0.0-SNAPSHOT.jar WriteDecodedDoc -nonSeq 601501018.pdf java.util.zip.DataFormatException: incorrect header check at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83) was: java -jar ~/pdf-box-svn/app/target/pdfbox-app-2.0.0-SNAPSHOT.jar WriteDecodedDoc -nonSeq 601501018.pdf java.util.zip.DataFormatException: incorrect header check at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83) DataFormatException: incorrect header check --- Key: PDFBOX-2354 URL: https://issues.apache.org/jira/browse/PDFBOX-2354 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 2.0.0 Reporter: simon steiner http://svn.apache.org/viewvc/incubator/pdfbox/trunk/test/input/601501018.pdf?revision=682412view=copathrev=793348 java -jar ~/pdf-box-svn/app/target/pdfbox-app-2.0.0-SNAPSHOT.jar WriteDecodedDoc -nonSeq 601501018.pdf java.util.zip.DataFormatException: incorrect header check at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] move documentation and examples to git
Hi, Maruan Sahyoun sahy...@fileaffairs.de hat am 16. September 2014 um 14:35 geschrieben: Am 16.09.2014 um 14:27 schrieb Andreas Lehmkühler andr...@lehmi.de: Maruan Sahyoun sahy...@fileaffairs.de hat am 16. September 2014 um 14:23 geschrieben: Am 16.09.2014 um 14:08 schrieb Andreas Lehmkühler andr...@lehmi.de: Maruan Sahyoun sahy...@fileaffairs.de hat am 16. September 2014 um 12:06 geschrieben: OK - I see what you mean, got your question wrong. We can check with infra but I don’t see a reason why pdfbox-docs and pdfbox-examples can't exist in new repos and there is pdfbox in the old one and the new repos being git based. Would behave just like ‚different‘ projects. Technically yes, but we should asked infra if it's possible from the organizational point of view. You or me going to ask? Be my guest ;-) Thank you - looking forward to your feedback. In the meanwhile I’ll start with the changes for the content. Done, I'm simply created a JIRA ticket. Let's see what happens https://issues.apache.org/jira/browse/INFRA-8357 BR Andreas
[jira] [Closed] (PDFBOX-2353) ArrayIndexOutOfBoundsException in Type2CharString.drawAlternatingCurve
[ https://issues.apache.org/jira/browse/PDFBOX-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed PDFBOX-2353. --- Resolution: Duplicate ArrayIndexOutOfBoundsException in Type2CharString.drawAlternatingCurve -- Key: PDFBOX-2353 URL: https://issues.apache.org/jira/browse/PDFBOX-2353 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 2.0.0 Reporter: Tilman Hausherr The file from PDFBOX-2348 fails with this exception: {code} java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 3 at java.util.Vector.get(Vector.java:744) at org.apache.fontbox.cff.Type2CharString.drawAlternatingCurve(Type2CharString.java:333) at org.apache.fontbox.cff.Type2CharString.handleCommand(Type2CharString.java:181) at org.apache.fontbox.cff.Type2CharString.access$000(Type2CharString.java:32) at org.apache.fontbox.cff.Type2CharString$1.handleCommand(Type2CharString.java:104) at org.apache.fontbox.cff.CharStringHandler.handleSequence(CharStringHandler.java:45) at org.apache.fontbox.cff.Type2CharString.convertType1ToType2(Type2CharString.java:107) at org.apache.fontbox.cff.Type2CharString.init(Type2CharString.java:58) at org.apache.fontbox.cff.CIDKeyedType2CharString.init(CIDKeyedType2CharString.java:46) at org.apache.fontbox.cff.CFFCIDFont.getType2CharString(CFFCIDFont.java:233) at org.apache.pdfbox.pdmodel.font.PDCIDFontType0.getType2CharString(PDCIDFontType0.java:210) at org.apache.pdfbox.rendering.font.CIDType0Glyph2D.getPathForCharacterCode(CIDType0Glyph2D.java:63) at org.apache.pdfbox.rendering.PageDrawer.drawGlyph2D(PageDrawer.java:431) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2350) Type1 Parser hangs indefinitely
[ https://issues.apache.org/jira/browse/PDFBOX-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135708#comment-14135708 ] Tilman Hausherr commented on PDFBOX-2350: - Please try also {code} PDDocument.loadNonSeq(new File(pdfFilename), ); {code} that does the decryption if needed. also, the correct way to decrypt with the old parser is {code} if( document.isEncrypted() ) { try { StandardDecryptionMaterial sdm = new StandardDecryptionMaterial(); document.openProtection(sdm); } catch( InvalidPasswordException e ) { System.err.println( Error: The document is encrypted. ); } } {code} I'm not saying that this will solve your problems but it is worth a try. If it still doesn't work, please save the decrypt byte array (in the ParseBinary nethod) in a file and post it here. Type1 Parser hangs indefinitely --- Key: PDFBOX-2350 URL: https://issues.apache.org/jira/browse/PDFBOX-2350 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 2.0.0 Environment: Windows 7, JDK 1.7.0_51-b13 Reporter: Daniel Scheibe When rendering the first page of my pdf document the Type1Parser (org.apache.fontbox.type1.Type1Parser) hangs in a loop in {{parseBinary(byte[] bytes) throws IOException}} and kills our rendering pipeline. Please find the loop that hangs below: // find /Private dict while (!lexer.peekToken().getText().equals(Private)) { lexer.nextToken(); } There is no token named Private ever in the list of returned tokens (they're empty all the time). Furthermore going deeper into the source code it seems the class reading the tokens (Type1Lexer) does never finally advance the buffer position and always returns an empty name token in the readToken(Token prevToken) method. Looking at the decrypted buffer i cannot get something useful out of it based on my current understanding. Unfortunately i cannot provide the pdf in question as it contains confidental data. Acrobat Reader XI Version 11.0.08 renders the document just fine. In addition it seems the pdf was encrypted (40-Bit RC4) with an empty password and says it's pdf version 1.5. Does this provide enough information or can i do anything else to help nailing this one down? I guess this might be a pdf document structure/feature that is not yet supported completely but at least pdfbox should throw an exception instead of failing silently... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PDFBOX-2321) java.lang.ExceptionInInitializerError in PDFRenderer.renderImageWithDPI
[ https://issues.apache.org/jira/browse/PDFBOX-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed PDFBOX-2321. --- Resolution: Incomplete Closing, please reopen and attach the file as described. We won't register with scribd to get the PDF. java.lang.ExceptionInInitializerError in PDFRenderer.renderImageWithDPI --- Key: PDFBOX-2321 URL: https://issues.apache.org/jira/browse/PDFBOX-2321 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 2.0.0 Environment: Windows 8.1 x64 Reporter: Marino An unhandled exception of type 'java.lang.ExceptionInInitializerError' occurs when calling the method with the following pdf and 96 dpi. renderImageWithDPI(i, 96); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PDFBOX-2354) DataFormatException: incorrect header check
[ https://issues.apache.org/jira/browse/PDFBOX-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed PDFBOX-2354. --- Resolution: Invalid The file is most probably broken and here's why: - I first traced to find out which object has the problem. It is 97 0 obj. - I then traced to see whether the object is decrypted. Yes, it is. - I traced to see which other objects have the problem. Yes: 108, 115, 123, 129, 133, 171, 196, 204, 223. All these objects are streams with length 7. - I then ran qpdf with the file. It brings this error message: (file position 160266): error decoding stream data for object 108 0: stream inflate: inflate: data: incorrect header check stream inflate: inflate: data: incorrect header check DataFormatException: incorrect header check --- Key: PDFBOX-2354 URL: https://issues.apache.org/jira/browse/PDFBOX-2354 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 2.0.0 Reporter: simon steiner http://svn.apache.org/viewvc/incubator/pdfbox/trunk/test/input/601501018.pdf?revision=682412view=copathrev=793348 java -jar ~/pdf-box-svn/app/target/pdfbox-app-2.0.0-SNAPSHOT.jar WriteDecodedDoc -nonSeq 601501018.pdf java.util.zip.DataFormatException: incorrect header check at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2301) RandomAccessBuffer consumes too much memory.
[ https://issues.apache.org/jira/browse/PDFBOX-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gee updated PDFBOX-2301: Attachment: clone2.diff Although this patch need java 1.7, would fix issues formerly addressed by you. Now adding/removing element to shallow-cloned list changes internal data. RandomAccessBuffer consumes too much memory. Key: PDFBOX-2301 URL: https://issues.apache.org/jira/browse/PDFBOX-2301 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.6, 2.0.0 Reporter: gee Assignee: Andreas Lehmkühler Fix For: 2.0.0 Attachments: clone.diff, clone2.diff RandomAccessBuffer holds uncompressed image during operation because it is what exactly pdfbox ExtractImages do. but holding uncompressed image instead of compressed one in memory consumes too much memory, not excluding many PDF XObjects that can use filter to compress itself. It would be good if pdfbox provides option that reverts to COSObject state just before the RandomAccess object created(the state that pdf XObject stream parsed and COSDictionary objects haven't created because user doesn't requested it using get() method.) It is crucial feature so that pdfbox can analyze huge pdf file(100MB). In current source, one must close COSStream unless required(and I know closed stream cannot reopened again.) Class Name | Shallow Heap | Retained Heap -- org.apache.pdfbox.cos.COSObject @ 0x5ad4940 | 24 | 8,187,264 |- class class org.apache.pdfbox.cos.COSObject @ 0x58c4020 | 0 | 0 |- generationNumber org.apache.pdfbox.cos.COSInteger @ 0x5ad0080 | 24 |24 |- baseObject org.apache.pdfbox.cos.COSStream @ 0x5b25ea0 | 32 | 8,187,216 | |- class class org.apache.pdfbox.cos.COSStream @ 0x58c3e00 | 8 | 8 | |- items java.util.LinkedHashMap @ 0x5b2a0f0 | 56 | 552 | |- file org.apache.pdfbox.io.RandomAccessBuffer @ 0x5b2a128 | 48 | 8,186,528 | | |- class class org.apache.pdfbox.io.RandomAccessBuffer @ 0x5ad2b00 |
Re: [DISCUSS] move documentation and examples to git
Pl dont send me mail. On 16 Sep 2014 13:52, Maruan Sahyoun sahy...@fileaffairs.de wrote: Hi there, in order to make it easier for people to contribute to the documentation and examples I thought about the potential benefits of moving these to a git based repository instead of svn. The main idea behind that is to allow people to contribute via github opening another channel of communication and making it easier to contribute. Proposed names are pdfbox-docs and pdfbox-examples. Take a look at https://github.com/apache/cordova-docs for an example of that. I haven’t thought about all potential implications and changes necessary yet but wanted to get a first feedback about support for that idea before putting more effort into that. WDYT? Maruan
[jira] [Updated] (PDFBOX-2301) RandomAccessBuffer consumes too much memory.
[ https://issues.apache.org/jira/browse/PDFBOX-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gee updated PDFBOX-2301: Attachment: clone3.diff clone2.diff is now invalid. This patch fills remaining hole in previous patch. now every write operation ensures that changes is not seen to any other cloned object. RandomAccessBuffer consumes too much memory. Key: PDFBOX-2301 URL: https://issues.apache.org/jira/browse/PDFBOX-2301 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.6, 2.0.0 Reporter: gee Assignee: Andreas Lehmkühler Fix For: 2.0.0 Attachments: clone.diff, clone2.diff, clone3.diff RandomAccessBuffer holds uncompressed image during operation because it is what exactly pdfbox ExtractImages do. but holding uncompressed image instead of compressed one in memory consumes too much memory, not excluding many PDF XObjects that can use filter to compress itself. It would be good if pdfbox provides option that reverts to COSObject state just before the RandomAccess object created(the state that pdf XObject stream parsed and COSDictionary objects haven't created because user doesn't requested it using get() method.) It is crucial feature so that pdfbox can analyze huge pdf file(100MB). In current source, one must close COSStream unless required(and I know closed stream cannot reopened again.) Class Name | Shallow Heap | Retained Heap -- org.apache.pdfbox.cos.COSObject @ 0x5ad4940 | 24 | 8,187,264 |- class class org.apache.pdfbox.cos.COSObject @ 0x58c4020 | 0 | 0 |- generationNumber org.apache.pdfbox.cos.COSInteger @ 0x5ad0080 | 24 |24 |- baseObject org.apache.pdfbox.cos.COSStream @ 0x5b25ea0 | 32 | 8,187,216 | |- class class org.apache.pdfbox.cos.COSStream @ 0x58c3e00 | 8 | 8 | |- items java.util.LinkedHashMap @ 0x5b2a0f0 | 56 | 552 | |- file org.apache.pdfbox.io.RandomAccessBuffer @ 0x5b2a128 | 48 | 8,186,528 | | |- class class org.apache.pdfbox.io.RandomAccessBuffer @ 0x5ad2b00
[jira] [Created] (PDFBOX-2356) Error Validating PDF Archive Document
Cetra Free created PDFBOX-2356: -- Summary: Error Validating PDF Archive Document Key: PDFBOX-2356 URL: https://issues.apache.org/jira/browse/PDFBOX-2356 Project: PDFBox Issue Type: Bug Components: Preflight Affects Versions: 1.8.6, 1.8.5, 1.8.4 Reporter: Cetra Free When trying to validate a PDF archive file (attached to this ticket) we get the following error: {code} 7.2 - Error on MetaData, ModificationDate present in the document catalog dictionary doesn't match with XMP information {code} This is because the the Modification Date in the Dictionary is parsed differently from the XMP Metadata. The XMP Metadata is correct, but the Date from the Dictionary appends an extra 30 minutes. The following is the raw COSObject from the PDF File {code} COSString{D:20140917122850+09'30'} {code} The Long value should be *141092273* The *org.apache.pdfbox.util.DateConverter* *parseDate* method returns the Date with Long *141092453* which is 30 minutes ahead. XMP Modification Date is parsed differently and returns the correct date. This means that validation will fail for PDF Archives. My suggestion would be to refactor the parseDate function to use the Standard Java library. Here's an example class which will be compatible with the PDF Specification: {code} static class DateParser { private MapInteger, SimpleDateFormat formats = new HashMapInteger, SimpleDateFormat(); public DateParser() { String expr = ; for(String part: Arrays.asList(, MM, dd, HH, mm, ss, Z)) { expr = expr + part; formats.put(expr.length(), new SimpleDateFormat(expr)); } } public Calendar parseDate(String expr) { try { expr = expr.replace(D:, ).replace(', ).replace(Z, +); Date date = formats.get(Math.min(expr.length(), 15)).parse(expr); Calendar calendar = Calendar.getInstance(); calendar.setTime(date); return calendar; } catch (ParseException e) { return null; } } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2356) Error Validating PDF Archive Document
[ https://issues.apache.org/jira/browse/PDFBOX-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cetra Free updated PDFBOX-2356: --- Attachment: pdfafile.pdf Error Validating PDF Archive Document - Key: PDFBOX-2356 URL: https://issues.apache.org/jira/browse/PDFBOX-2356 Project: PDFBox Issue Type: Bug Components: Preflight Affects Versions: 1.8.4, 1.8.5, 1.8.6 Reporter: Cetra Free Attachments: pdfafile.pdf When trying to validate a PDF archive file (attached to this ticket) we get the following error: {code} 7.2 - Error on MetaData, ModificationDate present in the document catalog dictionary doesn't match with XMP information {code} This is because the the Modification Date in the Dictionary is parsed differently from the XMP Metadata. The XMP Metadata is correct, but the Date from the Dictionary appends an extra 30 minutes. The following is the raw COSObject from the PDF File {code} COSString{D:20140917122850+09'30'} {code} The Long value should be *141092273* The *org.apache.pdfbox.util.DateConverter* *parseDate* method returns the Date with Long *141092453* which is 30 minutes ahead. XMP Modification Date is parsed differently and returns the correct date. This means that validation will fail for PDF Archives. My suggestion would be to refactor the parseDate function to use the Standard Java library. Here's an example class which will be compatible with the PDF Specification: {code} static class DateParser { private MapInteger, SimpleDateFormat formats = new HashMapInteger, SimpleDateFormat(); public DateParser() { String expr = ; for(String part: Arrays.asList(, MM, dd, HH, mm, ss, Z)) { expr = expr + part; formats.put(expr.length(), new SimpleDateFormat(expr)); } } public Calendar parseDate(String expr) { try { expr = expr.replace(D:, ).replace(', ).replace(Z, +); Date date = formats.get(Math.min(expr.length(), 15)).parse(expr); Calendar calendar = Calendar.getInstance(); calendar.setTime(date); return calendar; } catch (ParseException e) { return null; } } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)