[jira] [Commented] (PDFBOX-2694) Evaluate twelvemonkeys for JPEG
[ https://issues.apache.org/jira/browse/PDFBOX-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14344698#comment-14344698 ] Andreas Lehmkühler commented on PDFBOX-2694: Looks promising. How about the TIFF support, will you have a look too? Based on our past expecriences it seems to be a good idea to get rid of all problematic JRE-dependencies such as ImageIO, especially if the developer is as responsive as Harald is. :-) > Evaluate twelvemonkeys for JPEG > --- > > Key: PDFBOX-2694 > URL: https://issues.apache.org/jira/browse/PDFBOX-2694 > Project: PDFBox > Issue Type: Task > Components: Parsing >Affects Versions: 2.0.0 >Reporter: Tilman Hausherr >Priority: Minor > Labels: jpeg, twelvemonkeys > Attachments: 176936-p154-2.jpg, 176936-p154.pdf, 485945.pdf, > 573636.pdf > > > While working on PDFBOX-2128 I decided to try twelvemonkeys for JPEG reading > and the first impression is excellent. It seems that the author is making a > big effort in handling even the most broken JPEG files (similar to what we do > with PDFs). This issue is to collect problem files and discuss all > experiences and decide whether we should bundle twelvemonkeys with PDFBox or > rather just recommend it as an optional solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-2695) Iterate PDOutlineNode children
[ https://issues.apache.org/jira/browse/PDFBOX-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-2695. - Resolution: Fixed Assignee: Tilman Hausherr Good idea - thanks! > Iterate PDOutlineNode children > -- > > Key: PDFBOX-2695 > URL: https://issues.apache.org/jira/browse/PDFBOX-2695 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 2.0.0 >Reporter: Andrea Vacondio >Assignee: Tilman Hausherr >Priority: Minor > Labels: outline > Fix For: 2.0.0 > > Attachments: iterable_children.diff > > > Give an outline item, I need to walk through all its children. ??The items at > each level of the hierarchy form a linked list, chained together through > their Prev and Next entries and accessed through the First and Last entries > in the parent item?? so I created a simple patch to allow this kind of code: > {code} > if(node !=null){ >for (PDOutlineItem current : node.children()) { > //do something with the >} > } > {code} > Given an item, PDOutlineNode.children returns an Iterable that walks through > the children until there is no NEXT or NEXT is equals to the starting element. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2695) Iterate PDOutlineNode children
[ https://issues.apache.org/jira/browse/PDFBOX-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343787#comment-14343787 ] ASF subversion and git services commented on PDFBOX-2695: - Commit 1663436 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1663436 ] PDFBOX-2695: Iterate PDOutlineNode children, by Andrea Vacondio > Iterate PDOutlineNode children > -- > > Key: PDFBOX-2695 > URL: https://issues.apache.org/jira/browse/PDFBOX-2695 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 2.0.0 >Reporter: Andrea Vacondio >Assignee: Tilman Hausherr >Priority: Minor > Labels: outline > Fix For: 2.0.0 > > Attachments: iterable_children.diff > > > Give an outline item, I need to walk through all its children. ??The items at > each level of the hierarchy form a linked list, chained together through > their Prev and Next entries and accessed through the First and Last entries > in the parent item?? so I created a simple patch to allow this kind of code: > {code} > if(node !=null){ >for (PDOutlineItem current : node.children()) { > //do something with the >} > } > {code} > Given an item, PDOutlineNode.children returns an Iterable that walks through > the children until there is no NEXT or NEXT is equals to the starting element. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-1130) ExtractText -html doesn't always close the tags it opens
[ https://issues.apache.org/jira/browse/PDFBOX-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343741#comment-14343741 ] Hudson commented on PDFBOX-1130: SUCCESS: Integrated in tika-trunk-jdk1.7 #524 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/524/]) TIKA-758 clean up after remembering PDFBOX-1130 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1663424) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java > ExtractText -html doesn't always close the tags it opens > > > Key: PDFBOX-1130 > URL: https://issues.apache.org/jira/browse/PDFBOX-1130 > Project: PDFBox > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 1.8.0 > > Attachments: 86.pdf, PDFBOX-1130.patch > > > I have a test document (same one on PDFBOX-1129), which when run through > ExtractText -html, extracts the page number for each page, however in each > case the page number looks like: > NText of page N... > Ie, the tag for the page number wasn't closed. > Maybe related: if I run ExtractText without html, there is not space after > the page number and before the next word, ie I see words like 1Massachusetts, > 2Course, 3also, 4the. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-2436) Parsing error
[ https://issues.apache.org/jira/browse/PDFBOX-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler resolved PDFBOX-2436. Resolution: Fixed Fix Version/s: 2.0.0 Works fine using the non-sequential parser after solving PDFBOX-2515. The fix is limited to the trunk version Thanks for the report > Parsing error > - > > Key: PDFBOX-2436 > URL: https://issues.apache.org/jira/browse/PDFBOX-2436 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 1.8.7 > Environment: Java 8 >Reporter: Jan Vomlel >Assignee: Andreas Lehmkühler >Priority: Critical > Fix For: 2.0.0 > > Attachments: h1.pdf > > > PDDocument.load method returns without exception, but document model is > incomplete. > You can try it by this code on attached file: > {code} > PDDocument document = PDDocument.load(new File(inFN), null); > int size = document.getSignatureDictionaries().size(); > System.out.println("Signatures count:" +size); > {code} > Output is 1, but there are two signatures in PDF document. > PDFParser.class produces IOException and ignores it on line 196. Rest of the > document is ignored. > loadNoSeq method works, but I cannot use it, because I want to attach a new > signature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-2527) IOException: Negative seek offset in NonSequentialPDFParser
[ https://issues.apache.org/jira/browse/PDFBOX-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler resolved PDFBOX-2527. Resolution: Fixed I'm finished at this point. I discontinue the work on rebuilding a corrupt file which is encrypted as it is far more complicated than expected. We can open a new issue if someone comes up with a real sample (I've created mine by manipulating a well-formed one). Thanks to everybody for the help/input/report > IOException: Negative seek offset in NonSequentialPDFParser > --- > > Key: PDFBOX-2527 > URL: https://issues.apache.org/jira/browse/PDFBOX-2527 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 1.8.8, 2.0.0 >Reporter: Tilman Hausherr >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 2.0.0 > > Attachments: PDFBOX-2527-069020.pdf > > > {code} > Exception in thread "main" java.io.IOException: Negative seek offset > at java.io.RandomAccessFile.seek(Native Method) > at > org.apache.pdfbox.io.RandomAccessBufferedFileInputStream.seek(RandomAccessBufferedFileInputStream.java:116) > at > org.apache.pdfbox.io.PushBackInputStream.seek(PushBackInputStream.java:234) > at > org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:492) > at > org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:1013) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:951) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:897) > at org.apache.pdfbox.tools.PDFReader.parseDocument(PDFReader.java:375) > at org.apache.pdfbox.tools.PDFReader.openPDFFile(PDFReader.java:340) > at org.apache.pdfbox.tools.PDFReader.main(PDFReader.java:326) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:80) > {code} > This happens with several malformed PDFs from the test set in TIKA-1442. > These files (303385, 069020, 303385, 742141, 982996) all have some trash at > the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2527) IOException: Negative seek offset in NonSequentialPDFParser
[ https://issues.apache.org/jira/browse/PDFBOX-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343555#comment-14343555 ] ASF subversion and git services commented on PDFBOX-2527: - Commit 1663394 from [~lehmi] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1663394 ] PDFBOX-2527: removed encryption dictionary detection > IOException: Negative seek offset in NonSequentialPDFParser > --- > > Key: PDFBOX-2527 > URL: https://issues.apache.org/jira/browse/PDFBOX-2527 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 1.8.8, 2.0.0 >Reporter: Tilman Hausherr >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 2.0.0 > > Attachments: PDFBOX-2527-069020.pdf > > > {code} > Exception in thread "main" java.io.IOException: Negative seek offset > at java.io.RandomAccessFile.seek(Native Method) > at > org.apache.pdfbox.io.RandomAccessBufferedFileInputStream.seek(RandomAccessBufferedFileInputStream.java:116) > at > org.apache.pdfbox.io.PushBackInputStream.seek(PushBackInputStream.java:234) > at > org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:492) > at > org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:1013) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:951) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:897) > at org.apache.pdfbox.tools.PDFReader.parseDocument(PDFReader.java:375) > at org.apache.pdfbox.tools.PDFReader.openPDFFile(PDFReader.java:340) > at org.apache.pdfbox.tools.PDFReader.main(PDFReader.java:326) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:80) > {code} > This happens with several malformed PDFs from the test set in TIKA-1442. > These files (303385, 069020, 303385, 742141, 982996) all have some trash at > the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2301) RandomAccessBuffer consumes too much memory.
[ https://issues.apache.org/jira/browse/PDFBOX-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343505#comment-14343505 ] Andreas Lehmkühler commented on PDFBOX-2301: {quote} The analysis was not correct. RandomAccessBuffer allocated 16KB without condition when it needed to use just some hundred bytes for small content stream object. {quote} I've decreased the chunk size to 1024 > RandomAccessBuffer consumes too much memory. > > > Key: PDFBOX-2301 > URL: https://issues.apache.org/jira/browse/PDFBOX-2301 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 1.8.6, 2.0.0 >Reporter: gee >Assignee: Andreas Lehmkühler >Priority: Blocker > Fix For: 2.0.0 > > Attachments: clone.diff, clone2.diff, clone3.diff, clone4.diff > > > RandomAccessBuffer holds uncompressed image during operation because it is > what exactly pdfbox ExtractImages do. > but holding uncompressed image instead of compressed one in memory consumes > too much memory, not excluding many PDF XObjects that can use filter to > compress itself. It would be good if pdfbox provides option that reverts to > COSObject state just before the RandomAccess object created(the state that > pdf XObject stream parsed and COSDictionary objects haven't created because > user doesn't requested it using get() method.) It is crucial feature so > that pdfbox can analyze huge pdf file(>100MB). > In current source, one must close COSStream unless required(and I know closed > stream cannot reopened again.) > Class Name > > > | > Shallow Heap | Retained Heap > -- > org.apache.pdfbox.cos.COSObject @ 0x5ad4940 > > > | >24 | 8,187,264 > |- class org.apache.pdfbox.cos.COSObject @ 0x58c4020 > > > | > 0 | 0 > |- generationNumber org.apache.pdfbox.cos.COSInteger @ 0x5ad0080 > > > | >24 |24 > |- baseObject org.apache.pdfbox.cos.COSStream @ 0x5b25ea0 > > > | >32 | 8,187,216 > | |- class org.apache.pdfbox.cos.COSStream @ 0x58c3e00 > > > | > 8 | 8 > | |- items java.util.LinkedHashMap @ 0x5b2a0f0 > > > | >56 | 552 > | |- file org.apache.pdfbox.io.RandomAccessBuffer @ 0x5b2a128 > > > | >48 | 8,186,528 > | | |- class org.apache.pdfbox.io.RandomAccessBuffer @ 0x5ad2b00 >
[jira] [Commented] (PDFBOX-2301) RandomAccessBuffer consumes too much memory.
[ https://issues.apache.org/jira/browse/PDFBOX-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343502#comment-14343502 ] ASF subversion and git services commented on PDFBOX-2301: - Commit 1663378 from [~lehmi] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1663378 ] PDFBOX-2301: use 1024 instead of 16384 bytes as chunk size > RandomAccessBuffer consumes too much memory. > > > Key: PDFBOX-2301 > URL: https://issues.apache.org/jira/browse/PDFBOX-2301 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 1.8.6, 2.0.0 >Reporter: gee >Assignee: Andreas Lehmkühler >Priority: Blocker > Fix For: 2.0.0 > > Attachments: clone.diff, clone2.diff, clone3.diff, clone4.diff > > > RandomAccessBuffer holds uncompressed image during operation because it is > what exactly pdfbox ExtractImages do. > but holding uncompressed image instead of compressed one in memory consumes > too much memory, not excluding many PDF XObjects that can use filter to > compress itself. It would be good if pdfbox provides option that reverts to > COSObject state just before the RandomAccess object created(the state that > pdf XObject stream parsed and COSDictionary objects haven't created because > user doesn't requested it using get() method.) It is crucial feature so > that pdfbox can analyze huge pdf file(>100MB). > In current source, one must close COSStream unless required(and I know closed > stream cannot reopened again.) > Class Name > > > | > Shallow Heap | Retained Heap > -- > org.apache.pdfbox.cos.COSObject @ 0x5ad4940 > > > | >24 | 8,187,264 > |- class org.apache.pdfbox.cos.COSObject @ 0x58c4020 > > > | > 0 | 0 > |- generationNumber org.apache.pdfbox.cos.COSInteger @ 0x5ad0080 > > > | >24 |24 > |- baseObject org.apache.pdfbox.cos.COSStream @ 0x5b25ea0 > > > | >32 | 8,187,216 > | |- class org.apache.pdfbox.cos.COSStream @ 0x58c3e00 > > > | > 8 | 8 > | |- items java.util.LinkedHashMap @ 0x5b2a0f0 > > > | >56 | 552 > | |- file org.apache.pdfbox.io.RandomAccessBuffer @ 0x5b2a128 > > > | >48 | 8,186,528 > | | |- class org.apache.pdfbox.io.RandomAccessBuffer @ 0x5ad2b00 >
[jira] [Comment Edited] (PDFBOX-2694) Evaluate twelvemonkeys for JPEG
[ https://issues.apache.org/jira/browse/PDFBOX-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341500#comment-14341500 ] Tilman Hausherr edited comment on PDFBOX-2694 at 3/2/15 5:16 PM: - - 176936.pdf, p. 154, second file: ArrayIndexOutOfBoundsException only with twelvemonkeys, issue opened: https://github.com/haraldk/TwelveMonkeys/issues/102 same stack trace for 258980.pdf, 307454.pdf, 410598.pdf, 452570.pdf, 464989.pdf, 465440.pdf, 592024.pdf, 701637.pdf, 709032.pdf, 736239.pdf, 751004.pdf. - 573636.pdf cannot rendered with the sun jpeg reader: "Numbers of source Raster bands and source color space components do not match". Can be rendered with twelvemonkeys. - 485945.pdf fails with both - but twelvemonkeys might fail gracefully in the future. https://github.com/haraldk/TwelveMonkeys/issues/101 - same for 178360.pdf, however that file is really badly damaged https://github.com/haraldk/TwelveMonkeys/issues/103 - Kevins confidential file fails with the sun reader: "CMMException: Invalid image format", and succeeds with twelvemonkeys. was (Author: tilman): - 176936.pdf, p. 154, second file: ArrayIndexOutOfBoundsException only with twelvemonkeys, issue opened: https://github.com/haraldk/TwelveMonkeys/issues/102 same stack trace for 258980.pdf, 307454.pdf, 410598.pdf, 452570.pdf, 464989.pdf, 465440.pdf - 573636.pdf cannot rendered with the sun jpeg reader: "Numbers of source Raster bands and source color space components do not match". Can be rendered with twelvemonkeys. - 485945.pdf fails with both - but twelvemonkeys might fail gracefully in the future. https://github.com/haraldk/TwelveMonkeys/issues/101 - same for 178360.pdf, however that file is really badly damaged https://github.com/haraldk/TwelveMonkeys/issues/103 - Kevins confidential file fails with the sun reader: "CMMException: Invalid image format", and succeeds with twelvemonkeys. > Evaluate twelvemonkeys for JPEG > --- > > Key: PDFBOX-2694 > URL: https://issues.apache.org/jira/browse/PDFBOX-2694 > Project: PDFBox > Issue Type: Task > Components: Parsing >Affects Versions: 2.0.0 >Reporter: Tilman Hausherr >Priority: Minor > Labels: jpeg, twelvemonkeys > Attachments: 176936-p154-2.jpg, 176936-p154.pdf, 485945.pdf, > 573636.pdf > > > While working on PDFBOX-2128 I decided to try twelvemonkeys for JPEG reading > and the first impression is excellent. It seems that the author is making a > big effort in handling even the most broken JPEG files (similar to what we do > with PDFs). This issue is to collect problem files and discuss all > experiences and decide whether we should bundle twelvemonkeys with PDFBox or > rather just recommend it as an optional solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2694) Evaluate twelvemonkeys for JPEG
[ https://issues.apache.org/jira/browse/PDFBOX-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343405#comment-14343405 ] Tilman Hausherr commented on PDFBOX-2694: - Harald Kuhr has fixed all three issues this morning! I will now rerun all the preflight mass tests, and also test the jpeg files from the digitalcorpora site. > Evaluate twelvemonkeys for JPEG > --- > > Key: PDFBOX-2694 > URL: https://issues.apache.org/jira/browse/PDFBOX-2694 > Project: PDFBox > Issue Type: Task > Components: Parsing >Affects Versions: 2.0.0 >Reporter: Tilman Hausherr >Priority: Minor > Labels: jpeg, twelvemonkeys > Attachments: 176936-p154-2.jpg, 176936-p154.pdf, 485945.pdf, > 573636.pdf > > > While working on PDFBOX-2128 I decided to try twelvemonkeys for JPEG reading > and the first impression is excellent. It seems that the author is making a > big effort in handling even the most broken JPEG files (similar to what we do > with PDFs). This issue is to collect problem files and discuss all > experiences and decide whether we should bundle twelvemonkeys with PDFBox or > rather just recommend it as an optional solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2695) Iterate PDOutlineNode children
[ https://issues.apache.org/jira/browse/PDFBOX-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrea Vacondio updated PDFBOX-2695: Description: Give an outline item, I need to walk through all its children. ??The items at each level of the hierarchy form a linked list, chained together through their Prev and Next entries and accessed through the First and Last entries in the parent item?? so I created a simple patch to allow this kind of code: {code} if(node !=null){ for (PDOutlineItem current : node.children()) { //do something with the } } {code} Given an item, PDOutlineNode.children returns an Iterable that walks through the children until there is no NEXT or NEXT is equals to the starting element. was: Give an outline item, I need to walk through all its children. ??The items at each level of the hierarchy form a linked list, chained together through their Prev and Next entries and accessed through the First and Last entries in the parent item?? so I created a simple patch to allow this kind of code: {code} if(node !=null){ for (PDOutlineItem current : node.children()) { //do something with the } } {code} Given an item, PDOutlineNode.children returns an iterator that walks through the children until there is no NEXT or NEXT is equals to the starting element. > Iterate PDOutlineNode children > -- > > Key: PDFBOX-2695 > URL: https://issues.apache.org/jira/browse/PDFBOX-2695 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 2.0.0 >Reporter: Andrea Vacondio >Priority: Minor > Labels: outline > Fix For: 2.0.0 > > Attachments: iterable_children.diff > > > Give an outline item, I need to walk through all its children. ??The items at > each level of the hierarchy form a linked list, chained together through > their Prev and Next entries and accessed through the First and Last entries > in the parent item?? so I created a simple patch to allow this kind of code: > {code} > if(node !=null){ >for (PDOutlineItem current : node.children()) { > //do something with the >} > } > {code} > Given an item, PDOutlineNode.children returns an Iterable that walks through > the children until there is no NEXT or NEXT is equals to the starting element. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2695) Iterate PDOutlineNode children
[ https://issues.apache.org/jira/browse/PDFBOX-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrea Vacondio updated PDFBOX-2695: Attachment: iterable_children.diff > Iterate PDOutlineNode children > -- > > Key: PDFBOX-2695 > URL: https://issues.apache.org/jira/browse/PDFBOX-2695 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 2.0.0 >Reporter: Andrea Vacondio >Priority: Minor > Labels: outline > Fix For: 2.0.0 > > Attachments: iterable_children.diff > > > Give an outline item, I need to walk through all its children. ??The items at > each level of the hierarchy form a linked list, chained together through > their Prev and Next entries and accessed through the First and Last entries > in the parent item?? so I created a simple patch to allow this kind of code: > {code} > if(node !=null){ >for (PDOutlineItem current : node.children()) { > //do something with the >} > } > {code} > Given an item, PDOutlineNode.children returns an iterator that walks through > the children until there is no NEXT or NEXT is equals to the starting element. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-2695) Iterate PDOutlineNode children
Andrea Vacondio created PDFBOX-2695: --- Summary: Iterate PDOutlineNode children Key: PDFBOX-2695 URL: https://issues.apache.org/jira/browse/PDFBOX-2695 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: Andrea Vacondio Priority: Minor Fix For: 2.0.0 Give an outline item, I need to walk through all its children. ??The items at each level of the hierarchy form a linked list, chained together through their Prev and Next entries and accessed through the First and Last entries in the parent item?? so I created a simple patch to allow this kind of code: {code} if(node !=null){ for (PDOutlineItem current : node.children()) { //do something with the } } {code} Given an item, PDOutlineNode.children returns an iterator that walks through the children until there is no NEXT or NEXT is equals to the starting element. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Closed] (PDFBOX-1109) Data corruption related to scratch file use
[ https://issues.apache.org/jira/browse/PDFBOX-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler closed PDFBOX-1109. -- This won't happen any more as starting with 2.0.0 PDFBox doesn't use a given scratch file but its own one if needed. > Data corruption related to scratch file use > --- > > Key: PDFBOX-1109 > URL: https://issues.apache.org/jira/browse/PDFBOX-1109 > Project: PDFBox > Issue Type: Bug > Components: Parsing, PDModel >Affects Versions: 1.8.7, 2.0.0 >Reporter: Stefan Mücke >Assignee: Andreas Lehmkühler >Priority: Critical > Fix For: 2.0.0 > > Attachments: COSDocument.java, PagedMultiRandomAccessFile.java, > PagedMultiRandomAccessFileTest.java > > > PDFBox uses a scratch file to reduce memory consumption. However, there is no > mechanism that prevents two PDStreams from writing to the scratch file at the > same time. When this happens, the resulting PDF contains garbage in some > streams. This problem occurred several times to me (e.g. when writing to an > image stream while constructing a page). > Reproducing the bug > *** > One can easily reproduce the bug. Open file AddImageToPDF.java and move the > following line: > PDPageContentStream contentStream = > new PDPageContentStream(doc, page, true, true); > immediately after the line in which the PDPage object is fetched: > PDPage page = > (PDPage)doc.getDocumentCatalog().getAllPages().get( 0 ); > > With this modification, one will still get a PDF file, but Acrobat Reader > will report that the image could not be processed. BTW, the files > AddImageToPDF.java and ImageToPDF.java are almost identical. One of them > should be deleted. > Bug-Fix > *** > The problem can be solved by using a scratch file that is divided into pages > (e.g. of 4 KB). Each PDStream in the scratch file is then associated with a > list of pages. This list grows as more data is written to the stream. > The bug fix requires minimal changes to the existing code. The very nice > RandomAccess interface made this very easy. > Here is what needs to be changed: > - Add the attached "PagedMultiRandomAccessFile.java" to the I/O package > - Change COSDocument.getScratchFile() to return a RandomAccess > instance provided by PagedMultiRandomAccessFile: > private PagedMultiRandomAccessFile scratchFile = null; > [...] > public COSDocument(File scratchDir) throws IOException { > tmpFile = File.createTempFile("pdfbox", "tmp", scratchDir); > scratchFile = new PagedMultiRandomAccessFile( > new RandomAccessFile(tmpFile, "rw")); > } > public COSDocument(RandomAccess file) { > // scratchFile = file; > throw new RuntimeException("Not yet implemented."); > //$NON-NLS-1$ > } > > [...] > /** >* Returns a new scratch file. >* >* @return the newly created scratch file >*/ > public RandomAccess getScratchFile() { > return scratchFile.getNewRandomAcess(); > } > One of the COSDocument constructors takes a RandomAccess file. This > constructor is only called in a single location, namely, in method > PDFParser.parse(). I am not sure if the RandomAccess parameter provided here > is really a scratch file. Someone will have to decide what to do with this > one. > The code has been throughly tested and has been used in the production of > several books without any problems. > In the attachment please find the code. There is also a JUnit test that was > used to debug my code. I have added an Apache license header and adopted > PDFBox's code style. Feel free to make any desired changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-1109) Data corruption related to scratch file use
[ https://issues.apache.org/jira/browse/PDFBOX-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler resolved PDFBOX-1109. Resolution: Invalid > Data corruption related to scratch file use > --- > > Key: PDFBOX-1109 > URL: https://issues.apache.org/jira/browse/PDFBOX-1109 > Project: PDFBox > Issue Type: Bug > Components: Parsing, PDModel >Affects Versions: 1.8.7, 2.0.0 >Reporter: Stefan Mücke >Assignee: Andreas Lehmkühler >Priority: Critical > Fix For: 2.0.0 > > Attachments: COSDocument.java, PagedMultiRandomAccessFile.java, > PagedMultiRandomAccessFileTest.java > > > PDFBox uses a scratch file to reduce memory consumption. However, there is no > mechanism that prevents two PDStreams from writing to the scratch file at the > same time. When this happens, the resulting PDF contains garbage in some > streams. This problem occurred several times to me (e.g. when writing to an > image stream while constructing a page). > Reproducing the bug > *** > One can easily reproduce the bug. Open file AddImageToPDF.java and move the > following line: > PDPageContentStream contentStream = > new PDPageContentStream(doc, page, true, true); > immediately after the line in which the PDPage object is fetched: > PDPage page = > (PDPage)doc.getDocumentCatalog().getAllPages().get( 0 ); > > With this modification, one will still get a PDF file, but Acrobat Reader > will report that the image could not be processed. BTW, the files > AddImageToPDF.java and ImageToPDF.java are almost identical. One of them > should be deleted. > Bug-Fix > *** > The problem can be solved by using a scratch file that is divided into pages > (e.g. of 4 KB). Each PDStream in the scratch file is then associated with a > list of pages. This list grows as more data is written to the stream. > The bug fix requires minimal changes to the existing code. The very nice > RandomAccess interface made this very easy. > Here is what needs to be changed: > - Add the attached "PagedMultiRandomAccessFile.java" to the I/O package > - Change COSDocument.getScratchFile() to return a RandomAccess > instance provided by PagedMultiRandomAccessFile: > private PagedMultiRandomAccessFile scratchFile = null; > [...] > public COSDocument(File scratchDir) throws IOException { > tmpFile = File.createTempFile("pdfbox", "tmp", scratchDir); > scratchFile = new PagedMultiRandomAccessFile( > new RandomAccessFile(tmpFile, "rw")); > } > public COSDocument(RandomAccess file) { > // scratchFile = file; > throw new RuntimeException("Not yet implemented."); > //$NON-NLS-1$ > } > > [...] > /** >* Returns a new scratch file. >* >* @return the newly created scratch file >*/ > public RandomAccess getScratchFile() { > return scratchFile.getNewRandomAcess(); > } > One of the COSDocument constructors takes a RandomAccess file. This > constructor is only called in a single location, namely, in method > PDFParser.parse(). I am not sure if the RandomAccess parameter provided here > is really a scratch file. Someone will have to decide what to do with this > one. > The code has been throughly tested and has been used in the production of > several books without any problems. > In the attachment please find the code. There is also a JUnit test that was > used to debug my code. I have added an Apache license header and adopted > PDFBox's code style. Feel free to make any desired changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-1822) Signature byte range is Invalid
[ https://issues.apache.org/jira/browse/PDFBOX-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler resolved PDFBOX-1822. Resolution: Fixed > Signature byte range is Invalid > --- > > Key: PDFBOX-1822 > URL: https://issues.apache.org/jira/browse/PDFBOX-1822 > Project: PDFBox > Issue Type: Bug > Components: Signing >Affects Versions: 1.8.3, 1.8.4, 2.0.0 >Reporter: vakhtang koroghlishvili >Assignee: Andreas Lehmkühler >Priority: Blocker > Fix For: 2.0.0 > > Attachments: > SignatureFileSet-PDFBOX-1.8.2_TO_1.8.4-SNAPSHOT_SEQ_AND_NONSEQ.zip, > araxis-merge - compare two document.jpg, damaged-sig.jpg, > unsigned-signed.pdf, unsigned.pdf, unsigned_signed_fix.pdf > > > On person send me a unsigned PDF document. He wanted to sign it. When I try > to sign it (using pad box), I have some problem. > After signing adobe reader tells me "The signature byre range is invalid". > I will attach original and signed document. > I think, it is PDF box parser error. another signature libraries sign > document very well. I'm searching the problem at the moment, in order to fix > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org