date:20150302

[jira] [Commented] (PDFBOX-2694) Evaluate twelvemonkeys for JPEG

2015-03-02 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14344698#comment-14344698
 ] 

Andreas Lehmkühler commented on PDFBOX-2694:


Looks promising. How about the TIFF support, will you have a look too? Based on 
our past expecriences it seems to be a good idea to get rid of all problematic 
JRE-dependencies such as ImageIO, especially if the developer is as responsive 
as Harald is. :-)

> Evaluate twelvemonkeys for JPEG
> ---
>
> Key: PDFBOX-2694
> URL: https://issues.apache.org/jira/browse/PDFBOX-2694
> Project: PDFBox
>  Issue Type: Task
>  Components: Parsing
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Priority: Minor
>  Labels: jpeg, twelvemonkeys
> Attachments: 176936-p154-2.jpg, 176936-p154.pdf, 485945.pdf, 
> 573636.pdf
>
>
> While working on PDFBOX-2128 I decided to try twelvemonkeys for JPEG reading 
> and the first impression is excellent. It seems that the author is making a 
> big effort in handling even the most broken JPEG files (similar to what we do 
> with PDFs). This issue is to collect problem files and discuss all 
> experiences and decide whether we should bundle twelvemonkeys with PDFBox or 
> rather just recommend it as an optional solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-2695) Iterate PDOutlineNode children

2015-03-02 Thread Tilman Hausherr (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-2695.
-
Resolution: Fixed
  Assignee: Tilman Hausherr

Good idea - thanks!

> Iterate PDOutlineNode children
> --
>
> Key: PDFBOX-2695
> URL: https://issues.apache.org/jira/browse/PDFBOX-2695
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: Andrea Vacondio
>Assignee: Tilman Hausherr
>Priority: Minor
>  Labels: outline
> Fix For: 2.0.0
>
> Attachments: iterable_children.diff
>
>
> Give an outline item, I need to walk through all its children. ??The items at 
> each level of the hierarchy form a linked list, chained together through 
> their Prev and Next entries and accessed through the First and Last entries 
> in the parent item?? so I created a simple patch to allow this kind of code:
> {code}
> if(node !=null){
>for (PDOutlineItem current : node.children()) {
> //do something with the
>}
> }
> {code}
> Given an item, PDOutlineNode.children returns an Iterable that walks through 
> the children until there is no NEXT or NEXT is equals to the starting element.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2695) Iterate PDOutlineNode children

2015-03-02 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343787#comment-14343787
 ] 

ASF subversion and git services commented on PDFBOX-2695:
-

Commit 1663436 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1663436 ]

PDFBOX-2695: Iterate PDOutlineNode children, by Andrea Vacondio

> Iterate PDOutlineNode children
> --
>
> Key: PDFBOX-2695
> URL: https://issues.apache.org/jira/browse/PDFBOX-2695
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: Andrea Vacondio
>Assignee: Tilman Hausherr
>Priority: Minor
>  Labels: outline
> Fix For: 2.0.0
>
> Attachments: iterable_children.diff
>
>
> Give an outline item, I need to walk through all its children. ??The items at 
> each level of the hierarchy form a linked list, chained together through 
> their Prev and Next entries and accessed through the First and Last entries 
> in the parent item?? so I created a simple patch to allow this kind of code:
> {code}
> if(node !=null){
>for (PDOutlineItem current : node.children()) {
> //do something with the
>}
> }
> {code}
> Given an item, PDOutlineNode.children returns an Iterable that walks through 
> the children until there is no NEXT or NEXT is equals to the starting element.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-1130) ExtractText -html doesn't always close the tags it opens

2015-03-02 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343741#comment-14343741
 ] 

Hudson commented on PDFBOX-1130:


SUCCESS: Integrated in tika-trunk-jdk1.7 #524 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/524/])
TIKA-758 clean up after remembering PDFBOX-1130 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1663424)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java


> ExtractText -html doesn't always close the  tags it opens
> 
>
> Key: PDFBOX-1130
> URL: https://issues.apache.org/jira/browse/PDFBOX-1130
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 1.8.0
>
> Attachments: 86.pdf, PDFBOX-1130.patch
>
>
> I have a test document (same one on PDFBOX-1129), which when run through 
> ExtractText -html, extracts the page number for each page, however in each 
> case the page number looks like:
> NText of page N...
> Ie, the  tag for the page number wasn't closed.
> Maybe related: if I run ExtractText without html, there is not space after 
> the page number and before the next word, ie I see words like 1Massachusetts, 
> 2Course, 3also, 4the.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-2436) Parsing error

2015-03-02 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-2436.

   Resolution: Fixed
Fix Version/s: 2.0.0

Works fine using the non-sequential parser after solving PDFBOX-2515. The fix 
is limited to the trunk version

Thanks for the report

> Parsing error
> -
>
> Key: PDFBOX-2436
> URL: https://issues.apache.org/jira/browse/PDFBOX-2436
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.8.7
> Environment: Java 8
>Reporter: Jan Vomlel
>Assignee: Andreas Lehmkühler
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: h1.pdf
>
>
> PDDocument.load method returns without exception, but document model is 
> incomplete.
> You can try it by this code on attached file:
> {code}
> PDDocument document = PDDocument.load(new File(inFN), null);
> int size = document.getSignatureDictionaries().size();
> System.out.println("Signatures count:" +size);
> {code}
> Output is 1, but there are two signatures in PDF document.
> PDFParser.class produces IOException and ignores it on line 196. Rest of the 
> document is ignored.
> loadNoSeq method works, but I cannot use it, because I want to attach a new 
> signature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-2527) IOException: Negative seek offset in NonSequentialPDFParser

2015-03-02 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-2527.

Resolution: Fixed

I'm finished at this point. I discontinue the work on rebuilding a corrupt file 
which is encrypted as it is far more complicated than expected. We can open a 
new issue if someone comes up with a real sample (I've created mine by 
manipulating a well-formed one).
Thanks to everybody for the help/input/report

> IOException: Negative seek offset in NonSequentialPDFParser
> ---
>
> Key: PDFBOX-2527
> URL: https://issues.apache.org/jira/browse/PDFBOX-2527
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.8.8, 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: PDFBOX-2527-069020.pdf
>
>
> {code}
> Exception in thread "main" java.io.IOException: Negative seek offset
>   at java.io.RandomAccessFile.seek(Native Method)
>   at 
> org.apache.pdfbox.io.RandomAccessBufferedFileInputStream.seek(RandomAccessBufferedFileInputStream.java:116)
>   at 
> org.apache.pdfbox.io.PushBackInputStream.seek(PushBackInputStream.java:234)
>   at 
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:492)
>   at 
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:1013)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:951)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:897)
>   at org.apache.pdfbox.tools.PDFReader.parseDocument(PDFReader.java:375)
>   at org.apache.pdfbox.tools.PDFReader.openPDFFile(PDFReader.java:340)
>   at org.apache.pdfbox.tools.PDFReader.main(PDFReader.java:326)
>   at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:80)
> {code}
> This happens with several malformed PDFs from the test set in TIKA-1442. 
> These files (303385, 069020, 303385, 742141, 982996) all have some trash at 
> the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2527) IOException: Negative seek offset in NonSequentialPDFParser

2015-03-02 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343555#comment-14343555
 ] 

ASF subversion and git services commented on PDFBOX-2527:
-

Commit 1663394 from [~lehmi] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1663394 ]

PDFBOX-2527: removed encryption dictionary detection

> IOException: Negative seek offset in NonSequentialPDFParser
> ---
>
> Key: PDFBOX-2527
> URL: https://issues.apache.org/jira/browse/PDFBOX-2527
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.8.8, 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: PDFBOX-2527-069020.pdf
>
>
> {code}
> Exception in thread "main" java.io.IOException: Negative seek offset
>   at java.io.RandomAccessFile.seek(Native Method)
>   at 
> org.apache.pdfbox.io.RandomAccessBufferedFileInputStream.seek(RandomAccessBufferedFileInputStream.java:116)
>   at 
> org.apache.pdfbox.io.PushBackInputStream.seek(PushBackInputStream.java:234)
>   at 
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:492)
>   at 
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:1013)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:951)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:897)
>   at org.apache.pdfbox.tools.PDFReader.parseDocument(PDFReader.java:375)
>   at org.apache.pdfbox.tools.PDFReader.openPDFFile(PDFReader.java:340)
>   at org.apache.pdfbox.tools.PDFReader.main(PDFReader.java:326)
>   at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:80)
> {code}
> This happens with several malformed PDFs from the test set in TIKA-1442. 
> These files (303385, 069020, 303385, 742141, 982996) all have some trash at 
> the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2301) RandomAccessBuffer consumes too much memory.

2015-03-02 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343505#comment-14343505
 ] 

Andreas Lehmkühler commented on PDFBOX-2301:


{quote}
The analysis was not correct. RandomAccessBuffer allocated 16KB without 
condition when it needed to use just some hundred bytes for small content 
stream object.
{quote}
I've decreased the chunk size to 1024

> RandomAccessBuffer consumes too much memory.
> 
>
> Key: PDFBOX-2301
> URL: https://issues.apache.org/jira/browse/PDFBOX-2301
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.8.6, 2.0.0
>Reporter: gee
>Assignee: Andreas Lehmkühler
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: clone.diff, clone2.diff, clone3.diff, clone4.diff
>
>
> RandomAccessBuffer holds uncompressed image during operation because it is 
> what exactly pdfbox ExtractImages do.
> but holding uncompressed image instead of compressed one in memory consumes 
> too much memory, not excluding many PDF XObjects that can use filter to 
> compress itself. It would be good if pdfbox provides option that reverts to 
> COSObject state just before the RandomAccess object created(the state that 
> pdf XObject stream parsed and COSDictionary objects haven't created because 
> user doesn't requested it using get() method.) It is crucial feature so 
> that pdfbox can analyze huge pdf file(>100MB).
> In current source, one must close COSStream unless required(and I know closed 
> stream cannot reopened again.)
> Class Name
>   
>   
>  | 
> Shallow Heap | Retained Heap
> --
> org.apache.pdfbox.cos.COSObject @ 0x5ad4940   
>   
>   
>  |
>24 | 8,187,264
> |-  class org.apache.pdfbox.cos.COSObject @ 0x58c4020  
>   
>   
>  |
> 0 | 0
> |- generationNumber org.apache.pdfbox.cos.COSInteger @ 0x5ad0080  
>   
>   
>  |
>24 |24
> |- baseObject org.apache.pdfbox.cos.COSStream @ 0x5b25ea0 
>   
>   
>  |
>32 | 8,187,216
> |  |-  class org.apache.pdfbox.cos.COSStream @ 0x58c3e00   
>   
>   
>  |
> 8 | 8
> |  |- items java.util.LinkedHashMap @ 0x5b2a0f0   
>   
>   
>  |
>56 |   552
> |  |- file org.apache.pdfbox.io.RandomAccessBuffer @ 0x5b2a128
>   
>   
>  |
>48 | 8,186,528
> |  |  |-  class org.apache.pdfbox.io.RandomAccessBuffer @ 0x5ad2b00
>

[jira] [Commented] (PDFBOX-2301) RandomAccessBuffer consumes too much memory.

2015-03-02 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343502#comment-14343502
 ] 

ASF subversion and git services commented on PDFBOX-2301:
-

Commit 1663378 from [~lehmi] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1663378 ]

PDFBOX-2301: use 1024 instead of 16384 bytes as chunk size

> RandomAccessBuffer consumes too much memory.
> 
>
> Key: PDFBOX-2301
> URL: https://issues.apache.org/jira/browse/PDFBOX-2301
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.8.6, 2.0.0
>Reporter: gee
>Assignee: Andreas Lehmkühler
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: clone.diff, clone2.diff, clone3.diff, clone4.diff
>
>
> RandomAccessBuffer holds uncompressed image during operation because it is 
> what exactly pdfbox ExtractImages do.
> but holding uncompressed image instead of compressed one in memory consumes 
> too much memory, not excluding many PDF XObjects that can use filter to 
> compress itself. It would be good if pdfbox provides option that reverts to 
> COSObject state just before the RandomAccess object created(the state that 
> pdf XObject stream parsed and COSDictionary objects haven't created because 
> user doesn't requested it using get() method.) It is crucial feature so 
> that pdfbox can analyze huge pdf file(>100MB).
> In current source, one must close COSStream unless required(and I know closed 
> stream cannot reopened again.)
> Class Name
>   
>   
>  | 
> Shallow Heap | Retained Heap
> --
> org.apache.pdfbox.cos.COSObject @ 0x5ad4940   
>   
>   
>  |
>24 | 8,187,264
> |-  class org.apache.pdfbox.cos.COSObject @ 0x58c4020  
>   
>   
>  |
> 0 | 0
> |- generationNumber org.apache.pdfbox.cos.COSInteger @ 0x5ad0080  
>   
>   
>  |
>24 |24
> |- baseObject org.apache.pdfbox.cos.COSStream @ 0x5b25ea0 
>   
>   
>  |
>32 | 8,187,216
> |  |-  class org.apache.pdfbox.cos.COSStream @ 0x58c3e00   
>   
>   
>  |
> 8 | 8
> |  |- items java.util.LinkedHashMap @ 0x5b2a0f0   
>   
>   
>  |
>56 |   552
> |  |- file org.apache.pdfbox.io.RandomAccessBuffer @ 0x5b2a128
>   
>   
>  |
>48 | 8,186,528
> |  |  |-  class org.apache.pdfbox.io.RandomAccessBuffer @ 0x5ad2b00
>

[jira] [Comment Edited] (PDFBOX-2694) Evaluate twelvemonkeys for JPEG

2015-03-02 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341500#comment-14341500
 ] 

Tilman Hausherr edited comment on PDFBOX-2694 at 3/2/15 5:16 PM:
-

- 176936.pdf, p. 154, second file: ArrayIndexOutOfBoundsException only with 
twelvemonkeys, issue opened:
https://github.com/haraldk/TwelveMonkeys/issues/102
same stack trace for 258980.pdf, 307454.pdf, 410598.pdf, 452570.pdf, 
464989.pdf, 465440.pdf, 592024.pdf, 701637.pdf, 709032.pdf, 736239.pdf, 
751004.pdf.
- 573636.pdf cannot rendered with the sun jpeg reader: "Numbers of source 
Raster bands and source color space components do not match". Can be rendered 
with twelvemonkeys.
- 485945.pdf fails with both - but twelvemonkeys might fail gracefully in the 
future.
https://github.com/haraldk/TwelveMonkeys/issues/101
- same for 178360.pdf, however that file is really badly damaged
https://github.com/haraldk/TwelveMonkeys/issues/103
- Kevins confidential file fails with the sun reader: "CMMException: Invalid 
image format", and succeeds with twelvemonkeys.


was (Author: tilman):
- 176936.pdf, p. 154, second file: ArrayIndexOutOfBoundsException only with 
twelvemonkeys, issue opened:
https://github.com/haraldk/TwelveMonkeys/issues/102
same stack trace for 258980.pdf, 307454.pdf, 410598.pdf, 452570.pdf, 
464989.pdf, 465440.pdf
- 573636.pdf cannot rendered with the sun jpeg reader: "Numbers of source 
Raster bands and source color space components do not match". Can be rendered 
with twelvemonkeys.
- 485945.pdf fails with both - but twelvemonkeys might fail gracefully in the 
future.
https://github.com/haraldk/TwelveMonkeys/issues/101
- same for 178360.pdf, however that file is really badly damaged
https://github.com/haraldk/TwelveMonkeys/issues/103
- Kevins confidential file fails with the sun reader: "CMMException: Invalid 
image format", and succeeds with twelvemonkeys.

> Evaluate twelvemonkeys for JPEG
> ---
>
> Key: PDFBOX-2694
> URL: https://issues.apache.org/jira/browse/PDFBOX-2694
> Project: PDFBox
>  Issue Type: Task
>  Components: Parsing
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Priority: Minor
>  Labels: jpeg, twelvemonkeys
> Attachments: 176936-p154-2.jpg, 176936-p154.pdf, 485945.pdf, 
> 573636.pdf
>
>
> While working on PDFBOX-2128 I decided to try twelvemonkeys for JPEG reading 
> and the first impression is excellent. It seems that the author is making a 
> big effort in handling even the most broken JPEG files (similar to what we do 
> with PDFs). This issue is to collect problem files and discuss all 
> experiences and decide whether we should bundle twelvemonkeys with PDFBox or 
> rather just recommend it as an optional solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2694) Evaluate twelvemonkeys for JPEG

2015-03-02 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343405#comment-14343405
 ] 

Tilman Hausherr commented on PDFBOX-2694:
-

Harald Kuhr has fixed all three issues this morning!

I will now rerun all the preflight mass tests, and also test the jpeg files 
from the digitalcorpora site.

> Evaluate twelvemonkeys for JPEG
> ---
>
> Key: PDFBOX-2694
> URL: https://issues.apache.org/jira/browse/PDFBOX-2694
> Project: PDFBox
>  Issue Type: Task
>  Components: Parsing
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Priority: Minor
>  Labels: jpeg, twelvemonkeys
> Attachments: 176936-p154-2.jpg, 176936-p154.pdf, 485945.pdf, 
> 573636.pdf
>
>
> While working on PDFBOX-2128 I decided to try twelvemonkeys for JPEG reading 
> and the first impression is excellent. It seems that the author is making a 
> big effort in handling even the most broken JPEG files (similar to what we do 
> with PDFs). This issue is to collect problem files and discuss all 
> experiences and decide whether we should bundle twelvemonkeys with PDFBox or 
> rather just recommend it as an optional solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-2695) Iterate PDOutlineNode children

2015-03-02 Thread Andrea Vacondio (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrea Vacondio updated PDFBOX-2695:

Description: 
Give an outline item, I need to walk through all its children. ??The items at 
each level of the hierarchy form a linked list, chained together through their 
Prev and Next entries and accessed through the First and Last entries in the 
parent item?? so I created a simple patch to allow this kind of code:
{code}
if(node !=null){
   for (PDOutlineItem current : node.children()) {
//do something with the
   }
}
{code}

Given an item, PDOutlineNode.children returns an Iterable that walks through 
the children until there is no NEXT or NEXT is equals to the starting element.

  was:
Give an outline item, I need to walk through all its children. ??The items at 
each level of the hierarchy form a linked list, chained together through their 
Prev and Next entries and accessed through the First and Last entries in the 
parent item?? so I created a simple patch to allow this kind of code:
{code}
if(node !=null){
   for (PDOutlineItem current : node.children()) {
//do something with the
   }
}
{code}

Given an item, PDOutlineNode.children returns an iterator that walks through 
the children until there is no NEXT or NEXT is equals to the starting element.


> Iterate PDOutlineNode children
> --
>
> Key: PDFBOX-2695
> URL: https://issues.apache.org/jira/browse/PDFBOX-2695
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: Andrea Vacondio
>Priority: Minor
>  Labels: outline
> Fix For: 2.0.0
>
> Attachments: iterable_children.diff
>
>
> Give an outline item, I need to walk through all its children. ??The items at 
> each level of the hierarchy form a linked list, chained together through 
> their Prev and Next entries and accessed through the First and Last entries 
> in the parent item?? so I created a simple patch to allow this kind of code:
> {code}
> if(node !=null){
>for (PDOutlineItem current : node.children()) {
> //do something with the
>}
> }
> {code}
> Given an item, PDOutlineNode.children returns an Iterable that walks through 
> the children until there is no NEXT or NEXT is equals to the starting element.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-2695) Iterate PDOutlineNode children

2015-03-02 Thread Andrea Vacondio (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrea Vacondio updated PDFBOX-2695:

Attachment: iterable_children.diff

> Iterate PDOutlineNode children
> --
>
> Key: PDFBOX-2695
> URL: https://issues.apache.org/jira/browse/PDFBOX-2695
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: Andrea Vacondio
>Priority: Minor
>  Labels: outline
> Fix For: 2.0.0
>
> Attachments: iterable_children.diff
>
>
> Give an outline item, I need to walk through all its children. ??The items at 
> each level of the hierarchy form a linked list, chained together through 
> their Prev and Next entries and accessed through the First and Last entries 
> in the parent item?? so I created a simple patch to allow this kind of code:
> {code}
> if(node !=null){
>for (PDOutlineItem current : node.children()) {
> //do something with the
>}
> }
> {code}
> Given an item, PDOutlineNode.children returns an iterator that walks through 
> the children until there is no NEXT or NEXT is equals to the starting element.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Created] (PDFBOX-2695) Iterate PDOutlineNode children

2015-03-02 Thread Andrea Vacondio (JIRA)

Andrea Vacondio created PDFBOX-2695:
---

 Summary: Iterate PDOutlineNode children
 Key: PDFBOX-2695
 URL: https://issues.apache.org/jira/browse/PDFBOX-2695
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: Andrea Vacondio
Priority: Minor
 Fix For: 2.0.0


Give an outline item, I need to walk through all its children. ??The items at 
each level of the hierarchy form a linked list, chained together through their 
Prev and Next entries and accessed through the First and Last entries in the 
parent item?? so I created a simple patch to allow this kind of code:
{code}
if(node !=null){
   for (PDOutlineItem current : node.children()) {
//do something with the
   }
}
{code}

Given an item, PDOutlineNode.children returns an iterator that walks through 
the children until there is no NEXT or NEXT is equals to the starting element.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Closed] (PDFBOX-1109) Data corruption related to scratch file use

2015-03-02 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler closed PDFBOX-1109.
--

This won't happen any more as starting with 2.0.0 PDFBox doesn't use a given 
scratch file but its own one if needed.

> Data corruption related to scratch file use
> ---
>
> Key: PDFBOX-1109
> URL: https://issues.apache.org/jira/browse/PDFBOX-1109
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing, PDModel
>Affects Versions: 1.8.7, 2.0.0
>Reporter: Stefan Mücke
>Assignee: Andreas Lehmkühler
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: COSDocument.java, PagedMultiRandomAccessFile.java, 
> PagedMultiRandomAccessFileTest.java
>
>
> PDFBox uses a scratch file to reduce memory consumption. However, there is no 
> mechanism that prevents two PDStreams from writing to the scratch file at the 
> same time. When this happens, the resulting PDF contains garbage in some 
> streams. This problem occurred several times to me (e.g. when writing to an 
> image stream while constructing a page).
> Reproducing the bug
> ***
> One can easily reproduce the bug. Open file AddImageToPDF.java and move the 
> following line:
> PDPageContentStream contentStream =
> new PDPageContentStream(doc, page, true, true);
> immediately after the line in which the PDPage object is fetched:
> PDPage page =
> (PDPage)doc.getDocumentCatalog().getAllPages().get( 0 );
> 
> With this modification, one will still get a PDF file, but Acrobat Reader 
> will report that the image could not be processed. BTW, the files 
> AddImageToPDF.java and ImageToPDF.java are almost identical. One of them 
> should be deleted.
> Bug-Fix
> ***
> The problem can be solved by using a scratch file that is divided into pages 
> (e.g. of 4 KB). Each PDStream in the scratch file is then associated with a 
> list of pages. This list grows as more data is written to the stream.
> The bug fix requires minimal changes to the existing code. The very nice 
> RandomAccess interface made this very easy.
> Here is what needs to be changed:
> - Add the attached "PagedMultiRandomAccessFile.java" to the I/O package
> - Change COSDocument.getScratchFile() to return a RandomAccess
>   instance provided by PagedMultiRandomAccessFile:
>   private PagedMultiRandomAccessFile scratchFile = null;
>   [...]
>   public COSDocument(File scratchDir) throws IOException {
>   tmpFile = File.createTempFile("pdfbox", "tmp", scratchDir);
>   scratchFile = new PagedMultiRandomAccessFile(
>   new RandomAccessFile(tmpFile, "rw"));
>   }
>   public COSDocument(RandomAccess file) {
>   // scratchFile = file;
>   throw new RuntimeException("Not yet implemented."); 
> //$NON-NLS-1$
>   }
>   
>   [...]
>   /**
>* Returns a new scratch file.
>*
>* @return the newly created scratch file
>*/
>   public RandomAccess getScratchFile() {
>   return scratchFile.getNewRandomAcess();
>   }
> One of the COSDocument constructors takes a RandomAccess file. This 
> constructor is only called in a single location, namely, in method 
> PDFParser.parse(). I am not sure if the RandomAccess parameter provided here 
> is really a scratch file. Someone will have to decide what to do with this 
> one.
> The code has been throughly tested and has been used in the production of 
> several books without any problems.
> In the attachment please find the code. There is also a JUnit test that was 
> used to debug my code. I have added an Apache license header and adopted 
> PDFBox's code style. Feel free to make any desired changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-1109) Data corruption related to scratch file use

2015-03-02 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-1109.

Resolution: Invalid

> Data corruption related to scratch file use
> ---
>
> Key: PDFBOX-1109
> URL: https://issues.apache.org/jira/browse/PDFBOX-1109
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing, PDModel
>Affects Versions: 1.8.7, 2.0.0
>Reporter: Stefan Mücke
>Assignee: Andreas Lehmkühler
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: COSDocument.java, PagedMultiRandomAccessFile.java, 
> PagedMultiRandomAccessFileTest.java
>
>
> PDFBox uses a scratch file to reduce memory consumption. However, there is no 
> mechanism that prevents two PDStreams from writing to the scratch file at the 
> same time. When this happens, the resulting PDF contains garbage in some 
> streams. This problem occurred several times to me (e.g. when writing to an 
> image stream while constructing a page).
> Reproducing the bug
> ***
> One can easily reproduce the bug. Open file AddImageToPDF.java and move the 
> following line:
> PDPageContentStream contentStream =
> new PDPageContentStream(doc, page, true, true);
> immediately after the line in which the PDPage object is fetched:
> PDPage page =
> (PDPage)doc.getDocumentCatalog().getAllPages().get( 0 );
> 
> With this modification, one will still get a PDF file, but Acrobat Reader 
> will report that the image could not be processed. BTW, the files 
> AddImageToPDF.java and ImageToPDF.java are almost identical. One of them 
> should be deleted.
> Bug-Fix
> ***
> The problem can be solved by using a scratch file that is divided into pages 
> (e.g. of 4 KB). Each PDStream in the scratch file is then associated with a 
> list of pages. This list grows as more data is written to the stream.
> The bug fix requires minimal changes to the existing code. The very nice 
> RandomAccess interface made this very easy.
> Here is what needs to be changed:
> - Add the attached "PagedMultiRandomAccessFile.java" to the I/O package
> - Change COSDocument.getScratchFile() to return a RandomAccess
>   instance provided by PagedMultiRandomAccessFile:
>   private PagedMultiRandomAccessFile scratchFile = null;
>   [...]
>   public COSDocument(File scratchDir) throws IOException {
>   tmpFile = File.createTempFile("pdfbox", "tmp", scratchDir);
>   scratchFile = new PagedMultiRandomAccessFile(
>   new RandomAccessFile(tmpFile, "rw"));
>   }
>   public COSDocument(RandomAccess file) {
>   // scratchFile = file;
>   throw new RuntimeException("Not yet implemented."); 
> //$NON-NLS-1$
>   }
>   
>   [...]
>   /**
>* Returns a new scratch file.
>*
>* @return the newly created scratch file
>*/
>   public RandomAccess getScratchFile() {
>   return scratchFile.getNewRandomAcess();
>   }
> One of the COSDocument constructors takes a RandomAccess file. This 
> constructor is only called in a single location, namely, in method 
> PDFParser.parse(). I am not sure if the RandomAccess parameter provided here 
> is really a scratch file. Someone will have to decide what to do with this 
> one.
> The code has been throughly tested and has been used in the production of 
> several books without any problems.
> In the attachment please find the code. There is also a JUnit test that was 
> used to debug my code. I have added an Apache license header and adopted 
> PDFBox's code style. Feel free to make any desired changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-1822) Signature byte range is Invalid

2015-03-02 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-1822.

Resolution: Fixed

> Signature byte range is Invalid
> ---
>
> Key: PDFBOX-1822
> URL: https://issues.apache.org/jira/browse/PDFBOX-1822
> Project: PDFBox
>  Issue Type: Bug
>  Components: Signing
>Affects Versions: 1.8.3, 1.8.4, 2.0.0
>Reporter: vakhtang koroghlishvili
>Assignee: Andreas Lehmkühler
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: 
> SignatureFileSet-PDFBOX-1.8.2_TO_1.8.4-SNAPSHOT_SEQ_AND_NONSEQ.zip, 
> araxis-merge - compare two document.jpg, damaged-sig.jpg, 
> unsigned-signed.pdf, unsigned.pdf, unsigned_signed_fix.pdf
>
>
> On person send me a unsigned PDF document. He wanted to sign it. When I try 
> to sign it (using pad box), I have some problem.
> After signing adobe reader tells me "The signature byre range is invalid".  
> I will attach original and signed document.
> I think, it is PDF box parser error. another signature libraries sign 
> document very well. I'm searching the problem at the moment, in order to fix 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2694) Evaluate twelvemonkeys for JPEG

[jira] [Resolved] (PDFBOX-2695) Iterate PDOutlineNode children

[jira] [Commented] (PDFBOX-2695) Iterate PDOutlineNode children

[jira] [Commented] (PDFBOX-1130) ExtractText -html doesn't always close the tags it opens

[jira] [Resolved] (PDFBOX-2436) Parsing error

[jira] [Resolved] (PDFBOX-2527) IOException: Negative seek offset in NonSequentialPDFParser

[jira] [Commented] (PDFBOX-2527) IOException: Negative seek offset in NonSequentialPDFParser

[jira] [Commented] (PDFBOX-2301) RandomAccessBuffer consumes too much memory.

[jira] [Commented] (PDFBOX-2301) RandomAccessBuffer consumes too much memory.

[jira] [Comment Edited] (PDFBOX-2694) Evaluate twelvemonkeys for JPEG

[jira] [Commented] (PDFBOX-2694) Evaluate twelvemonkeys for JPEG

[jira] [Updated] (PDFBOX-2695) Iterate PDOutlineNode children

[jira] [Updated] (PDFBOX-2695) Iterate PDOutlineNode children

[jira] [Created] (PDFBOX-2695) Iterate PDOutlineNode children

[jira] [Closed] (PDFBOX-1109) Data corruption related to scratch file use

[jira] [Resolved] (PDFBOX-1109) Data corruption related to scratch file use

[jira] [Resolved] (PDFBOX-1822) Signature byte range is Invalid

17 matches

Site Navigation

Mail list logo

Footer information