date:20160329

[jira] [Commented] (PDFBOX-3295) Improve parsing performance of object streams

2016-03-29 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217505#comment-15217505
 ] 

Andreas Lehmkühler commented on PDFBOX-3295:


[~tilman] Thanks again for the first analysis

I guess it's related to incremental updates, I somehow missed that point. I'm 
going to check that ...

> Improve parsing performance of object streams
> -
>
> Key: PDFBOX-3295
> URL: https://issues.apache.org/jira/browse/PDFBOX-3295
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 1.8.11, 2.0.0, 2.1.0
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
> Fix For: 1.8.12, 2.0.1, 2.1.0
>
>
> Round about a year ago [~torakiki] posted a comment  about some xref 
> refactoring on the dev list:
> {quote}
> few days ago I was profiling PDFBox when loading medium/large size
> documents and I think I found something.
> If you try loading the document
> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
> it takes quite some time and that's mostly spent in the
> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
> an object contained in an unparsed object stream is found, the
> XrefTrailerResolver performs a full scan of the xref entries found in the
> document, in this case hundreds of thousands. If the object streams are
> many (like in the given doc), it performs many full scans resulting in poor
> performance.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: Custom glyphlist for text extraction

2016-03-29 Thread John Hewson



> On 30 Mar 2016, at 01:59, John Hewson  wrote:
> 
> 
> 
> -- John
> 
>> On 29 Mar 2016, at 21:31, Daniel Persson  wrote:
>> 
>> Hi Maruan
>> 
>> I extended the class to override that. Then again I extended the
>> PDFStreamEngine because I required more extensive changes but the principle
>> should be sound.
> 
> That's right but subclasses of PDFTextStreamEngine such as PDFTextStripper 
> don't have access to that. So yes, we've lost that capability for 
> PDFTextStripper.
> 
> What's needed is for the glyphList in PDFTextStripper to be overridden, 
> either by making it protected or adding a getter/setter (the latter is 
> probably a bit easier for users). Note that GlyphLists are immutable and may 
> be arbitrarily chained by wrapping with another GlyphList, as the constructor 
> of PDFTextStripper does.

Correction: "as the constructor of PDFTextStreamEngine does".

-- John

> 
> -- John
> 
>> best regards
>> Daniel
>> 
>>> On Tue, Mar 29, 2016, 20:12 Maruan Sahyoun  wrote:
>>> 
>>> Hi,
>>> 
>>> I was wondering if we lost the capability to supply a custom glyph list
>>> file as discussed here:
>>> http://stackoverflow.com/questions/35972788/how-to-read-control-characters-in-a-pdf-using-java/36034529#36034529
>>> 
>>> PDFTextStreamEngine seems to have it hardcoded
>>> ["org/apache/pdfbox/resources/glyphlist/additional.txt";] and I couldn't
>>> find a way to override that.
>>> 
>>> Do I miss something?
>>> 
>>> BR
>>> Maruan

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: Custom glyphlist for text extraction

2016-03-29 Thread John Hewson

-- John

> On 29 Mar 2016, at 21:31, Daniel Persson  wrote:
> 
> Hi Maruan
> 
> I extended the class to override that. Then again I extended the
> PDFStreamEngine because I required more extensive changes but the principle
> should be sound.

That's right but subclasses of PDFTextStreamEngine such as PDFTextStripper 
don't have access to that. So yes, we've lost that capability for 
PDFTextStripper.

What's needed is for the glyphList in PDFTextStripper to be overridden, either 
by making it protected or adding a getter/setter (the latter is probably a bit 
easier for users). Note that GlyphLists are immutable and may be arbitrarily 
chained by wrapping with another GlyphList, as the constructor of 
PDFTextStripper does.

-- John

> best regards
> Daniel
> 
>> On Tue, Mar 29, 2016, 20:12 Maruan Sahyoun  wrote:
>> 
>> Hi,
>> 
>> I was wondering if we lost the capability to supply a custom glyph list
>> file as discussed here:
>> http://stackoverflow.com/questions/35972788/how-to-read-control-characters-in-a-pdf-using-java/36034529#36034529
>> 
>> PDFTextStreamEngine seems to have it hardcoded
>> ["org/apache/pdfbox/resources/glyphlist/additional.txt";] and I couldn't
>> find a way to override that.
>> 
>> Do I miss something?
>> 
>> BR
>> Maruan

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-3295) Improve parsing performance of object streams

2016-03-29 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216890#comment-15216890
 ] 

Tilman Hausherr edited comment on PDFBOX-3295 at 3/29/16 9:46 PM:
--

The build failed, and there are problems with several files:
- PDFBOX-1907, which has 2 pages, now has 159 pages, file is at 
http://www.filedropper.com/pdfbox-1907
- PDFBOX-2001 the text "ich bin der Bieter gmbh" is missing
- 
https://issues.apache.org/jira/secure/attachment/12678455/testPDF_acroForm.pdf 
page 7 the form contents are missing (e.g. Springfield)
- PDFBOX-2163-322313.pdf and PDFBOX-2163-662062.pdf now have an OOM error, that 
wasn't before



was (Author: tilman):
The build failed, and there are problems with several files:
- PDFBOX-1907, which has 2 pages, now has 159 pages, file is at 
http://www.filedropper.com/pdfbox-1907
- PDFBOX-2001 the text "ich bin der Bieter gmbh" is missing
- 
https://issues.apache.org/jira/secure/attachment/12678455/testPDF_acroForm.pdf 
page 7 the form contents are missing (e.g. Springfield)



> Improve parsing performance of object streams
> -
>
> Key: PDFBOX-3295
> URL: https://issues.apache.org/jira/browse/PDFBOX-3295
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 1.8.11, 2.0.0, 2.1.0
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
> Fix For: 1.8.12, 2.0.1, 2.1.0
>
>
> Round about a year ago [~torakiki] posted a comment  about some xref 
> refactoring on the dev list:
> {quote}
> few days ago I was profiling PDFBox when loading medium/large size
> documents and I think I found something.
> If you try loading the document
> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
> it takes quite some time and that's mostly spent in the
> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
> an object contained in an unparsed object stream is found, the
> XrefTrailerResolver performs a full scan of the xref entries found in the
> document, in this case hundreds of thousands. If the object streams are
> many (like in the given doc), it performs many full scans resulting in poor
> performance.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-3295) Improve parsing performance of object streams

2016-03-29 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216890#comment-15216890
 ] 

Tilman Hausherr edited comment on PDFBOX-3295 at 3/29/16 9:42 PM:
--

The build failed, and there are problems with several files:
- PDFBOX-1907, which has 2 pages, now has 159 pages, file is at 
http://www.filedropper.com/pdfbox-1907
- PDFBOX-2001 the text "ich bin der Bieter gmbh" is missing
- 
https://issues.apache.org/jira/secure/attachment/12678455/testPDF_acroForm.pdf 
page 7 the form contents are missing (e.g. Springfield)




was (Author: tilman):
The build failed, and there are problems with several files:
- PDFBOX-1907, which has 2 pages, now has 159 pages, file is at 
http://www.filedropper.com/pdfbox-1907
- PDFBOX-2001 the text "ich bin der Bieter gmbh" is missing


> Improve parsing performance of object streams
> -
>
> Key: PDFBOX-3295
> URL: https://issues.apache.org/jira/browse/PDFBOX-3295
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 1.8.11, 2.0.0, 2.1.0
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
> Fix For: 1.8.12, 2.0.1, 2.1.0
>
>
> Round about a year ago [~torakiki] posted a comment  about some xref 
> refactoring on the dev list:
> {quote}
> few days ago I was profiling PDFBox when loading medium/large size
> documents and I think I found something.
> If you try loading the document
> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
> it takes quite some time and that's mostly spent in the
> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
> an object contained in an unparsed object stream is found, the
> XrefTrailerResolver performs a full scan of the xref entries found in the
> document, in this case hundreds of thousands. If the object streams are
> many (like in the given doc), it performs many full scans resulting in poor
> performance.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Reopened] (PDFBOX-3295) Improve parsing performance of object streams

2016-03-29 Thread Tilman Hausherr (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reopened PDFBOX-3295:
-

The build failed, and there are problems with several files:
- PDFBOX-1907, which has 2 pages, now has 159 pages, file is at 
http://www.filedropper.com/pdfbox-1907
- PDFBOX-2001 the text "ich bin der Bieter gmbh" is missing


> Improve parsing performance of object streams
> -
>
> Key: PDFBOX-3295
> URL: https://issues.apache.org/jira/browse/PDFBOX-3295
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 1.8.11, 2.0.0, 2.1.0
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
> Fix For: 1.8.12, 2.0.1, 2.1.0
>
>
> Round about a year ago [~torakiki] posted a comment  about some xref 
> refactoring on the dev list:
> {quote}
> few days ago I was profiling PDFBox when loading medium/large size
> documents and I think I found something.
> If you try loading the document
> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
> it takes quite some time and that's mostly spent in the
> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
> an object contained in an unparsed object stream is found, the
> XrefTrailerResolver performs a full scan of the xref entries found in the
> document, in this case hundreds of thousands. If the object streams are
> many (like in the given doc), it performs many full scans resulting in poor
> performance.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Jenkins build became unstable: PDFBox 2.0.x » Apache PDFBox #7

2016-03-29 Thread Apache Jenkins Server

See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Jenkins build became unstable: PDFBox 2.0.x #7

2016-03-29 Thread Apache Jenkins Server

See 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Jenkins build is back to normal : PDFBox 1.8.x » PDFBox parent #562

2016-03-29 Thread Apache Jenkins Server

See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Jenkins build is back to normal : PDFBox 1.8.x #562

2016-03-29 Thread Apache Jenkins Server

See 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Jenkins build became unstable: PDFBox-trunk » Apache PDFBox #2802

2016-03-29 Thread Apache Jenkins Server

See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Jenkins build became unstable: PDFBox-trunk #2802

2016-03-29 Thread Apache Jenkins Server

See 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-3295) Improve parsing performance of object streams

2016-03-29 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216808#comment-15216808
 ] 

Andreas Lehmkühler commented on PDFBOX-3295:


After eliminating the mentioned method call the parser is about 8 times faster 
(using the file from PDFBOX-3284 with the trunk)

[~torakiki] Thanks for sharing your profiling results!

> Improve parsing performance of object streams
> -
>
> Key: PDFBOX-3295
> URL: https://issues.apache.org/jira/browse/PDFBOX-3295
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 1.8.11, 2.0.0, 2.1.0
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
> Fix For: 1.8.12, 2.0.1, 2.1.0
>
>
> Round about a year ago [~torakiki] posted a comment  about some xref 
> refactoring on the dev list:
> {quote}
> few days ago I was profiling PDFBox when loading medium/large size
> documents and I think I found something.
> If you try loading the document
> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
> it takes quite some time and that's mostly spent in the
> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
> an object contained in an unparsed object stream is found, the
> XrefTrailerResolver performs a full scan of the xref entries found in the
> document, in this case hundreds of thousands. If the object streams are
> many (like in the given doc), it performs many full scans resulting in poor
> performance.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-3295) Improve parsing performance of object streams

2016-03-29 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/PDFBOX-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-3295.

Resolution: Fixed

> Improve parsing performance of object streams
> -
>
> Key: PDFBOX-3295
> URL: https://issues.apache.org/jira/browse/PDFBOX-3295
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 1.8.11, 2.0.0, 2.1.0
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
> Fix For: 1.8.12, 2.0.1, 2.1.0
>
>
> Round about a year ago [~torakiki] posted a comment  about some xref 
> refactoring on the dev list:
> {quote}
> few days ago I was profiling PDFBox when loading medium/large size
> documents and I think I found something.
> If you try loading the document
> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
> it takes quite some time and that's mostly spent in the
> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
> an object contained in an unparsed object stream is found, the
> XrefTrailerResolver performs a full scan of the xref entries found in the
> document, in this case hundreds of thousands. If the object streams are
> many (like in the given doc), it performs many full scans resulting in poor
> performance.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-3295) Improve parsing performance of object streams

2016-03-29 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/PDFBOX-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-3295:
---
Affects Version/s: 1.8.11

> Improve parsing performance of object streams
> -
>
> Key: PDFBOX-3295
> URL: https://issues.apache.org/jira/browse/PDFBOX-3295
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 1.8.11, 2.0.0, 2.1.0
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
> Fix For: 1.8.12, 2.0.1, 2.1.0
>
>
> Round about a year ago [~torakiki] posted a comment  about some xref 
> refactoring on the dev list:
> {quote}
> few days ago I was profiling PDFBox when loading medium/large size
> documents and I think I found something.
> If you try loading the document
> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
> it takes quite some time and that's mostly spent in the
> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
> an object contained in an unparsed object stream is found, the
> XrefTrailerResolver performs a full scan of the xref entries found in the
> document, in this case hundreds of thousands. If the object streams are
> many (like in the given doc), it performs many full scans resulting in poor
> performance.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-3295) Improve parsing performance of object streams

2016-03-29 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/PDFBOX-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-3295:
---
Fix Version/s: 1.8.12

> Improve parsing performance of object streams
> -
>
> Key: PDFBOX-3295
> URL: https://issues.apache.org/jira/browse/PDFBOX-3295
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 1.8.11, 2.0.0, 2.1.0
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
> Fix For: 1.8.12, 2.0.1, 2.1.0
>
>
> Round about a year ago [~torakiki] posted a comment  about some xref 
> refactoring on the dev list:
> {quote}
> few days ago I was profiling PDFBox when loading medium/large size
> documents and I think I found something.
> If you try loading the document
> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
> it takes quite some time and that's mostly spent in the
> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
> an object contained in an unparsed object stream is found, the
> XrefTrailerResolver performs a full scan of the xref entries found in the
> document, in this case hundreds of thousands. If the object streams are
> many (like in the given doc), it performs many full scans resulting in poor
> performance.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-3295) Improve parsing performance of object streams

2016-03-29 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216804#comment-15216804
 ] 

ASF subversion and git services commented on PDFBOX-3295:
-

Commit 1737047 from [~lehmi] in branch 'pdfbox/branches/1.8'
[ https://svn.apache.org/r1737047 ]

PDFBOX-3295: avoid using XrefTrailerResolver#getContainedObjectNumbers as 
proposed by Andrea Vacondio to speed up the parsing of object streams

> Improve parsing performance of object streams
> -
>
> Key: PDFBOX-3295
> URL: https://issues.apache.org/jira/browse/PDFBOX-3295
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
> Fix For: 2.0.1, 2.1.0
>
>
> Round about a year ago [~torakiki] posted a comment  about some xref 
> refactoring on the dev list:
> {quote}
> few days ago I was profiling PDFBox when loading medium/large size
> documents and I think I found something.
> If you try loading the document
> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
> it takes quite some time and that's mostly spent in the
> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
> an object contained in an unparsed object stream is found, the
> XrefTrailerResolver performs a full scan of the xref entries found in the
> document, in this case hundreds of thousands. If the object streams are
> many (like in the given doc), it performs many full scans resulting in poor
> performance.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-3295) Improve parsing performance of object streams

2016-03-29 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216799#comment-15216799
 ] 

ASF subversion and git services commented on PDFBOX-3295:
-

Commit 1737044 from [~lehmi] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1737044 ]

PDFBOX-3295: avoid using XrefTrailerResolver#getContainedObjectNumbers as 
proposed by Andrea Vacondio to speed up the parsing of object streams

> Improve parsing performance of object streams
> -
>
> Key: PDFBOX-3295
> URL: https://issues.apache.org/jira/browse/PDFBOX-3295
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
> Fix For: 2.0.1, 2.1.0
>
>
> Round about a year ago [~torakiki] posted a comment  about some xref 
> refactoring on the dev list:
> {quote}
> few days ago I was profiling PDFBox when loading medium/large size
> documents and I think I found something.
> If you try loading the document
> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
> it takes quite some time and that's mostly spent in the
> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
> an object contained in an unparsed object stream is found, the
> XrefTrailerResolver performs a full scan of the xref entries found in the
> document, in this case hundreds of thousands. If the object streams are
> many (like in the given doc), it performs many full scans resulting in poor
> performance.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-3295) Improve parsing performance of object streams

2016-03-29 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216795#comment-15216795
 ] 

ASF subversion and git services commented on PDFBOX-3295:
-

Commit 1737043 from [~lehmi] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1737043 ]

PDFBOX-3295: avoid using XrefTrailerResolver#getContainedObjectNumbers as 
proposed by Andrea Vacondio to speed up the parsing of object streams

> Improve parsing performance of object streams
> -
>
> Key: PDFBOX-3295
> URL: https://issues.apache.org/jira/browse/PDFBOX-3295
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
> Fix For: 2.0.1, 2.1.0
>
>
> Round about a year ago [~torakiki] posted a comment  about some xref 
> refactoring on the dev list:
> {quote}
> few days ago I was profiling PDFBox when loading medium/large size
> documents and I think I found something.
> If you try loading the document
> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
> it takes quite some time and that's mostly spent in the
> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
> an object contained in an unparsed object stream is found, the
> XrefTrailerResolver performs a full scan of the xref entries found in the
> document, in this case hundreds of thousands. If the object streams are
> many (like in the given doc), it performs many full scans resulting in poor
> performance.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Created] (PDFBOX-3295) Improve parsing performance of object streams

2016-03-29 Thread JIRA

Andreas Lehmkühler created PDFBOX-3295:
--

 Summary: Improve parsing performance of object streams
 Key: PDFBOX-3295
 URL: https://issues.apache.org/jira/browse/PDFBOX-3295
 Project: PDFBox
  Issue Type: Improvement
  Components: Parsing
Affects Versions: 2.0.0, 2.1.0
Reporter: Andreas Lehmkühler
Assignee: Andreas Lehmkühler
 Fix For: 2.0.1, 2.1.0


Round about a year ago [~torakiki] posted a comment  about some xref 
refactoring on the dev list:
{quote}
few days ago I was profiling PDFBox when loading medium/large size
documents and I think I found something.
If you try loading the document
http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
it takes quite some time and that's mostly spent in the
XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
an object contained in an unparsed object stream is found, the
XrefTrailerResolver performs a full scan of the xref entries found in the
document, in this case hundreds of thousands. If the object streams are
many (like in the given doc), it performs many full scans resulting in poor
performance.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: Custom glyphlist for text extraction

2016-03-29 Thread Daniel Persson

Hi Maruan

I extended the class to override that. Then again I extended the
PDFStreamEngine because I required more extensive changes but the principle
should be sound.

best regards
Daniel

On Tue, Mar 29, 2016, 20:12 Maruan Sahyoun  wrote:

> Hi,
>
> I was wondering if we lost the capability to supply a custom glyph list
> file as discussed here:
> http://stackoverflow.com/questions/35972788/how-to-read-control-characters-in-a-pdf-using-java/36034529#36034529
>
> PDFTextStreamEngine seems to have it hardcoded
> ["org/apache/pdfbox/resources/glyphlist/additional.txt";] and I couldn't
> find a way to override that.
>
> Do I miss something?
>
> BR
> Maruan

Custom glyphlist for text extraction

2016-03-29 Thread Maruan Sahyoun

Hi,

I was wondering if we lost the capability to supply a custom glyph list file as 
discussed here: 
http://stackoverflow.com/questions/35972788/how-to-read-control-characters-in-a-pdf-using-java/36034529#36034529

PDFTextStreamEngine seems to have it hardcoded 
["org/apache/pdfbox/resources/glyphlist/additional.txt";] and I couldn't find a 
way to override that.

Do I miss something?

BR
Maruan

[jira] [Closed] (PDFBOX-3294) After loading pdf through PDFRenderer(pdf), We are trying to take the first page and convert that into preview image, but it is sometimes gets out of memory.

2016-03-29 Thread Tilman Hausherr (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-3294.
---
Resolution: Not A Problem

Closing (you can still comment), I was able to successfully run PDFToImage with 
-Xmx2g.

> After loading pdf through PDFRenderer(pdf), We are trying to take the first 
> page and convert that into preview image, but it is sometimes gets out of 
> memory.
> -
>
> Key: PDFBOX-3294
> URL: https://issues.apache.org/jira/browse/PDFBOX-3294
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.0
> Environment: Windows 8.1 OS, 8 GB ram, I5 processor
>Reporter: Vijay
>Priority: Blocker
> Attachments: VFEpRJNMpdC1SCElMVN_ew__93_1_93_0.pdf, 
> originalVFEpRJNMpdC1SCElMVN_ew__93_1_93_0.pdf_VFQZ-4975.jpg, screenshot-1.png
>
>
> Senario 1. 
> After loading pdf through PDFRenderer(pdf), We are trying to take the first 
> page and convert that into preview image, but it is sometimes gets out of 
> memory by calling "pdfRenderer.renderImageWithDPI(0, 300, ImageType.RGB)",
> Senario 2.
> After loading pdf through PDFRenderer(pdf), We are trying to take the first 
> page and convert that into preview image, so we are calling 
> "pdfRenderer.renderImageWithDPI(0, 300, ImageType.RGB)", it takes more time 
> to return the bufferimage. 
> after we called ImageIOUtil.writeImage(bufferedImage, fileName, 175) tooks 
> more time to save it to local and the size of the image is very huge. 
> for example. 
> total size of the pdf -1.7 mb
> first page image size - 11 MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-3293) Font glyphs with overlapping paths not rendered correctly

2016-03-29 Thread Tilman Hausherr (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3293:

Attachment: fontforge.png
PDFBOX-3293.ttf
PDFBOX-3293_reduced.pdf

Sadly, it happens at any resolution. Here's a reduced PDF, the subsetted font, 
and a screenshot of fontforge. Fontforge displays it badly, I've opened an 
issue at https://github.com/fontforge/fontforge/issues/2657 . Same for two 
commercial java competitors. Same for PDFRenderer. GS does it right. PDFBox 
1.8.12 displays it properly, but it uses awt. BATIK (of which PDFBox uses some 
code for font display) displays it badly. PDF.js displays it properly.

> Font glyphs with overlapping paths not rendered correctly
> -
>
> Key: PDFBOX-3293
> URL: https://issues.apache.org/jira/browse/PDFBOX-3293
> Project: PDFBox
>  Issue Type: Bug
>  Components: FontBox, Rendering
>Affects Versions: 2.0.0
>Reporter: Pei-Tang Huang
>Priority: Critical
> Attachments: PDFBOX-3293.ttf, PDFBOX-3293_reduced.pdf, fontforge.png, 
> sample.pdf, sample_rendered.tif
>
>
> Font glyphs with overlapping paths may be rendered in correctly, especially 
> when the font size is small.
> Sadly, the Traditional Chinese edition of Windows bundled fonts 細明體&新細明體 
> (MingLiU & PMingLiU) and 標楷體 (DFKai-SB) all suffer from this problem.
> See attached sample.pdf and the rendered sample_rendered.tif.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

RE: shading/relocating 1.8.x?

2016-03-29 Thread Allison, Timothy B.

Got it.  That's what I had assumed.

I'll hold off on opening truncated file issue(s) on PDFBox's JIRA...  I opened 
TIKA-1912 to track this on our side.

Thank you, again!

Best,

  Tim

-Original Message-
From: Andreas Lehmkühler [mailto:andr...@lehmi.de] 
Sent: Tuesday, March 29, 2016 7:12 AM
To: dev@pdfbox.apache.org
Subject: RE: shading/relocating 1.8.x?

> "Allison, Timothy B."  hat am 28. März 2016 um 
> 21:02
> geschrieben:
> 
> 
> Oh, wow, so it really might be possible without too much work?  I'm 
> more than happy to supply examples. :)
Ups, it isn't as simply as it sounds. If we simply swallow the exception pdfbox 
most likel runs into a NPE. IMHO we have to implement some sort of an on demand 
parser which is able to handle null-values for specific parts of a pdf without 
throwing any exception.

> Should I open an issue?
Thanks, but I'm going to do that soon, as some other things should be done as 
well.

BR
Andreas

RE: shading/relocating 1.8.x?

2016-03-29 Thread Andreas Lehmkühler

> "Allison, Timothy B."  hat am 28. März 2016 um 21:02
> geschrieben:
> 
> 
> Oh, wow, so it really might be possible without too much work?  I'm more than
> happy to supply examples. :) 
Ups, it isn't as simply as it sounds. If we simply swallow the exception pdfbox
most likel runs into a NPE. IMHO we have to implement some sort of an on demand
parser which is able to handle null-values for specific parts of a pdf without
throwing any exception.

> Should I open an issue?
Thanks, but I'm going to do that soon, as some other things should be done as
well.

BR
Andreas
> 
> 
> -Original Message-
> From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
> Sent: Monday, March 28, 2016 10:58 AM
> To: dev@pdfbox.apache.org
> Subject: Re: shading/relocating 1.8.x?
> 
> Am 25.03.2016 um 17:39 schrieb John Hewson:
> >
> >> On 23 Mar 2016, at 06:20, Allison, Timothy B.  wrote:
> >>
> >> All,
> >>   We've upgraded to 2.0.0 on Tika.  Many thanks again!
> >>   One of our users is interested in continuing to use the
> >> classic/SequentialParser, or at least having it available as a back-off
> >> parser for corrupt pdfs [0].
> >
> > Using the old parser really isn’t a good idea, it’s known to be pretty
> > broken. I think that we would be much better off making sure the new parser
> > can handle truncated files. We already do a lot of repair in the new parser,
> > so this doesn’t seem like to much work? Maybe Andreas can comment further?
> The biggest issue here is the truncated stream or dictionary. The current
> version simply throws an exception when running into such constellations. We
> have to implement some algorithm to ignore such incomplete parts of a pdf if
> possible.
> 
> BR
> Andreas
> 
> >
> > Do we have some JIRA issues which identify some of these cases?
> >
> > — John
> >
> >>   Would you be willing to distribute a shaded/relocated 1.8.x app so that
> >> we could load both 1.8.x and 2.0.0 in the same jvm without collisions?  Or,
> >> is there a better solution?
> >
> > I wouldn’t recommend doing that, because you’re going to be stuck with using
> > 1.8 for everything, not just parsing, at least as far as corrupt/truncated
> > files are concerned.
> >
> > — John
> >
> >>   Thank you!
> >>
> >>   Cheers,
> >>
> >>  Tim
> >>
> >> [0]
> >> https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

2016-03-29 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215842#comment-15215842
 ] 

Andreas Lehmkühler commented on PDFBOX-922:
---

[~fibe] First of all, don't use already closed tickets for bug reports, please 
open a new one.

With regard to your issue, pdf doesn't support control characters like "tab", 
"linefeed", "carriage return" etc. even the space character is seldom used as 
all characters/text chunks of a pdf have to be positioned absolutely using 
specific coordinates. If at all  "simple" spaces are supported only, there is 
no distinction for different kond of spaces. Saying that, your dirty 
work-around is the way to go.

> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> 
>
> Key: PDFBOX-922
> URL: https://issues.apache.org/jira/browse/PDFBOX-922
> Project: PDFBox
>  Issue Type: New Feature
>  Components: Writing
>Affects Versions: 1.3.1
> Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
>Reporter: Thanos Agelatos
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff
>
>
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it 
> creates, making it impossible to create PDFs in any language apart from 
> English and ones supported in WinAnsiEncoding. This behaviour is caused 
> because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, 
> and there is no Identity-H or Identity-V Encoding classes provided (to set 
> afterwards via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze 
> The PDF created contains garbled characters and/or squares.
> Simple test case:
> {code}
> PDDocument doc = null;
>   try {
>   doc = new PDDocument();
>   PDPage page = new PDPage();
>   doc.addPage(page);
>   // extract fonts for fields
>   byte[] arialNorm = extractFont("arial.ttf");
>   //byte[] arialBold = extractFont("arialbd.ttf"); 
>   //PDFont font = PDType1Font.HELVETICA;
>   PDFont font = PDTrueTypeFont.loadTTF(doc, new 
> ByteArrayInputStream(arialNorm));
>   
>   PDPageContentStream contentStream = new 
> PDPageContentStream(doc, page);
>   contentStream.beginText();
>   contentStream.setFont(font, 12);
>   contentStream.moveTextPositionByAmount(100, 700);
>   contentStream.drawString("Hello world from PDFBox 
> ελληνικά"); // text here may appear garbled; insert any text in Greek or 
> Bulgarian or Malteze
>   contentStream.endText();
>   contentStream.close();
>   doc.save("pdfbox.pdf");
>   System.out.println(" created!");
>   } catch (Exception ioe) {
>   ioe.printStackTrace();
>   } finally {
>   if (doc != null) {
>   try { doc.close(); } catch (Exception e) {}
>   }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: PDFBox 2.1

2016-03-29 Thread Andreas Lehmkühler

> Maruan Sahyoun  hat am 29. März 2016 um 12:28
> geschrieben:
> 
> 
> Hi,
> 
> now as PDFBox 2.0 is out what about collecting ideas for 2.1? Could put that
> on our website the same way we had the old ideas published.
Goodi idea!

> From my perspective:
> - simplify creation of AcroForm fields
> - appearance generation for new AcroForm fields
> - rework/enhancement to the plain text formatter.
> - Java 1.7
> - incremental parsing i.e. page by page
> - discussion/decision on XMP (shall we enhance XMPBox, restore Jempbox, base
> on Adobe's XMP library, join forces with the FOP project …)
Most likely some of these ideas will come with breaking changes, so that we have
to think about the correct target: 2.1 vs 3.0

> 
> 
> BR
> Maruan

BR
Andreas

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

PDFBox 2.1

2016-03-29 Thread Maruan Sahyoun

Hi,

now as PDFBox 2.0 is out what about collecting ideas for 2.1? Could put that on 
our website the same way we had the old ideas published.

From my perspective:
- simplify creation of AcroForm fields
- appearance generation for new AcroForm fields
- rework/enhancement to the plain text formatter.
- Java 1.7
- incremental parsing i.e. page by page
- discussion/decision on XMP (shall we enhance XMPBox, restore Jempbox, base on 
Adobe's XMP library, join forces with the FOP project …)


BR
Maruan



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-3284) Big Pdf parsing to text - Out of memory

2016-03-29 Thread Nicolas Daniels (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215729#comment-15215729
 ] 

Nicolas Daniels commented on PDFBOX-3284:
-

Thanks for the investigations.

I think it it good to try to reduce the memory footprint of parsing the whole 
document but I really think pdfbox should have a way to parse the document page 
per page for instance (and so having only a part of the document in memory). 
Having such a feature will ensure any pdf could be parsed, whatever how big it 
is.

Fyi, I tried to parse the same pdf using Itext. I can get the text with less 
than 200MB:
{code:borderStyle=solid}
PdfReader reader = new PdfReader(new 
RandomAccessFileOrArray(new 
RandomAccessSourceFactory().createSource(inputStream)), null);

try {
for (int page = 1; page <= reader.getNumberOfPages(); 
page++) {

textWriter.write(PdfTextExtractor.getTextFromPage(reader, page));
}
} finally {
reader.close();
}
{code}

Rgds,

> Big Pdf parsing to text - Out of memory
> ---
>
> Key: PDFBOX-3284
> URL: https://issues.apache.org/jira/browse/PDFBOX-3284
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.8.10, 1.8.11, 2.0.0, 2.1.0
>Reporter: Nicolas Daniels
> Attachments: massparse-stat.txt
>
>
> I'm trying to parse a quite big PDF (26MB) and transform it to text, however 
> I'm facing a huge memory consumption leading to out of memory error. Running 
> my test with -Xmx768M will always fail. I've to increase to 1500M to make it 
> work. 
> The resulting text is only 3MB so I don't understand why it is taking so much 
> memory.
> I've tested this code over 1.8.10, 1.8.11 & 2.0.0 with same result.
> The pdf can be found 
> [here|https://www2.swift.com/uhbonline/books/public/en_uk/clr_3_0_stdsmx_msg_def_rpt_sch/sr2015_mx_clearing_3dot0_mdr2_solution.pdf]
> My code:
> {code:title=Test.java|borderStyle=solid}
> @Test
> public void testParsePdf_Content_Memory() throws Exception {
> {
> InputStream inputStream = new 
> FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
> try {
>  StringWriter writer = new StringWriter();
>FileWriter fileWriter = new FileWriter(new 
> File("c:/tmp/test.txt"));
>  PDFTextStripper pdfTextStripper = new PDFTextStripper();
>pdfTextStripper.writeText(PDDocument.load(inputStream), 
> fileWriter);
>  fileWriter.close();
> } finally {
> inputStream.close();
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

2016-03-29 Thread Filip Bellander (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215705#comment-15215705
 ] 

Filip Bellander commented on PDFBOX-922:


I tried to update to 2.0.0 today. Now suddenly my tests no longer works with 
the following error: {noformat}U+00A0 ('nbspace') is not available in this 
font's encoding: WinAnsiEncoding{noformat}

Worth noting here is that I don't have any of the Type1 fonts installed on my 
machine (I'm on a Linux-box and just haven't installed them). This results in 
the following information being printed before the tests are run (ie, when I 
start using PDFBox)

{noformat}
10:41:30.174 [main] WARN  o.a.pdfbox.pdmodel.font.PDType1Font - Using fallback 
font LiberationSans for base font Times-Roman
10:41:30.178 [main] WARN  o.a.pdfbox.pdmodel.font.PDType1Font - Using fallback 
font LiberationSans for base font Times-Bold
10:41:30.178 [main] WARN  o.a.pdfbox.pdmodel.font.PDType1Font - Using fallback 
font LiberationSans for base font Times-Italic
10:41:30.179 [main] WARN  o.a.pdfbox.pdmodel.font.PDType1Font - Using fallback 
font LiberationSans for base font Times-BoldItalic
10:41:30.179 [main] WARN  o.a.pdfbox.pdmodel.font.PDType1Font - Using fallback 
font LiberationSans for base font Helvetica
10:41:30.180 [main] WARN  o.a.pdfbox.pdmodel.font.PDType1Font - Using fallback 
font LiberationSans for base font Helvetica-Bold
10:41:30.180 [main] WARN  o.a.pdfbox.pdmodel.font.PDType1Font - Using fallback 
font LiberationSans for base font Helvetica-Oblique
10:41:30.181 [main] WARN  o.a.pdfbox.pdmodel.font.PDType1Font - Using fallback 
font LiberationSans for base font Helvetica-BoldOblique
10:41:30.181 [main] WARN  o.a.pdfbox.pdmodel.font.PDType1Font - Using fallback 
font LiberationSans for base font Courier
10:41:30.182 [main] WARN  o.a.pdfbox.pdmodel.font.PDType1Font - Using fallback 
font LiberationSans for base font Courier-Bold
10:41:30.183 [main] WARN  o.a.pdfbox.pdmodel.font.PDType1Font - Using fallback 
font LiberationSans for base font Courier-Oblique
10:41:30.184 [main] WARN  o.a.pdfbox.pdmodel.font.PDType1Font - Using fallback 
font LiberationSans for base font Courier-BoldOblique
10:41:30.199 [main] DEBUG o.a.p.p.font.FileSystemFontProvider - Loaded 
StandardSymL from /usr/share/fonts/Type1/s05l.pfb
10:41:30.215 [main] DEBUG o.a.p.p.font.FileSystemFontProvider - Loaded Dingbats 
from /usr/share/fonts/Type1/d05l.pfb
10:41:30.422 [main] WARN  o.a.pdfbox.pdmodel.font.PDType1Font - Using fallback 
font LiberationSans for Helvetica
Tests run: 10, Failures: 0, Errors: 6, Skipped: 0, Time elapsed: 0.968 sec <<< 
FAILURE!
{noformat}

This problem was not present in 1.8.11, so I'm wondering what's really going on 
here.
What this gets triggered on, from what I can tell, is when you do something like
{code:java}
pdFont.getStringWidth(StringEspaceUtils.unescapeHtml4(" "));
{code}
That is at least what it fails on for me.
I dirty work-around would be to replace all non-breaking spaces with breaking 
spaces, but that defeats the purpose of having non-breaking ones.
Suggestions on how this might be solved?

> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> 
>
> Key: PDFBOX-922
> URL: https://issues.apache.org/jira/browse/PDFBOX-922
> Project: PDFBox
>  Issue Type: New Feature
>  Components: Writing
>Affects Versions: 1.3.1
> Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
>Reporter: Thanos Agelatos
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff
>
>
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it 
> creates, making it impossible to create PDFs in any language apart from 
> English and ones supported in WinAnsiEncoding. This behaviour is caused 
> because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, 
> and there is no Identity-H or Identity-V Encoding classes provided (to set 
> afterwards via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze 
> The PDF created contains garbled characters and/or squares.
> Simple test case:
> {code}
> PDDocument doc = null;
>   try {
>   doc = new PDDocument();
>   PDPage page = new PDPage();
>   doc.addPage(page);
>   // extract fonts for fields
>   byte[] arialNorm = extractFont("arial.ttf");
>   //byte[] arialBold = extractFont("arialbd.ttf"); 
>   //PDFont font = PDType1Font.HELVETICA;
>   PDFont font = PDTrueTypeFont.loadTTF(doc,

[jira] [Commented] (PDFBOX-3294) After loading pdf through PDFRenderer(pdf), We are trying to take the first page and convert that into preview image, but it is sometimes gets out of memory.

2016-03-29 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215592#comment-15215592
 ] 

Tilman Hausherr commented on PDFBOX-3294:
-

That is because your PDF is huge. The media box is 3000 x 5135. An A4 page is 
595 x 841.

Solution: use a higher -Xmx value and use a lower dpi.

If possible, talk to the creator of the PDF. It can be much smaller if they use 
vector graphics instead of two images inside.

> After loading pdf through PDFRenderer(pdf), We are trying to take the first 
> page and convert that into preview image, but it is sometimes gets out of 
> memory.
> -
>
> Key: PDFBOX-3294
> URL: https://issues.apache.org/jira/browse/PDFBOX-3294
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.0
> Environment: Windows 8.1 OS, 8 GB ram, I5 processor
>Reporter: Vijay
>Priority: Blocker
> Attachments: VFEpRJNMpdC1SCElMVN_ew__93_1_93_0.pdf, 
> originalVFEpRJNMpdC1SCElMVN_ew__93_1_93_0.pdf_VFQZ-4975.jpg, screenshot-1.png
>
>
> Senario 1. 
> After loading pdf through PDFRenderer(pdf), We are trying to take the first 
> page and convert that into preview image, but it is sometimes gets out of 
> memory by calling "pdfRenderer.renderImageWithDPI(0, 300, ImageType.RGB)",
> Senario 2.
> After loading pdf through PDFRenderer(pdf), We are trying to take the first 
> page and convert that into preview image, so we are calling 
> "pdfRenderer.renderImageWithDPI(0, 300, ImageType.RGB)", it takes more time 
> to return the bufferimage. 
> after we called ImageIOUtil.writeImage(bufferedImage, fileName, 175) tooks 
> more time to save it to local and the size of the image is very huge. 
> for example. 
> total size of the pdf -1.7 mb
> first page image size - 11 MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-3295) Improve parsing performance of object streams

Re: Custom glyphlist for text extraction

Re: Custom glyphlist for text extraction

[jira] [Comment Edited] (PDFBOX-3295) Improve parsing performance of object streams

[jira] [Comment Edited] (PDFBOX-3295) Improve parsing performance of object streams

[jira] [Reopened] (PDFBOX-3295) Improve parsing performance of object streams

Jenkins build became unstable: PDFBox 2.0.x » Apache PDFBox #7

Jenkins build became unstable: PDFBox 2.0.x #7

Jenkins build is back to normal : PDFBox 1.8.x » PDFBox parent #562

Jenkins build is back to normal : PDFBox 1.8.x #562

Jenkins build became unstable: PDFBox-trunk » Apache PDFBox #2802

Jenkins build became unstable: PDFBox-trunk #2802

[jira] [Commented] (PDFBOX-3295) Improve parsing performance of object streams

[jira] [Resolved] (PDFBOX-3295) Improve parsing performance of object streams

[jira] [Updated] (PDFBOX-3295) Improve parsing performance of object streams

[jira] [Updated] (PDFBOX-3295) Improve parsing performance of object streams

[jira] [Commented] (PDFBOX-3295) Improve parsing performance of object streams

[jira] [Commented] (PDFBOX-3295) Improve parsing performance of object streams

[jira] [Commented] (PDFBOX-3295) Improve parsing performance of object streams

[jira] [Created] (PDFBOX-3295) Improve parsing performance of object streams

Re: Custom glyphlist for text extraction

Custom glyphlist for text extraction

[jira] [Closed] (PDFBOX-3294) After loading pdf through PDFRenderer(pdf), We are trying to take the first page and convert that into preview image, but it is sometimes gets out of memory.

[jira] [Updated] (PDFBOX-3293) Font glyphs with overlapping paths not rendered correctly

RE: shading/relocating 1.8.x?

RE: shading/relocating 1.8.x?

[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

Re: PDFBox 2.1

PDFBox 2.1

[jira] [Commented] (PDFBOX-3284) Big Pdf parsing to text - Out of memory

[jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

[jira] [Commented] (PDFBOX-3294) After loading pdf through PDFRenderer(pdf), We are trying to take the first page and convert that into preview image, but it is sometimes gets out of memory.

32 matches

Site Navigation

Mail list logo

Footer information