date:20181127

Re: Fwd: 1.20?

2018-11-27 Thread Andreas Lehmkuehler

Sounds good enough for me. I'm going to cut the release this evening in about 
10-11 hours from now.


Thanks Tim!

Andreas


Am 28.11.18 um 03:40 schrieb Tim Allison:

Looks good.  One file has fewer pages: govdocs1/229/229205.pdf

Based on the content diffs, it looks like I compared 2.0.13-SNAPSHOT
against 2.0.11 (/data4/batch_runs/tika_1_19-rc1).  We upgraded to
2.0.12 in 1.19.1.  Given that I don't see any differences, I think
we're good.  However, I'm happy to re-run with 2.0.12 as the baseline.

The reports are here:
http://162.242.228.174/reports/reports_pdfbox_2_0_13-pre-rc.tgz
On Tue, Nov 27, 2018 at 1:22 PM Andreas Lehmkuehler  wrote:


Am 25.11.18 um 11:01 schrieb Andreas Lehmkuehler:

Am 24.11.18 um 10:15 schrieb Tilman Hausherr:

Am 24.11.2018 um 08:53 schrieb Andreas Lehmkuehler:

How about cutting a release next week or a week later?

I'm going to cut the release next Tuesday the 27th.

I'm going to postpone the preparations until we have hopefully positive test
results from Tim

Andreas



If there are any objections I postpone the release to the following Monday the
3th of December.


Andreas

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: Fwd: 1.20?

2018-11-27 Thread Tim Allison

Looks good.  One file has fewer pages: govdocs1/229/229205.pdf

Based on the content diffs, it looks like I compared 2.0.13-SNAPSHOT
against 2.0.11 (/data4/batch_runs/tika_1_19-rc1).  We upgraded to
2.0.12 in 1.19.1.  Given that I don't see any differences, I think
we're good.  However, I'm happy to re-run with 2.0.12 as the baseline.

The reports are here:
http://162.242.228.174/reports/reports_pdfbox_2_0_13-pre-rc.tgz
On Tue, Nov 27, 2018 at 1:22 PM Andreas Lehmkuehler  wrote:
>
> Am 25.11.18 um 11:01 schrieb Andreas Lehmkuehler:
> > Am 24.11.18 um 10:15 schrieb Tilman Hausherr:
> >> Am 24.11.2018 um 08:53 schrieb Andreas Lehmkuehler:
> >>> How about cutting a release next week or a week later?
> > I'm going to cut the release next Tuesday the 27th.
> I'm going to postpone the preparations until we have hopefully positive test
> results from Tim
>
> Andreas
>
> >
> > If there are any objections I postpone the release to the following Monday 
> > the
> > 3th of December.
> >
> >
> > Andreas
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-3017) Improve document signing

2018-11-27 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700847#comment-16700847
 ] 

ASF subversion and git services commented on PDFBOX-3017:
-

Commit 1847577 from til...@apache.org in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1847577 ]

PDFBOX-3017: include mention of mkl comment so people don't take this example 
blindly

> Improve document signing
> 
>
> Key: PDFBOX-3017
> URL: https://issues.apache.org/jira/browse/PDFBOX-3017
> Project: PDFBox
>  Issue Type: Improvement
>  Components: AcroForm, Signing
>Affects Versions: 2.0.0, 3.0.0 PDFBox
>Reporter: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: PDFBOX-3017_certificate_chain.diff, 
> PDFBOX-3017_certificate_chain_Screenshot.png, QV_RCA1_RCA3_CPCPS_V4_11.pdf, 
> pdfa_signed_insivible.pdf
>
>
> Improve signing code:
> - incremental save only works for signatures and doesn't respect certificates 
> such as Adobe Extended Usage Rights
> - -{{prepareNonVisualSignature}} clears the AcroForm DR 
> {{acroForm.setDefaultResources(null)}} which is not good if there are other 
> form fields-
> - visual/nonVisualSignature should move into the {{interactive.forms}} 
> package and be handled within the signature field
> - -verify signature (to have tests that go full circle)- done June 2016
> - document or refactor / rewrite visible labyrinthine signature code
> - why is it not possible to pass only the signatureField to addSignature, 
> instead having to create a COSDocument with a page and annotations that has 
> the signature field, and that must be searched for in 
> {{prepareVisibleSignature()}}?
> - -support rotated pages (see 
> https://stackoverflow.com/questions/34012293/pdfbox-sign-landscape-file-error/34359956#34359956
>  )- done in PDFBOX-3671
> - -make sure that signed PDF/A files are still PDF/A (see 
> http://www.pdfa.org/wp-content/uploads/2011/08/tn0006_digital_signatures_in_pdfa-1_2008-03-14.pdf
>  ); /ID possibly not OK; /Annots is possibly required ([~tilman] removed this 
> for invisible signatures); test signed files with PDF-Tools and with 
> preflight- tested, they are OK with PDF-Tools and preflight
> - test whether "bad" signatures are detected by preflight (search in old 
> issues)
> - -PDFBOX-3363 - why is the stream cached in a file? Should it be done in 
> memory?- done on July 15, 2016
> - remove {{setVisualSignature(PDVisibleSigProperties 
> visSignatureProperties)}} from SignatureOptions.java, all it does is to call 
> {{visSignatureProperties.getVisibleSignature()}} which returns an 
> {{InputStream}}, and this is already available
> - {{checkSignatureField}} violates the "do one thing" rule
> - -decide whether the whole certificate chain should be passed in the sample 
> code, instead of only the first one- yes the whole chain is stored
> - -check certificate chain, revocation lists, etc,- only if needed by users, 
> code 
> [here|https://svn.apache.org/repos/asf/cxf/tags/cxf-2.4.1/distribution/src/main/release/samples/sts_issue_operation/src/main/java/demo/sts/provider/cert/]
> - deprecate / remove all PDVisibleSignDesigner constructors except those with 
> a PDDocument object, to avoid a file being opened twice
> - ... your ideas...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: Fwd: 1.20?

2018-11-27 Thread Andreas Lehmkuehler


Am 25.11.18 um 11:01 schrieb Andreas Lehmkuehler:

Am 24.11.18 um 10:15 schrieb Tilman Hausherr:

Am 24.11.2018 um 08:53 schrieb Andreas Lehmkuehler:
How about cutting a release next week or a week later? 

I'm going to cut the release next Tuesday the 27th.
I'm going to postpone the preparations until we have hopefully positive test 
results from Tim


Andreas



If there are any objections I postpone the release to the following Monday the 
3th of December.



Andreas

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-3017) Improve document signing

2018-11-27 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700830#comment-16700830
 ] 

ASF subversion and git services commented on PDFBOX-3017:
-

Commit 1847575 from til...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1847575 ]

PDFBOX-3017: pass signing date to OCSPHelper

> Improve document signing
> 
>
> Key: PDFBOX-3017
> URL: https://issues.apache.org/jira/browse/PDFBOX-3017
> Project: PDFBox
>  Issue Type: Improvement
>  Components: AcroForm, Signing
>Affects Versions: 2.0.0, 3.0.0 PDFBox
>Reporter: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: PDFBOX-3017_certificate_chain.diff, 
> PDFBOX-3017_certificate_chain_Screenshot.png, QV_RCA1_RCA3_CPCPS_V4_11.pdf, 
> pdfa_signed_insivible.pdf
>
>
> Improve signing code:
> - incremental save only works for signatures and doesn't respect certificates 
> such as Adobe Extended Usage Rights
> - -{{prepareNonVisualSignature}} clears the AcroForm DR 
> {{acroForm.setDefaultResources(null)}} which is not good if there are other 
> form fields-
> - visual/nonVisualSignature should move into the {{interactive.forms}} 
> package and be handled within the signature field
> - -verify signature (to have tests that go full circle)- done June 2016
> - document or refactor / rewrite visible labyrinthine signature code
> - why is it not possible to pass only the signatureField to addSignature, 
> instead having to create a COSDocument with a page and annotations that has 
> the signature field, and that must be searched for in 
> {{prepareVisibleSignature()}}?
> - -support rotated pages (see 
> https://stackoverflow.com/questions/34012293/pdfbox-sign-landscape-file-error/34359956#34359956
>  )- done in PDFBOX-3671
> - -make sure that signed PDF/A files are still PDF/A (see 
> http://www.pdfa.org/wp-content/uploads/2011/08/tn0006_digital_signatures_in_pdfa-1_2008-03-14.pdf
>  ); /ID possibly not OK; /Annots is possibly required ([~tilman] removed this 
> for invisible signatures); test signed files with PDF-Tools and with 
> preflight- tested, they are OK with PDF-Tools and preflight
> - test whether "bad" signatures are detected by preflight (search in old 
> issues)
> - -PDFBOX-3363 - why is the stream cached in a file? Should it be done in 
> memory?- done on July 15, 2016
> - remove {{setVisualSignature(PDVisibleSigProperties 
> visSignatureProperties)}} from SignatureOptions.java, all it does is to call 
> {{visSignatureProperties.getVisibleSignature()}} which returns an 
> {{InputStream}}, and this is already available
> - {{checkSignatureField}} violates the "do one thing" rule
> - -decide whether the whole certificate chain should be passed in the sample 
> code, instead of only the first one- yes the whole chain is stored
> - -check certificate chain, revocation lists, etc,- only if needed by users, 
> code 
> [here|https://svn.apache.org/repos/asf/cxf/tags/cxf-2.4.1/distribution/src/main/release/samples/sts_issue_operation/src/main/java/demo/sts/provider/cert/]
> - deprecate / remove all PDVisibleSignDesigner constructors except those with 
> a PDDocument object, to avoid a file being opened twice
> - ... your ideas...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-3017) Improve document signing

2018-11-27 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700831#comment-16700831
 ] 

ASF subversion and git services commented on PDFBOX-3017:
-

Commit 1847576 from til...@apache.org in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1847576 ]

PDFBOX-3017: pass signing date to OCSPHelper

> Improve document signing
> 
>
> Key: PDFBOX-3017
> URL: https://issues.apache.org/jira/browse/PDFBOX-3017
> Project: PDFBox
>  Issue Type: Improvement
>  Components: AcroForm, Signing
>Affects Versions: 2.0.0, 3.0.0 PDFBox
>Reporter: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: PDFBOX-3017_certificate_chain.diff, 
> PDFBOX-3017_certificate_chain_Screenshot.png, QV_RCA1_RCA3_CPCPS_V4_11.pdf, 
> pdfa_signed_insivible.pdf
>
>
> Improve signing code:
> - incremental save only works for signatures and doesn't respect certificates 
> such as Adobe Extended Usage Rights
> - -{{prepareNonVisualSignature}} clears the AcroForm DR 
> {{acroForm.setDefaultResources(null)}} which is not good if there are other 
> form fields-
> - visual/nonVisualSignature should move into the {{interactive.forms}} 
> package and be handled within the signature field
> - -verify signature (to have tests that go full circle)- done June 2016
> - document or refactor / rewrite visible labyrinthine signature code
> - why is it not possible to pass only the signatureField to addSignature, 
> instead having to create a COSDocument with a page and annotations that has 
> the signature field, and that must be searched for in 
> {{prepareVisibleSignature()}}?
> - -support rotated pages (see 
> https://stackoverflow.com/questions/34012293/pdfbox-sign-landscape-file-error/34359956#34359956
>  )- done in PDFBOX-3671
> - -make sure that signed PDF/A files are still PDF/A (see 
> http://www.pdfa.org/wp-content/uploads/2011/08/tn0006_digital_signatures_in_pdfa-1_2008-03-14.pdf
>  ); /ID possibly not OK; /Annots is possibly required ([~tilman] removed this 
> for invisible signatures); test signed files with PDF-Tools and with 
> preflight- tested, they are OK with PDF-Tools and preflight
> - test whether "bad" signatures are detected by preflight (search in old 
> issues)
> - -PDFBOX-3363 - why is the stream cached in a file? Should it be done in 
> memory?- done on July 15, 2016
> - remove {{setVisualSignature(PDVisibleSigProperties 
> visSignatureProperties)}} from SignatureOptions.java, all it does is to call 
> {{visSignatureProperties.getVisibleSignature()}} which returns an 
> {{InputStream}}, and this is already available
> - {{checkSignatureField}} violates the "do one thing" rule
> - -decide whether the whole certificate chain should be passed in the sample 
> code, instead of only the first one- yes the whole chain is stored
> - -check certificate chain, revocation lists, etc,- only if needed by users, 
> code 
> [here|https://svn.apache.org/repos/asf/cxf/tags/cxf-2.4.1/distribution/src/main/release/samples/sts_issue_operation/src/main/java/demo/sts/provider/cert/]
> - deprecate / remove all PDVisibleSignDesigner constructors except those with 
> a PDDocument object, to avoid a file being opened twice
> - ... your ideas...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-3017) Improve document signing

2018-11-27 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700826#comment-16700826
 ] 

ASF subversion and git services commented on PDFBOX-3017:
-

Commit 1847573 from til...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1847573 ]

PDFBOX-3017: pass signing date to OCSPHelper to compare revocation date with 
sign date; check revocation of OCSP responder

> Improve document signing
> 
>
> Key: PDFBOX-3017
> URL: https://issues.apache.org/jira/browse/PDFBOX-3017
> Project: PDFBox
>  Issue Type: Improvement
>  Components: AcroForm, Signing
>Affects Versions: 2.0.0, 3.0.0 PDFBox
>Reporter: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: PDFBOX-3017_certificate_chain.diff, 
> PDFBOX-3017_certificate_chain_Screenshot.png, QV_RCA1_RCA3_CPCPS_V4_11.pdf, 
> pdfa_signed_insivible.pdf
>
>
> Improve signing code:
> - incremental save only works for signatures and doesn't respect certificates 
> such as Adobe Extended Usage Rights
> - -{{prepareNonVisualSignature}} clears the AcroForm DR 
> {{acroForm.setDefaultResources(null)}} which is not good if there are other 
> form fields-
> - visual/nonVisualSignature should move into the {{interactive.forms}} 
> package and be handled within the signature field
> - -verify signature (to have tests that go full circle)- done June 2016
> - document or refactor / rewrite visible labyrinthine signature code
> - why is it not possible to pass only the signatureField to addSignature, 
> instead having to create a COSDocument with a page and annotations that has 
> the signature field, and that must be searched for in 
> {{prepareVisibleSignature()}}?
> - -support rotated pages (see 
> https://stackoverflow.com/questions/34012293/pdfbox-sign-landscape-file-error/34359956#34359956
>  )- done in PDFBOX-3671
> - -make sure that signed PDF/A files are still PDF/A (see 
> http://www.pdfa.org/wp-content/uploads/2011/08/tn0006_digital_signatures_in_pdfa-1_2008-03-14.pdf
>  ); /ID possibly not OK; /Annots is possibly required ([~tilman] removed this 
> for invisible signatures); test signed files with PDF-Tools and with 
> preflight- tested, they are OK with PDF-Tools and preflight
> - test whether "bad" signatures are detected by preflight (search in old 
> issues)
> - -PDFBOX-3363 - why is the stream cached in a file? Should it be done in 
> memory?- done on July 15, 2016
> - remove {{setVisualSignature(PDVisibleSigProperties 
> visSignatureProperties)}} from SignatureOptions.java, all it does is to call 
> {{visSignatureProperties.getVisibleSignature()}} which returns an 
> {{InputStream}}, and this is already available
> - {{checkSignatureField}} violates the "do one thing" rule
> - -decide whether the whole certificate chain should be passed in the sample 
> code, instead of only the first one- yes the whole chain is stored
> - -check certificate chain, revocation lists, etc,- only if needed by users, 
> code 
> [here|https://svn.apache.org/repos/asf/cxf/tags/cxf-2.4.1/distribution/src/main/release/samples/sts_issue_operation/src/main/java/demo/sts/provider/cert/]
> - deprecate / remove all PDVisibleSignDesigner constructors except those with 
> a PDDocument object, to avoid a file being opened twice
> - ... your ideas...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-3017) Improve document signing

2018-11-27 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700827#comment-16700827
 ] 

ASF subversion and git services commented on PDFBOX-3017:
-

Commit 1847574 from til...@apache.org in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1847574 ]

PDFBOX-3017: pass signing date to OCSPHelper to compare revocation date with 
sign date; check revocation of OCSP responder

> Improve document signing
> 
>
> Key: PDFBOX-3017
> URL: https://issues.apache.org/jira/browse/PDFBOX-3017
> Project: PDFBox
>  Issue Type: Improvement
>  Components: AcroForm, Signing
>Affects Versions: 2.0.0, 3.0.0 PDFBox
>Reporter: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: PDFBOX-3017_certificate_chain.diff, 
> PDFBOX-3017_certificate_chain_Screenshot.png, QV_RCA1_RCA3_CPCPS_V4_11.pdf, 
> pdfa_signed_insivible.pdf
>
>
> Improve signing code:
> - incremental save only works for signatures and doesn't respect certificates 
> such as Adobe Extended Usage Rights
> - -{{prepareNonVisualSignature}} clears the AcroForm DR 
> {{acroForm.setDefaultResources(null)}} which is not good if there are other 
> form fields-
> - visual/nonVisualSignature should move into the {{interactive.forms}} 
> package and be handled within the signature field
> - -verify signature (to have tests that go full circle)- done June 2016
> - document or refactor / rewrite visible labyrinthine signature code
> - why is it not possible to pass only the signatureField to addSignature, 
> instead having to create a COSDocument with a page and annotations that has 
> the signature field, and that must be searched for in 
> {{prepareVisibleSignature()}}?
> - -support rotated pages (see 
> https://stackoverflow.com/questions/34012293/pdfbox-sign-landscape-file-error/34359956#34359956
>  )- done in PDFBOX-3671
> - -make sure that signed PDF/A files are still PDF/A (see 
> http://www.pdfa.org/wp-content/uploads/2011/08/tn0006_digital_signatures_in_pdfa-1_2008-03-14.pdf
>  ); /ID possibly not OK; /Annots is possibly required ([~tilman] removed this 
> for invisible signatures); test signed files with PDF-Tools and with 
> preflight- tested, they are OK with PDF-Tools and preflight
> - test whether "bad" signatures are detected by preflight (search in old 
> issues)
> - -PDFBOX-3363 - why is the stream cached in a file? Should it be done in 
> memory?- done on July 15, 2016
> - remove {{setVisualSignature(PDVisibleSigProperties 
> visSignatureProperties)}} from SignatureOptions.java, all it does is to call 
> {{visSignatureProperties.getVisibleSignature()}} which returns an 
> {{InputStream}}, and this is already available
> - {{checkSignatureField}} violates the "do one thing" rule
> - -decide whether the whole certificate chain should be passed in the sample 
> code, instead of only the first one- yes the whole chain is stored
> - -check certificate chain, revocation lists, etc,- only if needed by users, 
> code 
> [here|https://svn.apache.org/repos/asf/cxf/tags/cxf-2.4.1/distribution/src/main/release/samples/sts_issue_operation/src/main/java/demo/sts/provider/cert/]
> - deprecate / remove all PDVisibleSignDesigner constructors except those with 
> a PDDocument object, to avoid a file being opened twice
> - ... your ideas...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

2018-11-27 Thread Tim Allison (JIRA)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved PDFBOX-4184.
-
Resolution: Fixed

> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -
>
> Key: PDFBOX-4184
> URL: https://issues.apache.org/jira/browse/PDFBOX-4184
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Writing
>Affects Versions: 2.0.9
>Reporter: Emmeran Seehuber
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0 PDFBox, 2.0.12
>
> Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, 
> fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, 
> images.zip, lossless_predictor_based_imageencoding.patch, 
> lossless_predictor_based_imageencoding_v2.patch, 
> lossless_predictor_based_imageencoding_v3.patch, 
> lossless_predictor_based_imageencoding_v4.patch, 
> lossless_predictor_based_imageencoding_v5.patch, 
> lossless_predictor_based_imageencoding_v6.patch, 
> pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, 
> png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, 
> size_compare.txt
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Closed] (PDFBOX-4386) Incorrect encoding during pdf file reading

2018-11-27 Thread Tilman Hausherr (JIRA)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-4386.
---
Resolution: Not A Bug

Closing as "not a bug". Text extraction is working as designed, which is why 
Adobe Reader shows the same problem. You can still comment.

> Incorrect encoding during pdf file reading
> --
>
> Key: PDFBOX-4386
> URL: https://issues.apache.org/jira/browse/PDFBOX-4386
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.12
>Reporter: Oleksandr Skoryi
>Priority: Major
> Attachments: Test2.pdf, image-2018-11-26-21-06-57-022.png
>
>
> Hello everybody, I use PDFBOX for scrapping text from attached pdf
> The issue is in double ff in Kaffee-Pads
> I downloaded pdf debugger and found, that it is a symbol with 31-st uncode, 
> however I think it is a bug. Sincerely waiting for your reply
> !image-2018-11-26-21-06-57-022.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Closed] (PDFBOX-4387) Parsing typographic ligatures

2018-11-27 Thread Tilman Hausherr (JIRA)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-4387.
---
Resolution: Duplicate

Closing as duplicate of PDBOX-4386. The problem with this file is that some non 
ligature glyphs don't have unicode. The file was produced with ilovepdf.com.

> Parsing typographic ligatures
> -
>
> Key: PDFBOX-4387
> URL: https://issues.apache.org/jira/browse/PDFBOX-4387
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.9, 2.0.12
>Reporter: Oleksandr Skoryi
>Priority: Major
> Attachments: test.pdf
>
>
> Hello everybody. I tried to parse following pdf, however have a problem with 
> ligatures. Pdf box add extraspace after each of them
> Attached pdf has issue in word flüssig under Persil powder
> however other ligatures are affected too
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4386) Incorrect encoding during pdf file reading

2018-11-27 Thread Tilman Hausherr (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700777#comment-16700777
 ] 

Tilman Hausherr commented on PDFBOX-4386:
-

You could use OCR. And them maybe compare the text extraction with the OCR and 
then correct using services like the amazon mechanical turk.

Another idea would be to detect such fonts and adjust the Unicode when missing. 
But you're on your own there, it will probably be several days of work. You'd 
need to connect the glyph name ("f_f") with the Unicode entry.

But this will not be the only problem you may have with text extraction. Some 
PDF files may bring nothing, or completely garbled text.

> Incorrect encoding during pdf file reading
> --
>
> Key: PDFBOX-4386
> URL: https://issues.apache.org/jira/browse/PDFBOX-4386
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.12
>Reporter: Oleksandr Skoryi
>Priority: Major
> Attachments: Test2.pdf, image-2018-11-26-21-06-57-022.png
>
>
> Hello everybody, I use PDFBOX for scrapping text from attached pdf
> The issue is in double ff in Kaffee-Pads
> I downloaded pdf debugger and found, that it is a symbol with 31-st uncode, 
> however I think it is a bug. Sincerely waiting for your reply
> !image-2018-11-26-21-06-57-022.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

2018-11-27 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700762#comment-16700762
 ] 

ASF subversion and git services commented on PDFBOX-4184:
-

Commit 1847570 from [~talli...@apache.org] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1847570 ]

PDFBOX-4184 -- pull test file from JIRA rather than internet archive

> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -
>
> Key: PDFBOX-4184
> URL: https://issues.apache.org/jira/browse/PDFBOX-4184
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Writing
>Affects Versions: 2.0.9
>Reporter: Emmeran Seehuber
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.12, 3.0.0 PDFBox
>
> Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, 
> fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, 
> images.zip, lossless_predictor_based_imageencoding.patch, 
> lossless_predictor_based_imageencoding_v2.patch, 
> lossless_predictor_based_imageencoding_v3.patch, 
> lossless_predictor_based_imageencoding_v4.patch, 
> lossless_predictor_based_imageencoding_v5.patch, 
> lossless_predictor_based_imageencoding_v6.patch, 
> pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, 
> png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, 
> size_compare.txt
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Reopened] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

2018-11-27 Thread Tim Allison (JIRA)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened PDFBOX-4184:
-

Reopening...let's pull govdocs1 test file from JIRA rather than the internet 
archive.

> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -
>
> Key: PDFBOX-4184
> URL: https://issues.apache.org/jira/browse/PDFBOX-4184
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Writing
>Affects Versions: 2.0.9
>Reporter: Emmeran Seehuber
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.12, 3.0.0 PDFBox
>
> Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, 
> fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, 
> images.zip, lossless_predictor_based_imageencoding.patch, 
> lossless_predictor_based_imageencoding_v2.patch, 
> lossless_predictor_based_imageencoding_v3.patch, 
> lossless_predictor_based_imageencoding_v4.patch, 
> lossless_predictor_based_imageencoding_v5.patch, 
> lossless_predictor_based_imageencoding_v6.patch, 
> pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, 
> png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, 
> size_compare.txt
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

2018-11-27 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700760#comment-16700760
 ] 

ASF subversion and git services commented on PDFBOX-4184:
-

Commit 1847569 from [~talli...@apache.org] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1847569 ]

PDFBOX-4184 -- switch to download test file from jira rather than the internet 
archive.

> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -
>
> Key: PDFBOX-4184
> URL: https://issues.apache.org/jira/browse/PDFBOX-4184
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Writing
>Affects Versions: 2.0.9
>Reporter: Emmeran Seehuber
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.12, 3.0.0 PDFBox
>
> Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, 
> fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, 
> images.zip, lossless_predictor_based_imageencoding.patch, 
> lossless_predictor_based_imageencoding_v2.patch, 
> lossless_predictor_based_imageencoding_v3.patch, 
> lossless_predictor_based_imageencoding_v4.patch, 
> lossless_predictor_based_imageencoding_v5.patch, 
> lossless_predictor_based_imageencoding_v6.patch, 
> pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, 
> png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, 
> size_compare.txt
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4384) PDF/A Document Validation out of memory

2018-11-27 Thread Vincenzo Mangiapanello (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700726#comment-16700726
 ] 

Vincenzo Mangiapanello commented on PDFBOX-4384:


Thanks a lot.

Tomorrow in the morning we'll test the new snapshot and we'll give you the 
response 

> PDF/A Document Validation out of memory
> ---
>
> Key: PDFBOX-4384
> URL: https://issues.apache.org/jira/browse/PDFBOX-4384
> Project: PDFBox
>  Issue Type: Bug
>  Components: Preflight
>Affects Versions: 2.0.8, 2.0.12
>Reporter: Vincenzo Mangiapanello
>Priority: Major
>
> Hi everyone,
> validating a customer PDF file, using
> {code:java}
> document.validate(){code}
> we recognise that if the file itself has an enormous numbers of validation 
> errors, the process goes to OutOfMemory and at the end the we get the GC 
> error.
> In our case the file has more than 550.000 errors. So we cannot go head with 
> the conversion to PDF/A. 
> To avoid this kind of error it could be useful to configure a max number of 
> validation errors to stop the process if this value has been reached.
> We cannot attach the original document, because it is a customer's file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4388) Update several tests to cache downloaded test files with download-maven-plugin

2018-11-27 Thread Tim Allison (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700690#comment-16700690
 ] 

Tim Allison commented on PDFBOX-4388:
-

This patch is based on 2.x.  I'll commit it to trunk, 2.x and 1.x in a few days 
unless there are recommendations/objections.

> Update several tests to cache downloaded test files with download-maven-plugin
> --
>
> Key: PDFBOX-4388
> URL: https://issues.apache.org/jira/browse/PDFBOX-4388
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Attachments: PDFBOX-4388.patch
>
>
> On PDFBOX-3974, [~tilman] added the use of {{download-maven-plugin}} to 
> download test files.  This allows for caching of files, and it appropriately 
> applies corporate proxy settings.
> There are a few tests that could be updated: PDButtonTest, 
> MergeAnnotationsTest and MergeAcroFormsTest.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-4384) PDF/A Document Validation out of memory

2018-11-27 Thread Tilman Hausherr (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700645#comment-16700645
 ] 

Tilman Hausherr edited comment on PDFBOX-4384 at 11/27/18 4:43 PM:
---

This is now a different strategy, the page tree validation process is aborted. 
It is still possible that other errors appear after that. It is also possible 
that it doesn't abort because there is a super-complicated page. This is too 
difficult to change.

To configure, call {{document.getContext().getConfig().setMaxErrors(xxx)}}.

(snapshot available at link previously mentioned)


was (Author: tilman):
This is now a different strategy, the page tree validation process is aborted. 
It is still possible that other errors appear after that. It is also possible 
that it doesn't abort because there is a super-complicated page. This is too 
difficult to change.

To configure, call {{document.getContext().getConfig().setMaxErrors(xxx)}}.

(snapshot will be available soon)

> PDF/A Document Validation out of memory
> ---
>
> Key: PDFBOX-4384
> URL: https://issues.apache.org/jira/browse/PDFBOX-4384
> Project: PDFBox
>  Issue Type: Bug
>  Components: Preflight
>Affects Versions: 2.0.8, 2.0.12
>Reporter: Vincenzo Mangiapanello
>Priority: Major
>
> Hi everyone,
> validating a customer PDF file, using
> {code:java}
> document.validate(){code}
> we recognise that if the file itself has an enormous numbers of validation 
> errors, the process goes to OutOfMemory and at the end the we get the GC 
> error.
> In our case the file has more than 550.000 errors. So we cannot go head with 
> the conversion to PDF/A. 
> To avoid this kind of error it could be useful to configure a max number of 
> validation errors to stop the process if this value has been reached.
> We cannot attach the original document, because it is a customer's file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-4388) Update several tests to cache downloaded test files with download-maven-plugin

2018-11-27 Thread Tim Allison (JIRA)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-4388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-4388:

Attachment: PDFBOX-4388.patch

> Update several tests to cache downloaded test files with download-maven-plugin
> --
>
> Key: PDFBOX-4388
> URL: https://issues.apache.org/jira/browse/PDFBOX-4388
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Attachments: PDFBOX-4388.patch
>
>
> On PDFBOX-3974, [~tilman] added the use of {{download-maven-plugin}} to 
> download test files.  This allows for caching of files, and it appropriately 
> applies corporate proxy settings.
> There are a few tests that could be updated: PDButtonTest, 
> MergeAnnotationsTest and MergeAcroFormsTest.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Assigned] (PDFBOX-4388) Update several tests to cache downloaded test files with download-maven-plugin

2018-11-27 Thread Tim Allison (JIRA)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-4388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned PDFBOX-4388:
---

Assignee: Tim Allison

> Update several tests to cache downloaded test files with download-maven-plugin
> --
>
> Key: PDFBOX-4388
> URL: https://issues.apache.org/jira/browse/PDFBOX-4388
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
>
> On PDFBOX-3974, [~tilman] added the use of {{download-maven-plugin}} to 
> download test files.  This allows for caching of files, and it appropriately 
> applies corporate proxy settings.
> There are a few tests that could be updated: PDButtonTest, 
> MergeAnnotationsTest and MergeAcroFormsTest.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Created] (PDFBOX-4388) Update several tests to cache downloaded test files with download-maven-plugin

2018-11-27 Thread Tim Allison (JIRA)

Tim Allison created PDFBOX-4388:
---

 Summary: Update several tests to cache downloaded test files with 
download-maven-plugin
 Key: PDFBOX-4388
 URL: https://issues.apache.org/jira/browse/PDFBOX-4388
 Project: PDFBox
  Issue Type: Task
Reporter: Tim Allison


On PDFBOX-3974, [~tilman] added the use of {{download-maven-plugin}} to 
download test files.  This allows for caching of files, and it appropriately 
applies corporate proxy settings.

There are a few tests that could be updated: PDButtonTest, MergeAnnotationsTest 
and MergeAcroFormsTest.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

2018-11-27 Thread Tim Allison (JIRA)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved PDFBOX-4184.
-
Resolution: Fixed
  Assignee: Tilman Hausherr  (was: Tim Allison)

Re-assigning to [~tilman] who did all of the work. :)

> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -
>
> Key: PDFBOX-4184
> URL: https://issues.apache.org/jira/browse/PDFBOX-4184
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Writing
>Affects Versions: 2.0.9
>Reporter: Emmeran Seehuber
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0 PDFBox, 2.0.12
>
> Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, 
> fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, 
> images.zip, lossless_predictor_based_imageencoding.patch, 
> lossless_predictor_based_imageencoding_v2.patch, 
> lossless_predictor_based_imageencoding_v3.patch, 
> lossless_predictor_based_imageencoding_v4.patch, 
> lossless_predictor_based_imageencoding_v5.patch, 
> lossless_predictor_based_imageencoding_v6.patch, 
> pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, 
> png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, 
> size_compare.txt
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-4384) PDF/A Document Validation out of memory

2018-11-27 Thread Tilman Hausherr (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700645#comment-16700645
 ] 

Tilman Hausherr edited comment on PDFBOX-4384 at 11/27/18 4:18 PM:
---

This is now a different strategy, the page tree validation process is aborted. 
It is still possible that other errors appear after that. It is also possible 
that it doesn't abort because there is a super-complicated page. This is too 
difficult to change.

To configure, call {{document.getContext().getConfig().setMaxErrors(xxx)}}.

(snapshot will be available soon)


was (Author: tilman):
This is now a different strategy, the page tree validation process is aborted. 
It is still possible that other errors appear after that. It is also possible 
that it doesn't abort because there is a super-complicated page. This is too 
difficult to change.

To configure, call {{document.getContext().getConfig().setMaxErrors(xxx)}}.

> PDF/A Document Validation out of memory
> ---
>
> Key: PDFBOX-4384
> URL: https://issues.apache.org/jira/browse/PDFBOX-4384
> Project: PDFBox
>  Issue Type: Bug
>  Components: Preflight
>Affects Versions: 2.0.8, 2.0.12
>Reporter: Vincenzo Mangiapanello
>Priority: Major
>
> Hi everyone,
> validating a customer PDF file, using
> {code:java}
> document.validate(){code}
> we recognise that if the file itself has an enormous numbers of validation 
> errors, the process goes to OutOfMemory and at the end the we get the GC 
> error.
> In our case the file has more than 550.000 errors. So we cannot go head with 
> the conversion to PDF/A. 
> To avoid this kind of error it could be useful to configure a max number of 
> validation errors to stop the process if this value has been reached.
> We cannot attach the original document, because it is a customer's file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4384) PDF/A Document Validation out of memory

2018-11-27 Thread Tilman Hausherr (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700645#comment-16700645
 ] 

Tilman Hausherr commented on PDFBOX-4384:
-

This is now a different strategy, the page tree validation process is aborted. 
It is still possible that other errors appear after that. It is also possible 
that it doesn't abort because there is a super-complicated page. This is too 
difficult to change.

To configure, call {{document.getContext().getConfig().setMaxErrors(xxx)}}.

> PDF/A Document Validation out of memory
> ---
>
> Key: PDFBOX-4384
> URL: https://issues.apache.org/jira/browse/PDFBOX-4384
> Project: PDFBox
>  Issue Type: Bug
>  Components: Preflight
>Affects Versions: 2.0.8, 2.0.12
>Reporter: Vincenzo Mangiapanello
>Priority: Major
>
> Hi everyone,
> validating a customer PDF file, using
> {code:java}
> document.validate(){code}
> we recognise that if the file itself has an enormous numbers of validation 
> errors, the process goes to OutOfMemory and at the end the we get the GC 
> error.
> In our case the file has more than 550.000 errors. So we cannot go head with 
> the conversion to PDF/A. 
> To avoid this kind of error it could be useful to configure a max number of 
> validation errors to stop the process if this value has been reached.
> We cannot attach the original document, because it is a customer's file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Reopened] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

2018-11-27 Thread Tim Allison (JIRA)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened PDFBOX-4184:
-
  Assignee: Tim Allison  (was: Tilman Hausherr)

Re-opening to add attachment

> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -
>
> Key: PDFBOX-4184
> URL: https://issues.apache.org/jira/browse/PDFBOX-4184
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Writing
>Affects Versions: 2.0.9
>Reporter: Emmeran Seehuber
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0.12, 3.0.0 PDFBox
>
> Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, 
> fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, 
> images.zip, lossless_predictor_based_imageencoding.patch, 
> lossless_predictor_based_imageencoding_v2.patch, 
> lossless_predictor_based_imageencoding_v3.patch, 
> lossless_predictor_based_imageencoding_v4.patch, 
> lossless_predictor_based_imageencoding_v5.patch, 
> lossless_predictor_based_imageencoding_v6.patch, 
> pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, 
> png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, 
> size_compare.txt
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

2018-11-27 Thread Tim Allison (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700641#comment-16700641
 ] 

Tim Allison edited comment on PDFBOX-4184 at 11/27/18 4:15 PM:
---

Re-opening to add literal govdocs1 test file 032163.jpg


was (Author: talli...@mitre.org):
Re-opening to add attachment

> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -
>
> Key: PDFBOX-4184
> URL: https://issues.apache.org/jira/browse/PDFBOX-4184
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Writing
>Affects Versions: 2.0.9
>Reporter: Emmeran Seehuber
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0.12, 3.0.0 PDFBox
>
> Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, 
> fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, 
> images.zip, lossless_predictor_based_imageencoding.patch, 
> lossless_predictor_based_imageencoding_v2.patch, 
> lossless_predictor_based_imageencoding_v3.patch, 
> lossless_predictor_based_imageencoding_v4.patch, 
> lossless_predictor_based_imageencoding_v5.patch, 
> lossless_predictor_based_imageencoding_v6.patch, 
> pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, 
> png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, 
> size_compare.txt
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images

2018-11-27 Thread Tim Allison (JIRA)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-4184:

Attachment: 032163.jpg

> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -
>
> Key: PDFBOX-4184
> URL: https://issues.apache.org/jira/browse/PDFBOX-4184
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Writing
>Affects Versions: 2.0.9
>Reporter: Emmeran Seehuber
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0.12, 3.0.0 PDFBox
>
> Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, 
> fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, 
> images.zip, lossless_predictor_based_imageencoding.patch, 
> lossless_predictor_based_imageencoding_v2.patch, 
> lossless_predictor_based_imageencoding_v3.patch, 
> lossless_predictor_based_imageencoding_v4.patch, 
> lossless_predictor_based_imageencoding_v5.patch, 
> lossless_predictor_based_imageencoding_v6.patch, 
> pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, 
> png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, 
> size_compare.txt
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4384) PDF/A Document Validation out of memory

2018-11-27 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700635#comment-16700635
 ] 

ASF subversion and git services commented on PDFBOX-4384:
-

Commit 1847564 from til...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1847564 ]

PDFBOX-4384: abort page tree validation process if too many errors + make it 
configurable; use foreach loop (also speeds up when many pages)

> PDF/A Document Validation out of memory
> ---
>
> Key: PDFBOX-4384
> URL: https://issues.apache.org/jira/browse/PDFBOX-4384
> Project: PDFBox
>  Issue Type: Bug
>  Components: Preflight
>Affects Versions: 2.0.8, 2.0.12
>Reporter: Vincenzo Mangiapanello
>Priority: Major
>
> Hi everyone,
> validating a customer PDF file, using
> {code:java}
> document.validate(){code}
> we recognise that if the file itself has an enormous numbers of validation 
> errors, the process goes to OutOfMemory and at the end the we get the GC 
> error.
> In our case the file has more than 550.000 errors. So we cannot go head with 
> the conversion to PDF/A. 
> To avoid this kind of error it could be useful to configure a max number of 
> validation errors to stop the process if this value has been reached.
> We cannot attach the original document, because it is a customer's file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4384) PDF/A Document Validation out of memory

2018-11-27 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700620#comment-16700620
 ] 

ASF subversion and git services commented on PDFBOX-4384:
-

Commit 1847560 from til...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1847560 ]

PDFBOX-4384: revert last two commits, will use different strategy

> PDF/A Document Validation out of memory
> ---
>
> Key: PDFBOX-4384
> URL: https://issues.apache.org/jira/browse/PDFBOX-4384
> Project: PDFBox
>  Issue Type: Bug
>  Components: Preflight
>Affects Versions: 2.0.8, 2.0.12
>Reporter: Vincenzo Mangiapanello
>Priority: Major
>
> Hi everyone,
> validating a customer PDF file, using
> {code:java}
> document.validate(){code}
> we recognise that if the file itself has an enormous numbers of validation 
> errors, the process goes to OutOfMemory and at the end the we get the GC 
> error.
> In our case the file has more than 550.000 errors. So we cannot go head with 
> the conversion to PDF/A. 
> To avoid this kind of error it could be useful to configure a max number of 
> validation errors to stop the process if this value has been reached.
> We cannot attach the original document, because it is a customer's file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4384) PDF/A Document Validation out of memory

2018-11-27 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700628#comment-16700628
 ] 

ASF subversion and git services commented on PDFBOX-4384:
-

Commit 1847562 from til...@apache.org in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1847562 ]

PDFBOX-4384: abort page tree validation process if too many errors + make it 
configurable; use foreach loop (also speeds up when many pages)

> PDF/A Document Validation out of memory
> ---
>
> Key: PDFBOX-4384
> URL: https://issues.apache.org/jira/browse/PDFBOX-4384
> Project: PDFBox
>  Issue Type: Bug
>  Components: Preflight
>Affects Versions: 2.0.8, 2.0.12
>Reporter: Vincenzo Mangiapanello
>Priority: Major
>
> Hi everyone,
> validating a customer PDF file, using
> {code:java}
> document.validate(){code}
> we recognise that if the file itself has an enormous numbers of validation 
> errors, the process goes to OutOfMemory and at the end the we get the GC 
> error.
> In our case the file has more than 550.000 errors. So we cannot go head with 
> the conversion to PDF/A. 
> To avoid this kind of error it could be useful to configure a max number of 
> validation errors to stop the process if this value has been reached.
> We cannot attach the original document, because it is a customer's file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4384) PDF/A Document Validation out of memory

2018-11-27 Thread ASF subversion and git services (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700621#comment-16700621
 ] 

ASF subversion and git services commented on PDFBOX-4384:
-

Commit 1847561 from til...@apache.org in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1847561 ]

PDFBOX-4384: revert last two commits, will use different strategy

> PDF/A Document Validation out of memory
> ---
>
> Key: PDFBOX-4384
> URL: https://issues.apache.org/jira/browse/PDFBOX-4384
> Project: PDFBox
>  Issue Type: Bug
>  Components: Preflight
>Affects Versions: 2.0.8, 2.0.12
>Reporter: Vincenzo Mangiapanello
>Priority: Major
>
> Hi everyone,
> validating a customer PDF file, using
> {code:java}
> document.validate(){code}
> we recognise that if the file itself has an enormous numbers of validation 
> errors, the process goes to OutOfMemory and at the end the we get the GC 
> error.
> In our case the file has more than 550.000 errors. So we cannot go head with 
> the conversion to PDF/A. 
> To avoid this kind of error it could be useful to configure a max number of 
> validation errors to stop the process if this value has been reached.
> We cannot attach the original document, because it is a customer's file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-4387) Parsing typographic ligatures

2018-11-27 Thread Oleksandr Skoryi (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700358#comment-16700358
 ] 

Oleksandr Skoryi edited comment on PDFBOX-4387 at 11/27/18 12:41 PM:
-

[~tilman]

Coould you advice me any workarounds ?


was (Author: alexfaster):
[~tilman]

Coould you tell me any workarounds ?

> Parsing typographic ligatures
> -
>
> Key: PDFBOX-4387
> URL: https://issues.apache.org/jira/browse/PDFBOX-4387
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.9, 2.0.12
>Reporter: Oleksandr Skoryi
>Priority: Major
> Attachments: test.pdf
>
>
> Hello everybody. I tried to parse following pdf, however have a problem with 
> ligatures. Pdf box add extraspace after each of them
> Attached pdf has issue in word flüssig under Persil powder
> however other ligatures are affected too
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-4387) Parsing typographic ligatures

2018-11-27 Thread Oleksandr Skoryi (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700358#comment-16700358
 ] 

Oleksandr Skoryi edited comment on PDFBOX-4387 at 11/27/18 12:41 PM:
-

[~tilman]

Coould you tell me any workarounds ?


was (Author: alexfaster):
[~tilman]

Any workarounds ?

> Parsing typographic ligatures
> -
>
> Key: PDFBOX-4387
> URL: https://issues.apache.org/jira/browse/PDFBOX-4387
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.9, 2.0.12
>Reporter: Oleksandr Skoryi
>Priority: Major
> Attachments: test.pdf
>
>
> Hello everybody. I tried to parse following pdf, however have a problem with 
> ligatures. Pdf box add extraspace after each of them
> Attached pdf has issue in word flüssig under Persil powder
> however other ligatures are affected too
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4387) Parsing typographic ligatures

2018-11-27 Thread Oleksandr Skoryi (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700358#comment-16700358
 ] 

Oleksandr Skoryi commented on PDFBOX-4387:
--

[~tilman]

Any workarounds ?

> Parsing typographic ligatures
> -
>
> Key: PDFBOX-4387
> URL: https://issues.apache.org/jira/browse/PDFBOX-4387
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.9, 2.0.12
>Reporter: Oleksandr Skoryi
>Priority: Major
> Attachments: test.pdf
>
>
> Hello everybody. I tried to parse following pdf, however have a problem with 
> ligatures. Pdf box add extraspace after each of them
> Attached pdf has issue in word flüssig under Persil powder
> however other ligatures are affected too
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4387) Parsing typographic ligatures

2018-11-27 Thread Tilman Hausherr (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700275#comment-16700275
 ] 

Tilman Hausherr commented on PDFBOX-4387:
-

Because text extraction and rendering are separate things. A glyph can have a 
correct visual display but a wrong unicode. 



> Parsing typographic ligatures
> -
>
> Key: PDFBOX-4387
> URL: https://issues.apache.org/jira/browse/PDFBOX-4387
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.9, 2.0.12
>Reporter: Oleksandr Skoryi
>Priority: Major
> Attachments: test.pdf
>
>
> Hello everybody. I tried to parse following pdf, however have a problem with 
> ligatures. Pdf box add extraspace after each of them
> Attached pdf has issue in word flüssig under Persil powder
> however other ligatures are affected too
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4386) Incorrect encoding during pdf file reading

2018-11-27 Thread Oleksandr Skoryi (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700176#comment-16700176
 ] 

Oleksandr Skoryi commented on PDFBOX-4386:
--

[~tilman]

Do u have any suggestion how to fix that? Or probable workaround?

> Incorrect encoding during pdf file reading
> --
>
> Key: PDFBOX-4386
> URL: https://issues.apache.org/jira/browse/PDFBOX-4386
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.12
>Reporter: Oleksandr Skoryi
>Priority: Major
> Attachments: Test2.pdf, image-2018-11-26-21-06-57-022.png
>
>
> Hello everybody, I use PDFBOX for scrapping text from attached pdf
> The issue is in double ff in Kaffee-Pads
> I downloaded pdf debugger and found, that it is a symbol with 31-st uncode, 
> however I think it is a bug. Sincerely waiting for your reply
> !image-2018-11-26-21-06-57-022.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4386) Incorrect encoding during pdf file reading

2018-11-27 Thread Tilman Hausherr (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700166#comment-16700166
 ] 

Tilman Hausherr commented on PDFBOX-4386:
-

Because text extraction and rendering are separate things. A glyph can have a 
correct visual display but a wrong unicode. 



> Incorrect encoding during pdf file reading
> --
>
> Key: PDFBOX-4386
> URL: https://issues.apache.org/jira/browse/PDFBOX-4386
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.12
>Reporter: Oleksandr Skoryi
>Priority: Major
> Attachments: Test2.pdf, image-2018-11-26-21-06-57-022.png
>
>
> Hello everybody, I use PDFBOX for scrapping text from attached pdf
> The issue is in double ff in Kaffee-Pads
> I downloaded pdf debugger and found, that it is a symbol with 31-st uncode, 
> however I think it is a bug. Sincerely waiting for your reply
> !image-2018-11-26-21-06-57-022.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-4386) Incorrect encoding during pdf file reading

2018-11-27 Thread Oleksandr Skoryi (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700109#comment-16700109
 ] 

Oleksandr Skoryi edited comment on PDFBOX-4386 at 11/27/18 9:25 AM:


[~tilman]

therefore PDF is broken?? But how then the symbol is so precisely displayed in 
pdf viewers?


was (Author: alexfaster):
therefore PDF is broken?? But how then the symbol is so precisely displayed in 
pdf viewers?

> Incorrect encoding during pdf file reading
> --
>
> Key: PDFBOX-4386
> URL: https://issues.apache.org/jira/browse/PDFBOX-4386
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.12
>Reporter: Oleksandr Skoryi
>Priority: Major
> Attachments: Test2.pdf, image-2018-11-26-21-06-57-022.png
>
>
> Hello everybody, I use PDFBOX for scrapping text from attached pdf
> The issue is in double ff in Kaffee-Pads
> I downloaded pdf debugger and found, that it is a symbol with 31-st uncode, 
> however I think it is a bug. Sincerely waiting for your reply
> !image-2018-11-26-21-06-57-022.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4387) Parsing typographic ligatures

2018-11-27 Thread Oleksandr Skoryi (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700108#comment-16700108
 ] 

Oleksandr Skoryi commented on PDFBOX-4387:
--

[~tilman]

therefore PDF is broken?? But how then the symbol is so precisely displayed in 
pdf viewers?

> Parsing typographic ligatures
> -
>
> Key: PDFBOX-4387
> URL: https://issues.apache.org/jira/browse/PDFBOX-4387
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.9, 2.0.12
>Reporter: Oleksandr Skoryi
>Priority: Major
> Attachments: test.pdf
>
>
> Hello everybody. I tried to parse following pdf, however have a problem with 
> ligatures. Pdf box add extraspace after each of them
> Attached pdf has issue in word flüssig under Persil powder
> however other ligatures are affected too
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4386) Incorrect encoding during pdf file reading

2018-11-27 Thread Oleksandr Skoryi (JIRA)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700109#comment-16700109
 ] 

Oleksandr Skoryi commented on PDFBOX-4386:
--

therefore PDF is broken?? But how then the symbol is so precisely displayed in 
pdf viewers?

> Incorrect encoding during pdf file reading
> --
>
> Key: PDFBOX-4386
> URL: https://issues.apache.org/jira/browse/PDFBOX-4386
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.12
>Reporter: Oleksandr Skoryi
>Priority: Major
> Attachments: Test2.pdf, image-2018-11-26-21-06-57-022.png
>
>
> Hello everybody, I use PDFBOX for scrapping text from attached pdf
> The issue is in double ff in Kaffee-Pads
> I downloaded pdf debugger and found, that it is a symbol with 31-st uncode, 
> however I think it is a bug. Sincerely waiting for your reply
> !image-2018-11-26-21-06-57-022.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

40 matches

Mail list logo