Re: Fwd: 1.20?
Sounds good enough for me. I'm going to cut the release this evening in about 10-11 hours from now. Thanks Tim! Andreas Am 28.11.18 um 03:40 schrieb Tim Allison: Looks good. One file has fewer pages: govdocs1/229/229205.pdf Based on the content diffs, it looks like I compared 2.0.13-SNAPSHOT against 2.0.11 (/data4/batch_runs/tika_1_19-rc1). We upgraded to 2.0.12 in 1.19.1. Given that I don't see any differences, I think we're good. However, I'm happy to re-run with 2.0.12 as the baseline. The reports are here: http://162.242.228.174/reports/reports_pdfbox_2_0_13-pre-rc.tgz On Tue, Nov 27, 2018 at 1:22 PM Andreas Lehmkuehler wrote: Am 25.11.18 um 11:01 schrieb Andreas Lehmkuehler: Am 24.11.18 um 10:15 schrieb Tilman Hausherr: Am 24.11.2018 um 08:53 schrieb Andreas Lehmkuehler: How about cutting a release next week or a week later? I'm going to cut the release next Tuesday the 27th. I'm going to postpone the preparations until we have hopefully positive test results from Tim Andreas If there are any objections I postpone the release to the following Monday the 3th of December. Andreas - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: Fwd: 1.20?
Looks good. One file has fewer pages: govdocs1/229/229205.pdf Based on the content diffs, it looks like I compared 2.0.13-SNAPSHOT against 2.0.11 (/data4/batch_runs/tika_1_19-rc1). We upgraded to 2.0.12 in 1.19.1. Given that I don't see any differences, I think we're good. However, I'm happy to re-run with 2.0.12 as the baseline. The reports are here: http://162.242.228.174/reports/reports_pdfbox_2_0_13-pre-rc.tgz On Tue, Nov 27, 2018 at 1:22 PM Andreas Lehmkuehler wrote: > > Am 25.11.18 um 11:01 schrieb Andreas Lehmkuehler: > > Am 24.11.18 um 10:15 schrieb Tilman Hausherr: > >> Am 24.11.2018 um 08:53 schrieb Andreas Lehmkuehler: > >>> How about cutting a release next week or a week later? > > I'm going to cut the release next Tuesday the 27th. > I'm going to postpone the preparations until we have hopefully positive test > results from Tim > > Andreas > > > > > If there are any objections I postpone the release to the following Monday > > the > > 3th of December. > > > > > > Andreas > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3017) Improve document signing
[ https://issues.apache.org/jira/browse/PDFBOX-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700847#comment-16700847 ] ASF subversion and git services commented on PDFBOX-3017: - Commit 1847577 from til...@apache.org in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1847577 ] PDFBOX-3017: include mention of mkl comment so people don't take this example blindly > Improve document signing > > > Key: PDFBOX-3017 > URL: https://issues.apache.org/jira/browse/PDFBOX-3017 > Project: PDFBox > Issue Type: Improvement > Components: AcroForm, Signing >Affects Versions: 2.0.0, 3.0.0 PDFBox >Reporter: Tilman Hausherr >Priority: Major > Fix For: 3.0.0 PDFBox > > Attachments: PDFBOX-3017_certificate_chain.diff, > PDFBOX-3017_certificate_chain_Screenshot.png, QV_RCA1_RCA3_CPCPS_V4_11.pdf, > pdfa_signed_insivible.pdf > > > Improve signing code: > - incremental save only works for signatures and doesn't respect certificates > such as Adobe Extended Usage Rights > - -{{prepareNonVisualSignature}} clears the AcroForm DR > {{acroForm.setDefaultResources(null)}} which is not good if there are other > form fields- > - visual/nonVisualSignature should move into the {{interactive.forms}} > package and be handled within the signature field > - -verify signature (to have tests that go full circle)- done June 2016 > - document or refactor / rewrite visible labyrinthine signature code > - why is it not possible to pass only the signatureField to addSignature, > instead having to create a COSDocument with a page and annotations that has > the signature field, and that must be searched for in > {{prepareVisibleSignature()}}? > - -support rotated pages (see > https://stackoverflow.com/questions/34012293/pdfbox-sign-landscape-file-error/34359956#34359956 > )- done in PDFBOX-3671 > - -make sure that signed PDF/A files are still PDF/A (see > http://www.pdfa.org/wp-content/uploads/2011/08/tn0006_digital_signatures_in_pdfa-1_2008-03-14.pdf > ); /ID possibly not OK; /Annots is possibly required ([~tilman] removed this > for invisible signatures); test signed files with PDF-Tools and with > preflight- tested, they are OK with PDF-Tools and preflight > - test whether "bad" signatures are detected by preflight (search in old > issues) > - -PDFBOX-3363 - why is the stream cached in a file? Should it be done in > memory?- done on July 15, 2016 > - remove {{setVisualSignature(PDVisibleSigProperties > visSignatureProperties)}} from SignatureOptions.java, all it does is to call > {{visSignatureProperties.getVisibleSignature()}} which returns an > {{InputStream}}, and this is already available > - {{checkSignatureField}} violates the "do one thing" rule > - -decide whether the whole certificate chain should be passed in the sample > code, instead of only the first one- yes the whole chain is stored > - -check certificate chain, revocation lists, etc,- only if needed by users, > code > [here|https://svn.apache.org/repos/asf/cxf/tags/cxf-2.4.1/distribution/src/main/release/samples/sts_issue_operation/src/main/java/demo/sts/provider/cert/] > - deprecate / remove all PDVisibleSignDesigner constructors except those with > a PDDocument object, to avoid a file being opened twice > - ... your ideas... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: Fwd: 1.20?
Am 25.11.18 um 11:01 schrieb Andreas Lehmkuehler: Am 24.11.18 um 10:15 schrieb Tilman Hausherr: Am 24.11.2018 um 08:53 schrieb Andreas Lehmkuehler: How about cutting a release next week or a week later? I'm going to cut the release next Tuesday the 27th. I'm going to postpone the preparations until we have hopefully positive test results from Tim Andreas If there are any objections I postpone the release to the following Monday the 3th of December. Andreas - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3017) Improve document signing
[ https://issues.apache.org/jira/browse/PDFBOX-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700830#comment-16700830 ] ASF subversion and git services commented on PDFBOX-3017: - Commit 1847575 from til...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1847575 ] PDFBOX-3017: pass signing date to OCSPHelper > Improve document signing > > > Key: PDFBOX-3017 > URL: https://issues.apache.org/jira/browse/PDFBOX-3017 > Project: PDFBox > Issue Type: Improvement > Components: AcroForm, Signing >Affects Versions: 2.0.0, 3.0.0 PDFBox >Reporter: Tilman Hausherr >Priority: Major > Fix For: 3.0.0 PDFBox > > Attachments: PDFBOX-3017_certificate_chain.diff, > PDFBOX-3017_certificate_chain_Screenshot.png, QV_RCA1_RCA3_CPCPS_V4_11.pdf, > pdfa_signed_insivible.pdf > > > Improve signing code: > - incremental save only works for signatures and doesn't respect certificates > such as Adobe Extended Usage Rights > - -{{prepareNonVisualSignature}} clears the AcroForm DR > {{acroForm.setDefaultResources(null)}} which is not good if there are other > form fields- > - visual/nonVisualSignature should move into the {{interactive.forms}} > package and be handled within the signature field > - -verify signature (to have tests that go full circle)- done June 2016 > - document or refactor / rewrite visible labyrinthine signature code > - why is it not possible to pass only the signatureField to addSignature, > instead having to create a COSDocument with a page and annotations that has > the signature field, and that must be searched for in > {{prepareVisibleSignature()}}? > - -support rotated pages (see > https://stackoverflow.com/questions/34012293/pdfbox-sign-landscape-file-error/34359956#34359956 > )- done in PDFBOX-3671 > - -make sure that signed PDF/A files are still PDF/A (see > http://www.pdfa.org/wp-content/uploads/2011/08/tn0006_digital_signatures_in_pdfa-1_2008-03-14.pdf > ); /ID possibly not OK; /Annots is possibly required ([~tilman] removed this > for invisible signatures); test signed files with PDF-Tools and with > preflight- tested, they are OK with PDF-Tools and preflight > - test whether "bad" signatures are detected by preflight (search in old > issues) > - -PDFBOX-3363 - why is the stream cached in a file? Should it be done in > memory?- done on July 15, 2016 > - remove {{setVisualSignature(PDVisibleSigProperties > visSignatureProperties)}} from SignatureOptions.java, all it does is to call > {{visSignatureProperties.getVisibleSignature()}} which returns an > {{InputStream}}, and this is already available > - {{checkSignatureField}} violates the "do one thing" rule > - -decide whether the whole certificate chain should be passed in the sample > code, instead of only the first one- yes the whole chain is stored > - -check certificate chain, revocation lists, etc,- only if needed by users, > code > [here|https://svn.apache.org/repos/asf/cxf/tags/cxf-2.4.1/distribution/src/main/release/samples/sts_issue_operation/src/main/java/demo/sts/provider/cert/] > - deprecate / remove all PDVisibleSignDesigner constructors except those with > a PDDocument object, to avoid a file being opened twice > - ... your ideas... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3017) Improve document signing
[ https://issues.apache.org/jira/browse/PDFBOX-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700831#comment-16700831 ] ASF subversion and git services commented on PDFBOX-3017: - Commit 1847576 from til...@apache.org in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1847576 ] PDFBOX-3017: pass signing date to OCSPHelper > Improve document signing > > > Key: PDFBOX-3017 > URL: https://issues.apache.org/jira/browse/PDFBOX-3017 > Project: PDFBox > Issue Type: Improvement > Components: AcroForm, Signing >Affects Versions: 2.0.0, 3.0.0 PDFBox >Reporter: Tilman Hausherr >Priority: Major > Fix For: 3.0.0 PDFBox > > Attachments: PDFBOX-3017_certificate_chain.diff, > PDFBOX-3017_certificate_chain_Screenshot.png, QV_RCA1_RCA3_CPCPS_V4_11.pdf, > pdfa_signed_insivible.pdf > > > Improve signing code: > - incremental save only works for signatures and doesn't respect certificates > such as Adobe Extended Usage Rights > - -{{prepareNonVisualSignature}} clears the AcroForm DR > {{acroForm.setDefaultResources(null)}} which is not good if there are other > form fields- > - visual/nonVisualSignature should move into the {{interactive.forms}} > package and be handled within the signature field > - -verify signature (to have tests that go full circle)- done June 2016 > - document or refactor / rewrite visible labyrinthine signature code > - why is it not possible to pass only the signatureField to addSignature, > instead having to create a COSDocument with a page and annotations that has > the signature field, and that must be searched for in > {{prepareVisibleSignature()}}? > - -support rotated pages (see > https://stackoverflow.com/questions/34012293/pdfbox-sign-landscape-file-error/34359956#34359956 > )- done in PDFBOX-3671 > - -make sure that signed PDF/A files are still PDF/A (see > http://www.pdfa.org/wp-content/uploads/2011/08/tn0006_digital_signatures_in_pdfa-1_2008-03-14.pdf > ); /ID possibly not OK; /Annots is possibly required ([~tilman] removed this > for invisible signatures); test signed files with PDF-Tools and with > preflight- tested, they are OK with PDF-Tools and preflight > - test whether "bad" signatures are detected by preflight (search in old > issues) > - -PDFBOX-3363 - why is the stream cached in a file? Should it be done in > memory?- done on July 15, 2016 > - remove {{setVisualSignature(PDVisibleSigProperties > visSignatureProperties)}} from SignatureOptions.java, all it does is to call > {{visSignatureProperties.getVisibleSignature()}} which returns an > {{InputStream}}, and this is already available > - {{checkSignatureField}} violates the "do one thing" rule > - -decide whether the whole certificate chain should be passed in the sample > code, instead of only the first one- yes the whole chain is stored > - -check certificate chain, revocation lists, etc,- only if needed by users, > code > [here|https://svn.apache.org/repos/asf/cxf/tags/cxf-2.4.1/distribution/src/main/release/samples/sts_issue_operation/src/main/java/demo/sts/provider/cert/] > - deprecate / remove all PDVisibleSignDesigner constructors except those with > a PDDocument object, to avoid a file being opened twice > - ... your ideas... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3017) Improve document signing
[ https://issues.apache.org/jira/browse/PDFBOX-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700826#comment-16700826 ] ASF subversion and git services commented on PDFBOX-3017: - Commit 1847573 from til...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1847573 ] PDFBOX-3017: pass signing date to OCSPHelper to compare revocation date with sign date; check revocation of OCSP responder > Improve document signing > > > Key: PDFBOX-3017 > URL: https://issues.apache.org/jira/browse/PDFBOX-3017 > Project: PDFBox > Issue Type: Improvement > Components: AcroForm, Signing >Affects Versions: 2.0.0, 3.0.0 PDFBox >Reporter: Tilman Hausherr >Priority: Major > Fix For: 3.0.0 PDFBox > > Attachments: PDFBOX-3017_certificate_chain.diff, > PDFBOX-3017_certificate_chain_Screenshot.png, QV_RCA1_RCA3_CPCPS_V4_11.pdf, > pdfa_signed_insivible.pdf > > > Improve signing code: > - incremental save only works for signatures and doesn't respect certificates > such as Adobe Extended Usage Rights > - -{{prepareNonVisualSignature}} clears the AcroForm DR > {{acroForm.setDefaultResources(null)}} which is not good if there are other > form fields- > - visual/nonVisualSignature should move into the {{interactive.forms}} > package and be handled within the signature field > - -verify signature (to have tests that go full circle)- done June 2016 > - document or refactor / rewrite visible labyrinthine signature code > - why is it not possible to pass only the signatureField to addSignature, > instead having to create a COSDocument with a page and annotations that has > the signature field, and that must be searched for in > {{prepareVisibleSignature()}}? > - -support rotated pages (see > https://stackoverflow.com/questions/34012293/pdfbox-sign-landscape-file-error/34359956#34359956 > )- done in PDFBOX-3671 > - -make sure that signed PDF/A files are still PDF/A (see > http://www.pdfa.org/wp-content/uploads/2011/08/tn0006_digital_signatures_in_pdfa-1_2008-03-14.pdf > ); /ID possibly not OK; /Annots is possibly required ([~tilman] removed this > for invisible signatures); test signed files with PDF-Tools and with > preflight- tested, they are OK with PDF-Tools and preflight > - test whether "bad" signatures are detected by preflight (search in old > issues) > - -PDFBOX-3363 - why is the stream cached in a file? Should it be done in > memory?- done on July 15, 2016 > - remove {{setVisualSignature(PDVisibleSigProperties > visSignatureProperties)}} from SignatureOptions.java, all it does is to call > {{visSignatureProperties.getVisibleSignature()}} which returns an > {{InputStream}}, and this is already available > - {{checkSignatureField}} violates the "do one thing" rule > - -decide whether the whole certificate chain should be passed in the sample > code, instead of only the first one- yes the whole chain is stored > - -check certificate chain, revocation lists, etc,- only if needed by users, > code > [here|https://svn.apache.org/repos/asf/cxf/tags/cxf-2.4.1/distribution/src/main/release/samples/sts_issue_operation/src/main/java/demo/sts/provider/cert/] > - deprecate / remove all PDVisibleSignDesigner constructors except those with > a PDDocument object, to avoid a file being opened twice > - ... your ideas... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3017) Improve document signing
[ https://issues.apache.org/jira/browse/PDFBOX-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700827#comment-16700827 ] ASF subversion and git services commented on PDFBOX-3017: - Commit 1847574 from til...@apache.org in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1847574 ] PDFBOX-3017: pass signing date to OCSPHelper to compare revocation date with sign date; check revocation of OCSP responder > Improve document signing > > > Key: PDFBOX-3017 > URL: https://issues.apache.org/jira/browse/PDFBOX-3017 > Project: PDFBox > Issue Type: Improvement > Components: AcroForm, Signing >Affects Versions: 2.0.0, 3.0.0 PDFBox >Reporter: Tilman Hausherr >Priority: Major > Fix For: 3.0.0 PDFBox > > Attachments: PDFBOX-3017_certificate_chain.diff, > PDFBOX-3017_certificate_chain_Screenshot.png, QV_RCA1_RCA3_CPCPS_V4_11.pdf, > pdfa_signed_insivible.pdf > > > Improve signing code: > - incremental save only works for signatures and doesn't respect certificates > such as Adobe Extended Usage Rights > - -{{prepareNonVisualSignature}} clears the AcroForm DR > {{acroForm.setDefaultResources(null)}} which is not good if there are other > form fields- > - visual/nonVisualSignature should move into the {{interactive.forms}} > package and be handled within the signature field > - -verify signature (to have tests that go full circle)- done June 2016 > - document or refactor / rewrite visible labyrinthine signature code > - why is it not possible to pass only the signatureField to addSignature, > instead having to create a COSDocument with a page and annotations that has > the signature field, and that must be searched for in > {{prepareVisibleSignature()}}? > - -support rotated pages (see > https://stackoverflow.com/questions/34012293/pdfbox-sign-landscape-file-error/34359956#34359956 > )- done in PDFBOX-3671 > - -make sure that signed PDF/A files are still PDF/A (see > http://www.pdfa.org/wp-content/uploads/2011/08/tn0006_digital_signatures_in_pdfa-1_2008-03-14.pdf > ); /ID possibly not OK; /Annots is possibly required ([~tilman] removed this > for invisible signatures); test signed files with PDF-Tools and with > preflight- tested, they are OK with PDF-Tools and preflight > - test whether "bad" signatures are detected by preflight (search in old > issues) > - -PDFBOX-3363 - why is the stream cached in a file? Should it be done in > memory?- done on July 15, 2016 > - remove {{setVisualSignature(PDVisibleSigProperties > visSignatureProperties)}} from SignatureOptions.java, all it does is to call > {{visSignatureProperties.getVisibleSignature()}} which returns an > {{InputStream}}, and this is already available > - {{checkSignatureField}} violates the "do one thing" rule > - -decide whether the whole certificate chain should be passed in the sample > code, instead of only the first one- yes the whole chain is stored > - -check certificate chain, revocation lists, etc,- only if needed by users, > code > [here|https://svn.apache.org/repos/asf/cxf/tags/cxf-2.4.1/distribution/src/main/release/samples/sts_issue_operation/src/main/java/demo/sts/provider/cert/] > - deprecate / remove all PDVisibleSignDesigner constructors except those with > a PDDocument object, to avoid a file being opened twice > - ... your ideas... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images
[ https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved PDFBOX-4184. - Resolution: Fixed > [PATCH]: Support simple lossless compression of 16 bit RGB images > - > > Key: PDFBOX-4184 > URL: https://issues.apache.org/jira/browse/PDFBOX-4184 > Project: PDFBox > Issue Type: Improvement > Components: Writing >Affects Versions: 2.0.9 >Reporter: Emmeran Seehuber >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0 PDFBox, 2.0.12 > > Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, > fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, > images.zip, lossless_predictor_based_imageencoding.patch, > lossless_predictor_based_imageencoding_v2.patch, > lossless_predictor_based_imageencoding_v3.patch, > lossless_predictor_based_imageencoding_v4.patch, > lossless_predictor_based_imageencoding_v5.patch, > lossless_predictor_based_imageencoding_v6.patch, > pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, > png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, > size_compare.txt > > > The attached patch add support to write 16 bit per component images > correctly. I've integrated a test for this here: > [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9] > It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this > is what you usually get when you read a 16 bit PNG file. > This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173]. > The patch is against 2.0.9, but should apply to 3.0.0 too. > There is still some room for improvements when writing lossless images, as > the images are currently not efficiently encoded. I.e. you could use PNG > encodings to get a better compression. (By adding a COSName.DECODE_PARMS with > a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is > something for a later patch. It would also need another API, as there is a > tradeoff speed vs compression ratio. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Closed] (PDFBOX-4386) Incorrect encoding during pdf file reading
[ https://issues.apache.org/jira/browse/PDFBOX-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed PDFBOX-4386. --- Resolution: Not A Bug Closing as "not a bug". Text extraction is working as designed, which is why Adobe Reader shows the same problem. You can still comment. > Incorrect encoding during pdf file reading > -- > > Key: PDFBOX-4386 > URL: https://issues.apache.org/jira/browse/PDFBOX-4386 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.12 >Reporter: Oleksandr Skoryi >Priority: Major > Attachments: Test2.pdf, image-2018-11-26-21-06-57-022.png > > > Hello everybody, I use PDFBOX for scrapping text from attached pdf > The issue is in double ff in Kaffee-Pads > I downloaded pdf debugger and found, that it is a symbol with 31-st uncode, > however I think it is a bug. Sincerely waiting for your reply > !image-2018-11-26-21-06-57-022.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Closed] (PDFBOX-4387) Parsing typographic ligatures
[ https://issues.apache.org/jira/browse/PDFBOX-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed PDFBOX-4387. --- Resolution: Duplicate Closing as duplicate of PDBOX-4386. The problem with this file is that some non ligature glyphs don't have unicode. The file was produced with ilovepdf.com. > Parsing typographic ligatures > - > > Key: PDFBOX-4387 > URL: https://issues.apache.org/jira/browse/PDFBOX-4387 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.9, 2.0.12 >Reporter: Oleksandr Skoryi >Priority: Major > Attachments: test.pdf > > > Hello everybody. I tried to parse following pdf, however have a problem with > ligatures. Pdf box add extraspace after each of them > Attached pdf has issue in word flüssig under Persil powder > however other ligatures are affected too > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4386) Incorrect encoding during pdf file reading
[ https://issues.apache.org/jira/browse/PDFBOX-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700777#comment-16700777 ] Tilman Hausherr commented on PDFBOX-4386: - You could use OCR. And them maybe compare the text extraction with the OCR and then correct using services like the amazon mechanical turk. Another idea would be to detect such fonts and adjust the Unicode when missing. But you're on your own there, it will probably be several days of work. You'd need to connect the glyph name ("f_f") with the Unicode entry. But this will not be the only problem you may have with text extraction. Some PDF files may bring nothing, or completely garbled text. > Incorrect encoding during pdf file reading > -- > > Key: PDFBOX-4386 > URL: https://issues.apache.org/jira/browse/PDFBOX-4386 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.12 >Reporter: Oleksandr Skoryi >Priority: Major > Attachments: Test2.pdf, image-2018-11-26-21-06-57-022.png > > > Hello everybody, I use PDFBOX for scrapping text from attached pdf > The issue is in double ff in Kaffee-Pads > I downloaded pdf debugger and found, that it is a symbol with 31-st uncode, > however I think it is a bug. Sincerely waiting for your reply > !image-2018-11-26-21-06-57-022.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images
[ https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700762#comment-16700762 ] ASF subversion and git services commented on PDFBOX-4184: - Commit 1847570 from [~talli...@apache.org] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1847570 ] PDFBOX-4184 -- pull test file from JIRA rather than internet archive > [PATCH]: Support simple lossless compression of 16 bit RGB images > - > > Key: PDFBOX-4184 > URL: https://issues.apache.org/jira/browse/PDFBOX-4184 > Project: PDFBox > Issue Type: Improvement > Components: Writing >Affects Versions: 2.0.9 >Reporter: Emmeran Seehuber >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 2.0.12, 3.0.0 PDFBox > > Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, > fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, > images.zip, lossless_predictor_based_imageencoding.patch, > lossless_predictor_based_imageencoding_v2.patch, > lossless_predictor_based_imageencoding_v3.patch, > lossless_predictor_based_imageencoding_v4.patch, > lossless_predictor_based_imageencoding_v5.patch, > lossless_predictor_based_imageencoding_v6.patch, > pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, > png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, > size_compare.txt > > > The attached patch add support to write 16 bit per component images > correctly. I've integrated a test for this here: > [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9] > It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this > is what you usually get when you read a 16 bit PNG file. > This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173]. > The patch is against 2.0.9, but should apply to 3.0.0 too. > There is still some room for improvements when writing lossless images, as > the images are currently not efficiently encoded. I.e. you could use PNG > encodings to get a better compression. (By adding a COSName.DECODE_PARMS with > a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is > something for a later patch. It would also need another API, as there is a > tradeoff speed vs compression ratio. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Reopened] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images
[ https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened PDFBOX-4184: - Reopening...let's pull govdocs1 test file from JIRA rather than the internet archive. > [PATCH]: Support simple lossless compression of 16 bit RGB images > - > > Key: PDFBOX-4184 > URL: https://issues.apache.org/jira/browse/PDFBOX-4184 > Project: PDFBox > Issue Type: Improvement > Components: Writing >Affects Versions: 2.0.9 >Reporter: Emmeran Seehuber >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 2.0.12, 3.0.0 PDFBox > > Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, > fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, > images.zip, lossless_predictor_based_imageencoding.patch, > lossless_predictor_based_imageencoding_v2.patch, > lossless_predictor_based_imageencoding_v3.patch, > lossless_predictor_based_imageencoding_v4.patch, > lossless_predictor_based_imageencoding_v5.patch, > lossless_predictor_based_imageencoding_v6.patch, > pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, > png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, > size_compare.txt > > > The attached patch add support to write 16 bit per component images > correctly. I've integrated a test for this here: > [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9] > It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this > is what you usually get when you read a 16 bit PNG file. > This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173]. > The patch is against 2.0.9, but should apply to 3.0.0 too. > There is still some room for improvements when writing lossless images, as > the images are currently not efficiently encoded. I.e. you could use PNG > encodings to get a better compression. (By adding a COSName.DECODE_PARMS with > a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is > something for a later patch. It would also need another API, as there is a > tradeoff speed vs compression ratio. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images
[ https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700760#comment-16700760 ] ASF subversion and git services commented on PDFBOX-4184: - Commit 1847569 from [~talli...@apache.org] in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1847569 ] PDFBOX-4184 -- switch to download test file from jira rather than the internet archive. > [PATCH]: Support simple lossless compression of 16 bit RGB images > - > > Key: PDFBOX-4184 > URL: https://issues.apache.org/jira/browse/PDFBOX-4184 > Project: PDFBox > Issue Type: Improvement > Components: Writing >Affects Versions: 2.0.9 >Reporter: Emmeran Seehuber >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 2.0.12, 3.0.0 PDFBox > > Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, > fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, > images.zip, lossless_predictor_based_imageencoding.patch, > lossless_predictor_based_imageencoding_v2.patch, > lossless_predictor_based_imageencoding_v3.patch, > lossless_predictor_based_imageencoding_v4.patch, > lossless_predictor_based_imageencoding_v5.patch, > lossless_predictor_based_imageencoding_v6.patch, > pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, > png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, > size_compare.txt > > > The attached patch add support to write 16 bit per component images > correctly. I've integrated a test for this here: > [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9] > It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this > is what you usually get when you read a 16 bit PNG file. > This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173]. > The patch is against 2.0.9, but should apply to 3.0.0 too. > There is still some room for improvements when writing lossless images, as > the images are currently not efficiently encoded. I.e. you could use PNG > encodings to get a better compression. (By adding a COSName.DECODE_PARMS with > a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is > something for a later patch. It would also need another API, as there is a > tradeoff speed vs compression ratio. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4384) PDF/A Document Validation out of memory
[ https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700726#comment-16700726 ] Vincenzo Mangiapanello commented on PDFBOX-4384: Thanks a lot. Tomorrow in the morning we'll test the new snapshot and we'll give you the response > PDF/A Document Validation out of memory > --- > > Key: PDFBOX-4384 > URL: https://issues.apache.org/jira/browse/PDFBOX-4384 > Project: PDFBox > Issue Type: Bug > Components: Preflight >Affects Versions: 2.0.8, 2.0.12 >Reporter: Vincenzo Mangiapanello >Priority: Major > > Hi everyone, > validating a customer PDF file, using > {code:java} > document.validate(){code} > we recognise that if the file itself has an enormous numbers of validation > errors, the process goes to OutOfMemory and at the end the we get the GC > error. > In our case the file has more than 550.000 errors. So we cannot go head with > the conversion to PDF/A. > To avoid this kind of error it could be useful to configure a max number of > validation errors to stop the process if this value has been reached. > We cannot attach the original document, because it is a customer's file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4388) Update several tests to cache downloaded test files with download-maven-plugin
[ https://issues.apache.org/jira/browse/PDFBOX-4388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700690#comment-16700690 ] Tim Allison commented on PDFBOX-4388: - This patch is based on 2.x. I'll commit it to trunk, 2.x and 1.x in a few days unless there are recommendations/objections. > Update several tests to cache downloaded test files with download-maven-plugin > -- > > Key: PDFBOX-4388 > URL: https://issues.apache.org/jira/browse/PDFBOX-4388 > Project: PDFBox > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Attachments: PDFBOX-4388.patch > > > On PDFBOX-3974, [~tilman] added the use of {{download-maven-plugin}} to > download test files. This allows for caching of files, and it appropriately > applies corporate proxy settings. > There are a few tests that could be updated: PDButtonTest, > MergeAnnotationsTest and MergeAcroFormsTest. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-4384) PDF/A Document Validation out of memory
[ https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700645#comment-16700645 ] Tilman Hausherr edited comment on PDFBOX-4384 at 11/27/18 4:43 PM: --- This is now a different strategy, the page tree validation process is aborted. It is still possible that other errors appear after that. It is also possible that it doesn't abort because there is a super-complicated page. This is too difficult to change. To configure, call {{document.getContext().getConfig().setMaxErrors(xxx)}}. (snapshot available at link previously mentioned) was (Author: tilman): This is now a different strategy, the page tree validation process is aborted. It is still possible that other errors appear after that. It is also possible that it doesn't abort because there is a super-complicated page. This is too difficult to change. To configure, call {{document.getContext().getConfig().setMaxErrors(xxx)}}. (snapshot will be available soon) > PDF/A Document Validation out of memory > --- > > Key: PDFBOX-4384 > URL: https://issues.apache.org/jira/browse/PDFBOX-4384 > Project: PDFBox > Issue Type: Bug > Components: Preflight >Affects Versions: 2.0.8, 2.0.12 >Reporter: Vincenzo Mangiapanello >Priority: Major > > Hi everyone, > validating a customer PDF file, using > {code:java} > document.validate(){code} > we recognise that if the file itself has an enormous numbers of validation > errors, the process goes to OutOfMemory and at the end the we get the GC > error. > In our case the file has more than 550.000 errors. So we cannot go head with > the conversion to PDF/A. > To avoid this kind of error it could be useful to configure a max number of > validation errors to stop the process if this value has been reached. > We cannot attach the original document, because it is a customer's file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4388) Update several tests to cache downloaded test files with download-maven-plugin
[ https://issues.apache.org/jira/browse/PDFBOX-4388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-4388: Attachment: PDFBOX-4388.patch > Update several tests to cache downloaded test files with download-maven-plugin > -- > > Key: PDFBOX-4388 > URL: https://issues.apache.org/jira/browse/PDFBOX-4388 > Project: PDFBox > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Attachments: PDFBOX-4388.patch > > > On PDFBOX-3974, [~tilman] added the use of {{download-maven-plugin}} to > download test files. This allows for caching of files, and it appropriately > applies corporate proxy settings. > There are a few tests that could be updated: PDButtonTest, > MergeAnnotationsTest and MergeAcroFormsTest. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Assigned] (PDFBOX-4388) Update several tests to cache downloaded test files with download-maven-plugin
[ https://issues.apache.org/jira/browse/PDFBOX-4388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned PDFBOX-4388: --- Assignee: Tim Allison > Update several tests to cache downloaded test files with download-maven-plugin > -- > > Key: PDFBOX-4388 > URL: https://issues.apache.org/jira/browse/PDFBOX-4388 > Project: PDFBox > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > > On PDFBOX-3974, [~tilman] added the use of {{download-maven-plugin}} to > download test files. This allows for caching of files, and it appropriately > applies corporate proxy settings. > There are a few tests that could be updated: PDButtonTest, > MergeAnnotationsTest and MergeAcroFormsTest. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-4388) Update several tests to cache downloaded test files with download-maven-plugin
Tim Allison created PDFBOX-4388: --- Summary: Update several tests to cache downloaded test files with download-maven-plugin Key: PDFBOX-4388 URL: https://issues.apache.org/jira/browse/PDFBOX-4388 Project: PDFBox Issue Type: Task Reporter: Tim Allison On PDFBOX-3974, [~tilman] added the use of {{download-maven-plugin}} to download test files. This allows for caching of files, and it appropriately applies corporate proxy settings. There are a few tests that could be updated: PDButtonTest, MergeAnnotationsTest and MergeAcroFormsTest. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images
[ https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved PDFBOX-4184. - Resolution: Fixed Assignee: Tilman Hausherr (was: Tim Allison) Re-assigning to [~tilman] who did all of the work. :) > [PATCH]: Support simple lossless compression of 16 bit RGB images > - > > Key: PDFBOX-4184 > URL: https://issues.apache.org/jira/browse/PDFBOX-4184 > Project: PDFBox > Issue Type: Improvement > Components: Writing >Affects Versions: 2.0.9 >Reporter: Emmeran Seehuber >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0 PDFBox, 2.0.12 > > Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, > fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, > images.zip, lossless_predictor_based_imageencoding.patch, > lossless_predictor_based_imageencoding_v2.patch, > lossless_predictor_based_imageencoding_v3.patch, > lossless_predictor_based_imageencoding_v4.patch, > lossless_predictor_based_imageencoding_v5.patch, > lossless_predictor_based_imageencoding_v6.patch, > pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, > png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, > size_compare.txt > > > The attached patch add support to write 16 bit per component images > correctly. I've integrated a test for this here: > [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9] > It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this > is what you usually get when you read a 16 bit PNG file. > This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173]. > The patch is against 2.0.9, but should apply to 3.0.0 too. > There is still some room for improvements when writing lossless images, as > the images are currently not efficiently encoded. I.e. you could use PNG > encodings to get a better compression. (By adding a COSName.DECODE_PARMS with > a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is > something for a later patch. It would also need another API, as there is a > tradeoff speed vs compression ratio. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-4384) PDF/A Document Validation out of memory
[ https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700645#comment-16700645 ] Tilman Hausherr edited comment on PDFBOX-4384 at 11/27/18 4:18 PM: --- This is now a different strategy, the page tree validation process is aborted. It is still possible that other errors appear after that. It is also possible that it doesn't abort because there is a super-complicated page. This is too difficult to change. To configure, call {{document.getContext().getConfig().setMaxErrors(xxx)}}. (snapshot will be available soon) was (Author: tilman): This is now a different strategy, the page tree validation process is aborted. It is still possible that other errors appear after that. It is also possible that it doesn't abort because there is a super-complicated page. This is too difficult to change. To configure, call {{document.getContext().getConfig().setMaxErrors(xxx)}}. > PDF/A Document Validation out of memory > --- > > Key: PDFBOX-4384 > URL: https://issues.apache.org/jira/browse/PDFBOX-4384 > Project: PDFBox > Issue Type: Bug > Components: Preflight >Affects Versions: 2.0.8, 2.0.12 >Reporter: Vincenzo Mangiapanello >Priority: Major > > Hi everyone, > validating a customer PDF file, using > {code:java} > document.validate(){code} > we recognise that if the file itself has an enormous numbers of validation > errors, the process goes to OutOfMemory and at the end the we get the GC > error. > In our case the file has more than 550.000 errors. So we cannot go head with > the conversion to PDF/A. > To avoid this kind of error it could be useful to configure a max number of > validation errors to stop the process if this value has been reached. > We cannot attach the original document, because it is a customer's file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4384) PDF/A Document Validation out of memory
[ https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700645#comment-16700645 ] Tilman Hausherr commented on PDFBOX-4384: - This is now a different strategy, the page tree validation process is aborted. It is still possible that other errors appear after that. It is also possible that it doesn't abort because there is a super-complicated page. This is too difficult to change. To configure, call {{document.getContext().getConfig().setMaxErrors(xxx)}}. > PDF/A Document Validation out of memory > --- > > Key: PDFBOX-4384 > URL: https://issues.apache.org/jira/browse/PDFBOX-4384 > Project: PDFBox > Issue Type: Bug > Components: Preflight >Affects Versions: 2.0.8, 2.0.12 >Reporter: Vincenzo Mangiapanello >Priority: Major > > Hi everyone, > validating a customer PDF file, using > {code:java} > document.validate(){code} > we recognise that if the file itself has an enormous numbers of validation > errors, the process goes to OutOfMemory and at the end the we get the GC > error. > In our case the file has more than 550.000 errors. So we cannot go head with > the conversion to PDF/A. > To avoid this kind of error it could be useful to configure a max number of > validation errors to stop the process if this value has been reached. > We cannot attach the original document, because it is a customer's file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Reopened] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images
[ https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened PDFBOX-4184: - Assignee: Tim Allison (was: Tilman Hausherr) Re-opening to add attachment > [PATCH]: Support simple lossless compression of 16 bit RGB images > - > > Key: PDFBOX-4184 > URL: https://issues.apache.org/jira/browse/PDFBOX-4184 > Project: PDFBox > Issue Type: Improvement > Components: Writing >Affects Versions: 2.0.9 >Reporter: Emmeran Seehuber >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0.12, 3.0.0 PDFBox > > Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, > fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, > images.zip, lossless_predictor_based_imageencoding.patch, > lossless_predictor_based_imageencoding_v2.patch, > lossless_predictor_based_imageencoding_v3.patch, > lossless_predictor_based_imageencoding_v4.patch, > lossless_predictor_based_imageencoding_v5.patch, > lossless_predictor_based_imageencoding_v6.patch, > pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, > png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, > size_compare.txt > > > The attached patch add support to write 16 bit per component images > correctly. I've integrated a test for this here: > [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9] > It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this > is what you usually get when you read a 16 bit PNG file. > This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173]. > The patch is against 2.0.9, but should apply to 3.0.0 too. > There is still some room for improvements when writing lossless images, as > the images are currently not efficiently encoded. I.e. you could use PNG > encodings to get a better compression. (By adding a COSName.DECODE_PARMS with > a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is > something for a later patch. It would also need another API, as there is a > tradeoff speed vs compression ratio. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images
[ https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700641#comment-16700641 ] Tim Allison edited comment on PDFBOX-4184 at 11/27/18 4:15 PM: --- Re-opening to add literal govdocs1 test file 032163.jpg was (Author: talli...@mitre.org): Re-opening to add attachment > [PATCH]: Support simple lossless compression of 16 bit RGB images > - > > Key: PDFBOX-4184 > URL: https://issues.apache.org/jira/browse/PDFBOX-4184 > Project: PDFBox > Issue Type: Improvement > Components: Writing >Affects Versions: 2.0.9 >Reporter: Emmeran Seehuber >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0.12, 3.0.0 PDFBox > > Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, > fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, > images.zip, lossless_predictor_based_imageencoding.patch, > lossless_predictor_based_imageencoding_v2.patch, > lossless_predictor_based_imageencoding_v3.patch, > lossless_predictor_based_imageencoding_v4.patch, > lossless_predictor_based_imageencoding_v5.patch, > lossless_predictor_based_imageencoding_v6.patch, > pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, > png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, > size_compare.txt > > > The attached patch add support to write 16 bit per component images > correctly. I've integrated a test for this here: > [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9] > It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this > is what you usually get when you read a 16 bit PNG file. > This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173]. > The patch is against 2.0.9, but should apply to 3.0.0 too. > There is still some room for improvements when writing lossless images, as > the images are currently not efficiently encoded. I.e. you could use PNG > encodings to get a better compression. (By adding a COSName.DECODE_PARMS with > a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is > something for a later patch. It would also need another API, as there is a > tradeoff speed vs compression ratio. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images
[ https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-4184: Attachment: 032163.jpg > [PATCH]: Support simple lossless compression of 16 bit RGB images > - > > Key: PDFBOX-4184 > URL: https://issues.apache.org/jira/browse/PDFBOX-4184 > Project: PDFBox > Issue Type: Improvement > Components: Writing >Affects Versions: 2.0.9 >Reporter: Emmeran Seehuber >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0.12, 3.0.0 PDFBox > > Attachments: 032163.jpg, 16bit.png, LoadGovdocs.java, > fix_profile_use.patch, fix_profile_use3.patch, fix_profile_use4.patch, > images.zip, lossless_predictor_based_imageencoding.patch, > lossless_predictor_based_imageencoding_v2.patch, > lossless_predictor_based_imageencoding_v3.patch, > lossless_predictor_based_imageencoding_v4.patch, > lossless_predictor_based_imageencoding_v5.patch, > lossless_predictor_based_imageencoding_v6.patch, > pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, > png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf, > size_compare.txt > > > The attached patch add support to write 16 bit per component images > correctly. I've integrated a test for this here: > [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9] > It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this > is what you usually get when you read a 16 bit PNG file. > This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173]. > The patch is against 2.0.9, but should apply to 3.0.0 too. > There is still some room for improvements when writing lossless images, as > the images are currently not efficiently encoded. I.e. you could use PNG > encodings to get a better compression. (By adding a COSName.DECODE_PARMS with > a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is > something for a later patch. It would also need another API, as there is a > tradeoff speed vs compression ratio. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4384) PDF/A Document Validation out of memory
[ https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700635#comment-16700635 ] ASF subversion and git services commented on PDFBOX-4384: - Commit 1847564 from til...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1847564 ] PDFBOX-4384: abort page tree validation process if too many errors + make it configurable; use foreach loop (also speeds up when many pages) > PDF/A Document Validation out of memory > --- > > Key: PDFBOX-4384 > URL: https://issues.apache.org/jira/browse/PDFBOX-4384 > Project: PDFBox > Issue Type: Bug > Components: Preflight >Affects Versions: 2.0.8, 2.0.12 >Reporter: Vincenzo Mangiapanello >Priority: Major > > Hi everyone, > validating a customer PDF file, using > {code:java} > document.validate(){code} > we recognise that if the file itself has an enormous numbers of validation > errors, the process goes to OutOfMemory and at the end the we get the GC > error. > In our case the file has more than 550.000 errors. So we cannot go head with > the conversion to PDF/A. > To avoid this kind of error it could be useful to configure a max number of > validation errors to stop the process if this value has been reached. > We cannot attach the original document, because it is a customer's file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4384) PDF/A Document Validation out of memory
[ https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700620#comment-16700620 ] ASF subversion and git services commented on PDFBOX-4384: - Commit 1847560 from til...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1847560 ] PDFBOX-4384: revert last two commits, will use different strategy > PDF/A Document Validation out of memory > --- > > Key: PDFBOX-4384 > URL: https://issues.apache.org/jira/browse/PDFBOX-4384 > Project: PDFBox > Issue Type: Bug > Components: Preflight >Affects Versions: 2.0.8, 2.0.12 >Reporter: Vincenzo Mangiapanello >Priority: Major > > Hi everyone, > validating a customer PDF file, using > {code:java} > document.validate(){code} > we recognise that if the file itself has an enormous numbers of validation > errors, the process goes to OutOfMemory and at the end the we get the GC > error. > In our case the file has more than 550.000 errors. So we cannot go head with > the conversion to PDF/A. > To avoid this kind of error it could be useful to configure a max number of > validation errors to stop the process if this value has been reached. > We cannot attach the original document, because it is a customer's file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4384) PDF/A Document Validation out of memory
[ https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700628#comment-16700628 ] ASF subversion and git services commented on PDFBOX-4384: - Commit 1847562 from til...@apache.org in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1847562 ] PDFBOX-4384: abort page tree validation process if too many errors + make it configurable; use foreach loop (also speeds up when many pages) > PDF/A Document Validation out of memory > --- > > Key: PDFBOX-4384 > URL: https://issues.apache.org/jira/browse/PDFBOX-4384 > Project: PDFBox > Issue Type: Bug > Components: Preflight >Affects Versions: 2.0.8, 2.0.12 >Reporter: Vincenzo Mangiapanello >Priority: Major > > Hi everyone, > validating a customer PDF file, using > {code:java} > document.validate(){code} > we recognise that if the file itself has an enormous numbers of validation > errors, the process goes to OutOfMemory and at the end the we get the GC > error. > In our case the file has more than 550.000 errors. So we cannot go head with > the conversion to PDF/A. > To avoid this kind of error it could be useful to configure a max number of > validation errors to stop the process if this value has been reached. > We cannot attach the original document, because it is a customer's file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4384) PDF/A Document Validation out of memory
[ https://issues.apache.org/jira/browse/PDFBOX-4384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700621#comment-16700621 ] ASF subversion and git services commented on PDFBOX-4384: - Commit 1847561 from til...@apache.org in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1847561 ] PDFBOX-4384: revert last two commits, will use different strategy > PDF/A Document Validation out of memory > --- > > Key: PDFBOX-4384 > URL: https://issues.apache.org/jira/browse/PDFBOX-4384 > Project: PDFBox > Issue Type: Bug > Components: Preflight >Affects Versions: 2.0.8, 2.0.12 >Reporter: Vincenzo Mangiapanello >Priority: Major > > Hi everyone, > validating a customer PDF file, using > {code:java} > document.validate(){code} > we recognise that if the file itself has an enormous numbers of validation > errors, the process goes to OutOfMemory and at the end the we get the GC > error. > In our case the file has more than 550.000 errors. So we cannot go head with > the conversion to PDF/A. > To avoid this kind of error it could be useful to configure a max number of > validation errors to stop the process if this value has been reached. > We cannot attach the original document, because it is a customer's file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-4387) Parsing typographic ligatures
[ https://issues.apache.org/jira/browse/PDFBOX-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700358#comment-16700358 ] Oleksandr Skoryi edited comment on PDFBOX-4387 at 11/27/18 12:41 PM: - [~tilman] Coould you advice me any workarounds ? was (Author: alexfaster): [~tilman] Coould you tell me any workarounds ? > Parsing typographic ligatures > - > > Key: PDFBOX-4387 > URL: https://issues.apache.org/jira/browse/PDFBOX-4387 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.9, 2.0.12 >Reporter: Oleksandr Skoryi >Priority: Major > Attachments: test.pdf > > > Hello everybody. I tried to parse following pdf, however have a problem with > ligatures. Pdf box add extraspace after each of them > Attached pdf has issue in word flüssig under Persil powder > however other ligatures are affected too > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-4387) Parsing typographic ligatures
[ https://issues.apache.org/jira/browse/PDFBOX-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700358#comment-16700358 ] Oleksandr Skoryi edited comment on PDFBOX-4387 at 11/27/18 12:41 PM: - [~tilman] Coould you tell me any workarounds ? was (Author: alexfaster): [~tilman] Any workarounds ? > Parsing typographic ligatures > - > > Key: PDFBOX-4387 > URL: https://issues.apache.org/jira/browse/PDFBOX-4387 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.9, 2.0.12 >Reporter: Oleksandr Skoryi >Priority: Major > Attachments: test.pdf > > > Hello everybody. I tried to parse following pdf, however have a problem with > ligatures. Pdf box add extraspace after each of them > Attached pdf has issue in word flüssig under Persil powder > however other ligatures are affected too > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4387) Parsing typographic ligatures
[ https://issues.apache.org/jira/browse/PDFBOX-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700358#comment-16700358 ] Oleksandr Skoryi commented on PDFBOX-4387: -- [~tilman] Any workarounds ? > Parsing typographic ligatures > - > > Key: PDFBOX-4387 > URL: https://issues.apache.org/jira/browse/PDFBOX-4387 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.9, 2.0.12 >Reporter: Oleksandr Skoryi >Priority: Major > Attachments: test.pdf > > > Hello everybody. I tried to parse following pdf, however have a problem with > ligatures. Pdf box add extraspace after each of them > Attached pdf has issue in word flüssig under Persil powder > however other ligatures are affected too > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4387) Parsing typographic ligatures
[ https://issues.apache.org/jira/browse/PDFBOX-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700275#comment-16700275 ] Tilman Hausherr commented on PDFBOX-4387: - Because text extraction and rendering are separate things. A glyph can have a correct visual display but a wrong unicode. > Parsing typographic ligatures > - > > Key: PDFBOX-4387 > URL: https://issues.apache.org/jira/browse/PDFBOX-4387 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.9, 2.0.12 >Reporter: Oleksandr Skoryi >Priority: Major > Attachments: test.pdf > > > Hello everybody. I tried to parse following pdf, however have a problem with > ligatures. Pdf box add extraspace after each of them > Attached pdf has issue in word flüssig under Persil powder > however other ligatures are affected too > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4386) Incorrect encoding during pdf file reading
[ https://issues.apache.org/jira/browse/PDFBOX-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700176#comment-16700176 ] Oleksandr Skoryi commented on PDFBOX-4386: -- [~tilman] Do u have any suggestion how to fix that? Or probable workaround? > Incorrect encoding during pdf file reading > -- > > Key: PDFBOX-4386 > URL: https://issues.apache.org/jira/browse/PDFBOX-4386 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.12 >Reporter: Oleksandr Skoryi >Priority: Major > Attachments: Test2.pdf, image-2018-11-26-21-06-57-022.png > > > Hello everybody, I use PDFBOX for scrapping text from attached pdf > The issue is in double ff in Kaffee-Pads > I downloaded pdf debugger and found, that it is a symbol with 31-st uncode, > however I think it is a bug. Sincerely waiting for your reply > !image-2018-11-26-21-06-57-022.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4386) Incorrect encoding during pdf file reading
[ https://issues.apache.org/jira/browse/PDFBOX-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700166#comment-16700166 ] Tilman Hausherr commented on PDFBOX-4386: - Because text extraction and rendering are separate things. A glyph can have a correct visual display but a wrong unicode. > Incorrect encoding during pdf file reading > -- > > Key: PDFBOX-4386 > URL: https://issues.apache.org/jira/browse/PDFBOX-4386 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.12 >Reporter: Oleksandr Skoryi >Priority: Major > Attachments: Test2.pdf, image-2018-11-26-21-06-57-022.png > > > Hello everybody, I use PDFBOX for scrapping text from attached pdf > The issue is in double ff in Kaffee-Pads > I downloaded pdf debugger and found, that it is a symbol with 31-st uncode, > however I think it is a bug. Sincerely waiting for your reply > !image-2018-11-26-21-06-57-022.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-4386) Incorrect encoding during pdf file reading
[ https://issues.apache.org/jira/browse/PDFBOX-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700109#comment-16700109 ] Oleksandr Skoryi edited comment on PDFBOX-4386 at 11/27/18 9:25 AM: [~tilman] therefore PDF is broken?? But how then the symbol is so precisely displayed in pdf viewers? was (Author: alexfaster): therefore PDF is broken?? But how then the symbol is so precisely displayed in pdf viewers? > Incorrect encoding during pdf file reading > -- > > Key: PDFBOX-4386 > URL: https://issues.apache.org/jira/browse/PDFBOX-4386 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.12 >Reporter: Oleksandr Skoryi >Priority: Major > Attachments: Test2.pdf, image-2018-11-26-21-06-57-022.png > > > Hello everybody, I use PDFBOX for scrapping text from attached pdf > The issue is in double ff in Kaffee-Pads > I downloaded pdf debugger and found, that it is a symbol with 31-st uncode, > however I think it is a bug. Sincerely waiting for your reply > !image-2018-11-26-21-06-57-022.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4387) Parsing typographic ligatures
[ https://issues.apache.org/jira/browse/PDFBOX-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700108#comment-16700108 ] Oleksandr Skoryi commented on PDFBOX-4387: -- [~tilman] therefore PDF is broken?? But how then the symbol is so precisely displayed in pdf viewers? > Parsing typographic ligatures > - > > Key: PDFBOX-4387 > URL: https://issues.apache.org/jira/browse/PDFBOX-4387 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.9, 2.0.12 >Reporter: Oleksandr Skoryi >Priority: Major > Attachments: test.pdf > > > Hello everybody. I tried to parse following pdf, however have a problem with > ligatures. Pdf box add extraspace after each of them > Attached pdf has issue in word flüssig under Persil powder > however other ligatures are affected too > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4386) Incorrect encoding during pdf file reading
[ https://issues.apache.org/jira/browse/PDFBOX-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700109#comment-16700109 ] Oleksandr Skoryi commented on PDFBOX-4386: -- therefore PDF is broken?? But how then the symbol is so precisely displayed in pdf viewers? > Incorrect encoding during pdf file reading > -- > > Key: PDFBOX-4386 > URL: https://issues.apache.org/jira/browse/PDFBOX-4386 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.12 >Reporter: Oleksandr Skoryi >Priority: Major > Attachments: Test2.pdf, image-2018-11-26-21-06-57-022.png > > > Hello everybody, I use PDFBOX for scrapping text from attached pdf > The issue is in double ff in Kaffee-Pads > I downloaded pdf debugger and found, that it is a symbol with 31-st uncode, > however I think it is a bug. Sincerely waiting for your reply > !image-2018-11-26-21-06-57-022.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org