[jira] [Updated] (PDFBOX-2441) Improve XRef self healing mechanism when more than one xref table
[ https://issues.apache.org/jira/browse/PDFBOX-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2441: Summary: Improve XRef self healing mechanism when more than one xref table (was: mprove XRef self healing mechanism when more than one xref table) Improve XRef self healing mechanism when more than one xref table - Key: PDFBOX-2441 URL: https://issues.apache.org/jira/browse/PDFBOX-2441 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.7, 1.8.8, 2.0.0 Reporter: Tilman Hausherr This is a follow-up issue to PDFBOX-2250: {quote} the xref repair algorithm simply searches for the nearest offset, which may fail if more than one xref table is present ... Once we have a sample pdf which can't be parsed with the simple algorithm, we can open a new issue. {quote} And here's one: {code} Exception in thread main java.io.IOException: Error: Expected a long type at offset 1180, instead got '50/Filter/FlateDecode/DecodeParms' at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1690) {code} That file does have more than one xref table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PDFBOX-2441) mprove XRef self healing mechanism when more than one xref table
Tilman Hausherr created PDFBOX-2441: --- Summary: mprove XRef self healing mechanism when more than one xref table Key: PDFBOX-2441 URL: https://issues.apache.org/jira/browse/PDFBOX-2441 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.7, 1.8.8, 2.0.0 Reporter: Tilman Hausherr This is a follow-up issue to PDFBOX-2250: {quote} the xref repair algorithm simply searches for the nearest offset, which may fail if more than one xref table is present ... Once we have a sample pdf which can't be parsed with the simple algorithm, we can open a new issue. {quote} And here's one: {code} Exception in thread main java.io.IOException: Error: Expected a long type at offset 1180, instead got '50/Filter/FlateDecode/DecodeParms' at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1690) {code} That file does have more than one xref table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2441) Improve XRef self healing mechanism when more than one xref table
[ https://issues.apache.org/jira/browse/PDFBOX-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2441: Attachment: 260105.pdf Improve XRef self healing mechanism when more than one xref table - Key: PDFBOX-2441 URL: https://issues.apache.org/jira/browse/PDFBOX-2441 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.7, 1.8.8, 2.0.0 Reporter: Tilman Hausherr Attachments: 260105.pdf This is a follow-up issue to PDFBOX-2250: {quote} the xref repair algorithm simply searches for the nearest offset, which may fail if more than one xref table is present ... Once we have a sample pdf which can't be parsed with the simple algorithm, we can open a new issue. {quote} And here's one: {code} Exception in thread main java.io.IOException: Error: Expected a long type at offset 1180, instead got '50/Filter/FlateDecode/DecodeParms' at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1690) {code} That file does have more than one xref table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (PDFBOX-2441) Improve XRef self healing mechanism when more than one xref table
[ https://issues.apache.org/jira/browse/PDFBOX-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler reassigned PDFBOX-2441: -- Assignee: Andreas Lehmkühler Improve XRef self healing mechanism when more than one xref table - Key: PDFBOX-2441 URL: https://issues.apache.org/jira/browse/PDFBOX-2441 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.7, 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: Andreas Lehmkühler Attachments: 260105.pdf This is a follow-up issue to PDFBOX-2250: {quote} the xref repair algorithm simply searches for the nearest offset, which may fail if more than one xref table is present ... Once we have a sample pdf which can't be parsed with the simple algorithm, we can open a new issue. {quote} And here's one: {code} Exception in thread main java.io.IOException: Error: Expected a long type at offset 1180, instead got '50/Filter/FlateDecode/DecodeParms' at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1690) {code} That file does have more than one xref table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] pdfbox pull request: Reapplied changes by patric42
GitHub user anti43 opened a pull request: https://github.com/apache/pdfbox/pull/9 Reapplied changes by patric42 You can merge this pull request into a Git repository by running: $ git pull https://github.com/anti43/pdfbox apache-trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/pdfbox/pull/9.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9 commit 8e6df3802bf045d1ef268ac2f14fbdb4324a9517 Author: Patric Bechtel p.bech...@oashi.com Date: 2014-04-22T09:06:15Z use java-image-scaling for high quality scaling of images. commit 1345f028429b29a48fc440db30485f7d58d62807 Author: Tilman Hausherr til...@apache.org Date: 2014-04-30T16:07:57Z PDFBOX-2034: refactoring per DRY git-svn-id: https://svn.apache.org/repos/asf/pdfbox/trunk@1591375 13f79535-47bb-0310-9956-ffa450edef68 commit 5f03db1caeb0c108628e851d698ab71d59327db4 Author: Patric Bechtel p.bech...@oashi.com Date: 2014-07-18T08:58:04Z re-enabled the hq-scaling again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] pdfbox pull request: Reapplied changes by patric42
Github user anti43 closed the pull request at: https://github.com/apache/pdfbox/pull/9 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (PDFBOX-2403) false negative? Font damaged, The FontFile can't be read
[ https://issues.apache.org/jira/browse/PDFBOX-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178121#comment-14178121 ] Ralf Hauser commented on PDFBOX-2403: - today, I see java.lang.NullPointerException at org.apache.fontbox.cff.CharStringRenderer.rrcurveTo(CharStringRenderer.java:433) at org.apache.fontbox.cff.CharStringRenderer.rrCurveTo(CharStringRenderer.java:424) at org.apache.fontbox.cff.CharStringRenderer.handleCommandType2(CharStringRenderer.java:154) at org.apache.fontbox.cff.CharStringRenderer.handleCommand(CharStringRenderer.java:90) at org.apache.fontbox.cff.CharStringHandler.handleSequence(CharStringHandler.java:53) at org.apache.fontbox.cff.CharStringRenderer.render(CharStringRenderer.java:75) at org.apache.fontbox.cff.CFFFontROS.getWidth(CFFFontROS.java:173) at org.apache.pdfbox.preflight.font.container.CIDType0Container.getFontProgramWidth(CIDType0Container.java:83) at org.apache.pdfbox.preflight.font.container.Type0Container.getFontProgramWidth(Type0Container.java:46) at org.apache.pdfbox.preflight.font.container.FontContainer.checkGlyphWith(FontContainer.java:115) at org.apache.pdfbox.preflight.content.ContentStreamWrapper.validText(ContentStreamWrapper.java:373) at org.apache.pdfbox.preflight.content.ContentStreamWrapper.validStringArray(ContentStreamWrapper.java:297) at org.apache.pdfbox.preflight.content.ContentStreamWrapper.validStringArray(ContentStreamWrapper.java:293) at org.apache.pdfbox.preflight.content.ContentStreamWrapper.checkShowTextOperators(ContentStreamWrapper.java:209) at org.apache.pdfbox.preflight.content.ContentStreamWrapper.processOperator(ContentStreamWrapper.java:181) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:258) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:205) at org.apache.pdfbox.preflight.content.ContentStreamWrapper.validPageContentStream(ContentStreamWrapper.java:76) at org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validateContent(SinglePageValidationProcess.java:179) at org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validate(SinglePageValidationProcess.java:87) at org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73) at org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:52) at org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validatePage(PageTreeValidationProcess.java:58) at org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validate(PageTreeValidationProcess.java:47) at org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73) at org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:88) at org.apache.pdfbox.preflight.PreflightDocument.validate(PreflightDocument.java:169) false negative? Font damaged, The FontFile can't be read -- Key: PDFBOX-2403 URL: https://issues.apache.org/jira/browse/PDFBOX-2403 Project: PDFBox Issue Type: Bug Components: Preflight Affects Versions: 2.0.0 Environment: deb7, java 7 Reporter: Ralf Hauser Fix For: 2.0.0 Attachments: Konformität mit PDF_A-1b prüfen.pdf, Problems_pdfa1b.pdf_07.10.2014_001.pdf, patch2403JavaDoc.txt, patchBetterErrorMessages.txt, patchPDFBOX-2403.txt, patchPDFBOX-2403Type1.txt, pdfA_Validation_Report.eml, pdfa1b.pdf, pdfa1b_againstPDFA1a_report, pdfa1b_againstPDFA1b_report, pdfa1b_summary_0001.pdf, report, reportforfile_pdfa1b, validation_report.xml - 1: 3.2.1 : Font damaged, The FontFile can't be read - 2: 3.2.1 : Font damaged, The FontFile can't be read - 3: 3.1.6 : Invalid Font definition, Width of the character 48 in the font program SURPPV+HeiseiMaruGoStd-W8-Identity-H is inconsistent with the width in the PDF dictionary. - 4: 3.1.6 : Invalid Font definition, Width of the character 36 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 5: 3.3.1 : Glyph error, The character 74 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is missing from the Charater Encoding. - 6: 3.1.6 : Invalid Font definition, Width of the character 80 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 7: 3.1.6 : Invalid Font definition, Width of the character 420 in the font program
[jira] [Issue Comment Deleted] (PDFBOX-2403) false negative? Font damaged, The FontFile can't be read
[ https://issues.apache.org/jira/browse/PDFBOX-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ralf Hauser updated PDFBOX-2403: Comment: was deleted (was: today, I see java.lang.NullPointerException at org.apache.fontbox.cff.CharStringRenderer.rrcurveTo(CharStringRenderer.java:433) at org.apache.fontbox.cff.CharStringRenderer.rrCurveTo(CharStringRenderer.java:424) at org.apache.fontbox.cff.CharStringRenderer.handleCommandType2(CharStringRenderer.java:154) at org.apache.fontbox.cff.CharStringRenderer.handleCommand(CharStringRenderer.java:90) at org.apache.fontbox.cff.CharStringHandler.handleSequence(CharStringHandler.java:53) at org.apache.fontbox.cff.CharStringRenderer.render(CharStringRenderer.java:75) at org.apache.fontbox.cff.CFFFontROS.getWidth(CFFFontROS.java:173) at org.apache.pdfbox.preflight.font.container.CIDType0Container.getFontProgramWidth(CIDType0Container.java:83) at org.apache.pdfbox.preflight.font.container.Type0Container.getFontProgramWidth(Type0Container.java:46) at org.apache.pdfbox.preflight.font.container.FontContainer.checkGlyphWith(FontContainer.java:115) at org.apache.pdfbox.preflight.content.ContentStreamWrapper.validText(ContentStreamWrapper.java:373) at org.apache.pdfbox.preflight.content.ContentStreamWrapper.validStringArray(ContentStreamWrapper.java:297) at org.apache.pdfbox.preflight.content.ContentStreamWrapper.validStringArray(ContentStreamWrapper.java:293) at org.apache.pdfbox.preflight.content.ContentStreamWrapper.checkShowTextOperators(ContentStreamWrapper.java:209) at org.apache.pdfbox.preflight.content.ContentStreamWrapper.processOperator(ContentStreamWrapper.java:181) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:258) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:205) at org.apache.pdfbox.preflight.content.ContentStreamWrapper.validPageContentStream(ContentStreamWrapper.java:76) at org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validateContent(SinglePageValidationProcess.java:179) at org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validate(SinglePageValidationProcess.java:87) at org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73) at org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:52) at org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validatePage(PageTreeValidationProcess.java:58) at org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validate(PageTreeValidationProcess.java:47) at org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73) at org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:88) at org.apache.pdfbox.preflight.PreflightDocument.validate(PreflightDocument.java:169)) false negative? Font damaged, The FontFile can't be read -- Key: PDFBOX-2403 URL: https://issues.apache.org/jira/browse/PDFBOX-2403 Project: PDFBox Issue Type: Bug Components: Preflight Affects Versions: 2.0.0 Environment: deb7, java 7 Reporter: Ralf Hauser Fix For: 2.0.0 Attachments: Konformität mit PDF_A-1b prüfen.pdf, Problems_pdfa1b.pdf_07.10.2014_001.pdf, patch2403JavaDoc.txt, patchBetterErrorMessages.txt, patchPDFBOX-2403.txt, patchPDFBOX-2403Type1.txt, pdfA_Validation_Report.eml, pdfa1b.pdf, pdfa1b_againstPDFA1a_report, pdfa1b_againstPDFA1b_report, pdfa1b_summary_0001.pdf, report, reportforfile_pdfa1b, validation_report.xml - 1: 3.2.1 : Font damaged, The FontFile can't be read - 2: 3.2.1 : Font damaged, The FontFile can't be read - 3: 3.1.6 : Invalid Font definition, Width of the character 48 in the font program SURPPV+HeiseiMaruGoStd-W8-Identity-H is inconsistent with the width in the PDF dictionary. - 4: 3.1.6 : Invalid Font definition, Width of the character 36 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 5: 3.3.1 : Glyph error, The character 74 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is missing from the Charater Encoding. - 6: 3.1.6 : Invalid Font definition, Width of the character 80 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 7: 3.1.6 : Invalid Font definition, Width of the character 420 in the font program RRATCX+MathematicalPiLTStd-Identity-H is
[jira] [Created] (PDFBOX-2442) false negative? 3.1.6 : Invalid Font definition, Width (633.0) of the character 60 in the font program BNGLNN+LucidaMath-Symbol is inconsistent with the width (0.0)
Ralf Hauser created PDFBOX-2442: --- Summary: false negative? 3.1.6 : Invalid Font definition, Width (633.0) of the character 60 in the font program BNGLNN+LucidaMath-Symbol is inconsistent with the width (0.0) in the PDF dictionary. Key: PDFBOX-2442 URL: https://issues.apache.org/jira/browse/PDFBOX-2442 Project: PDFBox Issue Type: Bug Components: Preflight Affects Versions: 2.0.0 Environment: java7 deb7 Reporter: Ralf Hauser org.apache.pdfbox.preflight.font.util.GlyphException: Width (633.0) of the character 60 in the font program BNGLNN+LucidaMath-Symbol is inconsistent with the width (0.0) in the PDF dictionary. at org.apache.pdfbox.preflight.font.container.FontContainer.checkWidthsConsistency(FontContainer.java:181) at org.apache.pdfbox.preflight.font.container.FontContainer.checkGlyphWidth(FontContainer.java:130) at org.apache.pdfbox.preflight.content.PreflightContentStream.validText(PreflightContentStream.java:342) at org.apache.pdfbox.preflight.content.PreflightContentStream.validStringArray(PreflightContentStream.java:276) at org.apache.pdfbox.preflight.content.PreflightContentStream.validStringArray(PreflightContentStream.java:272) at org.apache.pdfbox.preflight.content.PreflightContentStream.checkShowTextOperators(PreflightContentStream.java:190) at org.apache.pdfbox.preflight.content.PreflightContentStream.processOperator(PreflightContentStream.java:155) at org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:226) at org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:196) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:152) at org.apache.pdfbox.preflight.content.PreflightContentStream.validPageContentStream(PreflightContentStream.java:76) at org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validateContent(SinglePageValidationProcess.java:184) at org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validate(SinglePageValidationProcess.java:87) at org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73) at org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:52) at org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validatePage(PageTreeValidationProcess.java:56) at org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validate(PageTreeValidationProcess.java:45) at org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73) at org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:88) at org.apache.pdfbox.preflight.PreflightDocument.validate(PreflightDocument.java:168) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2442) false negative? 3.1.6 : Invalid Font definition, Width (633.0) of the character 60 in the font program BNGLNN+LucidaMath-Symbol is inconsistent with the width (0.0)
[ https://issues.apache.org/jira/browse/PDFBOX-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ralf Hauser updated PDFBOX-2442: Attachment: adobe7pie.pdf false negative? 3.1.6 : Invalid Font definition, Width (633.0) of the character 60 in the font program BNGLNN+LucidaMath-Symbol is inconsistent with the width (0.0) in the PDF dictionary. --- Key: PDFBOX-2442 URL: https://issues.apache.org/jira/browse/PDFBOX-2442 Project: PDFBox Issue Type: Bug Components: Preflight Affects Versions: 2.0.0 Environment: java7 deb7 Reporter: Ralf Hauser Attachments: adobe7pie.pdf org.apache.pdfbox.preflight.font.util.GlyphException: Width (633.0) of the character 60 in the font program BNGLNN+LucidaMath-Symbol is inconsistent with the width (0.0) in the PDF dictionary. at org.apache.pdfbox.preflight.font.container.FontContainer.checkWidthsConsistency(FontContainer.java:181) at org.apache.pdfbox.preflight.font.container.FontContainer.checkGlyphWidth(FontContainer.java:130) at org.apache.pdfbox.preflight.content.PreflightContentStream.validText(PreflightContentStream.java:342) at org.apache.pdfbox.preflight.content.PreflightContentStream.validStringArray(PreflightContentStream.java:276) at org.apache.pdfbox.preflight.content.PreflightContentStream.validStringArray(PreflightContentStream.java:272) at org.apache.pdfbox.preflight.content.PreflightContentStream.checkShowTextOperators(PreflightContentStream.java:190) at org.apache.pdfbox.preflight.content.PreflightContentStream.processOperator(PreflightContentStream.java:155) at org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:226) at org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:196) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:152) at org.apache.pdfbox.preflight.content.PreflightContentStream.validPageContentStream(PreflightContentStream.java:76) at org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validateContent(SinglePageValidationProcess.java:184) at org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validate(SinglePageValidationProcess.java:87) at org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73) at org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:52) at org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validatePage(PageTreeValidationProcess.java:56) at org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validate(PageTreeValidationProcess.java:45) at org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73) at org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:88) at org.apache.pdfbox.preflight.PreflightDocument.validate(PreflightDocument.java:168) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2441) Improve XRef self healing mechanism when more than one xref table
[ https://issues.apache.org/jira/browse/PDFBOX-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-2441: --- Fix Version/s: 2.0.0 1.8.8 Improve XRef self healing mechanism when more than one xref table - Key: PDFBOX-2441 URL: https://issues.apache.org/jira/browse/PDFBOX-2441 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.7, 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: Andreas Lehmkühler Fix For: 1.8.8, 2.0.0 Attachments: 260105.pdf This is a follow-up issue to PDFBOX-2250: {quote} the xref repair algorithm simply searches for the nearest offset, which may fail if more than one xref table is present ... Once we have a sample pdf which can't be parsed with the simple algorithm, we can open a new issue. {quote} And here's one: {code} Exception in thread main java.io.IOException: Error: Expected a long type at offset 1180, instead got '50/Filter/FlateDecode/DecodeParms' at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1690) {code} That file does have more than one xref table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: 2.0
Been too busy over in Tika-land...just noticing this now. Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v non-seq). I won't have time to integrate 2.0 into our Tika PDFParser any time soon (Jeremy Anderson on TIKA-1285 has already started this), but I could easily write a lightweight wrapper around PDFBox's TextStripper + metadata inside of the tika-batch/tika-eval framework. Cheers, Tim From: Andreas Lehmkühler [andr...@lehmi.de] Sent: Wednesday, October 15, 2014 6:20 AM To: dev@pdfbox.apache.org Subject: Re: 2.0 Hi, Maruan Sahyoun sahy...@fileaffairs.de hat am 15. Oktober 2014 um 09:32 geschrieben: What about keeping both for the 2.0 release and phase the old one out for 3 but making the NonSequential the default parser. Would also give us some time to work with Tim (TIKA) on the test suite. I agree, that's the only thing we can manage in a timely manner. Maybe we could simplify the variations of PDDocument.load to something like PDDocument.load(input, raf, enforce, useLegacyParser) or PDDocument.load(input, raf, enforce, withSignatureSupport) … and introduce PDDocument.load(input) to use the NonSequential WDYT? Good idea, I've already created PDFBOX-2430 for this. Maruan BR Andreas Lehmkühler Am 15.10.2014 um 09:18 schrieb Timo Boehme timo.boe...@ontochem.com: Hi, the difference between the parsers stems from the fact that the old parser can cope with a completely broken xref table because it uses the objects as it finds them on its sequential way. What we need (as I proposed before) is a repair mechanism scanning the file for object start/end to be used for re-creating the xref table. I will see if I can find some time to do this. The only other stopper is as Andreas has pointed out the signing. I'm not familiar with this and don't known what needs to be done here. Best, Timo Am 14.10.2014 um 21:18 schrieb Tilman Hausherr: Here are some: 055/055794.pdf 082/082463.pdf 108/108362.pdf 113/113223.pdf 115/115458.pdf 115/115463.pdf 122/122393.pdf 129/129416.pdf 133/133423.pdf 148/148020.pdf 152/152012.pdf 161/161466.pdf to be found here: http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/ Tilman Am 14.10.2014 um 21:06 schrieb John Hewson: Unless somebody provides us with a list of those files, then I think this is an unreasonable request. As long as we continue to leave the old parser in PDFBox, we won’t get the bug reports which we need to fix the new parser, and the situation will never resolve itself. Falling back to the old parser is just as bad - we won’t get bug reports. -- John On 14 Oct 2014, at 07:39, Tilman Hausherr thaush...@t-online.de wrote: I prefer that the old parser not be removed, because there are many files that can only be parsed by the old parser. This came out in a large scale test with TIKA. The best idea (in my current opinion) is to use the nonSeq parser first, and the old parser if there is an exception. Tilman Am 14.10.2014 um 09:45 schrieb Timo Boehme: Hi, Am 14.10.2014 um 07:22 schrieb John Hewson: Hi, John Hewson j...@jahewson.com hat am 10. Oktober 2014 um 20:05 geschrieben: - Parsing (Andreas?) I guess we won't get a complete new parser in 2.0, but I try to improve the XRef and the COSStream stuff It would be great if we could get rid of the old parser and switch to the non-sequential parser, WDYT? I would also propose to completely remove the old parser. That way we are more flexible in parsing streams etc. since parts of the non-sequential parser are a compromise to work side-by-side with the old parser. Possibly there are a small number of functions for which the old parser is still needed - e.g. signing? Best, Timo -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474 F: +49 345 4780471 timo.boe...@ontochem.com _ OntoChem GmbH Geschäftsführer: Dr. Lutz Weber Sitz: Halle / Saale Registergericht: Stendal Registernummer: HRB 215461 _
[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178325#comment-14178325 ] Tilman Hausherr commented on PDFBOX-2370: - When rendering, many files have their resources missing (fonts, images, shadings, forms), or are having NPE. Some examples: - PDFBOX-1169.pdf images missing - tracemonkey NPE - PDFBOX-1452.pdf question mark image missing - CIB-coons-vs-tensormesh.pdf NPE (but CIB-coonsmesh.pdf is ok) - PDFBOX-2265-igalia.pdf NPE and many many more :-( Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Priority: Critical Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178332#comment-14178332 ] Tilman Hausherr commented on PDFBOX-2370: - popping the resource stack in processSubStream() helps Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Priority: Critical Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism
[ https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178547#comment-14178547 ] Tilman Hausherr commented on PDFBOX-2250: - ignore the last commit message (wrong issue) Improve XRef self healing mechanism --- Key: PDFBOX-2250 URL: https://issues.apache.org/jira/browse/PDFBOX-2250 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Andreas Lehmkühler Assignee: Andreas Lehmkühler Fix For: 1.8.8, 2.0.0 Attachments: 055794.pdf, 113223.pdf, PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf, PDFBOX-2250-229205.pdf, PDFBOX-2250-233566.pdf PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef offsets. But that one was just a starter and there remain a lot of issues to be solved. I'm planing to solve at least some of them. All fixes and improvements are targeting the non-sequential parser and I won't port those changes to the old parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178546#comment-14178546 ] Tilman Hausherr commented on PDFBOX-2370: - done in [ https://svn.apache.org/r1633401 ] Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Priority: Critical Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Build failed in Jenkins: PDFBox-trunk » Apache PDFBox #1356
See https://builds.apache.org/job/PDFBox-trunk/org.apache.pdfbox$pdfbox/1356/changes Changes: [tilman] PDFBOX-2370: restore pop resource stack -- [...truncated 58 lines...] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec - in org.apache.pdfbox.cos.TestCOSInteger Running org.apache.pdfbox.cos.TestCOSFloat Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec - in org.apache.pdfbox.cos.TestCOSFloat Running org.apache.pdfbox.io.TestRandomAccessBuffer Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec - in org.apache.pdfbox.io.TestRandomAccessBuffer Running org.apache.pdfbox.io.TestIOUtils Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec - in org.apache.pdfbox.io.TestIOUtils Running org.apache.pdfbox.io.TestRandomAccessFileOutputStream Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec - in org.apache.pdfbox.io.TestRandomAccessFileOutputStream Running org.apache.pdfbox.encoding.PDFDocEncodingCharsetTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec - in org.apache.pdfbox.encoding.PDFDocEncodingCharsetTest Running org.apache.pdfbox.util.TestLayerUtility Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.145 sec - in org.apache.pdfbox.util.TestLayerUtility Running org.apache.pdfbox.util.PDFCloneUtilityTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec - in org.apache.pdfbox.util.PDFCloneUtilityTest Running org.apache.pdfbox.util.TestMatrix Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec - in org.apache.pdfbox.util.TestMatrix Running org.apache.pdfbox.util.TestQuickSort Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec - in org.apache.pdfbox.util.TestQuickSort Running org.apache.pdfbox.util.PDFMergerUtilityTest Oct 21, 2014 4:01:45 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:45 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Times-Roman' Oct 21, 2014 4:01:45 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:45 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Times-Roman' Oct 21, 2014 4:01:45 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Times-Roman' Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Times-Roman' Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Times-Roman' Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Times-Roman' Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Times-Roman' Oct 21, 2014 4:01:47 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:47 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Times-Roman' Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.352 sec - in org.apache.pdfbox.util.PDFMergerUtilityTest Running org.apache.pdfbox.util.PageExtractorTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.036 sec - in org.apache.pdfbox.util.PageExtractorTest Running org.apache.pdfbox.util.TestTextStripper Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts
Build failed in Jenkins: PDFBox-trunk #1356
See https://builds.apache.org/job/PDFBox-trunk/1356/changes Changes: [tilman] PDFBOX-2370: restore pop resource stack -- [...truncated 472 lines...] Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica-Bold' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica-Bold' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica-Bold' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica-Bold' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica-Bold' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica-Bold' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica-Bold' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica-Bold' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica-BoldOblique' Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts getTrueTypeFallbackFont SEVERE: No TTF fallback font for 'Helvetica-BoldOblique' Oct 21, 2014
[jira] [Created] (PDFBOX-2443) About to return NULL from unhandled branch when constructing a PDJpeg
Tilman Hausherr created PDFBOX-2443: --- Summary: About to return NULL from unhandled branch when constructing a PDJpeg Key: PDFBOX-2443 URL: https://issues.apache.org/jira/browse/PDFBOX-2443 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7, 1.8.8 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Priority: Minor Fix For: 1.8.8 The INFO About to return NULL from unhandled branch appears when creating a PDJpeg from a stream. Although the message is an INFO and not a WARNING or an ERROR, it scares users. The message happens because getRGBImage() calls getColorSpace() although the colorspace isn't known yet, it is determined after the call to getRGBImage(), which loads the image. The image objects were completely redesigned in 2.0, so it makes no sense to waste time for a real solution to this. I am setting the message to DEBUG instead, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2443) About to return NULL from unhandled branch when constructing a PDJpeg
[ https://issues.apache.org/jira/browse/PDFBOX-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178647#comment-14178647 ] ASF subversion and git services commented on PDFBOX-2443: - Commit 1633414 from [~tilman] in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1633414 ] PDFBOX-2443: change scary info message to debug and make it less scary; change javadoc too About to return NULL from unhandled branch when constructing a PDJpeg - Key: PDFBOX-2443 URL: https://issues.apache.org/jira/browse/PDFBOX-2443 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7, 1.8.8 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Priority: Minor Fix For: 1.8.8 The INFO About to return NULL from unhandled branch appears when creating a PDJpeg from a stream. Although the message is an INFO and not a WARNING or an ERROR, it scares users. The message happens because getRGBImage() calls getColorSpace() although the colorspace isn't known yet, it is determined after the call to getRGBImage(), which loads the image. The image objects were completely redesigned in 2.0, so it makes no sense to waste time for a real solution to this. I am setting the message to DEBUG instead, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2443) About to return NULL from unhandled branch when constructing a PDJpeg
[ https://issues.apache.org/jira/browse/PDFBOX-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2443: Description: The INFO About to return NULL from unhandled branch appears when creating a PDJpeg from a stream. Although the message is an INFO and not a WARNING or an ERROR, it scares users. The message happens because getRGBImage() calls getColorSpace() although the colorspace isn't known yet, it is determined after the call to getRGBImage(), which loads the image. The image objects were completely redesigned in 2.0, so it makes no sense to waste time for a real solution to this. I am setting the message to DEBUG instead, and make it less scary. was: The INFO About to return NULL from unhandled branch appears when creating a PDJpeg from a stream. Although the message is an INFO and not a WARNING or an ERROR, it scares users. The message happens because getRGBImage() calls getColorSpace() although the colorspace isn't known yet, it is determined after the call to getRGBImage(), which loads the image. The image objects were completely redesigned in 2.0, so it makes no sense to waste time for a real solution to this. I am setting the message to DEBUG instead, About to return NULL from unhandled branch when constructing a PDJpeg - Key: PDFBOX-2443 URL: https://issues.apache.org/jira/browse/PDFBOX-2443 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7, 1.8.8 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Priority: Minor Fix For: 1.8.8 The INFO About to return NULL from unhandled branch appears when creating a PDJpeg from a stream. Although the message is an INFO and not a WARNING or an ERROR, it scares users. The message happens because getRGBImage() calls getColorSpace() although the colorspace isn't known yet, it is determined after the call to getRGBImage(), which loads the image. The image objects were completely redesigned in 2.0, so it makes no sense to waste time for a real solution to this. I am setting the message to DEBUG instead, and make it less scary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PDFBOX-2443) About to return NULL from unhandled branch when constructing a PDJpeg
[ https://issues.apache.org/jira/browse/PDFBOX-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-2443. - Resolution: Fixed About to return NULL from unhandled branch when constructing a PDJpeg - Key: PDFBOX-2443 URL: https://issues.apache.org/jira/browse/PDFBOX-2443 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7, 1.8.8 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Priority: Minor Fix For: 1.8.8 The INFO About to return NULL from unhandled branch appears when creating a PDJpeg from a stream. Although the message is an INFO and not a WARNING or an ERROR, it scares users. The message happens because getRGBImage() calls getColorSpace() although the colorspace isn't known yet, it is determined after the call to getRGBImage(), which loads the image. The image objects were completely redesigned in 2.0, so it makes no sense to waste time for a real solution to this. I am setting the message to DEBUG instead, and make it less scary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: 2.0
Hi Tim, 2.0 doesn't seem to be released soon... what might be useful again is a comparison between seq v non-seq, Andreas recently resolved an issue (PDFBOX-2250) that improves the nonSeq parser a lot. Although this isn't fully done, a follow-up issue PDFBOX-2441 https://issues.apache.org/jira/browse/PDFBOX-2441 has been opened which will improve a few more complex files. Tilman Am 21.10.2014 um 13:00 schrieb Allison, Timothy B.: Been too busy over in Tika-land...just noticing this now. Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v non-seq). I won't have time to integrate 2.0 into our Tika PDFParser any time soon (Jeremy Anderson on TIKA-1285 has already started this), but I could easily write a lightweight wrapper around PDFBox's TextStripper + metadata inside of the tika-batch/tika-eval framework. Cheers, Tim From: Andreas Lehmkühler [andr...@lehmi.de] Sent: Wednesday, October 15, 2014 6:20 AM To: dev@pdfbox.apache.org Subject: Re: 2.0 Hi, Maruan Sahyoun sahy...@fileaffairs.de hat am 15. Oktober 2014 um 09:32 geschrieben: What about keeping both for the 2.0 release and phase the old one out for 3 but making the NonSequential the default parser. Would also give us some time to work with Tim (TIKA) on the test suite. I agree, that's the only thing we can manage in a timely manner. Maybe we could simplify the variations of PDDocument.load to something like PDDocument.load(input, raf, enforce, useLegacyParser) or PDDocument.load(input, raf, enforce, withSignatureSupport) … and introduce PDDocument.load(input) to use the NonSequential WDYT? Good idea, I've already created PDFBOX-2430 for this. Maruan BR Andreas Lehmkühler Am 15.10.2014 um 09:18 schrieb Timo Boehme timo.boe...@ontochem.com: Hi, the difference between the parsers stems from the fact that the old parser can cope with a completely broken xref table because it uses the objects as it finds them on its sequential way. What we need (as I proposed before) is a repair mechanism scanning the file for object start/end to be used for re-creating the xref table. I will see if I can find some time to do this. The only other stopper is as Andreas has pointed out the signing. I'm not familiar with this and don't known what needs to be done here. Best, Timo Am 14.10.2014 um 21:18 schrieb Tilman Hausherr: Here are some: 055/055794.pdf 082/082463.pdf 108/108362.pdf 113/113223.pdf 115/115458.pdf 115/115463.pdf 122/122393.pdf 129/129416.pdf 133/133423.pdf 148/148020.pdf 152/152012.pdf 161/161466.pdf to be found here: http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/ Tilman Am 14.10.2014 um 21:06 schrieb John Hewson: Unless somebody provides us with a list of those files, then I think this is an unreasonable request. As long as we continue to leave the old parser in PDFBox, we won’t get the bug reports which we need to fix the new parser, and the situation will never resolve itself. Falling back to the old parser is just as bad - we won’t get bug reports. -- John On 14 Oct 2014, at 07:39, Tilman Hausherr thaush...@t-online.de wrote: I prefer that the old parser not be removed, because there are many files that can only be parsed by the old parser. This came out in a large scale test with TIKA. The best idea (in my current opinion) is to use the nonSeq parser first, and the old parser if there is an exception. Tilman Am 14.10.2014 um 09:45 schrieb Timo Boehme: Hi, Am 14.10.2014 um 07:22 schrieb John Hewson: Hi, John Hewson j...@jahewson.com hat am 10. Oktober 2014 um 20:05 geschrieben: - Parsing (Andreas?) I guess we won't get a complete new parser in 2.0, but I try to improve the XRef and the COSStream stuff It would be great if we could get rid of the old parser and switch to the non-sequential parser, WDYT? I would also propose to completely remove the old parser. That way we are more flexible in parsing streams etc. since parts of the non-sequential parser are a compromise to work side-by-side with the old parser. Possibly there are a small number of functions for which the old parser is still needed - e.g. signing? Best, Timo -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474 F: +49 345 4780471 timo.boe...@ontochem.com _ OntoChem GmbH Geschäftsführer: Dr. Lutz Weber Sitz: Halle / Saale Registergericht: Stendal Registernummer: HRB 215461 _
[jira] [Created] (PDFBOX-2444) Add radial shading example
Tilman Hausherr created PDFBOX-2444: --- Summary: Add radial shading example Key: PDFBOX-2444 URL: https://issues.apache.org/jira/browse/PDFBOX-2444 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Priority: Minor Fix For: 2.0.0 Add radial shading to the example created in PDFBOX-2211. Use both methods of adding a shading that emerged from PDFBOX-2370. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2444) Add radial shading example
[ https://issues.apache.org/jira/browse/PDFBOX-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178756#comment-14178756 ] ASF subversion and git services commented on PDFBOX-2444: - Commit 1633427 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1633427 ] PDFBOX-2444, PDFBOX-2370: add radial shading; use both methods of adding a shading to the resources Add radial shading example -- Key: PDFBOX-2444 URL: https://issues.apache.org/jira/browse/PDFBOX-2444 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Priority: Minor Labels: shading Fix For: 2.0.0 Add radial shading to the example created in PDFBOX-2211. Use both methods of adding a shading that emerged from PDFBOX-2370. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PDFBOX-2444) Add radial shading example
[ https://issues.apache.org/jira/browse/PDFBOX-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-2444. - Resolution: Fixed Add radial shading example -- Key: PDFBOX-2444 URL: https://issues.apache.org/jira/browse/PDFBOX-2444 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Priority: Minor Labels: shading Fix For: 2.0.0 Add radial shading to the example created in PDFBOX-2211. Use both methods of adding a shading that emerged from PDFBOX-2370. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178763#comment-14178763 ] ASF subversion and git services commented on PDFBOX-2370: - Commit 1633428 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1633428 ] PDFBOX-2370: use sh instead of cs1 as prefix for shading objects Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Priority: Critical Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178766#comment-14178766 ] Tilman Hausherr commented on PDFBOX-2370: - I've done just a minimal restore re: popping the resource stack, so that the tests work again. But why has the try... finally part been removed? This would make sure that the stack is popped if an exception happens. Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Priority: Critical Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-1980) TestCOSFloat is non-deterministic
[ https://issues.apache.org/jira/browse/PDFBOX-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178847#comment-14178847 ] ASF subversion and git services commented on PDFBOX-1980: - Commit 1633435 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1633435 ] PDFBOX-1980: fix javadoc format error TestCOSFloat is non-deterministic - Key: PDFBOX-1980 URL: https://issues.apache.org/jira/browse/PDFBOX-1980 Project: PDFBox Issue Type: Bug Affects Versions: 2.0.0 Reporter: John Hewson Priority: Minor Fix For: 2.0.0 TestCOSFloat generates random numbers for testing which means that it is non-deterministic. Testing COSFloat on random data doesn't achieve much, because we know what numbers look like. Even taking into account the discussion in PDFBOX-1977, I suggest that it would be better to create a set of representative data with interesting edge-cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Jenkins build is back to normal : PDFBox-trunk » Apache PDFBox #1357
See https://builds.apache.org/job/PDFBox-trunk/org.apache.pdfbox$pdfbox/1357/changes
Jenkins build is back to normal : PDFBox-trunk #1357
See https://builds.apache.org/job/PDFBox-trunk/1357/changes
download link broken
https://pdfbox.apache.org/download.cgi shows this: #!/bin/sh # Wrapper script around mirrors.cgi script # (we must change to that directory in order for python to pick up the # python includes correctly) cd /www/www.apache.org/dyn/mirrors /www/www.apache.org/dyn/mirrors/mirrors.cgi $*
RE: 2.0
Maruan, Sounds good. I'll add it to my todo list to write the wrapper...probably be good for me to start moving to 2.0 anyways. :) -Original Message- From: Maruan Sahyoun [mailto:sahy...@fileaffairs.de] Sent: Tuesday, October 21, 2014 1:50 PM To: dev@pdfbox.apache.org Subject: Re: 2.0 Tim, first many thanks for the offer. I'd add that a comparison between 1.8 and 2.0 would be useful too to detect differences might it be because of enhancements or regressions. BR Maruan Am 21.10.2014 um 19:42 schrieb Tilman Hausherr thaush...@t-online.de: Hi Tim, 2.0 doesn't seem to be released soon... what might be useful again is a comparison between seq v non-seq, Andreas recently resolved an issue (PDFBOX-2250) that improves the nonSeq parser a lot. Although this isn't fully done, a follow-up issue PDFBOX-2441 https://issues.apache.org/jira/browse/PDFBOX-2441 has been opened which will improve a few more complex files. Tilman Am 21.10.2014 um 13:00 schrieb Allison, Timothy B.: Been too busy over in Tika-land...just noticing this now. Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v non-seq). I won't have time to integrate 2.0 into our Tika PDFParser any time soon (Jeremy Anderson on TIKA-1285 has already started this), but I could easily write a lightweight wrapper around PDFBox's TextStripper + metadata inside of the tika-batch/tika-eval framework. Cheers, Tim From: Andreas Lehmkühler [andr...@lehmi.de] Sent: Wednesday, October 15, 2014 6:20 AM To: dev@pdfbox.apache.org Subject: Re: 2.0 Hi, Maruan Sahyoun sahy...@fileaffairs.de hat am 15. Oktober 2014 um 09:32 geschrieben: What about keeping both for the 2.0 release and phase the old one out for 3 but making the NonSequential the default parser. Would also give us some time to work with Tim (TIKA) on the test suite. I agree, that's the only thing we can manage in a timely manner. Maybe we could simplify the variations of PDDocument.load to something like PDDocument.load(input, raf, enforce, useLegacyParser) or PDDocument.load(input, raf, enforce, withSignatureSupport) . and introduce PDDocument.load(input) to use the NonSequential WDYT? Good idea, I've already created PDFBOX-2430 for this. Maruan BR Andreas Lehmkühler Am 15.10.2014 um 09:18 schrieb Timo Boehme timo.boe...@ontochem.com: Hi, the difference between the parsers stems from the fact that the old parser can cope with a completely broken xref table because it uses the objects as it finds them on its sequential way. What we need (as I proposed before) is a repair mechanism scanning the file for object start/end to be used for re-creating the xref table. I will see if I can find some time to do this. The only other stopper is as Andreas has pointed out the signing. I'm not familiar with this and don't known what needs to be done here. Best, Timo Am 14.10.2014 um 21:18 schrieb Tilman Hausherr: Here are some: 055/055794.pdf 082/082463.pdf 108/108362.pdf 113/113223.pdf 115/115458.pdf 115/115463.pdf 122/122393.pdf 129/129416.pdf 133/133423.pdf 148/148020.pdf 152/152012.pdf 161/161466.pdf to be found here: http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/ Tilman Am 14.10.2014 um 21:06 schrieb John Hewson: Unless somebody provides us with a list of those files, then I think this is an unreasonable request. As long as we continue to leave the old parser in PDFBox, we won't get the bug reports which we need to fix the new parser, and the situation will never resolve itself. Falling back to the old parser is just as bad - we won't get bug reports. -- John On 14 Oct 2014, at 07:39, Tilman Hausherr thaush...@t-online.de wrote: I prefer that the old parser not be removed, because there are many files that can only be parsed by the old parser. This came out in a large scale test with TIKA. The best idea (in my current opinion) is to use the nonSeq parser first, and the old parser if there is an exception. Tilman Am 14.10.2014 um 09:45 schrieb Timo Boehme: Hi, Am 14.10.2014 um 07:22 schrieb John Hewson: Hi, John Hewson j...@jahewson.com hat am 10. Oktober 2014 um 20:05 geschrieben: - Parsing (Andreas?) I guess we won't get a complete new parser in 2.0, but I try to improve the XRef and the COSStream stuff It would be great if we could get rid of the old parser and switch to the non-sequential parser, WDYT? I would also propose to completely remove the old parser. That way we are more flexible in parsing streams etc. since parts of the non-sequential parser are a compromise to work side-by-side with the old parser. Possibly there are a small number of functions for which the old parser is still needed - e.g. signing? Best, Timo -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474
[jira] [Comment Edited] (PDFBOX-2403) false negative? Font damaged, The FontFile can't be read
[ https://issues.apache.org/jira/browse/PDFBOX-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179171#comment-14179171 ] John Hewson edited comment on PDFBOX-2403 at 10/21/14 9:44 PM: --- I've updated my version of Adobe Preflight to 11.0.9, which is the same version you have and I get no errors when verifying the pdfa1b.pdf file for compliance with PDF/A-1b. The fonts in the file are all embedded, so the errors which you're seeing just don't match the file. I'd start by double checking that the file attached to this issue is the same as the one you're testing. was (Author: jahewson): I've updated by version of Adobe Preflight to 11.0.9, which is the same version you have and I get no errors when verifying the pdfa1b.pdf file for compliance with PDF/A-1b. The fonts in the file are all embedded, so the errors which you're seeing just don't match the file. I'd start by double checking that the file attached to this issue is the same as the one you're testing. false negative? Font damaged, The FontFile can't be read -- Key: PDFBOX-2403 URL: https://issues.apache.org/jira/browse/PDFBOX-2403 Project: PDFBox Issue Type: Bug Components: Preflight Affects Versions: 2.0.0 Environment: deb7, java 7 Reporter: Ralf Hauser Fix For: 2.0.0 Attachments: Konformität mit PDF_A-1b prüfen.pdf, Problems_pdfa1b.pdf_07.10.2014_001.pdf, patch2403JavaDoc.txt, patchBetterErrorMessages.txt, patchPDFBOX-2403.txt, patchPDFBOX-2403Type1.txt, pdfA_Validation_Report.eml, pdfa1b.pdf, pdfa1b_againstPDFA1a_report, pdfa1b_againstPDFA1b_report, pdfa1b_summary_0001.pdf, report, reportforfile_pdfa1b, validation_report.xml - 1: 3.2.1 : Font damaged, The FontFile can't be read - 2: 3.2.1 : Font damaged, The FontFile can't be read - 3: 3.1.6 : Invalid Font definition, Width of the character 48 in the font program SURPPV+HeiseiMaruGoStd-W8-Identity-H is inconsistent with the width in the PDF dictionary. - 4: 3.1.6 : Invalid Font definition, Width of the character 36 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 5: 3.3.1 : Glyph error, The character 74 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is missing from the Charater Encoding. - 6: 3.1.6 : Invalid Font definition, Width of the character 80 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 7: 3.1.6 : Invalid Font definition, Width of the character 420 in the font program RRATCX+MathematicalPiLTStd-Identity-H is inconsistent with the width in the PDF dictionary. possibly related to PDFBOX-2299? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2403) false negative? Font damaged, The FontFile can't be read
[ https://issues.apache.org/jira/browse/PDFBOX-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179171#comment-14179171 ] John Hewson commented on PDFBOX-2403: - I've updated by version of Adobe Preflight to 11.0.9, which is the same version you have and I get no errors when verifying the pdfa1b.pdf file for compliance with PDF/A-1b. The fonts in the file are all embedded, so the errors which you're seeing just don't match the file. I'd start by double checking that the file attached to this issue is the same as the one you're testing. false negative? Font damaged, The FontFile can't be read -- Key: PDFBOX-2403 URL: https://issues.apache.org/jira/browse/PDFBOX-2403 Project: PDFBox Issue Type: Bug Components: Preflight Affects Versions: 2.0.0 Environment: deb7, java 7 Reporter: Ralf Hauser Fix For: 2.0.0 Attachments: Konformität mit PDF_A-1b prüfen.pdf, Problems_pdfa1b.pdf_07.10.2014_001.pdf, patch2403JavaDoc.txt, patchBetterErrorMessages.txt, patchPDFBOX-2403.txt, patchPDFBOX-2403Type1.txt, pdfA_Validation_Report.eml, pdfa1b.pdf, pdfa1b_againstPDFA1a_report, pdfa1b_againstPDFA1b_report, pdfa1b_summary_0001.pdf, report, reportforfile_pdfa1b, validation_report.xml - 1: 3.2.1 : Font damaged, The FontFile can't be read - 2: 3.2.1 : Font damaged, The FontFile can't be read - 3: 3.1.6 : Invalid Font definition, Width of the character 48 in the font program SURPPV+HeiseiMaruGoStd-W8-Identity-H is inconsistent with the width in the PDF dictionary. - 4: 3.1.6 : Invalid Font definition, Width of the character 36 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 5: 3.3.1 : Glyph error, The character 74 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is missing from the Charater Encoding. - 6: 3.1.6 : Invalid Font definition, Width of the character 80 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 7: 3.1.6 : Invalid Font definition, Width of the character 420 in the font program RRATCX+MathematicalPiLTStd-Identity-H is inconsistent with the width in the PDF dictionary. possibly related to PDFBOX-2299? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179177#comment-14179177 ] John Hewson commented on PDFBOX-2370: - Thanks Tilman, I'll take a look at the problem files. The finally and stack popping removal was an mistake, I had been experimenting with those lines. Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Priority: Critical Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179177#comment-14179177 ] John Hewson edited comment on PDFBOX-2370 at 10/21/14 9:48 PM: --- Thanks Tilman, I'll take a look at the problem files. The finally and stack popping removal was a mistake, I had been experimenting with those lines. was (Author: jahewson): Thanks Tilman, I'll take a look at the problem files. The finally and stack popping removal was an mistake, I had been experimenting with those lines. Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Priority: Critical Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179180#comment-14179180 ] ASF subversion and git services commented on PDFBOX-2370: - Commit 1633472 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1633472 ] PDFBOX-2370: Fix, pop stack in finally Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Priority: Critical Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179237#comment-14179237 ] Tilman Hausherr commented on PDFBOX-2370: - You don't have to look at the problem files anymore, their problem was caused by the non-popping. Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Priority: Critical Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2391) Use an enum for RenderingIntent
[ https://issues.apache.org/jira/browse/PDFBOX-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179295#comment-14179295 ] John Hewson commented on PDFBOX-2391: - I can't reproduce that exception. Use an enum for RenderingIntent --- Key: PDFBOX-2391 URL: https://issues.apache.org/jira/browse/PDFBOX-2391 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Minor Fix For: 2.0.0 The rendering intent in the graphics state is currently a String, we should replace it with a RenderingIntent enum. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PDFBOX-1329) Update PDPage to enum
[ https://issues.apache.org/jira/browse/PDFBOX-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson closed PDFBOX-1329. --- Resolution: Fixed Rather than making the page sizes an enum, I moved them to PDRectangle. Update PDPage to enum - Key: PDFBOX-1329 URL: https://issues.apache.org/jira/browse/PDFBOX-1329 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 1.8.0 Environment: Linux, UBUNTU 12.04, openjdk-7 Reporter: Jens Kapitza Priority: Minor Fix For: 2.0.0 Attachments: change_pdpage.diff Original Estimate: 1h Remaining Estimate: 1h -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-1329) Update PDPage to enum
[ https://issues.apache.org/jira/browse/PDFBOX-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179361#comment-14179361 ] ASF subversion and git services commented on PDFBOX-1329: - Commit 1633490 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1633490 ] PDFBOX-1329: Move page size constants from PDPage to PDRectangle, and clean up PDPage. Update PDPage to enum - Key: PDFBOX-1329 URL: https://issues.apache.org/jira/browse/PDFBOX-1329 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 1.8.0 Environment: Linux, UBUNTU 12.04, openjdk-7 Reporter: Jens Kapitza Priority: Minor Fix For: 2.0.0 Attachments: change_pdpage.diff Original Estimate: 1h Remaining Estimate: 1h -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-1329) Update PDPage to enum
[ https://issues.apache.org/jira/browse/PDFBOX-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179367#comment-14179367 ] ASF subversion and git services commented on PDFBOX-1329: - Commit 1633492 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1633492 ] PDFBOX-1329: Removed comment Update PDPage to enum - Key: PDFBOX-1329 URL: https://issues.apache.org/jira/browse/PDFBOX-1329 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 1.8.0 Environment: Linux, UBUNTU 12.04, openjdk-7 Reporter: Jens Kapitza Priority: Minor Fix For: 2.0.0 Attachments: change_pdpage.diff Original Estimate: 1h Remaining Estimate: 1h -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2423) Page tree handling needs rewriting
[ https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2423: Priority: Blocker (was: Critical) Page tree handling needs rewriting -- Key: PDFBOX-2423 URL: https://issues.apache.org/jira/browse/PDFBOX-2423 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7, 2.0.0 Reporter: John Hewson Priority: Blocker Fix For: 2.0.0 The way in which PDFBox handles the Page tree needs to be rewritten, preferably from scratch. Currently the document catalog returns the raw objects from the page tree, wrapped in either a PDPage or PDPageNode. We need to abstract over the page tree and get rid of PDPageNode, we should provide methods which can add/remove PDPage objects *only*. The existing low-level access to the page tree is not needed at the PD-level. Inheritance of page properties such as crop box, resources, and rotation should be reimplemented to use whatever new page tree abstraction we invent. We can finally remove the old broken methods which didn't look up the inheritance tree when retrieving these values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2370: Priority: Blocker (was: Critical) Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Priority: Blocker Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson reassigned PDFBOX-2370: --- Assignee: John Hewson Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (PDFBOX-2423) Page tree handling needs rewriting
[ https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson reassigned PDFBOX-2423: --- Assignee: John Hewson Page tree handling needs rewriting -- Key: PDFBOX-2423 URL: https://issues.apache.org/jira/browse/PDFBOX-2423 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7, 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 The way in which PDFBox handles the Page tree needs to be rewritten, preferably from scratch. Currently the document catalog returns the raw objects from the page tree, wrapped in either a PDPage or PDPageNode. We need to abstract over the page tree and get rid of PDPageNode, we should provide methods which can add/remove PDPage objects *only*. The existing low-level access to the page tree is not needed at the PD-level. Inheritance of page properties such as crop box, resources, and rotation should be reimplemented to use whatever new page tree abstraction we invent. We can finally remove the old broken methods which didn't look up the inheritance tree when retrieving these values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (PDFBOX-2428) An error occured when reading table hmtx
[ https://issues.apache.org/jira/browse/PDFBOX-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson reassigned PDFBOX-2428: --- Assignee: John Hewson An error occured when reading table hmtx Key: PDFBOX-2428 URL: https://issues.apache.org/jira/browse/PDFBOX-2428 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 1.8.8 Reporter: simon steiner Assignee: John Hewson Attachments: ttsubset_pdfa.pdf java -cp pdfbox/preflight/target/preflight-1.8.8-SNAPSHOT.jar:pdfbox/app/target/pdfbox-app-1.8.8-SNAPSHOT.jar:pdfbox/xmpbox/target/xmpbox-1.8.8-SNAPSHOT.jar:lib/commons-io-1.3.1.jar org.apache.pdfbox.preflight.Validator_A1b ttsubset_pdfa.pdf SEVERE: An error occured when reading table hmtx java.io.EOFException at org.apache.fontbox.ttf.MemoryTTFDataStream.readSignedShort(MemoryTTFDataStream.java:139) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2333) Overhaul the apperance generation for PDF forms
[ https://issues.apache.org/jira/browse/PDFBOX-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179388#comment-14179388 ] ASF subversion and git services commented on PDFBOX-2333: - Commit 1633495 from [~msahyoun] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1633495 ] PDFBOX-2333 enhance alignment for single line text fields Overhaul the apperance generation for PDF forms --- Key: PDFBOX-2333 URL: https://issues.apache.org/jira/browse/PDFBOX-2333 Project: PDFBox Issue Type: Improvement Components: AcroForm Affects Versions: 2.0.0 Reporter: Maruan Sahyoun Priority: Critical Fix For: 2.0.0 Attachments: AcroForms-SimpleTextFields.1.8.7.pdf, AcroForms-SimpleTextFields.1.8.7.png, AcroForms-SimpleTextFields.pdf The appearance handling for forms in 1.x is limited and does not reflect all settings possible for form fields. In addition the current code is not very modular and does not follow the box model used for form fields. Unfortunately only the basics of form handling are defined in the PDF spec. The details like padding of boxes, text placement etc. have to be determined by looking at how Adobe forms are generated. Update: The file from PDFBOX-2310 has bad rendering which might be related? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2423) Page tree handling needs rewriting
[ https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179389#comment-14179389 ] ASF subversion and git services commented on PDFBOX-2423: - Commit 1633496 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1633496 ] PDFBOX-2423: Clean up PDDocumentCatalog formatting Page tree handling needs rewriting -- Key: PDFBOX-2423 URL: https://issues.apache.org/jira/browse/PDFBOX-2423 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7, 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 The way in which PDFBox handles the Page tree needs to be rewritten, preferably from scratch. Currently the document catalog returns the raw objects from the page tree, wrapped in either a PDPage or PDPageNode. We need to abstract over the page tree and get rid of PDPageNode, we should provide methods which can add/remove PDPage objects *only*. The existing low-level access to the page tree is not needed at the PD-level. Inheritance of page properties such as crop box, resources, and rotation should be reimplemented to use whatever new page tree abstraction we invent. We can finally remove the old broken methods which didn't look up the inheritance tree when retrieving these values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2333) Overhaul the apperance generation for PDF forms
[ https://issues.apache.org/jira/browse/PDFBOX-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun updated PDFBOX-2333: --- Attachment: AlignmentTests-pre1633495.pdf Testfile filled prior to rev. 1633495. The upper fields are filled by PDFBox the lower fields ...-Filled are filled by Acrobat. Overhaul the apperance generation for PDF forms --- Key: PDFBOX-2333 URL: https://issues.apache.org/jira/browse/PDFBOX-2333 Project: PDFBox Issue Type: Improvement Components: AcroForm Affects Versions: 2.0.0 Reporter: Maruan Sahyoun Priority: Critical Fix For: 2.0.0 Attachments: AcroForms-SimpleTextFields.1.8.7.pdf, AcroForms-SimpleTextFields.1.8.7.png, AcroForms-SimpleTextFields.pdf, AlignmentTests-pre1633495.pdf The appearance handling for forms in 1.x is limited and does not reflect all settings possible for form fields. In addition the current code is not very modular and does not follow the box model used for form fields. Unfortunately only the basics of form handling are defined in the PDF spec. The details like padding of boxes, text placement etc. have to be determined by looking at how Adobe forms are generated. Update: The file from PDFBOX-2310 has bad rendering which might be related? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2333) Overhaul the apperance generation for PDF forms
[ https://issues.apache.org/jira/browse/PDFBOX-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun updated PDFBOX-2333: --- Attachment: AlignmentTests-post1633495.pdf Testfile filled after rev. 1633495. Overhaul the apperance generation for PDF forms --- Key: PDFBOX-2333 URL: https://issues.apache.org/jira/browse/PDFBOX-2333 Project: PDFBox Issue Type: Improvement Components: AcroForm Affects Versions: 2.0.0 Reporter: Maruan Sahyoun Priority: Critical Fix For: 2.0.0 Attachments: AcroForms-SimpleTextFields.1.8.7.pdf, AcroForms-SimpleTextFields.1.8.7.png, AcroForms-SimpleTextFields.pdf, AlignmentTests-post1633495.pdf, AlignmentTests-pre1633495.pdf The appearance handling for forms in 1.x is limited and does not reflect all settings possible for form fields. In addition the current code is not very modular and does not follow the box model used for form fields. Unfortunately only the basics of form handling are defined in the PDF spec. The details like padding of boxes, text placement etc. have to be determined by looking at how Adobe forms are generated. Update: The file from PDFBOX-2310 has bad rendering which might be related? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2333) Overhaul the apperance generation for PDF forms
[ https://issues.apache.org/jira/browse/PDFBOX-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179411#comment-14179411 ] Maruan Sahyoun commented on PDFBOX-2333: I’ve enhanced the alignment of single line text fields as - left aligned text had too much padding applied on the left - centered text had padding aligned to the left making it no longer aligned - right aligned text had no padding applied making it overlap with borders In addition I added some special handling for corner cases to match Acrobats behavior. This needs to be verified with additional files. Overhaul the apperance generation for PDF forms --- Key: PDFBOX-2333 URL: https://issues.apache.org/jira/browse/PDFBOX-2333 Project: PDFBox Issue Type: Improvement Components: AcroForm Affects Versions: 2.0.0 Reporter: Maruan Sahyoun Priority: Critical Fix For: 2.0.0 Attachments: AcroForms-SimpleTextFields.1.8.7.pdf, AcroForms-SimpleTextFields.1.8.7.png, AcroForms-SimpleTextFields.pdf, AlignmentTests-post1633495.pdf, AlignmentTests-pre1633495.pdf The appearance handling for forms in 1.x is limited and does not reflect all settings possible for form fields. In addition the current code is not very modular and does not follow the box model used for form fields. Unfortunately only the basics of form handling are defined in the PDF spec. The details like padding of boxes, text placement etc. have to be determined by looking at how Adobe forms are generated. Update: The file from PDFBOX-2310 has bad rendering which might be related? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2423) Page tree handling needs rewriting
[ https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179423#comment-14179423 ] ASF subversion and git services commented on PDFBOX-2423: - Commit 1633501 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1633501 ] PDFBOX-2423: Made page mode and layout constants in PDDocumentCatalog into enums Page tree handling needs rewriting -- Key: PDFBOX-2423 URL: https://issues.apache.org/jira/browse/PDFBOX-2423 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7, 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 The way in which PDFBox handles the Page tree needs to be rewritten, preferably from scratch. Currently the document catalog returns the raw objects from the page tree, wrapped in either a PDPage or PDPageNode. We need to abstract over the page tree and get rid of PDPageNode, we should provide methods which can add/remove PDPage objects *only*. The existing low-level access to the page tree is not needed at the PD-level. Inheritance of page properties such as crop box, resources, and rotation should be reimplemented to use whatever new page tree abstraction we invent. We can finally remove the old broken methods which didn't look up the inheritance tree when retrieving these values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2423) Page tree handling needs rewriting
[ https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179424#comment-14179424 ] ASF subversion and git services commented on PDFBOX-2423: - Commit 1633502 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1633502 ] PDFBOX-2423: Replaced calls to PDDocumentCatalog#getCOSDictionary with getCOSObject Page tree handling needs rewriting -- Key: PDFBOX-2423 URL: https://issues.apache.org/jira/browse/PDFBOX-2423 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7, 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 The way in which PDFBox handles the Page tree needs to be rewritten, preferably from scratch. Currently the document catalog returns the raw objects from the page tree, wrapped in either a PDPage or PDPageNode. We need to abstract over the page tree and get rid of PDPageNode, we should provide methods which can add/remove PDPage objects *only*. The existing low-level access to the page tree is not needed at the PD-level. Inheritance of page properties such as crop box, resources, and rotation should be reimplemented to use whatever new page tree abstraction we invent. We can finally remove the old broken methods which didn't look up the inheritance tree when retrieving these values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2423) Page tree handling needs rewriting
[ https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179428#comment-14179428 ] ASF subversion and git services commented on PDFBOX-2423: - Commit 1633503 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1633503 ] PDFBOX-2423: Fix bug with AcroForm caching Page tree handling needs rewriting -- Key: PDFBOX-2423 URL: https://issues.apache.org/jira/browse/PDFBOX-2423 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7, 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 The way in which PDFBox handles the Page tree needs to be rewritten, preferably from scratch. Currently the document catalog returns the raw objects from the page tree, wrapped in either a PDPage or PDPageNode. We need to abstract over the page tree and get rid of PDPageNode, we should provide methods which can add/remove PDPage objects *only*. The existing low-level access to the page tree is not needed at the PD-level. Inheritance of page properties such as crop box, resources, and rotation should be reimplemented to use whatever new page tree abstraction we invent. We can finally remove the old broken methods which didn't look up the inheritance tree when retrieving these values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2423) Page tree handling needs rewriting
[ https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179436#comment-14179436 ] ASF subversion and git services commented on PDFBOX-2423: - Commit 1633505 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1633505 ] PDFBOX-2423: More cleaning up of PDDocumentCatalog Page tree handling needs rewriting -- Key: PDFBOX-2423 URL: https://issues.apache.org/jira/browse/PDFBOX-2423 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7, 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 The way in which PDFBox handles the Page tree needs to be rewritten, preferably from scratch. Currently the document catalog returns the raw objects from the page tree, wrapped in either a PDPage or PDPageNode. We need to abstract over the page tree and get rid of PDPageNode, we should provide methods which can add/remove PDPage objects *only*. The existing low-level access to the page tree is not needed at the PD-level. Inheritance of page properties such as crop box, resources, and rotation should be reimplemented to use whatever new page tree abstraction we invent. We can finally remove the old broken methods which didn't look up the inheritance tree when retrieving these values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2423) Page tree handling needs rewriting
[ https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179448#comment-14179448 ] ASF subversion and git services commented on PDFBOX-2423: - Commit 1633506 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1633506 ] PDFBOX-2423: Clean up PDPageNode Page tree handling needs rewriting -- Key: PDFBOX-2423 URL: https://issues.apache.org/jira/browse/PDFBOX-2423 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7, 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Blocker Fix For: 2.0.0 The way in which PDFBox handles the Page tree needs to be rewritten, preferably from scratch. Currently the document catalog returns the raw objects from the page tree, wrapped in either a PDPage or PDPageNode. We need to abstract over the page tree and get rid of PDPageNode, we should provide methods which can add/remove PDPage objects *only*. The existing low-level access to the page tree is not needed at the PD-level. Inheritance of page properties such as crop box, resources, and rotation should be reimplemented to use whatever new page tree abstraction we invent. We can finally remove the old broken methods which didn't look up the inheritance tree when retrieving these values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)