[jira] [Updated] (PDFBOX-2403) false negative? Font damaged, The FontFile can't be read
[ https://issues.apache.org/jira/browse/PDFBOX-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Schreiber updated PDFBOX-2403: -- Attachment: reportforfile_pdfa1b Report Preflight fpr file pdfa1b.pdf, Preflight version 11.0.09 (119) false negative? Font damaged, The FontFile can't be read -- Key: PDFBOX-2403 URL: https://issues.apache.org/jira/browse/PDFBOX-2403 Project: PDFBox Issue Type: Bug Components: Preflight Affects Versions: 2.0.0 Environment: deb7, java 7 Reporter: Ralf Hauser Attachments: Problems_pdfa1b.pdf_07.10.2014_001.pdf, patchBetterErrorMessages.txt, patchPDFBOX-2403.txt, patchPDFBOX-2403Type1.txt, pdfA_Validation_Report.eml, pdfa1b.pdf, pdfa1b_summary_0001.pdf, report, reportforfile_pdfa1b, validation_report.xml - 1: 3.2.1 : Font damaged, The FontFile can't be read - 2: 3.2.1 : Font damaged, The FontFile can't be read - 3: 3.1.6 : Invalid Font definition, Width of the character 48 in the font program SURPPV+HeiseiMaruGoStd-W8-Identity-H is inconsistent with the width in the PDF dictionary. - 4: 3.1.6 : Invalid Font definition, Width of the character 36 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 5: 3.3.1 : Glyph error, The character 74 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is missing from the Charater Encoding. - 6: 3.1.6 : Invalid Font definition, Width of the character 80 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 7: 3.1.6 : Invalid Font definition, Width of the character 420 in the font program RRATCX+MathematicalPiLTStd-Identity-H is inconsistent with the width in the PDF dictionary. possibly related to PDFBOX-2299? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PDFBOX-2403) false negative? Font damaged, The FontFile can't be read
[ https://issues.apache.org/jira/browse/PDFBOX-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166467#comment-14166467 ] Schreiber edited comment on PDFBOX-2403 at 10/10/14 6:50 AM: - here ist report Preflight for file pdfa1b.pdf, Preflight version 11.0.09 (119) was (Author: csch1): Report Preflight fpr file pdfa1b.pdf, Preflight version 11.0.09 (119) false negative? Font damaged, The FontFile can't be read -- Key: PDFBOX-2403 URL: https://issues.apache.org/jira/browse/PDFBOX-2403 Project: PDFBox Issue Type: Bug Components: Preflight Affects Versions: 2.0.0 Environment: deb7, java 7 Reporter: Ralf Hauser Attachments: Problems_pdfa1b.pdf_07.10.2014_001.pdf, patchBetterErrorMessages.txt, patchPDFBOX-2403.txt, patchPDFBOX-2403Type1.txt, pdfA_Validation_Report.eml, pdfa1b.pdf, pdfa1b_summary_0001.pdf, report, reportforfile_pdfa1b, validation_report.xml - 1: 3.2.1 : Font damaged, The FontFile can't be read - 2: 3.2.1 : Font damaged, The FontFile can't be read - 3: 3.1.6 : Invalid Font definition, Width of the character 48 in the font program SURPPV+HeiseiMaruGoStd-W8-Identity-H is inconsistent with the width in the PDF dictionary. - 4: 3.1.6 : Invalid Font definition, Width of the character 36 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 5: 3.3.1 : Glyph error, The character 74 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is missing from the Charater Encoding. - 6: 3.1.6 : Invalid Font definition, Width of the character 80 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 7: 3.1.6 : Invalid Font definition, Width of the character 420 in the font program RRATCX+MathematicalPiLTStd-Identity-H is inconsistent with the width in the PDF dictionary. possibly related to PDFBOX-2299? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PDFBOX-2422) PDFont.getStringWidth results in stackoverflow
Cornelis Hoeflake created PDFBOX-2422: - Summary: PDFont.getStringWidth results in stackoverflow Key: PDFBOX-2422 URL: https://issues.apache.org/jira/browse/PDFBOX-2422 Project: PDFBox Issue Type: Bug Affects Versions: 2.0.0 Reporter: Cornelis Hoeflake When loading a true type font and calling getStringWidth(é) will result in a stackoverflow. Calling the method with a 'regular' character is ok. {code:title=Example code} PDDocument doc = new PDDocument(); // load a font which is in PDFBox PDTrueTypeFont font = PDTrueTypeFont.loadTTF(doc, getClass().getResourceAsStream(/org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf)); font.getStringWidth(éé); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
GSoC2015
Some ideas for GSoC2015: - improved PDFDebugger (because of the difficulty to seeing the different sequence in PDFBOX-2401 and because the product shown at https://www.youtube.com/watch?v=g-QcU9B4qMc is better) - hex view - view of non printable characters - saving streams - color mark of PDF operators - show images that are streams - show PDIndexed gradient - show PDSeparation color - edit fields and streams - save altered PDF - improved PDF Viewer (Zoom, drag and drop, resize view) This could possibly be a candidate for Google Code-in 2014, although I'm not sure if Apache participates. I saw a msg from 2013 that looked like not. - a working TIFF decoder - a working JPX decoder - the text extraction test suite for TIKA that Tim mentioned some time ago Tilman PS: No I won't participate in the Semester of Code because I don't have a project idea, and I want to relax somewhat. The work on GSoC2014 has been pretty intense, i.e. reviewing code and making tests.
[jira] [Commented] (PDFBOX-2403) false negative? Font damaged, The FontFile can't be read
[ https://issues.apache.org/jira/browse/PDFBOX-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167173#comment-14167173 ] John Hewson commented on PDFBOX-2403: - The latest XML report contains the error: {quote} Ungültiger Wert für PDF/A-Konformitätslevel (muss B sein) {quote} In English: {quote} Invalid value for PDF / A conformance level (must be B) {quote} Which I think means you're running Acrobat's PDF/A-1a validation against this file, but this file is PDF/A-1b. false negative? Font damaged, The FontFile can't be read -- Key: PDFBOX-2403 URL: https://issues.apache.org/jira/browse/PDFBOX-2403 Project: PDFBox Issue Type: Bug Components: Preflight Affects Versions: 2.0.0 Environment: deb7, java 7 Reporter: Ralf Hauser Attachments: Problems_pdfa1b.pdf_07.10.2014_001.pdf, patchBetterErrorMessages.txt, patchPDFBOX-2403.txt, patchPDFBOX-2403Type1.txt, pdfA_Validation_Report.eml, pdfa1b.pdf, pdfa1b_summary_0001.pdf, report, reportforfile_pdfa1b, validation_report.xml - 1: 3.2.1 : Font damaged, The FontFile can't be read - 2: 3.2.1 : Font damaged, The FontFile can't be read - 3: 3.1.6 : Invalid Font definition, Width of the character 48 in the font program SURPPV+HeiseiMaruGoStd-W8-Identity-H is inconsistent with the width in the PDF dictionary. - 4: 3.1.6 : Invalid Font definition, Width of the character 36 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 5: 3.3.1 : Glyph error, The character 74 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is missing from the Charater Encoding. - 6: 3.1.6 : Invalid Font definition, Width of the character 80 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 7: 3.1.6 : Invalid Font definition, Width of the character 420 in the font program RRATCX+MathematicalPiLTStd-Identity-H is inconsistent with the width in the PDF dictionary. possibly related to PDFBOX-2299? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PDFBOX-2403) false negative? Font damaged, The FontFile can't be read
[ https://issues.apache.org/jira/browse/PDFBOX-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167173#comment-14167173 ] John Hewson edited comment on PDFBOX-2403 at 10/10/14 5:41 PM: --- The latest XML report contains the error: {quote} Ungültiger Wert für PDF/A-Konformitätslevel (muss B sein) {quote} In English: {quote} Invalid value for PDF / A conformance level (must be B) {quote} Which I think means you're running Acrobat's PDF/A-1a validation against this file, but this file is actually PDF/A-1b. was (Author: jahewson): The latest XML report contains the error: {quote} Ungültiger Wert für PDF/A-Konformitätslevel (muss B sein) {quote} In English: {quote} Invalid value for PDF / A conformance level (must be B) {quote} Which I think means you're running Acrobat's PDF/A-1a validation against this file, but this file is PDF/A-1b. false negative? Font damaged, The FontFile can't be read -- Key: PDFBOX-2403 URL: https://issues.apache.org/jira/browse/PDFBOX-2403 Project: PDFBox Issue Type: Bug Components: Preflight Affects Versions: 2.0.0 Environment: deb7, java 7 Reporter: Ralf Hauser Attachments: Problems_pdfa1b.pdf_07.10.2014_001.pdf, patchBetterErrorMessages.txt, patchPDFBOX-2403.txt, patchPDFBOX-2403Type1.txt, pdfA_Validation_Report.eml, pdfa1b.pdf, pdfa1b_summary_0001.pdf, report, reportforfile_pdfa1b, validation_report.xml - 1: 3.2.1 : Font damaged, The FontFile can't be read - 2: 3.2.1 : Font damaged, The FontFile can't be read - 3: 3.1.6 : Invalid Font definition, Width of the character 48 in the font program SURPPV+HeiseiMaruGoStd-W8-Identity-H is inconsistent with the width in the PDF dictionary. - 4: 3.1.6 : Invalid Font definition, Width of the character 36 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 5: 3.3.1 : Glyph error, The character 74 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is missing from the Charater Encoding. - 6: 3.1.6 : Invalid Font definition, Width of the character 80 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 7: 3.1.6 : Invalid Font definition, Width of the character 420 in the font program RRATCX+MathematicalPiLTStd-Identity-H is inconsistent with the width in the PDF dictionary. possibly related to PDFBOX-2299? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PDFBOX-2403) false negative? Font damaged, The FontFile can't be read
[ https://issues.apache.org/jira/browse/PDFBOX-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167173#comment-14167173 ] John Hewson edited comment on PDFBOX-2403 at 10/10/14 5:42 PM: --- The latest XML report contains the error: {quote} Ungültiger Wert für PDF/A-Konformitätslevel (muss B sein) {quote} In English: {quote} Invalid value for PDF / A conformance level (must be B) {quote} Which I think means you're running Acrobat's PDF/A-1a validation against this file, but this file is actually PDF/A-1b, so you need to run that validation instead. was (Author: jahewson): The latest XML report contains the error: {quote} Ungültiger Wert für PDF/A-Konformitätslevel (muss B sein) {quote} In English: {quote} Invalid value for PDF / A conformance level (must be B) {quote} Which I think means you're running Acrobat's PDF/A-1a validation against this file, but this file is actually PDF/A-1b. false negative? Font damaged, The FontFile can't be read -- Key: PDFBOX-2403 URL: https://issues.apache.org/jira/browse/PDFBOX-2403 Project: PDFBox Issue Type: Bug Components: Preflight Affects Versions: 2.0.0 Environment: deb7, java 7 Reporter: Ralf Hauser Attachments: Problems_pdfa1b.pdf_07.10.2014_001.pdf, patchBetterErrorMessages.txt, patchPDFBOX-2403.txt, patchPDFBOX-2403Type1.txt, pdfA_Validation_Report.eml, pdfa1b.pdf, pdfa1b_summary_0001.pdf, report, reportforfile_pdfa1b, validation_report.xml - 1: 3.2.1 : Font damaged, The FontFile can't be read - 2: 3.2.1 : Font damaged, The FontFile can't be read - 3: 3.1.6 : Invalid Font definition, Width of the character 48 in the font program SURPPV+HeiseiMaruGoStd-W8-Identity-H is inconsistent with the width in the PDF dictionary. - 4: 3.1.6 : Invalid Font definition, Width of the character 36 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 5: 3.3.1 : Glyph error, The character 74 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is missing from the Charater Encoding. - 6: 3.1.6 : Invalid Font definition, Width of the character 80 in the font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the width in the PDF dictionary. - 7: 3.1.6 : Invalid Font definition, Width of the character 420 in the font program RRATCX+MathematicalPiLTStd-Identity-H is inconsistent with the width in the PDF dictionary. possibly related to PDFBOX-2299? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (PDFBOX-2422) PDFont.getStringWidth results in stackoverflow
[ https://issues.apache.org/jira/browse/PDFBOX-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson reassigned PDFBOX-2422: --- Assignee: John Hewson PDFont.getStringWidth results in stackoverflow -- Key: PDFBOX-2422 URL: https://issues.apache.org/jira/browse/PDFBOX-2422 Project: PDFBox Issue Type: Bug Affects Versions: 2.0.0 Reporter: Cornelis Hoeflake Assignee: John Hewson When loading a true type font and calling getStringWidth(é) will result in a stackoverflow. Calling the method with a 'regular' character is ok. {code:title=Example code} PDDocument doc = new PDDocument(); // load a font which is in PDFBox PDTrueTypeFont font = PDTrueTypeFont.loadTTF(doc, getClass().getResourceAsStream(/org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf)); font.getStringWidth(éé); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2422) PDFont.getStringWidth results in stackoverflow
[ https://issues.apache.org/jira/browse/PDFBOX-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167197#comment-14167197 ] John Hewson commented on PDFBOX-2422: - This code works fine for me with the latest trunk, can you make sure you're using the latest version from SVN or the latest snapshot jar? PDFont.getStringWidth results in stackoverflow -- Key: PDFBOX-2422 URL: https://issues.apache.org/jira/browse/PDFBOX-2422 Project: PDFBox Issue Type: Bug Affects Versions: 2.0.0 Reporter: Cornelis Hoeflake Assignee: John Hewson When loading a true type font and calling getStringWidth(é) will result in a stackoverflow. Calling the method with a 'regular' character is ok. {code:title=Example code} PDDocument doc = new PDDocument(); // load a font which is in PDFBox PDTrueTypeFont font = PDTrueTypeFont.loadTTF(doc, getClass().getResourceAsStream(/org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf)); font.getStringWidth(éé); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: 2.0
Simon, Andreas has the best handle on this, but off the top of my head what we need is to finish making breaking API changes and for the code to have been stable for a while before making a 2.0 release. Improvements and fixes which still need breaking API changes include: - Pattern rendering - Pages resource caching (significant memory usage issues) - Font embedding (particularly TTF) - Parsing (Andreas?) - Page Tree (needs completely re-writing) - Text extraction on Java 8 (this might end up being a breaking change to the sort) There’s probably more, such as work on Acroforms, and we need to have much better example code for 2.0 due to all the changes. This seems like a good time to explicitly try to make sure that we have JIRA issues open for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability of the code is a pretty good indicator - we’re not there yet. I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any 2.0 features which you think we need. Thanks -- John On 10 Oct 2014, at 08:08, Simon Steiner simonsteiner1...@gmail.com wrote: Hi, Could you set a target date for 2.0 release. What's missing to make a release? Thanks
Re: 2.0
Andreas - can we create a new “Later” version in JIRA so that we can assign issues that we’ve decided to defer until after 2.0? That way we can have a definitive list of what does and doesn’t need attention. -- John On 10 Oct 2014, at 11:05, John Hewson j...@jahewson.com wrote: Simon, Andreas has the best handle on this, but off the top of my head what we need is to finish making breaking API changes and for the code to have been stable for a while before making a 2.0 release. Improvements and fixes which still need breaking API changes include: - Pattern rendering - Pages resource caching (significant memory usage issues) - Font embedding (particularly TTF) - Parsing (Andreas?) - Page Tree (needs completely re-writing) - Text extraction on Java 8 (this might end up being a breaking change to the sort) There’s probably more, such as work on Acroforms, and we need to have much better example code for 2.0 due to all the changes. This seems like a good time to explicitly try to make sure that we have JIRA issues open for all outstanding tasks, so that we can track how close 2.0 is to being ready. The stability of the code is a pretty good indicator - we’re not there yet. I’m going to open some JIRA issues. Andreas, Tilman - please open issues for any 2.0 features which you think we need. Thanks -- John On 10 Oct 2014, at 08:08, Simon Steiner simonsteiner1...@gmail.com wrote: Hi, Could you set a target date for 2.0 release. What's missing to make a release? Thanks
[jira] [Created] (PDFBOX-2423) Page tree handling needs rewriting
John Hewson created PDFBOX-2423: --- Summary: Page tree handling needs rewriting Key: PDFBOX-2423 URL: https://issues.apache.org/jira/browse/PDFBOX-2423 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7, 2.0.0 Reporter: John Hewson The way in which PDFBox handles the Page tree needs to be rewritten, preferably from scratch. Currently the document catalog returns the raw objects from the page tree, wrapped in either a PDPage or PDPageNode. We need to abstract over the page tree and get rid of PDPageNode, we should provide methods which can add/remove PDPage objects *only*. The existing low-level access to the page tree is not needed at the PD-level. Inheritance of page properties such as crop box, resources, and rotation should be reimplemented to use whatever new page tree abstraction we invent. We can finally remove the old broken methods which didn't look up the inheritance tree when retrieving these values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2423) Page tree handling needs rewriting
[ https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2423: Fix Version/s: 2.0.0 Page tree handling needs rewriting -- Key: PDFBOX-2423 URL: https://issues.apache.org/jira/browse/PDFBOX-2423 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7, 2.0.0 Reporter: John Hewson Fix For: 2.0.0 The way in which PDFBox handles the Page tree needs to be rewritten, preferably from scratch. Currently the document catalog returns the raw objects from the page tree, wrapped in either a PDPage or PDPageNode. We need to abstract over the page tree and get rid of PDPageNode, we should provide methods which can add/remove PDPage objects *only*. The existing low-level access to the page tree is not needed at the PD-level. Inheritance of page properties such as crop box, resources, and rotation should be reimplemented to use whatever new page tree abstraction we invent. We can finally remove the old broken methods which didn't look up the inheritance tree when retrieving these values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2400) Insert page
[ https://issues.apache.org/jira/browse/PDFBOX-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167252#comment-14167252 ] John Hewson commented on PDFBOX-2400: - The page tree in 1.8 is fundamentally broken and probably shouldn't receive any more attention. However, in 2.0 we're planning on fixing this, see PDFBOX-2423, after which we could probably add an insertPage method with ease. Insert page --- Key: PDFBOX-2400 URL: https://issues.apache.org/jira/browse/PDFBOX-2400 Project: PDFBox Issue Type: New Feature Components: PDModel Reporter: Patrick Tucker Priority: Minor It would be nice if PDDocument had an insertPage function similar to addPage, but takes a number to indicate where to add the new page in the current set of pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2400) Insert page
[ https://issues.apache.org/jira/browse/PDFBOX-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2400: Fix Version/s: 2.0.0 Insert page --- Key: PDFBOX-2400 URL: https://issues.apache.org/jira/browse/PDFBOX-2400 Project: PDFBox Issue Type: New Feature Components: PDModel Affects Versions: 1.8.7, 2.0.0 Reporter: Patrick Tucker Priority: Minor Fix For: 2.0.0 It would be nice if PDDocument had an insertPage function similar to addPage, but takes a number to indicate where to add the new page in the current set of pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2400) Insert page
[ https://issues.apache.org/jira/browse/PDFBOX-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2400: Affects Version/s: 2.0.0 1.8.7 Insert page --- Key: PDFBOX-2400 URL: https://issues.apache.org/jira/browse/PDFBOX-2400 Project: PDFBox Issue Type: New Feature Components: PDModel Affects Versions: 1.8.7, 2.0.0 Reporter: Patrick Tucker Priority: Minor Fix For: 2.0.0 It would be nice if PDDocument had an insertPage function similar to addPage, but takes a number to indicate where to add the new page in the current set of pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2370: Component/s: PDModel Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2370: Affects Version/s: 2.0.0 Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2400) Add insertPage() method
[ https://issues.apache.org/jira/browse/PDFBOX-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2400: Summary: Add insertPage() method (was: Insert page) Add insertPage() method --- Key: PDFBOX-2400 URL: https://issues.apache.org/jira/browse/PDFBOX-2400 Project: PDFBox Issue Type: New Feature Components: PDModel Affects Versions: 1.8.7, 2.0.0 Reporter: Patrick Tucker Priority: Minor Fix For: 2.0.0 It would be nice if PDDocument had an insertPage function similar to addPage, but takes a number to indicate where to add the new page in the current set of pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources
[ https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2370: Fix Version/s: 2.0.0 Move caching outside of PDResources --- Key: PDFBOX-2370 URL: https://issues.apache.org/jira/browse/PDFBOX-2370 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Fix For: 2.0.0 *Note:* This issue is based on a discussion which occurred regarding PDFBOX-2301 but is actually a separate issue. Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created. What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to just work and power users will get a level of control that they appreciate. This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name). --- \* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so: {code} for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List { // ... this is idiomatic in PDFBox 1.8 } // List returned by getAllPages() kept in scope until here (bad) {code} I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have: {code} for (int i = 0; i document.getNumberOfPages(); i++) { PDPage page = document.getPage(i); // ... this is the new 2.0 way // current page falls out of scope here (good) } {code} To solve this problem, we could change getAllPages() so that instead of returning a List it returns an IteratorPDPage, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2340) Overhaul PDFBox Documentation
[ https://issues.apache.org/jira/browse/PDFBOX-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2340: Fix Version/s: 2.0.0 Overhaul PDFBox Documentation - Key: PDFBOX-2340 URL: https://issues.apache.org/jira/browse/PDFBOX-2340 Project: PDFBox Issue Type: Improvement Components: Documentation Affects Versions: 2.0.0 Reporter: Maruan Sahyoun Fix For: 2.0.0 Attachments: Mockup-20140912.png, Mockup_Documentation.png In oder to make it easier for users of PDFBox to work with the library there shall be an enhanced documentation consisting of an introduction, API references and more well documented examples and code snippets (Cookbook). In order to make it easier to contribute the Cookbook shall be build automatically from the examples/snippet ‚repository‘. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2366) Improve high-level font API
[ https://issues.apache.org/jira/browse/PDFBOX-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2366: Fix Version/s: 2.0.0 Improve high-level font API --- Key: PDFBOX-2366 URL: https://issues.apache.org/jira/browse/PDFBOX-2366 Project: PDFBox Issue Type: Improvement Components: PDModel Reporter: John Hewson Assignee: John Hewson Priority: Minor Fix For: 2.0.0 The PDFont and Type1Equivalent APIs could expose some higher-level details, such as a consistent way for to get names and Type1Equivalent instances. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2340) Overhaul PDFBox Documentation
[ https://issues.apache.org/jira/browse/PDFBOX-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2340: Affects Version/s: 2.0.0 Overhaul PDFBox Documentation - Key: PDFBOX-2340 URL: https://issues.apache.org/jira/browse/PDFBOX-2340 Project: PDFBox Issue Type: Improvement Components: Documentation Affects Versions: 2.0.0 Reporter: Maruan Sahyoun Fix For: 2.0.0 Attachments: Mockup-20140912.png, Mockup_Documentation.png In oder to make it easier for users of PDFBox to work with the library there shall be an enhanced documentation consisting of an introduction, API references and more well documented examples and code snippets (Cookbook). In order to make it easier to contribute the Cookbook shall be build automatically from the examples/snippet ‚repository‘. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2337) Add an example for highlighting text based on a string
[ https://issues.apache.org/jira/browse/PDFBOX-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2337: Affects Version/s: 2.0.0 1.8.7 Add an example for highlighting text based on a string --- Key: PDFBOX-2337 URL: https://issues.apache.org/jira/browse/PDFBOX-2337 Project: PDFBox Issue Type: New Feature Components: Utilities Affects Versions: 1.8.7, 2.0.0 Reporter: Joël Kuiper Fix For: 1.8.7 An often heard request is to be able to highlight a certain text within a PDF programmatically, similar to the highlight functionality in Acrobat or Preview.app. The actual implementation of this functionality is trickier than it appears, since it requires the calculation of bouding boxes from TextPositions. A example class may help people with implementing this (common) functionality. (see for example this discussion https://mail-archives.apache.org/mod_mbox/pdfbox-users/201409.mbox/%3CC8340BB9-E299-4A76-A50B-6155504A0D5B%40joelkuiper.eu%3E) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2337) Add an example for highlighting text based on a string
[ https://issues.apache.org/jira/browse/PDFBOX-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2337: Fix Version/s: 1.8.7 Add an example for highlighting text based on a string --- Key: PDFBOX-2337 URL: https://issues.apache.org/jira/browse/PDFBOX-2337 Project: PDFBox Issue Type: New Feature Components: Utilities Affects Versions: 1.8.7, 2.0.0 Reporter: Joël Kuiper Fix For: 1.8.7 An often heard request is to be able to highlight a certain text within a PDF programmatically, similar to the highlight functionality in Acrobat or Preview.app. The actual implementation of this functionality is trickier than it appears, since it requires the calculation of bouding boxes from TextPositions. A example class may help people with implementing this (common) functionality. (see for example this discussion https://mail-archives.apache.org/mod_mbox/pdfbox-users/201409.mbox/%3CC8340BB9-E299-4A76-A50B-6155504A0D5B%40joelkuiper.eu%3E) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2335) NPE in DictionaryEncoding constructor
[ https://issues.apache.org/jira/browse/PDFBOX-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2335: Affects Version/s: 2.0.0 NPE in DictionaryEncoding constructor - Key: PDFBOX-2335 URL: https://issues.apache.org/jira/browse/PDFBOX-2335 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: John Hewson Fix For: 2.0.0 Attachments: PDFBOX-2335-203040-p17.pdf I get an NPE with the attached file: {code} Sep 09, 2014 9:16:57 PM org.apache.pdfbox.pdmodel.font.PDType1Font init WARNUNG: Using fallback font 'TimesNewRomanPSMT' for 'ZapfDingbats' Exception in thread main java.lang.NullPointerException at org.apache.pdfbox.encoding.DictionaryEncoding.init(DictionaryEncoding.java:91) at org.apache.pdfbox.pdmodel.font.PDSimpleFont.readEncoding(PDSimpleFont.java:126) at org.apache.pdfbox.pdmodel.font.PDType1Font.init(PDType1Font.java:256) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:65) at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:171) at org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:556) at org.apache.pdfbox.util.operator.text.SetTextFont.process(SetTextFont.java:48) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2333) Overhaul the apperance generation for PDF forms
[ https://issues.apache.org/jira/browse/PDFBOX-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2333: Affects Version/s: 2.0.0 Overhaul the apperance generation for PDF forms --- Key: PDFBOX-2333 URL: https://issues.apache.org/jira/browse/PDFBOX-2333 Project: PDFBox Issue Type: Improvement Components: AcroForm Affects Versions: 2.0.0 Reporter: Maruan Sahyoun Fix For: 2.0.0 Attachments: AcroForms-SimpleTextFields.1.8.7.pdf, AcroForms-SimpleTextFields.1.8.7.png, AcroForms-SimpleTextFields.pdf The appearance handling for forms in 1.x is limited and does not reflect all settings possible for form fields. In addition the current code is not very modular and does not follow the box model used for form fields. Unfortunately only the basics of form handling are defined in the PDF spec. The details like padding of boxes, text placement etc. have to be determined by looking at how Adobe forms are generated. Update: The file from PDFBOX-2310 has bad rendering which might be related? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2289) New Example:
[ https://issues.apache.org/jira/browse/PDFBOX-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2289: Summary: New Example: (was: provide an example how to set PDDocumentCatalog.PAGE_MODE_USE_ATTACHMENTS to an aes encrypted pdf) New Example: - Key: PDFBOX-2289 URL: https://issues.apache.org/jira/browse/PDFBOX-2289 Project: PDFBox Issue Type: Wish Reporter: Ralf Hauser Provide an example how to set PDDocumentCatalog.PAGE_MODE_USE_ATTACHMENTS to an AES encrypted pdf such that it shows the attachment section after entering the decryption password in Acrobat Viewer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2289) provide an example how to set PDDocumentCatalog.PAGE_MODE_USE_ATTACHMENTS to an aes encrypted pdf
[ https://issues.apache.org/jira/browse/PDFBOX-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2289: Description: Provide an example how to set PDDocumentCatalog.PAGE_MODE_USE_ATTACHMENTS to an AES encrypted pdf such that it shows the attachment section after entering the decryption password in Acrobat Viewer (was: such that it shows the attachment section after entering the decryption password in Acrobat Viewer) provide an example how to set PDDocumentCatalog.PAGE_MODE_USE_ATTACHMENTS to an aes encrypted pdf - Key: PDFBOX-2289 URL: https://issues.apache.org/jira/browse/PDFBOX-2289 Project: PDFBox Issue Type: Wish Reporter: Ralf Hauser Provide an example how to set PDDocumentCatalog.PAGE_MODE_USE_ATTACHMENTS to an AES encrypted pdf such that it shows the attachment section after entering the decryption password in Acrobat Viewer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2289) Example: Attachments with AES encrypted PDF
[ https://issues.apache.org/jira/browse/PDFBOX-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2289: Summary: Example: Attachments with AES encrypted PDF (was: New Example: ) Example: Attachments with AES encrypted PDF --- Key: PDFBOX-2289 URL: https://issues.apache.org/jira/browse/PDFBOX-2289 Project: PDFBox Issue Type: Wish Reporter: Ralf Hauser Provide an example how to set PDDocumentCatalog.PAGE_MODE_USE_ATTACHMENTS to an AES encrypted pdf such that it shows the attachment section after entering the decryption password in Acrobat Viewer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PDFBOX-2289) Example: Attachments with AES encrypted PDF
[ https://issues.apache.org/jira/browse/PDFBOX-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-2289. - Resolution: Won't Fix This isn't really an example but a how to question which would be best asked on the mailing list. If we had a concrete example we could add it to PDFBox's examples suite later. Example: Attachments with AES encrypted PDF --- Key: PDFBOX-2289 URL: https://issues.apache.org/jira/browse/PDFBOX-2289 Project: PDFBox Issue Type: Wish Reporter: Ralf Hauser Provide an example how to set PDDocumentCatalog.PAGE_MODE_USE_ATTACHMENTS to an AES encrypted pdf such that it shows the attachment section after entering the decryption password in Acrobat Viewer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PDFBOX-2232) Is there difference between character \n and character space(32) in pdf stream
[ https://issues.apache.org/jira/browse/PDFBOX-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-2232. - Resolution: Cannot Reproduce There doesn't seem to be any information to go on here, or any real indication that this is a bug. Is there difference between character \n and character space(32) in pdf stream -- Key: PDFBOX-2232 URL: https://issues.apache.org/jira/browse/PDFBOX-2232 Project: PDFBox Issue Type: Bug Components: Text extraction Reporter: huangchangan when extract text from pdf files with PDFTextStripper, I get a space(32) at each end of paragraph or cells in a table, while in the text copyed from Adobe reader, the end character is \n, I wonder whether pdfbox convert character \n to space(32),I checked function processEncodedText in PDFStreamEngine and get no usefull information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2168) Different behavior of Undo feature when form was pre filled by PDFBox
[ https://issues.apache.org/jira/browse/PDFBOX-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167330#comment-14167330 ] John Hewson commented on PDFBOX-2168: - Can you add the Affects Version/s for this issue? Different behavior of Undo feature when form was pre filled by PDFBox - Key: PDFBOX-2168 URL: https://issues.apache.org/jira/browse/PDFBOX-2168 Project: PDFBox Issue Type: Bug Components: AcroForm Reporter: Maruan Sahyoun Priority: Minor Labels: Appearance Attachments: formtemplate-filled-pdfbox.pdf, formtemplate-filled-reader.pdf, formtemplate.pdf When a form is pre filled by PDFBox the Undo feature in Adobe Reader will reset the field value but not change the visible appearance of the field i.e. the old value will still be visible, The same form filled by Adobe Reader/Acrobat behaves correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2142) some /ICCBased colorspaces not rendered correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2142: Fix Version/s: 2.0.0 some /ICCBased colorspaces not rendered correctly - Key: PDFBOX-2142 URL: https://issues.apache.org/jira/browse/PDFBOX-2142 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 2.0.0 Reporter: Tilman Hausherr Fix For: 2.0.0 Attachments: PDFBOX-2142.pdf, PDFBOX-2142.pdf-1.png, PDFBOX-2142.ps I have created a test file from PostScript to show that -CIELAB and XYZ- some colors are different when rendered by PDFBox. Btw the RGB colors in the file have no meaning, nor do the colors have a relationship between each others, i.e. they do not have to look identical to any other color anywhere. The postscript file was created based on files by [James Cloos|http://jhcloos.com/PostScript/]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2424) ClassCastException in getMetaData if no real meta data
[ https://issues.apache.org/jira/browse/PDFBOX-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2424: Component/s: Parsing ClassCastException in getMetaData if no real meta data -- Key: PDFBOX-2424 URL: https://issues.apache.org/jira/browse/PDFBOX-2424 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.7, 1.8.8, 2.0.0 Reporter: Tilman Hausherr Here's an exception from [~talli...@apache.org] latest TIKA test (too lazy to test it myself, the cause is obvious) with the attached file: {code} org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:137) at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:120) at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:153) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:96) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be cast to org.apache.pdfbox.cos.COSStream at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getMetadata(PDDocumentCatalog.java:312) at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:181) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) ... 13 more {code} here's the excerpt in the PDF: {code} 241 0 obj /Type /Metadata /Subtype /XML endobj {code} the current code is {code} COSStream stream = (COSStream)root.getDictionaryObject( COSName.METADATA ); {code} shall we keep it that way or rather put out a warning if the meta data is not a stream and return null? Adobe Reader does nothing when looking for the properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2366) Improve high-level font API
[ https://issues.apache.org/jira/browse/PDFBOX-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2366: Affects Version/s: 2.0.0 Improve high-level font API --- Key: PDFBOX-2366 URL: https://issues.apache.org/jira/browse/PDFBOX-2366 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: John Hewson Assignee: John Hewson Priority: Minor Fix For: 2.0.0 The PDFont and Type1Equivalent APIs could expose some higher-level details, such as a consistent way for to get names and Type1Equivalent instances. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PDFBOX-2424) ClassCastException in getMetaData if no real meta data
Tilman Hausherr created PDFBOX-2424: --- Summary: ClassCastException in getMetaData if no real meta data Key: PDFBOX-2424 URL: https://issues.apache.org/jira/browse/PDFBOX-2424 Project: PDFBox Issue Type: Bug Reporter: Tilman Hausherr Here's an exception from [~talli...@apache.org] latest TIKA test (too lazy to test it myself, the cause is obvious) with the attached file: {code} org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:137) at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:120) at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:153) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:96) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be cast to org.apache.pdfbox.cos.COSStream at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getMetadata(PDDocumentCatalog.java:312) at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:181) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) ... 13 more {code} here's the excerpt in the PDF: {code} 241 0 obj /Type /Metadata /Subtype /XML endobj {code} the current code is {code} COSStream stream = (COSStream)root.getDictionaryObject( COSName.METADATA ); {code} shall we keep it that way or rather put out a warning if the meta data is not a stream and return null? Adobe Reader does nothing when looking for the properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2424) ClassCastException in getMetaData if no real meta data
[ https://issues.apache.org/jira/browse/PDFBOX-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2424: Affects Version/s: 2.0.0 1.8.8 1.8.7 ClassCastException in getMetaData if no real meta data -- Key: PDFBOX-2424 URL: https://issues.apache.org/jira/browse/PDFBOX-2424 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.7, 1.8.8, 2.0.0 Reporter: Tilman Hausherr Here's an exception from [~talli...@apache.org] latest TIKA test (too lazy to test it myself, the cause is obvious) with the attached file: {code} org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:137) at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:120) at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:153) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:96) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be cast to org.apache.pdfbox.cos.COSStream at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getMetadata(PDDocumentCatalog.java:312) at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:181) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) ... 13 more {code} here's the excerpt in the PDF: {code} 241 0 obj /Type /Metadata /Subtype /XML endobj {code} the current code is {code} COSStream stream = (COSStream)root.getDictionaryObject( COSName.METADATA ); {code} shall we keep it that way or rather put out a warning if the meta data is not a stream and return null? Adobe Reader does nothing when looking for the properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2142) some /ICCBased colorspaces not rendered correctly
[ https://issues.apache.org/jira/browse/PDFBOX-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2142: Affects Version/s: 2.0.0 some /ICCBased colorspaces not rendered correctly - Key: PDFBOX-2142 URL: https://issues.apache.org/jira/browse/PDFBOX-2142 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 2.0.0 Reporter: Tilman Hausherr Fix For: 2.0.0 Attachments: PDFBOX-2142.pdf, PDFBOX-2142.pdf-1.png, PDFBOX-2142.ps I have created a test file from PostScript to show that -CIELAB and XYZ- some colors are different when rendered by PDFBox. Btw the RGB colors in the file have no meaning, nor do the colors have a relationship between each others, i.e. they do not have to look identical to any other color anywhere. The postscript file was created based on files by [James Cloos|http://jhcloos.com/PostScript/]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2130) PDAnnotationLinks are empty after saving as in Acrobat
[ https://issues.apache.org/jira/browse/PDFBOX-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167332#comment-14167332 ] John Hewson commented on PDFBOX-2130: - What are the Affects Version/s? PDAnnotationLinks are empty after saving as in Acrobat -- Key: PDFBOX-2130 URL: https://issues.apache.org/jira/browse/PDFBOX-2130 Project: PDFBox Issue Type: Bug Components: Writing Reporter: Andreas Weiss Hello dear pdfbox team, Do you have any idea on how to fix the problem with the not working links after „saving as“ in Acrobat? The PDAnnotationLinks/goToPage-action/ PDPageDestination(PDPageFitHeihtDestination... doesn’t matter) – If you create a copy of the document using „save as“ option, than the links in the new document are empty – no properties, no destinations. It seams that acrobat overriding then. The with acrobat created links remaining only. Besides the COSArrays of the Destinations 'created in acrobat manually' and 'automatically using pdfbox' are slightly different. It’s a big Problem if you cannot save the document under other name without links being destroyed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2424) ClassCastException in getMetaData if no real meta data
[ https://issues.apache.org/jira/browse/PDFBOX-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2424: Attachment: 333472.pdf ClassCastException in getMetaData if no real meta data -- Key: PDFBOX-2424 URL: https://issues.apache.org/jira/browse/PDFBOX-2424 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.7, 1.8.8, 2.0.0 Reporter: Tilman Hausherr Attachments: 333472.pdf Here's an exception from [~talli...@apache.org] latest TIKA test (too lazy to test it myself, the cause is obvious) with the attached file: {code} org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:137) at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:120) at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:153) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:96) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be cast to org.apache.pdfbox.cos.COSStream at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getMetadata(PDDocumentCatalog.java:312) at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:181) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) ... 13 more {code} here's the excerpt in the PDF: {code} 241 0 obj /Type /Metadata /Subtype /XML endobj {code} the current code is {code} COSStream stream = (COSStream)root.getDictionaryObject( COSName.METADATA ); {code} shall we keep it that way or rather put out a warning if the meta data is not a stream and return null? Adobe Reader does nothing when looking for the properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1979) TypeTestingHelper is non-deterministic
[ https://issues.apache.org/jira/browse/PDFBOX-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1979: Fix Version/s: 2.0.0 TypeTestingHelper is non-deterministic -- Key: PDFBOX-1979 URL: https://issues.apache.org/jira/browse/PDFBOX-1979 Project: PDFBox Issue Type: Bug Components: XmpBox Affects Versions: 1.8.7, 2.0.0 Reporter: John Hewson Assignee: Guillaume Bailleul Fix For: 2.0.0 Attachments: nd_test.patch TypeTestingHelper generates random calendar data and random UUIDs for testing, which means that it is non-deterministic. As discussed in PDFBOX-1977, we should alter this test to make sure that it has deterministic (regression test) functionality as well as the existing non-deterministic (fuzz test) functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2168) Different behavior of Undo feature when form was pre filled by PDFBox
[ https://issues.apache.org/jira/browse/PDFBOX-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2168: Fix Version/s: 2.0.0 Different behavior of Undo feature when form was pre filled by PDFBox - Key: PDFBOX-2168 URL: https://issues.apache.org/jira/browse/PDFBOX-2168 Project: PDFBox Issue Type: Bug Components: AcroForm Reporter: Maruan Sahyoun Priority: Minor Labels: Appearance Fix For: 2.0.0 Attachments: formtemplate-filled-pdfbox.pdf, formtemplate-filled-reader.pdf, formtemplate.pdf When a form is pre filled by PDFBox the Undo feature in Adobe Reader will reset the field value but not change the visible appearance of the field i.e. the old value will still be visible, The same form filled by Adobe Reader/Acrobat behaves correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1987) Provide a PDF Lexer as a base for PDF parsing
[ https://issues.apache.org/jira/browse/PDFBOX-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1987: Affects Version/s: 2.0.0 Provide a PDF Lexer as a base for PDF parsing - Key: PDFBOX-1987 URL: https://issues.apache.org/jira/browse/PDFBOX-1987 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 2.0.0 Reporter: Maruan Sahyoun Priority: Minor Fix For: 2.0.0 Attachments: src.zip In order to enhance the parsing process and as a foundation for a combination of the different parsers a PDF lexer should be provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1979) TypeTestingHelper is non-deterministic
[ https://issues.apache.org/jira/browse/PDFBOX-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1979: Affects Version/s: 2.0.0 1.8.7 TypeTestingHelper is non-deterministic -- Key: PDFBOX-1979 URL: https://issues.apache.org/jira/browse/PDFBOX-1979 Project: PDFBox Issue Type: Bug Components: XmpBox Affects Versions: 1.8.7, 2.0.0 Reporter: John Hewson Assignee: Guillaume Bailleul Attachments: nd_test.patch TypeTestingHelper generates random calendar data and random UUIDs for testing, which means that it is non-deterministic. As discussed in PDFBOX-1977, we should alter this test to make sure that it has deterministic (regression test) functionality as well as the existing non-deterministic (fuzz test) functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1978) Type1FontUtilTest is non-deterministic
[ https://issues.apache.org/jira/browse/PDFBOX-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1978: Affects Version/s: 2.0.0 1.8.7 Type1FontUtilTest is non-deterministic -- Key: PDFBOX-1978 URL: https://issues.apache.org/jira/browse/PDFBOX-1978 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 1.8.7, 2.0.0 Reporter: John Hewson Fix For: 2.0.0 Type1FontUtilTest uses java.util.Random to generate random test data, which means that it is is non-deterministic. As discussed in PDFBOX-1977, we should alter this test to make sure that it has deterministic (regression test) functionality as well as the existing non-deterministic (fuzz test) functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1978) Type1FontUtilTest is non-deterministic
[ https://issues.apache.org/jira/browse/PDFBOX-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1978: Fix Version/s: 2.0.0 Type1FontUtilTest is non-deterministic -- Key: PDFBOX-1978 URL: https://issues.apache.org/jira/browse/PDFBOX-1978 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 1.8.7, 2.0.0 Reporter: John Hewson Fix For: 2.0.0 Type1FontUtilTest uses java.util.Random to generate random test data, which means that it is is non-deterministic. As discussed in PDFBOX-1977, we should alter this test to make sure that it has deterministic (regression test) functionality as well as the existing non-deterministic (fuzz test) functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-1873) NoninvertibleTransformException if form field isn't set with Scroll long text option
[ https://issues.apache.org/jira/browse/PDFBOX-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167336#comment-14167336 ] John Hewson commented on PDFBOX-1873: - The Affects Version/s were not given, so I'm guessing this is a 1.8 bug and should be fixed in at least 2.0, if not 1.8. NoninvertibleTransformException if form field isn't set with Scroll long text option -- Key: PDFBOX-1873 URL: https://issues.apache.org/jira/browse/PDFBOX-1873 Project: PDFBox Issue Type: Bug Components: AcroForm Affects Versions: 1.8.7, 2.0.0 Reporter: Álison Fernandes Priority: Critical Fix For: 2.0.0 Creating a PDF with a form field in Adobe Acrobat X Pro, if the field doesn't have the Scroll long text option set (which can be set in Field's Properties Options tab), PDFBox will throw an extensive list of the same exception (probably one for each char being drawn). Exception: Jan 31, 2014 10:31:59 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont writeFont SEVERE: Error in org.apache.pdfbox.pdmodel.font.PDType1Font.writeFont java.awt.geom.NoninvertibleTransformException: Determinant is 0 at java.awt.geom.AffineTransform.createInverse(AffineTransform.java:2707) at org.apache.pdfbox.pdmodel.font.PDSimpleFont.writeFont(PDSimpleFont.java:339) at org.apache.pdfbox.pdmodel.font.PDSimpleFont.drawString(PDSimpleFont.java:147) at org.apache.pdfbox.pdfviewer.PageDrawer.processTextPosition(PageDrawer.java:246) at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:496) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:156) at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:801) at org.xpandit.vvp.signedpdf.PdfboxUtils.renderToPanel(PdfboxUtils.java:186) at org.xpandit.vvp.signedpdf.PDFApplet.repaintPDF(PDFApplet.java:843) at org.xpandit.vvp.signedpdf.PDFApplet.actionPerformed(PDFApplet.java:925) at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2018) at javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2341) at javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:402) at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:259) at javax.swing.plaf.basic.BasicButtonListener.mouseReleased(BasicButtonListener.java:252) at java.awt.Component.processMouseEvent(Component.java:6505) at javax.swing.JComponent.processMouseEvent(JComponent.java:3320) at java.awt.Component.processEvent(Component.java:6270) at java.awt.Container.processEvent(Container.java:2229) at java.awt.Component.dispatchEventImpl(Component.java:4861) at java.awt.Container.dispatchEventImpl(Container.java:2287) at java.awt.Component.dispatchEvent(Component.java:4687) at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4832) at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4492) at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4422) at java.awt.Container.dispatchEventImpl(Container.java:2273) at java.awt.Component.dispatchEvent(Component.java:4687) at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:735) at java.awt.EventQueue.access$200(EventQueue.java:103) at java.awt.EventQueue$3.run(EventQueue.java:694) at java.awt.EventQueue$3.run(EventQueue.java:692) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76) at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:87) at java.awt.EventQueue$4.run(EventQueue.java:708) at java.awt.EventQueue$4.run(EventQueue.java:706) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76) at java.awt.EventQueue.dispatchEvent(EventQueue.java:705) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:242) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:161) at
[jira] [Updated] (PDFBOX-1873) NoninvertibleTransformException if form field isn't set with Scroll long text option
[ https://issues.apache.org/jira/browse/PDFBOX-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1873: Fix Version/s: 2.0.0 NoninvertibleTransformException if form field isn't set with Scroll long text option -- Key: PDFBOX-1873 URL: https://issues.apache.org/jira/browse/PDFBOX-1873 Project: PDFBox Issue Type: Bug Components: AcroForm Affects Versions: 1.8.7, 2.0.0 Reporter: Álison Fernandes Priority: Critical Fix For: 2.0.0 Creating a PDF with a form field in Adobe Acrobat X Pro, if the field doesn't have the Scroll long text option set (which can be set in Field's Properties Options tab), PDFBox will throw an extensive list of the same exception (probably one for each char being drawn). Exception: Jan 31, 2014 10:31:59 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont writeFont SEVERE: Error in org.apache.pdfbox.pdmodel.font.PDType1Font.writeFont java.awt.geom.NoninvertibleTransformException: Determinant is 0 at java.awt.geom.AffineTransform.createInverse(AffineTransform.java:2707) at org.apache.pdfbox.pdmodel.font.PDSimpleFont.writeFont(PDSimpleFont.java:339) at org.apache.pdfbox.pdmodel.font.PDSimpleFont.drawString(PDSimpleFont.java:147) at org.apache.pdfbox.pdfviewer.PageDrawer.processTextPosition(PageDrawer.java:246) at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:496) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:156) at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:801) at org.xpandit.vvp.signedpdf.PdfboxUtils.renderToPanel(PdfboxUtils.java:186) at org.xpandit.vvp.signedpdf.PDFApplet.repaintPDF(PDFApplet.java:843) at org.xpandit.vvp.signedpdf.PDFApplet.actionPerformed(PDFApplet.java:925) at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2018) at javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2341) at javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:402) at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:259) at javax.swing.plaf.basic.BasicButtonListener.mouseReleased(BasicButtonListener.java:252) at java.awt.Component.processMouseEvent(Component.java:6505) at javax.swing.JComponent.processMouseEvent(JComponent.java:3320) at java.awt.Component.processEvent(Component.java:6270) at java.awt.Container.processEvent(Container.java:2229) at java.awt.Component.dispatchEventImpl(Component.java:4861) at java.awt.Container.dispatchEventImpl(Container.java:2287) at java.awt.Component.dispatchEvent(Component.java:4687) at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4832) at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4492) at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4422) at java.awt.Container.dispatchEventImpl(Container.java:2273) at java.awt.Component.dispatchEvent(Component.java:4687) at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:735) at java.awt.EventQueue.access$200(EventQueue.java:103) at java.awt.EventQueue$3.run(EventQueue.java:694) at java.awt.EventQueue$3.run(EventQueue.java:692) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76) at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:87) at java.awt.EventQueue$4.run(EventQueue.java:708) at java.awt.EventQueue$4.run(EventQueue.java:706) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76) at java.awt.EventQueue.dispatchEvent(EventQueue.java:705) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:242) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:161) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:150) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:146) at
[jira] [Updated] (PDFBOX-1873) NoninvertibleTransformException if form field isn't set with Scroll long text option
[ https://issues.apache.org/jira/browse/PDFBOX-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1873: Affects Version/s: 2.0.0 1.8.7 NoninvertibleTransformException if form field isn't set with Scroll long text option -- Key: PDFBOX-1873 URL: https://issues.apache.org/jira/browse/PDFBOX-1873 Project: PDFBox Issue Type: Bug Components: AcroForm Affects Versions: 1.8.7, 2.0.0 Reporter: Álison Fernandes Priority: Critical Fix For: 2.0.0 Creating a PDF with a form field in Adobe Acrobat X Pro, if the field doesn't have the Scroll long text option set (which can be set in Field's Properties Options tab), PDFBox will throw an extensive list of the same exception (probably one for each char being drawn). Exception: Jan 31, 2014 10:31:59 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont writeFont SEVERE: Error in org.apache.pdfbox.pdmodel.font.PDType1Font.writeFont java.awt.geom.NoninvertibleTransformException: Determinant is 0 at java.awt.geom.AffineTransform.createInverse(AffineTransform.java:2707) at org.apache.pdfbox.pdmodel.font.PDSimpleFont.writeFont(PDSimpleFont.java:339) at org.apache.pdfbox.pdmodel.font.PDSimpleFont.drawString(PDSimpleFont.java:147) at org.apache.pdfbox.pdfviewer.PageDrawer.processTextPosition(PageDrawer.java:246) at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:496) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:156) at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:801) at org.xpandit.vvp.signedpdf.PdfboxUtils.renderToPanel(PdfboxUtils.java:186) at org.xpandit.vvp.signedpdf.PDFApplet.repaintPDF(PDFApplet.java:843) at org.xpandit.vvp.signedpdf.PDFApplet.actionPerformed(PDFApplet.java:925) at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2018) at javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2341) at javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:402) at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:259) at javax.swing.plaf.basic.BasicButtonListener.mouseReleased(BasicButtonListener.java:252) at java.awt.Component.processMouseEvent(Component.java:6505) at javax.swing.JComponent.processMouseEvent(JComponent.java:3320) at java.awt.Component.processEvent(Component.java:6270) at java.awt.Container.processEvent(Container.java:2229) at java.awt.Component.dispatchEventImpl(Component.java:4861) at java.awt.Container.dispatchEventImpl(Container.java:2287) at java.awt.Component.dispatchEvent(Component.java:4687) at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4832) at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4492) at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4422) at java.awt.Container.dispatchEventImpl(Container.java:2273) at java.awt.Component.dispatchEvent(Component.java:4687) at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:735) at java.awt.EventQueue.access$200(EventQueue.java:103) at java.awt.EventQueue$3.run(EventQueue.java:694) at java.awt.EventQueue$3.run(EventQueue.java:692) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76) at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:87) at java.awt.EventQueue$4.run(EventQueue.java:708) at java.awt.EventQueue$4.run(EventQueue.java:706) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76) at java.awt.EventQueue.dispatchEvent(EventQueue.java:705) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:242) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:161) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:150) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:146) at
[jira] [Updated] (PDFBOX-1842) Warn if command-line pdf encryption destroys a pre-existing signature
[ https://issues.apache.org/jira/browse/PDFBOX-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1842: Fix Version/s: 2.0.0 Warn if command-line pdf encryption destroys a pre-existing signature - Key: PDFBOX-1842 URL: https://issues.apache.org/jira/browse/PDFBOX-1842 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.7, 2.0.0 Reporter: Ralf Hauser Priority: Minor Fix For: 2.0.0 see also PDFBOX-1594 , PDFBOX-912 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1863) Can't resize PDFPagePanel render
[ https://issues.apache.org/jira/browse/PDFBOX-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1863: Affects Version/s: 2.0.0 1.8.7 Can't resize PDFPagePanel render Key: PDFBOX-1863 URL: https://issues.apache.org/jira/browse/PDFBOX-1863 Project: PDFBox Issue Type: Bug Components: Swing GUI Affects Versions: 1.8.7, 2.0.0 Reporter: Álison Fernandes I tried to use PDFPagePanel to render a PDF to an applet but, I had to change my implementation because PDFPagePanel wasn't resizing the rendering so it could be bigger. I've checked in the source code (of pdfbox-1.8.2 and in the SVN trunk), the Dimension drawDimension var that sets the rendering size isn't accessible from outside and it will draw using the dimension of the PDF Cropbox. My current implementation to bypass this is: - Create a JPanel - Render the page to an image using PDPage.convertToImage(...) - Add the image to the JPanel using JLabel picLabel = new JLabel(new ImageIcon(page.convertToImage(...))); - Repeat for all the pages - Set the panel as a viewport in a JScrollPane Unfortunately, this method takes way too much time if you have to render things multiple times (~1 second for more complex pages with forms). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1842) Warn if command-line pdf encryption destroys a pre-existing signature
[ https://issues.apache.org/jira/browse/PDFBOX-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1842: Affects Version/s: 2.0.0 1.8.7 Warn if command-line pdf encryption destroys a pre-existing signature - Key: PDFBOX-1842 URL: https://issues.apache.org/jira/browse/PDFBOX-1842 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.7, 2.0.0 Reporter: Ralf Hauser Priority: Minor Fix For: 2.0.0 see also PDFBOX-1594 , PDFBOX-912 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1833) BaseParser tidy up
[ https://issues.apache.org/jira/browse/PDFBOX-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1833: Affects Version/s: 2.0.0 BaseParser tidy up -- Key: PDFBOX-1833 URL: https://issues.apache.org/jira/browse/PDFBOX-1833 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 2.0.0 Reporter: Jens Kapitza Priority: Minor Fix For: 2.0.0 Attachments: baseparser.patch Original Estimate: 0.5h Remaining Estimate: 0.5h Tidy up logic (should not change the parsing result) Character.isWhitespace(c) is the only point wich may have site effects (but i assume there is no File-Seperator in parseCOSHexString) so this should pass as it passes befor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1833) BaseParser tidy up
[ https://issues.apache.org/jira/browse/PDFBOX-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1833: Fix Version/s: 2.0.0 BaseParser tidy up -- Key: PDFBOX-1833 URL: https://issues.apache.org/jira/browse/PDFBOX-1833 Project: PDFBox Issue Type: Improvement Components: Parsing Affects Versions: 2.0.0 Reporter: Jens Kapitza Priority: Minor Fix For: 2.0.0 Attachments: baseparser.patch Original Estimate: 0.5h Remaining Estimate: 0.5h Tidy up logic (should not change the parsing result) Character.isWhitespace(c) is the only point wich may have site effects (but i assume there is no File-Seperator in parseCOSHexString) so this should pass as it passes befor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PDFBOX-1788) [PATCH] Show warning if system font not found
[ https://issues.apache.org/jira/browse/PDFBOX-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-1788. - Resolution: Won't Fix We're no longer using AWT fonts in 2.0, so this patch no longer applies. [PATCH] Show warning if system font not found - Key: PDFBOX-1788 URL: https://issues.apache.org/jira/browse/PDFBOX-1788 Project: PDFBox Issue Type: Bug Components: Rendering Reporter: simon steiner Attachments: warnmissingfonts.patch If you process a pdf which doesnt embed a font, pdfbox will try to use system font but that font may not exist so we should print a warning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1710) PDF structure report as XML
[ https://issues.apache.org/jira/browse/PDFBOX-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1710: Affects Version/s: 1.8.2 PDF structure report as XML --- Key: PDFBOX-1710 URL: https://issues.apache.org/jira/browse/PDFBOX-1710 Project: PDFBox Issue Type: New Feature Components: Utilities Affects Versions: 1.8.2 Reporter: Axel Rose Priority: Minor Attachments: PdfAnalysis.java I wrote a utility to get an XML report of a PDF input file. Please see the attached source code and check if it can be incorporated into a pdfbox release. I'm happy to hear about problems in my code and tasks to correct this. Test on a command line like this java -cp pdfbox-1.8.2.jar:commons-logging-1.0.4.jar:bcprov-jdk16-145.jar:. PdfAnalysis file.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PDFBOX-1579) add logging if /FontFile2 entry is missing and a system font is used instead
[ https://issues.apache.org/jira/browse/PDFBOX-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson closed PDFBOX-1579. --- Resolution: Won't Fix We're no longer using AWT fonts in 2.0, so this patch no longer applies. add logging if /FontFile2 entry is missing and a system font is used instead Key: PDFBOX-1579 URL: https://issues.apache.org/jira/browse/PDFBOX-1579 Project: PDFBox Issue Type: Improvement Components: PDModel Reporter: Luis Bernardo Priority: Trivial Attachments: pdfbox-1579.patch This issue became apparent when output from a PDF that had a non embedded TTF (/FontFile2 entry missing) was different in different machines. After investigation I found out PDFBox was using a system font, but in one machine the font was installed and in the other it wasn't. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-1573) pdf text highlighting
[ https://issues.apache.org/jira/browse/PDFBOX-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167345#comment-14167345 ] John Hewson commented on PDFBOX-1573: - Didn't we add an example to 1.8 which does this? pdf text highlighting - Key: PDFBOX-1573 URL: https://issues.apache.org/jira/browse/PDFBOX-1573 Project: PDFBox Issue Type: New Feature Components: Utilities Reporter: Arun Try to add a method which return the List coordinates of the given text. Find all locations of the text, determine x/y coordinates, width/height Feature is similar to https://pdfclown.wordpress.com/tag/text-highlighting/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1573) pdf text highlighting
[ https://issues.apache.org/jira/browse/PDFBOX-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1573: Priority: Minor (was: Major) pdf text highlighting - Key: PDFBOX-1573 URL: https://issues.apache.org/jira/browse/PDFBOX-1573 Project: PDFBox Issue Type: New Feature Components: Utilities Affects Versions: 1.8.6, 2.0.0 Reporter: Arun Priority: Minor Try to add a method which return the List coordinates of the given text. Find all locations of the text, determine x/y coordinates, width/height Feature is similar to https://pdfclown.wordpress.com/tag/text-highlighting/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1573) pdf text highlighting
[ https://issues.apache.org/jira/browse/PDFBOX-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1573: Affects Version/s: 2.0.0 1.8.6 pdf text highlighting - Key: PDFBOX-1573 URL: https://issues.apache.org/jira/browse/PDFBOX-1573 Project: PDFBox Issue Type: New Feature Components: Utilities Affects Versions: 1.8.6, 2.0.0 Reporter: Arun Try to add a method which return the List coordinates of the given text. Find all locations of the text, determine x/y coordinates, width/height Feature is similar to https://pdfclown.wordpress.com/tag/text-highlighting/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PDFBOX-1537) [PATCH] Java crash, Type 2 CID Fonts and image alpha channels not properly handled in Imported PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson closed PDFBOX-1537. --- Resolution: Won't Fix 1.8's font handling isn't going to receive any further attention. [PATCH] Java crash, Type 2 CID Fonts and image alpha channels not properly handled in Imported PDFs --- Key: PDFBOX-1537 URL: https://issues.apache.org/jira/browse/PDFBOX-1537 Project: PDFBox Issue Type: Bug Components: Rendering Reporter: simon steiner Attachments: pdfinpstrunk.patch, simontest.pdf, test.fo Running from fop fop test.fo -ps out.ps cid font gives crash, if i disable cid then font is wrong and image inverted -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PDFBOX-1534) Graphics2D to create PDF
[ https://issues.apache.org/jira/browse/PDFBOX-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson closed PDFBOX-1534. --- Resolution: Won't Fix No. Graphics2D to create PDF Key: PDFBOX-1534 URL: https://issues.apache.org/jira/browse/PDFBOX-1534 Project: PDFBox Issue Type: New Feature Components: Writing Reporter: Samuel Pfitzer Priority: Minor Apache FOP has a PDFGraphics2D. It is used for drawing into a pdf document. Drawing into an image is not an alternative because the size of the document gets too big. Is it planned to have this feature for PDFBox? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1450) document how to encrypt with AES 256
[ https://issues.apache.org/jira/browse/PDFBOX-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1450: Fix Version/s: 2.0.0 document how to encrypt with AES 256 Key: PDFBOX-1450 URL: https://issues.apache.org/jira/browse/PDFBOX-1450 Project: PDFBox Issue Type: Wish Components: Documentation Affects Versions: 2.0.0 Reporter: Ralf Hauser Priority: Minor Fix For: 2.0.0 please add a java code sample how to do this on the web-site and link to it from http://pdfbox.apache.org/commandlineutilities/Encrypt.html see also see also PDFBOX-953 and see also PDFBOX-135 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1450) document how to encrypt with AES 256
[ https://issues.apache.org/jira/browse/PDFBOX-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1450: Affects Version/s: 2.0.0 document how to encrypt with AES 256 Key: PDFBOX-1450 URL: https://issues.apache.org/jira/browse/PDFBOX-1450 Project: PDFBox Issue Type: Wish Components: Documentation Affects Versions: 2.0.0 Reporter: Ralf Hauser Priority: Minor Fix For: 2.0.0 please add a java code sample how to do this on the web-site and link to it from http://pdfbox.apache.org/commandlineutilities/Encrypt.html see also see also PDFBOX-953 and see also PDFBOX-135 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PDFBOX-1409) Create Preflight documentation
[ https://issues.apache.org/jira/browse/PDFBOX-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-1409. - Resolution: Fixed Close enough. Create Preflight documentation -- Key: PDFBOX-1409 URL: https://issues.apache.org/jira/browse/PDFBOX-1409 Project: PDFBox Issue Type: Task Components: Preflight Reporter: Eric Leleu Priority: Minor Add documentation about the preflight module. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PDFBOX-1386) Proposal for classes to handle optional contents
[ https://issues.apache.org/jira/browse/PDFBOX-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-1386. - Resolution: Won't Fix We already have classes for using optional content in the org.apache.pdfbox.pdmodel.graphics.optionalcontent package. Proposal for classes to handle optional contents Key: PDFBOX-1386 URL: https://issues.apache.org/jira/browse/PDFBOX-1386 Project: PDFBox Issue Type: New Feature Components: PDModel Reporter: Dominic Tubach Priority: Minor Attachments: DTCOSName.java, DTPDContentUsageDictionary.java, DTPDContentUsageDictionaryTest.java, DTPDOptionalContentConfiguration.java, DTPDOptionalContentConfigurationTest.java, DTPDOptionalContentGroup.java, DTPDOptionalContentGroupTest.java, DTPDOptionalContentMembershipDictionary.java, DTPDOptionalContentMembershipDictionaryTest.java, DTPDOptionalContentProperties.java, DTPDOptionalContentPropertiesTest.java, DTPDUsageApplicationDictionary.java, DTPDUsageApplicationDictionaryTest.java Attached are classes as proposal to handle optional contents. It requires the classes in the issues #PDFBOX-1383 and #PDFBOX-1385 It requires Java 1.6 (It might be enough to remove the @Override annotations for Java 1.5 compatibility.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PDFBOX-1385) Proposal for a PD tree that represents a tree based on arrays.
[ https://issues.apache.org/jira/browse/PDFBOX-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson closed PDFBOX-1385. --- Resolution: Won't Fix 2.0 has taken a different direction with the handling of trees. Proposal for a PD tree that represents a tree based on arrays. -- Key: PDFBOX-1385 URL: https://issues.apache.org/jira/browse/PDFBOX-1385 Project: PDFBox Issue Type: New Feature Components: PDModel Reporter: Dominic Tubach Priority: Minor Attachments: DTPDTreeIntermediateNode.java, DTPDTreeLabeledNode.java, DTPDTreeLeafNode.java, DTPDTreeNode.java, DTPDTreeNodeTest.java Attached is a proposal for a PD tree that represents a tree that is based on arrays (such as RBGroups). The required COSArrayList and COSBaseConverter can be found in issue #PDFBOX-1383 It requires Java 1.6 (It might be enough to remove the @Override annotations for Java 1.5 compatibility.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1383) Proposal for a new COSArrayList
[ https://issues.apache.org/jira/browse/PDFBOX-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1383: Affects Version/s: 2.0.0 1.8.7 Proposal for a new COSArrayList --- Key: PDFBOX-1383 URL: https://issues.apache.org/jira/browse/PDFBOX-1383 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 1.8.7, 2.0.0 Reporter: Dominic Tubach Priority: Minor Fix For: 2.0.0 Attachments: DTCOSArrayList.java, DTCOSArrayListTest.java, DTCOSBaseConverter.java, DefaultDTCOSBaseConverter.java, DefaultDTCOSBaseConverterTest.java Attached is a proposal for a new COSArrayList. Main differences to the existing COSArrayList: - type safety through generics. - it's always clear which types of objects the array holds. - flexible loading of objects from a dictionary through COSBaseConverter (see below). - correct updating of dictionary entry, no matter whether it is optional, a single value is allowed, or it is required. - listener interface. However there are some drawbacks: - it allows only classes/interfaces that implement/extend COSObjectable. - DualCOSObjectables are not possible. (Would require an extra class.) - no Java types such as String or Float (I see this as advantage as I was a bit confused when I expected an Array with COSNames, but got Strings. By the way adding a String in that case would not add a COSName as one might expect, but a COSString.) - replacing the existing COSArrayList would require changes in existing code. - requires (as of now) Java 1.6 (It might be enough to remove the @Override annotations for Java 1.5 compatibility.) Now to the COSBaseConverter. The COSBaseConverter is just an interface that defines a conversion method to convert a COSBase object to a class that implements COSObjectable. The default implementation tries to find a fitting constructor to instantiate the object. If the destination class is an Enum it tries to find a fitting static valueOf method to create the object. (To avoid a conflict with the existing COSArrayList i prefixed everything with my initials.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1383) Proposal for a new COSArrayList
[ https://issues.apache.org/jira/browse/PDFBOX-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1383: Priority: Major (was: Minor) Proposal for a new COSArrayList --- Key: PDFBOX-1383 URL: https://issues.apache.org/jira/browse/PDFBOX-1383 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 1.8.7, 2.0.0 Reporter: Dominic Tubach Fix For: 2.0.0 Attachments: DTCOSArrayList.java, DTCOSArrayListTest.java, DTCOSBaseConverter.java, DefaultDTCOSBaseConverter.java, DefaultDTCOSBaseConverterTest.java Attached is a proposal for a new COSArrayList. Main differences to the existing COSArrayList: - type safety through generics. - it's always clear which types of objects the array holds. - flexible loading of objects from a dictionary through COSBaseConverter (see below). - correct updating of dictionary entry, no matter whether it is optional, a single value is allowed, or it is required. - listener interface. However there are some drawbacks: - it allows only classes/interfaces that implement/extend COSObjectable. - DualCOSObjectables are not possible. (Would require an extra class.) - no Java types such as String or Float (I see this as advantage as I was a bit confused when I expected an Array with COSNames, but got Strings. By the way adding a String in that case would not add a COSName as one might expect, but a COSString.) - replacing the existing COSArrayList would require changes in existing code. - requires (as of now) Java 1.6 (It might be enough to remove the @Override annotations for Java 1.5 compatibility.) Now to the COSBaseConverter. The COSBaseConverter is just an interface that defines a conversion method to convert a COSBase object to a class that implements COSObjectable. The default implementation tries to find a fitting constructor to instantiate the object. If the destination class is an Enum it tries to find a fitting static valueOf method to create the object. (To avoid a conflict with the existing COSArrayList i prefixed everything with my initials.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PDFBOX-1317) PDFBox giving AcroFields size zero for some pdf document.
[ https://issues.apache.org/jira/browse/PDFBOX-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson closed PDFBOX-1317. --- Resolution: Cannot Reproduce The PDF file in the link is no longer available. PDFBox giving AcroFields size zero for some pdf document. - Key: PDFBOX-1317 URL: https://issues.apache.org/jira/browse/PDFBOX-1317 Project: PDFBox Issue Type: Bug Components: AcroForm Reporter: Manoj Patel I am working on PDF Form fill utility and found some of pdf return blank acrofield array. Download PDF document from below mentioned link https://skydrive.live.com/redir?resid=C420713A859E927D!118authkey=!AJqh1odSC8MqMrE It will give blank list of acrofields. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PDFBOX-1301) Wrong characters in HTML/TXT file from PDF containing scanned pages/images
[ https://issues.apache.org/jira/browse/PDFBOX-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-1301. - Resolution: Fixed Fix Version/s: 2.0.0 This is fixed in 2.0 Wrong characters in HTML/TXT file from PDF containing scanned pages/images -- Key: PDFBOX-1301 URL: https://issues.apache.org/jira/browse/PDFBOX-1301 Project: PDFBox Issue Type: Bug Components: Text extraction Environment: Windows XP, java version 1.6.0_29 Reporter: Jan Divis Fix For: 2.0.0 Attachments: 54391-scan.pdf, converted-wrong-chars.html, correct-chars-when-converted-splitted-page.html When trying to extract text/html from attached PDF file, there are some wrong characters (instead of characters with diacritics): Pro úþely tohoto Protokolu mohou bêt sdělení ]asílána prostřednictvím elektronickêch nebo Makêchkoli Minêch prostředkĤ instead of Pro účely tohoto Protokolu mohou být sdělení zasílána prostřednictvím elektronických nebo jakýchkoli jiných prostředků resp. Pro #250;#254;ely tohoto Protokolu mohou b#234;t sd#283;len#237; ]as#237;l#225;na prost#345;ednictv#237;m elektronick#234;ch nebo Mak#234;chkoli Min#234;ch prost#345;edk#292; instead of Pro #250;#269;ely tohoto Protokolu mohou b#253;t sd#283;len#237; zas#237;l#225;na prost#345;ednictv#237;m elektronick#253;ch nebo jak#253;chkoli jin#253;ch prost#345;edk#367; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1234) NPE at org.apache.pdfbox.pdmodel.interactive.form.PDAppearance.calculateFontSize(PDAppearance.java:551)
[ https://issues.apache.org/jira/browse/PDFBOX-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1234: Affects Version/s: 2.0.0 1.8.4 NPE at org.apache.pdfbox.pdmodel.interactive.form.PDAppearance.calculateFontSize(PDAppearance.java:551) --- Key: PDFBOX-1234 URL: https://issues.apache.org/jira/browse/PDFBOX-1234 Project: PDFBox Issue Type: Bug Components: AcroForm Affects Versions: 1.8.4, 2.0.0 Reporter: Christer Palm Fix For: 2.0.0 Attachments: 200221.pdf, SetPDFFieldValueTest.java, fw8bene--dft.pdf Using SVN trunk revision 1291094 (2012-02-18) Getting the following stack trace when trying to call PDField.setValue() on a AcroForm field in the attached document; java.lang.NullPointerException at org.apache.pdfbox.pdmodel.interactive.form.PDAppearance.calculateFontSize(PDAppearance.java:551) at org.apache.pdfbox.pdmodel.interactive.form.PDAppearance.insertGeneratedAppearance(PDAppearance.java:371) at org.apache.pdfbox.pdmodel.interactive.form.PDAppearance.setAppearanceValue(PDAppearance.java:281) at org.apache.pdfbox.pdmodel.interactive.form.PDVariableText.setValue(PDVariableText.java:131) Reason seems to be that PDApperance.getFontAndUpdateResources() returns null, in turn because the font dictionary for the DA of the field (/Cour 11 Tf 0 g) is not present in the document. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1234) NPE at org.apache.pdfbox.pdmodel.interactive.form.PDAppearance.calculateFontSize(PDAppearance.java:551)
[ https://issues.apache.org/jira/browse/PDFBOX-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1234: Fix Version/s: 2.0.0 NPE at org.apache.pdfbox.pdmodel.interactive.form.PDAppearance.calculateFontSize(PDAppearance.java:551) --- Key: PDFBOX-1234 URL: https://issues.apache.org/jira/browse/PDFBOX-1234 Project: PDFBox Issue Type: Bug Components: AcroForm Affects Versions: 1.8.4, 2.0.0 Reporter: Christer Palm Fix For: 2.0.0 Attachments: 200221.pdf, SetPDFFieldValueTest.java, fw8bene--dft.pdf Using SVN trunk revision 1291094 (2012-02-18) Getting the following stack trace when trying to call PDField.setValue() on a AcroForm field in the attached document; java.lang.NullPointerException at org.apache.pdfbox.pdmodel.interactive.form.PDAppearance.calculateFontSize(PDAppearance.java:551) at org.apache.pdfbox.pdmodel.interactive.form.PDAppearance.insertGeneratedAppearance(PDAppearance.java:371) at org.apache.pdfbox.pdmodel.interactive.form.PDAppearance.setAppearanceValue(PDAppearance.java:281) at org.apache.pdfbox.pdmodel.interactive.form.PDVariableText.setValue(PDVariableText.java:131) Reason seems to be that PDApperance.getFontAndUpdateResources() returns null, in turn because the font dictionary for the DA of the field (/Cour 11 Tf 0 g) is not present in the document. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1176) Watermark Annotations
[ https://issues.apache.org/jira/browse/PDFBOX-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1176: Summary: Watermark Annotations (was: Watermark) Watermark Annotations - Key: PDFBOX-1176 URL: https://issues.apache.org/jira/browse/PDFBOX-1176 Project: PDFBox Issue Type: Wish Components: Writing Reporter: Rubesh MX Labels: Watermark Original Estimate: 24h Remaining Estimate: 24h I am checking if watermarks can be added to a PDF doc and the same way can be removed, so far I could not find any option to do that with PDFBox; It will be better if we have an option to add and remove watermak to a PDF. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1176) Watermark Annotations
[ https://issues.apache.org/jira/browse/PDFBOX-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1176: Affects Version/s: 2.0.0 1.8.7 Watermark Annotations - Key: PDFBOX-1176 URL: https://issues.apache.org/jira/browse/PDFBOX-1176 Project: PDFBox Issue Type: Wish Components: Writing Affects Versions: 1.8.7, 2.0.0 Reporter: Rubesh MX Labels: Watermark Fix For: 2.0.0 Original Estimate: 24h Remaining Estimate: 24h I am checking if watermarks can be added to a PDF doc and the same way can be removed, so far I could not find any option to do that with PDFBox; It will be better if we have an option to add and remove watermak to a PDF. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1176) Watermark Annotations
[ https://issues.apache.org/jira/browse/PDFBOX-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1176: Fix Version/s: 2.0.0 Watermark Annotations - Key: PDFBOX-1176 URL: https://issues.apache.org/jira/browse/PDFBOX-1176 Project: PDFBox Issue Type: Wish Components: Writing Affects Versions: 1.8.7, 2.0.0 Reporter: Rubesh MX Labels: Watermark Fix For: 2.0.0 Original Estimate: 24h Remaining Estimate: 24h I am checking if watermarks can be added to a PDF doc and the same way can be removed, so far I could not find any option to do that with PDFBox; It will be better if we have an option to add and remove watermak to a PDF. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1155) setSuppressDuplicateOverlappingText sometimes removes characters that it shouldn't
[ https://issues.apache.org/jira/browse/PDFBOX-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1155: Fix Version/s: 2.0.0 setSuppressDuplicateOverlappingText sometimes removes characters that it shouldn't -- Key: PDFBOX-1155 URL: https://issues.apache.org/jira/browse/PDFBOX-1155 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7, 2.0.0 Reporter: Michael McCandless Priority: Minor Fix For: 2.0.0 Attachments: 000527.pdf, dedup.diffs.txt The duplicate detection (in PDFTextStripper.java) checks whether the same character was placed nearish to where we are about to place another and de-dups it if so; this is to catch documents that rewind and overwrite in order to bold word(s). But in some cases I see it removing valid characters (that were not dups). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1155) setSuppressDuplicateOverlappingText sometimes removes characters that it shouldn't
[ https://issues.apache.org/jira/browse/PDFBOX-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1155: Affects Version/s: 2.0.0 1.8.7 setSuppressDuplicateOverlappingText sometimes removes characters that it shouldn't -- Key: PDFBOX-1155 URL: https://issues.apache.org/jira/browse/PDFBOX-1155 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7, 2.0.0 Reporter: Michael McCandless Priority: Minor Fix For: 2.0.0 Attachments: 000527.pdf, dedup.diffs.txt The duplicate detection (in PDFTextStripper.java) checks whether the same character was placed nearish to where we are about to place another and de-dups it if so; this is to catch documents that rewind and overwrite in order to bold word(s). But in some cases I see it removing valid characters (that were not dups). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1143) PDFTextStripper doesn't process text annotations
[ https://issues.apache.org/jira/browse/PDFBOX-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1143: Affects Version/s: 1.7.0 PDFTextStripper doesn't process text annotations Key: PDFBOX-1143 URL: https://issues.apache.org/jira/browse/PDFBOX-1143 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.7.0 Reporter: Michael McCandless Priority: Minor Fix For: 2.0.0 Users are able to add annotations (comments) to a PDF, and PDFBox processes them correctly: you can retrieve them via PDPage.getAnnotations. But PDFTextStripper currently doesn't extract the text from annotations. I think it [optionally] should? I think we'd add a boolean (shouldProcessAnnotations?), and if enabled, we'd visit the annotations that have sub-type FreeText, and extract what text we can (Subject, TitlePopup, Contents, maybe RichContents?), associate the .getRectangle with the text to make a TextPosition, and then somehow associate that with the right article (so that annotations over a given article are rendered with it). Alternatively we just put all annotations into their own article? I'm not familiar enough with PDF text positioning nor PDFTextStripper to work out a real patch here... but I think this approach should work? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1143) PDFTextStripper doesn't process text annotations
[ https://issues.apache.org/jira/browse/PDFBOX-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1143: Fix Version/s: 2.0.0 PDFTextStripper doesn't process text annotations Key: PDFBOX-1143 URL: https://issues.apache.org/jira/browse/PDFBOX-1143 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.7.0 Reporter: Michael McCandless Priority: Minor Fix For: 2.0.0 Users are able to add annotations (comments) to a PDF, and PDFBox processes them correctly: you can retrieve them via PDPage.getAnnotations. But PDFTextStripper currently doesn't extract the text from annotations. I think it [optionally] should? I think we'd add a boolean (shouldProcessAnnotations?), and if enabled, we'd visit the annotations that have sub-type FreeText, and extract what text we can (Subject, TitlePopup, Contents, maybe RichContents?), associate the .getRectangle with the text to make a TextPosition, and then somehow associate that with the right article (so that annotations over a given article are rendered with it). Alternatively we just put all annotations into their own article? I'm not familiar enough with PDF text positioning nor PDFTextStripper to work out a real patch here... but I think this approach should work? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PDFBOX-1121) PDF Fields becomes non editable
[ https://issues.apache.org/jira/browse/PDFBOX-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-1121. - Resolution: Invalid This issue is so old, that it's almost certainly no longer valid. PDF Fields becomes non editable --- Key: PDFBOX-1121 URL: https://issues.apache.org/jira/browse/PDFBOX-1121 Project: PDFBox Issue Type: Bug Components: AcroForm Reporter: Rubesh MX Priority: Minor Labels: .NET, newbie Original Estimate: 8h Remaining Estimate: 8h Hi, I am new to using PDFBox, so apologies if this is not a bug. I am using the .net version of PDFBox; I have a PDF File with editable fields, I am trying to read all the field values and write the field values if necessary, reading is fine, but when I write the values to the fields, and save the doc. programatically. The fields become non-editable. Could you please tell me what is going wrong, infact I even set the permission to canFillinForm but it is of no use, my PDF is not password protected. Also When I open a different PDF file, during Load I am seeing this error - expected='startxref' actual='3281' org.pdfbox.io.PushBackInputStream@329c933 Could you please advise me on the above issues? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1109) Data corruption related to scratch file use
[ https://issues.apache.org/jira/browse/PDFBOX-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1109: Fix Version/s: 2.0.0 Data corruption related to scratch file use --- Key: PDFBOX-1109 URL: https://issues.apache.org/jira/browse/PDFBOX-1109 Project: PDFBox Issue Type: Bug Affects Versions: 1.8.7, 2.0.0 Reporter: Stefan Mücke Assignee: Andreas Lehmkühler Priority: Critical Fix For: 2.0.0 Attachments: COSDocument.java, PagedMultiRandomAccessFile.java, PagedMultiRandomAccessFileTest.java PDFBox uses a scratch file to reduce memory consumption. However, there is no mechanism that prevents two PDStreams from writing to the scratch file at the same time. When this happens, the resulting PDF contains garbage in some streams. This problem occurred several times to me (e.g. when writing to an image stream while constructing a page). Reproducing the bug *** One can easily reproduce the bug. Open file AddImageToPDF.java and move the following line: PDPageContentStream contentStream = new PDPageContentStream(doc, page, true, true); immediately after the line in which the PDPage object is fetched: PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get( 0 ); With this modification, one will still get a PDF file, but Acrobat Reader will report that the image could not be processed. BTW, the files AddImageToPDF.java and ImageToPDF.java are almost identical. One of them should be deleted. Bug-Fix *** The problem can be solved by using a scratch file that is divided into pages (e.g. of 4 KB). Each PDStream in the scratch file is then associated with a list of pages. This list grows as more data is written to the stream. The bug fix requires minimal changes to the existing code. The very nice RandomAccess interface made this very easy. Here is what needs to be changed: - Add the attached PagedMultiRandomAccessFile.java to the I/O package - Change COSDocument.getScratchFile() to return a RandomAccess instance provided by PagedMultiRandomAccessFile: private PagedMultiRandomAccessFile scratchFile = null; [...] public COSDocument(File scratchDir) throws IOException { tmpFile = File.createTempFile(pdfbox, tmp, scratchDir); scratchFile = new PagedMultiRandomAccessFile( new RandomAccessFile(tmpFile, rw)); } public COSDocument(RandomAccess file) { // scratchFile = file; throw new RuntimeException(Not yet implemented.); //$NON-NLS-1$ } [...] /** * Returns a new scratch file. * * @return the newly created scratch file */ public RandomAccess getScratchFile() { return scratchFile.getNewRandomAcess(); } One of the COSDocument constructors takes a RandomAccess file. This constructor is only called in a single location, namely, in method PDFParser.parse(). I am not sure if the RandomAccess parameter provided here is really a scratch file. Someone will have to decide what to do with this one. The code has been throughly tested and has been used in the production of several books without any problems. In the attachment please find the code. There is also a JUnit test that was used to debug my code. I have added an Apache license header and adopted PDFBox's code style. Feel free to make any desired changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1109) Data corruption related to scratch file use
[ https://issues.apache.org/jira/browse/PDFBOX-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1109: Affects Version/s: 2.0.0 1.8.7 Data corruption related to scratch file use --- Key: PDFBOX-1109 URL: https://issues.apache.org/jira/browse/PDFBOX-1109 Project: PDFBox Issue Type: Bug Affects Versions: 1.8.7, 2.0.0 Reporter: Stefan Mücke Assignee: Andreas Lehmkühler Priority: Critical Attachments: COSDocument.java, PagedMultiRandomAccessFile.java, PagedMultiRandomAccessFileTest.java PDFBox uses a scratch file to reduce memory consumption. However, there is no mechanism that prevents two PDStreams from writing to the scratch file at the same time. When this happens, the resulting PDF contains garbage in some streams. This problem occurred several times to me (e.g. when writing to an image stream while constructing a page). Reproducing the bug *** One can easily reproduce the bug. Open file AddImageToPDF.java and move the following line: PDPageContentStream contentStream = new PDPageContentStream(doc, page, true, true); immediately after the line in which the PDPage object is fetched: PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get( 0 ); With this modification, one will still get a PDF file, but Acrobat Reader will report that the image could not be processed. BTW, the files AddImageToPDF.java and ImageToPDF.java are almost identical. One of them should be deleted. Bug-Fix *** The problem can be solved by using a scratch file that is divided into pages (e.g. of 4 KB). Each PDStream in the scratch file is then associated with a list of pages. This list grows as more data is written to the stream. The bug fix requires minimal changes to the existing code. The very nice RandomAccess interface made this very easy. Here is what needs to be changed: - Add the attached PagedMultiRandomAccessFile.java to the I/O package - Change COSDocument.getScratchFile() to return a RandomAccess instance provided by PagedMultiRandomAccessFile: private PagedMultiRandomAccessFile scratchFile = null; [...] public COSDocument(File scratchDir) throws IOException { tmpFile = File.createTempFile(pdfbox, tmp, scratchDir); scratchFile = new PagedMultiRandomAccessFile( new RandomAccessFile(tmpFile, rw)); } public COSDocument(RandomAccess file) { // scratchFile = file; throw new RuntimeException(Not yet implemented.); //$NON-NLS-1$ } [...] /** * Returns a new scratch file. * * @return the newly created scratch file */ public RandomAccess getScratchFile() { return scratchFile.getNewRandomAcess(); } One of the COSDocument constructors takes a RandomAccess file. This constructor is only called in a single location, namely, in method PDFParser.parse(). I am not sure if the RandomAccess parameter provided here is really a scratch file. Someone will have to decide what to do with this one. The code has been throughly tested and has been used in the production of several books without any problems. In the attachment please find the code. There is also a JUnit test that was used to debug my code. I have added an Apache license header and adopted PDFBox's code style. Feel free to make any desired changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PDFBOX-1086) Error when decoding CCITT compressed data that contains EOLs, fill bits etc.
[ https://issues.apache.org/jira/browse/PDFBOX-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-1086. - Resolution: Fixed Fix Version/s: 2.0.0 This is good enough to count as fixed. Error when decoding CCITT compressed data that contains EOLs, fill bits etc. Key: PDFBOX-1086 URL: https://issues.apache.org/jira/browse/PDFBOX-1086 Project: PDFBox Issue Type: Bug Components: Parsing Reporter: Jeremias Maerki Assignee: Jeremias Maerki Labels: CCITTFaxDecode, ccitt Fix For: 2.0.0 The TIFFFaxDecoder class (originally coming from JAI via XML Graphics Commons) does not handle cases like EOLs between lines and in front. But the PDF CCITTFaxDecode filter needs to allow many different variants of the encoding. Apparently, TIFF has a relatively restricted way of encoding CCITT data, so TIFFFaxDecoder was not written to be as flexible as we need it. Ideally, PDFBox should handle anything that gets thrown at it. It apprears that it would be rather difficult to retrofit TIFFFaxDecoder with the necessary flexibility. So, new decoders for T.4 and T.6 should probably be written. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1000) Conforming parser
[ https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1000: Fix Version/s: 2.0.0 Conforming parser - Key: PDFBOX-1000 URL: https://issues.apache.org/jira/browse/PDFBOX-1000 Project: PDFBox Issue Type: New Feature Components: Parsing Affects Versions: 1.6.0 Reporter: Adam Nichols Assignee: Adam Nichols Fix For: 1.7.0, 2.0.0 Attachments: COSUnread.java, ConformingPDDocument.java, ConformingPDFParser.java, ConformingPDFParserTest.java, PDFLexer.java, PDFLexer.java, PDFStreamConstants.java, PDFStreamConstants.java, XrefEntry.java, conforming-parser.patch, gdb-refcard.pdf A conforming parser will start at the end of the file and read backward until it has read the EOF marker, the xref location, and trailer[1]. Once this is read, it will read in the xref table so it can locate other objects and revisions. This also allows skipping objects which have been rendered obsolete (per the xref table)[2]. It also allows the minimum amount of information to be read when the file is loaded, and then subsequent information will be loaded if and when it is requested. This is all laid out in the official PDF specification, ISO 32000-1:2008. Existing code will be re-used where possible, but this will require new classes in order to accommodate the lazy reading which is a very different paradigm from the existing parser. Using separate classes will also eliminate the possibility of regression bugs from making their way into the PDDocument or BaseParser classes. Changes to existing classes will be kept to a minimum in order to prevent regression bugs. [1] Section 7.5.5 Conforming readers should read a PDF file from its end [2] Section 7.5.4 the entire file need not be read to locate any particular object -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1000) Conforming parser
[ https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1000: Affects Version/s: 1.6.0 Conforming parser - Key: PDFBOX-1000 URL: https://issues.apache.org/jira/browse/PDFBOX-1000 Project: PDFBox Issue Type: New Feature Components: Parsing Affects Versions: 1.6.0 Reporter: Adam Nichols Assignee: Adam Nichols Fix For: 1.7.0, 2.0.0 Attachments: COSUnread.java, ConformingPDDocument.java, ConformingPDFParser.java, ConformingPDFParserTest.java, PDFLexer.java, PDFLexer.java, PDFStreamConstants.java, PDFStreamConstants.java, XrefEntry.java, conforming-parser.patch, gdb-refcard.pdf A conforming parser will start at the end of the file and read backward until it has read the EOF marker, the xref location, and trailer[1]. Once this is read, it will read in the xref table so it can locate other objects and revisions. This also allows skipping objects which have been rendered obsolete (per the xref table)[2]. It also allows the minimum amount of information to be read when the file is loaded, and then subsequent information will be loaded if and when it is requested. This is all laid out in the official PDF specification, ISO 32000-1:2008. Existing code will be re-used where possible, but this will require new classes in order to accommodate the lazy reading which is a very different paradigm from the existing parser. Using separate classes will also eliminate the possibility of regression bugs from making their way into the PDDocument or BaseParser classes. Changes to existing classes will be kept to a minimum in order to prevent regression bugs. [1] Section 7.5.5 Conforming readers should read a PDF file from its end [2] Section 7.5.4 the entire file need not be read to locate any particular object -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1000) Conforming parser
[ https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1000: Fix Version/s: 1.7.0 Conforming parser - Key: PDFBOX-1000 URL: https://issues.apache.org/jira/browse/PDFBOX-1000 Project: PDFBox Issue Type: New Feature Components: Parsing Affects Versions: 1.6.0 Reporter: Adam Nichols Assignee: Adam Nichols Fix For: 1.7.0, 2.0.0 Attachments: COSUnread.java, ConformingPDDocument.java, ConformingPDFParser.java, ConformingPDFParserTest.java, PDFLexer.java, PDFLexer.java, PDFStreamConstants.java, PDFStreamConstants.java, XrefEntry.java, conforming-parser.patch, gdb-refcard.pdf A conforming parser will start at the end of the file and read backward until it has read the EOF marker, the xref location, and trailer[1]. Once this is read, it will read in the xref table so it can locate other objects and revisions. This also allows skipping objects which have been rendered obsolete (per the xref table)[2]. It also allows the minimum amount of information to be read when the file is loaded, and then subsequent information will be loaded if and when it is requested. This is all laid out in the official PDF specification, ISO 32000-1:2008. Existing code will be re-used where possible, but this will require new classes in order to accommodate the lazy reading which is a very different paradigm from the existing parser. Using separate classes will also eliminate the possibility of regression bugs from making their way into the PDDocument or BaseParser classes. Changes to existing classes will be kept to a minimum in order to prevent regression bugs. [1] Section 7.5.5 Conforming readers should read a PDF file from its end [2] Section 7.5.4 the entire file need not be read to locate any particular object -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-1000) Conforming parser
[ https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167409#comment-14167409 ] John Hewson commented on PDFBOX-1000: - This issue has been open for 3 years, despite ConformingPDFParser being introduced in PDFBox 1.7.0. Can we close this issue now? Any further changes should be new issues. Conforming parser - Key: PDFBOX-1000 URL: https://issues.apache.org/jira/browse/PDFBOX-1000 Project: PDFBox Issue Type: New Feature Components: Parsing Affects Versions: 1.6.0 Reporter: Adam Nichols Assignee: Adam Nichols Fix For: 1.7.0, 2.0.0 Attachments: COSUnread.java, ConformingPDDocument.java, ConformingPDFParser.java, ConformingPDFParserTest.java, PDFLexer.java, PDFLexer.java, PDFStreamConstants.java, PDFStreamConstants.java, XrefEntry.java, conforming-parser.patch, gdb-refcard.pdf A conforming parser will start at the end of the file and read backward until it has read the EOF marker, the xref location, and trailer[1]. Once this is read, it will read in the xref table so it can locate other objects and revisions. This also allows skipping objects which have been rendered obsolete (per the xref table)[2]. It also allows the minimum amount of information to be read when the file is loaded, and then subsequent information will be loaded if and when it is requested. This is all laid out in the official PDF specification, ISO 32000-1:2008. Existing code will be re-used where possible, but this will require new classes in order to accommodate the lazy reading which is a very different paradigm from the existing parser. Using separate classes will also eliminate the possibility of regression bugs from making their way into the PDDocument or BaseParser classes. Changes to existing classes will be kept to a minimum in order to prevent regression bugs. [1] Section 7.5.5 Conforming readers should read a PDF file from its end [2] Section 7.5.4 the entire file need not be read to locate any particular object -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-830) Setting of logical page numbers
[ https://issues.apache.org/jira/browse/PDFBOX-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-830: --- Affects Version/s: 1.3.1 Setting of logical page numbers --- Key: PDFBOX-830 URL: https://issues.apache.org/jira/browse/PDFBOX-830 Project: PDFBox Issue Type: New Feature Components: PDModel Affects Versions: 1.3.1 Environment: JDK 1.6.0_21, PDFBox 1.3.0-snapshot Reporter: MH When viewing PDFs processed with PDFBox, Acrobat Reader / Foxit Reader show logical page numbers. I guess PDFBox is somehow generating such logic page numbers. However, the current automatic logic page numbering is not always as expected/wished. So an API to change/set these logic page numbers would be usefull. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-830) Setting of logical page numbers
[ https://issues.apache.org/jira/browse/PDFBOX-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-830: --- Fix Version/s: 2.0.0 Setting of logical page numbers --- Key: PDFBOX-830 URL: https://issues.apache.org/jira/browse/PDFBOX-830 Project: PDFBox Issue Type: New Feature Components: PDModel Affects Versions: 1.3.1 Environment: JDK 1.6.0_21, PDFBox 1.3.0-snapshot Reporter: MH Fix For: 2.0.0 When viewing PDFs processed with PDFBox, Acrobat Reader / Foxit Reader show logical page numbers. I guess PDFBox is somehow generating such logic page numbers. However, the current automatic logic page numbering is not always as expected/wished. So an API to change/set these logic page numbers would be usefull. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files
[ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-800: --- Affects Version/s: 1.7.0 Wrong text extract from vertical textboxes in pdf files --- Key: PDFBOX-800 URL: https://issues.apache.org/jira/browse/PDFBOX-800 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.7.0 Environment: Windows 7, VS 2010 C#, Tika Library Reporter: Sandor Dj Fix For: 2.0.0 Attachments: problemdoc.doc, problemdoc.pdf Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#). For example if there is a vertical textbox hello in a pdf file (!WITHOUT! line breaks): H E L L O the parser returns 5 strings, each with a single letter, even there is NO line break after every letter. Is there a option to avoid this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PDFBOX-824) Support for PDF/A (long-term archiving)
[ https://issues.apache.org/jira/browse/PDFBOX-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson closed PDFBOX-824. -- Resolution: Won't Fix There's nothing to stop you generating PDF/A with PDFBox. Modifying all the APIs to stop you doing anything invalid in PDF/A is not viable. Support for PDF/A (long-term archiving) --- Key: PDFBOX-824 URL: https://issues.apache.org/jira/browse/PDFBOX-824 Project: PDFBox Issue Type: New Feature Components: PDModel Reporter: MH Apache FOP already supports PDF/A by setting a renderer option pdf-a-mode, PDF/A-1b it would be a usefull feature for PDFBox to also support this (and other PDF/A derivates). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files
[ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-800: --- Fix Version/s: 2.0.0 Wrong text extract from vertical textboxes in pdf files --- Key: PDFBOX-800 URL: https://issues.apache.org/jira/browse/PDFBOX-800 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.7.0 Environment: Windows 7, VS 2010 C#, Tika Library Reporter: Sandor Dj Fix For: 2.0.0 Attachments: problemdoc.doc, problemdoc.pdf Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#). For example if there is a vertical textbox hello in a pdf file (!WITHOUT! line breaks): H E L L O the parser returns 5 strings, each with a single letter, even there is NO line break after every letter. Is there a option to avoid this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PDFBOX-720) Inconsistency in parsing PDFs between Windows and Linux
[ https://issues.apache.org/jira/browse/PDFBOX-720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-720. Resolution: Not a Problem Having heard of no issues for 3 years, I presume this is no longer a problem. Inconsistency in parsing PDFs between Windows and Linux --- Key: PDFBOX-720 URL: https://issues.apache.org/jira/browse/PDFBOX-720 Project: PDFBox Issue Type: Bug Components: Parsing Environment: Windows Vista 32-bit, Sun JDK 1.5.0_06, PDFBox HEAD tag (revision 941073) vs. Red Hat Linux, 2.6.9-67.ELsmp kernel, Java 1.5.0_06, PDFBox HEAD tag (revision 941073) Reporter: Adam Nichols Attachments: 238_Page_Report.pdf Run this same code using the same PDF and you'll get different results on Linux than on Windows. Regardless of which one you consider correct, it should be consistent. doc = PDDocument.load(inputFile); PDDocumentOutline outline = doc.getDocumentCatalog().getDocumentOutline(); if(outline == null) System.out.println(Document outline was null); else System.out.println(Document outline was not null); Some interesting notes about this PDF: Seems that Acrobat Distiller 8.1.0 basically just concatenated two PDFs into one. There are two trailers, they both refer to object 1600 0 as the root. 1600 0 appears multiple times, one time it doesn't have Outlines in the dictionary, the other time it has Outlines 1667 0. Windows picks up the latter and shows the outline correctly. Linux picks up the former and thus returns null for the outline. I tried debugging through PDFParser and BaseParser, but I'm not really sure how that code works and I quickly got lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PDFBOX-577) TextPosition should expose its bounding box
[ https://issues.apache.org/jira/browse/PDFBOX-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson closed PDFBOX-577. -- Resolution: Invalid The Ascent and Descent values in the PDF dictionary are **not** used when computing glyph positions. In fact, it's common for these values to be missing or invalid. In any case, the BBox value is actually what is wanted, but that suffers from the same problem. If somebody wants to tackle this problem in the future, it can be fairly easily done in 2.0 with the new APIs provided by PDFont which can extract the BBox from the embedded or substituted font - or even compute exact bounds from the glyph outlines. A new issue or patch addressing this is welcome. TextPosition should expose its bounding box --- Key: PDFBOX-577 URL: https://issues.apache.org/jira/browse/PDFBOX-577 Project: PDFBox Issue Type: Improvement Components: PDModel Reporter: Villu Ruusmann Attachments: 0001-PDFont.java-Add-methods-to-retreive-the-Ascent-and-D.patch, AFM-getHeight.png, AFM-getUpperRightY.png, textposition-randombg.zip It does not seem to be possible to calculate the bounding box of a TextPosition. IIUC, TextPosition#getY is the baseline of the text and TextPosition#getHeight is the absolute height of the text. When I subtract the latter from the former I get a top line, but this is only correct if the text does not contain descender characters. Below is a screenshot (AFM-getHeight.png) which shows the bounding boxes of TextPositions calculated as {#getX(), #getY() - #getHeight, #getWidth, #getHeight} painted in random colors. For example, the bounding boxes of parentheses are severely misplaced, which makes the line-by-line text extraction impossible. Right now I've solved the problem by tweaking AFM FontMetrics code so that it returns BoundingBox#getUpperRightY instead of BoundingBox#getHeight when queried via PDSimpleFont#getFontHeight(byte[], int, int). Another screenshot (AFM-getUpperRightY.png) shows how this restores the previously broken text extraction ability. It seems like a good idea to rework TextPosition so that it would be aware of its bounding box: *) Replace methods PDSimpleFont#getFontWidth(byte[], int, int) and PDSimpleFont#getFontHeight(byte[], int, int) with a single method PDSimpleFont#getFontBoundingBox(byte[], int, int) *) Replace the constructor TextPosition(Matrix, Matrix) with TextPosition(Matrix, BoundingBox) *) Add new methods TextPosition#getBoundingBox, TextPosition#getBoundingBoxDir. This shouldn't affect existing application clients, because TextPosition#getY and TextPosition#getHeight remain in place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-566) PDChoiceField does not handle some valid PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-566: --- Summary: PDChoiceField does not handle some valid PDFs (was: PDChoiceField does not handle some valid PDF's ) PDChoiceField does not handle some valid PDFs -- Key: PDFBOX-566 URL: https://issues.apache.org/jira/browse/PDFBOX-566 Project: PDFBox Issue Type: Bug Components: AcroForm Reporter: Yonas Jongkind Attachments: PDChoiceField.java, PDChoiceField.java.diff The problem is that there are cases where sometimes the format is periodically a array and/or a singleton. The attached fix allows it to work smoothly for either system and for mixed cases. May also be more efficient. See attached diff and corrected source file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-448) Columns in text not extracted separately
[ https://issues.apache.org/jira/browse/PDFBOX-448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-448: --- Summary: Columns in text not extracted separately (was: Columns in text not extracted separately. ) Columns in text not extracted separately Key: PDFBOX-448 URL: https://issues.apache.org/jira/browse/PDFBOX-448 Project: PDFBox Issue Type: Bug Components: Text extraction Reporter: Brian Carrier Attachments: WBPaper3120.pdf The paper that is attached to PDFBOX-80 has two columns of text, but the extracted text is not separated by column. Instead it combines the text in each column on each line. PDFTextStripper has a notion of columns and articles / beads, but they are not being used with this file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PDFBOX-566) PDChoiceField does not handle some valid PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson closed PDFBOX-566. -- Resolution: Invalid As far as I can tell this no longer applies, at least not in 2.0. PDChoiceField does not handle some valid PDFs -- Key: PDFBOX-566 URL: https://issues.apache.org/jira/browse/PDFBOX-566 Project: PDFBox Issue Type: Bug Components: AcroForm Reporter: Yonas Jongkind Attachments: PDChoiceField.java, PDChoiceField.java.diff The problem is that there are cases where sometimes the format is periodically a array and/or a singleton. The attached fix allows it to work smoothly for either system and for mixed cases. May also be more efficient. See attached diff and corrected source file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-448) Columns in text not extracted separately
[ https://issues.apache.org/jira/browse/PDFBOX-448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-448: --- Fix Version/s: 2.0.0 Columns in text not extracted separately Key: PDFBOX-448 URL: https://issues.apache.org/jira/browse/PDFBOX-448 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7, 2.0.0 Reporter: Brian Carrier Fix For: 2.0.0 Attachments: WBPaper3120.pdf The paper that is attached to PDFBOX-80 has two columns of text, but the extracted text is not separated by column. Instead it combines the text in each column on each line. PDFTextStripper has a notion of columns and articles / beads, but they are not being used with this file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)