[jira] [Updated] (PDFBOX-2441) Improve XRef self healing mechanism when more than one xref table

2014-10-21 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2441:

Summary: Improve XRef self healing mechanism when more than one xref table  
(was: mprove XRef self healing mechanism when more than one xref table)

 Improve XRef self healing mechanism when more than one xref table
 -

 Key: PDFBOX-2441
 URL: https://issues.apache.org/jira/browse/PDFBOX-2441
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.7, 1.8.8, 2.0.0
Reporter: Tilman Hausherr

 This is a follow-up issue to PDFBOX-2250:
 {quote}
 the xref repair algorithm simply searches for the nearest offset, which may 
 fail if more than one xref table is present
 ...
 Once we have a sample pdf which can't be parsed with the simple algorithm, we 
 can open a new issue.
 {quote}
 And here's one:
 {code}
 Exception in thread main java.io.IOException: Error: Expected a long type 
 at offset 1180, instead got '50/Filter/FlateDecode/DecodeParms'
 at 
 org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1690)
 {code}
 That file does have more than one xref table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PDFBOX-2441) mprove XRef self healing mechanism when more than one xref table

2014-10-21 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created PDFBOX-2441:
---

 Summary: mprove XRef self healing mechanism when more than one 
xref table
 Key: PDFBOX-2441
 URL: https://issues.apache.org/jira/browse/PDFBOX-2441
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.7, 1.8.8, 2.0.0
Reporter: Tilman Hausherr


This is a follow-up issue to PDFBOX-2250:
{quote}
the xref repair algorithm simply searches for the nearest offset, which may 
fail if more than one xref table is present
...
Once we have a sample pdf which can't be parsed with the simple algorithm, we 
can open a new issue.
{quote}
And here's one:
{code}
Exception in thread main java.io.IOException: Error: Expected a long type at 
offset 1180, instead got '50/Filter/FlateDecode/DecodeParms'
at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1690)
{code}
That file does have more than one xref table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2441) Improve XRef self healing mechanism when more than one xref table

2014-10-21 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2441:

Attachment: 260105.pdf

 Improve XRef self healing mechanism when more than one xref table
 -

 Key: PDFBOX-2441
 URL: https://issues.apache.org/jira/browse/PDFBOX-2441
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.7, 1.8.8, 2.0.0
Reporter: Tilman Hausherr
 Attachments: 260105.pdf


 This is a follow-up issue to PDFBOX-2250:
 {quote}
 the xref repair algorithm simply searches for the nearest offset, which may 
 fail if more than one xref table is present
 ...
 Once we have a sample pdf which can't be parsed with the simple algorithm, we 
 can open a new issue.
 {quote}
 And here's one:
 {code}
 Exception in thread main java.io.IOException: Error: Expected a long type 
 at offset 1180, instead got '50/Filter/FlateDecode/DecodeParms'
 at 
 org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1690)
 {code}
 That file does have more than one xref table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PDFBOX-2441) Improve XRef self healing mechanism when more than one xref table

2014-10-21 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler reassigned PDFBOX-2441:
--

Assignee: Andreas Lehmkühler

 Improve XRef self healing mechanism when more than one xref table
 -

 Key: PDFBOX-2441
 URL: https://issues.apache.org/jira/browse/PDFBOX-2441
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.7, 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: Andreas Lehmkühler
 Attachments: 260105.pdf


 This is a follow-up issue to PDFBOX-2250:
 {quote}
 the xref repair algorithm simply searches for the nearest offset, which may 
 fail if more than one xref table is present
 ...
 Once we have a sample pdf which can't be parsed with the simple algorithm, we 
 can open a new issue.
 {quote}
 And here's one:
 {code}
 Exception in thread main java.io.IOException: Error: Expected a long type 
 at offset 1180, instead got '50/Filter/FlateDecode/DecodeParms'
 at 
 org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1690)
 {code}
 That file does have more than one xref table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] pdfbox pull request: Reapplied changes by patric42

2014-10-21 Thread anti43
GitHub user anti43 opened a pull request:

https://github.com/apache/pdfbox/pull/9

Reapplied changes by patric42



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/anti43/pdfbox apache-trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/pdfbox/pull/9.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9


commit 8e6df3802bf045d1ef268ac2f14fbdb4324a9517
Author: Patric Bechtel p.bech...@oashi.com
Date:   2014-04-22T09:06:15Z

use java-image-scaling for high quality scaling of images.

commit 1345f028429b29a48fc440db30485f7d58d62807
Author: Tilman Hausherr til...@apache.org
Date:   2014-04-30T16:07:57Z

PDFBOX-2034: refactoring per DRY

git-svn-id: https://svn.apache.org/repos/asf/pdfbox/trunk@1591375 
13f79535-47bb-0310-9956-ffa450edef68

commit 5f03db1caeb0c108628e851d698ab71d59327db4
Author: Patric Bechtel p.bech...@oashi.com
Date:   2014-07-18T08:58:04Z

re-enabled the hq-scaling again.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] pdfbox pull request: Reapplied changes by patric42

2014-10-21 Thread anti43
Github user anti43 closed the pull request at:

https://github.com/apache/pdfbox/pull/9


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (PDFBOX-2403) false negative? Font damaged, The FontFile can't be read

2014-10-21 Thread Ralf Hauser (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178121#comment-14178121
 ] 

Ralf Hauser commented on PDFBOX-2403:
-

today, I see
java.lang.NullPointerException
at 
org.apache.fontbox.cff.CharStringRenderer.rrcurveTo(CharStringRenderer.java:433)
at 
org.apache.fontbox.cff.CharStringRenderer.rrCurveTo(CharStringRenderer.java:424)
at 
org.apache.fontbox.cff.CharStringRenderer.handleCommandType2(CharStringRenderer.java:154)
at 
org.apache.fontbox.cff.CharStringRenderer.handleCommand(CharStringRenderer.java:90)
at 
org.apache.fontbox.cff.CharStringHandler.handleSequence(CharStringHandler.java:53)
at 
org.apache.fontbox.cff.CharStringRenderer.render(CharStringRenderer.java:75)
at org.apache.fontbox.cff.CFFFontROS.getWidth(CFFFontROS.java:173)
at 
org.apache.pdfbox.preflight.font.container.CIDType0Container.getFontProgramWidth(CIDType0Container.java:83)
at 
org.apache.pdfbox.preflight.font.container.Type0Container.getFontProgramWidth(Type0Container.java:46)
at 
org.apache.pdfbox.preflight.font.container.FontContainer.checkGlyphWith(FontContainer.java:115)
at 
org.apache.pdfbox.preflight.content.ContentStreamWrapper.validText(ContentStreamWrapper.java:373)
at 
org.apache.pdfbox.preflight.content.ContentStreamWrapper.validStringArray(ContentStreamWrapper.java:297)
at 
org.apache.pdfbox.preflight.content.ContentStreamWrapper.validStringArray(ContentStreamWrapper.java:293)
at 
org.apache.pdfbox.preflight.content.ContentStreamWrapper.checkShowTextOperators(ContentStreamWrapper.java:209)
at 
org.apache.pdfbox.preflight.content.ContentStreamWrapper.processOperator(ContentStreamWrapper.java:181)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:258)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:225)
at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:205)
at 
org.apache.pdfbox.preflight.content.ContentStreamWrapper.validPageContentStream(ContentStreamWrapper.java:76)
at 
org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validateContent(SinglePageValidationProcess.java:179)
at 
org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validate(SinglePageValidationProcess.java:87)
at 
org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73)
at 
org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:52)
at 
org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validatePage(PageTreeValidationProcess.java:58)
at 
org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validate(PageTreeValidationProcess.java:47)
at 
org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73)
at 
org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:88)
at 
org.apache.pdfbox.preflight.PreflightDocument.validate(PreflightDocument.java:169)

 false negative? Font damaged, The FontFile can't be read
 --

 Key: PDFBOX-2403
 URL: https://issues.apache.org/jira/browse/PDFBOX-2403
 Project: PDFBox
  Issue Type: Bug
  Components: Preflight
Affects Versions: 2.0.0
 Environment: deb7, java 7
Reporter: Ralf Hauser
 Fix For: 2.0.0

 Attachments: Konformität mit PDF_A-1b prüfen.pdf, 
 Problems_pdfa1b.pdf_07.10.2014_001.pdf, patch2403JavaDoc.txt, 
 patchBetterErrorMessages.txt, patchPDFBOX-2403.txt, 
 patchPDFBOX-2403Type1.txt, pdfA_Validation_Report.eml, pdfa1b.pdf, 
 pdfa1b_againstPDFA1a_report, pdfa1b_againstPDFA1b_report, 
 pdfa1b_summary_0001.pdf, report, reportforfile_pdfa1b, validation_report.xml


 - 1: 3.2.1 : Font damaged, The FontFile can't be read
  - 2: 3.2.1 : Font damaged, The FontFile can't be read
  - 3: 3.1.6 : Invalid Font definition, Width of the character 48 in the 
 font program SURPPV+HeiseiMaruGoStd-W8-Identity-H is inconsistent with the 
 width in the PDF dictionary.
  - 4: 3.1.6 : Invalid Font definition, Width of the character 36 in the 
 font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the 
 width in the PDF dictionary.
  - 5: 3.3.1 : Glyph error, The character 74 in the font program 
 OIZFRF+KozMinProVI-Regular-Identity-H is missing from the Charater Encoding.
  - 6: 3.1.6 : Invalid Font definition, Width of the character 80 in the 
 font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the 
 width in the PDF dictionary.
  - 7: 3.1.6 : Invalid Font definition, Width of the character 420 in the 
 font program 

[jira] [Issue Comment Deleted] (PDFBOX-2403) false negative? Font damaged, The FontFile can't be read

2014-10-21 Thread Ralf Hauser (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ralf Hauser updated PDFBOX-2403:

Comment: was deleted

(was: today, I see
java.lang.NullPointerException
at 
org.apache.fontbox.cff.CharStringRenderer.rrcurveTo(CharStringRenderer.java:433)
at 
org.apache.fontbox.cff.CharStringRenderer.rrCurveTo(CharStringRenderer.java:424)
at 
org.apache.fontbox.cff.CharStringRenderer.handleCommandType2(CharStringRenderer.java:154)
at 
org.apache.fontbox.cff.CharStringRenderer.handleCommand(CharStringRenderer.java:90)
at 
org.apache.fontbox.cff.CharStringHandler.handleSequence(CharStringHandler.java:53)
at 
org.apache.fontbox.cff.CharStringRenderer.render(CharStringRenderer.java:75)
at org.apache.fontbox.cff.CFFFontROS.getWidth(CFFFontROS.java:173)
at 
org.apache.pdfbox.preflight.font.container.CIDType0Container.getFontProgramWidth(CIDType0Container.java:83)
at 
org.apache.pdfbox.preflight.font.container.Type0Container.getFontProgramWidth(Type0Container.java:46)
at 
org.apache.pdfbox.preflight.font.container.FontContainer.checkGlyphWith(FontContainer.java:115)
at 
org.apache.pdfbox.preflight.content.ContentStreamWrapper.validText(ContentStreamWrapper.java:373)
at 
org.apache.pdfbox.preflight.content.ContentStreamWrapper.validStringArray(ContentStreamWrapper.java:297)
at 
org.apache.pdfbox.preflight.content.ContentStreamWrapper.validStringArray(ContentStreamWrapper.java:293)
at 
org.apache.pdfbox.preflight.content.ContentStreamWrapper.checkShowTextOperators(ContentStreamWrapper.java:209)
at 
org.apache.pdfbox.preflight.content.ContentStreamWrapper.processOperator(ContentStreamWrapper.java:181)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:258)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:225)
at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:205)
at 
org.apache.pdfbox.preflight.content.ContentStreamWrapper.validPageContentStream(ContentStreamWrapper.java:76)
at 
org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validateContent(SinglePageValidationProcess.java:179)
at 
org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validate(SinglePageValidationProcess.java:87)
at 
org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73)
at 
org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:52)
at 
org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validatePage(PageTreeValidationProcess.java:58)
at 
org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validate(PageTreeValidationProcess.java:47)
at 
org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73)
at 
org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:88)
at 
org.apache.pdfbox.preflight.PreflightDocument.validate(PreflightDocument.java:169))

 false negative? Font damaged, The FontFile can't be read
 --

 Key: PDFBOX-2403
 URL: https://issues.apache.org/jira/browse/PDFBOX-2403
 Project: PDFBox
  Issue Type: Bug
  Components: Preflight
Affects Versions: 2.0.0
 Environment: deb7, java 7
Reporter: Ralf Hauser
 Fix For: 2.0.0

 Attachments: Konformität mit PDF_A-1b prüfen.pdf, 
 Problems_pdfa1b.pdf_07.10.2014_001.pdf, patch2403JavaDoc.txt, 
 patchBetterErrorMessages.txt, patchPDFBOX-2403.txt, 
 patchPDFBOX-2403Type1.txt, pdfA_Validation_Report.eml, pdfa1b.pdf, 
 pdfa1b_againstPDFA1a_report, pdfa1b_againstPDFA1b_report, 
 pdfa1b_summary_0001.pdf, report, reportforfile_pdfa1b, validation_report.xml


 - 1: 3.2.1 : Font damaged, The FontFile can't be read
  - 2: 3.2.1 : Font damaged, The FontFile can't be read
  - 3: 3.1.6 : Invalid Font definition, Width of the character 48 in the 
 font program SURPPV+HeiseiMaruGoStd-W8-Identity-H is inconsistent with the 
 width in the PDF dictionary.
  - 4: 3.1.6 : Invalid Font definition, Width of the character 36 in the 
 font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the 
 width in the PDF dictionary.
  - 5: 3.3.1 : Glyph error, The character 74 in the font program 
 OIZFRF+KozMinProVI-Regular-Identity-H is missing from the Charater Encoding.
  - 6: 3.1.6 : Invalid Font definition, Width of the character 80 in the 
 font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the 
 width in the PDF dictionary.
  - 7: 3.1.6 : Invalid Font definition, Width of the character 420 in the 
 font program RRATCX+MathematicalPiLTStd-Identity-H is 

[jira] [Created] (PDFBOX-2442) false negative? 3.1.6 : Invalid Font definition, Width (633.0) of the character 60 in the font program BNGLNN+LucidaMath-Symbol is inconsistent with the width (0.0)

2014-10-21 Thread Ralf Hauser (JIRA)
Ralf Hauser created PDFBOX-2442:
---

 Summary: false negative? 3.1.6 : Invalid Font definition, Width 
(633.0) of the character 60 in the font program BNGLNN+LucidaMath-Symbol is 
inconsistent with the width (0.0) in the PDF dictionary.
 Key: PDFBOX-2442
 URL: https://issues.apache.org/jira/browse/PDFBOX-2442
 Project: PDFBox
  Issue Type: Bug
  Components: Preflight
Affects Versions: 2.0.0
 Environment: java7 deb7
Reporter: Ralf Hauser


org.apache.pdfbox.preflight.font.util.GlyphException: Width (633.0) of the 
character 60 in the font program BNGLNN+LucidaMath-Symbol is inconsistent 
with the width (0.0) in the PDF dictionary.
at 
org.apache.pdfbox.preflight.font.container.FontContainer.checkWidthsConsistency(FontContainer.java:181)
at 
org.apache.pdfbox.preflight.font.container.FontContainer.checkGlyphWidth(FontContainer.java:130)
at 
org.apache.pdfbox.preflight.content.PreflightContentStream.validText(PreflightContentStream.java:342)
at 
org.apache.pdfbox.preflight.content.PreflightContentStream.validStringArray(PreflightContentStream.java:276)
at 
org.apache.pdfbox.preflight.content.PreflightContentStream.validStringArray(PreflightContentStream.java:272)
at 
org.apache.pdfbox.preflight.content.PreflightContentStream.checkShowTextOperators(PreflightContentStream.java:190)
at 
org.apache.pdfbox.preflight.content.PreflightContentStream.processOperator(PreflightContentStream.java:155)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:226)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:196)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:152)
at 
org.apache.pdfbox.preflight.content.PreflightContentStream.validPageContentStream(PreflightContentStream.java:76)
at 
org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validateContent(SinglePageValidationProcess.java:184)
at 
org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validate(SinglePageValidationProcess.java:87)
at 
org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73)
at 
org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:52)
at 
org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validatePage(PageTreeValidationProcess.java:56)
at 
org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validate(PageTreeValidationProcess.java:45)
at 
org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73)
at 
org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:88)
at 
org.apache.pdfbox.preflight.PreflightDocument.validate(PreflightDocument.java:168)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2442) false negative? 3.1.6 : Invalid Font definition, Width (633.0) of the character 60 in the font program BNGLNN+LucidaMath-Symbol is inconsistent with the width (0.0)

2014-10-21 Thread Ralf Hauser (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ralf Hauser updated PDFBOX-2442:

Attachment: adobe7pie.pdf

 false negative? 3.1.6 : Invalid Font definition, Width (633.0) of the 
 character 60 in the font program BNGLNN+LucidaMath-Symbol is inconsistent 
 with the width (0.0) in the PDF dictionary.
 ---

 Key: PDFBOX-2442
 URL: https://issues.apache.org/jira/browse/PDFBOX-2442
 Project: PDFBox
  Issue Type: Bug
  Components: Preflight
Affects Versions: 2.0.0
 Environment: java7 deb7
Reporter: Ralf Hauser
 Attachments: adobe7pie.pdf


 org.apache.pdfbox.preflight.font.util.GlyphException: Width (633.0) of the 
 character 60 in the font program BNGLNN+LucidaMath-Symbol is inconsistent 
 with the width (0.0) in the PDF dictionary.
   at 
 org.apache.pdfbox.preflight.font.container.FontContainer.checkWidthsConsistency(FontContainer.java:181)
   at 
 org.apache.pdfbox.preflight.font.container.FontContainer.checkGlyphWidth(FontContainer.java:130)
   at 
 org.apache.pdfbox.preflight.content.PreflightContentStream.validText(PreflightContentStream.java:342)
   at 
 org.apache.pdfbox.preflight.content.PreflightContentStream.validStringArray(PreflightContentStream.java:276)
   at 
 org.apache.pdfbox.preflight.content.PreflightContentStream.validStringArray(PreflightContentStream.java:272)
   at 
 org.apache.pdfbox.preflight.content.PreflightContentStream.checkShowTextOperators(PreflightContentStream.java:190)
   at 
 org.apache.pdfbox.preflight.content.PreflightContentStream.processOperator(PreflightContentStream.java:155)
   at 
 org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:226)
   at 
 org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:196)
   at 
 org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:152)
   at 
 org.apache.pdfbox.preflight.content.PreflightContentStream.validPageContentStream(PreflightContentStream.java:76)
   at 
 org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validateContent(SinglePageValidationProcess.java:184)
   at 
 org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess.validate(SinglePageValidationProcess.java:87)
   at 
 org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73)
   at 
 org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:52)
   at 
 org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validatePage(PageTreeValidationProcess.java:56)
   at 
 org.apache.pdfbox.preflight.process.PageTreeValidationProcess.validate(PageTreeValidationProcess.java:45)
   at 
 org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:73)
   at 
 org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:88)
   at 
 org.apache.pdfbox.preflight.PreflightDocument.validate(PreflightDocument.java:168)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2441) Improve XRef self healing mechanism when more than one xref table

2014-10-21 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2441:
---
Fix Version/s: 2.0.0
   1.8.8

 Improve XRef self healing mechanism when more than one xref table
 -

 Key: PDFBOX-2441
 URL: https://issues.apache.org/jira/browse/PDFBOX-2441
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.7, 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: Andreas Lehmkühler
 Fix For: 1.8.8, 2.0.0

 Attachments: 260105.pdf


 This is a follow-up issue to PDFBOX-2250:
 {quote}
 the xref repair algorithm simply searches for the nearest offset, which may 
 fail if more than one xref table is present
 ...
 Once we have a sample pdf which can't be parsed with the simple algorithm, we 
 can open a new issue.
 {quote}
 And here's one:
 {code}
 Exception in thread main java.io.IOException: Error: Expected a long type 
 at offset 1180, instead got '50/Filter/FlateDecode/DecodeParms'
 at 
 org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1690)
 {code}
 That file does have more than one xref table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: 2.0

2014-10-21 Thread Allison, Timothy B.
Been too busy over in Tika-land...just noticing this now.

Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v non-seq). 
 I won't have time to integrate 2.0 into our Tika PDFParser any time soon 
(Jeremy Anderson on TIKA-1285 has already started this), but I could easily 
write a lightweight wrapper around PDFBox's TextStripper + metadata inside of 
the tika-batch/tika-eval framework.

Cheers,

  Tim

From: Andreas Lehmkühler [andr...@lehmi.de]
Sent: Wednesday, October 15, 2014 6:20 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0

Hi,


 Maruan Sahyoun sahy...@fileaffairs.de hat am 15. Oktober 2014 um 09:32
 geschrieben:


 What about keeping both for the 2.0 release and phase the old one out for 3
 but making the NonSequential the default parser.
 Would also give us some time to work with Tim (TIKA) on the test suite.
I agree, that's the only thing we can manage in a timely manner.


 Maybe we could simplify the variations of PDDocument.load to something like

 PDDocument.load(input, raf, enforce, useLegacyParser) or
 PDDocument.load(input, raf, enforce, withSignatureSupport) …

 and introduce PDDocument.load(input) to use the NonSequential


 WDYT?
Good idea, I've already created PDFBOX-2430 for this.


 Maruan


BR
Andreas Lehmkühler

 Am 15.10.2014 um 09:18 schrieb Timo Boehme timo.boe...@ontochem.com:

  Hi,
 
  the difference between the parsers stems from the fact that the old parser
  can cope with a completely broken xref table because it uses the objects as
  it finds them on its sequential way. What we need (as I proposed before) is
  a repair mechanism scanning the file for object start/end to be used for
  re-creating the xref table.
  I will see if I can find some time to do this.
 
  The only other stopper is as Andreas has pointed out the signing. I'm not
  familiar with this and don't known what needs to be done here.
 
 
  Best,
  Timo
 
 
  Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
  Here are some:
 
  055/055794.pdf
  082/082463.pdf
  108/108362.pdf
  113/113223.pdf
  115/115458.pdf
  115/115463.pdf
  122/122393.pdf
  129/129416.pdf
  133/133423.pdf
  148/148020.pdf
  152/152012.pdf
  161/161466.pdf
 
  to be found here:
  http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
 
  Tilman
 
  Am 14.10.2014 um 21:06 schrieb John Hewson:
  Unless somebody provides us with a list of those files, then I think
  this is an unreasonable request. As long as we continue to leave the
  old parser in PDFBox, we won’t get the bug reports which we need to
  fix the new parser, and the situation will never resolve itself.
  Falling back to the old parser is just as bad - we won’t get bug reports.
 
  -- John
 
  On 14 Oct 2014, at 07:39, Tilman Hausherr thaush...@t-online.de wrote:
 
  I prefer that the old parser not be removed, because there are many
  files that can only be parsed by the old parser. This came out in a
  large scale test with TIKA.
 
  The best idea (in my current opinion) is to use the nonSeq parser
  first, and the old parser if there is an exception.
 
  Tilman
 
  Am 14.10.2014 um 09:45 schrieb Timo Boehme:
  Hi,
 
  Am 14.10.2014 um 07:22 schrieb John Hewson:
  Hi,
  John Hewson j...@jahewson.com hat am 10. Oktober 2014 um 20:05
  geschrieben:
 
 
     - Parsing (Andreas?)
  I guess we won't get a complete new parser in 2.0, but I try to
  improve the XRef
  and the COSStream stuff
  It would be great if we could get rid of the old parser and switch
  to the non-sequential
  parser, WDYT?
  I would also propose to completely remove the old parser. That way
  we are more flexible in parsing streams etc. since parts of the
  non-sequential parser are a compromise to work side-by-side with the
  old parser.
  Possibly there are a small number of functions for which the old
  parser is still needed - e.g. signing?
 
 
  Best,
  Timo
 
 
 
 
 
 
  --
 
  Timo Boehme
  OntoChem GmbH
  H.-Damerow-Str. 4
  06120 Halle/Saale
  T: +49 345 4780474
  F: +49 345 4780471
  timo.boe...@ontochem.com
 
  _
 
  OntoChem GmbH
  Geschäftsführer: Dr. Lutz Weber
  Sitz: Halle / Saale
  Registergericht: Stendal
  Registernummer: HRB 215461
  _
 



[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources

2014-10-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178325#comment-14178325
 ] 

Tilman Hausherr commented on PDFBOX-2370:
-

When rendering, many files have their resources missing (fonts, images, 
shadings, forms), or are having NPE. Some examples:
- PDFBOX-1169.pdf images missing
- tracemonkey NPE
- PDFBOX-1452.pdf question mark image missing
- CIB-coons-vs-tensormesh.pdf NPE (but CIB-coonsmesh.pdf is ok)
- PDFBOX-2265-igalia.pdf NPE
and many many more :-(

 Move caching outside of PDResources
 ---

 Key: PDFBOX-2370
 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Priority: Critical
 Fix For: 2.0.0


 *Note:* This issue is based on a discussion which occurred regarding 
 PDFBOX-2301 but is actually a separate issue.
 Currently we cache the page resources in PDResources which belongs to a 
 specific PDPage. This causes two problems, 1) users who want to hold many 
 PDPage objects in memory will have high memory use (but this is often by 
 accident*). 2) By caching resources in PDPage we only get to keep that cache 
 for the lifetime of the page, which e.g. in PDFRenderer is a single page 
 only. That means that a font which appears on 40 pages has to be parsed 40 
 times, which causes slow running times, but also memory thrashing as objects 
 are destroyed frequently only to be re-created.
 What PDFRenderer really needs is not page-wide caching but document-wide 
 caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
 But that won't work for images, because they're too large. What we're 
 beginning to realise is that caching is use-case specific and probably 
 shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
 resource caching from PDPage/PDResources and implement custom caching in 
 PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
 happily volunteer myself. The existing high-level PDFBox APIs will continue 
 to just work and power users will get a level of control that they 
 appreciate.
 This strategy could be enhanced by removing memory-hungry methods on 
 PDResources such as getFonts() and getXObjects() which force all resources of 
 a particular type to be loaded, whether or not they are needed, or actually 
 used in the content stream. They would be replaced by methods to retrieve a 
 single resource, e.g. getFont(name).
 ---
 \* There probably isn't a legitimate use case for 1) any more, we've solved 
 the issues which we used to have with image caching (in fact, the 
 clearCache() method actually no longer needs to be called by PDFRenderer, 
 though it currently is). The real problem is that it's easy to accidentally 
 retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
 method is dangerous as looping over it will cause pages to be retained during 
 processing, like so:
 {code}
 for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
 java.util.List
 {
  // ... this is idiomatic in PDFBox 1.8
 } 
 // List returned by getAllPages() kept in scope until here (bad)
 {code}
 I added of couple of methods a while ago to avoid this by fetching each 
 PDPage one at a time, and this is now used internally in PDFBox to avoid the 
 memory problems we used to have:
 {code}
 for (int i = 0; i  document.getNumberOfPages(); i++)
 {
 PDPage page = document.getPage(i);
 // ... this is the new 2.0 way
 // current page falls out of scope here (good)
 }
 {code}
 To solve this problem, we could change getAllPages() so that instead of 
 returning a List it returns an IteratorPDPage, which would provide a nicer 
 API than getPage(int) and most existing code will continue to work. This is 
 also an opportunity to also fix type safety issues due to PDPageNode and 
 incorrect handling of the page tree (this is similar to the issue we had 
 recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources

2014-10-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178332#comment-14178332
 ] 

Tilman Hausherr commented on PDFBOX-2370:
-

popping the resource stack in processSubStream() helps

 Move caching outside of PDResources
 ---

 Key: PDFBOX-2370
 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Priority: Critical
 Fix For: 2.0.0


 *Note:* This issue is based on a discussion which occurred regarding 
 PDFBOX-2301 but is actually a separate issue.
 Currently we cache the page resources in PDResources which belongs to a 
 specific PDPage. This causes two problems, 1) users who want to hold many 
 PDPage objects in memory will have high memory use (but this is often by 
 accident*). 2) By caching resources in PDPage we only get to keep that cache 
 for the lifetime of the page, which e.g. in PDFRenderer is a single page 
 only. That means that a font which appears on 40 pages has to be parsed 40 
 times, which causes slow running times, but also memory thrashing as objects 
 are destroyed frequently only to be re-created.
 What PDFRenderer really needs is not page-wide caching but document-wide 
 caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
 But that won't work for images, because they're too large. What we're 
 beginning to realise is that caching is use-case specific and probably 
 shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
 resource caching from PDPage/PDResources and implement custom caching in 
 PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
 happily volunteer myself. The existing high-level PDFBox APIs will continue 
 to just work and power users will get a level of control that they 
 appreciate.
 This strategy could be enhanced by removing memory-hungry methods on 
 PDResources such as getFonts() and getXObjects() which force all resources of 
 a particular type to be loaded, whether or not they are needed, or actually 
 used in the content stream. They would be replaced by methods to retrieve a 
 single resource, e.g. getFont(name).
 ---
 \* There probably isn't a legitimate use case for 1) any more, we've solved 
 the issues which we used to have with image caching (in fact, the 
 clearCache() method actually no longer needs to be called by PDFRenderer, 
 though it currently is). The real problem is that it's easy to accidentally 
 retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
 method is dangerous as looping over it will cause pages to be retained during 
 processing, like so:
 {code}
 for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
 java.util.List
 {
  // ... this is idiomatic in PDFBox 1.8
 } 
 // List returned by getAllPages() kept in scope until here (bad)
 {code}
 I added of couple of methods a while ago to avoid this by fetching each 
 PDPage one at a time, and this is now used internally in PDFBox to avoid the 
 memory problems we used to have:
 {code}
 for (int i = 0; i  document.getNumberOfPages(); i++)
 {
 PDPage page = document.getPage(i);
 // ... this is the new 2.0 way
 // current page falls out of scope here (good)
 }
 {code}
 To solve this problem, we could change getAllPages() so that instead of 
 returning a List it returns an IteratorPDPage, which would provide a nicer 
 API than getPage(int) and most existing code will continue to work. This is 
 also an opportunity to also fix type safety issues due to PDPageNode and 
 incorrect handling of the page tree (this is similar to the issue we had 
 recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2250) Improve XRef self healing mechanism

2014-10-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178547#comment-14178547
 ] 

Tilman Hausherr commented on PDFBOX-2250:
-

ignore the last commit message (wrong issue)

 Improve XRef self healing mechanism
 ---

 Key: PDFBOX-2250
 URL: https://issues.apache.org/jira/browse/PDFBOX-2250
 Project: PDFBox
  Issue Type: Improvement
  Components: Parsing
Affects Versions: 1.8.6, 1.8.7, 2.0.0
Reporter: Andreas Lehmkühler
Assignee: Andreas Lehmkühler
 Fix For: 1.8.8, 2.0.0

 Attachments: 055794.pdf, 113223.pdf, 
 PDFBOX-2250-107425-empty-xref.pdf, PDFBOX-2250-110264-xref-zeronumber.pdf, 
 PDFBOX-2250-229205.pdf, PDFBOX-2250-233566.pdf


 PDFBOX-1769 introduced a self healing mechanism to repair corrupt XRef 
 offsets. But that one was just a starter and there remain a lot of issues to 
 be solved. I'm planing to solve at least some of them.
 All fixes and improvements are targeting the non-sequential parser and I 
 won't port those changes to the old parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources

2014-10-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178546#comment-14178546
 ] 

Tilman Hausherr commented on PDFBOX-2370:
-

done in [ https://svn.apache.org/r1633401 ]


 Move caching outside of PDResources
 ---

 Key: PDFBOX-2370
 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Priority: Critical
 Fix For: 2.0.0


 *Note:* This issue is based on a discussion which occurred regarding 
 PDFBOX-2301 but is actually a separate issue.
 Currently we cache the page resources in PDResources which belongs to a 
 specific PDPage. This causes two problems, 1) users who want to hold many 
 PDPage objects in memory will have high memory use (but this is often by 
 accident*). 2) By caching resources in PDPage we only get to keep that cache 
 for the lifetime of the page, which e.g. in PDFRenderer is a single page 
 only. That means that a font which appears on 40 pages has to be parsed 40 
 times, which causes slow running times, but also memory thrashing as objects 
 are destroyed frequently only to be re-created.
 What PDFRenderer really needs is not page-wide caching but document-wide 
 caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
 But that won't work for images, because they're too large. What we're 
 beginning to realise is that caching is use-case specific and probably 
 shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
 resource caching from PDPage/PDResources and implement custom caching in 
 PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
 happily volunteer myself. The existing high-level PDFBox APIs will continue 
 to just work and power users will get a level of control that they 
 appreciate.
 This strategy could be enhanced by removing memory-hungry methods on 
 PDResources such as getFonts() and getXObjects() which force all resources of 
 a particular type to be loaded, whether or not they are needed, or actually 
 used in the content stream. They would be replaced by methods to retrieve a 
 single resource, e.g. getFont(name).
 ---
 \* There probably isn't a legitimate use case for 1) any more, we've solved 
 the issues which we used to have with image caching (in fact, the 
 clearCache() method actually no longer needs to be called by PDFRenderer, 
 though it currently is). The real problem is that it's easy to accidentally 
 retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
 method is dangerous as looping over it will cause pages to be retained during 
 processing, like so:
 {code}
 for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
 java.util.List
 {
  // ... this is idiomatic in PDFBox 1.8
 } 
 // List returned by getAllPages() kept in scope until here (bad)
 {code}
 I added of couple of methods a while ago to avoid this by fetching each 
 PDPage one at a time, and this is now used internally in PDFBox to avoid the 
 memory problems we used to have:
 {code}
 for (int i = 0; i  document.getNumberOfPages(); i++)
 {
 PDPage page = document.getPage(i);
 // ... this is the new 2.0 way
 // current page falls out of scope here (good)
 }
 {code}
 To solve this problem, we could change getAllPages() so that instead of 
 returning a List it returns an IteratorPDPage, which would provide a nicer 
 API than getPage(int) and most existing code will continue to work. This is 
 also an opportunity to also fix type safety issues due to PDPageNode and 
 incorrect handling of the page tree (this is similar to the issue we had 
 recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Build failed in Jenkins: PDFBox-trunk » Apache PDFBox #1356

2014-10-21 Thread Apache Jenkins Server
See 
https://builds.apache.org/job/PDFBox-trunk/org.apache.pdfbox$pdfbox/1356/changes

Changes:

[tilman] PDFBOX-2370: restore pop resource stack

--
[...truncated 58 lines...]
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec - in 
org.apache.pdfbox.cos.TestCOSInteger
Running org.apache.pdfbox.cos.TestCOSFloat
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec - in 
org.apache.pdfbox.cos.TestCOSFloat
Running org.apache.pdfbox.io.TestRandomAccessBuffer
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec - in 
org.apache.pdfbox.io.TestRandomAccessBuffer
Running org.apache.pdfbox.io.TestIOUtils
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec - in 
org.apache.pdfbox.io.TestIOUtils
Running org.apache.pdfbox.io.TestRandomAccessFileOutputStream
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec - in 
org.apache.pdfbox.io.TestRandomAccessFileOutputStream
Running org.apache.pdfbox.encoding.PDFDocEncodingCharsetTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec - in 
org.apache.pdfbox.encoding.PDFDocEncodingCharsetTest
Running org.apache.pdfbox.util.TestLayerUtility
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.145 sec - in 
org.apache.pdfbox.util.TestLayerUtility
Running org.apache.pdfbox.util.PDFCloneUtilityTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec - in 
org.apache.pdfbox.util.PDFCloneUtilityTest
Running org.apache.pdfbox.util.TestMatrix
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec - in 
org.apache.pdfbox.util.TestMatrix
Running org.apache.pdfbox.util.TestQuickSort
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec - in 
org.apache.pdfbox.util.TestQuickSort
Running org.apache.pdfbox.util.PDFMergerUtilityTest
Oct 21, 2014 4:01:45 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:45 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Times-Roman'
Oct 21, 2014 4:01:45 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:45 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Times-Roman'
Oct 21, 2014 4:01:45 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Times-Roman'
Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Times-Roman'
Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Times-Roman'
Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Times-Roman'
Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:46 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Times-Roman'
Oct 21, 2014 4:01:47 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:47 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Times-Roman'
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.352 sec - in 
org.apache.pdfbox.util.PDFMergerUtilityTest
Running org.apache.pdfbox.util.PageExtractorTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.036 sec - in 
org.apache.pdfbox.util.PageExtractorTest
Running org.apache.pdfbox.util.TestTextStripper
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 

Build failed in Jenkins: PDFBox-trunk #1356

2014-10-21 Thread Apache Jenkins Server
See https://builds.apache.org/job/PDFBox-trunk/1356/changes

Changes:

[tilman] PDFBOX-2370: restore pop resource stack

--
[...truncated 472 lines...]
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica-Bold'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica-Bold'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica-Bold'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica-Bold'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica-Bold'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica-Bold'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica-Bold'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica-Bold'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica-BoldOblique'
Oct 21, 2014 4:01:51 PM org.apache.pdfbox.pdmodel.font.ExternalFonts 
getTrueTypeFallbackFont
SEVERE: No TTF fallback font for 'Helvetica-BoldOblique'
Oct 21, 2014 

[jira] [Created] (PDFBOX-2443) About to return NULL from unhandled branch when constructing a PDJpeg

2014-10-21 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created PDFBOX-2443:
---

 Summary: About to return NULL from unhandled branch when 
constructing a PDJpeg
 Key: PDFBOX-2443
 URL: https://issues.apache.org/jira/browse/PDFBOX-2443
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7, 1.8.8
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
Priority: Minor
 Fix For: 1.8.8


The INFO About to return NULL from unhandled branch appears when creating a 
PDJpeg from a stream. Although the message is an INFO and not a WARNING or an 
ERROR, it scares users.

The message happens because getRGBImage() calls getColorSpace() although the 
colorspace isn't known yet, it is determined after the call to getRGBImage(), 
which loads the image.

The image objects were completely redesigned in 2.0, so it makes no sense to 
waste time for a real solution to this. I am setting the message to DEBUG 
instead, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2443) About to return NULL from unhandled branch when constructing a PDJpeg

2014-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178647#comment-14178647
 ] 

ASF subversion and git services commented on PDFBOX-2443:
-

Commit 1633414 from [~tilman] in branch 'pdfbox/branches/1.8'
[ https://svn.apache.org/r1633414 ]

PDFBOX-2443: change scary info message to debug and make it less scary; change 
javadoc too

 About to return NULL from unhandled branch when constructing a PDJpeg
 -

 Key: PDFBOX-2443
 URL: https://issues.apache.org/jira/browse/PDFBOX-2443
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7, 1.8.8
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
Priority: Minor
 Fix For: 1.8.8


 The INFO About to return NULL from unhandled branch appears when creating a 
 PDJpeg from a stream. Although the message is an INFO and not a WARNING or an 
 ERROR, it scares users.
 The message happens because getRGBImage() calls getColorSpace() although the 
 colorspace isn't known yet, it is determined after the call to getRGBImage(), 
 which loads the image.
 The image objects were completely redesigned in 2.0, so it makes no sense to 
 waste time for a real solution to this. I am setting the message to DEBUG 
 instead, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2443) About to return NULL from unhandled branch when constructing a PDJpeg

2014-10-21 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2443:

Description: 
The INFO About to return NULL from unhandled branch appears when creating a 
PDJpeg from a stream. Although the message is an INFO and not a WARNING or an 
ERROR, it scares users.

The message happens because getRGBImage() calls getColorSpace() although the 
colorspace isn't known yet, it is determined after the call to getRGBImage(), 
which loads the image.

The image objects were completely redesigned in 2.0, so it makes no sense to 
waste time for a real solution to this. I am setting the message to DEBUG 
instead, and make it less scary.

  was:
The INFO About to return NULL from unhandled branch appears when creating a 
PDJpeg from a stream. Although the message is an INFO and not a WARNING or an 
ERROR, it scares users.

The message happens because getRGBImage() calls getColorSpace() although the 
colorspace isn't known yet, it is determined after the call to getRGBImage(), 
which loads the image.

The image objects were completely redesigned in 2.0, so it makes no sense to 
waste time for a real solution to this. I am setting the message to DEBUG 
instead, 


 About to return NULL from unhandled branch when constructing a PDJpeg
 -

 Key: PDFBOX-2443
 URL: https://issues.apache.org/jira/browse/PDFBOX-2443
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7, 1.8.8
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
Priority: Minor
 Fix For: 1.8.8


 The INFO About to return NULL from unhandled branch appears when creating a 
 PDJpeg from a stream. Although the message is an INFO and not a WARNING or an 
 ERROR, it scares users.
 The message happens because getRGBImage() calls getColorSpace() although the 
 colorspace isn't known yet, it is determined after the call to getRGBImage(), 
 which loads the image.
 The image objects were completely redesigned in 2.0, so it makes no sense to 
 waste time for a real solution to this. I am setting the message to DEBUG 
 instead, and make it less scary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PDFBOX-2443) About to return NULL from unhandled branch when constructing a PDJpeg

2014-10-21 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-2443.
-
Resolution: Fixed

 About to return NULL from unhandled branch when constructing a PDJpeg
 -

 Key: PDFBOX-2443
 URL: https://issues.apache.org/jira/browse/PDFBOX-2443
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7, 1.8.8
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
Priority: Minor
 Fix For: 1.8.8


 The INFO About to return NULL from unhandled branch appears when creating a 
 PDJpeg from a stream. Although the message is an INFO and not a WARNING or an 
 ERROR, it scares users.
 The message happens because getRGBImage() calls getColorSpace() although the 
 colorspace isn't known yet, it is determined after the call to getRGBImage(), 
 which loads the image.
 The image objects were completely redesigned in 2.0, so it makes no sense to 
 waste time for a real solution to this. I am setting the message to DEBUG 
 instead, and make it less scary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: 2.0

2014-10-21 Thread Tilman Hausherr

Hi Tim,

2.0 doesn't seem to be released soon... what might be useful again is a 
comparison between seq v non-seq, Andreas recently resolved an issue 
(PDFBOX-2250) that improves the nonSeq parser a lot. Although this isn't 
fully done, a follow-up issue PDFBOX-2441 
https://issues.apache.org/jira/browse/PDFBOX-2441 has been opened 
which will improve a few more complex files.


Tilman



Am 21.10.2014 um 13:00 schrieb Allison, Timothy B.:

Been too busy over in Tika-land...just noticing this now.

Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v non-seq). 
 I won't have time to integrate 2.0 into our Tika PDFParser any time soon 
(Jeremy Anderson on TIKA-1285 has already started this), but I could easily 
write a lightweight wrapper around PDFBox's TextStripper + metadata inside of 
the tika-batch/tika-eval framework.

Cheers,

   Tim

From: Andreas Lehmkühler [andr...@lehmi.de]
Sent: Wednesday, October 15, 2014 6:20 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0

Hi,



Maruan Sahyoun sahy...@fileaffairs.de hat am 15. Oktober 2014 um 09:32
geschrieben:


What about keeping both for the 2.0 release and phase the old one out for 3
but making the NonSequential the default parser.
Would also give us some time to work with Tim (TIKA) on the test suite.

I agree, that's the only thing we can manage in a timely manner.



Maybe we could simplify the variations of PDDocument.load to something like

PDDocument.load(input, raf, enforce, useLegacyParser) or
PDDocument.load(input, raf, enforce, withSignatureSupport) …

and introduce PDDocument.load(input) to use the NonSequential


WDYT?

Good idea, I've already created PDFBOX-2430 for this.


Maruan


BR
Andreas Lehmkühler

Am 15.10.2014 um 09:18 schrieb Timo Boehme timo.boe...@ontochem.com:


Hi,

the difference between the parsers stems from the fact that the old parser
can cope with a completely broken xref table because it uses the objects as
it finds them on its sequential way. What we need (as I proposed before) is
a repair mechanism scanning the file for object start/end to be used for
re-creating the xref table.
I will see if I can find some time to do this.

The only other stopper is as Andreas has pointed out the signing. I'm not
familiar with this and don't known what needs to be done here.


Best,
Timo


Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:

Here are some:

055/055794.pdf
082/082463.pdf
108/108362.pdf
113/113223.pdf
115/115458.pdf
115/115463.pdf
122/122393.pdf
129/129416.pdf
133/133423.pdf
148/148020.pdf
152/152012.pdf
161/161466.pdf

to be found here:
http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/

Tilman

Am 14.10.2014 um 21:06 schrieb John Hewson:

Unless somebody provides us with a list of those files, then I think
this is an unreasonable request. As long as we continue to leave the
old parser in PDFBox, we won’t get the bug reports which we need to
fix the new parser, and the situation will never resolve itself.
Falling back to the old parser is just as bad - we won’t get bug reports.

-- John

On 14 Oct 2014, at 07:39, Tilman Hausherr thaush...@t-online.de wrote:


I prefer that the old parser not be removed, because there are many
files that can only be parsed by the old parser. This came out in a
large scale test with TIKA.

The best idea (in my current opinion) is to use the nonSeq parser
first, and the old parser if there is an exception.

Tilman

Am 14.10.2014 um 09:45 schrieb Timo Boehme:

Hi,

Am 14.10.2014 um 07:22 schrieb John Hewson:

Hi,

John Hewson j...@jahewson.com hat am 10. Oktober 2014 um 20:05
geschrieben:


 - Parsing (Andreas?)

I guess we won't get a complete new parser in 2.0, but I try to
improve the XRef
and the COSStream stuff

It would be great if we could get rid of the old parser and switch
to the non-sequential
parser, WDYT?

I would also propose to completely remove the old parser. That way
we are more flexible in parsing streams etc. since parts of the
non-sequential parser are a compromise to work side-by-side with the
old parser.
Possibly there are a small number of functions for which the old
parser is still needed - e.g. signing?


Best,
Timo




--

Timo Boehme
OntoChem GmbH
H.-Damerow-Str. 4
06120 Halle/Saale
T: +49 345 4780474
F: +49 345 4780471
timo.boe...@ontochem.com

_

OntoChem GmbH
Geschäftsführer: Dr. Lutz Weber
Sitz: Halle / Saale
Registergericht: Stendal
Registernummer: HRB 215461
_





[jira] [Created] (PDFBOX-2444) Add radial shading example

2014-10-21 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created PDFBOX-2444:
---

 Summary: Add radial shading example
 Key: PDFBOX-2444
 URL: https://issues.apache.org/jira/browse/PDFBOX-2444
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
Priority: Minor
 Fix For: 2.0.0


Add radial shading to the example created in PDFBOX-2211. Use both methods of 
adding a shading that emerged from PDFBOX-2370.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2444) Add radial shading example

2014-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178756#comment-14178756
 ] 

ASF subversion and git services commented on PDFBOX-2444:
-

Commit 1633427 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1633427 ]

PDFBOX-2444, PDFBOX-2370: add radial shading; use both methods of adding a 
shading to the resources

 Add radial shading example
 --

 Key: PDFBOX-2444
 URL: https://issues.apache.org/jira/browse/PDFBOX-2444
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
Priority: Minor
  Labels: shading
 Fix For: 2.0.0


 Add radial shading to the example created in PDFBOX-2211. Use both methods of 
 adding a shading that emerged from PDFBOX-2370.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PDFBOX-2444) Add radial shading example

2014-10-21 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-2444.
-
Resolution: Fixed

 Add radial shading example
 --

 Key: PDFBOX-2444
 URL: https://issues.apache.org/jira/browse/PDFBOX-2444
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
Priority: Minor
  Labels: shading
 Fix For: 2.0.0


 Add radial shading to the example created in PDFBOX-2211. Use both methods of 
 adding a shading that emerged from PDFBOX-2370.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources

2014-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178763#comment-14178763
 ] 

ASF subversion and git services commented on PDFBOX-2370:
-

Commit 1633428 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1633428 ]

PDFBOX-2370: use sh instead of cs1 as prefix for shading objects

 Move caching outside of PDResources
 ---

 Key: PDFBOX-2370
 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Priority: Critical
 Fix For: 2.0.0


 *Note:* This issue is based on a discussion which occurred regarding 
 PDFBOX-2301 but is actually a separate issue.
 Currently we cache the page resources in PDResources which belongs to a 
 specific PDPage. This causes two problems, 1) users who want to hold many 
 PDPage objects in memory will have high memory use (but this is often by 
 accident*). 2) By caching resources in PDPage we only get to keep that cache 
 for the lifetime of the page, which e.g. in PDFRenderer is a single page 
 only. That means that a font which appears on 40 pages has to be parsed 40 
 times, which causes slow running times, but also memory thrashing as objects 
 are destroyed frequently only to be re-created.
 What PDFRenderer really needs is not page-wide caching but document-wide 
 caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
 But that won't work for images, because they're too large. What we're 
 beginning to realise is that caching is use-case specific and probably 
 shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
 resource caching from PDPage/PDResources and implement custom caching in 
 PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
 happily volunteer myself. The existing high-level PDFBox APIs will continue 
 to just work and power users will get a level of control that they 
 appreciate.
 This strategy could be enhanced by removing memory-hungry methods on 
 PDResources such as getFonts() and getXObjects() which force all resources of 
 a particular type to be loaded, whether or not they are needed, or actually 
 used in the content stream. They would be replaced by methods to retrieve a 
 single resource, e.g. getFont(name).
 ---
 \* There probably isn't a legitimate use case for 1) any more, we've solved 
 the issues which we used to have with image caching (in fact, the 
 clearCache() method actually no longer needs to be called by PDFRenderer, 
 though it currently is). The real problem is that it's easy to accidentally 
 retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
 method is dangerous as looping over it will cause pages to be retained during 
 processing, like so:
 {code}
 for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
 java.util.List
 {
  // ... this is idiomatic in PDFBox 1.8
 } 
 // List returned by getAllPages() kept in scope until here (bad)
 {code}
 I added of couple of methods a while ago to avoid this by fetching each 
 PDPage one at a time, and this is now used internally in PDFBox to avoid the 
 memory problems we used to have:
 {code}
 for (int i = 0; i  document.getNumberOfPages(); i++)
 {
 PDPage page = document.getPage(i);
 // ... this is the new 2.0 way
 // current page falls out of scope here (good)
 }
 {code}
 To solve this problem, we could change getAllPages() so that instead of 
 returning a List it returns an IteratorPDPage, which would provide a nicer 
 API than getPage(int) and most existing code will continue to work. This is 
 also an opportunity to also fix type safety issues due to PDPageNode and 
 incorrect handling of the page tree (this is similar to the issue we had 
 recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources

2014-10-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178766#comment-14178766
 ] 

Tilman Hausherr commented on PDFBOX-2370:
-

I've done just a minimal restore re: popping the resource stack, so that the 
tests work again. 

But why has the try... finally part been removed? This would make sure that 
the stack is popped if an exception happens.

 Move caching outside of PDResources
 ---

 Key: PDFBOX-2370
 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Priority: Critical
 Fix For: 2.0.0


 *Note:* This issue is based on a discussion which occurred regarding 
 PDFBOX-2301 but is actually a separate issue.
 Currently we cache the page resources in PDResources which belongs to a 
 specific PDPage. This causes two problems, 1) users who want to hold many 
 PDPage objects in memory will have high memory use (but this is often by 
 accident*). 2) By caching resources in PDPage we only get to keep that cache 
 for the lifetime of the page, which e.g. in PDFRenderer is a single page 
 only. That means that a font which appears on 40 pages has to be parsed 40 
 times, which causes slow running times, but also memory thrashing as objects 
 are destroyed frequently only to be re-created.
 What PDFRenderer really needs is not page-wide caching but document-wide 
 caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
 But that won't work for images, because they're too large. What we're 
 beginning to realise is that caching is use-case specific and probably 
 shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
 resource caching from PDPage/PDResources and implement custom caching in 
 PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
 happily volunteer myself. The existing high-level PDFBox APIs will continue 
 to just work and power users will get a level of control that they 
 appreciate.
 This strategy could be enhanced by removing memory-hungry methods on 
 PDResources such as getFonts() and getXObjects() which force all resources of 
 a particular type to be loaded, whether or not they are needed, or actually 
 used in the content stream. They would be replaced by methods to retrieve a 
 single resource, e.g. getFont(name).
 ---
 \* There probably isn't a legitimate use case for 1) any more, we've solved 
 the issues which we used to have with image caching (in fact, the 
 clearCache() method actually no longer needs to be called by PDFRenderer, 
 though it currently is). The real problem is that it's easy to accidentally 
 retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
 method is dangerous as looping over it will cause pages to be retained during 
 processing, like so:
 {code}
 for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
 java.util.List
 {
  // ... this is idiomatic in PDFBox 1.8
 } 
 // List returned by getAllPages() kept in scope until here (bad)
 {code}
 I added of couple of methods a while ago to avoid this by fetching each 
 PDPage one at a time, and this is now used internally in PDFBox to avoid the 
 memory problems we used to have:
 {code}
 for (int i = 0; i  document.getNumberOfPages(); i++)
 {
 PDPage page = document.getPage(i);
 // ... this is the new 2.0 way
 // current page falls out of scope here (good)
 }
 {code}
 To solve this problem, we could change getAllPages() so that instead of 
 returning a List it returns an IteratorPDPage, which would provide a nicer 
 API than getPage(int) and most existing code will continue to work. This is 
 also an opportunity to also fix type safety issues due to PDPageNode and 
 incorrect handling of the page tree (this is similar to the issue we had 
 recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-1980) TestCOSFloat is non-deterministic

2014-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178847#comment-14178847
 ] 

ASF subversion and git services commented on PDFBOX-1980:
-

Commit 1633435 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1633435 ]

PDFBOX-1980: fix javadoc format error

 TestCOSFloat is non-deterministic
 -

 Key: PDFBOX-1980
 URL: https://issues.apache.org/jira/browse/PDFBOX-1980
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: John Hewson
Priority: Minor
 Fix For: 2.0.0


 TestCOSFloat generates random numbers for testing which means that it is 
 non-deterministic.
 Testing COSFloat on random data doesn't achieve much, because we know what 
 numbers look like. Even taking into account the discussion in PDFBOX-1977, I 
 suggest that it would be better to create a set of representative data with 
 interesting edge-cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Jenkins build is back to normal : PDFBox-trunk » Apache PDFBox #1357

2014-10-21 Thread Apache Jenkins Server
See 
https://builds.apache.org/job/PDFBox-trunk/org.apache.pdfbox$pdfbox/1357/changes



Jenkins build is back to normal : PDFBox-trunk #1357

2014-10-21 Thread Apache Jenkins Server
See https://builds.apache.org/job/PDFBox-trunk/1357/changes



download link broken

2014-10-21 Thread Tilman Hausherr

https://pdfbox.apache.org/download.cgi  shows this:

#!/bin/sh
# Wrapper script around mirrors.cgi script
# (we must change to that directory in order for python to pick up the
#  python includes correctly)
cd /www/www.apache.org/dyn/mirrors
/www/www.apache.org/dyn/mirrors/mirrors.cgi $*




RE: 2.0

2014-10-21 Thread Allison, Timothy B.
Maruan,
  Sounds good.  I'll add it to my todo list to write the wrapper...probably be 
good for me to start moving to 2.0 anyways. :)

-Original Message-
From: Maruan Sahyoun [mailto:sahy...@fileaffairs.de] 
Sent: Tuesday, October 21, 2014 1:50 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0

Tim, 

first many thanks for the offer. I'd add that a comparison between 1.8 and 2.0 
would be useful too to detect differences might it be because of enhancements 
or regressions.

BR
Maruan


Am 21.10.2014 um 19:42 schrieb Tilman Hausherr thaush...@t-online.de:

 Hi Tim,
 
 2.0 doesn't seem to be released soon... what might be useful again is a 
 comparison between seq v non-seq, Andreas recently resolved an issue 
 (PDFBOX-2250) that improves the nonSeq parser a lot. Although this isn't 
 fully done, a follow-up issue PDFBOX-2441 
 https://issues.apache.org/jira/browse/PDFBOX-2441 has been opened which 
 will improve a few more complex files.
 
 Tilman
 
 
 
 Am 21.10.2014 um 13:00 schrieb Allison, Timothy B.:
 Been too busy over in Tika-land...just noticing this now.
 
 Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v 
 non-seq).  I won't have time to integrate 2.0 into our Tika PDFParser any 
 time soon (Jeremy Anderson on TIKA-1285 has already started this), but I 
 could easily write a lightweight wrapper around PDFBox's TextStripper + 
 metadata inside of the tika-batch/tika-eval framework.
 
 Cheers,
 
   Tim
 
 From: Andreas Lehmkühler [andr...@lehmi.de]
 Sent: Wednesday, October 15, 2014 6:20 AM
 To: dev@pdfbox.apache.org
 Subject: Re: 2.0
 
 Hi,
 
 
 Maruan Sahyoun sahy...@fileaffairs.de hat am 15. Oktober 2014 um 09:32
 geschrieben:
 
 
 What about keeping both for the 2.0 release and phase the old one out for 3
 but making the NonSequential the default parser.
 Would also give us some time to work with Tim (TIKA) on the test suite.
 I agree, that's the only thing we can manage in a timely manner.
 
 
 Maybe we could simplify the variations of PDDocument.load to something like
 
 PDDocument.load(input, raf, enforce, useLegacyParser) or
 PDDocument.load(input, raf, enforce, withSignatureSupport) .
 
 and introduce PDDocument.load(input) to use the NonSequential
 
 
 WDYT?
 Good idea, I've already created PDFBOX-2430 for this.
 
 Maruan
 
 BR
 Andreas Lehmkühler
 Am 15.10.2014 um 09:18 schrieb Timo Boehme timo.boe...@ontochem.com:
 
 Hi,
 
 the difference between the parsers stems from the fact that the old parser
 can cope with a completely broken xref table because it uses the objects as
 it finds them on its sequential way. What we need (as I proposed before) is
 a repair mechanism scanning the file for object start/end to be used for
 re-creating the xref table.
 I will see if I can find some time to do this.
 
 The only other stopper is as Andreas has pointed out the signing. I'm not
 familiar with this and don't known what needs to be done here.
 
 
 Best,
 Timo
 
 
 Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
 Here are some:
 
 055/055794.pdf
 082/082463.pdf
 108/108362.pdf
 113/113223.pdf
 115/115458.pdf
 115/115463.pdf
 122/122393.pdf
 129/129416.pdf
 133/133423.pdf
 148/148020.pdf
 152/152012.pdf
 161/161466.pdf
 
 to be found here:
 http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
 
 Tilman
 
 Am 14.10.2014 um 21:06 schrieb John Hewson:
 Unless somebody provides us with a list of those files, then I think
 this is an unreasonable request. As long as we continue to leave the
 old parser in PDFBox, we won't get the bug reports which we need to
 fix the new parser, and the situation will never resolve itself.
 Falling back to the old parser is just as bad - we won't get bug reports.
 
 -- John
 
 On 14 Oct 2014, at 07:39, Tilman Hausherr thaush...@t-online.de wrote:
 
 I prefer that the old parser not be removed, because there are many
 files that can only be parsed by the old parser. This came out in a
 large scale test with TIKA.
 
 The best idea (in my current opinion) is to use the nonSeq parser
 first, and the old parser if there is an exception.
 
 Tilman
 
 Am 14.10.2014 um 09:45 schrieb Timo Boehme:
 Hi,
 
 Am 14.10.2014 um 07:22 schrieb John Hewson:
 Hi,
 John Hewson j...@jahewson.com hat am 10. Oktober 2014 um 20:05
 geschrieben:
 
 
 - Parsing (Andreas?)
 I guess we won't get a complete new parser in 2.0, but I try to
 improve the XRef
 and the COSStream stuff
 It would be great if we could get rid of the old parser and switch
 to the non-sequential
 parser, WDYT?
 I would also propose to completely remove the old parser. That way
 we are more flexible in parsing streams etc. since parts of the
 non-sequential parser are a compromise to work side-by-side with the
 old parser.
 Possibly there are a small number of functions for which the old
 parser is still needed - e.g. signing?
 
 
 Best,
 Timo
 
 
 
 --
 
 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 

[jira] [Comment Edited] (PDFBOX-2403) false negative? Font damaged, The FontFile can't be read

2014-10-21 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179171#comment-14179171
 ] 

John Hewson edited comment on PDFBOX-2403 at 10/21/14 9:44 PM:
---

I've updated my version of Adobe Preflight to 11.0.9, which is the same version 
you have and I get no errors when verifying the pdfa1b.pdf file for compliance 
with PDF/A-1b. The fonts in the file are all embedded, so the errors which 
you're seeing just don't match the file. I'd start by double checking that the 
file attached to this issue is the same as the one you're testing.


was (Author: jahewson):
I've updated by version of Adobe Preflight to 11.0.9, which is the same version 
you have and I get no errors when verifying the pdfa1b.pdf file for compliance 
with PDF/A-1b. The fonts in the file are all embedded, so the errors which 
you're seeing just don't match the file. I'd start by double checking that the 
file attached to this issue is the same as the one you're testing.

 false negative? Font damaged, The FontFile can't be read
 --

 Key: PDFBOX-2403
 URL: https://issues.apache.org/jira/browse/PDFBOX-2403
 Project: PDFBox
  Issue Type: Bug
  Components: Preflight
Affects Versions: 2.0.0
 Environment: deb7, java 7
Reporter: Ralf Hauser
 Fix For: 2.0.0

 Attachments: Konformität mit PDF_A-1b prüfen.pdf, 
 Problems_pdfa1b.pdf_07.10.2014_001.pdf, patch2403JavaDoc.txt, 
 patchBetterErrorMessages.txt, patchPDFBOX-2403.txt, 
 patchPDFBOX-2403Type1.txt, pdfA_Validation_Report.eml, pdfa1b.pdf, 
 pdfa1b_againstPDFA1a_report, pdfa1b_againstPDFA1b_report, 
 pdfa1b_summary_0001.pdf, report, reportforfile_pdfa1b, validation_report.xml


 - 1: 3.2.1 : Font damaged, The FontFile can't be read
  - 2: 3.2.1 : Font damaged, The FontFile can't be read
  - 3: 3.1.6 : Invalid Font definition, Width of the character 48 in the 
 font program SURPPV+HeiseiMaruGoStd-W8-Identity-H is inconsistent with the 
 width in the PDF dictionary.
  - 4: 3.1.6 : Invalid Font definition, Width of the character 36 in the 
 font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the 
 width in the PDF dictionary.
  - 5: 3.3.1 : Glyph error, The character 74 in the font program 
 OIZFRF+KozMinProVI-Regular-Identity-H is missing from the Charater Encoding.
  - 6: 3.1.6 : Invalid Font definition, Width of the character 80 in the 
 font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the 
 width in the PDF dictionary.
  - 7: 3.1.6 : Invalid Font definition, Width of the character 420 in the 
 font program RRATCX+MathematicalPiLTStd-Identity-H is inconsistent with the 
 width in the PDF dictionary.
 possibly related to PDFBOX-2299?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2403) false negative? Font damaged, The FontFile can't be read

2014-10-21 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179171#comment-14179171
 ] 

John Hewson commented on PDFBOX-2403:
-

I've updated by version of Adobe Preflight to 11.0.9, which is the same version 
you have and I get no errors when verifying the pdfa1b.pdf file for compliance 
with PDF/A-1b. The fonts in the file are all embedded, so the errors which 
you're seeing just don't match the file. I'd start by double checking that the 
file attached to this issue is the same as the one you're testing.

 false negative? Font damaged, The FontFile can't be read
 --

 Key: PDFBOX-2403
 URL: https://issues.apache.org/jira/browse/PDFBOX-2403
 Project: PDFBox
  Issue Type: Bug
  Components: Preflight
Affects Versions: 2.0.0
 Environment: deb7, java 7
Reporter: Ralf Hauser
 Fix For: 2.0.0

 Attachments: Konformität mit PDF_A-1b prüfen.pdf, 
 Problems_pdfa1b.pdf_07.10.2014_001.pdf, patch2403JavaDoc.txt, 
 patchBetterErrorMessages.txt, patchPDFBOX-2403.txt, 
 patchPDFBOX-2403Type1.txt, pdfA_Validation_Report.eml, pdfa1b.pdf, 
 pdfa1b_againstPDFA1a_report, pdfa1b_againstPDFA1b_report, 
 pdfa1b_summary_0001.pdf, report, reportforfile_pdfa1b, validation_report.xml


 - 1: 3.2.1 : Font damaged, The FontFile can't be read
  - 2: 3.2.1 : Font damaged, The FontFile can't be read
  - 3: 3.1.6 : Invalid Font definition, Width of the character 48 in the 
 font program SURPPV+HeiseiMaruGoStd-W8-Identity-H is inconsistent with the 
 width in the PDF dictionary.
  - 4: 3.1.6 : Invalid Font definition, Width of the character 36 in the 
 font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the 
 width in the PDF dictionary.
  - 5: 3.3.1 : Glyph error, The character 74 in the font program 
 OIZFRF+KozMinProVI-Regular-Identity-H is missing from the Charater Encoding.
  - 6: 3.1.6 : Invalid Font definition, Width of the character 80 in the 
 font program OIZFRF+KozMinProVI-Regular-Identity-H is inconsistent with the 
 width in the PDF dictionary.
  - 7: 3.1.6 : Invalid Font definition, Width of the character 420 in the 
 font program RRATCX+MathematicalPiLTStd-Identity-H is inconsistent with the 
 width in the PDF dictionary.
 possibly related to PDFBOX-2299?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources

2014-10-21 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179177#comment-14179177
 ] 

John Hewson commented on PDFBOX-2370:
-

Thanks Tilman, I'll take a look at the problem files. The finally and stack 
popping removal was an mistake, I had been experimenting with those lines.

 Move caching outside of PDResources
 ---

 Key: PDFBOX-2370
 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Priority: Critical
 Fix For: 2.0.0


 *Note:* This issue is based on a discussion which occurred regarding 
 PDFBOX-2301 but is actually a separate issue.
 Currently we cache the page resources in PDResources which belongs to a 
 specific PDPage. This causes two problems, 1) users who want to hold many 
 PDPage objects in memory will have high memory use (but this is often by 
 accident*). 2) By caching resources in PDPage we only get to keep that cache 
 for the lifetime of the page, which e.g. in PDFRenderer is a single page 
 only. That means that a font which appears on 40 pages has to be parsed 40 
 times, which causes slow running times, but also memory thrashing as objects 
 are destroyed frequently only to be re-created.
 What PDFRenderer really needs is not page-wide caching but document-wide 
 caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
 But that won't work for images, because they're too large. What we're 
 beginning to realise is that caching is use-case specific and probably 
 shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
 resource caching from PDPage/PDResources and implement custom caching in 
 PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
 happily volunteer myself. The existing high-level PDFBox APIs will continue 
 to just work and power users will get a level of control that they 
 appreciate.
 This strategy could be enhanced by removing memory-hungry methods on 
 PDResources such as getFonts() and getXObjects() which force all resources of 
 a particular type to be loaded, whether or not they are needed, or actually 
 used in the content stream. They would be replaced by methods to retrieve a 
 single resource, e.g. getFont(name).
 ---
 \* There probably isn't a legitimate use case for 1) any more, we've solved 
 the issues which we used to have with image caching (in fact, the 
 clearCache() method actually no longer needs to be called by PDFRenderer, 
 though it currently is). The real problem is that it's easy to accidentally 
 retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
 method is dangerous as looping over it will cause pages to be retained during 
 processing, like so:
 {code}
 for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
 java.util.List
 {
  // ... this is idiomatic in PDFBox 1.8
 } 
 // List returned by getAllPages() kept in scope until here (bad)
 {code}
 I added of couple of methods a while ago to avoid this by fetching each 
 PDPage one at a time, and this is now used internally in PDFBox to avoid the 
 memory problems we used to have:
 {code}
 for (int i = 0; i  document.getNumberOfPages(); i++)
 {
 PDPage page = document.getPage(i);
 // ... this is the new 2.0 way
 // current page falls out of scope here (good)
 }
 {code}
 To solve this problem, we could change getAllPages() so that instead of 
 returning a List it returns an IteratorPDPage, which would provide a nicer 
 API than getPage(int) and most existing code will continue to work. This is 
 also an opportunity to also fix type safety issues due to PDPageNode and 
 incorrect handling of the page tree (this is similar to the issue we had 
 recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PDFBOX-2370) Move caching outside of PDResources

2014-10-21 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179177#comment-14179177
 ] 

John Hewson edited comment on PDFBOX-2370 at 10/21/14 9:48 PM:
---

Thanks Tilman, I'll take a look at the problem files. The finally and stack 
popping removal was a mistake, I had been experimenting with those lines.


was (Author: jahewson):
Thanks Tilman, I'll take a look at the problem files. The finally and stack 
popping removal was an mistake, I had been experimenting with those lines.

 Move caching outside of PDResources
 ---

 Key: PDFBOX-2370
 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Priority: Critical
 Fix For: 2.0.0


 *Note:* This issue is based on a discussion which occurred regarding 
 PDFBOX-2301 but is actually a separate issue.
 Currently we cache the page resources in PDResources which belongs to a 
 specific PDPage. This causes two problems, 1) users who want to hold many 
 PDPage objects in memory will have high memory use (but this is often by 
 accident*). 2) By caching resources in PDPage we only get to keep that cache 
 for the lifetime of the page, which e.g. in PDFRenderer is a single page 
 only. That means that a font which appears on 40 pages has to be parsed 40 
 times, which causes slow running times, but also memory thrashing as objects 
 are destroyed frequently only to be re-created.
 What PDFRenderer really needs is not page-wide caching but document-wide 
 caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
 But that won't work for images, because they're too large. What we're 
 beginning to realise is that caching is use-case specific and probably 
 shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
 resource caching from PDPage/PDResources and implement custom caching in 
 PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
 happily volunteer myself. The existing high-level PDFBox APIs will continue 
 to just work and power users will get a level of control that they 
 appreciate.
 This strategy could be enhanced by removing memory-hungry methods on 
 PDResources such as getFonts() and getXObjects() which force all resources of 
 a particular type to be loaded, whether or not they are needed, or actually 
 used in the content stream. They would be replaced by methods to retrieve a 
 single resource, e.g. getFont(name).
 ---
 \* There probably isn't a legitimate use case for 1) any more, we've solved 
 the issues which we used to have with image caching (in fact, the 
 clearCache() method actually no longer needs to be called by PDFRenderer, 
 though it currently is). The real problem is that it's easy to accidentally 
 retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
 method is dangerous as looping over it will cause pages to be retained during 
 processing, like so:
 {code}
 for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
 java.util.List
 {
  // ... this is idiomatic in PDFBox 1.8
 } 
 // List returned by getAllPages() kept in scope until here (bad)
 {code}
 I added of couple of methods a while ago to avoid this by fetching each 
 PDPage one at a time, and this is now used internally in PDFBox to avoid the 
 memory problems we used to have:
 {code}
 for (int i = 0; i  document.getNumberOfPages(); i++)
 {
 PDPage page = document.getPage(i);
 // ... this is the new 2.0 way
 // current page falls out of scope here (good)
 }
 {code}
 To solve this problem, we could change getAllPages() so that instead of 
 returning a List it returns an IteratorPDPage, which would provide a nicer 
 API than getPage(int) and most existing code will continue to work. This is 
 also an opportunity to also fix type safety issues due to PDPageNode and 
 incorrect handling of the page tree (this is similar to the issue we had 
 recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources

2014-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179180#comment-14179180
 ] 

ASF subversion and git services commented on PDFBOX-2370:
-

Commit 1633472 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1633472 ]

PDFBOX-2370: Fix, pop stack in finally

 Move caching outside of PDResources
 ---

 Key: PDFBOX-2370
 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Priority: Critical
 Fix For: 2.0.0


 *Note:* This issue is based on a discussion which occurred regarding 
 PDFBOX-2301 but is actually a separate issue.
 Currently we cache the page resources in PDResources which belongs to a 
 specific PDPage. This causes two problems, 1) users who want to hold many 
 PDPage objects in memory will have high memory use (but this is often by 
 accident*). 2) By caching resources in PDPage we only get to keep that cache 
 for the lifetime of the page, which e.g. in PDFRenderer is a single page 
 only. That means that a font which appears on 40 pages has to be parsed 40 
 times, which causes slow running times, but also memory thrashing as objects 
 are destroyed frequently only to be re-created.
 What PDFRenderer really needs is not page-wide caching but document-wide 
 caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
 But that won't work for images, because they're too large. What we're 
 beginning to realise is that caching is use-case specific and probably 
 shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
 resource caching from PDPage/PDResources and implement custom caching in 
 PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
 happily volunteer myself. The existing high-level PDFBox APIs will continue 
 to just work and power users will get a level of control that they 
 appreciate.
 This strategy could be enhanced by removing memory-hungry methods on 
 PDResources such as getFonts() and getXObjects() which force all resources of 
 a particular type to be loaded, whether or not they are needed, or actually 
 used in the content stream. They would be replaced by methods to retrieve a 
 single resource, e.g. getFont(name).
 ---
 \* There probably isn't a legitimate use case for 1) any more, we've solved 
 the issues which we used to have with image caching (in fact, the 
 clearCache() method actually no longer needs to be called by PDFRenderer, 
 though it currently is). The real problem is that it's easy to accidentally 
 retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
 method is dangerous as looping over it will cause pages to be retained during 
 processing, like so:
 {code}
 for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
 java.util.List
 {
  // ... this is idiomatic in PDFBox 1.8
 } 
 // List returned by getAllPages() kept in scope until here (bad)
 {code}
 I added of couple of methods a while ago to avoid this by fetching each 
 PDPage one at a time, and this is now used internally in PDFBox to avoid the 
 memory problems we used to have:
 {code}
 for (int i = 0; i  document.getNumberOfPages(); i++)
 {
 PDPage page = document.getPage(i);
 // ... this is the new 2.0 way
 // current page falls out of scope here (good)
 }
 {code}
 To solve this problem, we could change getAllPages() so that instead of 
 returning a List it returns an IteratorPDPage, which would provide a nicer 
 API than getPage(int) and most existing code will continue to work. This is 
 also an opportunity to also fix type safety issues due to PDPageNode and 
 incorrect handling of the page tree (this is similar to the issue we had 
 recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2370) Move caching outside of PDResources

2014-10-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179237#comment-14179237
 ] 

Tilman Hausherr commented on PDFBOX-2370:
-

You don't have to look at the problem files anymore, their problem was caused 
by the non-popping.

 Move caching outside of PDResources
 ---

 Key: PDFBOX-2370
 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Priority: Critical
 Fix For: 2.0.0


 *Note:* This issue is based on a discussion which occurred regarding 
 PDFBOX-2301 but is actually a separate issue.
 Currently we cache the page resources in PDResources which belongs to a 
 specific PDPage. This causes two problems, 1) users who want to hold many 
 PDPage objects in memory will have high memory use (but this is often by 
 accident*). 2) By caching resources in PDPage we only get to keep that cache 
 for the lifetime of the page, which e.g. in PDFRenderer is a single page 
 only. That means that a font which appears on 40 pages has to be parsed 40 
 times, which causes slow running times, but also memory thrashing as objects 
 are destroyed frequently only to be re-created.
 What PDFRenderer really needs is not page-wide caching but document-wide 
 caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
 But that won't work for images, because they're too large. What we're 
 beginning to realise is that caching is use-case specific and probably 
 shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
 resource caching from PDPage/PDResources and implement custom caching in 
 PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
 happily volunteer myself. The existing high-level PDFBox APIs will continue 
 to just work and power users will get a level of control that they 
 appreciate.
 This strategy could be enhanced by removing memory-hungry methods on 
 PDResources such as getFonts() and getXObjects() which force all resources of 
 a particular type to be loaded, whether or not they are needed, or actually 
 used in the content stream. They would be replaced by methods to retrieve a 
 single resource, e.g. getFont(name).
 ---
 \* There probably isn't a legitimate use case for 1) any more, we've solved 
 the issues which we used to have with image caching (in fact, the 
 clearCache() method actually no longer needs to be called by PDFRenderer, 
 though it currently is). The real problem is that it's easy to accidentally 
 retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
 method is dangerous as looping over it will cause pages to be retained during 
 processing, like so:
 {code}
 for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
 java.util.List
 {
  // ... this is idiomatic in PDFBox 1.8
 } 
 // List returned by getAllPages() kept in scope until here (bad)
 {code}
 I added of couple of methods a while ago to avoid this by fetching each 
 PDPage one at a time, and this is now used internally in PDFBox to avoid the 
 memory problems we used to have:
 {code}
 for (int i = 0; i  document.getNumberOfPages(); i++)
 {
 PDPage page = document.getPage(i);
 // ... this is the new 2.0 way
 // current page falls out of scope here (good)
 }
 {code}
 To solve this problem, we could change getAllPages() so that instead of 
 returning a List it returns an IteratorPDPage, which would provide a nicer 
 API than getPage(int) and most existing code will continue to work. This is 
 also an opportunity to also fix type safety issues due to PDPageNode and 
 incorrect handling of the page tree (this is similar to the issue we had 
 recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2391) Use an enum for RenderingIntent

2014-10-21 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179295#comment-14179295
 ] 

John Hewson commented on PDFBOX-2391:
-

I can't reproduce that exception.

 Use an enum for RenderingIntent
 ---

 Key: PDFBOX-2391
 URL: https://issues.apache.org/jira/browse/PDFBOX-2391
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Minor
 Fix For: 2.0.0


 The rendering intent in the graphics state is currently a String, we should 
 replace it with a RenderingIntent enum.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (PDFBOX-1329) Update PDPage to enum

2014-10-21 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson closed PDFBOX-1329.
---
Resolution: Fixed

Rather than making the page sizes an enum, I moved them to PDRectangle.

 Update PDPage to enum
 -

 Key: PDFBOX-1329
 URL: https://issues.apache.org/jira/browse/PDFBOX-1329
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 1.8.0
 Environment: Linux, UBUNTU 12.04, openjdk-7
Reporter: Jens Kapitza
Priority: Minor
 Fix For: 2.0.0

 Attachments: change_pdpage.diff

   Original Estimate: 1h
  Remaining Estimate: 1h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-1329) Update PDPage to enum

2014-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179361#comment-14179361
 ] 

ASF subversion and git services commented on PDFBOX-1329:
-

Commit 1633490 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1633490 ]

PDFBOX-1329: Move page size constants from PDPage to PDRectangle, and clean up 
PDPage.

 Update PDPage to enum
 -

 Key: PDFBOX-1329
 URL: https://issues.apache.org/jira/browse/PDFBOX-1329
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 1.8.0
 Environment: Linux, UBUNTU 12.04, openjdk-7
Reporter: Jens Kapitza
Priority: Minor
 Fix For: 2.0.0

 Attachments: change_pdpage.diff

   Original Estimate: 1h
  Remaining Estimate: 1h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-1329) Update PDPage to enum

2014-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179367#comment-14179367
 ] 

ASF subversion and git services commented on PDFBOX-1329:
-

Commit 1633492 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1633492 ]

PDFBOX-1329: Removed comment

 Update PDPage to enum
 -

 Key: PDFBOX-1329
 URL: https://issues.apache.org/jira/browse/PDFBOX-1329
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 1.8.0
 Environment: Linux, UBUNTU 12.04, openjdk-7
Reporter: Jens Kapitza
Priority: Minor
 Fix For: 2.0.0

 Attachments: change_pdpage.diff

   Original Estimate: 1h
  Remaining Estimate: 1h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2423) Page tree handling needs rewriting

2014-10-21 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2423:

Priority: Blocker  (was: Critical)

 Page tree handling needs rewriting
 --

 Key: PDFBOX-2423
 URL: https://issues.apache.org/jira/browse/PDFBOX-2423
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7, 2.0.0
Reporter: John Hewson
Priority: Blocker
 Fix For: 2.0.0


 The way in which PDFBox handles the Page tree needs to be rewritten, 
 preferably from scratch. Currently the document catalog returns the raw 
 objects from the page tree, wrapped in either a PDPage or PDPageNode.
 We need to abstract over the page tree and get rid of PDPageNode, we should 
 provide methods which can add/remove PDPage objects *only*. The existing 
 low-level access to the page tree is not needed at the PD-level.
 Inheritance of page properties such as crop box, resources, and rotation 
 should be reimplemented to use whatever new page tree abstraction we invent. 
 We can finally remove the old broken methods which didn't look up the 
 inheritance tree when retrieving these values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2370) Move caching outside of PDResources

2014-10-21 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2370:

Priority: Blocker  (was: Critical)

 Move caching outside of PDResources
 ---

 Key: PDFBOX-2370
 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Priority: Blocker
 Fix For: 2.0.0


 *Note:* This issue is based on a discussion which occurred regarding 
 PDFBOX-2301 but is actually a separate issue.
 Currently we cache the page resources in PDResources which belongs to a 
 specific PDPage. This causes two problems, 1) users who want to hold many 
 PDPage objects in memory will have high memory use (but this is often by 
 accident*). 2) By caching resources in PDPage we only get to keep that cache 
 for the lifetime of the page, which e.g. in PDFRenderer is a single page 
 only. That means that a font which appears on 40 pages has to be parsed 40 
 times, which causes slow running times, but also memory thrashing as objects 
 are destroyed frequently only to be re-created.
 What PDFRenderer really needs is not page-wide caching but document-wide 
 caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
 But that won't work for images, because they're too large. What we're 
 beginning to realise is that caching is use-case specific and probably 
 shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
 resource caching from PDPage/PDResources and implement custom caching in 
 PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
 happily volunteer myself. The existing high-level PDFBox APIs will continue 
 to just work and power users will get a level of control that they 
 appreciate.
 This strategy could be enhanced by removing memory-hungry methods on 
 PDResources such as getFonts() and getXObjects() which force all resources of 
 a particular type to be loaded, whether or not they are needed, or actually 
 used in the content stream. They would be replaced by methods to retrieve a 
 single resource, e.g. getFont(name).
 ---
 \* There probably isn't a legitimate use case for 1) any more, we've solved 
 the issues which we used to have with image caching (in fact, the 
 clearCache() method actually no longer needs to be called by PDFRenderer, 
 though it currently is). The real problem is that it's easy to accidentally 
 retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
 method is dangerous as looping over it will cause pages to be retained during 
 processing, like so:
 {code}
 for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
 java.util.List
 {
  // ... this is idiomatic in PDFBox 1.8
 } 
 // List returned by getAllPages() kept in scope until here (bad)
 {code}
 I added of couple of methods a while ago to avoid this by fetching each 
 PDPage one at a time, and this is now used internally in PDFBox to avoid the 
 memory problems we used to have:
 {code}
 for (int i = 0; i  document.getNumberOfPages(); i++)
 {
 PDPage page = document.getPage(i);
 // ... this is the new 2.0 way
 // current page falls out of scope here (good)
 }
 {code}
 To solve this problem, we could change getAllPages() so that instead of 
 returning a List it returns an IteratorPDPage, which would provide a nicer 
 API than getPage(int) and most existing code will continue to work. This is 
 also an opportunity to also fix type safety issues due to PDPageNode and 
 incorrect handling of the page tree (this is similar to the issue we had 
 recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PDFBOX-2370) Move caching outside of PDResources

2014-10-21 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson reassigned PDFBOX-2370:
---

Assignee: John Hewson

 Move caching outside of PDResources
 ---

 Key: PDFBOX-2370
 URL: https://issues.apache.org/jira/browse/PDFBOX-2370
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0


 *Note:* This issue is based on a discussion which occurred regarding 
 PDFBOX-2301 but is actually a separate issue.
 Currently we cache the page resources in PDResources which belongs to a 
 specific PDPage. This causes two problems, 1) users who want to hold many 
 PDPage objects in memory will have high memory use (but this is often by 
 accident*). 2) By caching resources in PDPage we only get to keep that cache 
 for the lifetime of the page, which e.g. in PDFRenderer is a single page 
 only. That means that a font which appears on 40 pages has to be parsed 40 
 times, which causes slow running times, but also memory thrashing as objects 
 are destroyed frequently only to be re-created.
 What PDFRenderer really needs is not page-wide caching but document-wide 
 caching, so that it can cache fonts, cmaps, color profiles, etc. only once. 
 But that won't work for images, because they're too large. What we're 
 beginning to realise is that caching is use-case specific and probably 
 shouldn't be built-in to PDFBox's pdmodel. Instead we should removing 
 resource caching from PDPage/PDResources and implement custom caching in 
 PDFRenderer and other downstream classes such as PDFTextStripper. I'll 
 happily volunteer myself. The existing high-level PDFBox APIs will continue 
 to just work and power users will get a level of control that they 
 appreciate.
 This strategy could be enhanced by removing memory-hungry methods on 
 PDResources such as getFonts() and getXObjects() which force all resources of 
 a particular type to be loaded, whether or not they are needed, or actually 
 used in the content stream. They would be replaced by methods to retrieve a 
 single resource, e.g. getFont(name).
 ---
 \* There probably isn't a legitimate use case for 1) any more, we've solved 
 the issues which we used to have with image caching (in fact, the 
 clearCache() method actually no longer needs to be called by PDFRenderer, 
 though it currently is). The real problem is that it's easy to accidentally 
 retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() 
 method is dangerous as looping over it will cause pages to be retained during 
 processing, like so:
 {code}
 for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
 java.util.List
 {
  // ... this is idiomatic in PDFBox 1.8
 } 
 // List returned by getAllPages() kept in scope until here (bad)
 {code}
 I added of couple of methods a while ago to avoid this by fetching each 
 PDPage one at a time, and this is now used internally in PDFBox to avoid the 
 memory problems we used to have:
 {code}
 for (int i = 0; i  document.getNumberOfPages(); i++)
 {
 PDPage page = document.getPage(i);
 // ... this is the new 2.0 way
 // current page falls out of scope here (good)
 }
 {code}
 To solve this problem, we could change getAllPages() so that instead of 
 returning a List it returns an IteratorPDPage, which would provide a nicer 
 API than getPage(int) and most existing code will continue to work. This is 
 also an opportunity to also fix type safety issues due to PDPageNode and 
 incorrect handling of the page tree (this is similar to the issue we had 
 recently with the acroform field tree).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PDFBOX-2423) Page tree handling needs rewriting

2014-10-21 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson reassigned PDFBOX-2423:
---

Assignee: John Hewson

 Page tree handling needs rewriting
 --

 Key: PDFBOX-2423
 URL: https://issues.apache.org/jira/browse/PDFBOX-2423
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7, 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0


 The way in which PDFBox handles the Page tree needs to be rewritten, 
 preferably from scratch. Currently the document catalog returns the raw 
 objects from the page tree, wrapped in either a PDPage or PDPageNode.
 We need to abstract over the page tree and get rid of PDPageNode, we should 
 provide methods which can add/remove PDPage objects *only*. The existing 
 low-level access to the page tree is not needed at the PD-level.
 Inheritance of page properties such as crop box, resources, and rotation 
 should be reimplemented to use whatever new page tree abstraction we invent. 
 We can finally remove the old broken methods which didn't look up the 
 inheritance tree when retrieving these values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PDFBOX-2428) An error occured when reading table hmtx

2014-10-21 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson reassigned PDFBOX-2428:
---

Assignee: John Hewson

 An error occured when reading table hmtx
 

 Key: PDFBOX-2428
 URL: https://issues.apache.org/jira/browse/PDFBOX-2428
 Project: PDFBox
  Issue Type: Bug
  Components: FontBox
Affects Versions: 1.8.8
Reporter: simon steiner
Assignee: John Hewson
 Attachments: ttsubset_pdfa.pdf


 java -cp 
 pdfbox/preflight/target/preflight-1.8.8-SNAPSHOT.jar:pdfbox/app/target/pdfbox-app-1.8.8-SNAPSHOT.jar:pdfbox/xmpbox/target/xmpbox-1.8.8-SNAPSHOT.jar:lib/commons-io-1.3.1.jar
  org.apache.pdfbox.preflight.Validator_A1b ttsubset_pdfa.pdf
 SEVERE: An error occured when reading table hmtx
 java.io.EOFException
   at 
 org.apache.fontbox.ttf.MemoryTTFDataStream.readSignedShort(MemoryTTFDataStream.java:139)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2333) Overhaul the apperance generation for PDF forms

2014-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179388#comment-14179388
 ] 

ASF subversion and git services commented on PDFBOX-2333:
-

Commit 1633495 from [~msahyoun] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1633495 ]

PDFBOX-2333 enhance alignment for single line text fields

 Overhaul the apperance generation for PDF forms
 ---

 Key: PDFBOX-2333
 URL: https://issues.apache.org/jira/browse/PDFBOX-2333
 Project: PDFBox
  Issue Type: Improvement
  Components: AcroForm
Affects Versions: 2.0.0
Reporter: Maruan Sahyoun
Priority: Critical
 Fix For: 2.0.0

 Attachments: AcroForms-SimpleTextFields.1.8.7.pdf, 
 AcroForms-SimpleTextFields.1.8.7.png, AcroForms-SimpleTextFields.pdf


 The appearance handling for forms in 1.x is limited and does not reflect all 
 settings possible for form fields. In addition the current code is not very 
 modular and does not follow the box model used for form fields. 
 Unfortunately only the basics of form handling are defined in the PDF spec. 
 The details like padding of boxes, text placement etc. have to be determined 
 by looking at how Adobe forms are generated.
 Update: The file from PDFBOX-2310 has bad rendering which might be related?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2423) Page tree handling needs rewriting

2014-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179389#comment-14179389
 ] 

ASF subversion and git services commented on PDFBOX-2423:
-

Commit 1633496 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1633496 ]

PDFBOX-2423: Clean up PDDocumentCatalog formatting

 Page tree handling needs rewriting
 --

 Key: PDFBOX-2423
 URL: https://issues.apache.org/jira/browse/PDFBOX-2423
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7, 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0


 The way in which PDFBox handles the Page tree needs to be rewritten, 
 preferably from scratch. Currently the document catalog returns the raw 
 objects from the page tree, wrapped in either a PDPage or PDPageNode.
 We need to abstract over the page tree and get rid of PDPageNode, we should 
 provide methods which can add/remove PDPage objects *only*. The existing 
 low-level access to the page tree is not needed at the PD-level.
 Inheritance of page properties such as crop box, resources, and rotation 
 should be reimplemented to use whatever new page tree abstraction we invent. 
 We can finally remove the old broken methods which didn't look up the 
 inheritance tree when retrieving these values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2333) Overhaul the apperance generation for PDF forms

2014-10-21 Thread Maruan Sahyoun (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun updated PDFBOX-2333:
---
Attachment: AlignmentTests-pre1633495.pdf

Testfile filled prior to rev. 1633495. The upper fields are filled by PDFBox 
the lower fields ...-Filled are filled by Acrobat.

 Overhaul the apperance generation for PDF forms
 ---

 Key: PDFBOX-2333
 URL: https://issues.apache.org/jira/browse/PDFBOX-2333
 Project: PDFBox
  Issue Type: Improvement
  Components: AcroForm
Affects Versions: 2.0.0
Reporter: Maruan Sahyoun
Priority: Critical
 Fix For: 2.0.0

 Attachments: AcroForms-SimpleTextFields.1.8.7.pdf, 
 AcroForms-SimpleTextFields.1.8.7.png, AcroForms-SimpleTextFields.pdf, 
 AlignmentTests-pre1633495.pdf


 The appearance handling for forms in 1.x is limited and does not reflect all 
 settings possible for form fields. In addition the current code is not very 
 modular and does not follow the box model used for form fields. 
 Unfortunately only the basics of form handling are defined in the PDF spec. 
 The details like padding of boxes, text placement etc. have to be determined 
 by looking at how Adobe forms are generated.
 Update: The file from PDFBOX-2310 has bad rendering which might be related?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2333) Overhaul the apperance generation for PDF forms

2014-10-21 Thread Maruan Sahyoun (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun updated PDFBOX-2333:
---
Attachment: AlignmentTests-post1633495.pdf

Testfile filled after rev. 1633495.

 Overhaul the apperance generation for PDF forms
 ---

 Key: PDFBOX-2333
 URL: https://issues.apache.org/jira/browse/PDFBOX-2333
 Project: PDFBox
  Issue Type: Improvement
  Components: AcroForm
Affects Versions: 2.0.0
Reporter: Maruan Sahyoun
Priority: Critical
 Fix For: 2.0.0

 Attachments: AcroForms-SimpleTextFields.1.8.7.pdf, 
 AcroForms-SimpleTextFields.1.8.7.png, AcroForms-SimpleTextFields.pdf, 
 AlignmentTests-post1633495.pdf, AlignmentTests-pre1633495.pdf


 The appearance handling for forms in 1.x is limited and does not reflect all 
 settings possible for form fields. In addition the current code is not very 
 modular and does not follow the box model used for form fields. 
 Unfortunately only the basics of form handling are defined in the PDF spec. 
 The details like padding of boxes, text placement etc. have to be determined 
 by looking at how Adobe forms are generated.
 Update: The file from PDFBOX-2310 has bad rendering which might be related?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2333) Overhaul the apperance generation for PDF forms

2014-10-21 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179411#comment-14179411
 ] 

Maruan Sahyoun commented on PDFBOX-2333:


I’ve enhanced the alignment of single line text fields as
 - left aligned text had too much padding applied on the left
 - centered text had padding aligned to the left making it no longer aligned
 - right aligned text had no padding applied making it overlap with borders

In addition I added some special handling for corner cases to match Acrobats 
behavior. This needs to be verified with additional files.

 Overhaul the apperance generation for PDF forms
 ---

 Key: PDFBOX-2333
 URL: https://issues.apache.org/jira/browse/PDFBOX-2333
 Project: PDFBox
  Issue Type: Improvement
  Components: AcroForm
Affects Versions: 2.0.0
Reporter: Maruan Sahyoun
Priority: Critical
 Fix For: 2.0.0

 Attachments: AcroForms-SimpleTextFields.1.8.7.pdf, 
 AcroForms-SimpleTextFields.1.8.7.png, AcroForms-SimpleTextFields.pdf, 
 AlignmentTests-post1633495.pdf, AlignmentTests-pre1633495.pdf


 The appearance handling for forms in 1.x is limited and does not reflect all 
 settings possible for form fields. In addition the current code is not very 
 modular and does not follow the box model used for form fields. 
 Unfortunately only the basics of form handling are defined in the PDF spec. 
 The details like padding of boxes, text placement etc. have to be determined 
 by looking at how Adobe forms are generated.
 Update: The file from PDFBOX-2310 has bad rendering which might be related?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2423) Page tree handling needs rewriting

2014-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179423#comment-14179423
 ] 

ASF subversion and git services commented on PDFBOX-2423:
-

Commit 1633501 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1633501 ]

PDFBOX-2423: Made page mode and layout constants in PDDocumentCatalog into enums

 Page tree handling needs rewriting
 --

 Key: PDFBOX-2423
 URL: https://issues.apache.org/jira/browse/PDFBOX-2423
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7, 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0


 The way in which PDFBox handles the Page tree needs to be rewritten, 
 preferably from scratch. Currently the document catalog returns the raw 
 objects from the page tree, wrapped in either a PDPage or PDPageNode.
 We need to abstract over the page tree and get rid of PDPageNode, we should 
 provide methods which can add/remove PDPage objects *only*. The existing 
 low-level access to the page tree is not needed at the PD-level.
 Inheritance of page properties such as crop box, resources, and rotation 
 should be reimplemented to use whatever new page tree abstraction we invent. 
 We can finally remove the old broken methods which didn't look up the 
 inheritance tree when retrieving these values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2423) Page tree handling needs rewriting

2014-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179424#comment-14179424
 ] 

ASF subversion and git services commented on PDFBOX-2423:
-

Commit 1633502 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1633502 ]

PDFBOX-2423: Replaced calls to PDDocumentCatalog#getCOSDictionary with 
getCOSObject

 Page tree handling needs rewriting
 --

 Key: PDFBOX-2423
 URL: https://issues.apache.org/jira/browse/PDFBOX-2423
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7, 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0


 The way in which PDFBox handles the Page tree needs to be rewritten, 
 preferably from scratch. Currently the document catalog returns the raw 
 objects from the page tree, wrapped in either a PDPage or PDPageNode.
 We need to abstract over the page tree and get rid of PDPageNode, we should 
 provide methods which can add/remove PDPage objects *only*. The existing 
 low-level access to the page tree is not needed at the PD-level.
 Inheritance of page properties such as crop box, resources, and rotation 
 should be reimplemented to use whatever new page tree abstraction we invent. 
 We can finally remove the old broken methods which didn't look up the 
 inheritance tree when retrieving these values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2423) Page tree handling needs rewriting

2014-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179428#comment-14179428
 ] 

ASF subversion and git services commented on PDFBOX-2423:
-

Commit 1633503 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1633503 ]

PDFBOX-2423: Fix bug with AcroForm caching

 Page tree handling needs rewriting
 --

 Key: PDFBOX-2423
 URL: https://issues.apache.org/jira/browse/PDFBOX-2423
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7, 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0


 The way in which PDFBox handles the Page tree needs to be rewritten, 
 preferably from scratch. Currently the document catalog returns the raw 
 objects from the page tree, wrapped in either a PDPage or PDPageNode.
 We need to abstract over the page tree and get rid of PDPageNode, we should 
 provide methods which can add/remove PDPage objects *only*. The existing 
 low-level access to the page tree is not needed at the PD-level.
 Inheritance of page properties such as crop box, resources, and rotation 
 should be reimplemented to use whatever new page tree abstraction we invent. 
 We can finally remove the old broken methods which didn't look up the 
 inheritance tree when retrieving these values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2423) Page tree handling needs rewriting

2014-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179436#comment-14179436
 ] 

ASF subversion and git services commented on PDFBOX-2423:
-

Commit 1633505 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1633505 ]

PDFBOX-2423: More cleaning up of PDDocumentCatalog

 Page tree handling needs rewriting
 --

 Key: PDFBOX-2423
 URL: https://issues.apache.org/jira/browse/PDFBOX-2423
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7, 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0


 The way in which PDFBox handles the Page tree needs to be rewritten, 
 preferably from scratch. Currently the document catalog returns the raw 
 objects from the page tree, wrapped in either a PDPage or PDPageNode.
 We need to abstract over the page tree and get rid of PDPageNode, we should 
 provide methods which can add/remove PDPage objects *only*. The existing 
 low-level access to the page tree is not needed at the PD-level.
 Inheritance of page properties such as crop box, resources, and rotation 
 should be reimplemented to use whatever new page tree abstraction we invent. 
 We can finally remove the old broken methods which didn't look up the 
 inheritance tree when retrieving these values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2423) Page tree handling needs rewriting

2014-10-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179448#comment-14179448
 ] 

ASF subversion and git services commented on PDFBOX-2423:
-

Commit 1633506 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1633506 ]

PDFBOX-2423: Clean up PDPageNode

 Page tree handling needs rewriting
 --

 Key: PDFBOX-2423
 URL: https://issues.apache.org/jira/browse/PDFBOX-2423
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7, 2.0.0
Reporter: John Hewson
Assignee: John Hewson
Priority: Blocker
 Fix For: 2.0.0


 The way in which PDFBox handles the Page tree needs to be rewritten, 
 preferably from scratch. Currently the document catalog returns the raw 
 objects from the page tree, wrapped in either a PDPage or PDPageNode.
 We need to abstract over the page tree and get rid of PDPageNode, we should 
 provide methods which can add/remove PDPage objects *only*. The existing 
 low-level access to the page tree is not needed at the PD-level.
 Inheritance of page properties such as crop box, resources, and rotation 
 should be reimplemented to use whatever new page tree abstraction we invent. 
 We can finally remove the old broken methods which didn't look up the 
 inheritance tree when retrieving these values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)