[jira] [Commented] (PDFBOX-2397) Running within an Applet throws an AccessControlException

2014-12-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237584#comment-14237584
 ] 

Andreas Lehmkühler commented on PDFBOX-2397:


[~tilman] Any updates on this topic, or should we simply postpone this issue

 Running within an Applet throws an AccessControlException
 -

 Key: PDFBOX-2397
 URL: https://issues.apache.org/jira/browse/PDFBOX-2397
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7
 Environment: JRE 7u67 or JRE 6u45 (Windows 7 SP1 64bit)
Reporter: Bertrand Gillis
Assignee: Tilman Hausherr
 Fix For: 1.8.8


 As soon as PDFBox is embedded in a signed applet, the following exception is 
 thrown when I try to print a PDF document through PDFBox:
 {code}
 Caused by: java.security.AccessControlException: access denied 
 (java.util.PropertyPermission org.apache.pdfbox.ICC_override_color read)
   at java.security.AccessControlContext.checkPermission(Unknown Source)
   at java.security.AccessController.checkPermission(Unknown Source)
   at java.lang.SecurityManager.checkPermission(Unknown Source)
   at sun.plugin2.applet.AWTAppletSecurityManager.checkPermission(Unknown 
 Source)
   at java.lang.SecurityManager.checkPropertyAccess(Unknown Source)
   at java.lang.System.getProperty(Unknown Source)
   at java.lang.Integer.getInteger(Unknown Source)
   at java.lang.Integer.getInteger(Unknown Source)
   at java.awt.Color.getColor(Unknown Source)
   at java.awt.Color.getColor(Unknown Source)
   at 
 org.apache.pdfbox.pdmodel.graphics.color.PDColorState.clinit(PDColorState.java:50)
 {code}
 This issue was also in previous PDFBox versions for the following instruction:
 {code:title=BaseParser.java}
 FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing);
 {code}
 But it was fixed in later versions:
 {code:title=BaseParser.java}
   static {
 try {
   FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing);
 }
 catch (SecurityException e) {}
   }
 {code}
 This fixed is unfortunately not set for the current property:
 {code:title=PDColorState.java}
 private static volatile Color iccOverrideColor = 
 Color.getColor(org.apache.pdfbox.ICC_override_color);
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2397) Running within an Applet throws an AccessControlException

2014-12-08 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2397:

Fix Version/s: (was: 1.8.8)

 Running within an Applet throws an AccessControlException
 -

 Key: PDFBOX-2397
 URL: https://issues.apache.org/jira/browse/PDFBOX-2397
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7
 Environment: JRE 7u67 or JRE 6u45 (Windows 7 SP1 64bit)
Reporter: Bertrand Gillis
Assignee: Tilman Hausherr

 As soon as PDFBox is embedded in a signed applet, the following exception is 
 thrown when I try to print a PDF document through PDFBox:
 {code}
 Caused by: java.security.AccessControlException: access denied 
 (java.util.PropertyPermission org.apache.pdfbox.ICC_override_color read)
   at java.security.AccessControlContext.checkPermission(Unknown Source)
   at java.security.AccessController.checkPermission(Unknown Source)
   at java.lang.SecurityManager.checkPermission(Unknown Source)
   at sun.plugin2.applet.AWTAppletSecurityManager.checkPermission(Unknown 
 Source)
   at java.lang.SecurityManager.checkPropertyAccess(Unknown Source)
   at java.lang.System.getProperty(Unknown Source)
   at java.lang.Integer.getInteger(Unknown Source)
   at java.lang.Integer.getInteger(Unknown Source)
   at java.awt.Color.getColor(Unknown Source)
   at java.awt.Color.getColor(Unknown Source)
   at 
 org.apache.pdfbox.pdmodel.graphics.color.PDColorState.clinit(PDColorState.java:50)
 {code}
 This issue was also in previous PDFBox versions for the following instruction:
 {code:title=BaseParser.java}
 FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing);
 {code}
 But it was fixed in later versions:
 {code:title=BaseParser.java}
   static {
 try {
   FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing);
 }
 catch (SecurityException e) {}
   }
 {code}
 This fixed is unfortunately not set for the current property:
 {code:title=PDColorState.java}
 private static volatile Color iccOverrideColor = 
 Color.getColor(org.apache.pdfbox.ICC_override_color);
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2397) Running within an Applet throws an AccessControlException

2014-12-08 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237593#comment-14237593
 ] 

Tilman Hausherr commented on PDFBOX-2397:
-

postpone for the reason I mentioned on Nov. 3th. Until either [~bgillis] comes 
back, or until somebody else comes who is willing to run an applet with the 
modified code.

 Running within an Applet throws an AccessControlException
 -

 Key: PDFBOX-2397
 URL: https://issues.apache.org/jira/browse/PDFBOX-2397
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7
 Environment: JRE 7u67 or JRE 6u45 (Windows 7 SP1 64bit)
Reporter: Bertrand Gillis
Assignee: Tilman Hausherr

 As soon as PDFBox is embedded in a signed applet, the following exception is 
 thrown when I try to print a PDF document through PDFBox:
 {code}
 Caused by: java.security.AccessControlException: access denied 
 (java.util.PropertyPermission org.apache.pdfbox.ICC_override_color read)
   at java.security.AccessControlContext.checkPermission(Unknown Source)
   at java.security.AccessController.checkPermission(Unknown Source)
   at java.lang.SecurityManager.checkPermission(Unknown Source)
   at sun.plugin2.applet.AWTAppletSecurityManager.checkPermission(Unknown 
 Source)
   at java.lang.SecurityManager.checkPropertyAccess(Unknown Source)
   at java.lang.System.getProperty(Unknown Source)
   at java.lang.Integer.getInteger(Unknown Source)
   at java.lang.Integer.getInteger(Unknown Source)
   at java.awt.Color.getColor(Unknown Source)
   at java.awt.Color.getColor(Unknown Source)
   at 
 org.apache.pdfbox.pdmodel.graphics.color.PDColorState.clinit(PDColorState.java:50)
 {code}
 This issue was also in previous PDFBox versions for the following instruction:
 {code:title=BaseParser.java}
 FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing);
 {code}
 But it was fixed in later versions:
 {code:title=BaseParser.java}
   static {
 try {
   FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing);
 }
 catch (SecurityException e) {}
   }
 {code}
 This fixed is unfortunately not set for the current property:
 {code:title=PDColorState.java}
 private static volatile Color iccOverrideColor = 
 Color.getColor(org.apache.pdfbox.ICC_override_color);
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PDFBOX-2512) OutOfMemory while signing large documents

2014-12-08 Thread Thomas Chojecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Chojecki resolved PDFBOX-2512.
-
   Resolution: Fixed
Fix Version/s: 1.8.8

There is still one point open, but with the workaround mentioned in the 
comment, this issue is resolved.

 OutOfMemory while signing large documents
 -

 Key: PDFBOX-2512
 URL: https://issues.apache.org/jira/browse/PDFBOX-2512
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Signing
Affects Versions: 1.8.7
Reporter: Thomas Chojecki
Assignee: Thomas Chojecki
 Fix For: 1.8.8

 Attachments: keystore.p12


 While working with large documents, we found some memory issues.
 1. The method close() in the COSDocument, clones the objectpool and does not 
 clean it properly. The cloning in getObjects() cause a OutOfMemory exception.
 2.The COSWriter copy the whole pdf into the memory for signing and does not 
 use BufferedInputStream for the FileInputStream which also has a big 
 performance impact. (PDFBOX-1798)
 3. The cloning of COSStreams cause a OutOfMemory exception
 I used the CreateSignature example with a about 150 MB big document from here:
 https://cdn-reichelt.de/bilder/downloads/reichelt_01-2015_DE_B_HQ.pdf
 Additionaly I add a RandomAccessFile to the PDDocument.load in the 
 CreateSignature class.
 PDDocument doc = PDDocument.load(document,new RandomAccessFile(new 
 File(d:\\temp.bin), rw)); (this prevent the OOM for the third case)
 The use of a BuffedInputStream in case two, will increase the signing speed 
 from more than 5 minutes to less than 1 minute. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2512) OutOfMemory while signing large documents

2014-12-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237747#comment-14237747
 ] 

Andreas Lehmkühler commented on PDFBOX-2512:


Are these changes limited to the 1.8-branch or should we add them to the trunk 
as well?

 OutOfMemory while signing large documents
 -

 Key: PDFBOX-2512
 URL: https://issues.apache.org/jira/browse/PDFBOX-2512
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Signing
Affects Versions: 1.8.7
Reporter: Thomas Chojecki
Assignee: Thomas Chojecki
 Fix For: 1.8.8

 Attachments: keystore.p12


 While working with large documents, we found some memory issues.
 1. The method close() in the COSDocument, clones the objectpool and does not 
 clean it properly. The cloning in getObjects() cause a OutOfMemory exception.
 2.The COSWriter copy the whole pdf into the memory for signing and does not 
 use BufferedInputStream for the FileInputStream which also has a big 
 performance impact. (PDFBOX-1798)
 3. The cloning of COSStreams cause a OutOfMemory exception
 I used the CreateSignature example with a about 150 MB big document from here:
 https://cdn-reichelt.de/bilder/downloads/reichelt_01-2015_DE_B_HQ.pdf
 Additionaly I add a RandomAccessFile to the PDDocument.load in the 
 CreateSignature class.
 PDDocument doc = PDDocument.load(document,new RandomAccessFile(new 
 File(d:\\temp.bin), rw)); (this prevent the OOM for the third case)
 The use of a BuffedInputStream in case two, will increase the signing speed 
 from more than 5 minutes to less than 1 minute. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2512) OutOfMemory while signing large documents

2014-12-08 Thread Thomas Chojecki (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237806#comment-14237806
 ] 

Thomas Chojecki commented on PDFBOX-2512:
-

If we can port it, we should do it. There are only small changes, that improve 
the performance and solve the OOM problematic.

 OutOfMemory while signing large documents
 -

 Key: PDFBOX-2512
 URL: https://issues.apache.org/jira/browse/PDFBOX-2512
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Signing
Affects Versions: 1.8.7
Reporter: Thomas Chojecki
Assignee: Thomas Chojecki
 Fix For: 1.8.8

 Attachments: keystore.p12


 While working with large documents, we found some memory issues.
 1. The method close() in the COSDocument, clones the objectpool and does not 
 clean it properly. The cloning in getObjects() cause a OutOfMemory exception.
 2.The COSWriter copy the whole pdf into the memory for signing and does not 
 use BufferedInputStream for the FileInputStream which also has a big 
 performance impact. (PDFBOX-1798)
 3. The cloning of COSStreams cause a OutOfMemory exception
 I used the CreateSignature example with a about 150 MB big document from here:
 https://cdn-reichelt.de/bilder/downloads/reichelt_01-2015_DE_B_HQ.pdf
 Additionaly I add a RandomAccessFile to the PDDocument.load in the 
 CreateSignature class.
 PDDocument doc = PDDocument.load(document,new RandomAccessFile(new 
 File(d:\\temp.bin), rw)); (this prevent the OOM for the third case)
 The use of a BuffedInputStream in case two, will increase the signing speed 
 from more than 5 minutes to less than 1 minute. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-1351) False paragraph caused by superscript (1.7 regression)

2014-12-08 Thread Merijn Wijngaard (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238010#comment-14238010
 ] 

Merijn Wijngaard commented on PDFBOX-1351:
--

This problem still persists in pdfbox 1.8.7. Using superscript doesn't sound 
like a rare use case to me, so it would be nice if this could be fixed. 
Inlining the superscript for text output seems like the best solution to me.

 False paragraph caused by superscript (1.7 regression)
 --

 Key: PDFBOX-1351
 URL: https://issues.apache.org/jira/browse/PDFBOX-1351
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.7.0
Reporter: Daniel Bonniot de Ruisselet
 Attachments: PDFParaTest.java, superscript.pdf


 On the attached minimal example document, text extraction seems to be 
 confused by the superscript, and generates three paragraphs where there is 
 only one.
 Note that 1.6 is processing this case well:
 {noformat}
 $ java -jar /dev/shm/pdfbox-app-1.6.0.jar ExtractText /tmp/superscript.pdf
 Jun 29, 2012 4:52:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
 WARNING: expected='%%EOF' actual='5 0 obj '
 $ cat /tmp/superscript.txt 
   
 Multiple synthetic routes have been described by R. Filler et al.11 regarding 
 1,3-
 Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
  
  
 $ java -jar /dev/shm/pdfbox-app-1.7.0.jar ExtractText /tmp/superscript.pdf 
 Jun 29, 2012 4:52:39 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
 WARNING: expected='%%EOF' actual='5 0 obj '
 $ cat /tmp/superscript.txt 
   
 Multiple synthetic routes have been described by R. Filler et al.
 11
  regarding 1,3-
 Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
  
  
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread JIRA
Matthias Bösinger created PDFBOX-2548:
-

 Summary: problems with character extraction (OpenType, dense 
printed Text)
 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Test
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows JavaSE8 Eclipse
Reporter: Matthias Bösinger
Priority: Minor


 favorite


I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
pdfBox text extraction can also extract special characters (for example small 
capital lettres), which caused problems when the underlying font has been a 
simple Type1 font.

However, the text extraction now causes another type of problem. In my case, 
when the charater sequences fi or fl occur in the text, the 
PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' 
and 'fl' and sets a space character on their right side.

(Surprisingly, if I access the list of characters of a page via the 
charactersByArticle field of PDFTextStripper / via the 
PDFTextStripper#processText(TextPosition pos) method, the same characters show 
up as 'normal-single' characters f i / f l).

My assumption is that the advantage of the underlying OpenFont type turns into 
this particular disadvantage, because the PDFTextStripper recognizes the 
character sequence f i / f l as special charcters fi / fl (- what might have to 
do with the fact, that the getText() method calculates things like whitespace 
characters by distances / positional placements).

Background: The given document is a wordbook text with very dense printed text.

My question: is there anything what I can do to avoid this problem?

thanks in advance ...




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger updated PDFBOX-2548:
--
Attachment: test.pdf

 problems with character extraction (OpenType, dense printed Text)
 -

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Test
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows JavaSE8 Eclipse
Reporter: Matthias Bösinger
Priority: Minor
  Labels: newbie
 Attachments: test.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger updated PDFBOX-2548:
--
Description: 
 favorite


I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
pdfBox text extraction can also extract special characters (for example small 
capital lettres), which caused problems when the underlying font has been a 
simple Type1 font.

However, the text extraction now causes another type of problem. In my case, 
when the charater sequences fi or fl occur in the text, the 
PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' 
and 'fl' and sets a space character on their right side.

(Surprisingly, if I access the list of characters of a page via the 
charactersByArticle field of PDFTextStripper / via the 
PDFTextStripper#processText(TextPosition pos) method, the same characters show 
up as 'normal-single' characters f i / f l).

My assumption is that the advantage of the underlying OpenFont type turns into 
this particular disadvantage, because the PDFTextStripper recognizes the 
character sequence f i / f l as special charcters fi / fl (- what might have to 
do with the fact, that the getText() method calculates things like whitespace 
characters by distances / positional placements).

Background: The given document is a wordbook text with very dense printed text.

see this link for code and output:
http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox

My question: is there anything what I can do to avoid this problem?

thanks in advance ...


  was:
 favorite


I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
pdfBox text extraction can also extract special characters (for example small 
capital lettres), which caused problems when the underlying font has been a 
simple Type1 font.

However, the text extraction now causes another type of problem. In my case, 
when the charater sequences fi or fl occur in the text, the 
PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' 
and 'fl' and sets a space character on their right side.

(Surprisingly, if I access the list of characters of a page via the 
charactersByArticle field of PDFTextStripper / via the 
PDFTextStripper#processText(TextPosition pos) method, the same characters show 
up as 'normal-single' characters f i / f l).

My assumption is that the advantage of the underlying OpenFont type turns into 
this particular disadvantage, because the PDFTextStripper recognizes the 
character sequence f i / f l as special charcters fi / fl (- what might have to 
do with the fact, that the getText() method calculates things like whitespace 
characters by distances / positional placements).

Background: The given document is a wordbook text with very dense printed text.

My question: is there anything what I can do to avoid this problem?

thanks in advance ...



 problems with character extraction (OpenType, dense printed Text)
 -

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Test
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows JavaSE8 Eclipse
Reporter: Matthias Bösinger
Priority: Minor
  Labels: newbie
 Attachments: test.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My 

[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger updated PDFBOX-2548:
--
Environment: Windows7Professional JavaSE8 EclipseKepler  (was: Windows 
JavaSE8 EclipseKepler)

 problems with character extraction (OpenType, dense printed Text)
 -

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Test
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
  Labels: newbie
 Attachments: test.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger updated PDFBOX-2548:
--
Environment: Windows JavaSE8 EclipseKepler  (was: Windows JavaSE8 Eclipse)

 problems with character extraction (OpenType, dense printed Text)
 -

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Test
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
  Labels: newbie
 Attachments: test.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PDFBOX-2549) TIFF-Predictor with 16 bits per component not supported

2014-12-08 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created PDFBOX-2549:
---

 Summary: TIFF-Predictor with 16 bits per component not supported
 Key: PDFBOX-2549
 URL: https://issues.apache.org/jira/browse/PDFBOX-2549
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 1.8.7, 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr


The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test suite 
is not displayed, PDFBox throws the mentioned exception. One open source and 
one closed source product display an X, but gswin renders the image properly. 

The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc 
because I don't have test images.

I'll add my patch 1.8 after the cut.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2549) TIFF-Predictor with 16 bits per component not supported

2014-12-08 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2549:

Attachment: GWG181_16Bit_CMYK_X4.pdf

 TIFF-Predictor with 16 bits per component not supported
 ---

 Key: PDFBOX-2549
 URL: https://issues.apache.org/jira/browse/PDFBOX-2549
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 1.8.7, 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
  Labels: Predictor
 Attachments: GWG181_16Bit_CMYK_X4.pdf


 The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test 
 suite is not displayed, PDFBox throws the mentioned exception. One open 
 source and one closed source product display an X, but gswin renders the 
 image properly. 
 The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc 
 because I don't have test images.
 I'll add my patch 1.8 after the cut.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger updated PDFBOX-2548:
--
Attachment: test2.pdf

 problems with character extraction (OpenType, dense printed Text)
 -

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Test
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
  Labels: newbie
 Attachments: test.pdf, test2.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238120#comment-14238120
 ] 

Matthias Bösinger commented on PDFBOX-2548:
---

I added a second test page, from a former volume of the same wordbook. For this 
volume, a Type1 font has been used. I chose a page where the two words 
begrifflich and spezifisch occur (they cause problems as you can see in the 
first test). As you can see/test, the described error doesn't occur when 
extracting the text of this second page! This strenghens my assumption that the 
OpenType format is the reason for the occuring error.

 problems with character extraction (OpenType, dense printed Text)
 -

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Test
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
  Labels: newbie
 Attachments: test.pdf, test2.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2549) TIFF-Predictor with 16 bits per component not supported

2014-12-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238201#comment-14238201
 ] 

ASF subversion and git services commented on PDFBOX-2549:
-

Commit 1643881 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1643881 ]

PDFBOX-2549: TIFF SUB predictor for 16bpc

 TIFF-Predictor with 16 bits per component not supported
 ---

 Key: PDFBOX-2549
 URL: https://issues.apache.org/jira/browse/PDFBOX-2549
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 1.8.7, 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
  Labels: Predictor
 Attachments: GWG181_16Bit_CMYK_X4.pdf


 The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test 
 suite is not displayed, PDFBox throws the mentioned exception. One open 
 source and one closed source product display an X, but gswin renders the 
 image properly. 
 The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc 
 because I don't have test images.
 I'll add my patch 1.8 after the cut.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2549) TIFF-Predictor with 16 bits per component not supported

2014-12-08 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2549:

Description: 
The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test suite 
is not displayed, PDFBox throws the mentioned exception. One open source and 
one closed source product display an X, but gswin renders the image properly. 

The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc 
because I don't have test images.

I'll add my patch to 1.8 after the cut.

  was:
The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test suite 
is not displayed, PDFBox throws the mentioned exception. One open source and 
one closed source product display an X, but gswin renders the image properly. 

The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc 
because I don't have test images.

I'll add my patch 1.8 after the cut.


 TIFF-Predictor with 16 bits per component not supported
 ---

 Key: PDFBOX-2549
 URL: https://issues.apache.org/jira/browse/PDFBOX-2549
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 1.8.7, 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
  Labels: Predictor
 Attachments: GWG181_16Bit_CMYK_X4.pdf


 The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test 
 suite is not displayed, PDFBox throws the mentioned exception. One open 
 source and one closed source product display an X, but gswin renders the 
 image properly. 
 The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc 
 because I don't have test images.
 I'll add my patch to 1.8 after the cut.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2548:

Issue Type: Bug  (was: Test)

 problems with character extraction (OpenType, dense printed Text)
 -

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
 Attachments: test.pdf, test2.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2548:

Labels:   (was: newbie)

 problems with character extraction (OpenType, dense printed Text)
 -

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
 Attachments: test.pdf, test2.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238282#comment-14238282
 ] 

John Hewson commented on PDFBOX-2548:
-

Neither of these PDFs contain OpenType fonts, instead they contain embedded 
Type 1 fonts. It is common for PDF generating software to perform such format 
conversions when embedding fonts.

 problems with character extraction (OpenType, dense printed Text)
 -

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
 Attachments: test.pdf, test2.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)

2014-12-08 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2548:

Summary: Problems with character extraction (fi ligature)  (was: problems 
with character extraction (OpenType, dense printed Text))

 Problems with character extraction (fi ligature)
 

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
 Attachments: test.pdf, test2.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PDFBOX-2548) Problems with character extraction (fi ligature)

2014-12-08 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238282#comment-14238282
 ] 

John Hewson edited comment on PDFBOX-2548 at 12/8/14 7:14 PM:
--

Neither of these PDFs contain OpenType fonts, instead they contain embedded 
Type 1 fonts. It is common for PDF generating software to perform such format 
conversions when embedding fonts.

If you open the file in Adobe Reader and go to File  Properties  Fonts, then 
you can see a list of the fonts which are embedded and their format.


was (Author: jahewson):
Neither of these PDFs contain OpenType fonts, instead they contain embedded 
Type 1 fonts. It is common for PDF generating software to perform such format 
conversions when embedding fonts.

 Problems with character extraction (fi ligature)
 

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
 Attachments: test.pdf, test2.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)

2014-12-08 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2548:

Attachment: preflight.png

 Problems with character extraction (fi ligature)
 

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
 Attachments: preflight.png, test.pdf, test2.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2548) Problems with character extraction (fi ligature)

2014-12-08 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238369#comment-14238369
 ] 

John Hewson commented on PDFBOX-2548:
-

The embedded text in this PDF really does contain spaces after some of the 
ligatures, e.g Spezifi zierung and Adobe Acrobat extracts the text with those 
spaces, exactly as PDFBox does. Foxit does the same, but OS X Preview strips 
the space, which gives the correct result: Spezifizierung.

Here's the text drawing commands for Spezifi zierung shown in Adobe 
Preflight's PDF structure viewer:
!preflight.png!

These commands have the meaning:

0: Draw text Spezifi
1: Subtract 305.505 units from x-position (move _backwards_ approx 0.3em, 
roughly the width of a space)
2: Draw text   (space)
3: Subtract -20.3063 units from the x-position (move _forwards_ approx 0.02em, 
this is a kern)
4: Draw text zierung des logisch-historischen

So the space is overlayed on top of the fi ligature. Needless to say this is 
a very unusual technique which does not result in proper text embedding.

Given that Acrobat produces the same result, and I don't see any simple way to 
fix this (on could imagine some complex solution). I'm going to close this 
issue as not a problem.

 Problems with character extraction (fi ligature)
 

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
 Attachments: preflight.png, test.pdf, test2.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (PDFBOX-2548) Problems with character extraction (fi ligature)

2014-12-08 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson closed PDFBOX-2548.
---

 Problems with character extraction (fi ligature)
 

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
 Attachments: preflight.png, test.pdf, test2.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PDFBOX-2548) Problems with character extraction (fi ligature)

2014-12-08 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson resolved PDFBOX-2548.
-
Resolution: Not a Problem

 Problems with character extraction (fi ligature)
 

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
 Attachments: preflight.png, test.pdf, test2.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2548) Problems with character extraction (fi ligature)

2014-12-08 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238382#comment-14238382
 ] 

John Hewson commented on PDFBOX-2548:
-

{quote}
(Surprisingly, if I access the list of characters of a page via the 
charactersByArticle field of PDFTextStripper / via the 
PDFTextStripper#processText(TextPosition pos) method, the same characters show 
up as 'normal-single' characters f i / f l).
{quote}

This is by design, certain text in charactersByArticle undergoes  [NFKC 
normalization|http://www.unicode.org/reports/tr15/], which includes mapping fi 
- f i.

 Problems with character extraction (fi ligature)
 

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
 Attachments: preflight.png, test.pdf, test2.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2547) maybe encoding error

2014-12-08 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238407#comment-14238407
 ] 

John Hewson commented on PDFBOX-2547:
-

Text extraction does of this PDF does not produce good results with Acrobat 
either, although the problems are not as bad as with PDFBox. Acrobat extracts 
nothing for 'ę' and 'ą' but 'na przykład miłe' is extracted correctly.

Calling setSpacingTolerance(0.3) on PDFTextStripper seems to produce better 
results.

 maybe encoding error
 

 Key: PDFBOX-2547
 URL: https://issues.apache.org/jira/browse/PDFBOX-2547
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
Reporter: Michał
Priority: Minor

 Hi,
 I just download a pdf form page:
 http://download.jw.org/files/media_books/32/es15_P.pdf
 and wants extract text from this document.
 I use command:
 java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 es15_P.pdf 
 resultFile-UTF-8.txt
 But I see some problems for exmaple:
 1. I see in text file 'STX' and 'ETX' instead of 'ę' and 'ą'.
 2. extractor return a text 'naprzykładmiłe' instead of 'na przykład miłe' 
 (page 4, line 6).
 Maybe it is some small problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2547) maybe encoding error

2014-12-08 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2547:

Affects Version/s: 2.0.0

 maybe encoding error
 

 Key: PDFBOX-2547
 URL: https://issues.apache.org/jira/browse/PDFBOX-2547
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7, 2.0.0
Reporter: Michał
Priority: Minor

 Hi,
 I just download a pdf form page:
 http://download.jw.org/files/media_books/32/es15_P.pdf
 and wants extract text from this document.
 I use command:
 java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 es15_P.pdf 
 resultFile-UTF-8.txt
 But I see some problems for exmaple:
 1. I see in text file 'STX' and 'ETX' instead of 'ę' and 'ą'.
 2. extractor return a text 'naprzykładmiłe' instead of 'na przykład miłe' 
 (page 4, line 6).
 Maybe it is some small problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2546) IllegalArgumentException: resourceDictionary is null in PDFMerger

2014-12-08 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238412#comment-14238412
 ] 

John Hewson commented on PDFBOX-2546:
-

Well, this is a fun bug :(

 IllegalArgumentException: resourceDictionary is null in PDFMerger
 -

 Key: PDFBOX-2546
 URL: https://issues.apache.org/jira/browse/PDFBOX-2546
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 2.0.0
Reporter: Tilman Hausherr

 This was first mentioned on the user mailing list by [~giladd]:
 When merging the PDF 1.7 spec with another PDF file this exception appears:
 {code}
 Exception in thread main java.lang.IllegalArgumentException: 
 resourceDictionary is null
   at org.apache.pdfbox.pdmodel.PDResources.init(PDResources.java:68)
   at 
 org.apache.pdfbox.util.PDFMergerUtility.appendDocument(PDFMergerUtility.java:448)
   at 
 org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:190)
   at org.apache.pdfbox.tools.PDFMerger.merge(PDFMerger.java:70)
   at org.apache.pdfbox.tools.PDFMerger.main(PDFMerger.java:46)
   at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76)
 {code}
 I did some debugging, it happens on the very first page. The resources is 
 indeed null, but it exists when viewing with PDFDebugger. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PDFBOX-2546) IllegalArgumentException: resourceDictionary is null in PDFMerger

2014-12-08 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson reassigned PDFBOX-2546:
---

Assignee: John Hewson

 IllegalArgumentException: resourceDictionary is null in PDFMerger
 -

 Key: PDFBOX-2546
 URL: https://issues.apache.org/jira/browse/PDFBOX-2546
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Assignee: John Hewson

 This was first mentioned on the user mailing list by [~giladd]:
 When merging the PDF 1.7 spec with another PDF file this exception appears:
 {code}
 Exception in thread main java.lang.IllegalArgumentException: 
 resourceDictionary is null
   at org.apache.pdfbox.pdmodel.PDResources.init(PDResources.java:68)
   at 
 org.apache.pdfbox.util.PDFMergerUtility.appendDocument(PDFMergerUtility.java:448)
   at 
 org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:190)
   at org.apache.pdfbox.tools.PDFMerger.merge(PDFMerger.java:70)
   at org.apache.pdfbox.tools.PDFMerger.main(PDFMerger.java:46)
   at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76)
 {code}
 I did some debugging, it happens on the very first page. The resources is 
 indeed null, but it exists when viewing with PDFDebugger. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PDFBOX-2542) IllegalArgumentException: root must be of type Pages

2014-12-08 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson reassigned PDFBOX-2542:
---

Assignee: John Hewson

 IllegalArgumentException: root must be of type Pages
 

 Key: PDFBOX-2542
 URL: https://issues.apache.org/jira/browse/PDFBOX-2542
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Assignee: John Hewson
 Attachments: 249776.pdf


 {code}
 java.lang.IllegalArgumentException: root must be of type Pages
   at org.apache.pdfbox.pdmodel.PDPageTree.init(PDPageTree.java:66)
   at 
 org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatalog.java:125)
   at 
 org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:1175)
 {code}
 The cause is this
 {code}
 
 /Count 11 
 /Kids [ 100 0 R 141 0 R ]
 
 endobj
 {code}
 /Type /Pages  is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2532) Text extraction fails due to the usage of the internal font mapping

2014-12-08 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238438#comment-14238438
 ] 

John Hewson commented on PDFBOX-2532:
-

It's very common to need to extract the Encoding from Type1C fonts, so Acrobat 
must be doing something other than just ignoring the encoding. Either it's a 
bug in Acrobat (which happens to produce good behaviour for this file) or they 
have some sort of heuristic. The CharSet entry can't be the deciding factor, 
because it is optional, and its entries are unordered, so it provides no help 
in identifying a jumbled encoding (i.e. two encodings with the same 
characters have the same CharSet, even if their order is different).

 Text extraction fails due to the usage of the internal font mapping
 ---

 Key: PDFBOX-2532
 URL: https://issues.apache.org/jira/browse/PDFBOX-2532
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 2.0.0
Reporter: Andreas Lehmkühler
 Fix For: 2.0.0

 Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt, 
 PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt, 
 PDFBOX2247-701542_sa_reader_osx.txt, PDFBOX2247-Debugger.png


 If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode 
 mapping) we have to decide where to get a suitable mapping ourselves. We 
 can't use the internal font mapping of the type1C font as it doesn't work in 
 every case, see PDFBOX-2377 which provides a solution for the 1.8-branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PDFBOX-2532) Text extraction fails due to the usage of the internal font mapping

2014-12-08 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238438#comment-14238438
 ] 

John Hewson edited comment on PDFBOX-2532 at 12/8/14 8:49 PM:
--

It's very common to need to extract the Encoding from Type1C fonts, so Acrobat 
must be doing something other than just ignoring the encoding. Either it's a 
bug in Acrobat (which happens to produce good behaviour for this file) or they 
have some sort of heuristic.

The CharSet entry can't be the deciding factor, because it is optional, and its 
entries are unordered, so it provides no help in identifying a jumbled 
encoding (i.e. two encodings with the same characters have the same CharSet, 
even if their order is different).


was (Author: jahewson):
It's very common to need to extract the Encoding from Type1C fonts, so Acrobat 
must be doing something other than just ignoring the encoding. Either it's a 
bug in Acrobat (which happens to produce good behaviour for this file) or they 
have some sort of heuristic. The CharSet entry can't be the deciding factor, 
because it is optional, and its entries are unordered, so it provides no help 
in identifying a jumbled encoding (i.e. two encodings with the same 
characters have the same CharSet, even if their order is different).

 Text extraction fails due to the usage of the internal font mapping
 ---

 Key: PDFBOX-2532
 URL: https://issues.apache.org/jira/browse/PDFBOX-2532
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 2.0.0
Reporter: Andreas Lehmkühler
 Fix For: 2.0.0

 Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt, 
 PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt, 
 PDFBOX2247-701542_sa_reader_osx.txt, PDFBOX2247-Debugger.png


 If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode 
 mapping) we have to decide where to get a suitable mapping ourselves. We 
 can't use the internal font mapping of the type1C font as it doesn't work in 
 every case, see PDFBOX-2377 which provides a solution for the 1.8-branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PDFBOX-2532) Text extraction fails due to the usage of the internal font mapping

2014-12-08 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238438#comment-14238438
 ] 

John Hewson edited comment on PDFBOX-2532 at 12/8/14 8:50 PM:
--

It's very common to need to extract the Encoding from Type1C fonts, so Acrobat 
must be doing something other than just ignoring the encoding. Either it's a 
bug in Acrobat (which happens to produce good behaviour for this file) or they 
have some sort of heuristic.

The CharSet entry can't be the deciding factor, because it is optional, and its 
entries are unordered, so it provides no help in identifying a jumbled 
encoding. Two different encodings which contain the same characters will have 
the same CharSet, even if their order is different.


was (Author: jahewson):
It's very common to need to extract the Encoding from Type1C fonts, so Acrobat 
must be doing something other than just ignoring the encoding. Either it's a 
bug in Acrobat (which happens to produce good behaviour for this file) or they 
have some sort of heuristic.

The CharSet entry can't be the deciding factor, because it is optional, and its 
entries are unordered, so it provides no help in identifying a jumbled 
encoding. Two different encodings which contain the same characters will have 
the same CharSet, even if their order is different).

 Text extraction fails due to the usage of the internal font mapping
 ---

 Key: PDFBOX-2532
 URL: https://issues.apache.org/jira/browse/PDFBOX-2532
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 2.0.0
Reporter: Andreas Lehmkühler
 Fix For: 2.0.0

 Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt, 
 PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt, 
 PDFBOX2247-701542_sa_reader_osx.txt, PDFBOX2247-Debugger.png


 If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode 
 mapping) we have to decide where to get a suitable mapping ourselves. We 
 can't use the internal font mapping of the type1C font as it doesn't work in 
 every case, see PDFBOX-2377 which provides a solution for the 1.8-branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PDFBOX-2532) Text extraction fails due to the usage of the internal font mapping

2014-12-08 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238438#comment-14238438
 ] 

John Hewson edited comment on PDFBOX-2532 at 12/8/14 8:49 PM:
--

It's very common to need to extract the Encoding from Type1C fonts, so Acrobat 
must be doing something other than just ignoring the encoding. Either it's a 
bug in Acrobat (which happens to produce good behaviour for this file) or they 
have some sort of heuristic.

The CharSet entry can't be the deciding factor, because it is optional, and its 
entries are unordered, so it provides no help in identifying a jumbled 
encoding. Two different encodings which contain the same characters will have 
the same CharSet, even if their order is different).


was (Author: jahewson):
It's very common to need to extract the Encoding from Type1C fonts, so Acrobat 
must be doing something other than just ignoring the encoding. Either it's a 
bug in Acrobat (which happens to produce good behaviour for this file) or they 
have some sort of heuristic.

The CharSet entry can't be the deciding factor, because it is optional, and its 
entries are unordered, so it provides no help in identifying a jumbled 
encoding (i.e. two encodings with the same characters have the same CharSet, 
even if their order is different).

 Text extraction fails due to the usage of the internal font mapping
 ---

 Key: PDFBOX-2532
 URL: https://issues.apache.org/jira/browse/PDFBOX-2532
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 2.0.0
Reporter: Andreas Lehmkühler
 Fix For: 2.0.0

 Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt, 
 PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt, 
 PDFBOX2247-701542_sa_reader_osx.txt, PDFBOX2247-Debugger.png


 If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode 
 mapping) we have to decide where to get a suitable mapping ourselves. We 
 can't use the internal font mapping of the type1C font as it doesn't work in 
 every case, see PDFBOX-2377 which provides a solution for the 1.8-branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2542) IllegalArgumentException: root must be of type Pages

2014-12-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238450#comment-14238450
 ] 

ASF subversion and git services commented on PDFBOX-2542:
-

Commit 1643915 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1643915 ]

PDFBOX-2542: Removed check for Type of page tree root

 IllegalArgumentException: root must be of type Pages
 

 Key: PDFBOX-2542
 URL: https://issues.apache.org/jira/browse/PDFBOX-2542
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Assignee: John Hewson
 Fix For: 2.0.0

 Attachments: 249776.pdf


 {code}
 java.lang.IllegalArgumentException: root must be of type Pages
   at org.apache.pdfbox.pdmodel.PDPageTree.init(PDPageTree.java:66)
   at 
 org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatalog.java:125)
   at 
 org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:1175)
 {code}
 The cause is this
 {code}
 
 /Count 11 
 /Kids [ 100 0 R 141 0 R ]
 
 endobj
 {code}
 /Type /Pages  is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PDFBOX-2542) IllegalArgumentException: root must be of type Pages

2014-12-08 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson resolved PDFBOX-2542.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

 IllegalArgumentException: root must be of type Pages
 

 Key: PDFBOX-2542
 URL: https://issues.apache.org/jira/browse/PDFBOX-2542
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Assignee: John Hewson
 Fix For: 2.0.0

 Attachments: 249776.pdf


 {code}
 java.lang.IllegalArgumentException: root must be of type Pages
   at org.apache.pdfbox.pdmodel.PDPageTree.init(PDPageTree.java:66)
   at 
 org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatalog.java:125)
   at 
 org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:1175)
 {code}
 The cause is this
 {code}
 
 /Count 11 
 /Kids [ 100 0 R 141 0 R ]
 
 endobj
 {code}
 /Type /Pages  is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2546) IllegalArgumentException: resourceDictionary is null in PDFMerger

2014-12-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238492#comment-14238492
 ] 

ASF subversion and git services commented on PDFBOX-2546:
-

Commit 1643933 from [~jahewson] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1643933 ]

PDFBOX-2546: PageIterator should be recursive

 IllegalArgumentException: resourceDictionary is null in PDFMerger
 -

 Key: PDFBOX-2546
 URL: https://issues.apache.org/jira/browse/PDFBOX-2546
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Assignee: John Hewson
 Fix For: 2.0.0


 This was first mentioned on the user mailing list by [~giladd]:
 When merging the PDF 1.7 spec with another PDF file this exception appears:
 {code}
 Exception in thread main java.lang.IllegalArgumentException: 
 resourceDictionary is null
   at org.apache.pdfbox.pdmodel.PDResources.init(PDResources.java:68)
   at 
 org.apache.pdfbox.util.PDFMergerUtility.appendDocument(PDFMergerUtility.java:448)
   at 
 org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:190)
   at org.apache.pdfbox.tools.PDFMerger.merge(PDFMerger.java:70)
   at org.apache.pdfbox.tools.PDFMerger.main(PDFMerger.java:46)
   at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76)
 {code}
 I did some debugging, it happens on the very first page. The resources is 
 indeed null, but it exists when viewing with PDFDebugger. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PDFBOX-2546) IllegalArgumentException: resourceDictionary is null in PDFMerger

2014-12-08 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson resolved PDFBOX-2546.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

 IllegalArgumentException: resourceDictionary is null in PDFMerger
 -

 Key: PDFBOX-2546
 URL: https://issues.apache.org/jira/browse/PDFBOX-2546
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Assignee: John Hewson
 Fix For: 2.0.0


 This was first mentioned on the user mailing list by [~giladd]:
 When merging the PDF 1.7 spec with another PDF file this exception appears:
 {code}
 Exception in thread main java.lang.IllegalArgumentException: 
 resourceDictionary is null
   at org.apache.pdfbox.pdmodel.PDResources.init(PDResources.java:68)
   at 
 org.apache.pdfbox.util.PDFMergerUtility.appendDocument(PDFMergerUtility.java:448)
   at 
 org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:190)
   at org.apache.pdfbox.tools.PDFMerger.merge(PDFMerger.java:70)
   at org.apache.pdfbox.tools.PDFMerger.main(PDFMerger.java:46)
   at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76)
 {code}
 I did some debugging, it happens on the very first page. The resources is 
 indeed null, but it exists when viewing with PDFDebugger. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2532) Text extraction fails due to the usage of the internal font mapping

2014-12-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238494#comment-14238494
 ] 

Andreas Lehmkühler commented on PDFBOX-2532:


{quote}
It's very common to need to extract the Encoding from Type1C fonts, so Acrobat 
must be doing something other than just ignoring the encoding. Either it's a 
bug in Acrobat (which happens to produce good behaviour for this file) or they 
have some sort of heuristic.
{quote}
It has to be a new bug as It worked with older acrobat versions.

{quote}
The CharSet entry can't be the deciding factor, because it is optional, and its 
entries are unordered, so it provides no help in identifying a jumbled 
encoding. Two different encodings which contain the same characters will have 
the same CharSet, even if their order is different.
{quote}
I know the specs. Anyway, in all cases I know it was a good indicator for 
broken fonts.


 Text extraction fails due to the usage of the internal font mapping
 ---

 Key: PDFBOX-2532
 URL: https://issues.apache.org/jira/browse/PDFBOX-2532
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 2.0.0
Reporter: Andreas Lehmkühler
 Fix For: 2.0.0

 Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt, 
 PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt, 
 PDFBOX2247-701542_sa_reader_osx.txt, PDFBOX2247-Debugger.png


 If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode 
 mapping) we have to decide where to get a suitable mapping ourselves. We 
 can't use the internal font mapping of the type1C font as it doesn't work in 
 every case, see PDFBOX-2377 which provides a solution for the 1.8-branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PDFBOX-2539) [PATCH] Allow non static FontProvider

2014-12-08 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238495#comment-14238495
 ] 

John Hewson edited comment on PDFBOX-2539 at 12/8/14 9:39 PM:
--

Your updated patch does not compile, it is missing the method 
PDFStreamEngine#getFontProvider().


was (Author: jahewson):
Your updated does not compile, it is missing the method 
PDFStreamEngine#getFontProvider().

 [PATCH] Allow non static FontProvider
 -

 Key: PDFBOX-2539
 URL: https://issues.apache.org/jira/browse/PDFBOX-2539
 Project: PDFBox
  Issue Type: Bug
  Components: FontBox
Affects Versions: 2.0.0
Reporter: simon steiner
 Attachments: fontProvider.patch


 I would like to use multiple instances of fontprovider in thread safe way



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2539) [PATCH] Allow non static FontProvider

2014-12-08 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238495#comment-14238495
 ] 

John Hewson commented on PDFBOX-2539:
-

Your updated does not compile, it is missing the method 
PDFStreamEngine#getFontProvider().

 [PATCH] Allow non static FontProvider
 -

 Key: PDFBOX-2539
 URL: https://issues.apache.org/jira/browse/PDFBOX-2539
 Project: PDFBox
  Issue Type: Bug
  Components: FontBox
Affects Versions: 2.0.0
Reporter: simon steiner
 Attachments: fontProvider.patch


 I would like to use multiple instances of fontprovider in thread safe way



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PDFBOX-2550) ClassCastException in PDAnnotation.getColour

2014-12-08 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created PDFBOX-2550:
---

 Summary: ClassCastException in PDAnnotation.getColour
 Key: PDFBOX-2550
 URL: https://issues.apache.org/jira/browse/PDFBOX-2550
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr


{code}
java.lang.ClassCastException: org.apache.pdfbox.cos.COSObject cannot be cast to 
org.apache.pdfbox.cos.COSArray
at 
org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.getColour(PDAnnotation.java:644)
at 
org.apache.pdfbox.preflight.annotation.AnnotationValidator.checkColors(AnnotationValidator.java:134)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2550) ClassCastException in PDAnnotation.getColour

2014-12-08 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2550:

Description: 
{code}
java.lang.ClassCastException: org.apache.pdfbox.cos.COSObject cannot be cast to 
org.apache.pdfbox.cos.COSArray
at 
org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.getColour(PDAnnotation.java:644)
at 
org.apache.pdfbox.preflight.annotation.AnnotationValidator.checkColors(AnnotationValidator.java:134)
{code}
The cause is this:
{code}
/C 19 0 R
{code}
The current code doesn't expect it to be an indirect object.

  was:
{code}
java.lang.ClassCastException: org.apache.pdfbox.cos.COSObject cannot be cast to 
org.apache.pdfbox.cos.COSArray
at 
org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.getColour(PDAnnotation.java:644)
at 
org.apache.pdfbox.preflight.annotation.AnnotationValidator.checkColors(AnnotationValidator.java:134)
{code}



 ClassCastException in PDAnnotation.getColour
 

 Key: PDFBOX-2550
 URL: https://issues.apache.org/jira/browse/PDFBOX-2550
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
  Labels: Annotations

 {code}
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSObject cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.getColour(PDAnnotation.java:644)
   at 
 org.apache.pdfbox.preflight.annotation.AnnotationValidator.checkColors(AnnotationValidator.java:134)
 {code}
 The cause is this:
 {code}
 /C 19 0 R
 {code}
 The current code doesn't expect it to be an indirect object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2550) ClassCastException in PDAnnotation.getColour

2014-12-08 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2550:

Attachment: 176622.pdf

 ClassCastException in PDAnnotation.getColour
 

 Key: PDFBOX-2550
 URL: https://issues.apache.org/jira/browse/PDFBOX-2550
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
  Labels: Annotations
 Attachments: 176622.pdf


 {code}
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSObject cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.getColour(PDAnnotation.java:644)
   at 
 org.apache.pdfbox.preflight.annotation.AnnotationValidator.checkColors(AnnotationValidator.java:134)
 {code}
 The cause is this:
 {code}
 /C 19 0 R
 {code}
 The current code doesn't expect it to be an indirect object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2550) ClassCastException in PDAnnotation.getColour

2014-12-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239096#comment-14239096
 ] 

ASF subversion and git services commented on PDFBOX-2550:
-

Commit 1643996 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1643996 ]

PDFBOX-2550: allow indirect object and avoid ClassCastException in getColour()

 ClassCastException in PDAnnotation.getColour
 

 Key: PDFBOX-2550
 URL: https://issues.apache.org/jira/browse/PDFBOX-2550
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
  Labels: Annotations
 Attachments: 176622.pdf


 {code}
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSObject cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.getColour(PDAnnotation.java:644)
   at 
 org.apache.pdfbox.preflight.annotation.AnnotationValidator.checkColors(AnnotationValidator.java:134)
 {code}
 The cause is this:
 {code}
 /C 19 0 R
 {code}
 The current code doesn't expect it to be an indirect object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2550) ClassCastException in PDAnnotation.getColour

2014-12-08 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239099#comment-14239099
 ] 

Tilman Hausherr commented on PDFBOX-2550:
-

Will do 1.8 after the cut.

 ClassCastException in PDAnnotation.getColour
 

 Key: PDFBOX-2550
 URL: https://issues.apache.org/jira/browse/PDFBOX-2550
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
  Labels: Annotations
 Attachments: 176622.pdf


 {code}
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSObject cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.getColour(PDAnnotation.java:644)
   at 
 org.apache.pdfbox.preflight.annotation.AnnotationValidator.checkColors(AnnotationValidator.java:134)
 {code}
 The cause is this:
 {code}
 /C 19 0 R
 {code}
 The current code doesn't expect it to be an indirect object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)