date:20141208

[jira] [Commented] (PDFBOX-2397) Running within an Applet throws an AccessControlException

2014-12-08 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237584#comment-14237584
 ] 

Andreas Lehmkühler commented on PDFBOX-2397:


[~tilman] Any updates on this topic, or should we simply postpone this issue

 Running within an Applet throws an AccessControlException
 -

 Key: PDFBOX-2397
 URL: https://issues.apache.org/jira/browse/PDFBOX-2397
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7
 Environment: JRE 7u67 or JRE 6u45 (Windows 7 SP1 64bit)
Reporter: Bertrand Gillis
Assignee: Tilman Hausherr
 Fix For: 1.8.8


 As soon as PDFBox is embedded in a signed applet, the following exception is 
 thrown when I try to print a PDF document through PDFBox:
 {code}
 Caused by: java.security.AccessControlException: access denied 
 (java.util.PropertyPermission org.apache.pdfbox.ICC_override_color read)
   at java.security.AccessControlContext.checkPermission(Unknown Source)
   at java.security.AccessController.checkPermission(Unknown Source)
   at java.lang.SecurityManager.checkPermission(Unknown Source)
   at sun.plugin2.applet.AWTAppletSecurityManager.checkPermission(Unknown 
 Source)
   at java.lang.SecurityManager.checkPropertyAccess(Unknown Source)
   at java.lang.System.getProperty(Unknown Source)
   at java.lang.Integer.getInteger(Unknown Source)
   at java.lang.Integer.getInteger(Unknown Source)
   at java.awt.Color.getColor(Unknown Source)
   at java.awt.Color.getColor(Unknown Source)
   at 
 org.apache.pdfbox.pdmodel.graphics.color.PDColorState.clinit(PDColorState.java:50)
 {code}
 This issue was also in previous PDFBox versions for the following instruction:
 {code:title=BaseParser.java}
 FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing);
 {code}
 But it was fixed in later versions:
 {code:title=BaseParser.java}
   static {
 try {
   FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing);
 }
 catch (SecurityException e) {}
   }
 {code}
 This fixed is unfortunately not set for the current property:
 {code:title=PDColorState.java}
 private static volatile Color iccOverrideColor = 
 Color.getColor(org.apache.pdfbox.ICC_override_color);
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PDFBOX-2397) Running within an Applet throws an AccessControlException

2014-12-08 Thread Tilman Hausherr (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2397:

Fix Version/s: (was: 1.8.8)

 Running within an Applet throws an AccessControlException
 -

 Key: PDFBOX-2397
 URL: https://issues.apache.org/jira/browse/PDFBOX-2397
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7
 Environment: JRE 7u67 or JRE 6u45 (Windows 7 SP1 64bit)
Reporter: Bertrand Gillis
Assignee: Tilman Hausherr

 As soon as PDFBox is embedded in a signed applet, the following exception is 
 thrown when I try to print a PDF document through PDFBox:
 {code}
 Caused by: java.security.AccessControlException: access denied 
 (java.util.PropertyPermission org.apache.pdfbox.ICC_override_color read)
   at java.security.AccessControlContext.checkPermission(Unknown Source)
   at java.security.AccessController.checkPermission(Unknown Source)
   at java.lang.SecurityManager.checkPermission(Unknown Source)
   at sun.plugin2.applet.AWTAppletSecurityManager.checkPermission(Unknown 
 Source)
   at java.lang.SecurityManager.checkPropertyAccess(Unknown Source)
   at java.lang.System.getProperty(Unknown Source)
   at java.lang.Integer.getInteger(Unknown Source)
   at java.lang.Integer.getInteger(Unknown Source)
   at java.awt.Color.getColor(Unknown Source)
   at java.awt.Color.getColor(Unknown Source)
   at 
 org.apache.pdfbox.pdmodel.graphics.color.PDColorState.clinit(PDColorState.java:50)
 {code}
 This issue was also in previous PDFBox versions for the following instruction:
 {code:title=BaseParser.java}
 FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing);
 {code}
 But it was fixed in later versions:
 {code:title=BaseParser.java}
   static {
 try {
   FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing);
 }
 catch (SecurityException e) {}
   }
 {code}
 This fixed is unfortunately not set for the current property:
 {code:title=PDColorState.java}
 private static volatile Color iccOverrideColor = 
 Color.getColor(org.apache.pdfbox.ICC_override_color);
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PDFBOX-2397) Running within an Applet throws an AccessControlException

2014-12-08 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237593#comment-14237593
 ] 

Tilman Hausherr commented on PDFBOX-2397:
-

postpone for the reason I mentioned on Nov. 3th. Until either [~bgillis] comes 
back, or until somebody else comes who is willing to run an applet with the 
modified code.

 Running within an Applet throws an AccessControlException
 -

 Key: PDFBOX-2397
 URL: https://issues.apache.org/jira/browse/PDFBOX-2397
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.7
 Environment: JRE 7u67 or JRE 6u45 (Windows 7 SP1 64bit)
Reporter: Bertrand Gillis
Assignee: Tilman Hausherr

 As soon as PDFBox is embedded in a signed applet, the following exception is 
 thrown when I try to print a PDF document through PDFBox:
 {code}
 Caused by: java.security.AccessControlException: access denied 
 (java.util.PropertyPermission org.apache.pdfbox.ICC_override_color read)
   at java.security.AccessControlContext.checkPermission(Unknown Source)
   at java.security.AccessController.checkPermission(Unknown Source)
   at java.lang.SecurityManager.checkPermission(Unknown Source)
   at sun.plugin2.applet.AWTAppletSecurityManager.checkPermission(Unknown 
 Source)
   at java.lang.SecurityManager.checkPropertyAccess(Unknown Source)
   at java.lang.System.getProperty(Unknown Source)
   at java.lang.Integer.getInteger(Unknown Source)
   at java.lang.Integer.getInteger(Unknown Source)
   at java.awt.Color.getColor(Unknown Source)
   at java.awt.Color.getColor(Unknown Source)
   at 
 org.apache.pdfbox.pdmodel.graphics.color.PDColorState.clinit(PDColorState.java:50)
 {code}
 This issue was also in previous PDFBox versions for the following instruction:
 {code:title=BaseParser.java}
 FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing);
 {code}
 But it was fixed in later versions:
 {code:title=BaseParser.java}
   static {
 try {
   FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing);
 }
 catch (SecurityException e) {}
   }
 {code}
 This fixed is unfortunately not set for the current property:
 {code:title=PDColorState.java}
 private static volatile Color iccOverrideColor = 
 Color.getColor(org.apache.pdfbox.ICC_override_color);
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (PDFBOX-2512) OutOfMemory while signing large documents

2014-12-08 Thread Thomas Chojecki (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Thomas Chojecki resolved PDFBOX-2512.
-
Resolution: Fixed
Fix Version/s: 1.8.8

There is still one point open, but with the workaround mentioned in the
comment, this issue is resolved.

OutOfMemory while signing large documents
-

Key: PDFBOX-2512
URL: https://issues.apache.org/jira/browse/PDFBOX-2512
Project: PDFBox
Issue Type: Bug
Components: Parsing, Signing
Affects Versions: 1.8.7
Reporter: Thomas Chojecki
Assignee: Thomas Chojecki
Fix For: 1.8.8

Attachments: keystore.p12

While working with large documents, we found some memory issues.
1. The method close() in the COSDocument, clones the objectpool and does not
clean it properly. The cloning in getObjects() cause a OutOfMemory exception.
2.The COSWriter copy the whole pdf into the memory for signing and does not
use BufferedInputStream for the FileInputStream which also has a big
performance impact. (PDFBOX-1798)
3. The cloning of COSStreams cause a OutOfMemory exception
I used the CreateSignature example with a about 150 MB big document from here:
https://cdn-reichelt.de/bilder/downloads/reichelt_01-2015_DE_B_HQ.pdf
Additionaly I add a RandomAccessFile to the PDDocument.load in the
CreateSignature class.
PDDocument doc = PDDocument.load(document,new RandomAccessFile(new
File(d:\\temp.bin), rw)); (this prevent the OOM for the third case)
The use of a BuffedInputStream in case two, will increase the signing speed
from more than 5 minutes to less than 1 minute.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PDFBOX-2512) OutOfMemory while signing large documents

2014-12-08 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237747#comment-14237747
 ] 

Andreas Lehmkühler commented on PDFBOX-2512:


Are these changes limited to the 1.8-branch or should we add them to the trunk 
as well?

 OutOfMemory while signing large documents
 -

 Key: PDFBOX-2512
 URL: https://issues.apache.org/jira/browse/PDFBOX-2512
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Signing
Affects Versions: 1.8.7
Reporter: Thomas Chojecki
Assignee: Thomas Chojecki
 Fix For: 1.8.8

 Attachments: keystore.p12


 While working with large documents, we found some memory issues.
 1. The method close() in the COSDocument, clones the objectpool and does not 
 clean it properly. The cloning in getObjects() cause a OutOfMemory exception.
 2.The COSWriter copy the whole pdf into the memory for signing and does not 
 use BufferedInputStream for the FileInputStream which also has a big 
 performance impact. (PDFBOX-1798)
 3. The cloning of COSStreams cause a OutOfMemory exception
 I used the CreateSignature example with a about 150 MB big document from here:
 https://cdn-reichelt.de/bilder/downloads/reichelt_01-2015_DE_B_HQ.pdf
 Additionaly I add a RandomAccessFile to the PDDocument.load in the 
 CreateSignature class.
 PDDocument doc = PDDocument.load(document,new RandomAccessFile(new 
 File(d:\\temp.bin), rw)); (this prevent the OOM for the third case)
 The use of a BuffedInputStream in case two, will increase the signing speed 
 from more than 5 minutes to less than 1 minute. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PDFBOX-2512) OutOfMemory while signing large documents

2014-12-08 Thread Thomas Chojecki (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237806#comment-14237806
]

Thomas Chojecki commented on PDFBOX-2512:
-

If we can port it, we should do it. There are only small changes, that improve
the performance and solve the OOM problematic.

OutOfMemory while signing large documents
-

Attachments: keystore.p12

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PDFBOX-1351) False paragraph caused by superscript (1.7 regression)

2014-12-08 Thread Merijn Wijngaard (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238010#comment-14238010
 ] 

Merijn Wijngaard commented on PDFBOX-1351:
--

This problem still persists in pdfbox 1.8.7. Using superscript doesn't sound 
like a rare use case to me, so it would be nice if this could be fixed. 
Inlining the superscript for text output seems like the best solution to me.

 False paragraph caused by superscript (1.7 regression)
 --

 Key: PDFBOX-1351
 URL: https://issues.apache.org/jira/browse/PDFBOX-1351
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.7.0
Reporter: Daniel Bonniot de Ruisselet
 Attachments: PDFParaTest.java, superscript.pdf


 On the attached minimal example document, text extraction seems to be 
 confused by the superscript, and generates three paragraphs where there is 
 only one.
 Note that 1.6 is processing this case well:
 {noformat}
 $ java -jar /dev/shm/pdfbox-app-1.6.0.jar ExtractText /tmp/superscript.pdf
 Jun 29, 2012 4:52:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
 WARNING: expected='%%EOF' actual='5 0 obj '
 $ cat /tmp/superscript.txt 
   
 Multiple synthetic routes have been described by R. Filler et al.11 regarding 
 1,3-
 Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
  
  
 $ java -jar /dev/shm/pdfbox-app-1.7.0.jar ExtractText /tmp/superscript.pdf 
 Jun 29, 2012 4:52:39 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
 WARNING: expected='%%EOF' actual='5 0 obj '
 $ cat /tmp/superscript.txt 
   
 Multiple synthetic routes have been described by R. Filler et al.
 11
  regarding 1,3-
 Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
  
  
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread JIRA

Matthias Bösinger created PDFBOX-2548:
-

 Summary: problems with character extraction (OpenType, dense 
printed Text)
 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Test
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows JavaSE8 Eclipse
Reporter: Matthias Bösinger
Priority: Minor


 favorite


I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
pdfBox text extraction can also extract special characters (for example small 
capital lettres), which caused problems when the underlying font has been a 
simple Type1 font.

However, the text extraction now causes another type of problem. In my case, 
when the charater sequences fi or fl occur in the text, the 
PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'ﬁ' 
and 'ﬂ' and sets a space character on their right side.

(Surprisingly, if I access the list of characters of a page via the 
charactersByArticle field of PDFTextStripper / via the 
PDFTextStripper#processText(TextPosition pos) method, the same characters show 
up as 'normal-single' characters f i / f l).

My assumption is that the advantage of the underlying OpenFont type turns into 
this particular disadvantage, because the PDFTextStripper recognizes the 
character sequence f i / f l as special charcters ﬁ / ﬂ (- what might have to 
do with the fact, that the getText() method calculates things like whitespace 
characters by distances / positional placements).

Background: The given document is a wordbook text with very dense printed text.

My question: is there anything what I can do to avoid this problem?

thanks in advance ...




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger updated PDFBOX-2548:
--
Attachment: test.pdf

 problems with character extraction (OpenType, dense printed Text)
 -

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Test
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows JavaSE8 Eclipse
Reporter: Matthias Bösinger
Priority: Minor
  Labels: newbie
 Attachments: test.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'ﬁ' and 'ﬂ' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters ﬁ / ﬂ (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger updated PDFBOX-2548:
--
Description: 
 favorite


I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
pdfBox text extraction can also extract special characters (for example small 
capital lettres), which caused problems when the underlying font has been a 
simple Type1 font.

However, the text extraction now causes another type of problem. In my case, 
when the charater sequences fi or fl occur in the text, the 
PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'ﬁ' 
and 'ﬂ' and sets a space character on their right side.

(Surprisingly, if I access the list of characters of a page via the 
charactersByArticle field of PDFTextStripper / via the 
PDFTextStripper#processText(TextPosition pos) method, the same characters show 
up as 'normal-single' characters f i / f l).

My assumption is that the advantage of the underlying OpenFont type turns into 
this particular disadvantage, because the PDFTextStripper recognizes the 
character sequence f i / f l as special charcters ﬁ / ﬂ (- what might have to 
do with the fact, that the getText() method calculates things like whitespace 
characters by distances / positional placements).

Background: The given document is a wordbook text with very dense printed text.

see this link for code and output:
http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox

My question: is there anything what I can do to avoid this problem?

thanks in advance ...


  was:
 favorite


I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
pdfBox text extraction can also extract special characters (for example small 
capital lettres), which caused problems when the underlying font has been a 
simple Type1 font.

However, the text extraction now causes another type of problem. In my case, 
when the charater sequences fi or fl occur in the text, the 
PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'ﬁ' 
and 'ﬂ' and sets a space character on their right side.

(Surprisingly, if I access the list of characters of a page via the 
charactersByArticle field of PDFTextStripper / via the 
PDFTextStripper#processText(TextPosition pos) method, the same characters show 
up as 'normal-single' characters f i / f l).

My assumption is that the advantage of the underlying OpenFont type turns into 
this particular disadvantage, because the PDFTextStripper recognizes the 
character sequence f i / f l as special charcters ﬁ / ﬂ (- what might have to 
do with the fact, that the getText() method calculates things like whitespace 
characters by distances / positional placements).

Background: The given document is a wordbook text with very dense printed text.

My question: is there anything what I can do to avoid this problem?

thanks in advance ...



 problems with character extraction (OpenType, dense printed Text)
 -

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Test
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows JavaSE8 Eclipse
Reporter: Matthias Bösinger
Priority: Minor
  Labels: newbie
 Attachments: test.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'ﬁ' and 'ﬂ' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters ﬁ / ﬂ (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My

[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger updated PDFBOX-2548:
--
Environment: Windows7Professional JavaSE8 EclipseKepler  (was: Windows 
JavaSE8 EclipseKepler)

 problems with character extraction (OpenType, dense printed Text)
 -

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Test
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
  Labels: newbie
 Attachments: test.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'ﬁ' and 'ﬂ' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters ﬁ / ﬂ (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger updated PDFBOX-2548:
--
Environment: Windows JavaSE8 EclipseKepler  (was: Windows JavaSE8 Eclipse)

 problems with character extraction (OpenType, dense printed Text)
 -

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Test
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
  Labels: newbie
 Attachments: test.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'ﬁ' and 'ﬂ' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters ﬁ / ﬂ (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PDFBOX-2549) TIFF-Predictor with 16 bits per component not supported

2014-12-08 Thread Tilman Hausherr (JIRA)

Tilman Hausherr created PDFBOX-2549:
---

 Summary: TIFF-Predictor with 16 bits per component not supported
 Key: PDFBOX-2549
 URL: https://issues.apache.org/jira/browse/PDFBOX-2549
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 1.8.7, 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr


The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test suite 
is not displayed, PDFBox throws the mentioned exception. One open source and 
one closed source product display an X, but gswin renders the image properly. 

The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc 
because I don't have test images.

I'll add my patch 1.8 after the cut.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PDFBOX-2549) TIFF-Predictor with 16 bits per component not supported

2014-12-08 Thread Tilman Hausherr (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2549:

Attachment: GWG181_16Bit_CMYK_X4.pdf

 TIFF-Predictor with 16 bits per component not supported
 ---

 Key: PDFBOX-2549
 URL: https://issues.apache.org/jira/browse/PDFBOX-2549
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 1.8.7, 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
  Labels: Predictor
 Attachments: GWG181_16Bit_CMYK_X4.pdf


 The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test 
 suite is not displayed, PDFBox throws the mentioned exception. One open 
 source and one closed source product display an X, but gswin renders the 
 image properly. 
 The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc 
 because I don't have test images.
 I'll add my patch 1.8 after the cut.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger updated PDFBOX-2548:
--
Attachment: test2.pdf

 problems with character extraction (OpenType, dense printed Text)
 -

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Test
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
  Labels: newbie
 Attachments: test.pdf, test2.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'ﬁ' and 'ﬂ' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters ﬁ / ﬂ (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

2014-12-08 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238120#comment-14238120
 ] 

Matthias Bösinger commented on PDFBOX-2548:
---

I added a second test page, from a former volume of the same wordbook. For this 
volume, a Type1 font has been used. I chose a page where the two words 
begrifflich and spezifisch occur (they cause problems as you can see in the 
first test). As you can see/test, the described error doesn't occur when 
extracting the text of this second page! This strenghens my assumption that the 
OpenType format is the reason for the occuring error.

 problems with character extraction (OpenType, dense printed Text)
 -

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Test
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
  Labels: newbie
 Attachments: test.pdf, test2.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'ﬁ' and 'ﬂ' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters ﬁ / ﬂ (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PDFBOX-2549) TIFF-Predictor with 16 bits per component not supported

2014-12-08 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238201#comment-14238201
 ] 

ASF subversion and git services commented on PDFBOX-2549:
-

Commit 1643881 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1643881 ]

PDFBOX-2549: TIFF SUB predictor for 16bpc

 TIFF-Predictor with 16 bits per component not supported
 ---

 Key: PDFBOX-2549
 URL: https://issues.apache.org/jira/browse/PDFBOX-2549
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 1.8.7, 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
  Labels: Predictor
 Attachments: GWG181_16Bit_CMYK_X4.pdf


 The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test 
 suite is not displayed, PDFBox throws the mentioned exception. One open 
 source and one closed source product display an X, but gswin renders the 
 image properly. 
 The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc 
 because I don't have test images.
 I'll add my patch 1.8 after the cut.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PDFBOX-2549) TIFF-Predictor with 16 bits per component not supported

2014-12-08 Thread Tilman Hausherr (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tilman Hausherr updated PDFBOX-2549:

Description:
The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test suite
is not displayed, PDFBox throws the mentioned exception. One open source and
one closed source product display an X, but gswin renders the image properly.

The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc
because I don't have test images.

I'll add my patch to 1.8 after the cut.

was:
The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test suite
is not displayed, PDFBox throws the mentioned exception. One open source and
one closed source product display an X, but gswin renders the image properly.

The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc
because I don't have test images.

I'll add my patch 1.8 after the cut.

TIFF-Predictor with 16 bits per component not supported
---

Key: PDFBOX-2549
URL: https://issues.apache.org/jira/browse/PDFBOX-2549
Project: PDFBox
Issue Type: Bug
Components: Rendering
Affects Versions: 1.8.7, 1.8.8, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
Labels: Predictor
Attachments: GWG181_16Bit_CMYK_X4.pdf

The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test
suite is not displayed, PDFBox throws the mentioned exception. One open
source and one closed source product display an X, but gswin renders the
image properly.
The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc
because I don't have test images.
I'll add my patch to 1.8 after the cut.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

49 matches

Mail list logo