[jira] [Commented] (PDFBOX-2397) Running within an Applet throws an AccessControlException
[ https://issues.apache.org/jira/browse/PDFBOX-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237584#comment-14237584 ] Andreas Lehmkühler commented on PDFBOX-2397: [~tilman] Any updates on this topic, or should we simply postpone this issue Running within an Applet throws an AccessControlException - Key: PDFBOX-2397 URL: https://issues.apache.org/jira/browse/PDFBOX-2397 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7 Environment: JRE 7u67 or JRE 6u45 (Windows 7 SP1 64bit) Reporter: Bertrand Gillis Assignee: Tilman Hausherr Fix For: 1.8.8 As soon as PDFBox is embedded in a signed applet, the following exception is thrown when I try to print a PDF document through PDFBox: {code} Caused by: java.security.AccessControlException: access denied (java.util.PropertyPermission org.apache.pdfbox.ICC_override_color read) at java.security.AccessControlContext.checkPermission(Unknown Source) at java.security.AccessController.checkPermission(Unknown Source) at java.lang.SecurityManager.checkPermission(Unknown Source) at sun.plugin2.applet.AWTAppletSecurityManager.checkPermission(Unknown Source) at java.lang.SecurityManager.checkPropertyAccess(Unknown Source) at java.lang.System.getProperty(Unknown Source) at java.lang.Integer.getInteger(Unknown Source) at java.lang.Integer.getInteger(Unknown Source) at java.awt.Color.getColor(Unknown Source) at java.awt.Color.getColor(Unknown Source) at org.apache.pdfbox.pdmodel.graphics.color.PDColorState.clinit(PDColorState.java:50) {code} This issue was also in previous PDFBox versions for the following instruction: {code:title=BaseParser.java} FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing); {code} But it was fixed in later versions: {code:title=BaseParser.java} static { try { FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing); } catch (SecurityException e) {} } {code} This fixed is unfortunately not set for the current property: {code:title=PDColorState.java} private static volatile Color iccOverrideColor = Color.getColor(org.apache.pdfbox.ICC_override_color); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2397) Running within an Applet throws an AccessControlException
[ https://issues.apache.org/jira/browse/PDFBOX-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2397: Fix Version/s: (was: 1.8.8) Running within an Applet throws an AccessControlException - Key: PDFBOX-2397 URL: https://issues.apache.org/jira/browse/PDFBOX-2397 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7 Environment: JRE 7u67 or JRE 6u45 (Windows 7 SP1 64bit) Reporter: Bertrand Gillis Assignee: Tilman Hausherr As soon as PDFBox is embedded in a signed applet, the following exception is thrown when I try to print a PDF document through PDFBox: {code} Caused by: java.security.AccessControlException: access denied (java.util.PropertyPermission org.apache.pdfbox.ICC_override_color read) at java.security.AccessControlContext.checkPermission(Unknown Source) at java.security.AccessController.checkPermission(Unknown Source) at java.lang.SecurityManager.checkPermission(Unknown Source) at sun.plugin2.applet.AWTAppletSecurityManager.checkPermission(Unknown Source) at java.lang.SecurityManager.checkPropertyAccess(Unknown Source) at java.lang.System.getProperty(Unknown Source) at java.lang.Integer.getInteger(Unknown Source) at java.lang.Integer.getInteger(Unknown Source) at java.awt.Color.getColor(Unknown Source) at java.awt.Color.getColor(Unknown Source) at org.apache.pdfbox.pdmodel.graphics.color.PDColorState.clinit(PDColorState.java:50) {code} This issue was also in previous PDFBox versions for the following instruction: {code:title=BaseParser.java} FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing); {code} But it was fixed in later versions: {code:title=BaseParser.java} static { try { FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing); } catch (SecurityException e) {} } {code} This fixed is unfortunately not set for the current property: {code:title=PDColorState.java} private static volatile Color iccOverrideColor = Color.getColor(org.apache.pdfbox.ICC_override_color); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2397) Running within an Applet throws an AccessControlException
[ https://issues.apache.org/jira/browse/PDFBOX-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237593#comment-14237593 ] Tilman Hausherr commented on PDFBOX-2397: - postpone for the reason I mentioned on Nov. 3th. Until either [~bgillis] comes back, or until somebody else comes who is willing to run an applet with the modified code. Running within an Applet throws an AccessControlException - Key: PDFBOX-2397 URL: https://issues.apache.org/jira/browse/PDFBOX-2397 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 1.8.7 Environment: JRE 7u67 or JRE 6u45 (Windows 7 SP1 64bit) Reporter: Bertrand Gillis Assignee: Tilman Hausherr As soon as PDFBox is embedded in a signed applet, the following exception is thrown when I try to print a PDF document through PDFBox: {code} Caused by: java.security.AccessControlException: access denied (java.util.PropertyPermission org.apache.pdfbox.ICC_override_color read) at java.security.AccessControlContext.checkPermission(Unknown Source) at java.security.AccessController.checkPermission(Unknown Source) at java.lang.SecurityManager.checkPermission(Unknown Source) at sun.plugin2.applet.AWTAppletSecurityManager.checkPermission(Unknown Source) at java.lang.SecurityManager.checkPropertyAccess(Unknown Source) at java.lang.System.getProperty(Unknown Source) at java.lang.Integer.getInteger(Unknown Source) at java.lang.Integer.getInteger(Unknown Source) at java.awt.Color.getColor(Unknown Source) at java.awt.Color.getColor(Unknown Source) at org.apache.pdfbox.pdmodel.graphics.color.PDColorState.clinit(PDColorState.java:50) {code} This issue was also in previous PDFBox versions for the following instruction: {code:title=BaseParser.java} FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing); {code} But it was fixed in later versions: {code:title=BaseParser.java} static { try { FORCE_PARSING = Boolean.getBoolean(org.apache.pdfbox.forceParsing); } catch (SecurityException e) {} } {code} This fixed is unfortunately not set for the current property: {code:title=PDColorState.java} private static volatile Color iccOverrideColor = Color.getColor(org.apache.pdfbox.ICC_override_color); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PDFBOX-2512) OutOfMemory while signing large documents
[ https://issues.apache.org/jira/browse/PDFBOX-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Chojecki resolved PDFBOX-2512. - Resolution: Fixed Fix Version/s: 1.8.8 There is still one point open, but with the workaround mentioned in the comment, this issue is resolved. OutOfMemory while signing large documents - Key: PDFBOX-2512 URL: https://issues.apache.org/jira/browse/PDFBOX-2512 Project: PDFBox Issue Type: Bug Components: Parsing, Signing Affects Versions: 1.8.7 Reporter: Thomas Chojecki Assignee: Thomas Chojecki Fix For: 1.8.8 Attachments: keystore.p12 While working with large documents, we found some memory issues. 1. The method close() in the COSDocument, clones the objectpool and does not clean it properly. The cloning in getObjects() cause a OutOfMemory exception. 2.The COSWriter copy the whole pdf into the memory for signing and does not use BufferedInputStream for the FileInputStream which also has a big performance impact. (PDFBOX-1798) 3. The cloning of COSStreams cause a OutOfMemory exception I used the CreateSignature example with a about 150 MB big document from here: https://cdn-reichelt.de/bilder/downloads/reichelt_01-2015_DE_B_HQ.pdf Additionaly I add a RandomAccessFile to the PDDocument.load in the CreateSignature class. PDDocument doc = PDDocument.load(document,new RandomAccessFile(new File(d:\\temp.bin), rw)); (this prevent the OOM for the third case) The use of a BuffedInputStream in case two, will increase the signing speed from more than 5 minutes to less than 1 minute. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2512) OutOfMemory while signing large documents
[ https://issues.apache.org/jira/browse/PDFBOX-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237747#comment-14237747 ] Andreas Lehmkühler commented on PDFBOX-2512: Are these changes limited to the 1.8-branch or should we add them to the trunk as well? OutOfMemory while signing large documents - Key: PDFBOX-2512 URL: https://issues.apache.org/jira/browse/PDFBOX-2512 Project: PDFBox Issue Type: Bug Components: Parsing, Signing Affects Versions: 1.8.7 Reporter: Thomas Chojecki Assignee: Thomas Chojecki Fix For: 1.8.8 Attachments: keystore.p12 While working with large documents, we found some memory issues. 1. The method close() in the COSDocument, clones the objectpool and does not clean it properly. The cloning in getObjects() cause a OutOfMemory exception. 2.The COSWriter copy the whole pdf into the memory for signing and does not use BufferedInputStream for the FileInputStream which also has a big performance impact. (PDFBOX-1798) 3. The cloning of COSStreams cause a OutOfMemory exception I used the CreateSignature example with a about 150 MB big document from here: https://cdn-reichelt.de/bilder/downloads/reichelt_01-2015_DE_B_HQ.pdf Additionaly I add a RandomAccessFile to the PDDocument.load in the CreateSignature class. PDDocument doc = PDDocument.load(document,new RandomAccessFile(new File(d:\\temp.bin), rw)); (this prevent the OOM for the third case) The use of a BuffedInputStream in case two, will increase the signing speed from more than 5 minutes to less than 1 minute. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2512) OutOfMemory while signing large documents
[ https://issues.apache.org/jira/browse/PDFBOX-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237806#comment-14237806 ] Thomas Chojecki commented on PDFBOX-2512: - If we can port it, we should do it. There are only small changes, that improve the performance and solve the OOM problematic. OutOfMemory while signing large documents - Key: PDFBOX-2512 URL: https://issues.apache.org/jira/browse/PDFBOX-2512 Project: PDFBox Issue Type: Bug Components: Parsing, Signing Affects Versions: 1.8.7 Reporter: Thomas Chojecki Assignee: Thomas Chojecki Fix For: 1.8.8 Attachments: keystore.p12 While working with large documents, we found some memory issues. 1. The method close() in the COSDocument, clones the objectpool and does not clean it properly. The cloning in getObjects() cause a OutOfMemory exception. 2.The COSWriter copy the whole pdf into the memory for signing and does not use BufferedInputStream for the FileInputStream which also has a big performance impact. (PDFBOX-1798) 3. The cloning of COSStreams cause a OutOfMemory exception I used the CreateSignature example with a about 150 MB big document from here: https://cdn-reichelt.de/bilder/downloads/reichelt_01-2015_DE_B_HQ.pdf Additionaly I add a RandomAccessFile to the PDDocument.load in the CreateSignature class. PDDocument doc = PDDocument.load(document,new RandomAccessFile(new File(d:\\temp.bin), rw)); (this prevent the OOM for the third case) The use of a BuffedInputStream in case two, will increase the signing speed from more than 5 minutes to less than 1 minute. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-1351) False paragraph caused by superscript (1.7 regression)
[ https://issues.apache.org/jira/browse/PDFBOX-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238010#comment-14238010 ] Merijn Wijngaard commented on PDFBOX-1351: -- This problem still persists in pdfbox 1.8.7. Using superscript doesn't sound like a rare use case to me, so it would be nice if this could be fixed. Inlining the superscript for text output seems like the best solution to me. False paragraph caused by superscript (1.7 regression) -- Key: PDFBOX-1351 URL: https://issues.apache.org/jira/browse/PDFBOX-1351 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.7.0 Reporter: Daniel Bonniot de Ruisselet Attachments: PDFParaTest.java, superscript.pdf On the attached minimal example document, text extraction seems to be confused by the superscript, and generates three paragraphs where there is only one. Note that 1.6 is processing this case well: {noformat} $ java -jar /dev/shm/pdfbox-app-1.6.0.jar ExtractText /tmp/superscript.pdf Jun 29, 2012 4:52:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject WARNING: expected='%%EOF' actual='5 0 obj ' $ cat /tmp/superscript.txt Multiple synthetic routes have been described by R. Filler et al.11 regarding 1,3- Bis(perfluorophenyl)propane-1,3-dione. The synthesis and $ java -jar /dev/shm/pdfbox-app-1.7.0.jar ExtractText /tmp/superscript.pdf Jun 29, 2012 4:52:39 PM org.apache.pdfbox.pdfparser.PDFParser parseObject WARNING: expected='%%EOF' actual='5 0 obj ' $ cat /tmp/superscript.txt Multiple synthetic routes have been described by R. Filler et al. 11 regarding 1,3- Bis(perfluorophenyl)propane-1,3-dione. The synthesis and {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)
Matthias Bösinger created PDFBOX-2548: - Summary: problems with character extraction (OpenType, dense printed Text) Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Test Components: Text extraction Affects Versions: 1.8.7 Environment: Windows JavaSE8 Eclipse Reporter: Matthias Bösinger Priority: Minor favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Bösinger updated PDFBOX-2548: -- Attachment: test.pdf problems with character extraction (OpenType, dense printed Text) - Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Test Components: Text extraction Affects Versions: 1.8.7 Environment: Windows JavaSE8 Eclipse Reporter: Matthias Bösinger Priority: Minor Labels: newbie Attachments: test.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Bösinger updated PDFBOX-2548: -- Description: favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... was: favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. My question: is there anything what I can do to avoid this problem? thanks in advance ... problems with character extraction (OpenType, dense printed Text) - Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Test Components: Text extraction Affects Versions: 1.8.7 Environment: Windows JavaSE8 Eclipse Reporter: Matthias Bösinger Priority: Minor Labels: newbie Attachments: test.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My
[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Bösinger updated PDFBOX-2548: -- Environment: Windows7Professional JavaSE8 EclipseKepler (was: Windows JavaSE8 EclipseKepler) problems with character extraction (OpenType, dense printed Text) - Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Test Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Labels: newbie Attachments: test.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Bösinger updated PDFBOX-2548: -- Environment: Windows JavaSE8 EclipseKepler (was: Windows JavaSE8 Eclipse) problems with character extraction (OpenType, dense printed Text) - Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Test Components: Text extraction Affects Versions: 1.8.7 Environment: Windows JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Labels: newbie Attachments: test.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PDFBOX-2549) TIFF-Predictor with 16 bits per component not supported
Tilman Hausherr created PDFBOX-2549: --- Summary: TIFF-Predictor with 16 bits per component not supported Key: PDFBOX-2549 URL: https://issues.apache.org/jira/browse/PDFBOX-2549 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 1.8.7, 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test suite is not displayed, PDFBox throws the mentioned exception. One open source and one closed source product display an X, but gswin renders the image properly. The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc because I don't have test images. I'll add my patch 1.8 after the cut. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2549) TIFF-Predictor with 16 bits per component not supported
[ https://issues.apache.org/jira/browse/PDFBOX-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2549: Attachment: GWG181_16Bit_CMYK_X4.pdf TIFF-Predictor with 16 bits per component not supported --- Key: PDFBOX-2549 URL: https://issues.apache.org/jira/browse/PDFBOX-2549 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 1.8.7, 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Labels: Predictor Attachments: GWG181_16Bit_CMYK_X4.pdf The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test suite is not displayed, PDFBox throws the mentioned exception. One open source and one closed source product display an X, but gswin renders the image properly. The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc because I don't have test images. I'll add my patch 1.8 after the cut. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Bösinger updated PDFBOX-2548: -- Attachment: test2.pdf problems with character extraction (OpenType, dense printed Text) - Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Test Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Labels: newbie Attachments: test.pdf, test2.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238120#comment-14238120 ] Matthias Bösinger commented on PDFBOX-2548: --- I added a second test page, from a former volume of the same wordbook. For this volume, a Type1 font has been used. I chose a page where the two words begrifflich and spezifisch occur (they cause problems as you can see in the first test). As you can see/test, the described error doesn't occur when extracting the text of this second page! This strenghens my assumption that the OpenType format is the reason for the occuring error. problems with character extraction (OpenType, dense printed Text) - Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Test Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Labels: newbie Attachments: test.pdf, test2.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2549) TIFF-Predictor with 16 bits per component not supported
[ https://issues.apache.org/jira/browse/PDFBOX-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238201#comment-14238201 ] ASF subversion and git services commented on PDFBOX-2549: - Commit 1643881 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1643881 ] PDFBOX-2549: TIFF SUB predictor for 16bpc TIFF-Predictor with 16 bits per component not supported --- Key: PDFBOX-2549 URL: https://issues.apache.org/jira/browse/PDFBOX-2549 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 1.8.7, 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Labels: Predictor Attachments: GWG181_16Bit_CMYK_X4.pdf The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test suite is not displayed, PDFBox throws the mentioned exception. One open source and one closed source product display an X, but gswin renders the image properly. The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc because I don't have test images. I'll add my patch 1.8 after the cut. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2549) TIFF-Predictor with 16 bits per component not supported
[ https://issues.apache.org/jira/browse/PDFBOX-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2549: Description: The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test suite is not displayed, PDFBox throws the mentioned exception. One open source and one closed source product display an X, but gswin renders the image properly. The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc because I don't have test images. I'll add my patch to 1.8 after the cut. was: The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test suite is not displayed, PDFBox throws the mentioned exception. One open source and one closed source product display an X, but gswin renders the image properly. The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc because I don't have test images. I'll add my patch 1.8 after the cut. TIFF-Predictor with 16 bits per component not supported --- Key: PDFBOX-2549 URL: https://issues.apache.org/jira/browse/PDFBOX-2549 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 1.8.7, 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Labels: Predictor Attachments: GWG181_16Bit_CMYK_X4.pdf The attached image GWG181_16Bit_CMYK_X4.pdf from the Ghent Workgroup test suite is not displayed, PDFBox throws the mentioned exception. One open source and one closed source product display an X, but gswin renders the image properly. The upcoming patch handles the 16bit case. I won't implement 1, 2 or 4 bpc because I don't have test images. I'll add my patch to 1.8 after the cut. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2548: Issue Type: Bug (was: Test) problems with character extraction (OpenType, dense printed Text) - Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Attachments: test.pdf, test2.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2548: Labels: (was: newbie) problems with character extraction (OpenType, dense printed Text) - Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Attachments: test.pdf, test2.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238282#comment-14238282 ] John Hewson commented on PDFBOX-2548: - Neither of these PDFs contain OpenType fonts, instead they contain embedded Type 1 fonts. It is common for PDF generating software to perform such format conversions when embedding fonts. problems with character extraction (OpenType, dense printed Text) - Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Attachments: test.pdf, test2.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2548: Summary: Problems with character extraction (fi ligature) (was: problems with character extraction (OpenType, dense printed Text)) Problems with character extraction (fi ligature) Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Attachments: test.pdf, test2.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PDFBOX-2548) Problems with character extraction (fi ligature)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238282#comment-14238282 ] John Hewson edited comment on PDFBOX-2548 at 12/8/14 7:14 PM: -- Neither of these PDFs contain OpenType fonts, instead they contain embedded Type 1 fonts. It is common for PDF generating software to perform such format conversions when embedding fonts. If you open the file in Adobe Reader and go to File Properties Fonts, then you can see a list of the fonts which are embedded and their format. was (Author: jahewson): Neither of these PDFs contain OpenType fonts, instead they contain embedded Type 1 fonts. It is common for PDF generating software to perform such format conversions when embedding fonts. Problems with character extraction (fi ligature) Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Attachments: test.pdf, test2.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2548: Attachment: preflight.png Problems with character extraction (fi ligature) Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Attachments: preflight.png, test.pdf, test2.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2548) Problems with character extraction (fi ligature)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238369#comment-14238369 ] John Hewson commented on PDFBOX-2548: - The embedded text in this PDF really does contain spaces after some of the ligatures, e.g Spezifi zierung and Adobe Acrobat extracts the text with those spaces, exactly as PDFBox does. Foxit does the same, but OS X Preview strips the space, which gives the correct result: Spezifizierung. Here's the text drawing commands for Spezifi zierung shown in Adobe Preflight's PDF structure viewer: !preflight.png! These commands have the meaning: 0: Draw text Spezifi 1: Subtract 305.505 units from x-position (move _backwards_ approx 0.3em, roughly the width of a space) 2: Draw text (space) 3: Subtract -20.3063 units from the x-position (move _forwards_ approx 0.02em, this is a kern) 4: Draw text zierung des logisch-historischen So the space is overlayed on top of the fi ligature. Needless to say this is a very unusual technique which does not result in proper text embedding. Given that Acrobat produces the same result, and I don't see any simple way to fix this (on could imagine some complex solution). I'm going to close this issue as not a problem. Problems with character extraction (fi ligature) Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Attachments: preflight.png, test.pdf, test2.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PDFBOX-2548) Problems with character extraction (fi ligature)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson closed PDFBOX-2548. --- Problems with character extraction (fi ligature) Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Attachments: preflight.png, test.pdf, test2.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PDFBOX-2548) Problems with character extraction (fi ligature)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-2548. - Resolution: Not a Problem Problems with character extraction (fi ligature) Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Attachments: preflight.png, test.pdf, test2.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2548) Problems with character extraction (fi ligature)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238382#comment-14238382 ] John Hewson commented on PDFBOX-2548: - {quote} (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). {quote} This is by design, certain text in charactersByArticle undergoes [NFKC normalization|http://www.unicode.org/reports/tr15/], which includes mapping fi - f i. Problems with character extraction (fi ligature) Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Attachments: preflight.png, test.pdf, test2.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2547) maybe encoding error
[ https://issues.apache.org/jira/browse/PDFBOX-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238407#comment-14238407 ] John Hewson commented on PDFBOX-2547: - Text extraction does of this PDF does not produce good results with Acrobat either, although the problems are not as bad as with PDFBox. Acrobat extracts nothing for 'ę' and 'ą' but 'na przykład miłe' is extracted correctly. Calling setSpacingTolerance(0.3) on PDFTextStripper seems to produce better results. maybe encoding error Key: PDFBOX-2547 URL: https://issues.apache.org/jira/browse/PDFBOX-2547 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7 Reporter: Michał Priority: Minor Hi, I just download a pdf form page: http://download.jw.org/files/media_books/32/es15_P.pdf and wants extract text from this document. I use command: java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 es15_P.pdf resultFile-UTF-8.txt But I see some problems for exmaple: 1. I see in text file 'STX' and 'ETX' instead of 'ę' and 'ą'. 2. extractor return a text 'naprzykładmiłe' instead of 'na przykład miłe' (page 4, line 6). Maybe it is some small problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2547) maybe encoding error
[ https://issues.apache.org/jira/browse/PDFBOX-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2547: Affects Version/s: 2.0.0 maybe encoding error Key: PDFBOX-2547 URL: https://issues.apache.org/jira/browse/PDFBOX-2547 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7, 2.0.0 Reporter: Michał Priority: Minor Hi, I just download a pdf form page: http://download.jw.org/files/media_books/32/es15_P.pdf and wants extract text from this document. I use command: java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 es15_P.pdf resultFile-UTF-8.txt But I see some problems for exmaple: 1. I see in text file 'STX' and 'ETX' instead of 'ę' and 'ą'. 2. extractor return a text 'naprzykładmiłe' instead of 'na przykład miłe' (page 4, line 6). Maybe it is some small problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2546) IllegalArgumentException: resourceDictionary is null in PDFMerger
[ https://issues.apache.org/jira/browse/PDFBOX-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238412#comment-14238412 ] John Hewson commented on PDFBOX-2546: - Well, this is a fun bug :( IllegalArgumentException: resourceDictionary is null in PDFMerger - Key: PDFBOX-2546 URL: https://issues.apache.org/jira/browse/PDFBOX-2546 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 2.0.0 Reporter: Tilman Hausherr This was first mentioned on the user mailing list by [~giladd]: When merging the PDF 1.7 spec with another PDF file this exception appears: {code} Exception in thread main java.lang.IllegalArgumentException: resourceDictionary is null at org.apache.pdfbox.pdmodel.PDResources.init(PDResources.java:68) at org.apache.pdfbox.util.PDFMergerUtility.appendDocument(PDFMergerUtility.java:448) at org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:190) at org.apache.pdfbox.tools.PDFMerger.merge(PDFMerger.java:70) at org.apache.pdfbox.tools.PDFMerger.main(PDFMerger.java:46) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} I did some debugging, it happens on the very first page. The resources is indeed null, but it exists when viewing with PDFDebugger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (PDFBOX-2546) IllegalArgumentException: resourceDictionary is null in PDFMerger
[ https://issues.apache.org/jira/browse/PDFBOX-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson reassigned PDFBOX-2546: --- Assignee: John Hewson IllegalArgumentException: resourceDictionary is null in PDFMerger - Key: PDFBOX-2546 URL: https://issues.apache.org/jira/browse/PDFBOX-2546 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: John Hewson This was first mentioned on the user mailing list by [~giladd]: When merging the PDF 1.7 spec with another PDF file this exception appears: {code} Exception in thread main java.lang.IllegalArgumentException: resourceDictionary is null at org.apache.pdfbox.pdmodel.PDResources.init(PDResources.java:68) at org.apache.pdfbox.util.PDFMergerUtility.appendDocument(PDFMergerUtility.java:448) at org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:190) at org.apache.pdfbox.tools.PDFMerger.merge(PDFMerger.java:70) at org.apache.pdfbox.tools.PDFMerger.main(PDFMerger.java:46) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} I did some debugging, it happens on the very first page. The resources is indeed null, but it exists when viewing with PDFDebugger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (PDFBOX-2542) IllegalArgumentException: root must be of type Pages
[ https://issues.apache.org/jira/browse/PDFBOX-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson reassigned PDFBOX-2542: --- Assignee: John Hewson IllegalArgumentException: root must be of type Pages Key: PDFBOX-2542 URL: https://issues.apache.org/jira/browse/PDFBOX-2542 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: John Hewson Attachments: 249776.pdf {code} java.lang.IllegalArgumentException: root must be of type Pages at org.apache.pdfbox.pdmodel.PDPageTree.init(PDPageTree.java:66) at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatalog.java:125) at org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:1175) {code} The cause is this {code} /Count 11 /Kids [ 100 0 R 141 0 R ] endobj {code} /Type /Pages is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2532) Text extraction fails due to the usage of the internal font mapping
[ https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238438#comment-14238438 ] John Hewson commented on PDFBOX-2532: - It's very common to need to extract the Encoding from Type1C fonts, so Acrobat must be doing something other than just ignoring the encoding. Either it's a bug in Acrobat (which happens to produce good behaviour for this file) or they have some sort of heuristic. The CharSet entry can't be the deciding factor, because it is optional, and its entries are unordered, so it provides no help in identifying a jumbled encoding (i.e. two encodings with the same characters have the same CharSet, even if their order is different). Text extraction fails due to the usage of the internal font mapping --- Key: PDFBOX-2532 URL: https://issues.apache.org/jira/browse/PDFBOX-2532 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 2.0.0 Reporter: Andreas Lehmkühler Fix For: 2.0.0 Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt, PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt, PDFBOX2247-701542_sa_reader_osx.txt, PDFBOX2247-Debugger.png If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode mapping) we have to decide where to get a suitable mapping ourselves. We can't use the internal font mapping of the type1C font as it doesn't work in every case, see PDFBOX-2377 which provides a solution for the 1.8-branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PDFBOX-2532) Text extraction fails due to the usage of the internal font mapping
[ https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238438#comment-14238438 ] John Hewson edited comment on PDFBOX-2532 at 12/8/14 8:49 PM: -- It's very common to need to extract the Encoding from Type1C fonts, so Acrobat must be doing something other than just ignoring the encoding. Either it's a bug in Acrobat (which happens to produce good behaviour for this file) or they have some sort of heuristic. The CharSet entry can't be the deciding factor, because it is optional, and its entries are unordered, so it provides no help in identifying a jumbled encoding (i.e. two encodings with the same characters have the same CharSet, even if their order is different). was (Author: jahewson): It's very common to need to extract the Encoding from Type1C fonts, so Acrobat must be doing something other than just ignoring the encoding. Either it's a bug in Acrobat (which happens to produce good behaviour for this file) or they have some sort of heuristic. The CharSet entry can't be the deciding factor, because it is optional, and its entries are unordered, so it provides no help in identifying a jumbled encoding (i.e. two encodings with the same characters have the same CharSet, even if their order is different). Text extraction fails due to the usage of the internal font mapping --- Key: PDFBOX-2532 URL: https://issues.apache.org/jira/browse/PDFBOX-2532 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 2.0.0 Reporter: Andreas Lehmkühler Fix For: 2.0.0 Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt, PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt, PDFBOX2247-701542_sa_reader_osx.txt, PDFBOX2247-Debugger.png If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode mapping) we have to decide where to get a suitable mapping ourselves. We can't use the internal font mapping of the type1C font as it doesn't work in every case, see PDFBOX-2377 which provides a solution for the 1.8-branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PDFBOX-2532) Text extraction fails due to the usage of the internal font mapping
[ https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238438#comment-14238438 ] John Hewson edited comment on PDFBOX-2532 at 12/8/14 8:50 PM: -- It's very common to need to extract the Encoding from Type1C fonts, so Acrobat must be doing something other than just ignoring the encoding. Either it's a bug in Acrobat (which happens to produce good behaviour for this file) or they have some sort of heuristic. The CharSet entry can't be the deciding factor, because it is optional, and its entries are unordered, so it provides no help in identifying a jumbled encoding. Two different encodings which contain the same characters will have the same CharSet, even if their order is different. was (Author: jahewson): It's very common to need to extract the Encoding from Type1C fonts, so Acrobat must be doing something other than just ignoring the encoding. Either it's a bug in Acrobat (which happens to produce good behaviour for this file) or they have some sort of heuristic. The CharSet entry can't be the deciding factor, because it is optional, and its entries are unordered, so it provides no help in identifying a jumbled encoding. Two different encodings which contain the same characters will have the same CharSet, even if their order is different). Text extraction fails due to the usage of the internal font mapping --- Key: PDFBOX-2532 URL: https://issues.apache.org/jira/browse/PDFBOX-2532 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 2.0.0 Reporter: Andreas Lehmkühler Fix For: 2.0.0 Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt, PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt, PDFBOX2247-701542_sa_reader_osx.txt, PDFBOX2247-Debugger.png If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode mapping) we have to decide where to get a suitable mapping ourselves. We can't use the internal font mapping of the type1C font as it doesn't work in every case, see PDFBOX-2377 which provides a solution for the 1.8-branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PDFBOX-2532) Text extraction fails due to the usage of the internal font mapping
[ https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238438#comment-14238438 ] John Hewson edited comment on PDFBOX-2532 at 12/8/14 8:49 PM: -- It's very common to need to extract the Encoding from Type1C fonts, so Acrobat must be doing something other than just ignoring the encoding. Either it's a bug in Acrobat (which happens to produce good behaviour for this file) or they have some sort of heuristic. The CharSet entry can't be the deciding factor, because it is optional, and its entries are unordered, so it provides no help in identifying a jumbled encoding. Two different encodings which contain the same characters will have the same CharSet, even if their order is different). was (Author: jahewson): It's very common to need to extract the Encoding from Type1C fonts, so Acrobat must be doing something other than just ignoring the encoding. Either it's a bug in Acrobat (which happens to produce good behaviour for this file) or they have some sort of heuristic. The CharSet entry can't be the deciding factor, because it is optional, and its entries are unordered, so it provides no help in identifying a jumbled encoding (i.e. two encodings with the same characters have the same CharSet, even if their order is different). Text extraction fails due to the usage of the internal font mapping --- Key: PDFBOX-2532 URL: https://issues.apache.org/jira/browse/PDFBOX-2532 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 2.0.0 Reporter: Andreas Lehmkühler Fix For: 2.0.0 Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt, PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt, PDFBOX2247-701542_sa_reader_osx.txt, PDFBOX2247-Debugger.png If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode mapping) we have to decide where to get a suitable mapping ourselves. We can't use the internal font mapping of the type1C font as it doesn't work in every case, see PDFBOX-2377 which provides a solution for the 1.8-branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2542) IllegalArgumentException: root must be of type Pages
[ https://issues.apache.org/jira/browse/PDFBOX-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238450#comment-14238450 ] ASF subversion and git services commented on PDFBOX-2542: - Commit 1643915 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1643915 ] PDFBOX-2542: Removed check for Type of page tree root IllegalArgumentException: root must be of type Pages Key: PDFBOX-2542 URL: https://issues.apache.org/jira/browse/PDFBOX-2542 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: John Hewson Fix For: 2.0.0 Attachments: 249776.pdf {code} java.lang.IllegalArgumentException: root must be of type Pages at org.apache.pdfbox.pdmodel.PDPageTree.init(PDPageTree.java:66) at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatalog.java:125) at org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:1175) {code} The cause is this {code} /Count 11 /Kids [ 100 0 R 141 0 R ] endobj {code} /Type /Pages is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PDFBOX-2542) IllegalArgumentException: root must be of type Pages
[ https://issues.apache.org/jira/browse/PDFBOX-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-2542. - Resolution: Fixed Fix Version/s: 2.0.0 IllegalArgumentException: root must be of type Pages Key: PDFBOX-2542 URL: https://issues.apache.org/jira/browse/PDFBOX-2542 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: John Hewson Fix For: 2.0.0 Attachments: 249776.pdf {code} java.lang.IllegalArgumentException: root must be of type Pages at org.apache.pdfbox.pdmodel.PDPageTree.init(PDPageTree.java:66) at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatalog.java:125) at org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:1175) {code} The cause is this {code} /Count 11 /Kids [ 100 0 R 141 0 R ] endobj {code} /Type /Pages is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2546) IllegalArgumentException: resourceDictionary is null in PDFMerger
[ https://issues.apache.org/jira/browse/PDFBOX-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238492#comment-14238492 ] ASF subversion and git services commented on PDFBOX-2546: - Commit 1643933 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1643933 ] PDFBOX-2546: PageIterator should be recursive IllegalArgumentException: resourceDictionary is null in PDFMerger - Key: PDFBOX-2546 URL: https://issues.apache.org/jira/browse/PDFBOX-2546 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: John Hewson Fix For: 2.0.0 This was first mentioned on the user mailing list by [~giladd]: When merging the PDF 1.7 spec with another PDF file this exception appears: {code} Exception in thread main java.lang.IllegalArgumentException: resourceDictionary is null at org.apache.pdfbox.pdmodel.PDResources.init(PDResources.java:68) at org.apache.pdfbox.util.PDFMergerUtility.appendDocument(PDFMergerUtility.java:448) at org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:190) at org.apache.pdfbox.tools.PDFMerger.merge(PDFMerger.java:70) at org.apache.pdfbox.tools.PDFMerger.main(PDFMerger.java:46) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} I did some debugging, it happens on the very first page. The resources is indeed null, but it exists when viewing with PDFDebugger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PDFBOX-2546) IllegalArgumentException: resourceDictionary is null in PDFMerger
[ https://issues.apache.org/jira/browse/PDFBOX-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-2546. - Resolution: Fixed Fix Version/s: 2.0.0 IllegalArgumentException: resourceDictionary is null in PDFMerger - Key: PDFBOX-2546 URL: https://issues.apache.org/jira/browse/PDFBOX-2546 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: John Hewson Fix For: 2.0.0 This was first mentioned on the user mailing list by [~giladd]: When merging the PDF 1.7 spec with another PDF file this exception appears: {code} Exception in thread main java.lang.IllegalArgumentException: resourceDictionary is null at org.apache.pdfbox.pdmodel.PDResources.init(PDResources.java:68) at org.apache.pdfbox.util.PDFMergerUtility.appendDocument(PDFMergerUtility.java:448) at org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:190) at org.apache.pdfbox.tools.PDFMerger.merge(PDFMerger.java:70) at org.apache.pdfbox.tools.PDFMerger.main(PDFMerger.java:46) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} I did some debugging, it happens on the very first page. The resources is indeed null, but it exists when viewing with PDFDebugger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2532) Text extraction fails due to the usage of the internal font mapping
[ https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238494#comment-14238494 ] Andreas Lehmkühler commented on PDFBOX-2532: {quote} It's very common to need to extract the Encoding from Type1C fonts, so Acrobat must be doing something other than just ignoring the encoding. Either it's a bug in Acrobat (which happens to produce good behaviour for this file) or they have some sort of heuristic. {quote} It has to be a new bug as It worked with older acrobat versions. {quote} The CharSet entry can't be the deciding factor, because it is optional, and its entries are unordered, so it provides no help in identifying a jumbled encoding. Two different encodings which contain the same characters will have the same CharSet, even if their order is different. {quote} I know the specs. Anyway, in all cases I know it was a good indicator for broken fonts. Text extraction fails due to the usage of the internal font mapping --- Key: PDFBOX-2532 URL: https://issues.apache.org/jira/browse/PDFBOX-2532 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 2.0.0 Reporter: Andreas Lehmkühler Fix For: 2.0.0 Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt, PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt, PDFBOX2247-701542_sa_reader_osx.txt, PDFBOX2247-Debugger.png If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode mapping) we have to decide where to get a suitable mapping ourselves. We can't use the internal font mapping of the type1C font as it doesn't work in every case, see PDFBOX-2377 which provides a solution for the 1.8-branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PDFBOX-2539) [PATCH] Allow non static FontProvider
[ https://issues.apache.org/jira/browse/PDFBOX-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238495#comment-14238495 ] John Hewson edited comment on PDFBOX-2539 at 12/8/14 9:39 PM: -- Your updated patch does not compile, it is missing the method PDFStreamEngine#getFontProvider(). was (Author: jahewson): Your updated does not compile, it is missing the method PDFStreamEngine#getFontProvider(). [PATCH] Allow non static FontProvider - Key: PDFBOX-2539 URL: https://issues.apache.org/jira/browse/PDFBOX-2539 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 2.0.0 Reporter: simon steiner Attachments: fontProvider.patch I would like to use multiple instances of fontprovider in thread safe way -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2539) [PATCH] Allow non static FontProvider
[ https://issues.apache.org/jira/browse/PDFBOX-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238495#comment-14238495 ] John Hewson commented on PDFBOX-2539: - Your updated does not compile, it is missing the method PDFStreamEngine#getFontProvider(). [PATCH] Allow non static FontProvider - Key: PDFBOX-2539 URL: https://issues.apache.org/jira/browse/PDFBOX-2539 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 2.0.0 Reporter: simon steiner Attachments: fontProvider.patch I would like to use multiple instances of fontprovider in thread safe way -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PDFBOX-2550) ClassCastException in PDAnnotation.getColour
Tilman Hausherr created PDFBOX-2550: --- Summary: ClassCastException in PDAnnotation.getColour Key: PDFBOX-2550 URL: https://issues.apache.org/jira/browse/PDFBOX-2550 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr {code} java.lang.ClassCastException: org.apache.pdfbox.cos.COSObject cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.getColour(PDAnnotation.java:644) at org.apache.pdfbox.preflight.annotation.AnnotationValidator.checkColors(AnnotationValidator.java:134) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2550) ClassCastException in PDAnnotation.getColour
[ https://issues.apache.org/jira/browse/PDFBOX-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2550: Description: {code} java.lang.ClassCastException: org.apache.pdfbox.cos.COSObject cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.getColour(PDAnnotation.java:644) at org.apache.pdfbox.preflight.annotation.AnnotationValidator.checkColors(AnnotationValidator.java:134) {code} The cause is this: {code} /C 19 0 R {code} The current code doesn't expect it to be an indirect object. was: {code} java.lang.ClassCastException: org.apache.pdfbox.cos.COSObject cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.getColour(PDAnnotation.java:644) at org.apache.pdfbox.preflight.annotation.AnnotationValidator.checkColors(AnnotationValidator.java:134) {code} ClassCastException in PDAnnotation.getColour Key: PDFBOX-2550 URL: https://issues.apache.org/jira/browse/PDFBOX-2550 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Labels: Annotations {code} java.lang.ClassCastException: org.apache.pdfbox.cos.COSObject cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.getColour(PDAnnotation.java:644) at org.apache.pdfbox.preflight.annotation.AnnotationValidator.checkColors(AnnotationValidator.java:134) {code} The cause is this: {code} /C 19 0 R {code} The current code doesn't expect it to be an indirect object. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2550) ClassCastException in PDAnnotation.getColour
[ https://issues.apache.org/jira/browse/PDFBOX-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2550: Attachment: 176622.pdf ClassCastException in PDAnnotation.getColour Key: PDFBOX-2550 URL: https://issues.apache.org/jira/browse/PDFBOX-2550 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Labels: Annotations Attachments: 176622.pdf {code} java.lang.ClassCastException: org.apache.pdfbox.cos.COSObject cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.getColour(PDAnnotation.java:644) at org.apache.pdfbox.preflight.annotation.AnnotationValidator.checkColors(AnnotationValidator.java:134) {code} The cause is this: {code} /C 19 0 R {code} The current code doesn't expect it to be an indirect object. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2550) ClassCastException in PDAnnotation.getColour
[ https://issues.apache.org/jira/browse/PDFBOX-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239096#comment-14239096 ] ASF subversion and git services commented on PDFBOX-2550: - Commit 1643996 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1643996 ] PDFBOX-2550: allow indirect object and avoid ClassCastException in getColour() ClassCastException in PDAnnotation.getColour Key: PDFBOX-2550 URL: https://issues.apache.org/jira/browse/PDFBOX-2550 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Labels: Annotations Attachments: 176622.pdf {code} java.lang.ClassCastException: org.apache.pdfbox.cos.COSObject cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.getColour(PDAnnotation.java:644) at org.apache.pdfbox.preflight.annotation.AnnotationValidator.checkColors(AnnotationValidator.java:134) {code} The cause is this: {code} /C 19 0 R {code} The current code doesn't expect it to be an indirect object. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2550) ClassCastException in PDAnnotation.getColour
[ https://issues.apache.org/jira/browse/PDFBOX-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239099#comment-14239099 ] Tilman Hausherr commented on PDFBOX-2550: - Will do 1.8 after the cut. ClassCastException in PDAnnotation.getColour Key: PDFBOX-2550 URL: https://issues.apache.org/jira/browse/PDFBOX-2550 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.8, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Labels: Annotations Attachments: 176622.pdf {code} java.lang.ClassCastException: org.apache.pdfbox.cos.COSObject cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.getColour(PDAnnotation.java:644) at org.apache.pdfbox.preflight.annotation.AnnotationValidator.checkColors(AnnotationValidator.java:134) {code} The cause is this: {code} /C 19 0 R {code} The current code doesn't expect it to be an indirect object. -- This message was sent by Atlassian JIRA (v6.3.4#6332)