[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts

2014-03-17 Thread Craig Strong (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938606#comment-13938606
 ] 

Craig Strong commented on PDFBOX-1988:
--

I tested the fix on the 2.0.0 build and it worked.  Thanks again.

> PDFBox ExtractText issue of PDF with no embedded fonts
> --
>
> Key: PDFBOX-1988
> URL: https://issues.apache.org/jira/browse/PDFBOX-1988
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering, Text extraction
>Affects Versions: 1.8.4
> Environment: Windows 7
> Also, PASE on IBM i
>Reporter: Craig Strong
>  Labels: patch
> Fix For: 1.8.5, 2.0.0
>
> Attachments: Test1.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have been using PDFBox 1.8.4 to extract text from several different PDF 
> files fine.  I use the latest PDFBox app with ExtractText command line.  
> There is one PDF that PDFBox (and iText) fails to extract any text even 
> though I can extract the text with Adobe Reader and also pdftotext.exe part 
> of XPdf.  "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt".  I 
> don't want to have to rely on using pdftotext.exe from a PC since this is 
> part of an automated application.  I think the error relates to an unknown 
> font type and having to use the few fonts installed in the jar file.  I tried 
> running the API classes and trying to force a font from a certain location 
> but I still got errors.  I thought I loaded the font with the loadTTF method 
> but I don't know if that did anything with the font.  I would really like to 
> have this working straight from the ExtractText class anyway.
> Here are the errors I am getting.  I tried this from both a Windows 7 PC and 
> our IBM i in the PASE environment but I get the same errors.  The section 
> starting processEncodedText and on repeats a few times so I just included the 
> first entries.
>  
> Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory 
> createFont   
> WARNING: Substituting TrueType for unknown font subtype=  
> 
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processOperator
> WARNING: java.lang.NullPointerException   
> 
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) 
>
> at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)
>   
> at 
> org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) 
>
> at 
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>  
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)  
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) 
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>
> at 
> org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)   
>
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>   
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)  
>   
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processEncodedText   
> WARNING: java.lang.NullPointerException   
>   
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
> at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) 
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>

[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts

2014-03-17 Thread Craig Strong (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13937894#comment-13937894
 ] 

Craig Strong commented on PDFBOX-1988:
--

Thank you John and Tilman.  That was very quick and effective work.

> PDFBox ExtractText issue of PDF with no embedded fonts
> --
>
> Key: PDFBOX-1988
> URL: https://issues.apache.org/jira/browse/PDFBOX-1988
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering, Text extraction
>Affects Versions: 1.8.4
> Environment: Windows 7
> Also, PASE on IBM i
>Reporter: Craig Strong
>  Labels: patch
> Fix For: 1.8.5, 2.0.0
>
> Attachments: Test1.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have been using PDFBox 1.8.4 to extract text from several different PDF 
> files fine.  I use the latest PDFBox app with ExtractText command line.  
> There is one PDF that PDFBox (and iText) fails to extract any text even 
> though I can extract the text with Adobe Reader and also pdftotext.exe part 
> of XPdf.  "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt".  I 
> don't want to have to rely on using pdftotext.exe from a PC since this is 
> part of an automated application.  I think the error relates to an unknown 
> font type and having to use the few fonts installed in the jar file.  I tried 
> running the API classes and trying to force a font from a certain location 
> but I still got errors.  I thought I loaded the font with the loadTTF method 
> but I don't know if that did anything with the font.  I would really like to 
> have this working straight from the ExtractText class anyway.
> Here are the errors I am getting.  I tried this from both a Windows 7 PC and 
> our IBM i in the PASE environment but I get the same errors.  The section 
> starting processEncodedText and on repeats a few times so I just included the 
> first entries.
>  
> Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory 
> createFont   
> WARNING: Substituting TrueType for unknown font subtype=  
> 
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processOperator
> WARNING: java.lang.NullPointerException   
> 
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) 
>
> at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)
>   
> at 
> org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) 
>
> at 
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>  
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)  
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) 
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>
> at 
> org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)   
>
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>   
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)  
>   
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processEncodedText   
> WARNING: java.lang.NullPointerException   
>   
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
> at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) 
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>  

[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts

2014-03-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935713#comment-13935713
 ] 

Tilman Hausherr commented on PDFBOX-1988:
-

Yeah, works great!

> PDFBox ExtractText issue of PDF with no embedded fonts
> --
>
> Key: PDFBOX-1988
> URL: https://issues.apache.org/jira/browse/PDFBOX-1988
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering, Text extraction
>Affects Versions: 1.8.4
> Environment: Windows 7
> Also, PASE on IBM i
>Reporter: Craig Strong
>  Labels: patch
> Fix For: 1.8.5, 2.0.0
>
> Attachments: Test1.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have been using PDFBox 1.8.4 to extract text from several different PDF 
> files fine.  I use the latest PDFBox app with ExtractText command line.  
> There is one PDF that PDFBox (and iText) fails to extract any text even 
> though I can extract the text with Adobe Reader and also pdftotext.exe part 
> of XPdf.  "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt".  I 
> don't want to have to rely on using pdftotext.exe from a PC since this is 
> part of an automated application.  I think the error relates to an unknown 
> font type and having to use the few fonts installed in the jar file.  I tried 
> running the API classes and trying to force a font from a certain location 
> but I still got errors.  I thought I loaded the font with the loadTTF method 
> but I don't know if that did anything with the font.  I would really like to 
> have this working straight from the ExtractText class anyway.
> Here are the errors I am getting.  I tried this from both a Windows 7 PC and 
> our IBM i in the PASE environment but I get the same errors.  The section 
> starting processEncodedText and on repeats a few times so I just included the 
> first entries.
>  
> Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory 
> createFont   
> WARNING: Substituting TrueType for unknown font subtype=  
> 
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processOperator
> WARNING: java.lang.NullPointerException   
> 
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) 
>
> at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)
>   
> at 
> org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) 
>
> at 
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>  
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)  
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) 
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>
> at 
> org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)   
>
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>   
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)  
>   
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processEncodedText   
> WARNING: java.lang.NullPointerException   
>   
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
> at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) 
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>
> at 
> org.apache.pdfbox.util.P

[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts

2014-03-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935699#comment-13935699
 ] 

John Hewson commented on PDFBOX-1988:
-

Tilman, I checked with my fix in revision 1577725 and 2.0 rendering is now good.

> PDFBox ExtractText issue of PDF with no embedded fonts
> --
>
> Key: PDFBOX-1988
> URL: https://issues.apache.org/jira/browse/PDFBOX-1988
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering, Text extraction
>Affects Versions: 1.8.4
> Environment: Windows 7
> Also, PASE on IBM i
>Reporter: Craig Strong
>  Labels: patch
> Fix For: 1.8.5, 2.0.0
>
> Attachments: Test1.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have been using PDFBox 1.8.4 to extract text from several different PDF 
> files fine.  I use the latest PDFBox app with ExtractText command line.  
> There is one PDF that PDFBox (and iText) fails to extract any text even 
> though I can extract the text with Adobe Reader and also pdftotext.exe part 
> of XPdf.  "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt".  I 
> don't want to have to rely on using pdftotext.exe from a PC since this is 
> part of an automated application.  I think the error relates to an unknown 
> font type and having to use the few fonts installed in the jar file.  I tried 
> running the API classes and trying to force a font from a certain location 
> but I still got errors.  I thought I loaded the font with the loadTTF method 
> but I don't know if that did anything with the font.  I would really like to 
> have this working straight from the ExtractText class anyway.
> Here are the errors I am getting.  I tried this from both a Windows 7 PC and 
> our IBM i in the PASE environment but I get the same errors.  The section 
> starting processEncodedText and on repeats a few times so I just included the 
> first entries.
>  
> Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory 
> createFont   
> WARNING: Substituting TrueType for unknown font subtype=  
> 
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processOperator
> WARNING: java.lang.NullPointerException   
> 
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) 
>
> at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)
>   
> at 
> org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) 
>
> at 
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>  
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)  
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) 
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>
> at 
> org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)   
>
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>   
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)  
>   
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processEncodedText   
> WARNING: java.lang.NullPointerException   
>   
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
> at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) 
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.ja

[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts

2014-03-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935693#comment-13935693
 ] 

John Hewson commented on PDFBOX-1988:
-

{quote}
adding a line for Courier-Bold in the PDFBox_External_Fonts.properties file.
{quote}

Courier-Bold is one of the Standard 14 fonts, it's not an external font.

> PDFBox ExtractText issue of PDF with no embedded fonts
> --
>
> Key: PDFBOX-1988
> URL: https://issues.apache.org/jira/browse/PDFBOX-1988
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering, Text extraction
>Affects Versions: 1.8.4
> Environment: Windows 7
> Also, PASE on IBM i
>Reporter: Craig Strong
>  Labels: patch
> Fix For: 1.8.5, 2.0.0
>
> Attachments: Test1.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have been using PDFBox 1.8.4 to extract text from several different PDF 
> files fine.  I use the latest PDFBox app with ExtractText command line.  
> There is one PDF that PDFBox (and iText) fails to extract any text even 
> though I can extract the text with Adobe Reader and also pdftotext.exe part 
> of XPdf.  "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt".  I 
> don't want to have to rely on using pdftotext.exe from a PC since this is 
> part of an automated application.  I think the error relates to an unknown 
> font type and having to use the few fonts installed in the jar file.  I tried 
> running the API classes and trying to force a font from a certain location 
> but I still got errors.  I thought I loaded the font with the loadTTF method 
> but I don't know if that did anything with the font.  I would really like to 
> have this working straight from the ExtractText class anyway.
> Here are the errors I am getting.  I tried this from both a Windows 7 PC and 
> our IBM i in the PASE environment but I get the same errors.  The section 
> starting processEncodedText and on repeats a few times so I just included the 
> first entries.
>  
> Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory 
> createFont   
> WARNING: Substituting TrueType for unknown font subtype=  
> 
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processOperator
> WARNING: java.lang.NullPointerException   
> 
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) 
>
> at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)
>   
> at 
> org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) 
>
> at 
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>  
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)  
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) 
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>
> at 
> org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)   
>
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>   
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)  
>   
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processEncodedText   
> WARNING: java.lang.NullPointerException   
>   
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
> at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) 
> 
>  

[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts

2014-03-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935533#comment-13935533
 ] 

Tilman Hausherr commented on PDFBOX-1988:
-

Rendering is also bad.

> PDFBox ExtractText issue of PDF with no embedded fonts
> --
>
> Key: PDFBOX-1988
> URL: https://issues.apache.org/jira/browse/PDFBOX-1988
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering, Text extraction
>Affects Versions: 1.8.4
> Environment: Windows 7
> Also, PASE on IBM i
>Reporter: Craig Strong
>  Labels: patch
> Fix For: 1.8.5
>
> Attachments: Test1.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have been using PDFBox 1.8.4 to extract text from several different PDF 
> files fine.  I use the latest PDFBox app with ExtractText command line.  
> There is one PDF that PDFBox (and iText) fails to extract any text even 
> though I can extract the text with Adobe Reader and also pdftotext.exe part 
> of XPdf.  "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt".  I 
> don't want to have to rely on using pdftotext.exe from a PC since this is 
> part of an automated application.  I think the error relates to an unknown 
> font type and having to use the few fonts installed in the jar file.  I tried 
> running the API classes and trying to force a font from a certain location 
> but I still got errors.  I thought I loaded the font with the loadTTF method 
> but I don't know if that did anything with the font.  I would really like to 
> have this working straight from the ExtractText class anyway.
> Here are the errors I am getting.  I tried this from both a Windows 7 PC and 
> our IBM i in the PASE environment but I get the same errors.  The section 
> starting processEncodedText and on repeats a few times so I just included the 
> first entries.
>  
> Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory 
> createFont   
> WARNING: Substituting TrueType for unknown font subtype=  
> 
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processOperator
> WARNING: java.lang.NullPointerException   
> 
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) 
>
> at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)
>   
> at 
> org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) 
>
> at 
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>  
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)  
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) 
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>
> at 
> org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)   
>
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>   
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)  
>   
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processEncodedText   
> WARNING: java.lang.NullPointerException   
>   
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
> at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) 
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>
> at 
> org.apache.pdfbox.util.PDFS

[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts

2014-03-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935427#comment-13935427
 ] 

John Hewson commented on PDFBOX-1988:
-

The attached file uses the  "Courier Bold" font which is one of the "standard 
14" fonts which do not need to be embedded, so this looks like a bug in PDFBox 
rather than a font embedding issue.

> PDFBox ExtractText issue of PDF with no embedded fonts
> --
>
> Key: PDFBOX-1988
> URL: https://issues.apache.org/jira/browse/PDFBOX-1988
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.4
> Environment: Windows 7
> Also, PASE on IBM i
>Reporter: Craig Strong
>  Labels: patch
> Fix For: 1.8.5
>
> Attachments: Test1.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have been using PDFBox 1.8.4 to extract text from several different PDF 
> files fine.  I use the latest PDFBox app with ExtractText command line.  
> There is one PDF that PDFBox (and iText) fails to extract any text even 
> though I can extract the text with Adobe Reader and also pdftotext.exe part 
> of XPdf.  "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt".  I 
> don't want to have to rely on using pdftotext.exe from a PC since this is 
> part of an automated application.  I think the error relates to an unknown 
> font type and having to use the few fonts installed in the jar file.  I tried 
> running the API classes and trying to force a font from a certain location 
> but I still got errors.  I thought I loaded the font with the loadTTF method 
> but I don't know if that did anything with the font.  I would really like to 
> have this working straight from the ExtractText class anyway.
> Here are the errors I am getting.  I tried this from both a Windows 7 PC and 
> our IBM i in the PASE environment but I get the same errors.  The section 
> starting processEncodedText and on repeats a few times so I just included the 
> first entries.
>  
> Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory 
> createFont   
> WARNING: Substituting TrueType for unknown font subtype=  
> 
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processOperator
> WARNING: java.lang.NullPointerException   
> 
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) 
>
> at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)
>   
> at 
> org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) 
>
> at 
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>  
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)  
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) 
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>
> at 
> org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)   
>
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>   
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)  
>   
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processEncodedText   
> WARNING: java.lang.NullPointerException   
>   
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
> at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) 
>