[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts
[ https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938606#comment-13938606 ] Craig Strong commented on PDFBOX-1988: -- I tested the fix on the 2.0.0 build and it worked. Thanks again. > PDFBox ExtractText issue of PDF with no embedded fonts > -- > > Key: PDFBOX-1988 > URL: https://issues.apache.org/jira/browse/PDFBOX-1988 > Project: PDFBox > Issue Type: Bug > Components: Rendering, Text extraction >Affects Versions: 1.8.4 > Environment: Windows 7 > Also, PASE on IBM i >Reporter: Craig Strong > Labels: patch > Fix For: 1.8.5, 2.0.0 > > Attachments: Test1.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > I have been using PDFBox 1.8.4 to extract text from several different PDF > files fine. I use the latest PDFBox app with ExtractText command line. > There is one PDF that PDFBox (and iText) fails to extract any text even > though I can extract the text with Adobe Reader and also pdftotext.exe part > of XPdf. "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt". I > don't want to have to rely on using pdftotext.exe from a PC since this is > part of an automated application. I think the error relates to an unknown > font type and having to use the few fonts installed in the jar file. I tried > running the API classes and trying to force a font from a certain location > but I still got errors. I thought I loaded the font with the loadTTF method > but I don't know if that did anything with the font. I would really like to > have this working straight from the ExtractText class anyway. > Here are the errors I am getting. I tried this from both a Windows 7 PC and > our IBM i in the PASE environment but I get the same errors. The section > starting processEncodedText and on repeats a few times so I just included the > first entries. > > Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory > createFont > WARNING: Substituting TrueType for unknown font subtype= > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processOperator > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375) > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221) > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) > > at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121) > > at > org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204) > > at > org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) > > at > org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) > > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) > > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) > > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) > > at > org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) > > at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) > > at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processEncodedText > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355) > at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) >
[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts
[ https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13937894#comment-13937894 ] Craig Strong commented on PDFBOX-1988: -- Thank you John and Tilman. That was very quick and effective work. > PDFBox ExtractText issue of PDF with no embedded fonts > -- > > Key: PDFBOX-1988 > URL: https://issues.apache.org/jira/browse/PDFBOX-1988 > Project: PDFBox > Issue Type: Bug > Components: Rendering, Text extraction >Affects Versions: 1.8.4 > Environment: Windows 7 > Also, PASE on IBM i >Reporter: Craig Strong > Labels: patch > Fix For: 1.8.5, 2.0.0 > > Attachments: Test1.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > I have been using PDFBox 1.8.4 to extract text from several different PDF > files fine. I use the latest PDFBox app with ExtractText command line. > There is one PDF that PDFBox (and iText) fails to extract any text even > though I can extract the text with Adobe Reader and also pdftotext.exe part > of XPdf. "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt". I > don't want to have to rely on using pdftotext.exe from a PC since this is > part of an automated application. I think the error relates to an unknown > font type and having to use the few fonts installed in the jar file. I tried > running the API classes and trying to force a font from a certain location > but I still got errors. I thought I loaded the font with the loadTTF method > but I don't know if that did anything with the font. I would really like to > have this working straight from the ExtractText class anyway. > Here are the errors I am getting. I tried this from both a Windows 7 PC and > our IBM i in the PASE environment but I get the same errors. The section > starting processEncodedText and on repeats a few times so I just included the > first entries. > > Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory > createFont > WARNING: Substituting TrueType for unknown font subtype= > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processOperator > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375) > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221) > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) > > at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121) > > at > org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204) > > at > org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) > > at > org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) > > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) > > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) > > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) > > at > org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) > > at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) > > at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processEncodedText > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355) > at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) >
[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts
[ https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935713#comment-13935713 ] Tilman Hausherr commented on PDFBOX-1988: - Yeah, works great! > PDFBox ExtractText issue of PDF with no embedded fonts > -- > > Key: PDFBOX-1988 > URL: https://issues.apache.org/jira/browse/PDFBOX-1988 > Project: PDFBox > Issue Type: Bug > Components: Rendering, Text extraction >Affects Versions: 1.8.4 > Environment: Windows 7 > Also, PASE on IBM i >Reporter: Craig Strong > Labels: patch > Fix For: 1.8.5, 2.0.0 > > Attachments: Test1.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > I have been using PDFBox 1.8.4 to extract text from several different PDF > files fine. I use the latest PDFBox app with ExtractText command line. > There is one PDF that PDFBox (and iText) fails to extract any text even > though I can extract the text with Adobe Reader and also pdftotext.exe part > of XPdf. "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt". I > don't want to have to rely on using pdftotext.exe from a PC since this is > part of an automated application. I think the error relates to an unknown > font type and having to use the few fonts installed in the jar file. I tried > running the API classes and trying to force a font from a certain location > but I still got errors. I thought I loaded the font with the loadTTF method > but I don't know if that did anything with the font. I would really like to > have this working straight from the ExtractText class anyway. > Here are the errors I am getting. I tried this from both a Windows 7 PC and > our IBM i in the PASE environment but I get the same errors. The section > starting processEncodedText and on repeats a few times so I just included the > first entries. > > Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory > createFont > WARNING: Substituting TrueType for unknown font subtype= > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processOperator > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375) > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221) > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) > > at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121) > > at > org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204) > > at > org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) > > at > org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) > > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) > > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) > > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) > > at > org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) > > at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) > > at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processEncodedText > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355) > at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > > at > org.apache.pdfbox.util.P
[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts
[ https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935699#comment-13935699 ] John Hewson commented on PDFBOX-1988: - Tilman, I checked with my fix in revision 1577725 and 2.0 rendering is now good. > PDFBox ExtractText issue of PDF with no embedded fonts > -- > > Key: PDFBOX-1988 > URL: https://issues.apache.org/jira/browse/PDFBOX-1988 > Project: PDFBox > Issue Type: Bug > Components: Rendering, Text extraction >Affects Versions: 1.8.4 > Environment: Windows 7 > Also, PASE on IBM i >Reporter: Craig Strong > Labels: patch > Fix For: 1.8.5, 2.0.0 > > Attachments: Test1.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > I have been using PDFBox 1.8.4 to extract text from several different PDF > files fine. I use the latest PDFBox app with ExtractText command line. > There is one PDF that PDFBox (and iText) fails to extract any text even > though I can extract the text with Adobe Reader and also pdftotext.exe part > of XPdf. "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt". I > don't want to have to rely on using pdftotext.exe from a PC since this is > part of an automated application. I think the error relates to an unknown > font type and having to use the few fonts installed in the jar file. I tried > running the API classes and trying to force a font from a certain location > but I still got errors. I thought I loaded the font with the loadTTF method > but I don't know if that did anything with the font. I would really like to > have this working straight from the ExtractText class anyway. > Here are the errors I am getting. I tried this from both a Windows 7 PC and > our IBM i in the PASE environment but I get the same errors. The section > starting processEncodedText and on repeats a few times so I just included the > first entries. > > Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory > createFont > WARNING: Substituting TrueType for unknown font subtype= > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processOperator > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375) > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221) > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) > > at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121) > > at > org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204) > > at > org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) > > at > org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) > > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) > > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) > > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) > > at > org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) > > at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) > > at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processEncodedText > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355) > at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.ja
[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts
[ https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935693#comment-13935693 ] John Hewson commented on PDFBOX-1988: - {quote} adding a line for Courier-Bold in the PDFBox_External_Fonts.properties file. {quote} Courier-Bold is one of the Standard 14 fonts, it's not an external font. > PDFBox ExtractText issue of PDF with no embedded fonts > -- > > Key: PDFBOX-1988 > URL: https://issues.apache.org/jira/browse/PDFBOX-1988 > Project: PDFBox > Issue Type: Bug > Components: Rendering, Text extraction >Affects Versions: 1.8.4 > Environment: Windows 7 > Also, PASE on IBM i >Reporter: Craig Strong > Labels: patch > Fix For: 1.8.5, 2.0.0 > > Attachments: Test1.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > I have been using PDFBox 1.8.4 to extract text from several different PDF > files fine. I use the latest PDFBox app with ExtractText command line. > There is one PDF that PDFBox (and iText) fails to extract any text even > though I can extract the text with Adobe Reader and also pdftotext.exe part > of XPdf. "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt". I > don't want to have to rely on using pdftotext.exe from a PC since this is > part of an automated application. I think the error relates to an unknown > font type and having to use the few fonts installed in the jar file. I tried > running the API classes and trying to force a font from a certain location > but I still got errors. I thought I loaded the font with the loadTTF method > but I don't know if that did anything with the font. I would really like to > have this working straight from the ExtractText class anyway. > Here are the errors I am getting. I tried this from both a Windows 7 PC and > our IBM i in the PASE environment but I get the same errors. The section > starting processEncodedText and on repeats a few times so I just included the > first entries. > > Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory > createFont > WARNING: Substituting TrueType for unknown font subtype= > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processOperator > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375) > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221) > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) > > at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121) > > at > org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204) > > at > org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) > > at > org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) > > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) > > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) > > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) > > at > org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) > > at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) > > at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processEncodedText > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355) > at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) > >
[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts
[ https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935533#comment-13935533 ] Tilman Hausherr commented on PDFBOX-1988: - Rendering is also bad. > PDFBox ExtractText issue of PDF with no embedded fonts > -- > > Key: PDFBOX-1988 > URL: https://issues.apache.org/jira/browse/PDFBOX-1988 > Project: PDFBox > Issue Type: Bug > Components: Rendering, Text extraction >Affects Versions: 1.8.4 > Environment: Windows 7 > Also, PASE on IBM i >Reporter: Craig Strong > Labels: patch > Fix For: 1.8.5 > > Attachments: Test1.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > I have been using PDFBox 1.8.4 to extract text from several different PDF > files fine. I use the latest PDFBox app with ExtractText command line. > There is one PDF that PDFBox (and iText) fails to extract any text even > though I can extract the text with Adobe Reader and also pdftotext.exe part > of XPdf. "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt". I > don't want to have to rely on using pdftotext.exe from a PC since this is > part of an automated application. I think the error relates to an unknown > font type and having to use the few fonts installed in the jar file. I tried > running the API classes and trying to force a font from a certain location > but I still got errors. I thought I loaded the font with the loadTTF method > but I don't know if that did anything with the font. I would really like to > have this working straight from the ExtractText class anyway. > Here are the errors I am getting. I tried this from both a Windows 7 PC and > our IBM i in the PASE environment but I get the same errors. The section > starting processEncodedText and on repeats a few times so I just included the > first entries. > > Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory > createFont > WARNING: Substituting TrueType for unknown font subtype= > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processOperator > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375) > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221) > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) > > at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121) > > at > org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204) > > at > org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) > > at > org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) > > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) > > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) > > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) > > at > org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) > > at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) > > at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processEncodedText > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355) > at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > > at > org.apache.pdfbox.util.PDFS
[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts
[ https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935427#comment-13935427 ] John Hewson commented on PDFBOX-1988: - The attached file uses the "Courier Bold" font which is one of the "standard 14" fonts which do not need to be embedded, so this looks like a bug in PDFBox rather than a font embedding issue. > PDFBox ExtractText issue of PDF with no embedded fonts > -- > > Key: PDFBOX-1988 > URL: https://issues.apache.org/jira/browse/PDFBOX-1988 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 1.8.4 > Environment: Windows 7 > Also, PASE on IBM i >Reporter: Craig Strong > Labels: patch > Fix For: 1.8.5 > > Attachments: Test1.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > I have been using PDFBox 1.8.4 to extract text from several different PDF > files fine. I use the latest PDFBox app with ExtractText command line. > There is one PDF that PDFBox (and iText) fails to extract any text even > though I can extract the text with Adobe Reader and also pdftotext.exe part > of XPdf. "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt". I > don't want to have to rely on using pdftotext.exe from a PC since this is > part of an automated application. I think the error relates to an unknown > font type and having to use the few fonts installed in the jar file. I tried > running the API classes and trying to force a font from a certain location > but I still got errors. I thought I loaded the font with the loadTTF method > but I don't know if that did anything with the font. I would really like to > have this working straight from the ExtractText class anyway. > Here are the errors I am getting. I tried this from both a Windows 7 PC and > our IBM i in the PASE environment but I get the same errors. The section > starting processEncodedText and on repeats a few times so I just included the > first entries. > > Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory > createFont > WARNING: Substituting TrueType for unknown font subtype= > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processOperator > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375) > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221) > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) > > at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121) > > at > org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204) > > at > org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) > > at > org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) > > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) > > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) > > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) > > at > org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) > > at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) > > at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processEncodedText > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355) > at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) >