[ https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935533#comment-13935533 ]
Tilman Hausherr edited comment on PDFBOX-1988 at 3/14/14 8:15 PM: ------------------------------------------------------------------ Text extraction always works with the 2.0 version. Rendering is bad with the 2.0 version. I'm able to render the file properly with 2.0 by making two changes: - adding {code} dic.setName( COSName.TYPE, COSName.TRUE_TYPE.toString() ); {code} after the "Substituting TrueType for unknown font subtype=" warning (I'm not committing this because I prefer to have another opinion. The idea is that if we are to fake a TT font, that we should attach the type) - adding a line for Courier-Bold in the PDFBox_External_Fonts.properties file. was (Author: tilman): Rendering is also bad. > PDFBox ExtractText issue of PDF with no embedded fonts > ------------------------------------------------------ > > Key: PDFBOX-1988 > URL: https://issues.apache.org/jira/browse/PDFBOX-1988 > Project: PDFBox > Issue Type: Bug > Components: Rendering, Text extraction > Affects Versions: 1.8.4 > Environment: Windows 7 > Also, PASE on IBM i > Reporter: Craig Strong > Labels: patch > Fix For: 1.8.5 > > Attachments: Test1.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > I have been using PDFBox 1.8.4 to extract text from several different PDF > files fine. I use the latest PDFBox app with ExtractText command line. > There is one PDF that PDFBox (and iText) fails to extract any text even > though I can extract the text with Adobe Reader and also pdftotext.exe part > of XPdf. "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt". I > don't want to have to rely on using pdftotext.exe from a PC since this is > part of an automated application. I think the error relates to an unknown > font type and having to use the few fonts installed in the jar file. I tried > running the API classes and trying to force a font from a certain location > but I still got errors. I thought I loaded the font with the loadTTF method > but I don't know if that did anything with the font. I would really like to > have this working straight from the ExtractText class anyway. > Here are the errors I am getting. I tried this from both a Windows 7 PC and > our IBM i in the PASE environment but I get the same errors. The section > starting processEncodedText and on repeats a few times so I just included the > first entries. > > Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory > createFont > WARNING: Substituting TrueType for unknown font subtype= > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processOperator > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375) > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221) > > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:119) > > at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121) > > at > org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204) > > at > org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) > > at > org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) > > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) > > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) > > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) > > at > org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) > > at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) > > at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processEncodedText > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355) > at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) > > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) > > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) > > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) > > at > org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) > > at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) > > at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) > > Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine > processOperator > WARNING: java.lang.NullPointerException > > Throwable occurred: java.lang.NullPointerException > > at > org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:364) > at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) > > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) > > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) > > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) > > at > org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) > > at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) > > at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) > > Thanks, > Craig Strong -- This message was sent by Atlassian JIRA (v6.2#6252)