I found a solution to my issue. I was able to install the latest XPdf RPM file for AIX so I can now use pdftotext from PASE on the IBM i. I can also adjust font manipulation on the fly with a configuration file. This converts this PDF to text on the same system which PDFBox can't do and I don't have to rely on running pdftotext from a PC. The -layout option is kind of nice too which puts some spaces similar to the PDF for some easier parsing. The PDFBox pdfsplit function will have some use later. Just to be clear, I still like the functionality of PDFBox and also iText. I appreciate everyone's assistance.
Thanks, Craig Strong ----- Forwarded Message ----- From: Craig Strong <[email protected]> To: "[email protected]" <[email protected]> Sent: Monday, March 10, 2014 4:19 PM Subject: Extracting text from PDF with no embedded fonts I have been using PDFBox to extract text from several different PDF files fine. I use the latest PDFBox app with ExtractText class. There is one PDF that PDFBox (and iText) fails to extract any text even though I can extract the text with Adobe Reader and also pdftotext.exe part of XPdf. I don't want to have to rely on using pdftotext.exe from a PC since this is part of an automated application. I think the error relates to an unknown font type and having to use the few fonts installed in the jar file. I tried running the API classes and trying to force a font from a certain location but I still got errors. I thought I loaded the font with the loadTTF method but I don't know if that did anything with the font. I would really like to have this working straight from the ExtractText class anyway. I'm thinking I might have to build my own after putting a bunch of Windows fonts somewhere and changing a properties file but I really don't know if that is the right direction I should be taking and I am new to PDFBox. Any ideas? Here are the errors I am getting. I tried this from both a Windows PC and our system but I get the same errors. The section starting processEncodedText and on repeats a few times so I just included the first entries. Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory createFont WARNING: Substituting TrueType for unknown font subtype= Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator WARNING: java.lang.NullPointerException Throwable occurred: java.lang.NullPointerException at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:119) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121) at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204) at org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) at org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processEncodedText WARNING: java.lang.NullPointerException Throwable occurred: java.lang.NullPointerException at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator WARNING: java.lang.NullPointerException Throwable occurred: java.lang.NullPointerException at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:364) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) Thanks, Craig Strong

