I found a solution to my issue.  I was able to install the latest XPdf RPM file 
for AIX so I can now use pdftotext from PASE on the IBM i.  I can also adjust 
font manipulation on the fly with a configuration file.  This converts this PDF 
to text on the same system which PDFBox can't do and I don't have to rely on 
running pdftotext from a PC.  The -layout option is kind of nice too which puts 
some spaces similar to the PDF for some easier parsing.  The PDFBox pdfsplit 
function will have some use later.  Just to be clear, I still like the 
functionality of PDFBox and also iText.
I appreciate everyone's assistance.

Thanks,
Craig Strong
----- Forwarded Message -----
From: Craig Strong <[email protected]>
To: "[email protected]" <[email protected]> 
Sent: Monday, March 10, 2014 4:19 PM
Subject: Extracting text from PDF with no embedded fonts
 

I have been using PDFBox to extract text from several different PDF files fine. 
 I use the latest PDFBox app with ExtractText class.  There is one PDF that 
PDFBox (and iText) fails to extract any text even though I can extract the text 
with Adobe Reader and also pdftotext.exe part of XPdf.  I don't want to have to 
rely on using pdftotext.exe from a PC since this is part of an automated 
application.  I think the error relates to an unknown font type and having to 
use the few fonts installed in the jar file.  I tried running the API classes 
and trying to force a font from a certain location but I still got errors.  I 
thought I loaded the font with the loadTTF method but I don't know if that did 
anything with the font.  I would really like to have this working straight from 
the ExtractText class anyway.  I'm thinking I might have to build my own after 
putting a bunch of Windows fonts somewhere and changing a properties file but I 
really don't know
if that is the right direction I should be taking and I am new to PDFBox.  Any 
ideas?
Here are the errors I am getting.  I tried this from both a Windows PC and our 
system but I get the same errors.  The section starting processEncodedText and 
on repeats a few times so I just included the first entries.
 
Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory createFont 
                          
WARNING: Substituting TrueType for unknown font subtype=                        
                          
Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator  
                          
WARNING: java.lang.NullPointerException                                         
                          
Throwable occurred: java.lang.NullPointerException                              
                          
        at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
        at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
    
        at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:119)   
 
        at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121) 
 
        at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204) 
            
        at 
org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604)       
 
        at 
org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)        
 
        at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
 
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)  
 
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)    
 
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)   
 
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)      
 
        at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)  
            
        at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)              
            
        at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)                        
            
Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
processEncodedText           
WARNING: java.lang.NullPointerException                                         
            
Throwable occurred: java.lang.NullPointerException                              
              
        at 
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)   
              
        at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
   
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
  
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
  
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)  
   
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)    
   
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)   
   
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)      
   
        at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)  
              
        at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)              
              
        at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)                        
              
Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator  
              
WARNING: java.lang.NullPointerException                                         
              
Throwable occurred: java.lang.NullPointerException                              
              
        at 
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:364)
        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)   
              
        at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
   
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
  
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
  
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)  
   
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)    
   
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)   
   
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)      
   
        at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)  
              
        at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)              
              
        at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)                        
              

Thanks,
Craig Strong

Reply via email to