Craig Strong created PDFBOX-1988:
------------------------------------

             Summary: PDFBox ExtractText issue of PDF with no embedded fonts
                 Key: PDFBOX-1988
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1988
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.4
         Environment: Windows 7
Also, PASE on IBM i
            Reporter: Craig Strong
             Fix For: 1.8.5


I have been using PDFBox 1.8.4 to extract text from several different PDF files 
fine.  I use the latest PDFBox app with ExtractText command line.  There is one 
PDF that PDFBox (and iText) fails to extract any text even though I can extract 
the text with Adobe Reader and also pdftotext.exe part of XPdf.  "java -jar 
pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt".  I don't want to have to 
rely on using pdftotext.exe from a PC since this is part of an automated 
application.  I think the error relates to an unknown font type and having to 
use the few fonts installed in the jar file.  I tried running the API classes 
and trying to force a font from a certain location but I still got errors.  I 
thought I loaded the font with the loadTTF method but I don't know if that did 
anything with the font.  I would really like to have this working straight from 
the ExtractText class anyway.
Here are the errors I am getting.  I tried this from both a Windows 7 PC and 
our IBM i in the PASE environment but I get the same errors.  The section 
starting processEncodedText and on repeats a few times so I just included the 
first entries.

 

Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory createFont 
                          
WARNING: Substituting TrueType for unknown font subtype=                        
                          
Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator  
                          
WARNING: java.lang.NullPointerException                                         
                          
Throwable occurred: java.lang.NullPointerException                              
                          
        at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
        at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
    

        at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:119)   
 
        at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121) 
 
        at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204) 
            
        at 
org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604)       
 
        at 
org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)        
 
        at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
 
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)  
 
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)    
 
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)   
 
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)      
 
        at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)  
            
        at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)              
            
        at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)                        
            
Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
processEncodedText           
WARNING: java.lang.NullPointerException                                         
            

Throwable occurred: java.lang.NullPointerException                              
              
        at 
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)   
              
        at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
   
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
  
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
  
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)  
   
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)    
   
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)   
   
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)      
   
        at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)  
              
        at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)              
              
        at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)                        
              
Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator  
              
WARNING: java.lang.NullPointerException                                         
              
Throwable occurred: java.lang.NullPointerException                              
              
        at 
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:364)

        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)   
              
        at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
   
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
  
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
  
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)  
   
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)    
   
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)   
   
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)      
   
        at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)  
              
        at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)              
              
        at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)                        
              


Thanks,

Craig Strong



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to