Hi,
I'm new to PDF and PDFBox, but I'm trying to see if I can use it to extract
text + positions from PDF.
I ran into a NullPointerException which seems to be caused by the fact
that COSDictionary.getDictionaryObject(COSName.ENCODING) returns null. This
happens with some PDFs.
This is what I did. I created a fairly simple application to start with:
PDFTextStripper printer = new PDFTextStripper();
printer.writeText(document, new OutputStreamWriter(System.out));
This gives me the NullPointerException:
java.lang.NullPointerException
at
org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:136)
at
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:408)
at
org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
at TestRun.main(TestRun.java:79)
Line 136 in PDSimpleFont.java refers to the 'encoding' property, which seems
to be null because PDFont.getEncoding() returns null. I added some debug
lines to the PDFBox code, and it appears
that COSDictionary.getDictionaryObject(COSName.ENCODING), which is called
from PDFont.getEncoding(), occasionally returns null. This causes the
NullPointerException above.
Any clues? Can I fix this?
Wouter