Hello everyone,

I'm new, so please be gentle with me.

We are using PDFBox to extract text from a large amount of PDFs (approx. 80,000) in preparation for indexing in Solr/Lucene.

In order to do this, we use the org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages() method in order to iterate over the pages and strip the contents using the PDFTextStripper a page at a time.

The vast majority are fine, but approx. 0.8% suffer from a NullPointerException when it reaches org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:102)

I'm currently working from the trunk after seeing a similar problem in the archives (<http://mail-archives.apache.org/mod_mbox/incubator-pdfbox-dev/200809.mbox/%3cof15421546.54f415dc-on862574ba.006a9e36-862574ba.006ad...@uscmail.uscourts.gov%3e>) but unfortunately it hasn't solved the issue.

The stack trace is:

Caused by: java.lang.NullPointerException
: at org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:102) : at org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:754) : at com.semantico.depp.extractor.PDFBoxPdfExtractor.writeText(PDFBoxPdfExtractor.java:71) : at com.semantico.depp.extractor.PDFBoxPdfExtractor.extractText(PDFBoxPdfExtractor.java:56)
             : at com.semantico.depp.task.JobTask.doJob(JobTask.java:129)

Having delved into the code, the "page" variable is null when:

page.getDictionaryObject( COSName.COUNT )).intValue()

is called in PDPageNode.getCount(PDPageNode)

I understand that not all PDFs can be supported, and to be honest I think 99.2% is amazing. I just thought I would post this in the hopes that someone has come across it before.

Thanks for any help.

Regards,

Declan

Reply via email to