Hi Declan, there are a lot of unsolved issues without any sample-document. Consequently it is difficult or nearly impossible to reproduce the issue. Yours is one of them. Please attach a sample-file to https://issues.apache.org/jira/browse/PDFBOX-289 if possible.
Andreas > Hello everyone, > > I'm new, so please be gentle with me. > > We are using PDFBox to extract text from a large amount of PDFs (approx. > 80,000) in preparation for indexing in Solr/Lucene. > > In order to do this, we use the > org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages() method in order > to iterate over the pages and strip the contents using the > PDFTextStripper a page at a time. > > The vast majority are fine, but approx. 0.8% suffer from a > NullPointerException when it reaches > org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:102) > > I'm currently working from the trunk after seeing a similar problem in > the archives > (<http://mail-archives.apache.org/mod_mbox/incubator-pdfbox-dev/200809.mbox/ > %3cof15421546.54f415dc-on862574ba.006a9e36-862574ba.006ad...@uscmail.uscourt > s.gov%3E>) > but unfortunately it hasn't solved the issue. > > The stack trace is: > > Caused by: java.lang.NullPointerException > : at > org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:102) > : at > org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:754) > : at > com.semantico.depp.extractor.PDFBoxPdfExtractor.writeText(PDFBoxPdfExtractor > .java:71) > : at > com.semantico.depp.extractor.PDFBoxPdfExtractor.extractText(PDFBoxPdfExtract > or.java:56) > : at com.semantico.depp.task.JobTask.doJob(JobTask.java:129) > > Having delved into the code, the "page" variable is null when: > > page.getDictionaryObject( COSName.COUNT )).intValue() > > is called in PDPageNode.getCount(PDPageNode) > > I understand that not all PDFs can be supported, and to be honest I > think 99.2% is amazing. I just thought I would post this in the hopes > that someone has come across it before. > > Thanks for any help. > > Regards, > > Declan >
