Hi,
I'm having trouble extracting text from a PDF. The main goal here is to upload
PDFs and have them in our Lucene Index.
This code has been working fine for 95 % of the PDFs our users upload.
Unfortunately the PDF causing this error is quite large so I didn't attach it.
Anyway this is the error I'm getting:
Caused by: java.io.IOException: Unknown colorspace type:null
at
org.pdfbox.pdmodel.graphics.color.PDColorSpaceFactory.createColorSpace(PDColorSpaceFactory.java:121)
at org.pdfbox.pdmodel.PDResources.getColorSpaces(PDResources.java:264)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:196)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:177)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152)
And this is how I use PDFTextStripper in my code:
InputStream is = DocumentServiceUtil.getFileAsStream(...);
PDDocument document = PDDocument.load(is);
int nbrPages = document.getNumberOfPages();
if (nbrPages > 0) {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setLineSeparator(" ");
stripper.setPageSeparator(" ");
List< IndexedCatalogPage > pages = new ArrayList< IndexedCatalogPage
>();
for (int i = 1; i <= nbrPages; i++) {
stripper.setStartPage(i);
stripper.setEndPage(i);
String text = stripper.getText(document);
IndexedCatalogPage page = new IndexedCatalogPage();
page.setPageNumber(i);
page.setText(text);
pages.add(page);
}
...
Any help is greatly appreciated!
Best Regards,
Kim