Using the first suggestion from Jukka to change the PDF Parser, the
Tika 1.1 class org.apache.tika.parser.pdf.PDFParser was modified to
load the document as follows:
Starting at line 100
TemporaryResources tmp2 = new TemporaryResources();
try {
TikaInputStream tstream = TikaInputStream.get(stream, tmp2);
File tsFile = tstream.getFile();
RandomAccess scratchFile = new
RandomAccessFile(tmp.createTemporaryFile(), "rw");
pdfDocument = PDDocument.loadNonSeq(tsFile, scratchFile);
// PDFBox can process entirely in memory, or can use a temp file
// for unpacked / processed resources
// Decide which to do based on if we're reading from a
file or not already
// TikaInputStream tstream = TikaInputStream.cast(stream);
// if (tstream != null && tstream.hasFile()) {
// // File based, take that as a cue to use a temporary file
// RandomAccess scratchFile = new
RandomAccessFile(tmp.createTemporaryFile(), "rw");
// pdfDocument = PDDocument.load(new
CloseShieldInputStream(stream), scratchFile, true);
// } else {
// // Go for the normal, stream based in-memory parsing
// pdfDocument = PDDocument.load(new
CloseShieldInputStream(stream), true);
// }
Tika builds and passes all the unit tests using loadNonSeq() :-)
Now I will move on to my own testing.
Thanks again to Jukka for pointing me in the right direction!
Best Regards,
Steve Deal
"...I will choose a path that's clear, I will choose free will" - Rush