Re: Tika and PDFBox NonSequentialPDFParser class

Steve Deal Wed, 16 May 2012 09:31:24 -0700

Using the first suggestion from Jukka to change the PDF Parser, the
Tika 1.1 class org.apache.tika.parser.pdf.PDFParser was modified to
load the document as follows:
 Starting at line 100
      TemporaryResources tmp2 = new TemporaryResources();
       try {
            TikaInputStream tstream = TikaInputStream.get(stream, tmp2);
            File tsFile = tstream.getFile();
            RandomAccess scratchFile = new
RandomAccessFile(tmp.createTemporaryFile(), "rw");
            pdfDocument = PDDocument.loadNonSeq(tsFile, scratchFile);
            // PDFBox can process entirely in memory, or can use a temp file
            //  for unpacked / processed resources
            // Decide which to do based on if we're reading from a
file or not already
//            TikaInputStream tstream = TikaInputStream.cast(stream);
//            if (tstream != null && tstream.hasFile()) {
//               // File based, take that as a cue to use a temporary file
//               RandomAccess scratchFile = new
RandomAccessFile(tmp.createTemporaryFile(), "rw");
//               pdfDocument = PDDocument.load(new
CloseShieldInputStream(stream), scratchFile, true);
//            } else {
//               // Go for the normal, stream based in-memory parsing
//               pdfDocument = PDDocument.load(new
CloseShieldInputStream(stream), true);
//            }


Tika builds and passes all the unit tests using loadNonSeq()    :-)
Now I will move on to my own testing.

Thanks again to Jukka for pointing me in the right direction!

Best Regards,
Steve Deal
"...I will choose a path that's clear, I will choose free will" - Rush

Re: Tika and PDFBox NonSequentialPDFParser class

Reply via email to