Re: Parsing incomplete PDF and Office files

Jonathan Koren Thu, 13 Nov 2008 16:30:40 -0800

On a related note, does Tika support full text extraction of PDFs?


On Nov 13, 2008, at 1:52 PM, Jukka Zitting wrote:

Hi,
On Thu, Nov 13, 2008 at 9:04 PM, Milos Kovacevic<[EMAIL PROTECTED]> wrote:
I would like to download just a few kilobytes of a PDF(doc) fileand toextract the text from it. I do not want to download the whole fileand thento parse it, just truncated first N Kbs. Is it possible with Tikaor not? If
not how should I do that?
That's currently not possible, but AFAIK there is support for
page-by-page streaming in PDFBox (for PDF documents that support that,
not all of them do). It would be nice if Tika could leverage that
functionality in PDFBox.

However, I'm not sure how well that would work with truncated streams.
I guess the reasonable approach would be to stream as much text as can
be parsed, and then fail with a TikaException if the input stream ends
unexpectedly. Your application would then need to be aware of this
error condition and handle it appropriately.

BR,

Jukka Zitting


--
Jonathan Koren
[EMAIL PROTECTED]
http://www.soe.ucsc.edu/~jonathan/

Re: Parsing incomplete PDF and Office files

Reply via email to