On a related note, does Tika support full text extraction of PDFs?

On Nov 13, 2008, at 1:52 PM, Jukka Zitting wrote:

Hi,

On Thu, Nov 13, 2008 at 9:04 PM, Milos Kovacevic <[EMAIL PROTECTED]> wrote:
I would like to download just a few kilobytes of a PDF(doc) file and to extract the text from it. I do not want to download the whole file and then to parse it, just truncated first N Kbs. Is it possible with Tika or not? If
not how should I do that?

That's currently not possible, but AFAIK there is support for
page-by-page streaming in PDFBox (for PDF documents that support that,
not all of them do). It would be nice if Tika could leverage that
functionality in PDFBox.

However, I'm not sure how well that would work with truncated streams.
I guess the reasonable approach would be to stream as much text as can
be parsed, and then fail with a TikaException if the input stream ends
unexpectedly. Your application would then need to be aware of this
error condition and handle it appropriately.

BR,

Jukka Zitting

--
Jonathan Koren
[EMAIL PROTECTED]
http://www.soe.ucsc.edu/~jonathan/


Reply via email to