Re: LazyTextExtractorField and background text extraction

Jukka Zitting Thu, 16 Jul 2009 02:33:51 -0700

Hi,

On Thu, Jul 16, 2009 at 11:04 AM, Marcel
Reutegger<[email protected]> wrote:
> hmm, even if the conversion from reader to string is done in a
> separate thread as part of the extractor job, there remains the issue
> when the reader is used as is.


As far as I can tell from the code, this is currently not the case as
all the binary values get wrapped into LazyTextExtractorFields.

> we'd have to change the way how the indexer finds out whether the
> extractor times out.

Would it help if we added an unlimited buffering mechanism (backed by
temporary files as needed) to the Readers so that if the indexer gets
blocked extracting text from one document, all the other pending
documents can automatically continue text extraction in parallel? This
might cause occasional blocking in the indexer, but on the average it
should do about as well as maintaining an explicit indexing queue.

In fact if we did this in Tika, we could avoid the extra buffering
entirely for things like plain text documents and other formats where
the parsing overhead is negligible.

BR,

Jukka Zitting

Re: LazyTextExtractorField and background text extraction

Reply via email to