Hi, On Thu, Jul 16, 2009 at 11:04 AM, Marcel Reutegger<[email protected]> wrote: > hmm, even if the conversion from reader to string is done in a > separate thread as part of the extractor job, there remains the issue > when the reader is used as is.
As far as I can tell from the code, this is currently not the case as all the binary values get wrapped into LazyTextExtractorFields. > we'd have to change the way how the indexer finds out whether the > extractor times out. Would it help if we added an unlimited buffering mechanism (backed by temporary files as needed) to the Readers so that if the indexer gets blocked extracting text from one document, all the other pending documents can automatically continue text extraction in parallel? This might cause occasional blocking in the indexer, but on the average it should do about as well as maintaining an explicit indexing queue. In fact if we did this in Tika, we could avoid the extra buffering entirely for things like plain text documents and other formats where the parsing overhead is negligible. BR, Jukka Zitting
