Re: About text extraction for index

Vikas Saurabh Fri, 23 Aug 2019 11:16:52 -0700

>  but I am having a problem: the thread that processes the pdf file keeps
running, creating images and performing OCR. Is this supposed to happen?


TL;DR: yes, because there is no safe way to kill a thread

Yes that's supposed to happen. The reason this feature implemented was
because in most cases text extraction should finish within a reasonable
time. But, at times, due to a bad file or a bug in parser the extraction
process keeps on running - that used to hold up indexing for the whole
setup. Since the assumption with a timed out extraction is that tika or
whichever parser is in play might be stuck and Thread.stop could leave
things in incorrect state potentially affecting subsequent operations.

-Vikas
(sent from mobile)

Re: About text extraction for index

Reply via email to