In my case, it’s a non-searchable PDF. So I assume that Tika converts each page to a Tiff and then OCR’s it
Peter Kronenberg | Senior AI Analytic ENGINEER C: 703.887.5623 [Torch AI]<http://www.torch.ai/> 5250 W 116th Pl, Suite 200., Leawood, KS 66211 WWW.TORCH.AI<http://www.torch.ai/> From: Tim Allison <talli...@apache.org> Sent: Wednesday, January 19, 2022 7:54 PM To: lfcnas...@gmail.com Cc: user@tika.apache.org Subject: Re: TesseractOCRParser timeout Sorry. You’re right. I think tesseract is supposed to handle multi page tiffs on its own. On Wed, Jan 19, 2022 at 7:39 PM Luís Filipe Nassif <lfcnas...@gmail.com<mailto:lfcnas...@gmail.com>> wrote: Hi Tim, I'm sure Tika does that for PDFs, but I couldn't find that logic in the code base for TIFFs. Could you point to me what class does that? Luis Em qua, 19 de jan de 2022 15:00, Tim Allison <talli...@apache.org<mailto:talli...@apache.org>> escreveu: Yes. Exactly right. Tika spawns a process per page/image. On Wed, Jan 19, 2022 at 11:30 AM Peter Kronenberg <peter.kronenb...@torch.ai<mailto:peter.kronenb...@torch.ai>> wrote: I believe that Tika just OCR’s one page at a time. My guess is that it spawns a process for each page. Peter Kronenberg | Senior AI Analytic ENGINEER C: 703.887.5623 [Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=ad58006ca8714e2983477aa3dc1f2425> 5250 W 116<https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC81MjUwK1crMTE2P2VudHJ5PWdtYWlsJnNvdXJjZT1n&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=aU1DSk1ITExaYjNJVkpqekZ3cE1Ybm9BbWtKVkhlZW9OOVJOYTRuQzJ3OD0=&h=ad58006ca8714e2983477aa3dc1f2425>th Pl, Suite 200., Leawood, KS 66211 WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=ad58006ca8714e2983477aa3dc1f2425> From: Luís Filipe Nassif <lfcnas...@gmail.com<mailto:lfcnas...@gmail.com>> Sent: Wednesday, January 19, 2022 11:11 AM To: user@tika.apache.org<mailto:user@tika.apache.org> Subject: Re: TesseractOCRParser timeout Just a guess, if you are OCRing multipage TIF files, that may be the reason, I "think" Tika sends the whole TIF to tesseract and that could take a large amount of time if there are lots of pages, triggering timeouts. In our project, we send each TIF page at a time to tesseract and restart the timeout counter to avoid this. Best regards, Luís Nassif Em ter., 18 de jan. de 2022 às 22:51, Peter Kronenberg <peter.kronenb...@torch.ai<mailto:peter.kronenb...@torch.ai>> escreveu: Unrelated to my previous questions. I’m getting some sort of timeout in Tika in TesseractOCRParser.runOCRProcess. It’s one of the errors that say ‘TesseractOCRParser timeout’. What exactly is it doing here? Does it spawn a separate process to do the OCR? We’re having some performance issues, so in a way, this doesn’t come as a surprise. Just trying to understand a little more what’s going on private void runOCRProcess(Process process, int timeout) throws IOException, TikaException { process.getOutputStream().close(); InputStream out = process.getInputStream(); InputStream err = process.getErrorStream(); StringBuilder outBuilder = new StringBuilder(); StringBuilder errBuilder = new StringBuilder(); Thread outThread = this.logStream(out, outBuilder); Thread errThread = this.logStream(err, errBuilder); outThread.start(); errThread.start(); int exitValue = -2147483648; try { boolean finished = process.waitFor((long)timeout, TimeUnit.SECONDS); if (!finished) { throw new TikaException("TesseractOCRParser timeout"); } exitValue = process.exitValue(); } catch (InterruptedException var12) { Thread.currentThread().interrupt(); throw new TikaException("TesseractOCRParser interrupted", var12); } catch (IllegalThreadStateException var13) { throw new TikaException("TesseractOCRParser timeout"); } Peter Kronenberg | Senior AI Analytic ENGINEER C: 703.887.5623 [Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd> 5250 W 116<https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC81MjUwK1crMTE2P2VudHJ5PWdtYWlsJnNvdXJjZT1n&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=aU1DSk1ITExaYjNJVkpqekZ3cE1Ybm9BbWtKVkhlZW9OOVJOYTRuQzJ3OD0=&h=ad58006ca8714e2983477aa3dc1f2425>th Pl, Suite 200., Leawood, KS 66211 WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd>