In my case, it’s a non-searchable PDF.  So I assume that Tika converts each 
page to a Tiff and then OCR’s it

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
5250 W 116th Pl, Suite 200., Leawood, KS 66211
WWW.TORCH.AI<http://www.torch.ai/>


From: Tim Allison <talli...@apache.org>
Sent: Wednesday, January 19, 2022 7:54 PM
To: lfcnas...@gmail.com
Cc: user@tika.apache.org
Subject: Re: TesseractOCRParser timeout


Sorry. You’re right. I think tesseract is supposed to handle multi page tiffs 
on its own.

On Wed, Jan 19, 2022 at 7:39 PM Luís Filipe Nassif 
<lfcnas...@gmail.com<mailto:lfcnas...@gmail.com>> wrote:
Hi Tim,

I'm sure Tika does that for PDFs, but I couldn't find that logic in the code 
base for TIFFs. Could you point to me what class does that?

Luis

Em qua, 19 de jan de 2022 15:00, Tim Allison 
<talli...@apache.org<mailto:talli...@apache.org>> escreveu:
Yes.  Exactly right.  Tika spawns a process per page/image.

On Wed, Jan 19, 2022 at 11:30 AM Peter Kronenberg 
<peter.kronenb...@torch.ai<mailto:peter.kronenb...@torch.ai>> wrote:
I believe that Tika just OCR’s one page at a time.  My guess is that it spawns 
a process for each page.

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch 
AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=ad58006ca8714e2983477aa3dc1f2425>
5250 W 
116<https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC81MjUwK1crMTE2P2VudHJ5PWdtYWlsJnNvdXJjZT1n&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=aU1DSk1ITExaYjNJVkpqekZ3cE1Ybm9BbWtKVkhlZW9OOVJOYTRuQzJ3OD0=&h=ad58006ca8714e2983477aa3dc1f2425>th
 Pl, Suite 200., Leawood, KS 66211
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=ad58006ca8714e2983477aa3dc1f2425>


From: Luís Filipe Nassif <lfcnas...@gmail.com<mailto:lfcnas...@gmail.com>>
Sent: Wednesday, January 19, 2022 11:11 AM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Re: TesseractOCRParser timeout

Just a guess, if you are OCRing multipage TIF files, that may be the reason, I 
"think" Tika sends the whole TIF to tesseract and that could take a large 
amount of time if there are lots of pages, triggering timeouts. In our project, 
we send each TIF page at a time to tesseract and restart the timeout counter to 
avoid this.

Best regards,
Luís Nassif

Em ter., 18 de jan. de 2022 às 22:51, Peter Kronenberg 
<peter.kronenb...@torch.ai<mailto:peter.kronenb...@torch.ai>> escreveu:
Unrelated to my previous questions.  I’m getting some sort of timeout in Tika 
in TesseractOCRParser.runOCRProcess.  It’s one of the errors that say 
‘TesseractOCRParser timeout’.  What exactly is it doing here?  Does it spawn a 
separate process to do the OCR?  We’re having some performance issues, so in a 
way, this doesn’t come as a surprise.  Just trying to understand a little more 
what’s going on

private void runOCRProcess(Process process, int timeout) throws IOException, 
TikaException {
    process.getOutputStream().close();
    InputStream out = process.getInputStream();
    InputStream err = process.getErrorStream();
    StringBuilder outBuilder = new StringBuilder();
    StringBuilder errBuilder = new StringBuilder();
    Thread outThread = this.logStream(out, outBuilder);
    Thread errThread = this.logStream(err, errBuilder);
    outThread.start();
    errThread.start();
    int exitValue = -2147483648;

    try {
        boolean finished = process.waitFor((long)timeout, TimeUnit.SECONDS);
        if (!finished) {
            throw new TikaException("TesseractOCRParser timeout");
        }

        exitValue = process.exitValue();
    } catch (InterruptedException var12) {
        Thread.currentThread().interrupt();
        throw new TikaException("TesseractOCRParser interrupted", var12);
    } catch (IllegalThreadStateException var13) {
        throw new TikaException("TesseractOCRParser timeout");
    }




Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch 
AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd>
5250 W 
116<https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC81MjUwK1crMTE2P2VudHJ5PWdtYWlsJnNvdXJjZT1n&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=aU1DSk1ITExaYjNJVkpqekZ3cE1Ybm9BbWtKVkhlZW9OOVJOYTRuQzJ3OD0=&h=ad58006ca8714e2983477aa3dc1f2425>th
 Pl, Suite 200., Leawood, KS 66211
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=1e408b0e1b48447593a37423cefecfcd>


Reply via email to