OK, so I have been testing this with different files. What I have noticed 
is that even with extremely small files, such as a 224 KB test PDF which 
has 3 blank pages in it, processing the file for OCR still takes 31 
seconds. It seems almost as if the Tesseract processes are deadlocked for 
an extended period of time before being able to execute (or possibly after 
they execute and are trying to close). In the older production version of 
the code, we frequently see files which are small enough to take only 6 or 
less seconds. Again, the way in which I call Tesseract as an external 
process hasn't really changed between the old and new versions of the code, 
as far as I can tell, aside from the fact that I call Parallel.For instead 
of Parallel.ForEach. I checked the .NET source code and it seems 
Parallel.ForEach resolves to the same worker method as Parallel.For, so I 
doubt that is the issue.

On Wednesday, March 4, 2020 at 9:26:24 AM UTC-6, Lucas L. wrote:
>
> Hello, I apologize in advance if this seems like the wrong place to post 
> this. It is Tesseract-related, but it seems like the issue may be more at 
> fault with .NET than Tesseract. However, I have found almost no one else 
> who has this particular issue and I'm running out of options.
>
> The issue and code are described in detail here: 
> https://stackoverflow.com/questions/60456829/why-does-calling-the-tesseract-process-cause-this-service-to-crash-randomly
>
> I will summarize in order to avoid simply copy-pasting everything from my 
> SO post. We run Tesseract 4.00 in multiple threads on an Ubuntu 18.04 VM, 
> and it is called as an external process from a .NET Core 2.1 application (I 
> have also tried upgrading to 3.1, but that did not seem to make a 
> difference). I am aware of the "OMP_THREAD_LIMIT" variable, but we want to 
> process multiple pages from a split document file at once, so we call 
> Tesseract on multiple threads (currently, it's set to 8 degrees of 
> parallelism). This didn't have any issues in the past, but recently I have 
> been making changes to reduce the number of reads/writes to disk in the 
> service, and now it seems to crash with the message "Error while reaping 
> child" randomly while processing a file. The stack trace is in the SO post. 
> Rarely it won't happen at all, but usually it will occur (more likely on 
> larger files since the processes need to run more frequently). It could 
> occur at the very start of processing a document or at the very end.  
>
> I have tried using the prerelease of the API wrapper found here 
> https://github.com/charlesw/tesseract which uses a recent version of 
> Tesseract, but it does not seem to handle multithreading very well (I 
> suppose I could just be using it wrong, but it does not allow me to process 
> multiple pages simultaneously without disposing the first page).
>
> It seems like an issue with the Process class in .NET cleaning up the 
> child resources when a process ends. Tesseract is a child process to the 
> dotnet process when it is called. However, I'm really not sure what I can 
> do to make .NET clean up the children without throwing an error. I was 
> reading the .NET Core source code and they mentioned that they must make a 
> global lock in order to add/remove process references (
> https://github.com/dotnet/runtime/blob/master/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/ProcessWaitState.Unix.cs).
>  
> I'm wondering if there is some interaction between multithreading, possibly 
> the GC, and this global ref table that causes an issue.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8587bc8a-54a8-4748-bffa-08e4af66ed15%40googlegroups.com.

Reply via email to