Re: [tesseract-ocr] Re: pytesseract speed improvement?

TheComplete BookOfMormon Fri, 23 May 2025 04:35:50 -0700

In C# I would use Tesseract.net and create an engine once (that has a
cost), then I would process each page using the already-created engine.
That should at least save *some* processing time.


You'd need to have a process constantly running, so it would either need to
watch a folder for incoming images or it would need to serve HTTP requests

To start with, I'd test the speed difference by writing code that does this
1: Discover all files in a folder
2: Create the engine
3: Start a timer
4: Process each file
5: Stop the timer, and output the elapsed time

Then try creating the engine per file (as part of step 4) and see how that
affects the total time. Then decide if it's worth making the change or not.




On Thu, 22 May 2025 at 21:36, Jean-Marc Spaggiari <[email protected]>
wrote:

> Hi Tom,
>
> Thanks for having a look at this. The challenge is that I don't know which
> of those languages the title is using.
>
> Let me remove pytesseract from the picture.
>
> If I run tesseract title.jpg stdout --psm 7 --oem 1 -l
> eng+fra+spa+deu+ita+por+jpn+kor+rus+chi_sim+chi_tra it takes 0.9 second and
> returns the right title ("Advance Scout")
>
> The title is in English.
>
> If I run *tesseract title.jpg stdout --psm 7 --oem 1 -l eng+fra+spa+deu*
> it's faster (0,3s) and the title is still correct.
> If I run *tesseract title.jpg stdout --psm 7 --oem 1 -l eng+fra+spa+deu*
> it's even faster (0.25) but the title is wrong ("AVEO Segue")
> If I run *tesseract title.jpg stdout --psm 7 --oem 1 -l eng* it's crazy
> fast! (0,09s) but title is wrong again ("clyzinee Segue")
> If I use just "deu" it's super fast and correct.
>
> I can't batch the pictures as the client is waiting for the reply before
> sending the next one.
>
> So I was thinking about running each of them in parallel. I'm able to get
> a reply in 300ms! Thats 3 times faster, and it gives me this:
> clyzinee Segue
> ANVanee Scout
> AVEO EU
> Advance Scout:
> YAVanicc Sco
> Advance So ui
> eV2pe22)らの016
> 여00200606 20600ㄷ
> Ао\алее Эсодиь
> 二司多5
> 和NOU2COCOUUE
>
> But then I don't know which one I should take from those. I see the one
> from DEU is the good one. But I don't have a way to confirm that in the
> script.
>
> So multiple questions here.
> - Can tesseract work like a shell? I send a picture, I get the txt. I send
> a picture, I get the text. Without ever closing tesseract?
> - Can I get the "confidence" level for each of those predictions? It might
> help to figure which one is the most probable?
>
> Thanks,
>
> JMS
>
>
>
>
>
> Le jeu. 22 mai 2025, à 15 h 48, Tom Morris <[email protected]> a écrit :
>
>> On Wednesday, May 21, 2025 at 12:28:52 PM UTC-4 [email protected]
>> wrote:
>>
>> I'm using tesseract to convert a small picture containing a title into a
>> string. It runs in about one second.
>> Here is the command line I'm using:
>> pytesseract.image_to_string(cropped_image, nice=-10, config='--psm 7
>> --oem 1 -l eng+fra+spa+deu+ita+por+jpn+kor+rus+chi_sim+chi_tra')
>>
>>
>> A small semantic distinction - tesseract and pytesseract are two
>> different things, maintained by different teams.
>>
>>
>> I tried to to remove the -l parameter and it's way faster (98ms), but
>> then the title is totally wrong. I'm wondering if the time is taken to load
>> those dictionnaries, so I can pre-load them and keep them in memory, or
>> it's more on the processing time.
>>
>>
>> Certainly every language model that you add is going to increase
>> processing time, so you only want to load the ones that you really need,
>> but I don't think you have the granularity of control with pytesseract to
>> save significantly on initialization time. It appears to just use command
>> line tesseract running in a subprocess.
>>
>> One thing which may cut down on overhead is collecting batch of images,
>> saving them in a multi-image file format, and then have Tesseract process
>> that.
>>
>> Tom
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion visit
>> https://groups.google.com/d/msgid/tesseract-ocr/77af7499-6271-4135-982b-4b2fd1ee27d9n%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/77af7499-6271-4135-982b-4b2fd1ee27d9n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAPQV63UrnpiXXMwypR2bq2gqk5YR1qm1TB22rZKUKQ_xXEgfvg%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAPQV63UrnpiXXMwypR2bq2gqk5YR1qm1TB22rZKUKQ_xXEgfvg%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAN%2BihQRnvxjz6n8TXDc8-AmAkCtvHGMQ-rJ-2P9mW3-7Mk9NVg%40mail.gmail.com.

Re: [tesseract-ocr] Re: pytesseract speed improvement?

Reply via email to