Re: [tesseract-ocr] Re: pytesseract speed improvement?

Jean-Marc Spaggiari Thu, 22 May 2025 13:35:58 -0700

Hi Tom,

Thanks for having a look at this. The challenge is that I don't know which
of those languages the title is using.


Let me remove pytesseract from the picture.

If I run tesseract title.jpg stdout --psm 7 --oem 1 -l
eng+fra+spa+deu+ita+por+jpn+kor+rus+chi_sim+chi_tra it takes 0.9 second and
returns the right title ("Advance Scout")

The title is in English.

If I run *tesseract title.jpg stdout --psm 7 --oem 1 -l eng+fra+spa+deu*
it's faster (0,3s) and the title is still correct.
If I run *tesseract title.jpg stdout --psm 7 --oem 1 -l eng+fra+spa+deu*
it's even faster (0.25) but the title is wrong ("AVEO Segue")
If I run *tesseract title.jpg stdout --psm 7 --oem 1 -l eng* it's crazy
fast! (0,09s) but title is wrong again ("clyzinee Segue")
If I use just "deu" it's super fast and correct.

I can't batch the pictures as the client is waiting for the reply before
sending the next one.

So I was thinking about running each of them in parallel. I'm able to get a
reply in 300ms! Thats 3 times faster, and it gives me this:
clyzinee Segue
ANVanee Scout
AVEO EU
Advance Scout:
YAVanicc Sco
Advance So ui
eV2pe22)らの016
여00200606 20600ㄷ
Ао\алее Эсодиь
二司多5
和NOU2COCOUUE

But then I don't know which one I should take from those. I see the one
from DEU is the good one. But I don't have a way to confirm that in the
script.

So multiple questions here.
- Can tesseract work like a shell? I send a picture, I get the txt. I send
a picture, I get the text. Without ever closing tesseract?
- Can I get the "confidence" level for each of those predictions? It might
help to figure which one is the most probable?

Thanks,

JMS





Le jeu. 22 mai 2025, à 15 h 48, Tom Morris <[email protected]> a écrit :

> On Wednesday, May 21, 2025 at 12:28:52 PM UTC-4 [email protected]
> wrote:
>
> I'm using tesseract to convert a small picture containing a title into a
> string. It runs in about one second.
> Here is the command line I'm using:
> pytesseract.image_to_string(cropped_image, nice=-10, config='--psm 7 --oem
> 1 -l eng+fra+spa+deu+ita+por+jpn+kor+rus+chi_sim+chi_tra')
>
>
> A small semantic distinction - tesseract and pytesseract are two different
> things, maintained by different teams.
>
>
> I tried to to remove the -l parameter and it's way faster (98ms), but then
> the title is totally wrong. I'm wondering if the time is taken to load
> those dictionnaries, so I can pre-load them and keep them in memory, or
> it's more on the processing time.
>
>
> Certainly every language model that you add is going to increase
> processing time, so you only want to load the ones that you really need,
> but I don't think you have the granularity of control with pytesseract to
> save significantly on initialization time. It appears to just use command
> line tesseract running in a subprocess.
>
> One thing which may cut down on overhead is collecting batch of images,
> saving them in a multi-image file format, and then have Tesseract process
> that.
>
> Tom
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/77af7499-6271-4135-982b-4b2fd1ee27d9n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/77af7499-6271-4135-982b-4b2fd1ee27d9n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAPQV63UrnpiXXMwypR2bq2gqk5YR1qm1TB22rZKUKQ_xXEgfvg%40mail.gmail.com.

Re: [tesseract-ocr] Re: pytesseract speed improvement?

Reply via email to