Use tesserocr https://github.com/sirfz/tesserocr
Zdenko št 22. 5. 2025 o 22:35 Jean-Marc Spaggiari <[email protected]> napísal(a): > Hi Tom, > > Thanks for having a look at this. The challenge is that I don't know which > of those languages the title is using. > > Let me remove pytesseract from the picture. > > If I run tesseract title.jpg stdout --psm 7 --oem 1 -l > eng+fra+spa+deu+ita+por+jpn+kor+rus+chi_sim+chi_tra it takes 0.9 second and > returns the right title ("Advance Scout") > > The title is in English. > > If I run *tesseract title.jpg stdout --psm 7 --oem 1 -l eng+fra+spa+deu* > it's faster (0,3s) and the title is still correct. > If I run *tesseract title.jpg stdout --psm 7 --oem 1 -l eng+fra+spa+deu* > it's even faster (0.25) but the title is wrong ("AVEO Segue") > If I run *tesseract title.jpg stdout --psm 7 --oem 1 -l eng* it's crazy > fast! (0,09s) but title is wrong again ("clyzinee Segue") > If I use just "deu" it's super fast and correct. > > I can't batch the pictures as the client is waiting for the reply before > sending the next one. > > So I was thinking about running each of them in parallel. I'm able to get > a reply in 300ms! Thats 3 times faster, and it gives me this: > clyzinee Segue > ANVanee Scout > AVEO EU > Advance Scout: > YAVanicc Sco > Advance So ui > eV2pe22)らの016 > 여00200606 20600ㄷ > Ао\алее Эсодиь > 二司多5 > 和NOU2COCOUUE > > But then I don't know which one I should take from those. I see the one > from DEU is the good one. But I don't have a way to confirm that in the > script. > > So multiple questions here. > - Can tesseract work like a shell? I send a picture, I get the txt. I send > a picture, I get the text. Without ever closing tesseract? > - Can I get the "confidence" level for each of those predictions? It might > help to figure which one is the most probable? > > Thanks, > > JMS > > > > > > Le jeu. 22 mai 2025, à 15 h 48, Tom Morris <[email protected]> a écrit : > >> On Wednesday, May 21, 2025 at 12:28:52 PM UTC-4 [email protected] >> wrote: >> >> I'm using tesseract to convert a small picture containing a title into a >> string. It runs in about one second. >> Here is the command line I'm using: >> pytesseract.image_to_string(cropped_image, nice=-10, config='--psm 7 >> --oem 1 -l eng+fra+spa+deu+ita+por+jpn+kor+rus+chi_sim+chi_tra') >> >> >> A small semantic distinction - tesseract and pytesseract are two >> different things, maintained by different teams. >> >> >> I tried to to remove the -l parameter and it's way faster (98ms), but >> then the title is totally wrong. I'm wondering if the time is taken to load >> those dictionnaries, so I can pre-load them and keep them in memory, or >> it's more on the processing time. >> >> >> Certainly every language model that you add is going to increase >> processing time, so you only want to load the ones that you really need, >> but I don't think you have the granularity of control with pytesseract to >> save significantly on initialization time. It appears to just use command >> line tesseract running in a subprocess. >> >> One thing which may cut down on overhead is collecting batch of images, >> saving them in a multi-image file format, and then have Tesseract process >> that. >> >> Tom >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion visit >> https://groups.google.com/d/msgid/tesseract-ocr/77af7499-6271-4135-982b-4b2fd1ee27d9n%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/77af7499-6271-4135-982b-4b2fd1ee27d9n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion visit > https://groups.google.com/d/msgid/tesseract-ocr/CAPQV63UrnpiXXMwypR2bq2gqk5YR1qm1TB22rZKUKQ_xXEgfvg%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAPQV63UrnpiXXMwypR2bq2gqk5YR1qm1TB22rZKUKQ_xXEgfvg%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zB28Y5ur%2Bbeh_Ce6NWn4rqScd%2BS8NbCmFrqJ9RJ%2B8YGg%40mail.gmail.com.

