*Dear all, *
I'm currently trying to use the python wrapper for Tesseract (pytesseract)
to correct the rotation, in terms of multiple of 90 degrees, of images
about Tamil newspapers. Specifically, I want to use
pytesseract.image_to_osd(binary, config = '--oem 0 -l tam--psm 0') to find
the orientation OSD data of the individual images so as to correct them. I
tried --oem 0, 1, 2, 3 and all of them did not work even after using the
legacy engine.
Error for --oem 0 and 2:
File
"C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pytesseract\pytesseract.py",
line 284, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, "Warning, detects only
orientation with -l tam Error: Tesseract (legacy) engine requested, but
components are not present in C:\\Program
Files\\Tesseract-OCR\\tessdata/tam.traineddata!! Failed loading language
'tam' Tesseract couldn't load any languages! Could not initialize
tesseract.")
Error for --oem 1 and 3:
File
"C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pytesseract\pytesseract.py",
line 284, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Warning, detects only
orientation with -l tam Error, OSD requires a model for the legacy engine')
Indeed, legacy engine for Tamil is needed for this task, and I used the
tam.traineddata in this <https://github.com/tesseract-ocr/tessdata>
<https://github.com/tesseract-ocr/tessdata>legacy+LSTM repository. However,
as you can see at the bottom of the page, it says "The legacy tesseract
models (--oem 0) have been removed for Indic and Arabic script language
files."
Legacy fra and eng packs works perfectly when I do
pytesseract.image_to_osd(binary, config = '--oem 0 -l fra --psm 0')
pytesseract.image_to_osd(binary, config = '--oem 2 -l fra --psm 0')
and
pytesseract.image_to_osd(binary, config = '--oem 0 -l eng --psm 0')
pytesseract.image_to_osd(binary, config = '--oem 2 -l eng --psm 0')
The output looks like this:
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 0.89
Script: Latin
Script confidence: 8.38
I guess the reason legacy Tamil pack is removed is that Tamil legacy engine
worked poorly. However, since I'm only trying to get the orientation of
texts in binarized images, would it be possible for you to give me access
to its legacy model? If this is not possible, are there any other
suggestions from you to help me with my case?
Thanks for reading this email in your busy schedule and have a great day!
*Sincerely,*
*Siyou*
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/tesseract-ocr/6be2a77c-20ca-4854-bc36-2a4fd9754036n%40googlegroups.com.