olo company

i am trying to ocr an old (1963) morocco arabic - english dictionary

i have tried jTessBoxEditor for ocr, somehow managed to follow the info on 
net,
but at the very end tesseract failed to make final _traindata_ files

my problem is
the book (dictionary) is basically in english language, so i used eng file 
for ocr-ing
but there is also transliteration text, which includes characters that are 
not present in english language
although they are latin script
i tried to train the tesseract for those characters, but failed
ie from this link:

https://www.youtube.com/watch?v=8GdcyknL1ls

the other info i could find is also a bit confusing

the characters i was trying to train are letters

g z d h r t s l - with dots below and above, plus
š ž and a weird semi question mark

transliteration script is also _italic_

with help of libre office writer and some trial & error i also managed to 
identify a (close approximation) of the transliteration font (Latin Modern 
Roman Unslanted)

can somebody versed in tesseract-ocr training help me train (or do the ocr) 
for those letters/characters ?

attached are:
- my train script / font image (font - latin modern roman unslanted)
- a page from a dictionary which includes most of the characters i am 
trying to ocr

dictionary has 500+ pages, half is eng-morocco arabic, the other half is 
morocco arabic-eng, so proper ocr would be truly appreciated

thank you for your help

have fun

aum


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c1b2a694-8d05-4b06-b06f-ecbc27c13ea4n%40googlegroups.com.

Reply via email to