I was trying to do with image. I got one image online with all modi script characters and tried to create Box file for that image. In the box file I can see that it is considering each character as English character. *My question is how to make it realise that it should refer to it as a modi character.*
Then I tried to use tesstrain.sh as below src/training/tesstrain.sh --fonts_dir /usr/share/fonts --fontlist MarathiCursiveT --lang mar --linedata_only --noextract_font_properties --langdata_dir ../tesstutorial/langdata --tessdata_dir ../tesstutorial/tesseract/tessdata --training_text ../tesstutorial/langdata/mar/mar.modi.training_text --output_dir ../tesstutorial/moditrain I got (by running make) MarathiCursiveT truetype Unicode modi font from the link https://github.com/MihailJP/MarathiCursive, mentioned in response to my query. That file I kept at /usr/share/fonts/truetype/MarathiCursiveT I created mar.modi.training_text by copying content of marathi training data text file in Aksharmukh app and taking output text in modi. *for tesstrain.sh I am getting error Could not find font named 'MarathiCursiveT. Pango suggested font 'MarthiCursiveT Medium'* Please advise for both the queries.Thanks in advance On Monday, January 27, 2020 at 3:22:17 AM UTC-5, shree wrote: > > For LSTM training punc, numbers, wordlist are NOT required. You can add > them if you like. Unicharset is generated from the training text. > > Are you planning to train from text or images? > > On Mon, Jan 27, 2020 at 2:19 AM 'Nilambari Joshi' via tesseract-ocr < > tesser...@googlegroups.com <javascript:>> wrote: > >> Thanks for your response. I will work as suggested. Please also clarify >> whether I need to create separate language directory for Modi similar to >> Marathi with all files like number, punc wordlist included and a separate >> unicharset file as well? >> Thanks in advance. >> >> On Sunday, January 26, 2020 at 12:26:51 PM UTC-5, shree wrote: >>> >>> Thanks for the link to Modi Unicode font. >>> >>> I would convert the Marathi training text to Modi script (use >>> Aksharamukha) and then train using the unicode font. >>> >>> On Sun, Jan 26, 2020 at 10:28 PM Patrick CHEW <patri...@gmail.com> >>> wrote: >>> >>>> >>>> On Jan 26, 2020, at 08:16, Shree Devi Kumar <shree...@gmail.com> wrote: >>>> >>>> Is there a Unicode font for modi script? >>>> >>>> >>>> https://github.com/MihailJP/MarathiCursive >>>> >>>> On Sun, Jan 26, 2020, 21:22 'Nilambari Joshi' via tesseract-ocr < >>>> tesser...@googlegroups.com> wrote: >>>> >>>>> Hi... I want to create Modi script (Marathi language) traineddata in >>>>> tesseract for OCR. Can somebody guide what steps should I follow. >>>>> I referred to >>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 >>>>> but stuckup at a stage of creating box files. >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesser...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/EB77DC11-4EBA-498C-A8AE-E728C3F82A4D%40gmail.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/EB77DC11-4EBA-498C-A8AE-E728C3F82A4D%40gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> >>> >>> -- >>> >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesser...@googlegroups.com <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/3d481093-8efd-408c-abcc-758c6c72df32%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/3d481093-8efd-408c-abcc-758c6c72df32%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b65c4a9d-ea7c-44af-956e-e9628ba05ee4%40googlegroups.com.