I tried using MarathiCursiveT Medium as font in fontlist and it worked. Thanks for that. It created traineddata and unicharset files in the destination folder. I hope now I can continue with further instructions as mentioned at https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
box file is created using command *tesseract A.png A lstmbox* where A.png is the image with modi characters. On Tue, Jan 28, 2020 at 12:28 PM Shree Devi Kumar <shreesh...@gmail.com> wrote: > > *MarthiCursiveT Medium* > *Use the above as the font with tesstrain.sh* > > *How are you creating the box file for the image?* > > > On Tue, Jan 28, 2020, 21:56 'Nilambari Joshi' via tesseract-ocr < > tesseract-ocr@googlegroups.com> wrote: > >> I was trying to do with image. I got one image online with all modi >> script characters and tried to create Box file for that image. >> In the box file I can see that it is considering each character as >> English character. >> *My question is how to make it realise that it should refer to it as a >> modi character.* >> >> Then I tried to use tesstrain.sh as below >> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --fontlist >> MarathiCursiveT --lang mar --linedata_only --noextract_font_properties >> --langdata_dir ../tesstutorial/langdata --tessdata_dir >> ../tesstutorial/tesseract/tessdata --training_text >> ../tesstutorial/langdata/mar/mar.modi.training_text --output_dir >> ../tesstutorial/moditrain >> >> I got (by running make) MarathiCursiveT truetype Unicode modi font from >> the link https://github.com/MihailJP/MarathiCursive, mentioned in >> response to my query. >> That file I kept at /usr/share/fonts/truetype/MarathiCursiveT >> >> I created mar.modi.training_text by copying content of marathi >> training data text file in Aksharmukh app and taking output text in modi. >> >> *for tesstrain.sh I am getting error Could not find font named >> 'MarathiCursiveT. Pango suggested font 'MarthiCursiveT Medium'* >> >> Please advise for both the queries.Thanks in advance >> >> On Monday, January 27, 2020 at 3:22:17 AM UTC-5, shree wrote: >>> >>> For LSTM training punc, numbers, wordlist are NOT required. You can add >>> them if you like. Unicharset is generated from the training text. >>> >>> Are you planning to train from text or images? >>> >>> On Mon, Jan 27, 2020 at 2:19 AM 'Nilambari Joshi' via tesseract-ocr < >>> tesser...@googlegroups.com> wrote: >>> >>>> Thanks for your response. I will work as suggested. Please also clarify >>>> whether I need to create separate language directory for Modi similar to >>>> Marathi with all files like number, punc wordlist included and a separate >>>> unicharset file as well? >>>> Thanks in advance. >>>> >>>> On Sunday, January 26, 2020 at 12:26:51 PM UTC-5, shree wrote: >>>>> >>>>> Thanks for the link to Modi Unicode font. >>>>> >>>>> I would convert the Marathi training text to Modi script (use >>>>> Aksharamukha) and then train using the unicode font. >>>>> >>>>> On Sun, Jan 26, 2020 at 10:28 PM Patrick CHEW <patri...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> On Jan 26, 2020, at 08:16, Shree Devi Kumar <shree...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Is there a Unicode font for modi script? >>>>>> >>>>>> >>>>>> https://github.com/MihailJP/MarathiCursive >>>>>> >>>>>> On Sun, Jan 26, 2020, 21:22 'Nilambari Joshi' via tesseract-ocr < >>>>>> tesser...@googlegroups.com> wrote: >>>>>> >>>>>>> Hi... I want to create Modi script (Marathi language) traineddata in >>>>>>> tesseract for OCR. Can somebody guide what steps should I follow. >>>>>>> I referred to >>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 >>>>>>> but stuckup at a stage of creating box files. >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesser...@googlegroups.com. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/EB77DC11-4EBA-498C-A8AE-E728C3F82A4D%40gmail.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/EB77DC11-4EBA-498C-A8AE-E728C3F82A4D%40gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesser...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/3d481093-8efd-408c-abcc-758c6c72df32%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/3d481093-8efd-408c-abcc-758c6c72df32%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> >>> >>> -- >>> >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/b65c4a9d-ea7c-44af-956e-e9628ba05ee4%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/b65c4a9d-ea7c-44af-956e-e9628ba05ee4%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWX9WmC%3DXbVCRAM9qJd2UB65_QafyimqOg3X7GoVbbqfQ%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWX9WmC%3DXbVCRAM9qJd2UB65_QafyimqOg3X7GoVbbqfQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CA%2BK5eC-imXwE97yH8a-EdXksiDmDu_A-o%3DLORQJ_Y_q9pXqinw%40mail.gmail.com.