Re: [tesseract-ocr] recognising roman with sanskrit diacritics
eng+iast-plus-3600 => no diacritics at all Latin+iast-plus-3600 => only macrons none other On Thursday, July 12, 2018 at 1:12:25 AM UTC+5:30, shree wrote: > > What about ocr with > > eng+iast > > > > On Wed 11 Jul, 2018, 7:44 PM yajva, > > wrote: > >> shree >> namaste >> >> I am trying to OCR the attached image. Getting not so good results. Even >> for text which is apparently clear. Eg. in the first line, B is recognized >> as H, under dot for 't' in 'most' 4th line etc. The image has warping but >> still best/Latin and Google OCR produce better results. Is it possible to >> add diacritics to Latin? Can you help in any way? >> >> regards >> Venkatesh >> >> >> On Monday, July 2, 2018 at 2:05:47 PM UTC+5:30, yajva wrote: >>> >>> Many thanks. Downloaded and using. >>> Will wait for next ver. >>> >>> >>> On Sunday, July 1, 2018 at 12:21:19 AM UTC+5:30, shree wrote: >>>> >>>> I have uploaded a new version of traineddata file at >>>> >>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/iast-layer-18003.traineddata >>>> >>>> Attached is the OCRed output for pages 13-24 of dark pdf with it. >>>> >>>> I am still training a different variation. >>>> >>>> >>>> >>>> On Wed, Jun 27, 2018 at 6:46 PM Shree Devi Kumar >>>> wrote: >>>> >>>>> ok. I will take a look. >>>>> >>>>> On Wed, Jun 27, 2018 at 5:04 PM yajva wrote: >>>>> >>>>>> Checked with both light & dark pdfs. The results are very good. >>>>>> Thanks. >>>>>> >>>>>> A few concerns. E is consistently missed in both. J is missed >>>>>> consistently in darker image but recognized as T in dark image. ṝ is >>>>>> recognized as ṛ consistently. Can these be addressed ? >>>>>> I am using tesseract 4 alpha windows build from command line. >>>>>> >>>>>> Are the dev files in repos ? >>>>>> >>>>>> >>>>>> On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote: >>>>>>> >>>>>>> I had used ghostview to convert PDF to tif or png. >>>>>>> >>>>>>> You can ocr PDF directly with gimagereader using the traineddata >>>>>>> file I sent. >>>>>>> >>>>>>> See links for new windows binaries in msg below. >>>>>>> >>>>>>> >>>>>>> At last, here are some fresh builds: >>>>>>> >>>>>>> >>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe >>>>>>> >>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe >>>>>>> >>>>>>> I'd be also interested in testing of the tessdata manager, which >>>>>>> should now also properly handle script tessdatas >>>>>>> >>>>>>> On Tue 26 Jun, 2018, 10:59 PM yajva, wrote: >>>>>>> >>>>>>>> The doc is diff ver of the same text. Here's the doc used for the >>>>>>>> first. png. This is slightly darker, but the one sent earlier is >>>>>>>> cleaner. >>>>>>>> Let me know which is more amenable for OCRing. I use PDF Shaper to >>>>>>>> extract >>>>>>>> images and convert to png using xnview. >>>>>>>> >>>>>>>> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote: >>>>>>>>> >>>>>>>>> Traineddata file is attached for use with tesseract4.0.0-beta. >>>>>>>>> >>>>>>>>> How did you create the test png from the pdf? I am not getting as >>>>>>>>> good quality, tried various settings with irfanview. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Jun 26, 2018 at 4:58 PM yajva wrote: >>>>>>>>> >>>>>>>>>> Sorry for the delay, my system was down. >>>>>>>>>> >>>>>>>
Re: [tesseract-ocr] recognising roman with sanskrit diacritics
Many thanks. Downloaded and using. Will wait for next ver. On Sunday, July 1, 2018 at 12:21:19 AM UTC+5:30, shree wrote: > > I have uploaded a new version of traineddata file at > > https://github.com/Shreeshrii/tessdata_shreetest/blob/master/iast-layer-18003.traineddata > > Attached is the OCRed output for pages 13-24 of dark pdf with it. > > I am still training a different variation. > > > > On Wed, Jun 27, 2018 at 6:46 PM Shree Devi Kumar > wrote: > >> ok. I will take a look. >> >> On Wed, Jun 27, 2018 at 5:04 PM yajva > >> wrote: >> >>> Checked with both light & dark pdfs. The results are very good. Thanks. >>> >>> A few concerns. E is consistently missed in both. J is missed >>> consistently in darker image but recognized as T in dark image. ṝ is >>> recognized as ṛ consistently. Can these be addressed ? >>> I am using tesseract 4 alpha windows build from command line. >>> >>> Are the dev files in repos ? >>> >>> >>> On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote: >>>> >>>> I had used ghostview to convert PDF to tif or png. >>>> >>>> You can ocr PDF directly with gimagereader using the traineddata file I >>>> sent. >>>> >>>> See links for new windows binaries in msg below. >>>> >>>> >>>> At last, here are some fresh builds: >>>> >>>> >>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe >>>> >>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe >>>> >>>> I'd be also interested in testing of the tessdata manager, which should >>>> now also properly handle script tessdatas >>>> >>>> On Tue 26 Jun, 2018, 10:59 PM yajva, wrote: >>>> >>>>> The doc is diff ver of the same text. Here's the doc used for the >>>>> first. png. This is slightly darker, but the one sent earlier is cleaner. >>>>> Let me know which is more amenable for OCRing. I use PDF Shaper to >>>>> extract >>>>> images and convert to png using xnview. >>>>> >>>>> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote: >>>>>> >>>>>> Traineddata file is attached for use with tesseract4.0.0-beta. >>>>>> >>>>>> How did you create the test png from the pdf? I am not getting as >>>>>> good quality, tried various settings with irfanview. >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jun 26, 2018 at 4:58 PM yajva wrote: >>>>>> >>>>>>> Sorry for the delay, my system was down. >>>>>>> >>>>>>> I am getting "Page not Found" for the link given. Can you pl >>>>>>> re-check? >>>>>>> >>>>>>> Here's the doc I am trying to OCR >>>>>>> >>>>>>> >>>>>>> On Saturday, June 23, 2018 at 9:46:08 PM UTC+5:30, shree wrote: >>>>>>>> >>>>>>>> Please test with traineddata file from >>>>>>>> https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1 >>>>>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FShreeshrii%2Ftessdata_sanskrit%2Ftree%2Fmaster%2Fiast-plus1=D=1=AFQjCNHSTndmiJUoozyMRJ7OpHzTKIqYLw> >>>>>>>> >>>>>>>> Need to check that is it not overfitted. >>>>>>>> >>>>>>>> Please share a couple more images which I can use for testing. >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jun 21, 2018 at 11:38 PM yajva wrote: >>>>>>>> >>>>>>>>> one more correction. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote: >>>>>>>>>> >>>>>>>>>> done >>>>>>>>>> >>>>>>>>>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote: >>>>>>>>>>> >>>>>>>>>>> I am attaching the OCRed text. Please correct it so that I can >>>>>>
Re: [tesseract-ocr] recognising roman with sanskrit diacritics
Checked with both light & dark pdfs. The results are very good. Thanks. A few concerns. E is consistently missed in both. J is missed consistently in darker image but recognized as T in dark image. ṝ is recognized as ṛ consistently. Can these be addressed ? I am using tesseract 4 alpha windows build from command line. Are the dev files in repos ? On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote: > > I had used ghostview to convert PDF to tif or png. > > You can ocr PDF directly with gimagereader using the traineddata file I > sent. > > See links for new windows binaries in msg below. > > > At last, here are some fresh builds: > > > https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe > > https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe > > I'd be also interested in testing of the tessdata manager, which should > now also properly handle script tessdatas > > On Tue 26 Jun, 2018, 10:59 PM yajva, > > wrote: > >> The doc is diff ver of the same text. Here's the doc used for the first. >> png. This is slightly darker, but the one sent earlier is cleaner. Let me >> know which is more amenable for OCRing. I use PDF Shaper to extract images >> and convert to png using xnview. >> >> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote: >>> >>> Traineddata file is attached for use with tesseract4.0.0-beta. >>> >>> How did you create the test png from the pdf? I am not getting as good >>> quality, tried various settings with irfanview. >>> >>> >>> >>> On Tue, Jun 26, 2018 at 4:58 PM yajva wrote: >>> >>>> Sorry for the delay, my system was down. >>>> >>>> I am getting "Page not Found" for the link given. Can you pl re-check? >>>> >>>> Here's the doc I am trying to OCR >>>> >>>> >>>> On Saturday, June 23, 2018 at 9:46:08 PM UTC+5:30, shree wrote: >>>>> >>>>> Please test with traineddata file from >>>>> https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1 >>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FShreeshrii%2Ftessdata_sanskrit%2Ftree%2Fmaster%2Fiast-plus1=D=1=AFQjCNHSTndmiJUoozyMRJ7OpHzTKIqYLw> >>>>> >>>>> Need to check that is it not overfitted. >>>>> >>>>> Please share a couple more images which I can use for testing. >>>>> >>>>> >>>>> On Thu, Jun 21, 2018 at 11:38 PM yajva wrote: >>>>> >>>>>> one more correction. >>>>>> >>>>>> >>>>>> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote: >>>>>>> >>>>>>> done >>>>>>> >>>>>>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote: >>>>>>>> >>>>>>>> I am attaching the OCRed text. Please correct it so that I can use >>>>>>>> as groundtruth for further training and testing. >>>>>>>> >>>>>>>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar < >>>>>>>> shree...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I had done a training for sanskrit for both devanagari and IAST >>>>>>>>> but it does not include cedilla for Sh >>>>>>>>> >>>>>>>>> I will add it and let you know. >>>>>>>>> >>>>>>>>> On Wed 20 Jun, 2018, 1:17 AM yajva, wrote: >>>>>>>>> >>>>>>>>>> I have tried Google OCR for recognizing Sanskrit text in Roman >>>>>>>>>> with diacritics (IAST). It recognizes above macron but not dots >>>>>>>>>> below also >>>>>>>>>> joining grave and accent. Is there any traineddata available for >>>>>>>>>> tesseract >>>>>>>>>> that can do this with good accuracy ? Attached a sample page that I >>>>>>>>>> am >>>>>>>>>> interested in. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>> Google Groups "t
Re: [tesseract-ocr] recognising roman with sanskrit diacritics
one more correction. On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote: > > done > > On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote: >> >> I am attaching the OCRed text. Please correct it so that I can use as >> groundtruth for further training and testing. >> >> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar >> wrote: >> >>> I had done a training for sanskrit for both devanagari and IAST but it >>> does not include cedilla for Sh >>> >>> I will add it and let you know. >>> >>> On Wed 20 Jun, 2018, 1:17 AM yajva, wrote: >>> >>>> I have tried Google OCR for recognizing Sanskrit text in Roman with >>>> diacritics (IAST). It recognizes above macron but not dots below also >>>> joining grave and accent. Is there any traineddata available for tesseract >>>> that can do this with good accuracy ? Attached a sample page that I am >>>> interested in. >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To post to this group, send email to tesser...@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com?utm_medium=email_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >> >> -- >> >> >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. Çrīgaheçāya namaḥ. I. Athāto Gobhiloktānām anyeshāṁ caiva karmaṇām aspashṭānāṁ vidhiṁ samyag darçayishye pradīpavat | 1. | Trīvṛd ūrdhvavṛtṃ kāryaṁ tantutrayam adhovṛtam trivṛt tac copavītaṁ syāt tasyaiko granthir ishyate | 2. | Pṛshṭhavaṁçe ca nābhyāṁ ca dhṛtaṁ yad vindate kaṭim tad dhāryam upavītaṁ syān nātolambaṃ na cocchritam | 3. | Sadopavītinā bhāvyaṁ sadā baddhaçikhena ca viçikho vyupavītaç ca yat karoti na tat kṛtam | 4. | Triḥ prāçyāpo dvir unmṛjya mukham etāny upaspṛçet āsyanāsākṣhikarṇāṁç ca nābhivakṣhaḥçiroṁsakān | 5. | Aṅgushṭhena pradeçinyā ghrāṇaṁ caivam upaspṛçet aṅgushṭhānāmikābhyāṁ ca cakṣhuḥ çrotraṃ punaḥ punaḥ | 6. | Kanishṭhāṅgushṭhayor nābhiṁ hṛdayaṁ tu talena vai sarvābhis tu çiraḥ paçcād bāhū cāgreṇa saṁspṛçet | 7. | Yatropadiçyate karma kartur aṅgaṁ na tūcyate dakṣhiṇas tatra vijñeyaḥ karmaṇāṁ pāragaḥ karaḥ | 8. | Yatra diṅniyamo na syāj japahomādikarmasu tisras tatra diçaḥ proktā aindrīsaumyāparājitāḥ | 9. | Tishṭhann āsīnaḥ prahvo vā niyamo yatra nedṛçaḥ tadāsīnena kartavyaṁ na prahveṇa na tishṭhatā | 10. |
Re: [tesseract-ocr] recognising roman with sanskrit diacritics
done On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote: > > I am attaching the OCRed text. Please correct it so that I can use as > groundtruth for further training and testing. > > On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar > wrote: > >> I had done a training for sanskrit for both devanagari and IAST but it >> does not include cedilla for Sh >> >> I will add it and let you know. >> >> On Wed 20 Jun, 2018, 1:17 AM yajva, > >> wrote: >> >>> I have tried Google OCR for recognizing Sanskrit text in Roman with >>> diacritics (IAST). It recognizes above macron but not dots below also >>> joining grave and accent. Is there any traineddata available for tesseract >>> that can do this with good accuracy ? Attached a sample page that I am >>> interested in. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com . >>> To post to this group, send email to tesser...@googlegroups.com >>> . >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com?utm_medium=email_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> > > -- > > > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/52372ad9-0be0-4dc8-af63-7bf4154f55c0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. Çrīgaheçāya namaḥ. I. Athāto Gobhiloktānām anyeshāṁ caiva karmaṇām aspashṭānāṁ vidhiṁ samyag darçayishye pradīpavat | 1. | Trīvṛd ūrdhvavṛtṃ kāryaṁ tantutrayam adhovṛtam trivṛt tac copavītaṁ syāt tasyaiko granthir ishyate | 2. | Pṛshṭhavaṁçe ca nābhyāṁ ca dhṛtaṁ yad vindate kaṭim tad dhāryam upavītaṁ syān nātolambaṃ na cocchritam | 83. | Sadopavītinā bhāvyaṁ sadā baddhaçikhena ca viçikho vyupavītaç ca yat karoti na tat kṛtam | 4. | Triḥ prāçyāpo dvir unmṛjya mukham etāny upaspṛçet āsyanāsākṣhikarṇāṁç ca nābhivakṣhaḥçiroṁsakān | 5. | Aṅgushṭhena pradeçinyā ghrāṇaṁ caivam upaspṛçet aṅgushṭhānāmikābhyāṁ ca cakṣhuḥ çrotraṃ punaḥ punaḥ | 6. | Kanishṭhāṅgushṭhayor nābhiṁ hṛdayaṁ tu talena vai sarvābhis tu çiraḥ paçcād bāhū cāgreṇa saṁspṛçet | 7. | Yatropadiçyate karma kartur aṅgaṁ na tūcyate dakṣhiṇas tatra vijñeyaḥ karmaṇāṁ pāragaḥ karaḥ | 8. | Yatra diṅniyamo na syāj japahomādikarmasu tisras tatra diçaḥ proktā aindrīsaumyāparājitāḥ | 9. | Tishṭhann āsīnaḥ prahvo vā niyamo yatra nedṛçaḥ tadāsīnena kartavyaṁ na prahveṇa na tishṭhatā | 10. |