Checked with both light & dark pdfs. The results are very good. Thanks.
A few concerns. E is consistently missed in both. J is missed consistently in darker image but recognized as T in dark image. ṝ is recognized as ṛ consistently. Can these be addressed ? I am using tesseract 4 alpha windows build from command line. Are the dev files in repos ? On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote: > > I had used ghostview to convert PDF to tif or png. > > You can ocr PDF directly with gimagereader using the traineddata file I > sent. > > See links for new windows binaries in msg below. > > > At last, here are some fresh builds: > > > https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe > > https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe > > I'd be also interested in testing of the tessdata manager, which should > now also properly handle script tessdatas > > On Tue 26 Jun, 2018, 10:59 PM yajva, <nsvnar...@gmail.com <javascript:>> > wrote: > >> The doc is diff ver of the same text. Here's the doc used for the first. >> png. This is slightly darker, but the one sent earlier is cleaner. Let me >> know which is more amenable for OCRing. I use PDF Shaper to extract images >> and convert to png using xnview. >> >> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote: >>> >>> Traineddata file is attached for use with tesseract4.0.0-beta. >>> >>> How did you create the test png from the pdf? I am not getting as good >>> quality, tried various settings with irfanview. >>> >>> >>> >>> On Tue, Jun 26, 2018 at 4:58 PM yajva <nsvnar...@gmail.com> wrote: >>> >>>> Sorry for the delay, my system was down. >>>> >>>> I am getting "Page not Found" for the link given. Can you pl re-check? >>>> >>>> Here's the doc I am trying to OCR >>>> >>>> >>>> On Saturday, June 23, 2018 at 9:46:08 PM UTC+5:30, shree wrote: >>>>> >>>>> Please test with traineddata file from >>>>> https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1 >>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FShreeshrii%2Ftessdata_sanskrit%2Ftree%2Fmaster%2Fiast-plus1&sa=D&sntz=1&usg=AFQjCNHSTndmiJUoozyMRJ7OpHzTKIqYLw> >>>>> >>>>> Need to check that is it not overfitted. >>>>> >>>>> Please share a couple more images which I can use for testing. >>>>> >>>>> >>>>> On Thu, Jun 21, 2018 at 11:38 PM yajva <nsvnar...@gmail.com> wrote: >>>>> >>>>>> one more correction. >>>>>> >>>>>> >>>>>> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote: >>>>>>> >>>>>>> done >>>>>>> >>>>>>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote: >>>>>>>> >>>>>>>> I am attaching the OCRed text. Please correct it so that I can use >>>>>>>> as groundtruth for further training and testing. >>>>>>>> >>>>>>>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar < >>>>>>>> shree...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I had done a training for sanskrit for both devanagari and IAST >>>>>>>>> but it does not include cedilla for Sh >>>>>>>>> >>>>>>>>> I will add it and let you know. >>>>>>>>> >>>>>>>>> On Wed 20 Jun, 2018, 1:17 AM yajva, <nsvnar...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> I have tried Google OCR for recognizing Sanskrit text in Roman >>>>>>>>>> with diacritics (IAST). It recognizes above macron but not dots >>>>>>>>>> below also >>>>>>>>>> joining grave and accent. Is there any traineddata available for >>>>>>>>>> tesseract >>>>>>>>>> that can do this with good accuracy ? Attached a sample page that I >>>>>>>>>> am >>>>>>>>>> interested in. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr >>>>>>>>>> . >>>>>>>>>> To view this discussion on the web visit >>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com >>>>>>>>>> >>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>> . >>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> ____________________________________________________________ >>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To post to this group, send email to tesser...@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/81b2b741-471c-45a5-adef-48330d960d62%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/81b2b741-471c-45a5-adef-48330d960d62%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> -- >>> >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com <javascript:>. >> To post to this group, send email to tesser...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/ed565236-146d-4902-b3e2-13445939a2f4%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/ed565236-146d-4902-b3e2-13445939a2f4%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f942f9b9-a767-4d9e-9de7-0855179db9b5%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.