Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-07-12 Thread yajva
eng+iast-plus-3600 => no diacritics at all
Latin+iast-plus-3600 => only macrons none other



On Thursday, July 12, 2018 at 1:12:25 AM UTC+5:30, shree wrote:
>
> What about ocr with 
>
> eng+iast
>
>
>
> On Wed 11 Jul, 2018, 7:44 PM yajva, > 
> wrote:
>
>> shree
>> namaste
>>
>> I am trying to OCR the attached image. Getting not so good results. Even 
>> for text which is apparently clear. Eg. in the first line, B is recognized 
>> as H, under dot for 't' in 'most' 4th line etc. The image has warping but 
>> still best/Latin and Google OCR produce better results. Is it possible to 
>> add diacritics to Latin? Can you help in any way?
>>
>> regards
>> Venkatesh
>>
>>
>> On Monday, July 2, 2018 at 2:05:47 PM UTC+5:30, yajva wrote:
>>>
>>> Many thanks. Downloaded and using.
>>> Will wait for next ver.
>>>
>>>
>>> On Sunday, July 1, 2018 at 12:21:19 AM UTC+5:30, shree wrote:
>>>>
>>>> I have uploaded a new version of traineddata file at 
>>>>
>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/iast-layer-18003.traineddata
>>>>
>>>> Attached is the OCRed output for pages 13-24 of dark pdf with it.
>>>>
>>>> I am still training a different variation.
>>>>
>>>>
>>>>
>>>> On Wed, Jun 27, 2018 at 6:46 PM Shree Devi Kumar  
>>>> wrote:
>>>>
>>>>> ok. I will take a look.
>>>>>
>>>>> On Wed, Jun 27, 2018 at 5:04 PM yajva  wrote:
>>>>>
>>>>>> Checked with both light & dark pdfs. The results are very good. 
>>>>>> Thanks.
>>>>>>
>>>>>> A few concerns. E is consistently missed in both. J is missed 
>>>>>> consistently in darker image but recognized as T in dark image. ṝ is 
>>>>>> recognized as ṛ consistently. Can these be addressed ?
>>>>>> I am using tesseract 4 alpha windows build from command line.
>>>>>>
>>>>>> Are the dev files in repos ?
>>>>>>
>>>>>>
>>>>>> On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote:
>>>>>>>
>>>>>>> I had used ghostview to convert PDF to tif or png.
>>>>>>>
>>>>>>> You can ocr PDF directly with gimagereader using the traineddata 
>>>>>>> file I sent.
>>>>>>>
>>>>>>> See links for new windows binaries in msg below.
>>>>>>>
>>>>>>>
>>>>>>> At last, here are some fresh builds:
>>>>>>>
>>>>>>>
>>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
>>>>>>>
>>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe
>>>>>>>
>>>>>>> I'd be also interested in testing of the tessdata manager, which 
>>>>>>> should now also properly handle script tessdatas
>>>>>>>
>>>>>>> On Tue 26 Jun, 2018, 10:59 PM yajva,  wrote:
>>>>>>>
>>>>>>>> The doc is diff ver of the same text. Here's the doc used for the 
>>>>>>>> first. png. This is slightly darker, but the one sent earlier is 
>>>>>>>> cleaner. 
>>>>>>>> Let me know which is more amenable for OCRing. I use PDF Shaper to 
>>>>>>>> extract 
>>>>>>>> images and convert to png using xnview.
>>>>>>>>
>>>>>>>> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote:
>>>>>>>>>
>>>>>>>>> Traineddata file is attached for use with tesseract4.0.0-beta.
>>>>>>>>>
>>>>>>>>> How did you create the test png from the pdf? I am not getting as 
>>>>>>>>> good quality, tried various settings with irfanview.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jun 26, 2018 at 4:58 PM yajva  wrote:
>>>>>>>>>
>>>>>>>>>> Sorry for the delay, my system was down.
>>>>>>>>>>
>>>>>>>

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-07-02 Thread yajva
Many thanks. Downloaded and using.
Will wait for next ver.


On Sunday, July 1, 2018 at 12:21:19 AM UTC+5:30, shree wrote:
>
> I have uploaded a new version of traineddata file at 
>
> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/iast-layer-18003.traineddata
>
> Attached is the OCRed output for pages 13-24 of dark pdf with it.
>
> I am still training a different variation.
>
>
>
> On Wed, Jun 27, 2018 at 6:46 PM Shree Devi Kumar  > wrote:
>
>> ok. I will take a look.
>>
>> On Wed, Jun 27, 2018 at 5:04 PM yajva > 
>> wrote:
>>
>>> Checked with both light & dark pdfs. The results are very good. Thanks.
>>>
>>> A few concerns. E is consistently missed in both. J is missed 
>>> consistently in darker image but recognized as T in dark image. ṝ is 
>>> recognized as ṛ consistently. Can these be addressed ?
>>> I am using tesseract 4 alpha windows build from command line.
>>>
>>> Are the dev files in repos ?
>>>
>>>
>>> On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote:
>>>>
>>>> I had used ghostview to convert PDF to tif or png.
>>>>
>>>> You can ocr PDF directly with gimagereader using the traineddata file I 
>>>> sent.
>>>>
>>>> See links for new windows binaries in msg below.
>>>>
>>>>
>>>> At last, here are some fresh builds:
>>>>
>>>>
>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
>>>>
>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe
>>>>
>>>> I'd be also interested in testing of the tessdata manager, which should 
>>>> now also properly handle script tessdatas
>>>>
>>>> On Tue 26 Jun, 2018, 10:59 PM yajva,  wrote:
>>>>
>>>>> The doc is diff ver of the same text. Here's the doc used for the 
>>>>> first. png. This is slightly darker, but the one sent earlier is cleaner. 
>>>>> Let me know which is more amenable for OCRing. I use PDF Shaper to 
>>>>> extract 
>>>>> images and convert to png using xnview.
>>>>>
>>>>> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote:
>>>>>>
>>>>>> Traineddata file is attached for use with tesseract4.0.0-beta.
>>>>>>
>>>>>> How did you create the test png from the pdf? I am not getting as 
>>>>>> good quality, tried various settings with irfanview.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 26, 2018 at 4:58 PM yajva  wrote:
>>>>>>
>>>>>>> Sorry for the delay, my system was down.
>>>>>>>
>>>>>>> I am getting "Page not Found" for the link given. Can you pl 
>>>>>>> re-check?
>>>>>>>
>>>>>>> Here's the doc I am trying to OCR
>>>>>>>
>>>>>>>
>>>>>>> On Saturday, June 23, 2018 at 9:46:08 PM UTC+5:30, shree wrote:
>>>>>>>>
>>>>>>>> Please test with traineddata file from 
>>>>>>>> https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1 
>>>>>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FShreeshrii%2Ftessdata_sanskrit%2Ftree%2Fmaster%2Fiast-plus1=D=1=AFQjCNHSTndmiJUoozyMRJ7OpHzTKIqYLw>
>>>>>>>>
>>>>>>>> Need to check that is it not overfitted.
>>>>>>>>
>>>>>>>> Please share a couple more images which I can use for testing.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jun 21, 2018 at 11:38 PM yajva  wrote:
>>>>>>>>
>>>>>>>>> one more correction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote:
>>>>>>>>>>
>>>>>>>>>> done
>>>>>>>>>>
>>>>>>>>>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote:
>>>>>>>>>>>
>>>>>>>>>>> I am attaching the OCRed text. Please correct it so that  I can 
>>>>>>

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-27 Thread yajva
Checked with both light & dark pdfs. The results are very good. Thanks.

A few concerns. E is consistently missed in both. J is missed consistently 
in darker image but recognized as T in dark image. ṝ is recognized as ṛ 
consistently. Can these be addressed ?
I am using tesseract 4 alpha windows build from command line.

Are the dev files in repos ?


On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote:
>
> I had used ghostview to convert PDF to tif or png.
>
> You can ocr PDF directly with gimagereader using the traineddata file I 
> sent.
>
> See links for new windows binaries in msg below.
>
>
> At last, here are some fresh builds:
>
>
> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
>
> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe
>
> I'd be also interested in testing of the tessdata manager, which should 
> now also properly handle script tessdatas
>
> On Tue 26 Jun, 2018, 10:59 PM yajva, > 
> wrote:
>
>> The doc is diff ver of the same text. Here's the doc used for the first. 
>> png. This is slightly darker, but the one sent earlier is cleaner. Let me 
>> know which is more amenable for OCRing. I use PDF Shaper to extract images 
>> and convert to png using xnview.
>>
>> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote:
>>>
>>> Traineddata file is attached for use with tesseract4.0.0-beta.
>>>
>>> How did you create the test png from the pdf? I am not getting as good 
>>> quality, tried various settings with irfanview.
>>>
>>>
>>>
>>> On Tue, Jun 26, 2018 at 4:58 PM yajva  wrote:
>>>
>>>> Sorry for the delay, my system was down.
>>>>
>>>> I am getting "Page not Found" for the link given. Can you pl re-check?
>>>>
>>>> Here's the doc I am trying to OCR
>>>>
>>>>
>>>> On Saturday, June 23, 2018 at 9:46:08 PM UTC+5:30, shree wrote:
>>>>>
>>>>> Please test with traineddata file from 
>>>>> https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1 
>>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FShreeshrii%2Ftessdata_sanskrit%2Ftree%2Fmaster%2Fiast-plus1=D=1=AFQjCNHSTndmiJUoozyMRJ7OpHzTKIqYLw>
>>>>>
>>>>> Need to check that is it not overfitted.
>>>>>
>>>>> Please share a couple more images which I can use for testing.
>>>>>
>>>>>
>>>>> On Thu, Jun 21, 2018 at 11:38 PM yajva  wrote:
>>>>>
>>>>>> one more correction.
>>>>>>
>>>>>>
>>>>>> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote:
>>>>>>>
>>>>>>> done
>>>>>>>
>>>>>>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote:
>>>>>>>>
>>>>>>>> I am attaching the OCRed text. Please correct it so that  I can use 
>>>>>>>> as groundtruth for further training and testing.
>>>>>>>>
>>>>>>>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar <
>>>>>>>> shree...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I had done a training for sanskrit for both devanagari and IAST 
>>>>>>>>> but it does not include cedilla for Sh 
>>>>>>>>>
>>>>>>>>> I will add it and let you know.
>>>>>>>>>
>>>>>>>>> On Wed 20 Jun, 2018, 1:17 AM yajva,  wrote:
>>>>>>>>>
>>>>>>>>>> I have tried Google OCR for recognizing Sanskrit text in Roman 
>>>>>>>>>> with diacritics (IAST). It recognizes above macron but not dots 
>>>>>>>>>> below also 
>>>>>>>>>> joining grave and accent. Is there any traineddata available for 
>>>>>>>>>> tesseract 
>>>>>>>>>> that can do this with good accuracy ? Attached a sample page that I 
>>>>>>>>>> am 
>>>>>>>>>> interested in.
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>> Google Groups "t

Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-21 Thread yajva
one more correction.


On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote:
>
> done
>
> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote:
>>
>> I am attaching the OCRed text. Please correct it so that  I can use as 
>> groundtruth for further training and testing.
>>
>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar  
>> wrote:
>>
>>> I had done a training for sanskrit for both devanagari and IAST but it 
>>> does not include cedilla for Sh 
>>>
>>> I will add it and let you know.
>>>
>>> On Wed 20 Jun, 2018, 1:17 AM yajva,  wrote:
>>>
>>>> I have tried Google OCR for recognizing Sanskrit text in Roman with 
>>>> diacritics (IAST). It recognizes above macron but not dots below also 
>>>> joining grave and accent. Is there any traineddata available for tesseract 
>>>> that can do this with good accuracy ? Attached a sample page that I am 
>>>> interested in.
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com?utm_medium=email_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>
>> -- 
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Çrīgaheçāya namaḥ.

I.
Athāto Gobhiloktānām anyeshāṁ caiva karmaṇām
aspashṭānāṁ vidhiṁ samyag darçayishye pradīpavat | 1. |
Trīvṛd ūrdhvavṛtṃ kāryaṁ tantutrayam adhovṛtam
trivṛt tac copavītaṁ syāt tasyaiko granthir ishyate | 2. |
Pṛshṭhavaṁçe ca nābhyāṁ ca dhṛtaṁ yad vindate kaṭim
tad dhāryam upavītaṁ syān nātolambaṃ na cocchritam | 3. |
Sadopavītinā bhāvyaṁ sadā baddhaçikhena ca
viçikho vyupavītaç ca yat karoti na tat kṛtam | 4. |
Triḥ prāçyāpo dvir unmṛjya mukham etāny upaspṛçet
āsyanāsākṣhikarṇāṁç ca nābhivakṣhaḥçiroṁsakān | 5. |
Aṅgushṭhena pradeçinyā ghrāṇaṁ caivam upaspṛçet
aṅgushṭhānāmikābhyāṁ ca cakṣhuḥ çrotraṃ punaḥ punaḥ | 6. |
Kanishṭhāṅgushṭhayor nābhiṁ hṛdayaṁ tu talena vai
sarvābhis tu çiraḥ paçcād bāhū cāgreṇa saṁspṛçet | 7. |
Yatropadiçyate karma kartur aṅgaṁ na tūcyate
dakṣhiṇas tatra vijñeyaḥ karmaṇāṁ pāragaḥ karaḥ | 8. |
Yatra diṅniyamo na syāj japahomādikarmasu
tisras tatra diçaḥ proktā aindrīsaumyāparājitāḥ | 9. |
Tishṭhann āsīnaḥ prahvo vā niyamo yatra nedṛçaḥ
tadāsīnena kartavyaṁ na prahveṇa na tishṭhatā | 10. |


Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-06-21 Thread yajva
done

On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote:
>
> I am attaching the OCRed text. Please correct it so that  I can use as 
> groundtruth for further training and testing.
>
> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar  > wrote:
>
>> I had done a training for sanskrit for both devanagari and IAST but it 
>> does not include cedilla for Sh 
>>
>> I will add it and let you know.
>>
>> On Wed 20 Jun, 2018, 1:17 AM yajva, > 
>> wrote:
>>
>>> I have tried Google OCR for recognizing Sanskrit text in Roman with 
>>> diacritics (IAST). It recognizes above macron but not dots below also 
>>> joining grave and accent. Is there any traineddata available for tesseract 
>>> that can do this with good accuracy ? Attached a sample page that I am 
>>> interested in.
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com .
>>> To post to this group, send email to tesser...@googlegroups.com 
>>> .
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com?utm_medium=email_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>
> -- 
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/52372ad9-0be0-4dc8-af63-7bf4154f55c0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Çrīgaheçāya namaḥ.

I.
Athāto Gobhiloktānām anyeshāṁ caiva karmaṇām
aspashṭānāṁ vidhiṁ samyag darçayishye pradīpavat | 1. |
Trīvṛd ūrdhvavṛtṃ kāryaṁ tantutrayam adhovṛtam
trivṛt tac copavītaṁ syāt tasyaiko granthir ishyate | 2. |
Pṛshṭhavaṁçe ca nābhyāṁ ca dhṛtaṁ yad vindate kaṭim
tad dhāryam upavītaṁ syān nātolambaṃ na cocchritam | 83. |
Sadopavītinā bhāvyaṁ sadā baddhaçikhena ca
viçikho vyupavītaç ca yat karoti na tat kṛtam | 4. |
Triḥ prāçyāpo dvir unmṛjya mukham etāny upaspṛçet
āsyanāsākṣhikarṇāṁç ca nābhivakṣhaḥçiroṁsakān | 5. |
Aṅgushṭhena pradeçinyā ghrāṇaṁ caivam upaspṛçet
aṅgushṭhānāmikābhyāṁ ca cakṣhuḥ çrotraṃ punaḥ punaḥ | 6. |
Kanishṭhāṅgushṭhayor nābhiṁ hṛdayaṁ tu talena vai
sarvābhis tu çiraḥ paçcād bāhū cāgreṇa saṁspṛçet | 7. |
Yatropadiçyate karma kartur aṅgaṁ na tūcyate
dakṣhiṇas tatra vijñeyaḥ karmaṇāṁ pāragaḥ karaḥ | 8. |
Yatra diṅniyamo na syāj japahomādikarmasu
tisras tatra diçaḥ proktā aindrīsaumyāparājitāḥ | 9. |
Tishṭhann āsīnaḥ prahvo vā niyamo yatra nedṛçaḥ
tadāsīnena kartavyaṁ na prahveṇa na tishṭhatā | 10. |