Re: [tesseract-ocr] Adding Modi Script to Tesseract

2020-01-28 Thread Shree Devi Kumar
The default language that tesseract uses when none are specified is eng.
Hence you get box file with English characters.

There is currently no `Modi` traineddata so you can't use that, You could
use `-l mar` to use Marathi but obviously the recognition will not be
correct.

I suggest that you use `wordstrbox` instead of `lstmbox` - it will make it
easier to correct the box files.

Have you looked at the tesstrain repo for training from images?

On Wed, Jan 29, 2020 at 12:10 AM 'Nilambari Joshi' via tesseract-ocr <
tesseract-ocr@googlegroups.com> wrote:

>
> box file is created using command*tesseract A.png A lstmbox*
> where A.png is the image with modi characters.
>
> On Tue, Jan 28, 2020, 21:56 'Nilambari Joshi' via tesseract-ocr <
>> tesseract-ocr@googlegroups.com> wrote:
>>
>>> I was trying to do with image. I got one image online with all modi
>>> script characters and tried to create Box file for that image.
>>> In the box file I can see that it is considering each character as
>>> English character.
>>> *My question is how to make it realise that it should refer to it as a
>>> modi character.*
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVp6OR_GT2XcgHwWVYn5-UZsjb4td%3DnRyTstp7pU-QYkQ%40mail.gmail.com.


[tesseract-ocr] lstmtraining creates an unuseable .traineddata file

2020-01-28 Thread Amory Kisch
I followed the instructions for Fine Tuning in the "TrainingTesseract 4.00" 
tutorial. The first time I did this process, it worked fine; I ended up 
with a new model that improved performance. However, whenever I have 
subsequently tried to train a new model, after running through the process 
I get "Error opening data file". I have tried this several times with the 
same result. Is there a way I can find out what is causing this error? Does 
it have something to do with the training true text? Should I start over 
with a fresh install of Tesseract? I'd appreciate any ideas.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2b36c4fd-c154-4f0b-aa96-6d4bb91dea8c%40googlegroups.com.


[tesseract-ocr] Re: Announcement: Python package pytesstrain (Tesseract training helpers)

2020-01-28 Thread shree
Hi Wincent,

Thank you for sharing these tools. I find create-dictdata to be very useful.

I wanted to know if you have modified any ocr-evaluation tools to handle 
the high unicode range such as for Akkadian language.

I was trying to test regarding Modi script (*Range*‎: ‎U+11600..U+1165F; 
(96 code points)) and found that  `ocrevalutf8 accuracy` does not work well 
for it. Any suggestions ...

Shree

On Sunday, January 5, 2020 at 2:22:50 AM UTC+5:30, Wincent Balin wrote:
>
> Hi all,
>
> I would like to announce pytesstrain, a collection of Tesseract training 
> tools, as well as the underlying library. The tools were created while 
> training Tesseract to recognise Akkadian language (stay tuned for more 
> posts!), to solve the problems that emerged in the process.
>
> You can install it with pip install pytesstrain.
>
> The PyPI page for the package is https://pypi.org/project/pytesstrain/. 
> The GitHub project page is https://github.com/wincentbalin/pytesstrain.
>
> This package contains the tools to create dictionary data (wordlist, bi- 
> and unigram lists, etc.), rewrap lines in text files to the specified 
> length, collect most frequent recognition errors and dump them into 
> unicharambigs file, and to perform recognition metrics (WER and CER). It 
> also contains the run_test() function, which creates an image file from 
> the given string and performs OCR on it afterwards, as well as its 
> parallelised version, run_tests(), which can be used in future tools.
>
> Feedback, suggestions, etc would be most welcome.
>
> Yours truly,
>
> Wincent
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/22d65439-54f1-4628-9c04-d7a35777b950%40googlegroups.com.


Re: [tesseract-ocr] Pros and cons of .tiff vs .png

2020-01-28 Thread Thad Guidry
There's a few Wiki pages that cover some of this.
You can see the pages that have "png" mentioned by doing a search on Github
and then filtering on Wiki (instead of default Code)
Here's the filtered result pages from the Wiki that talk about "png".

https://github.com/tesseract-ocr/tesseract/search?q=png=Wikis

Thad
https://www.linkedin.com/in/thadguidry/


On Tue, Jan 28, 2020 at 5:13 PM teksts  wrote:

> Hi all,
>
> I'm fairly new to tesseract (and to programming work in general), and am
> trying to get my bearings. Almost everything I have seen recommends/assumes
> that I feed .tiff files into tesseract to be ocr'd, but I recently came
> across some posts suggesting that .png is less finicky, and might be better
> to use. What are the pros and cons of each filetype? Does it make much of a
> difference? Can I use .png for training purposes as well?
>
> Thanks for your time
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f754516c-ff2b-46a3-89e2-d946f1046966%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAChbWaNbC-%3Dq7_WcH1frX__74GaL0C_-%2B2WdRYOyG-eDK3Kt9g%40mail.gmail.com.


[tesseract-ocr] Pros and cons of .tiff vs .png

2020-01-28 Thread teksts
Hi all,

I'm fairly new to tesseract (and to programming work in general), and am 
trying to get my bearings. Almost everything I have seen recommends/assumes 
that I feed .tiff files into tesseract to be ocr'd, but I recently came 
across some posts suggesting that .png is less finicky, and might be better 
to use. What are the pros and cons of each filetype? Does it make much of a 
difference? Can I use .png for training purposes as well?

Thanks for your time

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f754516c-ff2b-46a3-89e2-d946f1046966%40googlegroups.com.


Re: [tesseract-ocr] Adding Modi Script to Tesseract

2020-01-28 Thread 'Nilambari Joshi' via tesseract-ocr
I tried using MarathiCursiveT Medium as font in fontlist and it worked.
Thanks for that.
It created traineddata and unicharset files in the destination folder.
I hope now I can continue with further instructions as mentioned at
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

box file is created using command*tesseract A.png A lstmbox*
where A.png is the image with modi characters.


On Tue, Jan 28, 2020 at 12:28 PM Shree Devi Kumar 
wrote:

>
> *MarthiCursiveT Medium*
> *Use the above as the font with tesstrain.sh*
>
> *How are you creating the box file for the image?*
>
>
> On Tue, Jan 28, 2020, 21:56 'Nilambari Joshi' via tesseract-ocr <
> tesseract-ocr@googlegroups.com> wrote:
>
>> I was trying to do with image. I got one image online with all modi
>> script characters and tried to create Box file for that image.
>> In the box file I can see that it is considering each character as
>> English character.
>> *My question is how to make it realise that it should refer to it as a
>> modi character.*
>>
>> Then I tried to use tesstrain.sh as below
>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --fontlist
>> MarathiCursiveT --lang mar --linedata_only --noextract_font_properties
>> --langdata_dir ../tesstutorial/langdata --tessdata_dir
>> ../tesstutorial/tesseract/tessdata --training_text
>> ../tesstutorial/langdata/mar/mar.modi.training_text --output_dir
>> ../tesstutorial/moditrain
>>
>> I got (by running make) MarathiCursiveT truetype Unicode modi font from
>> the link https://github.com/MihailJP/MarathiCursive, mentioned in
>> response to my query.
>> That file I kept at /usr/share/fonts/truetype/MarathiCursiveT
>>
>> I created mar.modi.training_text  by copying content of  marathi
>> training data text file in Aksharmukh app and taking output text in modi.
>>
>> *for tesstrain.sh I am getting error Could not find font named
>> 'MarathiCursiveT. Pango suggested font 'MarthiCursiveT Medium'*
>>
>> Please advise for both the queries.Thanks in advance
>>
>> On Monday, January 27, 2020 at 3:22:17 AM UTC-5, shree wrote:
>>>
>>> For LSTM training punc, numbers, wordlist are NOT required. You can add
>>> them if you like. Unicharset is generated from the training text.
>>>
>>> Are you planning to train from text or images?
>>>
>>> On Mon, Jan 27, 2020 at 2:19 AM 'Nilambari Joshi' via tesseract-ocr <
>>> tesser...@googlegroups.com> wrote:
>>>
 Thanks for your response. I will work as suggested. Please also clarify
 whether I need to create separate language directory for Modi similar to
 Marathi with all files like number, punc wordlist included and a separate
 unicharset file as well?
 Thanks in advance.

 On Sunday, January 26, 2020 at 12:26:51 PM UTC-5, shree wrote:
>
> Thanks for the link to Modi Unicode font.
>
> I would convert the Marathi training text to Modi script (use
> Aksharamukha) and then train using the unicode font.
>
> On Sun, Jan 26, 2020 at 10:28 PM Patrick CHEW 
> wrote:
>
>>
>> On Jan 26, 2020, at 08:16, Shree Devi Kumar 
>> wrote:
>>
>> Is there a Unicode font for modi script?
>>
>>
>> https://github.com/MihailJP/MarathiCursive
>>
>> On Sun, Jan 26, 2020, 21:22 'Nilambari Joshi' via tesseract-ocr <
>> tesser...@googlegroups.com> wrote:
>>
>>> Hi... I want to create Modi script (Marathi language) traineddata in
>>> tesseract for OCR. Can somebody guide what steps should I follow.
>>> I referred to
>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>> but stuckup at a stage of creating box files.
>>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to tesser...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/EB77DC11-4EBA-498C-A8AE-E728C3F82A4D%40gmail.com
>> 
>> .
>>
>
>
> --
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesser...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/3d481093-8efd-408c-abcc-758c6c72df32%40googlegroups.com
 
 .

>>>
>>>
>>> --
>>>
>>> 

Re: [tesseract-ocr] Adding Modi Script to Tesseract

2020-01-28 Thread Shree Devi Kumar
*MarthiCursiveT Medium*
*Use the above as the font with tesstrain.sh*

*How are you creating the box file for the image?*


On Tue, Jan 28, 2020, 21:56 'Nilambari Joshi' via tesseract-ocr <
tesseract-ocr@googlegroups.com> wrote:

> I was trying to do with image. I got one image online with all modi script
> characters and tried to create Box file for that image.
> In the box file I can see that it is considering each character as English
> character.
> *My question is how to make it realise that it should refer to it as a
> modi character.*
>
> Then I tried to use tesstrain.sh as below
> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --fontlist
> MarathiCursiveT --lang mar --linedata_only --noextract_font_properties
> --langdata_dir ../tesstutorial/langdata --tessdata_dir
> ../tesstutorial/tesseract/tessdata --training_text
> ../tesstutorial/langdata/mar/mar.modi.training_text --output_dir
> ../tesstutorial/moditrain
>
> I got (by running make) MarathiCursiveT truetype Unicode modi font from
> the link https://github.com/MihailJP/MarathiCursive, mentioned in
> response to my query.
> That file I kept at /usr/share/fonts/truetype/MarathiCursiveT
>
> I created mar.modi.training_text  by copying content of  marathi training
> data text file in Aksharmukh app and taking output text in modi.
>
> *for tesstrain.sh I am getting error Could not find font named
> 'MarathiCursiveT. Pango suggested font 'MarthiCursiveT Medium'*
>
> Please advise for both the queries.Thanks in advance
>
> On Monday, January 27, 2020 at 3:22:17 AM UTC-5, shree wrote:
>>
>> For LSTM training punc, numbers, wordlist are NOT required. You can add
>> them if you like. Unicharset is generated from the training text.
>>
>> Are you planning to train from text or images?
>>
>> On Mon, Jan 27, 2020 at 2:19 AM 'Nilambari Joshi' via tesseract-ocr <
>> tesser...@googlegroups.com> wrote:
>>
>>> Thanks for your response. I will work as suggested. Please also clarify
>>> whether I need to create separate language directory for Modi similar to
>>> Marathi with all files like number, punc wordlist included and a separate
>>> unicharset file as well?
>>> Thanks in advance.
>>>
>>> On Sunday, January 26, 2020 at 12:26:51 PM UTC-5, shree wrote:

 Thanks for the link to Modi Unicode font.

 I would convert the Marathi training text to Modi script (use
 Aksharamukha) and then train using the unicode font.

 On Sun, Jan 26, 2020 at 10:28 PM Patrick CHEW 
 wrote:

>
> On Jan 26, 2020, at 08:16, Shree Devi Kumar 
> wrote:
>
> Is there a Unicode font for modi script?
>
>
> https://github.com/MihailJP/MarathiCursive
>
> On Sun, Jan 26, 2020, 21:22 'Nilambari Joshi' via tesseract-ocr <
> tesser...@googlegroups.com> wrote:
>
>> Hi... I want to create Modi script (Marathi language) traineddata in
>> tesseract for OCR. Can somebody guide what steps should I follow.
>> I referred to
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>> but stuckup at a stage of creating box files.
>>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesser...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/EB77DC11-4EBA-498C-A8AE-E728C3F82A4D%40gmail.com
> 
> .
>


 --

 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesser...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/3d481093-8efd-408c-abcc-758c6c72df32%40googlegroups.com
>>> 
>>> .
>>>
>>
>>
>> --
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b65c4a9d-ea7c-44af-956e-e9628ba05ee4%40googlegroups.com
> 

Re: [tesseract-ocr] Incremental Training Tesseract 4.0+ for fraktur

2020-01-28 Thread Shree Devi Kumar
Please see https://github.com/tesseract-ocr/tesstrain/wiki

There are already newly trained models by @stweil for Fraktur.

On Tue, Jan 28, 2020, 22:46 Val LNB  wrote:

> *How to perform incremental training on Tesseract 4.0+?*
>
>
> I want to improve the existing fraktur (frk) model with some 6000 hand
> curated lines from our library.
>
> Ground truth for these lines has 10 new unicode characters not present in
> German fraktur model.
>
>
> How can I continue training from the existing German fraktur model without
> full retraining?
>
>
> Progress so far:
>
>
>- Following information on https://github.com/tesseract-ocr/tesstrain
>- My script created the .tif and gt.txt files based on examples
>provided in
>https://github.com/tesseract-ocr/tesstrain/blob/master/ocrd-testset.zip
>- Now makefile
>https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile has
>space for START_MODEL
>
>
> What/if anything do I enter into START_MODEL?
>
>
> It would be fantastic to see an example CLI command used for your
> incremental training. :)
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1e79c1d6-de0c-4c87-b07c-9455b90cfef4%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXv-_TKYyVrAZWYGO3Fsr9v0wYMwMV424ovU1VhGFs49g%40mail.gmail.com.


[tesseract-ocr] Incremental Training Tesseract 4.0+ for fraktur

2020-01-28 Thread Val LNB
*How to perform incremental training on Tesseract 4.0+?*


I want to improve the existing fraktur (frk) model with some 6000 hand 
curated lines from our library. 

Ground truth for these lines has 10 new unicode characters not present in 
German fraktur model.


How can I continue training from the existing German fraktur model without 
full retraining?


Progress so far:


   - Following information on https://github.com/tesseract-ocr/tesstrain
   - My script created the .tif and gt.txt files based on examples provided 
   in 
   https://github.com/tesseract-ocr/tesstrain/blob/master/ocrd-testset.zip
   - Now makefile 
   https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile has 
   space for START_MODEL 


What/if anything do I enter into START_MODEL?


It would be fantastic to see an example CLI command used for your 
incremental training. :)













-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1e79c1d6-de0c-4c87-b07c-9455b90cfef4%40googlegroups.com.


Re: [tesseract-ocr] Adding Modi Script to Tesseract

2020-01-28 Thread 'Nilambari Joshi' via tesseract-ocr
I was trying to do with image. I got one image online with all modi script 
characters and tried to create Box file for that image. 
In the box file I can see that it is considering each character as English 
character. 
*My question is how to make it realise that it should refer to it as a modi 
character.*

Then I tried to use tesstrain.sh as below
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --fontlist 
MarathiCursiveT --lang mar --linedata_only --noextract_font_properties 
--langdata_dir ../tesstutorial/langdata --tessdata_dir 
../tesstutorial/tesseract/tessdata --training_text 
../tesstutorial/langdata/mar/mar.modi.training_text --output_dir 
../tesstutorial/moditrain

I got (by running make) MarathiCursiveT truetype Unicode modi font from the 
link https://github.com/MihailJP/MarathiCursive, mentioned in response to 
my query.
That file I kept at /usr/share/fonts/truetype/MarathiCursiveT 

I created mar.modi.training_text  by copying content of  marathi training 
data text file in Aksharmukh app and taking output text in modi.

*for tesstrain.sh I am getting error Could not find font named 
'MarathiCursiveT. Pango suggested font 'MarthiCursiveT Medium'*

Please advise for both the queries.Thanks in advance

On Monday, January 27, 2020 at 3:22:17 AM UTC-5, shree wrote:
>
> For LSTM training punc, numbers, wordlist are NOT required. You can add 
> them if you like. Unicharset is generated from the training text.
>
> Are you planning to train from text or images?
>
> On Mon, Jan 27, 2020 at 2:19 AM 'Nilambari Joshi' via tesseract-ocr <
> tesser...@googlegroups.com > wrote:
>
>> Thanks for your response. I will work as suggested. Please also clarify 
>> whether I need to create separate language directory for Modi similar to 
>> Marathi with all files like number, punc wordlist included and a separate 
>> unicharset file as well?  
>> Thanks in advance.
>>
>> On Sunday, January 26, 2020 at 12:26:51 PM UTC-5, shree wrote:
>>>
>>> Thanks for the link to Modi Unicode font.
>>>
>>> I would convert the Marathi training text to Modi script (use 
>>> Aksharamukha) and then train using the unicode font.
>>>
>>> On Sun, Jan 26, 2020 at 10:28 PM Patrick CHEW  
>>> wrote:
>>>

 On Jan 26, 2020, at 08:16, Shree Devi Kumar  wrote:

 Is there a Unicode font for modi script?


 https://github.com/MihailJP/MarathiCursive

 On Sun, Jan 26, 2020, 21:22 'Nilambari Joshi' via tesseract-ocr <
 tesser...@googlegroups.com> wrote:

> Hi... I want to create Modi script (Marathi language) traineddata in 
> tesseract for OCR. Can somebody guide what steps should I follow.
> I referred to 
> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 
> but stuckup at a stage of creating box files.
>
 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesser...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/EB77DC11-4EBA-498C-A8AE-E728C3F82A4D%40gmail.com
  
 
 .

>>>
>>>
>>> -- 
>>>
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/3d481093-8efd-408c-abcc-758c6c72df32%40googlegroups.com
>>  
>> 
>> .
>>
>
>
> -- 
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b65c4a9d-ea7c-44af-956e-e9628ba05ee4%40googlegroups.com.


Re: [tesseract-ocr] Re: How to make training for Arabic in Tesseract 4.0

2020-01-28 Thread Shree Devi Kumar
Please see https://github.com/Shreeshrii/tesstrain-ckb

This is for finetune training from script/Arabic, using text and fonts.

You would need to do steps similar to

https://github.com/Shreeshrii/tesstrain-ckb/blob/master/0-setup.sh
https://github.com/Shreeshrii/tesstrain-ckb/blob/master/2-txt2img.sh
https://github.com/Shreeshrii/tesstrain-ckb/blob/master/3-img2lstmf.sh
https://github.com/Shreeshrii/tesstrain-ckb/blob/master/4-train-layer.sh




On Tue, Jan 28, 2020 at 12:08 PM manu pranay 
wrote:

> shree,
> can you please help me out how to perform arabic training on tesseract 4.
>
> thank you
>
>
> On Thursday, May 4, 2017 at 3:22:42 PM UTC+5:30, shree wrote:
>>
>> Ibr,
>>
>> You are incorrect in your description of LSTM training.
>>
>> What you are doing will use the ara.traineddata provided in the repo,
>> there will be no change in output.
>>
>> Once lstmf files are created, you have to run lstmtraining which will run
>> for days/weeks  to give you a good result.
>>
>> Please read about LSTM training on wiki.
>>
>> On May 4, 2017 2:58 PM, "Ibr"  wrote:
>>
>>> if you are referring to tesseract 4.00alpha with liptonica 1.74.1, and
>>> if you compiled them in the correct way and got the binaries that you need
>>> for training lmstf files, then I recommend to follow the suggestions that
>>> is made by tesseract devs which is: once you create an .lstmf file for a
>>> certain font (that can be used for Arabic writing) then get the official
>>> ara.traineddata file from GitHub paste it in tessdata folder, and the lstmf
>>> file in tesseract folder and run the command  tesseract text_image
>>> result_text -l ara --oem 1
>>> what Arabic characters exactly are you trying to enhance the accuracy
>>> for ?
>>>
>>> On Saturday, April 8, 2017 at 11:52:25 AM UTC+3, Ahmad Moawad wrote:
>>>
 Hello All,


 I want to make training for Arabic language in Tesseract 4.0, and The
 result of this version is great but still need some tunning, so I got
 jTessBoxEditor 2.0 beta.
 I tried to modify the incorrect characters and build ara.traineddata.
 After copying the ara.traineddata to
 /usr/share/tesseract-ocr/4.00/tessdata, I got random characters when I run
 the tesseract on the image.
 So any suggestion of how making training for Version 4.0, I already
 know that that last version 3.0x cube doesn't included in 4.0 LSTM or
 waiting until Ray makes another updated ara.traineddata.

 ,Thanks.

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesser...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/1c842b1e-1dc1-418b-a5b7-368c11e7dfa5%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/7bf66a4e-f85f-4b87-bf82-5688cb2cac8a%40googlegroups.com
> 
> .
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUK2tnsBAKytr3Uxtx_c8g4pNSqWTUWo5Bi_ZgwCKyOLw%40mail.gmail.com.