Re: [tesseract-ocr] Re: Announcement: Python package pytesstrain (Tesseract training helpers)

2020-02-10 Thread Shree Devi Kumar
Hello Wincent,

Thanks for the new version of package.
No errors regarding font now and not slow either.

Tested on Ubuntu.

On Mon, Feb 10, 2020 at 12:28 AM Wincent Balin 
wrote:

> Hello Shree,
>
> I just uploaded new version of the package. About the fixes:
>
> 1. --fonts_dir: I added the default value of the fonts directory on
> different platforms.
>
> 2. Amount of threads: I also capped the maximal amount of threads to the
> number of CPUs.
>
> Would you like to re-test it, please?
>
>
>
> Am Dienstag, 4. Februar 2020 12:21:49 UTC+1 schrieb shree:
>>
>> By the way, I added a create_ground_truth utility, which creates .gt.txt
>>> files as well as the associated .tif files for every specified font, to
>>> the package. I think it could be useful for anyone who does not have a
>>> ground truth collection yet.
>>>
>>> Thanks, I tried it with latest tesseract code.
>>
>> 1. Error when --fonts_dir is not specified, works ok, when specified.
>>
>> 2. Very slow (10 mins), started 20 text2image processes in parallel for
>> training_text with 20 lines.
>>
>>  create_ground_truth --fonts_dir ~/.fonts --fonts "Arial Unicode MS"
>> corpora ground-truth
>> 2020-02-04 11:01:19,135 INFO Processing .txt files
>> 2020-02-04 11:01:19,137 INFO Generating .tif files
>> 2020-02-04 11:10:24,855 INFO Done
>>
>> Much faster (1 second) after setting  export OMP_THREAD_LIMIT=1
>>
>>  export OMP_THREAD_LIMIT=1
>>  create_ground_truth --fonts_dir ~/.fonts --fonts "Arial Unicode MS"
>> corpora ground-truth
>> 2020-02-04 11:12:18,713 INFO Processing .txt files
>> 2020-02-04 11:12:18,715 INFO Generating .tif files
>> 2020-02-04 11:12:19,398 INFO Done
>>
>> You can update the documenation.
>>
>> 
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ec83d722-4bac-46cf-b501-d4d990816596%40googlegroups.com
> 
> .
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW_H7-wKW1csCMU1S_grTxmV8noo6Dd5q_KCC%2BBH-apTQ%40mail.gmail.com.


Re: [tesseract-ocr] Re: Announcement: Python package pytesstrain (Tesseract training helpers)

2020-02-09 Thread Shree Devi Kumar
Re: max threads, please see
https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-455614504

I will test the new scripts later and report back

On Mon, Feb 10, 2020 at 12:28 AM Wincent Balin 
wrote:

> Hello Shree,
>
> I just uploaded new version of the package. About the fixes:
>
> 1. --fonts_dir: I added the default value of the fonts directory on
> different platforms.
>
> 2. Amount of threads: I also capped the maximal amount of threads to the
> number of CPUs.
>
> Would you like to re-test it, please?
>
>
>
> Am Dienstag, 4. Februar 2020 12:21:49 UTC+1 schrieb shree:
>>
>> By the way, I added a create_ground_truth utility, which creates .gt.txt
>>> files as well as the associated .tif files for every specified font, to
>>> the package. I think it could be useful for anyone who does not have a
>>> ground truth collection yet.
>>>
>>> Thanks, I tried it with latest tesseract code.
>>
>> 1. Error when --fonts_dir is not specified, works ok, when specified.
>>
>> 2. Very slow (10 mins), started 20 text2image processes in parallel for
>> training_text with 20 lines.
>>
>>  create_ground_truth --fonts_dir ~/.fonts --fonts "Arial Unicode MS"
>> corpora ground-truth
>> 2020-02-04 11:01:19,135 INFO Processing .txt files
>> 2020-02-04 11:01:19,137 INFO Generating .tif files
>> 2020-02-04 11:10:24,855 INFO Done
>>
>> Much faster (1 second) after setting  export OMP_THREAD_LIMIT=1
>>
>>  export OMP_THREAD_LIMIT=1
>>  create_ground_truth --fonts_dir ~/.fonts --fonts "Arial Unicode MS"
>> corpora ground-truth
>> 2020-02-04 11:12:18,713 INFO Processing .txt files
>> 2020-02-04 11:12:18,715 INFO Generating .tif files
>> 2020-02-04 11:12:19,398 INFO Done
>>
>> You can update the documenation.
>>
>> 
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ec83d722-4bac-46cf-b501-d4d990816596%40googlegroups.com
> 
> .
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVnmfNLmrT0-WZ7xxZz_ondDZzjJ0ZH_haq9DK3BhbkyA%40mail.gmail.com.


Re: [tesseract-ocr] Re: Announcement: Python package pytesstrain (Tesseract training helpers)

2020-02-09 Thread Wincent Balin
Hello Shree,

I just uploaded new version of the package. About the fixes:

1. --fonts_dir: I added the default value of the fonts directory on 
different platforms.

2. Amount of threads: I also capped the maximal amount of threads to the 
number of CPUs.

Would you like to re-test it, please?



Am Dienstag, 4. Februar 2020 12:21:49 UTC+1 schrieb shree:
>
> By the way, I added a create_ground_truth utility, which creates .gt.txt 
>> files as well as the associated .tif files for every specified font, to 
>> the package. I think it could be useful for anyone who does not have a 
>> ground truth collection yet.
>>
>> Thanks, I tried it with latest tesseract code.
>
> 1. Error when --fonts_dir is not specified, works ok, when specified.
>
> 2. Very slow (10 mins), started 20 text2image processes in parallel for 
> training_text with 20 lines.
>
>  create_ground_truth --fonts_dir ~/.fonts --fonts "Arial Unicode MS" 
> corpora ground-truth
> 2020-02-04 11:01:19,135 INFO Processing .txt files
> 2020-02-04 11:01:19,137 INFO Generating .tif files
> 2020-02-04 11:10:24,855 INFO Done
>
> Much faster (1 second) after setting  export OMP_THREAD_LIMIT=1
>
>  export OMP_THREAD_LIMIT=1
>  create_ground_truth --fonts_dir ~/.fonts --fonts "Arial Unicode MS" 
> corpora ground-truth
> 2020-02-04 11:12:18,713 INFO Processing .txt files
> 2020-02-04 11:12:18,715 INFO Generating .tif files
> 2020-02-04 11:12:19,398 INFO Done
>
> You can update the documenation.
>
> 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ec83d722-4bac-46cf-b501-d4d990816596%40googlegroups.com.


Re: [tesseract-ocr] Re: Announcement: Python package pytesstrain (Tesseract training helpers)

2020-02-04 Thread Shree Devi Kumar
>
> By the way, I added a create_ground_truth utility, which creates .gt.txt
> files as well as the associated .tif files for every specified font, to
> the package. I think it could be useful for anyone who does not have a
> ground truth collection yet.
>
> Thanks, I tried it with latest tesseract code.

1. Error when --fonts_dir is not specified, works ok, when specified.

2. Very slow (10 mins), started 20 text2image processes in parallel for
training_text with 20 lines.

 create_ground_truth --fonts_dir ~/.fonts --fonts "Arial Unicode MS"
corpora ground-truth
2020-02-04 11:01:19,135 INFO Processing .txt files
2020-02-04 11:01:19,137 INFO Generating .tif files
2020-02-04 11:10:24,855 INFO Done

Much faster (1 second) after setting  export OMP_THREAD_LIMIT=1

 export OMP_THREAD_LIMIT=1
 create_ground_truth --fonts_dir ~/.fonts --fonts "Arial Unicode MS"
corpora ground-truth
2020-02-04 11:12:18,713 INFO Processing .txt files
2020-02-04 11:12:18,715 INFO Generating .tif files
2020-02-04 11:12:19,398 INFO Done

You can update the documenation.



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXzXLFbK8JKnNOK%3Di39p3UcGZJgJSmvzCbmUo_rnwhpRQ%40mail.gmail.com.


Re: [tesseract-ocr] Re: Announcement: Python package pytesstrain (Tesseract training helpers)

2020-02-04 Thread Shree Devi Kumar
Thanks, Wincent.
I will try out the tools added by you.

I found a Unicode version of the ISRI evaluation tools at
https://github.com/eddieantonio/ocreval which handles the high range
Unicodepoints also. See
https://github.com/Shreeshrii/tesstrain-modi/blob/master/reports/modi-eval-modiLayer_1.017_157724_324000/report_modiLayer_1.017_157724_324000-modi-ALL.txt
for an example

Do you have a workflow for tesseract training using your tools? If so, I
would like to add/refer to it in Tesseract documentation.




On Tue, Feb 4, 2020 at 2:06 AM Wincent Balin 
wrote:

> Hi Shree,
>
> I am glad you find the package already useful :-) .
>
> As to your question: I did not use the ocr-evaluation tools, only the
> language_metrics utility. So, regrettably, I cannot help you here. But
> maybe you could try the same utility too?
>
> By the way, I added a create_ground_truth utility, which creates .gt.txt
> files as well as the associated .tif files for every specified font, to
> the package. I think it could be useful for anyone who does not have a
> ground truth collection yet.
>
> Kind regards,
>
> Wincent
>
>
> Am Mittwoch, 29. Januar 2020 06:47:01 UTC+1 schrieb shree:
>>
>> Hi Wincent,
>>
>> Thank you for sharing these tools. I find create-dictdata to be very
>> useful.
>>
>> I wanted to know if you have modified any ocr-evaluation tools to handle
>> the high unicode range such as for Akkadian language.
>>
>> I was trying to test regarding Modi script (*Range*‎: ‎U+11600..U+1165F;
>> (96 code points)) and found that  `ocrevalutf8 accuracy` does not work
>> well for it. Any suggestions ...
>>
>> Shree
>>
>> On Sunday, January 5, 2020 at 2:22:50 AM UTC+5:30, Wincent Balin wrote:
>>>
>>> Hi all,
>>>
>>> I would like to announce pytesstrain, a collection of Tesseract
>>> training tools, as well as the underlying library. The tools were created
>>> while training Tesseract to recognise Akkadian language (stay tuned for
>>> more posts!), to solve the problems that emerged in the process.
>>>
>>> You can install it with pip install pytesstrain.
>>>
>>> The PyPI page for the package is https://pypi.org/project/pytesstrain/.
>>> The GitHub project page is https://github.com/wincentbalin/pytesstrain.
>>>
>>> This package contains the tools to create dictionary data (wordlist, bi-
>>> and unigram lists, etc.), rewrap lines in text files to the specified
>>> length, collect most frequent recognition errors and dump them into
>>> unicharambigs file, and to perform recognition metrics (WER and CER). It
>>> also contains the run_test() function, which creates an image file from
>>> the given string and performs OCR on it afterwards, as well as its
>>> parallelised version, run_tests(), which can be used in future tools.
>>>
>>> Feedback, suggestions, etc would be most welcome.
>>>
>>> Yours truly,
>>>
>>> Wincent
>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3df5801b-7119-4451-9bb5-5fabc3e66bb1%40googlegroups.com
> 
> .
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU-Xyj4bU3-aw%3DjVP9%3DTvm5uPjLDuFesC4G%2B6nx6JM4Ug%40mail.gmail.com.