Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Des Bw Sun, 10 Sep 2023 23:50:08 -0700

Thank you so much for putting out these brilliant scripts. They make the 
process  much more efficient.


I have one more question on the other script that you use to train. 







*import subprocess# List of font namesfont_names = ['ben']for font in 
font_names:    command = f"TESSDATA_PREFIX=../tesseract/tessdata make 
training MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata 
MAX_ITERATIONS=10000"*
*    subprocess.run(command, shell=True) *

Do you have the name of fonts listed in file in the same/root directory?
How do you setup the names of the fonts in the file, if you don't mind 
sharing it?
On Monday, September 11, 2023 at 4:27:27 AM UTC+3 [email protected] 
wrote:

> You can use the new script below. it's better than the previous two 
> scripts.  You can create *tif, gt.txt, and .box files *by multiple fonts 
> and also use breakpoint if vs code close or anything during creating *tif, 
> gt.txt, and .box files *then you can checkpoint to navigate where you 
> close vs code.
>
> command for *tif, gt.txt, and .box files *
>
>
> import os
> import random
> import pathlib
> import subprocess
> import argparse
> from FontList import FontList
>
> def create_training_data(training_text_file, font_list, output_directory, 
> start_line=None, end_line=None):
>     lines = []
>     with open(training_text_file, 'r') as input_file:
>         lines = input_file.readlines()
>
>     if not os.path.exists(output_directory):
>         os.mkdir(output_directory)
>
>     if start_line is None:
>         start_line = 0
>
>     if end_line is None:
>         end_line = len(lines) - 1
>
>     for font_name in font_list.fonts:
>         for line_index in range(start_line, end_line + 1):
>             line = lines[line_index].strip()
>
>             training_text_file_name = pathlib.Path(training_text_file
> ).stem
>
>             line_serial = f"{line_index:d}"
>
>             line_gt_text = os.path.join(output_directory, f'{
> training_text_file_name}_{line_serial}_{font_name.replace(" ", "_")}
> .gt.txt')
>
>
>             with open(line_gt_text, 'w') as output_file:
>                 output_file.writelines([line])
>
>             file_base_name = f'{training_text_file_name}_{line_serial}_{
> font_name.replace(" ", "_")}'
>             subprocess.run([
>                 'text2image',
>                 f'--font={font_name}',
>                 f'--text={line_gt_text}',
>                 f'--outputbase={output_directory}/{file_base_name}',
>                 '--max_pages=1',
>                 '--strip_unrenderable_words',
>                 '--leading=36',
>                 '--xsize=3600',
>                 '--ysize=330',
>                 '--char_spacing=1.0',
>                 '--exposure=0',
>                 '--unicharset_file=langdata/eng.unicharset',
>             ])
>
> if __name__ == "__main__":
>     parser = argparse.ArgumentParser()
>     parser.add_argument('--start', type=int, help='Starting line count 
> (inclusive)')
>     parser.add_argument('--end', type=int, help='Ending line count 
> (inclusive)')
>     args = parser.parse_args()
>
>     training_text_file = 'langdata/eng.training_text'
>     output_directory = 'tesstrain/data/eng-ground-truth'
>
>     font_list = FontList()
>
>     create_training_data(training_text_file, font_list, output_directory, 
> args.start, args.end)
>
>
>
> Then create a file called "FontList" in the root directory and paste it.
>
>
>
> class FontList:
>     def __init__(self):
>         self.fonts = [
>         "Gerlick"
>             "Sagar Medium",
>             "Ekushey Lohit Normal",  
>            "Charukola Round Head Regular, weight=433",
>             "Charukola Round Head Bold, weight=443",
>             "Ador Orjoma Unicode",
>       
>           
>                        
> ]                         
>
>
>
> then import in the above code,
>
>
> *for breakpoint command:*
>
>
> sudo python3 split_training_text.py --start 0  --end 11
>
>
>
> change checkpoint according to you  --start 0 --end 11.
>
> *and training checkpoint as you know already.*
>
>
> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 [email protected] 
> wrote:
>
>> Hi mhalidu, 
>> the script you posted here seems much more extensive than you posted 
>> before: 
>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>> .
>>
>> I have been using your earlier script. It is magical. How is this one 
>> different from the earlier one?
>>
>> Thank you for posting these scripts, by the way. It has saved my 
>> countless hours; by running multiple fonts in one sweep. I was not able to 
>> find any instruction on how to train for  multiple fonts. The official 
>> manual is also unclear. YOUr script helped me to get started. 
>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 [email protected] 
>> wrote:
>>
>>> ok, I will try as you said.
>>> one more thing, what's the role of the trained_text lines will be? I 
>>> have seen Bengali text are long words of lines. so I wanna know how many 
>>> words or characters will be the better choice for the train? 
>>> and '--xsize=3600','--ysize=350',  will be according to words of lines?
>>>
>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote:
>>>
>>>> Include the default fonts also in your fine-tuning list of fonts and 
>>>> see if that helps.
>>>>
>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <[email protected]> wrote:
>>>>
>>>>> I have trained some new fonts by fine-tune methods for the Bengali 
>>>>> language in Tesseract 5 and I have used all official trained_text and 
>>>>> tessdata_best and other things also.  everything is good but the problem 
>>>>> is 
>>>>> the default font which was trained before that does not convert text like 
>>>>> prev but my new fonts work well. I don't understand why it's happening. I 
>>>>> share code based to understand what going on.
>>>>>
>>>>>
>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>> import os
>>>>> import random
>>>>> import pathlib
>>>>> import subprocess
>>>>> import argparse
>>>>> from FontList import FontList
>>>>>
>>>>> def read_line_count():
>>>>>     if os.path.exists('line_count.txt'):
>>>>>         with open('line_count.txt', 'r') as file:
>>>>>             return int(file.read())
>>>>>     return 0
>>>>>
>>>>> def write_line_count(line_count):
>>>>>     with open('line_count.txt', 'w') as file:
>>>>>         file.write(str(line_count))
>>>>>
>>>>> def create_training_data(training_text_file, font_list, 
>>>>> output_directory, start_line=None, end_line=None):
>>>>>     lines = []
>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>         for line in input_file.readlines():
>>>>>             lines.append(line.strip())
>>>>>     
>>>>>     if not os.path.exists(output_directory):
>>>>>         os.mkdir(output_directory)
>>>>>     
>>>>>     random.shuffle(lines)
>>>>>     
>>>>>     if start_line is None:
>>>>>         line_count = read_line_count()  # Set the starting line_count 
>>>>> from the file
>>>>>     else:
>>>>>         line_count = start_line
>>>>>     
>>>>>     if end_line is None:
>>>>>         end_line_count = len(lines) - 1  # Set the ending line_count
>>>>>     else:
>>>>>         end_line_count = min(end_line, len(lines) - 1)
>>>>>     
>>>>>     for font in font_list.fonts:  # Iterate through all the fonts in 
>>>>> the font_list
>>>>>         font_serial = 1
>>>>>         for line in lines:
>>>>>             training_text_file_name = pathlib.Path(training_text_file
>>>>> ).stem
>>>>>             
>>>>>             # Generate a unique serial number for each line
>>>>>             line_serial = f"{line_count:d}"
>>>>>             
>>>>>             # GT (Ground Truth) text filename
>>>>>             line_gt_text = os.path.join(output_directory, f'{
>>>>> training_text_file_name}_{line_serial}.gt.txt')
>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>                 output_file.writelines([line])
>>>>>             
>>>>>             # Image filename
>>>>>             file_base_name = f'ben_{line_serial}'  # Unique filename 
>>>>> for each font
>>>>>             subprocess.run([
>>>>>                 'text2image',
>>>>>                 f'--font={font}',
>>>>>                 f'--text={line_gt_text}',
>>>>>                 f'--outputbase={output_directory}/{file_base_name}',
>>>>>                 '--max_pages=1',
>>>>>                 '--strip_unrenderable_words',
>>>>>                 '--leading=36',
>>>>>                 '--xsize=3600',
>>>>>                 '--ysize=350',
>>>>>                 '--char_spacing=1.0',
>>>>>                 '--exposure=0',
>>>>>                 '--unicharset_file=langdata/ben.unicharset',
>>>>>             ])
>>>>>             
>>>>>             line_count += 1
>>>>>             font_serial += 1
>>>>>         
>>>>>         # Reset font_serial for the next font iteration
>>>>>         font_serial = 1
>>>>>     
>>>>>     write_line_count(line_count)  # Update the line_count in the file
>>>>>
>>>>> if __name__ == "__main__":
>>>>>     parser = argparse.ArgumentParser()
>>>>>     parser.add_argument('--start', type=int, help='Starting line 
>>>>> count (inclusive)')
>>>>>     parser.add_argument('--end', type=int, help='Ending line count 
>>>>> (inclusive)')
>>>>>     args = parser.parse_args()
>>>>>     
>>>>>     training_text_file = 'langdata/ben.training_text'
>>>>>     output_directory = 'tesstrain/data/ben-ground-truth'
>>>>>     
>>>>>     # Create an instance of the FontList class
>>>>>     font_list = FontList()
>>>>>      
>>>>>     create_training_data(training_text_file, font_list, 
>>>>> output_directory, args.start, args.end)
>>>>>
>>>>>
>>>>> *and for training code:*
>>>>>
>>>>> import subprocess
>>>>>
>>>>> # List of font names
>>>>> font_names = ['ben']
>>>>>
>>>>> for font in font_names:
>>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>> MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>> MAX_ITERATIONS=10000 LANG_TYPE=Indic"
>>>>>     subprocess.run(command, shell=True)
>>>>>
>>>>>
>>>>> any suggestion to identify to extract the problem.
>>>>> thanks, everyone
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5c9f85a3-ffbc-4adb-8cad-3d8ab77ec940n%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to