Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Ali hussain Thu, 14 Sep 2023 11:02:29 -0700

I will try some changes. thx

On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 [email protected] wrote:


> I also faced that issue in the Windows. Apparently, the issue is related 
> with unicode. You can try your luck by changing  "r" to "utf8" in the 
> script.
> I end up installing Ubuntu because i was having too many errors in the 
> Windows.
>
> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <[email protected]> wrote:
>
>> you faced this error,  Can't encode transcription? if you faced how you 
>> have solved this?
>>
>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 [email protected] 
>> wrote:
>>
>>> I was using my own text
>>>
>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <[email protected]> wrote:
>>>
>>>> you are training from Tessearact default text data or your own 
>>>> collected text data?
>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 [email protected] 
>>>> wrote:
>>>>
>>>>> I now get to 200000 iterations; and the error rate is stuck at 0.46. 
>>>>> The result is absolutely trash: nowhere close to the default/Ray's 
>>>>> training. 
>>>>>
>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 
>>>>> [email protected] wrote:
>>>>>
>>>>>>
>>>>>> after Tesseact recognizes text from images. then you can apply regex 
>>>>>> to replace the wrong word with to correct word.
>>>>>> I'm not familiar with paddleOcr and scanTailor also.
>>>>>>
>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 
>>>>>> [email protected] wrote:
>>>>>>
>>>>>>> At what stage are you doing the regex replacement?
>>>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf
>>>>>>>
>>>>>>> >EasyOCR I think is best for ID cards or something like that image 
>>>>>>> process. but document images like books, here Tesseract is better than 
>>>>>>> EasyOCR.
>>>>>>>
>>>>>>> How about paddleOcr?, are you familiar with it?
>>>>>>>
>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 
>>>>>>> [email protected] wrote:
>>>>>>>
>>>>>>>> I know what you mean. but in some cases, it helps me.  I have faced 
>>>>>>>> specific characters and words are always not recognized by Tesseract. 
>>>>>>>> That 
>>>>>>>> way I use these regex to replace those characters   and words if  
>>>>>>>> those 
>>>>>>>> characters are incorrect.
>>>>>>>>
>>>>>>>> see what I have done: 
>>>>>>>>
>>>>>>>>    " ী": "ী",
>>>>>>>>     " ্": " ",
>>>>>>>>     " ে": " ",
>>>>>>>>     জ্া: "জা",
>>>>>>>>     "  ": " ",
>>>>>>>>     "   ": " ",
>>>>>>>>     "    ": " ",
>>>>>>>>     "্প": " ",
>>>>>>>>     " য": "র্য",
>>>>>>>>     য: "য",
>>>>>>>>     " া": "া",
>>>>>>>>     আা: "আ",
>>>>>>>>     ম্ি: "মি",
>>>>>>>>     স্ু: "সু",
>>>>>>>>     "হূ ": "হূ",
>>>>>>>>     " ণ": "ণ",
>>>>>>>>     র্্: "র",
>>>>>>>>     "চিন্ত ": "চিন্তা ",
>>>>>>>>     ন্া: "না",
>>>>>>>>     "সম ূর্ন": "সম্পূর্ণ",
>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 
>>>>>>>> [email protected] wrote:
>>>>>>>>
>>>>>>>>> The problem for regex is that Tesseract is not consistent in its 
>>>>>>>>> replacement. 
>>>>>>>>> Think of the original training of English data doesn't contain the 
>>>>>>>>> letter /u/. What does Tesseract do when it faces /u/ in actual 
>>>>>>>>> processing??
>>>>>>>>> In some cases, it replaces it with closely similar letters such as 
>>>>>>>>> /v/ and /w/. In other cases, it completely removes it. That is what 
>>>>>>>>> is 
>>>>>>>>> happening with my case. Those characters re sometimes completely 
>>>>>>>>> removed; 
>>>>>>>>> other times, they are replaced by closely resembling characters. 
>>>>>>>>> Because of 
>>>>>>>>> this inconsistency, applying regex is very difficult. 
>>>>>>>>>
>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 
>>>>>>>>> [email protected] wrote:
>>>>>>>>>
>>>>>>>>>> if Some specific characters or words are always missing from the 
>>>>>>>>>> OCR result.  then you can apply logic with the Regular expressions 
>>>>>>>>>> method 
>>>>>>>>>> on your applications. After OCR, these specific characters or words 
>>>>>>>>>> will be 
>>>>>>>>>> replaced by current characters or words that you defined in your 
>>>>>>>>>> applications by  Regular expressions. it can be done in some major 
>>>>>>>>>> problems.
>>>>>>>>>>
>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 
>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>
>>>>>>>>>>> The characters are getting missed, even after fine-tuning. 
>>>>>>>>>>> I never made any progress. I tried many different ways. Some  
>>>>>>>>>>> specific characters are always missing from the OCR result.  
>>>>>>>>>>>
>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 
>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>
>>>>>>>>>>>> EasyOCR I think is best for ID cards or something like that 
>>>>>>>>>>>> image process. but document images like books, here Tesseract is 
>>>>>>>>>>>> better 
>>>>>>>>>>>> than EasyOCR.  Even I didn't use EasyOCR. you can try it.
>>>>>>>>>>>>
>>>>>>>>>>>> I have added words of dictionaries but the result is the same. 
>>>>>>>>>>>>
>>>>>>>>>>>> what kind of problem you have faced in fine-tuning in few new 
>>>>>>>>>>>> characters as you said (*but, I failed in every possible way 
>>>>>>>>>>>> to introduce a few new characters into the database.)*
>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 
>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, we are new to this. I find the instructions (the manual) 
>>>>>>>>>>>>> very hard to follow. The video you linked above was really 
>>>>>>>>>>>>> helpful  to get 
>>>>>>>>>>>>> started. My plan at the beginning was to fine tune the existing 
>>>>>>>>>>>>> .traineddata. But, I failed in every possible way to introduce a 
>>>>>>>>>>>>> few new 
>>>>>>>>>>>>> characters into the database. That is why I started from scratch. 
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the 
>>>>>>>>>>>>> iterations, and see if I can improve. 
>>>>>>>>>>>>>
>>>>>>>>>>>>> Another areas we need to explore is usage of dictionaries 
>>>>>>>>>>>>> actually. May be adding millions of words into the dictionary 
>>>>>>>>>>>>> could help 
>>>>>>>>>>>>> Tesseract. I don't have millions of words; but I am looking into 
>>>>>>>>>>>>> some 
>>>>>>>>>>>>> corpus to get more words into the dictionary. 
>>>>>>>>>>>>>
>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar 
>>>>>>>>>>>>> open-source packages)  is probably our next option to try on. 
>>>>>>>>>>>>> Sure, sharing 
>>>>>>>>>>>>> our experiences will be helpful. I will let you know if I made 
>>>>>>>>>>>>> good 
>>>>>>>>>>>>> progresses in any of these options. 
>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 
>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> How is your training going for Bengali?  It was nearly good 
>>>>>>>>>>>>>> but I faced space problems between two words, some words are 
>>>>>>>>>>>>>> spaces but 
>>>>>>>>>>>>>> most of them have no space. I think is problem is in the dataset 
>>>>>>>>>>>>>> but I use 
>>>>>>>>>>>>>> the default training dataset from Tesseract which is used in Ben 
>>>>>>>>>>>>>> That way I 
>>>>>>>>>>>>>> am confused so I have to explore more. by the way,  you can try 
>>>>>>>>>>>>>> as Lorenzo 
>>>>>>>>>>>>>> Blz said.  Actually training from scratch is harder than 
>>>>>>>>>>>>>> fine-tuning. so you can use different datasets to explore. if 
>>>>>>>>>>>>>> you succeed. 
>>>>>>>>>>>>>> please let me know how you have done this whole process.  I'm 
>>>>>>>>>>>>>> also new in 
>>>>>>>>>>>>>> this field.
>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 
>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>>>>>>> I have been trying to train from scratch. I made about 
>>>>>>>>>>>>>>> 64,000 lines of text (which produced about 255,000 files, in 
>>>>>>>>>>>>>>> the end) and 
>>>>>>>>>>>>>>> run the training for 150,000 iterations; getting 0.51 training 
>>>>>>>>>>>>>>> error rate. 
>>>>>>>>>>>>>>> I was hopping to get reasonable accuracy. Unfortunately, when I 
>>>>>>>>>>>>>>> run the OCR 
>>>>>>>>>>>>>>> using  .traineddata,  the accuracy is absolutely terrible. Do 
>>>>>>>>>>>>>>> you think I 
>>>>>>>>>>>>>>> made some mistakes, or that is an expected result?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 
>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font.  That 
>>>>>>>>>>>>>>>> way he didn't use *MODEL_NAME in a separate **script **file 
>>>>>>>>>>>>>>>> script I think.*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files 
>>>>>>>>>>>>>>>> *which 
>>>>>>>>>>>>>>>> are created by  *MODEL_NAME I mean **eng, ben, oro flag or 
>>>>>>>>>>>>>>>> language code *because when we first create *tif, gt.txt, 
>>>>>>>>>>>>>>>> and .box files, *every file starts by  *MODEL_NAME*. This  
>>>>>>>>>>>>>>>> *MODEL_NAME*  we selected on the training script for 
>>>>>>>>>>>>>>>> looping each tif, gt.txt, and .box files which are created by
>>>>>>>>>>>>>>>>   *MODEL_NAME.*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 
>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the 
>>>>>>>>>>>>>>>>> folder structure as you did. Indeed, I have tried a number of 
>>>>>>>>>>>>>>>>> fine-tuning 
>>>>>>>>>>>>>>>>> with a single font following Gracia's video. But, your script 
>>>>>>>>>>>>>>>>> is much  
>>>>>>>>>>>>>>>>> better because supports multiple fonts. The whole improvement 
>>>>>>>>>>>>>>>>> you made is  
>>>>>>>>>>>>>>>>> brilliant; and very useful. It is all working for me. 
>>>>>>>>>>>>>>>>> The only part that I didn't understand is the trick you 
>>>>>>>>>>>>>>>>> used in your tesseract_train.py script. You see, I have been 
>>>>>>>>>>>>>>>>> doing exactly 
>>>>>>>>>>>>>>>>> to you did except this script. 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The scripts seems to have the trick of sending/teaching 
>>>>>>>>>>>>>>>>> each of the fonts (iteratively) into the model. The script I 
>>>>>>>>>>>>>>>>> have been 
>>>>>>>>>>>>>>>>> using  (which I get from Garcia) doesn't mention font at all. 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000*
>>>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts (even 
>>>>>>>>>>>>>>>>> if the fonts have been included in the splitting process, in 
>>>>>>>>>>>>>>>>> the other 
>>>>>>>>>>>>>>>>> script)?
>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 
>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . This command 
>>>>>>>>>>>>>>>>>> is for training data that I have named '*
>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain folder.*
>>>>>>>>>>>>>>>>>> *2. root directory means your main training folder and 
>>>>>>>>>>>>>>>>>> inside it as like langdata, tessearact,  tesstrain folders. 
>>>>>>>>>>>>>>>>>> if you see this 
>>>>>>>>>>>>>>>>>> tutorial    *https://www.youtube.com/watch?v=KE4xEzFGSU8  
>>>>>>>>>>>>>>>>>>  you will understand better the folder structure. only I 
>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for 
>>>>>>>>>>>>>>>>>> training and  
>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like langdata, 
>>>>>>>>>>>>>>>>>> tessearact,  tesstrain, and *split_training_text.py.
>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your Linux 
>>>>>>>>>>>>>>>>>> fonts folder.   /usr/share/fonts/  then run:  sudo apt 
>>>>>>>>>>>>>>>>>> update  then sudo fc-cache -fv
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's name in 
>>>>>>>>>>>>>>>>>> FontList.py file like me.
>>>>>>>>>>>>>>>>>> I  have added two pic my folder structure. first is main 
>>>>>>>>>>>>>>>>>> structure pic and the second is the Colopse tesstrain folder.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: 
>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] 
>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 
>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant 
>>>>>>>>>>>>>>>>>>> scripts. They make the process  much more efficient.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I have one more question on the other script that you 
>>>>>>>>>>>>>>>>>>> use to train. 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the 
>>>>>>>>>>>>>>>>>>> same/root directory?
>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, if 
>>>>>>>>>>>>>>>>>>> you don't mind sharing it?
>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 
>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better than the 
>>>>>>>>>>>>>>>>>>>> previous two scripts.  You can create *tif, gt.txt, 
>>>>>>>>>>>>>>>>>>>> and .box files *by multiple fonts and also use 
>>>>>>>>>>>>>>>>>>>> breakpoint if vs code close or anything during creating 
>>>>>>>>>>>>>>>>>>>> *tif, 
>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can checkpoint to 
>>>>>>>>>>>>>>>>>>>> navigate where you close vs code.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>>>>>>         for line_index in range(start_line, end_line + 
>>>>>>>>>>>>>>>>>>>> 1):
>>>>>>>>>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}.gt.txt')
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>             file_base_name = f'{training_text_file_name
>>>>>>>>>>>>>>>>>>>> }_{line_serial}_{font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>> help='Starting 
>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending 
>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>     training_text_file = 'langdata/eng.training_text'
>>>>>>>>>>>>>>>>>>>>     output_directory = 'tesstrain/data/eng-ground-truth
>>>>>>>>>>>>>>>>>>>> '
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root 
>>>>>>>>>>>>>>>>>>>> directory and paste it.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>>>>>>>>>>            "Charukola Round Head Regular, weight=433",
>>>>>>>>>>>>>>>>>>>>             "Charukola Round Head Bold, weight=443",
>>>>>>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>>>>>>       
>>>>>>>>>>>>>>>>>>>>           
>>>>>>>>>>>>>>>>>>>>                        
>>>>>>>>>>>>>>>>>>>> ]                         
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0  --end 11
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> change checkpoint according to you  --start 0 --end 11.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 
>>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more extensive 
>>>>>>>>>>>>>>>>>>>>> than you posted before: 
>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is magical. 
>>>>>>>>>>>>>>>>>>>>> How is this one different from the earlier one?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It 
>>>>>>>>>>>>>>>>>>>>> has saved my countless hours; by running multiple fonts 
>>>>>>>>>>>>>>>>>>>>> in one sweep. I was 
>>>>>>>>>>>>>>>>>>>>> not able to find any instruction on how to train for  
>>>>>>>>>>>>>>>>>>>>> multiple fonts. The 
>>>>>>>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me to 
>>>>>>>>>>>>>>>>>>>>> get started. 
>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 
>>>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the trained_text 
>>>>>>>>>>>>>>>>>>>>>> lines will be? I have seen Bengali text are long words 
>>>>>>>>>>>>>>>>>>>>>> of lines. so I wanna 
>>>>>>>>>>>>>>>>>>>>>> know how many words or characters will be the better 
>>>>>>>>>>>>>>>>>>>>>> choice for the train? 
>>>>>>>>>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350',  will be according to 
>>>>>>>>>>>>>>>>>>>>>> words of lines?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 
>>>>>>>>>>>>>>>>>>>>>> shree wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning 
>>>>>>>>>>>>>>>>>>>>>>> list of fonts and see if that helps.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods 
>>>>>>>>>>>>>>>>>>>>>>>> for the Bengali language in Tesseract 5 and I have 
>>>>>>>>>>>>>>>>>>>>>>>> used all official 
>>>>>>>>>>>>>>>>>>>>>>>> trained_text and tessdata_best and other things also.  
>>>>>>>>>>>>>>>>>>>>>>>> everything is good 
>>>>>>>>>>>>>>>>>>>>>>>> but the problem is the default font which was trained 
>>>>>>>>>>>>>>>>>>>>>>>> before that does not 
>>>>>>>>>>>>>>>>>>>>>>>> convert text like prev but my new fonts work well. I 
>>>>>>>>>>>>>>>>>>>>>>>> don't understand why 
>>>>>>>>>>>>>>>>>>>>>>>> it's happening. I share code based to understand what 
>>>>>>>>>>>>>>>>>>>>>>>> going on.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') as file:
>>>>>>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>         line_count = read_line_count()  # Set the 
>>>>>>>>>>>>>>>>>>>>>>>> starting line_count from the file
>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # Set the 
>>>>>>>>>>>>>>>>>>>>>>>> ending line_count
>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, len(lines) - 
>>>>>>>>>>>>>>>>>>>>>>>> 1)
>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # Iterate 
>>>>>>>>>>>>>>>>>>>>>>>> through all the fonts in the font_list
>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>             # Generate a unique serial number for 
>>>>>>>>>>>>>>>>>>>>>>>> each line
>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>>>>>>>>> line_serial}.gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{line_serial}'  # 
>>>>>>>>>>>>>>>>>>>>>>>> Unique filename for each font
>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the next font 
>>>>>>>>>>>>>>>>>>>>>>>> iteration
>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # Update the 
>>>>>>>>>>>>>>>>>>>>>>>> line_count in the file
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>> help='Starting 
>>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>> help='Ending 
>>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>>>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem.
>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are 
>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving 
>>>>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>>
>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7efa6de5-980f-422d-a5da-54b16a35ff26n%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to