Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Dellu Bw Thu, 14 Sep 2023 01:46:32 -0700

I also faced that issue in the Windows. Apparently, the issue is related
with unicode. You can try your luck by changing  "r" to "utf8" in the
script.
I end up installing Ubuntu because i was having too many errors in the
Windows.


On Thu, Sep 14, 2023, 9:33 AM Ali hussain <[email protected]> wrote:

> you faced this error,  Can't encode transcription? if you faced how you
> have solved this?
>
> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 [email protected]
> wrote:
>
>> I was using my own text
>>
>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <[email protected]> wrote:
>>
>>> you are training from Tessearact default text data or your own collected
>>> text data?
>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 [email protected]
>>> wrote:
>>>
>>>> I now get to 200000 iterations; and the error rate is stuck at 0.46.
>>>> The result is absolutely trash: nowhere close to the default/Ray's
>>>> training.
>>>>
>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3
>>>> [email protected] wrote:
>>>>
>>>>>
>>>>> after Tesseact recognizes text from images. then you can apply regex
>>>>> to replace the wrong word with to correct word.
>>>>> I'm not familiar with paddleOcr and scanTailor also.
>>>>>
>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6
>>>>> [email protected] wrote:
>>>>>
>>>>>> At what stage are you doing the regex replacement?
>>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf
>>>>>>
>>>>>> >EasyOCR I think is best for ID cards or something like that image
>>>>>> process. but document images like books, here Tesseract is better than
>>>>>> EasyOCR.
>>>>>>
>>>>>> How about paddleOcr?, are you familiar with it?
>>>>>>
>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3
>>>>>> [email protected] wrote:
>>>>>>
>>>>>>> I know what you mean. but in some cases, it helps me.  I have faced
>>>>>>> specific characters and words are always not recognized by Tesseract. 
>>>>>>> That
>>>>>>> way I use these regex to replace those characters   and words if  those
>>>>>>> characters are incorrect.
>>>>>>>
>>>>>>> see what I have done:
>>>>>>>
>>>>>>>    " ী": "ী",
>>>>>>>     " ্": " ",
>>>>>>>     " ে": " ",
>>>>>>>     জ্া: "জা",
>>>>>>>     "  ": " ",
>>>>>>>     "   ": " ",
>>>>>>>     "    ": " ",
>>>>>>>     "্প": " ",
>>>>>>>     " য": "র্য",
>>>>>>>     য: "য",
>>>>>>>     " া": "া",
>>>>>>>     আা: "আ",
>>>>>>>     ম্ি: "মি",
>>>>>>>     স্ু: "সু",
>>>>>>>     "হূ ": "হূ",
>>>>>>>     " ণ": "ণ",
>>>>>>>     র্্: "র",
>>>>>>>     "চিন্ত ": "চিন্তা ",
>>>>>>>     ন্া: "না",
>>>>>>>     "সম ূর্ন": "সম্পূর্ণ",
>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6
>>>>>>> [email protected] wrote:
>>>>>>>
>>>>>>>> The problem for regex is that Tesseract is not consistent in its
>>>>>>>> replacement.
>>>>>>>> Think of the original training of English data doesn't contain the
>>>>>>>> letter /u/. What does Tesseract do when it faces /u/ in actual 
>>>>>>>> processing??
>>>>>>>> In some cases, it replaces it with closely similar letters such as
>>>>>>>> /v/ and /w/. In other cases, it completely removes it. That is what is
>>>>>>>> happening with my case. Those characters re sometimes completely 
>>>>>>>> removed;
>>>>>>>> other times, they are replaced by closely resembling characters. 
>>>>>>>> Because of
>>>>>>>> this inconsistency, applying regex is very difficult.
>>>>>>>>
>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3
>>>>>>>> [email protected] wrote:
>>>>>>>>
>>>>>>>>> if Some specific characters or words are always missing from the
>>>>>>>>> OCR result.  then you can apply logic with the Regular expressions 
>>>>>>>>> method
>>>>>>>>> on your applications. After OCR, these specific characters or words 
>>>>>>>>> will be
>>>>>>>>> replaced by current characters or words that you defined in your
>>>>>>>>> applications by  Regular expressions. it can be done in some major 
>>>>>>>>> problems.
>>>>>>>>>
>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6
>>>>>>>>> [email protected] wrote:
>>>>>>>>>
>>>>>>>>>> The characters are getting missed, even after fine-tuning.
>>>>>>>>>> I never made any progress. I tried many different ways. Some
>>>>>>>>>> specific characters are always missing from the OCR result.
>>>>>>>>>>
>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3
>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>
>>>>>>>>>>> EasyOCR I think is best for ID cards or something like that
>>>>>>>>>>> image process. but document images like books, here Tesseract is 
>>>>>>>>>>> better
>>>>>>>>>>> than EasyOCR.  Even I didn't use EasyOCR. you can try it.
>>>>>>>>>>>
>>>>>>>>>>> I have added words of dictionaries but the result is the same.
>>>>>>>>>>>
>>>>>>>>>>> what kind of problem you have faced in fine-tuning in few new
>>>>>>>>>>> characters as you said (*but, I failed in every possible way to
>>>>>>>>>>> introduce a few new characters into the database.)*
>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6
>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes, we are new to this. I find the instructions (the manual)
>>>>>>>>>>>> very hard to follow. The video you linked above was really helpful 
>>>>>>>>>>>>  to get
>>>>>>>>>>>> started. My plan at the beginning was to fine tune the existing
>>>>>>>>>>>> .traineddata. But, I failed in every possible way to introduce a 
>>>>>>>>>>>> few new
>>>>>>>>>>>> characters into the database. That is why I started from scratch.
>>>>>>>>>>>>
>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the
>>>>>>>>>>>> iterations, and see if I can improve.
>>>>>>>>>>>>
>>>>>>>>>>>> Another areas we need to explore is usage of dictionaries
>>>>>>>>>>>> actually. May be adding millions of words into the dictionary 
>>>>>>>>>>>> could help
>>>>>>>>>>>> Tesseract. I don't have millions of words; but I am looking into 
>>>>>>>>>>>> some
>>>>>>>>>>>> corpus to get more words into the dictionary.
>>>>>>>>>>>>
>>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar
>>>>>>>>>>>> open-source packages)  is probably our next option to try on. 
>>>>>>>>>>>> Sure, sharing
>>>>>>>>>>>> our experiences will be helpful. I will let you know if I made good
>>>>>>>>>>>> progresses in any of these options.
>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3
>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> How is your training going for Bengali?  It was nearly good
>>>>>>>>>>>>> but I faced space problems between two words, some words are 
>>>>>>>>>>>>> spaces but
>>>>>>>>>>>>> most of them have no space. I think is problem is in the dataset 
>>>>>>>>>>>>> but I use
>>>>>>>>>>>>> the default training dataset from Tesseract which is used in Ben 
>>>>>>>>>>>>> That way I
>>>>>>>>>>>>> am confused so I have to explore more. by the way,  you can try 
>>>>>>>>>>>>> as Lorenzo
>>>>>>>>>>>>> Blz said.  Actually training from scratch is harder than
>>>>>>>>>>>>> fine-tuning. so you can use different datasets to explore. if you 
>>>>>>>>>>>>> succeed.
>>>>>>>>>>>>> please let me know how you have done this whole process.  I'm 
>>>>>>>>>>>>> also new in
>>>>>>>>>>>>> this field.
>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6
>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>>>>>> I have been trying to train from scratch. I made about 64,000
>>>>>>>>>>>>>> lines of text (which produced about 255,000 files, in the end) 
>>>>>>>>>>>>>> and run the
>>>>>>>>>>>>>> training for 150,000 iterations; getting 0.51 training error 
>>>>>>>>>>>>>> rate. I was
>>>>>>>>>>>>>> hopping to get reasonable accuracy. Unfortunately, when I run 
>>>>>>>>>>>>>> the OCR
>>>>>>>>>>>>>> using  .traineddata,  the accuracy is absolutely terrible. Do 
>>>>>>>>>>>>>> you think I
>>>>>>>>>>>>>> made some mistakes, or that is an expected result?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3
>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font.  That
>>>>>>>>>>>>>>> way he didn't use *MODEL_NAME in a separate **script **file
>>>>>>>>>>>>>>> script I think.*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files *which
>>>>>>>>>>>>>>> are created by  *MODEL_NAME I mean **eng, ben, oro flag or
>>>>>>>>>>>>>>> language code *because when we first create *tif, gt.txt,
>>>>>>>>>>>>>>> and .box files, *every file starts by  *MODEL_NAME*. This
>>>>>>>>>>>>>>> *MODEL_NAME*  we selected on the training script for
>>>>>>>>>>>>>>> looping each tif, gt.txt, and .box files which are created by
>>>>>>>>>>>>>>>   *MODEL_NAME.*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6
>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the
>>>>>>>>>>>>>>>> folder structure as you did. Indeed, I have tried a number of 
>>>>>>>>>>>>>>>> fine-tuning
>>>>>>>>>>>>>>>> with a single font following Gracia's video. But, your script 
>>>>>>>>>>>>>>>> is much
>>>>>>>>>>>>>>>> better because supports multiple fonts. The whole improvement 
>>>>>>>>>>>>>>>> you made is
>>>>>>>>>>>>>>>> brilliant; and very useful. It is all working for me.
>>>>>>>>>>>>>>>> The only part that I didn't understand is the trick you
>>>>>>>>>>>>>>>> used in your tesseract_train.py script. You see, I have been 
>>>>>>>>>>>>>>>> doing exactly
>>>>>>>>>>>>>>>> to you did except this script.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The scripts seems to have the trick of sending/teaching
>>>>>>>>>>>>>>>> each of the fonts (iteratively) into the model. The script I 
>>>>>>>>>>>>>>>> have been
>>>>>>>>>>>>>>>> using  (which I get from Garcia) doesn't mention font at all.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training
>>>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000*
>>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts (even
>>>>>>>>>>>>>>>> if the fonts have been included in the splitting process, in 
>>>>>>>>>>>>>>>> the other
>>>>>>>>>>>>>>>> script)?
>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3
>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names =
>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command =
>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>> MODEL_NAME={font}
>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . This command
>>>>>>>>>>>>>>>>> is for training data that I have named '*
>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain folder.*
>>>>>>>>>>>>>>>>> *2. root directory means your main training folder and
>>>>>>>>>>>>>>>>> inside it as like langdata, tessearact,  tesstrain folders. 
>>>>>>>>>>>>>>>>> if you see this
>>>>>>>>>>>>>>>>> tutorial    *https://www.youtube.com/watch?v=KE4xEzFGSU8
>>>>>>>>>>>>>>>>>  you will understand better the folder structure. only I
>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for 
>>>>>>>>>>>>>>>>> training and
>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like langdata,
>>>>>>>>>>>>>>>>> tessearact,  tesstrain, and *split_training_text.py.
>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your Linux
>>>>>>>>>>>>>>>>> fonts folder.   /usr/share/fonts/  then run:  sudo apt
>>>>>>>>>>>>>>>>> update  then sudo fc-cache -fv
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> after that, you have to add the exact font's name in
>>>>>>>>>>>>>>>>> FontList.py file like me.
>>>>>>>>>>>>>>>>> I  have added two pic my folder structure. first is main
>>>>>>>>>>>>>>>>> structure pic and the second is the Colopse tesstrain folder.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image:
>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png]
>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6
>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant
>>>>>>>>>>>>>>>>>> scripts. They make the process  much more efficient.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have one more question on the other script that you use
>>>>>>>>>>>>>>>>>> to train.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names =
>>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command =
>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>> MODEL_NAME={font}
>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the
>>>>>>>>>>>>>>>>>> same/root directory?
>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, if
>>>>>>>>>>>>>>>>>> you don't mind sharing it?
>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3
>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better than the
>>>>>>>>>>>>>>>>>>> previous two scripts.  You can create *tif, gt.txt, and
>>>>>>>>>>>>>>>>>>> .box files *by multiple fonts and also use breakpoint
>>>>>>>>>>>>>>>>>>> if vs code close or anything during creating *tif,
>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can checkpoint to
>>>>>>>>>>>>>>>>>>> navigate where you close vs code.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list,
>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>>>>>         for line_index in range(start_line, end_line + 1
>>>>>>>>>>>>>>>>>>> ):
>>>>>>>>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(output_directory,
>>>>>>>>>>>>>>>>>>> f'{training_text_file_name}_{line_serial}_{
>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}.gt.txt')
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>             file_base_name = f'{training_text_file_name}
>>>>>>>>>>>>>>>>>>> _{line_serial}_{font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, help='Starting
>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending
>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>     training_text_file = 'langdata/eng.training_text'
>>>>>>>>>>>>>>>>>>>     output_directory = 'tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, font_list,
>>>>>>>>>>>>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root
>>>>>>>>>>>>>>>>>>> directory and paste it.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",
>>>>>>>>>>>>>>>>>>>            "Charukola Round Head Regular, weight=433",
>>>>>>>>>>>>>>>>>>>             "Charukola Round Head Bold, weight=443",
>>>>>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0  --end 11
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> change checkpoint according to you  --start 0 --end 11.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6
>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi mhalidu,
>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more extensive
>>>>>>>>>>>>>>>>>>>> than you posted before:
>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is magical.
>>>>>>>>>>>>>>>>>>>> How is this one different from the earlier one?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It has
>>>>>>>>>>>>>>>>>>>> saved my countless hours; by running multiple fonts in one 
>>>>>>>>>>>>>>>>>>>> sweep. I was not
>>>>>>>>>>>>>>>>>>>> able to find any instruction on how to train for  multiple 
>>>>>>>>>>>>>>>>>>>> fonts. The
>>>>>>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me to 
>>>>>>>>>>>>>>>>>>>> get started.
>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3
>>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the trained_text
>>>>>>>>>>>>>>>>>>>>> lines will be? I have seen Bengali text are long words of 
>>>>>>>>>>>>>>>>>>>>> lines. so I wanna
>>>>>>>>>>>>>>>>>>>>> know how many words or characters will be the better 
>>>>>>>>>>>>>>>>>>>>> choice for the train?
>>>>>>>>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350',  will be according to 
>>>>>>>>>>>>>>>>>>>>> words of lines?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning
>>>>>>>>>>>>>>>>>>>>>> list of fonts and see if that helps.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods
>>>>>>>>>>>>>>>>>>>>>>> for the Bengali language in Tesseract 5 and I have used 
>>>>>>>>>>>>>>>>>>>>>>> all official
>>>>>>>>>>>>>>>>>>>>>>> trained_text and tessdata_best and other things also.  
>>>>>>>>>>>>>>>>>>>>>>> everything is good
>>>>>>>>>>>>>>>>>>>>>>> but the problem is the default font which was trained 
>>>>>>>>>>>>>>>>>>>>>>> before that does not
>>>>>>>>>>>>>>>>>>>>>>> convert text like prev but my new fonts work well. I 
>>>>>>>>>>>>>>>>>>>>>>> don't understand why
>>>>>>>>>>>>>>>>>>>>>>> it's happening. I share code based to understand what 
>>>>>>>>>>>>>>>>>>>>>>> going on.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') as file:
>>>>>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file,
>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None,
>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as
>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>         line_count = read_line_count()  # Set the
>>>>>>>>>>>>>>>>>>>>>>> starting line_count from the file
>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # Set the
>>>>>>>>>>>>>>>>>>>>>>> ending line_count
>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, len(lines) -
>>>>>>>>>>>>>>>>>>>>>>> 1)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # Iterate through
>>>>>>>>>>>>>>>>>>>>>>> all the fonts in the font_list
>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>             # Generate a unique serial number for
>>>>>>>>>>>>>>>>>>>>>>> each line
>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>>>>>>>> line_serial}.gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as
>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{line_serial}'  #
>>>>>>>>>>>>>>>>>>>>>>> Unique filename for each font
>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the next font
>>>>>>>>>>>>>>>>>>>>>>> iteration
>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # Update the
>>>>>>>>>>>>>>>>>>>>>>> line_count in the file
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>>>> help='Starting
>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending
>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = 'langdata/ben.training_text
>>>>>>>>>>>>>>>>>>>>>>> '
>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file,
>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata
>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben
>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem.
>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are subscribed
>>>>>>>>>>>>>>>>>>>>>>> to the Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving
>>>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to
>>>>>>>>>>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kC%2Bc57wP4FHtY44iaVNnwWm6S%2BJW9oM8dvHNSx727h0yw%40mail.gmail.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to