I also faced that issue in the Windows. Apparently, the issue is related with unicode. You can try your luck by changing "r" to "utf8" in the script. I end up installing Ubuntu because i was having too many errors in the Windows.
On Thu, Sep 14, 2023, 9:33 AM Ali hussain <[email protected]> wrote: > you faced this error, Can't encode transcription? if you faced how you > have solved this? > > On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 [email protected] > wrote: > >> I was using my own text >> >> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <[email protected]> wrote: >> >>> you are training from Tessearact default text data or your own collected >>> text data? >>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 [email protected] >>> wrote: >>> >>>> I now get to 200000 iterations; and the error rate is stuck at 0.46. >>>> The result is absolutely trash: nowhere close to the default/Ray's >>>> training. >>>> >>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 >>>> [email protected] wrote: >>>> >>>>> >>>>> after Tesseact recognizes text from images. then you can apply regex >>>>> to replace the wrong word with to correct word. >>>>> I'm not familiar with paddleOcr and scanTailor also. >>>>> >>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 >>>>> [email protected] wrote: >>>>> >>>>>> At what stage are you doing the regex replacement? >>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf >>>>>> >>>>>> >EasyOCR I think is best for ID cards or something like that image >>>>>> process. but document images like books, here Tesseract is better than >>>>>> EasyOCR. >>>>>> >>>>>> How about paddleOcr?, are you familiar with it? >>>>>> >>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 >>>>>> [email protected] wrote: >>>>>> >>>>>>> I know what you mean. but in some cases, it helps me. I have faced >>>>>>> specific characters and words are always not recognized by Tesseract. >>>>>>> That >>>>>>> way I use these regex to replace those characters and words if those >>>>>>> characters are incorrect. >>>>>>> >>>>>>> see what I have done: >>>>>>> >>>>>>> " ী": "ী", >>>>>>> " ্": " ", >>>>>>> " ে": " ", >>>>>>> জ্া: "জা", >>>>>>> " ": " ", >>>>>>> " ": " ", >>>>>>> " ": " ", >>>>>>> "্প": " ", >>>>>>> " য": "র্য", >>>>>>> য: "য", >>>>>>> " া": "া", >>>>>>> আা: "আ", >>>>>>> ম্ি: "মি", >>>>>>> স্ু: "সু", >>>>>>> "হূ ": "হূ", >>>>>>> " ণ": "ণ", >>>>>>> র্্: "র", >>>>>>> "চিন্ত ": "চিন্তা ", >>>>>>> ন্া: "না", >>>>>>> "সম ূর্ন": "সম্পূর্ণ", >>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 >>>>>>> [email protected] wrote: >>>>>>> >>>>>>>> The problem for regex is that Tesseract is not consistent in its >>>>>>>> replacement. >>>>>>>> Think of the original training of English data doesn't contain the >>>>>>>> letter /u/. What does Tesseract do when it faces /u/ in actual >>>>>>>> processing?? >>>>>>>> In some cases, it replaces it with closely similar letters such as >>>>>>>> /v/ and /w/. In other cases, it completely removes it. That is what is >>>>>>>> happening with my case. Those characters re sometimes completely >>>>>>>> removed; >>>>>>>> other times, they are replaced by closely resembling characters. >>>>>>>> Because of >>>>>>>> this inconsistency, applying regex is very difficult. >>>>>>>> >>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 >>>>>>>> [email protected] wrote: >>>>>>>> >>>>>>>>> if Some specific characters or words are always missing from the >>>>>>>>> OCR result. then you can apply logic with the Regular expressions >>>>>>>>> method >>>>>>>>> on your applications. After OCR, these specific characters or words >>>>>>>>> will be >>>>>>>>> replaced by current characters or words that you defined in your >>>>>>>>> applications by Regular expressions. it can be done in some major >>>>>>>>> problems. >>>>>>>>> >>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 >>>>>>>>> [email protected] wrote: >>>>>>>>> >>>>>>>>>> The characters are getting missed, even after fine-tuning. >>>>>>>>>> I never made any progress. I tried many different ways. Some >>>>>>>>>> specific characters are always missing from the OCR result. >>>>>>>>>> >>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 >>>>>>>>>> [email protected] wrote: >>>>>>>>>> >>>>>>>>>>> EasyOCR I think is best for ID cards or something like that >>>>>>>>>>> image process. but document images like books, here Tesseract is >>>>>>>>>>> better >>>>>>>>>>> than EasyOCR. Even I didn't use EasyOCR. you can try it. >>>>>>>>>>> >>>>>>>>>>> I have added words of dictionaries but the result is the same. >>>>>>>>>>> >>>>>>>>>>> what kind of problem you have faced in fine-tuning in few new >>>>>>>>>>> characters as you said (*but, I failed in every possible way to >>>>>>>>>>> introduce a few new characters into the database.)* >>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 >>>>>>>>>>> [email protected] wrote: >>>>>>>>>>> >>>>>>>>>>>> Yes, we are new to this. I find the instructions (the manual) >>>>>>>>>>>> very hard to follow. The video you linked above was really helpful >>>>>>>>>>>> to get >>>>>>>>>>>> started. My plan at the beginning was to fine tune the existing >>>>>>>>>>>> .traineddata. But, I failed in every possible way to introduce a >>>>>>>>>>>> few new >>>>>>>>>>>> characters into the database. That is why I started from scratch. >>>>>>>>>>>> >>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the >>>>>>>>>>>> iterations, and see if I can improve. >>>>>>>>>>>> >>>>>>>>>>>> Another areas we need to explore is usage of dictionaries >>>>>>>>>>>> actually. May be adding millions of words into the dictionary >>>>>>>>>>>> could help >>>>>>>>>>>> Tesseract. I don't have millions of words; but I am looking into >>>>>>>>>>>> some >>>>>>>>>>>> corpus to get more words into the dictionary. >>>>>>>>>>>> >>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar >>>>>>>>>>>> open-source packages) is probably our next option to try on. >>>>>>>>>>>> Sure, sharing >>>>>>>>>>>> our experiences will be helpful. I will let you know if I made good >>>>>>>>>>>> progresses in any of these options. >>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 >>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>> >>>>>>>>>>>>> How is your training going for Bengali? It was nearly good >>>>>>>>>>>>> but I faced space problems between two words, some words are >>>>>>>>>>>>> spaces but >>>>>>>>>>>>> most of them have no space. I think is problem is in the dataset >>>>>>>>>>>>> but I use >>>>>>>>>>>>> the default training dataset from Tesseract which is used in Ben >>>>>>>>>>>>> That way I >>>>>>>>>>>>> am confused so I have to explore more. by the way, you can try >>>>>>>>>>>>> as Lorenzo >>>>>>>>>>>>> Blz said. Actually training from scratch is harder than >>>>>>>>>>>>> fine-tuning. so you can use different datasets to explore. if you >>>>>>>>>>>>> succeed. >>>>>>>>>>>>> please let me know how you have done this whole process. I'm >>>>>>>>>>>>> also new in >>>>>>>>>>>>> this field. >>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 >>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> How is your training going for Bengali? >>>>>>>>>>>>>> I have been trying to train from scratch. I made about 64,000 >>>>>>>>>>>>>> lines of text (which produced about 255,000 files, in the end) >>>>>>>>>>>>>> and run the >>>>>>>>>>>>>> training for 150,000 iterations; getting 0.51 training error >>>>>>>>>>>>>> rate. I was >>>>>>>>>>>>>> hopping to get reasonable accuracy. Unfortunately, when I run >>>>>>>>>>>>>> the OCR >>>>>>>>>>>>>> using .traineddata, the accuracy is absolutely terrible. Do >>>>>>>>>>>>>> you think I >>>>>>>>>>>>>> made some mistakes, or that is an expected result? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 >>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font. That >>>>>>>>>>>>>>> way he didn't use *MODEL_NAME in a separate **script **file >>>>>>>>>>>>>>> script I think.* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files *which >>>>>>>>>>>>>>> are created by *MODEL_NAME I mean **eng, ben, oro flag or >>>>>>>>>>>>>>> language code *because when we first create *tif, gt.txt, >>>>>>>>>>>>>>> and .box files, *every file starts by *MODEL_NAME*. This >>>>>>>>>>>>>>> *MODEL_NAME* we selected on the training script for >>>>>>>>>>>>>>> looping each tif, gt.txt, and .box files which are created by >>>>>>>>>>>>>>> *MODEL_NAME.* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 >>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the >>>>>>>>>>>>>>>> folder structure as you did. Indeed, I have tried a number of >>>>>>>>>>>>>>>> fine-tuning >>>>>>>>>>>>>>>> with a single font following Gracia's video. But, your script >>>>>>>>>>>>>>>> is much >>>>>>>>>>>>>>>> better because supports multiple fonts. The whole improvement >>>>>>>>>>>>>>>> you made is >>>>>>>>>>>>>>>> brilliant; and very useful. It is all working for me. >>>>>>>>>>>>>>>> The only part that I didn't understand is the trick you >>>>>>>>>>>>>>>> used in your tesseract_train.py script. You see, I have been >>>>>>>>>>>>>>>> doing exactly >>>>>>>>>>>>>>>> to you did except this script. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The scripts seems to have the trick of sending/teaching >>>>>>>>>>>>>>>> each of the fonts (iteratively) into the model. The script I >>>>>>>>>>>>>>>> have been >>>>>>>>>>>>>>>> using (which I get from Garcia) doesn't mention font at all. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>> MAX_ITERATIONS=10000* >>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts (even >>>>>>>>>>>>>>>> if the fonts have been included in the splitting process, in >>>>>>>>>>>>>>>> the other >>>>>>>>>>>>>>>> script)? >>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 >>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) 1 . This command >>>>>>>>>>>>>>>>> is for training data that I have named '* >>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain folder.* >>>>>>>>>>>>>>>>> *2. root directory means your main training folder and >>>>>>>>>>>>>>>>> inside it as like langdata, tessearact, tesstrain folders. >>>>>>>>>>>>>>>>> if you see this >>>>>>>>>>>>>>>>> tutorial *https://www.youtube.com/watch?v=KE4xEzFGSU8 >>>>>>>>>>>>>>>>> you will understand better the folder structure. only I >>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for >>>>>>>>>>>>>>>>> training and >>>>>>>>>>>>>>>>> FontList.py file is the main path as *like langdata, >>>>>>>>>>>>>>>>> tessearact, tesstrain, and *split_training_text.py. >>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your Linux >>>>>>>>>>>>>>>>> fonts folder. /usr/share/fonts/ then run: sudo apt >>>>>>>>>>>>>>>>> update then sudo fc-cache -fv >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> after that, you have to add the exact font's name in >>>>>>>>>>>>>>>>> FontList.py file like me. >>>>>>>>>>>>>>>>> I have added two pic my folder structure. first is main >>>>>>>>>>>>>>>>> structure pic and the second is the Colopse tesstrain folder. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: >>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] >>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 >>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant >>>>>>>>>>>>>>>>>> scripts. They make the process much more efficient. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I have one more question on the other script that you use >>>>>>>>>>>>>>>>>> to train. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the >>>>>>>>>>>>>>>>>> same/root directory? >>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, if >>>>>>>>>>>>>>>>>> you don't mind sharing it? >>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 >>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> You can use the new script below. it's better than the >>>>>>>>>>>>>>>>>>> previous two scripts. You can create *tif, gt.txt, and >>>>>>>>>>>>>>>>>>> .box files *by multiple fonts and also use breakpoint >>>>>>>>>>>>>>>>>>> if vs code close or anything during creating *tif, >>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can checkpoint to >>>>>>>>>>>>>>>>>>> navigate where you close vs code. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>>>>>>>>> lines = input_file.readlines() >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>> start_line = 0 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>> end_line = len(lines) - 1 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> for font_name in font_list.fonts: >>>>>>>>>>>>>>>>>>> for line_index in range(start_line, end_line + 1 >>>>>>>>>>>>>>>>>>> ): >>>>>>>>>>>>>>>>>>> line = lines[line_index].strip() >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join(output_directory, >>>>>>>>>>>>>>>>>>> f'{training_text_file_name}_{line_serial}_{ >>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}.gt.txt') >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as output_file: >>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> file_base_name = f'{training_text_file_name} >>>>>>>>>>>>>>>>>>> _{line_serial}_{font_name.replace(" ", "_")}' >>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>> f'--font={font_name}', >>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>> '--ysize=330', >>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset', >>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, help='Starting >>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending >>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> training_text_file = 'langdata/eng.training_text' >>>>>>>>>>>>>>>>>>> output_directory = 'tesstrain/data/eng-ground-truth' >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>>>>> output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root >>>>>>>>>>>>>>>>>>> directory and paste it. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> class FontList: >>>>>>>>>>>>>>>>>>> def __init__(self): >>>>>>>>>>>>>>>>>>> self.fonts = [ >>>>>>>>>>>>>>>>>>> "Gerlick" >>>>>>>>>>>>>>>>>>> "Sagar Medium", >>>>>>>>>>>>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>>>>>>>>>>>> "Charukola Round Head Regular, weight=433", >>>>>>>>>>>>>>>>>>> "Charukola Round Head Bold, weight=443", >>>>>>>>>>>>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> then import in the above code, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> *for breakpoint command:* >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0 --end 11 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> change checkpoint according to you --start 0 --end 11. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.* >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 >>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi mhalidu, >>>>>>>>>>>>>>>>>>>> the script you posted here seems much more extensive >>>>>>>>>>>>>>>>>>>> than you posted before: >>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is magical. >>>>>>>>>>>>>>>>>>>> How is this one different from the earlier one? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It has >>>>>>>>>>>>>>>>>>>> saved my countless hours; by running multiple fonts in one >>>>>>>>>>>>>>>>>>>> sweep. I was not >>>>>>>>>>>>>>>>>>>> able to find any instruction on how to train for multiple >>>>>>>>>>>>>>>>>>>> fonts. The >>>>>>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me to >>>>>>>>>>>>>>>>>>>> get started. >>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 >>>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the trained_text >>>>>>>>>>>>>>>>>>>>> lines will be? I have seen Bengali text are long words of >>>>>>>>>>>>>>>>>>>>> lines. so I wanna >>>>>>>>>>>>>>>>>>>>> know how many words or characters will be the better >>>>>>>>>>>>>>>>>>>>> choice for the train? >>>>>>>>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350', will be according to >>>>>>>>>>>>>>>>>>>>> words of lines? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree >>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning >>>>>>>>>>>>>>>>>>>>>> list of fonts and see if that helps. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods >>>>>>>>>>>>>>>>>>>>>>> for the Bengali language in Tesseract 5 and I have used >>>>>>>>>>>>>>>>>>>>>>> all official >>>>>>>>>>>>>>>>>>>>>>> trained_text and tessdata_best and other things also. >>>>>>>>>>>>>>>>>>>>>>> everything is good >>>>>>>>>>>>>>>>>>>>>>> but the problem is the default font which was trained >>>>>>>>>>>>>>>>>>>>>>> before that does not >>>>>>>>>>>>>>>>>>>>>>> convert text like prev but my new fonts work well. I >>>>>>>>>>>>>>>>>>>>>>> don't understand why >>>>>>>>>>>>>>>>>>>>>>> it's happening. I share code based to understand what >>>>>>>>>>>>>>>>>>>>>>> going on. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box files:* >>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'r') as file: >>>>>>>>>>>>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>>>>>>>>>>>> return 0 >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as file: >>>>>>>>>>>>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as >>>>>>>>>>>>>>>>>>>>>>> input_file: >>>>>>>>>>>>>>>>>>>>>>> for line in input_file.readlines(): >>>>>>>>>>>>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>> line_count = read_line_count() # Set the >>>>>>>>>>>>>>>>>>>>>>> starting line_count from the file >>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 # Set the >>>>>>>>>>>>>>>>>>>>>>> ending line_count >>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>> end_line_count = min(end_line, len(lines) - >>>>>>>>>>>>>>>>>>>>>>> 1) >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> for font in font_list.fonts: # Iterate through >>>>>>>>>>>>>>>>>>>>>>> all the fonts in the font_list >>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> # Generate a unique serial number for >>>>>>>>>>>>>>>>>>>>>>> each line >>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_count:d}" >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> # GT (Ground Truth) text filename >>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{ >>>>>>>>>>>>>>>>>>>>>>> line_serial}.gt.txt') >>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> # Image filename >>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'ben_{line_serial}' # >>>>>>>>>>>>>>>>>>>>>>> Unique filename for each font >>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset', >>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> # Reset font_serial for the next font >>>>>>>>>>>>>>>>>>>>>>> iteration >>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> write_line_count(line_count) # Update the >>>>>>>>>>>>>>>>>>>>>>> line_count in the file >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, >>>>>>>>>>>>>>>>>>>>>>> help='Starting >>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending >>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> training_text_file = 'langdata/ben.training_text >>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth' >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> # Create an instance of the FontList class >>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> # List of font names >>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>>>>>>>>>>>> command = f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 >>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem. >>>>>>>>>>>>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>> You received this message because you are subscribed >>>>>>>>>>>>>>>>>>>>>>> to the Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving >>>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to >>>>>>>>>>>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> >> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kC%2Bc57wP4FHtY44iaVNnwWm6S%2BJW9oM8dvHNSx727h0yw%40mail.gmail.com.

