I will try some changes. thx On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 [email protected] wrote:
> I also faced that issue in the Windows. Apparently, the issue is related > with unicode. You can try your luck by changing "r" to "utf8" in the > script. > I end up installing Ubuntu because i was having too many errors in the > Windows. > > On Thu, Sep 14, 2023, 9:33 AM Ali hussain <[email protected]> wrote: > >> you faced this error, Can't encode transcription? if you faced how you >> have solved this? >> >> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 [email protected] >> wrote: >> >>> I was using my own text >>> >>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <[email protected]> wrote: >>> >>>> you are training from Tessearact default text data or your own >>>> collected text data? >>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 [email protected] >>>> wrote: >>>> >>>>> I now get to 200000 iterations; and the error rate is stuck at 0.46. >>>>> The result is absolutely trash: nowhere close to the default/Ray's >>>>> training. >>>>> >>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 >>>>> [email protected] wrote: >>>>> >>>>>> >>>>>> after Tesseact recognizes text from images. then you can apply regex >>>>>> to replace the wrong word with to correct word. >>>>>> I'm not familiar with paddleOcr and scanTailor also. >>>>>> >>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 >>>>>> [email protected] wrote: >>>>>> >>>>>>> At what stage are you doing the regex replacement? >>>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf >>>>>>> >>>>>>> >EasyOCR I think is best for ID cards or something like that image >>>>>>> process. but document images like books, here Tesseract is better than >>>>>>> EasyOCR. >>>>>>> >>>>>>> How about paddleOcr?, are you familiar with it? >>>>>>> >>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 >>>>>>> [email protected] wrote: >>>>>>> >>>>>>>> I know what you mean. but in some cases, it helps me. I have faced >>>>>>>> specific characters and words are always not recognized by Tesseract. >>>>>>>> That >>>>>>>> way I use these regex to replace those characters and words if >>>>>>>> those >>>>>>>> characters are incorrect. >>>>>>>> >>>>>>>> see what I have done: >>>>>>>> >>>>>>>> " ী": "ী", >>>>>>>> " ্": " ", >>>>>>>> " ে": " ", >>>>>>>> জ্া: "জা", >>>>>>>> " ": " ", >>>>>>>> " ": " ", >>>>>>>> " ": " ", >>>>>>>> "্প": " ", >>>>>>>> " য": "র্য", >>>>>>>> য: "য", >>>>>>>> " া": "া", >>>>>>>> আা: "আ", >>>>>>>> ম্ি: "মি", >>>>>>>> স্ু: "সু", >>>>>>>> "হূ ": "হূ", >>>>>>>> " ণ": "ণ", >>>>>>>> র্্: "র", >>>>>>>> "চিন্ত ": "চিন্তা ", >>>>>>>> ন্া: "না", >>>>>>>> "সম ূর্ন": "সম্পূর্ণ", >>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 >>>>>>>> [email protected] wrote: >>>>>>>> >>>>>>>>> The problem for regex is that Tesseract is not consistent in its >>>>>>>>> replacement. >>>>>>>>> Think of the original training of English data doesn't contain the >>>>>>>>> letter /u/. What does Tesseract do when it faces /u/ in actual >>>>>>>>> processing?? >>>>>>>>> In some cases, it replaces it with closely similar letters such as >>>>>>>>> /v/ and /w/. In other cases, it completely removes it. That is what >>>>>>>>> is >>>>>>>>> happening with my case. Those characters re sometimes completely >>>>>>>>> removed; >>>>>>>>> other times, they are replaced by closely resembling characters. >>>>>>>>> Because of >>>>>>>>> this inconsistency, applying regex is very difficult. >>>>>>>>> >>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 >>>>>>>>> [email protected] wrote: >>>>>>>>> >>>>>>>>>> if Some specific characters or words are always missing from the >>>>>>>>>> OCR result. then you can apply logic with the Regular expressions >>>>>>>>>> method >>>>>>>>>> on your applications. After OCR, these specific characters or words >>>>>>>>>> will be >>>>>>>>>> replaced by current characters or words that you defined in your >>>>>>>>>> applications by Regular expressions. it can be done in some major >>>>>>>>>> problems. >>>>>>>>>> >>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 >>>>>>>>>> [email protected] wrote: >>>>>>>>>> >>>>>>>>>>> The characters are getting missed, even after fine-tuning. >>>>>>>>>>> I never made any progress. I tried many different ways. Some >>>>>>>>>>> specific characters are always missing from the OCR result. >>>>>>>>>>> >>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 >>>>>>>>>>> [email protected] wrote: >>>>>>>>>>> >>>>>>>>>>>> EasyOCR I think is best for ID cards or something like that >>>>>>>>>>>> image process. but document images like books, here Tesseract is >>>>>>>>>>>> better >>>>>>>>>>>> than EasyOCR. Even I didn't use EasyOCR. you can try it. >>>>>>>>>>>> >>>>>>>>>>>> I have added words of dictionaries but the result is the same. >>>>>>>>>>>> >>>>>>>>>>>> what kind of problem you have faced in fine-tuning in few new >>>>>>>>>>>> characters as you said (*but, I failed in every possible way >>>>>>>>>>>> to introduce a few new characters into the database.)* >>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 >>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Yes, we are new to this. I find the instructions (the manual) >>>>>>>>>>>>> very hard to follow. The video you linked above was really >>>>>>>>>>>>> helpful to get >>>>>>>>>>>>> started. My plan at the beginning was to fine tune the existing >>>>>>>>>>>>> .traineddata. But, I failed in every possible way to introduce a >>>>>>>>>>>>> few new >>>>>>>>>>>>> characters into the database. That is why I started from scratch. >>>>>>>>>>>>> >>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the >>>>>>>>>>>>> iterations, and see if I can improve. >>>>>>>>>>>>> >>>>>>>>>>>>> Another areas we need to explore is usage of dictionaries >>>>>>>>>>>>> actually. May be adding millions of words into the dictionary >>>>>>>>>>>>> could help >>>>>>>>>>>>> Tesseract. I don't have millions of words; but I am looking into >>>>>>>>>>>>> some >>>>>>>>>>>>> corpus to get more words into the dictionary. >>>>>>>>>>>>> >>>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar >>>>>>>>>>>>> open-source packages) is probably our next option to try on. >>>>>>>>>>>>> Sure, sharing >>>>>>>>>>>>> our experiences will be helpful. I will let you know if I made >>>>>>>>>>>>> good >>>>>>>>>>>>> progresses in any of these options. >>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 >>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> How is your training going for Bengali? It was nearly good >>>>>>>>>>>>>> but I faced space problems between two words, some words are >>>>>>>>>>>>>> spaces but >>>>>>>>>>>>>> most of them have no space. I think is problem is in the dataset >>>>>>>>>>>>>> but I use >>>>>>>>>>>>>> the default training dataset from Tesseract which is used in Ben >>>>>>>>>>>>>> That way I >>>>>>>>>>>>>> am confused so I have to explore more. by the way, you can try >>>>>>>>>>>>>> as Lorenzo >>>>>>>>>>>>>> Blz said. Actually training from scratch is harder than >>>>>>>>>>>>>> fine-tuning. so you can use different datasets to explore. if >>>>>>>>>>>>>> you succeed. >>>>>>>>>>>>>> please let me know how you have done this whole process. I'm >>>>>>>>>>>>>> also new in >>>>>>>>>>>>>> this field. >>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 >>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> How is your training going for Bengali? >>>>>>>>>>>>>>> I have been trying to train from scratch. I made about >>>>>>>>>>>>>>> 64,000 lines of text (which produced about 255,000 files, in >>>>>>>>>>>>>>> the end) and >>>>>>>>>>>>>>> run the training for 150,000 iterations; getting 0.51 training >>>>>>>>>>>>>>> error rate. >>>>>>>>>>>>>>> I was hopping to get reasonable accuracy. Unfortunately, when I >>>>>>>>>>>>>>> run the OCR >>>>>>>>>>>>>>> using .traineddata, the accuracy is absolutely terrible. Do >>>>>>>>>>>>>>> you think I >>>>>>>>>>>>>>> made some mistakes, or that is an expected result? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 >>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font. That >>>>>>>>>>>>>>>> way he didn't use *MODEL_NAME in a separate **script **file >>>>>>>>>>>>>>>> script I think.* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files >>>>>>>>>>>>>>>> *which >>>>>>>>>>>>>>>> are created by *MODEL_NAME I mean **eng, ben, oro flag or >>>>>>>>>>>>>>>> language code *because when we first create *tif, gt.txt, >>>>>>>>>>>>>>>> and .box files, *every file starts by *MODEL_NAME*. This >>>>>>>>>>>>>>>> *MODEL_NAME* we selected on the training script for >>>>>>>>>>>>>>>> looping each tif, gt.txt, and .box files which are created by >>>>>>>>>>>>>>>> *MODEL_NAME.* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 >>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the >>>>>>>>>>>>>>>>> folder structure as you did. Indeed, I have tried a number of >>>>>>>>>>>>>>>>> fine-tuning >>>>>>>>>>>>>>>>> with a single font following Gracia's video. But, your script >>>>>>>>>>>>>>>>> is much >>>>>>>>>>>>>>>>> better because supports multiple fonts. The whole improvement >>>>>>>>>>>>>>>>> you made is >>>>>>>>>>>>>>>>> brilliant; and very useful. It is all working for me. >>>>>>>>>>>>>>>>> The only part that I didn't understand is the trick you >>>>>>>>>>>>>>>>> used in your tesseract_train.py script. You see, I have been >>>>>>>>>>>>>>>>> doing exactly >>>>>>>>>>>>>>>>> to you did except this script. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The scripts seems to have the trick of sending/teaching >>>>>>>>>>>>>>>>> each of the fonts (iteratively) into the model. The script I >>>>>>>>>>>>>>>>> have been >>>>>>>>>>>>>>>>> using (which I get from Garcia) doesn't mention font at all. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000* >>>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts (even >>>>>>>>>>>>>>>>> if the fonts have been included in the splitting process, in >>>>>>>>>>>>>>>>> the other >>>>>>>>>>>>>>>>> script)? >>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 >>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) 1 . This command >>>>>>>>>>>>>>>>>> is for training data that I have named '* >>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain folder.* >>>>>>>>>>>>>>>>>> *2. root directory means your main training folder and >>>>>>>>>>>>>>>>>> inside it as like langdata, tessearact, tesstrain folders. >>>>>>>>>>>>>>>>>> if you see this >>>>>>>>>>>>>>>>>> tutorial *https://www.youtube.com/watch?v=KE4xEzFGSU8 >>>>>>>>>>>>>>>>>> you will understand better the folder structure. only I >>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for >>>>>>>>>>>>>>>>>> training and >>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like langdata, >>>>>>>>>>>>>>>>>> tessearact, tesstrain, and *split_training_text.py. >>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your Linux >>>>>>>>>>>>>>>>>> fonts folder. /usr/share/fonts/ then run: sudo apt >>>>>>>>>>>>>>>>>> update then sudo fc-cache -fv >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> after that, you have to add the exact font's name in >>>>>>>>>>>>>>>>>> FontList.py file like me. >>>>>>>>>>>>>>>>>> I have added two pic my folder structure. first is main >>>>>>>>>>>>>>>>>> structure pic and the second is the Colopse tesstrain folder. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: >>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] >>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 >>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant >>>>>>>>>>>>>>>>>>> scripts. They make the process much more efficient. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I have one more question on the other script that you >>>>>>>>>>>>>>>>>>> use to train. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the >>>>>>>>>>>>>>>>>>> same/root directory? >>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, if >>>>>>>>>>>>>>>>>>> you don't mind sharing it? >>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 >>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better than the >>>>>>>>>>>>>>>>>>>> previous two scripts. You can create *tif, gt.txt, >>>>>>>>>>>>>>>>>>>> and .box files *by multiple fonts and also use >>>>>>>>>>>>>>>>>>>> breakpoint if vs code close or anything during creating >>>>>>>>>>>>>>>>>>>> *tif, >>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can checkpoint to >>>>>>>>>>>>>>>>>>>> navigate where you close vs code. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>>>>>>>>>> lines = input_file.readlines() >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>> start_line = 0 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>> end_line = len(lines) - 1 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> for font_name in font_list.fonts: >>>>>>>>>>>>>>>>>>>> for line_index in range(start_line, end_line + >>>>>>>>>>>>>>>>>>>> 1): >>>>>>>>>>>>>>>>>>>> line = lines[line_index].strip() >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{ >>>>>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}.gt.txt') >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> file_base_name = f'{training_text_file_name >>>>>>>>>>>>>>>>>>>> }_{line_serial}_{font_name.replace(" ", "_")}' >>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>> f'--font={font_name}', >>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>> '--ysize=330', >>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset', >>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, >>>>>>>>>>>>>>>>>>>> help='Starting >>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending >>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> training_text_file = 'langdata/eng.training_text' >>>>>>>>>>>>>>>>>>>> output_directory = 'tesstrain/data/eng-ground-truth >>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root >>>>>>>>>>>>>>>>>>>> directory and paste it. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> class FontList: >>>>>>>>>>>>>>>>>>>> def __init__(self): >>>>>>>>>>>>>>>>>>>> self.fonts = [ >>>>>>>>>>>>>>>>>>>> "Gerlick" >>>>>>>>>>>>>>>>>>>> "Sagar Medium", >>>>>>>>>>>>>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>>>>>>>>>>>>> "Charukola Round Head Regular, weight=433", >>>>>>>>>>>>>>>>>>>> "Charukola Round Head Bold, weight=443", >>>>>>>>>>>>>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> then import in the above code, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> *for breakpoint command:* >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0 --end 11 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> change checkpoint according to you --start 0 --end 11. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.* >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 >>>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi mhalidu, >>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more extensive >>>>>>>>>>>>>>>>>>>>> than you posted before: >>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is magical. >>>>>>>>>>>>>>>>>>>>> How is this one different from the earlier one? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It >>>>>>>>>>>>>>>>>>>>> has saved my countless hours; by running multiple fonts >>>>>>>>>>>>>>>>>>>>> in one sweep. I was >>>>>>>>>>>>>>>>>>>>> not able to find any instruction on how to train for >>>>>>>>>>>>>>>>>>>>> multiple fonts. The >>>>>>>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me to >>>>>>>>>>>>>>>>>>>>> get started. >>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 >>>>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the trained_text >>>>>>>>>>>>>>>>>>>>>> lines will be? I have seen Bengali text are long words >>>>>>>>>>>>>>>>>>>>>> of lines. so I wanna >>>>>>>>>>>>>>>>>>>>>> know how many words or characters will be the better >>>>>>>>>>>>>>>>>>>>>> choice for the train? >>>>>>>>>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350', will be according to >>>>>>>>>>>>>>>>>>>>>> words of lines? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 >>>>>>>>>>>>>>>>>>>>>> shree wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning >>>>>>>>>>>>>>>>>>>>>>> list of fonts and see if that helps. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain < >>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods >>>>>>>>>>>>>>>>>>>>>>>> for the Bengali language in Tesseract 5 and I have >>>>>>>>>>>>>>>>>>>>>>>> used all official >>>>>>>>>>>>>>>>>>>>>>>> trained_text and tessdata_best and other things also. >>>>>>>>>>>>>>>>>>>>>>>> everything is good >>>>>>>>>>>>>>>>>>>>>>>> but the problem is the default font which was trained >>>>>>>>>>>>>>>>>>>>>>>> before that does not >>>>>>>>>>>>>>>>>>>>>>>> convert text like prev but my new fonts work well. I >>>>>>>>>>>>>>>>>>>>>>>> don't understand why >>>>>>>>>>>>>>>>>>>>>>>> it's happening. I share code based to understand what >>>>>>>>>>>>>>>>>>>>>>>> going on. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box files:* >>>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>>>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'r') as file: >>>>>>>>>>>>>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>>>>>>>>>>>>> return 0 >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as file: >>>>>>>>>>>>>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as >>>>>>>>>>>>>>>>>>>>>>>> input_file: >>>>>>>>>>>>>>>>>>>>>>>> for line in input_file.readlines(): >>>>>>>>>>>>>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>>> line_count = read_line_count() # Set the >>>>>>>>>>>>>>>>>>>>>>>> starting line_count from the file >>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 # Set the >>>>>>>>>>>>>>>>>>>>>>>> ending line_count >>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>> end_line_count = min(end_line, len(lines) - >>>>>>>>>>>>>>>>>>>>>>>> 1) >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> for font in font_list.fonts: # Iterate >>>>>>>>>>>>>>>>>>>>>>>> through all the fonts in the font_list >>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> # Generate a unique serial number for >>>>>>>>>>>>>>>>>>>>>>>> each line >>>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_count:d}" >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> # GT (Ground Truth) text filename >>>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{ >>>>>>>>>>>>>>>>>>>>>>>> line_serial}.gt.txt') >>>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> # Image filename >>>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'ben_{line_serial}' # >>>>>>>>>>>>>>>>>>>>>>>> Unique filename for each font >>>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset', >>>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> # Reset font_serial for the next font >>>>>>>>>>>>>>>>>>>>>>>> iteration >>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> write_line_count(line_count) # Update the >>>>>>>>>>>>>>>>>>>>>>>> line_count in the file >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, >>>>>>>>>>>>>>>>>>>>>>>> help='Starting >>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, >>>>>>>>>>>>>>>>>>>>>>>> help='Ending >>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> training_text_file = ' >>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text' >>>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth' >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> # Create an instance of the FontList class >>>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> # List of font names >>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>>>>>>>>>>>>> command = f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 >>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem. >>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are >>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving >>>>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to >>>>>>>>>>>>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> >>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7efa6de5-980f-422d-a5da-54b16a35ff26n%40googlegroups.com.

