Hi, I'm new to tesseract and ocr in general, and need some help to train my tesseract.
Config Platform: Mac OS X 10.13.3 Tesseract Version: 4.0.0-beta.1 leptonica: 1.75.3 libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 images used kor.AppleMyungjo.exp1.tif <https://lh3.googleusercontent.com/-HfEWwZudKjE/WsUtd-CH2iI/AAAAAAAAHig/u_gQpXArU4cU4jREJJegB2dIjo3tqv3lwCLcBGAs/s1600/kor.AppleMyungjo.exp1.tif> kor.AppleMyungjo.exp0.tif <https://lh3.googleusercontent.com/-OGn-qgzxBgE/WsUr2NKqeBI/AAAAAAAAHiQ/aZ7PnPiB7qwHvyXTGb-wHVyGJ4Gs-N9GwCLcBGAs/s1600/kor.AppleMyungjo.exp0.tif> Step by step I'm trying to train (fine tuning) my tesseract to better detect commas (") and dot (.) in korean, but I'm getting some errors. Here what I did until now: 1 - Got the Images, I'm using 2 images .tif (both images has only 1 line and few characters) 2 - Renamed the images to kor.AppleMyungjo.exp0.tif and kor.AppleMyungjo.exp1.tif 3 - Created the .box file for each image ```tesseract [language].[fontname].exp[samplenumber].tif [language].[fontname].exp[samplenumber] -l [language] batch.nochop makebox``` (one of them come empty) 4 - Corrected the .box files using the site https://pp19dd.com/tesseract-ocr-chopper/ (I just pasted the positioning in the file) 5 - Created the .tr files for each image ```tesseract kor.AppleMyungjo.exp0.tif kor.AppleMyungjo.exp0 -l kor box.train ``` (both image got an empty .tr file) 6 - Created the unicharset file ```unicharset_extractor [box file 0] [box file 1]...``` 7 - Created the font_properties, only has the ```AppleMyungjo 0 0 1 0 0``` 8 - Cloned the tesseract repo to my mac, path ```~/projects/tesseract``` 9 - cloned the langdata repo to my mac, path ```~/projects/langdata``` 10 - Found the folder where the brew installed my tesseract, path ```/usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata``` 11 - Executed the ```~/projects/tesseract/training/tesstrain.sh``` file ``` sudo ~/projects/tesseract/training/tesstrain.sh \ --fonts_dir /Library/Fonts \ --lang kor \ --linedata_only \ --noextract_font_properties \ --exposures "0" \ --langdata_dir ~/projects/langdata \ --tessdata_dir /usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata \ --output_dir ~/tesstutorial/kor \ --fontlist "AppleMyungjo" ``` and got the error: ``` === Starting training for language 'kor' mktemp: illegal option -- - usage: mktemp [-d] [-q] [-t prefix] [-u] template ... mktemp [-d] [-q] [-u] -t prefix [Wed Apr 4 13:26:24 -03 2018] /usr/local/bin/text2image --fonts_dir=/Library/Fonts --font=AppleMyungjo --outputbase=/sample_text.txt --text=/sample_text.txt --fontconfig_tmpdir= Fontconfig error: Cannot load default config file === Phase I: Generating training images === Rendering using AppleMyungjo [Wed Apr 4 13:26:25 -03 2018] /usr/local/bin/text2image --fontconfig_tmpdir= --fonts_dir=/Library/Fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0 --max_pages=3 --font=AppleMyungjo --text=/Users/fernandogot/projects/langdata/kor/kor.training_text Fontconfig error: Cannot load default config file ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box does not exist or is not readable ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box does not exist or is not readable ``` I found that the ```Fontconfig error: Cannot load default config file``` was being generated because of the mktemp on mac, I fixed it replacing the code: training/tesstrain_utils.sh ```diff - export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX) + export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX) ``` After executing the same code I get: ``` === Starting training for language 'kor' [Wed Apr 4 14:13:38 -03 2018] /usr/local/bin/text2image --fonts_dir=/Library/Fonts --font=AppleMyungjo --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs/sample_text.txt --text=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs/sample_text.txt --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs === Phase I: Generating training images === Rendering using AppleMyungjo [Wed Apr 4 14:13:40 -03 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs --fonts_dir=/Library/Fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0 --max_pages=3 --font=AppleMyungjo --text=/Users/fernandogot/projects/langdata/kor/kor.training_text ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0.box does not exist or is not readable ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0.box does not exist or is not readable ``` So I'm stuck at these 2 erros, I do have this file in the folder that Im executing the code ```~/projects/ocr/trainning/```, but what can I do to make it work? Thanks for reading all this text and for your time -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a3d11945-97ef-4b2d-9626-96364c7884cb%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.