Hi, I'm new to tesseract and ocr in general, and need some help to train my 
tesseract.

Config
Platform: Mac OS X 10.13.3
Tesseract Version: 4.0.0-beta.1
leptonica: 1.75.3
  libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

images used

kor.AppleMyungjo.exp1.tif

<https://lh3.googleusercontent.com/-HfEWwZudKjE/WsUtd-CH2iI/AAAAAAAAHig/u_gQpXArU4cU4jREJJegB2dIjo3tqv3lwCLcBGAs/s1600/kor.AppleMyungjo.exp1.tif>


kor.AppleMyungjo.exp0.tif

<https://lh3.googleusercontent.com/-OGn-qgzxBgE/WsUr2NKqeBI/AAAAAAAAHiQ/aZ7PnPiB7qwHvyXTGb-wHVyGJ4Gs-N9GwCLcBGAs/s1600/kor.AppleMyungjo.exp0.tif>


Step by step
I'm trying to train (fine tuning) my tesseract to better detect commas (") 
and dot (.) in korean, but I'm getting some errors. Here what I did until 
now:

1 - Got the Images, I'm using 2 images .tif (both images has only 1 line 
and few characters)
2 - Renamed the images to kor.AppleMyungjo.exp0.tif and 
kor.AppleMyungjo.exp1.tif
3 - Created the .box file for each image ```tesseract 
[language].[fontname].exp[samplenumber].tif 
[language].[fontname].exp[samplenumber] -l [language] batch.nochop 
makebox``` (one of them come empty)
4 - Corrected the .box files using the site 
https://pp19dd.com/tesseract-ocr-chopper/ (I just pasted the positioning in 
the file)
5 - Created the .tr files for each image ```tesseract 
kor.AppleMyungjo.exp0.tif kor.AppleMyungjo.exp0 -l kor box.train ``` (both 
image got an empty .tr file)
6 - Created the unicharset file ```unicharset_extractor [box file 0] [box 
file 1]...```
7 - Created the font_properties, only has the ```AppleMyungjo 0 0 1 0 0```
8 - Cloned the tesseract repo to my mac, path ```~/projects/tesseract```
9 - cloned the langdata repo to my mac, path ```~/projects/langdata```
10 - Found the folder where the brew installed my tesseract, path 
```/usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata```
11 - Executed the ```~/projects/tesseract/training/tesstrain.sh``` file


```
sudo ~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts  \
  --lang kor \
  --linedata_only  \
  --noextract_font_properties  \
  --exposures "0"    \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir /usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata \
  --output_dir ~/tesstutorial/kor \
  --fontlist "AppleMyungjo"
```
and got the error:
```
=== Starting training for language 'kor'
mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
       mktemp [-d] [-q] [-u] -t prefix
[Wed Apr 4 13:26:24 -03 2018] /usr/local/bin/text2image 
--fonts_dir=/Library/Fonts --font=AppleMyungjo 
--outputbase=/sample_text.txt --text=/sample_text.txt --fontconfig_tmpdir=
Fontconfig error: Cannot load default config file

=== Phase I: Generating training images ===
Rendering using AppleMyungjo
[Wed Apr 4 13:26:25 -03 2018] /usr/local/bin/text2image 
--fontconfig_tmpdir= --fonts_dir=/Library/Fonts --strip_unrenderable_words 
--leading=32 --char_spacing=0.0 --exposure=0 
--outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0
 
--max_pages=3 --font=AppleMyungjo 
--text=/Users/fernandogot/projects/langdata/kor/kor.training_text
Fontconfig error: Cannot load default config file
ERROR: 
/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box
 
does not exist or is not readable
ERROR: 
/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box
 
does not exist or is not readable
```

I found that the ```Fontconfig error: Cannot load default config file``` 
was being generated because of the mktemp on mac, I fixed it replacing the 
code:

training/tesstrain_utils.sh
```diff
- export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)
+ export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)
```
After executing the same code I get:

```
=== Starting training for language 'kor'
[Wed Apr 4 14:13:38 -03 2018] /usr/local/bin/text2image 
--fonts_dir=/Library/Fonts --font=AppleMyungjo 
--outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs/sample_text.txt
 
--text=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs/sample_text.txt
 
--fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs

=== Phase I: Generating training images ===
Rendering using AppleMyungjo
[Wed Apr 4 14:13:40 -03 2018] /usr/local/bin/text2image 
--fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs
 
--fonts_dir=/Library/Fonts --strip_unrenderable_words --leading=32 
--char_spacing=0.0 --exposure=0 
--outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0
 
--max_pages=3 --font=AppleMyungjo 
--text=/Users/fernandogot/projects/langdata/kor/kor.training_text
ERROR: 
/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0.box
 
does not exist or is not readable
ERROR: 
/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0.box
 
does not exist or is not readable
```

So I'm stuck at these 2 erros, I do have this file in the folder that Im 
executing the code ```~/projects/ocr/trainning/```, but what can I do to 
make it work?


Thanks for reading all this text and for your time

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a3d11945-97ef-4b2d-9626-96364c7884cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to