Re: [tesseract-ocr] Traineed non unicode font with tesseract

2018-04-04 Thread ShreeDevi Kumar
Training tesseract is only supported using unicode fonts.

On Thu 5 Apr, 2018, 12:25 AM gopal bhalala,  wrote:

> Hi I am new in tesseract-ocr. I want trainned non unicode font using
> tesseract, I tried with to trained it with jTextboxeditor to trained that
> data but did not get any sucess.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/dc1825db-ef94-4bfd-bb3e-9e98d11faf07%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWdm5%3DG9MoDskCLHfE1-bdy7pXZteR6HrNp9EDjmzRy4w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Error at training 4.0

2018-04-04 Thread ShreeDevi Kumar
Training tesseract 4.0.0 is different from process for 3.0x.

Training  using images is not supported for tesseract 4.0.0.

See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

On Thu 5 Apr, 2018, 1:36 AM Fanatico,  wrote:

> Hi, I'm new to tesseract and ocr in general, and need some help to train
> my tesseract.
>
> Config
> Platform: Mac OS X 10.13.3
> Tesseract Version: 4.0.0-beta.1
> leptonica: 1.75.3
>   libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
>
> images used
>
> kor.AppleMyungjo.exp1.tif
>
>
> 
>
>
> kor.AppleMyungjo.exp0.tif
>
>
> 
>
>
> Step by step
> I'm trying to train (fine tuning) my tesseract to better detect commas (")
> and dot (.) in korean, but I'm getting some errors. Here what I did until
> now:
>
> 1 - Got the Images, I'm using 2 images .tif (both images has only 1 line
> and few characters)
> 2 - Renamed the images to kor.AppleMyungjo.exp0.tif and
> kor.AppleMyungjo.exp1.tif
> 3 - Created the .box file for each image ```tesseract
> [language].[fontname].exp[samplenumber].tif
> [language].[fontname].exp[samplenumber] -l [language] batch.nochop
> makebox``` (one of them come empty)
> 4 - Corrected the .box files using the site
> https://pp19dd.com/tesseract-ocr-chopper/ (I just pasted the positioning
> in the file)
> 5 - Created the .tr files for each image ```tesseract
> kor.AppleMyungjo.exp0.tif kor.AppleMyungjo.exp0 -l kor box.train ``` (both
> image got an empty .tr file)
> 6 - Created the unicharset file ```unicharset_extractor [box file 0] [box
> file 1]...```
> 7 - Created the font_properties, only has the ```AppleMyungjo 0 0 1 0 0```
> 8 - Cloned the tesseract repo to my mac, path ```~/projects/tesseract```
> 9 - cloned the langdata repo to my mac, path ```~/projects/langdata```
> 10 - Found the folder where the brew installed my tesseract, path
> ```/usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata```
> 11 - Executed the ```~/projects/tesseract/training/tesstrain.sh``` file
>
>
> ```
> sudo ~/projects/tesseract/training/tesstrain.sh \
>   --fonts_dir /Library/Fonts  \
>   --lang kor \
>   --linedata_only  \
>   --noextract_font_properties  \
>   --exposures "0"\
>   --langdata_dir ~/projects/langdata \
>   --tessdata_dir /usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata \
>   --output_dir ~/tesstutorial/kor \
>   --fontlist "AppleMyungjo"
> ```
> and got the error:
> ```
> === Starting training for language 'kor'
> mktemp: illegal option -- -
> usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
>mktemp [-d] [-q] [-u] -t prefix
> [Wed Apr 4 13:26:24 -03 2018] /usr/local/bin/text2image
> --fonts_dir=/Library/Fonts --font=AppleMyungjo
> --outputbase=/sample_text.txt --text=/sample_text.txt --fontconfig_tmpdir=
> Fontconfig error: Cannot load default config file
>
> === Phase I: Generating training images ===
> Rendering using AppleMyungjo
> [Wed Apr 4 13:26:25 -03 2018] /usr/local/bin/text2image
> --fontconfig_tmpdir= --fonts_dir=/Library/Fonts --strip_unrenderable_words
> --leading=32 --char_spacing=0.0 --exposure=0
> --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0
> --max_pages=3 --font=AppleMyungjo
> --text=/Users/fernandogot/projects/langdata/kor/kor.training_text
> Fontconfig error: Cannot load default config file
> ERROR:
> /var/folders/zz/zyxvpxvq6csfxvn_n0/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box
> does not exist or is not readable
> ERROR:
> /var/folders/zz/zyxvpxvq6csfxvn_n0/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box
> does not exist or is not readable
> ```
>
> I found that the ```Fontconfig error: Cannot load default config file```
> was being generated because of the mktemp on mac, I fixed it replacing the
> code:
>
> training/tesstrain_utils.sh
> ```diff
> - export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XX)
> + export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XX)
> ```
> After executing the same code I get:
>
> ```
> === Starting training for language 'kor'
> [Wed Apr 4 14:13:38 -03 2018] /usr/local/bin/text2image
> --fonts_dir=/Library/Fonts --font=AppleMyungjo
> --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0/T/font_tmp.XX.X52wexDs/sample_text.txt
> --text=/var/folders/zz/zyxvpxvq6csfxvn_n0/T/font_tmp.XX.X52wexDs/sample_text.txt
> --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0/T/font_tmp.XX.X52wexDs
>
> === Phase I: Generating training images ===
> Rendering using AppleMyungjo
> [Wed Apr 4 14:13:40 -03 2018] /usr/local/bin/text2image
> --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0/T/font_tmp.XX.X52wexDs
> --fonts_d

[tesseract-ocr] Error at training 4.0

2018-04-04 Thread Fanatico
Hi, I'm new to tesseract and ocr in general, and need some help to train my 
tesseract.

Config
Platform: Mac OS X 10.13.3
Tesseract Version: 4.0.0-beta.1
leptonica: 1.75.3
  libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

images used

kor.AppleMyungjo.exp1.tif




kor.AppleMyungjo.exp0.tif




Step by step
I'm trying to train (fine tuning) my tesseract to better detect commas (") 
and dot (.) in korean, but I'm getting some errors. Here what I did until 
now:

1 - Got the Images, I'm using 2 images .tif (both images has only 1 line 
and few characters)
2 - Renamed the images to kor.AppleMyungjo.exp0.tif and 
kor.AppleMyungjo.exp1.tif
3 - Created the .box file for each image ```tesseract 
[language].[fontname].exp[samplenumber].tif 
[language].[fontname].exp[samplenumber] -l [language] batch.nochop 
makebox``` (one of them come empty)
4 - Corrected the .box files using the site 
https://pp19dd.com/tesseract-ocr-chopper/ (I just pasted the positioning in 
the file)
5 - Created the .tr files for each image ```tesseract 
kor.AppleMyungjo.exp0.tif kor.AppleMyungjo.exp0 -l kor box.train ``` (both 
image got an empty .tr file)
6 - Created the unicharset file ```unicharset_extractor [box file 0] [box 
file 1]...```
7 - Created the font_properties, only has the ```AppleMyungjo 0 0 1 0 0```
8 - Cloned the tesseract repo to my mac, path ```~/projects/tesseract```
9 - cloned the langdata repo to my mac, path ```~/projects/langdata```
10 - Found the folder where the brew installed my tesseract, path 
```/usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata```
11 - Executed the ```~/projects/tesseract/training/tesstrain.sh``` file


```
sudo ~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts  \
  --lang kor \
  --linedata_only  \
  --noextract_font_properties  \
  --exposures "0"\
  --langdata_dir ~/projects/langdata \
  --tessdata_dir /usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata \
  --output_dir ~/tesstutorial/kor \
  --fontlist "AppleMyungjo"
```
and got the error:
```
=== Starting training for language 'kor'
mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
   mktemp [-d] [-q] [-u] -t prefix
[Wed Apr 4 13:26:24 -03 2018] /usr/local/bin/text2image 
--fonts_dir=/Library/Fonts --font=AppleMyungjo 
--outputbase=/sample_text.txt --text=/sample_text.txt --fontconfig_tmpdir=
Fontconfig error: Cannot load default config file

=== Phase I: Generating training images ===
Rendering using AppleMyungjo
[Wed Apr 4 13:26:25 -03 2018] /usr/local/bin/text2image 
--fontconfig_tmpdir= --fonts_dir=/Library/Fonts --strip_unrenderable_words 
--leading=32 --char_spacing=0.0 --exposure=0 
--outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0
 
--max_pages=3 --font=AppleMyungjo 
--text=/Users/fernandogot/projects/langdata/kor/kor.training_text
Fontconfig error: Cannot load default config file
ERROR: 
/var/folders/zz/zyxvpxvq6csfxvn_n0/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box
 
does not exist or is not readable
ERROR: 
/var/folders/zz/zyxvpxvq6csfxvn_n0/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box
 
does not exist or is not readable
```

I found that the ```Fontconfig error: Cannot load default config file``` 
was being generated because of the mktemp on mac, I fixed it replacing the 
code:

training/tesstrain_utils.sh
```diff
- export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XX)
+ export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XX)
```
After executing the same code I get:

```
=== Starting training for language 'kor'
[Wed Apr 4 14:13:38 -03 2018] /usr/local/bin/text2image 
--fonts_dir=/Library/Fonts --font=AppleMyungjo 
--outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0/T/font_tmp.XX.X52wexDs/sample_text.txt
 
--text=/var/folders/zz/zyxvpxvq6csfxvn_n0/T/font_tmp.XX.X52wexDs/sample_text.txt
 
--fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0/T/font_tmp.XX.X52wexDs

=== Phase I: Generating training images ===
Rendering using AppleMyungjo
[Wed Apr 4 14:13:40 -03 2018] /usr/local/bin/text2image 
--fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0/T/font_tmp.XX.X52wexDs
 
--fonts_dir=/Library/Fonts --strip_unrenderable_words --leading=32 
--char_spacing=0.0 --exposure=0 
--outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0
 
--max_pages=3 --font=AppleMyungjo 
--text=/Users/fernandogot/projects/langdata/kor/kor.training_text
ERROR: 
/var/folders/zz/zyxvpxvq6csfxvn_n0/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0.box
 
does not exist or is not 

[tesseract-ocr] Traineed non unicode font with tesseract

2018-04-04 Thread gopal bhalala
Hi I am new in tesseract-ocr. I want trainned non unicode font using 
tesseract, I tried with to trained it with jTextboxeditor to trained that 
data but did not get any sucess.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/dc1825db-ef94-4bfd-bb3e-9e98d11faf07%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


LMG-ARUN.TTF
Description: application/font-ttf