Re: [tesseract-ocr] Compute CTC targets failed while training

2018-09-26 Thread Khosrobeigy.zohreh
No, I always train from scratch.
best fast.traindata doesn't recognize eng and persian and the accuracy is
too low in some fonts.
I want to solve this problem.
For fine tune can have different unicharset. As I read in wiki of
tesseract, it is the number of class of lstm. So if Mr. Smit has trained
for example 120 unicharset, can i have 160 unicharset in fine tune?
As I know the number of class in lstm cannot change.
all character in eng and fas and punc are aroud 164 character.

On Wed, Sep 26, 2018 at 12:34 PM Shree Devi Kumar 
wrote:

>
> >By version alpha, I trained about 1000 line and it is not so bad
>
> You must have only done fine tuning of model then and now you are trying
> to train from scratch.
>
> On Wed, 26 Sep 2018, 04:01 Khosrobeigy.zohreh, 
> wrote:
>
>> I know, actually I am master in lstm. I want to resolve all error and
>> then train big text.
>> By version alpha, I trained about 1000 line and it is not so bad. But in
>> version beta 4 I got many error.
>> In alpha,
>> # Use LSTM
>> tessedit_ocr_engine_mode 1
>> tessedit_pageseg_mode 6
>>
>> # Arabic page layout variables
>> segment_nonalphabetic_script 1
>>
>> # Avoid dropping rows
>> textord_noise_rowratio 20.0
>> textord_noise_syfract 0.6
>>
>> textord_min_linesize 2.5
>>
>> # Avoid over-estimating intra-word spacing at both row and
>> # block levels when using old to method
>> tosp_old_to_method T
>> tosp_old_to_constrain_sp_kn T
>> tosp_old_sp_kn_th_factor 4.0
>>
>> tosp_only_small_gaps_for_kern T
>> tosp_use_pre_chopping T
>>  I used all these, but now my model doesn't learn.
>> Has any thing changed in beta 4 for example text2image?
>>
>> On Wed, Sep 26, 2018 at 12:53 AM Shree Devi Kumar 
>> wrote:
>>
>>>   --fontlist "Arial"
>>>
>>> Does that have good coverage for Farsi?
>>>
>>>
>>> --max_iterations 5000
>>>
>>> You are trying to train from scratch with 18000 lines of text and only
>>> 5000 iterations. That will not work.
>>>
>>> Ray has trained on hundreds of thousands of lines of text and millions
>>> of iterations.
>>>
>>> On Tue, 25 Sep 2018, 16:20 Zohreh Khosrobeygi, 
>>> wrote:
>>>
 Hi, I use this :
 tesseract 4.0.0-beta.4
  leptonica-1.74.4
   libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 :
 zlib 1.2.8

  Found AVX2
  Found AVX
  Found SSE
 I've trained about 18000 line for persian language. I use this command:

 bash -x tesstrain.sh --fonts_dir /usr/share/fonts --lang fas
 --training_text
  
 /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.training_text.txt
 --wordlist
 /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.wordlist.txt
 --linedata_only \
   --noextract_font_properties --langdata_dir
 /home/zohreh/Desktop/tesseract-master/src/training/langdata \
   --tessdata_dir /home/zohreh/Desktop/tesseract-master/tessdata \
   --fontlist "Arial" --output_dir
 /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2
 and then run this:
 sudo /home/zohreh/Desktop/tesseract-master/src/training/lstmtraining   \
   --traineddata
 /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas/fas.traineddata
  --net_spec '[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]' \
   --model_output
 /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/base
 --learning_rate 0.001 \
   --train_listfile
 /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas.training_files.txt
 \
   --eval_listfile
 /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/v/fas.training_files.txt
 \
   --max_iterations 5000
 &>/home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/basetrain.log
 but always show Compute CTC targets failed and the model is not well at
 all.
 I normal my text and each line of the text have 20 token(max).
 Could you pleas help me?


 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-ocr+unsubscr...@googlegroups.com.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/04872dc6-7d92-4f95-9f65-8bb0cbf87c8c%40googlegroups.com
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "tesseract-ocr" group.
>>> To unsubscribe from this topic, visit
>>> 

Re: [tesseract-ocr] Compute CTC targets failed while training

2018-09-26 Thread Shree Devi Kumar
>By version alpha, I trained about 1000 line and it is not so bad

You must have only done fine tuning of model then and now you are trying to
train from scratch.

On Wed, 26 Sep 2018, 04:01 Khosrobeigy.zohreh, 
wrote:

> I know, actually I am master in lstm. I want to resolve all error and then
> train big text.
> By version alpha, I trained about 1000 line and it is not so bad. But in
> version beta 4 I got many error.
> In alpha,
> # Use LSTM
> tessedit_ocr_engine_mode 1
> tessedit_pageseg_mode 6
>
> # Arabic page layout variables
> segment_nonalphabetic_script 1
>
> # Avoid dropping rows
> textord_noise_rowratio 20.0
> textord_noise_syfract 0.6
>
> textord_min_linesize 2.5
>
> # Avoid over-estimating intra-word spacing at both row and
> # block levels when using old to method
> tosp_old_to_method T
> tosp_old_to_constrain_sp_kn T
> tosp_old_sp_kn_th_factor 4.0
>
> tosp_only_small_gaps_for_kern T
> tosp_use_pre_chopping T
>  I used all these, but now my model doesn't learn.
> Has any thing changed in beta 4 for example text2image?
>
> On Wed, Sep 26, 2018 at 12:53 AM Shree Devi Kumar 
> wrote:
>
>>   --fontlist "Arial"
>>
>> Does that have good coverage for Farsi?
>>
>>
>> --max_iterations 5000
>>
>> You are trying to train from scratch with 18000 lines of text and only
>> 5000 iterations. That will not work.
>>
>> Ray has trained on hundreds of thousands of lines of text and millions of
>> iterations.
>>
>> On Tue, 25 Sep 2018, 16:20 Zohreh Khosrobeygi, 
>> wrote:
>>
>>> Hi, I use this :
>>> tesseract 4.0.0-beta.4
>>>  leptonica-1.74.4
>>>   libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 :
>>> zlib 1.2.8
>>>
>>>  Found AVX2
>>>  Found AVX
>>>  Found SSE
>>> I've trained about 18000 line for persian language. I use this command:
>>>
>>> bash -x tesstrain.sh --fonts_dir /usr/share/fonts --lang fas
>>> --training_text
>>>  
>>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.training_text.txt
>>> --wordlist
>>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.wordlist.txt
>>> --linedata_only \
>>>   --noextract_font_properties --langdata_dir
>>> /home/zohreh/Desktop/tesseract-master/src/training/langdata \
>>>   --tessdata_dir /home/zohreh/Desktop/tesseract-master/tessdata \
>>>   --fontlist "Arial" --output_dir
>>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2
>>> and then run this:
>>> sudo /home/zohreh/Desktop/tesseract-master/src/training/lstmtraining   \
>>>   --traineddata
>>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas/fas.traineddata
>>>  --net_spec '[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]' \
>>>   --model_output
>>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/base
>>> --learning_rate 0.001 \
>>>   --train_listfile
>>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas.training_files.txt
>>> \
>>>   --eval_listfile
>>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/v/fas.training_files.txt
>>> \
>>>   --max_iterations 5000
>>> &>/home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/basetrain.log
>>> but always show Compute CTC targets failed and the model is not well at
>>> all.
>>> I normal my text and each line of the text have 20 token(max).
>>> Could you pleas help me?
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/04872dc6-7d92-4f95-9f65-8bb0cbf87c8c%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/tesseract-ocr/hGQMuZip6io/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcjmoC%2BfvY5qvn3e4PBVMhBFiEGDGP9WCkEUnsygQTpw%40mail.gmail.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

Re: [tesseract-ocr] Compute CTC targets failed while training

2018-09-26 Thread Khosrobeigy.zohreh
I know, actually I am master in lstm. I want to resolve all error and then
train big text.
By version alpha, I trained about 1000 line and it is not so bad. But in
version beta 4 I got many error.
In alpha,
# Use LSTM
tessedit_ocr_engine_mode 1
tessedit_pageseg_mode 6

# Arabic page layout variables
segment_nonalphabetic_script 1

# Avoid dropping rows
textord_noise_rowratio 20.0
textord_noise_syfract 0.6

textord_min_linesize 2.5

# Avoid over-estimating intra-word spacing at both row and
# block levels when using old to method
tosp_old_to_method T
tosp_old_to_constrain_sp_kn T
tosp_old_sp_kn_th_factor 4.0

tosp_only_small_gaps_for_kern T
tosp_use_pre_chopping T
 I used all these, but now my model doesn't learn.
Has any thing changed in beta 4 for example text2image?

On Wed, Sep 26, 2018 at 12:53 AM Shree Devi Kumar 
wrote:

>   --fontlist "Arial"
>
> Does that have good coverage for Farsi?
>
>
> --max_iterations 5000
>
> You are trying to train from scratch with 18000 lines of text and only
> 5000 iterations. That will not work.
>
> Ray has trained on hundreds of thousands of lines of text and millions of
> iterations.
>
> On Tue, 25 Sep 2018, 16:20 Zohreh Khosrobeygi, 
> wrote:
>
>> Hi, I use this :
>> tesseract 4.0.0-beta.4
>>  leptonica-1.74.4
>>   libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib
>> 1.2.8
>>
>>  Found AVX2
>>  Found AVX
>>  Found SSE
>> I've trained about 18000 line for persian language. I use this command:
>>
>> bash -x tesstrain.sh --fonts_dir /usr/share/fonts --lang fas
>> --training_text
>>  
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.training_text.txt
>> --wordlist
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.wordlist.txt
>> --linedata_only \
>>   --noextract_font_properties --langdata_dir
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata \
>>   --tessdata_dir /home/zohreh/Desktop/tesseract-master/tessdata \
>>   --fontlist "Arial" --output_dir
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2
>> and then run this:
>> sudo /home/zohreh/Desktop/tesseract-master/src/training/lstmtraining   \
>>   --traineddata
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas/fas.traineddata
>>  --net_spec '[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]' \
>>   --model_output
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/base
>> --learning_rate 0.001 \
>>   --train_listfile
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas.training_files.txt
>> \
>>   --eval_listfile
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/v/fas.training_files.txt
>> \
>>   --max_iterations 5000
>> &>/home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/basetrain.log
>> but always show Compute CTC targets failed and the model is not well at
>> all.
>> I normal my text and each line of the text have 20 token(max).
>> Could you pleas help me?
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/04872dc6-7d92-4f95-9f65-8bb0cbf87c8c%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/hGQMuZip6io/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcjmoC%2BfvY5qvn3e4PBVMhBFiEGDGP9WCkEUnsygQTpw%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 
Zohreh Khosrobeygi
University of Tehran, 2016
Tel: +989196042887
khosrobeygi.zo...@ut.ac.ir 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to 

Re: [tesseract-ocr] Compute CTC targets failed while training

2018-09-25 Thread Shree Devi Kumar
  --fontlist "Arial"

Does that have good coverage for Farsi?


--max_iterations 5000

You are trying to train from scratch with 18000 lines of text and only 5000
iterations. That will not work.

Ray has trained on hundreds of thousands of lines of text and millions of
iterations.

On Tue, 25 Sep 2018, 16:20 Zohreh Khosrobeygi, 
wrote:

> Hi, I use this :
> tesseract 4.0.0-beta.4
>  leptonica-1.74.4
>   libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib
> 1.2.8
>
>  Found AVX2
>  Found AVX
>  Found SSE
> I've trained about 18000 line for persian language. I use this command:
>
> bash -x tesstrain.sh --fonts_dir /usr/share/fonts --lang fas
> --training_text
>  
> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.training_text.txt
> --wordlist
> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.wordlist.txt
> --linedata_only \
>   --noextract_font_properties --langdata_dir
> /home/zohreh/Desktop/tesseract-master/src/training/langdata \
>   --tessdata_dir /home/zohreh/Desktop/tesseract-master/tessdata \
>   --fontlist "Arial" --output_dir
> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2
> and then run this:
> sudo /home/zohreh/Desktop/tesseract-master/src/training/lstmtraining   \
>   --traineddata
> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas/fas.traineddata
>  --net_spec '[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]' \
>   --model_output
> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/base
> --learning_rate 0.001 \
>   --train_listfile
> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas.training_files.txt
> \
>   --eval_listfile
> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/v/fas.training_files.txt
> \
>   --max_iterations 5000
> &>/home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/basetrain.log
> but always show Compute CTC targets failed and the model is not well at
> all.
> I normal my text and each line of the text have 20 token(max).
> Could you pleas help me?
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/04872dc6-7d92-4f95-9f65-8bb0cbf87c8c%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcjmoC%2BfvY5qvn3e4PBVMhBFiEGDGP9WCkEUnsygQTpw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Compute CTC targets failed while training

2018-09-25 Thread Zohreh Khosrobeygi
Hi, I use this :
tesseract 4.0.0-beta.4
 leptonica-1.74.4
  libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 
1.2.8

 Found AVX2
 Found AVX
 Found SSE
I've trained about 18000 line for persian language. I use this command:

bash -x tesstrain.sh --fonts_dir /usr/share/fonts --lang fas
--training_text  
 
/home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.training_text.txt
 
--wordlist 
/home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.wordlist.txt
  
--linedata_only \
  --noextract_font_properties --langdata_dir 
/home/zohreh/Desktop/tesseract-master/src/training/langdata \
  --tessdata_dir /home/zohreh/Desktop/tesseract-master/tessdata \
  --fontlist "Arial" --output_dir 
/home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2
and then run this:
sudo /home/zohreh/Desktop/tesseract-master/src/training/lstmtraining   \
  --traineddata 
/home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas/fas.traineddata
  
 --net_spec '[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]' \
  --model_output 
/home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/base 
--learning_rate 0.001 \
  --train_listfile 
/home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas.training_files.txt
 
\
  --eval_listfile 
/home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/v/fas.training_files.txt
 
\
  --max_iterations 5000 
&>/home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/basetrain.log
but always show Compute CTC targets failed and the model is not well at all.
I normal my text and each line of the text have 20 token(max).
Could you pleas help me?
 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/04872dc6-7d92-4f95-9f65-8bb0cbf87c8c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.