After running tesseract with various debug switches activated, I've found that 
it thinks there are two characters in the image and trying OCR on each of them.

Changing the page segmentation mode changes the output:
PSM 6 (single uniform block of text) : outputs garbage plus correct character
PSM 7 (single text line) : works correctly. 
PSM 8 (single word) : works correctly

The debug output is below.

This raises a new issue: the input data (TV subtitles) are a mixture of 1 or 2 
line text blocks. And a 1-line text block might be a single character in this 
case.

So the ideal page segmentation mode should be 6, no? But looking at the debug 
output, it thinks there are two characters in the input image...

That doesn't sound like a training issue but rather some problem with how it 
identifies glyphs in the input image...


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/28D1A233-AD19-4541-95E3-F31422000F67%40mac.com.

> $ tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6   -c 
> classify_debug_level=1
> 
> Processing word with lang ARYuanB5-MD at:Bounding box=(3,45)->(33,56)
> Trying word using lang ARYuanB5-MD, oem 1
> Best choice: accepted=0, adaptable=0, done=1 : Lang result : Ll : R=10.3645, 
> C=-11.8365, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
> str   L       l
> 1 new words better than 0 old words: r: 10.3645 v 0 c: -11.8365 v 0 valid 
> dict: 0 v 0
> 
> Processing word with lang ARYuanB5-MD at:Bounding box=(3,3)->(56,58)
> Trying word using lang ARYuanB5-MD, oem 1
> Best choice: accepted=1, adaptable=0, done=1 : Lang result : 對 : R=3.09071, 
> C=-1.8713, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
> str   對
> state:        1
> 1 new words better than 0 old words: r: 3.09071 v 0 c: -1.8713 v 0 valid 
> dict: 0 v 0
> 
> $ cat debugOut.txt
> Ll
> 對


> On 16 Oct 2023, at 09:08, 'Danny Wilson' via tesseract-ocr 
> <tesseract-ocr@googlegroups.com> wrote:
> 
> I guess I am the author... ARYuanB5-MD is the font.
> 
> For further background, the stock tessdata_best/chi_tra.traineddata did not 
> do a good job at all on the text I'm trying to recognize.  
> 
> So I retrained:
> - copying the existing Chinese wordlist and added additional characters and 
> sentences (total 47,000 lines)
> - rendered ground truth images (with the special font) and box files
> - used lang data from "chi_tra" (config, unicharset, Han.xx, Latin.xx, 
> radical-stroke etc)
> - ran lstmtraining with 30,000 iterations

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/28D1A233-AD19-4541-95E3-F31422000F67%40mac.com.

Reply via email to