Re: Tess v3 not recognising accented Esperanto characters.

Donaldo Thu, 04 Oct 2012 03:09:05 -0700

On Tuesday, 2 October 2012 22:33:35 UTC+13, I wrote:
>
>
> > Results: 1.5% character errors. Most accented letters recognised. 
> Frequent 
> > errors: l → I, e → c, il → ü, li → h, o → O 
>
> I have redone the tests with my new epo.word-dawg (492000+ words) and 
epo.freq-dawg (200 words) and with a epo.unicharambigs that I put together 
myself. This is it (27 rules):


v1
1    c    1    e    0
2    t î    2    f i    0
1    ü    2    i i    0
1    ü    2    i l    0
1    â    1    i    0
1    î    1    i    0
1    l    1    i    0
2    l d    2    k l    0
2    I (    1    K    0
3    1 < -    1    k    0
2    l <    1    k    0
2    1 <    1    k    0
1    ü    2    l i    0
1    h    2    l i    0
1    I    1    l    0
1    1    1    l    0
1    t    1    l    0
2    r n    1    m    0
1    O    1    o    0
1    0    1    o    0
2    n n    2    r m    0
1    :    1    r    0
2    l 0    2    1 0    0
2    0 l    2    0 1    0
1    ô    1    6    0
3    ° / o    1    %    1
2    , ,    1    "    1

(The tabs don't seem to translate well here).

This time I got 0.3% character errors (c.p. 1.3% before). This is a serif 
font text with sans bold headings, scanned at 600 dpi and saved as tif. 
Here are some of the main errors:

 10 ->  l00,1 % → O,l % 0,0000001 ->  0,000000l[It apparently is stuck on USA 
decimal separator, so it thinks that this is alphabetic?]6 ->  ôtro -> tro Ŭ  
[strange character at end of line.]

 Tesseract sometimes loses the dot at the end of a sentence, sometimes loses a 
space between words, e.g. Do pro -> DoghroExclamation mark read as letter, e.g. 
Domage! -> Domagel

I scanned (600 dpi) a double page from another magazine. This one uses a 
sans font only. This gave about 0.2% errors. Some errors are rather bizarre. 

sofo → soĵho, Do pro → Doghro, sometimes spaces are lost, sometimes added. 
Sometimes an sentence initial I changes to lower case, maybe because it found 
it in the dictionary?I had only one instance of underlined text. It was 
misinterpreted: *
mi* → _r_n_j 
Is tesseract supposed to be able to cope with underlined text, or should we 
train it on an underlined fonts? (which I don't have.)

Any comments?

Regards
Donaldo

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tess v3 not recognising accented Esperanto characters.

Reply via email to