On Tuesday, 2 October 2012 22:33:35 UTC+13, I wrote: > > > > Results: 1.5% character errors. Most accented letters recognised. > Frequent > > errors: l → I, e → c, il → ü, li → h, o → O > > I have redone the tests with my new epo.word-dawg (492000+ words) and epo.freq-dawg (200 words) and with a epo.unicharambigs that I put together myself. This is it (27 rules):
v1 1 c 1 e 0 2 t î 2 f i 0 1 ü 2 i i 0 1 ü 2 i l 0 1 â 1 i 0 1 î 1 i 0 1 l 1 i 0 2 l d 2 k l 0 2 I ( 1 K 0 3 1 < - 1 k 0 2 l < 1 k 0 2 1 < 1 k 0 1 ü 2 l i 0 1 h 2 l i 0 1 I 1 l 0 1 1 1 l 0 1 t 1 l 0 2 r n 1 m 0 1 O 1 o 0 1 0 1 o 0 2 n n 2 r m 0 1 : 1 r 0 2 l 0 2 1 0 0 2 0 l 2 0 1 0 1 ô 1 6 0 3 ° / o 1 % 1 2 , , 1 " 1 (The tabs don't seem to translate well here). This time I got 0.3% character errors (c.p. 1.3% before). This is a serif font text with sans bold headings, scanned at 600 dpi and saved as tif. Here are some of the main errors: 10 -> l00,1 % → O,l % 0,0000001 -> 0,000000l[It apparently is stuck on USA decimal separator, so it thinks that this is alphabetic?]6 -> ôtro -> tro Ŭ [strange character at end of line.] Tesseract sometimes loses the dot at the end of a sentence, sometimes loses a space between words, e.g. Do pro -> DoghroExclamation mark read as letter, e.g. Domage! -> Domagel I scanned (600 dpi) a double page from another magazine. This one uses a sans font only. This gave about 0.2% errors. Some errors are rather bizarre. sofo → soĵho, Do pro → Doghro, sometimes spaces are lost, sometimes added. Sometimes an sentence initial I changes to lower case, maybe because it found it in the dictionary?I had only one instance of underlined text. It was misinterpreted: * mi* → _r_n_j Is tesseract supposed to be able to cope with underlined text, or should we train it on an underlined fonts? (which I don't have.) Any comments? Regards Donaldo -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

