Hello I'm trying to train tesseract for recognition of patterns present in tickets. Each ticket possesses a unique pattern in a predetermined place which determines its value. As these patterns are not including unicode characters, I assigned them the characters 'a' to 'f'. I created a .tif image with six patterns: bil.pat.exp0.tif <https://drive.google.com/file/d/0B7CfYFzWHQDAYWU4M3hIQXUyOWs/view?usp=sharing> and the corresponding file box: bil.pat.exp0.box <https://drive.google.com/file/d/0B7CfYFzWHQDAVkJlZ3lreEdpaXc/view?usp=sharing> a 32 692 165 958 0 b 221 734 354 958 0 c 32 446 165 628 0 d 221 488 354 628 0 e 32 275 165 373 0 f 221 317 277 373 0
Then I ran: tesseract bil.pat.exp0.tif bil.pat.exp0 box.train and output: Tesseract Open Source OCR Engine v3.04.00 with Leptonica Page 1 APPLY_BOXES: Boxes read from boxfile: 6 APPLY_BOXES: Unlabelled word at :Bounding box=(-958,221)->(-734,277) APPLY_BOXES: Unlabelled word at :Bounding box=(-628,221)->(-488,277) APPLY_BOXES: Unlabelled word at :Bounding box=(-958,32)->(-734,88) APPLY_BOXES: Unlabelled word at :Bounding box=(-628,32)->(-488,88) APPLY_BOXES: Unlabelled word at :Bounding box=(-373,32)->(-317,88) Found 6 good blobs. 5 remaining unlabelled words deleted. Generated training data for 6 words That can not mean negative coordinates. Despite this I tried to keep going. My font_properties is: bil.pat.box 0 0 1 0 0 bil.words_list is: a b c d e f then I ran: $ unicharset_extractor bil.pat.exp0.box Extracting unicharset from bil.pat.exp0.box Wrote unicharset file ./unicharset. but the unicharset file has: 9 NULL 0 NULL 0 Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Joined [4a 6f 69 6e 65 64 ] |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Broken a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # a [61 ] b 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # b [62 ] c 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # c [63 ] d 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # d [64 ] e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # e [65 ] f 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # f [66 ] Then I ran: $ mftraining -F font_properties -U unicharset -O bil.unicharset bil.pat.exp0 .tr Read shape table shapetable of 0 shapes Reading bil.pat.exp0.tr ... Bad properties for index 3, char a: 0,255 0,255 0,0 0,0 0,0 Bad properties for index 4, char b: 0,255 0,255 0,0 0,0 0,0 Bad properties for index 5, char c: 0,255 0,255 0,0 0,0 0,0 Bad properties for index 6, char d: 0,255 0,255 0,0 0,0 0,0 Bad properties for index 7, char e: 0,255 0,255 0,0 0,0 0,0 Bad properties for index 8, char f: 0,255 0,255 0,0 0,0 0,0 Warning: no protos/configs for Joined in CreateIntTemplates() Warning: no protos/configs for |Broken|0|1 in CreateIntTemplates() Warning: no protos/configs for a in CreateIntTemplates() Warning: no protos/configs for b in CreateIntTemplates() Warning: no protos/configs for c in CreateIntTemplates() Warning: no protos/configs for d in CreateIntTemplates() Warning: no protos/configs for e in CreateIntTemplates() Warning: no protos/configs for f in CreateIntTemplates() Done! That's what I'm doing wrong? I am on debian. tesseract 3.04.00 leptonica-1.72 libgif 4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.4.0) : libpng 1.2.50 : libtiff 4.0.5 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0 >From already thank you very much! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a619104a-79d5-40ec-8a08-a6a9941ec292%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.