Re: [tesseract-ocr] Train tesseract 3.04 for recognition of six patterns no existents in UTF-8

Juan Pablo Aveggio Sun, 27 Sep 2015 14:19:05 -0700

Hi Dmitri Silaev

Thanks for your useful help. Actually I have almost no progress, in terms 
of image preprocessing. Just convert the image to grayscale before applying 
OCR. But I could not get good training data. The test code is as follows: 
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <iostream>


using namespace cv;

int main(int, char**)
{
    VideoCapture cap(0); // open the default camera
    if(!cap.isOpened())  // check if we succeeded
        return -1;
    int c;
    Mat gray;
    namedWindow("gray", 1);
    tesseract::TessBaseAPI tess;
    tess.Init("/usr/share/tesseract-ocr/tessdata/", "bil", 
tesseract::OEM_DEFAULT );
    tess.SetPageSegMode(tesseract::PSM_SINGLE_WORD);

    for(;;)
    {
        Mat frame;
        cap >> frame;
        cvtColor(frame, gray, CV_BGR2GRAY);

                  c = waitKey(30);
        if(c == 27) break;
        else if(c > 0) {
            tess.SetImage((uchar*)gray.data, gray.cols, gray.rows, 1, 
gray.cols);
            Boxa* boxes = tess.GetComponentImages(tesseract::RIL_WORD, 
true, NULL, NULL);
            for(int i=0; i < boxes->n; i++){
                BOX* box = boxaGetBox(boxes, i, L_CLONE);
                rectangle(gray, Point(box->x, box->y), Point(box->x+box->w, 
box->y+box->h), Scalar(255, 0, 0));
            }
            char* out = tess.GetUTF8Text();
            std::cout << out << std::endl;
            imshow("gray", gray);
            waitKey(4000);
        }else imshow("gray", gray);
    }
    tess.~TessBaseAPI();
    // the camera will be deinitialized automatically in VideoCapture 
destructor
    return 0;
}

With this code and my training data only thing I've done is draw a square 
around the pattern, but only when I place the ticket close enough to the 
camera and exactly horizontal. It has yielded some results, but with very 
low hit. I also tested PSM_SINGLE_CHARACTER page segmentation mode, with 
similar results.
I thought all this could be due to errors thrown during training process, 
which resulted in bad training data. Now I understand that this is because 
my characters appear disconnected, isolated, and tesseract is designed to 
detect horizontal lines of text with words, mostly several characters.
Then I could just use OpenCV to solve this problem? The hardest part seems 
to be finding the region where the pattern in the bill, and its rotation 
is. Once echos with this information, I could straighten out and deal with 
this small subimage more easily.
I do not have much experience with OpenCV. But I'm willing to learn. I 
imagine that we will have to apply an algorithm to detect edges or corners, 
to try to get the contour of the ticket. We have to consider that the 
ticket might be being partially captured on camera. It could even be the 
reverse, so that the pattern will not be found. I see it quite difficult, 
but it's a good challenge.

Finally, I would note that I have selected this pattern because I thought 
it would be easier to detect. They are also issuing a new currency, with 
many different typefaces and design, but the pattern have not changed. But 
any suggestions are welcome.

Thank you very much for your interest.
Best regards
Juan Pablo Aveggio


El sábado, 26 de septiembre de 2015, 11:36:18 (UTC-3), Dmitri Silaev 
escribió:
>
> Hi Juan Pablo,
>
> The problem cannot be solved by Tesseract as is. Even given such perfect 
> images like you've shown, Tesseract would fail since your "characters" are 
> too disjointed, have no meaningful baseline and only happen as singletons. 
>
> However a simple and robust recognition can be implemented without 
> Tesseract using common sense and a bit of programming. Of image processing 
> operations, you only would need trivial thresholding. Though, some more 
> involved image preprocessing is required to convert the image to the form 
> close to what you've demonstrated in your sample images. 
>
> The said preprocessing would be needed anyway even if Tesseract worked for 
> your "characters". Tell what you already have done so far in this direction 
> so I can share more details about the above method, if you wish. 
>
> -Dmitri
> Hi Dmitri Silaev.
>
> Thanks for reply. They are bills, sorry for mistranslation. You can see 
> examples:
> 2 <http://k43.kn3.net/AE7DA8C86.jpg> 5 <http://k36.kn3.net/43DDCD402.jpg> 
> 10 
> <http://radioalfa971.com.ar.elserver.com/wp-content/uploads/2013/12/Billete-Argentino-10-pesos.jpg>
>  20 50 
> <http://rafaelanoticias.com/privados/subidas/fotos/3783_billete_de_cincuenta_50_pesos.jpg>
>  100 <http://k02.kn3.net/58A05701F.jpg>
>
> These patterns have relief for the blind, but they are very worn and no 
> longer apply. So I'm working on an android app to detect the value 
> and speech it to user.
>
>
> El miércoles, 23 de septiembre de 2015, 18:26:43 (UTC-3), Dmitri Silaev 
> escribió:
>>
>> Hi Juan Pablo,
>>
>> The problem seems interesting. However not sure if you can use Tesseract 
>> for that. Could you show one or more example tickets?
>>
>> Best regards,
>> Dmitri Silaev
>> www.CustomOCR.com
>>
>>
>>
>>
>>
>> On Tue, Sep 22, 2015 at 2:17 AM, Juan Pablo Aveggio <jpav...@gmail.com> 
>> wrote:
>>
>>> Hello
>>> I'm trying to train tesseract for recognition of patterns present in 
>>> tickets. Each ticket possesses a unique pattern in a predetermined place 
>>> which determines its value. As these patterns are not including unicode 
>>> characters,  I assigned them the characters 'a' to 'f'.
>>> I created a .tif image with six patterns:
>>> bil.pat.exp0.tif 
>>> <https://drive.google.com/file/d/0B7CfYFzWHQDAYWU4M3hIQXUyOWs/view?usp=sharing>
>>> and the corresponding file box:
>>> bil.pat.exp0.box 
>>> <https://drive.google.com/file/d/0B7CfYFzWHQDAVkJlZ3lreEdpaXc/view?usp=sharing>
>>> a 32 692 165 958 0 
>>> b 221 734 354 958 0 
>>> c 32 446 165 628 0 
>>> d 221 488 354 628 0 
>>> e 32 275 165 373 0 
>>> f 221 317 277 373 0
>>>
>>> Then I ran:
>>> tesseract bil.pat.exp0.tif bil.pat.exp0 box.train
>>> and output:
>>> Tesseract Open Source OCR Engine v3.04.00 with Leptonica 
>>> Page 1 
>>> APPLY_BOXES: 
>>>    Boxes read from boxfile:       6 
>>> APPLY_BOXES: Unlabelled word at :Bounding box=(-958,221)->(-734,277) 
>>> APPLY_BOXES: Unlabelled word at :Bounding box=(-628,221)->(-488,277) 
>>> APPLY_BOXES: Unlabelled word at :Bounding box=(-958,32)->(-734,88) 
>>> APPLY_BOXES: Unlabelled word at :Bounding box=(-628,32)->(-488,88) 
>>> APPLY_BOXES: Unlabelled word at :Bounding box=(-373,32)->(-317,88) 
>>>    Found 6 good blobs. 
>>>    5 remaining unlabelled words deleted. 
>>> Generated training data for 6 words
>>> That can not mean negative coordinates. Despite this I tried to keep 
>>> going.
>>> My font_properties is:
>>> bil.pat.box 0 0 1 0 0
>>> bil.words_list is:
>>> a 
>>> b 
>>> c 
>>> d 
>>> e 
>>> f 
>>>
>>> then I ran:
>>> $ unicharset_extractor bil.pat.exp0.box
>>> Extracting unicharset from bil.pat.exp0.box 
>>> Wrote unicharset file ./unicharset.
>>> but the unicharset file has:
>>> 9 
>>> NULL 0 NULL 0 
>>> Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0     # Joined [4a 6f 69 6e 
>>> 65 64 ] 
>>> |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # Broken 
>>> a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # a [61 ] 
>>> b 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # b [62 ] 
>>> c 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # c [63 ] 
>>> d 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # d [64 ] 
>>> e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # e [65 ] 
>>> f 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # f [66 ]
>>> Then I ran:
>>> $ mftraining -F font_properties -U unicharset -O bil.unicharset bil.pat.
>>> exp0.tr  
>>> Read shape table shapetable of 0 shapes 
>>> Reading bil.pat.exp0.tr ... 
>>> Bad properties for index 3, char a: 0,255 0,255 0,0 0,0 0,0 
>>> Bad properties for index 4, char b: 0,255 0,255 0,0 0,0 0,0 
>>> Bad properties for index 5, char c: 0,255 0,255 0,0 0,0 0,0 
>>> Bad properties for index 6, char d: 0,255 0,255 0,0 0,0 0,0 
>>> Bad properties for index 7, char e: 0,255 0,255 0,0 0,0 0,0 
>>> Bad properties for index 8, char f: 0,255 0,255 0,0 0,0 0,0 
>>> Warning: no protos/configs for Joined in CreateIntTemplates() 
>>> Warning: no protos/configs for |Broken|0|1 in CreateIntTemplates() 
>>> Warning: no protos/configs for a in CreateIntTemplates() 
>>> Warning: no protos/configs for b in CreateIntTemplates() 
>>> Warning: no protos/configs for c in CreateIntTemplates() 
>>> Warning: no protos/configs for d in CreateIntTemplates() 
>>> Warning: no protos/configs for e in CreateIntTemplates() 
>>> Warning: no protos/configs for f in CreateIntTemplates() 
>>> Done!
>>> That's what I'm doing wrong?
>>> I am on debian.
>>> tesseract 3.04.00 
>>>  leptonica-1.72 
>>>   libgif 4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.4.0) : libpng 1.2.50 : 
>>> libtiff 4.0.5 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
>>> From already thank you very much!
>>>
>>>
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/a619104a-79d5-40ec-8a08-a6a9941ec292%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/a619104a-79d5-40ec-8a08-a6a9941ec292%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to tesseract-oc...@googlegroups.com <javascript:>.
> To post to this group, send email to tesser...@googlegroups.com 
> <javascript:>.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/e00b21a3-88c7-4535-96b2-833010610308%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/tesseract-ocr/e00b21a3-88c7-4535-96b2-833010610308%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/12ffb9a1-8530-445f-b126-2b5a884efd3e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Train tesseract 3.04 for recognition of six patterns no existents in UTF-8

Reply via email to