Re: Disable Special characters?

MARTIN Pierre Sun, 18 Apr 2010 14:30:11 -0700

Dear NGuyenQ,

> From the page http://www.pixel-technology.com/freeware/tessnet2/
> tessnet2.Tesseract ocr = new tessnet2.Tesseract();
> ocr.SetVariable("tessedit_char_whitelist", "0123456789"); // If digit only
This is brilliant advice you just gave him. It is very effective, i just tested 
it on document with only digits and a few special characters.
Since i'm working with C++ only (No .net wrapper), here is what i recommend to 
do:


        // Init your tess API.
        _tessApi        = new tesseract::TessBaseAPI();
        // Set up the current directory and language prefix.
        _tessApi->Init("./", "cst");
        // This is only important if you'll be parsing pictures with only one 
line of text (Which is my case).
        _tessApi->SetPageSegMode(tesseract::PSM_SINGLE_LINE);
        // Here is the trick as explained and pointed by NGuyenQ:
        _tessApi->SetVariable("tessedit_char_whitelist", "<0123456789");
        
        // The in a loop for each of my documents, here is the idea:
        PIX     *pix    = pixReadMemTiff((const 
l_uint8*)buffer.buffer().constData(), buffer.size(), 0);
        _tessApi->SetImage(pix);
        doc.setRecognizedData("OCRLine", QString(text).trimmed());
        pixDestroy(&pix);
        delete []       text;
        delete  pix;
        
        // Release everything.
        _tessApi->Clear();
        _tessApi->End();
        delete _tessApi;

The very very interesting part is that before, i was getting "D" and "O" 
instead of zeros, sometimes even "A" for "4" and "[]" and "[)" instead of 
zeroes, despite my disambiguation file. Now, i'm getting everything correct, 
which means the whitelist / blacklist are not just post-processing filters, but 
real "recognition clues".

i recommend everyone to take note (Well... i'm discovering this feature and 
it's real consequences, maybe you're not :D).

Pierre.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Disable Special characters?

Reply via email to