Dear NGuyenQ,
> From the page http://www.pixel-technology.com/freeware/tessnet2/
> tessnet2.Tesseract ocr = new tessnet2.Tesseract();
> ocr.SetVariable("tessedit_char_whitelist", "0123456789"); // If digit only
This is brilliant advice you just gave him. It is very effective, i just tested
it on document with only digits and a few special characters.
Since i'm working with C++ only (No .net wrapper), here is what i recommend to
do:
// Init your tess API.
_tessApi = new tesseract::TessBaseAPI();
// Set up the current directory and language prefix.
_tessApi->Init("./", "cst");
// This is only important if you'll be parsing pictures with only one
line of text (Which is my case).
_tessApi->SetPageSegMode(tesseract::PSM_SINGLE_LINE);
// Here is the trick as explained and pointed by NGuyenQ:
_tessApi->SetVariable("tessedit_char_whitelist", "<0123456789");
// The in a loop for each of my documents, here is the idea:
PIX *pix = pixReadMemTiff((const
l_uint8*)buffer.buffer().constData(), buffer.size(), 0);
_tessApi->SetImage(pix);
doc.setRecognizedData("OCRLine", QString(text).trimmed());
pixDestroy(&pix);
delete [] text;
delete pix;
// Release everything.
_tessApi->Clear();
_tessApi->End();
delete _tessApi;
The very very interesting part is that before, i was getting "D" and "O"
instead of zeros, sometimes even "A" for "4" and "[]" and "[)" instead of
zeroes, despite my disambiguation file. Now, i'm getting everything correct,
which means the whitelist / blacklist are not just post-processing filters, but
real "recognition clues".
i recommend everyone to take note (Well... i'm discovering this feature and
it's real consequences, maybe you're not :D).
Pierre.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.