Re: Customising Tesseract for character recognition

Saurabh Gandhi Fri, 18 Feb 2011 01:56:12 -0800

You can simply use this in your program just after init to set whitelist /
blacklist:


*api.Init(argv[**0**],** **lang,** **&(argv[arg]),** **argc-arg,** **false**
);**
**api.SetVariable(**"tessedit_char_whitelist"**,** **
"ABCDEFGHIJKLMNOPQRSTUVWXYZ.0123456789 "**);*

--
Regards,
Saurabh Gandhi




On Fri, Feb 18, 2011 at 3:21 PM, Sriranga(78yrsold) <withblessi...@gmail.com
> wrote:

> *Customise the tesseract engine to recognize only the characters from 
> **A-Z,0-9,.(dot),
> (space) by setting the character white-list   *  Kindly furnish the name
> of the folder in which whitelist as well as blacklist are existed. I want to
> utilise the same for Kannada scripts.
> -sriranga(78yrs)
>
>
> On Fri, Feb 18, 2011 at 11:57 AM, Ray Smith <theraysm...@gmail.com> wrote:
>
>> From all this, I have identified the following ways of improving the
>> results:
>>
>>    1. Customise the tesseract engine to recognize only the characters
>>    from A-Z,0-9,.(dot), (space) by setting the character white-list. My
>>    understanding is that the white-list is the list of characters that are
>>    going to be sensed. I was inquisitive to know what the blacklist is meant 
>> to
>>    do?
>>    Just the opposite of whitelist. You can disable specific characters
>>    from the usual set.
>>    2. A lot of times I have seen fairly good number plate images being
>>    OCRed inaccurately. This could possibly be due to the word recognition
>>    stage. Has anyone found a way to disable the dictionary / word 
>> recognition.
>>    Play with segment_penalty_dict_*
>>    3. Then there are some page segmentation modes
>>    (PSM_AUTO,PSM_SINGLE_BLOCK, PSM_CHAR etc). Does PSM_CHAR imply that it 
>> will
>>    consider the input image as a single character and run the algorithm
>>    accordingly without attempting word recognition?
>>    Yes.
>>    4. Another important configuration macro that I have seen within the
>>    code was AVS_FASTEST = 0,  AVS_MOST_ACCURATE = 100. However, I could not
>>    find the same being used anywhere in the code. Does this have any impact 
>> on
>>    the *character recognition*accuracy?
>>    This control is dead in 3.01. Replaced by ocr_engine_mode. It just
>>    controls the combination of tesseract vs cube. Cube increases the accuracy
>>    slightly, but adds a lot of compute time.
>>    5. Finally, I also plan to use the confidence level data. Are there
>>    any indicators of confidence for characters as well. There is word
>>    confidence data which can be found in TessBaseAPI::
>>    AllWordConfidences().
>>    Yes, and they are exposed in the new ResultIterator in 3.01, otherwise
>>    you have to go down into the guts of the data structures.
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> To unsubscribe from this group, send email to
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> To unsubscribe from this group, send email to
> tesseract-ocr+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Customising Tesseract for character recognition

Reply via email to