Unicharset has Incorrect Character Properties

matthew christy Fri, 20 Sep 2013 08:05:53 -0700

Hi all,

I was recently reading through the TrainingTesseract3 document again and 
came across some stuff in the "Compute the Character Set" section that I 
had not really looked at before.


Basically:

Tesseract needs to have access to character properties isalpha, isdigit, 
isupper, islower, ispunctuation. This data must be encoded in theunicharset 
data 
file. Each line of this file corresponds to one character. The character in 
UTF-8 is followed by a hexadecimal number representing a binary mask that 
encodes the properties. Each bit corresponds to a property. If the bit is 
set to 1, it means that the property is true. The bit ordering is (from 
least significant bit to most significant bit): isalpha, islower, isupper, 
isdigit.


I always had other things to worry about so I didn't pay too much attention 
to this. But now that I'm getting close to producing usable data with 
Tesseract and am concerned about tweaking the most accuracy out of 
Tesseract, I am paying more attention. I looked at my a unicharset file I 
recently generated and saw that many characters have been encoded with 
incorrect 
character properties.

I can, I believe, correct these by hand (can any one confirm that all I 
need to do is change the value in the unicharset and that's all?), but I'm 
going to be generating quite a lot of training files and I don't want to 
have to do this every time. I can't find any additional information on how 
these character properties values are determined and assigned to the 
characters identified in unicharset. Does anyone know? 

Does anyone have any ideas how to make this functionality more accurate? 
And, frankly, does it even really matter what values are assigned here?

Thanks,
Matt

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Unicharset has Incorrect Character Properties

Reply via email to