>> 3) If it's fonts you are looking for, is it unicode fonts you want.  The 
>> reason I'm asking is that the bulk of the literature (older than the last 20 
>> years or so)  typeset by computer would be in non-unicode fonts.
> We really are interested in eveyrthing, while unicode is the best we might 
> invest time to make other fonts work if needed.

Actually, for training an OCR, it doesn’t matter whether the fonts are Unicode 
or not, because they work as input and you are interested in the shapes in 
them, not to which numbers they are assigned. It needs a bit of extra work to 
map the shapes to the correct classes, but in general the more shapes you can 
get, the better. I would encourage people to share non-Unicode fonts too. The 
output of OCR is text, which will presumably be text in Unicode, so any Unicode 
font would work to display them.

You could theoretically use OCR to convert the non-Unicode literature to 
Unicode one, but I would argue that is very convoluted, slow and error-prone 
approach – you have to convert text to image and then the OCR has to “guess” 
how to split it back and what letters they are. You will get faster and more 
correct results if you convert them as text to text, as it already contains the 
information about which shapes are used.

Best Regards,
Jan Kučera
ल Institute of South and Central Asia Students, Prague


_______________________________________________
INDOLOGY mailing list
[email protected]
https://list.indology.info/mailman/listinfo/indology

Reply via email to