From: "Towheed Chowdhury" <[EMAIL PROTECTED]> > How bangla ocr can be developed using current unicode?
ISO/IEC 10646 and Unicode are just standard for character encoding, not for their rendering and presentation. OCR is a difficult problem, but it has nothing in common with characters encoding, as it is an analysis of glyphs. Generally, good OCR recognition is difficult to automate without specific fonts with simplified or slightly altered (but still readable) glyphs. This is not a problem of Unicode. What Unicode has done is only to add some characters that were used in the OCR context (such as symbols on checks, that were created and printed specially for OCR systems, but had no prior meaning in the linguistic and plain-text area: in Unicode these special glyphs are coded as distinctive symbols with their own code points. OCR already has difficulties to recognize accents on modern Latin, Greek or Cyrillic letters, and it does not work well with other scripts (it works with unpointed Hebrew, but fails with Arabic due to the complex joining behavior and too small glyphic differences between glyphs in the most widely used typographic variants of the Arabic script.) I don't know if there has been attempt to recognize Devanagari in India. Hiragana and Katakan may work well in OCR, but generally Japanese texts contain lots of Han ideographs that are very difficult to recognize with OCR due to their graphic complexity. May be there's OCR working with Hangul basic Jamos (written linerarily, instead of with syllabic squares). In all these case, the target encoding when parsing a scanned image of a text is not the issue, as the difficulty is in recognizing abstract characters from many distinct glyph shapes that will alwyas exhibit slight variations when scanned from a printed paper. So you want to search in India if there exists some works to read Devanagari printed texts with OCR (Devenagari is difficult to parse too, like Arabic, because glyphs are most often joined, and this creates difficulties to separate letters or letter parts.

