> For example, in [1] the letter ব ("ba") is used frequently, but is written with a fancier script where it has an extra line through it.
I just realized that my interpretation here was wrong; that letter is actually the Assamese ৰ, and the document is in Assamese. I assume the OCR algorithm isn't aware of Assamese-only characters in this case. -Manish On Tue, Feb 7, 2017 at 9:38 PM, Manish Goregaokar <man...@mozilla.com> wrote: > > The very first one কিী (0995 09BF 09C0) had 1090 hits and shows up in a > book of short stories: > > That's bad OCR, that's an apostrophe, a Ka, and an E, with the apostrophe > being interpreted as a matra somehow. > > I bet there are only a couple of OCR algorithms out there handling Bangla. > Indic scripts aren't something you can OCR glyph by glyph in such a > straightforward way due to ligatures, so these algorithms are probably > noticing components of a character and producing it. It sees a preceding > line and the curve above, and interprets that as an I. It also sees the > proceeding line and curve above, and interprets that as an EE. It then just > puts the two together. It shouldn't, but it does. > > Given a small set of OCR algorithms I think it's reasonable to assume that > such aberrations would be common across outputs -- so hundreds of hits for > a typo doesn't sound out of the ordinary to me. > > > Tried a random one: ঘিা (0998 09BF 09BE) > > I went through the results for ঘিা (0998 09BF 09BE). Most occurrences are > actually ঘন্টা (0998 09A8 09CD 099F 09BE), "ghanta" which can mean "hour" > or "bell". Reasonably common word. These documents don't look scanned -- > the text isn't garbled or anything, but it could be a cleaned up scanned > document because I copied out some more of the text and there were similar > aberrations all over the place. For example, in [1] the letter ব ("ba") is > used frequently, but is written with a fancier script where it has an extra > line through it. Many occurrences of it have been interpreted as sequences > of vowel diacritics. The last line of the second-last stanza on page 5 has > an absolutely ridiculous number of consecutive diacritics in the PDF text. > > > [1]: http://yousigma.com/religionandphilosophy/poojasloka/Sri%20Hari% > 20Kathamruta%20Sara%20Datta%20Swatantrya%20Sandhi%20(Sri% > 20Jagannatha%20Vittala%20Dasaru)%20-%20Assamese.pdf > > > -Manish > > On Tue, Feb 7, 2017 at 7:53 PM, Asmus Freytag <asm...@ix.netcom.com> > wrote: > >> On 2/7/2017 10:08 AM, Eric Muller wrote: >> >> In looking at the wiki{pedia,book.source,tionary} corpus for Bengla, I >> see a relatively large number of syllables with <... 09BF 09BE> or <... >> 09BF 09C0>. I checked a couple of sources, and I did not find them listed >> anywhere as being normally used. >> >> Are they in normal use or are those all typos? >> >> Tried a random one: ঘিা (0998 09BF 09BE) and got 385 hits in google. >> Would surprise me if all of these were typos. >> >> The very first one কিী (0995 09BF 09C0) had 1090 hits and shows up in a >> book of short stories: >> >> where it starts a paragraph. >> >> A./ >> >> >> I did not find any occurrence in the Assamese corpus. >> >> Thanks, >> Eric. >> >> The syllables (o is the number of occurrences): >> >> >> <string s='কিী' o='198'/> >> <string s='ক্তিা' o='262'/> >> <string s='ক্রিা' o='447'/> >> <string s='ক্রিী' o='77'/> >> <string s='ক্লিা' o='245'/> >> <string s='ক্ষিী' o='161'/> >> <string s='ক্সিা' o='138'/> >> <string s='খিা' o='949'/> >> <string s='গিা' o='2671'/> >> <string s='গিী' o='250'/> >> <string s='গ্নিা' o='57'/> >> <string s='গ্নিী' o='110'/> >> <string s='গ্রিা' o='143'/> >> <string s='ঘিা' o='83'/> >> <string s='ঙ্কিা' o='403'/> >> <string s='ঙ্গিা' o='267'/> >> <string s='ঙ্গিী' o='150'/> >> <string s='চিা' o='905'/> >> <string s='চিী' o='135'/> >> <string s='চ্চিা' o='91'/> >> <string s='চ্ছিা' o='323'/> >> <string s='ছিা' o='712'/> >> <string s='ছিী' o='61'/> >> <string s='জিা' o='527'/> >> <string s='জিী' o='140'/> >> <string s='জ্জিা' o='56'/> >> <string s='ঝিা' o='81'/> >> <string s='ঞিা' o='71'/> >> <string s='ঞ্চিা' o='175'/> >> <string s='ঞ্জিা' o='270'/> >> <string s='ঞ্জিী' o='316'/> >> <string s='টিা' o='807'/> >> <string s='টিী' o='586'/> >> <string s='ঠিা' o='549'/> >> <string s='ঠিী' o='89'/> >> <string s='ড়িা' o='1361'/> >> <string s='ড়িী' o='135'/> >> <string s='ডিা' o='257'/> >> <string s='ঢ়িা' o='71'/> >> <string s='ণিা' o='354'/> >> <string s='তিী' o='270'/> >> <string s='তি্যু' o='75'/> >> <string s='ত্তিা' o='143'/> >> <string s='ত্তিী' o='144'/> >> <string s='ত্ত্বিা' >> o='54'/> >> <string s='ত্বিা' o='72'/> >> <string s='ত্মিা' o='161'/> >> <string s='ত্যিা' o='129'/> >> <string s='ত্রিা' o='217'/> >> <string s='ত্রিী' o='264'/> >> <string s='ত্ৰিা' o='102'/> >> <string s='থিা' o='290'/> >> <string s='থিী' o='127'/> >> <string s='দিী' o='514'/> >> <string s='দ্ধিা' o='228'/> >> <string s='দ্বিা' o='505'/> >> <string s='দ্বিী' o='121'/> >> <string s='দ্যিা' o='53'/> >> <string s='ধিী' o='235'/> >> <string s='নিী' o='551'/> >> <string s='ন্তিা' o='100'/> >> <string s='ন্ত্রিা' >> o='93'/> >> <string s='ন্ত্রিী' >> o='171'/> >> <string s='ন্দিা' o='102'/> >> <string s='ন্দ্রিা' >> o='238'/> >> <string s='ন্দ্রিী' >> o='79'/> >> <string s='ন্ধিা' o='109'/> >> <string s='ন্মিা' o='98'/> >> <string s='পিা' o='1199'/> >> <string s='প্তিা' o='67'/> >> <string s='প্রিা' o='203'/> >> <string s='ফিা' o='174'/> >> <string s='ফ্রিা' o='60'/> >> <string s='বিী' o='715'/> >> <string s='ব্রিা' o='87'/> >> <string s='ভিা' o='908'/> >> <string s='ভিী' o='80'/> >> <string s='মিী' o='373'/> >> <string s='ম্পিা' o='55'/> >> <string s='ম্বিা' o='117'/> >> <string s='ম্মিা' o='67'/> >> <string s='যিা' o='204'/> >> <string s='রিা' o='4703'/> >> <string s='র্ণিা' o='55'/> >> <string s='র্তিী' o='56'/> >> <string s='র্বিা' o='105'/> >> <string s='র্মিা' o='68'/> >> <string s='র্মিী' o='70'/> >> <string s='র্ষিা' o='65'/> >> <string s='লিী' o='419'/> >> <string s='ল্পিী' o='113'/> >> <string s='শিী' o='216'/> >> <string s='শ্বিা' o='145'/> >> <string s='ষিা' o='376'/> >> <string s='ষ্টিা' o='269'/> >> <string s='ষ্ট্যিা' >> o='75'/> >> <string s='ষ্ঠিী' o='99'/> >> <string s='সিা' o='760'/> >> <string s='সিী' o='117'/> >> <string s='স্কিা' o='106'/> >> <string s='স্ট্রিী' >> o='157'/> >> <string s='স্তিা' o='311'/> >> <string s='স্তিী' o='50'/> >> <string s='স্থিা' o='1946'/> >> <string s='স্বিা' o='97'/> >> <string s='স্মিা' o='74'/> >> <string s='হিী' o='424'/> >> <string s='হ্যিা' o='89'/> >> <string s='ৰিী' o='204'/> >> <string s='ৰ্ত্তিা' >> o='125'/> >> <string s='ৰ্ত্তিী' >> o='118'/> >> <string s='ৰ্ম্মিা' >> o='58'/> >> <string s='ৱিা' o='264'/> >> >> >> >> >> >