On Tue, 2007-06-26 at 22:05 +0100, Vivek Rai wrote:
> > (c) Vastly improve existing spell-checking dictionaries, as these
> >     rules are not of much use without an adequate dictionary. One
> >     way is to have someone type in dictionaries that are out of
> >     copyright.
> I understand aspell is based on list of valid words, right? How about
> this as a quick way to generate a starting version of such a list?
> a) store any public domain text in hindi/oriya in a text file. select
> inputs with reliable spellings.
> b) create a script that splits the file into words on different lines,
> does sort -u, and then finally generates a sorted list.
> c) each of us run this script on any public domain local language text
> that we find, and upload  our lists to the indlinux website.
> a master list can then be build from this?

We have done broadly what you suggest on several texts, and along with
input from other sources, we now have a pretty comprehensive list of
Hindi words, which should number about 30-40K by now. The problem is
to have it proof-read. At the time of proof-reading, we should also
have people add affix information for aspell. I will post a note about
this soon, and we can talk about a web interface to let people easily
do the proof-reading, and add affix information.

The other thing I am having a summer intern work on is building a page
scraper in Python that will crawl web pages, and grab text within a
specified Unicode range. As this would have to be proof-read for
validity, I see this as being more beneficial for (a) getting an idea
of common mis-spellings, (b) building a corpus in various domains, and
(c) as a snapshot of how the language evolves, and how new words come
in. Maybe tie this to Newsrack (http://newsrack.in).

> this wouldnt be perfect, and we will still have to manually keep on
> adding to it, sorting any misspellings from our sources etc, but i
> think this could still be a good start.
> आगे मेरा सुझाव ये भी है की हम इस मेल सूची पर अब हिन्दी में बोलचाल
> बढ़ायें, ताकि हमारे पास अधिक से अधिक हिन्दी भाषा का पाठ उप्लब्ध हो.

आप ठीक कह रहें हैं, लेकिन इसमें कई मुश्किले हैं. एक तो मेरी हिन्दी
कमज़ोर है, जिस वजह से मुझे हिन्दी
में लिखने अधिक समय लगता है. कई लोगों को तो मुझसे भी ज़्यादा धिक्कत होती
है. तो मेरा सुझाव यह रहेगा
की लोग अपनी मनचाही भाषा में लिखें, और जवाब दूसरी भाषा में भी आये.


ilugd mailinglist -- ilugd@lists.linux-delhi.org
Archives at: http://news.gmane.org/gmane.user-groups.linux.delhi 

Reply via email to