Hello, Over the last couple of months, I have been working with some people who are implementing an Indian language search engine, at http://raftaar.org (the online keyboard on the site presently works only on Microsoft IE only, but will soon be ported to Firefox, and other open-source browsers). The site uses aspell to provide suggestions for mis-spelled words. There are still significant issues in making aspell work properly with Indian languages. Here are some thoughts based on the work that has been done on aspell 0.60.3: 1. I would like to volunteer to work on writing a proper C++ interface to aspell. This would include a public interface that exposes only the normal spellchecking facilities in a class, as well as a testing interface that provides access to internals like the scores, weights, and even costs for computing edit distance. I already have something that makes the testing part available, but it is rather hacked up. If we can discuss what might be an interface that can get accepted into aspell, I would be glad to work on it. 2. I have done some more work on making bindings to aspell available in other programming languages, and, at present, Python, Perl and C# bindings are available, through SWIG. What I would like to do is first build a C++ class-based interface, and use that as a basis for a consistent interface across all languages. Besides the bindings, this would include example programs for using them, as well as GUI implementations in at least one language that provide a front-end to spellchecking, as well as to the testing framework. 3. I see some major stumbling blocks in making aspell work properly with Indian languages. Perhaps the most significant one is that in Indian languages it makes sense to deal with syllables (a clump of consonants, possibly with vowel modifiers), rather than with individual characters. Thus, for example, edit distance operations should work on syllables. This is a little difficult, though not impossible, to do with the present, non-Unicode, internal functioning of aspell. One way would be to have a function inside score_list() that reconverts to Unicode, and works on syllables. However, it seems silly to do this, rather than having Unicode throughout. I am aware of Kevin's arguments for retaining the 128-character space used by aspell, but do not see a clean mechanism for handling complex scripts within such a framework. Comments on this would be appreciated. 4. There are other niceties that would improve spellchecking in Indian languages, such as the use of a morphological analyser to identify the type of the word, and also its gender if it is a noun. This can however, probably be handled by a pre or post filters to normal aspell checking.
Please comment on the above issues. I will start preparing a requirements specification for the work outlined above on a Wiki page, and invite people to join in. Regards, Gora _______________________________________________ Aspell-devel mailing list [email protected] http://lists.gnu.org/mailman/listinfo/aspell-devel
