Hi! I was actually going to answer Laurent's message about affix file generator for Hunspell but then I thought that it might be good to give a little background about who I am and what is going on around Hunspell and Finnish spellchecking these days.
I am a physics student (so I do not have any linguistic expertise) who first got involved in the Finnish localization of OOo by packaging Soikko (the closed-source but freely distributable spellchecker and hyphenator that has traditionally been the only usable alternative for Finnish) into a UNO component that could be installed into OOo version 2.0. But we have for a long time hoped to have a truly free alternative for Soikko and several groups have tried and failed to produce such program during the past few years. In August 2005 we started a project [1] to create Finnish dictionary and affix files for Hunspell. Originally we tried to write the affix file by hand but for various reasons that turned out to be impractical. Therefore a few months ago I started writing a Python script that would make our job easier. The current version of the script is available in SourceForge [2] (module tools) under GPL for anyone who is interested. It is very much unfinished and I do not think that it will be of much use for anything else than Finnish or other languages that need to deal with certain complications in word inflection such as consonant gradation and vowel harmony. So it most likely is not the thing Laurent is looking for. But it is able to generate an affix file that works quite well with 90% of Finnish non-compound nouns and the rest 10% would also work if the data files were complete. Less than a month ago a surprising thing happened that made lot of that work irrelevant for us: we got a code donation from a guy who had silently been working on formulating Finnish word formation rules by using a research tool called Malaga [3] for over three years. This code was donated under the GPL (which is the licence of Malaga as well) which of course leads to certain complications that LGPL does not have. But since the Malaga-based implementation is so much more advanced that our previous Hunspell-based work it seems sensible to finish that and then, if there still is any need or intrest, return to reimplement the same functionality using Hunspell. Also one thing that OOo's linguistic tools could not have provided in their current form was hyphenation. This is because hyphenating Finnish requires morphologigal analysis of the word (there are even word that can be hyphenated differently depending on their interpretation) and therefore hyphenation and spellchecking cannot be sensibly done using separate programs. So the fact that we will have to build a separate spellchecker/hyphenator for our language is not necessarily a bad thing. But the important thing I wanted to write is about a tool for dictionary creation. I was recently granted funding to work on such tool for three months during next summer (this is something along the lines of Google's Summer of Code but the funding is from Finnish companies to Finnish students). This tool is needed because we still do not have a high quality word list for Finnish. And this list is rather complicated thing to build because we need to collect certain metadata about each word that describes how the words are inflected. This is non-trivial job even for Finnish speaking people and will require proofreading the dictionary several times by people from different areas of Finland because we have different dialects that affect people's opinions about the correct inflections. I would like to know if there are others who would benefit from such program. The plan is to have a www-interface to the vocabulary database that would allow browsing and controlled, concurrent editing of the word list and related metadata. And some sort of "voting system" for people to mark words that they think are incorrectly spelled or that should be moved to another inflection class. We already have a tool that people can use to suggest missing words to be added to the dictionary. That could (and will) be integrated to this new tool too. So if you know language teams that would like to use something like this, please let me know. This would be most useful for languages that are still missing a proper wordlist and/or need to have complicated metadata stored within it. But it could be used for simpler things as well. For example Hunspell affix flags could be directly stored into the database. Harri [1] http://www.hunspell-fi.org/ [2] http://cvs.sourceforge.net/viewcvs.py/hunspell-fi/ [3] http://home.arcor.de/bjoern-beutel/malaga/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
