Hi!

I was actually going to answer Laurent's message about affix file generator 
for Hunspell but then I thought that it might be good to give a little 
background about who I am and what is going on around Hunspell and Finnish 
spellchecking these days.

I am a physics student (so I do not have any linguistic expertise) who first 
got involved in the Finnish localization of OOo by packaging Soikko (the 
closed-source but freely distributable spellchecker and hyphenator that has 
traditionally been the only usable alternative for Finnish) into a UNO 
component that could be installed into OOo version 2.0. But we have for a 
long time hoped to have a truly free alternative for Soikko and several 
groups have tried and failed to produce such program during the past few 
years. In August 2005 we started a project [1] to create Finnish dictionary 
and affix files for Hunspell. Originally we tried to write the affix file by 
hand but for various reasons that turned out to be impractical. Therefore a 
few months ago I started writing a Python script that would make our job 
easier.

The current version of the script is available in SourceForge [2] (module 
tools) under GPL for anyone who is interested. It is very much unfinished and 
I do not think that it will be of much use for anything else than Finnish or 
other languages that need to deal with certain complications in word 
inflection such as consonant gradation and vowel harmony. So it most likely 
is not the thing Laurent is looking for. But it is able to generate an affix 
file that works quite well with 90% of Finnish non-compound nouns and the 
rest 10% would also work if the data files were complete.

Less than a month ago a surprising thing happened that made lot of that work 
irrelevant for us: we got a code donation from a guy who had silently been 
working on formulating Finnish word formation rules by using a research tool 
called Malaga [3] for over three years. This code was donated under the GPL 
(which is the licence of Malaga as well) which of course leads to certain 
complications that LGPL does not have. But since the Malaga-based 
implementation is so much more advanced that our previous Hunspell-based work 
it seems sensible to finish that and then, if there still is any need or 
intrest, return to reimplement the same functionality using Hunspell. Also 
one thing that OOo's linguistic tools could not have provided in their 
current form was hyphenation. This is because hyphenating Finnish requires 
morphologigal analysis of the word (there are even word that can be 
hyphenated differently depending on their interpretation) and therefore 
hyphenation and spellchecking cannot be sensibly done using separate 
programs. So the fact that we will have to build a separate 
spellchecker/hyphenator for our language is not necessarily a bad thing.

But the important thing I wanted to write is about a tool for dictionary 
creation. I was recently granted funding to work on such tool for three 
months during next summer (this is something along the lines of Google's 
Summer of Code but the funding is from Finnish companies to Finnish 
students). This tool is needed because we still do not have a high quality 
word list for Finnish. And this list is rather complicated thing to build 
because we need to collect certain metadata about each word that describes 
how the words are inflected. This is non-trivial job even for Finnish 
speaking people and will require proofreading the dictionary several times by 
people from different areas of Finland because we have different dialects 
that affect people's opinions about the correct inflections. I would like to 
know if there are others who would benefit from such program. The plan is to 
have a www-interface to the vocabulary database that would allow browsing and 
controlled, concurrent editing of the word list and related metadata. And 
some sort of "voting system" for people to mark words that they think are 
incorrectly spelled or that should be moved to another inflection class. We 
already have a tool that people can use to suggest missing words to be added 
to the dictionary. That could (and will) be integrated to this new tool too.

So if you know language teams that would like to use something like this, 
please let me know. This would be most useful for languages that are still 
missing a proper wordlist and/or need to have complicated metadata stored 
within it. But it could be used for simpler things as well. For example 
Hunspell affix flags could be directly stored into the database.

Harri

[1] http://www.hunspell-fi.org/
[2] http://cvs.sourceforge.net/viewcvs.py/hunspell-fi/
[3] http://home.arcor.de/bjoern-beutel/malaga/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to