Harri,

Thanks for the clarification.

You wrote:
>>
But the important thing I wanted to write is about a tool for dictionary 
creation. I was recently granted funding to work on such tool for three 
months during next summer (this is something along the lines of Google's 
Summer of Code but the funding is from Finnish companies to Finnish 
students). This tool is needed because we still do not have a high quality 
word list for Finnish. And this list is rather complicated thing to build 
because we need to collect certain metadata about each word that describes 
how the words are inflected. 
<<

There is a very simple and effective  method to do that.
1. You make first a web corpus. 
   That is, you identify large size pages, that contain text in Finnish. The 
easily downloadable wikipedia is a first address.
2. You create a word list of the downloaded pages using scripts like awk/perl, 
sort,  etc. The hard part is the de-affixing, this can be also done by scripts 
like awk/perl.
3. Using your existing word collection you build a  Finnish compound words 
collection from your word list. To do that you must use the compounding 
capabilities of hunspell and your word list must contain almost all the 
non-compound Finnish words. That is probably round 100 thousand words, not so 
much. 
4. Check the word list with search machines- they help you a lot in the dirty 
work.
5. Finally check the word list manually, isolated the words, that are good 
according to the search machine, and also the bad ones according to search 
machine.

Now you have a comprehensive and good quality Finnish word collection.

I have a lot of experience with that, you can ask at any time if you need 
details.

Regards: Eleonora

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to