Harri, Thanks for the clarification.
You wrote: >> But the important thing I wanted to write is about a tool for dictionary creation. I was recently granted funding to work on such tool for three months during next summer (this is something along the lines of Google's Summer of Code but the funding is from Finnish companies to Finnish students). This tool is needed because we still do not have a high quality word list for Finnish. And this list is rather complicated thing to build because we need to collect certain metadata about each word that describes how the words are inflected. << There is a very simple and effective method to do that. 1. You make first a web corpus. That is, you identify large size pages, that contain text in Finnish. The easily downloadable wikipedia is a first address. 2. You create a word list of the downloaded pages using scripts like awk/perl, sort, etc. The hard part is the de-affixing, this can be also done by scripts like awk/perl. 3. Using your existing word collection you build a Finnish compound words collection from your word list. To do that you must use the compounding capabilities of hunspell and your word list must contain almost all the non-compound Finnish words. That is probably round 100 thousand words, not so much. 4. Check the word list with search machines- they help you a lot in the dirty work. 5. Finally check the word list manually, isolated the words, that are good according to the search machine, and also the bad ones according to search machine. Now you have a comprehensive and good quality Finnish word collection. I have a lot of experience with that, you can ask at any time if you need details. Regards: Eleonora --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
