On 10:29 Sat 18 Mar , Laurent Godard wrote: > > You probably recall that I have web-crawled corpora and hence > > frequency lists for 200+ languages as part of the gramadoir project - > > if you get something up and running I can do some testing with these. > > > > great !!! I'm actually thinking to the evaluation part of the result. > Does any metric exists that would evaluate the quality of an affix file > ? is compression ratio a good one ? Any more linguistical ones ?
I don't know of such a metric, but I'll think about it a bit. As a start, you can probably do well by comparing your generated affix files with the 30+ hand-written ones that exist. If you can automatically produce something that approximates these, then I would trust it for new languages. > > Note that the important thing here (in my view) is to get > > something *linguistically* meaningful - if the goal is to merely > > compress the word list one can just run munchlist to find candidate > > affixes. > > hum. Does this "munchlist" exists ? I misspoke - I meant "findaffix", not "munchlist". Both programs are distributed as part of the standard ispell package. You can find their manpages online by Googling either term. "findaffix" takes a raw word list and outputs a list of potential prefixes or suffixes - it is, in a sense, the "easy" part of what you are trying to do. It doesn't attempt to gather these candidate affixes together into classes that can be placed under a single affix flag. This second step is a matter of analyzing which affixes are applicable to similar collections of words. > linguistically meaningfull for a general tool not dependanding on > languages will be difficult, no ? Yes, definitely. But this is, after all, the stated goal of the Linguistica project and probably not completely out of reach. At least one would hope you can produce something that a human could post-edit into a good, linguistically meaningful affix file. > One xtension, if time, we will see is form the affix file guess some new > words and check if these words exists (google requesting for eg.) and > teh ptopose them as new words for teh spellchecker. But this will be > probably a second part of the job Yes, this is more or less the code I've got built into the web crawler already. All I need is an affix file as input. The issue with using Google directly (vs. my corpora) is that you can never be sure you're getting hits for a word in your target language. -Kevin --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
