On 10:29 Sat 18 Mar     , Laurent Godard wrote:
> > You probably recall that I have web-crawled corpora and hence
> > frequency lists for 200+ languages as part of the gramadoir project -
> > if you get something up and running I can do some testing with these.
> >
> 
> great !!! I'm actually thinking to the evaluation part of the result. 
> Does any metric exists that would evaluate the quality of an affix file 
> ? is compression ratio a good one ? Any more linguistical ones ?

I don't know of such a metric, but I'll think about it a bit.

As a start, you can probably do well by comparing your generated
affix files with the 30+ hand-written ones that exist.  
If you can automatically produce something that approximates these,
then I would trust it for new languages. 

> > Note that the important thing here (in my view) is to get
> > something *linguistically* meaningful - if the goal is to merely
> > compress the word list one can just run munchlist to find candidate
> > affixes.   
> 
> hum. Does this "munchlist" exists ?

I misspoke - I meant "findaffix", not "munchlist". 
Both programs are distributed as part of the standard ispell
package.  You can find their manpages online by
Googling either term.   

"findaffix" takes a raw word list and outputs a list
of potential prefixes or suffixes - it is, in a sense,
the "easy" part of what you are trying to do.  
It doesn't attempt to gather these candidate affixes together
into classes that can be placed under a single affix flag.
This second step is a matter of analyzing which affixes are applicable to
similar collections of words.

> linguistically meaningfull for a general tool not dependanding on 
> languages will be difficult, no ?

Yes, definitely.  But this is, after all, the stated goal of
the Linguistica project and probably not completely out of reach. 
At least one would hope you can produce something that a human could
post-edit into a good, linguistically meaningful affix file.

> One xtension, if time, we will see is form the affix file guess some new 
> words and check if these words exists (google requesting for eg.) and 
> teh ptopose them as new words for teh spellchecker. But this will be 
> probably a second part of the job

Yes, this is more or less the code I've got built into the
web crawler already.   All I need is an affix file as input.

The issue with using Google directly (vs. my corpora) is that you
can never be sure you're getting hits for a word in your target language.

-Kevin

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to