Sami Siren wrote:
I notice that there is two loops in the getSimilarity() method.
But I don't really understand why you use two loops sami (in fact, I don't understand why you first compare anotherProfile to currentProfile, and then compare currentProfile to anotherProfile?)


This was implemented to get similarity calculation symmetric a.getsimilarity(b)==b.getSimilarity(a), but I quess this is not really a requirement and might slightly slow things down.

I have implemented a very similar code during my PhD, so I have a little experience on language identification using ngrams.
I think it really make sense to use thresholds, because the more relevant ngrams are the first one and could be in most cases sufficient to identify the language (perhaps there will be a need to normalize the ngrams files each others by removing ngrams duplicated in many files?)


I think we must make some experiments on a basis corpus and perform some comparative bench.
Sami, if you need some help....


help is always appreciated!

I think the most timeconsuming part of language identifier is splitting the text into ngrams and propably the biggest optimization could be done there.

perhaps a configurable variable to set maximum text length to be analyzed. also the minimum limit could be defined because with small amount of ngrams the performance (as quality) is very poor.

I'll do some experimets also to see how speed could be improved.

I remember reading somewhere about n-gram language detection that taking the first 512 characters of the text is usually sufficient enough, but I can't recall where I read it... That process used n-gram profiles built from 2-5 n-grams, and each profile was limited to the first 300 of most frequent ngrams.




--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to