Stefan Groschupf wrote:
Hi Sami, Hi all,
i like the language identifier very much, but we notice that it slow
down the indexing process 3 times.
In case people index very large segments this is may a problem.
I have a set of questions:
+ Can you tell me which corpus you used to generate the ngram files?
http://people.csail.mit.edu/people/koehn/publications/europarl/
+ some hand collected for languages not available there
+ Are there any plans to improve speed by fine tuning the implementation?
I will do some experiments how it could best be optimized.
+ Why use vectors instead of array lists?
No reason - either way.
+ Do you think it make sense to use thresholds? For example not
generate a score for the complete profile but use only the top 10 ngrams
and check if there is a clear best profile using a threshold. In case
the result isn't clear use 10 more ngrams. etc.
I think we should try something more basic first.
--
Sami Siren
-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers