Hi all,

I usually use Nutch for this but, just for fun, I tried to create a language
identifier based on Lucene only.

I had a really small set of "training data": 10 files (roughly 2M each) for
10 languages. I indexed those files using an NGram analyzer.

I have to say that I was not expecting much...but the results seem amazing!

What do you think of this approach?

Beside the good results, that could easily be false positives due to the
small training set, I wonder if maybe I am misunderstanding Lucene's VSM!

I'll try to increase the "training" data and see what happen changing NGram
size too! I'll keep you posted, meanwhile let me know what you think.

Thanks,
Luca

Reply via email to