Hi all, I usually use Nutch for this but, just for fun, I tried to create a language identifier based on Lucene only.
I had a really small set of "training data": 10 files (roughly 2M each) for 10 languages. I indexed those files using an NGram analyzer. I have to say that I was not expecting much...but the results seem amazing! What do you think of this approach? Beside the good results, that could easily be false positives due to the small training set, I wonder if maybe I am misunderstanding Lucene's VSM! I'll try to increase the "training" data and see what happen changing NGram size too! I'll keep you posted, meanwhile let me know what you think. Thanks, Luca