Mead, it just depends on how many languages you enter in the system!
Collecting the data is not a huge problem: I'm using news websites in 19 languages! The quality of the content is usually high and they "talk" a lot! Watch out that the real problem is the encoding: you want to be sure everything is using the same! hope this will help, Luca On Mon, Oct 24, 2011 at 3:29 AM, Mead Lai <laiqi...@gmail.com> wrote: > Luca, > > I would like to know: how much language, your system could identify? > In my view, this difficult part in your system is: how to collect so many > languages/character in the world for *one person*... > > Regards, > Mead > > > On Sun, Oct 23, 2011 at 1:27 AM, Petite Abeille <petite_abei...@me.com > >wrote: > > > > > On Oct 22, 2011, at 2:49 AM, Luca Rondanini wrote: > > > > > I usually use Nutch for this but, just for fun, I tried to create a > > language > > > identifier based on Lucene only. > > > > Talking of which: > > > > Google's Compact Language Detector > > > > > http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >