e w wrote:
> If someone could explain the reasoning/motivation behind the orginal

Current n-gram identifier in nutch works pretty much ok for most of
western languages. It is also very simple and quite fast way of
identifying documents language. However is the charset of document is
not detected right results are not that good.

> identification method that would be helpful. Otherwise, I'd be happy to
> contribute my pseudo-NB hack and maybe even implement the correct version.

Go ahead and attach it to JIRA. I am sure there's plenty of people
interested in such thing.

--
 Sami Siren


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to