e w wrote: > If someone could explain the reasoning/motivation behind the orginal
Current n-gram identifier in nutch works pretty much ok for most of western languages. It is also very simple and quite fast way of identifying documents language. However is the charset of document is not detected right results are not that good. > identification method that would be helpful. Otherwise, I'd be happy to > contribute my pseudo-NB hack and maybe even implement the correct version. Go ahead and attach it to JIRA. I am sure there's plenty of people interested in such thing. -- Sami Siren ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
