Yes. But how to be sure that the first 20 or 512 characters of a documents are in the same language as the whole document?
I think the language identifier must process the whole document to clearly identify its main language.
This seems like it would be a good configuration option. Folks who want to do a better job of language identification can set it higher, so that more text is analyzed.
Doug
------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
