I just uploaded a patch that adds a language identifier plugin to nutch.

http://sourceforge.net/tracker/index.php?func=detail&aid=982263&group_id=59548&atid=491356

The process of identification is as follows:

1. html (html only, HTML 4.0 "lang" attribute)
2. meta tags (html only, http-equiv, dc.language)
3. http header (Content-Language)
4. if all above fail "statistical analysis"

1 & 2 are run during the fetching phase and 3 & 4 are run on indexing phase.

Currently supported languages (in "statistical analysis") are
da,de,el,en,es,fi,fr,it,nl,sv and pt. The corpus used was grabbed from
http://www.isi.edu/~koehn/europarl/ and the profiles were build with
tool supplied in patch.

After indexing the language can be found from field named "lang"

it's not 100% accurate but it's a start.

--
Sami Siren


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to