Sami, very cool! Great work!

I think it is a good idea to store the language in the index as well but just as a note I hope we can add unlimited meta data in general very soon.
I think Doug is working on it. (?)
Sorry to ask, do you plan to contribute junit tests for your code as well? ;)


Stefan

Am 29.06.2004 um 23:17 schrieb Sami Siren:

I just uploaded a patch that adds a language identifier plugin to nutch.

http://sourceforge.net/tracker/index.php? func=detail&aid=982263&group_id=59548&atid=491356

The process of identification is as follows:

1. html (html only, HTML 4.0 "lang" attribute)
2. meta tags (html only, http-equiv, dc.language)
3. http header (Content-Language)
4. if all above fail "statistical analysis"

1 & 2 are run during the fetching phase and 3 & 4 are run on indexing phase.

Currently supported languages (in "statistical analysis") are
da,de,el,en,es,fi,fr,it,nl,sv and pt. The corpus used was grabbed from
http://www.isi.edu/~koehn/europarl/ and the profiles were build with
tool supplied in patch.

After indexing the language can be found from field named "lang"

it's not 100% accurate but it's a start.

--
Sami Siren


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers



---------------------------------------------------------------
enterprise information technology consulting
open technology:   http://www.media-style.com
open source:           http://www.weta-group.net
open discussion:    http://www.text-mining.org



-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to