Sami, very cool! Great work!
I think it is a good idea to store the language in the index as well but just as a note I hope we can add unlimited meta data in general very soon.
I think Doug is working on it. (?)
Sorry to ask, do you plan to contribute junit tests for your code as well? ;)
Stefan
Am 29.06.2004 um 23:17 schrieb Sami Siren:
I just uploaded a patch that adds a language identifier plugin to nutch.
http://sourceforge.net/tracker/index.php? func=detail&aid=982263&group_id=59548&atid=491356
The process of identification is as follows:
1. html (html only, HTML 4.0 "lang" attribute) 2. meta tags (html only, http-equiv, dc.language) 3. http header (Content-Language) 4. if all above fail "statistical analysis"
1 & 2 are run during the fetching phase and 3 & 4 are run on indexing phase.
Currently supported languages (in "statistical analysis") are da,de,el,en,es,fi,fr,it,nl,sv and pt. The corpus used was grabbed from http://www.isi.edu/~koehn/europarl/ and the profiles were build with tool supplied in patch.
After indexing the language can be found from field named "lang"
it's not 100% accurate but it's a start.
-- Sami Siren
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
--------------------------------------------------------------- enterprise information technology consulting open technology: http://www.media-style.com open source: http://www.weta-group.net open discussion: http://www.text-mining.org
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
