[Nutch-dev] Re: lang identifier and nutch analyzer in trunk

Andrzej Bialecki Tue, 24 Jan 2006 03:43:01 -0800

Jérôme Charron wrote:

We're going back to the old discussion - most web pages out there either
don't have these tags at all, or even if they have it it contains wrong
values ... so, I think this policy is not going to give the best results.


Yes I know Andrzej, it was just to explain to Jack how it actually works

Ok.

IMHO we should always try to guess the language if we have enough text,
unless we can be sure that we deal with properly marked documents (not
such uncommon case in Intranets).


I think we should have something like in the MimeType detection:
If a meta data is found, then checks that it is the correct value regarding
the score of this language (statistical analyis).
If the score is too low or no meta data is found, then we perform a full
statistical analysis.
No?

Yes :-)


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: lang identifier and nutch analyzer in trunk

Reply via email to