Doug Cutting wrote:

Language id can already be reasonably done by an indexing filter. I don't see any advantage to moving it here. Am I missing something?

And stemming and other word-based operations should be performed in Lucene Analyzers, while indexing. Nutch does not yet permit a plugin here, but this might eventually make sense.

When I was thinking, and actually implemented something, about localized stemming I
did it with TokenFilter. But the point is that the language was known before the
indexing(before creating the Document, or adding the fields) isn't that (adding fields) the point that analyzers/filters are utilized?


But if the language is identified in indexing filter how can that information be used to
select localized stemmer/other data for the fields allready added?


So is whitespace normalization alone enough to justify this? I wonder if instead parser implementations might just use a utility class that removes excess whitespace...

No it is not justified :)

--
Sami Siren


------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to