Re: Comments? (Re: [Nutch-dev] [ nutch-Bugs-989511 ] Patch to reduce whitespace in Summary)

Sami Siren Thu, 15 Jul 2004 00:53:44 -0700

Doug Cutting wrote:

Language id can already be reasonably done by an indexing filter. I don't see any advantage to moving it here. Am I missing something?

And stemming and other word-based operations should be performed in Lucene Analyzers, while indexing. Nutch does not yet permit a plugin here, but this might eventually make sense.

When I was thinking, and actually implemented something, about localized stemming I did it with TokenFilter. But the point is that the language was known before the indexing(before creating the Document, or adding the fields) isn't that (adding fields) the point that analyzers/filters are utilized?

But if the language is identified in indexing filter how can that information be used to select localized stemmer/other data for the fields allready added?

So is whitespace normalization alone enough to justify this? I wonder if instead parser implementations might just use a utility class that removes excess whitespace...


No it is not justified :)

--
Sami Siren


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: Comments? (Re: [Nutch-dev] [ nutch-Bugs-989511 ] Patch to reduce whitespace in Summary)

Reply via email to