Re: Comments? (Re: [Nutch-dev] [ nutch-Bugs-989511 ] Patch to reduce whitespace in Summary)

Sami Siren Wed, 14 Jul 2004 05:55:10 -0700

Sami Siren wrote:

Andrzej Bialecki wrote:
I re-worked and applied the patch according to your suggestions. I also created some additional tests, among others to test for whitespace processing. Please check out the latest CVS version and see if it works for you.

works like a dream!


the only drawback is that it works for html only. There's now atleast two
cases were additional? extensionpoint would give nutch more boost in parsing
(or post parsing) phase.

it would be great if new or modified extension point would allow us to add
filters wich have access to textual content of document no matter if the
original was html, pdf, doc or whatever.

whitespace removing could be done with one plugin for all (text-) formats. another usecase would be the language identifier, (or some other sort of categorizer). it would be possible to do ngram language identifiaction allready at that point and it again would open possibility to use localized stop word-/profane-/whatever lists, stemmers etc at later stages.

What do you gentlemen think about this?

--
Sami Siren

------------------------------------------------------- This SF.Net email sponsored by Black Hat Briefings & Training. Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: Comments? (Re: [Nutch-dev] [ nutch-Bugs-989511 ] Patch to reduce whitespace in Summary)

Reply via email to