Sami Siren wrote:
Andrzej Bialecki wrote:
works like a dream!
I re-worked and applied the patch according to your suggestions. I also created some additional tests, among others to test for whitespace processing. Please check out the latest CVS version and see if it works for you.
the only drawback is that it works for html only. There's now atleast two cases were additional? extensionpoint would give nutch more boost in parsing (or post parsing) phase.
it would be great if new or modified extension point would allow us to add filters wich have access to textual content of document no matter if the original was html, pdf, doc or whatever.
whitespace removing could be done with one plugin for all (text-) formats.
another usecase would be the language identifier, (or some other sort of
categorizer). it would be possible to do ngram language identifiaction
allready at that point and it again would open possibility to use localized stop
word-/profane-/whatever lists, stemmers etc at later stages.
What do you gentlemen think about this?
-- Sami Siren
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
