I'd like to solicit more comments on the impact of this solution, before going forward. I can apply the other simple whitespace-related change, though...


Why don't we do the whitespace removing in parsing stage once (in html case right place is perhaps DomContentUtils) instead of over and over again when creating fragments as Doug? pointed out earlier. the getText method in DomContentUtils has also another problem: it sometimes illegally concatenates strings (words) if they're separated only with some html tags and no whitespace.


--
Sami Siren


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to