I just been looking at Jakarta Commons HttpClient (http://jakarta.apache.org/commons/httpclient/) as I would like to refactor HttpResponse to use it (as it will make the isTruncate flag easier to implement for this fetcher).
This is what Heritrix uses too. Perhaps Nutch should switch to it.
I vote for that with all voices I have. ;-)
From my point of view we should concentrate on things that nobody else had implement.
The archive.org people understand how to crawl and they knowledge is in the Heritrix, further more it is clean implemented, use maven, cruice control and JMX. ;-)
At least it has allows to plugin custom crawl content processors, so we should plugin in our post processing there.
I guess archive.org people would be interested in collaboration and both project would go a step forward.
Stefan
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
