Doug Cutting wrote:

Stefan Groschupf wrote:

However in case it is known as buggy, we may should not set up as default http protocol plugin as it is by today.


+1

I have found protocol-http to be more reliable for large crawls than protocol-httpclient and would be in favor of switching the default back to protocol-http. When folks need advanced features then they can switch to protocol-httpclient. Thoughts?


Hmm... I'm not saying it's flawless, there were surely some mysterious things going on with it. That large crawl you mention, was it with the (recently updated in Nutch) release 3.0? What were the issues?

The main advantage of protocol-http is that it's so simple that few things can go wrong, but this also means it's relatively unsophisticated, and adding more advanced features could mean a lot of work. Namely, adding support for https, cookies and authentication.

A related issue is that these two plugins replicate a lot of code. At some point we should try to fix that. See:

http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html


Yes.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to