[ https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000981#comment-14000981 ]
Hudson commented on NUTCH-1613: ------------------------------- SUCCESS: Integrated in Nutch-trunk #2630 (See [https://builds.apache.org/job/Nutch-trunk/2630/]) NUTCH-1613 Timeouts in protocol-httpclient when crawling same host with >2 threads (jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1593951) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java > Timeouts in protocol-httpclient when crawling same host with >2 threads and > added cookie strings for both http protocols > ------------------------------------------------------------------------------------------------------------------------ > > Key: NUTCH-1613 > URL: https://issues.apache.org/jira/browse/NUTCH-1613 > Project: Nutch > Issue Type: Bug > Components: protocol > Affects Versions: 2.2.1 > Reporter: Brian > Priority: Minor > Labels: patch > Fix For: 2.3, 1.9 > > Attachments: NUTCH-1613.patch > > > 1.) When using protocol-httpclient to crawl a single website (the same host) > I would always get a bunch of timeout errors during fetching and the pages > with errors would not be fetched. E.g.: > 2013-07-09 17:57:13,717 WARN fetcher.FetcherJob - fetch of http://www.... > failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: > Timeout waiting for connection > 2013-07-09 17:57:13,718 INFO fetcher.FetcherJob - fetching http://www.... > (queue crawl delay=0ms) > 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following > error: > org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting > for connection > at > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497) > at > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416) > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) > at > org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:95) > at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518) > This is because by default the connection pool manager only allows 2 > connections per host so if more than 2 threads are used the others will tend > to time out waiting to get a connection. The code previously set max > connections correctly but not connection per host. > 2.) I also added at the same time simple modifications to both protocol-http > and protocol-httpclient to allow specifying a cookie string in the conf file > to include in request headers. > I use this to crawl site content requiring authentication - it is better for > me to specify the cookie string for the authentication than go through the > whole authentication process and specifying login info. > The nutch-site.xml property is the following: > <property> > <name>http.cookie_string</name> > <value>XX_AL=authorization_value_goes_here</value> > <description>String to use as the cookie value for HTTP > requests</description> > </property> > Although I use it for authentication it can be used to specify any single > cookie string for the crawl (httpclient does support different cookies for > different hosts but I did not get into that). -- This message was sent by Atlassian JIRA (v6.2#6252)