[ 
https://issues.apache.org/jira/browse/NUTCH-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280131#comment-14280131
 ] 

Markus Jelsma commented on NUTCH-1919:
--------------------------------------

+1 indeed! 

> Getting timeout when server returns Content-Length: 0 
> ------------------------------------------------------
>
>                 Key: NUTCH-1919
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1919
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>            Reporter: Julien Nioche
>             Fix For: 1.10
>
>         Attachments: NUTCH-1919.patch
>
>
> This has been investigated in fixed in the Storm-Crawler 
> [https://github.com/DigitalPebble/storm-crawler/issues/48].
> {quote}
> curl -I "http://www.dailynewslosangeles.com/";
> HTTP/1.1 301 Moved Permanently
> Location: http://www.dailynews.com
> Connection: close
> Content-Length: 0
> Content-Type: text/html; charset=UTF-8
> {quote}
> when fetching with Nutch we are getting a timeout exception :
> {quote}
> ./nutch parsechecker -D http.agent.name="PebbleCrawler" 
> "http://www.dailynewslosangeles.com/";
> fetching: http://www.dailynewslosangeles.com/
> Fetch failed with protocol status: exception(16), lastModified=0: 
> java.net.SocketTimeoutException: Read timed out
> {quote}
> The reason for this is that we are trying to read from the stream even though 
> we know that the content length is 0.
> The patch attached fixes the issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to