[ https://issues.apache.org/jira/browse/NUTCH-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280131#comment-14280131 ]
Markus Jelsma commented on NUTCH-1919: -------------------------------------- +1 indeed! > Getting timeout when server returns Content-Length: 0 > ------------------------------------------------------ > > Key: NUTCH-1919 > URL: https://issues.apache.org/jira/browse/NUTCH-1919 > Project: Nutch > Issue Type: Bug > Components: protocol > Reporter: Julien Nioche > Fix For: 1.10 > > Attachments: NUTCH-1919.patch > > > This has been investigated in fixed in the Storm-Crawler > [https://github.com/DigitalPebble/storm-crawler/issues/48]. > {quote} > curl -I "http://www.dailynewslosangeles.com/" > HTTP/1.1 301 Moved Permanently > Location: http://www.dailynews.com > Connection: close > Content-Length: 0 > Content-Type: text/html; charset=UTF-8 > {quote} > when fetching with Nutch we are getting a timeout exception : > {quote} > ./nutch parsechecker -D http.agent.name="PebbleCrawler" > "http://www.dailynewslosangeles.com/" > fetching: http://www.dailynewslosangeles.com/ > Fetch failed with protocol status: exception(16), lastModified=0: > java.net.SocketTimeoutException: Read timed out > {quote} > The reason for this is that we are trying to read from the stream even though > we know that the content length is 0. > The patch attached fixes the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)