I think that you narrowed it down and most probably its some bug/incompatibility of the HTTP library which nutch uses to talk with the server. Were both the servers where you hosted the url of IIS 6.0 ? If yes, then there is more :)
Thanks, Tejas On Mon, Dec 9, 2013 at 3:32 PM, Iain Lopata <ilopa...@hotmail.com> wrote: > Out of ideas at this point. > > I can retrieve the page with Curl > I can retrieve the page with Wget > I can view the page in my browser > I can retrieve the page by opening a socket from a PHP script > I can retrieve the page with nutch if I move the page to another host > > But > > Any page I try and fetch from www.friedfrank.com with Nutch reads just 198 > bytes and then closes the stream. > > Debug code inserted in HttpResponse and WireShark both show that this is > the > case. > > Could someone else please try and fetch a page from this host from your > config? > > My suspicion is that it is related to this host being on IIS 6.0 with this > problem being a potential cause: http://support.microsoft.com/kb/919797 > > -----Original Message----- > From: Iain Lopata [mailto:ilopa...@hotmail.com] > Sent: Monday, December 09, 2013 7:36 AM > To: user@nutch.apache.org > Subject: RE: Unsuccessful fetch/parse of large page with many outlinks > > Parses 652 outlinks from the ebay url without any difficulty. > > Didn't want to change the title and thereby break this thread, but at this > point, and as stated in my last post, I am reasonably confident that for > some reason the InputReader in HttpResponse.java sees the stream as closed > after reading only 198 bytes. Why I do not know. > > -----Original Message----- > From: S.L [mailto:simpleliving...@gmail.com] > Sent: Sunday, December 08, 2013 11:44 PM > To: user@nutch.apache.org > Subject: Re: Unsuccessful fetch/parse of large page with many outlinks > > I faced a similar problem with this page > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 when I was > running Nutch from within eclipse , I was able to crawl all the outlinks > successfully when I ran nutch as a jar outside of eclipse, at that point it > was considered to be an issue with running ti in eclipse. > > Can you please try this URL with your setup ? this has atleast 600+ > outlinks. > > > On Sun, Dec 8, 2013 at 10:07 PM, Iain Lopata <ilopa...@hotmail.com> wrote: > > > Some further analysis - no solution. > > > > The pages in question do not return a Content-Length header. > > > > Since the http.content.limit is set to -1, http-protocol sets the > > maximum read length to 2147483647. > > > > At line 231 of HttpResponse.java the loop: > > > > for (int i = in.read(bytes); i != -1 && length + i <= contentLength; i > > = > > in.read(bytes)) > > > > executes once and once only and returns a stream of just 198 bytes. > > No exceptions are thrown. > > > > So, I think, the question becomes why would this connection close > > before the end of the stream? It certainly seems to be server > > specific since I can retrieve the file successfully from a different > > host domain. > > > > -----Original Message----- > > From: Tejas Patil [mailto:tejas.patil...@gmail.com] > > Sent: Sunday, December 08, 2013 2:29 PM > > To: user@nutch.apache.org > > Subject: Re: Unsuccessful fetch/parse of large page with many outlinks > > > > > debug code that I have inserted in a custom filter shows that the > > > file > > that was retrieved is only 198 bytes long. > > I am assuming that this code did not hinder the crawler. A better way > > to see the content would be to take a segment dump [0] and then > > analyse it. > > Also, turn on DEBUG mode of the log4j for the http protocol classes > > and fetcher class. > > > > > attempted to crawl it from that site and it works fine, retrieving > > > all > > 597KB and parsing it successfully. > > You mean that you ran a nutch crawl with the problematic url as a seed > > and used the EXACT same config on both machines. One machine gave > > perfect content and the other one was not. Note that using EXACT same > > config over these 2 runs is important. > > > > > the page has about 350 characters of LineFeeds CarriageRetruns and > > > spaces > > No way. The HTTP request gets a byte stream as response. Also, had it > > been the case that LF or CR chars create problem, then it must hit > > nutch irrespective of from which machine you run nutch...but thats not > > what your experiments suggest. > > > > [0] : http://wiki.apache.org/nutch/bin/nutch_readseg > > > > > > > > On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata <ilopa...@hotmail.com> > wrote: > > > > > I do not know whether this would be a factor, but I have noticed > > > that the page has about 350 characters of LineFeeds CarriageRetruns > > > and spaces before the <!DOCTYPE> declaration. Could this be causing > > > a problem for > > > http-protocol in some way? Howver, I can't explain why the same file > > with > > > the same LF, CR and whitespace would read correctly from a different > > host. > > > > > > -----Original Message----- > > > From: Iain Lopata [mailto:ilopa...@hotmail.com] > > > Sent: Sunday, December 08, 2013 12:06 PM > > > To: user@nutch.apache.org > > > Subject: Unsuccessful fetch/parse of large page with many outlinks > > > > > > I am running Nutch 1.6 on Ubuntu Server. > > > > > > > > > > > > I am experiencing a problem with one particular webpage. > > > > > > > > > > > > If I use parsechecker against the problem url the output shows (host > > > name changed to example.com): > > > > > > > > > > > > ================================================================ > > > > > > fetching: http://www.example.com/index.cfm?pageID=12 > > > > > > text/html > > > > > > parsing: http://www.example.com/index.cfm?pageID=12 > > > > > > contentType: text/html > > > > > > signature: a9c640626fcad48caaf3ad5f94bea446 > > > > > > --------- > > > > > > Url > > > > > > --------------- > > > > > > http://www.example.com/index.cfm?pageID=12 > > > > > > --------- > > > > > > ParseData > > > > > > --------- > > > > > > Version: 5 > > > > > > Status: success(1,0) > > > > > > Title: > > > > > > Outlinks: 0 > > > > > > Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT > > > Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html; > > > charset=UTF-8 Connection=close X-Powered-By=ASP.NET > > > Server=Microsoft-IIS/6.0 > > > > > > Parse Metadata: CharEncodingForConversion=utf-8 > > > OriginalCharEncoding=utf-8 > > > > > > ==================================================================== > > > == > > > == > > > > > > > > > > > > However, this page has 3775 outlinks. > > > > > > > > > > > > If I run a crawl with this page as a seed the log file shows that > > > the file it fetched successfully, but debug code that I have > > > inserted in a custom filter shows that the file that was retrieved > > > is only 198 bytes long. For some reason the file would seem to be > > > truncated or otherwise > > corrupted. > > > > > > > > > > > > I can retrieve the file with wget and can see that the file is 597KB. > > > > > > > > > > > > I copied the file that I retrieved with wget to another web server > > > and attempted to crawl it from that site and it works fine, > > > retrieving all 597KB and parsing it successfully. This would > > > suggest that my current configuration does not have a problem > > > processing > this large file. > > > > > > > > > > > > I have checked the robots.txt file on the original host and it > > > allows retrieval of this web page. > > > > > > > > > > > > Other relevant configuration settings may be: > > > > > > > > > > > > <property> > > > > > > <name>http.content.limit</name> > > > > > > <value>-1</value> > > > > > > </property> > > > > > > <property> > > > > > > <name>http.timeout</name> > > > > > > <value>60000</value> > > > > > > <description></description> > > > > > > </property> > > > > > > > > > > > > Any ideas on what to check next? > > > > > > > > > > > > > > > > > > > > > >