RE: Unsuccessful fetch/parse of large page with many outlinks

Iain Lopata Mon, 09 Dec 2013 05:37:36 -0800

Parses 652 outlinks from the ebay url without any difficulty.

Didn't want to change the title and thereby break this thread, but at this
point, and as stated in my last post,  I am reasonably confident that for
some reason the InputReader in HttpResponse.java sees the stream as closed
after reading only 198 bytes.  Why I do not know.


-----Original Message-----
From: S.L [mailto:simpleliving...@gmail.com] 
Sent: Sunday, December 08, 2013 11:44 PM
To: user@nutch.apache.org
Subject: Re: Unsuccessful fetch/parse of large page with many outlinks

I faced a similar problem with this page
http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1  when I was
running Nutch from within eclipse , I was able to crawl all the outlinks
successfully when I ran nutch as a jar outside of eclipse, at that point it
was considered to be an issue with running ti in eclipse.

Can you please try this URL with your setup ? this has atleast 600+
outlinks.


On Sun, Dec 8, 2013 at 10:07 PM, Iain Lopata <ilopa...@hotmail.com> wrote:

> Some further analysis - no solution.
>
> The pages in question do not return a Content-Length header.
>
> Since the http.content.limit is set to -1, http-protocol sets the 
> maximum read length to 2147483647.
>
> At line 231 of HttpResponse.java the loop:
>
> for (int i = in.read(bytes); i != -1 && length + i <= contentLength; i 
> =
> in.read(bytes))
>
> executes once and once only and returns a stream of just 198 bytes.  
> No exceptions are thrown.
>
> So, I think, the question becomes why would this connection close 
> before the end of the stream?  It certainly seems to be server 
> specific since I can retrieve the file successfully from a different 
> host domain.
>
> -----Original Message-----
> From: Tejas Patil [mailto:tejas.patil...@gmail.com]
> Sent: Sunday, December 08, 2013 2:29 PM
> To: user@nutch.apache.org
> Subject: Re: Unsuccessful fetch/parse of large page with many outlinks
>
> > debug code that I have inserted in a custom filter shows that the 
> > file
> that was retrieved is only 198 bytes long.
> I am assuming that this code did not hinder the crawler. A better way 
> to see the content would be to take a segment dump [0] and then 
> analyse it.
> Also, turn on DEBUG mode of the log4j for the http protocol classes 
> and fetcher class.
>
> > attempted to crawl it from that site and it works fine, retrieving 
> > all
> 597KB and parsing it successfully.
> You mean that you ran a nutch crawl with the problematic url as a seed 
> and used the EXACT same config on both machines. One machine gave 
> perfect content and the other one was not. Note that using EXACT same 
> config over these 2 runs is important.
>
> > the page has about 350 characters of LineFeeds CarriageRetruns and 
> > spaces
> No way. The HTTP request gets a byte stream as response. Also, had it 
> been the case that LF or CR chars create problem, then it must hit 
> nutch irrespective of from which machine you run nutch...but thats not 
> what your experiments suggest.
>
> [0] : http://wiki.apache.org/nutch/bin/nutch_readseg
>
>
>
> On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata <ilopa...@hotmail.com> wrote:
>
> > I do not know whether this would be a factor, but I have noticed 
> > that the page has about 350 characters of LineFeeds CarriageRetruns 
> > and spaces before the <!DOCTYPE> declaration.  Could this be causing 
> > a problem for
> > http-protocol in some way?   Howver, I can't explain why the same file
> with
> > the same LF, CR and whitespace would read correctly from a different
> host.
> >
> > -----Original Message-----
> > From: Iain Lopata [mailto:ilopa...@hotmail.com]
> > Sent: Sunday, December 08, 2013 12:06 PM
> > To: user@nutch.apache.org
> > Subject: Unsuccessful fetch/parse of large page with many outlinks
> >
> > I am running Nutch 1.6 on Ubuntu Server.
> >
> >
> >
> > I am experiencing a problem with one particular webpage.
> >
> >
> >
> > If I use parsechecker against the problem url the output shows (host 
> > name changed to example.com):
> >
> >
> >
> > ================================================================
> >
> > fetching: http://www.example.com/index.cfm?pageID=12
> >
> > text/html
> >
> > parsing: http://www.example.com/index.cfm?pageID=12
> >
> > contentType: text/html
> >
> > signature: a9c640626fcad48caaf3ad5f94bea446
> >
> > ---------
> >
> > Url
> >
> > ---------------
> >
> > http://www.example.com/index.cfm?pageID=12
> >
> > ---------
> >
> > ParseData
> >
> > ---------
> >
> > Version: 5
> >
> > Status: success(1,0)
> >
> > Title:
> >
> > Outlinks: 0
> >
> > Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT 
> > Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html;
> > charset=UTF-8 Connection=close X-Powered-By=ASP.NET
> > Server=Microsoft-IIS/6.0
> >
> > Parse Metadata: CharEncodingForConversion=utf-8
> > OriginalCharEncoding=utf-8
> >
> > ====================================================================
> > ==
> > ==
> >
> >
> >
> > However, this page has 3775 outlinks.
> >
> >
> >
> > If I run a  crawl with this page as a seed the log file shows that 
> > the file it fetched successfully, but debug code that I have 
> > inserted in a custom filter shows that the file that was retrieved 
> > is only 198 bytes long.  For some reason the file would seem to be 
> > truncated or otherwise
> corrupted.
> >
> >
> >
> > I can retrieve the file with wget and can see that the file is 597KB.
> >
> >
> >
> > I copied the file that I retrieved with wget to another web server 
> > and attempted to crawl it from that site and it works fine, 
> > retrieving all 597KB and parsing it successfully.  This would 
> > suggest that my current configuration does not have a problem processing
this large file.
> >
> >
> >
> > I have checked the robots.txt file on the original host and it 
> > allows retrieval of this web page.
> >
> >
> >
> > Other relevant configuration settings may be:
> >
> >
> >
> > <property>
> >
> >     <name>http.content.limit</name>
> >
> >     <value>-1</value>
> >
> > </property>
> >
> > <property>
> >
> >          <name>http.timeout</name>
> >
> >          <value>60000</value>
> >
> >          <description></description>
> >
> > </property>
> >
> >
> >
> > Any ideas on what to check next?
> >
> >
> >
> >
> >
>
>

RE: Unsuccessful fetch/parse of large page with many outlinks

Reply via email to