Re: Unsuccessful fetch/parse of large page with many outlinks

Tejas Patil Mon, 09 Dec 2013 22:58:15 -0800

I think that you narrowed it down and most probably its some
bug/incompatibility of the HTTP library which nutch uses to talk with the
server. Were both the servers where you hosted the url of IIS 6.0 ? If yes,
then there is more :)


Thanks,
Tejas


On Mon, Dec 9, 2013 at 3:32 PM, Iain Lopata <ilopa...@hotmail.com> wrote:

> Out of ideas at this point.
>
> I can retrieve the page with Curl
> I can retrieve the page with Wget
> I can view the page in my browser
> I can retrieve the page by opening a socket from a PHP script
> I can retrieve the page with nutch if I move the page to another host
>
> But
>
> Any page I try and fetch from www.friedfrank.com with Nutch reads just 198
> bytes and then closes the stream.
>
> Debug code inserted in HttpResponse and WireShark both show that this is
> the
> case.
>
> Could someone else please try and fetch a page from this host from your
> config?
>
> My suspicion is that it is related to this host being on IIS 6.0 with this
> problem being a potential cause: http://support.microsoft.com/kb/919797
>
> -----Original Message-----
> From: Iain Lopata [mailto:ilopa...@hotmail.com]
> Sent: Monday, December 09, 2013 7:36 AM
> To: user@nutch.apache.org
> Subject: RE: Unsuccessful fetch/parse of large page with many outlinks
>
> Parses 652 outlinks from the ebay url without any difficulty.
>
> Didn't want to change the title and thereby break this thread, but at this
> point, and as stated in my last post,  I am reasonably confident that for
> some reason the InputReader in HttpResponse.java sees the stream as closed
> after reading only 198 bytes.  Why I do not know.
>
> -----Original Message-----
> From: S.L [mailto:simpleliving...@gmail.com]
> Sent: Sunday, December 08, 2013 11:44 PM
> To: user@nutch.apache.org
> Subject: Re: Unsuccessful fetch/parse of large page with many outlinks
>
> I faced a similar problem with this page
> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1  when I was
> running Nutch from within eclipse , I was able to crawl all the outlinks
> successfully when I ran nutch as a jar outside of eclipse, at that point it
> was considered to be an issue with running ti in eclipse.
>
> Can you please try this URL with your setup ? this has atleast 600+
> outlinks.
>
>
> On Sun, Dec 8, 2013 at 10:07 PM, Iain Lopata <ilopa...@hotmail.com> wrote:
>
> > Some further analysis - no solution.
> >
> > The pages in question do not return a Content-Length header.
> >
> > Since the http.content.limit is set to -1, http-protocol sets the
> > maximum read length to 2147483647.
> >
> > At line 231 of HttpResponse.java the loop:
> >
> > for (int i = in.read(bytes); i != -1 && length + i <= contentLength; i
> > =
> > in.read(bytes))
> >
> > executes once and once only and returns a stream of just 198 bytes.
> > No exceptions are thrown.
> >
> > So, I think, the question becomes why would this connection close
> > before the end of the stream?  It certainly seems to be server
> > specific since I can retrieve the file successfully from a different
> > host domain.
> >
> > -----Original Message-----
> > From: Tejas Patil [mailto:tejas.patil...@gmail.com]
> > Sent: Sunday, December 08, 2013 2:29 PM
> > To: user@nutch.apache.org
> > Subject: Re: Unsuccessful fetch/parse of large page with many outlinks
> >
> > > debug code that I have inserted in a custom filter shows that the
> > > file
> > that was retrieved is only 198 bytes long.
> > I am assuming that this code did not hinder the crawler. A better way
> > to see the content would be to take a segment dump [0] and then
> > analyse it.
> > Also, turn on DEBUG mode of the log4j for the http protocol classes
> > and fetcher class.
> >
> > > attempted to crawl it from that site and it works fine, retrieving
> > > all
> > 597KB and parsing it successfully.
> > You mean that you ran a nutch crawl with the problematic url as a seed
> > and used the EXACT same config on both machines. One machine gave
> > perfect content and the other one was not. Note that using EXACT same
> > config over these 2 runs is important.
> >
> > > the page has about 350 characters of LineFeeds CarriageRetruns and
> > > spaces
> > No way. The HTTP request gets a byte stream as response. Also, had it
> > been the case that LF or CR chars create problem, then it must hit
> > nutch irrespective of from which machine you run nutch...but thats not
> > what your experiments suggest.
> >
> > [0] : http://wiki.apache.org/nutch/bin/nutch_readseg
> >
> >
> >
> > On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata <ilopa...@hotmail.com>
> wrote:
> >
> > > I do not know whether this would be a factor, but I have noticed
> > > that the page has about 350 characters of LineFeeds CarriageRetruns
> > > and spaces before the <!DOCTYPE> declaration.  Could this be causing
> > > a problem for
> > > http-protocol in some way?   Howver, I can't explain why the same file
> > with
> > > the same LF, CR and whitespace would read correctly from a different
> > host.
> > >
> > > -----Original Message-----
> > > From: Iain Lopata [mailto:ilopa...@hotmail.com]
> > > Sent: Sunday, December 08, 2013 12:06 PM
> > > To: user@nutch.apache.org
> > > Subject: Unsuccessful fetch/parse of large page with many outlinks
> > >
> > > I am running Nutch 1.6 on Ubuntu Server.
> > >
> > >
> > >
> > > I am experiencing a problem with one particular webpage.
> > >
> > >
> > >
> > > If I use parsechecker against the problem url the output shows (host
> > > name changed to example.com):
> > >
> > >
> > >
> > > ================================================================
> > >
> > > fetching: http://www.example.com/index.cfm?pageID=12
> > >
> > > text/html
> > >
> > > parsing: http://www.example.com/index.cfm?pageID=12
> > >
> > > contentType: text/html
> > >
> > > signature: a9c640626fcad48caaf3ad5f94bea446
> > >
> > > ---------
> > >
> > > Url
> > >
> > > ---------------
> > >
> > > http://www.example.com/index.cfm?pageID=12
> > >
> > > ---------
> > >
> > > ParseData
> > >
> > > ---------
> > >
> > > Version: 5
> > >
> > > Status: success(1,0)
> > >
> > > Title:
> > >
> > > Outlinks: 0
> > >
> > > Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT
> > > Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html;
> > > charset=UTF-8 Connection=close X-Powered-By=ASP.NET
> > > Server=Microsoft-IIS/6.0
> > >
> > > Parse Metadata: CharEncodingForConversion=utf-8
> > > OriginalCharEncoding=utf-8
> > >
> > > ====================================================================
> > > ==
> > > ==
> > >
> > >
> > >
> > > However, this page has 3775 outlinks.
> > >
> > >
> > >
> > > If I run a  crawl with this page as a seed the log file shows that
> > > the file it fetched successfully, but debug code that I have
> > > inserted in a custom filter shows that the file that was retrieved
> > > is only 198 bytes long.  For some reason the file would seem to be
> > > truncated or otherwise
> > corrupted.
> > >
> > >
> > >
> > > I can retrieve the file with wget and can see that the file is 597KB.
> > >
> > >
> > >
> > > I copied the file that I retrieved with wget to another web server
> > > and attempted to crawl it from that site and it works fine,
> > > retrieving all 597KB and parsing it successfully.  This would
> > > suggest that my current configuration does not have a problem
> > > processing
> this large file.
> > >
> > >
> > >
> > > I have checked the robots.txt file on the original host and it
> > > allows retrieval of this web page.
> > >
> > >
> > >
> > > Other relevant configuration settings may be:
> > >
> > >
> > >
> > > <property>
> > >
> > >     <name>http.content.limit</name>
> > >
> > >     <value>-1</value>
> > >
> > > </property>
> > >
> > > <property>
> > >
> > >          <name>http.timeout</name>
> > >
> > >          <value>60000</value>
> > >
> > >          <description></description>
> > >
> > > </property>
> > >
> > >
> > >
> > > Any ideas on what to check next?
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>
>

Re: Unsuccessful fetch/parse of large page with many outlinks

Reply via email to