Re: Incomplete HTML content of a crawled Page in ParseFilter ?

feng lu Mon, 17 Jun 2013 07:12:35 -0700

Hi Tony

As Coskun said that you can set http.content.limit to -1, default is 65536,
not file.content.limit property.


<property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>



On Mon, Jun 17, 2013 at 7:58 PM, Tony Mullins <[email protected]>wrote:

> html in my previous email was incorrect ( I was trying different dns
> thinking its due to bad internet) ...
> but in short I am getting incomplete html response....
>
> Is there any property which could let webpage wait for complete html to
> load in Nutch ?
>
> Thanks,
> Tony
>
>
> On Mon, Jun 17, 2013 at 4:43 PM, Tony Mullins <[email protected]
> >wrote:
>
> > I have modified these values as
> >
> > <property>
> >   <name>http.timeout</name>
> >   <value>*20000*</value>
> >   <description>The default network timeout, in
> milliseconds.</description>
> > </property>
> >
> > <property>
> >   <name>file.content.limit</name>
> >   <value>*-1*</value>
> >
> >   <description>The length limit for downloaded content using the file
> >    protocol, in bytes. If this value is nonnegative (>=0), content longer
> >    than it will be truncated; otherwise, no truncation at all. Do not
> >    confuse this setting with the http.content.limit setting.
> >   </description>
> > </property>
> >
> > <property>
> >   <name>http.max.delays</name>
> >   <value>*200*</value>
> >   <description>The number of times a thread will delay when trying to
> >   fetch a page.  Each time it finds that a host is busy, it will wait
> >   fetcher.server.delay.  After http.max.delays attepts, it will give
> >   up on the page for now.</description>
> > </property>
> >
> > And I am getting html for page *
> >
> http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y
> > *  like this
> >
> > INFO  nutch.selector - page html is <!DOCTYPE HTML>
> > <html>
> > <head>
> >
> >   <title>Squarespace - Domain Not Claimed</title>
> >   <meta http-equiv="X-UA-Compatible" content="chrome=1">
> >
> >   <script type="text/javascript" src="//
> >
> static.squarespace.com/universal/scripts-v6/061620131943271011/yui-seed.js
> > "></script>
> >
> >   <script>
> >
> >     Y = YUI(YUI_CONFIG).use("squarespace-util", "squarespace-ui-base",
> > "squarespace-configuration-css",  function(Y) {
> >
> >       Y.on("domready", function() {
> >
> >         var lb = new Y.Squarespace.Lightbox({
> >           disableNormalClose: true,
> >           clickAnywhereToExit: false,
> >           content: '<div class="bigtext"><div class="title">Domain Not
> > Claimed</div><div class="description">This domain has been mapped to
> > Squarespace, but it has not yet been claimed by a website.  If this is
> your
> > domain, claim it in the Domains tab of your website
> manager.</div></div>',
> >           margin: 100,
> >           noHeightConstrain: true
> >         });
> >
> >         lb.show();
> >
> >         lb.getContentEl().on("click", function(e) {
> >           if (e.target.ancestor(".login-button", true)) {
> >             document.location.href = '/config/';
> >           }
> >         });
> >
> >       });
> >
> >     });
> >
> >   </script>
> >
> >
> > </head>
> > <body class="squarespace-config squarespace-system-page">
> >
> >   <div class="minimal-logo">&nbsp;</div>
> >
> > </body>
> > </html>
> >
> > So as you can see its not loading the complete page....
> >
> > Is there any other property that I need to modify ?
> >
> > Thanks
> > Tony.
> >
> >
> >
> > On Mon, Jun 17, 2013 at 4:13 PM, H. Coskun Gunduz <
> > [email protected]> wrote:
> >
> >> Hi Tony,
> >>
> >> You may need to add http.content.limit parameter in nutch-site.xml file.
> >>
> >> for size-unlimited crawling:
> >>
> >> <property>
> >>         <name>http.content.limit</**name>
> >> *<value>-1</value>*
> >>         <description>The length limit for downloaded content using the
> >> file
> >>             protocol, in bytes. If this value is nonnegative (>=0),
> >> content longer
> >>             than it will be truncated; otherwise, no truncation at all.
> >> Do not
> >>             confuse this setting with the http.content.limit setting.
> >>         </description>
> >>     </property>
> >>
> >>
> >> Please refer to: http://wiki.apache.org/nutch/**nutch-default.xml<
> http://wiki.apache.org/nutch/nutch-default.xml>
> >>
> >> Kind regards..
> >> coskun...
> >>
> >>
> >> On 06/17/2013 02:05 PM, Tony Mullins wrote:
> >>
> >>> Hi ,
> >>>
> >>> I am trying to crawl this url
> >>> http://www.amazon.com/Levis-**Mens-550-Relaxed-Jean/dp/**B0018OKX68<
> http://www.amazon.com/Levis-Mens-550-Relaxed-Jean/dp/B0018OKX68>
> >>> and getting the crawled page content in my ParseFIlter plugin like this
> >>> String html = new String(webPage.getContent().**array());
> >>> Then I am using this html to extract my required information....
> >>>
> >>> But its not returning me complete html of page. I have logged the
> 'html'
> >>> and I can see that log file contains incomplete html for the above
> >>> link....
> >>>
> >>> Is there any size limit of page' content ? Or I am doing something
> wrong
> >>> here ?
> >>>
> >>> Thanks,
> >>> Tony.
> >>>
> >>>
> >>
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Incomplete HTML content of a crawled Page in ParseFilter ?

Reply via email to