I have modified these values as <property> <name>http.timeout</name> <value>*20000*</value> <description>The default network timeout, in milliseconds.</description> </property>
<property> <name>file.content.limit</name> <value>*-1*</value> <description>The length limit for downloaded content using the file protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. </description> </property> <property> <name>http.max.delays</name> <value>*200*</value> <description>The number of times a thread will delay when trying to fetch a page. Each time it finds that a host is busy, it will wait fetcher.server.delay. After http.max.delays attepts, it will give up on the page for now.</description> </property> And I am getting html for page * http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y * like this INFO nutch.selector - page html is <!DOCTYPE HTML> <html> <head> <title>Squarespace - Domain Not Claimed</title> <meta http-equiv="X-UA-Compatible" content="chrome=1"> <script type="text/javascript" src="// static.squarespace.com/universal/scripts-v6/061620131943271011/yui-seed.js "></script> <script> Y = YUI(YUI_CONFIG).use("squarespace-util", "squarespace-ui-base", "squarespace-configuration-css", function(Y) { Y.on("domready", function() { var lb = new Y.Squarespace.Lightbox({ disableNormalClose: true, clickAnywhereToExit: false, content: '<div class="bigtext"><div class="title">Domain Not Claimed</div><div class="description">This domain has been mapped to Squarespace, but it has not yet been claimed by a website. If this is your domain, claim it in the Domains tab of your website manager.</div></div>', margin: 100, noHeightConstrain: true }); lb.show(); lb.getContentEl().on("click", function(e) { if (e.target.ancestor(".login-button", true)) { document.location.href = '/config/'; } }); }); }); </script> </head> <body class="squarespace-config squarespace-system-page"> <div class="minimal-logo"> </div> </body> </html> So as you can see its not loading the complete page.... Is there any other property that I need to modify ? Thanks Tony. On Mon, Jun 17, 2013 at 4:13 PM, H. Coskun Gunduz <[email protected]>wrote: > Hi Tony, > > You may need to add http.content.limit parameter in nutch-site.xml file. > > for size-unlimited crawling: > > <property> > <name>http.content.limit</**name> > *<value>-1</value>* > <description>The length limit for downloaded content using the file > protocol, in bytes. If this value is nonnegative (>=0), > content longer > than it will be truncated; otherwise, no truncation at all. Do > not > confuse this setting with the http.content.limit setting. > </description> > </property> > > > Please refer to: > http://wiki.apache.org/nutch/**nutch-default.xml<http://wiki.apache.org/nutch/nutch-default.xml> > > Kind regards.. > coskun... > > > On 06/17/2013 02:05 PM, Tony Mullins wrote: > >> Hi , >> >> I am trying to crawl this url >> http://www.amazon.com/Levis-**Mens-550-Relaxed-Jean/dp/**B0018OKX68<http://www.amazon.com/Levis-Mens-550-Relaxed-Jean/dp/B0018OKX68> >> and getting the crawled page content in my ParseFIlter plugin like this >> String html = new String(webPage.getContent().**array()); >> Then I am using this html to extract my required information.... >> >> But its not returning me complete html of page. I have logged the 'html' >> and I can see that log file contains incomplete html for the above >> link.... >> >> Is there any size limit of page' content ? Or I am doing something wrong >> here ? >> >> Thanks, >> Tony. >> >> >

