Hi Tony As Coskun said that you can set http.content.limit to -1, default is 65536, not file.content.limit property.
<property> <name>http.content.limit</name> <value>65536</value> <description>The length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. </description> </property> On Mon, Jun 17, 2013 at 7:58 PM, Tony Mullins <[email protected]>wrote: > html in my previous email was incorrect ( I was trying different dns > thinking its due to bad internet) ... > but in short I am getting incomplete html response.... > > Is there any property which could let webpage wait for complete html to > load in Nutch ? > > Thanks, > Tony > > > On Mon, Jun 17, 2013 at 4:43 PM, Tony Mullins <[email protected] > >wrote: > > > I have modified these values as > > > > <property> > > <name>http.timeout</name> > > <value>*20000*</value> > > <description>The default network timeout, in > milliseconds.</description> > > </property> > > > > <property> > > <name>file.content.limit</name> > > <value>*-1*</value> > > > > <description>The length limit for downloaded content using the file > > protocol, in bytes. If this value is nonnegative (>=0), content longer > > than it will be truncated; otherwise, no truncation at all. Do not > > confuse this setting with the http.content.limit setting. > > </description> > > </property> > > > > <property> > > <name>http.max.delays</name> > > <value>*200*</value> > > <description>The number of times a thread will delay when trying to > > fetch a page. Each time it finds that a host is busy, it will wait > > fetcher.server.delay. After http.max.delays attepts, it will give > > up on the page for now.</description> > > </property> > > > > And I am getting html for page * > > > http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y > > * like this > > > > INFO nutch.selector - page html is <!DOCTYPE HTML> > > <html> > > <head> > > > > <title>Squarespace - Domain Not Claimed</title> > > <meta http-equiv="X-UA-Compatible" content="chrome=1"> > > > > <script type="text/javascript" src="// > > > static.squarespace.com/universal/scripts-v6/061620131943271011/yui-seed.js > > "></script> > > > > <script> > > > > Y = YUI(YUI_CONFIG).use("squarespace-util", "squarespace-ui-base", > > "squarespace-configuration-css", function(Y) { > > > > Y.on("domready", function() { > > > > var lb = new Y.Squarespace.Lightbox({ > > disableNormalClose: true, > > clickAnywhereToExit: false, > > content: '<div class="bigtext"><div class="title">Domain Not > > Claimed</div><div class="description">This domain has been mapped to > > Squarespace, but it has not yet been claimed by a website. If this is > your > > domain, claim it in the Domains tab of your website > manager.</div></div>', > > margin: 100, > > noHeightConstrain: true > > }); > > > > lb.show(); > > > > lb.getContentEl().on("click", function(e) { > > if (e.target.ancestor(".login-button", true)) { > > document.location.href = '/config/'; > > } > > }); > > > > }); > > > > }); > > > > </script> > > > > > > </head> > > <body class="squarespace-config squarespace-system-page"> > > > > <div class="minimal-logo"> </div> > > > > </body> > > </html> > > > > So as you can see its not loading the complete page.... > > > > Is there any other property that I need to modify ? > > > > Thanks > > Tony. > > > > > > > > On Mon, Jun 17, 2013 at 4:13 PM, H. Coskun Gunduz < > > [email protected]> wrote: > > > >> Hi Tony, > >> > >> You may need to add http.content.limit parameter in nutch-site.xml file. > >> > >> for size-unlimited crawling: > >> > >> <property> > >> <name>http.content.limit</**name> > >> *<value>-1</value>* > >> <description>The length limit for downloaded content using the > >> file > >> protocol, in bytes. If this value is nonnegative (>=0), > >> content longer > >> than it will be truncated; otherwise, no truncation at all. > >> Do not > >> confuse this setting with the http.content.limit setting. > >> </description> > >> </property> > >> > >> > >> Please refer to: http://wiki.apache.org/nutch/**nutch-default.xml< > http://wiki.apache.org/nutch/nutch-default.xml> > >> > >> Kind regards.. > >> coskun... > >> > >> > >> On 06/17/2013 02:05 PM, Tony Mullins wrote: > >> > >>> Hi , > >>> > >>> I am trying to crawl this url > >>> http://www.amazon.com/Levis-**Mens-550-Relaxed-Jean/dp/**B0018OKX68< > http://www.amazon.com/Levis-Mens-550-Relaxed-Jean/dp/B0018OKX68> > >>> and getting the crawled page content in my ParseFIlter plugin like this > >>> String html = new String(webPage.getContent().**array()); > >>> Then I am using this html to extract my required information.... > >>> > >>> But its not returning me complete html of page. I have logged the > 'html' > >>> and I can see that log file contains incomplete html for the above > >>> link.... > >>> > >>> Is there any size limit of page' content ? Or I am doing something > wrong > >>> here ? > >>> > >>> Thanks, > >>> Tony. > >>> > >>> > >> > > > -- Don't Grow Old, Grow Up... :-)

