html in my previous email was incorrect ( I was trying different dns thinking its due to bad internet) ... but in short I am getting incomplete html response....
Is there any property which could let webpage wait for complete html to load in Nutch ? Thanks, Tony On Mon, Jun 17, 2013 at 4:43 PM, Tony Mullins <[email protected]>wrote: > I have modified these values as > > <property> > <name>http.timeout</name> > <value>*20000*</value> > <description>The default network timeout, in milliseconds.</description> > </property> > > <property> > <name>file.content.limit</name> > <value>*-1*</value> > > <description>The length limit for downloaded content using the file > protocol, in bytes. If this value is nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the http.content.limit setting. > </description> > </property> > > <property> > <name>http.max.delays</name> > <value>*200*</value> > <description>The number of times a thread will delay when trying to > fetch a page. Each time it finds that a host is busy, it will wait > fetcher.server.delay. After http.max.delays attepts, it will give > up on the page for now.</description> > </property> > > And I am getting html for page * > http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y > * like this > > INFO nutch.selector - page html is <!DOCTYPE HTML> > <html> > <head> > > <title>Squarespace - Domain Not Claimed</title> > <meta http-equiv="X-UA-Compatible" content="chrome=1"> > > <script type="text/javascript" src="// > static.squarespace.com/universal/scripts-v6/061620131943271011/yui-seed.js > "></script> > > <script> > > Y = YUI(YUI_CONFIG).use("squarespace-util", "squarespace-ui-base", > "squarespace-configuration-css", function(Y) { > > Y.on("domready", function() { > > var lb = new Y.Squarespace.Lightbox({ > disableNormalClose: true, > clickAnywhereToExit: false, > content: '<div class="bigtext"><div class="title">Domain Not > Claimed</div><div class="description">This domain has been mapped to > Squarespace, but it has not yet been claimed by a website. If this is your > domain, claim it in the Domains tab of your website manager.</div></div>', > margin: 100, > noHeightConstrain: true > }); > > lb.show(); > > lb.getContentEl().on("click", function(e) { > if (e.target.ancestor(".login-button", true)) { > document.location.href = '/config/'; > } > }); > > }); > > }); > > </script> > > > </head> > <body class="squarespace-config squarespace-system-page"> > > <div class="minimal-logo"> </div> > > </body> > </html> > > So as you can see its not loading the complete page.... > > Is there any other property that I need to modify ? > > Thanks > Tony. > > > > On Mon, Jun 17, 2013 at 4:13 PM, H. Coskun Gunduz < > [email protected]> wrote: > >> Hi Tony, >> >> You may need to add http.content.limit parameter in nutch-site.xml file. >> >> for size-unlimited crawling: >> >> <property> >> <name>http.content.limit</**name> >> *<value>-1</value>* >> <description>The length limit for downloaded content using the >> file >> protocol, in bytes. If this value is nonnegative (>=0), >> content longer >> than it will be truncated; otherwise, no truncation at all. >> Do not >> confuse this setting with the http.content.limit setting. >> </description> >> </property> >> >> >> Please refer to: >> http://wiki.apache.org/nutch/**nutch-default.xml<http://wiki.apache.org/nutch/nutch-default.xml> >> >> Kind regards.. >> coskun... >> >> >> On 06/17/2013 02:05 PM, Tony Mullins wrote: >> >>> Hi , >>> >>> I am trying to crawl this url >>> http://www.amazon.com/Levis-**Mens-550-Relaxed-Jean/dp/**B0018OKX68<http://www.amazon.com/Levis-Mens-550-Relaxed-Jean/dp/B0018OKX68> >>> and getting the crawled page content in my ParseFIlter plugin like this >>> String html = new String(webPage.getContent().**array()); >>> Then I am using this html to extract my required information.... >>> >>> But its not returning me complete html of page. I have logged the 'html' >>> and I can see that log file contains incomplete html for the above >>> link.... >>> >>> Is there any size limit of page' content ? Or I am doing something wrong >>> here ? >>> >>> Thanks, >>> Tony. >>> >>> >> >

