I've experienced a similar issue on my development station running Mac 10.8 but the same code worked perfectly on my server VM running ubuntu, so no jira was created in the end. Also, in my case was fetching image files and not HTML content + the files was hosted locally so no connection problem was involved.
----- Mensaje original ----- De: "feng lu" <[email protected]> Para: [email protected] Enviados: Lunes, 17 de Junio 2013 10:10:49 Asunto: Re: Incomplete HTML content of a crawled Page in ParseFilter ? Hi Tony As Coskun said that you can set http.content.limit to -1, default is 65536, not file.content.limit property. <property> <name>http.content.limit</name> <value>65536</value> <description>The length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. </description> </property> On Mon, Jun 17, 2013 at 7:58 PM, Tony Mullins <[email protected]>wrote: > html in my previous email was incorrect ( I was trying different dns > thinking its due to bad internet) ... > but in short I am getting incomplete html response.... > > Is there any property which could let webpage wait for complete html to > load in Nutch ? > > Thanks, > Tony > > > On Mon, Jun 17, 2013 at 4:43 PM, Tony Mullins <[email protected] > >wrote: > > > I have modified these values as > > > > <property> > > <name>http.timeout</name> > > <value>*20000*</value> > > <description>The default network timeout, in > milliseconds.</description> > > </property> > > > > <property> > > <name>file.content.limit</name> > > <value>*-1*</value> > > > > <description>The length limit for downloaded content using the file > > protocol, in bytes. If this value is nonnegative (>=0), content longer > > than it will be truncated; otherwise, no truncation at all. Do not > > confuse this setting with the http.content.limit setting. > > </description> > > </property> > > > > <property> > > <name>http.max.delays</name> > > <value>*200*</value> > > <description>The number of times a thread will delay when trying to > > fetch a page. Each time it finds that a host is busy, it will wait > > fetcher.server.delay. After http.max.delays attepts, it will give > > up on the page for now.</description> > > </property> > > > > And I am getting html for page * > > > http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y > > * like this > > > > INFO nutch.selector - page html is <!DOCTYPE HTML> > > <html> > > <head> > > > > <title>Squarespace - Domain Not Claimed</title> > > <meta http-equiv="X-UA-Compatible" content="chrome=1"> > > > > <script type="text/javascript" src="// > > > static.squarespace.com/universal/scripts-v6/061620131943271011/yui-seed.js > > "></script> > > > > <script> > > > > Y = YUI(YUI_CONFIG).use("squarespace-util", "squarespace-ui-base", > > "squarespace-configuration-css", function(Y) { > > > > Y.on("domready", function() { > > > > var lb = new Y.Squarespace.Lightbox({ > > disableNormalClose: true, > > clickAnywhereToExit: false, > > content: '<div class="bigtext"><div class="title">Domain Not > > Claimed</div><div class="description">This domain has been mapped to > > Squarespace, but it has not yet been claimed by a website. If this is > your > > domain, claim it in the Domains tab of your website > manager.</div></div>', > > margin: 100, > > noHeightConstrain: true > > }); > > > > lb.show(); > > > > lb.getContentEl().on("click", function(e) { > > if (e.target.ancestor(".login-button", true)) { > > document.location.href = '/config/'; > > } > > }); > > > > }); > > > > }); > > > > </script> > > > > > > </head> > > <body class="squarespace-config squarespace-system-page"> > > > > <div class="minimal-logo"> </div> > > > > </body> > > </html> > > > > So as you can see its not loading the complete page.... > > > > Is there any other property that I need to modify ? > > > > Thanks > > Tony. > > > > > > > > On Mon, Jun 17, 2013 at 4:13 PM, H. Coskun Gunduz < > > [email protected]> wrote: > > > >> Hi Tony, > >> > >> You may need to add http.content.limit parameter in nutch-site.xml file. > >> > >> for size-unlimited crawling: > >> > >> <property> > >> <name>http.content.limit</**name> > >> *<value>-1</value>* > >> <description>The length limit for downloaded content using the > >> file > >> protocol, in bytes. If this value is nonnegative (>=0), > >> content longer > >> than it will be truncated; otherwise, no truncation at all. > >> Do not > >> confuse this setting with the http.content.limit setting. > >> </description> > >> </property> > >> > >> > >> Please refer to: http://wiki.apache.org/nutch/**nutch-default.xml< > http://wiki.apache.org/nutch/nutch-default.xml> > >> > >> Kind regards.. > >> coskun... > >> > >> > >> On 06/17/2013 02:05 PM, Tony Mullins wrote: > >> > >>> Hi , > >>> > >>> I am trying to crawl this url > >>> http://www.amazon.com/Levis-**Mens-550-Relaxed-Jean/dp/**B0018OKX68< > http://www.amazon.com/Levis-Mens-550-Relaxed-Jean/dp/B0018OKX68> > >>> and getting the crawled page content in my ParseFIlter plugin like this > >>> String html = new String(webPage.getContent().**array()); > >>> Then I am using this html to extract my required information.... > >>> > >>> But its not returning me complete html of page. I have logged the > 'html' > >>> and I can see that log file contains incomplete html for the above > >>> link.... > >>> > >>> Is there any size limit of page' content ? Or I am doing something > wrong > >>> here ? > >>> > >>> Thanks, > >>> Tony. > >>> > >>> > >> > > > -- Don't Grow Old, Grow Up... :-) http://www.uci.cu http://www.uci.cu

