thanks very much version is 0.9 2009/4/1 Alejandro Gonzalez <[email protected]>
> yeah i thought it first, but i've been having a look into those websites > and > they have some normal links. i'm gonna deploy a nutch and try'em. wich > version are u running? > > > 2009/4/1 陈琛 <[email protected]> > > > thanks, but I do not think this is the timeout problem > > > > i think they are a special website, Perhaps the link they are from other > > sources > > > > like some javasripts? > > > > so i do not know what is right url can be fetch by nutch.. > > > > 2009/4/1 Alejandro Gonzalez <[email protected]> > > > > > strange strange :). maybe you got a timeout error? have u change this > > > property in the nutch-site or nutch-default? > > > > > > <property> > > > <name>http.timeout</name> > > > <value>10000</value> > > > <description>The default network timeout, in > milliseconds.</description> > > > </property> > > > > > > > > > > > > 2009/4/1 陈琛 <[email protected]> > > > > > > > > > > > thanks very much ;) > > > > > > > > the log in the cygwin~(out.txt) > > > > and the nutch log (hahoop.log) > > > > > > > > > > > > i cannot find the any clues > > > > > > > > 2009/4/1 Alejandro Gonzalez <[email protected]> > > > > > > > >> send me the log of the crawling if possible. for sure there are some > > > clues > > > >> on it > > > >> > > > >> 2009/4/1 陈琛 <[email protected]> > > > >> > > > >> > yes, the depth is 10 and topN is 2000... > > > >> > > > > >> > So strange....the other urls it is normal..but the 4 urls.. > > > >> > > > > >> > > > > >> > > > > >> > 2009/4/1 Alejandro Gonzalez <[email protected]> > > > >> > > > > >> > > seems strange. have u tried to start a crawl just with these 4 > > seed > > > >> > pages? > > > >> > > > > > >> > > Are you setting the topN parameter? > > > >> > > > > > >> > > > > > >> > > 2009/4/1 陈琛 <[email protected]> > > > >> > > > > > >> > > > > > > >> > > > thanks,i have Collection of urls Only these four can not > search > > a > > > >> > subset > > > >> > > > of their pages > > > >> > > > > > > >> > > > the urls and crawl-urlfilter like Attachment > > > >> > > > > > > >> > > > > > > >> > > > 2009/4/1 Alejandro Gonzalez <[email protected]> > > > >> > > > > > > >> > > > it's your crawl-urlfilter ok? are u sure it's fetching them > > > >> properly? > > > >> > > maybe > > > >> > > >> it's not getting the content of the pages and so it cannot > > > extract > > > >> > links > > > >> > > >> for > > > >> > > >> fetch in the next level (i suppose you have set the crawl > depth > > > >> just > > > >> > for > > > >> > > >> the > > > >> > > >> seeds level). > > > >> > > >> > > > >> > > >> So or your filters are skipping the seeds (i suppose it's not > > the > > > >> case > > > >> > > >> cause > > > >> > > >> you say that urls arrive to Fetcher), or the fetching it's > not > > > >> going > > > >> > ok > > > >> > > >> (network issues?). take a look on that > > > >> > > >> > > > >> > > >> 2009/4/1 陈琛 <[email protected]> > > > >> > > >> > > > >> > > >> > HI,all > > > >> > > >> > I have four urls, like this: > > > >> > > >> > http://www.lao-indochina.com > > > >> > > >> > http://www.nuol.edu.la > > > >> > > >> > http://www.corninc.com.la > > > >> > > >> > http://www.vientianecollege.laopdr.com > > > >> > > >> > > > > >> > > >> > only fetch the HomePage why? Sub-page is not fetch。。。 > > > >> > > >> > > > > >> > > >> > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > > >
