http://www.vientianecollege.laopdr.com/ this url i think should change to http://www.vientianecollege.com/ it is right , can fetch sub-page
http://www.lao-indochina.com http://www.nuol.edu.la http://www.corninc.com.la also only fetch home-page 2009/4/1 陈琛 <[email protected]> > fetch other urls , not Sub-page...... > > 2009/4/1 Alejandro Gonzalez <[email protected]> > >> try using this as filter in crawl-urlfilter.txt and comment the others >> +lines >> >> +^http://([a-z0-9]*\.)* >> >> 2009/4/1 Alejandro Gonzalez <[email protected]> >> >> > yeah i thought it first, but i've been having a look into those websites >> > and they have some normal links. i'm gonna deploy a nutch and try'em. >> wich >> > version are u running? >> > >> > >> > >> > 2009/4/1 陈琛 <[email protected]> >> > >> >> thanks, but I do not think this is the timeout problem >> >> >> >> i think they are a special website, Perhaps the link they are from >> other >> >> sources >> >> >> >> like some javasripts? >> >> >> >> so i do not know what is right url can be fetch by nutch.. >> >> >> >> 2009/4/1 Alejandro Gonzalez <[email protected]> >> >> >> >> > strange strange :). maybe you got a timeout error? have u change this >> >> > property in the nutch-site or nutch-default? >> >> > >> >> > <property> >> >> > <name>http.timeout</name> >> >> > <value>10000</value> >> >> > <description>The default network timeout, in >> >> milliseconds.</description> >> >> > </property> >> >> > >> >> > >> >> > >> >> > 2009/4/1 陈琛 <[email protected]> >> >> > >> >> > > >> >> > > thanks very much ;) >> >> > > >> >> > > the log in the cygwin~(out.txt) >> >> > > and the nutch log (hahoop.log) >> >> > > >> >> > > >> >> > > i cannot find the any clues >> >> > > >> >> > > 2009/4/1 Alejandro Gonzalez <[email protected]> >> >> > > >> >> > >> send me the log of the crawling if possible. for sure there are >> some >> >> > clues >> >> > >> on it >> >> > >> >> >> > >> 2009/4/1 陈琛 <[email protected]> >> >> > >> >> >> > >> > yes, the depth is 10 and topN is 2000... >> >> > >> > >> >> > >> > So strange....the other urls it is normal..but the 4 urls.. >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> > 2009/4/1 Alejandro Gonzalez <[email protected]> >> >> > >> > >> >> > >> > > seems strange. have u tried to start a crawl just with these 4 >> >> seed >> >> > >> > pages? >> >> > >> > > >> >> > >> > > Are you setting the topN parameter? >> >> > >> > > >> >> > >> > > >> >> > >> > > 2009/4/1 陈琛 <[email protected]> >> >> > >> > > >> >> > >> > > > >> >> > >> > > > thanks,i have Collection of urls Only these four can not >> search >> >> a >> >> > >> > subset >> >> > >> > > > of their pages >> >> > >> > > > >> >> > >> > > > the urls and crawl-urlfilter like Attachment >> >> > >> > > > >> >> > >> > > > >> >> > >> > > > 2009/4/1 Alejandro Gonzalez < >> [email protected]> >> >> > >> > > > >> >> > >> > > > it's your crawl-urlfilter ok? are u sure it's fetching them >> >> > >> properly? >> >> > >> > > maybe >> >> > >> > > >> it's not getting the content of the pages and so it cannot >> >> > extract >> >> > >> > links >> >> > >> > > >> for >> >> > >> > > >> fetch in the next level (i suppose you have set the crawl >> >> depth >> >> > >> just >> >> > >> > for >> >> > >> > > >> the >> >> > >> > > >> seeds level). >> >> > >> > > >> >> >> > >> > > >> So or your filters are skipping the seeds (i suppose it's >> not >> >> the >> >> > >> case >> >> > >> > > >> cause >> >> > >> > > >> you say that urls arrive to Fetcher), or the fetching it's >> not >> >> > >> going >> >> > >> > ok >> >> > >> > > >> (network issues?). take a look on that >> >> > >> > > >> >> >> > >> > > >> 2009/4/1 陈琛 <[email protected]> >> >> > >> > > >> >> >> > >> > > >> > HI,all >> >> > >> > > >> > I have four urls, like this: >> >> > >> > > >> > http://www.lao-indochina.com >> >> > >> > > >> > http://www.nuol.edu.la >> >> > >> > > >> > http://www.corninc.com.la >> >> > >> > > >> > http://www.vientianecollege.laopdr.com >> >> > >> > > >> > >> >> > >> > > >> > only fetch the HomePage why? Sub-page is not fetch。。。 >> >> > >> > > >> > >> >> > >> > > >> >> >> > >> > > > >> >> > >> > > > >> >> > >> > > >> >> > >> > >> >> > >> >> >> > > >> >> > > >> >> > >> >> >> > >> > >> > >
