i've try with these 3 sites and this is what my nutch got:
crawl started in: crawl-20090401170024 rootUrlDir = urls threads = 5 depth = 3 topN = 30 Injector: starting Injector: crawlDb: crawl-20090401170024/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-20090401170024/segments/20090401170027 Generator: filtering: false Generator: topN: 30 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl-20090401170024/segments/20090401170027 Fetcher: threads: 5 fetching http://www.corninc.com.la/ fetching http://www.nuol.edu.la/ fetching http://www.lao-indochina.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl-20090401170024/crawldb CrawlDb update: segments: [crawl-20090401170024/segments/20090401170027] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-20090401170024/segments/20090401170042 Generator: filtering: false Generator: topN: 30 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl-20090401170024/segments/20090401170042 Fetcher: threads: 5 fetching http://www.allgamerentals.com/ fetching http://www.mahasan.com/ fetching http://www.amazingcounters.com/ fetching http://www.alldvdrentals.com/video-game-rentals.html fetching http://www.allgamerentals.com/rental-services.php fetching http://www.corninc.com.la/_pgtres/stm31.js are you sure it is not a network issue? cause the only strange thing i've noticed is the slower fetching... 2009/4/1 陈琛 <[email protected]> > http://www.vientianecollege.laopdr.com/ this url i think should change to > http://www.vientianecollege.com/ > it is right , can fetch sub-page > > > http://www.lao-indochina.com > http://www.nuol.edu.la > http://www.corninc.com.la > also only fetch home-page > > > 2009/4/1 陈琛 <[email protected]> > > > fetch other urls , not Sub-page...... > > > > 2009/4/1 Alejandro Gonzalez <[email protected]> > > > >> try using this as filter in crawl-urlfilter.txt and comment the others > >> +lines > >> > >> +^http://([a-z0-9]*\.)* > >> > >> 2009/4/1 Alejandro Gonzalez <[email protected]> > >> > >> > yeah i thought it first, but i've been having a look into those > websites > >> > and they have some normal links. i'm gonna deploy a nutch and try'em. > >> wich > >> > version are u running? > >> > > >> > > >> > > >> > 2009/4/1 陈琛 <[email protected]> > >> > > >> >> thanks, but I do not think this is the timeout problem > >> >> > >> >> i think they are a special website, Perhaps the link they are from > >> other > >> >> sources > >> >> > >> >> like some javasripts? > >> >> > >> >> so i do not know what is right url can be fetch by nutch.. > >> >> > >> >> 2009/4/1 Alejandro Gonzalez <[email protected]> > >> >> > >> >> > strange strange :). maybe you got a timeout error? have u change > this > >> >> > property in the nutch-site or nutch-default? > >> >> > > >> >> > <property> > >> >> > <name>http.timeout</name> > >> >> > <value>10000</value> > >> >> > <description>The default network timeout, in > >> >> milliseconds.</description> > >> >> > </property> > >> >> > > >> >> > > >> >> > > >> >> > 2009/4/1 陈琛 <[email protected]> > >> >> > > >> >> > > > >> >> > > thanks very much ;) > >> >> > > > >> >> > > the log in the cygwin~(out.txt) > >> >> > > and the nutch log (hahoop.log) > >> >> > > > >> >> > > > >> >> > > i cannot find the any clues > >> >> > > > >> >> > > 2009/4/1 Alejandro Gonzalez <[email protected]> > >> >> > > > >> >> > >> send me the log of the crawling if possible. for sure there are > >> some > >> >> > clues > >> >> > >> on it > >> >> > >> > >> >> > >> 2009/4/1 陈琛 <[email protected]> > >> >> > >> > >> >> > >> > yes, the depth is 10 and topN is 2000... > >> >> > >> > > >> >> > >> > So strange....the other urls it is normal..but the 4 urls.. > >> >> > >> > > >> >> > >> > > >> >> > >> > > >> >> > >> > 2009/4/1 Alejandro Gonzalez <[email protected]> > >> >> > >> > > >> >> > >> > > seems strange. have u tried to start a crawl just with these > 4 > >> >> seed > >> >> > >> > pages? > >> >> > >> > > > >> >> > >> > > Are you setting the topN parameter? > >> >> > >> > > > >> >> > >> > > > >> >> > >> > > 2009/4/1 陈琛 <[email protected]> > >> >> > >> > > > >> >> > >> > > > > >> >> > >> > > > thanks,i have Collection of urls Only these four can not > >> search > >> >> a > >> >> > >> > subset > >> >> > >> > > > of their pages > >> >> > >> > > > > >> >> > >> > > > the urls and crawl-urlfilter like Attachment > >> >> > >> > > > > >> >> > >> > > > > >> >> > >> > > > 2009/4/1 Alejandro Gonzalez < > >> [email protected]> > >> >> > >> > > > > >> >> > >> > > > it's your crawl-urlfilter ok? are u sure it's fetching > them > >> >> > >> properly? > >> >> > >> > > maybe > >> >> > >> > > >> it's not getting the content of the pages and so it > cannot > >> >> > extract > >> >> > >> > links > >> >> > >> > > >> for > >> >> > >> > > >> fetch in the next level (i suppose you have set the crawl > >> >> depth > >> >> > >> just > >> >> > >> > for > >> >> > >> > > >> the > >> >> > >> > > >> seeds level). > >> >> > >> > > >> > >> >> > >> > > >> So or your filters are skipping the seeds (i suppose it's > >> not > >> >> the > >> >> > >> case > >> >> > >> > > >> cause > >> >> > >> > > >> you say that urls arrive to Fetcher), or the fetching > it's > >> not > >> >> > >> going > >> >> > >> > ok > >> >> > >> > > >> (network issues?). take a look on that > >> >> > >> > > >> > >> >> > >> > > >> 2009/4/1 陈琛 <[email protected]> > >> >> > >> > > >> > >> >> > >> > > >> > HI,all > >> >> > >> > > >> > I have four urls, like this: > >> >> > >> > > >> > http://www.lao-indochina.com > >> >> > >> > > >> > http://www.nuol.edu.la > >> >> > >> > > >> > http://www.corninc.com.la > >> >> > >> > > >> > http://www.vientianecollege.laopdr.com > >> >> > >> > > >> > > >> >> > >> > > >> > only fetch the HomePage why? Sub-page is not fetch。。。 > >> >> > >> > > >> > > >> >> > >> > > >> > >> >> > >> > > > > >> >> > >> > > > > >> >> > >> > > > >> >> > >> > > >> >> > >> > >> >> > > > >> >> > > > >> >> > > >> >> > >> > > >> > > >> > > > > >
