thanks very much ;) the log in the cygwin~(out.txt) and the nutch log (hahoop.log)
i cannot find the any clues 2009/4/1 Alejandro Gonzalez <[email protected]> > send me the log of the crawling if possible. for sure there are some clues > on it > > 2009/4/1 陈琛 <[email protected]> > > > yes, the depth is 10 and topN is 2000... > > > > So strange....the other urls it is normal..but the 4 urls.. > > > > > > > > 2009/4/1 Alejandro Gonzalez <[email protected]> > > > > > seems strange. have u tried to start a crawl just with these 4 seed > > pages? > > > > > > Are you setting the topN parameter? > > > > > > > > > 2009/4/1 陈琛 <[email protected]> > > > > > > > > > > > thanks,i have Collection of urls Only these four can not search a > > subset > > > > of their pages > > > > > > > > the urls and crawl-urlfilter like Attachment > > > > > > > > > > > > 2009/4/1 Alejandro Gonzalez <[email protected]> > > > > > > > > it's your crawl-urlfilter ok? are u sure it's fetching them properly? > > > maybe > > > >> it's not getting the content of the pages and so it cannot extract > > links > > > >> for > > > >> fetch in the next level (i suppose you have set the crawl depth just > > for > > > >> the > > > >> seeds level). > > > >> > > > >> So or your filters are skipping the seeds (i suppose it's not the > case > > > >> cause > > > >> you say that urls arrive to Fetcher), or the fetching it's not going > > ok > > > >> (network issues?). take a look on that > > > >> > > > >> 2009/4/1 陈琛 <[email protected]> > > > >> > > > >> > HI,all > > > >> > I have four urls, like this: > > > >> > http://www.lao-indochina.com > > > >> > http://www.nuol.edu.la > > > >> > http://www.corninc.com.la > > > >> > http://www.vientianecollege.laopdr.com > > > >> > > > > >> > only fetch the HomePage why? Sub-page is not fetch。。。 > > > >> > > > > >> > > > > > > > > > > > > > >
crawl started in: new_crawl rootUrlDir = urls threads = 10 depth = 10 topN = 2000 Injector: starting Injector: crawlDb: new_crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: new_crawl/segments/20090401201826 Generator: filtering: false Generator: topN: 2000 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: new_crawl/segments/20090401201826 Fetcher: threads: 10 fetching http://www.corninc.com.la/ fetching http://www.vientianecollege.laopdr.com/ fetching http://www.nuol.edu.la/ fetching http://www.lao-indochina.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: new_crawl/crawldb CrawlDb update: segments: [new_crawl/segments/20090401201826] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: new_crawl/segments/20090401201931 Generator: filtering: false Generator: topN: 2000 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting LinkDb: linkdb: new_crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: new_crawl/segments/20090401201826 LinkDb: done Indexer: starting Indexer: linkdb: new_crawl/linkdb Indexer: adding segment: new_crawl/segments/20090401201826 Indexing [http://www.corninc.com.la/] with analyzer org.apache.nutch.analysis.nutchdocumentanaly...@dec8b3 (null) Indexing [http://www.lao-indochina.com/] with analyzer org.apache.nutch.analysis.nutchdocumentanaly...@dec8b3 (null) maxFieldLength 10000 reached, ignoring following tokens Indexing [http://www.nuol.edu.la/] with analyzer org.apache.nutch.analysis.nutchdocumentanaly...@dec8b3 (null) Indexing [http://www.vientianecollege.laopdr.com/] with analyzer org.apache.nutch.analysis.nutchdocumentanaly...@dec8b3 (null) Optimizing index. merging segments _ram_0 (1 docs) _ram_1 (1 docs) _ram_2 (1 docs) _ram_3 (1 docs) into _0 (4 docs) Indexer: done Dedup: starting Dedup: adding indexes in: new_crawl/indexes Dedup: done merging indexes to: new_crawl/index Adding new_crawl/indexes/part-00000 done merging crawl finished: new_crawl
