Re: only fetch home page

陈琛 Wed, 01 Apr 2009 05:22:34 -0700

thanks very much ;)

the log in the cygwin~(out.txt)
and the nutch log (hahoop.log)



i cannot find the any clues

2009/4/1 Alejandro Gonzalez <[email protected]>

> send me the log of the crawling if possible. for sure there are some clues
> on it
>
> 2009/4/1 陈琛 <[email protected]>
>
> > yes, the depth is 10 and topN is 2000...
> >
> >  So strange....the other urls it is normal..but the 4 urls..
> >
> >
> >
> > 2009/4/1 Alejandro Gonzalez <[email protected]>
> >
> > > seems strange. have u tried to start a crawl just with these 4 seed
> > pages?
> > >
> > > Are you setting the topN parameter?
> > >
> > >
> > > 2009/4/1 陈琛 <[email protected]>
> > >
> > > >
> > > > thanks，i have Collection of urls Only these four can not search a
> > subset
> > > > of their pages
> > > >
> > > > the urls and crawl-urlfilter like Attachment
> > > >
> > > >
> > > > 2009/4/1 Alejandro Gonzalez <[email protected]>
> > > >
> > > > it's your crawl-urlfilter ok? are u sure it's fetching them properly?
> > > maybe
> > > >> it's not getting the content of the pages and so it cannot extract
> > links
> > > >> for
> > > >> fetch in the next level (i suppose you have set the crawl depth just
> > for
> > > >> the
> > > >> seeds level).
> > > >>
> > > >> So or your filters are skipping the seeds (i suppose it's not the
> case
> > > >> cause
> > > >> you say that urls arrive to Fetcher), or the fetching it's not going
> > ok
> > > >> (network issues?). take a look on that
> > > >>
> > > >> 2009/4/1 陈琛 <[email protected]>
> > > >>
> > > >> > HI,all
> > > >> >       I have four urls, like this:
> > > >> >       http://www.lao-indochina.com
> > > >> >       http://www.nuol.edu.la
> > > >> >       http://www.corninc.com.la
> > > >> >       http://www.vientianecollege.laopdr.com
> > > >> >
> > > >> > only fetch the HomePage why？ Sub-page is not fetch。。。
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

crawl started in: new_crawl
rootUrlDir = urls
threads = 10
depth = 10
topN = 2000
Injector: starting
Injector: crawlDb: new_crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: new_crawl/segments/20090401201826
Generator: filtering: false
Generator: topN: 2000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: new_crawl/segments/20090401201826
Fetcher: threads: 10
fetching http://www.corninc.com.la/
fetching http://www.vientianecollege.laopdr.com/
fetching http://www.nuol.edu.la/
fetching http://www.lao-indochina.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: new_crawl/crawldb
CrawlDb update: segments: [new_crawl/segments/20090401201826]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: new_crawl/segments/20090401201931
Generator: filtering: false
Generator: topN: 2000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: new_crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: new_crawl/segments/20090401201826
LinkDb: done
Indexer: starting
Indexer: linkdb: new_crawl/linkdb
Indexer: adding segment: new_crawl/segments/20090401201826
 Indexing [http://www.corninc.com.la/] with analyzer 
org.apache.nutch.analysis.nutchdocumentanaly...@dec8b3 (null)
 Indexing [http://www.lao-indochina.com/] with analyzer 
org.apache.nutch.analysis.nutchdocumentanaly...@dec8b3 (null)
maxFieldLength 10000 reached, ignoring following tokens
 Indexing [http://www.nuol.edu.la/] with analyzer 
org.apache.nutch.analysis.nutchdocumentanaly...@dec8b3 (null)
 Indexing [http://www.vientianecollege.laopdr.com/] with analyzer 
org.apache.nutch.analysis.nutchdocumentanaly...@dec8b3 (null)
Optimizing index.
merging segments _ram_0 (1 docs) _ram_1 (1 docs) _ram_2 (1 docs) _ram_3 (1 
docs) into _0 (4 docs)
Indexer: done
Dedup: starting
Dedup: adding indexes in: new_crawl/indexes
Dedup: done
merging indexes to: new_crawl/index
Adding new_crawl/indexes/part-00000
done merging
crawl finished: new_crawl

Re: only fetch home page

Reply via email to