i've try with these 3 sites and this is what my nutch got:

crawl started in: crawl-20090401170024
rootUrlDir = urls
threads = 5
depth = 3
topN = 30
Injector: starting
Injector: crawlDb: crawl-20090401170024/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl-20090401170024/segments/20090401170027
Generator: filtering: false
Generator: topN: 30
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl-20090401170024/segments/20090401170027
Fetcher: threads: 5
fetching http://www.corninc.com.la/
fetching http://www.nuol.edu.la/
fetching http://www.lao-indochina.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl-20090401170024/crawldb
CrawlDb update: segments: [crawl-20090401170024/segments/20090401170027]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl-20090401170024/segments/20090401170042
Generator: filtering: false
Generator: topN: 30
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl-20090401170024/segments/20090401170042
Fetcher: threads: 5
fetching http://www.allgamerentals.com/
fetching http://www.mahasan.com/
fetching http://www.amazingcounters.com/
fetching http://www.alldvdrentals.com/video-game-rentals.html
fetching http://www.allgamerentals.com/rental-services.php
fetching http://www.corninc.com.la/_pgtres/stm31.js


are you sure it is not a network issue? cause the only strange thing i've
noticed is the slower fetching...

2009/4/1 陈琛 <[email protected]>

> http://www.vientianecollege.laopdr.com/  this url i think should change to
> http://www.vientianecollege.com/
> it is right , can fetch sub-page
>
>
> http://www.lao-indochina.com
> http://www.nuol.edu.la
> http://www.corninc.com.la
> also only fetch home-page
>
>
> 2009/4/1 陈琛 <[email protected]>
>
> > fetch other urls , not Sub-page......
> >
> >   2009/4/1 Alejandro Gonzalez <[email protected]>
> >
> >> try using this as filter in crawl-urlfilter.txt and comment the others
> >> +lines
> >>
> >> +^http://([a-z0-9]*\.)*
> >>
> >> 2009/4/1 Alejandro Gonzalez <[email protected]>
> >>
> >> > yeah i thought it first, but i've been having a look into those
> websites
> >> > and they have some normal links. i'm gonna deploy a nutch and try'em.
> >> wich
> >> > version are u running?
> >> >
> >> >
> >> >
> >> > 2009/4/1 陈琛 <[email protected]>
> >> >
> >> >> thanks, but I do not think this is the timeout problem
> >> >>
> >> >> i think they are a special website, Perhaps the link they are from
> >> other
> >> >> sources
> >> >>
> >> >> like some javasripts?
> >> >>
> >> >> so i do not know what is right url can be fetch by nutch..
> >> >>
> >> >> 2009/4/1 Alejandro Gonzalez <[email protected]>
> >> >>
> >> >> > strange strange :). maybe you got a timeout error? have u change
> this
> >> >> > property in the nutch-site or nutch-default?
> >> >> >
> >> >> > <property>
> >> >> >  <name>http.timeout</name>
> >> >> >  <value>10000</value>
> >> >> >  <description>The default network timeout, in
> >> >> milliseconds.</description>
> >> >> > </property>
> >> >> >
> >> >> >
> >> >> >
> >> >> > 2009/4/1 陈琛 <[email protected]>
> >> >> >
> >> >> > >
> >> >> > > thanks very much ;)
> >> >> > >
> >> >> > > the log in the cygwin~(out.txt)
> >> >> > > and the nutch log (hahoop.log)
> >> >> > >
> >> >> > >
> >> >> > > i cannot find the any clues
> >> >> > >
> >> >> > > 2009/4/1 Alejandro Gonzalez <[email protected]>
> >> >> > >
> >> >> > >> send me the log of the crawling if possible. for sure there are
> >> some
> >> >> > clues
> >> >> > >> on it
> >> >> > >>
> >> >> > >> 2009/4/1 陈琛 <[email protected]>
> >> >> > >>
> >> >> > >> > yes, the depth is 10 and topN is 2000...
> >> >> > >> >
> >> >> > >> >  So strange....the other urls it is normal..but the 4 urls..
> >> >> > >> >
> >> >> > >> >
> >> >> > >> >
> >> >> > >> > 2009/4/1 Alejandro Gonzalez <[email protected]>
> >> >> > >> >
> >> >> > >> > > seems strange. have u tried to start a crawl just with these
> 4
> >> >> seed
> >> >> > >> > pages?
> >> >> > >> > >
> >> >> > >> > > Are you setting the topN parameter?
> >> >> > >> > >
> >> >> > >> > >
> >> >> > >> > > 2009/4/1 陈琛 <[email protected]>
> >> >> > >> > >
> >> >> > >> > > >
> >> >> > >> > > > thanks,i have Collection of urls Only these four can not
> >> search
> >> >> a
> >> >> > >> > subset
> >> >> > >> > > > of their pages
> >> >> > >> > > >
> >> >> > >> > > > the urls and crawl-urlfilter like Attachment
> >> >> > >> > > >
> >> >> > >> > > >
> >> >> > >> > > > 2009/4/1 Alejandro Gonzalez <
> >> [email protected]>
> >> >> > >> > > >
> >> >> > >> > > > it's your crawl-urlfilter ok? are u sure it's fetching
> them
> >> >> > >> properly?
> >> >> > >> > > maybe
> >> >> > >> > > >> it's not getting the content of the pages and so it
> cannot
> >> >> > extract
> >> >> > >> > links
> >> >> > >> > > >> for
> >> >> > >> > > >> fetch in the next level (i suppose you have set the crawl
> >> >> depth
> >> >> > >> just
> >> >> > >> > for
> >> >> > >> > > >> the
> >> >> > >> > > >> seeds level).
> >> >> > >> > > >>
> >> >> > >> > > >> So or your filters are skipping the seeds (i suppose it's
> >> not
> >> >> the
> >> >> > >> case
> >> >> > >> > > >> cause
> >> >> > >> > > >> you say that urls arrive to Fetcher), or the fetching
> it's
> >> not
> >> >> > >> going
> >> >> > >> > ok
> >> >> > >> > > >> (network issues?). take a look on that
> >> >> > >> > > >>
> >> >> > >> > > >> 2009/4/1 陈琛 <[email protected]>
> >> >> > >> > > >>
> >> >> > >> > > >> > HI,all
> >> >> > >> > > >> >       I have four urls, like this:
> >> >> > >> > > >> >       http://www.lao-indochina.com
> >> >> > >> > > >> >       http://www.nuol.edu.la
> >> >> > >> > > >> >       http://www.corninc.com.la
> >> >> > >> > > >> >       http://www.vientianecollege.laopdr.com
> >> >> > >> > > >> >
> >> >> > >> > > >> > only fetch the HomePage why? Sub-page is not fetch。。。
> >> >> > >> > > >> >
> >> >> > >> > > >>
> >> >> > >> > > >
> >> >> > >> > > >
> >> >> > >> > >
> >> >> > >> >
> >> >> > >>
> >> >> > >
> >> >> > >
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >
> >
>

Reply via email to