Re: only fetch home page

陈琛 Wed, 01 Apr 2009 07:39:25 -0700

thanks very much

 version is 0.9
2009/4/1 Alejandro Gonzalez <[email protected]>


> yeah i thought it first, but i've been having a look into those websites
> and
> they have some normal links. i'm gonna deploy a nutch and try'em. wich
> version are u running?
>
>
> 2009/4/1 陈琛 <[email protected]>
>
> > thanks, but I do not think this is the timeout problem
> >
> > i think they are a special website, Perhaps the link they are from other
> > sources
> >
> > like some javasripts?
> >
> > so i do not know what is right url can be fetch by nutch..
> >
> > 2009/4/1 Alejandro Gonzalez <[email protected]>
> >
> > > strange strange :). maybe you got a timeout error? have u change this
> > > property in the nutch-site or nutch-default?
> > >
> > > <property>
> > >  <name>http.timeout</name>
> > >  <value>10000</value>
> > >  <description>The default network timeout, in
> milliseconds.</description>
> > > </property>
> > >
> > >
> > >
> > > 2009/4/1 陈琛 <[email protected]>
> > >
> > > >
> > > > thanks very much ;)
> > > >
> > > > the log in the cygwin~(out.txt)
> > > > and the nutch log (hahoop.log)
> > > >
> > > >
> > > > i cannot find the any clues
> > > >
> > > > 2009/4/1 Alejandro Gonzalez <[email protected]>
> > > >
> > > >> send me the log of the crawling if possible. for sure there are some
> > > clues
> > > >> on it
> > > >>
> > > >> 2009/4/1 陈琛 <[email protected]>
> > > >>
> > > >> > yes, the depth is 10 and topN is 2000...
> > > >> >
> > > >> >  So strange....the other urls it is normal..but the 4 urls..
> > > >> >
> > > >> >
> > > >> >
> > > >> > 2009/4/1 Alejandro Gonzalez <[email protected]>
> > > >> >
> > > >> > > seems strange. have u tried to start a crawl just with these 4
> > seed
> > > >> > pages?
> > > >> > >
> > > >> > > Are you setting the topN parameter?
> > > >> > >
> > > >> > >
> > > >> > > 2009/4/1 陈琛 <[email protected]>
> > > >> > >
> > > >> > > >
> > > >> > > > thanks，i have Collection of urls Only these four can not
> search
> > a
> > > >> > subset
> > > >> > > > of their pages
> > > >> > > >
> > > >> > > > the urls and crawl-urlfilter like Attachment
> > > >> > > >
> > > >> > > >
> > > >> > > > 2009/4/1 Alejandro Gonzalez <[email protected]>
> > > >> > > >
> > > >> > > > it's your crawl-urlfilter ok? are u sure it's fetching them
> > > >> properly?
> > > >> > > maybe
> > > >> > > >> it's not getting the content of the pages and so it cannot
> > > extract
> > > >> > links
> > > >> > > >> for
> > > >> > > >> fetch in the next level (i suppose you have set the crawl
> depth
> > > >> just
> > > >> > for
> > > >> > > >> the
> > > >> > > >> seeds level).
> > > >> > > >>
> > > >> > > >> So or your filters are skipping the seeds (i suppose it's not
> > the
> > > >> case
> > > >> > > >> cause
> > > >> > > >> you say that urls arrive to Fetcher), or the fetching it's
> not
> > > >> going
> > > >> > ok
> > > >> > > >> (network issues?). take a look on that
> > > >> > > >>
> > > >> > > >> 2009/4/1 陈琛 <[email protected]>
> > > >> > > >>
> > > >> > > >> > HI,all
> > > >> > > >> >       I have four urls, like this:
> > > >> > > >> >       http://www.lao-indochina.com
> > > >> > > >> >       http://www.nuol.edu.la
> > > >> > > >> >       http://www.corninc.com.la
> > > >> > > >> >       http://www.vientianecollege.laopdr.com
> > > >> > > >> >
> > > >> > > >> > only fetch the HomePage why？ Sub-page is not fetch。。。
> > > >> > > >> >
> > > >> > > >>
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: only fetch home page

Reply via email to