http://www.vientianecollege.laopdr.com/  this url i think should change to
http://www.vientianecollege.com/
it is right , can fetch sub-page


http://www.lao-indochina.com
http://www.nuol.edu.la
http://www.corninc.com.la
also only fetch home-page


2009/4/1 陈琛 <[email protected]>

> fetch other urls , not Sub-page......
>
>   2009/4/1 Alejandro Gonzalez <[email protected]>
>
>> try using this as filter in crawl-urlfilter.txt and comment the others
>> +lines
>>
>> +^http://([a-z0-9]*\.)*
>>
>> 2009/4/1 Alejandro Gonzalez <[email protected]>
>>
>> > yeah i thought it first, but i've been having a look into those websites
>> > and they have some normal links. i'm gonna deploy a nutch and try'em.
>> wich
>> > version are u running?
>> >
>> >
>> >
>> > 2009/4/1 陈琛 <[email protected]>
>> >
>> >> thanks, but I do not think this is the timeout problem
>> >>
>> >> i think they are a special website, Perhaps the link they are from
>> other
>> >> sources
>> >>
>> >> like some javasripts?
>> >>
>> >> so i do not know what is right url can be fetch by nutch..
>> >>
>> >> 2009/4/1 Alejandro Gonzalez <[email protected]>
>> >>
>> >> > strange strange :). maybe you got a timeout error? have u change this
>> >> > property in the nutch-site or nutch-default?
>> >> >
>> >> > <property>
>> >> >  <name>http.timeout</name>
>> >> >  <value>10000</value>
>> >> >  <description>The default network timeout, in
>> >> milliseconds.</description>
>> >> > </property>
>> >> >
>> >> >
>> >> >
>> >> > 2009/4/1 陈琛 <[email protected]>
>> >> >
>> >> > >
>> >> > > thanks very much ;)
>> >> > >
>> >> > > the log in the cygwin~(out.txt)
>> >> > > and the nutch log (hahoop.log)
>> >> > >
>> >> > >
>> >> > > i cannot find the any clues
>> >> > >
>> >> > > 2009/4/1 Alejandro Gonzalez <[email protected]>
>> >> > >
>> >> > >> send me the log of the crawling if possible. for sure there are
>> some
>> >> > clues
>> >> > >> on it
>> >> > >>
>> >> > >> 2009/4/1 陈琛 <[email protected]>
>> >> > >>
>> >> > >> > yes, the depth is 10 and topN is 2000...
>> >> > >> >
>> >> > >> >  So strange....the other urls it is normal..but the 4 urls..
>> >> > >> >
>> >> > >> >
>> >> > >> >
>> >> > >> > 2009/4/1 Alejandro Gonzalez <[email protected]>
>> >> > >> >
>> >> > >> > > seems strange. have u tried to start a crawl just with these 4
>> >> seed
>> >> > >> > pages?
>> >> > >> > >
>> >> > >> > > Are you setting the topN parameter?
>> >> > >> > >
>> >> > >> > >
>> >> > >> > > 2009/4/1 陈琛 <[email protected]>
>> >> > >> > >
>> >> > >> > > >
>> >> > >> > > > thanks,i have Collection of urls Only these four can not
>> search
>> >> a
>> >> > >> > subset
>> >> > >> > > > of their pages
>> >> > >> > > >
>> >> > >> > > > the urls and crawl-urlfilter like Attachment
>> >> > >> > > >
>> >> > >> > > >
>> >> > >> > > > 2009/4/1 Alejandro Gonzalez <
>> [email protected]>
>> >> > >> > > >
>> >> > >> > > > it's your crawl-urlfilter ok? are u sure it's fetching them
>> >> > >> properly?
>> >> > >> > > maybe
>> >> > >> > > >> it's not getting the content of the pages and so it cannot
>> >> > extract
>> >> > >> > links
>> >> > >> > > >> for
>> >> > >> > > >> fetch in the next level (i suppose you have set the crawl
>> >> depth
>> >> > >> just
>> >> > >> > for
>> >> > >> > > >> the
>> >> > >> > > >> seeds level).
>> >> > >> > > >>
>> >> > >> > > >> So or your filters are skipping the seeds (i suppose it's
>> not
>> >> the
>> >> > >> case
>> >> > >> > > >> cause
>> >> > >> > > >> you say that urls arrive to Fetcher), or the fetching it's
>> not
>> >> > >> going
>> >> > >> > ok
>> >> > >> > > >> (network issues?). take a look on that
>> >> > >> > > >>
>> >> > >> > > >> 2009/4/1 陈琛 <[email protected]>
>> >> > >> > > >>
>> >> > >> > > >> > HI,all
>> >> > >> > > >> >       I have four urls, like this:
>> >> > >> > > >> >       http://www.lao-indochina.com
>> >> > >> > > >> >       http://www.nuol.edu.la
>> >> > >> > > >> >       http://www.corninc.com.la
>> >> > >> > > >> >       http://www.vientianecollege.laopdr.com
>> >> > >> > > >> >
>> >> > >> > > >> > only fetch the HomePage why? Sub-page is not fetch。。。
>> >> > >> > > >> >
>> >> > >> > > >>
>> >> > >> > > >
>> >> > >> > > >
>> >> > >> > >
>> >> > >> >
>> >> > >>
>> >> > >
>> >> > >
>> >> >
>> >>
>> >
>> >
>>
>
>

Reply via email to