Re: only fetch home page

陈琛 Wed, 01 Apr 2009 08:23:34 -0700

fetching http://www.corninc.com.la/_pgtres/stm31.js


i think up i filter the ".js*"

2009/4/1 陈琛 <[email protected]>

>
> yes, i got same result
>
>
> but i want to limit these three urls
>
> fetching http://www.allgamerentals.com/
> fetching http://www.mahasan.com/
> fetching http://www.amazingcounters.com/
> fetching http://www.alldvdrentals.com/video-game-rentals.html
> fetching http://www.allgamerentals.com/rental-services.php
> these urls has nothing to do with the above three, i do not need to fetch
> these
>
>
> one urls seems right
> fetching http://www.corninc.com.la/_pgtres/stm31.js
> is the sub-page of the http://www.corninc.com.la
>
> 2009/4/1 Alejandro Gonzalez <[email protected]>
>
>> i've try with these 3 sites and this is what my nutch got:
>>
>>
>> crawl started in: crawl-20090401170024
>> rootUrlDir = urls
>> threads = 5
>> depth = 3
>> topN = 30
>> Injector: starting
>> Injector: crawlDb: crawl-20090401170024/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Merging injected urls into crawl db.
>> Injector: done
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: crawl-20090401170024/segments/20090401170027
>> Generator: filtering: false
>> Generator: topN: 30
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> Fetcher: starting
>> Fetcher: segment: crawl-20090401170024/segments/20090401170027
>> Fetcher: threads: 5
>> fetching http://www.corninc.com.la/
>> fetching http://www.nuol.edu.la/
>> fetching http://www.lao-indochina.com/
>> Fetcher: done
>> CrawlDb update: starting
>> CrawlDb update: db: crawl-20090401170024/crawldb
>> CrawlDb update: segments: [crawl-20090401170024/segments/20090401170027]
>> CrawlDb update: additions allowed: true
>> CrawlDb update: URL normalizing: true
>> CrawlDb update: URL filtering: true
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: done
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: crawl-20090401170024/segments/20090401170042
>> Generator: filtering: false
>> Generator: topN: 30
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> Fetcher: starting
>> Fetcher: segment: crawl-20090401170024/segments/20090401170042
>> Fetcher: threads: 5
>> fetching http://www.allgamerentals.com/
>> fetching http://www.mahasan.com/
>> fetching http://www.amazingcounters.com/
>> fetching http://www.alldvdrentals.com/video-game-rentals.html
>> fetching http://www.allgamerentals.com/rental-services.php
>> fetching http://www.corninc.com.la/_pgtres/stm31.js
>>
>>
>> are you sure it is not a network issue? cause the only strange thing i've
>> noticed is the slower fetching...
>>
>> 2009/4/1 陈琛 <[email protected]>
>>
>> > http://www.vientianecollege.laopdr.com/  this url i think should change
>> to
>> > http://www.vientianecollege.com/
>> > it is right , can fetch sub-page
>> >
>> >
>> > http://www.lao-indochina.com
>> > http://www.nuol.edu.la
>> > http://www.corninc.com.la
>> > also only fetch home-page
>> >
>> >
>> > 2009/4/1 陈琛 <[email protected]>
>> >
>> > > fetch other urls , not Sub-page......
>> > >
>> > >   2009/4/1 Alejandro Gonzalez <[email protected]>
>> > >
>> > >> try using this as filter in crawl-urlfilter.txt and comment the
>> others
>> > >> +lines
>> > >>
>> > >> +^http://([a-z0-9]*\.)*
>> > >>
>> > >> 2009/4/1 Alejandro Gonzalez <[email protected]>
>> > >>
>> > >> > yeah i thought it first, but i've been having a look into those
>> > websites
>> > >> > and they have some normal links. i'm gonna deploy a nutch and
>> try'em.
>> > >> wich
>> > >> > version are u running?
>> > >> >
>> > >> >
>> > >> >
>> > >> > 2009/4/1 陈琛 <[email protected]>
>> > >> >
>> > >> >> thanks, but I do not think this is the timeout problem
>> > >> >>
>> > >> >> i think they are a special website, Perhaps the link they are from
>> > >> other
>> > >> >> sources
>> > >> >>
>> > >> >> like some javasripts?
>> > >> >>
>> > >> >> so i do not know what is right url can be fetch by nutch..
>> > >> >>
>> > >> >> 2009/4/1 Alejandro Gonzalez <[email protected]>
>> > >> >>
>> > >> >> > strange strange :). maybe you got a timeout error? have u change
>> > this
>> > >> >> > property in the nutch-site or nutch-default?
>> > >> >> >
>> > >> >> > <property>
>> > >> >> >  <name>http.timeout</name>
>> > >> >> >  <value>10000</value>
>> > >> >> >  <description>The default network timeout, in
>> > >> >> milliseconds.</description>
>> > >> >> > </property>
>> > >> >> >
>> > >> >> >
>> > >> >> >
>> > >> >> > 2009/4/1 陈琛 <[email protected]>
>> > >> >> >
>> > >> >> > >
>> > >> >> > > thanks very much ;)
>> > >> >> > >
>> > >> >> > > the log in the cygwin~(out.txt)
>> > >> >> > > and the nutch log (hahoop.log)
>> > >> >> > >
>> > >> >> > >
>> > >> >> > > i cannot find the any clues
>> > >> >> > >
>> > >> >> > > 2009/4/1 Alejandro Gonzalez <[email protected]>
>> > >> >> > >
>> > >> >> > >> send me the log of the crawling if possible. for sure there
>> are
>> > >> some
>> > >> >> > clues
>> > >> >> > >> on it
>> > >> >> > >>
>> > >> >> > >> 2009/4/1 陈琛 <[email protected]>
>> > >> >> > >>
>> > >> >> > >> > yes, the depth is 10 and topN is 2000...
>> > >> >> > >> >
>> > >> >> > >> >  So strange....the other urls it is normal..but the 4
>> urls..
>> > >> >> > >> >
>> > >> >> > >> >
>> > >> >> > >> >
>> > >> >> > >> > 2009/4/1 Alejandro Gonzalez <
>> [email protected]>
>> > >> >> > >> >
>> > >> >> > >> > > seems strange. have u tried to start a crawl just with
>> these
>> > 4
>> > >> >> seed
>> > >> >> > >> > pages?
>> > >> >> > >> > >
>> > >> >> > >> > > Are you setting the topN parameter?
>> > >> >> > >> > >
>> > >> >> > >> > >
>> > >> >> > >> > > 2009/4/1 陈琛 <[email protected]>
>> > >> >> > >> > >
>> > >> >> > >> > > >
>> > >> >> > >> > > > thanks，i have Collection of urls Only these four can
>> not
>> > >> search
>> > >> >> a
>> > >> >> > >> > subset
>> > >> >> > >> > > > of their pages
>> > >> >> > >> > > >
>> > >> >> > >> > > > the urls and crawl-urlfilter like Attachment
>> > >> >> > >> > > >
>> > >> >> > >> > > >
>> > >> >> > >> > > > 2009/4/1 Alejandro Gonzalez <
>> > >> [email protected]>
>> > >> >> > >> > > >
>> > >> >> > >> > > > it's your crawl-urlfilter ok? are u sure it's fetching
>> > them
>> > >> >> > >> properly?
>> > >> >> > >> > > maybe
>> > >> >> > >> > > >> it's not getting the content of the pages and so it
>> > cannot
>> > >> >> > extract
>> > >> >> > >> > links
>> > >> >> > >> > > >> for
>> > >> >> > >> > > >> fetch in the next level (i suppose you have set the
>> crawl
>> > >> >> depth
>> > >> >> > >> just
>> > >> >> > >> > for
>> > >> >> > >> > > >> the
>> > >> >> > >> > > >> seeds level).
>> > >> >> > >> > > >>
>> > >> >> > >> > > >> So or your filters are skipping the seeds (i suppose
>> it's
>> > >> not
>> > >> >> the
>> > >> >> > >> case
>> > >> >> > >> > > >> cause
>> > >> >> > >> > > >> you say that urls arrive to Fetcher), or the fetching
>> > it's
>> > >> not
>> > >> >> > >> going
>> > >> >> > >> > ok
>> > >> >> > >> > > >> (network issues?). take a look on that
>> > >> >> > >> > > >>
>> > >> >> > >> > > >> 2009/4/1 陈琛 <[email protected]>
>> > >> >> > >> > > >>
>> > >> >> > >> > > >> > HI,all
>> > >> >> > >> > > >> >       I have four urls, like this:
>> > >> >> > >> > > >> >       http://www.lao-indochina.com
>> > >> >> > >> > > >> >       http://www.nuol.edu.la
>> > >> >> > >> > > >> >       http://www.corninc.com.la
>> > >> >> > >> > > >> >       http://www.vientianecollege.laopdr.com
>> > >> >> > >> > > >> >
>> > >> >> > >> > > >> > only fetch the HomePage why？ Sub-page is not
>> fetch。。。
>> > >> >> > >> > > >> >
>> > >> >> > >> > > >>
>> > >> >> > >> > > >
>> > >> >> > >> > > >
>> > >> >> > >> > >
>> > >> >> > >> >
>> > >> >> > >>
>> > >> >> > >
>> > >> >> > >
>> > >> >> >
>> > >> >>
>> > >> >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: only fetch home page

Reply via email to