Re: only fetch home page

陈琛 Wed, 01 Apr 2009 08:30:56 -0700

thanks ..
i also do have...

but i donot got the normal page like:
http://www.corninc.com.la/faqs.htm



2009/4/1 Alejandro Gonzalez <[email protected]>

> yes. i've launch a crawl with this filters:
>
> +^http://([a-z0-9]*\.)*corninc.com.la/*
> +^http://([a-z0-9]*\.)*lao-indochina.com/*
> +^http://([a-z0-9]*\.)*nuol.edu.la/*
>
> and the only link it gets in that domains is that .js. in the others i've
> got timeouts
>
>
> 2009/4/1 陈琛 <[email protected]>
>
> > fetching http://www.corninc.com.la/_pgtres/stm31.js
> >
> > i think up i filter the ".js*"
> >
> > 2009/4/1 陈琛 <[email protected]>
> >
> > >
> > > yes, i got same result
> > >
> > >
> > > but i want to limit these three urls
> > >
> > > fetching http://www.allgamerentals.com/
> > > fetching http://www.mahasan.com/
> > > fetching http://www.amazingcounters.com/
> > > fetching http://www.alldvdrentals.com/video-game-rentals.html
> > > fetching http://www.allgamerentals.com/rental-services.php
> > > these urls has nothing to do with the above three, i do not need to
> fetch
> > > these
> > >
> > >
> > > one urls seems right
> > > fetching http://www.corninc.com.la/_pgtres/stm31.js
> > > is the sub-page of the http://www.corninc.com.la
> > >
> > > 2009/4/1 Alejandro Gonzalez <[email protected]>
> > >
> > >> i've try with these 3 sites and this is what my nutch got:
> > >>
> > >>
> > >> crawl started in: crawl-20090401170024
> > >> rootUrlDir = urls
> > >> threads = 5
> > >> depth = 3
> > >> topN = 30
> > >> Injector: starting
> > >> Injector: crawlDb: crawl-20090401170024/crawldb
> > >> Injector: urlDir: urls
> > >> Injector: Converting injected urls to crawl db entries.
> > >> Injector: Merging injected urls into crawl db.
> > >> Injector: done
> > >> Generator: Selecting best-scoring urls due for fetch.
> > >> Generator: starting
> > >> Generator: segment: crawl-20090401170024/segments/20090401170027
> > >> Generator: filtering: false
> > >> Generator: topN: 30
> > >> Generator: jobtracker is 'local', generating exactly one partition.
> > >> Generator: Partitioning selected urls by host, for politeness.
> > >> Generator: done.
> > >> Fetcher: starting
> > >> Fetcher: segment: crawl-20090401170024/segments/20090401170027
> > >> Fetcher: threads: 5
> > >> fetching http://www.corninc.com.la/
> > >> fetching http://www.nuol.edu.la/
> > >> fetching http://www.lao-indochina.com/
> > >> Fetcher: done
> > >> CrawlDb update: starting
> > >> CrawlDb update: db: crawl-20090401170024/crawldb
> > >> CrawlDb update: segments:
> [crawl-20090401170024/segments/20090401170027]
> > >> CrawlDb update: additions allowed: true
> > >> CrawlDb update: URL normalizing: true
> > >> CrawlDb update: URL filtering: true
> > >> CrawlDb update: Merging segment data into db.
> > >> CrawlDb update: done
> > >> Generator: Selecting best-scoring urls due for fetch.
> > >> Generator: starting
> > >> Generator: segment: crawl-20090401170024/segments/20090401170042
> > >> Generator: filtering: false
> > >> Generator: topN: 30
> > >> Generator: jobtracker is 'local', generating exactly one partition.
> > >> Generator: Partitioning selected urls by host, for politeness.
> > >> Generator: done.
> > >> Fetcher: starting
> > >> Fetcher: segment: crawl-20090401170024/segments/20090401170042
> > >> Fetcher: threads: 5
> > >> fetching http://www.allgamerentals.com/
> > >> fetching http://www.mahasan.com/
> > >> fetching http://www.amazingcounters.com/
> > >> fetching http://www.alldvdrentals.com/video-game-rentals.html
> > >> fetching http://www.allgamerentals.com/rental-services.php
> > >> fetching http://www.corninc.com.la/_pgtres/stm31.js
> > >>
> > >>
> > >> are you sure it is not a network issue? cause the only strange thing
> > i've
> > >> noticed is the slower fetching...
> > >>
> > >> 2009/4/1 陈琛 <[email protected]>
> > >>
> > >> > http://www.vientianecollege.laopdr.com/  this url i think should
> > change
> > >> to
> > >> > http://www.vientianecollege.com/
> > >> > it is right , can fetch sub-page
> > >> >
> > >> >
> > >> > http://www.lao-indochina.com
> > >> > http://www.nuol.edu.la
> > >> > http://www.corninc.com.la
> > >> > also only fetch home-page
> > >> >
> > >> >
> > >> > 2009/4/1 陈琛 <[email protected]>
> > >> >
> > >> > > fetch other urls , not Sub-page......
> > >> > >
> > >> > >   2009/4/1 Alejandro Gonzalez <[email protected]>
> > >> > >
> > >> > >> try using this as filter in crawl-urlfilter.txt and comment the
> > >> others
> > >> > >> +lines
> > >> > >>
> > >> > >> +^http://([a-z0-9]*\.)*
> > >> > >>
> > >> > >> 2009/4/1 Alejandro Gonzalez <[email protected]>
> > >> > >>
> > >> > >> > yeah i thought it first, but i've been having a look into those
> > >> > websites
> > >> > >> > and they have some normal links. i'm gonna deploy a nutch and
> > >> try'em.
> > >> > >> wich
> > >> > >> > version are u running?
> > >> > >> >
> > >> > >> >
> > >> > >> >
> > >> > >> > 2009/4/1 陈琛 <[email protected]>
> > >> > >> >
> > >> > >> >> thanks, but I do not think this is the timeout problem
> > >> > >> >>
> > >> > >> >> i think they are a special website, Perhaps the link they are
> > from
> > >> > >> other
> > >> > >> >> sources
> > >> > >> >>
> > >> > >> >> like some javasripts?
> > >> > >> >>
> > >> > >> >> so i do not know what is right url can be fetch by nutch..
> > >> > >> >>
> > >> > >> >> 2009/4/1 Alejandro Gonzalez <[email protected]>
> > >> > >> >>
> > >> > >> >> > strange strange :). maybe you got a timeout error? have u
> > change
> > >> > this
> > >> > >> >> > property in the nutch-site or nutch-default?
> > >> > >> >> >
> > >> > >> >> > <property>
> > >> > >> >> >  <name>http.timeout</name>
> > >> > >> >> >  <value>10000</value>
> > >> > >> >> >  <description>The default network timeout, in
> > >> > >> >> milliseconds.</description>
> > >> > >> >> > </property>
> > >> > >> >> >
> > >> > >> >> >
> > >> > >> >> >
> > >> > >> >> > 2009/4/1 陈琛 <[email protected]>
> > >> > >> >> >
> > >> > >> >> > >
> > >> > >> >> > > thanks very much ;)
> > >> > >> >> > >
> > >> > >> >> > > the log in the cygwin~(out.txt)
> > >> > >> >> > > and the nutch log (hahoop.log)
> > >> > >> >> > >
> > >> > >> >> > >
> > >> > >> >> > > i cannot find the any clues
> > >> > >> >> > >
> > >> > >> >> > > 2009/4/1 Alejandro Gonzalez <
> > [email protected]>
> > >> > >> >> > >
> > >> > >> >> > >> send me the log of the crawling if possible. for sure
> there
> > >> are
> > >> > >> some
> > >> > >> >> > clues
> > >> > >> >> > >> on it
> > >> > >> >> > >>
> > >> > >> >> > >> 2009/4/1 陈琛 <[email protected]>
> > >> > >> >> > >>
> > >> > >> >> > >> > yes, the depth is 10 and topN is 2000...
> > >> > >> >> > >> >
> > >> > >> >> > >> >  So strange....the other urls it is normal..but the 4
> > >> urls..
> > >> > >> >> > >> >
> > >> > >> >> > >> >
> > >> > >> >> > >> >
> > >> > >> >> > >> > 2009/4/1 Alejandro Gonzalez <
> > >> [email protected]>
> > >> > >> >> > >> >
> > >> > >> >> > >> > > seems strange. have u tried to start a crawl just
> with
> > >> these
> > >> > 4
> > >> > >> >> seed
> > >> > >> >> > >> > pages?
> > >> > >> >> > >> > >
> > >> > >> >> > >> > > Are you setting the topN parameter?
> > >> > >> >> > >> > >
> > >> > >> >> > >> > >
> > >> > >> >> > >> > > 2009/4/1 陈琛 <[email protected]>
> > >> > >> >> > >> > >
> > >> > >> >> > >> > > >
> > >> > >> >> > >> > > > thanks，i have Collection of urls Only these four
> can
> > >> not
> > >> > >> search
> > >> > >> >> a
> > >> > >> >> > >> > subset
> > >> > >> >> > >> > > > of their pages
> > >> > >> >> > >> > > >
> > >> > >> >> > >> > > > the urls and crawl-urlfilter like Attachment
> > >> > >> >> > >> > > >
> > >> > >> >> > >> > > >
> > >> > >> >> > >> > > > 2009/4/1 Alejandro Gonzalez <
> > >> > >> [email protected]>
> > >> > >> >> > >> > > >
> > >> > >> >> > >> > > > it's your crawl-urlfilter ok? are u sure it's
> > fetching
> > >> > them
> > >> > >> >> > >> properly?
> > >> > >> >> > >> > > maybe
> > >> > >> >> > >> > > >> it's not getting the content of the pages and so
> it
> > >> > cannot
> > >> > >> >> > extract
> > >> > >> >> > >> > links
> > >> > >> >> > >> > > >> for
> > >> > >> >> > >> > > >> fetch in the next level (i suppose you have set
> the
> > >> crawl
> > >> > >> >> depth
> > >> > >> >> > >> just
> > >> > >> >> > >> > for
> > >> > >> >> > >> > > >> the
> > >> > >> >> > >> > > >> seeds level).
> > >> > >> >> > >> > > >>
> > >> > >> >> > >> > > >> So or your filters are skipping the seeds (i
> suppose
> > >> it's
> > >> > >> not
> > >> > >> >> the
> > >> > >> >> > >> case
> > >> > >> >> > >> > > >> cause
> > >> > >> >> > >> > > >> you say that urls arrive to Fetcher), or the
> > fetching
> > >> > it's
> > >> > >> not
> > >> > >> >> > >> going
> > >> > >> >> > >> > ok
> > >> > >> >> > >> > > >> (network issues?). take a look on that
> > >> > >> >> > >> > > >>
> > >> > >> >> > >> > > >> 2009/4/1 陈琛 <[email protected]>
> > >> > >> >> > >> > > >>
> > >> > >> >> > >> > > >> > HI,all
> > >> > >> >> > >> > > >> >       I have four urls, like this:
> > >> > >> >> > >> > > >> >       http://www.lao-indochina.com
> > >> > >> >> > >> > > >> >       http://www.nuol.edu.la
> > >> > >> >> > >> > > >> >       http://www.corninc.com.la
> > >> > >> >> > >> > > >> >       http://www.vientianecollege.laopdr.com
> > >> > >> >> > >> > > >> >
> > >> > >> >> > >> > > >> > only fetch the HomePage why？ Sub-page is not
> > >> fetch。。。
> > >> > >> >> > >> > > >> >
> > >> > >> >> > >> > > >>
> > >> > >> >> > >> > > >
> > >> > >> >> > >> > > >
> > >> > >> >> > >> > >
> > >> > >> >> > >> >
> > >> > >> >> > >>
> > >> > >> >> > >
> > >> > >> >> > >
> > >> > >> >> >
> > >> > >> >>
> > >> > >> >
> > >> > >> >
> > >> > >>
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: only fetch home page

Reply via email to