Re: About search inner links information

Yves Yu Thu, 05 Mar 2009 07:01:26 -0800

yes, I using crawl command
nutch crawl urls -dir new_crawl -depth 10 -topN 200



2009/3/5 Alexander Aristov <[email protected]>

> So I assume you use the crawl command.
>
> Then waht is the depth you set?
> This value controls how many times crawler would generate-fetch-update
> pages.
>
> Try to set it more, say to 3, so that crawler could reach your ccc.com
>
>
> on the first attempt it would fetch only aaa.com
> in the second it will do bbb.com
> and finaly it will do ccc.com
>
>
> 2009/3/5 Yves Yu <[email protected]>
>
> > sorry,
> > my crawl-urlfilter.txt contains
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://([a-z0-9]*\.)*com/
> >
> > not
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://([a-z0-9]*\.)*la/
> >
> > - Show quoted text -
> > 2009/3/5 Yves Yu <[email protected]>
> >
> > > yes.
> > > maybe I'm going to summarize my questions so that I can express myself
> > more
> > > clearly.
> > >
> > > I set only www.aaa.com to urls.txt.
> > >
> > > The page www.aaa.com has a link www.bbb.com
> > > www.bbb.com has a link www.ccc.com
> > > www.ccc.com has a word: big
> > >
> > > if I search "big", can I get a result like www.ccc.com?
> > >
> > >
> >
> -------------------------------------------------------------------------------------------
> > > my urls.txt is
> > > www.aaa.com
> > >
> > >
> >
> -------------------------------------------------------------------------------------------
> > > my crawl-urlfilter.txt is
> > > # skip file:, ftp:, & mailto: urls
> > > -^(file|ftp|mailto):
> > >
> > > # skip image and other suffixes we can't yet parse
> > >
> > >
> >
> -\.(js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> > >
> > > # skip URLs containing certain characters as probable queries, etc.
> > > #-...@=]
> > >
> > > # skip URLs with slash-delimited segment that repeats 3+ times, to
> break
> > > loops
> > > -.*(/.+?)/.*?\1/.*?\1/
> > >
> > > # accept hosts in MY.DOMAIN.NAME
> > > +^http://([a-z0-9]*\.)*la/
> > >
> > > # skip everything else
> > > -.
> > >
> > >
> >
> -------------------------------------------------------------------------------------------
> > > my nutch-site.xml contains:
> > > <property>
> > > <name>http.agent.url</name>
> > > <value>http://www.aaa.com/</value>
> > > <description>http://www.aaa.com/</description>
> > > </property>
> > >
> > >
> >
> -------------------------------------------------------------------------------------------
> > > my nutch-default.xml contains following and I didn't change them in
> > > my nutch-site.xml
> > >
> > > <property>
> > >   <name>db.update.additions.allowed</name>
> > >   <value>true</value>
> > >   <description>If true, updatedb will add newly discovered URLs, if
> false
> > >   only already existing URLs in the CrawlDb will be updated and no new
> > >   URLs will be added.
> > >   </description>
> > > </property>
> > >
> > > <property>
> > >   <name>db.ignore.external.links</name>
> > >   <value>false</value>
> > >   <description>If true, outlinks leading from a page to external hosts
> > >   will be ignored. This is an effective way to limit the crawl to
> include
> > >   only initially injected hosts, without creating complex URLFilters.
> > >   </description>
> > > </property>
> > >
> > >
> >
> -------------------------------------------------------------------------------------------
> > > with these configurations,
> > > I can crawl pages in www.aaa.com like www.aaa.com/xxxxx
> > > but I cannot crawl pages like www.bbb.com/xxxx or www.ccc.com/xxxx, is
> > it
> > > right?
> > >
> > > if I want to crawl pages like www.bbb.com/xxx or www.ccc.com/xxx, what
> > > should I do?
> > >
> > > thank you for taking time to answer my question.
> > >
> > > Yves
> > >
> > > 2009/3/5 Alexander Aristov <[email protected]>
> > >
> > >> If you have taken nutch from trunk then these settings are already
> > there.
> > >> You should just check them. If it is not then you can add them
> > >>
> > >> See also the nutch-default.xml file which contains all nutch settings.
> > you
> > >> may copy necessary from this file and customize then in nutch-site.xml
> > >>
> > >> 2009/3/5 Yves Yu <[email protected]>
> > >>
> > >> > thanks, that would be very useful for me.
> > >> > just copy these two lines to nutch-site.xml file?
> > >> > if not, would you like to provide a template?
> > >> > - Show quoted text -
> > >> >
> > >> > 2009/3/5 Alexander Aristov <[email protected]>
> > >> >
> > >> > > in the nutch-site.xml file check settings
> > >> > >
> > >> > > db.update.additions.allowed
> > >> > > db.ignore.external.links
> > >> > >
> > >> > > Alexander
> > >> > >
> > >> > >
> > >> > > 2009/3/5 Yves Yu <[email protected]>
> > >> > >
> > >> > > > thank you
> > >> > > > my urls.txt is www.aaa.com
> > >> > > > must I add www.bbb.com and www.ccc.com here?
> > >> > > >
> > >> > > > my urlfilter is +^http://([a-z0-9]*\.)*com/
> > >> > > >
> > >> > > > by the way, how to check nutch settings if I allow adding
> outward
> > >> > links?
> > >> > > >
> > >> > > > 2009/3/5 Alexander Aristov <[email protected]>
> > >> > > > - Show quoted text -
> > >> > > >
> > >> > > > > I would suggest to check url filters. If you use the crawl
> > command
> > >> > then
> > >> > > > it
> > >> > > > > is teh crawl url filter otherwise it is regex-urlfilter
> > >> > > > >
> > >> > > > >
> > >> > > > > And check nutch settings if you allow adding  outward links.
> > >> > > > >
> > >> > > > > 2009/3/5 Yves Yu <[email protected]>
> > >> > > > >
> > >> > > > > > yes, I'm using Luke now, and I see there is no www.
> bbb.comand
> > >> no
> > >> > > > > > www.ccc.com in crawling procedure. it only can crawling
> > >> > www.aaa.com,
> > >> > > > > > www.aaa.com\xxx\xxx, like these
> > >> > > > > > do you know what the problem is?
> > >> > > > > >
> > >> > > > > > 2009/3/4 Jasper Kamperman <
> [email protected]>
> > >> > > > > >
> > >> > > > > > > Oh and the documentation also specifies a depth parameter
> > that
> > >> > says
> > >> > > > how
> > >> > > > > > far
> > >> > > > > > > afield the crawler may go. I think default is 10 but not
> > sure.
> > >> > > > > > >
> > >> > > > > > > Sent from my iPhone
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Mar 3, 2009, at 12:53 PM, Yves Yu <[email protected]>
> > >> wrote:
> > >> > > > > > >
> > >> > > > > > >  you mean, we can do this without additional
> configuration?
> > >> how
> > >> > > about
> > >> > > > > 10
> > >> > > > > > >> depth like this? how can I set it?thanks.
> > >> > > > > > >>
> > >> > > > > > >> 2009/3/4 Jasper Kamperman <
> > [email protected]
> > >> >
> > >> > > > > > >>
> > >> > > > > > >>  Could be a lot of reasons. I'd start by investigating
> the
> > >> index
> > >> > > > with
> > >> > > > > > Luke
> > >> > > > > > >>> to see if ccc made it into the index and if I can search
> > out
> > >> > the
> > >> > > > page
> > >> > > > > > >>> with
> > >> > > > > > >>> the word "big". From what I find out with Luke I'd work
> my
> > >> way
> > >> > > back
> > >> > > > > to
> > >> > > > > > >>> the
> > >> > > > > > >>> root cause
> > >> > > > > > >>>
> > >> > > > > > >>> Sent from my iPhone
> > >> > > > > > >>>
> > >> > > > > > >>>
> > >> > > > > > >>> On Mar 3, 2009, at 7:40 AM, Yves Yu <[email protected]
> >
> > >> > wrote:
> > >> > > > > > >>>
> > >> > > > > > >>> Hi, all,
> > >> > > > > > >>>
> > >> > > > > > >>>> for example,
> > >> > > > > > >>>>
> > >> > > > > > >>>> The page www.aaa.com has a link www.bbb.com
> > >> > > > > > >>>> www.bbb.com has a link www.ccc.com
> > >> > > > > > >>>> www.ccc.com has a word: big
> > >> > > > > > >>>>
> > >> > > > > > >>>> It seems I cannot find "big" in www.ccc.com, is it
> > >> possible?
> > >> > > How
> > >> > > > > can
> > >> > > > > > I
> > >> > > > > > >>>> set
> > >> > > > > > >>>> the configurations?
> > >> > > > > > >>>>
> > >> > > > > > >>>> Thanks in advance!
> > >> > > > > > >>>>
> > >> > > > > > >>>> Yves
> > >> > > > > > >>>>
> > >> > > > > > >>>>
> > >> > > > > > >>>
> > >> > > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > > Best Regards
> > >> > > > > Alexander Aristov
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Best Regards
> > >> > > Alexander Aristov
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Best Regards
> > >> Alexander Aristov
> > >>
> > >
> > >
> >
>
>
>
> --
> Best Regards
> Alexander Aristov
>

Re: About search inner links information

Reply via email to