Re: About search inner links information

Alexander Aristov Thu, 05 Mar 2009 06:56:37 -0800

So I assume you use the crawl command.

Then waht is the depth you set?
This value controls how many times crawler would generate-fetch-update
pages.


Try to set it more, say to 3, so that crawler could reach your ccc.com


on the first attempt it would fetch only aaa.com
in the second it will do bbb.com
and finaly it will do ccc.com


2009/3/5 Yves Yu <[email protected]>

> sorry,
> my crawl-urlfilter.txt contains
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*com/
>
> not
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*la/
>
> - Show quoted text -
> 2009/3/5 Yves Yu <[email protected]>
>
> > yes.
> > maybe I'm going to summarize my questions so that I can express myself
> more
> > clearly.
> >
> > I set only www.aaa.com to urls.txt.
> >
> > The page www.aaa.com has a link www.bbb.com
> > www.bbb.com has a link www.ccc.com
> > www.ccc.com has a word: big
> >
> > if I search "big", can I get a result like www.ccc.com?
> >
> >
> -------------------------------------------------------------------------------------------
> > my urls.txt is
> > www.aaa.com
> >
> >
> -------------------------------------------------------------------------------------------
> > my crawl-urlfilter.txt is
> > # skip file:, ftp:, & mailto: urls
> > -^(file|ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> >
> >
> -\.(js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > #-...@=]
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> > loops
> > -.*(/.+?)/.*?\1/.*?\1/
> >
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://([a-z0-9]*\.)*la/
> >
> > # skip everything else
> > -.
> >
> >
> -------------------------------------------------------------------------------------------
> > my nutch-site.xml contains:
> > <property>
> > <name>http.agent.url</name>
> > <value>http://www.aaa.com/</value>
> > <description>http://www.aaa.com/</description>
> > </property>
> >
> >
> -------------------------------------------------------------------------------------------
> > my nutch-default.xml contains following and I didn't change them in
> > my nutch-site.xml
> >
> > <property>
> >   <name>db.update.additions.allowed</name>
> >   <value>true</value>
> >   <description>If true, updatedb will add newly discovered URLs, if false
> >   only already existing URLs in the CrawlDb will be updated and no new
> >   URLs will be added.
> >   </description>
> > </property>
> >
> > <property>
> >   <name>db.ignore.external.links</name>
> >   <value>false</value>
> >   <description>If true, outlinks leading from a page to external hosts
> >   will be ignored. This is an effective way to limit the crawl to include
> >   only initially injected hosts, without creating complex URLFilters.
> >   </description>
> > </property>
> >
> >
> -------------------------------------------------------------------------------------------
> > with these configurations,
> > I can crawl pages in www.aaa.com like www.aaa.com/xxxxx
> > but I cannot crawl pages like www.bbb.com/xxxx or www.ccc.com/xxxx, is
> it
> > right?
> >
> > if I want to crawl pages like www.bbb.com/xxx or www.ccc.com/xxx, what
> > should I do?
> >
> > thank you for taking time to answer my question.
> >
> > Yves
> >
> > 2009/3/5 Alexander Aristov <[email protected]>
> >
> >> If you have taken nutch from trunk then these settings are already
> there.
> >> You should just check them. If it is not then you can add them
> >>
> >> See also the nutch-default.xml file which contains all nutch settings.
> you
> >> may copy necessary from this file and customize then in nutch-site.xml
> >>
> >> 2009/3/5 Yves Yu <[email protected]>
> >>
> >> > thanks, that would be very useful for me.
> >> > just copy these two lines to nutch-site.xml file?
> >> > if not, would you like to provide a template?
> >> > - Show quoted text -
> >> >
> >> > 2009/3/5 Alexander Aristov <[email protected]>
> >> >
> >> > > in the nutch-site.xml file check settings
> >> > >
> >> > > db.update.additions.allowed
> >> > > db.ignore.external.links
> >> > >
> >> > > Alexander
> >> > >
> >> > >
> >> > > 2009/3/5 Yves Yu <[email protected]>
> >> > >
> >> > > > thank you
> >> > > > my urls.txt is www.aaa.com
> >> > > > must I add www.bbb.com and www.ccc.com here?
> >> > > >
> >> > > > my urlfilter is +^http://([a-z0-9]*\.)*com/
> >> > > >
> >> > > > by the way, how to check nutch settings if I allow adding outward
> >> > links?
> >> > > >
> >> > > > 2009/3/5 Alexander Aristov <[email protected]>
> >> > > > - Show quoted text -
> >> > > >
> >> > > > > I would suggest to check url filters. If you use the crawl
> command
> >> > then
> >> > > > it
> >> > > > > is teh crawl url filter otherwise it is regex-urlfilter
> >> > > > >
> >> > > > >
> >> > > > > And check nutch settings if you allow adding  outward links.
> >> > > > >
> >> > > > > 2009/3/5 Yves Yu <[email protected]>
> >> > > > >
> >> > > > > > yes, I'm using Luke now, and I see there is no www. bbb.comand
> >> no
> >> > > > > > www.ccc.com in crawling procedure. it only can crawling
> >> > www.aaa.com,
> >> > > > > > www.aaa.com\xxx\xxx, like these
> >> > > > > > do you know what the problem is?
> >> > > > > >
> >> > > > > > 2009/3/4 Jasper Kamperman <[email protected]>
> >> > > > > >
> >> > > > > > > Oh and the documentation also specifies a depth parameter
> that
> >> > says
> >> > > > how
> >> > > > > > far
> >> > > > > > > afield the crawler may go. I think default is 10 but not
> sure.
> >> > > > > > >
> >> > > > > > > Sent from my iPhone
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Mar 3, 2009, at 12:53 PM, Yves Yu <[email protected]>
> >> wrote:
> >> > > > > > >
> >> > > > > > >  you mean, we can do this without additional configuration?
> >> how
> >> > > about
> >> > > > > 10
> >> > > > > > >> depth like this? how can I set it?thanks.
> >> > > > > > >>
> >> > > > > > >> 2009/3/4 Jasper Kamperman <
> [email protected]
> >> >
> >> > > > > > >>
> >> > > > > > >>  Could be a lot of reasons. I'd start by investigating the
> >> index
> >> > > > with
> >> > > > > > Luke
> >> > > > > > >>> to see if ccc made it into the index and if I can search
> out
> >> > the
> >> > > > page
> >> > > > > > >>> with
> >> > > > > > >>> the word "big". From what I find out with Luke I'd work my
> >> way
> >> > > back
> >> > > > > to
> >> > > > > > >>> the
> >> > > > > > >>> root cause
> >> > > > > > >>>
> >> > > > > > >>> Sent from my iPhone
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > > >>> On Mar 3, 2009, at 7:40 AM, Yves Yu <[email protected]>
> >> > wrote:
> >> > > > > > >>>
> >> > > > > > >>> Hi, all,
> >> > > > > > >>>
> >> > > > > > >>>> for example,
> >> > > > > > >>>>
> >> > > > > > >>>> The page www.aaa.com has a link www.bbb.com
> >> > > > > > >>>> www.bbb.com has a link www.ccc.com
> >> > > > > > >>>> www.ccc.com has a word: big
> >> > > > > > >>>>
> >> > > > > > >>>> It seems I cannot find "big" in www.ccc.com, is it
> >> possible?
> >> > > How
> >> > > > > can
> >> > > > > > I
> >> > > > > > >>>> set
> >> > > > > > >>>> the configurations?
> >> > > > > > >>>>
> >> > > > > > >>>> Thanks in advance!
> >> > > > > > >>>>
> >> > > > > > >>>> Yves
> >> > > > > > >>>>
> >> > > > > > >>>>
> >> > > > > > >>>
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Best Regards
> >> > > > > Alexander Aristov
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Best Regards
> >> > > Alexander Aristov
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Best Regards
> >> Alexander Aristov
> >>
> >
> >
>



-- 
Best Regards
Alexander Aristov

Re: About search inner links information

Reply via email to