yes.
maybe I'm going to summarize my questions so that I can express myself more
clearly.

I set only www.aaa.com to urls.txt.

The page www.aaa.com has a link www.bbb.com
www.bbb.com has a link www.ccc.com
www.ccc.com has a word: big

if I search "big", can I get a result like www.ccc.com?
-------------------------------------------------------------------------------------------
my urls.txt is
www.aaa.com
-------------------------------------------------------------------------------------------
my crawl-urlfilter.txt is
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
#-...@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*la/

# skip everything else
-.
-------------------------------------------------------------------------------------------
my nutch-site.xml contains:
<property>
<name>http.agent.url</name>
<value>http://www.aaa.com/</value>
<description>http://www.aaa.com/</description>
</property>
-------------------------------------------------------------------------------------------
my nutch-default.xml contains following and I didn't change them in
my nutch-site.xml

<property>
  <name>db.update.additions.allowed</name>
  <value>true</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>
-------------------------------------------------------------------------------------------
with these configurations,
I can crawl pages in www.aaa.com like www.aaa.com/xxxxx
but I cannot crawl pages like www.bbb.com/xxxx or www.ccc.com/xxxx, is it
right?

if I want to crawl pages like www.bbb.com/xxx or www.ccc.com/xxx, what
should I do?

thank you for taking time to answer my question.

Yves

2009/3/5 Alexander Aristov <[email protected]>

> If you have taken nutch from trunk then these settings are already there.
> You should just check them. If it is not then you can add them
>
> See also the nutch-default.xml file which contains all nutch settings. you
> may copy necessary from this file and customize then in nutch-site.xml
>
> 2009/3/5 Yves Yu <[email protected]>
>
> > thanks, that would be very useful for me.
> > just copy these two lines to nutch-site.xml file?
> > if not, would you like to provide a template?
> > - Show quoted text -
> >
> > 2009/3/5 Alexander Aristov <[email protected]>
> >
> > > in the nutch-site.xml file check settings
> > >
> > > db.update.additions.allowed
> > > db.ignore.external.links
> > >
> > > Alexander
> > >
> > >
> > > 2009/3/5 Yves Yu <[email protected]>
> > >
> > > > thank you
> > > > my urls.txt is www.aaa.com
> > > > must I add www.bbb.com and www.ccc.com here?
> > > >
> > > > my urlfilter is +^http://([a-z0-9]*\.)*com/
> > > >
> > > > by the way, how to check nutch settings if I allow adding outward
> > links?
> > > >
> > > > 2009/3/5 Alexander Aristov <[email protected]>
> > > > - Show quoted text -
> > > >
> > > > > I would suggest to check url filters. If you use the crawl command
> > then
> > > > it
> > > > > is teh crawl url filter otherwise it is regex-urlfilter
> > > > >
> > > > >
> > > > > And check nutch settings if you allow adding  outward links.
> > > > >
> > > > > 2009/3/5 Yves Yu <[email protected]>
> > > > >
> > > > > > yes, I'm using Luke now, and I see there is no www. bbb.com and
> no
> > > > > > www.ccc.com in crawling procedure. it only can crawling
> > www.aaa.com,
> > > > > > www.aaa.com\xxx\xxx, like these
> > > > > > do you know what the problem is?
> > > > > >
> > > > > > 2009/3/4 Jasper Kamperman <[email protected]>
> > > > > >
> > > > > > > Oh and the documentation also specifies a depth parameter that
> > says
> > > > how
> > > > > > far
> > > > > > > afield the crawler may go. I think default is 10 but not sure.
> > > > > > >
> > > > > > > Sent from my iPhone
> > > > > > >
> > > > > > >
> > > > > > > On Mar 3, 2009, at 12:53 PM, Yves Yu <[email protected]>
> wrote:
> > > > > > >
> > > > > > >  you mean, we can do this without additional configuration? how
> > > about
> > > > > 10
> > > > > > >> depth like this? how can I set it?thanks.
> > > > > > >>
> > > > > > >> 2009/3/4 Jasper Kamperman <[email protected]>
> > > > > > >>
> > > > > > >>  Could be a lot of reasons. I'd start by investigating the
> index
> > > > with
> > > > > > Luke
> > > > > > >>> to see if ccc made it into the index and if I can search out
> > the
> > > > page
> > > > > > >>> with
> > > > > > >>> the word "big". From what I find out with Luke I'd work my
> way
> > > back
> > > > > to
> > > > > > >>> the
> > > > > > >>> root cause
> > > > > > >>>
> > > > > > >>> Sent from my iPhone
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> On Mar 3, 2009, at 7:40 AM, Yves Yu <[email protected]>
> > wrote:
> > > > > > >>>
> > > > > > >>> Hi, all,
> > > > > > >>>
> > > > > > >>>> for example,
> > > > > > >>>>
> > > > > > >>>> The page www.aaa.com has a link www.bbb.com
> > > > > > >>>> www.bbb.com has a link www.ccc.com
> > > > > > >>>> www.ccc.com has a word: big
> > > > > > >>>>
> > > > > > >>>> It seems I cannot find "big" in www.ccc.com, is it
> possible?
> > > How
> > > > > can
> > > > > > I
> > > > > > >>>> set
> > > > > > >>>> the configurations?
> > > > > > >>>>
> > > > > > >>>> Thanks in advance!
> > > > > > >>>>
> > > > > > >>>> Yves
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards
> > > > > Alexander Aristov
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best Regards
> > > Alexander Aristov
> > >
> >
>
>
>
> --
> Best Regards
> Alexander Aristov
>

Reply via email to