I got it.
in nutch-default.xml

<property>
  <name>db.update.additions.allowed</name>
  <value>true</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

I didn't change it, so I don't think it's the cause of my problem...

2009/3/5 Yves Yu <[email protected]>

> thanks, that would be very useful for me.
> just copy these two lines to nutch-site.xml file?
> if not, would you like to provide a template?
>
> 2009/3/5 Alexander Aristov <[email protected]>
>
>> in the nutch-site.xml file check settings
>>
>> db.update.additions.allowed
>> db.ignore.external.links
>>
>> Alexander
>>
>>
>> 2009/3/5 Yves Yu <[email protected]>
>>
>> > thank you
>> > my urls.txt is www.aaa.com
>> > must I add www.bbb.com and www.ccc.com here?
>> >
>> > my urlfilter is +^http://([a-z0-9]*\.)*com/
>> >
>> > by the way, how to check nutch settings if I allow adding outward links?
>> >
>> > 2009/3/5 Alexander Aristov <[email protected]>
>> > - Show quoted text -
>> >
>> > > I would suggest to check url filters. If you use the crawl command
>> then
>> > it
>> > > is teh crawl url filter otherwise it is regex-urlfilter
>> > >
>> > >
>> > > And check nutch settings if you allow adding  outward links.
>> > >
>> > > 2009/3/5 Yves Yu <[email protected]>
>> > >
>> > > > yes, I'm using Luke now, and I see there is no www. bbb.com and no
>> > > > www.ccc.com in crawling procedure. it only can crawling www.aaa.com
>> ,
>> > > > www.aaa.com\xxx\xxx, like these
>> > > > do you know what the problem is?
>> > > >
>> > > > 2009/3/4 Jasper Kamperman <[email protected]>
>> > > >
>> > > > > Oh and the documentation also specifies a depth parameter that
>> says
>> > how
>> > > > far
>> > > > > afield the crawler may go. I think default is 10 but not sure.
>> > > > >
>> > > > > Sent from my iPhone
>> > > > >
>> > > > >
>> > > > > On Mar 3, 2009, at 12:53 PM, Yves Yu <[email protected]> wrote:
>> > > > >
>> > > > >  you mean, we can do this without additional configuration? how
>> about
>> > > 10
>> > > > >> depth like this? how can I set it?thanks.
>> > > > >>
>> > > > >> 2009/3/4 Jasper Kamperman <[email protected]>
>> > > > >>
>> > > > >>  Could be a lot of reasons. I'd start by investigating the index
>> > with
>> > > > Luke
>> > > > >>> to see if ccc made it into the index and if I can search out the
>> > page
>> > > > >>> with
>> > > > >>> the word "big". From what I find out with Luke I'd work my way
>> back
>> > > to
>> > > > >>> the
>> > > > >>> root cause
>> > > > >>>
>> > > > >>> Sent from my iPhone
>> > > > >>>
>> > > > >>>
>> > > > >>> On Mar 3, 2009, at 7:40 AM, Yves Yu <[email protected]> wrote:
>> > > > >>>
>> > > > >>> Hi, all,
>> > > > >>>
>> > > > >>>> for example,
>> > > > >>>>
>> > > > >>>> The page www.aaa.com has a link www.bbb.com
>> > > > >>>> www.bbb.com has a link www.ccc.com
>> > > > >>>> www.ccc.com has a word: big
>> > > > >>>>
>> > > > >>>> It seems I cannot find "big" in www.ccc.com, is it possible?
>> How
>> > > can
>> > > > I
>> > > > >>>> set
>> > > > >>>> the configurations?
>> > > > >>>>
>> > > > >>>> Thanks in advance!
>> > > > >>>>
>> > > > >>>> Yves
>> > > > >>>>
>> > > > >>>>
>> > > > >>>
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Best Regards
>> > > Alexander Aristov
>> > >
>> >
>>
>>
>>
>> --
>> Best Regards
>> Alexander Aristov
>>
>
>

Reply via email to