I got it. in nutch-default.xml <property> <name>db.update.additions.allowed</name> <value>true</value> <description>If true, updatedb will add newly discovered URLs, if false only already existing URLs in the CrawlDb will be updated and no new URLs will be added. </description> </property>
<property> <name>db.ignore.external.links</name> <value>false</value> <description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> </property> I didn't change it, so I don't think it's the cause of my problem... 2009/3/5 Yves Yu <[email protected]> > thanks, that would be very useful for me. > just copy these two lines to nutch-site.xml file? > if not, would you like to provide a template? > > 2009/3/5 Alexander Aristov <[email protected]> > >> in the nutch-site.xml file check settings >> >> db.update.additions.allowed >> db.ignore.external.links >> >> Alexander >> >> >> 2009/3/5 Yves Yu <[email protected]> >> >> > thank you >> > my urls.txt is www.aaa.com >> > must I add www.bbb.com and www.ccc.com here? >> > >> > my urlfilter is +^http://([a-z0-9]*\.)*com/ >> > >> > by the way, how to check nutch settings if I allow adding outward links? >> > >> > 2009/3/5 Alexander Aristov <[email protected]> >> > - Show quoted text - >> > >> > > I would suggest to check url filters. If you use the crawl command >> then >> > it >> > > is teh crawl url filter otherwise it is regex-urlfilter >> > > >> > > >> > > And check nutch settings if you allow adding outward links. >> > > >> > > 2009/3/5 Yves Yu <[email protected]> >> > > >> > > > yes, I'm using Luke now, and I see there is no www. bbb.com and no >> > > > www.ccc.com in crawling procedure. it only can crawling www.aaa.com >> , >> > > > www.aaa.com\xxx\xxx, like these >> > > > do you know what the problem is? >> > > > >> > > > 2009/3/4 Jasper Kamperman <[email protected]> >> > > > >> > > > > Oh and the documentation also specifies a depth parameter that >> says >> > how >> > > > far >> > > > > afield the crawler may go. I think default is 10 but not sure. >> > > > > >> > > > > Sent from my iPhone >> > > > > >> > > > > >> > > > > On Mar 3, 2009, at 12:53 PM, Yves Yu <[email protected]> wrote: >> > > > > >> > > > > you mean, we can do this without additional configuration? how >> about >> > > 10 >> > > > >> depth like this? how can I set it?thanks. >> > > > >> >> > > > >> 2009/3/4 Jasper Kamperman <[email protected]> >> > > > >> >> > > > >> Could be a lot of reasons. I'd start by investigating the index >> > with >> > > > Luke >> > > > >>> to see if ccc made it into the index and if I can search out the >> > page >> > > > >>> with >> > > > >>> the word "big". From what I find out with Luke I'd work my way >> back >> > > to >> > > > >>> the >> > > > >>> root cause >> > > > >>> >> > > > >>> Sent from my iPhone >> > > > >>> >> > > > >>> >> > > > >>> On Mar 3, 2009, at 7:40 AM, Yves Yu <[email protected]> wrote: >> > > > >>> >> > > > >>> Hi, all, >> > > > >>> >> > > > >>>> for example, >> > > > >>>> >> > > > >>>> The page www.aaa.com has a link www.bbb.com >> > > > >>>> www.bbb.com has a link www.ccc.com >> > > > >>>> www.ccc.com has a word: big >> > > > >>>> >> > > > >>>> It seems I cannot find "big" in www.ccc.com, is it possible? >> How >> > > can >> > > > I >> > > > >>>> set >> > > > >>>> the configurations? >> > > > >>>> >> > > > >>>> Thanks in advance! >> > > > >>>> >> > > > >>>> Yves >> > > > >>>> >> > > > >>>> >> > > > >>> >> > > > >> > > >> > > >> > > >> > > -- >> > > Best Regards >> > > Alexander Aristov >> > > >> > >> >> >> >> -- >> Best Regards >> Alexander Aristov >> > >
