in the nutch-site.xml file check settings db.update.additions.allowed db.ignore.external.links
Alexander 2009/3/5 Yves Yu <[email protected]> > thank you > my urls.txt is www.aaa.com > must I add www.bbb.com and www.ccc.com here? > > my urlfilter is +^http://([a-z0-9]*\.)*com/ > > by the way, how to check nutch settings if I allow adding outward links? > > 2009/3/5 Alexander Aristov <[email protected]> > - Show quoted text - > > > I would suggest to check url filters. If you use the crawl command then > it > > is teh crawl url filter otherwise it is regex-urlfilter > > > > > > And check nutch settings if you allow adding outward links. > > > > 2009/3/5 Yves Yu <[email protected]> > > > > > yes, I'm using Luke now, and I see there is no www. bbb.com and no > > > www.ccc.com in crawling procedure. it only can crawling www.aaa.com, > > > www.aaa.com\xxx\xxx, like these > > > do you know what the problem is? > > > > > > 2009/3/4 Jasper Kamperman <[email protected]> > > > > > > > Oh and the documentation also specifies a depth parameter that says > how > > > far > > > > afield the crawler may go. I think default is 10 but not sure. > > > > > > > > Sent from my iPhone > > > > > > > > > > > > On Mar 3, 2009, at 12:53 PM, Yves Yu <[email protected]> wrote: > > > > > > > > you mean, we can do this without additional configuration? how about > > 10 > > > >> depth like this? how can I set it?thanks. > > > >> > > > >> 2009/3/4 Jasper Kamperman <[email protected]> > > > >> > > > >> Could be a lot of reasons. I'd start by investigating the index > with > > > Luke > > > >>> to see if ccc made it into the index and if I can search out the > page > > > >>> with > > > >>> the word "big". From what I find out with Luke I'd work my way back > > to > > > >>> the > > > >>> root cause > > > >>> > > > >>> Sent from my iPhone > > > >>> > > > >>> > > > >>> On Mar 3, 2009, at 7:40 AM, Yves Yu <[email protected]> wrote: > > > >>> > > > >>> Hi, all, > > > >>> > > > >>>> for example, > > > >>>> > > > >>>> The page www.aaa.com has a link www.bbb.com > > > >>>> www.bbb.com has a link www.ccc.com > > > >>>> www.ccc.com has a word: big > > > >>>> > > > >>>> It seems I cannot find "big" in www.ccc.com, is it possible? How > > can > > > I > > > >>>> set > > > >>>> the configurations? > > > >>>> > > > >>>> Thanks in advance! > > > >>>> > > > >>>> Yves > > > >>>> > > > >>>> > > > >>> > > > > > > > > > > > -- > > Best Regards > > Alexander Aristov > > > -- Best Regards Alexander Aristov
