yes, I using crawl command nutch crawl urls -dir new_crawl -depth 10 -topN 200
2009/3/5 Alexander Aristov <[email protected]> > So I assume you use the crawl command. > > Then waht is the depth you set? > This value controls how many times crawler would generate-fetch-update > pages. > > Try to set it more, say to 3, so that crawler could reach your ccc.com > > > on the first attempt it would fetch only aaa.com > in the second it will do bbb.com > and finaly it will do ccc.com > > > 2009/3/5 Yves Yu <[email protected]> > > > sorry, > > my crawl-urlfilter.txt contains > > # accept hosts in MY.DOMAIN.NAME > > +^http://([a-z0-9]*\.)*com/ > > > > not > > # accept hosts in MY.DOMAIN.NAME > > +^http://([a-z0-9]*\.)*la/ > > > > - Show quoted text - > > 2009/3/5 Yves Yu <[email protected]> > > > > > yes. > > > maybe I'm going to summarize my questions so that I can express myself > > more > > > clearly. > > > > > > I set only www.aaa.com to urls.txt. > > > > > > The page www.aaa.com has a link www.bbb.com > > > www.bbb.com has a link www.ccc.com > > > www.ccc.com has a word: big > > > > > > if I search "big", can I get a result like www.ccc.com? > > > > > > > > > ------------------------------------------------------------------------------------------- > > > my urls.txt is > > > www.aaa.com > > > > > > > > > ------------------------------------------------------------------------------------------- > > > my crawl-urlfilter.txt is > > > # skip file:, ftp:, & mailto: urls > > > -^(file|ftp|mailto): > > > > > > # skip image and other suffixes we can't yet parse > > > > > > > > > -\.(js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > > > > > # skip URLs containing certain characters as probable queries, etc. > > > #-...@=] > > > > > > # skip URLs with slash-delimited segment that repeats 3+ times, to > break > > > loops > > > -.*(/.+?)/.*?\1/.*?\1/ > > > > > > # accept hosts in MY.DOMAIN.NAME > > > +^http://([a-z0-9]*\.)*la/ > > > > > > # skip everything else > > > -. > > > > > > > > > ------------------------------------------------------------------------------------------- > > > my nutch-site.xml contains: > > > <property> > > > <name>http.agent.url</name> > > > <value>http://www.aaa.com/</value> > > > <description>http://www.aaa.com/</description> > > > </property> > > > > > > > > > ------------------------------------------------------------------------------------------- > > > my nutch-default.xml contains following and I didn't change them in > > > my nutch-site.xml > > > > > > <property> > > > <name>db.update.additions.allowed</name> > > > <value>true</value> > > > <description>If true, updatedb will add newly discovered URLs, if > false > > > only already existing URLs in the CrawlDb will be updated and no new > > > URLs will be added. > > > </description> > > > </property> > > > > > > <property> > > > <name>db.ignore.external.links</name> > > > <value>false</value> > > > <description>If true, outlinks leading from a page to external hosts > > > will be ignored. This is an effective way to limit the crawl to > include > > > only initially injected hosts, without creating complex URLFilters. > > > </description> > > > </property> > > > > > > > > > ------------------------------------------------------------------------------------------- > > > with these configurations, > > > I can crawl pages in www.aaa.com like www.aaa.com/xxxxx > > > but I cannot crawl pages like www.bbb.com/xxxx or www.ccc.com/xxxx, is > > it > > > right? > > > > > > if I want to crawl pages like www.bbb.com/xxx or www.ccc.com/xxx, what > > > should I do? > > > > > > thank you for taking time to answer my question. > > > > > > Yves > > > > > > 2009/3/5 Alexander Aristov <[email protected]> > > > > > >> If you have taken nutch from trunk then these settings are already > > there. > > >> You should just check them. If it is not then you can add them > > >> > > >> See also the nutch-default.xml file which contains all nutch settings. > > you > > >> may copy necessary from this file and customize then in nutch-site.xml > > >> > > >> 2009/3/5 Yves Yu <[email protected]> > > >> > > >> > thanks, that would be very useful for me. > > >> > just copy these two lines to nutch-site.xml file? > > >> > if not, would you like to provide a template? > > >> > - Show quoted text - > > >> > > > >> > 2009/3/5 Alexander Aristov <[email protected]> > > >> > > > >> > > in the nutch-site.xml file check settings > > >> > > > > >> > > db.update.additions.allowed > > >> > > db.ignore.external.links > > >> > > > > >> > > Alexander > > >> > > > > >> > > > > >> > > 2009/3/5 Yves Yu <[email protected]> > > >> > > > > >> > > > thank you > > >> > > > my urls.txt is www.aaa.com > > >> > > > must I add www.bbb.com and www.ccc.com here? > > >> > > > > > >> > > > my urlfilter is +^http://([a-z0-9]*\.)*com/ > > >> > > > > > >> > > > by the way, how to check nutch settings if I allow adding > outward > > >> > links? > > >> > > > > > >> > > > 2009/3/5 Alexander Aristov <[email protected]> > > >> > > > - Show quoted text - > > >> > > > > > >> > > > > I would suggest to check url filters. If you use the crawl > > command > > >> > then > > >> > > > it > > >> > > > > is teh crawl url filter otherwise it is regex-urlfilter > > >> > > > > > > >> > > > > > > >> > > > > And check nutch settings if you allow adding outward links. > > >> > > > > > > >> > > > > 2009/3/5 Yves Yu <[email protected]> > > >> > > > > > > >> > > > > > yes, I'm using Luke now, and I see there is no www. > bbb.comand > > >> no > > >> > > > > > www.ccc.com in crawling procedure. it only can crawling > > >> > www.aaa.com, > > >> > > > > > www.aaa.com\xxx\xxx, like these > > >> > > > > > do you know what the problem is? > > >> > > > > > > > >> > > > > > 2009/3/4 Jasper Kamperman < > [email protected]> > > >> > > > > > > > >> > > > > > > Oh and the documentation also specifies a depth parameter > > that > > >> > says > > >> > > > how > > >> > > > > > far > > >> > > > > > > afield the crawler may go. I think default is 10 but not > > sure. > > >> > > > > > > > > >> > > > > > > Sent from my iPhone > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > On Mar 3, 2009, at 12:53 PM, Yves Yu <[email protected]> > > >> wrote: > > >> > > > > > > > > >> > > > > > > you mean, we can do this without additional > configuration? > > >> how > > >> > > about > > >> > > > > 10 > > >> > > > > > >> depth like this? how can I set it?thanks. > > >> > > > > > >> > > >> > > > > > >> 2009/3/4 Jasper Kamperman < > > [email protected] > > >> > > > >> > > > > > >> > > >> > > > > > >> Could be a lot of reasons. I'd start by investigating > the > > >> index > > >> > > > with > > >> > > > > > Luke > > >> > > > > > >>> to see if ccc made it into the index and if I can search > > out > > >> > the > > >> > > > page > > >> > > > > > >>> with > > >> > > > > > >>> the word "big". From what I find out with Luke I'd work > my > > >> way > > >> > > back > > >> > > > > to > > >> > > > > > >>> the > > >> > > > > > >>> root cause > > >> > > > > > >>> > > >> > > > > > >>> Sent from my iPhone > > >> > > > > > >>> > > >> > > > > > >>> > > >> > > > > > >>> On Mar 3, 2009, at 7:40 AM, Yves Yu <[email protected] > > > > >> > wrote: > > >> > > > > > >>> > > >> > > > > > >>> Hi, all, > > >> > > > > > >>> > > >> > > > > > >>>> for example, > > >> > > > > > >>>> > > >> > > > > > >>>> The page www.aaa.com has a link www.bbb.com > > >> > > > > > >>>> www.bbb.com has a link www.ccc.com > > >> > > > > > >>>> www.ccc.com has a word: big > > >> > > > > > >>>> > > >> > > > > > >>>> It seems I cannot find "big" in www.ccc.com, is it > > >> possible? > > >> > > How > > >> > > > > can > > >> > > > > > I > > >> > > > > > >>>> set > > >> > > > > > >>>> the configurations? > > >> > > > > > >>>> > > >> > > > > > >>>> Thanks in advance! > > >> > > > > > >>>> > > >> > > > > > >>>> Yves > > >> > > > > > >>>> > > >> > > > > > >>>> > > >> > > > > > >>> > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > -- > > >> > > > > Best Regards > > >> > > > > Alexander Aristov > > >> > > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > -- > > >> > > Best Regards > > >> > > Alexander Aristov > > >> > > > > >> > > > >> > > >> > > >> > > >> -- > > >> Best Regards > > >> Alexander Aristov > > >> > > > > > > > > > > > > -- > Best Regards > Alexander Aristov >
