So that's the problem : you have to replace MY.DOMAIN.NAME with domains you want to crawl. For your situation, that line should reads : +^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/ Check it out.
----- Original Message ----- From: "Meryl Silverburgh" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Saturday, April 07, 2007 9:02 AM Subject: Re: Trying to setup Nutch > On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >> Have yuo checked your crawl-urlfilter.txt file ? >> Make sure you have replaced your accepted domain. >> > > I have this in my crawl-urlfilter.txt > > # accept hosts in MY.DOMAIN.NAME > +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > > > but lets' say I have > yahoo, cnn, amazon, msn, google > in my 'urls' files, what should my accepted domain to be? > > >> ----- Original Message ----- >> From: "Meryl Silverburgh" <[EMAIL PROTECTED]> >> To: <[email protected]> >> Sent: Saturday, April 07, 2007 8:54 AM >> Subject: Re: Trying to setup Nutch >> >> > On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >> >> After setup, you should put the urls you want to crawl into the HDFS >> >> by >> >> the >> >> command : >> >> $bin/hadoop dfs -put urls urls >> >> >> >> Maybe that's something you forgot to do and I hope it helps :) >> >> >> > >> > I try your command, but I get this error: >> > $ bin/hadoop dfs -put urls urls >> > put: Target urls already exists >> > >> > >> > I just have 1 line in my file 'urls': >> > $ more urls >> > http://www.yahoo.com >> > >> > Thanks for any help. >> > >> > >> >> ----- Original Message ----- >> >> From: "Meryl Silverburgh" <[EMAIL PROTECTED]> >> >> To: <[email protected]> >> >> Sent: Saturday, April 07, 2007 3:08 AM >> >> Subject: Trying to setup Nutch >> >> >> >> > Hi, >> >> > >> >> > i am trying to setup Nutch. >> >> > I setup 1 site in my urls file: >> >> > http://www.yahoo.com >> >> > >> >> > And then I start crawl using this command: >> >> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5 >> >> > >> >> > But I get this "No URLs to fecth", can you please tell me what am i >> >> > missing? >> >> > $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5 >> >> > crawl started in: crawl >> >> > rootUrlDir = urls >> >> > threads = 10 >> >> > depth = 1 >> >> > topN = 5 >> >> > Injector: starting >> >> > Injector: crawlDb: crawl/crawldb >> >> > Injector: urlDir: urls >> >> > Injector: Converting injected urls to crawl db entries. >> >> > Injector: Merging injected urls into crawl db. >> >> > Injector: done >> >> > Generator: Selecting best-scoring urls due for fetch. >> >> > Generator: starting >> >> > Generator: segment: crawl/segments/20070406140513 >> >> > Generator: filtering: false >> >> > Generator: topN: 5 >> >> > Generator: jobtracker is 'local', generating exactly one partition. >> >> > Generator: 0 records selected for fetching, exiting ... >> >> > Stopping at depth=0 - no more URLs to fetch. >> >> > No URLs to fetch - check your seed list and URL filters. >> >> > crawl finished: crawl >> >> > >> >> >> > >> > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
