Thanks. I follow the rest of the tutorial, and it said : To fetch, we first generate a fetchlist from the database:
bin/nutch generate crawl/crawldb crawl/segments This generates a fetchlist for all of the pages due to be fetched. The fetchlist is placed in a newly created segment directory. The segment directory is named by the time it's created. We save the name of this segment in the shell variable s1: s1=`ls -d crawl/segments/2* | tail -1` echo $s1 But how can I see the list of the URLs (in human readable format) before I actually fetch it? i do a more of that file, it is not readable. crawl/segments/20070406202200/crawl_generate On 4/6/07, Xiangyu Zhang <[EMAIL PROTECTED]> wrote: > Actually that command is for distribute configuration on multiple machines. > The tutorial you refered to is for entry-level users who typically don't > need distribute utility. > > According to your description, I guess you're using Nutch on a single > machine which makes that command useless to you. > > But when you decide to deploy Nutch to multiple machines to do something > big, you have much more to do than that tutorial tells you,including that > command :) > > ----- Original Message ----- > From: "Meryl Silverburgh" <[EMAIL PROTECTED]> > To: <[email protected]> > Sent: Saturday, April 07, 2007 9:12 AM > Subject: Re: Trying to setup Nutch > > > On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > >> So that's the problem : you have to replace MY.DOMAIN.NAME with domains > >> you > >> want to crawl. > >> For your situation, that line should reads : > >> +^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/ > >> Check it out. > >> > > > > Thanks for your help. > > but from the documtation > > http://lucene.apache.org/nutch/tutorial8.html, i don't need to do > > this: > > $bin/hadoop dfs -put urls urls > > > > but I should do this for crawling: > > > > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5 > > > > Why do I need to do this, and what is that for? > > $bin/hadoop dfs -put urls urls > > > >> ----- Original Message ----- > >> From: "Meryl Silverburgh" <[EMAIL PROTECTED]> > >> To: <[email protected]> > >> Sent: Saturday, April 07, 2007 9:02 AM > >> Subject: Re: Trying to setup Nutch > >> > >> > On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > >> >> Have yuo checked your crawl-urlfilter.txt file ? > >> >> Make sure you have replaced your accepted domain. > >> >> > >> > > >> > I have this in my crawl-urlfilter.txt > >> > > >> > # accept hosts in MY.DOMAIN.NAME > >> > +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > >> > > >> > > >> > but lets' say I have > >> > yahoo, cnn, amazon, msn, google > >> > in my 'urls' files, what should my accepted domain to be? > >> > > >> > > >> >> ----- Original Message ----- > >> >> From: "Meryl Silverburgh" <[EMAIL PROTECTED]> > >> >> To: <[email protected]> > >> >> Sent: Saturday, April 07, 2007 8:54 AM > >> >> Subject: Re: Trying to setup Nutch > >> >> > >> >> > On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > >> >> >> After setup, you should put the urls you want to crawl into the > >> >> >> HDFS > >> >> >> by > >> >> >> the > >> >> >> command : > >> >> >> $bin/hadoop dfs -put urls urls > >> >> >> > >> >> >> Maybe that's something you forgot to do and I hope it helps :) > >> >> >> > >> >> > > >> >> > I try your command, but I get this error: > >> >> > $ bin/hadoop dfs -put urls urls > >> >> > put: Target urls already exists > >> >> > > >> >> > > >> >> > I just have 1 line in my file 'urls': > >> >> > $ more urls > >> >> > http://www.yahoo.com > >> >> > > >> >> > Thanks for any help. > >> >> > > >> >> > > >> >> >> ----- Original Message ----- > >> >> >> From: "Meryl Silverburgh" <[EMAIL PROTECTED]> > >> >> >> To: <[email protected]> > >> >> >> Sent: Saturday, April 07, 2007 3:08 AM > >> >> >> Subject: Trying to setup Nutch > >> >> >> > >> >> >> > Hi, > >> >> >> > > >> >> >> > i am trying to setup Nutch. > >> >> >> > I setup 1 site in my urls file: > >> >> >> > http://www.yahoo.com > >> >> >> > > >> >> >> > And then I start crawl using this command: > >> >> >> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5 > >> >> >> > > >> >> >> > But I get this "No URLs to fecth", can you please tell me what am > >> >> >> > i > >> >> >> > missing? > >> >> >> > $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5 > >> >> >> > crawl started in: crawl > >> >> >> > rootUrlDir = urls > >> >> >> > threads = 10 > >> >> >> > depth = 1 > >> >> >> > topN = 5 > >> >> >> > Injector: starting > >> >> >> > Injector: crawlDb: crawl/crawldb > >> >> >> > Injector: urlDir: urls > >> >> >> > Injector: Converting injected urls to crawl db entries. > >> >> >> > Injector: Merging injected urls into crawl db. > >> >> >> > Injector: done > >> >> >> > Generator: Selecting best-scoring urls due for fetch. > >> >> >> > Generator: starting > >> >> >> > Generator: segment: crawl/segments/20070406140513 > >> >> >> > Generator: filtering: false > >> >> >> > Generator: topN: 5 > >> >> >> > Generator: jobtracker is 'local', generating exactly one > >> >> >> > partition. > >> >> >> > Generator: 0 records selected for fetching, exiting ... > >> >> >> > Stopping at depth=0 - no more URLs to fetch. > >> >> >> > No URLs to fetch - check your seed list and URL filters. > >> >> >> > crawl finished: crawl > >> >> >> > > >> >> >> > >> >> > > >> >> > >> > > >> > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
