Re: [Nutch-general] Trying to setup Nutch

Meryl Silverburgh Fri, 06 Apr 2007 18:28:15 -0700

Thanks.

I follow the rest of the tutorial, and it said :
To fetch, we first generate a fetchlist from the database:


bin/nutch generate crawl/crawldb crawl/segments

This generates a fetchlist for all of the pages due to be fetched. The
fetchlist is placed in a newly created segment directory. The segment
directory is named by the time it's created. We save the name of this
segment in the shell variable s1:

s1=`ls -d crawl/segments/2* | tail -1`
echo $s1



But how can I see the list of the URLs (in human readable format)
before I actually fetch it?
i do a more of that file, it is not readable.
crawl/segments/20070406202200/crawl_generate



On 4/6/07, Xiangyu Zhang <[EMAIL PROTECTED]> wrote:
> Actually that command is for distribute configuration on multiple machines.
> The tutorial you refered to is for entry-level users who typically don't
> need distribute utility.
>
> According to your description, I guess you're using Nutch on a single
> machine which makes that command useless to you.
>
> But when you decide to deploy Nutch to multiple machines to do something
> big, you have much more to do than that tutorial tells you,including that
> command :)
>
> ----- Original Message -----
> From: "Meryl Silverburgh" <[EMAIL PROTECTED]>
> To: <[email protected]>
> Sent: Saturday, April 07, 2007 9:12 AM
> Subject: Re: Trying to setup Nutch
>
> > On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> >> So that's the problem : you have to replace MY.DOMAIN.NAME with domains
> >> you
> >> want to crawl.
> >> For your situation, that line should reads :
> >> +^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/
> >> Check it out.
> >>
> >
> > Thanks for your help.
> > but from the documtation
> > http://lucene.apache.org/nutch/tutorial8.html, i don't need to do
> > this:
> > $bin/hadoop dfs -put urls urls
> >
> > but I should do this for crawling:
> >
> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> >
> > Why do I need to do this, and what is that for?
> > $bin/hadoop dfs -put urls urls
> >
> >> ----- Original Message -----
> >> From: "Meryl Silverburgh" <[EMAIL PROTECTED]>
> >> To: <[email protected]>
> >> Sent: Saturday, April 07, 2007 9:02 AM
> >> Subject: Re: Trying to setup Nutch
> >>
> >> > On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> >> >> Have yuo checked your crawl-urlfilter.txt file ?
> >> >> Make sure you have replaced your accepted domain.
> >> >>
> >> >
> >> > I have this in my crawl-urlfilter.txt
> >> >
> >> > # accept hosts in MY.DOMAIN.NAME
> >> > +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> >> >
> >> >
> >> > but lets' say I have
> >> > yahoo, cnn, amazon, msn, google
> >> > in my 'urls' files, what should my accepted domain to be?
> >> >
> >> >
> >> >> ----- Original Message -----
> >> >> From: "Meryl Silverburgh" <[EMAIL PROTECTED]>
> >> >> To: <[email protected]>
> >> >> Sent: Saturday, April 07, 2007 8:54 AM
> >> >> Subject: Re: Trying to setup Nutch
> >> >>
> >> >> > On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> >> >> >> After setup, you should put the urls you want to crawl into the
> >> >> >> HDFS
> >> >> >> by
> >> >> >> the
> >> >> >> command :
> >> >> >> $bin/hadoop dfs -put urls urls
> >> >> >>
> >> >> >> Maybe that's something you forgot to do and I hope it helps :)
> >> >> >>
> >> >> >
> >> >> > I try your command, but I get this error:
> >> >> > $ bin/hadoop dfs -put urls urls
> >> >> > put: Target urls already exists
> >> >> >
> >> >> >
> >> >> > I just have 1 line in my file 'urls':
> >> >> > $ more urls
> >> >> > http://www.yahoo.com
> >> >> >
> >> >> > Thanks for any help.
> >> >> >
> >> >> >
> >> >> >> ----- Original Message -----
> >> >> >> From: "Meryl Silverburgh" <[EMAIL PROTECTED]>
> >> >> >> To: <[email protected]>
> >> >> >> Sent: Saturday, April 07, 2007 3:08 AM
> >> >> >> Subject: Trying to setup Nutch
> >> >> >>
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > i am trying to setup Nutch.
> >> >> >> > I setup 1 site in my urls file:
> >> >> >> > http://www.yahoo.com
> >> >> >> >
> >> >> >> > And then I start crawl using this command:
> >> >> >> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> >> >> >> >
> >> >> >> > But I get this "No URLs to fecth", can you please tell me what am
> >> >> >> > i
> >> >> >> > missing?
> >> >> >> > $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> >> >> >> > crawl started in: crawl
> >> >> >> > rootUrlDir = urls
> >> >> >> > threads = 10
> >> >> >> > depth = 1
> >> >> >> > topN = 5
> >> >> >> > Injector: starting
> >> >> >> > Injector: crawlDb: crawl/crawldb
> >> >> >> > Injector: urlDir: urls
> >> >> >> > Injector: Converting injected urls to crawl db entries.
> >> >> >> > Injector: Merging injected urls into crawl db.
> >> >> >> > Injector: done
> >> >> >> > Generator: Selecting best-scoring urls due for fetch.
> >> >> >> > Generator: starting
> >> >> >> > Generator: segment: crawl/segments/20070406140513
> >> >> >> > Generator: filtering: false
> >> >> >> > Generator: topN: 5
> >> >> >> > Generator: jobtracker is 'local', generating exactly one
> >> >> >> > partition.
> >> >> >> > Generator: 0 records selected for fetching, exiting ...
> >> >> >> > Stopping at depth=0 - no more URLs to fetch.
> >> >> >> > No URLs to fetch - check your seed list and URL filters.
> >> >> >> > crawl finished: crawl
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >
> >>
> >
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Trying to setup Nutch

Reply via email to