Actually that command is for distribute configuration on multiple machines.
The tutorial you refered to is for entry-level users who typically don't
need distribute utility.
According to your description, I guess you're using Nutch on a single
machine which makes that command useless to you.
But when you decide to deploy Nutch to multiple machines to do something
big, you have much more to do than that tutorial tells you,including that
command :)
----- Original Message -----
From: "Meryl Silverburgh" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Saturday, April 07, 2007 9:12 AM
Subject: Re: Trying to setup Nutch
On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
So that's the problem : you have to replace MY.DOMAIN.NAME with domains
you
want to crawl.
For your situation, that line should reads :
+^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/
Check it out.
Thanks for your help.
but from the documtation
http://lucene.apache.org/nutch/tutorial8.html, i don't need to do
this:
$bin/hadoop dfs -put urls urls
but I should do this for crawling:
$bin/nutch crawl urls -dir crawl -depth 1 -topN 5
Why do I need to do this, and what is that for?
$bin/hadoop dfs -put urls urls
----- Original Message -----
From: "Meryl Silverburgh" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Saturday, April 07, 2007 9:02 AM
Subject: Re: Trying to setup Nutch
> On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>> Have yuo checked your crawl-urlfilter.txt file ?
>> Make sure you have replaced your accepted domain.
>>
>
> I have this in my crawl-urlfilter.txt
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
>
> but lets' say I have
> yahoo, cnn, amazon, msn, google
> in my 'urls' files, what should my accepted domain to be?
>
>
>> ----- Original Message -----
>> From: "Meryl Silverburgh" <[EMAIL PROTECTED]>
>> To: <[email protected]>
>> Sent: Saturday, April 07, 2007 8:54 AM
>> Subject: Re: Trying to setup Nutch
>>
>> > On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>> >> After setup, you should put the urls you want to crawl into the
>> >> HDFS
>> >> by
>> >> the
>> >> command :
>> >> $bin/hadoop dfs -put urls urls
>> >>
>> >> Maybe that's something you forgot to do and I hope it helps :)
>> >>
>> >
>> > I try your command, but I get this error:
>> > $ bin/hadoop dfs -put urls urls
>> > put: Target urls already exists
>> >
>> >
>> > I just have 1 line in my file 'urls':
>> > $ more urls
>> > http://www.yahoo.com
>> >
>> > Thanks for any help.
>> >
>> >
>> >> ----- Original Message -----
>> >> From: "Meryl Silverburgh" <[EMAIL PROTECTED]>
>> >> To: <[email protected]>
>> >> Sent: Saturday, April 07, 2007 3:08 AM
>> >> Subject: Trying to setup Nutch
>> >>
>> >> > Hi,
>> >> >
>> >> > i am trying to setup Nutch.
>> >> > I setup 1 site in my urls file:
>> >> > http://www.yahoo.com
>> >> >
>> >> > And then I start crawl using this command:
>> >> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>> >> >
>> >> > But I get this "No URLs to fecth", can you please tell me what am
>> >> > i
>> >> > missing?
>> >> > $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>> >> > crawl started in: crawl
>> >> > rootUrlDir = urls
>> >> > threads = 10
>> >> > depth = 1
>> >> > topN = 5
>> >> > Injector: starting
>> >> > Injector: crawlDb: crawl/crawldb
>> >> > Injector: urlDir: urls
>> >> > Injector: Converting injected urls to crawl db entries.
>> >> > Injector: Merging injected urls into crawl db.
>> >> > Injector: done
>> >> > Generator: Selecting best-scoring urls due for fetch.
>> >> > Generator: starting
>> >> > Generator: segment: crawl/segments/20070406140513
>> >> > Generator: filtering: false
>> >> > Generator: topN: 5
>> >> > Generator: jobtracker is 'local', generating exactly one
>> >> > partition.
>> >> > Generator: 0 records selected for fetching, exiting ...
>> >> > Stopping at depth=0 - no more URLs to fetch.
>> >> > No URLs to fetch - check your seed list and URL filters.
>> >> > crawl finished: crawl
>> >> >
>> >>
>> >
>>
>