On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
So that's the problem : you have to replace MY.DOMAIN.NAME with domains you
want to crawl.
For your situation, that line should reads :
+^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/
Check it out.


Thanks for your help.
but from the documtation
http://lucene.apache.org/nutch/tutorial8.html, i don't need to do
this:
$bin/hadoop dfs -put urls urls

but I should do this for crawling:

$bin/nutch crawl urls -dir crawl -depth 1 -topN 5

Why do I need to do this, and what is that for?
$bin/hadoop dfs -put urls urls

----- Original Message -----
From: "Meryl Silverburgh" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Saturday, April 07, 2007 9:02 AM
Subject: Re: Trying to setup Nutch

> On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>> Have yuo checked your crawl-urlfilter.txt file ?
>> Make sure you have replaced your accepted domain.
>>
>
> I have this in my crawl-urlfilter.txt
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
>
> but lets' say I have
> yahoo, cnn, amazon, msn, google
> in my 'urls' files, what should my accepted domain to be?
>
>
>> ----- Original Message -----
>> From: "Meryl Silverburgh" <[EMAIL PROTECTED]>
>> To: <[email protected]>
>> Sent: Saturday, April 07, 2007 8:54 AM
>> Subject: Re: Trying to setup Nutch
>>
>> > On 4/6/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>> >> After setup, you should put the urls you want to crawl into the HDFS
>> >> by
>> >> the
>> >> command :
>> >> $bin/hadoop dfs -put urls urls
>> >>
>> >> Maybe that's something you forgot to do and I hope it helps :)
>> >>
>> >
>> > I try your command, but I get this error:
>> > $ bin/hadoop dfs -put urls urls
>> > put: Target urls already exists
>> >
>> >
>> > I just have 1 line in my file 'urls':
>> > $ more urls
>> > http://www.yahoo.com
>> >
>> > Thanks for any help.
>> >
>> >
>> >> ----- Original Message -----
>> >> From: "Meryl Silverburgh" <[EMAIL PROTECTED]>
>> >> To: <[email protected]>
>> >> Sent: Saturday, April 07, 2007 3:08 AM
>> >> Subject: Trying to setup Nutch
>> >>
>> >> > Hi,
>> >> >
>> >> > i am trying to setup Nutch.
>> >> > I setup 1 site in my urls file:
>> >> > http://www.yahoo.com
>> >> >
>> >> > And then I start crawl using this command:
>> >> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>> >> >
>> >> > But I get this "No URLs to fecth", can you please tell me what am i
>> >> > missing?
>> >> > $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>> >> > crawl started in: crawl
>> >> > rootUrlDir = urls
>> >> > threads = 10
>> >> > depth = 1
>> >> > topN = 5
>> >> > Injector: starting
>> >> > Injector: crawlDb: crawl/crawldb
>> >> > Injector: urlDir: urls
>> >> > Injector: Converting injected urls to crawl db entries.
>> >> > Injector: Merging injected urls into crawl db.
>> >> > Injector: done
>> >> > Generator: Selecting best-scoring urls due for fetch.
>> >> > Generator: starting
>> >> > Generator: segment: crawl/segments/20070406140513
>> >> > Generator: filtering: false
>> >> > Generator: topN: 5
>> >> > Generator: jobtracker is 'local', generating exactly one partition.
>> >> > Generator: 0 records selected for fetching, exiting ...
>> >> > Stopping at depth=0 - no more URLs to fetch.
>> >> > No URLs to fetch - check your seed list and URL filters.
>> >> > crawl finished: crawl
>> >> >
>> >>
>> >
>>
>

Reply via email to