Re: Nutch - Hadoop Help

Manikandan Saravanan Mon, 03 Feb 2014 23:06:18 -0800

How do I run the crawl script on hadoop?
-- 
Manikandan Saravanan
Architect - Technology
TheSocialPeople

On 4 February 2014 at 1:28:39 am, Lewis John Mcgibbney 
([email protected]) wrote:

Hi Manikandan,  

On Mon, Feb 3, 2014 at 3:45 PM, <[email protected]> wrote:  

> And then, I'm running this:  
> $HADOOP_HOME/bin/hadoop jar /usr/local/nutch/nutch.job  
> org.apache.nutch.crawl.Crawler dmoz -dir /user/hduser/crawl -depth 3 -topN  
> 5000  
>  

You're using the Crawler class. This is not advised at all and is now  
deprecated. There is no point in downloading the crawl script if you are  
going to use the Crawler class. I would suggest you using the crawl script.  

>  
> org.apache.gora.memory.store.MemStore as the Gora storage class.  
>  

Please don't use MemStore its implementation in Gora 0.3 is not thread safe  
and is only used for trivial tests. Please see the 2.x tutorial on the  
Nutch wiki for details of how to configure the supported Gora persistent  
data stores.  

Once you've used the crawl script, and configured your Nutch deployment job  
file, please get back to us with your results.  
Remeber you will always need to regenerate your Nutch job file if you make  
configuration changes to your Nutch deployment.  
hth  
Thanks

Re: Nutch - Hadoop Help

Reply via email to