How do I run the crawl script on hadoop? -- Manikandan Saravanan Architect - Technology TheSocialPeople
On 4 February 2014 at 1:28:39 am, Lewis John Mcgibbney ([email protected]) wrote: Hi Manikandan, On Mon, Feb 3, 2014 at 3:45 PM, <[email protected]> wrote: > And then, I'm running this: > $HADOOP_HOME/bin/hadoop jar /usr/local/nutch/nutch.job > org.apache.nutch.crawl.Crawler dmoz -dir /user/hduser/crawl -depth 3 -topN > 5000 > You're using the Crawler class. This is not advised at all and is now deprecated. There is no point in downloading the crawl script if you are going to use the Crawler class. I would suggest you using the crawl script. > > org.apache.gora.memory.store.MemStore as the Gora storage class. > Please don't use MemStore its implementation in Gora 0.3 is not thread safe and is only used for trivial tests. Please see the 2.x tutorial on the Nutch wiki for details of how to configure the supported Gora persistent data stores. Once you've used the crawl script, and configured your Nutch deployment job file, please get back to us with your results. Remeber you will always need to regenerate your Nutch job file if you make configuration changes to your Nutch deployment. hth Thanks

