Re: Best practice for Nutch 2.x on AWS?

Michael Chen Tue, 15 Aug 2017 17:38:51 -0700

Hi Sebastian,

Thanks for the reply. I do have to use 2.x for some functionalities, soI guess I might have to stick to HDFS for now...

I set up a 5-node hadoop cluster with HBase and Solr services byCloudera Manager (and it still took me a while...), and I've installedNutch on all nodes. I'm a bit confused on how to deploy the job to thecluster. Do I only interact with the master node, setting configurationand seeds, and hadoop will manage the cluster?


Is there a good reference for Nutch2 deployment?

Thank you!

Michael


On 08/15/2017 02:49 AM, Sebastian Nagel wrote:

Hi Michael,

Will I be able to use S3 as data storage so that I can keep the data when EC2 
instance stops?

I don't know whether this is easily possible for 2.x and HBase. But Nutch 1.x 
can read and write
data directly from S3 (via S3A file system [1]). Only operations on the CrawlDb 
need a little
modification: data current to old, resp. temp folder to current, and S3 does 
not support moves.
But this is easily worked-around by copying between S3 and HDFS.

Best,
Sebastian

[1] https://wiki.apache.org/hadoop/AmazonS3

On 08/06/2017 02:29 AM, Michael Chen wrote:

Hi,

I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering if anyone 
know of a "best
set up" for it. The hadoop and hbase version in current EMR releases doesn't 
seem to work with Nutch
2.x. Does it sound like a good idea to manually set up Hadoop clusters and then 
run Nutch on it?
Will I be able to use S3 as data storage so that I can keep the data when EC2 
instance stops?

Any suggestions would be very much helpful!

Thanks in advance,

Michael

Re: Best practice for Nutch 2.x on AWS?

Reply via email to