Hi Sebastian,

Thanks for the reply. I do have to use 2.x for some functionalities, so I guess I might have to stick to HDFS for now...

I set up a 5-node hadoop cluster with HBase and Solr services by Cloudera Manager (and it still took me a while...), and I've installed Nutch on all nodes. I'm a bit confused on how to deploy the job to the cluster. Do I only interact with the master node, setting configuration and seeds, and hadoop will manage the cluster?

Is there a good reference for Nutch2 deployment?

Thank you!

Michael


On 08/15/2017 02:49 AM, Sebastian Nagel wrote:
Hi Michael,

Will I be able to use S3 as data storage so that I can keep the data when EC2 
instance stops?
I don't know whether this is easily possible for 2.x and HBase. But Nutch 1.x 
can read and write
data directly from S3 (via S3A file system [1]). Only operations on the CrawlDb 
need a little
modification: data current to old, resp. temp folder to current, and S3 does 
not support moves.
But this is easily worked-around by copying between S3 and HDFS.

Best,
Sebastian

[1] https://wiki.apache.org/hadoop/AmazonS3

On 08/06/2017 02:29 AM, Michael Chen wrote:
Hi,

I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering if anyone 
know of a "best
set up" for it. The hadoop and hbase version in current EMR releases doesn't 
seem to work with Nutch
2.x. Does it sound like a good idea to manually set up Hadoop clusters and then 
run Nutch on it?
Will I be able to use S3 as data storage so that I can keep the data when EC2 
instance stops?

Any suggestions would be very much helpful!

Thanks in advance,

Michael


Reply via email to