Hi Sebastian,
Thanks for the reply. I do have to use 2.x for some functionalities, so
I guess I might have to stick to HDFS for now...
I set up a 5-node hadoop cluster with HBase and Solr services by
Cloudera Manager (and it still took me a while...), and I've installed
Nutch on all nodes. I'm a bit confused on how to deploy the job to the
cluster. Do I only interact with the master node, setting configuration
and seeds, and hadoop will manage the cluster?
Is there a good reference for Nutch2 deployment?
Thank you!
Michael
On 08/15/2017 02:49 AM, Sebastian Nagel wrote:
Hi Michael,
Will I be able to use S3 as data storage so that I can keep the data when EC2
instance stops?
I don't know whether this is easily possible for 2.x and HBase. But Nutch 1.x
can read and write
data directly from S3 (via S3A file system [1]). Only operations on the CrawlDb
need a little
modification: data current to old, resp. temp folder to current, and S3 does
not support moves.
But this is easily worked-around by copying between S3 and HDFS.
Best,
Sebastian
[1] https://wiki.apache.org/hadoop/AmazonS3
On 08/06/2017 02:29 AM, Michael Chen wrote:
Hi,
I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering if anyone
know of a "best
set up" for it. The hadoop and hbase version in current EMR releases doesn't
seem to work with Nutch
2.x. Does it sound like a good idea to manually set up Hadoop clusters and then
run Nutch on it?
Will I be able to use S3 as data storage so that I can keep the data when EC2
instance stops?
Any suggestions would be very much helpful!
Thanks in advance,
Michael