Hi Michael, except for HBase and Solr nothing has to be deployed or installed.
On the Hadoop master node: - build Nutch via ant runtime - point NUTCH_HOME to the directory where the job file is placed export NUTCH_HOME=.../runtime/deploy - run $NUTCH_HOME/bin/nutch .... the Hadoop job is then launched via "hadoop jar $NUTCH_HOME/*.job ..." Of course, the executable "hadoop" must be on your path, but that should be the case on the master node. > Is there a good reference for Nutch2 deployment? I do not know one, but haven't searched for. The tutorials in the Nutch wiki need an update, esp. for distributed mode in combination with 2.x Best, Sebastian On 08/16/2017 02:38 AM, Michael Chen wrote: > Hi Sebastian, > > Thanks for the reply. I do have to use 2.x for some functionalities, so I > guess I might have to > stick to HDFS for now... > > I set up a 5-node hadoop cluster with HBase and Solr services by Cloudera > Manager (and it still took > me a while...), and I've installed Nutch on all nodes. I'm a bit confused on > how to deploy the job > to the cluster. Do I only interact with the master node, setting > configuration and seeds, and hadoop > will manage the cluster? > > Is there a good reference for Nutch2 deployment? > > Thank you! > > Michael > > > On 08/15/2017 02:49 AM, Sebastian Nagel wrote: >> Hi Michael, >> >>> Will I be able to use S3 as data storage so that I can keep the data when >>> EC2 instance stops? >> I don't know whether this is easily possible for 2.x and HBase. But Nutch >> 1.x can read and write >> data directly from S3 (via S3A file system [1]). Only operations on the >> CrawlDb need a little >> modification: data current to old, resp. temp folder to current, and S3 does >> not support moves. >> But this is easily worked-around by copying between S3 and HDFS. >> >> Best, >> Sebastian >> >> [1] https://wiki.apache.org/hadoop/AmazonS3 >> >> On 08/06/2017 02:29 AM, Michael Chen wrote: >>> Hi, >>> >>> I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering if >>> anyone know of a "best >>> set up" for it. The hadoop and hbase version in current EMR releases >>> doesn't seem to work with Nutch >>> 2.x. Does it sound like a good idea to manually set up Hadoop clusters and >>> then run Nutch on it? >>> Will I be able to use S3 as data storage so that I can keep the data when >>> EC2 instance stops? >>> >>> Any suggestions would be very much helpful! >>> >>> Thanks in advance, >>> >>> Michael >>> >