Re: Best practice for Nutch 2.x on AWS?

Sebastian Nagel Fri, 18 Aug 2017 05:33:08 -0700

Hi Michael,

except for HBase and Solr nothing has to be deployed or installed.


On the Hadoop master node:
- build Nutch via
   ant runtime
- point NUTCH_HOME to the directory where the job file is placed
   export NUTCH_HOME=.../runtime/deploy
- run
   $NUTCH_HOME/bin/nutch ....
  the Hadoop job is then launched via "hadoop jar $NUTCH_HOME/*.job ..."
  Of course, the executable "hadoop" must be on your path, but that should
  be the case on the master node.

> Is there a good reference for Nutch2 deployment?

I do not know one, but haven't searched for.
The tutorials in the Nutch wiki need an update, esp. for distributed mode in 
combination with 2.x

Best,
Sebastian

On 08/16/2017 02:38 AM, Michael Chen wrote:
> Hi Sebastian,
> 
> Thanks for the reply. I do have to use 2.x for some functionalities, so I 
> guess I might have to
> stick to HDFS for now...
> 
> I set up a 5-node hadoop cluster with HBase and Solr services by Cloudera 
> Manager (and it still took
> me a while...), and I've installed Nutch on all nodes. I'm a bit confused on 
> how to deploy the job
> to the cluster. Do I only interact with the master node, setting 
> configuration and seeds, and hadoop
> will manage the cluster?
> 
> Is there a good reference for Nutch2 deployment?
> 
> Thank you!
> 
> Michael
> 
> 
> On 08/15/2017 02:49 AM, Sebastian Nagel wrote:
>> Hi Michael,
>>
>>> Will I be able to use S3 as data storage so that I can keep the data when 
>>> EC2 instance stops?
>> I don't know whether this is easily possible for 2.x and HBase. But Nutch 
>> 1.x can read and write
>> data directly from S3 (via S3A file system [1]). Only operations on the 
>> CrawlDb need a little
>> modification: data current to old, resp. temp folder to current, and S3 does 
>> not support moves.
>> But this is easily worked-around by copying between S3 and HDFS.
>>
>> Best,
>> Sebastian
>>
>> [1] https://wiki.apache.org/hadoop/AmazonS3
>>
>> On 08/06/2017 02:29 AM, Michael Chen wrote:
>>> Hi,
>>>
>>> I'm trying to set up Nutch 2.x on AWS EC2 clusters, and I was wondering if 
>>> anyone know of a "best
>>> set up" for it. The hadoop and hbase version in current EMR releases 
>>> doesn't seem to work with Nutch
>>> 2.x. Does it sound like a good idea to manually set up Hadoop clusters and 
>>> then run Nutch on it?
>>> Will I be able to use S3 as data storage so that I can keep the data when 
>>> EC2 instance stops?
>>>
>>> Any suggestions would be very much helpful!
>>>
>>> Thanks in advance,
>>>
>>> Michael
>>>
>

Re: Best practice for Nutch 2.x on AWS?

Reply via email to