Hi Marc,

Sounds like you went through a successful learning exercise. 


>> KeeperException$NoNodeException, IOException

Resource starvation first manifests with Zookeeper. This is why our EC2
scripts run a separate quorum ensemble, so I/O and CPU loading are
independent. 

>    - Large instances in EC2 -- meaning more memory and more cores.

c1.medium for ZooKeeper and c1.xlarge for Hadoop/HBase master and
slaves are really the minimum you can consider. I have tried other
types,  including m1.xlarge, with unsatisfactory results.

>    - Xcievers setting for HDFS and ulimit -n increase

Mandatory for every installation no matter where. Also thanks again to
Robert @ VF for the excellent tip regarding RedHat derived systems in
this regard. That is something I have overlooked for a long time. 

>    - Originally I had both the Hadoop Master and HBase master on the
>       same instance. Now they are separate.

That would be ok. The HBase master normally does very little. Our EC2 
scripts set up the HDFS NameNode and HBase master on the same
instance. We do not try to engineer around the NameNode single point
of failure so might as well colocate only one instance of the HBase
master there as well. I do have an issue open about engineering high
availability for HBase EC2 clusters. See 
https://issues.apache.org/jira/browse/HBASE-2098

>       - I also killed the TaskTrackers on the Hadoop Master and the
>         HBase master

Not good at all. User tasks can swamp CPU, memory. Naturally the
NameNode and Master liveness will degrade, perhaps fatally. The
NameNode in particular must be reserved independent resources. 

This is the single largest factor for the improvements you are seeing.

  - Andy



----- Original Message ----
> From: Marc Limotte <[email protected]>
> To: [email protected]
> Sent: Fri, January 8, 2010 10:24:48 AM
> Subject: Re: Seeing errors after loading a fair amount of data.  
> KeeperException$NoNodeException, IOException
> 
> Thanks for your suggestions, Ryan, Robert, Andy.
> 
> I was able to get it to run all the way through now, and loaded up 300
> million rows, which is the volume I wanted in order to do some performance
> testing for my application.
> 
> I'm not entirely sure what made the difference. At first I was trying one
> thing at a time and getting inconcistent results (but always an eventual
> failure). Then I just reworked the whole thing and got it to work. 
[...]



      

Reply via email to