Hi Marc, Sounds like you went through a successful learning exercise.
>> KeeperException$NoNodeException, IOException Resource starvation first manifests with Zookeeper. This is why our EC2 scripts run a separate quorum ensemble, so I/O and CPU loading are independent. > - Large instances in EC2 -- meaning more memory and more cores. c1.medium for ZooKeeper and c1.xlarge for Hadoop/HBase master and slaves are really the minimum you can consider. I have tried other types, including m1.xlarge, with unsatisfactory results. > - Xcievers setting for HDFS and ulimit -n increase Mandatory for every installation no matter where. Also thanks again to Robert @ VF for the excellent tip regarding RedHat derived systems in this regard. That is something I have overlooked for a long time. > - Originally I had both the Hadoop Master and HBase master on the > same instance. Now they are separate. That would be ok. The HBase master normally does very little. Our EC2 scripts set up the HDFS NameNode and HBase master on the same instance. We do not try to engineer around the NameNode single point of failure so might as well colocate only one instance of the HBase master there as well. I do have an issue open about engineering high availability for HBase EC2 clusters. See https://issues.apache.org/jira/browse/HBASE-2098 > - I also killed the TaskTrackers on the Hadoop Master and the > HBase master Not good at all. User tasks can swamp CPU, memory. Naturally the NameNode and Master liveness will degrade, perhaps fatally. The NameNode in particular must be reserved independent resources. This is the single largest factor for the improvements you are seeing. - Andy ----- Original Message ---- > From: Marc Limotte <[email protected]> > To: [email protected] > Sent: Fri, January 8, 2010 10:24:48 AM > Subject: Re: Seeing errors after loading a fair amount of data. > KeeperException$NoNodeException, IOException > > Thanks for your suggestions, Ryan, Robert, Andy. > > I was able to get it to run all the way through now, and loaded up 300 > million rows, which is the volume I wanted in order to do some performance > testing for my application. > > I'm not entirely sure what made the difference. At first I was trying one > thing at a time and getting inconcistent results (but always an eventual > failure). Then I just reworked the whole thing and got it to work. [...]
