> So the setup starts by recommending rolling your own hadoop (pain in the > ass). OR using a beta ( :( ).
CDH3 is not in beta. The latest version is release, CDH3U1. I think most people at this point will just use CDH, so all of that about rolling your own compile of Hadoop sources -- that is hard? ("ant") -- is a non-issue. > First, you have to learn: > 1) Linux HA > 2) DRDB > > Right out of the gate just to have a redundant name node. Likewise HA namenode. Most don't do that I suspect. However, we did. Having a modicum of Linux system administration experience, we were already familiar with DRDB and the RHEL Cluster Suite, so this was not anything we had not seen before. Maybe you are arguing Cassandra is easier for noobs to set up? I guess that's great. But I would not want such a person running my production, and I can't see how any serious person would. > *Fud ALARM* "Cassandra is rife with cascading cluster failure > scenarios." > ....and hbase never has issues apparently. (remember I am on both lists) What Ryan said regarding this, I agree completely. I've had occasion over the years to wrangle both master-slave and peer-to-peer systems in various failure modes. In many cases a master gives you a single point of control to regain control of an errant system. There is no such thing in a P2P system, you have to shut down everything and reinitialize. However, refer to my response to the mail that started this thread. Whether master-slave or P2P architecture is appropriate for a given use case involves a series of trade offs. There is no simple answer. Neither is superior to the other. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) ----- Original Message ----- > From: Edward Capriolo <edlinuxg...@gmail.com> > To: user@hbase.apache.org > Cc: > Sent: Friday, September 2, 2011 1:53 AM > Subject: Re: HBase and Cassandra on StackOverflow > > On Wed, Aug 31, 2011 at 1:34 AM, Time Less <timelessn...@gmail.com> wrote: > >> Most of your points are dead-on. >> >> > Cassandra is no less complex than HBase. All of this complexity is >> > "hidden" in the sense that with Hadoop/HBase the layering is > obvious -- >> > HDFS, HBase, etc. -- but the Cassandra internals are no less layered. >> > >> > Operationally, however, HBase is more complex. Admins have to > configure >> > and manage ZooKeeper, HDFS, and HBase. Could this be improved? >> > >> >> I strongly disagree with the premise[1]. Having personally been involved in >> the Digg Cassandra rollout, and spent up until a couple months ago being in >> part-time weekly contact with the Digg Cassandra administrator, and having >> very close ties to the SimpleGeo Cassandra admin, I know it is a fickle >> beast. Having also spent a good amount of time at StumbleUpon and Mozilla >> (and now Riot Games) I also see first-hand that HBase is far more stable >> and >> -- dare I say it? -- operationally more simple. >> >> So okay, HBase is "harder to set up" if following a step-by-step > guide on a >> wiki is "hard,"[2] but it's FAR easier to administer. > Cassandra is rife >> with >> cascading cluster failure scenarios. I would not recommend running >> Cassandra >> in a highly-available high-volume data scenario, but don't hesitate to > do >> so >> for HBase. >> >> I do not know if this is a guaranteed (provable due to architecture) >> result, >> or just the result of the Cassandra community being... how shall I say... >> hostile to administrators. But then, to me it doesn't matter. Results > do. >> >> -- >> Tim Ellis >> Data Architect, Riot Games >> [1] That said, the other part of your statement is spot-on, too. It's >> surely >> possible to improve the HBase architecture or simplify it. >> [2] I went from having never set up HBase nor ever used Chef to having >> functional Chef recipes that installed a functional HBase/HDFS cluster in >> about 2 weeks. From my POV, the biggest stumbling point was that HDFS by >> default stores critical data in the underlying filesystem's /tmp > directory >> by default, which is, for lack of a better word, insane. If I had to >> suggest >> how to simplify "HBase installation," I'd ask for sane HDFS > config files >> that are extremely common and difficult-to-ignore. >> > > Why are you quoting "harder" what was said was "more > complex". Setting up N > things is more complex then setting up a single thing. > > First, you have to learn: > 1) Linux HA > 2) DRDB > > Right out of the gate just to have a redundant name node. > > This is not easy, fast, or simple. In fact this is quite a pain. > http://docs.google.com/viewer?a=v&q=cache:9rnx-eRzi1AJ:files.meetup.com/1228907/Hadoop%2520Namenode%2520High%2520Availability.pptx+linux+ha+namenode&hl=en&gl=us&pid=bl&srcid=ADGEESig5aJNVAXbLgBwyc311sPSd88jUJbKHx4z2PQtDKHnmM1FuCJpg2IUyqi5JrmUL3RbCb8QRYsjHnP74YuKQfOQXoUZxnhrCy6N1kVpiG1jNi4zhqoKlUTmoDaqS1NegCFb6-WM&sig=AHIEtbQbjN1Olwxui5JmywdWzhqv4Hq3tw&pli=1 > > Doing it properly involves setting up physical wires between servers or link > aggregation groups. You can't script having someone physically run crossover > cables. You need your switching engineer to set up LAG's. > Also you may notice that everyone that describes this setup is also > describing it using linux-ha V1 which was deprecated for over 2 years. Which > also demonstrates how this process is so complicated people tend to touch it > and never touch it again because of how fragile it is. > > You are also implying that following the wiki is easy. Personally, I find > that the wiki has fine detail, but it is confusing. > Here is why. > > "1.3.1.2. hadoop > > This version of HBase will only run on Hadoop 0.20.x. It will not run on > hadoop 0.21.x (nor 0.22.x). HBase will lose data unless it is running on an > HDFS that has a durable sync. Currently only the branch-0.20-append branch > has this attribute[1]. No official releases have been made from this branch > up to now so you will have to build your own Hadoop from the tip of this > branch. Michael Noll has written a detailed blog, Building an Hadoop 0.20.x > version for HBase 0.90.2, on how to build an Hadoop from branch-0.20-append. > Recommended. > > Or rather than build your own, you could use Cloudera's CDH3. CDH has the > 0.20-append patches needed to add a durable sync (CDH3 betas will suffice; > b2, b3, or b4)." > > So the setup starts by recommending rolling your own hadoop (pain in the > ass). OR using a beta ( :( ). > > Then it gets onto hbase it branches into “Standalone HBase” and Section > 1.3.2.2, “Distributed” > Then it branches into "psuedo distributed" and "full > distributed" , then the > zookeeper section offers you two options "1.3.2.2.2.2. ZooKeeper", > "1.3.2.2.2.2.1. Using existing ZooKeeper ensemble" . > > Not to say this is hard or impossible, but it is a lot of information to > digest and all the branching decisions are hard to understand to a first > time user. > > Uppercasing the word FAR does not prove to me that hbase is easier to > administer nor does the your employment history or second hand stories > unnamed from people you know. I can tell you why I think Cassandra is easier > to manage: > > 1) There is only one log file /var/log/cassandra/system.log > 2) There is only one configuration folder > /usr/local/cassandra/conf/cassandra.yaml cassandra-env.sh > 3) I do not need to keep a chart or post it notes where all these 1 off > components are. zk server list, hbase master server list, namenode, > 4) No need to configure auxiliary stuff such as DRBD or Linux-HA > > *Fud ALARM* "Cassandra is rife with cascading cluster failure > scenarios." > ....and hbase never has issues apparently. (remember I am on both lists) > > Also... > [2] I went from having never set up HBase nor ever used Chef to having > functional Chef recipes that installed a functional HBase/HDFS cluster in > about 2 weeks. > > It took me about one hour to accomplish the same result with puppet + > cassandra. > http://www.jointhegrid.com/highperfcassandra/?p=62 >