I am not with StumbleUpon, but I can tell you how one of my clients does it.
We are serving the website from HBase. We used to try M/R jobs on the same cluster, but quickly found that this was a bad idea. We do very minimal Hadoop M/R apps on the web cluster. Every time we run an M/R job on the web cluster, we see increased latency on the web app. It makes sense, if you do more work on the cluster, it will not be able to respond as quickly. This isn't a new idea, traditionally data warehousing/deep analytics tasks are separated from OLTP processing. We ended up splitting the cluster into two clusters: a web cluster and a compute cluster. The longer jobs that run on the compute cluster have quite a few steps. The first step pulls data from the HBase cluster, and the final step puts the results back to the HBase cluster. We manage our own indexes. The only jobs that run on the HBase cluster are indexing jobs. Anything that does any sort of analytics runs on the compute cluster. -Matthew On Sep 23, 2010, at 10:11 PM, Bishal Acharya wrote: > I am running a 20 node cluster with hadoop/hbase. Currently what I am doing > is that, I run the MR jobs in the cluster and at the same time I am serving > my web application directly from Hbase in the same cluster. What happens is, > when I am not running any MR jobs the applications are running perfectly > fine, But when I run MR jobs at the same time as I am browsing my > application, I am faced with this increase in latency while browsing. How > could I properly manage my cluster so that I don't have to face the added > latency due to cluster being saturated by MR jobs. I wanted to know > specifically how this is done in companies using Hbase for front serving for > example at StumbleUpon ? How do they manage this issue ? > > > -- > > Sincerely, > > > *Bishal Acharya* > > /Software Engineer | D2HawkeyeServices Pvt. Ltd. | Subsidiary of Verisk > Health,USA/ > ** > > Cell +977-9849378541 | [email protected] | www.d2hawkeyeservices.com > > P Request : Unless absolutely necessary, please do not print this e-mail. > Help save environment. Thank you. > > > > > This email is intended for the recipient only. If you are not the intended > recipient please disregard, and do not use the information for any purpose.
