High collision rate means high contention at taking the row locks. This results in poor write performance.
Cheers On May 11, 2013, at 7:14 PM, Pal Konyves <paul.kony...@gmail.com> wrote: > Hi, > > I decided not to make any tuning, because my whole project is about > experimenting with HBase (it's a scool project). However it turned out that > my sample data generated lots of rowkey collisions. 4 million inserts only > resulted in about 5000 rows. The data were different though in the columns. > When I changed my sample dataset to have no collisions in the rowkey, the > performance increased with a magnitude of 10. Why is that? > > Thanks, > Pal > > > On Thu, May 9, 2013 at 2:32 PM, Michel Segel <michael_se...@hotmail.com>wrote: > >> What I am saying is that by default, you get two mappers per node. >> x4large can run HBase w more mapred slots, so you will want to tune the >> defaults based on machine size. Not just mapred, but also HBase stuff too. >> You need to do this on startup of EMR cluster though... >> >> Sent from a remote device. Please excuse any typos... >> >> Mike Segel >> >> On May 9, 2013, at 2:39 AM, Pal Konyves <paul.kony...@gmail.com> wrote: >> >>> Principally I chose to use Amazon, because they are supposedly high >>> performance, and what more important is: HBase is already set up if I >> chose >>> it as an EMR Workflow. I wanted to save up the time setting up the >> cluster >>> manually on EC2 instances. >>> >>> Are you saying I will reach higher performance when I set up the HBase on >>> the cluster manually, instead of the default Amazon HBase distribution? >> Or >>> is it worth to tune the Amazon distribution with a bootstrap action? How >>> long does it take, to set up the cluster with HDFS manually? >>> >>> I will also try larger instance types. >>> >>> >>> On Thu, May 9, 2013 at 6:47 AM, Michel Segel <michael_se...@hotmail.com >>> wrote: >>> >>>> With respect to EMR, you can run HBase fairly easily. >>>> You can't run MapR w HBase on EMR stick w Amazon's release. >>>> >>>> And you can run it but you will want to know your tuning parameters up >>>> front when you instantiate it. >>>> >>>> >>>> >>>> Sent from a remote device. Please excuse any typos... >>>> >>>> Mike Segel >>>> >>>> On May 8, 2013, at 9:04 PM, Andrew Purtell <apurt...@apache.org> wrote: >>>> >>>>> M7 is not Apache HBase, or any HBase. It is a proprietary NoSQL >> datastore >>>>> with (I gather) an Apache HBase compatible Java API. >>>>> >>>>> As for running HBase on EC2, we recently discussed some particulars, >> see >>>>> the latter part of this thread: >> http://search-hadoop.com/m/rI1HpK90guwhere >>>>> I hijack it. I wouldn't recommend launching HBase as part of an EMR >> flow >>>>> unless you want to use it only for temporary random access storage, and >>>> in >>>>> which case use m2.2xlarge/m2.4xlarge instance types. Otherwise, set up >> a >>>>> dedicated HBase backed storage service on high I/O instance types. The >>>>> fundamental issue is IO performance on the EC2 platform is fair to >> poor. >>>>> >>>>> I have also noticed a large difference in baseline block device latency >>>> if >>>>> using an old Amazon Linux AMI (< 2013) or the latest AMIs from this >> year. >>>>> Use the new ones, they cut the latency long tail in half. There were >> some >>>>> significant kernel level improvements I gather. >>>>> >>>>> >>>>> On Wed, May 8, 2013 at 10:42 AM, Marcos Luis Ortiz Valmaseda < >>>>> marcosluis2...@gmail.com> wrote: >>>>> >>>>>> I think that you when you are talking about RMap, you are referring to >>>>>> MapR´s distribution. >>>>>> I think that MapR´s team released a very good version of its Hadoop >>>>>> distribution focused on HBase called M7. You can see its overview >> here: >>>>>> http://www.mapr.com/products/mapr-editions/m7-edition >>>>>> >>>>>> But this release was under beta testing, and I see that it´s not >>>> included >>>>>> in the Amazon Marketplace yet: >> https://aws.amazon.com/marketplace/seller-profile?id=802b0a25-877e-4b57-9007-a3fd284815a5 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> 2013/5/7 Pal Konyves <paul.kony...@gmail.com> >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Has anyone got some recommendations about running HBase on EC2? I am >>>>>>> testing it, and so far I am very disappointed with it. I did not >> change >>>>>>> anything about the default 'Amazon distribution' installation. It has >>>> one >>>>>>> MasterNode and two slave nodes, and write performance is around 2500 >>>>>> small >>>>>>> rows per sec at most, but I expected it to be way better. Oh, and >> this >>>>>> is >>>>>>> with batch put operations with autocommit turned off, where each >> batch >>>>>>> containes about 500-1000 rows... When I do it with autocommit, it >> does >>>>>> not >>>>>>> even reach the 1000 rows per sec. >>>>>>> >>>>>>> Every nodes were m1.Large ones. >>>>>>> >>>>>>> Any experiences, suggestions? Is it worth to try the RMap >> distribution >>>>>>> instead of the amazon one? >>>>>>> >>>>>>> Thanks, >>>>>>> Pal >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Marcos Ortiz Valmaseda >>>>>> Product Manager at PDVSA >>>>>> http://about.me/marcosortiz >>>>> >>>>> >>>>> >>>>> -- >>>>> Best regards, >>>>> >>>>> - Andy >>>>> >>>>> Problems worthy of attack prove their worth by hitting back. - Piet >> Hein >>>>> (via Tom White) >>