Re: EC2 Elastic MapReduce HBase install recommendations

Ted Yu Sat, 11 May 2013 19:26:08 -0700

High collision rate means high contention at taking the row locks. 
This results in poor write performance.


Cheers

On May 11, 2013, at 7:14 PM, Pal Konyves <paul.kony...@gmail.com> wrote:

> Hi,
> 
> I decided not to make any tuning, because my whole project is about
> experimenting with HBase (it's a scool project). However it turned out that
> my sample data generated lots of rowkey collisions. 4 million inserts only
> resulted in about 5000 rows. The data were different though in the columns.
> When I changed my sample dataset to have no collisions in the rowkey, the
> performance increased with a magnitude of 10. Why is that?
> 
> Thanks,
> Pal
> 
> 
> On Thu, May 9, 2013 at 2:32 PM, Michel Segel <michael_se...@hotmail.com>wrote:
> 
>> What I am saying is that by default, you get two mappers per node.
>> x4large can run HBase w more mapred slots, so you will want to tune the
>> defaults based on machine size. Not just mapred, but also HBase stuff too.
>> You need to do this on startup of EMR cluster though...
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On May 9, 2013, at 2:39 AM, Pal Konyves <paul.kony...@gmail.com> wrote:
>> 
>>> Principally I chose to use Amazon, because they are supposedly high
>>> performance, and what more important is: HBase is already set up if I
>> chose
>>> it as an EMR Workflow. I wanted to save up the time setting up the
>> cluster
>>> manually on EC2 instances.
>>> 
>>> Are you saying I will reach higher performance when I set up the HBase on
>>> the cluster manually, instead of the default Amazon HBase distribution?
>> Or
>>> is it worth to tune the Amazon distribution with a bootstrap action? How
>>> long does it take, to set up the cluster with HDFS manually?
>>> 
>>> I will also try larger instance types.
>>> 
>>> 
>>> On Thu, May 9, 2013 at 6:47 AM, Michel Segel <michael_se...@hotmail.com
>>> wrote:
>>> 
>>>> With respect to EMR, you can run HBase fairly easily.
>>>> You can't run MapR w HBase on EMR stick w Amazon's release.
>>>> 
>>>> And you can run it but you will want to know your tuning parameters up
>>>> front when you instantiate it.
>>>> 
>>>> 
>>>> 
>>>> Sent from a remote device. Please excuse any typos...
>>>> 
>>>> Mike Segel
>>>> 
>>>> On May 8, 2013, at 9:04 PM, Andrew Purtell <apurt...@apache.org> wrote:
>>>> 
>>>>> M7 is not Apache HBase, or any HBase. It is a proprietary NoSQL
>> datastore
>>>>> with (I gather) an Apache HBase compatible Java API.
>>>>> 
>>>>> As for running HBase on EC2, we recently discussed some particulars,
>> see
>>>>> the latter part of this thread:
>> http://search-hadoop.com/m/rI1HpK90guwhere
>>>>> I hijack it. I wouldn't recommend launching HBase as part of an EMR
>> flow
>>>>> unless you want to use it only for temporary random access storage, and
>>>> in
>>>>> which case use m2.2xlarge/m2.4xlarge instance types. Otherwise, set up
>> a
>>>>> dedicated HBase backed storage service on high I/O instance types. The
>>>>> fundamental issue is IO performance on the EC2 platform is fair to
>> poor.
>>>>> 
>>>>> I have also noticed a large difference in baseline block device latency
>>>> if
>>>>> using an old Amazon Linux AMI (< 2013) or the latest AMIs from this
>> year.
>>>>> Use the new ones, they cut the latency long tail in half. There were
>> some
>>>>> significant kernel level improvements I gather.
>>>>> 
>>>>> 
>>>>> On Wed, May 8, 2013 at 10:42 AM, Marcos Luis Ortiz Valmaseda <
>>>>> marcosluis2...@gmail.com> wrote:
>>>>> 
>>>>>> I think that you when you are talking about RMap, you are referring to
>>>>>> MapR´s distribution.
>>>>>> I think that MapR´s team released a very good version of its Hadoop
>>>>>> distribution focused on HBase called M7. You can see its overview
>> here:
>>>>>> http://www.mapr.com/products/mapr-editions/m7-edition
>>>>>> 
>>>>>> But this release was under beta testing, and I see that it´s not
>>>> included
>>>>>> in the Amazon Marketplace yet:
>> https://aws.amazon.com/marketplace/seller-profile?id=802b0a25-877e-4b57-9007-a3fd284815a5
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 2013/5/7 Pal Konyves <paul.kony...@gmail.com>
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Has anyone got some recommendations about running HBase on EC2? I am
>>>>>>> testing it, and so far I am very disappointed with it. I did not
>> change
>>>>>>> anything about the default 'Amazon distribution' installation. It has
>>>> one
>>>>>>> MasterNode and two slave nodes, and write performance is around 2500
>>>>>> small
>>>>>>> rows per sec at most, but I expected it to be way  better. Oh, and
>> this
>>>>>> is
>>>>>>> with batch put operations with autocommit turned off, where each
>> batch
>>>>>>> containes about 500-1000 rows... When I do it with autocommit, it
>> does
>>>>>> not
>>>>>>> even reach the 1000 rows per sec.
>>>>>>> 
>>>>>>> Every nodes were m1.Large ones.
>>>>>>> 
>>>>>>> Any experiences, suggestions? Is it worth to try the RMap
>> distribution
>>>>>>> instead of the amazon one?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Pal
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Marcos Ortiz Valmaseda
>>>>>> Product Manager at PDVSA
>>>>>> http://about.me/marcosortiz
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best regards,
>>>>> 
>>>>> - Andy
>>>>> 
>>>>> Problems worthy of attack prove their worth by hitting back. - Piet
>> Hein
>>>>> (via Tom White)
>>

Re: EC2 Elastic MapReduce HBase install recommendations

Reply via email to