Hi Esteban,

Thanks for sharing ideas.

We are on Hbase 0.96 and java 1.6. I have enabled short-circuit read,
and heap size is around 16G for each region server. We have about 20
of them.

The list of rowkeys that I need to process is about 10M. I am using
batch gets already and the batch size is ~2000 gets.

thomas

On Thu, Aug 14, 2014 at 11:01 AM, Esteban Gutierrez
<este...@cloudera.com> wrote:
> Hello Thomas,
>
> What version of HBase are you using? sorting and grouping based on the
> regions the rows is going to help for sure. I don't think you should focus
> too much in the locality side of the problem unless your HDFS input set is
> too large (100s or 1000s of MBs per task), otherwise it might be faster to
> load in-memory the input dataset and do the batched calls. As discussed in
> this mailing list recently there are too many factors that might be
> involved in the performance: number of threads or tasks, size of the row,
> RS resources, configurations, etc. so any additional info would be very
> helpful.
>
> cheers,
> esteban.
>
>
>
>
> --
> Cloudera, Inc.
>
>
>
> On Thu, Aug 14, 2014 at 10:32 AM, Thomas Kwan <thomas.k...@manage.com>
> wrote:
>
>> Hi there
>>
>> I have a use-case where I need to do a read to check if a hbase entry
>> is present, then I do a put to create the entry when it is not there.
>>
>> I have a script to get a list of rowkeys from hive and put them on a
>> HDFS directory. Then I have a MR job that reads the rowkeys and do
>> batch reads. I am getting around 1.5K requests per second.
>>
>> To attempt to make this faster, I am wondering if I can
>>
>> - sort and group the rowkeys based on regions
>> - make the MR jobs run on regions that have the data locally
>>
>> Scan or TableInputFormat must have some codes to do something similar
>> right?
>>
>> thanks
>> thomas
>>

Reply via email to