Benchmarking and improvement of HBase's performance for a common bulk data workload

Atri Sharma Fri, 26 Apr 2013 22:21:20 -0700

Hi all,

I have been discussing with Priyank sir on the following style of
workload and whether we can improve HBase's performance in this area.
The usecase is as follows:


1) Bulk load data.
2) Query the data multiple times(read access mostly, and no real time writes).

This is a common workload, and I am pretty interested in benchmarking
HBase's performance in this area, as well as improve this further.

Please advice me on how I can proceed in benchmarking. Specifically,
how will I need to set up a HBase cluster, will there be any specific
requirements of the cluster for this type of testing?


I worked on a patch to improve performance for a similar usecase in
PostgreSQL. The case is pretty similar, bulk load of data, large
number of mostly read only queries, and then deletion of the data.

The optimization I targeted was the cost of writes to disk.
Specifically, setting of flags(hint bits) for tracking the commt
status of inserting/deleting transaction was causing a write overhead.
I tried to mitigate this by making a cache which holds the transaction
id in case of the above mentioned workload, hence mitigating the cost
of writes.

I will start benchmarking once I have the system set up and then start
thinking of tests. Once I have an outline in my mind, I shall post it
on the list.

i will require the community's guidance in this a lot.

Thoughts/Comments/Advice please?

Regards,

Atri

--
Regards,

Atri
l'apprenant

Benchmarking and improvement of HBase's performance for a common bulk data workload

Reply via email to