Region statistics created during major compaction

Jan Lukavský Wed, 30 Apr 2014 03:08:17 -0700

Hi all,

I have a general idea I'd like to consult. A short description of aproblem we are facing: during mapreduce jobs run over HBase cluster, wevery often see great disproportions in run time of different map tasks(some tasks tend to finish in minutes or even seconds, while othersmight take even hours). This causes the job to run inefficiently and thewhole cluster to be underutilized - reducers have to wait until all themap tasks finish - at least before starting the sort phase. The numberof long running map tasks is usually low, so the whole cluster basicallywaits until several machines finish their work. We tried to get overthis by sampling the regions and creating some statistics (one statisticper mapreduce job), which we then used to tune the input format splitsto make the distribution of running time more even. This seems to work(although at the time being might cause some issues with data locality,which we think we can solve).

Now, the questions is, would it be possible to calculate some statisticsduring major compactions and store them in the region directory on HDFS?What I mean by these statistics, I think it could be possible to storefor some reasonable ranges of rows (so that for each region there wouldbe like hundreds of these ranges):

 * total number of rows between specified rows
 * total number of KeyValues
 * amount of data stored on disk

These statistics could be calculated per column family and subsequentlyused in InputFormat to tune the splits to match even distribution asclose as possible.

Is anyone else interested in this? Does anyone have any other solutionto the problem I have described? I know we could say manually splitregions that take long time to process, but first, these regions arejob-specific (so different jobs have different regions that take longtime to process), and second, ideally I'm looking for an automated solution.


Thanks for reply,
 Jan

Region statistics created during major compaction

Reply via email to