Hi all,
I have a general idea I'd like to consult. A short description of a
problem we are facing: during mapreduce jobs run over HBase cluster, we
very often see great disproportions in run time of different map tasks
(some tasks tend to finish in minutes or even seconds, while others
might take even hours). This causes the job to run inefficiently and the
whole cluster to be underutilized - reducers have to wait until all the
map tasks finish - at least before starting the sort phase. The number
of long running map tasks is usually low, so the whole cluster basically
waits until several machines finish their work. We tried to get over
this by sampling the regions and creating some statistics (one statistic
per mapreduce job), which we then used to tune the input format splits
to make the distribution of running time more even. This seems to work
(although at the time being might cause some issues with data locality,
which we think we can solve).
Now, the questions is, would it be possible to calculate some statistics
during major compactions and store them in the region directory on HDFS?
What I mean by these statistics, I think it could be possible to store
for some reasonable ranges of rows (so that for each region there would
be like hundreds of these ranges):
* total number of rows between specified rows
* total number of KeyValues
* amount of data stored on disk
These statistics could be calculated per column family and subsequently
used in InputFormat to tune the splits to match even distribution as
close as possible.
Is anyone else interested in this? Does anyone have any other solution
to the problem I have described? I know we could say manually split
regions that take long time to process, but first, these regions are
job-specific (so different jobs have different regions that take long
time to process), and second, ideally I'm looking for an automated solution.
Thanks for reply,
Jan