[
https://issues.apache.org/jira/browse/HBASE-4114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070922#comment-13070922
]
Ming Ma commented on HBASE-4114:
--------------------------------
Thanks, Stack, Ted.
1. In the experiment table above, the "total number of HDFS blocks that can be
retrieved locally by the region server" as well as "total number of HDFS blocks
for all HFiles" are defined on the whole cluster level. The external program
also calculates locality information per hfile, region as well as per region
server. It uses HDFS namenode and the calculation is independent of any map
reduce jobs.
2. In terms of how we can calculate this metrics inside hbase, we can do in two
steps. first one is to calcluate the metrics independent of map reduce jobs;
the second one is to calcuate it on per map reduce job level.
3. Calculate on the locality index, independent of map reduce jobs.
a. It will first be calcuated on hfile level { total # of HDFS block, total #
of local HDFS blocks }; then the data get aggregated on region level, finally
get aggregated on region server level.
b. Impact on namenode. There are 2 RPC calls to NN to get block info for each
hfile. If we assume 100 regions per RS, 10 hfiles per region, 500 RSs, we will
have 1M RPC hits to NN. Most of the time, that won't be an issue if we only
calculate hfile locality index when hfile is created or region is loaded by the
RS the first time. Because HDFS can still move HDFS blocks around without hbase
knowing it, we still need to refresh the value periodically.
c. The computation can be done in RS or HMaster. It seems RS is better in terms
of design(only store knows the HDFS path for hfile location, HMaster doesn't)
and extensiblity(to calculate locality index per map reduce job). The locality
index can be part of HServerLoad and RegionLoad for load balancer to use. RS
will rotate through all regions periodically in its main thread. The calcuation
interval defined by by "hbase.regionserver.msginterval" might be too short for
this scenario to minimize the load to NN for a large cluster ( 20 NN RPC per RS
per 3 sec ).
d. The locality index can be a new RS metrics. We can also put it on table.jsp
for each region.
e. HRegionInfo is kind of static. It doesn't change over time, however,
locality index changes overtime for a given region. Maybe
ClusterStatus/HServerInfo/HServerLoad/RegionLoad are better?
4. Locality index calculation for scan / map reduce job.
a. The original scenario is for full table scan only. If we want to provide
accurate locality index for any scan / map reduce, this could be tricky given
i) map reduce job can have start/end keys and filters such as time range; ii)
block cache can be used and thus hfile shouldn't be accounted for if there is
cache hit. iii) data volume read from HDFS block is also a factor. Reading
smaller buffer is different from reading bigger buffer.
b. One useful scenario is we want to find out why map jobs run slower
sometimes. So it is useful the metrics is just there as part of map reduce job
status page. We can estimate by using ganglia page to get the locality index
value for the RSs at the time map reduce job is run.
c. To provide more accurate data, we can modify TableInputFormat, a) call
HBaseAdmin.getClusterStatus to get the locality index info for each region. b)
calculate the intersection between scan specification and ClusterStatus based
on key range as well as column family. It isn't 100% accurate, but it might be
good enough.
d. To be really accurate, region server needs to provide locality index for
each scan operation back to the client.
> Metrics for HFile HDFS block locality
> -------------------------------------
>
> Key: HBASE-4114
> URL: https://issues.apache.org/jira/browse/HBASE-4114
> Project: HBase
> Issue Type: Improvement
> Components: metrics, regionserver
> Reporter: Ming Ma
> Assignee: Ming Ma
>
> Normally, when we put hbase and HDFS in the same cluster ( e.g., region
> server runs on the datenode ), we have a reasonably good data locality, as
> explained by Lars. Also Work has been done by Jonathan to address the startup
> situation.
> There are scenarios where regions can be on a different machine from the
> machines that hold the underlying HFile blocks, at least for some period of
> time. This will have performance impact on whole table scan operation and map
> reduce job during that time.
> 1. After load balancer moves the region and before compaction (thus
> generate HFile on the new region server ) on that region, HDFS block can be
> remote.
> 2. When a new machine is added, or removed, Hbase's region assignment
> policy is different from HDFS's block reassignment policy.
> 3. Even if there is no much hbase activity, HDFS can load balance HFile
> blocks as other non-hbase applications push other data to HDFS.
> Lots has been or will be done in load balancer, as summarized by Ted. I am
> curious if HFile HDFS block locality should be used as another factor here.
> I have done some experiments on how HDFS block locality can impact map reduce
> latency. First we need to define a metrics to measure HFile data locality.
> Metrics defintion:
> For a given table, or a region server, or a region, we can define the
> following. The higher the value, the more local HFile is from region server's
> point of view.
> HFile locality index = ( Total number of HDFS blocks that can be retrieved
> locally by the region server ) / ( Total number of HDFS blocks for all HFiles
> )
> Test Results:
> This is to show how HFile locality can impact the latency. It is based on a
> table with 1M rows, 36KB per row; regions are distributed in balance. The map
> job is RowCounter.
> HFile Locality Index Map job latency ( in sec )
> 28% 157
> 36% 150
> 47% 142
> 61% 133
> 73% 122
> 89% 103
> 99% 95
> So the first suggestion is to expose HFile locality index as a new region
> server metrics. It will be ideal if we can somehow measure HFile locality
> index on a per map job level.
> Regarding if/when we should include that as another factor for load balancer,
> that will be a different work item. It is unclear how load balancer can take
> various factors into account to come up with the best load balancer strategy.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira