[ 
https://issues.apache.org/jira/browse/HDFS-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15485649#comment-15485649
 ] 

Erik Krogen edited comment on HDFS-10475 at 9/12/16 11:47 PM:
--------------------------------------------------------------

To get a mapping of operation -> lock time metrics we propose the following:
1. Move the logging/metrics logic into FSNamesystemLock rather than 
FSNamesystem to centralize logic and tracking. 
2. Add new methods, {{(read|write)Unlock(operation)}}, in which you specify a 
name for the current operation as you unlock (note that for metrics collecting 
the name is only needed on unlock). If an operation is not specified, a 
catch-all 'default' or 'other' operation would be used. We would manually add 
the name of the operation to the unlock call for those operations which we 
think are likely to contribute significantly to the overall lock hold time. 
This is a manual process since otherwise we would need to get a stack trace (to 
find the method name) on each call to {{unlock}} which may be prohibitively 
expensive.
3. Add a map of OperationName -> MutableRate metrics to FSNamesystemLock, all 
of which are also contained within a MetricsRegistry. On each time a lock is 
released we look up the corresponding MutableRate and add a value for the lock 
hold time. We do not use the map within MetricsRegistry because it is 
synchronized and we do not want contention on this map to cause slowness around 
the FSNamesystem lock. 

The best type of map to use within FSNamesystemLock to hold the MutableRate 
metrics is tricky. Ideally we would use a Java 8 ConcurrentHashMap, using 
{{computeIfAbsent}} to create new MutableRate metrics objects and insert them 
into the registry whenever a new operation is encountered. However this 
functionality is not available in Java 7 and we would like to support older 
versions. Thus we propose using a regular HashMap (wrapped within a call to 
{{Collections.unmodifiableMap}}) which is initialized with all of the different 
operations at the time the FSNamesystemLock is created. This allows for 
lock-free access, but requires that we have a list of all the possible 
operations. So we suggest an Enum, e.g. FSNamesystemLockMetricOp, which lists 
all of the operations of interest to be supplied to the {{(read|write)Unlock}} 
calls. This would likely be a list of a few dozen operations of interest which 
are likely to be relatively expensive lock holders. Operations not listed 
within this Enum would be regarded as "other"/"default". 

We believe this is the right tradeoff between granularity of metrics, 
performance, and developer effort, but it is certainly not ideal in terms of 
manual effort required. We would be interested to hear any other ideas about 
how to make the metrics collection require less manual intervention. 


was (Author: xkrogen):
To get a mapping of operation -> lock time metrics we propose the following:
1. Move the logging/metrics logic into FSNamesystemLock rather than 
FSNamesystem to centralize logic and tracking. 
2. Add new methods, {{(read|write)Unlock(operation)}}, in which you specify a 
name for the current operation as you unlock (note that for metrics collecting 
the name is only needed on unlock). If an operation is not specified, a 
catch-all 'default' or 'other' operation would be used. We would manually add 
the name of the operation to the unlock call for those operations which we 
think are likely to contribute significantly to the overall lock hold time. 
This is a manual process since otherwise we would need to get a stack trace (to 
find the method name) on each call to {{unlock}} which may be prohibitively 
expensive.
3. FSNamesystemLock contains a map of OperationName -> MutableRate metrics, all 
of which are also contained within a MetricsRegistry. On each time a lock is 
released we look up the corresponding MutableRate and add a value for the lock 
hold time. We do not use the map within MetricsRegistry because it is 
synchronized and we do not want contention on this map to cause slowness around 
the FSNamesystem lock. 

The best type of map to use within FSNamesystemLock to hold the MutableRate 
metrics is tricky. Ideally we would use a Java 8 ConcurrentHashMap, using 
{{computeIfAbsent}} to create new MutableRate metrics objects and insert them 
into the registry whenever a new operation is encountered. However this 
functionality is not available in Java 7 and we would like to support older 
versions. Thus we propose using a regular HashMap (wrapped within a call to 
{{Collections.unmodifiableMap}}) which is initialized with all of the different 
operations at the time the FSNamesystemLock is created. This allows for 
lock-free access, but requires that we have a list of all the possible 
operations. So we suggest an Enum, e.g. FSNamesystemLockMetricOp, which lists 
all of the operations of interest to be supplied to the {{(read|write)Unlock}} 
calls. This would likely be a list of a few dozen operations of interest which 
are likely to be relatively expensive lock holders. Operations not listed 
within this Enum would be regarded as "other"/"default". 

We believe this is the right tradeoff between granularity of metrics, 
performance, and developer effort, but it is certainly not ideal in terms of 
manual effort required. We would be interested to hear any other ideas about 
how to make the metrics collection require less manual intervention. 

> Adding metrics for long FSNamesystem read and write locks
> ---------------------------------------------------------
>
>                 Key: HDFS-10475
>                 URL: https://issues.apache.org/jira/browse/HDFS-10475
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Xiaoyu Yao
>            Assignee: Erik Krogen
>
> This is a follow up of the comment on HADOOP-12916 and 
> [here|https://issues.apache.org/jira/browse/HDFS-9924?focusedCommentId=15310837&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15310837]
>  add more metrics and WARN/DEBUG logs for long FSD/FSN locking operations on 
> namenode similar to what we have for slow write/network WARN/metrics on 
> datanode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to