[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently
[ https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045096#comment-16045096 ] Chen Liang edited comment on HDFS-11907 at 6/9/17 9:59 PM: --- Thanks [~arpitagarwal] for pointing out offline that, {{checkAvailableResources}} has more than one callers. Post v006 patch to move the measurement into this method to capture all the code paths. was (Author: vagarychen): Thanks [~arpitagarwal] for pointing out offline that, {{checkAvailableResources}} has more than one callers. Post v006 patch to move the measurement into this method to capture all the callers. > NameNodeResourceChecker should avoid calling df.getAvailable too frequently > --- > > Key: HDFS-11907 > URL: https://issues.apache.org/jira/browse/HDFS-11907 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch, > HDFS-11907.003.patch, HDFS-11907.004.patch, HDFS-11907.005.patch, > HDFS-11907.006.patch > > > Currently, {{HealthMonitor#doHealthChecks}} invokes > {{NameNode#monitorHealth}} which ends up invoking > {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per > second by default. And NameNodeResourceChecker#isResourceAvailable invokes > {{df.getAvailable();}} every time it is called. > Since available space information should rarely be changing dramatically at > the pace of per second. A cached value should be sufficient. i.e. only try to > get the updated value when the cached value is too old. otherwise simply > return the cached value. This way df.getAvailable() gets invoked less. > Thanks [~arpitagarwal] for the offline discussion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently
[ https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044987#comment-16044987 ] Chen Liang edited comment on HDFS-11907 at 6/9/17 8:37 PM: --- Thanks [~andrew.wang] for the comments. Based on Andrew's comments, instead of making the change to cache the value, post v005 patch to add a metric to keep track of available resource check time. was (Author: vagarychen): Thanks [~andrew.wang] for the comments. Based on Andrew's comments, instead of making the cache the value, post v005 patch to add a metric to keep track of available resource check time. > NameNodeResourceChecker should avoid calling df.getAvailable too frequently > --- > > Key: HDFS-11907 > URL: https://issues.apache.org/jira/browse/HDFS-11907 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch, > HDFS-11907.003.patch, HDFS-11907.004.patch, HDFS-11907.005.patch > > > Currently, {{HealthMonitor#doHealthChecks}} invokes > {{NameNode#monitorHealth}} which ends up invoking > {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per > second by default. And NameNodeResourceChecker#isResourceAvailable invokes > {{df.getAvailable();}} every time it is called. > Since available space information should rarely be changing dramatically at > the pace of per second. A cached value should be sufficient. i.e. only try to > get the updated value when the cached value is too old. otherwise simply > return the cached value. This way df.getAvailable() gets invoked less. > Thanks [~arpitagarwal] for the offline discussion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently
[ https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043335#comment-16043335 ] Arpit Agarwal edited comment on HDFS-11907 at 6/8/17 8:10 PM: -- Hi [~andrew.wang], you are right that it's expected to be a cheap call, but calling it once per second per volume seems excessive. Do you see any benefit to querying {{df}} once per second? We can make the caching interval configurable and leave the default at 1 second if you prefer. This is not the same as changing the health check interval as Chen mentioned. Keeping the health check interval at 1 second lets us detect process failure faster and we don't want to change that. Also the v4 patch has a couple of issues I missed earlier. [~vagarychen] can you please take a look at these? # availableSpace and availableSpaceTimeStamp should be members of checkedVolume. # The test case failure in TestNameNodeResourceChecker needs to be addressed. An easy fix is to check all volumes instead of trying to query a specific one. was (Author: arpitagarwal): Hi [~andrew.wang], you are right that it's expected to be a cheap call, but calling it once per second per volume seems excessive. Do you see any benefit to querying {{df}} once per second? We can make the caching interval configurable and leave the default at 1 second if you prefer. This is not the same as changing the health check interval as Chen mentioned. Keeping the health check interval at 1 second lets us detect process failure faster and we don't want to change that. Also the v4 patch has a couple of issues I missed earlier. # availableSpace and availableSpaceTimeStamp should be members of checkedVolume. # The test case failure in TestNameNodeResourceChecker needs to be addressed. An easy fix is to check all volumes instead of trying to query a specific one. > NameNodeResourceChecker should avoid calling df.getAvailable too frequently > --- > > Key: HDFS-11907 > URL: https://issues.apache.org/jira/browse/HDFS-11907 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch, > HDFS-11907.003.patch, HDFS-11907.004.patch > > > Currently, {{HealthMonitor#doHealthChecks}} invokes > {{NameNode#monitorHealth}} which ends up invoking > {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per > second by default. And NameNodeResourceChecker#isResourceAvailable invokes > {{df.getAvailable();}} every time it is called. > Since available space information should rarely be changing dramatically at > the pace of per second. A cached value should be sufficient. i.e. only try to > get the updated value when the cached value is too old. otherwise simply > return the cached value. This way df.getAvailable() gets invoked less. > Thanks [~arpitagarwal] for the offline discussion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently
[ https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035112#comment-16035112 ] Chen Liang edited comment on HDFS-11907 at 6/2/17 5:51 PM: --- Thanks [~andrew.wang] for the reply! df does seem to be a fairly cheap operation in general, but we've seen cases where we suspect it was this call being slow under certain conditions, which we are still doing analysis. About changing monitorHealth check interval, since we still want ZKFC process to try to contact NameNode frequently enough to detect process failures ASAP, we probably don't want to lower the frequency from caller's side. was (Author: vagarychen): Thanks [~andrew.wang] for the reply! df does seem to be a fairly cheap operation in general, but we've seen cases where we suspect it was this call being slow under certain conditions, which we are still doing analysis. About changing monitorHealth check interval, since we still want ZKFC process to try to contact NameNode frequently enough to detect failures ASAP, we probably don't want to lower the frequency from caller's side. > NameNodeResourceChecker should avoid calling df.getAvailable too frequently > --- > > Key: HDFS-11907 > URL: https://issues.apache.org/jira/browse/HDFS-11907 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch, > HDFS-11907.003.patch, HDFS-11907.004.patch > > > Currently, {{HealthMonitor#doHealthChecks}} invokes > {{NameNode#monitorHealth}} which ends up invoking > {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per > second by default. And NameNodeResourceChecker#isResourceAvailable invokes > {{df.getAvailable();}} every time it is called. > Since available space information should rarely be changing dramatically at > the pace of per second. A cached value should be sufficient. i.e. only try to > get the updated value when the cached value is too old. otherwise simply > return the cached value. This way df.getAvailable() gets invoked less. > Thanks [~arpitagarwal] for the offline discussion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently
[ https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035112#comment-16035112 ] Chen Liang edited comment on HDFS-11907 at 6/2/17 5:51 PM: --- Thanks [~andrew.wang] for the reply! df does seem to be a fairly cheap operation in general, but we've seen cases where we suspect it was this call being slow under certain conditions, which we are still doing analysis. About changing monitorHealth check interval, since we still want ZKFC process to try to contact NameNode frequently enough to detect process crash ASAP, we probably don't want to lower the frequency from caller's side. was (Author: vagarychen): Thanks [~andrew.wang] for the reply! df does seem to be a fairly cheap operation in general, but we've seen cases where we suspect it was this call being slow under certain conditions, which we are still doing analysis. About changing monitorHealth check interval, since we still want ZKFC process to try to contact NameNode frequently enough to detect process failures ASAP, we probably don't want to lower the frequency from caller's side. > NameNodeResourceChecker should avoid calling df.getAvailable too frequently > --- > > Key: HDFS-11907 > URL: https://issues.apache.org/jira/browse/HDFS-11907 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch, > HDFS-11907.003.patch, HDFS-11907.004.patch > > > Currently, {{HealthMonitor#doHealthChecks}} invokes > {{NameNode#monitorHealth}} which ends up invoking > {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per > second by default. And NameNodeResourceChecker#isResourceAvailable invokes > {{df.getAvailable();}} every time it is called. > Since available space information should rarely be changing dramatically at > the pace of per second. A cached value should be sufficient. i.e. only try to > get the updated value when the cached value is too old. otherwise simply > return the cached value. This way df.getAvailable() gets invoked less. > Thanks [~arpitagarwal] for the offline discussion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently
[ https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032274#comment-16032274 ] Arpit Agarwal edited comment on HDFS-11907 at 6/1/17 12:48 AM: --- Thanks for updating the patch [~vagarychen]. A few minor comments: # Typo: availabeSpaceTimeStamp --> availableSpaceTimeStamp (also the getter). # Typo: DF_STALE_THREASHOLD_MS --> DF_STALE_THRESHOLD_MS # Looks like the two threshold values were swapped from v2 to v3 patch. I think the v2 values were better. # getAvailabeSpaceTimeStamp can be package private. Also the new constructor. # in the test case, can you add an assert for the {{// first call guarantees an update}} e.g. {code} // first call guarantees an update long val0 = nb.getAvailabeSpaceTimeStamp(); volume.isResourceAvailable(); long val1 = nb.getAvailabeSpaceTimeStamp(); assertEquals(val0 + 5000, val1); timer.advance(2000); {code} Looks good otherwise! was (Author: arpitagarwal): Thanks for updating the patch [~vagarychen]. A few minor comments: # Typo: availabeSpaceTimeStamp --> availableSpaceTimeStamp # Typo: DF_STALE_THREASHOLD_MS --> DF_STALE_THRESHOLD_MS # Looks like the two threshold values were swapped from v2 to v3 patch. I think the v2 values were better. # getAvailabeSpaceTimeStamp can be package private. Also the new constructor. # in the test case, can you add an assert for the {{// first call guarantees an update}} e.g. {code} // first call guarantees an update long val0 = nb.getAvailabeSpaceTimeStamp(); volume.isResourceAvailable(); long val1 = nb.getAvailabeSpaceTimeStamp(); assertEquals(val0 + 5000, val1); timer.advance(2000); {code} Looks good otherwise! > NameNodeResourceChecker should avoid calling df.getAvailable too frequently > --- > > Key: HDFS-11907 > URL: https://issues.apache.org/jira/browse/HDFS-11907 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch, > HDFS-11907.003.patch > > > Currently, {{HealthMonitor#doHealthChecks}} invokes > {{NameNode#monitorHealth}} which ends up invoking > {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per > second by default. And NameNodeResourceChecker#isResourceAvailable invokes > {{df.getAvailable();}} every time it is called. > Since available space information should rarely be changing dramatically at > the pace of per second. A cached value should be sufficient. i.e. only try to > get the updated value when the cached value is too old. otherwise simply > return the cached value. This way df.getAvailable() gets invoked less. > Thanks [~arpitagarwal] for the offline discussion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently
[ https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032251#comment-16032251 ] Chen Liang edited comment on HDFS-11907 at 6/1/17 12:24 AM: Thanks [~hanishakoneru] for the review, and thanks [~arpitagarwal] for the comments! all addressed in v003 patch. was (Author: vagarychen): Thanks [~arpitagarwal] for the comments! all addressed in v003 patch. > NameNodeResourceChecker should avoid calling df.getAvailable too frequently > --- > > Key: HDFS-11907 > URL: https://issues.apache.org/jira/browse/HDFS-11907 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch, > HDFS-11907.003.patch > > > Currently, {{HealthMonitor#doHealthChecks}} invokes > {{NameNode#monitorHealth}} which ends up invoking > {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per > second by default. And NameNodeResourceChecker#isResourceAvailable invokes > {{df.getAvailable();}} every time it is called. > Since available space information should rarely be changing dramatically at > the pace of per second. A cached value should be sufficient. i.e. only try to > get the updated value when the cached value is too old. otherwise simply > return the cached value. This way df.getAvailable() gets invoked less. > Thanks [~arpitagarwal] for the offline discussion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently
[ https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032154#comment-16032154 ] Arpit Agarwal edited comment on HDFS-11907 at 5/31/17 11:22 PM: Thanks for this improvement Chen! A few comments: # We should use Time#monotonicNow instead of System#currentTimeMillis in both files. Time#monotonicNow also returns a millisecond value, but it is guaranteed to be monotonically increasing. # Instead of initializing availableSpeceTimeStamp to zero, we should initialize it to (Time#monotonicNow - 5000) since 0 can be a valid timestamp returned by nanoTime. # You can also log the IP address of the client that issued the request to aid debugging. It can be retrieved in an RPC call by calling Server.getRemoteIp()* -Server.getIpAddr()- # Typo: availabeSpeceTimeStamp --> availableSpaceTimeStamp. # Let's Replace 5000 and 3000 with static final ints. # See if you can write an isolated unit test for NameNodeResourceChecker. e.g. the first call to isResourceAvailable should update availableSpaceTimeStamp, subsequent calls immediately should not. Then if you advance the timer (see FakeTimer) and call isResourceAvailable again, availableSpaceTimeStamp should be updated. *thanks for the correction Chen. was (Author: arpitagarwal): Thanks for this improvement Chen! A few comments: # We should use Time#monotonicNow instead of System#currentTimeMillis in both files. Time#monotonicNow also returns a millisecond value, but it is guaranteed to be monotonically increasing. # Instead of initializing availableSpeceTimeStamp to zero, we should initialize it to (Time#monotonicNow - 5000) since 0 can be a valid timestamp returned by nanoTime. # You can also log the IP address of the client that issued the request to aid debugging. It can be retrieved in an RPC call by calling Server.getIpAddr(). # Typo: availabeSpeceTimeStamp --> availableSpaceTimeStamp. # Let's Replace 5000 and 3000 with static final ints. # See if you can write an isolated unit test for NameNodeResourceChecker. e.g. the first call to isResourceAvailable should update availableSpaceTimeStamp, subsequent calls immediately should not. Then if you advance the timer (see FakeTimer) and call isResourceAvailable again, availableSpaceTimeStamp should be updated. > NameNodeResourceChecker should avoid calling df.getAvailable too frequently > --- > > Key: HDFS-11907 > URL: https://issues.apache.org/jira/browse/HDFS-11907 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch > > > Currently, {{HealthMonitor#doHealthChecks}} invokes > {{NameNode#monitorHealth}} which ends up invoking > {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per > second by default. And NameNodeResourceChecker#isResourceAvailable invokes > {{df.getAvailable();}} every time it is called. > Since available space information should rarely be changing dramatically at > the pace of per second. A cached value should be sufficient. i.e. only try to > get the updated value when the cached value is too old. otherwise simply > return the cached value. This way df.getAvailable() gets invoked less. > Thanks [~arpitagarwal] for the offline discussion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently
[ https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16031765#comment-16031765 ] Chen Liang edited comment on HDFS-11907 at 5/31/17 7:13 PM: Post v001 patch to only call {{availableSpace = df.getAvailable()}} to update the value at most every 5 seconds. was (Author: vagarychen): Post v001 patch to only update call {{availableSpace = df.getAvailable()}} at most every 5 seconds. > NameNodeResourceChecker should avoid calling df.getAvailable too frequently > --- > > Key: HDFS-11907 > URL: https://issues.apache.org/jira/browse/HDFS-11907 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-11907.001.patch > > > Currently, {{HealthMonitor#doHealthChecks}} invokes > {{NameNode#monitorHealth}} which ends up invoking > {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per > second by default. And NameNodeResourceChecker#isResourceAvailable invokes > {{df.getAvailable();}} every time it is called. > Since available space information should rarely be changing dramatically at > the pace of per second. A cached value should be sufficient. i.e. only try to > get the updated value when the cached value is too old. otherwise simply > return the cached value. This way df.getAvailable() gets invoked less. > Thanks [~arpitagarwal] for the offline discussion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org