[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently

2017-06-09 Thread Chen Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045096#comment-16045096
 ] 

Chen Liang edited comment on HDFS-11907 at 6/9/17 9:59 PM:
---

Thanks [~arpitagarwal] for pointing out offline that, 
{{checkAvailableResources}} has more than one callers. Post v006 patch to move 
the measurement into this method to capture all the code paths.


was (Author: vagarychen):
Thanks [~arpitagarwal] for pointing out offline that, 
{{checkAvailableResources}} has more than one callers. Post v006 patch to move 
the measurement into this method to capture all the callers.

> NameNodeResourceChecker should avoid calling df.getAvailable too frequently
> ---
>
> Key: HDFS-11907
> URL: https://issues.apache.org/jira/browse/HDFS-11907
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
> Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch, 
> HDFS-11907.003.patch, HDFS-11907.004.patch, HDFS-11907.005.patch, 
> HDFS-11907.006.patch
>
>
> Currently, {{HealthMonitor#doHealthChecks}} invokes 
> {{NameNode#monitorHealth}} which ends up invoking 
> {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per 
> second by default. And NameNodeResourceChecker#isResourceAvailable invokes 
> {{df.getAvailable();}} every time it is called.
> Since available space information should rarely be changing dramatically at 
> the pace of per second. A cached value should be sufficient. i.e. only try to 
> get the updated value when the cached value is too old. otherwise simply 
> return the cached value. This way df.getAvailable() gets invoked less.
> Thanks [~arpitagarwal] for the offline discussion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently

2017-06-09 Thread Chen Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044987#comment-16044987
 ] 

Chen Liang edited comment on HDFS-11907 at 6/9/17 8:37 PM:
---

Thanks [~andrew.wang] for the comments. Based on Andrew's comments, instead of 
making the change to cache the value, post v005 patch to add a metric to keep 
track of available resource check time.


was (Author: vagarychen):
Thanks [~andrew.wang] for the comments. Based on Andrew's comments, instead of 
making the cache the value, post v005 patch to add a metric to keep track of 
available resource check time.

> NameNodeResourceChecker should avoid calling df.getAvailable too frequently
> ---
>
> Key: HDFS-11907
> URL: https://issues.apache.org/jira/browse/HDFS-11907
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
> Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch, 
> HDFS-11907.003.patch, HDFS-11907.004.patch, HDFS-11907.005.patch
>
>
> Currently, {{HealthMonitor#doHealthChecks}} invokes 
> {{NameNode#monitorHealth}} which ends up invoking 
> {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per 
> second by default. And NameNodeResourceChecker#isResourceAvailable invokes 
> {{df.getAvailable();}} every time it is called.
> Since available space information should rarely be changing dramatically at 
> the pace of per second. A cached value should be sufficient. i.e. only try to 
> get the updated value when the cached value is too old. otherwise simply 
> return the cached value. This way df.getAvailable() gets invoked less.
> Thanks [~arpitagarwal] for the offline discussion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently

2017-06-08 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043335#comment-16043335
 ] 

Arpit Agarwal edited comment on HDFS-11907 at 6/8/17 8:10 PM:
--

Hi [~andrew.wang], you are right that it's expected to be a cheap call, but 
calling it once per second per volume seems excessive. Do you see any benefit 
to querying {{df}} once per second? We can make the caching interval 
configurable and leave the default at 1 second if you prefer.

This is not the same as changing the health check interval as Chen mentioned. 
Keeping the health check interval at 1 second lets us detect process failure 
faster and we don't want to change that.

Also the v4 patch has a couple of issues I missed earlier. [~vagarychen] can 
you please take a look at these?
# availableSpace and availableSpaceTimeStamp should be members of checkedVolume.
# The test case failure in TestNameNodeResourceChecker needs to be addressed. 
An easy fix is to check all volumes instead of trying to query a specific one.


was (Author: arpitagarwal):
Hi [~andrew.wang], you are right that it's expected to be a cheap call, but 
calling it once per second per volume seems excessive. Do you see any benefit 
to querying {{df}} once per second? We can make the caching interval 
configurable and leave the default at 1 second if you prefer.

This is not the same as changing the health check interval as Chen mentioned. 
Keeping the health check interval at 1 second lets us detect process failure 
faster and we don't want to change that.

Also the v4 patch has a couple of issues I missed earlier.
# availableSpace and availableSpaceTimeStamp should be members of checkedVolume.
# The test case failure in TestNameNodeResourceChecker needs to be addressed. 
An easy fix is to check all volumes instead of trying to query a specific one.

> NameNodeResourceChecker should avoid calling df.getAvailable too frequently
> ---
>
> Key: HDFS-11907
> URL: https://issues.apache.org/jira/browse/HDFS-11907
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
> Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch, 
> HDFS-11907.003.patch, HDFS-11907.004.patch
>
>
> Currently, {{HealthMonitor#doHealthChecks}} invokes 
> {{NameNode#monitorHealth}} which ends up invoking 
> {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per 
> second by default. And NameNodeResourceChecker#isResourceAvailable invokes 
> {{df.getAvailable();}} every time it is called.
> Since available space information should rarely be changing dramatically at 
> the pace of per second. A cached value should be sufficient. i.e. only try to 
> get the updated value when the cached value is too old. otherwise simply 
> return the cached value. This way df.getAvailable() gets invoked less.
> Thanks [~arpitagarwal] for the offline discussion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently

2017-06-02 Thread Chen Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035112#comment-16035112
 ] 

Chen Liang edited comment on HDFS-11907 at 6/2/17 5:51 PM:
---

Thanks [~andrew.wang] for the reply!

df does seem to be a fairly cheap operation in general, but we've seen cases 
where we suspect it was this call being slow under certain conditions, which we 
are still doing analysis. About changing monitorHealth check interval, since we 
still want ZKFC process to try to contact NameNode frequently enough to detect 
process failures ASAP, we probably don't want to lower the frequency from 
caller's side.


was (Author: vagarychen):
Thanks [~andrew.wang] for the reply!

df does seem to be a fairly cheap operation in general, but we've seen cases 
where we suspect it was this call being slow under certain conditions, which we 
are still doing analysis. About changing monitorHealth check interval, since we 
still want ZKFC process to try to contact NameNode frequently enough to detect 
failures ASAP, we probably don't want to lower the frequency from caller's side.

> NameNodeResourceChecker should avoid calling df.getAvailable too frequently
> ---
>
> Key: HDFS-11907
> URL: https://issues.apache.org/jira/browse/HDFS-11907
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
> Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch, 
> HDFS-11907.003.patch, HDFS-11907.004.patch
>
>
> Currently, {{HealthMonitor#doHealthChecks}} invokes 
> {{NameNode#monitorHealth}} which ends up invoking 
> {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per 
> second by default. And NameNodeResourceChecker#isResourceAvailable invokes 
> {{df.getAvailable();}} every time it is called.
> Since available space information should rarely be changing dramatically at 
> the pace of per second. A cached value should be sufficient. i.e. only try to 
> get the updated value when the cached value is too old. otherwise simply 
> return the cached value. This way df.getAvailable() gets invoked less.
> Thanks [~arpitagarwal] for the offline discussion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently

2017-06-02 Thread Chen Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035112#comment-16035112
 ] 

Chen Liang edited comment on HDFS-11907 at 6/2/17 5:51 PM:
---

Thanks [~andrew.wang] for the reply!

df does seem to be a fairly cheap operation in general, but we've seen cases 
where we suspect it was this call being slow under certain conditions, which we 
are still doing analysis. About changing monitorHealth check interval, since we 
still want ZKFC process to try to contact NameNode frequently enough to detect 
process crash ASAP, we probably don't want to lower the frequency from caller's 
side.


was (Author: vagarychen):
Thanks [~andrew.wang] for the reply!

df does seem to be a fairly cheap operation in general, but we've seen cases 
where we suspect it was this call being slow under certain conditions, which we 
are still doing analysis. About changing monitorHealth check interval, since we 
still want ZKFC process to try to contact NameNode frequently enough to detect 
process failures ASAP, we probably don't want to lower the frequency from 
caller's side.

> NameNodeResourceChecker should avoid calling df.getAvailable too frequently
> ---
>
> Key: HDFS-11907
> URL: https://issues.apache.org/jira/browse/HDFS-11907
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
> Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch, 
> HDFS-11907.003.patch, HDFS-11907.004.patch
>
>
> Currently, {{HealthMonitor#doHealthChecks}} invokes 
> {{NameNode#monitorHealth}} which ends up invoking 
> {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per 
> second by default. And NameNodeResourceChecker#isResourceAvailable invokes 
> {{df.getAvailable();}} every time it is called.
> Since available space information should rarely be changing dramatically at 
> the pace of per second. A cached value should be sufficient. i.e. only try to 
> get the updated value when the cached value is too old. otherwise simply 
> return the cached value. This way df.getAvailable() gets invoked less.
> Thanks [~arpitagarwal] for the offline discussion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently

2017-05-31 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032274#comment-16032274
 ] 

Arpit Agarwal edited comment on HDFS-11907 at 6/1/17 12:48 AM:
---

Thanks for updating the patch [~vagarychen]. A few minor comments:
# Typo: availabeSpaceTimeStamp --> availableSpaceTimeStamp (also the getter).
# Typo: DF_STALE_THREASHOLD_MS --> DF_STALE_THRESHOLD_MS
# Looks like the two threshold values were swapped from v2 to v3 patch. I think 
the v2 values were better.
# getAvailabeSpaceTimeStamp can be package private. Also the new constructor.
# in the test case, can you add an assert for the {{// first call guarantees an 
update}} e.g.
{code}
// first call guarantees an update
long val0 = nb.getAvailabeSpaceTimeStamp();
volume.isResourceAvailable();
long val1 = nb.getAvailabeSpaceTimeStamp();
assertEquals(val0 + 5000, val1);
timer.advance(2000);
{code}

Looks good otherwise!


was (Author: arpitagarwal):
Thanks for updating the patch [~vagarychen]. A few minor comments:
# Typo: availabeSpaceTimeStamp --> availableSpaceTimeStamp
# Typo: DF_STALE_THREASHOLD_MS --> DF_STALE_THRESHOLD_MS
# Looks like the two threshold values were swapped from v2 to v3 patch. I think 
the v2 values were better.
# getAvailabeSpaceTimeStamp can be package private. Also the new constructor.
# in the test case, can you add an assert for the {{// first call guarantees an 
update}} e.g.
{code}
// first call guarantees an update
long val0 = nb.getAvailabeSpaceTimeStamp();
volume.isResourceAvailable();
long val1 = nb.getAvailabeSpaceTimeStamp();
assertEquals(val0 + 5000, val1);
timer.advance(2000);
{code}

Looks good otherwise!

> NameNodeResourceChecker should avoid calling df.getAvailable too frequently
> ---
>
> Key: HDFS-11907
> URL: https://issues.apache.org/jira/browse/HDFS-11907
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
> Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch, 
> HDFS-11907.003.patch
>
>
> Currently, {{HealthMonitor#doHealthChecks}} invokes 
> {{NameNode#monitorHealth}} which ends up invoking 
> {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per 
> second by default. And NameNodeResourceChecker#isResourceAvailable invokes 
> {{df.getAvailable();}} every time it is called.
> Since available space information should rarely be changing dramatically at 
> the pace of per second. A cached value should be sufficient. i.e. only try to 
> get the updated value when the cached value is too old. otherwise simply 
> return the cached value. This way df.getAvailable() gets invoked less.
> Thanks [~arpitagarwal] for the offline discussion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently

2017-05-31 Thread Chen Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032251#comment-16032251
 ] 

Chen Liang edited comment on HDFS-11907 at 6/1/17 12:24 AM:


Thanks [~hanishakoneru] for the review, and thanks [~arpitagarwal] for the 
comments! all addressed in v003 patch.


was (Author: vagarychen):
Thanks [~arpitagarwal] for the comments! all addressed in v003 patch.

> NameNodeResourceChecker should avoid calling df.getAvailable too frequently
> ---
>
> Key: HDFS-11907
> URL: https://issues.apache.org/jira/browse/HDFS-11907
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
> Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch, 
> HDFS-11907.003.patch
>
>
> Currently, {{HealthMonitor#doHealthChecks}} invokes 
> {{NameNode#monitorHealth}} which ends up invoking 
> {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per 
> second by default. And NameNodeResourceChecker#isResourceAvailable invokes 
> {{df.getAvailable();}} every time it is called.
> Since available space information should rarely be changing dramatically at 
> the pace of per second. A cached value should be sufficient. i.e. only try to 
> get the updated value when the cached value is too old. otherwise simply 
> return the cached value. This way df.getAvailable() gets invoked less.
> Thanks [~arpitagarwal] for the offline discussion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently

2017-05-31 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032154#comment-16032154
 ] 

Arpit Agarwal edited comment on HDFS-11907 at 5/31/17 11:22 PM:


Thanks for this improvement Chen! A few comments:
# We should use Time#monotonicNow instead of System#currentTimeMillis in both 
files. Time#monotonicNow also returns a millisecond value, but it is guaranteed 
to be monotonically increasing.
# Instead of initializing availableSpeceTimeStamp to zero, we should initialize 
it to (Time#monotonicNow - 5000) since 0 can be a valid timestamp returned by 
nanoTime.
# You can also log the IP address of the client that issued the request to aid 
debugging. It can be retrieved in an RPC call by calling Server.getRemoteIp()* 
-Server.getIpAddr()-
# Typo: availabeSpeceTimeStamp --> availableSpaceTimeStamp.
# Let's Replace 5000 and 3000 with static final ints.
# See if you can write an isolated unit test for NameNodeResourceChecker. e.g. 
the first call to isResourceAvailable should update availableSpaceTimeStamp, 
subsequent calls immediately should not. Then if you advance the timer (see 
FakeTimer) and call isResourceAvailable again, availableSpaceTimeStamp should 
be updated.

*thanks for the correction Chen.


was (Author: arpitagarwal):
Thanks for this improvement Chen! A few comments:
# We should use Time#monotonicNow instead of System#currentTimeMillis in both 
files. Time#monotonicNow also returns a millisecond value, but it is guaranteed 
to be monotonically increasing.
# Instead of initializing availableSpeceTimeStamp to zero, we should initialize 
it to (Time#monotonicNow - 5000) since 0 can be a valid timestamp returned by 
nanoTime.
# You can also log the IP address of the client that issued the request to aid 
debugging. It can be retrieved in an RPC call by calling Server.getIpAddr().
# Typo: availabeSpeceTimeStamp --> availableSpaceTimeStamp.
# Let's Replace 5000 and 3000 with static final ints.
# See if you can write an isolated unit test for NameNodeResourceChecker. e.g. 
the first call to isResourceAvailable should update availableSpaceTimeStamp, 
subsequent calls immediately should not. Then if you advance the timer (see 
FakeTimer) and call isResourceAvailable again, availableSpaceTimeStamp should 
be updated.


> NameNodeResourceChecker should avoid calling df.getAvailable too frequently
> ---
>
> Key: HDFS-11907
> URL: https://issues.apache.org/jira/browse/HDFS-11907
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
> Attachments: HDFS-11907.001.patch, HDFS-11907.002.patch
>
>
> Currently, {{HealthMonitor#doHealthChecks}} invokes 
> {{NameNode#monitorHealth}} which ends up invoking 
> {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per 
> second by default. And NameNodeResourceChecker#isResourceAvailable invokes 
> {{df.getAvailable();}} every time it is called.
> Since available space information should rarely be changing dramatically at 
> the pace of per second. A cached value should be sufficient. i.e. only try to 
> get the updated value when the cached value is too old. otherwise simply 
> return the cached value. This way df.getAvailable() gets invoked less.
> Thanks [~arpitagarwal] for the offline discussion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11907) NameNodeResourceChecker should avoid calling df.getAvailable too frequently

2017-05-31 Thread Chen Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16031765#comment-16031765
 ] 

Chen Liang edited comment on HDFS-11907 at 5/31/17 7:13 PM:


Post v001 patch to only call {{availableSpace = df.getAvailable()}} to update 
the value at most every 5 seconds.


was (Author: vagarychen):
Post v001 patch to only update call {{availableSpace = df.getAvailable()}} at 
most every 5 seconds.

> NameNodeResourceChecker should avoid calling df.getAvailable too frequently
> ---
>
> Key: HDFS-11907
> URL: https://issues.apache.org/jira/browse/HDFS-11907
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
> Attachments: HDFS-11907.001.patch
>
>
> Currently, {{HealthMonitor#doHealthChecks}} invokes 
> {{NameNode#monitorHealth}} which ends up invoking 
> {{NameNodeResourceChecker#isResourceAvailable}}, at the frequency of once per 
> second by default. And NameNodeResourceChecker#isResourceAvailable invokes 
> {{df.getAvailable();}} every time it is called.
> Since available space information should rarely be changing dramatically at 
> the pace of per second. A cached value should be sufficient. i.e. only try to 
> get the updated value when the cached value is too old. otherwise simply 
> return the cached value. This way df.getAvailable() gets invoked less.
> Thanks [~arpitagarwal] for the offline discussion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org