[kudu-CR] KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9523 ) Change subject: KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias .. KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias This changes the timing for the collection of stack samples into the metrics log to avoid two issues: 1) Inflexible configuration: in many cases we might want to sample stacks more or less frequently than metrics. The previous implementation used the same schedule for both. 2) Sampling bias: if we collect stack samples on a fixed schedule, we risk unintended correlation with background tasks that might run on a similar fixed schedule. For example, if we collect stacks exactly once a minute, and the user has some monitoring software which polls the HTTP server once a minute for some data, we might either line up perfectly with the polling (in which case we'd overestimate its impact) or line of perfectly away from the polling (in which case we'd never see its effects). This patch changes the wakeups for stacks and metrics to be decoupled. In addition, it adds randomness to the stack sampling. The configuration now represents the mean sampling rate rather than an exact sampling rate. I manually ran a master with -diagnostics-log-stack-traces-interval-ms=2000 and verified that the stack sample times were between 0 and 4 seconds apart while the metric dumps were still exactly one minute apart. I also plan to modify some local test clusters to configure this to be relatively frequent to try to suss out any races or bugs which might occur in a real workload. Change-Id: I538eb59f95a695fa2da50f24ed82148715b15e44 Reviewed-on: http://gerrit.cloudera.org:8080/9523 Tested-by: Todd LipconReviewed-by: Adar Dembo --- M src/kudu/server/diagnostics_log.cc M src/kudu/server/diagnostics_log.h 2 files changed, 77 insertions(+), 49 deletions(-) Approvals: Todd Lipcon: Verified Adar Dembo: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/9523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I538eb59f95a695fa2da50f24ed82148715b15e44 Gerrit-Change-Number: 9523 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/9523 ) Change subject: KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias .. Patch Set 3: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/9523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I538eb59f95a695fa2da50f24ed82148715b15e44 Gerrit-Change-Number: 9523 Gerrit-PatchSet: 3 Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 08 Mar 2018 18:49:42 + Gerrit-HasComments: No
[kudu-CR] KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias
Todd Lipcon has removed a vote on this change. Change subject: KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias .. Removed Verified-1 by Kudu Jenkins (120) -- To view, visit http://gerrit.cloudera.org:8080/9523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: deleteVote Gerrit-Change-Id: I538eb59f95a695fa2da50f24ed82148715b15e44 Gerrit-Change-Number: 9523 Gerrit-PatchSet: 3 Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/9523 ) Change subject: KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias .. Patch Set 3: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/9523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I538eb59f95a695fa2da50f24ed82148715b15e44 Gerrit-Change-Number: 9523 Gerrit-PatchSet: 3 Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 08 Mar 2018 18:10:14 + Gerrit-HasComments: No
[kudu-CR] KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias
Hello Mike Percy, Kudu Jenkins, Adar Dembo, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9523 to look at the new patch set (#3). Change subject: KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias .. KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias This changes the timing for the collection of stack samples into the metrics log to avoid two issues: 1) Inflexible configuration: in many cases we might want to sample stacks more or less frequently than metrics. The previous implementation used the same schedule for both. 2) Sampling bias: if we collect stack samples on a fixed schedule, we risk unintended correlation with background tasks that might run on a similar fixed schedule. For example, if we collect stacks exactly once a minute, and the user has some monitoring software which polls the HTTP server once a minute for some data, we might either line up perfectly with the polling (in which case we'd overestimate its impact) or line of perfectly away from the polling (in which case we'd never see its effects). This patch changes the wakeups for stacks and metrics to be decoupled. In addition, it adds randomness to the stack sampling. The configuration now represents the mean sampling rate rather than an exact sampling rate. I manually ran a master with -diagnostics-log-stack-traces-interval-ms=2000 and verified that the stack sample times were between 0 and 4 seconds apart while the metric dumps were still exactly one minute apart. I also plan to modify some local test clusters to configure this to be relatively frequent to try to suss out any races or bugs which might occur in a real workload. Change-Id: I538eb59f95a695fa2da50f24ed82148715b15e44 --- M src/kudu/server/diagnostics_log.cc M src/kudu/server/diagnostics_log.h 2 files changed, 77 insertions(+), 49 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/9523/3 -- To view, visit http://gerrit.cloudera.org:8080/9523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I538eb59f95a695fa2da50f24ed82148715b15e44 Gerrit-Change-Number: 9523 Gerrit-PatchSet: 3 Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/9523 ) Change subject: KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias .. Patch Set 2: (8 comments) http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.h File src/kudu/server/diagnostics_log.h: http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.h@59 PS2, Line 59: enum WakeupType { > enum class is the new hotness. Done http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc File src/kudu/server/diagnostics_log.cc: http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc@75 PS2, Line 75: If this is set to " : "a non-positive number, stack traces will be not be periodically logged. > How about making this a uint and using 0 to disable? Done http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc@169 PS2, Line 169: Random rng(GetRandomSeed32()); > Why create a new PRNG each time? seemed easier than making it a class member, and this isn't perf critical enough that the seed computation matters much http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc@182 PS2, Line 182: __builtin_unreachable(); > Fancy. If you're feeling generous, there are quite a few places in the code I suppose it coudl be. This isn't the first use of it, though. It's not supported by MSVC but I think we have so many Linux-isms that this is the least of our problems :) FWIW Clang on windows is also now pretty feasible (Chrome just switched to it) http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc@204 PS2, Line 204: } else if (MonoTime::Now() > next_log) { > >= is more correct, no? Done http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc@208 PS2, Line 208: wakeups.emplace(ComputeNextWakeup(what), what); > what what what Ack ack ack http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc@214 PS2, Line 214: // Unlock the mutex while actually logging metrics or stacks since it's somewhat > This is much cleaner, thanks for the cleanup. Ack http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc@222 PS2, Line 222: WARN_NOT_OK(LogStacks(reason), "Unable to collect stacks to diagnostics log"); > So error no longer feds into the next logging time? Didn't find it useful? yea, at first I tried to maintain it, but then it just added complexity and I couldn't really see the value relative to the extra 5-10 lines of code -- To view, visit http://gerrit.cloudera.org:8080/9523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I538eb59f95a695fa2da50f24ed82148715b15e44 Gerrit-Change-Number: 9523 Gerrit-PatchSet: 2 Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 08 Mar 2018 03:49:37 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/9523 ) Change subject: KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias .. Patch Set 2: (8 comments) http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.h File src/kudu/server/diagnostics_log.h: http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.h@59 PS2, Line 59: enum WakeupType { enum class is the new hotness. http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc File src/kudu/server/diagnostics_log.cc: http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc@75 PS2, Line 75: If this is set to " : "a non-positive number, stack traces will be not be periodically logged. How about making this a uint and using 0 to disable? http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc@169 PS2, Line 169: Random rng(GetRandomSeed32()); Why create a new PRNG each time? http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc@182 PS2, Line 182: __builtin_unreachable(); Fancy. If you're feeling generous, there are quite a few places in the codebase where we add a comment like // unreachable instead of actually using this, and they could be fixed up. Also, should this be abstracted away by something in gutil/port.h? Or do you expect it to be the same everywhere, including in compilers we might not support yet (like MSVC)? http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc@204 PS2, Line 204: } else if (MonoTime::Now() > next_log) { >= is more correct, no? http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc@208 PS2, Line 208: wakeups.emplace(ComputeNextWakeup(what), what); what what what http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc@214 PS2, Line 214: // Unlock the mutex while actually logging metrics or stacks since it's somewhat This is much cleaner, thanks for the cleanup. http://gerrit.cloudera.org:8080/#/c/9523/2/src/kudu/server/diagnostics_log.cc@222 PS2, Line 222: WARN_NOT_OK(LogStacks(reason), "Unable to collect stacks to diagnostics log"); So error no longer feds into the next logging time? Didn't find it useful? -- To view, visit http://gerrit.cloudera.org:8080/9523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I538eb59f95a695fa2da50f24ed82148715b15e44 Gerrit-Change-Number: 9523 Gerrit-PatchSet: 2 Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Comment-Date: Wed, 07 Mar 2018 20:22:48 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias
Hello Mike Percy, Adar Dembo, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/9523 to review the following change. Change subject: KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias .. KUDU-2297 (part 6): enable periodic stack sampling by default, avoid bias This changes the timing for the collection of stack samples into the metrics log to avoid two issues: 1) Inflexible configuration: in many cases we might want to sample stacks more or less frequently than metrics. The previous implementation used the same schedule for both. 2) Sampling bias: if we collect stack samples on a fixed schedule, we risk unintended correlation with background tasks that might run on a similar fixed schedule. For example, if we collect stacks exactly once a minute, and the user has some monitoring software which polls the HTTP server once a minute for some data, we might either line up perfectly with the polling (in which case we'd overestimate its impact) or line of perfectly away from the polling (in which case we'd never see its effects). This patch changes the wakeups for stacks and metrics to be decoupled. In addition, it adds randomness to the stack sampling. The configuration now represents the mean sampling rate rather than an exact sampling rate. I manually ran a master with -diagnostics-log-stack-traces-interval-ms=2000 and verified that the stack sample times were between 0 and 4 seconds apart while the metric dumps were still exactly one minute apart. I also plan to modify some local test clusters to configure this to be relatively frequent to try to suss out any races or bugs which might occur in a real workload. Change-Id: I538eb59f95a695fa2da50f24ed82148715b15e44 --- M src/kudu/server/diagnostics_log.cc M src/kudu/server/diagnostics_log.h 2 files changed, 77 insertions(+), 49 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/9523/1 -- To view, visit http://gerrit.cloudera.org:8080/9523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I538eb59f95a695fa2da50f24ed82148715b15e44 Gerrit-Change-Number: 9523 Gerrit-PatchSet: 1 Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Mike Percy