Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12069 )
Change subject: IMPALA-7694: Add host resource usage metrics to profile ...................................................................... Patch Set 7: (24 comments) Thanks for the reviews, please see PS8. http://gerrit.cloudera.org:8080/#/c/12069/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12069/7//COMMIT_MSG@7 PS7, Line 7: IMPALA-7694: Add host resource usage metrics to profile > Could you mention that you added a mechanism for periodic counter updates? Done http://gerrit.cloudera.org:8080/#/c/12069/7//COMMIT_MSG@15 PS7, Line 15: This mechanism adds a new time series counter class that collects all : measured values and does not re-sample them. It will re-sample values : when printing them into a string profile to a max of 64 values, but : Thrift profiles will contain the full list of values. > How hard is it to factor this subset of changes out as a patch itself ? It Medium I'd say, there shouldn't be a technical reason but I'd expect the required git-work to be quite painful. Have a look at the next PS and if it's too painful to review I can try to split the change. http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/runtime/exec-env.cc File be/src/runtime/exec-env.cc: http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/runtime/exec-env.cc@463 PS7, Line 463: std::bind(&SystemStateInfo::CaptureSystemStateSnapshot, system_state_info_.get()); > I read in effective C++ that I should prefer lambdas to std::bind. Well, I Done, I think we're mixing them in the codebase and using a lambda was slightly more concise (but requires C++14 for the named capture). http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/runtime/query-state.cc File be/src/runtime/query-state.cc: http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/runtime/query-state.cc@296 PS7, Line 296: host_profile_->ClearChunkedTimeSeriesCounters(); > Do you know what protects this being a race where you're clearing something Thanks for catching this, added proper locking and tracking of the last reported value, and a test. http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/service/impala-server.cc@1051 PS7, Line 1051: if (rand() < trace_ratio * (RAND_MAX + 1L)) query_ctx->__set_trace_resource_usage(true); > Seems helpful to have a log statement that tracing is enabled. I'm not sure I'm following the idea. What scenarios would the log statement be helping with? http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/service/impala-server.cc@1051 PS7, Line 1051: if (rand() < trace_ratio * (RAND_MAX + 1L)) query_ctx->__set_trace_resource_usage(true); > I think rand() isn't thread safe. (And do we initialize it with srand() som Done http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/periodic-counter-updater.h File be/src/util/periodic-counter-updater.h: http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/periodic-counter-updater.h@33 PS7, Line 33: /// Singleton utility class that updates counter values. This is used to sample some : /// metric (e.g. memory used) at regular intervals. The samples can be summarized in : /// a few ways (e.g. averaged, stored as histogram, kept as a time series data, etc). : /// This class has one thread that will wake up at a regular period and update all : /// the registered counters. : /// Typically, the counter updates should be stopped as early as possible to prevent : /// future stale samples from polluting the useful values. > Should this be updated to include the role of UpdateFn() below ? Done http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/pretty-printer.h File be/src/util/pretty-printer.h: http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/pretty-printer.h@139 PS7, Line 139: case TUnit::BYTE_FRACTION: { > What is a BYTE_FRACTION? Removed http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile-counters.h File be/src/util/runtime-profile-counters.h: http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile-counters.h@453 PS7, Line 453: class RuntimeProfile::ChunkedTimeSeriesCounter : : public RuntimeProfile::TimeSeriesCounter { > The inheritance in this file is getting complicated. Could you add a diagra Added a comment at the top of runtime-profile.h with a brief explanation of what each counter does. I couldn't think of a nicer way to model the inheritance without duplicating some of the data members. http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile-counters.h@474 PS7, Line 474: : virtual const int64_t* GetSamples( : int* num_samples, int* period, SpinLock** lock = nullptr) const override; > Even though this is private, it seems like we should document what lock doe Done, and simplified the whole interface a bit. http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.h File be/src/util/runtime-profile.h: http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.h@386 PS7, Line 386: const std::string& name, TUnit::type unit, DerivedCounterFunction sample_fn); > What makes sample_fn a "DerivedCounterFunction"? i.e., Derived is unclear t I think it was a historic leftover because it was introduced for the DerivedCounter. It is just a function returning an int64 so I renamed it. http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.cc File be/src/util/runtime-profile.cc: http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.cc@779 PS7, Line 779: // SampliSamplingTimeSeriesCounter. > typo Done http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.cc@788 PS7, Line 788: stream << endl; > Consider printing something here that indicates that we actually have more Added more info. I'm reluctant to print everything, because a list of thousands of values for long running queries might clutter the profile and I think it'll be good to have some (arbitrary) cap. Should we bump it to something longer, e.g. 512? http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info-test.cc File be/src/util/system-state-info-test.cc: http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info-test.cc@37 PS7, Line 37: TEST_F(SystemStateInfoTest, ReadProcStat) { > I think I'd expect a test here that takes a string (which represents the va Done http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info-test.cc@54 PS7, Line 54: EXPECT_GT(r.user + r.system + r.iowait, 0); These values only return ratios, and the whole ratio computation works without considering the unit of the values in /proc/stat. They are USER_HZ which is 1/100th of a second on most systems and we can use sysconf(_SC_CLK_TCK) to determine it here. Even if we do so, we might have to consider "steal" which the man-page describes as. > (8) Stolen time, which is the time spent in other operating systems when > running in a virtualized environment Additionally we'd need to add idle time to this to get close to 100% / number of CPUs. I added a test that uses explicit values to make sure that the computation works correctly. Let me know if you would like me to add more tests around real system numbers. http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.h File be/src/util/system-state-info.h: http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.h@43 PS7, Line 43: struct CpuUsageRatios { > What does "Ratio" mean here? What are the units of these? Ratio means a value between 0 and 1 inclusive. I added a comment, let me know if you'd like me to rename the struct. http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.h@64 PS7, Line 64: enum PROC_STAT_CPU_FIELDS { : CPU_USER, : CPU_NICE, : CPU_SYSTEM, : CPU_IDLE, : CPU_IOWAIT : }; > I think there could be more fields here (from man proc). Yes, for this change I extracted the ones that I thought are interesting and useful. Should we add more/all of them? http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.cc File be/src/util/system-state-info.cc: http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.cc@33 PS7, Line 33: memset(&cpu_ratios_, 0, sizeof(cpu_ratios_)); > I'm not an expert on this stuff, but can we use array initialization here? Yes, struct initialization would work. I used this way because it requires no change when adding more fields to the struct. http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.cc@65 PS7, Line 65: for (int i = 0; i < NUM_CPU_FIELDS; ++i) proc_file >> next_values[i]; > Is there any error handling that needs to happen here? I added a unit test to make sure that this parses a valid line correctly. I also added a DCHECK and a safety net to clamp invalid values to 0 in release builds. http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.cc@75 PS7, Line 75: if (total_tics == 0) { > Comment? Done http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.cc@80 PS7, Line 80: cpu_ratios_.user = ((cur[CPU_USER] - old[CPU_USER]) << 8) / total_tics; : cpu_ratios_.system = ((cur[CPU_SYSTEM] - old[CPU_SYSTEM]) << 8) / total_tics; : cpu_ratios_.iowait = ((cur[CPU_IOWAIT] - old[CPU_IOWAIT]) << 8) / total_tics; > Comment what this is doing? Done http://gerrit.cloudera.org:8080/#/c/12069/7/common/thrift/Metrics.thrift File common/thrift/Metrics.thrift: http://gerrit.cloudera.org:8080/#/c/12069/7/common/thrift/Metrics.thrift@32 PS7, Line 32: // Fraction between 0 and 1 with single byte precision [0, 255]. : // Pretty-printed as a percentage [0, 100]. : BYTE_FRACTION, > This is a weird unit. I guess we're constrained to int64 here, so that's ho I wanted to express that this is a fraction [0,1] encoded in a single byte. I'm also good with base points or percent, and giving up the byte limitation (the rest of the patch doesn't use it yet anyways). http://gerrit.cloudera.org:8080/#/c/12069/7/common/thrift/RuntimeProfile.thrift File common/thrift/RuntimeProfile.thrift: http://gerrit.cloudera.org:8080/#/c/12069/7/common/thrift/RuntimeProfile.thrift@73 PS7, Line 73: 5: required i64 start_index > By marking this required, I think you make Thrift fail to understand an old Done http://gerrit.cloudera.org:8080/#/c/12069/7/tests/query_test/test_observability.py File tests/query_test/test_observability.py: http://gerrit.cloudera.org:8080/#/c/12069/7/tests/query_test/test_observability.py@422 PS7, Line 422: query_opts = {'resource_trace_ratio': 1.0} > Do we have a test that asserts that resource_trace_ratio = 0 also works cor There's test_query_profile_host_cpu_usage_off_by_default above that checks that it is off by default, I extended it to make sure that it can also be turned off explicitly. -- To view, visit http://gerrit.cloudera.org:8080/12069 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3aedc20c553ab8d7ed50f72a1a936eba151487d9 Gerrit-Change-Number: 12069 Gerrit-PatchSet: 7 Gerrit-Owner: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com> Gerrit-Comment-Date: Wed, 19 Dec 2018 01:00:13 +0000 Gerrit-HasComments: Yes