Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12069 )

Change subject: IMPALA-7694: Add host resource usage metrics to profile
......................................................................


Patch Set 7:

(24 comments)

Thanks for the reviews, please see PS8.

http://gerrit.cloudera.org:8080/#/c/12069/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12069/7//COMMIT_MSG@7
PS7, Line 7: IMPALA-7694: Add host resource usage metrics to profile
> Could you mention that you added a mechanism for periodic counter updates?
Done


http://gerrit.cloudera.org:8080/#/c/12069/7//COMMIT_MSG@15
PS7, Line 15: This mechanism adds a new time series counter class that collects 
all
            : measured values and does not re-sample them. It will re-sample 
values
            : when printing them into a string profile to a max of 64 values, 
but
            : Thrift profiles will contain the full list of values.
> How hard is it to factor this subset of changes out as a patch itself ? It
Medium I'd say, there shouldn't be a technical reason but I'd expect the 
required git-work to be quite painful. Have a look at the next PS and if it's 
too painful to review I can try to split the change.


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/runtime/exec-env.cc
File be/src/runtime/exec-env.cc:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/runtime/exec-env.cc@463
PS7, Line 463:       std::bind(&SystemStateInfo::CaptureSystemStateSnapshot, 
system_state_info_.get());
> I read in effective C++ that I should prefer lambdas to std::bind. Well, I 
Done, I think we're mixing them in the codebase and using a lambda was slightly 
more concise (but requires C++14 for the named capture).


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/runtime/query-state.cc@296
PS7, Line 296:   host_profile_->ClearChunkedTimeSeriesCounters();
> Do you know what protects this being a race where you're clearing something
Thanks for catching this, added proper locking and tracking of the last 
reported value, and a test.


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/service/impala-server.cc@1051
PS7, Line 1051:   if (rand() < trace_ratio * (RAND_MAX + 1L)) 
query_ctx->__set_trace_resource_usage(true);
> Seems helpful to have a log statement that tracing is enabled.
I'm not sure I'm following the idea. What scenarios would the log statement be 
helping with?


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/service/impala-server.cc@1051
PS7, Line 1051:   if (rand() < trace_ratio * (RAND_MAX + 1L)) 
query_ctx->__set_trace_resource_usage(true);
> I think rand() isn't thread safe. (And do we initialize it with srand() som
Done


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/periodic-counter-updater.h
File be/src/util/periodic-counter-updater.h:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/periodic-counter-updater.h@33
PS7, Line 33: /// Singleton utility class that updates counter values. This is 
used to sample some
            : /// metric (e.g. memory used) at regular intervals. The samples 
can be summarized in
            : /// a few ways (e.g. averaged, stored as histogram, kept as a 
time series data, etc).
            : /// This class has one thread that will wake up at a regular 
period and update all
            : /// the registered counters.
            : /// Typically, the counter updates should be stopped as early as 
possible to prevent
            : /// future stale samples from polluting the useful values.
> Should this be updated to include the role of UpdateFn() below ?
Done


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/pretty-printer.h
File be/src/util/pretty-printer.h:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/pretty-printer.h@139
PS7, Line 139:       case TUnit::BYTE_FRACTION: {
> What is a BYTE_FRACTION?
Removed


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile-counters.h@453
PS7, Line 453: class RuntimeProfile::ChunkedTimeSeriesCounter
             :     : public RuntimeProfile::TimeSeriesCounter {
> The inheritance in this file is getting complicated. Could you add a diagra
Added a comment at the top of runtime-profile.h with a brief explanation of 
what each counter does.

I couldn't think of a nicer way to model the inheritance without duplicating 
some of the data members.


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile-counters.h@474
PS7, Line 474:
             :   virtual const int64_t* GetSamples(
             :       int* num_samples, int* period, SpinLock** lock = nullptr) 
const override;
> Even though this is private, it seems like we should document what lock doe
Done, and simplified the whole interface a bit.


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.h
File be/src/util/runtime-profile.h:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.h@386
PS7, Line 386:       const std::string& name, TUnit::type unit, 
DerivedCounterFunction sample_fn);
> What makes sample_fn a "DerivedCounterFunction"? i.e., Derived is unclear t
I think it was a historic leftover because it was introduced for the 
DerivedCounter. It is just a function returning an int64 so I renamed it.


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.cc@779
PS7, Line 779:         // SampliSamplingTimeSeriesCounter.
> typo
Done


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.cc@788
PS7, Line 788:         stream << endl;
> Consider printing something here that indicates that we actually have more
Added more info. I'm reluctant to print everything, because a list of thousands 
of values for long running queries might clutter the profile and I think it'll 
be good to have some (arbitrary) cap. Should we bump it to something longer, 
e.g. 512?


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info-test.cc
File be/src/util/system-state-info-test.cc:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info-test.cc@37
PS7, Line 37: TEST_F(SystemStateInfoTest, ReadProcStat) {
> I think I'd expect a test here that takes a string (which represents the va
Done


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info-test.cc@54
PS7, Line 54:     EXPECT_GT(r.user + r.system + r.iowait, 0);
These values only return ratios, and the whole ratio computation works without 
considering the unit of the values in /proc/stat. They are USER_HZ which is 
1/100th of a second on most systems and we can use sysconf(_SC_CLK_TCK) to 
determine it here. Even if we do so, we might have to consider "steal" which 
the man-page describes as.

> (8) Stolen time, which is the time spent in other  operating systems when 
> running in a virtualized environment

Additionally we'd need to add idle time to this to get close to 100% / number 
of CPUs.

I added a test that uses explicit values to make sure that the computation 
works correctly. Let me know if you would like me to add more tests around real 
system numbers.


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.h
File be/src/util/system-state-info.h:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.h@43
PS7, Line 43:   struct CpuUsageRatios {
> What does "Ratio" mean here? What are the units of these?
Ratio means a value between 0 and 1 inclusive. I added a comment, let me know 
if you'd like me to rename the struct.


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.h@64
PS7, Line 64:   enum PROC_STAT_CPU_FIELDS {
            :     CPU_USER,
            :     CPU_NICE,
            :     CPU_SYSTEM,
            :     CPU_IDLE,
            :     CPU_IOWAIT
            :   };
> I think there could be more fields here (from man proc).
Yes, for this change I extracted the ones that I thought are interesting and 
useful. Should we add more/all of them?


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.cc
File be/src/util/system-state-info.cc:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.cc@33
PS7, Line 33:   memset(&cpu_ratios_, 0, sizeof(cpu_ratios_));
> I'm not an expert on this stuff, but can we use array initialization here?
Yes, struct initialization would work. I used this way because it requires no 
change when adding more fields to the struct.


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.cc@65
PS7, Line 65:   for (int i = 0; i < NUM_CPU_FIELDS; ++i) proc_file >> 
next_values[i];
> Is there any error handling that needs to happen here?
I added a unit test to make sure that this parses a valid line correctly. I 
also added a DCHECK and a safety net to clamp invalid values to 0 in release 
builds.


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.cc@75
PS7, Line 75:   if (total_tics == 0) {
> Comment?
Done


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.cc@80
PS7, Line 80:   cpu_ratios_.user = ((cur[CPU_USER] - old[CPU_USER]) << 8) / 
total_tics;
            :   cpu_ratios_.system = ((cur[CPU_SYSTEM] - old[CPU_SYSTEM]) << 8) 
/ total_tics;
            :   cpu_ratios_.iowait = ((cur[CPU_IOWAIT] - old[CPU_IOWAIT]) << 8) 
/ total_tics;
> Comment what this is doing?
Done


http://gerrit.cloudera.org:8080/#/c/12069/7/common/thrift/Metrics.thrift
File common/thrift/Metrics.thrift:

http://gerrit.cloudera.org:8080/#/c/12069/7/common/thrift/Metrics.thrift@32
PS7, Line 32:   // Fraction between 0 and 1 with single byte precision [0, 255].
            :   // Pretty-printed as a percentage [0, 100].
            :   BYTE_FRACTION,
> This is a weird unit. I guess we're constrained to int64 here, so that's ho
I wanted to express that this is a fraction [0,1] encoded in a single byte. I'm 
also good with base points or percent, and giving up the byte limitation (the 
rest of the patch doesn't use it yet anyways).


http://gerrit.cloudera.org:8080/#/c/12069/7/common/thrift/RuntimeProfile.thrift
File common/thrift/RuntimeProfile.thrift:

http://gerrit.cloudera.org:8080/#/c/12069/7/common/thrift/RuntimeProfile.thrift@73
PS7, Line 73:   5: required i64 start_index
> By marking this required, I think you make Thrift fail to understand an old
Done


http://gerrit.cloudera.org:8080/#/c/12069/7/tests/query_test/test_observability.py
File tests/query_test/test_observability.py:

http://gerrit.cloudera.org:8080/#/c/12069/7/tests/query_test/test_observability.py@422
PS7, Line 422:     query_opts = {'resource_trace_ratio': 1.0}
> Do we have a test that asserts that resource_trace_ratio = 0 also works cor
There's test_query_profile_host_cpu_usage_off_by_default above that checks that 
it is off by default, I extended it to make sure that it can also be turned off 
explicitly.



-- 
To view, visit http://gerrit.cloudera.org:8080/12069
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3aedc20c553ab8d7ed50f72a1a936eba151487d9
Gerrit-Change-Number: 12069
Gerrit-PatchSet: 7
Gerrit-Owner: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com>
Gerrit-Comment-Date: Wed, 19 Dec 2018 01:00:13 +0000
Gerrit-HasComments: Yes

Reply via email to