Will Berkeley has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/12261


Change subject: [spark] Add write duration histograms
......................................................................

[spark] Add write duration histograms

This adds an additional accumulator metrics to KuduContext writes: write
duration histograms. These histograms will show up on the webui and in
the driver logs, so it's easier to track how much time is spent writing
to Kudu in Spark stages and tasks. Log messages on the driver look like:

19/01/23 11:13:34 INFO kudu.KuduContext: completed insert ops: duration 
histogram: 25.0%: 14ms, 25.0%: 14ms, 75.0%: 17ms, 75.0%: 17ms, 75.0%: 17ms, 
100.0%: 66ms, 100.0%: 66ms

The funny repeated values are an artifact of having a cluster with only
3 executors executing 4 tasks. Log messages on executors look like

19/01/23 11:13:34 INFO kudu.KuduContext: applied 69 inserts to table 
'impala::default.aaa' in 14ms

HdrHistograms need to be shipped between executors and the driver, so
their (serialized) size is relevant. Spark users differ in how they
serialize, so I didn't put much effort into estimating the serialized
size, but based on the conservative formula in [1] the in-memory size of
a histogram with 3 significant value digits and storing longs is 4MiB or
so. That only happens if the histogram is storing values from 1 to the
max trackable long value, which is Long.MAX / 2. More realistically,
the values in the duration histogram should be at most 86400 * 1000, the
number of milliseconds in a day, and usually much, much smaller. For
that range of values, the max footprint is 1MiB. That should be a safe
amount of data to ship about semi-frequently along with all the Kudu
data (and I'm not counting potential compression as part of
serialization).

[1]: https://github.com/HdrHistogram/HdrHistogram#footprint-estimation

Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
---
M java/gradle/dependencies.gradle
M java/kudu-spark/build.gradle
A 
java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala
4 files changed, 114 insertions(+), 1 deletion(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/12261/1
--
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wdberke...@gmail.com>

Reply via email to