Will Berkeley has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12261
Change subject: [spark] Add write duration histograms ...................................................................... [spark] Add write duration histograms This adds an additional accumulator metrics to KuduContext writes: write duration histograms. These histograms will show up on the webui and in the driver logs, so it's easier to track how much time is spent writing to Kudu in Spark stages and tasks. Log messages on the driver look like: 19/01/23 11:13:34 INFO kudu.KuduContext: completed insert ops: duration histogram: 25.0%: 14ms, 25.0%: 14ms, 75.0%: 17ms, 75.0%: 17ms, 75.0%: 17ms, 100.0%: 66ms, 100.0%: 66ms The funny repeated values are an artifact of having a cluster with only 3 executors executing 4 tasks. Log messages on executors look like 19/01/23 11:13:34 INFO kudu.KuduContext: applied 69 inserts to table 'impala::default.aaa' in 14ms HdrHistograms need to be shipped between executors and the driver, so their (serialized) size is relevant. Spark users differ in how they serialize, so I didn't put much effort into estimating the serialized size, but based on the conservative formula in [1] the in-memory size of a histogram with 3 significant value digits and storing longs is 4MiB or so. That only happens if the histogram is storing values from 1 to the max trackable long value, which is Long.MAX / 2. More realistically, the values in the duration histogram should be at most 86400 * 1000, the number of milliseconds in a day, and usually much, much smaller. For that range of values, the max footprint is 1MiB. That should be a safe amount of data to ship about semi-frequently along with all the Kudu data (and I'm not counting potential compression as part of serialization). [1]: https://github.com/HdrHistogram/HdrHistogram#footprint-estimation Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433 --- M java/gradle/dependencies.gradle M java/kudu-spark/build.gradle A java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala 4 files changed, 114 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/12261/1 -- To view, visit http://gerrit.cloudera.org:8080/12261 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433 Gerrit-Change-Number: 12261 Gerrit-PatchSet: 1 Gerrit-Owner: Will Berkeley <wdberke...@gmail.com>