Will Berkeley has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12261 )
Change subject: [spark] Add write duration histograms ...................................................................... [spark] Add write duration histograms This adds an additional accumulator metrics to KuduContext writes: write duration histograms. These histograms will show up on the webui and in the driver logs, so it's easier to track how much time is spent writing to Kudu in Spark stages and tasks. Log messages on the driver look like: 19/01/23 11:13:34 INFO kudu.KuduContext: completed insert ops: duration histogram: 25.0%: 14ms, 25.0%: 14ms, 75.0%: 17ms, 75.0%: 17ms, 75.0%: 17ms, 100.0%: 66ms, 100.0%: 66ms The funny repeated values are an artifact of having a cluster with only 3 executors executing 4 tasks. Log messages on executors look like 19/01/23 11:13:34 INFO kudu.KuduContext: applied 69 inserts to table 'impala::default.aaa' in 14ms HdrHistograms need to be shipped between executors and the driver, so their (serialized) size is relevant. Spark users differ in how they serialize, so I didn't put much effort into estimating the serialized size, but based on the conservative formula in [1] the in-memory size of a histogram with 3 significant value digits and storing longs is 4MiB or so. That only happens if the histogram is storing values from 1 to the max trackable long value, which is Long.MAX / 2. More realistically, the values in the duration histogram should be at most 86400 * 1000, the number of milliseconds in a day, and usually much, much smaller. For that range of values, the max footprint is 1MiB. That should be a safe amount of data to ship about semi-frequently along with all the Kudu data (and I'm not counting potential compression as part of serialization). [1]: https://github.com/HdrHistogram/HdrHistogram#footprint-estimation Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433 Reviewed-on: http://gerrit.cloudera.org:8080/12261 Reviewed-by: Grant Henke <granthe...@apache.org> Tested-by: Will Berkeley <wdberke...@gmail.com> --- M java/gradle/dependencies.gradle M java/kudu-spark/build.gradle A java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala 4 files changed, 113 insertions(+), 1 deletion(-) Approvals: Grant Henke: Looks good to me, approved Will Berkeley: Verified -- To view, visit http://gerrit.cloudera.org:8080/12261 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433 Gerrit-Change-Number: 12261 Gerrit-PatchSet: 3 Gerrit-Owner: Will Berkeley <wdberke...@gmail.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Grant Henke <granthe...@apache.org> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Will Berkeley <wdberke...@gmail.com>