[jira] [Commented] (KUDU-636) optimization: we spend a lot of time in alloc/free
[ https://issues.apache.org/jira/browse/KUDU-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155725#comment-17155725 ] ASF subversion and git services commented on KUDU-636: -- Commit a600f386aa2c341522638acb9af53fd45c469431 in kudu's branch refs/heads/master from Todd Lipcon [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=a600f38 ] KUDU-636. Use Arena for EncodedKeys This updates EncodedKeyBuilder, RowSetKeyProbe, and EncodedKey to always allocate from an Arena instead of from the heap. This reduces allocator contention on the write path significantly and improves memory locality. I measured by running a tserver under 'perf stat' while using perf loadgen to insert 80M rows total using 8 client threads. The CPU time on the tserver was reduced by about 20%. Before: Performance counter stats for './build/latest/bin/kudu tserver run -fs-wal-dir /tmp/ts': 269853.10 msec task-clock#6.862 CPUs utilized 293066 context-switches #0.001 M/sec 44541 cpu-migrations#0.165 K/sec 2846435 page-faults #0.011 M/sec 1110190206891 cycles#4.114 GHz (83.33%) 201895623339 stalled-cycles-frontend # 18.19% frontend cycles idle (83.33%) 137095475307 stalled-cycles-backend# 12.35% backend cycles idle (83.32%) 894201276095 instructions #0.81 insn per cycle #0.23 stalled cycles per insn (83.33%) 159095264762 branches # 589.562 M/sec (83.35%) 639216492 branch-misses #0.40% of all branches (83.35%) 255.178068000 seconds user 14.913394000 seconds sys After: Performance counter stats for './build/latest/bin/kudu tserver run -fs-wal-dir /tmp/ts': 227730.62 msec task-clock#6.212 CPUs utilized 263824 context-switches #0.001 M/sec 45470 cpu-migrations#0.200 K/sec 3165436 page-faults #0.014 M/sec 931840588715 cycles#4.092 GHz (83.25%) 183214671009 stalled-cycles-frontend # 19.66% frontend cycles idle (83.40%) 111864991317 stalled-cycles-backend# 12.00% backend cycles idle (83.35%) 832636863971 instructions #0.89 insn per cycle #0.22 stalled cycles per insn (83.40%) 148228107120 branches # 650.892 M/sec (83.24%) 563344647 branch-misses #0.38% of all branches (83.35%) 211.361472000 seconds user 16.635265000 seconds sys Change-Id: Ib46d0e2c31e03a7f319ceb0bf742e08ff74d7683 Reviewed-on: http://gerrit.cloudera.org:8080/16162 Reviewed-by: Alexey Serbin Tested-by: Todd Lipcon > optimization: we spend a lot of time in alloc/free > -- > > Key: KUDU-636 > URL: https://issues.apache.org/jira/browse/KUDU-636 > Project: Kudu > Issue Type: Improvement > Components: perf >Affects Versions: Public beta >Reporter: Todd Lipcon >Priority: Major > > Looking at a workload in the cluster, several of the top 10 lines of perf > report are tcmalloc-related. It seems like we don't do a good job of making > use of the per-thread free-lists, and we end up in a lot of contention on the > central free list. There are a few low-hanging fruit things we could do to > improve this for a likely perf boost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-636) optimization: we spend a lot of time in alloc/free
[ https://issues.apache.org/jira/browse/KUDU-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154923#comment-17154923 ] ASF subversion and git services commented on KUDU-636: -- Commit 9287bc2095bfea2713cd743dc3f5bb4cd0f41476 in kudu's branch refs/heads/master from Todd Lipcon [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=9287bc2 ] KUDU-636. Use protobuf arenas for CommitMsgs This commit optimizes the write path by allowing protobuf arenas to be used to construct the operation results protobuf and the CommitMsg that contains it. The operation result for a large batch of writes has one or more embedded protobuf per inserted row, so using a protobuf arena for allocation is much more efficient than calling into the system allocator for each object. In order to accomplish this, I had to simplify the Log interface a bit. Previously, the Log code constructed a LogEntryPB and passed that through to the log's appender thread, even though it had already performed all of the serialization on the submitting thread. Doing that required that the log entry retain references to all of the embedded protobufs, which complicated lifetime quite a bit. The new Log interface performs all of the serialization and analysis (including extracting the OpIds of the replicate messages in the batch) inline in the submission path instead of doing any such work on the Append thread. With this, the interface now just takes a const protobuf reference instead of a unique_ptr, which means that the caller has a simpler model around its lifetime. With the above accomplished, it was straightforward to add a protobuf Arena to the OpState structure and allocate the CommitMsg and its constituent sub-messages from that Arena. The performance benefits are substantial. I benchmarked on a local machine using: $ kudu perf loadgen localhost -num_rows_per_thread=1000 -num_threads=8 and ran the tserver under `perf stat` to collect counters: Without patch: INSERT report rows total: 8000 time total: 35860.9 ms time per row: 0.000448261 ms Performance counter stats for './build/latest/bin/kudu tserver run -fs-wal-dir /tmp/ts': 378784.92 msec task-clock#8.453 CPUs utilized 1429039 context-switches #0.004 M/sec 132930 cpu-migrations#0.351 K/sec 3128091 page-faults #0.008 M/sec 1553122880821 cycles#4.100 GHz (83.24%) 313764365792 stalled-cycles-frontend # 20.20% frontend cycles idle (83.33%) 166769392663 stalled-cycles-backend# 10.74% backend cycles idle (83.39%) 943534760864 instructions #0.61 insn per cycle #0.33 stalled cycles per insn (83.34%) 170465210875 branches # 450.032 M/sec (83.39%) 834101556 branch-misses #0.49% of all branches (83.32%) 357.520042000 seconds user 21.770448000 seconds sys With patch: INSERT report rows total: 8000 time total: 32701 ms time per row: 0.000408763 ms Performance counter stats for './build/latest/bin/kudu tserver run -fs-wal-dir /tmp/ts': 272393.27 msec task-clock#4.915 CPUs utilized 300768 context-switches #0.001 M/sec 44879 cpu-migrations#0.165 K/sec 2861143 page-faults #0.011 M/sec 1126891932279 cycles#4.137 GHz (83.28%) 209167186469 stalled-cycles-frontend # 18.56% frontend cycles idle (83.42%) 144156173079 stalled-cycles-backend# 12.79% backend cycles idle (83.34%) 925439690437 instructions #0.82 insn per cycle #0.23 stalled cycles per insn (83.28%) 163672508297 branches # 600.868 M/sec (83.33%) 655509045 branch-misses #0.40% of all branches (83.34%) 257.52199 seconds user 15.112482000 seconds sys Summary: * 9.6% throughput increase * 39% reduction in tserver cycles Change-Id: I78698d4cb4944bddd8dabd6cbbf1e3a064056199 Reviewed-on: http://gerrit.cloudera.org:8080/16147 Tested-by: Kudu Jenkins Reviewed-by: Alexey Serbin > optimization: we spend a lot of time in alloc/free > -- > > Key: KUDU-636 > URL: https://issues.apache.org/jira/browse/KUDU-636 > Project: Kudu > Issue Type:
[jira] [Commented] (KUDU-636) optimization: we spend a lot of time in alloc/free
[ https://issues.apache.org/jira/browse/KUDU-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154526#comment-17154526 ] Grant Henke commented on KUDU-636: -- Protobuf Arenas are now used as of https://github.com/apache/kudu/commit/fe15af8496af524b6de3392772edeec88dc8a626 > optimization: we spend a lot of time in alloc/free > -- > > Key: KUDU-636 > URL: https://issues.apache.org/jira/browse/KUDU-636 > Project: Kudu > Issue Type: Improvement > Components: perf >Affects Versions: Public beta >Reporter: Todd Lipcon >Priority: Major > > Looking at a workload in the cluster, several of the top 10 lines of perf > report are tcmalloc-related. It seems like we don't do a good job of making > use of the per-thread free-lists, and we end up in a lot of contention on the > central free list. There are a few low-hanging fruit things we could do to > improve this for a likely perf boost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-636) optimization: we spend a lot of time in alloc/free
[ https://issues.apache.org/jira/browse/KUDU-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367912#comment-16367912 ] Todd Lipcon commented on KUDU-636: -- Brain dump update on some of the points above: 1) Threadpool is now LIFO as of 1.7 which should help keep a smaller number of threads with "hot" caches 2) We've removed almost all of the per-tablet threads and they now use shared threadpools 3) we should still look into thread cache tuning. perhaps we could even have kudu automatically tune at runtime to allow large thread caches when the process memory is under its soft threshold and then tune it down as it reaches its high threshold. Then we wouldn't need users to concern themselves with this tuning parameter. 4) we're on PB3 now but haven't yet switched on arena support > optimization: we spend a lot of time in alloc/free > -- > > Key: KUDU-636 > URL: https://issues.apache.org/jira/browse/KUDU-636 > Project: Kudu > Issue Type: Improvement > Components: perf >Affects Versions: Public beta >Reporter: Todd Lipcon >Priority: Major > > Looking at a workload in the cluster, several of the top 10 lines of perf > report are tcmalloc-related. It seems like we don't do a good job of making > use of the per-thread free-lists, and we end up in a lot of contention on the > central free list. There are a few low-hanging fruit things we could do to > improve this for a likely perf boost. -- This message was sent by Atlassian JIRA (v7.6.3#76005)