[jira] [Commented] (KUDU-636) optimization: we spend a lot of time in alloc/free

2020-07-10 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155725#comment-17155725
 ] 

ASF subversion and git services commented on KUDU-636:
--

Commit a600f386aa2c341522638acb9af53fd45c469431 in kudu's branch 
refs/heads/master from Todd Lipcon
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=a600f38 ]

KUDU-636. Use Arena for EncodedKeys

This updates EncodedKeyBuilder, RowSetKeyProbe, and EncodedKey to always
allocate from an Arena instead of from the heap. This reduces allocator
contention on the write path significantly and improves memory locality.

I measured by running a tserver under 'perf stat' while using perf loadgen to
insert 80M rows total using 8 client threads. The CPU time on the tserver was
reduced by about 20%.

Before:

 Performance counter stats for './build/latest/bin/kudu tserver run -fs-wal-dir 
/tmp/ts':

 269853.10 msec task-clock#6.862 CPUs utilized
293066  context-switches  #0.001 M/sec
 44541  cpu-migrations#0.165 K/sec
   2846435  page-faults   #0.011 M/sec
 1110190206891  cycles#4.114 GHz
  (83.33%)
  201895623339  stalled-cycles-frontend   #   18.19% frontend cycles 
idle (83.33%)
  137095475307  stalled-cycles-backend#   12.35% backend cycles 
idle  (83.32%)
  894201276095  instructions  #0.81  insn per cycle
  #0.23  stalled cycles per 
insn  (83.33%)
  159095264762  branches  #  589.562 M/sec  
  (83.35%)
 639216492  branch-misses #0.40% of all branches
  (83.35%)

 255.178068000 seconds user
  14.913394000 seconds sys

After:

 Performance counter stats for './build/latest/bin/kudu tserver run -fs-wal-dir 
/tmp/ts':

 227730.62 msec task-clock#6.212 CPUs utilized
263824  context-switches  #0.001 M/sec
 45470  cpu-migrations#0.200 K/sec
   3165436  page-faults   #0.014 M/sec
  931840588715  cycles#4.092 GHz
  (83.25%)
  183214671009  stalled-cycles-frontend   #   19.66% frontend cycles 
idle (83.40%)
  111864991317  stalled-cycles-backend#   12.00% backend cycles 
idle  (83.35%)
  832636863971  instructions  #0.89  insn per cycle
  #0.22  stalled cycles per 
insn  (83.40%)
  148228107120  branches  #  650.892 M/sec  
  (83.24%)
 563344647  branch-misses #0.38% of all branches
  (83.35%)

 211.361472000 seconds user
  16.635265000 seconds sys

Change-Id: Ib46d0e2c31e03a7f319ceb0bf742e08ff74d7683
Reviewed-on: http://gerrit.cloudera.org:8080/16162
Reviewed-by: Alexey Serbin 
Tested-by: Todd Lipcon 


> optimization: we spend a lot of time in alloc/free
> --
>
> Key: KUDU-636
> URL: https://issues.apache.org/jira/browse/KUDU-636
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Affects Versions: Public beta
>Reporter: Todd Lipcon
>Priority: Major
>
> Looking at a workload in the cluster, several of the top 10 lines of perf 
> report are tcmalloc-related. It seems like we don't do a good job of making 
> use of the per-thread free-lists, and we end up in a lot of contention on the 
> central free list. There are a few low-hanging fruit things we could do to 
> improve this for a likely perf boost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-636) optimization: we spend a lot of time in alloc/free

2020-07-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154923#comment-17154923
 ] 

ASF subversion and git services commented on KUDU-636:
--

Commit 9287bc2095bfea2713cd743dc3f5bb4cd0f41476 in kudu's branch 
refs/heads/master from Todd Lipcon
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=9287bc2 ]

KUDU-636. Use protobuf arenas for CommitMsgs

This commit optimizes the write path by allowing protobuf arenas to be
used to construct the operation results protobuf and the CommitMsg that
contains it. The operation result for a large batch of writes has one or
more embedded protobuf per inserted row, so using a protobuf arena for
allocation is much more efficient than calling into the system allocator
for each object.

In order to accomplish this, I had to simplify the Log interface a bit.
Previously, the Log code constructed a LogEntryPB and passed that
through to the log's appender thread, even though it had already
performed all of the serialization on the submitting thread. Doing that
required that the log entry retain references to all of the embedded
protobufs, which complicated lifetime quite a bit.

The new Log interface performs all of the serialization and analysis
(including extracting the OpIds of the replicate messages in the batch)
inline in the submission path instead of doing any such work on the
Append thread. With this, the interface now just takes a const protobuf
reference instead of a unique_ptr, which means that the
caller has a simpler model around its lifetime.

With the above accomplished, it was straightforward to add a protobuf
Arena to the OpState structure and allocate the CommitMsg and its
constituent sub-messages from that Arena.

The performance benefits are substantial. I benchmarked on a local
machine using:

  $ kudu perf loadgen localhost -num_rows_per_thread=1000 -num_threads=8

and ran the tserver under `perf stat` to collect counters:

Without patch:

  INSERT report
  rows total: 8000
  time total: 35860.9 ms
time per row: 0.000448261 ms

   Performance counter stats for './build/latest/bin/kudu tserver run 
-fs-wal-dir /tmp/ts':

   378784.92 msec task-clock#8.453 CPUs utilized
 1429039  context-switches  #0.004 M/sec
  132930  cpu-migrations#0.351 K/sec
 3128091  page-faults   #0.008 M/sec
   1553122880821  cycles#4.100 GHz  
(83.24%)
313764365792  stalled-cycles-frontend   #   20.20% frontend cycles 
idle (83.33%)
166769392663  stalled-cycles-backend#   10.74% backend cycles 
idle  (83.39%)
943534760864  instructions  #0.61  insn per cycle
#0.33  stalled cycles 
per insn  (83.34%)
170465210875  branches  #  450.032 M/sec
(83.39%)
   834101556  branch-misses #0.49% of all branches  
(83.32%)

   357.520042000 seconds user
21.770448000 seconds sys

With patch:

  INSERT report
  rows total: 8000
  time total: 32701 ms
time per row: 0.000408763 ms

   Performance counter stats for './build/latest/bin/kudu tserver run 
-fs-wal-dir /tmp/ts':

   272393.27 msec task-clock#4.915 CPUs utilized
  300768  context-switches  #0.001 M/sec
   44879  cpu-migrations#0.165 K/sec
 2861143  page-faults   #0.011 M/sec
   1126891932279  cycles#4.137 GHz  
(83.28%)
209167186469  stalled-cycles-frontend   #   18.56% frontend cycles 
idle (83.42%)
144156173079  stalled-cycles-backend#   12.79% backend cycles 
idle  (83.34%)
925439690437  instructions  #0.82  insn per cycle
#0.23  stalled cycles 
per insn  (83.28%)
163672508297  branches  #  600.868 M/sec
(83.33%)
   655509045  branch-misses #0.40% of all branches  
(83.34%)

   257.52199 seconds user
15.112482000 seconds sys

Summary:
* 9.6% throughput increase
* 39% reduction in tserver cycles

Change-Id: I78698d4cb4944bddd8dabd6cbbf1e3a064056199
Reviewed-on: http://gerrit.cloudera.org:8080/16147
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin 


> optimization: we spend a lot of time in alloc/free
> --
>
> Key: KUDU-636
> URL: https://issues.apache.org/jira/browse/KUDU-636
> Project: Kudu
>  Issue Type: 

[jira] [Commented] (KUDU-636) optimization: we spend a lot of time in alloc/free

2020-07-09 Thread Grant Henke (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154526#comment-17154526
 ] 

Grant Henke commented on KUDU-636:
--

Protobuf Arenas are now used as of 
https://github.com/apache/kudu/commit/fe15af8496af524b6de3392772edeec88dc8a626

> optimization: we spend a lot of time in alloc/free
> --
>
> Key: KUDU-636
> URL: https://issues.apache.org/jira/browse/KUDU-636
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Affects Versions: Public beta
>Reporter: Todd Lipcon
>Priority: Major
>
> Looking at a workload in the cluster, several of the top 10 lines of perf 
> report are tcmalloc-related. It seems like we don't do a good job of making 
> use of the per-thread free-lists, and we end up in a lot of contention on the 
> central free list. There are a few low-hanging fruit things we could do to 
> improve this for a likely perf boost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-636) optimization: we spend a lot of time in alloc/free

2018-02-16 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367912#comment-16367912
 ] 

Todd Lipcon commented on KUDU-636:
--

Brain dump update on some of the points above:
1) Threadpool is now LIFO as of 1.7 which should help keep a smaller number of 
threads with "hot" caches
2) We've removed almost all of the per-tablet threads and they now use shared 
threadpools
3) we should still look into thread cache tuning. perhaps we could even have 
kudu automatically tune at runtime to allow large thread caches when the 
process memory is under its soft threshold and then tune it down as it reaches 
its high threshold. Then we wouldn't need users to concern themselves with this 
tuning parameter.
4) we're on PB3 now but haven't yet switched on arena support

> optimization: we spend a lot of time in alloc/free
> --
>
> Key: KUDU-636
> URL: https://issues.apache.org/jira/browse/KUDU-636
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Affects Versions: Public beta
>Reporter: Todd Lipcon
>Priority: Major
>
> Looking at a workload in the cluster, several of the top 10 lines of perf 
> report are tcmalloc-related. It seems like we don't do a good job of making 
> use of the per-thread free-lists, and we end up in a lot of contention on the 
> central free list. There are a few low-hanging fruit things we could do to 
> improve this for a likely perf boost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)