empiredan opened a new issue #889:
URL: https://github.com/apache/incubator-pegasus/issues/889


   # 1. Motivation
   
   In the old metrics system, to be precise, the counter is not defined 
independently as a metric type, though it could be implemented by 
`dsn::perf_counter_number_atomic` or `dsn::perf_counter_volatile_number_atomic`.
   
   Using thread ID as the hash code,  `dsn::perf_counter_number_atomic` has 
tried to distribute the atomic increment evenly over the `DIVIDE_CONTAINER`. 
Each element in `DIVIDE_CONTAINER` is a 8-byte `std::atomic<int64_t>`. The 
final value of the counter can be got by summing all the elements of 
`DIVIDE_CONTAINER`.
   
   However, threads run on different cpus are likely to update the data within 
the same cache line  (typically 64 bytes for x86). This will lead to [false 
sharing](https://en.wikipedia.org/wiki/False_sharing) problem which will 
degrade performance. In comparison with a single `std::atomic<int64_t>` as the 
counter, it consumes more memory while sometimes getting even worse performance.
   
   Therefore, in [new metrics 
system](https://github.com/apache/incubator-pegasus/issues/756), we need new 
implementations that eliminate false sharing, to improve the performance and 
consume less memory, if possible.
   
   # 2. Implementation
   
   ## 2.1 concurrent_long_adder
   
   To eliminate false sharing, first we can introduce a new struct that aligns 
a `std::atomic<int64_t>` with a cache line. Then, allocate an array of the new 
structs in an aligned region of memory. The size of this array is equal to the 
number of cpus.
   
   Since threads run on different cpus are more likely to be hashed to the 
different cache line of this array, the probability of false sharing can be 
significantly reduced. The implementation based on this idea is called 
`concurrent_long_adder`.
   
   `concurrent_long_adder` achieves high performance at the cost of more 
memory. The more cpu cores the machines has, the more memory it will consumed. 
The relationship between the cpu cores and memory usage is as below: 
   |The number of cpus|Memory usage|
   | :----: | :----: |
   |2|128 bytes|
   |4|256 bytes|
   |8|512 bytes|
   |16|1024 bytes|
   |32|2048 bytes|
   
   Since `dsn::perf_counter_number_atomic` will consume 856 bytes, if the 
machine has less than 16 cores, `concurrent_long_adder` will consume less 
memory than `dsn::perf_counter_number_atomic`.
   
   Nevertheless, due to the memory consumption, `concurrent_long_adder` is 
recommended to be used under the scenario that a counter is updated very 
frequently.
   
   ## 2.2 striped_long_adder
   
   Inspired by 
https://github.com/apache/kudu/commit/01233222e594c9e460e541a29875998e1e3b873c, 
`striped_long_adder` is implemented to trade off between the performance and 
memory usage.
   
   The idea of this long adder is that it has a *base* 8-byte 
`std::atomic<int64_t>`. If the long adder is not updated frequently, it will 
not consume extra memory; Otherwise, once collision is detected, it will 
allocate the same memory bytes that `concurrent_long_adder` consumes in case 
the performance is degraded.
   
   Since more instructions are used to detect the collision, 
`striped_long_adder` is slower than `concurrent_long_adder`. However, it can 
balance between the performance and memory usage. Therefore, 
`striped_long_adder` is used as the default long adder and is appropriate for 
the scenario that a counter is not updated frequently.
   
   # 3. Usage
   
   For the reason that virtual function will consume extra memory and slow down 
the execution, template is used to wrap the long adder to provide the unified 
API. For details please see `long_adder_wrapper` in 
`include/dsn/utility/long_adder.h`.
   
   # 4. Performance
   
   ## 4.1 Hardware configurations
   ```
   cpu: Intel Xeon Processor (Cascadelake)
   cores: 4
   memory: 32GB
   ```
   
   ## 4.2 Test results
   
   The performance test is quite simple: a number of threads is started, and 
each thread execute a number of atomic increment executions. Besides 
`striped_long_adder` and `concurrent_long_adder`, another 2 kinds of long 
adders are introduced to compare the execution performance.
   
   `simple_long_adder` is just a wrapper of a single `std::atomic<int64_t>`. 
`divided_long_adder` is nearly the same with `dsn::perf_counter_number_atomic` 
except that virtual functions are removed.
   
   ## 4.2.1 4 threads with each 1,000,000,000 operations
   
   |The type of long adder||Duration(seconds)|
   | :----: | :----: |
   |simple_long_adder|94.9675|
   |divided_long_adder|101.167|
   |striped_long_adder|36.1651|
   |concurrent_long_adder|12.3893|
   
   ## 4.2.2 8 threads with each 1,000,000,000 operations
   |The type of long adder||Duration(seconds)|
   | :----: | :----: |
   |simple_long_adder|183.897|
   |divided_long_adder|185.945|
   |striped_long_adder|70.5208|
   |concurrent_long_adder|47.1702|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to