Amit, welcome and thank you for contributing the results from your test and opening this discussion.

I don’t think anyone is arguing that the database shouldn’t take advantage of available hardware.

A few things important to keep in mind when considering a patch like this:

- Where the actual bottleneck in the database will be for increased write throughput. As Bowen and Benedict mentioned, the amount of work performed by the commitlog versus the accrued cost of integrating flushed SSTables into the LSM tree is dramatically weighed toward compaction. A multi-day benchmark that allows the database to accrue and incorporate a sizable amount of data is much more likely to produce measurements that approximate what users of Cassandra may experience in production use.

- Making something multi-threaded doesn’t reduce the amount of work done; it redistributes it. In a saturated system, this means resources are allocated in an environment of trade offs. Allocating additional resources to the front door will reduce the resources available to compaction, live serving, etc. in environments where cores are not limitless and free. This is why the holistic view of performance others are speaking to is important.

- How such a change alters the balance of the database’s threading model, and where the bottleneck moved to. Users who overrun the commitlog’s capability today are likely to be even more negatively impacted by compaction overhead of backpressure is lost at the front door. The meta-point to consider is “how does this change affect the performance characteristics of a live database?”

- We also need to balance complexity and correctness in the implementation. If the patch is straightforward, has a well-defined locking scheme, and ideally a suite of randomized tests, that can help mitigate concerns related to this.

It sounds like several would welcome such a patch for review. Just want to signpost that the gains and tradeoffs aren’t always clear cut, especially in cases where the improvement is a rebalancing of the database’s threading model rather than reducing the amount of work performed.

The second item you mentioned - a direct IO path for commitlog writes - sounds like an interesting potential addition.

One thing that may be useful to post along with your patch is a result from an extended tlp-stress run that includes both the live write path as well as the deferred compaction of data written.

- Scott

On Jul 22, 2022, at 9:14 AM, Pawar, Amit <amit.pa...@amd.com> wrote:



[AMD Official Use Only - General]

 

Hi Benedict,

 

The whole point is Cassandra as a software should take advantage of hardware wherever possible. So reducing Commitlog bottleneck may help some workloads and not all. I am already working on trunk now and will share the patch. If changes looks good and not very complex then please give your feedback. Your input might help to reduce the complexity of change and possibly patch can be accepted.

 

Thanks,

Amit

 

From: Benedict <bened...@apache.org>
Sent: Friday, July 22, 2022 3:56 PM
To: dev@cassandra.apache.org
Cc: Bowen Song <bo...@bso.ng>; Raghavendra, Prakash <prakash.raghaven...@amd.com>
Subject: Re: [DISCUSS] Improve Commitlog write path

 

[CAUTION: External Email]

Hi Amit,

 

I am inclined to agree with Bowen Song, in that benchmarks from an initially empty cluster tend to lean more heavily on memtable and commit log bottlenecks than a real-world long running cluster does, as the algorithmic complexity of LSMTs begin to bite much later while the cost of the commit log and memtable stay fairly constant. The more data you have, the less commit log and memtable performance directly matter, and memtable size becomes much more important along with compaction efficiency.

 

That said, reducing bottlenecks is still a good thing if the additional complexity is not severe - and this is still an unfortunately common way that we benchmark changes today, anyway.

 

 

On 22 Jul 2022, at 11:20, Pawar, Amit <amit.pa...@amd.com> wrote:



[Public]

 

Thank you Bowen for your reply. Took some time to respond due to testing issue.

 

I tested again multi-threaded feature with number of records from 260 million to 2 billion and still improvement is seen around 80% of Ramdisk score. It is still possible that compaction can become new bottleneck and could be new opportunity to fix it. I am newbie here and possible that I failed to understand your suggestion completely.  At-least with this testing multi-threading benefit is reflecting in score.

 

Do you think multi-threading is good to have now ? else please suggest if I need to test further.

 

Thanks,

Amit

 

From: Bowen Song via dev <dev@cassandra.apache.org>
Sent: Wednesday, July 20, 2022 4:13 PM
To: dev@cassandra.apache.org
Subject: Re: [DISCUSS] Improve Commitlog write path

 

[CAUTION: External Email]

From my past experience, the bottleneck for insert heavy workload is likely to be compaction, not commit log. You initially may see commit log as the bottleneck when the table size is relatively small, but as the table size increases, compaction will likely take its place and become the new bottleneck.

On 20/07/2022 11:11, Pawar, Amit wrote:

[Public]

 

Hi all,

 

(My previous mail is not appearing in mailing list and resending again after 2 days)

 

Myself Amit and working at AMD Bangalore, India. I am new to Cassandra and need to do Cassandra testing on large core systems. Usually should test on multi-nodes Cassandra but started with Single node testing to understand how Cassandra scales with increasing core counts.

 

Test details:

Operation: Insert > 90% (insert heavy)

Operation: Scan < 10%

Cassandra: 3.11.10 and trunk

Benchmark: TPCx-IOT (similar to YCSB)

 

Results shows scaling is poor beyond 16 cores and it is almost linear. Following settings are the common settings helped to get the better scores.

1.      Memtable heap allocation: offheap_objects

2.      memtable_flush_writers > 4

3.      Java heap: 8-32GB with survivor ratio tuning

4.      Separate storage space for Commitlog and Data.

 

Many online blogs suggest to add new Cassandra node when unable to take high writes. But with large systems, high writes should be easily taken due to many cores. Need was to improve the scaling with more cores so this suggestion didn’t help. After many rounds of testing it was observed that current implementation uses single thread for Commitlog syncing activity. Commitlog files are mapped using mmap system call and changes are written with msync. Periodic syncing with JVisualvm tool shows

1.      thread is not 100% busy with Ramdisk usage for Commitlog storage and scaling improved on large systems. Ramdisk scores > 2 X NVME score.

2.      thread becomes 100% busy with NVME usage for Commiglog and score does not improve much beyond 16 cores.

 

Linux kernel uses 4K pages for mapped memory with mmap system call. So, to understand this further, disk I/O testing was done using FIO tool and results shows

1.      NVME 4K random R/W throughput is very less with single thread and it improves with multi-threaded.

2.      Ramdisk 4K random R/W throughput is good with single thread only and also better with multi-threaded

 

Based on the FIO test results following two ideas were tested for Commitlog files with Cassandra-3.1.10 sources.

1.      Enable Direct IO feature for Commitlog files (similar to  [CASSANDRA-14466] Enable Direct I/O - ASF JIRA (apache.org) )

2.      Enable Multi-threaded syncing for Commitlog files.

 

First one need to retest. Interestingly second one helped to improve the score with “NVME” disk. NVME disk configuration score is almost within 80-90% of ramdisk and 2 times of single threaded implementation. Multithreading enabled by adding new thread pool in “AbstractCommitLogSegmentManager” class and changed syncing thread as manager thread for this new thread pool to take care synchronization. Only tested with Cassandra-3.11.10 and needs complete testing but this change is working in my test environment. Tried these few experiments so that I could discuss here and seek your valuable suggestions to identify the right fix for insert heavy workloads.

 

1.      Is it good idea to convert single threaded syncing to multi-threading implementation to improve the disk IO?

2.      Direct I/O throughput is high with single thread and best fit for Commitlog case due to file size. This will improve writes on small to large systems. Good to bring this support for Commitlog files?

 

Please suggest.

 

Thanks,

Amit Pawar

Reply via email to