Quick poll on content
Hi everyone, Yesterday, I did a live stream on "GenAI for Cassandra Teams" you can see it on YouTube[1]. I love creating content that helps you work through problems or new things. The GenAI thing has been hitting Cassandra teams with requests for new app features and there are a lot of topics I could cover there. I put together a quick poll and would love your feedback: https://www.surveymonkey.com/r/S2XLR7B It's just one question "What kind of content would be helpful for you?" with multi-checkbox. If you don't see what you are looking for, add something in the "Other" box. Thanks for your time! Patrick 1: https://www.youtube.com/live/k7EBhN_xXHA?si=H5iN27qUinx-bH6b
Re: Trouble with using group commitlog_sync
Okay, that proves I was wrong on the client side bottleneck. On 24/04/2024 17:55, Nathan Marz wrote: I tried running two client processes in parallel and the numbers were unchanged. The max throughput is still a single client doing 10 in-flight BatchStatement containing 100 inserts. On Tue, Apr 23, 2024 at 10:24 PM Bowen Song via user wrote: You might have run into the bottleneck of the driver's IO thread. Try increase the driver's connections-per-server limit to 2 or 3 if you've only got 1 server in the cluster. Or alternatively, run two client processes in parallel. On 24/04/2024 07:19, Nathan Marz wrote: Tried it again with one more client thread, and that had no effect on performance. This is unsurprising as there's only 2 CPU on this node and they were already at 100%. These were good ideas, but I'm still unable to even match the performance of batch commit mode with group commit mode. On Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user wrote: To achieve 10k loop iterations per second, each iteration must take 0.1 milliseconds or less. Considering that each iteration needs to lock and unlock the semaphore (two syscalls) and make network requests (more syscalls), that's a lots of context switches. It may a bit too much to ask for a single thread. I would suggest try multi-threading or multi-processing, and see if the combined insert rate is higher. I should also note that executeAsync() also has implicit limits on the number of in-flight requests, which default to 1024 requests per connection and 1 connection per server. See https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/ On 23/04/2024 23:18, Nathan Marz wrote: It's using the async API, so why would it need multiple threads? Using the exact same approach I'm able to get 38k / second with periodic commitlog_sync. For what it's worth, I do see 100% CPU utilization in every single one of these tests. On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user wrote: Have you checked the thread CPU utilisation of the client side? You likely will need more than one thread to do insertion in a loop to achieve tens of thousands of inserts per second. On 23/04/2024 21:55, Nathan Marz wrote: Thanks for the explanation. I tried again with commitlog_sync_group_window at 2ms, concurrent_writes at 512, and doing 1000 individual inserts at a time with the same loop + semaphore approach. This only nets 9k / second. I got much higher throughput for the other modes with BatchStatement of 100 inserts rather than 100x more individual inserts. On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user wrote: I suspect you are abusing batch statements. Batch statements should only be used where atomicity or isolation is needed. Using batch statements won't make inserting multiple partitions faster. In fact, it often will make that slower. Also, the liner relationship between commitlog_sync_group_window and write throughput is expected. That's because the max number of uncompleted writes is limited by the write concurrency, and a write is not considered "complete" before it is synced to disk when commitlog sync is in group or batch mode. That means within each interval, only limited number of writes can be done. The ways to increase that including: add more nodes, sync commitlog at shorter intervals and allow more concurrent writes. On 23/04/2024 20:43, Nathan Marz wrote: Thanks. I raised concurrent_writes to 128 and set commitlog_sync_group_window to 20ms. This causes a single execute of a BatchStatement containing 100 inserts to succeed. However, the throughput I'm seeing is atrocious. With these settings, I'm executing 10 BatchStatement concurrently at a time using the semaphore + loop approach I showed in my first message. So as requests complete, more are sent out such that there are 10 in-flight at a time. Each BatchStatement has 100 individual inserts. I'm seeing only 730 inserts / second. Again, with periodic mode I see 38k / second and with batch I see 14k / second. My expectation was that group commit mode throughput would be somewhere between
Re: Trouble with using group commitlog_sync
I tried running two client processes in parallel and the numbers were unchanged. The max throughput is still a single client doing 10 in-flight BatchStatement containing 100 inserts. On Tue, Apr 23, 2024 at 10:24 PM Bowen Song via user < user@cassandra.apache.org> wrote: > You might have run into the bottleneck of the driver's IO thread. Try > increase the driver's connections-per-server limit to 2 or 3 if you've only > got 1 server in the cluster. Or alternatively, run two client processes in > parallel. > > > On 24/04/2024 07:19, Nathan Marz wrote: > > Tried it again with one more client thread, and that had no effect on > performance. This is unsurprising as there's only 2 CPU on this node and > they were already at 100%. These were good ideas, but I'm still unable to > even match the performance of batch commit mode with group commit mode. > > On Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user < > user@cassandra.apache.org> wrote: > >> To achieve 10k loop iterations per second, each iteration must take 0.1 >> milliseconds or less. Considering that each iteration needs to lock and >> unlock the semaphore (two syscalls) and make network requests (more >> syscalls), that's a lots of context switches. It may a bit too much to ask >> for a single thread. I would suggest try multi-threading or >> multi-processing, and see if the combined insert rate is higher. >> >> I should also note that executeAsync() also has implicit limits on the >> number of in-flight requests, which default to 1024 requests per connection >> and 1 connection per server. See >> https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/ >> >> >> On 23/04/2024 23:18, Nathan Marz wrote: >> >> It's using the async API, so why would it need multiple threads? Using >> the exact same approach I'm able to get 38k / second with periodic >> commitlog_sync. For what it's worth, I do see 100% CPU utilization in every >> single one of these tests. >> >> On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user < >> user@cassandra.apache.org> wrote: >> >>> Have you checked the thread CPU utilisation of the client side? You >>> likely will need more than one thread to do insertion in a loop to achieve >>> tens of thousands of inserts per second. >>> >>> >>> On 23/04/2024 21:55, Nathan Marz wrote: >>> >>> Thanks for the explanation. >>> >>> I tried again with commitlog_sync_group_window at 2ms, concurrent_writes >>> at 512, and doing 1000 individual inserts at a time with the same loop + >>> semaphore approach. This only nets 9k / second. >>> >>> I got much higher throughput for the other modes with BatchStatement of >>> 100 inserts rather than 100x more individual inserts. >>> >>> On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user < >>> user@cassandra.apache.org> wrote: >>> I suspect you are abusing batch statements. Batch statements should only be used where atomicity or isolation is needed. Using batch statements won't make inserting multiple partitions faster. In fact, it often will make that slower. Also, the liner relationship between commitlog_sync_group_window and write throughput is expected. That's because the max number of uncompleted writes is limited by the write concurrency, and a write is not considered "complete" before it is synced to disk when commitlog sync is in group or batch mode. That means within each interval, only limited number of writes can be done. The ways to increase that including: add more nodes, sync commitlog at shorter intervals and allow more concurrent writes. On 23/04/2024 20:43, Nathan Marz wrote: Thanks. I raised concurrent_writes to 128 and set commitlog_sync_group_window to 20ms. This causes a single execute of a BatchStatement containing 100 inserts to succeed. However, the throughput I'm seeing is atrocious. With these settings, I'm executing 10 BatchStatement concurrently at a time using the semaphore + loop approach I showed in my first message. So as requests complete, more are sent out such that there are 10 in-flight at a time. Each BatchStatement has 100 individual inserts. I'm seeing only 730 inserts / second. Again, with periodic mode I see 38k / second and with batch I see 14k / second. My expectation was that group commit mode throughput would be somewhere between those two. If I set commitlog_sync_group_window to 100ms, the throughput drops to 14 / second. If I set commitlog_sync_group_window to 10ms, the throughput increases to 1587 / second. If I set commitlog_sync_group_window to 5ms, the throughput increases to 3200 / second. If I set commitlog_sync_group_window to 1ms, the throughput increases to 13k / second, which is slightly less than batch commit mode. Is group commit mode supposed to have better performance than batch mode? On Tue, Apr 23,
Re: Mixed Cluster 4.0 and 4.1
Hi Paul, IMO, if they are truly risk-adverse, they should follow the tested and proven best practices, instead of doing things in a less tested way which is also know to pose a danger to the data correctness. If they must do this over a long period of time, then they may need to temporarily increase the gc_grace_seconds on all tables, and ensure that no DDL or repair is run before the upgrade completes. It is unknown whether this route is safe, because it's a less tested route to upgrade a cluster. Please be aware that if they do deletes frequently, increasing the gc_grace_seconds may cause some reads to fail due to the elevated number of tombstones. Cheers, Bowen On 24/04/2024 17:25, Paul Chandler wrote: Hi Bowen, Thanks for your quick reply. Sorry I used the wrong term there, there it is a maintenance window rather than an outage. This is a key system and the vital nature of it means that the customer is rightly very risk adverse, so we will only even get permission to upgrade one DC per night via a rolling upgrade, meaning this will always be over more than a week. So we can’t shorten the time the cluster is in mixed mode, but I am concerned about having a schema mismatch for this long time. Should I be concerned, or have others upgraded in a similar way? Thanks Paul On 24 Apr 2024, at 17:02, Bowen Song via user wrote: Hi Paul, You don't need to plan for or introduce an outage for a rolling upgrade, which is the preferred route. It isn't advisable to take down an entire DC to do upgrade. You should aim to complete upgrading the entire cluster and finish a full repair within the shortest gc_grace_seconds (default to 10 days) of all tables. Failing to do that may cause data resurrections. During the rolling upgrade, you should not run repair or any DDL query (such as ALTER TABLE, TRUNCATE, etc.). You don't need to do the rolling upgrade node by node. You can do it rack by rack. Stopping all nodes in a single rack and upgrade them concurrently is much faster. The number of nodes doesn't matter that much to the time required to complete a rolling upgrade, it's the number of DCs and racks matter. Cheers, Bowen On 24/04/2024 16:16, Paul Chandler wrote: Hi all, We have some large clusters ( 1000+ nodes ), these are across multiple datacenters. When we perform upgrades we would normally upgrade a DC at a time during a planned outage for one DC. This means that a cluster might be in a mixed mode with multiple versions for a week or 2. We have noticed that during our testing that upgrading to 4.1 causes a schema mismatch due to the new tables added into the system keyspace. Is this going to be an issue if this schema mismatch lasts for maybe several weeks? I assume that running any DDL during that time would be a bad idea, is there any other issues to look out for? Thanks Paul Chandler
Re: Mixed Cluster 4.0 and 4.1
Hi Bowen, Thanks for your quick reply. Sorry I used the wrong term there, there it is a maintenance window rather than an outage. This is a key system and the vital nature of it means that the customer is rightly very risk adverse, so we will only even get permission to upgrade one DC per night via a rolling upgrade, meaning this will always be over more than a week. So we can’t shorten the time the cluster is in mixed mode, but I am concerned about having a schema mismatch for this long time. Should I be concerned, or have others upgraded in a similar way? Thanks Paul > On 24 Apr 2024, at 17:02, Bowen Song via user > wrote: > > Hi Paul, > > You don't need to plan for or introduce an outage for a rolling upgrade, > which is the preferred route. It isn't advisable to take down an entire DC to > do upgrade. > > You should aim to complete upgrading the entire cluster and finish a full > repair within the shortest gc_grace_seconds (default to 10 days) of all > tables. Failing to do that may cause data resurrections. > > During the rolling upgrade, you should not run repair or any DDL query (such > as ALTER TABLE, TRUNCATE, etc.). > > You don't need to do the rolling upgrade node by node. You can do it rack by > rack. Stopping all nodes in a single rack and upgrade them concurrently is > much faster. The number of nodes doesn't matter that much to the time > required to complete a rolling upgrade, it's the number of DCs and racks > matter. > > Cheers, > Bowen > > On 24/04/2024 16:16, Paul Chandler wrote: >> Hi all, >> >> We have some large clusters ( 1000+ nodes ), these are across multiple >> datacenters. >> >> When we perform upgrades we would normally upgrade a DC at a time during a >> planned outage for one DC. This means that a cluster might be in a mixed >> mode with multiple versions for a week or 2. >> >> We have noticed that during our testing that upgrading to 4.1 causes a >> schema mismatch due to the new tables added into the system keyspace. >> >> Is this going to be an issue if this schema mismatch lasts for maybe several >> weeks? I assume that running any DDL during that time would be a bad idea, >> is there any other issues to look out for? >> >> Thanks >> >> Paul Chandler
Re: Mixed Cluster 4.0 and 4.1
Hi Paul, You don't need to plan for or introduce an outage for a rolling upgrade, which is the preferred route. It isn't advisable to take down an entire DC to do upgrade. You should aim to complete upgrading the entire cluster and finish a full repair within the shortest gc_grace_seconds (default to 10 days) of all tables. Failing to do that may cause data resurrections. During the rolling upgrade, you should not run repair or any DDL query (such as ALTER TABLE, TRUNCATE, etc.). You don't need to do the rolling upgrade node by node. You can do it rack by rack. Stopping all nodes in a single rack and upgrade them concurrently is much faster. The number of nodes doesn't matter that much to the time required to complete a rolling upgrade, it's the number of DCs and racks matter. Cheers, Bowen On 24/04/2024 16:16, Paul Chandler wrote: Hi all, We have some large clusters ( 1000+ nodes ), these are across multiple datacenters. When we perform upgrades we would normally upgrade a DC at a time during a planned outage for one DC. This means that a cluster might be in a mixed mode with multiple versions for a week or 2. We have noticed that during our testing that upgrading to 4.1 causes a schema mismatch due to the new tables added into the system keyspace. Is this going to be an issue if this schema mismatch lasts for maybe several weeks? I assume that running any DDL during that time would be a bad idea, is there any other issues to look out for? Thanks Paul Chandler
Mixed Cluster 4.0 and 4.1
Hi all, We have some large clusters ( 1000+ nodes ), these are across multiple datacenters. When we perform upgrades we would normally upgrade a DC at a time during a planned outage for one DC. This means that a cluster might be in a mixed mode with multiple versions for a week or 2. We have noticed that during our testing that upgrading to 4.1 causes a schema mismatch due to the new tables added into the system keyspace. Is this going to be an issue if this schema mismatch lasts for maybe several weeks? I assume that running any DDL during that time would be a bad idea, is there any other issues to look out for? Thanks Paul Chandler
Re: Trouble with using group commitlog_sync
You might have run into the bottleneck of the driver's IO thread. Try increase the driver's connections-per-server limit to 2 or 3 if you've only got 1 server in the cluster. Or alternatively, run two client processes in parallel. On 24/04/2024 07:19, Nathan Marz wrote: Tried it again with one more client thread, and that had no effect on performance. This is unsurprising as there's only 2 CPU on this node and they were already at 100%. These were good ideas, but I'm still unable to even match the performance of batch commit mode with group commit mode. On Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user wrote: To achieve 10k loop iterations per second, each iteration must take 0.1 milliseconds or less. Considering that each iteration needs to lock and unlock the semaphore (two syscalls) and make network requests (more syscalls), that's a lots of context switches. It may a bit too much to ask for a single thread. I would suggest try multi-threading or multi-processing, and see if the combined insert rate is higher. I should also note that executeAsync() also has implicit limits on the number of in-flight requests, which default to 1024 requests per connection and 1 connection per server. See https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/ On 23/04/2024 23:18, Nathan Marz wrote: It's using the async API, so why would it need multiple threads? Using the exact same approach I'm able to get 38k / second with periodic commitlog_sync. For what it's worth, I do see 100% CPU utilization in every single one of these tests. On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user wrote: Have you checked the thread CPU utilisation of the client side? You likely will need more than one thread to do insertion in a loop to achieve tens of thousands of inserts per second. On 23/04/2024 21:55, Nathan Marz wrote: Thanks for the explanation. I tried again with commitlog_sync_group_window at 2ms, concurrent_writes at 512, and doing 1000 individual inserts at a time with the same loop + semaphore approach. This only nets 9k / second. I got much higher throughput for the other modes with BatchStatement of 100 inserts rather than 100x more individual inserts. On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user wrote: I suspect you are abusing batch statements. Batch statements should only be used where atomicity or isolation is needed. Using batch statements won't make inserting multiple partitions faster. In fact, it often will make that slower. Also, the liner relationship between commitlog_sync_group_window and write throughput is expected. That's because the max number of uncompleted writes is limited by the write concurrency, and a write is not considered "complete" before it is synced to disk when commitlog sync is in group or batch mode. That means within each interval, only limited number of writes can be done. The ways to increase that including: add more nodes, sync commitlog at shorter intervals and allow more concurrent writes. On 23/04/2024 20:43, Nathan Marz wrote: Thanks. I raised concurrent_writes to 128 and set commitlog_sync_group_window to 20ms. This causes a single execute of a BatchStatement containing 100 inserts to succeed. However, the throughput I'm seeing is atrocious. With these settings, I'm executing 10 BatchStatement concurrently at a time using the semaphore + loop approach I showed in my first message. So as requests complete, more are sent out such that there are 10 in-flight at a time. Each BatchStatement has 100 individual inserts. I'm seeing only 730 inserts / second. Again, with periodic mode I see 38k / second and with batch I see 14k / second. My expectation was that group commit mode throughput would be somewhere between those two. If I set commitlog_sync_group_window to 100ms, the throughput drops to 14 / second. If I set commitlog_sync_group_window to 10ms, the throughput increases to 1587 / second. If I set commitlog_sync_group_window to 5ms, the throughput increases to 3200 / second. If I set commitlog_sync_group_window to 1ms, the throughput increases to 13k / second, which is slightly less than batch commit mode. Is group commit mode supposed to have better performance than batch mode? On Tue, Apr 23, 2024 at 8:46 AM
Re: Trouble with using group commitlog_sync
Tried it again with one more client thread, and that had no effect on performance. This is unsurprising as there's only 2 CPU on this node and they were already at 100%. These were good ideas, but I'm still unable to even match the performance of batch commit mode with group commit mode. On Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user < user@cassandra.apache.org> wrote: > To achieve 10k loop iterations per second, each iteration must take 0.1 > milliseconds or less. Considering that each iteration needs to lock and > unlock the semaphore (two syscalls) and make network requests (more > syscalls), that's a lots of context switches. It may a bit too much to ask > for a single thread. I would suggest try multi-threading or > multi-processing, and see if the combined insert rate is higher. > > I should also note that executeAsync() also has implicit limits on the > number of in-flight requests, which default to 1024 requests per connection > and 1 connection per server. See > https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/ > > > On 23/04/2024 23:18, Nathan Marz wrote: > > It's using the async API, so why would it need multiple threads? Using the > exact same approach I'm able to get 38k / second with periodic > commitlog_sync. For what it's worth, I do see 100% CPU utilization in every > single one of these tests. > > On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user < > user@cassandra.apache.org> wrote: > >> Have you checked the thread CPU utilisation of the client side? You >> likely will need more than one thread to do insertion in a loop to achieve >> tens of thousands of inserts per second. >> >> >> On 23/04/2024 21:55, Nathan Marz wrote: >> >> Thanks for the explanation. >> >> I tried again with commitlog_sync_group_window at 2ms, concurrent_writes >> at 512, and doing 1000 individual inserts at a time with the same loop + >> semaphore approach. This only nets 9k / second. >> >> I got much higher throughput for the other modes with BatchStatement of >> 100 inserts rather than 100x more individual inserts. >> >> On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user < >> user@cassandra.apache.org> wrote: >> >>> I suspect you are abusing batch statements. Batch statements should only >>> be used where atomicity or isolation is needed. Using batch statements >>> won't make inserting multiple partitions faster. In fact, it often will >>> make that slower. >>> >>> Also, the liner relationship between commitlog_sync_group_window and >>> write throughput is expected. That's because the max number of uncompleted >>> writes is limited by the write concurrency, and a write is not considered >>> "complete" before it is synced to disk when commitlog sync is in group or >>> batch mode. That means within each interval, only limited number of writes >>> can be done. The ways to increase that including: add more nodes, sync >>> commitlog at shorter intervals and allow more concurrent writes. >>> >>> >>> On 23/04/2024 20:43, Nathan Marz wrote: >>> >>> Thanks. I raised concurrent_writes to 128 and >>> set commitlog_sync_group_window to 20ms. This causes a single execute of a >>> BatchStatement containing 100 inserts to succeed. However, the throughput >>> I'm seeing is atrocious. >>> >>> With these settings, I'm executing 10 BatchStatement concurrently at a >>> time using the semaphore + loop approach I showed in my first message. So >>> as requests complete, more are sent out such that there are 10 in-flight at >>> a time. Each BatchStatement has 100 individual inserts. I'm seeing only 730 >>> inserts / second. Again, with periodic mode I see 38k / second and with >>> batch I see 14k / second. My expectation was that group commit mode >>> throughput would be somewhere between those two. >>> >>> If I set commitlog_sync_group_window to 100ms, the throughput drops to >>> 14 / second. >>> >>> If I set commitlog_sync_group_window to 10ms, the throughput increases >>> to 1587 / second. >>> >>> If I set commitlog_sync_group_window to 5ms, the throughput increases to >>> 3200 / second. >>> >>> If I set commitlog_sync_group_window to 1ms, the throughput increases to >>> 13k / second, which is slightly less than batch commit mode. >>> >>> Is group commit mode supposed to have better performance than batch mode? >>> >>> >>> On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user < >>> user@cassandra.apache.org> wrote: >>> The default commitlog_sync_group_window is very long for SSDs. Try reduce it if you are using SSD-backed storage for the commit log. 10-15 ms is a good starting point. You may also want to increase the value of concurrent_writes, consider at least double or quadruple it from the default. You'll need even higher write concurrency for longer commitlog_sync_group_window. On 23/04/2024 19:26, Nathan Marz wrote: "batch" mode works fine. I'm having trouble with "group" mode. The only config for that is "commitlog_sync_group_window",