[jira] [Commented] (CASSANDRA-19477) Do not go to disk to get HintsStore.getTotalFileSize
[ https://issues.apache.org/jira/browse/CASSANDRA-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830308#comment-17830308 ] Jon Haddad commented on CASSANDRA-19477: Here's some more fun graphs. Both read and write latency and load average, are significantly improved. !image-2024-03-24-18-16-50-370.png|width=645,height=205! !image-2024-03-24-18-20-07-734.png|width=723,height=229! !image-2024-03-24-18-17-48-334.png|width=653,height=210! > Do not go to disk to get HintsStore.getTotalFileSize > > > Key: CASSANDRA-19477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19477 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Hints >Reporter: Jon Haddad >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: flame-cassandra0-patched-2024-03-25_00-40-47.html, > flame-cassandra0-release-2024-03-25_00-16-44.html, flamegraph.cpu.html, > image-2024-03-24-17-57-32-560.png, image-2024-03-24-18-08-36-918.png, > image-2024-03-24-18-16-50-370.png, image-2024-03-24-18-17-48-334.png, > image-2024-03-24-18-20-07-734.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > When testing a cluster with more requests than it could handle, I noticed > significant CPU time (25%) spent in HintsStore.getTotalFileSize. Here's what > I'm seeing from profiling: > 10% of CPU time spent in HintsDescriptor.fileName which only does this: > > {noformat} > return String.format("%s-%s-%s.hints", hostId, timestamp, version);{noformat} > At a bare minimum here we should create this string up front with the host > and version and eliminate 2 of the 3 substitutions, but I think it's probably > faster to use a StringBuilder and avoid the underlying regular expression > altogether. > 12% of the time is spent in org.apache.cassandra.io.util.File.length. It > looks like this is called once for each hint file on disk for each host we're > hinting to. In the case of an overloaded cluster, this is significant. It > would be better if we were to track the file size in memory for each hint > file and reference that rather than go to the filesystem. > These fairly small changes should make Cassandra more reliable when under > load spikes. > CPU Flame graph attached. > I only tested this in 4.1 but it looks like this is present up to trunk. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19477) Do not go to disk to get HintsStore.getTotalFileSize
[ https://issues.apache.org/jira/browse/CASSANDRA-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jon Haddad updated CASSANDRA-19477: --- Attachment: image-2024-03-24-18-20-07-734.png > Do not go to disk to get HintsStore.getTotalFileSize > > > Key: CASSANDRA-19477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19477 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Hints >Reporter: Jon Haddad >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: flame-cassandra0-patched-2024-03-25_00-40-47.html, > flame-cassandra0-release-2024-03-25_00-16-44.html, flamegraph.cpu.html, > image-2024-03-24-17-57-32-560.png, image-2024-03-24-18-08-36-918.png, > image-2024-03-24-18-16-50-370.png, image-2024-03-24-18-17-48-334.png, > image-2024-03-24-18-20-07-734.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > When testing a cluster with more requests than it could handle, I noticed > significant CPU time (25%) spent in HintsStore.getTotalFileSize. Here's what > I'm seeing from profiling: > 10% of CPU time spent in HintsDescriptor.fileName which only does this: > > {noformat} > return String.format("%s-%s-%s.hints", hostId, timestamp, version);{noformat} > At a bare minimum here we should create this string up front with the host > and version and eliminate 2 of the 3 substitutions, but I think it's probably > faster to use a StringBuilder and avoid the underlying regular expression > altogether. > 12% of the time is spent in org.apache.cassandra.io.util.File.length. It > looks like this is called once for each hint file on disk for each host we're > hinting to. In the case of an overloaded cluster, this is significant. It > would be better if we were to track the file size in memory for each hint > file and reference that rather than go to the filesystem. > These fairly small changes should make Cassandra more reliable when under > load spikes. > CPU Flame graph attached. > I only tested this in 4.1 but it looks like this is present up to trunk. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19477) Do not go to disk to get HintsStore.getTotalFileSize
[ https://issues.apache.org/jira/browse/CASSANDRA-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jon Haddad updated CASSANDRA-19477: --- Attachment: image-2024-03-24-18-16-50-370.png > Do not go to disk to get HintsStore.getTotalFileSize > > > Key: CASSANDRA-19477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19477 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Hints >Reporter: Jon Haddad >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: flame-cassandra0-patched-2024-03-25_00-40-47.html, > flame-cassandra0-release-2024-03-25_00-16-44.html, flamegraph.cpu.html, > image-2024-03-24-17-57-32-560.png, image-2024-03-24-18-08-36-918.png, > image-2024-03-24-18-16-50-370.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > When testing a cluster with more requests than it could handle, I noticed > significant CPU time (25%) spent in HintsStore.getTotalFileSize. Here's what > I'm seeing from profiling: > 10% of CPU time spent in HintsDescriptor.fileName which only does this: > > {noformat} > return String.format("%s-%s-%s.hints", hostId, timestamp, version);{noformat} > At a bare minimum here we should create this string up front with the host > and version and eliminate 2 of the 3 substitutions, but I think it's probably > faster to use a StringBuilder and avoid the underlying regular expression > altogether. > 12% of the time is spent in org.apache.cassandra.io.util.File.length. It > looks like this is called once for each hint file on disk for each host we're > hinting to. In the case of an overloaded cluster, this is significant. It > would be better if we were to track the file size in memory for each hint > file and reference that rather than go to the filesystem. > These fairly small changes should make Cassandra more reliable when under > load spikes. > CPU Flame graph attached. > I only tested this in 4.1 but it looks like this is present up to trunk. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19477) Do not go to disk to get HintsStore.getTotalFileSize
[ https://issues.apache.org/jira/browse/CASSANDRA-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jon Haddad updated CASSANDRA-19477: --- Attachment: image-2024-03-24-18-17-48-334.png > Do not go to disk to get HintsStore.getTotalFileSize > > > Key: CASSANDRA-19477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19477 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Hints >Reporter: Jon Haddad >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: flame-cassandra0-patched-2024-03-25_00-40-47.html, > flame-cassandra0-release-2024-03-25_00-16-44.html, flamegraph.cpu.html, > image-2024-03-24-17-57-32-560.png, image-2024-03-24-18-08-36-918.png, > image-2024-03-24-18-16-50-370.png, image-2024-03-24-18-17-48-334.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > When testing a cluster with more requests than it could handle, I noticed > significant CPU time (25%) spent in HintsStore.getTotalFileSize. Here's what > I'm seeing from profiling: > 10% of CPU time spent in HintsDescriptor.fileName which only does this: > > {noformat} > return String.format("%s-%s-%s.hints", hostId, timestamp, version);{noformat} > At a bare minimum here we should create this string up front with the host > and version and eliminate 2 of the 3 substitutions, but I think it's probably > faster to use a StringBuilder and avoid the underlying regular expression > altogether. > 12% of the time is spent in org.apache.cassandra.io.util.File.length. It > looks like this is called once for each hint file on disk for each host we're > hinting to. In the case of an overloaded cluster, this is significant. It > would be better if we were to track the file size in memory for each hint > file and reference that rather than go to the filesystem. > These fairly small changes should make Cassandra more reliable when under > load spikes. > CPU Flame graph attached. > I only tested this in 4.1 but it looks like this is present up to trunk. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19477) Do not go to disk to get HintsStore.getTotalFileSize
[ https://issues.apache.org/jira/browse/CASSANDRA-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830307#comment-17830307 ] Stefan Miklosovic edited comment on CASSANDRA-19477 at 3/25/24 1:15 AM: nice! It was all the joined effort really, [~aleksey] helped me to improve and polish the idea so big kudos to him! was (Author: smiklosovic): nice! It was all the joind effort really, [~aleksey] helped me to improve and polish the idea so big kudos to him! > Do not go to disk to get HintsStore.getTotalFileSize > > > Key: CASSANDRA-19477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19477 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Hints >Reporter: Jon Haddad >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: flame-cassandra0-patched-2024-03-25_00-40-47.html, > flame-cassandra0-release-2024-03-25_00-16-44.html, flamegraph.cpu.html, > image-2024-03-24-17-57-32-560.png, image-2024-03-24-18-08-36-918.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > When testing a cluster with more requests than it could handle, I noticed > significant CPU time (25%) spent in HintsStore.getTotalFileSize. Here's what > I'm seeing from profiling: > 10% of CPU time spent in HintsDescriptor.fileName which only does this: > > {noformat} > return String.format("%s-%s-%s.hints", hostId, timestamp, version);{noformat} > At a bare minimum here we should create this string up front with the host > and version and eliminate 2 of the 3 substitutions, but I think it's probably > faster to use a StringBuilder and avoid the underlying regular expression > altogether. > 12% of the time is spent in org.apache.cassandra.io.util.File.length. It > looks like this is called once for each hint file on disk for each host we're > hinting to. In the case of an overloaded cluster, this is significant. It > would be better if we were to track the file size in memory for each hint > file and reference that rather than go to the filesystem. > These fairly small changes should make Cassandra more reliable when under > load spikes. > CPU Flame graph attached. > I only tested this in 4.1 but it looks like this is present up to trunk. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19477) Do not go to disk to get HintsStore.getTotalFileSize
[ https://issues.apache.org/jira/browse/CASSANDRA-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830307#comment-17830307 ] Stefan Miklosovic commented on CASSANDRA-19477: --- nice! It was all the joind effort really, [~aleksey] helped me to improve and polish the idea so big kudos to him! > Do not go to disk to get HintsStore.getTotalFileSize > > > Key: CASSANDRA-19477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19477 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Hints >Reporter: Jon Haddad >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: flame-cassandra0-patched-2024-03-25_00-40-47.html, > flame-cassandra0-release-2024-03-25_00-16-44.html, flamegraph.cpu.html, > image-2024-03-24-17-57-32-560.png, image-2024-03-24-18-08-36-918.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > When testing a cluster with more requests than it could handle, I noticed > significant CPU time (25%) spent in HintsStore.getTotalFileSize. Here's what > I'm seeing from profiling: > 10% of CPU time spent in HintsDescriptor.fileName which only does this: > > {noformat} > return String.format("%s-%s-%s.hints", hostId, timestamp, version);{noformat} > At a bare minimum here we should create this string up front with the host > and version and eliminate 2 of the 3 substitutions, but I think it's probably > faster to use a StringBuilder and avoid the underlying regular expression > altogether. > 12% of the time is spent in org.apache.cassandra.io.util.File.length. It > looks like this is called once for each hint file on disk for each host we're > hinting to. In the case of an overloaded cluster, this is significant. It > would be better if we were to track the file size in memory for each hint > file and reference that rather than go to the filesystem. > These fairly small changes should make Cassandra more reliable when under > load spikes. > CPU Flame graph attached. > I only tested this in 4.1 but it looks like this is present up to trunk. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19477) Do not go to disk to get HintsStore.getTotalFileSize
[ https://issues.apache.org/jira/browse/CASSANDRA-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830306#comment-17830306 ] Jon Haddad commented on CASSANDRA-19477: I've set up a 3 node cluster, loaded 15GB of data then took down a node and let hints accumulate. I switched one node to use the 4.1 patch branch above, and let the other node remain on release 4.1, then ran this: {noformat} easy-cass-stress run RandomPartitionAccess --workload.rows=1000 --rate 5k -d 2h -t 4{noformat} Here's the 4.1 release flame graph. [^flame-cassandra0-release-2024-03-25_00-16-44.html] StorageProxy.mutate is taking up 17% of CPU time, with shouldHint taking up almost 7% of CPU time. Here's the 4.1 + patch flame graph: [^flame-cassandra0-patched-2024-03-25_00-40-47.html] StorageProxy.mutate is only taking up 10% of CPU time now, with shouldHint taking up .26% of CPU time. You can see the below graph 172.31.36.176 is using less CPU overall. !image-2024-03-24-17-57-32-560.png|width=857,height=270! Here's the same setup with additional load. {noformat} easy-cass-stress run RandomPartitionAccess --workload.rows=1000 --rate 30k -d 2h -t 4{noformat} !image-2024-03-24-18-08-36-918.png|width=749,height=302! The improvement in this patch is fantastic, really nice work [~smiklosovic]. I'm +1 with regard to performance, but deferring to [~aleksey] to judge correctness. > Do not go to disk to get HintsStore.getTotalFileSize > > > Key: CASSANDRA-19477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19477 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Hints >Reporter: Jon Haddad >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: flame-cassandra0-patched-2024-03-25_00-40-47.html, > flame-cassandra0-release-2024-03-25_00-16-44.html, flamegraph.cpu.html, > image-2024-03-24-17-57-32-560.png, image-2024-03-24-18-08-36-918.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > When testing a cluster with more requests than it could handle, I noticed > significant CPU time (25%) spent in HintsStore.getTotalFileSize. Here's what > I'm seeing from profiling: > 10% of CPU time spent in HintsDescriptor.fileName which only does this: > > {noformat} > return String.format("%s-%s-%s.hints", hostId, timestamp, version);{noformat} > At a bare minimum here we should create this string up front with the host > and version and eliminate 2 of the 3 substitutions, but I think it's probably > faster to use a StringBuilder and avoid the underlying regular expression > altogether. > 12% of the time is spent in org.apache.cassandra.io.util.File.length. It > looks like this is called once for each hint file on disk for each host we're > hinting to. In the case of an overloaded cluster, this is significant. It > would be better if we were to track the file size in memory for each hint > file and reference that rather than go to the filesystem. > These fairly small changes should make Cassandra more reliable when under > load spikes. > CPU Flame graph attached. > I only tested this in 4.1 but it looks like this is present up to trunk. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19477) Do not go to disk to get HintsStore.getTotalFileSize
[ https://issues.apache.org/jira/browse/CASSANDRA-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jon Haddad updated CASSANDRA-19477: --- Attachment: image-2024-03-24-18-08-36-918.png > Do not go to disk to get HintsStore.getTotalFileSize > > > Key: CASSANDRA-19477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19477 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Hints >Reporter: Jon Haddad >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: flame-cassandra0-patched-2024-03-25_00-40-47.html, > flame-cassandra0-release-2024-03-25_00-16-44.html, flamegraph.cpu.html, > image-2024-03-24-17-57-32-560.png, image-2024-03-24-18-08-36-918.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > When testing a cluster with more requests than it could handle, I noticed > significant CPU time (25%) spent in HintsStore.getTotalFileSize. Here's what > I'm seeing from profiling: > 10% of CPU time spent in HintsDescriptor.fileName which only does this: > > {noformat} > return String.format("%s-%s-%s.hints", hostId, timestamp, version);{noformat} > At a bare minimum here we should create this string up front with the host > and version and eliminate 2 of the 3 substitutions, but I think it's probably > faster to use a StringBuilder and avoid the underlying regular expression > altogether. > 12% of the time is spent in org.apache.cassandra.io.util.File.length. It > looks like this is called once for each hint file on disk for each host we're > hinting to. In the case of an overloaded cluster, this is significant. It > would be better if we were to track the file size in memory for each hint > file and reference that rather than go to the filesystem. > These fairly small changes should make Cassandra more reliable when under > load spikes. > CPU Flame graph attached. > I only tested this in 4.1 but it looks like this is present up to trunk. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19477) Do not go to disk to get HintsStore.getTotalFileSize
[ https://issues.apache.org/jira/browse/CASSANDRA-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jon Haddad updated CASSANDRA-19477: --- Attachment: image-2024-03-24-17-57-32-560.png > Do not go to disk to get HintsStore.getTotalFileSize > > > Key: CASSANDRA-19477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19477 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Hints >Reporter: Jon Haddad >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: flame-cassandra0-patched-2024-03-25_00-40-47.html, > flame-cassandra0-release-2024-03-25_00-16-44.html, flamegraph.cpu.html, > image-2024-03-24-17-57-32-560.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > When testing a cluster with more requests than it could handle, I noticed > significant CPU time (25%) spent in HintsStore.getTotalFileSize. Here's what > I'm seeing from profiling: > 10% of CPU time spent in HintsDescriptor.fileName which only does this: > > {noformat} > return String.format("%s-%s-%s.hints", hostId, timestamp, version);{noformat} > At a bare minimum here we should create this string up front with the host > and version and eliminate 2 of the 3 substitutions, but I think it's probably > faster to use a StringBuilder and avoid the underlying regular expression > altogether. > 12% of the time is spent in org.apache.cassandra.io.util.File.length. It > looks like this is called once for each hint file on disk for each host we're > hinting to. In the case of an overloaded cluster, this is significant. It > would be better if we were to track the file size in memory for each hint > file and reference that rather than go to the filesystem. > These fairly small changes should make Cassandra more reliable when under > load spikes. > CPU Flame graph attached. > I only tested this in 4.1 but it looks like this is present up to trunk. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19477) Do not go to disk to get HintsStore.getTotalFileSize
[ https://issues.apache.org/jira/browse/CASSANDRA-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jon Haddad updated CASSANDRA-19477: --- Attachment: flame-cassandra0-patched-2024-03-25_00-40-47.html > Do not go to disk to get HintsStore.getTotalFileSize > > > Key: CASSANDRA-19477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19477 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Hints >Reporter: Jon Haddad >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: flame-cassandra0-patched-2024-03-25_00-40-47.html, > flame-cassandra0-release-2024-03-25_00-16-44.html, flamegraph.cpu.html > > Time Spent: 4h 10m > Remaining Estimate: 0h > > When testing a cluster with more requests than it could handle, I noticed > significant CPU time (25%) spent in HintsStore.getTotalFileSize. Here's what > I'm seeing from profiling: > 10% of CPU time spent in HintsDescriptor.fileName which only does this: > > {noformat} > return String.format("%s-%s-%s.hints", hostId, timestamp, version);{noformat} > At a bare minimum here we should create this string up front with the host > and version and eliminate 2 of the 3 substitutions, but I think it's probably > faster to use a StringBuilder and avoid the underlying regular expression > altogether. > 12% of the time is spent in org.apache.cassandra.io.util.File.length. It > looks like this is called once for each hint file on disk for each host we're > hinting to. In the case of an overloaded cluster, this is significant. It > would be better if we were to track the file size in memory for each hint > file and reference that rather than go to the filesystem. > These fairly small changes should make Cassandra more reliable when under > load spikes. > CPU Flame graph attached. > I only tested this in 4.1 but it looks like this is present up to trunk. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19477) Do not go to disk to get HintsStore.getTotalFileSize
[ https://issues.apache.org/jira/browse/CASSANDRA-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jon Haddad updated CASSANDRA-19477: --- Attachment: flame-cassandra0-release-2024-03-25_00-16-44.html > Do not go to disk to get HintsStore.getTotalFileSize > > > Key: CASSANDRA-19477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19477 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Hints >Reporter: Jon Haddad >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: flame-cassandra0-patched-2024-03-25_00-40-47.html, > flame-cassandra0-release-2024-03-25_00-16-44.html, flamegraph.cpu.html > > Time Spent: 4h 10m > Remaining Estimate: 0h > > When testing a cluster with more requests than it could handle, I noticed > significant CPU time (25%) spent in HintsStore.getTotalFileSize. Here's what > I'm seeing from profiling: > 10% of CPU time spent in HintsDescriptor.fileName which only does this: > > {noformat} > return String.format("%s-%s-%s.hints", hostId, timestamp, version);{noformat} > At a bare minimum here we should create this string up front with the host > and version and eliminate 2 of the 3 substitutions, but I think it's probably > faster to use a StringBuilder and avoid the underlying regular expression > altogether. > 12% of the time is spent in org.apache.cassandra.io.util.File.length. It > looks like this is called once for each hint file on disk for each host we're > hinting to. In the case of an overloaded cluster, this is significant. It > would be better if we were to track the file size in memory for each hint > file and reference that rather than go to the filesystem. > These fairly small changes should make Cassandra more reliable when under > load spikes. > CPU Flame graph attached. > I only tested this in 4.1 but it looks like this is present up to trunk. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19429) Remove lock contention generated by getCapacity function in SSTableReader
[ https://issues.apache.org/jira/browse/CASSANDRA-19429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830297#comment-17830297 ] Jon Haddad commented on CASSANDRA-19429: Thanks, I appreciate the offer. I won't have time to look at this in the next several days, but can probably look in early April. > Remove lock contention generated by getCapacity function in SSTableReader > - > > Key: CASSANDRA-19429 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19429 > Project: Cassandra > Issue Type: Bug > Components: Local/SSTable >Reporter: Dipietro Salvatore >Assignee: Dipietro Salvatore >Priority: Normal > Fix For: 4.0.x, 4.1.x > > Attachments: Screenshot 2024-02-26 at 10.27.10.png, Screenshot > 2024-02-27 at 11.29.41.png, Screenshot 2024-03-19 at 15.22.50.png, > asprof_cass4.1.3__lock_20240216052912lock.html, > image-2024-03-08-15-51-30-439.png, image-2024-03-08-15-52-07-902.png > > Time Spent: 20m > Remaining Estimate: 0h > > Profiling Cassandra 4.1.3 on large AWS instances, a high number of lock > acquires is measured in the `getCapacity` function from > `org/apache/cassandra/cache/InstrumentingCache` (1.9M lock acquires per 60 > seconds). Based on our tests on r8g.24xlarge instances (using Ubuntu 22.04), > this limits the CPU utilization of the system to under 50% when testing at > full load and therefore limits the achieved throughput. > Removing the lock contention from the SSTableReader.java file by replacing > the call to `getCapacity` with `size` achieves up to 2.95x increase in > throughput on r8g.24xlarge and 2x on r7i.24xlarge: > |Instance type|Cass 4.1.3|Cass 4.1.3 patched| > |r8g.24xlarge|168k ops|496k ops (2.95x)| > |r7i.24xlarge|153k ops|304k ops (1.98x)| > > Instructions to reproduce: > {code:java} > ## Requirements for Ubuntu 22.04 > sudo apt install -y ant git openjdk-11-jdk > ## Build and run > CASSANDRA_USE_JDK11=true ant realclean && CASSANDRA_USE_JDK11=true ant jar && > CASSANDRA_USE_JDK11=true ant stress-build && rm -rf data && bin/cassandra -f > -R > # Run > bin/cqlsh -e 'drop table if exists keyspace1.standard1;' && \ > bin/cqlsh -e 'drop keyspace if exists keyspace1;' && \ > bin/nodetool clearsnapshot --all && tools/bin/cassandra-stress write > n=1000 cl=ONE -rate threads=384 -node 127.0.0.1 -log file=cload.log > -graph file=cload.html && \ > bin/nodetool compact keyspace1 && sleep 30s && \ > tools/bin/cassandra-stress mixed ratio\(write=10,read=90\) duration=10m > cl=ONE -rate threads=406 -node localhost -log file=result.log -graph > file=graph.html > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19260) org.apache.cassandra.tcm.ClusterMetadataService#commit does not catch up when rejected
[ https://issues.apache.org/jira/browse/CASSANDRA-19260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830293#comment-17830293 ] Michael Semb Wever commented on CASSANDRA-19260: missing results attachment ? > org.apache.cassandra.tcm.ClusterMetadataService#commit does not catch up when > rejected > -- > > Key: CASSANDRA-19260 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19260 > Project: Cassandra > Issue Type: Bug > Components: Transactional Cluster Metadata >Reporter: David Capwell >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.1 > > Attachments: ci_summary.html, ci_summary.json > > > This was found in the cep-15-accord branch (CASSANDRA-18804). The test that > found this was a simple benchmark test. > 1) deploy a 6 node cluster > 2) create a table > 3) in parallel launch many accord transactions > When accord gets a transaction it needs to make sure the table is “managed” > by accord which uses TCM for this bookkeeping, this is just a List > in ClusterMetadata. We found that we detect that the table isn’t managed so > we try to add it, we get a reject and the TCM epoch has not moved forward! > Debugging this it looks like org.apache.cassandra.tcm.RemoteProcessor#commit > is the root cause as it only seems to try to catch up if there is a messaging > error and not a TCM rejection! Given that the caller to TCM is not able to > find the epoch to “wait” on I feel that this is a TCM issue as TCM normally > tries to make sure success/rejects are blocking, but in this one case it > appears not to be so -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19150) Align values in rows in CQLSH right for numbers, left for text
[ https://issues.apache.org/jira/browse/CASSANDRA-19150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830288#comment-17830288 ] Brad Schoening commented on CASSANDRA-19150: [~arkn98] ah, yes, the tests will have to be fixed. Manually adding whitespace is one way. Using a matching regex with \s+ could be another. > Align values in rows in CQLSH right for numbers, left for text > -- > > Key: CASSANDRA-19150 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19150 > Project: Cassandra > Issue Type: Improvement > Components: CQL/Interpreter >Reporter: Stefan Miklosovic >Assignee: Arun Ganesh >Priority: Low > Fix For: 5.x > > Attachments: Screenshot 2023-12-04 at 00.38.16.png, Screenshot > 2023-12-09 at 16.58.25.png, signature.asc > > Time Spent: 20m > Remaining Estimate: 0h > > *Updated* Jan 17 2024 after dev discussion > Change CQLSH to left-align text while continue to right-align numbers. This > will match how Postgres shell and Excel treat alignment of text and number. > - > *Original* > We need to make this > [https://github.com/apache/cassandra/blob/trunk/pylib/cqlshlib/cqlshmain.py#L1101] > configurable so values in columns are either all on left or on right side of > the column (basically change col.rjust to col.ljust). > By default, it would be like it is now but there would be configuration > property in cqlsh for that as well as a corresponding CQLSH command > (optional), something like > {code:java} > ALIGNMENT LEFT|RIGHT > {code} > cc [~bschoeni] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org