[jira] [Commented] (CASSANDRA-13851) Allow existing nodes to use all peers in shadow round
[ https://issues.apache.org/jira/browse/CASSANDRA-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17573088#comment-17573088 ] Daniel Cranford commented on CASSANDRA-13851: - [~brandon.williams] ticket added. cf CASSANDRA-17786 > Allow existing nodes to use all peers in shadow round > - > > Key: CASSANDRA-13851 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13851 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Kurt Greaves >Assignee: Kurt Greaves >Priority: Normal > Fix For: 3.11.3, 4.0-alpha1, 4.0 > > > In CASSANDRA-10134 we made collision checks necessary on every startup. A > side-effect was introduced that then requires a nodes seeds to be contacted > on every startup. Prior to this change an existing node could start up > regardless whether it could contact a seed node or not (because > checkForEndpointCollision() was only called for bootstrapping nodes). > Now if a nodes seeds are removed/deleted/fail it will no longer be able to > start up until live seeds are configured (or itself is made a seed), even > though it already knows about the rest of the ring. This is inconvenient for > operators and has the potential to cause some nasty surprises and increase > downtime. > One solution would be to use all a nodes existing peers as seeds in the > shadow round. Not a Gossip guru though so not sure of implications. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-17786) Seed docs are out of date
Daniel Cranford created CASSANDRA-17786: --- Summary: Seed docs are out of date Key: CASSANDRA-17786 URL: https://issues.apache.org/jira/browse/CASSANDRA-17786 Project: Cassandra Issue Type: Bug Reporter: Daniel Cranford The [FAQ|https://cassandra.apache.org/doc/latest/cassandra/faq/index.html#are-seeds-SPOF] states {quote} The ring can operate or boot without a seed {quote} This has not been true since Cassandra 3.6 when CASSANDRA-10134 required nodes to complete a "shadow" gossip round or specify the undocumented ```cassandra.allow_unsafe_join``` property. AFAICT this "shadow" round is not documented anywhere outside the code implementing it. CASSANDRA-13851 improved things by allowing other nodes that are not themselves booting to release a node from the shadow round and successfully boot. However, this still means a node that is booting must contact a seed or a peer that is not itself booting in order to start, making seed more crucial to booting than the docs imply. In particular, a full cluster bounce is not supported when there are no reachable seeds since the non-seed peers required to release a node from the shadow round will themselves be trapped in the shadow round. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13851) Allow existing nodes to use all peers in shadow round
[ https://issues.apache.org/jira/browse/CASSANDRA-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570152#comment-17570152 ] Daniel Cranford commented on CASSANDRA-13851: - I hope it is clear that I/we don't care that the behavior of seeds has changed. But rather that this behavior change was not made public (it is "hidden" in source code and bug trackers) and took us a significant amount of skilled man-hours to track down what had changed and why and what our potential workarounds were. Just a simple update to the seed docs would have helped us immensely. > Allow existing nodes to use all peers in shadow round > - > > Key: CASSANDRA-13851 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13851 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Kurt Greaves >Assignee: Kurt Greaves >Priority: Normal > Fix For: 3.11.3, 4.0-alpha1, 4.0 > > > In CASSANDRA-10134 we made collision checks necessary on every startup. A > side-effect was introduced that then requires a nodes seeds to be contacted > on every startup. Prior to this change an existing node could start up > regardless whether it could contact a seed node or not (because > checkForEndpointCollision() was only called for bootstrapping nodes). > Now if a nodes seeds are removed/deleted/fail it will no longer be able to > start up until live seeds are configured (or itself is made a seed), even > though it already knows about the rest of the ring. This is inconvenient for > operators and has the potential to cause some nasty surprises and increase > downtime. > One solution would be to use all a nodes existing peers as seeds in the > shadow round. Not a Gossip guru though so not sure of implications. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13851) Allow existing nodes to use all peers in shadow round
[ https://issues.apache.org/jira/browse/CASSANDRA-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570118#comment-17570118 ] Daniel Cranford commented on CASSANDRA-13851: - Respectfully, the shadow round isn't documented at all (outside the source code) and gossip is barely documented. My ops guys are going to see `unable to gossip with peers` and assume there's a network issue preventing a node from talking to any of their peers and not "all my peers are also stuck in this undocumented thing called the shadow round" > Allow existing nodes to use all peers in shadow round > - > > Key: CASSANDRA-13851 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13851 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Kurt Greaves >Assignee: Kurt Greaves >Priority: Normal > Fix For: 3.11.3, 4.0-alpha1, 4.0 > > > In CASSANDRA-10134 we made collision checks necessary on every startup. A > side-effect was introduced that then requires a nodes seeds to be contacted > on every startup. Prior to this change an existing node could start up > regardless whether it could contact a seed node or not (because > checkForEndpointCollision() was only called for bootstrapping nodes). > Now if a nodes seeds are removed/deleted/fail it will no longer be able to > start up until live seeds are configured (or itself is made a seed), even > though it already knows about the rest of the ring. This is inconvenient for > operators and has the potential to cause some nasty surprises and increase > downtime. > One solution would be to use all a nodes existing peers as seeds in the > shadow round. Not a Gossip guru though so not sure of implications. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13851) Allow existing nodes to use all peers in shadow round
[ https://issues.apache.org/jira/browse/CASSANDRA-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570067#comment-17570067 ] Daniel Cranford commented on CASSANDRA-13851: - [~samt], sorry, I miss-spoke. I appreciate the material improvement in behavior this ticket has provided. What I intended to say was {quote}A node will note start unless it can contact a seed node *or* another node not also performing the shadow round{quote} Background: my operations guys routinely perform a full cluster bounce to ensure everything is starting from a clean state. Up until Cassandra 3.6 this worked fine. Unfortunately, due to the details of our hardware, sometimes nodes take longer to come up than usual (eg 5 minutes instead of 30 seconds). If the slow nodes happen to be the seed node/nodes, it is game over - the cluster will not start. The only way my ops guys were able to figure out how to resolve this was to give me the stack trace of the error, which I had to correlate with the source code and use `git blame` to find CASSANDRA-10134 and this ticket. I would not consider a bug tracker to be appropriate documentation for the semantics of a seed node, especially when the public docs state {quote}The ring can operate or boot without a seed; however, you will not be able to add new nodes to the cluster.{quote} My ops guys have worked around this behavior by begrudgingly setting `cassandra.allow_unsafe_joins=true` - an undocumented workaround I found by inspecting the source code. After we upgraded from 3.9 to 3.11, I was eager to see if this ticket allowed us to remove the workaround. Unfortunately it does not, since a full cluster bounce will still fail since only seed nodes and nodes not themselves in the shadow round can release a node from the shadow round. If anything, the error message in this version is worse, since it is now incorrect. {code:java} if (!isSeed) throw new RuntimeException("Unable to gossip with any peers"); {code} actually, the node was unable to gossip with any seeds and any peers not themselves in the shadow round. Peers may be alive but themselves trapped in the shadow round. > Allow existing nodes to use all peers in shadow round > - > > Key: CASSANDRA-13851 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13851 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Kurt Greaves >Assignee: Kurt Greaves >Priority: Normal > Fix For: 3.11.3, 4.0-alpha1, 4.0 > > > In CASSANDRA-10134 we made collision checks necessary on every startup. A > side-effect was introduced that then requires a nodes seeds to be contacted > on every startup. Prior to this change an existing node could start up > regardless whether it could contact a seed node or not (because > checkForEndpointCollision() was only called for bootstrapping nodes). > Now if a nodes seeds are removed/deleted/fail it will no longer be able to > start up until live seeds are configured (or itself is made a seed), even > though it already knows about the rest of the ring. This is inconvenient for > operators and has the potential to cause some nasty surprises and increase > downtime. > One solution would be to use all a nodes existing peers as seeds in the > shadow round. Not a Gossip guru though so not sure of implications. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13851) Allow existing nodes to use all peers in shadow round
[ https://issues.apache.org/jira/browse/CASSANDRA-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569515#comment-17569515 ] Daniel Cranford commented on CASSANDRA-13851: - This is still an *undocumented* regression in the definition of a "seed" node. A node *will not start* unless it can contact at least one seed node which is a detail that still hasn't made it into the [documentation|https://cassandra.apache.org/doc/latest/cassandra/faq/index.html#what-are-seeds] > Allow existing nodes to use all peers in shadow round > - > > Key: CASSANDRA-13851 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13851 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Kurt Greaves >Assignee: Kurt Greaves >Priority: Normal > Fix For: 3.11.3, 4.0-alpha1, 4.0 > > > In CASSANDRA-10134 we made collision checks necessary on every startup. A > side-effect was introduced that then requires a nodes seeds to be contacted > on every startup. Prior to this change an existing node could start up > regardless whether it could contact a seed node or not (because > checkForEndpointCollision() was only called for bootstrapping nodes). > Now if a nodes seeds are removed/deleted/fail it will no longer be able to > start up until live seeds are configured (or itself is made a seed), even > though it already knows about the rest of the ring. This is inconvenient for > operators and has the potential to cause some nasty surprises and increase > downtime. > One solution would be to use all a nodes existing peers as seeds in the > shadow round. Not a Gossip guru though so not sure of implications. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-17237) Pathalogical interaction between Cassandra and readahead, particularly on Centos 7 VMs
[ https://issues.apache.org/jira/browse/CASSANDRA-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469355#comment-17469355 ] Daniel Cranford edited comment on CASSANDRA-17237 at 1/5/22, 2:49 PM: -- {quote}since mmap was clearly the superior mode, and likely still is with sane readahead settings{quote} Perhaps. In our testing, standard IO has a 10% performance advantage over mmap with sane readahead values. Of course, a lot of this is going to boil down to hardware specifics, eg what is the IO seek penalty and bandwidth, what is the syscall latency vs page fault latency. It certainly doesn't help mmap that half the IO bandwidth is wasted compared to standard IO and there is no way to issue smaller reads for the index file. It is worth noting that Linus is on record saying mmap is [not necessarily always a win|https://marc.info/?l=linux-kernel&m=95496636207616&w=2] The TLB miss and page fault mechanism can be more expensive than people realize. I've looked through the Cassandra git history, and there is the appearance that the standard IO path was unfairly penalized by suboptimal behavior which may have explained some of the observed benefit to mmap (eg https://issues.apache.org/jira/browse/CASSANDRA-8894). I'm certainly not arguing that standard IO should be the default. But since it really is faster in our tests (with sane readahead values), perhaps it should still be a documented tunable. was (Author: daniel.cranford): {quote}since mmap was clearly the superior mode, and likely still is with sane readahead settings{quote} Perhaps. In our testing, standard IO has a 10% performance advantage over mmap with sane readahead values. Of course, a lot of this is going to boil down to hardware specifics, eg what is the IO seek penalty and bandwidth, what is the syscall latency vs page fault latency. It certainly doesn't help mmap that half the IO bandwidth is wasted compared to standard IO and there is no way to issue smaller reads for the index file. It is worth noting that Linus is on record saying mmap is [not necessarily always a win|https://marc.info/?l=linux-kernel&m=95496636207616&w=2] The TLB miss and page fault mechanism can be more expensive than people realize. I've looked through the Cassandra git history, and there is the appearance that the standard IO path was unfairly penalized by suboptimal behavior which may have explained some of the observed benefit to mmap. I'm certainly not arguing that standard IO should be the default. But since it really is faster in our tests (with sane readahead values), perhaps it should still be a documented tunable. > Pathalogical interaction between Cassandra and readahead, particularly on > Centos 7 VMs > -- > > Key: CASSANDRA-17237 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17237 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Daniel Cranford >Priority: Normal > Fix For: 4.x > > > Cassandra defaults to using mmap for IO, except on 32 bit systems. The config > value `disk_access_mode` that controls this isn't even included in or > documented in cassandra.yml. > While this may be a reasonable default config for Cassandra, we've noticed a > pathalogical interplay between the way Linux implements readahead for mmap, > and Cassandra's IO patterns, particularly on vanilla Centos 7 VMs. > A read that misses all levels of cache in Cassandra is (typically) going to > involve 2 IOs: 1 into the index file and one into the data file. These IOs > will both be effectively random given the nature the mummer3 hash partitioner. > The amount of data read from the index file IO will be relatively small, > perhaps 4-8kb, compared to the data file IO which (assuming the entire > partition fits in a single compressed chunk and a compression ratio of 1/2) > will require 32kb. > However, applications using `mmap()` have no way to tell the OS the desired > IO size - they can only tell the OS the desired IO location - by reading from > the mapped address and triggering a page fault. This is unlike `read()` where > the application provides both the size and location to the OS. So for > `mmap()` the OS has to guess how large the IO submitted to the backing device > should be and whether the application is performing sequential or random IO > unless the application provides hints (eg `fadvise()`, `madvise()`, > `readahead()`). > This is how Linux determines the size of IO for mmap during a page fault: > * Outside of hints (eg FADV_RANDOM) default IO size is maximum readahead > value with the faulting address in the middle of the IO, eg IO requested for > range [fault_addr - max_readahead /
[jira] [Commented] (CASSANDRA-17237) Pathalogical interaction between Cassandra and readahead, particularly on Centos 7 VMs
[ https://issues.apache.org/jira/browse/CASSANDRA-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469355#comment-17469355 ] Daniel Cranford commented on CASSANDRA-17237: - {quote}since mmap was clearly the superior mode, and likely still is with sane readahead settings{quote} Perhaps. In our testing, standard IO has a 10% performance advantage over mmap with sane readahead values. Of course, a lot of this is going to boil down to hardware specifics, eg what is the IO seek penalty and bandwidth, what is the syscall latency vs page fault latency. It certainly doesn't help mmap that half the IO bandwidth is wasted compared to standard IO and there is no way to issue smaller reads for the index file. It is worth noting that Linus is on record saying mmap is [not necessarily always a win|https://marc.info/?l=linux-kernel&m=95496636207616&w=2] The TLB miss and page fault mechanism can be more expensive than people realize. I've looked through the Cassandra git history, and there is the appearance that the standard IO path was unfairly penalized by suboptimal behavior which may have explained some of the observed benefit to mmap. I'm certainly not arguing that standard IO should be the default. But since it really is faster in our tests (with sane readahead values), perhaps it should still be a documented tunable. > Pathalogical interaction between Cassandra and readahead, particularly on > Centos 7 VMs > -- > > Key: CASSANDRA-17237 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17237 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Daniel Cranford >Priority: Normal > Fix For: 4.x > > > Cassandra defaults to using mmap for IO, except on 32 bit systems. The config > value `disk_access_mode` that controls this isn't even included in or > documented in cassandra.yml. > While this may be a reasonable default config for Cassandra, we've noticed a > pathalogical interplay between the way Linux implements readahead for mmap, > and Cassandra's IO patterns, particularly on vanilla Centos 7 VMs. > A read that misses all levels of cache in Cassandra is (typically) going to > involve 2 IOs: 1 into the index file and one into the data file. These IOs > will both be effectively random given the nature the mummer3 hash partitioner. > The amount of data read from the index file IO will be relatively small, > perhaps 4-8kb, compared to the data file IO which (assuming the entire > partition fits in a single compressed chunk and a compression ratio of 1/2) > will require 32kb. > However, applications using `mmap()` have no way to tell the OS the desired > IO size - they can only tell the OS the desired IO location - by reading from > the mapped address and triggering a page fault. This is unlike `read()` where > the application provides both the size and location to the OS. So for > `mmap()` the OS has to guess how large the IO submitted to the backing device > should be and whether the application is performing sequential or random IO > unless the application provides hints (eg `fadvise()`, `madvise()`, > `readahead()`). > This is how Linux determines the size of IO for mmap during a page fault: > * Outside of hints (eg FADV_RANDOM) default IO size is maximum readahead > value with the faulting address in the middle of the IO, eg IO requested for > range [fault_addr - max_readahead / 2, fault_addr + max_readahead / 2] This > is sometimes referred to as "read around" (ie read around the faulting > address). See > [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2989] > * The kernel maintains a cache miss counter for the file. Every time the > kernel submits an IO for a page fault, this counts as a miss. Every time the > application faults in a page that is already in the pages cache (presumably > from a previous page fault's IO) is a cache hit and decrements the counter. > If the miss counter exceeds a threshold, the kernel stops inflating the IOs > to the max readahead and falls back to reading a *single* 4k page for each > page fault. See summary > [here|https://www.quora.com/What-heuristics-does-the-adaptive-readahead-implementation-in-the-Linux-kernel-use/answer/Robert-Love-1] > and implementation > [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2955] > and > [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L3005] > * This means an application that, on average, references more than one 4k > page around the initial page fault will consistently have page fault IOs > inflated to the maximum readahead
[jira] [Updated] (CASSANDRA-17237) Pathalogical interaction between Cassandra and readahead, particularly on Centos 7 VMs
[ https://issues.apache.org/jira/browse/CASSANDRA-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Cranford updated CASSANDRA-17237: Description: Cassandra defaults to using mmap for IO, except on 32 bit systems. The config value `disk_access_mode` that controls this isn't even included in or documented in cassandra.yml. While this may be a reasonable default config for Cassandra, we've noticed a pathalogical interplay between the way Linux implements readahead for mmap, and Cassandra's IO patterns, particularly on vanilla Centos 7 VMs. A read that misses all levels of cache in Cassandra is (typically) going to involve 2 IOs: 1 into the index file and one into the data file. These IOs will both be effectively random given the nature the mummer3 hash partitioner. The amount of data read from the index file IO will be relatively small, perhaps 4-8kb, compared to the data file IO which (assuming the entire partition fits in a single compressed chunk and a compression ratio of 1/2) will require 32kb. However, applications using `mmap()` have no way to tell the OS the desired IO size - they can only tell the OS the desired IO location - by reading from the mapped address and triggering a page fault. This is unlike `read()` where the application provides both the size and location to the OS. So for `mmap()` the OS has to guess how large the IO submitted to the backing device should be and whether the application is performing sequential or random IO unless the application provides hints (eg `fadvise()`, `madvise()`, `readahead()`). This is how Linux determines the size of IO for mmap during a page fault: * Outside of hints (eg FADV_RANDOM) default IO size is maximum readahead value with the faulting address in the middle of the IO, eg IO requested for range [fault_addr - max_readahead / 2, fault_addr + max_readahead / 2] This is sometimes referred to as "read around" (ie read around the faulting address). See [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2989] * The kernel maintains a cache miss counter for the file. Every time the kernel submits an IO for a page fault, this counts as a miss. Every time the application faults in a page that is already in the pages cache (presumably from a previous page fault's IO) is a cache hit and decrements the counter. If the miss counter exceeds a threshold, the kernel stops inflating the IOs to the max readahead and falls back to reading a *single* 4k page for each page fault. See summary [here|https://www.quora.com/What-heuristics-does-the-adaptive-readahead-implementation-in-the-Linux-kernel-use/answer/Robert-Love-1] and implementation [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2955] and [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L3005] * This means an application that, on average, references more than one 4k page around the initial page fault will consistently have page fault IOs inflated to the maximum readahead value. Note, there is no ramping up a window the way there is with standard IO. The kernel only submits IOs of 1 page and max_readahead as far as I can tell. Observations: * mmap'ed IO on Linux wastes half the IO bandwith. This may or may not be a big deal depending on your setup. * Cassandra will always have IOs inflated to the maximum readahead because more than 1 page is references for the data file and (depending on the size and cardinality of your keys) more than one page is referenced from the index file * The device's readahead is a crude system wide knob for controlling IO size. Cassandra cannot perform smaller IOs for the index file (unless your keyset is such that only 1 page from the index file needs to be referenced). Centos 7 VMs: * The default readahead for Centos 7 VMs is 4MB (as opposed to the default readahead for non-VM Centos 7 which is 128kb). * Even though this is reduced by the kernel (cf `max_sane_readahead()`) to something around 450k, it is still far too large for an average Cassandra read. * Even once this readahead is reduced to the recommended 64kb, standard IO still has a 10% performance advantage in our tests, likely because the readahead algorithm for standard IO is more flexible and converges on smaller reads from the index file and larger reads from the data file. was: Cassandra defaults to using mmap for IO, except on 32 bit systems. The config value `disk_access_mode` that controls this isn't even included in or documented in cassandra.yml. While this may be a reasonable default config for Cassandra, we've noticed a pathalogical interplay between the way Linux implements readahead for mmap, and Cassandra's IO patterns, particularly on vanilla Centos 7 VMs. A read that misses all levels of cache in Cassandra is (
[jira] [Updated] (CASSANDRA-17237) Pathalogical interaction between Cassandra and readahead, particularly on Centos 7 VMs
[ https://issues.apache.org/jira/browse/CASSANDRA-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Cranford updated CASSANDRA-17237: Description: Cassandra defaults to using mmap for IO, except on 32 bit systems. The config value `disk_access_mode` that controls this isn't even included in or documented in cassandra.yml. While this may be a reasonable default config for Cassandra, we've noticed a pathalogical interplay between the way Linux implements readahead for mmap, and Cassandra's IO patterns, particularly on vanilla Centos 7 VMs. A read that misses all levels of cache in Cassandra is (typically) going to involve 2 IOs: 1 into the index file and one into the data file. These IOs will both be effectively random given the nature the mummer3 hash partitioner. The amount of data read from the index file IO will be relatively small, perhaps 4-8kb, compared to the data file IO which (assuming the entire partition fits in a single compressed chunk and a compression ratio of 1/2) will require 32kb. However, applications using `mmap()` have no way to tell the OS the desired IO size - they can only tell the OS the desired IO location - by reading from the mapped address and triggering a page fault. This is unlike `read()` where the application provides both the size and location to the OS. So for `mmap()` the OS has to guess how large the IO submitted to the backing device should be and whether the application is performing sequential or random IO unless the application provides hints (eg `fadvise()`, `madvise()`, `readahead()`). This is how Linux determines the size of IO for mmap during a page fault: * Outside of hints (eg FADV_RANDOM) default IO size is maximum readahead value with the faulting address in the middle of the IO, eg IO requested for range [fault_addr - max_readahead / 2, fault_addr + max_readahead / 2] This is sometimes referred to as "read around" (ie read around the faulting address). See [here](https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2989) * The kernel maintains a cache miss counter for the file. Every time the kernel submits an IO for a page fault, this counts as a miss. Every time the application faults in a page that is already in the pages cache (presumably from a previous page fault's IO) is a cache hit and decrements the counter. If the miss counter exceeds a threshold, the kernel stops inflating the IOs to the max readahead and falls back to reading a *single* 4k page for each page fault. See summary [here|https://www.quora.com/What-heuristics-does-the-adaptive-readahead-implementation-in-the-Linux-kernel-use/answer/Robert-Love-1] and implementation [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2955] and [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L3005] * This means an application that, on average, references more than one 4k page around the initial page fault will consistently have page fault IOs inflated to the maximum readahead value. Note, there is no ramping up a window the way there is with standard IO. The kernel only submits IOs of 1 page and max_readahead as far as I can tell. Observations: * mmap'ed IO on Linux wastes half the IO bandwith. This may or may not be a big deal depending on your setup. * Cassandra will always have IOs inflated to the maximum readahead because more than 1 page is references for the data file and (depending on the size and cardinality of your keys) more than one page is referenced from the index file * The device's readahead is a crude system wide knob for controlling IO size. Cassandra cannot perform smaller IOs for the index file (unless your keyset is such that only 1 page from the index file needs to be referenced). Centos 7 VMs: * The default readahead for Centos 7 VMs is 4MB (as opposed to the default readahead for non-VM Centos 7 which is 128kb). * Even though this is reduced by the kernel (cf `max_sane_readahead()`) to something around 450k, it is still far too large for an average Cassandra read. * Even once this readahead is reduced to the recommended 64kb, standard IO still has a 10% performance advantage in our tests, likely because the readahead algorithm for standard IO is more flexible and converges on smaller reads from the index file and larger reads from the data file. was: Cassandra defaults to using mmap for IO, except on 32 bit systems. The config value `disk_access_mode` that controls this isn't even included in or documented in cassandra.yml. While this may be a reasonable default config for Cassandra, we've noticed a pathalogical interplay between the way Linux implements readahead for mmap, and Cassandra's IO patterns, particularly on vanilla Centos 7 VMs. A read that misses all levels of cache in Cassandra is
[jira] [Created] (CASSANDRA-17237) Pathalogical interaction between Cassandra and readahead, particularly on Centos 7 VMs
Daniel Cranford created CASSANDRA-17237: --- Summary: Pathalogical interaction between Cassandra and readahead, particularly on Centos 7 VMs Key: CASSANDRA-17237 URL: https://issues.apache.org/jira/browse/CASSANDRA-17237 Project: Cassandra Issue Type: Bug Reporter: Daniel Cranford Cassandra defaults to using mmap for IO, except on 32 bit systems. The config value `disk_access_mode` that controls this isn't even included in or documented in cassandra.yml. While this may be a reasonable default config for Cassandra, we've noticed a pathalogical interplay between the way Linux implements readahead for mmap, and Cassandra's IO patterns, particularly on vanilla Centos 7 VMs. A read that misses all levels of cache in Cassandra is (typically) going to involve 2 IOs: 1 into the index file and one into the data file. These IOs will both be effectively random given the nature the mummer3 hash partitioner. The amount of data read from the index file IO will be relatively small, perhaps 4-8kb, compared to the data file IO which (assuming the entire partition fits in a single compressed chunk and a compression ratio of 1/2) will require 32kb. However, applications using `mmap()` have no way to tell the OS the desired IO size - they can only tell the OS the desired IO location - by reading from the mapped address and triggering a page fault. This is unlike `read()` where the application provides both the size and location to the OS. So for `mmap()` the OS has to guess how large the IO submitted to the backing device should be and whether the application is performing sequential or random IO unless the application provides hints (eg `fadvise()`, `madvise()`, `readahead()`). This is how Linux determines the size of IO for mmap during a page fault: * Outside of hints (eg FADV_RANDOM) default IO size is maximum readahead value with the faulting address in the middle of the IO, eg IO requested for range [fault_addr - max_readahead / 2, fault_addr + max_readahead / 2] This is sometimes referred to as "read around" (ie read around the faulting address). See [here](https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2989) * The kernel maintains a cache miss counter for the file. Every time the kernel submits an IO for a page fault, this counts as a miss. Every time the application faults in a page that is already in the pages cache (presumably from a previous page fault's IO) is a cache hit and decrements the counter. If the miss counter exceeds a threshold, the kernel stops inflating the IOs to the max readahead and falls back to reading a *single* 4k page for each page fault. See summary [here](https://www.quora.com/What-heuristics-does-the-adaptive-readahead-implementation-in-the-Linux-kernel-use/answer/Robert-Love-1) and implementation [here](https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2955) and [here](https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L3005) * This means an application that, on average, references more than one 4k page around the initial page fault will consistently have page fault IOs inflated to the maximum readahead value. Note, there is no ramping up a window the way there is with standard IO. The kernel only submits IOs of 1 page and max_readahead as far as I can tell. Observations: * mmap'ed IO on Linux wastes half the IO bandwith. This may or may not be a big deal depending on your setup. * Cassandra will always have IOs inflated to the maximum readahead because more than 1 page is references for the data file and (depending on the size and cardinality of your keys) more than one page is referenced from the index file * The device's readahead is a crude system wide knob for controlling IO size. Cassandra cannot perform smaller IOs for the index file (unless your keyset is such that only 1 page from the index file needs to be referenced). Centos 7 VMs: * The default readahead for Centos 7 VMs is 4MB (as opposed to the default readahead for non-VM Centos 7 which is 128kb). * Even though this is reduced by the kernel (cf `max_sane_readahead()`) to something around 450k, it is still far too large for an average Cassandra read. * Even once this readahead is reduced to the recommended 64kb, standard IO still has a 10% performance advantage in our tests, likely because the readahead algorithm for standard IO is more flexible and converges on smaller reads from the index file and larger reads from the data file. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10134) Always require replace_address to replace existing address
[ https://issues.apache.org/jira/browse/CASSANDRA-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211198#comment-16211198 ] Daniel Cranford commented on CASSANDRA-10134: - This fix changes the definition of a "seed" node and invalidates the description provided in the [FAQ page|http://cassandra.apache.org/doc/latest/faq/index.html#what-are-seeds]. Because the shadow round only talks to seeds and the shadow round is now performed on every startup (not just bootstrap), a node will not boot unless at least one seed is alive. > Always require replace_address to replace existing address > -- > > Key: CASSANDRA-10134 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10134 > Project: Cassandra > Issue Type: Improvement > Components: Distributed Metadata >Reporter: Tyler Hobbs >Assignee: Sam Tunnicliffe > Labels: docs-impacting > Fix For: 3.6 > > > Normally, when a node is started from a clean state with the same address as > an existing down node, it will fail to start with an error like this: > {noformat} > ERROR [main] 2015-08-19 15:07:51,577 CassandraDaemon.java:554 - Exception > encountered during startup > java.lang.RuntimeException: A node with address /127.0.0.3 already exists, > cancelling join. Use cassandra.replace_address if you want to replace this > node. > at > org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:543) > ~[main/:na] > at > org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:783) > ~[main/:na] > at > org.apache.cassandra.service.StorageService.initServer(StorageService.java:720) > ~[main/:na] > at > org.apache.cassandra.service.StorageService.initServer(StorageService.java:611) > ~[main/:na] > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378) > [main/:na] > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:537) > [main/:na] > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:626) > [main/:na] > {noformat} > However, if {{auto_bootstrap}} is set to false or the node is in its own seed > list, it will not throw this error and will start normally. The new node > then takes over the host ID of the old node (even if the tokens are > different), and the only message you will see is a warning in the other > nodes' logs: > {noformat} > logger.warn("Changing {}'s host ID from {} to {}", endpoint, storedId, > hostId); > {noformat} > This could cause an operator to accidentally wipe out the token information > for a down node without replacing it. To fix this, we should check for an > endpoint collision even if {{auto_bootstrap}} is false or the node is a seed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13940) Fix stress seed multiplier
[ https://issues.apache.org/jira/browse/CASSANDRA-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192830#comment-16192830 ] Daniel Cranford commented on CASSANDRA-13940: - See [this comment|https://issues.apache.org/jira/browse/CASSANDRA-12744?focusedCommentId=16192820&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16192820] for a full explanation of the problem. > Fix stress seed multiplier > -- > > Key: CASSANDRA-13940 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13940 > Project: Cassandra > Issue Type: Bug > Components: Stress >Reporter: Daniel Cranford > Attachments: 0001-Fixing-seed-multiplier.patch > > > CASSANDRA-12744 attempted to fix a problem with partition key generation, but > is generally broken. E.G. > {noformat} > cassandra-stress -insert visits=fixed\(100\) revisit=uniform\(1..100\) ... > {noformat} > sends cassandra-stress into an infinite loop. Here's a better fix. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12744) Randomness of stress distributions is not good
[ https://issues.apache.org/jira/browse/CASSANDRA-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192827#comment-16192827 ] Daniel Cranford commented on CASSANDRA-12744: - Created CASSANDRA-13940 to fix this. > Randomness of stress distributions is not good > -- > > Key: CASSANDRA-12744 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12744 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: T Jake Luciani >Assignee: Ben Slater >Priority: Minor > Labels: stress > Fix For: 4.0 > > Attachments: CASSANDRA_12744_SeedManager_changes-trunk.patch > > > The randomness of our distributions is pretty bad. We are using the > JDKRandomGenerator() but in testing of uniform(1..3) we see for 100 > iterations it's only outputting 3. If you bump it to 10k it hits all 3 > values. > I made a change to just use the default commons math random generator and now > see all 3 values for n=10 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-13940) Fix stress seed multiplier
Daniel Cranford created CASSANDRA-13940: --- Summary: Fix stress seed multiplier Key: CASSANDRA-13940 URL: https://issues.apache.org/jira/browse/CASSANDRA-13940 Project: Cassandra Issue Type: Bug Components: Stress Reporter: Daniel Cranford Attachments: 0001-Fixing-seed-multiplier.patch CASSANDRA-12744 attempted to fix a problem with partition key generation, but is generally broken. E.G. {noformat} cassandra-stress -insert visits=fixed\(100\) revisit=uniform\(1..100\) ... {noformat} sends cassandra-stress into an infinite loop. Here's a better fix. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12744) Randomness of stress distributions is not good
[ https://issues.apache.org/jira/browse/CASSANDRA-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192820#comment-16192820 ] Daniel Cranford commented on CASSANDRA-12744: - Some more thoughts: The generation of partition keys has been "broken" since CASSANDRA-7519 The Linear Congruential Generators (LCGs) used in java.util.Random and by extension JDKRandomGenerator generate good random number sequences, but similar seeds result in similar sequences. Using the lcg update function {{lcg\(x) = a*x + c}} like random ~1~ = lcg(1) random ~2~ = lcg(2) random ~3~ = lcg(3) ... random ~n~ = lcg\(n) does not generate a good random sequence, this is a misuse of the LCG. LCGs are supposed to be used like random ~1~ = lcg(1) random ~2~ = lcg(lcg(1)) random ~3~ = lcg(lcg(lcg(1))) ... random ~n~ = lcg ^n^ (1) I say "broken" in quotes because the misuse of LCGs ends up not mattering. {{new java.util.Random(seed).nextDouble()}} will always differ from {{new java.util.Random(seed + 1).nextDouble()}} by more than 1/100,000,000,000 Thus with the default partition key population (=UNIFORM(1..100B)), seeds that differ by 1 will generate distinct partition keys. The only thing that matters about partition keys is how many distinct values there are (and how large their lexical value is). The number of partition key components doesn't matter. The cardinality of each partition key component doesn't matter. The distribution of values in the lexical partition key space doesn't matter. At the end of the day, all the partition key components get concatenated and the resulting bit vector is hashed resulting in a uniformly distributed 64 bit token that determines where the data will be stored. The easiest "fix" is to not use the partition key population to define the number of partition keys. Take advantage of the fact that the only thing that matters from a performance standpoint is the number of distinct partitions. Leave the partition key distribution at uniform(1..100B), and use the n= parameter to define the number of partitions. An ideal fix would update the way partition keys are generated to use the LCG generator properly. However, this seems difficult since LCGs don't support random access (i.e., the only way to calculate the nth item in an LCG sequence is to first calculate the n-1 preceding items), and all three seed generation modes rely on the ability to randomly jump around in the seed sequence. This could be worked around by using a PRNG that supports random access to the n'th item in the sequence (e.g. something like JDK 1.8's SpittableRandom could be easily extended to support this). A more workable fix is to spread the generated seeds (typically drawn from a smallish range of integers) around in the 2 ^64^ values a long can take before seeding the LCG. An additional caveat to whatever function is used for spreading the seeds needs to be invertable since LookbackableWriteGenerator's implementation relies on the properties of the sequence it generates to perform internal bookeeping. Multiplication by an odd integer happens to be an invertable function (although integer division is NOT the inverse operation, multiplication by the modular inverse is). So the initial implementation (although broken) is not actually that bad an idea. I propose fixing things by picking a static integer as the multiplier and using multiplication by it's modular inverse to invert it for LookbackableWriteGenerator > Randomness of stress distributions is not good > -- > > Key: CASSANDRA-12744 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12744 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: T Jake Luciani >Assignee: Ben Slater >Priority: Minor > Labels: stress > Fix For: 4.0 > > Attachments: CASSANDRA_12744_SeedManager_changes-trunk.patch > > > The randomness of our distributions is pretty bad. We are using the > JDKRandomGenerator() but in testing of uniform(1..3) we see for 100 > iterations it's only outputting 3. If you bump it to 10k it hits all 3 > values. > I made a change to just use the default commons math random generator and now > see all 3 values for n=10 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13932) Stress write order and seed order should be different
[ https://issues.apache.org/jira/browse/CASSANDRA-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Cranford updated CASSANDRA-13932: Summary: Stress write order and seed order should be different (was: Write order and seed order should be different) > Stress write order and seed order should be different > - > > Key: CASSANDRA-13932 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13932 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Daniel Cranford > Labels: stress > Attachments: 0001-Initial-implementation-cassandra-3.11.patch, > vmtouch-after.txt, vmtouch-before.txt > > > Read tests get an unrealistic boost in performance because they read data > from a set of partitions that was written sequentially. > I ran into this while running a timed read test against a large data set (250 > million partition keys) {noformat}cassandra-stress read > duration=30m{noformat} While the test was running, I noticed one node was > performing zero IO after an initial period. > I discovered each node in the cluster only had blocks from a single SSTable > loaded in the FS cache. {noformat}vmtouch -v /path/to/sstables{noformat} > For the node that was performing zero IO, the SSTable in question was small > enough to fit into the FS cache. > I realized that when a read test is run for a duration or until rate > convergenge, the default population for the seeds is a GAUSSIAN distribution > over the first million seeds. Because of the way compaction works, partitions > that are written sequentially will (with high probability) always live in the > same SSTable. That means that while the first million seeds will generate > partition keys that will be randomly distributed in the token space, they > will most likely all live in the same SSTable. When this SSTable is small > enough to fit into the FS cache, you get unbelievably good results for a read > test. Consider that a dataset 4x the size of the FS cache will have almost > 1/2 the data in SSTables small enough to fit into the FS cache. > Adjusting the population of seeds used during the read test to be the entire > 250 million seeds used to load the cluster does not fix the > problem.{noformat}cassandra-stress read duration=30m -pop > dist=gaussian(1..250M){noformat} > or (same population, larger sample) {noformat}cassandra-stress read > n=250M{noformat} > Any distribution other than the uniform distribution has one or more modes, > and the mode(s) of such a distribution will cluster reads around a certain > seed range which corresponds to a certain set of sequential writes which > corresponds to (with high probability) a single SSTable. > My patch against cassandra-3.11 fixes this by shuffling the sequence of > generated seeds. Each seed value will still be generated once and only once. > The old behavior of sequential seed generation (ie seed(n+1) = seed( n) + 1) > may be selected by using the no-shuffle flag. e.g. {noformat}cassandra-stress > read duration=30m -pop no-shuffle{noformat} > Results: In [^vmtouch-before.txt] only pages from a single SSTable are > present in the FS cache while in [^vmtouch-after.txt] an equal proportion of > all SSTables are present in the FS cache. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-13932) Write order and seed order should be different
Daniel Cranford created CASSANDRA-13932: --- Summary: Write order and seed order should be different Key: CASSANDRA-13932 URL: https://issues.apache.org/jira/browse/CASSANDRA-13932 Project: Cassandra Issue Type: Bug Components: Tools Reporter: Daniel Cranford Attachments: 0001-Initial-implementation-cassandra-3.11.patch, vmtouch-after.txt, vmtouch-before.txt Read tests get an unrealistic boost in performance because they read data from a set of partitions that was written sequentially. I ran into this while running a timed read test against a large data set (250 million partition keys) {noformat}cassandra-stress read duration=30m{noformat} While the test was running, I noticed one node was performing zero IO after an initial period. I discovered each node in the cluster only had blocks from a single SSTable loaded in the FS cache. {noformat}vmtouch -v /path/to/sstables{noformat} For the node that was performing zero IO, the SSTable in question was small enough to fit into the FS cache. I realized that when a read test is run for a duration or until rate convergenge, the default population for the seeds is a GAUSSIAN distribution over the first million seeds. Because of the way compaction works, partitions that are written sequentially will (with high probability) always live in the same SSTable. That means that while the first million seeds will generate partition keys that will be randomly distributed in the token space, they will most likely all live in the same SSTable. When this SSTable is small enough to fit into the FS cache, you get unbelievably good results for a read test. Consider that a dataset 4x the size of the FS cache will have almost 1/2 the data in SSTables small enough to fit into the FS cache. Adjusting the population of seeds used during the read test to be the entire 250 million seeds used to load the cluster does not fix the problem.{noformat}cassandra-stress read duration=30m -pop dist=gaussian(1..250M){noformat} or (same population, larger sample) {noformat}cassandra-stress read n=250M{noformat} Any distribution other than the uniform distribution has one or more modes, and the mode(s) of such a distribution will cluster reads around a certain seed range which corresponds to a certain set of sequential writes which corresponds to (with high probability) a single SSTable. My patch against cassandra-3.11 fixes this by shuffling the sequence of generated seeds. Each seed value will still be generated once and only once. The old behavior of sequential seed generation (ie seed(n+1) = seed( n) + 1) may be selected by using the no-shuffle flag. e.g. {noformat}cassandra-stress read duration=30m -pop no-shuffle{noformat} Results: In [^vmtouch-before.txt] only pages from a single SSTable are present in the FS cache while in [^vmtouch-after.txt] an equal proportion of all SSTables are present in the FS cache. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12744) Randomness of stress distributions is not good
[ https://issues.apache.org/jira/browse/CASSANDRA-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16190085#comment-16190085 ] Daniel Cranford commented on CASSANDRA-12744: - As I've thought about how to fix the seed multiplier, I've come to the conclusion that it is impossible to use an adaptive multiplier without breaking existing functionality or changing the command line interface. One of the key reasons you can specify how the seeds get generated is so that you can partition the seed space and run multiple cassandra-stress processes on different machines in parallel so the cassandra-stress client doesn't become the bottleneck. E.G. to write 2 million partitions from two client machines, you'd run {noformat}cassandra-stress write n=100 -pop seq=1..100{noformat} on one client machine and {noformat}cassandra-stress write n=100 -pop seq=101..200{noformat} on the other client machine. An adaptive multiplier that attempts to scale the seed sequence so that it's range is 10^22 (or better, Long.MAX_VALUE since seeds are 64 bit longs) would generate the same multiplier for both client processes resulting in seed sequence overlaps. To correctly generate an adaptive multiplier, you need global knowledge of the entire range of seeds being generated by all cassandra-stress processes. This information cannot be supplied via the current command line interface. The command line interface would have to be updated in a breaking fashion to support an adaptive multiplier. Using a hardcoded static multiplier is safe, but would reduce the allowable range of seed values (and thus reduce the maximum number of distinct partition keys). This probably isn't a big deal since nobody wants to write 2^64 partitions. But it would need to be chosen with care so that the number of distinct seeds (and thus the number of distinct partitions) doesn't become too small. > Randomness of stress distributions is not good > -- > > Key: CASSANDRA-12744 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12744 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: T Jake Luciani >Assignee: Ben Slater >Priority: Minor > Labels: stress > Fix For: 4.0 > > Attachments: CASSANDRA_12744_SeedManager_changes-trunk.patch > > > The randomness of our distributions is pretty bad. We are using the > JDKRandomGenerator() but in testing of uniform(1..3) we see for 100 > iterations it's only outputting 3. If you bump it to 10k it hits all 3 > values. > I made a change to just use the default commons math random generator and now > see all 3 values for n=10 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12744) Randomness of stress distributions is not good
[ https://issues.apache.org/jira/browse/CASSANDRA-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184418#comment-16184418 ] Daniel Cranford commented on CASSANDRA-12744: - I think the math on this is broken slightly. The seed multiplier is intended to scale all seeds to the 10^22 magnitude. However, seeds (and the multiplier) are all stored in 64 bit integers and the math is performed on them is 64 bit math. 10^22 is not representable as a long which has range {noformat}[-(2^63) : 2^63 - 1] = [-9,223,372,036,854,775,808 : 9,223,372,036,854,775,807]{noformat} Consider that for sample sizes under 1084, the line that calculates the the sample multiplier {noformat}this.sampleMultiplier = 1 + Math.round(Math.pow(10D, 22 - Math.log10(sampleSize)));{noformat} will result in a multiplier of Long.MIN_VALUE which when multiplied by any long will result in 0 or Long.MIN_VALUE reducing your seeds to two distinct values. I think using 18 instead of 22 as the target exponent should resolve this issue. Additionally, I think the seed population size is being incorrectly calculated as the range of the revisit distribution (which defaults to uniform(1..1M)). However, when running in the default sequential seed mode (without revisits), eg {noformat}cassandra-stress write n=100{noformat}, the size of the seed population is actually the length of the seed sequence (in this case 100). And when running with seeds generated from a distribution, eg {noformat}cassandra-stress read -pop dist=gaussian(1..250M){noformat} the size of the seed population is actually the range of the seed distribution (in this case 250 million). > Randomness of stress distributions is not good > -- > > Key: CASSANDRA-12744 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12744 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: T Jake Luciani >Assignee: Ben Slater >Priority: Minor > Labels: stress > Fix For: 4.0 > > Attachments: CASSANDRA_12744_SeedManager_changes-trunk.patch > > > The randomness of our distributions is pretty bad. We are using the > JDKRandomGenerator() but in testing of uniform(1..3) we see for 100 > iterations it's only outputting 3. If you bump it to 10k it hits all 3 > values. > I made a change to just use the default commons math random generator and now > see all 3 values for n=10 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13879) cassandra-stress sleeps for entire duration even when errors halt progress
[ https://issues.apache.org/jira/browse/CASSANDRA-13879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Cranford updated CASSANDRA-13879: Attachment: 0001-Fixing-bug.patch Here's a patch that fixes this bug. > cassandra-stress sleeps for entire duration even when errors halt progress > -- > > Key: CASSANDRA-13879 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13879 > Project: Cassandra > Issue Type: Bug >Reporter: Daniel Cranford >Priority: Minor > Attachments: 0001-Fixing-bug.patch > > > If cassandra-stress is run with a duration parameter, eg > {noformat} > cassandra-stress read duration=30s > {noformat} > then, the process will sleep for the entire duration, even when errors have > killed all the Consumer threads responsible for executing queries. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13879) cassandra-stress sleeps for entire duration even when errors halt progress
[ https://issues.apache.org/jira/browse/CASSANDRA-13879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Cranford updated CASSANDRA-13879: Priority: Minor (was: Major) > cassandra-stress sleeps for entire duration even when errors halt progress > -- > > Key: CASSANDRA-13879 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13879 > Project: Cassandra > Issue Type: Bug >Reporter: Daniel Cranford >Priority: Minor > > If cassandra-stress is run with a duration parameter, eg > {noformat} > cassandra-stress read duration=30s > {noformat} > then, the process will sleep for the entire duration, even when errors have > killed all the Consumer threads responsible for executing queries. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-13879) cassandra-stress sleeps for entire duration even when errors halt progress
Daniel Cranford created CASSANDRA-13879: --- Summary: cassandra-stress sleeps for entire duration even when errors halt progress Key: CASSANDRA-13879 URL: https://issues.apache.org/jira/browse/CASSANDRA-13879 Project: Cassandra Issue Type: Bug Reporter: Daniel Cranford If cassandra-stress is run with a duration parameter, eg {noformat} cassandra-stress read duration=30s {noformat} then, the process will sleep for the entire duration, even when errors have killed all the Consumer threads responsible for executing queries. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13871) cassandra-stress user command misbehaves when retrying operations
[ https://issues.apache.org/jira/browse/CASSANDRA-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Cranford updated CASSANDRA-13871: Attachment: 0001-Fixing-cassandra-stress-user-operations-retry.patch Here's a patch that fixes the problem in trunk. > cassandra-stress user command misbehaves when retrying operations > - > > Key: CASSANDRA-13871 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13871 > Project: Cassandra > Issue Type: Bug >Reporter: Daniel Cranford > Attachments: 0001-Fixing-cassandra-stress-user-operations-retry.patch > > > o.a.c.stress.Operation will retry queries a configurable number of times. > When the "user" command is invoked the o.a.c.stress.operations.userdefined > SchemaInsert and SchemaQuery operations are used. > When SchemaInsert and SchemaQuery are retried (eg after a Read/WriteTimeout > exception), they advance the PartitionIterator used to generate the keys to > insert/query (SchemaInsert.java:85 SchemaQuery.java:129) This means each > retry will use a different set of keys. > The predefined set of operations avoid this problem by packaging up the > arguments to bind to the query into the RunOp object so that retrying the > operation results in exactly the same query with the same arguments being run. > This problem was introduced by CASSANDRA-7964. Prior to CASSANDRA-7964 the > PartitionIterator (Partition.RowIterator before the change) was reinitialized > prior to each query retry, thus generating the same set of keys each time. > This problem is reported rather confusingly. The only error that shows up in > a log file (specified with -log file=foo.log) is the unhelpful > {noformat} > java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: > (NoSuchElementException) > at org.apache.cassandra.stress.Operation.error(Operation.java:136) > at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114) > at > org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158) > at > org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459) > {noformat} > Standard error is only slightly more helpful, displaying the ignorable > initial read/write error, and confusing java.util.NoSuchElementException > lines (caused by PartitionIterator exhaustion) followed by the above > IOException with stack trace, eg > {noformat} > com.datastax.drive.core.exceptions.ReadTimeoutException: Cassandra timeout > during read query > java.util.NoSuchElementException > java.util.NoSuchElementException > java.util.NoSuchElementException > java.util.NoSuchElementException > java.util.NoSuchElementException > java.util.NoSuchElementException > java.util.NoSuchElementException > java.util.NoSuchElementException > java.util.NoSuchElementException > java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: > (NoSuchElementException) > at org.apache.cassandra.stress.Operation.error(Operation.java:136) > at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114) > at > org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158) > at > org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13871) cassandra-stress user command misbehaves when retrying operations
[ https://issues.apache.org/jira/browse/CASSANDRA-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16166886#comment-16166886 ] Daniel Cranford commented on CASSANDRA-13871: - *Note* This problem can be worked around by ignoring errors and disabling retries {noformat}-errors ignore retries=0{noformat} > cassandra-stress user command misbehaves when retrying operations > - > > Key: CASSANDRA-13871 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13871 > Project: Cassandra > Issue Type: Bug >Reporter: Daniel Cranford > > o.a.c.stress.Operation will retry queries a configurable number of times. > When the "user" command is invoked the o.a.c.stress.operations.userdefined > SchemaInsert and SchemaQuery operations are used. > When SchemaInsert and SchemaQuery are retried (eg after a Read/WriteTimeout > exception), they advance the PartitionIterator used to generate the keys to > insert/query (SchemaInsert.java:85 SchemaQuery.java:129) This means each > retry will use a different set of keys. > The predefined set of operations avoid this problem by packaging up the > arguments to bind to the query into the RunOp object so that retrying the > operation results in exactly the same query with the same arguments being run. > This problem was introduced by CASSANDRA-7964. Prior to CASSANDRA-7964 the > PartitionIterator (Partition.RowIterator before the change) was reinitialized > prior to each query retry, thus generating the same set of keys each time. > This problem is reported rather confusingly. The only error that shows up in > a log file (specified with -log file=foo.log) is the unhelpful > {noformat} > java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: > (NoSuchElementException) > at org.apache.cassandra.stress.Operation.error(Operation.java:136) > at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114) > at > org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158) > at > org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459) > {noformat} > Standard error is only slightly more helpful, displaying the ignorable > initial read/write error, and confusing java.util.NoSuchElementException > lines (caused by PartitionIterator exhaustion) followed by the above > IOException with stack trace, eg > {noformat} > com.datastax.drive.core.exceptions.ReadTimeoutException: Cassandra timeout > during read query > java.util.NoSuchElementException > java.util.NoSuchElementException > java.util.NoSuchElementException > java.util.NoSuchElementException > java.util.NoSuchElementException > java.util.NoSuchElementException > java.util.NoSuchElementException > java.util.NoSuchElementException > java.util.NoSuchElementException > java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: > (NoSuchElementException) > at org.apache.cassandra.stress.Operation.error(Operation.java:136) > at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114) > at > org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158) > at > org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13871) cassandra-stress user command misbehaves when retrying operations
[ https://issues.apache.org/jira/browse/CASSANDRA-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Cranford updated CASSANDRA-13871: Description: o.a.c.stress.Operation will retry queries a configurable number of times. When the "user" command is invoked the o.a.c.stress.operations.userdefined SchemaInsert and SchemaQuery operations are used. When SchemaInsert and SchemaQuery are retried (eg after a Read/WriteTimeout exception), they advance the PartitionIterator used to generate the keys to insert/query (SchemaInsert.java:85 SchemaQuery.java:129) This means each retry will use a different set of keys. The predefined set of operations avoid this problem by packaging up the arguments to bind to the query into the RunOp object so that retrying the operation results in exactly the same query with the same arguments being run. This problem was introduced by CASSANDRA-7964. Prior to CASSANDRA-7964 the PartitionIterator (Partition.RowIterator before the change) was reinitialized prior to each query retry, thus generating the same set of keys each time. This problem is reported rather confusingly. The only error that shows up in a log file (specified with -log file=foo.log) is the unhelpful {noformat} java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: (NoSuchElementException) at org.apache.cassandra.stress.Operation.error(Operation.java:136) at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114) at org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158) at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459) {noformat} Standard error is only slightly more helpful, displaying the ignorable initial read/write error, and confusing java.util.NoSuchElementException lines (caused by PartitionIterator exhaustion) followed by the above IOException with stack trace, eg {noformat} com.datastax.drive.core.exceptions.ReadTimeoutException: Cassandra timeout during read query java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: (NoSuchElementException) at org.apache.cassandra.stress.Operation.error(Operation.java:136) at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114) at org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158) at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459) {noformat} was: o.a.c.stress.Operation will retry queries a configurable number of times. When the "user" command is invoked the o.a.c.stress.operations.userdefined SchemaInsert and SchemaQuery operations are used. When SchemaInsert and SchemaQuery are retried (eg after a Read/WriteTimeout exception), they advance the PartitionIterator used to generate the keys to insert/query (SchemaInsert.java:85 SchemaQuery.java:129) This means each retry will use a different set of keys. The predefined set of operations avoid this problem by packaging up the arguments to bind to the query into the RunOp object so that retrying the operation results in exactly the same query with the same arguments being run. This problem was introduced by CASSANDRA-7964. Prior to CASSANDRA-7964 the PartitionIterator (Partition.RowIterator before the change) was reinitialized prior to each query retry, thus generating the same set of keys each time. This problem is reported rather confusingly. The only error that shows up in a log file (specified with -log file=foo.log) is the unhelpful {{{ java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: (NoSuchElementException) at org.apache.cassandra.stress.Operation.error(Operation.java:136) at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114) at org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158) at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459) }}} Standard error is only slightly more helpful, displaying the ignorable initial read/write error, and confusing java.util.NoSuchElementException lines (caused by PartitionIterator exhaustion) followed by the above IOException with stack trace, eg {{{ com.datastax.drive.core.exceptions.ReadTimeoutException: Cassandra timeout during read query java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.
[jira] [Created] (CASSANDRA-13871) cassandra-stress user command misbehaves when retrying operations
Daniel Cranford created CASSANDRA-13871: --- Summary: cassandra-stress user command misbehaves when retrying operations Key: CASSANDRA-13871 URL: https://issues.apache.org/jira/browse/CASSANDRA-13871 Project: Cassandra Issue Type: Bug Reporter: Daniel Cranford o.a.c.stress.Operation will retry queries a configurable number of times. When the "user" command is invoked the o.a.c.stress.operations.userdefined SchemaInsert and SchemaQuery operations are used. When SchemaInsert and SchemaQuery are retried (eg after a Read/WriteTimeout exception), they advance the PartitionIterator used to generate the keys to insert/query (SchemaInsert.java:85 SchemaQuery.java:129) This means each retry will use a different set of keys. The predefined set of operations avoid this problem by packaging up the arguments to bind to the query into the RunOp object so that retrying the operation results in exactly the same query with the same arguments being run. This problem was introduced by CASSANDRA-7964. Prior to CASSANDRA-7964 the PartitionIterator (Partition.RowIterator before the change) was reinitialized prior to each query retry, thus generating the same set of keys each time. This problem is reported rather confusingly. The only error that shows up in a log file (specified with -log file=foo.log) is the unhelpful {{{ java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: (NoSuchElementException) at org.apache.cassandra.stress.Operation.error(Operation.java:136) at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114) at org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158) at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459) }}} Standard error is only slightly more helpful, displaying the ignorable initial read/write error, and confusing java.util.NoSuchElementException lines (caused by PartitionIterator exhaustion) followed by the above IOException with stack trace, eg {{{ com.datastax.drive.core.exceptions.ReadTimeoutException: Cassandra timeout during read query java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.util.NoSuchElementException java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: (NoSuchElementException) at org.apache.cassandra.stress.Operation.error(Operation.java:136) at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114) at org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158) at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459) }}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-8735) Batch log replication is not randomized when there are only 2 racks
[ https://issues.apache.org/jira/browse/CASSANDRA-8735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119979#comment-16119979 ] Daniel Cranford edited comment on CASSANDRA-8735 at 8/14/17 2:24 PM: - [~iamaleksey] Great, I didn't see any activity yet on CASSANDRA-12884, so I attached a patch there. was (Author: daniel.cranford): [~iamaleksey] Great, I didn't see any activity yet on CASSANDRA-12844, so I attached a patch there. > Batch log replication is not randomized when there are only 2 racks > --- > > Key: CASSANDRA-8735 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8735 > Project: Cassandra > Issue Type: Bug >Reporter: Yuki Morishita >Assignee: Mihai Suteu >Priority: Minor > Fix For: 2.1.9, 2.2.1, 3.0 alpha 1 > > Attachments: 8735-v2.patch, CASSANDRA-8735.patch > > > Batch log replication is not randomized and the same 2 nodes can be picked up > when there are only 2 racks in the cluster. > https://github.com/apache/cassandra/blob/cassandra-2.0.11/src/java/org/apache/cassandra/service/BatchlogEndpointSelector.java#L72-73 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12884) Batch logic can lead to unbalanced use of system.batches
[ https://issues.apache.org/jira/browse/CASSANDRA-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125719#comment-16125719 ] Daniel Cranford commented on CASSANDRA-12884: - Oh, and not to nitpick, but any reason to prefer {{otherRack.sublist(2, otherRack.size()).clear(); return otherRack();}} to {{return otherRack.sublist(0,2);}} ? > Batch logic can lead to unbalanced use of system.batches > > > Key: CASSANDRA-12884 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12884 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Adam Hattrell >Assignee: Daniel Cranford > Fix For: 3.0.x, 3.11.x > > Attachments: 0001-CASSANDRA-12884.patch > > > It looks as though there are some odd edge cases in how we distribute the > copies in system.batches. > The main issue is in the filter method for > org.apache.cassandra.batchlog.BatchlogManager > {code:java} > if (validated.size() - validated.get(localRack).size() >= 2) > { > // we have enough endpoints in other racks > validated.removeAll(localRack); > } > if (validated.keySet().size() == 1) > { >// we have only 1 `other` rack >Collection otherRack = > Iterables.getOnlyElement(validated.asMap().values()); > > return Lists.newArrayList(Iterables.limit(otherRack, 2)); > } > {code} > So with one or two racks we just return the first 2 entries in the list. > There's no shuffle or randomisation here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12884) Batch logic can lead to unbalanced use of system.batches
[ https://issues.apache.org/jira/browse/CASSANDRA-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125710#comment-16125710 ] Daniel Cranford commented on CASSANDRA-12884: - I had originally considered using sublist to avoid creating a second ArrayList, but decided against it because the sublist version throws an exception in the degenerate case where there is only 1 element in otherRack. But now that I trace through the code, I think that otherRack is guaranteed to have at least 2 elements. If otherRack is the local rack and only has 1 element, {{if(validated.size() <= 2)}} would have been true, and the filter() function would have already returned. If otherRack was the single non-local rack, and had size 1, then {{if(validated.size() - validated.get(localRack).size() >= 2)}} would be false and the whole single-other-rack block wouldn't run. It's probably worth a comment stating that otherRack is guaranteed to have at least 2 elements. Looks good! > Batch logic can lead to unbalanced use of system.batches > > > Key: CASSANDRA-12884 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12884 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Adam Hattrell >Assignee: Daniel Cranford > Fix For: 3.0.x, 3.11.x > > Attachments: 0001-CASSANDRA-12884.patch > > > It looks as though there are some odd edge cases in how we distribute the > copies in system.batches. > The main issue is in the filter method for > org.apache.cassandra.batchlog.BatchlogManager > {code:java} > if (validated.size() - validated.get(localRack).size() >= 2) > { > // we have enough endpoints in other racks > validated.removeAll(localRack); > } > if (validated.keySet().size() == 1) > { >// we have only 1 `other` rack >Collection otherRack = > Iterables.getOnlyElement(validated.asMap().values()); > > return Lists.newArrayList(Iterables.limit(otherRack, 2)); > } > {code} > So with one or two racks we just return the first 2 entries in the list. > There's no shuffle or randomisation here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12884) Batch logic can lead to unbalanced use of system.batches
[ https://issues.apache.org/jira/browse/CASSANDRA-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122335#comment-16122335 ] Daniel Cranford commented on CASSANDRA-12884: - Technically, if efficiency is key, we could implement something like a Durstenfeld/Knuth shuffle, eg https://stackoverflow.com/a/35278327 > Batch logic can lead to unbalanced use of system.batches > > > Key: CASSANDRA-12884 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12884 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Adam Hattrell >Assignee: Daniel Cranford > Fix For: 3.0.x, 3.11.x > > Attachments: 0001-CASSANDRA-12884.patch > > > It looks as though there are some odd edge cases in how we distribute the > copies in system.batches. > The main issue is in the filter method for > org.apache.cassandra.batchlog.BatchlogManager > {code:java} > if (validated.size() - validated.get(localRack).size() >= 2) > { > // we have enough endpoints in other racks > validated.removeAll(localRack); > } > if (validated.keySet().size() == 1) > { >// we have only 1 `other` rack >Collection otherRack = > Iterables.getOnlyElement(validated.asMap().values()); > > return Lists.newArrayList(Iterables.limit(otherRack, 2)); > } > {code} > So with one or two racks we just return the first 2 entries in the list. > There's no shuffle or randomisation here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12884) Batch logic can lead to unbalanced use of system.batches
[ https://issues.apache.org/jira/browse/CASSANDRA-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122288#comment-16122288 ] Daniel Cranford commented on CASSANDRA-12884: - 1) BatchlogManager::shuffle is stubbed out so the unit test can provide a deterministic override. The unit test has been expanded to provide a test which catches this regression. (the existing code used the same pattern for getRandomInt which is overridden to be non-random in the unit test) 2) getRandomInt could return the same value twice (sampling with replacement) resulting in the same replica being chosen. The existing code uses the shuffle+take head pattern, eg in BatchlogManager.java line 545 {{shuffle((List) racks);}} and line 550 {{for (String rack : Iterables.limit(racks, 2))}} to perform sampling without replacement. > Batch logic can lead to unbalanced use of system.batches > > > Key: CASSANDRA-12884 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12884 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Adam Hattrell >Assignee: Daniel Cranford > Fix For: 3.0.x, 3.11.x > > Attachments: 0001-CASSANDRA-12884.patch > > > It looks as though there are some odd edge cases in how we distribute the > copies in system.batches. > The main issue is in the filter method for > org.apache.cassandra.batchlog.BatchlogManager > {code:java} > if (validated.size() - validated.get(localRack).size() >= 2) > { > // we have enough endpoints in other racks > validated.removeAll(localRack); > } > if (validated.keySet().size() == 1) > { >// we have only 1 `other` rack >Collection otherRack = > Iterables.getOnlyElement(validated.asMap().values()); > > return Lists.newArrayList(Iterables.limit(otherRack, 2)); > } > {code} > So with one or two racks we just return the first 2 entries in the list. > There's no shuffle or randomisation here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12884) Batch logic can lead to unbalanced use of system.batches
[ https://issues.apache.org/jira/browse/CASSANDRA-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119980#comment-16119980 ] Daniel Cranford commented on CASSANDRA-12884: - Same bug. Regression. > Batch logic can lead to unbalanced use of system.batches > > > Key: CASSANDRA-12884 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12884 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Adam Hattrell >Assignee: Joshua McKenzie > Fix For: 3.0.x, 3.11.x > > Attachments: 0001-CASSANDRA-12884.patch > > > It looks as though there are some odd edge cases in how we distribute the > copies in system.batches. > The main issue is in the filter method for > org.apache.cassandra.batchlog.BatchlogManager > {code:java} > if (validated.size() - validated.get(localRack).size() >= 2) > { > // we have enough endpoints in other racks > validated.removeAll(localRack); > } > if (validated.keySet().size() == 1) > { >// we have only 1 `other` rack >Collection otherRack = > Iterables.getOnlyElement(validated.asMap().values()); > > return Lists.newArrayList(Iterables.limit(otherRack, 2)); > } > {code} > So with one or two racks we just return the first 2 entries in the list. > There's no shuffle or randomisation here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-12884) Batch logic can lead to unbalanced use of system.batches
[ https://issues.apache.org/jira/browse/CASSANDRA-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119980#comment-16119980 ] Daniel Cranford edited comment on CASSANDRA-12884 at 8/9/17 2:26 PM: - Same bug as CASSANDRA-8735. Regression. was (Author: daniel.cranford): Same bug. Regression. > Batch logic can lead to unbalanced use of system.batches > > > Key: CASSANDRA-12884 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12884 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Adam Hattrell >Assignee: Joshua McKenzie > Fix For: 3.0.x, 3.11.x > > Attachments: 0001-CASSANDRA-12884.patch > > > It looks as though there are some odd edge cases in how we distribute the > copies in system.batches. > The main issue is in the filter method for > org.apache.cassandra.batchlog.BatchlogManager > {code:java} > if (validated.size() - validated.get(localRack).size() >= 2) > { > // we have enough endpoints in other racks > validated.removeAll(localRack); > } > if (validated.keySet().size() == 1) > { >// we have only 1 `other` rack >Collection otherRack = > Iterables.getOnlyElement(validated.asMap().values()); > > return Lists.newArrayList(Iterables.limit(otherRack, 2)); > } > {code} > So with one or two racks we just return the first 2 entries in the list. > There's no shuffle or randomisation here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8735) Batch log replication is not randomized when there are only 2 racks
[ https://issues.apache.org/jira/browse/CASSANDRA-8735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119979#comment-16119979 ] Daniel Cranford commented on CASSANDRA-8735: [~iamaleksey] Great, I didn't see any activity yet on CASSANDRA-12844, so I attached a patch there. > Batch log replication is not randomized when there are only 2 racks > --- > > Key: CASSANDRA-8735 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8735 > Project: Cassandra > Issue Type: Bug >Reporter: Yuki Morishita >Assignee: Mihai Suteu >Priority: Minor > Fix For: 2.1.9, 2.2.1, 3.0 alpha 1 > > Attachments: 8735-v2.patch, CASSANDRA-8735.patch > > > Batch log replication is not randomized and the same 2 nodes can be picked up > when there are only 2 racks in the cluster. > https://github.com/apache/cassandra/blob/cassandra-2.0.11/src/java/org/apache/cassandra/service/BatchlogEndpointSelector.java#L72-73 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-12884) Batch logic can lead to unbalanced use of system.batches
[ https://issues.apache.org/jira/browse/CASSANDRA-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Cranford updated CASSANDRA-12884: Attachment: 0001-CASSANDRA-12884.patch Fix + improved unit tests. > Batch logic can lead to unbalanced use of system.batches > > > Key: CASSANDRA-12884 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12884 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Adam Hattrell >Assignee: Joshua McKenzie > Fix For: 3.0.x, 3.11.x > > Attachments: 0001-CASSANDRA-12884.patch > > > It looks as though there are some odd edge cases in how we distribute the > copies in system.batches. > The main issue is in the filter method for > org.apache.cassandra.batchlog.BatchlogManager > {code:java} > if (validated.size() - validated.get(localRack).size() >= 2) > { > // we have enough endpoints in other racks > validated.removeAll(localRack); > } > if (validated.keySet().size() == 1) > { >// we have only 1 `other` rack >Collection otherRack = > Iterables.getOnlyElement(validated.asMap().values()); > > return Lists.newArrayList(Iterables.limit(otherRack, 2)); > } > {code} > So with one or two racks we just return the first 2 entries in the list. > There's no shuffle or randomisation here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8735) Batch log replication is not randomized when there are only 2 racks
[ https://issues.apache.org/jira/browse/CASSANDRA-8735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114984#comment-16114984 ] Daniel Cranford commented on CASSANDRA-8735: Looks like it was reverted/overwritten (seemingly unintentionally) by the fix for CASSANDRA-7237 We've seen this in production with Cassandra 3.9. One or two rack DCs always select the same two hosts for batch log replication. > Batch log replication is not randomized when there are only 2 racks > --- > > Key: CASSANDRA-8735 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8735 > Project: Cassandra > Issue Type: Bug >Reporter: Yuki Morishita >Assignee: Mihai Suteu >Priority: Minor > Fix For: 2.1.9, 2.2.1, 3.0 alpha 1 > > Attachments: 8735-v2.patch, CASSANDRA-8735.patch > > > Batch log replication is not randomized and the same 2 nodes can be picked up > when there are only 2 racks in the cluster. > https://github.com/apache/cassandra/blob/cassandra-2.0.11/src/java/org/apache/cassandra/service/BatchlogEndpointSelector.java#L72-73 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org