[jira] [Resolved] (CASSANDRA-7377) Should be an option to fail startup if corrupt SSTable found
[ https://issues.apache.org/jira/browse/CASSANDRA-7377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Low resolved CASSANDRA-7377. Resolution: Duplicate > Should be an option to fail startup if corrupt SSTable found > > > Key: CASSANDRA-7377 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7377 > Project: Cassandra > Issue Type: Improvement >Reporter: Richard Low > Labels: proposed-wontfix > > We had a server that crashed and when it came back, some SSTables were > corrupted. Cassandra happily started, but we then realised the corrupt > SSTable contained some tombstones and a few keys were resurrected. This means > corruption on a single replica can bring back data even if you run repairs at > least every gc_grace. > There should be an option, probably controlled by the disk failure policy, to > catch this and stop node startup. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7377) Should be an option to fail startup if corrupt SSTable found
[ https://issues.apache.org/jira/browse/CASSANDRA-7377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287100#comment-16287100 ] Richard Low commented on CASSANDRA-7377: SGTM > Should be an option to fail startup if corrupt SSTable found > > > Key: CASSANDRA-7377 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7377 > Project: Cassandra > Issue Type: Improvement >Reporter: Richard Low > Labels: proposed-wontfix > > We had a server that crashed and when it came back, some SSTables were > corrupted. Cassandra happily started, but we then realised the corrupt > SSTable contained some tombstones and a few keys were resurrected. This means > corruption on a single replica can bring back data even if you run repairs at > least every gc_grace. > There should be an option, probably controlled by the disk failure policy, to > catch this and stop node startup. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10726) Read repair inserts should not be blocking
[ https://issues.apache.org/jira/browse/CASSANDRA-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16219759#comment-16219759 ] Richard Low commented on CASSANDRA-10726: - Background read repair is quite different. This foreground read repair is required to be blocking as the discussion at the beginning of the ticket shows. Now I understand it, I think this is an important guarantee and people would be very surprised if this behaviour changed. So I'm strongly in favour of 1, although the title of the ticket may be misleading :) > Read repair inserts should not be blocking > -- > > Key: CASSANDRA-10726 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10726 > Project: Cassandra > Issue Type: Improvement > Components: Coordination >Reporter: Richard Low >Assignee: Xiaolong Jiang > Fix For: 4.x > > > Today, if there’s a digest mismatch in a foreground read repair, the insert > to update out of date replicas is blocking. This means, if it fails, the read > fails with a timeout. If a node is dropping writes (maybe it is overloaded or > the mutation stage is backed up for some other reason), all reads to a > replica set could fail. Further, replicas dropping writes get more out of > sync so will require more read repair. > The comment on the code for why the writes are blocking is: > {code} > // wait for the repair writes to be acknowledged, to minimize impact on any > replica that's > // behind on writes in case the out-of-sync row is read multiple times in > quick succession > {code} > but the bad side effect is that reads timeout. Either the writes should not > be blocking or we should return success for the read even if the write times > out. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8502) Static columns returning null for pages after first
[ https://issues.apache.org/jira/browse/CASSANDRA-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645878#comment-15645878 ] Richard Low commented on CASSANDRA-8502: Doesn't Dave's patch change behaviour though? We think we're seeing this in 2.0.17 but not 2.1. > Static columns returning null for pages after first > --- > > Key: CASSANDRA-8502 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8502 > Project: Cassandra > Issue Type: Bug >Reporter: Flavien Charlon >Assignee: Tyler Hobbs > Fix For: 2.0.16, 2.1.6, 2.2.0 rc1 > > Attachments: 8502-2.0-v2.txt, 8502-2.0.txt, 8502-2.1-v2.txt, > null-static-column.txt > > > When paging is used for a query containing a static column, the first page > contains the right value for the static column, but subsequent pages have > null null for the static column instead of the expected value. > Repro steps: > - Create a table with a static column > - Create a partition with 500 cells > - Using cqlsh, query that partition > Actual result: > - You will see that first, the static column appears as expected, but if you > press a key after "---MORE---", the static columns will appear as null. > See the attached file for a repro of the output. > I am using a single node cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8502) Static columns returning null for pages after first
[ https://issues.apache.org/jira/browse/CASSANDRA-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645846#comment-15645846 ] Richard Low commented on CASSANDRA-8502: Can someone remove 2.0.16 from the fix versions, since the above change was only applied in 2.1? > Static columns returning null for pages after first > --- > > Key: CASSANDRA-8502 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8502 > Project: Cassandra > Issue Type: Bug >Reporter: Flavien Charlon >Assignee: Tyler Hobbs > Fix For: 2.0.16, 2.1.6, 2.2.0 rc1 > > Attachments: 8502-2.0-v2.txt, 8502-2.0.txt, 8502-2.1-v2.txt, > null-static-column.txt > > > When paging is used for a query containing a static column, the first page > contains the right value for the static column, but subsequent pages have > null null for the static column instead of the expected value. > Repro steps: > - Create a table with a static column > - Create a partition with 500 cells > - Using cqlsh, query that partition > Actual result: > - You will see that first, the static column appears as expected, but if you > press a key after "---MORE---", the static columns will appear as null. > See the attached file for a repro of the output. > I am using a single node cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data
[ https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427335#comment-15427335 ] Richard Low commented on CASSANDRA-8523: Are you waiting for me to review the dtest PR? > Writes should be sent to a replacement node while it is streaming in data > - > > Key: CASSANDRA-8523 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8523 > Project: Cassandra > Issue Type: Improvement >Reporter: Richard Wagner >Assignee: Paulo Motta > Fix For: 2.1.x > > > In our operations, we make heavy use of replace_address (or > replace_address_first_boot) in order to replace broken nodes. We now realize > that writes are not sent to the replacement nodes while they are in hibernate > state and streaming in data. This runs counter to what our expectations were, > especially since we know that writes ARE sent to nodes when they are > bootstrapped into the ring. > It seems like cassandra should arrange to send writes to a node that is in > the process of replacing another node, just like it does for a nodes that are > bootstraping. I hesitate to phrase this as "we should send writes to a node > in hibernate" because the concept of hibernate may be useful in other > contexts, as per CASSANDRA-8336. Maybe a new state is needed here? > Among other things, the fact that we don't get writes during this period > makes subsequent repairs more expensive, proportional to the number of writes > that we miss (and depending on the amount of data that needs to be streamed > during replacement and the time it may take to rebuild secondary indexes, we > could miss many many hours worth of writes). It also leaves us more exposed > to consistency violations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data
[ https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15416328#comment-15416328 ] Richard Low commented on CASSANDRA-8523: +1 on the 3.9 version too. > Writes should be sent to a replacement node while it is streaming in data > - > > Key: CASSANDRA-8523 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8523 > Project: Cassandra > Issue Type: Improvement >Reporter: Richard Wagner >Assignee: Paulo Motta > Fix For: 2.1.x > > > In our operations, we make heavy use of replace_address (or > replace_address_first_boot) in order to replace broken nodes. We now realize > that writes are not sent to the replacement nodes while they are in hibernate > state and streaming in data. This runs counter to what our expectations were, > especially since we know that writes ARE sent to nodes when they are > bootstrapped into the ring. > It seems like cassandra should arrange to send writes to a node that is in > the process of replacing another node, just like it does for a nodes that are > bootstraping. I hesitate to phrase this as "we should send writes to a node > in hibernate" because the concept of hibernate may be useful in other > contexts, as per CASSANDRA-8336. Maybe a new state is needed here? > Among other things, the fact that we don't get writes during this period > makes subsequent repairs more expensive, proportional to the number of writes > that we miss (and depending on the amount of data that needs to be streamed > during replacement and the time it may take to rebuild secondary indexes, we > could miss many many hours worth of writes). It also leaves us more exposed > to consistency violations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data
[ https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399949#comment-15399949 ] Richard Low commented on CASSANDRA-8523: I'll review the 3.9 version. I'm very much in favour of putting this in 2.2 and 3.0 as this hurts us badly and no doubt others suffer too. > Writes should be sent to a replacement node while it is streaming in data > - > > Key: CASSANDRA-8523 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8523 > Project: Cassandra > Issue Type: Improvement >Reporter: Richard Wagner >Assignee: Paulo Motta > Fix For: 2.1.x > > > In our operations, we make heavy use of replace_address (or > replace_address_first_boot) in order to replace broken nodes. We now realize > that writes are not sent to the replacement nodes while they are in hibernate > state and streaming in data. This runs counter to what our expectations were, > especially since we know that writes ARE sent to nodes when they are > bootstrapped into the ring. > It seems like cassandra should arrange to send writes to a node that is in > the process of replacing another node, just like it does for a nodes that are > bootstraping. I hesitate to phrase this as "we should send writes to a node > in hibernate" because the concept of hibernate may be useful in other > contexts, as per CASSANDRA-8336. Maybe a new state is needed here? > Among other things, the fact that we don't get writes during this period > makes subsequent repairs more expensive, proportional to the number of writes > that we miss (and depending on the amount of data that needs to be streamed > during replacement and the time it may take to rebuild secondary indexes, we > could miss many many hours worth of writes). It also leaves us more exposed > to consistency violations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data
[ https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398667#comment-15398667 ] Richard Low commented on CASSANDRA-8523: +1 patch looks good. Really like the dtests. > Writes should be sent to a replacement node while it is streaming in data > - > > Key: CASSANDRA-8523 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8523 > Project: Cassandra > Issue Type: Improvement >Reporter: Richard Wagner >Assignee: Paulo Motta > Fix For: 2.1.x > > > In our operations, we make heavy use of replace_address (or > replace_address_first_boot) in order to replace broken nodes. We now realize > that writes are not sent to the replacement nodes while they are in hibernate > state and streaming in data. This runs counter to what our expectations were, > especially since we know that writes ARE sent to nodes when they are > bootstrapped into the ring. > It seems like cassandra should arrange to send writes to a node that is in > the process of replacing another node, just like it does for a nodes that are > bootstraping. I hesitate to phrase this as "we should send writes to a node > in hibernate" because the concept of hibernate may be useful in other > contexts, as per CASSANDRA-8336. Maybe a new state is needed here? > Among other things, the fact that we don't get writes during this period > makes subsequent repairs more expensive, proportional to the number of writes > that we miss (and depending on the amount of data that needs to be streamed > during replacement and the time it may take to rebuild secondary indexes, we > could miss many many hours worth of writes). It also leaves us more exposed > to consistency violations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9641) Occasional timeouts with blockFor=all for LOCAL_QUORUM query
[ https://issues.apache.org/jira/browse/CASSANDRA-9641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392096#comment-15392096 ] Richard Low commented on CASSANDRA-9641: I couldn't find the root cause. I'll look to see if it's happening on 2.1. > Occasional timeouts with blockFor=all for LOCAL_QUORUM query > > > Key: CASSANDRA-9641 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9641 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Richard Low > Fix For: 2.1.x, 2.2.x, 3.0.x > > > We have a keyspace using NetworkTopologyStrategy with options DC1:3, DC2:3. > Our tables have > read_repair_chance = 0.0 > dclocal_read_repair_chance = 0.1 > speculative_retry = ’99.0PERCENTILE' > and all reads are at LOCAL_QUORUM. On 2.0.11, we occasionally see this > timeout: > Cassandra timeout during read query at consistency ALL (6 responses were > required but only 5 replica responded) > (sometimes only 4 respond). The ALL is probably due to CASSANDRA-7947 if this > occurs during a digest mismatch, but what is interesting is it is expecting 6 > responses i.e. blockFor is set to all replicas. I can’t see how this should > happen. From the code it should never set blockFor to more than 4 (although 4 > is still wrong - I'll make a separate JIRA for that). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12043) Syncing most recent commit in CAS across replicas can cause all CAS queries in the CQL partition to fail
[ https://issues.apache.org/jira/browse/CASSANDRA-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358348#comment-15358348 ] Richard Low commented on CASSANDRA-12043: - Nice detective work [~kohlisankalp] > Syncing most recent commit in CAS across replicas can cause all CAS queries > in the CQL partition to fail > > > Key: CASSANDRA-12043 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12043 > Project: Cassandra > Issue Type: Bug >Reporter: sankalp kohli >Assignee: Sylvain Lebresne > Fix For: 2.1.15, 2.2.7, 3.0.9, 3.9 > > > We update the most recent commit on requiredParticipant replicas if out of > sync during the prepare round in beginAndRepairPaxos method. We keep doing > this in a loop till the requiredParticipant replicas have the same most > recent commit or we hit timeout. > Say we have 3 machines A,B and C and gc grace on the table is 10 days. We do > a CAS write at time 0 and it went to A and B but not to C. C will get the > hint later but will not update the most recent commit in paxos table. This is > how CAS hints work. > In the paxos table whose gc_grace=0, most_recent_commit in A and B will be > inserted with timestamp 0 and with a TTL of 10 days. After 10 days, this > insert will become a tombstone at time 0 till it is compacted away since > gc_grace=0. > Do a CAS read after say 1 day on the same CQL partition and this time prepare > phase involved A and C. most_recent_commit on C for this CQL partition is > empty. A sends the most_recent_commit to C with a timestamp of 0 and with a > TTL of 10 days. This most_recent_commit on C will expire on 11th day since it > is inserted after 1 day. > most_recent_commit are now in sync on A,B and C, however A and B > most_recent_commit will expire on 10th day whereas for C it will expire on > 11th day since it was inserted one day later. > Do another CAS read after 10days when most_recent_commit on A and B have > expired and is treated as tombstones till compacted. In this CAS read, say A > and C are involved in prepare phase. most_recent_commit will not match > between them since it is expired in A and is still there on C. This will > cause most_recent_commit to be applied to A with a timestamp of 0 and TTL of > 10 days. If A has not compacted away the original most_recent_commit which > has expired, this new write to most_recent_commit wont be visible on reads > since there is a tombstone with same timestamp(Delete wins over data with > same timestamp). > Another round of prepare will follow and again A would say it does not know > about most_recent_write(covered by original write which is not a tombstone) > and C will again try to send the write to A. This can keep going on till the > request timeouts or only A and B are involved in the prepare phase. > When A’s original most_recent_commit which is now a tombstone is compacted, > all the inserts which it was covering will come live. This will in turn again > get played to another replica. This ping pong can keep going on for a long > time. > The issue is that most_recent_commit is expiring at different times across > replicas. When they get replayed to a replica to make it in sync, we again > set the TTL from that point. > During the CAS read which timed out, most_recent_commit was being sent to > another replica in a loop. Even in successful requests, it will try to loop > for a couple of times if involving A and C and then when the replicas which > respond are A and B, it will succeed. So this will have impact on latencies > as well. > These timeouts gets worse when a machine is down as no progress can be made > as the machine with unexpired commit is always involved in the CAS prepare > round. Also with range movements, the new machine gaining range has empty > most recent commit and gets the commit at a later time causing same issue. > Repro steps: > 1. Paxos TTL is max(3 hours, gc_grace) as defined in > SystemKeyspace.paxosTtl(). Change this method to not put a minimum TTL of 3 > hours. > Method SystemKeyspace.paxosTtl() will look like return > metadata.getGcGraceSeconds(); instead of return Math.max(3 * 3600, > metadata.getGcGraceSeconds()); > We are doing this so that we dont need to wait for 3 hours. > Create a 3 node cluster with the code change suggested above with machines > A,B and C > CREATE KEYSPACE test WITH REPLICATION = { 'class' : 'SimpleStrategy', > 'replication_factor' : 3 }; > use test; > CREATE TABLE users (a int PRIMARY KEY,b int); > alter table users WITH gc_grace_seconds=120; > consistency QUORUM; > bring down machine C > INSERT INTO users (user_name, password ) VALUES ( 1,1) IF NOT EXISTS; > Nodetool flush on machine A and B > Bring up
[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data
[ https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337328#comment-15337328 ] Richard Low commented on CASSANDRA-8523: It's not those writes that matter - the replacement will get those writes from the other nodes during streaming. The hints that you might care about are writes dropped during the replacement on the replacing node. But those should be extremely rare, and a price well worth paying for getting the vast majority of the writes vs none today. > Writes should be sent to a replacement node while it is streaming in data > - > > Key: CASSANDRA-8523 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8523 > Project: Cassandra > Issue Type: Improvement >Reporter: Richard Wagner >Assignee: Paulo Motta > Fix For: 2.1.x > > > In our operations, we make heavy use of replace_address (or > replace_address_first_boot) in order to replace broken nodes. We now realize > that writes are not sent to the replacement nodes while they are in hibernate > state and streaming in data. This runs counter to what our expectations were, > especially since we know that writes ARE sent to nodes when they are > bootstrapped into the ring. > It seems like cassandra should arrange to send writes to a node that is in > the process of replacing another node, just like it does for a nodes that are > bootstraping. I hesitate to phrase this as "we should send writes to a node > in hibernate" because the concept of hibernate may be useful in other > contexts, as per CASSANDRA-8336. Maybe a new state is needed here? > Among other things, the fact that we don't get writes during this period > makes subsequent repairs more expensive, proportional to the number of writes > that we miss (and depending on the amount of data that needs to be streamed > during replacement and the time it may take to rebuild secondary indexes, we > could miss many many hours worth of writes). It also leaves us more exposed > to consistency violations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data
[ https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337290#comment-15337290 ] Richard Low commented on CASSANDRA-8523: Thanks! I'm happy to review. > Writes should be sent to a replacement node while it is streaming in data > - > > Key: CASSANDRA-8523 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8523 > Project: Cassandra > Issue Type: Improvement >Reporter: Richard Wagner >Assignee: Paulo Motta > Fix For: 2.1.x > > > In our operations, we make heavy use of replace_address (or > replace_address_first_boot) in order to replace broken nodes. We now realize > that writes are not sent to the replacement nodes while they are in hibernate > state and streaming in data. This runs counter to what our expectations were, > especially since we know that writes ARE sent to nodes when they are > bootstrapped into the ring. > It seems like cassandra should arrange to send writes to a node that is in > the process of replacing another node, just like it does for a nodes that are > bootstraping. I hesitate to phrase this as "we should send writes to a node > in hibernate" because the concept of hibernate may be useful in other > contexts, as per CASSANDRA-8336. Maybe a new state is needed here? > Among other things, the fact that we don't get writes during this period > makes subsequent repairs more expensive, proportional to the number of writes > that we miss (and depending on the amount of data that needs to be streamed > during replacement and the time it may take to rebuild secondary indexes, we > could miss many many hours worth of writes). It also leaves us more exposed > to consistency violations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data
[ https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15311526#comment-15311526 ] Richard Low commented on CASSANDRA-8523: Without understanding the FD details, this sounds good. Losing hints isn't an issue, as you say. > Writes should be sent to a replacement node while it is streaming in data > - > > Key: CASSANDRA-8523 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8523 > Project: Cassandra > Issue Type: Improvement >Reporter: Richard Wagner >Assignee: Paulo Motta > Fix For: 2.1.x > > > In our operations, we make heavy use of replace_address (or > replace_address_first_boot) in order to replace broken nodes. We now realize > that writes are not sent to the replacement nodes while they are in hibernate > state and streaming in data. This runs counter to what our expectations were, > especially since we know that writes ARE sent to nodes when they are > bootstrapped into the ring. > It seems like cassandra should arrange to send writes to a node that is in > the process of replacing another node, just like it does for a nodes that are > bootstraping. I hesitate to phrase this as "we should send writes to a node > in hibernate" because the concept of hibernate may be useful in other > contexts, as per CASSANDRA-8336. Maybe a new state is needed here? > Among other things, the fact that we don't get writes during this period > makes subsequent repairs more expensive, proportional to the number of writes > that we miss (and depending on the amount of data that needs to be streamed > during replacement and the time it may take to rebuild secondary indexes, we > could miss many many hours worth of writes). It also leaves us more exposed > to consistency violations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11903) Serial reads should not include pending endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-11903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304341#comment-15304341 ] Richard Low commented on CASSANDRA-11903: - The write CL is increased by 1 for a pending endpoint so that a quorum of non-pending endpoints are written to. That means a regular quorum read is guaranteed to see the write and detect if any in progress paxos writes are committed. > Serial reads should not include pending endpoints > - > > Key: CASSANDRA-11903 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11903 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Richard Low > > A serial read uses pending endpoints in beginAndRepairPaxos, although the > read itself does not. I don't think the pending endpoints are necessary and > including them unnecessarily increases paxos work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-11903) Serial reads should not include pending endpoints
Richard Low created CASSANDRA-11903: --- Summary: Serial reads should not include pending endpoints Key: CASSANDRA-11903 URL: https://issues.apache.org/jira/browse/CASSANDRA-11903 Project: Cassandra Issue Type: Bug Components: Coordination Reporter: Richard Low A serial read uses pending endpoints in beginAndRepairPaxos, although the read itself does not. I don't think the pending endpoints are necessary and including them unnecessarily increases paxos work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11746) Add per partition rate limiting
[ https://issues.apache.org/jira/browse/CASSANDRA-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284899#comment-15284899 ] Richard Low commented on CASSANDRA-11746: - I think per table via DDL or backed by a system table will be best. It will be easier in the server to coordinate across the cluster than in the driver. > Add per partition rate limiting > --- > > Key: CASSANDRA-11746 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11746 > Project: Cassandra > Issue Type: Improvement >Reporter: Richard Low > > In a similar spirit to the tombstone fail threshold, Cassandra could protect > itself against rogue clients issuing too many requests to the same partition > by rate limiting. Nodes could keep a sliding window of requests per partition > and immediately reject requests if the threshold has been reached. This could > stop hotspots from taking down a replica set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-11746) Add per partition rate limiting
Richard Low created CASSANDRA-11746: --- Summary: Add per partition rate limiting Key: CASSANDRA-11746 URL: https://issues.apache.org/jira/browse/CASSANDRA-11746 Project: Cassandra Issue Type: Improvement Reporter: Richard Low In a similar spirit to the tombstone fail threshold, Cassandra could protect itself against rogue clients issuing too many requests to the same partition by rate limiting. Nodes could keep a sliding window of requests per partition and immediately reject requests if the threshold has been reached. This could stop hotspots from taking down a replica set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-11745) Add bytes limit to queries and paging
Richard Low created CASSANDRA-11745: --- Summary: Add bytes limit to queries and paging Key: CASSANDRA-11745 URL: https://issues.apache.org/jira/browse/CASSANDRA-11745 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Richard Low For some data models, values may be of very different sizes. When querying data, limit by count doesn’t work well and leads to timeouts. It would be much better to limit by size of the response, probably by stopping at the first row that goes above the limit. This applies to paging too so you can safely page through such data without timeout worries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11547) Add background thread to check for clock drift
[ https://issues.apache.org/jira/browse/CASSANDRA-11547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253057#comment-15253057 ] Richard Low commented on CASSANDRA-11547: - Given how critical clocks are to Cassandra I think it is definitely Cassandra's business to report on this. It's not actually doing anything, just warning. You'd need to have a 5 minute GC pause for it to fire spuriously with the default. > Add background thread to check for clock drift > -- > > Key: CASSANDRA-11547 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11547 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Labels: clocks, time > > The system clock has the potential to drift while a system is running. As a > simple way to check if this occurs, we can run a background thread that wakes > up every n seconds, reads the system clock, and checks to see if, indeed, n > seconds have passed. > * If the clock's current time is less than the last recorded time (captured n > seconds in the past), we know the clock has jumped backward. > * If n seconds have not elapsed, we know the system clock is running slow or > has moved backward (by a value less than n) > * If (n + a small offset) seconds have elapsed, we can assume we are within > an acceptable window of clock movement. Reasons for including an offset are > the clock checking thread might not have been scheduled on time, or garbage > collection, and so on. > * If the clock is greater than (n + a small offset) seconds, we can assume > the clock jumped forward. > In the unhappy cases, we can write a message to the log and increment some > metric that the user's monitoring systems can trigger/alert on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222330#comment-15222330 ] Richard Low commented on CASSANDRA-11349: - I'm also not sure how this is meant to fix it. Special casing validation compaction may fix repairs but you'd still get the digest mismatches on reads. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-10643) Implement compaction for a specific token range
[ https://issues.apache.org/jira/browse/CASSANDRA-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Low updated CASSANDRA-10643: Reviewer: Jason Brown Status: Patch Available (was: Open) > Implement compaction for a specific token range > --- > > Key: CASSANDRA-10643 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10643 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Vishy Kasar >Assignee: Vishy Kasar > Attachments: 10643-trunk-REV01.txt > > > We see repeated cases in production (using LCS) where small number of users > generate a large number repeated updates or tombstones. Reading data of such > users brings in large amounts of data in to java process. Apart from the read > itself being slow for the user, the excessive GC affects other users as well. > Our solution so far is to move from LCS to SCS and back. This takes long and > is an over kill if the number of outliers is small. For such cases, we can > implement the point compaction of a token range. We make the nodetool compact > take a starting and ending token range and compact all the SSTables that fall > with in that range. We can refuse to compact if the number of sstables is > beyond a max_limit. > Example: > nodetool -st 3948291562518219268 -et 3948291562518219269 compact keyspace > table -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10726) Read repair inserts should not be blocking
[ https://issues.apache.org/jira/browse/CASSANDRA-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181335#comment-15181335 ] Richard Low commented on CASSANDRA-10726: - What do you think [~jbellis] [~slebresne]? > Read repair inserts should not be blocking > -- > > Key: CASSANDRA-10726 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10726 > Project: Cassandra > Issue Type: Improvement > Components: Coordination >Reporter: Richard Low > > Today, if there’s a digest mismatch in a foreground read repair, the insert > to update out of date replicas is blocking. This means, if it fails, the read > fails with a timeout. If a node is dropping writes (maybe it is overloaded or > the mutation stage is backed up for some other reason), all reads to a > replica set could fail. Further, replicas dropping writes get more out of > sync so will require more read repair. > The comment on the code for why the writes are blocking is: > {code} > // wait for the repair writes to be acknowledged, to minimize impact on any > replica that's > // behind on writes in case the out-of-sync row is read multiple times in > quick succession > {code} > but the bad side effect is that reads timeout. Either the writes should not > be blocking or we should return success for the read even if the write times > out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10726) Read repair inserts should not be blocking
[ https://issues.apache.org/jira/browse/CASSANDRA-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122431#comment-15122431 ] Richard Low commented on CASSANDRA-10726: - Actually, isn't the real problem here that speculative retry doesn't include the RR write? We should give up waiting for the write to complete and retry on another replica. A slow RR insert is just as bad as a slow read. > Read repair inserts should not be blocking > -- > > Key: CASSANDRA-10726 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10726 > Project: Cassandra > Issue Type: Improvement > Components: Coordination >Reporter: Richard Low > > Today, if there’s a digest mismatch in a foreground read repair, the insert > to update out of date replicas is blocking. This means, if it fails, the read > fails with a timeout. If a node is dropping writes (maybe it is overloaded or > the mutation stage is backed up for some other reason), all reads to a > replica set could fail. Further, replicas dropping writes get more out of > sync so will require more read repair. > The comment on the code for why the writes are blocking is: > {code} > // wait for the repair writes to be acknowledged, to minimize impact on any > replica that's > // behind on writes in case the out-of-sync row is read multiple times in > quick succession > {code} > but the bad side effect is that reads timeout. Either the writes should not > be blocking or we should return success for the read even if the write times > out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-11091) Insufficient disk space in memtable flush should trigger disk fail policy
Richard Low created CASSANDRA-11091: --- Summary: Insufficient disk space in memtable flush should trigger disk fail policy Key: CASSANDRA-11091 URL: https://issues.apache.org/jira/browse/CASSANDRA-11091 Project: Cassandra Issue Type: Bug Reporter: Richard Low If there's insufficient disk space to flush, DiskAwareRunnable.getWriteDirectory throws and the flush fails. The commitlogs then grow indefinitely because the latch is never counted down. This should be an FSError so the disk fail policy is triggered. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10726) Read repair inserts should not be blocking
[ https://issues.apache.org/jira/browse/CASSANDRA-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092150#comment-15092150 ] Richard Low commented on CASSANDRA-10726: - It would lose a guarantee (which admittedly I didn't know existed), but most people who care about what happens when there's a write timeout will use CAS read and write. Would a reasonable half way house be to keep the write as blocking but return success in the case of a write timeout? Then almost always the behaviour will be the same, but it would avoid the timeouts caused by a single broken replica. > Read repair inserts should not be blocking > -- > > Key: CASSANDRA-10726 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10726 > Project: Cassandra > Issue Type: Improvement > Components: Coordination >Reporter: Richard Low > > Today, if there’s a digest mismatch in a foreground read repair, the insert > to update out of date replicas is blocking. This means, if it fails, the read > fails with a timeout. If a node is dropping writes (maybe it is overloaded or > the mutation stage is backed up for some other reason), all reads to a > replica set could fail. Further, replicas dropping writes get more out of > sync so will require more read repair. > The comment on the code for why the writes are blocking is: > {code} > // wait for the repair writes to be acknowledged, to minimize impact on any > replica that's > // behind on writes in case the out-of-sync row is read multiple times in > quick succession > {code} > but the bad side effect is that reads timeout. Either the writes should not > be blocking or we should return success for the read even if the write times > out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10726) Read repair inserts should not be blocking
[ https://issues.apache.org/jira/browse/CASSANDRA-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081929#comment-15081929 ] Richard Low commented on CASSANDRA-10726: - +1 on the option to disable. > Read repair inserts should not be blocking > -- > > Key: CASSANDRA-10726 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10726 > Project: Cassandra > Issue Type: Improvement > Components: Coordination >Reporter: Richard Low > > Today, if there’s a digest mismatch in a foreground read repair, the insert > to update out of date replicas is blocking. This means, if it fails, the read > fails with a timeout. If a node is dropping writes (maybe it is overloaded or > the mutation stage is backed up for some other reason), all reads to a > replica set could fail. Further, replicas dropping writes get more out of > sync so will require more read repair. > The comment on the code for why the writes are blocking is: > {code} > // wait for the repair writes to be acknowledged, to minimize impact on any > replica that's > // behind on writes in case the out-of-sync row is read multiple times in > quick succession > {code} > but the bad side effect is that reads timeout. Either the writes should not > be blocking or we should return success for the read even if the write times > out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10887) Pending range calculator gives wrong pending ranges for moves
[ https://issues.apache.org/jira/browse/CASSANDRA-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15064904#comment-15064904 ] Richard Low commented on CASSANDRA-10887: - Is your key 'jdoe' stored on node1? > Pending range calculator gives wrong pending ranges for moves > - > > Key: CASSANDRA-10887 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10887 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Richard Low >Assignee: Branimir Lambov >Priority: Critical > > My understanding is the PendingRangeCalculator is meant to calculate who > should receive extra writes during range movements. However, it adds the > wrong ranges for moves. An extreme example of this can be seen in the > following reproduction. Create a 5 node cluster (I did this on 2.0.16 and > 2.2.4) and a keyspace RF=3 and a simple table. Then start moving a node and > immediately kill -9 it. Now you see a node as down and moving in the ring. > Try a quorum write for a partition that is stored on that node - it will fail > with a timeout. Further, all CAS reads or writes fail immediately with > unavailable exception because they attempt to include the moving node twice. > This is likely to be the cause of CASSANDRA-10423. > In my example I had this ring: > 127.0.0.1 rack1 Up Normal 170.97 KB 20.00% > -9223372036854775808 > 127.0.0.2 rack1 Up Normal 124.06 KB 20.00% > -5534023222112865485 > 127.0.0.3 rack1 Down Moving 108.7 KB40.00% > 1844674407370955160 > 127.0.0.4 rack1 Up Normal 142.58 KB 0.00% > 1844674407370955161 > 127.0.0.5 rack1 Up Normal 118.64 KB 20.00% > 5534023222112865484 > Node 3 was moving to -1844674407370955160. I added logging to print the > pending and natural endpoints. For ranges owned by node 3, node 3 appeared in > pending and natural endpoints. The blockFor is increased to 3 so we’re > effectively doing CL.ALL operations. This manifests as write timeouts and CAS > unavailables when the node is down. > The correct pending range for this scenario is node 1 is gaining the range > (-1844674407370955160, 1844674407370955160). So node 1 should be added as a > destination for writes and CAS for this range, not node 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-10887) Pending range calculator gives wrong pending ranges for moves
Richard Low created CASSANDRA-10887: --- Summary: Pending range calculator gives wrong pending ranges for moves Key: CASSANDRA-10887 URL: https://issues.apache.org/jira/browse/CASSANDRA-10887 Project: Cassandra Issue Type: Bug Components: Coordination Reporter: Richard Low Priority: Critical My understanding is the PendingRangeCalculator is meant to calculate who should receive extra writes during range movements. However, it adds the wrong ranges for moves. An extreme example of this can be seen in the following reproduction. Create a 5 node cluster (I did this on 2.0.16 and 2.2.4) and a keyspace RF=3 and a simple table. Then start moving a node and immediately kill -9 it. Now you see a node as down and moving in the ring. Try a quorum write for a partition that is stored on that node - it will fail with a timeout. Further, all CAS reads or writes fail immediately with unavailable exception because they attempt to include the moving node twice. This is likely to be the cause of CASSANDRA-10423. In my example I had this ring: 127.0.0.1 rack1 Up Normal 170.97 KB 20.00% -9223372036854775808 127.0.0.2 rack1 Up Normal 124.06 KB 20.00% -5534023222112865485 127.0.0.3 rack1 Down Moving 108.7 KB40.00% 1844674407370955160 127.0.0.4 rack1 Up Normal 142.58 KB 0.00% 1844674407370955161 127.0.0.5 rack1 Up Normal 118.64 KB 20.00% 5534023222112865484 Node 3 was moving to -1844674407370955160. I added logging to print the pending and natural endpoints. For ranges owned by node 3, node 3 appeared in pending and natural endpoints. The blockFor is increased to 3 so we’re effectively doing CL.ALL operations. This manifests as write timeouts and CAS unavailables when the node is down. The correct pending range for this scenario is node 1 is gaining the range (-1844674407370955160, 1844674407370955160). So node 1 should be added as a destination for writes and CAS for this range, not node 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10726) Read repair inserts should not be blocking
[ https://issues.apache.org/jira/browse/CASSANDRA-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058576#comment-15058576 ] Richard Low commented on CASSANDRA-10726: - How does it violate consistency? The replica was already inconsistent to require a read repair insert so returning before completing the write can't make it any worse. > Read repair inserts should not be blocking > -- > > Key: CASSANDRA-10726 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10726 > Project: Cassandra > Issue Type: Improvement > Components: Coordination >Reporter: Richard Low > > Today, if there’s a digest mismatch in a foreground read repair, the insert > to update out of date replicas is blocking. This means, if it fails, the read > fails with a timeout. If a node is dropping writes (maybe it is overloaded or > the mutation stage is backed up for some other reason), all reads to a > replica set could fail. Further, replicas dropping writes get more out of > sync so will require more read repair. > The comment on the code for why the writes are blocking is: > {code} > // wait for the repair writes to be acknowledged, to minimize impact on any > replica that's > // behind on writes in case the out-of-sync row is read multiple times in > quick succession > {code} > but the bad side effect is that reads timeout. Either the writes should not > be blocking or we should return success for the read even if the write times > out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-10726) Read repair inserts should not be blocking
Richard Low created CASSANDRA-10726: --- Summary: Read repair inserts should not be blocking Key: CASSANDRA-10726 URL: https://issues.apache.org/jira/browse/CASSANDRA-10726 Project: Cassandra Issue Type: Improvement Components: Coordination Reporter: Richard Low Today, if there’s a digest mismatch in a foreground read repair, the insert to update out of date replicas is blocking. This means, if it fails, the read fails with a timeout. If a node is dropping writes (maybe it is overloaded or the mutation stage is backed up for some other reason), all reads to a replica set could fail. Further, replicas dropping writes get more out of sync so will require more read repair. The comment on the code for why the writes are blocking is: {code} // wait for the repair writes to be acknowledged, to minimize impact on any replica that's // behind on writes in case the out-of-sync row is read multiple times in quick succession {code} but the bad side effect is that reads timeout. Either the writes should not be blocking or we should return success for the read even if the write times out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10366) Added gossip states can shadow older unseen states
[ https://issues.apache.org/jira/browse/CASSANDRA-10366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804315#comment-14804315 ] Richard Low commented on CASSANDRA-10366: - +1 > Added gossip states can shadow older unseen states > -- > > Key: CASSANDRA-10366 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10366 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Brandon Williams >Assignee: Brandon Williams >Priority: Critical > Fix For: 2.0.17, 3.0.0 rc1, 2.1.10, 2.2.2 > > Attachments: 10336.txt > > > In CASSANDRA-6135 we added cloneWithHigherVersion to ensure that if another > thread added states to gossip while we were notifying we would increase our > version to ensure the existing states wouldn't get shadowed. This however, > was not entirely perfect since it's possible that after the clone, but before > the addition another thread will insert an even newer state, thus shadowing > the others. A common case (of this rare one) is when STATUS and TOKENS are > added a bit later in SS.setGossipTokens, where something in another thread > injects a new state (likely SEVERITY) just before the addition after the > clone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9434) If a node loses schema_columns SSTables it could delete all secondary indexes from the schema
[ https://issues.apache.org/jira/browse/CASSANDRA-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Low updated CASSANDRA-9434: --- Fix Version/s: (was: 2.0.x) 2.2.x 2.1.x If a node loses schema_columns SSTables it could delete all secondary indexes from the schema - Key: CASSANDRA-9434 URL: https://issues.apache.org/jira/browse/CASSANDRA-9434 Project: Cassandra Issue Type: Bug Reporter: Richard Low Assignee: Aleksey Yeschenko Fix For: 2.1.x, 2.2.x It is possible that a single bad node can delete all secondary indexes if it restarts and cannot read its schema_columns SSTables. Here's a reproduction: * Create a 2 node cluster (we saw it on 2.0.11) * Create the schema: {code} create keyspace myks with replication = {'class':'SimpleStrategy', 'replication_factor':1}; use myks; create table mytable (a text, b text, c text, PRIMARY KEY (a, b) ); create index myindex on mytable(b); {code} NB index must be on clustering column to repro * Kill one node * Wipe its commitlog and system/schema_columns sstables. * Start it again * Run on this node select index_name from system.schema_columns where keyspace_name = 'myks' and columnfamily_name = 'mytable' and column_name = 'b'; and you'll see the index is null. * Run 'describe schema' on the other node. Sometimes it will not show the index, but you might need to bounce for it to disappear. I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9434) If a node loses schema_columns SSTables it could delete all secondary indexes from the schema
[ https://issues.apache.org/jira/browse/CASSANDRA-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702418#comment-14702418 ] Richard Low commented on CASSANDRA-9434: Thanks for the explanation! With 2.0 EOL I don't think I can object much... I updated the fix versions. If a node loses schema_columns SSTables it could delete all secondary indexes from the schema - Key: CASSANDRA-9434 URL: https://issues.apache.org/jira/browse/CASSANDRA-9434 Project: Cassandra Issue Type: Bug Reporter: Richard Low Assignee: Aleksey Yeschenko Fix For: 2.1.x, 2.2.x It is possible that a single bad node can delete all secondary indexes if it restarts and cannot read its schema_columns SSTables. Here's a reproduction: * Create a 2 node cluster (we saw it on 2.0.11) * Create the schema: {code} create keyspace myks with replication = {'class':'SimpleStrategy', 'replication_factor':1}; use myks; create table mytable (a text, b text, c text, PRIMARY KEY (a, b) ); create index myindex on mytable(b); {code} NB index must be on clustering column to repro * Kill one node * Wipe its commitlog and system/schema_columns sstables. * Start it again * Run on this node select index_name from system.schema_columns where keyspace_name = 'myks' and columnfamily_name = 'mytable' and column_name = 'b'; and you'll see the index is null. * Run 'describe schema' on the other node. Sometimes it will not show the index, but you might need to bounce for it to disappear. I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9434) If a node loses schema_columns SSTables it could delete all secondary indexes from the schema
[ https://issues.apache.org/jira/browse/CASSANDRA-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700697#comment-14700697 ] Richard Low commented on CASSANDRA-9434: Thanks Aleksey. So it sounds like we should close this as behaves correctly? If a node loses schema_columns SSTables it could delete all secondary indexes from the schema - Key: CASSANDRA-9434 URL: https://issues.apache.org/jira/browse/CASSANDRA-9434 Project: Cassandra Issue Type: Bug Reporter: Richard Low Assignee: Aleksey Yeschenko Fix For: 2.0.x It is possible that a single bad node can delete all secondary indexes if it restarts and cannot read its schema_columns SSTables. Here's a reproduction: * Create a 2 node cluster (we saw it on 2.0.11) * Create the schema: {code} create keyspace myks with replication = {'class':'SimpleStrategy', 'replication_factor':1}; use myks; create table mytable (a text, b text, c text, PRIMARY KEY (a, b) ); create index myindex on mytable(b); {code} NB index must be on clustering column to repro * Kill one node * Wipe its commitlog and system/schema_columns sstables. * Start it again * Run on this node select index_name from system.schema_columns where keyspace_name = 'myks' and columnfamily_name = 'mytable' and column_name = 'b'; and you'll see the index is null. * Run 'describe schema' on the other node. Sometimes it will not show the index, but you might need to bounce for it to disappear. I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9753) LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch
[ https://issues.apache.org/jira/browse/CASSANDRA-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642825#comment-14642825 ] Richard Low commented on CASSANDRA-9753: I agree with Sankalp. This is with read_repair_chance = 0. LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch --- Key: CASSANDRA-9753 URL: https://issues.apache.org/jira/browse/CASSANDRA-9753 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low When there is a digest mismatch during the initial read, a data read request is sent to all replicas involved in the initial read. This can be more than the initial blockFor if read repair was done and if speculative retry kicked in. E.g. for RF 3 in two DCs, the number of reads could be 4: 2 for LOCAL_QUORUM, 1 for read repair and 1 for speculative read if one replica was slow. If there is then a digest mismatch, Cassandra will issue the data read to all 4 and set blockFor=4. Now the read query is blocked on cross-DC latency. The digest mismatch read blockFor should be capped at RF for the local DC when using CL.LOCAL_*. You can reproduce this behaviour by creating a keyspace with NetworkTopologyStrategy, RF 3 per DC, dc_local_read_repair=1.0 and ALWAYS for speculative read. If you force a digest mismatch (e.g. by deleting a replicas SSTables and restarting) you can see in tracing that it is blocking for 4 responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9753) LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch
[ https://issues.apache.org/jira/browse/CASSANDRA-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642881#comment-14642881 ] Richard Low commented on CASSANDRA-9753: Using a remote replica for an eager retry is fine, but blocking on it for a later digest mismatch read is not. LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch --- Key: CASSANDRA-9753 URL: https://issues.apache.org/jira/browse/CASSANDRA-9753 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low When there is a digest mismatch during the initial read, a data read request is sent to all replicas involved in the initial read. This can be more than the initial blockFor if read repair was done and if speculative retry kicked in. E.g. for RF 3 in two DCs, the number of reads could be 4: 2 for LOCAL_QUORUM, 1 for read repair and 1 for speculative read if one replica was slow. If there is then a digest mismatch, Cassandra will issue the data read to all 4 and set blockFor=4. Now the read query is blocked on cross-DC latency. The digest mismatch read blockFor should be capped at RF for the local DC when using CL.LOCAL_*. You can reproduce this behaviour by creating a keyspace with NetworkTopologyStrategy, RF 3 per DC, dc_local_read_repair=1.0 and ALWAYS for speculative read. If you force a digest mismatch (e.g. by deleting a replicas SSTables and restarting) you can see in tracing that it is blocking for 4 responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9753) LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch
[ https://issues.apache.org/jira/browse/CASSANDRA-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642883#comment-14642883 ] Richard Low commented on CASSANDRA-9753: Actually I take that back. It's not fine, since it could violate local DC consistency. So fixing by avoiding any reads to remote DCs for eager retries would fix this too. LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch --- Key: CASSANDRA-9753 URL: https://issues.apache.org/jira/browse/CASSANDRA-9753 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low When there is a digest mismatch during the initial read, a data read request is sent to all replicas involved in the initial read. This can be more than the initial blockFor if read repair was done and if speculative retry kicked in. E.g. for RF 3 in two DCs, the number of reads could be 4: 2 for LOCAL_QUORUM, 1 for read repair and 1 for speculative read if one replica was slow. If there is then a digest mismatch, Cassandra will issue the data read to all 4 and set blockFor=4. Now the read query is blocked on cross-DC latency. The digest mismatch read blockFor should be capped at RF for the local DC when using CL.LOCAL_*. You can reproduce this behaviour by creating a keyspace with NetworkTopologyStrategy, RF 3 per DC, dc_local_read_repair=1.0 and ALWAYS for speculative read. If you force a digest mismatch (e.g. by deleting a replicas SSTables and restarting) you can see in tracing that it is blocking for 4 responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9827) Add consistency level to tracing ouput
[ https://issues.apache.org/jira/browse/CASSANDRA-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14643012#comment-14643012 ] Richard Low commented on CASSANDRA-9827: +1 Can someone commit? Add consistency level to tracing ouput -- Key: CASSANDRA-9827 URL: https://issues.apache.org/jira/browse/CASSANDRA-9827 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Alec Grieser Priority: Minor Fix For: 2.1.9 Attachments: cassandra-9827-v1.diff To help get a better view of expected behavior of queries, it would be helpful if each query's consistency level (and, where applicable, the serial consistency level) were included in the tracing output. The proposed location would be within the session's row in the sessions table as an additional key-value pair included in the parameters column (along with the query string, for example). Having it here would easily allow the user to group each particular query with the actual consistency level used and thus compare it to expectation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-5901) Bootstrap should also make the data consistent on the new node
[ https://issues.apache.org/jira/browse/CASSANDRA-5901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625756#comment-14625756 ] Richard Low commented on CASSANDRA-5901: It would be much better to not have to run repair. Repair can take a long time and increases the time that your cluster is running with a node down, increasing your chance of another failure before the replacement completes. Bootstrap should also make the data consistent on the new node -- Key: CASSANDRA-5901 URL: https://issues.apache.org/jira/browse/CASSANDRA-5901 Project: Cassandra Issue Type: Improvement Components: Core Reporter: sankalp kohli Assignee: Yuki Morishita Priority: Minor Currently when we are bootstrapping a new node, it might bootstrap from a node which does not have most upto date data. Because of this, we need to run a repair after that. Most people will always run the repair so it would help if we can provide a parameter to bootstrap to run the repair once the bootstrap has finished. It can also stop the node from responding to reads till repair has finished. This could be another param as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9765) checkForEndpointCollision fails for legitimate collisions
[ https://issues.apache.org/jira/browse/CASSANDRA-9765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622507#comment-14622507 ] Richard Low commented on CASSANDRA-9765: Yes I can. checkForEndpointCollision fails for legitimate collisions - Key: CASSANDRA-9765 URL: https://issues.apache.org/jira/browse/CASSANDRA-9765 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low Assignee: Stefania Fix For: 2.0.17 Since CASSANDRA-7939, checkForEndpointCollision no longer catches a legitimate collision. Without CASSANDRA-7939, wiping a node and starting it again fails with 'A node with address %s already exists', but with it the node happily enters joining state, potentially streaming from the wrong place and violating consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9765) checkForEndpointCollision fails for legitimate collisions
Richard Low created CASSANDRA-9765: -- Summary: checkForEndpointCollision fails for legitimate collisions Key: CASSANDRA-9765 URL: https://issues.apache.org/jira/browse/CASSANDRA-9765 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low Fix For: 2.0.17 Since CASSANDRA-7939, checkForEndpointCollision no longer catches a legitimate collision. Without CASSANDRA-7939, wiping a node and starting it again fails with 'A node with address %s already exists', but with it the node happily enters joining state, potentially streaming from the wrong place and violating consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9753) LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch
Richard Low created CASSANDRA-9753: -- Summary: LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch Key: CASSANDRA-9753 URL: https://issues.apache.org/jira/browse/CASSANDRA-9753 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low When there is a digest mismatch during the initial read, a data read request is sent to all replicas involved in the initial read. This can be more than the initial blockFor if read repair was done and if speculative retry kicked in. E.g. for RF 3 in two DCs, the number of reads could be 4: 2 for LOCAL_QUORUM, 1 for read repair and 1 for speculative read if one replica was slow. If there is then a digest mismatch, Cassandra will issue the data read to all 4 and set blockFor=4. Now the read query is blocked on cross-DC latency. The digest mismatch read blockFor should be capped at RF for the local DC when using CL.LOCAL_*. You can reproduce this behaviour by creating a keyspace with NetworkTopologyStrategy, RF 3 per DC, dc_local_read_repair=1.0 and ALWAYS for speculative read. If you force a digest mismatch (e.g. by deleting a replicas SSTables and restarting) you can see in tracing that it is blocking for 4 responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9641) Occasional timeouts with blockFor=all for LOCAL_QUORUM query
Richard Low created CASSANDRA-9641: -- Summary: Occasional timeouts with blockFor=all for LOCAL_QUORUM query Key: CASSANDRA-9641 URL: https://issues.apache.org/jira/browse/CASSANDRA-9641 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low We have a keyspace using NetworkTopologyStrategy with options DC1:3, DC2:3. Our tables have read_repair_chance = 0.0 dclocal_read_repair_chance = 0.1 speculative_retry = ’99.0PERCENTILE' and all reads are at LOCAL_QUORUM. On 2.0.11, we occasionally see this timeout: Cassandra timeout during read query at consistency ALL (6 responses were required but only 5 replica responded) (sometimes only 4 respond). The ALL is probably due to CASSANDRA-7947 if this occurs during a digest mismatch, but what is interesting is it is expecting 6 responses i.e. blockFor is set to all replicas. I can’t see how this should happen. From the code it should never set blockFor to more than 4 (although 4 is still wrong - I'll make a separate JIRA for that). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9596) Tombstone timestamps aren't used to skip SSTables while they are still in the memtable
Richard Low created CASSANDRA-9596: -- Summary: Tombstone timestamps aren't used to skip SSTables while they are still in the memtable Key: CASSANDRA-9596 URL: https://issues.apache.org/jira/browse/CASSANDRA-9596 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Richard Low Fix For: 2.0.x If you have one SSTable containing a partition level tombstone at timestamp t and all other SSTables have cells with timestamp t, Cassandra will skip all the other SSTables and return nothing quickly. However, if the partition tombstone is still in the memtable it doesn’t skip any SSTables. It should use the same timestamp logic to skip all SSTables. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9434) If a node loses schema_columns SSTables it could delete all secondary indexes from the schema
Richard Low created CASSANDRA-9434: -- Summary: If a node loses schema_columns SSTables it could delete all secondary indexes from the schema Key: CASSANDRA-9434 URL: https://issues.apache.org/jira/browse/CASSANDRA-9434 Project: Cassandra Issue Type: Bug Reporter: Richard Low It is possible that a single bad node can delete all secondary indexes if it restarts and cannot read its schema_columns SSTables. Here's a reproduction: * Create a 2 node cluster (we saw it on 2.0.11) * Create the schema: create keyspace myks with replication = {'class':'SimpleStrategy', 'replication_factor':1}; use myks; create table mytable (a text, b text, c text, PRIMARY KEY (a, b) ); create index myindex on mytable(b); NB index must be on clustering column to repro * Kill one node * Wipe its commitlog and system/schema_columns sstables. * Start it again * Run on this node select index_name from system.schema_columns where keyspace_name = 'myks' and columnfamily_name = 'mytable' and column_name = 'b'; and you'll see the index is null. * Run 'describe schema' on the other node. Sometimes it will not show the index, but you might need to bounce for it to disappear. I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9434) If a node loses schema_columns SSTables it could delete all secondary indexes from the schema
[ https://issues.apache.org/jira/browse/CASSANDRA-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551601#comment-14551601 ] Richard Low commented on CASSANDRA-9434: cc [~iamaleksey] If a node loses schema_columns SSTables it could delete all secondary indexes from the schema - Key: CASSANDRA-9434 URL: https://issues.apache.org/jira/browse/CASSANDRA-9434 Project: Cassandra Issue Type: Bug Reporter: Richard Low It is possible that a single bad node can delete all secondary indexes if it restarts and cannot read its schema_columns SSTables. Here's a reproduction: * Create a 2 node cluster (we saw it on 2.0.11) * Create the schema: create keyspace myks with replication = {'class':'SimpleStrategy', 'replication_factor':1}; use myks; create table mytable (a text, b text, c text, PRIMARY KEY (a, b) ); create index myindex on mytable(b); NB index must be on clustering column to repro * Kill one node * Wipe its commitlog and system/schema_columns sstables. * Start it again * Run on this node select index_name from system.schema_columns where keyspace_name = 'myks' and columnfamily_name = 'mytable' and column_name = 'b'; and you'll see the index is null. * Run 'describe schema' on the other node. Sometimes it will not show the index, but you might need to bounce for it to disappear. I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9434) If a node loses schema_columns SSTables it could delete all secondary indexes from the schema
[ https://issues.apache.org/jira/browse/CASSANDRA-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Low updated CASSANDRA-9434: --- Description: It is possible that a single bad node can delete all secondary indexes if it restarts and cannot read its schema_columns SSTables. Here's a reproduction: * Create a 2 node cluster (we saw it on 2.0.11) * Create the schema: {code} create keyspace myks with replication = {'class':'SimpleStrategy', 'replication_factor':1}; use myks; create table mytable (a text, b text, c text, PRIMARY KEY (a, b) ); create index myindex on mytable(b); {code} NB index must be on clustering column to repro * Kill one node * Wipe its commitlog and system/schema_columns sstables. * Start it again * Run on this node select index_name from system.schema_columns where keyspace_name = 'myks' and columnfamily_name = 'mytable' and column_name = 'b'; and you'll see the index is null. * Run 'describe schema' on the other node. Sometimes it will not show the index, but you might need to bounce for it to disappear. I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper. was: It is possible that a single bad node can delete all secondary indexes if it restarts and cannot read its schema_columns SSTables. Here's a reproduction: * Create a 2 node cluster (we saw it on 2.0.11) * Create the schema: create keyspace myks with replication = {'class':'SimpleStrategy', 'replication_factor':1}; use myks; create table mytable (a text, b text, c text, PRIMARY KEY (a, b) ); create index myindex on mytable(b); NB index must be on clustering column to repro * Kill one node * Wipe its commitlog and system/schema_columns sstables. * Start it again * Run on this node select index_name from system.schema_columns where keyspace_name = 'myks' and columnfamily_name = 'mytable' and column_name = 'b'; and you'll see the index is null. * Run 'describe schema' on the other node. Sometimes it will not show the index, but you might need to bounce for it to disappear. I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper. If a node loses schema_columns SSTables it could delete all secondary indexes from the schema - Key: CASSANDRA-9434 URL: https://issues.apache.org/jira/browse/CASSANDRA-9434 Project: Cassandra Issue Type: Bug Reporter: Richard Low It is possible that a single bad node can delete all secondary indexes if it restarts and cannot read its schema_columns SSTables. Here's a reproduction: * Create a 2 node cluster (we saw it on 2.0.11) * Create the schema: {code} create keyspace myks with replication = {'class':'SimpleStrategy', 'replication_factor':1}; use myks; create table mytable (a text, b text, c text, PRIMARY KEY (a, b) ); create index myindex on mytable(b); {code} NB index must be on clustering column to repro * Kill one node * Wipe its commitlog and system/schema_columns sstables. * Start it again * Run on this node select index_name from system.schema_columns where keyspace_name = 'myks' and columnfamily_name = 'mytable' and column_name = 'b'; and you'll see the index is null. * Run 'describe schema' on the other node. Sometimes it will not show the index, but you might need to bounce for it to disappear. I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses
[ https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541008#comment-14541008 ] Richard Low commented on CASSANDRA-9183: Is it possible to get this in 2.1 too? Failure detector should detect and ignore local pauses -- Key: CASSANDRA-9183 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 2.2 beta 1 Attachments: 9183-v2.txt, 9183.txt A local node can be paused for many reasons such as GC, and if the pause is long enough when it recovers it will think all the other nodes are dead until it gossips, causing UAE to be thrown to clients trying to use it as a coordinator. Instead, the FD can track the current time, and if the gap there becomes too large, skip marking the nodes down (reset the FD data perhaps) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8336) Add shutdown gossip state to prevent timeouts during rolling restarts
[ https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538198#comment-14538198 ] Richard Low commented on CASSANDRA-8336: How big is your largest QA cluster? I did extensive manual tests to verify this fixes the issue in a large cluster. Add shutdown gossip state to prevent timeouts during rolling restarts - Key: CASSANDRA-8336 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336 Project: Cassandra Issue Type: Bug Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 2.0.15, 2.1.5 Attachments: 8336-v2.txt, 8336-v3.txt, 8336-v4.txt, 8336.txt, 8366-v5.txt In CASSANDRA-3936 we added a gossip shutdown announcement. The problem here is that this isn't sufficient; you can still get TOEs and have to wait on the FD to figure things out. This happens due to gossip propagation time and variance; if node X shuts down and sends the message to Y, but Z has a greater gossip version than Y for X and has not yet received the message, it can initiate gossip with Y and thus mark X alive again. I propose quarantining to solve this, however I feel it should be a -D parameter you have to specify, so as not to destroy current dev and test practices, since this will mean a node that shuts down will not be able to restart until the quarantine expires. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses
[ https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533030#comment-14533030 ] Richard Low commented on CASSANDRA-9183: +1. Very minor comment: it would be slightly clearer to set lastInterpret immediately after the diff calculation rather than in both cases. Failure detector should detect and ignore local pauses -- Key: CASSANDRA-9183 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 3.x Attachments: 9183-v2.txt, 9183.txt A local node can be paused for many reasons such as GC, and if the pause is long enough when it recovers it will think all the other nodes are dead until it gossips, causing UAE to be thrown to clients trying to use it as a coordinator. Instead, the FD can track the current time, and if the gap there becomes too large, skip marking the nodes down (reset the FD data perhaps) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9280) Streaming connections should bind to the broadcast_address of the node
[ https://issues.apache.org/jira/browse/CASSANDRA-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Low updated CASSANDRA-9280: --- Since Version: 1.2.0 Streaming connections should bind to the broadcast_address of the node -- Key: CASSANDRA-9280 URL: https://issues.apache.org/jira/browse/CASSANDRA-9280 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low Assignee: Yuki Morishita Priority: Minor Currently, if you have multiple interfaces on a server, a node receiving a stream may show the stream as coming from the wrong IP in e.g. nodetool netstats. The IP is taken as the source of the socket, which may not be the same as the node’s broadcast_address. The outgoing socket should be explicitly bound to the broadcast_address. It seems like this was fixed a long time ago in CASSANDRA-737 but has since broken. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9280) Streaming connections should bind to the broadcast_address of the node
Richard Low created CASSANDRA-9280: -- Summary: Streaming connections should bind to the broadcast_address of the node Key: CASSANDRA-9280 URL: https://issues.apache.org/jira/browse/CASSANDRA-9280 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low Priority: Minor Currently, if you have multiple interfaces on a server, a node receiving a stream may show the stream as coming from the wrong IP in e.g. nodetool netstats. The IP is taken as the source of the socket, which may not be the same as the node’s broadcast_address. The outgoing socket should be explicitly bound to the broadcast_address. It seems like this was fixed a long time ago in CASSANDRA-737 but has since broken. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message
[ https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493345#comment-14493345 ] Richard Low commented on CASSANDRA-8336: +1, thanks Brandon! Quarantine nodes after receiving the gossip shutdown message Key: CASSANDRA-8336 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336 Project: Cassandra Issue Type: Bug Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 2.0.15 Attachments: 8336-v2.txt, 8336-v3.txt, 8336-v4.txt, 8336.txt, 8366-v5.txt In CASSANDRA-3936 we added a gossip shutdown announcement. The problem here is that this isn't sufficient; you can still get TOEs and have to wait on the FD to figure things out. This happens due to gossip propagation time and variance; if node X shuts down and sends the message to Y, but Z has a greater gossip version than Y for X and has not yet received the message, it can initiate gossip with Y and thus mark X alive again. I propose quarantining to solve this, however I feel it should be a -D parameter you have to specify, so as not to destroy current dev and test practices, since this will mean a node that shuts down will not be able to restart until the quarantine expires. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message
[ https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341222#comment-14341222 ] Richard Low commented on CASSANDRA-8336: Here it is: {code} ERROR [main] 2015-02-27 18:11:57,584 CassandraDaemon.java (line 513) Exception encountered during startup java.lang.RuntimeException: Unable to gossip with any seeds at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1270) at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:459) at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:673) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:625) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:517) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585) INFO [StorageServiceShutdownHook] 2015-02-27 18:11:57,605 Gossiper.java (line 1370) Announcing shutdown ERROR [StorageServiceShutdownHook] 2015-02-27 18:11:57,607 CassandraDaemon.java (line 199) Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.AssertionError at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1339) at org.apache.cassandra.gms.Gossiper.stop(Gossiper.java:1371) at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:586) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:745) {code} Quarantine nodes after receiving the gossip shutdown message Key: CASSANDRA-8336 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336 Project: Cassandra Issue Type: Bug Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 2.0.13 Attachments: 8336-v2.txt, 8336-v3.txt, 8336.txt In CASSANDRA-3936 we added a gossip shutdown announcement. The problem here is that this isn't sufficient; you can still get TOEs and have to wait on the FD to figure things out. This happens due to gossip propagation time and variance; if node X shuts down and sends the message to Y, but Z has a greater gossip version than Y for X and has not yet received the message, it can initiate gossip with Y and thus mark X alive again. I propose quarantining to solve this, however I feel it should be a -D parameter you have to specify, so as not to destroy current dev and test practices, since this will mean a node that shuts down will not be able to restart until the quarantine expires. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8829) Add extra checks to catch SSTable ref counting bugs
Richard Low created CASSANDRA-8829: -- Summary: Add extra checks to catch SSTable ref counting bugs Key: CASSANDRA-8829 URL: https://issues.apache.org/jira/browse/CASSANDRA-8829 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Richard Low Fix For: 2.0.13, 2.1.4 There have been some bad affects from ref counting bugs (see e.g. CASSANDRA-7704). We should add extra checks so we can more easily diagnose any future problems and avoid some of the side effects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8829) Add extra checks to catch SSTable ref counting bugs
[ https://issues.apache.org/jira/browse/CASSANDRA-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Low updated CASSANDRA-8829: --- Attachment: 8829-2.0.patch Add extra checks to catch SSTable ref counting bugs --- Key: CASSANDRA-8829 URL: https://issues.apache.org/jira/browse/CASSANDRA-8829 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Richard Low Fix For: 2.0.13, 2.1.4 Attachments: 8829-2.0.patch There have been some bad affects from ref counting bugs (see e.g. CASSANDRA-7704). We should add extra checks so we can more easily diagnose any future problems and avoid some of the side effects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8829) Add extra checks to catch SSTable ref counting bugs
[ https://issues.apache.org/jira/browse/CASSANDRA-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326321#comment-14326321 ] Richard Low commented on CASSANDRA-8829: Attached patch for 2.0. Add extra checks to catch SSTable ref counting bugs --- Key: CASSANDRA-8829 URL: https://issues.apache.org/jira/browse/CASSANDRA-8829 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Richard Low Fix For: 2.0.13, 2.1.4 Attachments: 8829-2.0.patch There have been some bad affects from ref counting bugs (see e.g. CASSANDRA-7704). We should add extra checks so we can more easily diagnose any future problems and avoid some of the side effects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8829) Add extra checks to catch SSTable ref counting bugs
[ https://issues.apache.org/jira/browse/CASSANDRA-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326427#comment-14326427 ] Richard Low commented on CASSANDRA-8829: Agreed that the check in releaseReference will break SSTableLoader. What do you think about the assert in markReferenced? Add extra checks to catch SSTable ref counting bugs --- Key: CASSANDRA-8829 URL: https://issues.apache.org/jira/browse/CASSANDRA-8829 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Richard Low Assignee: Richard Low Fix For: 2.0.13, 2.1.4 Attachments: 8829-2.0.patch There have been some bad affects from ref counting bugs (see e.g. CASSANDRA-7704). We should add extra checks so we can more easily diagnose any future problems and avoid some of the side effects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8829) Add extra checks to catch SSTable ref counting bugs
[ https://issues.apache.org/jira/browse/CASSANDRA-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326749#comment-14326749 ] Richard Low commented on CASSANDRA-8829: Attached v2 patch with releaseReference check removed and your containsAll suggestion. Add extra checks to catch SSTable ref counting bugs --- Key: CASSANDRA-8829 URL: https://issues.apache.org/jira/browse/CASSANDRA-8829 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Richard Low Assignee: Richard Low Fix For: 2.0.13, 2.1.4 Attachments: 8829-2.0-v2.patch, 8829-2.0.patch There have been some bad affects from ref counting bugs (see e.g. CASSANDRA-7704). We should add extra checks so we can more easily diagnose any future problems and avoid some of the side effects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8829) Add extra checks to catch SSTable ref counting bugs
[ https://issues.apache.org/jira/browse/CASSANDRA-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Low updated CASSANDRA-8829: --- Attachment: 8829-2.0-v2.patch Add extra checks to catch SSTable ref counting bugs --- Key: CASSANDRA-8829 URL: https://issues.apache.org/jira/browse/CASSANDRA-8829 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Richard Low Assignee: Richard Low Fix For: 2.0.13, 2.1.4 Attachments: 8829-2.0-v2.patch, 8829-2.0.patch There have been some bad affects from ref counting bugs (see e.g. CASSANDRA-7704). We should add extra checks so we can more easily diagnose any future problems and avoid some of the side effects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message
[ https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325170#comment-14325170 ] Richard Low commented on CASSANDRA-8336: v3 works well and I was able to do a full cluster bounce with zero timeouts. Here’s a few minor points: * The shutting down node might as well set the version of the shutdown state to Integer.MAX_VALUE since receiving nodes will blindly use that. * Why does it increment the generation number? We call Gossiper.instance.start with a new generation number set to the current time so it would make sense to use that. * If hit 'Unable to gossip with any seeds’ on replace, it shuts down the gossiper. This throws an AssertionError in addLocalApplicationState since the local epState is null. Quarantine nodes after receiving the gossip shutdown message Key: CASSANDRA-8336 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336 Project: Cassandra Issue Type: Bug Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 2.0.13 Attachments: 8336-v2.txt, 8336-v3.txt, 8336.txt In CASSANDRA-3936 we added a gossip shutdown announcement. The problem here is that this isn't sufficient; you can still get TOEs and have to wait on the FD to figure things out. This happens due to gossip propagation time and variance; if node X shuts down and sends the message to Y, but Z has a greater gossip version than Y for X and has not yet received the message, it can initiate gossip with Y and thus mark X alive again. I propose quarantining to solve this, however I feel it should be a -D parameter you have to specify, so as not to destroy current dev and test practices, since this will mean a node that shuts down will not be able to restart until the quarantine expires. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8815) Race in sstable ref counting during streaming failures
[ https://issues.apache.org/jira/browse/CASSANDRA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325007#comment-14325007 ] Richard Low commented on CASSANDRA-8815: I think we should add some assertions that would avoid the bad effect of this. I'll prepare a patch and put it here. Race in sstable ref counting during streaming failures Key: CASSANDRA-8815 URL: https://issues.apache.org/jira/browse/CASSANDRA-8815 Project: Cassandra Issue Type: Bug Components: Core Reporter: sankalp kohli Assignee: Benedict Fix For: 2.0.13 Attachments: 8815.txt We have a seen a machine in Prod whose all read threads are blocked(spinning) on trying to acquire the reference lock on stables. There are also some stream sessions which are doing the same. On looking at the heap dump, we could see that a live sstable which is part of the View has a ref count = 0. This sstable is also not compacting or is part of any failed compaction. On looking through the code, we could see that if ref goes to zero and the stable is part of the View, all reader threads will spin forever. On further looking through the code of streaming, we could see that if StreamTransferTask.complete is called after closeSession has been called due to error in OutgoingMessageHandler, it will double decrement the ref count of an sstable. This race can happen and we see through exception in logs that closeSession was triggered by OutgoingMessageHandler. The fix for this is very simple i think. In StreamTransferTask.abort, we can remove a file from files” before decrementing the ref count. This will avoid this race. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7968) permissions_validity_in_ms should be settable via JMX
[ https://issues.apache.org/jira/browse/CASSANDRA-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299269#comment-14299269 ] Richard Low commented on CASSANDRA-7968: How is this meant to work? The MBean is never registered so how do I call it? permissions_validity_in_ms should be settable via JMX - Key: CASSANDRA-7968 URL: https://issues.apache.org/jira/browse/CASSANDRA-7968 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Priority: Minor Fix For: 2.0.11, 2.1.1 Attachments: 7968.txt Oftentimes people don't think about auth problems and just run with the default of RF=2 and 2000ms until it's too late, and at that point doing a rolling restart to change the permissions cache can be a bit painful vs setting it via JMX everywhere and then updating the yaml for future restarts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7968) permissions_validity_in_ms should be settable via JMX
[ https://issues.apache.org/jira/browse/CASSANDRA-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299285#comment-14299285 ] Richard Low commented on CASSANDRA-7968: As Benedict says, there is at least one user who cares :) permissions_validity_in_ms should be settable via JMX - Key: CASSANDRA-7968 URL: https://issues.apache.org/jira/browse/CASSANDRA-7968 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Priority: Minor Fix For: 2.0.11, 2.1.1 Attachments: 7968.txt Oftentimes people don't think about auth problems and just run with the default of RF=2 and 2000ms until it's too late, and at that point doing a rolling restart to change the permissions cache can be a bit painful vs setting it via JMX everywhere and then updating the yaml for future restarts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()
[ https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278204#comment-14278204 ] Richard Low commented on CASSANDRA-8414: I tested this on some real workload SSTables and got a 2x speedup on force compaction! Also the output was the same as before. Can someone commit the patch? Avoid loops over array backed iterators that call iter.remove() --- Key: CASSANDRA-8414 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Richard Low Assignee: Jimmy Mårdell Labels: performance Fix For: 2.0.12, 2.1.3 Attachments: cassandra-2.0-8414-1.txt, cassandra-2.0-8414-2.txt, cassandra-2.0-8414-3.txt, cassandra-2.0-8414-4.txt, cassandra-2.0-8414-5.txt, cassandra-2.1-8414-5.txt, cassandra-2.1-8414-6.txt I noticed from sampling that sometimes compaction spends almost all of its time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns out that the cf object is using ArrayBackedSortedColumns, so deletes are from an ArrayList. If the majority of your columns are GCable tombstones then this is O(n^2). The data structure should be changed or a copy made to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()
[ https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Low updated CASSANDRA-8414: --- Fix Version/s: 2.0.12 Avoid loops over array backed iterators that call iter.remove() --- Key: CASSANDRA-8414 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Richard Low Assignee: Jimmy Mårdell Labels: performance Fix For: 2.0.12, 2.1.3 Attachments: cassandra-2.0-8414-1.txt, cassandra-2.0-8414-2.txt, cassandra-2.0-8414-3.txt, cassandra-2.0-8414-4.txt, cassandra-2.0-8414-5.txt, cassandra-2.1-8414-5.txt, cassandra-2.1-8414-6.txt I noticed from sampling that sometimes compaction spends almost all of its time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns out that the cf object is using ArrayBackedSortedColumns, so deletes are from an ArrayList. If the majority of your columns are GCable tombstones then this is O(n^2). The data structure should be changed or a copy made to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()
[ https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274593#comment-14274593 ] Richard Low commented on CASSANDRA-8414: Only minor nit is that the BitSet can be initialized with size rather than cells.length, but otherwise +1. Avoid loops over array backed iterators that call iter.remove() --- Key: CASSANDRA-8414 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Richard Low Assignee: Jimmy Mårdell Labels: performance Fix For: 2.1.3 Attachments: cassandra-2.0-8414-1.txt, cassandra-2.0-8414-2.txt, cassandra-2.0-8414-3.txt, cassandra-2.0-8414-4.txt, cassandra-2.0-8414-5.txt, cassandra-2.1-8414-5.txt I noticed from sampling that sometimes compaction spends almost all of its time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns out that the cf object is using ArrayBackedSortedColumns, so deletes are from an ArrayList. If the majority of your columns are GCable tombstones then this is O(n^2). The data structure should be changed or a copy made to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-5913) Nodes with no gossip STATUS shown as UN by nodetool:status
[ https://issues.apache.org/jira/browse/CASSANDRA-5913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273930#comment-14273930 ] Richard Low commented on CASSANDRA-5913: I have seen this on 1.2.19 and 2.0.9. I suspect the root cause is CASSANDRA-6125. Nodes with no gossip STATUS shown as UN by nodetool:status Key: CASSANDRA-5913 URL: https://issues.apache.org/jira/browse/CASSANDRA-5913 Project: Cassandra Issue Type: Bug Components: Core Environment: 1.2.8 Reporter: Chris Burroughs Priority: Minor I have no idea if this is a valid situation or a larger problem, but either way nodetool status should not make it look like everything is a-okay. From nt:gossipinfo: {noformat} /64.215.255.182 RACK:NOP NET_VERSION:6 HOST_ID:4f3b214b-b03e-46eb-8214-5fab2662a06b RELEASE_VERSION:1.2.8 DC:IAD INTERNAL_IP:10.15.2.182 SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f RPC_ADDRESS:0.0.0.0 {noformat} {noformat} $ ./bin/nt.sh status | grep -i 4055109d-800d-4743-8efa-4ecfff883463 UN 64.215.255.182 63.84 GB 256 2.5% 4055109d-800d-4743-8efa-4ecfff883463 NOP {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8515) Hang at startup when no commitlog space
[ https://issues.apache.org/jira/browse/CASSANDRA-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271668#comment-14271668 ] Richard Low commented on CASSANDRA-8515: I think #5737 was marked as invalid because it was thought to be a bug outside of Cassandra. But understanding the cause means we can do something about it, and I think logging and stopping would be the right approach, as you say. Hang at startup when no commitlog space --- Key: CASSANDRA-8515 URL: https://issues.apache.org/jira/browse/CASSANDRA-8515 Project: Cassandra Issue Type: Bug Reporter: Richard Low Fix For: 2.0.12 If the commit log directory has no free space, Cassandra hangs on startup. The main thread is waiting: {code} main prio=9 tid=0x7fefe400f800 nid=0x1303 waiting on condition [0x00010b9c1000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x0007dc8c5fc8 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java:137) at org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:299) at org.apache.cassandra.db.commitlog.CommitLog.init(CommitLog.java:73) at org.apache.cassandra.db.commitlog.CommitLog.clinit(CommitLog.java:53) at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:360) at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:339) at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:211) at org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:699) at org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:208) at org.apache.cassandra.db.SystemKeyspace.updateSchemaVersion(SystemKeyspace.java:390) - locked 0x0007de2f2ce0 (a java.lang.Class for org.apache.cassandra.db.SystemKeyspace) at org.apache.cassandra.config.Schema.updateVersion(Schema.java:384) at org.apache.cassandra.config.DatabaseDescriptor.loadSchemas(DatabaseDescriptor.java:532) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:270) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585) {code} but COMMIT-LOG-ALLOCATOR is RUNNABLE: {code} COMMIT-LOG-ALLOCATOR prio=9 tid=0x7fefe5402800 nid=0x7513 in Object.wait() [0x000118252000] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:116) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:745) {code} but making no progress. This behaviour has change since 1.2 (see CASSANDRA-5737). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8515) Hang at startup when no commitlog space
[ https://issues.apache.org/jira/browse/CASSANDRA-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271668#comment-14271668 ] Richard Low edited comment on CASSANDRA-8515 at 1/9/15 6:38 PM: I think CASSANDRA-5737 was marked as invalid because it was thought to be a bug outside of Cassandra. But understanding the cause means we can do something about it, and I think logging and stopping would be the right approach, as you say. was (Author: rlow): I think #5737 was marked as invalid because it was thought to be a bug outside of Cassandra. But understanding the cause means we can do something about it, and I think logging and stopping would be the right approach, as you say. Hang at startup when no commitlog space --- Key: CASSANDRA-8515 URL: https://issues.apache.org/jira/browse/CASSANDRA-8515 Project: Cassandra Issue Type: Bug Reporter: Richard Low Fix For: 2.0.12 If the commit log directory has no free space, Cassandra hangs on startup. The main thread is waiting: {code} main prio=9 tid=0x7fefe400f800 nid=0x1303 waiting on condition [0x00010b9c1000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x0007dc8c5fc8 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java:137) at org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:299) at org.apache.cassandra.db.commitlog.CommitLog.init(CommitLog.java:73) at org.apache.cassandra.db.commitlog.CommitLog.clinit(CommitLog.java:53) at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:360) at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:339) at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:211) at org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:699) at org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:208) at org.apache.cassandra.db.SystemKeyspace.updateSchemaVersion(SystemKeyspace.java:390) - locked 0x0007de2f2ce0 (a java.lang.Class for org.apache.cassandra.db.SystemKeyspace) at org.apache.cassandra.config.Schema.updateVersion(Schema.java:384) at org.apache.cassandra.config.DatabaseDescriptor.loadSchemas(DatabaseDescriptor.java:532) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:270) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585) {code} but COMMIT-LOG-ALLOCATOR is RUNNABLE: {code} COMMIT-LOG-ALLOCATOR prio=9 tid=0x7fefe5402800 nid=0x7513 in Object.wait() [0x000118252000] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:116) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:745) {code} but making no progress. This behaviour has change since 1.2 (see CASSANDRA-5737). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8515) Hang at startup when no commitlog space
[ https://issues.apache.org/jira/browse/CASSANDRA-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Low updated CASSANDRA-8515: --- Description: If the commit log directory has no free space, Cassandra hangs on startup. The main thread is waiting: {code} main prio=9 tid=0x7fefe400f800 nid=0x1303 waiting on condition [0x00010b9c1000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x0007dc8c5fc8 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java:137) at org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:299) at org.apache.cassandra.db.commitlog.CommitLog.init(CommitLog.java:73) at org.apache.cassandra.db.commitlog.CommitLog.clinit(CommitLog.java:53) at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:360) at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:339) at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:211) at org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:699) at org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:208) at org.apache.cassandra.db.SystemKeyspace.updateSchemaVersion(SystemKeyspace.java:390) - locked 0x0007de2f2ce0 (a java.lang.Class for org.apache.cassandra.db.SystemKeyspace) at org.apache.cassandra.config.Schema.updateVersion(Schema.java:384) at org.apache.cassandra.config.DatabaseDescriptor.loadSchemas(DatabaseDescriptor.java:532) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:270) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585) {code} but COMMIT-LOG-ALLOCATOR is RUNNABLE: {code} COMMIT-LOG-ALLOCATOR prio=9 tid=0x7fefe5402800 nid=0x7513 in Object.wait() [0x000118252000] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:116) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:745) {code} but making no progress. This behaviour has changed since 1.2 (see CASSANDRA-5737). was: If the commit log directory has no free space, Cassandra hangs on startup. The main thread is waiting: {code} main prio=9 tid=0x7fefe400f800 nid=0x1303 waiting on condition [0x00010b9c1000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x0007dc8c5fc8 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java:137) at org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:299) at org.apache.cassandra.db.commitlog.CommitLog.init(CommitLog.java:73) at org.apache.cassandra.db.commitlog.CommitLog.clinit(CommitLog.java:53) at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:360) at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:339) at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:211) at org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:699) at org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:208) at org.apache.cassandra.db.SystemKeyspace.updateSchemaVersion(SystemKeyspace.java:390) - locked 0x0007de2f2ce0 (a java.lang.Class for org.apache.cassandra.db.SystemKeyspace) at org.apache.cassandra.config.Schema.updateVersion(Schema.java:384) at org.apache.cassandra.config.DatabaseDescriptor.loadSchemas(DatabaseDescriptor.java:532) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:270) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496) at
[jira] [Commented] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()
[ https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268075#comment-14268075 ] Richard Low commented on CASSANDRA-8414: +1 on 2.0 v5. Do you have a 2.1 version? Avoid loops over array backed iterators that call iter.remove() --- Key: CASSANDRA-8414 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Richard Low Assignee: Jimmy Mårdell Labels: performance Fix For: 2.1.3 Attachments: cassandra-2.0-8414-1.txt, cassandra-2.0-8414-2.txt, cassandra-2.0-8414-3.txt, cassandra-2.0-8414-4.txt, cassandra-2.0-8414-5.txt I noticed from sampling that sometimes compaction spends almost all of its time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns out that the cf object is using ArrayBackedSortedColumns, so deletes are from an ArrayList. If the majority of your columns are GCable tombstones then this is O(n^2). The data structure should be changed or a copy made to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8515) Hang at startup when no commitlog space
Richard Low created CASSANDRA-8515: -- Summary: Hang at startup when no commitlog space Key: CASSANDRA-8515 URL: https://issues.apache.org/jira/browse/CASSANDRA-8515 Project: Cassandra Issue Type: Bug Reporter: Richard Low If the commit log directory has no free space, Cassandra hangs on startup. The main thread is waiting: {code} main prio=9 tid=0x7fefe400f800 nid=0x1303 waiting on condition [0x00010b9c1000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x0007dc8c5fc8 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java:137) at org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:299) at org.apache.cassandra.db.commitlog.CommitLog.init(CommitLog.java:73) at org.apache.cassandra.db.commitlog.CommitLog.clinit(CommitLog.java:53) at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:360) at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:339) at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:211) at org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:699) at org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:208) at org.apache.cassandra.db.SystemKeyspace.updateSchemaVersion(SystemKeyspace.java:390) - locked 0x0007de2f2ce0 (a java.lang.Class for org.apache.cassandra.db.SystemKeyspace) at org.apache.cassandra.config.Schema.updateVersion(Schema.java:384) at org.apache.cassandra.config.DatabaseDescriptor.loadSchemas(DatabaseDescriptor.java:532) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:270) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585) {code} but COMMIT-LOG-ALLOCATOR is RUNNABLE: {code} COMMIT-LOG-ALLOCATOR prio=9 tid=0x7fefe5402800 nid=0x7513 in Object.wait() [0x000118252000] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:116) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:745) {code} but making no progress. This behaviour has change since 1.2 (see CASSANDRA-5737). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()
[ https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Low updated CASSANDRA-8414: --- Reviewer: Richard Low Avoid loops over array backed iterators that call iter.remove() --- Key: CASSANDRA-8414 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low Assignee: Jimmy Mårdell Labels: performance Fix For: 2.1.3 Attachments: cassandra-2.0-8414-1.txt, cassandra-2.0-8414-2.txt, cassandra-2.0-8414-3.txt I noticed from sampling that sometimes compaction spends almost all of its time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns out that the cf object is using ArrayBackedSortedColumns, so deletes are from an ArrayList. If the majority of your columns are GCable tombstones then this is O(n^2). The data structure should be changed or a copy made to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-5737) CassandraDaemon - recent unsafe memory access operation in compiled Java code
[ https://issues.apache.org/jira/browse/CASSANDRA-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251018#comment-14251018 ] Richard Low commented on CASSANDRA-5737: I get exactly this error when the disk is full on Linux. It must be some poor handling of the disk full error by the JVM. CassandraDaemon - recent unsafe memory access operation in compiled Java code - Key: CASSANDRA-5737 URL: https://issues.apache.org/jira/browse/CASSANDRA-5737 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.2.6 Environment: Amazon EC2, XLarge instance. Ubuntu 12.04.2 LTS Raid 0 disks, with ext4 Reporter: Glyn Davies I'm using 1.2.6 on Ubuntu AWS m1.xlarge instances with the Datastax Community package and have tried using Java versions jdk1.7.0_25 jre1.6.0_45 Also testing with and without libjna-java (ie the JNA jar) However, something has triggered a bug in the CassandraDaemon: ERROR [COMMIT-LOG-ALLOCATOR] 2013-07-05 15:00:51,663 CassandraDaemon.java (line 192) Exception in thread Thread[COMMIT-LOG-ALLOCATOR,5,main] java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code at org.apache.cassandra.db.commitlog.CommitLogSegment.init(CommitLogSegment.java:126) at org.apache.cassandra.db.commitlog.CommitLogSegment.freshSegment(CommitLogSegment.java:81) at org.apache.cassandra.db.commitlog.CommitLogAllocator.createFreshSegment(CommitLogAllocator.java:250) at org.apache.cassandra.db.commitlog.CommitLogAllocator.access$500(CommitLogAllocator.java:48) at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:104) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Unknown Source) This brought two nodes down out of a three node cluster – using QUORUM write with 3 replicas. Restarting the node replays this error, so I have the system in a 'stable' unstable state – which is probably a good place for trouble shooting. Presumably something a client wrote triggered this situation, and the other third node was to be the final replication point – and is thus still up. Subsequently discovered that only a reboot will allow that node to come back up. Java Bug raised with Oracle after finding a Java dump text indicating a SIGBUS. http://bugs.sun.com/view_bug.do?bug_id=9004953 At this point, I'm thinking that there is potentially a Linux kernel bug being triggered? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()
[ https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246961#comment-14246961 ] Richard Low commented on CASSANDRA-8414: Thanks for writing the patch! A few comments: - v3 patch has BatchIterator interface missing. - Some unnecessary formatting changes and import order switching. - remove method should throw IllegalStateException if called twice on same element to adhere to Iterator spec. - Calling commit twice will remove incorrect elements. Should throw IllegalStateException if commit is called more than once or make it idempotent. - Could add 'assert test = src;' to copy method to enforce comment. Avoid loops over array backed iterators that call iter.remove() --- Key: CASSANDRA-8414 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low Assignee: Jimmy Mårdell Labels: performance Fix For: 2.1.3 Attachments: cassandra-2.0-8414-1.txt, cassandra-2.0-8414-2.txt, cassandra-2.0-8414-3.txt I noticed from sampling that sometimes compaction spends almost all of its time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns out that the cf object is using ArrayBackedSortedColumns, so deletes are from an ArrayList. If the majority of your columns are GCable tombstones then this is O(n^2). The data structure should be changed or a copy made to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()
[ https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237987#comment-14237987 ] Richard Low commented on CASSANDRA-8414: Yes, I can review this week. Avoid loops over array backed iterators that call iter.remove() --- Key: CASSANDRA-8414 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low Assignee: Jimmy Mårdell Labels: performance Fix For: 2.1.3 Attachments: cassandra-2.0-8414-1.txt I noticed from sampling that sometimes compaction spends almost all of its time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns out that the cf object is using ArrayBackedSortedColumns, so deletes are from an ArrayList. If the majority of your columns are GCable tombstones then this is O(n^2). The data structure should be changed or a copy made to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8414) Compaction is O(n^2) when deleting lots of tombstones
Richard Low created CASSANDRA-8414: -- Summary: Compaction is O(n^2) when deleting lots of tombstones Key: CASSANDRA-8414 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low I noticed from sampling that sometimes compaction spends almost all of its time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns out that the cf object is using ArrayBackedSortedColumns, so deletes are from an ArrayList. If the majority of your columns are GCable tombstones then this is O(n^2). The data structure should be changed or a copy made to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8416) AssertionError 'Incoherent new size -1' during hints compaction
Richard Low created CASSANDRA-8416: -- Summary: AssertionError 'Incoherent new size -1' during hints compaction Key: CASSANDRA-8416 URL: https://issues.apache.org/jira/browse/CASSANDRA-8416 Project: Cassandra Issue Type: Bug Reporter: Richard Low I've seen the error on 2.0.9: java.lang.AssertionError: Incoherent new size -1 replacing [SSTableReader(path='/cassandra/d1/data/system/hints/system-hints-jb-24386-Data.db')] by [] in View(pending_count=0, sstables=[], compacting=[SSTableReader(path='/cassandra/d1/data/system/hints/system-hints-jb-24386-Data.db')]) in logs during hints compaction. It looks like there are 2 concurrent compactions of the same file - just before this error the logs say: INFO [CompactionExecutor:220316] 2014-11-19 22:53:54,650 CompactionTask.java (line 115) Compacting [SSTableReader(path='/cassandra/d1/data/system/hints/system-hints-jb-24386-Data.db')] INFO [CompactionExecutor:220315] 2014-11-19 22:53:54,651 CompactionTask.java (line 115) Compacting [SSTableReader(path='/cassandra/d1/data/system/hints/system-hints-jb-24386-Data.db')] The assertion is: int newSSTablesSize = sstables.size() - oldSSTables.size() + Iterables.size(replacements); assert newSSTablesSize = Iterables.size(replacements) : String.format(Incoherent new size %d replacing %s by %s in %s, newSSTablesSize, oldSSTables, replacements, this); So if the first compaction completes, the second one has sstables=[] (as seen in the assertion failure print), so newSSTablesSize = 0 - 1 + 0 = -1 and we get the error. It is possible the root cause is the same as CASSANDRA-7145. Does anyone know how to tell? The error happens very rarely so hard to know from testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8121) Audit acquire/release SSTable references
Richard Low created CASSANDRA-8121: -- Summary: Audit acquire/release SSTable references Key: CASSANDRA-8121 URL: https://issues.apache.org/jira/browse/CASSANDRA-8121 Project: Cassandra Issue Type: Task Components: Core Reporter: Richard Low There are instances where SSTable references are not guaranteed to be released (e.g. CompactionTask.runWith) because there is no try/finally around the reference acquire/release. We should audit all places where SSTable references are acquired and wrap them appropriately. Leaked references cause junk files to build up on disk and on a restart can lead to data resurrection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8113) Gossip should ignore generation numbers too far in the future
Richard Low created CASSANDRA-8113: -- Summary: Gossip should ignore generation numbers too far in the future Key: CASSANDRA-8113 URL: https://issues.apache.org/jira/browse/CASSANDRA-8113 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Richard Low If a node sends corrupted gossip, it could set the generation numbers for other nodes to arbitrarily large values. This is dangerous since one bad node (e.g. with bad memory) could in theory bring down the cluster. Nodes should refuse to accept generation numbers that are too far in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-4206) AssertionError: originally calculated column size of 629444349 but now it is 588008950
[ https://issues.apache.org/jira/browse/CASSANDRA-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107031#comment-14107031 ] Richard Low commented on CASSANDRA-4206: The root cause of this in 1.2 is CASSANDRA-7808. AssertionError: originally calculated column size of 629444349 but now it is 588008950 -- Key: CASSANDRA-4206 URL: https://issues.apache.org/jira/browse/CASSANDRA-4206 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.0.9 Environment: Debian Squeeze Linux, kernel 2.6.32, sun-java6-bin 6.26-0squeeze1 Reporter: Patrik Modesto I've 4 node cluster of Cassandra 1.0.9. There is a rfTest3 keyspace with RF=3 and one CF with two secondary indexes. I'm importing data into this CF using Hadoop Mapreduce job, each row has less than 10 colkumns. From JMX: MaxRowSize: 1597 MeanRowSize: 369 And there are some tens of millions of rows. It's write-heavy usage and there is a big pressure on each node, there are quite some dropped mutations on each node. After ~12 hours of inserting I see these assertion exceptiona on 3 out of four nodes: {noformat} ERROR 06:25:40,124 Fatal exception in thread Thread[HintedHandoff:1,1,main] java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.AssertionError: originally calculated column size of 629444349 but now it is 588008950 at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpointInternal(HintedHandOffManager.java:388) at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:256) at org.apache.cassandra.db.HintedHandOffManager.access$300(HintedHandOffManager.java:84) at org.apache.cassandra.db.HintedHandOffManager$3.runMayThrow(HintedHandOffManager.java:437) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.util.concurrent.ExecutionException: java.lang.AssertionError: originally calculated column size of 629444349 but now it is 588008950 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpointInternal(HintedHandOffManager.java:384) ... 7 more Caused by: java.lang.AssertionError: originally calculated column size of 629444349 but now it is 588008950 at org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:124) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:160) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:161) at org.apache.cassandra.db.compaction.CompactionManager$7.call(CompactionManager.java:380) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) ... 3 more {noformat} Few lines regarding Hints from the output.log: {noformat} INFO 06:21:26,202 Compacting large row system/HintsColumnFamily:7000 (1712834057 bytes) incrementally INFO 06:22:52,610 Compacting large row system/HintsColumnFamily:1000 (2616073981 bytes) incrementally INFO 06:22:59,111 flushing high-traffic column family CFS(Keyspace='system', ColumnFamily='HintsColumnFamily') (estimated 305147360 bytes) INFO 06:22:59,813 Enqueuing flush of Memtable-HintsColumnFamily@833933926(3814342/305147360 serialized/live bytes, 7452 ops) INFO 06:22:59,814 Writing Memtable-HintsColumnFamily@833933926(3814342/305147360 serialized/live bytes, 7452 ops) {noformat} I think the problem may be somehow connected to an IntegerType secondary index. I had a different problem with CF with two secondary indexes, the first UTF8Type, the second IntegerType. After a few hours of inserting data in the afternoon and midnight repair+compact, the next day I couldn't find any row using the IntegerType secondary index. The output was like this: {noformat} [default@rfTest3] get IndexTest where col1 = '3230727:http://zaskolak.cz/download.php'; --- RowKey: 3230727:8383582:http://zaskolak.cz/download.php = (column=col1, value=3230727:http://zaskolak.cz/download.php, timestamp=1335348630332000) = (column=col2, value=8383582, timestamp=1335348630332000) --- RowKey:
[jira] [Created] (CASSANDRA-7808) LazilyCompactedRow incorrectly handles row tombstones
Richard Low created CASSANDRA-7808: -- Summary: LazilyCompactedRow incorrectly handles row tombstones Key: CASSANDRA-7808 URL: https://issues.apache.org/jira/browse/CASSANDRA-7808 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low LazilyCompactedRow doesn’t handle row tombstones correctly, leading to an AssertionError (CASSANDRA-4206) in some cases, and the row tombstone being incorrectly dropped in others. It looks like this was introduced by CASSANDRA-5677. To reproduce an AssertionError: 1. Hack a really small return value for DatabaseDescriptor.getInMemoryCompactionLimit() like 10 bytes to force large row compaction 2. Create a column family with gc_grace = 10 3. Insert a few columns in one row 4. Call nodetool flush 5. Delete the row 6. Call nodetool flush 7. Wait 10 seconds 8. Call nodetool compact and it will fail To reproduce the row tombstone being dropped, do the same except, after the delete (in step 5), insert a column that sorts before the ones you inserted in step 3. E.g. if you inserted b, c, d in step 3, insert a now. After the compaction, which now succeeds, the full row will be visible, rather than just a. The problem is two fold. Firstly, LazilyCompactedRow.Reducer.reduce() and getReduce() incorrectly call container.clear(). This clears the columns (as intended) but also removes the deletion times from container. This means no further columns are deleted if they are annihilated by the row tombstone. Secondly, after the second pass, LazilyCompactedRow.isEmpty() is called which calls {{ColumnFamilyStore.removeDeletedCF(emptyColumnFamily, controller.gcBefore(key.getToken()))}} which unfortunately removes the last deleted time from emptyColumnFamily if it is earlier than gcBefore. Since this is only called after the second pass, the second pass doesn’t remove any columns that are removed by the row tombstone whereas the first pass removes just the first one. This is pretty serious - no large rows can ever be compacted and row tombstones can go missing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7808) LazilyCompactedRow incorrectly handles row tombstones
[ https://issues.apache.org/jira/browse/CASSANDRA-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Low updated CASSANDRA-7808: --- Attachment: 7808-v1.diff LazilyCompactedRow incorrectly handles row tombstones - Key: CASSANDRA-7808 URL: https://issues.apache.org/jira/browse/CASSANDRA-7808 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low Attachments: 7808-v1.diff LazilyCompactedRow doesn’t handle row tombstones correctly, leading to an AssertionError (CASSANDRA-4206) in some cases, and the row tombstone being incorrectly dropped in others. It looks like this was introduced by CASSANDRA-5677. To reproduce an AssertionError: 1. Hack a really small return value for DatabaseDescriptor.getInMemoryCompactionLimit() like 10 bytes to force large row compaction 2. Create a column family with gc_grace = 10 3. Insert a few columns in one row 4. Call nodetool flush 5. Delete the row 6. Call nodetool flush 7. Wait 10 seconds 8. Call nodetool compact and it will fail To reproduce the row tombstone being dropped, do the same except, after the delete (in step 5), insert a column that sorts before the ones you inserted in step 3. E.g. if you inserted b, c, d in step 3, insert a now. After the compaction, which now succeeds, the full row will be visible, rather than just a. The problem is two fold. Firstly, LazilyCompactedRow.Reducer.reduce() and getReduce() incorrectly call container.clear(). This clears the columns (as intended) but also removes the deletion times from container. This means no further columns are deleted if they are annihilated by the row tombstone. Secondly, after the second pass, LazilyCompactedRow.isEmpty() is called which calls {{ColumnFamilyStore.removeDeletedCF(emptyColumnFamily, controller.gcBefore(key.getToken()))}} which unfortunately removes the last deleted time from emptyColumnFamily if it is earlier than gcBefore. Since this is only called after the second pass, the second pass doesn’t remove any columns that are removed by the row tombstone whereas the first pass removes just the first one. This is pretty serious - no large rows can ever be compacted and row tombstones can go missing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7808) LazilyCompactedRow incorrectly handles row tombstones
[ https://issues.apache.org/jira/browse/CASSANDRA-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104877#comment-14104877 ] Richard Low commented on CASSANDRA-7808: I attached a patch which I think fixes this. LazilyCompactedRow incorrectly handles row tombstones - Key: CASSANDRA-7808 URL: https://issues.apache.org/jira/browse/CASSANDRA-7808 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low Fix For: 1.2.19 Attachments: 7808-v1.diff LazilyCompactedRow doesn’t handle row tombstones correctly, leading to an AssertionError (CASSANDRA-4206) in some cases, and the row tombstone being incorrectly dropped in others. It looks like this was introduced by CASSANDRA-5677. To reproduce an AssertionError: 1. Hack a really small return value for DatabaseDescriptor.getInMemoryCompactionLimit() like 10 bytes to force large row compaction 2. Create a column family with gc_grace = 10 3. Insert a few columns in one row 4. Call nodetool flush 5. Delete the row 6. Call nodetool flush 7. Wait 10 seconds 8. Call nodetool compact and it will fail To reproduce the row tombstone being dropped, do the same except, after the delete (in step 5), insert a column that sorts before the ones you inserted in step 3. E.g. if you inserted b, c, d in step 3, insert a now. After the compaction, which now succeeds, the full row will be visible, rather than just a. The problem is two fold. Firstly, LazilyCompactedRow.Reducer.reduce() and getReduce() incorrectly call container.clear(). This clears the columns (as intended) but also removes the deletion times from container. This means no further columns are deleted if they are annihilated by the row tombstone. Secondly, after the second pass, LazilyCompactedRow.isEmpty() is called which calls {{ColumnFamilyStore.removeDeletedCF(emptyColumnFamily, controller.gcBefore(key.getToken()))}} which unfortunately removes the last deleted time from emptyColumnFamily if it is earlier than gcBefore. Since this is only called after the second pass, the second pass doesn’t remove any columns that are removed by the row tombstone whereas the first pass removes just the first one. This is pretty serious - no large rows can ever be compacted and row tombstones can go missing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7808) LazilyCompactedRow incorrectly handles row tombstones
[ https://issues.apache.org/jira/browse/CASSANDRA-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105018#comment-14105018 ] Richard Low commented on CASSANDRA-7808: Sorry, I got my scope wrong. The AssertionError is likely caused by CASSANDRA-5677 but the clearing in the Reducer existed a long time before that. LazilyCompactedRow incorrectly handles row tombstones - Key: CASSANDRA-7808 URL: https://issues.apache.org/jira/browse/CASSANDRA-7808 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low Fix For: 1.2.19 Attachments: 7808-v1.diff LazilyCompactedRow doesn’t handle row tombstones correctly, leading to an AssertionError (CASSANDRA-4206) in some cases, and the row tombstone being incorrectly dropped in others. It looks like this was introduced by CASSANDRA-5677. To reproduce an AssertionError: 1. Hack a really small return value for DatabaseDescriptor.getInMemoryCompactionLimit() like 10 bytes to force large row compaction 2. Create a column family with gc_grace = 10 3. Insert a few columns in one row 4. Call nodetool flush 5. Delete the row 6. Call nodetool flush 7. Wait 10 seconds 8. Call nodetool compact and it will fail To reproduce the row tombstone being dropped, do the same except, after the delete (in step 5), insert a column that sorts before the ones you inserted in step 3. E.g. if you inserted b, c, d in step 3, insert a now. After the compaction, which now succeeds, the full row will be visible, rather than just a. The problem is two fold. Firstly, LazilyCompactedRow.Reducer.reduce() and getReduce() incorrectly call container.clear(). This clears the columns (as intended) but also removes the deletion times from container. This means no further columns are deleted if they are annihilated by the row tombstone. Secondly, after the second pass, LazilyCompactedRow.isEmpty() is called which calls {{ColumnFamilyStore.removeDeletedCF(emptyColumnFamily, controller.gcBefore(key.getToken()))}} which unfortunately removes the last deleted time from emptyColumnFamily if it is earlier than gcBefore. Since this is only called after the second pass, the second pass doesn’t remove any columns that are removed by the row tombstone whereas the first pass removes just the first one. This is pretty serious - no large rows can ever be compacted and row tombstones can go missing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7663) Removing a seed causes previously removed seeds to reappear
[ https://issues.apache.org/jira/browse/CASSANDRA-7663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086574#comment-14086574 ] Richard Low commented on CASSANDRA-7663: +1 on v2, thanks! Removing a seed causes previously removed seeds to reappear --- Key: CASSANDRA-7663 URL: https://issues.apache.org/jira/browse/CASSANDRA-7663 Project: Cassandra Issue Type: Bug Reporter: Richard Low Assignee: Brandon Williams Fix For: 1.2.19, 2.0.10, 2.1.0 Attachments: 7663-v2.txt, 7663.txt When you remove a seed from a cluster, Gossiper.removeEndpoint ensures it is removed from the seed list. However, it also resets the seed list to be the original list, which would bring back any previously removed seeds. What is the reasoning for having the call to buildSeedsList()? If it wasn’t there then I think the problem would be solved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7663) Removing a seed causes previously removed seeds to reappear
[ https://issues.apache.org/jira/browse/CASSANDRA-7663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082613#comment-14082613 ] Richard Low commented on CASSANDRA-7663: Thanks! Removing a seed causes previously removed seeds to reappear --- Key: CASSANDRA-7663 URL: https://issues.apache.org/jira/browse/CASSANDRA-7663 Project: Cassandra Issue Type: Bug Reporter: Richard Low Assignee: Brandon Williams Fix For: 1.2.19, 2.0.10, 2.1.0 Attachments: 7663.txt When you remove a seed from a cluster, Gossiper.removeEndpoint ensures it is removed from the seed list. However, it also resets the seed list to be the original list, which would bring back any previously removed seeds. What is the reasoning for having the call to buildSeedsList()? If it wasn’t there then I think the problem would be solved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7663) Removing a seed causes previously removed seeds to reappear
[ https://issues.apache.org/jira/browse/CASSANDRA-7663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14083258#comment-14083258 ] Richard Low commented on CASSANDRA-7663: Actually there's a potential problem with this. It now requires that the yaml is still present, whereas before this patch the yaml was only needed on startup. Depending on how people deploy updated yamls, the old yaml may not be there. In that case, it would throw AssertionError and bad things will happen in gossip. Maybe it could fall back on the original behaviour if the yaml can't be read? Removing a seed causes previously removed seeds to reappear --- Key: CASSANDRA-7663 URL: https://issues.apache.org/jira/browse/CASSANDRA-7663 Project: Cassandra Issue Type: Bug Reporter: Richard Low Assignee: Brandon Williams Fix For: 1.2.19, 2.0.10, 2.1.0 Attachments: 7663.txt When you remove a seed from a cluster, Gossiper.removeEndpoint ensures it is removed from the seed list. However, it also resets the seed list to be the original list, which would bring back any previously removed seeds. What is the reasoning for having the call to buildSeedsList()? If it wasn’t there then I think the problem would be solved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7663) Removing a seed causes previously removed seeds to reappear
Richard Low created CASSANDRA-7663: -- Summary: Removing a seed causes previously removed seeds to reappear Key: CASSANDRA-7663 URL: https://issues.apache.org/jira/browse/CASSANDRA-7663 Project: Cassandra Issue Type: Bug Reporter: Richard Low When you remove a seed from a cluster, Gossiper.removeEndpoint ensures it is removed from the seed list. However, it also resets the seed list to be the original list, which would bring back any previously removed seeds. What is the reasoning for having the call to buildSeedsList()? If it wasn’t there then I think the problem would be solved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7591) Add read and write metrics for each consistency level
Richard Low created CASSANDRA-7591: -- Summary: Add read and write metrics for each consistency level Key: CASSANDRA-7591 URL: https://issues.apache.org/jira/browse/CASSANDRA-7591 Project: Cassandra Issue Type: Improvement Reporter: Richard Low It would be helpful to have read and write metrics for each consistency level to help clients track query rates per consistency level. It's quite common to forget to set e.g. a config param in the client and use the wrong consistency level. Right now the only way to find out from Cassandra is to use tracing. It would also be helpful to track latencies for different consistency levels to estimate the cost of e.g. switching from LOCAL_QUORUM to QUORUM reads. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7591) Add per consistency level read and write metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Low updated CASSANDRA-7591: --- Summary: Add per consistency level read and write metrics (was: Add read and write metrics for each consistency level) Add per consistency level read and write metrics Key: CASSANDRA-7591 URL: https://issues.apache.org/jira/browse/CASSANDRA-7591 Project: Cassandra Issue Type: Improvement Reporter: Richard Low It would be helpful to have read and write metrics for each consistency level to help clients track query rates per consistency level. It's quite common to forget to set e.g. a config param in the client and use the wrong consistency level. Right now the only way to find out from Cassandra is to use tracing. It would also be helpful to track latencies for different consistency levels to estimate the cost of e.g. switching from LOCAL_QUORUM to QUORUM reads. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (CASSANDRA-7591) Add per consistency level read and write metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Low resolved CASSANDRA-7591. Resolution: Duplicate Dupe of CASSANDRA-7384. Add per consistency level read and write metrics Key: CASSANDRA-7591 URL: https://issues.apache.org/jira/browse/CASSANDRA-7591 Project: Cassandra Issue Type: Improvement Reporter: Richard Low It would be helpful to have read and write metrics for each consistency level to help clients track query rates per consistency level. It's quite common to forget to set e.g. a config param in the client and use the wrong consistency level. Right now the only way to find out from Cassandra is to use tracing. It would also be helpful to track latencies for different consistency levels to estimate the cost of e.g. switching from LOCAL_QUORUM to QUORUM reads. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7591) Add per consistency level read and write metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070941#comment-14070941 ] Richard Low commented on CASSANDRA-7591: And I've just found CASSANDRA-7384. Sorry for the noise. Add per consistency level read and write metrics Key: CASSANDRA-7591 URL: https://issues.apache.org/jira/browse/CASSANDRA-7591 Project: Cassandra Issue Type: Improvement Reporter: Richard Low It would be helpful to have read and write metrics for each consistency level to help clients track query rates per consistency level. It's quite common to forget to set e.g. a config param in the client and use the wrong consistency level. Right now the only way to find out from Cassandra is to use tracing. It would also be helpful to track latencies for different consistency levels to estimate the cost of e.g. switching from LOCAL_QUORUM to QUORUM reads. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7384) Collect metrics on queries by consistency level
[ https://issues.apache.org/jira/browse/CASSANDRA-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070960#comment-14070960 ] Richard Low commented on CASSANDRA-7384: Having relative counts per consistency level helps clients to check rates are as they expect. Also having the relative latencies would help to estimate the cost/benefit of changing consistency levels. I think it would be helpful. Collect metrics on queries by consistency level --- Key: CASSANDRA-7384 URL: https://issues.apache.org/jira/browse/CASSANDRA-7384 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Vishy Kasar Assignee: sankalp kohli Priority: Minor We had cases where cassandra client users thought that they were doing queries at one consistency level but turned out to be not correct. It will be good to collect metrics on number of queries done at various consistency level on the server. See the equivalent JIRA on java driver: https://datastax-oss.atlassian.net/browse/JAVA-354 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-5901) Bootstrap should also make the data consistent on the new node
[ https://issues.apache.org/jira/browse/CASSANDRA-5901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070988#comment-14070988 ] Richard Low commented on CASSANDRA-5901: In particular, host replacement can violate consistency, even with CASSANDRA-2434. For example, if you always do quorum reads and writes and have replicas A, B, C, you'll always read the latest value, even if any one replica has missed a write. Suppose A did miss a write and then B fails and is replaced by D. D chooses where to stream from - there is no 'right' answer - so it can stream from A. Now A and D have old values and only C has the latest value. A quorum read that chooses A and D will give back stale data and violate expected consistency. If a repair was run on A and C after B had failed but before it was replaced with D, the consistency problem is eliminated. Bootstrap should also make the data consistent on the new node -- Key: CASSANDRA-5901 URL: https://issues.apache.org/jira/browse/CASSANDRA-5901 Project: Cassandra Issue Type: Improvement Components: Core Reporter: sankalp kohli Priority: Minor Currently when we are bootstrapping a new node, it might bootstrap from a node which does not have most upto date data. Because of this, we need to run a repair after that. Most people will always run the repair so it would help if we can provide a parameter to bootstrap to run the repair once the bootstrap has finished. It can also stop the node from responding to reads till repair has finished. This could be another param as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7592) Ownership changes can violate consistency
Richard Low created CASSANDRA-7592: -- Summary: Ownership changes can violate consistency Key: CASSANDRA-7592 URL: https://issues.apache.org/jira/browse/CASSANDRA-7592 Project: Cassandra Issue Type: Improvement Reporter: Richard Low CASSANDRA-2434 goes a long way to avoiding consistency violations when growing a cluster. However, there is still a window when consistency can be violated when switching ownership of a range. Suppose you have replication factor 3 and all reads and writes at quorum. The first part of the ring looks like this: Z: 0 A: 100 B: 200 C: 300 Choose two random coordinators, C1 and C2. Then you bootstrap node X at token 50. Consider the token range 0-50. Before bootstrap, this is stored on A, B, C. During bootstrap, writes go to X, A, B, C (and must succeed on 3) and reads choose two from A, B, C. After bootstrap, the range is on X, A, B. When the bootstrap completes, suppose C1 processes the ownership change at t1 and C2 at t4. Then the following can give an inconsistency: t1: C1 switches ownership. t2: C1 performs write, so sends write to X, A, B. A is busy and drops the write, but it succeeds because X and B return. t3: C2 performs a read. It hasn’t done the switch and chooses A and C. Neither got the write at t2 so null is returned. t4: C2 switches ownership. This could be solved by continuing writes to the old replica for some time (maybe ring delay) after the ownership changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6751) Setting -Dcassandra.fd_initial_value_ms Results in NPE
[ https://issues.apache.org/jira/browse/CASSANDRA-6751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068158#comment-14068158 ] Richard Low commented on CASSANDRA-6751: This issue also affects 2.0 branch and was fixed in 2.0.8. Setting -Dcassandra.fd_initial_value_ms Results in NPE -- Key: CASSANDRA-6751 URL: https://issues.apache.org/jira/browse/CASSANDRA-6751 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Dave Brosius Priority: Minor Fix For: 1.2.17 Attachments: 6751.txt Start Cassandra with {{-Dcassandra.fd_initial_value_ms=1000}} and you'll get the following stacktrace: {noformat} INFO [main] 2014-02-21 14:45:57,731 StorageService.java (line 617) Starting up server gossip ERROR [main] 2014-02-21 14:45:57,736 CassandraDaemon.java (line 464) Exception encountered during startup java.lang.ExceptionInInitializerError at org.apache.cassandra.gms.Gossiper.init(Gossiper.java:178) at org.apache.cassandra.gms.Gossiper.clinit(Gossiper.java:71) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:618) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:583) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:480) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490) Caused by: java.lang.NullPointerException at org.apache.cassandra.gms.FailureDetector.getInitialValue(FailureDetector.java:81) at org.apache.cassandra.gms.FailureDetector.clinit(FailureDetector.java:48) ... 8 more ERROR [StorageServiceShutdownHook] 2014-02-21 14:45:57,754 CassandraDaemon.java (line 191) Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NoClassDefFoundError: Could not initialize class org.apache.cassandra.gms.Gossiper at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:550) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:724) {noformat} Glancing at the code, this is because the FailureDetector logger isn't initialized when the static initialization of {{INITIAL_VALUE}} happens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7307) New nodes mark dead nodes as up for 10 minutes
[ https://issues.apache.org/jira/browse/CASSANDRA-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14036328#comment-14036328 ] Richard Low commented on CASSANDRA-7307: bq. For bootstrap? Let me be clear, the problem with replace is not related to streaming. It's refusing to replace a live node, because the FD takes so long to report it as down upon first discovery. Actually, most of the time the problem is streaming. It is happy during replacement (which surprises me, since it clearly lists it as UP), but then requests to stream from the dead node which fails. We've seen this where it happily streams from other nodes, but then ultimately fails because the stream from the dead node fails. However, we also see a problem where it fails to replace because it thinks the node is live. This happens less often but I expect has the same root cause. New nodes mark dead nodes as up for 10 minutes -- Key: CASSANDRA-7307 URL: https://issues.apache.org/jira/browse/CASSANDRA-7307 Project: Cassandra Issue Type: Bug Reporter: Richard Low Assignee: Brandon Williams Fix For: 1.2.17, 2.0.9, 2.1 rc2 When doing a node replacement when other nodes are down we see the down nodes marked as up for about 10 minutes. This means requests are routed to the dead nodes causing timeouts. It also means replacing a node when multiple nodes from a replica set is extremely difficult - the node usually tries to stream from a dead node and the replacement fails. This isn't limited to host replacement. I did a simple test: 1. Create a 2 node cluster 2. Kill node 2 3. Start a 3rd node with a unique token (I used auto_bootstrap=false but I don't think this is significant) The 3rd node lists node 2 (127.0.0.2) as up for almost 10 minutes: {code} INFO [main] 2014-05-27 14:28:24,753 CassandraDaemon.java (line 119) Logging initialized INFO [GossipStage:1] 2014-05-27 14:28:31,492 Gossiper.java (line 843) Node /127.0.0.2 is now part of the cluster INFO [GossipStage:1] 2014-05-27 14:28:31,495 Gossiper.java (line 809) InetAddress /127.0.0.2 is now UP INFO [GossipTasks:1] 2014-05-27 14:37:44,526 Gossiper.java (line 823) InetAddress /127.0.0.2 is now DOWN {code} I reproduced on 1.2.15 and 1.2.16. -- This message was sent by Atlassian JIRA (v6.2#6252)