from:"Richard Low \(JIRA\)"

[jira] [Resolved] (CASSANDRA-7377) Should be an option to fail startup if corrupt SSTable found

2017-12-11 Thread Richard Low (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Low resolved CASSANDRA-7377.

Resolution: Duplicate

> Should be an option to fail startup if corrupt SSTable found
> 
>
> Key: CASSANDRA-7377
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7377
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Richard Low
>  Labels: proposed-wontfix
>
> We had a server that crashed and when it came back, some SSTables were 
> corrupted. Cassandra happily started, but we then realised the corrupt 
> SSTable contained some tombstones and a few keys were resurrected. This means 
> corruption on a single replica can bring back data even if you run repairs at 
> least every gc_grace.
> There should be an option, probably controlled by the disk failure policy, to 
> catch this and stop node startup.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-7377) Should be an option to fail startup if corrupt SSTable found

2017-12-11 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287100#comment-16287100
 ] 

Richard Low commented on CASSANDRA-7377:


SGTM

> Should be an option to fail startup if corrupt SSTable found
> 
>
> Key: CASSANDRA-7377
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7377
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Richard Low
>  Labels: proposed-wontfix
>
> We had a server that crashed and when it came back, some SSTables were 
> corrupted. Cassandra happily started, but we then realised the corrupt 
> SSTable contained some tombstones and a few keys were resurrected. This means 
> corruption on a single replica can bring back data even if you run repairs at 
> least every gc_grace.
> There should be an option, probably controlled by the disk failure policy, to 
> catch this and stop node startup.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-10726) Read repair inserts should not be blocking

2017-10-25 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16219759#comment-16219759
 ] 

Richard Low commented on CASSANDRA-10726:
-

Background read repair is quite different. This foreground read repair is 
required to be blocking as the discussion at the beginning of the ticket shows. 
Now I understand it, I think this is an important guarantee and people would be 
very surprised if this behaviour changed.

So I'm strongly in favour of 1, although the title of the ticket may be 
misleading :)

> Read repair inserts should not be blocking
> --
>
> Key: CASSANDRA-10726
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10726
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Coordination
>Reporter: Richard Low
>Assignee: Xiaolong Jiang
> Fix For: 4.x
>
>
> Today, if there’s a digest mismatch in a foreground read repair, the insert 
> to update out of date replicas is blocking. This means, if it fails, the read 
> fails with a timeout. If a node is dropping writes (maybe it is overloaded or 
> the mutation stage is backed up for some other reason), all reads to a 
> replica set could fail. Further, replicas dropping writes get more out of 
> sync so will require more read repair.
> The comment on the code for why the writes are blocking is:
> {code}
> // wait for the repair writes to be acknowledged, to minimize impact on any 
> replica that's
> // behind on writes in case the out-of-sync row is read multiple times in 
> quick succession
> {code}
> but the bad side effect is that reads timeout. Either the writes should not 
> be blocking or we should return success for the read even if the write times 
> out.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-8502) Static columns returning null for pages after first

2016-11-07 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645878#comment-15645878
 ] 

Richard Low commented on CASSANDRA-8502:


Doesn't Dave's patch change behaviour though? We think we're seeing this in 
2.0.17 but not 2.1.

> Static columns returning null for pages after first
> ---
>
> Key: CASSANDRA-8502
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8502
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Flavien Charlon
>Assignee: Tyler Hobbs
> Fix For: 2.0.16, 2.1.6, 2.2.0 rc1
>
> Attachments: 8502-2.0-v2.txt, 8502-2.0.txt, 8502-2.1-v2.txt, 
> null-static-column.txt
>
>
> When paging is used for a query containing a static column, the first page 
> contains the right value for the static column, but subsequent pages have 
> null null for the static column instead of the expected value.
> Repro steps:
> - Create a table with a static column
> - Create a partition with 500 cells
> - Using cqlsh, query that partition
> Actual result:
> - You will see that first, the static column appears as expected, but if you 
> press a key after "---MORE---", the static columns will appear as null.
> See the attached file for a repro of the output.
> I am using a single node cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8502) Static columns returning null for pages after first

2016-11-07 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645846#comment-15645846
 ] 

Richard Low commented on CASSANDRA-8502:


Can someone remove 2.0.16 from the fix versions, since the above change was 
only applied in 2.1?

> Static columns returning null for pages after first
> ---
>
> Key: CASSANDRA-8502
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8502
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Flavien Charlon
>Assignee: Tyler Hobbs
> Fix For: 2.0.16, 2.1.6, 2.2.0 rc1
>
> Attachments: 8502-2.0-v2.txt, 8502-2.0.txt, 8502-2.1-v2.txt, 
> null-static-column.txt
>
>
> When paging is used for a query containing a static column, the first page 
> contains the right value for the static column, but subsequent pages have 
> null null for the static column instead of the expected value.
> Repro steps:
> - Create a table with a static column
> - Create a partition with 500 cells
> - Using cqlsh, query that partition
> Actual result:
> - You will see that first, the static column appears as expected, but if you 
> press a key after "---MORE---", the static columns will appear as null.
> See the attached file for a repro of the output.
> I am using a single node cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data

2016-08-18 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427335#comment-15427335
 ] 

Richard Low commented on CASSANDRA-8523:


Are you waiting for me to review the dtest PR?

> Writes should be sent to a replacement node while it is streaming in data
> -
>
> Key: CASSANDRA-8523
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8523
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Richard Wagner
>Assignee: Paulo Motta
> Fix For: 2.1.x
>
>
> In our operations, we make heavy use of replace_address (or 
> replace_address_first_boot) in order to replace broken nodes. We now realize 
> that writes are not sent to the replacement nodes while they are in hibernate 
> state and streaming in data. This runs counter to what our expectations were, 
> especially since we know that writes ARE sent to nodes when they are 
> bootstrapped into the ring.
> It seems like cassandra should arrange to send writes to a node that is in 
> the process of replacing another node, just like it does for a nodes that are 
> bootstraping. I hesitate to phrase this as "we should send writes to a node 
> in hibernate" because the concept of hibernate may be useful in other 
> contexts, as per CASSANDRA-8336. Maybe a new state is needed here?
> Among other things, the fact that we don't get writes during this period 
> makes subsequent repairs more expensive, proportional to the number of writes 
> that we miss (and depending on the amount of data that needs to be streamed 
> during replacement and the time it may take to rebuild secondary indexes, we 
> could miss many many hours worth of writes). It also leaves us more exposed 
> to consistency violations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data

2016-08-10 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15416328#comment-15416328
 ] 

Richard Low commented on CASSANDRA-8523:


+1 on the 3.9 version too.

> Writes should be sent to a replacement node while it is streaming in data
> -
>
> Key: CASSANDRA-8523
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8523
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Richard Wagner
>Assignee: Paulo Motta
> Fix For: 2.1.x
>
>
> In our operations, we make heavy use of replace_address (or 
> replace_address_first_boot) in order to replace broken nodes. We now realize 
> that writes are not sent to the replacement nodes while they are in hibernate 
> state and streaming in data. This runs counter to what our expectations were, 
> especially since we know that writes ARE sent to nodes when they are 
> bootstrapped into the ring.
> It seems like cassandra should arrange to send writes to a node that is in 
> the process of replacing another node, just like it does for a nodes that are 
> bootstraping. I hesitate to phrase this as "we should send writes to a node 
> in hibernate" because the concept of hibernate may be useful in other 
> contexts, as per CASSANDRA-8336. Maybe a new state is needed here?
> Among other things, the fact that we don't get writes during this period 
> makes subsequent repairs more expensive, proportional to the number of writes 
> that we miss (and depending on the amount of data that needs to be streamed 
> during replacement and the time it may take to rebuild secondary indexes, we 
> could miss many many hours worth of writes). It also leaves us more exposed 
> to consistency violations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data

2016-07-29 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399949#comment-15399949
 ] 

Richard Low commented on CASSANDRA-8523:


I'll review the 3.9 version. I'm very much in favour of putting this in 2.2 and 
3.0 as this hurts us badly and no doubt others suffer too.

> Writes should be sent to a replacement node while it is streaming in data
> -
>
> Key: CASSANDRA-8523
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8523
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Richard Wagner
>Assignee: Paulo Motta
> Fix For: 2.1.x
>
>
> In our operations, we make heavy use of replace_address (or 
> replace_address_first_boot) in order to replace broken nodes. We now realize 
> that writes are not sent to the replacement nodes while they are in hibernate 
> state and streaming in data. This runs counter to what our expectations were, 
> especially since we know that writes ARE sent to nodes when they are 
> bootstrapped into the ring.
> It seems like cassandra should arrange to send writes to a node that is in 
> the process of replacing another node, just like it does for a nodes that are 
> bootstraping. I hesitate to phrase this as "we should send writes to a node 
> in hibernate" because the concept of hibernate may be useful in other 
> contexts, as per CASSANDRA-8336. Maybe a new state is needed here?
> Among other things, the fact that we don't get writes during this period 
> makes subsequent repairs more expensive, proportional to the number of writes 
> that we miss (and depending on the amount of data that needs to be streamed 
> during replacement and the time it may take to rebuild secondary indexes, we 
> could miss many many hours worth of writes). It also leaves us more exposed 
> to consistency violations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data

2016-07-28 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398667#comment-15398667
 ] 

Richard Low commented on CASSANDRA-8523:


+1 patch looks good. Really like the dtests.

> Writes should be sent to a replacement node while it is streaming in data
> -
>
> Key: CASSANDRA-8523
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8523
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Richard Wagner
>Assignee: Paulo Motta
> Fix For: 2.1.x
>
>
> In our operations, we make heavy use of replace_address (or 
> replace_address_first_boot) in order to replace broken nodes. We now realize 
> that writes are not sent to the replacement nodes while they are in hibernate 
> state and streaming in data. This runs counter to what our expectations were, 
> especially since we know that writes ARE sent to nodes when they are 
> bootstrapped into the ring.
> It seems like cassandra should arrange to send writes to a node that is in 
> the process of replacing another node, just like it does for a nodes that are 
> bootstraping. I hesitate to phrase this as "we should send writes to a node 
> in hibernate" because the concept of hibernate may be useful in other 
> contexts, as per CASSANDRA-8336. Maybe a new state is needed here?
> Among other things, the fact that we don't get writes during this period 
> makes subsequent repairs more expensive, proportional to the number of writes 
> that we miss (and depending on the amount of data that needs to be streamed 
> during replacement and the time it may take to rebuild secondary indexes, we 
> could miss many many hours worth of writes). It also leaves us more exposed 
> to consistency violations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9641) Occasional timeouts with blockFor=all for LOCAL_QUORUM query

2016-07-25 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392096#comment-15392096
 ] 

Richard Low commented on CASSANDRA-9641:


I couldn't find the root cause. I'll look to see if it's happening on 2.1.

> Occasional timeouts with blockFor=all for LOCAL_QUORUM query
> 
>
> Key: CASSANDRA-9641
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9641
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Richard Low
> Fix For: 2.1.x, 2.2.x, 3.0.x
>
>
> We have a keyspace using NetworkTopologyStrategy with options DC1:3, DC2:3. 
> Our tables have
> read_repair_chance = 0.0
> dclocal_read_repair_chance = 0.1
> speculative_retry = ’99.0PERCENTILE'
> and all reads are at LOCAL_QUORUM. On 2.0.11, we occasionally see this 
> timeout:
> Cassandra timeout during read query at consistency ALL (6 responses were 
> required but only 5 replica responded)
> (sometimes only 4 respond). The ALL is probably due to CASSANDRA-7947 if this 
> occurs during a digest mismatch, but what is interesting is it is expecting 6 
> responses i.e. blockFor is set to all replicas. I can’t see how this should 
> happen. From the code it should never set blockFor to more than 4 (although 4 
> is still wrong - I'll make a separate JIRA for that).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-12043) Syncing most recent commit in CAS across replicas can cause all CAS queries in the CQL partition to fail

2016-06-30 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358348#comment-15358348
 ] 

Richard Low commented on CASSANDRA-12043:
-

Nice detective work [~kohlisankalp]

> Syncing most recent commit in CAS across replicas can cause all CAS queries 
> in the CQL partition to fail
> 
>
> Key: CASSANDRA-12043
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12043
> Project: Cassandra
>  Issue Type: Bug
>Reporter: sankalp kohli
>Assignee: Sylvain Lebresne
> Fix For: 2.1.15, 2.2.7, 3.0.9, 3.9
>
>
> We update the most recent commit on requiredParticipant replicas if out of 
> sync during the prepare round in beginAndRepairPaxos method. We keep doing 
> this in a loop till the requiredParticipant replicas have the same most 
> recent commit or we hit timeout. 
> Say we have 3 machines A,B and C and gc grace on the table is 10 days. We do 
> a CAS write at time 0 and it went to A and B but not to C.  C will get the 
> hint later but will not update the most recent commit in paxos table. This is 
> how CAS hints work. 
> In the paxos table whose gc_grace=0, most_recent_commit in A and B will be 
> inserted with timestamp 0 and with a TTL of 10 days. After 10 days, this 
> insert will become a tombstone at time 0 till it is compacted away since 
> gc_grace=0.
> Do a CAS read after say 1 day on the same CQL partition and this time prepare 
> phase involved A and C. most_recent_commit on C for this CQL partition is 
> empty. A sends the most_recent_commit to C with a timestamp of 0 and with a 
> TTL of 10 days. This most_recent_commit on C will expire on 11th day since it 
> is inserted after 1 day. 
> most_recent_commit are now in sync on A,B and C, however A and B 
> most_recent_commit will expire on 10th day whereas for C it will expire on 
> 11th day since it was inserted one day later. 
> Do another CAS read after 10days when most_recent_commit on A and B have 
> expired and is treated as tombstones till compacted. In this CAS read, say A 
> and C are involved in prepare phase. most_recent_commit will not match 
> between them since it is expired in A and is still there on C. This will 
> cause most_recent_commit to be applied to A with a timestamp of 0 and TTL of 
> 10 days. If A has not compacted away the original most_recent_commit which 
> has expired, this new write to most_recent_commit wont be visible on reads 
> since there is a tombstone with same timestamp(Delete wins over data with 
> same timestamp). 
> Another round of prepare will follow and again A would say it does not know 
> about most_recent_write(covered by original write which is not a tombstone) 
> and C will again try to send the write to A. This can keep going on till the 
> request timeouts or only A and B are involved in the prepare phase. 
> When A’s original most_recent_commit which is now a tombstone is compacted, 
> all the inserts which it was covering will come live. This will in turn again 
> get played to another replica. This ping pong can keep going on for a long 
> time. 
> The issue is that most_recent_commit is expiring at different times across 
> replicas. When they get replayed to a replica to make it in sync, we again 
> set the TTL from that point.  
> During the CAS read which timed out, most_recent_commit was being sent to 
> another replica in a loop. Even in successful requests, it will try to loop 
> for a couple of times if involving A and C and then when the replicas which 
> respond are A and B, it will succeed. So this will have impact on latencies 
> as well. 
> These timeouts gets worse when a machine is down as no progress can be made 
> as the machine with unexpired commit is always involved in the CAS prepare 
> round. Also with range movements, the new machine gaining range has empty 
> most recent commit and gets the commit at a later time causing same issue. 
> Repro steps:
> 1. Paxos TTL is max(3 hours, gc_grace) as defined in 
> SystemKeyspace.paxosTtl(). Change this method to not put a minimum TTL of 3 
> hours. 
> Method  SystemKeyspace.paxosTtl() will look like return 
> metadata.getGcGraceSeconds();   instead of return Math.max(3 * 3600, 
> metadata.getGcGraceSeconds());
> We are doing this so that we dont need to wait for 3 hours. 
> Create a 3 node cluster with the code change suggested above with machines 
> A,B and C
> CREATE KEYSPACE  test WITH REPLICATION = { 'class' : 'SimpleStrategy', 
> 'replication_factor' : 3 };
> use test;
> CREATE TABLE users (a int PRIMARY KEY,b int);
> alter table users WITH gc_grace_seconds=120;
> consistency QUORUM;
> bring down machine C
> INSERT INTO users (user_name, password ) VALUES ( 1,1) IF NOT EXISTS;
> Nodetool flush on machine A and B
> Bring up

[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data

2016-06-17 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337328#comment-15337328
 ] 

Richard Low commented on CASSANDRA-8523:


It's not those writes that matter - the replacement will get those writes from 
the other nodes during streaming. The hints that you might care about are 
writes dropped during the replacement on the replacing node. But those should 
be extremely rare, and a price well worth paying for getting the vast majority 
of the writes vs none today.

> Writes should be sent to a replacement node while it is streaming in data
> -
>
> Key: CASSANDRA-8523
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8523
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Richard Wagner
>Assignee: Paulo Motta
> Fix For: 2.1.x
>
>
> In our operations, we make heavy use of replace_address (or 
> replace_address_first_boot) in order to replace broken nodes. We now realize 
> that writes are not sent to the replacement nodes while they are in hibernate 
> state and streaming in data. This runs counter to what our expectations were, 
> especially since we know that writes ARE sent to nodes when they are 
> bootstrapped into the ring.
> It seems like cassandra should arrange to send writes to a node that is in 
> the process of replacing another node, just like it does for a nodes that are 
> bootstraping. I hesitate to phrase this as "we should send writes to a node 
> in hibernate" because the concept of hibernate may be useful in other 
> contexts, as per CASSANDRA-8336. Maybe a new state is needed here?
> Among other things, the fact that we don't get writes during this period 
> makes subsequent repairs more expensive, proportional to the number of writes 
> that we miss (and depending on the amount of data that needs to be streamed 
> during replacement and the time it may take to rebuild secondary indexes, we 
> could miss many many hours worth of writes). It also leaves us more exposed 
> to consistency violations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data

2016-06-17 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337290#comment-15337290
 ] 

Richard Low commented on CASSANDRA-8523:


Thanks! I'm happy to review.

> Writes should be sent to a replacement node while it is streaming in data
> -
>
> Key: CASSANDRA-8523
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8523
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Richard Wagner
>Assignee: Paulo Motta
> Fix For: 2.1.x
>
>
> In our operations, we make heavy use of replace_address (or 
> replace_address_first_boot) in order to replace broken nodes. We now realize 
> that writes are not sent to the replacement nodes while they are in hibernate 
> state and streaming in data. This runs counter to what our expectations were, 
> especially since we know that writes ARE sent to nodes when they are 
> bootstrapped into the ring.
> It seems like cassandra should arrange to send writes to a node that is in 
> the process of replacing another node, just like it does for a nodes that are 
> bootstraping. I hesitate to phrase this as "we should send writes to a node 
> in hibernate" because the concept of hibernate may be useful in other 
> contexts, as per CASSANDRA-8336. Maybe a new state is needed here?
> Among other things, the fact that we don't get writes during this period 
> makes subsequent repairs more expensive, proportional to the number of writes 
> that we miss (and depending on the amount of data that needs to be streamed 
> during replacement and the time it may take to rebuild secondary indexes, we 
> could miss many many hours worth of writes). It also leaves us more exposed 
> to consistency violations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data

2016-06-01 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15311526#comment-15311526
 ] 

Richard Low commented on CASSANDRA-8523:


Without understanding the FD details, this sounds good. Losing hints isn't an 
issue, as you say.

> Writes should be sent to a replacement node while it is streaming in data
> -
>
> Key: CASSANDRA-8523
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8523
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Richard Wagner
>Assignee: Paulo Motta
> Fix For: 2.1.x
>
>
> In our operations, we make heavy use of replace_address (or 
> replace_address_first_boot) in order to replace broken nodes. We now realize 
> that writes are not sent to the replacement nodes while they are in hibernate 
> state and streaming in data. This runs counter to what our expectations were, 
> especially since we know that writes ARE sent to nodes when they are 
> bootstrapped into the ring.
> It seems like cassandra should arrange to send writes to a node that is in 
> the process of replacing another node, just like it does for a nodes that are 
> bootstraping. I hesitate to phrase this as "we should send writes to a node 
> in hibernate" because the concept of hibernate may be useful in other 
> contexts, as per CASSANDRA-8336. Maybe a new state is needed here?
> Among other things, the fact that we don't get writes during this period 
> makes subsequent repairs more expensive, proportional to the number of writes 
> that we miss (and depending on the amount of data that needs to be streamed 
> during replacement and the time it may take to rebuild secondary indexes, we 
> could miss many many hours worth of writes). It also leaves us more exposed 
> to consistency violations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11903) Serial reads should not include pending endpoints

2016-05-27 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-11903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304341#comment-15304341
 ] 

Richard Low commented on CASSANDRA-11903:
-

The write CL is increased by 1 for a pending endpoint so that a quorum of 
non-pending endpoints are written to. That means a regular quorum read is 
guaranteed to see the write and detect if any in progress paxos writes are 
committed.

> Serial reads should not include pending endpoints
> -
>
> Key: CASSANDRA-11903
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11903
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Richard Low
>
> A serial read uses pending endpoints in beginAndRepairPaxos, although the 
> read itself does not. I don't think the pending endpoints are necessary and 
> including them unnecessarily increases paxos work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-11903) Serial reads should not include pending endpoints

2016-05-26 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-11903:
---

 Summary: Serial reads should not include pending endpoints
 Key: CASSANDRA-11903
 URL: https://issues.apache.org/jira/browse/CASSANDRA-11903
 Project: Cassandra
  Issue Type: Bug
  Components: Coordination
Reporter: Richard Low


A serial read uses pending endpoints in beginAndRepairPaxos, although the read 
itself does not. I don't think the pending endpoints are necessary and 
including them unnecessarily increases paxos work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11746) Add per partition rate limiting

2016-05-16 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284899#comment-15284899
 ] 

Richard Low commented on CASSANDRA-11746:
-

I think per table via DDL or backed by a system table will be best. It will be 
easier in the server to coordinate across the cluster than in the driver.

> Add per partition rate limiting
> ---
>
> Key: CASSANDRA-11746
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11746
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Richard Low
>
> In a similar spirit to the tombstone fail threshold, Cassandra could protect 
> itself against rogue clients issuing too many requests to the same partition 
> by rate limiting. Nodes could keep a sliding window of requests per partition 
> and immediately reject requests if the threshold has been reached. This could 
> stop hotspots from taking down a replica set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-11746) Add per partition rate limiting

2016-05-10 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-11746:
---

 Summary: Add per partition rate limiting
 Key: CASSANDRA-11746
 URL: https://issues.apache.org/jira/browse/CASSANDRA-11746
 Project: Cassandra
  Issue Type: Improvement
Reporter: Richard Low


In a similar spirit to the tombstone fail threshold, Cassandra could protect 
itself against rogue clients issuing too many requests to the same partition by 
rate limiting. Nodes could keep a sliding window of requests per partition and 
immediately reject requests if the threshold has been reached. This could stop 
hotspots from taking down a replica set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-11745) Add bytes limit to queries and paging

2016-05-10 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-11745:
---

 Summary: Add bytes limit to queries and paging
 Key: CASSANDRA-11745
 URL: https://issues.apache.org/jira/browse/CASSANDRA-11745
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Richard Low


For some data models, values may be of very different sizes. When querying 
data, limit by count doesn’t work well and leads to timeouts. It would be much 
better to limit by size of the response, probably by stopping at the first row 
that goes above the limit. This applies to paging too so you can safely page 
through such data without timeout worries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11547) Add background thread to check for clock drift

2016-04-21 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-11547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253057#comment-15253057
 ] 

Richard Low commented on CASSANDRA-11547:
-

Given how critical clocks are to Cassandra I think it is definitely Cassandra's 
business to report on this. It's not actually doing anything, just warning. 
You'd need to have a 5 minute GC pause for it to fire spuriously with the 
default.

> Add background thread to check for clock drift
> --
>
> Key: CASSANDRA-11547
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11547
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Jason Brown
>Assignee: Jason Brown
>Priority: Minor
>  Labels: clocks, time
>
> The system clock has the potential to drift while a system is running. As a 
> simple way to check if this occurs, we can run a background thread that wakes 
> up every n seconds, reads the system clock, and checks to see if, indeed, n 
> seconds have passed. 
> * If the clock's current time is less than the last recorded time (captured n 
> seconds in the past), we know the clock has jumped backward.
> * If n seconds have not elapsed, we know the system clock is running slow or 
> has moved backward (by a value less than n)
> * If (n + a small offset) seconds have elapsed, we can assume we are within 
> an acceptable window of clock movement. Reasons for including an offset are 
> the clock checking thread might not have been scheduled on time, or garbage 
> collection, and so on.
> * If the clock is greater than (n + a small offset) seconds, we can assume 
> the clock jumped forward.
> In the unhappy cases, we can write a message to the log and increment some 
> metric that the user's monitoring systems can trigger/alert on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-04-01 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222330#comment-15222330
 ] 

Richard Low commented on CASSANDRA-11349:
-

I'm also not sure how this is meant to fix it. Special casing validation 
compaction may fix repairs but you'd still get the digest mismatches on reads.

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-10643) Implement compaction for a specific token range

2016-03-23 Thread Richard Low (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Low updated CASSANDRA-10643:

Reviewer: Jason Brown
  Status: Patch Available  (was: Open)

> Implement compaction for a specific token range
> ---
>
> Key: CASSANDRA-10643
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10643
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction
>Reporter: Vishy Kasar
>Assignee: Vishy Kasar
> Attachments: 10643-trunk-REV01.txt
>
>
> We see repeated cases in production (using LCS) where small number of users 
> generate a large number repeated updates or tombstones. Reading data of such 
> users brings in large amounts of data in to java process. Apart from the read 
> itself being slow for the user, the excessive GC affects other users as well. 
> Our solution so far is to move from LCS to SCS and back. This takes long and 
> is an over kill if the number of outliers is small. For such cases, we can 
> implement the point compaction of a token range. We make the nodetool compact 
> take a starting and ending token range and compact all the SSTables that fall 
> with in that range. We can refuse to compact if the number of sstables is 
> beyond a max_limit.
> Example: 
> nodetool -st 3948291562518219268 -et 3948291562518219269 compact keyspace 
> table



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10726) Read repair inserts should not be blocking

2016-03-04 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181335#comment-15181335
 ] 

Richard Low commented on CASSANDRA-10726:
-

What do you think [~jbellis] [~slebresne]?

> Read repair inserts should not be blocking
> --
>
> Key: CASSANDRA-10726
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10726
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Coordination
>Reporter: Richard Low
>
> Today, if there’s a digest mismatch in a foreground read repair, the insert 
> to update out of date replicas is blocking. This means, if it fails, the read 
> fails with a timeout. If a node is dropping writes (maybe it is overloaded or 
> the mutation stage is backed up for some other reason), all reads to a 
> replica set could fail. Further, replicas dropping writes get more out of 
> sync so will require more read repair.
> The comment on the code for why the writes are blocking is:
> {code}
> // wait for the repair writes to be acknowledged, to minimize impact on any 
> replica that's
> // behind on writes in case the out-of-sync row is read multiple times in 
> quick succession
> {code}
> but the bad side effect is that reads timeout. Either the writes should not 
> be blocking or we should return success for the read even if the write times 
> out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10726) Read repair inserts should not be blocking

2016-01-28 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122431#comment-15122431
 ] 

Richard Low commented on CASSANDRA-10726:
-

Actually, isn't the real problem here that speculative retry doesn't include 
the RR write? We should give up waiting for the write to complete and retry on 
another replica. A slow RR insert is just as bad as a slow read.

> Read repair inserts should not be blocking
> --
>
> Key: CASSANDRA-10726
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10726
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Coordination
>Reporter: Richard Low
>
> Today, if there’s a digest mismatch in a foreground read repair, the insert 
> to update out of date replicas is blocking. This means, if it fails, the read 
> fails with a timeout. If a node is dropping writes (maybe it is overloaded or 
> the mutation stage is backed up for some other reason), all reads to a 
> replica set could fail. Further, replicas dropping writes get more out of 
> sync so will require more read repair.
> The comment on the code for why the writes are blocking is:
> {code}
> // wait for the repair writes to be acknowledged, to minimize impact on any 
> replica that's
> // behind on writes in case the out-of-sync row is read multiple times in 
> quick succession
> {code}
> but the bad side effect is that reads timeout. Either the writes should not 
> be blocking or we should return success for the read even if the write times 
> out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-11091) Insufficient disk space in memtable flush should trigger disk fail policy

2016-01-28 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-11091:
---

 Summary: Insufficient disk space in memtable flush should trigger 
disk fail policy
 Key: CASSANDRA-11091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-11091
 Project: Cassandra
  Issue Type: Bug
Reporter: Richard Low


If there's insufficient disk space to flush, 
DiskAwareRunnable.getWriteDirectory throws and the flush fails. The commitlogs 
then grow indefinitely because the latch is never counted down.

This should be an FSError so the disk fail policy is triggered. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10726) Read repair inserts should not be blocking

2016-01-11 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092150#comment-15092150
 ] 

Richard Low commented on CASSANDRA-10726:
-

It would lose a guarantee (which admittedly I didn't know existed), but most 
people who care about what happens when there's a write timeout will use CAS 
read and write.

Would a reasonable half way house be to keep the write as blocking but return 
success in the case of a write timeout? Then almost always the behaviour will 
be the same, but it would avoid the timeouts caused by a single broken replica.

> Read repair inserts should not be blocking
> --
>
> Key: CASSANDRA-10726
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10726
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Coordination
>Reporter: Richard Low
>
> Today, if there’s a digest mismatch in a foreground read repair, the insert 
> to update out of date replicas is blocking. This means, if it fails, the read 
> fails with a timeout. If a node is dropping writes (maybe it is overloaded or 
> the mutation stage is backed up for some other reason), all reads to a 
> replica set could fail. Further, replicas dropping writes get more out of 
> sync so will require more read repair.
> The comment on the code for why the writes are blocking is:
> {code}
> // wait for the repair writes to be acknowledged, to minimize impact on any 
> replica that's
> // behind on writes in case the out-of-sync row is read multiple times in 
> quick succession
> {code}
> but the bad side effect is that reads timeout. Either the writes should not 
> be blocking or we should return success for the read even if the write times 
> out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10726) Read repair inserts should not be blocking

2016-01-04 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081929#comment-15081929
 ] 

Richard Low commented on CASSANDRA-10726:
-

+1 on the option to disable.

> Read repair inserts should not be blocking
> --
>
> Key: CASSANDRA-10726
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10726
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Coordination
>Reporter: Richard Low
>
> Today, if there’s a digest mismatch in a foreground read repair, the insert 
> to update out of date replicas is blocking. This means, if it fails, the read 
> fails with a timeout. If a node is dropping writes (maybe it is overloaded or 
> the mutation stage is backed up for some other reason), all reads to a 
> replica set could fail. Further, replicas dropping writes get more out of 
> sync so will require more read repair.
> The comment on the code for why the writes are blocking is:
> {code}
> // wait for the repair writes to be acknowledged, to minimize impact on any 
> replica that's
> // behind on writes in case the out-of-sync row is read multiple times in 
> quick succession
> {code}
> but the bad side effect is that reads timeout. Either the writes should not 
> be blocking or we should return success for the read even if the write times 
> out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10887) Pending range calculator gives wrong pending ranges for moves

2015-12-18 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15064904#comment-15064904
 ] 

Richard Low commented on CASSANDRA-10887:
-

Is your key 'jdoe' stored on node1?

> Pending range calculator gives wrong pending ranges for moves
> -
>
> Key: CASSANDRA-10887
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10887
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Richard Low
>Assignee: Branimir Lambov
>Priority: Critical
>
> My understanding is the PendingRangeCalculator is meant to calculate who 
> should receive extra writes during range movements. However, it adds the 
> wrong ranges for moves. An extreme example of this can be seen in the 
> following reproduction. Create a 5 node cluster (I did this on 2.0.16 and 
> 2.2.4) and a keyspace RF=3 and a simple table. Then start moving a node and 
> immediately kill -9 it. Now you see a node as down and moving in the ring. 
> Try a quorum write for a partition that is stored on that node - it will fail 
> with a timeout. Further, all CAS reads or writes fail immediately with 
> unavailable exception because they attempt to include the moving node twice. 
> This is likely to be the cause of CASSANDRA-10423.
> In my example I had this ring:
> 127.0.0.1  rack1   Up Normal  170.97 KB   20.00%  
> -9223372036854775808
> 127.0.0.2  rack1   Up Normal  124.06 KB   20.00%  
> -5534023222112865485
> 127.0.0.3  rack1   Down   Moving  108.7 KB40.00%  
> 1844674407370955160
> 127.0.0.4  rack1   Up Normal  142.58 KB   0.00%   
> 1844674407370955161
> 127.0.0.5  rack1   Up Normal  118.64 KB   20.00%  
> 5534023222112865484
> Node 3 was moving to -1844674407370955160. I added logging to print the 
> pending and natural endpoints. For ranges owned by node 3, node 3 appeared in 
> pending and natural endpoints. The blockFor is increased to 3 so we’re 
> effectively doing CL.ALL operations. This manifests as write timeouts and CAS 
> unavailables when the node is down.
> The correct pending range for this scenario is node 1 is gaining the range 
> (-1844674407370955160, 1844674407370955160). So node 1 should be added as a 
> destination for writes and CAS for this range, not node 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-10887) Pending range calculator gives wrong pending ranges for moves

2015-12-16 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-10887:
---

 Summary: Pending range calculator gives wrong pending ranges for 
moves
 Key: CASSANDRA-10887
 URL: https://issues.apache.org/jira/browse/CASSANDRA-10887
 Project: Cassandra
  Issue Type: Bug
  Components: Coordination
Reporter: Richard Low
Priority: Critical


My understanding is the PendingRangeCalculator is meant to calculate who should 
receive extra writes during range movements. However, it adds the wrong ranges 
for moves. An extreme example of this can be seen in the following 
reproduction. Create a 5 node cluster (I did this on 2.0.16 and 2.2.4) and a 
keyspace RF=3 and a simple table. Then start moving a node and immediately kill 
-9 it. Now you see a node as down and moving in the ring. Try a quorum write 
for a partition that is stored on that node - it will fail with a timeout. 
Further, all CAS reads or writes fail immediately with unavailable exception 
because they attempt to include the moving node twice. This is likely to be the 
cause of CASSANDRA-10423.

In my example I had this ring:

127.0.0.1  rack1   Up Normal  170.97 KB   20.00%  
-9223372036854775808
127.0.0.2  rack1   Up Normal  124.06 KB   20.00%  
-5534023222112865485
127.0.0.3  rack1   Down   Moving  108.7 KB40.00%  
1844674407370955160
127.0.0.4  rack1   Up Normal  142.58 KB   0.00%   
1844674407370955161
127.0.0.5  rack1   Up Normal  118.64 KB   20.00%  
5534023222112865484

Node 3 was moving to -1844674407370955160. I added logging to print the pending 
and natural endpoints. For ranges owned by node 3, node 3 appeared in pending 
and natural endpoints. The blockFor is increased to 3 so we’re effectively 
doing CL.ALL operations. This manifests as write timeouts and CAS unavailables 
when the node is down.

The correct pending range for this scenario is node 1 is gaining the range 
(-1844674407370955160, 1844674407370955160). So node 1 should be added as a 
destination for writes and CAS for this range, not node 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10726) Read repair inserts should not be blocking

2015-12-15 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058576#comment-15058576
 ] 

Richard Low commented on CASSANDRA-10726:
-

How does it violate consistency? The replica was already inconsistent to 
require a read repair insert so returning before completing the write can't 
make it any worse.

> Read repair inserts should not be blocking
> --
>
> Key: CASSANDRA-10726
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10726
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Coordination
>Reporter: Richard Low
>
> Today, if there’s a digest mismatch in a foreground read repair, the insert 
> to update out of date replicas is blocking. This means, if it fails, the read 
> fails with a timeout. If a node is dropping writes (maybe it is overloaded or 
> the mutation stage is backed up for some other reason), all reads to a 
> replica set could fail. Further, replicas dropping writes get more out of 
> sync so will require more read repair.
> The comment on the code for why the writes are blocking is:
> {code}
> // wait for the repair writes to be acknowledged, to minimize impact on any 
> replica that's
> // behind on writes in case the out-of-sync row is read multiple times in 
> quick succession
> {code}
> but the bad side effect is that reads timeout. Either the writes should not 
> be blocking or we should return success for the read even if the write times 
> out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-10726) Read repair inserts should not be blocking

2015-11-17 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-10726:
---

 Summary: Read repair inserts should not be blocking
 Key: CASSANDRA-10726
 URL: https://issues.apache.org/jira/browse/CASSANDRA-10726
 Project: Cassandra
  Issue Type: Improvement
  Components: Coordination
Reporter: Richard Low


Today, if there’s a digest mismatch in a foreground read repair, the insert to 
update out of date replicas is blocking. This means, if it fails, the read 
fails with a timeout. If a node is dropping writes (maybe it is overloaded or 
the mutation stage is backed up for some other reason), all reads to a replica 
set could fail. Further, replicas dropping writes get more out of sync so will 
require more read repair.

The comment on the code for why the writes are blocking is:

{code}
// wait for the repair writes to be acknowledged, to minimize impact on any 
replica that's
// behind on writes in case the out-of-sync row is read multiple times in quick 
succession
{code}

but the bad side effect is that reads timeout. Either the writes should not be 
blocking or we should return success for the read even if the write times out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10366) Added gossip states can shadow older unseen states

2015-09-17 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-10366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804315#comment-14804315
 ] 

Richard Low commented on CASSANDRA-10366:
-

+1

> Added gossip states can shadow older unseen states
> --
>
> Key: CASSANDRA-10366
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10366
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Brandon Williams
>Assignee: Brandon Williams
>Priority: Critical
> Fix For: 2.0.17, 3.0.0 rc1, 2.1.10, 2.2.2
>
> Attachments: 10336.txt
>
>
> In CASSANDRA-6135 we added cloneWithHigherVersion to ensure that if another 
> thread added states to gossip while we were notifying we would increase our 
> version to ensure the existing states wouldn't get shadowed.  This however, 
> was not entirely perfect since it's possible that after the clone, but before 
> the addition another thread will insert an even newer state, thus shadowing 
> the others.  A common case (of this rare one) is when STATUS and TOKENS are 
> added a bit later in SS.setGossipTokens, where something in another thread 
> injects a new state (likely SEVERITY) just before the addition after the 
> clone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-9434) If a node loses schema_columns SSTables it could delete all secondary indexes from the schema

2015-08-18 Thread Richard Low (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Low updated CASSANDRA-9434:
---
Fix Version/s: (was: 2.0.x)
   2.2.x
   2.1.x

 If a node loses schema_columns SSTables it could delete all secondary indexes 
 from the schema
 -

 Key: CASSANDRA-9434
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9434
 Project: Cassandra
  Issue Type: Bug
Reporter: Richard Low
Assignee: Aleksey Yeschenko
 Fix For: 2.1.x, 2.2.x


 It is possible that a single bad node can delete all secondary indexes if it 
 restarts and cannot read its schema_columns SSTables. Here's a reproduction:
 * Create a 2 node cluster (we saw it on 2.0.11)
 * Create the schema:
 {code}
 create keyspace myks with replication = {'class':'SimpleStrategy', 
 'replication_factor':1};
 use myks;
 create table mytable (a text, b text, c text, PRIMARY KEY (a, b) );
 create index myindex on mytable(b);
 {code}
 NB index must be on clustering column to repro
 * Kill one node
 * Wipe its commitlog and system/schema_columns sstables.
 * Start it again
 * Run on this node
 select index_name from system.schema_columns where keyspace_name = 'myks' and 
 columnfamily_name = 'mytable' and column_name = 'b';
 and you'll see the index is null.
 * Run 'describe schema' on the other node. Sometimes it will not show the 
 index, but you might need to bounce for it to disappear.
 I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9434) If a node loses schema_columns SSTables it could delete all secondary indexes from the schema

2015-08-18 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702418#comment-14702418
 ] 

Richard Low commented on CASSANDRA-9434:


Thanks for the explanation! With 2.0 EOL I don't think I can object much... I 
updated the fix versions.

 If a node loses schema_columns SSTables it could delete all secondary indexes 
 from the schema
 -

 Key: CASSANDRA-9434
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9434
 Project: Cassandra
  Issue Type: Bug
Reporter: Richard Low
Assignee: Aleksey Yeschenko
 Fix For: 2.1.x, 2.2.x


 It is possible that a single bad node can delete all secondary indexes if it 
 restarts and cannot read its schema_columns SSTables. Here's a reproduction:
 * Create a 2 node cluster (we saw it on 2.0.11)
 * Create the schema:
 {code}
 create keyspace myks with replication = {'class':'SimpleStrategy', 
 'replication_factor':1};
 use myks;
 create table mytable (a text, b text, c text, PRIMARY KEY (a, b) );
 create index myindex on mytable(b);
 {code}
 NB index must be on clustering column to repro
 * Kill one node
 * Wipe its commitlog and system/schema_columns sstables.
 * Start it again
 * Run on this node
 select index_name from system.schema_columns where keyspace_name = 'myks' and 
 columnfamily_name = 'mytable' and column_name = 'b';
 and you'll see the index is null.
 * Run 'describe schema' on the other node. Sometimes it will not show the 
 index, but you might need to bounce for it to disappear.
 I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9434) If a node loses schema_columns SSTables it could delete all secondary indexes from the schema

2015-08-17 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700697#comment-14700697
 ] 

Richard Low commented on CASSANDRA-9434:


Thanks Aleksey. So it sounds like we should close this as behaves correctly?

 If a node loses schema_columns SSTables it could delete all secondary indexes 
 from the schema
 -

 Key: CASSANDRA-9434
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9434
 Project: Cassandra
  Issue Type: Bug
Reporter: Richard Low
Assignee: Aleksey Yeschenko
 Fix For: 2.0.x


 It is possible that a single bad node can delete all secondary indexes if it 
 restarts and cannot read its schema_columns SSTables. Here's a reproduction:
 * Create a 2 node cluster (we saw it on 2.0.11)
 * Create the schema:
 {code}
 create keyspace myks with replication = {'class':'SimpleStrategy', 
 'replication_factor':1};
 use myks;
 create table mytable (a text, b text, c text, PRIMARY KEY (a, b) );
 create index myindex on mytable(b);
 {code}
 NB index must be on clustering column to repro
 * Kill one node
 * Wipe its commitlog and system/schema_columns sstables.
 * Start it again
 * Run on this node
 select index_name from system.schema_columns where keyspace_name = 'myks' and 
 columnfamily_name = 'mytable' and column_name = 'b';
 and you'll see the index is null.
 * Run 'describe schema' on the other node. Sometimes it will not show the 
 index, but you might need to bounce for it to disappear.
 I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9753) LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch

2015-07-27 Thread Richard Low (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642825#comment-14642825
]

Richard Low commented on CASSANDRA-9753:

I agree with Sankalp. This is with read_repair_chance = 0.

LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch
---

Key: CASSANDRA-9753
URL: https://issues.apache.org/jira/browse/CASSANDRA-9753
Project: Cassandra
Issue Type: Bug
Components: Core
Reporter: Richard Low

When there is a digest mismatch during the initial read, a data read request
is sent to all replicas involved in the initial read. This can be more than
the initial blockFor if read repair was done and if speculative retry kicked
in. E.g. for RF 3 in two DCs, the number of reads could be 4: 2 for
LOCAL_QUORUM, 1 for read repair and 1 for speculative read if one replica was
slow. If there is then a digest mismatch, Cassandra will issue the data read
to all 4 and set blockFor=4. Now the read query is blocked on cross-DC
latency. The digest mismatch read blockFor should be capped at RF for the
local DC when using CL.LOCAL_*.
You can reproduce this behaviour by creating a keyspace with
NetworkTopologyStrategy, RF 3 per DC, dc_local_read_repair=1.0 and ALWAYS for
speculative read. If you force a digest mismatch (e.g. by deleting a replicas
SSTables and restarting) you can see in tracing that it is blocking for 4
responses.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9753) LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch

2015-07-27 Thread Richard Low (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642881#comment-14642881
]

Richard Low commented on CASSANDRA-9753:

Using a remote replica for an eager retry is fine, but blocking on it for a
later digest mismatch read is not.

LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch
---

Key: CASSANDRA-9753
URL: https://issues.apache.org/jira/browse/CASSANDRA-9753
Project: Cassandra
Issue Type: Bug
Components: Core
Reporter: Richard Low

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9753) LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch

2015-07-27 Thread Richard Low (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642883#comment-14642883
]

Richard Low commented on CASSANDRA-9753:

Actually I take that back. It's not fine, since it could violate local DC
consistency. So fixing by avoiding any reads to remote DCs for eager retries
would fix this too.

LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch
---

Key: CASSANDRA-9753
URL: https://issues.apache.org/jira/browse/CASSANDRA-9753
Project: Cassandra
Issue Type: Bug
Components: Core
Reporter: Richard Low

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9827) Add consistency level to tracing ouput

2015-07-27 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14643012#comment-14643012
 ] 

Richard Low commented on CASSANDRA-9827:


+1

Can someone commit?

 Add consistency level to tracing ouput
 --

 Key: CASSANDRA-9827
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9827
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Alec Grieser
Priority: Minor
 Fix For: 2.1.9

 Attachments: cassandra-9827-v1.diff


 To help get a better view of expected behavior of queries, it would be 
 helpful if each query's consistency level (and, where applicable, the serial 
 consistency level) were included in the tracing output. The proposed location 
 would be within the session's row in the sessions table as an additional 
 key-value pair included in the parameters column (along with the query 
 string, for example). Having it here would easily allow the user to group 
 each particular query with the actual consistency level used and thus compare 
 it to expectation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-5901) Bootstrap should also make the data consistent on the new node

2015-07-13 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625756#comment-14625756
 ] 

Richard Low commented on CASSANDRA-5901:


It would be much better to not have to run repair. Repair can take a long time 
and increases the time that your cluster is running with a node down, 
increasing your chance of another failure before the replacement completes.

 Bootstrap should also make the data consistent on the new node
 --

 Key: CASSANDRA-5901
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5901
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: sankalp kohli
Assignee: Yuki Morishita
Priority: Minor

 Currently when we are bootstrapping a new node, it might bootstrap from a 
 node which does not have most upto date data. Because of this, we need to run 
 a repair after that.
 Most people will always run the repair so it would help if we can provide a 
 parameter to bootstrap to run the repair once the bootstrap has finished. 
 It can also stop the node from responding to reads till repair has finished. 
 This could be another param as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9765) checkForEndpointCollision fails for legitimate collisions

2015-07-10 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622507#comment-14622507
 ] 

Richard Low commented on CASSANDRA-9765:


Yes I can.

 checkForEndpointCollision fails for legitimate collisions
 -

 Key: CASSANDRA-9765
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9765
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low
Assignee: Stefania
 Fix For: 2.0.17


 Since CASSANDRA-7939, checkForEndpointCollision no longer catches a 
 legitimate collision. Without CASSANDRA-7939, wiping a node and starting it 
 again fails with 'A node with address %s already exists', but with it the 
 node happily enters joining state, potentially streaming from the wrong place 
 and violating consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-9765) checkForEndpointCollision fails for legitimate collisions

2015-07-08 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-9765:
--

 Summary: checkForEndpointCollision fails for legitimate collisions
 Key: CASSANDRA-9765
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9765
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low
 Fix For: 2.0.17


Since CASSANDRA-7939, checkForEndpointCollision no longer catches a legitimate 
collision. Without CASSANDRA-7939, wiping a node and starting it again fails 
with 'A node with address %s already exists', but with it the node happily 
enters joining state, potentially streaming from the wrong place and violating 
consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-9753) LOCAL_QUORUM reads can block cross-DC if there is a digest mismatch

2015-07-07 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-9753:
--

 Summary: LOCAL_QUORUM reads can block cross-DC if there is a 
digest mismatch
 Key: CASSANDRA-9753
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9753
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low


When there is a digest mismatch during the initial read, a data read request is 
sent to all replicas involved in the initial read. This can be more than the 
initial blockFor if read repair was done and if speculative retry kicked in. 
E.g. for RF 3 in two DCs, the number of reads could be 4: 2 for LOCAL_QUORUM, 1 
for read repair and 1 for speculative read if one replica was slow. If there is 
then a digest mismatch, Cassandra will issue the data read to all 4 and set 
blockFor=4. Now the read query is blocked on cross-DC latency. The digest 
mismatch read blockFor should be capped at RF for the local DC when using 
CL.LOCAL_*.

You can reproduce this behaviour by creating a keyspace with 
NetworkTopologyStrategy, RF 3 per DC, dc_local_read_repair=1.0 and ALWAYS for 
speculative read. If you force a digest mismatch (e.g. by deleting a replicas 
SSTables and restarting) you can see in tracing that it is blocking for 4 
responses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-9641) Occasional timeouts with blockFor=all for LOCAL_QUORUM query

2015-06-23 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-9641:
--

 Summary: Occasional timeouts with blockFor=all for LOCAL_QUORUM 
query
 Key: CASSANDRA-9641
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9641
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low


We have a keyspace using NetworkTopologyStrategy with options DC1:3, DC2:3. Our 
tables have

read_repair_chance = 0.0
dclocal_read_repair_chance = 0.1
speculative_retry = ’99.0PERCENTILE'

and all reads are at LOCAL_QUORUM. On 2.0.11, we occasionally see this timeout:

Cassandra timeout during read query at consistency ALL (6 responses were 
required but only 5 replica responded)

(sometimes only 4 respond). The ALL is probably due to CASSANDRA-7947 if this 
occurs during a digest mismatch, but what is interesting is it is expecting 6 
responses i.e. blockFor is set to all replicas. I can’t see how this should 
happen. From the code it should never set blockFor to more than 4 (although 4 
is still wrong - I'll make a separate JIRA for that).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-9596) Tombstone timestamps aren't used to skip SSTables while they are still in the memtable

2015-06-12 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-9596:
--

 Summary: Tombstone timestamps aren't used to skip SSTables while 
they are still in the memtable
 Key: CASSANDRA-9596
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9596
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Richard Low
 Fix For: 2.0.x


If you have one SSTable containing a partition level tombstone at timestamp t 
and all other SSTables have cells with timestamp  t, Cassandra will skip all 
the other SSTables and return nothing quickly. However, if the partition 
tombstone is still in the memtable it doesn’t skip any SSTables. It should use 
the same timestamp logic to skip all SSTables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-9434) If a node loses schema_columns SSTables it could delete all secondary indexes from the schema

2015-05-19 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-9434:
--

 Summary: If a node loses schema_columns SSTables it could delete 
all secondary indexes from the schema
 Key: CASSANDRA-9434
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9434
 Project: Cassandra
  Issue Type: Bug
Reporter: Richard Low


It is possible that a single bad node can delete all secondary indexes if it 
restarts and cannot read its schema_columns SSTables. Here's a reproduction:

* Create a 2 node cluster (we saw it on 2.0.11)
* Create the schema:

create keyspace myks with replication = {'class':'SimpleStrategy', 
'replication_factor':1};
use myks;
create table mytable (a text, b text, c text, PRIMARY KEY (a, b) );
create index myindex on mytable(b);

NB index must be on clustering column to repro

* Kill one node
* Wipe its commitlog and system/schema_columns sstables.
* Start it again
* Run on this node

select index_name from system.schema_columns where keyspace_name = 'myks' and 
columnfamily_name = 'mytable' and column_name = 'b';

and you'll see the index is null.
* Run 'describe schema' on the other node. Sometimes it will not show the 
index, but you might need to bounce for it to disappear.

I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9434) If a node loses schema_columns SSTables it could delete all secondary indexes from the schema

2015-05-19 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551601#comment-14551601
 ] 

Richard Low commented on CASSANDRA-9434:


cc [~iamaleksey]


 If a node loses schema_columns SSTables it could delete all secondary indexes 
 from the schema
 -

 Key: CASSANDRA-9434
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9434
 Project: Cassandra
  Issue Type: Bug
Reporter: Richard Low

 It is possible that a single bad node can delete all secondary indexes if it 
 restarts and cannot read its schema_columns SSTables. Here's a reproduction:
 * Create a 2 node cluster (we saw it on 2.0.11)
 * Create the schema:
 create keyspace myks with replication = {'class':'SimpleStrategy', 
 'replication_factor':1};
 use myks;
 create table mytable (a text, b text, c text, PRIMARY KEY (a, b) );
 create index myindex on mytable(b);
 NB index must be on clustering column to repro
 * Kill one node
 * Wipe its commitlog and system/schema_columns sstables.
 * Start it again
 * Run on this node
 select index_name from system.schema_columns where keyspace_name = 'myks' and 
 columnfamily_name = 'mytable' and column_name = 'b';
 and you'll see the index is null.
 * Run 'describe schema' on the other node. Sometimes it will not show the 
 index, but you might need to bounce for it to disappear.
 I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-9434) If a node loses schema_columns SSTables it could delete all secondary indexes from the schema

2015-05-19 Thread Richard Low (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Low updated CASSANDRA-9434:
---
Description: 
It is possible that a single bad node can delete all secondary indexes if it 
restarts and cannot read its schema_columns SSTables. Here's a reproduction:

* Create a 2 node cluster (we saw it on 2.0.11)
* Create the schema:

{code}
create keyspace myks with replication = {'class':'SimpleStrategy', 
'replication_factor':1};
use myks;
create table mytable (a text, b text, c text, PRIMARY KEY (a, b) );
create index myindex on mytable(b);
{code}

NB index must be on clustering column to repro

* Kill one node
* Wipe its commitlog and system/schema_columns sstables.
* Start it again
* Run on this node

select index_name from system.schema_columns where keyspace_name = 'myks' and 
columnfamily_name = 'mytable' and column_name = 'b';

and you'll see the index is null.
* Run 'describe schema' on the other node. Sometimes it will not show the 
index, but you might need to bounce for it to disappear.

I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper.

  was:
It is possible that a single bad node can delete all secondary indexes if it 
restarts and cannot read its schema_columns SSTables. Here's a reproduction:

* Create a 2 node cluster (we saw it on 2.0.11)
* Create the schema:

create keyspace myks with replication = {'class':'SimpleStrategy', 
'replication_factor':1};
use myks;
create table mytable (a text, b text, c text, PRIMARY KEY (a, b) );
create index myindex on mytable(b);

NB index must be on clustering column to repro

* Kill one node
* Wipe its commitlog and system/schema_columns sstables.
* Start it again
* Run on this node

select index_name from system.schema_columns where keyspace_name = 'myks' and 
columnfamily_name = 'mytable' and column_name = 'b';

and you'll see the index is null.
* Run 'describe schema' on the other node. Sometimes it will not show the 
index, but you might need to bounce for it to disappear.

I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper.


 If a node loses schema_columns SSTables it could delete all secondary indexes 
 from the schema
 -

 Key: CASSANDRA-9434
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9434
 Project: Cassandra
  Issue Type: Bug
Reporter: Richard Low

 It is possible that a single bad node can delete all secondary indexes if it 
 restarts and cannot read its schema_columns SSTables. Here's a reproduction:
 * Create a 2 node cluster (we saw it on 2.0.11)
 * Create the schema:
 {code}
 create keyspace myks with replication = {'class':'SimpleStrategy', 
 'replication_factor':1};
 use myks;
 create table mytable (a text, b text, c text, PRIMARY KEY (a, b) );
 create index myindex on mytable(b);
 {code}
 NB index must be on clustering column to repro
 * Kill one node
 * Wipe its commitlog and system/schema_columns sstables.
 * Start it again
 * Run on this node
 select index_name from system.schema_columns where keyspace_name = 'myks' and 
 columnfamily_name = 'mytable' and column_name = 'b';
 and you'll see the index is null.
 * Run 'describe schema' on the other node. Sometimes it will not show the 
 index, but you might need to bounce for it to disappear.
 I think the culprit is SystemKeyspace.copyAllAliasesToColumnsProper.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses

2015-05-12 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541008#comment-14541008
 ] 

Richard Low commented on CASSANDRA-9183:


Is it possible to get this in 2.1 too?

 Failure detector should detect and ignore local pauses
 --

 Key: CASSANDRA-9183
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.2 beta 1

 Attachments: 9183-v2.txt, 9183.txt


 A local node can be paused for many reasons such as GC, and if the pause is 
 long enough when it recovers it will think all the other nodes are dead until 
 it gossips, causing UAE to be thrown to clients trying to use it as a 
 coordinator.  Instead, the FD can track the current time, and if the gap 
 there becomes too large, skip marking the nodes down (reset the FD data 
 perhaps)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8336) Add shutdown gossip state to prevent timeouts during rolling restarts

2015-05-11 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538198#comment-14538198
 ] 

Richard Low commented on CASSANDRA-8336:


How big is your largest QA cluster? I did extensive manual tests to verify this 
fixes the issue in a large cluster.

 Add shutdown gossip state to prevent timeouts during rolling restarts
 -

 Key: CASSANDRA-8336
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.0.15, 2.1.5

 Attachments: 8336-v2.txt, 8336-v3.txt, 8336-v4.txt, 8336.txt, 
 8366-v5.txt


 In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here 
 is that this isn't sufficient; you can still get TOEs and have to wait on the 
 FD to figure things out.  This happens due to gossip propagation time and 
 variance; if node X shuts down and sends the message to Y, but Z has a 
 greater gossip version than Y for X and has not yet received the message, it 
 can initiate gossip with Y and thus mark X alive again.  I propose 
 quarantining to solve this, however I feel it should be a -D parameter you 
 have to specify, so as not to destroy current dev and test practices, since 
 this will mean a node that shuts down will not be able to restart until the 
 quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses

2015-05-07 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533030#comment-14533030
 ] 

Richard Low commented on CASSANDRA-9183:


+1. Very minor comment: it would be slightly clearer to set lastInterpret 
immediately after the diff calculation rather than in both cases.

 Failure detector should detect and ignore local pauses
 --

 Key: CASSANDRA-9183
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 3.x

 Attachments: 9183-v2.txt, 9183.txt


 A local node can be paused for many reasons such as GC, and if the pause is 
 long enough when it recovers it will think all the other nodes are dead until 
 it gossips, causing UAE to be thrown to clients trying to use it as a 
 coordinator.  Instead, the FD can track the current time, and if the gap 
 there becomes too large, skip marking the nodes down (reset the FD data 
 perhaps)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-9280) Streaming connections should bind to the broadcast_address of the node

2015-05-04 Thread Richard Low (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Low updated CASSANDRA-9280:
---
Since Version: 1.2.0

 Streaming connections should bind to the broadcast_address of the node
 --

 Key: CASSANDRA-9280
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9280
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low
Assignee: Yuki Morishita
Priority: Minor

 Currently, if you have multiple interfaces on a server, a node receiving a 
 stream may show the stream as coming from the wrong IP in e.g. nodetool 
 netstats. The IP is taken as the source of the socket, which may not be the 
 same as the node’s broadcast_address. The outgoing socket should be 
 explicitly bound to the broadcast_address.
 It seems like this was fixed a long time ago in CASSANDRA-737 but has since 
 broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-9280) Streaming connections should bind to the broadcast_address of the node

2015-04-30 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-9280:
--

 Summary: Streaming connections should bind to the 
broadcast_address of the node
 Key: CASSANDRA-9280
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9280
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low
Priority: Minor


Currently, if you have multiple interfaces on a server, a node receiving a 
stream may show the stream as coming from the wrong IP in e.g. nodetool 
netstats. The IP is taken as the source of the socket, which may not be the 
same as the node’s broadcast_address. The outgoing socket should be explicitly 
bound to the broadcast_address.

It seems like this was fixed a long time ago in CASSANDRA-737 but has since 
broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

2015-04-13 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493345#comment-14493345
 ] 

Richard Low commented on CASSANDRA-8336:


+1, thanks Brandon!

 Quarantine nodes after receiving the gossip shutdown message
 

 Key: CASSANDRA-8336
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.0.15

 Attachments: 8336-v2.txt, 8336-v3.txt, 8336-v4.txt, 8336.txt, 
 8366-v5.txt


 In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here 
 is that this isn't sufficient; you can still get TOEs and have to wait on the 
 FD to figure things out.  This happens due to gossip propagation time and 
 variance; if node X shuts down and sends the message to Y, but Z has a 
 greater gossip version than Y for X and has not yet received the message, it 
 can initiate gossip with Y and thus mark X alive again.  I propose 
 quarantining to solve this, however I feel it should be a -D parameter you 
 have to specify, so as not to destroy current dev and test practices, since 
 this will mean a node that shuts down will not be able to restart until the 
 quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

2015-02-27 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341222#comment-14341222
 ] 

Richard Low commented on CASSANDRA-8336:


Here it is:

{code}
ERROR [main] 2015-02-27 18:11:57,584 CassandraDaemon.java (line 513) Exception 
encountered during startup
java.lang.RuntimeException: Unable to gossip with any seeds
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1270)
at 
org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:459)
at 
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:673)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:625)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:517)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)
 INFO [StorageServiceShutdownHook] 2015-02-27 18:11:57,605 Gossiper.java (line 
1370) Announcing shutdown
ERROR [StorageServiceShutdownHook] 2015-02-27 18:11:57,607 CassandraDaemon.java 
(line 199) Exception in thread Thread[StorageServiceShutdownHook,5,main]
java.lang.AssertionError
at 
org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1339)
at org.apache.cassandra.gms.Gossiper.stop(Gossiper.java:1371)
at 
org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:586)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:745)
{code}

 Quarantine nodes after receiving the gossip shutdown message
 

 Key: CASSANDRA-8336
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.0.13

 Attachments: 8336-v2.txt, 8336-v3.txt, 8336.txt


 In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here 
 is that this isn't sufficient; you can still get TOEs and have to wait on the 
 FD to figure things out.  This happens due to gossip propagation time and 
 variance; if node X shuts down and sends the message to Y, but Z has a 
 greater gossip version than Y for X and has not yet received the message, it 
 can initiate gossip with Y and thus mark X alive again.  I propose 
 quarantining to solve this, however I feel it should be a -D parameter you 
 have to specify, so as not to destroy current dev and test practices, since 
 this will mean a node that shuts down will not be able to restart until the 
 quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-8829) Add extra checks to catch SSTable ref counting bugs

2015-02-18 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-8829:
--

 Summary: Add extra checks to catch SSTable ref counting bugs
 Key: CASSANDRA-8829
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8829
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Richard Low
 Fix For: 2.0.13, 2.1.4


There have been some bad affects from ref counting bugs (see e.g. 
CASSANDRA-7704). We should add extra checks so we can more easily diagnose any 
future problems and avoid some of the side effects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-8829) Add extra checks to catch SSTable ref counting bugs

2015-02-18 Thread Richard Low (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Low updated CASSANDRA-8829:
---
Attachment: 8829-2.0.patch

 Add extra checks to catch SSTable ref counting bugs
 ---

 Key: CASSANDRA-8829
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8829
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Richard Low
 Fix For: 2.0.13, 2.1.4

 Attachments: 8829-2.0.patch


 There have been some bad affects from ref counting bugs (see e.g. 
 CASSANDRA-7704). We should add extra checks so we can more easily diagnose 
 any future problems and avoid some of the side effects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8829) Add extra checks to catch SSTable ref counting bugs

2015-02-18 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326321#comment-14326321
 ] 

Richard Low commented on CASSANDRA-8829:


Attached patch for 2.0.

 Add extra checks to catch SSTable ref counting bugs
 ---

 Key: CASSANDRA-8829
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8829
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Richard Low
 Fix For: 2.0.13, 2.1.4

 Attachments: 8829-2.0.patch


 There have been some bad affects from ref counting bugs (see e.g. 
 CASSANDRA-7704). We should add extra checks so we can more easily diagnose 
 any future problems and avoid some of the side effects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8829) Add extra checks to catch SSTable ref counting bugs

2015-02-18 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326427#comment-14326427
 ] 

Richard Low commented on CASSANDRA-8829:


Agreed that the check in releaseReference will break SSTableLoader. What do you 
think about the assert in markReferenced?

 Add extra checks to catch SSTable ref counting bugs
 ---

 Key: CASSANDRA-8829
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8829
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Richard Low
Assignee: Richard Low
 Fix For: 2.0.13, 2.1.4

 Attachments: 8829-2.0.patch


 There have been some bad affects from ref counting bugs (see e.g. 
 CASSANDRA-7704). We should add extra checks so we can more easily diagnose 
 any future problems and avoid some of the side effects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8829) Add extra checks to catch SSTable ref counting bugs

2015-02-18 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326749#comment-14326749
 ] 

Richard Low commented on CASSANDRA-8829:


Attached v2 patch with releaseReference check removed and your containsAll 
suggestion.

 Add extra checks to catch SSTable ref counting bugs
 ---

 Key: CASSANDRA-8829
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8829
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Richard Low
Assignee: Richard Low
 Fix For: 2.0.13, 2.1.4

 Attachments: 8829-2.0-v2.patch, 8829-2.0.patch


 There have been some bad affects from ref counting bugs (see e.g. 
 CASSANDRA-7704). We should add extra checks so we can more easily diagnose 
 any future problems and avoid some of the side effects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-8829) Add extra checks to catch SSTable ref counting bugs

2015-02-18 Thread Richard Low (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Low updated CASSANDRA-8829:
---
Attachment: 8829-2.0-v2.patch

 Add extra checks to catch SSTable ref counting bugs
 ---

 Key: CASSANDRA-8829
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8829
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Richard Low
Assignee: Richard Low
 Fix For: 2.0.13, 2.1.4

 Attachments: 8829-2.0-v2.patch, 8829-2.0.patch


 There have been some bad affects from ref counting bugs (see e.g. 
 CASSANDRA-7704). We should add extra checks so we can more easily diagnose 
 any future problems and avoid some of the side effects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

2015-02-17 Thread Richard Low (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325170#comment-14325170
]

Richard Low commented on CASSANDRA-8336:

v3 works well and I was able to do a full cluster bounce with zero timeouts.

Here’s a few minor points:

* The shutting down node might as well set the version of the shutdown state to
Integer.MAX_VALUE since receiving nodes will blindly use that.
* Why does it increment the generation number? We call Gossiper.instance.start
with a new generation number set to the current time so it would make sense to
use that.
* If hit 'Unable to gossip with any seeds’ on replace, it shuts down the
gossiper. This throws an AssertionError in addLocalApplicationState since the
local epState is null.

Quarantine nodes after receiving the gossip shutdown message

Key: CASSANDRA-8336
URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
Project: Cassandra
Issue Type: Bug
Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
Fix For: 2.0.13

Attachments: 8336-v2.txt, 8336-v3.txt, 8336.txt

In CASSANDRA-3936 we added a gossip shutdown announcement. The problem here
is that this isn't sufficient; you can still get TOEs and have to wait on the
FD to figure things out. This happens due to gossip propagation time and
variance; if node X shuts down and sends the message to Y, but Z has a
greater gossip version than Y for X and has not yet received the message, it
can initiate gossip with Y and thus mark X alive again. I propose
quarantining to solve this, however I feel it should be a -D parameter you
have to specify, so as not to destroy current dev and test practices, since
this will mean a node that shuts down will not be able to restart until the
quarantine expires.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8815) Race in sstable ref counting during streaming failures

2015-02-17 Thread Richard Low (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325007#comment-14325007
]

Richard Low commented on CASSANDRA-8815:

I think we should add some assertions that would avoid the bad effect of this.
I'll prepare a patch and put it here.

Race in sstable ref counting during streaming failures

Key: CASSANDRA-8815
URL: https://issues.apache.org/jira/browse/CASSANDRA-8815
Project: Cassandra
Issue Type: Bug
Components: Core
Reporter: sankalp kohli
Assignee: Benedict
Fix For: 2.0.13

Attachments: 8815.txt

We have a seen a machine in Prod whose all read threads are blocked(spinning)
on trying to acquire the reference lock on stables. There are also some
stream sessions which are doing the same.
On looking at the heap dump, we could see that a live sstable which is part
of the View has a ref count = 0. This sstable is also not compacting or is
part of any failed compaction.
On looking through the code, we could see that if ref goes to zero and the
stable is part of the View, all reader threads will spin forever.
On further looking through the code of streaming, we could see that if
StreamTransferTask.complete is called after closeSession has been called due
to error in OutgoingMessageHandler, it will double decrement the ref count of
an sstable.
This race can happen and we see through exception in logs that closeSession
was triggered by OutgoingMessageHandler.
The fix for this is very simple i think. In StreamTransferTask.abort, we can
remove a file from files” before decrementing the ref count. This will avoid
this race.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7968) permissions_validity_in_ms should be settable via JMX

2015-01-30 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299269#comment-14299269
 ] 

Richard Low commented on CASSANDRA-7968:


How is this meant to work? The MBean is never registered so how do I call it?

 permissions_validity_in_ms should be settable via JMX
 -

 Key: CASSANDRA-7968
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7968
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
Priority: Minor
 Fix For: 2.0.11, 2.1.1

 Attachments: 7968.txt


 Oftentimes people don't think about auth problems and just run with the 
 default of RF=2 and 2000ms until it's too late, and at that point doing a 
 rolling restart to change the permissions cache can be a bit painful vs 
 setting it via JMX everywhere and then updating the yaml for future restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7968) permissions_validity_in_ms should be settable via JMX

2015-01-30 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299285#comment-14299285
 ] 

Richard Low commented on CASSANDRA-7968:


As Benedict says, there is at least one user who cares :)

 permissions_validity_in_ms should be settable via JMX
 -

 Key: CASSANDRA-7968
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7968
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
Priority: Minor
 Fix For: 2.0.11, 2.1.1

 Attachments: 7968.txt


 Oftentimes people don't think about auth problems and just run with the 
 default of RF=2 and 2000ms until it's too late, and at that point doing a 
 rolling restart to change the permissions cache can be a bit painful vs 
 setting it via JMX everywhere and then updating the yaml for future restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()

2015-01-14 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278204#comment-14278204
 ] 

Richard Low commented on CASSANDRA-8414:


I tested this on some real workload SSTables and got a 2x speedup on force 
compaction! Also the output was the same as before.

Can someone commit the patch?

 Avoid loops over array backed iterators that call iter.remove()
 ---

 Key: CASSANDRA-8414
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Richard Low
Assignee: Jimmy Mårdell
  Labels: performance
 Fix For: 2.0.12, 2.1.3

 Attachments: cassandra-2.0-8414-1.txt, cassandra-2.0-8414-2.txt, 
 cassandra-2.0-8414-3.txt, cassandra-2.0-8414-4.txt, cassandra-2.0-8414-5.txt, 
 cassandra-2.1-8414-5.txt, cassandra-2.1-8414-6.txt


 I noticed from sampling that sometimes compaction spends almost all of its 
 time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns 
 out that the cf object is using ArrayBackedSortedColumns, so deletes are from 
 an ArrayList. If the majority of your columns are GCable tombstones then this 
 is O(n^2). The data structure should be changed or a copy made to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()

2015-01-13 Thread Richard Low (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Low updated CASSANDRA-8414:
---
Fix Version/s: 2.0.12

 Avoid loops over array backed iterators that call iter.remove()
 ---

 Key: CASSANDRA-8414
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Richard Low
Assignee: Jimmy Mårdell
  Labels: performance
 Fix For: 2.0.12, 2.1.3

 Attachments: cassandra-2.0-8414-1.txt, cassandra-2.0-8414-2.txt, 
 cassandra-2.0-8414-3.txt, cassandra-2.0-8414-4.txt, cassandra-2.0-8414-5.txt, 
 cassandra-2.1-8414-5.txt, cassandra-2.1-8414-6.txt


 I noticed from sampling that sometimes compaction spends almost all of its 
 time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns 
 out that the cf object is using ArrayBackedSortedColumns, so deletes are from 
 an ArrayList. If the majority of your columns are GCable tombstones then this 
 is O(n^2). The data structure should be changed or a copy made to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()

2015-01-12 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274593#comment-14274593
 ] 

Richard Low commented on CASSANDRA-8414:


Only minor nit is that the BitSet can be initialized with size rather than 
cells.length, but otherwise +1.

 Avoid loops over array backed iterators that call iter.remove()
 ---

 Key: CASSANDRA-8414
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Richard Low
Assignee: Jimmy Mårdell
  Labels: performance
 Fix For: 2.1.3

 Attachments: cassandra-2.0-8414-1.txt, cassandra-2.0-8414-2.txt, 
 cassandra-2.0-8414-3.txt, cassandra-2.0-8414-4.txt, cassandra-2.0-8414-5.txt, 
 cassandra-2.1-8414-5.txt


 I noticed from sampling that sometimes compaction spends almost all of its 
 time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns 
 out that the cf object is using ArrayBackedSortedColumns, so deletes are from 
 an ArrayList. If the majority of your columns are GCable tombstones then this 
 is O(n^2). The data structure should be changed or a copy made to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-5913) Nodes with no gossip STATUS shown as UN by nodetool:status

2015-01-12 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273930#comment-14273930
 ] 

Richard Low commented on CASSANDRA-5913:


I have seen this on 1.2.19 and 2.0.9. I suspect the root cause is 
CASSANDRA-6125.

 Nodes with no gossip STATUS shown as UN by nodetool:status
 

 Key: CASSANDRA-5913
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5913
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: 1.2.8
Reporter: Chris Burroughs
Priority: Minor

 I have no idea if this is a valid situation or a larger problem, but either 
 way nodetool status should not make it look like everything is a-okay.
 From nt:gossipinfo:
 {noformat} 
 /64.215.255.182
   RACK:NOP
   NET_VERSION:6
   HOST_ID:4f3b214b-b03e-46eb-8214-5fab2662a06b
   RELEASE_VERSION:1.2.8
   DC:IAD
   INTERNAL_IP:10.15.2.182
   SCHEMA:59adb24e-f3cd-3e02-97f0-5b395827453f
   RPC_ADDRESS:0.0.0.0
 {noformat}
 {noformat}
 $ ./bin/nt.sh status | grep -i 4055109d-800d-4743-8efa-4ecfff883463
 UN  64.215.255.182  63.84 GB   256 2.5%   
 4055109d-800d-4743-8efa-4ecfff883463  NOP
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8515) Hang at startup when no commitlog space

2015-01-09 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271668#comment-14271668
 ] 

Richard Low commented on CASSANDRA-8515:


I think #5737 was marked as invalid because it was thought to be a bug outside 
of Cassandra. But understanding the cause means we can do something about it, 
and I think logging and stopping would be the right approach, as you say.

 Hang at startup when no commitlog space
 ---

 Key: CASSANDRA-8515
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8515
 Project: Cassandra
  Issue Type: Bug
Reporter: Richard Low
 Fix For: 2.0.12


 If the commit log directory has no free space, Cassandra hangs on startup.
 The main thread is waiting:
 {code}
 main prio=9 tid=0x7fefe400f800 nid=0x1303 waiting on condition 
 [0x00010b9c1000]
java.lang.Thread.State: WAITING (parking)
   at sun.misc.Unsafe.park(Native Method)
   - parking to wait for  0x0007dc8c5fc8 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
   at 
 java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
   at 
 org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java:137)
   at 
 org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:299)
   at org.apache.cassandra.db.commitlog.CommitLog.init(CommitLog.java:73)
   at 
 org.apache.cassandra.db.commitlog.CommitLog.clinit(CommitLog.java:53)
   at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:360)
   at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:339)
   at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:211)
   at 
 org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:699)
   at 
 org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:208)
   at 
 org.apache.cassandra.db.SystemKeyspace.updateSchemaVersion(SystemKeyspace.java:390)
   - locked 0x0007de2f2ce0 (a java.lang.Class for 
 org.apache.cassandra.db.SystemKeyspace)
   at org.apache.cassandra.config.Schema.updateVersion(Schema.java:384)
   at 
 org.apache.cassandra.config.DatabaseDescriptor.loadSchemas(DatabaseDescriptor.java:532)
   at 
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:270)
   at 
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
   at 
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)
 {code}
 but COMMIT-LOG-ALLOCATOR is RUNNABLE:
 {code}
 COMMIT-LOG-ALLOCATOR prio=9 tid=0x7fefe5402800 nid=0x7513 in 
 Object.wait() [0x000118252000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:116)
   at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 but making no progress.
 This behaviour has change since 1.2 (see CASSANDRA-5737).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8515) Hang at startup when no commitlog space

2015-01-09 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271668#comment-14271668
 ] 

Richard Low edited comment on CASSANDRA-8515 at 1/9/15 6:38 PM:


I think CASSANDRA-5737 was marked as invalid because it was thought to be a bug 
outside of Cassandra. But understanding the cause means we can do something 
about it, and I think logging and stopping would be the right approach, as you 
say.


was (Author: rlow):
I think #5737 was marked as invalid because it was thought to be a bug outside 
of Cassandra. But understanding the cause means we can do something about it, 
and I think logging and stopping would be the right approach, as you say.

 Hang at startup when no commitlog space
 ---

 Key: CASSANDRA-8515
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8515
 Project: Cassandra
  Issue Type: Bug
Reporter: Richard Low
 Fix For: 2.0.12


 If the commit log directory has no free space, Cassandra hangs on startup.
 The main thread is waiting:
 {code}
 main prio=9 tid=0x7fefe400f800 nid=0x1303 waiting on condition 
 [0x00010b9c1000]
java.lang.Thread.State: WAITING (parking)
   at sun.misc.Unsafe.park(Native Method)
   - parking to wait for  0x0007dc8c5fc8 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
   at 
 java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
   at 
 org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java:137)
   at 
 org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:299)
   at org.apache.cassandra.db.commitlog.CommitLog.init(CommitLog.java:73)
   at 
 org.apache.cassandra.db.commitlog.CommitLog.clinit(CommitLog.java:53)
   at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:360)
   at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:339)
   at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:211)
   at 
 org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:699)
   at 
 org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:208)
   at 
 org.apache.cassandra.db.SystemKeyspace.updateSchemaVersion(SystemKeyspace.java:390)
   - locked 0x0007de2f2ce0 (a java.lang.Class for 
 org.apache.cassandra.db.SystemKeyspace)
   at org.apache.cassandra.config.Schema.updateVersion(Schema.java:384)
   at 
 org.apache.cassandra.config.DatabaseDescriptor.loadSchemas(DatabaseDescriptor.java:532)
   at 
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:270)
   at 
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
   at 
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)
 {code}
 but COMMIT-LOG-ALLOCATOR is RUNNABLE:
 {code}
 COMMIT-LOG-ALLOCATOR prio=9 tid=0x7fefe5402800 nid=0x7513 in 
 Object.wait() [0x000118252000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:116)
   at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 but making no progress.
 This behaviour has change since 1.2 (see CASSANDRA-5737).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-8515) Hang at startup when no commitlog space

2015-01-09 Thread Richard Low (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Low updated CASSANDRA-8515:
---
Description: 
If the commit log directory has no free space, Cassandra hangs on startup.

The main thread is waiting:

{code}
main prio=9 tid=0x7fefe400f800 nid=0x1303 waiting on condition 
[0x00010b9c1000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x0007dc8c5fc8 (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java:137)
at 
org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:299)
at org.apache.cassandra.db.commitlog.CommitLog.init(CommitLog.java:73)
at 
org.apache.cassandra.db.commitlog.CommitLog.clinit(CommitLog.java:53)
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:360)
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:339)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:211)
at 
org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:699)
at 
org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:208)
at 
org.apache.cassandra.db.SystemKeyspace.updateSchemaVersion(SystemKeyspace.java:390)
- locked 0x0007de2f2ce0 (a java.lang.Class for 
org.apache.cassandra.db.SystemKeyspace)
at org.apache.cassandra.config.Schema.updateVersion(Schema.java:384)
at 
org.apache.cassandra.config.DatabaseDescriptor.loadSchemas(DatabaseDescriptor.java:532)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:270)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)
{code}

but COMMIT-LOG-ALLOCATOR is RUNNABLE:

{code}
COMMIT-LOG-ALLOCATOR prio=9 tid=0x7fefe5402800 nid=0x7513 in 
Object.wait() [0x000118252000]
   java.lang.Thread.State: RUNNABLE
at 
org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:116)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:745)
{code}

but making no progress.

This behaviour has changed since 1.2 (see CASSANDRA-5737).

  was:
If the commit log directory has no free space, Cassandra hangs on startup.

The main thread is waiting:

{code}
main prio=9 tid=0x7fefe400f800 nid=0x1303 waiting on condition 
[0x00010b9c1000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x0007dc8c5fc8 (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java:137)
at 
org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:299)
at org.apache.cassandra.db.commitlog.CommitLog.init(CommitLog.java:73)
at 
org.apache.cassandra.db.commitlog.CommitLog.clinit(CommitLog.java:53)
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:360)
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:339)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:211)
at 
org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:699)
at 
org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:208)
at 
org.apache.cassandra.db.SystemKeyspace.updateSchemaVersion(SystemKeyspace.java:390)
- locked 0x0007de2f2ce0 (a java.lang.Class for 
org.apache.cassandra.db.SystemKeyspace)
at org.apache.cassandra.config.Schema.updateVersion(Schema.java:384)
at 
org.apache.cassandra.config.DatabaseDescriptor.loadSchemas(DatabaseDescriptor.java:532)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:270)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
at

[jira] [Commented] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()

2015-01-07 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268075#comment-14268075
 ] 

Richard Low commented on CASSANDRA-8414:


+1 on 2.0 v5. Do you have a 2.1 version?

 Avoid loops over array backed iterators that call iter.remove()
 ---

 Key: CASSANDRA-8414
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Richard Low
Assignee: Jimmy Mårdell
  Labels: performance
 Fix For: 2.1.3

 Attachments: cassandra-2.0-8414-1.txt, cassandra-2.0-8414-2.txt, 
 cassandra-2.0-8414-3.txt, cassandra-2.0-8414-4.txt, cassandra-2.0-8414-5.txt


 I noticed from sampling that sometimes compaction spends almost all of its 
 time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns 
 out that the cf object is using ArrayBackedSortedColumns, so deletes are from 
 an ArrayList. If the majority of your columns are GCable tombstones then this 
 is O(n^2). The data structure should be changed or a copy made to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-8515) Hang at startup when no commitlog space

2014-12-18 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-8515:
--

 Summary: Hang at startup when no commitlog space
 Key: CASSANDRA-8515
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8515
 Project: Cassandra
  Issue Type: Bug
Reporter: Richard Low


If the commit log directory has no free space, Cassandra hangs on startup.

The main thread is waiting:

{code}
main prio=9 tid=0x7fefe400f800 nid=0x1303 waiting on condition 
[0x00010b9c1000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x0007dc8c5fc8 (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java:137)
at 
org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:299)
at org.apache.cassandra.db.commitlog.CommitLog.init(CommitLog.java:73)
at 
org.apache.cassandra.db.commitlog.CommitLog.clinit(CommitLog.java:53)
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:360)
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:339)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:211)
at 
org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:699)
at 
org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:208)
at 
org.apache.cassandra.db.SystemKeyspace.updateSchemaVersion(SystemKeyspace.java:390)
- locked 0x0007de2f2ce0 (a java.lang.Class for 
org.apache.cassandra.db.SystemKeyspace)
at org.apache.cassandra.config.Schema.updateVersion(Schema.java:384)
at 
org.apache.cassandra.config.DatabaseDescriptor.loadSchemas(DatabaseDescriptor.java:532)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:270)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)
{code}

but COMMIT-LOG-ALLOCATOR is RUNNABLE:

{code}
COMMIT-LOG-ALLOCATOR prio=9 tid=0x7fefe5402800 nid=0x7513 in 
Object.wait() [0x000118252000]
   java.lang.Thread.State: RUNNABLE
at 
org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:116)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:745)
{code}

but making no progress.

This behaviour has change since 1.2 (see CASSANDRA-5737).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()

2014-12-18 Thread Richard Low (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Low updated CASSANDRA-8414:
---
Reviewer: Richard Low

 Avoid loops over array backed iterators that call iter.remove()
 ---

 Key: CASSANDRA-8414
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low
Assignee: Jimmy Mårdell
  Labels: performance
 Fix For: 2.1.3

 Attachments: cassandra-2.0-8414-1.txt, cassandra-2.0-8414-2.txt, 
 cassandra-2.0-8414-3.txt


 I noticed from sampling that sometimes compaction spends almost all of its 
 time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns 
 out that the cf object is using ArrayBackedSortedColumns, so deletes are from 
 an ArrayList. If the majority of your columns are GCable tombstones then this 
 is O(n^2). The data structure should be changed or a copy made to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-5737) CassandraDaemon - recent unsafe memory access operation in compiled Java code

2014-12-17 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251018#comment-14251018
 ] 

Richard Low commented on CASSANDRA-5737:


I get exactly this error when the disk is full on Linux. It must be some poor 
handling of the disk full error by the JVM.

 CassandraDaemon - recent unsafe memory access operation in compiled Java code
 -

 Key: CASSANDRA-5737
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5737
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.2.6
 Environment: Amazon EC2, XLarge instance.
 Ubuntu 12.04.2 LTS
 Raid 0 disks, with ext4 
Reporter: Glyn Davies

 I'm using 1.2.6 on Ubuntu AWS m1.xlarge instances with the Datastax Community 
 package and have tried using Java versions jdk1.7.0_25  jre1.6.0_45
 Also testing with and without libjna-java (ie the JNA jar)
 However, something has triggered a bug in the CassandraDaemon:
 ERROR [COMMIT-LOG-ALLOCATOR] 2013-07-05 15:00:51,663 CassandraDaemon.java 
 (line 192) Exception in thread Thread[COMMIT-LOG-ALLOCATOR,5,main]
 java.lang.InternalError: a fault occurred in a recent unsafe memory access 
 operation in compiled Java code
 at 
 org.apache.cassandra.db.commitlog.CommitLogSegment.init(CommitLogSegment.java:126)
 at 
 org.apache.cassandra.db.commitlog.CommitLogSegment.freshSegment(CommitLogSegment.java:81)
 at 
 org.apache.cassandra.db.commitlog.CommitLogAllocator.createFreshSegment(CommitLogAllocator.java:250)
 at 
 org.apache.cassandra.db.commitlog.CommitLogAllocator.access$500(CommitLogAllocator.java:48)
 at 
 org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:104)
 at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.lang.Thread.run(Unknown Source)
 This brought two nodes down out of a three node cluster – using QUORUM write 
 with 3 replicas.
 Restarting the node replays this error, so I have the system in a 'stable' 
 unstable state – which is probably a good place for trouble shooting.
 Presumably something a client wrote triggered this situation, and the other 
 third node was to be the final replication point – and is thus still up.
 Subsequently discovered that only a reboot will allow that node to come back 
 up.
 Java Bug raised with Oracle after finding a Java dump text indicating a 
 SIGBUS.
  http://bugs.sun.com/view_bug.do?bug_id=9004953
 At this point, I'm thinking that there is potentially a Linux kernel bug 
 being triggered?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()

2014-12-15 Thread Richard Low (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246961#comment-14246961
]

Richard Low commented on CASSANDRA-8414:

Thanks for writing the patch! A few comments:

- v3 patch has BatchIterator interface missing.
- Some unnecessary formatting changes and import order switching.
- remove method should throw IllegalStateException if called twice on same
element to adhere to Iterator spec.
- Calling commit twice will remove incorrect elements. Should throw
IllegalStateException if commit is called more than once or make it idempotent.
- Could add 'assert test = src;' to copy method to enforce comment.

Avoid loops over array backed iterators that call iter.remove()
---

Key: CASSANDRA-8414
URL: https://issues.apache.org/jira/browse/CASSANDRA-8414
Project: Cassandra
Issue Type: Bug
Components: Core
Reporter: Richard Low
Assignee: Jimmy Mårdell
Labels: performance
Fix For: 2.1.3

Attachments: cassandra-2.0-8414-1.txt, cassandra-2.0-8414-2.txt,
cassandra-2.0-8414-3.txt

I noticed from sampling that sometimes compaction spends almost all of its
time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns
out that the cf object is using ArrayBackedSortedColumns, so deletes are from
an ArrayList. If the majority of your columns are GCable tombstones then this
is O(n^2). The data structure should be changed or a copy made to avoid this.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()

2014-12-08 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237987#comment-14237987
 ] 

Richard Low commented on CASSANDRA-8414:


Yes, I can review this week.

 Avoid loops over array backed iterators that call iter.remove()
 ---

 Key: CASSANDRA-8414
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low
Assignee: Jimmy Mårdell
  Labels: performance
 Fix For: 2.1.3

 Attachments: cassandra-2.0-8414-1.txt


 I noticed from sampling that sometimes compaction spends almost all of its 
 time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns 
 out that the cf object is using ArrayBackedSortedColumns, so deletes are from 
 an ArrayList. If the majority of your columns are GCable tombstones then this 
 is O(n^2). The data structure should be changed or a copy made to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-8414) Compaction is O(n^2) when deleting lots of tombstones

2014-12-03 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-8414:
--

 Summary: Compaction is O(n^2) when deleting lots of tombstones
 Key: CASSANDRA-8414
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low


I noticed from sampling that sometimes compaction spends almost all of its time 
in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns out that 
the cf object is using ArrayBackedSortedColumns, so deletes are from an 
ArrayList. If the majority of your columns are GCable tombstones then this is 
O(n^2). The data structure should be changed or a copy made to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-8416) AssertionError 'Incoherent new size -1' during hints compaction

2014-12-03 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-8416:
--

 Summary: AssertionError 'Incoherent new size -1' during hints 
compaction
 Key: CASSANDRA-8416
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8416
 Project: Cassandra
  Issue Type: Bug
Reporter: Richard Low


I've seen the error on 2.0.9:

java.lang.AssertionError: Incoherent new size -1 replacing 
[SSTableReader(path='/cassandra/d1/data/system/hints/system-hints-jb-24386-Data.db')]
 by [] in View(pending_count=0, sstables=[], 
compacting=[SSTableReader(path='/cassandra/d1/data/system/hints/system-hints-jb-24386-Data.db')])

in logs during hints compaction. It looks like there are 2 concurrent 
compactions of the same file - just before this error the logs say:

INFO [CompactionExecutor:220316] 2014-11-19 22:53:54,650 CompactionTask.java 
(line 115) Compacting 
[SSTableReader(path='/cassandra/d1/data/system/hints/system-hints-jb-24386-Data.db')]
INFO [CompactionExecutor:220315] 2014-11-19 22:53:54,651 CompactionTask.java 
(line 115) Compacting 
[SSTableReader(path='/cassandra/d1/data/system/hints/system-hints-jb-24386-Data.db')]

The assertion is:

int newSSTablesSize = sstables.size() - oldSSTables.size() + 
Iterables.size(replacements);
assert newSSTablesSize = Iterables.size(replacements) : 
String.format(Incoherent new size %d replacing %s by %s in %s, 
newSSTablesSize, oldSSTables, replacements, this);

So if the first compaction completes, the second one has sstables=[] (as seen 
in the assertion failure print), so newSSTablesSize = 0 - 1 + 0 = -1 and we get 
the error.

It is possible the root cause is the same as CASSANDRA-7145. Does anyone know 
how to tell? The error happens very rarely so hard to know from testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-8121) Audit acquire/release SSTable references

2014-10-14 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-8121:
--

 Summary: Audit acquire/release SSTable references
 Key: CASSANDRA-8121
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8121
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Richard Low


There are instances where SSTable references are not guaranteed to be released 
(e.g. CompactionTask.runWith) because there is no try/finally around the 
reference acquire/release. We should audit all places where SSTable references 
are acquired and wrap them appropriately. Leaked references cause junk files to 
build up on disk and on a restart can lead to data resurrection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-8113) Gossip should ignore generation numbers too far in the future

2014-10-13 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-8113:
--

 Summary: Gossip should ignore generation numbers too far in the 
future
 Key: CASSANDRA-8113
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8113
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Richard Low


If a node sends corrupted gossip, it could set the generation numbers for other 
nodes to arbitrarily large values. This is dangerous since one bad node (e.g. 
with bad memory) could in theory bring down the cluster. Nodes should refuse to 
accept generation numbers that are too far in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-4206) AssertionError: originally calculated column size of 629444349 but now it is 588008950

2014-08-22 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107031#comment-14107031
 ] 

Richard Low commented on CASSANDRA-4206:


The root cause of this in 1.2 is CASSANDRA-7808.

 AssertionError: originally calculated column size of 629444349 but now it is 
 588008950
 --

 Key: CASSANDRA-4206
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4206
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.9
 Environment: Debian Squeeze Linux, kernel 2.6.32, sun-java6-bin 
 6.26-0squeeze1
Reporter: Patrik Modesto

 I've 4 node cluster of Cassandra 1.0.9. There is a rfTest3 keyspace with RF=3 
 and one CF with two secondary indexes. I'm importing data into this CF using 
 Hadoop Mapreduce job, each row has less than 10 colkumns. From JMX:
 MaxRowSize:  1597
 MeanRowSize: 369
 And there are some tens of millions of rows.
 It's write-heavy usage and there is a big pressure on each node, there are 
 quite some dropped mutations on each node. After ~12 hours of inserting I see 
 these assertion exceptiona on 3 out of four nodes:
 {noformat}
 ERROR 06:25:40,124 Fatal exception in thread Thread[HintedHandoff:1,1,main]
 java.lang.RuntimeException: java.util.concurrent.ExecutionException:
 java.lang.AssertionError: originally calculated column size of 629444349 but 
 now it is 588008950
at 
 org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpointInternal(HintedHandOffManager.java:388)
at 
 org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:256)
at 
 org.apache.cassandra.db.HintedHandOffManager.access$300(HintedHandOffManager.java:84)
at 
 org.apache.cassandra.db.HintedHandOffManager$3.runMayThrow(HintedHandOffManager.java:437)
at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
 Caused by: java.util.concurrent.ExecutionException:
 java.lang.AssertionError: originally calculated column size of
 629444349 but now it is 588008950
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at 
 org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpointInternal(HintedHandOffManager.java:384)
... 7 more
 Caused by: java.lang.AssertionError: originally calculated column size
 of 629444349 but now it is 588008950
at 
 org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:124)
at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:160)
at 
 org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:161)
at 
 org.apache.cassandra.db.compaction.CompactionManager$7.call(CompactionManager.java:380)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
... 3 more
 {noformat}
 Few lines regarding Hints from the output.log:
 {noformat}
  INFO 06:21:26,202 Compacting large row 
 system/HintsColumnFamily:7000 (1712834057 bytes) 
 incrementally
  INFO 06:22:52,610 Compacting large row 
 system/HintsColumnFamily:1000 (2616073981 bytes) 
 incrementally
  INFO 06:22:59,111 flushing high-traffic column family CFS(Keyspace='system', 
 ColumnFamily='HintsColumnFamily') (estimated 305147360 bytes)
  INFO 06:22:59,813 Enqueuing flush of 
 Memtable-HintsColumnFamily@833933926(3814342/305147360 serialized/live bytes, 
 7452 ops)
  INFO 06:22:59,814 Writing 
 Memtable-HintsColumnFamily@833933926(3814342/305147360 serialized/live bytes, 
 7452 ops)
 {noformat}
 I think the problem may be somehow connected to an IntegerType secondary 
 index. I had a different problem with CF with two secondary indexes, the 
 first UTF8Type, the second IntegerType. After a few hours of inserting data 
 in the afternoon and midnight repair+compact, the next day I couldn't find 
 any row using the IntegerType secondary index. The output was like this:
 {noformat}
 [default@rfTest3] get IndexTest where col1 = 
 '3230727:http://zaskolak.cz/download.php';
 ---
 RowKey: 3230727:8383582:http://zaskolak.cz/download.php
 = (column=col1, value=3230727:http://zaskolak.cz/download.php, 
 timestamp=1335348630332000)
 = (column=col2, value=8383582, timestamp=1335348630332000)
 ---
 RowKey:

[jira] [Created] (CASSANDRA-7808) LazilyCompactedRow incorrectly handles row tombstones

2014-08-20 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-7808:
--

 Summary: LazilyCompactedRow incorrectly handles row tombstones
 Key: CASSANDRA-7808
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7808
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low


LazilyCompactedRow doesn’t handle row tombstones correctly, leading to an 
AssertionError (CASSANDRA-4206) in some cases, and the row tombstone being 
incorrectly dropped in others. It looks like this was introduced by 
CASSANDRA-5677.

To reproduce an AssertionError:

1. Hack a really small return value for 
DatabaseDescriptor.getInMemoryCompactionLimit() like 10 bytes to force large 
row compaction
2. Create a column family with gc_grace = 10
3. Insert a few columns in one row
4. Call nodetool flush
5. Delete the row
6. Call nodetool flush
7. Wait 10 seconds
8. Call nodetool compact and it will fail

To reproduce the row tombstone being dropped, do the same except, after the 
delete (in step 5), insert a column that sorts before the ones you inserted in 
step 3. E.g. if you inserted b, c, d in step 3, insert a now. After the 
compaction, which now succeeds, the full row will be visible, rather than just 
a.

The problem is two fold. Firstly, LazilyCompactedRow.Reducer.reduce() and 
getReduce() incorrectly call container.clear(). This clears the columns (as 
intended) but also removes the deletion times from container. This means no 
further columns are deleted if they are annihilated by the row tombstone.

Secondly, after the second pass, LazilyCompactedRow.isEmpty() is called which 
calls

{{ColumnFamilyStore.removeDeletedCF(emptyColumnFamily, 
controller.gcBefore(key.getToken()))}}

which unfortunately removes the last deleted time from emptyColumnFamily if it 
is earlier than gcBefore. Since this is only called after the second pass, the 
second pass doesn’t remove any columns that are removed by the row tombstone 
whereas the first pass removes just the first one.

This is pretty serious - no large rows can ever be compacted and row tombstones 
can go missing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (CASSANDRA-7808) LazilyCompactedRow incorrectly handles row tombstones

2014-08-20 Thread Richard Low (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Richard Low updated CASSANDRA-7808:
---

Attachment: 7808-v1.diff

LazilyCompactedRow incorrectly handles row tombstones
-

Key: CASSANDRA-7808
URL: https://issues.apache.org/jira/browse/CASSANDRA-7808
Project: Cassandra
Issue Type: Bug
Components: Core
Reporter: Richard Low
Attachments: 7808-v1.diff

LazilyCompactedRow doesn’t handle row tombstones correctly, leading to an
AssertionError (CASSANDRA-4206) in some cases, and the row tombstone being
incorrectly dropped in others. It looks like this was introduced by
CASSANDRA-5677.
To reproduce an AssertionError:
1. Hack a really small return value for
DatabaseDescriptor.getInMemoryCompactionLimit() like 10 bytes to force large
row compaction
2. Create a column family with gc_grace = 10
3. Insert a few columns in one row
4. Call nodetool flush
5. Delete the row
6. Call nodetool flush
7. Wait 10 seconds
8. Call nodetool compact and it will fail
To reproduce the row tombstone being dropped, do the same except, after the
delete (in step 5), insert a column that sorts before the ones you inserted
in step 3. E.g. if you inserted b, c, d in step 3, insert a now. After the
compaction, which now succeeds, the full row will be visible, rather than
just a.
The problem is two fold. Firstly, LazilyCompactedRow.Reducer.reduce() and
getReduce() incorrectly call container.clear(). This clears the columns (as
intended) but also removes the deletion times from container. This means no
further columns are deleted if they are annihilated by the row tombstone.
Secondly, after the second pass, LazilyCompactedRow.isEmpty() is called which
calls
{{ColumnFamilyStore.removeDeletedCF(emptyColumnFamily,
controller.gcBefore(key.getToken()))}}
which unfortunately removes the last deleted time from emptyColumnFamily if
it is earlier than gcBefore. Since this is only called after the second pass,
the second pass doesn’t remove any columns that are removed by the row
tombstone whereas the first pass removes just the first one.
This is pretty serious - no large rows can ever be compacted and row
tombstones can go missing.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7808) LazilyCompactedRow incorrectly handles row tombstones

2014-08-20 Thread Richard Low (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104877#comment-14104877
]

Richard Low commented on CASSANDRA-7808:

I attached a patch which I think fixes this.

LazilyCompactedRow incorrectly handles row tombstones
-

Key: CASSANDRA-7808
URL: https://issues.apache.org/jira/browse/CASSANDRA-7808
Project: Cassandra
Issue Type: Bug
Components: Core
Reporter: Richard Low
Fix For: 1.2.19

Attachments: 7808-v1.diff

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7808) LazilyCompactedRow incorrectly handles row tombstones

2014-08-20 Thread Richard Low (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105018#comment-14105018
]

Richard Low commented on CASSANDRA-7808:

Sorry, I got my scope wrong. The AssertionError is likely caused by
CASSANDRA-5677 but the clearing in the Reducer existed a long time before that.

LazilyCompactedRow incorrectly handles row tombstones
-

Key: CASSANDRA-7808
URL: https://issues.apache.org/jira/browse/CASSANDRA-7808
Project: Cassandra
Issue Type: Bug
Components: Core
Reporter: Richard Low
Fix For: 1.2.19

Attachments: 7808-v1.diff

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7663) Removing a seed causes previously removed seeds to reappear

2014-08-05 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086574#comment-14086574
 ] 

Richard Low commented on CASSANDRA-7663:


+1 on v2, thanks!

 Removing a seed causes previously removed seeds to reappear
 ---

 Key: CASSANDRA-7663
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7663
 Project: Cassandra
  Issue Type: Bug
Reporter: Richard Low
Assignee: Brandon Williams
 Fix For: 1.2.19, 2.0.10, 2.1.0

 Attachments: 7663-v2.txt, 7663.txt


 When you remove a seed from a cluster, Gossiper.removeEndpoint ensures it is 
 removed from the seed list. However, it also resets the seed list to be the 
 original list, which would bring back any previously removed seeds. What is 
 the reasoning for having the call to buildSeedsList()? If it wasn’t there 
 then I think the problem would be solved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7663) Removing a seed causes previously removed seeds to reappear

2014-08-01 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082613#comment-14082613
 ] 

Richard Low commented on CASSANDRA-7663:


Thanks!

 Removing a seed causes previously removed seeds to reappear
 ---

 Key: CASSANDRA-7663
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7663
 Project: Cassandra
  Issue Type: Bug
Reporter: Richard Low
Assignee: Brandon Williams
 Fix For: 1.2.19, 2.0.10, 2.1.0

 Attachments: 7663.txt


 When you remove a seed from a cluster, Gossiper.removeEndpoint ensures it is 
 removed from the seed list. However, it also resets the seed list to be the 
 original list, which would bring back any previously removed seeds. What is 
 the reasoning for having the call to buildSeedsList()? If it wasn’t there 
 then I think the problem would be solved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7663) Removing a seed causes previously removed seeds to reappear

2014-08-01 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14083258#comment-14083258
 ] 

Richard Low commented on CASSANDRA-7663:


Actually there's a potential problem with this. It now requires that the yaml 
is still present, whereas before this patch the yaml was only needed on 
startup. Depending on how people deploy updated yamls, the old yaml may not be 
there. In that case, it would throw AssertionError and bad things will happen 
in gossip. Maybe it could fall back on the original behaviour if the yaml can't 
be read? 

 Removing a seed causes previously removed seeds to reappear
 ---

 Key: CASSANDRA-7663
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7663
 Project: Cassandra
  Issue Type: Bug
Reporter: Richard Low
Assignee: Brandon Williams
 Fix For: 1.2.19, 2.0.10, 2.1.0

 Attachments: 7663.txt


 When you remove a seed from a cluster, Gossiper.removeEndpoint ensures it is 
 removed from the seed list. However, it also resets the seed list to be the 
 original list, which would bring back any previously removed seeds. What is 
 the reasoning for having the call to buildSeedsList()? If it wasn’t there 
 then I think the problem would be solved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (CASSANDRA-7663) Removing a seed causes previously removed seeds to reappear

2014-07-31 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-7663:
--

 Summary: Removing a seed causes previously removed seeds to 
reappear
 Key: CASSANDRA-7663
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7663
 Project: Cassandra
  Issue Type: Bug
Reporter: Richard Low


When you remove a seed from a cluster, Gossiper.removeEndpoint ensures it is 
removed from the seed list. However, it also resets the seed list to be the 
original list, which would bring back any previously removed seeds. What is the 
reasoning for having the call to buildSeedsList()? If it wasn’t there then I 
think the problem would be solved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (CASSANDRA-7591) Add read and write metrics for each consistency level

2014-07-22 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-7591:
--

 Summary: Add read and write metrics for each consistency level
 Key: CASSANDRA-7591
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7591
 Project: Cassandra
  Issue Type: Improvement
Reporter: Richard Low


It would be helpful to have read and write metrics for each consistency level 
to help clients track query rates per consistency level. It's quite common to 
forget to set e.g. a config param in the client and use the wrong consistency 
level. Right now the only way to find out from Cassandra is to use tracing. It 
would also be helpful to track latencies for different consistency levels to 
estimate the cost of e.g. switching from LOCAL_QUORUM to QUORUM reads.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (CASSANDRA-7591) Add per consistency level read and write metrics

2014-07-22 Thread Richard Low (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Low updated CASSANDRA-7591:
---

Summary: Add per consistency level read and write metrics  (was: Add read 
and write metrics for each consistency level)

 Add per consistency level read and write metrics
 

 Key: CASSANDRA-7591
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7591
 Project: Cassandra
  Issue Type: Improvement
Reporter: Richard Low

 It would be helpful to have read and write metrics for each consistency level 
 to help clients track query rates per consistency level. It's quite common to 
 forget to set e.g. a config param in the client and use the wrong consistency 
 level. Right now the only way to find out from Cassandra is to use tracing. 
 It would also be helpful to track latencies for different consistency levels 
 to estimate the cost of e.g. switching from LOCAL_QUORUM to QUORUM reads.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (CASSANDRA-7591) Add per consistency level read and write metrics

2014-07-22 Thread Richard Low (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Low resolved CASSANDRA-7591.


Resolution: Duplicate

Dupe of CASSANDRA-7384.

 Add per consistency level read and write metrics
 

 Key: CASSANDRA-7591
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7591
 Project: Cassandra
  Issue Type: Improvement
Reporter: Richard Low

 It would be helpful to have read and write metrics for each consistency level 
 to help clients track query rates per consistency level. It's quite common to 
 forget to set e.g. a config param in the client and use the wrong consistency 
 level. Right now the only way to find out from Cassandra is to use tracing. 
 It would also be helpful to track latencies for different consistency levels 
 to estimate the cost of e.g. switching from LOCAL_QUORUM to QUORUM reads.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7591) Add per consistency level read and write metrics

2014-07-22 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070941#comment-14070941
 ] 

Richard Low commented on CASSANDRA-7591:


And I've just found CASSANDRA-7384. Sorry for the noise.

 Add per consistency level read and write metrics
 

 Key: CASSANDRA-7591
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7591
 Project: Cassandra
  Issue Type: Improvement
Reporter: Richard Low

 It would be helpful to have read and write metrics for each consistency level 
 to help clients track query rates per consistency level. It's quite common to 
 forget to set e.g. a config param in the client and use the wrong consistency 
 level. Right now the only way to find out from Cassandra is to use tracing. 
 It would also be helpful to track latencies for different consistency levels 
 to estimate the cost of e.g. switching from LOCAL_QUORUM to QUORUM reads.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7384) Collect metrics on queries by consistency level

2014-07-22 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070960#comment-14070960
 ] 

Richard Low commented on CASSANDRA-7384:


Having relative counts per consistency level helps clients to check rates are 
as they expect. Also having the relative latencies would help to estimate the 
cost/benefit of changing consistency levels. I think it would be helpful.

 Collect metrics on queries by consistency level
 ---

 Key: CASSANDRA-7384
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7384
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Vishy Kasar
Assignee: sankalp kohli
Priority: Minor

 We had cases where cassandra client users thought that they were doing 
 queries at one consistency level but turned out to be not correct. It will be 
 good to collect metrics on number of queries done at various consistency 
 level on the server. See the equivalent JIRA on java driver: 
 https://datastax-oss.atlassian.net/browse/JAVA-354



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-5901) Bootstrap should also make the data consistent on the new node

2014-07-22 Thread Richard Low (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-5901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070988#comment-14070988
]

Richard Low commented on CASSANDRA-5901:

In particular, host replacement can violate consistency, even with
CASSANDRA-2434. For example, if you always do quorum reads and writes and have
replicas A, B, C, you'll always read the latest value, even if any one replica
has missed a write. Suppose A did miss a write and then B fails and is replaced
by D. D chooses where to stream from - there is no 'right' answer - so it can
stream from A. Now A and D have old values and only C has the latest value. A
quorum read that chooses A and D will give back stale data and violate expected
consistency.

If a repair was run on A and C after B had failed but before it was replaced
with D, the consistency problem is eliminated.

Bootstrap should also make the data consistent on the new node
--

Key: CASSANDRA-5901
URL: https://issues.apache.org/jira/browse/CASSANDRA-5901
Project: Cassandra
Issue Type: Improvement
Components: Core
Reporter: sankalp kohli
Priority: Minor

Currently when we are bootstrapping a new node, it might bootstrap from a
node which does not have most upto date data. Because of this, we need to run
a repair after that.
Most people will always run the repair so it would help if we can provide a
parameter to bootstrap to run the repair once the bootstrap has finished.
It can also stop the node from responding to reads till repair has finished.
This could be another param as well.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (CASSANDRA-7592) Ownership changes can violate consistency

2014-07-22 Thread Richard Low (JIRA)

Richard Low created CASSANDRA-7592:
--

 Summary: Ownership changes can violate consistency
 Key: CASSANDRA-7592
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7592
 Project: Cassandra
  Issue Type: Improvement
Reporter: Richard Low


CASSANDRA-2434 goes a long way to avoiding consistency violations when growing 
a cluster. However, there is still a window when consistency can be violated 
when switching ownership of a range.

Suppose you have replication factor 3 and all reads and writes at quorum. The 
first part of the ring looks like this:

Z: 0
A: 100
B: 200
C: 300

Choose two random coordinators, C1 and C2. Then you bootstrap node X at token 
50.

Consider the token range 0-50. Before bootstrap, this is stored on A, B, C. 
During bootstrap, writes go to X, A, B, C (and must succeed on 3) and reads 
choose two from A, B, C. After bootstrap, the range is on X, A, B.

When the bootstrap completes, suppose C1 processes the ownership change at t1 
and C2 at t4. Then the following can give an inconsistency:

t1: C1 switches ownership.
t2: C1 performs write, so sends write to X, A, B. A is busy and drops the 
write, but it succeeds because X and B return.
t3: C2 performs a read. It hasn’t done the switch and chooses A and C. Neither 
got the write at t2 so null is returned.
t4: C2 switches ownership.

This could be solved by continuing writes to the old replica for some time 
(maybe ring delay) after the ownership changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-6751) Setting -Dcassandra.fd_initial_value_ms Results in NPE

2014-07-20 Thread Richard Low (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068158#comment-14068158
 ] 

Richard Low commented on CASSANDRA-6751:


This issue also affects 2.0 branch and was fixed in 2.0.8.

 Setting -Dcassandra.fd_initial_value_ms Results in NPE
 --

 Key: CASSANDRA-6751
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6751
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Dave Brosius
Priority: Minor
 Fix For: 1.2.17

 Attachments: 6751.txt


 Start Cassandra with {{-Dcassandra.fd_initial_value_ms=1000}} and you'll get 
 the following stacktrace:
 {noformat}
  INFO [main] 2014-02-21 14:45:57,731 StorageService.java (line 617) Starting 
 up server gossip
 ERROR [main] 2014-02-21 14:45:57,736 CassandraDaemon.java (line 464) 
 Exception encountered during startup
 java.lang.ExceptionInInitializerError
 at org.apache.cassandra.gms.Gossiper.init(Gossiper.java:178)
 at org.apache.cassandra.gms.Gossiper.clinit(Gossiper.java:71)
 at 
 org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:618)
 at 
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:583)
 at 
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:480)
 at 
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
 at 
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447)
 at 
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490)
 Caused by: java.lang.NullPointerException
 at 
 org.apache.cassandra.gms.FailureDetector.getInitialValue(FailureDetector.java:81)
 at 
 org.apache.cassandra.gms.FailureDetector.clinit(FailureDetector.java:48)
 ... 8 more
 ERROR [StorageServiceShutdownHook] 2014-02-21 14:45:57,754 
 CassandraDaemon.java (line 191) Exception in thread 
 Thread[StorageServiceShutdownHook,5,main]
 java.lang.NoClassDefFoundError: Could not initialize class 
 org.apache.cassandra.gms.Gossiper
 at 
 org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:550)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.lang.Thread.run(Thread.java:724)
 {noformat}
 Glancing at the code, this is because the FailureDetector logger isn't 
 initialized when the static initialization of {{INITIAL_VALUE}} happens.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7307) New nodes mark dead nodes as up for 10 minutes

2014-06-18 Thread Richard Low (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14036328#comment-14036328
]

Richard Low commented on CASSANDRA-7307:

bq. For bootstrap? Let me be clear, the problem with replace is not related to
streaming. It's refusing to replace a live node, because the FD takes so long
to report it as down upon first discovery.

Actually, most of the time the problem is streaming. It is happy during
replacement (which surprises me, since it clearly lists it as UP), but then
requests to stream from the dead node which fails. We've seen this where it
happily streams from other nodes, but then ultimately fails because the stream
from the dead node fails.

However, we also see a problem where it fails to replace because it thinks the
node is live. This happens less often but I expect has the same root cause.

New nodes mark dead nodes as up for 10 minutes
--

Key: CASSANDRA-7307
URL: https://issues.apache.org/jira/browse/CASSANDRA-7307
Project: Cassandra
Issue Type: Bug
Reporter: Richard Low
Assignee: Brandon Williams
Fix For: 1.2.17, 2.0.9, 2.1 rc2

When doing a node replacement when other nodes are down we see the down nodes
marked as up for about 10 minutes. This means requests are routed to the dead
nodes causing timeouts. It also means replacing a node when multiple nodes
from a replica set is extremely difficult - the node usually tries to stream
from a dead node and the replacement fails.
This isn't limited to host replacement. I did a simple test:
1. Create a 2 node cluster
2. Kill node 2
3. Start a 3rd node with a unique token (I used auto_bootstrap=false but I
don't think this is significant)
The 3rd node lists node 2 (127.0.0.2) as up for almost 10 minutes:
{code}
INFO [main] 2014-05-27 14:28:24,753 CassandraDaemon.java (line 119) Logging
initialized
INFO [GossipStage:1] 2014-05-27 14:28:31,492 Gossiper.java (line 843) Node
/127.0.0.2 is now part of the cluster
INFO [GossipStage:1] 2014-05-27 14:28:31,495 Gossiper.java (line 809)
InetAddress /127.0.0.2 is now UP
INFO [GossipTasks:1] 2014-05-27 14:37:44,526 Gossiper.java (line 823)
InetAddress /127.0.0.2 is now DOWN
{code}
I reproduced on 1.2.15 and 1.2.16.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

1 2 >

1 - 100 of 178 matches

Mail list logo