Re: Re[6]: Modify keyspace replication strategy and rebalance the nodes

2017-09-18 Thread kurt greaves
So I haven't completely thought through this, so don't just go ahead and do
it. Definitely test first. Also if anyone sees something terribly wrong
don't be afraid to say.

Seeing as you're only using SimpleStrategy and it doesn't care about racks,
you could change to SimpleSnitch, or GossipingPropertyFileSnitch with just
1 rack, by changing snitch/rack details and setting
-Dcassandra.ignore_rack=true

Do this for all nodes until you have 40 nodes in 1 rack, then change to NTS
with RF=3. Then do a DC migration to a new DC with the correct
configuration of racks (ideally >=3 racks).

Unfortunately DC migration is currently the only way to change RF and rack
topology without downtime.
​


Re: ConsitencyLevel and Mutations : Behaviour if the update of the commitlog fails

2017-09-18 Thread kurt greaves
> ​Does the coordinator "cancel" the mutation on the "committed" nodes (and
> how)?

No. Those mutations are applied on those nodes.

>  Is it an heuristic case where two nodes have the data whereas they
> shouldn't and we hope that HintedHandoff will replay the mutation ?

Yes. But really you should make sure you recover from this error in your
client. Hinted handoff might work, but you have no way of knowing if it has
taken place so if ALL is important you should retry/resolve the failed
query accordingly.


Re: Multi-node repair fails after upgrading to 3.0.14

2017-09-18 Thread kurt greaves
https://issues.apache.org/jira/browse/CASSANDRA-13153 implies full repairs
still triggers anti-compaction on non-repaired SSTables (if I'm reading
that right), so might need to make sure you don't run multiple repairs at
the same time across your nodes (if your using vnodes), otherwise could
still end up trying to run anti-compaction on the same SSTable from 2
repairs.

Anyone else feel free to jump in and correct me if my interpretation is
wrong.

On 18 September 2017 at 17:11, Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Jeff,
>
>
>
> what should be the expected outcome when running with 3.0.14:
>
>
>
> nodetool repair –full –pr keyspace cfs
>
>
>
> · Should –full trigger anti-compaction?
>
> · Should this be the same operation as nodetool repair –pr
> keyspace cfs in 2.1?
>
> · Should I be able to  run this on several nodes in parallel as
> in 2.1 without troubles, where incremental repair was not the default?
>
>
>
> Still confused if I’m missing something obvious. Sorry about that. J
>
>
>
> Thanks,
>
> Thomas
>
>
>
> *From:* Jeff Jirsa [mailto:jji...@gmail.com]
> *Sent:* Montag, 18. September 2017 16:10
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: Multi-node repair fails after upgrading to 3.0.14
>
>
>
> Sorry I may be wrong about the cause - didn't see -full
>
>
>
> Mea culpa, its early here and I'm not awake
>
>
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Sep 18, 2017, at 7:01 AM, Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
> Hi Jeff,
>
>
>
> understood. That’s quite a change then coming from 2.1 from an operational
> POV.
>
>
>
> Thanks again.
>
>
>
> Thomas
>
>
>
> *From:* Jeff Jirsa [mailto:jji...@gmail.com ]
> *Sent:* Montag, 18. September 2017 15:56
> *To:* user@cassandra.apache.org
> *Subject:* Re: Multi-node repair fails after upgrading to 3.0.14
>
>
>
> The command you're running will cause anticompaction and the range borders
> for all instances at the same time
>
>
>
> Since only one repair session can anticompact any given sstable, it's
> almost guaranteed to fail
>
>
>
> Run it on one instance at a time
>
>
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Sep 18, 2017, at 1:11 AM, Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
> Hi Alex,
>
>
>
> I now ran nodetool repair –full –pr keyspace cfs on all nodes in parallel
> and this may pop up now:
>
>
>
> 0.176.38.128 (progress: 1%)
>
> [2017-09-18 07:59:17,145] Some repair failed
>
> [2017-09-18 07:59:17,151] Repair command #3 finished in 0 seconds
>
> error: Repair job has failed with the error message: [2017-09-18
> 07:59:17,145] Some repair failed
>
> -- StackTrace --
>
> java.lang.RuntimeException: Repair job has failed with the error message:
> [2017-09-18 07:59:17,145] Some repair failed
>
> at org.apache.cassandra.tools.RepairRunner.progress(
> RepairRunner.java:115)
>
> at org.apache.cassandra.utils.progress.jmx.
> JMXNotificationProgressListener.handleNotification(
> JMXNotificationProgressListener.java:77)
>
> at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.
> dispatchNotification(ClientNotifForwarder.java:583)
>
> at com.sun.jmx.remote.internal.ClientNotifForwarder$
> NotifFetcher.doRun(ClientNotifForwarder.java:533)
>
> at com.sun.jmx.remote.internal.ClientNotifForwarder$
> NotifFetcher.run(ClientNotifForwarder.java:452)
>
> at com.sun.jmx.remote.internal.ClientNotifForwarder$
> LinearExecutor$1.run(ClientNotifForwarder.java:108)
>
>
>
> 2017-09-18 07:59:17 repair finished
>
>
>
>
>
> If running the above nodetool call sequentially on all nodes, repair
> finishes without printing a stack trace.
>
>
>
> The error message and stack trace isn’t really useful here. Any further
> ideas/experiences?
>
>
>
> Thanks,
>
> Thomas
>
>
>
> *From:* Alexander Dejanovski [mailto:a...@thelastpickle.com
> ]
> *Sent:* Freitag, 15. September 2017 11:30
> *To:* user@cassandra.apache.org
> *Subject:* Re: Multi-node repair fails after upgrading to 3.0.14
>
>
>
> Right, you should indeed add the "--full" flag to perform full repairs,
> and you can then keep the "-pr" flag.
>
>
>
> I'd advise to monitor the status of your SSTables as you'll probably end
> up with a pool of SSTables marked as repaired, and another pool marked as
> unrepaired which won't be compacted together (hence the suggestion of
> running subrange repairs).
>
> Use sstablemetadata to check on the "Repaired at" value for each. 0 means
> unrepaired and any other value (a timestamp) means the SSTable has been
> repaired.
>
> I've had behaviors in the past where running "-pr" on the whole cluster
> would still not mark all SSTables as repaired, but I can't say if that
> behavior has changed in latest versions.
>
>
>
> Having separate pools of SStables that cannot be compacted means that you
> might have tombstones that don't get evicted due to partitions living in
> both states (repaired/unrepaired).
>
>
>

Re: Modify keyspace replication strategy and rebalance the nodes

2017-09-18 Thread Jeff Jirsa
No worries, that makes both of us, my first contribution to this thread was
similarly going-too-fast and trying to remember things I don't use often (I
thought originally SimpleStrategy would consult the EC2 snitch, but it
doesn't).

- Jeff

On Mon, Sep 18, 2017 at 1:56 PM, Jon Haddad 
wrote:

> Sorry, you’re right.  This is what happens when you try to do two things
> at once.  Google too quickly, look like an idiot.  Thanks for the
> correction.
>
>
> On Sep 18, 2017, at 1:37 PM, Jeff Jirsa  wrote:
>
> For what its worth, the problem isn't the snitch it's the replication
> strategy - he's using the right snitch but SimpleStrategy ignores it
>
> That's the same reason that adding a new DC doesn't work - the relocation
> strategy is dc agnostic and changing it safely IS the problem
>
>
>
> --
> Jeff Jirsa
>
>
> On Sep 18, 2017, at 11:46 AM, Jon Haddad 
> wrote:
>
> For those of you who like trivia, simpleSnitch is hard coded to report
> every node in DC in “datacenter1” and in rack “rack1”, there’s no way
> around it.  https://github.com/apache/cassandra/blob/
> 8b3a60b9a7dbefeecc06bace617279612ec7092d/src/java/org/
> apache/cassandra/locator/SimpleSnitch.java#L28
>
> I would do this by setting up a new DC, trying to do it with the existing
> one is going to leave you in a state where most queries will return
> incorrect results (2/3 of queries at ONE and 1/2 of queries at QUORUM)
> until you finish repair.
>
> On Sep 18, 2017, at 11:41 AM, Jeff Jirsa  wrote:
>
> The hard part here is nobody's going to be able to tell you exactly what's
> involved in fixing this because nobody sees your ring
>
> And since you're using vnodes and have a nontrivial number of instances,
> sharing that ring (and doing anything actionable with it) is nontrivial.
>
> If you weren't using vnodes, you could just fix the distribution and decom
> extra nodes afterward.
>
> I thought - but don't have time or energy to check - that the ec2snitch
> would be rack aware even when using simple strategy - if that's not the
> case (as you seem to indicate), then you're in a weird spot - you can't go
> to NTS trivially because doing so will reassign your replicas to be rack/as
> aware, certainly violating your consistency guarantees.
>
> If you can change your app to temporarily write with ALL and read with
> ALL, and then run repair, then immediately ALTER the keyspace, then run
> repair again, then drop back to whatever consistency you're using, you can
> probably get through it. The challenge is that ALL gets painful if you lose
> any instance.
>
> But please test in a lab, and note that this is inherently dangerous, I'm
> not advising you to do it, though I do believe it can be made to work.
>
>
>
>
>
> --
> Jeff Jirsa
>
>
> On Sep 18, 2017, at 11:18 AM, Dominik Petrovic  INVALID> wrote:
>
> @jeff what do you think is the best approach here to fix this problem?
> Thank you all for helping me.
>
>
> Thursday, September 14, 2017 3:28 PM -07:00 from kurt greaves <
> k...@instaclustr.com>:
>
> Sorry that only applies our you're using NTS. You're right that simple
> strategy won't work very well in this case. To migrate you'll likely need
> to do a DC migration to ensuite no downtime, as replica placement will
> change even if RF stays the same.
>
> On 15 Sep. 2017 08:26, "kurt greaves"  wrote:
>
> If you have racks configured and lose nodes you should replace the node
> with one from the same rack. You then need to repair, and definitely don't
> decommission until you do.
>
> Also 40 nodes with 256 vnodes is not a fun time for repair.
>
> On 15 Sep. 2017 03:36, "Dominik Petrovic" 
> wrote:
>
> @jeff,
> I'm using 3 availability zones, during the life of the cluster we lost
> nodes, retired others and we end up having some of the data
> written/replicated on a single availability zone. We saw it with nodetool
> getendpoints.
> Regards
>
>
> Thursday, September 14, 2017 9:23 AM -07:00 from Jeff Jirsa <
> jji...@gmail.com>:
>
> With one datacenter/region, what did you discover in an outage you think
> you'll solve with network topology strategy? It should be equivalent for a
> single D.C.
>
> --
> Jeff Jirsa
>
>
> On Sep 14, 2017, at 8:47 AM, Dominik Petrovic  INVALID> wrote:
>
> Thank you for the replies!
>
> @jeff my current cluster details are:
> 1 datacenter
> 40 nodes, with vnodes=256
> RF=3
> What is your advice? is it a production cluster, so I need to be very
> careful about it.
> Regards
>
>
> Thu, 14 Sep 2017 -2:47:52 -0700 from Jeff Jirsa :
>
> The token distribution isn't going to change - the way Cassandra maps
> replicas will change.
>
> How many data centers/regions will you have when you're done? What's your
> RF now? You definitely need to run repair before you ALTER, but you've got
> a bit of a race here 

Re: Re[6]: Modify keyspace replication strategy and rebalance the nodes

2017-09-18 Thread Jeff Jirsa
Using CL:ALL basically forces you to always include the first replica in
the query.

The first replica will be the same for both SimpleStrategy/SimpleSnitch and
NetworkTopologyStrategy/EC2Snitch.

It's basically the only way we can guarantee we're not going to lose a row
because it's only written to the second and third replicas while the first
replica is down, in case the second and third replicas change to different
hosts (racks / availability zones) during the ALTER.






On Mon, Sep 18, 2017 at 1:57 PM, Myron A. Semack 
wrote:

> How would setting the consistency to ALL help?  Wouldn’t that just cause
> EVERY read/write to fail after the ALTER until the repair is complete?
>
>
>
> Sincerely,
>
> Myron A. Semack
>
>
>
> *From:* Jeff Jirsa [mailto:jji...@gmail.com]
> *Sent:* Monday, September 18, 2017 2:42 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Re[6]: Modify keyspace replication strategy and rebalance
> the nodes
>
>
>
> The hard part here is nobody's going to be able to tell you exactly what's
> involved in fixing this because nobody sees your ring
>
>
>
> And since you're using vnodes and have a nontrivial number of instances,
> sharing that ring (and doing anything actionable with it) is nontrivial.
>
>
>
> If you weren't using vnodes, you could just fix the distribution and decom
> extra nodes afterward.
>
>
>
> I thought - but don't have time or energy to check - that the ec2snitch
> would be rack aware even when using simple strategy - if that's not the
> case (as you seem to indicate), then you're in a weird spot - you can't go
> to NTS trivially because doing so will reassign your replicas to be rack/as
> aware, certainly violating your consistency guarantees.
>
>
>
> If you can change your app to temporarily write with ALL and read with
> ALL, and then run repair, then immediately ALTER the keyspace, then run
> repair again, then drop back to whatever consistency you're using, you can
> probably get through it. The challenge is that ALL gets painful if you lose
> any instance.
>
>
>
> But please test in a lab, and note that this is inherently dangerous, I'm
> not advising you to do it, though I do believe it can be made to work.
>
>
>
>
>
>
>
>
>
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Sep 18, 2017, at 11:18 AM, Dominik Petrovic  INVALID> wrote:
>
> @jeff what do you think is the best approach here to fix this problem?
> Thank you all for helping me.
>
> Thursday, September 14, 2017 3:28 PM -07:00 from kurt greaves <
> k...@instaclustr.com>:
>
> Sorry that only applies our you're using NTS. You're right that simple
> strategy won't work very well in this case. To migrate you'll likely need
> to do a DC migration to ensuite no downtime, as replica placement will
> change even if RF stays the same.
>
>
>
> On 15 Sep. 2017 08:26, "kurt greaves"  wrote:
>
> If you have racks configured and lose nodes you should replace the node
> with one from the same rack. You then need to repair, and definitely don't
> decommission until you do.
>
>
>
> Also 40 nodes with 256 vnodes is not a fun time for repair.
>
>
>
> On 15 Sep. 2017 03:36, "Dominik Petrovic" 
> wrote:
>
> @jeff,
> I'm using 3 availability zones, during the life of the cluster we lost
> nodes, retired others and we end up having some of the data
> written/replicated on a single availability zone. We saw it with nodetool
> getendpoints.
> Regards
>
> Thursday, September 14, 2017 9:23 AM -07:00 from Jeff Jirsa <
> jji...@gmail.com>:
>
> With one datacenter/region, what did you discover in an outage you think
> you'll solve with network topology strategy? It should be equivalent for a
> single D.C.
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Sep 14, 2017, at 8:47 AM, Dominik Petrovic  INVALID> wrote:
>
> Thank you for the replies!
>
> @jeff my current cluster details are:
> 1 datacenter
> 40 nodes, with vnodes=256
> RF=3
> What is your advice? is it a production cluster, so I need to be very
> careful about it.
> Regards
>
> Thu, 14 Sep 2017 -2:47:52 -0700 from Jeff Jirsa :
>
> The token distribution isn't going to change - the way Cassandra maps
> replicas will change.
>
>
>
> How many data centers/regions will you have when you're done? What's your
> RF now? You definitely need to run repair before you ALTER, but you've got
> a bit of a race here between the repairs and the ALTER, which you MAY be
> able to work around if we know more about your cluster.
>
>
>
> How many nodes
>
> How many regions
>
> How many replicas per region when you're done?
>
>
>
>
>
>
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Sep 13, 2017, at 2:04 PM, Dominik Petrovic  INVALID> wrote:
>
> Dear community,
> I'd like to receive additional info on how to modify a keyspace
> replication strategy.
>
> My Cassandra cluster is on AWS, Cassandra 2.1.15 using vnodes, the
> cluster's snitch is configured to 

ConsitencyLevel and Mutations : Behaviour if the update of the commitlog fails

2017-09-18 Thread Leleu Eric
Hi Cassandra users,


I have a question about the ConsistencyLevel and the MUTATION operation.
According to the write path documentation, the first action executed by a
replica node is to write the mutation into the commitlog, the mutation is
ACK only if this action is performed.

I suppose that this commitlog write may fail for one node (even if this
node is seen as Up and Nominal by the coordinator)

So my question is :  what happend if on a RF of 3 and a CL=ALL, a commitlog
write fails and the 2 others succeed? Does the coordinator "cancel" the
mutation on the "committed" nodes (and how)? Is it an heuristic case where
two nodes have the data whereas they shouldn't and we hope that
HintedHandoff will replay the mutation ?



Thanks you in advance for your answers in order to improve my Cassandra
understanding :)

Regards,
Eric


RE: Re[6]: Modify keyspace replication strategy and rebalance the nodes

2017-09-18 Thread Myron A. Semack
How would setting the consistency to ALL help?  Wouldn’t that just cause EVERY 
read/write to fail after the ALTER until the repair is complete?

Sincerely,
Myron A. Semack


From: Jeff Jirsa [mailto:jji...@gmail.com]
Sent: Monday, September 18, 2017 2:42 PM
To: user@cassandra.apache.org
Subject: Re: Re[6]: Modify keyspace replication strategy and rebalance the nodes

The hard part here is nobody's going to be able to tell you exactly what's 
involved in fixing this because nobody sees your ring

And since you're using vnodes and have a nontrivial number of instances, 
sharing that ring (and doing anything actionable with it) is nontrivial.

If you weren't using vnodes, you could just fix the distribution and decom 
extra nodes afterward.


I thought - but don't have time or energy to check - that the ec2snitch would 
be rack aware even when using simple strategy - if that's not the case (as you 
seem to indicate), then you're in a weird spot - you can't go to NTS trivially 
because doing so will reassign your replicas to be rack/as aware, certainly 
violating your consistency guarantees.

If you can change your app to temporarily write with ALL and read with ALL, and 
then run repair, then immediately ALTER the keyspace, then run repair again, 
then drop back to whatever consistency you're using, you can probably get 
through it. The challenge is that ALL gets painful if you lose any instance.

But please test in a lab, and note that this is inherently dangerous, I'm not 
advising you to do it, though I do believe it can be made to work.





--
Jeff Jirsa


On Sep 18, 2017, at 11:18 AM, Dominik Petrovic 
> 
wrote:
@jeff what do you think is the best approach here to fix this problem?
Thank you all for helping me.

Thursday, September 14, 2017 3:28 PM -07:00 from kurt greaves 
>:
Sorry that only applies our you're using NTS. You're right that simple strategy 
won't work very well in this case. To migrate you'll likely need to do a DC 
migration to ensuite no downtime, as replica placement will change even if RF 
stays the same.

On 15 Sep. 2017 08:26, "kurt greaves" 
> wrote:
If you have racks configured and lose nodes you should replace the node with 
one from the same rack. You then need to repair, and definitely don't 
decommission until you do.

Also 40 nodes with 256 vnodes is not a fun time for repair.

On 15 Sep. 2017 03:36, "Dominik Petrovic" 
.invalid> wrote:
@jeff,
I'm using 3 availability zones, during the life of the cluster we lost nodes, 
retired others and we end up having some of the data written/replicated on a 
single availability zone. We saw it with nodetool getendpoints.
Regards

Thursday, September 14, 2017 9:23 AM -07:00 from Jeff Jirsa 
>:
With one datacenter/region, what did you discover in an outage you think you'll 
solve with network topology strategy? It should be equivalent for a single D.C.

--
Jeff Jirsa


On Sep 14, 2017, at 8:47 AM, Dominik Petrovic 
> 
wrote:
Thank you for the replies!

@jeff my current cluster details are:
1 datacenter
40 nodes, with vnodes=256
RF=3
What is your advice? is it a production cluster, so I need to be very careful 
about it.
Regards

Thu, 14 Sep 2017 -2:47:52 -0700 from Jeff Jirsa 
>:
The token distribution isn't going to change - the way Cassandra maps replicas 
will change.

How many data centers/regions will you have when you're done? What's your RF 
now? You definitely need to run repair before you ALTER, but you've got a bit 
of a race here between the repairs and the ALTER, which you MAY be able to work 
around if we know more about your cluster.

How many nodes
How many regions
How many replicas per region when you're done?




--
Jeff Jirsa


On Sep 13, 2017, at 2:04 PM, Dominik Petrovic 
> 
wrote:
Dear community,
I'd like to receive additional info on how to modify a keyspace replication 
strategy.

My Cassandra cluster is on AWS, Cassandra 2.1.15 using vnodes, the cluster's 
snitch is configured to Ec2Snitch, but the keyspace the developers created has 
replication class SimpleStrategy = 3.

During an outage last week we realized the discrepancy between the 
configuration and we would now fix the issue using NetworkTopologyStrategy.

What are the suggested steps to perform?
For Cassandra 2.1 I found only this doc: 
http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsChangeKSStrategy.html
that does not mention anything about repairing the cluster

For Cassandra 3 I found this other doc: 

Re: Modify keyspace replication strategy and rebalance the nodes

2017-09-18 Thread Jeff Jirsa
For what its worth, the problem isn't the snitch it's the replication strategy 
- he's using the right snitch but SimpleStrategy ignores it

That's the same reason that adding a new DC doesn't work - the relocation 
strategy is dc agnostic and changing it safely IS the problem



-- 
Jeff Jirsa


> On Sep 18, 2017, at 11:46 AM, Jon Haddad  wrote:
> 
> For those of you who like trivia, simpleSnitch is hard coded to report every 
> node in DC in “datacenter1” and in rack “rack1”, there’s no way around it.  
> https://github.com/apache/cassandra/blob/8b3a60b9a7dbefeecc06bace617279612ec7092d/src/java/org/apache/cassandra/locator/SimpleSnitch.java#L28
> 
> I would do this by setting up a new DC, trying to do it with the existing one 
> is going to leave you in a state where most queries will return incorrect 
> results (2/3 of queries at ONE and 1/2 of queries at QUORUM) until you finish 
> repair.
> 
>> On Sep 18, 2017, at 11:41 AM, Jeff Jirsa  wrote:
>> 
>> The hard part here is nobody's going to be able to tell you exactly what's 
>> involved in fixing this because nobody sees your ring
>> 
>> And since you're using vnodes and have a nontrivial number of instances, 
>> sharing that ring (and doing anything actionable with it) is nontrivial. 
>> 
>> If you weren't using vnodes, you could just fix the distribution and decom 
>> extra nodes afterward. 
>> 
>> I thought - but don't have time or energy to check - that the ec2snitch 
>> would be rack aware even when using simple strategy - if that's not the case 
>> (as you seem to indicate), then you're in a weird spot - you can't go to NTS 
>> trivially because doing so will reassign your replicas to be rack/as aware, 
>> certainly violating your consistency guarantees.
>> 
>> If you can change your app to temporarily write with ALL and read with ALL, 
>> and then run repair, then immediately ALTER the keyspace, then run repair 
>> again, then drop back to whatever consistency you're using, you can probably 
>> get through it. The challenge is that ALL gets painful if you lose any 
>> instance.
>> 
>> But please test in a lab, and note that this is inherently dangerous, I'm 
>> not advising you to do it, though I do believe it can be made to work.
>> 
>> 
>> 
>> 
>> 
>> -- 
>> Jeff Jirsa
>> 
>> 
>>> On Sep 18, 2017, at 11:18 AM, Dominik Petrovic 
>>>  wrote:
>>> 
>>> @jeff what do you think is the best approach here to fix this problem?
>>> Thank you all for helping me.
>>> 
>>> 
>>> Thursday, September 14, 2017 3:28 PM -07:00 from kurt greaves 
>>> :
>>> 
>>> Sorry that only applies our you're using NTS. You're right that simple 
>>> strategy won't work very well in this case. To migrate you'll likely need 
>>> to do a DC migration to ensuite no downtime, as replica placement will 
>>> change even if RF stays the same.
>>> 
>>> On 15 Sep. 2017 08:26, "kurt greaves"  wrote:
>>> If you have racks configured and lose nodes you should replace the node 
>>> with one from the same rack. You then need to repair, and definitely don't 
>>> decommission until you do.
>>> 
>>> Also 40 nodes with 256 vnodes is not a fun time for repair.
>>> 
>>> On 15 Sep. 2017 03:36, "Dominik Petrovic" 
>>>  wrote:
>>> @jeff,
>>> I'm using 3 availability zones, during the life of the cluster we lost 
>>> nodes, retired others and we end up having some of the data 
>>> written/replicated on a single availability zone. We saw it with nodetool 
>>> getendpoints.
>>> Regards 
>>> 
>>> 
>>> Thursday, September 14, 2017 9:23 AM -07:00 from Jeff Jirsa 
>>> :
>>> 
>>> With one datacenter/region, what did you discover in an outage you think 
>>> you'll solve with network topology strategy? It should be equivalent for a 
>>> single D.C. 
>>> 
>>> -- 
>>> Jeff Jirsa
>>> 
>>> 
 On Sep 14, 2017, at 8:47 AM, Dominik Petrovic 
  wrote:
 
 Thank you for the replies!
 
 @jeff my current cluster details are:
 1 datacenter
 40 nodes, with vnodes=256
 RF=3
 What is your advice? is it a production cluster, so I need to be very 
 careful about it.
 Regards
 
 
 Thu, 14 Sep 2017 -2:47:52 -0700 from Jeff Jirsa :
 
 The token distribution isn't going to change - the way Cassandra maps 
 replicas will change. 
 
 How many data centers/regions will you have when you're done? What's your 
 RF now? You definitely need to run repair before you ALTER, but you've got 
 a bit of a race here between the repairs and the ALTER, which you MAY be 
 able to work around if we know more about your cluster.
 
 How many nodes
 How many regions
 How many replicas per region when you're done?
 
 
 
 
 -- 
 Jeff Jirsa
 
 
> On Sep 13, 2017, at 2:04 PM, Dominik Petrovic 

Re: Re[6]: Modify keyspace replication strategy and rebalance the nodes

2017-09-18 Thread Jeff Jirsa
The hard part here is nobody's going to be able to tell you exactly what's 
involved in fixing this because nobody sees your ring

And since you're using vnodes and have a nontrivial number of instances, 
sharing that ring (and doing anything actionable with it) is nontrivial. 

If you weren't using vnodes, you could just fix the distribution and decom 
extra nodes afterward. 

I thought - but don't have time or energy to check - that the ec2snitch would 
be rack aware even when using simple strategy - if that's not the case (as you 
seem to indicate), then you're in a weird spot - you can't go to NTS trivially 
because doing so will reassign your replicas to be rack/as aware, certainly 
violating your consistency guarantees.

If you can change your app to temporarily write with ALL and read with ALL, and 
then run repair, then immediately ALTER the keyspace, then run repair again, 
then drop back to whatever consistency you're using, you can probably get 
through it. The challenge is that ALL gets painful if you lose any instance.

But please test in a lab, and note that this is inherently dangerous, I'm not 
advising you to do it, though I do believe it can be made to work.





-- 
Jeff Jirsa


> On Sep 18, 2017, at 11:18 AM, Dominik Petrovic 
>  wrote:
> 
> @jeff what do you think is the best approach here to fix this problem?
> Thank you all for helping me.
> 
> 
> Thursday, September 14, 2017 3:28 PM -07:00 from kurt greaves 
> :
> 
> Sorry that only applies our you're using NTS. You're right that simple 
> strategy won't work very well in this case. To migrate you'll likely need to 
> do a DC migration to ensuite no downtime, as replica placement will change 
> even if RF stays the same.
> 
> On 15 Sep. 2017 08:26, "kurt greaves"  wrote:
> If you have racks configured and lose nodes you should replace the node with 
> one from the same rack. You then need to repair, and definitely don't 
> decommission until you do.
> 
> Also 40 nodes with 256 vnodes is not a fun time for repair.
> 
> On 15 Sep. 2017 03:36, "Dominik Petrovic"  
> wrote:
> @jeff,
> I'm using 3 availability zones, during the life of the cluster we lost nodes, 
> retired others and we end up having some of the data written/replicated on a 
> single availability zone. We saw it with nodetool getendpoints.
> Regards 
> 
> 
> Thursday, September 14, 2017 9:23 AM -07:00 from Jeff Jirsa 
> :
> 
> With one datacenter/region, what did you discover in an outage you think 
> you'll solve with network topology strategy? It should be equivalent for a 
> single D.C. 
> 
> -- 
> Jeff Jirsa
> 
> 
>> On Sep 14, 2017, at 8:47 AM, Dominik Petrovic 
>>  wrote:
>> 
>> Thank you for the replies!
>> 
>> @jeff my current cluster details are:
>> 1 datacenter
>> 40 nodes, with vnodes=256
>> RF=3
>> What is your advice? is it a production cluster, so I need to be very 
>> careful about it.
>> Regards
>> 
>> 
>> Thu, 14 Sep 2017 -2:47:52 -0700 from Jeff Jirsa :
>> 
>> The token distribution isn't going to change - the way Cassandra maps 
>> replicas will change. 
>> 
>> How many data centers/regions will you have when you're done? What's your RF 
>> now? You definitely need to run repair before you ALTER, but you've got a 
>> bit of a race here between the repairs and the ALTER, which you MAY be able 
>> to work around if we know more about your cluster.
>> 
>> How many nodes
>> How many regions
>> How many replicas per region when you're done?
>> 
>> 
>> 
>> 
>> -- 
>> Jeff Jirsa
>> 
>> 
>>> On Sep 13, 2017, at 2:04 PM, Dominik Petrovic 
>>>  wrote:
>>> 
>>> Dear community,
>>> I'd like to receive additional info on how to modify a keyspace replication 
>>> strategy.
>>> 
>>> My Cassandra cluster is on AWS, Cassandra 2.1.15 using vnodes, the 
>>> cluster's snitch is configured to Ec2Snitch, but the keyspace the 
>>> developers created has replication class SimpleStrategy = 3.
>>> 
>>> During an outage last week we realized the discrepancy between the 
>>> configuration and we would now fix the issue using NetworkTopologyStrategy. 
>>> 
>>> What are the suggested steps to perform?
>>> For Cassandra 2.1 I found only this doc: 
>>> http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsChangeKSStrategy.html
>>>  
>>> that does not mention anything about repairing the cluster
>>> 
>>> For Cassandra 3 I found this other doc: 
>>> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsChangeKSStrategy.html
>>>  
>>> That involves also the cluster repair operation.
>>> 
>>> On a test cluster I tried the steps for Cassandra 2.1 but the token 
>>> distribution in the ring didn't change so I'm assuming that wasn't the 
>>> right think to do.
>>> I also perform a nodetool repair -pr but nothing changed as well.
>>> Some advice?
>>> 

Re[6]: Modify keyspace replication strategy and rebalance the nodes

2017-09-18 Thread Dominik Petrovic
@jeff what do you think is the best approach here to fix this problem?
Thank you all for helping me.


>Thursday, September 14, 2017 3:28 PM -07:00 from kurt greaves 
>:
>
>Sorry that only applies our you're using NTS. You're right that simple 
>strategy won't work very well in this case. To migrate you'll likely need to 
>do a DC migration to ensuite no downtime, as replica placement will change 
>even if RF stays the same.
>
>On 15 Sep. 2017 08:26, "kurt greaves" < k...@instaclustr.com > wrote:
>>If you have racks configured and lose nodes you should replace the node with 
>>one from the same rack. You then need to repair, and definitely don't 
>>decommission until you do.
>>
>>Also 40 nodes with 256 vnodes is not a fun time for repair.
>>
>>On 15 Sep. 2017 03:36, "Dominik Petrovic" < dominik.petro...@mail.ru 
>>.invalid> wrote:
>>>@jeff,
>>>I'm using 3 availability zones, during the life of the cluster we lost 
>>>nodes, retired others and we end up having some of the data 
>>>written/replicated on a single availability zone. We saw it with nodetool 
>>>getendpoints.
>>>Regards 
>>>
>>>
Thursday, September 14, 2017 9:23 AM -07:00 from Jeff Jirsa < 
jji...@gmail.com >:

With one datacenter/region, what did you discover in an outage you think 
you'll solve with network topology strategy? It should be equivalent for a 
single D.C. 

-- 
Jeff Jirsa


On Sep 14, 2017, at 8:47 AM, Dominik Petrovic < 
dominik.petro...@mail.ru.INVALID > wrote:

>Thank you for the replies!
>
>@jeff my current cluster details are:
>1 datacenter
>40 nodes, with vnodes=256
>RF=3
>What is your advice? is it a production cluster, so I need to be very 
>careful about it.
>Regards
>
>
>>Thu, 14 Sep 2017 -2:47:52 -0700 from Jeff Jirsa < jji...@gmail.com >:
>>
>>The token distribution isn't going to change - the way Cassandra maps 
>>replicas will change. 
>>
>>How many data centers/regions will you have when you're done? What's your 
>>RF now? You definitely need to run repair before you ALTER, but you've 
>>got a bit of a race here between the repairs and the ALTER, which you MAY 
>>be able to work around if we know more about your cluster.
>>
>>How many nodes
>>How many regions
>>How many replicas per region when you're done?
>>
>>
>>
>>
>>-- 
>>Jeff Jirsa
>>
>>
>>On Sep 13, 2017, at 2:04 PM, Dominik Petrovic < 
>>dominik.petro...@mail.ru.INVALID > wrote:
>>
>>>Dear community,
>>>I'd like to receive additional info on how to modify a keyspace 
>>>replication strategy.
>>>
>>>My Cassandra cluster is on AWS, Cassandra 2.1.15 using vnodes, the 
>>>cluster's snitch is configured to Ec2Snitch, but the keyspace the 
>>>developers created has replication class SimpleStrategy = 3.
>>>
>>>During an outage last week we realized the discrepancy between the 
>>>configuration and we would now fix the issue using 
>>>NetworkTopologyStrategy. 
>>>
>>>What are the suggested steps to perform?
>>>For Cassandra 2.1 I found only this doc:  
>>>http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsChangeKSStrategy.html
>>>  
>>>that does not mention anything about repairing the cluster
>>>
>>>For Cassandra 3 I found this other doc:  
>>>https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsChangeKSStrategy.html
>>>  
>>>That involves also the cluster repair operation.
>>>
>>>On a test cluster I tried the steps for Cassandra 2.1 but the token 
>>>distribution in the ring didn't change so I'm assuming that wasn't the 
>>>right think to do.
>>>I also perform a nodetool repair -pr but nothing changed as well.
>>>Some advice?
>>>
>>>-- 
>>>Dominik Petrovic
>
>-- 
>Dominik Petrovic
>>>
>>>
>>>-- 
>>>Dominik Petrovic


-- 
Dominik Petrovic


Re: Multi-node repair fails after upgrading to 3.0.14

2017-09-18 Thread Jeff Jirsa
Sorry I may be wrong about the cause - didn't see -full

Mea culpa, its early here and I'm not awake


-- 
Jeff Jirsa


> On Sep 18, 2017, at 7:01 AM, Steinmaurer, Thomas 
>  wrote:
> 
> Hi Jeff,
>  
> understood. That’s quite a change then coming from 2.1 from an operational 
> POV.
>  
> Thanks again.
>  
> Thomas
>  
> From: Jeff Jirsa [mailto:jji...@gmail.com] 
> Sent: Montag, 18. September 2017 15:56
> To: user@cassandra.apache.org
> Subject: Re: Multi-node repair fails after upgrading to 3.0.14
>  
> The command you're running will cause anticompaction and the range borders 
> for all instances at the same time
>  
> Since only one repair session can anticompact any given sstable, it's almost 
> guaranteed to fail
>  
> Run it on one instance at a time
> 
> 
> -- 
> Jeff Jirsa
>  
> 
> On Sep 18, 2017, at 1:11 AM, Steinmaurer, Thomas 
>  wrote:
> 
> Hi Alex,
>  
> I now ran nodetool repair –full –pr keyspace cfs on all nodes in parallel and 
> this may pop up now:
>  
> 0.176.38.128 (progress: 1%)
> [2017-09-18 07:59:17,145] Some repair failed
> [2017-09-18 07:59:17,151] Repair command #3 finished in 0 seconds
> error: Repair job has failed with the error message: [2017-09-18 
> 07:59:17,145] Some repair failed
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error message: 
> [2017-09-18 07:59:17,145] Some repair failed
> at 
> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:115)
> at 
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
> at 
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
> at 
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
> at 
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
> at 
> com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
>  
> 2017-09-18 07:59:17 repair finished
>  
>  
> If running the above nodetool call sequentially on all nodes, repair finishes 
> without printing a stack trace.
>  
> The error message and stack trace isn’t really useful here. Any further 
> ideas/experiences?
>  
> Thanks,
> Thomas
>  
> From: Alexander Dejanovski [mailto:a...@thelastpickle.com] 
> Sent: Freitag, 15. September 2017 11:30
> To: user@cassandra.apache.org
> Subject: Re: Multi-node repair fails after upgrading to 3.0.14
>  
> Right, you should indeed add the "--full" flag to perform full repairs, and 
> you can then keep the "-pr" flag.
>  
> I'd advise to monitor the status of your SSTables as you'll probably end up 
> with a pool of SSTables marked as repaired, and another pool marked as 
> unrepaired which won't be compacted together (hence the suggestion of running 
> subrange repairs).
> Use sstablemetadata to check on the "Repaired at" value for each. 0 means 
> unrepaired and any other value (a timestamp) means the SSTable has been 
> repaired.
> I've had behaviors in the past where running "-pr" on the whole cluster would 
> still not mark all SSTables as repaired, but I can't say if that behavior has 
> changed in latest versions.
>  
> Having separate pools of SStables that cannot be compacted means that you 
> might have tombstones that don't get evicted due to partitions living in both 
> states (repaired/unrepaired).
>  
> To sum up the recommendations : 
> - Run a full repair with both "--full" and "-pr" and check that SSTables are 
> properly marked as repaired
> - Use a tight repair schedule to avoid keeping partitions for too long in 
> both repaired and unrepaired state
> - Switch to subrange repair if you want to fully avoid marking SSTables as 
> repaired (which you don't need anyway since you're not using incremental 
> repairs). If you wish to do this, you'll have to mark back all your sstables 
> to unrepaired, using nodetool sstablerepairedset.
>  
> Cheers,
>  
> On Fri, Sep 15, 2017 at 10:27 AM Steinmaurer, Thomas 
>  wrote:
> Hi Alex,
>  
> thanks a lot. Somehow missed that incremental repairs are the default now.
>  
> We have been happy with full repair so far, cause data what we currently 
> manually invoke for being prepared is a small (~1GB or even smaller).
>  
> So I guess with full repairs across all nodes, we still can stick with the 
> partition range (-pr) option, but with 3.0 we additionally have to provide 
> the –full option, right?
>  
> Thanks again,
> Thomas
>  
> From: Alexander Dejanovski [mailto:a...@thelastpickle.com] 
> Sent: Freitag, 15. September 2017 09:45
> To: user@cassandra.apache.org
> Subject: Re: Multi-node repair fails after upgrading to 3.0.14
>  
> Hi Thomas,
>  
> in 2.1.18, the default repair mode was full repair while since 

RE: Multi-node repair fails after upgrading to 3.0.14

2017-09-18 Thread Steinmaurer, Thomas
Hi Jeff,

understood. That’s quite a change then coming from 2.1 from an operational POV.

Thanks again.

Thomas

From: Jeff Jirsa [mailto:jji...@gmail.com]
Sent: Montag, 18. September 2017 15:56
To: user@cassandra.apache.org
Subject: Re: Multi-node repair fails after upgrading to 3.0.14

The command you're running will cause anticompaction and the range borders for 
all instances at the same time

Since only one repair session can anticompact any given sstable, it's almost 
guaranteed to fail

Run it on one instance at a time


--
Jeff Jirsa


On Sep 18, 2017, at 1:11 AM, Steinmaurer, Thomas 
> 
wrote:
Hi Alex,

I now ran nodetool repair –full –pr keyspace cfs on all nodes in parallel and 
this may pop up now:

0.176.38.128 (progress: 1%)
[2017-09-18 07:59:17,145] Some repair failed
[2017-09-18 07:59:17,151] Repair command #3 finished in 0 seconds
error: Repair job has failed with the error message: [2017-09-18 07:59:17,145] 
Some repair failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: 
[2017-09-18 07:59:17,145] Some repair failed
at 
org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:115)
at 
org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
at 
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
at 
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
at 
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
at 
com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)

2017-09-18 07:59:17 repair finished


If running the above nodetool call sequentially on all nodes, repair finishes 
without printing a stack trace.

The error message and stack trace isn’t really useful here. Any further 
ideas/experiences?

Thanks,
Thomas

From: Alexander Dejanovski [mailto:a...@thelastpickle.com]
Sent: Freitag, 15. September 2017 11:30
To: user@cassandra.apache.org
Subject: Re: Multi-node repair fails after upgrading to 3.0.14

Right, you should indeed add the "--full" flag to perform full repairs, and you 
can then keep the "-pr" flag.

I'd advise to monitor the status of your SSTables as you'll probably end up 
with a pool of SSTables marked as repaired, and another pool marked as 
unrepaired which won't be compacted together (hence the suggestion of running 
subrange repairs).
Use sstablemetadata to check on the "Repaired at" value for each. 0 means 
unrepaired and any other value (a timestamp) means the SSTable has been 
repaired.
I've had behaviors in the past where running "-pr" on the whole cluster would 
still not mark all SSTables as repaired, but I can't say if that behavior has 
changed in latest versions.

Having separate pools of SStables that cannot be compacted means that you might 
have tombstones that don't get evicted due to partitions living in both states 
(repaired/unrepaired).

To sum up the recommendations :
- Run a full repair with both "--full" and "-pr" and check that SSTables are 
properly marked as repaired
- Use a tight repair schedule to avoid keeping partitions for too long in both 
repaired and unrepaired state
- Switch to subrange repair if you want to fully avoid marking SSTables as 
repaired (which you don't need anyway since you're not using incremental 
repairs). If you wish to do this, you'll have to mark back all your sstables to 
unrepaired, using nodetool 
sstablerepairedset.

Cheers,

On Fri, Sep 15, 2017 at 10:27 AM Steinmaurer, Thomas 
> 
wrote:
Hi Alex,

thanks a lot. Somehow missed that incremental repairs are the default now.

We have been happy with full repair so far, cause data what we currently 
manually invoke for being prepared is a small (~1GB or even smaller).

So I guess with full repairs across all nodes, we still can stick with the 
partition range (-pr) option, but with 3.0 we additionally have to provide the 
–full option, right?

Thanks again,
Thomas

From: Alexander Dejanovski 
[mailto:a...@thelastpickle.com]
Sent: Freitag, 15. September 2017 09:45
To: user@cassandra.apache.org
Subject: Re: Multi-node repair fails after upgrading to 3.0.14

Hi Thomas,

in 2.1.18, the default repair mode was full repair while since 2.2 it is 
incremental repair.
So running "nodetool repair -pr" since your upgrade to 3.0.14 doesn't trigger 
the same operation.

Incremental repair cannot run on more than one node at a time on a cluster, 
because you risk to have 

Re: Multi-node repair fails after upgrading to 3.0.14

2017-09-18 Thread Jeff Jirsa
The command you're running will cause anticompaction and the range borders for 
all instances at the same time

Since only one repair session can anticompact any given sstable, it's almost 
guaranteed to fail

Run it on one instance at a time


-- 
Jeff Jirsa


> On Sep 18, 2017, at 1:11 AM, Steinmaurer, Thomas 
>  wrote:
> 
> Hi Alex,
>  
> I now ran nodetool repair –full –pr keyspace cfs on all nodes in parallel and 
> this may pop up now:
>  
> 0.176.38.128 (progress: 1%)
> [2017-09-18 07:59:17,145] Some repair failed
> [2017-09-18 07:59:17,151] Repair command #3 finished in 0 seconds
> error: Repair job has failed with the error message: [2017-09-18 
> 07:59:17,145] Some repair failed
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error message: 
> [2017-09-18 07:59:17,145] Some repair failed
> at 
> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:115)
> at 
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
> at 
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
> at 
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
> at 
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
> at 
> com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
>  
> 2017-09-18 07:59:17 repair finished
>  
>  
> If running the above nodetool call sequentially on all nodes, repair finishes 
> without printing a stack trace.
>  
> The error message and stack trace isn’t really useful here. Any further 
> ideas/experiences?
>  
> Thanks,
> Thomas
>  
> From: Alexander Dejanovski [mailto:a...@thelastpickle.com] 
> Sent: Freitag, 15. September 2017 11:30
> To: user@cassandra.apache.org
> Subject: Re: Multi-node repair fails after upgrading to 3.0.14
>  
> Right, you should indeed add the "--full" flag to perform full repairs, and 
> you can then keep the "-pr" flag.
>  
> I'd advise to monitor the status of your SSTables as you'll probably end up 
> with a pool of SSTables marked as repaired, and another pool marked as 
> unrepaired which won't be compacted together (hence the suggestion of running 
> subrange repairs).
> Use sstablemetadata to check on the "Repaired at" value for each. 0 means 
> unrepaired and any other value (a timestamp) means the SSTable has been 
> repaired.
> I've had behaviors in the past where running "-pr" on the whole cluster would 
> still not mark all SSTables as repaired, but I can't say if that behavior has 
> changed in latest versions.
>  
> Having separate pools of SStables that cannot be compacted means that you 
> might have tombstones that don't get evicted due to partitions living in both 
> states (repaired/unrepaired).
>  
> To sum up the recommendations : 
> - Run a full repair with both "--full" and "-pr" and check that SSTables are 
> properly marked as repaired
> - Use a tight repair schedule to avoid keeping partitions for too long in 
> both repaired and unrepaired state
> - Switch to subrange repair if you want to fully avoid marking SSTables as 
> repaired (which you don't need anyway since you're not using incremental 
> repairs). If you wish to do this, you'll have to mark back all your sstables 
> to unrepaired, using nodetool sstablerepairedset.
>  
> Cheers,
>  
> On Fri, Sep 15, 2017 at 10:27 AM Steinmaurer, Thomas 
>  wrote:
> Hi Alex,
>  
> thanks a lot. Somehow missed that incremental repairs are the default now.
>  
> We have been happy with full repair so far, cause data what we currently 
> manually invoke for being prepared is a small (~1GB or even smaller).
>  
> So I guess with full repairs across all nodes, we still can stick with the 
> partition range (-pr) option, but with 3.0 we additionally have to provide 
> the –full option, right?
>  
> Thanks again,
> Thomas
>  
> From: Alexander Dejanovski [mailto:a...@thelastpickle.com] 
> Sent: Freitag, 15. September 2017 09:45
> To: user@cassandra.apache.org
> Subject: Re: Multi-node repair fails after upgrading to 3.0.14
>  
> Hi Thomas,
>  
> in 2.1.18, the default repair mode was full repair while since 2.2 it is 
> incremental repair.
> So running "nodetool repair -pr" since your upgrade to 3.0.14 doesn't trigger 
> the same operation.
>  
> Incremental repair cannot run on more than one node at a time on a cluster, 
> because you risk to have conflicts with sessions trying to anticompact and 
> run validation compactions on the same SSTables (which will make the 
> validation phase fail, like your logs are showing).
> Furthermore, you should never use "-pr" with incremental repair because it is 
> useless in that mode, and won't properly perform anticompaction on 

RE: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

2017-09-18 Thread Steinmaurer, Thomas
Hello again,

digged a bit further. Comparing 1hr flight recording sessions for both, 2.1 and 
3.0 with the same incoming simulated load from our loadtest environment.

We are heavily write than read bound in this environment/scenario and it looks 
like there is a noticeable/measurable difference in 3.0 on what is happening 
underneath org.apache.cassandra.cql3.statements.BatchStatement.execute in both 
JFR/JMC areas, Code and Memory (allocation rate / object churn).

E.g. for org.apache.cassandra.cql3.statements.BatchStatement.execute, while JFR 
reports for the 1hr session a total TLAB size of 59,35 GB, it is 246,12 GB in 
Cassandra 3.0, so if this is trustworthy, a 4 times higher allocation rate in 
the BatchStatement.execute code path, which would explain the increased GC 
suspension since upgrading.

Is anybody aware of some kind of write-bound benchmarks of the storage engine 
in 3.0 in context of CPU/GC and not disk savings?

Thanks,
Thomas


From: Steinmaurer, Thomas [mailto:thomas.steinmau...@dynatrace.com]
Sent: Freitag, 15. September 2017 13:51
To: user@cassandra.apache.org
Subject: RE: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

Hi Jeff,

we are using native (CQL3) via Java DataStax driver (3.1). We also have 
OpsCenter running (to be removed soon) via Thrift, if I remember correctly.

As said, the write request latency for our keyspace hasn’t really changed, so 
perhaps another one (system related, OpsCenter …?) is affected or perhaps the 
JMX metric is reporting something differently now. ☺ So not a real issue for 
now hopefully, just popping up in our monitoring, wondering what this may be.

Regarding compression metadata memory usage drop. Right, storage engine 
re-write could be a reason. Thanks.

Still wondering about the GC/CPU increase.

Thanks!

Thomas



From: Jeff Jirsa [mailto:jji...@gmail.com]
Sent: Freitag, 15. September 2017 13:14
To: user@cassandra.apache.org
Subject: Re: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

Most people find 3.0 slightly slower than 2.1. The only thing that really 
stands out in your email is the huge change in 95% latency - that's atypical. 
Are you using thrift or native 9042)?  The decrease in compression metadata 
offheap usage is likely due to the increased storage efficiency of the storage 
engine (see Cassandra-8099).


--
Jeff Jirsa


On Sep 15, 2017, at 2:37 AM, Steinmaurer, Thomas 
> 
wrote:
Hello,

we have a test (regression) environment hosted in AWS, which is used for auto 
deploying our software on a daily basis and attach constant load across all 
deployments. Basically to allow us to detect any regressions in our software on 
a daily basis.

On the Cassandra-side, this is single-node in AWS, m4.xlarge, EBS gp2, 8G heap, 
CMS. The environment has also been upgraded from Cassandra 2.1.18 to 3.0.14 at 
a certain point in time. Without running upgradesstables so far. We have not 
made any additional JVM/GC configuration change when going from 2.1.18 to 
3.0.14 on our own, thus, any self-made configuration changes (e.g. new gen heap 
size) for 2.1.18 are also in place with 3.0.14.

What we see after a time-frame of ~ 7 days (so, e.g. should not be caused by 
some sort of spiky compaction pattern) is an AVG increase in GC/CPU (most 
likely correlating):

· CPU: ~ 12% => ~ 17%

· GC Suspension: ~ 1,7% => 3,29%

In this environment not a big deal, but relatively we have a CPU increase of ~ 
50% (with increased GC most likely contributing). Something we have deal with 
when going into production (going into larger, multi-node loadtest environments 
first though).

Beside the CPU/GC shift, we also monitor the following noticeable changes 
(don’t know if they somehow correlate with the CPU/GC shift above):

· Increased AVG Write Client Requests Latency (95th Percentile), 
org.apache.cassandra.metrics.ClientRequest.Latency.Write: 6,05ms => 29,2ms, but 
almost constant (no change in) write client request latency for our particular 
keyspace only, org.apache.cassandra.metrics.Keyspace.ruxitdb.WriteLatency

· Compression metadata memory usage drop, 
org.apache.cassandra.metrics.Keyspace.XXX. 
CompressionMetadataOffHeapMemoryUsed: ~218MB => ~105MB => Good or bad? Known?

I know, looks all a bit vague, but perhaps someone else has seen something 
similar when upgrading to 3.0.14 and can share their thoughts/ideas. Especially 
the (relative) CPU/GC increase is something we are curious about.

Thanks a lot.

Thomas
The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 

Compaction task not available in dcos-cassandra-service

2017-09-18 Thread Akshit Jain
Hi, there isn't a compaction task feature in
mesosphere/dcos-cassandra-service like repair and cleanup.
Is anybody working on it or is there any plan to add in later releases?
Regards


Re: Multi-node repair fails after upgrading to 3.0.14

2017-09-18 Thread Alexander Dejanovski
You could dig a bit more in the logs to see what precisely failed.
I suspect anticompaction to still be responsible for conflicts with
validation compaction (so you should see validation failures on some nodes).

The only way to fully disable anticompaction will be to run subrange
repairs.
The two easy solutions for that will be cassandra_range_repair
 and reaper
.

Reaper will offer better orchestration as it considers the whole token ring
and not just a single node at a time, and already includes a scheduler. It
also checks for pending compactions and slows down repairs if there are too
many (ie : your repair job may be generating a lot of new sstables which
can put you in a very very bad place...).
That said, you may find that cassandra_range_repair better suits your
scheduling/running habits.

Cheers,

On Mon, Sep 18, 2017 at 10:11 AM Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Hi Alex,
>
>
>
> I now ran nodetool repair –full –pr keyspace cfs on all nodes in parallel
> and this may pop up now:
>
>
>
> 0.176.38.128 (progress: 1%)
>
> [2017-09-18 07:59:17,145] Some repair failed
>
> [2017-09-18 07:59:17,151] Repair command #3 finished in 0 seconds
>
> error: Repair job has failed with the error message: [2017-09-18
> 07:59:17,145] Some repair failed
>
> -- StackTrace --
>
> java.lang.RuntimeException: Repair job has failed with the error message:
> [2017-09-18 07:59:17,145] Some repair failed
>
> at
> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:115)
>
> at
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
>
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
>
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
>
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
>
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
>
>
>
> 2017-09-18 07:59:17 repair finished
>
>
>
>
>
> If running the above nodetool call sequentially on all nodes, repair
> finishes without printing a stack trace.
>
>
>
> The error message and stack trace isn’t really useful here. Any further
> ideas/experiences?
>
>
>
> Thanks,
>
> Thomas
>
>
>
> *From:* Alexander Dejanovski [mailto:a...@thelastpickle.com]
> *Sent:* Freitag, 15. September 2017 11:30
>
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: Multi-node repair fails after upgrading to 3.0.14
>
>
>
> Right, you should indeed add the "--full" flag to perform full repairs,
> and you can then keep the "-pr" flag.
>
>
>
> I'd advise to monitor the status of your SSTables as you'll probably end
> up with a pool of SSTables marked as repaired, and another pool marked as
> unrepaired which won't be compacted together (hence the suggestion of
> running subrange repairs).
>
> Use sstablemetadata to check on the "Repaired at" value for each. 0 means
> unrepaired and any other value (a timestamp) means the SSTable has been
> repaired.
>
> I've had behaviors in the past where running "-pr" on the whole cluster
> would still not mark all SSTables as repaired, but I can't say if that
> behavior has changed in latest versions.
>
>
>
> Having separate pools of SStables that cannot be compacted means that you
> might have tombstones that don't get evicted due to partitions living in
> both states (repaired/unrepaired).
>
>
>
> To sum up the recommendations :
>
> - Run a full repair with both "--full" and "-pr" and check that SSTables
> are properly marked as repaired
>
> - Use a tight repair schedule to avoid keeping partitions for too long in
> both repaired and unrepaired state
>
> - Switch to subrange repair if you want to fully avoid marking SSTables as
> repaired (which you don't need anyway since you're not using incremental
> repairs). If you wish to do this, you'll have to mark back all your
> sstables to unrepaired, using nodetool sstablerepairedset
> 
> .
>
>
>
> Cheers,
>
>
>
> On Fri, Sep 15, 2017 at 10:27 AM Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
> Hi Alex,
>
>
>
> thanks a lot. Somehow missed that incremental repairs are the default now.
>
>
>
> We have been happy with full repair so far, cause data what we currently
> manually invoke for being prepared is a small (~1GB or even smaller).
>
>
>
> So I guess with full repairs across all nodes, we still can stick with the
> partition range (-pr) option, but with 3.0 we additionally have to provide
> the –full option, right?
>
>
>
> Thanks again,
>
> Thomas
>
>
>
> *From:* Alexander Dejanovski [mailto:a...@thelastpickle.com]
> *Sent:* 

RE: Multi-node repair fails after upgrading to 3.0.14

2017-09-18 Thread Steinmaurer, Thomas
Hi Alex,

I now ran nodetool repair –full –pr keyspace cfs on all nodes in parallel and 
this may pop up now:

0.176.38.128 (progress: 1%)
[2017-09-18 07:59:17,145] Some repair failed
[2017-09-18 07:59:17,151] Repair command #3 finished in 0 seconds
error: Repair job has failed with the error message: [2017-09-18 07:59:17,145] 
Some repair failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: 
[2017-09-18 07:59:17,145] Some repair failed
at 
org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:115)
at 
org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
at 
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
at 
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
at 
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
at 
com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)

2017-09-18 07:59:17 repair finished


If running the above nodetool call sequentially on all nodes, repair finishes 
without printing a stack trace.

The error message and stack trace isn’t really useful here. Any further 
ideas/experiences?

Thanks,
Thomas

From: Alexander Dejanovski [mailto:a...@thelastpickle.com]
Sent: Freitag, 15. September 2017 11:30
To: user@cassandra.apache.org
Subject: Re: Multi-node repair fails after upgrading to 3.0.14

Right, you should indeed add the "--full" flag to perform full repairs, and you 
can then keep the "-pr" flag.

I'd advise to monitor the status of your SSTables as you'll probably end up 
with a pool of SSTables marked as repaired, and another pool marked as 
unrepaired which won't be compacted together (hence the suggestion of running 
subrange repairs).
Use sstablemetadata to check on the "Repaired at" value for each. 0 means 
unrepaired and any other value (a timestamp) means the SSTable has been 
repaired.
I've had behaviors in the past where running "-pr" on the whole cluster would 
still not mark all SSTables as repaired, but I can't say if that behavior has 
changed in latest versions.

Having separate pools of SStables that cannot be compacted means that you might 
have tombstones that don't get evicted due to partitions living in both states 
(repaired/unrepaired).

To sum up the recommendations :
- Run a full repair with both "--full" and "-pr" and check that SSTables are 
properly marked as repaired
- Use a tight repair schedule to avoid keeping partitions for too long in both 
repaired and unrepaired state
- Switch to subrange repair if you want to fully avoid marking SSTables as 
repaired (which you don't need anyway since you're not using incremental 
repairs). If you wish to do this, you'll have to mark back all your sstables to 
unrepaired, using nodetool 
sstablerepairedset.

Cheers,

On Fri, Sep 15, 2017 at 10:27 AM Steinmaurer, Thomas 
> 
wrote:
Hi Alex,

thanks a lot. Somehow missed that incremental repairs are the default now.

We have been happy with full repair so far, cause data what we currently 
manually invoke for being prepared is a small (~1GB or even smaller).

So I guess with full repairs across all nodes, we still can stick with the 
partition range (-pr) option, but with 3.0 we additionally have to provide the 
–full option, right?

Thanks again,
Thomas

From: Alexander Dejanovski 
[mailto:a...@thelastpickle.com]
Sent: Freitag, 15. September 2017 09:45
To: user@cassandra.apache.org
Subject: Re: Multi-node repair fails after upgrading to 3.0.14

Hi Thomas,

in 2.1.18, the default repair mode was full repair while since 2.2 it is 
incremental repair.
So running "nodetool repair -pr" since your upgrade to 3.0.14 doesn't trigger 
the same operation.

Incremental repair cannot run on more than one node at a time on a cluster, 
because you risk to have conflicts with sessions trying to anticompact and run 
validation compactions on the same SSTables (which will make the validation 
phase fail, like your logs are showing).
Furthermore, you should never use "-pr" with incremental repair because it is 
useless in that mode, and won't properly perform anticompaction on all nodes.

If you were happy with full repairs in 2.1.18, I'd suggest to stick with those 
in 3.0.14 as well because there are still too many caveats with incremental 
repairs that should hopefully be fixed in 4.0+.
Note that full repair will also trigger anticompaction and mark SSTables as 
repaired in your release of Cassandra, and only full subrange repairs are the 
only flavor that will skip 

Re: Wide rows splitting

2017-09-18 Thread Stefano Ortolani
You might find this interesting:
https://medium.com/@foundev/synthetic-sharding-in-cassandra-to-deal-with-large-partitions-2124b2fd788b

Cheers,
Stefano

On Mon, Sep 18, 2017 at 5:07 AM, Adam Smith  wrote:

> Dear community,
>
> I have a table with inlinks to URLs, i.e. many URLs point to
> http://google.com, less URLs point to http://somesmallweb.page.
>
> It has very wide and very skinny rows - the distribution is following a
> power law. I do not know a priori how many columns a row has. Also, I can't
> identify a schema to introduce a good partitioning.
>
> Currently, I am thinking about introducing splits by: pk is like (URL,
> splitnumber), where splitnumber is initially 1 and  hash URL mod
> splitnumber would determine the splitnumber on insert. I would need a
> separate table to maintain the splitnumber and a spark-cassandra-connector
> job counts the columns and and increases/doubles the number of splits on
> demand. This means then that I would have to move e.g. (URL1,0) -> (URL1,1)
> when splitnumber would be 2.
>
> Would you do the same? Is there a better way?
>
> Thanks!
> Adam
>


Re: Maturity and Stability of Enabling CDC

2017-09-18 Thread Michael Fong
Thanks Jeff!

On Mon, Sep 18, 2017 at 9:31 AM, Jeff Jirsa  wrote:

> Haven't tried out CDC, but the answer based on the design doc is yes - you
> have to manually dedup CDC at the consumer level
>
>
>
>
> --
> Jeff Jirsa
>
>
> On Sep 17, 2017, at 6:21 PM, Michael Fong  wrote:
>
> Thanks for your reply.
>
>
> If anyone has tried out this new feature, perhaps he/she could answer this
> question: would multiple copies of CDC be sent to downstream (i.e., Kafka)
> when all nodes have enabled cdc?
>
> Regards,
>
> On Mon, Sep 18, 2017 at 6:59 AM, kurt greaves 
> wrote:
>
>> I don't believe it's used by many, if any. it certainly hasn't had enough
>> attention to determine it production ready, nor has it been out long enough
>> for many people to be in a version where cdc is available. FWIW I've never
>> even seen any inquiries about using it.
>>
>> On 17 Sep. 2017 13:18, "Michael Fong"  wrote:
>>
>> anyone?
>>
>> On Wed, Sep 13, 2017 at 5:10 PM, Michael Fong 
>> wrote:
>>
>>> Hi, all,
>>>
>>> We've noticed there is a new feature for streaming changed data other
>>> streaming service. Doc: http://cassandra.apache.o
>>> rg/doc/latest/operating/cdc.html
>>>
>>> We are evaluating the stability (and maturity) of this feature, and
>>> possibly integrate this with Kafka (associated w/ its connector). Has
>>> anyone adopted this in production for real application?
>>>
>>> Regards,
>>>
>>> Michael
>>>
>>
>>
>>
>