Re: Replication factor, LOCAL_QUORUM write consistency and materialized views

2024-05-17 Thread Gábor Auth
Hi,

On Fri, May 17, 2024 at 6:18 PM Jon Haddad  wrote:

> I strongly suggest you don't use materialized views at all.  There are
> edge cases that in my opinion make them unsuitable for production, both in
> terms of cluster stability as well as data integrity.
>

Oh, there is already an open and fresh Jira ticket about it:
https://issues.apache.org/jira/browse/CASSANDRA-19383

Bye,
Gábor AUTH


Re: Replication factor, LOCAL_QUORUM write consistency and materialized views

2024-05-17 Thread Gábor Auth
Hi,

On Fri, May 17, 2024 at 6:18 PM Jon Haddad  wrote:

> I strongly suggest you don't use materialized views at all.  There are
> edge cases that in my opinion make them unsuitable for production, both in
> terms of cluster stability as well as data integrity.
>

I totally agree with you about it. But it looks like a strange and
interesting issue... the affected table has only ~1300 rows and less than
200 kB data. :)

Also, I found a same issue:
https://dba.stackexchange.com/questions/325140/single-node-failure-in-cassandra-4-0-7-causes-cluster-to-run-into-high-cpu

Bye,
Gábor AUTH


> On Fri, May 17, 2024 at 8:58 AM Gábor Auth  wrote:
>
>> Hi,
>>
>> I know, I know, the materialized view is experimental... :)
>>
>> So, I ran into a strange error. Among others, I have a very small 4-nodes
>> cluster, with very minimal data (~100 MB at all), the keyspace's
>> replication factor is 3, everything is works fine... except: if I restart a
>> node, I get a lot of errors with materialized views and consistency level
>> ONE, but only for those tables for which there is more than one
>> materialized view.
>>
>> Tables without materialized view don't have it, works fine.
>> Tables that have it, but only one materialized view, also works fine.
>> But, a table with more than one materialized view, whoops, the cluster
>> crashes temporarily, I can also see on the calling side (Java backend) that
>> no nodes are responding:
>>
>> Caused by: com.datastax.driver.core.exceptions.WriteFailureException:
>> Cassandra failure during write query at consistency LOCAL_QUORUM (2
>> responses were required but only 1 replica responded, 2 failed)
>>
>> I am surprised by this behavior, because there is so little data
>> involved, and it occurs when there is more than one materialized view only,
>> so it might be a concurrency issue under the hood.
>>
>> Have you seen an issue like this?
>>
>> Here is a stack trace on the Cassandra's side:
>>
>> [cassandra-dc03-1] ERROR [MutationStage-1] 2024-05-17 08:51:47,425
>> Keyspace.java:652 - Unknown exception caught while attempting to update
>> MaterializedView! pope.unit
>> [cassandra-dc03-1] org.apache.cassandra.exceptions.UnavailableException:
>> Cannot achieve consistency level ONE
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.exceptions.UnavailableException.create(UnavailableException.java:37)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicas(ReplicaPlans.java:170)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicasForWrite(ReplicaPlans.java:113)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:354)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:345)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:339)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.service.StorageProxy.wrapViewBatchResponseHandler(StorageProxy.java:1312)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.service.StorageProxy.mutateMV(StorageProxy.java:1004)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.db.view.TableViews.pushViewReplicaUpdates(TableViews.java:167)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:647)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:477)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:210)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:58)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:432)
>> [cassandra-dc03-1]  at
>> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown
>> Source)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137)
>> [cassandra-dc03-1]  at
>> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
>> [cassandra-dc03-1]  at
>> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>> [cassandra-dc03-1]  at java.base/java.lang.Thread.run(Unknown Source)
>>
>> --
>> Bye,
>> Gábor AUTH
>>
>


Re: Replication factor, LOCAL_QUORUM write consistency and materialized views

2024-05-17 Thread Jon Haddad
I strongly suggest you don't use materialized views at all.  There are edge
cases that in my opinion make them unsuitable for production, both in terms
of cluster stability as well as data integrity.

Jon

On Fri, May 17, 2024 at 8:58 AM Gábor Auth  wrote:

> Hi,
>
> I know, I know, the materialized view is experimental... :)
>
> So, I ran into a strange error. Among others, I have a very small 4-nodes
> cluster, with very minimal data (~100 MB at all), the keyspace's
> replication factor is 3, everything is works fine... except: if I restart a
> node, I get a lot of errors with materialized views and consistency level
> ONE, but only for those tables for which there is more than one
> materialized view.
>
> Tables without materialized view don't have it, works fine.
> Tables that have it, but only one materialized view, also works fine.
> But, a table with more than one materialized view, whoops, the cluster
> crashes temporarily, I can also see on the calling side (Java backend) that
> no nodes are responding:
>
> Caused by: com.datastax.driver.core.exceptions.WriteFailureException:
> Cassandra failure during write query at consistency LOCAL_QUORUM (2
> responses were required but only 1 replica responded, 2 failed)
>
> I am surprised by this behavior, because there is so little data involved,
> and it occurs when there is more than one materialized view only, so it
> might be a concurrency issue under the hood.
>
> Have you seen an issue like this?
>
> Here is a stack trace on the Cassandra's side:
>
> [cassandra-dc03-1] ERROR [MutationStage-1] 2024-05-17 08:51:47,425
> Keyspace.java:652 - Unknown exception caught while attempting to update
> MaterializedView! pope.unit
> [cassandra-dc03-1] org.apache.cassandra.exceptions.UnavailableException:
> Cannot achieve consistency level ONE
> [cassandra-dc03-1]  at
> org.apache.cassandra.exceptions.UnavailableException.create(UnavailableException.java:37)
> [cassandra-dc03-1]  at
> org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicas(ReplicaPlans.java:170)
> [cassandra-dc03-1]  at
> org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicasForWrite(ReplicaPlans.java:113)
> [cassandra-dc03-1]  at
> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:354)
> [cassandra-dc03-1]  at
> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:345)
> [cassandra-dc03-1]  at
> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:339)
> [cassandra-dc03-1]  at
> org.apache.cassandra.service.StorageProxy.wrapViewBatchResponseHandler(StorageProxy.java:1312)
> [cassandra-dc03-1]  at
> org.apache.cassandra.service.StorageProxy.mutateMV(StorageProxy.java:1004)
> [cassandra-dc03-1]  at
> org.apache.cassandra.db.view.TableViews.pushViewReplicaUpdates(TableViews.java:167)
> [cassandra-dc03-1]  at
> org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:647)
> [cassandra-dc03-1]  at
> org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:477)
> [cassandra-dc03-1]  at
> org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:210)
> [cassandra-dc03-1]  at
> org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:58)
> [cassandra-dc03-1]  at
> org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
> [cassandra-dc03-1]  at
> org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97)
> [cassandra-dc03-1]  at
> org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
> [cassandra-dc03-1]  at
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:432)
> [cassandra-dc03-1]  at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> Source)
> [cassandra-dc03-1]  at
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165)
> [cassandra-dc03-1]  at
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137)
> [cassandra-dc03-1]  at
> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
> [cassandra-dc03-1]  at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> [cassandra-dc03-1]  at java.base/java.lang.Thread.run(Unknown Source)
>
> --
> Bye,
> Gábor AUTH
>


Replication factor, LOCAL_QUORUM write consistency and materialized views

2024-05-17 Thread Gábor Auth
Hi,

I know, I know, the materialized view is experimental... :)

So, I ran into a strange error. Among others, I have a very small 4-nodes
cluster, with very minimal data (~100 MB at all), the keyspace's
replication factor is 3, everything is works fine... except: if I restart a
node, I get a lot of errors with materialized views and consistency level
ONE, but only for those tables for which there is more than one
materialized view.

Tables without materialized view don't have it, works fine.
Tables that have it, but only one materialized view, also works fine.
But, a table with more than one materialized view, whoops, the cluster
crashes temporarily, I can also see on the calling side (Java backend) that
no nodes are responding:

Caused by: com.datastax.driver.core.exceptions.WriteFailureException:
Cassandra failure during write query at consistency LOCAL_QUORUM (2
responses were required but only 1 replica responded, 2 failed)

I am surprised by this behavior, because there is so little data involved,
and it occurs when there is more than one materialized view only, so it
might be a concurrency issue under the hood.

Have you seen an issue like this?

Here is a stack trace on the Cassandra's side:

[cassandra-dc03-1] ERROR [MutationStage-1] 2024-05-17 08:51:47,425
Keyspace.java:652 - Unknown exception caught while attempting to update
MaterializedView! pope.unit
[cassandra-dc03-1] org.apache.cassandra.exceptions.UnavailableException:
Cannot achieve consistency level ONE
[cassandra-dc03-1]  at
org.apache.cassandra.exceptions.UnavailableException.create(UnavailableException.java:37)
[cassandra-dc03-1]  at
org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicas(ReplicaPlans.java:170)
[cassandra-dc03-1]  at
org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicasForWrite(ReplicaPlans.java:113)
[cassandra-dc03-1]  at
org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:354)
[cassandra-dc03-1]  at
org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:345)
[cassandra-dc03-1]  at
org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:339)
[cassandra-dc03-1]  at
org.apache.cassandra.service.StorageProxy.wrapViewBatchResponseHandler(StorageProxy.java:1312)
[cassandra-dc03-1]  at
org.apache.cassandra.service.StorageProxy.mutateMV(StorageProxy.java:1004)
[cassandra-dc03-1]  at
org.apache.cassandra.db.view.TableViews.pushViewReplicaUpdates(TableViews.java:167)
[cassandra-dc03-1]  at
org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:647)
[cassandra-dc03-1]  at
org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:477)
[cassandra-dc03-1]  at
org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:210)
[cassandra-dc03-1]  at
org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:58)
[cassandra-dc03-1]  at
org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
[cassandra-dc03-1]  at
org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97)
[cassandra-dc03-1]  at
org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
[cassandra-dc03-1]  at
org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:432)
[cassandra-dc03-1]  at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown
Source)
[cassandra-dc03-1]  at
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165)
[cassandra-dc03-1]  at
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137)
[cassandra-dc03-1]  at
org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
[cassandra-dc03-1]  at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[cassandra-dc03-1]  at java.base/java.lang.Thread.run(Unknown Source)

-- 
Bye,
Gábor AUTH


Re: write on ONE node vs replication factor

2023-07-16 Thread Anurag Bisht
Thank you Dipan, it makes sense now.

Cheers,
Anurag

On Sun, Jul 16, 2023 at 12:43 AM Dipan Shah  wrote:

> Hello Anurag,
>
> In Cassandra, Strong consistency is guaranteed when "R + W > N" where R is
> Read consistency, W is Write consistency and N is the Replication Factor.
>
> So in your case, R(2) + W(1) = 3 which is NOT greater than your
> replication factor(3) so you will not be able to guarantee strong
> consistency. This is because you will write to 1 replica but your immediate
> read can go to the other 2(quorum) replicas and they might not be updated
> yet.
>
> On Sun, Jul 16, 2023 at 8:06 AM Anurag Bisht 
> wrote:
>
>> thank you Jeff,
>> it makes more sense now. How about I write with ONE consistency,
>> replication factor = 3 and read consistency is QUORUM. I am guessing in
>> that case, I will not have the empty read even if it is happened
>> immediately after the write request, let me know your thoughts ?
>>
>> Cheers,
>> Anurag
>>
>> On Sat, Jul 15, 2023 at 7:28 PM Jeff Jirsa  wrote:
>>
>>> Consistency level controls when queries acknowledge/succeed
>>>
>>> Replication factor is where data lives / how many copies
>>>
>>> If you write at consistency ONE and replication factor 3, the query
>>> finishes successfully when the write is durable on one of the 3 copies.
>>>
>>> It will get sent to all 3, but it’ll return when it’s durable on one.
>>>
>>> If you write at ONE and it goes to the first replica, and you read at
>>> ONE and it reads from the last replica, it may return without the data:
>>> you may not see a given write right away.
>>>
>>> > On Jul 15, 2023, at 7:05 PM, Anurag Bisht 
>>> wrote:
>>> >
>>> > 
>>> > Hello Users,
>>> >
>>> > I am new to Cassandra and trying to understand the architecture of it.
>>> If I write to ONE node for a particular key and have a replication factor
>>> of 3, would the written key will get replicated to the other two nodes ?
>>> Let  me know if I am thinking incorrectly.
>>> >
>>> > Thanks,
>>> > Anurag
>>>
>>
>
> --
>
> Thanks,
>
> *Dipan Shah*
>
> *Data Engineer*
>
> [image: https://www.anant.us/Home.aspx]
>
>
> 3 Washington Circle NW, Suite 301
>
> Washington, D.C. 20037
>
>
> *Check out our **blog* <https://blog.anant.us/>*!*
>
>
> This email and any attachments to it may be confidential and are intended
> solely for the use of the individual to whom it is addressed. Any views or
> opinions expressed are solely those of the author and do not necessarily
> represent those of Anant Corporation. If you are not the intended recipient
> of this email, you must neither take any action based upon its contents,
> nor copy or show it to anyone. Please contact the sender if you believe you
> have received this email in error.
>


Re: write on ONE node vs replication factor

2023-07-16 Thread Dipan Shah
Hello Anurag,

In Cassandra, Strong consistency is guaranteed when "R + W > N" where R is
Read consistency, W is Write consistency and N is the Replication Factor.

So in your case, R(2) + W(1) = 3 which is NOT greater than your replication
factor(3) so you will not be able to guarantee strong consistency. This is
because you will write to 1 replica but your immediate read can go to the
other 2(quorum) replicas and they might not be updated yet.

On Sun, Jul 16, 2023 at 8:06 AM Anurag Bisht 
wrote:

> thank you Jeff,
> it makes more sense now. How about I write with ONE consistency,
> replication factor = 3 and read consistency is QUORUM. I am guessing in
> that case, I will not have the empty read even if it is happened
> immediately after the write request, let me know your thoughts ?
>
> Cheers,
> Anurag
>
> On Sat, Jul 15, 2023 at 7:28 PM Jeff Jirsa  wrote:
>
>> Consistency level controls when queries acknowledge/succeed
>>
>> Replication factor is where data lives / how many copies
>>
>> If you write at consistency ONE and replication factor 3, the query
>> finishes successfully when the write is durable on one of the 3 copies.
>>
>> It will get sent to all 3, but it’ll return when it’s durable on one.
>>
>> If you write at ONE and it goes to the first replica, and you read at ONE
>> and it reads from the last replica, it may return without the data:  you
>> may not see a given write right away.
>>
>> > On Jul 15, 2023, at 7:05 PM, Anurag Bisht 
>> wrote:
>> >
>> > 
>> > Hello Users,
>> >
>> > I am new to Cassandra and trying to understand the architecture of it.
>> If I write to ONE node for a particular key and have a replication factor
>> of 3, would the written key will get replicated to the other two nodes ?
>> Let  me know if I am thinking incorrectly.
>> >
>> > Thanks,
>> > Anurag
>>
>

-- 

Thanks,

*Dipan Shah*

*Data Engineer*

[image: https://www.anant.us/Home.aspx]


3 Washington Circle NW, Suite 301

Washington, D.C. 20037


*Check out our **blog* <https://blog.anant.us/>*!*


This email and any attachments to it may be confidential and are intended
solely for the use of the individual to whom it is addressed. Any views or
opinions expressed are solely those of the author and do not necessarily
represent those of Anant Corporation. If you are not the intended recipient
of this email, you must neither take any action based upon its contents,
nor copy or show it to anyone. Please contact the sender if you believe you
have received this email in error.


Re: write on ONE node vs replication factor

2023-07-15 Thread Anurag Bisht
thank you Jeff,
it makes more sense now. How about I write with ONE consistency,
replication factor = 3 and read consistency is QUORUM. I am guessing in
that case, I will not have the empty read even if it is happened
immediately after the write request, let me know your thoughts ?

Cheers,
Anurag

On Sat, Jul 15, 2023 at 7:28 PM Jeff Jirsa  wrote:

> Consistency level controls when queries acknowledge/succeed
>
> Replication factor is where data lives / how many copies
>
> If you write at consistency ONE and replication factor 3, the query
> finishes successfully when the write is durable on one of the 3 copies.
>
> It will get sent to all 3, but it’ll return when it’s durable on one.
>
> If you write at ONE and it goes to the first replica, and you read at ONE
> and it reads from the last replica, it may return without the data:  you
> may not see a given write right away.
>
> > On Jul 15, 2023, at 7:05 PM, Anurag Bisht 
> wrote:
> >
> > 
> > Hello Users,
> >
> > I am new to Cassandra and trying to understand the architecture of it.
> If I write to ONE node for a particular key and have a replication factor
> of 3, would the written key will get replicated to the other two nodes ?
> Let  me know if I am thinking incorrectly.
> >
> > Thanks,
> > Anurag
>


Re: write on ONE node vs replication factor

2023-07-15 Thread Jeff Jirsa
Consistency level controls when queries acknowledge/succeed

Replication factor is where data lives / how many copies 

If you write at consistency ONE and replication factor 3, the query finishes 
successfully when the write is durable on one of the 3 copies.

It will get sent to all 3, but it’ll return when it’s durable on one.

If you write at ONE and it goes to the first replica, and you read at ONE and 
it reads from the last replica, it may return without the data:  you may not 
see a given write right away. 

> On Jul 15, 2023, at 7:05 PM, Anurag Bisht  wrote:
> 
> 
> Hello Users,
> 
> I am new to Cassandra and trying to understand the architecture of it. If I 
> write to ONE node for a particular key and have a replication factor of 3, 
> would the written key will get replicated to the other two nodes ? Let  me 
> know if I am thinking incorrectly.
> 
> Thanks,
> Anurag


write on ONE node vs replication factor

2023-07-15 Thread Anurag Bisht
Hello Users,

I am new to Cassandra and trying to understand the architecture of it. If I
write to ONE node for a particular key and have a replication factor of 3,
would the written key will get replicated to the other two nodes ? Let  me
know if I am thinking incorrectly.

Thanks,
Anurag


RE: Trouble After Changing Replication Factor

2021-10-13 Thread Isaeed Mohanna
Hi again
I did run repair -full without any parameters which I understood will run 
repair for all key spaces, but I do not recall seeing validation tasks running 
on one of my two main keyspaces with most data. Maybe it failed or didn’t run.
Anyhow I tested with a small app on a small table that I have, the app would 
fail before the repair, and after running repair -full on the specific table it 
running fine, so I am running a full repair on the problematic keyspace , 
hopefully all will be fine when repair is done.
I am left wondering though, why does Cassandra allow this to happen, most other 
operations are somewhat guarded, one would expect the RF change operation will 
not complete without having the actual changes been carried out, I got 
surprised that CL1 reads are failing and it could cause serious data 
inconsistences, but maybe that is not realistic in large datasets to wait for 
the changes but I think it should be added to the documentation to warn that 
read with CL1 will fail until a full repair is completed.
Thanks everyone for the help,
Isaeed Mohanna


From: Jeff Jirsa 
Sent: Tuesday, October 12, 2021 4:59 PM
To: cassandra 
Subject: Re: Trouble After Changing Replication Factor

The most likely explanation is that repair failed and you didnt notice.
Or that you didnt actually repair every host / every range.

Which version are you using?
How did you run repair?


On Tue, Oct 12, 2021 at 4:33 AM Isaeed Mohanna 
mailto:isa...@xsense.co>> wrote:
Hi
Yes I am sacrificing consistency to gain higher availability and faster speed, 
but my problem is not with newly inserted data that is not there for a very 
short period of time, my problem is the data that was there before the RF 
change, still do not exist in all replicas even after repair.
It looks like my cluster configuration is RF3 but the data itself is still 
using RF2 and when the data is requested from the 3rd (new) replica, it is not 
there and an empty record is returned with read CL1.
What can I do to force this data to be synced to all replicas as it should? So 
read CL1 request will actually return a correct result?

Thanks

From: Bowen Song mailto:bo...@bso.ng>>
Sent: Monday, October 11, 2021 5:13 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Trouble After Changing Replication Factor


You have RF=3 and both read & write CL=1, which means you are asking Cassandra 
to give up strong consistency in order to gain higher availability and perhaps 
slight faster speed, and that's what you get. If you want to have strong 
consistency, you will need to make sure (read CL + write CL) > RF.
On 10/10/2021 11:55, Isaeed Mohanna wrote:
Hi
We had a cluster with 3 Nodes with Replication Factor 2 and we were using read 
with consistency Level One.
We recently added a 4th node and changed the replication factor to 3, once this 
was done apps reading from DB with CL1 would receive an empty record, Looking 
around I was surprised to learn that upon changing the replication factor if 
the read request is sent to a node the should own the record according to the 
new replication factor while it still doesn’t have it yet then an empty record 
will be returned because of CL1, the record will be written to that node after 
the repair operation is over.
We ran the repair operation which took days in our case (we had to change apps 
to CL2 to avoid serious data inconsistencies).
Now the repair operations are over and if I revert to CL1 we are still getting 
errors that records do not exist in DB while they do, using CL2 again it works 
fine.
Any ideas what I am missing?
Is there a way to validate that the repairs task has actually done what is 
needed and that the data is actually now replicated RF3 ?
Could it it be a Cassandra Driver issue? Since if I issue the request in cqlsh 
I do get the record but I cannot know if I am hitting the replica that doesn’t 
hold the record
Thanks for your help


Re: Trouble After Changing Replication Factor

2021-10-12 Thread Jeff Jirsa
The most likely explanation is that repair failed and you didnt notice.
Or that you didnt actually repair every host / every range.

Which version are you using?
How did you run repair?


On Tue, Oct 12, 2021 at 4:33 AM Isaeed Mohanna  wrote:

> Hi
>
> Yes I am sacrificing consistency to gain higher availability and faster
> speed, but my problem is not with newly inserted data that is not there for
> a very short period of time, my problem is the data that was there before
> the RF change, still do not exist in all replicas even after repair.
>
> It looks like my cluster configuration is RF3 but the data itself is still
> using RF2 and when the data is requested from the 3rd (new) replica, it
> is not there and an empty record is returned with read CL1.
>
> What can I do to force this data to be synced to all replicas as it
> should? So read CL1 request will actually return a correct result?
>
>
>
> Thanks
>
>
>
> *From:* Bowen Song 
> *Sent:* Monday, October 11, 2021 5:13 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Trouble After Changing Replication Factor
>
>
>
> You have RF=3 and both read & write CL=1, which means you are asking
> Cassandra to give up strong consistency in order to gain higher
> availability and perhaps slight faster speed, and that's what you get. If
> you want to have strong consistency, you will need to make sure (read CL +
> write CL) > RF.
>
> On 10/10/2021 11:55, Isaeed Mohanna wrote:
>
> Hi
>
> We had a cluster with 3 Nodes with Replication Factor 2 and we were using
> read with consistency Level One.
>
> We recently added a 4th node and changed the replication factor to 3,
> once this was done apps reading from DB with CL1 would receive an empty
> record, Looking around I was surprised to learn that upon changing the
> replication factor if the read request is sent to a node the should own the
> record according to the new replication factor while it still doesn’t have
> it yet then an empty record will be returned because of CL1, the record
> will be written to that node after the repair operation is over.
>
> We ran the repair operation which took days in our case (we had to change
> apps to CL2 to avoid serious data inconsistencies).
>
> Now the repair operations are over and if I revert to CL1 we are still
> getting errors that records do not exist in DB while they do, using CL2
> again it works fine.
>
> Any ideas what I am missing?
>
> Is there a way to validate that the repairs task has actually done what is
> needed and that the data is actually now replicated RF3 ?
>
> Could it it be a Cassandra Driver issue? Since if I issue the request in
> cqlsh I do get the record but I cannot know if I am hitting the replica
> that doesn’t hold the record
>
> Thanks for your help
>
>


Re: Trouble After Changing Replication Factor

2021-10-12 Thread Dmitry Saprykin
Hi,

You could try to run full repair over short subrange containing data
missing from one replica. It should take just couple of minutes and will
prove if your repair failed to finish

Dmitrii Saprykin

On Tue, Oct 12, 2021 at 7:54 AM Bowen Song  wrote:

> I see. In that case, I suspect the repair wasn't fully successful. Try
> repair the new joined node again, and make sure it actually finishes
> successfully.
> On 12/10/2021 12:23, Isaeed Mohanna wrote:
>
> Hi
>
> Yes I am sacrificing consistency to gain higher availability and faster
> speed, but my problem is not with newly inserted data that is not there for
> a very short period of time, my problem is the data that was there before
> the RF change, still do not exist in all replicas even after repair.
>
> It looks like my cluster configuration is RF3 but the data itself is still
> using RF2 and when the data is requested from the 3rd (new) replica, it
> is not there and an empty record is returned with read CL1.
>
> What can I do to force this data to be synced to all replicas as it
> should? So read CL1 request will actually return a correct result?
>
>
>
> Thanks
>
>
>
> *From:* Bowen Song  
> *Sent:* Monday, October 11, 2021 5:13 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Trouble After Changing Replication Factor
>
>
>
> You have RF=3 and both read & write CL=1, which means you are asking
> Cassandra to give up strong consistency in order to gain higher
> availability and perhaps slight faster speed, and that's what you get. If
> you want to have strong consistency, you will need to make sure (read CL +
> write CL) > RF.
>
> On 10/10/2021 11:55, Isaeed Mohanna wrote:
>
> Hi
>
> We had a cluster with 3 Nodes with Replication Factor 2 and we were using
> read with consistency Level One.
>
> We recently added a 4th node and changed the replication factor to 3,
> once this was done apps reading from DB with CL1 would receive an empty
> record, Looking around I was surprised to learn that upon changing the
> replication factor if the read request is sent to a node the should own the
> record according to the new replication factor while it still doesn’t have
> it yet then an empty record will be returned because of CL1, the record
> will be written to that node after the repair operation is over.
>
> We ran the repair operation which took days in our case (we had to change
> apps to CL2 to avoid serious data inconsistencies).
>
> Now the repair operations are over and if I revert to CL1 we are still
> getting errors that records do not exist in DB while they do, using CL2
> again it works fine.
>
> Any ideas what I am missing?
>
> Is there a way to validate that the repairs task has actually done what is
> needed and that the data is actually now replicated RF3 ?
>
> Could it it be a Cassandra Driver issue? Since if I issue the request in
> cqlsh I do get the record but I cannot know if I am hitting the replica
> that doesn’t hold the record
>
> Thanks for your help
>
>


Re: Trouble After Changing Replication Factor

2021-10-12 Thread Bowen Song
I see. In that case, I suspect the repair wasn't fully successful. Try 
repair the new joined node again, and make sure it actually finishes 
successfully.


On 12/10/2021 12:23, Isaeed Mohanna wrote:


Hi

Yes I am sacrificing consistency to gain higher availability and 
faster speed, but my problem is not with newly inserted data that is 
not there for a very short period of time, my problem is the data that 
was there before the RF change, still do not exist in all replicas 
even after repair.


It looks like my cluster configuration is RF3 but the data itself is 
still using RF2 and when the data is requested from the 3^rd (new) 
replica, it is not there and an empty record is returned with read CL1.


What can I do to force this data to be synced to all replicas as it 
should? So read CL1 request will actually return a correct result?


Thanks

*From:* Bowen Song 
*Sent:* Monday, October 11, 2021 5:13 PM
*To:* user@cassandra.apache.org
*Subject:* Re: Trouble After Changing Replication Factor

You have RF=3 and both read & write CL=1, which means you are asking 
Cassandra to give up strong consistency in order to gain higher 
availability and perhaps slight faster speed, and that's what you get. 
If you want to have strong consistency, you will need to make sure 
(read CL + write CL) > RF.


On 10/10/2021 11:55, Isaeed Mohanna wrote:

Hi

We had a cluster with 3 Nodes with Replication Factor 2 and we
were using read with consistency Level One.

We recently added a 4^th node and changed the replication factor
to 3, once this was done apps reading from DB with CL1 would
receive an empty record, Looking around I was surprised to learn
that upon changing the replication factor if the read request is
sent to a node the should own the record according to the new
replication factor while it still doesn’t have it yet then an
empty record will be returned because of CL1, the record will be
written to that node after the repair operation is over.

We ran the repair operation which took days in our case (we had to
change apps to CL2 to avoid serious data inconsistencies).

Now the repair operations are over and if I revert to CL1 we are
still getting errors that records do not exist in DB while they
do, using CL2 again it works fine.

Any ideas what I am missing?

Is there a way to validate that the repairs task has actually done
what is needed and that the data is actually now replicated RF3 ?

Could it it be a Cassandra Driver issue? Since if I issue the
request in cqlsh I do get the record but I cannot know if I am
hitting the replica that doesn’t hold the record

Thanks for your help


RE: Trouble After Changing Replication Factor

2021-10-12 Thread Isaeed Mohanna
Hi
Yes I am sacrificing consistency to gain higher availability and faster speed, 
but my problem is not with newly inserted data that is not there for a very 
short period of time, my problem is the data that was there before the RF 
change, still do not exist in all replicas even after repair.
It looks like my cluster configuration is RF3 but the data itself is still 
using RF2 and when the data is requested from the 3rd (new) replica, it is not 
there and an empty record is returned with read CL1.
What can I do to force this data to be synced to all replicas as it should? So 
read CL1 request will actually return a correct result?

Thanks

From: Bowen Song 
Sent: Monday, October 11, 2021 5:13 PM
To: user@cassandra.apache.org
Subject: Re: Trouble After Changing Replication Factor


You have RF=3 and both read & write CL=1, which means you are asking Cassandra 
to give up strong consistency in order to gain higher availability and perhaps 
slight faster speed, and that's what you get. If you want to have strong 
consistency, you will need to make sure (read CL + write CL) > RF.
On 10/10/2021 11:55, Isaeed Mohanna wrote:
Hi
We had a cluster with 3 Nodes with Replication Factor 2 and we were using read 
with consistency Level One.
We recently added a 4th node and changed the replication factor to 3, once this 
was done apps reading from DB with CL1 would receive an empty record, Looking 
around I was surprised to learn that upon changing the replication factor if 
the read request is sent to a node the should own the record according to the 
new replication factor while it still doesn’t have it yet then an empty record 
will be returned because of CL1, the record will be written to that node after 
the repair operation is over.
We ran the repair operation which took days in our case (we had to change apps 
to CL2 to avoid serious data inconsistencies).
Now the repair operations are over and if I revert to CL1 we are still getting 
errors that records do not exist in DB while they do, using CL2 again it works 
fine.
Any ideas what I am missing?
Is there a way to validate that the repairs task has actually done what is 
needed and that the data is actually now replicated RF3 ?
Could it it be a Cassandra Driver issue? Since if I issue the request in cqlsh 
I do get the record but I cannot know if I am hitting the replica that doesn’t 
hold the record
Thanks for your help


Re: Trouble After Changing Replication Factor

2021-10-11 Thread Bowen Song
You have RF=3 and both read & write CL=1, which means you are asking 
Cassandra to give up strong consistency in order to gain higher 
availability and perhaps slight faster speed, and that's what you get. 
If you want to have strong consistency, you will need to make sure (read 
CL + write CL) > RF.


On 10/10/2021 11:55, Isaeed Mohanna wrote:


Hi

We had a cluster with 3 Nodes with Replication Factor 2 and we were 
using read with consistency Level One.


We recently added a 4^th node and changed the replication factor to 3, 
once this was done apps reading from DB with CL1 would receive an 
empty record, Looking around I was surprised to learn that upon 
changing the replication factor if the read request is sent to a node 
the should own the record according to the new replication factor 
while it still doesn’t have it yet then an empty record will be 
returned because of CL1, the record will be written to that node after 
the repair operation is over.


We ran the repair operation which took days in our case (we had to 
change apps to CL2 to avoid serious data inconsistencies).


Now the repair operations are over and if I revert to CL1 we are still 
getting errors that records do not exist in DB while they do, using 
CL2 again it works fine.


Any ideas what I am missing?

Is there a way to validate that the repairs task has actually done 
what is needed and that the data is actually now replicated RF3 ?


Could it it be a Cassandra Driver issue? Since if I issue the request 
in cqlsh I do get the record but I cannot know if I am hitting the 
replica that doesn’t hold the record


Thanks for your help


Trouble After Changing Replication Factor

2021-10-10 Thread Isaeed Mohanna
Hi
We had a cluster with 3 Nodes with Replication Factor 2 and we were using read 
with consistency Level One.
We recently added a 4th node and changed the replication factor to 3, once this 
was done apps reading from DB with CL1 would receive an empty record, Looking 
around I was surprised to learn that upon changing the replication factor if 
the read request is sent to a node the should own the record according to the 
new replication factor while it still doesn't have it yet then an empty record 
will be returned because of CL1, the record will be written to that node after 
the repair operation is over.
We ran the repair operation which took days in our case (we had to change apps 
to CL2 to avoid serious data inconsistencies).
Now the repair operations are over and if I revert to CL1 we are still getting 
errors that records do not exist in DB while they do, using CL2 again it works 
fine.
Any ideas what I am missing?
Is there a way to validate that the repairs task has actually done what is 
needed and that the data is actually now replicated RF3 ?
Could it it be a Cassandra Driver issue? Since if I issue the request in cqlsh 
I do get the record but I cannot know if I am hitting the replica that doesn't 
hold the record
Thanks for your help


Re: Anti-entropy repair with a 4 node cluster replication factor 4

2020-10-27 Thread manish khandelwal
If you run full repair  then it should be fine, since all the replicas are
present on all the nodes. If you are using -pr option then you need to run
on all the nodes.

On Tue, Oct 27, 2020 at 4:11 PM Fred Al  wrote:

> Hello!
> Running Cassandra 2.2.9 with a 4 node cluster with replication factor 4.
> When running anti-entropy repair is it required to run repair on all 4
> nodes or is it sufficient to run it on only one node?
> Since all data is replicated on all nodes i.m.o. only one node would need
> to be repaired to repair all data in the cluster.
> If it's required to run on all nodes, please give an explanation as to why?
>
> Regards
> Fredrik
>


Anti-entropy repair with a 4 node cluster replication factor 4

2020-10-27 Thread Fred Al
Hello!
Running Cassandra 2.2.9 with a 4 node cluster with replication factor 4.
When running anti-entropy repair is it required to run repair on all 4
nodes or is it sufficient to run it on only one node?
Since all data is replicated on all nodes i.m.o. only one node would need
to be repaired to repair all data in the cluster.
If it's required to run on all nodes, please give an explanation as to why?

Regards
Fredrik


Re: any risks with changing replication factor on live production cluster without downtime and service interruption?

2020-05-27 Thread Leena Ghatpande

Nothing complex. All we do is perform the read for  x number of retries 
(configurable parameter) and if it fails , flag an alert

But agree with the solution that Jeff provided and would use the approach.

Thanks for all the responses.


From: Reid Pinchback 
Sent: Tuesday, May 26, 2020 11:33 PM
To: user@cassandra.apache.org 
Subject: Re: any risks with changing replication factor on live production 
cluster without downtime and service interruption?


By retry logic, I’m going to guess you are doing some kind of version 
consistency trick where you have a non-key column managing a visibility horizon 
to simulate a transaction, and you poll for a horizon value >= some threshold 
that the app is keeping aware of.



Note that these assorted variations on trying to do battle with eventual 
consistency can generate a lot of load on the cluster, unless there is enough 
latency in the progress of the logical flow at the app level that the 
optimistic concurrency hack almost always succeeds the first time anyways.



If this generates the degree of java garbage collection that I suspect, then 
the advice to upgrade C* becomes even more significant.  Repairs themselves can 
generate substantial memory load, and you could have a node or two drop out on 
you if they OOM. I’d definitely take Jeff’s advice about switching your reads 
to LOCAL_QUORUM until you’re done to buffer yourself from that risk.





From: Leena Ghatpande 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, May 26, 2020 at 1:20 PM
To: "user@cassandra.apache.org" 
Subject: Re: any risks with changing replication factor on live production 
cluster without downtime and service interruption?



Message from External Sender

Thank you for the response. Will follow the recommendation for the update. So 
with Read=LOCAL_QUORUM we should see some latency, but not failures during RF 
change right?



We do mitigate the issue of not seeing writes when set to Local_one, by having 
a Retry logic in the app







From: Leena Ghatpande 
Sent: Friday, May 22, 2020 11:51 AM
To: cassandra cassandra 
Subject: any risks with changing replication factor on live production cluster 
without downtime and service interruption?



We are on Cassandra 3.7 and have a 12 node cluster , 2DC, with 6 nodes in each 
DC. RF=3

We have around 150M rows across tables.



We are planning to add more nodes to the cluster, and thinking of changing the 
replication factor to 5 for each DC.



Our application uses the below consistency level

 read-level: LOCAL_ONE

 write-level: LOCAL_QUORUM



if we change the RF=5 on live cluster, and run full repairs, would we see 
read/write errors while data is being replicated?

if so, This is not something that we can afford in production, so how would we 
avoid this?


Re: any risks with changing replication factor on live production cluster without downtime and service interruption?

2020-05-26 Thread Reid Pinchback
By retry logic, I’m going to guess you are doing some kind of version 
consistency trick where you have a non-key column managing a visibility horizon 
to simulate a transaction, and you poll for a horizon value >= some threshold 
that the app is keeping aware of.

Note that these assorted variations on trying to do battle with eventual 
consistency can generate a lot of load on the cluster, unless there is enough 
latency in the progress of the logical flow at the app level that the 
optimistic concurrency hack almost always succeeds the first time anyways.

If this generates the degree of java garbage collection that I suspect, then 
the advice to upgrade C* becomes even more significant.  Repairs themselves can 
generate substantial memory load, and you could have a node or two drop out on 
you if they OOM. I’d definitely take Jeff’s advice about switching your reads 
to LOCAL_QUORUM until you’re done to buffer yourself from that risk.


From: Leena Ghatpande 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, May 26, 2020 at 1:20 PM
To: "user@cassandra.apache.org" 
Subject: Re: any risks with changing replication factor on live production 
cluster without downtime and service interruption?

Message from External Sender
Thank you for the response. Will follow the recommendation for the update. So 
with Read=LOCAL_QUORUM we should see some latency, but not failures during RF 
change right?

We do mitigate the issue of not seeing writes when set to Local_one, by having 
a Retry logic in the app



From: Leena Ghatpande 
Sent: Friday, May 22, 2020 11:51 AM
To: cassandra cassandra 
Subject: any risks with changing replication factor on live production cluster 
without downtime and service interruption?

We are on Cassandra 3.7 and have a 12 node cluster , 2DC, with 6 nodes in each 
DC. RF=3
We have around 150M rows across tables.

We are planning to add more nodes to the cluster, and thinking of changing the 
replication factor to 5 for each DC.

Our application uses the below consistency level
 read-level: LOCAL_ONE
 write-level: LOCAL_QUORUM

if we change the RF=5 on live cluster, and run full repairs, would we see 
read/write errors while data is being replicated?
if so, This is not something that we can afford in production, so how would we 
avoid this?


Re: any risks with changing replication factor on live production cluster without downtime and service interruption?

2020-05-26 Thread Leena Ghatpande
Thank you for the response. Will follow the recommendation for the update. So 
with Read=LOCAL_QUORUM we should see some latency, but not failures during RF 
change right?

We do mitigate the issue of not seeing writes when set to Local_one, by having 
a Retry logic in the app



From: Leena Ghatpande 
Sent: Friday, May 22, 2020 11:51 AM
To: cassandra cassandra 
Subject: any risks with changing replication factor on live production cluster 
without downtime and service interruption?

We are on Cassandra 3.7 and have a 12 node cluster , 2DC, with 6 nodes in each 
DC. RF=3
We have around 150M rows across tables.

We are planning to add more nodes to the cluster, and thinking of changing the 
replication factor to 5 for each DC.

Our application uses the below consistency level
 read-level: LOCAL_ONE
 write-level: LOCAL_QUORUM

if we change the RF=5 on live cluster, and run full repairs, would we see 
read/write errors while data is being replicated?
if so, This is not something that we can afford in production, so how would we 
avoid this?


Re: any risks with changing replication factor on live production cluster without downtime and service interruption?

2020-05-25 Thread Oleksandr Shulgin
On Fri, May 22, 2020 at 9:51 PM Jeff Jirsa  wrote:

> With those consistency levels it’s already possible you don’t see your
> writes, so you’re already probably seeing some of what would happen if you
> went to RF=5 like that - just less common
>
> If you did what you describe you’d have a 40% chance on each read of not
> seeing any data (or not seeing the most recent data) until repair runs.
>
> Alternatively:
> - you change the app to read at local quorum
> - you change RF from 3 to 4
> - run repair
> - change RF from 4 to 5
> - run repair
> - change the app to read local_one
>
> Then you’re back to status quo where you probably see most writes but it’s
> not strictly guaranteed
>

The very first step to consider is to upgrade to the latest 3.11.  Running
3.7 in production is, ugh, an odd choice of version in 2020.

--
Alex


Re: any risks with changing replication factor on live production cluster without downtime and service interruption?

2020-05-22 Thread Jeff Jirsa
With those consistency levels it’s already possible you don’t see your writes, 
so you’re already probably seeing some of what would happen if you went to RF=5 
like that - just less common

If you did what you describe you’d have a 40% chance on each read of not seeing 
any data (or not seeing the most recent data) until repair runs.

Alternatively:
- you change the app to read at local quorum
- you change RF from 3 to 4
- run repair
- change RF from 4 to 5
- run repair 
- change the app to read local_one 

Then you’re back to status quo where you probably see most writes but it’s not 
strictly guaranteed 

> On May 22, 2020, at 8:51 AM, Leena Ghatpande  wrote:
> 
> 
> We are on Cassandra 3.7 and have a 12 node cluster , 2DC, with 6 nodes in 
> each DC. RF=3
> We have around 150M rows across tables.
> 
> We are planning to add more nodes to the cluster, and thinking of changing 
> the replication factor to 5 for each DC. 
> 
> Our application uses the below consistency level
>  read-level: LOCAL_ONE
>  write-level: LOCAL_QUORUM
> 
> if we change the RF=5 on live cluster, and run full repairs, would we see 
> read/write errors while data is being replicated? 
> if so, This is not something that we can afford in production, so how would 
> we avoid this?


any risks with changing replication factor on live production cluster without downtime and service interruption?

2020-05-22 Thread Leena Ghatpande
We are on Cassandra 3.7 and have a 12 node cluster , 2DC, with 6 nodes in each 
DC. RF=3
We have around 150M rows across tables.

We are planning to add more nodes to the cluster, and thinking of changing the 
replication factor to 5 for each DC.

Our application uses the below consistency level
 read-level: LOCAL_ONE
 write-level: LOCAL_QUORUM

if we change the RF=5 on live cluster, and run full repairs, would we see 
read/write errors while data is being replicated?
if so, This is not something that we can afford in production, so how would we 
avoid this?


Re: system_auth keyspace replication factor

2018-11-26 Thread Sam Tunnicliffe
> I suspect some of the intermediate queries (determining role, etc) happen at 
> quorum in 2.2+, but I don’t have time to go read the code and prove it. 

This isn’t true. Aside from when using the default superuser, only 
CRM::getAllRoles reads at QUORUM (because the resultset would include the 
default superuser if present). This is only called during execution of a LIST 
ROLES statement and isn’t on the login path.

From the driver log you can see that the actual authentication exchange happens 
quickly, so I’d say that the problem described in CSHARP-436 is a more likely 
candidate. 

> Sadly, this recommendation is out of date / incorrect.  For `system_auth` we 
> are mostly using a formula like `RF=min(num_dc_nodes, 5)` and see no issues.

+1 to that, RF=N is way over the top. 

Thanks,
Sam


> On 26 Nov 2018, at 09:44, Oleksandr Shulgin  
> wrote:
> 
> On Fri, Nov 23, 2018 at 5:38 PM Vitali Dyachuk  > wrote:
> 
> We have recently met a problem when we added 60 nodes in 1 region to the 
> cluster
> and set an RF=60 for the system_auth ks, following this documentation 
> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html 
> 
> 
> Sadly, this recommendation is out of date / incorrect.  For `system_auth` we 
> are mostly using a formula like `RF=min(num_dc_nodes, 5)` and see no issues.
> 
> Is there a chance to correct the documentation @datastax?
> 
> Regards,
> --
> Alex
> 



Re: system_auth keyspace replication factor

2018-11-26 Thread Oleksandr Shulgin
On Fri, Nov 23, 2018 at 5:38 PM Vitali Dyachuk  wrote:

>
> We have recently met a problem when we added 60 nodes in 1 region to the
> cluster
> and set an RF=60 for the system_auth ks, following this documentation
> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
>

Sadly, this recommendation is out of date / incorrect.  For `system_auth`
we are mostly using a formula like `RF=min(num_dc_nodes, 5)` and see no
issues.

Is there a chance to correct the documentation @datastax?

Regards,
--
Alex


Re: system_auth keyspace replication factor

2018-11-23 Thread Vitali Dyachuk
Attaching the runner log snippet, where we can see that "Rebuilding token
map" took most of the time.
getAllroles is using quorum, don't if it is used during login
https://github.com/apache/cassandra/blob/cc12665bb7645d17ba70edcf952ee6a1ea63127b/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L260

Vitali Djatsuk,
On Fri, Nov 23, 2018 at 8:32 PM Jeff Jirsa  wrote:

> I suspect some of the intermediate queries (determining role, etc) happen
> at quorum in 2.2+, but I don’t have time to go read the code and prove it.
>
> In any case, RF > 10 per DC is probably excessive
>
> Also want to crank up the validity times so it uses cached info longer
>
>
> --
> Jeff Jirsa
>
>
> On Nov 23, 2018, at 10:18 AM, Vitali Dyachuk  wrote:
>
> no its not a cassandra user and as i understood all other users login
> local_one.
>
> On Fri, 23 Nov 2018, 19:30 Jonathan Haddad 
>> Any chance you’re logging in with the Cassandra user? It uses quorum
>> reads.
>>
>>
>> On Fri, Nov 23, 2018 at 11:38 AM Vitali Dyachuk 
>> wrote:
>>
>>> Hi,
>>> We have recently met a problem when we added 60 nodes in 1 region to the
>>> cluster
>>> and set an RF=60 for the system_auth ks, following this documentation
>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
>>> However we've started to see increased login latencies in the cluster 5x
>>> bigger than before changing RF of system_auth ks.
>>> We have casandra runner written is csharp, running against the cluster,
>>> when analyzing the logs we notices that   Rebuilding token map  is
>>> taking most of the time ~20s.
>>> When we changed RF to 3 the issue has resolved.
>>> We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
>>> version="3.2.1"
>>> I've found somehow related to my problem ticket
>>> https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
>>> related tickets, that the issue with the token map rebuild time has been
>>> fixed in the previous versions of the driver.
>>> So my question is what is the best recommendation of the setting
>>> system_auth ks RF?
>>>
>>> Regards,
>>> Vitali Djatsuk.
>>>
>>>
>>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> twitter: rustyrazorblade
>>
>
ControlConnection: 11/22/2018 10:30:32.170 +00:00 : Trying to connect the 
ControlConnection
TcpSocket: 11/22/2018 10:30:32.170 +00:00 Socket connected, starting SSL client 
authentication
TcpSocket: 11/22/2018 10:30:32.170 +00:00 Starting SSL authentication
TcpSocket: 11/22/2018 10:30:32.217 +00:00 SSL authentication successful
Connection: 11/22/2018 10:30:32.217 +00:00 Sending #0 for StartupRequest to 
node1:9042
Connection: 11/22/2018 10:30:32.217 +00:00 Received #0 from node1:9042
Connection: 11/22/2018 10:30:32.217 +00:00 Sending #0 for AuthResponseRequest 
to node1:9042
Connection: 11/22/2018 10:30:32.329 +00:00 Received #0 from node1:9042
ControlConnection: 11/22/2018 10:30:32.329 +00:00 : Connection established to 
node1:9042
Connection: 11/22/2018 10:30:32.329 +00:00 Sending #0 for 
RegisterForEventRequest to node1:9042
Connection: 11/22/2018 10:30:32.329 +00:00 Received #0 from node1:9042
ControlConnection: 11/22/2018 10:30:32.329 +00:00 : Refreshing node list
Connection: 11/22/2018 10:30:32.329 +00:00 Sending #0 for QueryRequest to 
node1:9042
Connection: 11/22/2018 10:30:32.342 +00:00 Received #0 from node1:9042
Connection: 11/22/2018 10:30:32.342 +00:00 Sending #0 for QueryRequest to 
node1:9042
Connection: 11/22/2018 10:30:32.373 +00:00 Received #0 from node1:9042
ControlConnection: 11/22/2018 10:30:32.389 +00:00 : Node list retrieved 
successfully
ControlConnection: 11/22/2018 10:30:32.389 +00:00 : Retrieving keyspaces 
metadata
Connection: 11/22/2018 10:30:32.389 +00:00 Sending #0 for QueryRequest to 
node1:9042
Connection: 11/22/2018 10:30:32.389 +00:00 Received #0 from node1:9042
ControlConnection: 11/22/2018 10:30:32.389 +00:00 : Rebuilding token map
Cluster: 11/22/2018 10:30:55.233 +00:00 : Cluster Connected using binary 
protocol version: [4]

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: system_auth keyspace replication factor

2018-11-23 Thread Jeff Jirsa
I suspect some of the intermediate queries (determining role, etc) happen at 
quorum in 2.2+, but I don’t have time to go read the code and prove it. 

In any case, RF > 10 per DC is probably excessive

Also want to crank up the validity times so it uses cached info longer


-- 
Jeff Jirsa


> On Nov 23, 2018, at 10:18 AM, Vitali Dyachuk  wrote:
> 
> no its not a cassandra user and as i understood all other users login 
> local_one.
> 
>> On Fri, 23 Nov 2018, 19:30 Jonathan Haddad > Any chance you’re logging in with the Cassandra user? It uses quorum reads. 
>> 
>> 
>>> On Fri, Nov 23, 2018 at 11:38 AM Vitali Dyachuk  wrote:
>>> Hi,
>>> We have recently met a problem when we added 60 nodes in 1 region to the 
>>> cluster
>>> and set an RF=60 for the system_auth ks, following this documentation 
>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
>>> However we've started to see increased login latencies in the cluster 5x 
>>> bigger than before changing RF of system_auth ks.
>>> We have casandra runner written is csharp, running against the cluster, 
>>> when analyzing the logs we notices that  Rebuilding token map  is taking 
>>> most of the time ~20s.
>>> When we changed RF to 3 the issue has resolved.
>>> We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver" 
>>> version="3.2.1"   
>>> I've found somehow related to my problem ticket 
>>> https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the 
>>> related tickets, that the issue with the token map rebuild time has been 
>>> fixed in the previous versions of the driver.
>>> So my question is what is the best recommendation of the setting 
>>> system_auth ks RF?
>>> 
>>> Regards,
>>> Vitali Djatsuk.
>>> 
>>> 
>> -- 
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> twitter: rustyrazorblade


Re: system_auth keyspace replication factor

2018-11-23 Thread Vitali Dyachuk
no its not a cassandra user and as i understood all other users login
local_one.

On Fri, 23 Nov 2018, 19:30 Jonathan Haddad  Any chance you’re logging in with the Cassandra user? It uses quorum
> reads.
>
>
> On Fri, Nov 23, 2018 at 11:38 AM Vitali Dyachuk 
> wrote:
>
>> Hi,
>> We have recently met a problem when we added 60 nodes in 1 region to the
>> cluster
>> and set an RF=60 for the system_auth ks, following this documentation
>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
>> However we've started to see increased login latencies in the cluster 5x
>> bigger than before changing RF of system_auth ks.
>> We have casandra runner written is csharp, running against the cluster,
>> when analyzing the logs we notices that   Rebuilding token map  is
>> taking most of the time ~20s.
>> When we changed RF to 3 the issue has resolved.
>> We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
>> version="3.2.1"
>> I've found somehow related to my problem ticket
>> https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
>> related tickets, that the issue with the token map rebuild time has been
>> fixed in the previous versions of the driver.
>> So my question is what is the best recommendation of the setting
>> system_auth ks RF?
>>
>> Regards,
>> Vitali Djatsuk.
>>
>>
>> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>


Re: system_auth keyspace replication factor

2018-11-23 Thread Jonathan Haddad
Any chance you’re logging in with the Cassandra user? It uses quorum reads.


On Fri, Nov 23, 2018 at 11:38 AM Vitali Dyachuk  wrote:

> Hi,
> We have recently met a problem when we added 60 nodes in 1 region to the
> cluster
> and set an RF=60 for the system_auth ks, following this documentation
> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
> However we've started to see increased login latencies in the cluster 5x
> bigger than before changing RF of system_auth ks.
> We have casandra runner written is csharp, running against the cluster,
> when analyzing the logs we notices that   Rebuilding token map  is taking
> most of the time ~20s.
> When we changed RF to 3 the issue has resolved.
> We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
> version="3.2.1"
> I've found somehow related to my problem ticket
> https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
> related tickets, that the issue with the token map rebuild time has been
> fixed in the previous versions of the driver.
> So my question is what is the best recommendation of the setting
> system_auth ks RF?
>
> Regards,
> Vitali Djatsuk.
>
>
> --
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade


system_auth keyspace replication factor

2018-11-23 Thread Vitali Dyachuk
Hi,
We have recently met a problem when we added 60 nodes in 1 region to the
cluster
and set an RF=60 for the system_auth ks, following this documentation
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
However we've started to see increased login latencies in the cluster 5x
bigger than before changing RF of system_auth ks.
We have casandra runner written is csharp, running against the cluster,
when analyzing the logs we notices that   Rebuilding token map  is taking
most of the time ~20s.
When we changed RF to 3 the issue has resolved.
We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
version="3.2.1"
I've found somehow related to my problem ticket
https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
related tickets, that the issue with the token map rebuild time has been
fixed in the previous versions of the driver.
So my question is what is the best recommendation of the setting
system_auth ks RF?

Regards,
Vitali Djatsuk.


Re: Tuning Replication Factor - All, Consistency ONE

2018-07-11 Thread Jürgen Albersdorfer
And by all means, do not treat Cassandra as a relational Database. - Beware
of the limitations of CQL in contrast to SQL.
I don't want to argue angainst Cassandra because I like it for what it was
primarly designed - horizontal scalability for HUGE amounts of data.
It is good to access your Data by Key, but not for searching. High
Availibility is a nice giveaway here.
If you end having only one Table in C*, maybe something like Redis would
work for your needs, too.

Some hints from my own expierience with it - if you choose to use Cassandra:
Have at least as much Racks as Replication Factor - Replication Factor of 3
means you want to have at least 3 Racks.
Choose your Partitioning wisely - it starts becoming relevant from 10 mio.
records onwards.

regards,
Jürgen

Am Di., 10. Juli 2018 um 18:18 Uhr schrieb Jeff Jirsa :

>
>
> On Tue, Jul 10, 2018 at 8:29 AM, Code Wiget  wrote:
>
>> Hi,
>>
>> I have been tasked with picking and setting up a database with the
>> following characteristics:
>>
>>- Ultra-high availability - The real requirement is uptime - our
>>whole platform becomes inaccessible without a “read” from the database. We
>>need the read to authenticate users. Databases will never be spread across
>>multiple networks.
>>
>>
> Sooner or later life will happen and you're going to have some
> unavailability - may be worth taking the time to make it fail gracefully
> (cache auth responses, etc).
>
>>
>>- Reasonably quick access speeds
>>- Very low data storage - The data storage is very low - for 10
>>million users, we would have around 8GB of storage total.
>>
>> Having done a bit of research on Cassandra, I think the optimal approach
>> for my use-case would be to replicate the data on *ALL* nodes possible,
>> but require reads to only have a consistency level of one. So, in the case
>> that a node goes down, we can still read/write to other nodes. It is not
>> very important that a read be unanimously agreed upon, as long as Cassandra
>> is eventually consistent, within around 1s, then there shouldn’t be an
>> issue.
>>
>
> Seems like a reasonably good fit, but there's no 1s guarantee - it'll
> USUALLY happen within milliseconds, but the edge cases don't have a strict
> guarantee at all (imagine two hosts in adjacent racks, the link between the
> two racks goes down, but both are otherwise functional - a query at ONE in
> either rack would be able to read and write data, but it would diverge
> between the two racks for some period of time).
>
>
>>
>> When I go to set up the database though, I am required to set a
>> replication factor to a number - 1,2,3,etc. So I can’t just say “ALL” and
>> have it replicate to all nodes.
>>
>
> That option doesn't exist. It's been proposed (and exists in Datastax
> Enterprise, which is a proprietary fork), but reportedly causes quite a bit
> of pain when misused, so people have successfully lobbied against it's
> inclusion in OSS Apache Cassandra. You could (assuming some basic java
> knowledge) extend NetworkTopologyStrategy to have it accomplish this, but I
> imagine you don't REALLY want this unless you're frequently auto-scaling
> nodes in/out of the cluster. You should probably just pick a high RF and
> you'll be OK with it.
>
>
>> Right now, I have a 2 node cluster with replication factor 3. Will this
>> cause any issues, having a RF > #nodes? Or is there a way to just have it
>> copy to *all* nodes?
>>
>
> It's obviously not the intended config, but I don't think it'll cause many
> problems.
>
>
>> Is there any way that I can tune Cassandra to be more read-optimized?
>>
>>
> Yes - definitely use leveled compaction instead of STCS (the default), and
> definitely take the time to tune the JVM args - read path generates a lot
> of short lived java objects, so a larger eden will help you (maybe up to
> 40-50% of max heap size).
>
>
>> Finally, I have some misgivings about how well Cassandra fits my
>> use-case. Please, if anyone has a suggestion as to why or why not it is a
>> good fit, I would really appreciate your input! If this could be done with
>> a simple SQL database and this is overkill, please let me know.
>>
>> Thanks for your input!
>>
>>
>


Re: Tuning Replication Factor - All, Consistency ONE

2018-07-10 Thread Jeff Jirsa
On Tue, Jul 10, 2018 at 8:29 AM, Code Wiget  wrote:

> Hi,
>
> I have been tasked with picking and setting up a database with the
> following characteristics:
>
>- Ultra-high availability - The real requirement is uptime - our whole
>platform becomes inaccessible without a “read” from the database. We need
>the read to authenticate users. Databases will never be spread across
>multiple networks.
>
>
Sooner or later life will happen and you're going to have some
unavailability - may be worth taking the time to make it fail gracefully
(cache auth responses, etc).

>
>- Reasonably quick access speeds
>- Very low data storage - The data storage is very low - for 10
>million users, we would have around 8GB of storage total.
>
> Having done a bit of research on Cassandra, I think the optimal approach
> for my use-case would be to replicate the data on *ALL* nodes possible,
> but require reads to only have a consistency level of one. So, in the case
> that a node goes down, we can still read/write to other nodes. It is not
> very important that a read be unanimously agreed upon, as long as Cassandra
> is eventually consistent, within around 1s, then there shouldn’t be an
> issue.
>

Seems like a reasonably good fit, but there's no 1s guarantee - it'll
USUALLY happen within milliseconds, but the edge cases don't have a strict
guarantee at all (imagine two hosts in adjacent racks, the link between the
two racks goes down, but both are otherwise functional - a query at ONE in
either rack would be able to read and write data, but it would diverge
between the two racks for some period of time).


>
> When I go to set up the database though, I am required to set a
> replication factor to a number - 1,2,3,etc. So I can’t just say “ALL” and
> have it replicate to all nodes.
>

That option doesn't exist. It's been proposed (and exists in Datastax
Enterprise, which is a proprietary fork), but reportedly causes quite a bit
of pain when misused, so people have successfully lobbied against it's
inclusion in OSS Apache Cassandra. You could (assuming some basic java
knowledge) extend NetworkTopologyStrategy to have it accomplish this, but I
imagine you don't REALLY want this unless you're frequently auto-scaling
nodes in/out of the cluster. You should probably just pick a high RF and
you'll be OK with it.


> Right now, I have a 2 node cluster with replication factor 3. Will this
> cause any issues, having a RF > #nodes? Or is there a way to just have it
> copy to *all* nodes?
>

It's obviously not the intended config, but I don't think it'll cause many
problems.


> Is there any way that I can tune Cassandra to be more read-optimized?
>
>
Yes - definitely use leveled compaction instead of STCS (the default), and
definitely take the time to tune the JVM args - read path generates a lot
of short lived java objects, so a larger eden will help you (maybe up to
40-50% of max heap size).


> Finally, I have some misgivings about how well Cassandra fits my use-case.
> Please, if anyone has a suggestion as to why or why not it is a good fit, I
> would really appreciate your input! If this could be done with a simple SQL
> database and this is overkill, please let me know.
>
> Thanks for your input!
>
>


Tuning Replication Factor - All, Consistency ONE

2018-07-10 Thread Code Wiget
Hi,

I have been tasked with picking and setting up a database with the following 
characteristics:

• Ultra-high availability - The real requirement is uptime - our whole platform 
becomes inaccessible without a “read” from the database. We need the read to 
authenticate users. Databases will never be spread across multiple networks.
• Reasonably quick access speeds
• Very low data storage - The data storage is very low - for 10 million users, 
we would have around 8GB of storage total.

Having done a bit of research on Cassandra, I think the optimal approach for my 
use-case would be to replicate the data on ALL nodes possible, but require 
reads to only have a consistency level of one. So, in the case that a node goes 
down, we can still read/write to other nodes. It is not very important that a 
read be unanimously agreed upon, as long as Cassandra is eventually consistent, 
within around 1s, then there shouldn’t be an issue.

When I go to set up the database though, I am required to set a replication 
factor to a number - 1,2,3,etc. So I can’t just say “ALL” and have it replicate 
to all nodes. Right now, I have a 2 node cluster with replication factor 3. 
Will this cause any issues, having a RF > #nodes? Or is there a way to just 
have it copy to all nodes? Is there any way that I can tune Cassandra to be 
more read-optimized?

Finally, I have some misgivings about how well Cassandra fits my use-case. 
Please, if anyone has a suggestion as to why or why not it is a good fit, I 
would really appreciate your input! If this could be done with a simple SQL 
database and this is overkill, please let me know.

Thanks for your input!



Re: Reducing the replication factor

2018-01-09 Thread Jeff Jirsa
Run repair first to ensure the data is properly replicated, then cleanup.


-- 
Jeff Jirsa


> On Jan 9, 2018, at 9:36 AM, Alessandro Pieri <siri...@gmail.com> wrote:
> 
> Dear Everyone,
> 
> We are running Cassandra v2.0.15 on our production cluster.
> 
> We would like to reduce the replication factor from 3 to 2 but we are not 
> sure if it is a safe operation. We would like to get some feedback from you 
> guys. 
> 
> Have anybody tried to shrink the replication factor?
> 
> Does "nodetool cleanup" get rid of the replicated data no longer needed?
> 
> Thanks in advance for your support.
> 
> Regards,
> Alessandro
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Reducing the replication factor

2018-01-09 Thread Alessandro Pieri
Dear Everyone,

We are running Cassandra v2.0.15 on our production cluster.

We would like to reduce the replication factor from 3 to 2 but we are not
sure if it is a safe operation. We would like to get some feedback from you
guys.

Have anybody tried to shrink the replication factor?

Does "nodetool cleanup" get rid of the replicated data no longer needed?

Thanks in advance for your support.

Regards,
Alessandro


Cassandra Replication Factor change from 2 to 3 for each data center

2017-12-15 Thread Harika Vangapelli -T (hvangape - AKRAYA INC at Cisco)
This is just basic question to ask..but it is worth to ask.

We changed Replication factor from 2 to 3 in our production cluster. We have 2 
data centers.

Does nodetool repair -dcpar from single node in one data center is sufficient 
for the whole replication to take effect? Please confirm.

Do I need to run it from each node?

[http://wwwin.cisco.com/c/dam/cec/organizations/gmcc/services-tools/signaturetool/images/logo/logo_gradient.png]



Harika Vangapelli
Engineer - IT
hvang...@cisco.com<mailto:hvang...@cisco.com>
Tel:

Cisco Systems, Inc.



United States
cisco.com


[http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif]Think before you 
print.

This email may contain confidential and privileged material for the sole use of 
the intended recipient. Any review, use, distribution or disclosure by others 
is strictly prohibited. If you are not the intended recipient (or authorized to 
receive for the recipient), please contact the sender by reply email and delete 
all copies of this message.
Please click 
here<http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for 
Company Registration Information.




Re: system_auth replication factor in Cassandra 2.1

2017-08-30 Thread Nate McCall
Regardless, if you are not modifying users frequently (with five you most
likely are not), make sure turn the permission cache wyyy up.

In 2.1 that is just: permissions_validity_in_ms (default is 2000 or 2
seconds). Feel free to set it to 1 day or some such. The corresponding
async update parameter (permissions_update_interval_in_ms) can be set to a
slightly smaller value. If you really need to, you can drop the cache via
the "invalidate" operation on the
"org.apache.cassandra.auth:type=PermissionsCache" mbean (on each node) to
revoke a user for example.

In later versions, you would have to do the same with:
- roles_validity_in_ms
- credentials_validity_in_ms
and their corresponding 'interval' parameters.


Re: system_auth replication factor in Cassandra 2.1

2017-08-30 Thread Erick Ramirez
Pool-Worker-1] | 2017-08-30
> 10:51:25.003000 | xx.xx.xx.116 |178
>
>
> Read 1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30
> 10:51:25.003000 | xx.xx.xx.116 |186
>
>
>Read 1 live and 0 tombstone cells [SharedPool-Worker-1] |
> 2017-08-30 10:51:25.003000 | xx.xx.xx.116 |191
>
>
> Read 1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30
> 10:51:25.003000 | xx.xx.xx.116 |194
>
>
> Read 1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30
> 10:51:25.004000 | xx.xx.xx.116 |198
>
>
> Scanned 5 rows and matched 5 [SharedPool-Worker-1] | 2017-08-30
> 10:51:25.004000 | xx.xx.xx.116 |224
>
>
> Enqueuing response to /xx.xx.xx.113 [SharedPool-Worker-1] |
> 2017-08-30 10:51:25.004000 | xx.xx.xx.116 |240
>
>  Sending REQUEST_RESPONSE message to
> /xx.xx.xx.113 [MessagingService-Outgoing-/xx.xx.xx.113] | 2017-08-30
> 10:51:25.004000 | xx.xx.xx.116 |302
>
>
> Enqueuing request to /xx.xx.xx.116 [SharedPool-Worker-2] | 2017-08-30
> 10:51:25.014000 | xx.xx.xx.113 | 601103
>
>  Submitted 1 concurrent
> range requests covering 63681 ranges [SharedPool-Worker-2] | 2017-08-30
> 10:51:25.014000 | xx.xx.xx.113 | 601120
>
>   Sending PAGED_RANGE message to
> /xx.xx.xx.116 [MessagingService-Outgoing-/xx.xx.xx.116] | 2017-08-30
> 10:51:25.015000 | xx.xx.xx.113 | 601190
>
>   REQUEST_RESPONSE message received from
> /xx.xx.xx.116 [MessagingService-Incoming-/xx.xx.xx.116] | 2017-08-30
> 10:51:25.015000 | xx.xx.xx.113 | 601771
>
>
> Processing response from /xx.xx.xx.116 [SharedPool-Worker-1] | 2017-08-30
> 10:51:25.015000 | xx.xx.xx.113 | 601824
>
>
>  Request complete |
> 2017-08-30 10:51:25.014874 | xx.xx.xx.113 | 601874
>
>
>
>
>
> *From: *Oleksandr Shulgin <oleksandr.shul...@zalando.de>
> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Date: *Wednesday, August 30, 2017 at 10:42 AM
> *To: *User <user@cassandra.apache.org>
> *Subject: *Re: system_auth replication factor in Cassandra 2.1
>
>
>
> On Wed, Aug 30, 2017 at 6:40 PM, Chuck Reynolds <creyno...@ancestry.com>
> wrote:
>
> How many users do you have (or expect to be found in system_auth.users)?
>
>   5 users.
>
> What are the current RF for system_auth and consistency level you are
> using in cqlsh?
>
>  135 in one DC and 227 in the other DC.  Consistency level one
>
>
>
> Still very surprising...
>
>
>
> Did you try to obtain a trace of a timing-out query (with TRACING ON)?
>
> Tracing timeout even though I increased it to 120 seconds.
>
>
>
> Even if cqlsh doesn't print the trace because of timeout, you should be
> still able to find something in system_traces.
>
>
>
> --
>
> Alex
>
>
>


Re: system_auth replication factor in Cassandra 2.1

2017-08-30 Thread kurt greaves
For that many nodes mixed with vnodes you probably want a lower RF than N
per datacenter. 5 or 7 would be reasonable. The only down side is that auth
queries may take slightly longer as they will often have to go to other
nodes to be resolved, but in practice this is likely not a big deal as the
data will be cached anyway.


Re: system_auth replication factor in Cassandra 2.1

2017-08-30 Thread Chuck Reynolds
 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30 10:51:25.003000 
| xx.xx.xx.116 |178
   Read 
1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30 10:51:25.003000 
| xx.xx.xx.116 |186
   Read 
1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30 10:51:25.003000 
| xx.xx.xx.116 |191
   Read 
1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30 10:51:25.003000 
| xx.xx.xx.116 |194
   Read 
1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30 10:51:25.004000 
| xx.xx.xx.116 |198

Scanned 5 rows and matched 5 [SharedPool-Worker-1] | 2017-08-30 10:51:25.004000 
| xx.xx.xx.116 |224

Enqueuing response to /xx.xx.xx.113 [SharedPool-Worker-1] | 2017-08-30 
10:51:25.004000 | xx.xx.xx.116 |240
 Sending REQUEST_RESPONSE message to 
/xx.xx.xx.113 [MessagingService-Outgoing-/xx.xx.xx.113] | 2017-08-30 
10:51:25.004000 | xx.xx.xx.116 |302
 
Enqueuing request to /xx.xx.xx.116 [SharedPool-Worker-2] | 2017-08-30 
10:51:25.014000 | xx.xx.xx.113 | 601103
 Submitted 1 concurrent range 
requests covering 63681 ranges [SharedPool-Worker-2] | 2017-08-30 
10:51:25.014000 | xx.xx.xx.113 | 601120
  Sending PAGED_RANGE message to 
/xx.xx.xx.116 [MessagingService-Outgoing-/xx.xx.xx.116] | 2017-08-30 
10:51:25.015000 | xx.xx.xx.113 | 601190
  REQUEST_RESPONSE message received from 
/xx.xx.xx.116 [MessagingService-Incoming-/xx.xx.xx.116] | 2017-08-30 
10:51:25.015000 | xx.xx.xx.113 | 601771
 Processing 
response from /xx.xx.xx.116 [SharedPool-Worker-1] | 2017-08-30 10:51:25.015000 
| xx.xx.xx.113 | 601824

  Request complete | 2017-08-30 10:51:25.014874 
| xx.xx.xx.113 | 601874


From: Oleksandr Shulgin <oleksandr.shul...@zalando.de>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Wednesday, August 30, 2017 at 10:42 AM
To: User <user@cassandra.apache.org>
Subject: Re: system_auth replication factor in Cassandra 2.1

On Wed, Aug 30, 2017 at 6:40 PM, Chuck Reynolds 
<creyno...@ancestry.com<mailto:creyno...@ancestry.com>> wrote:
How many users do you have (or expect to be found in system_auth.users)?
  5 users.
What are the current RF for system_auth and consistency level you are using in 
cqlsh?
 135 in one DC and 227 in the other DC.  Consistency level one

Still very surprising...

Did you try to obtain a trace of a timing-out query (with TRACING ON)?
Tracing timeout even though I increased it to 120 seconds.

Even if cqlsh doesn't print the trace because of timeout, you should be still 
able to find something in system_traces.

--
Alex



Re: system_auth replication factor in Cassandra 2.1

2017-08-30 Thread Oleksandr Shulgin
On Wed, Aug 30, 2017 at 6:40 PM, Chuck Reynolds 
wrote:

> How many users do you have (or expect to be found in system_auth.users)?
>
>   5 users.
>
> What are the current RF for system_auth and consistency level you are
> using in cqlsh?
>
>  135 in one DC and 227 in the other DC.  Consistency level one
>

Still very surprising...

Did you try to obtain a trace of a timing-out query (with TRACING ON)?
>
> Tracing timeout even though I increased it to 120 seconds.
>

Even if cqlsh doesn't print the trace because of timeout, you should be
still able to find something in system_traces.

--
Alex


Re: system_auth replication factor in Cassandra 2.1

2017-08-30 Thread Chuck Reynolds
How many users do you have (or expect to be found in system_auth.users)?
  5 users.
What are the current RF for system_auth and consistency level you are using in 
cqlsh?
 135 in one DC and 227 in the other DC.  Consistency level one
Did you try to obtain a trace of a timing-out query (with TRACING ON)?
Tracing timeout even though I increased it to 120 seconds.

From: Oleksandr Shulgin <oleksandr.shul...@zalando.de>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Wednesday, August 30, 2017 at 10:19 AM
To: User <user@cassandra.apache.org>
Subject: Re: system_auth replication factor in Cassandra 2.1

On Wed, Aug 30, 2017 at 5:50 PM, Chuck Reynolds 
<creyno...@ancestry.com<mailto:creyno...@ancestry.com>> wrote:
So I’ve read that if your using authentication in Cassandra 2.1 that your 
replication factor should match the number of nodes in your datacenter.

Is that true?

I have two datacenter cluster, 135 nodes in datacenter 1 & 227 nodes in an AWS 
datacenter.

Why do I want to replicate the system_auth table that many times?

What are the benefits and disadvantages of matching the number of nodes as 
opposed to the standard replication factor of 3?


The reason I’m asking the question is because it seems like I’m getting a lot 
of authentication errors now and they seem to happen more under load.

Also, querying the system_auth table from cqlsh to get the users seems to now 
timeout.

This is surprising.

How many users do you have (or expect to be found in system_auth.users)?   What 
are the current RF for system_auth and consistency level you are using in 
cqlsh?  Did you try to obtain a trace of a timing-out query (with TRACING ON)?

Regards,
--
Oleksandr "Alex" Shulgin | Database Engineer | Zalando SE | Tel: +49 176 
127-59-707



Re: system_auth replication factor in Cassandra 2.1

2017-08-30 Thread Oleksandr Shulgin
On Wed, Aug 30, 2017 at 6:20 PM, Chuck Reynolds 
wrote:

> So I tried to run a repair with the following on one of the server.
>
> nodetool repair system_auth -pr –local
>
>
>
> After two hours it hadn’t finished.  I had to kill the repair because of
> another issue and haven’t tried again.
>
>
>
> *Why would such a small table take so long to repair?*
>

It could be the overhead of that many nodes having to communicate with each
other (times the number of vnodes).  Even on a small clusters (3-5 nodes) I
think it takes a few minutes to run a repair on a small/empty keyspace.

*Also what would happen if I set the RF back to a lower number like 5?*
>

You should still run a repair afterwards, but I would expect it to finish
in a reasonable time.

--
Alex


Re: system_auth replication factor in Cassandra 2.1

2017-08-30 Thread Chuck Reynolds
So I tried to run a repair with the following on one of the server.
nodetool repair system_auth -pr –local

After two hours it hadn’t finished.  I had to kill the repair because of 
another issue and haven’t tried again.

Why would such a small table take so long to repair?

Also what would happen if I set the RF back to a lower number like 5?


Thanks
From: <li...@beobal.com> on behalf of Sam Tunnicliffe <s...@beobal.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Wednesday, August 30, 2017 at 10:10 AM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: system_auth replication factor in Cassandra 2.1

It's a better rule of thumb to use an RF of 3 to 5 per DC and this is what the 
docs now suggest: 
http://cassandra.apache.org/doc/latest/operating/security.html#authentication
Out of the box, the system_auth keyspace is setup with SimpleStrategy and RF=1 
so that it works on any new system including dev & test clusters, but obviously 
that's no use for a production system.

Regarding the increased rate of authentication errors: did you run repair after 
changing the RF? Auth queries are done at CL.LOCAL_ONE, so if you haven't 
repaired, the data for the user logging in will probably not be where it should 
be. The exception to this is the default "cassandra" user, queries for that 
user are done at CL.QUORUM, which will indeed lead to timeouts and 
authentication errors with a very high RF. It's recommended to only use that 
default user to bootstrap the setup of your own users & superusers, the link 
above also has info on this.

Thanks,
Sam


On 30 August 2017 at 16:50, Chuck Reynolds 
<creyno...@ancestry.com<mailto:creyno...@ancestry.com>> wrote:
So I’ve read that if your using authentication in Cassandra 2.1 that your 
replication factor should match the number of nodes in your datacenter.

Is that true?

I have two datacenter cluster, 135 nodes in datacenter 1 & 227 nodes in an AWS 
datacenter.

Why do I want to replicate the system_auth table that many times?

What are the benefits and disadvantages of matching the number of nodes as 
opposed to the standard replication factor of 3?


The reason I’m asking the question is because it seems like I’m getting a lot 
of authentication errors now and they seem to happen more under load.

Also, querying the system_auth table from cqlsh to get the users seems to now 
timeout.


Any help would be greatly appreciated.

Thanks



Re: system_auth replication factor in Cassandra 2.1

2017-08-30 Thread Oleksandr Shulgin
On Wed, Aug 30, 2017 at 5:50 PM, Chuck Reynolds <creyno...@ancestry.com>
wrote:

> So I’ve read that if your using authentication in Cassandra 2.1 that your
> replication factor should match the number of nodes in your datacenter.
>
>
>
> *Is that true?*
>
>
>
> I have two datacenter cluster, 135 nodes in datacenter 1 & 227 nodes in an
> AWS datacenter.
>
>
>
> *Why do I want to replicate the system_auth table that many times?*
>
>
>
> *What are the benefits and disadvantages of matching the number of nodes
> as opposed to the standard replication factor of 3? *
>
>
>
>
>
> The reason I’m asking the question is because it seems like I’m getting a
> lot of authentication errors now and they seem to happen more under load.
>
>
>
> Also, querying the system_auth table from cqlsh to get the users seems to
> now timeout.
>

This is surprising.

How many users do you have (or expect to be found in system_auth.users)?
What are the current RF for system_auth and consistency level you are using
in cqlsh?  Did you try to obtain a trace of a timing-out query (with
TRACING ON)?

Regards,
-- 
Oleksandr "Alex" Shulgin | Database Engineer | Zalando SE | Tel: +49 176
127-59-707


Re: system_auth replication factor in Cassandra 2.1

2017-08-30 Thread Sam Tunnicliffe
It's a better rule of thumb to use an RF of 3 to 5 per DC and this is what
the docs now suggest:
http://cassandra.apache.org/doc/latest/operating/security.html#authentication

Out of the box, the system_auth keyspace is setup with SimpleStrategy and
RF=1 so that it works on any new system including dev & test clusters, but
obviously that's no use for a production system.

Regarding the increased rate of authentication errors: did you run repair
after changing the RF? Auth queries are done at CL.LOCAL_ONE, so if you
haven't repaired, the data for the user logging in will probably not be
where it should be. The exception to this is the default "cassandra" user,
queries for that user are done at CL.QUORUM, which will indeed lead to
timeouts and authentication errors with a very high RF. It's recommended to
only use that default user to bootstrap the setup of your own users &
superusers, the link above also has info on this.

Thanks,
Sam


On 30 August 2017 at 16:50, Chuck Reynolds <creyno...@ancestry.com> wrote:

> So I’ve read that if your using authentication in Cassandra 2.1 that your
> replication factor should match the number of nodes in your datacenter.
>
>
>
> *Is that true?*
>
>
>
> I have two datacenter cluster, 135 nodes in datacenter 1 & 227 nodes in an
> AWS datacenter.
>
>
>
> *Why do I want to replicate the system_auth table that many times?*
>
>
>
> *What are the benefits and disadvantages of matching the number of nodes
> as opposed to the standard replication factor of 3? *
>
>
>
>
>
> The reason I’m asking the question is because it seems like I’m getting a
> lot of authentication errors now and they seem to happen more under load.
>
>
>
> Also, querying the system_auth table from cqlsh to get the users seems to
> now timeout.
>
>
>
>
>
> Any help would be greatly appreciated.
>
>
>
> Thanks
>


RE: system_auth replication factor in Cassandra 2.1

2017-08-30 Thread Jonathan Baynes

I recently came across an issue where by my user Keyspace was replicated by 3 
(I have 3 nodes) but my system_auth was default to 1, we also use 
authentication, I then lost 2 of my nodes and because authentication wasn’t 
replicated I couldn’t log in.

Once I resolved the issue, and got the nodes back up, I could then log back in, 
I too asked the community what was going on , and I was pointed to this

http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/sec/secConfSysAuthKeyspRepl.html

it clearly states the following

Attention: To prevent a potential problem logging into a secure cluster, set 
the replication factor of the system_auth and dse_security keyspaces to a value 
that is greater than 1. In a multi-node cluster, using the default of 1 
prevents logging into any node when the node that stores the user data is down.



From: Chuck Reynolds [mailto:creyno...@ancestry.com]
Sent: 30 August 2017 16:51
To: user@cassandra.apache.org
Subject: system_auth replication factor in Cassandra 2.1

So I’ve read that if your using authentication in Cassandra 2.1 that your 
replication factor should match the number of nodes in your datacenter.

Is that true?

I have two datacenter cluster, 135 nodes in datacenter 1 & 227 nodes in an AWS 
datacenter.

Why do I want to replicate the system_auth table that many times?

What are the benefits and disadvantages of matching the number of nodes as 
opposed to the standard replication factor of 3?


The reason I’m asking the question is because it seems like I’m getting a lot 
of authentication errors now and they seem to happen more under load.

Also, querying the system_auth table from cqlsh to get the users seems to now 
timeout.


Any help would be greatly appreciated.

Thanks



This e-mail may contain confidential and/or privileged information. If you are 
not the intended recipient (or have received this e-mail in error) please 
notify the sender immediately and destroy it. Any unauthorized copying, 
disclosure or distribution of the material in this e-mail is strictly 
forbidden. Tradeweb reserves the right to monitor all e-mail communications 
through its networks. If you do not wish to receive marketing emails about our 
products / services, please let us know by contacting us, either by email at 
contac...@tradeweb.com or by writing to us at the registered office of Tradeweb 
in the UK, which is: Tradeweb Europe Limited (company number 3912826), 1 Fore 
Street Avenue London EC2Y 9DT. To see our privacy policy, visit our website @ 
www.tradeweb.com.


system_auth replication factor in Cassandra 2.1

2017-08-30 Thread Chuck Reynolds
So I’ve read that if your using authentication in Cassandra 2.1 that your 
replication factor should match the number of nodes in your datacenter.

Is that true?

I have two datacenter cluster, 135 nodes in datacenter 1 & 227 nodes in an AWS 
datacenter.

Why do I want to replicate the system_auth table that many times?

What are the benefits and disadvantages of matching the number of nodes as 
opposed to the standard replication factor of 3?


The reason I’m asking the question is because it seems like I’m getting a lot 
of authentication errors now and they seem to happen more under load.

Also, querying the system_auth table from cqlsh to get the users seems to now 
timeout.


Any help would be greatly appreciated.

Thanks


Re: Dropping down replication factor

2017-08-15 Thread Erick Ramirez
I would discourage dropping to RF=2 because if you're using CL=*QUORUM, it
won't be able to tolerate a node outage.

You mentioned a couple of days ago that there's an index file that is
corrupted on 10.40.17.114. Could you try moving out the sstable set
associated with that corrupt file and try again? Though I echo Jeff's
comments and I'm concerned you have a hardware issue on that node since
OpsCenter tables got corrupted too. The replace method certainly sounds
like a good idea.

On Sun, Aug 13, 2017 at 7:58 AM, <brian.spind...@gmail.com> wrote:

> Hi folks, hopefully a quick one:
>
> We are running a 12 node cluster (2.1.15) in AWS with Ec2Snitch.  It's all
> in one region but spread across 3 availability zones.  It was nicely
> balanced with 4 nodes in each.
>
> But with a couple of failures and subsequent provisions to the wrong az we
> now have a cluster with :
>
> 5 nodes in az A
> 5 nodes in az B
> 2 nodes in az C
>
> Not sure why, but when adding a third node in AZ C it fails to stream
> after getting all the way to completion and no apparent error in logs.
> I've looked at a couple of bugs referring to scrubbing and possible OOM
> bugs due to metadata writing at end of streaming (sorry don't have ticket
> handy).  I'm worried I might not be able to do much with these since the
> disk space usage is high and they are under a lot of load given the small
> number of them for this rack.
>
> Rather than troubleshoot this further, what I was thinking about doing was:
> - drop the replication factor on our keyspace to two
> - hopefully this would reduce load on these two remaining nodes
> - run repairs/cleanup across the cluster
> - then shoot these two nodes in the 'c' rack
> - run repairs/cleanup across the cluster
>
> Would this work with minimal/no disruption?
> Should I update their "rack" before hand or after ?
> What else am I not thinking about?
>
> My main goal atm is to get back to where the cluster is in a clean
> consistent state that allows nodes to properly bootstrap.
>
> Thanks for your help in advance.
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Dropping down replication factor

2017-08-13 Thread Brian Spindler
Thanks Kurt.

We had one sstable from a cf of ours.  I am actually running a repair on
that cf now and then plan to try and join the additional nodes as you
suggest.  I deleted the opscenter corrupt sstables as well but will not
bother repairing that before adding capacity.

Been keeping an eye across all nodes for corrupt exceptions - so far no new
occurrences.

Thanks again.

-B

On Sun, Aug 13, 2017 at 17:52 kurt greaves  wrote:

>
>
> On 14 Aug. 2017 00:59, "Brian Spindler"  wrote:
>
> Do you think with the setup I've described I'd be ok doing that now to
> recover this node?
>
> The node died trying to run the scrub; I've restarted it but I'm not sure
> it's going to get past a scrub/repair, this is why I deleted the other
> files as a brute force method.  I think I might have to do the same here
> and then kick off a repair if I can't just replace it?
>
> is it just opscenter keyspace that has corrupt sstables? if so I wouldn't
> worry about repairing too much. If you can get that third node in C to join
> I'd say your best bet is to just do that until you have enough nodes in C.
> Dropping and increasing RF is pretty risky on a live system.
>
> It sounds to me like you stand a good chance of getting the new nodes in C
> to join so I'd pursue that before trying anything more complicated
>
>
> Doing the repair on the node that had the corrupt data deleted should be
> ok?
>
> Yes. as long as you also deleted corrupt SSTables on any other nodes that
> had them.
>
-- 
-Brian


Re: Dropping down replication factor

2017-08-13 Thread kurt greaves
On 14 Aug. 2017 00:59, "Brian Spindler"  wrote:

Do you think with the setup I've described I'd be ok doing that now to
recover this node?

The node died trying to run the scrub; I've restarted it but I'm not sure
it's going to get past a scrub/repair, this is why I deleted the other
files as a brute force method.  I think I might have to do the same here
and then kick off a repair if I can't just replace it?

is it just opscenter keyspace that has corrupt sstables? if so I wouldn't
worry about repairing too much. If you can get that third node in C to join
I'd say your best bet is to just do that until you have enough nodes in C.
Dropping and increasing RF is pretty risky on a live system.

It sounds to me like you stand a good chance of getting the new nodes in C
to join so I'd pursue that before trying anything more complicated


Doing the repair on the node that had the corrupt data deleted should be
ok?

Yes. as long as you also deleted corrupt SSTables on any other nodes that
had them.


Re: Dropping down replication factor

2017-08-13 Thread Brian Spindler
Do you think with the setup I've described I'd be ok doing that now to
recover this node?

The node died trying to run the scrub; I've restarted it but I'm not sure
it's going to get past a scrub/repair, this is why I deleted the other
files as a brute force method.  I think I might have to do the same here
and then kick off a repair if I can't just replace it?

Doing the repair on the node that had the corrupt data deleted should be
ok?

On Sun, Aug 13, 2017 at 10:29 AM Jeff Jirsa <jji...@gmail.com> wrote:

> Running repairs when you have corrupt sstables can spread the corruption
>
> In 2.1.15, corruption is almost certainly from something like a bad disk
> or bad RAM
>
> One way to deal with corruption is to stop the node and replace is (with
> -Dcassandra.replace_address) so you restream data from neighbors. The
> challenge here is making sure you have a healthy replica for streaming
>
> Please make sure you have backups and snapshots if you have corruption
> popping up
>
> If you're using vnodes, once you get rid of the corruption you may
> consider adding another c node with fewer vnodes to try to get it joined
> faster with less data.
>
>
> --
> Jeff Jirsa
>
>
> On Aug 13, 2017, at 7:11 AM, Brian Spindler <brian.spind...@gmail.com>
> wrote:
>
> Hi Jeff, I ran the scrub online and that didn't help.  I went ahead and
> stopped the node, deleted all the corrupted data files --*.db
> files and planned on running a repair when it came back online.
>
> Unrelated I believe, now another CF is corrupted!
>
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted:
> /ephemeral/cassandra/data/OpsCenter/rollups300-45c85324387b35238d056678f8fa8b0f/OpsCenter-rollups300-ka-100672-Data.db
> Caused by: org.apache.cassandra.io.compress.CorruptBlockException:
> (/ephemeral/cassandra/data/OpsCenter/rollups300-45c85324387b35238d056678f8fa8b0f/OpsCenter-rollups300-ka-100672-Data.db):
> corruption detected, chunk at 101500 of length 26523398.
>
> Few days ago when troubleshooting this I did change the OpsCenter keyspace
> RF == 2 from 3 since I thought that would help reduce load.  Did that cause
> this corruption?
>
> running *'nodetool scrub OpsCenter rollups300'* on that node now
>
> And now I also see this when running nodetool status:
>
> *"Note: Non-system keyspaces don't have the same replication settings,
> effective ownership information is meaningless"*
>
> What to do?
>
> I still can't stream to this new node cause of this corruption.  Disk
> space is getting low on these nodes ...
>
> On Sat, Aug 12, 2017 at 9:51 PM Brian Spindler <brian.spind...@gmail.com>
> wrote:
>
>> nothing in logs on the node that it was streaming from.
>>
>> however, I think I found the issue on the other node in the C rack:
>>
>> ERROR [STREAM-IN-/10.40.17.114] 2017-08-12 16:48:53,354
>> StreamSession.java:512 - [Stream #08957970-7f7e-11e7-b2a2-a31e21b877e5]
>> Streaming error occurred
>> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted:
>> /ephemeral/cassandra/data/...
>>
>> I did a 'cat /var/log/cassandra/system.log|grep Corrupt'  and it seems
>> it's a single Index.db file and nothing on the other node.
>>
>> I think nodetool scrub or offline sstablescrub might be in order but with
>> the current load I'm not sure I can take it offline for very long.
>>
>> Thanks again for the help.
>>
>>
>> On Sat, Aug 12, 2017 at 9:38 PM Jeffrey Jirsa <jji...@gmail.com> wrote:
>>
>>> Compaction is backed up – that may be normal write load (because of the
>>> rack imbalance), or it may be a secondary index build. Hard to say for
>>> sure. ‘nodetool compactionstats’ if you’re able to provide it. The jstack
>>> probably not necessary, streaming is being marked as failed and it’s
>>> turning itself off. Not sure why streaming is marked as failing, though,
>>> anything on the sending sides?
>>>
>>>
>>>
>>>
>>>
>>> From: Brian Spindler <brian.spind...@gmail.com>
>>> Reply-To: <user@cassandra.apache.org>
>>> Date: Saturday, August 12, 2017 at 6:34 PM
>>> To: <user@cassandra.apache.org>
>>> Subject: Re: Dropping down replication factor
>>>
>>> Thanks for replying Jeff.
>>>
>>> Responses below.
>>>
>>> On Sat, Aug 12, 2017 at 8:33 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>
>>>> Answers inline
>>>>
>>>> --
>>>> Jeff Jirsa
>>>>
>>>>
>>>> > On Aug 12, 2017, at 2:58 PM, brian.spind...@gmail.com w

Re: Dropping down replication factor

2017-08-13 Thread Jeff Jirsa
Running repairs when you have corrupt sstables can spread the corruption

In 2.1.15, corruption is almost certainly from something like a bad disk or bad 
RAM

One way to deal with corruption is to stop the node and replace is (with 
-Dcassandra.replace_address) so you restream data from neighbors. The challenge 
here is making sure you have a healthy replica for streaming 

Please make sure you have backups and snapshots if you have corruption popping 
up

If you're using vnodes, once you get rid of the corruption you may consider 
adding another c node with fewer vnodes to try to get it joined faster with 
less data.


-- 
Jeff Jirsa


> On Aug 13, 2017, at 7:11 AM, Brian Spindler <brian.spind...@gmail.com> wrote:
> 
> Hi Jeff, I ran the scrub online and that didn't help.  I went ahead and 
> stopped the node, deleted all the corrupted data files --*.db files 
> and planned on running a repair when it came back online.  
> 
> Unrelated I believe, now another CF is corrupted!  
> 
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /ephemeral/cassandra/data/OpsCenter/rollups300-45c85324387b35238d056678f8fa8b0f/OpsCenter-rollups300-ka-100672-Data.db
> Caused by: org.apache.cassandra.io.compress.CorruptBlockException: 
> (/ephemeral/cassandra/data/OpsCenter/rollups300-45c85324387b35238d056678f8fa8b0f/OpsCenter-rollups300-ka-100672-Data.db):
>  corruption detected, chunk at 101500 of length 26523398.
> 
> Few days ago when troubleshooting this I did change the OpsCenter keyspace RF 
> == 2 from 3 since I thought that would help reduce load.  Did that cause this 
> corruption? 
> 
> running 'nodetool scrub OpsCenter rollups300' on that node now 
> 
> And now I also see this when running nodetool status: 
> 
> "Note: Non-system keyspaces don't have the same replication settings, 
> effective ownership information is meaningless"
> 
> What to do?  
> 
> I still can't stream to this new node cause of this corruption.  Disk space 
> is getting low on these nodes ... 
> 
>> On Sat, Aug 12, 2017 at 9:51 PM Brian Spindler <brian.spind...@gmail.com> 
>> wrote:
>> nothing in logs on the node that it was streaming from.  
>> 
>> however, I think I found the issue on the other node in the C rack: 
>> 
>> ERROR [STREAM-IN-/10.40.17.114] 2017-08-12 16:48:53,354 
>> StreamSession.java:512 - [Stream #08957970-7f7e-11e7-b2a2-a31e21b877e5] 
>> Streaming error occurred
>> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
>> /ephemeral/cassandra/data/...
>> 
>> I did a 'cat /var/log/cassandra/system.log|grep Corrupt'  and it seems it's 
>> a single Index.db file and nothing on the other node.  
>> 
>> I think nodetool scrub or offline sstablescrub might be in order but with 
>> the current load I'm not sure I can take it offline for very long.  
>> 
>> Thanks again for the help. 
>> 
>> 
>>> On Sat, Aug 12, 2017 at 9:38 PM Jeffrey Jirsa <jji...@gmail.com> wrote:
>>> Compaction is backed up – that may be normal write load (because of the 
>>> rack imbalance), or it may be a secondary index build. Hard to say for 
>>> sure. ‘nodetool compactionstats’ if you’re able to provide it. The jstack 
>>> probably not necessary, streaming is being marked as failed and it’s 
>>> turning itself off. Not sure why streaming is marked as failing, though, 
>>> anything on the sending sides?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> From: Brian Spindler <brian.spind...@gmail.com>
>>> Reply-To: <user@cassandra.apache.org>
>>> Date: Saturday, August 12, 2017 at 6:34 PM
>>> To: <user@cassandra.apache.org>
>>> Subject: Re: Dropping down replication factor
>>> 
>>> Thanks for replying Jeff. 
>>> 
>>> Responses below. 
>>> 
>>>> On Sat, Aug 12, 2017 at 8:33 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>> Answers inline
>>>> 
>>>> --
>>>> Jeff Jirsa
>>>> 
>>>> 
>>>> > On Aug 12, 2017, at 2:58 PM, brian.spind...@gmail.com wrote:
>>>> >
>>>> > Hi folks, hopefully a quick one:
>>>> >
>>>> > We are running a 12 node cluster (2.1.15) in AWS with Ec2Snitch.  It's 
>>>> > all in one region but spread across 3 availability zones.  It was nicely 
>>>> > balanced with 4 nodes in each.
>>>> >
>>>> > But with a couple of failures and subsequent provisions to the wrong az 
>>>> > we now have a cluster with :
>>>> >
>>>> > 5 

Re: Dropping down replication factor

2017-08-13 Thread Brian Spindler
Hi Jeff, I ran the scrub online and that didn't help.  I went ahead and
stopped the node, deleted all the corrupted data files --*.db
files and planned on running a repair when it came back online.

Unrelated I believe, now another CF is corrupted!

org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted:
/ephemeral/cassandra/data/OpsCenter/rollups300-45c85324387b35238d056678f8fa8b0f/OpsCenter-rollups300-ka-100672-Data.db
Caused by: org.apache.cassandra.io.compress.CorruptBlockException:
(/ephemeral/cassandra/data/OpsCenter/rollups300-45c85324387b35238d056678f8fa8b0f/OpsCenter-rollups300-ka-100672-Data.db):
corruption detected, chunk at 101500 of length 26523398.

Few days ago when troubleshooting this I did change the OpsCenter keyspace
RF == 2 from 3 since I thought that would help reduce load.  Did that cause
this corruption?

running *'nodetool scrub OpsCenter rollups300'* on that node now

And now I also see this when running nodetool status:

*"Note: Non-system keyspaces don't have the same replication settings,
effective ownership information is meaningless"*

What to do?

I still can't stream to this new node cause of this corruption.  Disk space
is getting low on these nodes ...

On Sat, Aug 12, 2017 at 9:51 PM Brian Spindler <brian.spind...@gmail.com>
wrote:

> nothing in logs on the node that it was streaming from.
>
> however, I think I found the issue on the other node in the C rack:
>
> ERROR [STREAM-IN-/10.40.17.114] 2017-08-12 16:48:53,354
> StreamSession.java:512 - [Stream #08957970-7f7e-11e7-b2a2-a31e21b877e5]
> Streaming error occurred
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted:
> /ephemeral/cassandra/data/...
>
> I did a 'cat /var/log/cassandra/system.log|grep Corrupt'  and it seems
> it's a single Index.db file and nothing on the other node.
>
> I think nodetool scrub or offline sstablescrub might be in order but with
> the current load I'm not sure I can take it offline for very long.
>
> Thanks again for the help.
>
>
> On Sat, Aug 12, 2017 at 9:38 PM Jeffrey Jirsa <jji...@gmail.com> wrote:
>
>> Compaction is backed up – that may be normal write load (because of the
>> rack imbalance), or it may be a secondary index build. Hard to say for
>> sure. ‘nodetool compactionstats’ if you’re able to provide it. The jstack
>> probably not necessary, streaming is being marked as failed and it’s
>> turning itself off. Not sure why streaming is marked as failing, though,
>> anything on the sending sides?
>>
>>
>>
>>
>>
>> From: Brian Spindler <brian.spind...@gmail.com>
>> Reply-To: <user@cassandra.apache.org>
>> Date: Saturday, August 12, 2017 at 6:34 PM
>> To: <user@cassandra.apache.org>
>> Subject: Re: Dropping down replication factor
>>
>> Thanks for replying Jeff.
>>
>> Responses below.
>>
>> On Sat, Aug 12, 2017 at 8:33 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>
>>> Answers inline
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> > On Aug 12, 2017, at 2:58 PM, brian.spind...@gmail.com wrote:
>>> >
>>> > Hi folks, hopefully a quick one:
>>> >
>>> > We are running a 12 node cluster (2.1.15) in AWS with Ec2Snitch.  It's
>>> all in one region but spread across 3 availability zones.  It was nicely
>>> balanced with 4 nodes in each.
>>> >
>>> > But with a couple of failures and subsequent provisions to the wrong
>>> az we now have a cluster with :
>>> >
>>> > 5 nodes in az A
>>> > 5 nodes in az B
>>> > 2 nodes in az C
>>> >
>>> > Not sure why, but when adding a third node in AZ C it fails to stream
>>> after getting all the way to completion and no apparent error in logs.
>>> I've looked at a couple of bugs referring to scrubbing and possible OOM
>>> bugs due to metadata writing at end of streaming (sorry don't have ticket
>>> handy).  I'm worried I might not be able to do much with these since the
>>> disk space usage is high and they are under a lot of load given the small
>>> number of them for this rack.
>>>
>>> You'll definitely have higher load on az C instances with rf=3 in this
>>> ratio
>>
>>
>>> Streaming should still work - are you sure it's not busy doing
>>> something? Like building secondary index or similar? jstack thread dump
>>> would be useful, or at least nodetool tpstats
>>>
>>> Only other thing might be a backup.  We do incrementals x1hr and
>> snapshots x24h; they are shipped to s3 then links are cleaned up.  The
>> error

Re: Dropping down replication factor

2017-08-12 Thread Brian Spindler
nothing in logs on the node that it was streaming from.

however, I think I found the issue on the other node in the C rack:

ERROR [STREAM-IN-/10.40.17.114] 2017-08-12 16:48:53,354
StreamSession.java:512 - [Stream #08957970-7f7e-11e7-b2a2-a31e21b877e5]
Streaming error occurred
org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted:
/ephemeral/cassandra/data/...

I did a 'cat /var/log/cassandra/system.log|grep Corrupt'  and it seems it's
a single Index.db file and nothing on the other node.

I think nodetool scrub or offline sstablescrub might be in order but with
the current load I'm not sure I can take it offline for very long.

Thanks again for the help.


On Sat, Aug 12, 2017 at 9:38 PM Jeffrey Jirsa <jji...@gmail.com> wrote:

> Compaction is backed up – that may be normal write load (because of the
> rack imbalance), or it may be a secondary index build. Hard to say for
> sure. ‘nodetool compactionstats’ if you’re able to provide it. The jstack
> probably not necessary, streaming is being marked as failed and it’s
> turning itself off. Not sure why streaming is marked as failing, though,
> anything on the sending sides?
>
>
>
>
>
> From: Brian Spindler <brian.spind...@gmail.com>
> Reply-To: <user@cassandra.apache.org>
> Date: Saturday, August 12, 2017 at 6:34 PM
> To: <user@cassandra.apache.org>
> Subject: Re: Dropping down replication factor
>
> Thanks for replying Jeff.
>
> Responses below.
>
> On Sat, Aug 12, 2017 at 8:33 PM Jeff Jirsa <jji...@gmail.com> wrote:
>
>> Answers inline
>>
>> --
>> Jeff Jirsa
>>
>>
>> > On Aug 12, 2017, at 2:58 PM, brian.spind...@gmail.com wrote:
>> >
>> > Hi folks, hopefully a quick one:
>> >
>> > We are running a 12 node cluster (2.1.15) in AWS with Ec2Snitch.  It's
>> all in one region but spread across 3 availability zones.  It was nicely
>> balanced with 4 nodes in each.
>> >
>> > But with a couple of failures and subsequent provisions to the wrong az
>> we now have a cluster with :
>> >
>> > 5 nodes in az A
>> > 5 nodes in az B
>> > 2 nodes in az C
>> >
>> > Not sure why, but when adding a third node in AZ C it fails to stream
>> after getting all the way to completion and no apparent error in logs.
>> I've looked at a couple of bugs referring to scrubbing and possible OOM
>> bugs due to metadata writing at end of streaming (sorry don't have ticket
>> handy).  I'm worried I might not be able to do much with these since the
>> disk space usage is high and they are under a lot of load given the small
>> number of them for this rack.
>>
>> You'll definitely have higher load on az C instances with rf=3 in this
>> ratio
>
>
>> Streaming should still work - are you sure it's not busy doing something?
>> Like building secondary index or similar? jstack thread dump would be
>> useful, or at least nodetool tpstats
>>
>> Only other thing might be a backup.  We do incrementals x1hr and
> snapshots x24h; they are shipped to s3 then links are cleaned up.  The
> error I get on the node I'm trying to add to rack C is:
>
> ERROR [main] 2017-08-12 23:54:51,546 CassandraDaemon.java:583 - Exception
> encountered during startup
> java.lang.RuntimeException: Error during boostrap: Stream failed
> at
> org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:87)
> ~[apache-cassandra-2.1.15.jar:2.1.15]
> at
> org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1166)
> ~[apache-cassandra-2.1.15.jar:2.1.15]
> at
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:944)
> ~[apache-cassandra-2.1.15.jar:2.1.15]
> at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:740)
> ~[apache-cassandra-2.1.15.jar:2.1.15]
> at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:617)
> ~[apache-cassandra-2.1.15.jar:2.1.15]
> at
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:391)
> [apache-cassandra-2.1.15.jar:2.1.15]
> at
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:566)
> [apache-cassandra-2.1.15.jar:2.1.15]
> at
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:655)
> [apache-cassandra-2.1.15.jar:2.1.15]
> Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
> at
> org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85)
> ~[apache-cassandra-2.1.15.jar:2.1.15]
> at
> com.goog

Re: Dropping down replication factor

2017-08-12 Thread Jeffrey Jirsa
Compaction is backed up ­ that may be normal write load (because of the rack
imbalance), or it may be a secondary index build. Hard to say for sure.
Œnodetool compactionstats¹ if you¹re able to provide it. The jstack probably
not necessary, streaming is being marked as failed and it¹s turning itself
off. Not sure why streaming is marked as failing, though, anything on the
sending sides?





From:  Brian Spindler <brian.spind...@gmail.com>
Reply-To:  <user@cassandra.apache.org>
Date:  Saturday, August 12, 2017 at 6:34 PM
To:  <user@cassandra.apache.org>
Subject:  Re: Dropping down replication factor

Thanks for replying Jeff.

Responses below. 

On Sat, Aug 12, 2017 at 8:33 PM Jeff Jirsa <jji...@gmail.com> wrote:
> Answers inline
> 
> --
> Jeff Jirsa
> 
> 
>> > On Aug 12, 2017, at 2:58 PM, brian.spind...@gmail.com wrote:
>> >
>> > Hi folks, hopefully a quick one:
>> >
>> > We are running a 12 node cluster (2.1.15) in AWS with Ec2Snitch.  It's all
>> in one region but spread across 3 availability zones.  It was nicely balanced
>> with 4 nodes in each.
>> >
>> > But with a couple of failures and subsequent provisions to the wrong az we
>> now have a cluster with :
>> >
>> > 5 nodes in az A
>> > 5 nodes in az B
>> > 2 nodes in az C
>> >
>> > Not sure why, but when adding a third node in AZ C it fails to stream after
>> getting all the way to completion and no apparent error in logs.  I've looked
>> at a couple of bugs referring to scrubbing and possible OOM bugs due to
>> metadata writing at end of streaming (sorry don't have ticket handy).  I'm
>> worried I might not be able to do much with these since the disk space usage
>> is high and they are under a lot of load given the small number of them for
>> this rack.
> 
> You'll definitely have higher load on az C instances with rf=3 in this ratio
> 
> Streaming should still work - are you sure it's not busy doing something? Like
> building secondary index or similar? jstack thread dump would be useful, or at
> least nodetool tpstats
> 
Only other thing might be a backup.  We do incrementals x1hr and snapshots
x24h; they are shipped to s3 then links are cleaned up.  The error I get on
the node I'm trying to add to rack C is:

ERROR [main] 2017-08-12 23:54:51,546 CassandraDaemon.java:583 - Exception
encountered during startup
java.lang.RuntimeException: Error during boostrap: Stream failed
at 
org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:87)
~[apache-cassandra-2.1.15.jar:2.1.15]
at 
org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:11
66) ~[apache-cassandra-2.1.15.jar:2.1.15]
at 
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.jav
a:944) ~[apache-cassandra-2.1.15.jar:2.1.15]
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:7
40) ~[apache-cassandra-2.1.15.jar:2.1.15]
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:6
17) ~[apache-cassandra-2.1.15.jar:2.1.15]
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:391)
[apache-cassandra-2.1.15.jar:2.1.15]
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:5
66) [apache-cassandra-2.1.15.jar:2.1.15]
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:655)
[apache-cassandra-2.1.15.jar:2.1.15]
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
at 
org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(S
treamEventJMXNotifier.java:85) ~[apache-cassandra-2.1.15.jar:2.1.15]
at 
com.google.common.util.concurrent.Futures$4.run(Futures.java:1172)
~[guava-16.0.jar:na]
at 
com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.ex
ecute(MoreExecutors.java:297) ~[guava-16.0.jar:na]
at 
com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionLis
t.java:156) ~[guava-16.0.jar:na]
at 
com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:1
45) ~[guava-16.0.jar:na]
at 
com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture
.java:202) ~[guava-16.0.jar:na]
at 
org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResult
Future.java:209) ~[apache-cassandra-2.1.15.jar:2.1.15]
at 
org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(Stre
amResultFuture.java:185) ~[apache-cassandra-2.1.15.jar:2.1.15]
at 
org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java
:413) ~[apache-cassandra-2.1.15.jar:2.1.15]
at 
org.apache.cassandra.streaming.StreamSession.maybeCompleted(StreamSession.ja
va:700

Re: Dropping down replication factor

2017-08-12 Thread Brian Spindler
run(Thread.java:745) ~[na:1.8.0_112]
WARN  [StorageServiceShutdownHook] 2017-08-12 23:54:51,582
Gossiper.java:1462 - No local state or state is in silent shutdown, not
announcing shutdown
INFO  [StorageServiceShutdownHook] 2017-08-12 23:54:51,582
MessagingService.java:734 - Waiting for messaging service to quiesce
INFO  [ACCEPT-/10.40.17.114] 2017-08-12 23:54:51,583
MessagingService.java:1020 - MessagingService has terminated the accept()
thread

And I got this on this same node when it was bootstrapping, I ran 'nodetool
netstats' just before it shutdown:

Receiving 377 files, 161928296443 bytes total. Already received 377
files, 161928296443 bytes total

TPStats on host that was streaming the data to this node:

Pool NameActive   Pending  Completed   Blocked  All
time blocked
MutationStage 1 1 4488289014 0
0
ReadStage 0 0   24486526 0
0
RequestResponseStage  0 0 3038847374
<(303)%20884-7374> 0 0
ReadRepairStage   0 01601576 0
0
CounterMutationStage  0 0  68403 0
0
MiscStage 0 0  0 0
0
AntiEntropySessions   0 0  0 0
0
HintedHandoff 0 0 18 0
0
GossipStage   0 02786892 0
0
CacheCleanupExecutor  0 0  0 0
0
InternalResponseStage 0 0  61115 0
0
CommitLogArchiver 0 0  0 0
0
CompactionExecutor483 304167 0
0
ValidationExecutor0 0  78249 0
0
MigrationStage0 0  94201 0
0
AntiEntropyStage  0 0 160505 0
0
PendingRangeCalculator0 0 30 0
0
Sampler   0 0  0 0
0
MemtableFlushWriter   0 0  71270 0
0
MemtablePostFlush 0 0 175209 0
0
MemtableReclaimMemory 0 0  81222 0
0
Native-Transport-Requests 2 0 1983565628 0
  9405444

Message type   Dropped
READ   218
RANGE_SLICE 15
_TRACE   0
MUTATION   2949001
COUNTER_MUTATION 0
BINARY   0
REQUEST_RESPONSE 0
PAGED_RANGE  0
READ_REPAIR       8571

I can get a jstack if needed.

>
> >
> > Rather than troubleshoot this further, what I was thinking about doing
> was:
> > - drop the replication factor on our keyspace to two
>
> Repair before you do this, or you'll lose your consistency guarantees
>

Given the load on the 2 nodes in rack C I'm hoping a repair will succeed.


> > - hopefully this would reduce load on these two remaining nodes
>
> It should, racks awareness guarantees on replica per rack if rf==num
> racks, so right now those 2 c machines have 2.5x as much data as the
> others. This will drop that requirement and drop the load significantly
>
> > - run repairs/cleanup across the cluster
> > - then shoot these two nodes in the 'c' rack
>
> Why shoot the c instances? Why not drop RF and then add 2 more C
> instances, then increase RF back to 3, run repair, then Decom the extra
> instances in a and b?
>
>
> Fair point.  I was considering staying at RF two but I think with your
points below, I should reconsider.


> > - run repairs/cleanup across the cluster
> >
> > Would this work with minimal/no disruption?
>
> The big risk of running rf=2 is that quorum==all - any gc pause or node
> restarting will make you lose HA or strong consistency guarantees.
>
> > Should I update their "rack" before hand or after ?
>
> You can't change a node's rack once it's in the cluster, it SHOULD refuse
> to start if you do that
>
> Got it.


> > What else am I not thinking about?
> >
> > My main goal atm is to get back to where the cluster is in a clean
> consistent state that allows nodes to properly bootstrap.
> >
> > Thanks for your help in advance.
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Dropping down replication factor

2017-08-12 Thread Jeff Jirsa
Answers inline

-- 
Jeff Jirsa


> On Aug 12, 2017, at 2:58 PM, brian.spind...@gmail.com wrote:
> 
> Hi folks, hopefully a quick one:
> 
> We are running a 12 node cluster (2.1.15) in AWS with Ec2Snitch.  It's all in 
> one region but spread across 3 availability zones.  It was nicely balanced 
> with 4 nodes in each.
> 
> But with a couple of failures and subsequent provisions to the wrong az we 
> now have a cluster with : 
> 
> 5 nodes in az A
> 5 nodes in az B
> 2 nodes in az C
> 
> Not sure why, but when adding a third node in AZ C it fails to stream after 
> getting all the way to completion and no apparent error in logs.  I've looked 
> at a couple of bugs referring to scrubbing and possible OOM bugs due to 
> metadata writing at end of streaming (sorry don't have ticket handy).  I'm 
> worried I might not be able to do much with these since the disk space usage 
> is high and they are under a lot of load given the small number of them for 
> this rack.

You'll definitely have higher load on az C instances with rf=3 in this ratio

Streaming should still work - are you sure it's not busy doing something? Like 
building secondary index or similar? jstack thread dump would be useful, or at 
least nodetool tpstats


> 
> Rather than troubleshoot this further, what I was thinking about doing was:
> - drop the replication factor on our keyspace to two

Repair before you do this, or you'll lose your consistency guarantees

> - hopefully this would reduce load on these two remaining nodes 

It should, racks awareness guarantees on replica per rack if rf==num racks, so 
right now those 2 c machines have 2.5x as much data as the others. This will 
drop that requirement and drop the load significantly 

> - run repairs/cleanup across the cluster 
> - then shoot these two nodes in the 'c' rack

Why shoot the c instances? Why not drop RF and then add 2 more C instances, 
then increase RF back to 3, run repair, then Decom the extra instances in a and 
b?


> - run repairs/cleanup across the cluster
> 
> Would this work with minimal/no disruption? 

The big risk of running rf=2 is that quorum==all - any gc pause or node 
restarting will make you lose HA or strong consistency guarantees.

> Should I update their "rack" before hand or after ?

You can't change a node's rack once it's in the cluster, it SHOULD refuse to 
start if you do that

> What else am I not thinking about? 
> 
> My main goal atm is to get back to where the cluster is in a clean consistent 
> state that allows nodes to properly bootstrap.
> 
> Thanks for your help in advance.
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Dropping down replication factor

2017-08-12 Thread brian . spindler
Hi folks, hopefully a quick one:

We are running a 12 node cluster (2.1.15) in AWS with Ec2Snitch.  It's all in 
one region but spread across 3 availability zones.  It was nicely balanced with 
4 nodes in each.

But with a couple of failures and subsequent provisions to the wrong az we now 
have a cluster with : 

5 nodes in az A
5 nodes in az B
2 nodes in az C

Not sure why, but when adding a third node in AZ C it fails to stream after 
getting all the way to completion and no apparent error in logs.  I've looked 
at a couple of bugs referring to scrubbing and possible OOM bugs due to 
metadata writing at end of streaming (sorry don't have ticket handy).  I'm 
worried I might not be able to do much with these since the disk space usage is 
high and they are under a lot of load given the small number of them for this 
rack.

Rather than troubleshoot this further, what I was thinking about doing was:
- drop the replication factor on our keyspace to two
- hopefully this would reduce load on these two remaining nodes 
- run repairs/cleanup across the cluster 
- then shoot these two nodes in the 'c' rack
- run repairs/cleanup across the cluster

Would this work with minimal/no disruption? 
Should I update their "rack" before hand or after ?
What else am I not thinking about? 

My main goal atm is to get back to where the cluster is in a clean consistent 
state that allows nodes to properly bootstrap.

Thanks for your help in advance.
-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



RE: Question about replica and replication factor

2016-09-20 Thread Jun Wu
Great explanation!
For the single partition read, it makes sense to read data from only one 
replica. 
Thank you so much Ben!
Jun

From: ben.sla...@instaclustr.com
Date: Tue, 20 Sep 2016 05:30:43 +
Subject: Re: Question about replica and replication factor
To: wuxiaomi...@hotmail.com
CC: user@cassandra.apache.org

“replica” here means “a node that has a copy of the data for a given 
partition”. The scenario being discussed hear is CL > 1. In this case, rather 
than using up network and processing capacity sending all the data from all the 
nodes required to meet the consistency level, Cassandra gets the full data from 
one replica and  checksums from the others. Only if the checksums don’t match 
the full data does Cassandra need to get full data from all the relevant 
replicas.
I think the other point here is, conceptually, you should think of the 
coordinator as splitting up any query that hits multiple partitions into a set 
of queries, one per partition (there might be some optimisations that make this 
not quite physically correct but conceptually it’s about right). Discussion 
such as the one you quote above tend to be considering a single partition read 
(which is the most common kind of read in most uses of Cassandra).
CheersBen
On Tue, 20 Sep 2016 at 15:18 Jun Wu <wuxiaomi...@hotmail.com> wrote:


Yes, I think for my case, at least two nodes need to be contacted to get the 
full set of data.
But another thing comes up about dynamic snitch. It's the wrapped snitch and 
enabled by default and it'll choose the fastest/closest node to read data from. 
Another post is about 
this.http://www.datastax.com/dev/blog/dynamic-snitching-in-cassandra-past-present-and-future
 
The thing is why it's still emphasis only one replica to read data from. Below 
is from the post:
To begin, let’s first answer the most obvious question: what is dynamic 
snitching? To understand this, we’ll first recall what a snitch does. A 
snitch’s function is to determine which datacenters and racks are both written 
to and read from. So, why would that be ‘dynamic?’ This comes into play on the 
read side only (there’s nothing to be done for writes since we send them all 
and then block to until the consistency level is achieved.) When doing reads 
however, Cassandra only asks one node for the actual data, and, depending on 
consistency level and read repair chance, it asks the remaining replicas for 
checksums only. This means that it has a choice of however many replicas exist 
to ask for the actual data, and this is where the dynamic snitch goes to 
work.Since only one replica is sending the full data we need, we need to chose 
the best possible replica to ask, since if all we get back is checksums we have 
nothing useful to return to the user. The dynamic snitch handles this task by 
monitoring the performance of reads from the various replicas and choosing the 
best one based on this history.
Sent from my iPadOn Sep 20, 2016, at 00:03, Ben Slater 
<ben.sla...@instaclustr.com> wrote:

If your read operation requires data from multiple partitions and the 
partitions are spread across multiple nodes then the coordinator has the job of 
contacting the multiple nodes to get the data and return to the client. So, in 
your scenario, if you did a select * from table (with no where clause) the 
coordinator would need to contact and execute a read on at least one other node 
to satisfy the query.
CheersBen
On Tue, 20 Sep 2016 at 14:50 Jun Wu <wuxiaomi...@hotmail.com> wrote:



Hi Ben,
Thanks for the quick response. 
It's clear about the example for single row/partition. However, normally 
data are not single row. Then for this case, I'm still confused. 
http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/architectureClientRequestsRead_c.html
The link above gives an example of 10 nodes cluster with RF = 3. But the 
figure and the words in the post shows that the coordinator only contact/read 
data from one replica, and operate read repair for the left replicas. 
Also, how could read accross all nodes in the cluster? 
Thanks!
Jun


From: ben.sla...@instaclustr.com
Date: Tue, 20 Sep 2016 04:18:59 +
Subject: Re: Question about replica and replication factor
To: user@cassandra.apache.org

Each individual read (where a read is a single row or single partition) will 
read from one node (ignoring read repairs) as each partition will be contained 
entirely on a single node. To read the full set of data,  reads would hit at 
least two nodes (in practice, reads would likely end up being distributed 
across all the nodes in your cluster).
CheersBen
On Tue, 20 Sep 2016 at 14:09 Jun Wu <wuxiaomi...@hotmail.com> wrote:



Hi there,
I have a question about the replica and replication factor. 
For example, I have a cluster of 6 nodes in the same data center. 
Replication factor RF is set to 3  and the consistency level is default 1. 
According to this calculator http://www.ecyrd.com/

Re: Question about replica and replication factor

2016-09-19 Thread Ben Slater
“replica” here means “a node that has a copy of the data for a given
partition”. The scenario being discussed hear is CL > 1. In this case,
rather than using up network and processing capacity sending all the data
from all the nodes required to meet the consistency level, Cassandra gets
the full data from one replica and  checksums from the others. Only if the
checksums don’t match the full data does Cassandra need to get full data
from all the relevant replicas.

I think the other point here is, conceptually, you should think of the
coordinator as splitting up any query that hits multiple partitions into a
set of queries, one per partition (there might be some optimisations that
make this not quite physically correct but conceptually it’s about right).
Discussion such as the one you quote above tend to be considering a single
partition read (which is the most common kind of read in most uses of
Cassandra).

Cheers
Ben

On Tue, 20 Sep 2016 at 15:18 Jun Wu <wuxiaomi...@hotmail.com> wrote:

>
>
> Yes, I think for my case, at least two nodes need to be contacted to get
> the full set of data.
>
> But another thing comes up about dynamic snitch. It's the wrapped snitch
> and enabled by default and it'll choose the fastest/closest node to read
> data from. Another post is about this.
>
> http://www.datastax.com/dev/blog/dynamic-snitching-in-cassandra-past-present-and-future
>
>
> The thing is why it's still emphasis only one replica to read data from.
> Below is from the post:
>
> To begin, let’s first answer the most obvious question: what is dynamic
> snitching? To understand this, we’ll first recall what a snitch does. A
> snitch’s function is to determine which datacenters and racks are both
> written to and read from. So, why would that be ‘dynamic?’ This comes into
> play on the read side only (there’s nothing to be done for writes since we
> send them all and then block to until the consistency level is achieved.)
> When doing reads however, Cassandra only asks one node for the actual data,
> and, depending on consistency level and read repair chance, it asks the
> remaining replicas for checksums only. This means that it has a choice of
> however many replicas exist to ask for the actual data, and this is where
> the dynamic snitch goes to work.
>
> Since only one replica is sending the full data we need, we need to chose
> the best possible replica to ask, since if all we get back is checksums we
> have nothing useful to return to the user. The dynamic snitch handles this
> task by monitoring the performance of reads from the various replicas and
> choosing the best one based on this history.
>
> Sent from my iPad
> On Sep 20, 2016, at 00:03, Ben Slater <ben.sla...@instaclustr.com> wrote:
>
> If your read operation requires data from multiple partitions and the
> partitions are spread across multiple nodes then the coordinator has the
> job of contacting the multiple nodes to get the data and return to the
> client. So, in your scenario, if you did a select * from table (with no
> where clause) the coordinator would need to contact and execute a read on
> at least one other node to satisfy the query.
>
> Cheers
> Ben
>
> On Tue, 20 Sep 2016 at 14:50 Jun Wu <wuxiaomi...@hotmail.com> wrote:
>
>> Hi Ben,
>>
>> Thanks for the quick response.
>>
>> It's clear about the example for single row/partition. However,
>> normally data are not single row. Then for this case, I'm still confused.
>> http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/architectureClientRequestsRead_c.html
>>
>> The link above gives an example of 10 nodes cluster with RF = 3. But
>> the figure and the words in the post shows that the coordinator only
>> contact/read data from one replica, and operate read repair for the left
>> replicas.
>>
>> Also, how could read accross all nodes in the cluster?
>>
>> Thanks!
>>
>> Jun
>>
>>
>> From: ben.sla...@instaclustr.com
>> Date: Tue, 20 Sep 2016 04:18:59 +
>> Subject: Re: Question about replica and replication factor
>> To: user@cassandra.apache.org
>>
>>
>> Each individual read (where a read is a single row or single partition)
>> will read from one node (ignoring read repairs) as each partition will be
>> contained entirely on a single node. To read the full set of data,  reads
>> would hit at least two nodes (in practice, reads would likely end up being
>> distributed across all the nodes in your cluster).
>>
>> Cheers
>> Ben
>>
>> On Tue, 20 Sep 2016 at 14:09 Jun Wu <wuxiaomi...@hotmail.com> wrote:
>>
>> Hi there,
>>
>> I have a question about the re

Re: Question about replica and replication factor

2016-09-19 Thread Jun Wu


Yes, I think for my case, at least two nodes need to be contacted to get the 
full set of data.

But another thing comes up about dynamic snitch. It's the wrapped snitch and 
enabled by default and it'll choose the fastest/closest node to read data from. 
Another post is about this.
http://www.datastax.com/dev/blog/dynamic-snitching-in-cassandra-past-present-and-future
 

The thing is why it's still emphasis only one replica to read data from. Below 
is from the post:

To begin, let’s first answer the most obvious question: what is dynamic 
snitching? To understand this, we’ll first recall what a snitch does. A 
snitch’s function is to determine which datacenters and racks are both written 
to and read from. So, why would that be ‘dynamic?’ This comes into play on the 
read side only (there’s nothing to be done for writes since we send them all 
and then block to until the consistency level is achieved.) When doing reads 
however, Cassandra only asks one node for the actual data, and, depending on 
consistency level and read repair chance, it asks the remaining replicas for 
checksums only. This means that it has a choice of however many replicas exist 
to ask for the actual data, and this is where the dynamic snitch goes to work.

Since only one replica is sending the full data we need, we need to chose the 
best possible replica to ask, since if all we get back is checksums we have 
nothing useful to return to the user. The dynamic snitch handles this task by 
monitoring the performance of reads from the various replicas and choosing the 
best one based on this history.


Sent from my iPad
> On Sep 20, 2016, at 00:03, Ben Slater <ben.sla...@instaclustr.com> wrote:
> 
> If your read operation requires data from multiple partitions and the 
> partitions are spread across multiple nodes then the coordinator has the job 
> of contacting the multiple nodes to get the data and return to the client. 
> So, in your scenario, if you did a select * from table (with no where clause) 
> the coordinator would need to contact and execute a read on at least one 
> other node to satisfy the query.
> 
> Cheers
> Ben
> 
>> On Tue, 20 Sep 2016 at 14:50 Jun Wu <wuxiaomi...@hotmail.com> wrote:
>> Hi Ben,
>> 
>> Thanks for the quick response. 
>> 
>> It's clear about the example for single row/partition. However, normally 
>> data are not single row. Then for this case, I'm still confused. 
>> http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/architectureClientRequestsRead_c.html
>> 
>> The link above gives an example of 10 nodes cluster with RF = 3. But the 
>> figure and the words in the post shows that the coordinator only 
>> contact/read data from one replica, and operate read repair for the left 
>> replicas. 
>> 
>> Also, how could read accross all nodes in the cluster? 
>> 
>>     Thanks!
>> 
>> Jun
>> 
>> 
>> From: ben.sla...@instaclustr.com
>> Date: Tue, 20 Sep 2016 04:18:59 +
>> Subject: Re: Question about replica and replication factor
>> To: user@cassandra.apache.org
>> 
>> 
>> Each individual read (where a read is a single row or single partition) will 
>> read from one node (ignoring read repairs) as each partition will be 
>> contained entirely on a single node. To read the full set of data,  reads 
>> would hit at least two nodes (in practice, reads would likely end up being 
>> distributed across all the nodes in your cluster).
>> 
>> Cheers
>> Ben
>> 
>> On Tue, 20 Sep 2016 at 14:09 Jun Wu <wuxiaomi...@hotmail.com> wrote:
>> Hi there,
>> 
>> I have a question about the replica and replication factor. 
>> 
>> For example, I have a cluster of 6 nodes in the same data center. 
>> Replication factor RF is set to 3  and the consistency level is default 1. 
>> According to this calculator http://www.ecyrd.com/cassandracalculator/, 
>> every node will store 50% of the data.
>> 
>> When I want to read all data from the cluster, how many nodes should I 
>> read from, 2 or 1? Is it 2, because each node has half data? But in the 
>> calculator it show 1: You are really reading from 1 node every time.
>> 
>>Any suggestions? Thanks!
>> 
>> Jun
>> -- 
>> 
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
> 
> -- 
> 
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798


RE: Question about replica and replication factor

2016-09-19 Thread Jun Wu
Hi Ben,
Thanks for the quick response. 
It's clear about the example for single row/partition. However, normally 
data are not single row. Then for this case, I'm still confused. 
http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/architectureClientRequestsRead_c.html
The link above gives an example of 10 nodes cluster with RF = 3. But the 
figure and the words in the post shows that the coordinator only contact/read 
data from one replica, and operate read repair for the left replicas. 
Also, how could read accross all nodes in the cluster? 
Thanks!
Jun


From: ben.sla...@instaclustr.com
Date: Tue, 20 Sep 2016 04:18:59 +
Subject: Re: Question about replica and replication factor
To: user@cassandra.apache.org

Each individual read (where a read is a single row or single partition) will 
read from one node (ignoring read repairs) as each partition will be contained 
entirely on a single node. To read the full set of data,  reads would hit at 
least two nodes (in practice, reads would likely end up being distributed 
across all the nodes in your cluster).
CheersBen
On Tue, 20 Sep 2016 at 14:09 Jun Wu <wuxiaomi...@hotmail.com> wrote:



Hi there,
I have a question about the replica and replication factor. 
For example, I have a cluster of 6 nodes in the same data center. 
Replication factor RF is set to 3  and the consistency level is default 1. 
According to this calculator http://www.ecyrd.com/cassandracalculator/, every 
node will store 50% of the data.
When I want to read all data from the cluster, how many nodes should I read 
from, 2 or 1? Is it 2, because each node has half data? But in the calculator 
it show 1: You are really reading from 1 node every time.
   Any suggestions? Thanks!
Jun   -- 
Ben SlaterChief Product OfficerInstaclustr: Cassandra + Spark - Managed 
| Consulting | Support+61 437 929 798 

Re: Question about replica and replication factor

2016-09-19 Thread Ben Slater
Each individual read (where a read is a single row or single partition)
will read from one node (ignoring read repairs) as each partition will be
contained entirely on a single node. To read the full set of data,  reads
would hit at least two nodes (in practice, reads would likely end up being
distributed across all the nodes in your cluster).

Cheers
Ben

On Tue, 20 Sep 2016 at 14:09 Jun Wu <wuxiaomi...@hotmail.com> wrote:

> Hi there,
>
> I have a question about the replica and replication factor.
>
> For example, I have a cluster of 6 nodes in the same data center.
> Replication factor RF is set to 3  and the consistency level is default 1.
> According to this calculator http://www.ecyrd.com/cassandracalculator/,
> every node will store 50% of the data.
>
> When I want to read all data from the cluster, how many nodes should I
> read from, 2 or 1? Is it 2, because each node has half data? But in the
> calculator it show 1: You are really reading from 1 node every time.
>
>Any suggestions? Thanks!
>
> Jun
>
-- 

Ben Slater
Chief Product Officer
Instaclustr: Cassandra + Spark - Managed | Consulting | Support
+61 437 929 798


Question about replica and replication factor

2016-09-19 Thread Jun Wu
Hi there,
I have a question about the replica and replication factor. 
For example, I have a cluster of 6 nodes in the same data center. 
Replication factor RF is set to 3  and the consistency level is default 1. 
According to this calculator http://www.ecyrd.com/cassandracalculator/, every 
node will store 50% of the data.
When I want to read all data from the cluster, how many nodes should I read 
from, 2 or 1? Is it 2, because each node has half data? But in the calculator 
it show 1: You are really reading from 1 node every time.
   Any suggestions? Thanks!
Jun   

Re: Increasing replication factor and repair doesn't seem to work

2016-05-25 Thread Luke Jolly
lly" <l...@getadmiral.com> wrote:
>>>>>
>>>>>> So I guess the problem may have been with the initial addition of the
>>>>>> 10.128.0.20 node because when I added it in it never synced data I
>>>>>> guess?  It was at around 50 MB when it first came up and transitioned to
>>>>>> "UN". After it was in I did the 1->2 replication change and tried repair
>>>>>> but it didn't fix it.  From what I can tell all the data on it is stuff
>>>>>> that has been written since it came up.  We never delete data ever so we
>>>>>> should have zero tombstones.
>>>>>>
>>>>>> If I am not mistaken, only two of my nodes actually have all the
>>>>>> data, 10.128.0.3 and 10.142.0.14 since they agree on the data amount.
>>>>>> 10.142.0.13 is almost a GB lower and then of course 10.128.0.20
>>>>>> which is missing over 5 GB of data.  I tried running nodetool -local on
>>>>>> both DCs and it didn't fix either one.
>>>>>>
>>>>>> Am I running into a bug of some kind?
>>>>>>
>>>>>> On Tue, May 24, 2016 at 4:06 PM Bhuvan Rawal <bhu1ra...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Luke,
>>>>>>>
>>>>>>> You mentioned that replication factor was increased from 1 to 2. In
>>>>>>> that case was the node bearing ip 10.128.0.20 carried around 3GB data
>>>>>>> earlier?
>>>>>>>
>>>>>>> You can run nodetool repair with option -local to initiate repair
>>>>>>> local datacenter for gce-us-central1.
>>>>>>>
>>>>>>> Also you may suspect that if a lot of data was deleted while the
>>>>>>> node was down it may be having a lot of tombstones which is not needed 
>>>>>>> to
>>>>>>> be replicated to the other node. In order to verify the same, you can 
>>>>>>> issue
>>>>>>> a select count(*) query on column families (With the amount of data you
>>>>>>> have it should not be an issue) with tracing on and with consistency
>>>>>>> local_all by connecting to either 10.128.0.3  or 10.128.0.20 and
>>>>>>> store it in a file. It will give you a fair amount of idea about how 
>>>>>>> many
>>>>>>> deleted cells the nodes have. I tried searching for reference if 
>>>>>>> tombstones
>>>>>>> are moved around during repair, but I didnt find evidence of it. 
>>>>>>> However I
>>>>>>> see no reason to because if the node didnt have data then streaming
>>>>>>> tombstones does not make a lot of sense.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Bhuvan
>>>>>>>
>>>>>>> On Tue, May 24, 2016 at 11:06 PM, Luke Jolly <l...@getadmiral.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Here's my setup:
>>>>>>>>
>>>>>>>> Datacenter: gce-us-central1
>>>>>>>> ===
>>>>>>>> Status=Up/Down
>>>>>>>> |/ State=Normal/Leaving/Joining/Moving
>>>>>>>> --  Address  Load   Tokens   Owns (effective)  Host ID
>>>>>>>>   Rack
>>>>>>>> UN  10.128.0.3   6.4 GB 256  100.0%
>>>>>>>>  3317a3de-9113-48e2-9a85-bbf756d7a4a6  default
>>>>>>>> UN  10.128.0.20  943.08 MB  256  100.0%
>>>>>>>>  958348cb-8205-4630-8b96-0951bf33f3d3  default
>>>>>>>> Datacenter: gce-us-east1
>>>>>>>> 
>>>>>>>> Status=Up/Down
>>>>>>>> |/ State=Normal/Leaving/Joining/Moving
>>>>>>>> --  Address  Load   Tokens   Owns (effective)  Host ID
>>>>>>>>   Rack
>>>>>>>> UN  10.142.0.14  6.4 GB 256  100.0%
>>>>>>>>  c3a5c39d-e1c9-4116-903d-b6d1b23fb652  default
>>>>>>>> UN  10.142.0.13  5.55 GB256  100.0%
>>>>>>>>  d0d9c30e-1506-4b95-be64-3dd4d78f0583  default
>>>>>>>>
>>>>>>>> And my replication settings are:
>>>>>>>>
>>>>>>>> {'class': 'NetworkTopologyStrategy', 'aws-us-west': '2',
>>>>>>>> 'gce-us-central1': '2', 'gce-us-east1': '2'}
>>>>>>>>
>>>>>>>> As you can see 10.128.0.20 in the gce-us-central1 DC only has a
>>>>>>>> load of 943 MB even though it's supposed to own 100% and should have 
>>>>>>>> 6.4
>>>>>>>> GB.  Also 10.142.0.13 seems also not to have everything as it only
>>>>>>>> has a load of 5.55 GB.
>>>>>>>>
>>>>>>>> On Mon, May 23, 2016 at 7:28 PM, kurt Greaves <k...@instaclustr.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Do you have 1 node in each DC or 2? If you're saying you have 1
>>>>>>>>> node in each DC then a RF of 2 doesn't make sense. Can you clarify on 
>>>>>>>>> what
>>>>>>>>> your set up is?
>>>>>>>>>
>>>>>>>>> On 23 May 2016 at 19:31, Luke Jolly <l...@getadmiral.com> wrote:
>>>>>>>>>
>>>>>>>>>> I am running 3.0.5 with 2 nodes in two DCs, gce-us-central1 and
>>>>>>>>>> gce-us-east1.  I increased the replication factor of gce-us-central1 
>>>>>>>>>> from 1
>>>>>>>>>> to 2.  Then I ran 'nodetool repair -dc gce-us-central1'.  The
>>>>>>>>>> "Owns" for the node switched to 100% as it should but the Load 
>>>>>>>>>> showed that
>>>>>>>>>> it didn't actually sync the data.  I then ran a full 'nodetool 
>>>>>>>>>> repair' and
>>>>>>>>>> it didn't fix it still.  This scares me as I thought 'nodetool 
>>>>>>>>>> repair' was
>>>>>>>>>> a way to assure consistency and that all the nodes were synced but it
>>>>>>>>>> doesn't seem to be.  Outside of that command, I have no idea how I 
>>>>>>>>>> would
>>>>>>>>>> assure all the data was synced or how to get the data correctly 
>>>>>>>>>> synced
>>>>>>>>>> without decommissioning the node and re-adding it.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Kurt Greaves
>>>>>>>>> k...@instaclustr.com
>>>>>>>>> www.instaclustr.com
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>
>>>>
>>>> --
>>>> Kurt Greaves
>>>> k...@instaclustr.com
>>>> www.instaclustr.com
>>>>
>>>
>>>
>>


Re: Increasing replication factor and repair doesn't seem to work

2016-05-25 Thread Luke Jolly
So I figured out the main cause of the problem.  The seed node was itself.
That's what got it in a weird state.  The second part was that I didn't
know the default repair is incremental as I was accidently looking at the
wrong version documentation.  After running a repair -full, the 3 other
nodes are synced correctly it seems as they have identical loads.
Strangely, now the problem 10.128.0.20 node has 10 GB of load (the others
have 6 GB).  Since I now know I started it off in a very weird state, I'm
going to just decommission it and add it back in from scratch.  When I
added it, all working folders were cleared.

I feel Cassandra should through an error if the seed node is set to itself
and fail to bootstrap / join?

On Wed, May 25, 2016 at 2:37 AM Mike Yeap <wkk1...@gmail.com> wrote:

> Hi Luke, I've encountered similar problem before, could you please advise
> on following?
>
> 1) when you add 10.128.0.20, what are the seeds defined in cassandra.yaml?
>
> 2) when you add 10.128.0.20, were the data and cache directories in
> 10.128.0.20 empty?
>
>- /var/lib/cassandra/data
>- /var/lib/cassandra/saved_caches
>
> 3) if you do a compact in 10.128.0.3, what is the size shown in "Load"
> column in "nodetool status "?
>
> 4) when you do the full repair, did you use "nodetool repair" or "nodetool
> repair -full"? I'm asking this because Incremental Repair is the default
> for Cassandra 2.2 and later.
>
>
> Regards,
> Mike Yeap
>
> On Wed, May 25, 2016 at 8:01 AM, Bryan Cheng <br...@blockcypher.com>
> wrote:
>
>> Hi Luke,
>>
>> I've never found nodetool status' load to be useful beyond a general
>> indicator.
>>
>> You should expect some small skew, as this will depend on your current
>> compaction status, tombstones, etc. IIRC repair will not provide
>> consistency of intermediate states nor will it remove tombstones, it only
>> guarantees consistency in the final state. This means, in the case of
>> dropped hints or mutations, you will see differences in intermediate
>> states, and therefore storage footrpint, even in fully repaired nodes. This
>> includes intermediate UPDATE operations as well.
>>
>> Your one node with sub 1GB sticks out like a sore thumb, though. Where
>> did you originate the nodetool repair from? Remember that repair will only
>> ensure consistency for ranges held by the node you're running it on. While
>> I am not sure if missing ranges are included in this, if you ran nodetool
>> repair only on a machine with partial ownership, you will need to complete
>> repairs across the ring before data will return to full consistency.
>>
>> I would query some older data using consistency = ONE on the affected
>> machine to determine if you are actually missing data.  There are a few
>> outstanding bugs in the 2.1.x  and older release families that may result
>> in tombstone creation even without deletes, for example CASSANDRA-10547,
>> which impacts updates on collections in pre-2.1.13 Cassandra.
>>
>> You can also try examining the output of nodetool ring, which will give
>> you a breakdown of tokens and their associations within your cluster.
>>
>> --Bryan
>>
>> On Tue, May 24, 2016 at 3:49 PM, kurt Greaves <k...@instaclustr.com>
>> wrote:
>>
>>> Not necessarily considering RF is 2 so both nodes should have all
>>> partitions. Luke, are you sure the repair is succeeding? You don't have
>>> other keyspaces/duplicate data/extra data in your cassandra data directory?
>>> Also, you could try querying on the node with less data to confirm if it
>>> has the same dataset.
>>>
>>> On 24 May 2016 at 22:03, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:
>>>
>>>> For the other DC, it can be acceptable because partition reside on one
>>>> node, so say  if you have a large partition, it may skew things a bit.
>>>> On May 25, 2016 2:41 AM, "Luke Jolly" <l...@getadmiral.com> wrote:
>>>>
>>>>> So I guess the problem may have been with the initial addition of the
>>>>> 10.128.0.20 node because when I added it in it never synced data I
>>>>> guess?  It was at around 50 MB when it first came up and transitioned to
>>>>> "UN". After it was in I did the 1->2 replication change and tried repair
>>>>> but it didn't fix it.  From what I can tell all the data on it is stuff
>>>>> that has been written since it came up.  We never delete data ever so we
>>>>> should have zero tombstones.
>>>>>
>>>>> 

Re: Increasing replication factor and repair doesn't seem to work

2016-05-25 Thread Mike Yeap
Hi Luke, I've encountered similar problem before, could you please advise
on following?

1) when you add 10.128.0.20, what are the seeds defined in cassandra.yaml?

2) when you add 10.128.0.20, were the data and cache directories in
10.128.0.20 empty?

   - /var/lib/cassandra/data
   - /var/lib/cassandra/saved_caches

3) if you do a compact in 10.128.0.3, what is the size shown in "Load"
column in "nodetool status "?

4) when you do the full repair, did you use "nodetool repair" or "nodetool
repair -full"? I'm asking this because Incremental Repair is the default
for Cassandra 2.2 and later.


Regards,
Mike Yeap

On Wed, May 25, 2016 at 8:01 AM, Bryan Cheng <br...@blockcypher.com> wrote:

> Hi Luke,
>
> I've never found nodetool status' load to be useful beyond a general
> indicator.
>
> You should expect some small skew, as this will depend on your current
> compaction status, tombstones, etc. IIRC repair will not provide
> consistency of intermediate states nor will it remove tombstones, it only
> guarantees consistency in the final state. This means, in the case of
> dropped hints or mutations, you will see differences in intermediate
> states, and therefore storage footrpint, even in fully repaired nodes. This
> includes intermediate UPDATE operations as well.
>
> Your one node with sub 1GB sticks out like a sore thumb, though. Where did
> you originate the nodetool repair from? Remember that repair will only
> ensure consistency for ranges held by the node you're running it on. While
> I am not sure if missing ranges are included in this, if you ran nodetool
> repair only on a machine with partial ownership, you will need to complete
> repairs across the ring before data will return to full consistency.
>
> I would query some older data using consistency = ONE on the affected
> machine to determine if you are actually missing data.  There are a few
> outstanding bugs in the 2.1.x  and older release families that may result
> in tombstone creation even without deletes, for example CASSANDRA-10547,
> which impacts updates on collections in pre-2.1.13 Cassandra.
>
> You can also try examining the output of nodetool ring, which will give
> you a breakdown of tokens and their associations within your cluster.
>
> --Bryan
>
> On Tue, May 24, 2016 at 3:49 PM, kurt Greaves <k...@instaclustr.com>
> wrote:
>
>> Not necessarily considering RF is 2 so both nodes should have all
>> partitions. Luke, are you sure the repair is succeeding? You don't have
>> other keyspaces/duplicate data/extra data in your cassandra data directory?
>> Also, you could try querying on the node with less data to confirm if it
>> has the same dataset.
>>
>> On 24 May 2016 at 22:03, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:
>>
>>> For the other DC, it can be acceptable because partition reside on one
>>> node, so say  if you have a large partition, it may skew things a bit.
>>> On May 25, 2016 2:41 AM, "Luke Jolly" <l...@getadmiral.com> wrote:
>>>
>>>> So I guess the problem may have been with the initial addition of the
>>>> 10.128.0.20 node because when I added it in it never synced data I
>>>> guess?  It was at around 50 MB when it first came up and transitioned to
>>>> "UN". After it was in I did the 1->2 replication change and tried repair
>>>> but it didn't fix it.  From what I can tell all the data on it is stuff
>>>> that has been written since it came up.  We never delete data ever so we
>>>> should have zero tombstones.
>>>>
>>>> If I am not mistaken, only two of my nodes actually have all the data,
>>>> 10.128.0.3 and 10.142.0.14 since they agree on the data amount. 10.142.0.13
>>>> is almost a GB lower and then of course 10.128.0.20 which is missing
>>>> over 5 GB of data.  I tried running nodetool -local on both DCs and it
>>>> didn't fix either one.
>>>>
>>>> Am I running into a bug of some kind?
>>>>
>>>> On Tue, May 24, 2016 at 4:06 PM Bhuvan Rawal <bhu1ra...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Luke,
>>>>>
>>>>> You mentioned that replication factor was increased from 1 to 2. In
>>>>> that case was the node bearing ip 10.128.0.20 carried around 3GB data
>>>>> earlier?
>>>>>
>>>>> You can run nodetool repair with option -local to initiate repair
>>>>> local datacenter for gce-us-central1.
>>>>>
>>>>> Also you may suspect that if a lot of data was deleted while the no

Re: Increasing replication factor and repair doesn't seem to work

2016-05-24 Thread Bryan Cheng
Hi Luke,

I've never found nodetool status' load to be useful beyond a general
indicator.

You should expect some small skew, as this will depend on your current
compaction status, tombstones, etc. IIRC repair will not provide
consistency of intermediate states nor will it remove tombstones, it only
guarantees consistency in the final state. This means, in the case of
dropped hints or mutations, you will see differences in intermediate
states, and therefore storage footrpint, even in fully repaired nodes. This
includes intermediate UPDATE operations as well.

Your one node with sub 1GB sticks out like a sore thumb, though. Where did
you originate the nodetool repair from? Remember that repair will only
ensure consistency for ranges held by the node you're running it on. While
I am not sure if missing ranges are included in this, if you ran nodetool
repair only on a machine with partial ownership, you will need to complete
repairs across the ring before data will return to full consistency.

I would query some older data using consistency = ONE on the affected
machine to determine if you are actually missing data.  There are a few
outstanding bugs in the 2.1.x  and older release families that may result
in tombstone creation even without deletes, for example CASSANDRA-10547,
which impacts updates on collections in pre-2.1.13 Cassandra.

You can also try examining the output of nodetool ring, which will give you
a breakdown of tokens and their associations within your cluster.

--Bryan

On Tue, May 24, 2016 at 3:49 PM, kurt Greaves <k...@instaclustr.com> wrote:

> Not necessarily considering RF is 2 so both nodes should have all
> partitions. Luke, are you sure the repair is succeeding? You don't have
> other keyspaces/duplicate data/extra data in your cassandra data directory?
> Also, you could try querying on the node with less data to confirm if it
> has the same dataset.
>
> On 24 May 2016 at 22:03, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:
>
>> For the other DC, it can be acceptable because partition reside on one
>> node, so say  if you have a large partition, it may skew things a bit.
>> On May 25, 2016 2:41 AM, "Luke Jolly" <l...@getadmiral.com> wrote:
>>
>>> So I guess the problem may have been with the initial addition of the
>>> 10.128.0.20 node because when I added it in it never synced data I
>>> guess?  It was at around 50 MB when it first came up and transitioned to
>>> "UN". After it was in I did the 1->2 replication change and tried repair
>>> but it didn't fix it.  From what I can tell all the data on it is stuff
>>> that has been written since it came up.  We never delete data ever so we
>>> should have zero tombstones.
>>>
>>> If I am not mistaken, only two of my nodes actually have all the data,
>>> 10.128.0.3 and 10.142.0.14 since they agree on the data amount. 10.142.0.13
>>> is almost a GB lower and then of course 10.128.0.20 which is missing
>>> over 5 GB of data.  I tried running nodetool -local on both DCs and it
>>> didn't fix either one.
>>>
>>> Am I running into a bug of some kind?
>>>
>>> On Tue, May 24, 2016 at 4:06 PM Bhuvan Rawal <bhu1ra...@gmail.com>
>>> wrote:
>>>
>>>> Hi Luke,
>>>>
>>>> You mentioned that replication factor was increased from 1 to 2. In
>>>> that case was the node bearing ip 10.128.0.20 carried around 3GB data
>>>> earlier?
>>>>
>>>> You can run nodetool repair with option -local to initiate repair local
>>>> datacenter for gce-us-central1.
>>>>
>>>> Also you may suspect that if a lot of data was deleted while the node
>>>> was down it may be having a lot of tombstones which is not needed to be
>>>> replicated to the other node. In order to verify the same, you can issue a
>>>> select count(*) query on column families (With the amount of data you have
>>>> it should not be an issue) with tracing on and with consistency local_all
>>>> by connecting to either 10.128.0.3  or 10.128.0.20 and store it in a
>>>> file. It will give you a fair amount of idea about how many deleted cells
>>>> the nodes have. I tried searching for reference if tombstones are moved
>>>> around during repair, but I didnt find evidence of it. However I see no
>>>> reason to because if the node didnt have data then streaming tombstones
>>>> does not make a lot of sense.
>>>>
>>>> Regards,
>>>> Bhuvan
>>>>
>>>> On Tue, May 24, 2016 at 11:06 PM, Luke Jolly <l...@getadmiral.com>
>>>> wrote:
>>>>
>

Re: Increasing replication factor and repair doesn't seem to work

2016-05-24 Thread kurt Greaves
Not necessarily considering RF is 2 so both nodes should have all
partitions. Luke, are you sure the repair is succeeding? You don't have
other keyspaces/duplicate data/extra data in your cassandra data directory?
Also, you could try querying on the node with less data to confirm if it
has the same dataset.

On 24 May 2016 at 22:03, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:

> For the other DC, it can be acceptable because partition reside on one
> node, so say  if you have a large partition, it may skew things a bit.
> On May 25, 2016 2:41 AM, "Luke Jolly" <l...@getadmiral.com> wrote:
>
>> So I guess the problem may have been with the initial addition of the
>> 10.128.0.20 node because when I added it in it never synced data I
>> guess?  It was at around 50 MB when it first came up and transitioned to
>> "UN". After it was in I did the 1->2 replication change and tried repair
>> but it didn't fix it.  From what I can tell all the data on it is stuff
>> that has been written since it came up.  We never delete data ever so we
>> should have zero tombstones.
>>
>> If I am not mistaken, only two of my nodes actually have all the data,
>> 10.128.0.3 and 10.142.0.14 since they agree on the data amount. 10.142.0.13
>> is almost a GB lower and then of course 10.128.0.20 which is missing
>> over 5 GB of data.  I tried running nodetool -local on both DCs and it
>> didn't fix either one.
>>
>> Am I running into a bug of some kind?
>>
>> On Tue, May 24, 2016 at 4:06 PM Bhuvan Rawal <bhu1ra...@gmail.com> wrote:
>>
>>> Hi Luke,
>>>
>>> You mentioned that replication factor was increased from 1 to 2. In that
>>> case was the node bearing ip 10.128.0.20 carried around 3GB data earlier?
>>>
>>> You can run nodetool repair with option -local to initiate repair local
>>> datacenter for gce-us-central1.
>>>
>>> Also you may suspect that if a lot of data was deleted while the node
>>> was down it may be having a lot of tombstones which is not needed to be
>>> replicated to the other node. In order to verify the same, you can issue a
>>> select count(*) query on column families (With the amount of data you have
>>> it should not be an issue) with tracing on and with consistency local_all
>>> by connecting to either 10.128.0.3  or 10.128.0.20 and store it in a
>>> file. It will give you a fair amount of idea about how many deleted cells
>>> the nodes have. I tried searching for reference if tombstones are moved
>>> around during repair, but I didnt find evidence of it. However I see no
>>> reason to because if the node didnt have data then streaming tombstones
>>> does not make a lot of sense.
>>>
>>> Regards,
>>> Bhuvan
>>>
>>> On Tue, May 24, 2016 at 11:06 PM, Luke Jolly <l...@getadmiral.com>
>>> wrote:
>>>
>>>> Here's my setup:
>>>>
>>>> Datacenter: gce-us-central1
>>>> ===
>>>> Status=Up/Down
>>>> |/ State=Normal/Leaving/Joining/Moving
>>>> --  Address  Load   Tokens   Owns (effective)  Host ID
>>>>   Rack
>>>> UN  10.128.0.3   6.4 GB 256  100.0%
>>>>  3317a3de-9113-48e2-9a85-bbf756d7a4a6  default
>>>> UN  10.128.0.20  943.08 MB  256  100.0%
>>>>  958348cb-8205-4630-8b96-0951bf33f3d3  default
>>>> Datacenter: gce-us-east1
>>>> 
>>>> Status=Up/Down
>>>> |/ State=Normal/Leaving/Joining/Moving
>>>> --  Address  Load   Tokens   Owns (effective)  Host ID
>>>>   Rack
>>>> UN  10.142.0.14  6.4 GB 256  100.0%
>>>>  c3a5c39d-e1c9-4116-903d-b6d1b23fb652  default
>>>> UN  10.142.0.13  5.55 GB256  100.0%
>>>>  d0d9c30e-1506-4b95-be64-3dd4d78f0583  default
>>>>
>>>> And my replication settings are:
>>>>
>>>> {'class': 'NetworkTopologyStrategy', 'aws-us-west': '2',
>>>> 'gce-us-central1': '2', 'gce-us-east1': '2'}
>>>>
>>>> As you can see 10.128.0.20 in the gce-us-central1 DC only has a load
>>>> of 943 MB even though it's supposed to own 100% and should have 6.4 GB.
>>>> Also 10.142.0.13 seems also not to have everything as it only has a
>>>> load of 5.55 GB.
>>>>
>>>> On Mon, May 23, 2016 at 7:28 PM, kurt Greaves <k...@instaclustr.com>
>>>> wrote:
>>>

Re: Increasing replication factor and repair doesn't seem to work

2016-05-24 Thread Bhuvan Rawal
For the other DC, it can be acceptable because partition reside on one
node, so say  if you have a large partition, it may skew things a bit.
On May 25, 2016 2:41 AM, "Luke Jolly" <l...@getadmiral.com> wrote:

> So I guess the problem may have been with the initial addition of the
> 10.128.0.20 node because when I added it in it never synced data I
> guess?  It was at around 50 MB when it first came up and transitioned to
> "UN". After it was in I did the 1->2 replication change and tried repair
> but it didn't fix it.  From what I can tell all the data on it is stuff
> that has been written since it came up.  We never delete data ever so we
> should have zero tombstones.
>
> If I am not mistaken, only two of my nodes actually have all the data,
> 10.128.0.3 and 10.142.0.14 since they agree on the data amount. 10.142.0.13
> is almost a GB lower and then of course 10.128.0.20 which is missing over
> 5 GB of data.  I tried running nodetool -local on both DCs and it didn't
> fix either one.
>
> Am I running into a bug of some kind?
>
> On Tue, May 24, 2016 at 4:06 PM Bhuvan Rawal <bhu1ra...@gmail.com> wrote:
>
>> Hi Luke,
>>
>> You mentioned that replication factor was increased from 1 to 2. In that
>> case was the node bearing ip 10.128.0.20 carried around 3GB data earlier?
>>
>> You can run nodetool repair with option -local to initiate repair local
>> datacenter for gce-us-central1.
>>
>> Also you may suspect that if a lot of data was deleted while the node was
>> down it may be having a lot of tombstones which is not needed to be
>> replicated to the other node. In order to verify the same, you can issue a
>> select count(*) query on column families (With the amount of data you have
>> it should not be an issue) with tracing on and with consistency local_all
>> by connecting to either 10.128.0.3  or 10.128.0.20 and store it in a
>> file. It will give you a fair amount of idea about how many deleted cells
>> the nodes have. I tried searching for reference if tombstones are moved
>> around during repair, but I didnt find evidence of it. However I see no
>> reason to because if the node didnt have data then streaming tombstones
>> does not make a lot of sense.
>>
>> Regards,
>> Bhuvan
>>
>> On Tue, May 24, 2016 at 11:06 PM, Luke Jolly <l...@getadmiral.com> wrote:
>>
>>> Here's my setup:
>>>
>>> Datacenter: gce-us-central1
>>> ===
>>> Status=Up/Down
>>> |/ State=Normal/Leaving/Joining/Moving
>>> --  Address  Load   Tokens   Owns (effective)  Host ID
>>> Rack
>>> UN  10.128.0.3   6.4 GB 256  100.0%
>>>  3317a3de-9113-48e2-9a85-bbf756d7a4a6  default
>>> UN  10.128.0.20  943.08 MB  256  100.0%
>>>  958348cb-8205-4630-8b96-0951bf33f3d3  default
>>> Datacenter: gce-us-east1
>>> 
>>> Status=Up/Down
>>> |/ State=Normal/Leaving/Joining/Moving
>>> --  Address  Load   Tokens   Owns (effective)  Host ID
>>> Rack
>>> UN  10.142.0.14  6.4 GB 256  100.0%
>>>  c3a5c39d-e1c9-4116-903d-b6d1b23fb652  default
>>> UN  10.142.0.13  5.55 GB256  100.0%
>>>  d0d9c30e-1506-4b95-be64-3dd4d78f0583  default
>>>
>>> And my replication settings are:
>>>
>>> {'class': 'NetworkTopologyStrategy', 'aws-us-west': '2',
>>> 'gce-us-central1': '2', 'gce-us-east1': '2'}
>>>
>>> As you can see 10.128.0.20 in the gce-us-central1 DC only has a load of
>>> 943 MB even though it's supposed to own 100% and should have 6.4 GB.  Also 
>>> 10.142.0.13
>>> seems also not to have everything as it only has a load of 5.55 GB.
>>>
>>> On Mon, May 23, 2016 at 7:28 PM, kurt Greaves <k...@instaclustr.com>
>>> wrote:
>>>
>>>> Do you have 1 node in each DC or 2? If you're saying you have 1 node in
>>>> each DC then a RF of 2 doesn't make sense. Can you clarify on what your set
>>>> up is?
>>>>
>>>> On 23 May 2016 at 19:31, Luke Jolly <l...@getadmiral.com> wrote:
>>>>
>>>>> I am running 3.0.5 with 2 nodes in two DCs, gce-us-central1 and
>>>>> gce-us-east1.  I increased the replication factor of gce-us-central1 from 
>>>>> 1
>>>>> to 2.  Then I ran 'nodetool repair -dc gce-us-central1'.  The "Owns"
>>>>> for the node switched to 100% as it should but the Load showed that it
>>>>> didn't actually sync the data.  I then ran a full 'nodetool repair' and it
>>>>> didn't fix it still.  This scares me as I thought 'nodetool repair' was a
>>>>> way to assure consistency and that all the nodes were synced but it 
>>>>> doesn't
>>>>> seem to be.  Outside of that command, I have no idea how I would assure 
>>>>> all
>>>>> the data was synced or how to get the data correctly synced without
>>>>> decommissioning the node and re-adding it.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Kurt Greaves
>>>> k...@instaclustr.com
>>>> www.instaclustr.com
>>>>
>>>
>>>
>>


Re: Increasing replication factor and repair doesn't seem to work

2016-05-24 Thread Luke Jolly
So I guess the problem may have been with the initial addition of the
10.128.0.20 node because when I added it in it never synced data I guess?
It was at around 50 MB when it first came up and transitioned to "UN".
After it was in I did the 1->2 replication change and tried repair but it
didn't fix it.  From what I can tell all the data on it is stuff that has
been written since it came up.  We never delete data ever so we should have
zero tombstones.

If I am not mistaken, only two of my nodes actually have all the data,
10.128.0.3 and 10.142.0.14 since they agree on the data amount. 10.142.0.13
is almost a GB lower and then of course 10.128.0.20 which is missing over 5
GB of data.  I tried running nodetool -local on both DCs and it didn't fix
either one.

Am I running into a bug of some kind?

On Tue, May 24, 2016 at 4:06 PM Bhuvan Rawal <bhu1ra...@gmail.com> wrote:

> Hi Luke,
>
> You mentioned that replication factor was increased from 1 to 2. In that
> case was the node bearing ip 10.128.0.20 carried around 3GB data earlier?
>
> You can run nodetool repair with option -local to initiate repair local
> datacenter for gce-us-central1.
>
> Also you may suspect that if a lot of data was deleted while the node was
> down it may be having a lot of tombstones which is not needed to be
> replicated to the other node. In order to verify the same, you can issue a
> select count(*) query on column families (With the amount of data you have
> it should not be an issue) with tracing on and with consistency local_all
> by connecting to either 10.128.0.3  or 10.128.0.20 and store it in a
> file. It will give you a fair amount of idea about how many deleted cells
> the nodes have. I tried searching for reference if tombstones are moved
> around during repair, but I didnt find evidence of it. However I see no
> reason to because if the node didnt have data then streaming tombstones
> does not make a lot of sense.
>
> Regards,
> Bhuvan
>
> On Tue, May 24, 2016 at 11:06 PM, Luke Jolly <l...@getadmiral.com> wrote:
>
>> Here's my setup:
>>
>> Datacenter: gce-us-central1
>> ===
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address  Load   Tokens   Owns (effective)  Host ID
>> Rack
>> UN  10.128.0.3   6.4 GB 256  100.0%
>>  3317a3de-9113-48e2-9a85-bbf756d7a4a6  default
>> UN  10.128.0.20  943.08 MB  256  100.0%
>>  958348cb-8205-4630-8b96-0951bf33f3d3  default
>> Datacenter: gce-us-east1
>> 
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address  Load   Tokens   Owns (effective)  Host ID
>> Rack
>> UN  10.142.0.14  6.4 GB 256  100.0%
>>  c3a5c39d-e1c9-4116-903d-b6d1b23fb652  default
>> UN  10.142.0.13  5.55 GB256  100.0%
>>  d0d9c30e-1506-4b95-be64-3dd4d78f0583  default
>>
>> And my replication settings are:
>>
>> {'class': 'NetworkTopologyStrategy', 'aws-us-west': '2',
>> 'gce-us-central1': '2', 'gce-us-east1': '2'}
>>
>> As you can see 10.128.0.20 in the gce-us-central1 DC only has a load of
>> 943 MB even though it's supposed to own 100% and should have 6.4 GB.  Also 
>> 10.142.0.13
>> seems also not to have everything as it only has a load of 5.55 GB.
>>
>> On Mon, May 23, 2016 at 7:28 PM, kurt Greaves <k...@instaclustr.com>
>> wrote:
>>
>>> Do you have 1 node in each DC or 2? If you're saying you have 1 node in
>>> each DC then a RF of 2 doesn't make sense. Can you clarify on what your set
>>> up is?
>>>
>>> On 23 May 2016 at 19:31, Luke Jolly <l...@getadmiral.com> wrote:
>>>
>>>> I am running 3.0.5 with 2 nodes in two DCs, gce-us-central1 and
>>>> gce-us-east1.  I increased the replication factor of gce-us-central1 from 1
>>>> to 2.  Then I ran 'nodetool repair -dc gce-us-central1'.  The "Owns"
>>>> for the node switched to 100% as it should but the Load showed that it
>>>> didn't actually sync the data.  I then ran a full 'nodetool repair' and it
>>>> didn't fix it still.  This scares me as I thought 'nodetool repair' was a
>>>> way to assure consistency and that all the nodes were synced but it doesn't
>>>> seem to be.  Outside of that command, I have no idea how I would assure all
>>>> the data was synced or how to get the data correctly synced without
>>>> decommissioning the node and re-adding it.
>>>>
>>>
>>>
>>>
>>> --
>>> Kurt Greaves
>>> k...@instaclustr.com
>>> www.instaclustr.com
>>>
>>
>>
>


Re: Increasing replication factor and repair doesn't seem to work

2016-05-24 Thread Bhuvan Rawal
Hi Luke,

You mentioned that replication factor was increased from 1 to 2. In that
case was the node bearing ip 10.128.0.20 carried around 3GB data earlier?

You can run nodetool repair with option -local to initiate repair local
datacenter for gce-us-central1.

Also you may suspect that if a lot of data was deleted while the node was
down it may be having a lot of tombstones which is not needed to be
replicated to the other node. In order to verify the same, you can issue a
select count(*) query on column families (With the amount of data you have
it should not be an issue) with tracing on and with consistency local_all
by connecting to either 10.128.0.3  or 10.128.0.20 and store it in a file.
It will give you a fair amount of idea about how many deleted cells the
nodes have. I tried searching for reference if tombstones are moved around
during repair, but I didnt find evidence of it. However I see no reason to
because if the node didnt have data then streaming tombstones does not make
a lot of sense.

Regards,
Bhuvan

On Tue, May 24, 2016 at 11:06 PM, Luke Jolly <l...@getadmiral.com> wrote:

> Here's my setup:
>
> Datacenter: gce-us-central1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address  Load   Tokens   Owns (effective)  Host ID
>   Rack
> UN  10.128.0.3   6.4 GB 256  100.0%
>  3317a3de-9113-48e2-9a85-bbf756d7a4a6  default
> UN  10.128.0.20  943.08 MB  256  100.0%
>  958348cb-8205-4630-8b96-0951bf33f3d3  default
> Datacenter: gce-us-east1
> 
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address  Load   Tokens   Owns (effective)  Host ID
>   Rack
> UN  10.142.0.14  6.4 GB 256  100.0%
>  c3a5c39d-e1c9-4116-903d-b6d1b23fb652  default
> UN  10.142.0.13  5.55 GB256  100.0%
>  d0d9c30e-1506-4b95-be64-3dd4d78f0583  default
>
> And my replication settings are:
>
> {'class': 'NetworkTopologyStrategy', 'aws-us-west': '2',
> 'gce-us-central1': '2', 'gce-us-east1': '2'}
>
> As you can see 10.128.0.20 in the gce-us-central1 DC only has a load of
> 943 MB even though it's supposed to own 100% and should have 6.4 GB.  Also 
> 10.142.0.13
> seems also not to have everything as it only has a load of 5.55 GB.
>
> On Mon, May 23, 2016 at 7:28 PM, kurt Greaves <k...@instaclustr.com>
> wrote:
>
>> Do you have 1 node in each DC or 2? If you're saying you have 1 node in
>> each DC then a RF of 2 doesn't make sense. Can you clarify on what your set
>> up is?
>>
>> On 23 May 2016 at 19:31, Luke Jolly <l...@getadmiral.com> wrote:
>>
>>> I am running 3.0.5 with 2 nodes in two DCs, gce-us-central1 and
>>> gce-us-east1.  I increased the replication factor of gce-us-central1 from 1
>>> to 2.  Then I ran 'nodetool repair -dc gce-us-central1'.  The "Owns"
>>> for the node switched to 100% as it should but the Load showed that it
>>> didn't actually sync the data.  I then ran a full 'nodetool repair' and it
>>> didn't fix it still.  This scares me as I thought 'nodetool repair' was a
>>> way to assure consistency and that all the nodes were synced but it doesn't
>>> seem to be.  Outside of that command, I have no idea how I would assure all
>>> the data was synced or how to get the data correctly synced without
>>> decommissioning the node and re-adding it.
>>>
>>
>>
>>
>> --
>> Kurt Greaves
>> k...@instaclustr.com
>> www.instaclustr.com
>>
>
>


Re: Increasing replication factor and repair doesn't seem to work

2016-05-24 Thread Luke Jolly
Here's my setup:

Datacenter: gce-us-central1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address  Load   Tokens   Owns (effective)  Host ID
  Rack
UN  10.128.0.3   6.4 GB 256  100.0%
 3317a3de-9113-48e2-9a85-bbf756d7a4a6  default
UN  10.128.0.20  943.08 MB  256  100.0%
 958348cb-8205-4630-8b96-0951bf33f3d3  default
Datacenter: gce-us-east1

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address  Load   Tokens   Owns (effective)  Host ID
  Rack
UN  10.142.0.14  6.4 GB 256  100.0%
 c3a5c39d-e1c9-4116-903d-b6d1b23fb652  default
UN  10.142.0.13  5.55 GB256  100.0%
 d0d9c30e-1506-4b95-be64-3dd4d78f0583  default

And my replication settings are:

{'class': 'NetworkTopologyStrategy', 'aws-us-west': '2', 'gce-us-central1':
'2', 'gce-us-east1': '2'}

As you can see 10.128.0.20 in the gce-us-central1 DC only has a load of 943
MB even though it's supposed to own 100% and should have 6.4 GB.  Also
10.142.0.13
seems also not to have everything as it only has a load of 5.55 GB.

On Mon, May 23, 2016 at 7:28 PM, kurt Greaves <k...@instaclustr.com> wrote:

> Do you have 1 node in each DC or 2? If you're saying you have 1 node in
> each DC then a RF of 2 doesn't make sense. Can you clarify on what your set
> up is?
>
> On 23 May 2016 at 19:31, Luke Jolly <l...@getadmiral.com> wrote:
>
>> I am running 3.0.5 with 2 nodes in two DCs, gce-us-central1 and
>> gce-us-east1.  I increased the replication factor of gce-us-central1 from 1
>> to 2.  Then I ran 'nodetool repair -dc gce-us-central1'.  The "Owns" for
>> the node switched to 100% as it should but the Load showed that it didn't
>> actually sync the data.  I then ran a full 'nodetool repair' and it didn't
>> fix it still.  This scares me as I thought 'nodetool repair' was a way to
>> assure consistency and that all the nodes were synced but it doesn't seem
>> to be.  Outside of that command, I have no idea how I would assure all the
>> data was synced or how to get the data correctly synced without
>> decommissioning the node and re-adding it.
>>
>
>
>
> --
> Kurt Greaves
> k...@instaclustr.com
> www.instaclustr.com
>


Re: Increasing replication factor and repair doesn't seem to work

2016-05-23 Thread kurt Greaves
Do you have 1 node in each DC or 2? If you're saying you have 1 node in
each DC then a RF of 2 doesn't make sense. Can you clarify on what your set
up is?

On 23 May 2016 at 19:31, Luke Jolly <l...@getadmiral.com> wrote:

> I am running 3.0.5 with 2 nodes in two DCs, gce-us-central1 and
> gce-us-east1.  I increased the replication factor of gce-us-central1 from 1
> to 2.  Then I ran 'nodetool repair -dc gce-us-central1'.  The "Owns" for
> the node switched to 100% as it should but the Load showed that it didn't
> actually sync the data.  I then ran a full 'nodetool repair' and it didn't
> fix it still.  This scares me as I thought 'nodetool repair' was a way to
> assure consistency and that all the nodes were synced but it doesn't seem
> to be.  Outside of that command, I have no idea how I would assure all the
> data was synced or how to get the data correctly synced without
> decommissioning the node and re-adding it.
>



-- 
Kurt Greaves
k...@instaclustr.com
www.instaclustr.com


Increasing replication factor and repair doesn't seem to work

2016-05-23 Thread Luke Jolly
I am running 3.0.5 with 2 nodes in two DCs, gce-us-central1 and
gce-us-east1.  I increased the replication factor of gce-us-central1 from 1
to 2.  Then I ran 'nodetool repair -dc gce-us-central1'.  The "Owns" for
the node switched to 100% as it should but the Load showed that it didn't
actually sync the data.  I then ran a full 'nodetool repair' and it didn't
fix it still.  This scares me as I thought 'nodetool repair' was a way to
assure consistency and that all the nodes were synced but it doesn't seem
to be.  Outside of that command, I have no idea how I would assure all the
data was synced or how to get the data correctly synced without
decommissioning the node and re-adding it.


Re: Replication Factor Change

2015-11-05 Thread Yulian Oifa
Hello
OK i got it , so i should set CL to ALL for reads, otherwise data may be
retrieved from nodes that does not have yet current record.
Thanks for help.
Yulian Oifa

On Thu, Nov 5, 2015 at 5:33 PM, Eric Stevens <migh...@gmail.com> wrote:

> If you switch reads to CL=LOCAL_ALL, you should be able to increase RF,
> then run repair, and after repair is complete, go back to your old
> consistency level.  However, while you're operating at ALL consistency, you
> have no tolerance for a node failure (but at RF=1 you already have no
> tolerance for a node failure, so that doesn't really change your
> availability model).
>
> On Thu, Nov 5, 2015 at 8:01 AM Yulian Oifa <oifa.yul...@gmail.com> wrote:
>
>> Hello to all.
>> I am planning to change replication factor from 1 to 3.
>> Will it cause data read errors in time of nodes repair?
>>
>> Best regards
>> Yulian Oifa
>>
>


Re: Replication Factor Change

2015-11-05 Thread Eric Stevens
If you switch reads to CL=LOCAL_ALL, you should be able to increase RF,
then run repair, and after repair is complete, go back to your old
consistency level.  However, while you're operating at ALL consistency, you
have no tolerance for a node failure (but at RF=1 you already have no
tolerance for a node failure, so that doesn't really change your
availability model).

On Thu, Nov 5, 2015 at 8:01 AM Yulian Oifa <oifa.yul...@gmail.com> wrote:

> Hello to all.
> I am planning to change replication factor from 1 to 3.
> Will it cause data read errors in time of nodes repair?
>
> Best regards
> Yulian Oifa
>


RE: Replication Factor Change

2015-11-05 Thread aeljami.ext
Hello,

If current CL = ONE, Be careful on production at the time of change replication 
factor, 3 nodes will be queried while data is being transformed ==> So data 
read errors!
De : Yulian Oifa [mailto:oifa.yul...@gmail.com]
Envoyé : jeudi 5 novembre 2015 16:02
À : user@cassandra.apache.org
Objet : Replication Factor Change

Hello to all.
I am planning to change replication factor from 1 to 3.
Will it cause data read errors in time of nodes repair?
Best regards
Yulian Oifa

_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.



Replication Factor Change

2015-11-05 Thread Yulian Oifa
Hello to all.
I am planning to change replication factor from 1 to 3.
Will it cause data read errors in time of nodes repair?

Best regards
Yulian Oifa


Re: Re : will Unsafeassaniate a dead node maintain the replication factor

2015-11-01 Thread sai krishnam raju potturi
thanks Surabhi. Would try that out.

On Sat, Oct 31, 2015 at 6:52 PM, Surbhi Gupta <surbhi.gupt...@gmail.com>
wrote:

> If it is using vnodes then just run nodetool repair . It should fix the
> issue related to data if any.
>
> And then run nodetool cleanup
>
> Sent from my iPhone
>
> On Oct 31, 2015, at 3:12 PM, sai krishnam raju potturi <
> pskraj...@gmail.com> wrote:
>
> yes Surbhi.
>
> On Sat, Oct 31, 2015 at 1:13 PM, Surbhi Gupta <surbhi.gupt...@gmail.com>
> wrote:
>
>> Is the cluster using vnodes?
>>
>> Sent from my iPhone
>>
>> On Oct 31, 2015, at 9:16 AM, sai krishnam raju potturi <
>> pskraj...@gmail.com> wrote:
>>
>> yes Surbhi.
>>
>> On Sat, Oct 31, 2015 at 12:10 PM, Surbhi Gupta <surbhi.gupt...@gmail.com>
>> wrote:
>>
>>> So have you already done unsafe assassination ?
>>>
>>> On 31 October 2015 at 08:37, sai krishnam raju potturi <
>>> pskraj...@gmail.com> wrote:
>>>
>>>> it's dead; and we had to do unsafeassassinate as other 2 methods did
>>>> not work
>>>>
>>>> On Sat, Oct 31, 2015 at 11:30 AM, Surbhi Gupta <
>>>> surbhi.gupt...@gmail.com> wrote:
>>>>
>>>>> Whether the node is down or up which you want to decommission?
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Oct 31, 2015, at 8:24 AM, sai krishnam raju potturi <
>>>>> pskraj...@gmail.com> wrote:
>>>>>
>>>>> Thanks Surabhi. Decommission nor removenode did not work. We did not
>>>>> capture the tokens of the dead node. Any way we could make sure the
>>>>> replication of 3 is maintained?
>>>>>
>>>>> On Sat, Oct 31, 2015, 11:14 Surbhi Gupta <surbhi.gupt...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> You have to do few things before unsafe as sanitation . First run the
>>>>>> nodetool decommission if the node is up and wait till streaming happens .
>>>>>> You can check is the streaming is completed by nodetool netstats . If
>>>>>> streaming is completed you can do unsafe assanitation .
>>>>>>
>>>>>> To answer your question unsafe assanitation will not take care of
>>>>>> replication factor .
>>>>>> It is like forcing a node out from the cluster .
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> > On Oct 31, 2015, at 5:12 AM, sai krishnam raju potturi <
>>>>>> pskraj...@gmail.com> wrote:
>>>>>> >
>>>>>> > hi;
>>>>>> >would unsafeassasinating a dead node maintain the replication
>>>>>> factor like decommission process or removenode process?
>>>>>> >
>>>>>> > thanks
>>>>>> >
>>>>>> >
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re : will Unsafeassaniate a dead node maintain the replication factor

2015-10-31 Thread sai krishnam raju potturi
hi;
   would unsafeassasinating a dead node maintain the replication factor
like decommission process or removenode process?

thanks


Re: Re : will Unsafeassaniate a dead node maintain the replication factor

2015-10-31 Thread Surbhi Gupta
Whether the node is down or up which you want to decommission?

Sent from my iPhone

> On Oct 31, 2015, at 8:24 AM, sai krishnam raju potturi <pskraj...@gmail.com> 
> wrote:
> 
> Thanks Surabhi. Decommission nor removenode did not work. We did not capture 
> the tokens of the dead node. Any way we could make sure the replication of 3 
> is maintained?
> 
> 
>> On Sat, Oct 31, 2015, 11:14 Surbhi Gupta <surbhi.gupt...@gmail.com> wrote:
>> You have to do few things before unsafe as sanitation . First run the 
>> nodetool decommission if the node is up and wait till streaming happens . 
>> You can check is the streaming is completed by nodetool netstats . If 
>> streaming is completed you can do unsafe assanitation .
>> 
>> To answer your question unsafe assanitation will not take care of 
>> replication factor .
>> It is like forcing a node out from the cluster .
>> 
>> Hope this helps.
>> 
>> Sent from my iPhone
>> 
>> > On Oct 31, 2015, at 5:12 AM, sai krishnam raju potturi 
>> > <pskraj...@gmail.com> wrote:
>> >
>> > hi;
>> >would unsafeassasinating a dead node maintain the replication factor 
>> > like decommission process or removenode process?
>> >
>> > thanks
>> >
>> >


Re: Re : will Unsafeassaniate a dead node maintain the replication factor

2015-10-31 Thread sai krishnam raju potturi
it's dead; and we had to do unsafeassassinate as other 2 methods did not
work

On Sat, Oct 31, 2015 at 11:30 AM, Surbhi Gupta <surbhi.gupt...@gmail.com>
wrote:

> Whether the node is down or up which you want to decommission?
>
> Sent from my iPhone
>
> On Oct 31, 2015, at 8:24 AM, sai krishnam raju potturi <
> pskraj...@gmail.com> wrote:
>
> Thanks Surabhi. Decommission nor removenode did not work. We did not
> capture the tokens of the dead node. Any way we could make sure the
> replication of 3 is maintained?
>
> On Sat, Oct 31, 2015, 11:14 Surbhi Gupta <surbhi.gupt...@gmail.com> wrote:
>
>> You have to do few things before unsafe as sanitation . First run the
>> nodetool decommission if the node is up and wait till streaming happens .
>> You can check is the streaming is completed by nodetool netstats . If
>> streaming is completed you can do unsafe assanitation .
>>
>> To answer your question unsafe assanitation will not take care of
>> replication factor .
>> It is like forcing a node out from the cluster .
>>
>> Hope this helps.
>>
>> Sent from my iPhone
>>
>> > On Oct 31, 2015, at 5:12 AM, sai krishnam raju potturi <
>> pskraj...@gmail.com> wrote:
>> >
>> > hi;
>> >would unsafeassasinating a dead node maintain the replication factor
>> like decommission process or removenode process?
>> >
>> > thanks
>> >
>> >
>>
>


Re: Re : will Unsafeassaniate a dead node maintain the replication factor

2015-10-31 Thread sai krishnam raju potturi
Thanks Surabhi. Decommission nor removenode did not work. We did not
capture the tokens of the dead node. Any way we could make sure the
replication of 3 is maintained?

On Sat, Oct 31, 2015, 11:14 Surbhi Gupta <surbhi.gupt...@gmail.com> wrote:

> You have to do few things before unsafe as sanitation . First run the
> nodetool decommission if the node is up and wait till streaming happens .
> You can check is the streaming is completed by nodetool netstats . If
> streaming is completed you can do unsafe assanitation .
>
> To answer your question unsafe assanitation will not take care of
> replication factor .
> It is like forcing a node out from the cluster .
>
> Hope this helps.
>
> Sent from my iPhone
>
> > On Oct 31, 2015, at 5:12 AM, sai krishnam raju potturi <
> pskraj...@gmail.com> wrote:
> >
> > hi;
> >would unsafeassasinating a dead node maintain the replication factor
> like decommission process or removenode process?
> >
> > thanks
> >
> >
>


Re: Re : will Unsafeassaniate a dead node maintain the replication factor

2015-10-31 Thread Surbhi Gupta
So have you already done unsafe assassination ?

On 31 October 2015 at 08:37, sai krishnam raju potturi <pskraj...@gmail.com>
wrote:

> it's dead; and we had to do unsafeassassinate as other 2 methods did not
> work
>
> On Sat, Oct 31, 2015 at 11:30 AM, Surbhi Gupta <surbhi.gupt...@gmail.com>
> wrote:
>
>> Whether the node is down or up which you want to decommission?
>>
>> Sent from my iPhone
>>
>> On Oct 31, 2015, at 8:24 AM, sai krishnam raju potturi <
>> pskraj...@gmail.com> wrote:
>>
>> Thanks Surabhi. Decommission nor removenode did not work. We did not
>> capture the tokens of the dead node. Any way we could make sure the
>> replication of 3 is maintained?
>>
>> On Sat, Oct 31, 2015, 11:14 Surbhi Gupta <surbhi.gupt...@gmail.com>
>> wrote:
>>
>>> You have to do few things before unsafe as sanitation . First run the
>>> nodetool decommission if the node is up and wait till streaming happens .
>>> You can check is the streaming is completed by nodetool netstats . If
>>> streaming is completed you can do unsafe assanitation .
>>>
>>> To answer your question unsafe assanitation will not take care of
>>> replication factor .
>>> It is like forcing a node out from the cluster .
>>>
>>> Hope this helps.
>>>
>>> Sent from my iPhone
>>>
>>> > On Oct 31, 2015, at 5:12 AM, sai krishnam raju potturi <
>>> pskraj...@gmail.com> wrote:
>>> >
>>> > hi;
>>> >would unsafeassasinating a dead node maintain the replication
>>> factor like decommission process or removenode process?
>>> >
>>> > thanks
>>> >
>>> >
>>>
>>
>


Re: Re : will Unsafeassaniate a dead node maintain the replication factor

2015-10-31 Thread sai krishnam raju potturi
yes Surbhi.

On Sat, Oct 31, 2015 at 12:10 PM, Surbhi Gupta <surbhi.gupt...@gmail.com>
wrote:

> So have you already done unsafe assassination ?
>
> On 31 October 2015 at 08:37, sai krishnam raju potturi <
> pskraj...@gmail.com> wrote:
>
>> it's dead; and we had to do unsafeassassinate as other 2 methods did not
>> work
>>
>> On Sat, Oct 31, 2015 at 11:30 AM, Surbhi Gupta <surbhi.gupt...@gmail.com>
>> wrote:
>>
>>> Whether the node is down or up which you want to decommission?
>>>
>>> Sent from my iPhone
>>>
>>> On Oct 31, 2015, at 8:24 AM, sai krishnam raju potturi <
>>> pskraj...@gmail.com> wrote:
>>>
>>> Thanks Surabhi. Decommission nor removenode did not work. We did not
>>> capture the tokens of the dead node. Any way we could make sure the
>>> replication of 3 is maintained?
>>>
>>> On Sat, Oct 31, 2015, 11:14 Surbhi Gupta <surbhi.gupt...@gmail.com>
>>> wrote:
>>>
>>>> You have to do few things before unsafe as sanitation . First run the
>>>> nodetool decommission if the node is up and wait till streaming happens .
>>>> You can check is the streaming is completed by nodetool netstats . If
>>>> streaming is completed you can do unsafe assanitation .
>>>>
>>>> To answer your question unsafe assanitation will not take care of
>>>> replication factor .
>>>> It is like forcing a node out from the cluster .
>>>>
>>>> Hope this helps.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> > On Oct 31, 2015, at 5:12 AM, sai krishnam raju potturi <
>>>> pskraj...@gmail.com> wrote:
>>>> >
>>>> > hi;
>>>> >would unsafeassasinating a dead node maintain the replication
>>>> factor like decommission process or removenode process?
>>>> >
>>>> > thanks
>>>> >
>>>> >
>>>>
>>>
>>
>


Re: Re : will Unsafeassaniate a dead node maintain the replication factor

2015-10-31 Thread Surbhi Gupta
Is the cluster using vnodes?

Sent from my iPhone

> On Oct 31, 2015, at 9:16 AM, sai krishnam raju potturi <pskraj...@gmail.com> 
> wrote:
> 
> yes Surbhi.
> 
>> On Sat, Oct 31, 2015 at 12:10 PM, Surbhi Gupta <surbhi.gupt...@gmail.com> 
>> wrote:
>> So have you already done unsafe assassination ?
>> 
>>> On 31 October 2015 at 08:37, sai krishnam raju potturi 
>>> <pskraj...@gmail.com> wrote:
>>> it's dead; and we had to do unsafeassassinate as other 2 methods did not 
>>> work
>>> 
>>>> On Sat, Oct 31, 2015 at 11:30 AM, Surbhi Gupta <surbhi.gupt...@gmail.com> 
>>>> wrote:
>>>> Whether the node is down or up which you want to decommission?
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On Oct 31, 2015, at 8:24 AM, sai krishnam raju potturi 
>>>>> <pskraj...@gmail.com> wrote:
>>>>> 
>>>>> Thanks Surabhi. Decommission nor removenode did not work. We did not 
>>>>> capture the tokens of the dead node. Any way we could make sure the 
>>>>> replication of 3 is maintained?
>>>>> 
>>>>> 
>>>>>> On Sat, Oct 31, 2015, 11:14 Surbhi Gupta <surbhi.gupt...@gmail.com> 
>>>>>> wrote:
>>>>>> You have to do few things before unsafe as sanitation . First run the 
>>>>>> nodetool decommission if the node is up and wait till streaming happens 
>>>>>> . You can check is the streaming is completed by nodetool netstats . If 
>>>>>> streaming is completed you can do unsafe assanitation .
>>>>>> 
>>>>>> To answer your question unsafe assanitation will not take care of 
>>>>>> replication factor .
>>>>>> It is like forcing a node out from the cluster .
>>>>>> 
>>>>>> Hope this helps.
>>>>>> 
>>>>>> Sent from my iPhone
>>>>>> 
>>>>>> > On Oct 31, 2015, at 5:12 AM, sai krishnam raju potturi 
>>>>>> > <pskraj...@gmail.com> wrote:
>>>>>> >
>>>>>> > hi;
>>>>>> >would unsafeassasinating a dead node maintain the replication 
>>>>>> > factor like decommission process or removenode process?
>>>>>> >
>>>>>> > thanks
>>>>>> >
>>>>>> >
> 


Re: Re : will Unsafeassaniate a dead node maintain the replication factor

2015-10-31 Thread sai krishnam raju potturi
yes Surbhi.

On Sat, Oct 31, 2015 at 1:13 PM, Surbhi Gupta <surbhi.gupt...@gmail.com>
wrote:

> Is the cluster using vnodes?
>
> Sent from my iPhone
>
> On Oct 31, 2015, at 9:16 AM, sai krishnam raju potturi <
> pskraj...@gmail.com> wrote:
>
> yes Surbhi.
>
> On Sat, Oct 31, 2015 at 12:10 PM, Surbhi Gupta <surbhi.gupt...@gmail.com>
> wrote:
>
>> So have you already done unsafe assassination ?
>>
>> On 31 October 2015 at 08:37, sai krishnam raju potturi <
>> pskraj...@gmail.com> wrote:
>>
>>> it's dead; and we had to do unsafeassassinate as other 2 methods did not
>>> work
>>>
>>> On Sat, Oct 31, 2015 at 11:30 AM, Surbhi Gupta <surbhi.gupt...@gmail.com
>>> > wrote:
>>>
>>>> Whether the node is down or up which you want to decommission?
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Oct 31, 2015, at 8:24 AM, sai krishnam raju potturi <
>>>> pskraj...@gmail.com> wrote:
>>>>
>>>> Thanks Surabhi. Decommission nor removenode did not work. We did not
>>>> capture the tokens of the dead node. Any way we could make sure the
>>>> replication of 3 is maintained?
>>>>
>>>> On Sat, Oct 31, 2015, 11:14 Surbhi Gupta <surbhi.gupt...@gmail.com>
>>>> wrote:
>>>>
>>>>> You have to do few things before unsafe as sanitation . First run the
>>>>> nodetool decommission if the node is up and wait till streaming happens .
>>>>> You can check is the streaming is completed by nodetool netstats . If
>>>>> streaming is completed you can do unsafe assanitation .
>>>>>
>>>>> To answer your question unsafe assanitation will not take care of
>>>>> replication factor .
>>>>> It is like forcing a node out from the cluster .
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> > On Oct 31, 2015, at 5:12 AM, sai krishnam raju potturi <
>>>>> pskraj...@gmail.com> wrote:
>>>>> >
>>>>> > hi;
>>>>> >would unsafeassasinating a dead node maintain the replication
>>>>> factor like decommission process or removenode process?
>>>>> >
>>>>> > thanks
>>>>> >
>>>>> >
>>>>>
>>>>
>>>
>>
>


Re: Re : will Unsafeassaniate a dead node maintain the replication factor

2015-10-31 Thread Surbhi Gupta
If it is using vnodes then just run nodetool repair . It should fix the issue 
related to data if any.

And then run nodetool cleanup 

Sent from my iPhone

> On Oct 31, 2015, at 3:12 PM, sai krishnam raju potturi <pskraj...@gmail.com> 
> wrote:
> 
> yes Surbhi.
> 
>> On Sat, Oct 31, 2015 at 1:13 PM, Surbhi Gupta <surbhi.gupt...@gmail.com> 
>> wrote:
>> Is the cluster using vnodes?
>> 
>> Sent from my iPhone
>> 
>>> On Oct 31, 2015, at 9:16 AM, sai krishnam raju potturi 
>>> <pskraj...@gmail.com> wrote:
>>> 
>>> yes Surbhi.
>>> 
>>>> On Sat, Oct 31, 2015 at 12:10 PM, Surbhi Gupta <surbhi.gupt...@gmail.com> 
>>>> wrote:
>>>> So have you already done unsafe assassination ?
>>>> 
>>>>> On 31 October 2015 at 08:37, sai krishnam raju potturi 
>>>>> <pskraj...@gmail.com> wrote:
>>>>> it's dead; and we had to do unsafeassassinate as other 2 methods did not 
>>>>> work
>>>>> 
>>>>>> On Sat, Oct 31, 2015 at 11:30 AM, Surbhi Gupta 
>>>>>> <surbhi.gupt...@gmail.com> wrote:
>>>>>> Whether the node is down or up which you want to decommission?
>>>>>> 
>>>>>> Sent from my iPhone
>>>>>> 
>>>>>>> On Oct 31, 2015, at 8:24 AM, sai krishnam raju potturi 
>>>>>>> <pskraj...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Thanks Surabhi. Decommission nor removenode did not work. We did not 
>>>>>>> capture the tokens of the dead node. Any way we could make sure the 
>>>>>>> replication of 3 is maintained?
>>>>>>> 
>>>>>>> 
>>>>>>>> On Sat, Oct 31, 2015, 11:14 Surbhi Gupta <surbhi.gupt...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>> You have to do few things before unsafe as sanitation . First run the 
>>>>>>>> nodetool decommission if the node is up and wait till streaming 
>>>>>>>> happens . You can check is the streaming is completed by nodetool 
>>>>>>>> netstats . If streaming is completed you can do unsafe assanitation .
>>>>>>>> 
>>>>>>>> To answer your question unsafe assanitation will not take care of 
>>>>>>>> replication factor .
>>>>>>>> It is like forcing a node out from the cluster .
>>>>>>>> 
>>>>>>>> Hope this helps.
>>>>>>>> 
>>>>>>>> Sent from my iPhone
>>>>>>>> 
>>>>>>>> > On Oct 31, 2015, at 5:12 AM, sai krishnam raju potturi 
>>>>>>>> > <pskraj...@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > hi;
>>>>>>>> >would unsafeassasinating a dead node maintain the replication 
>>>>>>>> > factor like decommission process or removenode process?
>>>>>>>> >
>>>>>>>> > thanks
>>>>>>>> >
>>>>>>>> >
> 


Re: Re : Replication factor for system_auth keyspace

2015-10-16 Thread Victor Chen
To elaborate on what Robert said, I think with most things technology
related, the answer with these sorts of questions (i.e. "ideal settings")
is usually "it depends." Remember that technology is a tool that we use to
accomplish something we want. It's just a mechanism that we as humans use
to exert our wishes on other things. In this case, cassandra allows us to
exert our wishes on the data we need to have available. So think for a
second about what you want? To be less philosophical and more practical,
how many nodes you are comfortable losing or likely to lose? How many
copies of your system_auth keyspace do you want to have always available?

Also, what do you mean by "really long?" What version of cassandra are you
using? If you are on 2.1, look at migrating to incremental repair. That it
takes so long for such a small keyspace leads me to believe you're using
sequential repair ...

-V

On Thu, Oct 15, 2015 at 7:46 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Thu, Oct 15, 2015 at 10:24 AM, sai krishnam raju potturi <
> pskraj...@gmail.com> wrote:
>
>>   we are deploying a new cluster with 2 datacenters, 48 nodes in each DC.
>> For the system_auth keyspace, what should be the ideal replication_factor
>> set?
>>
>> We tried setting the replication factor equal to the number of nodes in a
>> datacenter, and the repair for the system_auth keyspace took really long.
>> Your suggestions would be of great help.
>>
>
> More than 1 and a lot less than 48.
>
> =Rob
>
>


Re: Re : Replication factor for system_auth keyspace

2015-10-16 Thread sai krishnam raju potturi
thanks guys for the advice. We were running parallel repairs earlier, with
cassandra version 2.0.14. As pointed out having set the replication factor
really huge for system_auth was causing the repair to take really long.

thanks
Sai

On Fri, Oct 16, 2015 at 9:56 AM, Victor Chen <victor.h.c...@gmail.com>
wrote:

> To elaborate on what Robert said, I think with most things technology
> related, the answer with these sorts of questions (i.e. "ideal settings")
> is usually "it depends." Remember that technology is a tool that we use to
> accomplish something we want. It's just a mechanism that we as humans use
> to exert our wishes on other things. In this case, cassandra allows us to
> exert our wishes on the data we need to have available. So think for a
> second about what you want? To be less philosophical and more practical,
> how many nodes you are comfortable losing or likely to lose? How many
> copies of your system_auth keyspace do you want to have always available?
>
> Also, what do you mean by "really long?" What version of cassandra are you
> using? If you are on 2.1, look at migrating to incremental repair. That it
> takes so long for such a small keyspace leads me to believe you're using
> sequential repair ...
>
> -V
>
> On Thu, Oct 15, 2015 at 7:46 PM, Robert Coli <rc...@eventbrite.com> wrote:
>
>> On Thu, Oct 15, 2015 at 10:24 AM, sai krishnam raju potturi <
>> pskraj...@gmail.com> wrote:
>>
>>>   we are deploying a new cluster with 2 datacenters, 48 nodes in each
>>> DC. For the system_auth keyspace, what should be the ideal
>>> replication_factor set?
>>>
>>> We tried setting the replication factor equal to the number of nodes in
>>> a datacenter, and the repair for the system_auth keyspace took really long.
>>> Your suggestions would be of great help.
>>>
>>
>> More than 1 and a lot less than 48.
>>
>> =Rob
>>
>>
>


Re: Re : Replication factor for system_auth keyspace

2015-10-15 Thread Robert Coli
On Thu, Oct 15, 2015 at 10:24 AM, sai krishnam raju potturi <
pskraj...@gmail.com> wrote:

>   we are deploying a new cluster with 2 datacenters, 48 nodes in each DC.
> For the system_auth keyspace, what should be the ideal replication_factor
> set?
>
> We tried setting the replication factor equal to the number of nodes in a
> datacenter, and the repair for the system_auth keyspace took really long.
> Your suggestions would be of great help.
>

More than 1 and a lot less than 48.

=Rob


Re : Replication factor for system_auth keyspace

2015-10-15 Thread sai krishnam raju potturi
hi;
  we are deploying a new cluster with 2 datacenters, 48 nodes in each DC.
For the system_auth keyspace, what should be the ideal replication_factor
set?

We tried setting the replication factor equal to the number of nodes in a
datacenter, and the repair for the system_auth keyspace took really long.
Your suggestions would be of great help.

thanks
Sai


Run witch repair cmd when increase replication factor

2015-03-06 Thread 曹志富
I want fo increase replication factor in my C* 2.1.3 cluster(rf chang from
2 to 3 for some keyspaces).

I read the doc of Updating the replication factor
http://www.datastax.com/documentation/cql/3.1/cql/cql_using/update_ks_rf_t.html
.
The step two is run the nodetool repair.But as I know nodetool repair
default is full repair,seems againt to the step three.So run witch repair
cmd when increase replication factor?

Thanks all.

--
Ranger Tsao


Re: Changing replication factor of Cassandra cluster

2015-01-06 Thread Pranay Agarwal
Thanks Robert. Also, I have seen the node-repair operation to fail for some
nodes. What are the chances of the data getting corrupt if node-repair
fails? I am okay with data availability issues for some time as long as I
don't loose or corrupt data. Also, is there way to restore the graph
without having to backup the token ring range but just the data backup?

-Pranay

On Mon, Dec 29, 2014 at 1:58 PM, Robert Coli rc...@eventbrite.com wrote:

 On Mon, Dec 29, 2014 at 1:40 PM, Pranay Agarwal agarwalpran...@gmail.com
 wrote:

 I want to understand what is the best way to increase/change the replica
 factor of the cassandra cluster? My priority is consistency and probably I
 am tolerant about some down time of the cluster. Is it totally weird to try
 changing replica later or are there people doing it for production env in
 past?


 The way you are doing it is fine, but risks false-negative reads.

 Basically, if you ask the wrong node does this key exist before it is
 repaired, you will get the answer no when in fact it does exist under the
 RF=1 paradigm. Unfortunately the only way to avoid this case is to do all
 reads with ConsistencyLevel.ALL until the whole cluster is repaired.

 =Rob



Re: Changing replication factor of Cassandra cluster

2015-01-06 Thread Robert Coli
On Tue, Jan 6, 2015 at 4:40 PM, Pranay Agarwal agarwalpran...@gmail.com
wrote:

 Thanks Robert. Also, I have seen the node-repair operation to fail for
 some nodes. What are the chances of the data getting corrupt if node-repair
 fails?


If repair does not complete before gc_grace_seconds, chance of data getting
corrupt is non-zero and relates to how under-replicated tombstones might be.

You probably want to increase gc_grace_seconds if you're experiencing
failing repairs, as the default value is ambitious in non-toy clusters. I
personally recommend 34 days so that you can start repair on the first of
every month and have up to 7 days to complete the repair.

=Rob


Re: Changing replication factor of Cassandra cluster

2014-12-29 Thread Pranay Agarwal
Thanks Ryan.

I want to understand what is the best way to increase/change the replica
factor of the cassandra cluster? My priority is consistency and probably I
am tolerant about some down time of the cluster. Is it totally weird to try
changing replica later or are there people doing it for production env in
past?

On Tue, Dec 16, 2014 at 9:47 AM, Ryan Svihla rsvi...@datastax.com wrote:

 Repair's performance is going to vary heavily by a large number of
 factors, hours for 1 node to finish is within range of what I see in the
 wild, again there are so many factors it's impossible to speculate on if
 that is good or bad for your cluster. Factors that matter include:

1. speed of disk io
2. amount of ram and cpu on each node
3. network interface speed
4. is this multidc or not
5. are vnodes enabled or not
6. what are the jvm tunings
7. compaction settings
8. current load on the cluster
9. streaming settings

 Suffice it to say to improve repair performance is a full on tuning
 exercise, note you're current operation is going to be worse than
 tradtional repair, as your streaming copies of data around and not just
 doing normal merkel tree work.

 Restoring from backup to a new cluster (including how to handle token
 ranges) is discussed in detail here
 http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_snapshot_restore_new_cluster.html


 On Mon, Dec 15, 2014 at 4:14 PM, Pranay Agarwal agarwalpran...@gmail.com
 wrote:

 Hi All,


 I have 20 nodes cassandra cluster with 500gb of data and replication
 factor of 1. I increased the replication factor to 3 and ran nodetool
 repair on each node one by one as the docs says. But it takes hours for 1
 node to finish repair. Is that normal or am I doing something wrong?

 Also, I took backup of cassandra data on each node. How do I restore the
 graph in a new cluster of nodes using the backup? Do I have to have the
 tokens range backed up as well?

 -Pranay



 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.




Re: Changing replication factor of Cassandra cluster

2014-12-29 Thread Robert Coli
On Mon, Dec 29, 2014 at 1:40 PM, Pranay Agarwal agarwalpran...@gmail.com
wrote:

 I want to understand what is the best way to increase/change the replica
 factor of the cassandra cluster? My priority is consistency and probably I
 am tolerant about some down time of the cluster. Is it totally weird to try
 changing replica later or are there people doing it for production env in
 past?


The way you are doing it is fine, but risks false-negative reads.

Basically, if you ask the wrong node does this key exist before it is
repaired, you will get the answer no when in fact it does exist under the
RF=1 paradigm. Unfortunately the only way to avoid this case is to do all
reads with ConsistencyLevel.ALL until the whole cluster is repaired.

=Rob


  1   2   3   >