Re: Stop long running queries in Cassandra 3.11.x or Cassandra 4.x

2021-10-12 Thread S G
Thanks Bowen.
Any idea why is cross_node_timeout commented out by default? That seems
like a good option to enable even as per the documentation:
# If disabled, replicas will assume that requests
# were forwarded to them instantly by the coordinator, which means that
# under overload conditions we will waste that much extra time processing
# already-timed-out requests.

Also, taking an example from Oracle kind of RDBMS systems, is there a
command like the following that can be fired from an external script to
kill a long running query on each node:

alter system kill session




On Tue, Oct 12, 2021 at 10:49 AM Bowen Song  wrote:

> That will depend on whether you have cross_node_timeout enabled. However,
> I have to point out that set timeout to 15ms is perhaps not a good idea,
> the JVM GC can easily cause a lots of timeouts.
> On 12/10/2021 18:20, S G wrote:
>
> ok, when a coordinator node sends timeout to the client, does it mean all
> the replica nodes have stopped processing that specific query too?
> Or is it just the coordinator node that has stopped waiting for the
> replicas to return response?
>
> On Tue, Oct 12, 2021 at 10:12 AM Jeff Jirsa  wrote:
>
>> It sends an exception to the client, it doesnt sever the connection.
>>
>>
>> On Tue, Oct 12, 2021 at 10:06 AM S G  wrote:
>>
>>> Do the timeout values only kill the connection with the client or send
>>> error to the client?
>>> Or do they also kill the corresponding query execution happening on the
>>> Cassandra servers (co-ordinator, replicas etc) ?
>>>
>>> On Tue, Oct 12, 2021 at 10:00 AM Jeff Jirsa  wrote:
>>>
 The read and write timeout values do this today.


 https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L920-L943


 On Tue, Oct 12, 2021 at 9:53 AM S G  wrote:

> Hello,
>
> Is there a way to stop long running queries in Cassandra (versions
> 3.11.x or 4.x) ?
> The use-case is to have some kind of a circuit breaker based on
> query-time that has exceeded the client's SLAs.
> Example: If server response is useless to the client after 10 ms, then
> we could
> have a *query_killing_timeout* set to 15 ms (where additional 5ms
> allows for some buffer).
> And when that much time has elapsed, Cassandra will kill the query
> execution automatically.
>
> If this is not possible in Cassandra currently, any chance we can do
> it outside of Cassandra, like
> a shell script that monitors such long running queries (through users
> table etc) and kills the
> OS-thread responsible for that query (Looks unsafe though as that
> might leave the DB in an inconsistent state) ?
>
> We are trying this as a proactive measure to safeguard our clusters
> from any rogue queries fired accidentally or maliciously.
>
> Thanks !
>
>


Re: Stop long running queries in Cassandra 3.11.x or Cassandra 4.x

2021-10-12 Thread Bowen Song
That will depend on whether you have cross_node_timeout enabled. 
However, I have to point out that set timeout to 15ms is perhaps not a 
good idea, the JVM GC can easily cause a lots of timeouts.


On 12/10/2021 18:20, S G wrote:
ok, when a coordinator node sends timeout to the client, does it mean 
all the replica nodes have stopped processing that specific query too?
Or is it just the coordinator node that has stopped waiting for the 
replicas to return response?


On Tue, Oct 12, 2021 at 10:12 AM Jeff Jirsa  wrote:

It sends an exception to the client, it doesnt sever the connection.


On Tue, Oct 12, 2021 at 10:06 AM S G 
wrote:

Do the timeout values only kill the connection with the client
or send error to the client?
Or do they also kill the corresponding query execution
happening on the Cassandra servers (co-ordinator, replicas etc) ?

On Tue, Oct 12, 2021 at 10:00 AM Jeff Jirsa 
wrote:

The read and write timeout values do this today.


https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L920-L943


On Tue, Oct 12, 2021 at 9:53 AM S G
 wrote:

Hello,

Is there a way to stop long running queries in
Cassandra (versions 3.11.x or 4.x) ?
The use-case is to have some kind of a circuit breaker
based on query-time that has exceeded the client's SLAs.
Example: If server response is useless to the client
after 10 ms, then we could
have a *query_killing_timeout* set to 15 ms (where
additional 5ms allows for some buffer).
And when that much time has elapsed, Cassandra will
kill the query execution automatically.

If this is not possible in Cassandra currently, any
chance we can do it outside of Cassandra, like
a shell script that monitors such long running queries
(through users table etc) and kills the
OS-thread responsible for that query (Looks unsafe
though as that might leave the DB in an inconsistent
state) ?

We are trying this as a proactive measure to safeguard
our clusters from any rogue queries fired accidentally
or maliciously.

Thanks !


Re: Stop long running queries in Cassandra 3.11.x or Cassandra 4.x

2021-10-12 Thread S G
ok, when a coordinator node sends timeout to the client, does it mean all
the replica nodes have stopped processing that specific query too?
Or is it just the coordinator node that has stopped waiting for the
replicas to return response?

On Tue, Oct 12, 2021 at 10:12 AM Jeff Jirsa  wrote:

> It sends an exception to the client, it doesnt sever the connection.
>
>
> On Tue, Oct 12, 2021 at 10:06 AM S G  wrote:
>
>> Do the timeout values only kill the connection with the client or send
>> error to the client?
>> Or do they also kill the corresponding query execution happening on the
>> Cassandra servers (co-ordinator, replicas etc) ?
>>
>> On Tue, Oct 12, 2021 at 10:00 AM Jeff Jirsa  wrote:
>>
>>> The read and write timeout values do this today.
>>>
>>>
>>> https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L920-L943
>>>
>>>
>>> On Tue, Oct 12, 2021 at 9:53 AM S G  wrote:
>>>
 Hello,

 Is there a way to stop long running queries in Cassandra (versions
 3.11.x or 4.x) ?
 The use-case is to have some kind of a circuit breaker based on
 query-time that has exceeded the client's SLAs.
 Example: If server response is useless to the client after 10 ms, then
 we could
 have a *query_killing_timeout* set to 15 ms (where additional 5ms
 allows for some buffer).
 And when that much time has elapsed, Cassandra will kill the query
 execution automatically.

 If this is not possible in Cassandra currently, any chance we can do it
 outside of Cassandra, like
 a shell script that monitors such long running queries (through users
 table etc) and kills the
 OS-thread responsible for that query (Looks unsafe though as that might
 leave the DB in an inconsistent state) ?

 We are trying this as a proactive measure to safeguard our clusters
 from any rogue queries fired accidentally or maliciously.

 Thanks !




Re: Stop long running queries in Cassandra 3.11.x or Cassandra 4.x

2021-10-12 Thread Jeff Jirsa
It sends an exception to the client, it doesnt sever the connection.


On Tue, Oct 12, 2021 at 10:06 AM S G  wrote:

> Do the timeout values only kill the connection with the client or send
> error to the client?
> Or do they also kill the corresponding query execution happening on the
> Cassandra servers (co-ordinator, replicas etc) ?
>
> On Tue, Oct 12, 2021 at 10:00 AM Jeff Jirsa  wrote:
>
>> The read and write timeout values do this today.
>>
>>
>> https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L920-L943
>>
>>
>> On Tue, Oct 12, 2021 at 9:53 AM S G  wrote:
>>
>>> Hello,
>>>
>>> Is there a way to stop long running queries in Cassandra (versions
>>> 3.11.x or 4.x) ?
>>> The use-case is to have some kind of a circuit breaker based on
>>> query-time that has exceeded the client's SLAs.
>>> Example: If server response is useless to the client after 10 ms, then
>>> we could
>>> have a *query_killing_timeout* set to 15 ms (where additional 5ms allows
>>> for some buffer).
>>> And when that much time has elapsed, Cassandra will kill the query
>>> execution automatically.
>>>
>>> If this is not possible in Cassandra currently, any chance we can do it
>>> outside of Cassandra, like
>>> a shell script that monitors such long running queries (through users
>>> table etc) and kills the
>>> OS-thread responsible for that query (Looks unsafe though as that might
>>> leave the DB in an inconsistent state) ?
>>>
>>> We are trying this as a proactive measure to safeguard our clusters from
>>> any rogue queries fired accidentally or maliciously.
>>>
>>> Thanks !
>>>
>>>


Re: Stop long running queries in Cassandra 3.11.x or Cassandra 4.x

2021-10-12 Thread S G
Do the timeout values only kill the connection with the client or send
error to the client?
Or do they also kill the corresponding query execution happening on the
Cassandra servers (co-ordinator, replicas etc) ?

On Tue, Oct 12, 2021 at 10:00 AM Jeff Jirsa  wrote:

> The read and write timeout values do this today.
>
>
> https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L920-L943
>
>
> On Tue, Oct 12, 2021 at 9:53 AM S G  wrote:
>
>> Hello,
>>
>> Is there a way to stop long running queries in Cassandra (versions 3.11.x
>> or 4.x) ?
>> The use-case is to have some kind of a circuit breaker based on
>> query-time that has exceeded the client's SLAs.
>> Example: If server response is useless to the client after 10 ms, then we
>> could
>> have a *query_killing_timeout* set to 15 ms (where additional 5ms allows
>> for some buffer).
>> And when that much time has elapsed, Cassandra will kill the query
>> execution automatically.
>>
>> If this is not possible in Cassandra currently, any chance we can do it
>> outside of Cassandra, like
>> a shell script that monitors such long running queries (through users
>> table etc) and kills the
>> OS-thread responsible for that query (Looks unsafe though as that might
>> leave the DB in an inconsistent state) ?
>>
>> We are trying this as a proactive measure to safeguard our clusters from
>> any rogue queries fired accidentally or maliciously.
>>
>> Thanks !
>>
>>


Re: Stop long running queries in Cassandra 3.11.x or Cassandra 4.x

2021-10-12 Thread Jeff Jirsa
The read and write timeout values do this today.

https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L920-L943


On Tue, Oct 12, 2021 at 9:53 AM S G  wrote:

> Hello,
>
> Is there a way to stop long running queries in Cassandra (versions 3.11.x
> or 4.x) ?
> The use-case is to have some kind of a circuit breaker based on query-time
> that has exceeded the client's SLAs.
> Example: If server response is useless to the client after 10 ms, then we
> could
> have a *query_killing_timeout* set to 15 ms (where additional 5ms allows
> for some buffer).
> And when that much time has elapsed, Cassandra will kill the query
> execution automatically.
>
> If this is not possible in Cassandra currently, any chance we can do it
> outside of Cassandra, like
> a shell script that monitors such long running queries (through users
> table etc) and kills the
> OS-thread responsible for that query (Looks unsafe though as that might
> leave the DB in an inconsistent state) ?
>
> We are trying this as a proactive measure to safeguard our clusters from
> any rogue queries fired accidentally or maliciously.
>
> Thanks !
>
>


Stop long running queries in Cassandra 3.11.x or Cassandra 4.x

2021-10-12 Thread S G
Hello,

Is there a way to stop long running queries in Cassandra (versions 3.11.x
or 4.x) ?
The use-case is to have some kind of a circuit breaker based on query-time
that has exceeded the client's SLAs.
Example: If server response is useless to the client after 10 ms, then we
could
have a *query_killing_timeout* set to 15 ms (where additional 5ms allows
for some buffer).
And when that much time has elapsed, Cassandra will kill the query
execution automatically.

If this is not possible in Cassandra currently, any chance we can do it
outside of Cassandra, like
a shell script that monitors such long running queries (through users table
etc) and kills the
OS-thread responsible for that query (Looks unsafe though as that might
leave the DB in an inconsistent state) ?

We are trying this as a proactive measure to safeguard our clusters from
any rogue queries fired accidentally or maliciously.

Thanks !


Re: Trouble After Changing Replication Factor

2021-10-12 Thread Jeff Jirsa
The most likely explanation is that repair failed and you didnt notice.
Or that you didnt actually repair every host / every range.

Which version are you using?
How did you run repair?


On Tue, Oct 12, 2021 at 4:33 AM Isaeed Mohanna  wrote:

> Hi
>
> Yes I am sacrificing consistency to gain higher availability and faster
> speed, but my problem is not with newly inserted data that is not there for
> a very short period of time, my problem is the data that was there before
> the RF change, still do not exist in all replicas even after repair.
>
> It looks like my cluster configuration is RF3 but the data itself is still
> using RF2 and when the data is requested from the 3rd (new) replica, it
> is not there and an empty record is returned with read CL1.
>
> What can I do to force this data to be synced to all replicas as it
> should? So read CL1 request will actually return a correct result?
>
>
>
> Thanks
>
>
>
> *From:* Bowen Song 
> *Sent:* Monday, October 11, 2021 5:13 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Trouble After Changing Replication Factor
>
>
>
> You have RF=3 and both read & write CL=1, which means you are asking
> Cassandra to give up strong consistency in order to gain higher
> availability and perhaps slight faster speed, and that's what you get. If
> you want to have strong consistency, you will need to make sure (read CL +
> write CL) > RF.
>
> On 10/10/2021 11:55, Isaeed Mohanna wrote:
>
> Hi
>
> We had a cluster with 3 Nodes with Replication Factor 2 and we were using
> read with consistency Level One.
>
> We recently added a 4th node and changed the replication factor to 3,
> once this was done apps reading from DB with CL1 would receive an empty
> record, Looking around I was surprised to learn that upon changing the
> replication factor if the read request is sent to a node the should own the
> record according to the new replication factor while it still doesn’t have
> it yet then an empty record will be returned because of CL1, the record
> will be written to that node after the repair operation is over.
>
> We ran the repair operation which took days in our case (we had to change
> apps to CL2 to avoid serious data inconsistencies).
>
> Now the repair operations are over and if I revert to CL1 we are still
> getting errors that records do not exist in DB while they do, using CL2
> again it works fine.
>
> Any ideas what I am missing?
>
> Is there a way to validate that the repairs task has actually done what is
> needed and that the data is actually now replicated RF3 ?
>
> Could it it be a Cassandra Driver issue? Since if I issue the request in
> cqlsh I do get the record but I cannot know if I am hitting the replica
> that doesn’t hold the record
>
> Thanks for your help
>
>


Re: Trouble After Changing Replication Factor

2021-10-12 Thread Dmitry Saprykin
Hi,

You could try to run full repair over short subrange containing data
missing from one replica. It should take just couple of minutes and will
prove if your repair failed to finish

Dmitrii Saprykin

On Tue, Oct 12, 2021 at 7:54 AM Bowen Song  wrote:

> I see. In that case, I suspect the repair wasn't fully successful. Try
> repair the new joined node again, and make sure it actually finishes
> successfully.
> On 12/10/2021 12:23, Isaeed Mohanna wrote:
>
> Hi
>
> Yes I am sacrificing consistency to gain higher availability and faster
> speed, but my problem is not with newly inserted data that is not there for
> a very short period of time, my problem is the data that was there before
> the RF change, still do not exist in all replicas even after repair.
>
> It looks like my cluster configuration is RF3 but the data itself is still
> using RF2 and when the data is requested from the 3rd (new) replica, it
> is not there and an empty record is returned with read CL1.
>
> What can I do to force this data to be synced to all replicas as it
> should? So read CL1 request will actually return a correct result?
>
>
>
> Thanks
>
>
>
> *From:* Bowen Song  
> *Sent:* Monday, October 11, 2021 5:13 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Trouble After Changing Replication Factor
>
>
>
> You have RF=3 and both read & write CL=1, which means you are asking
> Cassandra to give up strong consistency in order to gain higher
> availability and perhaps slight faster speed, and that's what you get. If
> you want to have strong consistency, you will need to make sure (read CL +
> write CL) > RF.
>
> On 10/10/2021 11:55, Isaeed Mohanna wrote:
>
> Hi
>
> We had a cluster with 3 Nodes with Replication Factor 2 and we were using
> read with consistency Level One.
>
> We recently added a 4th node and changed the replication factor to 3,
> once this was done apps reading from DB with CL1 would receive an empty
> record, Looking around I was surprised to learn that upon changing the
> replication factor if the read request is sent to a node the should own the
> record according to the new replication factor while it still doesn’t have
> it yet then an empty record will be returned because of CL1, the record
> will be written to that node after the repair operation is over.
>
> We ran the repair operation which took days in our case (we had to change
> apps to CL2 to avoid serious data inconsistencies).
>
> Now the repair operations are over and if I revert to CL1 we are still
> getting errors that records do not exist in DB while they do, using CL2
> again it works fine.
>
> Any ideas what I am missing?
>
> Is there a way to validate that the repairs task has actually done what is
> needed and that the data is actually now replicated RF3 ?
>
> Could it it be a Cassandra Driver issue? Since if I issue the request in
> cqlsh I do get the record but I cannot know if I am hitting the replica
> that doesn’t hold the record
>
> Thanks for your help
>
>


Re: Trouble After Changing Replication Factor

2021-10-12 Thread Bowen Song
I see. In that case, I suspect the repair wasn't fully successful. Try 
repair the new joined node again, and make sure it actually finishes 
successfully.


On 12/10/2021 12:23, Isaeed Mohanna wrote:


Hi

Yes I am sacrificing consistency to gain higher availability and 
faster speed, but my problem is not with newly inserted data that is 
not there for a very short period of time, my problem is the data that 
was there before the RF change, still do not exist in all replicas 
even after repair.


It looks like my cluster configuration is RF3 but the data itself is 
still using RF2 and when the data is requested from the 3^rd (new) 
replica, it is not there and an empty record is returned with read CL1.


What can I do to force this data to be synced to all replicas as it 
should? So read CL1 request will actually return a correct result?


Thanks

*From:* Bowen Song 
*Sent:* Monday, October 11, 2021 5:13 PM
*To:* user@cassandra.apache.org
*Subject:* Re: Trouble After Changing Replication Factor

You have RF=3 and both read & write CL=1, which means you are asking 
Cassandra to give up strong consistency in order to gain higher 
availability and perhaps slight faster speed, and that's what you get. 
If you want to have strong consistency, you will need to make sure 
(read CL + write CL) > RF.


On 10/10/2021 11:55, Isaeed Mohanna wrote:

Hi

We had a cluster with 3 Nodes with Replication Factor 2 and we
were using read with consistency Level One.

We recently added a 4^th node and changed the replication factor
to 3, once this was done apps reading from DB with CL1 would
receive an empty record, Looking around I was surprised to learn
that upon changing the replication factor if the read request is
sent to a node the should own the record according to the new
replication factor while it still doesn’t have it yet then an
empty record will be returned because of CL1, the record will be
written to that node after the repair operation is over.

We ran the repair operation which took days in our case (we had to
change apps to CL2 to avoid serious data inconsistencies).

Now the repair operations are over and if I revert to CL1 we are
still getting errors that records do not exist in DB while they
do, using CL2 again it works fine.

Any ideas what I am missing?

Is there a way to validate that the repairs task has actually done
what is needed and that the data is actually now replicated RF3 ?

Could it it be a Cassandra Driver issue? Since if I issue the
request in cqlsh I do get the record but I cannot know if I am
hitting the replica that doesn’t hold the record

Thanks for your help


RE: Trouble After Changing Replication Factor

2021-10-12 Thread Isaeed Mohanna
Hi
Yes I am sacrificing consistency to gain higher availability and faster speed, 
but my problem is not with newly inserted data that is not there for a very 
short period of time, my problem is the data that was there before the RF 
change, still do not exist in all replicas even after repair.
It looks like my cluster configuration is RF3 but the data itself is still 
using RF2 and when the data is requested from the 3rd (new) replica, it is not 
there and an empty record is returned with read CL1.
What can I do to force this data to be synced to all replicas as it should? So 
read CL1 request will actually return a correct result?

Thanks

From: Bowen Song 
Sent: Monday, October 11, 2021 5:13 PM
To: user@cassandra.apache.org
Subject: Re: Trouble After Changing Replication Factor


You have RF=3 and both read & write CL=1, which means you are asking Cassandra 
to give up strong consistency in order to gain higher availability and perhaps 
slight faster speed, and that's what you get. If you want to have strong 
consistency, you will need to make sure (read CL + write CL) > RF.
On 10/10/2021 11:55, Isaeed Mohanna wrote:
Hi
We had a cluster with 3 Nodes with Replication Factor 2 and we were using read 
with consistency Level One.
We recently added a 4th node and changed the replication factor to 3, once this 
was done apps reading from DB with CL1 would receive an empty record, Looking 
around I was surprised to learn that upon changing the replication factor if 
the read request is sent to a node the should own the record according to the 
new replication factor while it still doesn’t have it yet then an empty record 
will be returned because of CL1, the record will be written to that node after 
the repair operation is over.
We ran the repair operation which took days in our case (we had to change apps 
to CL2 to avoid serious data inconsistencies).
Now the repair operations are over and if I revert to CL1 we are still getting 
errors that records do not exist in DB while they do, using CL2 again it works 
fine.
Any ideas what I am missing?
Is there a way to validate that the repairs task has actually done what is 
needed and that the data is actually now replicated RF3 ?
Could it it be a Cassandra Driver issue? Since if I issue the request in cqlsh 
I do get the record but I cannot know if I am hitting the replica that doesn’t 
hold the record
Thanks for your help