Re: How to stop "nodetool repair" in 2.1.2?

2015-04-15 Thread Benyi Wang
Using JMX worked. Thanks a lot.

On Wed, Apr 15, 2015 at 3:57 PM, Robert Coli  wrote:

> On Wed, Apr 15, 2015 at 3:30 PM, Benyi Wang  wrote:
>
>> It didn't work. I ran the command on all nodes, but I still can see the
>> repair activities.
>>
>
> Your input as an operator who wants a nodetool command to trivially stop
> repairs is welcome here :
>
> https://issues.apache.org/jira/browse/CASSANDRA-3486
>
> For now, your two options are :
>
> 1) restart all nodes participating in the repair
> 2) access the JMX endpoint forceTerminateAllRepairSessions on all nodes
> participating in the repair
>
> =Rob
>
>


Re: Delete query range limitation

2015-04-15 Thread Jim Witschey
There's a ticket for range deletions in CQL here:

https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-6237

On Apr 15, 2015 6:27 PM, "Dan Kinder"  wrote:
>
> I understand that range deletes are currently not supported (
http://stackoverflow.com/questions/19390335/cassandra-cql-delete-using-a-less-than-operator-on-a-secondary-key
)
>
> Since Cassandra now does have range tombstones is there a reason why it
can't be allowed? Is there a ticket for supporting this or is it a
deliberate design decision not to?


Re: How to stop "nodetool repair" in 2.1.2?

2015-04-15 Thread Robert Coli
On Wed, Apr 15, 2015 at 3:30 PM, Benyi Wang  wrote:

> It didn't work. I ran the command on all nodes, but I still can see the
> repair activities.
>

Your input as an operator who wants a nodetool command to trivially stop
repairs is welcome here :

https://issues.apache.org/jira/browse/CASSANDRA-3486

For now, your two options are :

1) restart all nodes participating in the repair
2) access the JMX endpoint forceTerminateAllRepairSessions on all nodes
participating in the repair

=Rob


Re: How to stop "nodetool repair" in 2.1.2?

2015-04-15 Thread Benyi Wang
It didn't work. I ran the command on all nodes, but I still can see the
repair activities.

On Wed, Apr 15, 2015 at 3:20 PM, Sebastian Estevez <
sebastian.este...@datastax.com> wrote:

> nodetool stop *VALIDATION*
> On Apr 15, 2015 5:16 PM, "Benyi Wang"  wrote:
>
>> I ran "nodetool repair -- keyspace table" for a table, and it is still
>> running after 4 days. I knew there is an issue for repair with vnodes
>> https://issues.apache.org/jira/browse/CASSANDRA-5220. My question is how
>> I can kill this sequential repair?
>>
>> I killed the process which I ran the repair command. But I still can find
>> the repair activities running on different nodes in OpsCenter.
>>
>> Is there a way I can stop the repair without restarting the nodes?
>>
>> Thanks.
>>
>


Delete query range limitation

2015-04-15 Thread Dan Kinder
I understand that range deletes are currently not supported (
http://stackoverflow.com/questions/19390335/cassandra-cql-delete-using-a-less-than-operator-on-a-secondary-key
)

Since Cassandra now does have range tombstones is there a reason why it
can't be allowed? Is there a ticket for supporting this or is it a
deliberate design decision not to?


Re: How to stop "nodetool repair" in 2.1.2?

2015-04-15 Thread Sebastian Estevez
nodetool stop *VALIDATION*
On Apr 15, 2015 5:16 PM, "Benyi Wang"  wrote:

> I ran "nodetool repair -- keyspace table" for a table, and it is still
> running after 4 days. I knew there is an issue for repair with vnodes
> https://issues.apache.org/jira/browse/CASSANDRA-5220. My question is how
> I can kill this sequential repair?
>
> I killed the process which I ran the repair command. But I still can find
> the repair activities running on different nodes in OpsCenter.
>
> Is there a way I can stop the repair without restarting the nodes?
>
> Thanks.
>


How to stop "nodetool repair" in 2.1.2?

2015-04-15 Thread Benyi Wang
I ran "nodetool repair -- keyspace table" for a table, and it is still
running after 4 days. I knew there is an issue for repair with vnodes
https://issues.apache.org/jira/browse/CASSANDRA-5220. My question is how I
can kill this sequential repair?

I killed the process which I ran the repair command. But I still can find
the repair activities running on different nodes in OpsCenter.

Is there a way I can stop the repair without restarting the nodes?

Thanks.


Re: Keyspace Replication changes not synchronized after adding Datacenter

2015-04-15 Thread Paul Leddy

Hello,

No, that is not expected at all, to run the alter statement in each DC. 
Yes, it indicates a larger problem, for sure.


Check that ports are open between all nodes, especially 7000, if I 
recall correctly. We use a simple telnet check.


Paul

On 04/13/2015 10:22 AM, Thunder Stumpges wrote:

Hi guys,

We have recently added two datacenters to our existing 2.0.6 cluster. 
We followed the process here pretty much exactly:

http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html

We are using GossipingPropertyFileSnitch and NetworkTopologyStrategy 
across the board. All property files are identical in each of the 
three datacenters, and we use two nodes from each DC in the seed list.


However when we came to step 7.a. we ran the ALTER KEYSPACE command on 
one of the new datacenters (to add it as a replica). This change was 
reflected on the new datacenter where it ran as returned by DESCRIBE 
KEYSPACE. However the change was NOT propagated to either of the other 
two datacenters. We effectively had to run the ALTER KEYSPACE command 
3 times, one in each datacenter. Is this expected? I could find no 
documentation stating that this needed to be done, nor any 
documentation around how the system keyspace was kept in sync across 
datacenters in general.


If this is indicative of a larger problem with our installation, how 
would we go about troubleshooting it?


Thanks in advance!
Thunder






Re: One node misbehaving (lot's of GC), ideas?

2015-04-15 Thread Michal Michalski
Hi Erik,

Forgetting for a while that it's only a single row: does this node store
any super-long rows?
The first things that come to my mind after reading your e-mail is
unthrottled compaction (sounds like a possible issue, but it would affect
other nodes too) or very large rows. Or a mix of both?
Maybe it will be of your interest:
http://aryanet.com/blog/cassandra-garbage-collector-tuning re investigating
GC issues (if you haven't seen it yet) and pinning it down further.

M.



Kind regards,
Michał Michalski,
michal.michal...@boxever.com

On 15 April 2015 at 13:15, Erik Forsberg  wrote:

> Hi!
>
> We having problems with one node (out of 56 in total) misbehaving.
> Symptoms are:
>
> * High number of full CMS old space collections during early morning
> when we're doing bulkloads. Yes, bulkloads, not CQL, and only a few
> thrift insertions.
> * Really long stop-the-world GC events (I've seen up to 50 seconds) for
> both CMS and ParNew.
> * CPU usage higher during early morning hours compared to other nodes.
> * The large number of Garbage Collections *seems* to correspond to doing
> a lot of compactions (SizeTiered for most of our CFs, Leveled for a few
> small ones)
> * Node loosing track of what other nodes are up and keeping that state
> until restart (this I think is a bug caused by the GC behaviour, with
> the stop-the-world making the node not accepting gossip connections from
> other nodes)
>
> This is on 2.0.13 with vnodes (256 per node).
>
> All other nodes have normal behaviour, with a few (2-3) full CMS old
> space  in the same 3h period that the trouble node is making some 30
> ones. Heap space is 8G, with NEW_SIZE set to 800M. With 6G/800M the
> problem was even worse (it seems, this is a bit hard to debug as it
> happens *almost* every night).
>
> nodetool status shows that although we have a certain unbalance in the
> cluster, this node is neither the most nor the least loaded. I.e. we
> have between 1.6% and 2.1% in the "Owns" column, and the troublesome
> node reports 1.7%.
>
> All nodes are under puppet control, so configuration is the same
> everywhere.
>
> We're running NetworkTopolyStrategy with rack awareness, and here's a
> deviation from recommended settings - we have slightly varying number of
> nodes in the racks:
>
>  15 cssa01
>  15 cssa02
>  13 cssa03
>  13 cssa04
>
> The affected node is in the cssa04 rack. Could this mean I have some
> kind of hotspot situation? Why would that show up as more GC work?
>
> I'm quite puzzled here, so I'm looking for hints on how to identify what
> is causing this.
>
> Regards,
> \EF
>
>
>
>
>


Re: Delete-only work loads crash Cassandra

2015-04-15 Thread Robert Wille
I can readily reproduce the bug, and filed a JIRA ticket: 
https://issues.apache.org/jira/browse/CASSANDRA-9194

I’m posting for posterity

On Apr 13, 2015, at 11:59 AM, Robert Wille 
mailto:rwi...@fold3.com>> wrote:

Unfortunately, I’ve switched email systems and don’t have my emails from that 
time period. I did not file a Jira, and I don’t remember who made the patch for 
me or if he filed a Jira on my behalf.

I vaguely recall seeing the fix in the Cassandra change logs, but I just went 
and read them and I don’t see it. I’m probably remembering wrong.

My suspicion is that the original patch did not make it into the main branch, 
and I just have always had enough concurrent writing to keep Cassandra happy.

Hopefully the author of the patch will read this and be able to chime in.

This issue is very reproducible. I’ll try to come up with some time to write a 
simple program that illustrates the problem and file a Jira.

Thanks

Robert

On Apr 13, 2015, at 10:39 AM, Philip Thompson 
mailto:philip.thomp...@datastax.com>> wrote:

Did the original patch make it into upstream? That's unclear. If so, what was 
the JIRA #? Have you filed a JIRA for the new problem?

On Mon, Apr 13, 2015 at 12:21 PM, Robert Wille 
mailto:rwi...@fold3.com>> wrote:
Back in 2.0.4 or 2.0.5 I ran into a problem with delete-only workloads. If I 
did lots of deletes and no upserts, Cassandra would report that the memtable 
was 0 bytes because an accounting error. The memtable would never flush and 
Cassandra would eventually die. Someone was kind enough to create a patch, 
which seemed to have fixed the problem, but last night it reared its ugly head.

I’m now running 2.0.14. I ran a cleanup process on my cluster (10 nodes, RF=3, 
CL=1). The workload was pretty light, because this cleanup process is 
single-threaded and does everything synchronously. It was performing 4 reads 
per second and about 3000 deletes per second. Over the course of many hours, 
heap slowly grew on all nodes. CPU utilization also increased as GC consumed an 
ever-increasing amount of time. Eventually a couple of nodes shed 3.5 GB of 
their 7.5 GB. Other nodes weren’t so fortunate and started flapping due to 30 
second GC pauses.

The workaround is pretty simple. This cleanup process can simply write a dummy 
record with a TTL periodically so that Cassandra can flush its memtables and 
function properly. However, I think this probably ought to be fixed. 
Delete-only workloads can’t be that rare. I can’t be the only one that needs to 
go through and cleanup their tables.

Robert






One node misbehaving (lot's of GC), ideas?

2015-04-15 Thread Erik Forsberg
Hi!

We having problems with one node (out of 56 in total) misbehaving.
Symptoms are:

* High number of full CMS old space collections during early morning
when we're doing bulkloads. Yes, bulkloads, not CQL, and only a few
thrift insertions.
* Really long stop-the-world GC events (I've seen up to 50 seconds) for
both CMS and ParNew.
* CPU usage higher during early morning hours compared to other nodes.
* The large number of Garbage Collections *seems* to correspond to doing
a lot of compactions (SizeTiered for most of our CFs, Leveled for a few
small ones)
* Node loosing track of what other nodes are up and keeping that state
until restart (this I think is a bug caused by the GC behaviour, with
the stop-the-world making the node not accepting gossip connections from
other nodes)

This is on 2.0.13 with vnodes (256 per node).

All other nodes have normal behaviour, with a few (2-3) full CMS old
space  in the same 3h period that the trouble node is making some 30
ones. Heap space is 8G, with NEW_SIZE set to 800M. With 6G/800M the
problem was even worse (it seems, this is a bit hard to debug as it
happens *almost* every night).

nodetool status shows that although we have a certain unbalance in the
cluster, this node is neither the most nor the least loaded. I.e. we
have between 1.6% and 2.1% in the "Owns" column, and the troublesome
node reports 1.7%.

All nodes are under puppet control, so configuration is the same
everywhere.

We're running NetworkTopolyStrategy with rack awareness, and here's a
deviation from recommended settings - we have slightly varying number of
nodes in the racks:

 15 cssa01
 15 cssa02
 13 cssa03
 13 cssa04

The affected node is in the cssa04 rack. Could this mean I have some
kind of hotspot situation? Why would that show up as more GC work?

I'm quite puzzled here, so I'm looking for hints on how to identify what
is causing this.

Regards,
\EF