[jira] [Commented] (CASSANDRA-6125) Race condition in Gossip propagation

2015-07-21 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636139#comment-14636139
 ] 

Peter Haggerty commented on CASSANDRA-6125:
---

We've seen this bug or something like it on 2.0.11 with 45 nodes in a fairly 
noisy AWS environment but other than CASSANDRA-8336 I don't see any fixes to 
gossip post 2.0.11.

The nodetool status command doesn't list the node that does't have status info. 
It's not up or down, it's simply not there and this impacts % ownership.
In a recent instance of this 4 nodes had the same status hole but only 2 of 
the 4 had different nodetool ring output compared to the other 41 no status 
hole members of the ring.

Restarting cassandra on the node that has a missing STATUS entry in gossip 
fixes the problem in that the hole goes away. This is something we used to 
see more commonly before 2.0.11 so it does appear this fix works but are there 
other places where a race might be happening?

{code}
/10.xx.yyy.169
  generation:1436544814
  heartbeat:2986679
  SEVERITY:0.0
  HOST_ID:7d22299f-b35b-4035-82bc-e2b603a655d7
  LOAD:2.57836E11
  RACK:1e
  NET_VERSION:7
  DC:us-east
  RPC_ADDRESS:10.xx.yyy.169
  RELEASE_VERSION:2.0.11
  SCHEMA:0f72be52-2751-33a6-a172-8511e943b2ec
/10.xx.yyy.175
  generation:1419877470
  heartbeat:53496976
  SEVERITY:1.2787723541259766
  HOST_ID:c87ed8db-76b6-485a-ac2f-32c2822b1ef5
  LOAD:3.08812188602E11
  RACK:1e
  NET_VERSION:7
  STATUS:NORMAL,-1010822684895662807
  DC:us-east
  RPC_ADDRESS:10.xx.yyy.175
  RELEASE_VERSION:2.0.11
  SCHEMA:0f72be52-2751-33a6-a172-8511e943b2ec
{code}


 Race condition in Gossip propagation
 

 Key: CASSANDRA-6125
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6125
 Project: Cassandra
  Issue Type: Bug
Reporter: Sergio Bossa
Assignee: Brandon Williams
 Fix For: 2.0.11, 2.1.1

 Attachments: 6125.txt


 Gossip propagation has a race when concurrent VersionedValues are created and 
 submitted/propagated, causing some updates to be lost, even if happening on 
 different ApplicationStatuses.
 That's what happens basically:
 1) A new VersionedValue V1 is created with version X.
 2) A new VersionedValue V2 is created with version Y = X + 1.
 3) V2 is added to the endpoint state map and propagated.
 4) Nodes register Y as max version seen.
 5) At this point, V1 is added to the endpoint state map and propagated too.
 6) V1 version is X  Y, so nodes do not ask for his value after digests.
 A possible solution would be to propagate/track per-ApplicationStatus 
 versions, possibly encoding them to avoid network overhead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8014) NPE in Message.java line 324

2015-04-20 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502921#comment-14502921
 ] 

Peter Haggerty commented on CASSANDRA-8014:
---

We just saw this again on 2.0.11 in very similar circumstances (gently shutting 
down cassandra with disable commands before terminating it):

{code}
ERROR [RPC-Thread:50] 2015-04-20 14:14:23,165 CassandraDaemon.java (line 199) 
Exception in thread Thread[RPC-Thread:50,5,main]
java.lang.RuntimeException: java.lang.NullPointerException
at 
com.lmax.disruptor.FatalExceptionHandler.handleEventException(FatalExceptionHandler.java:45)
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:126)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at com.thinkaurelius.thrift.Message.getInputTransport(Message.java:338)
at com.thinkaurelius.thrift.Message.invoke(Message.java:308)
at com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
at 
com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695)
at 
com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689)
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
... 3 more
{code}

 NPE in Message.java line 324
 

 Key: CASSANDRA-8014
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8014
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Cassandra 2.0.9
Reporter: Peter Haggerty
Assignee: Pavel Yaskevich
 Attachments: NPE_Message.java_line-324.txt


 We received this when a server was rebooting and attempted to shut Cassandra 
 down while it was still quite busy. While it's normal for us to have a 
 handful of the RejectedExecution exceptions on a sudden shutdown like this 
 these NPEs in Message.java are new.
 The attached file include the logs from StorageServiceShutdownHook to the 
 Logging initialized after the server restarts and Cassandra comes back up.
 {code}ERROR [pool-10-thread-2] 2014-09-29 08:33:44,055 Message.java (line 
 324) Unexpected throwable while invoking!
 java.lang.NullPointerException
 at com.thinkaurelius.thrift.util.mem.Buffer.size(Buffer.java:83)
 at 
 com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.expand(FastMemoryOutputTransport.java:84)
 at 
 com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.write(FastMemoryOutputTransport.java:167)
 at 
 org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:55)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
 at 
 com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
 at 
 com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:638)
 at 
 com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:632)
 at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8014) NPE in Message.java line 324

2015-04-20 Thread Peter Haggerty (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Haggerty updated CASSANDRA-8014:
--
Environment: Cassandra 2.0.9, Cassandra 2.0.11  (was: Cassandra 2.0.9)

 NPE in Message.java line 324
 

 Key: CASSANDRA-8014
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8014
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Cassandra 2.0.9, Cassandra 2.0.11
Reporter: Peter Haggerty
Assignee: Pavel Yaskevich
 Attachments: NPE_Message.java_line-324.txt


 We received this when a server was rebooting and attempted to shut Cassandra 
 down while it was still quite busy. While it's normal for us to have a 
 handful of the RejectedExecution exceptions on a sudden shutdown like this 
 these NPEs in Message.java are new.
 The attached file include the logs from StorageServiceShutdownHook to the 
 Logging initialized after the server restarts and Cassandra comes back up.
 {code}ERROR [pool-10-thread-2] 2014-09-29 08:33:44,055 Message.java (line 
 324) Unexpected throwable while invoking!
 java.lang.NullPointerException
 at com.thinkaurelius.thrift.util.mem.Buffer.size(Buffer.java:83)
 at 
 com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.expand(FastMemoryOutputTransport.java:84)
 at 
 com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.write(FastMemoryOutputTransport.java:167)
 at 
 org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:55)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
 at 
 com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
 at 
 com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:638)
 at 
 com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:632)
 at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8014) NPE in Message.java line 324

2014-12-12 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14245041#comment-14245041
 ] 

Peter Haggerty commented on CASSANDRA-8014:
---

We've seen this in a 2.0.9 instance when running nodetool disablethrift. It 
throws a half dozen of the Unexpected throwable, then proceeds to:

{code}
ERROR [pool-6-thread-2] 2014-12-12 23:43:13,643 CassandraDaemon.java (line 199) 
Exception in thread Thread[pool-6-thread-2,5,main]
java.lang.RuntimeException: java.lang.NullPointerException
at 
com.lmax.disruptor.FatalExceptionHandler.handleEventException(FatalExceptionHandler.java:45)
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:126)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at com.thinkaurelius.thrift.Message.getInputTransport(Message.java:338)
at com.thinkaurelius.thrift.Message.invoke(Message.java:308)
at com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
at 
com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:638)
at 
com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:632)
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
... 3 more
{code}

The nodetool disablethrift appears to hang until killed.


 NPE in Message.java line 324
 

 Key: CASSANDRA-8014
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8014
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Cassandra 2.0.9
Reporter: Peter Haggerty
Assignee: Pavel Yaskevich
 Attachments: NPE_Message.java_line-324.txt


 We received this when a server was rebooting and attempted to shut Cassandra 
 down while it was still quite busy. While it's normal for us to have a 
 handful of the RejectedExecution exceptions on a sudden shutdown like this 
 these NPEs in Message.java are new.
 The attached file include the logs from StorageServiceShutdownHook to the 
 Logging initialized after the server restarts and Cassandra comes back up.
 {code}ERROR [pool-10-thread-2] 2014-09-29 08:33:44,055 Message.java (line 
 324) Unexpected throwable while invoking!
 java.lang.NullPointerException
 at com.thinkaurelius.thrift.util.mem.Buffer.size(Buffer.java:83)
 at 
 com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.expand(FastMemoryOutputTransport.java:84)
 at 
 com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.write(FastMemoryOutputTransport.java:167)
 at 
 org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:55)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
 at 
 com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
 at 
 com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:638)
 at 
 com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:632)
 at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8116) HSHA fails with default rpc_max_threads setting

2014-10-29 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188625#comment-14188625
 ] 

Peter Haggerty commented on CASSANDRA-8116:
---

The latest 2.0.x release of Cassandra using hsha with default settings either 
stalls after a few minutes of operation or crashes.

This does not seem like it should have a priority of Minor. This is a major 
problem. The longer that 2.0.11 is the latest version the bigger the problem 
becomes for new users and existing users that have automation and high levels 
of trust in minor version upgrades.



 HSHA fails with default rpc_max_threads setting
 ---

 Key: CASSANDRA-8116
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8116
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Mike Adamson
Assignee: Tyler Hobbs
Priority: Minor
 Fix For: 2.0.12, 2.1.2

 Attachments: 8116-throw-exc-2.0.txt, 8116.txt


 The HSHA server fails with 'Out of heap space' error if the rpc_max_threads 
 is left at its default setting (unlimited) in cassandra.yaml.
 I'm not proposing any code change for this but have submitted a patch for a 
 comment change in cassandra.yaml to indicate that rpc_max_threads needs to be 
 changed if you use HSHA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-8102) cassandra-cli and cqlsh report two different values for a setting, partially update it and partially report it

2014-10-10 Thread Peter Haggerty (JIRA)
Peter Haggerty created CASSANDRA-8102:
-

 Summary: cassandra-cli and cqlsh report two different values for a 
setting, partially update it and partially report it
 Key: CASSANDRA-8102
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8102
 Project: Cassandra
  Issue Type: Bug
 Environment: 2.0.9
Reporter: Peter Haggerty
Priority: Minor


cassandra-cli updates and prints out a min_compaction_threshold that is not 
shown by cqlsh (it shows a different min_threshold attribute)

cqlsh updates both values but only shows one of them

{code}
cassandra-cli:
UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;

$ echo describe foo; | cassandra-cli -h `hostname` -k bar
  Compaction min/max thresholds: 8/32

$ echo describe table foo; | cqlsh -k bar `hostname`
  compaction={'class': 'SizeTieredCompactionStrategy'} AND



cqlsh:
ALTER TABLE foo WITH compaction = {'class' : 'SizeTieredCompactionStrategy', 
'min_threshold' : 16};

cassandra-cli:
  Compaction min/max thresholds: 16/32
  Compaction Strategy Options:
min_threshold: 16
cqlsh:
  compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
AND



cassandra-cli:
UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;

cassandra-cli:
  Compaction min/max thresholds: 8/32
  Compaction Strategy Options:
min_threshold: 16

cqlsh:
  compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
AND
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8102) cassandra-cli and cqlsh report two different values for a setting, partially update it and partially report it

2014-10-10 Thread Peter Haggerty (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Haggerty updated CASSANDRA-8102:
--
Description: 
cassandra-cli updates and prints out a min_compaction_threshold that is not 
shown by cqlsh (it shows a different min_threshold attribute)

cqlsh updates both values but only shows one of them

{code}
cassandra-cli:
UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;

$ echo describe foo; | cassandra-cli -h `hostname` -k bar
  Compaction min/max thresholds: 8/32

$ echo describe table foo; | cqlsh -k bar `hostname`
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
{code}

{code}
cqlsh:
ALTER TABLE foo WITH compaction = {'class' : 'SizeTieredCompactionStrategy', 
'min_threshold' : 16};

cassandra-cli:
  Compaction min/max thresholds: 16/32
  Compaction Strategy Options:
min_threshold: 16
cqlsh:
  compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
AND
{code}

{code}
cassandra-cli:
UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;

cassandra-cli:
  Compaction min/max thresholds: 8/32
  Compaction Strategy Options:
min_threshold: 16

cqlsh:
  compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
AND
{code}


  was:
cassandra-cli updates and prints out a min_compaction_threshold that is not 
shown by cqlsh (it shows a different min_threshold attribute)

cqlsh updates both values but only shows one of them

{code}
cassandra-cli:
UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;

$ echo describe foo; | cassandra-cli -h `hostname` -k bar
  Compaction min/max thresholds: 8/32

$ echo describe table foo; | cqlsh -k bar `hostname`
  compaction={'class': 'SizeTieredCompactionStrategy'} AND



cqlsh:
ALTER TABLE foo WITH compaction = {'class' : 'SizeTieredCompactionStrategy', 
'min_threshold' : 16};

cassandra-cli:
  Compaction min/max thresholds: 16/32
  Compaction Strategy Options:
min_threshold: 16
cqlsh:
  compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
AND



cassandra-cli:
UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;

cassandra-cli:
  Compaction min/max thresholds: 8/32
  Compaction Strategy Options:
min_threshold: 16

cqlsh:
  compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
AND
{code}



 cassandra-cli and cqlsh report two different values for a setting, partially 
 update it and partially report it
 --

 Key: CASSANDRA-8102
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8102
 Project: Cassandra
  Issue Type: Bug
 Environment: 2.0.9
Reporter: Peter Haggerty
Priority: Minor

 cassandra-cli updates and prints out a min_compaction_threshold that is not 
 shown by cqlsh (it shows a different min_threshold attribute)
 cqlsh updates both values but only shows one of them
 {code}
 cassandra-cli:
 UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;
 $ echo describe foo; | cassandra-cli -h `hostname` -k bar
   Compaction min/max thresholds: 8/32
 $ echo describe table foo; | cqlsh -k bar `hostname`
   compaction={'class': 'SizeTieredCompactionStrategy'} AND
 {code}
 {code}
 cqlsh:
 ALTER TABLE foo WITH compaction = {'class' : 'SizeTieredCompactionStrategy', 
 'min_threshold' : 16};
 cassandra-cli:
   Compaction min/max thresholds: 16/32
   Compaction Strategy Options:
 min_threshold: 16
 cqlsh:
   compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
 AND
 {code}
 {code}
 cassandra-cli:
 UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;
 cassandra-cli:
   Compaction min/max thresholds: 8/32
   Compaction Strategy Options:
 min_threshold: 16
 cqlsh:
   compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
 AND
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-8014) NPE in Message.java line 324

2014-09-29 Thread Peter Haggerty (JIRA)
Peter Haggerty created CASSANDRA-8014:
-

 Summary: NPE in Message.java line 324
 Key: CASSANDRA-8014
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8014
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Cassandra 2.0.9
Reporter: Peter Haggerty
 Attachments: NPE_Message.java_line-324.txt

We received this when a server was rebooting and attempted to shut Cassandra 
down while it was still quite busy. While it's normal for us to have a handful 
of the RejectedExecution exceptions on a sudden shutdown like this these NPEs 
in Message.java are new.

The attached file include the logs from StorageServiceShutdownHook to the 
Logging initialized after the server restarts and Cassandra comes back up.


ERROR [pool-10-thread-2] 2014-09-29 08:33:44,055 Message.java (line 324) 
Unexpected throwable while invoking!
java.lang.NullPointerException
at com.thinkaurelius.thrift.util.mem.Buffer.size(Buffer.java:83)
at 
com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.expand(FastMemoryOutputTransport.java:84)
at 
com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.write(FastMemoryOutputTransport.java:167)
at 
org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:55)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
at com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
at 
com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:638)
at 
com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:632)
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7808) LazilyCompactedRow incorrectly handles row tombstones

2014-09-09 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127642#comment-14127642
 ] 

Peter Haggerty commented on CASSANDRA-7808:
---

This is listed in the 2.0 CHANGES.txt as present in 2.0.10 but Fix Versions 
shows 2.0.11.


 LazilyCompactedRow incorrectly handles row tombstones
 -

 Key: CASSANDRA-7808
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7808
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low
Assignee: Richard Low
 Fix For: 1.2.19, 2.0.11, 2.1.0

 Attachments: 7808-v1.diff


 LazilyCompactedRow doesn’t handle row tombstones correctly, leading to an 
 AssertionError (CASSANDRA-4206) in some cases, and the row tombstone being 
 incorrectly dropped in others. It looks like this was introduced by 
 CASSANDRA-5677.
 To reproduce an AssertionError:
 1. Hack a really small return value for 
 DatabaseDescriptor.getInMemoryCompactionLimit() like 10 bytes to force large 
 row compaction
 2. Create a column family with gc_grace = 10
 3. Insert a few columns in one row
 4. Call nodetool flush
 5. Delete the row
 6. Call nodetool flush
 7. Wait 10 seconds
 8. Call nodetool compact and it will fail
 To reproduce the row tombstone being dropped, do the same except, after the 
 delete (in step 5), insert a column that sorts before the ones you inserted 
 in step 3. E.g. if you inserted b, c, d in step 3, insert a now. After the 
 compaction, which now succeeds, the full row will be visible, rather than 
 just a.
 The problem is two fold. Firstly, LazilyCompactedRow.Reducer.reduce() and 
 getReduce() incorrectly call container.clear(). This clears the columns (as 
 intended) but also removes the deletion times from container. This means no 
 further columns are deleted if they are annihilated by the row tombstone.
 Secondly, after the second pass, LazilyCompactedRow.isEmpty() is called which 
 calls
 {{ColumnFamilyStore.removeDeletedCF(emptyColumnFamily, 
 controller.gcBefore(key.getToken()))}}
 which unfortunately removes the last deleted time from emptyColumnFamily if 
 it is earlier than gcBefore. Since this is only called after the second pass, 
 the second pass doesn’t remove any columns that are removed by the row 
 tombstone whereas the first pass removes just the first one.
 This is pretty serious - no large rows can ever be compacted and row 
 tombstones can go missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7145) FileNotFoundException during compaction

2014-09-09 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127646#comment-14127646
 ] 

Peter Haggerty commented on CASSANDRA-7145:
---

This is listed in the 2.0 CHANGES.txt as present in 2.0.10 but Fix Versions 
shows 2.0.11.


 FileNotFoundException during compaction
 ---

 Key: CASSANDRA-7145
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7145
 Project: Cassandra
  Issue Type: Bug
 Environment: CentOS 6.3, Datastax Enterprise 4.0.1 (Cassandra 2.0.5), 
 Java 1.7.0_55
Reporter: PJ
Assignee: Marcus Eriksson
 Fix For: 1.2.19, 2.0.11, 2.1.0

 Attachments: 
 0001-avoid-marking-compacted-sstables-as-compacting.patch, compaction - 
 FileNotFoundException.txt, repair - RuntimeException.txt, startup - 
 AssertionError.txt


 I can't finish any compaction because my nodes always throw a 
 FileNotFoundException. I've already tried the following but nothing helped:
 1. nodetool flush
 2. nodetool repair (ends with RuntimeException; see attachment)
 3. node restart (via dse cassandra-stop)
 Whenever I restart the nodes, another type of exception is logged (see 
 attachment) somewhere near the end of startup process. This particular 
 exception doesn't seem to be critical because the nodes still manage to 
 finish the startup and become online.
 I don't have specific steps to reproduce the problem that I'm experiencing 
 with compaction and repair. I'm in the middle of migrating 4.8 billion rows 
 from MySQL via SSTableLoader. 
 Some things that may or may not be relevant:
 1. I didn't drop and recreate the keyspace (so probably not related to 
 CASSANDRA-4857)
 2. I do the bulk-loading in batches of 1 to 20 millions rows. When a batch 
 reaches 100% total progress (i.e. starts to build secondary index), I kill 
 the sstableloader process and cancel the index build
 3. I restart the nodes occasionally. It's possible that there is an on-going 
 compaction during one of those restarts.
 Related StackOverflow question (mine): 
 http://stackoverflow.com/questions/23435847/filenotfoundexception-during-compaction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7828) New node cannot be joined if a value in composite type column is dropped (description updated)

2014-09-09 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127645#comment-14127645
 ] 

Peter Haggerty commented on CASSANDRA-7828:
---

This is listed in the 2.0 CHANGES.txt as present in 2.0.10 but Fix Versions 
shows 2.0.11.


 New node cannot be joined if a value in composite type column is dropped 
 (description updated)
 --

 Key: CASSANDRA-7828
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7828
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Igor Zubchenok
Assignee: Mikhail Stepura
 Fix For: 1.2.19, 2.0.11, 2.1.1

 Attachments: 1409959462180-myColumnFamily.zip, 
 CASSANDRA-2.0-7828.patch


 I get a *RuntimeException* at new node system.log on bootstrapping a new DC:
 {code:title=system.out - RuntimeException caused by IllegalArgumentException 
 in Buffer.limit|borderStyle=solid}
 INFO [NonPeriodicTasks:1] 2014-08-26 15:43:01,030 SecondaryIndexManager.java 
 (line 137) Submitting index build of [myColumnFamily.myColumnFamily_myColumn] 
 for data in 
 SSTableReader(path='/var/lib/cassandra/data/testbug/myColumnFamily/testbug-myColumnFamily-jb-1-Data.db')
 ERROR [CompactionExecutor:2] 2014-08-26 15:43:01,035 CassandraDaemon.java 
 (line 199) Exception in thread Thread[CompactionExecutor:2,1,main]
 java.lang.IllegalArgumentException
   at java.nio.Buffer.limit(Buffer.java:267)
   at 
 org.apache.cassandra.utils.ByteBufferUtil.readBytes(ByteBufferUtil.java:587)
   at 
 org.apache.cassandra.utils.ByteBufferUtil.readBytesWithShortLength(ByteBufferUtil.java:596)
   at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:61)
   at 
 org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:36)
   at org.apache.cassandra.dht.LocalToken.compareTo(LocalToken.java:44)
   at org.apache.cassandra.db.DecoratedKey.compareTo(DecoratedKey.java:85)
   at org.apache.cassandra.db.DecoratedKey.compareTo(DecoratedKey.java:36)
   at 
 java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipListMap.java:727)
   at 
 java.util.concurrent.ConcurrentSkipListMap.findNode(ConcurrentSkipListMap.java:789)
   at 
 java.util.concurrent.ConcurrentSkipListMap.doGet(ConcurrentSkipListMap.java:828)
   at 
 java.util.concurrent.ConcurrentSkipListMap.get(ConcurrentSkipListMap.java:1626)
   at org.apache.cassandra.db.Memtable.resolve(Memtable.java:215)
   at org.apache.cassandra.db.Memtable.put(Memtable.java:173)
   at 
 org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:900)
   at 
 org.apache.cassandra.db.index.AbstractSimplePerColumnSecondaryIndex.insert(AbstractSimplePerColumnSecondaryIndex.java:107)
   at 
 org.apache.cassandra.db.index.SecondaryIndexManager.indexRow(SecondaryIndexManager.java:441)
   at org.apache.cassandra.db.Keyspace.indexRow(Keyspace.java:413)
   at 
 org.apache.cassandra.db.index.SecondaryIndexBuilder.build(SecondaryIndexBuilder.java:62)
   at 
 org.apache.cassandra.db.compaction.CompactionManager$9.run(CompactionManager.java:834)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 ERROR [NonPeriodicTasks:1] 2014-08-26 15:43:01,035 CassandraDaemon.java (line 
 199) Exception in thread Thread[NonPeriodicTasks:1,5,main]
 java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
 java.lang.IllegalArgumentException
   at 
 org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:413)
   at 
 org.apache.cassandra.db.index.SecondaryIndexManager.maybeBuildSecondaryIndexes(SecondaryIndexManager.java:142)
   at 
 org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:113)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at 

[jira] [Commented] (CASSANDRA-7810) tombstones gc'd before being locally applied

2014-09-09 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127644#comment-14127644
 ] 

Peter Haggerty commented on CASSANDRA-7810:
---

This is listed in the 2.0 CHANGES.txt as present in 2.0.10 but Fix Versions 
shows 2.0.11.


 tombstones gc'd before being locally applied
 

 Key: CASSANDRA-7810
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7810
 Project: Cassandra
  Issue Type: Bug
 Environment: 2.1.0.rc6
Reporter: Jonathan Halliday
Assignee: Marcus Eriksson
 Fix For: 1.2.19, 2.0.11, 2.1.0

 Attachments: 0001-7810-test-for-2.0.x.patch, 
 0001-track-gcable-tombstones-v2.patch, 0001-track-gcable-tombstones.patch, 
 0002-track-gcable-tombstones-for-2.0.patch, range_tombstone_test.py


 # single node environment
 CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 
 'replication_factor': 1 };
 use test;
 create table foo (a int, b int, primary key(a,b));
 alter table foo with gc_grace_seconds = 0;
 insert into foo (a,b) values (1,2);
 select * from foo;
 -- one row returned. so far, so good.
 delete from foo where a=1 and b=2;
 select * from foo;
 -- 0 rows. still rainbows and kittens.
 bin/nodetool flush;
 bin/nodetool compact;
 select * from foo;
  a | b
 ---+---
  1 | 2
 (1 rows)
 gahhh.
 looks like the tombstones were considered obsolete and thrown away before 
 being applied to the compaction?  gc_grace just means the interval after 
 which they won't be available to remote nodes repair - they should still 
 apply locally regardless (and do correctly in 2.0.9)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7543) Assertion error when compacting large row with map//list field or range tombstone

2014-09-09 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127649#comment-14127649
 ] 

Peter Haggerty commented on CASSANDRA-7543:
---

This is listed in the 2.0 CHANGES.txt as present in 2.0.10 but Fix Versions 
shows only 1.2.19.


 Assertion error when compacting large row with map//list field or range 
 tombstone
 -

 Key: CASSANDRA-7543
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7543
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: linux
Reporter: Matt Byrd
Assignee: Yuki Morishita
  Labels: compaction, map
 Fix For: 1.2.19

 Attachments: 0001-add-rangetombstone-test.patch, 
 0002-fix-rangetomebstone-not-included-in-LCR-size-calc.patch


 Hi,
 So in a couple of clusters we're hitting this problem when compacting large 
 rows with a schema which contains the map data-type.
 Here is an example of the error:
 {code}
 java.lang.AssertionError: incorrect row data size 87776427 written to 
 /cassandra/X/Y/X-Y-tmp-ic-2381-Data.db; correct is 87845952
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:162)
 org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:163)
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)
  
 org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)
  
 org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208)
 {code}
 I have a python script which reproduces the problem, by just writing lots of 
 data to a single partition key with a schema that contains the map data-type.
 I added some debug logging and found that the difference in bytes seen in the 
 reproduction (255) was due to the following pieces of data being written:
 {code}
 DEBUG [CompactionExecutor:3] 2014-07-13 00:38:42,891 ColumnIndex.java (line 
 168) DATASIZE writeOpenedMarker columnIndex: 
 org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
 [java.nio.HeapByteBuffer[pos=0 lim=34 cap=34], java.nio.HeapByteBuffer[pos=0 
 lim=34 cap=34]](deletedAt=1405237116014999, localDeletion=1405237116) 
 startPosition: 262476 endPosition: 262561 diff: 85 
 DEBUG [CompactionExecutor:3] 2014-07-13 00:38:43,007 ColumnIndex.java (line 
 168) DATASIZE writeOpenedMarker columnIndex: 
 org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
 org.apache.cassandra.db.Column@3e5b5939 startPosition: 328157 endPosition: 
 328242 diff: 85 
 DEBUG [CompactionExecutor:3] 2014-07-13 00:38:44,159 ColumnIndex.java (line 
 168) DATASIZE writeOpenedMarker columnIndex: 
 org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
 org.apache.cassandra.db.Column@fc3299b startPosition: 984105 endPosition: 
 984190 diff: 85
 {code}
 So looking at the code you can see that there are extra range tombstones 
 written on the column index border (in ColumnIndex where 
 tombstoneTracker.writeOpenedMarker is called) which aren't accounted for in 
 LazilyCompactedRow.columnSerializedSize.
 This is where the difference comes from in the assertion error, so the 
 solution is just to account for this data.
 I have a patch which does just this, by keeping track of the extra data 
 written out via tombstoneTracker.writeOpenedMarker in ColumnIndex and adding 
 it back to the dataSize in LazilyCompactedRow.java, where it serialises out 
 the row size.
 After applying the patch the reproduction stops producing the AssertionError.
 I know this is not a problem in 2.0 + because of singe pass compaction, 
 however there are lots of 1.2 clusters out there still which might run into 
 this.
 Please let me know if you've any questions.
 Thanks,
 Matt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7605) compactionstats reports incorrect byte values

2014-07-24 Thread Peter Haggerty (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Haggerty updated CASSANDRA-7605:
--

Attachment: CASSANDRA-7605.txt

 compactionstats reports incorrect byte values
 -

 Key: CASSANDRA-7605
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7605
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: 2.0.9, Java 1.7.0_55
Reporter: Peter Haggerty
 Attachments: CASSANDRA-7605.txt


 The output of nodetool compactionstats (while a compaction is running) is 
 incorrect.
 The output from nodetool compactionhistory and the log both match and they 
 disagree with the output from compactionstats.
 What nodetool said during the compaction was almost certainly wrong given the 
 sizes of files on disk:
completed   total  unit  progress
 144713163589146631071165 bytes98.69%
 nodetool compactionhistory and the log both report the same values for that 
 compaction:
 52,596,321,269 bytes to 38,575,881,134
 The compactionhistory/log values make much more sense given the size of the 
 files on disk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (CASSANDRA-7605) compactionstats reports incorrect byte values

2014-07-23 Thread Peter Haggerty (JIRA)
Peter Haggerty created CASSANDRA-7605:
-

 Summary: compactionstats reports incorrect byte values
 Key: CASSANDRA-7605
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7605
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: 2.0.9, Java 1.7.0_55
Reporter: Peter Haggerty


The output of nodetool compactionstats (while a compaction is running) is 
incorrect.

The output from nodetool compactionhistory and the log both match and they 
disagree with the output from compactionstats.

What nodetool said during the compaction was almost certainly wrong given the 
sizes of files on disk:
   completed   total  unit  progress
144713163589146631071165 bytes98.69%

nodetool compactionhistory and the log both report the same values for that 
compaction:
52,596,321,269 bytes to 38,575,881,134

The compactionhistory/log values make much more sense given the size of the 
files on disk.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (CASSANDRA-7246) Gossip Null Pointer Exception when a cassandra instance in ring is restarted

2014-05-16 Thread Peter Haggerty (JIRA)
Peter Haggerty created CASSANDRA-7246:
-

 Summary: Gossip Null Pointer Exception when a cassandra instance 
in ring is restarted
 Key: CASSANDRA-7246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7246
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: 12 node ring of 1.2.x.
11 of 12 are 1.2.15.
1 is 1.2.16.
Reporter: Peter Haggerty
Priority: Minor


12 Cassandra instances, one per node.
11 of the Cassandra instances are 1.2.15.
1 of the Cassandra instances is 1.2.16.

One of the eleven 1.2.15 Cassandra instances is restarted (disable thrift, 
gossip, then flush, drain, stop, start).

The 1.2.16 Cassandra instance noted this by throwing a Null Pointer Exception. 
None of the 1.2.15 instances threw an exception and this is new behavior that 
hasn't been observed before.


ERROR 02:18:06,009 Exception in thread Thread[GossipStage:1,5,main]
java.lang.NullPointerException
at org.apache.cassandra.gms.Gossiper.convict(Gossiper.java:264)
at 
org.apache.cassandra.gms.FailureDetector.forceConviction(FailureDetector.java:246)
at 
org.apache.cassandra.gms.GossipShutdownVerbHandler.doVerb(GossipShutdownVerbHandler.java:37)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
 INFO 02:18:23,402 Node /10.x.y.x is now part of the cluster
 INFO 02:18:23,403 InetAddress /10.x.y.z is now UP
 INFO 02:18:53,494 FatClient /10.x.y.z has been silent for 3ms, removing 
from gossip
 INFO 02:19:00,031 Handshaking version with /10.x.y.z





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-5780) nodetool status and ring report incorrect/stale information after decommission

2013-12-14 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848357#comment-13848357
 ] 

Peter Haggerty commented on CASSANDRA-5780:
---

We just ran into this again when a node rebooted and came back up thinking 
everything was fine, but every other node in the ring disagreed. This was 
resolved by our normal manual restart procedure where we stop thrift, gossip, 
flush the node, drain the node then restart cassandra but it definitely caused 
some confusion for nodetool status and nodetool info to report that the 
node was up and a working part of the cluster when in fact it wasn't.

The nodes in this state definitely do *not* make it clear that they are not 
part of the cluster anymore.

 nodetool status and ring report incorrect/stale information after decommission
 --

 Key: CASSANDRA-5780
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5780
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Reporter: Peter Haggerty
Priority: Trivial
  Labels: lhf, ponies

 Cassandra 1.2.6 ring of 12 instances, each with 256 tokens.
 Decommission 3 of the 12 nodes, one after another resulting a 9 instance ring.
 The 9 instances of cassandra that are in the ring all correctly report 
 nodetool status information for the ring and have the same data.
 After the first node is decommissioned:
 nodetool status on decommissioned-1st reports 11 nodes
 After the second node is decommissioned:
 nodetool status on decommissioned-1st reports 11 nodes
 nodetool status on decommissioned-2nd reports 10 nodes
 After the second node is decommissioned:
 nodetool status on decommissioned-1st reports 11 nodes
 nodetool status on decommissioned-2nd reports 10 nodes
 nodetool status on decommissioned-3rd reports 9 nodes
 The storage load information is similarly stale on the various decommissioned 
 nodes. The nodetool status and ring commands continue to return information 
 as if they were part of a cluster and they appear to return the last 
 information that they saw.
 In contrast the nodetool info command fails with an exception, which isn't 
 ideal but at least indicates that there was a failure rather than returning 
 stale information.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (CASSANDRA-4573) HSHA doesn't handle large messages gracefully

2013-07-20 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714355#comment-13714355
 ] 

Peter Haggerty commented on CASSANDRA-4573:
---

We may be seeing this behavior in 1.2.6. I haven't enabled debug but we are 
definitely seeing a correlation between groups of 'Read an invalid frame size 
of 0' messages (dozens at a time) during the same second that we're seeing 
large (10 seconds or more) 'GC for ConcurrentMarkSweep' events.

On a 9 node cluster we see this anywhere from 1 to 9 times a day.



 HSHA doesn't handle large messages gracefully
 -

 Key: CASSANDRA-4573
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4573
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Tyler Hobbs
Assignee: Vijay
 Attachments: repro.py


 HSHA doesn't seem to enforce any kind of max message length, and when 
 messages are too large, it doesn't fail gracefully.
 With debug logs enabled, you'll see this:
 {{DEBUG 13:13:31,805 Unexpected state 16}}
 Which seems to mean that there's a SelectionKey that's valid, but isn't ready 
 for reading, writing, or accepting.
 Client-side, you'll get this thrift error (while trying to read a frame as 
 part of {{recv_batch_mutate}}):
 {{TTransportException: TSocket read 0 bytes}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (CASSANDRA-5783) nodetool and cassandra-cli report different information for Compaction min/max thresholds

2013-07-20 Thread Peter Haggerty (JIRA)
Peter Haggerty created CASSANDRA-5783:
-

 Summary: nodetool and cassandra-cli report different information 
for Compaction min/max thresholds
 Key: CASSANDRA-5783
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5783
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Affects Versions: 1.2.6
Reporter: Peter Haggerty


Ask cassandra-cli and nodetool the same question and get different answers 
back. This was executed after using nodetool to adjust the compactionthreshold 
on this CF to have a minimum of 2. The change was observed to work as we saw 
increased compactions which is exactly what one would expect


$ echo describe ${CF}; \
  | cassandra-cli -h localhost -k ${KEYSPACE} \
  | grep thresholds

  Compaction min/max thresholds: 4/32


$ nodetool -h localhost getcompactionthreshold ${KEYSPACE} ${CF}
Current compaction thresholds for Metrics/dimensions_active_1:
 min = 2,  max = 32


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CASSANDRA-5783) nodetool and cassandra-cli report different information for Compaction min/max thresholds

2013-07-20 Thread Peter Haggerty (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Haggerty updated CASSANDRA-5783:
--

Priority: Minor  (was: Major)

 nodetool and cassandra-cli report different information for Compaction 
 min/max thresholds
 ---

 Key: CASSANDRA-5783
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5783
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Affects Versions: 1.2.6
Reporter: Peter Haggerty
Priority: Minor

 Ask cassandra-cli and nodetool the same question and get different answers 
 back. This was executed after using nodetool to adjust the 
 compactionthreshold on this CF to have a minimum of 2. The change was 
 observed to work as we saw increased compactions which is exactly what one 
 would expect
 $ echo describe ${CF}; \
   | cassandra-cli -h localhost -k ${KEYSPACE} \
   | grep thresholds
   Compaction min/max thresholds: 4/32
 $ nodetool -h localhost getcompactionthreshold ${KEYSPACE} ${CF}
 Current compaction thresholds for Metrics/dimensions_active_1:
  min = 2,  max = 32

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (CASSANDRA-5780) nodetool status and ring report incorrect/stale information after decommission

2013-07-19 Thread Peter Haggerty (JIRA)
Peter Haggerty created CASSANDRA-5780:
-

 Summary: nodetool status and ring report incorrect/stale 
information after decommission
 Key: CASSANDRA-5780
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5780
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Affects Versions: 1.2.6
Reporter: Peter Haggerty
Priority: Minor


Cassandra 1.2.6 ring of 12 instances, each with 256 tokens.

Decommission 3 of the 12 nodes, one after another resulting a 9 instance ring.

The 9 instances of cassandra that are in the ring all correctly report nodetool 
status information for the ring and have the same data.


After the first node is decommissioned:
nodetool status on decommissioned-1st reports 11 nodes

After the second node is decommissioned:
nodetool status on decommissioned-1st reports 11 nodes
nodetool status on decommissioned-2nd reports 10 nodes

After the second node is decommissioned:
nodetool status on decommissioned-1st reports 11 nodes
nodetool status on decommissioned-2nd reports 10 nodes
nodetool status on decommissioned-3rd reports 9 nodes


The storage load information is similarly stale on the various decommissioned 
nodes. The nodetool status and ring commands continue to return information as 
if they were part of a cluster and they appear to return the last information 
that they saw.

In contrast the nodetool info command fails with an exception, which isn't 
ideal but at least indicates that there was a failure rather than returning 
stale information.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5068) CLONE - Once a host has been hinted to, log messages for it repeat every 10 mins even if no hints are delivered

2013-02-05 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13571772#comment-13571772
 ] 

Peter Haggerty commented on CASSANDRA-5068:
---

We see this in 1.1.9 as well.

 CLONE - Once a host has been hinted to, log messages for it repeat every 10 
 mins even if no hints are delivered
 ---

 Key: CASSANDRA-5068
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5068
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.6, 1.2.0
 Environment: cassandra 1.1.6
 java 1.6.0_30
Reporter: Peter Haggerty
Assignee: Brandon Williams
Priority: Minor
  Labels: hinted, hintedhandoff
 Attachments: 5068.txt


 We have 0 row hinted handoffs every 10 minutes like clockwork. This impacts 
 our ability to monitor the cluster by adding persistent noise in the handoff 
 metric.
 Previous mentions of this issue are here:
 http://www.mail-archive.com/user@cassandra.apache.org/msg25982.html
 The hinted handoffs can be scrubbed away with
 nodetool -h 127.0.0.1 scrub system HintsColumnFamily
 but they return after anywhere from a few minutes to multiple hours later.
 These started to appear after an upgrade to 1.1.6 and haven't gone away 
 despite rolling cleanups, rolling restarts, multiple rounds of scrubbing, etc.
 A few things we've noticed about the handoffs:
 1. The phantom handoff endpoint changes after a non-zero handoff comes through
 2. Sometimes a non-zero handoff will be immediately followed by an off 
 schedule phantom handoff to the endpoint the phantom had been using before
 3. The sstable2json output seems to include multiple sub-sections for each 
 handoff with the same deletedAt information.
 The phantom handoff endpoint changes after a non-zero handoff comes through:
  INFO [HintedHandoff:1] 2012-12-11 06:57:35,093 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
  INFO [HintedHandoff:1] 2012-12-11 07:07:35,092 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
  INFO [HintedHandoff:1] 2012-12-11 07:07:37,915 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 1058 rows to endpoint /10.10.10.2
  INFO [HintedHandoff:1] 2012-12-11 07:17:35,093 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
  INFO [HintedHandoff:1] 2012-12-11 07:27:35,093 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
 Sometimes a non-zero handoff will be immediately followed by an off 
 schedule phantom handoff to the endpoint the phantom had been using before:
  INFO [HintedHandoff:1] 2012-12-12 21:47:39,335 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
  INFO [HintedHandoff:1] 2012-12-12 21:57:39,335 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
  INFO [HintedHandoff:1] 2012-12-12 22:07:43,319 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 1416 rows to endpoint /10.10.10.4
  INFO [HintedHandoff:1] 2012-12-12 22:07:43,320 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
  INFO [HintedHandoff:1] 2012-12-12 22:17:39,357 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
  INFO [HintedHandoff:1] 2012-12-12 22:27:39,337 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
 The first few entries from one of the json files:
 {
 0aaa: {
 ccf5dc203a2211e2e154da71a9bb: {
 deletedAt: -9223372036854775808, 
 subColumns: []
 }, 
 ccf603303a2211e2e154da71a9bb: {
 deletedAt: -9223372036854775808, 
 subColumns: []
 }, 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CASSANDRA-5068) CLONE - Once a host has been hinted to, log messages for it repeat every 10 mins even if no hints are delivered

2013-02-05 Thread Peter Haggerty (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Haggerty updated CASSANDRA-5068:
--

Affects Version/s: 1.1.8
   1.1.9

 CLONE - Once a host has been hinted to, log messages for it repeat every 10 
 mins even if no hints are delivered
 ---

 Key: CASSANDRA-5068
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5068
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.6, 1.1.8, 1.1.9, 1.2.0
 Environment: cassandra 1.1.6
 java 1.6.0_30
Reporter: Peter Haggerty
Assignee: Brandon Williams
Priority: Minor
  Labels: hinted, hintedhandoff
 Attachments: 5068.txt


 We have 0 row hinted handoffs every 10 minutes like clockwork. This impacts 
 our ability to monitor the cluster by adding persistent noise in the handoff 
 metric.
 Previous mentions of this issue are here:
 http://www.mail-archive.com/user@cassandra.apache.org/msg25982.html
 The hinted handoffs can be scrubbed away with
 nodetool -h 127.0.0.1 scrub system HintsColumnFamily
 but they return after anywhere from a few minutes to multiple hours later.
 These started to appear after an upgrade to 1.1.6 and haven't gone away 
 despite rolling cleanups, rolling restarts, multiple rounds of scrubbing, etc.
 A few things we've noticed about the handoffs:
 1. The phantom handoff endpoint changes after a non-zero handoff comes through
 2. Sometimes a non-zero handoff will be immediately followed by an off 
 schedule phantom handoff to the endpoint the phantom had been using before
 3. The sstable2json output seems to include multiple sub-sections for each 
 handoff with the same deletedAt information.
 The phantom handoff endpoint changes after a non-zero handoff comes through:
  INFO [HintedHandoff:1] 2012-12-11 06:57:35,093 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
  INFO [HintedHandoff:1] 2012-12-11 07:07:35,092 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
  INFO [HintedHandoff:1] 2012-12-11 07:07:37,915 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 1058 rows to endpoint /10.10.10.2
  INFO [HintedHandoff:1] 2012-12-11 07:17:35,093 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
  INFO [HintedHandoff:1] 2012-12-11 07:27:35,093 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
 Sometimes a non-zero handoff will be immediately followed by an off 
 schedule phantom handoff to the endpoint the phantom had been using before:
  INFO [HintedHandoff:1] 2012-12-12 21:47:39,335 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
  INFO [HintedHandoff:1] 2012-12-12 21:57:39,335 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
  INFO [HintedHandoff:1] 2012-12-12 22:07:43,319 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 1416 rows to endpoint /10.10.10.4
  INFO [HintedHandoff:1] 2012-12-12 22:07:43,320 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
  INFO [HintedHandoff:1] 2012-12-12 22:17:39,357 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
  INFO [HintedHandoff:1] 2012-12-12 22:27:39,337 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
 The first few entries from one of the json files:
 {
 0aaa: {
 ccf5dc203a2211e2e154da71a9bb: {
 deletedAt: -9223372036854775808, 
 subColumns: []
 }, 
 ccf603303a2211e2e154da71a9bb: {
 deletedAt: -9223372036854775808, 
 subColumns: []
 }, 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CASSANDRA-5068) CLONE - Once a host has been hinted to, log messages for it repeat every 10 mins even if no hints are delivered

2013-01-11 Thread Peter Haggerty (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Haggerty updated CASSANDRA-5068:
--

Affects Version/s: 1.2.0

 CLONE - Once a host has been hinted to, log messages for it repeat every 10 
 mins even if no hints are delivered
 ---

 Key: CASSANDRA-5068
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5068
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.6, 1.2.0
 Environment: cassandra 1.1.6
 java 1.6.0_30
Reporter: Peter Haggerty
Assignee: Brandon Williams
Priority: Minor
  Labels: hinted, hintedhandoff, phantom

 We have 0 row hinted handoffs every 10 minutes like clockwork. This impacts 
 our ability to monitor the cluster by adding persistent noise in the handoff 
 metric.
 Previous mentions of this issue are here:
 http://www.mail-archive.com/user@cassandra.apache.org/msg25982.html
 The hinted handoffs can be scrubbed away with
 nodetool -h 127.0.0.1 scrub system HintsColumnFamily
 but they return after anywhere from a few minutes to multiple hours later.
 These started to appear after an upgrade to 1.1.6 and haven't gone away 
 despite rolling cleanups, rolling restarts, multiple rounds of scrubbing, etc.
 A few things we've noticed about the handoffs:
 1. The phantom handoff endpoint changes after a non-zero handoff comes through
 2. Sometimes a non-zero handoff will be immediately followed by an off 
 schedule phantom handoff to the endpoint the phantom had been using before
 3. The sstable2json output seems to include multiple sub-sections for each 
 handoff with the same deletedAt information.
 The phantom handoff endpoint changes after a non-zero handoff comes through:
  INFO [HintedHandoff:1] 2012-12-11 06:57:35,093 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
  INFO [HintedHandoff:1] 2012-12-11 07:07:35,092 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
  INFO [HintedHandoff:1] 2012-12-11 07:07:37,915 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 1058 rows to endpoint /10.10.10.2
  INFO [HintedHandoff:1] 2012-12-11 07:17:35,093 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
  INFO [HintedHandoff:1] 2012-12-11 07:27:35,093 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
 Sometimes a non-zero handoff will be immediately followed by an off 
 schedule phantom handoff to the endpoint the phantom had been using before:
  INFO [HintedHandoff:1] 2012-12-12 21:47:39,335 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
  INFO [HintedHandoff:1] 2012-12-12 21:57:39,335 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
  INFO [HintedHandoff:1] 2012-12-12 22:07:43,319 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 1416 rows to endpoint /10.10.10.4
  INFO [HintedHandoff:1] 2012-12-12 22:07:43,320 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
  INFO [HintedHandoff:1] 2012-12-12 22:17:39,357 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
  INFO [HintedHandoff:1] 2012-12-12 22:27:39,337 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
 The first few entries from one of the json files:
 {
 0aaa: {
 ccf5dc203a2211e2e154da71a9bb: {
 deletedAt: -9223372036854775808, 
 subColumns: []
 }, 
 ccf603303a2211e2e154da71a9bb: {
 deletedAt: -9223372036854775808, 
 subColumns: []
 }, 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (CASSANDRA-5068) CLONE - Once a host has been hinted to, log messages for it repeat every 10 mins even if no hints are delivered

2012-12-13 Thread Peter Haggerty (JIRA)
Peter Haggerty created CASSANDRA-5068:
-

 Summary: CLONE - Once a host has been hinted to, log messages for 
it repeat every 10 mins even if no hints are delivered
 Key: CASSANDRA-5068
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5068
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 0.6
Reporter: Peter Haggerty
Assignee: Brandon Williams
Priority: Minor
 Fix For: 0.8.10, 1.0.7


{noformat}
 INFO 15:36:03,977 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 15:36:03,978 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 15:46:31,248 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 15:46:31,249 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 15:56:29,448 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 15:56:29,449 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 16:06:09,949 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 16:06:09,950 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 16:16:21,529 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 16:16:21,530 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
{noformat}

Introduced by CASSANDRA-3554.  The problem is that until a compaction on hints 
occurs, tombstones are present causing the isEmpty() check to be false.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CASSANDRA-5068) CLONE - Once a host has been hinted to, log messages for it repeat every 10 mins even if no hints are delivered

2012-12-13 Thread Peter Haggerty (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Haggerty updated CASSANDRA-5068:
--

Fix Version/s: (was: 0.8.10)
   (was: 1.0.7)
 Reviewer:   (was: jbellis)
   Labels: hinted hintedhandoff phantom  (was: )
  Description: 
We have 0 row hinted handoffs every 10 minutes like clockwork. This impacts 
our ability to monitor the cluster by adding persistent noise in the handoff 
metric.

Previous mentions of this issue are here:
http://www.mail-archive.com/user@cassandra.apache.org/msg25982.html

The hinted handoffs can be scrubbed away with
nodetool -h 127.0.0.1 scrub system HintsColumnFamily
but they return after anywhere from a few minutes to multiple hours later.

These started to appear after an upgrade to 1.1.6 and haven't gone away despite 
rolling cleanups, rolling restarts, multiple rounds of scrubbing, etc.

A few things we've noticed about the handoffs:
1. The phantom handoff endpoint changes after a non-zero handoff comes through

2. Sometimes a non-zero handoff will be immediately followed by an off 
schedule phantom handoff to the endpoint the phantom had been using before

3. The sstable2json output seems to include multiple sub-sections for each 
handoff with the same deletedAt information.



The phantom handoff endpoint changes after a non-zero handoff comes through:
 INFO [HintedHandoff:1] 2012-12-11 06:57:35,093 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
 INFO [HintedHandoff:1] 2012-12-11 07:07:35,092 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
 INFO [HintedHandoff:1] 2012-12-11 07:07:37,915 HintedHandOffManager.java (line 
392) Finished hinted handoff of 1058 rows to endpoint /10.10.10.2
 INFO [HintedHandoff:1] 2012-12-11 07:17:35,093 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
 INFO [HintedHandoff:1] 2012-12-11 07:27:35,093 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2



Sometimes a non-zero handoff will be immediately followed by an off schedule 
phantom handoff to the endpoint the phantom had been using before:
 INFO [HintedHandoff:1] 2012-12-12 21:47:39,335 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
 INFO [HintedHandoff:1] 2012-12-12 21:57:39,335 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
 INFO [HintedHandoff:1] 2012-12-12 22:07:43,319 HintedHandOffManager.java (line 
392) Finished hinted handoff of 1416 rows to endpoint /10.10.10.4
 INFO [HintedHandoff:1] 2012-12-12 22:07:43,320 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
 INFO [HintedHandoff:1] 2012-12-12 22:17:39,357 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
 INFO [HintedHandoff:1] 2012-12-12 22:27:39,337 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4



The first few entries from one of the json files:
{
0aaa: {
ccf5dc203a2211e2e154da71a9bb: {
deletedAt: -9223372036854775808, 
subColumns: []
}, 
ccf603303a2211e2e154da71a9bb: {
deletedAt: -9223372036854775808, 
subColumns: []
}, 


  was:
{noformat}
 INFO 15:36:03,977 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 15:36:03,978 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 15:46:31,248 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 15:46:31,249 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 15:56:29,448 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 15:56:29,449 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 16:06:09,949 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 16:06:09,950 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 16:16:21,529 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 16:16:21,530 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
{noformat}

Introduced by CASSANDRA-3554.  The problem is that until a compaction on hints 
occurs, tombstones are present causing the isEmpty() check to be false.

  Environment: 
cassandra 1.1.6
java 1.6.0_30
Affects Version/s: (was: 0.6)
   1.1.6

Cloning CASSANDRA-3733 as it seems to be the same issue.

 CLONE - Once a host has been 

[jira] [Commented] (CASSANDRA-5068) CLONE - Once a host has been hinted to, log messages for it repeat every 10 mins even if no hints are delivered

2012-12-13 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13531678#comment-13531678
 ] 

Peter Haggerty commented on CASSANDRA-5068:
---

When there are zero-row hinted handoffs the output of list HintsColumnFamily
might show that 9 of 12 nodes in a ring have a row key like this:
9 of 12 nodes in a ring might have a row key like this:
RowKey: 7554

1 of the 12 nodes will have a different row key than all the rest:
RowKey: 1554

another 1-2 nodes might not have any RowKeys at all


 CLONE - Once a host has been hinted to, log messages for it repeat every 10 
 mins even if no hints are delivered
 ---

 Key: CASSANDRA-5068
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5068
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.6
 Environment: cassandra 1.1.6
 java 1.6.0_30
Reporter: Peter Haggerty
Assignee: Brandon Williams
Priority: Minor
  Labels: hinted, hintedhandoff, phantom

 We have 0 row hinted handoffs every 10 minutes like clockwork. This impacts 
 our ability to monitor the cluster by adding persistent noise in the handoff 
 metric.
 Previous mentions of this issue are here:
 http://www.mail-archive.com/user@cassandra.apache.org/msg25982.html
 The hinted handoffs can be scrubbed away with
 nodetool -h 127.0.0.1 scrub system HintsColumnFamily
 but they return after anywhere from a few minutes to multiple hours later.
 These started to appear after an upgrade to 1.1.6 and haven't gone away 
 despite rolling cleanups, rolling restarts, multiple rounds of scrubbing, etc.
 A few things we've noticed about the handoffs:
 1. The phantom handoff endpoint changes after a non-zero handoff comes through
 2. Sometimes a non-zero handoff will be immediately followed by an off 
 schedule phantom handoff to the endpoint the phantom had been using before
 3. The sstable2json output seems to include multiple sub-sections for each 
 handoff with the same deletedAt information.
 The phantom handoff endpoint changes after a non-zero handoff comes through:
  INFO [HintedHandoff:1] 2012-12-11 06:57:35,093 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
  INFO [HintedHandoff:1] 2012-12-11 07:07:35,092 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
  INFO [HintedHandoff:1] 2012-12-11 07:07:37,915 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 1058 rows to endpoint /10.10.10.2
  INFO [HintedHandoff:1] 2012-12-11 07:17:35,093 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
  INFO [HintedHandoff:1] 2012-12-11 07:27:35,093 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
 Sometimes a non-zero handoff will be immediately followed by an off 
 schedule phantom handoff to the endpoint the phantom had been using before:
  INFO [HintedHandoff:1] 2012-12-12 21:47:39,335 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
  INFO [HintedHandoff:1] 2012-12-12 21:57:39,335 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
  INFO [HintedHandoff:1] 2012-12-12 22:07:43,319 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 1416 rows to endpoint /10.10.10.4
  INFO [HintedHandoff:1] 2012-12-12 22:07:43,320 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
  INFO [HintedHandoff:1] 2012-12-12 22:17:39,357 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
  INFO [HintedHandoff:1] 2012-12-12 22:27:39,337 HintedHandOffManager.java 
 (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
 The first few entries from one of the json files:
 {
 0aaa: {
 ccf5dc203a2211e2e154da71a9bb: {
 deletedAt: -9223372036854775808, 
 subColumns: []
 }, 
 ccf603303a2211e2e154da71a9bb: {
 deletedAt: -9223372036854775808, 
 subColumns: []
 }, 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira