[jira] [Commented] (CASSANDRA-3006) Enormous counter

Boris Yen (JIRA) Tue, 09 Aug 2011 20:34:18 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082122#comment-13082122
 ]


Boris Yen commented on CASSANDRA-3006:
--------------------------------------

In order to make it easier to reproduce this issue, I document how I recreate 
this issue step by step.

1. clean any thing that is inside /var/lib/cassandra on node 172.17.19.151

2. start cassandra on node 172.17.19.151.

3. clean any thing that is inside /var/lib/cassnadra on node 172.17.19.152

4. modify the cassandra.yaml of 172.17.19.152 and add 172.17.19.151 as a seed.

5. start cassandra on node 172.17.19.152, I could see two node has formed a 
cluster, I also double check that using nodetool.

6. on node 172.17.19.151, I use cassandra-cli: to connect 172.17.19.151/9160, 
and execute commands -> 

create keyspace test
with placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy'
and strategy_options = [{datacenter1:2}];

create column family testCounter
    with column_type = Super
    and default_validation_class = CounterColumnType
    and replicate_on_write = true
    and comparator = BytesType
    and subcomparator = BytesType
    and comment = 'APP status information.';

7. use the test program to add the counter 1000 times. between each adding 
action the program will pause 50 millisecond.

8. in the middle of the adding process, shut down the cassandra on node 
172.17.19.152, (let's say I shut down node 172.17.19.152 when count is 200.). 
Because the test program changes the consistency level to One when it 
encounters an exception (timeout exception to be exact), the following adding 
actions will still be success.

9. wait for the overall adding process to complete. I saw "success counter: 
999" due to one exception. 

10. use the cassandra-cli to connect to 172.17.19.151 and 172.17.19.152 and 
check the counter value, the value is 1001 on both nodes. It shows 1001 because 
hector will retry when it encounters the timeout exception. 

11. shutdown the cassandra on 172.17.19.151, wait for a few seconds, I saw 
"InetAddress /172.17.19.151 is now dead" on node 172.17.19.152.

12. after seeing "InetAddress /172.17.19.151 is now dead", restart the 
cassandra on node 172.17.19.151.

13. check the counter again with cassandra-cli on both nodes, this time the 
counter should no longer be 1001, it should be other weird number.

Hope someone else could recreate it by these steps.

> Enormous counter 
> -----------------
>
>                 Key: CASSANDRA-3006
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3006
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.8.3
>         Environment: ubuntu 10.04
>            Reporter: Boris Yen
>            Assignee: Sylvain Lebresne
>
> I have two-node cluster with the following keyspace and column family 
> settings.
> Cluster Information:
>    Snitch: org.apache.cassandra.locator.SimpleSnitch
>    Partitioner: org.apache.cassandra.dht.RandomPartitioner
>    Schema versions: 
>       63fda700-c243-11e0-0000-2d03dcafebdf: [172.17.19.151, 172.17.19.152]
> Keyspace: test:
>   Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>   Durable Writes: true
>     Options: [datacenter1:2]
>   Column Families:
>     ColumnFamily: testCounter (Super)
>     "APP status information."
>       Key Validation Class: org.apache.cassandra.db.marshal.BytesType
>       Default column value validator: 
> org.apache.cassandra.db.marshal.CounterColumnType
>       Columns sorted by: 
> org.apache.cassandra.db.marshal.BytesType/org.apache.cassandra.db.marshal.BytesType
>       Row cache size / save period in seconds: 0.0/0
>       Key cache size / save period in seconds: 200000.0/14400
>       Memtable thresholds: 1.1578125/1440/247 (millions of ops/MB/minutes)
>       GC grace seconds: 864000
>       Compaction min/max thresholds: 4/32
>       Read repair chance: 1.0
>       Replicate on write: true
>       Built indexes: []
> Then, I use a test program based on hector to add a counter column 
> (testCounter[sc][column]) 1000 times. In the middle the adding process, I 
> intentional shut down the node 172.17.19.152. In addition to that, the test 
> program is smart enough to switch the consistency level from Quorum to One, 
> so that the following adding actions would not fail. 
> After all the adding actions are done, I start the cassandra on 
> 172.17.19.152, and I use cassandra-cli to check if the counter is correct on 
> both nodes, and I got a result 1001 which should be reasonable because hector 
> will retry once. However, when I shut down 172.17.19.151 and after 
> 172.17.19.152 is aware of 172.17.19.151 is down, I try to start the cassandra 
> on 172.17.19.151 again. Then, I check the counter again, this time I got a 
> result 481387 which is so wrong.
> I use 0.8.3 to reproduce this bug, but I think this also happens on 0.8.2 or 
> before also. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3006) Enormous counter

Reply via email to