Vassil Hristov created CASSANDRA-4274:
-----------------------------------------

             Summary: Cassandra cluster becomes very slow and looses data after 
node failure
                 Key: CASSANDRA-4274
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4274
             Project: Cassandra
          Issue Type: Bug
    Affects Versions: 1.0.8
         Environment: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41) 
(b...@decadent.org.uk) (gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Mon Jan 16 
16:22:28 UTC 2012
Debian GNU/Linux 6.0
Cassandra 1.0.8 Debian package
            Reporter: Vassil Hristov


Hi,

in a nutshell: today we experienced a problem with one of our clusters. Our 
application became very slow and it turned out to be caused by Cassandra. 
Additionally, some data was not persisted properly. A reboot of one of the 
nodes fixed the problem, in terms of that the application is now responsive 
again and data is written properly, lost data was not recovered.

Now some more details.

The setup: we have 2 nodes, running Cassandra 1.0.8. In our java application, 
we use Hector to connect to the nodes. We store some log data in cassandra. The 
relevant method looks like this:
{{  storeMessage(mutator, key, message, ttl);
  storeMessageInIndex(mutator, key, message, ttl);}}
In the first method, the entire message is stored in the column family 
cfMainData under the provided key, and in the second we maintain a manual 
index, which is stored in a different column family (cfDateOrderedMessages) 
under the same key.

The problem: our support reported that certain operations take extremely long 
(200+ seconds, compared to the usual <1 second). According to {{nodetool ring}} 
both nodes were up and running. After checking the data, some of it was not 
accessible. What's really odd though is that the index was maintained properly, 
while the main data was missing. That is, the index would hold a key X, but 
cfMainData[X] would return no results.

After the restart of one of the nodes (192.186.1.7 for the log reference), 
everything went back to normal and now all is working correctly. 

I am well aware that it's very likely that you won't be able to reproduce the 
problem (we cannot either). However, maybe you'll figure out why the 'broken' 
node wasn't marked as such. The behaviour I would have expected is that all 
writes would fail, since quorum cannot be reached. The result would again be 
lost data, but it would be a more consistent behaviour. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to