[ https://issues.apache.org/jira/browse/CASSANDRA-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vassil Hristov updated CASSANDRA-4274: -------------------------------------- Attachment: catalina.2012-05-23.log system.192.168.1.7.log output.192.168.1.7.log system.192.168.1.8.log output.192.168.1.8.log > Cassandra cluster becomes very slow and looses data after node failure > ---------------------------------------------------------------------- > > Key: CASSANDRA-4274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4274 > Project: Cassandra > Issue Type: Bug > Affects Versions: 1.0.8 > Environment: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41) > (b...@decadent.org.uk) (gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Mon Jan > 16 16:22:28 UTC 2012 > Debian GNU/Linux 6.0 > Cassandra 1.0.8 Debian package > Reporter: Vassil Hristov > Attachments: catalina.2012-05-23.log, output.192.168.1.7.log, > output.192.168.1.8.log, system.192.168.1.7.log, system.192.168.1.8.log > > > Hi, > in a nutshell: today we experienced a problem with one of our clusters. Our > application became very slow and it turned out to be caused by Cassandra. > Additionally, some data was not persisted properly. A reboot of one of the > nodes fixed the problem, in terms of that the application is now responsive > again and data is written properly, lost data was not recovered. > Now some more details. > The setup: we have 2 nodes, running Cassandra 1.0.8. In our java application, > we use Hector to connect to the nodes. We store some log data in cassandra. > The relevant method looks like this: > {{ storeMessage(mutator, key, message, ttl); > storeMessageInIndex(mutator, key, message, ttl);}} > In the first method, the entire message is stored in the column family > cfMainData under the provided key, and in the second we maintain a manual > index, which is stored in a different column family (cfDateOrderedMessages) > under the same key. > The problem: our support reported that certain operations take extremely long > (200+ seconds, compared to the usual <1 second). According to {{nodetool > ring}} both nodes were up and running. After checking the data, some of it > was not accessible. What's really odd though is that the index was maintained > properly, while the main data was missing. That is, the index would hold a > key X, but cfMainData[X] would return no results. > After the restart of one of the nodes (192.186.1.7 for the log reference), > everything went back to normal and now all is working correctly. > I am well aware that it's very likely that you won't be able to reproduce the > problem (we cannot either). However, maybe you'll figure out why the 'broken' > node wasn't marked as such. The behaviour I would have expected is that all > writes would fail, since quorum cannot be reached. The result would again be > lost data, but it would be a more consistent behaviour. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira