[ 
https://issues.apache.org/jira/browse/CASSANDRA-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vassil Hristov updated CASSANDRA-4274:
--------------------------------------

    Attachment: catalina.2012-05-23.log
                system.192.168.1.7.log
                output.192.168.1.7.log
                system.192.168.1.8.log
                output.192.168.1.8.log
    
> Cassandra cluster becomes very slow and looses data after node failure
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-4274
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4274
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.0.8
>         Environment: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41) 
> (b...@decadent.org.uk) (gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Mon Jan 
> 16 16:22:28 UTC 2012
> Debian GNU/Linux 6.0
> Cassandra 1.0.8 Debian package
>            Reporter: Vassil Hristov
>         Attachments: catalina.2012-05-23.log, output.192.168.1.7.log, 
> output.192.168.1.8.log, system.192.168.1.7.log, system.192.168.1.8.log
>
>
> Hi,
> in a nutshell: today we experienced a problem with one of our clusters. Our 
> application became very slow and it turned out to be caused by Cassandra. 
> Additionally, some data was not persisted properly. A reboot of one of the 
> nodes fixed the problem, in terms of that the application is now responsive 
> again and data is written properly, lost data was not recovered.
> Now some more details.
> The setup: we have 2 nodes, running Cassandra 1.0.8. In our java application, 
> we use Hector to connect to the nodes. We store some log data in cassandra. 
> The relevant method looks like this:
> {{  storeMessage(mutator, key, message, ttl);
>   storeMessageInIndex(mutator, key, message, ttl);}}
> In the first method, the entire message is stored in the column family 
> cfMainData under the provided key, and in the second we maintain a manual 
> index, which is stored in a different column family (cfDateOrderedMessages) 
> under the same key.
> The problem: our support reported that certain operations take extremely long 
> (200+ seconds, compared to the usual <1 second). According to {{nodetool 
> ring}} both nodes were up and running. After checking the data, some of it 
> was not accessible. What's really odd though is that the index was maintained 
> properly, while the main data was missing. That is, the index would hold a 
> key X, but cfMainData[X] would return no results.
> After the restart of one of the nodes (192.186.1.7 for the log reference), 
> everything went back to normal and now all is working correctly. 
> I am well aware that it's very likely that you won't be able to reproduce the 
> problem (we cannot either). However, maybe you'll figure out why the 'broken' 
> node wasn't marked as such. The behaviour I would have expected is that all 
> writes would fail, since quorum cannot be reached. The result would again be 
> lost data, but it would be a more consistent behaviour. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to