The main issue turned out to be a bug in our code whereby we were writing a
lot of new columns to the same row key instead of a new row key, turning
what we expected to be a skinny rowed CF into a CF with one very, very wide
row. These writes on the single key were putting pressure on the 3 nodes
holding our replicas.
One of the replicas would eventually fail under the pressure and the rest
of the cluster would try holding hints for the bad keys writes, which would
cause the same problem on the rest of the cluster.

On Thu, Mar 22, 2012 at 1:55 AM, Thomas van Neerijnen
<t...@bossastudios.com>wrote:

> Hi
>
> I'm going with yes to all three of your questions.
>
> I found a very heavily hit index which we have since reworked to remove
> the secondry index entirely.
> This fixed a large portion of the problem but during the panic of the
> overloaded cluster we did the simple scaling out trick of doubling the
> cluster, however in the rush two out of the 7 new nodes accidentally ended
> up on EC2 EBS volumes instead of the usual ephemeral RAID10.
> So, same error but this time all nodes reporting only the two EBS backed
> nodes as down instead of the whole cluster getting weird.
> I'm rsyncing the data off the EBS volume onto an ephemeral RAID10 array as
> I type so in the next hour or so I'll know if this fixed the issue.
>
>
> On Wed, Mar 21, 2012 at 5:24 PM, aaron morton <aa...@thelastpickle.com>wrote:
>
>> The node is overloaded with hints.
>>
>> I'll just grab the comments from codeā€¦
>>
>>             // avoid OOMing due to excess hints.  we need to do this
>> check even for "live" nodes, since we can
>>             // still generate hints for those if it's overloaded or
>> simply dead but not yet known-to-be-dead.
>>             // The idea is that if we have over maxHintsInProgress hints
>> in flight, this is probably due to
>>             // a small number of nodes causing problems, so we should
>> avoid shutting down writes completely to
>>             // healthy nodes.  Any node with no hintsInProgress is
>> considered healthy.
>>
>> Are the nodes going up and down a lot ? Are they under GC pressure. The
>> other possibility is that you have overloaded the cluster.
>>
>> Cheers
>>
>>
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 22/03/2012, at 3:20 AM, Thomas van Neerijnen wrote:
>>
>> Hi all
>>
>> I'm running into a weird error on Cassandra 1.0.7.
>> As my clusters load gets heavier many of the nodes seem to hit the same
>> error around the same time, resulting in MutationStage backing up and never
>> clearing down. The only way to recover the cluster is to kill all the nodes
>> and start them up again. The error is as below and is repeated continuously
>> until I kill the Cassandra process.
>>
>> ERROR [ReplicateOnWriteStage:57] 2012-03-21 14:02:05,099
>> AbstractCassandraDaemon.java (line 139) Fatal exception in thread
>> Thread[ReplicateOnWriteStage:57,5,main]
>> java.lang.RuntimeException: java.util.concurrent.TimeoutException
>>         at
>> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1227)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>         at java.lang.Thread.run(Thread.java:662)
>> Caused by: java.util.concurrent.TimeoutException
>>         at
>> org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:301)
>>         at
>> org.apache.cassandra.service.StorageProxy$7$1.runMayThrow(StorageProxy.java:544)
>>         at
>> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1223)
>>         ... 3 more
>>
>>
>>
>

Reply via email to