Peter - my apologies for the slow response - we had to divert down a 'Plan B' approach last week involving MySQL, memcache, redis and various other uglies.
On 20 September 2010 23:11, Peter Schuller <peter.schul...@infidyne.com> wrote: > Are you running an old JVM by any chance? (Just grasping for straws.) JVM is Sun's 1.6 - I've been caught out once before with openjdk's performance challenges, so I'm particularly careful with this now. > Hmm. I can see useless spinning decreasing efficiency, but the numbers > from your log are really extreme. Do you have a URL / bug id or > anything that one can read up on about this? We've rebuilt the Cassandra cluster this week, avoiding Hadoop entirely - partly to reduce the variables in play, and partly because it looks like we'll only need two 'feeder' nodes for our jobs with the size of Cassandra cluster that we're likely going to end up with (10-12 ish). Any ratio higher than that seems to, on EC2 at least, cause too many fails on the Cassandra side. Actually, there's a question - is it 'acceptable' do you think for GC to take out a small number of your nodes at a time, so long as the bulk (or at least where RF is > nodes gone on STW GC) of the nodes are okay? I suspect this is a question peculiar to Amazon EC2, as I've never seen a box rendered non-communicative by a single core flat-lining. By the end of this week we hope to have a better idea (mind, I've thought that for the past 5 weeks of experimenting). If I'm back to square one at that point I'll start pastebining some logs and configs. Increasingly, I'm convinced that many of these problems would be solved if we hosted our own servers. cheers, Jedd.