Hi Peter, We were logging the GC output as per this before, have since taken it out, but will put it back in I think.
Apropos logging - I've found that with RMI to our boxes at EC2 I've had to do the ugly thing with this: -Djava.rmi.server.hostname=<external / public IP address> .. which then renders nodetool useless, as it can't talk to the localhost or internal IP address / hostname any more. The clocks - I'm pretty sure these are very accurate, but will investigate this tomorrow morning just in case there's some drift happening. We think we might have cracked the underlying problem though, and it might be similar to the 'behind the scenes swap thing' (sadly I suspect that such things might actually be happening -- plus I thought that memory overcommit wasn't possible with Xen - only with VMware - but I guess they could have done all kinds of things with Xen by now over there.) There's a spinlock problem that's been identified elsewhere where the JVM mis-detects the number of cores it has running - based on the underlying architecture - and so we've reverted to parallel GC and forced the number of threads: -XX:+UseParallelGC -XX:MaxGCPauseMillis=100 -XX:ParallelGCThreads=3" It *seems* to be working a bit better at the moment, but I'll be more comfortable with feeling optimistic after a night's worth of jobs have been thrown at it :) j.