Which jdk are you using? We've had similar problems with jdk1.6u22 on Ubuntu 10.04 in Amazon EC2. Nodes would lock up for 20-40+ minutes.
We haven't done any conclusive tests yet, but we haven't seen the same problems after down rev'ing to jdk1.6u16. -brent On Mon, Jan 10, 2011 at 12:59 PM, Wayne <wav...@gmail.com> wrote: > We had a node last night go awol and got stuck in permanent 50% CPU wait > time. The node also steadily shot up the load to 400 before we saw it and > had to hard reboot. Besides that all other ganglia metrics flat-lined. Is > this some sort of bizarre kernal problem? We are using xfs with std > settings. I have seen a few postings talk about bizarre problems like this. > Can XFS be blamed or is it more kernal related? Is there a posting somewhere > suggesting the best file system settings? Are there recommended settings for > using CentOS 5.5? We have a 10 nodes cluster we have been pounding for weeks > and we can't seem to keep all ten nodes up for a 24 hour period. I am hoping > there is a lower level problem causing much of it. > > Thanks. >