Hello everyone, We've got a very interesting problem. We're hosting our 5-node cluster on EC2 running Ubuntu 10.04 LTS (Lucid Lynx) Server 64-bit<http://aws.amazon.com/amis/4348> using m2.xlarge instance types, and over the past 5 days we've had two EC2 servers randomly restart on us. We've checked the logs and there was nothing that we saw that indicated why they restarted. One second they were happily logging and the next second the server was in the process of rebooting. This is particularly bad because every time the node comes back up we get merge errors due to an existing bug in Riak and have to restore from a recent backup.
Just today we noticed that the EC2 servers did not have swap enabled (apparently the norm for xlarge+ instances), which we thought might have been our problem? My knowledge of what happens when swap is off is pretty poor - but I have been told that the Linux OOM killer should still be invoked and start trying to kill processes, rather than the server simply restarting. Is that correct? Also, how would Riak hypothetically handle swap being off on a system? We're using Bitcask if that helps. Secondly, one of our ops guys here thinks the issue might be related to a bug <http://ubuntuforums.org/showthread.php?t=1436497> (?) that others Ubuntu users of the same version seem to have. In fact, we do see the same "INFO: task cron:15047 blocked for more than 120 seconds: line in our log file. We're also running a AMI that isn't the official one from Canonical, so the thought being an upgrade to the official AMI would help. If we do want to upgrade, it will mean moving each cluster node to new hardware. I wanted to ask the list to make sure we were doing it correctly. Here is the plan to transfer a node to new hardware -- note that these steps will be done on one node at a time, and we'll make sure the cluster has stabilized after doing one node before moving on to the next one. 1. Stop riak on old server. 2. Copy data directory (including bitcask, mr_queue and ring folders) to a shared location. 3. Shutdown old server. 4. Boot new replacement server, installing (but not starting) Riak. 5. Transfer data directory from shared location to data folder on new node. 6. Start riak. My main concern is if the ring state will transfer to a new node safely, assuming the new server has the same hostname and node name as the old server? The new server will have a different IP address, but all our node names in our cluster use hostnames, and those will not be changing.
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
