>>> "ar" == Armon Dadgar <[email protected]> wrote:
ar> All the nodes appeared to have been blocked trying to talk to riak ar> 001 which was the ring claimant at the time. Doing this seems to ar> have cleared the state enough for the cluster to make progress ar> again. Armon, it's quite unlikely that the ring claimant was doing anything special because the claimant only acts when cluster membership changes. Instead, it's quite likely that riak001 was busy doing a set of LevelDB compactions. There have been a number of changes recently to reduce the amount of time that we've seen worst-case LevelDB compaction blocking Erlang process schedulers which blocks *everything*, including the keep-alives that are sent between Erlang nodes. The longest LevelDB-related stoppage that I've seen was 7.5 minutes. :-( When that happens on a node X, then all other nodes will complain (almost simultaneously) that node X is down. It's not *down*, it's just reallyreallyreally slow to respond to messages ... which is effectively the same as being down. Checking for big LevelDB compaction storms is pretty easy using DTrace or SystemTap, but you're probably not using a kernel that has user-space SystemTap available. There are compaction messages in the "LOG" file of each LevelDB data directory. The hassle is the need to look at all of them in parallel. A secondary effect is watching write ops via "iostat -x 1": the amount of data written spikes much higher than writes triggered only by Riak client operations. (Read ops would go higher too, except that many files input to a compaction are already cached by the OS.) Your primary keys look UID'ish. If they are not lexigraphically adjacent to other keys inserted at the same time, you will cause many more LevelDB compaction events than if your keys were adjacent (e.g. prefixing them with a wall-clock timestamp). -Scott _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
