Hi,
thanks for the detailed reply. So you would suggest that somehow the partition 
allocation got into an incosistent state across nodes. I'll have to check the 
logs to see if anything similar to your dump pops up.

> So I compared the ring states manually using the console, and in the
> ring state on the removed node quite a few partitions where assigned to
> different nodes than what the other nodes thought.
> After I manually synced the ring on the leaving node with the rest of
> the cluster by doing this on the console:
> 
> {ok,Ring} = rpc:call('r...@othernode', riak_core_ring_manger,
> get_my_ring, []).
> riak_core_ring_manager:set_my_ring(R).
> 

That ought to be 

riak_core_ring_manager:set_my_ring( Ring ).

right? Just verifying because my Erlang is rather rudimentary :)

> Also riak-admin ringready will not recognize this problem, as far as I
> read the code, because only the ring states of the current ring members
> are compared. I haven't tried it, cause I am still on 0.12.0. 
> The same is apparently true for riak-admin transfers, which might tell
> you that there are no handoffs left, even if the removed node still has
> data.

I'm running 0.13.0, so if we're stumbling over the same cause it's still there.

> 
> I discovered another problem while debugging this. I you restart (or it
> crashes) a node that you removed from the cluster which still has data,
> it won't start handing off it's data afterwards. The reason being, that
> is the node watcher also does not get notified that the other nodes are
> up, and so all of them are considered down. This also can only be worked
> around manually via the erlang console.

Why would that have to be worked around at all? My understanding is through the 
data duplication within the ring having a single node encounter a messy and 
fatal accident shouldn't destabilize the entire ring.  The nodes which contain 
the duplicate data would just take over until a replacement node gets added, 
and the newly dead node is removed (ok, via console).

So this still leaves me with some of my original questions open:
>> 
>> 1. What would normally trigger a rebalancing of the nodes? 
>> 2. Is there a way to manually trigger a rebalancing?
>> 3. Did I do anything wrong with the procedure described above to be left in 
>> the current odd state by riak?

Regards,
Sven

------------------------------------------
Scoreloop AG, Brecherspitzstrasse 8, 81541 Munich, Germany, www.scoreloop.com
[email protected]

Sitz der Gesellschaft: München, Registergericht: Amtsgericht München, HRB 
174805 
Vorstand: Dr. Marc Gumpinger (Vorsitzender), Dominik Westner, Christian van der 
Leeden, Vorsitzender des Aufsichtsrates: Olaf Jacobi 

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to