I think your best bet is to do a force-replace (and then a manual repair, if 
you are not using AAE) with a node that has higher capacity than your current 
standby.  You are correct that replacing with your standby will fail when you 
run repairs and end up running out of space.

I think you do NOT want to do a remove (or force remove) of the node that has 
run out of space, as you will likely just end up pushing those partitions to 
other nodes that are already stressed for capacity.

I would not tempt fate by doing an add on a cluster with a node that is down 
for the count.  Fix that node first, and then add capacity to the cluster.  But 
others here might have more experience with expanding clusters that are in ill 
health.  For example, it might be possible to do a replace without a manual 
repair (which will leave a bunch of near-empty partitions on your new node), 
and then do an add with the node you took out of the cluster.  You would then 
need to track down where all the partitions got moved to in the cluster, and 
then do a manual repair of those partitions.  (Or I suppose if you are using 
AAE, wait for a full set of exchanges to complete, but if were me I'd turn off 
AAE and run the repairs manually, so you can monitor which partitions actually 
got repaired.)

If you are moving to the latest version of Riak (2.2.6), then you may find that 
the v2 claim algorithm does a better job of distributing (ownership) partitions 
more evenly around the cluster, but you should upgrade all your nodes first 
(before running your cap add).

Also make use of `riak_core_claim_sim:help()` in the Riak console.  You can get 
pretty good insight into what your cap adds will look like, without having to 
do a cluster plan.  As Drew has mentioned, you might not want to try the 
claim_v3 algorithms if you are on an already under-provisioned cluster, as 
claim_v3 may move partitions around to already heavily-loaded nodes as part of 
its ownership handoff, which in your case could be catastrophic, as you 
currently have keys that have only two replicas (and presumably are already 
growing in size with secondary partitions), so losing a node or two in the rest 
of your cluster due to disk space could result in data loss (4eva).

-Fred

PS> I see Sick has posted a temporary workaround if you can add extra disk 
capacity to your node (NAS, etc), and that seems like a good route to go, if 
you can do that.  Otherwise, I'd find a machine with more disk and do the 
replace as described above.

> On Nov 1, 2018, at 9:25 AM, Travis Kirstine 
> <tkirst...@firstbasesolutions.com> wrote:
> 
> I’m running riak (v2.14) in a 5 node cluster and for some reason one of the 
> nodes has higher disk usage than the other nodes.  The problem seems to be 
> related to how riak distributes the partitions, in my case I’m using the 
> default 64, riak has given each node 12 partition except one node that gets 
> 16 (4x12+16=64).  As a result the node with 16 partitions has filled the disk 
> and become ‘un-reachable’.
>  
> I have a node on standby with roughly the same disk space as the failed node, 
> my concern is that if a add it to the cluster it will overflow as well.
>  
> How do I recover the failed node and add a new node without destroying the 
> cluster….. BTW just to make things more fun the new node is at a newer 
> version of riak so I need to perform a rolling upgrade at the same time.
>  
> Any help would be greatly appreciated!    
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com <mailto:riak-users@lists.basho.com>
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com 
> <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to