Hi Travis,
I see you have encountered the disk space pickle. Just for the record, the 
safest way to run Riak in production is to keep all resources (CPU, RAM, 
Network, Disk Space, I/O etc.) below 70% utilization on all nodes at all times. 
The reason behind this is to compensate for when one or more nodes go down. 
When this does happen, the remaining nodes have to carry all the load of the 
offline node(s) on top of their existing load and therefore need to have 
sufficient free resources available to do it. If you are running right on the 
limit for any resource then you need to expect issues like this or worse to 
happen on a regular basis.

Potential initial prevention
When you join all the nodes together to form your cluster, you get to run 
`riak-admin cluster plan` and it will show you how things will turn out. If you 
like this plan, run `riak-admin cluster commit` and partitions will be moved 
around accordingly. If not, you can cancel the plan and generate a new one and 
keep doing so until you are happy with the distribution. Sometimes with 
unfortunate divisions of partitions/nodes/ring size then one node gets its fair 
share and then some but often a replan will make it as painless as possible.

Temporary escape from current situation
Before beginning here, let me mention that this method does have the potential 
to go horribly wrong so proceed at your own risk. With regards to the server 
with the filled disk, you can follow the method underneath:

  *   Stop Riak
  *   Attach additional storage (USB, additional disks, NAS, whatever)
  *   Copy partitions from the data directory of, presumably bitcask, to the 
additional storage
  *   Once the copy has been completed, delete the data from the regular node's 
hard disk
  *   Create a symlink from the external storage to where you just deleted the 
data from
  *   Repeat until you have freed up sufficient disk space (new stuff may be 
copied here, so make sure you do have enough space)
  *   Start Riak

The above should bring your server back in touch with the cluster. Monitor 
transfers and once they have all finished, add your new node to the cluster. 
After this new node has been added and all transfers have finished, take the 
previously full node offline and reverse the steps above until you are able to 
remove the additional storage.

Note: running a mixed version cluster for a prolonged period of time is not 
recommended in production. Out of preference, I would suggest installing the 
same version of Riak on the new node, going through the above and then looking 
at upgrading the cluster once everything is stable.

Good luck,

Nicholas
From: riak-users <riak-users-boun...@lists.basho.com> On Behalf Of Travis 
Kirstine
Sent: 01 November 2018 22:25
To: riak-users@lists.basho.com
Subject: High disk usage on node

I'm running riak (v2.14) in a 5 node cluster and for some reason one of the 
nodes has higher disk usage than the other nodes.  The problem seems to be 
related to how riak distributes the partitions, in my case I'm using the 
default 64, riak has given each node 12 partition except one node that gets 16 
(4x12+16=64).  As a result the node with 16 partitions has filled the disk and 
become 'un-reachable'.

I have a node on standby with roughly the same disk space as the failed node, 
my concern is that if a add it to the cluster it will overflow as well.

How do I recover the failed node and add a new node without destroying the 
cluster..... BTW just to make things more fun the new node is at a newer 
version of riak so I need to perform a rolling upgrade at the same time.

Any help would be greatly appreciated!
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to