Re: Node Recovery Questions

2018-09-14 Thread sean mcevoy
Hi Martin, List,

Just an update to let ye know how things went and what we learned.

We did the force-replace procedure to bring the new node into the cluster
in place of the old one. I attached to the riak erlang shell and with a
little hacking was able to get all the bitcask handles and then do a
bitcask:fold/3 to count keys. This showed that only a small percentage of
all keys were present on the new node, even after the handoffs and
transfers had completed.

Following the instructions at the bottom of this page:
https://docs.basho.com/riak/kv/2.2.0/using/repair-recovery/repairs/
I attached to the erlang shell again and ran these commands (replacing the
IP with our actual IP) to force repairs on all vnodes:

{ok, Ring} = riak_core_ring_manager:get_my_ring().
Partitions = [P || {P, 'dev1@127.0.0.1'} <-
riak_core_ring:all_owners(Ring)].
[riak_kv_vnode:repair(P) || P <- Partitions].

The progress was most easily monitored with: riak-admin handoff summary
and once complete the new node had the expected number of keys.

Counting the keys is more than a bit hacky and occasionally caused a seg
fault if there was background traffic, so I don't recommend it in general.
But it did allow us to verify where the data was in our test env and then
we could trust the procedure without counting keys in production.
Monitoring the size of the bitcask directory is a lot lower resolution but
it is at least safe, the results were similar in test & production so it
was sufficient to verify the above procedure.

So in short, when replacing a node the force-replace procedure doesn't
actually cause data to be synched to the new node. The above erlang shell
commands do force a sync.

Thanks for the support!
//Sean.

On Thu, Aug 9, 2018 at 11:25 PM sean mcevoy  wrote:

> Hi Martin,
> Thanks for taking the time.
> Yes, by "size of the bitcask directory" I mean I did a "du -h
> --max-depth=1 bitcask", so I think that would cover all the vnodes. We
> don't use any other backends.
> Those answers are helpful, will get back to this in a few days and see
> what I can determine about where our data physically lies. Might have more
> questions then.
> Cheers,
> //Sean.
>
> On Wed, Aug 8, 2018 at 6:05 PM, Martin Sumner  > wrote:
>
>> Based on a quick read of the code, compaction in bitcask is performed
>> only on "readable" files, and the current active file for writing is
>> excluded from that list.  With default settings, that active file can grow
>> to 2GB.  So it is possible that if objects had been replaced/deleted many
>> times within the active file, that space will not be recovered if all the
>> replacements amount to < 2GB per vnode.  So at these small data sizes - you
>> may get a relatively significant discrepancy between an old and recovered
>> node in terms of disk space usage.
>>
>> On 8 August 2018 at 17:37, Martin Sumner 
>> wrote:
>>
>>> Sean,
>>>
>>> Some partial answers to your questions.
>>>
>>> I don't believe force-replace itself will sync anything up - it just
>>> reassigns ownership (hence handoff happens very quickly).
>>>
>>> Read repair would synchronise a portion of the data.  So if 10% of you
>>> data is read regularly, this might explain some of what you see.
>>>
>>> AAE should also repair your data.  But if nothing has happened for 4
>>> days, then that doesn't seem to be the case.  It would be worth checking
>>> the aae-status page (
>>> http://docs.basho.com/riak/kv/2.2.3/using/admin/riak-admin/#aae-status)
>>> to confirm things are happening.
>>>
>>> I don't know if there are any minimum levels of data before bitcask will
>>> perform compaction.  There's nothing obvious in the code that wouldn't be
>>> triggered way before 90%.  I don't know if it will merge on the active file
>>> (the one currently being written to), but that is 2GB max size (configured
>>> through bitcask.max_file_size).
>>>
>>> When you say the size of the bitcask directory - is this the size shared
>>> across all vnodes on the node?  I guess if each vnode has a single file
>>> <2GB, and there are multiple vnodes - something unexpected might happen
>>> here?  If bitcask does indeed not merge the file active for writing.
>>>
>>> In terms of distribution around the cluster, if you have an n_val of 3
>>> you should normally expect to see a relatively even distribution of the
>>> data on failure (certainly not it all going to one).  Worst case scenario
>>> is that 3 nodes get all the load from that one failed node.
>>>
>>> When a vnode is inaccessible, 3 (assuming n=3) fallback vnodes are
>>> selected to handle the load for that 1 vnode (as that vnode would normally
>>> be in 3 preflists, and commonly a different node will be asked to start a
>>> vnode for each preflist).
>>>
>>>
>>> I will try and dig later into bitcask merge/compaction code, to see if I
>>> spot anything else.
>>>
>>> Martin
>>>
>>>
>>>
>>
>
___
riak-users mailing list
riak-users@lists.basho.com

Re: Node Recovery Questions

2018-08-09 Thread sean mcevoy
Hi Martin,
Thanks for taking the time.
Yes, by "size of the bitcask directory" I mean I did a "du -h --max-depth=1
bitcask", so I think that would cover all the vnodes. We don't use any
other backends.
Those answers are helpful, will get back to this in a few days and see what
I can determine about where our data physically lies. Might have more
questions then.
Cheers,
//Sean.

On Wed, Aug 8, 2018 at 6:05 PM, Martin Sumner 
wrote:

> Based on a quick read of the code, compaction in bitcask is performed only
> on "readable" files, and the current active file for writing is excluded
> from that list.  With default settings, that active file can grow to 2GB.
> So it is possible that if objects had been replaced/deleted many times
> within the active file, that space will not be recovered if all the
> replacements amount to < 2GB per vnode.  So at these small data sizes - you
> may get a relatively significant discrepancy between an old and recovered
> node in terms of disk space usage.
>
> On 8 August 2018 at 17:37, Martin Sumner 
> wrote:
>
>> Sean,
>>
>> Some partial answers to your questions.
>>
>> I don't believe force-replace itself will sync anything up - it just
>> reassigns ownership (hence handoff happens very quickly).
>>
>> Read repair would synchronise a portion of the data.  So if 10% of you
>> data is read regularly, this might explain some of what you see.
>>
>> AAE should also repair your data.  But if nothing has happened for 4
>> days, then that doesn't seem to be the case.  It would be worth checking
>> the aae-status page (http://docs.basho.com/riak/kv
>> /2.2.3/using/admin/riak-admin/#aae-status) to confirm things are
>> happening.
>>
>> I don't know if there are any minimum levels of data before bitcask will
>> perform compaction.  There's nothing obvious in the code that wouldn't be
>> triggered way before 90%.  I don't know if it will merge on the active file
>> (the one currently being written to), but that is 2GB max size (configured
>> through bitcask.max_file_size).
>>
>> When you say the size of the bitcask directory - is this the size shared
>> across all vnodes on the node?  I guess if each vnode has a single file
>> <2GB, and there are multiple vnodes - something unexpected might happen
>> here?  If bitcask does indeed not merge the file active for writing.
>>
>> In terms of distribution around the cluster, if you have an n_val of 3
>> you should normally expect to see a relatively even distribution of the
>> data on failure (certainly not it all going to one).  Worst case scenario
>> is that 3 nodes get all the load from that one failed node.
>>
>> When a vnode is inaccessible, 3 (assuming n=3) fallback vnodes are
>> selected to handle the load for that 1 vnode (as that vnode would normally
>> be in 3 preflists, and commonly a different node will be asked to start a
>> vnode for each preflist).
>>
>>
>> I will try and dig later into bitcask merge/compaction code, to see if I
>> spot anything else.
>>
>> Martin
>>
>>
>>
>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Node Recovery Questions

2018-08-08 Thread Martin Sumner
Based on a quick read of the code, compaction in bitcask is performed only
on "readable" files, and the current active file for writing is excluded
from that list.  With default settings, that active file can grow to 2GB.
So it is possible that if objects had been replaced/deleted many times
within the active file, that space will not be recovered if all the
replacements amount to < 2GB per vnode.  So at these small data sizes - you
may get a relatively significant discrepancy between an old and recovered
node in terms of disk space usage.

On 8 August 2018 at 17:37, Martin Sumner 
wrote:

> Sean,
>
> Some partial answers to your questions.
>
> I don't believe force-replace itself will sync anything up - it just
> reassigns ownership (hence handoff happens very quickly).
>
> Read repair would synchronise a portion of the data.  So if 10% of you
> data is read regularly, this might explain some of what you see.
>
> AAE should also repair your data.  But if nothing has happened for 4 days,
> then that doesn't seem to be the case.  It would be worth checking the
> aae-status page (http://docs.basho.com/riak/kv/2.2.3/using/admin/riak-
> admin/#aae-status) to confirm things are happening.
>
> I don't know if there are any minimum levels of data before bitcask will
> perform compaction.  There's nothing obvious in the code that wouldn't be
> triggered way before 90%.  I don't know if it will merge on the active file
> (the one currently being written to), but that is 2GB max size (configured
> through bitcask.max_file_size).
>
> When you say the size of the bitcask directory - is this the size shared
> across all vnodes on the node?  I guess if each vnode has a single file
> <2GB, and there are multiple vnodes - something unexpected might happen
> here?  If bitcask does indeed not merge the file active for writing.
>
> In terms of distribution around the cluster, if you have an n_val of 3 you
> should normally expect to see a relatively even distribution of the data on
> failure (certainly not it all going to one).  Worst case scenario is that 3
> nodes get all the load from that one failed node.
>
> When a vnode is inaccessible, 3 (assuming n=3) fallback vnodes are
> selected to handle the load for that 1 vnode (as that vnode would normally
> be in 3 preflists, and commonly a different node will be asked to start a
> vnode for each preflist).
>
>
> I will try and dig later into bitcask merge/compaction code, to see if I
> spot anything else.
>
> Martin
>
>
>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


RE: Node Recovery Questions

2018-08-08 Thread Martin Sumner
Sean,

Some partial answers to your questions.

I don't believe force-replace itself will sync anything up - it just
reassigns ownership (hence handoff happens very quickly).

Read repair would synchronise a portion of the data.  So if 10% of you data
is read regularly, this might explain some of what you see.

AAE should also repair your data.  But if nothing has happened for 4 days,
then that doesn't seem to be the case.  It would be worth checking the
aae-status page (
http://docs.basho.com/riak/kv/2.2.3/using/admin/riak-admin/#aae-status) to
confirm things are happening.

I don't know if there are any minimum levels of data before bitcask will
perform compaction.  There's nothing obvious in the code that wouldn't be
triggered way before 90%.  I don't know if it will merge on the active file
(the one currently being written to), but that is 2GB max size (configured
through bitcask.max_file_size).

When you say the size of the bitcask directory - is this the size shared
across all vnodes on the node?  I guess if each vnode has a single file
<2GB, and there are multiple vnodes - something unexpected might happen
here?  If bitcask does indeed not merge the file active for writing.

In terms of distribution around the cluster, if you have an n_val of 3 you
should normally expect to see a relatively even distribution of the data on
failure (certainly not it all going to one).  Worst case scenario is that 3
nodes get all the load from that one failed node.

When a vnode is inaccessible, 3 (assuming n=3) fallback vnodes are selected
to handle the load for that 1 vnode (as that vnode would normally be in 3
preflists, and commonly a different node will be asked to start a vnode for
each preflist).


I will try and dig later into bitcask merge/compaction code, to see if I
spot anything else.

Martin
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Node Recovery Questions

2018-08-08 Thread sean mcevoy
Hi All,

A few questions on the procedure here to recover a failed node:
http://docs.basho.com/riak/kv/2.2.3/using/repair-recovery/failed-node/

We lost a production riak server when AWS decided to delete a node and we
plan on doing this procedure to replace it with a newly built node. A
practice run in our QA environment has brought up some questions.

- How can I tell when everything has synched up? I thought I could just
monitor the handoffs but these completed within 5 minutes of comitting the
cluster changes, the data directories continued to grow rapidly in size for
at least an hour. I assume that this was data being synched to the new node
but how can I tell when it has completed from the user level? Or is it left
up to AAE to sync the data?

- The size of the bitcask directory on the 4 original nodes is ~10GB, on
the new node the size of this directory climbed to 1GB within an hour but
hasn't moved much in the 4 days since. I know bitcask entries still exist
until the periodic compaction but can it be right that its hanging on to
90% the disk space its using for dead data?

- Not directly related to the recovery procedure, but while one node of a
five-node cluster is down how is the extra load distributed within the
cluster? It will still keep 3 copies of each entry, right? Are the copies
that would have been on the missing node all stored on the next node in the
ring, or distributed all around the cluster?

Thanks in advance,
//Sean.
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com