Re: Riak 2.9.0 - Update Available

2019-08-09 Thread Martin Sumner
There is now a third update available for 2.9.0:
https://github.com/basho/riak/tree/riak-2.9.0p3.

Again, the fixes are related to memory management in leveled, and
specifically related to references to sub-binaries.  This main issue was
related to a lazy-load of file metadata which occurs following a riak
restart, plus also an issue with managing memory use during journal
compaction following many days of repeated compactions.  Release notes (
https://github.com/basho/riak/blob/riak-2.9.0p3/RELEASE-NOTES.md) contain
some more details and links.

I would recommend updating from any previous release of 2.9.0 if you have
enabled either the leveled backend, or Tictac AAE.

Updated packages are available (thanks to Nick Adams at TI Tokyo) -
https://files.tiot.jp/riak/kv/2.9/2.9.0p3/.

Thanks again to the testing team at the NHS Spine project, Aaron Gibbon
(BJSS) and Ramen Sen, for their continued efforts to stress Riak 2.9.0 in
different environments and scenarios and uncover these problems.

On a more general note, there are ongoing tests of a pre-release of 2.9.1
that have been happening over the past month, so we continue to make
progress towards that release.  No major issues have been highlighted so
far.  Work on Riak 3.0 has slowed over the summer, but I hope we can pick
up the pace again and make further progress in September.

Regards

Martin

On Fri, 28 Jun 2019 at 09:34, Martin Sumner 
wrote:

> There is now a second update available for 2.9.0:
> https://github.com/basho/riak/tree/riak-2.9.0p2.
>
> This patch, like the patch before, resolves a memory management issue in
> leveled, which this time could be triggered by sending many large objects
> in a short period of time.  The underlying problem is described a bit
> further here https://github.com/martinsumner/leveled/issues/285, and is
> resolved by leveled working more sympathetically with the beam binary
> memory management.
>
> Switching to the patched version is not urgent unless you are using the
> leveled backend, and may send a large number of large objects in a burst.
>
> Updated packages are available (thanks to Nick Adams at TI Tokyo) -
> https://files.tiot.jp/riak/kv/2.9/2.9.0p2/
>
> Thanks again to the testing team at the NHS Spine project, Aaron Gibbon
> (BJSS) and Ramen Sen, who discovered the problem.  The issue was discovered
> in a handoff scenario where there were a tens of thousands of 2MB objects
> stored in a portion of the keyspace at the end of the handoff - which led
> to memory issues until either more PUTs were received (to force a persist
> to disk) or a restart occurred..
>
> Regards
>
>
> On Sat, 25 May 2019 at 09:35, Martin Sumner 
> wrote:
>
>> Unfortunately, Riak 2.9.0 was released with an issue whereby a race
>> condition in heavy-PUT scenarios (e.g. handoffs), could cause a leak of
>> file descriptors.
>>
>> The issue is described here -
>> https://github.com/basho/riak_kv/issues/1699, and the underlying issue
>> here - https://github.com/martinsumner/leveled/issues/278.
>>
>> There is a new patched version of the release available (2.9.0p1) at
>> https://github.com/basho/riak/tree/riak-2.9.0p1.  This should be used in
>> preference to the original release of 2.9.0.
>>
>> Updated packages are available (thanks to Nick Adams at TI Tokyo) -
>> https://files.tiot.jp/riak/kv/2.9/2.9.0p1/
>>
>> Thanks also to the testing team at the NHS Spine project, Aaron Gibbon
>> (BJSS) and Ramen Sen, who discovered the problem.
>>
>> Regards
>>
>> Martin
>>
>>
>>
>>
>>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Riak 2.9.0 - Update Available

2019-06-28 Thread Martin Sumner
Russell,

WRT naming.  As we'd already announced at CodeBEAM that 2.9.1 was pending
in September and would be adding some extra functionality (the automated
repl replacement), I didn't want to call the patched versions of 2.9.0 by
that name, as that might cause confusion.  The whole choosing 2.9 thing has
unnecessarily cramped naming up, which was my bad.  I've turned Riak
release numbering into a confusing mess.  So apologies for that.

Hopefully we can return to a more sane numbering system from 3.0.  Perhaps
someone else should choose!

Regards

On Fri, 28 Jun 2019 at 09:34, Martin Sumner 
wrote:

> There is now a second update available for 2.9.0:
> https://github.com/basho/riak/tree/riak-2.9.0p2.
>
> This patch, like the patch before, resolves a memory management issue in
> leveled, which this time could be triggered by sending many large objects
> in a short period of time.  The underlying problem is described a bit
> further here https://github.com/martinsumner/leveled/issues/285, and is
> resolved by leveled working more sympathetically with the beam binary
> memory management.
>
> Switching to the patched version is not urgent unless you are using the
> leveled backend, and may send a large number of large objects in a burst.
>
> Updated packages are available (thanks to Nick Adams at TI Tokyo) -
> https://files.tiot.jp/riak/kv/2.9/2.9.0p2/
>
> Thanks again to the testing team at the NHS Spine project, Aaron Gibbon
> (BJSS) and Ramen Sen, who discovered the problem.  The issue was discovered
> in a handoff scenario where there were a tens of thousands of 2MB objects
> stored in a portion of the keyspace at the end of the handoff - which led
> to memory issues until either more PUTs were received (to force a persist
> to disk) or a restart occurred..
>
> Regards
>
>
> On Sat, 25 May 2019 at 09:35, Martin Sumner 
> wrote:
>
>> Unfortunately, Riak 2.9.0 was released with an issue whereby a race
>> condition in heavy-PUT scenarios (e.g. handoffs), could cause a leak of
>> file descriptors.
>>
>> The issue is described here -
>> https://github.com/basho/riak_kv/issues/1699, and the underlying issue
>> here - https://github.com/martinsumner/leveled/issues/278.
>>
>> There is a new patched version of the release available (2.9.0p1) at
>> https://github.com/basho/riak/tree/riak-2.9.0p1.  This should be used in
>> preference to the original release of 2.9.0.
>>
>> Updated packages are available (thanks to Nick Adams at TI Tokyo) -
>> https://files.tiot.jp/riak/kv/2.9/2.9.0p1/
>>
>> Thanks also to the testing team at the NHS Spine project, Aaron Gibbon
>> (BJSS) and Ramen Sen, who discovered the problem.
>>
>> Regards
>>
>> Martin
>>
>>
>>
>>
>>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Riak 2.9.0 - Update Available

2019-06-28 Thread Bryan Hunt
Top quality spelunking - always fun to read - thanks Martin !

> On 28 Jun 2019, at 10:24, Martin Sumner  wrote:
> 
> Bryan,
> 
> We saw that Riak was using much more memory than was expected at the end of 
> the handoffs.  Using `riak-admin top` we could see that this wasn't process 
> memory, but binaries.  Firstly did some work via attach looping over 
> processes and running GC to confirm that this wasn't a failure to collect 
> garbage - the references to memory were real.  Then did a bit of work in 
> attach writing some functions to analyse process_info/2 for each process 
> (looking at binary and memory), and discovered that there were penciller 
> processes that had lots of references to lots of large binaries (and this 
> accounted for all the unexpected memory use), and where the penciller was the 
> only process with a reference to the binary.  This made no sense initially as 
> the penciller should only have small binaries (metadata).  Then looked at the 
> running state of the penciller processes and could see no large binaries in 
> the state, but could see that a lot of the active keys in the penciller were 
> keys that were known to have large object values (but small amounts of 
> metadata) - and that the size of the object values were the same as the size 
> of the binary references found on the penciller process via process_info/2.. 
> 
> I then recalled the first part of this: 
> https://dieswaytoofast.blogspot.com/2012/12/erlang-binaries-and-garbage-collection.html
>  
> .
>   It was obvious that in extracting the metadata the beam was naturally 
> retaining a reference to the whole binary, as long as the sub-binary was 
> retained by the a process (the Penciller).  Forcing a binary copy resolved 
> this referencing issue.  It was nice that the same tools used to detect the 
> issue, made it quite easy to write a test to confirm resolution - 
> https://github.com/martinsumner/leveled/blob/master/test/end_to_end/riak_SUITE.erl#L1214-L1239
>  
> .
> 
> The memory leak section of Fred Herbert's http://www.erlang-in-anger.com/ 
>  is great reading for helping with these 
> types of issues. 
> 
> Thanks
> 
> Martin
> 
> 
> On Fri, 28 Jun 2019 at 09:46, b h  > wrote:
> Nice work - I've read issue / PR - how did you discover / track it down - 
> tools or just reading the code ? 
> 
> On Fri, 28 Jun 2019 at 09:35, Martin Sumner  > wrote:
> There is now a second update available for 2.9.0: 
> https://github.com/basho/riak/tree/riak-2.9.0p2 
> .
> 
> This patch, like the patch before, resolves a memory management issue in 
> leveled, which this time could be triggered by sending many large objects in 
> a short period of time.  The underlying problem is described a bit further 
> here https://github.com/martinsumner/leveled/issues/285 
> , and is resolved by 
> leveled working more sympathetically with the beam binary memory management. 
> 
> Switching to the patched version is not urgent unless you are using the 
> leveled backend, and may send a large number of large objects in a burst.  
> 
> Updated packages are available (thanks to Nick Adams at TI Tokyo) - 
> https://files.tiot.jp/riak/kv/2.9/2.9.0p2/ 
> 
> 
> Thanks again to the testing team at the NHS Spine project, Aaron Gibbon 
> (BJSS) and Ramen Sen, who discovered the problem.  The issue was discovered 
> in a handoff scenario where there were a tens of thousands of 2MB objects 
> stored in a portion of the keyspace at the end of the handoff - which led to 
> memory issues until either more PUTs were received (to force a persist to 
> disk) or a restart occurred..
> 
> Regards
> 
> 
> On Sat, 25 May 2019 at 09:35, Martin Sumner  > wrote:
> Unfortunately, Riak 2.9.0 was released with an issue whereby a race condition 
> in heavy-PUT scenarios (e.g. handoffs), could cause a leak of file 
> descriptors.
> 
> The issue is described here - https://github.com/basho/riak_kv/issues/1699 
> , and the underlying issue here 
> - https://github.com/martinsumner/leveled/issues/278 
> .
> 
> There is a new patched version of the release available (2.9.0p1) at 
> https://github.com/basho/riak/tree/riak-2.9.0p1 
> .  This should be used in 
> preference to the original release of 2.9.0.
> 
> Updated packages are available (thanks to Nick Adams at TI Tokyo) - 
> https://files.tiot.jp/riak/kv/2.9/2.9.0p1/ 
> 

Re: Riak 2.9.0 - Update Available

2019-06-28 Thread Russell Brown via riak-users

Good job on finding and fixing so fast.

I have to ask. What's with the naming scheme? Why not 2.9.2 instead of 
2.9.0p2?


Cheers

Russell

On 28/06/2019 10:24, Martin Sumner wrote:

Bryan,

We saw that Riak was using much more memory than was expected at the 
end of the handoffs.  Using `riak-admin top` we could see that this 
wasn't process memory, but binaries. Firstly did some work via attach 
looping over processes and running GC to confirm that this wasn't a 
failure to collect garbage - the references to memory were real.  Then 
did a bit of work in attach writing some functions to analyse 
process_info/2 for each process (looking at binary and memory), and 
discovered that there were penciller processes that had lots of 
references to lots of large binaries (and this accounted for all the 
unexpected memory use), and where the penciller was the only process 
with a reference to the binary.  This made no sense initially as the 
penciller should only have small binaries (metadata).  Then looked at 
the running state of the penciller processes and could see no large 
binaries in the state, but could see that a lot of the active keys in 
the penciller were keys that were known to have large object values 
(but small amounts of metadata) - and that the size of the object 
values were the same as the size of the binary references found on the 
penciller process via process_info/2..


I then recalled the first part of this: 
https://dieswaytoofast.blogspot.com/2012/12/erlang-binaries-and-garbage-collection.html. 
It was obvious that in extracting the metadata the beam was naturally 
retaining a reference to the whole binary, as long as the sub-binary 
was retained by the a process (the Penciller).  Forcing a binary copy 
resolved this referencing issue.  It was nice that the same tools used 
to detect the issue, made it quite easy to write a test to confirm 
resolution - 
https://github.com/martinsumner/leveled/blob/master/test/end_to_end/riak_SUITE.erl#L1214-L1239.


The memory leak section of Fred Herbert's 
http://www.erlang-in-anger.com/ is great reading for helping with 
these types of issues.


Thanks

Martin


On Fri, 28 Jun 2019 at 09:46, b h > wrote:


Nice work - I've read issue / PR - how did you discover / track it
down - tools or just reading the code ?

On Fri, 28 Jun 2019 at 09:35, Martin Sumner
mailto:martin.sum...@adaptip.co.uk>>
wrote:

There is now a second update available for 2.9.0:
https://github.com/basho/riak/tree/riak-2.9.0p2.

This patch, like the patch before, resolves a memory
management issue in leveled, which this time could be
triggered by sending many large objects in a short period of
time.  The underlying problem is described a bit further here
https://github.com/martinsumner/leveled/issues/285, and is
resolved by leveled working more sympathetically with the beam
binary memory management.

Switching to the patched version is not urgent unless you are
using the leveled backend, and may send a large number of
large objects in a burst.

Updated packages are available (thanks to Nick Adams at TI
Tokyo) - https://files.tiot.jp/riak/kv/2.9/2.9.0p2/

Thanks again to the testing team at the NHS Spine project,
Aaron Gibbon (BJSS) and Ramen Sen, who discovered the
problem.  The issue was discovered in a handoff scenario where
there were a tens of thousands of 2MB objects stored in a
portion of the keyspace at the end of the handoff - which led
to memory issues until either more PUTs were received (to
force a persist to disk) or a restart occurred..

Regards


On Sat, 25 May 2019 at 09:35, Martin Sumner
mailto:martin.sum...@adaptip.co.uk>> wrote:

Unfortunately, Riak 2.9.0 was released with an issue
whereby a race condition in heavy-PUT scenarios (e.g.
handoffs), could cause a leak of file descriptors.

The issue is described here -
https://github.com/basho/riak_kv/issues/1699, and the
underlying issue here -
https://github.com/martinsumner/leveled/issues/278.

There is a new patched version of the release available
(2.9.0p1) at
https://github.com/basho/riak/tree/riak-2.9.0p1. This
should be used in preference to the original release of 2.9.0.

Updated packages are available (thanks to Nick Adams at TI
Tokyo) - https://files.tiot.jp/riak/kv/2.9/2.9.0p1/

Thanks also to the testing team at the NHS Spine project,
Aaron Gibbon (BJSS) and Ramen Sen, who discovered the problem.

Regards

Martin




___
riak-users mailing list
riak-users@lists.basho.com 

Re: Riak 2.9.0 - Update Available

2019-06-28 Thread Martin Sumner
Bryan,

We saw that Riak was using much more memory than was expected at the end of
the handoffs.  Using `riak-admin top` we could see that this wasn't process
memory, but binaries.  Firstly did some work via attach looping over
processes and running GC to confirm that this wasn't a failure to collect
garbage - the references to memory were real.  Then did a bit of work in
attach writing some functions to analyse process_info/2 for each process
(looking at binary and memory), and discovered that there were penciller
processes that had lots of references to lots of large binaries (and this
accounted for all the unexpected memory use), and where the penciller was
the only process with a reference to the binary.  This made no sense
initially as the penciller should only have small binaries (metadata).
Then looked at the running state of the penciller processes and could see
no large binaries in the state, but could see that a lot of the active keys
in the penciller were keys that were known to have large object values (but
small amounts of metadata) - and that the size of the object values were
the same as the size of the binary references found on the penciller
process via process_info/2..

I then recalled the first part of this:
https://dieswaytoofast.blogspot.com/2012/12/erlang-binaries-and-garbage-collection.html.
It was obvious that in extracting the metadata the beam was naturally
retaining a reference to the whole binary, as long as the sub-binary was
retained by the a process (the Penciller).  Forcing a binary copy resolved
this referencing issue.  It was nice that the same tools used to detect the
issue, made it quite easy to write a test to confirm resolution -
https://github.com/martinsumner/leveled/blob/master/test/end_to_end/riak_SUITE.erl#L1214-L1239
.

The memory leak section of Fred Herbert's http://www.erlang-in-anger.com/ is
great reading for helping with these types of issues.

Thanks

Martin


On Fri, 28 Jun 2019 at 09:46, b h  wrote:

> Nice work - I've read issue / PR - how did you discover / track it down -
> tools or just reading the code ?
>
> On Fri, 28 Jun 2019 at 09:35, Martin Sumner 
> wrote:
>
>> There is now a second update available for 2.9.0:
>> https://github.com/basho/riak/tree/riak-2.9.0p2.
>>
>> This patch, like the patch before, resolves a memory management issue in
>> leveled, which this time could be triggered by sending many large objects
>> in a short period of time.  The underlying problem is described a bit
>> further here https://github.com/martinsumner/leveled/issues/285, and is
>> resolved by leveled working more sympathetically with the beam binary
>> memory management.
>>
>> Switching to the patched version is not urgent unless you are using the
>> leveled backend, and may send a large number of large objects in a burst.
>>
>> Updated packages are available (thanks to Nick Adams at TI Tokyo) -
>> https://files.tiot.jp/riak/kv/2.9/2.9.0p2/
>>
>> Thanks again to the testing team at the NHS Spine project, Aaron Gibbon
>> (BJSS) and Ramen Sen, who discovered the problem.  The issue was discovered
>> in a handoff scenario where there were a tens of thousands of 2MB objects
>> stored in a portion of the keyspace at the end of the handoff - which led
>> to memory issues until either more PUTs were received (to force a persist
>> to disk) or a restart occurred..
>>
>> Regards
>>
>>
>> On Sat, 25 May 2019 at 09:35, Martin Sumner 
>> wrote:
>>
>>> Unfortunately, Riak 2.9.0 was released with an issue whereby a race
>>> condition in heavy-PUT scenarios (e.g. handoffs), could cause a leak of
>>> file descriptors.
>>>
>>> The issue is described here -
>>> https://github.com/basho/riak_kv/issues/1699, and the underlying issue
>>> here - https://github.com/martinsumner/leveled/issues/278.
>>>
>>> There is a new patched version of the release available (2.9.0p1) at
>>> https://github.com/basho/riak/tree/riak-2.9.0p1.  This should be used
>>> in preference to the original release of 2.9.0.
>>>
>>> Updated packages are available (thanks to Nick Adams at TI Tokyo) -
>>> https://files.tiot.jp/riak/kv/2.9/2.9.0p1/
>>>
>>> Thanks also to the testing team at the NHS Spine project, Aaron Gibbon
>>> (BJSS) and Ramen Sen, who discovered the problem.
>>>
>>> Regards
>>>
>>> Martin
>>>
>>>
>>>
>>>
>>> ___
>> riak-users mailing list
>> riak-users@lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Riak 2.9.0 - Update Available

2019-06-28 Thread Martin Sumner
There is now a second update available for 2.9.0:
https://github.com/basho/riak/tree/riak-2.9.0p2.

This patch, like the patch before, resolves a memory management issue in
leveled, which this time could be triggered by sending many large objects
in a short period of time.  The underlying problem is described a bit
further here https://github.com/martinsumner/leveled/issues/285, and is
resolved by leveled working more sympathetically with the beam binary
memory management.

Switching to the patched version is not urgent unless you are using the
leveled backend, and may send a large number of large objects in a burst.

Updated packages are available (thanks to Nick Adams at TI Tokyo) -
https://files.tiot.jp/riak/kv/2.9/2.9.0p2/

Thanks again to the testing team at the NHS Spine project, Aaron Gibbon
(BJSS) and Ramen Sen, who discovered the problem.  The issue was discovered
in a handoff scenario where there were a tens of thousands of 2MB objects
stored in a portion of the keyspace at the end of the handoff - which led
to memory issues until either more PUTs were received (to force a persist
to disk) or a restart occurred..

Regards


On Sat, 25 May 2019 at 09:35, Martin Sumner 
wrote:

> Unfortunately, Riak 2.9.0 was released with an issue whereby a race
> condition in heavy-PUT scenarios (e.g. handoffs), could cause a leak of
> file descriptors.
>
> The issue is described here - https://github.com/basho/riak_kv/issues/1699,
> and the underlying issue here -
> https://github.com/martinsumner/leveled/issues/278.
>
> There is a new patched version of the release available (2.9.0p1) at
> https://github.com/basho/riak/tree/riak-2.9.0p1.  This should be used in
> preference to the original release of 2.9.0.
>
> Updated packages are available (thanks to Nick Adams at TI Tokyo) -
> https://files.tiot.jp/riak/kv/2.9/2.9.0p1/
>
> Thanks also to the testing team at the NHS Spine project, Aaron Gibbon
> (BJSS) and Ramen Sen, who discovered the problem.
>
> Regards
>
> Martin
>
>
>
>
>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Riak 2.9.0 - Update Available

2019-05-25 Thread Martin Sumner
Unfortunately, Riak 2.9.0 was released with an issue whereby a race
condition in heavy-PUT scenarios (e.g. handoffs), could cause a leak of
file descriptors.

The issue is described here - https://github.com/basho/riak_kv/issues/1699,
and the underlying issue here -
https://github.com/martinsumner/leveled/issues/278.

There is a new patched version of the release available (2.9.0p1) at
https://github.com/basho/riak/tree/riak-2.9.0p1.  This should be used in
preference to the original release of 2.9.0.

Updated packages are available (thanks to Nick Adams at TI Tokyo) -
https://files.tiot.jp/riak/kv/2.9/2.9.0p1/

Thanks also to the testing team at the NHS Spine project, Aaron Gibbon
(BJSS) and Ramen Sen, who discovered the problem.

Regards

Martin
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com