Good job on finding and fixing so fast.

I have to ask. What's with the naming scheme? Why not 2.9.2 instead of 2.9.0p2?

Cheers

Russell

On 28/06/2019 10:24, Martin Sumner wrote:
Bryan,

We saw that Riak was using much more memory than was expected at the end of the handoffs.  Using `riak-admin top` we could see that this wasn't process memory, but binaries. Firstly did some work via attach looping over processes and running GC to confirm that this wasn't a failure to collect garbage - the references to memory were real.  Then did a bit of work in attach writing some functions to analyse process_info/2 for each process (looking at binary and memory), and discovered that there were penciller processes that had lots of references to lots of large binaries (and this accounted for all the unexpected memory use), and where the penciller was the only process with a reference to the binary.  This made no sense initially as the penciller should only have small binaries (metadata).  Then looked at the running state of the penciller processes and could see no large binaries in the state, but could see that a lot of the active keys in the penciller were keys that were known to have large object values (but small amounts of metadata) - and that the size of the object values were the same as the size of the binary references found on the penciller process via process_info/2..

I then recalled the first part of this: https://dieswaytoofast.blogspot.com/2012/12/erlang-binaries-and-garbage-collection.html. It was obvious that in extracting the metadata the beam was naturally retaining a reference to the whole binary, as long as the sub-binary was retained by the a process (the Penciller).  Forcing a binary copy resolved this referencing issue.  It was nice that the same tools used to detect the issue, made it quite easy to write a test to confirm resolution - https://github.com/martinsumner/leveled/blob/master/test/end_to_end/riak_SUITE.erl#L1214-L1239.

The memory leak section of Fred Herbert's http://www.erlang-in-anger.com/ is great reading for helping with these types of issues.

Thanks

Martin


On Fri, 28 Jun 2019 at 09:46, b h <bryanhuntwit...@gmail.com <mailto:bryanhuntwit...@gmail.com>> wrote:

    Nice work - I've read issue / PR - how did you discover / track it
    down - tools or just reading the code ?

    On Fri, 28 Jun 2019 at 09:35, Martin Sumner
    <martin.sum...@adaptip.co.uk <mailto:martin.sum...@adaptip.co.uk>>
    wrote:

        There is now a second update available for 2.9.0:
        https://github.com/basho/riak/tree/riak-2.9.0p2.

        This patch, like the patch before, resolves a memory
        management issue in leveled, which this time could be
        triggered by sending many large objects in a short period of
        time.  The underlying problem is described a bit further here
        https://github.com/martinsumner/leveled/issues/285, and is
        resolved by leveled working more sympathetically with the beam
        binary memory management.

        Switching to the patched version is not urgent unless you are
        using the leveled backend, and may send a large number of
        large objects in a burst.

        Updated packages are available (thanks to Nick Adams at TI
        Tokyo) - https://files.tiot.jp/riak/kv/2.9/2.9.0p2/

        Thanks again to the testing team at the NHS Spine project,
        Aaron Gibbon (BJSS) and Ramen Sen, who discovered the
        problem.  The issue was discovered in a handoff scenario where
        there were a tens of thousands of 2MB objects stored in a
        portion of the keyspace at the end of the handoff - which led
        to memory issues until either more PUTs were received (to
        force a persist to disk) or a restart occurred..

        Regards


        On Sat, 25 May 2019 at 09:35, Martin Sumner
        <martin.sum...@adaptip.co.uk
        <mailto:martin.sum...@adaptip.co.uk>> wrote:

            Unfortunately, Riak 2.9.0 was released with an issue
            whereby a race condition in heavy-PUT scenarios (e.g.
            handoffs), could cause a leak of file descriptors.

            The issue is described here -
            https://github.com/basho/riak_kv/issues/1699, and the
            underlying issue here -
            https://github.com/martinsumner/leveled/issues/278.

            There is a new patched version of the release available
            (2.9.0p1) at
            https://github.com/basho/riak/tree/riak-2.9.0p1. This
            should be used in preference to the original release of 2.9.0.

            Updated packages are available (thanks to Nick Adams at TI
            Tokyo) - https://files.tiot.jp/riak/kv/2.9/2.9.0p1/

            Thanks also to the testing team at the NHS Spine project,
            Aaron Gibbon (BJSS) and Ramen Sen, who discovered the problem.

            Regards

            Martin




        _______________________________________________
        riak-users mailing list
        riak-users@lists.basho.com <mailto:riak-users@lists.basho.com>
        http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to