See
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029780.html
for
the original thread.

Here is a trivial reproducer not using any aio or dynamically allocated
memory to store the objects read.
It simply reads 20,000 1MB large objects sequentially: when run, instead of
using a roughly fixed amount of memory, the RSS increases up to 400 MB.

<code>
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <rados/librados.h>

#define SIZE 1000000

void readobj(rados_ioctx_t* io, char objname[], unsigned long size) {
    static char data[SIZE];
    unsigned long bytes_read;
    int retval;

    rados_read_op_t read_op = rados_create_read_op();
    rados_read_op_read(read_op, 0, size, data, &bytes_read, &retval);
    retval = rados_read_op_operate(read_op, *io, objname, 0);
    assert(retval == 0);
    rados_release_read_op(read_op);
}

int main() {
    rados_t cluster;
    rados_create(&cluster, "test");
    rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
    rados_connect(cluster);

    rados_ioctx_t io;
    rados_ioctx_create(cluster, "test", &io);

    char smallobj_name[16];
    int i;
    long long total_bytes_read = 0;

    for (i = 0; i < 20000; i++) {
        sprintf(smallobj_name, "1mobj_%d", i);
        readobj(&io, smallobj_name, SIZE);
        readobj(&io, smallobj_name, 1);

        total_bytes_read += SIZE;
        printf("Read %s for total %lld\n", smallobj_name, total_bytes_read);
    }

    return 0;
}
</code>


Running it under valgrind with --show-reachable=yes shows where the problem
is:

==10253== 423,755,776 bytes in 424 blocks are still reachable in loss
record 882 of 882
==10253==    at 0x4C2FEA6: memalign (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10253==    by 0x4C2FFB1: posix_memalign (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10253==    by 0x57A3C06: ceph::buffer::create_aligned_in_mempool(unsigned
int, unsigned int, int) (in /usr/lib/ceph/libceph-common.so.0)
==10253==    by 0x5984B49: AsyncConnection::process() (in
/usr/lib/ceph/libceph-common.so.0)
==10253==    by 0x5994C97: EventCenter::process_events(int,
std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) (in
/usr/lib/ceph/libceph-common.so.0)
==10253==    by 0x5999817: ??? (in /usr/lib/ceph/libceph-common.so.0)
==10253==    by 0xE960C7F: ??? (in
/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
==10253==    by 0xE6926B9: start_thread (pthread_create.c:333)
==10253==    by 0x52A041C: clone (clone.S:109)

We can see that the memory is allocated in AsyncConnection::process,
precisely here:
https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L614

>From my understanding - I have never read ceph code before, so my
understanding might be flawed - this code is allocating a buffer to store
the data contained in the message received from a ceph node and adding it
to a buffers list `data_buf`.
Note that this list of buffers is per AsyncConnection, that's important.

Unfortunately, looking through the file shows that `data_buf` only gets
cleared here:
https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L477
when it transitions to the STATE_OPEN_MESSAGE_HEADER state, IIUC only
occurs when it starts receiving a new CEPH_MSGR_TAG_MSG message.

Which means that basically the following happens - the below is for a
single AsyncConnection:

repeat:
   0. wait for message
   1. parse message header
   2. realize it's an CEPH_MSGR_TAG_MSG message (not a heartbeat, etc):
transition to STATE_OPEN_MESSAGE_HEADER
   3. clear all buffers (
https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L477
)
   4. start parsing message, allocate memory buffer (
https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L614
)
   ...
   n. done processing message

Which means that the memory allocated when processing a message - at step 4
- is only released when a subsequent CEPH_MSGR_TAG_MSG message is received.
For a single connection that's not too bad - memory is just freed later
than it could, but with a large OSD pool, it adds up.
Our cluster is made of over 40 ceph OSDs, which results in around 400
connections. Which means that if reading for example 1MB objects, this
delayed freeing can use up to 400MB (1MB per connection), which is exactly
what we're seeing.

It would also explains:
- why reading a small subset of objects over and over again doesn't result
in a memory increase: because they are distributed over few ceph OSDs,
hence use few connections
- the shape of the memory consumption over time: if it were a leak (even
with reachable objects), it would grow linearly, whereas we expect it to
grow slower and slower as we've connected all ceph OSDs in the pool
- why interspersing small reads helps: if one does a large read follows by
a small read over the same connection, the large read buffer will be
cleared before the small read buffer is allocated

The followingg is enough to keep the memory usage under control:
<diff>
--- a.c     2018-09-14 12:56:13.770350595 +0100
+++ b.c        2018-09-14 12:55:32.921856548 +0100
@@ -34,7 +34,6 @@
     for (i = 0; i < 20000; i++) {
         sprintf(smallobj_name, "1mobj_%d", i);
         readobj(&io, smallobj_name, SIZE);
+        readobj(&io, smallobj_name, 1);

         total_bytes_read += SIZE;
         printf("Read %s for total %lld\n", smallobj_name,
total_bytes_read);
</diff>

Reading just 1 byte of the same object after having read the object will
make the AsyncConnection clear the buffer allocated for processing the
large read result, and allocate a tiny buffer for the 1 byte read result.


TLDR: It seems that the buffers allocated while processing a message in
AsyncConnection are not cleared when the message has been processed, but
only when a subsequent message is received, which yields to a significant
memory usage if there are many connections in use (objects stored on many
different OSDs).
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to