Re: [ceph-users] RADOS async client memory usage explodes when reading several objects in sequence

Gregory Farnum Thu, 13 Sep 2018 10:22:41 -0700

On Thu, Sep 13, 2018 at 6:35 AM Daniel Goldbach <dan.goldb...@gmail.com>
wrote:


> I'm sure I'm not forgetting to free any buffers. I'm not even allocating
> any heap memory in the example above.
>
> On further investigation, the same issue *does* happen with the
> synchronous read operation API. I erroneously said that the issue doesn't
> happen with the synchronous API when what I meant to say was that the issue
> doesn't happen when calling the rados_read() function directly to read from
> an object.
>
> Using the synchronous read operation API, the following four commands are
> sufficient to trigger the 'leak':
>
>     rados_read_op_t read_op = rados_create_read_op();
>     rados_read_op_read(read_op, 0, SIZE, data, &bytes_read, &retval);
>     retval = rados_read_op_operate(read_op, *io, objname, 0);
>     assert(retval == 0);
>     rados_release_read_op(read_op);
>
> Curiously, the presence of the final line "rados_release_read_op(read_op);"
> does not seem to affect the memory usage of the program at all! It can be
> removed and the program's memory profile looks the same. It's as if the
> release mechanism is totally broken.
>
> I agree that the source is most likely mismanagement of C_bl_to_buf. 
> rados_release_read_op
> should call Objecter.h:~ObjectOperation() on the read op, which in turn
> should delete the the C_bl_to_buf whose pointer was pushed onto the
> ObjectOperation's out_handler vector in ObjectOperation::read(..).
> Something around that bit of code must be broken.
>

Okay, but you get that *you* are responsible for cleaning up the "data"
buffer, right? librados isn't going to delete that for you automatically as
it has no way of knowing when it should do so.


>
>
>
> On Wed, Sep 12, 2018 at 5:35 PM Gregory Farnum <gfar...@redhat.com> wrote:
>
>> That code bit is just "we have an incoming message with data", which is
>> what we'd expect, but means it's not very helpful for tracking down the
>> source of any leaks.
>>
>> My guess is still very much that somehow there are deallocations missing
>> here. Internally, the synchronous API is wrapping the async one so it'd be
>> subject to the same bugs. (Although perhaps not terrible malloc behavior,
>> but if valgrind reports the same memory usage as the RSS being reported
>> then I don't think it can be malloc behavior.)
>>
>> Are you actually freeing the buffers you provide when you're done with
>> them? I guess there could be something with the "C_bl_to_buf" structure
>> getting managed wrong as well since that looks to be unique to this code
>> path, but it wouldn't depend on the size of the objects since it's just 4
>> pointers/ints.
>> -Greg
>>
>> On Wed, Sep 12, 2018 at 8:43 AM Daniel Goldbach <dan.goldb...@gmail.com>
>> wrote:
>>
>>> The issue continues even when I do rados_aio_release(completion) at the
>>> end of the readobj(..) definition in the example. Also, in our production
>>> code we call rados_aio_release for each completion and we still see the
>>> issue there. The release command doesn't guarantee instant release, so
>>> could it be that the release operations are getting queued up but never
>>> executed?
>>>
>>> Valgrind massif shows that the relevant allocations are all happening in
>>> the bit of code in the following stack trace:
>>>
>>>
>>> --------------------------------------------------------------------------------
>>>   n        time(i)         total(B)   useful-heap(B) extra-heap(B)
>>> stacks(B)
>>>
>>> --------------------------------------------------------------------------------
>>>  62    166,854,775       82,129,696       81,808,615       321,081
>>>       0
>>>  63    168,025,321       83,155,872       82,834,072       321,800
>>>       0
>>> 99.61% (82,834,072B) (heap allocation functions) malloc/new/new[],
>>> --alloc-fns, etc.
>>> ->93.75% (77,955,072B) 0x579AC05:
>>> ceph::buffer::create_aligned_in_mempool(unsigned int, unsigned int, int)
>>> (in /usr/lib/ceph/libceph-common.so.0)
>>> | ->93.75% (77,955,072B) 0x597BB48: AsyncConnection::process() (in
>>> /usr/lib/ceph/libceph-common.so.0)
>>> | | ->93.75% (77,955,072B) 0x598BC96: EventCenter::process_events(int,
>>> std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) (in
>>> /usr/lib/ceph/libceph-common.so.0)
>>> | |   ->93.75% (77,955,072B) 0x5990816: ??? (in
>>> /usr/lib/ceph/libceph-common.so.0)
>>> | |     ->93.75% (77,955,072B) 0xE957C7E: ??? (in
>>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
>>> | |       ->93.75% (77,955,072B) 0xE6896B8: start_thread
>>> (pthread_create.c:333)
>>> | |         ->93.75% (77,955,072B) 0x529741B: clone (clone.S:109)
>>> | |
>>> | ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
>>> |
>>> ->05.87% (4,879,000B) in 387 places, all below massif's threshold (1.00%)
>>>
>>>
>>>
>>> On Wed, Sep 12, 2018 at 4:05 PM Gregory Farnum <gfar...@redhat.com>
>>> wrote:
>>>
>>>> Yep, those completions are maintaining bufferlist references IIRC, so
>>>> they’re definitely holding the memory buffers in place!
>>>> On Wed, Sep 12, 2018 at 7:04 AM Casey Bodley <cbod...@redhat.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 09/12/2018 05:29 AM, Daniel Goldbach wrote:
>>>>> > Hi all,
>>>>> >
>>>>> > We're reading from a Ceph Luminous pool using the librados
>>>>> asychronous
>>>>> > I/O API. We're seeing some concerning memory usage patterns when we
>>>>> > read many objects in sequence.
>>>>> >
>>>>> > The expected behaviour is that our memory usage stabilises at a
>>>>> small
>>>>> > amount, since we're just fetching objects and ignoring their data.
>>>>> > What we instead find is that the memory usage of our program grows
>>>>> > linearly with the amount of data read for an interval of time, and
>>>>> > then continues to grow at a much slower but still consistent pace.
>>>>> > This memory is not freed until program termination. My guess is that
>>>>> > this is an issue with Ceph's memory allocator.
>>>>> >
>>>>> > To demonstrate, we create 20000 objects of size 10KB, and of size
>>>>> > 100KB, and of size 1MB:
>>>>> >
>>>>> >     #include <stdio.h>
>>>>> >     #include <stdlib.h>
>>>>> >     #include <string.h>
>>>>> >     #include <rados/librados.h>
>>>>> >
>>>>> >     int main() {
>>>>> > rados_t cluster;
>>>>> > rados_create(&cluster, "test");
>>>>> > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
>>>>> > rados_connect(cluster);
>>>>> >
>>>>> > rados_ioctx_t io;
>>>>> > rados_ioctx_create(cluster, "test", &io);
>>>>> >
>>>>> >         char data[1000000];
>>>>> > memset(data, 'a', 1000000);
>>>>> >
>>>>> >         char smallobj_name[16], mediumobj_name[16],
>>>>> largeobj_name[16];
>>>>> >         int i;
>>>>> >         for (i = 0; i < 20000; i++) {
>>>>> > sprintf(smallobj_name, "10kobj_%d", i);
>>>>> > rados_write(io, smallobj_name, data, 10000, 0);
>>>>> >
>>>>> > sprintf(mediumobj_name, "100kobj_%d", i);
>>>>> > rados_write(io, mediumobj_name, data, 100000, 0);
>>>>> >
>>>>> > sprintf(largeobj_name, "1mobj_%d", i);
>>>>> > rados_write(io, largeobj_name, data, 1000000, 0);
>>>>> >
>>>>> > printf("wrote %s of size 10000, %s of size 100000, %s of size
>>>>> 1000000\n",
>>>>> >       smallobj_name, mediumobj_name, largeobj_name);
>>>>> >         }
>>>>> >
>>>>> > return 0;
>>>>> >     }
>>>>> >
>>>>> >     $ gcc create.c -lrados -o create
>>>>> >     $ ./create
>>>>> >     wrote 10kobj_0 of size 10000, 100kobj_0 of size 100000, 1mobj_0
>>>>> of
>>>>> > size 1000000
>>>>> >     wrote 10kobj_1 of size 10000, 100kobj_1 of size 100000, 1mobj_1
>>>>> of
>>>>> > size 1000000
>>>>> >     [...]
>>>>> >     wrote 10kobj_19998 of size 10000, 100kobj_19998 of size 100000,
>>>>> > 1mobj_19998 of size 1000000
>>>>> >     wrote 10kobj_19999 of size 10000, 100kobj_19999 of size 100000,
>>>>> > 1mobj_19999 of size 1000000
>>>>> >
>>>>> > Now we read each of these objects with the async API, into the same
>>>>> > buffer. First we read just the the 10KB objects first:
>>>>> >
>>>>> >     #include <assert.h>
>>>>> >     #include <stdio.h>
>>>>> >     #include <stdlib.h>
>>>>> >     #include <string.h>
>>>>> >     #include <rados/librados.h>
>>>>> >
>>>>> >     void readobj(rados_ioctx_t* io, char objname[]);
>>>>> >
>>>>> >     int main() {
>>>>> >         rados_t cluster;
>>>>> > rados_create(&cluster, "test");
>>>>> > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
>>>>> > rados_connect(cluster);
>>>>> >
>>>>> > rados_ioctx_t io;
>>>>> > rados_ioctx_create(cluster, "test", &io);
>>>>> >
>>>>> >         char smallobj_name[16];
>>>>> >         int i, total_bytes_read = 0;
>>>>> >
>>>>> >         for (i = 0; i < 20000; i++) {
>>>>> > sprintf(smallobj_name, "10kobj_%d", i);
>>>>> > readobj(&io, smallobj_name);
>>>>> >
>>>>> > total_bytes_read += 10000;
>>>>> > printf("Read %s for total %d\n", smallobj_name, total_bytes_read);
>>>>> >         }
>>>>> >
>>>>> > getchar();
>>>>> >         return 0;
>>>>> >     }
>>>>> >
>>>>> >     void readobj(rados_ioctx_t* io, char objname[]) {
>>>>> >         char data[1000000];
>>>>> >         unsigned long bytes_read;
>>>>> > rados_completion_t completion;
>>>>> >         int retval;
>>>>> >
>>>>> > rados_read_op_t read_op = rados_create_read_op();
>>>>> > rados_read_op_read(read_op, 0, 10000, data, &bytes_read, &retval);
>>>>> >         retval = rados_aio_create_completion(NULL, NULL, NULL,
>>>>> > &completion);
>>>>> > assert(retval == 0);
>>>>> >
>>>>> >         retval = rados_aio_read_op_operate(read_op, *io, completion,
>>>>> > objname, 0);
>>>>> > assert(retval == 0);
>>>>> >
>>>>> > rados_aio_wait_for_complete(completion);
>>>>> > rados_aio_get_return_value(completion);
>>>>> >     }
>>>>> >
>>>>> >     $ gcc read.c -lrados -o read_small -Wall -g && ./read_small
>>>>> >     Read 10kobj_0 for total 10000
>>>>> >     Read 10kobj_1 for total 20000
>>>>> >     [...]
>>>>> >     Read 10kobj_19998 for total 199990000
>>>>> >     Read 10kobj_19999 for total 200000000
>>>>> >
>>>>> > We read 200MB. A graph of the resident set size of the program is
>>>>> > attached as mem-graph-10k.png, with seconds on x axis and KB on the
>>>>> y
>>>>> > axis. You can see that the memory usage increases throughout, which
>>>>> > itself is unexpected since that memory should be freed over time and
>>>>> > we should only hold 10KB of object data in memory at a time. The
>>>>> rate
>>>>> > of growth decreases and eventually stabilises, and by the end we've
>>>>> > used 60MB of RAM.
>>>>> >
>>>>> > We repeat this experiment for the 100KB and 1MB objects and find
>>>>> that
>>>>> > after all reads they use 140MB and 500MB of RAM, and memory usage
>>>>> > presumably would continue to grow if there were more objects. This
>>>>> is
>>>>> > orders of magnitude more memory than what I would expect these
>>>>> > programs to use.
>>>>> >
>>>>> >   * We do not get this behaviour with the synchronous API, and the
>>>>> >     memory usage remains stable at just a few MB.
>>>>> >   * We've found that for some reason, this doesn't happen (or doesn't
>>>>> >     happen as severely) if we intersperse large reads with much
>>>>> >     smaller reads. In this case, the memory usage seems to stabilise
>>>>> >     at a reasonable number.
>>>>> >   * Valgrind only reports a trivial amount of unreachable memory.
>>>>> >   * Memory usage doesn't increase in this manner if we repeatedly
>>>>> read
>>>>> >     the same object over and over again. It hovers around 20MB.
>>>>> >   * In other experiments we've done, with different object data and
>>>>> >     distributions of object sizes, we've seen memory usage grow even
>>>>> >     larger in proportion to the amount of data read.
>>>>> >
>>>>> > We maintain a long-running (order of weeks) services that read
>>>>> objects
>>>>> > from Ceph and send them elsewhere. Over time, the memory usage of
>>>>> some
>>>>> > of these services have grown to more than 6GB, which is unreasonable.
>>>>> >
>>>>> > --
>>>>> > Regards,
>>>>> > Dan G
>>>>> >
>>>>> >
>>>>> > _______________________________________________
>>>>> > ceph-users mailing list
>>>>> > ceph-users@lists.ceph.com
>>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>> It looks like the async example is missing calls to
>>>>> rados_aio_release()
>>>>> to clean up the completions. I'm not sure that would account for all
>>>>> of
>>>>> the memory growth, but that's where I would start. Past that, running
>>>>> the client under valgrind massif should help with further
>>>>> investigation.
>>>>>
>>>>> Casey
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Dan Goldbach
>>>
>>
>
> --
> Regards,
> Dan Goldbach
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RADOS async client memory usage explodes when reading several objects in sequence

Reply via email to