On Thu, Sep 13, 2018 at 6:35 AM Daniel Goldbach <dan.goldb...@gmail.com> wrote:
> I'm sure I'm not forgetting to free any buffers. I'm not even allocating > any heap memory in the example above. > > On further investigation, the same issue *does* happen with the > synchronous read operation API. I erroneously said that the issue doesn't > happen with the synchronous API when what I meant to say was that the issue > doesn't happen when calling the rados_read() function directly to read from > an object. > > Using the synchronous read operation API, the following four commands are > sufficient to trigger the 'leak': > > rados_read_op_t read_op = rados_create_read_op(); > rados_read_op_read(read_op, 0, SIZE, data, &bytes_read, &retval); > retval = rados_read_op_operate(read_op, *io, objname, 0); > assert(retval == 0); > rados_release_read_op(read_op); > > Curiously, the presence of the final line "rados_release_read_op(read_op);" > does not seem to affect the memory usage of the program at all! It can be > removed and the program's memory profile looks the same. It's as if the > release mechanism is totally broken. > > I agree that the source is most likely mismanagement of C_bl_to_buf. > rados_release_read_op > should call Objecter.h:~ObjectOperation() on the read op, which in turn > should delete the the C_bl_to_buf whose pointer was pushed onto the > ObjectOperation's out_handler vector in ObjectOperation::read(..). > Something around that bit of code must be broken. > Okay, but you get that *you* are responsible for cleaning up the "data" buffer, right? librados isn't going to delete that for you automatically as it has no way of knowing when it should do so. > > > > On Wed, Sep 12, 2018 at 5:35 PM Gregory Farnum <gfar...@redhat.com> wrote: > >> That code bit is just "we have an incoming message with data", which is >> what we'd expect, but means it's not very helpful for tracking down the >> source of any leaks. >> >> My guess is still very much that somehow there are deallocations missing >> here. Internally, the synchronous API is wrapping the async one so it'd be >> subject to the same bugs. (Although perhaps not terrible malloc behavior, >> but if valgrind reports the same memory usage as the RSS being reported >> then I don't think it can be malloc behavior.) >> >> Are you actually freeing the buffers you provide when you're done with >> them? I guess there could be something with the "C_bl_to_buf" structure >> getting managed wrong as well since that looks to be unique to this code >> path, but it wouldn't depend on the size of the objects since it's just 4 >> pointers/ints. >> -Greg >> >> On Wed, Sep 12, 2018 at 8:43 AM Daniel Goldbach <dan.goldb...@gmail.com> >> wrote: >> >>> The issue continues even when I do rados_aio_release(completion) at the >>> end of the readobj(..) definition in the example. Also, in our production >>> code we call rados_aio_release for each completion and we still see the >>> issue there. The release command doesn't guarantee instant release, so >>> could it be that the release operations are getting queued up but never >>> executed? >>> >>> Valgrind massif shows that the relevant allocations are all happening in >>> the bit of code in the following stack trace: >>> >>> >>> -------------------------------------------------------------------------------- >>> n time(i) total(B) useful-heap(B) extra-heap(B) >>> stacks(B) >>> >>> -------------------------------------------------------------------------------- >>> 62 166,854,775 82,129,696 81,808,615 321,081 >>> 0 >>> 63 168,025,321 83,155,872 82,834,072 321,800 >>> 0 >>> 99.61% (82,834,072B) (heap allocation functions) malloc/new/new[], >>> --alloc-fns, etc. >>> ->93.75% (77,955,072B) 0x579AC05: >>> ceph::buffer::create_aligned_in_mempool(unsigned int, unsigned int, int) >>> (in /usr/lib/ceph/libceph-common.so.0) >>> | ->93.75% (77,955,072B) 0x597BB48: AsyncConnection::process() (in >>> /usr/lib/ceph/libceph-common.so.0) >>> | | ->93.75% (77,955,072B) 0x598BC96: EventCenter::process_events(int, >>> std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) (in >>> /usr/lib/ceph/libceph-common.so.0) >>> | | ->93.75% (77,955,072B) 0x5990816: ??? (in >>> /usr/lib/ceph/libceph-common.so.0) >>> | | ->93.75% (77,955,072B) 0xE957C7E: ??? (in >>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21) >>> | | ->93.75% (77,955,072B) 0xE6896B8: start_thread >>> (pthread_create.c:333) >>> | | ->93.75% (77,955,072B) 0x529741B: clone (clone.S:109) >>> | | >>> | ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%) >>> | >>> ->05.87% (4,879,000B) in 387 places, all below massif's threshold (1.00%) >>> >>> >>> >>> On Wed, Sep 12, 2018 at 4:05 PM Gregory Farnum <gfar...@redhat.com> >>> wrote: >>> >>>> Yep, those completions are maintaining bufferlist references IIRC, so >>>> they’re definitely holding the memory buffers in place! >>>> On Wed, Sep 12, 2018 at 7:04 AM Casey Bodley <cbod...@redhat.com> >>>> wrote: >>>> >>>>> >>>>> >>>>> On 09/12/2018 05:29 AM, Daniel Goldbach wrote: >>>>> > Hi all, >>>>> > >>>>> > We're reading from a Ceph Luminous pool using the librados >>>>> asychronous >>>>> > I/O API. We're seeing some concerning memory usage patterns when we >>>>> > read many objects in sequence. >>>>> > >>>>> > The expected behaviour is that our memory usage stabilises at a >>>>> small >>>>> > amount, since we're just fetching objects and ignoring their data. >>>>> > What we instead find is that the memory usage of our program grows >>>>> > linearly with the amount of data read for an interval of time, and >>>>> > then continues to grow at a much slower but still consistent pace. >>>>> > This memory is not freed until program termination. My guess is that >>>>> > this is an issue with Ceph's memory allocator. >>>>> > >>>>> > To demonstrate, we create 20000 objects of size 10KB, and of size >>>>> > 100KB, and of size 1MB: >>>>> > >>>>> > #include <stdio.h> >>>>> > #include <stdlib.h> >>>>> > #include <string.h> >>>>> > #include <rados/librados.h> >>>>> > >>>>> > int main() { >>>>> > rados_t cluster; >>>>> > rados_create(&cluster, "test"); >>>>> > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf"); >>>>> > rados_connect(cluster); >>>>> > >>>>> > rados_ioctx_t io; >>>>> > rados_ioctx_create(cluster, "test", &io); >>>>> > >>>>> > char data[1000000]; >>>>> > memset(data, 'a', 1000000); >>>>> > >>>>> > char smallobj_name[16], mediumobj_name[16], >>>>> largeobj_name[16]; >>>>> > int i; >>>>> > for (i = 0; i < 20000; i++) { >>>>> > sprintf(smallobj_name, "10kobj_%d", i); >>>>> > rados_write(io, smallobj_name, data, 10000, 0); >>>>> > >>>>> > sprintf(mediumobj_name, "100kobj_%d", i); >>>>> > rados_write(io, mediumobj_name, data, 100000, 0); >>>>> > >>>>> > sprintf(largeobj_name, "1mobj_%d", i); >>>>> > rados_write(io, largeobj_name, data, 1000000, 0); >>>>> > >>>>> > printf("wrote %s of size 10000, %s of size 100000, %s of size >>>>> 1000000\n", >>>>> > smallobj_name, mediumobj_name, largeobj_name); >>>>> > } >>>>> > >>>>> > return 0; >>>>> > } >>>>> > >>>>> > $ gcc create.c -lrados -o create >>>>> > $ ./create >>>>> > wrote 10kobj_0 of size 10000, 100kobj_0 of size 100000, 1mobj_0 >>>>> of >>>>> > size 1000000 >>>>> > wrote 10kobj_1 of size 10000, 100kobj_1 of size 100000, 1mobj_1 >>>>> of >>>>> > size 1000000 >>>>> > [...] >>>>> > wrote 10kobj_19998 of size 10000, 100kobj_19998 of size 100000, >>>>> > 1mobj_19998 of size 1000000 >>>>> > wrote 10kobj_19999 of size 10000, 100kobj_19999 of size 100000, >>>>> > 1mobj_19999 of size 1000000 >>>>> > >>>>> > Now we read each of these objects with the async API, into the same >>>>> > buffer. First we read just the the 10KB objects first: >>>>> > >>>>> > #include <assert.h> >>>>> > #include <stdio.h> >>>>> > #include <stdlib.h> >>>>> > #include <string.h> >>>>> > #include <rados/librados.h> >>>>> > >>>>> > void readobj(rados_ioctx_t* io, char objname[]); >>>>> > >>>>> > int main() { >>>>> > rados_t cluster; >>>>> > rados_create(&cluster, "test"); >>>>> > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf"); >>>>> > rados_connect(cluster); >>>>> > >>>>> > rados_ioctx_t io; >>>>> > rados_ioctx_create(cluster, "test", &io); >>>>> > >>>>> > char smallobj_name[16]; >>>>> > int i, total_bytes_read = 0; >>>>> > >>>>> > for (i = 0; i < 20000; i++) { >>>>> > sprintf(smallobj_name, "10kobj_%d", i); >>>>> > readobj(&io, smallobj_name); >>>>> > >>>>> > total_bytes_read += 10000; >>>>> > printf("Read %s for total %d\n", smallobj_name, total_bytes_read); >>>>> > } >>>>> > >>>>> > getchar(); >>>>> > return 0; >>>>> > } >>>>> > >>>>> > void readobj(rados_ioctx_t* io, char objname[]) { >>>>> > char data[1000000]; >>>>> > unsigned long bytes_read; >>>>> > rados_completion_t completion; >>>>> > int retval; >>>>> > >>>>> > rados_read_op_t read_op = rados_create_read_op(); >>>>> > rados_read_op_read(read_op, 0, 10000, data, &bytes_read, &retval); >>>>> > retval = rados_aio_create_completion(NULL, NULL, NULL, >>>>> > &completion); >>>>> > assert(retval == 0); >>>>> > >>>>> > retval = rados_aio_read_op_operate(read_op, *io, completion, >>>>> > objname, 0); >>>>> > assert(retval == 0); >>>>> > >>>>> > rados_aio_wait_for_complete(completion); >>>>> > rados_aio_get_return_value(completion); >>>>> > } >>>>> > >>>>> > $ gcc read.c -lrados -o read_small -Wall -g && ./read_small >>>>> > Read 10kobj_0 for total 10000 >>>>> > Read 10kobj_1 for total 20000 >>>>> > [...] >>>>> > Read 10kobj_19998 for total 199990000 >>>>> > Read 10kobj_19999 for total 200000000 >>>>> > >>>>> > We read 200MB. A graph of the resident set size of the program is >>>>> > attached as mem-graph-10k.png, with seconds on x axis and KB on the >>>>> y >>>>> > axis. You can see that the memory usage increases throughout, which >>>>> > itself is unexpected since that memory should be freed over time and >>>>> > we should only hold 10KB of object data in memory at a time. The >>>>> rate >>>>> > of growth decreases and eventually stabilises, and by the end we've >>>>> > used 60MB of RAM. >>>>> > >>>>> > We repeat this experiment for the 100KB and 1MB objects and find >>>>> that >>>>> > after all reads they use 140MB and 500MB of RAM, and memory usage >>>>> > presumably would continue to grow if there were more objects. This >>>>> is >>>>> > orders of magnitude more memory than what I would expect these >>>>> > programs to use. >>>>> > >>>>> > * We do not get this behaviour with the synchronous API, and the >>>>> > memory usage remains stable at just a few MB. >>>>> > * We've found that for some reason, this doesn't happen (or doesn't >>>>> > happen as severely) if we intersperse large reads with much >>>>> > smaller reads. In this case, the memory usage seems to stabilise >>>>> > at a reasonable number. >>>>> > * Valgrind only reports a trivial amount of unreachable memory. >>>>> > * Memory usage doesn't increase in this manner if we repeatedly >>>>> read >>>>> > the same object over and over again. It hovers around 20MB. >>>>> > * In other experiments we've done, with different object data and >>>>> > distributions of object sizes, we've seen memory usage grow even >>>>> > larger in proportion to the amount of data read. >>>>> > >>>>> > We maintain a long-running (order of weeks) services that read >>>>> objects >>>>> > from Ceph and send them elsewhere. Over time, the memory usage of >>>>> some >>>>> > of these services have grown to more than 6GB, which is unreasonable. >>>>> > >>>>> > -- >>>>> > Regards, >>>>> > Dan G >>>>> > >>>>> > >>>>> > _______________________________________________ >>>>> > ceph-users mailing list >>>>> > ceph-users@lists.ceph.com >>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>> It looks like the async example is missing calls to >>>>> rados_aio_release() >>>>> to clean up the completions. I'm not sure that would account for all >>>>> of >>>>> the memory growth, but that's where I would start. Past that, running >>>>> the client under valgrind massif should help with further >>>>> investigation. >>>>> >>>>> Casey >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >>> >>> -- >>> Regards, >>> Dan Goldbach >>> >> > > -- > Regards, > Dan Goldbach >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com