Re: OSD memory leaks?

Samuel Just Mon, 07 Jan 2013 11:09:39 -0800

Awesome!  What version are you running (ceph-osd -v, include the hash)?
-Sam


On Mon, Jan 7, 2013 at 11:03 AM, Dave Spano <dsp...@optogenics.com> wrote:
> This failed the first time I sent it, so I'm resending in plain text.
>
> Dave Spano
> Optogenics
> Systems Administrator
>
>
>
> ----- Original Message -----
>
> From: "Dave Spano" <dsp...@optogenics.com>
> To: "Sébastien Han" <han.sebast...@gmail.com>
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "Samuel Just" 
> <sam.j...@inktank.com>
> Sent: Monday, January 7, 2013 12:40:06 PM
> Subject: Re: OSD memory leaks?
>
>
> Sam,
>
> Attached are some heaps that I collected today. 001 and 003 are just after I 
> started the profiler; 011 is the most recent. If you need more, or anything 
> different let me know. Already the OSD in question is at 38% memory usage. As 
> mentioned by Sèbastien, restarting ceph-osd keeps things going.
>
> Not sure if this is helpful information, but out of the two OSDs that I have 
> running, the first one (osd.0) is the one that develops this problem the 
> quickest. osd.1 does have the same issue, it just takes much longer. Do the 
> monitors hit the first osd in the list first, when there's activity?
>
>
> Dave Spano
> Optogenics
> Systems Administrator
>
>
> ----- Original Message -----
>
> From: "Sébastien Han" <han.sebast...@gmail.com>
> To: "Samuel Just" <sam.j...@inktank.com>
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
> Sent: Friday, January 4, 2013 10:20:58 AM
> Subject: Re: OSD memory leaks?
>
> Hi Sam,
>
> Thanks for your answer and sorry the late reply.
>
> Unfortunately I can't get something out from the profiler, actually I
> do but I guess it doesn't show what is supposed to show... I will keep
> on trying this. Anyway yesterday I just thought that the problem might
> be due to some over usage of some OSDs. I was thinking that the
> distribution of the primary OSD might be uneven, this could have
> explained that some memory leaks are more important with some servers.
> At the end, the repartition seems even but while looking at the pg
> dump I found something interesting in the scrub column, timestamps
> from the last scrubbing operation matched with times showed on the
> graph.
>
> After this, I made some calculation, I compared the total number of
> scrubbing operation with the time range where memory leaks occurred.
> First of all check my setup:
>
> root@c2-ceph-01 ~ # ceph osd tree
> dumped osdmap tree epoch 859
> # id weight type name up/down reweight
> -1 12 pool default
> -3 12 rack lc2_rack33
> -2 3 host c2-ceph-01
> 0 1 osd.0 up 1
> 1 1 osd.1 up 1
> 2 1 osd.2 up 1
> -4 3 host c2-ceph-04
> 10 1 osd.10 up 1
> 11 1 osd.11 up 1
> 9 1 osd.9 up 1
> -5 3 host c2-ceph-02
> 3 1 osd.3 up 1
> 4 1 osd.4 up 1
> 5 1 osd.5 up 1
> -6 3 host c2-ceph-03
> 6 1 osd.6 up 1
> 7 1 osd.7 up 1
> 8 1 osd.8 up 1
>
>
> And there are the results:
>
> * Ceph node 1 which has the most important memory leak performed 1608
> in total and 1059 during the time range where memory leaks occured
> * Ceph node 2, 1168 in total and 776 during the time range where
> memory leaks occured
> * Ceph node 3, 940 in total and 94 during the time range where memory
> leaks occurred
> * Ceph node 4, 899 in total and 191 during the time range where
> memory leaks occurred
>
> I'm still not entirely sure that the scrub operation causes the leak
> but the only relevant relation that I found...
>
> Could it be that the scrubbing process doesn't release memory? Btw I
> was wondering, how ceph decides at what time it should run the
> scrubbing operation? I know that it's once a day and control by the
> following options
>
> OPTION(osd_scrub_min_interval, OPT_FLOAT, 300)
> OPTION(osd_scrub_max_interval, OPT_FLOAT, 60*60*24)
>
> But how ceph determined the time where the operation started, during
> cluster creation probably?
>
> I just checked the options that control OSD scrubbing and found that by 
> default:
>
> OPTION(osd_max_scrubs, OPT_INT, 1)
>
> So that might explain why only one OSD uses a lot of memory.
>
> My dirty workaround at the moment is to performed a check of memory
> use by every OSD and restart it if it uses more than 25% of the total
> memory. Also note that on ceph 1, 3 and 4 it's always one OSD that
> uses a lot of memory, for ceph 2 only the mem usage is high but almost
> the same for all the OSD process.
>
> Thank you in advance.
>
> --
> Regards,
> Sébastien Han.
>
>
> On Wed, Dec 19, 2012 at 10:43 PM, Samuel Just <sam.j...@inktank.com> wrote:
>>
>> Sorry, it's been very busy. The next step would to try to get a heap
>> dump. You can start a heap profile on osd N by:
>>
>> ceph osd tell N heap start_profiler
>>
>> and you can get it to dump the collected profile using
>>
>> ceph osd tell N heap dump.
>>
>> The dumps should show up in the osd log directory.
>>
>> Assuming the heap profiler is working correctly, you can look at the
>> dump using pprof in google-perftools.
>>
>> On Wed, Dec 19, 2012 at 8:37 AM, Sébastien Han <han.sebast...@gmail.com> 
>> wrote:
>> > No more suggestions? :(
>> > --
>> > Regards,
>> > Sébastien Han.
>> >
>> >
>> > On Tue, Dec 18, 2012 at 6:21 PM, Sébastien Han <han.sebast...@gmail.com> 
>> > wrote:
>> >> Nothing terrific...
>> >>
>> >> Kernel logs from my clients are full of "libceph: osd4
>> >> 172.20.11.32:6801 socket closed"
>> >>
>> >> I saw this somewhere on the tracker.
>> >>
>> >> Does this harm?
>> >>
>> >> Thanks.
>> >>
>> >> --
>> >> Regards,
>> >> Sébastien Han.
>> >>
>> >>
>> >>
>> >> On Mon, Dec 17, 2012 at 11:55 PM, Samuel Just <sam.j...@inktank.com> 
>> >> wrote:
>> >>>
>> >>> What is the workload like?
>> >>> -Sam
>> >>>
>> >>> On Mon, Dec 17, 2012 at 2:41 PM, Sébastien Han <han.sebast...@gmail.com> 
>> >>> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > No, I don't see nothing abnormal in the network stats. I don't see
>> >>> > anything in the logs... :(
>> >>> > The weird thing is that one node over 4 seems to take way more memory
>> >>> > than the others...
>> >>> >
>> >>> > --
>> >>> > Regards,
>> >>> > Sébastien Han.
>> >>> >
>> >>> >
>> >>> > On Mon, Dec 17, 2012 at 11:31 PM, Sébastien Han 
>> >>> > <han.sebast...@gmail.com> wrote:
>> >>> >>
>> >>> >> Hi,
>> >>> >>
>> >>> >> No, I don't see nothing abnormal in the network stats. I don't see 
>> >>> >> anything in the logs... :(
>> >>> >> The weird thing is that one node over 4 seems to take way more memory 
>> >>> >> than the others...
>> >>> >>
>> >>> >> --
>> >>> >> Regards,
>> >>> >> Sébastien Han.
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Mon, Dec 17, 2012 at 7:12 PM, Samuel Just <sam.j...@inktank.com> 
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> Are you having network hiccups? There was a bug noticed recently that
>> >>> >>> could cause a memory leak if nodes are being marked up and down.
>> >>> >>> -Sam
>> >>> >>>
>> >>> >>> On Mon, Dec 17, 2012 at 12:28 AM, Sébastien Han 
>> >>> >>> <han.sebast...@gmail.com> wrote:
>> >>> >>> > Hi guys,
>> >>> >>> >
>> >>> >>> > Today looking at my graphs I noticed that one over 4 ceph nodes 
>> >>> >>> > used a
>> >>> >>> > lot of memory. It keeps growing and growing.
>> >>> >>> > See the graph attached to this mail.
>> >>> >>> > I run 0.48.2 on Ubuntu 12.04.
>> >>> >>> >
>> >>> >>> > The other nodes also grow, but slowly than the first one.
>> >>> >>> >
>> >>> >>> > I'm not quite sure about the information that I have to provide. So
>> >>> >>> > let me know. The only thing I can say is that the load haven't
>> >>> >>> > increase that much this week. It seems to be consuming and not 
>> >>> >>> > giving
>> >>> >>> > back the memory.
>> >>> >>> >
>> >>> >>> > Thank you in advance.
>> >>> >>> >
>> >>> >>> > --
>> >>> >>> > Regards,
>> >>> >>> > Sébastien Han.
>> >>> >>
>> >>> >>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD memory leaks?

Reply via email to