Re: 0.56 scrub OSD memleaks, WAS Re: [0.48.3] OSD memory leak when scrubbing

2013-02-19 Thread Christopher Kunz
Am 19.02.13 20:23, schrieb Samuel Just:
> Can you confirm that the memory size reported is res?
> -Sam#

I think it was virtual, seeing it was the SIZE parameter in ps.
However, we ran into massive slow request issues as soon as the memory
started ballooning.

--ck

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 0.56 scrub OSD memleaks, WAS Re: [0.48.3] OSD memory leak when scrubbing

2013-02-19 Thread Samuel Just
Can you confirm that the memory size reported is res?
-Sam

On Mon, Feb 18, 2013 at 8:46 AM, Christopher Kunz  wrote:
> Am 16.02.13 10:09, schrieb Wido den Hollander:
>> On 02/16/2013 08:09 AM, Andrey Korolyov wrote:
>>> Can anyone who hit this bug please confirm that your system contains
>>> libc 2.15+?
>>>
>>
> Hello,
>
> when we started a deep scrub on our 0.56.2 cluster today, we saw a
> massive memleak about 1 hour into the scrub. One OSD claimed over
> 53GByte within 10 minutes. We had to restart the OSD to keep the cluster
> stable.
>
> Another OSD is currently claiming about 27GByte and will be restarted
> soon. All circumstantial evidence points to the deep scrub as the source
> of the leak.
>
> One affected node is running libc 2.15 (Ubuntu 12.04 LTS), the other one
> is using libc 2.11.3 (Debian Squeeze). So it seems this is not a
> libc-dependant issue.
>
> We have disabled scrub completely.
>
> Regards,
>
> --ck
>
> PS: Do we have any idea when this will be fixed?
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


0.56 scrub OSD memleaks, WAS Re: [0.48.3] OSD memory leak when scrubbing

2013-02-18 Thread Christopher Kunz
Am 16.02.13 10:09, schrieb Wido den Hollander:
> On 02/16/2013 08:09 AM, Andrey Korolyov wrote:
>> Can anyone who hit this bug please confirm that your system contains
>> libc 2.15+?
>>
> 
Hello,

when we started a deep scrub on our 0.56.2 cluster today, we saw a
massive memleak about 1 hour into the scrub. One OSD claimed over
53GByte within 10 minutes. We had to restart the OSD to keep the cluster
stable.

Another OSD is currently claiming about 27GByte and will be restarted
soon. All circumstantial evidence points to the deep scrub as the source
of the leak.

One affected node is running libc 2.15 (Ubuntu 12.04 LTS), the other one
is using libc 2.11.3 (Debian Squeeze). So it seems this is not a
libc-dependant issue.

We have disabled scrub completely.

Regards,

--ck

PS: Do we have any idea when this will be fixed?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-17 Thread Sébastien Han
+1
--
Regards,
Sébastien Han.


On Sat, Feb 16, 2013 at 10:09 AM, Wido den Hollander  wrote:
> On 02/16/2013 08:09 AM, Andrey Korolyov wrote:
>>
>> Can anyone who hit this bug please confirm that your system contains libc
>> 2.15+?
>>
>
> I've seen this with 0.56.2 as well on Ubuntu 12.04. Ubuntu 12.04 comes with
> 2.15-0ubuntu10.3
>
> Haven't gotten around to adding a heap profiler to it.
>
> Wido
>
>
>> On Tue, Feb 5, 2013 at 1:27 AM, Sébastien Han 
>> wrote:
>>>
>>> oh nice, the pattern also matches path :D, didn't know that
>>> thanks Greg
>>> --
>>> Regards,
>>> Sébastien Han.
>>>
>>>
>>> On Mon, Feb 4, 2013 at 10:22 PM, Gregory Farnum  wrote:

 Set your /proc/sys/kernel/core_pattern file. :)
 http://linux.die.net/man/5/core
 -Greg

 On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han 
 wrote:
>
> ok I finally managed to get something on my test cluster,
> unfortunately, the dump goes to /
>
> any idea to change the destination path?
>
> My production / won't be big enough...
>
> --
> Regards,
> Sébastien Han.
>
>
> On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick  wrote:
>>
>> ...and/or do you have the corepath set interestingly, or one of the
>> core-trapping mechanisms turned on?
>>
>>
>> On 02/04/2013 11:29 AM, Sage Weil wrote:
>>>
>>>
>>> On Mon, 4 Feb 2013, S?bastien Han wrote:


 Hum just tried several times on my test cluster and I can't get any
 core dump. Does Ceph commit suicide or something? Is it expected
 behavior?
>>>
>>>
>>>
>>> SIGSEGV should trigger the usual path that dumps a stack trace and
>>> then
>>> dumps core.  Was your ulimit -c set before the daemon was started?
>>>
>>> sage
>>>
>>>
>>>
 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han
 
 wrote:
>
>
> Hi Lo?c,
>
> Thanks for bringing our discussion on the ML. I'll check that
> tomorrow
> :-).
>
> Cheer
> --
> Regards,
> S?bastien Han.
>
>
> On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han
> 
> wrote:
>>
>>
>> Hi Lo?c,
>>
>> Thanks for bringing our discussion on the ML. I'll check that
>> tomorrow
>> :-).
>>
>> Cheers
>>
>> --
>> Regards,
>> S?bastien Han.
>>
>>
>> On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary 
>> wrote:
>>>
>>>
>>>
>>> Hi,
>>>
>>> As discussed during FOSDEM, the script you wrote to kill the OSD
>>> when
>>> it
>>> grows too much could be amended to core dump instead of just
>>> being
>>> killed &
>>> restarted. The binary + core could probably be used to figure out
>>> where the
>>> leak is.
>>>
>>> You should make sure the OSD current working directory is in a
>>> file
>>> system
>>> with enough free disk space to accomodate for the dump and set
>>>
>>> ulimit -c unlimited
>>>
>>> before running it ( your system default is probably ulimit -c 0
>>> which
>>> inhibits core dumps ). When you detect that OSD grows too much
>>> kill it
>>> with
>>>
>>> kill -SEGV $pid
>>>
>>> and upload the core found in the working directory, together with
>>> the
>>> binary in a public place. If the osd binary is compiled with -g
>>> but
>>> without
>>> changing the -O settings, you should have a larger binary file
>>> but no
>>> negative impact on performances. Forensics analysis will be made
>>> a lot
>>> easier with the debugging symbols.
>>>
>>> My 2cts
>>>
>>> On 01/31/2013 08:57 PM, Sage Weil wrote:


 On Thu, 31 Jan 2013, Sylvain Munaut wrote:
>
>
> Hi,
>
> I disabled scrubbing using
>
>> ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
>> ceph osd tell \* injectargs '--osd-scrub-max-interval
>> 1000'
>
>
>
> and the leak seems to be gone.
>
> See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD
> memory
> for the 12 osd processes over the last 3.5 days.
> Memory was rising every 24h. I did the change yesterday around
> 13h00
> and OSDs stopped growing. OSD memory even seems to go down
> slowly by
> small bloc

Re: [0.48.3] OSD memory leak when scrubbing

2013-02-16 Thread Wido den Hollander

On 02/16/2013 08:09 AM, Andrey Korolyov wrote:

Can anyone who hit this bug please confirm that your system contains libc 2.15+?



I've seen this with 0.56.2 as well on Ubuntu 12.04. Ubuntu 12.04 comes 
with 2.15-0ubuntu10.3


Haven't gotten around to adding a heap profiler to it.

Wido


On Tue, Feb 5, 2013 at 1:27 AM, Sébastien Han  wrote:

oh nice, the pattern also matches path :D, didn't know that
thanks Greg
--
Regards,
Sébastien Han.


On Mon, Feb 4, 2013 at 10:22 PM, Gregory Farnum  wrote:

Set your /proc/sys/kernel/core_pattern file. :) http://linux.die.net/man/5/core
-Greg

On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han  wrote:

ok I finally managed to get something on my test cluster,
unfortunately, the dump goes to /

any idea to change the destination path?

My production / won't be big enough...

--
Regards,
Sébastien Han.


On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick  wrote:

...and/or do you have the corepath set interestingly, or one of the
core-trapping mechanisms turned on?


On 02/04/2013 11:29 AM, Sage Weil wrote:


On Mon, 4 Feb 2013, S?bastien Han wrote:


Hum just tried several times on my test cluster and I can't get any
core dump. Does Ceph commit suicide or something? Is it expected
behavior?



SIGSEGV should trigger the usual path that dumps a stack trace and then
dumps core.  Was your ulimit -c set before the daemon was started?

sage




--
Regards,
S?bastien Han.


On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han 
wrote:


Hi Lo?c,

Thanks for bringing our discussion on the ML. I'll check that tomorrow
:-).

Cheer
--
Regards,
S?bastien Han.


On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han 
wrote:


Hi Lo?c,

Thanks for bringing our discussion on the ML. I'll check that tomorrow
:-).

Cheers

--
Regards,
S?bastien Han.


On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary  wrote:



Hi,

As discussed during FOSDEM, the script you wrote to kill the OSD when
it
grows too much could be amended to core dump instead of just being
killed &
restarted. The binary + core could probably be used to figure out
where the
leak is.

You should make sure the OSD current working directory is in a file
system
with enough free disk space to accomodate for the dump and set

ulimit -c unlimited

before running it ( your system default is probably ulimit -c 0 which
inhibits core dumps ). When you detect that OSD grows too much kill it
with

kill -SEGV $pid

and upload the core found in the working directory, together with the
binary in a public place. If the osd binary is compiled with -g but
without
changing the -O settings, you should have a larger binary file but no
negative impact on performances. Forensics analysis will be made a lot
easier with the debugging symbols.

My 2cts

On 01/31/2013 08:57 PM, Sage Weil wrote:


On Thu, 31 Jan 2013, Sylvain Munaut wrote:


Hi,

I disabled scrubbing using


ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'



and the leak seems to be gone.

See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD
memory
for the 12 osd processes over the last 3.5 days.
Memory was rising every 24h. I did the change yesterday around 13h00
and OSDs stopped growing. OSD memory even seems to go down slowly by
small blocks.

Of course I assume disabling scrubbing is not a long term solution
and
I should re-enable it ... (how do I do that btw ? what were the
default values for those parameters)



It depends on the exact commit you're on.  You can see the defaults
if
you
do

   ceph-osd --show-config | grep osd_scrub

Thanks for testing this... I have a few other ideas to try to
reproduce.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Lo?c Dachary, Artisan Logiciel Libre







--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-15 Thread Andrey Korolyov
Can anyone who hit this bug please confirm that your system contains libc 2.15+?

On Tue, Feb 5, 2013 at 1:27 AM, Sébastien Han  wrote:
> oh nice, the pattern also matches path :D, didn't know that
> thanks Greg
> --
> Regards,
> Sébastien Han.
>
>
> On Mon, Feb 4, 2013 at 10:22 PM, Gregory Farnum  wrote:
>> Set your /proc/sys/kernel/core_pattern file. :) 
>> http://linux.die.net/man/5/core
>> -Greg
>>
>> On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han  
>> wrote:
>>> ok I finally managed to get something on my test cluster,
>>> unfortunately, the dump goes to /
>>>
>>> any idea to change the destination path?
>>>
>>> My production / won't be big enough...
>>>
>>> --
>>> Regards,
>>> Sébastien Han.
>>>
>>>
>>> On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick  wrote:
 ...and/or do you have the corepath set interestingly, or one of the
 core-trapping mechanisms turned on?


 On 02/04/2013 11:29 AM, Sage Weil wrote:
>
> On Mon, 4 Feb 2013, S?bastien Han wrote:
>>
>> Hum just tried several times on my test cluster and I can't get any
>> core dump. Does Ceph commit suicide or something? Is it expected
>> behavior?
>
>
> SIGSEGV should trigger the usual path that dumps a stack trace and then
> dumps core.  Was your ulimit -c set before the daemon was started?
>
> sage
>
>
>
>> --
>> Regards,
>> S?bastien Han.
>>
>>
>> On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han 
>> wrote:
>>>
>>> Hi Lo?c,
>>>
>>> Thanks for bringing our discussion on the ML. I'll check that tomorrow
>>> :-).
>>>
>>> Cheer
>>> --
>>> Regards,
>>> S?bastien Han.
>>>
>>>
>>> On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han 
>>> wrote:

 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow
 :-).

 Cheers

 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary  wrote:
>
>
> Hi,
>
> As discussed during FOSDEM, the script you wrote to kill the OSD when
> it
> grows too much could be amended to core dump instead of just being
> killed &
> restarted. The binary + core could probably be used to figure out
> where the
> leak is.
>
> You should make sure the OSD current working directory is in a file
> system
> with enough free disk space to accomodate for the dump and set
>
> ulimit -c unlimited
>
> before running it ( your system default is probably ulimit -c 0 which
> inhibits core dumps ). When you detect that OSD grows too much kill it
> with
>
> kill -SEGV $pid
>
> and upload the core found in the working directory, together with the
> binary in a public place. If the osd binary is compiled with -g but
> without
> changing the -O settings, you should have a larger binary file but no
> negative impact on performances. Forensics analysis will be made a lot
> easier with the debugging symbols.
>
> My 2cts
>
> On 01/31/2013 08:57 PM, Sage Weil wrote:
>>
>> On Thu, 31 Jan 2013, Sylvain Munaut wrote:
>>>
>>> Hi,
>>>
>>> I disabled scrubbing using
>>>
 ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
 ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'
>>>
>>>
>>> and the leak seems to be gone.
>>>
>>> See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD
>>> memory
>>> for the 12 osd processes over the last 3.5 days.
>>> Memory was rising every 24h. I did the change yesterday around 13h00
>>> and OSDs stopped growing. OSD memory even seems to go down slowly by
>>> small blocks.
>>>
>>> Of course I assume disabling scrubbing is not a long term solution
>>> and
>>> I should re-enable it ... (how do I do that btw ? what were the
>>> default values for those parameters)
>>
>>
>> It depends on the exact commit you're on.  You can see the defaults
>> if
>> you
>> do
>>
>>   ceph-osd --show-config | grep osd_scrub
>>
>> Thanks for testing this... I have a few other ideas to try to
>> reproduce.
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> Lo?c Dachary, Artisan Logiciel Libre
>


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-04 Thread Sébastien Han
oh nice, the pattern also matches path :D, didn't know that
thanks Greg
--
Regards,
Sébastien Han.


On Mon, Feb 4, 2013 at 10:22 PM, Gregory Farnum  wrote:
> Set your /proc/sys/kernel/core_pattern file. :) 
> http://linux.die.net/man/5/core
> -Greg
>
> On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han  wrote:
>> ok I finally managed to get something on my test cluster,
>> unfortunately, the dump goes to /
>>
>> any idea to change the destination path?
>>
>> My production / won't be big enough...
>>
>> --
>> Regards,
>> Sébastien Han.
>>
>>
>> On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick  wrote:
>>> ...and/or do you have the corepath set interestingly, or one of the
>>> core-trapping mechanisms turned on?
>>>
>>>
>>> On 02/04/2013 11:29 AM, Sage Weil wrote:

 On Mon, 4 Feb 2013, S?bastien Han wrote:
>
> Hum just tried several times on my test cluster and I can't get any
> core dump. Does Ceph commit suicide or something? Is it expected
> behavior?


 SIGSEGV should trigger the usual path that dumps a stack trace and then
 dumps core.  Was your ulimit -c set before the daemon was started?

 sage



> --
> Regards,
> S?bastien Han.
>
>
> On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han 
> wrote:
>>
>> Hi Lo?c,
>>
>> Thanks for bringing our discussion on the ML. I'll check that tomorrow
>> :-).
>>
>> Cheer
>> --
>> Regards,
>> S?bastien Han.
>>
>>
>> On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han 
>> wrote:
>>>
>>> Hi Lo?c,
>>>
>>> Thanks for bringing our discussion on the ML. I'll check that tomorrow
>>> :-).
>>>
>>> Cheers
>>>
>>> --
>>> Regards,
>>> S?bastien Han.
>>>
>>>
>>> On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary  wrote:


 Hi,

 As discussed during FOSDEM, the script you wrote to kill the OSD when
 it
 grows too much could be amended to core dump instead of just being
 killed &
 restarted. The binary + core could probably be used to figure out
 where the
 leak is.

 You should make sure the OSD current working directory is in a file
 system
 with enough free disk space to accomodate for the dump and set

 ulimit -c unlimited

 before running it ( your system default is probably ulimit -c 0 which
 inhibits core dumps ). When you detect that OSD grows too much kill it
 with

 kill -SEGV $pid

 and upload the core found in the working directory, together with the
 binary in a public place. If the osd binary is compiled with -g but
 without
 changing the -O settings, you should have a larger binary file but no
 negative impact on performances. Forensics analysis will be made a lot
 easier with the debugging symbols.

 My 2cts

 On 01/31/2013 08:57 PM, Sage Weil wrote:
>
> On Thu, 31 Jan 2013, Sylvain Munaut wrote:
>>
>> Hi,
>>
>> I disabled scrubbing using
>>
>>> ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
>>> ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'
>>
>>
>> and the leak seems to be gone.
>>
>> See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD
>> memory
>> for the 12 osd processes over the last 3.5 days.
>> Memory was rising every 24h. I did the change yesterday around 13h00
>> and OSDs stopped growing. OSD memory even seems to go down slowly by
>> small blocks.
>>
>> Of course I assume disabling scrubbing is not a long term solution
>> and
>> I should re-enable it ... (how do I do that btw ? what were the
>> default values for those parameters)
>
>
> It depends on the exact commit you're on.  You can see the defaults
> if
> you
> do
>
>   ceph-osd --show-config | grep osd_scrub
>
> Thanks for testing this... I have a few other ideas to try to
> reproduce.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Lo?c Dachary, Artisan Logiciel Libre

>>>
>
>
 --
 To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"

Re: [0.48.3] OSD memory leak when scrubbing

2013-02-04 Thread Gregory Farnum
Set your /proc/sys/kernel/core_pattern file. :) http://linux.die.net/man/5/core
-Greg

On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han  wrote:
> ok I finally managed to get something on my test cluster,
> unfortunately, the dump goes to /
>
> any idea to change the destination path?
>
> My production / won't be big enough...
>
> --
> Regards,
> Sébastien Han.
>
>
> On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick  wrote:
>> ...and/or do you have the corepath set interestingly, or one of the
>> core-trapping mechanisms turned on?
>>
>>
>> On 02/04/2013 11:29 AM, Sage Weil wrote:
>>>
>>> On Mon, 4 Feb 2013, S?bastien Han wrote:

 Hum just tried several times on my test cluster and I can't get any
 core dump. Does Ceph commit suicide or something? Is it expected
 behavior?
>>>
>>>
>>> SIGSEGV should trigger the usual path that dumps a stack trace and then
>>> dumps core.  Was your ulimit -c set before the daemon was started?
>>>
>>> sage
>>>
>>>
>>>
 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han 
 wrote:
>
> Hi Lo?c,
>
> Thanks for bringing our discussion on the ML. I'll check that tomorrow
> :-).
>
> Cheer
> --
> Regards,
> S?bastien Han.
>
>
> On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han 
> wrote:
>>
>> Hi Lo?c,
>>
>> Thanks for bringing our discussion on the ML. I'll check that tomorrow
>> :-).
>>
>> Cheers
>>
>> --
>> Regards,
>> S?bastien Han.
>>
>>
>> On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary  wrote:
>>>
>>>
>>> Hi,
>>>
>>> As discussed during FOSDEM, the script you wrote to kill the OSD when
>>> it
>>> grows too much could be amended to core dump instead of just being
>>> killed &
>>> restarted. The binary + core could probably be used to figure out
>>> where the
>>> leak is.
>>>
>>> You should make sure the OSD current working directory is in a file
>>> system
>>> with enough free disk space to accomodate for the dump and set
>>>
>>> ulimit -c unlimited
>>>
>>> before running it ( your system default is probably ulimit -c 0 which
>>> inhibits core dumps ). When you detect that OSD grows too much kill it
>>> with
>>>
>>> kill -SEGV $pid
>>>
>>> and upload the core found in the working directory, together with the
>>> binary in a public place. If the osd binary is compiled with -g but
>>> without
>>> changing the -O settings, you should have a larger binary file but no
>>> negative impact on performances. Forensics analysis will be made a lot
>>> easier with the debugging symbols.
>>>
>>> My 2cts
>>>
>>> On 01/31/2013 08:57 PM, Sage Weil wrote:

 On Thu, 31 Jan 2013, Sylvain Munaut wrote:
>
> Hi,
>
> I disabled scrubbing using
>
>> ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
>> ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'
>
>
> and the leak seems to be gone.
>
> See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD
> memory
> for the 12 osd processes over the last 3.5 days.
> Memory was rising every 24h. I did the change yesterday around 13h00
> and OSDs stopped growing. OSD memory even seems to go down slowly by
> small blocks.
>
> Of course I assume disabling scrubbing is not a long term solution
> and
> I should re-enable it ... (how do I do that btw ? what were the
> default values for those parameters)


 It depends on the exact commit you're on.  You can see the defaults
 if
 you
 do

   ceph-osd --show-config | grep osd_scrub

 Thanks for testing this... I have a few other ideas to try to
 reproduce.

 sage
 --
 To unsubscribe from this list: send the line "unsubscribe ceph-devel"
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>> --
>>> Lo?c Dachary, Artisan Logiciel Libre
>>>
>>


>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-04 Thread Sébastien Han
ok I finally managed to get something on my test cluster,
unfortunately, the dump goes to /

any idea to change the destination path?

My production / won't be big enough...

--
Regards,
Sébastien Han.


On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick  wrote:
> ...and/or do you have the corepath set interestingly, or one of the
> core-trapping mechanisms turned on?
>
>
> On 02/04/2013 11:29 AM, Sage Weil wrote:
>>
>> On Mon, 4 Feb 2013, S?bastien Han wrote:
>>>
>>> Hum just tried several times on my test cluster and I can't get any
>>> core dump. Does Ceph commit suicide or something? Is it expected
>>> behavior?
>>
>>
>> SIGSEGV should trigger the usual path that dumps a stack trace and then
>> dumps core.  Was your ulimit -c set before the daemon was started?
>>
>> sage
>>
>>
>>
>>> --
>>> Regards,
>>> S?bastien Han.
>>>
>>>
>>> On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han 
>>> wrote:

 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow
 :-).

 Cheer
 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han 
 wrote:
>
> Hi Lo?c,
>
> Thanks for bringing our discussion on the ML. I'll check that tomorrow
> :-).
>
> Cheers
>
> --
> Regards,
> S?bastien Han.
>
>
> On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary  wrote:
>>
>>
>> Hi,
>>
>> As discussed during FOSDEM, the script you wrote to kill the OSD when
>> it
>> grows too much could be amended to core dump instead of just being
>> killed &
>> restarted. The binary + core could probably be used to figure out
>> where the
>> leak is.
>>
>> You should make sure the OSD current working directory is in a file
>> system
>> with enough free disk space to accomodate for the dump and set
>>
>> ulimit -c unlimited
>>
>> before running it ( your system default is probably ulimit -c 0 which
>> inhibits core dumps ). When you detect that OSD grows too much kill it
>> with
>>
>> kill -SEGV $pid
>>
>> and upload the core found in the working directory, together with the
>> binary in a public place. If the osd binary is compiled with -g but
>> without
>> changing the -O settings, you should have a larger binary file but no
>> negative impact on performances. Forensics analysis will be made a lot
>> easier with the debugging symbols.
>>
>> My 2cts
>>
>> On 01/31/2013 08:57 PM, Sage Weil wrote:
>>>
>>> On Thu, 31 Jan 2013, Sylvain Munaut wrote:

 Hi,

 I disabled scrubbing using

> ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
> ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'


 and the leak seems to be gone.

 See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD
 memory
 for the 12 osd processes over the last 3.5 days.
 Memory was rising every 24h. I did the change yesterday around 13h00
 and OSDs stopped growing. OSD memory even seems to go down slowly by
 small blocks.

 Of course I assume disabling scrubbing is not a long term solution
 and
 I should re-enable it ... (how do I do that btw ? what were the
 default values for those parameters)
>>>
>>>
>>> It depends on the exact commit you're on.  You can see the defaults
>>> if
>>> you
>>> do
>>>
>>>   ceph-osd --show-config | grep osd_scrub
>>>
>>> Thanks for testing this... I have a few other ideas to try to
>>> reproduce.
>>>
>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> Lo?c Dachary, Artisan Logiciel Libre
>>
>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-04 Thread Dan Mick
...and/or do you have the corepath set interestingly, or one of the 
core-trapping mechanisms turned on?


On 02/04/2013 11:29 AM, Sage Weil wrote:

On Mon, 4 Feb 2013, S?bastien Han wrote:

Hum just tried several times on my test cluster and I can't get any
core dump. Does Ceph commit suicide or something? Is it expected
behavior?


SIGSEGV should trigger the usual path that dumps a stack trace and then
dumps core.  Was your ulimit -c set before the daemon was started?

sage




--
Regards,
S?bastien Han.


On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han  wrote:

Hi Lo?c,

Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).

Cheer
--
Regards,
S?bastien Han.


On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han  wrote:

Hi Lo?c,

Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).

Cheers

--
Regards,
S?bastien Han.


On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary  wrote:


Hi,

As discussed during FOSDEM, the script you wrote to kill the OSD when it
grows too much could be amended to core dump instead of just being killed &
restarted. The binary + core could probably be used to figure out where the
leak is.

You should make sure the OSD current working directory is in a file system
with enough free disk space to accomodate for the dump and set

ulimit -c unlimited

before running it ( your system default is probably ulimit -c 0 which
inhibits core dumps ). When you detect that OSD grows too much kill it with

kill -SEGV $pid

and upload the core found in the working directory, together with the
binary in a public place. If the osd binary is compiled with -g but without
changing the -O settings, you should have a larger binary file but no
negative impact on performances. Forensics analysis will be made a lot
easier with the debugging symbols.

My 2cts

On 01/31/2013 08:57 PM, Sage Weil wrote:

On Thu, 31 Jan 2013, Sylvain Munaut wrote:

Hi,

I disabled scrubbing using


ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'


and the leak seems to be gone.

See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
for the 12 osd processes over the last 3.5 days.
Memory was rising every 24h. I did the change yesterday around 13h00
and OSDs stopped growing. OSD memory even seems to go down slowly by
small blocks.

Of course I assume disabling scrubbing is not a long term solution and
I should re-enable it ... (how do I do that btw ? what were the
default values for those parameters)


It depends on the exact commit you're on.  You can see the defaults if
you
do

  ceph-osd --show-config | grep osd_scrub

Thanks for testing this... I have a few other ideas to try to reproduce.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Lo?c Dachary, Artisan Logiciel Libre







--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-04 Thread Sage Weil
On Mon, 4 Feb 2013, S?bastien Han wrote:
> Hum just tried several times on my test cluster and I can't get any
> core dump. Does Ceph commit suicide or something? Is it expected
> behavior?

SIGSEGV should trigger the usual path that dumps a stack trace and then 
dumps core.  Was your ulimit -c set before the daemon was started?

sage



> --
> Regards,
> S?bastien Han.
> 
> 
> On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han  
> wrote:
> > Hi Lo?c,
> >
> > Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).
> >
> > Cheer
> > --
> > Regards,
> > S?bastien Han.
> >
> >
> > On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han  
> > wrote:
> >> Hi Lo?c,
> >>
> >> Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).
> >>
> >> Cheers
> >>
> >> --
> >> Regards,
> >> S?bastien Han.
> >>
> >>
> >> On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary  wrote:
> >>>
> >>> Hi,
> >>>
> >>> As discussed during FOSDEM, the script you wrote to kill the OSD when it
> >>> grows too much could be amended to core dump instead of just being killed 
> >>> &
> >>> restarted. The binary + core could probably be used to figure out where 
> >>> the
> >>> leak is.
> >>>
> >>> You should make sure the OSD current working directory is in a file system
> >>> with enough free disk space to accomodate for the dump and set
> >>>
> >>> ulimit -c unlimited
> >>>
> >>> before running it ( your system default is probably ulimit -c 0 which
> >>> inhibits core dumps ). When you detect that OSD grows too much kill it 
> >>> with
> >>>
> >>> kill -SEGV $pid
> >>>
> >>> and upload the core found in the working directory, together with the
> >>> binary in a public place. If the osd binary is compiled with -g but 
> >>> without
> >>> changing the -O settings, you should have a larger binary file but no
> >>> negative impact on performances. Forensics analysis will be made a lot
> >>> easier with the debugging symbols.
> >>>
> >>> My 2cts
> >>>
> >>> On 01/31/2013 08:57 PM, Sage Weil wrote:
> >>> > On Thu, 31 Jan 2013, Sylvain Munaut wrote:
> >>> >> Hi,
> >>> >>
> >>> >> I disabled scrubbing using
> >>> >>
> >>> >>> ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
> >>> >>> ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'
> >>> >>
> >>> >> and the leak seems to be gone.
> >>> >>
> >>> >> See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
> >>> >> for the 12 osd processes over the last 3.5 days.
> >>> >> Memory was rising every 24h. I did the change yesterday around 13h00
> >>> >> and OSDs stopped growing. OSD memory even seems to go down slowly by
> >>> >> small blocks.
> >>> >>
> >>> >> Of course I assume disabling scrubbing is not a long term solution and
> >>> >> I should re-enable it ... (how do I do that btw ? what were the
> >>> >> default values for those parameters)
> >>> >
> >>> > It depends on the exact commit you're on.  You can see the defaults if
> >>> > you
> >>> > do
> >>> >
> >>> >  ceph-osd --show-config | grep osd_scrub
> >>> >
> >>> > Thanks for testing this... I have a few other ideas to try to reproduce.
> >>> >
> >>> > sage
> >>> > --
> >>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> > the body of a message to majord...@vger.kernel.org
> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>> --
> >>> Lo?c Dachary, Artisan Logiciel Libre
> >>>
> >>
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-04 Thread Sébastien Han
Hum just tried several times on my test cluster and I can't get any
core dump. Does Ceph commit suicide or something? Is it expected
behavior?
--
Regards,
Sébastien Han.


On Sun, Feb 3, 2013 at 10:03 PM, Sébastien Han  wrote:
> Hi Loïc,
>
> Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).
>
> Cheer
> --
> Regards,
> Sébastien Han.
>
>
> On Sun, Feb 3, 2013 at 10:01 PM, Sébastien Han  
> wrote:
>> Hi Loïc,
>>
>> Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).
>>
>> Cheers
>>
>> --
>> Regards,
>> Sébastien Han.
>>
>>
>> On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary  wrote:
>>>
>>> Hi,
>>>
>>> As discussed during FOSDEM, the script you wrote to kill the OSD when it
>>> grows too much could be amended to core dump instead of just being killed &
>>> restarted. The binary + core could probably be used to figure out where the
>>> leak is.
>>>
>>> You should make sure the OSD current working directory is in a file system
>>> with enough free disk space to accomodate for the dump and set
>>>
>>> ulimit -c unlimited
>>>
>>> before running it ( your system default is probably ulimit -c 0 which
>>> inhibits core dumps ). When you detect that OSD grows too much kill it with
>>>
>>> kill -SEGV $pid
>>>
>>> and upload the core found in the working directory, together with the
>>> binary in a public place. If the osd binary is compiled with -g but without
>>> changing the -O settings, you should have a larger binary file but no
>>> negative impact on performances. Forensics analysis will be made a lot
>>> easier with the debugging symbols.
>>>
>>> My 2cts
>>>
>>> On 01/31/2013 08:57 PM, Sage Weil wrote:
>>> > On Thu, 31 Jan 2013, Sylvain Munaut wrote:
>>> >> Hi,
>>> >>
>>> >> I disabled scrubbing using
>>> >>
>>> >>> ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
>>> >>> ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'
>>> >>
>>> >> and the leak seems to be gone.
>>> >>
>>> >> See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
>>> >> for the 12 osd processes over the last 3.5 days.
>>> >> Memory was rising every 24h. I did the change yesterday around 13h00
>>> >> and OSDs stopped growing. OSD memory even seems to go down slowly by
>>> >> small blocks.
>>> >>
>>> >> Of course I assume disabling scrubbing is not a long term solution and
>>> >> I should re-enable it ... (how do I do that btw ? what were the
>>> >> default values for those parameters)
>>> >
>>> > It depends on the exact commit you're on.  You can see the defaults if
>>> > you
>>> > do
>>> >
>>> >  ceph-osd --show-config | grep osd_scrub
>>> >
>>> > Thanks for testing this... I have a few other ideas to try to reproduce.
>>> >
>>> > sage
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> > the body of a message to majord...@vger.kernel.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-03 Thread Sébastien Han
Hi Loïc,

Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).

Cheer
--
Regards,
Sébastien Han.


On Sun, Feb 3, 2013 at 10:01 PM, Sébastien Han  wrote:
> Hi Loïc,
>
> Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).
>
> Cheers
>
> --
> Regards,
> Sébastien Han.
>
>
> On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary  wrote:
>>
>> Hi,
>>
>> As discussed during FOSDEM, the script you wrote to kill the OSD when it
>> grows too much could be amended to core dump instead of just being killed &
>> restarted. The binary + core could probably be used to figure out where the
>> leak is.
>>
>> You should make sure the OSD current working directory is in a file system
>> with enough free disk space to accomodate for the dump and set
>>
>> ulimit -c unlimited
>>
>> before running it ( your system default is probably ulimit -c 0 which
>> inhibits core dumps ). When you detect that OSD grows too much kill it with
>>
>> kill -SEGV $pid
>>
>> and upload the core found in the working directory, together with the
>> binary in a public place. If the osd binary is compiled with -g but without
>> changing the -O settings, you should have a larger binary file but no
>> negative impact on performances. Forensics analysis will be made a lot
>> easier with the debugging symbols.
>>
>> My 2cts
>>
>> On 01/31/2013 08:57 PM, Sage Weil wrote:
>> > On Thu, 31 Jan 2013, Sylvain Munaut wrote:
>> >> Hi,
>> >>
>> >> I disabled scrubbing using
>> >>
>> >>> ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
>> >>> ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'
>> >>
>> >> and the leak seems to be gone.
>> >>
>> >> See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
>> >> for the 12 osd processes over the last 3.5 days.
>> >> Memory was rising every 24h. I did the change yesterday around 13h00
>> >> and OSDs stopped growing. OSD memory even seems to go down slowly by
>> >> small blocks.
>> >>
>> >> Of course I assume disabling scrubbing is not a long term solution and
>> >> I should re-enable it ... (how do I do that btw ? what were the
>> >> default values for those parameters)
>> >
>> > It depends on the exact commit you're on.  You can see the defaults if
>> > you
>> > do
>> >
>> >  ceph-osd --show-config | grep osd_scrub
>> >
>> > Thanks for testing this... I have a few other ideas to try to reproduce.
>> >
>> > sage
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majord...@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-03 Thread Loic Dachary
Hi,

As discussed during FOSDEM, the script you wrote to kill the OSD when it grows 
too much could be amended to core dump instead of just being killed & 
restarted. The binary + core could probably be used to figure out where the 
leak is.

You should make sure the OSD current working directory is in a file system with 
enough free disk space to accomodate for the dump and set

ulimit -c unlimited

before running it ( your system default is probably ulimit -c 0 which inhibits 
core dumps ). When you detect that OSD grows too much kill it with

kill -SEGV $pid

and upload the core found in the working directory, together with the binary in 
a public place. If the osd binary is compiled with -g but without changing the 
-O settings, you should have a larger binary file but no negative impact on 
performances. Forensics analysis will be made a lot easier with the debugging 
symbols. 

My 2cts

On 01/31/2013 08:57 PM, Sage Weil wrote:
> On Thu, 31 Jan 2013, Sylvain Munaut wrote:
>> Hi,
>>
>> I disabled scrubbing using
>>
>>> ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
>>> ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'
>>
>> and the leak seems to be gone.
>>
>> See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
>> for the 12 osd processes over the last 3.5 days.
>> Memory was rising every 24h. I did the change yesterday around 13h00
>> and OSDs stopped growing. OSD memory even seems to go down slowly by
>> small blocks.
>>
>> Of course I assume disabling scrubbing is not a long term solution and
>> I should re-enable it ... (how do I do that btw ? what were the
>> default values for those parameters)
> 
> It depends on the exact commit you're on.  You can see the defaults if you 
> do
> 
>  ceph-osd --show-config | grep osd_scrub
> 
> Thanks for testing this... I have a few other ideas to try to reproduce.  
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-31 Thread Sage Weil
On Thu, 31 Jan 2013, Sylvain Munaut wrote:
> Hi,
> 
> I disabled scrubbing using
> 
> > ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
> > ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'
> 
> and the leak seems to be gone.
> 
> See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
> for the 12 osd processes over the last 3.5 days.
> Memory was rising every 24h. I did the change yesterday around 13h00
> and OSDs stopped growing. OSD memory even seems to go down slowly by
> small blocks.
> 
> Of course I assume disabling scrubbing is not a long term solution and
> I should re-enable it ... (how do I do that btw ? what were the
> default values for those parameters)

It depends on the exact commit you're on.  You can see the defaults if you 
do

 ceph-osd --show-config | grep osd_scrub

Thanks for testing this... I have a few other ideas to try to reproduce.  

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-31 Thread Sylvain Munaut
Hi,

> I'm crossing my fingers, but I just noticed that since I upgraded to kernel
> version 3.2.0-36-generic on Ubuntu 12.04 the other day, ceph-osd memory
> usage has stayed stable.

Unfortunately for me, I'm already on 3.2.0-36-generic  (Ubuntu 12.04 as well).

Cheers,

Sylvain


PS: Dave sorry for the double, I forgot reply-to-all ...
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-31 Thread Sylvain Munaut
Hi,

I disabled scrubbing using

> ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
> ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'

and the leak seems to be gone.

See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
for the 12 osd processes over the last 3.5 days.
Memory was rising every 24h. I did the change yesterday around 13h00
and OSDs stopped growing. OSD memory even seems to go down slowly by
small blocks.

Of course I assume disabling scrubbing is not a long term solution and
I should re-enable it ... (how do I do that btw ? what were the
default values for those parameters)

Cheers,

   Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-30 Thread Sage Weil
On Wed, 30 Jan 2013, Sylvain Munaut wrote:
> Hi,
> 
> 
> > Can you try disabling scrubbing and see if the leak stops?
> >
> > ceph osd tell \* injectargs '--osd-scrub-load-threshold .01'
> >
> > (that will work for 0.56.1, but is fixed in later versions, btw.)  On
> > newer code,
> >
> > ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
> > ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'
> 
> Ok, I just did that.
> (I have 0.56.1 + a few more patches from the bobtail branch (up to
> c5fe0965572c07... )
> 
> I'll report back tomorrow.
> 
> 
> > Tracking this via
> >
> > http://tracker.ceph.com/issues/3883
> 
> Should I post the updates on the ML or on the ticket ?

Either or both.  We try to keep the ticket up to date, either way.

Thanks!
s
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-30 Thread Sylvain Munaut
Hi,


> Can you try disabling scrubbing and see if the leak stops?
>
> ceph osd tell \* injectargs '--osd-scrub-load-threshold .01'
>
> (that will work for 0.56.1, but is fixed in later versions, btw.)  On
> newer code,
>
> ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
> ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'

Ok, I just did that.
(I have 0.56.1 + a few more patches from the bobtail branch (up to
c5fe0965572c07... )

I'll report back tomorrow.


> Tracking this via
>
> http://tracker.ceph.com/issues/3883

Should I post the updates on the ML or on the ticket ?

Cheers,

   Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-30 Thread Sage Weil
On Wed, 30 Jan 2013, Sylvain Munaut wrote:
> >> Just to keep you posted,  upgraded our cluster yesterday to a custom
> >> compiled 0.56.1 and it has now been more than 24h and there is no sign
> >> on memory leak anymore. Previously it would rise by ~ 100 M every 24h
> >> almost like clock work and now, it's been slightly more than 24h and
> >> memory is stable. (it fluctuates, but no large jumps that stay
> >> forever).
> >
> > That's great news.  We've been trying to replicate the argonaut leak here
> > on argonaut and haven't succeeded so far.
> 
> I'm sorry to report that my excitement was premature ...  it didn't
> grow during the first 24h but each day since then has seen a 100 M
> increase of OSD memory, so pretty much the same behavior as before.
> And again, happens when scrubbing PGs from the rbd pool.

Can you try disabling scrubbing and see if the leak stops?

ceph osd tell \* injectargs '--osd-scrub-load-threshold .01'

(that will work for 0.56.1, but is fixed in later versions, btw.)  On 
newer code,

ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'

Tracking this via

http://tracker.ceph.com/issues/3883

Thanks!
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-30 Thread Sylvain Munaut
>> Just to keep you posted,  upgraded our cluster yesterday to a custom
>> compiled 0.56.1 and it has now been more than 24h and there is no sign
>> on memory leak anymore. Previously it would rise by ~ 100 M every 24h
>> almost like clock work and now, it's been slightly more than 24h and
>> memory is stable. (it fluctuates, but no large jumps that stay
>> forever).
>
> That's great news.  We've been trying to replicate the argonaut leak here
> on argonaut and haven't succeeded so far.

I'm sorry to report that my excitement was premature ...  it didn't
grow during the first 24h but each day since then has seen a 100 M
increase of OSD memory, so pretty much the same behavior as before.
And again, happens when scrubbing PGs from the rbd pool.


:(

Cheers,

Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-27 Thread Sylvain Munaut
Hi,

>> Just to keep you posted,  upgraded our cluster yesterday to a custom
>> compiled 0.56.1 and it has now been more than 24h and there is no sign
>> on memory leak anymore. Previously it would rise by ~ 100 M every 24h
>> almost like clock work and now, it's been slightly more than 24h and
>> memory is stable. (it fluctuates, but no large jumps that stay
>> forever).
>
> That's great news.  We've been trying to replicate the argonaut leak here
> on argonaut and haven't succeeded so far.

To be entirely complete, I also upgraded the kernel RBD client and
since the leak happened while scrubbing the RBD pool, maybe the client
behavior makes a difference..

Previously they were running kernel 3.6.8, they're now running 3.6.11
with all the ceph related patch from 3.8 backported ( ~ 150 patches ).

Cheers,

Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-27 Thread Sage Weil
On Sun, 27 Jan 2013, Sylvain Munaut wrote:
> Hi,
> 
> Just to keep you posted,  upgraded our cluster yesterday to a custom
> compiled 0.56.1 and it has now been more than 24h and there is no sign
> on memory leak anymore. Previously it would rise by ~ 100 M every 24h
> almost like clock work and now, it's been slightly more than 24h and
> memory is stable. (it fluctuates, but no large jumps that stay
> forever).

That's great news.  We've been trying to replicate the argonaut leak here 
on argonaut and haven't succeeded so far.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-27 Thread Sylvain Munaut
Hi,

Just to keep you posted,  upgraded our cluster yesterday to a custom
compiled 0.56.1 and it has now been more than 24h and there is no sign
on memory leak anymore. Previously it would rise by ~ 100 M every 24h
almost like clock work and now, it's been slightly more than 24h and
memory is stable. (it fluctuates, but no large jumps that stay
forever).

Cheers,

Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-25 Thread Sylvain Munaut
> Could provide those heaps? Is it possible?

We're updating this weekend to 0.56.1.

If it still happens after the update, I'll try and reproduce it on our
test infra and do the profile there, because unfortunately running the
profiler seem to make it eat up CPU and RAM a lot ...

I also need to test is it happens when I force a scrub myself because
I can't let the profile run the whole day and just wait for it to
happen naturally, so I need a way to trigger a scrub of all PGs on a
given pool.


Cheers,

Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-25 Thread Sébastien Han
Hi,

Could provide those heaps? Is it possible?

--
Regards,
Sébastien Han.


On Tue, Jan 22, 2013 at 10:38 PM, Sébastien Han  wrote:
> Well ideally you want to run the profiler during the scrubbing process
> when the memory leaks appear :-).
> --
> Regards,
> Sébastien Han.
>
>
> On Tue, Jan 22, 2013 at 10:32 PM, Sylvain Munaut
>  wrote:
>> Hi,
>>
>>> I don't really want to try the mem profiler, I had quite a bad
>>> experience with it on a test cluster. While running the profiler some
>>> OSD crashed...
>>> The only way to fix this is to provide a heap dump. Could you provide one?
>>
>> I just did:
>>
>> ceph osd tell 0 heap start_profiler
>> ceph osd tell 0 heap dump
>> ceph osd tell 0 heap stop_profiler
>>
>> and it produced osd.0.profile.0001.heap
>>
>> Is it enough or do I actually have to leave it running ?
>>
>> I had to stop the profiler because after doing the dump, the OSD
>> process was taking 100% of CPU ... stopping the profiler restored it
>> to normal.
>>
>> Cheers,
>>
>> Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-22 Thread Sébastien Han
Well ideally you want to run the profiler during the scrubbing process
when the memory leaks appear :-).
--
Regards,
Sébastien Han.


On Tue, Jan 22, 2013 at 10:32 PM, Sylvain Munaut
 wrote:
> Hi,
>
>> I don't really want to try the mem profiler, I had quite a bad
>> experience with it on a test cluster. While running the profiler some
>> OSD crashed...
>> The only way to fix this is to provide a heap dump. Could you provide one?
>
> I just did:
>
> ceph osd tell 0 heap start_profiler
> ceph osd tell 0 heap dump
> ceph osd tell 0 heap stop_profiler
>
> and it produced osd.0.profile.0001.heap
>
> Is it enough or do I actually have to leave it running ?
>
> I had to stop the profiler because after doing the dump, the OSD
> process was taking 100% of CPU ... stopping the profiler restored it
> to normal.
>
> Cheers,
>
> Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-22 Thread Sylvain Munaut
Hi,

> I don't really want to try the mem profiler, I had quite a bad
> experience with it on a test cluster. While running the profiler some
> OSD crashed...
> The only way to fix this is to provide a heap dump. Could you provide one?

I just did:

ceph osd tell 0 heap start_profiler
ceph osd tell 0 heap dump
ceph osd tell 0 heap stop_profiler

and it produced osd.0.profile.0001.heap

Is it enough or do I actually have to leave it running ?

I had to stop the profiler because after doing the dump, the OSD
process was taking 100% of CPU ... stopping the profiler restored it
to normal.

Cheers,

Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-22 Thread Sébastien Han
Hi,

I originally started a thread around these memory leaks problems here:
http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg11000.html

I'm happy to see that someone supports my theory about the scrubbing
process leaking the memory. I only use RBD from Ceph, so your theory
makes sense as well. Unfortunately, since I run a production platform
I don't really want to try the mem profiler, I had quite a bad
experience with it on a test cluster. While running the profiler some
OSD crashed...
The only way to fix this is to provide a heap dump. Could you provide one?

Moreover I can't reproduce the problem on my test environment... :(

--
Regards,
Sébastien Han.


On Tue, Jan 22, 2013 at 9:01 PM, Sylvain Munaut
 wrote:
> Hi,
>
> Since I have ceph in prod, I experienced a memory leak in the OSD
> forcing to restart them every 5 or 6 days. Without that the OSD
> process just grows infinitely and eventually gets killed by the OOM
> killer. (To make sure it wasn't "legitimate", I left one grow up to 4G
> or RSS ...).
>
> Here's for example the RSS usage of the 12 OSDs process
> http://i.imgur.com/ZJxyldq.png during a few hours.
>
> What I've just noticed is that if I look at the logs of the osd
> process right when it grows, I can see it's scrubbing PGs from pool
> #3. When scrubbing PGs from other pools, nothing really happens memory
> wise.
>
> Pool #3 is the pool where I have all the RBD images for the VMs and so
> have a bunch of small read/write/modify. The other pools are used by
> RGW for object storage and are mostly write-once,read-many-times of
> relatively large objects.
>
> I'm planning to upgrade to 0.56.1 this week end and I was hoping to
> see if someone knew if that issue had been fixed with the scrubbing
> code ?
>
> I've seen other posts about memory leaks but at the time, it wasn't
> confirmed what was the source. Here I clearly see it's the scrubbing
> on pools that have RBD image.
>
> Cheers,
>
>   Sylvain
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html