Hi Sage,

On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <s...@newdream.net> wrote:
> On Mon, 9 Feb 2015, David McBride wrote:
>> On 09/02/15 15:31, Gregory Farnum wrote:
>>
>> > So, memory usage of an OSD is usually linear in the number of PGs it
>> > hosts. However, that memory can also grow based on at least one other
>> > thing: the number of OSD Maps required to go through peering. It
>> > *looks* to me like this is what you're running in to, not growth on
>> > the number of state machines. In particular, those past_intervals you
>> > mentioned. ;)
>>
>> Hi Greg,
>>
>> Right, that sounds entirely plausible, and is very helpful.
>>
>> In practice, that means I'll need to be careful to avoid this situation
>> occurring in production ? but given that's unlikely to occur except in the
>> case of non-trivial neglect, I don't think I need be particularly concerned.
>>
>> (Happily, I'm in the situation that my existing cluster is purely for testing
>> purposes; the data is expendable.)
>>
>> That said, for my own peace of mind, it would be valuable to have a procedure
>> that can be used to recover from this state, even if it's unlikely to occur 
>> in
>> practice.
>
> The best luck I've had recovering from situations is something like:
>
> - stop all osds
> - osd set nodown
> - osd set nobackfill
> - osd set noup
> - set map cache size smaller to reduce memory footprint.
>
>   osd map cache size = 50
>   osd map max advance = 25
>   osd map share max epochs = 25
>   osd pg epoch persisted max stale = 25
>

These above settings have proven to be very useful when setting up
some of our new OSD servers with not much memory per OSD: 64GB RAM for
48x4TB OSDs
Prior to applying these settings (plus one more, below) we were seeing
memory usage around 2-3GB / OSD when they are freshly created. After a
restart the processes stayed under 3-400MB.

It seems the initial bootstrapping -- getting all the most recent 500
osdmaps -- in bunches of 100 at a time causes the osd map cache to
exceed its 50 entry limit -- and that memory is then never freed. We
found that to fix this we had to also lower the "osd map message max"
setting on the mons -- like that them OSD memory is staying under
500MB per process.

Currently we're happily running a large [1] number of OSDs with the
following configuration:

[global]
   osd map message max = 10

[osd]
   osd map cache size = 20
   osd map max advance = 10
   osd map share max epochs = 10
   osd pg epoch persisted max stale = 10

and the memory consumption is 400-500MB per process, even during
backfilling. And so far we didn't see any drawbacks to this
configuration. Should we expect any problems if we continue with this
small osdmap cache, permanently?

Best Regards,
Dan

[1] "large" in this case means the osdmap is 4.6MB in size
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to