On 3/29/22 16:27, Clippinger, Sam wrote:
Hi Mark,

Wow that's great information, thank you for the details.  I hate to press the 
point though, but I still don't understand how the two numbers for 
osd_memory_target and bluestore_cache_size are actually used.  Since I have 
disabled autotuning, is the osd_memory_target used for the onode and rocksdb 
cache while bluestore_cache_size is used for bluestore buffers?  My cluster is 
only used for RBD storage, which one is more important?


No worries, it's confusing. :)  Partially that's because we created the autotuning stuff after people were already using the cache size option and wanted to keep that functionality available for folks that don't want autotuning.  Basically if you have autotuning enabled, the OSD will start controlling the cache size itself.  It won't really matter what you set it to (I think we may still use what you provide as the initial value, but it will change as soon as the first iteration of the autotuning runs).  So basically:

autotuning off -> osd cache size determined by bluestore_cache_size* (osd_memory_target irrelevant)

autotuning on -> osd cache size adjusted automatically based on osd_memory_target (bluestore_cache_size* almost irrelevant)


In both cases the ratios are used, but in slightly different ways (overall ratio for autotuning off and per-priority level for autotuning on).



Josh, to answer your questions from earlier, I've read through the cache sizing 
page and the documentation for the options at least a dozen times and I remain 
confused.  Am I understanding you correctly, that all of the bluestore cache is 
contained within the amount set by osd_memory_target?  This simply does not 
match my experience -- even right now with bluestore_cache_size=10Gi and 
osd_memory_target=6Gi, each daemon is using between 15-20 GiB.  I previously 
set them both to 8 GiB and the memory usage was about 18 GiB per daemon.  When 
autotuning was enabled (in Nautilus) each daemon would use about 4 GiB no 
matter how the values were set (and I've never found any settings to keep 
daemons in our lab environment from using at least 2 GiB).  This cluster uses 
NVMe drives for storage, currently running one daemon per drive.  We're 
planning to split them into multiple daemons per drive (following popular 
online wisdom) but I'm concerned about running out of RAM if I can't control 
how much memory each daemon will use.  Let's say osd_memory_target=8Gi and 
bluestore_cache_size=6Gi... what are the remaining 2Gi used for?  Is there a 
way I can calculate an appropriate amount (or even minimum amount), perhaps 
based on number of pools/PGs/OSDs/etc?  We've had to clean up damage caused by 
OSD daemons getting OOMkilled before, that was a bad week.

When I dump the mempools for one OSD daemon where "ps" currently shows RSS 
17235900 and VSZ 18584008:
# ceph tell osd.0 dump_mempools
{
     "mempool": {
         "by_pool": {
             "bloom_filter": {
                 "items": 0,
                 "bytes": 0
             },
             "bluestore_alloc": {
                 "items": 25328063,
                 "bytes": 263022424
             },
             "bluestore_cache_data": {
                 "items": 2017,
                 "bytes": 57925872
             },
             "bluestore_cache_onode": {
                 "items": 400268,
                 "bytes": 246565088
             },
             "bluestore_cache_meta": {
                 "items": 58839368,
                 "bytes": 643004516
             },
             "bluestore_cache_other": {
                 "items": 63930648,
                 "bytes": 2770468348
             },
             "bluestore_Buffer": {
                 "items": 20,
                 "bytes": 1920
             },
             "bluestore_Extent": {
                 "items": 17760199,
                 "bytes": 852489552
             },
             "bluestore_Blob": {
                 "items": 17567934,
                 "bytes": 1827065136
             },
             "bluestore_SharedBlob": {
                 "items": 17536894,
                 "bytes": 1964132128
             },
             "bluestore_inline_bl": {
                 "items": 967,
                 "bytes": 141154
             },
             "bluestore_fsck": {
                 "items": 0,
                 "bytes": 0
             },
             "bluestore_txc": {
                 "items": 20,
                 "bytes": 15680
             },
             "bluestore_writing_deferred": {
                 "items": 55,
                 "bytes": 427691
             },
             "bluestore_writing": {
                 "items": 45,
                 "bytes": 184320
             },
             "bluefs": {
                 "items": 16771,
                 "bytes": 249024
             },
             "bluefs_file_reader": {
                 "items": 352,
                 "bytes": 58242816
             },
             "bluefs_file_writer": {
                 "items": 3,
                 "bytes": 576
             },
             "buffer_anon": {
                 "items": 588578,
                 "bytes": 87927562
             },
             "buffer_meta": {
                 "items": 254277,
                 "bytes": 22376376
             },
             "osd": {
                 "items": 1367,
                 "bytes": 15463504
             },
             "osd_mapbl": {
                 "items": 0,
                 "bytes": 0
             },
             "osd_pglog": {
                 "items": 827739,
                 "bytes": 429467480
             },
             "osdmap": {
                 "items": 2794507,
                 "bytes": 43126616
             },
             "osdmap_mapping": {
                 "items": 0,
                 "bytes": 0
             },
             "pgmap": {
                 "items": 0,
                 "bytes": 0
             },
             "mds_co": {
                 "items": 0,
                 "bytes": 0
             },
             "unittest_1": {
                 "items": 0,
                 "bytes": 0
             },
             "unittest_2": {
                 "items": 0,
                 "bytes": 0
             }
         },
         "total": {
             "items": 205850092,
             "bytes": 9282297783
         }
     }
}

The total bytes at the end is much less than what the OS reports.  Is this 
something I can control by adjusting the calculation frequency as Mark suggests?


Looking at your numbers here, you have a lot of memory consumed by blobs, shared blobs, and extents (over 4GB!).  What kind of workload are you running?  FWIW the caches have not auto-adjusted down to their minimums to hit a smaller memory target (which they will do if autotuning is enabled).  What were the settings used for this dump?  FWIW the autotuning only adjust cache sizes and it will always leave a small amount of memory for each cache.  If there is too dynamically allocated memory being using for other things than cache (which appears to be the case here) it may not be able to shrink the OSD to fit within the target.  It's basically best effort but it can't shrink in-memory data structures for things it doesn't control, just try to work around them to the extent it is able.




-- Sam Clippinger


-----Original Message-----
From: Mark Nelson <mnel...@redhat.com>
Sent: Tuesday, March 29, 2022 1:27 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: What's the relationship between osd_memory_target and 
bluestore_cache_size?

CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless 
you trust the sender and know the content is safe.


On 3/29/22 11:44, Anthony D'Atri wrote:
[osd]
bluestore_cache_autotune = 0
Why are you turning autotuning off?
FWIW I’ve encountered the below assertions.  I neither support nor deny them, 
pasting here for discussion.  One might interpret this to only apply to OSDs 
with DB on a seperate (faster) device.


With random small block workloads, it’s important to keep BlueStore metadata 
cached and keep RocksDB from spilling over to slow media – including during 
compaction. If there is adequate memory on the OSD node, it is recommended to 
increase the BlueStore metadata cache ratio. An example of this is shown below:

bluestore_cache_meta_ratio = 0.8
bluestore_cache_kv_ratio = 0.2
osd bluestore_cache_size_ssd 6GB

In Ceph Nautilus and above, the cache ratios are automatically tuned so it is 
recommended to first observe the relevant cache hit counters in BlueStore 
before manually setting these parameters.  There is some disagreement regarding 
how effective the auto tuning is.

https://urldefense.com/v3/__https://ceph.io/community/bluestore-de
fault-vs-tuned-performance-comparison/__;!!EJc4YC3iFmQ!CYgPlRbNINH
H3PZk1a41xJtauKQiIwpBKjNR9Fl68exXEekavrBL0KbFuDgULyoyAV8$
suggests that we still set

bluestore_cache_size_ssd = 8GB with 12GB memory target.
Sorry, this is going to be a long... :)


The basic gist of it is that if you disable autotuning, the OSD will use a set "cache size" for 
various caches and then divvy the memory up between them based on the defined ratios.  For bluestore that 
means the rocksdb block cache(s), bluestore onode cache, and bluestore buffer cache.  IE in the above example 
that's 6GB with 80% going to onode "meta" cache, 20% going to the rocksdb block "kv" 
cache, and an implicit 0% being dedicated to bluestore buffer cache.  This kind of setup tends to work best 
when you have a well defined workload and you know exactly how you want to tune the different cache sizes for 
optimal performance (often times giving a lot of the memory to onode cache for RBD for example).  The amount 
of memory the OSD uses can float up and down and it tends to be a little easier on tcmalloc because because 
you aren't growing/shrinking the caches constantly trying to stay within a certain memory target.

When cache autotuning is enabled, the cache size is allowed to fluctuate based 
on the osd_memory_target and how much memory is mapped by the ceph-osd process 
as reported by tcmalloc.  This is almost like using RSS memory as the target 
but not quite.  The difference is that there is no guarantee that the kernel 
will reclaim freed memory soon (or at all), so RSS memory usage ends up being a 
really poor metric for trying to dynamically adjust memory targets (I tried 
with fairly comical results).  This process of adjusting the caches based on a 
process level memory target seems to be harder on tcmalloc, probably because 
we're freeing a bunch of fragmented memory (say from the onode cache) while 
it's simultaneously trying to hand sequential chunks of memory out to something 
else (whatever is requesting memory and forcing us to go over target).  We tend 
to oscillate around the memory target, though over all the system works fairly 
well if you are willing to accept up to ~20% memory spikes under heavy (write) 
workloads. You can tweak the behavior to more aggressively try to control this 
by increasing the frequency that we recalculate the memory target, but it's 
more CPU intensive and may overcompensate by releasing too much fragmented 
memory too quickly.

Enabling autotuning also enables the priority cache manager. Each cache subsystem will 
request memory at different priority targets (say pri0, pri1, etc).  When autotuning is 
enabled the ratios no longer govern a global percentage of the cache, but instead govern 
a "fairshare" target at each priorirty level.  Each cache is assigned at least 
it's ratio of the available memory at a given level.  If a cache is assigned all of the 
memory it requests at that level, the prioirty cache manager will use left over memory to 
fulfill requests at that level by caches that want more memory than their faireshare 
target.  This process continues until all requests at a given level have been fulfilled 
or we run out of memory available for caches.  If all requests have been fulfilled at a 
given level, we move to the next level and start the process all over again.

In current versions of ceph we only really utilize 2 of the available levels.  Priority0 is used 
for very high priority things (like items pinned in the cache or rocksdb "hipri pool" 
items. Everything else is basically shoved into a single level and competes there.  In Quincy, we 
finally implemented age-binning, where we associate items in the different caches with "age 
bins" that give us a coarse look at the relative ages of all cache items.  IE say that there 
are old onode entries sitting in the bluestore onode cache, but now there is a really hot read 
workload against a single large object.  That OSD's priority cache can now sort those older onode 
entries into a lower priority level than the buffer cache data for the hot object.  We generally 
may heavily favor onodes at a given priority level, but in this case older onodes may end up in a 
lower priority level than the hot buffer data, so the buffer data memory request is fulfilled first.

Due to various factors this isn't as big of a win as I had hoped it would be 
(primarily in relation to the rocksdb block cache, since compaction tends to 
blow everything in the cache away regularly anyway).  In reality the biggest 
benefit seems to be that we are more aggressive about clearing away very old 
onode data if there are new writes which we suspect is reducing memory 
fragmentation, and it's much easier to tell the ages of items in the various 
caches via the perf admin socket.  It does give us significantly more control 
and insight into the global cache behavior though, so in general it seems to be 
a good thing.  The perf tests we ran ranged from having little effect to 
showing moderate improvement in some scenarios.

FWIW, despite the fact that I wrote the prioritycache system and memory 
autotuning code, I'd be much happier if we were much less dynamic about how we 
allocate memory.  That probably goes all the way back to how the message over 
the wire looks.  Ideally we would have a very short path from the message to 
the disk with minimal intermediate translation of the message, minimal dynamic 
behavior based on the content of the message, and recycling static buffers or 
objects from a contiguous pool whenever possible.  The prioritycache system 
tries to account for dynamic memory allocations in ceph by reactively 
growing/shrinking the caches, but it would be much better if we didn't need to 
do any of that in the first place.


Mark

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io

________________________________

CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of 
the intended recipient(s) and contain information that may be Garmin 
confidential and/or Garmin legally privileged. If you have received this email 
in error, please notify the sender by reply email and delete the message. Any 
disclosure, copying, distribution or use of this communication (including 
attachments) by someone other than the intended recipient is prohibited. Thank 
you.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to