On 8/10/22 10:08, Frank Schilder wrote:
Hi Mark.

I actually had no idea that you needed both the yaml option
and the pool option configured
I guess you are referring to ceph-adm deployments, which I'm not using. In the 
ceph config data base, both options mush be enabled irrespective of how this 
happens (I separate the application ceph from deployment systems, which may or 
may not have their own logic). There was a longer thread started by me some 
years ago where someone posted a matrix of how both mode settings interact and 
what the resulting mode is.

I might have misunderstood what you were saying.  I was in fact referring to the yaml config and pool options.  I was under the impression that the pool setting overrode whatever was in the yaml and you didn't need to sort of chain them to be enabled in both places.  Am I mistaken?


Our applications are ceph fs data pools and rbd data pools, all EC pools. This 
places some heavy requirements on the compression methods in order not to kill 
IOPs performance completely. I don't know what your long-term goal with this 
is, just simplify some internals or achieve better storage utilisation. 
However, something like compression of entire files will probably kill 
performance to such a degree that it becomes useless.

Mostly the conversation just started out with how we could clean up some of the internals to be less complex.  That led to a discussion of how compression was implemented along with blobs early on in bluestore's life as part of a big write path overhaul.  That led to further questions regarding whether or not people actually use it and whether or not it's useful out in the field, hence this conversation. :)  The general thought process was questioning whether we might be able to handle compression more cleanly higher up in the stack (withe the trade-off being losing blob level granularity), but we want to be very careful since as you say below it could be a lot of effort for little gain other than making bluestore simpler (which is a gain to be sure).


I am not sure if you can get much better results out of changes. It looks like 
you could spend a lot of time on it and gain little. Maybe I can draw your 
attention to a problem that might lead to much more valuable improvements for 
both, pool performance and disk utilisation. This is going to be a bit longer, 
I need to go into details. I hope you find the time to read on.

Ceph has a couple of problems with its EC implementation. One problem that I 
have never seen discussed so far is the inability of its data stores to perform 
tail merging. I have opened a feature request 
(https://tracker.ceph.com/issues/56949) that describes the symptom and only 
requests a way to account for the excess usage. In the last sentence I mention 
that tail merging would be the real deal.

The example given there shows how extreme the problem can materialise. Here is 
the situation as of today while running my benchmark:

status usage: Filesystem                                 Size  Used Avail Use% 
Mounted on
status usage: 10.41.24.13,10.41.24.14,10.41.24.15:/data  2.0T  276G  1.8T  14% 
/mnt/cephfs
status usage: 10.41.24.13,10.41.24.14,10.41.24.15:/      2.5T  2.1T  419G  84% 
/mnt/adm/cephfs

Only /data contains any data. The first line shows the ceph.dir.rbytes=276G 
while the second line shows the pool usage 2.1T. The discrepancy due to small 
files is more than a factor 7. Compression is enabled, but you won't gain much 
here because most files are below compression_min_blob_size.

I know that the "solution" to this (and the EC overwrite amplification problem) 
was chosen to be bluestore_min_alloc_size=4K for all types of devices, which comes with 
its own problems due to the huge rocks db required and was therefore postponed in 
octopus. I wonder how this will work on our 18TB hard drives. I personally am not 
convinced that this is a path of success and, while it reduces the problem of not having 
tail merging, it does not really remove the need for tail merging. Even on an k=8 EC 
profile, 4*8=32K is a quite large unit of atomicity. On geo-replicated EC pools even 
larger values of k are the standard.

Yep, 4K is way better than what we had before with the 64K min_alloc size, but the seldomly talked about reality is that if you primarily have small (say <8-16K) objects you might want to look at whether or not you are actually gaining anything with EC vs replication with the current implementation.


Are there any discussions and/or ideas on how to address this? ... in different 
ways?

There was also a discussion about de-duplication. Are there any news in this 
direction?

I haven't seen a lot of movement specifically on the EC and de-dup fronts, but it's possible that someone is working on them and I'm not in the loop.  Going to punt on these.



The following is speculative, based on incomplete knowledge:

An idea I would consider worthwhile is a separation of physical disk allocation 
from logical object allocation and using full read-modify-copy-on-write cycles 
for EC overwrites. The blob allocation size on disk should be tailored to 
accommodate the optimal IO size of the device, which is often larger than 4K or 
even 64K. All PG operations like rebalance and recovery should always operate 
on entire blobs regardless of what is stored in them.

Object allocation then happens on a second layer where whenever possible entire 
blobs are allocated (just like today, one address per blob). However, for small 
objects (or tails) a dedicated set of such blobs should allow addressing 
smaller chunks. Such blobs should offer 2, 4, 8, ... equal sized chunks for 
object allocation. This would require organising blobs in levels of 
sub-addresses available, possibly following PG stats about actual allocation 
sizes.

It sounds kind of heavy to me in some ways.  Maybe not for disks though where contiguous disk access is the biggest concern.  Would you still have the current extent model sitting underneath these? How would shared blobs fit in with the sub-allocatable units? Does the freespace/allocation strategy change?



What do we gain with this? Rebalance and recovery will profit dramatically. 
They don't operate on object level, which is a huge pain for many small 
objects. Instead, they operate on blob level, which is big chunks. All objects 
(shards) in a single blob will be moved/recovered in a single operation instead 
object by object. This is particularly important for pure meta-data objects, 
for example, the backtrace objects created by ceph fs on the primary data pool. 
I found on our cluster that the bottleneck for recovery are small objects (even 
on SSD!), not the amount of data. If this bottleneck could be removed, it would 
be a huge improvement.

Can you tell me a little bit more about what you are seeing with the cephfs backtrace objects?  Also, would you mind talking a bit more about what you saw with recovery performance on SSD?  Did increasing recovery parallelization help at all?  I don't want to get too sidetracked from the primary topic, but user feedback on this kind of stuff is always useful.



I guess that this idea goes to the fundamentals of ceph, namely that rados 
operates on object level, which means that this separation of storing data into 
two instead of one logical layer would require a fundamental change of rados 
and cannot be done on OSD level alone (the objects that rados sees may no 
longer be the same objects that a user sees). This change might be so 
fundamental that there is no upgrade path without data migration.

On the other hand, a dead end for development requiring such an upgrade path is 
likely to come any ways.

I hope this was not a total waste of time.

I don't think so.  It's interesting to hear what people are struggling with and ideas to make it better.  Rados level protocol changes (especially one like that) along with the implementation changes would require extremely heavy lifting though.  I suspect we'd shoot for easier performance wins first where we can get them.



Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Nelson <mnel...@redhat.com>
Sent: 09 August 2022 16:56:19
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Request for Info: bluestore_compression_mode?

Hi Frank,


Thank you very much for the reply!  If you don't mind me asking, what's
the use case?  We're trying to determine if we might be able to do
compression at a higher level than blob with the eventual goal of
simplifying the underlying data structures.  I actually had no idea that
you needed both the yaml option and the pool option configured (I
figured the pool option just overrode the yaml).  That's definitely
confusing!


Not sure what the right path is here or if we should even make any
significant changes at this point, but we figured that the first step
was to figure out if people are using it and how.


Mark


On 8/9/22 04:11, Frank Schilder wrote:
Hi Mark,

we are using per-pool aggressive compression mode on any EC data pool. We need 
it per pool as we also have un-compressed replicated meta data pools sharing 
the same OSDs. Currently, one needs to enable both for data compression, the 
bluestore option to enable compression on an OSD and the pool option to enable 
compression for a pool. Only when both options are active simultaneously is 
data actually compressed, which led to quite a bit of confusion in the past. I 
think per-pool compression should be sufficient and imply compression without 
further tweaks on the OSD side. I don't know what the objective with per-OSD 
bluestore compression was. We just enabled bluestore compression globally since 
the pool option selects the data for compression and its the logical way to 
select and enforce compression (per data type).

Just an enable/disable setting for pools would be sufficient (enabled=aggressive, and always treat 
bluestore_compression=aggressive implicitly). On the bluestore side the usual 
compression_blob_size/algorithm options will probably remain necessary, although one might better 
set them via a mask as in "ceph config set osd/class:hdd compression_min_blob_size XYZ" 
or better allow combination of masks as in "ceph config set osd/class:hdd,store:blue 
compression_min_blob_size XYZ" to prepare the config interface for future data stores.

I don't think the compression mode "passive" makes much sense as I have never heard of client 
software providing a meaningful hint. I think its better treated as an administrator's choice after testing 
performance and then enabled should simply mean "always compress" and disabled "never 
compress".

I believe currently there is an interdependence with min_alloc_size on the OSD 
data store, which makes tuning a bit of a pain. It would be great if physical 
allocation parameters and logical allocation sizes could be decoupled somewhat. 
If they need to be coupled, then at least make it possible to read important 
creation-time settings at run-time. At the moment it is necessary to restart an 
OSD and grep the log to find the min_alloc_size of an OSD that is actually used 
by the OSD. Also, with upgraded clusters it is more likely to have OSDs with 
different min_alloc_sizes in a pool, so it would be great if settings like this 
one have no/not so much influence on whether or not compression works as 
expected.

Summary:

- pool enable/disable flag for always/never compress
- data store flags for compression performance tuning
- make OSD create- and tune parameters as orthogonal as possible

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Nelson <mnel...@redhat.com>
Sent: 08 August 2022 20:30:49
To: ceph-users@ceph.io
Subject: [ceph-users] Request for Info: bluestore_compression_mode?

Hi Folks,


We are trying to get a sense for how many people are using
bluestore_compression_mode or the per-pool compression_mode options
(these were introduced early in bluestore's life, but afaik may not
widely be used).  We might be able to reduce complexity in bluestore's
blob code if we could do compression in some other fashion, so we are
trying to get a sense of whether or not it's something worth looking
into more.


Thanks,

Mark

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to