On 8/10/22 10:08, Frank Schilder wrote:
Hi Mark.
I actually had no idea that you needed both the yaml option
and the pool option configured
I guess you are referring to ceph-adm deployments, which I'm not using. In the
ceph config data base, both options mush be enabled irrespective of how this
happens (I separate the application ceph from deployment systems, which may or
may not have their own logic). There was a longer thread started by me some
years ago where someone posted a matrix of how both mode settings interact and
what the resulting mode is.
I might have misunderstood what you were saying. I was in fact
referring to the yaml config and pool options. I was under the
impression that the pool setting overrode whatever was in the yaml and
you didn't need to sort of chain them to be enabled in both places. Am
I mistaken?
Our applications are ceph fs data pools and rbd data pools, all EC pools. This
places some heavy requirements on the compression methods in order not to kill
IOPs performance completely. I don't know what your long-term goal with this
is, just simplify some internals or achieve better storage utilisation.
However, something like compression of entire files will probably kill
performance to such a degree that it becomes useless.
Mostly the conversation just started out with how we could clean up some
of the internals to be less complex. That led to a discussion of how
compression was implemented along with blobs early on in bluestore's
life as part of a big write path overhaul. That led to further
questions regarding whether or not people actually use it and whether or
not it's useful out in the field, hence this conversation. :) The
general thought process was questioning whether we might be able to
handle compression more cleanly higher up in the stack (withe the
trade-off being losing blob level granularity), but we want to be very
careful since as you say below it could be a lot of effort for little
gain other than making bluestore simpler (which is a gain to be sure).
I am not sure if you can get much better results out of changes. It looks like
you could spend a lot of time on it and gain little. Maybe I can draw your
attention to a problem that might lead to much more valuable improvements for
both, pool performance and disk utilisation. This is going to be a bit longer,
I need to go into details. I hope you find the time to read on.
Ceph has a couple of problems with its EC implementation. One problem that I
have never seen discussed so far is the inability of its data stores to perform
tail merging. I have opened a feature request
(https://tracker.ceph.com/issues/56949) that describes the symptom and only
requests a way to account for the excess usage. In the last sentence I mention
that tail merging would be the real deal.
The example given there shows how extreme the problem can materialise. Here is
the situation as of today while running my benchmark:
status usage: Filesystem Size Used Avail Use%
Mounted on
status usage: 10.41.24.13,10.41.24.14,10.41.24.15:/data 2.0T 276G 1.8T 14%
/mnt/cephfs
status usage: 10.41.24.13,10.41.24.14,10.41.24.15:/ 2.5T 2.1T 419G 84%
/mnt/adm/cephfs
Only /data contains any data. The first line shows the ceph.dir.rbytes=276G
while the second line shows the pool usage 2.1T. The discrepancy due to small
files is more than a factor 7. Compression is enabled, but you won't gain much
here because most files are below compression_min_blob_size.
I know that the "solution" to this (and the EC overwrite amplification problem)
was chosen to be bluestore_min_alloc_size=4K for all types of devices, which comes with
its own problems due to the huge rocks db required and was therefore postponed in
octopus. I wonder how this will work on our 18TB hard drives. I personally am not
convinced that this is a path of success and, while it reduces the problem of not having
tail merging, it does not really remove the need for tail merging. Even on an k=8 EC
profile, 4*8=32K is a quite large unit of atomicity. On geo-replicated EC pools even
larger values of k are the standard.
Yep, 4K is way better than what we had before with the 64K min_alloc
size, but the seldomly talked about reality is that if you primarily
have small (say <8-16K) objects you might want to look at whether or not
you are actually gaining anything with EC vs replication with the
current implementation.
Are there any discussions and/or ideas on how to address this? ... in different
ways?
There was also a discussion about de-duplication. Are there any news in this
direction?
I haven't seen a lot of movement specifically on the EC and de-dup
fronts, but it's possible that someone is working on them and I'm not in
the loop. Going to punt on these.
The following is speculative, based on incomplete knowledge:
An idea I would consider worthwhile is a separation of physical disk allocation
from logical object allocation and using full read-modify-copy-on-write cycles
for EC overwrites. The blob allocation size on disk should be tailored to
accommodate the optimal IO size of the device, which is often larger than 4K or
even 64K. All PG operations like rebalance and recovery should always operate
on entire blobs regardless of what is stored in them.
Object allocation then happens on a second layer where whenever possible entire
blobs are allocated (just like today, one address per blob). However, for small
objects (or tails) a dedicated set of such blobs should allow addressing
smaller chunks. Such blobs should offer 2, 4, 8, ... equal sized chunks for
object allocation. This would require organising blobs in levels of
sub-addresses available, possibly following PG stats about actual allocation
sizes.
It sounds kind of heavy to me in some ways. Maybe not for disks though
where contiguous disk access is the biggest concern. Would you still
have the current extent model sitting underneath these? How would shared
blobs fit in with the sub-allocatable units? Does the
freespace/allocation strategy change?
What do we gain with this? Rebalance and recovery will profit dramatically.
They don't operate on object level, which is a huge pain for many small
objects. Instead, they operate on blob level, which is big chunks. All objects
(shards) in a single blob will be moved/recovered in a single operation instead
object by object. This is particularly important for pure meta-data objects,
for example, the backtrace objects created by ceph fs on the primary data pool.
I found on our cluster that the bottleneck for recovery are small objects (even
on SSD!), not the amount of data. If this bottleneck could be removed, it would
be a huge improvement.
Can you tell me a little bit more about what you are seeing with the
cephfs backtrace objects? Also, would you mind talking a bit more about
what you saw with recovery performance on SSD? Did increasing recovery
parallelization help at all? I don't want to get too sidetracked from
the primary topic, but user feedback on this kind of stuff is always useful.
I guess that this idea goes to the fundamentals of ceph, namely that rados
operates on object level, which means that this separation of storing data into
two instead of one logical layer would require a fundamental change of rados
and cannot be done on OSD level alone (the objects that rados sees may no
longer be the same objects that a user sees). This change might be so
fundamental that there is no upgrade path without data migration.
On the other hand, a dead end for development requiring such an upgrade path is
likely to come any ways.
I hope this was not a total waste of time.
I don't think so. It's interesting to hear what people are struggling
with and ideas to make it better. Rados level protocol changes
(especially one like that) along with the implementation changes would
require extremely heavy lifting though. I suspect we'd shoot for easier
performance wins first where we can get them.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Mark Nelson <mnel...@redhat.com>
Sent: 09 August 2022 16:56:19
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Request for Info: bluestore_compression_mode?
Hi Frank,
Thank you very much for the reply! If you don't mind me asking, what's
the use case? We're trying to determine if we might be able to do
compression at a higher level than blob with the eventual goal of
simplifying the underlying data structures. I actually had no idea that
you needed both the yaml option and the pool option configured (I
figured the pool option just overrode the yaml). That's definitely
confusing!
Not sure what the right path is here or if we should even make any
significant changes at this point, but we figured that the first step
was to figure out if people are using it and how.
Mark
On 8/9/22 04:11, Frank Schilder wrote:
Hi Mark,
we are using per-pool aggressive compression mode on any EC data pool. We need
it per pool as we also have un-compressed replicated meta data pools sharing
the same OSDs. Currently, one needs to enable both for data compression, the
bluestore option to enable compression on an OSD and the pool option to enable
compression for a pool. Only when both options are active simultaneously is
data actually compressed, which led to quite a bit of confusion in the past. I
think per-pool compression should be sufficient and imply compression without
further tweaks on the OSD side. I don't know what the objective with per-OSD
bluestore compression was. We just enabled bluestore compression globally since
the pool option selects the data for compression and its the logical way to
select and enforce compression (per data type).
Just an enable/disable setting for pools would be sufficient (enabled=aggressive, and always treat
bluestore_compression=aggressive implicitly). On the bluestore side the usual
compression_blob_size/algorithm options will probably remain necessary, although one might better
set them via a mask as in "ceph config set osd/class:hdd compression_min_blob_size XYZ"
or better allow combination of masks as in "ceph config set osd/class:hdd,store:blue
compression_min_blob_size XYZ" to prepare the config interface for future data stores.
I don't think the compression mode "passive" makes much sense as I have never heard of client
software providing a meaningful hint. I think its better treated as an administrator's choice after testing
performance and then enabled should simply mean "always compress" and disabled "never
compress".
I believe currently there is an interdependence with min_alloc_size on the OSD
data store, which makes tuning a bit of a pain. It would be great if physical
allocation parameters and logical allocation sizes could be decoupled somewhat.
If they need to be coupled, then at least make it possible to read important
creation-time settings at run-time. At the moment it is necessary to restart an
OSD and grep the log to find the min_alloc_size of an OSD that is actually used
by the OSD. Also, with upgraded clusters it is more likely to have OSDs with
different min_alloc_sizes in a pool, so it would be great if settings like this
one have no/not so much influence on whether or not compression works as
expected.
Summary:
- pool enable/disable flag for always/never compress
- data store flags for compression performance tuning
- make OSD create- and tune parameters as orthogonal as possible
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Mark Nelson <mnel...@redhat.com>
Sent: 08 August 2022 20:30:49
To: ceph-users@ceph.io
Subject: [ceph-users] Request for Info: bluestore_compression_mode?
Hi Folks,
We are trying to get a sense for how many people are using
bluestore_compression_mode or the per-pool compression_mode options
(these were introduced early in bluestore's life, but afaik may not
widely be used). We might be able to reduce complexity in bluestore's
blob code if we could do compression in some other fashion, so we are
trying to get a sense of whether or not it's something worth looking
into more.
Thanks,
Mark
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io