Hi Frank,
On 7/31/2020 10:31 AM, Frank Schilder wrote:
Hi Igor,
thanks. I guess the problem with finding the corresponding images is, that it
happens on bluestore and not on object level. Even if I listed all rados
objects and added their sizes I would not see the excess storage.
Thinking about working around this issue, would re-writing the objects deflate
the exces usage? For example, evacuating an OSD and adding it back to the pool
after it was empty, would this re-write the objects on this OSD without the
overhead?
May be but I can't say for sure..
Or simply copying an entire RBD image, would the copy be deflated?
Although the latter options sound a bit crazy, one could do this without (much)
downtime of VMs and it might get us through this migration.
Also you might want to try pg export/import using ceph-objectstore-tool.
See https://ceph.io/geen-categorie/incomplete-pgs-oh-my/ for some hints
how to do that.
But again I'm not certain if it's helpful. Preferably to try with some
non-production cluster first...
Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Igor Fedotov
Sent: 30 July 2020 15:40
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mimic: much more raw used than reported
Hi Frank,
On 7/30/2020 11:19 AM, Frank Schilder wrote:
Hi Igor,
thanks for looking at this. Here a few thoughts:
The copy goes to NTFS. I would expect between 2-4 meta data operations per
write, which would go to few existing objects. I guess the difference
bluestore_write_small-bluestore_write_small_new are mostly such writes and are
susceptible to the partial overwrite amplification. A first question is, how
many objects are actually affected? 3 small writes does not mean 3
objects have partial overwrites.
The large number of small_new is indeed strange, although these would not lead
to excess allocations. It is possible that the write size of the copy tool is
not ideal, was wondering about this too. I will investigate.
small_new might relate to small tailing chunks that presumably appear
when doing unaligned appends. Each such append triggers small_new write...
To know more, I would need to find out which images these small writes come
from, we have more than one active. Is there a low-level way to find out which
objects are affected by partial overwrites and which image they belong to? In
your post you were describing some properties like being shared/cloned etc. Can
one search for such objects?
IMO raising debug bluestore to 10 (or even 20) and subsequent OSD log
inspection is likely to be the only mean to learn which objects OSD is
processing... Be careful - this produces significant amount of data and
negatively impact the performance.
On a more fundamental level, I'm wondering why RBD images issue sub-object size
writes at all. I naively assumed that every I/O operation to RBD always implies
full object writes, even just changing a single byte (thinking of an object as
the equivalent of a sector on a disk, the smallest atomic unit). If this is not
the case, what is the meaning of object size then? How does it influence on I/O
patterns? My benchmarks show that object size matters a lot, but it becomes a
bit unclear now why.
Not sure I can provide good enough answer on the above. But I doubt that
RBD unconditionally operates on full objects.
Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Igor Fedotov
Sent: 29 July 2020 16:25:36
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mimic: much more raw used than reported
Frank,
so you have pretty high amount of small writes indeed. More than a half
of the written volume (in bytes) is done via small writes.
And 6x times more small requests.
This looks pretty odd for sequential write pattern and is likely to be
the root cause for that space overhead.
I can see approx 1.4GB additionally lost per each of these 3 OSDs since
perf dump reset ( = allocated_new - stored_new - (allocated_old -
stored_old))
Below are some speculations on what might be happening by for sure I
could be wrong/missing something. So please do not consider this as a
100% valid analysis.
Client does writes in 1MB chunks. This is split into 6 EC chunks (+2
added) which results in approx 170K writing block to object store ( =
1MB / 6). Which corresponds to 1x128K big write and 1x42K small tailing
one. Resulting in 3x64K allocations.
The next client adjacent write results in another 128K blob, one more
"small" tailing blob and heading blob which partially overlaps with the
previous tailing 42K chunk. Overlapped chunks are expected to be merged.
But presumably this doesn't happen due to that "partial EC overwrites"
issue. So instead additional 64K blob is allocated for overlapped range.
I.e. 2x170K writes cause 2x128K blobs, 1x64K tailing blo