Hi Eric,
You say you don't have access to raw drives. What does it mean? Do you
run Ceph OSDs inside VMs? In that case you should probably disable
Micron caches on the hosts, not just in VMs.
Yes, disabling the write cache only takes place upon a power cycle... or
upon the next hotplug of th
Hi Philip,
I'm not sure if we're talking about the same thing but I was also
confused when I didn't see 100% OSD drive utilization during my first
RBD write benchmark. Since then I collect all my confusion here
https://yourcmc.ru/wiki/Ceph_performance :)
100% RBD utilization means that somet
Yes, that's it, see the end of the article. You'll have to disable
signature checks, too.
cephx_require_signatures = false
cephx_cluster_require_signatures = false
cephx_sign_messages = false
Hi Vitaliy,
thank you for your time. Do you mean
cephx sign messages = false
with "diable signature
Hi,
we're playing around with ceph but are not quite happy with the IOs.
on average 5000 iops / write
on average 13000 iops / read
We're expecting more. :( any ideas or is that all we can expect?
With server SSD you can expect up to ~1 write / ~25000 read iops per
a single client.
https
SATA: Micron 5100-5200-5300, Seagate Nytro 1351/1551 (don't forget to
disable their cache with hdparm -W 0)
NVMe: Intel P4500, Micron 9300
Thanks for all the replies. In summary; consumer grade SSD is a no go.
What is an alternative to SM863a? Since it is quite hard to get these
due non non-s
Usually it doesn't, it only harms performance and probably SSD lifetime
too
I would not be running ceph on ssds without powerloss protection. I
delivers a potential data loss scenario
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists
Use 30 GB for all OSDs. Other values are pointless, because
https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
You can use the rest of free NVMe space for bcache - it's much better
than just allocating it for block.db.
___
ceph-users mail
Hi, sorry to everyone that I post my link again, but
https://yourcmc.ru/wiki/Ceph_performance
Hello Cephers,
The only recommendation I can find about db device selection is about
the capacity (4% of the data disk) on the documents. Is there any
suggestions about technical specs like throughpu
Hi!
Latency doesn't scale with the number of OSDs at all, IOPS scale almost
linearly. IOPS are bounded by CPU usage though. Also a single RBD client
usually doesn't deliver more than 20-30k read iops and 10-15k write
iops.
You can run with more than 1 OSD per drive if you think you have enou
1 NVMe is really only limited to a readonly / writethrough cache (which
should be of course possible with bcache). Nobody wants to lose all data
after 1 disk failure...
Another option is the use of bcache / flashcache.
I have experimented with bcache, it is quite easy to et up, but once
you r
Hi Marc,
Hi Vitaliy, just saw you recommend someone to use ssd, and wanted to
use
the oppurtunaty to thank you for composing this text[0], enoyed reading
it.
- What do you mean with: bad-SSD-only?
A cluster consisting only of bad SSDs, like desktop ones :) their
latency with fsync is almost
Could performance of Optane + 4x SSDs per node ever exceed that of
pure Optane disks?
No. With Ceph, the results for Optane and just for good server SSDs
are
almost the same. One thing is that you can run more OSDs per an Optane
than per a usual SSD. However, the latency you get from both is a
Could performance of Optane + 4x SSDs per node ever exceed that of
pure Optane disks?
No. With Ceph, the results for Optane and just for good server SSDs are
almost the same. One thing is that you can run more OSDs per an Optane
than per a usual SSD. However, the latency you get from both is a
I can add RAM ans is there a way to increase rocksdb caching , can I
increase bluestore_cache_size_hdd to higher value to cache rocksdb?
In recent releases it's governed by the osd_memory_target parameter. In
previous releases it's bluestore_cache_size_hdd. Check release notes to
know for sure
Hi Team,
@vita...@yourcmc.ru , thank you for information and could you please
clarify on the below quires as well,
1. Average object size we use will be 256KB to 512KB , will there be
deferred write queue ?
With the default settings, no (bluestore_prefer_deferred_size_hdd =
32KB)
Are you su
where small means 32kb or smaller going to BlueStore, so <= 128kb
writes
from the client.
Also: please don't do 4+1 erasure coding, see older discussions for
details.
Can you point me to the discussion abort the problems of 4+1? It's not
easy to google :)
--
Vitaliy Filippov
__
1. For 750 object write request , data written directly into data
partition and since we use EC 4+1 there will be 5 iops across the
cluster for each obejct write . This makes 750 * 5 = 3750 iops
don't forget about the metadata and the deferring of small writes.
deferred write queue + metadata,
Your results are okay..ish. General rule is that it's hard to achieve
read latencies below 0.5ms and write latencies below 1ms with Ceph, **no
matter what drives or network you use**. 1 iops with one thread is
0.1 ms. It's just impossible with Ceph currently.
I've heard that some people ma
60 millibits per second? 60 bits every 1000 seconds? Are you serious?
Or did we get the capitalisation wrong?
Assuming 60MB/sec (as 60 Mb/sec would still be slower than the 5MB/sec
I
was getting), maybe there's some characteristic that Bluestore is
particularly dependent on regarding the HDD
We're seeing ~5800 IOPS, ~23 MiB/s on 4 KiB IO (stripe_width 8192) on a
pool that could do 3 GiB/s with 4M blocksize. So, yeah, well, that is
rather harsh, even for EC.
4kb IO is slow in Ceph even without EC. Your 3 GB/s linear writes don't
matter anything. Ceph adds a significant overhead to e
Ok, average network latency from VM to OSD's ~0.4ms.
It's rather bad, you can improve the latency by 0.3ms just by upgrading
the network.
Single threaded performance ~500-600 IOPS - or average latency of 1.6ms
Is that comparable to what other are seeing?
Good "reference" numbers are 0.5ms
Basically they max out at around 1000 IOPS and report 100%
utilization and feel slow.
Haven't seen the 5200 yet.
Micron 5100s performs wonderfully!
You have to just turn its write cache off:
hdparm -W 0 /dev/sdX
1000 IOPS means you haven't done it. Although even with write cache
enabled I o
I bet you'd see better memstore results with my vector based object
implementation instead of bufferlists.
Where can I find it?
Nick Fisk noticed the same
thing you did. One interesting observation he made was that disabling
CPU C/P states helped bluestore immensely in the iodepth=1 case.
T
One way or another we can only have a single thread sending writes to
rocksdb. A lot of the prior optimization work on the write side was
to get as much processing out of the kv_sync_thread as possible.
That's still a worthwhile goal as it's typically what bottlenecks with
high amounts of concur
Amount of the metadata depends on the amount of data. But RocksDB is
only putting metadata to the fast storage when it thinks all metadata on
the same level of the DB is going to fit there. So all sizes except 4,
30, 286 GB are useless.
___
ceph-users
block.db is very unlikely to ever grow to 250GB with a 6TB data device.
However, there seems to be a funny "issue" with all block.db sizes
except 4, 30, and 286 GB being useless, because RocksDB puts the data on
the fast storage only if it thinks the whole LSM level will fit there.
Ceph's Rock
Decreasing the min_alloc size isn't always a win, but ican be in some
cases. Originally bluestore_min_alloc_size_ssd was set to 4096 but we
increased it to 16384 because at the time our metadata path was slow
and increasing it resulted in a pretty significant performance win
(along with increasin
I create 2-4 RBD images sized 10GB or more with --thick-provision, then
run
fio -ioengine=rbd -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128
-rw=randwrite -pool=rpool -runtime=60 -rbdname=testimg
For each of them at the same time.
How do you test what total 4Kb random write iops (RB
Yes and no... bluestore seems to not work really optimal. For example,
it has no filestore-like journal waterlining and flushes the deferred
write queue just every 32 writes (deferred_batch_ops). And when it does
that it's basically waiting for the HDD to commit and slowing down all
further writes.
Hi Dave,
The main line in SSD specs you should look at is
Enhanced Power Loss Data Protection: Yes
This makes SSD cache nonvolatile and makes SSD ignore fsync()s so
transactional performance becomes equal to non-transactional. So your
SSDs should be OK for journal.
rados bench is a bad to
Use Ubuntu bionic repository, Mimic installs without problem from there.
You can also build it yourself, all you need is to install gcc-7 and
other build dependencies, git clone, checkout 13.2.2 and say
`dpkg-buildpackage -j4`.
It takes some time, but overall it builds without issues, except
If you simply multiply number of objects and rbd object size
you will get 7611672*4M ~= 29T and that is what you should see in USED
field, and 29/2*3=43.5T of raw space.
Unfortunately no idea why they consume less; probably because not all
objects are fully written.
It seems some objects corresp
Hi again.
It seems I've found the problem, although I don't understand the root
cause.
I looked into OSD datastore using ceph-objectstore-tool and I see that
for almost every object there are two copies, like:
2#13:080008d8:::rbd_data.15.3d3e1d6b8b4567.00361a96:28#
2#13:080008d8:::r
Hi,
Can you benchmark your Optane 900P with `fio -fsync=1 -direct=1 -bs=4k
-rw=randwrite -runtime=60`?
It's really interesting to see how much iops it will provide for ceph :)
--
With best regards,
Vitaliy Filippov
___
ceph-users mailing list
ceph
34 matches
Mail list logo