Re: [ceph-users] Instrument librbd+qemu IO from hypervisor
Self-follow-up: The ceph version is 0.80.11 in the cluster I'm working. So quite old. Adding: admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok log file = /var/log/ceph/ to /etc/ceph.conf, and then in my case tweaking apparmor (for test disabling it): service apparmor teardown service apparmor stop Then stopping a qemu VM: virsh stop $instance Then restarting libvirt-bin: service libvirt-bin restart Then starting the VM again: virsh start $instance Allowed me to get at the the perf dump data, which seems to contain basically what I need for the moment: { "librbd--compute/a43efe1b-461a-4b54-923e-09c2e95da1ba_disk": { "rd": 0, "rd_bytes": 0, "rd_latency": { "avgcount": 0, "sum": 0.0}, "wr": 0, "wr_bytes": 0, "wr_latency": { "avgcount": 0, "sum": 0.0}, "discard": 0, "discard_bytes": 0, "discard_latency": { "avgcount": 0, "sum": 0.0}, "flush": 9, "aio_rd": 4596, "aio_rd_bytes": 88915968, "aio_rd_latency": { "avgcount": 4596, "sum": 7.335787000}, "aio_wr": 114, "aio_wr_bytes": 1438720, "aio_wr_latency": { "avgcount": 114, "sum": 0.011218000}, "aio_discard": 0, "aio_discard_bytes": 0, "aio_discard_latency": { "avgcount": 0, "sum": 0.0}, "aio_flush": 0, "aio_flush_latency": { "avgcount": 0, "sum": 0.0}, "snap_create": 0, "snap_remove": 0, "snap_rollback": 0, "notify": 0, "resize": 0}, "objectcacher-librbd--compute/a43efe1b-461a-4b54-923e-09c2e95da1ba_disk": { "cache_ops_hit": 114, "cache_ops_miss": 4458, "cache_bytes_hit": 24985600, "cache_bytes_miss": 88279552, "data_read": 88764416, "data_written": 1438720, "data_flushed": 1438720, "data_overwritten_while_flushing": 0, "write_ops_blocked": 0, "write_bytes_blocked": 0, "write_time_blocked": 0.0}, "objecter": { "op_active": 0, "op_laggy": 0, "op_send": 4553, "op_send_bytes": 0, "op_resend": 0, "op_ack": 4552, "op_commit": 89, "op": 4553, "op_r": 4464, "op_w": 88, "op_rmw": 1, "op_pg": 0, "osdop_stat": 2, "osdop_create": 0, "osdop_read": 4458, "osdop_write": 88, "osdop_writefull": 0, "osdop_append": 0, "osdop_zero": 0, "osdop_truncate": 0, "osdop_delete": 0, "osdop_mapext": 0, "osdop_sparse_read": 0, "osdop_clonerange": 0, "osdop_getxattr": 0, "osdop_setxattr": 0, "osdop_cmpxattr": 0, "osdop_rmxattr": 0, "osdop_resetxattrs": 0, "osdop_tmap_up": 0, "osdop_tmap_put": 0, "osdop_tmap_get": 0, "osdop_call": 9, "osdop_watch": 1, "osdop_notify": 0, "osdop_src_cmpxattr": 0, "osdop_pgls": 0, "osdop_pgls_filter": 0, "osdop_other": 88, "linger_active": 1, "linger_send": 1, "linger_resend": 0, "poolop_active": 0, "poolop_send": 0, "poolop_resend": 0, "poolstat_active": 0, "poolstat_send": 0, "poolstat_resend": 0, "statfs_active": 0, "statfs_send": 0, "statfs_resend": 0, "command_active": 0, "command_send": 0, "command_resend": 0, "map_epoch": 0, "map_full": 0, "map_inc": 0, "osd_sessions": 7140, "osd_session_open": 119, "osd_session_close": 0, "osd_laggy": 1}, "throttle-msgr_dispatch_throttler-radosclient": { "val": 0, "max": 104857600, "get": 4643, "get_sum": 89851514, "get_or_fail_fail": 0, "get_or_fail_success": 0, "
[ceph-users] Instrument librbd+qemu IO from hypervisor
Dear fellow cephalopods, does anyone have any pointers on how to instrument librbd as-driven-by qemu IO performance from a hypervisor? Are there less intrusive ways than perf or equivalent? Can librbd be told to dump statistics somewhere (per volume) - clientside? This would come in real handy whilst debugging potential performance issues troubling me. Ideally I'd like to get per-volume metrics out that I can submit to InfluxDB for presentation in Graphana. But I'll take anything. Best, Martin signature.asc Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous CephFS on EC - how?
On Wed, Aug 30, 2017 at 02:06:29PM +0100, John Spray wrote: > > As I wrote in my ticket there is room for improvement in docs on how to > > do it and with cli/api rejecting "ceph fs new " with > > pool1 or pool2 being EC. > > The CLI will indeed reject attempts to use an EC pool for metadata, > and when an EC pool is used for data it verifies that the EC > overwrites are enabled. This is meant to work, you're just ("just" > being my understatement of the day) hitting an OSD crash as soon as > you try and use it! > > re. the docs: https://github.com/ceph/ceph/pull/17372 - voila. > Oh, ok, so it *is* supposed to work the way I did it then with the cephfs base data pool being EC natively. Interesting! Then I'll just hang around for a patch for the crash then. :) Thanks for the clarification & see you in the bug report. ;) /M signature.asc Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous CephFS on EC - how?
On Wed, Aug 30, 2017 at 11:06:02AM +0200, Peter Maloney wrote: > What kind of terrible mail client is this that sends a multipart message where > one part is blank and that's the one Thunderbird chooses to show? (see > blankness below) It's a real email client (mutt) sending text to the mailing list, with a attached PGP signature. The list servers does various violence to the emails sent and your client renders that as it sees fit I suppose. WFM. > Yes you're on the right track. As long as the main fs is on a replicated pool > (the one with omap), the ones below it (using file layouts) can be EC without > needing a cache pool. Thanks! > a quote from your first url: > http://docs.ceph.com/docs/master/rados/operations/ > erasure-code/#erasure-coding-with-overwrites > > > For Cephfs, using an erasure coded pool means setting that pool in a file > layout. Yeah that's not very descriptive at all to me without clear examples for EC on the link target. /M signature.asc Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous CephFS on EC - how?
Hi, On Wed, Aug 30, 2017 at 12:28:12PM +0100, John Spray wrote: > On Wed, Aug 30, 2017 at 7:21 AM, Martin Millnert <mar...@millnert.se> wrote: > > Hi, > > > > what is the proper method to not only setup but also successfully use > > CephFS on erasure coded data pool? > > The docs[1] very vaguely state that erasure coded pools do not support omap > > operations hence, "For Cephfs, using an erasure coded pool means setting > > that pool in a file layout.". The file layout docs says nothing further > > about this [2]. (I filed a bug[3].) > > > > I'm guessing this translates to something along the lines of: > > > > ceph fs new cephfs cephfs_metadata cephfs_replicated_data > > ceph fs add_data_pool cephfs cephfs_ec_data > > > > And then, > > > > setfattr -n ceph.dir.layout.SOMETHING -v cephfs_ec_data $cephfs_dir > > Yep. The SOMETHING is just "pool". Ok, thanks! > I see from your ticket that you're getting an OSD crash, which is > pretty bad news! > For what it's worth, I have a home cephfs-on-EC configuration that has > run happily for quite a while, so this can be done -- we just need to > work out what's making the OSDs crash in this particular case. Well, my base pool is EC and I guessed from the log output that that is the root cause of the error. I.e. the list of pending omap operations is too large. As I wrote in my ticket there is room for improvement in docs on how to do it and with cli/api rejecting "ceph fs new " with pool1 or pool2 being EC. /M signature.asc Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Luminous CephFS on EC - how?
Hi, what is the proper method to not only setup but also successfully use CephFS on erasure coded data pool? The docs[1] very vaguely state that erasure coded pools do not support omap operations hence, "For Cephfs, using an erasure coded pool means setting that pool in a file layout.". The file layout docs says nothing further about this [2]. (I filed a bug[3].) I'm guessing this translates to something along the lines of: ceph fs new cephfs cephfs_metadata cephfs_replicated_data ceph fs add_data_pool cephfs cephfs_ec_data And then, setfattr -n ceph.dir.layout.SOMETHING -v cephfs_ec_data $cephfs_dir To achieve the inheritance of all files under $cephfs_dir to use the erasure coded pool afterwards. Am I on the right track here? /M 1. http://docs.ceph.com/docs/master/rados/operations/erasure-code/ 2. http://docs.ceph.com/docs/master/cephfs/file-layouts/ 3. http://tracker.ceph.com/issues/21174 signature.asc Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deepscrub IO impact on Jewel: What is osd_op_queue prio implementation?
On Tue, Apr 25, 2017 at 03:39:42PM -0400, Gregory Farnum wrote: > > I'd like to understand if "prio" in Jewel is as explained, i.e. > > something similar to the following pseudo code: > > > > if len(subqueue) > 0: > > dequeue(subqueue) > > if tokens(global) > some_cost: > > for queue in queues_high_to_low: > > if len(queue) > 0: > > dequeue(queue) > > tokens = tokens - some_other_cost > > else: > > for queue in queues_low_to_high: > > if len(queue) > 0: > > dequeue(queue) > > tokens = tokens - some_other_cost > > tokens = min(tokens + some_refill_rate, max_tokens) > > That looks about right. OK, thanks for validation. That has indeed impact on the entire priority queue under stress, then. (WPQ motivation seems clear :) ) > > The objective is to increase servicing time of client IO, especially > > read, while deep scrub is occuring. It doesn't matter for us if a > > deep-scrub takes x or 3x time, essentially. More consistent latency > > to clients is more important. > > I don't have any experience with SMR drives so it wouldn't surprise me > if there are some exciting emergent effects with them. Basically a very large chunk of disk area needs to be rewritten on each write. So write amplification factor of an inode update is just silly. They have a PMR buffer area on approx 500 GB, but that area can run out pretty fast during consistent IO over time (exact buffer management logic not known). > But it sounds > to me like you want to start by adjusting the osd_scrub_priority > (default 5) and osd_scrub_cost (default 50 << 20, ie 50MB). That will > directly impact how they move through the queue in relation to client > ops. (There are also the family of scrub scheduling options, which > might make sense if you are more tolerant of slow IO at certain times > of the day/week, but I'm not familiar with them). > -Greg Thanks for those pointers! It seems from a distance that it's necessary to use WPQ if it can be suspected that the IO scheduler is running without available tokens (not sure how to verify *that*). #ceph also helped point out that indeed I'm missing noatime,nodiratime on the mount options. So every read is causing an inode update which is extremely expensive on SMR, compared with regular HDD (e.g. PMR). (Not sure how I missed this when I set it up, because I've been aware of noatime earlier :) ) I think that's the first fix we'll want to do, and the biggest source of trouble, and then look back in a week or so to see how it's doing then. Then after that look into the various scrub-vs-client op scheduling artefacts. Thanks! /M signature.asc Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Deepscrub IO impact on Jewel: What is osd_op_queue prio implementation?
Hi, experiencing significant impact from deep scrubs on Jewel. Started investigating OP priorities. We use default values on related/relevant OSD priority settings. "osd op queue" on http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/#operations states: "The normal queue is different between implementations." So... in Jewel, where except code can I learn what is the queue behavior? Is there anyone who's familiar with it? I'd like to understand if "prio" in Jewel is as explained, i.e. something similar to the following pseudo code: if len(subqueue) > 0: dequeue(subqueue) if tokens(global) > some_cost: for queue in queues_high_to_low: if len(queue) > 0: dequeue(queue) tokens = tokens - some_other_cost else: for queue in queues_low_to_high: if len(queue) > 0: dequeue(queue) tokens = tokens - some_other_cost tokens = min(tokens + some_refill_rate, max_tokens) The background, for anyone interested, is: If it is similar to above, this would explain extreme OSD commit latencies / client latency. My current theory is that the deep scrub quite possibly is consuming all available tokens, such that when a client op arrives, and priority(client_io) > priority([deep_]scrub), the prio queue essentially inverts and low priority ops get priority over high priority ops. The OSD:s are SMR but the question here is specifically not how they perform (we're quite intimately aware of their performance profiles), but how to tame Ceph to make cluster behave as good as possible in normal case. I put up some graphs on https://martin.millnert.se/ceph/jewel_prio/ : - OSD Journal/Commit/Apply latencies show very strong correlation with ongoing deep scrubs. - When latencies are low and noisy there's essentially no client IO happening. - There is some evidence the write latency shoots through the roof -- but there isn't much client write occuring... Possible Deep Scrub causes disk write IO? * mount opts used are: [...] type xfs (rw,relatime,seclabel,attr2,inode64,noquota) The objective is to increase servicing time of client IO, especially read, while deep scrub is occuring. It doesn't matter for us if a deep-scrub takes x or 3x time, essentially. More consistent latency to clients is more important. Best, Martin Millnert signature.asc Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency increase
On Fri, 2016-07-22 at 08:28 -0400, Jason Dillaman wrote: > You aren't, by chance, sharing the same RBD image between multiple > VMs, are you? An order-of-magnitude performance degradation would not > be unexpected if you have multiple clients concurrently accessing the > same image with the "exclusive-lock" feature enabled on the image. No, though I did perform a live migration of the VM between the tests as well. But there is only one client of it. > 4000 IOPS for 4K random writes also sounds suspiciously high to me. Are the replica writes of the primary OSD async/parallel? /M > On Thu, Jul 21, 2016 at 7:32 PM, Martin Millnert <mar...@millnert.se> wrote: > > Hi, > > > > I just upgraded from Infernalis to Jewel and see an approximate 10x > > latency increase. > > > > Quick facts: > > - 3x replicated pool > > - 4x 2x-"E5-2690 v3 @ 2.60GHz", 128GB RAM, 6x 1.6 TB Intel S3610 SSDs, > > - LSI3008 controller with up-to-date firmware and upstream driver, and > > up-to-date firmware on SSDs. > > - 40GbE (Mellanox, with up-to-date drivers & firmware) > > - CentOS 7.2 > > > > Physical checks out, both iperf3 for network and e.g. fio over all the > > SSDs. Not done much of Linux tuning yet; but irqbalanced does a pretty > > good job with pairing both NIC and HBA with their respective CPUs. > > > > In performance hunting mode, and today took the next logical step of > > upgrading from Infernalis to Jewel. > > > > Tester is remote KVM/Qemu/libvirt guest (openstack) CentOS 7 image with > > fio. The test scenario is 4K randomwrite, libaio, directIO, QD=1, > > runtime=900s, test-file-size=40GiB. > > > > Went from a picture of [1] to [2]. In [1], the guest saw 98.25% of the > > I/O complete within maximum 250 µsec (~4000 IOPS). This, [2], sees > > 98.95% of the IO at ~4 msec (actually ~300 IOPs). > > > > Between [1] and [2] (simple plots of FIO's E2E-latency metrics), the > > entire cluster including compute nodes code went from Infernalis to > > 10.2.2 > > > > What's going on here? > > > > I haven't tuned Ceph OSDs either in config or via Linux kernel at all > > yet; upgrade to Jewel came first. I haven't changed any OSD configs > > between [1] and [2] myself (only minimally before [1], 0 effort on > > performance tuning) , other than updated to Jewel tunables. But the > > difference is very drastic, wouldn't you say? > > > > Best, > > Martin > > [1] > > http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test08/ceph-fio-bench_lat.1.png > > [2] > > http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test10/ceph-fio-bench_lat.1.png > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency increase
On Fri, 2016-07-22 at 08:56 +0100, Nick Fisk wrote: > > > > -Original Message- > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On > > Behalf Of Martin Millnert > > Sent: 22 July 2016 00:33 > > To: Ceph Users <ceph-users@lists.ceph.com> > > Subject: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency > > increase > > > > Hi, > > > > I just upgraded from Infernalis to Jewel and see an approximate 10x > > latency increase. > > > > Quick facts: > > - 3x replicated pool > > - 4x 2x-"E5-2690 v3 @ 2.60GHz", 128GB RAM, 6x 1.6 TB Intel S3610 > > SSDs, > > - LSI3008 controller with up-to-date firmware and upstream driver, > > and up-to-date firmware on SSDs. > > - 40GbE (Mellanox, with up-to-date drivers & firmware) > > - CentOS 7.2 > > > > Physical checks out, both iperf3 for network and e.g. fio over all > > the SSDs. Not done much of Linux tuning yet; but irqbalanced does a > > pretty good job with pairing both NIC and HBA with their respective > > CPUs. > > > > In performance hunting mode, and today took the next logical step > > of upgrading from Infernalis to Jewel. > > > > Tester is remote KVM/Qemu/libvirt guest (openstack) CentOS 7 image > > with fio. The test scenario is 4K randomwrite, libaio, directIO, > > QD=1, runtime=900s, test-file-size=40GiB. > > > > Went from a picture of [1] to [2]. In [1], the guest saw 98.25% of > > the I/O complete within maximum 250 µsec (~4000 IOPS). This, [2], > > sees 98.95% of the IO at ~4 msec (actually ~300 IOPs). > > I would be suspicious that somehow somewhere you had some sort of > caching going on, in the 1st example. It wouldn't surprise me either, though I to the best of my knowledge haven't actively configured any such write caching anywhere. I did forget one brief detail regarding the setup: We run 4x OSDs per SSD-drive, i.e. roughly 400 GB each. Consistent 4k random-write performance onto /var/lib/ceph/osd- $num/fiotestfile, with similar test-config as above, is 13k IOPS *per partition*. > 250us is pretty much unachievable for directio writes with Ceph. Thanks for the feedback, though it's disappointing to hear. > I've just built some new nodes with the pure goal of crushing > (excuse the pun) write latency and after extensive tuning can't get > it much below 600-700us. What of the below, or other than the below, have you done, considering the directIO baseline? - SSD only hosts - NIC <-> CPU/NUMA mapping - HBA <-> CPU/NUMA mapping - ceph-osd process <-> CPU/NUMA mapping - Partition SSDs into multiple partitions - Ceph OSD tunings for concurrency (many-clients) - Ceph OSD tunings for latency (many-clients) - async messenger, new in Jewel (not sure what impact is), or, change/tuning of memory allocator - RDMA (e.g. Mellanox) messenger I have yet to iron out precisely what those two OSD tunings would be. > The 4ms sounds more likely for an untuned cluster. I wonder if any of > the RBD or qemu cache settings would have changed between versions? I'm curious about this too. What are relevant OSD-side configs here? And how do I check what the librbd clients experience? What parameters from e.g. /etc/ceph/$clustername.conf applies to them? I'll have to make another pass over the rbd PRs between Infernalis and 10.2.2 I suppose. > > Between [1] and [2] (simple plots of FIO's E2E-latency metrics), > > the entire cluster including compute nodes code went from > > Infernalis > > to > > 10.2.2 > > > > What's going on here? > > > > I haven't tuned Ceph OSDs either in config or via Linux kernel at > > all yet; upgrade to Jewel came first. I haven't changed any OSD > > configs > > between [1] and [2] myself (only minimally before [1], 0 effort on > > performance tuning) , other than updated to Jewel tunables. But > > the difference is very drastic, wouldn't you say? > > > > Best, > > Martin > > [1] http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test08 > > /ceph-fio-bench_lat.1.png > > [2] http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test10 > > /ceph-fio-bench_lat.1.png > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSDs with bcache experience
The thing that worries me with your next-gen design (actually your current design aswell) is SSD wear. If you use Intel SSD at 10 DWPD, that's 12TB/day per 64TB total. I guess use case dependant, and perhaps 1:4 write read ratio is quite high in terms of writes as-is. You're also throughput-limiting yourself to the pci-e bw of the NVME device (regardless of NVRAM/SSD). Compared to traditonal interface, that may be ok of course in relative terms. NVRAM vs SSD here is simply a choice between wear (NVRAM as journal minimum), and cache hit probability (size). Interesting thought experiment anyway for me, thanks for sharing Wido. /M Original message From: Wido den HollanderDate: 20/10/2015 16:00 (GMT+01:00) To: ceph-users Subject: [ceph-users] Ceph OSDs with bcache experience Hi, In the "newstore direction" thread on ceph-devel I wrote that I'm using bcache in production and Mark Nelson asked me to share some details. Bcache is running in two clusters now that I manage, but I'll keep this information to one of them (the one at PCextreme behind CloudStack). In this cluster has been running for over 2 years now: epoch 284353 fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307 created 2013-09-23 11:06:11.819520 modified 2015-10-20 15:27:48.734213 The system consists out of 39 hosts: 2U SuperMicro chassis: * 80GB Intel SSD for OS * 240GB Intel S3700 SSD for Journaling + Bcache * 6x 3TB disk This isn't the newest hardware. The next batch of hardware will be more disks per chassis, but this is it for now. All systems were installed with Ubuntu 12.04, but they are all running 14.04 now with bcache. The Intel S3700 SSD is partitioned with a GPT label: - 5GB Journal for each OSD - 200GB Partition for bcache root@ceph11:~# df -h|grep osd /dev/bcache0 2.8T 1.1T 1.8T 38% /var/lib/ceph/osd/ceph-60 /dev/bcache1 2.8T 1.2T 1.7T 41% /var/lib/ceph/osd/ceph-61 /dev/bcache2 2.8T 930G 1.9T 34% /var/lib/ceph/osd/ceph-62 /dev/bcache3 2.8T 970G 1.8T 35% /var/lib/ceph/osd/ceph-63 /dev/bcache4 2.8T 814G 2.0T 30% /var/lib/ceph/osd/ceph-64 /dev/bcache5 2.8T 915G 1.9T 33% /var/lib/ceph/osd/ceph-65 root@ceph11:~# root@ceph11:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description:Ubuntu 14.04.3 LTS Release:14.04 Codename: trusty root@ceph11:~# uname -r 3.19.0-30-generic root@ceph11:~# "apply_latency": { "avgcount": 2985023, "sum": 226219.891559000 } What did we notice? - Less spikes on the disk - Lower commit latencies on the OSDs - Almost no 'slow requests' during backfills - Cache-hit ratio of about 60% Max backfills and recovery active are both set to 1 on all OSDs. For the next generation hardware we are looking into using 3U chassis with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, but we haven't tested those yet, so nothing to say about it. The current setup is 200GB of cache for 18TB of disks. The new setup will be 1200GB for 64TB, curious to see what that does. Our main conclusion however is that it does smoothen the I/O-pattern towards the disks and that gives a overall better response of the disks. Wido ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSDs with bcache experience
OK - seems my android email client (native samsung) messed up "in-reply-to" which confuses some MUA's. Apologies for that () /M On Tue, Oct 20, 2015 at 09:45:25PM +0200, Martin Millnert wrote: > The thing that worries me with your next-gen design (actually your current > design aswell) is SSD wear. If you use Intel SSD at 10 DWPD, that's 12TB/day > per 64TB total. I guess use case dependant, and perhaps 1:4 write read ratio > is quite high in terms of writes as-is. > > You're also throughput-limiting yourself to the pci-e bw of the NVME device > (regardless of NVRAM/SSD). Compared to traditonal interface, that may be ok of > course in relative terms. NVRAM vs SSD here is simply a choice between wear > (NVRAM as journal minimum), and cache hit probability (size). > > Interesting thought experiment anyway for me, thanks for sharing Wido. > > /M > > > Original message > From: Wido den Hollander <w...@42on.com> > Date: 20/10/2015 16:00 (GMT+01:00) > To: ceph-users <ceph-us...@ceph.com> > Subject: [ceph-users] Ceph OSDs with bcache experience > > Hi, > > In the "newstore direction" thread on ceph-devel I wrote that I'm using > bcache in production and Mark Nelson asked me to share some details. > > Bcache is running in two clusters now that I manage, but I'll keep this > information to one of them (the one at PCextreme behind CloudStack). > > In this cluster has been running for over 2 years now: > > epoch 284353 > fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307 > created 2013-09-23 11:06:11.819520 > modified 2015-10-20 15:27:48.734213 > > The system consists out of 39 hosts: > > 2U SuperMicro chassis: > * 80GB Intel SSD for OS > * 240GB Intel S3700 SSD for Journaling + Bcache > * 6x 3TB disk > > This isn't the newest hardware. The next batch of hardware will be more > disks per chassis, but this is it for now. > > All systems were installed with Ubuntu 12.04, but they are all running > 14.04 now with bcache. > > The Intel S3700 SSD is partitioned with a GPT label: > - 5GB Journal for each OSD > - 200GB Partition for bcache > > root@ceph11:~# df -h|grep osd > /dev/bcache02.8T 1.1T 1.8T 38% /var/lib/ceph/osd/ceph-60 > /dev/bcache12.8T 1.2T 1.7T 41% /var/lib/ceph/osd/ceph-61 > /dev/bcache22.8T 930G 1.9T 34% /var/lib/ceph/osd/ceph-62 > /dev/bcache32.8T 970G 1.8T 35% /var/lib/ceph/osd/ceph-63 > /dev/bcache42.8T 814G 2.0T 30% /var/lib/ceph/osd/ceph-64 > /dev/bcache52.8T 915G 1.9T 33% /var/lib/ceph/osd/ceph-65 > root@ceph11:~# > > root@ceph11:~# lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 14.04.3 LTS > Release: 14.04 > Codename: trusty > root@ceph11:~# uname -r > 3.19.0-30-generic > root@ceph11:~# > > "apply_latency": { > "avgcount": 2985023, > "sum": 226219.891559000 > } > > What did we notice? > - Less spikes on the disk > - Lower commit latencies on the OSDs > - Almost no 'slow requests' during backfills > - Cache-hit ratio of about 60% > > Max backfills and recovery active are both set to 1 on all OSDs. > > For the next generation hardware we are looking into using 3U chassis > with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, but we haven't > tested those yet, so nothing to say about it. > > The current setup is 200GB of cache for 18TB of disks. The new setup > will be 1200GB for 64TB, curious to see what that does. > > Our main conclusion however is that it does smoothen the I/O-pattern > towards the disks and that gives a overall better response of the disks. > > Wido > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com signature.asc Description: Digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Force an OSD to try to peer
On Tue, Mar 31, 2015 at 10:44:51PM +0300, koukou73gr wrote: On 03/31/2015 09:23 PM, Sage Weil wrote: It's nothing specific to peering (or ceph). The symptom we've seen is just that byte stop passing across a TCP connection, usually when there is some largish messages being sent. The ping/heartbeat messages get through because they are small and we disable nagle so they never end up in large frames. Is there any special route one should take in order to transition a live cluster to use jumbo frames and avoid such pitfalls with OSD peering? 1. Configure entire switch infrastructure for jumbo frames. 2. Enable config versioning of switch infrastructure configurations 3. Bonus points: Monitor config changes of switch infrastructure 4. Run ping test using e.g. fping from each node to every other node, with large frames. 5. Bonus points: Setup such a test in some monitor infrastructure. 6. Once you trust the config (and monitoring), up all the nodes MTU to jumbo size, simultaneously. This is the critical step and perhaps it could be further perfected. Ideally you would like an atomic MTU-upgrade command on the entire cluster. /M signature.asc Description: Digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes
On Thu, Mar 26, 2015 at 12:36:53PM -0500, Mark Nelson wrote: Having said that, small nodes are absolutely more expensive per OSD as far as raw hardware and power/cooling goes. The smaller volume manufacturers have on the units, the worse the margin typically (from buyers side). Also, CPUs typically run up a premium the higher you go. I've found a lot of local maximas, optimization-wise, over the past years both in 12 OSD/U vs 18 OSD/U dedicated storage node setups, for instance. There may be local maximas along colocated low-scale storage/compute nodes, but the one major problem with colocating storage with compute is that you can't scale compute independently from storage efficiently, on using that building block alone. There may be temporal optimizations in doing so however (e.g. before you have reached sufficient scale). There's no single optimal answer when you're dealing with 20+ variables to consider... :) BR, Martin signature.asc Description: Digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cold-storage tuning Ceph
Hello list, I'm currently trying to understand what I can do with Ceph to optimize it for a cold-storage (write-once, read-very-rarely) like scenario, trying to compare cost against LTO-6 tape. There is a single main objective: - minimal cost/GB/month of operations (including power, DC) To achieve this, I can break it down to: - Use best cost/GB HDD * SMR today - Minimal cost/3.5-slot - Minimal power-utilization/drive While staying within what is available today, I don't imagine going to power-down individual disk slots using IPMI etc, as some vendors allow. Now, putting Ceph on this, drives will be on, but it would be very useful to be able to spin-down drives that aren't used. It then seems to me that I want to do a few things with Ceph: - Have only a subest of the cluster 'active' for writes at any point in time - Yet, still have the entire cluster online and available for reads - Minimize concurrent OSD operations in a node that uses RAM, e.g. - Scrubbing, minimal number of OSDs active (ideally max 1) - In general, minimize concurrent active OSDs as per above - Minimize risk that any type of re-balancing of data occurs at all - E.g. use a high number of EC parity chunks Assuming e.g. 16 drives/host and 10TB drives, at ~100MB/s read and nearly full cluster, deep scrubbing the host would take 18.5 days. This means roughly 2 deep scrubs per month. Using EC pool, I wouldn't be very worried about errors, so perhaps that's ok (calculable), but they need to be repaired obviously. Mathematically, I can use an increase of parity chunks to lengthen the interval between deep scrubs. Is there anyone on the list who can provide some thoughts on the higher-order goal of Minimizing concurrently active OSDs in a node? I imagine I need to steer writes towards a subset of the system - but I have no idea how to implement it - using multiple separate clusters eg. each OSD on a node participate in unique clusters could perhaps help. Any feedback appreciated. It does appear a hot topic (pun intended). Best, Martin signature.asc Description: Digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rados -p pool cache-flush-evict-all surprisingly slow
Dear Cephers, I have a lab setup with 6x dual-socket hosts, 48GB RAM, 2x10Gbps hosts, each equipped with 2x S3700 100GB SSDs and 4x 500GB HDD, where the HDDs are mapped in a tree under a 'platter' root tree similar to guidance from Seb at http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ , and SSDs similarily under an 'ssd' root. Replication is set to 3. Journals on tmpfs (simulating NVRAM). I have put an ssd pool as a cache tier in front of an hdd pool (rbd), and run fio-rbd against rbd. In the benchmarks, at bs=32kb, QD=128 from a single separate client machine, I reached at peak throughput of around 1.2 GB/s. So there is some capability. IOPS-wise I see a max of around 15k iops currently. After having filled the SSD cache tier, I ran rados -p rbd cache-flush-evict-all - and I was expecting to see the 6 SSD OSDs start to evict all the cache-tier pg's to the underlying pool, rbd, which maps to the HDDs. I would have expected parallellism and high throughput, but what I now observe is ~80 MB/s on average flush speed. Which leads me to the question: Is rados -p pool cache-flush-evict-all supposed to work in a parallell manner? Cursory viewing in tcpdump suggests to me that eviction operation is serial, in which case the performance could make a little bit sense, since it is basically limited by the write speed of a single hdd. What should I see? If it is indeed a serial operation, is this different from the regular cache tier eviction routines that are triggered by full_ratios, max objects or max storage volume? Regards, Martin signature.asc Description: Digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com