Re: [ceph-users] Panic in kernel CephFS client after kernel update
Thanks! I’ll remove my patch from my local build of the 4.19 kernel and upgrade to 4.19.77. Appreciate the quick fix. Thanks, -- Kenneth Van Alstyne Systems Architect M: 228.547.8045 15052 Conference Center Dr, Chantilly, VA 20151 perspecta On Oct 5, 2019, at 7:29 AM, Ilya Dryomov mailto:idryo...@gmail.com>> wrote: On Tue, Oct 1, 2019 at 9:12 PM Jeff Layton mailto:jlay...@kernel.org>> wrote: On Tue, 2019-10-01 at 15:04 -0400, Sasha Levin wrote: On Tue, Oct 01, 2019 at 01:54:45PM -0400, Jeff Layton wrote: On Tue, 2019-10-01 at 19:03 +0200, Ilya Dryomov wrote: On Tue, Oct 1, 2019 at 6:41 PM Kenneth Van Alstyne mailto:kvanalst...@knightpoint.com>> wrote: All: I’m not sure this should go to LKML or here, but I’ll start here. After upgrading from Linux kernel 4.19.60 to 4.19.75 (or 76), I started running into kernel panics in the “ceph” module. Based on the call trace, I believe I was able to narrow it down to the following commit in the Linux kernel 4.19 source tree: commit 81281039a673d30f9d04d38659030a28051a Author: Yan, Zheng mailto:z...@redhat.com>> Date: Sun Jun 2 09:45:38 2019 +0800 ceph: use ceph_evict_inode to cleanup inode's resource [ Upstream commit 87bc5b895d94a0f40fe170d4cf5771c8e8f85d15 ] remove_session_caps() relies on __wait_on_freeing_inode(), to wait for freeing inode to remove its caps. But VFS wakes freeing inode waiters before calling destroy_inode(). Cc: sta...@vger.kernel.org<mailto:sta...@vger.kernel.org> Link: https://tracker.ceph.com/issues/40102 Signed-off-by: "Yan, Zheng" mailto:z...@redhat.com>> Reviewed-by: Jeff Layton mailto:jlay...@redhat.com>> Signed-off-by: Ilya Dryomov mailto:idryo...@gmail.com>> Signed-off-by: Sasha Levin mailto:sas...@kernel.org>> Backing this patch out and recompiling my kernel has since resolved my issues (as far as I can tell thus far). The issue was fairly easy to create by simply creating and deleting files. I tested using ‘dd’ and was pretty consistently able to reproduce the issue. Since the issue occurred in a VM, I do have a screenshot of the crashed machine and to avoid attaching an image, I’ll link to where they are: http://kvanals.kvanals.org/.ceph_kernel_panic_images/ Am I way off base or has anyone else run into this issue? Hi Kenneth, This might be a botched backport. The first version of this patch had a conflict with Al's change that introduced ceph_free_inode() and Zheng had to adjust it for that. However, it looks like it has been taken to 4.19 verbatim, even though 4.19 does not have ceph_free_inode(). Zheng, Jeff, please take a look ASAP. (Sorry for the resend -- I got Sasha's old addr) Thanks Ilya, I think you're right -- this patch should not have been merged on any pre-5.2 kernels. We should go ahead and revert this for now, and do a one-off backport for v4.19. Sasha, what do we need to do to make that happen? I think the easiest would be to just revert the broken one and apply a clean backport which you'll send me? Thanks, Sasha. You can revert the old patch as soon as you're ready. It'll take me a bit to put together and test a proper backport, but I'll try to have something ready within the next day or so. Kenneth, this is now fixed in 4.19.77. Thanks for the report! Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Panic in kernel CephFS client after kernel update
All: I’m not sure this should go to LKML or here, but I’ll start here. After upgrading from Linux kernel 4.19.60 to 4.19.75 (or 76), I started running into kernel panics in the “ceph” module. Based on the call trace, I believe I was able to narrow it down to the following commit in the Linux kernel 4.19 source tree: commit 81281039a673d30f9d04d38659030a28051a Author: Yan, Zheng mailto:z...@redhat.com>> Date: Sun Jun 2 09:45:38 2019 +0800 ceph: use ceph_evict_inode to cleanup inode's resource [ Upstream commit 87bc5b895d94a0f40fe170d4cf5771c8e8f85d15 ] remove_session_caps() relies on __wait_on_freeing_inode(), to wait for freeing inode to remove its caps. But VFS wakes freeing inode waiters before calling destroy_inode(). Cc: sta...@vger.kernel.org<mailto:sta...@vger.kernel.org> Link: https://tracker.ceph.com/issues/40102 Signed-off-by: "Yan, Zheng" mailto:z...@redhat.com>> Reviewed-by: Jeff Layton mailto:jlay...@redhat.com>> Signed-off-by: Ilya Dryomov mailto:idryo...@gmail.com>> Signed-off-by: Sasha Levin mailto:sas...@kernel.org>> Backing this patch out and recompiling my kernel has since resolved my issues (as far as I can tell thus far). The issue was fairly easy to create by simply creating and deleting files. I tested using ‘dd’ and was pretty consistently able to reproduce the issue. Since the issue occurred in a VM, I do have a screenshot of the crashed machine and to avoid attaching an image, I’ll link to where they are: http://kvanals.kvanals.org/.ceph_kernel_panic_images/ Am I way off base or has anyone else run into this issue? Thanks, -- Kenneth Van Alstyne Systems Architect M: 228.547.8045 15052 Conference Center Dr, Chantilly, VA 20151 perspecta ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph capacity versus pool replicated size discrepancy?
Got it! I can calculate individual clone usage using “rbd du”, but does anything exist to show total clone usage across the pool? Otherwise it looks like phantom space is just missing. Thanks, -- Kenneth Van Alstyne Systems Architect M: 228.547.8045 15052 Conference Center Dr, Chantilly, VA 20151 perspecta On Aug 13, 2019, at 11:05 PM, Konstantin Shalygin mailto:k0...@k0ste.ru>> wrote: Hey guys, this is probably a really silly question, but I’m trying to reconcile where all of my space has gone in one cluster that I am responsible for. The cluster is made up of 36 2TB SSDs across 3 nodes (12 OSDs per node), all using FileStore on XFS. We are running Ceph Luminous 12.2.8 on this particular cluster. The only pool where data is heavily stored is the “rbd” pool, of which 7.09TiB is consumed. With a replication of “3”, I would expect that the raw used to be close to 21TiB, but it’s actually closer to 35TiB. Some additional details are below. Any thoughts? [cluster] root at dashboard<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>:~# ceph df GLOBAL: SIZEAVAIL RAW USED %RAW USED 62.8TiB 27.8TiB 35.1TiB 55.81 POOLS: NAME ID USED%USED MAX AVAIL OBJECTS rbd0 7.09TiB 53.76 6.10TiB 3056783 data 3 29.4GiB 0.47 6.10TiB 7918 metadata 4 57.2MiB 0 6.10TiB 95 .rgw.root 5 1.09KiB 0 6.10TiB 4 default.rgw.control6 0B 0 6.10TiB 8 default.rgw.meta 7 0B 0 6.10TiB 0 default.rgw.log8 0B 0 6.10TiB 207 default.rgw.buckets.index 9 0B 0 6.10TiB 0 default.rgw.buckets.data 10 0B 0 6.10TiB 0 default.rgw.buckets.non-ec 11 0B 0 6.10TiB 0 [cluster] root at dashboard<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>:~# ceph --version ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable) [cluster] root at dashboard<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>:~# ceph osd dump | grep 'replicated size' pool 0 'rbd' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 682 pgp_num 682 last_change 414873 flags hashpspool min_write_recency_for_promote 1 stripe_width 0 application rbd pool 3 'data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 682 pgp_num 682 last_change 409614 flags hashpspool crash_replay_interval 45 min_write_recency_for_promote 1 stripe_width 0 application cephfs pool 4 'metadata' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 682 pgp_num 682 last_change 409617 flags hashpspool min_write_recency_for_promote 1 stripe_width 0 application cephfs pool 5 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 409 pgp_num 409 last_change 409710 lfor 0/336229 flags hashpspool stripe_width 0 application rgw pool 6 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 409 pgp_num 409 last_change 409711 lfor 0/336232 flags hashpspool stripe_width 0 application rgw pool 7 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 409 pgp_num 409 last_change 409713 lfor 0/336235 flags hashpspool stripe_width 0 application rgw pool 8 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 409 pgp_num 409 last_change 409712 lfor 0/336238 flags hashpspool stripe_width 0 application rgw pool 9 'default.rgw.buckets.index' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 409 pgp_num 409 last_change 409714 lfor 0/336241 flags hashpspool stripe_width 0 application rgw pool 10 'default.rgw.buckets.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 409 pgp_num 409 last_change 409715 lfor 0/336244 flags hashpspool stripe_width 0 application rgw pool 11 'default.rgw.buckets.non-ec' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 409 pgp_num 409 last_change 409716 lfor 0/336247 flags hashpspool stripe_width 0 application rgw [cluster] root at dashboard<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>:~# ceph osd lspools 0 rbd,3 data,4 metadata,5 .rgw.root,6 default.rgw.control,7 default.rgw.meta,8 default.rgw.log,9 default.rgw.buckets.index,10 default.rgw.buckets.data,11 default.rgw.buckets.non-ec, [cluster] root at dashboard<http://lists.ceph.com/listi
[ceph-users] Ceph capacity versus pool replicated size discrepancy?
c 0B 0 0 0 0 00 0 0B 0 0B default.rgw.control 0B 8 0 24 0 00 0 0B 0 0B default.rgw.log 0B 207 0 621 0 0021644149 20.6GiB14422618 0B default.rgw.meta0B 0 0 0 0 00 0 0B 0 0B metadata 57.2MiB 95 0 285 0 00 780 189MiB 86885 476MiB rbd7.09TiB 3053998 1539909 9161994 0 00 23432304830 1.07PiB 11174458128 232TiB total_objects3062230 total_used 35.0TiB total_avail 27.8TiB total_space 62.8TiB [cluster] root@dashboard:~# for pool in `rados lspools`; do echo $pool; ceph osd pool get $pool size; echo; done rbd size: 3 data size: 3 metadata size: 3 .rgw.root size: 3 default.rgw.control size: 3 default.rgw.meta size: 3 default.rgw.log size: 3 default.rgw.buckets.index size: 3 default.rgw.buckets.data size: 3 default.rgw.buckets.non-ec size: 3 Thanks, -- Kenneth Van Alstyne Systems Architect M: 228.547.8045 15052 Conference Center Dr, Chantilly, VA 20151 perspecta ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Data distribution question
Unfortunately it looks like he’s still on Luminous, but if upgrading is an option, the options are indeed significantly better. If I recall correctly, at least the balancer module is available in Luminous. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Apr 30, 2019, at 12:15 PM, Jack mailto:c...@jack.fr.eu.org>> wrote: Hi, I see that you are using rgw RGW comes with many pools, yet most of them are used for metadata and configuration, those do not store many data Such pools do not need more than a couple PG, each (I use pg_num = 8) You need to allocate your pg on pool that actually stores the data Please do the following, to let us know more: Print the pg_num per pool: for i in $(rados lspools); do echo -n "$i: "; ceph osd pool get $i pg_num; done Print the usage per pool: ceph df Also, instead of doing a "ceph osd reweight-by-utilization", check out the balancer plugin : http://docs.ceph.com/docs/mimic/mgr/balancer/ Finally, in nautilus, the pg can now upscale and downscale automaticaly See https://ceph.com/rados/new-in-nautilus-pg-merging-and-autotuning/ On 04/30/2019 06:34 PM, Shain Miley wrote: Hi, We have a cluster with 235 osd's running version 12.2.11 with a combination of 4 and 6 TB drives. The data distribution across osd's varies from 52% to 94%. I have been trying to figure out how to get this a bit more balanced as we are running into 'backfillfull' issues on a regular basis. I've tried adding more pgs...but this did not seem to do much in terms of the imbalance. Here is the end output from 'ceph osd df': MIN/MAX VAR: 0.73/1.31 STDDEV: 7.73 We have 8199 pgs total with 6775 of them in the pool that has 97% of the data. The other pools are not really used (data, metadata, .rgw.root, .rgw.control, etc). I have thought about deleting those unused pools so that most if not all the pgs are being used by the pool with the majority of the data. However...before I do that...there anything else I can do or try in order to see if I can balance out the data more uniformly? Thanks in advance, Shain ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Data distribution question
Shain: Have you looked into doing a "ceph osd reweight-by-utilization” by chance? I’ve found that data distribution is rarely perfect and on aging clusters, I always have to do this periodically. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Apr 30, 2019, at 11:34 AM, Shain Miley mailto:smi...@npr.org>> wrote: Hi, We have a cluster with 235 osd's running version 12.2.11 with a combination of 4 and 6 TB drives. The data distribution across osd's varies from 52% to 94%. I have been trying to figure out how to get this a bit more balanced as we are running into 'backfillfull' issues on a regular basis. I've tried adding more pgs...but this did not seem to do much in terms of the imbalance. Here is the end output from 'ceph osd df': MIN/MAX VAR: 0.73/1.31 STDDEV: 7.73 We have 8199 pgs total with 6775 of them in the pool that has 97% of the data. The other pools are not really used (data, metadata, .rgw.root, .rgw.control, etc). I have thought about deleting those unused pools so that most if not all the pgs are being used by the pool with the majority of the data. However...before I do that...there anything else I can do or try in order to see if I can balance out the data more uniformly? Thanks in advance, Shain -- NPR | Shain Miley | Manager of Infrastructure, Digital Media | smi...@npr.org<mailto:smi...@npr.org> | 202.513.3649 ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] VM management setup
This is purely anecdotal (obviously), but I have found that OpenNebula is not only easy to setup, is relatively lightweight, and has very good Ceph support. 5.8.0 was recently released, but has a few bugs related to live migrations with Ceph as the backend datastore. You may want to look at 5.6.1 or wait for 5.8.1 to be released since the issues have already been fixed upstream. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Apr 5, 2019, at 2:34 PM, jes...@krogh.cc<mailto:jes...@krogh.cc> wrote: Hi. Knowing this is a bit off-topic but seeking recommendations and advise anyway. We're seeking a "management" solution for VM's - currently in the 40-50 VM - but would like to have better access in managing them and potintially migrate them across multiple hosts, setup block devices, etc, etc. This is only to be used internally in a department where a bunch of engineering people will manage it, no costumers and that kind of thing. Up until now we have been using virt-manager with kvm - and have been quite satisfied when we were in the "few vms", but it seems like the time to move on. Thus we're looking for something "simple" that can help manage a ceph+kvm based setup - the simpler and more to the point the better. Any recommendations? .. found a lot of names allready .. OpenStack CloudStack Proxmox .. But recommendations are truely welcome. Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Nautilus upgrade but older releases reported by features
Anecdotally, I see the same behaviour, but there seem to be no negative side-effects. The “jewel” clients below are more than likely the (Linux) kernel client: [cinder] root@aurae-dashboard:~# ceph features { "mon": [ { "features": "0x3ffddff8ffac", "release": "luminous", "num": 1 } ], "mds": [ { "features": "0x3ffddff8ffac", "release": "luminous", "num": 1 } ], "osd": [ { "features": "0x3ffddff8ffac", "release": "luminous", "num": 1 } ], "client": [ { "features": "0x27018fb86aa42ada", "release": "jewel", "num": 5 }, { "features": "0x3ffddff8ffac", "release": "luminous", "num": 8 } ], "mgr": [ { "features": "0x3ffddff8ffac", "release": "luminous", "num": 1 } ] } [cinder] root@aurae-dashboard:~# ceph -s cluster: id: 650c5366-efa8-4636-a1a1-08740513ac3c health: HEALTH_OK services: mon: 1 daemons, quorum aurae-storage-1 (age 45h) mgr: aurae-storage-1(active, since 45h) mds: cephfs:1 {0=aurae-storage-1=up:active} osd: 1 osds: 1 up (since 45h), 1 in (since 43h) rgw: 1 daemon active (radosgw.aurae-storage-1) data: pools: 10 pools, 832 pgs objects: 1.42k objects, 3.0 GiB usage: 4.1 GiB used, 91 GiB / 95 GiB avail pgs: 832 active+clean io: client: 36 KiB/s wr, 0 op/s rd, 3 op/s wr [cinder] root@aurae-dashboard:~# ceph versions { "mon": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 1 }, "mgr": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 1 }, "mds": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 1 }, "rgw": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 1 }, "overall": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 5 } } Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Mar 27, 2019, at 6:52 AM, John Hearns mailto:hear...@googlemail.com>> wrote: Sure # ceph versions { "mon": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 2 }, "osd": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 12 }, "mds": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 3 }, "rgw": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 4 }, "overall": { "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)": 24 } } On Wed, 27 Mar 2019 at 11:20, Konstantin Shalygin mailto:k0...@k0ste.ru>> wrote: We recently updated a cluster to the Nautlius release by updating Debian packages from the Ceph site. Then rebooted all servers. ceph features still reports older releases, for example the osd "osd": [ { "features": "0x3ffddff8ffac", "release": "luminous", "num": 12 } I think I am not understanding what is exactly meant by release here. Cn we alter the osd (mon, clients etc.) such that they report nautilus ?? Show your `ceph versions` please. k ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Filestore OSD on CephFS?
I’d actually rather it not be an extra cluster, but can the destination pool name be different? If not, I have conflicting image names in the “rbd” pool on either side. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Jan 16, 2019, at 9:38 AM, Robert Sander mailto:r.san...@heinlein-support.de>> wrote: On 16.01.19 16:03, Kenneth Van Alstyne wrote: To be clear, I know the question comes across as ludicrous. It *seems* like this is going to work okay for the light workload use case that I have in mind — I just didn’t want to risk impacting the underlying cluster too much or hit any other caveats that perhaps someone else has run into before. Why is setting up a distinct pool as destination for your RBD mirros not an option? Does it have to be an extra cluster? Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin https://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Amtsgericht Berlin-Charlottenburg - HRB 93818 B Geschäftsführer: Peer Heinlein - Sitz: Berlin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Filestore OSD on CephFS?
Burkhard: Thank you, this is literally what I was looking for. A VM with RBD images attached was my first choice (and what we do for a test and integration lab today), but am trying to give as much possible space to the underlying cluster without having to frequently add/remove OSDs and rebalance the “sub-cluster”. I didn’t think about a loopback-mapped file on CephFS — but at that point, to your point, I might as well use RBD. :-) To be clear, I know the question comes across as ludicrous. It *seems* like this is going to work okay for the light workload use case that I have in mind — I just didn’t want to risk impacting the underlying cluster too much or hit any other caveats that perhaps someone else has run into before. I doubt many people have tried CephFS as a Filestore OSD since in general, it seems like a pretty silly idea. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Jan 16, 2019, at 8:27 AM, Burkhard Linke mailto:burkhard.li...@computational.bio.uni-giessen.de>> wrote: Hi, just some comments: CephFS has an overhead for accessing files (capabilities round trip to MDS for first access, cap cache management, limited number of concurrent caps depending on MDS cache size...), so using the cephfs filesystem as storage for a filestore OSD will add some extra overhead. I would use a loopback file since it reduces the cephfs overhead (one file, one cap), but it might also introduce other restrictions, e.g. fixed size of the file. If you can use a ceph cluster as 'backend storage', you can also use a rbd image. This should be remove most of the restrictions you have already mentioned (except fixed size again). You can also use multiple images to have multiple OSDs ;-) Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Filestore OSD on CephFS?
Marc: To clarify, there will be no direct client workload (which is what I mean by “active production workload”), but rather RBD images from a remote cluster imported from either RBD export/import or as an RBD mirror destination. Obviously the best solution is dedicated hardware, but I don’t have that. The single OSD is simply due to the underlying cluster already either being erasure coded or replicated. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Jan 16, 2019, at 8:14 AM, Marc Roos mailto:m.r...@f1-outsourcing.eu>> wrote: How can there be a "catastrophic reason" if you have "no active, production workload"...? Do as you please. I am also having 1 replication for temp en tests. But if you have only one osd why use ceph? Choose the correct 'tool' for the job. -----Original Message- From: Kenneth Van Alstyne [mailto:kvanalst...@knightpoint.com] Sent: 16 January 2019 15:04 To: ceph-users Subject: [ceph-users] Filestore OSD on CephFS? Disclaimer: Even I will admit that I know this is going to sound like a silly/crazy/insane question, but I have a reason for wanting to do this and asking the question. Its also worth noting that no active, production workload will be used on this cluster, so Im worried more about data integrity than performance of availability. Can anyone think of any catastrophic reason why I cannot use an existing clusters CephFS filesystem as a single OSD for a small cluster? Ive tested it and it seems to work with the following caveats: - 50% performance degradation (due to double write penalty since journal and OSD data both are on the same backing cluster) - Max object name and namespace length limits, which can be overcome with the following OSD parameters: - osd max object name len = 256 - osd max object namespace len = 64 - Due to above name/namespace length limits, cluster should be limited to RBD (which is exactly what I want to do) Some details of my cluster are below if anyone cares and Im getting a consistent, solid roughly 50% of the underlying clusters performance benchmarks using rados bench: # ceph --cluster cephfs status cluster: id: 0f8904ce-754b-48d4-aa58-7ee6fe9e2cca health: HEALTH_OK services: mon:1 daemons, quorum storage mgr:storage(active) osd:1 osds: 1 up, 1 in rbd-mirror: 1 daemon active data: pools: 1 pools, 32 pgs objects: 10 objects, 133 B usage: 12 MiB used, 87 GiB / 87 GiB avail pgs: 32 active+clean io: client: 85 B/s wr, 0 op/s rd, 0 op/s wr # ceph --cluster cephfs versions { "mon": { "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 1 }, "mgr": { "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 1 }, "osd": { "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 1 }, "mds": {}, "rbd-mirror": { "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 1 }, "overall": { "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 4 } } # ceph --cluster cephfs osd df ID CLASS WEIGHT REWEIGHT SIZE USEAVAIL %USE VAR PGS 0 hdd 0.08510 1.0 87 GiB 16 MiB 87 GiB 0.02 1.00 32 TOTAL 87 GiB 16 MiB 87 GiB 0.02 MIN/MAX VAR: 1.00/1.00 STDDEV: 0 # ceph --cluster cephfs df GLOBAL: SIZE AVAIL RAW USED %RAW USED 87 GiB 87 GiB 16 MiB 0.02 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS rbd 1 133 B 083 GiB 10 # df -h /var/lib/ceph/osd/cephfs-0/ Filesystem Size Used Avail Use% Mounted on 10.0.0.1:/ceph-remote 87G 12M 87G 1% /var/lib/ceph Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
[ceph-users] Filestore OSD on CephFS?
Disclaimer: Even I will admit that I know this is going to sound like a silly/crazy/insane question, but I have a reason for wanting to do this and asking the question. It’s also worth noting that no active, production workload will be used on this “cluster”, so I’m worried more about data integrity than performance of availability. Can anyone think of any catastrophic reason why I cannot use an existing cluster’s CephFS filesystem as a single OSD for a small cluster? I’ve tested it and it seems to work with the following caveats: - 50% performance degradation (due to double write penalty since journal and OSD data both are on the same backing cluster) - Max object name and namespace length limits, which can be overcome with the following OSD parameters: - osd max object name len = 256 - osd max object namespace len = 64 - Due to above name/namespace length limits, cluster should be limited to RBD (which is exactly what I want to do) Some details of my cluster are below if anyone cares and I’m getting a consistent, solid roughly 50% of the underlying cluster’s performance benchmarks using “rados bench”: # ceph --cluster cephfs status cluster: id: 0f8904ce-754b-48d4-aa58-7ee6fe9e2cca health: HEALTH_OK services: mon:1 daemons, quorum storage mgr:storage(active) osd:1 osds: 1 up, 1 in rbd-mirror: 1 daemon active data: pools: 1 pools, 32 pgs objects: 10 objects, 133 B usage: 12 MiB used, 87 GiB / 87 GiB avail pgs: 32 active+clean io: client: 85 B/s wr, 0 op/s rd, 0 op/s wr # ceph --cluster cephfs versions { "mon": { "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 1 }, "mgr": { "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 1 }, "osd": { "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 1 }, "mds": {}, "rbd-mirror": { "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 1 }, "overall": { "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 4 } } # ceph --cluster cephfs osd df ID CLASS WEIGHT REWEIGHT SIZE USEAVAIL %USE VAR PGS 0 hdd 0.08510 1.0 87 GiB 16 MiB 87 GiB 0.02 1.00 32 TOTAL 87 GiB 16 MiB 87 GiB 0.02 MIN/MAX VAR: 1.00/1.00 STDDEV: 0 # ceph --cluster cephfs df GLOBAL: SIZE AVAIL RAW USED %RAW USED 87 GiB 87 GiB 16 MiB 0.02 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS rbd 1 133 B 083 GiB 10 # df -h /var/lib/ceph/osd/cephfs-0/ Filesystem Size Used Avail Use% Mounted on 10.0.0.1:/ceph-remote 87G 12M 87G 1% /var/lib/ceph Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD Mirror Proxy Support?
D’oh! I was hoping that the destination pools could be unique names, regardless of the source pool name. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Jan 14, 2019, at 11:07 AM, Jason Dillaman mailto:jdill...@redhat.com>> wrote: On Mon, Jan 14, 2019 at 11:09 AM Kenneth Van Alstyne mailto:kvanalst...@knightpoint.com>> wrote: In this case, I’m imagining Clusters A/B both having write access to a third “Cluster C”. So A/B -> C rather than A -> C -> B / B -> C -> A / A -> B-> C. I admit, in the event that I need to replicate back to either primary cluster, there may be challenges. While this is possible, in addition to the failback question, you would also need to use unique pool names in clusters A and B since on cluster C you are currently prevented from adding more than a single peer per pool. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Jan 14, 2019, at 9:50 AM, Jason Dillaman wrote: On Mon, Jan 14, 2019 at 10:10 AM Kenneth Van Alstyne wrote: Thanks for the reply Jason — I was actually thinking of emailing you directly, but thought it may be beneficial to keep the conversation to the list so that everyone can see the thread. Can you think of a reason why one-way RBD mirroring would not work to a shared tertiary cluster? I need to build out a test lab to see how that would work for us. I guess I don't understand what the tertiary cluster is doing? If the goal is to replicate from cluster A -> cluster B -> cluster C, that is not currently supported since (by design choice) we don't currently re-write the RBD image journal entries from the source cluster to the destination cluster but instead just directly apply the journal entries to the destination image (to save IOPS). Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Jan 12, 2019, at 4:01 PM, Jason Dillaman wrote: On Fri, Jan 11, 2019 at 2:09 PM Kenneth Van Alstyne wrote: Hello all (and maybe this would be better suited for the ceph devel mailing list): I’d like to use RBD mirroring between two sites (to each other), but I have the following limitations: - The clusters use the same name (“ceph”) That's actually not an issue. The "ceph" name is used to locate configuration files for RBD mirroring (a la /etc/ceph/.conf and /etc/ceph/.client..keyring). You just need to map that cluster config file name to the remote cluster name in the RBD mirroring configuration. Additionally, starting with Nautilus, the configuration details for connecting to a remote cluster can now be stored in the monitor (via the rbd CLI and dashbaord), so there won't be any need to fiddle with configuration files for remote clusters anymore. - The clusters share IP address space on
Re: [ceph-users] RBD Mirror Proxy Support?
In this case, I’m imagining Clusters A/B both having write access to a third “Cluster C”. So A/B -> C rather than A -> C -> B / B -> C -> A / A -> B-> C. I admit, in the event that I need to replicate back to either primary cluster, there may be challenges. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Jan 14, 2019, at 9:50 AM, Jason Dillaman mailto:jdill...@redhat.com>> wrote: On Mon, Jan 14, 2019 at 10:10 AM Kenneth Van Alstyne mailto:kvanalst...@knightpoint.com>> wrote: Thanks for the reply Jason — I was actually thinking of emailing you directly, but thought it may be beneficial to keep the conversation to the list so that everyone can see the thread. Can you think of a reason why one-way RBD mirroring would not work to a shared tertiary cluster? I need to build out a test lab to see how that would work for us. I guess I don't understand what the tertiary cluster is doing? If the goal is to replicate from cluster A -> cluster B -> cluster C, that is not currently supported since (by design choice) we don't currently re-write the RBD image journal entries from the source cluster to the destination cluster but instead just directly apply the journal entries to the destination image (to save IOPS). Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Jan 12, 2019, at 4:01 PM, Jason Dillaman wrote: On Fri, Jan 11, 2019 at 2:09 PM Kenneth Van Alstyne wrote: Hello all (and maybe this would be better suited for the ceph devel mailing list): I’d like to use RBD mirroring between two sites (to each other), but I have the following limitations: - The clusters use the same name (“ceph”) That's actually not an issue. The "ceph" name is used to locate configuration files for RBD mirroring (a la /etc/ceph/.conf and /etc/ceph/.client..keyring). You just need to map that cluster config file name to the remote cluster name in the RBD mirroring configuration. Additionally, starting with Nautilus, the configuration details for connecting to a remote cluster can now be stored in the monitor (via the rbd CLI and dashbaord), so there won't be any need to fiddle with configuration files for remote clusters anymore. - The clusters share IP address space on a private, non-routed storage network Unfortunately, that is an issue since the rbd-mirror daemon needs to be able to connect to both clusters. If the two clusters are at least on different subnets and your management servers can talk to each side, you might be able to run the rbd-mirror daemon there. There are management servers on each side that can talk to the respective storage networks, but the storage networks cannot talk directly to each other. I recall reading, some years back, of possibly adding support for an RBD mirror proxy, which would potentially solve my issues. Has anything been done in this regard? No, I haven't really seen much demand for such support so it's never bubbled up as a priority yet. If not, is my best bet perhaps a tertiary clusters that both can reach and do one-way replication to? Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001
Re: [ceph-users] RBD Mirror Proxy Support?
Thanks for the reply Jason — I was actually thinking of emailing you directly, but thought it may be beneficial to keep the conversation to the list so that everyone can see the thread. Can you think of a reason why one-way RBD mirroring would not work to a shared tertiary cluster? I need to build out a test lab to see how that would work for us. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Jan 12, 2019, at 4:01 PM, Jason Dillaman mailto:jdill...@redhat.com>> wrote: On Fri, Jan 11, 2019 at 2:09 PM Kenneth Van Alstyne mailto:kvanalst...@knightpoint.com>> wrote: Hello all (and maybe this would be better suited for the ceph devel mailing list): I’d like to use RBD mirroring between two sites (to each other), but I have the following limitations: - The clusters use the same name (“ceph”) That's actually not an issue. The "ceph" name is used to locate configuration files for RBD mirroring (a la /etc/ceph/.conf and /etc/ceph/.client..keyring). You just need to map that cluster config file name to the remote cluster name in the RBD mirroring configuration. Additionally, starting with Nautilus, the configuration details for connecting to a remote cluster can now be stored in the monitor (via the rbd CLI and dashbaord), so there won't be any need to fiddle with configuration files for remote clusters anymore. - The clusters share IP address space on a private, non-routed storage network Unfortunately, that is an issue since the rbd-mirror daemon needs to be able to connect to both clusters. If the two clusters are at least on different subnets and your management servers can talk to each side, you might be able to run the rbd-mirror daemon there. There are management servers on each side that can talk to the respective storage networks, but the storage networks cannot talk directly to each other. I recall reading, some years back, of possibly adding support for an RBD mirror proxy, which would potentially solve my issues. Has anything been done in this regard? No, I haven't really seen much demand for such support so it's never bubbled up as a priority yet. If not, is my best bet perhaps a tertiary clusters that both can reach and do one-way replication to? Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD Mirror Proxy Support?
Hello all (and maybe this would be better suited for the ceph devel mailing list): I’d like to use RBD mirroring between two sites (to each other), but I have the following limitations: - The clusters use the same name (“ceph”) - The clusters share IP address space on a private, non-routed storage network There are management servers on each side that can talk to the respective storage networks, but the storage networks cannot talk directly to each other. I recall reading, some years back, of possibly adding support for an RBD mirror proxy, which would potentially solve my issues. Has anything been done in this regard? If not, is my best bet perhaps a tertiary clusters that both can reach and do one-way replication to? Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Image has watchers, but cannot determine why
Thanks for the reply — I was pretty darn sure, since I live migrated all VMs off of that box and then killed everything but a handful of system processes (init, sshd, etc) and the watcher was STILL present. In saying that, I halted the machine (since nothing was running on it any longer) and the watcher did indeed go away and I was able to remove the images. Very, very strange. (But situation solved… except I don’t know what the cause was, really.) Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Jan 10, 2019, at 4:03 AM, Ilya Dryomov mailto:idryo...@gmail.com>> wrote: On Wed, Jan 9, 2019 at 5:17 PM Kenneth Van Alstyne mailto:kvanalst...@knightpoint.com>> wrote: Hey folks, I’m looking into what I would think would be a simple problem, but is turning out to be more complicated than I would have anticipated. A virtual machine managed by OpenNebula was blown away, but the backing RBD images remain. Upon investigating, it appears that the images still have watchers on the KVM node that that VM previously lived on. I can confirm that there are no mapped RBD images on the machine and the qemu-system-x86_64 process is indeed no longer running. Any ideas? Additional details are below: # rbd info one-73-145-10 rbd image 'one-73-145-10': size 1024 GB in 262144 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.27174d6b8b4567 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten flags: parent: rbd/one-73@snap overlap: 102400 kB # # rbd status one-73-145-10 Watchers: watcher=10.0.235.135:0/3820784110 client.33810559 cookie=140234310778880 # # # rados -p rbd listwatchers rbd_header.27174d6b8b4567 watcher=10.0.235.135:0/3820784110 client.33810559 cookie=140234310778880 This appears to be a RADOS (i.e. not a kernel client) watch. Are you sure that nothing of the sort is running on that node? In order for the watch to stay live, the watcher has to send periodic ping messages to the OSD. Perhaps determine the primary OSD with "ceph osd map rbd rbd_header.27174d6b8b4567", set debug_ms to 1 on that OSD and monitor the log for a few minutes? Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Image has watchers, but cannot determine why
Hey folks, I’m looking into what I would think would be a simple problem, but is turning out to be more complicated than I would have anticipated. A virtual machine managed by OpenNebula was blown away, but the backing RBD images remain. Upon investigating, it appears that the images still have watchers on the KVM node that that VM previously lived on. I can confirm that there are no mapped RBD images on the machine and the qemu-system-x86_64 process is indeed no longer running. Any ideas? Additional details are below: # rbd info one-73-145-10 rbd image 'one-73-145-10': size 1024 GB in 262144 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.27174d6b8b4567 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten flags: parent: rbd/one-73@snap overlap: 102400 kB # # rbd status one-73-145-10 Watchers: watcher=10.0.235.135:0/3820784110 client.33810559 cookie=140234310778880 # # # rados -p rbd listwatchers rbd_header.27174d6b8b4567 watcher=10.0.235.135:0/3820784110 client.33810559 cookie=140234310778880 # # ip addr show | grep -i 10.0.235.135 inet 10.0.235.135/16 scope global i-storage # # rbd showmapped # # ps -efww | grep -i qemu | grep -i rbd | grep -i 145 # ceph version ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) # Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Anyone tested Samsung 860 DCT SSDs?
Thanks for the feedback everyone. Based on the TBW figures, it sounds like these drives are terrible for us as the idea is to NOT use them simply for archive. This will be a high read/write workload, so totally a show stopper. I’m interested in the Seagate Nytro myself. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. > On Oct 12, 2018, at 9:31 AM, Corin Langosch > wrote: > > Hi > > It has only TBW of 349 TB, so might die quite soon. But what about the > "Seagate Nytro 1551 DuraWrite 3DWPD Mainstream Endurance 960GB, SATA"? > Seems really cheap too and has TBW 5.25PB. Anybody tested that? What > about (RBD) performance? > > Cheers > Corin > > On Fri, 2018-10-12 at 13:53 +, Kenneth Van Alstyne wrote: >> Cephers: >> As the subject suggests, has anyone tested Samsung 860 DCT >> SSDs? They are really inexpensive and we are considering buying some >> to test. >> >> Thanks, >> >> -- >> Kenneth Van Alstyne >> Systems Architect >> Knight Point Systems, LLC >> Service-Disabled Veteran-Owned Business >> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 >> c: 228-547-8045 f: 571-266-3106 >> www.knightpoint.com >> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track >> GSA Schedule 70 SDVOSB: GS-35F-0646S >> GSA MOBIS Schedule: GS-10F-0404Y >> ISO 2 / ISO 27001 / CMMI Level 3 >> >> Notice: This e-mail message, including any attachments, is for the >> sole use of the intended recipient(s) and may contain confidential >> and privileged information. Any unauthorized review, copy, use, >> disclosure, or distribution is STRICTLY prohibited. If you are not >> the intended recipient, please contact the sender by reply e-mail and >> destroy all copies of the original message. >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Anyone tested Samsung 860 DCT SSDs?
Cephers: As the subject suggests, has anyone tested Samsung 860 DCT SSDs? They are really inexpensive and we are considering buying some to test. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD Crash When Upgrading from Jewel to Luminous?
After looking into this further, is it possible that adjusting CRUSH weight of the OSDs while running mis-matched versions of the ceph-osd daemon across the cluster can cause this issue? Under certain circumstances in our cluster, this may happen automatically on the backend. I can’t duplicate the issue in a lab, but highly suspect this is what happened. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Aug 17, 2018, at 4:01 PM, Gregory Farnum mailto:gfar...@redhat.com>> wrote: Do you have more logs that indicate what state machine event the crashing OSDs received? This obviously shouldn't have happened, but it's a plausible failure mode, especially if it's a relatively rare combination of events. -Greg On Fri, Aug 17, 2018 at 4:49 PM Kenneth Van Alstyne mailto:kvanalst...@knightpoint.com>> wrote: Hello all: I ran into an issue recently with one of my clusters when upgrading from 10.2.10 to 12.2.7. I have previously tested the upgrade in a lab and upgraded one of our five production clusters with no issues. On the second cluster, however, I ran into an issue where all OSDs that were NOT running Luminous yet (which was about 40% of the cluster at the time) all crashed with the same backtrace, which I have pasted below: === 0> 2018-08-13 17:35:13.160849 7f145c9ec700 -1 osd/PG.cc<http://PG.cc>: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state::my_context)' thread 7f145c9ec700 time 2018-08-13 17:35:13.157319 osd/PG.cc<http://PG.cc>: 5860: FAILED assert(0 == "we got a bad state machine event") ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0x55b9bf08614f] 2: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state, (boost::statechart::history_mode)0>::my_context)+0xc4) [0x55b9bea62db4] 3: (()+0x447366) [0x55b9bea9a366] 4: (boost::statechart::simple_state, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x2f7) [0x55b9beac8b77] 5: (boost::statechart::state_machine, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x55b9beaab5bb] 6: (PG::handle_peering_event(std::shared_ptr, PG::RecoveryCtx*)+0x384) [0x55b9bea7db14] 7: (OSD::process_peering_events(std::__cxx11::list > const&, ThreadPool::TPHandle&)+0x263) [0x55b9be9d1723] 8: (ThreadPool::BatchWorkQueue::_void_process(void*, ThreadPool::TPHandle&)+0x2a) [0x55b9bea1274a] 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb0) [0x55b9bf076d40] 10: (ThreadPool::WorkThread::entry()+0x10) [0x55b9bf077ef0] 11: (()+0x7507) [0x7f14e2c96507] 12: (clone()+0x3f) [0x7f14e0ca214f] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. === Once I restarted the impacted OSDs, which brought them up to 12.2.7, everything recovered just fine and the cluster is healthy. The only rub is that losing that many OSDs simultaneously caused a significant I/O disruption to the production servers for several minutes while I brought up the remaining OSDs. I have been trying to duplicate this issue in a lab again before continuing the upgrades on the other three clusters, but am coming up short. Has anyone seen anything like this and am I missing something obvious? Given how quickly the issue happened and the fact that I’m having a hard time reproducing this issue, I am limited in the amount of logging and debug information I have available, unfortunately. If it helps, all ceph-mon, ceph-mds, radosgw, and ceph-mgr daemons were running 12.2.7, while 30 of the 50 total ceph-osd daemons were also on 12.2.7 when the remaining 20 ceph-osd daemons (on 10.2.10) crashed. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190<https://maps.google.com/?q=1775+Wiehle+Avenue+Suite+101+%7C+Reston,+VA+20190&entry=gmail&source=g> c: 228-547-8045 f: 571-266-3106 www.knightpoint.com<http://www.knightpoint.com/> DHS EAGLE II
[ceph-users] OSD Crash When Upgrading from Jewel to Luminous?
Hello all: I ran into an issue recently with one of my clusters when upgrading from 10.2.10 to 12.2.7. I have previously tested the upgrade in a lab and upgraded one of our five production clusters with no issues. On the second cluster, however, I ran into an issue where all OSDs that were NOT running Luminous yet (which was about 40% of the cluster at the time) all crashed with the same backtrace, which I have pasted below: === 0> 2018-08-13 17:35:13.160849 7f145c9ec700 -1 osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state::my_context)' thread 7f145c9ec700 time 2018-08-13 17:35:13.157319 osd/PG.cc: 5860: FAILED assert(0 == "we got a bad state machine event") ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0x55b9bf08614f] 2: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state, (boost::statechart::history_mode)0>::my_context)+0xc4) [0x55b9bea62db4] 3: (()+0x447366) [0x55b9bea9a366] 4: (boost::statechart::simple_state, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x2f7) [0x55b9beac8b77] 5: (boost::statechart::state_machine, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x55b9beaab5bb] 6: (PG::handle_peering_event(std::shared_ptr, PG::RecoveryCtx*)+0x384) [0x55b9bea7db14] 7: (OSD::process_peering_events(std::__cxx11::list > const&, ThreadPool::TPHandle&)+0x263) [0x55b9be9d1723] 8: (ThreadPool::BatchWorkQueue::_void_process(void*, ThreadPool::TPHandle&)+0x2a) [0x55b9bea1274a] 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb0) [0x55b9bf076d40] 10: (ThreadPool::WorkThread::entry()+0x10) [0x55b9bf077ef0] 11: (()+0x7507) [0x7f14e2c96507] 12: (clone()+0x3f) [0x7f14e0ca214f] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. === Once I restarted the impacted OSDs, which brought them up to 12.2.7, everything recovered just fine and the cluster is healthy. The only rub is that losing that many OSDs simultaneously caused a significant I/O disruption to the production servers for several minutes while I brought up the remaining OSDs. I have been trying to duplicate this issue in a lab again before continuing the upgrades on the other three clusters, but am coming up short. Has anyone seen anything like this and am I missing something obvious? Given how quickly the issue happened and the fact that I’m having a hard time reproducing this issue, I am limited in the amount of logging and debug information I have available, unfortunately. If it helps, all ceph-mon, ceph-mds, radosgw, and ceph-mgr daemons were running 12.2.7, while 30 of the 50 total ceph-osd daemons were also on 12.2.7 when the remaining 20 ceph-osd daemons (on 10.2.10) crashed. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Snapshot cleanup performance impact on client I/O?
Hey folks: I was wondering if the community can provide any advice — over time and due to some external issues, we have managed to accumulate thousands of snapshots of RBD images, which are now in need of cleaning up. I have recently attempted to roll through a “for" loop to perform a “rbd snap rm” on each snapshot, sequentially, waiting until the rbd command finishes before moving onto the next one, of course. I noticed that shortly after starting this, I started seeing thousands of slow ops and a few of our guest VMs became unresponsive, naturally. My questions are: - Is this expected behavior? - Is the background cleanup asynchronous from the “rbd snap rm” command? - If so, are there any OSD parameters I can set to reduce the impact on production? - Would “rbd snap purge” be any different? I expect not, since fundamentally, rbd is performing the same action that I do via the loop. Relevant details are as follows, though I’m not sure cluster size *really* has any effect here: - Ceph: version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) - 5 storage nodes, each with: - 10x 2TB 7200 RPM SATA Spindles (for a total of 50 OSDs) - 2x Samsung MZ7LM240 SSDs (used as journal for the OSDs) - 64GB RAM - 2x Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz - 20GBit LACP Port Channel via Intel X520 Dual Port 10GbE NIC Let me know if I’ve missed something fundamental. Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 2 / ISO 27001 / CMMI Level 3 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Performance Questions with rbd images access by qemu-kvm
Got it — I’ll keep that in mind. That may just be what I need to “get by” for now. Ultimately, we’re looking to buy at least three nodes of servers that can hold 40+ OSDs backed by 2TB+ SATA disks, Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 2 / ISO 27001 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. > On Sep 1, 2015, at 11:26 AM, Robert LeBlanc wrote: > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > Just swapping out spindles for SSD will not give you orders of magnitude > performance gains as it does in regular cases. This is because Ceph has a lot > of overhead for each I/O which limits the performance of the SSDs. In my > testing, two Intel S3500 SSDs with an 8 core Atom (Intel(R) Atom(TM) CPU > C2750 @ 2.40GHz) and size=1 and fio with 8 jobs and QD=8 sync,direct 4K > read/writes produced 2,600 IOPs. Don't get me wrong, it will help, but don't > expect spectacular results. > > - > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > On Tue, Sep 1, 2015 at 8:01 AM, Kenneth Van Alstyne wrote: > Thanks for the awesome advice folks. Until I can go larger scale (50+ SATA > disks), I’m thinking my best option here is to just swap out these 1TB SATA > disks with 1TB SSDs. Am I oversimplifying the short term solution? > > Thanks, > > - -- > Kenneth Van Alstyne > Systems Architect > Knight Point Systems, LLC > Service-Disabled Veteran-Owned Business > 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 > c: 228-547-8045 f: 571-266-3106 > www.knightpoint.com <http://www.knightpoint.com/> > DHS EAGLE II Prime Contractor: FC1 SDVOSB Track > GSA Schedule 70 SDVOSB: GS-35F-0646S > GSA MOBIS Schedule: GS-10F-0404Y > ISO 2 / ISO 27001 > > Notice: This e-mail message, including any attachments, is for the sole use > of the intended recipient(s) and may contain confidential and privileged > information. Any unauthorized review, copy, use, disclosure, or distribution > is STRICTLY prohibited. If you are not the intended recipient, please contact > the sender by reply e-mail and destroy all copies of the original message. > > On Aug 31, 2015, at 7:29 PM, Christian Balzer wrote: > > > Hello, > > On Mon, 31 Aug 2015 12:28:15 -0500 Kenneth Van Alstyne wrote: > > In addition to the spot on comments by Warren and Quentin, verify this by > watching your nodes with atop, iostat, etc. > The culprit (HDDs) should be plainly visible. > > More inline: > > Christian, et al: > > Sorry for the lack of information. I wasn’t sure what of our hardware > specifications or Ceph configuration was useful information at this > point. Thanks for the feedback — any feedback, is appreciated at this > point, as I’ve been beating my head against a wall trying to figure out > what’s going on. (If anything. Maybe the spindle count is indeed our > upper limit or our SSDs really suck? :-) ) > > Your SSDs aren't the problem. > > To directly address your questions, see answers below: > - CBT is the Ceph Benchmarking Tool. Since my question was more > generic rather than with CBT itself, it was probably more useful to post > in the ceph-users list rather than cbt. > - 8 Cores are from 2x quad core Intel(R) Xeon(R) CPU E5-2609 0 @ > 2.40GHz > Not your problem either. > > - The SSDs are indeed Intel S3500s. I agree — not ideal, but > supposedly capable of up to 75,000 random 4KB reads/writes. Throughput > and longevity is quite low for an SSD, rated at about 400MB/s reads and > 100MB/s writes, though. When we added these as journals in front of the > SATA spindles, both VM performance and rados benchmark numbers were > relatively unchanged. > > The only thing relevant in regards to journal SSDs is the sequential write > speed (SYNC), they don't seek and normally don't get read either. > This is why a 200GB DC S3700 is a better journal SSD than the 200GB S3710 > which is faster in any other aspect but sequential writes. ^o^ > > Latency should have gone down with the SSD journals in place, but that's >
Re: [ceph-users] Ceph Performance Questions with rbd images access by qemu-kvm
Thanks for the awesome advice folks. Until I can go larger scale (50+ SATA disks), I’m thinking my best option here is to just swap out these 1TB SATA disks with 1TB SSDs. Am I oversimplifying the short term solution? Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 2 / ISO 27001 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. > On Aug 31, 2015, at 7:29 PM, Christian Balzer wrote: > > > Hello, > > On Mon, 31 Aug 2015 12:28:15 -0500 Kenneth Van Alstyne wrote: > > In addition to the spot on comments by Warren and Quentin, verify this by > watching your nodes with atop, iostat, etc. > The culprit (HDDs) should be plainly visible. > > More inline: > >> Christian, et al: >> >> Sorry for the lack of information. I wasn’t sure what of our hardware >> specifications or Ceph configuration was useful information at this >> point. Thanks for the feedback — any feedback, is appreciated at this >> point, as I’ve been beating my head against a wall trying to figure out >> what’s going on. (If anything. Maybe the spindle count is indeed our >> upper limit or our SSDs really suck? :-) ) >> > Your SSDs aren't the problem. > >> To directly address your questions, see answers below: >> - CBT is the Ceph Benchmarking Tool. Since my question was more >> generic rather than with CBT itself, it was probably more useful to post >> in the ceph-users list rather than cbt. >> - 8 Cores are from 2x quad core Intel(R) Xeon(R) CPU E5-2609 0 @ >> 2.40GHz > Not your problem either. > >> - The SSDs are indeed Intel S3500s. I agree — not ideal, but >> supposedly capable of up to 75,000 random 4KB reads/writes. Throughput >> and longevity is quite low for an SSD, rated at about 400MB/s reads and >> 100MB/s writes, though. When we added these as journals in front of the >> SATA spindles, both VM performance and rados benchmark numbers were >> relatively unchanged. >> > The only thing relevant in regards to journal SSDs is the sequential write > speed (SYNC), they don't seek and normally don't get read either. > This is why a 200GB DC S3700 is a better journal SSD than the 200GB S3710 > which is faster in any other aspect but sequential writes. ^o^ > > Latency should have gone down with the SSD journals in place, but that's > their main function/benefit. > >> - Regarding throughput vs iops, indeed — the throughput that I’m >> seeing is nearly worst case scenario, with all I/O being 4KB block >> size. With RBD cache enabled and the writeback option set in the VM >> configuration, I was hoping more coalescing would occur, increasing the >> I/O block size. >> > That can only help with non-SYNC writes, so your MySQL VMs and certain > file system ops will have to bypass that and that hurts. > >> As an aside, the orchestration layer on top of KVM is OpenNebula if >> that’s of any interest. >> > It is actually, as I've been eying OpenNebula (alas no Debian Jessie > packages). However not relevant to your problem indeed. > >> VM information: >> - Number = 15 >> - Worload = Mixed (I know, I know — that’s as vague of an answer >> as they come) A handful of VMs are running some MySQL databases and >> some web applications in Apache Tomcat. One is running a syslog >> server. Everything else is mostly static web page serving for a low >> number of users. >> > As others have mentioned, would you expect this load to work well with > just 2 HDDs and via NFS to introduce network latency? > >> I can duplicate the blocked request issue pretty consistently, just by >> running something simple like a “yum -y update” in one VM. While that >> is running, ceph -w and ceph -s show the following: root@dashboard:~# >> ceph -s cluster f79d8c2a-3c14-49be-942d-83fc5f193a25 health HEALTH_WARN >>1 requests are blocked > 32 sec >> monmap e3: 3 mons at >> {storage-1=10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3:6789/0} >> election epoch 136, quorum 0,
Re: [ceph-users] Ceph Performance Questions with rbd images access by qemu-kvm
If spindle count is indeed the problem, is there anything else I can do to improve caching or I/O coalescing to deal with my crippling IOP limit due to the low number of spindles? Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 2 / ISO 27001 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. > On Aug 31, 2015, at 11:01 AM, Christian Balzer wrote: > > > Hello, > > On Mon, 31 Aug 2015 08:31:57 -0500 Kenneth Van Alstyne wrote: > >> Sorry about the repost from the cbt list, but it was suggested I post >> here as well: >> > I wasn't even aware a CBT (what the heck does that acronym stand for?) > existed... > >> I am attempting to track down some performance issues in a Ceph cluster >> recently deployed. Our configuration is as follows: 3 storage nodes, > 3 nodes is, of course, bare minimum. > >> each with: >> - 8 Cores > Of what, apples? Detailed information makes for better replies. > >> - 64GB of RAM > Ample. > >> - 2x 1TB 7200 RPM Spindle > Even if your cores where to be rotten apple ones, that's very few > spindles, so your CPU is unlikely to be the bottleneck. > >> - 1x 120GB Intel SSD > Details, again. From your P.S. I conclude that these are S3500's, > definitely not my choice for journals when it comes to speed and endurance. > >> - 2x 10GBit NICs (In LACP Port-channel) > Massively overspec'ed considering your storage sinks/wells aka HDDs. > >> >> The OSD pool min_size is set to “1” and “size” is set to “3”. When >> creating a new pool and running RADOS benchmarks, performance isn’t bad >> — about what I would expect from this hardware configuration: >> > Rados bench uses by default 4MB "blocks", which is the optimum size for > (default) RBD pools. > Bandwidth does not equal IOPS (which are commonly measured in 4KB blocks). > >> WRITES: >> Total writes made: 207 >> Write size: 4194304 >> Bandwidth (MB/sec): 80.017 >> >> Stddev Bandwidth: 34.9212 >> Max bandwidth (MB/sec): 120 >> Min bandwidth (MB/sec): 0 >> Average Latency:0.797667 >> Stddev Latency: 0.313188 >> Max latency:1.72237 >> Min latency:0.253286 >> >> RAND READS: >> Total time run:10.127990 >> Total reads made: 1263 >> Read size:4194304 >> Bandwidth (MB/sec):498.816 >> >> Average Latency: 0.127821 >> Max latency: 0.464181 >> Min latency: 0.0220425 >> >> This all looks fine, until we try to use the cluster for its purpose, >> which is to house images for qemu-kvm, which are access using librbd. > Not that it probably matters, but knowing if this Openstack, Ganeti or > something else might be of interest. > >> I/O inside VMs have excessive I/O wait times (in the hundreds of ms at >> times, making some operating systems, like Windows unusable) and >> throughput struggles to exceed 10MB/s (or less). Looking at ceph >> health, we see very low op/s numbers as well as throughput and the >> requests blocked number seems very high. Any ideas as to what to look >> at here? >> > Again, details. > > How many VMs? > What are they doing? > Keep in mind that the BEST sustained result you could hope for here > (ignoring Ceph overhead and network latency) is the IOPS of 2 HDDs, so > about 300 IOPS at best. TOTAL. > >> health HEALTH_WARN >>8 requests are blocked > 32 sec >> monmap e3: 3 mons at >> {storage-1=10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3:6789/0} >> election epoch 128, quorum 0,1,2 storage-1,storage-2,storage-3 osdmap >> e69615: 6 osds: 6 up, 6 in pgmap v3148541: 224 pgs, 1 pools, 819 GB > 256 or 512 PGs would have been the "correct" number here, but that's of > little importance. > >> data, 227 kobjects 2726 GB used, 2844 GB / 5571 GB avail >> 224 active+cle
[ceph-users] Ceph Performance Questions with rbd images access by qemu-kvm
Sorry about the repost from the cbt list, but it was suggested I post here as well: I am attempting to track down some performance issues in a Ceph cluster recently deployed. Our configuration is as follows: 3 storage nodes, each with: - 8 Cores - 64GB of RAM - 2x 1TB 7200 RPM Spindle - 1x 120GB Intel SSD - 2x 10GBit NICs (In LACP Port-channel) The OSD pool min_size is set to “1” and “size” is set to “3”. When creating a new pool and running RADOS benchmarks, performance isn’t bad — about what I would expect from this hardware configuration: WRITES: Total writes made: 207 Write size: 4194304 Bandwidth (MB/sec): 80.017 Stddev Bandwidth: 34.9212 Max bandwidth (MB/sec): 120 Min bandwidth (MB/sec): 0 Average Latency:0.797667 Stddev Latency: 0.313188 Max latency:1.72237 Min latency:0.253286 RAND READS: Total time run:10.127990 Total reads made: 1263 Read size:4194304 Bandwidth (MB/sec):498.816 Average Latency: 0.127821 Max latency: 0.464181 Min latency: 0.0220425 This all looks fine, until we try to use the cluster for its purpose, which is to house images for qemu-kvm, which are access using librbd. I/O inside VMs have excessive I/O wait times (in the hundreds of ms at times, making some operating systems, like Windows unusable) and throughput struggles to exceed 10MB/s (or less). Looking at ceph health, we see very low op/s numbers as well as throughput and the requests blocked number seems very high. Any ideas as to what to look at here? health HEALTH_WARN 8 requests are blocked > 32 sec monmap e3: 3 mons at {storage-1=10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3:6789/0} election epoch 128, quorum 0,1,2 storage-1,storage-2,storage-3 osdmap e69615: 6 osds: 6 up, 6 in pgmap v3148541: 224 pgs, 1 pools, 819 GB data, 227 kobjects 2726 GB used, 2844 GB / 5571 GB avail 224 active+clean client io 3957 B/s rd, 3494 kB/s wr, 30 op/s Of note, on the other list, I was asked to provide the following: - ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) - The SSD is split into 8GB partitions. These 8GB partitions are used as journal devices, specified in /etc/ceph/ceph.conf. For example: [osd.0] host = storage-1 osd journal = /dev/mapper/INTEL_SSDSC2BB120G4_CVWL4363006R120LGNp1 - rbd_cache is enabled and qemu cache is set to “writeback" - rbd_concurrent_management_ops is unset, so it appears the default is “10” Thanks, -- Kenneth Van Alstyne Systems Architect Knight Point Systems, LLC Service-Disabled Veteran-Owned Business 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 c: 228-547-8045 f: 571-266-3106 www.knightpoint.com DHS EAGLE II Prime Contractor: FC1 SDVOSB Track GSA Schedule 70 SDVOSB: GS-35F-0646S GSA MOBIS Schedule: GS-10F-0404Y ISO 2 / ISO 27001 Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com