Re: [ceph-users] Canonical Livepatch broke CephFS client
On Wed, Aug 14, 2019 at 12:44:15PM +0200, Ilya Dryomov wrote: > On Tue, Aug 13, 2019 at 10:56 PM Tim Bishop wrote: > > This email is mostly a heads up for others who might be using > > Canonical's livepatch on Ubuntu on a CephFS client. > > > > I have an Ubuntu 18.04 client with the standard kernel currently at > > version linux-image-4.15.0-54-generic 4.15.0-54.58. CephFS is mounted > > with the kernel client. Cluster is running mimic 13.2.6. I've got > > livepatch running and this evening it did an update: > > > > Aug 13 17:33:55 myclient canonical-livepatch[2396]: Client.Check > > Aug 13 17:33:55 myclient canonical-livepatch[2396]: Checking with livepatch > > service. > > Aug 13 17:33:55 myclient canonical-livepatch[2396]: updating last-check > > Aug 13 17:33:55 myclient canonical-livepatch[2396]: touched last check > > Aug 13 17:33:56 myclient canonical-livepatch[2396]: Applying update 54.1 > > for 4.15.0-54.58-generic > > Aug 13 17:33:56 myclient kernel: [3700923.970750] PKCS#7 signature not > > signed with a trusted key > > Aug 13 17:33:59 myclient kernel: [3700927.069945] livepatch: enabling patch > > 'lkp_Ubuntu_4_15_0_54_58_generic_54' > > Aug 13 17:33:59 myclient kernel: [3700927.154956] livepatch: > > 'lkp_Ubuntu_4_15_0_54_58_generic_54': starting patching transition > > Aug 13 17:34:01 myclient kernel: [3700928.994487] livepatch: > > 'lkp_Ubuntu_4_15_0_54_58_generic_54': patching complete > > Aug 13 17:34:09 myclient canonical-livepatch[2396]: Applied patch version > > 54.1 to 4.15.0-54.58-generic > > > > And then immediately I saw: > > > > Aug 13 17:34:18 myclient kernel: [3700945.728684] libceph: mds0 > > 1.2.3.4:6800 socket closed (con state OPEN) > > Aug 13 17:34:18 myclient kernel: [3700946.040138] libceph: mds0 > > 1.2.3.4:6800 socket closed (con state OPEN) > > Aug 13 17:34:19 myclient kernel: [3700947.105692] libceph: mds0 > > 1.2.3.4:6800 socket closed (con state OPEN) > > Aug 13 17:34:20 myclient kernel: [3700948.033704] libceph: mds0 > > 1.2.3.4:6800 socket closed (con state OPEN) > > > > And on the MDS: > > > > 2019-08-13 17:34:18.286 7ff165e75700 0 SIGN: MSG 9241367 Message signature > > does not match contents. > > 2019-08-13 17:34:18.286 7ff165e75700 0 SIGN: MSG 9241367Signature on > > message: > > 2019-08-13 17:34:18.286 7ff165e75700 0 SIGN: MSG 9241367sig: > > 10517606059379971075 > > 2019-08-13 17:34:18.286 7ff165e75700 0 SIGN: MSG 9241367Locally calculated > > signature: > > 2019-08-13 17:34:18.286 7ff165e75700 0 SIGN: MSG 9241367 > > sig_check:4899837294009305543 > > 2019-08-13 17:34:18.286 7ff165e75700 0 Signature failed. > > 2019-08-13 17:34:18.286 7ff165e75700 0 -- 1.2.3.4:6800/512468759 >> > > 4.3.2.1:0/928333509 conn(0xe6b9500 :6800 >> > > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=0).process >> > > Signature check failed > > > > Thankfully I was able to umount -f to unfreeze the client, but I have > > been unsuccessful remounting the file system using the kernel client. > > The fuse client worked fine as a workaround, but is slower. > > > > Taking a look at livepatch 54.1 I can see it touches Ceph code in the > > kernel: > > > > https://git.launchpad.net/~ubuntu-livepatch/+git/bionic-livepatches/commit/?id=3a3081c1e4c8e2e0f9f7a1ae4204eba5f38fbd29 > > > > But the relevance of those changes isn't immediately clear to me. I > > expect after a reboot it'll be fine, but as yet untested. > > These changes are very relevant. They introduce support for CEPHX_V2 > protocol, where message signatures are computed slightly differently: > same algorithm but a different set of inputs. The live-patched kernel > likely started signing using CEPHX_V2 without renegotiating. Ah - thanks for looking. Looks like something that wasn't a security issue so shouldn't have been included in the live patch. > This is a good example of how live-patching can go wrong. A reboot > should definitely help. Yup, it certainly has its tradeoffs (not having to reboot so regularly is certainly a positive, though). I've replicated on a test machine and confirmed that a reboot does indeed fix the problem. Thanks, Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Canonical Livepatch broke CephFS client
Hi, This email is mostly a heads up for others who might be using Canonical's livepatch on Ubuntu on a CephFS client. I have an Ubuntu 18.04 client with the standard kernel currently at version linux-image-4.15.0-54-generic 4.15.0-54.58. CephFS is mounted with the kernel client. Cluster is running mimic 13.2.6. I've got livepatch running and this evening it did an update: Aug 13 17:33:55 myclient canonical-livepatch[2396]: Client.Check Aug 13 17:33:55 myclient canonical-livepatch[2396]: Checking with livepatch service. Aug 13 17:33:55 myclient canonical-livepatch[2396]: updating last-check Aug 13 17:33:55 myclient canonical-livepatch[2396]: touched last check Aug 13 17:33:56 myclient canonical-livepatch[2396]: Applying update 54.1 for 4.15.0-54.58-generic Aug 13 17:33:56 myclient kernel: [3700923.970750] PKCS#7 signature not signed with a trusted key Aug 13 17:33:59 myclient kernel: [3700927.069945] livepatch: enabling patch 'lkp_Ubuntu_4_15_0_54_58_generic_54' Aug 13 17:33:59 myclient kernel: [3700927.154956] livepatch: 'lkp_Ubuntu_4_15_0_54_58_generic_54': starting patching transition Aug 13 17:34:01 myclient kernel: [3700928.994487] livepatch: 'lkp_Ubuntu_4_15_0_54_58_generic_54': patching complete Aug 13 17:34:09 myclient canonical-livepatch[2396]: Applied patch version 54.1 to 4.15.0-54.58-generic And then immediately I saw: Aug 13 17:34:18 myclient kernel: [3700945.728684] libceph: mds0 1.2.3.4:6800 socket closed (con state OPEN) Aug 13 17:34:18 myclient kernel: [3700946.040138] libceph: mds0 1.2.3.4:6800 socket closed (con state OPEN) Aug 13 17:34:19 myclient kernel: [3700947.105692] libceph: mds0 1.2.3.4:6800 socket closed (con state OPEN) Aug 13 17:34:20 myclient kernel: [3700948.033704] libceph: mds0 1.2.3.4:6800 socket closed (con state OPEN) And on the MDS: 2019-08-13 17:34:18.286 7ff165e75700 0 SIGN: MSG 9241367 Message signature does not match contents. 2019-08-13 17:34:18.286 7ff165e75700 0 SIGN: MSG 9241367Signature on message: 2019-08-13 17:34:18.286 7ff165e75700 0 SIGN: MSG 9241367sig: 10517606059379971075 2019-08-13 17:34:18.286 7ff165e75700 0 SIGN: MSG 9241367Locally calculated signature: 2019-08-13 17:34:18.286 7ff165e75700 0 SIGN: MSG 9241367 sig_check:4899837294009305543 2019-08-13 17:34:18.286 7ff165e75700 0 Signature failed. 2019-08-13 17:34:18.286 7ff165e75700 0 -- 1.2.3.4:6800/512468759 >> 4.3.2.1:0/928333509 conn(0xe6b9500 :6800 >> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=0).process >> Signature check failed Thankfully I was able to umount -f to unfreeze the client, but I have been unsuccessful remounting the file system using the kernel client. The fuse client worked fine as a workaround, but is slower. Taking a look at livepatch 54.1 I can see it touches Ceph code in the kernel: https://git.launchpad.net/~ubuntu-livepatch/+git/bionic-livepatches/commit/?id=3a3081c1e4c8e2e0f9f7a1ae4204eba5f38fbd29 But the relevance of those changes isn't immediately clear to me. I expect after a reboot it'll be fine, but as yet untested. Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephfs kernel client hung after eviction
Hi, I have a cephfs kernel client (Ubuntu 18.04 4.15.0-34-generic) that's completely hung after the client was evicted by the MDS. The client logged: Jan 24 17:31:46 client kernel: [10733559.309496] libceph: FULL or reached pool quota Jan 24 17:32:26 client kernel: [10733599.232213] libceph: mon0 n.n.n.n:6789 session lost, hunting for new mon And the MDS logged: 2019-01-24 17:36:38.859 7f3ac7844700 0 log_channel(cluster) log [WRN] : evicting unresponsive client client:cephfs-client (86527773), after 300.081 seconds Looking in mdsc shows: % head /sys/kernel/debug/ceph/[id].client86527773/mdsc 20 mds0getattr #103d042 21 mds0getattr #103d042 22 mds0getattr #103d042 23 mds0getattr #103d042 24 mds0getattr #103d042 25 mds0getattr #103d042 26 mds0getattr #103d042 27 mds0getattr #103d042 28 mds0getattr #103d042 29 mds0getattr #103d042 But osdc hangs when I try to access it. I've tried umount -f but it hangs too. umount -l hides the problem (df no longer hangs), but any processes that were trying to access the mount are still blocked. I've also tried switching back and forth to standby MDSs in case that unblocked something. There are no current OSD blacklist entries either. It looks like rebooting is the only option, but that's somewhat of a pain to do. There's lots of people using this machine :-( Any ideas? Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow requests from bluestore osds
On Sat, Sep 01, 2018 at 12:45:06PM -0400, Brett Chancellor wrote: > Hi Cephers, > I am in the process of upgrading a cluster from Filestore to bluestore, > but I'm concerned about frequent warnings popping up against the new > bluestore devices. I'm frequently seeing messages like this, although the > specific osd changes, it's always one of the few hosts I've converted to > bluestore. > > 6 ops are blocked > 32.768 sec on osd.219 > 1 osds have slow requests > > I'm running 12.2.4, have any of you seen similar issues? It seems as though > these messages pop up more frequently when one of the bluestore pgs is > involved in a scrub. I'll include my bluestore creation process below, in > case that might cause an issue. (sdb, sdc, sdd are SATA, sde and sdf are > SSD) I had the same issue for some time after a bunch of updates including switching to Luminous and to Bluestore from Filestore. I was unable to track down what was causing it. Then recently I did another bunch of updates including Ceph 12.2.7 and the latest Ubuntu 16.04 updates (which would have included microcode updates), and the problem has as good as gone away. My monitoring also shows reduced latency on my RBD volumes and end users are reporting improved performance. Just throwing this out there in case it helps anyone else - appreciate it's very anecdotal. Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Disk write cache - safe?
Thank you Christian, David, and Reed for your responses. My servers have the Dell H730 RAID controller in them, but I have the OSD disks in Non-RAID mode. When initially testing I compared single RAID-0 containers with Non-RAID and the Non-RAID performance was acceptable, so I opted for the configuration with less components between Ceph and the disks. This seemed to be the safer approach at the time. What I obviously hadn't realised was that the drive caches were enabled. Without those caches the difference is much greater, and the latency is now becoming a problem. My reading of the documentation led me to think along the lines Christian mentions below - that is, that data in flight would be lost, but that the disks should be consistent and still usable. But it would be nice to get confirmation of whether that holds for Bluestore. However, it looks like this wasn't the case for Reed, although perhaps that was at an earlier time when Ceph and/or Linux didn't handle things was well? I had also thought that our power supply was pretty safe - redundant PSUs with independent feeds, redundant UPSs, and a generator. But Reed's experiences certainly highlight that even that can fail, so it was good to hear that from someone else rather than experience it first hand. I do have tape backups, but recovery would be a pain, so based on all your comments I'll leave the drive caches off and look at using the RAID controller cache with its BBU instead. Tim. On Thu, Mar 15, 2018 at 04:13:49PM +0900, Christian Balzer wrote: > Hello, > > what has been said by others before is essentially true, as in if you want: > > - as much data conservation as possible and have > - RAID controllers with decent amounts of cache and a BBU > > then disabling the on disk cache is the way to go. > > But as you found out, w/o those caches and a controller cache to replace > them, performance will tank. > > And of course any data only in the pagecache (dirty) and not yet flushed > to the controller/disks is lost anyway in a power failure. > > All current FS _should_ be powerfail safe (barriers) in the sense that you > may loose the data in the disk caches (if properly exposed to the OS and > the controller or disk not lying about having written data to disk) but > the FS will be consistent and not "all will be lost". > > I'm hoping that this is true for Bluestore, but somebody needs to do that > testing. > > So if you can live with the loss of the in-transit data in the disk caches > in addition to the pagecache and/or you trust your DC never to loose > power, go ahead and get re-enable the disk caches. > > If you have the money and need for a sound happy sleep, do the BBU > controller cache dance. > Some controllers (Areca comes to mind) actually manage to IT mode style > exposure of the disks and still use their HW cache. > > Christian -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Disk write cache - safe?
I'm using Ceph on Ubuntu 16.04 on Dell R730xd servers. A recent [1] update to the PERC firmware disabled the disk write cache by default which made a noticable difference to the latency on my disks (spinning disks, not SSD) - by as much as a factor of 10. For reference their change list says: "Changes default value of drive cache for 6 Gbps SATA drive to disabled. This is to align with the industry for SATA drives. This may result in a performance degradation especially in non-Raid mode. You must perform an AC reboot to see existing configurations change." It's fairly straightforward to re-enable the cache either in the PERC BIOS, or by using hdparm, and doing so returns the latency back to what it was before. Checking the Ceph documentation I can see that older versions [2] recommended disabling the write cache for older kernels. But given I'm using a newer kernel, and there's no mention of this in the Luminous docs, is it safe to assume it's ok to enable the disk write cache now? If it makes a difference, I'm using a mixture of filestore and bluestore OSDs - migration is still ongoing. Thanks, Tim. [1] - https://www.dell.com/support/home/uk/en/ukdhs1/Drivers/DriversDetails?driverId=8WK8N [2] - http://docs.ceph.com/docs/jewel/rados/configuration/filesystem-recommendations/ -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Dashboard runs on all manager instances?
Hi, I've recently upgraded from Jewel to Luminous and I'm therefore new to using the Dashboard. I noted this section in the documentation: http://docs.ceph.com/docs/master/mgr/dashboard/#load-balancer "Please note that the dashboard will only start on the manager which is active at that moment. Query the Ceph cluster status to see which manager is active (e.g., ceph mgr dump). In order to make the dashboard available via a consistent URL regardless of which manager daemon is currently active, you may want to set up a load balancer front-end to direct traffic to whichever manager endpoint is available. If you use a reverse http proxy that forwards a subpath to the dashboard, you need to configure url_prefix (see above)." However, from what I can see the dashboard is actually started on all manager instances. On the standby instances it simply has a redirect to the active instance. So the above documentation would look to be incorrect? I was planning to use Apache's mod_proxy_balancer to just pick the active instance, but the above makes that tricky. How have others solved this? Or do you just pick the right instance and go direct? Or maybe I'm missing a config option to make the dashboard only run on the active manager? Thanks, Tim. -- Tim Bishop PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is anyone seeing iissues with task_numa_find_cpu?
0x3b/0x50 > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686173] > [] SYSC_recvfrom+0xe1/0x160 > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686202] > [] ? ktime_get_ts64+0x45/0xf0 > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686230] > [] SyS_recvfrom+0xe/0x10 > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686259] > [] entry_SYSCALL_64_fastpath+0x16/0x71 > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686287] Code: 55 b0 4c > 89 f7 e8 53 cd ff ff 48 8b 55 b0 49 8b 4e 78 48 8b 82 d8 01 00 00 48 > 83 c1 01 31 d2 49 0f af 86 b0 00 00 00 4c 8b 73 78 <48> f7 f1 48 8b 4b > 20 49 89 c0 48 29 c1 48 8b 45 d0 4c 03 43 48 > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686512] RIP > [] task_numa_find_cpu+0x22e/0x6f0 > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686544] RSP > Jun 28 09:46:41 roc04r-sca090 kernel: [137912.686896] ---[ end trace > 544cb9f68cb55c93 ]--- > Jun 28 09:52:15 roc04r-sca090 kernel: [138246.669713] mpt2sas_cm0: > log_info(0x30030101): originator(IOP), code(0x03), sub_code(0x0101) > Jun 28 09:55:01 roc0 > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] New Ceph mirror
Hi Wido, Six months or so ago you asked for new Ceph mirrors. I saw there wasn't currently one in the UK, so I've set one up following your guidelines here: https://github.com/ceph/ceph/tree/master/mirroring The mirror is available at: http://ceph.mirrorservice.org/ Over both IPv4 and IPv6, and rsync is enabled too. The vhost is configured to respond to uk.ceph.com should you feel it's appropriate to use that for our site. Our service is hosted on the UK academic network and has an 8 Gbit uplink speed. We're also users of Ceph ourselves within the School of Computing here at the University, so I'm following the Ceph developments closely. If you need any further information from us please let me know. Tim. -- Tim Bishop, Computing Officer, School of Computing, University of Kent. PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD - single drive RAID 0 or JBOD?
Hi all, I've got servers (Dell R730xd) with a number of drives in connected to a Dell H730 RAID controller. I'm trying to make a decision about whether I should put the drives in "Non-RAID" mode, or if I should make individual RAID 0 arrays for each drive. Going for the RAID 0 approach would mean that the cache on the controller will be used for the drives giving some increased performance. But it comes at the expense of less direct access to the disks for the operating system and Ceph, and I have had one oddity[1] with a drive that went away when switched from RAID 0 to Non-RAID. There are currently no SSDs being used as journals in the machines, so the increased performance is beneficial, but data integrity is obviously paramount. I've seen recommendations both ways on this list, so I'm hoping for feedback from people who've made the same decision, possibly with evidence (postive or negative) about either approach. Thanks! Tim. [1] The OS was getting I/O errors when reading certain files which gave scrub errors. No problems were shown in the RAID controller, and a check of the disk didn't reveal any issues. Switching to Non-RAID mode and recreating the OSD fixed the problem and there haven't been any issues since. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd map on Jewel, sysfs write failed when rbd map
OTE: The information contained in this electronic mail message > is intended only for the use of the designated recipient(s) named > above. If the reader of this message is not the intended recipient, you > are hereby notified that you have received this message in error and > that any review, dissemination, distribution, or copying of this > message is strictly prohibited. If you have received this communication > in error, please notify the sender by telephone or e-mail (as shown > above) immediately and destroy any and all copies of this message in > your possession (whether hard copies or electronically stored copies). > > References > > Visible links > 1. > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-September/004928.html > 2. mailto:hjwsm1...@gmail.com > 3. mailto:xizhiyon...@gmail.com > 4. mailto:ceph-users@lists.ceph.com > 5. http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > Hidden links: > 7. > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-September/004928.html > 8. mailto:xizhiyon...@gmail.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Calamari Goes Open Source
This is great news? Are there (or will there be) binary packages available? Tim. On Fri, May 30, 2014 at 06:04:42PM -0400, Patrick McGarry wrote: > Hey cephers, > > Sorry to push this announcement so late on a Friday but... > > Calamari has arrived! > > The source code bits have been flipped, the ticket tracker has been > moved, and we have even given you a little bit of background from both > a technical and vision point of view: > > Technical (ceph.com): > http://ceph.com/community/ceph-calamari-goes-open-source/ > > Vision (inktank.com): > http://www.inktank.com/software/future-of-calamari/ > > The ceph.com link should give you everything you need to know about > what tech comprises Calamari, where the source lives, and where the > discussions will take place. If you have any questions feel free to > hit the new ceph-calamari list or stop by IRC and we'll get you > started. Hope you all enjoy the GUI! > > Best Regards, > > Patrick McGarry -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Dell H310
Hi all, I'm hoping to get some feedback on the Dell H310 (LSI SAS2008 chipset). Based on searching I'd done previously I got the impression that people generally recommended avoiding it in favour of the higher specced H710 (LSI SAS2208 chipset). But yesterday I read Inktank's Hardware Configuration Guide[1] document (great document by the way!), and it mentions the H310 as a possible controller. So maybe it's not as bad as I'd assumed? It'd be great to hear from people using it successfully, or from people who've had bad experiences with it. Tim. [1] http://info.inktank.com/hardware_configuration_guide -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ubuntu 13.10 packages
Hi Michael, On Tue, Feb 25, 2014 at 10:01:31PM +, Michael wrote: > Just wondering if there was a reason for no packages for Ubuntu Saucy in > http://ceph.com/packages/ceph-extras/debian/dists/. Could do with > upgrading to fix a few bugs but would hate to have to drop Ceph from > being handled through the package manager! The raring packages work fine with saucy. Not ideal, but it's the easiest solution at the moment. I'd like to see native saucy packages too, and actually it'd be good to get trusty sorted soon too so it's all ready for the release. Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does Ceph deal with OSDs that have been away for a while?
Thanks Greg. Can I just confirm, does it do a full backfill automatically in the case where the log no longer overlaps? I guess the key question is - do I have to worry about it, or will it always "do the right thing"? Tim. On Fri, Feb 21, 2014 at 11:57:09AM -0800, Gregory Farnum wrote: > It depends on how long ago (in terms of data writes) it disappeared. > Each PG has a log of the changes that have been made (by default I > think it's 3000? Maybe just 1k), and if an OSD goes away and comes > back while the logs still overlap it will just sync up the changed > objects. Otherwise it has to do a full backfill across the PG. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > On Fri, Feb 21, 2014 at 10:33 AM, Tim Bishop wrote: > > I'm wondering how Ceph deals with OSDs that have been away for a while. > > Do they need to be completely rebuilt, or does it know which objects are > > good and which need to go? > > > > I know Ceph handles well the situation of an OSD going away, and > > rebalances etc to maintain the required redundancy levels. But I'm > > unsure what it does when an OSD comes back some time later still > > containing data. > > > > Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How does Ceph deal with OSDs that have been away for a while?
I'm wondering how Ceph deals with OSDs that have been away for a while. Do they need to be completely rebuilt, or does it know which objects are good and which need to go? I know Ceph handles well the situation of an OSD going away, and rebalances etc to maintain the required redundancy levels. But I'm unsure what it does when an OSD comes back some time later still containing data. Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph network topology with redundant switches
On Fri, Dec 20, 2013 at 09:26:35AM -0800, Kyle Bader wrote: > > The area I'm currently investigating is how to configure the > > networking. To avoid a SPOF I'd like to have redundant switches for > > both the public network and the internal network, most likely running > > at 10Gb. I'm considering splitting the nodes in to two separate racks > > and connecting each half to its own switch, and then trunk the > > switches together to allow the two halves of the cluster to see each > > other. The idea being that if a single switch fails I'd only lose half > > of the cluster. > > This is fine if you are using a replication factor of 2, you would > need 2/3 of the cluster surviving if using a replication factor 3 with > "osd pool default min size" set to 2. Ah! Thanks for pointing that out. I'd not appreciated the impact of that setting. I'll have to consider my options here. This also fits well with Wido's reply (in this same thread) about splitting the nodes in to three groups rather than two, although the cost of 10Gb networking starts to become prohibitive at that point. > > My question is about configuring the public network. If it's all one > > subnet then the clients consuming the Ceph resources can't have both > > links active, so they'd be configured in an active/standby role. But > > this results in quite heavy usage of the trunk between the two > > switches when a client accesses nodes on the other switch than the one > > they're actively connected to. > > The linux bonding driver supports several strategies for teaming network > adapters on L2 networks. Across switches? Wido's reply mentions using mlag to span LACP trunks across switches. This isn't something I'd seen before, so I'd assumed I couldn't do it. Certainly an area I need to look in to more. > > So, can I configure multiple public networks? I think so, based on the > > documentation, but I'm not completely sure. Can I have one half of the > > cluster on one subnet, and the other half on another? And then the > > client machine can have interfaces in different subnets and "do the > > right thing" with both interfaces to talk to all the nodes. This seems > > like a fairly simple solution that avoids a SPOF in Ceph or the network > > layer. > > You can have multiple networks for both the public and cluster networks, > the only restriction is that all subnets for a given type be within the same > supernet. For example > > 10.0.0.0/16 - Public supernet (configured in ceph.conf) > 10.0.1.0/24 - Public rack 1 > 10.0.2.0/24 - Public rack 2 > 10.1.0.0/16 - Cluster supernet (configured in ceph.conf) > 10.1.1.0/24 - Cluster rack 1 > 10.1.2.0/24 - Cluster rack 2 Thanks, that clarifies how this works. > > As an aside, there's a similar issue on the cluster network side with > > heavy traffic on the trunk between the two cluster switches. But I > > can't see that's avoidable, and presumably it's something people just > > have to deal with in larger Ceph installations? > > A proper CRUSH configuration is going to place a replica on a node in > each rack, this means every write is going to cross the trunk. Other > traffic that you will see on the trunk: > > * OSDs gossiping with one another > * OSD/Monitor traffic in the case where an OSD is connected to a > monitor connected in the adjacent rack (map updates, heartbeats). Am I right in say that the first of these happens over the cluster network and the second over the public network? It looks like monitors don't have a cluster network address. > * OSD/Client traffic where the OSD and client are in adjacent racks > > If you use all 4 40GbE uplinks (common on 10GbE ToR) then your > cluster level bandwidth is oversubscribed 4:1. To lower oversubscription > you are going to have to steal some of the other 48 ports, 12 for 2:1 and > 24 for a non-blocking fabric. Given number of nodes you have/plan to > have you will be utilizing 6-12 links per switch, leaving you with 12-18 > links for clients on a non-blocking fabric, 24-30 for 2:1 and 36-48 for 4:1. I guess this is an area where instrumentation can be used to measure the load on the trunk and add more links if required. Thank you for your help. Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph network topology with redundant switches
Hi Wido, Thanks for the reply. On Fri, Dec 20, 2013 at 08:14:13AM +0100, Wido den Hollander wrote: > On 12/18/2013 09:39 PM, Tim Bishop wrote: > > I'm investigating and planning a new Ceph cluster starting with 6 > > nodes with currently planned growth to 12 nodes over a few years. Each > > node will probably contain 4 OSDs, maybe 6. > > > > The area I'm currently investigating is how to configure the > > networking. To avoid a SPOF I'd like to have redundant switches for > > both the public network and the internal network, most likely running > > at 10Gb. I'm considering splitting the nodes in to two separate racks > > and connecting each half to its own switch, and then trunk the > > switches together to allow the two halves of the cluster to see each > > other. The idea being that if a single switch fails I'd only lose half > > of the cluster. > > Why not three switches in total and use VLANs on the switches to > separate public/cluster traffic? > > This way you can configure the CRUSH map to have one replica go to each > "switch" so that when you loose a switch you still have two replicas > available. > > Saves you a lot of switches and makes the network simpler. I was planning to use VLANs to separate the public and cluster traffic on the same switches. Two switches costs less than three switches :-) I think on a slightly larger scale cluster it might make more sense to go up to three (or even more) switches, but I'm not sure the extra cost is worth it at this level. I was planning two switches, using VLANs to separate the public and cluster traffic, and connecting half of the cluster to each switch. > > (I'm not touching on the required third MON in a separate location and > > the CRUSH rules to make sure data is correctly replicated - I'm happy > > with the setup there) > > > > To allow consumers of Ceph to see the full cluster they'd be directly > > connected to both switches. I could have another layer of switches for > > them and interlinks between them, but I'm not sure it's worth it on > > this sort of scale. > > > > My question is about configuring the public network. If it's all one > > subnet then the clients consuming the Ceph resources can't have both > > links active, so they'd be configured in an active/standby role. But > > this results in quite heavy usage of the trunk between the two > > switches when a client accesses nodes on the other switch than the one > > they're actively connected to. > > > > Why can't the clients have both links active? You could use LACP? Some > switches support mlag to span LACP trunks over two switches. > > Or use some intelligent bonding mode in the Linux kernel. I've only ever used LACP to the same switch, and I hadn't realised there were options for spanning LACP links across multiple switches. Thanks for the information there. > > So, can I configure multiple public networks? I think so, based on the > > documentation, but I'm not completely sure. Can I have one half of the > > cluster on one subnet, and the other half on another? And then the > > client machine can have interfaces in different subnets and "do the > > right thing" with both interfaces to talk to all the nodes. This seems > > like a fairly simple solution that avoids a SPOF in Ceph or the network > > layer. > > There is no restriction on the IPs of the OSDs. All they need is a Layer > 3 route to the WHOLE cluster and monitors. > > Say doesn't have to be in a Layer 2 network, everything can be simply > Layer 3. You just have to make sure all the nodes can reach each other. Thanks, that makes sense and makes planning simpler. I suppose it's logical really... in a HUGE cluster you'd probably have a whole manner of networks spread around the datacenter. > > Or maybe I'm missing an alternative that would be better? I'm aiming > > for something that keeps things as simple as possible while meeting > > the redundancy requirements. > > >client > | > | >core switch > /| \ >/ | \ > / | \ > / |\ > /| \ > switch1 switch2 switch3 > || | >OSD OSD OSD > > > You could build something like that. That would be fairly simple. Isn't the core switch in that diagram a SPOF? Or is it presumed to already be a redundant setup? > Keep in mind that you can always loose a switch and stil
[ceph-users] Ceph network topology with redundant switches
Hi all, I'm investigating and planning a new Ceph cluster starting with 6 nodes with currently planned growth to 12 nodes over a few years. Each node will probably contain 4 OSDs, maybe 6. The area I'm currently investigating is how to configure the networking. To avoid a SPOF I'd like to have redundant switches for both the public network and the internal network, most likely running at 10Gb. I'm considering splitting the nodes in to two separate racks and connecting each half to its own switch, and then trunk the switches together to allow the two halves of the cluster to see each other. The idea being that if a single switch fails I'd only lose half of the cluster. (I'm not touching on the required third MON in a separate location and the CRUSH rules to make sure data is correctly replicated - I'm happy with the setup there) To allow consumers of Ceph to see the full cluster they'd be directly connected to both switches. I could have another layer of switches for them and interlinks between them, but I'm not sure it's worth it on this sort of scale. My question is about configuring the public network. If it's all one subnet then the clients consuming the Ceph resources can't have both links active, so they'd be configured in an active/standby role. But this results in quite heavy usage of the trunk between the two switches when a client accesses nodes on the other switch than the one they're actively connected to. So, can I configure multiple public networks? I think so, based on the documentation, but I'm not completely sure. Can I have one half of the cluster on one subnet, and the other half on another? And then the client machine can have interfaces in different subnets and "do the right thing" with both interfaces to talk to all the nodes. This seems like a fairly simple solution that avoids a SPOF in Ceph or the network layer. Or maybe I'm missing an alternative that would be better? I'm aiming for something that keeps things as simple as possible while meeting the redundancy requirements. As an aside, there's a similar issue on the cluster network side with heavy traffic on the trunk between the two cluster switches. But I can't see that's avoidable, and presumably it's something people just have to deal with in larger Ceph installations? Finally, this is all theoretical planning to try and avoid designing in bottlenecks at the outset. I don't have any concrete ideas of loading so in practice none of it may be an issue. Thanks for your thoughts. Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x6C226B37FDF38D55 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ZFS on RBD?
On Fri, May 24, 2013 at 05:10:14PM +0200, Wido den Hollander wrote: > On 05/23/2013 11:34 PM, Tim Bishop wrote: > > I'm evaluating Ceph and one of my workloads is a server that provides > > home directories to end users over both NFS and Samba. I'm looking at > > whether this could be backed by Ceph provided storage. > > > > So to test this I built a single node Ceph instance (Ubuntu precise, > > ceph.com packages) in a VM and popped a couple of OSDs on it. I then > > built another VM and used it to mount an RBD from the Ceph node. No > > problems... it all worked as described in the documentation. > > > > Then I started to look at the filesystem I was using on top of the RBD. > > I'd tested ext4 without any problems. I'd been testing ZFS (from stable > > zfs-native PPA) separately against local storage on the client VM too, > > so I thought I'd try that on top of the RBD. This is when I hit > > problems, and the VM paniced (trace at the end of this email). > > > > Now I am just experimenting, so this isn't a huge deal right now. But > > I'm wondering if this is something that should work? Am I overlooking > > something? Is it a silly idea to even try it? > > It should work, but I'm not sure what is happening here. But I'm > wondering, what's the reasoning behind this? You can use ZFS on multiple > machines, so you are exporting via RBD from one machine to another. > > Wouldn't it be easier to just use NBD or iSCSI in this case? > > I can't find the usecase here for using RBD, since that is designed to > work in a distributed load. > > Is this just a test you wanted to run or something you were thinking > about deploying? Thank you for the reply. It's a bit of both; at this stage I'm just testing, but it's something I might deploy, if it works. I'll briefly explain the scenario. So I have various systems that I'd like to move on to Ceph, including stuff like VM servers. But this particular workload is a set of home directories that are mounted across a mixture of Unix-based servers, some Linux, some Solaris, and also end user desktops using Windows and MacOS. Since I can't directly mount the filesystem on all the end user machines I thought a proxy host would be a good idea. It could mount the RBD directly and then reshare it using NFS and Samba to the various other machines. It could have 10Gbit networking to make full use of the available storage from Ceph. I could make the filesystem on the proxy host just ext4, but I pondered ZFS for some of the extra features it offers. For example, creating a file system per user and easy snapshots. The overall idea is to consolidate storage from various different systems using locally attached storage arrays to a central storage pool based on Ceph. It's just an idea at this stage, so I'm testing to see what's feasible, and what works. Please do let me know if I'm approaching this in the wrong way! Thank you, Tim. (I submitted a bug report to the ZFS folk: https://github.com/zfsonlinux/spl/issues/241 ) -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x5AE7D984 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ZFS on RBD?
Hi all, I'm evaluating Ceph and one of my workloads is a server that provides home directories to end users over both NFS and Samba. I'm looking at whether this could be backed by Ceph provided storage. So to test this I built a single node Ceph instance (Ubuntu precise, ceph.com packages) in a VM and popped a couple of OSDs on it. I then built another VM and used it to mount an RBD from the Ceph node. No problems... it all worked as described in the documentation. Then I started to look at the filesystem I was using on top of the RBD. I'd tested ext4 without any problems. I'd been testing ZFS (from stable zfs-native PPA) separately against local storage on the client VM too, so I thought I'd try that on top of the RBD. This is when I hit problems, and the VM paniced (trace at the end of this email). Now I am just experimenting, so this isn't a huge deal right now. But I'm wondering if this is something that should work? Am I overlooking something? Is it a silly idea to even try it? The trace looks to be in the ZFS code, so if there's a bug that needs fixing it's probably over there rather than in Ceph, but I thought here might be a good starting point for advice. Thanks in advance everyone, Tim. [ 504.644120] divide error: [#1] SMP [ 504.644298] Modules linked in: coretemp(F) ppdev(F) vmw_balloon(F) microcode(F) psmouse(F) serio_raw(F) parport_pc(F) vmwgfx(F) i2c_piix4(F) mac_hid(F) ttm(F) shpchp(F) drm(F) rbd(F) libceph(F) lp(F) parport(F) zfs(POF) zcommon(POF) znvpair(POF) zavl(POF) zunicode(POF) spl(OF) floppy(F) e1000(F) mptspi(F) mptscsih(F) mptbase(F) btrfs(F) zlib_deflate(F) libcrc32c(F) [ 504.646156] CPU 0 [ 504.646234] Pid: 2281, comm: txg_sync Tainted: PF B O 3.8.0-21-generic #32~precise1-Ubuntu VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform [ 504.646550] RIP: 0010:[] [] spa_history_write+0x82/0x1d0 [zfs] [ 504.646816] RSP: 0018:88003ae3dab8 EFLAGS: 00010246 [ 504.646940] RAX: RBX: RCX: [ 504.647091] RDX: RSI: 0020 RDI: [ 504.647242] RBP: 88003ae3db28 R08: 88003b2afc00 R09: 0002 [ 504.647423] R10: 88003b9a4512 R11: 6d206b6e61742066 R12: 88003add6600 [ 504.647600] R13: 88003cfc2000 R14: 88003d3c9000 R15: 0008 [ 504.647778] FS: () GS:88003fc0() knlGS: [ 504.647997] CS: 0010 DS: ES: CR0: 8005003b [ 504.648153] CR2: 7fbc1ef54a38 CR3: 3bf3e000 CR4: 07f0 [ 504.648380] DR0: DR1: DR2: [ 504.648586] DR3: DR6: 0ff0 DR7: 0400 [ 504.648766] Process txg_sync (pid: 2281, threadinfo 88003ae3c000, task 88003b7c45c0) [ 504.648990] Stack: [ 504.649087] 0002 a01e3360 88003b2afc00 88003ae3dba0 [ 504.649461] 88003d3c9000 0008 88003cfc2000 5530ebc2 [ 504.649835] 88003d22ac40 88003d22ac40 88003cfc2000 88003b2afc00 [ 504.650209] Call Trace: [ 504.650351] [] spa_history_log_sync+0x235/0x650 [zfs] [ 504.650554] [] dsl_sync_task_group_sync+0x123/0x210 [zfs] [ 504.650760] [] dsl_pool_sync+0x41b/0x530 [zfs] [ 504.650953] [] spa_sync+0x3a8/0xa50 [zfs] [ 504.651117] [] ? ktime_get_ts+0x4c/0xe0 [ 504.651302] [] txg_sync_thread+0x2df/0x540 [zfs] [ 504.651501] [] ? txg_init+0x250/0x250 [zfs] [ 504.651676] [] thread_generic_wrapper+0x78/0x90 [spl] [ 504.651856] [] ? __thread_create+0x310/0x310 [spl] [ 504.652029] [] kthread+0xc0/0xd0 [ 504.652174] [] ? flush_kthread_worker+0xb0/0xb0 [ 504.652339] [] ret_from_fork+0x7c/0xb0 [ 504.652492] [] ? flush_kthread_worker+0xb0/0xb0 [ 504.652655] Code: 55 b0 48 89 fa 48 29 f2 48 01 c2 48 39 55 b8 0f 82 bc 00 00 00 4c 8b 75 b0 41 bf 08 00 00 00 48 29 c8 31 d2 49 8b b5 70 08 00 00 <48> f7 f7 4c 8d 45 c0 4c 89 f7 48 01 ca 48 29 d3 48 83 fb 08 49 [ 504.659810] RIP [] spa_history_write+0x82/0x1d0 [zfs] [ 504.660045] RSP [ 504.660187] ---[ end trace e69c7eee3ba17773 ]--- -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x5AE7D984 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com