Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
To be clear when you are restarting these osds how many pgs go into peering state? And do they stay there for the full 3 minutes? Certainly I've seen iops drop to zero or near zero when a large number of pgs are peering. It would be wonderful if we could keep iops flowing even when pgs are peering. In your case with such a high pg/osd count, my guess is peering always takes a long time. As the OSD goes down it has to peer those 564 pgs across the remaining 3 osds, then re-peer them once the OSD comes up again... Also because the OSD is a RAID6 I'm pretty sure the IO pattern is going to be bad, all 564 of those threads are going to request reads and writes (the peering process is going to update metadata in each pg directory on the OSD) nearly simultaneously. In a raid 6 each non-cached read will cause a read io from at least 5 disks and each write will cause a write io to all 7 disks. With that many threads hitting the volume simultaneously it means you're going to have massive disk head contention/head seek times, which is going to absolutely destroy your iops and make peering take that much longer. In effect in the non-cached case the raid6 is going to almost entirely negate the distribution of IO load across those 7 disks, and is going to make them behave with a performance closer to a single HDD. As Lionel said earlier, the HW Cache is going to be nearly useless in any sort of recovery scenario in ceph (which this is). I hope Robert or someone can come up with a way to continue IO to a pg in peering state, that would be wonderful as this is the fundamental problem I believe. I'm not "happy" with the amount of work we had to put in to getting our cluster to behave as well as it is now, and it would certainly be great if things "Just Worked". I'm just trying to relate our experience, and indicate what I see as the bottleneck in this particular setup based on that experience. I believe the ceph pg calculator and recommendations about pg counts are too high and your setup is 2-3x above that. I've been able to easily topple clusters (mostly due to RAM exhaustion/swapping/OOM killer) with the recommended pg/osd counts and recommended RAM (1GB/OSD + 1GB/TB of storage) by causing recovery in a cluster for 2 years now, and its not been improved as far as I can tell. The only solution I've seen work reliably is to drop the pg/osd ratio. Dropping said ratio also greatly reduced the peering load and time and made the pain of osd restarts almost negligible. To your question about our data distribution, it is excellent as far as per pg is concerned, less than 3% variance between pgs. We did see a massive disparity between how many pgs each osd gets. Originally we had osds with as few as 100pgs, and some with as many as 250 when on average they should have had about 175pgs each, that was with the recommended pg/osd settings. Additionally that ratio/variance has been the same regardless of the number of pgs/osd. Meaning it started out bad, and stayed bad but didn't get worse as we added osds. We've had to reweight osds in our crushmap to get anything close to a sane distribution of pgs. -Tom On Sat, Feb 13, 2016 at 10:57 PM, Christian Balzerwrote: > On Sat, 13 Feb 2016 20:51:19 -0700 Tom Christensen wrote: > > > > > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0 0 osd.2 > > > > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0 0 osd.2 1788 > > > > load_pgs opened > > > 564 pgs > --- > Another minute to load the PGs. > > > Same OSD reboot as above : 8 seconds for this. > > > > Do you really have 564 pgs on a single OSD? > > Yes, the reason is simple, more than a year ago it should have been 8 OSDs > (halving that number) and now it should be 18 OSDs, which would be a > perfect fit for the 1024 PGs in the rbd pool. > > >I've never had anything like > > decent performance on an OSD with greater than about 150pgs. In our > > production clusters we aim for 25-30 primary pgs per osd, 75-90pgs/osd > > total (with size set to 3). When we initially deployed our large cluster > > with 150-200pgs/osd (total, 50-70 primary pgs/osd, again size 3) we had > > no end of trouble getting pgs to peer. The OSDs ate RAM like nobody's > > business, took forever to do anything, and in general caused problems. > > The cluster performs admirable for the stress it is under, the number of > PGs per OSD never really was an issue when it came to CPU/RAM/network. > For example the restart increased the OSD process size from 1.3 to 2.8GB, > but that left 24GB still "free". > The main reason to have more OSDs (and thus a lower PG count per OSD) is > to have more IOPS from the underlying storage. > > > If you're running 564 pgs/osd in this 4 OSD cluster, I'd look at that > > first as the potential culprit. That is a lot of threads inside the OSD > > process that all need to get CPU/network/disk time in order to peer as > > they come up. Especially on firefly I would point to this.
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
Le 13/02/2016 06:31, Christian Balzer a écrit : > [...] > --- > So from shutdown to startup about 2 seconds, not that bad. > However here is where the cookie crumbles massively: > --- > 2016-02-12 01:33:50.263152 7f75be4d57c0 0 filestore(/var/lib/ceph/osd/ceph-2) limited size xattrs > 2016-02-12 01:35:31.809897 7f75be4d57c0 0 filestore(/var/lib/ceph/osd/ceph-2) mount: enabling WRITEAHEAD journal mode > : checkpoint is not enabled > --- > Nearly 2 minutes to mount things, it probably had to go to disk quite a > bit, as not everything was in the various slab caches. And yes, there is > 32GB of RAM, most of it pagecache and vfs_cache_pressure is set to 1. > During that time, silence of the lambs when it came to ops. Hum that's surprisingly long. How much data (size and nb of files) do you have on this OSD, which FS do you use, what are the mount options, what is the hardware and the kind of access ? The only time I saw OSDs take several minutes to reach the point where they fully rejoin is with BTRFS with default options/config. For reference our last OSD restart only took 6 seconds to complete this step. We only have RBD storage, so this OSD with 1TB of data has ~25 4M files. It was created ~ 1 year ago and this is after a complete OS umount/mount cycle which drops the cache (from experience Ceph mount messages doesn't actually imply that the FS was not mounted). > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0 0 osd.2 1788 > load_pgs > 2016-02-12 01:36:32.989709 7f75be4d57c0 0 osd.2 1788 load_pgs opened 564 pgs > --- > Another minute to load the PGs. Same OSD reboot as above : 8 seconds for this. This would be way faster if we didn't start with an umounted OSD. This OSD is still BTRFS but we don't use autodefrag anymore (we replaced it with our own defragmentation scheduler) and disabled BTRFS snapshots in Ceph to reach this point. Last time I checked an OSD startup was still faster with XFS. So do you use BTRFS in the default configuration or have a very high number of files on this OSD ? Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
Hello, On Sat, 13 Feb 2016 11:14:23 +0100 Lionel Bouton wrote: > Le 13/02/2016 06:31, Christian Balzer a écrit : > > [...] > --- > So from shutdown to startup about 2 seconds, not that > > bad. > > However here is where the cookie crumbles massively: > --- > 2016-02-12 > 01:33:50.263152 7f75be4d57c0 0 filestore(/var/lib/ceph/osd/ceph-2) > limited size xattrs > 2016-02-12 01:35:31.809897 7f75be4d57c0 0 > filestore(/var/lib/ceph/osd/ceph-2) mount: enabling WRITEAHEAD journal > mode > : checkpoint is not enabled > --- > Nearly 2 minutes to mount > things, it probably had to go to disk quite a > bit, as not everything > was in the various slab caches. And yes, there is > 32GB of RAM, most of > it pagecache and vfs_cache_pressure is set to 1. > During that time, > silence of the lambs when it came to ops. > > > Hum that's surprisingly long. How much data (size and nb of files) do > you have on this OSD, which FS do you use, what are the mount options, > what is the hardware and the kind of access ? > I already mentioned the HW, Areca RAID controller with 2GB HW cache and a 7 disk RAID6 per OSD. Nothing aside from noatime for mount options and EXT4. 2.6TB per OSD and with 1.4 million objects in the cluster a little more than 700k files per OSD. And kindly take note that my test cluster has less than 120k objects and thus 15k files per OSD and I still was able to reproduce this behaviour (in spirit at least). > The only time I saw OSDs take several minutes to reach the point where > they fully rejoin is with BTRFS with default options/config. > There isn't a pole long enough I would touch BTRFS with for production, especially in conjunction with Ceph. > For reference our last OSD restart only took 6 seconds to complete this > step. We only have RBD storage, so this OSD with 1TB of data has ~25 > 4M files. It was created ~ 1 year ago and this is after a complete OS > umount/mount cycle which drops the cache (from experience Ceph mount > messages doesn't actually imply that the FS was not mounted). > The "mount" in the ceph logs clearly is not a FS/OS level mount. This OSD was up for about 2 years. My other, more "conventional" production cluster has 400GB and 100k files per OSD and is very fast to restart as well. Alas it is also nowhere near as busy as this cluster, by order of 2 magnitudes roughly. > > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0 0 osd.2 > > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0 0 osd.2 1788 > > load_pgs opened > 564 pgs > --- > Another minute to load the PGs. > Same OSD reboot as above : 8 seconds for this. > > This would be way faster if we didn't start with an umounted OSD. > Again, it was never unmounted from a FS/OS perspective. Regards, Christian > This OSD is still BTRFS but we don't use autodefrag anymore (we replaced > it with our own defragmentation scheduler) and disabled BTRFS snapshots > in Ceph to reach this point. Last time I checked an OSD startup was > still faster with XFS. > > So do you use BTRFS in the default configuration or have a very high > number of files on this OSD ? > > Lionel -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
Hi, Le 13/02/2016 15:52, Christian Balzer a écrit : > [..] > > Hum that's surprisingly long. How much data (size and nb of files) do > you have on this OSD, which FS do you use, what are the mount options, > what is the hardware and the kind of access ? > > I already mentioned the HW, Areca RAID controller with 2GB HW cache and a > 7 disk RAID6 per OSD. > Nothing aside from noatime for mount options and EXT4. Thanks for the reminder. That said 7-disk RAID6 and EXT4 is new to me and may not be innocent. > > 2.6TB per OSD and with 1.4 million objects in the cluster a little more > than 700k files per OSD. That's nearly 3x more than my example OSD but it doesn't explain the more than 10x difference in startup time (especially considering BTRFS OSDs are slow to startup and my example was with dropped caches unlike your case). Your average file size is similar so it's not that either. Unless you have a more general, system-wide performance problem which impacts everything including the OSD init, there's 3 main components involved here : - Ceph OSD init code, - ext4 filesystem, - HW RAID6 block device. So either : - OSD init code doesn't scale past ~500k objects per OSD. - your ext4 filesystem is slow for the kind of access used during init (inherently or due to fragmentation, you might want to use filefrag on a random sample on PG directories, omap and meta), - your RAID6 array is slow for the kind of access used during init. - any combination of the above. I believe it's possible but doubtful that the OSD code wouldn't scale at this level (this does not feel like an abnormally high number of objects to me). Ceph devs will know better. ext4 could be a problem as it's not the most common choice for OSDs (from what I read here XFS is usually preferred over it) and it forces Ceph to use omap to store data which would be stored in extended attributes otherwise (which probably isn't without performance problems). RAID5/6 on HW might have performance problems. The usual ones happen on writes and OSD init is probably read-intensive (or maybe not, you should check the kind of access happening during the OSD init to avoid any surprise) but with HW cards it's difficult to know for sure the performance limitations they introduce (the only sure way is testing the actual access patterns). So I would probably try to reproduce the problem replacing one OSDs based on RAID6 arrays with as many OSDs as you have devices in the arrays. Then if it solves the problem and you didn't already do it you might want to explore Areca tuning, specifically with RAID6 if you must have it. > > And kindly take note that my test cluster has less than 120k objects and > thus 15k files per OSD and I still was able to reproduce this behaviour (in > spirit at least). I assume the test cluster uses ext4 and RAID6 arrays too: it would be a perfect testing environment for defragmentation/switch to XFS/switch to single drive OSDs then. > >> The only time I saw OSDs take several minutes to reach the point where >> they fully rejoin is with BTRFS with default options/config. >> > There isn't a pole long enough I would touch BTRFS with for production, > especially in conjunction with Ceph. That's a matter of experience and environment but I can understand: we invested more than a week of testing/development to reach a point where BTRFS was performing better than XFS in our use case. Not everyone can dedicate as much time just to select a filesystem and support it. There might be use cases where it's not even possible to use it (I'm not sure how it would perform if you only did small objects storage for example). BTRFS has been invaluable though : it detected and helped fix corruption generated by faulty Raid controllers (by forcing Ceph to use other replicas when repairing). I wouldn't let precious data live on anything other than checksumming filesystems now (the probabilities of undetectable disk corruption are too high for our use case now). We have 30 BTRFS OSDs in production (and many BTRFS filesystems on other systems) and we've never had any problem with them. These filesystems even survived several bad datacenter equipment failures (faulty backup generator control system and UPS blowing up during periodic testing). That said I'm susbcribed to linux-btrfs, was one of the SATA controller driver maintainers long ago so I know my way around kernel code, I hand pick the kernel versions going to production and we have custom tools and maintenance procedures for the BTRFS OSDs. So I've means and experience which make this choice comfortable for me and my team: I wouldn't blindly advise BTRFS to anyone else (not yet). Anyway it's possible ext4 is a problem but it seems to me less likely than the HW RAID6. In my experience RAID controllers with cache aren't really worth it with Ceph. Most of the time they perform well because of BBWC/FBWC but when you get into a situation where you must repair/backfill because you lost an OSD or added a
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
Hello, I was about to write something very much along these lines, thanks for beating me to it. ^o^ On Sat, 13 Feb 2016 21:50:17 -0700 Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > I'm still going to see if I can get Ceph clients to hardly notice that > an OSD comes back in. Our set up is EXT4 and our SSDs have the hardest > time with the longest recovery impact. It should be painless no matter > how slow the drives/CPU/etc are. If it means waiting to service client > I/O until all the peering, and stuff (not including > backfilling/recovery because that can be done in the background > without much impact already) is completed before sending the client > I/O to the OSD, then that is what I'm going to target. That way if it > takes 5 minutes for the OSD to get it's bearing because it is swapping > due to low memory or whatever, the clients happily ignore the OSD > until it says it is ready and don't have all the client I/O fighting > to get a piece of scarce resources. > Spot on. The recommendation the Ceph documentation is noout, the logic everybody assumes is happening is that no I/O goes to the OSD until it is actually ready to serve it and the reality clearly disproves it. Once the restart takes longer for whatever reasons than a few seconds it becomes very visible. > I appreciate all the suggestions that have been mentioned and believe > that there is a fundamental issue here that causes a problem when you > run your hardware into the red zone (like we have to do out of > necessity). You may be happy with how things are set-up in your > environment, but I'm not ready to give up on it and I think we can > make it better. That way it "Just Works" (TM) with more hardware and > configurations and doesn't need tons of efforts to get it tuned just > right. Oh, and be careful not to touch it, the balance of the force > might get thrown off and the whole thing will tank. This is exactly what happened in my case and we've seen evidence for in this ML plenty of times. Like with nearly all things I/O, there is a tipping point until everything is fine and then it isn't, often catastrophically so. > That does not make > me feel confident. Ceph is so resilient in so many ways already, why > should this be an Achilles heel for some? Well said indeed. Christian > -BEGIN PGP SIGNATURE- > Version: Mailvelope v1.3.4 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJWwAeGCRDmVDuy+mK58QAAG6MP/j+JN2z1qLK2KwlQOr/w > dam1U6t1WCzwN1XBpvYvbvJKKMcRHcwKmauuzTLYeEG8FjhgnOcvHaSRoHd8 > NURWINnGQrdTbxiMRGDbwC6iWfJypWMDN5d1vibo9aXC8ib7W6l9R21f+Koa > CsgyZV32kSwEs36teeM4JZrZBTlYQ4qRTOsMUDIfE1JFtBaeDjEwyI6gajdB > XsQo3mnqhe4LQC7x9oem/MpKEHp1Y/LO8tyf4jj72ZUp+qmJy2F3+oUPnCdU > P4h3uC0GZUd6l43p5cKW1w/h1mfEwR/9ppsIyufTghqlWFlE6dziaQdlas88 > IuDpGwCJfyJhiH18VxbtRpZQpNorJ27uxNjPPDcWNoUFHR8+daTCu+8NU6vT > 8xiZhBWpLiH/tShUtR6ZQnumwKgbwc+VOfHj+GSTY/DIfat/zaPxtKYsCHWz > LNE6fkzd4st2Aw7UVPSSUKrH/87RhIEnlipptZsh5SQNFUrl1G5ztNBTj7Xl > tyb+HD1Ge3u2mgS/ycnRGQECyXyUMvPXwITDqHLhN3wF7D/A3616v3Pg2H+v > R/dU8Wq31wA+A0LRuViMJy2PJMgEBoux+zhBsJFun4TPdXkpC15QODhpquMs > /0ofBwHG+FaWmmwVSQ0A0jMGGodfXTAgP4r/tL58JjGTgi1xtQu9L74u5KPD > yHbZ > =rnWI > -END PGP SIGNATURE- > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Sat, Feb 13, 2016 at 8:51 PM, Tom Christensen> wrote: > >> > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0 0 osd.2 > >> > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0 0 osd.2 1788 > >> > load_pgs opened > >> 564 pgs > --- > Another minute to load the PGs. > >> Same OSD reboot as above : 8 seconds for this. > > > > Do you really have 564 pgs on a single OSD? I've never had anything > > like decent performance on an OSD with greater than about 150pgs. In > > our production clusters we aim for 25-30 primary pgs per osd, > > 75-90pgs/osd total (with size set to 3). When we initially deployed > > our large cluster with 150-200pgs/osd (total, 50-70 primary pgs/osd, > > again size 3) we had no end of trouble getting pgs to peer. The OSDs > > ate RAM like nobody's business, took forever to do anything, and in > > general caused problems. If you're running 564 pgs/osd in this 4 OSD > > cluster, I'd look at that first as the potential culprit. That is a > > lot of threads inside the OSD process that all need to get > > CPU/network/disk time in order to peer as they come up. Especially on > > firefly I would point to this. We've moved to Hammer and that did > > improve a number of our performance bottlenecks, though we've also > > grown our cluster without adding pgs, so we are now down in the 25-30 > > primary pgs/osd range, and restarting osds, or whole nodes (24-32 OSDs > > for us) no longer causes us pain. In the past restarting a node could > > cause 5-10 minutes of peering and pain/slow requests/unhappiness of > > various sorts (RAM exhaustion, OOM Killer, Flapping
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
On Sat, 13 Feb 2016 20:51:19 -0700 Tom Christensen wrote: > > > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0 0 osd.2 > > > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0 0 osd.2 1788 > > > load_pgs opened > > 564 pgs > --- > Another minute to load the PGs. > > Same OSD reboot as above : 8 seconds for this. > > Do you really have 564 pgs on a single OSD? Yes, the reason is simple, more than a year ago it should have been 8 OSDs (halving that number) and now it should be 18 OSDs, which would be a perfect fit for the 1024 PGs in the rbd pool. >I've never had anything like > decent performance on an OSD with greater than about 150pgs. In our > production clusters we aim for 25-30 primary pgs per osd, 75-90pgs/osd > total (with size set to 3). When we initially deployed our large cluster > with 150-200pgs/osd (total, 50-70 primary pgs/osd, again size 3) we had > no end of trouble getting pgs to peer. The OSDs ate RAM like nobody's > business, took forever to do anything, and in general caused problems. The cluster performs admirable for the stress it is under, the number of PGs per OSD never really was an issue when it came to CPU/RAM/network. For example the restart increased the OSD process size from 1.3 to 2.8GB, but that left 24GB still "free". The main reason to have more OSDs (and thus a lower PG count per OSD) is to have more IOPS from the underlying storage. > If you're running 564 pgs/osd in this 4 OSD cluster, I'd look at that > first as the potential culprit. That is a lot of threads inside the OSD > process that all need to get CPU/network/disk time in order to peer as > they come up. Especially on firefly I would point to this. We've moved > to Hammer and that did improve a number of our performance bottlenecks, > though we've also grown our cluster without adding pgs, so we are now > down in the 25-30 primary pgs/osd range, and restarting osds, or whole > nodes (24-32 OSDs for us) no longer causes us pain. At that PG count, how good (bad really) is your data balancing out? > In the past > restarting a node could cause 5-10 minutes of peering and pain/slow > requests/unhappiness of various sorts (RAM exhaustion, OOM Killer, > Flapping OSDs). Nodes with that high number of OSDs I can indeed see cause pain, which > This all improved greatly once we got our pg/osd count > under 100 even before we upgraded to hammer. > Interesting point, but in my case all the slowness can be attributed to disk I/O of the respective backing storage. Which should be fast enough if ALL that it would do were to read things in. I'll see if Hammer behaves better, but I doubt it (especially for the first time when it upgrades stuff on the disk). Penultimately however I didn't ask on how to speed up OSD restarts (I have a lot of knowledge/ideas on how to do that), I asked about mitigating the impact of OSD restarts when they are going to be slow, for whatever reason. Regards, Christian > > > > > On Sat, Feb 13, 2016 at 11:08 AM, Lionel Bouton < > lionel-subscript...@bouton.name> wrote: > > > Hi, > > > > Le 13/02/2016 15:52, Christian Balzer a écrit : > > > [..] > > > > > > Hum that's surprisingly long. How much data (size and nb of files) do > > > you have on this OSD, which FS do you use, what are the mount > > > options, what is the hardware and the kind of access ? > > > > > > I already mentioned the HW, Areca RAID controller with 2GB HW cache > > > and a 7 disk RAID6 per OSD. > > > Nothing aside from noatime for mount options and EXT4. > > > > Thanks for the reminder. That said 7-disk RAID6 and EXT4 is new to me > > and may not be innocent. > > > > > > > > 2.6TB per OSD and with 1.4 million objects in the cluster a little > > > more than 700k files per OSD. > > > > That's nearly 3x more than my example OSD but it doesn't explain the > > more than 10x difference in startup time (especially considering BTRFS > > OSDs are slow to startup and my example was with dropped caches unlike > > your case). Your average file size is similar so it's not that either. > > Unless you have a more general, system-wide performance problem which > > impacts everything including the OSD init, there's 3 main components > > involved here : > > - Ceph OSD init code, > > - ext4 filesystem, > > - HW RAID6 block device. > > > > So either : > > - OSD init code doesn't scale past ~500k objects per OSD. > > - your ext4 filesystem is slow for the kind of access used during init > > (inherently or due to fragmentation, you might want to use filefrag on > > a random sample on PG directories, omap and meta), > > - your RAID6 array is slow for the kind of access used during init. > > - any combination of the above. > > > > I believe it's possible but doubtful that the OSD code wouldn't scale > > at this level (this does not feel like an abnormally high number of > > objects to me). Ceph devs will know better. > > ext4 could be a problem as it's not the most common choice for OSDs > > (from what I
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
> > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0 0 osd.2 > > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0 0 osd.2 1788 > > load_pgs opened > 564 pgs > --- > Another minute to load the PGs. > Same OSD reboot as above : 8 seconds for this. Do you really have 564 pgs on a single OSD? I've never had anything like decent performance on an OSD with greater than about 150pgs. In our production clusters we aim for 25-30 primary pgs per osd, 75-90pgs/osd total (with size set to 3). When we initially deployed our large cluster with 150-200pgs/osd (total, 50-70 primary pgs/osd, again size 3) we had no end of trouble getting pgs to peer. The OSDs ate RAM like nobody's business, took forever to do anything, and in general caused problems. If you're running 564 pgs/osd in this 4 OSD cluster, I'd look at that first as the potential culprit. That is a lot of threads inside the OSD process that all need to get CPU/network/disk time in order to peer as they come up. Especially on firefly I would point to this. We've moved to Hammer and that did improve a number of our performance bottlenecks, though we've also grown our cluster without adding pgs, so we are now down in the 25-30 primary pgs/osd range, and restarting osds, or whole nodes (24-32 OSDs for us) no longer causes us pain. In the past restarting a node could cause 5-10 minutes of peering and pain/slow requests/unhappiness of various sorts (RAM exhaustion, OOM Killer, Flapping OSDs). This all improved greatly once we got our pg/osd count under 100 even before we upgraded to hammer. On Sat, Feb 13, 2016 at 11:08 AM, Lionel Bouton < lionel-subscript...@bouton.name> wrote: > Hi, > > Le 13/02/2016 15:52, Christian Balzer a écrit : > > [..] > > > > Hum that's surprisingly long. How much data (size and nb of files) do > > you have on this OSD, which FS do you use, what are the mount options, > > what is the hardware and the kind of access ? > > > > I already mentioned the HW, Areca RAID controller with 2GB HW cache and a > > 7 disk RAID6 per OSD. > > Nothing aside from noatime for mount options and EXT4. > > Thanks for the reminder. That said 7-disk RAID6 and EXT4 is new to me > and may not be innocent. > > > > > 2.6TB per OSD and with 1.4 million objects in the cluster a little more > > than 700k files per OSD. > > That's nearly 3x more than my example OSD but it doesn't explain the > more than 10x difference in startup time (especially considering BTRFS > OSDs are slow to startup and my example was with dropped caches unlike > your case). Your average file size is similar so it's not that either. > Unless you have a more general, system-wide performance problem which > impacts everything including the OSD init, there's 3 main components > involved here : > - Ceph OSD init code, > - ext4 filesystem, > - HW RAID6 block device. > > So either : > - OSD init code doesn't scale past ~500k objects per OSD. > - your ext4 filesystem is slow for the kind of access used during init > (inherently or due to fragmentation, you might want to use filefrag on a > random sample on PG directories, omap and meta), > - your RAID6 array is slow for the kind of access used during init. > - any combination of the above. > > I believe it's possible but doubtful that the OSD code wouldn't scale at > this level (this does not feel like an abnormally high number of objects > to me). Ceph devs will know better. > ext4 could be a problem as it's not the most common choice for OSDs > (from what I read here XFS is usually preferred over it) and it forces > Ceph to use omap to store data which would be stored in extended > attributes otherwise (which probably isn't without performance problems). > RAID5/6 on HW might have performance problems. The usual ones happen on > writes and OSD init is probably read-intensive (or maybe not, you should > check the kind of access happening during the OSD init to avoid any > surprise) but with HW cards it's difficult to know for sure the > performance limitations they introduce (the only sure way is testing the > actual access patterns). > > So I would probably try to reproduce the problem replacing one OSDs > based on RAID6 arrays with as many OSDs as you have devices in the arrays. > Then if it solves the problem and you didn't already do it you might > want to explore Areca tuning, specifically with RAID6 if you must have it. > > > > > > And kindly take note that my test cluster has less than 120k objects and > > thus 15k files per OSD and I still was able to reproduce this behaviour > (in > > spirit at least). > > I assume the test cluster uses ext4 and RAID6 arrays too: it would be a > perfect testing environment for defragmentation/switch to XFS/switch to > single drive OSDs then. > > > > >> The only time I saw OSDs take several minutes to reach the point where > >> they fully rejoin is with BTRFS with default options/config. > >> > > There isn't a pole long enough I would touch BTRFS with for production, > >
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I'm still going to see if I can get Ceph clients to hardly notice that an OSD comes back in. Our set up is EXT4 and our SSDs have the hardest time with the longest recovery impact. It should be painless no matter how slow the drives/CPU/etc are. If it means waiting to service client I/O until all the peering, and stuff (not including backfilling/recovery because that can be done in the background without much impact already) is completed before sending the client I/O to the OSD, then that is what I'm going to target. That way if it takes 5 minutes for the OSD to get it's bearing because it is swapping due to low memory or whatever, the clients happily ignore the OSD until it says it is ready and don't have all the client I/O fighting to get a piece of scarce resources. I appreciate all the suggestions that have been mentioned and believe that there is a fundamental issue here that causes a problem when you run your hardware into the red zone (like we have to do out of necessity). You may be happy with how things are set-up in your environment, but I'm not ready to give up on it and I think we can make it better. That way it "Just Works" (TM) with more hardware and configurations and doesn't need tons of efforts to get it tuned just right. Oh, and be careful not to touch it, the balance of the force might get thrown off and the whole thing will tank. That does not make me feel confident. Ceph is so resilient in so many ways already, why should this be an Achilles heel for some? -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.4 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWwAeGCRDmVDuy+mK58QAAG6MP/j+JN2z1qLK2KwlQOr/w dam1U6t1WCzwN1XBpvYvbvJKKMcRHcwKmauuzTLYeEG8FjhgnOcvHaSRoHd8 NURWINnGQrdTbxiMRGDbwC6iWfJypWMDN5d1vibo9aXC8ib7W6l9R21f+Koa CsgyZV32kSwEs36teeM4JZrZBTlYQ4qRTOsMUDIfE1JFtBaeDjEwyI6gajdB XsQo3mnqhe4LQC7x9oem/MpKEHp1Y/LO8tyf4jj72ZUp+qmJy2F3+oUPnCdU P4h3uC0GZUd6l43p5cKW1w/h1mfEwR/9ppsIyufTghqlWFlE6dziaQdlas88 IuDpGwCJfyJhiH18VxbtRpZQpNorJ27uxNjPPDcWNoUFHR8+daTCu+8NU6vT 8xiZhBWpLiH/tShUtR6ZQnumwKgbwc+VOfHj+GSTY/DIfat/zaPxtKYsCHWz LNE6fkzd4st2Aw7UVPSSUKrH/87RhIEnlipptZsh5SQNFUrl1G5ztNBTj7Xl tyb+HD1Ge3u2mgS/ycnRGQECyXyUMvPXwITDqHLhN3wF7D/A3616v3Pg2H+v R/dU8Wq31wA+A0LRuViMJy2PJMgEBoux+zhBsJFun4TPdXkpC15QODhpquMs /0ofBwHG+FaWmmwVSQ0A0jMGGodfXTAgP4r/tL58JjGTgi1xtQu9L74u5KPD yHbZ =rnWI -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Sat, Feb 13, 2016 at 8:51 PM, Tom Christensenwrote: >> > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0 0 osd.2 >> > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0 0 osd.2 1788 >> > load_pgs opened >> 564 pgs > --- > Another minute to load the PGs. >> Same OSD reboot as above : 8 seconds for this. > > Do you really have 564 pgs on a single OSD? I've never had anything like > decent performance on an OSD with greater than about 150pgs. In our > production clusters we aim for 25-30 primary pgs per osd, 75-90pgs/osd total > (with size set to 3). When we initially deployed our large cluster with > 150-200pgs/osd (total, 50-70 primary pgs/osd, again size 3) we had no end of > trouble getting pgs to peer. The OSDs ate RAM like nobody's business, took > forever to do anything, and in general caused problems. If you're running > 564 pgs/osd in this 4 OSD cluster, I'd look at that first as the potential > culprit. That is a lot of threads inside the OSD process that all need to > get CPU/network/disk time in order to peer as they come up. Especially on > firefly I would point to this. We've moved to Hammer and that did improve a > number of our performance bottlenecks, though we've also grown our cluster > without adding pgs, so we are now down in the 25-30 primary pgs/osd range, > and restarting osds, or whole nodes (24-32 OSDs for us) no longer causes us > pain. In the past restarting a node could cause 5-10 minutes of peering and > pain/slow requests/unhappiness of various sorts (RAM exhaustion, OOM Killer, > Flapping OSDs). This all improved greatly once we got our pg/osd count > under 100 even before we upgraded to hammer. > > > > > > On Sat, Feb 13, 2016 at 11:08 AM, Lionel Bouton > wrote: >> >> Hi, >> >> Le 13/02/2016 15:52, Christian Balzer a écrit : >> > [..] >> > >> > Hum that's surprisingly long. How much data (size and nb of files) do >> > you have on this OSD, which FS do you use, what are the mount options, >> > what is the hardware and the kind of access ? >> > >> > I already mentioned the HW, Areca RAID controller with 2GB HW cache and >> > a >> > 7 disk RAID6 per OSD. >> > Nothing aside from noatime for mount options and EXT4. >> >> Thanks for the reminder. That said 7-disk RAID6 and EXT4 is new to me >> and may not be innocent. >> >> > >> > 2.6TB per OSD and with 1.4 million objects in the cluster a little more >> > than 700k files per OSD.
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
Hi, On 02/12/2016 03:47 PM, Christian Balzer wrote: Hello, yesterday I upgraded our most busy (in other words lethally overloaded) production cluster to the latest Firefly in preparation for a Hammer upgrade and then phasing in of a cache tier. When restarting the ODSs it took 3 minutes (1 minute in a consecutive repeat to test the impact of primed caches) during which the cluster crawled to a near stand-still and the dreaded slow requests piled up, causing applications in the VMs to fail. I had of course set things to "noout" beforehand, in hopes of staving off this kind of scenario. Note that the other OSDs and their backing storage were NOT overloaded during that time, only the backing storage of the OSD being restarted was under duress. I was under the (wishful thinking?) impression that with noout set and a controlled OSD shutdown/restart, operations would be redirect to the new primary for the duration. The strain on the restarted OSDs when recovering those operations (which I also saw) I was prepared for, the near screeching halt not so much. Any thoughts on how to mitigate this further or is this the expected behavior? I wouldn't use noout in this scenario. It keeps the cluster from recognizing that a OSD is not available; other OSD will still try to write to that OSD. This is probably the cause of the blocked requests. Redirecting only works if the cluster is able to detect a PG as being degraded. If the cluster is aware of the OSD being missing, it could handle the write requests more gracefully. To prevent it from backfilling etc, I prefer to use nobackfill and norecover. It blocks backfill on the cluster level, but allows requests to be carried out (at least in my understanding of these flags). 'noout' is fine for large scale cluster maintenance, since it keeps the cluster from backfilling. I've used when I had to power down our complete cluster. Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
I wonder if Christian is hitting some performance issue when the OSD or number of OSD's all start up at once? Or maybe the OSD is still doing some internal startup procedure and when the IO hits it on a very busy cluster, it causes it to become overloaded for a few seconds? I've seen similar things in the past where if I did not have enough min free KB's configured, PG's would take a long time to peer/activate and cause slow ops. > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Steve Taylor > Sent: 12 February 2016 16:32 > To: Nick Fisk <n...@fisk.me.uk>; 'Christian Balzer' <ch...@gol.com>; ceph- > us...@lists.ceph.com > Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't > uptosnuff) > > Nick is right. Setting noout is the right move in this scenario. Restarting an > OSD shouldn't block I/O unless nodown is also set, however. The exception > to this would be a case where min_size can't be achieved because of the > down OSD, i.e. min_size=3 and 1 of 3 OSDs is restarting. That would certainly > block writes. Otherwise the cluster will recognize down OSDs as down > (without nodown set), redirect I/O requests to OSDs that are up, and backfill > as necessary when things are back to normal. > > You can set min_size to something lower if you don't have enough OSDs to > allow you to restart one without blocking writes. If this isn't the case, > something deeper is going on with your cluster. You shouldn't get slow > requests due to restarting a single OSD with only noout set and idle disks on > the remaining OSDs. I've done this many, many times. > > Steve Taylor | Senior Software Engineer | StorageCraft Technology > Corporation > 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2799 | Fax: 801.545.4705 > > If you are not the intended recipient of this message, be advised that any > dissemination or copying of this message is prohibited. > If you received this message erroneously, please notify the sender and > delete it, together with any attachments. > > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Nick Fisk > Sent: Friday, February 12, 2016 9:07 AM > To: 'Christian Balzer' <ch...@gol.com>; ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't > uptosnuff) > > > > > -Original Message- > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > > Of Christian Balzer > > Sent: 12 February 2016 15:38 > > To: ceph-users@lists.ceph.com > > Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout > > ain't > > uptosnuff) > > > > On Fri, 12 Feb 2016 15:56:31 +0100 Burkhard Linke wrote: > > > > > Hi, > > > > > > On 02/12/2016 03:47 PM, Christian Balzer wrote: > > > > Hello, > > > > > > > > yesterday I upgraded our most busy (in other words lethally > > > > overloaded) production cluster to the latest Firefly in > > > > preparation for a Hammer upgrade and then phasing in of a cache tier. > > > > > > > > When restarting the ODSs it took 3 minutes (1 minute in a > > > > consecutive repeat to test the impact of primed caches) during > > > > which the cluster crawled to a near stand-still and the dreaded > > > > slow requests piled up, causing applications in the VMs to fail. > > > > > > > > I had of course set things to "noout" beforehand, in hopes of > > > > staving off this kind of scenario. > > > > > > > > Note that the other OSDs and their backing storage were NOT > > > > overloaded during that time, only the backing storage of the OSD > > > > being restarted was under duress. > > > > > > > > I was under the (wishful thinking?) impression that with noout set > > > > and a controlled OSD shutdown/restart, operations would be > > > > redirect to the new primary for the duration. > > > > The strain on the restarted OSDs when recovering those operations > > > > (which I also saw) I was prepared for, the near screeching halt > > > > not so much. > > > > > > > > Any thoughts on how to mitigate this further or is this the > > > > expected behavior? > > > > > > I wouldn't use noout in this scenario. It keeps the cluster from > > > recognizing that a OSD is not available; other OSD will still try to > > > write to that OSD. This is probably the cause of the blocked requests. >
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
On Fri, 12 Feb 2016 15:56:31 +0100 Burkhard Linke wrote: > Hi, > > On 02/12/2016 03:47 PM, Christian Balzer wrote: > > Hello, > > > > yesterday I upgraded our most busy (in other words lethally overloaded) > > production cluster to the latest Firefly in preparation for a Hammer > > upgrade and then phasing in of a cache tier. > > > > When restarting the ODSs it took 3 minutes (1 minute in a consecutive > > repeat to test the impact of primed caches) during which the cluster > > crawled to a near stand-still and the dreaded slow requests piled up, > > causing applications in the VMs to fail. > > > > I had of course set things to "noout" beforehand, in hopes of staving > > off this kind of scenario. > > > > Note that the other OSDs and their backing storage were NOT overloaded > > during that time, only the backing storage of the OSD being restarted > > was under duress. > > > > I was under the (wishful thinking?) impression that with noout set and > > a controlled OSD shutdown/restart, operations would be redirect to the > > new primary for the duration. > > The strain on the restarted OSDs when recovering those operations > > (which I also saw) I was prepared for, the near screeching halt not so > > much. > > > > Any thoughts on how to mitigate this further or is this the expected > > behavior? > > I wouldn't use noout in this scenario. It keeps the cluster from > recognizing that a OSD is not available; other OSD will still try to > write to that OSD. This is probably the cause of the blocked requests. > Redirecting only works if the cluster is able to detect a PG as being > degraded. > Oh well, that makes of course sense, but I found some article stating that it also would redirect things and the recovery activity I saw afterwards suggests it did so at some point. > If the cluster is aware of the OSD being missing, it could handle the > write requests more gracefully. To prevent it from backfilling etc, I > prefer to use nobackfill and norecover. It blocks backfill on the > cluster level, but allows requests to be carried out (at least in my > understanding of these flags). > Yes, I concur and was thinking of that as well. Will give it a spin with the upgrade to Hammer. > 'noout' is fine for large scale cluster maintenance, since it keeps the > cluster from backfilling. I've used when I had to power down our > complete cluster. > Guess with my other, less busy clusters, this never showed up on my radar. Regards, Christian > Regards, > Burkhard > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Christian Balzer > Sent: 12 February 2016 15:38 > To: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't > uptosnuff) > > On Fri, 12 Feb 2016 15:56:31 +0100 Burkhard Linke wrote: > > > Hi, > > > > On 02/12/2016 03:47 PM, Christian Balzer wrote: > > > Hello, > > > > > > yesterday I upgraded our most busy (in other words lethally > > > overloaded) production cluster to the latest Firefly in preparation > > > for a Hammer upgrade and then phasing in of a cache tier. > > > > > > When restarting the ODSs it took 3 minutes (1 minute in a > > > consecutive repeat to test the impact of primed caches) during which > > > the cluster crawled to a near stand-still and the dreaded slow > > > requests piled up, causing applications in the VMs to fail. > > > > > > I had of course set things to "noout" beforehand, in hopes of > > > staving off this kind of scenario. > > > > > > Note that the other OSDs and their backing storage were NOT > > > overloaded during that time, only the backing storage of the OSD > > > being restarted was under duress. > > > > > > I was under the (wishful thinking?) impression that with noout set > > > and a controlled OSD shutdown/restart, operations would be redirect > > > to the new primary for the duration. > > > The strain on the restarted OSDs when recovering those operations > > > (which I also saw) I was prepared for, the near screeching halt not > > > so much. > > > > > > Any thoughts on how to mitigate this further or is this the expected > > > behavior? > > > > I wouldn't use noout in this scenario. It keeps the cluster from > > recognizing that a OSD is not available; other OSD will still try to > > write to that OSD. This is probably the cause of the blocked requests. > > Redirecting only works if the cluster is able to detect a PG as being > > degraded. > > > Oh well, that makes of course sense, but I found some article stating that it > also would redirect things and the recovery activity I saw afterwards suggests > it did so at some point. Doesn't noout just stop the crushmap from being modified and hence data shuffling. Nodown controls whether or not the OSD is available for IO? Maybe try the reverse. Set noup so that OSD's don't participate in IO and then bring them in manually? > > > If the cluster is aware of the OSD being missing, it could handle the > > write requests more gracefully. To prevent it from backfilling etc, I > > prefer to use nobackfill and norecover. It blocks backfill on the > > cluster level, but allows requests to be carried out (at least in my > > understanding of these flags). > > > Yes, I concur and was thinking of that as well. Will give it a spin with the > upgrade to Hammer. > > > 'noout' is fine for large scale cluster maintenance, since it keeps > > the cluster from backfilling. I've used when I had to power down our > > complete cluster. > > > Guess with my other, less busy clusters, this never showed up on my radar. > > Regards, > > Christian > > Regards, > > Burkhard > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
Nick is right. Setting noout is the right move in this scenario. Restarting an OSD shouldn't block I/O unless nodown is also set, however. The exception to this would be a case where min_size can't be achieved because of the down OSD, i.e. min_size=3 and 1 of 3 OSDs is restarting. That would certainly block writes. Otherwise the cluster will recognize down OSDs as down (without nodown set), redirect I/O requests to OSDs that are up, and backfill as necessary when things are back to normal. You can set min_size to something lower if you don't have enough OSDs to allow you to restart one without blocking writes. If this isn't the case, something deeper is going on with your cluster. You shouldn't get slow requests due to restarting a single OSD with only noout set and idle disks on the remaining OSDs. I've done this many, many times. Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | Fax: 801.545.4705 If you are not the intended recipient of this message, be advised that any dissemination or copying of this message is prohibited. If you received this message erroneously, please notify the sender and delete it, together with any attachments. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nick Fisk Sent: Friday, February 12, 2016 9:07 AM To: 'Christian Balzer' <ch...@gol.com>; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff) > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of Christian Balzer > Sent: 12 February 2016 15:38 > To: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout > ain't > uptosnuff) > > On Fri, 12 Feb 2016 15:56:31 +0100 Burkhard Linke wrote: > > > Hi, > > > > On 02/12/2016 03:47 PM, Christian Balzer wrote: > > > Hello, > > > > > > yesterday I upgraded our most busy (in other words lethally > > > overloaded) production cluster to the latest Firefly in > > > preparation for a Hammer upgrade and then phasing in of a cache tier. > > > > > > When restarting the ODSs it took 3 minutes (1 minute in a > > > consecutive repeat to test the impact of primed caches) during > > > which the cluster crawled to a near stand-still and the dreaded > > > slow requests piled up, causing applications in the VMs to fail. > > > > > > I had of course set things to "noout" beforehand, in hopes of > > > staving off this kind of scenario. > > > > > > Note that the other OSDs and their backing storage were NOT > > > overloaded during that time, only the backing storage of the OSD > > > being restarted was under duress. > > > > > > I was under the (wishful thinking?) impression that with noout set > > > and a controlled OSD shutdown/restart, operations would be > > > redirect to the new primary for the duration. > > > The strain on the restarted OSDs when recovering those operations > > > (which I also saw) I was prepared for, the near screeching halt > > > not so much. > > > > > > Any thoughts on how to mitigate this further or is this the > > > expected behavior? > > > > I wouldn't use noout in this scenario. It keeps the cluster from > > recognizing that a OSD is not available; other OSD will still try to > > write to that OSD. This is probably the cause of the blocked requests. > > Redirecting only works if the cluster is able to detect a PG as > > being degraded. > > > Oh well, that makes of course sense, but I found some article stating > that it > also would redirect things and the recovery activity I saw afterwards suggests > it did so at some point. Doesn't noout just stop the crushmap from being modified and hence data shuffling. Nodown controls whether or not the OSD is available for IO? Maybe try the reverse. Set noup so that OSD's don't participate in IO and then bring them in manually? > > > If the cluster is aware of the OSD being missing, it could handle > > the write requests more gracefully. To prevent it from backfilling > > etc, I prefer to use nobackfill and norecover. It blocks backfill on > > the cluster level, but allows requests to be carried out (at least > > in my understanding of these flags). > > > Yes, I concur and was thinking of that as well. Will give it a spin > with the > upgrade to Hammer. > > > 'noout' is fine for large scale cluster maintenance, since it keeps > > the cluster from backfilling. I've
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
365 349 49.8472 0 - 1.02823 > 29 16 365 349 48.1285 0 - 1.02823 > 30 16 365 349 46.5243 0 - 1.02823 > 31 16 365 349 45.0236 0 - 1.02823 > 32 16 365 349 43.6167 0 - 1.02823 > 33 16 365 349 42.2951 0 - 1.02823 > 34 16 365 349 41.0512 0 - 1.02823 > 35 16 365 349 39.8784 0 - 1.02823 > 36 16 365 349 38.7707 0 - 1.02823 > 37 16 366 350 37.8309 0.4 17.1657 1.07434 > 38 16 386 370 38.939580 0.363365 1.62187 > --- > > Regards, > > Christian > > > Steve Taylor | Senior Software Engineer | StorageCraft Technology > > Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 > > Office: 801.871.2799 | Fax: 801.545.4705 > > > > If you are not the intended recipient of this message, be advised that > > any dissemination or copying of this message is prohibited. If you > > received this message erroneously, please notify the sender and delete > > it, together with any attachments. > > > > > > -Original Message- > > From: Robert LeBlanc [mailto:rob...@leblancnet.us] > > Sent: Friday, February 12, 2016 1:30 PM > > To: Nick Fisk <n...@fisk.me.uk> > > Cc: Steve Taylor <steve.tay...@storagecraft.com>; Christian Balzer > > <ch...@gol.com>; ceph-users@lists.ceph.com Subject: Re: [ceph-users] > > Reducing the impact of OSD restarts (noout ain't uptosnuff) > > > > -BEGIN PGP SIGNED MESSAGE- > > Hash: SHA256 > > > > What I've seen is that when an OSD starts up in a busy cluster, as soon > > as it is "in" (could be "out" before) it starts getting client traffic. > > However, it has be "in" to start catching up and peering to the other > > OSDs in the cluster. The OSD is not ready to service requests for that > > PG yet, but it has the OP queued until it is ready. On a busy cluster it > > can take an OSD a long time to become ready especially if it is > > servicing client requests at the same time. > > > > If someone isn't able to look into the code to resolve this by the time > > I'm finished with the queue optimizations I'm doing (hopefully in a week > > or two), I plan on looking into this to see if there is something that > > can be done to prevent the OPs from being accepted until the OSD is > > ready for them. > > - ---- > > Robert LeBlanc > > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > > > > On Fri, Feb 12, 2016 at 9:42 AM, Nick Fisk wrote: > > > I wonder if Christian is hitting some performance issue when the OSD > > > or number of OSD's all start up at once? Or maybe the OSD is still > > > doing some internal startup procedure and when the IO hits it on a > > > very busy cluster, it causes it to become overloaded for a few seconds? > > > > > > I've seen similar things in the past where if I did not have enough > > > min free KB's configured, PG's would take a long time to peer/activate > > > and cause slow ops. > > > > > >> -Original Message- > > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > > >> Of Steve Taylor > > >> Sent: 12 February 2016 16:32 > > >> To: Nick Fisk ; 'Christian Balzer' ; ceph- us...@lists.ceph.com > > >> Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout > > >> ain't > > >> uptosnuff) > > >> > > >> Nick is right. Setting noout is the right move in this scenario. > > > Restarting an > > >> OSD shouldn't block I/O unless nodown is also set, however. The > > >> exception to this would be a case where min_size can't be achieved > > >> because of the down OSD, i.e. min_size=3 and 1 of 3 OSDs is > > >> restarting. That would > > > certainly > > >> block writes. Otherwise the cluster will recognize down OSDs as down > > >> (without nodown set), redirect I/O requests to OSDs that are up, and > > > backfill > > >> as necessary when things are back to normal. > > >> > > >> You can set min_size to something lower if you don't have enough OSDs > > >> to allow you to restart one without blocking writes. If this isn't &
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
to Hammer first to do that. For the oldtimers here: You're in a twisty maze of passages, all alike. Again, I was watching this with atop running on all nodes, only the backing storage of the restarting OSD was busy to the point of melting, the cluster should have had enough IOPS resources to still hold up on just 3 cylinders, but the other OSDs were never getting to see any of that traffic. IOPS dropped 80-90 percent until the OSD was fully back "in". I could more or less replicate this with my cruddy test cluster, 4 nodes, 4 HDDs (no SSDs), 1Gb links, totally anemic in the CPU and RAM department. This one is already running Hammer. When running "rados -p rbd bench 30 write -b 4096" and stopping one OSD in the middle of things and starting it 10 seconds later I got this cheerful log entry: --- 2016-02-13 12:22:42.114247 osd.20 203.216.0.83:6807/26301 239 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.986682 secs 2016-02-13 12:22:42.114270 osd.20 203.216.0.83:6807/26301 240 : cluster [WRN] slow request 30.986682 seconds old, received at 2016-02-13 12:22:11.127490: osd_op(client.28601805.0:114 benchmark_data_engtest03_23598_object113 [write 0~4194304] 8.43856a6f ack+ondisk+write+known_if_redirected e11563) currently reached_pg 2016-02-13 12:22:44.116518 osd.20 203.216.0.83:6807/26301 241 : cluster [WRN] 2 slow requests, 1 included below; oldest blocked for > 32.988940 secs 2016-02-13 12:22:44.116533 osd.20 203.216.0.83:6807/26301 242 : cluster [WRN] slow request 30.254598 seconds old, received at 2016-02-13 12:22:13.861832: osd_op(client.28601805.0:128 benchmark_data_engtest03_23598_object127 [write 0~4194304] 8.ddc143c ack+ondisk+write+known_if_redirected e11563) currently reached_pg 2016-02-13 12:23:01.768430 osd.22 203.216.0.83:6812/27118 203 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.158289 secs 2016-02-13 12:23:01.768458 osd.22 203.216.0.83:6812/27118 204 : cluster [WRN] slow request 30.158289 seconds old, received at 2016-02-13 12:22:31.610001: osd_op(client.28601805.0:191 benchmark_data_engtest03_23598_object190 [write 0~4194304] 8.cafe2d76 ack+ondisk+write+known_if_redirected e11563) currently reached_pg --- And when doing it with the default 4M blocks the bench output looks like this, spot when I stopped/started the OSD: --- sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 311 295 58.988144 0.542233 1.03916 21 16 324 308 58.654652 0.563641 1.04909 22 16 338 322 58.533556 1.35033 1.05188 23 16 346 330 57.379632 1.01166 1.04428 24 16 355 339 56.488636 0.66026 1.04332 25 16 361 345 55.188824 0.503717 1.03625 26 16 364 348 53.527712 0.297751 1.02988 27 16 365 349 51.6934 4 0.456629 1.02823 28 16 365 349 49.8472 0 - 1.02823 29 16 365 349 48.1285 0 - 1.02823 30 16 365 349 46.5243 0 - 1.02823 31 16 365 349 45.0236 0 - 1.02823 32 16 365 349 43.6167 0 - 1.02823 33 16 365 349 42.2951 0 - 1.02823 34 16 365 349 41.0512 0 - 1.02823 35 16 365 349 39.8784 0 - 1.02823 36 16 365 349 38.7707 0 - 1.02823 37 16 366 350 37.8309 0.4 17.1657 1.07434 38 16 386 370 38.939580 0.363365 1.62187 --- Regards, Christian > Steve Taylor | Senior Software Engineer | StorageCraft Technology > Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2799 | Fax: 801.545.4705 > > If you are not the intended recipient of this message, be advised that > any dissemination or copying of this message is prohibited. If you > received this message erroneously, please notify the sender and delete > it, together with any attachments. > > > -Original Message- > From: Robert LeBlanc [mailto:rob...@leblancnet.us] > Sent: Friday, February 12, 2016 1:30 PM > To: Nick Fisk <n...@fisk.me.uk> > Cc: Steve Taylor <steve.tay...@storagecraft.com>; Christian Balzer > <ch...@gol.com>; ceph-users@lists.ceph.com Subject: Re: [ceph-users] > Reducing the impact of OSD restarts (noout ain't uptosnuff) > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > What I've seen is that when an OSD starts up in a busy cluster, as soon > as it is "in" (could be "out" before) it starts getting client
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 What I've seen is that when an OSD starts up in a busy cluster, as soon as it is "in" (could be "out" before) it starts getting client traffic. However, it has be "in" to start catching up and peering to the other OSDs in the cluster. The OSD is not ready to service requests for that PG yet, but it has the OP queued until it is ready. On a busy cluster it can take an OSD a long time to become ready especially if it is servicing client requests at the same time. If someone isn't able to look into the code to resolve this by the time I'm finished with the queue optimizations I'm doing (hopefully in a week or two), I plan on looking into this to see if there is something that can be done to prevent the OPs from being accepted until the OSD is ready for them. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, Feb 12, 2016 at 9:42 AM, Nick Fisk wrote: > I wonder if Christian is hitting some performance issue when the OSD or > number of OSD's all start up at once? Or maybe the OSD is still doing some > internal startup procedure and when the IO hits it on a very busy cluster, > it causes it to become overloaded for a few seconds? > > I've seen similar things in the past where if I did not have enough min free > KB's configured, PG's would take a long time to peer/activate and cause slow > ops. > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Steve Taylor >> Sent: 12 February 2016 16:32 >> To: Nick Fisk ; 'Christian Balzer' ; ceph- >> us...@lists.ceph.com >> Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't >> uptosnuff) >> >> Nick is right. Setting noout is the right move in this scenario. > Restarting an >> OSD shouldn't block I/O unless nodown is also set, however. The exception >> to this would be a case where min_size can't be achieved because of the >> down OSD, i.e. min_size=3 and 1 of 3 OSDs is restarting. That would > certainly >> block writes. Otherwise the cluster will recognize down OSDs as down >> (without nodown set), redirect I/O requests to OSDs that are up, and > backfill >> as necessary when things are back to normal. >> >> You can set min_size to something lower if you don't have enough OSDs to >> allow you to restart one without blocking writes. If this isn't the case, >> something deeper is going on with your cluster. You shouldn't get slow >> requests due to restarting a single OSD with only noout set and idle disks > on >> the remaining OSDs. I've done this many, many times. >> >> Steve Taylor | Senior Software Engineer | StorageCraft Technology >> Corporation >> 380 Data Drive Suite 300 | Draper | Utah | 84020 >> Office: 801.871.2799 | Fax: 801.545.4705 >> >> If you are not the intended recipient of this message, be advised that any >> dissemination or copying of this message is prohibited. >> If you received this message erroneously, please notify the sender and >> delete it, together with any attachments. >> >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Nick Fisk >> Sent: Friday, February 12, 2016 9:07 AM >> To: 'Christian Balzer' ; ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't >> uptosnuff) >> >> >> >> > -Original Message- >> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >> > Of Christian Balzer >> > Sent: 12 February 2016 15:38 >> > To: ceph-users@lists.ceph.com >> > Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout >> > ain't >> > uptosnuff) >> > >> > On Fri, 12 Feb 2016 15:56:31 +0100 Burkhard Linke wrote: >> > >> > > Hi, >> > > >> > > On 02/12/2016 03:47 PM, Christian Balzer wrote: >> > > > Hello, >> > > > >> > > > yesterday I upgraded our most busy (in other words lethally >> > > > overloaded) production cluster to the latest Firefly in >> > > > preparation for a Hammer upgrade and then phasing in of a cache > tier. >> > > > >> > > > When restarting the ODSs it took 3 minutes (1 minute in a >> > > > consecutive repeat to test the impact of primed caches) during >> > > > which the cluster crawled to a near stand-still and the dreaded >> > > > slow requests piled up, causing applications in the VMs to fail. >
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
I could be wrong, but I didn't think a PG would have to peer when an OSD is restarted with noout set. If I'm wrong, then this peering would definitely block I/O. I just did a quick test on a non-busy cluster and didn't see any peering when my OSD went down or up, but I'm not sure how good a test that is. The OSD should also stay "in" throughout the restart with noout set, so it wouldn't have been "out" before to cause peering when it came "in." I do know that OSDs don’t mark themselves "up" until they're caught up on OSD maps. They won't accept any op requests until they're "up," so they shouldn't have any catching up to do by the time they start taking op requests. In theory they're ready to handle I/O by the time they start handling I/O. At least that's my understanding. It would be interesting to see what this cluster looks like as far as OSD count, journal configuration, network, CPU, RAM, etc. Something is obviously amiss. Even in a semi-decent configuration one should be able to restart a single OSD with noout under little load without causing blocked op requests. Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | Fax: 801.545.4705 If you are not the intended recipient of this message, be advised that any dissemination or copying of this message is prohibited. If you received this message erroneously, please notify the sender and delete it, together with any attachments. -Original Message- From: Robert LeBlanc [mailto:rob...@leblancnet.us] Sent: Friday, February 12, 2016 1:30 PM To: Nick Fisk <n...@fisk.me.uk> Cc: Steve Taylor <steve.tay...@storagecraft.com>; Christian Balzer <ch...@gol.com>; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff) -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 What I've seen is that when an OSD starts up in a busy cluster, as soon as it is "in" (could be "out" before) it starts getting client traffic. However, it has be "in" to start catching up and peering to the other OSDs in the cluster. The OSD is not ready to service requests for that PG yet, but it has the OP queued until it is ready. On a busy cluster it can take an OSD a long time to become ready especially if it is servicing client requests at the same time. If someone isn't able to look into the code to resolve this by the time I'm finished with the queue optimizations I'm doing (hopefully in a week or two), I plan on looking into this to see if there is something that can be done to prevent the OPs from being accepted until the OSD is ready for them. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, Feb 12, 2016 at 9:42 AM, Nick Fisk wrote: > I wonder if Christian is hitting some performance issue when the OSD > or number of OSD's all start up at once? Or maybe the OSD is still > doing some internal startup procedure and when the IO hits it on a > very busy cluster, it causes it to become overloaded for a few seconds? > > I've seen similar things in the past where if I did not have enough > min free KB's configured, PG's would take a long time to peer/activate > and cause slow ops. > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >> Of Steve Taylor >> Sent: 12 February 2016 16:32 >> To: Nick Fisk ; 'Christian Balzer' ; ceph- us...@lists.ceph.com >> Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout >> ain't >> uptosnuff) >> >> Nick is right. Setting noout is the right move in this scenario. > Restarting an >> OSD shouldn't block I/O unless nodown is also set, however. The >> exception to this would be a case where min_size can't be achieved >> because of the down OSD, i.e. min_size=3 and 1 of 3 OSDs is >> restarting. That would > certainly >> block writes. Otherwise the cluster will recognize down OSDs as down >> (without nodown set), redirect I/O requests to OSDs that are up, and > backfill >> as necessary when things are back to normal. >> >> You can set min_size to something lower if you don't have enough OSDs >> to allow you to restart one without blocking writes. If this isn't >> the case, something deeper is going on with your cluster. You >> shouldn't get slow requests due to restarting a single OSD with only >> noout set and idle disks > on >> the remaining OSDs. I've done this many, many times. >> >> Steve Taylor | Senior Software Engineer | StorageCraft Technology >> Corporation >> 380 Data Drive Suite 300 | Draper | Utah | 84020 >> Office: 80