Re: [ceph-users] Ceph OSD very slow startup
Hi, More information on our Btrfs tests. Le 14/10/2014 19:53, Lionel Bouton a écrit : Current plan: wait at least a week to study 3.17.0 behavior and upgrade the 3.12.21 nodes to 3.17.0 if all goes well. 3.17.0 and 3.17.1 have a bug which remounts Btrfs filesystems read-only (no corruption but OSD goes down) on some access patterns with snapshots: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36483.html The bug may be present in earlier kernels (at least the 3.16.4 code in fs/btrfs/qgroup.c doesn't handle the case differently than 3.17.0 and 3.17.1) but seems at least less likely to show up (never saw it with 3.16.4 in several weeks but it happened with 3.17.1 three times in just a few hours). As far as I can tell from its Changelog, 3.17.1 didn't patch any vfs/btrfs path vs 3.17.0 so I assume 3.17.0 has the same behaviour. I switched all servers to 3.16.4 which I had previously tested without any problem. The performance problem is still there with 3.16.4. In fact one of the 2 large OSD was so slow it was repeatedly marked out and generated lots of latencies when in. I just had to remove it: when this OSD is shut down with noout to avoid backfills slowing down the storage network, latencies are back to normal. I chose to reformat this one with XFS. The other big node has a nearly perfectly identical system (same hardware, same software configuration, same logical volume configuration, same weight in the crush map, comparable disk usage in the OSD fs, ...) but is behaving itself (maybe slower than our smaller XFS and Btrfs OSD, but usable). The only notable difference is that it was formatted more recently. So the performance problem might be linked to the cumulative amount of data access to the OSD over time. If my suspicion is true I believe we might see performance problems on the other Btrfs OSDs later (we'll have to wait). Is any Btrfs developper subscribed to this list? I could forward this information to linux-btrfs@vger if needed but I can't offer much debugging help (the storage cluster is in production and I'm more inclined to migrate slow OSDs to XFS than doing invasive debugging with Btrfs). Best regards, Lionel Bouton ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSD very slow startup
On Mon, Oct 20, 2014 at 8:25 AM, Lionel Bouton lionel+c...@bouton.name wrote: Hi, More information on our Btrfs tests. Le 14/10/2014 19:53, Lionel Bouton a écrit : Current plan: wait at least a week to study 3.17.0 behavior and upgrade the 3.12.21 nodes to 3.17.0 if all goes well. 3.17.0 and 3.17.1 have a bug which remounts Btrfs filesystems read-only (no corruption but OSD goes down) on some access patterns with snapshots: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36483.html The bug may be present in earlier kernels (at least the 3.16.4 code in fs/btrfs/qgroup.c doesn't handle the case differently than 3.17.0 and 3.17.1) but seems at least less likely to show up (never saw it with 3.16.4 in several weeks but it happened with 3.17.1 three times in just a few hours). As far as I can tell from its Changelog, 3.17.1 didn't patch any vfs/btrfs path vs 3.17.0 so I assume 3.17.0 has the same behaviour. I switched all servers to 3.16.4 which I had previously tested without any problem. The performance problem is still there with 3.16.4. In fact one of the 2 large OSD was so slow it was repeatedly marked out and generated lots of latencies when in. I just had to remove it: when this OSD is shut down with noout to avoid backfills slowing down the storage network, latencies are back to normal. I chose to reformat this one with XFS. The other big node has a nearly perfectly identical system (same hardware, same software configuration, same logical volume configuration, same weight in the crush map, comparable disk usage in the OSD fs, ...) but is behaving itself (maybe slower than our smaller XFS and Btrfs OSD, but usable). The only notable difference is that it was formatted more recently. So the performance problem might be linked to the cumulative amount of data access to the OSD over time. Yeah; we've seen this before and it appears to be related to our aggressive use of btrfs snapshots; it seems that btrfs doesn't defrag well under our use case. The btrfs developers make sporadic concerted efforts to improve things (and succeed!), but it apparently still hasn't gotten enough better yet. :( -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSD very slow startup
On Monday, October 13, 2014, Lionel Bouton lionel+c...@bouton.name wrote: Hi, # First a short description of our Ceph setup You can skip to the next section (Main questions) to save time and come back to this one if you need more context. We are currently moving away from DRBD-based storage backed by RAID arrays to Ceph for some of our VMs. Our focus is on resiliency and capacity (one VM was outgrowing the largest RAID10 we had) and not maximum performance (at least not yet). Our Ceph OSDs are fairly unbalanced because 2 are on 2 historic hosts each with 4 disks in a hardware RAID10 configuration and no place available for new disks in the chassis. 12 additional OSD are on 2 new systems with 6 disk drives dedicated to one OSD each (CPU and RAM configurations are nearly identical on the 4 hosts). All hosts are used for running VMs too, we took some precautions to avoid too much interference: each host has CPU and RAM to spare for the OSD. CPU usage exhibits some bursts on occasions but as we only have one or two VM on each host, they can't starve the OSD which have between 2 and 8 full fledge cores (4 to 16 hardware threads) for them depending on the current load. We have at least 4GB of free RAM per OSD on each host at all times (including room for at least a 4GB OS cache). To sum up we have a total of 14 OSDs, the 2 largest ones on RAID10 are clearly our current bottleneck. That said until we have additional hardware they allow us to maintain availability even if 2 servers are down (default crushmap with pool configured with 3 replicas on 3 different hosts) and performance is acceptable (backfilling/scrubing/... pgs required some tuning though and I'm eagerly waiting for 0.80.7 to begin tests of the new io priority tunables). Everything is based on SATA/SAS 7200t/min disk drives behind P410 Raid controllers (HP Proliant systems) with battery backed memory to help with write bursts. The OSDs are a mix of: - Btrfs on 3.17.0 kernels on individual disks, 450GB use on 2TB (3.17.0 fixes a filesystem lockup we had with earlier kernels manifesting itself with concurrent accesses to several Btrfs filesystems according to recent lkml posts), - Btrfs on 3.12.21 kernels on the 2 systems with RAID10, 1.5TB used on 3TB (no lockup on these yet but they will migrate to 3.17.0 when we'll have enough experience with it). - XFS for a minority of individual disks (with a dedicated partition for the journal). Most of them have the same history (all being created at the same time), only two of them have been created later (following Btrfs corruption and/or conversion to XFS) and are avoided when comparing behaviours. All Btrfs volumes use these mount options: rw,noatime,nodiratime,compress=lzo,space_cache,autodefrag,recovery All OSDs use a 5GB journal. We slowly add monitoring to the setup to see what are the benefits of Btrfs in our case (ceph osd perf, kernel io wait per devices, osd CPU usage, ...). One long term objective is to slowly raise the performance both by migrating to/adding more suitable hardware and tuning the software side. Detailed monitoring is supposed to help us study the behaviour of isolated OSDs with different settings and being warned early if they generate performance problems to take them out with next to no impact on the whole storage network (we are strong believers in slow, incremental and continuous change and distributed storage with redundancy makes it easy to implement). # Main questions The system works well but I just realised when restarting one of the 2 large Btrfs OSD that it was very slow to rejoin the network (ceph osd set noout was used for the restart). I stopped the OSD init after 5 minutes to investigate what was going on and didn't find any obvious problem (filesystem sane, no swapping, CPU hogs, concurrent IO not able to starve the system by itself, ...). Next restarts took between 43s (nearly no concurrent disk access and warm caches after an earlier restart without umounting the filesystem) and 3mn57s (one VM still on DRBD doing ~30 IO/s on the same volume and cold caches after a filesystem mount). It seems that the startup time is getting longer on the 2 large Btrfs filesystems (the other one gives similar results: 3mn48s on the first try for example). I noticed that it was a bit slow a week ago but not as much (there was ~half as much data on them at the time). OSDs on individual disks don't exhibit this problem (with warm caches init finishes in ~4s on the small Btrfs volumes, ~3s on the XFS volumes) but they are on dedicated disks with less data. With warm caches most of the time is spent between: osd.n osdmap load_pgs osd.n osdmap load_pgs opened m pgs log lines in /var/log/ceph/ceph-osd.n.log (m is ~650 for both OSD). So it seems most of the time is spent opening pgs. What could explain such long startup times? Is the OSD init doing a lot of random disk accesses? Is it dependant on
Re: [ceph-users] Ceph OSD very slow startup
Le 14/10/2014 18:17, Gregory Farnum a écrit : On Monday, October 13, 2014, Lionel Bouton lionel+c...@bouton.name mailto:lionel%2bc...@bouton.name wrote: [...] What could explain such long startup times? Is the OSD init doing a lot of random disk accesses? Is it dependant on the volume of data or the history of the OSD (fragmentation?)? Maybe Btrfs on 3.12.21 has known performance problems or suboptimal autodefrag (on 3.17.0 with 1/3 the data and a similar history of disk accesses we have 1/10 the init time when the disks are in both cases idle)? Something like this is my guess; we've historically seen btrfs performance rapidly degrade under our workloads. And I imagine that your single-disk OSDs are only seeing 100 or so PGs each? Yes. You could perhaps turn up OSD and FileStore debugging on one of your big nodes and one of the little ones and do a restart and compare the syscall wait times between them to check. -Greg Will do (have to lookup doc first). Thanks for the suggestions. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSD very slow startup
Le 14/10/2014 18:51, Lionel Bouton a écrit : Le 14/10/2014 18:17, Gregory Farnum a écrit : On Monday, October 13, 2014, Lionel Bouton lionel+c...@bouton.name mailto:lionel%2bc...@bouton.name wrote: [...] What could explain such long startup times? Is the OSD init doing a lot of random disk accesses? Is it dependant on the volume of data or the history of the OSD (fragmentation?)? Maybe Btrfs on 3.12.21 has known performance problems or suboptimal autodefrag (on 3.17.0 with 1/3 the data and a similar history of disk accesses we have 1/10 the init time when the disks are in both cases idle)? Something like this is my guess; we've historically seen btrfs performance rapidly degrade under our workloads. And I imagine that your single-disk OSDs are only seeing 100 or so PGs each? Yes. In fact ~200 instead of ~600 on the big nodes in the current configuration. So it's in the same ballpark than your estimation. You could perhaps turn up OSD and FileStore debugging on one of your big nodes and one of the little ones and do a restart and compare the syscall wait times between them to check. -Greg In the logs the big OSD is osd.0. The small one is osd.2. I'll call them BIG and SMALL in the following. Results are interesting. First there is ~8 sec on BIG to select the Btrfs snapshot: 2014-10-14 19:04:56.290936 7f8b84516780 10 filestore(/var/lib/ceph/osd/ceph-0) most recent snap from 23107688,23107701 is 23107701 2014-10-14 19:04:56.290949 7f8b84516780 10 filestore(/var/lib/ceph/osd/ceph-0) mount rolling back to consistent snap 23107701 2014-10-14 19:04:56.290955 7f8b84516780 10 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) rollback_to: to 'snap_23107701' 2014-10-14 19:05:04.379241 7f8b84516780 5 filestore(/var/lib/ceph/osd/ceph-0) mount op_seq is 23107701 This takes less than 1sec on SMALL : 2014-10-14 19:02:19.596158 7fd0a0d2b780 10 filestore(/var/lib/ceph/osd/ceph-2) most recent snap from 1602633,1602645 is 1602645 2014-10-14 19:02:19.596176 7fd0a0d2b780 10 filestore(/var/lib/ceph/osd/ceph-2) mount rolling back to consistent snap 1602645 2014-10-14 19:02:19.596182 7fd0a0d2b780 10 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) rollback_to: to 'snap_1602645' 2014-10-14 19:02:20.311178 7fd0a0d2b780 5 filestore(/var/lib/ceph/osd/ceph-2) mount op_seq is 1602645 I assume this is the time Btrfs itself takes mounting the snapshot. This behavior (slow Btrfs snapshot operations) seems to repeat itself : the first checkpoint creation (which seems to involve a Btrfs snapshot) takes more than 15s on BIG 2014-10-14 19:05:23.135960 7f8b73fff700 10 filestore(/var/lib/ceph/osd/ceph-0) sync_entry commit took 15.503795, interval was 15.866245 Less than 0.3s on SMALL: 2014-10-14 19:02:21.889467 7fd094be4700 10 filestore(/var/lib/ceph/osd/ceph-2) sync_entry commit took 0.135268, interval was 0.276440 Following checkpoints are much faster on BIG later on though, next one: 2014-10-14 19:05:28.426263 7f8b73fff700 10 filestore(/var/lib/ceph/osd/ceph-0) sync_entry commit took 0.969734, interval was 0.979822 and they seem to converge towards ~0.25s later SMALL seems to converge towards ~0.07s (there is ~1/3 the data and probably datastructures on it though so if snapshots on Btrfs are supposed to be o(n) operations it could be normal). I couldn't find other significant differences: the different phases I identified in the OSD init process took more time on BIG but never more than ~3x compared to SMALL. It seems most of the time is spent doing or accessing snapshots. My best guess currently is that Btrfs snapshot operations may have seen significant speedups between 3.12.21 and 3.17.0 and that OSD init is checkpoint(/snapshot) intensive which makes for most of the slow startup. Current plan: wait at least a week to study 3.17.0 behavior and upgrade the 3.12.21 nodes to 3.17.0 if all goes well. Best regards, Lionel Bouton ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSD very slow startup
Le 14/10/2014 01:28, Lionel Bouton a écrit : Hi, # First a short description of our Ceph setup You can skip to the next section (Main questions) to save time and come back to this one if you need more context. Missing important piece of information: this is Ceph 0.80.5 (guessable as I stated that I was waiting for 0.80.7 but this required some digging... ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com