Le 14/10/2014 18:51, Lionel Bouton a écrit : > Le 14/10/2014 18:17, Gregory Farnum a écrit : >> On Monday, October 13, 2014, Lionel Bouton <lionel+c...@bouton.name >> <mailto:lionel%2bc...@bouton.name>> wrote: >> >> [...] >> >> What could explain such long startup times? Is the OSD init doing >> a lot >> of random disk accesses? Is it dependant on the volume of data or the >> history of the OSD (fragmentation?)? Maybe Btrfs on 3.12.21 has known >> performance problems or suboptimal autodefrag (on 3.17.0 with 1/3 the >> data and a similar history of disk accesses we have 1/10 the init >> time >> when the disks are in both cases idle)? >> >> >> Something like this is my guess; we've historically seen btrfs >> performance rapidly degrade under our workloads. And I imagine that >> your single-disk OSDs are only seeing 100 or so PGs each? > > Yes.
In fact ~200 instead of ~600 on the big nodes in the current configuration. So it's in the same ballpark than your estimation. > >> You could perhaps turn up OSD and FileStore debugging on one of your >> big nodes and one of the little ones and do a restart and compare the >> syscall wait times between them to check. >> -Greg >> In the logs the big OSD is osd.0. The small one is osd.2. I'll call them BIG and SMALL in the following. Results are interesting. First there is ~8 sec on BIG to select the Btrfs snapshot: 2014-10-14 19:04:56.290936 7f8b84516780 10 filestore(/var/lib/ceph/osd/ceph-0) most recent snap from <23107688,23107701> is 23107701 2014-10-14 19:04:56.290949 7f8b84516780 10 filestore(/var/lib/ceph/osd/ceph-0) mount rolling back to consistent snap 23107701 2014-10-14 19:04:56.290955 7f8b84516780 10 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) rollback_to: to 'snap_23107701' 2014-10-14 19:05:04.379241 7f8b84516780 5 filestore(/var/lib/ceph/osd/ceph-0) mount op_seq is 23107701 This takes less than 1sec on SMALL : 2014-10-14 19:02:19.596158 7fd0a0d2b780 10 filestore(/var/lib/ceph/osd/ceph-2) most recent snap from <1602633,1602645> is 1602645 2014-10-14 19:02:19.596176 7fd0a0d2b780 10 filestore(/var/lib/ceph/osd/ceph-2) mount rolling back to consistent snap 1602645 2014-10-14 19:02:19.596182 7fd0a0d2b780 10 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) rollback_to: to 'snap_1602645' 2014-10-14 19:02:20.311178 7fd0a0d2b780 5 filestore(/var/lib/ceph/osd/ceph-2) mount op_seq is 1602645 I assume this is the time Btrfs itself takes mounting the snapshot. This behavior (slow Btrfs snapshot operations) seems to repeat itself : the first checkpoint creation (which seems to involve a Btrfs snapshot) takes more than 15s on BIG 2014-10-14 19:05:23.135960 7f8b73fff700 10 filestore(/var/lib/ceph/osd/ceph-0) sync_entry commit took 15.503795, interval was 15.866245 Less than 0.3s on SMALL: 2014-10-14 19:02:21.889467 7fd094be4700 10 filestore(/var/lib/ceph/osd/ceph-2) sync_entry commit took 0.135268, interval was 0.276440 Following checkpoints are much faster on BIG later on though, next one: 2014-10-14 19:05:28.426263 7f8b73fff700 10 filestore(/var/lib/ceph/osd/ceph-0) sync_entry commit took 0.969734, interval was 0.979822 and they seem to converge towards ~0.25s later SMALL seems to converge towards ~0.07s (there is ~1/3 the data and probably datastructures on it though so if snapshots on Btrfs are supposed to be o(n) operations it could be normal). I couldn't find other significant differences: the different phases I identified in the OSD init process took more time on BIG but never more than ~3x compared to SMALL. It seems most of the time is spent doing or accessing snapshots. My best guess currently is that Btrfs snapshot operations may have seen significant speedups between 3.12.21 and 3.17.0 and that OSD init is checkpoint(/snapshot) intensive which makes for most of the slow startup. Current plan: wait at least a week to study 3.17.0 behavior and upgrade the 3.12.21 nodes to 3.17.0 if all goes well. Best regards, Lionel Bouton
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com