Re: [ceph-users] Ceph OSD very slow startup

2014-10-20 Thread Lionel Bouton
Hi,

More information on our Btrfs tests.

Le 14/10/2014 19:53, Lionel Bouton a écrit :


 Current plan: wait at least a week to study 3.17.0 behavior and
 upgrade the 3.12.21 nodes to 3.17.0 if all goes well.


3.17.0 and 3.17.1 have a bug which remounts Btrfs filesystems read-only
(no corruption but OSD goes down) on some access patterns with snapshots:
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36483.html

The bug may be present in earlier kernels (at least the 3.16.4 code in
fs/btrfs/qgroup.c doesn't handle the case differently than 3.17.0 and
3.17.1) but seems at least less likely to show up (never saw it with
3.16.4 in several weeks but it happened with 3.17.1 three times in just
a few hours). As far as I can tell from its Changelog, 3.17.1 didn't
patch any vfs/btrfs path vs 3.17.0 so I assume 3.17.0 has the same
behaviour.

I switched all servers to 3.16.4 which I had previously tested without
any problem.

The performance problem is still there with 3.16.4. In fact one of the 2
large OSD was so slow it was repeatedly marked out and generated lots of
latencies when in. I just had to remove it: when this OSD is shut down
with noout to avoid backfills slowing down the storage network,
latencies are back to normal. I chose to reformat this one with XFS.

The other big node has a nearly perfectly identical system (same
hardware, same software configuration, same logical volume
configuration, same weight in the crush map, comparable disk usage in
the OSD fs, ...) but is behaving itself (maybe slower than our smaller
XFS and Btrfs OSD, but usable). The only notable difference is that it
was formatted more recently. So the performance problem might be linked
to the cumulative amount of data access to the OSD over time. If my
suspicion is true I believe we might see performance problems on the
other Btrfs OSDs later (we'll have to wait).

Is any Btrfs developper subscribed to this list? I could forward this
information to linux-btrfs@vger if needed but I can't offer much
debugging help (the storage cluster is in production and I'm more
inclined to migrate slow OSDs to XFS than doing invasive debugging with
Btrfs).

Best regards,

Lionel Bouton
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD very slow startup

2014-10-20 Thread Gregory Farnum
On Mon, Oct 20, 2014 at 8:25 AM, Lionel Bouton lionel+c...@bouton.name wrote:
 Hi,

 More information on our Btrfs tests.

 Le 14/10/2014 19:53, Lionel Bouton a écrit :



 Current plan: wait at least a week to study 3.17.0 behavior and upgrade the
 3.12.21 nodes to 3.17.0 if all goes well.


 3.17.0 and 3.17.1 have a bug which remounts Btrfs filesystems read-only (no
 corruption but OSD goes down) on some access patterns with snapshots:
 https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36483.html

 The bug may be present in earlier kernels (at least the 3.16.4 code in
 fs/btrfs/qgroup.c doesn't handle the case differently than 3.17.0 and
 3.17.1) but seems at least less likely to show up (never saw it with 3.16.4
 in several weeks but it happened with 3.17.1 three times in just a few
 hours). As far as I can tell from its Changelog, 3.17.1 didn't patch any
 vfs/btrfs path vs 3.17.0 so I assume 3.17.0 has the same behaviour.

 I switched all servers to 3.16.4 which I had previously tested without any
 problem.

 The performance problem is still there with 3.16.4. In fact one of the 2
 large OSD was so slow it was repeatedly marked out and generated lots of
 latencies when in. I just had to remove it: when this OSD is shut down with
 noout to avoid backfills slowing down the storage network, latencies are
 back to normal. I chose to reformat this one with XFS.

 The other big node has a nearly perfectly identical system (same hardware,
 same software configuration, same logical volume configuration, same weight
 in the crush map, comparable disk usage in the OSD fs, ...) but is behaving
 itself (maybe slower than our smaller XFS and Btrfs OSD, but usable). The
 only notable difference is that it was formatted more recently. So the
 performance problem might be linked to the cumulative amount of data access
 to the OSD over time.

Yeah; we've seen this before and it appears to be related to our
aggressive use of btrfs snapshots; it seems that btrfs doesn't defrag
well under our use case. The btrfs developers make sporadic concerted
efforts to improve things (and succeed!), but it apparently still
hasn't gotten enough better yet. :(
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD very slow startup

2014-10-14 Thread Gregory Farnum
On Monday, October 13, 2014, Lionel Bouton lionel+c...@bouton.name wrote:

 Hi,

 # First a short description of our Ceph setup

 You can skip to the next section (Main questions) to save time and
 come back to this one if you need more context.

 We are currently moving away from DRBD-based storage backed by RAID
 arrays to Ceph for some of our VMs. Our focus is on resiliency and
 capacity (one VM was outgrowing the largest RAID10 we had) and not
 maximum performance (at least not yet). Our Ceph OSDs are fairly
 unbalanced because 2 are on 2 historic hosts each with 4 disks in a
 hardware RAID10 configuration and no place available for new disks in
 the chassis. 12 additional OSD are on 2 new systems with 6 disk drives
 dedicated to one OSD each (CPU and RAM configurations are nearly
 identical on the 4 hosts). All hosts are used for running VMs too, we
 took some precautions to avoid too much interference: each host has CPU
 and RAM to spare for the OSD. CPU usage exhibits some bursts on
 occasions but as we only have one or two VM on each host, they can't
 starve the OSD which have between 2 and 8 full fledge cores (4 to 16
 hardware threads) for them depending on the current load. We have at
 least 4GB of free RAM per OSD on each host at all times (including room
 for at least a 4GB OS cache).
 To sum up we have a total of 14 OSDs, the 2 largest ones on RAID10 are
 clearly our current bottleneck. That said until we have additional
 hardware they allow us to maintain availability even if 2 servers are
 down (default crushmap with pool configured with 3 replicas on 3
 different hosts) and performance is acceptable (backfilling/scrubing/...
 pgs required some tuning though and I'm eagerly waiting for 0.80.7 to
 begin tests of the new io priority tunables).
 Everything is based on SATA/SAS 7200t/min disk drives behind P410 Raid
 controllers (HP Proliant systems) with battery backed memory to help
 with write bursts.

 The OSDs are a mix of:
 - Btrfs on 3.17.0 kernels on individual disks, 450GB use on 2TB (3.17.0
 fixes a filesystem lockup we had with earlier kernels manifesting itself
 with concurrent accesses to several Btrfs filesystems according to
 recent lkml posts),
 - Btrfs on 3.12.21 kernels on the 2 systems with RAID10, 1.5TB used on
 3TB (no lockup on these yet but they will migrate to 3.17.0 when we'll
 have enough experience with it).
 - XFS for a minority of individual disks (with a dedicated partition for
 the journal).
 Most of them have the same history (all being created at the same time),
 only two of them have been created later (following Btrfs corruption
 and/or conversion to XFS) and are avoided when comparing behaviours.

 All Btrfs volumes use these mount options:
 rw,noatime,nodiratime,compress=lzo,space_cache,autodefrag,recovery
 All OSDs use a 5GB journal.

 We slowly add monitoring to the setup to see what are the benefits of
 Btrfs in our case (ceph osd perf, kernel io wait per devices, osd CPU
 usage, ...). One long term objective is to slowly raise the performance
 both by migrating to/adding more suitable hardware and tuning the
 software side. Detailed monitoring is supposed to help us study the
 behaviour of isolated OSDs with different settings and being warned
 early if they generate performance problems to take them out with next
 to no impact on the whole storage network (we are strong believers in
 slow, incremental and continuous change and distributed storage with
 redundancy makes it easy to implement).

 # Main questions

 The system works well but I just realised when restarting one of the 2
 large Btrfs OSD that it was very slow to rejoin the network (ceph osd
 set noout was used for the restart). I stopped the OSD init after 5
 minutes to investigate what was going on and didn't find any obvious
 problem (filesystem sane, no swapping, CPU hogs, concurrent IO not able
 to starve the system by itself, ...). Next restarts took between 43s
 (nearly no concurrent disk access and warm caches after an earlier
 restart without umounting the filesystem) and 3mn57s (one VM still on
 DRBD doing ~30 IO/s on the same volume and cold caches after a
 filesystem mount).

 It seems that the startup time is getting longer on the 2 large Btrfs
 filesystems (the other one gives similar results: 3mn48s on the first
 try for example). I noticed that it was a bit slow a week ago but not as
 much (there was ~half as much data on them at the time). OSDs on
 individual disks don't exhibit this problem (with warm caches init
 finishes in ~4s on the small Btrfs volumes, ~3s on the XFS volumes) but
 they are on dedicated disks with less data.

 With warm caches most of the time is spent between:
 osd.n osdmap load_pgs
 osd.n osdmap load_pgs opened m pgs
 log lines in /var/log/ceph/ceph-osd.n.log (m is ~650 for both OSD). So
 it seems most of the time is spent opening pgs.

 What could explain such long startup times? Is the OSD init doing a lot
 of random disk accesses? Is it dependant on 

Re: [ceph-users] Ceph OSD very slow startup

2014-10-14 Thread Lionel Bouton
Le 14/10/2014 18:17, Gregory Farnum a écrit :
 On Monday, October 13, 2014, Lionel Bouton lionel+c...@bouton.name
 mailto:lionel%2bc...@bouton.name wrote:

 [...]

 What could explain such long startup times? Is the OSD init doing
 a lot
 of random disk accesses? Is it dependant on the volume of data or the
 history of the OSD (fragmentation?)? Maybe Btrfs on 3.12.21 has known
 performance problems or suboptimal autodefrag (on 3.17.0 with 1/3 the
 data and a similar history of disk accesses we have 1/10 the init time
 when the disks are in both cases idle)?


 Something like this is my guess; we've historically seen btrfs
 performance rapidly degrade under our workloads. And I imagine that
 your single-disk OSDs are only seeing 100 or so PGs each?

Yes.

 You could perhaps turn up OSD and FileStore debugging on one of your
 big nodes and one of the little ones and do a restart and compare the
 syscall wait times between them to check.
 -Greg


Will do (have to lookup doc first).

Thanks for the suggestions.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD very slow startup

2014-10-14 Thread Lionel Bouton
Le 14/10/2014 18:51, Lionel Bouton a écrit :
 Le 14/10/2014 18:17, Gregory Farnum a écrit :
 On Monday, October 13, 2014, Lionel Bouton lionel+c...@bouton.name
 mailto:lionel%2bc...@bouton.name wrote:

 [...]

 What could explain such long startup times? Is the OSD init doing
 a lot
 of random disk accesses? Is it dependant on the volume of data or the
 history of the OSD (fragmentation?)? Maybe Btrfs on 3.12.21 has known
 performance problems or suboptimal autodefrag (on 3.17.0 with 1/3 the
 data and a similar history of disk accesses we have 1/10 the init
 time
 when the disks are in both cases idle)?


 Something like this is my guess; we've historically seen btrfs
 performance rapidly degrade under our workloads. And I imagine that
 your single-disk OSDs are only seeing 100 or so PGs each?

 Yes.

In fact ~200 instead of ~600 on the big nodes in the current
configuration. So it's in the same ballpark than your estimation.


 You could perhaps turn up OSD and FileStore debugging on one of your
 big nodes and one of the little ones and do a restart and compare the
 syscall wait times between them to check.
 -Greg


In the logs the big OSD is osd.0. The small one is osd.2. I'll call them
BIG and SMALL in the following.

Results are interesting. First there is ~8 sec on BIG to select the
Btrfs snapshot:

2014-10-14 19:04:56.290936 7f8b84516780 10
filestore(/var/lib/ceph/osd/ceph-0)  most recent snap from
23107688,23107701 is 23107701
2014-10-14 19:04:56.290949 7f8b84516780 10
filestore(/var/lib/ceph/osd/ceph-0) mount rolling back to consistent
snap 23107701
2014-10-14 19:04:56.290955 7f8b84516780 10
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) rollback_to: to
'snap_23107701'
2014-10-14 19:05:04.379241 7f8b84516780  5
filestore(/var/lib/ceph/osd/ceph-0) mount op_seq is 23107701

This takes less than 1sec on SMALL :

2014-10-14 19:02:19.596158 7fd0a0d2b780 10
filestore(/var/lib/ceph/osd/ceph-2)  most recent snap from
1602633,1602645 is 1602645 
2014-10-14 19:02:19.596176 7fd0a0d2b780 10
filestore(/var/lib/ceph/osd/ceph-2) mount rolling back to consistent
snap 1602645
2014-10-14 19:02:19.596182 7fd0a0d2b780 10
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) rollback_to: to
'snap_1602645' 
2014-10-14 19:02:20.311178 7fd0a0d2b780  5
filestore(/var/lib/ceph/osd/ceph-2) mount op_seq is 1602645

I assume this is the time Btrfs itself takes mounting the snapshot.

This behavior (slow Btrfs snapshot operations) seems to repeat itself :
the first checkpoint creation (which seems to involve a Btrfs snapshot)
takes more than 15s on BIG
2014-10-14 19:05:23.135960 7f8b73fff700 10
filestore(/var/lib/ceph/osd/ceph-0) sync_entry commit took 15.503795,
interval was 15.866245
Less than 0.3s on SMALL:
2014-10-14 19:02:21.889467 7fd094be4700 10
filestore(/var/lib/ceph/osd/ceph-2) sync_entry commit took 0.135268,
interval was 0.276440

Following checkpoints are much faster on BIG later on though, next one:
2014-10-14 19:05:28.426263 7f8b73fff700 10
filestore(/var/lib/ceph/osd/ceph-0) sync_entry commit took 0.969734,
interval was 0.979822
and they seem to converge towards ~0.25s later

SMALL seems to converge towards ~0.07s (there is ~1/3 the data and
probably datastructures on it though so if snapshots on Btrfs are
supposed to be o(n) operations it could be normal).

I couldn't find other significant differences: the different phases I
identified in the OSD init process took more time on BIG but never more
than ~3x compared to SMALL.

It seems most of the time is spent doing or accessing snapshots. My best
guess currently is that Btrfs snapshot operations may have seen
significant speedups between 3.12.21 and 3.17.0 and that OSD init is
checkpoint(/snapshot) intensive which makes for most of the slow startup.

Current plan: wait at least a week to study 3.17.0 behavior and upgrade
the 3.12.21 nodes to 3.17.0 if all goes well.

Best regards,

Lionel Bouton
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD very slow startup

2014-10-13 Thread Lionel Bouton
Le 14/10/2014 01:28, Lionel Bouton a écrit :
 Hi,

 # First a short description of our Ceph setup

 You can skip to the next section (Main questions) to save time and
 come back to this one if you need more context.

Missing important piece of information: this is Ceph 0.80.5 (guessable
as I stated that I was waiting for 0.80.7 but this required some digging...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com