Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-12 Thread Lionel Bouton
Hi,

Le 12/07/2016 02:51, Brad Hubbard a écrit :
>  [...]
 This is probably a fragmentation problem : typical rbd access patterns
 cause heavy BTRFS fragmentation.
>>> To the extent that operations take over 120 seconds to complete? Really?
>> Yes, really. I had these too. By default Ceph/RBD uses BTRFS in a very
>> aggressive way, rewriting data all over the place and creating/deleting
>> snapshots every filestore sync interval (5 seconds max by default IIRC).
>>
>> As I said there are 3 main causes of performance degradation :
>> - the snapshots,
>> - the journal in a standard copy-on-write file (move it out of the FS or
>> use NoCow),
>> - the weak auto defragmentation of BTRFS (autodefrag mount option).
>>
>> Each one of them is enough to impact or even destroy performance in the
>> long run. The 3 combined make BTRFS unusable by default. This is why
>> BTRFS is not recommended : if you want to use it you have to be prepared
>> for some (heavy) tuning. The first 2 points are easy to address, for the
>> last (which begins to be noticeable when you accumulate rewrites on your
>> data) I'm not aware of any other tool than the one we developed and
>> published on github (link provided in previous mail).
>>
>> Another thing : you better have a recent 4.1.x or 4.4.x kernel on your
>> OSDs if you use BTRFS. We've used it since 3.19.x but I wouldn't advise
>> it now and would recommend 4.4.x if it's possible for you and 4.1.x
>> otherwise.
> Thanks for the information. I wasn't aware things were that bad with BTRFS as
> I haven't had much to do with it up to this point.

Bad is relative. BTRFS was very time consuming to set up (mainly because
of the defragmentation scheduler development but finding sources of
inefficiency was no picnic either), but once used properly it has 3
unique advantages :
- data checksums : this forces Ceph to use one good replica by refusing
to hand over corrupted data and makes it far easier to handle silent
data corruption (and some of our RAID controllers, probably damaged by
electrical surges, had this nasty habit of flipping bits so it really
was a big time/data saver here),
- compression : you get more space for free,
- speed : we get better latencies than XFS with it.

Until bluestore is production ready (it should address these points even
better than BTRFS does), if I don't find a use case where BTRFS falls on
its face there's no way I'd used anything but BTRFS with Ceph.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-11 Thread Brad Hubbard
On Mon, Jul 11, 2016 at 04:53:36PM +0200, Lionel Bouton wrote:
> Le 11/07/2016 11:56, Brad Hubbard a écrit :
> > On Mon, Jul 11, 2016 at 7:18 PM, Lionel Bouton
> >  wrote:
> >> Le 11/07/2016 04:48, 한승진 a écrit :
> >>> Hi cephers.
> >>>
> >>> I need your help for some issues.
> >>>
> >>> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
> >>>
> >>> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
> >>>
> >>> I've experienced one of OSDs was killed himself.
> >>>
> >>> Always it issued suicide timeout message.
> >> This is probably a fragmentation problem : typical rbd access patterns
> >> cause heavy BTRFS fragmentation.
> > To the extent that operations take over 120 seconds to complete? Really?
> 
> Yes, really. I had these too. By default Ceph/RBD uses BTRFS in a very
> aggressive way, rewriting data all over the place and creating/deleting
> snapshots every filestore sync interval (5 seconds max by default IIRC).
> 
> As I said there are 3 main causes of performance degradation :
> - the snapshots,
> - the journal in a standard copy-on-write file (move it out of the FS or
> use NoCow),
> - the weak auto defragmentation of BTRFS (autodefrag mount option).
> 
> Each one of them is enough to impact or even destroy performance in the
> long run. The 3 combined make BTRFS unusable by default. This is why
> BTRFS is not recommended : if you want to use it you have to be prepared
> for some (heavy) tuning. The first 2 points are easy to address, for the
> last (which begins to be noticeable when you accumulate rewrites on your
> data) I'm not aware of any other tool than the one we developed and
> published on github (link provided in previous mail).
> 
> Another thing : you better have a recent 4.1.x or 4.4.x kernel on your
> OSDs if you use BTRFS. We've used it since 3.19.x but I wouldn't advise
> it now and would recommend 4.4.x if it's possible for you and 4.1.x
> otherwise.

Thanks for the information. I wasn't aware things were that bad with BTRFS as
I haven't had much to do with it up to this point.

Cheers,
Brad

> 
> Best regards,
> 
> Lionel

-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-11 Thread Lionel Bouton
Le 11/07/2016 11:56, Brad Hubbard a écrit :
> On Mon, Jul 11, 2016 at 7:18 PM, Lionel Bouton
>  wrote:
>> Le 11/07/2016 04:48, 한승진 a écrit :
>>> Hi cephers.
>>>
>>> I need your help for some issues.
>>>
>>> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
>>>
>>> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
>>>
>>> I've experienced one of OSDs was killed himself.
>>>
>>> Always it issued suicide timeout message.
>> This is probably a fragmentation problem : typical rbd access patterns
>> cause heavy BTRFS fragmentation.
> To the extent that operations take over 120 seconds to complete? Really?

Yes, really. I had these too. By default Ceph/RBD uses BTRFS in a very
aggressive way, rewriting data all over the place and creating/deleting
snapshots every filestore sync interval (5 seconds max by default IIRC).

As I said there are 3 main causes of performance degradation :
- the snapshots,
- the journal in a standard copy-on-write file (move it out of the FS or
use NoCow),
- the weak auto defragmentation of BTRFS (autodefrag mount option).

Each one of them is enough to impact or even destroy performance in the
long run. The 3 combined make BTRFS unusable by default. This is why
BTRFS is not recommended : if you want to use it you have to be prepared
for some (heavy) tuning. The first 2 points are easy to address, for the
last (which begins to be noticeable when you accumulate rewrites on your
data) I'm not aware of any other tool than the one we developed and
published on github (link provided in previous mail).

Another thing : you better have a recent 4.1.x or 4.4.x kernel on your
OSDs if you use BTRFS. We've used it since 3.19.x but I wouldn't advise
it now and would recommend 4.4.x if it's possible for you and 4.1.x
otherwise.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-11 Thread Brad Hubbard
On Mon, Jul 11, 2016 at 7:18 PM, Lionel Bouton
 wrote:
> Le 11/07/2016 04:48, 한승진 a écrit :
>> Hi cephers.
>>
>> I need your help for some issues.
>>
>> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
>>
>> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
>>
>> I've experienced one of OSDs was killed himself.
>>
>> Always it issued suicide timeout message.
>
> This is probably a fragmentation problem : typical rbd access patterns
> cause heavy BTRFS fragmentation.

To the extent that operations take over 120 seconds to complete? Really?

I have no experience with BTRFS but had heard that performance can "fall
off a cliff" but I didn't know it was that bad.

-- 
Cheers,
Brad

>
> If you already use the autodefrag mount option, you can try this which
> performs much better for us :
> https://github.com/jtek/ceph-utils/blob/master/btrfs-defrag-scheduler.rb
>
> Note that it can take some time to fully defragment the filesystems but
> it shouldn't put more stress than autodefrag while doing so.
>
> If you don't already use it, set :
> filestore btrfs snap = false
> in ceph.conf an restart your OSDs.
>
> Finally if you use journals on the filesystem and not on dedicated
> partitions, you'll have to recreate them with the NoCow attribute
> (there's no way to defragment journals in any way that doesn't kill
> performance otherwise).
>
> Best regards,
>
> Lionel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-11 Thread Lionel Bouton
Le 11/07/2016 04:48, 한승진 a écrit :
> Hi cephers.
>
> I need your help for some issues.
>
> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
>
> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
>
> I've experienced one of OSDs was killed himself.
>
> Always it issued suicide timeout message.

This is probably a fragmentation problem : typical rbd access patterns
cause heavy BTRFS fragmentation.

If you already use the autodefrag mount option, you can try this which
performs much better for us :
https://github.com/jtek/ceph-utils/blob/master/btrfs-defrag-scheduler.rb

Note that it can take some time to fully defragment the filesystems but
it shouldn't put more stress than autodefrag while doing so.

If you don't already use it, set :
filestore btrfs snap = false
in ceph.conf an restart your OSDs.

Finally if you use journals on the filesystem and not on dedicated
partitions, you'll have to recreate them with the NoCow attribute
(there's no way to defragment journals in any way that doesn't kill
performance otherwise).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-10 Thread Brad Hubbard
On Mon, Jul 11, 2016 at 1:21 PM, Brad Hubbard  wrote:
> On Mon, Jul 11, 2016 at 11:48:57AM +0900, 한승진 wrote:
>> Hi cephers.
>>
>> I need your help for some issues.
>>
>> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
>>
>> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
>>
>> I've experienced one of OSDs was killed himself.
>>
>> Always it issued suicide timeout message.
>>
>> Below is detailed logs.
>>
>>
>> ==
>> 0. ceph df detail
>> $ sudo ceph df detail
>> GLOBAL:
>> SIZE   AVAIL  RAW USED %RAW USED OBJECTS
>> 42989G 24734G   18138G 42.19  23443k
>> POOLS:
>> NAMEID CATEGORY QUOTA OBJECTS QUOTA BYTES USED
>>   %USED MAX AVAIL OBJECTS  DIRTY  READ   WRITE
>>  RAW USED
>> ha-pool 40 -N/A   N/A
>>  1405G  9.81 5270G 22986458 22447k  0
>> 22447k4217G
>> volumes 45 -N/A   N/A
>>  4093G 28.57 5270G   933401   911k   648M
>> 649M   12280G
>> images  46 -N/A   N/A
>> 53745M  0.37 5270G 6746   6746  1278k
>>  21046 157G
>> backups 47 -N/A   N/A
>>  0 0 5270G0  0  0  0
>>  0
>> vms 48 -N/A   N/A
>> 309G  2.16 5270G79426  79426 92612k 46506k
>> 928G
>>
>> 1. ceph no.15 log
>>
>> *(20:02 first timed out message)*
>> 2016-07-08 20:02:01.049483 7fcd3caa5700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15
>> 2016-07-08 20:02:01.050403 7fcd3b2a2700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15
>> 2016-07-08 20:02:01.086792 7fcd3b2a2700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15
>> .
>> .
>> (sometimes this logs with..)
>> 2016-07-08 20:02:11.379597 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
>> 12 slow requests, 5 included below; oldest blocked for > 30.269577 secs
>> 2016-07-08 20:02:11.379608 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
>> slow request 30.269577 seconds old, received at 2016-07-08 20:01:41.109937:
>> osd_op(client.895668.0:5302745 45.e2e779c2
>> rbd_data.cc460bc7fc8f.04d8 [stat,write 2596864~516096] snapc
>> 0=[] ack+ondisk+write+known_if_redirected e30969) currently commit_sent
>> 2016-07-08 20:02:11.379612 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
>> slow request 30.269108 seconds old, received at 2016-07-08 20:01:41.110406:
>> osd_op(client.895668.0:5302746 45.e2e779c2
>> rbd_data.cc460bc7fc8f.04d8 [stat,write 3112960~516096] snapc
>> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
>> locks
>> 2016-07-08 20:02:11.379630 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
>> slow request 30.268607 seconds old, received at 2016-07-08 20:01:41.110907:
>> osd_op(client.895668.0:5302747 45.e2e779c2
>> rbd_data.cc460bc7fc8f.04d8 [stat,write 3629056~516096] snapc
>> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
>> locks
>> 2016-07-08 20:02:11.379633 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
>> slow request 30.268143 seconds old, received at 2016-07-08 20:01:41.111371:
>> osd_op(client.895668.0:5302748 45.e2e779c2
>> rbd_data.cc460bc7fc8f.04d8 [stat,write 4145152~516096] snapc
>> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
>> locks
>> 2016-07-08 20:02:11.379636 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
>> slow request 30.267662 seconds old, received at 2016-07-08 20:01:41.111852:
>> osd_op(client.895668.0:5302749 45.e2e779c2
>> rbd_data.cc460bc7fc8f.04d8 [stat,write 4661248~516096] snapc
>> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
>> locks
>> .
>> .
>> (after a lot of same messages)
>> 2016-07-08 20:03:53.682828 7fcd3caa5700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fcd2d286700' had timed out after 15
>> 2016-07-08 20:03:53.682828 7fcd3caa5700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fcd2da87700' had timed out after 15
>> 2016-07-08 20:03:53.682829 7fcd3caa5700  1 heartbeat_map is_healthy
>> 'FileStore::op_tp thread 0x7fcd48716700' had timed out after 60
>> 2016-07-08 20:03:53.682830 7fcd3caa5700  1 heartbeat_map is_healthy
>> 'FileStore::op_tp thread 0x7fcd47f15700' had timed out after 60
>> .
>> .
>> (fault with nothing to send, going to standby massages)
>> 2016-07-08 20:03:53.708665 7fcd15787700  0 -- 10.200.10.145:6818/6462 >>
>> 10.200.10.146:6806/4642 pipe(0x55818727e000 sd=276 :51916 s=2 pgs=2225 cs=1
>> l=0 c=0x558186f61d80).fault 

Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-10 Thread Brad Hubbard
On Mon, Jul 11, 2016 at 11:48:57AM +0900, 한승진 wrote:
> Hi cephers.
> 
> I need your help for some issues.
> 
> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
> 
> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
> 
> I've experienced one of OSDs was killed himself.
> 
> Always it issued suicide timeout message.
> 
> Below is detailed logs.
> 
> 
> ==
> 0. ceph df detail
> $ sudo ceph df detail
> GLOBAL:
> SIZE   AVAIL  RAW USED %RAW USED OBJECTS
> 42989G 24734G   18138G 42.19  23443k
> POOLS:
> NAMEID CATEGORY QUOTA OBJECTS QUOTA BYTES USED
>   %USED MAX AVAIL OBJECTS  DIRTY  READ   WRITE
>  RAW USED
> ha-pool 40 -N/A   N/A
>  1405G  9.81 5270G 22986458 22447k  0
> 22447k4217G
> volumes 45 -N/A   N/A
>  4093G 28.57 5270G   933401   911k   648M
> 649M   12280G
> images  46 -N/A   N/A
> 53745M  0.37 5270G 6746   6746  1278k
>  21046 157G
> backups 47 -N/A   N/A
>  0 0 5270G0  0  0  0
>  0
> vms 48 -N/A   N/A
> 309G  2.16 5270G79426  79426 92612k 46506k
> 928G
> 
> 1. ceph no.15 log
> 
> *(20:02 first timed out message)*
> 2016-07-08 20:02:01.049483 7fcd3caa5700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15
> 2016-07-08 20:02:01.050403 7fcd3b2a2700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15
> 2016-07-08 20:02:01.086792 7fcd3b2a2700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15
> .
> .
> (sometimes this logs with..)
> 2016-07-08 20:02:11.379597 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
> 12 slow requests, 5 included below; oldest blocked for > 30.269577 secs
> 2016-07-08 20:02:11.379608 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
> slow request 30.269577 seconds old, received at 2016-07-08 20:01:41.109937:
> osd_op(client.895668.0:5302745 45.e2e779c2
> rbd_data.cc460bc7fc8f.04d8 [stat,write 2596864~516096] snapc
> 0=[] ack+ondisk+write+known_if_redirected e30969) currently commit_sent
> 2016-07-08 20:02:11.379612 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
> slow request 30.269108 seconds old, received at 2016-07-08 20:01:41.110406:
> osd_op(client.895668.0:5302746 45.e2e779c2
> rbd_data.cc460bc7fc8f.04d8 [stat,write 3112960~516096] snapc
> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
> locks
> 2016-07-08 20:02:11.379630 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
> slow request 30.268607 seconds old, received at 2016-07-08 20:01:41.110907:
> osd_op(client.895668.0:5302747 45.e2e779c2
> rbd_data.cc460bc7fc8f.04d8 [stat,write 3629056~516096] snapc
> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
> locks
> 2016-07-08 20:02:11.379633 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
> slow request 30.268143 seconds old, received at 2016-07-08 20:01:41.111371:
> osd_op(client.895668.0:5302748 45.e2e779c2
> rbd_data.cc460bc7fc8f.04d8 [stat,write 4145152~516096] snapc
> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
> locks
> 2016-07-08 20:02:11.379636 7fcd4d8f8700  0 log_channel(cluster) log [WRN] :
> slow request 30.267662 seconds old, received at 2016-07-08 20:01:41.111852:
> osd_op(client.895668.0:5302749 45.e2e779c2
> rbd_data.cc460bc7fc8f.04d8 [stat,write 4661248~516096] snapc
> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw
> locks
> .
> .
> (after a lot of same messages)
> 2016-07-08 20:03:53.682828 7fcd3caa5700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fcd2d286700' had timed out after 15
> 2016-07-08 20:03:53.682828 7fcd3caa5700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fcd2da87700' had timed out after 15
> 2016-07-08 20:03:53.682829 7fcd3caa5700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fcd48716700' had timed out after 60
> 2016-07-08 20:03:53.682830 7fcd3caa5700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fcd47f15700' had timed out after 60
> .
> .
> (fault with nothing to send, going to standby massages)
> 2016-07-08 20:03:53.708665 7fcd15787700  0 -- 10.200.10.145:6818/6462 >>
> 10.200.10.146:6806/4642 pipe(0x55818727e000 sd=276 :51916 s=2 pgs=2225 cs=1
> l=0 c=0x558186f61d80).fault with nothing to send, going to standby
> 2016-07-08 20:03:53.724928 7fcd072c2700  0 -- 10.200.10.145:6818/6462 >>
> 10.200.10.146:6800/4336 pipe(0x55818a25b400 sd=109