Re: [ceph-users] Undersized pgs problem

2015-11-30 Thread Vasiliy Angapov
Btw, in my configuration "mon osd downout subtree limit" is set to "host".
Does it influence things?

2015-11-29 14:38 GMT+08:00 Vasiliy Angapov :
> Bob,
> Thanks for explanation, sounds resonable! But how it could happen that
> host is down and its OSDs are still IN cluster?
> I mean NOOUT flag is not set and my timeouts are fully default...
>
> But if I remember correctly host was not completely down, it was
> pingable but not other services were reachable like SSH or any others.
> Is it possible that OSDs were still sending some information to
> monitors making them look like IN?
>
> 2015-11-29 2:10 GMT+08:00 Bob R :
>> Vasiliy,
>>
>> Your OSDs are marked as 'down' but 'in'.
>>
>> "Ceph OSDs have two known states that can be combined. Up and Down only
>> tells you whether the OSD is actively involved in the cluster. OSD states
>> also are expressed in terms of cluster replication: In and Out. Only when a
>> Ceph OSD is tagged as Out does the self-healing process occur"
>>
>> Bob
>>
>> On Fri, Nov 27, 2015 at 6:15 AM, Mart van Santen  wrote:
>>>
>>>
>>> Dear Vasilily,
>>>
>>>
>>>
>>> On 11/27/2015 02:00 PM, Irek Fasikhov wrote:
>>>
>>> You have time to synchronize?
>>>
>>> С уважением, Фасихов Ирек Нургаязович
>>> Моб.: +79229045757
>>>
>>> 2015-11-27 15:57 GMT+03:00 Vasiliy Angapov :
>>>>
>>>> > It seams that you played around with crushmap, and done something
>>>> > wrong.
>>>> > Compare the look of 'ceph osd tree' and crushmap. There are some 'osd'
>>>> > devices renamed to 'device' think threre is you problem.
>>>> Is this a mistake actually? What I did is removed a bunch of OSDs from
>>>> my cluster that's why the numeration is sparse. But is it an issue to
>>>> a have a sparse numeration of OSDs?
>>>
>>>
>>> I think this is normal and should be no problem. I had this also
>>> previously.
>>>
>>>>
>>>> > Hi.
>>>> > Vasiliy, Yes it is a problem with crusmap. Look at height:
>>>> > -3 14.56000 host slpeah001
>>>> > -2 14.56000 host slpeah002
>>>> What exactly is wrong here?
>>>
>>>
>>> I do not know how the weight of the hosts contribute to determine were to
>>> store the 3-th copy of the PG. As you explained, you have enough space on
>>> all hosts, but maybe if the weights of the hosts do not count up and the
>>> crushmap maybe come to the conclusion it is not able to place the PGs. What
>>> you can try, is to artificially raise the weights of these hosts, to see if
>>> it starts mapping the thirth copies for the pg's onto the available host.
>>>
>>> I had a similiar problem in the past, this was solved by upgrading to the
>>> latest crush tunables. But be aware, that can create massive datamovement
>>> behavior.
>>>
>>>
>>>>
>>>> I also found out that my OSD logs are full of such records:
>>>> 2015-11-26 08:31:19.273268 7fe4f49b1700  0 cephx: verify_authorizer
>>>> could not get service secret for service osd secret_id=2924
>>>> 2015-11-26 08:31:19.273276 7fe4f49b1700  0 --
>>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
>>>> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a520).accept: got bad
>>>> authorizer
>>>> 2015-11-26 08:31:24.273207 7fe4f49b1700  0 auth: could not find
>>>> secret_id=2924
>>>> 2015-11-26 08:31:24.273225 7fe4f49b1700  0 cephx: verify_authorizer
>>>> could not get service secret for service osd secret_id=2924
>>>> 2015-11-26 08:31:24.273231 7fe4f49b1700  0 --
>>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x3f90b000
>>>> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a3c0).accept: got bad
>>>> authorizer
>>>> 2015-11-26 08:31:29.273199 7fe4f49b1700  0 auth: could not find
>>>> secret_id=2924
>>>> 2015-11-26 08:31:29.273215 7fe4f49b1700  0 cephx: verify_authorizer
>>>> could not get service secret for service osd secret_id=2924
>>>> 2015-11-26 08:31:29.273222 7fe4f49b1700  0 --
>>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
>>>> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a260).accept: got bad
>>>> authorizer
>>>> 2015-11

Re: [ceph-users] Undersized pgs problem

2015-11-28 Thread Vasiliy Angapov
Bob,
Thanks for explanation, sounds resonable! But how it could happen that
host is down and its OSDs are still IN cluster?
I mean NOOUT flag is not set and my timeouts are fully default...

But if I remember correctly host was not completely down, it was
pingable but not other services were reachable like SSH or any others.
Is it possible that OSDs were still sending some information to
monitors making them look like IN?

2015-11-29 2:10 GMT+08:00 Bob R :
> Vasiliy,
>
> Your OSDs are marked as 'down' but 'in'.
>
> "Ceph OSDs have two known states that can be combined. Up and Down only
> tells you whether the OSD is actively involved in the cluster. OSD states
> also are expressed in terms of cluster replication: In and Out. Only when a
> Ceph OSD is tagged as Out does the self-healing process occur"
>
> Bob
>
> On Fri, Nov 27, 2015 at 6:15 AM, Mart van Santen  wrote:
>>
>>
>> Dear Vasilily,
>>
>>
>>
>> On 11/27/2015 02:00 PM, Irek Fasikhov wrote:
>>
>> You have time to synchronize?
>>
>> С уважением, Фасихов Ирек Нургаязович
>> Моб.: +79229045757
>>
>> 2015-11-27 15:57 GMT+03:00 Vasiliy Angapov :
>>>
>>> > It seams that you played around with crushmap, and done something
>>> > wrong.
>>> > Compare the look of 'ceph osd tree' and crushmap. There are some 'osd'
>>> > devices renamed to 'device' think threre is you problem.
>>> Is this a mistake actually? What I did is removed a bunch of OSDs from
>>> my cluster that's why the numeration is sparse. But is it an issue to
>>> a have a sparse numeration of OSDs?
>>
>>
>> I think this is normal and should be no problem. I had this also
>> previously.
>>
>>>
>>> > Hi.
>>> > Vasiliy, Yes it is a problem with crusmap. Look at height:
>>> > -3 14.56000 host slpeah001
>>> > -2 14.56000 host slpeah002
>>> What exactly is wrong here?
>>
>>
>> I do not know how the weight of the hosts contribute to determine were to
>> store the 3-th copy of the PG. As you explained, you have enough space on
>> all hosts, but maybe if the weights of the hosts do not count up and the
>> crushmap maybe come to the conclusion it is not able to place the PGs. What
>> you can try, is to artificially raise the weights of these hosts, to see if
>> it starts mapping the thirth copies for the pg's onto the available host.
>>
>> I had a similiar problem in the past, this was solved by upgrading to the
>> latest crush tunables. But be aware, that can create massive datamovement
>> behavior.
>>
>>
>>>
>>> I also found out that my OSD logs are full of such records:
>>> 2015-11-26 08:31:19.273268 7fe4f49b1700  0 cephx: verify_authorizer
>>> could not get service secret for service osd secret_id=2924
>>> 2015-11-26 08:31:19.273276 7fe4f49b1700  0 --
>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
>>> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a520).accept: got bad
>>> authorizer
>>> 2015-11-26 08:31:24.273207 7fe4f49b1700  0 auth: could not find
>>> secret_id=2924
>>> 2015-11-26 08:31:24.273225 7fe4f49b1700  0 cephx: verify_authorizer
>>> could not get service secret for service osd secret_id=2924
>>> 2015-11-26 08:31:24.273231 7fe4f49b1700  0 --
>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x3f90b000
>>> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a3c0).accept: got bad
>>> authorizer
>>> 2015-11-26 08:31:29.273199 7fe4f49b1700  0 auth: could not find
>>> secret_id=2924
>>> 2015-11-26 08:31:29.273215 7fe4f49b1700  0 cephx: verify_authorizer
>>> could not get service secret for service osd secret_id=2924
>>> 2015-11-26 08:31:29.273222 7fe4f49b1700  0 --
>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
>>> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a260).accept: got bad
>>> authorizer
>>> 2015-11-26 08:31:34.273469 7fe4f49b1700  0 auth: could not find
>>> secret_id=2924
>>> 2015-11-26 08:31:34.273482 7fe4f49b1700  0 cephx: verify_authorizer
>>> could not get service secret for service osd secret_id=2924
>>> 2015-11-26 08:31:34.273486 7fe4f49b1700  0 --
>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x3f90b000
>>> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a100).accept: got bad
>>> authorizer
>>> 2015-11-26 08:31:39.273310 7fe4f49b1700  0 auth: could not fi

Re: [ceph-users] Undersized pgs problem

2015-11-27 Thread Vasiliy Angapov
> It seams that you played around with crushmap, and done something wrong.
> Compare the look of 'ceph osd tree' and crushmap. There are some 'osd' 
> devices renamed to 'device' think threre is you problem.
Is this a mistake actually? What I did is removed a bunch of OSDs from
my cluster that's why the numeration is sparse. But is it an issue to
a have a sparse numeration of OSDs?

> Hi.
> Vasiliy, Yes it is a problem with crusmap. Look at height:
> -3 14.56000 host slpeah001
> -2 14.56000 host slpeah002
What exactly is wrong here?

I also found out that my OSD logs are full of such records:
2015-11-26 08:31:19.273268 7fe4f49b1700  0 cephx: verify_authorizer
could not get service secret for service osd secret_id=2924
2015-11-26 08:31:19.273276 7fe4f49b1700  0 --
192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a520).accept: got bad
authorizer
2015-11-26 08:31:24.273207 7fe4f49b1700  0 auth: could not find secret_id=2924
2015-11-26 08:31:24.273225 7fe4f49b1700  0 cephx: verify_authorizer
could not get service secret for service osd secret_id=2924
2015-11-26 08:31:24.273231 7fe4f49b1700  0 --
192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x3f90b000
sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a3c0).accept: got bad
authorizer
2015-11-26 08:31:29.273199 7fe4f49b1700  0 auth: could not find secret_id=2924
2015-11-26 08:31:29.273215 7fe4f49b1700  0 cephx: verify_authorizer
could not get service secret for service osd secret_id=2924
2015-11-26 08:31:29.273222 7fe4f49b1700  0 --
192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a260).accept: got bad
authorizer
2015-11-26 08:31:34.273469 7fe4f49b1700  0 auth: could not find secret_id=2924
2015-11-26 08:31:34.273482 7fe4f49b1700  0 cephx: verify_authorizer
could not get service secret for service osd secret_id=2924
2015-11-26 08:31:34.273486 7fe4f49b1700  0 --
192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x3f90b000
sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a100).accept: got bad
authorizer
2015-11-26 08:31:39.273310 7fe4f49b1700  0 auth: could not find secret_id=2924
2015-11-26 08:31:39.273331 7fe4f49b1700  0 cephx: verify_authorizer
could not get service secret for service osd secret_id=2924
2015-11-26 08:31:39.273342 7fe4f49b1700  0 --
192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fcc000
sd=98 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee19fa0).accept: got bad
authorizer
2015-11-26 08:31:44.273753 7fe4f49b1700  0 auth: could not find secret_id=2924
2015-11-26 08:31:44.273769 7fe4f49b1700  0 cephx: verify_authorizer
could not get service secret for service osd secret_id=2924
2015-11-26 08:31:44.273776 7fe4f49b1700  0 --
192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fcc000
sd=98 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee189a0).accept: got bad
authorizer
2015-11-26 08:31:49.273412 7fe4f49b1700  0 auth: could not find secret_id=2924
2015-11-26 08:31:49.273431 7fe4f49b1700  0 cephx: verify_authorizer
could not get service secret for service osd secret_id=2924
2015-11-26 08:31:49.273455 7fe4f49b1700  0 --
192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
sd=98 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee19080).accept: got bad
authorizer
2015-11-26 08:31:54.273293 7fe4f49b1700  0 auth: could not find secret_id=2924

What does it mean? Google sais it might be a time sync issue, but my
clocks are perfectly synchronized...

2015-11-26 21:05 GMT+08:00 Irek Fasikhov :
> Hi.
> Vasiliy, Yes it is a problem with crusmap. Look at height:
> " -3 14.56000 host slpeah001
>  -2 14.56000 host slpeah002
>  "
>
> С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
>
> 2015-11-26 13:16 GMT+03:00 ЦИТ РТ-Курамшин Камиль Фидаилевич
> :
>>
>> It seams that you played around with crushmap, and done something wrong.
>> Compare the look of 'ceph osd tree' and crushmap. There are some 'osd'
>> devices renamed to 'device' think threre is you problem.
>>
>> Отправлено с мобильного устройства.
>>
>>
>> -Original Message-
>> From: Vasiliy Angapov 
>> To: ceph-users 
>> Sent: чт, 26 нояб. 2015 7:53
>> Subject: [ceph-users] Undersized pgs problem
>>
>> Hi, colleagues!
>>
>> I have small 4-node CEPH cluster (0.94.2), all pools have size 3, min_size
>> 1.
>> This night one host failed and cluster was unable to rebalance saying
>> there are a lot of undersized pgs.
>>
>> root@slpeah002:[~]:# ceph -s
>> cluster 78eef61a-3e9c-447c-a3ec-ce84c617d728
>>  health HEALTH_WARN
>> 1486 pgs degraded
>> 1486 pgs stuck degraded
>> 2257 pgs stuck unc

[ceph-users] Undersized pgs problem

2015-11-25 Thread Vasiliy Angapov
Hi, colleagues!

I have small 4-node CEPH cluster (0.94.2), all pools have size 3, min_size 1.
This night one host failed and cluster was unable to rebalance saying
there are a lot of undersized pgs.

root@slpeah002:[~]:# ceph -s
cluster 78eef61a-3e9c-447c-a3ec-ce84c617d728
 health HEALTH_WARN
1486 pgs degraded
1486 pgs stuck degraded
2257 pgs stuck unclean
1486 pgs stuck undersized
1486 pgs undersized
recovery 80429/555185 objects degraded (14.487%)
recovery 40079/555185 objects misplaced (7.219%)
4/20 in osds are down
1 mons down, quorum 1,2 slpeah002,slpeah007
 monmap e7: 3 mons at
{slpeah001=192.168.254.11:6780/0,slpeah002=192.168.254.12:6780/0,slpeah007=172.31.252.46:6789/0}
election epoch 710, quorum 1,2 slpeah002,slpeah007
 osdmap e14062: 20 osds: 16 up, 20 in; 771 remapped pgs
  pgmap v7021316: 4160 pgs, 5 pools, 1045 GB data, 180 kobjects
3366 GB used, 93471 GB / 96838 GB avail
80429/555185 objects degraded (14.487%)
40079/555185 objects misplaced (7.219%)
1903 active+clean
1486 active+undersized+degraded
 771 active+remapped
  client io 0 B/s rd, 246 kB/s wr, 67 op/s

  root@slpeah002:[~]:# ceph osd tree
ID  WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -1 94.63998 root default
 -9 32.75999 host slpeah007
 72  5.45999 osd.72  up  1.0  1.0
 73  5.45999 osd.73  up  1.0  1.0
 74  5.45999 osd.74  up  1.0  1.0
 75  5.45999 osd.75  up  1.0  1.0
 76  5.45999 osd.76  up  1.0  1.0
 77  5.45999 osd.77  up  1.0  1.0
-10 32.75999 host slpeah008
 78  5.45999 osd.78  up  1.0  1.0
 79  5.45999 osd.79  up  1.0  1.0
 80  5.45999 osd.80  up  1.0  1.0
 81  5.45999 osd.81  up  1.0  1.0
 82  5.45999 osd.82  up  1.0  1.0
 83  5.45999 osd.83  up  1.0  1.0
 -3 14.56000 host slpeah001
  1  3.64000  osd.1 down  1.0  1.0
 33  3.64000 osd.33down  1.0  1.0
 34  3.64000 osd.34down  1.0  1.0
 35  3.64000 osd.35down  1.0  1.0
 -2 14.56000 host slpeah002
  0  3.64000 osd.0   up  1.0  1.0
 36  3.64000 osd.36  up  1.0  1.0
 37  3.64000 osd.37  up  1.0  1.0
 38  3.64000 osd.38  up  1.0  1.0

Crushmap:

 # begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0
device 1 osd.1
device 2 device2
device 3 device3
device 4 device4
device 5 device5
device 6 device6
device 7 device7
device 8 device8
device 9 device9
device 10 device10
device 11 device11
device 12 device12
device 13 device13
device 14 device14
device 15 device15
device 16 device16
device 17 device17
device 18 device18
device 19 device19
device 20 device20
device 21 device21
device 22 device22
device 23 device23
device 24 device24
device 25 device25
device 26 device26
device 27 device27
device 28 device28
device 29 device29
device 30 device30
device 31 device31
device 32 device32
device 33 osd.33
device 34 osd.34
device 35 osd.35
device 36 osd.36
device 37 osd.37
device 38 osd.38
device 39 device39
device 40 device40
device 41 device41
device 42 device42
device 43 device43
device 44 device44
device 45 device45
device 46 device46
device 47 device47
device 48 device48
device 49 device49
device 50 device50
device 51 device51
device 52 device52
device 53 device53
device 54 device54
device 55 device55
device 56 device56
device 57 device57
device 58 device58
device 59 device59
device 60 device60
device 61 device61
device 62 device62
device 63 device63
device 64 device64
device 65 device65
device 66 device66
device 67 device67
device 68 device68
device 69 device69
device 70 device70
device 71 device71
device 72 osd.72
device 73 osd.73
device 74 osd.74
device 75 osd.75
device 76 osd.76
device 77 osd.77
device 78 osd.78
device 79 osd.79
device 80 osd.80
device 81 osd.81
device 82 osd.82
device 83 osd.83

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host slpeah007 {
id -9   # do not change unnecessarily
# weight 32.760
alg straw
hash 0  # rjenkins1

Re: [ceph-users] Ceph Openstack deployment

2015-11-06 Thread Vasiliy Angapov
There must be something in /var/log/cinder/volume.log or
/var/log/nova/nova-compute.log that points to the problem. Can you
post it here?

2015-11-06 20:14 GMT+08:00 Iban Cabrillo :
> Hi Vasilly,
>   Thanks, but I still see the same error:
>
> cinder.conf (of course I just restart the cinder-volume service)
>
> # default volume type to use (string value)
>
> [rbd-cephvolume]
> rbd_user = cinder
> rbd_secret_uuid = 67a6d4a1-e53a-42c7-9bc9-xxx
> volume_backend_name=rbd
> volume_driver = cinder.volume.drivers.rbd.RBDDriver
> rbd_pool = volumes
> rbd_ceph_conf = /etc/ceph/ceph.conf
> rbd_flatten_volume_from_snapshot = false
> rbd_max_clone_depth = 5
> rbd_store_chunk_size = 4
> rados_connect_timeout = -1
> glance_api_version = 2
>
>
>   xen be: qdisk-51760: error: Could not open
> 'volumes/volume-4d26bb31-91e8-4646-8010-82127b775c8e': No such file or
> directory
> xen be: qdisk-51760: initialise() failed
> xen be: qdisk-51760: error: Could not open
> 'volumes/volume-4d26bb31-91e8-4646-8010-82127b775c8e': No such file or
> directory
> xen be: qdisk-51760: initialise() failed
> xen be: qdisk-51760: error: Could not open
> 'volumes/volume-4d26bb31-91e8-4646-8010-82127b775c8e': No such file or
> directory
> xen be: qdisk-51760: initialise() failed
> xen be: qdisk-51760: error: Could not open
> 'volumes/volume-4d26bb31-91e8-4646-8010-82127b775c8e': No such file or
> directory
> xen be: qdisk-51760: initialise() failed
> xen be: qdisk-51760: error: Could not open
> 'volumes/volume-4d26bb31-91e8-4646-8010-82127b775c8e': No such file or
> directory
> xen be: qdisk-51760: initialise() failed
> xen be: qdisk-51760: error: Could not open
> 'volumes/volume-4d26bb31-91e8-4646-8010-82127b775c8e': No such file or
> directory
> xen be: qdisk-51760: initialise() failed
>
> Regards, I
>
> 2015-11-06 13:00 GMT+01:00 Vasiliy Angapov :
>>
>> At cinder.conf you should place this options:
>>
>> rbd_user = cinder
>> rbd_secret_uuid = 67a6d4a1-e53a-42c7-9bc9-xxx
>>
>> to [rbd-cephvolume] section instead of DEFAULT.
>>
>> 2015-11-06 19:45 GMT+08:00 Iban Cabrillo :
>> > Hi,
>> >   One more step debugging this issue (hypervisor/nova-compute node is
>> > XEN
>> > 4.4.2):
>> >
>> >   I think the problem is that libvirt is not getting the correct user or
>> > credentials tu access pool, on instance qemu log i see:
>> >
>> > xen be: qdisk-51760: error: Could not open
>> > 'volumes/volume-4d26bb31-91e8-4646-8010-82127b775c8e': No such file or
>> > directory
>> > xen be: qdisk-51760: initialise() failed
>> > xen be: qdisk-51760: error: Could not open
>> > 'volumes/volume-4d26bb31-91e8-4646-8010-82127b775c8e': No such file or
>> > directory
>> > xen be: qdisk-51760: initialise() failed
>> > xen be: qdisk-51760: error: Could not open
>> > 'volumes/volume-4d26bb31-91e8-4646-8010-82127b775c8e': No such file or
>> > directory
>> >
>> > But using the user cinder on pool volumes :
>> >
>> > rbd ls -p volumes --id cinder
>> > test
>> > volume-4d26bb31-91e8-4646-8010-82127b775c8e
>> > volume-5e2ab5c2-4710-4c28-9755-b5bc4ff6a52a
>> > volume-7da08f12-fb0f-4269-931a-d528c1507fee
>> >
>> > Using:
>> > qemu-img info -f rbd rbd:volumes/test
>> > Does not work, but using directly the user cinder and the ceph.conf file
>> > works fine:
>> >
>> > qemu-img info -f rbd rbd:volumes/test:id=cinder:conf=/etc/ceph/ceph.conf
>> >
>> > I think nova.conf is set correctly (section libvirt):
>> > images_rbd_pool = volumes
>> > images_rbd_ceph_conf = /etc/ceph/ceph.conf
>> > hw_disk_discard=unmap
>> > rbd_user = cinder
>> > rbd_secret_uuid = 67a6d4a1-e53a-42c7-9bc9-
>> >
>> > And looking at libvirt:
>> >
>> > # virsh secret-list
>> > setlocale: No such file or directory
>> >  UUID  Usage
>> >
>> > 
>> >  67a6d4a1-e53a-42c7-9bc9-  ceph client.cinder secret
>> >
>> >
>> > virsh secret-get-value 67a6d4a1-e53a-42c7-9bc9-
>> > setlocale: No such file or directory
>> > AQAonAdWS3iMJxxj9iErv001a0k+vyFdUg==
>> > cat /etc/ceph/ceph.client.cinder.keyring

Re: [ceph-users] Ceph Openstack deployment

2015-11-06 Thread Vasiliy Angapov
At cinder.conf you should place this options:

rbd_user = cinder
rbd_secret_uuid = 67a6d4a1-e53a-42c7-9bc9-xxx

to [rbd-cephvolume] section instead of DEFAULT.

2015-11-06 19:45 GMT+08:00 Iban Cabrillo :
> Hi,
>   One more step debugging this issue (hypervisor/nova-compute node is XEN
> 4.4.2):
>
>   I think the problem is that libvirt is not getting the correct user or
> credentials tu access pool, on instance qemu log i see:
>
> xen be: qdisk-51760: error: Could not open
> 'volumes/volume-4d26bb31-91e8-4646-8010-82127b775c8e': No such file or
> directory
> xen be: qdisk-51760: initialise() failed
> xen be: qdisk-51760: error: Could not open
> 'volumes/volume-4d26bb31-91e8-4646-8010-82127b775c8e': No such file or
> directory
> xen be: qdisk-51760: initialise() failed
> xen be: qdisk-51760: error: Could not open
> 'volumes/volume-4d26bb31-91e8-4646-8010-82127b775c8e': No such file or
> directory
>
> But using the user cinder on pool volumes :
>
> rbd ls -p volumes --id cinder
> test
> volume-4d26bb31-91e8-4646-8010-82127b775c8e
> volume-5e2ab5c2-4710-4c28-9755-b5bc4ff6a52a
> volume-7da08f12-fb0f-4269-931a-d528c1507fee
>
> Using:
> qemu-img info -f rbd rbd:volumes/test
> Does not work, but using directly the user cinder and the ceph.conf file
> works fine:
>
> qemu-img info -f rbd rbd:volumes/test:id=cinder:conf=/etc/ceph/ceph.conf
>
> I think nova.conf is set correctly (section libvirt):
> images_rbd_pool = volumes
> images_rbd_ceph_conf = /etc/ceph/ceph.conf
> hw_disk_discard=unmap
> rbd_user = cinder
> rbd_secret_uuid = 67a6d4a1-e53a-42c7-9bc9-
>
> And looking at libvirt:
>
> # virsh secret-list
> setlocale: No such file or directory
>  UUID  Usage
> 
>  67a6d4a1-e53a-42c7-9bc9-  ceph client.cinder secret
>
>
> virsh secret-get-value 67a6d4a1-e53a-42c7-9bc9-
> setlocale: No such file or directory
> AQAonAdWS3iMJxxj9iErv001a0k+vyFdUg==
> cat /etc/ceph/ceph.client.cinder.keyring
> [client.cinder]
> key = AQAonAdWS3iMJxxj9iErv001a0k+vyFdUg==
>
>
> Any idea will be welcomed.
> regards, I
>
> 2015-11-04 10:51 GMT+01:00 Iban Cabrillo :
>>
>> Dear Cephers,
>>
>>I still can attach volume to my cloud machines, ceph version is  0.94.5
>> (9764da52395923e0b32908d83a9f7304401fee43) and Openstack Juno
>>
>>Nova+cinder are able to create volumes on Ceph
>> cephvolume:~ # rados ls --pool volumes
>> rbd_header.1f7784a9e1c2e
>> rbd_id.volume-5e2ab5c2-4710-4c28-9755-b5bc4ff6a52a
>> rbd_directory
>> rbd_id.volume-7da08f12-fb0f-4269-931a-d528c1507fee
>> rbd_header.23d5e33b4c15c
>> rbd_id.volume-4d26bb31-91e8-4646-8010-82127b775c8e
>> rbd_header.20407190ce77f
>>
>> cloud:~ # cinder list
>>
>> +--++--+--+-+--+--+
>> |  ID   |
>> Status  | Display Name | Size | Volume Type | Bootable |
>> Attached to  |
>>
>> +--++--+--+-+--+|-+
>> | 4d26bb31-91e8-4646-8010-82127b775c8e | in-use | None | 2
>> | rbd |  false   | 59aa021e-bb4c-4154-9b18-9d09f5fd3aeb
>> |
>>
>> +--++--+--+-+--+--+
>>
>>
>>nova:~ # nova volume-attach 59aa021e-bb4c-4154-9b18-9d09f5fd3aeb
>> 4d26bb31-91e8-4646-8010-82127b775c8e auto
>> +--++
>> | Property |  Value
>> |
>> +--++
>> | device  | /dev/xvdd
>> |
>> | id | 4d26bb31-91e8-4646-8010-82127b775c8e |
>> | serverId   | 59aa021e-bb4c-4154-9b18-9d09f5fd3aeb  |
>> | volumeId | 4d26bb31-91e8-4646-8010-82127b775c8e |
>> +--+--+
>>
>> From nova-compute (Ubuntu 14.04 LTS \n \l) node I see the
>> attaching/detaching:
>> cloud01:~ # dpkg -l | grep ceph
>> ii  ceph-common 0.94.5-1trusty
>> amd64common utilities to mount and interact with a ceph storage
>> cluster
>> ii  libcephfs1   0.94.5-1trusty
>> amd64Ceph distributed file system client library
>> ii  python-cephfs 0.94.5-1trusty
>> amd64Python libraries for the Ceph libcephfs library
>> ii  librbd10.94.5-1trusty
>> amd64RADOS block device client library
>> ii  python-rbd  0.94.5-1trusty
>> amd64Pyt

Re: [ceph-users] Nova fails to download image from Glance backed with Ceph

2015-09-04 Thread Vasiliy Angapov
Thanks for response!

The free space on /var/lib/nova/instances is very large on every compute host.
Glance image-download works as expected.

2015-09-04 21:27 GMT+08:00 Jan Schermer :
> Didn't you run out of space? Happened to me when a customer tried to create a 
> 1TB image...
>
> Z.
>
>> On 04 Sep 2015, at 15:15, Sebastien Han  wrote:
>>
>> Just to take away a possible issue from infra (LBs etc).
>> Did you try to download the image on the compute node? Something like rbd 
>> export?
>>
>>> On 04 Sep 2015, at 11:56, Vasiliy Angapov  wrote:
>>>
>>> Hi all,
>>>
>>> Not sure actually where does this bug belong to - OpenStack or Ceph -
>>> but writing here in humble hope that anyone faced that issue also.
>>>
>>> I configured test OpenStack instance with Glance images stored in Ceph
>>> 0.94.3. Nova has local storage.
>>> But when I'm trying to launch instance from large image stored in Ceph
>>> - it fails to spawn with such an error in nova-conductor.log:
>>>
>>> 2015-09-04 11:52:35.076 3605449 ERROR nova.scheduler.utils
>>> [req-c6af3eca-f166-45bd-8edc-b8cfadeb0d0b
>>> 82c1f134605e4ee49f65015dda96c79a 448cc6119e514398ac2793d043d4fa02 - -
>>> -] [instance: 18c9f1d5-50e8-426f-94d5-167f43129ea6] Error from last
>>> host: slpeah005 (node slpeah005.cloud): [u'Traceback (most recent call
>>> last):\n', u'  File
>>> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2220,
>>> in _do_build_and_run_instance\nfilter_properties)\n', u'  File
>>> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2363,
>>> in _build_and_run_instance\ninstance_uuid=instance.uuid,
>>> reason=six.text_type(e))\n', u'RescheduledException: Build of instance
>>> 18c9f1d5-50e8-426f-94d5-167f43129ea6 was re-scheduled: [Errno 32]
>>> Corrupt image download. Checksum was 625d0686a50f6b64e57b1facbc042248
>>> expected 4a7de2fbbd01be5c6a9e114df145b027\n']
>>>
>>> So nova tries 3 different hosts with the same error messages on every
>>> single one and then fails to spawn an instance.
>>> I've tried Cirros little image and it works fine with it. Issue
>>> happens with large images like 10Gb in size.
>>> I also managed to look into /var/lib/nova/instances/_base folder and
>>> found out that image is actually being downloaded but at some moment
>>> the download process interrupts for some unknown reason and instance
>>> gets deleted.
>>>
>>> I looked at the syslog and found many messages like that:
>>> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735094
>>> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.22 since
>>> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
>>> (cutoff 2015-09-04 12:51:32.735011)
>>> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735099
>>> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.23 since
>>> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
>>> (cutoff 2015-09-04 12:51:32.735011)
>>> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735104
>>> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.24 since
>>> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
>>> (cutoff 2015-09-04 12:51:32.735011)
>>> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735108
>>> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.26 since
>>> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
>>> (cutoff 2015-09-04 12:51:32.735011)
>>> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735118
>>> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.27 since
>>> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
>>> (cutoff 2015-09-04 12:51:32.735011)
>>>
>>> I've also tried to monitor nova-compute process file descriptors
>>> number but it is never more than 102. ("echo
>>> /proc/NOVA_COMPUTE_PID/fd/* | wc -w" like Jan advised in this ML).
>>> It also seems like problem appeared only in 0.94.3, in 0.94.2
>>> everything worked just fine!
>>>
>>> Would be very grateful for any help!
>>>
>>> Vasily.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> Cheers.
>> 
>> Sébastien Han
>> Senior Cloud Architect
>>
>> "Always give 100%. Unless you're giving blood."
>>
>> Mail: s...@redhat.com
>> Address: 11 bis, rue Roquépine - 75008 Paris
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Nova fails to download image from Glance backed with Ceph

2015-09-04 Thread Vasiliy Angapov
Hi all,

Not sure actually where does this bug belong to - OpenStack or Ceph -
but writing here in humble hope that anyone faced that issue also.

I configured test OpenStack instance with Glance images stored in Ceph
0.94.3. Nova has local storage.
But when I'm trying to launch instance from large image stored in Ceph
- it fails to spawn with such an error in nova-conductor.log:

2015-09-04 11:52:35.076 3605449 ERROR nova.scheduler.utils
[req-c6af3eca-f166-45bd-8edc-b8cfadeb0d0b
82c1f134605e4ee49f65015dda96c79a 448cc6119e514398ac2793d043d4fa02 - -
-] [instance: 18c9f1d5-50e8-426f-94d5-167f43129ea6] Error from last
host: slpeah005 (node slpeah005.cloud): [u'Traceback (most recent call
last):\n', u'  File
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2220,
in _do_build_and_run_instance\nfilter_properties)\n', u'  File
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2363,
in _build_and_run_instance\ninstance_uuid=instance.uuid,
reason=six.text_type(e))\n', u'RescheduledException: Build of instance
18c9f1d5-50e8-426f-94d5-167f43129ea6 was re-scheduled: [Errno 32]
Corrupt image download. Checksum was 625d0686a50f6b64e57b1facbc042248
expected 4a7de2fbbd01be5c6a9e114df145b027\n']

So nova tries 3 different hosts with the same error messages on every
single one and then fails to spawn an instance.
I've tried Cirros little image and it works fine with it. Issue
happens with large images like 10Gb in size.
I also managed to look into /var/lib/nova/instances/_base folder and
found out that image is actually being downloaded but at some moment
the download process interrupts for some unknown reason and instance
gets deleted.

I looked at the syslog and found many messages like that:
Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735094
7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.22 since
back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
(cutoff 2015-09-04 12:51:32.735011)
Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735099
7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.23 since
back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
(cutoff 2015-09-04 12:51:32.735011)
Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735104
7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.24 since
back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
(cutoff 2015-09-04 12:51:32.735011)
Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735108
7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.26 since
back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
(cutoff 2015-09-04 12:51:32.735011)
Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735118
7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.27 since
back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
(cutoff 2015-09-04 12:51:32.735011)

I've also tried to monitor nova-compute process file descriptors
number but it is never more than 102. ("echo
/proc/NOVA_COMPUTE_PID/fd/* | wc -w" like Jan advised in this ML).
It also seems like problem appeared only in 0.94.3, in 0.94.2
everything worked just fine!

Would be very grateful for any help!

Vasily.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] libvirt rbd issue

2015-09-03 Thread Vasiliy Angapov
And what to do for those with systemd? Because systemd totally ignores
limits.conf and manages limits on per-service basis...
What actual services should be tuned WRT LimitNOFILE?
Or should the DefaultLimitNOFILE increased in /etc/systemd/system.conf?

Thanks in advance!

2015-09-03 17:46 GMT+08:00 Jan Schermer :

> You're like the 5th person here (including me) that was hit by this.
>
> Could I get some input from someone using CEPH with RBD and thousands of
> OSDs? How high did you have to go?
>
> I only have ~200 OSDs and I had to bump the limit up to 1 for VMs that
> have multiple volumes attached, this doesn't seem right? I understand this
> is the effect of striping a volume accross multiple PGs, but shouldn't this
> be more limited or somehow garbage collected?
>
> And to get deeper - I suppose there will be one connection from QEMU to
> OSD for each NCQ queue? Or how does this work? blk-mq will likely be
> different again... Or is it decoupled from the virtio side of things by RBD
> cache if that's enabled?
>
> Anyway, out of the box, at least on OpenStack installations
> 1) anyone having more than a few OSDs should really bump this up by
> default.
> 2) librbd should handle this situation gracefully by recycling
> connections, instead of hanging
> 3) at least we should get a warning somewhere (in the libvirt/qemu log) -
> I don't think there's anything when the issue hits
>
> Should I make tickets for this?
>
> Jan
>
> On 03 Sep 2015, at 02:57, Rafael Lopez  wrote:
>
> Hi Jan,
>
> Thanks for the advice, hit the nail on the head.
>
> I checked the limits and watched the no. of fd's and as it reached the
> soft limit (1024) thats when the transfer came to a grinding halt and the
> vm started locking up.
>
> After your reply I also did some more googling and found another old
> thread:
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-December/026187.html
>
> I increased the max_files in qemu.conf and restarted libvirtd and the VM
> (as per Dan's solution in thread above), and now it seems to be happy
> copying any size files to the rbd. Confirmed the fd count is going past the
> previous soft limit of 1024 also.
>
> Thanks again!!
> Raf
>
> On 2 September 2015 at 18:44, Jan Schermer  wrote:
>
>> 1) Take a look at the number of file descriptors the QEMU process is
>> using, I think you are over the limits
>>
>> pid=pid of qemu process
>>
>> cat /proc/$pid/limits
>> echo /proc/$pid/fd/* | wc -w
>>
>> 2) Jumbo frames may be the cause, are they enabled on the rest of the
>> network? In any case, get rid of NetworkManager ASAP and set it manually,
>> though it looks like your NIC might not support them.
>>
>> Jan
>>
>>
>>
>> > On 02 Sep 2015, at 01:44, Rafael Lopez  wrote:
>> >
>> > Hi ceph-users,
>> >
>> > Hoping to get some help with a tricky problem. I have a rhel7.1 VM
>> guest (host machine also rhel7.1) with root disk presented from ceph
>> 0.94.2-0 (rbd) using libvirt.
>> >
>> > The VM also has a second rbd for storage presented from the same ceph
>> cluster, also using libvirt.
>> >
>> > The VM boots fine, no apparent issues with the OS root rbd. I am able
>> to mount the storage disk in the VM, and create a file system. I can even
>> transfer small files to it. But when I try to transfer a moderate size
>> files, eg. greater than 1GB, it seems to slow to a grinding halt and
>> eventually it locks up the whole system, and generates the kernel messages
>> below.
>> >
>> > I have googled some *similar* issues around, but haven't come across
>> some solid advice/fix. So far I have tried modifying the libvirt disk cache
>> settings, tried using the latest mainline kernel (4.2+), different file
>> systems (ext4, xfs, zfs) all produce similar results. I suspect it may be
>> network related, as when I was using the mainline kernel I was transferring
>> some files to the storage disk and this message came up, and the transfer
>> seemed to stop at the same time:
>> >
>> > Sep  1 15:31:22 nas1-rds NetworkManager[724]: 
>> [1441085482.078646] [platform/nm-linux-platform.c:2133] sysctl_set():
>> sysctl: failed to set '/proc/sys/net/ipv6/conf/eth0/mtu' to '9000': (22)
>> Invalid argument
>> >
>> > I think maybe the key info to troubleshooting is that it seems to be OK
>> for files under 1GB.
>> >
>> > Any ideas would be appreciated.
>> >
>> > Cheers,
>> > Raf
>> >
>> >
>> > Sep  1 16:04:15 nas1-rds kernel: INFO: task kworker/u8:1:60 blocked for
>> more than 120 seconds.
>> > Sep  1 16:04:15 nas1-rds kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> > Sep  1 16:04:15 nas1-rds kernel: kworker/u8:1D 88023fd93680
>>  060  2 0x
>> > Sep  1 16:04:15 nas1-rds kernel: Workqueue: writeback
>> bdi_writeback_workfn (flush-252:80)
>> > Sep  1 16:04:15 nas1-rds kernel: 880230c136b0 0046
>> 8802313c4440 880230c13fd8
>> > Sep  1 16:04:15 nas1-rds kernel: 880230c13fd8 880230c13fd8
>> 8802313c4440 88023fd93f48

[ceph-users] Ceph makes syslog full

2015-09-03 Thread Vasiliy Angapov
Hi!

I got Ceph Hammer 0.94.3 with 72 OSD and 6 nodes. My problem is that
i'm constantly getting my /var/log/messages full of such messages:

Sep  3 11:16:31 slpeah001 ceph-osd: 2015-09-03 11:16:31.393234
7f5a6bfd3700 -1 osd.34 2991 heartbeat_check: no reply from osd.68
since back 2015-09-03 11:16:25.491465 front 2015-09-03 11:16:25.491465
(cutoff 2015-09-03 11:16:26.392797)
Sep  3 11:16:31 slpeah001 ceph-osd: 2015-09-03 11:16:31.393239
7f5a6bfd3700 -1 osd.34 2991 heartbeat_check: no reply from osd.69
since back 2015-09-03 11:16:25.491465 front 2015-09-03 11:16:25.491465
(cutoff 2015-09-03 11:16:26.392797)
Sep  3 11:16:31 slpeah001 ceph-osd: 2015-09-03 11:16:31.393244
7f5a6bfd3700 -1 osd.34 2991 heartbeat_check: no reply from osd.70
since back 2015-09-03 11:16:25.491465 front 2015-09-03 11:16:25.491465
(cutoff 2015-09-03 11:16:26.392797)
Sep  3 11:16:31 slpeah001 ceph-osd: 2015-09-03 11:16:31.393249
7f5a6bfd3700 -1 osd.34 2991 heartbeat_check: no reply from osd.71
since back 2015-09-03 11:16:25.491465 front 2015-09-03 11:16:25.491465
(cutoff 2015-09-03 11:16:26.392797)

This messages keep spawning until my /var/log/ filesystem is depleted.
What do these messages mean? Is this a sign of a network issue?
How can I get rid of them? Because the grow rate is as high as several Gb a day.

I tried to tune my configuration WRT logging:
[global]
log_to_syslog = false
log_to_stderr = false
mon_cluster_log_file = /dev/null
mon_cluster_log_to_syslog = false
mon_cluster_log_file_level = warn
mon_cluster_log_to_syslog_level = warn
clog_to_syslog_level = warn
debug lockdep = 0/0
debug context = 0/0
debug buffer = 0/0
debug timer = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug ms = 0/0
debug monc = 0/0
debug throttle = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_rgw = 0/0
debug_civetweb = 0/0
debug_javaclient = 0/0

But that didn't help. What else can I do?

Regards, Vasily.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] НА: Rename Ceph cluster

2015-08-19 Thread Vasiliy Angapov
Thanks to all!

Everything worked like a charm:
1) Stopped the cluster (I guess it's faster than moving OSDs one by one)
2) Unmounted OSDs and fixed fstab entries for them
3) Renamed the MON and OSD folders
4) Renamed config file and keyrings, fixed paths to keyrings in config
5) Mounted OSDs back (mount -a)
6) Started everything
7) Fixed path to config in nova.conf, cinder.conf and glance-api.conf
all over the OpenStack

Everything works as expected. It took about half an hour to do all the job.
Again, thanks to all!

Regards, Vasily.

2015-08-19 15:10 GMT+08:00 Межов Игорь Александрович :
> Hi!
>
> I think, that renaming cluster - is not only mv config file. We try to
> change name of test Hammer
> cluster, created with ceph-deploy and got some issues.
>
> In default install, naming of many parts are derived from cluster name. For
> example, cephx keys are
> stored not in "ceph.client.admin.keyring", but
> "$CLUSTERNAME.client.admin.keyring", so we
> have to rename keyrings also.
>
> The same thing is for OSD/MON mounting points: instead
> /var/lib/ceph/osd/ceph-$OSDNUM,
> after renaming cluster, daemons try to run OSD from
> /var/lib/ceph/osd/$CLUSTERNAME-$OSDNUM.
> Of course, there are no such mountpoints and we manually create them, mount
> fs and re-run OSDs.
>
> There is one unresolved issue with udev rules: after node reboot,
> filesystems are mounted by udev
> into the old mountpoints. As the cluster is for testing - this is not a big
> thing.
>
> So, be carefull while renaming production or loaded cluster.
>
> PS: All above is my IMHO and I may be wrong. ;)
>
> Megov Igor
> CIO, Yuterra
>
>
> 
> От: ceph-users  от имени Jan Schermer
> 
> Отправлено: 18 августа 2015 г. 15:18
> Кому: Erik McCormick
> Копия: ceph-users@lists.ceph.com
> Тема: Re: [ceph-users] Rename Ceph cluster
>
> I think it's pretty clear:
>
> http://ceph.com/docs/master/install/manual-deployment/
>
> "For example, when you run multiple clusters in a federated architecture,
> the cluster name (e.g., us-west, us-east) identifies the cluster for the
> current CLI session. Note: To identify the cluster name on the command line
> interface, specify the a Ceph configuration file with the cluster name
> (e.g., ceph.conf, us-west.conf, us-east.conf, etc.). Also see CLI usage
> (ceph --cluster {cluster-name})."
>
> But it could be tricky on the OSDs that are running, depending on the
> distribution initscripts - you could find out that you can't "service ceph
> stop osd..." anymore after the change, since it can't find it's pidfile
> anymore. Looking at Centos initscript it looks like it accepts "-c conffile"
> argument though.
> (So you should be managins OSDs with "-c ceph-prod.conf" now?)
>
> Jan
>
>
> On 18 Aug 2015, at 14:13, Erik McCormick  wrote:
>
> I've got a custom named cluster integrated with Openstack (Juno) and didn't
> run into any hard-coded name issues that I can recall. Where are you seeing
> that?
>
> As to the name change itself, I think it's really just a label applying to a
> configuration set. The name doesn't actually appear *in* the configuration
> files. It stands to reason you should be able to rename the configuration
> files on the client side and leave the cluster alone. It'd be with trying in
> a test environment anyway.
>
> -Erik
>
> On Aug 18, 2015 7:59 AM, "Jan Schermer"  wrote:
>>
>> This should be simple enough
>>
>> mv /etc/ceph/ceph-prod.conf /etc/ceph/ceph.conf
>>
>> No? :-)
>>
>> Or you could set this in nova.conf:
>> images_rbd_ceph_conf=/etc/ceph/ceph-prod.conf
>>
>> Obviously since different parts of openstack have their own configs, you'd
>> have to do something similiar for cinder/glance... so not worth the hassle.
>>
>> Jan
>>
>> > On 18 Aug 2015, at 13:50, Vasiliy Angapov  wrote:
>> >
>> > Hi,
>> >
>> > Does anyone know what steps should be taken to rename a Ceph cluster?
>> > Btw, is it ever possbile without data loss?
>> >
>> > Background: I have a cluster named "ceph-prod" integrated with
>> > OpenStack, however I found out that the default cluster name "ceph" is
>> > very much hardcoded into OpenStack so I decided to change it to the
>> > default value.
>> >
>> > Regards, Vasily.
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Rename Ceph cluster

2015-08-18 Thread Vasiliy Angapov
Hi,

Does anyone know what steps should be taken to rename a Ceph cluster?
Btw, is it ever possbile without data loss?

Background: I have a cluster named "ceph-prod" integrated with
OpenStack, however I found out that the default cluster name "ceph" is
very much hardcoded into OpenStack so I decided to change it to the
default value.

Regards, Vasily.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Build latest KRBD module

2015-06-19 Thread Vasiliy Angapov
Hi, guys!

Do we have any procedure on how to build the latest KRBD module? I think it
will be helpful to many people here.

Regards, Vasily.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] calculating maximum number of disk and node failure that can be handled by cluster with out data loss

2015-06-11 Thread Vasiliy Angapov
I wrote a script which calculates data durability SLA depending on many
factors like disk size, network speed, number of hosts etc.
It takes recovery time three times greater than needed to count client IO
priority over recovery IO.
For 2Tb disks and 10g network it shows a bright picture.
OSDs: 10SLA: 100.00%
OSDs: 20SLA: 100.00%
OSDs: 30SLA: 100.00%
OSDs: 40SLA: 100.00%
OSDs: 50SLA: 100.00%
OSDs: 100   SLA: 100.00%
OSDs: 200   SLA: 100.00%
OSDs: 500   SLA: 99.99%

For 1g network it show 7-8 nines in every line. So if my estimations are
correct then we are almost safe from triple failure data loss.
Script is in attachment. Any good critisizm is welcome.

Regards, Vasily.


On Thu, Jun 11, 2015 at 3:37 AM, Christian Balzer  wrote:

>
> Hello,
>
> On Wed, 10 Jun 2015 23:53:48 +0300 Vasiliy Angapov wrote:
>
> > Hi,
> >
> > I also wrote a simple script which calculates the data loss probabilities
> > for triple disk failure. Here are some numbers:
> > OSDs: 10,   Pr: 138.89%
> > OSDs: 20,   Pr: 29.24%
> > OSDs: 30,   Pr: 12.32%
> > OSDs: 40,   Pr: 6.75%
> > OSDs: 50,   Pr: 4.25%
> > OSDs: 100, Pr: 1.03%
> > OSDs: 200, Pr: 0.25%
> > OSDs: 500, Pr: 0.04%
> >
> Nice, good to have some numbers.
>
> > Here i assumed we have 100PGs per OSD. Also there is a constraint for 3
> > disks not to be in one host because this will not lead to a failure. For
> > situation where all disks are evenly distributed between 10 hosts it
> > gives us a correction coefficient of 83% so for 50 OSDs it will be
> > something like 3.53% instead of 4.25%.
> >
> > There is a further constraint for 2 disks in one host and 1 disk on
> > another but that's just adds unneeded complexity. Numbers will not change
> > significantly.
> > And actually triple simultaneous failure is itself not very likely to
> > happen, so i believe that starting from 100 OSDs we can somewhat relax
> > about data  failure.
> >
> I mentioned the link below before, I found that to be one of the more
> believable RAID failure calculators and they explain their shortcomings
> nicely to boot.
> I usually half their DLO/year values (double the chance of data loss) to
> be on the safe side: https://www.memset.com/tools/raid-calculator/
>
> If you plunk in a 100 disk RAID6 (the equivalent of replica 3) and 2TB per
> disk with a recovery rate of 100MB/s the odds are indeed pretty good.
> But note the expected disk failure rate of one per 10 days!
>
> Of course the the biggest variable here is how fast your recovery speed
> will be. I picked 100MB/s, because for some people that will be as fast as
> their network goes. For others their network could be 10-40 times as
> fast, but their cluster might not have enough OSDs (or fast enough ones) to
> remain usable at those speeds, so they'll opt for for lower priority
> recovery speeds.
>
> Christian
>
> > BTW, this presentation has more math
> >
> http://www.slideshare.net/kioecn/build-an-highperformance-and-highdurable-block-storage-service-based-on-ceph
> >
> > Regards, Vasily.
> >
> > On Wed, Jun 10, 2015 at 12:38 PM, Dan van der Ster 
> > wrote:
> >
> > > OK I wrote a quick script to simulate triple failures and count how
> > > many would have caused data loss. The script gets your list of OSDs
> > > and PGs, then simulates failures and checks if any permutation of that
> > > failure matches a PG.
> > >
> > > Here's an example with 1 simulations on our production cluster:
> > >
> > > # ./simulate-failures.py
> > > We have 1232 OSDs and 21056 PGs, hence 21056 combinations e.g. like
> > > this: (945, 910, 399)
> > > Simulating 1 failures
> > > Simulated 1000 triple failures. Data loss incidents = 0
> > > Data loss incident with failure (676, 451, 931)
> > > Simulated 2000 triple failures. Data loss incidents = 1
> > > Simulated 3000 triple failures. Data loss incidents = 1
> > > Simulated 4000 triple failures. Data loss incidents = 1
> > > Simulated 5000 triple failures. Data loss incidents = 1
> > > Simulated 6000 triple failures. Data loss incidents = 1
> > > Simulated 7000 triple failures. Data loss incidents = 1
> > > Simulated 8000 triple failures. Data loss incidents = 1
> > > Data loss incident with failure (1031, 1034, 806)
> > > Data loss incident with failure (449, 644, 329)
> > > Simulated 9000 triple failures. Data loss incidents = 3
> > > Simulated 1 triple failures. Data loss incide

Re: [ceph-users] calculating maximum number of disk and node failure that can be handled by cluster with out data loss

2015-06-10 Thread Vasiliy Angapov
Hi,

I also wrote a simple script which calculates the data loss probabilities
for triple disk failure. Here are some numbers:
OSDs: 10,   Pr: 138.89%
OSDs: 20,   Pr: 29.24%
OSDs: 30,   Pr: 12.32%
OSDs: 40,   Pr: 6.75%
OSDs: 50,   Pr: 4.25%
OSDs: 100, Pr: 1.03%
OSDs: 200, Pr: 0.25%
OSDs: 500, Pr: 0.04%

Here i assumed we have 100PGs per OSD. Also there is a constraint for 3
disks not to be in one host because this will not lead to a failure. For
situation where all disks are evenly distributed between 10 hosts it gives
us a correction coefficient of 83% so for 50 OSDs it will be something like
3.53% instead of 4.25%.

There is a further constraint for 2 disks in one host and 1 disk on another
but that's just adds unneeded complexity. Numbers will not change
significantly.
And actually triple simultaneous failure is itself not very likely to
happen, so i believe that starting from 100 OSDs we can somewhat relax
about data  failure.

BTW, this presentation has more math
http://www.slideshare.net/kioecn/build-an-highperformance-and-highdurable-block-storage-service-based-on-ceph

Regards, Vasily.

On Wed, Jun 10, 2015 at 12:38 PM, Dan van der Ster 
wrote:

> OK I wrote a quick script to simulate triple failures and count how
> many would have caused data loss. The script gets your list of OSDs
> and PGs, then simulates failures and checks if any permutation of that
> failure matches a PG.
>
> Here's an example with 1 simulations on our production cluster:
>
> # ./simulate-failures.py
> We have 1232 OSDs and 21056 PGs, hence 21056 combinations e.g. like
> this: (945, 910, 399)
> Simulating 1 failures
> Simulated 1000 triple failures. Data loss incidents = 0
> Data loss incident with failure (676, 451, 931)
> Simulated 2000 triple failures. Data loss incidents = 1
> Simulated 3000 triple failures. Data loss incidents = 1
> Simulated 4000 triple failures. Data loss incidents = 1
> Simulated 5000 triple failures. Data loss incidents = 1
> Simulated 6000 triple failures. Data loss incidents = 1
> Simulated 7000 triple failures. Data loss incidents = 1
> Simulated 8000 triple failures. Data loss incidents = 1
> Data loss incident with failure (1031, 1034, 806)
> Data loss incident with failure (449, 644, 329)
> Simulated 9000 triple failures. Data loss incidents = 3
> Simulated 1 triple failures. Data loss incidents = 3
>
> End of simulation: Out of 1 triple failures, 3 caused a data loss
> incident
>
>
> The script is here:
>
> https://github.com/cernceph/ceph-scripts/blob/master/tools/durability/simulate-failures.py
> Give it a try (on your test clusters!)
>
> Cheers, Dan
>
>
>
>
>
> On Wed, Jun 10, 2015 at 10:47 AM, Jan Schermer  wrote:
> > Yeah, I know but I believe it was fixed so that a single copy is
> sufficient for recovery now (even with min_size=1)? Depends on what you
> want to achieve...
> >
> > The point is that even if we lost “just” 1% of data, that’s too much
> (>0%) when talking about customer data, and I know from experience that
> some volumes are unavailable when I lose 3 OSDs -  and I don’t have that
> many volumes...
> >
> > Jan
> >
> >> On 10 Jun 2015, at 10:40, Dan van der Ster  wrote:
> >>
> >> I'm not a mathematician, but I'm pretty sure there are 200 choose 3 =
> >> 1.3 million ways you can have 3 disks fail out of 200. nPGs = 16384 so
> >> that many combinations would cause data loss. So I think 1.2% of
> >> triple disk failures would lead to data loss. There might be another
> >> factor of 3! that needs to be applied to nPGs -- I'm currently
> >> thinking about that.
> >> But you're right, if indeed you do ever lose an entire PG, _every_ RBD
> >> device will have random holes in their data, like swiss cheese.
> >>
> >> BTW PGs can have stuck IOs without losing all three replicas -- see
> min_size.
> >>
> >> Cheers, Dan
> >>
> >> On Wed, Jun 10, 2015 at 10:20 AM, Jan Schermer  wrote:
> >>> When you increase the number of OSDs, you generaly would (and should)
> increase the number of PGs. For us, the sweet spot for ~200 OSDs is 16384
> PGs.
> >>> RBD volume that has xxx GiBs of data gets striped across many PGs, so
> the probability that the volume loses at least part of its’ data is very
> significant.
> >>> Someone correct me if I’m wrong, but I _know_ (from sad experience)
> that with the current CRUSH map if 3 disks fail in 3 different hosts, lots
> of instances (maybe all of them) have their IO stuck until 3 copies of data
> are restored.
> >>>
> >>> I just tested that by hand
> >>> a 150GB volume will consist of ~15/4=37500 objects
> >>> When I list their location with “ceph osd map”, every time I get a
> different pg, and a random mix of osds that host the PG.
> >>>
> >>> Thus, it is very likely that this volume will be lost when I lose any
> 3 osds, as at least one of the pgs will be hosted on all of them. What this
> probability is I don’t know - (I’m not good at statistics, is it
> combinations?) - but generally the data I care most about is stored in a
> mu

[ceph-users] CRUSH algoritm and recovery time

2015-06-06 Thread Vasiliy Angapov
Hi all,

I have a theoretical question about Ceph. I've recently watched
presentation here:
http://www.slideshare.net/kioecn/build-an-highperformance-and-highdurable-block-storage-service-based-on-ceph
and did not completely understood the idea behind slides #46-56.

In particular, if we have a pool consisting of 24 OSDs and 1 OSD is lost -
how tuning CRUSH influences a recovery time in that case? I saw the
optimized CRUSH map in the presentation (slide #55) but did not understand
how osd domains and replica domains reduce the recovery time in comparison
with the default map.

Can someone explain me the idea please?

Best Regards, Vasily
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD IO performance

2015-05-26 Thread Vasiliy Angapov
Hi,

Hi, I guess the author here means that for random loads 100Mb network
should generate 2500-3000 IOPS for 4k blocks.
So the complaint is reasonable, I suppose.

Regards, Vasily.

On Tue, May 26, 2015 at 5:27 PM, Karsten Heymann 
wrote:

> Hi ,
>
> you should definitely increase the speed of the network. 100Mbit/s is
> way too slow for all use cases I could think of, as it results in a
> maximum data transfer of less than 10 Mbyte per second, which is
> slower than a usb 2.0 thumb drive.
>
> Best,
> Karsten
>
> 2015-05-26 15:53 GMT+02:00 lixuehui...@126.com :
> >
> > Hi ALL:
> > I've built a ceph0.8 cluster including 2 nodes ,which  contains 5
> > osds(ssd) each , with 100MB/s network . Testing a rbd device with default
> > configuration ,the result is no ideal.To got better performance ,except
> the
> > capability of random r/w  of  SSD, which should to give a change?
> >
> > 2 nodes  5 osds(SSD) *2  , 1 mon, 32GB RAM
> > 100MB/S network
> > and now the whole iops is just 500 . Should we change the filestore or
> > journal part ? Thanks for any help!
> >
> > 
> > lixuehui...@126.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI ceph rbd

2015-05-22 Thread Vasiliy Angapov
Hi, Ariel, gentlemen,

I have the same question but with regard to multipath. Is it possible to
just export iSCSI target on each Ceph node and use a multipath on client
side?
Can it possibly lead to data inconsistency?

Regards, Vasily.


On Fri, May 22, 2015 at 12:59 PM, Gerson Ariel  wrote:

> I apologize beforehand for not using more descriptive subject for my
> question.
>
>
>
> On Fri, May 22, 2015 at 4:55 PM, Gerson Ariel 
> wrote:
>
>> Our hardware is like this, three identical servers with 8 osd disks, 1
>> ssd disk
>> as journal, 1 for os, 32GB of ECC RAM, 4 GiB copper ethernet. We deploy
>> this
>> cluster since February 2015 and most of the the system load is not too
>> great,
>> lots of idle time.
>>
>> Right now we have a node that mounts rbd blocks and export them as nfs. It
>> works quite well but at a cost of one extra node as bridge between storage
>> client (vms) and storage provider  cluster (ceph osd and mon).
>>
>> What I want to know is, is there any reason why I shouldn't mount rbd
>> disks on
>> one of the server, the ones that also runs OSD and MON daemons, and
>> export them
>> as nfs or iSCSI?  Assuming that I already done my homework to make my
>> setup
>> highly available using pacemaker (eg. floating IP, iSCSI/NFS resource),
>> isn't
>> something like this would be better as it is more reliable? ie. I remove
>> the
>> middle-man node(s) so I only have to make sure about those ceph nodes and
>> vm-hosts.
>>
>> Thank you
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph same rbd on multiple client

2015-05-21 Thread Vasiliy Angapov
CephFS is I believe not very production ready. Use production quality
clustered filesystems or consider using NFS or Samba shares.
The exact setup depends on what you need.

Cheers, Vasily.


On Thu, May 21, 2015 at 6:47 PM, gjprabu  wrote:

> Hi Angapov,
>
>   I have seen below message in ceph official sites. How it's consider to
> use in production.
>
> "ImportantCephFS currently lacks a robust ‘fsck’ check and repair
> function. Please use caution when storing important data as the disaster
> recovery tools are still under development. For more information about
> using CephFS today, see CephFS for early adopters"
>
> Regards
> Prabu
>
> Regards
> G.J
>
>
>  On Thu, 21 May 2015 19:57:09 +0530 * anga...@gmail.com
>  * wrote 
>
> Hi, Prabu!
>
> This behavior is expected because you are using non-clustered filesystem
> (ext4 or xfs or whatever), which is not expected to be mounted to multiple
> hosts at the same time.
> What's more - you can lose data when doing like this. That's the nature of
> local filesystems.
> So if you need to access filesystem simultaneously at two or more hosts -
> consider using clustered filesystem like CephFS, OCFS etc or using network
> mounts like NFS or Samba.
>
> Regards, Vasily.
>
> On Thu, May 21, 2015 at 5:10 PM, gjprabu  wrote:
>
> Hi All,
>
> We are using rbd and map the same rbd image to the rbd device on
> two different client but i can't see the data until i umount and mount -a
> partition. Kindly share the solution for this issue.
>
> *Example*
> create rbd image named foo
> map foo to /dev/rbd0 on server A,   mount /dev/rbd0 to /mnt
> map foo to /dev/rbd0 on server B,   mount /dev/rbd0 to /mnt
>
> Regards
> Prabu
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph same rbd on multiple client

2015-05-21 Thread Vasiliy Angapov
Hi, Prabu!

This behavior is expected because you are using non-clustered filesystem
(ext4 or xfs or whatever), which is not expected to be mounted to multiple
hosts at the same time.
What's more - you can lose data when doing like this. That's the nature of
local filesystems.
So if you need to access filesystem simultaneously at two or more hosts -
consider using clustered filesystem like CephFS, OCFS etc or using network
mounts like NFS or Samba.

Regards, Vasily.

On Thu, May 21, 2015 at 5:10 PM, gjprabu  wrote:

> Hi All,
>
> We are using rbd and map the same rbd image to the rbd device on
> two different client but i can't see the data until i umount and mount -a
> partition. Kindly share the solution for this issue.
>
> *Example*
> create rbd image named foo
> map foo to /dev/rbd0 on server A,   mount /dev/rbd0 to /mnt
> map foo to /dev/rbd0 on server B,   mount /dev/rbd0 to /mnt
>
> Regards
> Prabu
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write freeze when writing to rbd image and rebooting one of the nodes

2015-05-15 Thread Vasiliy Angapov
  1.0

And here are the stats of the pool i was using for tests:

root@iclcompute4:# ceph osd pool get Gold crush_ruleset
crush_ruleset: 0
root@iclcompute4:# ceph osd pool get Gold size
size: 3
root@iclcompute4:# ceph osd pool get Gold min_size
min_size: 1

IO freeze happens whetger when I add or remove host with 2 OSD. I just did
it with standard manual Ceph procedure at
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/ to be
sure that scripting mistakes are not involved. So freeze lasts until
cluster says OK, then resumes.

Regards, Vasily.



On Thu, May 14, 2015 at 6:44 PM, Robert LeBlanc 
wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Can you provide the output of the CRUSH map and a copy of the script that you 
> are using to add the OSDs? Can you also provide the pool size and pool 
> min_size?
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v0.13.1
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJVVMLvCRDmVDuy+mK58QAAHVIQALIZ8aOWE5P8DkRe+8pz
> XS+rMdA17nPUd2mX6PIqhjBxetrUhIjQUho8HSIswT9JVkjVSIj+QHs5CI1C
> 6ArWIPt/U8L78d1hI8NuH/vWwWydYfV32n2L2LExIgUpFAbJA81AnjjDFLvo
> T63KLitQ1wz8lyhAWXp4ze15CgAv1u9VbJhazeeWunxZxd8eSGuUS8RTdhLD
> sD0pSQnVT4W04TSKYfvbUlpqm68wGY+MApnuQXdpC0jBLcDz0OSu1P+OQC03
> 0vBCERY1er/rSskJ6TRrQGLzXAc/vc3HbPMvegIhp2voeXgONdO5P/qLfSfD
> ZwVUoi6EfFe+na3S4rEjOeBU+v2P00komVEcvjOJDQb3IVcE23iVJOezk3p+
> AgJqOz9VLdGvdmZTZnR08PKPZEja80QzrSklRW5f8JyjKlbE8tB5lBoM5mKo
> oRcBSDbGSKvXInqygQ3XLdxULHaXbNqNPj+JvPbmfkTU6Iq6pXqcBdUSqG0o
> /5Rx16+2Rouz4f8uu5irmDjz0ivKL6QCIzBwZbBTdLIwqhf9vCl1ACDWq4U3
> DMorcafZbMArdOqlkVhQJiMioZEQ8U/ThY2bInkNdhii/2A35CToyOfMKyfq
> FLAK5lCiM6gRfCkEBPTwkDR6GNAfgY7khz34adsBRlZPB6a3MeucAGtTjyWt
> AJIV
> =bcYd
> -END PGP SIGNATURE-
>
>
> 
> Robert LeBlanc
> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> On Thu, May 14, 2015 at 6:33 AM, Vasiliy Angapov 
> wrote:
>
>> Thanks, Robert, for sharing so many experience! I feel like I don't
>> deserve it :)
>>
>> I have another but very same situation which I don't understand.
>> Last time i tried to hard kill OSD daemons.
>> This time i add a new node with 2 OSDs to my cluster and also monitor the
>> IO. I wrote a script which adds a node with OSDs fully automatically. And
>> seems like when I start the script - an IO is also blocked until the
>> cluster shows HEALTH_OK which takes quite an amount of time. After Ceph
>> status is OK - copying resumes.
>>
>> What should I tune this time to avoid long IO interuption?
>>
>> Thanks in advance again :)
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write freeze when writing to rbd image and rebooting one of the nodes

2015-05-14 Thread Vasiliy Angapov
Thanks, Robert, for sharing so many experience! I feel like I don't deserve
it :)

I have another but very same situation which I don't understand.
Last time i tried to hard kill OSD daemons.
This time i add a new node with 2 OSDs to my cluster and also monitor the
IO. I wrote a script which adds a node with OSDs fully automatically. And
seems like when I start the script - an IO is also blocked until the
cluster shows HEALTH_OK which takes quite an amount of time. After Ceph
status is OK - copying resumes.

What should I tune this time to avoid long IO interuption?

Thanks in advance again :)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write freeze when writing to rbd image and rebooting one of the nodes

2015-05-13 Thread Vasiliy Angapov
rmance. 
> I've had my fair share of outages on these where failovers did not work 
> properly or some bug that has left their engineers scratching their heads for 
> years, never to be fixed.
>
> - 
> Robert LeBlanc
> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> On Wed, May 13, 2015 at 10:29 AM, Vasiliy Angapov  wrote:
> Thanks, Sage!
>
> In the meanwhile I asked the same question in #Ceph IRC channel and Be_El 
> gave me exactly the same answer, which helped.
> I also realized that in 
> http://ceph.com/docs/master/rados/configuration/mon-osd-interaction/ it is 
> stated: "You may change this grace period by adding an osd heartbeatgrace 
> setting under the [osd] section of your Ceph configuration file, or by 
> setting the value at runtime.". But in reality you must add this option to 
> the [global] sections. Settinng this value in [osd] section only influenced 
> only osd daemons, but not monitors.
>
> Anyway, now IO resumes after only 5 seconds freeze. Thanks for help, guys!
>
> Regarding Ceph failure detection: in real environment it seems for me like 
> 20-30 seconds of freeze after a single storage node outage is very expensive.
> Even when we talk about data consistency...  5 seconds is acceptable 
> threshold.
>
> But, Sage, can you please explain in brief, what are the drawbacks of 
> lowering the timeout? If for example I got stable 10 gig cluster network 
> which is not likely to lag or interrupt - is 5 seconds dangerous anyhow? How 
> OSDs can report false positives in that case?
>
> Thanks in advance :)
>
> On Wed, May 13, 2015 at 7:05 PM, Sage Weil  wrote:
> On Wed, 13 May 2015, Vasiliy Angapov wrote:
> > Hi,
> >
> > Well, I've managed to find out that correct stop of osd causes no IO
> > downtime (/etc/init.d/ceph stop osd). But that cannot be called a fault
> > tolerance, which Ceph is supposed to be.However, "killall -9 ceph-osd"
> > causes IO to stop for about 20 seconds.
> >
> > I've tried lowering some timeouts but without luck. Here is a related part
> > of my ceph.conf after lowering the timeout values:
> >
> > [global]
> > heartbeat interval = 5
> > mon osd down out interval = 90
> > mon pg warn max per osd = 2000
> > mon osd adjust heartbeat grace = false
> >
> > [client]
> > rbd cache = false
> >
> > [mon]
> > mon clock drift allowed = .200
> > mon osd min down reports = 1
> >
> > [osd]
> > osd heartbeat interval = 3
> > osd heartbeat grace = 5
> >
> > Can you help me to reduce IO downtime somehow? Because 20 seconds for
> > production is just horrible.
>
> You'll need to restart ceph-osd daemons for that change to take effect, or
>
>  ceph tell osd.\* injectargs '--osd-heartbeat-grace 5 
> --osd-heartbeat-interval 1'
>
> Just remember that this timeout is a tradeoff against false positives--be
> careful tuning it too low.
>
> Note that ext4 going ro after 5 seconds sounds like insanity to me.  I've
> only seen this with older guest kernels, and iirc the problem is a
> 120s timeout with ide or something?
>
> Ceph is a CP system that trades availability for consistency--it will
> block IO as needed to ensure that it is handling reads or writes in a
> completely consistent manner.  Even if you get the failure detection
> latency down, other recovery scenarios are likely to cross the magic 5s
> threshold at some point and cause the same problem.  You need to fix your
> guests one way or another!
>
> sage
>
>
> >
> > Regards, Vasily.
> >
> >
> > On Wed, May 13, 2015 at 9:57 AM, Vasiliy Angapov  wrote:
> >   Thanks, Gregory!
> > My Ceph version is 0.94.1. What I'm trying to test is the worst
> > situation when the node is loosing network or becomes inresponsive. So
> > what i do is "killall -9 ceph-osd", then reboot.
> >
> > Well, I also tried to do a clean reboot several times (just a "reboot"
> > command), but i saw no difference - there is always an IO freeze for
> > about 30 seconds. Btw, i'm using Fedora 20 on all nodes.
> >
> > Ok, I will play with timeouts more.
> >
> > Thanks again!
> >
> > On Wed, May 13, 2015 at 10:46 AM, Gregory Farnum
> > wrote:
> >   On Tue, May 12, 2015 at 11:39 PM, Vasiliy Angapov
> >wrote:
> >   > Hi, colleagues!
> >   >
> >   > I'm testing a simple Ceph cluster in order to use it in
> >   production
> >   > environment. I have 8 OSDs (1Tb SATA  drives) which are
> 

Re: [ceph-users] Write freeze when writing to rbd image and rebooting one of the nodes

2015-05-13 Thread Vasiliy Angapov
Thanks, Sage!

In the meanwhile I asked the same question in #Ceph IRC channel and Be_El
gave me exactly the same answer, which helped.
I also realized that in
http://ceph.com/docs/master/rados/configuration/mon-osd-interaction/ it is
stated: "You may change this grace period by adding an osd
heartbeatgrace setting
under the [osd] section of your Ceph configuration file, or by setting the
value at runtime.". But in reality you must add this option to the [global]
sections. Settinng this value in [osd] section only influenced only osd
daemons, but not monitors.

Anyway, now IO resumes after only 5 seconds freeze. Thanks for help, guys!

Regarding Ceph failure detection: in real environment it seems for me like
20-30 seconds of freeze after a single storage node outage is very
expensive.
Even when we talk about data consistency...  5 seconds is acceptable
threshold.

But, Sage, can you please explain in brief, what are the drawbacks of
lowering the timeout? If for example I got stable 10 gig cluster network
which is not likely to lag or interrupt - is 5 seconds dangerous anyhow?
How OSDs can report false positives in that case?

Thanks in advance :)

On Wed, May 13, 2015 at 7:05 PM, Sage Weil  wrote:

> On Wed, 13 May 2015, Vasiliy Angapov wrote:
> > Hi,
> >
> > Well, I've managed to find out that correct stop of osd causes no IO
> > downtime (/etc/init.d/ceph stop osd). But that cannot be called a fault
> > tolerance, which Ceph is supposed to be.However, "killall -9 ceph-osd"
> > causes IO to stop for about 20 seconds.
> >
> > I've tried lowering some timeouts but without luck. Here is a related
> part
> > of my ceph.conf after lowering the timeout values:
> >
> > [global]
> > heartbeat interval = 5
> > mon osd down out interval = 90
> > mon pg warn max per osd = 2000
> > mon osd adjust heartbeat grace = false
> >
> > [client]
> > rbd cache = false
> >
> > [mon]
> > mon clock drift allowed = .200
> > mon osd min down reports = 1
> >
> > [osd]
> > osd heartbeat interval = 3
> > osd heartbeat grace = 5
> >
> > Can you help me to reduce IO downtime somehow? Because 20 seconds for
> > production is just horrible.
>
> You'll need to restart ceph-osd daemons for that change to take effect, or
>
>  ceph tell osd.\* injectargs '--osd-heartbeat-grace 5
> --osd-heartbeat-interval 1'
>
> Just remember that this timeout is a tradeoff against false positives--be
> careful tuning it too low.
>
> Note that ext4 going ro after 5 seconds sounds like insanity to me.  I've
> only seen this with older guest kernels, and iirc the problem is a
> 120s timeout with ide or something?
>
> Ceph is a CP system that trades availability for consistency--it will
> block IO as needed to ensure that it is handling reads or writes in a
> completely consistent manner.  Even if you get the failure detection
> latency down, other recovery scenarios are likely to cross the magic 5s
> threshold at some point and cause the same problem.  You need to fix your
> guests one way or another!
>
> sage
>
>
> >
> > Regards, Vasily.
> >
> >
> > On Wed, May 13, 2015 at 9:57 AM, Vasiliy Angapov 
> wrote:
> >   Thanks, Gregory!
> > My Ceph version is 0.94.1. What I'm trying to test is the worst
> > situation when the node is loosing network or becomes inresponsive. So
> > what i do is "killall -9 ceph-osd", then reboot.
> >
> > Well, I also tried to do a clean reboot several times (just a "reboot"
> > command), but i saw no difference - there is always an IO freeze for
> > about 30 seconds. Btw, i'm using Fedora 20 on all nodes.
> >
> > Ok, I will play with timeouts more.
> >
> > Thanks again!
> >
> > On Wed, May 13, 2015 at 10:46 AM, Gregory Farnum 
> > wrote:
> >   On Tue, May 12, 2015 at 11:39 PM, Vasiliy Angapov
> >wrote:
> >   > Hi, colleagues!
> >   >
> >   > I'm testing a simple Ceph cluster in order to use it in
> >   production
> >   > environment. I have 8 OSDs (1Tb SATA  drives) which are
> >   evenly distributed
> >   > between 4 nodes.
> >   >
> >   > I'v mapped rbd image on the client node and started
> >   writing a lot of data to
> >   > it. Then I just reboot one node and see what's
> >   happening. What happens is
> >   > very sad. I have a write freeze for about 20-30 seconds
> >   which is enough for
> >   > ext4 filesystem to switch to RO.
> >   >
> >   

Re: [ceph-users] Write freeze when writing to rbd image and rebooting one of the nodes

2015-05-13 Thread Vasiliy Angapov
Hi,

Well, I've managed to find out that correct stop of osd causes no IO
downtime (/etc/init.d/ceph stop osd). But that cannot be called a fault
tolerance, which Ceph is supposed to be.
However, "killall -9 ceph-osd" causes IO to stop for about 20 seconds.

I've tried lowering some timeouts but without luck. Here is a related part
of my ceph.conf after lowering the timeout values:

[global]
heartbeat interval = 5
mon osd down out interval = 90
mon pg warn max per osd = 2000
mon osd adjust heartbeat grace = false

[client]
rbd cache = false

[mon]
mon clock drift allowed = .200
mon osd min down reports = 1

[osd]
osd heartbeat interval = 3
osd heartbeat grace = 5

Can you help me to reduce IO downtime somehow? Because 20 seconds for
production is just horrible.

Regards, Vasily.


On Wed, May 13, 2015 at 9:57 AM, Vasiliy Angapov  wrote:

> Thanks, Gregory!
>
> My Ceph version is 0.94.1. What I'm trying to test is the worst situation
> when the node is loosing network or becomes inresponsive. So what i do is
> "killall -9 ceph-osd", then reboot.
>
> Well, I also tried to do a clean reboot several times (just a "reboot"
> command), but i saw no difference - there is always an IO freeze for about
> 30 seconds. Btw, i'm using Fedora 20 on all nodes.
>
> Ok, I will play with timeouts more.
>
> Thanks again!
>
> On Wed, May 13, 2015 at 10:46 AM, Gregory Farnum  wrote:
>
>> On Tue, May 12, 2015 at 11:39 PM, Vasiliy Angapov 
>> wrote:
>> > Hi, colleagues!
>> >
>> > I'm testing a simple Ceph cluster in order to use it in production
>> > environment. I have 8 OSDs (1Tb SATA  drives) which are evenly
>> distributed
>> > between 4 nodes.
>> >
>> > I'v mapped rbd image on the client node and started writing a lot of
>> data to
>> > it. Then I just reboot one node and see what's happening. What happens
>> is
>> > very sad. I have a write freeze for about 20-30 seconds which is enough
>> for
>> > ext4 filesystem to switch to RO.
>> >
>> > I wonder, if there is any way to minimize this lag? AFAIK, ext
>> filesystems
>> > have 5 seconds timeout before switching to RO. So is there any way to
>> get
>> > that lag beyond 5 secs? I've tried lowering different osd timeouts, but
>> it
>> > doesn't seem to help.
>> >
>> > How do you deal with such a situations? 20 seconds of downtime is not
>> > tolerable in production.
>>
>> What version of Ceph are you running, and how are you rebooting it?
>> Any newish version that gets a clean reboot will notify the cluster
>> that it's shutting down, so you shouldn't witness blocked rights
>> really at all.
>>
>> If you're doing a reboot that involves just ending the daemon, you
>> will have to wait through the timeout period before the OSD gets
>> marked down, which defaults to 30 seconds. This is adjustable (look
>> for docs on the "osd heartbeat grace" config option), although if you
>> set it too low you'll need to change a bunch of other timeouts which I
>> don't know off-hand...
>> -Greg
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write freeze when writing to rbd image and rebooting one of the nodes

2015-05-12 Thread Vasiliy Angapov
Thanks, Gregory!

My Ceph version is 0.94.1. What I'm trying to test is the worst situation
when the node is loosing network or becomes inresponsive. So what i do is
"killall -9 ceph-osd", then reboot.

Well, I also tried to do a clean reboot several times (just a "reboot"
command), but i saw no difference - there is always an IO freeze for about
30 seconds. Btw, i'm using Fedora 20 on all nodes.

Ok, I will play with timeouts more.

Thanks again!

On Wed, May 13, 2015 at 10:46 AM, Gregory Farnum  wrote:

> On Tue, May 12, 2015 at 11:39 PM, Vasiliy Angapov 
> wrote:
> > Hi, colleagues!
> >
> > I'm testing a simple Ceph cluster in order to use it in production
> > environment. I have 8 OSDs (1Tb SATA  drives) which are evenly
> distributed
> > between 4 nodes.
> >
> > I'v mapped rbd image on the client node and started writing a lot of
> data to
> > it. Then I just reboot one node and see what's happening. What happens is
> > very sad. I have a write freeze for about 20-30 seconds which is enough
> for
> > ext4 filesystem to switch to RO.
> >
> > I wonder, if there is any way to minimize this lag? AFAIK, ext
> filesystems
> > have 5 seconds timeout before switching to RO. So is there any way to get
> > that lag beyond 5 secs? I've tried lowering different osd timeouts, but
> it
> > doesn't seem to help.
> >
> > How do you deal with such a situations? 20 seconds of downtime is not
> > tolerable in production.
>
> What version of Ceph are you running, and how are you rebooting it?
> Any newish version that gets a clean reboot will notify the cluster
> that it's shutting down, so you shouldn't witness blocked rights
> really at all.
>
> If you're doing a reboot that involves just ending the daemon, you
> will have to wait through the timeout period before the OSD gets
> marked down, which defaults to 30 seconds. This is adjustable (look
> for docs on the "osd heartbeat grace" config option), although if you
> set it too low you'll need to change a bunch of other timeouts which I
> don't know off-hand...
> -Greg
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Write freeze when writing to rbd image and rebooting one of the nodes

2015-05-12 Thread Vasiliy Angapov
Hi, colleagues!

I'm testing a simple Ceph cluster in order to use it in production
environment. I have 8 OSDs (1Tb SATA  drives) which are evenly distributed
between 4 nodes.

I'v mapped rbd image on the client node and started writing a lot of data
to it. Then I just reboot one node and see what's happening. What happens
is very sad. I have a write freeze for about 20-30 seconds which is enough
for ext4 filesystem to switch to RO.

I wonder, if there is any way to minimize this lag? AFAIK, ext filesystems
have 5 seconds timeout before switching to RO. So is there any way to get
that lag beyond 5 secs? I've tried lowering different osd timeouts, but it
doesn't seem to help.

How do you deal with such a situations? 20 seconds of downtime is not
tolerable in production.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com