[ceph-users] OSD not marked as down or out

2015-02-20 Thread Sudarshan Pathak
Hello everyone,

I have a cluster running with OpenStack. It has 6 OSD (3 in each 2
different locations). Each pool has 3 replication size with 2 copy in
primary location and 1 copy at secondary location.

Everything is running as expected but the osd are not marked as down when I
poweroff a OSD server. It has been around an hour.
I tried changing the heartbeat settings too.

Can someone point me in right direction.

OSD 0 log
=
2015-02-20 16:20:14.009723 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no
reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
16:15:54.607854 (cutoff 2015-02-20 16:19:54.009720)
2015-02-20 16:20:15.009908 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no
reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
16:15:54.607854 (cutoff 2015-02-20 16:19:55.009907)
2015-02-20 16:20:16.010123 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no
reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
16:15:54.607854 (cutoff 2015-02-20 16:19:56.010119)
2015-02-20 16:20:16.648167 7f3fc9a76700 -1 osd.0 451 heartbeat_check: no
reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
16:15:54.607854 (cutoff 2015-02-20 16:19:56.648165)


Ceph monitor log

2015-02-20 16:49:16.831548 7f416e4aa700  1 mon.storage1@1(leader).osd e455
prepare_failure osd.2 192.168.100.33:6800/24431 from osd.4
192.168.100.35:6800/1305 is reporting failure:1
2015-02-20 16:49:16.831593 7f416e4aa700  0 log_channel(cluster) log [DBG] :
osd.2 192.168.100.33:6800/24431 reported failed by osd.4
192.168.100.35:6800/1305
2015-02-20 16:49:17.080314 7f416e4aa700  1 mon.storage1@1(leader).osd e455
prepare_failure osd.2 192.168.100.33:6800/24431 from osd.3
192.168.100.34:6800/1358 is reporting failure:1
2015-02-20 16:49:17.080527 7f416e4aa700  0 log_channel(cluster) log [DBG] :
osd.2 192.168.100.33:6800/24431 reported failed by osd.3
192.168.100.34:6800/1358
2015-02-20 16:49:17.420859 7f416e4aa700  1 mon.storage1@1(leader).osd e455
prepare_failure osd.2 192.168.100.33:6800/24431 from osd.5
192.168.100.36:6800/1359 is reporting failure:1


#ceph osd stat
 osdmap e455: 6 osds: 6 up, 6 in


#ceph -s
cluster c8a5975f-4c86-4cfe-a91b-fac9f3126afc
 health HEALTH_WARN 528 pgs peering; 528 pgs stuck inactive; 528 pgs
stuck unclean; 1 requests are blocked  32 sec; 1 mons down, quorum 1,2,3,4
storage1,storage2,compute3,compute4
 monmap e1: 5 mons at {admin=
192.168.100.39:6789/0,compute3=192.168.100.133:6789/0,compute4=192.168.100.134:6789/0,storage1=192.168.100.120:6789/0,storage2=192.168.100.121:6789/0},
election epoch 132, quorum 1,2,3,4 storage1,storage2,compute3,compute4
 osdmap e455: 6 osds: 6 up, 6 in
  pgmap v48474: 3650 pgs, 19 pools, 27324 MB data, 4420 objects
82443 MB used, 2682 GB / 2763 GB avail
3122 active+clean
 528 remapped+peering



Ceph.conf file

[global]
fsid = c8a5975f-4c86-4cfe-a91b-fac9f3126afc
mon_initial_members = admin, storage1, storage2, compute3, compute4
mon_host =
192.168.100.39,192.168.100.120,192.168.100.121,192.168.100.133,192.168.100.134
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

osd pool default size = 3
osd pool default min size = 3

osd pool default pg num = 300
osd pool default pgp num = 300

public network = 192.168.100.0/24

rgw print continue = false
rgw enable ops log = false

mon osd report timeout = 60
mon osd down out interval = 30
mon osd min down reports = 2

osd heartbeat grace = 10
osd mon heartbeat interval = 20
osd mon report interval max = 60
osd mon ack timeout = 15

mon osd min down reports = 2


Regards,
Sudarshan Pathak
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD not marked as down or out

2015-02-20 Thread Gregory Farnum
That's pretty strange, especially since the monitor is getting the
failure reports. What version are you running? Can you bump up the
monitor debugging and provide its output from around that time?
-Greg

On Fri, Feb 20, 2015 at 3:26 AM, Sudarshan Pathak sushan@gmail.com wrote:
 Hello everyone,

 I have a cluster running with OpenStack. It has 6 OSD (3 in each 2 different
 locations). Each pool has 3 replication size with 2 copy in primary location
 and 1 copy at secondary location.

 Everything is running as expected but the osd are not marked as down when I
 poweroff a OSD server. It has been around an hour.
 I tried changing the heartbeat settings too.

 Can someone point me in right direction.

 OSD 0 log
 =
 2015-02-20 16:20:14.009723 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no
 reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
 16:15:54.607854 (cutoff 2015-02-20 16:19:54.009720)
 2015-02-20 16:20:15.009908 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no
 reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
 16:15:54.607854 (cutoff 2015-02-20 16:19:55.009907)
 2015-02-20 16:20:16.010123 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no
 reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
 16:15:54.607854 (cutoff 2015-02-20 16:19:56.010119)
 2015-02-20 16:20:16.648167 7f3fc9a76700 -1 osd.0 451 heartbeat_check: no
 reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
 16:15:54.607854 (cutoff 2015-02-20 16:19:56.648165)


 Ceph monitor log
 
 2015-02-20 16:49:16.831548 7f416e4aa700  1 mon.storage1@1(leader).osd e455
 prepare_failure osd.2 192.168.100.33:6800/24431 from osd.4
 192.168.100.35:6800/1305 is reporting failure:1
 2015-02-20 16:49:16.831593 7f416e4aa700  0 log_channel(cluster) log [DBG] :
 osd.2 192.168.100.33:6800/24431 reported failed by osd.4
 192.168.100.35:6800/1305
 2015-02-20 16:49:17.080314 7f416e4aa700  1 mon.storage1@1(leader).osd e455
 prepare_failure osd.2 192.168.100.33:6800/24431 from osd.3
 192.168.100.34:6800/1358 is reporting failure:1
 2015-02-20 16:49:17.080527 7f416e4aa700  0 log_channel(cluster) log [DBG] :
 osd.2 192.168.100.33:6800/24431 reported failed by osd.3
 192.168.100.34:6800/1358
 2015-02-20 16:49:17.420859 7f416e4aa700  1 mon.storage1@1(leader).osd e455
 prepare_failure osd.2 192.168.100.33:6800/24431 from osd.5
 192.168.100.36:6800/1359 is reporting failure:1


 #ceph osd stat
  osdmap e455: 6 osds: 6 up, 6 in


 #ceph -s
 cluster c8a5975f-4c86-4cfe-a91b-fac9f3126afc
  health HEALTH_WARN 528 pgs peering; 528 pgs stuck inactive; 528 pgs
 stuck unclean; 1 requests are blocked  32 sec; 1 mons down, quorum 1,2,3,4
 storage1,storage2,compute3,compute4
  monmap e1: 5 mons at
 {admin=192.168.100.39:6789/0,compute3=192.168.100.133:6789/0,compute4=192.168.100.134:6789/0,storage1=192.168.100.120:6789/0,storage2=192.168.100.121:6789/0},
 election epoch 132, quorum 1,2,3,4 storage1,storage2,compute3,compute4
  osdmap e455: 6 osds: 6 up, 6 in
   pgmap v48474: 3650 pgs, 19 pools, 27324 MB data, 4420 objects
 82443 MB used, 2682 GB / 2763 GB avail
 3122 active+clean
  528 remapped+peering



 Ceph.conf file

 [global]
 fsid = c8a5975f-4c86-4cfe-a91b-fac9f3126afc
 mon_initial_members = admin, storage1, storage2, compute3, compute4
 mon_host =
 192.168.100.39,192.168.100.120,192.168.100.121,192.168.100.133,192.168.100.134
 auth_cluster_required = cephx
 auth_service_required = cephx
 auth_client_required = cephx
 filestore_xattr_use_omap = true

 osd pool default size = 3
 osd pool default min size = 3

 osd pool default pg num = 300
 osd pool default pgp num = 300

 public network = 192.168.100.0/24

 rgw print continue = false
 rgw enable ops log = false

 mon osd report timeout = 60
 mon osd down out interval = 30
 mon osd min down reports = 2

 osd heartbeat grace = 10
 osd mon heartbeat interval = 20
 osd mon report interval max = 60
 osd mon ack timeout = 15

 mon osd min down reports = 2


 Regards,
 Sudarshan Pathak

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com