Hi, my version of ceph is 0.72.2 on scientific linux with 2.6.32-431.1.2.el6.x86_64 kernel.
after a network trouble on all my nodes. Osd flap up to down periodically. I have to set* nodown parameter* to stabilize it. I have a public_network and a cluster_network. I have this message on most of osd: 2014-06-23 08:08:59.750879 7f6bd3661700 -1 osd.y 53377 he*artbeat_check: no reply from osd.xxx ever on either front or back*, first ping sent 2014-06-22 20:06:10.055264 (cutoff 2014-06-23 08:08:24.750744) cluster b71fecc6-0323-4f08-8b49-e8ed1ff2d4ce health HEALTH_WARN 1 pgs backfill; 73 pgs down; 196 pgs peering; 196 pgs stuck inactive; 197 pgs stuck unclean; recovery 592/2459924 objects degraded (0.024%); nodown flag(s) set monmap e5: 3 mons at {bb-e19-x4=10.257.53.236:6789/0,cephfrontux1-r=10.257.53.241:6789/0,cephfrontux2-r=10.257.53.242:6789/0}, election epoch 202, quorum 0,1,2 bb-e19-x4,cephtux1-r,cephtux2-r osdmap e53377: 34 osds: 33 up, 33 in flags nodown pgmap v5928500: 5596 pgs, 5 pools, 4755 GB data, 1212 kobjects 9466 GB used, 17248 GB / 26715 GB avail 592/2459924 objects degraded (0.024%) 5398 active+clean 1 active+remapped+wait_backfill 123 peering 73 down+peering 1 active+clean+scrubbing grep check ceph-osd.*.log ' '| awk '{print $5,$7,'problem',$11}'|sort -u osd.10 heartbeat_check: problem osd.0 osd.10 heartbeat_check: problem osd.11 osd.10 heartbeat_check: problem osd.19 ..... is the same for most os dlog. I wrote some options but nothing [osd] osd_heartbeat_grace = 35 osd_min_down_reports = 4 osd_heartbeat_addr = 10.157.53.224 mon_osd_down_out_interval = 3000 osd_heartbeat_interval = 12 osd_mkfs_options_xfs = "-f" mon_osd_min_down_reporters = 3 osd_mkfs_type = xfs Have you an idea to fix it? -- Eric Mourgaya, Respectons la planete! Luttons contre la mediocrite!
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com