Re: [ceph-users] ceph mons de-synced from rest of cluster?

2018-02-12 Thread Gregory Farnum
On Sun, Feb 11, 2018 at 8:19 PM Chris Apsey  wrote:

> All,
>
> Recently doubled the number of OSDs in our cluster, and towards the end
> of the rebalancing, I noticed that recovery IO fell to nothing and that
> the ceph mons eventually looked like this when I ran ceph -s
>
>cluster:
>  id: 6a65c3d0-b84e-4c89-bbf7-a38a1966d780
>  health: HEALTH_WARN
>  34922/4329975 objects misplaced (0.807%)
>  Reduced data availability: 542 pgs inactive, 49 pgs
> peering, 13502 pgs stale
>  Degraded data redundancy: 248778/4329975 objects
> degraded (5.745%), 7319 pgs unclean, 2224 pgs degraded, 1817 pgs
> undersized
>
>services:
>  mon: 3 daemons, quorum cephmon-0,cephmon-1,cephmon-2
>  mgr: cephmon-0(active), standbys: cephmon-1, cephmon-2
>  osd: 376 osds: 376 up, 376 in
>
>data:
>  pools:   9 pools, 13952 pgs
>  objects: 1409k objects, 5992 GB
>  usage:   31528 GB used, 1673 TB / 1704 TB avail
>  pgs: 3.225% pgs unknown
>   0.659% pgs not active
>   248778/4329975 objects degraded (5.745%)
>   34922/4329975 objects misplaced (0.807%)
>   6141 stale+active+clean
>   4537 stale+active+remapped+backfilling
>   1575 stale+active+undersized+degraded
>   489  stale+active+clean+remapped
>   450  unknown
>   396  stale+active+recovery_wait+degraded
>   216
> stale+active+undersized+degraded+remapped+backfilling
>   40   stale+peering
>   30   stale+activating
>   24   stale+active+undersized+remapped
>   22   stale+active+recovering+degraded
>   13   stale+activating+degraded
>   9stale+remapped+peering
>   4stale+active+remapped+backfill_wait
>   3stale+active+clean+scrubbing+deep
>   2
> stale+active+undersized+degraded+remapped+backfill_wait
>   1stale+active+remapped
>
> The problem is, everything works fine.  If I run ceph health detail and
> do a pg query against one of the 'degraded' placement groups, it reports
> back as active-clean.  All clients in the cluster can write and read at
> normal speeds, but not IO information is ever reported in ceph -s.
>
>  From what I can see, everything in the cluster is working properly
> except the actual reporting on the status of the cluster.  Has anyone
> seen this before/know how to sync the mons up to what the OSDs are
> actually reporting?  I see no connectivity errors in the logs of the
> mons or the osds.
>

It sounds like the manager has gone stale somehow. You can probably fix it
by restarting, though if you have logs it would be good to file a bug
report at tracker.ceph.com.
-Greg


>
> Thanks,
>
> ---
> v/r
>
> Chris Apsey
> bitskr...@bitskrieg.net
> https://www.bitskrieg.net
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph mons de-synced from rest of cluster?

2018-02-11 Thread Chris Apsey

All,

Recently doubled the number of OSDs in our cluster, and towards the end 
of the rebalancing, I noticed that recovery IO fell to nothing and that 
the ceph mons eventually looked like this when I ran ceph -s


  cluster:
id: 6a65c3d0-b84e-4c89-bbf7-a38a1966d780
health: HEALTH_WARN
34922/4329975 objects misplaced (0.807%)
Reduced data availability: 542 pgs inactive, 49 pgs 
peering, 13502 pgs stale
Degraded data redundancy: 248778/4329975 objects 
degraded (5.745%), 7319 pgs unclean, 2224 pgs degraded, 1817 pgs 
undersized


  services:
mon: 3 daemons, quorum cephmon-0,cephmon-1,cephmon-2
mgr: cephmon-0(active), standbys: cephmon-1, cephmon-2
osd: 376 osds: 376 up, 376 in

  data:
pools:   9 pools, 13952 pgs
objects: 1409k objects, 5992 GB
usage:   31528 GB used, 1673 TB / 1704 TB avail
pgs: 3.225% pgs unknown
 0.659% pgs not active
 248778/4329975 objects degraded (5.745%)
 34922/4329975 objects misplaced (0.807%)
 6141 stale+active+clean
 4537 stale+active+remapped+backfilling
 1575 stale+active+undersized+degraded
 489  stale+active+clean+remapped
 450  unknown
 396  stale+active+recovery_wait+degraded
 216  
stale+active+undersized+degraded+remapped+backfilling

 40   stale+peering
 30   stale+activating
 24   stale+active+undersized+remapped
 22   stale+active+recovering+degraded
 13   stale+activating+degraded
 9stale+remapped+peering
 4stale+active+remapped+backfill_wait
 3stale+active+clean+scrubbing+deep
 2
stale+active+undersized+degraded+remapped+backfill_wait

 1stale+active+remapped

The problem is, everything works fine.  If I run ceph health detail and 
do a pg query against one of the 'degraded' placement groups, it reports 
back as active-clean.  All clients in the cluster can write and read at 
normal speeds, but not IO information is ever reported in ceph -s.


From what I can see, everything in the cluster is working properly 
except the actual reporting on the status of the cluster.  Has anyone 
seen this before/know how to sync the mons up to what the OSDs are 
actually reporting?  I see no connectivity errors in the logs of the 
mons or the osds.


Thanks,

---
v/r

Chris Apsey
bitskr...@bitskrieg.net
https://www.bitskrieg.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com