Something about it is blocking the cluster.  I would first try running this
command.  If that doesn't work, then I would restart the daemon.

# ceph osd down 13

Marking it down should force it to reassert itself to the cluster without
restarting the daemon and stopping any operations it's working on.  Also
while it's down, the secondary OSDs for the PG should be able to handle the
requests that are blocked.  Check it's log to see what it's doing.

You didn't answer what your size and min_size are for your 2 pools.

On Fri, Jun 23, 2017 at 3:11 PM Daniel Davidson <dani...@igb.illinois.edu>
wrote:

> Thanks for the response:
>
> [root@ceph-control ~]# ceph health detail | grep 'ops are blocked'
> 100 ops are blocked > 134218 sec on osd.13
> [root@ceph-control ~]# ceph osd blocked-by
> osd num_blocked
>
> A problem with osd.13?
>
> Dan
>
>
> On 06/23/2017 02:03 PM, David Turner wrote:
>
> # ceph health detail | grep 'ops are blocked'
> # ceph osd blocked-by
>
> My guess is that you have an OSD that is in a funky state blocking the
> requests and the peering.  Let me know what the output of those commands
> are.
>
> Also what are the replica sizes of your 2 pools?  It shows that only 1 OSD
> was last active for the 2 inactive PGs.  Not sure yet if that is anything
> of concern, but didn't want to ignore it.
>
> On Fri, Jun 23, 2017 at 1:16 PM Daniel Davidson <dani...@igb.illinois.edu>
> wrote:
>
> Two of our OSD systems hit 75% disk utilization, so I added another
>> system to try and bring that back down.  The system was usable for a day
>> while the data was being migrated, but now the system is not responding
>> when I try to mount it:
>>
>>   mount -t ceph ceph-0,ceph-1,ceph-2,ceph-3:6789:/ /home -o
>> name=admin,secretfile=/etc/ceph/admin.secret
>> mount error 5 = Input/output error
>>
>> Here is our ceph health
>>
>> [root@ceph-3 ~]# ceph -s
>>      cluster 7bffce86-9d7b-4bdf-a9c9-67670e68ca77
>>       health HEALTH_ERR
>>              2 pgs are stuck inactive for more than 300 seconds
>>              58 pgs backfill_wait
>>              20 pgs backfilling
>>              3 pgs degraded
>>              2 pgs stuck inactive
>>              76 pgs stuck unclean
>>              2 pgs undersized
>>              100 requests are blocked > 32 sec
>>              recovery 1197145/653713908 objects degraded (0.183%)
>>              recovery 47420551/653713908 objects misplaced (7.254%)
>>              mds0: Behind on trimming (180/30)
>>              mds0: Client biologin-0 failing to respond to capability
>> release
>>              mds0: Many clients (20) failing to respond to cache pressure
>>       monmap e3: 4 mons at
>>
> {ceph-0=*MailScanner has detected a possible fraud attempt from
>> "172.16.31.1:6789" claiming to be* *MailScanner warning: numerical links
>> are often malicious:*
>> 172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0
>> <http://172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0>
>> }
>
>
>>              election epoch 542, quorum 0,1,2,3
>> ceph-0,ceph-1,ceph-2,ceph-3
>>        fsmap e17666: 1/1/1 up {0=ceph-0=up:active}, 3 up:standby
>>       osdmap e25535: 32 osds: 32 up, 32 in; 78 remapped pgs
>>              flags sortbitwise,require_jewel_osds
>>        pgmap v19199544: 1536 pgs, 2 pools, 786 TB data, 299 Mobjects
>>              1595 TB used, 1024 TB / 2619 TB avail
>>              1197145/653713908 objects degraded (0.183%)
>>              47420551/653713908 objects misplaced (7.254%)
>>                  1448 active+clean
>>                    58 active+remapped+wait_backfill
>>                    17 active+remapped+backfilling
>>                    10 active+clean+scrubbing+deep
>>                     2 undersized+degraded+remapped+backfilling+peered
>>                     1 active+degraded+remapped+backfilling
>> recovery io 906 MB/s, 331 objects/s
>>
>> Checking in on the inactive PGs
>>
>> [root@ceph-control ~]# ceph health detail |grep inactive
>> HEALTH_ERR 2 pgs are stuck inactive for more than 300 seconds; 58 pgs
>> backfill_wait; 20 pgs backfilling; 3 pgs degraded; 2 pgs stuck inactive;
>> 78 pgs stuck unclean; 2 pgs undersized; 100 requests are blocked > 32
>> sec; 1 osds have slow requests; recovery 1197145/653713908 objects
>> degraded (0.183%); recovery 47390082/653713908 objects misplaced
>> (7.249%); mds0: Behind on trimming (180/30); mds0: Client biologin-0
>> failing to respond to capability release; mds0: Many clients (20)
>> failing to respond to cache pressure
>> pg 2.1b5 is stuck inactive for 77215.112164, current state
>> undersized+degraded+remapped+backfilling+peered, last acting [13]
>> pg 2.145 is stuck inactive for 76910.328647, current state
>> undersized+degraded+remapped+backfilling+peered, last acting [13]
>>
>> If I query, then I dont get a response:
>>
>> [root@ceph-control ~]# ceph pg 2.1b5 query
>>
>> Any ideas on what to do?
>>
>> Dan
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to