Re: [ceph-users] rbd map hangs

2018-06-08 Thread Ilya Dryomov
On Fri, Jun 8, 2018 at 6:37 AM, Tracy Reed  wrote:
> On Thu, Jun 07, 2018 at 09:30:23AM PDT, Jason Dillaman spake thusly:
>> I think what Ilya is saying is that it's a very old RHEL 7-based
>> kernel (RHEL 7.1?). For example, the current RHEL 7.5 kernel includes
>> numerous improvements that have been backported from the current
>> upstream kernel.
>
> Ah, I understand now. My VM servers tend not to get upgraded often as
> restarting all of the VMs is a hassle. I'll fix that. Do we think that
> is related to my issues? It has worked reliably for ages as far as
> mapping rbd goes.

Yes, it is likely related.  A few client-side issues that could lead to
stuck requests have been fixed since then.

>
> I still have the following in flight requests. I set osd.73 out as

Did you mean down?  If you marked it out, is it still out?

What is the output of "ceph -s", "ceph osd dump" and "ceph osd tree"?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2018-06-07 Thread Tracy Reed
On Thu, Jun 07, 2018 at 09:30:23AM PDT, Jason Dillaman spake thusly:
> I think what Ilya is saying is that it's a very old RHEL 7-based
> kernel (RHEL 7.1?). For example, the current RHEL 7.5 kernel includes
> numerous improvements that have been backported from the current
> upstream kernel.

Ah, I understand now. My VM servers tend not to get upgraded often as
restarting all of the VMs is a hassle. I'll fix that. Do we think that
is related to my issues? It has worked reliably for ages as far as
mapping rbd goes.

I still have the following in flight requests. I set osd.73 out as
suggested and even went and restarted the osd process on the node. It
doesn't seem to have had any effect. And I still have unkillable
processes blocking on mapped rbd devices. I guess I can patch/reboot
this box which would likely clear this up but that's going to have to
wait a week or so and involves downtime for 21 VMs which is less than
ideal. I would love to get this fixed, finish transferring images from
iscsi storage to ceph rbd, then I can retire the iscsi storage and have
some surplus amps so I can bring some more VM servers online so I can
live migrate these VMs in the future allowing easier reboots/upgrades as
that's the real limiting factor here.

# find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
# [70/1950]
/sys/kernel/debug/ceph/b2b00aae-f00d-41b4-a29b-58859aa41375.client31276017/osdmap
epoch 232501
flags
pool 0 pg_num 2500 (4095) read_tier -1 write_tier -1
pool 2 pg_num 512 (511) read_tier -1 write_tier -1
pool 3 pg_num 128 (127) read_tier -1 write_tier -1
pool 4 pg_num 100 (127) read_tier -1 write_tier -1
osd010.0.5.3:680154%(exists, up)100%
osd110.0.5.3:681257%(exists, up)100%
osd2(unknown sockaddr family 0)   0%(doesn't exist) 100%
osd310.0.5.4:681250%(exists, up)100%
osd4(unknown sockaddr family 0)   0%(doesn't exist) 100%
osd5(unknown sockaddr family 0)   0%(doesn't exist) 100%
osd610.0.5.9:686137%(exists, up)100%
osd710.0.5.9:687628%(exists, up)100%
osd810.0.5.9:686443%(exists, up)100%
osd910.0.5.9:683630%(exists, up)100%
osd10   10.0.5.9:682022%(exists, up)100%
osd11   10.0.5.9:684454%(exists, up)100%
osd12   10.0.5.9:680343%(exists, up)100%
osd13   10.0.5.9:682641%(exists, up)100%
osd14   10.0.5.9:685337%(exists, up)100%
osd15   10.0.5.9:687236%(exists, up)100%
osd16   (unknown sockaddr family 0)   0%(doesn't exist) 100%
osd17   10.0.5.9:681244%(exists, up)100%
osd18   10.0.5.9:681748%(exists, up)100%
osd19   10.0.5.9:685633%(exists, up)100%
osd20   10.0.5.9:680846%(exists, up)100%
osd21   10.0.5.9:687141%(exists, up)100%
osd22   10.0.5.9:681649%(exists, up)100%
osd23   10.0.5.9:682356%(exists, up)100%
osd24   10.0.5.9:680054%(exists, up)100%
osd25   10.0.5.9:684854%(exists, up)100%
osd26   10.0.5.9:684037%(exists, up)100%
osd27   10.0.5.9:688369%(exists, up)100%
osd28   10.0.5.9:683339%(exists, up)100%
osd29   10.0.5.9:680938%(exists, up)100%
osd30   10.0.5.9:682951%(exists, up)100%
osd31   10.0.5.11:6828   47%(exists, up)100%
osd32   10.0.5.11:6848   25%(exists, up)100%
osd33   10.0.5.11:6802   56%(exists, up)100%
osd34   10.0.5.11:6840   35%(exists, up)100%
osd35   10.0.5.11:6856   32%(exists, up)100%
osd36   10.0.5.11:6832   26%(exists, up)100%
osd37   10.0.5.11:6868   42%(exists, up)100%
osd38   (unknown sockaddr family 0)   0%(doesn't exist) 100%
osd39   10.0.5.11:6812   52%(exists, up)100%
[23/1950]
osd40   10.0.5.11:6864   44%(exists, up)100%
osd41   10.0.5.11:6801   25%(exists, up)100%
osd42   10.0.5.11:6872   39%(exists, up)100%
osd43   10.0.5.13:6809   38%(exists, up)100%
osd44   10.0.5.11:6844   47%(exists, up)100%
osd45   10.0.5.11:6816   20%(exists, up)100%
osd46   10.0.5.3:680058%(exists, up)100%
osd47   10.0.5.2:680843%(exists, up)100%
osd48   10.0.5.2:680444%(exists, up)100%
osd49   10.0.5.2:681244%(exists, up)100%
osd50   10.0.5.2:680047%(exists, up)100%
osd51   10.0.5.4:680843%(exists, up)100%
osd52   10.0.5.12:6815   41%(exists, up)100%
osd53   10.0.5.11:6820   24%(up)100%
osd54   10.0.5.11:6876   34%(exists, up)100%
osd55   10.0.5.11:6836   48%(exists, up)100%
osd56   10.0.5.11:6824   31%(exists, up)100%
osd57   10.0.5.11:6860   48%(exists, up)100%
osd58   10.0.5.11:6852   35%(exists, up)100%
osd59   10.0.5.11:6800   42%(exists, up)100%
osd60   10.0.5.11:6880   58%(exists, up)100%
osd61   10.0.5.3:680352%(exists, up)  

Re: [ceph-users] rbd map hangs

2018-06-07 Thread Ilya Dryomov
On Thu, Jun 7, 2018 at 6:30 PM, Jason Dillaman  wrote:
> On Thu, Jun 7, 2018 at 12:13 PM, Tracy Reed  wrote:
>> On Thu, Jun 07, 2018 at 08:40:50AM PDT, Ilya Dryomov spake thusly:
>>> > Kernel is Linux cpu04.mydomain.com 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue 
>>> > Nov 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> This is a *very* old kernel.
>>
>> It's what's shipping with CentOS/RHEL 7 and probably what the vast
>> majority of people are using aside from perhaps the Ubuntu LTS people.
>
> I think what Ilya is saying is that it's a very old RHEL 7-based
> kernel (RHEL 7.1?). For example, the current RHEL 7.5 kernel includes
> numerous improvements that have been backported from the current
> upstream kernel.

Correct.  RHEL 7.1 isn't supported anymore -- even the EUS (Extended
Update Support) from Red Hat ended more than a year ago.

I would recommend an upgrade to 7.5 or a recent upstream kernel from
ELRepo.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2018-06-07 Thread Sergey Malinin
http://elrepo.org/tiki/kernel-ml  provides 
4.17

> On 7.06.2018, at 19:13, Tracy Reed  wrote:
> 
> It's what's shipping with CentOS/RHEL 7 and probably what the vast
> majority of people are using aside from perhaps the Ubuntu LTS people.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2018-06-07 Thread Jason Dillaman
On Thu, Jun 7, 2018 at 12:13 PM, Tracy Reed  wrote:
> On Thu, Jun 07, 2018 at 08:40:50AM PDT, Ilya Dryomov spake thusly:
>> > Kernel is Linux cpu04.mydomain.com 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue 
>> > Nov 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>
>> This is a *very* old kernel.
>
> It's what's shipping with CentOS/RHEL 7 and probably what the vast
> majority of people are using aside from perhaps the Ubuntu LTS people.

I think what Ilya is saying is that it's a very old RHEL 7-based
kernel (RHEL 7.1?). For example, the current RHEL 7.5 kernel includes
numerous improvements that have been backported from the current
upstream kernel.

> Does anyone really still compile their own latest kernels? Back in the
> mid-90's I'd compile a new kernel at the drop of a hat. But now it has
> gotten so complicated with so many options and drivers etc. that it's
> actually pretty hard to get it right.
>
>> These lines indicate in-flight requests.  Looks like there may have
>> been a problem with osd1 in the past, as some of these are much older
>> than others.  Try bouncing osd1 with "ceph osd down 1" (it should
>> come back up automatically) and see if that clears up this batch.
>
> Thanks!
>
> --
> Tracy Reed
> http://tracyreed.org
> Digital signature attached for your safety.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2018-06-07 Thread Tracy Reed
On Thu, Jun 07, 2018 at 08:40:50AM PDT, Ilya Dryomov spake thusly:
> > Kernel is Linux cpu04.mydomain.com 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue 
> > Nov 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> 
> This is a *very* old kernel.

It's what's shipping with CentOS/RHEL 7 and probably what the vast
majority of people are using aside from perhaps the Ubuntu LTS people.
Does anyone really still compile their own latest kernels? Back in the
mid-90's I'd compile a new kernel at the drop of a hat. But now it has
gotten so complicated with so many options and drivers etc. that it's
actually pretty hard to get it right.

> These lines indicate in-flight requests.  Looks like there may have
> been a problem with osd1 in the past, as some of these are much older
> than others.  Try bouncing osd1 with "ceph osd down 1" (it should
> come back up automatically) and see if that clears up this batch.

Thanks!

-- 
Tracy Reed
http://tracyreed.org
Digital signature attached for your safety.


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2018-06-07 Thread Ilya Dryomov
On Thu, Jun 7, 2018 at 4:33 PM, Tracy Reed  wrote:
> On Thu, Jun 07, 2018 at 02:05:31AM PDT, Ilya Dryomov spake thusly:
>> > find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
>>
>> Can you paste the entire output of that command?
>>
>> Which kernel are you running on the client box?
>
> Kernel is Linux cpu04.mydomain.com 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue Nov 
> 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

This is a *very* old kernel.

>
> output is:
>
> # find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
> /sys/kernel/debug/ceph/b2b00aae-f00d-41b4-a29b-58859aa41375.client31276017/osdmap
> epoch 232455
> flags
> pool 0 pg_num 2500 (4095) read_tier -1 write_tier -1
> pool 2 pg_num 512 (511) read_tier -1 write_tier -1
> pool 3 pg_num 128 (127) read_tier -1 write_tier -1
> pool 4 pg_num 100 (127) read_tier -1 write_tier -1
> osd010.0.5.3:680154%(exists, up)100%
> osd110.0.5.3:681257%(exists, up)100%
> osd2(unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd310.0.5.4:681250%(exists, up)100%
> osd4(unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd5(unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd610.0.5.9:686137%(exists, up)100%
> osd710.0.5.9:687628%(exists, up)100%
> osd810.0.5.9:686443%(exists, up)100%
> osd910.0.5.9:683630%(exists, up)100%
> osd10   10.0.5.9:682022%(exists, up)100%
> osd11   10.0.5.9:684454%(exists, up)100%
> osd12   10.0.5.9:680343%(exists, up)100%
> osd13   10.0.5.9:682641%(exists, up)100%
> osd14   10.0.5.9:685337%(exists, up)100%
> osd15   10.0.5.9:687236%(exists, up)100%
> osd16   (unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd17   10.0.5.9:681244%(exists, up)100%
> osd18   10.0.5.9:681748%(exists, up)100%
> osd19   10.0.5.9:685633%(exists, up)100%
> osd20   10.0.5.9:680846%(exists, up)100%
> osd21   10.0.5.9:687141%(exists, up)100%
> osd22   10.0.5.9:681649%(exists, up)100%
> osd23   10.0.5.9:682356%(exists, up)100%
> osd24   10.0.5.9:680054%(exists, up)100%
> osd25   10.0.5.9:684854%(exists, up)100%
> osd26   10.0.5.9:684037%(exists, up)100%
> osd27   10.0.5.9:688369%(exists, up)100%
> osd28   10.0.5.9:683339%(exists, up)100%
> osd29   10.0.5.9:680938%(exists, up)100%
> osd30   10.0.5.9:682951%(exists, up)100%
> osd31   10.0.5.11:6828   47%(exists, up)100%
> osd32   10.0.5.11:6848   25%(exists, up)100%
> osd33   10.0.5.11:6802   56%(exists, up)100%
> osd34   10.0.5.11:6840   35%(exists, up)100%
> osd35   10.0.5.11:6856   32%(exists, up)100%
> osd36   10.0.5.11:6832   26%(exists, up)100%
> [88/1848]
> osd37   10.0.5.11:6868   42%(exists, up)100%
> osd38   (unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd39   10.0.5.11:6812   52%(exists, up)100%
> osd40   10.0.5.11:6864   44%(exists, up)100%
> osd41   10.0.5.11:6801   25%(exists, up)100%
> osd42   10.0.5.11:6872   39%(exists, up)100%
> osd43   10.0.5.13:6809   38%(exists, up)100%
> osd44   10.0.5.11:6844   47%(exists, up)100%
> osd45   10.0.5.11:6816   20%(exists, up)100%
> osd46   10.0.5.3:680058%(exists, up)100%
> osd47   10.0.5.2:680843%(exists, up)100%
> osd48   10.0.5.2:680444%(exists, up)100%
> osd49   10.0.5.2:681244%(exists, up)100%
> osd50   10.0.5.2:680047%(exists, up)100%
> osd51   10.0.5.4:680843%(exists, up)100%
> osd52   10.0.5.12:6815   41%(exists, up)100%
> osd53   10.0.5.11:6820   24%(up)100%
> osd54   10.0.5.11:6876   34%(exists, up)100%
> osd55   10.0.5.11:6836   48%(exists, up)100%
> osd56   10.0.5.11:6824   31%(exists, up)100%
> osd57   10.0.5.11:6860   48%(exists, up)100%
> osd58   10.0.5.11:6852   35%(exists, up)100%
> osd59   10.0.5.11:6800   42%(exists, up)100%
> osd60   10.0.5.11:6880   58%(exists, up)100%
> osd61   10.0.5.3:680352%(exists, up)100%
> osd62   10.0.5.12:6800   42%(exists, up)100%
> osd63   10.0.5.12:6819   46%(exists, up)100%
> osd64   10.0.5.12:6809   44%(exists, up)100%
> osd65   10.0.5.13:6800   44%(exists, up)100%
> osd66   (unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd67   10.0.5.13:6808   50%(exists, up)100%
> osd68   10.0.5.4:680441%(exists, up)100%
> osd69   10.0.5.4:680039%(exists, up)100%
> osd70   10.0.5.13:6804   42%(exists, up)100%
> osd71   (unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd72   (unknown sockaddr family 0)   0%(doesn't exis

Re: [ceph-users] rbd map hangs

2018-06-07 Thread Tracy Reed
On Thu, Jun 07, 2018 at 02:05:31AM PDT, Ilya Dryomov spake thusly:
> > find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
> 
> Can you paste the entire output of that command?
> 
> Which kernel are you running on the client box?

Kernel is Linux cpu04.mydomain.com 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue Nov 3 
19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

output is:

# find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
/sys/kernel/debug/ceph/b2b00aae-f00d-41b4-a29b-58859aa41375.client31276017/osdmap
epoch 232455
flags
pool 0 pg_num 2500 (4095) read_tier -1 write_tier -1
pool 2 pg_num 512 (511) read_tier -1 write_tier -1
pool 3 pg_num 128 (127) read_tier -1 write_tier -1
pool 4 pg_num 100 (127) read_tier -1 write_tier -1
osd010.0.5.3:680154%(exists, up)100%
osd110.0.5.3:681257%(exists, up)100%
osd2(unknown sockaddr family 0)   0%(doesn't exist) 100%
osd310.0.5.4:681250%(exists, up)100%
osd4(unknown sockaddr family 0)   0%(doesn't exist) 100%
osd5(unknown sockaddr family 0)   0%(doesn't exist) 100%
osd610.0.5.9:686137%(exists, up)100%
osd710.0.5.9:687628%(exists, up)100%
osd810.0.5.9:686443%(exists, up)100%
osd910.0.5.9:683630%(exists, up)100%
osd10   10.0.5.9:682022%(exists, up)100%
osd11   10.0.5.9:684454%(exists, up)100%
osd12   10.0.5.9:680343%(exists, up)100%
osd13   10.0.5.9:682641%(exists, up)100%
osd14   10.0.5.9:685337%(exists, up)100%
osd15   10.0.5.9:687236%(exists, up)100%
osd16   (unknown sockaddr family 0)   0%(doesn't exist) 100%
osd17   10.0.5.9:681244%(exists, up)100%
osd18   10.0.5.9:681748%(exists, up)100%
osd19   10.0.5.9:685633%(exists, up)100%
osd20   10.0.5.9:680846%(exists, up)100%
osd21   10.0.5.9:687141%(exists, up)100%
osd22   10.0.5.9:681649%(exists, up)100%
osd23   10.0.5.9:682356%(exists, up)100%
osd24   10.0.5.9:680054%(exists, up)100%
osd25   10.0.5.9:684854%(exists, up)100%
osd26   10.0.5.9:684037%(exists, up)100%
osd27   10.0.5.9:688369%(exists, up)100%
osd28   10.0.5.9:683339%(exists, up)100%
osd29   10.0.5.9:680938%(exists, up)100%
osd30   10.0.5.9:682951%(exists, up)100%
osd31   10.0.5.11:6828   47%(exists, up)100%
osd32   10.0.5.11:6848   25%(exists, up)100%
osd33   10.0.5.11:6802   56%(exists, up)100%
osd34   10.0.5.11:6840   35%(exists, up)100%
osd35   10.0.5.11:6856   32%(exists, up)100%
osd36   10.0.5.11:6832   26%(exists, up)100%
[88/1848]
osd37   10.0.5.11:6868   42%(exists, up)100%
osd38   (unknown sockaddr family 0)   0%(doesn't exist) 100%
osd39   10.0.5.11:6812   52%(exists, up)100%
osd40   10.0.5.11:6864   44%(exists, up)100%
osd41   10.0.5.11:6801   25%(exists, up)100%
osd42   10.0.5.11:6872   39%(exists, up)100%
osd43   10.0.5.13:6809   38%(exists, up)100%
osd44   10.0.5.11:6844   47%(exists, up)100%
osd45   10.0.5.11:6816   20%(exists, up)100%
osd46   10.0.5.3:680058%(exists, up)100%
osd47   10.0.5.2:680843%(exists, up)100%
osd48   10.0.5.2:680444%(exists, up)100%
osd49   10.0.5.2:681244%(exists, up)100%
osd50   10.0.5.2:680047%(exists, up)100%
osd51   10.0.5.4:680843%(exists, up)100%
osd52   10.0.5.12:6815   41%(exists, up)100%
osd53   10.0.5.11:6820   24%(up)100%
osd54   10.0.5.11:6876   34%(exists, up)100%
osd55   10.0.5.11:6836   48%(exists, up)100%
osd56   10.0.5.11:6824   31%(exists, up)100%
osd57   10.0.5.11:6860   48%(exists, up)100%
osd58   10.0.5.11:6852   35%(exists, up)100%
osd59   10.0.5.11:6800   42%(exists, up)100%
osd60   10.0.5.11:6880   58%(exists, up)100%
osd61   10.0.5.3:680352%(exists, up)100%
osd62   10.0.5.12:6800   42%(exists, up)100%
osd63   10.0.5.12:6819   46%(exists, up)100%
osd64   10.0.5.12:6809   44%(exists, up)100%
osd65   10.0.5.13:6800   44%(exists, up)100%
osd66   (unknown sockaddr family 0)   0%(doesn't exist) 100%
osd67   10.0.5.13:6808   50%(exists, up)100%
osd68   10.0.5.4:680441%(exists, up)100%
osd69   10.0.5.4:680039%(exists, up)100%
osd70   10.0.5.13:6804   42%(exists, up)100%
osd71   (unknown sockaddr family 0)   0%(doesn't exist) 100%
osd72   (unknown sockaddr family 0)   0%(doesn't exist) 100%
osd73   10.0.5.16:6825   92%(exists, up)100%
osd74   10.0.5.16:6846  100%(exists, up)100%
osd75   10.0.5.16:6811   98%(exists, up)100%
osd76   10.0.5.16:6815  100%(exists, up)100%
osd77   10.0.5.16:6835   93%(exists,

Re: [ceph-users] rbd map hangs

2018-06-07 Thread Ilya Dryomov
On Thu, Jun 7, 2018 at 5:12 AM, Tracy Reed  wrote:
>
> Hello all! I'm running luminous with old style non-bluestore OSDs. ceph
> 10.2.9 clients though, haven't been able to upgrade those yet.
>
> Occasionally I have access to rbds hang on the client such as right now.
> I tried to dd a VM image into a mapped rbd and it just hung.
>
> Then I tried to map a new rbd and that hangs also.
>
> How would I troubleshoot this? /var/log/ceph is empty, nothing in
> /var/log/messages or dmesg etc.
>
> I just discovered:
>
> find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
>
> which produces (among other seemingly innocuous things, let me know if
> anyone wants to see the rest):
>
> osd2(unknown sockaddr family 0) 0%(doesn't exist) 100%
>
> which seems suspicious.

Can you paste the entire output of that command?

Which kernel are you running on the client box?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2018-06-07 Thread ceph
Just a bet: have you inconsistant MTU across your network ?

I already had your issue when OSD and client was using jumbo frames, but
MON did not (or something like that)


On 06/07/2018 05:12 AM, Tracy Reed wrote:
> 
> Hello all! I'm running luminous with old style non-bluestore OSDs. ceph
> 10.2.9 clients though, haven't been able to upgrade those yet. 
> 
> Occasionally I have access to rbds hang on the client such as right now.
> I tried to dd a VM image into a mapped rbd and it just hung.
> 
> Then I tried to map a new rbd and that hangs also.
> 
> How would I troubleshoot this? /var/log/ceph is empty, nothing in
> /var/log/messages or dmesg etc.
> 
> I just discovered:
> 
> find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
> 
> which produces (among other seemingly innocuous things, let me know if
> anyone wants to see the rest):
> 
> osd2(unknown sockaddr family 0) 0%(doesn't exist) 100%
> 
> which seems suspicious.
> 
> rbd ls works reliably. As does create.  Cluster is healthy. 
> 
> But the processes which hung trying to access that mapped rbd appear to
> be completely unkillable. What 
> 
> else should I check?
> 
> Thanks!
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd map hangs

2018-06-06 Thread Tracy Reed

Hello all! I'm running luminous with old style non-bluestore OSDs. ceph
10.2.9 clients though, haven't been able to upgrade those yet. 

Occasionally I have access to rbds hang on the client such as right now.
I tried to dd a VM image into a mapped rbd and it just hung.

Then I tried to map a new rbd and that hangs also.

How would I troubleshoot this? /var/log/ceph is empty, nothing in
/var/log/messages or dmesg etc.

I just discovered:

find /sys/kernel/debug/ceph -type f -print -exec cat {} \;

which produces (among other seemingly innocuous things, let me know if
anyone wants to see the rest):

osd2(unknown sockaddr family 0) 0%(doesn't exist) 100%

which seems suspicious.

rbd ls works reliably. As does create.  Cluster is healthy. 

But the processes which hung trying to access that mapped rbd appear to
be completely unkillable. What 

else should I check?

Thanks!


-- 
Tracy Reed
http://tracyreed.org
Digital signature attached for your safety.


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd map hangs when using systemd-automount

2017-10-27 Thread Bjoern Laessig
Hi Cephers,

i have multiple rbds to map and mount and the bootup hangs forever
while running rbdmap.service script. This was my mount-entry for
/etc/fstab:

/dev/rbd/ptxdev/WORK_CEPH_BLA /ptx/work/ceph/bla xfs 
noauto,x-systemd.automount,defaults,noatime,_netdev,logbsize=256k,nofail  0  0

(the mount is activated at boottime by an nfs-server that exports this
filesystem)
And i have a lot of these rbd mounts. Via systemds debug-shell.service
i found out, that the boot hangs at rbdmap.service. I added an 'set -x'
to /usr/bin/rbdmap and it showed me, that it hangs at

  mount --fake /dev/rbd/$DEV >>/dev/null 2>&1

Why is this called there? Why is this done one rbd at a time? 

As there was no mention of it in the manual mounting documentation, I
masked rbdmap.service and created a rbdmap@.service instead:


[Unit]
Description=Map RBD device ptxdev/%i

After=network-online.target local-fs.target
Wants=network-online.target local-fs.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/rbd map %I --id dev --keyring 
/etc/ceph/ceph.client.dev.keyring
ExecStop=/usr/bin/rbd unmap /dev/rbd/%I


and added the option 
  x-systemd.requires=rbdmap@ptxdev-WORK_CEPH_BLA.service
 to my fstab-entry.

Now systemd is able to finish the boot process, but this is clearly
only a workaround as there is now duplicated configuration data in the
servicetemplate and in /etc/ceph/rbdmap.

To do this right, there should be a systemd.generator(7) that reads
/etc/ceph/rbdmap at boottime and generates the rbdmap@ptxdev-
WORK_CEPH_BLA.service files.

Is this the correct way?

have a nice weekend
Björn Lässig
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2015-01-03 Thread Max Power

Am 03.01.2015 um 00:36 schrieb Dyweni - Ceph-Users:
> Your OSDs are full.  The cluster will block, until space is freed up and
> both OSDs leave full state.

Okay, I did not know that a "rbd map" alone is too much for a full
cluster. That makes things a bit hard to work around because reducing
the replica size to 1 as you mentioned works, but it is of course a
risky thing.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2015-01-02 Thread Dyweni - Ceph-Users
Your OSDs are full.  The cluster will block, until space is freed up and 
both OSDs leave full state.


You have 2 OSDs, so I'm assuming you are running replica size of 2?  A 
quick (but risky) method might be to reduce your replica down to 1, to 
get the cluster unblocked, clean up space, then go back to replica size 
2.





On 2015-01-02 13:44, Max Power wrote:
After I tried to copy some files into a rbd device I ran into a "osd 
full"
state. So I restarted my server and wanted to remove some files from 
the
filesystem again. But at this moment I cannot execute "rbd map" anymore 
and I do

not know why.

This all happened in my testing environment and this is the current 
state with

'ceph status'
 health HEALTH_ERR
2 full osd(s)
 monmap e1: 1 mons at {test1=10.0.0.141:6789/0}
election epoch 1, quorum 0 test1
 osdmap e69: 2 osds: 2 up, 2 in
flags full
  pgmap v469: 100 pgs, 1 pools, 1727 MB data, 438 objects
3917 MB used, 156 MB / 4073 MB avail
 100 active+clean
strace reports this before 'rbd map pool/disk' hangs
[...]
access("/sys/bus/rbd", F_OK)= 0
access("/run/udev/control", F_OK)   = 0
socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, 
NETLINK_KOBJECT_UEVENT)

= 3
setsockopt(3, SOL_SOCKET, SO_ATTACH_FILTER,
"\r\0\0\0\0\0\0\0@k\211\240\377\177\0\0", 16) = 0
bind(3, {sa_family=AF_NETLINK, pid=0, groups=0002}, 12) = 0
getsockname(3, {sa_family=AF_NETLINK, pid=1192, groups=0002}, [12]) 
= 0

setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
open("/sys/bus/rbd/add_single_major", O_WRONLY) = 4
write(4, "10.0.0.141:6789 name=admin,key=c"..., 61

Any idea why I cannot access the rbd device anymore?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd map hangs

2015-01-02 Thread Max Power
After I tried to copy some files into a rbd device I ran into a "osd full"
state. So I restarted my server and wanted to remove some files from the
filesystem again. But at this moment I cannot execute "rbd map" anymore and I do
not know why.

This all happened in my testing environment and this is the current state with
'ceph status'
 health HEALTH_ERR
2 full osd(s)
 monmap e1: 1 mons at {test1=10.0.0.141:6789/0}
election epoch 1, quorum 0 test1
 osdmap e69: 2 osds: 2 up, 2 in
flags full
  pgmap v469: 100 pgs, 1 pools, 1727 MB data, 438 objects
3917 MB used, 156 MB / 4073 MB avail
 100 active+clean
strace reports this before 'rbd map pool/disk' hangs
[...]
access("/sys/bus/rbd", F_OK)= 0
access("/run/udev/control", F_OK)   = 0
socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, NETLINK_KOBJECT_UEVENT)
= 3
setsockopt(3, SOL_SOCKET, SO_ATTACH_FILTER,
"\r\0\0\0\0\0\0\0@k\211\240\377\177\0\0", 16) = 0
bind(3, {sa_family=AF_NETLINK, pid=0, groups=0002}, 12) = 0
getsockname(3, {sa_family=AF_NETLINK, pid=1192, groups=0002}, [12]) = 0
setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
open("/sys/bus/rbd/add_single_major", O_WRONLY) = 4
write(4, "10.0.0.141:6789 name=admin,key=c"..., 61

Any idea why I cannot access the rbd device anymore?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs on Ceph Cluster

2014-05-27 Thread Sharmila Govind
Thanks IIya. I upgraded the Kernel and it worked :-)


Thanks,
Sharmila


On Tue, May 27, 2014 at 11:26 PM, Ilya Dryomov wrote:

> On Tue, May 27, 2014 at 9:04 PM, Sharmila Govind
>  wrote:
> > Hi,
> >
> >   I am setting up a ceph cluster for some experimentation. The cluster is
> > setup successfully. But, When I try running rbd map on the host, the
> kernel
> > crashes(system hangs) and I need to do a hard reset for it to recover.
> Below
> > is my setup.
> >
> > All my nodes have Linux kernel  3.5 with Ubuntu 12.04. Iam installing the
> > emperor version of Ceph.
>
> It would help if you could capture the crash, but most probably it's
> a known bug in 3.5, which leads to a crash instead of returning an
> error to 'rbd map' when the kernel misses required feature bits, i.e.
> too old.  I would recommend running at least 3.9.
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs on Ceph Cluster

2014-05-27 Thread Ilya Dryomov
On Tue, May 27, 2014 at 9:04 PM, Sharmila Govind
 wrote:
> Hi,
>
>   I am setting up a ceph cluster for some experimentation. The cluster is
> setup successfully. But, When I try running rbd map on the host, the kernel
> crashes(system hangs) and I need to do a hard reset for it to recover. Below
> is my setup.
>
> All my nodes have Linux kernel  3.5 with Ubuntu 12.04. Iam installing the
> emperor version of Ceph.

It would help if you could capture the crash, but most probably it's
a known bug in 3.5, which leads to a crash instead of returning an
error to 'rbd map' when the kernel misses required feature bits, i.e.
too old.  I would recommend running at least 3.9.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd map hangs on Ceph Cluster

2014-05-27 Thread Sharmila Govind
Hi,

  I am setting up a ceph cluster for some experimentation. The cluster is
setup successfully. But, When I try running rbd map on the host, the kernel
crashes(system hangs) and I need to do a hard reset for it to recover.
Below is my setup.



​



All my nodes have Linux kernel  3.5 with Ubuntu 12.04. Iam installing the
emperor version of Ceph.



Below are the ceph cluster status:



*root@CephMon:~# ceph osd tree*

*# idweight  type name   up/down reweight*

*-1  2.2 root default*

*-2  1.66host cephnode2*

*0   0.9 osd.0   up  1*

*3   0.76osd.3   up  1*

*-3  0.54host cephnode4*

*1   0.27osd.1   up  1*

*2   0.27osd.2   up  1*

*root@CephMon:~#*







*root@CephMon:~# ceph osd dump*

*epoch 65*

*fsid bef84776-a957-495e-be34-c353eb76c3d7*

*created 2014-05-27 08:53:59.112200*

*modified 2014-05-27 15:05:42.742630*

*flags *

*pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 38 owner 0 flags hashpspool
crash_replay_interval 45 stripe_width 0*

*pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 36 owner 0 flags hashpspool
stripe_width 0*

*pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 96 pgp_num 96 last_change 61 owner 0 flags hashpspool
stripe_width 0*

*max_osd 4*

*osd.0 up   in  weight 1 up_from 4 up_thru 61 down_at 0 last_clean_interval
[0,0) 10.223.169.166:6800/26254 
10.223.169.166:6801/26254 
10.223.169.166:6802/26254 
10.223.169.166:6803/26254  exists,up
1aefbfb2-a220-4f0e-9d91-1b9344717337*

*osd.1 up   in  weight 1 up_from 8 up_thru 61 down_at 0 last_clean_interval
[0,0) 10.223.169.201:6800/32211 
10.223.169.201:6801/32211 
10.223.169.201:6802/32211 
10.223.169.201:6803/32211  exists,up
077d4fe2-e8f7-42ba-a569-87efc7c11fbe*

*osd.2 up   in  weight 1 up_from 12 up_thru 61 down_at 0
last_clean_interval [0,0) 10.223.169.201:6805/3
 10.223.169.201:6806/3
 10.223.169.201:6807/3
 10.223.169.201:6808/3
 exists,up
734fb969-1bb6-46ad-91c3-60b4647c90ac*

*osd.3 up   in  weight 1 up_from 48 up_thru 61 down_at 0
last_clean_interval [0,0) 10.223.169.166:6805/27859
 10.223.169.166:6806/27859
 10.223.169.166:6807/27859
 10.223.169.166:6808/27859
 exists,up
5b479abb-168a-411d-ba4e-a37e63fdfbd4*







The following are the rbd commands used in the host:

*rbd create test --size 1024 --pool rbd*

*modprobe rbd*

*rbd map test --pool rbd*





Any pointers to this issue would be of great help.





Thanks in Advance,

Sharmila
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com