Re: [ceph-users] Cinder pool inaccessible after Nautilus upgrade

2019-07-04 Thread Adrien Georget
It appears that if the client or Openstack cinder service is in the same 
network as Ceph, it works.
In the Openstack network it fails, but only on this particular pool! It 
was working well before the upgrade and no changes have been made on 
network side.
Very strange issue. I checked the Ceph release notes in order to find 
network changes but found nothing relevant.
Only the biggest pool is concerned, same pool config, same hosts, ACLs 
all open, no iptables, ...


Anything else to check?
We are thinking about adding a VNIC to all Ceph and Openstack hosts in 
order to be in the same subnet.



Adrien


Le 03/07/2019 à 13:46, Adrien Georget a écrit :

Hi,

With --debug-objecter=20, I found that the rados ls command hangs 
looping on laggy messages :

|
||2019-07-03 13:33:24.913 7efc402f5700 10 client.21363886.objecter 
_op_submit op 0x7efc3800dc10||
||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter 
_calc_target epoch 13146 base  @3 precalc_pgid 1 pgid 3.100 is_read||
||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter 
_calc_target target  @3 -> pgid 3.100||
||2019-07-03 13:33:24.913 7efc402f5700 10 client.21363886.objecter 
_calc_target  raw pgid 3.100 -> actual 3.100 acting [29,12,55] primary 
29||
||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter 
_get_session s=0x7efc380024c0 osd=29 3||
||2019-07-03 13:33:24.913 7efc402f5700 10 client.21363886.objecter 
_op_submit oid  '@3' '@3' [pgnls start_epoch 13146] tid 11 osd.29||
||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter 
get_session s=0x7efc380024c0 osd=29 3||
||2019-07-03 13:33:24.913 7efc402f5700 15 client.21363886.objecter 
_session_op_assign 29 11||
||2019-07-03 13:33:24.913 7efc402f5700 15 client.21363886.objecter 
_send_op 11 to 3.100 on osd.29||
||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter 
put_session s=0x7efc380024c0 osd=29 4||
||2019-07-03 13:33:24.913 7efc402f5700  5 client.21363886.objecter 1 
in flight||

||2019-07-03 13:33:29.678 7efc3e2f1700 10 client.21363886.objecter tick||
||2019-07-03 13:33:34.678 7efc3e2f1700 10 client.21363886.objecter tick||
||2019-07-03 13:33:39.678 7efc3e2f1700 10 client.21363886.objecter tick||
||2019-07-03 13:33:39.678 7efc3e2f1700  2 client.21363886.objecter  
tid 11 on osd.29 is laggy||
||2019-07-03 13:33:39.678 7efc3e2f1700 10 client.21363886.objecter 
_maybe_request_map subscribing (onetime) to next osd map||

||2019-07-03 13:33:44.678 7efc3e2f1700 10 client.21363886.objecter tick||
||2019-07-03 13:33:44.678 7efc3e2f1700  2 client.21363886.objecter  
tid 11 on osd.29 is laggy||
||2019-07-03 13:33:44.678 7efc3e2f1700 10 client.21363886.objecter 
_maybe_request_map subscribing (onetime) to next osd map||

||2019-07-03 13:33:49.679 7efc3e2f1700 10 client.21363886.objecter tick
...

|I tried to disable this OSD but the problem goes on another OSD, and 
so on.
The ceph client packages are up to date, all RBD command still work 
from a monitor but not from Openstack controllers.
And the other Ceph pool on the same OSD host but on different disks 
works perfectly with Openstack...


The issue looks like these old on, but It seems fixed since fews years 
: https://tracker.ceph.com/issues/2454 and 
https://tracker.ceph.com/issues/8515


Is there anything more I can check?

Adrien


Le 02/07/2019 à 14:10, Adrien Georget a écrit :

Hi Eugen,

The cinder keyring used by the 2 pools is the same, the rbd command 
works using this keyring and ceph.conf used by Openstack while the 
rados ls command stays stuck.


I tried with the previous ceph-common version used 10.2.5 and the 
last ceph version 14.2.1.
With the Nautilus ceph-common version, the 2 cinder-volume services 
crashed...


Adrien

Le 02/07/2019 à 13:50, Eugen Block a écrit :

Hi,

did you try to use rbd and rados commands with the cinder keyring, 
not the admin keyring? Did you check if the caps for that client are 
still valid (do the caps differ between the two cinder pools)?


Are the ceph versions on your hypervisors also nautilus?

Regards,
Eugen


Zitat von Adrien Georget :


Hi all,

I'm facing a very strange issue after migrating my Luminous cluster 
to Nautilus.
I have 2 pools configured for Openstack cinder volumes with 
multiple backend setup, One "service" Ceph pool with cache tiering 
and one "R&D" Ceph pool.
After the upgrade, the R&D pool became inaccessible for Cinder and 
the cinder-volume service using this pool can't start anymore.
What is strange is that Openstack and Ceph report no error, Ceph 
cluster is healthy, all OSDs are UP & running and the "service" 
pool is still running well with the other cinder service on the 
same openstack host.
I followed exactly the upgrade procedure 
(https://ceph.com/releases/v14-2-0-nautilus-released/#upgrading-from-mimic-or-luminous), 
no problem during the upgrade but I can't understand why Cinder 
still fails with this pool.

Re: [ceph-users] Cinder pool inaccessible after Nautilus upgrade

2019-07-03 Thread Adrien Georget

Hi,

With --debug-objecter=20, I found that the rados ls command hangs 
looping on laggy messages :

|
||2019-07-03 13:33:24.913 7efc402f5700 10 client.21363886.objecter 
_op_submit op 0x7efc3800dc10||
||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter 
_calc_target epoch 13146 base  @3 precalc_pgid 1 pgid 3.100 is_read||
||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter 
_calc_target target  @3 -> pgid 3.100||
||2019-07-03 13:33:24.913 7efc402f5700 10 client.21363886.objecter 
_calc_target  raw pgid 3.100 -> actual 3.100 acting [29,12,55] primary 29||
||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter 
_get_session s=0x7efc380024c0 osd=29 3||
||2019-07-03 13:33:24.913 7efc402f5700 10 client.21363886.objecter 
_op_submit oid  '@3' '@3' [pgnls start_epoch 13146] tid 11 osd.29||
||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter 
get_session s=0x7efc380024c0 osd=29 3||
||2019-07-03 13:33:24.913 7efc402f5700 15 client.21363886.objecter 
_session_op_assign 29 11||
||2019-07-03 13:33:24.913 7efc402f5700 15 client.21363886.objecter 
_send_op 11 to 3.100 on osd.29||
||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter 
put_session s=0x7efc380024c0 osd=29 4||
||2019-07-03 13:33:24.913 7efc402f5700  5 client.21363886.objecter 1 in 
flight||

||2019-07-03 13:33:29.678 7efc3e2f1700 10 client.21363886.objecter tick||
||2019-07-03 13:33:34.678 7efc3e2f1700 10 client.21363886.objecter tick||
||2019-07-03 13:33:39.678 7efc3e2f1700 10 client.21363886.objecter tick||
||2019-07-03 13:33:39.678 7efc3e2f1700  2 client.21363886.objecter  tid 
11 on osd.29 is laggy||
||2019-07-03 13:33:39.678 7efc3e2f1700 10 client.21363886.objecter 
_maybe_request_map subscribing (onetime) to next osd map||

||2019-07-03 13:33:44.678 7efc3e2f1700 10 client.21363886.objecter tick||
||2019-07-03 13:33:44.678 7efc3e2f1700  2 client.21363886.objecter  tid 
11 on osd.29 is laggy||
||2019-07-03 13:33:44.678 7efc3e2f1700 10 client.21363886.objecter 
_maybe_request_map subscribing (onetime) to next osd map||

||2019-07-03 13:33:49.679 7efc3e2f1700 10 client.21363886.objecter tick
...

|I tried to disable this OSD but the problem goes on another OSD, and so on.
The ceph client packages are up to date, all RBD command still work from 
a monitor but not from Openstack controllers.
And the other Ceph pool on the same OSD host but on different disks 
works perfectly with Openstack...


The issue looks like these old on, but It seems fixed since fews years : 
https://tracker.ceph.com/issues/2454 and 
https://tracker.ceph.com/issues/8515


Is there anything more I can check?

Adrien


Le 02/07/2019 à 14:10, Adrien Georget a écrit :

Hi Eugen,

The cinder keyring used by the 2 pools is the same, the rbd command 
works using this keyring and ceph.conf used by Openstack while the 
rados ls command stays stuck.


I tried with the previous ceph-common version used 10.2.5 and the last 
ceph version 14.2.1.
With the Nautilus ceph-common version, the 2 cinder-volume services 
crashed...


Adrien

Le 02/07/2019 à 13:50, Eugen Block a écrit :

Hi,

did you try to use rbd and rados commands with the cinder keyring, 
not the admin keyring? Did you check if the caps for that client are 
still valid (do the caps differ between the two cinder pools)?


Are the ceph versions on your hypervisors also nautilus?

Regards,
Eugen


Zitat von Adrien Georget :


Hi all,

I'm facing a very strange issue after migrating my Luminous cluster 
to Nautilus.
I have 2 pools configured for Openstack cinder volumes with multiple 
backend setup, One "service" Ceph pool with cache tiering and one 
"R&D" Ceph pool.
After the upgrade, the R&D pool became inaccessible for Cinder and 
the cinder-volume service using this pool can't start anymore.
What is strange is that Openstack and Ceph report no error, Ceph 
cluster is healthy, all OSDs are UP & running and the "service" pool 
is still running well with the other cinder service on the same 
openstack host.
I followed exactly the upgrade procedure 
(https://ceph.com/releases/v14-2-0-nautilus-released/#upgrading-from-mimic-or-luminous), 
no problem during the upgrade but I can't understand why Cinder 
still fails with this pool.
I can access, list, create volume on this pool with rbd or rados 
command from the monitors, but on the Openstack hypervisor the rbd 
or rados ls command stay stuck and rados ls give this message 
(|134.158.208.37 is an OSD node,10.158.246.214 an Openstack 
hypervisor) |:


|2019-07-02 11:26:15.999869 7f63484b4700  0 -- 
10.158.246.214:0/1404677569 >> 134.158.208.37:6884/2457222 
pipe(0x555c2bf96240 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x555c2bf97500).fault|



ceph version 14.2.1
Openstack Newton

I spent 2 days checking everything on Ceph side but I couldn't find 
anything problematic...

If you have any hints which can help me, I would appreciate :)

Re: [ceph-users] Cinder pool inaccessible after Nautilus upgrade

2019-07-02 Thread Adrien Georget

Hi Eugen,

The cinder keyring used by the 2 pools is the same, the rbd command 
works using this keyring and ceph.conf used by Openstack while the rados 
ls command stays stuck.


I tried with the previous ceph-common version used 10.2.5 and the last 
ceph version 14.2.1.
With the Nautilus ceph-common version, the 2 cinder-volume services 
crashed...


Adrien

Le 02/07/2019 à 13:50, Eugen Block a écrit :

Hi,

did you try to use rbd and rados commands with the cinder keyring, not 
the admin keyring? Did you check if the caps for that client are still 
valid (do the caps differ between the two cinder pools)?


Are the ceph versions on your hypervisors also nautilus?

Regards,
Eugen


Zitat von Adrien Georget :


Hi all,

I'm facing a very strange issue after migrating my Luminous cluster 
to Nautilus.
I have 2 pools configured for Openstack cinder volumes with multiple 
backend setup, One "service" Ceph pool with cache tiering and one 
"R&D" Ceph pool.
After the upgrade, the R&D pool became inaccessible for Cinder and 
the cinder-volume service using this pool can't start anymore.
What is strange is that Openstack and Ceph report no error, Ceph 
cluster is healthy, all OSDs are UP & running and the "service" pool 
is still running well with the other cinder service on the same 
openstack host.
I followed exactly the upgrade procedure 
(https://ceph.com/releases/v14-2-0-nautilus-released/#upgrading-from-mimic-or-luminous), 
no problem during the upgrade but I can't understand why Cinder still 
fails with this pool.
I can access, list, create volume on this pool with rbd or rados 
command from the monitors, but on the Openstack hypervisor the rbd or 
rados ls command stay stuck and rados ls give this message 
(|134.158.208.37 is an OSD node,10.158.246.214 an Openstack 
hypervisor) |:


|2019-07-02 11:26:15.999869 7f63484b4700  0 -- 
10.158.246.214:0/1404677569 >> 134.158.208.37:6884/2457222 
pipe(0x555c2bf96240 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x555c2bf97500).fault|



ceph version 14.2.1
Openstack Newton

I spent 2 days checking everything on Ceph side but I couldn't find 
anything problematic...

If you have any hints which can help me, I would appreciate :)

Adrien




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cinder pool inaccessible after Nautilus upgrade

2019-07-02 Thread Adrien Georget

Hi all,

I'm facing a very strange issue after migrating my Luminous cluster to 
Nautilus.
I have 2 pools configured for Openstack cinder volumes with multiple 
backend setup, One "service" Ceph pool with cache tiering and one "R&D" 
Ceph pool.
After the upgrade, the R&D pool became inaccessible for Cinder and the 
cinder-volume service using this pool can't start anymore.
What is strange is that Openstack and Ceph report no error, Ceph cluster 
is healthy, all OSDs are UP & running and the "service" pool is still 
running well with the other cinder service on the same openstack host.
I followed exactly the upgrade procedure 
(https://ceph.com/releases/v14-2-0-nautilus-released/#upgrading-from-mimic-or-luminous), 
no problem during the upgrade but I can't understand why Cinder still 
fails with this pool.
I can access, list, create volume on this pool with rbd or rados command 
from the monitors, but on the Openstack hypervisor the rbd or rados ls 
command stay stuck and rados ls give this message (|134.158.208.37 is an 
OSD node,10.158.246.214 an Openstack hypervisor) |:


|2019-07-02 11:26:15.999869 7f63484b4700  0 -- 
10.158.246.214:0/1404677569 >> 134.158.208.37:6884/2457222 
pipe(0x555c2bf96240 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x555c2bf97500).fault|



ceph version 14.2.1
Openstack Newton

I spent 2 days checking everything on Ceph side but I couldn't find 
anything problematic...

If you have any hints which can help me, I would appreciate :)

Adrien
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com