[ceph-users] Re: ceph-iscsi lock ping pong

2022-12-14 Thread Stolte, Felix
We have been using tgt for five years and switched to ceph-iscsi (LIO 
Framework) two months ago. We observed a massive performance boost. Can’t say 
though if the performance increase was only related to the different software 
or if our TGT configuration was not as could as it could have been. Personally 
i prefer the ceph-iscsi configuration, it’s way easier to setup and you can 
create targets, lun, etc. either via gwcli or ceph dashboard.

Regards
Felix
-
-
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior
-
-

Am 13.12.2022 um 23:54 schrieb Joe Comeau :

I am curious about what is happening with your iscsi configuration
Is this a new iscsi config or something that has just cropped up ?

We are using/have been using vmware for 5+ years with iscsi
We are using the kernel iscsi vs tcmu

We are running ALUA and all datastores are setup as RR
We routinely reboot the iscsi gateways - during patching and updates and the 
storage migrates to and from all servers without issue
We usually wait about 10 minutes before a gateway restart, so there is not an 
outage

It has been extremely stable for us

Thanks Joe



>>> Xiubo Li  12/13/2022 4:21 AM >>>

On 13/12/2022 18:57, Stolte, Felix wrote:
> Hi Xiubo,
>
> Thx for pointing me into the right direction. All involved esx host
> seem to use the correct policy. I am going to detach the LUN on each
> host one by one until i found the host causing the problem.
>
From the logs it means the client was switching the path in turn.

BTW, what's policy are you using ?

Thanks

- Xiubo

> Regards Felix
> -
> -
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior
> -
> -
>
>> Am 12.12.2022 um 13:03 schrieb Xiubo Li :
>>
>> Hi Stolte,
>>
>> For the VMware config could you refer to :
>> https://docs.ceph.com/en/latest/rbd/iscsi-initiator-esx/ ?
>>
>> What's the "Path Selection Policy with ALUA" you are using ? The
>> ceph-iscsi couldn't implement the real AA, so if you use the RR I
>> think it will be like this.
>>
>> - Xiubo
>>
>> On 12/12/2022 17:45, Stolte, Felix wrote:
>>> Hi guys,
>>>
>>> we are using ceph-iscsi to provide block storage for Microsoft Exchange and 
>>> vmware vsphere. Ceph docs state that you need to configure Windows iSCSI 
>>> Initatior for fail-over-only but there is no such point for vmware. In my 
>>> tcmu-runner logs on both ceph-iscsi gateways I see the following:
>>>
>>> 2022-12-12 10:36:06.978 33789 [WARN] tcmu_notify_lock_lost:222 
>>> rbd/mailbox.vmdk_junet_sata: Async lock drop. Old state 1
>>> 2022-12-12 10:36:06.993 33789 [INFO] alua_implicit_transition:570 
>>> rbd/mailbox.vmdk_junet_sata: Starting lock acquisition operation.
>>> 2022-12-12 10:36:08.064 33789 [WARN] tcmu_rbd_lock:762 
>>> rbd/mailbox.vmdk_junet_sata: Acquired exclusive lock.
>>> 2022-12-12 10:36:09.067 33789 [WARN] tcmu_notify_lock_lost:222 
>>> rbd/mailbox.vmdk_junet_sata: Async lock drop. Old state 1
>>> 2022-12-12 10:36:09.071 33789 [INFO] alua_implicit_transition:570 
>>> rbd/mailbox.vmdk_junet_sata: Starting lock acquisition operation.
>>> 2022-12-12 10:36:10.109 33789 [WARN] tcmu_rbd_lock:762 
>>> rbd/mailbox.vmdk_junet_sata: Acquired exclusive lock.
>>> 2022-12-12 10:36:11.104 33789 [WARN] tcmu_notify_lock_lost:222 
>>> rbd/mailbox.vmdk_junet_sata: Async lock drop. Old state 1
>>> 2022-12-12 10:36:11.106 33789 [INFO] alua_implicit_transition:570 
>>> rbd/mailbox.vmdk_junet_sata: Starting lock acquisition operation.
>>>
>>> At the same time there are these log entries in ceph.audit.logs:
>>> 2022-12-12T10:36:06.731621+0100 mon.mon-k2-1 (mon.1) 3407851 : audit [INF] 
>>> from='client

[ceph-users] Purge OSD does not delete the OSD deamon

2022-12-14 Thread Mevludin Blazevic

Hi all,

while trying to perform an update from Ceph Pacific to the current Patch 
version, errors occure due to failed osd deamon which are still present 
and installed on some Ceph hosts although I purged the corresponding OSD 
using the GUI. I am using a Red Hat environment, what is the save way to 
tell ceph to also delete specific deamon ID (not OSD IDs)?


Regards,

Mevludin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Purge OSD does not delete the OSD deamon

2022-12-14 Thread Mevludin Blazevic

Hi,

the strange thing is that on 2 different host, an OSD deamon with the 
same ID is present, by doing ls on /var/lib/ceph/FSID, e.g. I am afraid 
that performing a ceph orch deamon rm will remove both osd deamons, the 
healthy one and the failed one.



Am 14.12.2022 um 11:35 schrieb Mevludin Blazevic:

Hi all,

while trying to perform an update from Ceph Pacific to the current 
Patch version, errors occure due to failed osd deamon which are still 
present and installed on some Ceph hosts although I purged the 
corresponding OSD using the GUI. I am using a Red Hat environment, 
what is the save way to tell ceph to also delete specific deamon ID 
(not OSD IDs)?


Regards,

Mevludin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Mevludin Blazevic, M.Sc.

University of Koblenz-Landau
Computing Centre (GHRKO)
Universitaetsstrasse 1
D-56070 Koblenz, Germany
Room A023
Tel: +49 261/287-1326

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Purge OSD does not delete the OSD deamon

2022-12-14 Thread Stefan Kooman

On 12/14/22 11:40, Mevludin Blazevic wrote:

Hi,

the strange thing is that on 2 different host, an OSD deamon with the 
same ID is present, by doing ls on /var/lib/ceph/FSID, e.g. I am afraid 
that performing a ceph orch deamon rm will remove both osd deamons, the 
healthy one and the failed one.


IIRC that should not happen. I ran into this some time ago.
Do you get a warning that cephadm found duplicate OSD? If so, quoting 
the answer Eugen send earlier to me:


"Check the output of 'cepham ls' on the node where the OSD is not 
running and remove it with 'cephadm rm-daemon --name osd.3'. If there's 
an empty directory for that OSD (/var/lib/ceph/osd/ceph-3) remove it as 
well."


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Purge OSD does not delete the OSD deamon

2022-12-14 Thread Mevludin Blazevic

Update: It was removed after 6min from the dashboard

Am 14.12.2022 um 12:11 schrieb Stefan Kooman:

On 12/14/22 11:40, Mevludin Blazevic wrote:

Hi,

the strange thing is that on 2 different host, an OSD deamon with the 
same ID is present, by doing ls on /var/lib/ceph/FSID, e.g. I am 
afraid that performing a ceph orch deamon rm will remove both osd 
deamons, the healthy one and the failed one.


IIRC that should not happen. I ran into this some time ago.
Do you get a warning that cephadm found duplicate OSD? If so, quoting 
the answer Eugen send earlier to me:


"Check the output of 'cepham ls' on the node where the OSD is not 
running and remove it with 'cephadm rm-daemon --name osd.3'. If 
there's an empty directory for that OSD (/var/lib/ceph/osd/ceph-3) 
remove it as well."


Gr. Stefan


--
Mevludin Blazevic, M.Sc.

University of Koblenz-Landau
Computing Centre (GHRKO)
Universitaetsstrasse 1
D-56070 Koblenz, Germany
Room A023
Tel: +49 261/287-1326

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] New pool created with 2048 pg_num not executed

2022-12-14 Thread Martin Buss

Hi,

on quincy, I created a new pool

ceph osd pool create cfs_data 2048 2048

6 hosts 71 osds

autoscaler is off; I find it kind of strange that the pool is created 
with pg_num 1152 and pgp_num 1024, mentioning the 2048 as the new 
target. I cannot manage to actually make this pool contain 2048 pg_num 
and 2048 pgp_num.


What config option am I missing that does not allow me to grow the pool 
to 2048? Although I specified pg_num and pgp_num be the same, it is not.


Please some help and guidance.

Thank you,

Martin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph-volume inventory reports available devices as unavailable

2022-12-14 Thread Frank Schilder
Hi all,

we are using "ceph-volume inventory" for checking if a disk can host an OSD or 
not prior to running "ceph-volume lvm batch". Unfortunately, these two tools 
behave inconsistently. Our use case are SSDs with multiple OSDs per disk and 
re-deploying one of the OSDs on disk (the OSD was purged with "ceph-volume lvm 
zap --osd-id ID" and the left-over volume removed with "lvremove 
OSD-VG/OSD-LV").

Ceph-volume inventory reports a disk as unavailable even though it has space 
for the new OSD. On the other hand, ceph-volume lvm batch happily creates the 
OSD. Expected is that inventory says there is space for an OSD and reports the 
disk as available. Is there any way to get this to behave in a consistent way? 
I don't want to run lvm batch for testing and then try to figure out how to 
interpret the conflicting information.

Example outputs below (for octopus and pacific), each of these disks has 1 OSD 
deployed and space for another one. Thanks for any help!

[root@ceph-adm:ceph-19 ~]# ceph-volume inventory --format json-pretty /dev/sdt
{
"available": false,
"device_id": "KINGSTON_SEDC500M3840G_50026B72825B6A67",
"lsm_data": {},
"lvs": [
{
"block_uuid": "iZGHyl-oY3R-K6va-t6Ji-VxFg-8K0V-Pl978X",
"cluster_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
"cluster_name": "ceph",
"name": "osd-data-4ebd70a7-d51f-4f1c-921e-23269eb050fe",
"osd_fsid": "a4f41f0e-0cf5-4aab-a4bb-390a64cfb01a",
"osd_id": "571",
"osdspec_affinity": "",
"type": "block"
}
],
"path": "/dev/sdt",
"rejected_reasons": [
"LVM detected",
"locked"
],
"sys_api": {
"human_readable_size": "3.49 TB",
"locked": 1,
"model": "KINGSTON SEDC500",
"nr_requests": "256",
"partitions": {},
"path": "/dev/sdt",
"removable": "0",
"rev": "J2.8",
"ro": "0",
"rotational": "0",
"sas_address": "0x500056b317b777ca",
"sas_device_handle": "0x001e",
"scheduler_mode": "mq-deadline",
"sectors": 0,
"sectorsize": "512",
"size": 3840755982336.0,
"support_discard": "512",
"vendor": "ATA"
}
}

[root@ceph-adm:ceph-19 ~]# ceph-volume lvm batch --report --prepare --bluestore 
--no-systemd --crush-device-class rbd_data --osds-per-device 2 -- /dev/sdt
--> DEPRECATION NOTICE
--> You are using the legacy automatic disk sorting behavior
--> The Pacific release will change the default to --no-auto
--> passed data devices: 1 physical, 0 LVM
--> relative data size: 0.5

Total OSDs: 1

  TypePathLV 
Size % of device

  data/dev/sdt1.75 
TB 50.00%



# docker run --rm -v /dev:/dev --privileged --entrypoint /usr/sbin/ceph-volume 
"quay.io/ceph/ceph:v16.2.10" inventory --format json-pretty /dev/sdq
{
"available": false,
"device_id": "",
"lsm_data": {},
"lvs": [
{
"block_uuid": "ZtEuec-S672-meb5-xIQP-D20n-FjsC-jN3tVN",
"cluster_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
"cluster_name": "ceph",
"name": "osd-data-37e894ed-167f-4fcc-a506-dca8bfc6c83f",
"osd_fsid": "eaf62795-7c24-48e4-9f64-c66f42df973a",
"osd_id": "582",
"osdspec_affinity": "",
"type": "block"
}
],
"path": "/dev/sdq",
"rejected_reasons": [
"locked",
"LVM detected"
],
"sys_api": {
"human_readable_size": "3.49 TB",
"locked": 1,
"model": "KINGSTON SEDC500",
"nr_requests": "256",
"partitions": {},
"path": "/dev/sdq",
"removable": "0",
"rev": "J2.8",
"ro": "0",
"rotational": "0",
"sas_address": "0x500056b397fe9ac5",
"sas_device_handle": "0x001b",
"scheduler_mode": "mq-deadline",
"sectors": 0,
"sectorsize": "512",
"size": 3840755982336.0,
"support_discard": "512",
"vendor": "ATA"
}
}

# docker run --rm -v /dev:/dev --privileged --entrypoint /usr/sbin/ceph-volume 
"quay.io/ceph/ceph:v16.2.10" lvm batch --report --prepare --bluestore 
--no-systemd --crush-device-class rbd_data --osds-per-device 2 -- /dev/sdq

Total OSDs: 1

  TypePathLV 
Size % of device

  data/dev/sdq1.75 
TB 50.00%
--> DEPRECATION NOTICE
--> You are using the legacy automatic disk sorting behavior
--> The Pacific release will change the default to --no-auto
--> passed data dev

[ceph-users] Re: pacific: ceph-mon services stopped after OSDs are out/down

2022-12-14 Thread Eugen Block
There's an existing tracker issue [1] that hasn't been updated since a  
year. The OP reported that restarting the other MONs did resolve it,  
have you tried that?


[1] https://tracker.ceph.com/issues/52760

Zitat von Mevludin Blazevic :

Its very strange. The keyring of the ceph monitor is the same as on  
one of the working monitor hosts. The failed mon and the working  
mons also have the same selinux policies and firewalld settings. The  
connection is also present since, all osd deamons are up on the  
failed ceph monitor node.


Am 13.12.2022 um 11:43 schrieb Eugen Block:
So you get "Permission denied" errors, I'm guessing either the mon  
keyring is not present (or wrong) or the mon directory doesn't  
belong to the ceph user. Can you check


ls -l /var/lib/ceph/FSID/mon.sparci-store1/

Compare the keyring file with the ones on the working mon nodes.

Zitat von Mevludin Blazevic :


Hi Eugen,

I assume the mon db is stored on the "OS disk". I could not find  
any error related lines in cephadm.log, here is what journalctl  
-xe tells me:


Dec 13 11:24:21 sparci-store1  
ceph-8c774934-1535-11ec-973e-525400130e4f-mon-sparci-store1[786211]: debug  
2022-12-13T10:24:21.392+ 7f318e1fa700  1  
mon.sparci-store1@-1(???).paxosservice(auth 251..491) refresh  
upgraded, format 0 -> 3
Dec 13 11:24:21 sparci-store1  
ceph-8c774934-1535-11ec-973e-525400130e4f-mon-sparci-store1[786211]: debug  
2022-12-13T10:24:21.397+ 7f3179248700  1 heartbeat_map  
reset_timeout 'Monitor::cpu_tp thread 0x7f3179248700' had timed  
out after 0.0s
Dec 13 11:24:21 sparci-store1  
ceph-8c774934-1535-11ec-973e-525400130e4f-mon-sparci-store1[786211]: debug  
2022-12-13T10:24:21.397+ 7f318e1fa700  0  
mon.sparci-store1@-1(probing) e5  my rank is now 1 (was -1)
Dec 13 11:24:21 sparci-store1  
ceph-8c774934-1535-11ec-973e-525400130e4f-mon-sparci-store1[786211]: debug  
2022-12-13T10:24:21.398+ 7f317ba4d700 -1  
mon.sparci-store1@1(probing) e5 handle_auth_bad_method hmm, they  
didn't like 2 result (13) Permission denied
Dec 13 11:24:21 sparci-store1 systemd[1]: Started Ceph  
mon.sparci-store1 for 8c774934-1535-11ec-973e-525400130e4f.
-- Subject: Unit  
ceph-8c774934-1535-11ec-973e-525400130e4f@mon.sparci-store1.service has  
finished start-up

-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- Unit  
ceph-8c774934-1535-11ec-973e-525400130e4f@mon.sparci-store1.service has  
finished starting up.

--
-- The start-up result is done.
Dec 13 11:24:21 sparci-store1  
ceph-8c774934-1535-11ec-973e-525400130e4f-mon-sparci-store1[786211]: debug  
2022-12-13T10:24:21.599+ 7f317ba4d700 -1  
mon.sparci-store1@1(probing) e5 handle_auth_bad_method hmm, they  
didn't like 2 result (13) Permission denied
Dec 13 11:24:21 sparci-store1  
ceph-8c774934-1535-11ec-973e-525400130e4f-mon-sparci-store1[786211]: debug  
2022-12-13T10:24:21.600+ 7f3177a45700  0  
mon.sparci-store1@1(probing) e18  removed from monmap, suicide.
Dec 13 11:24:21 sparci-store1 systemd[1]:  
var-lib-containers-storage-overlay-2e67bce8ea3795683c4326479c7169a713e9a7630b31f25d60cd45bbd9fa56bd-merged.mount:  
Succeeded.

-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The unit  
var-lib-containers-storage-overlay-2e67bce8ea3795683c4326479c7169a713e9a7630b31f25d60cd45bbd9fa56bd-merged.mount has successfully entered the 'dead'  
state.
Dec 13 11:24:21 sparci-store1 bash[786318]: Error: no container  
with name or ID  
"ceph-8c774934-1535-11ec-973e-525400130e4f-mon.sparci-store1"  
found: no such container
Dec 13 11:24:21 sparci-store1 bash[786346]: Error: no container  
with name or ID  
"ceph-8c774934-1535-11ec-973e-525400130e4f-mon-sparci-store1"  
found: no such container
Dec 13 11:24:21 sparci-store1 bash[786375]: Error: no container  
with name or ID  
"ceph-8c774934-1535-11ec-973e-525400130e4f-mon.sparci-store1"  
found: no such container
Dec 13 11:24:21 sparci-store1 systemd[1]:  
ceph-8c774934-1535-11ec-973e-525400130e4f@mon.sparci-store1.service:  
Succeeded.

-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The unit  
ceph-8c774934-1535-11ec-973e-525400130e4f@mon.sparci-store1.service has  
successfully entered the 'dead' state.


Regards,

Mevludin


Am 08.12.2022 um 09:30 schrieb Eugen Block:

Hi,

do the MONs use the same SAS interface? They store the mon db on  
local disk, so it might be related. But without any logs or more  
details it's just guessing.


Regards,
Eugen

Zitat von Mevludin Blazevic :


Hi all,

I'm running Pacific with cephadm.

After installation, ceph automatically provisoned 5 ceph monitor  
nodes across the cluster. After a few OSDs crashed due to a  
hardware related issue with the SAS interface, 3 monitor  
services are stopped and won't restart again. Is it related to  
the OSD crash problem?


Thanks,
Mevludin

___
ceph-users mailing list -- cep

[ceph-users] Re: New pool created with 2048 pg_num not executed

2022-12-14 Thread Eugen Block

Hi,

are there already existing pools in the cluster? Can you share your  
'ceph osd df tree' as well as 'ceph osd pool ls detail'? It sounds  
like ceph is trying to stay within the limit of mon_max_pg_per_osd  
(default 250).


Regards,
Eugen

Zitat von Martin Buss :


Hi,

on quincy, I created a new pool

ceph osd pool create cfs_data 2048 2048

6 hosts 71 osds

autoscaler is off; I find it kind of strange that the pool is  
created with pg_num 1152 and pgp_num 1024, mentioning the 2048 as  
the new target. I cannot manage to actually make this pool contain  
2048 pg_num and 2048 pgp_num.


What config option am I missing that does not allow me to grow the  
pool to 2048? Although I specified pg_num and pgp_num be the same,  
it is not.


Please some help and guidance.

Thank you,

Martin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: New pool created with 2048 pg_num not executed

2022-12-14 Thread Martin Buss

Hi Eugen,

thanks, sure, below:

pg_num stuck at 1152 and pgp_num stuck at 1024

Regards,

Martin

ceph config set global mon_max_pg_per_osd 400

ceph osd pool create cfs_data 2048 2048 --pg_num_min 2048
pool 'cfs_data' created

pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off 
last_change 3099 lfor 0/3089/3096 flags hashpspool,bulk stripe_width 0 
target_size_ratio 1 application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode off last_change 
2942 lfor 0/0/123 flags hashpspool stripe_width 0 pg_autoscale_bias 4 
pg_num_min 16 recovery_priority 5 application cephfs
pool 3 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 2943 flags 
hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 9 'cfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 1152 pgp_num 1024 pg_num_target 2048 pgp_num_target 2048 
autoscale_mode off last_change 3198 lfor 0/0/3198 flags hashpspool 
stripe_width 0 pg_num_min 2048




On 14.12.22 15:10, Eugen Block wrote:

Hi,

are there already existing pools in the cluster? Can you share your 
'ceph osd df tree' as well as 'ceph osd pool ls detail'? It sounds like 
ceph is trying to stay within the limit of mon_max_pg_per_osd (default 
250).


Regards,
Eugen

Zitat von Martin Buss :


Hi,

on quincy, I created a new pool

ceph osd pool create cfs_data 2048 2048

6 hosts 71 osds

autoscaler is off; I find it kind of strange that the pool is created 
with pg_num 1152 and pgp_num 1024, mentioning the 2048 as the new 
target. I cannot manage to actually make this pool contain 2048 pg_num 
and 2048 pgp_num.


What config option am I missing that does not allow me to grow the 
pool to 2048? Although I specified pg_num and pgp_num be the same, it 
is not.


Please some help and guidance.

Thank you,

Martin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--

-
Prof. Dr. Martin Buss
Ferchenbachstr. 96
80995 Muenchen
Tel 0162 287 4077
Email: mbuss7...@gmail.com

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: New pool created with 2048 pg_num not executed

2022-12-14 Thread Martin Buss

Hi Eugen,

thanks, sure, below:

pg_num stuck at 1152 and pgp_num stuck at 1024

Regards,

Martin

ceph config set global mon_max_pg_per_osd 400

ceph osd pool create cfs_data 2048 2048 --pg_num_min 2048
pool 'cfs_data' created

pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off 
last_change 3099 lfor 0/3089/3096 flags hashpspool,bulk stripe_width 0 
target_size_ratio 1 application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode off last_change 
2942 lfor 0/0/123 flags hashpspool stripe_width 0 pg_autoscale_bias 4 
pg_num_min 16 recovery_priority 5 application cephfs
pool 3 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 2943 flags 
hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 9 'cfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 1152 pgp_num 1024 pg_num_target 2048 pgp_num_target 2048 
autoscale_mode off last_change 3198 lfor 0/0/3198 flags hashpspool 
stripe_width 0 pg_num_min 2048




On 14.12.22 15:10, Eugen Block wrote:

Hi,

are there already existing pools in the cluster? Can you share your 
'ceph osd df tree' as well as 'ceph osd pool ls detail'? It sounds like 
ceph is trying to stay within the limit of mon_max_pg_per_osd (default 
250).


Regards,
Eugen

Zitat von Martin Buss :


Hi,

on quincy, I created a new pool

ceph osd pool create cfs_data 2048 2048

6 hosts 71 osds

autoscaler is off; I find it kind of strange that the pool is created 
with pg_num 1152 and pgp_num 1024, mentioning the 2048 as the new 
target. I cannot manage to actually make this pool contain 2048 pg_num 
and 2048 pgp_num.


What config option am I missing that does not allow me to grow the 
pool to 2048? Although I specified pg_num and pgp_num be the same, it 
is not.


Please some help and guidance.

Thank you,

Martin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-volume inventory reports available devices as unavailable

2022-12-14 Thread Eugen Block

Hi,

I haven't been dealing with ceph-volume too much lately, but I  
remember seeing that when I have multiple DB devices on SSD and wanted  
to replace only one failed drive. Although ceph-volume inventory  
reported the disk as unavailable the actual create command was  
successful. But I don't remember which versions were okay and which  
weren't, there were multiple regressions in ceph-volume IIRC, it seems  
to be a very complex structure. But apparently '... batch --report' is  
more reliable than '... inventory'.


Regards,
Eugen

Zitat von Frank Schilder :


Hi all,

we are using "ceph-volume inventory" for checking if a disk can host  
an OSD or not prior to running "ceph-volume lvm batch".  
Unfortunately, these two tools behave inconsistently. Our use case  
are SSDs with multiple OSDs per disk and re-deploying one of the  
OSDs on disk (the OSD was purged with "ceph-volume lvm zap --osd-id  
ID" and the left-over volume removed with "lvremove OSD-VG/OSD-LV").


Ceph-volume inventory reports a disk as unavailable even though it  
has space for the new OSD. On the other hand, ceph-volume lvm batch  
happily creates the OSD. Expected is that inventory says there is  
space for an OSD and reports the disk as available. Is there any way  
to get this to behave in a consistent way? I don't want to run lvm  
batch for testing and then try to figure out how to interpret the  
conflicting information.


Example outputs below (for octopus and pacific), each of these disks  
has 1 OSD deployed and space for another one. Thanks for any help!


[root@ceph-adm:ceph-19 ~]# ceph-volume inventory --format  
json-pretty /dev/sdt

{
"available": false,
"device_id": "KINGSTON_SEDC500M3840G_50026B72825B6A67",
"lsm_data": {},
"lvs": [
{
"block_uuid": "iZGHyl-oY3R-K6va-t6Ji-VxFg-8K0V-Pl978X",
"cluster_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
"cluster_name": "ceph",
"name": "osd-data-4ebd70a7-d51f-4f1c-921e-23269eb050fe",
"osd_fsid": "a4f41f0e-0cf5-4aab-a4bb-390a64cfb01a",
"osd_id": "571",
"osdspec_affinity": "",
"type": "block"
}
],
"path": "/dev/sdt",
"rejected_reasons": [
"LVM detected",
"locked"
],
"sys_api": {
"human_readable_size": "3.49 TB",
"locked": 1,
"model": "KINGSTON SEDC500",
"nr_requests": "256",
"partitions": {},
"path": "/dev/sdt",
"removable": "0",
"rev": "J2.8",
"ro": "0",
"rotational": "0",
"sas_address": "0x500056b317b777ca",
"sas_device_handle": "0x001e",
"scheduler_mode": "mq-deadline",
"sectors": 0,
"sectorsize": "512",
"size": 3840755982336.0,
"support_discard": "512",
"vendor": "ATA"
}
}

[root@ceph-adm:ceph-19 ~]# ceph-volume lvm batch --report --prepare  
--bluestore --no-systemd --crush-device-class rbd_data  
--osds-per-device 2 -- /dev/sdt

--> DEPRECATION NOTICE
--> You are using the legacy automatic disk sorting behavior
--> The Pacific release will change the default to --no-auto
--> passed data devices: 1 physical, 0 LVM
--> relative data size: 0.5

Total OSDs: 1

  TypePath
 LV Size % of device


  data/dev/sdt
 1.75 TB 50.00%




# docker run --rm -v /dev:/dev --privileged --entrypoint  
/usr/sbin/ceph-volume "quay.io/ceph/ceph:v16.2.10" inventory  
--format json-pretty /dev/sdq

{
"available": false,
"device_id": "",
"lsm_data": {},
"lvs": [
{
"block_uuid": "ZtEuec-S672-meb5-xIQP-D20n-FjsC-jN3tVN",
"cluster_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
"cluster_name": "ceph",
"name": "osd-data-37e894ed-167f-4fcc-a506-dca8bfc6c83f",
"osd_fsid": "eaf62795-7c24-48e4-9f64-c66f42df973a",
"osd_id": "582",
"osdspec_affinity": "",
"type": "block"
}
],
"path": "/dev/sdq",
"rejected_reasons": [
"locked",
"LVM detected"
],
"sys_api": {
"human_readable_size": "3.49 TB",
"locked": 1,
"model": "KINGSTON SEDC500",
"nr_requests": "256",
"partitions": {},
"path": "/dev/sdq",
"removable": "0",
"rev": "J2.8",
"ro": "0",
"rotational": "0",
"sas_address": "0x500056b397fe9ac5",
"sas_device_handle": "0x001b",
"scheduler_mode": "mq-deadline",
"sectors": 0,
"sectorsize": "512",
"size": 3840755982336.0,
"support_discard": "512",
"vendor": "ATA"
}
}

# docker run --rm -v /dev:/dev --privileged --entrypoint  
/usr/sbin/ceph-volume "quay.io/ceph/ceph:v1

[ceph-users] Re: ceph-volume inventory reports available devices as unavailable

2022-12-14 Thread Martin Buss
Hi list admins, I accidentally posted my private address, can you please 
delete that post?


https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JMFG73QMB3MJKHDMNPIKZHQOUUCJPJJN/

Thanks,

Martin

On 14.12.22 15:18, Eugen Block wrote:

Hi,

I haven't been dealing with ceph-volume too much lately, but I remember 
seeing that when I have multiple DB devices on SSD and wanted to replace 
only one failed drive. Although ceph-volume inventory reported the disk 
as unavailable the actual create command was successful. But I don't 
remember which versions were okay and which weren't, there were multiple 
regressions in ceph-volume IIRC, it seems to be a very complex 
structure. But apparently '... batch --report' is more reliable than 
'... inventory'.


Regards,
Eugen

Zitat von Frank Schilder :


Hi all,

we are using "ceph-volume inventory" for checking if a disk can host 
an OSD or not prior to running "ceph-volume lvm batch". Unfortunately, 
these two tools behave inconsistently. Our use case are SSDs with 
multiple OSDs per disk and re-deploying one of the OSDs on disk (the 
OSD was purged with "ceph-volume lvm zap --osd-id ID" and the 
left-over volume removed with "lvremove OSD-VG/OSD-LV").


Ceph-volume inventory reports a disk as unavailable even though it has 
space for the new OSD. On the other hand, ceph-volume lvm batch 
happily creates the OSD. Expected is that inventory says there is 
space for an OSD and reports the disk as available. Is there any way 
to get this to behave in a consistent way? I don't want to run lvm 
batch for testing and then try to figure out how to interpret the 
conflicting information.


Example outputs below (for octopus and pacific), each of these disks 
has 1 OSD deployed and space for another one. Thanks for any help!


[root@ceph-adm:ceph-19 ~]# ceph-volume inventory --format json-pretty 
/dev/sdt

{
    "available": false,
    "device_id": "KINGSTON_SEDC500M3840G_50026B72825B6A67",
    "lsm_data": {},
    "lvs": [
    {
    "block_uuid": "iZGHyl-oY3R-K6va-t6Ji-VxFg-8K0V-Pl978X",
    "cluster_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
    "cluster_name": "ceph",
    "name": "osd-data-4ebd70a7-d51f-4f1c-921e-23269eb050fe",
    "osd_fsid": "a4f41f0e-0cf5-4aab-a4bb-390a64cfb01a",
    "osd_id": "571",
    "osdspec_affinity": "",
    "type": "block"
    }
    ],
    "path": "/dev/sdt",
    "rejected_reasons": [
    "LVM detected",
    "locked"
    ],
    "sys_api": {
    "human_readable_size": "3.49 TB",
    "locked": 1,
    "model": "KINGSTON SEDC500",
    "nr_requests": "256",
    "partitions": {},
    "path": "/dev/sdt",
    "removable": "0",
    "rev": "J2.8",
    "ro": "0",
    "rotational": "0",
    "sas_address": "0x500056b317b777ca",
    "sas_device_handle": "0x001e",
    "scheduler_mode": "mq-deadline",
    "sectors": 0,
    "sectorsize": "512",
    "size": 3840755982336.0,
    "support_discard": "512",
    "vendor": "ATA"
    }
}

[root@ceph-adm:ceph-19 ~]# ceph-volume lvm batch --report --prepare 
--bluestore --no-systemd --crush-device-class rbd_data 
--osds-per-device 2 -- /dev/sdt

--> DEPRECATION NOTICE
--> You are using the legacy automatic disk sorting behavior
--> The Pacific release will change the default to --no-auto
--> passed data devices: 1 physical, 0 LVM
--> relative data size: 0.5

Total OSDs: 1

  Type    Path   
 LV Size % of device


  data    /dev/sdt   
 1.75 TB 50.00%




# docker run --rm -v /dev:/dev --privileged --entrypoint 
/usr/sbin/ceph-volume "quay.io/ceph/ceph:v16.2.10" inventory --format 
json-pretty /dev/sdq

{
    "available": false,
    "device_id": "",
    "lsm_data": {},
    "lvs": [
    {
    "block_uuid": "ZtEuec-S672-meb5-xIQP-D20n-FjsC-jN3tVN",
    "cluster_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
    "cluster_name": "ceph",
    "name": "osd-data-37e894ed-167f-4fcc-a506-dca8bfc6c83f",
    "osd_fsid": "eaf62795-7c24-48e4-9f64-c66f42df973a",
    "osd_id": "582",
    "osdspec_affinity": "",
    "type": "block"
    }
    ],
    "path": "/dev/sdq",
    "rejected_reasons": [
    "locked",
    "LVM detected"
    ],
    "sys_api": {
    "human_readable_size": "3.49 TB",
    "locked": 1,
    "model": "KINGSTON SEDC500",
    "nr_requests": "256",
    "partitions": {},
    "path": "/dev/sdq",
    "removable": "0",
    "rev": "J2.8",
    "ro": "0",
    "rotational": "0",
    "sas_address": "0x500056b397fe9ac5",
    "sas_device_handle": "0x001b",
    "scheduler_mode": "mq-deadline",
    "sectors": 0,
    "

[ceph-users] Re: New pool created with 2048 pg_num not executed

2022-12-14 Thread Eugen Block

I'm wondering why the cephfs_data pool has mismatching pg_num and pgp_num:

pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0  
object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off


Does disabling the autoscaler leave it like that when you disable it  
in the middle of scaling? What is the current 'ceph status'?



Zitat von Martin Buss :


Hi Eugen,

thanks, sure, below:

pg_num stuck at 1152 and pgp_num stuck at 1024

Regards,

Martin

ceph config set global mon_max_pg_per_osd 400

ceph osd pool create cfs_data 2048 2048 --pg_num_min 2048
pool 'cfs_data' created

pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0  
object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off  
last_change 3099 lfor 0/3089/3096 flags hashpspool,bulk stripe_width  
0 target_size_ratio 1 application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0  
object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode off  
last_change 2942 lfor 0/0/123 flags hashpspool stripe_width 0  
pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application  
cephfs
pool 3 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash  
rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 2943  
flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1  
application mgr
pool 9 'cfs_data' replicated size 3 min_size 2 crush_rule 0  
object_hash rjenkins pg_num 1152 pgp_num 1024 pg_num_target 2048  
pgp_num_target 2048 autoscale_mode off last_change 3198 lfor  
0/0/3198 flags hashpspool stripe_width 0 pg_num_min 2048




On 14.12.22 15:10, Eugen Block wrote:

Hi,

are there already existing pools in the cluster? Can you share your  
'ceph osd df tree' as well as 'ceph osd pool ls detail'? It sounds  
like ceph is trying to stay within the limit of mon_max_pg_per_osd  
(default 250).


Regards,
Eugen

Zitat von Martin Buss :


Hi,

on quincy, I created a new pool

ceph osd pool create cfs_data 2048 2048

6 hosts 71 osds

autoscaler is off; I find it kind of strange that the pool is  
created with pg_num 1152 and pgp_num 1024, mentioning the 2048 as  
the new target. I cannot manage to actually make this pool contain  
2048 pg_num and 2048 pgp_num.


What config option am I missing that does not allow me to grow the  
pool to 2048? Although I specified pg_num and pgp_num be the same,  
it is not.


Please some help and guidance.

Thank you,

Martin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: New pool created with 2048 pg_num not executed

2022-12-14 Thread Martin Buss
that cephfs_data has been autoscaling while filling, the mismatched 
numbers are a result of that autoscaling


the cluster status is WARN as there is still some old stuff backfilling 
on cephfs_data


The issue is the newly created pool 9 cfs_data, which is stuck at 1152 
pg_num


ps: can you help me to get in touch with the list admin so I can get 
that post including private info deleted


On 14.12.22 15:41, Eugen Block wrote:

I'm wondering why the cephfs_data pool has mismatching pg_num and pgp_num:

pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off


Does disabling the autoscaler leave it like that when you disable it in 
the middle of scaling? What is the current 'ceph status'?



Zitat von Martin Buss :


Hi Eugen,

thanks, sure, below:

pg_num stuck at 1152 and pgp_num stuck at 1024

Regards,

Martin

ceph config set global mon_max_pg_per_osd 400

ceph osd pool create cfs_data 2048 2048 --pg_num_min 2048
pool 'cfs_data' created

pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off 
last_change 3099 lfor 0/3089/3096 flags hashpspool,bulk stripe_width 0 
target_size_ratio 1 application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode off 
last_change 2942 lfor 0/0/123 flags hashpspool stripe_width 0 
pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 3 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 2943 flags 
hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 9 'cfs_data' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 1152 pgp_num 1024 pg_num_target 2048 
pgp_num_target 2048 autoscale_mode off last_change 3198 lfor 0/0/3198 
flags hashpspool stripe_width 0 pg_num_min 2048




On 14.12.22 15:10, Eugen Block wrote:

Hi,

are there already existing pools in the cluster? Can you share your 
'ceph osd df tree' as well as 'ceph osd pool ls detail'? It sounds 
like ceph is trying to stay within the limit of mon_max_pg_per_osd 
(default 250).


Regards,
Eugen

Zitat von Martin Buss :


Hi,

on quincy, I created a new pool

ceph osd pool create cfs_data 2048 2048

6 hosts 71 osds

autoscaler is off; I find it kind of strange that the pool is 
created with pg_num 1152 and pgp_num 1024, mentioning the 2048 as 
the new target. I cannot manage to actually make this pool contain 
2048 pg_num and 2048 pgp_num.


What config option am I missing that does not allow me to grow the 
pool to 2048? Although I specified pg_num and pgp_num be the same, 
it is not.


Please some help and guidance.

Thank you,

Martin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: New pool created with 2048 pg_num not executed

2022-12-14 Thread Eugen Block
Then I'd suggest to wait until the backfilling is done and then report  
back if the PGs are still not created. I don't have information about  
the ML admin, sorry.


Zitat von Martin Buss :

that cephfs_data has been autoscaling while filling, the mismatched  
numbers are a result of that autoscaling


the cluster status is WARN as there is still some old stuff  
backfilling on cephfs_data


The issue is the newly created pool 9 cfs_data, which is stuck at 1152 pg_num

ps: can you help me to get in touch with the list admin so I can get  
that post including private info deleted


On 14.12.22 15:41, Eugen Block wrote:

I'm wondering why the cephfs_data pool has mismatching pg_num and pgp_num:

pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0  
object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off


Does disabling the autoscaler leave it like that when you disable  
it in the middle of scaling? What is the current 'ceph status'?



Zitat von Martin Buss :


Hi Eugen,

thanks, sure, below:

pg_num stuck at 1152 and pgp_num stuck at 1024

Regards,

Martin

ceph config set global mon_max_pg_per_osd 400

ceph osd pool create cfs_data 2048 2048 --pg_num_min 2048
pool 'cfs_data' created

pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0  
object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off  
last_change 3099 lfor 0/3089/3096 flags hashpspool,bulk  
stripe_width 0 target_size_ratio 1 application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0  
object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode off  
last_change 2942 lfor 0/0/123 flags hashpspool stripe_width 0  
pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application  
cephfs
pool 3 '.mgr' replicated size 3 min_size 2 crush_rule 0  
object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode off  
last_change 2943 flags hashpspool stripe_width 0 pg_num_max 32  
pg_num_min 1 application mgr
pool 9 'cfs_data' replicated size 3 min_size 2 crush_rule 0  
object_hash rjenkins pg_num 1152 pgp_num 1024 pg_num_target 2048  
pgp_num_target 2048 autoscale_mode off last_change 3198 lfor  
0/0/3198 flags hashpspool stripe_width 0 pg_num_min 2048




On 14.12.22 15:10, Eugen Block wrote:

Hi,

are there already existing pools in the cluster? Can you share  
your 'ceph osd df tree' as well as 'ceph osd pool ls detail'? It  
sounds like ceph is trying to stay within the limit of  
mon_max_pg_per_osd (default 250).


Regards,
Eugen

Zitat von Martin Buss :


Hi,

on quincy, I created a new pool

ceph osd pool create cfs_data 2048 2048

6 hosts 71 osds

autoscaler is off; I find it kind of strange that the pool is  
created with pg_num 1152 and pgp_num 1024, mentioning the 2048  
as the new target. I cannot manage to actually make this pool  
contain 2048 pg_num and 2048 pgp_num.


What config option am I missing that does not allow me to grow  
the pool to 2048? Although I specified pg_num and pgp_num be the  
same, it is not.


Please some help and guidance.

Thank you,

Martin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-volume inventory reports available devices as unavailable

2022-12-14 Thread Ralph Soika

Hi,

in run into the same problem. After installing ceph quincy on different 
servers, some were able to detect the disks others not.


My servers are hosted at hetzner.de and as I did not found a solution, 
so I tried as long different servers until I found servers where ceph 
detected the disks correctly. The hetzner support did not help and 
explained it is a software issue. Of course - but it seems to be related 
to a special hardware configuration.


I narrowed down the problem and installed a blank Debian with with 
ceph/ceph-adm. And I only tested the "$ sudo ceph-adm ceph-volume inventoy".


See also my issue here:

https://tracker.ceph.com/issues/58189

https://serverfault.com/questions/1117213/why-is-ceph-is-not-detecting-ssd-device-on-a-new-node


Best regards

Ralph



Am 14.12.22 um 14:22 schrieb Frank Schilder:

Hi all,

we are using "ceph-volume inventory" for checking if a disk can host an OSD or not prior to running 
"ceph-volume lvm batch". Unfortunately, these two tools behave inconsistently. Our use case are SSDs with 
multiple OSDs per disk and re-deploying one of the OSDs on disk (the OSD was purged with "ceph-volume lvm zap 
--osd-id ID" and the left-over volume removed with "lvremove OSD-VG/OSD-LV").

Ceph-volume inventory reports a disk as unavailable even though it has space 
for the new OSD. On the other hand, ceph-volume lvm batch happily creates the 
OSD. Expected is that inventory says there is space for an OSD and reports the 
disk as available. Is there any way to get this to behave in a consistent way? 
I don't want to run lvm batch for testing and then try to figure out how to 
interpret the conflicting information.

Example outputs below (for octopus and pacific), each of these disks has 1 OSD 
deployed and space for another one. Thanks for any help!

[root@ceph-adm:ceph-19 ~]# ceph-volume inventory --format json-pretty /dev/sdt
{
 "available": false,
 "device_id": "KINGSTON_SEDC500M3840G_50026B72825B6A67",
 "lsm_data": {},
 "lvs": [
 {
 "block_uuid": "iZGHyl-oY3R-K6va-t6Ji-VxFg-8K0V-Pl978X",
 "cluster_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
 "cluster_name": "ceph",
 "name": "osd-data-4ebd70a7-d51f-4f1c-921e-23269eb050fe",
 "osd_fsid": "a4f41f0e-0cf5-4aab-a4bb-390a64cfb01a",
 "osd_id": "571",
 "osdspec_affinity": "",
 "type": "block"
 }
 ],
 "path": "/dev/sdt",
 "rejected_reasons": [
 "LVM detected",
 "locked"
 ],
 "sys_api": {
 "human_readable_size": "3.49 TB",
 "locked": 1,
 "model": "KINGSTON SEDC500",
 "nr_requests": "256",
 "partitions": {},
 "path": "/dev/sdt",
 "removable": "0",
 "rev": "J2.8",
 "ro": "0",
 "rotational": "0",
 "sas_address": "0x500056b317b777ca",
 "sas_device_handle": "0x001e",
 "scheduler_mode": "mq-deadline",
 "sectors": 0,
 "sectorsize": "512",
 "size": 3840755982336.0,
 "support_discard": "512",
 "vendor": "ATA"
 }
}

[root@ceph-adm:ceph-19 ~]# ceph-volume lvm batch --report --prepare --bluestore 
--no-systemd --crush-device-class rbd_data --osds-per-device 2 -- /dev/sdt
--> DEPRECATION NOTICE
--> You are using the legacy automatic disk sorting behavior
--> The Pacific release will change the default to --no-auto
--> passed data devices: 1 physical, 0 LVM
--> relative data size: 0.5

Total OSDs: 1

   TypePathLV 
Size % of device

   data/dev/sdt1.75 
TB 50.00%



# docker run --rm -v /dev:/dev --privileged --entrypoint /usr/sbin/ceph-volume 
"quay.io/ceph/ceph:v16.2.10" inventory --format json-pretty /dev/sdq
{
 "available": false,
 "device_id": "",
 "lsm_data": {},
 "lvs": [
 {
 "block_uuid": "ZtEuec-S672-meb5-xIQP-D20n-FjsC-jN3tVN",
 "cluster_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
 "cluster_name": "ceph",
 "name": "osd-data-37e894ed-167f-4fcc-a506-dca8bfc6c83f",
 "osd_fsid": "eaf62795-7c24-48e4-9f64-c66f42df973a",
 "osd_id": "582",
 "osdspec_affinity": "",
 "type": "block"
 }
 ],
 "path": "/dev/sdq",
 "rejected_reasons": [
 "locked",
 "LVM detected"
 ],
 "sys_api": {
 "human_readable_size": "3.49 TB",
 "locked": 1,
 "model": "KINGSTON SEDC500",
 "nr_requests": "256",
 "partitions": {},
 "path": "/dev/sdq",
 "removable": "0",
 "rev": "J2.8",
 "ro": "0",
 "rotational": "0",
 "sas_address": "0x500056b397fe9ac5",
 "sas_device_handle"

[ceph-users] SLOW_OPS

2022-12-14 Thread Murilo Morais
Good morning everyone.

Guys, today my cluster had a "problem", it was showing SLOW_OPS, when
restarting the OSDs that were showing this problem everything was solved
(there were VMs stuck because of this), what I'm breaking my head is to
know the reason for having SLOW_OPS.

In the logs I saw that the problem started at 04:00 AM and continued until
07:50 AM (when I restarted the OSDs).

I'm suspicious of some exaggerated settings that I applied and forgot there
in the initial setup while performing a test, which may have caused a high
use of RAM leaving a maximum of 400 MB of 32 GB free memory, which in this
case was to put 512 PGs in two pools, one of which was affected.

In the logs I saw that the problem started when some VMs started to perform
backup actions, increasing the writing a little (to a maximum of 300 MBps),
after a few seconds a disk started to show this WARN and also this line:
Dec 14 04:01:01 dcs1.evocorp ceph-mon[639148]: 69 slow requests (by type [
'delayed' : 65 'waiting for sub ops' : 4 ] most affected pool [
'cephfs.ds_disk.data' : 69])

Then he presented these:
Dec 14 04:01:02 dcs1.evocorp ceph-mon[639148]: log_channel(cluster) log
[WRN] : Health check update: 0 slow ops, oldest one blocked for 36 sec,
daemons [osd.20,osd.5 ] have slow ops. (SLOW_OPS)
[...]
Dec 14 05:52:01 dcs1.evocorp ceph-mon[639148]: log_channel(cluster) log
[WRN] : Health check update: 149 slow ops, oldest one blocked for 6696 sec,
daemons [osd.20,osd.5 ,osd.50] have slow ops. (SLOW_OPS)

I've already checked the SMART, they're all OK, I've checked the graphs
generated in Grafana and none of the disks saturate, there haven't been any
incidents related to the network, that is, I haven't identified any other
problem that could cause this.

What could have caused this event? What can I do to prevent it from
happening again?

Below is some information about the cluster:
5 machines with 32GB RAM, 2 processors and 12 3TB SAS disks and connected
through 40Gb interfaces.

# ceph osd tree
ID   CLASS  WEIGHT TYPE NAME   STATUS  REWEIGHT  PRI-AFF
 -1 163.73932  root default
 -3  32.74786  host dcs1
  0hdd2.72899  osd.0   up   1.0  1.0
  1hdd2.72899  osd.1   up   1.0  1.0
  2hdd2.72899  osd.2   up   1.0  1.0
  3hdd2.72899  osd.3   up   1.0  1.0
  4hdd2.72899  osd.4   up   1.0  1.0
  5hdd2.72899  osd.5   up   1.0  1.0
  6hdd2.72899  osd.6   up   1.0  1.0
  7hdd2.72899  osd.7   up   1.0  1.0
  8hdd2.72899  osd.8   up   1.0  1.0
  9hdd2.72899  osd.9   up   1.0  1.0
 10hdd2.72899  osd.10  up   1.0  1.0
 11hdd2.72899  osd.11  up   1.0  1.0
 -5  32.74786  host dcs2
 12hdd2.72899  osd.12  up   1.0  1.0
 13hdd2.72899  osd.13  up   1.0  1.0
 14hdd2.72899  osd.14  up   1.0  1.0
 15hdd2.72899  osd.15  up   1.0  1.0
 16hdd2.72899  osd.16  up   1.0  1.0
 17hdd2.72899  osd.17  up   1.0  1.0
 18hdd2.72899  osd.18  up   1.0  1.0
 19hdd2.72899  osd.19  up   1.0  1.0
 20hdd2.72899  osd.20  up   1.0  1.0
 21hdd2.72899  osd.21  up   1.0  1.0
 22hdd2.72899  osd.22  up   1.0  1.0
 23hdd2.72899  osd.23  up   1.0  1.0
 -7  32.74786  host dcs3
 24hdd2.72899  osd.24  up   1.0  1.0
 25hdd2.72899  osd.25  up   1.0  1.0
 26hdd2.72899  osd.26  up   1.0  1.0
 27hdd2.72899  osd.27  up   1.0  1.0
 28hdd2.72899  osd.28  up   1.0  1.0
 29hdd2.72899  osd.29  up   1.0  1.0
 30hdd2.72899  osd.30  up   1.0  1.0
 31hdd2.72899  osd.31  up   1.0  1.0
 32hdd2.72899  osd.32  up   1.0  1.0
 33hdd2.72899  osd.33  up   1.0  1.0
 34hdd2.72899  osd.34  up   1.0  1.0
 35hdd2.72899  osd.35  up   1.0  1.0
 -9  32.74786  host dcs4
 36hdd2.72899  osd.36  up   1.0  1.0
 37hdd2.72899  osd.37  up   1.0  1.0
 38hdd2.72899  osd.38  up   1.0  1.0
 39hdd2.72899  osd.39  up   1.0  1.0
 40hdd2.72899  osd.40  up   1.0  1.0
 41hdd2.72899  osd.41  up   1.0  1.0
 42hdd 

[ceph-users] Re: Recent ceph.io Performance Blog Posts

2022-12-14 Thread Stefan Kooman

On 11/21/22 10:07, Stefan Kooman wrote:

On 11/8/22 21:20, Mark Nelson wrote:


2.
    https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/
    


You tested network encryption impact on performance. It would be nice to 
see how OSD encryption (encryption at rest) impacts performance. As far 
as I can see there is not much public information available on this. 
However there is one thread with this exact question asked [1]. And it 
contains an interesting blog post from Cloudlare [2]. I repeated the 
tests from [2] and could draw the same conclusions: TL;DR: performance 
is increased a lot and less CPU is used. Some fio 4k write, iodepth=1, 
performance numbers on a Samsung PM983 3.84 TB drive )Ubuntu 22.04 with 
HWE kernel, 5.15.0-52-generic, AMD EPYC 7302P 16-Core Processor, C-state 
pinning, CPU performance mode on, Samsung PM 983 firmware: EDA5702Q):


Unencrypted NVMe:

write: IOPS=63.3k, BW=247MiB/s (259MB/s)(62.6GiB/259207msec); 0 zone resets
     clat (nsec): min=13190, max=56400, avg=15397.89, stdev=1506.45
  lat (nsec): min=13250, max=56940, avg=15462.03, stdev=1507.88


Encrypted (without no_write_workqueue / no_read_workqueue):

   write: IOPS=34.8k, BW=136MiB/s (143MB/s)(47.4GiB/357175msec); 0 zone 
resets

     clat (usec): min=24, max=1221, avg=28.12, stdev= 2.98
  lat (usec): min=24, max=1221, avg=28.37, stdev= 2.99


Encrypted (with no_write_workqueue / no_read_workqueue enabled):

write: IOPS=55.7k, BW=218MiB/s (228MB/s)(57.3GiB/269574msec); 0 zone resets
     clat (nsec): min=15710, max=87090, avg=17550.99, stdev=875.72
  lat (nsec): min=15770, max=87150, avg=17614.82, stdev=876.85

So encryption does have a performance impact, but the added latency 
compared to the latency Ceph itself adds to (client) IO seems 
negligible. At least, when the work queues are bypassed, otherwise a lot 
of CPU seems to be involved (loads of kcryptd threads). And that might 
hurt max performance on a system that is CPU bound.


So, I have an update on this. One of our test clusters is now running 
with encrypted drives without the read/write work queues. Compared to 
the default (with work queues) it saves an enormous amount of CPU: no 
more hundreds of kcryptd threads consuming all available CPU.


The diff for ceph-volume encryption.py (pacific 16.2.10 docker image, 
sha256:2b68483bcd050472a18e73389c0e1f3f70d34bb7abf733f692e88c935ea0a6bd):


--- encryption.py   2022-12-07 08:32:50.949778767 +0100
+++ encryption_bypass.py2022-12-07 08:32:25.493558910 +0100
@@ -71,6 +71,8 @@
   '--key-file',
   '-',
   '--allow-discards',  # allow discards (aka TRIM) requests 
for device

+   '--perf-no_read_workqueue', # no read workqueue
+   '--perf-no_write_workqueue', # no write workqueue
   'open',
   device,
   mapping,
@@ -98,6 +100,8 @@
   '--key-file',
   '-',
   '--allow-discards',  # allow discards (aka TRIM) requests 
for device

+   '--perf-no_read_workqueue', # no read workqueue
+   '--perf-no_write_workqueue', # no write workqueue
   'luksOpen',
   device,
   mapping,

The performance seems to be improved for single threaded IO with 
iodepth=1. The random read performance with iodepth=32 is lower than the 
default (at the cost of extra CPU).


However, that is not all there is to it. Newish cryptsetup will auto 
determine what sector size to use for encryption.


To hard code it (for testing purposes) the following option can be added 
to def luks_format(key, device): function


'--sector-size=4096', # force 4096 sector size for know. Should be auto 
derived from physical_block_size


So, ideally this should be auto determined by ceph-volume. As a matter 
of fact, the util/disk.py script does collect this information. But it 
does not seem to be used here. Info on physical / logical block size can 
be derived from:


/sys/block/device/queue/physical_block_size and 
/sys/block/device/queue/logical_block_size


According to [1] performance is improved (on NVMe devices) by 2-3%. 
According to this thread [2] you want to use 4K sector size and only use 
"--perf-no_read_workqueue". I have no tested this combination yet.


Strange enough cryptsetup 2.4.3 choose to use 4096 sector size although 
both physical_block_size / logical_block_size where both 512 bytes for 
SAMSUNG MZQLB3T8HALS-7 disk.


I will reformat an NVMe into 4K native blocks and do a performance 
comparison, both with and without encryption to see what comes out.


The cluster I'm testing on seem to give high variability in the tests. 
So I'm going to set up a new cluster with NVMe only and repeat the 
tests. It would be great if more people could give it a try and post 
their results.


Gr. Stefan

[1]: https://fedoraproject.org/wiki/Changes/LUKSEncryptionSectorSize
[2]: 
https://www.reddit.com/r/Fedora/comments/rzvhyg/default_luks_encryption_settings_o

[ceph-users] Re: SLOW_OPS

2022-12-14 Thread Eugen Block
With 12 OSDs and a default of 4 GB RAM per OSD you would at least  
require 48 GB, usually a little more. Even if you reduced the memory  
target per OSD it doesn’t mean they can deal with the workload. There  
was a thread explaining that a couple of weeks ago.


Zitat von Murilo Morais :


Good morning everyone.

Guys, today my cluster had a "problem", it was showing SLOW_OPS, when
restarting the OSDs that were showing this problem everything was solved
(there were VMs stuck because of this), what I'm breaking my head is to
know the reason for having SLOW_OPS.

In the logs I saw that the problem started at 04:00 AM and continued until
07:50 AM (when I restarted the OSDs).

I'm suspicious of some exaggerated settings that I applied and forgot there
in the initial setup while performing a test, which may have caused a high
use of RAM leaving a maximum of 400 MB of 32 GB free memory, which in this
case was to put 512 PGs in two pools, one of which was affected.

In the logs I saw that the problem started when some VMs started to perform
backup actions, increasing the writing a little (to a maximum of 300 MBps),
after a few seconds a disk started to show this WARN and also this line:
Dec 14 04:01:01 dcs1.evocorp ceph-mon[639148]: 69 slow requests (by type [
'delayed' : 65 'waiting for sub ops' : 4 ] most affected pool [
'cephfs.ds_disk.data' : 69])

Then he presented these:
Dec 14 04:01:02 dcs1.evocorp ceph-mon[639148]: log_channel(cluster) log
[WRN] : Health check update: 0 slow ops, oldest one blocked for 36 sec,
daemons [osd.20,osd.5 ] have slow ops. (SLOW_OPS)
[...]
Dec 14 05:52:01 dcs1.evocorp ceph-mon[639148]: log_channel(cluster) log
[WRN] : Health check update: 149 slow ops, oldest one blocked for 6696 sec,
daemons [osd.20,osd.5 ,osd.50] have slow ops. (SLOW_OPS)

I've already checked the SMART, they're all OK, I've checked the graphs
generated in Grafana and none of the disks saturate, there haven't been any
incidents related to the network, that is, I haven't identified any other
problem that could cause this.

What could have caused this event? What can I do to prevent it from
happening again?

Below is some information about the cluster:
5 machines with 32GB RAM, 2 processors and 12 3TB SAS disks and connected
through 40Gb interfaces.

# ceph osd tree
ID   CLASS  WEIGHT TYPE NAME   STATUS  REWEIGHT  PRI-AFF
 -1 163.73932  root default
 -3  32.74786  host dcs1
  0hdd2.72899  osd.0   up   1.0  1.0
  1hdd2.72899  osd.1   up   1.0  1.0
  2hdd2.72899  osd.2   up   1.0  1.0
  3hdd2.72899  osd.3   up   1.0  1.0
  4hdd2.72899  osd.4   up   1.0  1.0
  5hdd2.72899  osd.5   up   1.0  1.0
  6hdd2.72899  osd.6   up   1.0  1.0
  7hdd2.72899  osd.7   up   1.0  1.0
  8hdd2.72899  osd.8   up   1.0  1.0
  9hdd2.72899  osd.9   up   1.0  1.0
 10hdd2.72899  osd.10  up   1.0  1.0
 11hdd2.72899  osd.11  up   1.0  1.0
 -5  32.74786  host dcs2
 12hdd2.72899  osd.12  up   1.0  1.0
 13hdd2.72899  osd.13  up   1.0  1.0
 14hdd2.72899  osd.14  up   1.0  1.0
 15hdd2.72899  osd.15  up   1.0  1.0
 16hdd2.72899  osd.16  up   1.0  1.0
 17hdd2.72899  osd.17  up   1.0  1.0
 18hdd2.72899  osd.18  up   1.0  1.0
 19hdd2.72899  osd.19  up   1.0  1.0
 20hdd2.72899  osd.20  up   1.0  1.0
 21hdd2.72899  osd.21  up   1.0  1.0
 22hdd2.72899  osd.22  up   1.0  1.0
 23hdd2.72899  osd.23  up   1.0  1.0
 -7  32.74786  host dcs3
 24hdd2.72899  osd.24  up   1.0  1.0
 25hdd2.72899  osd.25  up   1.0  1.0
 26hdd2.72899  osd.26  up   1.0  1.0
 27hdd2.72899  osd.27  up   1.0  1.0
 28hdd2.72899  osd.28  up   1.0  1.0
 29hdd2.72899  osd.29  up   1.0  1.0
 30hdd2.72899  osd.30  up   1.0  1.0
 31hdd2.72899  osd.31  up   1.0  1.0
 32hdd2.72899  osd.32  up   1.0  1.0
 33hdd2.72899  osd.33  up   1.0  1.0
 34hdd2.72899  osd.34  up   1.0  1.0
 35hdd2.72899  osd.35  up   1.0  1.0
 -9  32.74786  host dcs4
 36hdd2.72899  osd.36  up   1.0  1.0
 37hdd2.72899  osd.37  

[ceph-users] Re: MDS_DAMAGE dir_frag

2022-12-14 Thread Venky Shankar
Hi Sascha,

On Tue, Dec 13, 2022 at 6:43 PM Sascha Lucas  wrote:
>
> Hi,
>
> On Mon, 12 Dec 2022, Sascha Lucas wrote:
>
> > On Mon, 12 Dec 2022, Gregory Farnum wrote:
>
> >> Yes, we’d very much like to understand this. What versions of the server
> >> and kernel client are you using? What platform stack — I see it looks like
> >> you are using CephFS through the volumes interface? The simplest
> >> possibility I can think of here is that you are running with a bad kernel
> >> and it used async ops poorly, maybe? But I don’t remember other spontaneous
> >> corruptions of this type anytime recent.
> >
> > Ceph "servers" like MONs, OSDs, MDSs etc. are all 17.2.5/cephadm/podman. The
> > filesystem kernel clients are co-located on the same hosts running the
> > "servers". For some other reason OS is still RHEL 8.5 (yes with community
> > ceph). Kernel is 4.18.0-348.el8.x86_64 from release media. Just one
> > filesystem kernel client is at 4.18.0-348.23.1.el8_5.x86_64 from EOL of 8.5.
> >
> > Are there known issues with this kernel versions?
> >
> >> Have you run a normal forward scrub (which is non-disruptive) to check if
> >> there are other issues?
> >
> > So far I haven't dared, but will do so tomorrow.
>
> Just an update: "scrub / recursive,repair" does not uncover additional
> errors. But also does not fix the single dirfrag error.

File system scrub does not clear entries from the damage list.

The damage type you are running into ("dir_frag") implies that the
object for directory "V_7770505" is lost (from the metadata pool).
This results in files under that directory to be unavailable. Good
news is that you can regenerate the lost object by scanning the data
pool. This is documented here:


https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#recovery-from-missing-metadata-objects

(You'd need not run the cephfs-table-tool or cephfs-journal-tool
command though. Also, this could take time if you have lots of objects
in the data pool)

Since you mention that you do not see directory "CV_MAGNETIC" and no
other scrub errors are seen, it's possible that the application using
cephfs removed it since it was no longer needed (the data pool might
have some leftover object though).

>
> Thanks, Sascha.
>
> [2] https://www.spinics.net/lists/ceph-users/msg53202.html
> [3] 
> https://docs.ceph.com/en/quincy/cephfs/disaster-recovery/#metadata-damage-and-repair
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Cheers,
Venky

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Possible auth bug in quincy 17.2.5 on Ubuntu jammy

2022-12-14 Thread J-P Methot

Hi,

I've upgraded to the latest quincy release using cephadm on my test 
cluster (Ubuntu jammy) and I'm running in a very peculiar issue 
regarding user authentication:


-I have a pool called "cinder-replicated" for storing RBDs (application: 
RBD)


-I have a user called cinder with the following authorization caps :

client.cinder
    key: [redacted]
    caps: [mgr] profile rbd
    caps: [mon] profile rbd
    caps: [osd] profile rbd pool=cinder-replicated, profile rbd 
pool=nova-meta, profile rbd pool=glance-meta, profile rbd 
pool=cinder-erasure, profile rbd pool=cinder-meta


-If I use the command "rbd -p cinder-replicated --id cinder -k 
ceph.client.cinder.keyring ls" I get a list of RBDs in the pool, as you 
would expect


-If I use the command "rbd create --id cinder -k 
ceph.client.cinder.keyring --size 1024 cinder-replicated/test2", I get 
"rbd: create error: (22) Invalid argument"


-If I use the command "rbd create --size 1024 cinder-replicated/test2" 
which uses the admin user and keyring by default, I have no problem 
creating the RBD.


The fact that it works with the admin user and not with the cinder user 
makes me believe that it's an authentication issue. A possible cause 
could be that my client is on version 17.2.0 and my cluster is on 
17.2.5, but there doesn't seem to be official jammy packages for 17.2.5 
yet. Also, the release notes don't indicate any change to ceph auth.


--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-12-14 Thread Jakub Jaszewski
Sure, I tried in screen session before but it did not reduce the queue.

Eventually managed to zero the queue by increasing these params
radosgw-admin gc process --include-all --debug-rgw=20
--rgw-gc-max-concurrent-io=20 --rgw-gc-max-trim-chunk=64
--rgw-gc-processor-max-time=7200

I think it was the matter of the lock on GC shard being released before
RGWGC::process finished removing all objects included in that shard,
however I did notice any errors in the output with debug-rgw=20
https://github.com/ceph/ceph/blob/octopus/src/rgw/rgw_gc.cc#L514
Many thanks
Jakub



On Wed, Dec 14, 2022 at 1:24 AM Boris Behrens  wrote:

> You could try to do this in a screen session for a while.
> while true; do radosgw-admin gc process; done
>
> Maybe your normal RGW daemons are too busy for GC processing.
> We have this in our config and have started extra RGW instances for GC
> only:
> [global]
> ...
> # disable garbage collector default
> rgw_enable_gc_threads = false
> [client.gc-host1]
> rgw_frontends = "beast endpoint=[::1]:7489"
> rgw_enable_gc_threads = true
>
> Am Mi., 14. Dez. 2022 um 01:14 Uhr schrieb Jakub Jaszewski
> :
> >
> > Hi Boris, many thanks for the link!
> >
> > I see that GC list keep growing on my cluster and there are some very
> big multipart objects on the GC list, even 138660 parts that I calculate as
> >500GB in size.
> > These objects are visible on the GC list but not on rados-level when
> calling radosgw-admin --bucket=bucket_name bucket radoslist
> > Also I manually called GC process,  radosgw-admin gc process
> --bucket=bucket_name --debug-rgw=20   which according to logs did the job
> (no errors raised although objects do not exist in rados?)
> > ...
> > 2022-12-13T20:21:06.635+0100 7fe0eb771080 20 garbage collection:
> RGWGC::process iterating over entry
> tag='2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD',
> time=2022-12-13T12:35:59.727067+0100, chain.objs.size()=138660
> > 2022-12-13T20:21:06.635+0100 7fe0eb771080  5 garbage collection:
> RGWGC::process removing
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__multipart_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1
> > 2022-12-13T20:21:06.703+0100 7fe0eb771080  5 garbage collection:
> RGWGC::process removing
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__shadow_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1_1
> > 2022-12-13T20:21:06.859+0100 7fe0eb771080  5 garbage collection:
> RGWGC::process removing
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__shadow_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1_2
> > ...
> > but GC queue did not reduce, objects are still on the GC list.
> >
> > Do you happen to know how to remove non existent RADOS objects from RGW
> GC list ?
> >
> > One more thing i have to check is max_secs=3600 for GC when entering
> particular index_shard. As you can see in the logs, processing of
> multiparted objects takes more than 3600 seconds.  I will try to increase
> rgw_gc_processor_max_time
> >
> > 2022-12-13T20:20:13.168+0100 7fe0eb771080 20 garbage collection:
> RGWGC::process entered with GC index_shard=25, max_secs=3600, expired_only=1
> > 2022-12-13T20:20:13.168+0100 7fe0eb771080 20 garbage collection:
> RGWGC::process cls_rgw_gc_list returned with returned:0, entries.size=0,
> truncated=0, next_marker=''
> > 2022-12-13T20:20:13.172+0100 7fe0eb771080 20 garbage collection:
> RGWGC::process cls_rgw_gc_list returned NO non expired entries, so setting
> cache entry to TRUE
> > 2022-12-13T20:20:27.748+0100 7fe02700  2
> RGWDataChangesLog::ChangesRenewThread: start
> > 2022-12-13T20:20:49.748+0100 7fe02700  2
> RGWDataChangesLog::ChangesRenewThread: start
> > ...
> > 2022-12-13T20:21:05.339+0100 7fe0eb771080 20 garbage collection:
> RGWGC::process cls_rgw_gc_queue_list_entries returned with return value:0,
> entries.size=100, truncated=1, next_marker='4/20986990'
> > 2022-12-13T20:21:06.635+0100 7fe0eb771080 20 garbage collection:
> RGWGC::process iterating over entry
> tag='2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD',
> time=2022-12-13T12:35:59.727067+0100, chain.objs.size()=138660
> > 2022-12-13T20:21:06.635+0100 7fe0eb771080  5 garbage collection:
> RGWGC::process removing
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__multipart_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1
> > 2022-12-13T20:21:06.703+0100 7fe0eb771080  5 garbage collection:
> RGWGC::process removing
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__shadow_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1_1
> > ...
> > 2022-12-13T21:31:23.505+0100 7fe0eb771080  5 garbage collection:
> RGWGC::process removing
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__
> >
> shadow_2ib3aonh7thn59a394l06un5

[ceph-users] Re: CephFS constant high write I/O to the metadata pool

2022-12-14 Thread Olli Rajala
Hi,

One thing I now noticed in the mds logs is that there's a ton of entries
like this:
2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
[d345,d346] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
694=484+210)
2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
[d345,d346] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
695=484+211)
2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
[d343,d344] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
694=484+210)
2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
[d343,d344] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
695=484+211)
2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
[d341,d342] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
694=484+210)
2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
[d341,d342] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
695=484+211)
2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
[d33f,d340] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
694=484+210)
2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
[d33f,d340] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
695=484+211)
2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
[d33d,d33e] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
694=484+210)
2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
[d33d,d33e] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
695=484+211)

...and after dropping the caches considerably less of those - normal,
abnormal, typical, atypical? ...or is that just something that starts
happening after the cache gets filled?

Tnx,
---
Olli Rajala - Lead TD
Anima Vitae Ltd.
www.anima.fi
---


On Sun, Dec 11, 2022 at 9:07 PM Olli Rajala  wrote:

> Hi,
>
> I'm still totally lost with this issue. And now lately I've had a couple
> of incidents where the write bw has suddenly jumped to even crazier levels.
> See the graph here:
> https://gist.github.com/olliRJL/3e97e15a37e8e801a785a1bd5358120d
>
> The points where it drops to something manageable again are when I have
> dropped the mds caches. Usually after the drop there is steady rise but now
> these sudden jumps are something new and even more scary :E
>
> Here's a fresh 2sec level 20 mds log:
> https://gist.github.com/olliRJL/074bec65787085e70db8af0ec35f8148
>
> Any help and ideas greatly appreciated. Is there any tool or procedure to
> safely check or rebuild the mds data? ...if this behaviour could be caused
> by some hidden issue with the data itself.
>
> Tnx,
> ---
> Olli Rajala - Lead TD
> Anima Vitae Ltd.
> www.anima.fi
> ---
>
>
> On Fri, Nov 11, 2022 at 9:14 AM Venky Shankar  wrote:
>
>> On Fri, Nov 11, 2022 at 3:06 AM Olli Rajala  wrote:
>> >
>> > Hi Venky,
>> >
>> > I have indeed observed the output of the different sections of perf
>> dump like so:
>> > watch -n 1 ceph tell mds.`hostname` perf dump objecter
>> > watch -n 1 ceph tell mds.`hostname` perf dump mds_cache
>> > ...etc...
>> >
>> > ...but without any proper understanding of what is a normal rate for
>> some number to go up it's really difficult to make anything from that.
>> >
>> > btw - is there some convenient way to capture this kind of temporal
>> output for others to view. Sure, I could just dump once a second to a file
>> or sequential files but is there some tool or convention that is easy to
>> look at and analyze?
>>
>> Not really - you'd have to do it yourself.
>>
>> >
>> > Tnx,
>> > ---
>> > Olli Rajala - Lead TD
>> > Anima Vitae Ltd.
>> > www.anima.fi
>> > ---
>> >
>> >
>> > On Thu, Nov 10, 2022 at 8:18 AM Venky Shankar 
>> wrote:
>> >>
>> >> Hi Olli,
>> >>
>> >> On Mon, Oct 17, 2022 at 1:08 PM Olli Rajala 
>> wrote:
>> >> >
>> >> > Hi Patrick,
>> >> >
>> >> > With "objecter_ops" did you mean "ceph tell mds.pve-core-1 ops"
>> and/or
>> >> > "ceph tell mds.pve-core-1 objecter_requests"? Both these show very
>> few
>> >> > requests/ops - many times just returning empty lists. I'm pretty
>> sure that
>> >> > this I/O isn't generated by any clients - I've earlier tried to
>> isolate
>> >> > this by shutting down all cephfs clients and this didn't have any
>> >> > noticeable effect.
>> >> >
>> >> > I tried to watch what is going on with that "perf dump" but to be
>> honest
>> >> > all I can see is some numbers going up in the different sections :)
>> >> > ...don't have a clue what to focus on and how to interpret that.
>> >> >
>> >> > Here's a perf dump if you or anyone could make something out of that:
>> >> > https://gist.github.com/olliRJL/43c10173aafd82be22c080a9cd28e673
>> >>
>> >> You'd need to capture this over a period of time to see what ops might
>> >> be going through and what the mds is doing.
>> >>
>> >> >
>> >>

[ceph-users] User + Dev Monthly Meeting happening tomorrow, December 15th!

2022-12-14 Thread Laura Flores
Hi Ceph Users,

The User + Dev Monthly Meeting is coming up tomorrow, *Thursday, December
15th* *@* *3:00pm UTC* (time conversions below). See meeting details at the
bottom of this email.

Please add any topics you'd like to discuss to the agenda:
https://pad.ceph.com/p/ceph-user-dev-monthly-minutes


See you there,
Laura Flores

Meeting link: https://meet.jit.si/ceph-user-dev-monthly

Time conversions:
UTC:   Thursday, December 15, 15:00 UTC
Mountain View, CA, US: Thursday, December 15,  7:00 PST
Phoenix, AZ, US:   Thursday, December 15,  8:00 MST
Denver, CO, US:Thursday, December 15,  8:00 MST
Huntsville, AL, US:Thursday, December 15,  9:00 CST
Raleigh, NC, US:   Thursday, December 15, 10:00 EST
London, England:   Thursday, December 15, 15:00 GMT
Paris, France: Thursday, December 15, 16:00 CET
Helsinki, Finland: Thursday, December 15, 17:00 EET
Tel Aviv, Israel:  Thursday, December 15, 17:00 IST
Pune, India:   Thursday, December 15, 20:30 IST
Brisbane, Australia:   Friday, December 16,  1:00 AEST
Singapore, Asia:   Thursday, December 15, 23:00 +08
Auckland, New Zealand: Friday, December 16,  4:00 NZDT


-- 

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage

Red Hat Inc. 

Chicago, IL

lflo...@redhat.com
M: +17087388804
@RedHat    Red Hat
  Red Hat


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recent ceph.io Performance Blog Posts

2022-12-14 Thread Mark Nelson

On 12/14/22 10:09 AM, Stefan Kooman wrote:

On 11/21/22 10:07, Stefan Kooman wrote:

On 11/8/22 21:20, Mark Nelson wrote:


2.
    https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/
    


You tested network encryption impact on performance. It would be nice 
to see how OSD encryption (encryption at rest) impacts performance. 
As far as I can see there is not much public information available on 
this. However there is one thread with this exact question asked [1]. 
And it contains an interesting blog post from Cloudlare [2]. I 
repeated the tests from [2] and could draw the same conclusions: 
TL;DR: performance is increased a lot and less CPU is used. Some fio 
4k write, iodepth=1, performance numbers on a Samsung PM983 3.84 TB 
drive )Ubuntu 22.04 with HWE kernel, 5.15.0-52-generic, AMD EPYC 
7302P 16-Core Processor, C-state pinning, CPU performance mode on, 
Samsung PM 983 firmware: EDA5702Q):


Unencrypted NVMe:

write: IOPS=63.3k, BW=247MiB/s (259MB/s)(62.6GiB/259207msec); 0 zone 
resets

 clat (nsec): min=13190, max=56400, avg=15397.89, stdev=1506.45
  lat (nsec): min=13250, max=56940, avg=15462.03, stdev=1507.88


Encrypted (without no_write_workqueue / no_read_workqueue):

   write: IOPS=34.8k, BW=136MiB/s (143MB/s)(47.4GiB/357175msec); 0 
zone resets

 clat (usec): min=24, max=1221, avg=28.12, stdev= 2.98
  lat (usec): min=24, max=1221, avg=28.37, stdev= 2.99


Encrypted (with no_write_workqueue / no_read_workqueue enabled):

write: IOPS=55.7k, BW=218MiB/s (228MB/s)(57.3GiB/269574msec); 0 zone 
resets

 clat (nsec): min=15710, max=87090, avg=17550.99, stdev=875.72
  lat (nsec): min=15770, max=87150, avg=17614.82, stdev=876.85

So encryption does have a performance impact, but the added latency 
compared to the latency Ceph itself adds to (client) IO seems 
negligible. At least, when the work queues are bypassed, otherwise a 
lot of CPU seems to be involved (loads of kcryptd threads). And that 
might hurt max performance on a system that is CPU bound.


So, I have an update on this. One of our test clusters is now running 
with encrypted drives without the read/write work queues. Compared to 
the default (with work queues) it saves an enormous amount of CPU: no 
more hundreds of kcryptd threads consuming all available CPU.


The diff for ceph-volume encryption.py (pacific 16.2.10 docker image, 
sha256:2b68483bcd050472a18e73389c0e1f3f70d34bb7abf733f692e88c935ea0a6bd):


--- encryption.py    2022-12-07 08:32:50.949778767 +0100
+++ encryption_bypass.py    2022-12-07 08:32:25.493558910 +0100
@@ -71,6 +71,8 @@
   '--key-file',
   '-',
   '--allow-discards',  # allow discards (aka TRIM) requests 
for device

+    '--perf-no_read_workqueue', # no read workqueue
+    '--perf-no_write_workqueue', # no write workqueue
   'open',
   device,
   mapping,
@@ -98,6 +100,8 @@
   '--key-file',
   '-',
   '--allow-discards',  # allow discards (aka TRIM) requests 
for device

+    '--perf-no_read_workqueue', # no read workqueue
+    '--perf-no_write_workqueue', # no write workqueue
   'luksOpen',
   device,
   mapping,

The performance seems to be improved for single threaded IO with 
iodepth=1. The random read performance with iodepth=32 is lower than 
the default (at the cost of extra CPU).


However, that is not all there is to it. Newish cryptsetup will auto 
determine what sector size to use for encryption.


To hard code it (for testing purposes) the following option can be 
added to def luks_format(key, device): function


'--sector-size=4096', # force 4096 sector size for know. Should be 
auto derived from physical_block_size


So, ideally this should be auto determined by ceph-volume. As a matter 
of fact, the util/disk.py script does collect this information. But it 
does not seem to be used here. Info on physical / logical block size 
can be derived from:


/sys/block/device/queue/physical_block_size and 
/sys/block/device/queue/logical_block_size


According to [1] performance is improved (on NVMe devices) by 2-3%. 
According to this thread [2] you want to use 4K sector size and only 
use "--perf-no_read_workqueue". I have no tested this combination yet.


Strange enough cryptsetup 2.4.3 choose to use 4096 sector size 
although both physical_block_size / logical_block_size where both 512 
bytes for SAMSUNG MZQLB3T8HALS-7 disk.


I will reformat an NVMe into 4K native blocks and do a performance 
comparison, both with and without encryption to see what comes out.


The cluster I'm testing on seem to give high variability in the tests. 
So I'm going to set up a new cluster with NVMe only and repeat the 
tests. It would be great if more people could give it a try and post 
their results.


Gr. Stefan

[1]: https://fedoraproject.org/wiki/Changes/LUKSEncryptionSectorSize
[2]: 
https://www.reddit.com/r/Fedora/comments/rzvhyg/default_

[ceph-users] Re: New pool created with 2048 pg_num not executed

2022-12-14 Thread Martin Buss

will do, that will take another day or so.

Can this have to do anything with
osd_pg_bits that defaults to 6
some operators seem to be working with 8 or 11

Can you explain what this option means? I could not quite understand 
from the documentation.


Thanks!

On 14.12.22 16:11, Eugen Block wrote:
Then I'd suggest to wait until the backfilling is done and then report 
back if the PGs are still not created. I don't have information about 
the ML admin, sorry.


Zitat von Martin Buss :

that cephfs_data has been autoscaling while filling, the mismatched 
numbers are a result of that autoscaling


the cluster status is WARN as there is still some old stuff 
backfilling on cephfs_data


The issue is the newly created pool 9 cfs_data, which is stuck at 1152 
pg_num


ps: can you help me to get in touch with the list admin so I can get 
that post including private info deleted


On 14.12.22 15:41, Eugen Block wrote:
I'm wondering why the cephfs_data pool has mismatching pg_num and 
pgp_num:


pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off


Does disabling the autoscaler leave it like that when you disable it 
in the middle of scaling? What is the current 'ceph status'?



Zitat von Martin Buss :


Hi Eugen,

thanks, sure, below:

pg_num stuck at 1152 and pgp_num stuck at 1024

Regards,

Martin

ceph config set global mon_max_pg_per_osd 400

ceph osd pool create cfs_data 2048 2048 --pg_num_min 2048
pool 'cfs_data' created

pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off 
last_change 3099 lfor 0/3089/3096 flags hashpspool,bulk stripe_width 
0 target_size_ratio 1 application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode off 
last_change 2942 lfor 0/0/123 flags hashpspool stripe_width 0 
pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application 
cephfs
pool 3 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 2943 
flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 
application mgr
pool 9 'cfs_data' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 1152 pgp_num 1024 pg_num_target 2048 
pgp_num_target 2048 autoscale_mode off last_change 3198 lfor 
0/0/3198 flags hashpspool stripe_width 0 pg_num_min 2048




On 14.12.22 15:10, Eugen Block wrote:

Hi,

are there already existing pools in the cluster? Can you share your 
'ceph osd df tree' as well as 'ceph osd pool ls detail'? It sounds 
like ceph is trying to stay within the limit of mon_max_pg_per_osd 
(default 250).


Regards,
Eugen

Zitat von Martin Buss :


Hi,

on quincy, I created a new pool

ceph osd pool create cfs_data 2048 2048

6 hosts 71 osds

autoscaler is off; I find it kind of strange that the pool is 
created with pg_num 1152 and pgp_num 1024, mentioning the 2048 as 
the new target. I cannot manage to actually make this pool contain 
2048 pg_num and 2048 pgp_num.


What config option am I missing that does not allow me to grow the 
pool to 2048? Although I specified pg_num and pgp_num be the same, 
it is not.


Please some help and guidance.

Thank you,

Martin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsu

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-iscsi lock ping pong

2022-12-14 Thread Joe Comeau
That's correct - we use the kernel target not tcmu-runner


>>> Xiubo Li  12/13/2022 6:02 PM >>>

On 14/12/2022 06:54, Joe Comeau wrote:
> I am curious about what is happening with your iscsi configuration
> Is this a new iscsi config or something that has just cropped up ?
>   
> We are using/have been using vmware for 5+ years with iscsi
> We are using the kernel iscsi vs tcmu
>   

Do you mean you are using kernel target, not the ceph-iscsi/tcmu-runner 
in user space, right ?

> We are running ALUA and all datastores are setup as RR
> We routinely reboot the iscsi gateways - during patching and updates and the 
> storage migrates to and from all servers without issue
> We usually wait about 10 minutes before a gateway restart, so there is not an 
> outage
>   
> It has been extremely stable for us
>   
> Thanks Joe
>   
>
>
 Xiubo Li  12/13/2022 4:21 AM >>>
> On 13/12/2022 18:57, Stolte, Felix wrote:
>> Hi Xiubo,
>>
>> Thx for pointing me into the right direction. All involved esx host
>> seem to use the correct policy. I am going to detach the LUN on each
>> host one by one until i found the host causing the problem.
>>
>  From the logs it means the client was switching the path in turn.
>
> BTW, what's policy are you using ?
>
> Thanks
>
> - Xiubo
>
>> Regards Felix
>> -
>> -
>> Forschungszentrum Juelich GmbH
>> 52425 Juelich
>> Sitz der Gesellschaft: Juelich
>> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
>> Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
>> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
>> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
>> Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior
>> -
>> -
>>
>>> Am 12.12.2022 um 13:03 schrieb Xiubo Li :
>>>
>>> Hi Stolte,
>>>
>>> For the VMware config could you refer to :
>>> https://docs.ceph.com/en/latest/rbd/iscsi-initiator-esx/ ?
>>>
>>> What's the "Path Selection Policy with ALUA" you are using ? The
>>> ceph-iscsi couldn't implement the real AA, so if you use the RR I
>>> think it will be like this.
>>>
>>> - Xiubo
>>>
>>> On 12/12/2022 17:45, Stolte, Felix wrote:
 Hi guys,

 we are using ceph-iscsi to provide block storage for Microsoft Exchange 
 and vmware vsphere. Ceph docs state that you need to configure Windows 
 iSCSI Initatior for fail-over-only but there is no such point for vmware. 
 In my tcmu-runner logs on both ceph-iscsi gateways I see the following:

 2022-12-12 10:36:06.978 33789 [WARN] tcmu_notify_lock_lost:222 
 rbd/mailbox.vmdk_junet_sata: Async lock drop. Old state 1
 2022-12-12 10:36:06.993 33789 [INFO] alua_implicit_transition:570 
 rbd/mailbox.vmdk_junet_sata: Starting lock acquisition operation.
 2022-12-12 10:36:08.064 33789 [WARN] tcmu_rbd_lock:762 
 rbd/mailbox.vmdk_junet_sata: Acquired exclusive lock.
 2022-12-12 10:36:09.067 33789 [WARN] tcmu_notify_lock_lost:222 
 rbd/mailbox.vmdk_junet_sata: Async lock drop. Old state 1
 2022-12-12 10:36:09.071 33789 [INFO] alua_implicit_transition:570 
 rbd/mailbox.vmdk_junet_sata: Starting lock acquisition operation.
 2022-12-12 10:36:10.109 33789 [WARN] tcmu_rbd_lock:762 
 rbd/mailbox.vmdk_junet_sata: Acquired exclusive lock.
 2022-12-12 10:36:11.104 33789 [WARN] tcmu_notify_lock_lost:222 
 rbd/mailbox.vmdk_junet_sata: Async lock drop. Old state 1
 2022-12-12 10:36:11.106 33789 [INFO] alua_implicit_transition:570 
 rbd/mailbox.vmdk_junet_sata: Starting lock acquisition operation.

 At the same time there are these log entries in ceph.audit.logs:
 2022-12-12T10:36:06.731621+0100 mon.mon-k2-1 (mon.1) 3407851 : audit [INF] 
 from='client.? 10.100.8.55:0/2392201639' entity='client.admin' 
 cmd=[{"prefix": "osd blocklist", "blocklistop": "add", "addr": "10
 .100.8.56:0/1598475844"}]: dispatch
 2022-12-12T10:36:06.731913+0100 mon.mon-e2-1 (mon.0) 783726 : audit [INF] 
 from='client.? ' entity='client.admin' cmd=[{"prefix": "osd blocklist", 
 "blocklistop": "add", "addr": "10.100.8.56:0/1598475844"}]
 : dispatch
 2022-12-12T10:36:06.905082+0100 mon.mon-e2-1 (mon.0) 783727 : audit [INF] 
 from='client.? ' entity='client.admin' cmd='[{"prefix": "osd blocklist", 
 "blocklistop": "add", "addr": "10.100.8.56:0/1598475844"}
 ]': finished

 Can someone explaint to me, what is happening? Why are the gateways 
 blacklisting each other? All involved daemons are running Version 16.2.10. 
 ceph-iscsi gateways are running on Ubuntu 20.04 with ceph-isci package 
 from the Ubuntu repo (all o

[ceph-users] Re: New pool created with 2048 pg_num not executed

2022-12-14 Thread Frank Schilder
Hi Eugen: déjà vu again?

I think the way autoscaler code in the MGRs interferes with operations is 
extremely confusing.

Could this be the same issue I and somebody else had a while ago? Even though 
autoscaler is disabled, there are parts of it in the MGR still interfering. One 
of the essential config options was target_max_misplaced_ratio, which needs to 
be set to 1 if you want to have all PGs created regardless of how many objects 
are misplaced.

The thread was 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/WST6K5A4UQGGISBFGJEZS4HFL2VVWW32

In addition, the PG splitting will stop if recovery IO is going on (some 
objects are degraded).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Martin Buss 
Sent: 14 December 2022 19:32
To: ceph-users@ceph.io
Subject: [ceph-users] Re: New pool created with 2048 pg_num not executed

will do, that will take another day or so.

Can this have to do anything with
osd_pg_bits that defaults to 6
some operators seem to be working with 8 or 11

Can you explain what this option means? I could not quite understand
from the documentation.

Thanks!

On 14.12.22 16:11, Eugen Block wrote:
> Then I'd suggest to wait until the backfilling is done and then report
> back if the PGs are still not created. I don't have information about
> the ML admin, sorry.
>
> Zitat von Martin Buss :
>
>> that cephfs_data has been autoscaling while filling, the mismatched
>> numbers are a result of that autoscaling
>>
>> the cluster status is WARN as there is still some old stuff
>> backfilling on cephfs_data
>>
>> The issue is the newly created pool 9 cfs_data, which is stuck at 1152
>> pg_num
>>
>> ps: can you help me to get in touch with the list admin so I can get
>> that post including private info deleted
>>
>> On 14.12.22 15:41, Eugen Block wrote:
>>> I'm wondering why the cephfs_data pool has mismatching pg_num and
>>> pgp_num:
>>>
 pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0
 object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off
>>>
>>> Does disabling the autoscaler leave it like that when you disable it
>>> in the middle of scaling? What is the current 'ceph status'?
>>>
>>>
>>> Zitat von Martin Buss :
>>>
 Hi Eugen,

 thanks, sure, below:

 pg_num stuck at 1152 and pgp_num stuck at 1024

 Regards,

 Martin

 ceph config set global mon_max_pg_per_osd 400

 ceph osd pool create cfs_data 2048 2048 --pg_num_min 2048
 pool 'cfs_data' created

 pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0
 object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off
 last_change 3099 lfor 0/3089/3096 flags hashpspool,bulk stripe_width
 0 target_size_ratio 1 application cephfs
 pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0
 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode off
 last_change 2942 lfor 0/0/123 flags hashpspool stripe_width 0
 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application
 cephfs
 pool 3 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash
 rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 2943
 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1
 application mgr
 pool 9 'cfs_data' replicated size 3 min_size 2 crush_rule 0
 object_hash rjenkins pg_num 1152 pgp_num 1024 pg_num_target 2048
 pgp_num_target 2048 autoscale_mode off last_change 3198 lfor
 0/0/3198 flags hashpspool stripe_width 0 pg_num_min 2048



 On 14.12.22 15:10, Eugen Block wrote:
> Hi,
>
> are there already existing pools in the cluster? Can you share your
> 'ceph osd df tree' as well as 'ceph osd pool ls detail'? It sounds
> like ceph is trying to stay within the limit of mon_max_pg_per_osd
> (default 250).
>
> Regards,
> Eugen
>
> Zitat von Martin Buss :
>
>> Hi,
>>
>> on quincy, I created a new pool
>>
>> ceph osd pool create cfs_data 2048 2048
>>
>> 6 hosts 71 osds
>>
>> autoscaler is off; I find it kind of strange that the pool is
>> created with pg_num 1152 and pgp_num 1024, mentioning the 2048 as
>> the new target. I cannot manage to actually make this pool contain
>> 2048 pg_num and 2048 pgp_num.
>>
>> What config option am I missing that does not allow me to grow the
>> pool to 2048? Although I specified pg_num and pgp_num be the same,
>> it is not.
>>
>> Please some help and guidance.
>>
>> Thank you,
>>
>> Martin
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubs

[ceph-users] Re: ceph-volume inventory reports available devices as unavailable

2022-12-14 Thread Frank Schilder
Hi Eugen,

thanks for that. I guess the sane insane logic could be that if 
"rejected_reasons": ["LVM detected", "locked"], the disk has at least 1 OSD (or 
something ceph-ish) already and lvm batch would do something non-trivial 
(report-json not empty), one should consider the disk as "available".

Shame that the deployment tools are so inconsistent. It would be much easier to 
repair things if there was an easy way to query what is possible, how much 
space on a drive could be used and for what, etc.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 14 December 2022 15:18:11
To: ceph-users@ceph.io
Subject: [ceph-users] Re: ceph-volume inventory reports available devices as 
unavailable

Hi,

I haven't been dealing with ceph-volume too much lately, but I
remember seeing that when I have multiple DB devices on SSD and wanted
to replace only one failed drive. Although ceph-volume inventory
reported the disk as unavailable the actual create command was
successful. But I don't remember which versions were okay and which
weren't, there were multiple regressions in ceph-volume IIRC, it seems
to be a very complex structure. But apparently '... batch --report' is
more reliable than '... inventory'.

Regards,
Eugen

Zitat von Frank Schilder :

> Hi all,
>
> we are using "ceph-volume inventory" for checking if a disk can host
> an OSD or not prior to running "ceph-volume lvm batch".
> Unfortunately, these two tools behave inconsistently. Our use case
> are SSDs with multiple OSDs per disk and re-deploying one of the
> OSDs on disk (the OSD was purged with "ceph-volume lvm zap --osd-id
> ID" and the left-over volume removed with "lvremove OSD-VG/OSD-LV").
>
> Ceph-volume inventory reports a disk as unavailable even though it
> has space for the new OSD. On the other hand, ceph-volume lvm batch
> happily creates the OSD. Expected is that inventory says there is
> space for an OSD and reports the disk as available. Is there any way
> to get this to behave in a consistent way? I don't want to run lvm
> batch for testing and then try to figure out how to interpret the
> conflicting information.
>
> Example outputs below (for octopus and pacific), each of these disks
> has 1 OSD deployed and space for another one. Thanks for any help!
>
> [root@ceph-adm:ceph-19 ~]# ceph-volume inventory --format
> json-pretty /dev/sdt
> {
> "available": false,
> "device_id": "KINGSTON_SEDC500M3840G_50026B72825B6A67",
> "lsm_data": {},
> "lvs": [
> {
> "block_uuid": "iZGHyl-oY3R-K6va-t6Ji-VxFg-8K0V-Pl978X",
> "cluster_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
> "cluster_name": "ceph",
> "name": "osd-data-4ebd70a7-d51f-4f1c-921e-23269eb050fe",
> "osd_fsid": "a4f41f0e-0cf5-4aab-a4bb-390a64cfb01a",
> "osd_id": "571",
> "osdspec_affinity": "",
> "type": "block"
> }
> ],
> "path": "/dev/sdt",
> "rejected_reasons": [
> "LVM detected",
> "locked"
> ],
> "sys_api": {
> "human_readable_size": "3.49 TB",
> "locked": 1,
> "model": "KINGSTON SEDC500",
> "nr_requests": "256",
> "partitions": {},
> "path": "/dev/sdt",
> "removable": "0",
> "rev": "J2.8",
> "ro": "0",
> "rotational": "0",
> "sas_address": "0x500056b317b777ca",
> "sas_device_handle": "0x001e",
> "scheduler_mode": "mq-deadline",
> "sectors": 0,
> "sectorsize": "512",
> "size": 3840755982336.0,
> "support_discard": "512",
> "vendor": "ATA"
> }
> }
>
> [root@ceph-adm:ceph-19 ~]# ceph-volume lvm batch --report --prepare
> --bluestore --no-systemd --crush-device-class rbd_data
> --osds-per-device 2 -- /dev/sdt
> --> DEPRECATION NOTICE
> --> You are using the legacy automatic disk sorting behavior
> --> The Pacific release will change the default to --no-auto
> --> passed data devices: 1 physical, 0 LVM
> --> relative data size: 0.5
>
> Total OSDs: 1
>
>   TypePath
>  LV Size % of device
> 
>   data/dev/sdt
>  1.75 TB 50.00%
>
>
>
> # docker run --rm -v /dev:/dev --privileged --entrypoint
> /usr/sbin/ceph-volume "quay.io/ceph/ceph:v16.2.10" inventory
> --format json-pretty /dev/sdq
> {
> "available": false,
> "device_id": "",
> "lsm_data": {},
> "lvs": [
> {
> "block_uuid": "ZtEuec-S672-meb5-xIQP-D20n-FjsC-jN3tVN",
> "cluster_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
> "cluster_name": "ceph",
> "name": "osd-data-37e894ed-167f-4fcc-a506-dca8bfc6c83f",
> "osd_fsid": "eaf62795-7c24-48e4-9f64-c66f42df973a",
> "osd_id": "582",
> "

[ceph-users] Re: New pool created with 2048 pg_num not executed

2022-12-14 Thread Martin Buss

Hi Frank,

thanks for coming in on this, setting target_max_misplaced_ratio to 1 
does not help


Regards,
Martin

On 14.12.22 21:32, Frank Schilder wrote:

Hi Eugen: déjà vu again?

I think the way autoscaler code in the MGRs interferes with operations is 
extremely confusing.

Could this be the same issue I and somebody else had a while ago? Even though 
autoscaler is disabled, there are parts of it in the MGR still interfering. One 
of the essential config options was target_max_misplaced_ratio, which needs to 
be set to 1 if you want to have all PGs created regardless of how many objects 
are misplaced.

The thread was 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/WST6K5A4UQGGISBFGJEZS4HFL2VVWW32

In addition, the PG splitting will stop if recovery IO is going on (some 
objects are degraded).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Martin Buss 
Sent: 14 December 2022 19:32
To: ceph-users@ceph.io
Subject: [ceph-users] Re: New pool created with 2048 pg_num not executed

will do, that will take another day or so.

Can this have to do anything with
osd_pg_bits that defaults to 6
some operators seem to be working with 8 or 11

Can you explain what this option means? I could not quite understand
from the documentation.

Thanks!

On 14.12.22 16:11, Eugen Block wrote:

Then I'd suggest to wait until the backfilling is done and then report
back if the PGs are still not created. I don't have information about
the ML admin, sorry.

Zitat von Martin Buss :


that cephfs_data has been autoscaling while filling, the mismatched
numbers are a result of that autoscaling

the cluster status is WARN as there is still some old stuff
backfilling on cephfs_data

The issue is the newly created pool 9 cfs_data, which is stuck at 1152
pg_num

ps: can you help me to get in touch with the list admin so I can get
that post including private info deleted

On 14.12.22 15:41, Eugen Block wrote:

I'm wondering why the cephfs_data pool has mismatching pg_num and
pgp_num:


pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off


Does disabling the autoscaler leave it like that when you disable it
in the middle of scaling? What is the current 'ceph status'?


Zitat von Martin Buss :


Hi Eugen,

thanks, sure, below:

pg_num stuck at 1152 and pgp_num stuck at 1024

Regards,

Martin

ceph config set global mon_max_pg_per_osd 400

ceph osd pool create cfs_data 2048 2048 --pg_num_min 2048
pool 'cfs_data' created

pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off
last_change 3099 lfor 0/3089/3096 flags hashpspool,bulk stripe_width
0 target_size_ratio 1 application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode off
last_change 2942 lfor 0/0/123 flags hashpspool stripe_width 0
pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application
cephfs
pool 3 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 2943
flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1
application mgr
pool 9 'cfs_data' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 1152 pgp_num 1024 pg_num_target 2048
pgp_num_target 2048 autoscale_mode off last_change 3198 lfor
0/0/3198 flags hashpspool stripe_width 0 pg_num_min 2048



On 14.12.22 15:10, Eugen Block wrote:

Hi,

are there already existing pools in the cluster? Can you share your
'ceph osd df tree' as well as 'ceph osd pool ls detail'? It sounds
like ceph is trying to stay within the limit of mon_max_pg_per_osd
(default 250).

Regards,
Eugen

Zitat von Martin Buss :


Hi,

on quincy, I created a new pool

ceph osd pool create cfs_data 2048 2048

6 hosts 71 osds

autoscaler is off; I find it kind of strange that the pool is
created with pg_num 1152 and pgp_num 1024, mentioning the 2048 as
the new target. I cannot manage to actually make this pool contain
2048 pg_num and 2048 pgp_num.

What config option am I missing that does not allow me to grow the
pool to 2048? Although I specified pg_num and pgp_num be the same,
it is not.

Please some help and guidance.

Thank you,

Martin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



[ceph-users] Re: New pool created with 2048 pg_num not executed

2022-12-14 Thread Frank Schilder
Hi Martin,

I can't find the output of

ceph osd df tree
ceph status

anywhere. I thought you posted it, but well. Could you please post the output 
of these commands?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Martin Buss 
Sent: 14 December 2022 22:02:43
To: Frank Schilder; ceph-users@ceph.io
Cc: Eugen Block
Subject: Re: [ceph-users] Re: New pool created with 2048 pg_num not executed

Hi Frank,

thanks for coming in on this, setting target_max_misplaced_ratio to 1
does not help

Regards,
Martin

On 14.12.22 21:32, Frank Schilder wrote:
> Hi Eugen: déjà vu again?
>
> I think the way autoscaler code in the MGRs interferes with operations is 
> extremely confusing.
>
> Could this be the same issue I and somebody else had a while ago? Even though 
> autoscaler is disabled, there are parts of it in the MGR still interfering. 
> One of the essential config options was target_max_misplaced_ratio, which 
> needs to be set to 1 if you want to have all PGs created regardless of how 
> many objects are misplaced.
>
> The thread was 
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/WST6K5A4UQGGISBFGJEZS4HFL2VVWW32
>
> In addition, the PG splitting will stop if recovery IO is going on (some 
> objects are degraded).
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Martin Buss 
> Sent: 14 December 2022 19:32
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: New pool created with 2048 pg_num not executed
>
> will do, that will take another day or so.
>
> Can this have to do anything with
> osd_pg_bits that defaults to 6
> some operators seem to be working with 8 or 11
>
> Can you explain what this option means? I could not quite understand
> from the documentation.
>
> Thanks!
>
> On 14.12.22 16:11, Eugen Block wrote:
>> Then I'd suggest to wait until the backfilling is done and then report
>> back if the PGs are still not created. I don't have information about
>> the ML admin, sorry.
>>
>> Zitat von Martin Buss :
>>
>>> that cephfs_data has been autoscaling while filling, the mismatched
>>> numbers are a result of that autoscaling
>>>
>>> the cluster status is WARN as there is still some old stuff
>>> backfilling on cephfs_data
>>>
>>> The issue is the newly created pool 9 cfs_data, which is stuck at 1152
>>> pg_num
>>>
>>> ps: can you help me to get in touch with the list admin so I can get
>>> that post including private info deleted
>>>
>>> On 14.12.22 15:41, Eugen Block wrote:
 I'm wondering why the cephfs_data pool has mismatching pg_num and
 pgp_num:

> pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off

 Does disabling the autoscaler leave it like that when you disable it
 in the middle of scaling? What is the current 'ceph status'?


 Zitat von Martin Buss :

> Hi Eugen,
>
> thanks, sure, below:
>
> pg_num stuck at 1152 and pgp_num stuck at 1024
>
> Regards,
>
> Martin
>
> ceph config set global mon_max_pg_per_osd 400
>
> ceph osd pool create cfs_data 2048 2048 --pg_num_min 2048
> pool 'cfs_data' created
>
> pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off
> last_change 3099 lfor 0/3089/3096 flags hashpspool,bulk stripe_width
> 0 target_size_ratio 1 application cephfs
> pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode off
> last_change 2942 lfor 0/0/123 flags hashpspool stripe_width 0
> pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application
> cephfs
> pool 3 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash
> rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 2943
> flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1
> application mgr
> pool 9 'cfs_data' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 1152 pgp_num 1024 pg_num_target 2048
> pgp_num_target 2048 autoscale_mode off last_change 3198 lfor
> 0/0/3198 flags hashpspool stripe_width 0 pg_num_min 2048
>
>
>
> On 14.12.22 15:10, Eugen Block wrote:
>> Hi,
>>
>> are there already existing pools in the cluster? Can you share your
>> 'ceph osd df tree' as well as 'ceph osd pool ls detail'? It sounds
>> like ceph is trying to stay within the limit of mon_max_pg_per_osd
>> (default 250).
>>
>> Regards,
>> Eugen
>>
>> Zitat von Martin Buss :
>>
>>> Hi,
>>>
>>> on quincy, I created a new pool
>>>
>>> ceph osd pool create cfs_data 2048 2048
>>>
>>> 6 hosts 71 osds
>>>
>>> autoscaler is off; I find it 

[ceph-users] rgw: "failed to read header: bad method" after PutObject failed with 404 (NoSuchBucket)

2022-12-14 Thread Stefan Reuter

Hi,

When I try to upload an object to a non-existing bucket, PutObject 
returns a 404 Not Found with error code NoSuchBucket as expected.


Trying to create the bucket afterwards however results in a 400 Bad 
Request error which is not expected. The rgw logs indicate "failed to 
read header: bad method". This also happens when sending other requests 
like HeadBucket or GetObject after the failed PutObject request.


It looks like the failed PutObject request causes the HTTP parsing to 
fail afterwards, maybe due to (the body of) the PutObject request not 
being consumed completely.


The following python script reproces the issue:


#!/usr/bin/python

import boto3

s3_endpoint_url = ""
s3_access_key_id = ""
s3_secret_access_key = ""

s3 = boto3.resource('s3',
'',
use_ssl = False,
verify = False,
endpoint_url = s3_endpoint_url,
aws_access_key_id = s3_access_key_id,
aws_secret_access_key = s3_secret_access_key,
)

try:
s3.meta.client.put_object(Bucket='foo', Key='bar', Body='body')
except s3.meta.client.exceptions.NoSuchBucket:
pass

s3.meta.client.create_bucket(Bucket='foo')


Traceback (most recent call last):
  File "/tmp/badMethod.py", line 23, in 
s3.meta.client.create_bucket(Bucket='foo')
  File "/usr/lib/python3.10/site-packages/botocore/client.py", line 
514, in _api_call

return self._make_api_call(operation_name, kwargs)
  File "/usr/lib/python3.10/site-packages/botocore/client.py", line 
938, in _make_api_call

raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (400) when calling 
the CreateBucket operation: Bad Request


in the rgw logs:

1 failed to read header: bad method
1 == req done http_status=400 ==

ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy 
(stable)


=Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: New pool created with 2048 pg_num not executed

2022-12-14 Thread Martin Buss
after backfilling was complete, I was able to increase pg_num and 
pgp_num on the empty pool cfs_data in 128 increments all the way up to 
2048, that was fine.


This is not working for the filled pool.

pg_num 187 pgp_num, 59
trying to increase that in small increments

set nobackfill
set norebalance

then increase

It does not go beyond this

pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 195 pgp_num 67 pg_num_target 256 
pgp_num_target 195 autoscale_mode off last_change 3319 lfor 0/3089/3315 
flags hashpspool,bulk stripe_width 0 target_size_ratio 1 application cephfs


So does this mean I can go only in increments of 8 and then have to wait 
for rebalancing / backfill? If so, it will takes several months given 
that the pool is filled with 90 million objects already.


Fortunately, the data is cold backup only, so if I cannot find another 
way to increment in larger steps I will have to delete the pool and restart.




ceph status
  cluster:
id: c3f53dc2-6fec-11ed-8f82-8d92bac89f1e
health: HEALTH_WARN
4 clients failing to advance oldest client/flush tid
nobackfill,norebalance,noscrub,nodeep-scrub flag(s) set
161 pgs not deep-scrubbed in time
148 pgs not scrubbed in time
1 pools have pg_num > pgp_num

  services:
mon: 6 daemons, quorum i01,i02,i03,i04,i05,i06 (age 24h)
mgr: i05.cubljm(active, since 41h), standbys: i02.yshlju, 
i03.fxfpta, i04.bjgfeu, i06.blyjkk, i01.nbmavd

mds: 6/6 daemons up
osd: 71 osds: 71 up (since 24h), 71 in (since 24h); 21 remapped pgs
 flags nobackfill,norebalance,noscrub,nodeep-scrub

  data:
volumes: 1/1 healthy
pools:   3 pools, 212 pgs
objects: 90.33M objects, 41 TiB
usage:   128 TiB used, 510 TiB / 639 TiB avail
pgs: 24452271/270990393 objects misplaced (9.023%)
 191 active+clean
 21  active+remapped+backfilling

 ceph osd df tree
ID   CLASS  WEIGHT REWEIGHT  SIZE RAW USE  DATA OMAP 
META AVAIL%USE   VAR   PGS  STATUS  TYPE NAME
 -1 638.52063 -  639 TiB  128 TiB  128 TiB  137 GiB 
507 GiB  510 TiB  20.08  1.00-  root default
 -3 100.05257 -  100 TiB   22 TiB   22 TiB   28 GiB 
83 GiB   78 TiB  22.13  1.10-  host i01
  0hdd9.09569   1.0  9.1 TiB  1.3 TiB  1.3 TiB  5.8 GiB 
6.2 GiB  7.8 TiB  14.22  0.718  up  osd.0
  7hdd9.09569   1.0  9.1 TiB  656 GiB  653 GiB1 KiB 
2.7 GiB  8.5 TiB   7.04  0.353  up  osd.7
 13hdd9.09569   1.0  9.1 TiB  1.3 TiB  1.3 TiB  2.9 GiB 
4.5 GiB  7.8 TiB  14.20  0.717  up  osd.13
 19hdd9.09569   1.0  9.1 TiB  3.2 TiB  3.2 TiB1 KiB 
10 GiB  5.9 TiB  35.40  1.76   16  up  osd.19
 25hdd9.09569   1.0  9.1 TiB  1.3 TiB  1.3 TiB1 KiB 
3.9 GiB  7.8 TiB  14.19  0.716  up  osd.25
 31hdd9.09569   1.0  9.1 TiB  3.2 TiB  3.2 TiB  8.3 GiB 
14 GiB  5.9 TiB  35.50  1.77   18  up  osd.31
 38hdd9.09569   1.0  9.1 TiB  2.6 TiB  2.6 TiB  5.6 GiB 
9.0 GiB  6.5 TiB  28.35  1.41   14  up  osd.38
 44hdd9.09569   1.0  9.1 TiB  1.9 TiB  1.9 TiB  5.6 GiB 
6.9 GiB  7.2 TiB  21.28  1.06   11  up  osd.44
 50hdd9.09569   1.0  9.1 TiB  2.6 TiB  2.6 TiB1 KiB 
9.3 GiB  6.5 TiB  28.36  1.41   13  up  osd.50
 56hdd9.09569   1.0  9.1 TiB  1.5 TiB  1.5 TiB1 KiB 
5.7 GiB  7.6 TiB  16.34  0.816  up  osd.56
 62hdd9.09569   1.0  9.1 TiB  2.6 TiB  2.6 TiB1 KiB 
10 GiB  6.5 TiB  28.52  1.42   12  up  osd.62
 -5 103.69336 -  104 TiB   19 TiB   18 TiB   20 GiB 
76 GiB   85 TiB  17.85  0.89-  host i02
  5hdd9.09569   1.0  9.1 TiB  660 GiB  655 GiB  2.9 GiB 
2.8 GiB  8.5 TiB   7.09  0.354  up  osd.5
 11hdd7.27739   1.0  7.3 TiB  2.6 TiB  2.6 TiB1 KiB 
7.9 GiB  4.7 TiB  35.34  1.76   12  up  osd.11
 12hdd9.09569   1.0  9.1 TiB  662 GiB  659 GiB1 KiB 
3.6 GiB  8.4 TiB   7.11  0.353  up  osd.12
 18hdd9.09569   1.0  9.1 TiB  669 GiB  659 GiB  5.7 GiB 
3.8 GiB  8.4 TiB   7.18  0.365  up  osd.18
 24hdd7.27739   1.0  7.3 TiB  3.2 TiB  3.2 TiB1 KiB 
9.7 GiB  4.1 TiB  44.24  2.20   16  up  osd.24
 30hdd7.27739   1.0  7.3 TiB  2.6 TiB  2.6 TiB  3.0 GiB 
8.2 GiB  4.7 TiB  35.42  1.76   13  up  osd.30
 36hdd9.09569   1.0  9.1 TiB  1.9 TiB  1.9 TiB  2.8 GiB 
12 GiB  7.2 TiB  21.33  1.06   10  up  osd.36
 42hdd9.09569   1.0  9.1 TiB  1.3 TiB  1.3 TiB1 KiB 
4.1 GiB  7.8 TiB  14.14  0.706  up  osd.42
 48hdd9.09569   1.0  9.1 TiB   82 MiB   28 MiB1 KiB 
54 MiB  9.1 TiB  0

[ceph-users] Re: New pool created with 2048 pg_num not executed

2022-12-14 Thread Martin Buss

Hi Frank and Eugen,

target_max_misplaced_ratio 1

did the trick. Now I can increment pg_num and pgp_num in steps of 128 
increments.


Thanks!


On 14.12.22 21:32, Frank Schilder wrote:

Hi Eugen: déjà vu again?

I think the way autoscaler code in the MGRs interferes with operations is 
extremely confusing.

Could this be the same issue I and somebody else had a while ago? Even though 
autoscaler is disabled, there are parts of it in the MGR still interfering. One 
of the essential config options was target_max_misplaced_ratio, which needs to 
be set to 1 if you want to have all PGs created regardless of how many objects 
are misplaced.

The thread was 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/WST6K5A4UQGGISBFGJEZS4HFL2VVWW32

In addition, the PG splitting will stop if recovery IO is going on (some 
objects are degraded).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Martin Buss 
Sent: 14 December 2022 19:32
To: ceph-users@ceph.io
Subject: [ceph-users] Re: New pool created with 2048 pg_num not executed

will do, that will take another day or so.

Can this have to do anything with
osd_pg_bits that defaults to 6
some operators seem to be working with 8 or 11

Can you explain what this option means? I could not quite understand
from the documentation.

Thanks!

On 14.12.22 16:11, Eugen Block wrote:

Then I'd suggest to wait until the backfilling is done and then report
back if the PGs are still not created. I don't have information about
the ML admin, sorry.

Zitat von Martin Buss :


that cephfs_data has been autoscaling while filling, the mismatched
numbers are a result of that autoscaling

the cluster status is WARN as there is still some old stuff
backfilling on cephfs_data

The issue is the newly created pool 9 cfs_data, which is stuck at 1152
pg_num

ps: can you help me to get in touch with the list admin so I can get
that post including private info deleted

On 14.12.22 15:41, Eugen Block wrote:

I'm wondering why the cephfs_data pool has mismatching pg_num and
pgp_num:


pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off


Does disabling the autoscaler leave it like that when you disable it
in the middle of scaling? What is the current 'ceph status'?


Zitat von Martin Buss :


Hi Eugen,

thanks, sure, below:

pg_num stuck at 1152 and pgp_num stuck at 1024

Regards,

Martin

ceph config set global mon_max_pg_per_osd 400

ceph osd pool create cfs_data 2048 2048 --pg_num_min 2048
pool 'cfs_data' created

pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 187 pgp_num 59 autoscale_mode off
last_change 3099 lfor 0/3089/3096 flags hashpspool,bulk stripe_width
0 target_size_ratio 1 application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode off
last_change 2942 lfor 0/0/123 flags hashpspool stripe_width 0
pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application
cephfs
pool 3 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 2943
flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1
application mgr
pool 9 'cfs_data' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 1152 pgp_num 1024 pg_num_target 2048
pgp_num_target 2048 autoscale_mode off last_change 3198 lfor
0/0/3198 flags hashpspool stripe_width 0 pg_num_min 2048



On 14.12.22 15:10, Eugen Block wrote:

Hi,

are there already existing pools in the cluster? Can you share your
'ceph osd df tree' as well as 'ceph osd pool ls detail'? It sounds
like ceph is trying to stay within the limit of mon_max_pg_per_osd
(default 250).

Regards,
Eugen

Zitat von Martin Buss :


Hi,

on quincy, I created a new pool

ceph osd pool create cfs_data 2048 2048

6 hosts 71 osds

autoscaler is off; I find it kind of strange that the pool is
created with pg_num 1152 and pgp_num 1024, mentioning the 2048 as
the new target. I cannot manage to actually make this pool contain
2048 pg_num and 2048 pgp_num.

What config option am I missing that does not allow me to grow the
pool to 2048? Although I specified pg_num and pgp_num be the same,
it is not.

Please some help and guidance.

Thank you,

Martin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 

[ceph-users] Re: MDS_DAMAGE dir_frag

2022-12-14 Thread Sascha Lucas

Hi Venky,

On Wed, 14 Dec 2022, Venky Shankar wrote:


On Tue, Dec 13, 2022 at 6:43 PM Sascha Lucas  wrote:



Just an update: "scrub / recursive,repair" does not uncover additional
errors. But also does not fix the single dirfrag error.


File system scrub does not clear entries from the damage list.

The damage type you are running into ("dir_frag") implies that the
object for directory "V_7770505" is lost (from the metadata pool).
This results in files under that directory to be unavailable. Good
news is that you can regenerate the lost object by scanning the data
pool. This is documented here:

   
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#recovery-from-missing-metadata-objects

(You'd need not run the cephfs-table-tool or cephfs-journal-tool
command though. Also, this could take time if you have lots of objects
in the data pool)

Since you mention that you do not see directory "CV_MAGNETIC" and no
other scrub errors are seen, it's possible that the application using
cephfs removed it since it was no longer needed (the data pool might
have some leftover object though).


Thanks a lot for your help. Just to be clear: it's the directory structure 
CV_MAGNETIC/V_7770505, where V_7770505 can not be seen/found. But the 
parent dir CV_MAGNETIC still exists.


However it strengthens the idea that the application has removed the 
V_7770505 directory itself. Otherwise it is expected to still find/see 
this directory, but empty. Right?


If that is the case, there is no data needing recovery, just a cleanup of 
orphan objects.


Also very helpful: what part of the disaster-recovery-experts docs to run 
and what commands to skip. This seems to boil down to:


cephfs-data-scan init|scan_extents|scan_inodes|scan_links|cleanup

The data pool has ~100M objects. I doubt data scanning can not be 
done while the filesystem is online/in use?


Just a mystery remains how this damage could happen...

Thanks, Sascha.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io