[ceph-users] Re: OSD not created after replacing failed disk

2022-07-07 Thread Malte Stroem

Hello Vlad,

- add the following to your yaml and apply it:
unmanaged: true
- go to the server that hosts the failed OSD
- fire up cephadm shell, if it's not there install it and give the 
server the _admin label:

ceph orch host label add servername _admin
- ceph orch osd rm 494
- ceph-volume lvm deactivate 494
- ceph-volume lvm zap --destroy --osd-id 494
- leave cephadm shell
- check if db, wal and osd were removed on the server (lsblk, vgs, lvs)
- if not remove the volumes by hand with lvremove
- set unmanaged: false and apply the yaml

Best,
Malte

Am 07.07.22 um 20:55 schrieb Vladimir Brik:

Hello

I am running 17.2.1. We had a disk failure and I followed 
https://docs.ceph.com/en/quincy/cephadm/services/osd/ to replace the OSD 
but it didn't work.


I replaced the failed disk, ran "ceph orch osd rm 494 --replace --zap", 
which stopped and removed the daemon from "ceph orch ps", and deleted 
the WAL/DB LVM volume of the OSD from the NVMe device shared with other 
OSDs. "ceph status" says 710 OSDs total, 709 up. So far so good.


BUT

"ceph status" shows osd.494 as stray, even though it is not running on 
the host, its systemd files have been cleaned up, and "cephadm ls" 
doesn't show it.


A new OSD is not being created. The logs have entries about osd claims 
for ID 494 but nothing is happening.


Re-applying the drive group spec below didn't result in anything:
service_type: osd
service_id: r740xd2-mk2-hdd
service_name: osd.r740xd2-mk2-hdd
placement:
   label: r740xd2-mk2
spec:
   data_devices:
     rotational: 1
   db_devices:
     paths:
     - /dev/nvme0n1

Did I do something incorrectly? What do I need to do to re-create the 
failed OSD?



Vlad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [ext] Re: snap_schedule MGR module not available after upgrade to Quincy

2022-07-07 Thread Kuhring, Mathias
Hey Andreas,

Indeed, we were also possible to remove the legacy schedule DB
and the scheduler is now picking up the work again.
Wouldn't have known where to look for it.
Thanks for your help and all the details. I really appreciate it.

Best, Mathias

On 7/7/2022 11:46 AM, Andreas Teuchert wrote:
> Hello Mathias,
>
> On 06.07.22 18:27, Kuhring, Mathias wrote:
>> Hey Andreas,
>>
>> thanks for the info.
>>
>> We also had our MGR reporting crashes related to the module.
>>
>> We have a second cluster as mirror which we also updated to Quincy.
>> But there the MGR is able to use the snap_module (so "ceph fs
>> snap-schedule status" etc are not complaining).
>> And I'm able to schedule snapshots. But we didn't had any schedules
>> there before the upgrade (due to being the mirror).
>
> I think in that case there is no RADOS object for the legacy schedule 
> DB, which is handled gracefully by the code.
>
>>
>> I also noticed that this particular part of the code you mentioned
>> hasn't been touched in  a year and half:
>> https://github.com/ceph/ceph/blame/ec95624474b1871a821a912b8c3af68f8f8e7aa1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193
>>  
>>
>>
>
> The relevant change was made 17 months ago but it was not backported 
> to Pacific and is only included in Quincy.
>
>> So I'm wondering if my previous schedule entries got somehow
>> incompatible with the new version.
>
> The schedule entries are still the same. What changed is that the 
> sqlite DB that they are stored in, is no longer stored as a DB dump in 
> a RADOS object in the FS's metadata pool. Instead now the sqlite Ceph 
> VFS driver is used to store the DB in the metadata pool.
>
>>
>> Do you know if there is any way to reset/cleanup the modules config /
>> database?
>> So remove all the previously scheduled snapshots but without using "fs
>> snap-schedule remove"?
>> We only have a handful of schedules which can easily be recreated.
>> So maybe a clean start would be at least workaround.
>
> We could just solve the problem by deleting the legacy schedule DB 
> after the upgrade:
>
> rados -p  -N cephfs-snap-schedule rm snap_db_v0
>
> Afterwards the mgr has to be restarted/failovered.
>
> The schedules are still there afterwards because they have already 
> been migrated to the new DB.
>
> Thanks to my colleague Chris Glaubitz for figuring out that the object 
> is in a separate namespace. :-)
>
>>
>> Otherwise we will keep simple cron jobs until these issues are fixed.
>> After all, you just need regularly executed mkdir and rmdir to get you
>> started.
>>
>> Best Wishes,
>> Mathias
>>
>>
>
>
> Best regards,
>
> Andreas
>
>>
>>
>>
>> On 7/6/2022 5:05 PM, Andreas Teuchert wrote:
>>> Hello Mathias and others,
>>>
>>> I also ran into this problem after upgrading from 16.2.9 to 17.2.1.
>>>
>>> Additionally I observed a health warning: "3 mgr modules have recently
>>> crashed".
>>>
>>> Those are actually two distinct crashes that are already in the 
>>> tracker:
>>>
>>> https://tracker.ceph.com/issues/56269 and
>>> https://tracker.ceph.com/issues/56270
>>>
>>> Considering that the crashes are in the snap_schedule module I assume
>>> that they are the reason why the module is not available.
>>>
>>> I can reproduce the crash in 56270 by failing over the mgr.
>>>
>>> I believe that the faulty code causing the error is this line:
>>> https://github.com/ceph/ceph/blob/v17.2.1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193
>>>  
>>>
>>>
>>> Instead of ioctx.remove(SNAP_DB_OBJECT_NAME) it should be
>>> ioctx.remove_object(SNAP_DB_OBJECT_NAME).
>>>
>>> (According to my understanding of
>>> https://docs.ceph.com/en/latest/rados/api/python/.)
>>>
>>> Best regards,
>>>
>>> Andreas
>>>
>>>
>>> On 01.07.22 18:05, Kuhring, Mathias wrote:
 Dear Ceph community,

 After upgrading our cluster to Quincy with cephadm (ceph orch upgrade
 start --image quay.io/ceph/ceph:v17.2.1), I struggle to re-activate
 the snapshot schedule module:

 0|0[root@osd-1 ~]# ceph mgr module enable snap_schedule
 0|1[root@osd-1 ~]# ceph mgr module ls | grep snap
 snap_schedule on

 0|0[root@osd-1 ~]# ceph fs snap-schedule list / --recursive
 Error ENOENT: Module 'snap_schedule' is not available

 I tried restarting the MGR daemons and failed over a restarted one.
 But with no change.

 0|0[root@osd-1 ~]# ceph orch restart mgr
 Scheduled to restart mgr.osd-1 on host 'osd-1'
 Scheduled to restart mgr.osd-2 on host 'osd-2'
 Scheduled to restart mgr.osd-3 on host 'osd-3'
 Scheduled to restart mgr.osd-4.oylrhe on host 'osd-4'
 Scheduled to restart mgr.osd-5.jcfyqe on host 'osd-5'

 0|0[root@osd-1 ~]# ceph orch ps --daemon_type mgr
 NAME  HOST   PORTS    STATUS REFRESHED AGE
 MEM USE  MEM LIM  VERSION  IMAGE ID  CONTAINER ID
 mgr.osd-1 osd-1  *:8443,9283  running (61s)    35s ago 9M
 402M    -  17.2.1   e5af760fa1c1  

[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-07 Thread Konstantin Shalygin
On 7 Jul 2022, at 15:41, Dan van der Ster  wrote:
> 
> How is one supposed to redeploy OSDs on a multi-PB cluster while the
> performance is degraded?

This is very strong point of view!

Good that this case can be fixed with set bluestore_prefer_deferred_size_hdd to 
128k, and I think we need analyze answer from Igor

* this is bug
* bluestore_prefer_deferred_size_hdd should be increased by operator, until 
migration to 4k will finished


k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-07 Thread Dan van der Ster
Hi,

On Thu, Jul 7, 2022 at 2:37 PM Konstantin Shalygin  wrote:
>
> Hi,
>
> On 7 Jul 2022, at 13:04, Dan van der Ster  wrote:
>
> I'm not sure the html mail made it to the lists -- resending in plain text.
> I've also opened https://tracker.ceph.com/issues/56488
>
>
> I think with pacific you need to redeploy all OSD's to respect the new 
> default bluestore_min_alloc_size_hdd = 4096 [1]
> Or not? 
>

Understood, yes, that is another "solution". But it is incredibly
impractical, I would say impossible, for loaded production
installations.
(How is one supposed to redeploy OSDs on a multi-PB cluster while the
performance is degraded?)

-- Dan

>
> [1] https://github.com/ceph/ceph/pull/34588
>
> k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance in Proof-of-Concept cluster

2022-07-07 Thread Marc
> 
> Thanks for sharing. How many nodes/OSDs?, I get the following tail for
> the same command and size=3 (3 nodes, 4 OSD each):

4 nodes, 2x ssd per node.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-07 Thread Dan van der Ster
Hi again,

I'm not sure the html mail made it to the lists -- resending in plain text.
I've also opened https://tracker.ceph.com/issues/56488

Cheers, Dan


On Wed, Jul 6, 2022 at 11:43 PM Dan van der Ster  wrote:
>
> Hi Igor and others,
>
> (apologies for html, but i want to share a plot ;) )
>
> We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados 
> bench -p test 10 write -b 4096 -t 1" latency probe showed something is very 
> wrong with deferred writes in pacific.
> Here is an example cluster, upgraded today:
>
>
>
> The OSDs are 12TB HDDs, formatted in nautilus with the default 
> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.
>
> I found that the performance issue is because 4kB writes are no longer 
> deferred from those pre-pacific hdds to flash in pacific with the default 
> config !!!
> Here are example bench writes from both releases: 
> https://pastebin.com/raw/m0yL1H9Z
>
> I worked out that the issue is fixed if I set 
> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. 
> Note the default was 32k in octopus).
>
> I think this is related to the fixes in https://tracker.ceph.com/issues/52089 
> which landed in 16.2.6 -- _do_alloc_write is comparing the prealloc size 
> 0x1 with bluestore_prefer_deferred_size_hdd (0x1) and the "strictly 
> less than" condition prevents deferred writes from ever happening.
>
> So I think this would impact anyone upgrading clusters with hdd/ssd mixed 
> osds ... surely we must not be the only clusters impacted by this?!
>
> Should we increase the default bluestore_prefer_deferred_size_hdd up to 128kB 
> or is there in fact a bug here?
>
> Best Regards,
>
> Dan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance in Proof-of-Concept cluster

2022-07-07 Thread Hans van den Bogert

Hi,

Run a close to the metal benchmark on the disks first, just to see the 
theoretical ceiling.


Also, rerun your benchmarks with random write, just to get more honest 
numbers as well.


Based on the numbers so far, you seem to be getting 40k client iops @512 
threads, due to 3x replication and 3 nodes, this translates 1:1 to 40k 
per node. So ~10k per SSD. Depending on the benchmark directly on a disk 
(requested above) this can be either good or bad.


You might want to try 2 ceph-osd processes per SSD, just to see if the 
Ceph process is the bottleneck.


Hope this gives you food for thought.

On 7/6/22 13:13, Eneko Lacunza wrote:

Hi all,

We have a proof of concept HCI cluster with Proxmox v7 and Ceph v15.

We have 3 nodes:

2x Intel Xeon 5218 Gold (16 core/32 threads per socket)
Dell PERC H330 Controller (SAS3)
4xSamsung PM1634 3.84TB SAS 12Gb SSD
Network is LACP 2x10Gbps

This cluster is used for some VDI tests, with Windows 10 VMs.

Pool has size=3/min=2 and is used for RBD (KVM/QEMU VMs)

We are seeing Ceph performance reaching about 600MiB/s read and 500MiB/s 
write, and IOPS read about 6.000 and writes about 2.000 . Read/writes 
are simultaneous (mixed IO), as reported by Ceph.


Is this a reasonable performance for the hardware we have? We see about 
25-30% CPU used in the nodes, and ceph-osd processes spiking between 
600% and 1000% (I guess it's full 6-10 threads use).


I have checked cache for the disks, but they report cache as "Not 
applicable".

BIOS power profile is performance and C states are disabled.

Thanks

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io