[ceph-users] Re: snaptrim blocks io on ceph pacific even on fast NVMEs

2021-11-10 Thread Anthony D'Atri


> How many osd you have on 1 nvme drives?
> We increased 2/nvme to 4/nvme and it improved the snap-trimming quite a lot.

Interesting.  Most analyses I’ve seen report diminishing returns with more than 
two OSDs per.

There are definitely serialization bottlenecks in the PG and OSD code, so I’m 
curious re the number and size of the NVMe devices you’re using, and especially 
their PG ratio.  Not lowballing the PGs per OSD can have a similar effect with 
less impact to CPU and RAM.  ymmv.

> I guess the utilisation of the nvmes when you snaptrim is not 100%.

Take the iostat %util field with a grain of salt, like the load average.  Both 
are traditional metrics whose meanings have diffused as systems have evolved 
over the years.

— aad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: slow operation observed for _collection_list

2021-11-10 Thread Сергей Процун
No, you can not do online compaction.

пт, 5 лист. 2021, 17:22 користувач Szabo, Istvan (Agoda) <
istvan.sz...@agoda.com> пише:

> Seems like it can help, but after 1-2 days it comes back on different and
> in some cases on the same osd as well.
> Is there any other way to compact online as it compacts offline?
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> From: Szabo, Istvan (Agoda)
> Sent: Friday, October 29, 2021 8:43 PM
> To: Igor Fedotov 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] slow operation observed for _collection_list
>
> I can give a try again, but before migrated all db back to data I did
> compaction on all osd.
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
>
> On 2021. Oct 29., at 15:02, Igor Fedotov  igor.fedo...@croit.io>> wrote:
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> 
>
> Please manually compact the DB using ceph-kvstore-tool for all the
> affected OSDs (or preferable every OSD in the cluster). Highly likely
> you're facing RocksDB performance degradation caused by prior bulk data
> removal. Setting bluefs_buffered_io to true (if not yet set) might be
> helpful as well.
>
>
> On 10/29/2021 3:22 PM, Szabo, Istvan (Agoda) wrote:
>
> Hi,
>
> Having slow ops and laggy pgs due to osd is not accessible (octopus
> 15.2.14 version and 15.2.10 also).
> At the time when slow ops started, in the osd log I can see:
>
> "7f2a8d68f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread
> 0x7f2a70de5700' had timed out after 15"
>
> And this blocks the io until the radosgateway didn't restart itself.
> Is this a bug or something else?
>
> In the ceph.log I can see also that specific osd is reported failed from
> another osds:
>
> 2021-10-29T05:49:34.386857+0700 mon.server-3s01 (mon.0) 3576376 : cluster
> [DBG] osd.7 reported failed by osd.31
> 2021-10-29T05:49:34.454037+0700 mon.server-3s01 (mon.0) 3576377 : cluster
> [DBG] osd.7 reported failed by osd.22
> 2021-10-29T05:49:34.666758+0700 mon.server-3s01 (mon.0) 3576379 : cluster
> [DBG] osd.7 reported failed by osd.6
> 2021-10-29T05:49:34.807714+0700 mon.server-3s01 (mon.0) 3576382 : cluster
> [DBG] osd.7 reported failed by osd.11
>
> Here is the osd log: https://justpaste.it/4x4h2
> Here is the ceph.log itself: https://justpaste.it/5bk8k
> Here is some additional information regarding memory usage and
> backtrace...: https://justpaste.it/1tmjg
>
> Thank you
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io ceph-users-le...@ceph.io>
>
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 2 zones for a single RGW cluster

2021-11-10 Thread prosergey07
Yes. You just need to create a separate zone with radosgw-admin and the 
corresponding pool names for that rgw zone. Then on the radosgw host you need 
to put rgw zone for which it would operate in ceph.confНадіслано з пристрою 
Galaxy
 Оригінальне повідомлення Від: J-P Methot 
 Дата: 11.11.21  00:18  (GMT+02:00) Кому: 
ceph-users@ceph.io Тема: [ceph-users] 2 zones for a single RGW cluster Hi,Is it 
possible to have 2 rgw zones on a single Ceph cluster? Of course, each zone 
would have a different pool on the cluster.-- Jean-Philippe MéthotSenior 
Openstack system administratorAdministrateur système Openstack 
séniorPlanetHoster 
inc.___ceph-users mailing list -- 
ceph-us...@ceph.ioto unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-10 Thread Сергей Процун
In rgw.meta contains user, bucket, bucket instance metadata.

rgw.bucket.index contains bucket indexes aka shards. Like if you have 32
shards you will have 32 objects in that pool: .dir.BUCKET_ID.0-31. Each
would have part of your objects listed.They should be using some sort of
hash table algorithm and put into corresponding shard. Also objects in
bucket.index are of zero size. Their all data is OMAP stored in RocksDB.

 Then when we get the object name, we can check it in bucket.data pool. The
name of the object has prefix which is marker id of the bucket. So each rgw
object inside bucket.data pool also has OMAP  and xattr. And that data is
also in RocksDB. Like rgw.manifest xattr which contains manifest data. For
example if object is huge (takes more than 4MB) its stored as multiple
rados objects. Thats where from shadow files come from (pieces of one
bigger object). So losing DB device will make OSD non operational as OSD
bluestore uses DB device for storing omap and xattr.


ср, 10 лист. 2021, 23:51 користувач Boris  пише:

> Oh.
> How would one recover from that? Sounds like it basically makes no
> difference if 2, 5 oder 10 OSD are in the blast radius.
>
> Can the omap key/values be regenerated?
> I always thought these data would be stored in the rgw pools. Or am I
> mixing things up and the bluestore metadata got omap k/v? And then there is
> the omap k/v from rgw objects?
>
>
> Am 10.11.2021 um 22:37 schrieb Сергей Процун :
>
> 
> No, you can not do that. Because RocksDB for omap key/values and WAL would
> be gone meaning all xattr and omap will be gone too. Hence osd will become
> non operational. But if you notice that ssd starts throwing errors, you can
> start migrating bluefs device to a new partition:
>
> ceph-bluestore-tool bluefs-bdev-migrate --devs-source
> /var/lib/ceph/osd/ceph-OSD_ID/block --path /var/lib/ceph/osd/ceph-OSD_ID/
> --dev-target /path/to/new/db_device
>
> ср, 10 лист. 2021, 11:51 користувач Boris Behrens  пише:
>
>> Hi,
>> we use enterprise SSDs like SAMSUNG MZ7KM1T9.
>> The work very well for our block storage. Some NVMe would be a lot nicer
>> but we have some good experience with them.
>>
>> One SSD fail takes down 10 OSDs might sound hard, but this would be an
>> okayish risk. Most of the tunables are defaul in our setup and this looks
>> like PGs have a failure domain of a host. I restart the systems on a
>> regular basis for kernel updates.
>> Also checking disk io with dstat seems to be rather low on the SSDs
>> (below 1k IOPs)
>> root@s3db18:~# dstat --disk --io  -T  -D sdd
>> --dsk/sdd-- ---io/sdd-- --epoch---
>>  read  writ| read  writ|  epoch
>>  214k 1656k|7.21   126 |1636536603
>>  144k 1176k|2.00   200 |1636536604
>>  128k 1400k|2.00   230 |1636536605
>>
>> Normaly I would now try this configuration:
>> 1 SSD / 10 OSDs - having 150GB of block.db and block.wal, both on the
>> same partition as someone stated before, and 200GB extra to move all pools
>> except the .data pool to SSDs.
>>
>> But thinking about 10 downed OSDs if one SSD fails let's me wonder how to
>> recover from that.
>> IIRC the configuration per OSDs is in the LVM tags:
>> root@s3db18:~# lvs -o lv_tags
>>   LV Tags
>>
>> ceph.block_device=...,ceph.db_device=/dev/sdd8,ceph.db_uuid=011275a3-4201-8840-a678-c2e23d38bfd6,...
>>
>> When the SSD fails, can I just remove the tags and restart the OSD with 
>> ceph-volume
>> lvm activate --all? And after replacing the failed SSD readd the tags
>> with the correct IDs? Do I need to do anything else to prepare a block.db
>> partition?
>>
>> Cheers
>>  Boris
>>
>>
>> Am Di., 9. Nov. 2021 um 22:15 Uhr schrieb prosergey07 <
>> proserge...@gmail.com>:
>>
>>> Not sure how much it would help the performance with osd's backed with
>>> ssd db and wal devices. Even if you go this route with one ssd per 10 hdd,
>>> you might want to set the failure domain per host in crush rules in case
>>> ssd is out of service.
>>>
>>>  But from the practice ssd will not help too much to boost the
>>> performance especially for sharing it between 10 hdds.
>>>
>>>  We use nvme db+wal per osd and separate nvme specifically for metadata
>>> pools. There will be a lot of I/O on bucket.index pool and rgw pool which
>>> stores user, bucket metadata. So you might want to put them into separate
>>> fast storage.
>>>
>>>  Also if there will not be too much objects, like huge objects but not
>>> tens-hundreds million of them then bucket index will have less presure and
>>> ssd might be okay for metadata pools in that case.
>>>
>>>
>>>
>>> Надіслано з пристрою Galaxy
>>>
>>>
>>>  Оригінальне повідомлення 
>>> Від: Boris Behrens 
>>> Дата: 08.11.21 13:08 (GMT+02:00)
>>> Кому: ceph-users@ceph.io
>>> Тема: [ceph-users] Question if WAL/block.db partition will benefit us
>>>
>>> Hi,
>>> we run a larger octopus s3 cluster with only rotating disks.
>>> 1.3 PiB with 177 OSDs, some with a SSD block.db and some without.
>>>
>>> We have a ton of spare 2TB disks and we just wondered 

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-10 Thread Anthony D'Atri


> 
> Oh. 
> How would one recover from that? Sounds like it basically makes no difference 
> if 2, 5 oder 10 OSD are in the blast radius. 

Veilicht.  Aber a larger blast radius means that you lose a larger percentage 
of your cluster, assuming that you have a CRUSH failure domain of no smaller 
than `host`:
- Reduced performance until repair
- Longer repair process
- If you’re using SATA drives, repair may saturate your HBA, which slows down 
both recovery and clients.
- Depending on your Ceph release and configuration, your cluster may try to 
restore redundancy by making copies of surviving data on surviving nodes.  
Which may not have enough spare capacity, so their OSDs may enter nearfull, 
backfillfull, or even full states.  Careful selection of the 
mon_osd_down_out_subtree_limit can forestall this when an entire host is down, 
with the tradeoff of reduced redundancy until the host is restored.
- If your whole cluster is 30 OSDs, 10 being down is a whopping 1/3 of the 
whole.  If it’s 1000 OSDs, that’s less of a concern.


> Can the omap key/values be regenerated?
> I always thought these data would be stored in the rgw pools. Or am I mixing 
> things up and the bluestore metadata got omap k/v? And then there is the omap 
> k/v from rgw objects?

Doing so might take at least as much time, effort, and hassle as just repairing 
and backfilling the OSDs in toto, though there are multiple factors.

This is one reason why I’ve long recommended all-flash clusters.  Fewer 
interdependencies, less complexity, favorable blast radius, shorter MTTR.  
These contribute to TCO in very real ways.

> 
> 
>> Am 10.11.2021 um 22:37 schrieb Сергей Процун :
>> 
>> 
>> No, you can not do that. Because RocksDB for omap key/values and WAL would 
>> be gone meaning all xattr and omap will be gone too. Hence osd will becom
>> 
>> ceph-bluestore-tool bluefs-bdev-migrate --devs-source 
>> /var/lib/ceph/osd/ceph-OSD_ID/block --path /var/lib/ceph/osd/ceph-OSD_ID/ 
>> --dev-target /path/to/new/db_device
>> 
>> ср, 10 лист. 2021, 11:51 користувач Boris Behrens  пише:
>>> Hi,
>>> we use enterprise SSDs like SAMSUNG MZ7KM1T9.
>>> The work very well for our block storage. Some NVMe would be a lot nicer 
>>> but we have some good experience with them. 
>>> 
>>> One SSD fail takes down 10 OSDs might sound hard, but this would be an 
>>> okayish risk. Most of the tunables are defaul in our setup and this looks 
>>> like PGs have a failure domain of a host. I restart the systems on a 
>>> regular basis for kernel updates.
>>> Also checking disk io with dstat seems to be rather low on the SSDs (below 
>>> 1k IOPs)
>>> root@s3db18:~# dstat --disk --io  -T  -D sdd
>>> --dsk/sdd-- ---io/sdd-- --epoch---
>>> read  writ| read  writ|  epoch
>>> 214k 1656k|7.21   126 |1636536603
>>> 144k 1176k|2.00   200 |1636536604
>>> 128k 1400k|2.00   230 |1636536605
>>> 
>>> Normaly I would now try this configuration:
>>> 1 SSD / 10 OSDs - having 150GB of block.db and block.wal, both on the same 
>>> partition as someone stated before, and 200GB extra to move all pools 
>>> except the .data pool to SSDs.
>>> 
>>> But thinking about 10 downed OSDs if one SSD fails let's me wonder how to 
>>> recover from that.
>>> IIRC the configuration per OSDs is in the LVM tags:
>>> root@s3db18:~# lvs -o lv_tags
>>>  LV Tags
>>>  
>>> ceph.block_device=...,ceph.db_device=/dev/sdd8,ceph.db_uuid=011275a3-4201-8840-a678-c2e23d38bfd6,...
>>> 
>>> When the SSD fails, can I just remove the tags and restart the OSD with 
>>> ceph-volume lvm activate --all? And after replacing the failed SSD readd 
>>> the tags with the correct IDs? Do I need to do anything else to prepare a 
>>> block.db partition?
>>> 
>>> Cheers
>>> Boris
>>> 
>>> 
 Am Di., 9. Nov. 2021 um 22:15 Uhr schrieb prosergey07 
 :
 Not sure how much it would help the performance with osd's backed with ssd 
 db and wal devices. Even if you go this route with one ssd per 10 hdd, you 
 might want to set the failure domain per host in crush rules in case ssd 
 is out of service.
 
 But from the practice ssd will not help too much to boost the performance 
 especially for sharing it between 10 hdds.
 
 We use nvme db+wal per osd and separate nvme specifically for metadata 
 pools. There will be a lot of I/O on bucket.index pool and rgw pool which 
 stores user, bucket metadata. So you might want to put them into separate 
 fast storage. 
 
 Also if there will not be too much objects, like huge objects but not 
 tens-hundreds million of them then bucket index will have less presure and 
 ssd might be okay for metadata pools in that case.
 
 
 
 Надіслано з пристрою Galaxy
 
 
  Оригінальне повідомлення 
 Від: Boris Behrens 
 Дата: 08.11.21 13:08 (GMT+02:00)
 Кому: ceph-users@ceph.io
 Тема: [ceph-users] Question if WAL/block.db partition will benefit us
 
 Hi,
 we run a larger 

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-10 Thread Boris
Oh. 
How would one recover from that? Sounds like it basically makes no difference 
if 2, 5 oder 10 OSD are in the blast radius. 

Can the omap key/values be regenerated?
I always thought these data would be stored in the rgw pools. Or am I mixing 
things up and the bluestore metadata got omap k/v? And then there is the omap 
k/v from rgw objects?


> Am 10.11.2021 um 22:37 schrieb Сергей Процун :
> 
> 
> No, you can not do that. Because RocksDB for omap key/values and WAL would be 
> gone meaning all xattr and omap will be gone too. Hence osd will become non 
> operational. But if you notice that ssd starts throwing errors, you can start 
> migrating bluefs device to a new partition:
> 
> ceph-bluestore-tool bluefs-bdev-migrate --devs-source 
> /var/lib/ceph/osd/ceph-OSD_ID/block --path /var/lib/ceph/osd/ceph-OSD_ID/ 
> --dev-target /path/to/new/db_device
> 
> ср, 10 лист. 2021, 11:51 користувач Boris Behrens  пише:
>> Hi,
>> we use enterprise SSDs like SAMSUNG MZ7KM1T9.
>> The work very well for our block storage. Some NVMe would be a lot nicer but 
>> we have some good experience with them. 
>> 
>> One SSD fail takes down 10 OSDs might sound hard, but this would be an 
>> okayish risk. Most of the tunables are defaul in our setup and this looks 
>> like PGs have a failure domain of a host. I restart the systems on a regular 
>> basis for kernel updates.
>> Also checking disk io with dstat seems to be rather low on the SSDs (below 
>> 1k IOPs)
>> root@s3db18:~# dstat --disk --io  -T  -D sdd
>> --dsk/sdd-- ---io/sdd-- --epoch---
>>  read  writ| read  writ|  epoch
>>  214k 1656k|7.21   126 |1636536603
>>  144k 1176k|2.00   200 |1636536604
>>  128k 1400k|2.00   230 |1636536605
>> 
>> Normaly I would now try this configuration:
>> 1 SSD / 10 OSDs - having 150GB of block.db and block.wal, both on the same 
>> partition as someone stated before, and 200GB extra to move all pools except 
>> the .data pool to SSDs.
>> 
>> But thinking about 10 downed OSDs if one SSD fails let's me wonder how to 
>> recover from that.
>> IIRC the configuration per OSDs is in the LVM tags:
>> root@s3db18:~# lvs -o lv_tags
>>   LV Tags
>>   
>> ceph.block_device=...,ceph.db_device=/dev/sdd8,ceph.db_uuid=011275a3-4201-8840-a678-c2e23d38bfd6,...
>> 
>> When the SSD fails, can I just remove the tags and restart the OSD with 
>> ceph-volume lvm activate --all? And after replacing the failed SSD readd the 
>> tags with the correct IDs? Do I need to do anything else to prepare a 
>> block.db partition?
>> 
>> Cheers
>>  Boris
>> 
>> 
>>> Am Di., 9. Nov. 2021 um 22:15 Uhr schrieb prosergey07 
>>> :
>>> Not sure how much it would help the performance with osd's backed with ssd 
>>> db and wal devices. Even if you go this route with one ssd per 10 hdd, you 
>>> might want to set the failure domain per host in crush rules in case ssd is 
>>> out of service.
>>> 
>>>  But from the practice ssd will not help too much to boost the performance 
>>> especially for sharing it between 10 hdds.
>>> 
>>>  We use nvme db+wal per osd and separate nvme specifically for metadata 
>>> pools. There will be a lot of I/O on bucket.index pool and rgw pool which 
>>> stores user, bucket metadata. So you might want to put them into separate 
>>> fast storage. 
>>> 
>>>  Also if there will not be too much objects, like huge objects but not 
>>> tens-hundreds million of them then bucket index will have less presure and 
>>> ssd might be okay for metadata pools in that case.
>>> 
>>> 
>>> 
>>> Надіслано з пристрою Galaxy
>>> 
>>> 
>>>  Оригінальне повідомлення 
>>> Від: Boris Behrens 
>>> Дата: 08.11.21 13:08 (GMT+02:00)
>>> Кому: ceph-users@ceph.io
>>> Тема: [ceph-users] Question if WAL/block.db partition will benefit us
>>> 
>>> Hi,
>>> we run a larger octopus s3 cluster with only rotating disks.
>>> 1.3 PiB with 177 OSDs, some with a SSD block.db and some without.
>>> 
>>> We have a ton of spare 2TB disks and we just wondered if we can bring the
>>> to good use.
>>> For every 10 spinning disks we could add one 2TB SSD and we would create
>>> two partitions per OSD (130GB for block.db and 20GB for block.wal). This
>>> would leave some empty space on the SSD for waer leveling.
>>> 
>>> The question now is: would we benefit from this? Most of the data that is
>>> written to the cluster is very large (50GB and above). This would take a
>>> lot of work into restructuring the cluster and also two other clusters.
>>> 
>>> And does it make a different to have only a block.db partition or a
>>> block.db and a block.wal partition?
>>> 
>>> Cheers
>>> Boris
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-10 Thread Сергей Процун
No, you can not do that. Because RocksDB for omap key/values and WAL would
be gone meaning all xattr and omap will be gone too. Hence osd will become
non operational. But if you notice that ssd starts throwing errors, you can
start migrating bluefs device to a new partition:

ceph-bluestore-tool bluefs-bdev-migrate --devs-source
/var/lib/ceph/osd/ceph-OSD_ID/block --path /var/lib/ceph/osd/ceph-OSD_ID/
--dev-target /path/to/new/db_device

ср, 10 лист. 2021, 11:51 користувач Boris Behrens  пише:

> Hi,
> we use enterprise SSDs like SAMSUNG MZ7KM1T9.
> The work very well for our block storage. Some NVMe would be a lot nicer
> but we have some good experience with them.
>
> One SSD fail takes down 10 OSDs might sound hard, but this would be an
> okayish risk. Most of the tunables are defaul in our setup and this looks
> like PGs have a failure domain of a host. I restart the systems on a
> regular basis for kernel updates.
> Also checking disk io with dstat seems to be rather low on the SSDs (below
> 1k IOPs)
> root@s3db18:~# dstat --disk --io  -T  -D sdd
> --dsk/sdd-- ---io/sdd-- --epoch---
>  read  writ| read  writ|  epoch
>  214k 1656k|7.21   126 |1636536603
>  144k 1176k|2.00   200 |1636536604
>  128k 1400k|2.00   230 |1636536605
>
> Normaly I would now try this configuration:
> 1 SSD / 10 OSDs - having 150GB of block.db and block.wal, both on the same
> partition as someone stated before, and 200GB extra to move all pools
> except the .data pool to SSDs.
>
> But thinking about 10 downed OSDs if one SSD fails let's me wonder how to
> recover from that.
> IIRC the configuration per OSDs is in the LVM tags:
> root@s3db18:~# lvs -o lv_tags
>   LV Tags
>
> ceph.block_device=...,ceph.db_device=/dev/sdd8,ceph.db_uuid=011275a3-4201-8840-a678-c2e23d38bfd6,...
>
> When the SSD fails, can I just remove the tags and restart the OSD with 
> ceph-volume
> lvm activate --all? And after replacing the failed SSD readd the tags
> with the correct IDs? Do I need to do anything else to prepare a block.db
> partition?
>
> Cheers
>  Boris
>
>
> Am Di., 9. Nov. 2021 um 22:15 Uhr schrieb prosergey07 <
> proserge...@gmail.com>:
>
>> Not sure how much it would help the performance with osd's backed with
>> ssd db and wal devices. Even if you go this route with one ssd per 10 hdd,
>> you might want to set the failure domain per host in crush rules in case
>> ssd is out of service.
>>
>>  But from the practice ssd will not help too much to boost the
>> performance especially for sharing it between 10 hdds.
>>
>>  We use nvme db+wal per osd and separate nvme specifically for metadata
>> pools. There will be a lot of I/O on bucket.index pool and rgw pool which
>> stores user, bucket metadata. So you might want to put them into separate
>> fast storage.
>>
>>  Also if there will not be too much objects, like huge objects but not
>> tens-hundreds million of them then bucket index will have less presure and
>> ssd might be okay for metadata pools in that case.
>>
>>
>>
>> Надіслано з пристрою Galaxy
>>
>>
>>  Оригінальне повідомлення 
>> Від: Boris Behrens 
>> Дата: 08.11.21 13:08 (GMT+02:00)
>> Кому: ceph-users@ceph.io
>> Тема: [ceph-users] Question if WAL/block.db partition will benefit us
>>
>> Hi,
>> we run a larger octopus s3 cluster with only rotating disks.
>> 1.3 PiB with 177 OSDs, some with a SSD block.db and some without.
>>
>> We have a ton of spare 2TB disks and we just wondered if we can bring the
>> to good use.
>> For every 10 spinning disks we could add one 2TB SSD and we would create
>> two partitions per OSD (130GB for block.db and 20GB for block.wal). This
>> would leave some empty space on the SSD for waer leveling.
>>
>> The question now is: would we benefit from this? Most of the data that is
>> written to the cluster is very large (50GB and above). This would take a
>> lot of work into restructuring the cluster and also two other clusters.
>>
>> And does it make a different to have only a block.db partition or a
>> block.db and a block.wal partition?
>>
>> Cheers
>> Boris
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] [Pacific] OSD Spec problem?

2021-11-10 Thread [AR] Guillaume CephML
Hello,

I got something strange on a Pacific (16.2.6) cluster.
I have added 8 new empty spinning disk on this running cluster that is 
configured with: 

# ceph orch ls osd --export
service_type: osd
service_id: ar_osd_hdd_spec
service_name: osd.ar_osd_hdd_spec
placement:
  host_pattern: '*'
spec:
  data_devices:
rotational: 1
  filter_logic: AND
  objectstore: bluestore
---
service_type: osd
service_id: ar_osd_ssd_spec
service_name: osd.ar_osd_ssd_spec
placement:
  host_pattern: '*'
spec:
  data_devices:
rotational: 0
  filter_logic: AND
  objectstore: bluestore


Before adding them I had: 
#  ceph orch ls osd
NAME PORTS  RUNNING  REFRESHED  AGE  PLACEMENT  
osd.ar_osd_hdd_spec   16/24  8m ago 4M   *  
osd.ar_osd_ssd_spec   8/16   8m ago 4M   * 

After adding the disk I have: 
#  ceph orch ls osd
NAME PORTS  RUNNING  REFRESHED  AGE  PLACEMENT  
osd.ar_osd_hdd_spec   16/24  8m ago 4M   *  
osd.ar_osd_ssd_spec   16/24  8m ago 4M   *  

I do not understand why the disk have been detected as osd.ar_osd_ssd_spec.
New disk are on /dev/sdf.

# ceph orch device ls —wide
Hostname  Path  Type  Transport  RPM  Vendor  Model Size   
Health  Ident  Fault  Avail  Reject Reasons 

host10/dev/sdc  ssd   ATA/SATA   Unknown  ATA Micron_5300_MTFD   960G  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host10/dev/sdd  hdd   ATA/SATA   7200 ATA HGST HUH721010AL  10.0T  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host10/dev/sde  hdd   ATA/SATA   7200 ATA WDC  WUS721010AL  10.0T  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host10/dev/sdf  hdd   ATA/SATA   7200 ATA WDC  WUS721010AL  10.0T  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host11/dev/sdc  ssd   ATA/SATA   Unknown  ATA Micron_5300_MTFD   960G  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host11/dev/sdd  hdd   ATA/SATA   7200 ATA HGST HUH721010AL  10.0T  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host11/dev/sde  hdd   ATA/SATA   7200 ATA WDC  WUS721010AL  10.0T  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host11/dev/sdf  hdd   ATA/SATA   7200 ATA WDC  WUS721010AL  10.0T  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host12/dev/sdc  ssd   ATA/SATA   Unknown  ATA Micron_5300_MTFD   960G  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host12/dev/sdd  hdd   ATA/SATA   7200 ATA HGST HUH721010AL  10.0T  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host12/dev/sde  hdd   ATA/SATA   7200 ATA WDC  WUS721010AL  10.0T  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host12/dev/sdf  hdd   ATA/SATA   7200 ATA WDC  WUS721010AL  10.0T  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host13/dev/sdc  ssd   ATA/SATA   Unknown  ATA Micron_5300_MTFD   960G  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host13/dev/sdd  hdd   ATA/SATA   7200 ATA HGST HUH721010AL  10.0T  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host13/dev/sde  hdd   ATA/SATA   7200 ATA WDC  WUS721010AL  10.0T  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host13/dev/sdf  hdd   ATA/SATA   7200 ATA WDC  WUS721010AL  10.0T  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host14/dev/sdc  ssd   ATA/SATA   Unknown  ATA Micron_5300_MTFD   960G  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host14/dev/sdd  hdd   ATA/SATA   7200 ATA HGST HUH721010AL  10.0T  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host14/dev/sde  hdd   ATA/SATA   7200 ATA WDC  WUS721010AL  10.0T  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host14/dev/sdf  hdd   ATA/SATA   7200 ATA WDC  WUS721010AL  10.0T  
GoodN/AN/ANo Insufficient space (<10 extents) on vgs, LVM 
detected, locked  
host15/dev/sdc  ssd   ATA/SATA   Unknown  ATA Micron_5300_MTFD   960G  
GoodN/AN/ANo Insufficient space (<10 extents) on 

[ceph-users] Re: slow operation observed for _collection_list

2021-11-10 Thread Boris Behrens
Did someone figure this out?
We are currently facing the same issue but the OSDs more often kill
themself and need to be restarted by us.

This happens to OSDs that have a SSD backed block.db and OSDs that got the
block.db on the bluestore device.
All OSDs are rotating disk of various sizes.
We've disable (deep)scrubbing to know if this creates the issue.
It happens on centos7 and ubuntu focal hosts.
It only appeared after we switched from latest nautilus to latest octopus.
(in the upgrade process we've got rid of the cluster_network option (it was
only a separate VLAN) and a lot of other config variables, so we are mostly
on default values)

And even an OSD I added 6hrs ago (one of 20) does got this problem 20
minutes ago.
What we do: offline compact them and then start them again.

Am Fr., 5. Nov. 2021 um 16:22 Uhr schrieb Szabo, Istvan (Agoda) <
istvan.sz...@agoda.com>:

> Seems like it can help, but after 1-2 days it comes back on different and
> in some cases on the same osd as well.
> Is there any other way to compact online as it compacts offline?
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> From: Szabo, Istvan (Agoda)
> Sent: Friday, October 29, 2021 8:43 PM
> To: Igor Fedotov 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] slow operation observed for _collection_list
>
> I can give a try again, but before migrated all db back to data I did
> compaction on all osd.
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
>
> On 2021. Oct 29., at 15:02, Igor Fedotov  igor.fedo...@croit.io>> wrote:
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> 
>
> Please manually compact the DB using ceph-kvstore-tool for all the
> affected OSDs (or preferable every OSD in the cluster). Highly likely
> you're facing RocksDB performance degradation caused by prior bulk data
> removal. Setting bluefs_buffered_io to true (if not yet set) might be
> helpful as well.
>
>
> On 10/29/2021 3:22 PM, Szabo, Istvan (Agoda) wrote:
>
> Hi,
>
> Having slow ops and laggy pgs due to osd is not accessible (octopus
> 15.2.14 version and 15.2.10 also).
> At the time when slow ops started, in the osd log I can see:
>
> "7f2a8d68f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread
> 0x7f2a70de5700' had timed out after 15"
>
> And this blocks the io until the radosgateway didn't restart itself.
> Is this a bug or something else?
>
> In the ceph.log I can see also that specific osd is reported failed from
> another osds:
>
> 2021-10-29T05:49:34.386857+0700 mon.server-3s01 (mon.0) 3576376 : cluster
> [DBG] osd.7 reported failed by osd.31
> 2021-10-29T05:49:34.454037+0700 mon.server-3s01 (mon.0) 3576377 : cluster
> [DBG] osd.7 reported failed by osd.22
> 2021-10-29T05:49:34.666758+0700 mon.server-3s01 (mon.0) 3576379 : cluster
> [DBG] osd.7 reported failed by osd.6
> 2021-10-29T05:49:34.807714+0700 mon.server-3s01 (mon.0) 3576382 : cluster
> [DBG] osd.7 reported failed by osd.11
>
> Here is the osd log: https://justpaste.it/4x4h2
> Here is the ceph.log itself: https://justpaste.it/5bk8k
> Here is some additional information regarding memory usage and
> backtrace...: https://justpaste.it/1tmjg
>
> Thank you
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io ceph-users-le...@ceph.io>
>
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: snaptrim blocks io on ceph pacific even on fast NVMEs

2021-11-10 Thread Arthur Outhenin-Chalandre

Hi,

On 11/10/21 16:14, Christoph Adomeit wrote:

But the cluster seemed to slowly "eat" storage space. So yesterday I decided to 
add 3 more NVMEs, 1 for each node. In the second i added the first nvme as ceph osd the 
cluster was crashing. I had high loads on all osds and all the osds where dying again and 
again until i set nodown,noout,noscrub,nodeep-scrub and rtemoved the new osd. The the 
cluster recovered but had slow io and lots of snaptriHm and snaptrim wait processes.


You may have hit this issue https://tracker.ceph.com/issues/52026. AFAIU 
there could be some untrimmed snapshots (visible in snaptrimq_len with 
`ceph pg dump pgs`) which are only trimmed once the pg is repeered. We 
experience that during testing, but the root cause is not fully 
understood (at least to me).


Maybe once you added your new OSDs made the snaptrim state appeared on 
various PGs which affected your cluster apparently.



I made this smoother by setting --osd_snap_trim_sleep=3.0

Over night the snaptrim_wait pgs became 0 and i had 15% mor free space in the 
ceph cluster. But during the day the snaptrim_waits increased and increased.

I then set osd_snap_trim_sleep to 0.0 again and most vms had extremely high 
iowaits ore crashed.

Now I did a ceph osd set nosnaptrim and the cluster is flying again. Iowait 0 
on all vms but count
of snaptrim wait is slowly increasing.

How can I get the snaptrims running fast and not affect ceph io performance ?
My theory is until yesterday for some reasons the snaptrims were not running for some 
reason and therefore the cluster was "eating" storage space. After crash 
yesterday and restarting the snaptrims the started.


On our test cluster we actually decreased `osd_snap_trim_sleep` to 0.1s 
instead of the default 2s for hybrid OSD because the snaptrim we had 
would have lasted a few weeks without it IIRC. We didn't notice any 
slowdowns, HDD crashing or anything like that (but this cluster doesn't 
have any real production workloads, so we may have overlooked some aspects).


In your case the default value should be set to 
`osd_snap_trim_sleep_ssd` which is 0, so maybe with some SSD/NVME OSD 
the snaptrim do affect performance (with the default settings at 
least)... Therefore, you may want to set `osd_snap_trim_sleep` to 
something different than 0. The 0.1s sleep worked smoothly in our tests, 
but this was needed because I was stress testing snapshots and there was 
many many objects that needed this snaptrim process. You could probably 
increase this value for safety reasons, any value between 0.1s and 3s 
(that you already tested!) is probably fine!


Cheers,

--
Arthur Outhenin-Chalandre
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: allocate_bluefs_freespace failed to allocate

2021-11-10 Thread mhnx
Hello Igor. Thanks for the answer.

There are so many changes to read and test for me but I will plan an
upgrade to Octopus when I'm available.

Is there any problem upgrading from 14.2.16 ---> 15.2.15 ?



Igor Fedotov , 10 Kas 2021 Çar, 17:50 tarihinde şunu
yazdı:

> I would encourage you to upgrade to at least the latest Nautilus (and
> preferably to Octopus).
>
> There were a bunch of allocator's bugs fixed since 14.2.16. Not even
> sure all of them landed into N since it's EOL.
>
> A couple examples are (both are present in the latest Nautilus):
>
> https://github.com/ceph/ceph/pull/41673
>
> https://github.com/ceph/ceph/pull/38475
>
>
> Thanks,
>
> Igor
>
>
> On 11/8/2021 4:31 PM, mhnx wrote:
> > Hello.
> >
> > I'm using Nautilus 14.2.16
> > I have 30 SSD in my cluster and I use them as Bluestore OSD for RGW
> index.
> > Almost every week I'm losing (down) an OSD and when I check osd log I
> see:
> >
> >  -6> 2021-11-06 19:01:10.854 7fa799989c40  1 *bluefs _allocate
> > failed to allocate 0xf4f04 on bdev 1, free 0xb; fallback to bdev
> > 2*
> >  -5> 2021-11-06 19:01:10.854 7fa799989c40  1 *bluefs _allocate
> > unable to allocate 0xf4f04 on bdev 2, free 0x;
> > fallback to slow device expander*
> >  -4> 2021-11-06 19:01:10.854 7fa799989c40 -1
> > bluestore(/var/lib/ceph/osd/ceph-218) *allocate_bluefs_freespace
> > failed to allocate on* 0x8000 min_size 0x10 > allocated total
> > 0x0 bluefs_shared_alloc_size 0x1 allocated 0x0 available 0x
> > a497aab000
> >  -3> 2021-11-06 19:01:10.854 7fa799989c40 -1 *bluefs _allocate
> > failed to expand slow device to fit +0xf4f04*
> >
> >
> > Full log: https://paste.ubuntu.com/p/MpJfVjMh7V/plain/
> >
> > And OSD does not start without offline compaction.
> > Offline compaction log: https://paste.ubuntu.com/p/vFZcYnxQWh/plain/
> >
> > After the Offline compaction I tried to start OSD with bitmap allocator
> but
> > it is not getting up because of " FAILED ceph_assert(available >=
> > allocated)"
> > Log: https://paste.ubuntu.com/p/2Bbx983494/plain/
> >
> > Then I start the OSD with hybrid allocator and let it recover.
> > When the recover is done I stop the OSD and start with the bitmap
> > allocator.
> > This time it came up but I've got "80 slow ops, oldest one blocked for
> 116
> > sec, osd.218 has slow ops" and I increased "osd_recovery_sleep 10" to
> give
> > a breath to cluster and cluster marked the osd as down (it was still
> > working) after a while the osd marked up and cluster became normal. But
> > while recovering, other osd's started to give slow ops and I've played
> > around with "osd_recovery_sleep 0.1 <---> 10" to keep the cluster stable
> > till recovery finishes.
> >
> > Ceph osd df tree before: https://paste.ubuntu.com/p/4K7JXcZ8FJ/plain/
> > Ceph osd df tree after osd.218 = bitmap:
> > https://paste.ubuntu.com/p/5SKbhrbgVM/plain/
> >
> > If I want to change all other osd's allocator to bitmap, I need to repeat
> > the process 29 time and it will take too much time.
> > I don't want to heal OSDs with the offline compaction anymore so I will
> do
> > that if that's the solution but I want to be sure before doing a lot of
> > work and maybe with the issue I can provide helpful logs and information
> > for developers.
> >
> > Have a nice day.
> > Thanks.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: snaptrim blocks io on ceph pacific even on fast NVMEs

2021-11-10 Thread Christoph Adomeit
Thanks Stefan,

i played with bluefs_buffered_io but i think the impact is not great since the 
nvmes are so fast.
I think buffered IO on increased cpu load while buffered io off increase nvme 
load. Problem was with both settings.

I am not sure if require-osd-release was run. What do you think ceph osd 
require-osd-release pacific will do and is there a risk when running the 
command that the cluster might become unavailable ?

I think all osds are running 16.2.6 anyway and so it would not change anything 
if i set the force to run  Pacific ?


On Wed, Nov 10, 2021 at 04:22:14PM +0100, Stefan Kooman wrote:
> On 11/10/21 16:14, Christoph Adomeit wrote:
> > I have upgraded my ceph cluster to pacific in August and updated to pacific 
> > 16.2.6 in September without problems.
> 
> Have you set "ceph osd require-osd-release pacific" when you finished
> upgrading (this sometimes gets forgotten)?
> 
> Is "bluefs_buffered_io" set to true on the OSDs?
> 
> Gr. Stefan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
Hard times create strong men. Strong men create good times.Good times create 
weak men. And weak men create hard times.
Christoph Adomeit
GATWORKS GmbH
Metzenweg 78
41068 Moenchengladbach
Sitz: Moenchengladbach
Amtsgericht Moenchengladbach, HRB 6303
Geschaeftsfuehrer:
Christoph Adomeit, Hans Wilhelm Terstappen

christoph.adom...@gatworks.de Internetloesungen vom Feinsten
Fon. +49 2161 68464-32  Fax. +49 2161 68464-10
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] snaptrim blocks io on ceph pacific even on fast NVMEs

2021-11-10 Thread Christoph Adomeit
I have upgraded my ceph cluster to pacific in August and updated to pacific 
16.2.6 in September without problems. 

I had no performance issues at all, the cluster has 3 nodes 64 core each, 15 
blazing fast Samsung PM1733 NVME osds, 25 GBit/s Network and around 100 vms. 
The cluster was really fast. I never saw something like "snaptrim" in the ceph 
status output.

But the cluster seemed to slowly "eat" storage space. So yesterday I decided to 
add 3 more NVMEs, 1 for each node. In the second i added the first nvme as ceph 
osd the cluster was crashing. I had high loads on all osds and all the osds 
where dying again and again until i set nodown,noout,noscrub,nodeep-scrub and 
rtemoved the new osd. The the cluster recovered but had slow io and lots of 
snaptrim and snaptrim wait processes.

I made this smoother by setting --osd_snap_trim_sleep=3.0 

Over night the snaptrim_wait pgs became 0 and i had 15% mor free space in the 
ceph cluster. But during the day the snaptrim_waits increased and increased.

I then set osd_snap_trim_sleep to 0.0 again and most vms had extremely high 
iowaits ore crashed.

Now I did a ceph osd set nosnaptrim and the cluster is flying again. Iowait 0 
on all vms but count
of snaptrim wait is slowly increasing.

How can I get the snaptrims running fast and not affect ceph io performance ?
My theory is until yesterday for some reasons the snaptrims were not running 
for some reason and therefore the cluster was "eating" storage space. After 
crash yesterday and restarting the snaptrims the started.

In the logs I do not find the info whats going on. From what I read in the 
mailing lists and forums i suppose the problem might have somethin to do with 
ceph osds and omaps and compaction and rocksdb format or maybe with osd on disk 
format ? 

Any ideas what the next steps could be ?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: allocate_bluefs_freespace failed to allocate

2021-11-10 Thread Igor Fedotov
I would encourage you to upgrade to at least the latest Nautilus (and 
preferably to Octopus).


There were a bunch of allocator's bugs fixed since 14.2.16. Not even 
sure all of them landed into N since it's EOL.


A couple examples are (both are present in the latest Nautilus):

https://github.com/ceph/ceph/pull/41673

https://github.com/ceph/ceph/pull/38475


Thanks,

Igor


On 11/8/2021 4:31 PM, mhnx wrote:

Hello.

I'm using Nautilus 14.2.16
I have 30 SSD in my cluster and I use them as Bluestore OSD for RGW index.
Almost every week I'm losing (down) an OSD and when I check osd log I see:

 -6> 2021-11-06 19:01:10.854 7fa799989c40  1 *bluefs _allocate
failed to allocate 0xf4f04 on bdev 1, free 0xb; fallback to bdev
2*
 -5> 2021-11-06 19:01:10.854 7fa799989c40  1 *bluefs _allocate
unable to allocate 0xf4f04 on bdev 2, free 0x;
fallback to slow device expander*
 -4> 2021-11-06 19:01:10.854 7fa799989c40 -1
bluestore(/var/lib/ceph/osd/ceph-218) *allocate_bluefs_freespace
failed to allocate on* 0x8000 min_size 0x10 > allocated total
0x0 bluefs_shared_alloc_size 0x1 allocated 0x0 available 0x
a497aab000
 -3> 2021-11-06 19:01:10.854 7fa799989c40 -1 *bluefs _allocate
failed to expand slow device to fit +0xf4f04*


Full log: https://paste.ubuntu.com/p/MpJfVjMh7V/plain/

And OSD does not start without offline compaction.
Offline compaction log: https://paste.ubuntu.com/p/vFZcYnxQWh/plain/

After the Offline compaction I tried to start OSD with bitmap allocator but
it is not getting up because of " FAILED ceph_assert(available >=
allocated)"
Log: https://paste.ubuntu.com/p/2Bbx983494/plain/

Then I start the OSD with hybrid allocator and let it recover.
When the recover is done I stop the OSD and start with the bitmap
allocator.
This time it came up but I've got "80 slow ops, oldest one blocked for 116
sec, osd.218 has slow ops" and I increased "osd_recovery_sleep 10" to give
a breath to cluster and cluster marked the osd as down (it was still
working) after a while the osd marked up and cluster became normal. But
while recovering, other osd's started to give slow ops and I've played
around with "osd_recovery_sleep 0.1 <---> 10" to keep the cluster stable
till recovery finishes.

Ceph osd df tree before: https://paste.ubuntu.com/p/4K7JXcZ8FJ/plain/
Ceph osd df tree after osd.218 = bitmap:
https://paste.ubuntu.com/p/5SKbhrbgVM/plain/

If I want to change all other osd's allocator to bitmap, I need to repeat
the process 29 time and it will take too much time.
I don't want to heal OSDs with the offline compaction anymore so I will do
that if that's the solution but I want to be sure before doing a lot of
work and maybe with the issue I can provide helpful logs and information
for developers.

Have a nice day.
Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: LVM support in Ceph Pacific

2021-11-10 Thread Janne Johansson
Den ons 10 nov. 2021 kl 11:27 skrev MERZOUKI, HAMID :
>
> Hello everybody,
>
> I have a misunderstanding about LVM configuration for OSD devices:
>
> In the pacific documentation cephadm/install (and it was already written in 
> octopus documentation in section DEPLOY OSDS), it is written :
> “The device must not have any LVM state”
> (https://docs.ceph.com/en/pacific/cephadm/services/osd/#listing-storage-devices)

I think this mostly means "It should be empty, and not have a previous
LVM configuration"

> whereas there are lots of reference to LVM configuration in other sections, 
> such as in release notes : Pacific — Ceph Documentation or in ceph-volume — 
> Ceph Documentation :
> "NEW DEPLOYMENTS" 
> (https://docs.ceph.com/en/pacific/ceph-volume/index.html#new-deployments)
> For new deployments, LVM is recommended, it can use any logical volume as 
> input for data OSDs, or it can setup a minimal/naive logical volume from a 
> device

..and the many mentions of LVM is because the tools will happily
default to making a new LVM setup for you on the clean disks you
supply.
If there already is LVM on it, the installation needs to take many
different situations into consideration, compared to "the disk need to
be clean and without previous LVM on it", in which case it can just
pvcreate and so on from a clean start.


> So I’m not sure to understand and have two questions:
>
> 1/ Is it possible or not to deploy a Ceph Pacific configuration upon a LVM 
> configuration for OSDs or is it not supported or is it just a problem linked 
> to the use of ceph-adm instead of ceph-ansible ?

I'm sure you can if you do it even more manually with ceph-volume, but
there should seldom be a need to. As for if the scripts and tools will
allow you to specify such a config I don't know.

> 2/ Is it possible to migrate a Ceph Octopus configuration installed (thks to 
> ceph-ansible) upon LVM configuration for OSDs towards a Ceph Pacific 
> configuration without changing anything around LVM configuration ?

Yes, upgrades do not contain LVM management, as far as I have ever seen.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph-data-scan: Watching progress and choosing the number of threads

2021-11-10 Thread Anderson, Erik
I deleted a filesystem that should not have been deleted on a seven node 1.2P 
cluster running Octopus. After looking through various docs and threads I am 
running ‘ceph-data-scan’ to try and rebuild the metadata from the data pool. 
The example for ceph-data-scan in the documentation uses four threads but a 
little more research showed that four was likely very low for my cluster and so 
I stopped those jobs and restarted with 512 threads. Is there a rule of thumb 
to determine how many threads you should use? Something based on available 
resources on the cluster, file count or perhaps the size of the data pool? My 
current run is winding down with about 80 threads and I would like to know if 
upping the thread count will help the second phase run any faster than the 
first. Secondly, other than monitoring the number of threads that are currently 
running, how can I monitor the overall progress of the ceph-data-scan job?

Thanks for your help,

Erik
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to enable RDMA

2021-11-10 Thread David Majchrzak, Oderland Webbhotell AB
I think the latest docs on ceph RDMA "support" is based on Luminous.

I'd be careful using RDMA on later versions of ceph if you're running a 
production cluster.

⁣Kind Regards,

David Majchrzak
CTO
Oderland Webbhotell AB​

Den 10 nov. 2021 11:47, kI 11:47, "Mason-Williams, Gabryel (RFI,RAL,-)" 
 skrev:
>Hi GHui,
>
>You might find this document useful:
>https://support.mellanox.com/s/article/bring-up-ceph-rdma---developer-s-guide
>
>Also, I previously asked this question and there was some useful
>information in the thread:
>https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/5JD4ATRXKMMLIUQI5TUAUYQFGJ45Q7MJ/
>
>Kind regards
>
>Gabryel
>
>From: GHui 
>Sent: 10 November 2021 10:34
>To: ceph-users 
>Subject: [ceph-users] How to enable RDMA
>
>Hi,
>
>How can I known my Ceph Cluster is enabled RDMA, and with RoCE v2?
>
>I would very much appreciate any advice.
>
>Best Regards,
>GHui
>___
>ceph-users mailing list -- ceph-users@ceph.io
>To unsubscribe send an email to ceph-users-le...@ceph.io
>___
>ceph-users mailing list -- ceph-users@ceph.io
>To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to enable RDMA

2021-11-10 Thread Mason-Williams, Gabryel (RFI,RAL,-)
Hi GHui,

You might find this document useful: 
https://support.mellanox.com/s/article/bring-up-ceph-rdma---developer-s-guide

Also, I previously asked this question and there was some useful information in 
the thread: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/5JD4ATRXKMMLIUQI5TUAUYQFGJ45Q7MJ/

Kind regards

Gabryel

From: GHui 
Sent: 10 November 2021 10:34
To: ceph-users 
Subject: [ceph-users] How to enable RDMA

Hi,

How can I known my Ceph Cluster is enabled RDMA, and with RoCE v2?

I would very much appreciate any advice.

Best Regards,
GHui
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-10 Thread Manuel Lausch
This is the patch I made. I think this is the wrong place to do this. but in 
the first place in worked.


diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc
index 9fb22e0f9ee..69341840153 100644
--- a/src/osd/PrimaryLogPG.cc
+++ b/src/osd/PrimaryLogPG.cc
@@ -798,6 +798,10 @@ void PrimaryLogPG::maybe_force_recovery()
 
 bool PrimaryLogPG::check_laggy(OpRequestRef& op)
 {
+  if (!cct->_conf->osd_read_lease_enabled) {
+// possibility to deactivate this feature.
+return true;
+  }
   if (!HAVE_FEATURE(recovery_state.get_min_upacting_features(),
SERVER_OCTOPUS)) {
 dout(20) << __func__ << " not all upacting has SERVER_OCTOPUS" << dendl;
@@ -833,6 +837,10 @@ bool PrimaryLogPG::check_laggy(OpRequestRef& op)
 
 bool PrimaryLogPG::check_laggy_requeue(OpRequestRef& op)
 {
+  if (!cct->_conf->osd_read_lease_enabled) {
+// possibility to deactivate this feature.
+return true;
+  }
   if (!HAVE_FEATURE(recovery_state.get_min_upacting_features(),
SERVER_OCTOPUS)) {
 return true;



Von: Peter Lieven 
Gesendet: Mittwoch, 10. November 2021 11:37
An: Manuel Lausch; Sage Weil
Cc: ceph-users@ceph.io
Betreff: Re: [ceph-users] Re: OSD spend too much time on "waiting for readable" 
-> slow ops -> laggy pg -> rgw stop -> worst case osd restart

Am 10.11.21 um 11:35 schrieb Manuel Lausch:
> oh shit,
>
> I patched in a switch to deactivate the read_lease feature. This is only a 
> hack to test a bit around. But accidentally I had this switch enabled for my 
> last tests done here in this mail-thread.
>
> The bad news. The require_osd_release doesn't fix the slow op problematic, 
> only the increasing of the osdmap epochs are fixed.
> Unfortunately, even reduceing the paxos_prpopose_interval changes anything. 
> My last tests with it was wrong due to my hack :-(


Would it be an option to make this hack a switch for all those who don't 
require the read lease feature and are happy with reading from just the primary?


Peter



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-10 Thread Peter Lieven
Am 10.11.21 um 11:35 schrieb Manuel Lausch:
> oh shit,
>
> I patched in a switch to deactivate the read_lease feature. This is only a 
> hack to test a bit around. But accidentally I had this switch enabled for my 
> last tests done here in this mail-thread.
>
> The bad news. The require_osd_release doesn't fix the slow op problematic, 
> only the increasing of the osdmap epochs are fixed.
> Unfortunately, even reduceing the paxos_prpopose_interval changes anything. 
> My last tests with it was wrong due to my hack :-(


Would it be an option to make this hack a switch for all those who don't 
require the read lease feature and are happy with reading from just the primary?


Peter



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-10 Thread Manuel Lausch
oh shit,

I patched in a switch to deactivate the read_lease feature. This is only a hack 
to test a bit around. But accidentally I had this switch enabled for my last 
tests done here in this mail-thread.

The bad news. The require_osd_release doesn't fix the slow op problematic, only 
the increasing of the osdmap epochs are fixed.
Unfortunately, even reduceing the paxos_prpopose_interval changes anything. My 
last tests with it was wrong due to my hack :-(

So still affected are octopus and nautilus.
Sorry for the confusion.

Manuel


Von: Peter Lieven 
Gesendet: Mittwoch, 10. November 2021 10:15
An: Manuel Lausch; Sage Weil
Cc: ceph-users@ceph.io
Betreff: Re: [ceph-users] Re: OSD spend too much time on "waiting for readable" 
-> slow ops -> laggy pg -> rgw stop -> worst case osd restart

Am 10.11.21 um 09:57 schrieb Manuel Lausch:
> Hi Sage,
>
>
> thank you for your help.
>
>
> My origin issue with slow ops on osd restarts are gone too. Even with default 
> values for paxos_proposal_interval.
>
>
> Its a bit annoying, that I spent many hours to debug this and finally I 
> missed only one step in the upgrade.
>
> Only during the update itself, until require_osd_release is set to the new 
> version, there will be interruptions


However, in Octopus the issue does still exist, right?


Peter



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: large bucket index in multisite environement (how to deal with large omap objects warning)?

2021-11-10 Thread Boris Behrens
I am just creating a bucket with a lot of files to test it. Who would have
thought that uploading a million 1k files would take days?

Am Di., 9. Nov. 2021 um 00:50 Uhr schrieb prosergey07 :

> When resharding is performed I believe its considered as bucket operation
> and undergoes through updating the bucket stats. Like new bucket shard is
> created and it may increase the number of objects within the bucket stats.
>  If it was broken during resharding, you could check the current bucket id
> from:
>  radosgw-admin metadata get "bucket:BUCKET_NAME".
>
> That would hive an idea which bucket index objects to keep.
>
>  Then you could remove corrupted bucket shards (not the ones with the
> bucket id from the previous command) .dir.corrupted_bucket_index.SHARD_NUM
> objects from bucket.index pool:
>
> rados -p bucket.index .dir.corrupted_bucket_index.SHARD_NUM
>
> Where SHARD_NUM is the shard number you want to delete.
>
>  And then running "radosgw-admin bucket check --fix --bucket=BUCKET_NAME"
>
>  That should have resolved your issue with the number of objects.
>
>  As for slow object deletion. Do you run your metadata pools for rgw on
> nvme drives ? Specifically bucket.index pool. The problem is that you have
> a lot of objects and probably not enough shards. Radosgw retrieves the list
> of objects from bucket.index and if I remember correct it retrieves them as
> ordered list which is very expensive operation. Hence handful of time might
> be spent just on getting the object list.
>
>  We get 1000 objects per second deleted  inside our storage.
>
>
> I would not recommend using "--inconsistent-index" to avoid more
> consitency issues.
>
>
>
>
> Надіслано з пристрою Galaxy
>
>
>  Оригінальне повідомлення 
> Від: mhnx 
> Дата: 08.11.21 13:28 (GMT+02:00)
> Кому: Сергей Процун 
> Копія: "Szabo, Istvan (Agoda)" , Boris Behrens <
> b...@kervyn.de>, Ceph Users 
> Тема: Re: [ceph-users] Re: large bucket index in multisite environement
> (how to deal with large omap objects warning)?
>
> (There should not be any issues using rgw for other buckets while
> re-sharding.)
> If it is then disabling the bucket access will work right? Also sync
> should be disabled.
>
> Yes, after the manual reshard it should clear the leftovers but in my
> situation resharding failed and I got double entries for that bucket.
> I didn't push further, instead I divide the bucket to new buckets and
> reduce object count with a new bucket tree. Copied all of the objects with
> rclone and started bucket remove "radosgw-admin bucket rm --bucket=mybucket
> --bypass-gc --purge-objects --max-concurrent-ios=128" it has been very very
> long time "started at Sep08" and it is still working. There was 250M
> objects in that bucket and after the manual reshard faiI I got 500M object
> count when I check with bucket stats num_objects. Now I have;
> "size_kb": 10648067645,
> "num_objects": 132270190
>
> Remove speed is 50-60 objects in a second. It's not because of the cluster
> speed. Cluster is fine.
> I have space so I let it go. When I see stable object count I will stop
> the remove process and start again with the
> " --inconsistent-index" parameter.
> I wonder is it safe to use the parameter with referenced objects? I want
> to learn how "--inconsistent-index" works and what it does.
>
> Сергей Процун , 5 Kas 2021 Cum, 17:46 tarihinde
> şunu yazdı:
>
>> There should not be any issues using rgw for other buckets while
>> re-sharding.
>>
>> As for doubling number of objects after reshard is an interesting
>> situation. After the manual reshard is done, there might be leftover from
>> the old bucket index. As during reshard new .dir.new_bucket_index objects
>> are created. They contain all data related to the objects which are stored
>> in buckets.data pool. Just wondering if the issue with the doubled number
>> of objects was related to old bucket index. If so its save to delete old
>> bucket index.
>>
>>  In the perfect world, it would be ideal to know the eventoal number of
>> objects inside the bucket and set number of shards to the corresponding
>> setting initially.
>>
>>  In the real world when the client re-purpose the usage of the bucket, we
>> have to deal with reshards.
>>
>> пт, 5 лист. 2021, 14:43 користувач mhnx  пише:
>>
>>> I also use this method and I hate it.
>>>
>>> Stopping all of the RGW clients is never an option! It shouldn't be.
>>> Sharding is hell. I was have 250M objects in a bucket and reshard failed
>>> after 2days and object count doubled somehow! 2 days of downtime is not
>>> an
>>> option.
>>>
>>> I wonder if I stop the write-read on a bucket and while resharding it is
>>> there any problem of using RGW's with all other buckets?
>>>
>>> Nowadays I advise splitting buckets as much as you can! That means
>>> changing
>>> your apps directory tree but this design requires it.
>>> You need to plan object count at least for 5 years and create ones.
>>> Usually I use 101 shards which means 10.100.000 

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-10 Thread Boris Behrens
Hi,
we use enterprise SSDs like SAMSUNG MZ7KM1T9.
The work very well for our block storage. Some NVMe would be a lot nicer
but we have some good experience with them.

One SSD fail takes down 10 OSDs might sound hard, but this would be an
okayish risk. Most of the tunables are defaul in our setup and this looks
like PGs have a failure domain of a host. I restart the systems on a
regular basis for kernel updates.
Also checking disk io with dstat seems to be rather low on the SSDs (below
1k IOPs)
root@s3db18:~# dstat --disk --io  -T  -D sdd
--dsk/sdd-- ---io/sdd-- --epoch---
 read  writ| read  writ|  epoch
 214k 1656k|7.21   126 |1636536603
 144k 1176k|2.00   200 |1636536604
 128k 1400k|2.00   230 |1636536605

Normaly I would now try this configuration:
1 SSD / 10 OSDs - having 150GB of block.db and block.wal, both on the same
partition as someone stated before, and 200GB extra to move all pools
except the .data pool to SSDs.

But thinking about 10 downed OSDs if one SSD fails let's me wonder how to
recover from that.
IIRC the configuration per OSDs is in the LVM tags:
root@s3db18:~# lvs -o lv_tags
  LV Tags

ceph.block_device=...,ceph.db_device=/dev/sdd8,ceph.db_uuid=011275a3-4201-8840-a678-c2e23d38bfd6,...

When the SSD fails, can I just remove the tags and restart the OSD
with ceph-volume
lvm activate --all? And after replacing the failed SSD readd the tags with
the correct IDs? Do I need to do anything else to prepare a block.db
partition?

Cheers
 Boris


Am Di., 9. Nov. 2021 um 22:15 Uhr schrieb prosergey07 :

> Not sure how much it would help the performance with osd's backed with ssd
> db and wal devices. Even if you go this route with one ssd per 10 hdd, you
> might want to set the failure domain per host in crush rules in case ssd is
> out of service.
>
>  But from the practice ssd will not help too much to boost the performance
> especially for sharing it between 10 hdds.
>
>  We use nvme db+wal per osd and separate nvme specifically for metadata
> pools. There will be a lot of I/O on bucket.index pool and rgw pool which
> stores user, bucket metadata. So you might want to put them into separate
> fast storage.
>
>  Also if there will not be too much objects, like huge objects but not
> tens-hundreds million of them then bucket index will have less presure and
> ssd might be okay for metadata pools in that case.
>
>
>
> Надіслано з пристрою Galaxy
>
>
>  Оригінальне повідомлення 
> Від: Boris Behrens 
> Дата: 08.11.21 13:08 (GMT+02:00)
> Кому: ceph-users@ceph.io
> Тема: [ceph-users] Question if WAL/block.db partition will benefit us
>
> Hi,
> we run a larger octopus s3 cluster with only rotating disks.
> 1.3 PiB with 177 OSDs, some with a SSD block.db and some without.
>
> We have a ton of spare 2TB disks and we just wondered if we can bring the
> to good use.
> For every 10 spinning disks we could add one 2TB SSD and we would create
> two partitions per OSD (130GB for block.db and 20GB for block.wal). This
> would leave some empty space on the SSD for waer leveling.
>
> The question now is: would we benefit from this? Most of the data that is
> written to the cluster is very large (50GB and above). This would take a
> lot of work into restructuring the cluster and also two other clusters.
>
> And does it make a different to have only a block.db partition or a
> block.db and a block.wal partition?
>
> Cheers
> Boris
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-10 Thread Peter Lieven
Am 10.11.21 um 09:57 schrieb Manuel Lausch:
> Hi Sage,
>
>
> thank you for your help.
>
>
> My origin issue with slow ops on osd restarts are gone too. Even with default 
> values for paxos_proposal_interval.
>
>
> Its a bit annoying, that I spent many hours to debug this and finally I 
> missed only one step in the upgrade.
>
> Only during the update itself, until require_osd_release is set to the new 
> version, there will be interruptions


However, in Octopus the issue does still exist, right?


Peter



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: steady increasing of osd map epoch since octopus

2021-11-10 Thread Manuel Lausch
We found the reason,

after the upgrade from nautilus we forgot the set ceph osd require-osd-release 
pacific
Now all is fine.

Thanks
Manuel


Von: Manuel Lausch 
Gesendet: Montag, 8. November 2021 14:37
An: Dan van der Ster
Cc: Ceph Users
Betreff: [ceph-users] Re: steady increasing of osd map epoch since octopus

Hi Dan,

thanks for the hint.
The cluster is not doing any changes (rebalance, merging, splitting, or
somethin like this). Only normal client traffic via librados.

In the mon.log I see regularly the following messages, which seems to
corelate to the osd map "changes"

2021-11-08T14:15:58.915+0100 7f8bd32a3700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7f8bd32a3700' had timed out after 0.0s
2021-11-08T14:15:58.953+0100 7f8bd3aa4700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7f8bd3aa4700' had timed out after 0.0s
2021-11-08T14:15:59.201+0100 7f8bd2aa2700  1 
mon.csdeveubs-u02c01mon03@2(peon).osd e1970041 e1970041: 125 total, 125 up, 125 
in
2021-11-08T14:15:59.242+0100 7f8bd4aa6700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7f8bd4aa6700' had timed out after 0.0s
2021-11-08T14:15:59.480+0100 7f8bd2aa2700  1 
mon.csdeveubs-u02c01mon03@2(peon).osd e1970042 e1970042: 125 total, 125 up, 125 
in
2021-11-08T14:15:59.484+0100 7f8bd32a3700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7f8bd32a3700' had timed out after 0.0s
2021-11-08T14:15:59.520+0100 7f8bd42a5700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7f8bd42a5700' had timed out after 0.0s
2021-11-08T14:15:59.757+0100 7f8bd2aa2700  1 
mon.csdeveubs-u02c01mon03@2(peon).osd e1970043 e1970043: 125 total, 125 up, 125 
in
2021-11-08T14:15:59.797+0100 7f8bd3aa4700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7f8bd3aa4700' had timed out after 0.0s
2021-11-08T14:16:00.047+0100 7f8bd2aa2700  1 
mon.csdeveubs-u02c01mon03@2(peon).osd e1970044 e1970044: 125 total, 125 up, 125 
in
2021-11-08T14:16:00.051+0100 7f8bd4aa6700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7f8bd4aa6700' had timed out after 0.0s
2021-11-08T14:16:00.087+0100 7f8bd32a3700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7f8bd32a3700' had timed out after 0.0s
2021-11-08T14:16:00.329+0100 7f8bd2aa2700  1 
mon.csdeveubs-u02c01mon03@2(peon).osd e1970045 e1970045: 125 total, 125 up, 125 
in
2021-11-08T14:16:00.369+0100 7f8bd4aa6700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7f8bd4aa6700' had timed out after 0.0s
2021-11-08T14:16:00.635+0100 7f8bd2aa2700  1 
mon.csdeveubs-u02c01mon03@2(peon).osd e1970046 e1970046: 125 total, 125 up, 125 
in
2021-11-08T14:16:00.640+0100 7f8bd32a3700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7f8bd32a3700' had timed out after 0.0s
2021-11-08T14:16:00.674+0100 7f8bd3aa4700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7f8bd3aa4700' had timed out after 0.0s
2021-11-08T14:16:00.930+0100 7f8bd2aa2700  1 
mon.csdeveubs-u02c01mon03@2(peon).osd e1970047 e1970047: 125 total, 125 up, 125 
in
2021-11-08T14:16:00.968+0100 7f8bd32a3700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7f8bd32a3700' had timed out after 0.0s


timeouts after 0.0 seconds?
In between this timeouts the osdmap epoch is increasing. This happens
in bursts. Between this bursts there is no new map epoch.


Manuel


On Mon, 8 Nov 2021 13:01:06 +0100
Dan van der Ster  wrote:

> Hi,
>
> Okay. Here is another case which was churning the osdmaps:
> https://tracker.ceph.com/issues/51433
> Perhaps similar debugging will show what's creating the maps in your
> case.
>
> Cheers, Dan
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-10 Thread Manuel Lausch
Hi Sage,


thank you for your help.


My origin issue with slow ops on osd restarts are gone too. Even with default 
values for paxos_proposal_interval.


Its a bit annoying, that I spent many hours to debug this and finally I missed 
only one step in the upgrade.

Only during the update itself, until require_osd_release is set to the new 
version, there will be interruptions



Regards
Manuel



Von: Sage Weil 
Gesendet: Dienstag, 9. November 2021 17:29
An: Manuel Lausch
Betreff: Re: [ceph-users] Re: OSD spend too much time on "waiting for readable" 
-> slow ops -> laggy pg -> rgw stop -> worst case osd restart

Yeah, I think that is the problem.  The field that is getting updated by 
prepare_beacon is new in octopus, so if your osdmap still has 
require_osd_release=nautlius then it is trying to set it but then not getting 
encoded (for compatibility).  Doing `ceph osd require_osd_release octopus` 
should resolve this.

On Tue, Nov 9, 2021 at 9:01 AM Sage Weil 
mailto:s...@newdream.net>> wrote:
What version are you running?  I thought it was pacific or octopus but the 
osdmap says "require_osd_release": "nautilus" which implies the upgrade 
procedure wasn't finished?

sage

On Tue, Nov 9, 2021 at 8:08 AM Manuel Lausch 
mailto:manuel.lau...@1und1.de>> wrote:
As far as I see, the maps differ only in the epoch and creation date.
Nothing else. I dumped some maps and uploaded it for you:
1f1e1e5e-1c1c-470b-b691-ed820687bab8

On This cluster I don't create snapshots regularly. Since some weeks,
there are no snapshots present.

please let me know, if you need further information.

Regards
Manuel


On Tue, 9 Nov 2021 07:40:29 -0600
Sage Weil mailto:s...@newdream.net>> wrote:

> Are you sure consecutive maps are identical?  Can you get the latest
> epoch ('ceph osd stat'), and then dump a few consecutive ones?  e.g.
>
> ceph osd dump 1000 -f json-pretty  > 1000
> ceph osd dump 1001 -f json-pretty  > 1001
> ceph osd dump 1002 -f json-pretty  > 1002
> ceph osd dump 1003 -f json-pretty  > 1003
>
> ...and ceph-post-file those?  Based on the logs I think the delta is
> related to snap trimming, but want to confirm.  Thanks!
>
> Thanks!
> sage
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: allocate_bluefs_freespace failed to allocate

2021-11-10 Thread mhnx
Yes. I don't have separate DB/WAL. These SSD's are only using by rgw index.
The command "--command bluefs-bdev-sizes" is not working if the osd up and
working.
I need a new OSD failure to get useful output. I will check when I get one.

I picked an OSD from my test environment to check the command output and
looks like it is almost same with "ceph osd df tree":

ID  CLASS WEIGHT REWEIGHT SIZERAW USE DATA*OMAP*META
AVAIL   %USE  VAR  PGS STATUS TYPE NAME
14   ssd0.87299  1.0 894 GiB  29 GiB 9.8 GiB  *19 GiB *485 MiB 865
GiB  3.25 0.05  87 up osd.14

inferring bluefs devices from bluestore path
1 : device size 0xdf9000 : own 0x[6b4f5c~8f148] = 0x8f148 :
using 0x4cbe4(*19 GiB*) : bluestore has 0xd3f43e(848 GiB) available

In my production env, I have large OMAP's but also AVAIL space is large
enough to fit anything.

SIZE= 894 GiB
RAW USE = 214 GiB
DATA= 95  GiB
OMAP= 118 GiB
META= 839 MiB
AVAIL   = 680 GiB
%USE= 23.92



If you couldn't check the osd log I'm sending that below:

-78> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.write_buffer_size: 67108864
   -77> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.max_write_buffer_number: 32
   -76> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.compression: NoCompression
   -75> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
   Options.bottommost_compression: Disabled
   -74> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.prefix_extractor: nullptr
   -73> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.memtable_insert_with_hint_prefix_extractor: nullptr
   -72> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.num_levels: 7
   -71> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.min_write_buffer_number_to_merge: 2
   -70> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.max_write_buffer_number_to_maintain: 0
   -69> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.bottommost_compression_opts.window_bits: -14
   -68> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
   Options.bottommost_compression_opts.level: 32767
   -67> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.bottommost_compression_opts.strategy: 0
   -66> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.bottommost_compression_opts.max_dict_bytes: 0
   -65> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.bottommost_compression_opts.zstd_max_train_bytes: 0
   -64> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
   Options.bottommost_compression_opts.enabled: false
   -63> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.compression_opts.window_bits: -14
   -62> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
   Options.compression_opts.level: 32767
   -61> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.compression_opts.strategy: 0
   -60> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.compression_opts.max_dict_bytes: 0
   -59> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.compression_opts.zstd_max_train_bytes: 0
   -58> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
   Options.compression_opts.enabled: false
   -57> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.level0_file_num_compaction_trigger: 8
   -56> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.level0_slowdown_writes_trigger: 32
   -55> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.level0_stop_writes_trigger: 64
   -54> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.target_file_size_base: 67108864
   -53> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.target_file_size_multiplier: 1
   -52> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
 Options.max_bytes_for_level_base: 536870912
   -51> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.level_compaction_dynamic_level_bytes: 0
   -50> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.max_bytes_for_level_multiplier: 10.00
   -49> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.max_bytes_for_level_multiplier_addtl[0]: 1
   -48> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.max_bytes_for_level_multiplier_addtl[1]: 1
   -47> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.max_bytes_for_level_multiplier_addtl[2]: 1
   -46> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.max_bytes_for_level_multiplier_addtl[3]: 1
   -45> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.max_bytes_for_level_multiplier_addtl[4]: 1
   -44> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.max_bytes_for_level_multiplier_addtl[5]: 1
   -43> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.max_bytes_for_level_multiplier_addtl[6]: 1
   -42> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
Options.max_sequential_skip_in_iterations: 8
   -41> 2021-11-06 19:01:10.454 7fa799989c40  4 rocksdb:
 Options.max_compaction_bytes: 1677721600
   -40> 2021-11-06