[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-12-13 Thread Boris Behrens
You could try to do this in a screen session for a while.
while true; do radosgw-admin gc process; done

Maybe your normal RGW daemons are too busy for GC processing.
We have this in our config and have started extra RGW instances for GC only:
[global]
...
# disable garbage collector default
rgw_enable_gc_threads = false
[client.gc-host1]
rgw_frontends = "beast endpoint=[::1]:7489"
rgw_enable_gc_threads = true

Am Mi., 14. Dez. 2022 um 01:14 Uhr schrieb Jakub Jaszewski
:
>
> Hi Boris, many thanks for the link!
>
> I see that GC list keep growing on my cluster and there are some very big 
> multipart objects on the GC list, even 138660 parts that I calculate as 
> >500GB in size.
> These objects are visible on the GC list but not on rados-level when calling 
> radosgw-admin --bucket=bucket_name bucket radoslist
> Also I manually called GC process,  radosgw-admin gc process 
> --bucket=bucket_name --debug-rgw=20   which according to logs did the job (no 
> errors raised although objects do not exist in rados?)
> ...
> 2022-12-13T20:21:06.635+0100 7fe0eb771080 20 garbage collection: 
> RGWGC::process iterating over entry tag='2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD', 
> time=2022-12-13T12:35:59.727067+0100, chain.objs.size()=138660
> 2022-12-13T20:21:06.635+0100 7fe0eb771080  5 garbage collection: 
> RGWGC::process removing 
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__multipart_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1
> 2022-12-13T20:21:06.703+0100 7fe0eb771080  5 garbage collection: 
> RGWGC::process removing 
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__shadow_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1_1
> 2022-12-13T20:21:06.859+0100 7fe0eb771080  5 garbage collection: 
> RGWGC::process removing 
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__shadow_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1_2
> ...
> but GC queue did not reduce, objects are still on the GC list.
>
> Do you happen to know how to remove non existent RADOS objects from RGW GC 
> list ?
>
> One more thing i have to check is max_secs=3600 for GC when entering 
> particular index_shard. As you can see in the logs, processing of multiparted 
> objects takes more than 3600 seconds.  I will try to increase 
> rgw_gc_processor_max_time
>
> 2022-12-13T20:20:13.168+0100 7fe0eb771080 20 garbage collection: 
> RGWGC::process entered with GC index_shard=25, max_secs=3600, expired_only=1
> 2022-12-13T20:20:13.168+0100 7fe0eb771080 20 garbage collection: 
> RGWGC::process cls_rgw_gc_list returned with returned:0, entries.size=0, 
> truncated=0, next_marker=''
> 2022-12-13T20:20:13.172+0100 7fe0eb771080 20 garbage collection: 
> RGWGC::process cls_rgw_gc_list returned NO non expired entries, so setting 
> cache entry to TRUE
> 2022-12-13T20:20:27.748+0100 7fe02700  2 
> RGWDataChangesLog::ChangesRenewThread: start
> 2022-12-13T20:20:49.748+0100 7fe02700  2 
> RGWDataChangesLog::ChangesRenewThread: start
> ...
> 2022-12-13T20:21:05.339+0100 7fe0eb771080 20 garbage collection: 
> RGWGC::process cls_rgw_gc_queue_list_entries returned with return value:0, 
> entries.size=100, truncated=1, next_marker='4/20986990'
> 2022-12-13T20:21:06.635+0100 7fe0eb771080 20 garbage collection: 
> RGWGC::process iterating over entry tag='2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD', 
> time=2022-12-13T12:35:59.727067+0100, chain.objs.size()=138660
> 2022-12-13T20:21:06.635+0100 7fe0eb771080  5 garbage collection: 
> RGWGC::process removing 
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__multipart_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1
> 2022-12-13T20:21:06.703+0100 7fe0eb771080  5 garbage collection: 
> RGWGC::process removing 
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__shadow_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1_1
> ...
> 2022-12-13T21:31:23.505+0100 7fe0eb771080  5 garbage collection: 
> RGWGC::process removing 
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__
> shadow_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.4622_29
> ...
> 2022-12-13T21:31:23.565+0100 7fe0eb771080 20 garbage collection: 
> RGWGC::process entered with GC index_shard=26, max_secs=3600, expired_only=1
> ...
>
> Best Regards
> Jakub
>
> On Wed, Dec 7, 2022 at 6:10 PM Boris  wrote:
>>
>> Hi Jakub,
>>
>> the problem is in our case that we hit this bug 
>> (https://tracker.ceph.com/issues/53585) and the GC leads to this problem.
>>
>> We worked around this, by moving the GC to separate disk.
>> It still runs nuts all the time, but at least it does not bring down the 
>> cluster, but now it looks like rados objects go missing.
>>
>> Mit freundlichen Grüßen
>>  - Bor

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-12-14 Thread Jakub Jaszewski
Sure, I tried in screen session before but it did not reduce the queue.

Eventually managed to zero the queue by increasing these params
radosgw-admin gc process --include-all --debug-rgw=20
--rgw-gc-max-concurrent-io=20 --rgw-gc-max-trim-chunk=64
--rgw-gc-processor-max-time=7200

I think it was the matter of the lock on GC shard being released before
RGWGC::process finished removing all objects included in that shard,
however I did notice any errors in the output with debug-rgw=20
https://github.com/ceph/ceph/blob/octopus/src/rgw/rgw_gc.cc#L514
Many thanks
Jakub



On Wed, Dec 14, 2022 at 1:24 AM Boris Behrens  wrote:

> You could try to do this in a screen session for a while.
> while true; do radosgw-admin gc process; done
>
> Maybe your normal RGW daemons are too busy for GC processing.
> We have this in our config and have started extra RGW instances for GC
> only:
> [global]
> ...
> # disable garbage collector default
> rgw_enable_gc_threads = false
> [client.gc-host1]
> rgw_frontends = "beast endpoint=[::1]:7489"
> rgw_enable_gc_threads = true
>
> Am Mi., 14. Dez. 2022 um 01:14 Uhr schrieb Jakub Jaszewski
> :
> >
> > Hi Boris, many thanks for the link!
> >
> > I see that GC list keep growing on my cluster and there are some very
> big multipart objects on the GC list, even 138660 parts that I calculate as
> >500GB in size.
> > These objects are visible on the GC list but not on rados-level when
> calling radosgw-admin --bucket=bucket_name bucket radoslist
> > Also I manually called GC process,  radosgw-admin gc process
> --bucket=bucket_name --debug-rgw=20   which according to logs did the job
> (no errors raised although objects do not exist in rados?)
> > ...
> > 2022-12-13T20:21:06.635+0100 7fe0eb771080 20 garbage collection:
> RGWGC::process iterating over entry
> tag='2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD',
> time=2022-12-13T12:35:59.727067+0100, chain.objs.size()=138660
> > 2022-12-13T20:21:06.635+0100 7fe0eb771080  5 garbage collection:
> RGWGC::process removing
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__multipart_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1
> > 2022-12-13T20:21:06.703+0100 7fe0eb771080  5 garbage collection:
> RGWGC::process removing
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__shadow_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1_1
> > 2022-12-13T20:21:06.859+0100 7fe0eb771080  5 garbage collection:
> RGWGC::process removing
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__shadow_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1_2
> > ...
> > but GC queue did not reduce, objects are still on the GC list.
> >
> > Do you happen to know how to remove non existent RADOS objects from RGW
> GC list ?
> >
> > One more thing i have to check is max_secs=3600 for GC when entering
> particular index_shard. As you can see in the logs, processing of
> multiparted objects takes more than 3600 seconds.  I will try to increase
> rgw_gc_processor_max_time
> >
> > 2022-12-13T20:20:13.168+0100 7fe0eb771080 20 garbage collection:
> RGWGC::process entered with GC index_shard=25, max_secs=3600, expired_only=1
> > 2022-12-13T20:20:13.168+0100 7fe0eb771080 20 garbage collection:
> RGWGC::process cls_rgw_gc_list returned with returned:0, entries.size=0,
> truncated=0, next_marker=''
> > 2022-12-13T20:20:13.172+0100 7fe0eb771080 20 garbage collection:
> RGWGC::process cls_rgw_gc_list returned NO non expired entries, so setting
> cache entry to TRUE
> > 2022-12-13T20:20:27.748+0100 7fe02700  2
> RGWDataChangesLog::ChangesRenewThread: start
> > 2022-12-13T20:20:49.748+0100 7fe02700  2
> RGWDataChangesLog::ChangesRenewThread: start
> > ...
> > 2022-12-13T20:21:05.339+0100 7fe0eb771080 20 garbage collection:
> RGWGC::process cls_rgw_gc_queue_list_entries returned with return value:0,
> entries.size=100, truncated=1, next_marker='4/20986990'
> > 2022-12-13T20:21:06.635+0100 7fe0eb771080 20 garbage collection:
> RGWGC::process iterating over entry
> tag='2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD',
> time=2022-12-13T12:35:59.727067+0100, chain.objs.size()=138660
> > 2022-12-13T20:21:06.635+0100 7fe0eb771080  5 garbage collection:
> RGWGC::process removing
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__multipart_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1
> > 2022-12-13T20:21:06.703+0100 7fe0eb771080  5 garbage collection:
> RGWGC::process removing
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__shadow_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1_1
> > ...
> > 2022-12-13T21:31:23.505+0100 7fe0eb771080  5 garbage collection:
> RGWGC::process removing
> default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__
> >
> shadow_2ib3aonh7thn59a394l06un5

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-08 Thread Dan van der Ster
Here's the reason they exit:

7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
osd_max_markdown_count 5 in last 600.00 seconds, shutting down

If an osd flaps (marked down, then up) 6 times in 10 minutes, it
exits. (This is a safety measure).

It's normally caused by a network issue -- other OSDs are telling the
mon that he is down, but then the OSD himself tells the mon that he's
up!

Cheers, Dan

On Mon, Mar 7, 2022 at 10:36 PM Boris Behrens  wrote:
>
> Hi,
>
> we've had the problem with OSDs marked as offline since we updated to
> octopus and hope the problem would be fixed with the latest patch. We have
> this kind of problem only with octopus and there only with the big s3
> cluster.
> * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
> * Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
> * We only use the frontend network.
> * All disks are spinning, some have block.db devices.
> * All disks are bluestore
> * configs are mostly defaults
> * we've set the OSDs to restart=always without a limit, because we had the
> problem with unavailable PGs when two OSDs are marked as offline and the
> share PGs.
>
> But since we installed the latest patch we are experiencing more OSD downs
> and even crashes.
> I tried to remove as much duplicated lines as possible.
>
> Is the numa error a problem?
> Why do OSD daemons not respond to hearthbeats? I mean even when the disk is
> totally loaded with IO, the system itself should answer heathbeats, or am I
> missing something?
>
> I really hope some of you could send me on the correct way to solve this
> nasty problem.
>
> This is how the latest crash looks like
> Mar 07 17:44:15 s3db18 ceph-osd[4530]: 2022-03-07T17:44:15.099+
> 7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to identify public
> interface '' numa node: (2) No such file or directory
> ...
> Mar 07 17:49:07 s3db18 ceph-osd[4530]: 2022-03-07T17:49:07.678+
> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify public
> interface '' numa node: (2) No such file or directory
> Mar 07 17:53:07 s3db18 ceph-osd[4530]: *** Caught signal (Aborted) **
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
> thread_name:tp_osd_tp
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
> [0x7f5f0d45ef08]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
> unsigned long)+0x471) [0x55a699a01201]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
> long, unsigned long)+0x8e) [0x55a699a0199e]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
> [0x55a699a224b0]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43) [0x7f5f0cfc0163]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2022-03-07T17:53:07.387+
> 7f5ef1501700 -1 *** Caught signal (Aborted) **
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
> thread_name:tp_osd_tp
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
> [0x7f5f0d45ef08]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
> unsigned long)+0x471) [0x55a699a01201]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
> long, unsigned long)+0x8e) [0x55a699a0199e]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
> [0x55a699a224b0]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43) [0x7f5f0cfc0163]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  NOTE: a copy of the executable, or
> `objdump -rdS ` is needed to interpret this.
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  -5246> 2022-03-07T17:49:07.678+
> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify public
> interface '' numa node: (2) No such file or directory
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  0> 2022-03-07T17:53:07.387+
> 7f5ef1501700 -1 *** Caught signal (Aborted) **
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-08 Thread Boris Behrens
Yes, this is something we know and we disabled it, because we ran into the
problem that PGs went unavailable when two or more OSDs went offline.

I am searching for the reason WHY this happens.
Currently we have set the service file to restart=always and removed the
StartLimitBurst from the service file.

We just don't understand why the OSDs don't answer the heathbeat. The OSDs
that are flapping are random in terms of Host, Disksize, having SSD
block.db or not.
Network connectivity issues is something that I would rule out, because the
cluster went from "nothing ever happens except IOPS" to "random OSDs are
marked DOWN until they kill themself" with the update from nautilus to
octopus.

I am out of ideas and hoped this was a bug in 15.2.15, but after the update
things got worse (happen more often).
We tried to:
* disable swap
* more swap
* disable bluefs_buffered_io
* disable write cache for all disks
* disable scrubbing
* reinstall with new OS (from centos7 to ubuntu 20.04)
* disable cluster_network (so there is only one way to communicate)
* increase txqueuelen on the network interfaces
* everything together


What we try next: add more SATA controllers, so there are not 24 disks
attached to a single controller, but I doubt this will help.

Cheers
 Boris



Am Di., 8. März 2022 um 09:10 Uhr schrieb Dan van der Ster <
dvand...@gmail.com>:

> Here's the reason they exit:
>
> 7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
> osd_max_markdown_count 5 in last 600.00 seconds, shutting down
>
> If an osd flaps (marked down, then up) 6 times in 10 minutes, it
> exits. (This is a safety measure).
>
> It's normally caused by a network issue -- other OSDs are telling the
> mon that he is down, but then the OSD himself tells the mon that he's
> up!
>
> Cheers, Dan
>
> On Mon, Mar 7, 2022 at 10:36 PM Boris Behrens  wrote:
> >
> > Hi,
> >
> > we've had the problem with OSDs marked as offline since we updated to
> > octopus and hope the problem would be fixed with the latest patch. We
> have
> > this kind of problem only with octopus and there only with the big s3
> > cluster.
> > * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
> > * Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
> > * We only use the frontend network.
> > * All disks are spinning, some have block.db devices.
> > * All disks are bluestore
> > * configs are mostly defaults
> > * we've set the OSDs to restart=always without a limit, because we had
> the
> > problem with unavailable PGs when two OSDs are marked as offline and the
> > share PGs.
> >
> > But since we installed the latest patch we are experiencing more OSD
> downs
> > and even crashes.
> > I tried to remove as much duplicated lines as possible.
> >
> > Is the numa error a problem?
> > Why do OSD daemons not respond to hearthbeats? I mean even when the disk
> is
> > totally loaded with IO, the system itself should answer heathbeats, or
> am I
> > missing something?
> >
> > I really hope some of you could send me on the correct way to solve this
> > nasty problem.
> >
> > This is how the latest crash looks like
> > Mar 07 17:44:15 s3db18 ceph-osd[4530]: 2022-03-07T17:44:15.099+
> > 7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to identify
> public
> > interface '' numa node: (2) No such file or directory
> > ...
> > Mar 07 17:49:07 s3db18 ceph-osd[4530]: 2022-03-07T17:49:07.678+
> > 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify
> public
> > interface '' numa node: (2) No such file or directory
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]: *** Caught signal (Aborted) **
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
> > thread_name:tp_osd_tp
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
> > (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
> > [0x7f5f0d45ef08]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
> > (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
> > unsigned long)+0x471) [0x55a699a01201]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
> > (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
> > long, unsigned long)+0x8e) [0x55a699a0199e]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
> > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
> > [0x55a699a224b0]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
> > (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
> [0x7f5f0cfc0163]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2022-03-07T17:53:07.387+
> > 7f5ef1501700 -1 *** Caught signal (Aborted) **
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
> > thread_name:tp_osd_tp

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-08 Thread Francois Legrand

Hi,
We also had this kind of problems after upgrading to octopus. Maybe you 
can play with the hearthbeat grace time ( 
https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/ 
) to tell osds to wait a little more before declaring another osd down !
We also try to fix the problem by manually compact the down osd 
(something like : systemctl stop ceph-osd@74; sleep 10; 
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-74 compact; 
systemctl start ceph-osd@74).
This worked a few times, but some osd went down again, thus we simply 
wait for the datas to be reconstructed elswhere and then reinstall the 
dead osd :

ceph osd destroy 74 --yes-i-really-mean-it
ceph-volume lvm zap /dev/sde --destroy
ceph-volume lvm create --osd-id 74 --data /dev/sde

This seems to fix the issue for us (up to now).

F.

Le 08/03/2022 à 09:35, Boris Behrens a écrit :

Yes, this is something we know and we disabled it, because we ran into the
problem that PGs went unavailable when two or more OSDs went offline.

I am searching for the reason WHY this happens.
Currently we have set the service file to restart=always and removed the
StartLimitBurst from the service file.

We just don't understand why the OSDs don't answer the heathbeat. The OSDs
that are flapping are random in terms of Host, Disksize, having SSD
block.db or not.
Network connectivity issues is something that I would rule out, because the
cluster went from "nothing ever happens except IOPS" to "random OSDs are
marked DOWN until they kill themself" with the update from nautilus to
octopus.

I am out of ideas and hoped this was a bug in 15.2.15, but after the update
things got worse (happen more often).
We tried to:
* disable swap
* more swap
* disable bluefs_buffered_io
* disable write cache for all disks
* disable scrubbing
* reinstall with new OS (from centos7 to ubuntu 20.04)
* disable cluster_network (so there is only one way to communicate)
* increase txqueuelen on the network interfaces
* everything together


What we try next: add more SATA controllers, so there are not 24 disks
attached to a single controller, but I doubt this will help.

Cheers
  Boris



Am Di., 8. März 2022 um 09:10 Uhr schrieb Dan van der Ster <
dvand...@gmail.com>:


Here's the reason they exit:

7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
osd_max_markdown_count 5 in last 600.00 seconds, shutting down

If an osd flaps (marked down, then up) 6 times in 10 minutes, it
exits. (This is a safety measure).

It's normally caused by a network issue -- other OSDs are telling the
mon that he is down, but then the OSD himself tells the mon that he's
up!

Cheers, Dan

On Mon, Mar 7, 2022 at 10:36 PM Boris Behrens  wrote:

Hi,

we've had the problem with OSDs marked as offline since we updated to
octopus and hope the problem would be fixed with the latest patch. We

have

this kind of problem only with octopus and there only with the big s3
cluster.
* Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
* Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
* We only use the frontend network.
* All disks are spinning, some have block.db devices.
* All disks are bluestore
* configs are mostly defaults
* we've set the OSDs to restart=always without a limit, because we had

the

problem with unavailable PGs when two OSDs are marked as offline and the
share PGs.

But since we installed the latest patch we are experiencing more OSD

downs

and even crashes.
I tried to remove as much duplicated lines as possible.

Is the numa error a problem?
Why do OSD daemons not respond to hearthbeats? I mean even when the disk

is

totally loaded with IO, the system itself should answer heathbeats, or

am I

missing something?

I really hope some of you could send me on the correct way to solve this
nasty problem.

This is how the latest crash looks like
Mar 07 17:44:15 s3db18 ceph-osd[4530]: 2022-03-07T17:44:15.099+
7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to identify

public

interface '' numa node: (2) No such file or directory
...
Mar 07 17:49:07 s3db18 ceph-osd[4530]: 2022-03-07T17:49:07.678+
7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify

public

interface '' numa node: (2) No such file or directory
Mar 07 17:53:07 s3db18 ceph-osd[4530]: *** Caught signal (Aborted) **
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
thread_name:tp_osd_tp
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
(d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
[0x7f5f0d45ef08]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
unsigned long)+0x471) [0x55a699a01201]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
(ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
long, unsigned long

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-08 Thread Boris Behrens
Hi Francois,

thanks for the reminder. We offline compacted all of the OSDs when we
reinstalled the hosts with the new OS.
But actually reinstalling them was never on my list.

I could try that and in the same go I can remove all the cache SSDs (when
one SSD share the cache for 10 OSDs this is a horrible SPOF) and reuse the
SSDs as OSDs for the smaller pools in a RGW (like log and meta).

How long ago did you recreate the earliest OSD?

Cheers
 Boris

Am Di., 8. März 2022 um 10:03 Uhr schrieb Francois Legrand <
f...@lpnhe.in2p3.fr>:

> Hi,
> We also had this kind of problems after upgrading to octopus. Maybe you
> can play with the hearthbeat grace time (
> https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/
> ) to tell osds to wait a little more before declaring another osd down !
> We also try to fix the problem by manually compact the down osd
> (something like : systemctl stop ceph-osd@74; sleep 10;
> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-74 compact;
> systemctl start ceph-osd@74).
> This worked a few times, but some osd went down again, thus we simply
> wait for the datas to be reconstructed elswhere and then reinstall the
> dead osd :
> ceph osd destroy 74 --yes-i-really-mean-it
> ceph-volume lvm zap /dev/sde --destroy
> ceph-volume lvm create --osd-id 74 --data /dev/sde
>
> This seems to fix the issue for us (up to now).
>
> F.
>
> Le 08/03/2022 à 09:35, Boris Behrens a écrit :
> > Yes, this is something we know and we disabled it, because we ran into
> the
> > problem that PGs went unavailable when two or more OSDs went offline.
> >
> > I am searching for the reason WHY this happens.
> > Currently we have set the service file to restart=always and removed the
> > StartLimitBurst from the service file.
> >
> > We just don't understand why the OSDs don't answer the heathbeat. The
> OSDs
> > that are flapping are random in terms of Host, Disksize, having SSD
> > block.db or not.
> > Network connectivity issues is something that I would rule out, because
> the
> > cluster went from "nothing ever happens except IOPS" to "random OSDs are
> > marked DOWN until they kill themself" with the update from nautilus to
> > octopus.
> >
> > I am out of ideas and hoped this was a bug in 15.2.15, but after the
> update
> > things got worse (happen more often).
> > We tried to:
> > * disable swap
> > * more swap
> > * disable bluefs_buffered_io
> > * disable write cache for all disks
> > * disable scrubbing
> > * reinstall with new OS (from centos7 to ubuntu 20.04)
> > * disable cluster_network (so there is only one way to communicate)
> > * increase txqueuelen on the network interfaces
> > * everything together
> >
> >
> > What we try next: add more SATA controllers, so there are not 24 disks
> > attached to a single controller, but I doubt this will help.
> >
> > Cheers
> >   Boris
> >
> >
> >
> > Am Di., 8. März 2022 um 09:10 Uhr schrieb Dan van der Ster <
> > dvand...@gmail.com>:
> >
> >> Here's the reason they exit:
> >>
> >> 7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
> >> osd_max_markdown_count 5 in last 600.00 seconds, shutting down
> >>
> >> If an osd flaps (marked down, then up) 6 times in 10 minutes, it
> >> exits. (This is a safety measure).
> >>
> >> It's normally caused by a network issue -- other OSDs are telling the
> >> mon that he is down, but then the OSD himself tells the mon that he's
> >> up!
> >>
> >> Cheers, Dan
> >>
> >> On Mon, Mar 7, 2022 at 10:36 PM Boris Behrens  wrote:
> >>> Hi,
> >>>
> >>> we've had the problem with OSDs marked as offline since we updated to
> >>> octopus and hope the problem would be fixed with the latest patch. We
> >> have
> >>> this kind of problem only with octopus and there only with the big s3
> >>> cluster.
> >>> * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
> >>> * Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
> >>> * We only use the frontend network.
> >>> * All disks are spinning, some have block.db devices.
> >>> * All disks are bluestore
> >>> * configs are mostly defaults
> >>> * we've set the OSDs to restart=always without a limit, because we had
> >> the
> >>> problem with unavailable PGs when two OSDs are marked as offline and
> the
> >>> share PGs.
> >>>
> >>> But since we installed the latest patch we are experiencing more OSD
> >> downs
> >>> and even crashes.
> >>> I tried to remove as much duplicated lines as possible.
> >>>
> >>> Is the numa error a problem?
> >>> Why do OSD daemons not respond to hearthbeats? I mean even when the
> disk
> >> is
> >>> totally loaded with IO, the system itself should answer heathbeats, or
> >> am I
> >>> missing something?
> >>>
> >>> I really hope some of you could send me on the correct way to solve
> this
> >>> nasty problem.
> >>>
> >>> This is how the latest crash looks like
> >>> Mar 07 17:44:15 s3db18 ceph-osd[4530]: 2022-03-07T17:44:15.099+
> >>> 7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-08 Thread Francois Legrand

Hi,

The last 2 osd I recreated were on december 30 and february 8.

I totally agree that ssd cache are a terrible spof. I think that's an 
option if you use 1 ssd/nvme for 1 or 2 osd, but the cost is then very 
high. Using 1 ssd for 10 osd increase the risk for almost no gain 
because the ssd is 10 times faster but has 10 times more access !
Indeed, we did some benches with nvme for the wal db (1 nvme for ~10 
osds), and the gain was not tremendous, so we decided not use them !

F.


Le 08/03/2022 à 11:57, Boris Behrens a écrit :

Hi Francois,

thanks for the reminder. We offline compacted all of the OSDs when we 
reinstalled the hosts with the new OS.

But actually reinstalling them was never on my list.

I could try that and in the same go I can remove all the cache SSDs 
(when one SSD share the cache for 10 OSDs this is a horrible SPOF) and 
reuse the SSDs as OSDs for the smaller pools in a RGW (like log and meta).


How long ago did you recreate the earliest OSD?

Cheers
 Boris

Am Di., 8. März 2022 um 10:03 Uhr schrieb Francois Legrand 
:


Hi,
We also had this kind of problems after upgrading to octopus.
Maybe you
can play with the hearthbeat grace time (
https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/

) to tell osds to wait a little more before declaring another osd
down !
We also try to fix the problem by manually compact the down osd
(something like : systemctl stop ceph-osd@74; sleep 10;
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-74 compact;
systemctl start ceph-osd@74).
This worked a few times, but some osd went down again, thus we simply
wait for the datas to be reconstructed elswhere and then reinstall
the
dead osd :
ceph osd destroy 74 --yes-i-really-mean-it
ceph-volume lvm zap /dev/sde --destroy
ceph-volume lvm create --osd-id 74 --data /dev/sde

This seems to fix the issue for us (up to now).

F.

Le 08/03/2022 à 09:35, Boris Behrens a écrit :
> Yes, this is something we know and we disabled it, because we
ran into the
> problem that PGs went unavailable when two or more OSDs went
offline.
>
> I am searching for the reason WHY this happens.
> Currently we have set the service file to restart=always and
removed the
> StartLimitBurst from the service file.
>
> We just don't understand why the OSDs don't answer the
heathbeat. The OSDs
> that are flapping are random in terms of Host, Disksize, having SSD
> block.db or not.
> Network connectivity issues is something that I would rule out,
because the
> cluster went from "nothing ever happens except IOPS" to "random
OSDs are
> marked DOWN until they kill themself" with the update from
nautilus to
> octopus.
>
> I am out of ideas and hoped this was a bug in 15.2.15, but after
the update
> things got worse (happen more often).
> We tried to:
> * disable swap
> * more swap
> * disable bluefs_buffered_io
> * disable write cache for all disks
> * disable scrubbing
> * reinstall with new OS (from centos7 to ubuntu 20.04)
> * disable cluster_network (so there is only one way to communicate)
> * increase txqueuelen on the network interfaces
> * everything together
>
>
> What we try next: add more SATA controllers, so there are not 24
disks
> attached to a single controller, but I doubt this will help.
>
> Cheers
>   Boris
>
>
>
> Am Di., 8. März 2022 um 09:10 Uhr schrieb Dan van der Ster <
> dvand...@gmail.com>:
>
>> Here's the reason they exit:
>>
>> 7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
>> osd_max_markdown_count 5 in last 600.00 seconds, shutting down
>>
>> If an osd flaps (marked down, then up) 6 times in 10 minutes, it
>> exits. (This is a safety measure).
>>
>> It's normally caused by a network issue -- other OSDs are
telling the
>> mon that he is down, but then the OSD himself tells the mon
that he's
>> up!
>>
>> Cheers, Dan
>>
>> On Mon, Mar 7, 2022 at 10:36 PM Boris Behrens  wrote:
>>> Hi,
>>>
>>> we've had the problem with OSDs marked as offline since we
updated to
>>> octopus and hope the problem would be fixed with the latest
patch. We
>> have
>>> this kind of problem only with octopus and there only with the
big s3
>>> cluster.
>>> * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
>>> * Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
>>> * We only use the frontend network.
>>> * All disks are spinning, some have block.db devices.
>>> * All disks are bluestore
>>> * configs are mostly defaults
>>> * we've set the OSDs to restart=always without a limit,
because we had
>> the
>>> problem with unavailable PGs when two OSDs are

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-20 Thread Boris Behrens
So,
I have tried to remove the OSDs, wipe the disks and sync them back in
without block.db SSD. (Still in progress, 212 spinning disks take time to
out and in again)
And I just experienced them same behavior on one OSD on a host where all
disks got synced in new. This disk was marked as in yesterday and is still
backfilling.

Right before the OSD get marked as down by other OSDs I observe a ton of
these log entries:
2022-03-20T11:54:40.759+ 7ff9eef5d700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7ff9d3c8a700' had timed out after 15
...
2022-03-20T11:55:02.370+ 7ff9ee75c700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7ff9d3c8a700' had timed out after 15
2022-03-20T11:55:03.290+ 7ff9d3c8a700  1 heartbeat_map reset_timeout
'OSD::osd_op_tp thread 0x7ff9d3c8a700' had timed out after 15
..
2022-03-20T11:55:03.390+ 7ff9df4a1700  0 log_channel(cluster) log [WRN]
: Monitor daemon marked osd.48 down, but it is still running
2022-03-20T11:55:03.390+ 7ff9df4a1700  0 log_channel(cluster) log [DBG]
: map e514383 wrongly marked me down at e514383
2022-03-20T11:55:03.390+ 7ff9df4a1700 -1 osd.48 514383
_committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last
600.00 seconds, shutting down

There are 21 OSDs without cache SSD in the host.
All disks are attached to a single Broadcom / LSI SAS3008 SAS controller.
256GB ECC-RAM / 40 CPU cores.

What else can I do to find the problem?

Am Di., 8. März 2022 um 12:25 Uhr schrieb Francois Legrand <
f...@lpnhe.in2p3.fr>:

> Hi,
>
> The last 2 osd I recreated were on december 30 and february 8.
>
> I totally agree that ssd cache are a terrible spof. I think that's an
> option if you use 1 ssd/nvme for 1 or 2 osd, but the cost is then very
> high. Using 1 ssd for 10 osd increase the risk for almost no gain because
> the ssd is 10 times faster but has 10 times more access !
> Indeed, we did some benches with nvme for the wal db (1 nvme for ~10
> osds), and the gain was not tremendous, so we decided not use them !
> F.
>
>
> Le 08/03/2022 à 11:57, Boris Behrens a écrit :
>
> Hi Francois,
>
> thanks for the reminder. We offline compacted all of the OSDs when we
> reinstalled the hosts with the new OS.
> But actually reinstalling them was never on my list.
>
> I could try that and in the same go I can remove all the cache SSDs (when
> one SSD share the cache for 10 OSDs this is a horrible SPOF) and reuse the
> SSDs as OSDs for the smaller pools in a RGW (like log and meta).
>
> How long ago did you recreate the earliest OSD?
>
> Cheers
>  Boris
>
> Am Di., 8. März 2022 um 10:03 Uhr schrieb Francois Legrand <
> f...@lpnhe.in2p3.fr>:
>
>> Hi,
>> We also had this kind of problems after upgrading to octopus. Maybe you
>> can play with the hearthbeat grace time (
>> https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/
>> ) to tell osds to wait a little more before declaring another osd down !
>> We also try to fix the problem by manually compact the down osd
>> (something like : systemctl stop ceph-osd@74; sleep 10;
>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-74 compact;
>> systemctl start ceph-osd@74).
>> This worked a few times, but some osd went down again, thus we simply
>> wait for the datas to be reconstructed elswhere and then reinstall the
>> dead osd :
>> ceph osd destroy 74 --yes-i-really-mean-it
>> ceph-volume lvm zap /dev/sde --destroy
>> ceph-volume lvm create --osd-id 74 --data /dev/sde
>>
>> This seems to fix the issue for us (up to now).
>>
>> F.
>>
>> Le 08/03/2022 à 09:35, Boris Behrens a écrit :
>> > Yes, this is something we know and we disabled it, because we ran into
>> the
>> > problem that PGs went unavailable when two or more OSDs went offline.
>> >
>> > I am searching for the reason WHY this happens.
>> > Currently we have set the service file to restart=always and removed the
>> > StartLimitBurst from the service file.
>> >
>> > We just don't understand why the OSDs don't answer the heathbeat. The
>> OSDs
>> > that are flapping are random in terms of Host, Disksize, having SSD
>> > block.db or not.
>> > Network connectivity issues is something that I would rule out, because
>> the
>> > cluster went from "nothing ever happens except IOPS" to "random OSDs are
>> > marked DOWN until they kill themself" with the update from nautilus to
>> > octopus.
>> >
>> > I am out of ideas and hoped this was a bug in 15.2.15, but after the
>> update
>> > things got worse (happen more often).
>> > We tried to:
>> > * disable swap
>> > * more swap
>> > * disable bluefs_buffered_io
>> > * disable write cache for all disks
>> > * disable scrubbing
>> > * reinstall with new OS (from centos7 to ubuntu 20.04)
>> > * disable cluster_network (so there is only one way to communicate)
>> > * increase txqueuelen on the network interfaces
>> > * everything together
>> >
>> >
>> > What we try next: add more SATA controllers, so there are not 24 disks
>> > attached to a single controller, but I doubt this wi

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-21 Thread Konstantin Shalygin
Hi,

What is actual hardware (CPU, spinners, NVMe, network)?
This is HDD with block.db on NVMe?
How much PG per osd?
How much obj per PG?


k
Sent from my iPhone

> On 20 Mar 2022, at 19:59, Boris Behrens  wrote:
> 
> So,
> I have tried to remove the OSDs, wipe the disks and sync them back in
> without block.db SSD. (Still in progress, 212 spinning disks take time to
> out and in again)
> And I just experienced them same behavior on one OSD on a host where all
> disks got synced in new. This disk was marked as in yesterday and is still
> backfilling.
> 
> Right before the OSD get marked as down by other OSDs I observe a ton of
> these log entries:
> 2022-03-20T11:54:40.759+ 7ff9eef5d700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7ff9d3c8a700' had timed out after 15
> ...
> 2022-03-20T11:55:02.370+ 7ff9ee75c700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7ff9d3c8a700' had timed out after 15
> 2022-03-20T11:55:03.290+ 7ff9d3c8a700  1 heartbeat_map reset_timeout
> 'OSD::osd_op_tp thread 0x7ff9d3c8a700' had timed out after 15
> ..
> 2022-03-20T11:55:03.390+ 7ff9df4a1700  0 log_channel(cluster) log [WRN]
> : Monitor daemon marked osd.48 down, but it is still running
> 2022-03-20T11:55:03.390+ 7ff9df4a1700  0 log_channel(cluster) log [DBG]
> : map e514383 wrongly marked me down at e514383
> 2022-03-20T11:55:03.390+ 7ff9df4a1700 -1 osd.48 514383
> _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last
> 600.00 seconds, shutting down
> 
> There are 21 OSDs without cache SSD in the host.
> All disks are attached to a single Broadcom / LSI SAS3008 SAS controller.
> 256GB ECC-RAM / 40 CPU cores.
> 
> What else can I do to find the problem?
> 
>> Am Di., 8. März 2022 um 12:25 Uhr schrieb Francois Legrand <
>> f...@lpnhe.in2p3.fr>:
>> 
>> Hi,
>> 
>> The last 2 osd I recreated were on december 30 and february 8.
>> 
>> I totally agree that ssd cache are a terrible spof. I think that's an
>> option if you use 1 ssd/nvme for 1 or 2 osd, but the cost is then very
>> high. Using 1 ssd for 10 osd increase the risk for almost no gain because
>> the ssd is 10 times faster but has 10 times more access !
>> Indeed, we did some benches with nvme for the wal db (1 nvme for ~10
>> osds), and the gain was not tremendous, so we decided not use them !
>> F.
>> 
>> 
>> Le 08/03/2022 à 11:57, Boris Behrens a écrit :
>> 
>> Hi Francois,
>> 
>> thanks for the reminder. We offline compacted all of the OSDs when we
>> reinstalled the hosts with the new OS.
>> But actually reinstalling them was never on my list.
>> 
>> I could try that and in the same go I can remove all the cache SSDs (when
>> one SSD share the cache for 10 OSDs this is a horrible SPOF) and reuse the
>> SSDs as OSDs for the smaller pools in a RGW (like log and meta).
>> 
>> How long ago did you recreate the earliest OSD?
>> 
>> Cheers
>> Boris
>> 
>> Am Di., 8. März 2022 um 10:03 Uhr schrieb Francois Legrand <
>> f...@lpnhe.in2p3.fr>:
>> 
>>> Hi,
>>> We also had this kind of problems after upgrading to octopus. Maybe you
>>> can play with the hearthbeat grace time (
>>> https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/
>>> ) to tell osds to wait a little more before declaring another osd down !
>>> We also try to fix the problem by manually compact the down osd
>>> (something like : systemctl stop ceph-osd@74; sleep 10;
>>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-74 compact;
>>> systemctl start ceph-osd@74).
>>> This worked a few times, but some osd went down again, thus we simply
>>> wait for the datas to be reconstructed elswhere and then reinstall the
>>> dead osd :
>>> ceph osd destroy 74 --yes-i-really-mean-it
>>> ceph-volume lvm zap /dev/sde --destroy
>>> ceph-volume lvm create --osd-id 74 --data /dev/sde
>>> 
>>> This seems to fix the issue for us (up to now).
>>> 
>>> F.
>>> 
>>> Le 08/03/2022 à 09:35, Boris Behrens a écrit :
 Yes, this is something we know and we disabled it, because we ran into
>>> the
 problem that PGs went unavailable when two or more OSDs went offline.
 
 I am searching for the reason WHY this happens.
 Currently we have set the service file to restart=always and removed the
 StartLimitBurst from the service file.
 
 We just don't understand why the OSDs don't answer the heathbeat. The
>>> OSDs
 that are flapping are random in terms of Host, Disksize, having SSD
 block.db or not.
 Network connectivity issues is something that I would rule out, because
>>> the
 cluster went from "nothing ever happens except IOPS" to "random OSDs are
 marked DOWN until they kill themself" with the update from nautilus to
 octopus.
 
 I am out of ideas and hoped this was a bug in 15.2.15, but after the
>>> update
 things got worse (happen more often).
 We tried to:
 * disable swap
 * more swap
 * disable bluefs_buffered_io
 * disable write cache for all disks
 * disable sc

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-22 Thread Boris Behrens
Good morning K,

the "freshly done" host, where it happened last got:
* 21x 8TB TOSHIBA MG06ACA800E (Spinning)
* No block.db devices (just removed the 2 cache SSDs by syncing the disks
out, wiping them and adding them back without block.db)
* 1x Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz
* 256GB ECC RAM
* 2x 10GBit Network (802.3ad encap3+4 lacp-fast bonding)

# free -g
  totalusedfree  shared  buff/cache
available
Mem:251  87   2   0 161
162
Swap:15   0  15

We had this problem with one of the 21 OSDs, but I expect it to happen
random some time in the future. Cluster got 212 OSDs and 2-3 of them get at
least marked down once per day. Sometime they get marked down >3 times, so
systemd hast to restart the OSD process.

Cheers
 Boris


Am Di., 22. März 2022 um 07:48 Uhr schrieb Konstantin Shalygin <
k0...@k0ste.ru>:

> Hi,
>
> What is actual hardware (CPU, spinners, NVMe, network)?
> This is HDD with block.db on NVMe?
> How much PG per osd?
> How much obj per PG?
>
>
> k
> Sent from my iPhone
>
> > On 20 Mar 2022, at 19:59, Boris Behrens  wrote:
> >
> > So,
> > I have tried to remove the OSDs, wipe the disks and sync them back in
> > without block.db SSD. (Still in progress, 212 spinning disks take time to
> > out and in again)
> > And I just experienced them same behavior on one OSD on a host where all
> > disks got synced in new. This disk was marked as in yesterday and is
> still
> > backfilling.
> >
> > Right before the OSD get marked as down by other OSDs I observe a ton of
> > these log entries:
> > 2022-03-20T11:54:40.759+ 7ff9eef5d700  1 heartbeat_map is_healthy
> > 'OSD::osd_op_tp thread 0x7ff9d3c8a700' had timed out after 15
> > ...
> > 2022-03-20T11:55:02.370+ 7ff9ee75c700  1 heartbeat_map is_healthy
> > 'OSD::osd_op_tp thread 0x7ff9d3c8a700' had timed out after 15
> > 2022-03-20T11:55:03.290+ 7ff9d3c8a700  1 heartbeat_map reset_timeout
> > 'OSD::osd_op_tp thread 0x7ff9d3c8a700' had timed out after 15
> > ..
> > 2022-03-20T11:55:03.390+ 7ff9df4a1700  0 log_channel(cluster) log
> [WRN]
> > : Monitor daemon marked osd.48 down, but it is still running
> > 2022-03-20T11:55:03.390+ 7ff9df4a1700  0 log_channel(cluster) log
> [DBG]
> > : map e514383 wrongly marked me down at e514383
> > 2022-03-20T11:55:03.390+ 7ff9df4a1700 -1 osd.48 514383
> > _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last
> > 600.00 seconds, shutting down
> >
> > There are 21 OSDs without cache SSD in the host.
> > All disks are attached to a single Broadcom / LSI SAS3008 SAS controller.
> > 256GB ECC-RAM / 40 CPU cores.
> >
> > What else can I do to find the problem?
> >
> >> Am Di., 8. März 2022 um 12:25 Uhr schrieb Francois Legrand <
> >> f...@lpnhe.in2p3.fr>:
> >>
> >> Hi,
> >>
> >> The last 2 osd I recreated were on december 30 and february 8.
> >>
> >> I totally agree that ssd cache are a terrible spof. I think that's an
> >> option if you use 1 ssd/nvme for 1 or 2 osd, but the cost is then very
> >> high. Using 1 ssd for 10 osd increase the risk for almost no gain
> because
> >> the ssd is 10 times faster but has 10 times more access !
> >> Indeed, we did some benches with nvme for the wal db (1 nvme for ~10
> >> osds), and the gain was not tremendous, so we decided not use them !
> >> F.
> >>
> >>
> >> Le 08/03/2022 à 11:57, Boris Behrens a écrit :
> >>
> >> Hi Francois,
> >>
> >> thanks for the reminder. We offline compacted all of the OSDs when we
> >> reinstalled the hosts with the new OS.
> >> But actually reinstalling them was never on my list.
> >>
> >> I could try that and in the same go I can remove all the cache SSDs
> (when
> >> one SSD share the cache for 10 OSDs this is a horrible SPOF) and reuse
> the
> >> SSDs as OSDs for the smaller pools in a RGW (like log and meta).
> >>
> >> How long ago did you recreate the earliest OSD?
> >>
> >> Cheers
> >> Boris
> >>
> >> Am Di., 8. März 2022 um 10:03 Uhr schrieb Francois Legrand <
> >> f...@lpnhe.in2p3.fr>:
> >>
> >>> Hi,
> >>> We also had this kind of problems after upgrading to octopus. Maybe you
> >>> can play with the hearthbeat grace time (
> >>>
> https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/
> >>> ) to tell osds to wait a little more before declaring another osd down
> !
> >>> We also try to fix the problem by manually compact the down osd
> >>> (something like : systemctl stop ceph-osd@74; sleep 10;
> >>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-74 compact;
> >>> systemctl start ceph-osd@74).
> >>> This worked a few times, but some osd went down again, thus we simply
> >>> wait for the datas to be reconstructed elswhere and then reinstall the
> >>> dead osd :
> >>> ceph osd destroy 74 --yes-i-really-mean-it
> >>> ceph-volume lvm zap /dev/sde --destroy
> >>> ceph-volume lvm create --osd-id 74 --data /dev/sde
> >>>
> >>> This seems to fix the issue for us (up to now).
> >>>
> 

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-22 Thread Boris Behrens
Norf, I missed half of the answers...
* the 8TB disks hold around 80-90 PGs (16TB around 160-180)
* per PG we've around 40k objects 170m objects in 1.2PiB of storage


Am Di., 22. März 2022 um 09:29 Uhr schrieb Boris Behrens :

> Good morning K,
>
> the "freshly done" host, where it happened last got:
> * 21x 8TB TOSHIBA MG06ACA800E (Spinning)
> * No block.db devices (just removed the 2 cache SSDs by syncing the disks
> out, wiping them and adding them back without block.db)
> * 1x Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz
> * 256GB ECC RAM
> * 2x 10GBit Network (802.3ad encap3+4 lacp-fast bonding)
>
> # free -g
>   totalusedfree  shared  buff/cache
> available
> Mem:251  87   2   0 161
>   162
> Swap:15   0  15
>
> We had this problem with one of the 21 OSDs, but I expect it to happen
> random some time in the future. Cluster got 212 OSDs and 2-3 of them get at
> least marked down once per day. Sometime they get marked down >3 times, so
> systemd hast to restart the OSD process.
>
> Cheers
>  Boris
>
>
> Am Di., 22. März 2022 um 07:48 Uhr schrieb Konstantin Shalygin <
> k0...@k0ste.ru>:
>
>> Hi,
>>
>> What is actual hardware (CPU, spinners, NVMe, network)?
>> This is HDD with block.db on NVMe?
>> How much PG per osd?
>> How much obj per PG?
>>
>>
>> k
>> Sent from my iPhone
>>
>> > On 20 Mar 2022, at 19:59, Boris Behrens  wrote:
>> >
>> > So,
>> > I have tried to remove the OSDs, wipe the disks and sync them back in
>> > without block.db SSD. (Still in progress, 212 spinning disks take time
>> to
>> > out and in again)
>> > And I just experienced them same behavior on one OSD on a host where all
>> > disks got synced in new. This disk was marked as in yesterday and is
>> still
>> > backfilling.
>> >
>> > Right before the OSD get marked as down by other OSDs I observe a ton of
>> > these log entries:
>> > 2022-03-20T11:54:40.759+ 7ff9eef5d700  1 heartbeat_map is_healthy
>> > 'OSD::osd_op_tp thread 0x7ff9d3c8a700' had timed out after 15
>> > ...
>> > 2022-03-20T11:55:02.370+ 7ff9ee75c700  1 heartbeat_map is_healthy
>> > 'OSD::osd_op_tp thread 0x7ff9d3c8a700' had timed out after 15
>> > 2022-03-20T11:55:03.290+ 7ff9d3c8a700  1 heartbeat_map reset_timeout
>> > 'OSD::osd_op_tp thread 0x7ff9d3c8a700' had timed out after 15
>> > ..
>> > 2022-03-20T11:55:03.390+ 7ff9df4a1700  0 log_channel(cluster) log
>> [WRN]
>> > : Monitor daemon marked osd.48 down, but it is still running
>> > 2022-03-20T11:55:03.390+ 7ff9df4a1700  0 log_channel(cluster) log
>> [DBG]
>> > : map e514383 wrongly marked me down at e514383
>> > 2022-03-20T11:55:03.390+ 7ff9df4a1700 -1 osd.48 514383
>> > _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last
>> > 600.00 seconds, shutting down
>> >
>> > There are 21 OSDs without cache SSD in the host.
>> > All disks are attached to a single Broadcom / LSI SAS3008 SAS
>> controller.
>> > 256GB ECC-RAM / 40 CPU cores.
>> >
>> > What else can I do to find the problem?
>> >
>> >> Am Di., 8. März 2022 um 12:25 Uhr schrieb Francois Legrand <
>> >> f...@lpnhe.in2p3.fr>:
>> >>
>> >> Hi,
>> >>
>> >> The last 2 osd I recreated were on december 30 and february 8.
>> >>
>> >> I totally agree that ssd cache are a terrible spof. I think that's an
>> >> option if you use 1 ssd/nvme for 1 or 2 osd, but the cost is then very
>> >> high. Using 1 ssd for 10 osd increase the risk for almost no gain
>> because
>> >> the ssd is 10 times faster but has 10 times more access !
>> >> Indeed, we did some benches with nvme for the wal db (1 nvme for ~10
>> >> osds), and the gain was not tremendous, so we decided not use them !
>> >> F.
>> >>
>> >>
>> >> Le 08/03/2022 à 11:57, Boris Behrens a écrit :
>> >>
>> >> Hi Francois,
>> >>
>> >> thanks for the reminder. We offline compacted all of the OSDs when we
>> >> reinstalled the hosts with the new OS.
>> >> But actually reinstalling them was never on my list.
>> >>
>> >> I could try that and in the same go I can remove all the cache SSDs
>> (when
>> >> one SSD share the cache for 10 OSDs this is a horrible SPOF) and reuse
>> the
>> >> SSDs as OSDs for the smaller pools in a RGW (like log and meta).
>> >>
>> >> How long ago did you recreate the earliest OSD?
>> >>
>> >> Cheers
>> >> Boris
>> >>
>> >> Am Di., 8. März 2022 um 10:03 Uhr schrieb Francois Legrand <
>> >> f...@lpnhe.in2p3.fr>:
>> >>
>> >>> Hi,
>> >>> We also had this kind of problems after upgrading to octopus. Maybe
>> you
>> >>> can play with the hearthbeat grace time (
>> >>>
>> https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/
>> >>> ) to tell osds to wait a little more before declaring another osd
>> down !
>> >>> We also try to fix the problem by manually compact the down osd
>> >>> (something like : systemctl stop ceph-osd@74; sleep 10;
>> >>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-74 compact;
>> >>> systemctl start ceph-osd@74).
>

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-22 Thread Konstantin Shalygin
180PG per OSD is usually overhead, also 40k obj per PG is not much, but I don't 
think this will works without block.db NVMe. I think your "wrong out marks" 
evulate in time of rocksdb compaction. With default log settings you can try to 
grep 'latency' strings

Also, https://tracker.ceph.com/issues/50297


k
Sent from my iPhone

> On 22 Mar 2022, at 14:29, Boris Behrens  wrote:
> * the 8TB disks hold around 80-90 PGs (16TB around 160-180)
> * per PG we've around 40k objects 170m objects in 1.2PiB of storage
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-22 Thread Boris Behrens
The number 180 PGs is because of the 16TB disks. 3/4 of our OSDs had cache
SSDs (not nvme though and most of them are 10OSDs one SSD) but this problem
only came in with octopus.

We also thought this might be the db compactation, but it doesn't match up.
It might happen when the compactation run, but it looks also that it
happens, when there are other operations like table_file_deletion
and it happens on OSDs that have SSD backed block.db devices (like 5 OSDs
share one SAMSUNG MZ7KM1T9HAJM-5 and the IOPS/throughput on the SSD is
not huge (100IOPS r/s 300IOPS w/s when compacting an OSD on it, and around
50mb/s r/w throughput)

I also can not reproduce it via "ceph tell osd.NN compact", so I am not
100% sure it is the compactation.

What do you mean with "grep for latency string"?

Cheers
 Boris

Am Di., 22. März 2022 um 15:53 Uhr schrieb Konstantin Shalygin <
k0...@k0ste.ru>:

> 180PG per OSD is usually overhead, also 40k obj per PG is not much, but I
> don't think this will works without block.db NVMe. I think your "wrong out
> marks" evulate in time of rocksdb compaction. With default log settings you
> can try to grep 'latency' strings
>
> Also, https://tracker.ceph.com/issues/50297
>
>
> k
> Sent from my iPhone
>
> On 22 Mar 2022, at 14:29, Boris Behrens  wrote:
>
> * the 8TB disks hold around 80-90 PGs (16TB around 160-180)
> * per PG we've around 40k objects 170m objects in 1.2PiB of storage
>
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-22 Thread Boris Behrens
Good morning Istvan,
those are rotating disks and we don't use EC. Splitting up the 16TB disks
into two 8TB partitions and have two OSDs on one disk also sounds
interesting, but would it solve the problem?

I also thought to adjust the PGs for the data pool from 4096 to 8192. But I
am not sure if this will solve the problem or make it worse.

Until now, everything I've tried didn't work.

Am Mi., 23. März 2022 um 05:10 Uhr schrieb Szabo, Istvan (Agoda) <
istvan.sz...@agoda.com>:

> Hi,
>
> I think you are having similar issue as me in the past.
>
> I have 1.6B objects on a cluster average 40k and all my osd had spilled
> over.
>
> Also slow ops, wrongly marked down…
>
> My osds are 15.3TB ssds, so my solution was to store block+db together on
> the ssds, put 4 osd/ssd and go up to 100pg/osd so 1 disk holds 400pg approx.
> Also turned on balancer with upmap and max deviation 1.
>
> I’m using ec 4:2, let’s see how long it lasts. My bottleneck is always the
> pg number, too small pg number for too many objects.
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> On 2022. Mar 22., at 23:34, Boris Behrens  wrote:
>
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> 
>
> The number 180 PGs is because of the 16TB disks. 3/4 of our OSDs had cache
> SSDs (not nvme though and most of them are 10OSDs one SSD) but this problem
> only came in with octopus.
>
> We also thought this might be the db compactation, but it doesn't match up.
> It might happen when the compactation run, but it looks also that it
> happens, when there are other operations like table_file_deletion
> and it happens on OSDs that have SSD backed block.db devices (like 5 OSDs
> share one SAMSUNG MZ7KM1T9HAJM-5 and the IOPS/throughput on the SSD is
> not huge (100IOPS r/s 300IOPS w/s when compacting an OSD on it, and around
> 50mb/s r/w throughput)
>
> I also can not reproduce it via "ceph tell osd.NN compact", so I am not
> 100% sure it is the compactation.
>
> What do you mean with "grep for latency string"?
>
> Cheers
> Boris
>
> Am Di., 22. März 2022 um 15:53 Uhr schrieb Konstantin Shalygin <
> k0...@k0ste.ru>:
>
> 180PG per OSD is usually overhead, also 40k obj per PG is not much, but I
>
> don't think this will works without block.db NVMe. I think your "wrong out
>
> marks" evulate in time of rocksdb compaction. With default log settings you
>
> can try to grep 'latency' strings
>
>
> Also, https://tracker.ceph.com/issues/50297
>
>
>
> k
>
> Sent from my iPhone
>
>
> On 22 Mar 2022, at 14:29, Boris Behrens  wrote:
>
>
> * the 8TB disks hold around 80-90 PGs (16TB around 160-180)
>
> * per PG we've around 40k objects 170m objects in 1.2PiB of storage
>
>
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> --
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright
> or other legal rules. If you have received it by mistake please let us know
> by reply email and delete it from your system. It is prohibited to copy
> this message or disclose its content to anyone. Any confidentiality or
> privilege is not waived or lost by any mistaken delivery or unauthorized
> disclosure of the message. All messages sent to and from Agoda may be
> monitored to ensure compliance with company policies, to protect the
> company's interests and to remove potential malware. Electronic messages
> may be intercepted, amended, lost or deleted, or contain viruses.
>


-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-23 Thread Boris Behrens
You mean in the OSD logfiles?

Am Mi., 23. März 2022 um 08:23 Uhr schrieb Szabo, Istvan (Agoda) <
istvan.sz...@agoda.com>:

> Can you see in the pg dump like waiting for reading or something like
> this? In each step how much time it spends?
>
>
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
>
>
> *From:* Boris Behrens 
> *Sent:* Wednesday, March 23, 2022 1:29 PM
> *To:* Szabo, Istvan (Agoda) 
> *Cc:* ceph-users@ceph.io
> *Subject:* Re: [ceph-users] Re: octopus (15.2.16) OSDs crash or don't
> answer heathbeats (and get marked as down)
>
>
>
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> --
>
> Good morning Istvan,
>
> those are rotating disks and we don't use EC. Splitting up the 16TB disks
> into two 8TB partitions and have two OSDs on one disk also sounds
> interesting, but would it solve the problem?
>
>
>
> I also thought to adjust the PGs for the data pool from 4096 to 8192. But
> I am not sure if this will solve the problem or make it worse.
>
>
>
> Until now, everything I've tried didn't work.
>
>
>
> Am Mi., 23. März 2022 um 05:10 Uhr schrieb Szabo, Istvan (Agoda) <
> istvan.sz...@agoda.com>:
>
> Hi,
>
>
>
> I think you are having similar issue as me in the past.
>
>
>
> I have 1.6B objects on a cluster average 40k and all my osd had spilled
> over.
>
>
>
> Also slow ops, wrongly marked down…
>
>
>
> My osds are 15.3TB ssds, so my solution was to store block+db together on
> the ssds, put 4 osd/ssd and go up to 100pg/osd so 1 disk holds 400pg approx.
>
> Also turned on balancer with upmap and max deviation 1.
>
>
>
> I’m using ec 4:2, let’s see how long it lasts. My bottleneck is always the
> pg number, too small pg number for too many objects.
>
>
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
>
>
> On 2022. Mar 22., at 23:34, Boris Behrens  wrote:
>
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> 
>
> The number 180 PGs is because of the 16TB disks. 3/4 of our OSDs had cache
> SSDs (not nvme though and most of them are 10OSDs one SSD) but this problem
> only came in with octopus.
>
> We also thought this might be the db compactation, but it doesn't match up.
> It might happen when the compactation run, but it looks also that it
> happens, when there are other operations like table_file_deletion
> and it happens on OSDs that have SSD backed block.db devices (like 5 OSDs
> share one SAMSUNG MZ7KM1T9HAJM-5 and the IOPS/throughput on the SSD is
> not huge (100IOPS r/s 300IOPS w/s when compacting an OSD on it, and around
> 50mb/s r/w throughput)
>
> I also can not reproduce it via "ceph tell osd.NN compact", so I am not
> 100% sure it is the compactation.
>
> What do you mean with "grep for latency string"?
>
> Cheers
> Boris
>
> Am Di., 22. März 2022 um 15:53 Uhr schrieb Konstantin Shalygin <
> k0...@k0ste.ru>:
>
>
> 180PG per OSD is usually overhead, also 40k obj per PG is not much, but I
>
> don't think this will works without block.db NVMe. I think your "wrong out
>
> marks" evulate in time of rocksdb compaction. With default log settings you
>
> can try to grep 'latency' strings
>
>
>
> Also, https://tracker.ceph.com/issues/50297
>
>
>
>
>
> k
>
> Sent from my iPhone
>
>
>
> On 22 Mar 2022, at 14:29, Boris Behrens  wrote:
>
>
>
> * the 8TB disks hold around 80-90 PGs (16TB around 160-180)
>
> * per PG we've around 40k objects 170m objects in 1.2PiB of storage
>
>
>
>
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> --
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright
> or other legal rules. If you have received it by mistake please let us know
> by reply email and d

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-23 Thread Igor Fedotov

Hi Boris,

Curious if you tried to compact RocksdDB for all your OSDs? Sorry I this 
has been already discussed, haven't read through all the thread...


From my experience the symptoms you're facing are pretty common for DB 
performance degradation caused by bulk data removal. In that case OSDs 
start to flap due to suicide timeout as some regular user ops take ages 
to complete.


The issue has been discussed in this list multiple times.

Thanks,

Igor

On 3/8/2022 12:36 AM, Boris Behrens wrote:

Hi,

we've had the problem with OSDs marked as offline since we updated to
octopus and hope the problem would be fixed with the latest patch. We have
this kind of problem only with octopus and there only with the big s3
cluster.
* Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
* Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
* We only use the frontend network.
* All disks are spinning, some have block.db devices.
* All disks are bluestore
* configs are mostly defaults
* we've set the OSDs to restart=always without a limit, because we had the
problem with unavailable PGs when two OSDs are marked as offline and the
share PGs.

But since we installed the latest patch we are experiencing more OSD downs
and even crashes.
I tried to remove as much duplicated lines as possible.

Is the numa error a problem?
Why do OSD daemons not respond to hearthbeats? I mean even when the disk is
totally loaded with IO, the system itself should answer heathbeats, or am I
missing something?

I really hope some of you could send me on the correct way to solve this
nasty problem.

This is how the latest crash looks like
Mar 07 17:44:15 s3db18 ceph-osd[4530]: 2022-03-07T17:44:15.099+
7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to identify public
interface '' numa node: (2) No such file or directory
...
Mar 07 17:49:07 s3db18 ceph-osd[4530]: 2022-03-07T17:49:07.678+
7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify public
interface '' numa node: (2) No such file or directory
Mar 07 17:53:07 s3db18 ceph-osd[4530]: *** Caught signal (Aborted) **
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
thread_name:tp_osd_tp
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
(d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
[0x7f5f0d45ef08]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
unsigned long)+0x471) [0x55a699a01201]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
(ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
long, unsigned long)+0x8e) [0x55a699a0199e]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
[0x55a699a224b0]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43) [0x7f5f0cfc0163]
Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2022-03-07T17:53:07.387+
7f5ef1501700 -1 *** Caught signal (Aborted) **
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
thread_name:tp_osd_tp
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
(d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
[0x7f5f0d45ef08]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
unsigned long)+0x471) [0x55a699a01201]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
(ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
long, unsigned long)+0x8e) [0x55a699a0199e]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
[0x55a699a224b0]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43) [0x7f5f0cfc0163]
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  NOTE: a copy of the executable, or
`objdump -rdS ` is needed to interpret this.
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  -5246> 2022-03-07T17:49:07.678+
7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify public
interface '' numa node: (2) No such file or directory
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  0> 2022-03-07T17:53:07.387+
7f5ef1501700 -1 *** Caught signal (Aborted) **
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
thread_name:tp_osd_tp
Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
(d46a73d6d0a67a79558054a3a5a72cb561724974) octopu

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-23 Thread Boris Behrens
Hi Igor,
yes, I've compacted them all.

So is there a solution for the problem, because I can imagine this happens
when we remove large files from s3 (we use it as backup storage for lz4
compressed rbd exports).
Maybe I missed it.

Cheers
 Boris

Am Mi., 23. März 2022 um 13:43 Uhr schrieb Igor Fedotov <
igor.fedo...@croit.io>:

> Hi Boris,
>
> Curious if you tried to compact RocksdDB for all your OSDs? Sorry I this
> has been already discussed, haven't read through all the thread...
>
>  From my experience the symptoms you're facing are pretty common for DB
> performance degradation caused by bulk data removal. In that case OSDs
> start to flap due to suicide timeout as some regular user ops take ages
> to complete.
>
> The issue has been discussed in this list multiple times.
>
> Thanks,
>
> Igor
>
> On 3/8/2022 12:36 AM, Boris Behrens wrote:
> > Hi,
> >
> > we've had the problem with OSDs marked as offline since we updated to
> > octopus and hope the problem would be fixed with the latest patch. We
> have
> > this kind of problem only with octopus and there only with the big s3
> > cluster.
> > * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
> > * Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
> > * We only use the frontend network.
> > * All disks are spinning, some have block.db devices.
> > * All disks are bluestore
> > * configs are mostly defaults
> > * we've set the OSDs to restart=always without a limit, because we had
> the
> > problem with unavailable PGs when two OSDs are marked as offline and the
> > share PGs.
> >
> > But since we installed the latest patch we are experiencing more OSD
> downs
> > and even crashes.
> > I tried to remove as much duplicated lines as possible.
> >
> > Is the numa error a problem?
> > Why do OSD daemons not respond to hearthbeats? I mean even when the disk
> is
> > totally loaded with IO, the system itself should answer heathbeats, or
> am I
> > missing something?
> >
> > I really hope some of you could send me on the correct way to solve this
> > nasty problem.
> >
> > This is how the latest crash looks like
> > Mar 07 17:44:15 s3db18 ceph-osd[4530]: 2022-03-07T17:44:15.099+
> > 7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to identify
> public
> > interface '' numa node: (2) No such file or directory
> > ...
> > Mar 07 17:49:07 s3db18 ceph-osd[4530]: 2022-03-07T17:49:07.678+
> > 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify
> public
> > interface '' numa node: (2) No such file or directory
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]: *** Caught signal (Aborted) **
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
> > thread_name:tp_osd_tp
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
> > (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
> > [0x7f5f0d45ef08]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
> > (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
> > unsigned long)+0x471) [0x55a699a01201]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
> > (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
> > long, unsigned long)+0x8e) [0x55a699a0199e]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
> > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
> > [0x55a699a224b0]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
> > (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
> [0x7f5f0cfc0163]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2022-03-07T17:53:07.387+
> > 7f5ef1501700 -1 *** Caught signal (Aborted) **
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
> > thread_name:tp_osd_tp
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
> > (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
> > [0x7f5f0d45ef08]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
> > (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
> > unsigned long)+0x471) [0x55a699a01201]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
> > (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
> > long, unsigned long)+0x8e) [0x55a699a0199e]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
> > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
> > [0x55a699a224b0]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
> > (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-23 Thread Igor Fedotov
Unfortunately there is no silver bullet here so far. Just one note after 
looking at your configuration - I would strongly encourage you to add 
SSD disks for spinner-only OSDs.


Particularly when they are used for s3 payload which is pretty DB 
intensive.



Thanks,

Igor

On 3/23/2022 5:03 PM, Boris Behrens wrote:

Hi Igor,
yes, I've compacted them all.

So is there a solution for the problem, because I can imagine this 
happens when we remove large files from s3 (we use it as backup 
storage for lz4 compressed rbd exports).

Maybe I missed it.

Cheers
 Boris

Am Mi., 23. März 2022 um 13:43 Uhr schrieb Igor Fedotov 
:


Hi Boris,

Curious if you tried to compact RocksdDB for all your OSDs? Sorry
I this
has been already discussed, haven't read through all the thread...

 From my experience the symptoms you're facing are pretty common
for DB
performance degradation caused by bulk data removal. In that case
OSDs
start to flap due to suicide timeout as some regular user ops take
ages
to complete.

The issue has been discussed in this list multiple times.

Thanks,

Igor

On 3/8/2022 12:36 AM, Boris Behrens wrote:
> Hi,
>
> we've had the problem with OSDs marked as offline since we
updated to
> octopus and hope the problem would be fixed with the latest
patch. We have
> this kind of problem only with octopus and there only with the
big s3
> cluster.
> * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
> * Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
> * We only use the frontend network.
> * All disks are spinning, some have block.db devices.
> * All disks are bluestore
> * configs are mostly defaults
> * we've set the OSDs to restart=always without a limit, because
we had the
> problem with unavailable PGs when two OSDs are marked as offline
and the
> share PGs.
>
> But since we installed the latest patch we are experiencing more
OSD downs
> and even crashes.
> I tried to remove as much duplicated lines as possible.
>
> Is the numa error a problem?
> Why do OSD daemons not respond to hearthbeats? I mean even when
the disk is
> totally loaded with IO, the system itself should answer
heathbeats, or am I
> missing something?
>
> I really hope some of you could send me on the correct way to
solve this
> nasty problem.
>
> This is how the latest crash looks like
> Mar 07 17:44:15 s3db18 ceph-osd[4530]: 2022-03-07T17:44:15.099+
> 7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to
identify public
> interface '' numa node: (2) No such file or directory
> ...
> Mar 07 17:49:07 s3db18 ceph-osd[4530]: 2022-03-07T17:49:07.678+
> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to
identify public
> interface '' numa node: (2) No such file or directory
> Mar 07 17:53:07 s3db18 ceph-osd[4530]: *** Caught signal
(Aborted) **
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
> thread_name:tp_osd_tp
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0)
[0x7f5f0d4623c0]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
> [0x7f5f0d45ef08]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*,
char const*,
> unsigned long)+0x471) [0x55a699a01201]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*,
unsigned
> long, unsigned long)+0x8e) [0x55a699a0199e]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
> [0x55a699a224b0]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
> (ShardedThreadPool::WorkThreadSharded::entry()+0x14)
[0x55a699a252c4]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609)
[0x7f5f0d456609]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
[0x7f5f0cfc0163]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2022-03-07T17:53:07.387+
> 7f5ef1501700 -1 *** Caught signal (Aborted) **
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
> thread_name:tp_osd_tp
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0)
[0x7f5f0d4623c0]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
> [0x7f5f0d45ef08]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*,
char const*,
> unsigned long)+0x471) [0x55a699a01201]
> Mar 07 1

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-23 Thread Boris Behrens
What should go to the SSDs? We have not enough Slots to have a 3/1 ratio
for block.db. Most of the block.db SSDs served 10 OSDs and were mostly
idling so we are now removing them as we haven't seen any benefits. (But
maybe I am just blind and ignorant and just not see it)

Am Mi., 23. März 2022 um 15:17 Uhr schrieb Igor Fedotov <
igor.fedo...@croit.io>:

> Unfortunately there is no silver bullet here so far. Just one note after
> looking at your configuration - I would strongly encourage you to add SSD
> disks for spinner-only OSDs.
>
> Particularly when they are used for s3 payload which is pretty DB
> intensive.
>
>
> Thanks,
>
> Igor
> On 3/23/2022 5:03 PM, Boris Behrens wrote:
>
> Hi Igor,
> yes, I've compacted them all.
>
> So is there a solution for the problem, because I can imagine this happens
> when we remove large files from s3 (we use it as backup storage for lz4
> compressed rbd exports).
> Maybe I missed it.
>
> Cheers
>  Boris
>
> Am Mi., 23. März 2022 um 13:43 Uhr schrieb Igor Fedotov <
> igor.fedo...@croit.io>:
>
>> Hi Boris,
>>
>> Curious if you tried to compact RocksdDB for all your OSDs? Sorry I this
>> has been already discussed, haven't read through all the thread...
>>
>>  From my experience the symptoms you're facing are pretty common for DB
>> performance degradation caused by bulk data removal. In that case OSDs
>> start to flap due to suicide timeout as some regular user ops take ages
>> to complete.
>>
>> The issue has been discussed in this list multiple times.
>>
>> Thanks,
>>
>> Igor
>>
>> On 3/8/2022 12:36 AM, Boris Behrens wrote:
>> > Hi,
>> >
>> > we've had the problem with OSDs marked as offline since we updated to
>> > octopus and hope the problem would be fixed with the latest patch. We
>> have
>> > this kind of problem only with octopus and there only with the big s3
>> > cluster.
>> > * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
>> > * Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
>> > * We only use the frontend network.
>> > * All disks are spinning, some have block.db devices.
>> > * All disks are bluestore
>> > * configs are mostly defaults
>> > * we've set the OSDs to restart=always without a limit, because we had
>> the
>> > problem with unavailable PGs when two OSDs are marked as offline and the
>> > share PGs.
>> >
>> > But since we installed the latest patch we are experiencing more OSD
>> downs
>> > and even crashes.
>> > I tried to remove as much duplicated lines as possible.
>> >
>> > Is the numa error a problem?
>> > Why do OSD daemons not respond to hearthbeats? I mean even when the
>> disk is
>> > totally loaded with IO, the system itself should answer heathbeats, or
>> am I
>> > missing something?
>> >
>> > I really hope some of you could send me on the correct way to solve this
>> > nasty problem.
>> >
>> > This is how the latest crash looks like
>> > Mar 07 17:44:15 s3db18 ceph-osd[4530]: 2022-03-07T17:44:15.099+
>> > 7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to identify
>> public
>> > interface '' numa node: (2) No such file or directory
>> > ...
>> > Mar 07 17:49:07 s3db18 ceph-osd[4530]: 2022-03-07T17:49:07.678+
>> > 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify
>> public
>> > interface '' numa node: (2) No such file or directory
>> > Mar 07 17:53:07 s3db18 ceph-osd[4530]: *** Caught signal (Aborted) **
>> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
>> > thread_name:tp_osd_tp
>> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
>> > (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
>> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
>> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
>> > [0x7f5f0d45ef08]
>> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
>> > (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
>> const*,
>> > unsigned long)+0x471) [0x55a699a01201]
>> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
>> > (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
>> > long, unsigned long)+0x8e) [0x55a699a0199e]
>> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
>> > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
>> > [0x55a699a224b0]
>> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
>> > (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
>> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
>> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43)
>> [0x7f5f0cfc0163]
>> > Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2022-03-07T17:53:07.387+
>> > 7f5ef1501700 -1 *** Caught signal (Aborted) **
>> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
>> > thread_name:tp_osd_tp
>> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
>> > (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
>> > Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-23 Thread Christian Wuerdig
I would not host multiple OSD on a spinning drive (unless it's one of those
Seagate MACH.2 drives that have two independent heads) - head seek time
will most likely kill performance. The main reason to host multiple OSD on
a single SSD or NVME is typically to make use of the large IOPS capacity
which cepth can struggle to fully utilize on a single drive. With spinners
you usually don't have that "problem" (quite the opposite usually)

On Wed, 23 Mar 2022 at 19:29, Boris Behrens  wrote:

> Good morning Istvan,
> those are rotating disks and we don't use EC. Splitting up the 16TB disks
> into two 8TB partitions and have two OSDs on one disk also sounds
> interesting, but would it solve the problem?
>
> I also thought to adjust the PGs for the data pool from 4096 to 8192. But I
> am not sure if this will solve the problem or make it worse.
>
> Until now, everything I've tried didn't work.
>
> Am Mi., 23. März 2022 um 05:10 Uhr schrieb Szabo, Istvan (Agoda) <
> istvan.sz...@agoda.com>:
>
> > Hi,
> >
> > I think you are having similar issue as me in the past.
> >
> > I have 1.6B objects on a cluster average 40k and all my osd had spilled
> > over.
> >
> > Also slow ops, wrongly marked down…
> >
> > My osds are 15.3TB ssds, so my solution was to store block+db together on
> > the ssds, put 4 osd/ssd and go up to 100pg/osd so 1 disk holds 400pg
> approx.
> > Also turned on balancer with upmap and max deviation 1.
> >
> > I’m using ec 4:2, let’s see how long it lasts. My bottleneck is always
> the
> > pg number, too small pg number for too many objects.
> >
> > Istvan Szabo
> > Senior Infrastructure Engineer
> > ---
> > Agoda Services Co., Ltd.
> > e: istvan.sz...@agoda.com
> > ---
> >
> > On 2022. Mar 22., at 23:34, Boris Behrens  wrote:
> >
> > Email received from the internet. If in doubt, don't click any link nor
> > open any attachment !
> > 
> >
> > The number 180 PGs is because of the 16TB disks. 3/4 of our OSDs had
> cache
> > SSDs (not nvme though and most of them are 10OSDs one SSD) but this
> problem
> > only came in with octopus.
> >
> > We also thought this might be the db compactation, but it doesn't match
> up.
> > It might happen when the compactation run, but it looks also that it
> > happens, when there are other operations like table_file_deletion
> > and it happens on OSDs that have SSD backed block.db devices (like 5 OSDs
> > share one SAMSUNG MZ7KM1T9HAJM-5 and the IOPS/throughput on the SSD
> is
> > not huge (100IOPS r/s 300IOPS w/s when compacting an OSD on it, and
> around
> > 50mb/s r/w throughput)
> >
> > I also can not reproduce it via "ceph tell osd.NN compact", so I am not
> > 100% sure it is the compactation.
> >
> > What do you mean with "grep for latency string"?
> >
> > Cheers
> > Boris
> >
> > Am Di., 22. März 2022 um 15:53 Uhr schrieb Konstantin Shalygin <
> > k0...@k0ste.ru>:
> >
> > 180PG per OSD is usually overhead, also 40k obj per PG is not much, but I
> >
> > don't think this will works without block.db NVMe. I think your "wrong
> out
> >
> > marks" evulate in time of rocksdb compaction. With default log settings
> you
> >
> > can try to grep 'latency' strings
> >
> >
> > Also, https://tracker.ceph.com/issues/50297
> >
> >
> >
> > k
> >
> > Sent from my iPhone
> >
> >
> > On 22 Mar 2022, at 14:29, Boris Behrens  wrote:
> >
> >
> > * the 8TB disks hold around 80-90 PGs (16TB around 160-180)
> >
> > * per PG we've around 40k objects 170m objects in 1.2PiB of storage
> >
> >
> >
> >
> > --
> > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> > groüen Saal.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> > --
> > This message is confidential and is for the sole use of the intended
> > recipient(s). It may also be privileged or otherwise protected by
> copyright
> > or other legal rules. If you have received it by mistake please let us
> know
> > by reply email and delete it from your system. It is prohibited to copy
> > this message or disclose its content to anyone. Any confidentiality or
> > privilege is not waived or lost by any mistaken delivery or unauthorized
> > disclosure of the message. All messages sent to and from Agoda may be
> > monitored to ensure compliance with company policies, to protect the
> > company's interests and to remove potential malware. Electronic messages
> > may be intercepted, amended, lost or deleted, or contain viruses.
> >
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-user

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-04-26 Thread Boris Behrens
So,
I just checked the logs on one of our smaller cluster and it looks like
this error happened twice last week.

The cluster contains 12x8TB OSDs without any SSDs as cache. And it started
with octopus (so no upgrade from nautilus was performed)

root@3cecef08a104:~# zgrep -i marked /var/log/ceph/ceph*
/var/log/ceph/ceph.log.1.gz:2022-04-25T07:05:38.767452+
mon.3cecef5afb05 (mon.0) 390021 : cluster [INF] osd.6 marked itself dead as
of e4011
/var/log/ceph/ceph.log.1.gz:2022-04-25T07:05:38.764813+ osd.6 (osd.6)
3312 : cluster [WRN] Monitor daemon marked osd.6 down, but it is still
running
/var/log/ceph/ceph.log.1.gz:2022-04-25T07:05:38.764827+ osd.6 (osd.6)
3313 : cluster [DBG] map e4011 wrongly marked me down at e4010
/var/log/ceph/ceph.log.2.gz:2022-04-24T06:54:53.726084+ osd.6 (osd.6)
3227 : cluster [WRN] Monitor daemon marked osd.6 down, but it is still
running
/var/log/ceph/ceph.log.2.gz:2022-04-24T06:54:53.726098+ osd.6 (osd.6)
3228 : cluster [DBG] map e3995 wrongly marked me down at e3994
/var/log/ceph/ceph.log.2.gz:2022-04-24T06:54:53.729151+
mon.3cecef5afb05 (mon.0) 382918 : cluster [INF] osd.6 marked itself dead as
of e3995

Checking the log of said OSD shows that this happened after the
compactation began:

2022-04-24T06:54:24.341+ 7f7c6e152700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1650783264345276, "job": 1609, "event":
"table_file_deletion", "file_number": 10916}
2022-04-24T06:54:24.341+ 7f7c6e152700  4 rocksdb:
[db/compaction_job.cc:1642] [default] [JOB 1610] Compacting 1@1 + 4@2 files
to L2, score 1.76
2022-04-24T06:54:24.341+ 7f7c6e152700  4 rocksdb:
[db/compaction_job.cc:1648] [default] Compaction start summary: Base
version 1609 Base level 1, inputs: [10975(43MB)], [10930(67MB) 10931(672KB)
10862(51MB) 10843(67MB)]
2022-04-24T06:54:24.341+ 7f7c6e152700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1650783264345617, "job": 1610, "event":
"compaction_started", "compaction_reason": "LevelMaxLevelSize", "files_L1":
[10975], "files_L2": [10930, 10931, 10862, 10843], "score": 1.76282,
"input_data_size": 241028245}
2022-04-24T06:54:25.609+ 7f7c6e152700  4 rocksdb:
[db/compaction_job.cc:1327] [default] [JOB 1610] Generated table #10986:
226423 keys, 70464004 bytes
2022-04-24T06:54:25.609+ 7f7c6e152700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1650783265611705, "cf_name": "default", "job": 1610,
"event": "table_file_creation", "file_number": 10986, "file_size":
70464004, "table_properties": {"data_size": 67109810, "index_size":
2787246, "filter_size": 566085, "raw_key_size": 42830142,
"raw_average_key_size": 189, "raw_value_size": 55015026,
"raw_average_value_size": 242, "num_data_blocks": 16577, "num_entries":
226423, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}
2022-04-24T06:54:27.197+ 7f7c6e152700  4 rocksdb:
[db/compaction_job.cc:1327] [default] [JOB 1610] Generated table #10987:
233189 keys, 70525661 bytes
2022-04-24T06:54:27.197+ 7f7c6e152700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1650783267203149, "cf_name": "default", "job": 1610,
"event": "table_file_creation", "file_number": 10987, "file_size":
70525661, "table_properties": {"data_size": 67113431, "index_size":
2828386, "filter_size": 582981, "raw_key_size": 44106906,
"raw_average_key_size": 189, "raw_value_size": 54869394,
"raw_average_value_size": 235, "num_data_blocks": 16569, "num_entries":
233189, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}
2022-04-24T06:54:28.597+ 7f7c6e152700  4 rocksdb:
[db/compaction_job.cc:1327] [default] [JOB 1610] Generated table #10988:
228113 keys, 70497098 bytes
2022-04-24T06:54:28.597+ 7f7c6e152700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1650783268603183, "cf_name": "default", "job": 1610,
"event": "table_file_creation", "file_number": 10988, "file_size":
70497098, "table_properties": {"data_size": 67111373, "index_size":
2814553, "filter_size": 570309, "raw_key_size": 43137333,
"raw_average_key_size": 189, "raw_value_size": 54984875,
"raw_average_value_size": 241, "num_data_blocks": 16584, "num_entries":
228113, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}
2022-04-24T06:54:28.689+ 7f7c76c00700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f7c5d130700' had timed out after 15
2022-04-24T06:54:28.689+ 7f7c75bfe700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f7c5d130700' had timed out after 15

2022-04-24T06:54:53.521+ 7f7c5d130700  1 heartbeat_map reset_timeout
'OSD::osd_op_tp thread 0x7f7c5d130700' had timed out after 15
2022-04-24T06:54:53.721+ 7f7c66943700  0 log_channel(cluster) log [WRN]
: Monitor daemon marked osd.6 down, but it is still running
2022-04-24T06:54:53.721+ 7f7c66943700  0 log_channel(cluster) log [DBG]
: map e3995 wrongly marked me down at e3994

The cluster itself still got plenty of free space and doesn't have a huge
IO load.

So what next? How can I give more debug output to solve this issue.


-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal a

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-04-26 Thread Boris Behrens
So,
I just checked the logs on one of our smaller cluster and it looks like
this error happened twice last week.

The cluster contains 12x8TB OSDs without any SSDs as cache. And it started
with octopus (so no upgrade from nautilus was performed)

root@3cecef08a104:~# zgrep -i marked /var/log/ceph/ceph*
/var/log/ceph/ceph.log.1.gz:2022-04-25T07:05:38.767452+
mon.3cecef5afb05 (mon.0) 390021 : cluster [INF] osd.6 marked itself dead as
of e4011
/var/log/ceph/ceph.log.1.gz:2022-04-25T07:05:38.764813+ osd.6 (osd.6)
3312 : cluster [WRN] Monitor daemon marked osd.6 down, but it is still
running
/var/log/ceph/ceph.log.1.gz:2022-04-25T07:05:38.764827+ osd.6 (osd.6)
3313 : cluster [DBG] map e4011 wrongly marked me down at e4010
/var/log/ceph/ceph.log.2.gz:2022-04-24T06:54:53.726084+ osd.6 (osd.6)
3227 : cluster [WRN] Monitor daemon marked osd.6 down, but it is still
running
/var/log/ceph/ceph.log.2.gz:2022-04-24T06:54:53.726098+ osd.6 (osd.6)
3228 : cluster [DBG] map e3995 wrongly marked me down at e3994
/var/log/ceph/ceph.log.2.gz:2022-04-24T06:54:53.729151+
mon.3cecef5afb05 (mon.0) 382918 : cluster [INF] osd.6 marked itself dead as
of e3995

Checking the log of said OSD shows that this happened after the
compactation began:

2022-04-24T06:54:24.341+ 7f7c6e152700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1650783264345276, "job": 1609, "event":
"table_file_deletion", "file_number": 10916}
2022-04-24T06:54:24.341+ 7f7c6e152700  4 rocksdb:
[db/compaction_job.cc:1642] [default] [JOB 1610] Compacting 1@1 + 4@2 files
to L2, score 1.76
2022-04-24T06:54:24.341+ 7f7c6e152700  4 rocksdb:
[db/compaction_job.cc:1648] [default] Compaction start summary: Base
version 1609 Base level 1, inputs: [10975(43MB)], [10930(67MB) 10931(672KB)
10862(51MB) 10843(67MB)]
2022-04-24T06:54:24.341+ 7f7c6e152700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1650783264345617, "job": 1610, "event":
"compaction_started", "compaction_reason": "LevelMaxLevelSize", "files_L1":
[10975], "files_L2": [10930, 10931, 10862, 10843], "score": 1.76282,
"input_data_size": 241028245}
2022-04-24T06:54:25.609+ 7f7c6e152700  4 rocksdb:
[db/compaction_job.cc:1327] [default] [JOB 1610] Generated table #10986:
226423 keys, 70464004 bytes
2022-04-24T06:54:25.609+ 7f7c6e152700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1650783265611705, "cf_name": "default", "job": 1610,
"event": "table_file_creation", "file_number": 10986, "file_size":
70464004, "table_properties": {"data_size": 67109810, "index_size":
2787246, "filter_size": 566085, "raw_key_size": 42830142,
"raw_average_key_size": 189, "raw_value_size": 55015026,
"raw_average_value_size": 242, "num_data_blocks": 16577, "num_entries":
226423, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}
2022-04-24T06:54:27.197+ 7f7c6e152700  4 rocksdb:
[db/compaction_job.cc:1327] [default] [JOB 1610] Generated table #10987:
233189 keys, 70525661 bytes
2022-04-24T06:54:27.197+ 7f7c6e152700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1650783267203149, "cf_name": "default", "job": 1610,
"event": "table_file_creation", "file_number": 10987, "file_size":
70525661, "table_properties": {"data_size": 67113431, "index_size":
2828386, "filter_size": 582981, "raw_key_size": 44106906,
"raw_average_key_size": 189, "raw_value_size": 54869394,
"raw_average_value_size": 235, "num_data_blocks": 16569, "num_entries":
233189, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}
2022-04-24T06:54:28.597+ 7f7c6e152700  4 rocksdb:
[db/compaction_job.cc:1327] [default] [JOB 1610] Generated table #10988:
228113 keys, 70497098 bytes
2022-04-24T06:54:28.597+ 7f7c6e152700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1650783268603183, "cf_name": "default", "job": 1610,
"event": "table_file_creation", "file_number": 10988, "file_size":
70497098, "table_properties": {"data_size": 67111373, "index_size":
2814553, "filter_size": 570309, "raw_key_size": 43137333,
"raw_average_key_size": 189, "raw_value_size": 54984875,
"raw_average_value_size": 241, "num_data_blocks": 16584, "num_entries":
228113, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}
2022-04-24T06:54:28.689+ 7f7c76c00700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f7c5d130700' had timed out after 15
2022-04-24T06:54:28.689+ 7f7c75bfe700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f7c5d130700' had timed out after 15

2022-04-24T06:54:53.521+ 7f7c5d130700  1 heartbeat_map reset_timeout
'OSD::osd_op_tp thread 0x7f7c5d130700' had timed out after 15
2022-04-24T06:54:53.721+ 7f7c66943700  0 log_channel(cluster) log [WRN]
: Monitor daemon marked osd.6 down, but it is still running
2022-04-24T06:54:53.721+ 7f7c66943700  0 log_channel(cluster) log [DBG]
: map e3995 wrongly marked me down at e3994

The cluster itself still got plenty of free space and doesn't have a huge
IO load.

So what next? How can I give more debug output to solve this issue.

Am Di., 26. Apr. 2022 um 12:20 Uhr schrieb Boris Behrens :

> So,

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-04-26 Thread Konstantin Shalygin
Hi,

After some load HDD's will be not perform well. You should move block.db's to 
NVMe for avoid database vacuuming problems

k
Sent from my iPhone

> On 26 Apr 2022, at 13:58, Boris Behrens  wrote:
> 
> The cluster contains 12x8TB OSDs without any SSDs as cache

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-04-26 Thread Boris
I have this problem also with OSDs that are with SSDs as block.db. 

> 
> Am 26.04.2022 um 17:10 schrieb Konstantin Shalygin :
> 
> Hi,
> 
> After some load HDD's will be not perform well. You should move block.db's to 
> NVMe for avoid database vacuuming problems
> 
> k
> Sent from my iPhone
> 
>> On 26 Apr 2022, at 13:58, Boris Behrens  wrote:
>> 
>> The cluster contains 12x8TB OSDs without any SSDs as cache
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io