[ceph-users] bluefs_allocator bitmap or hybrid

2021-12-01 Thread huxia...@horebdata.cn
Dear Cephers,

We  are running tons of Ceph clusters on Luminous  with bluefs_allocator being 
bitmap, and when looking at Nautilus, 14.1.22, bluefs_allocator is now 
defaulting to hybrid. I am then wondering the follwoing:

1)  what will be the advantage for using hybrid instead of bitmap (which seems 
to be working very well now)?

2)  Is it saft to keep Nautilus with bluefs_allocator  settting to bitmap?

Cheers,

samuel 



huxia...@horebdata.cn
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.7 pacific QE validation status, RC1 available for testing

2021-12-01 Thread Venky Shankar
On Mon, Nov 29, 2021 at 10:53 PM Yuri Weinstein  wrote:

> fs - Venky, Patrick

fs approved - failures are known and have trackers.



-- 
Cheers,
Venky

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb: Corruption: missing start of fragmented record(1)

2021-12-01 Thread Dan van der Ster
Hi Frank,

I'd be interested to read that paper, if you can find it again. I
don't understand why the volatile cache + fsync might be dangerous due
to a buggy firmware, but yet we should trust that a firmware respects
FUA when the volatile cache is disabled.

In https://github.com/ceph/ceph/pull/43848 we're documenting the
implications of WCE -- but in the context of performance, not safety.
If write through / volatile cache off is required for safety too, then
we should take a different approach (e.g. ceph could disable the write
cache itself).

Cheers, dan



On Tue, Nov 30, 2021 at 9:36 AM Frank Schilder  wrote:
>
> Hi Dan.
>
> > ...however it is not unsafe to leave the cache enabled -- ceph uses
> > fsync appropriately to make the writes durable.
>
> Actually it is. You will rely on the drive's firmware to implement this 
> correctly and this is, unfortunately, less than a given. Within the last 
> one-two years somebody posted a link to a very interesting research paper to 
> this list, where drives were tested under real conditions. Turns out that the 
> "fsync to make writes persistent" is very vulnerable to power loss if 
> volatile write cache is enabled. It I remember correctly, about 1-2% of 
> drives ended up with data loss every time. In other words, for every drive 
> with volatile write cache enabled, every 100 power loss events you will have 
> 1-2 data loss events (in certain situations, the drive replies with ack 
> before the volatile cache is actually flushed). I think even PLP did not 
> prevent data loss in all cases.
>
> Its all down to bugs in firmware that fail to catch all corner cases and 
> internal race conditions with ops scheduling. Vendors will very often take 
> priority for performance over fixing a rare race condition and I will not 
> take nor recommend to take chances.
>
> I think this kind of advice should really not be given in a ceph context 
> without also referring to the pre-requisites: perfect firmware. Ceph is a 
> scale-out system and any large sized cluster will have enough drives to see 
> low-probability events on a regular basis. At least recommend to test that 
> thoroughly, that is, perform power-loss tests under load, and I mean many 
> power loss events per drive with randomised intervals under different load 
> patterns.
>
> Same applies to disk controllers with cache. Nobody recommends using the 
> controller cache because of firmware bugs that seem to be present in all 
> models. We have sufficient cases on this list for data loss after power loss 
> with controller cache being the issue. The recommendation is to enable HBA 
> mode and write-through. Do the same with your disk firmware, get better sleep 
> and better performance in one go.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 29 November 2021 09:24:29
> To: Frank Schilder
> Cc: huxia...@horebdata.cn; YiteGu; ceph-users
> Subject: Re: [ceph-users] Re: Rocksdb: Corruption: missing start of 
> fragmented record(1)
>
> Hi Frank,
>
> That's true from the performance perspective, however it is not unsafe
> to leave the cache enabled -- ceph uses fsync appropriately to make
> the writes durable.
>
> This issue looks rather to be related to concurrent hardware failure.
>
> Cheers, Dan
>
> On Mon, Nov 29, 2021 at 9:21 AM Frank Schilder  wrote:
> >
> > This may sound counter-intuitive, but you need to disable write cache to 
> > enable PLP cache only. SSDs with PLP have usually 2 types of cache, 
> > volatile and non-volatile. The volatile cache will experience data loss on 
> > power loss. It is the volatile cache that gets disabled when issuing the 
> > hd-/sdparm/smartctl command to switch it off. In many cases this can 
> > increase the non-volatile cache and also performance.
> >
> > It is the non-volatile cache you want your writes to go to directly.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: huxia...@horebdata.cn 
> > Sent: 26 November 2021 22:41:10
> > To: YiteGu; ceph-users
> > Subject: [ceph-users] Re: Rocksdb: Corruption: missing start of fragmented 
> > record(1)
> >
> > wal/db are on Intel S4610 960GB SSDs, with PLP and write back on
> >
> >
> >
> > huxia...@horebdata.cn
> >
> > From: YiteGu
> > Date: 2021-11-26 11:32
> > To: huxia...@horebdata.cn; ceph-users
> > Subject: Re:[ceph-users] Rocksdb: Corruption: missing start of fragmented 
> > record(1)
> > It look like your wal/db device loss data.
> > please check your wal/db device whether have writeback cache, and power 
> > loss cause data loss. replay log failure when rocksdb restart.
> >
> >
> >
> > YiteGu
> > ess_...@qq.com
> >
> >
> >
> > -- Original --
> > From: "huxia...@horebdata.cn" ;
> > Date: Fri, Nov 26, 2021 06:02 PM
> > To: "ceph-users";
> > Su

[ceph-users] [RGW] Too much index objects and OMAP keys on them

2021-12-01 Thread Gilles Mocellin

Hello,

We see large omap objects warnings on the RGW bucket index pool.
The objects OMAP keys are about objects in one identified big bucket.

Context :
=
We use S3 storage for an application, with ~1,5 M objects.

The production cluster is "replicated" with rclone cron jobs on another 
distant cluster.


We have for the moment only one big bucket (23 shards), but we work on a 
multi-bucket solution.

The problem is not here.

One other important information : the bucket is versioned. We don't 
really have versions or deleted markers due to the way the application 
works. It's mainly a way for recovery as we don't have backups, due to 
the expected storage volume. Versioning + replication should solve most 
of the restoration use cases.



First, we don't have large omap objects in the production cluster, only 
on the replicated / backup one.


Differences between the two clusters :
- production is a 5 nodes cluster with SSD for rocksdb+wal, 2To SCSI 10k 
in RAID0 + battery backed cache.
- backup cluster is a 13 nodes cluster without SSD? only 8To HDD with 
direct HBA


Both clusters use Erasure Coding for the RGW buckets data pool. (3+2 on 
the production one, 8+2 on the backup one).


Firsts seen facts :
===

Both cluster have the same number of S3 objects in the main bucket.
I've seen that there is 10x more objects in the RGW buckets index pool 
in the prod cluster than in the backup cluster.

On these objects, there is 4x more OMAP keys in the backup cluster.

Example :
With rados ls :
- 311 objects in defaults.rgw.buckets.index (prod cluster)
- 3157 objects in MRS4.rgw.buckets.index (backup cluster)

In the backup cluster, we have 22 objects with more than 20 OMAP 
keys, that's why we have a Warning.
Searching in the production cluster, I can see around 6 OMAP keys 
max on objects.


Root Cause ?


It seems we have too much OMAP keys and even too much objects in the 
index pool of our backup cluster. But Why ? And how to remove the 
orphans ?


I've already tried :
- radosgw-admin bucket check --fix -check-objects (still running)
- rgw-orphan-list (but was interrupted last night after 5 hours)

As I understand, the last script will do the reverse of what I need : 
show objects that don't have indexes pointing on it ?
The radosgw-admin bucket check will perhaps rebuild indexes, but will it 
remove unused ones ?


Workaround ?


How can I get rid of the unused index objects and omap keys ?
Of course, I can add more reshards, but I think it would be better to 
solve the root cause if I can.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: bluefs_allocator bitmap or hybrid

2021-12-01 Thread Igor Fedotov

Hi Samuel,


On 12/1/2021 11:54 AM, huxia...@horebdata.cn wrote:

Dear Cephers,

We  are running tons of Ceph clusters on Luminous  with bluefs_allocator being 
bitmap, and when looking at Nautilus, 14.1.22, bluefs_allocator is now 
defaulting to hybrid. I am then wondering the follwoing:

1)  what will be the advantage for using hybrid instead of bitmap (which seems 
to be working very well now)?


Generally it looks like hybrid allocator is faster than bitmap, 
especially when long continuous chunks to be allocated in a highly 
fragmented space. But there were also some complains about higher tail 
latencies caused by hybrid allocator, e.g. 
https://tracker.ceph.com/issues/52804


Hopefully this is fixed in the master and to be ported back to Octopus soon.



2)  Is it saft to keep Nautilus with bluefs_allocator  settting to bitmap?


yes, it's totally safe.




Cheers,

samuel



huxia...@horebdata.cn
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph unresponsive on manager restart

2021-12-01 Thread Janne Johansson
Den ons 1 dec. 2021 kl 13:43 skrev Roman Steinhart :
> Hi all,
> We're currently troubleshooting our Ceph cluster.
> It appears that every time the active manager switches or restarts the
> whole cluster becomes slow/unresponsive for a short period of time.
> Everytime that happens we also see a lot of leader elections in the
> monitors and down monitor reports when doing "ceph status".

The word "Manager" might have been slightly unfortunate here and as a
subject if you meant "monitor", since there is a role specifically
named Manager (mgr) which is not the same as "mon", even if many ceph
admins co-locate them on the same machine.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph unresponsive on manager restart

2021-12-01 Thread Janne Johansson
Den ons 1 dec. 2021 kl 15:45 skrev Roman Steinhart :
> Hi Janne,
> That's not a typo :D , I really mean manager. That thing happens when I 
> restart the active ceph manager daemon or when It the active manager switches 
> on its own.

My bad, I thought since you later mentioned monitor elections that you
used the word 'manager' broadly.
Apologies for only nitpicking (while being wrong!), and not being able
to help with your actual problem. 8-/


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb: Corruption: missing start of fragmented record(1)

2021-12-01 Thread Frank Schilder
Hi Dan,

I can try to find the thread and the link again. I should mention that my inbox 
is a mess and our search function on the outlook 365 app is, well, don't 
mention the war. Is there a "list by thread" option on the lists.ceph.io? I can 
go through threads for 2 years, but not all messages.

> ceph could disable the write cache itself

I thought the newer versions were doing that already, but it looks like there 
is only a udev rule recommended: https://github.com/ceph/ceph/pull/43848/files. 
I think the write cache issue is mostly relevant for consumer grade- or 
low-level datacenter hardware, they need to simulate performance with cheap 
components. I have never seen an enterprise SAS drive with write cache enabled.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 01 December 2021 11:28:03
To: Frank Schilder
Cc: huxia...@horebdata.cn; YiteGu; ceph-users
Subject: Re: [ceph-users] Re: Rocksdb: Corruption: missing start of fragmented 
record(1)

Hi Frank,

I'd be interested to read that paper, if you can find it again. I
don't understand why the volatile cache + fsync might be dangerous due
to a buggy firmware, but yet we should trust that a firmware respects
FUA when the volatile cache is disabled.

In https://github.com/ceph/ceph/pull/43848 we're documenting the
implications of WCE -- but in the context of performance, not safety.
If write through / volatile cache off is required for safety too, then
we should take a different approach (e.g. ceph could disable the write
cache itself).

Cheers, dan



On Tue, Nov 30, 2021 at 9:36 AM Frank Schilder  wrote:
>
> Hi Dan.
>
> > ...however it is not unsafe to leave the cache enabled -- ceph uses
> > fsync appropriately to make the writes durable.
>
> Actually it is. You will rely on the drive's firmware to implement this 
> correctly and this is, unfortunately, less than a given. Within the last 
> one-two years somebody posted a link to a very interesting research paper to 
> this list, where drives were tested under real conditions. Turns out that the 
> "fsync to make writes persistent" is very vulnerable to power loss if 
> volatile write cache is enabled. It I remember correctly, about 1-2% of 
> drives ended up with data loss every time. In other words, for every drive 
> with volatile write cache enabled, every 100 power loss events you will have 
> 1-2 data loss events (in certain situations, the drive replies with ack 
> before the volatile cache is actually flushed). I think even PLP did not 
> prevent data loss in all cases.
>
> Its all down to bugs in firmware that fail to catch all corner cases and 
> internal race conditions with ops scheduling. Vendors will very often take 
> priority for performance over fixing a rare race condition and I will not 
> take nor recommend to take chances.
>
> I think this kind of advice should really not be given in a ceph context 
> without also referring to the pre-requisites: perfect firmware. Ceph is a 
> scale-out system and any large sized cluster will have enough drives to see 
> low-probability events on a regular basis. At least recommend to test that 
> thoroughly, that is, perform power-loss tests under load, and I mean many 
> power loss events per drive with randomised intervals under different load 
> patterns.
>
> Same applies to disk controllers with cache. Nobody recommends using the 
> controller cache because of firmware bugs that seem to be present in all 
> models. We have sufficient cases on this list for data loss after power loss 
> with controller cache being the issue. The recommendation is to enable HBA 
> mode and write-through. Do the same with your disk firmware, get better sleep 
> and better performance in one go.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 29 November 2021 09:24:29
> To: Frank Schilder
> Cc: huxia...@horebdata.cn; YiteGu; ceph-users
> Subject: Re: [ceph-users] Re: Rocksdb: Corruption: missing start of 
> fragmented record(1)
>
> Hi Frank,
>
> That's true from the performance perspective, however it is not unsafe
> to leave the cache enabled -- ceph uses fsync appropriately to make
> the writes durable.
>
> This issue looks rather to be related to concurrent hardware failure.
>
> Cheers, Dan
>
> On Mon, Nov 29, 2021 at 9:21 AM Frank Schilder  wrote:
> >
> > This may sound counter-intuitive, but you need to disable write cache to 
> > enable PLP cache only. SSDs with PLP have usually 2 types of cache, 
> > volatile and non-volatile. The volatile cache will experience data loss on 
> > power loss. It is the volatile cache that gets disabled when issuing the 
> > hd-/sdparm/smartctl command to switch it off. In many cases this can 
> > increase the non-volatile cache and also performance.
> >
> > It is the non-volatile cache you 

[ceph-users] OSD repeatedly marked down

2021-12-01 Thread Jan Kasprzak
Hello,

I am trying to upgrade my Ceph cluster (v15.2.15) from CentOS 7 to CentOS 8
stream. I upgraded monitors (a month or so ago), and now I want to upgrade
OSDs: for now I upgraded one host with two OSDs: I kept the partitions
where OSD data live (I have separate db on NVMe partition and data on
the whole HDD), and removed/recreated the OS / and /boot/efi partitions.
When I run

ceph-volume lvm activate --all

the /var/lib/ceph/osd/ceph-* tmpfs volumes get mounted and populated,
and the ceph-osd processes get started. In "ceph -s", they "2 osds down"
message disappears, and the number of degraded objects steadily decreases.
However, after some time the number of degraded objects starts going up
and down again, and osds appear to be down (and then up again). After 5 minutes
the OSDs are kicked out from the cluster, and the ceph-osd daemons stop.
The log from "journalctl -u ceph-osd@32.service" is below.

What else should I check? Thanks!

-Yenya

Dec 01 17:15:20 my.osd.host ceph-osd[3818]: 2021-12-01T17:15:20.384+0100 
7f8c4280af00 -1 Falling back to public interface
Dec 01 17:15:24 my.osd.host ceph-osd[3818]: 2021-12-01T17:15:24.666+0100 
7f8c4280af00 -1 osd.32 1119445 log_to_monitors {default=true}
Dec 01 17:15:25 my.osd.host ceph-osd[3818]: 2021-12-01T17:15:25.334+0100 
7f8c34dfa700 -1 osd.32 1119445 set_numa_affinity unable to identify public 
interface '' numa node: (2) No such file or directory
Dec 01 17:15:48 my.osd.host ceph-osd[3818]: 2021-12-01T17:15:48.714+0100 
7f8c34dfa700 -1 osd.32 1119496 set_numa_affinity unable to identify public 
interface '' numa node: (2) No such file or directory
Dec 01 17:16:14 my.osd.host ceph-osd[3818]: 2021-12-01T17:16:14.717+0100 
7f8c34dfa700 -1 osd.32 1119508 set_numa_affinity unable to identify public 
interface '' numa node: (2) No such file or directory
Dec 01 17:16:45 my.osd.host ceph-osd[3818]: 2021-12-01T17:16:45.682+0100 
7f8c34dfa700 -1 osd.32 1119526 set_numa_affinity unable to identify public 
interface '' numa node: (2) No such file or directory
Dec 01 17:17:13 my.osd.host ceph-osd[3818]: 2021-12-01T17:17:13.565+0100 
7f8c34dfa700 -1 osd.32 1119538 set_numa_affinity unable to identify public 
interface '' numa node: (2) No such file or directory
Dec 01 17:17:42 my.osd.host ceph-osd[3818]: 2021-12-01T17:17:42.237+0100 
7f8c34dfa700 -1 osd.32 1119548 set_numa_affinity unable to identify public 
interface '' numa node: (2) No such file or directory
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.623+0100 
7f8c295e3700 -1 osd.32 1119559 _committed_osd_maps marked down 6 > 
osd_max_markdown_count 5 in last 600.00 seconds, shutting down
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
7f8c38e02700 -1 received  signal: Interrupt from Kernel ( Could be generated by 
pthread_kill(), raise(), abort(), alarm() ) UID: 0
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
7f8c38e02700 -1 osd.32 1119559 *** Got signal Interrupt ***
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
7f8c38e02700 -1 osd.32 1119559 *** Immediate shutdown (osd_fast_shutdown=true) 
***

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
We all agree on the necessity of compromise. We just can't agree on
when it's necessary to compromise. --Larry Wall
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD repeatedly marked down

2021-12-01 Thread Sebastian Knust

Hi Jan,

On 01.12.21 17:31, Jan Kasprzak wrote:

In "ceph -s", they "2 osds down"
message disappears, and the number of degraded objects steadily decreases.
However, after some time the number of degraded objects starts going up
and down again, and osds appear to be down (and then up again). After 5 minutes
the OSDs are kicked out from the cluster, and the ceph-osd daemons stop
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
7f8c38e02700 -1 received  signal: Interrupt from Kernel ( Could be generated by 
pthread_kill(), raise(), abort(), alarm() ) UID: 0
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
7f8c38e02700 -1 osd.32 1119559 *** Got signal Interrupt ***
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
7f8c38e02700 -1 osd.32 1119559 *** Immediate shutdown (osd_fast_shutdown=true) 
***



Do you have enough memory on your host? You might want to look for oom 
messages in dmesg / journal and monitor your memory usage throughout the 
recovery.


If the osd processes are indeed killed by OOM killer, you have a few 
options. Adding more memory would probably be best to future-proof the 
system. Maybe you could also work with some Ceph config setting, e.g. 
lowering osd_max_backfills (although I'm definitely not an expert on 
which parameters would give you the best result). Adding swap will most 
likely only produce other issues, but might be a method of last resort.


Cheers
Sebastian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph-mgr constantly dying

2021-12-01 Thread Malte Stroem

Hello,

one of our mgrs is constantly dying.

Everything worked fine for a long time but now it happens every two 
weeks or so.


We have two clusters. Both use the same ceph version 14.2.8. Each 
cluster hosts three ceph-mgrs.


Only one and always the same ceph-mgr is dying on the same machine on 
one of the two clusters.


The net shows a tracker ticket:

https://tracker.ceph.com/issues/24995

However it affects Ceph 12.

I did not find any hardware issues yet, maybe just a reboot helps, but 
the log shows the following regarding prometheus:


*** Caught signal (Segmentation fault) **
in thread 7fb3f752f700 thread_name:prometheus
ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus 
(stable)

1: (()+0x12890) [0x7fb4205d1890]
2: 
(ceph::buffer::v14_2_0::ptr_node::cloner::operator()(ceph::buffer::v14_2_0::ptr_node 
const&)+0x40) [0x7fb42158ca60]

3: (()+0xdd89c) [0x55fc4b5a889c]
4: (ActivePyModules::get_python(std::__cxx11::basic_stringstd::char_traits, std::allocator > const&)+0x1cb0) 
[0x55fc4b60e3e0]

5: (()+0x1524ab) [0x55fc4b61d4ab]
[...]

The crash log shows:

{
"os_version_id": "18.04",
"utsname_release": "4.15.0-76-generic",
"os_name": "Ubuntu",
"entity_name": "mgr.",
"timestamp": "",
"process_name": "ceph-mgr",
"utsname_machine": "x86_64",
"utsname_sysname": "Linux",
"os_version": "18.04.4 LTS (Bionic Beaver)",
"os_id": "ubuntu",
"utsname_version": "#86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020",
"backtrace": [
"(()+0x12890) [0x7fb4205d1890]",

"(ceph::buffer::v14_2_0::ptr_node::cloner::operator()(ceph::buffer::v14_2_0::ptr_node 
const&)+0x40) [0x7fb42158ca60]",

"(()+0xdd89c) [0x55fc4b5a889c]",
"(ActivePyModules::get_python(std::__cxx11::basic_stringstd::char_traits, std::allocator > const&)+0x1cb0) 
[0x55fc4b60e3e0]",

"(()+0x1524ab) [0x55fc4b61d4ab]",
"(PyEval_EvalFrameEx()+0x8010) [0x7fb420af0770]",
"(PyEval_EvalFrameEx()+0x8b5b) [0x7fb420af12bb]",
"(PyEval_EvalFrameEx()+0x8b5b) [0x7fb420af12bb]",
"(PyEval_EvalFrameEx()+0x8b5b) [0x7fb420af12bb]",
"(PyEval_EvalCodeEx()+0x7d8) [0x7fb420c2d908]",
"(PyEval_EvalFrameEx()+0x5bf6) [0x7fb420aee356]",
"(PyEval_EvalCodeEx()+0x7d8) [0x7fb420c2d908]",
"(()+0x17a71d) [0x7fb420bc871d]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(PyEval_EvalFrameEx()+0x5314) [0x7fb420aeda74]",
"(PyEval_EvalCodeEx()+0x7d8) [0x7fb420c2d908]",
"(()+0x17a71d) [0x7fb420bc871d]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(()+0x1a1eec) [0x7fb420befeec]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(()+0x12030a) [0x7fb420b6e30a]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(PyEval_EvalFrameEx()+0x5314) [0x7fb420aeda74]",
"(PyEval_EvalCodeEx()+0x7d8) [0x7fb420c2d908]",
"(()+0x17a639) [0x7fb420bc8639]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(()+0x1a1eec) [0x7fb420befeec]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(()+0x1a2704) [0x7fb420bf0704]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(PyEval_EvalFrameEx()+0x41e1) [0x7fb420aec941]",
"(PyEval_EvalFrameEx()+0x8b5b) [0x7fb420af12bb]",
"(PyEval_EvalFrameEx()+0x8b5b) [0x7fb420af12bb]",
"(PyEval_EvalFrameEx()+0x8b5b) [0x7fb420af12bb]",
"(PyEval_EvalCodeEx()+0x7d8) [0x7fb420c2d908]",
"(()+0x17a639) [0x7fb420bc8639]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(()+0x1a1eec) [0x7fb420befeec]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(()+0x11f862) [0x7fb420b6d862]",
"(()+0x1240ca) [0x7fb420b720ca]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(PyEval_EvalFrameEx()+0x41e1) [0x7fb420aec941]",
"(PyEval_EvalFrameEx()+0x8b5b) [0x7fb420af12bb]",
"(PyEval_EvalCodeEx()+0x7d8) [0x7fb420c2d908]",
"(()+0x17a71d) [0x7fb420bc871d]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(()+0x1a1eec) [0x7fb420befeec]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(()+0x12030a) [0x7fb420b6e30a]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(PyEval_EvalFrameEx()+0x5314) [0x7fb420aeda74]",
"(PyEval_EvalCodeEx()+0x7d8) [0x7fb420c2d908]",
"(PyEval_EvalFrameEx()+0x5bf6) [0x7fb420aee356]",
"(PyEval_EvalCodeEx()+0x7d8) [0x7fb420c2d908]",
"(()+0x17a639) [0x7fb420bc8639]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(()+0x1a1eec) [0x7fb420befeec]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(()+0x11f862) [0x7fb420b6d862]",
"(()+0x1240ca) [0x7fb420b720ca]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
"(PyEval_EvalFrameEx()+0x41e1) [0x7fb420aec941]",
"(PyEval_EvalCodeEx()+0x7d8) [0x7fb420c2d908]",
"(()+0x17a639) [0x7fb420bc8639]",
"(PyObject_Call()+0x43) [0x7fb420bc2903]",
 

[ceph-users] Re: OSD repeatedly marked down

2021-12-01 Thread Dan van der Ster
Hi,

You should check the central ceph.log to understand why the osd is
getting marked down to begin with. Is it a connectivity issue from
peers to that OSD?
It looks like you have osd logging disabled -- revert to defaults
while you troubleshoot this.

-- dan


On Wed, Dec 1, 2021 at 5:31 PM Jan Kasprzak  wrote:
>
> Hello,
>
> I am trying to upgrade my Ceph cluster (v15.2.15) from CentOS 7 to CentOS 8
> stream. I upgraded monitors (a month or so ago), and now I want to upgrade
> OSDs: for now I upgraded one host with two OSDs: I kept the partitions
> where OSD data live (I have separate db on NVMe partition and data on
> the whole HDD), and removed/recreated the OS / and /boot/efi partitions.
> When I run
>
> ceph-volume lvm activate --all
>
> the /var/lib/ceph/osd/ceph-* tmpfs volumes get mounted and populated,
> and the ceph-osd processes get started. In "ceph -s", they "2 osds down"
> message disappears, and the number of degraded objects steadily decreases.
> However, after some time the number of degraded objects starts going up
> and down again, and osds appear to be down (and then up again). After 5 
> minutes
> the OSDs are kicked out from the cluster, and the ceph-osd daemons stop.
> The log from "journalctl -u ceph-osd@32.service" is below.
>
> What else should I check? Thanks!
>
> -Yenya
>
> Dec 01 17:15:20 my.osd.host ceph-osd[3818]: 2021-12-01T17:15:20.384+0100 
> 7f8c4280af00 -1 Falling back to public interface
> Dec 01 17:15:24 my.osd.host ceph-osd[3818]: 2021-12-01T17:15:24.666+0100 
> 7f8c4280af00 -1 osd.32 1119445 log_to_monitors {default=true}
> Dec 01 17:15:25 my.osd.host ceph-osd[3818]: 2021-12-01T17:15:25.334+0100 
> 7f8c34dfa700 -1 osd.32 1119445 set_numa_affinity unable to identify public 
> interface '' numa node: (2) No such file or directory
> Dec 01 17:15:48 my.osd.host ceph-osd[3818]: 2021-12-01T17:15:48.714+0100 
> 7f8c34dfa700 -1 osd.32 1119496 set_numa_affinity unable to identify public 
> interface '' numa node: (2) No such file or directory
> Dec 01 17:16:14 my.osd.host ceph-osd[3818]: 2021-12-01T17:16:14.717+0100 
> 7f8c34dfa700 -1 osd.32 1119508 set_numa_affinity unable to identify public 
> interface '' numa node: (2) No such file or directory
> Dec 01 17:16:45 my.osd.host ceph-osd[3818]: 2021-12-01T17:16:45.682+0100 
> 7f8c34dfa700 -1 osd.32 1119526 set_numa_affinity unable to identify public 
> interface '' numa node: (2) No such file or directory
> Dec 01 17:17:13 my.osd.host ceph-osd[3818]: 2021-12-01T17:17:13.565+0100 
> 7f8c34dfa700 -1 osd.32 1119538 set_numa_affinity unable to identify public 
> interface '' numa node: (2) No such file or directory
> Dec 01 17:17:42 my.osd.host ceph-osd[3818]: 2021-12-01T17:17:42.237+0100 
> 7f8c34dfa700 -1 osd.32 1119548 set_numa_affinity unable to identify public 
> interface '' numa node: (2) No such file or directory
> Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.623+0100 
> 7f8c295e3700 -1 osd.32 1119559 _committed_osd_maps marked down 6 > 
> osd_max_markdown_count 5 in last 600.00 seconds, shutting down
> Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
> 7f8c38e02700 -1 received  signal: Interrupt from Kernel ( Could be generated 
> by pthread_kill(), raise(), abort(), alarm() ) UID: 0
> Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
> 7f8c38e02700 -1 osd.32 1119559 *** Got signal Interrupt ***
> Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
> 7f8c38e02700 -1 osd.32 1119559 *** Immediate shutdown 
> (osd_fast_shutdown=true) ***
>
> --
> | Jan "Yenya" Kasprzak  |
> | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
> We all agree on the necessity of compromise. We just can't agree on
> when it's necessary to compromise. --Larry Wall
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Is it normal for a orch osd rm drain to take so long?

2021-12-01 Thread Zach Heise (SSCC)

  
I wanted to swap out on existing OSD, preserve the number, and
  then remove the HDD that had it (osd.14 in this case) and give the
  ID of 14 to a new SSD that would be taking its place in the same
  node. First time ever doing this, so not sure what to expect.
 I followed the instructions here,
  using the --replace flag.

However, I'm a bit concerned that the operation is taking so long
  in my test cluster. Out of 70TB in the cluster, only 40GB were in
  use. This is a relatively large OSD in comparison to others in the
  cluster (2.7TB versus 300GB for most other OSDs) and yet it's been
  36 hours with the following status:
ceph04.ssc.wisc.edu> ceph orch osd rm status
OSD_ID  HOST STATE PG_COUNT  REPLACE  FORCE  DRAIN_STARTED_AT  
14  ceph04.ssc.wisc.edu  draining  1 True True   2021-11-30 15:22:23.469150+00:00


Another note: I don't know why it has the "force = true" set; the
  command that I ran was just Ceph
  orch osd rm 14 --replace, without specifying --force. Hopefully
  not a big deal but still strange.

At this point is there any way to tell if it's still actually
  doing something, or perhaps it is hung? if it is hung, what would
  be the 'recommended' way to proceed? I know that I could just
  manually eject the HDD from the chassis and run the "ceph osd
  crush remove osd.14" command and then manually delete the auth
  keys, etc, but the documentation seems to state that this
  shouldn't be necessary if a ceph OSD replacement goes properly.

  

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephalocon 2022 is official!

2021-12-01 Thread Mike Perez
Hi everyone,

We're near the deadline of December 10th for the Cephalocon CFP. So
don't miss your chance to speak at this event either in-person or
virtually.

https://ceph.io/en/community/events/2022/cephalocon-portland/

If you're interested in sponsoring Cephalocon, the sponsorship
prospectus is now available:

https://ceph.io/assets/pdfs/cephalocon-2022-sponsorship-prospectus.pdf

On Fri, Nov 5, 2021 at 10:06 AM Mike Perez  wrote:
>
> Hello everyone!
>
> I'm pleased to announce Cephalocon 2022 will be taking place April 5-7
> in Portland, Oregon + Virtually!
>
> The CFP is now open until December 10th, so don't delay! Registration
> and sponsorship details will be available soon!
>
> I am looking forward to seeing you all in person again soon!
>
> https://ceph.io/en/community/events/2022/cephalocon-portland/
>
> --
> Mike Perez

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD repeatedly marked down

2021-12-01 Thread Jan Kasprzak
Sebastian,

Sebastian Knust wrote:
: On 01.12.21 17:31, Jan Kasprzak wrote:
: >In "ceph -s", they "2 osds down"
: >message disappears, and the number of degraded objects steadily decreases.
: >However, after some time the number of degraded objects starts going up
: >and down again, and osds appear to be down (and then up again). After 5 
minutes
: >the OSDs are kicked out from the cluster, and the ceph-osd daemons stop
: >Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
7f8c38e02700 -1 received  signal: Interrupt from Kernel ( Could be generated by 
pthread_kill(), raise(), abort(), alarm() ) UID: 0
: >Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
7f8c38e02700 -1 osd.32 1119559 *** Got signal Interrupt ***
: >Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
7f8c38e02700 -1 osd.32 1119559 *** Immediate shutdown (osd_fast_shutdown=true) 
***
: >
: 
: Do you have enough memory on your host? You might want to look for
: oom messages in dmesg / journal and monitor your memory usage
: throughout the recovery.

Yes, I have lots of memory. This particular node has 512 GB,
and according to top(1), the ceph-osd daemon has VSZ around 1.1 GB.
OOM would be visible in dmesg(8) (it is not). AFAIK, CentOS 8 Stream
does not have systemd-oomd(8) yet.

: If the osd processes are indeed killed by OOM killer, you have a few
: options. Adding more memory would probably be best to future-proof
: the system. Maybe you could also work with some Ceph config setting,
: e.g. lowering osd_max_backfills (although I'm definitely not an
: expert on which parameters would give you the best result). Adding
: swap will most likely only produce other issues, but might be a
: method of last resort.

I tend to add a small swap partition to my systems (this one
has 8 GB of swap) just to get rid of initialization code in various
processes. But after starting ceph-osd daemons (and them being killed
exactly after 600.0 seconds), there are exactly zero bytes of swap space used.

So I don't think my problem is OOM. It might be communication,
but I tried to tcpdump and look for example for ICMP port unreachable
messages, but nothing interesting there.

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
We all agree on the necessity of compromise. We just can't agree on
when it's necessary to compromise. --Larry Wall
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.7 pacific QE validation status, RC1 available for testing

2021-12-01 Thread Neha Ojha
Hi Luis,

On Wed, Dec 1, 2021 at 8:19 AM Luis Domingues  wrote:
>
> We upgraded a test cluster (3 controllers + 6 osds nodes with HDD and SSDs 
> for rocksdb) from last Nautilus to this 16.2.7 RC1.
>
> Upgrade went well without issues. We repaired the OSDs and no one crashed.

That's good to know! Thanks for testing 16.2.7 RC1.

>
> But we are still hitting this bug:
> https://tracker.ceph.com/issues/50657
>
> Do you think this can still be backported to 16.2.7 before it get released?

Sure, we now have a PR out for it https://github.com/ceph/ceph/pull/44164.

Thanks,
Neha

>
> Luis Domingues
> Proton AG
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is it normal for a orch osd rm drain to take so long?

2021-12-01 Thread David Orman
What's "ceph osd df" show?

On Wed, Dec 1, 2021 at 2:20 PM Zach Heise (SSCC)  wrote:

> I wanted to swap out on existing OSD, preserve the number, and then remove
> the HDD that had it (osd.14 in this case) and give the ID of 14 to a new
> SSD that would be taking its place in the same node. First time ever doing
> this, so not sure what to expect.
>
> I followed the instructions here
> ,
> using the --replace flag.
>
> However, I'm a bit concerned that the operation is taking so long in my
> test cluster. Out of 70TB in the cluster, only 40GB were in use. This is a
> relatively large OSD in comparison to others in the cluster (2.7TB versus
> 300GB for most other OSDs) and yet it's been 36 hours with the following
> status:
>
> ceph04.ssc.wisc.edu> ceph orch osd rm status
> OSD_ID  HOST STATE PG_COUNT  REPLACE  FORCE  
> DRAIN_STARTED_AT
> 14  ceph04.ssc.wisc.edu  draining  1 True True   2021-11-30 
> 15:22:23.469150+00:00
>
>
> Another note: I don't know why it has the "force = true" set; the command
> that I ran was just Ceph orch osd rm 14 --replace, without specifying
> --force. Hopefully not a big deal but still strange.
>
> At this point is there any way to tell if it's still actually doing
> something, or perhaps it is hung? if it is hung, what would be the
> 'recommended' way to proceed? I know that I could just manually eject the
> HDD from the chassis and run the "ceph osd crush remove osd.14" command and
> then manually delete the auth keys, etc, but the documentation seems to
> state that this shouldn't be necessary if a ceph OSD replacement goes
> properly.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-mgr constantly dying

2021-12-01 Thread Konstantin Shalygin
Hi,

The fix was backported to 14.2.10
I suggest to upgrade your clusters to 14.2.22


k
Sent from my iPhone

> On 1 Dec 2021, at 19:56, Malte Stroem  wrote:
> 
> We have two clusters. Both use the same ceph version 14.2.8. Each cluster 
> hosts three ceph-mgrs.
> 
> Only one and always the same ceph-mgr is dying on the same machine on one of 
> the two clusters.
> 
> The net shows a tracker ticket:
> 
> https://tracker.ceph.com/issues/24995
> 
> However it affects Ceph 12.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] [solved] Re: OSD repeatedly marked down

2021-12-01 Thread Jan Kasprzak
Jan Kasprzak wrote:
[...]
:   So I don't think my problem is OOM. It might be communication,
: but I tried to tcpdump and look for example for ICMP port unreachable
: messages, but nothing interesting there.

D'oh. Wrong prefix length of public_network in ceph.conf,
copied from the old kickstart file when creating a C8stream kickstart file.
This caused _some_ requests to go through an incorrect network interface.

Sebastian, Dan - thanks for the hints you sent. It was my own
misconfiguration after all.

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
We all agree on the necessity of compromise. We just can't agree on
when it's necessary to compromise. --Larry Wall
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io