[ceph-users] Re: reef 18.2.3 QE validation status

2024-05-28 Thread Yuri Weinstein
We have discovered some issues (#1 and #2) during the final stages of
testing that require considering a delay in this point release until
all options and risks are assessed and resolved.

We will keep you all updated on the progress.

Thank you for your patience!

#1 https://tracker.ceph.com/issues/66260
#2 https://tracker.ceph.com/issues/61948#note-21

On Wed, May 1, 2024 at 3:41 PM Yuri Weinstein  wrote:
>
> We've run into a problem during the last verification steps before
> publishing this release after upgrading the LRC to it  =>
> https://tracker.ceph.com/issues/65733
>
> After this issue is resolved, we will continue testing and publishing
> this point release.
>
> Thanks for your patience!
>
> On Thu, Apr 18, 2024 at 11:29 PM Christian Rohmann
>  wrote:
> >
> > On 18.04.24 8:13 PM, Laura Flores wrote:
> > > Thanks for bringing this to our attention. The leads have decided that
> > > since this PR hasn't been merged to main yet and isn't approved, it
> > > will not go in v18.2.3, but it will be prioritized for v18.2.4.
> > > I've already added the PR to the v18.2.4 milestone so it's sure to be
> > > picked up.
> >
> > Thanks a bunch. If you miss the train, you miss the train - fair enough.
> > Nice to know there is another one going soon and that bug is going to be
> > on it !
> >
> >
> > Regards
> >
> > Christian
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Rocky 8 to Rocky 9 upgrade and ceph without data loss

2024-05-28 Thread Christopher Durham

I have both a small test cluster and a larger production cluster. They are 
(were, for the test cluster) running Rocky Linux 8.9. They are both updated 
originally from Pacific, currently at reef 18.2.2.These are all rpm installs.

It has come time to consider upgrades to Rocky 9.3. As there is no officially 
supported upgrade procedure from Rocky 8 to Rocky 9, I wanted to document my 
procedure for the test cluster(it worked!) before I move on to my production 
cluster, and verify that what I have done makes sense and ask a few questions. 
As such, this involves a reinstall of the OS.

For my test cluster, my mon nodes also have all the other services: ceph-mon, 
ceph-mds, ceph-radosgw and ceph-mgr.
ALL of my OSDs are LVMs, and they each have wal devices, also as LVMs. My aim 
was to preserve these across the reinstall of the OS, and I did. To be 
clear,all the OSDs are not on the same physical disk as the OS and the 
mon/mgr/mds/radosgw, and most are grouped on their own node.

First, I set noout, norecover, nobackfill
# ceph osd set noout# ceph osd set norecover# ceph osd set nobackfill

On one of the mon nodes, I did the following, to save off the monmap, osdmap, 
and crushmap in case something went wildly wrong

# systemctl stop ceph-mon.target# ceph-mon -i  
--extract-monmap /tmp/monmap.bin# systemcl start ceph-mon.target# ceph osd 
getcrushmap -o /tmp/crush.bin# ceph osd getmap -o /tmp/osd.bin

Then on EVERY node, both mons and osd nodes, I tarred up /etc/ceph and 
/var/lib/ceph
# tar cvf /tmp/varlibceph.tar /etc/ceph /var/lib/ceph
I then saved off each tar file on each node to a location not being 
reinstalled, as such I had a tarfile per osd node and also one for  each mon 
node. I also saved off the monmap, crushmap and osdmaps created above tothe 
same location. 

I then went sequentially through the mon/mds/radosgw/mgr servers, one at a 
time. As I am using kickstart/cobbler, I told kickstartto ignore all disks 
EXCEPT the one that the original 8.9 OS was installed on. For this I had to use 
the following in the kickstart file. 

ignoredisk --only-use=disk/by-path/pci-id-of-root-disk
I did this because, at least on my hardware, Rocky 9 reorders drive letters 
sda, sdb, sdc, etc based on recognition order on *every* boot, which 
mightoverwrite an OSD LVM if I wasn't careful enough.
Note that while I group some commands that follow below, I only did the mon 
nodes sequentially, whille I grouped together osd nodes and did those groups 
allat the same time based on my failure domain for my crushmap.
After the Rocky 9 reinstall and a reboot:

I then installed the appropriate el9 18.2.2 ceph packages
$  dnf install ceph-mon (mon nodes)# dnf install ceph-mds (mds nodes)
# dnf install ceph-radosgw (radosgw nodes)# dnf install ceph-mgr (mgr nodes)# 
dnf install ceph-osd (osd nodes)

For the various nodes, once the reinstall was done to Rocky 9, I re-enabled the 
firewall
# firewall-cmd --add-service ceph-mon --permanent ( for mon nodes)# 
firewall-cmd --add-service ceph --permanent (for osd nodes)# firewall-cmd 
--add-service https --permanent (for radosgw servers I am running them on port 
443, with certs)# systemctl restart firewalld

and then of course restarted firewalld
I then restored /etc/ceph and /var/lib/ceph for each node from the individual 
tar file backups per node.

For the non-OSD nodes, I re-enabled services:
# systemctl enable ceph-mon@short-name-of-server.service (mon nodes)# systemctl 
enable ceph-mds@short-name-of-server.service (mds nodes)# systemctl enable 
ceph-radosgw@rgw.short-name-of-server.service (radosgw servers)# systemctl 
enable ceph-mgr@short-name-of-server.service (mgr nodes)
For the OSD nodes, "ceph-volume lvm list" shows this for each OSD, even after 
reinstallingto Rocky 9 (which is good evidence that I installed on the proper 
disk and did not overwritean OSD:
# ceph-volume lvm list

  block device  
/dev/ceph-c5b97619-4184-4637-9b82-2575004dba41/osd-block-abb210f3-52cf-4d4b-8745-126dc57287da
  block uuid    lmxbj0-zgcB-5MwI-wpsf-9yut-3VlG-p9DoFL
  cephx lockbox secret  
  cluster fsid  0f3b6c81-3814-4dc5-a19e-f7307543e56c
  cluster name  ceph
  crush device class    
  encrypted 0
  osd fsid  abb210f3-52cf-4d4b-8745-126dc57287da
  osd id    7
  osdspec affinity  
  type  block
  vdo   0
  devices   /dev/sdj
I had to enable units of the form:
# systemctl enable ceph-lvm-$osd-id-$osd-fsid.service
For the above osd lvm, this would be:
# systemctl enable 
ceph-volume@lvm-7-abb210f3-52cf-4d4b-8745-126dc57287da.service
So If I had 10 OSD LVMs on a given node, I have to enable 10 services of the 
form above.

After reboot of either a mon or osd node (depending on which server I am 
doing), all comes up just fine.
My questions:
1. Did I do anything 

[ceph-users] Re: ceph orch osd rm --zap --replace leaves cluster in odd state

2024-05-28 Thread Matthew Vernon

On 28/05/2024 17:07, Wesley Dillingham wrote:

What is the state of your PGs? could you post "ceph -s"


PGs all good:

root@moss-be1001:/# ceph -s
  cluster:
id: d7849d66-183c-11ef-b973-bc97e1bb7c18
health: HEALTH_WARN
1 stray daemon(s) not managed by cephadm

  services:
mon: 3 daemons, quorum moss-be1001,moss-be1003,moss-be1002 (age 6d)
mgr: moss-be1001.yibskr(active, since 6d), standbys: moss-be1003.rwdjgw
osd: 48 osds: 47 up (since 2d), 47 in (since 2d)

  data:
pools:   1 pools, 1 pgs
objects: 6 objects, 19 MiB
usage:   4.2 TiB used, 258 TiB / 263 TiB avail
pgs: 1 active+clean

The OSD is marked as "destroyed" in the osd tree:

root@moss-be1001:/# ceph osd tree | grep -E '^35'
35hdd3.75999  osd.35destroyed 0  1.0

root@moss-be1001:/# ceph osd safe-to-destroy osd.35 ; echo $?
OSD(s) 35 are safe to destroy without reducing data durability.
0

I should have said - this is a reef 18.2.2 cluster, cephadm deployed.

Regards,

Matthew
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph orch osd rm --zap --replace leaves cluster in odd state

2024-05-28 Thread Wesley Dillingham
What is the state of your PGs? could you post "ceph -s"

I believe (but a bit of an assumption after encountering something similar
myself) that under the hood cephadm is using the "ceph osd safe-to-destroy
osd.X" command and when OSD.X is no longer running and not all PGs are
active+clean (for instance in a active+remapped state) the safe-to-destroy
command will return in the negative with the response "OSD.X not reporting
stats, not all PGs are active+clean, cannot draw any conclusions" or some
such msg. The cephadm osd removal would stall in that state until all PGs
reach active+clean.

Respectfully,

*Wes Dillingham*
LinkedIn 
w...@wesdillingham.com



On Tue, May 28, 2024 at 11:43 AM Matthew Vernon 
wrote:

> Hi,
>
> I want to prepare a failed disk for replacement. I did:
> ceph orch osd rm 35 --zap --replace
>
> and it's now in the state "Done, waiting for purge", with 0 pgs, and
> REPLACE and ZAP set to true. It's been like this for some hours, and now
> my cluster is unhappy:
>
> [WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
>  stray daemon osd.35 on host moss-be1002 not managed by cephadm
>
> (the OSD is down & out)
>
> ...and also neither the disk nor the relevant NVME LV has been zapped.
>
> I have my OSDs deployed via a spec:
> service_type: osd
> service_id: rrd_single_NVMe
> placement:
>label: "NVMe"
> spec:
>data_devices:
>  rotational: 1
>db_devices:
>  model: "NVMe"
>
> And before issuing the ceph orch osd rm I set that to be unmanaged (ceph
> orch set-unmanaged osd.rrd_single_NVMe), as obviously I don't want ceph
> to just try and re-make a new OSD on the sad disk.
>
> I'd expected from the docs[0] that what I did would leave me with a
> system ready for the failed disk to be swapped (and that I could then
> mark osd.rrd_single_NVMe as managed again, and a new OSD built),
> including removing/wiping the NVME lv so it can be removed.
>
> What did I do wrong? I don't much care about the OSD id (but obviously
> it's neater to not just incrementally increase OSD numbers every time a
> disk died), but I thought that telling ceph orch not to make new OSDs
> then using ceph orch osd rm to zap the disk and NVME lv would have been
> the way to go...
>
> Thanks,
>
> Matthew
>
> [0] https://docs.ceph.com/en/reef/cephadm/services/osd/#replacing-an-osd
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph orch osd rm --zap --replace leaves cluster in odd state

2024-05-28 Thread Matthew Vernon

Hi,

I want to prepare a failed disk for replacement. I did:
ceph orch osd rm 35 --zap --replace

and it's now in the state "Done, waiting for purge", with 0 pgs, and 
REPLACE and ZAP set to true. It's been like this for some hours, and now 
my cluster is unhappy:


[WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
stray daemon osd.35 on host moss-be1002 not managed by cephadm

(the OSD is down & out)

...and also neither the disk nor the relevant NVME LV has been zapped.

I have my OSDs deployed via a spec:
service_type: osd
service_id: rrd_single_NVMe
placement:
  label: "NVMe"
spec:
  data_devices:
rotational: 1
  db_devices:
model: "NVMe"

And before issuing the ceph orch osd rm I set that to be unmanaged (ceph 
orch set-unmanaged osd.rrd_single_NVMe), as obviously I don't want ceph 
to just try and re-make a new OSD on the sad disk.


I'd expected from the docs[0] that what I did would leave me with a 
system ready for the failed disk to be swapped (and that I could then 
mark osd.rrd_single_NVMe as managed again, and a new OSD built), 
including removing/wiping the NVME lv so it can be removed.


What did I do wrong? I don't much care about the OSD id (but obviously 
it's neater to not just incrementally increase OSD numbers every time a 
disk died), but I thought that telling ceph orch not to make new OSDs 
then using ceph orch osd rm to zap the disk and NVME lv would have been 
the way to go...


Thanks,

Matthew

[0] https://docs.ceph.com/en/reef/cephadm/services/osd/#replacing-an-osd
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD processes crashes on repair 'unexpected clone'

2024-05-28 Thread Thomas Björklund
Sorry, not sure what happened with the formatting, pasting the whole
contents again.

Hi,

We have an old cluster with 3 nodes running ceph version 15.2.17.
We have a PG in state active+clean+inconsistent which we are unable to
repair.

It's an RBD pool in use by kubernetes.

The earliest indication of the issue comes from ceph-osd.4.log on one of
the nodes:

2024-05-23T14:34:38.003+0200 7f00b907c700  0 log_channel(cluster) log [DBG]
: 20.4 repair starts
2024-05-23T14:34:46.095+0200 7f00b907c700 -1 log_channel(cluster) log [ERR]
: 20.4 shard 4
20:20a41d01:stc::rbd_data.44feb146358046.7737:38e : missing
2024-05-23T14:34:46.095+0200 7f00b907c700 -1 log_channel(cluster) log [ERR]
: 20.4 shard 16
20:20a41d01:stc::rbd_data.44feb146358046.7737:38e : missing
2024-05-23T14:34:46.095+0200 7f00b907c700 -1 log_channel(cluster) log [ERR]
: repair 20.4 20:20a41d01:stc::rbd_data.44feb146358046.7737:38e
: is an unexpected clone
2024-05-23T14:34:46.095+0200 7f00b907c700 -1 osd.4 pg_epoch: 30768 pg[20.4(
v 30768'359661769 (30767'359657067,30768'359661769]
local-lis/les=30660/30665 n=10628 ec=28562/20971 lis/c=30660/30660
les/c/f=30665/30665/0 sis=30660) [4,10,16] r=0 lpr=30660
luod=30768'359661760 crt=30768'359661769 lcod 30768'359661759 mlcod
30768'359661759 active+clean+scrubbing+deep+inconsistent+repair REQ_SCRUB]
_scan_snaps no head for
20:20a41d01:stc::rbd_data.44feb146358046.7737:38e (have
20:20a42186:stc::rbd_data.44feb146358046.000425b6:head)
2024-05-23T14:35:27.712+0200 7f00b5074700 -1 log_channel(cluster) log [ERR]
: 20.4 repair 1 missing, 0 inconsistent objects
2024-05-23T14:35:27.712+0200 7f00b5074700 -1 log_channel(cluster) log [ERR]
: 20.4 repair 3 errors, 2 fixed
2024-05-23T14:35:27.720+0200 7f00b5074700 -1 ./src/osd/PrimaryLogPG.cc: In
function 'int PrimaryLogPG::recover_missing(const hobject_t&, eversion_t,
int, PGBackend::RecoveryHandle*)' thread 7f00b5074700 time
2024-05-23T14:35:27.718485+0200
./src/osd/PrimaryLogPG.cc: 11540: FAILED ceph_assert(head_obc)

 ceph version 15.2.17 (542df8d06ef24dbddcf4994db16bcc4c89c9ec2d) octopus
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x131) [0x55a740013abb]
 2: (()+0x9d5c46) [0x55a740013c46]
 3: (PrimaryLogPG::recover_missing(hobject_t const&, eversion_t, int,
PGBackend::RecoveryHandle*)+0x6cf) [0x55a7401fabbf]
 4: (PrimaryLogPG::recover_primary(unsigned long,
ThreadPool::TPHandle&)+0xf7e) [0x55a74023b16e]
 5: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&,
unsigned long*)+0x5ee) [0x55a740246b4e]
 6: (OSD::do_recovery(PG*, unsigned int, unsigned long,
ThreadPool::TPHandle&)+0x295) [0x55a7400c4c85]
 7: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*,
boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x19) [0x55a74031c229]
 8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xf13) [0x55a7400e13c3]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a)
[0x55a74072ae6a]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a74072d410]
 11: (()+0x7ea7) [0x7f00d59e2ea7]
 12: (clone()+0x3f) [0x7f00d556ba6f]

2024-05-23T14:35:27.724+0200 7f00b5074700 -1 *** Caught signal (Aborted) **
 in thread 7f00b5074700 thread_name:tp_osd_tp

 ceph version 15.2.17 (542df8d06ef24dbddcf4994db16bcc4c89c9ec2d) octopus
(stable)
 1: (()+0x13140) [0x7f00d59ee140]
 2: (gsignal()+0x141) [0x7f00d54a8ce1]
 3: (abort()+0x123) [0x7f00d5492537]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x17b) [0x55a740013b05]
 5: (()+0x9d5c46) [0x55a740013c46]
 6: (PrimaryLogPG::recover_missing(hobject_t const&, eversion_t, int,
PGBackend::RecoveryHandle*)+0x6cf) [0x55a7401fabbf]
 7: (PrimaryLogPG::recover_primary(unsigned long,
ThreadPool::TPHandle&)+0xf7e) [0x55a74023b16e]
 8: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&,
unsigned long*)+0x5ee) [0x55a740246b4e]
 9: (OSD::do_recovery(PG*, unsigned int, unsigned long,
ThreadPool::TPHandle&)+0x295) [0x55a7400c4c85]
 10: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*,
boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x19) [0x55a74031c229]
 11: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xf13) [0x55a7400e13c3]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a)
[0x55a74072ae6a]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a74072d410]
 14: (()+0x7ea7) [0x7f00d59e2ea7]
 15: (clone()+0x3f) [0x7f00d556ba6f]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

Every time we try to start a repair on 20.4 all three OSD crashes (4, 10 &
16)

2024-05-28T11:14:30.573+0200 7f4d3405a700  0 log_channel(cluster) log [DBG]
: 20.4 repair starts
2024-05-28T11:14:39.481+0200 7f4d30052700 -1 log_channel(cluster) log [ERR]
: 20.4 shard 10 soid
20:20a41d01:stc::rbd_data.44feb146358046.7737:38e : candidate
had a read error
2024-05-28T11:14:39.481+0200 

[ceph-users] Help needed! First MDs crashing, then MONs. How to recover ?

2024-05-28 Thread Noe P.
Hi,

we ran into a bigger problem today with our ceph cluster (Quincy,
Alma8.9).
We have 4 filesystems and a total of 6 MDs, the largest fs having
two ranks assigned (i.e. one standby).

Since we often have the problem of MDs lagging behind, we restart
the MDs occasionally. Helps ususally, the standby taking over.

Today however, the restart didn't work and the rank 1 MDs started to
crash for unclear reasons. Rank 0 seemed ok.

We decided at some point to go back to one rank by settings max_mds to 1.
Due to the permanent crashing, the rank1 didn't stop however, and at some
point we set it to failed and the fs not joinable.

At this point it looked like this:
 fs_cluster - 716 clients
 ==
 RANK  STATE MDSACTIVITY DNSINOS   DIRS   CAPS
  0active  cephmd6a  Reqs:0 /s  13.1M  13.1M  1419k  79.2k
  1failed
   POOL TYPE USED  AVAIL
 fs_cluster_meta  metadata  1791G  54.2T
 fs_cluster_datadata 421T  54.2T

with rank1 still being listed.

The next attempt was to remove that failed

   ceph mds rmfailed fs_cluster:1 --yes-i-really-mean-it

which, after a short while brought down 3 out of five MONs.
They keep crashing shortly after restart with stack traces like this:

ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy 
(stable)
1: /lib64/libpthread.so.0(+0x12cf0) [0x7ff8813adcf0]
2: gsignal()
3: abort()
4: /lib64/libstdc++.so.6(+0x9009b) [0x7ff8809bf09b]
5: /lib64/libstdc++.so.6(+0x9654c) [0x7ff8809c554c]
6: /lib64/libstdc++.so.6(+0x965a7) [0x7ff8809c55a7]
7: /lib64/libstdc++.so.6(+0x96808) [0x7ff8809c5808]
8: /lib64/libstdc++.so.6(+0x92045) [0x7ff8809c1045]
9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xa9e) [0x55f05d9a5e8e]
10: (MDSMonitor::tick()+0x18a) [0x55f05d9b18da]
11: (MDSMonitor::on_active()+0x2c) [0x55f05d99a17c]
12: (Context::complete(int)+0xd) [0x55f05d76c56d]
13: (void finish_contexts > >(ceph::common::CephContext*, 
std::__cxx11::list >&, int)+0x9d) [0x55f05
   d799d7d]
14: (Paxos::finish_round()+0x74) [0x55f05d8c5c24]
15: (Paxos::dispatch(boost::intrusive_ptr)+0x41b) 
[0x55f05d8c7e5b]
16: (Monitor::dispatch_op(boost::intrusive_ptr)+0x123e) 
[0x55f05d76a2ae]
17: (Monitor::_ms_dispatch(Message*)+0x406) [0x55f05d76a976]
18: (Dispatcher::ms_dispatch2(boost::intrusive_ptr const&)+0x5d) 
[0x55f05d79b3ed]
19: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr 
const&)+0x478) [0x7ff88367fed8]
20: (DispatchQueue::entry()+0x50f) [0x7ff88367d31f]
21: (DispatchQueue::DispatchThread::entry()+0x11) [0x7ff883747381]
22: /lib64/libpthread.so.0(+0x81ca) [0x7ff8813a31ca]
23: clone()
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

The MDSMonitor::maybe_resize_cluster somehow suggests a connection to the above 
MDs operation.

Does anyone have an idea how to get this cluster back together again ? Like 
manually fixing the
MD ranks ?

'fs get' can still be called in short moments where enough MONs are reachable:

   fs_name fs_cluster
   epoch   3065486
   flags   13 allow_snaps allow_multimds_snaps
   created 2022-08-26T15:55:07.186477+0200
   modified2024-05-28T12:21:59.294364+0200
   tableserver 0
   root0
   session_timeout 60
   session_autoclose   300
   max_file_size   4398046511104
   required_client_features{}
   last_failure0
   last_failure_osd_epoch  1777109
   compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses 
versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no 
anchor table,9=file layout v2,10=snaprealm v2}
   max_mds 1
   in  0,1
   up  {0=911794623}
   failed
   damaged
   stopped 2,3
   data_pools  [32]
   metadata_pool   33
   inline_data disabled
   balancer
   standby_count_wanted1
   [mds.cephmd6a{0:911794623} state up:active seq 22777 addr 
[v2:10.13.5.6:6800/189084355,v1:10.13.5.6:6801/189084355] compat 
{c=[1],r=[1],i=[7ff]}]


Regards,
  Noe
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSD processes crashes on repair 'unexpected clone'

2024-05-28 Thread Thomas Björklund
Hi,

We have an old cluster with 3 nodes running ceph version 15.2.17.
We have a PG in state active+clean+inconsistent which we are unable to
repair.

It's an RBD pool in use by kubernetes.

The earliest indication of the issue comes from ceph-osd.4.log on one of
the nodes:












































*2024-05-23T14:34:38.003+0200 7f00b907c700  0 log_channel(cluster) log
[DBG] : 20.4 repair starts2024-05-23T14:34:46.095+0200 7f00b907c700 -1
log_channel(cluster) log [ERR] : 20.4 shard 4
20:20a41d01:stc::rbd_data.44feb146358046.7737:38e :
missing2024-05-23T14:34:46.095+0200 7f00b907c700 -1 log_channel(cluster)
log [ERR] : 20.4 shard 16
20:20a41d01:stc::rbd_data.44feb146358046.7737:38e :
missing2024-05-23T14:34:46.095+0200 7f00b907c700 -1 log_channel(cluster)
log [ERR] : repair 20.4
20:20a41d01:stc::rbd_data.44feb146358046.7737:38e : is an
unexpected clone2024-05-23T14:34:46.095+0200 7f00b907c700 -1 osd.4
pg_epoch: 30768 pg[20.4( v 30768'359661769
(30767'359657067,30768'359661769] local-lis/les=30660/30665 n=10628
ec=28562/20971 lis/c=30660/30660 les/c/f=30665/30665/0 sis=30660) [4,10,16]
r=0 lpr=30660 luod=30768'359661760 crt=30768'359661769 lcod 30768'359661759
mlcod 30768'359661759 active+clean+scrubbing+deep+inconsistent+repair
REQ_SCRUB] _scan_snaps no head for
20:20a41d01:stc::rbd_data.44feb146358046.7737:38e (have
20:20a42186:stc::rbd_data.44feb146358046.000425b6:head)2024-05-23T14:35:27.712+0200
7f00b5074700 -1 log_channel(cluster) log [ERR] : 20.4 repair 1 missing, 0
inconsistent objects2024-05-23T14:35:27.712+0200 7f00b5074700 -1
log_channel(cluster) log [ERR] : 20.4 repair 3 errors, 2
fixed2024-05-23T14:35:27.720+0200 7f00b5074700 -1
./src/osd/PrimaryLogPG.cc: In function 'int
PrimaryLogPG::recover_missing(const hobject_t&, eversion_t, int,
PGBackend::RecoveryHandle*)' thread 7f00b5074700 time
2024-05-23T14:35:27.718485+0200./src/osd/PrimaryLogPG.cc: 11540: FAILED
ceph_assert(head_obc) ceph version 15.2.17
(542df8d06ef24dbddcf4994db16bcc4c89c9ec2d) octopus (stable) 1:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x131) [0x55a740013abb] 2: (()+0x9d5c46) [0x55a740013c46] 3:
(PrimaryLogPG::recover_missing(hobject_t const&, eversion_t, int,
PGBackend::RecoveryHandle*)+0x6cf) [0x55a7401fabbf] 4:
(PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0xf7e)
[0x55a74023b16e] 5: (PrimaryLogPG::start_recovery_ops(unsigned long,
ThreadPool::TPHandle&, unsigned long*)+0x5ee) [0x55a740246b4e] 6:
(OSD::do_recovery(PG*, unsigned int, unsigned long,
ThreadPool::TPHandle&)+0x295) [0x55a7400c4c85] 7:
(ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*,
boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x19) [0x55a74031c229] 8:
(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xf13)
[0x55a7400e13c3] 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x41a) [0x55a74072ae6a] 10:
(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a74072d410] 11:
(()+0x7ea7) [0x7f00d59e2ea7] 12: (clone()+0x3f)
[0x7f00d556ba6f]2024-05-23T14:35:27.724+0200 7f00b5074700 -1 *** Caught
signal (Aborted) ** in thread 7f00b5074700 thread_name:tp_osd_tp ceph
version 15.2.17 (542df8d06ef24dbddcf4994db16bcc4c89c9ec2d) octopus
(stable) 1: (()+0x13140) [0x7f00d59ee140] 2: (gsignal()+0x141)
[0x7f00d54a8ce1] 3: (abort()+0x123) [0x7f00d5492537] 4:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x17b) [0x55a740013b05] 5: (()+0x9d5c46) [0x55a740013c46] 6:
(PrimaryLogPG::recover_missing(hobject_t const&, eversion_t, int,
PGBackend::RecoveryHandle*)+0x6cf) [0x55a7401fabbf] 7:
(PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0xf7e)
[0x55a74023b16e] 8: (PrimaryLogPG::start_recovery_ops(unsigned long,
ThreadPool::TPHandle&, unsigned long*)+0x5ee) [0x55a740246b4e] 9:
(OSD::do_recovery(PG*, unsigned int, unsigned long,
ThreadPool::TPHandle&)+0x295) [0x55a7400c4c85] 10:
(ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*,
boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x19)
[0x55a74031c229] 11: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xf13) [0x55a7400e13c3] 12:
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a)
[0x55a74072ae6a] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0x55a74072d410] 14: (()+0x7ea7) [0x7f00d59e2ea7] 15: (clone()+0x3f)
[0x7f00d556ba6f] NOTE: a copy of the executable, or `objdump -rdS
` is needed to interpret this.*

Every time we try to start a repair on 20.4 all three OSD crashes (4, 10 &
16)




















































*2024-05-28T11:14:30.573+0200 7f4d3405a700  0 log_channel(cluster) log
[DBG] : 20.4 repair starts2024-05-28T11:14:39.481+0200 7f4d30052700 -1
log_channel(cluster) log [ERR] : 20.4 shard 10 soid
20:20a41d01:stc::rbd_data.44feb146358046.7737:38e : candidate
had a read error2024-05-28T11:14:39.481+0200 7f4d30052700 -1
log_channel(cluster) log 

[ceph-users] Re: Safe method to perform failback for RBD on one way mirroring.

2024-05-28 Thread Eugen Block

Hi,

I think there might be a misunderstanding about one-way-mirroring. It  
really only mirrors one way, from A to B. In case site A fails, you  
can promote the images in B and continue using those images. But  
there's no automated way back, because it's only one way. When site A  
comes back, you would need to sync the changes from B manually (or  
some self scripted way).


Regards,
Eugen

Zitat von Saif Mohammad :


Hello Everyone

We have Clusters in production with the following configuration:
Cluster-A :  quincy v17.2.5
Cluster-B :  quincy v17.2.5
All images in a pool have the snapshot feature enabled and are mirrored.
Each site has 3 daemons.

We're testing disaster recovery with one-way mirroring in our block  
device mirroring setup. On the primary site (Cluster-A), we have  
Ceph clients attached to it , and couple of images are present  
there. These images are replicated to the secondary site (Cluster-B).
During testing we've successfully conducted failovers, with all  
resources accessible from Cluster-B once the Ceph client is attached  
there on secondary site.


However, during failback (restoring the primary site), we've  
encountered an issue. Data that was pushed from the secondary site  
seems to be deleted, while data originally present only on the  
primary site remains intact. Here are the steps we took during  
failback:

- Detached the client from Cluster-B.
- Ensured that on Cluster-B, "mirroring primary" is set to true, and  
on Cluster-A, it's set to false.

- Demoted the images on Cluster-B and promoted the images on Cluster-A.

After performing these steps, our images went into an error state,  
and they started syncing from Cluster-A to Cluster-B. However, in a  
failback scenario, the direction should be from Cluster-B to  
Cluster-A.


We are not sure where we are making mistakes. Could anybody please  
advise on the correct procedure for failback in one-way mirroring  
and the safest way to execute it to avoid impacting our data.


Regards,
Mohammad Saif
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io