[ceph-users] LVM osds loose connection to disk

2022-10-08 Thread Frank Schilder
Hi all,

we are facing a very annoying and disruptive problem. This happens only on a 
single type of disk:

Vendor:   TOSHIBA
Product:  PX05SMB040Y
Revision: AS10
Compliance:   SPC-4
User Capacity:400,088,457,216 bytes [400 GB]

schedulers: mq-deadline kyber [bfq] none

The default for these disks is none. Could this be s problem?

On these disks we have 4 OSDs deployed (yes, the ones that ran out of space 
during conversion). These disks hold out ceph fs meta data. Currently there is 
no load, we unmounted all clients due to problems during OSD conversions. The 
problem seems more likely with hgh load, but does happan also with very little 
load, like we have now.

We run the OSD daemons inside a Centos8 container built from 
quay.io/ceph/ceph:v15.2.17 on a Centos7 host with kernel version

# uname -r
5.14.13-1.el7.elrepo.x86_64

The lvm versions on the host and inside the container are almost identical:

[host]# yum list installed | grep lvm
lvm2.x86_64  7:2.02.187-6.el7_9.5   @updates
lvm2-libs.x86_64 7:2.02.187-6.el7_9.5   @updates

[con]# yum list installed | grep lvm
lvm2.x86_64   8:2.03.14-5.el8   
   @baseos 
lvm2-libs.x86_64  8:2.03.14-5.el8   
   @baseos   

We have >1000 OSDs and only the OSDs on these disks are causing trouble. The 
symptom is as if the disk suddenly gets stuck and does not accept IO any more. 
Trying to kill the hanging OSD daemons puts them in D-state.

The very very odd thing is, that ceph did not recognise all 4 down OSDs 
correctly. 1 out of 4 OSDs crashed (see log below) and the 3 other OSD daemons 
got stuck. These 3 stuck daemons were marked as down. However, the one that 
crashed was *not* marked as down even though it was dead for good (its process 
was not shown with ps any more, the other 3 were). This caused IO to hang and I 
don't understand how it is possible that this OSD was not recognised as  down. 
There must be plenty of reporters. I see a few messages like this (osd.975 
crashed)

Oct  8 16:08:54 ceph-13 ceph-osd: 2022-10-08T16:08:54.913+0200 7f942817b700 -1 
osd.990 912445 heartbeat_check: no reply from 192.168.32.88:7079 osd.975 since 
back 2022-10-08T16:08:34.029625+0200 front 2022-10-08T16:08:34.029288+0200 
(oldest deadline 2022-10-08T16:08:54.528209+0200)
[...]
Oct  8 16:08:56 ceph-08 journal: 2022-10-08T16:08:56.195+0200 7fb85ce4d700 -1 
osd.352 912445 heartbeat_check: no reply from 192.168.32.88:7079 osd.975 since 
back 2022-10-08T16:08:31.763519+0200 front 2022-10-08T16:08:31.764077+0200 
(oldest deadline 2022-10-08T16:08:55.861407+0200)

But nothing happened. Here some OSD log info:

This is where everything starts:
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 
level0_numfiles, 0 level0_num
files_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for 
pending_compaction_bytes,
 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

** File Read Latency Histogram By Level [default] **

2022-10-08T16:08:34.439+0200 7fbdf567a700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fbdd
1dc3700' had timed out after 15
2022-10-08T16:08:34.440+0200 7fbdf4e79700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fbdd1dc3700' had timed out after 15
[... loads and loads of these ...]
2022-10-08T16:10:51.065+0200 7fbdf4678700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fbdd
1dc3700' had suicide timed out after 150
2022-10-08T16:10:52.072+0200 7fbdf4678700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86
_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/
rpm/el8/BUILD/ceph-15.2.17/src/common/HeartbeatMap.cc: In function 'bool 
ceph::HeartbeatMap::_check(
const ceph::heartbeat_handle_d*, const char*, ceph::coarse_mono_clock::rep)' 
thread 7fbdf4678700 tim
e 2022-10-08T16:10:52.065768+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cent
os8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/common/Heartbe
atMap.cc: 80: ceph_abort_msg("hit suicide timeout")

 ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 
(stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, 
std::__cxx11::basic_string, std::allocator > const&)+0xe5) [0x556b9b10cb32]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, 
unsigned long)+0x295) 
[0x556b9b82c795]
 3: (ceph::HeartbeatMap::is_healthy()+0x112) [0x556b9b82d292]
 4: (OSD::handle_osd_ping(MOSDPing*)+0xc2f) [0x556b9b1e253f]
 5: (OSD::heartbeat_dispatch(Message*)+0x1db) [0x556b9b1e44eb]
 6: (DispatchQueue::fast_dispatch(boost::intrusive_ptr const&)+0x155) 
[0x556b9bb83aa5]
 7: (Protoco

[ceph-users] Invalid crush class

2022-10-08 Thread Michael Thomas
In 15.2.7, how can I remove an invalid crush class?  I'm surprised that 
I was able to create it in the first place:


[root@ceph1 bin]# ceph osd crush class ls
[
"ssd",
"JBOD.hdd",
"nvme",
"hdd"
]


[root@ceph1 bin]# ceph osd crush class ls-osd JBOD.hdd
Invalid command: invalid chars . in JBOD.hdd
osd crush class ls-osd  :  list all osds belonging to the 
specific 
>
Error EINVAL: invalid command

There are no devices mapped to this class:

[root@ceph1 bin]# ceph osd crush tree | grep JBOD | wc -l
0

--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to check which directory has ephemeral pinning set?

2022-10-08 Thread Frank Schilder
Hi all,

I believe I enabled ephemeral pinning on a home dir, but I can't figure out how 
to check that its working. Here is my attempt:

Set the flag:
# setfattr -n ceph.dir.pin.distributed -v 1 /mnt/admin/cephfs/hpc/home

Try to read it:
# getfattr -n ceph.dir.pin.distributed  /mnt/admin/cephfs/hpc/home
/mnt/admin/cephfs/hpc/home: ceph.dir.pin.distributed: No such attribute

Hmm. ???

Well, at least the first command might have done something, because this fails:
# setfattr -n ceph.dir.pin.distrid -v 1 /mnt/admin/cephfs/hpc/groups
setfattr: /mnt/admin/cephfs/hpc/groups: Invalid argument

What is the right way to confirm its working?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recurring stat mismatch on PG

2022-10-08 Thread Dan van der Ster
It's not necessarily a bug... Running deep scrub again will just tell you
the current state of the PG. That's safe any time.

If it's comes back inconsistent again, I'd repair the PG again, let it
finish completely, then scrub once again to double check that the repair
worked.

Thinking back, I've seen PG 1fff have scrub errors like this in the past,
not not recently, indicating it was a listing bug of some sort. Perhaps
this is just a leftover stats error from a bug in mimic, and the complete
repair will fix this fully for you.

(Btw, I've never had a stats error like this result in a visible issue.
Repair should probably fix this transparently).

.. Dan



On Sat, Oct 8, 2022, 11:27 Frank Schilder  wrote:

> Yes, primary OSD. Extracted with grep -e scrub -e repair -e 19.1fff
> /var/log/ceph/ceph-osd.338.log and then only relevant lines copied.
>
> Yes, according to the case I should just run a deep-scrub and should see.
> I guess if this error was cleared on an aborted repair, this would be a new
> bug? I will do a deep-scrub and report back.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 08 October 2022 11:18:37
> To: Frank Schilder
> Cc: Ceph Users
> Subject: Re: [ceph-users] recurring stat mismatch on PG
>
> Is that the log from the primary OSD?
>
> About the restart, you should probably just deep-scrub again to see the
> current state.
>
>
> .. Dan
>
>
>
> On Sat, Oct 8, 2022, 11:14 Frank Schilder  fr...@dtu.dk>> wrote:
> Hi Dan,
>
> yes, 15.2.17. I remember that case and was expecting it to be fixed. Here
> a relevant extract from the log:
>
> 2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff deep-scrub starts
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff deep-scrub 1 errors
> 2022-10-08T10:38:20.618+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff repair starts
> 2022-10-08T10:54:25.801+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 repair : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:54:25.802+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff repair 1 errors, 1 fixed
>
> Just completed a repair and its gone for now. As an alternative
> explanation, we had this scrub error, I started a repair but then OSDs in
> that PG were shut down and restarted. Is it possible that the repair was
> cancelled and the error cleared erroneously?
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster mailto:dvand...@gmail.com>>
> Sent: 08 October 2022 11:03:05
> To: Frank Schilder
> Cc: Ceph Users
> Subject: Re: [ceph-users] recurring stat mismatch on PG
>
> Hi,
>
> Is that 15.2.17? It reminds me of this bug -
> https://tracker.ceph.com/issues/52705 - where an object with a particular
> name would hash to  and cause a stat mismatch during scrub. But
> 15.2.17 should have the fix for that.
>
>
> Can you find the relevant osd log for more info?
>
> .. Dan
>
>
>
> On Sat, Oct 8, 2022, 10:42 Frank Schilder  fr...@dtu.dk>>> wrote:
> Hi all,
>
> I seem to observe something strange on an octopus(latest) cluster. We have
> a PG with a stat mismatch:
>
> 2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff deep-scrub starts
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff deep-scrub 1 errors
>
> This exact same mismatch was found before and I executed a pg-repair that
> fixed it. Now its back. Does anyone have an idea why this might be
> happening and how to deal with it?
>
> Thanks!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io >>
> To unsubscribe send an email to ceph-

[ceph-users] Re: recurring stat mismatch on PG

2022-10-08 Thread Frank Schilder
Yes, primary OSD. Extracted with grep -e scrub -e repair -e 19.1fff 
/var/log/ceph/ceph-osd.338.log and then only relevant lines copied.

Yes, according to the case I should just run a deep-scrub and should see. I 
guess if this error was cleared on an aborted repair, this would be a new bug? 
I will do a deep-scrub and report back.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 08 October 2022 11:18:37
To: Frank Schilder
Cc: Ceph Users
Subject: Re: [ceph-users] recurring stat mismatch on PG

Is that the log from the primary OSD?

About the restart, you should probably just deep-scrub again to see the current 
state.


.. Dan



On Sat, Oct 8, 2022, 11:14 Frank Schilder mailto:fr...@dtu.dk>> 
wrote:
Hi Dan,

yes, 15.2.17. I remember that case and was expecting it to be fixed. Here a 
relevant extract from the log:

2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log [DBG] : 
19.1fff deep-scrub starts
2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log [ERR] : 
19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects, 1243/1243 
clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1215/1215 
whiteouts, 170978253582/170974059278 bytes, 0/0 manifest objects, 0/0 
hit_set_archive bytes.
2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log [ERR] : 
19.1fff deep-scrub 1 errors
2022-10-08T10:38:20.618+0200 7fa3c48c7700  0 log_channel(cluster) log [DBG] : 
19.1fff repair starts
2022-10-08T10:54:25.801+0200 7fa3c48c7700 -1 log_channel(cluster) log [ERR] : 
19.1fffs0 repair : stat mismatch, got 64532/64531 objects, 1243/1243 clones, 
64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1215/1215 
whiteouts, 170978253582/170974059278 bytes, 0/0 manifest objects, 0/0 
hit_set_archive bytes.
2022-10-08T10:54:25.802+0200 7fa3c48c7700 -1 log_channel(cluster) log [ERR] : 
19.1fff repair 1 errors, 1 fixed

Just completed a repair and its gone for now. As an alternative explanation, we 
had this scrub error, I started a repair but then OSDs in that PG were shut 
down and restarted. Is it possible that the repair was cancelled and the error 
cleared erroneously?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster mailto:dvand...@gmail.com>>
Sent: 08 October 2022 11:03:05
To: Frank Schilder
Cc: Ceph Users
Subject: Re: [ceph-users] recurring stat mismatch on PG

Hi,

Is that 15.2.17? It reminds me of this bug - 
https://tracker.ceph.com/issues/52705 - where an object with a particular name 
would hash to  and cause a stat mismatch during scrub. But 15.2.17 
should have the fix for that.


Can you find the relevant osd log for more info?

.. Dan



On Sat, Oct 8, 2022, 10:42 Frank Schilder 
mailto:fr...@dtu.dk>>> 
wrote:
Hi all,

I seem to observe something strange on an octopus(latest) cluster. We have a PG 
with a stat mismatch:

2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log [DBG] : 
19.1fff deep-scrub starts
2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log [ERR] : 
19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects, 1243/1243 
clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1215/1215 
whiteouts, 170978253582/170974059278 bytes, 0/0 manifest objects, 0/0 
hit_set_archive bytes.
2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log [ERR] : 
19.1fff deep-scrub 1 errors

This exact same mismatch was found before and I executed a pg-repair that fixed 
it. Now its back. Does anyone have an idea why this might be happening and how 
to deal with it?

Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- 
ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: iscsi deprecation

2022-10-08 Thread Lucian Petrut
Hi,

As Ilya mentioned, RBD is natively supported on Windows since Pacific. 
Furthermore, we’re about to add Persistent Reservations support, which is going 
to enable Microsoft Failover Cluster and CSV support.

Regards,
Lucian

From: Maged Mokhtar
Sent: Friday, October 7, 2022 5:05 PM
To: Filipe Mendes; 
ceph-users@ceph.io
Subject: [ceph-users] Re: iscsi deprecation


You can try PetaSAN

www.petasan.org

We are open source solution on top of Ceph. we provide scalable
active/active iSCSI which supports VMWare VAAI and Microsoft clustered
shared volumes for hyper-v clustering.

Cheers /maged

On 30/09/2022 19:36, Filipe Mendes wrote:
> Hello!
>
>
> I'm considering switching my current storage solution to CEPH. Today we use
> iscsi as a communication protocol and we use several different hypervisors:
> VMware, hyper-v, xcp-ng, etc.
>
>
> I was reading that the current version of CEPH has discontinued iscsi
> support in favor of RBD or Nvmeof. I imagine there are thousands of
> projects in production using different hypervisors connecting to ceph via
> iscsi, so I was curious that I did not find much discussion on the topic in
> forums or mailings, since so many projects depend on both: ceph + iscsi,
> and that RBD only communicates well with Proxmox or openstack. Also nvmeof
> is not fully supported on ceph and many other popular hypervisors.
>
>
> So the trend is that other hypervisors will start to support RBD over time,
> or that they will start to support nvmeof at the same time that ceph
> implements it stably?
>
>
> Am I missing or maybe mixing something?
>
> Filipe
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recurring stat mismatch on PG

2022-10-08 Thread Dan van der Ster
Is that the log from the primary OSD?

About the restart, you should probably just deep-scrub again to see the
current state.


.. Dan



On Sat, Oct 8, 2022, 11:14 Frank Schilder  wrote:

> Hi Dan,
>
> yes, 15.2.17. I remember that case and was expecting it to be fixed. Here
> a relevant extract from the log:
>
> 2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff deep-scrub starts
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff deep-scrub 1 errors
> 2022-10-08T10:38:20.618+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff repair starts
> 2022-10-08T10:54:25.801+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 repair : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:54:25.802+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff repair 1 errors, 1 fixed
>
> Just completed a repair and its gone for now. As an alternative
> explanation, we had this scrub error, I started a repair but then OSDs in
> that PG were shut down and restarted. Is it possible that the repair was
> cancelled and the error cleared erroneously?
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 08 October 2022 11:03:05
> To: Frank Schilder
> Cc: Ceph Users
> Subject: Re: [ceph-users] recurring stat mismatch on PG
>
> Hi,
>
> Is that 15.2.17? It reminds me of this bug -
> https://tracker.ceph.com/issues/52705 - where an object with a particular
> name would hash to  and cause a stat mismatch during scrub. But
> 15.2.17 should have the fix for that.
>
>
> Can you find the relevant osd log for more info?
>
> .. Dan
>
>
>
> On Sat, Oct 8, 2022, 10:42 Frank Schilder  fr...@dtu.dk>> wrote:
> Hi all,
>
> I seem to observe something strange on an octopus(latest) cluster. We have
> a PG with a stat mismatch:
>
> 2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff deep-scrub starts
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff deep-scrub 1 errors
>
> This exact same mismatch was found before and I executed a pg-repair that
> fixed it. Now its back. Does anyone have an idea why this might be
> happening and how to deal with it?
>
> Thanks!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io ceph-users-le...@ceph.io>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recurring stat mismatch on PG

2022-10-08 Thread Frank Schilder
Hi Dan,

yes, 15.2.17. I remember that case and was expecting it to be fixed. Here a 
relevant extract from the log:

2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log [DBG] : 
19.1fff deep-scrub starts
2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log [ERR] : 
19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects, 1243/1243 
clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1215/1215 
whiteouts, 170978253582/170974059278 bytes, 0/0 manifest objects, 0/0 
hit_set_archive bytes.
2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log [ERR] : 
19.1fff deep-scrub 1 errors
2022-10-08T10:38:20.618+0200 7fa3c48c7700  0 log_channel(cluster) log [DBG] : 
19.1fff repair starts
2022-10-08T10:54:25.801+0200 7fa3c48c7700 -1 log_channel(cluster) log [ERR] : 
19.1fffs0 repair : stat mismatch, got 64532/64531 objects, 1243/1243 clones, 
64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1215/1215 
whiteouts, 170978253582/170974059278 bytes, 0/0 manifest objects, 0/0 
hit_set_archive bytes.
2022-10-08T10:54:25.802+0200 7fa3c48c7700 -1 log_channel(cluster) log [ERR] : 
19.1fff repair 1 errors, 1 fixed

Just completed a repair and its gone for now. As an alternative explanation, we 
had this scrub error, I started a repair but then OSDs in that PG were shut 
down and restarted. Is it possible that the repair was cancelled and the error 
cleared erroneously?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 08 October 2022 11:03:05
To: Frank Schilder
Cc: Ceph Users
Subject: Re: [ceph-users] recurring stat mismatch on PG

Hi,

Is that 15.2.17? It reminds me of this bug - 
https://tracker.ceph.com/issues/52705 - where an object with a particular name 
would hash to  and cause a stat mismatch during scrub. But 15.2.17 
should have the fix for that.


Can you find the relevant osd log for more info?

.. Dan



On Sat, Oct 8, 2022, 10:42 Frank Schilder mailto:fr...@dtu.dk>> 
wrote:
Hi all,

I seem to observe something strange on an octopus(latest) cluster. We have a PG 
with a stat mismatch:

2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log [DBG] : 
19.1fff deep-scrub starts
2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log [ERR] : 
19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects, 1243/1243 
clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1215/1215 
whiteouts, 170978253582/170974059278 bytes, 0/0 manifest objects, 0/0 
hit_set_archive bytes.
2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log [ERR] : 
19.1fff deep-scrub 1 errors

This exact same mismatch was found before and I executed a pg-repair that fixed 
it. Now its back. Does anyone have an idea why this might be happening and how 
to deal with it?

Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recurring stat mismatch on PG

2022-10-08 Thread Dan van der Ster
Hi,

Is that 15.2.17? It reminds me of this bug -
https://tracker.ceph.com/issues/52705 - where an object with a particular
name would hash to  and cause a stat mismatch during scrub. But
15.2.17 should have the fix for that.


Can you find the relevant osd log for more info?

.. Dan



On Sat, Oct 8, 2022, 10:42 Frank Schilder  wrote:

> Hi all,
>
> I seem to observe something strange on an octopus(latest) cluster. We have
> a PG with a stat mismatch:
>
> 2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff deep-scrub starts
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff deep-scrub 1 errors
>
> This exact same mismatch was found before and I executed a pg-repair that
> fixed it. Now its back. Does anyone have an idea why this might be
> happening and how to deal with it?
>
> Thanks!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] recurring stat mismatch on PG

2022-10-08 Thread Frank Schilder
Hi all,

I seem to observe something strange on an octopus(latest) cluster. We have a PG 
with a stat mismatch:

2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log [DBG] : 
19.1fff deep-scrub starts
2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log [ERR] : 
19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects, 1243/1243 
clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1215/1215 
whiteouts, 170978253582/170974059278 bytes, 0/0 manifest objects, 0/0 
hit_set_archive bytes.
2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log [ERR] : 
19.1fff deep-scrub 1 errors

This exact same mismatch was found before and I executed a pg-repair that fixed 
it. Now its back. Does anyone have an idea why this might be happening and how 
to deal with it?

Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io