[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-28 Thread Eugen Block

Hi,

great that you found a solution. Maybe that also helps to get rid of  
the cache-tier entirely?


Zitat von Cedric :


Hello,

Sorry for the late reply, so yes we finally find a solution, which  
was to split apart the cache_pool on dedicated OSD. It had the  
effect to clear off slow ops and allow the cluster to serves clients  
again, after 5 days of lock down, hopefully the majority of VM  
resume well, thanks to the virtio driver that does not seems to have  
any timeout.


It seems that at least one of the main culprit was to store both  
cold and hot data pool on same OSD (which in the end totally make  
sens), maybe some others actions engaged also had an effect, we are  
still trying to trouble shoot the root of slow ops, weirdly it was  
the 5th cluster upgraded and all as almost the same configuration,  
but this one handles 5x time more workload.


In the hope it could help.

Cédric


On 26 Feb 2024, at 10:57, Eugen Block  wrote:

Hi,

thanks for the context. Was there any progress over the weekend?  
The hanging commands seem to be MGR related, and there's only one  
in your cluster according to your output. Can you deploy a second  
one manually, then adopt it with cephadm? Can you add 'ceph  
versions' as well?



Zitat von florian.le...@socgen.com:


Hi,
A bit of history might help to understand why we have the cache tier.

We run openstack on top ceph since many years now (started with  
mimic, then an upgrade to nautilus (years 2 ago) and today and  
upgrade to pacific). At the beginning of the setup, we used to  
have a mix of hdd+ssd devices in HCI mode for openstack nova.  
After the upgrade to nautilus, we made a hardware refresh with  
brand new NVME devices. And transitionned from mixed devices to  
nvme. But we were never able to evict all the data from the  
vms_cache pools (even with being aggressive with the eviction; the  
last resort would have been to stop all the virtual instances, and  
that was not an option for our customers), so we decided to move  
on and set cache-mode proxy and serve data with only nvme since  
then. And it's been like this for 1 years and a half.


But today, after the upgrade, the situation is that we cannot  
query any stats (with ceph pg x.x query), rados query hangs, scrub  
hangs even though all PGs are "active+clean". and there is no  
client activity reported by the cluster. Recovery, and rebalance.  
Also some other commands hangs, ie: "ceph balancer status".


--
bash-4.2$ ceph -s
 cluster:
   id: 
   health: HEALTH_WARN
   mon is allowing insecure global_id reclaim
   noscrub,nodeep-scrub,nosnaptrim flag(s) set
   18432 slow ops, oldest one blocked for 7626 sec,  
daemons  
[osd.0,osd.1,osd.10,osd.11,osd.112,osd.113,osd.118,osd.119,osd.120,osd.122]... have slow  
ops.


 services:
   mon: 3 daemons, quorum mon1,mon2,mon3(age 36m)
   mgr: bm9612541(active, since 39m)
   osd: 72 osds: 72 up (since 97m), 72 in (since 9h)
flags noscrub,nodeep-scrub,nosnaptrim

 data:
   pools:   8 pools, 2409 pgs
   objects: 14.64M objects, 92 TiB
   usage:   276 TiB used, 143 TiB / 419 TiB avail
   pgs: 2409 active+clean
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-28 Thread Cedric
Hello,

Sorry for the late reply, so yes we finally find a solution, which was to split 
apart the cache_pool on dedicated OSD. It had the effect to clear off slow ops 
and allow the cluster to serves clients again, after 5 days of lock down, 
hopefully the majority of VM resume well, thanks to the virtio driver that does 
not seems to have any timeout.

It seems that at least one of the main culprit was to store both cold and hot 
data pool on same OSD (which in the end totally make sens), maybe some others 
actions engaged also had an effect, we are still trying to trouble shoot the 
root of slow ops, weirdly it was the 5th cluster upgraded and all as almost the 
same configuration, but this one handles 5x time more workload.

In the hope it could help.

Cédric

> On 26 Feb 2024, at 10:57, Eugen Block  wrote:
> 
> Hi,
> 
> thanks for the context. Was there any progress over the weekend? The hanging 
> commands seem to be MGR related, and there's only one in your cluster 
> according to your output. Can you deploy a second one manually, then adopt it 
> with cephadm? Can you add 'ceph versions' as well?
> 
> 
> Zitat von florian.le...@socgen.com:
> 
>> Hi,
>> A bit of history might help to understand why we have the cache tier.
>> 
>> We run openstack on top ceph since many years now (started with mimic, then 
>> an upgrade to nautilus (years 2 ago) and today and upgrade to pacific). At 
>> the beginning of the setup, we used to have a mix of hdd+ssd devices in HCI 
>> mode for openstack nova. After the upgrade to nautilus, we made a hardware 
>> refresh with brand new NVME devices. And transitionned from mixed devices to 
>> nvme. But we were never able to evict all the data from the vms_cache pools 
>> (even with being aggressive with the eviction; the last resort would have 
>> been to stop all the virtual instances, and that was not an option for our 
>> customers), so we decided to move on and set cache-mode proxy and serve data 
>> with only nvme since then. And it's been like this for 1 years and a half.
>> 
>> But today, after the upgrade, the situation is that we cannot query any 
>> stats (with ceph pg x.x query), rados query hangs, scrub hangs even though 
>> all PGs are "active+clean". and there is no client activity reported by the 
>> cluster. Recovery, and rebalance. Also some other commands hangs, ie: "ceph 
>> balancer status".
>> 
>> --
>> bash-4.2$ ceph -s
>>  cluster:
>>id: 
>>health: HEALTH_WARN
>>mon is allowing insecure global_id reclaim
>>noscrub,nodeep-scrub,nosnaptrim flag(s) set
>>18432 slow ops, oldest one blocked for 7626 sec, daemons 
>> [osd.0,osd.1,osd.10,osd.11,osd.112,osd.113,osd.118,osd.119,osd.120,osd.122]...
>>  have slow ops.
>> 
>>  services:
>>mon: 3 daemons, quorum mon1,mon2,mon3(age 36m)
>>mgr: bm9612541(active, since 39m)
>>osd: 72 osds: 72 up (since 97m), 72 in (since 9h)
>> flags noscrub,nodeep-scrub,nosnaptrim
>> 
>>  data:
>>pools:   8 pools, 2409 pgs
>>objects: 14.64M objects, 92 TiB
>>usage:   276 TiB used, 143 TiB / 419 TiB avail
>>pgs: 2409 active+clean
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Dropping focal for squid

2024-02-28 Thread Reed Dier
Found this mention in the CLT Minutes posted this morning[1], of a discussion 
on ceph-dev[2] about dropping ubuntu focal builds for the squid release, and 
beginning builds of quincy for jammy to facilitate quincy->squid upgrades.

> there was a consensus to drop support for ubuntu focal and centos
> stream 8 with the squid release, and i'd love to remove those distros
> from the shaman build matrix for squid and main branches asap
> 
> however, i see that quincy never supported ubuntu jammy, so our quincy
> upgrade tests still have to run against focal. that means we'd still
> have to build focal packages for squid
> 
> would it be possible to start building jammy packages for quincy to
> allow those upgrade tests to run jammy instead?


Just wanting to voice my support for this, as this both seems to match the 
historical ceph:ubuntu cadence going back roughly a decade, and helps 
facilitate a narrow upgrade window for ubuntu users to get to jammy.

+--+-+-+-+-+-+-+
| ceph | u14 | u16 | u18 | u20 | u22 | u24 |
+--+-+-+-+-+-+-+
| jewel| +   | +   | -   | -   | -   | -   |
| luminous | +   | +   | -   | -   | -   | -   |
| mimic| -   | +   | +   | -   | -   | -   |
| nautilus | -   | +   | +   | -   | -   | -   |
| octopus  | -   | -   | +   | +   | -   | -   |
| pacific  | -   | -   | +   | +   | -   | -   |
| quincy   | -   | -   | -   | +   | M   | -   |
| reef | -   | -   | -   | +   | +   | -   |
| squid| -   | -   | -   | -   | E   | E   |
| T| -   | -   | -   | -   | E   | E   |
+--+-+-+-+-+-+-+

Hopefully this table translates the mailing list well enough.
But going back to both jewel/10 and Ubuntu 14.04/trusty, there has been a 
consistent 4 ceph releases per ubuntu LTS release, with a dist drop/add every 
two releases.
This gives ample window for users to upgrade ubuntu and ceph at a reasonable 
pace.
However with quincy not being built for jammy (M=missing), this broke the trend 
and forced anyone looking to get to jammy to have to go all the way to reef, 
which they may not be ready to do just yet.
Running the table out to the T release and ubuntu 24.04/noble, following this 
trend, it would be expected (E=expected) that squid would be built for jammy 
(and eventually noble), and the same would be true for the T release.
 
Many words to say that as a user this would be beneficial to me, and likely 
others.

Reed

[1] 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/SI3KZTU6GLGEHWICVDZLQEKWUSVKYQHG/
 

[2] 
https://lists.ceph.io/hyperkitty/list/d...@ceph.io/thread/ONAWOAE7MPMT7CP6KH7Y4NGWIP5SZ7XR/
 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph Leadership Team Meeting, 2024-02-28 Minutes

2024-02-28 Thread Patrick Donnelly
Hi folks,

Today we discussed:


   - [casey] on dropping ubuntu focal support for squid


   - Discussion thread:
  
https://lists.ceph.io/hyperkitty/list/d...@ceph.io/thread/ONAWOAE7MPMT7CP6KH7Y4NGWIP5SZ7XR/


   - Quincy doesn't build jammy packages, so quincy->squid upgrade tests
  have to run on focal


   - proposing to add jammy packages for quincy to enable that upgrade path
  (from 17.2.8+)


   - https://github.com/ceph/ceph-build/pull/2206


   - Need to indicate that Quincy clusters must upgrade to jammy before
  upgrading to Squid.


   - T release name: https://pad.ceph.com/p/t


   - Tentacle wins!


   - Patrick to do release kick-off


   - Cephalocon news?


   - Planning is in progress; no news as knowledgeable parties not present
  for this meeting.


   - Volunteers for compiling the Contributor Credits?


   -
  
https://tracker.ceph.com/projects/ceph/wiki/Ceph_contributors_list_maintenance_guide


   - Laura will give it a try.


   - Plan for tagged vs. named Github milestones?


   - Continue using priority order for qa testing: exhaust testing on
  tagged milestone, then go to "release" catch-all milestone


   - v18.2.2 hotfix release next


   - Reef HEAD is still cooking with to-be-addressed upgrade issues.


   - v19.1.0 (first Squid RC)


   - two rgw features still waiting to go into squid


   - cephfs quiesce feature to be backported


   - Nightlies crontab to be updated by Patrick.


   - V19.1.0 milestone: https://github.com/ceph/ceph/milestone/21


-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS On Windows 10

2024-02-28 Thread Robert W. Eckert
I have it working on my machines-  the global configuration for me looks like
[global]
fsid = fe3a7cb0-69ca-11eb-8d45-c86000d08867
mon_host = [v2:192.168.2.142:3300/0,v1:192.168.2.142:6789/0] 
[v2:192.168.2.141:3300/0,v1:192.168.2.141:6789/0] 
[v2:192.168.2.199:3300/0,v1:192.168.2.199:6789/0]
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

The important thing for me was to add the FSID, and the auth sections

Also note the port is the 6789 not 3300.



-Original Message-
From: duluxoz  
Sent: Wednesday, February 28, 2024 4:05 AM
To: ceph-users@ceph.io
Subject: [ceph-users] CephFS On Windows 10

Hi All,

  I'm looking for some pointers/help as to why I can't get my Win10 PC to 
connect to our Ceph Cluster's CephFS Service. Details are as follows:

Ceph Cluster:

- IP Addresses: 192.168.1.10, 192.168.1.11, 192.168.1.12

- Each node above is a monitor & an MDS

- Firewall Ports: open (ie 3300, etc)

- CephFS System Name: my_cephfs

- Log files: nothing jumps out at me

Windows PC:

- Keyring file created and findable: ceph.client.me.keyring

- dokany installed

- ceph-for-windows installed

- Can ping all three ceph nodes

- Connection command: ceph-dokan -l v -o -id me --debug --client_fs my_cephfs 
-c C:\ProgramData\Ceph\ceph.conf

Ceph.conf contents:

~~~

[global]
   mon_host = 192.168.1.10, 192.168.1.11, 192.168.1.12
   log to stderr = true
   log to syslog = true
   run dir = C:/ProgramData/ceph
   crash dir = C:/logs/ceph
   debug client = 2
[client]
   keyring = C:/ProgramData/ceph/ceph.client.me.keyring
   log file = C:/logs/ceph/$name.$pid.log
   admin socket = C:/ProgramData/ceph/$name.$pid.asok
~~~

Windows Logfile contents (ieC:/logs/ceph/client.me..log):

~~~

2024-02-28T18:26:45.201+1100 1  0 monclient(hunting): authenticate timed out 
after 300
2024-02-28T18:31:45.203+1100 1  0 monclient(hunting): authenticate timed out 
after 300
2024-02-28T18:36:45.205+1100 1  0 monclient(hunting): authenticate timed out 
after 300

~~~

Additional info from Windows CLI:

~~~

failed to fetch mon config (--no-mon-config to skip)

~~~

So I've gone through the doco and done some Google-foo and I can't work out 
*why* I'm getting a failure; why I'm getting the authentication failure. I know 
it'll be something simple, something staring me in the face, but I'm at the 
point where I can't see the forest for the trees - please help.

Any help greatly appreciated

Thanks in advance

Cheers

Dulux-Oz
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Possible to tune Full Disk warning ??

2024-02-28 Thread Eugen Block

Maybe this [2] helps, one specific mountpoint is excluded:

mountpoint !~ "/mnt.*"

[2] https://alex.dzyoba.com/blog/prometheus-alerts/

Zitat von Eugen Block :


Hi,

let me refer you to my response to a similar question [1]. I don't  
have a working example how to exclude some mointpoints but it should  
be possible to modify existing rules.


Regards,
Eugen

[1]  
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/NIYXE27AQZHKOJD6Q4RDTPTIK7LIZEKM/


Zitat von Daniel Brown :


Greetings -

Is there any way to tune what CEPH will complain about, in terms of  
“full disks” ??


One of my ceph servers has an NFS mount which is for all intents  
and purposes “read only” and is sitting at 100% full. Ceph keeps  
warning me about this, unless I unmount the nfs mount point.


Is there any way to tell it to ignore that mount point?


I’m using reef 18.2.1, running on Ubuntu, which was setup with cephadm.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Possible to tune Full Disk warning ??

2024-02-28 Thread Eugen Block

Hi,

let me refer you to my response to a similar question [1]. I don't  
have a working example how to exclude some mointpoints but it should  
be possible to modify existing rules.


Regards,
Eugen

[1]  
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/NIYXE27AQZHKOJD6Q4RDTPTIK7LIZEKM/


Zitat von Daniel Brown :


Greetings -

Is there any way to tune what CEPH will complain about, in terms of  
“full disks” ??


One of my ceph servers has an NFS mount which is for all intents and  
purposes “read only” and is sitting at 100% full. Ceph keeps warning  
me about this, unless I unmount the nfs mount point.


Is there any way to tell it to ignore that mount point?


I’m using reef 18.2.1, running on Ubuntu, which was setup with cephadm.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS On Windows 10

2024-02-28 Thread Lucian Petrut
Hi,

I’d double check that the 3300 port is accessible (e.g. using telnet, which can 
be installed as an optional Windows feature). Make sure that it’s using the 
default port and not a custom one, also be aware the v1 protocol uses 6789 by 
default.

Increasing the messenger log level to 10 might also be useful: debug ms = 10.

Regards,
Lucian

> On 28 Feb 2024, at 11:05, duluxoz  wrote:
> 
> Hi All,
> 
>  I'm looking for some pointers/help as to why I can't get my Win10 PC to 
> connect to our Ceph Cluster's CephFS Service. Details are as follows:
> 
> Ceph Cluster:
> 
> - IP Addresses: 192.168.1.10, 192.168.1.11, 192.168.1.12
> 
> - Each node above is a monitor & an MDS
> 
> - Firewall Ports: open (ie 3300, etc)
> 
> - CephFS System Name: my_cephfs
> 
> - Log files: nothing jumps out at me
> 
> Windows PC:
> 
> - Keyring file created and findable: ceph.client.me.keyring
> 
> - dokany installed
> 
> - ceph-for-windows installed
> 
> - Can ping all three ceph nodes
> 
> - Connection command: ceph-dokan -l v -o -id me --debug --client_fs my_cephfs 
> -c C:\ProgramData\Ceph\ceph.conf
> 
> Ceph.conf contents:
> 
> ~~~
> 
> [global]
>   mon_host = 192.168.1.10, 192.168.1.11, 192.168.1.12
>   log to stderr = true
>   log to syslog = true
>   run dir = C:/ProgramData/ceph
>   crash dir = C:/logs/ceph
>   debug client = 2
> [client]
>   keyring = C:/ProgramData/ceph/ceph.client.me.keyring
>   log file = C:/logs/ceph/$name.$pid.log
>   admin socket = C:/ProgramData/ceph/$name.$pid.asok
> ~~~
> 
> Windows Logfile contents (ieC:/logs/ceph/client.me..log):
> 
> ~~~
> 
> 2024-02-28T18:26:45.201+1100 1  0 monclient(hunting): authenticate timed out 
> after 300
> 2024-02-28T18:31:45.203+1100 1  0 monclient(hunting): authenticate timed out 
> after 300
> 2024-02-28T18:36:45.205+1100 1  0 monclient(hunting): authenticate timed out 
> after 300
> 
> ~~~
> 
> Additional info from Windows CLI:
> 
> ~~~
> 
> failed to fetch mon config (--no-mon-config to skip)
> 
> ~~~
> 
> So I've gone through the doco and done some Google-foo and I can't work out 
> *why* I'm getting a failure; why I'm getting the authentication failure. I 
> know it'll be something simple, something staring me in the face, but I'm at 
> the point where I can't see the forest for the trees - please help.
> 
> Any help greatly appreciated
> 
> Thanks in advance
> 
> Cheers
> 
> Dulux-Oz
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pg repair doesn't fix "got incorrect hash on read" / "candidate had an ec hash mismatch"

2024-02-28 Thread Eugen Block

Hi,


I'm debating with myself if I should
1. Stop both OSD 223 and 269,
2. Just one of them.


I understand your struggle, I think I would stop them both just to  
rule out a replication of corrupted data.


Zitat von Kai Stian Olstad :


Hi Eugen, thank you for the reply.

The OSD was drained over the weekend, so OSD 223 and 269 have only  
the problematic PG 404.bc.


I don't think moving the PG would help since I don't have any empty  
OSD to move it to, and a move would not fix the hash mismatch.
The reason I just want to have the problematic PG on the OSDs is to  
reduce recovery time.
I would need to set min_size to 4 in an EC 4+2, and stop them both  
at the same time to force a rebuild of the corrupted part of PG that  
is on osd 223 and 269, since repair doesn't fix it.


I'm debating with myself if I should
1. Stop both OSD 223 and 269,
2. Just one of them.

Stopping them both, I'm guarantied that part of the PG on 223 and  
269 is rebuild from the 4 other, 297, 276, 136 and 197 that doesn't  
have any errors.


OSD 223 is the master in the EC, pg 404.bc acting [223,297,269,276,136,197]
So maybe just stop that one, wait for recovery and the run  
deep-scrub to check if things look better.

But would it then use corrupted data on osd 269 to rebuild.


-
Kai Stian Olstad



On 26.02.2024 10:19, Eugen Block wrote:

Hi,

I think your approach makes sense. But I'm wondering if moving only  
 the problematic PGs to different OSDs could have an effect as  
well. I  assume that moving the 2 PGs is much quicker than moving  
all BUT those  2 PGs. If that doesn't work you could still fall  
back to draining the  entire OSDs (except for the problematic PG).


Regards,
Eugen

Zitat von Kai Stian Olstad :


Hi,

No one have any comment at all?
I'm not picky so any speculation, guessing, I would, I wouldn't,   
should work and so one would be highly appreciated.



Since 4 out of 6 in EC 4+2 is OK and ceph pg repair doesn't solve  
it  I think the following might work.


pg 404.bc acting [223,297,269,276,136,197]

- Use pgremapper to move all PG on OSD 223 and 269 except 404.bc  
to  other OSD.

- Set min_since to 4, ceph osd pool set default.rgw.buckets.data min_size 4
- Stop osd 223 and 269

What I hope will happen is that Ceph then recreate 404.bc shard   
s0(osd.223) and s2(osd.269) since they are now down from the   
remaining shards

s1(osd.297), s3(osd.276), s4(osd.136) and s5(osd.197)


_Any_ comment is highly appreciated.

-
Kai Stian Olstad


On 21.02.2024 13:27, Kai Stian Olstad wrote:

Hi,

Short summary

PG 404.bc is an EC 4+2 where s0 and s2 report hash mismtach for  
698 objects.
Ceph pg repair doesn't fix it, because if you run deep-srub on  
the  PG after repair is finished, it still report scrub errors.


Why can't ceph pg repair repair this, it has 4 out of 6 should be  
 able to reconstruct the corrupted shards?
Is there a way to fix this? Like delete object s0 and s2 so it's   
forced to recreate them?



Long detailed summary

A short backstory.
* This is aftermath of problems with mclock, post "17.2.7:   
Backfilling deadlock / stall / stuck / standstill" [1].

- 4 OSDs had a few bad sectors, set all 4 out and cluster stopped.
- Solution was to swap from mclock to wpq and restart alle OSD.
- When all backfilling was finished all 4 OSD was replaced.
- osd.223 and osd.269 was 2 of the 4 OSDs that was replaced.


PG / pool 404 is EC 4+2 default.rgw.buckets.data

9 days after the osd.223 og osd.269 was replaced, deep-scub was  
run  and reported errors

  ceph status
  ---
  HEALTH_ERR 1396 scrub errors; Possible data damage: 1 pg inconsistent
  [ERR] OSD_SCRUB_ERRORS: 1396 scrub errors
  [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
  pg 404.bc is active+clean+inconsistent, acting   
[223,297,269,276,136,197]


I then run repair
  ceph pg repair 404.bc

And ceph status showed this
  ceph status
  ---
  HEALTH_WARN Too many repaired reads on 2 OSDs
  [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
  osd.223 had 698 reads repaired
  osd.269 had 698 reads repaired

But osd.223 and osd.269 is new disks and the disks has no SMART   
error or any I/O error in OS logs.

So I tried to run deep-scrub again on the PG.
  ceph pg deep-scrub 404.bc

And got this result.

  ceph status
  ---
  HEALTH_ERR 1396 scrub errors; Too many repaired reads on 2  
OSDs;  Possible data damage: 1 pg inconsistent

  [ERR] OSD_SCRUB_ERRORS: 1396 scrub errors
  [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
  osd.223 had 698 reads repaired
  osd.269 had 698 reads repaired
  [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
  pg 404.bc is   
active+clean+scrubbing+deep+inconsistent+repair, acting   
[223,297,269,276,136,197]


698 + 698 = 1396 so the same amount of errors.

Run repair again on 404.bc and ceph status is

  HEALTH_WARN Too many repaired reads on 2 OSDs
  [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 

[ceph-users] CephFS On Windows 10

2024-02-28 Thread duluxoz

Hi All,

 I'm looking for some pointers/help as to why I can't get my Win10 PC 
to connect to our Ceph Cluster's CephFS Service. Details are as follows:


Ceph Cluster:

- IP Addresses: 192.168.1.10, 192.168.1.11, 192.168.1.12

- Each node above is a monitor & an MDS

- Firewall Ports: open (ie 3300, etc)

- CephFS System Name: my_cephfs

- Log files: nothing jumps out at me

Windows PC:

- Keyring file created and findable: ceph.client.me.keyring

- dokany installed

- ceph-for-windows installed

- Can ping all three ceph nodes

- Connection command: ceph-dokan -l v -o -id me --debug --client_fs 
my_cephfs -c C:\ProgramData\Ceph\ceph.conf


Ceph.conf contents:

~~~

[global]
  mon_host = 192.168.1.10, 192.168.1.11, 192.168.1.12
  log to stderr = true
  log to syslog = true
  run dir = C:/ProgramData/ceph
  crash dir = C:/logs/ceph
  debug client = 2
[client]
  keyring = C:/ProgramData/ceph/ceph.client.me.keyring
  log file = C:/logs/ceph/$name.$pid.log
  admin socket = C:/ProgramData/ceph/$name.$pid.asok
~~~

Windows Logfile contents (ieC:/logs/ceph/client.me..log):

~~~

2024-02-28T18:26:45.201+1100 1  0 monclient(hunting): authenticate timed 
out after 300
2024-02-28T18:31:45.203+1100 1  0 monclient(hunting): authenticate timed 
out after 300
2024-02-28T18:36:45.205+1100 1  0 monclient(hunting): authenticate timed 
out after 300


~~~

Additional info from Windows CLI:

~~~

failed to fetch mon config (--no-mon-config to skip)

~~~

So I've gone through the doco and done some Google-foo and I can't work 
out *why* I'm getting a failure; why I'm getting the authentication 
failure. I know it'll be something simple, something staring me in the 
face, but I'm at the point where I can't see the forest for the trees - 
please help.


Any help greatly appreciated

Thanks in advance

Cheers

Dulux-Oz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io