[ceph-users] Unable to map RBDs after running pg-upmap-primary on the pool

2024-03-06 Thread Torkil Svensgaard

Hi

I tried to do offline read optimization[1] this morning but I am now 
unable to map the RBDs in the pool.


I did this prior to running the pg-upmap-primary commands suggested by 
the optimizer, as suggested by the latest documentation[2]:


"
ceph osd set-require-min-compat-client reef
"

The command didn't complain and the documentation stated it should 
failed if any pre-reef clients were connected so I thought all was well 
and rad the pg_upmap_primary commands.


Can I simply remove the applied pg_upmap_primary settings somehow to my 
RBDs back while investigating?


Mvh.

Torkil

[1] https://docs.ceph.com/en/reef/rados/operations/read-balancer/
[2] https://docs.ceph.com/en/latest/rados/operations/read-balancer/
--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Cluster Config File Locations?

2024-03-06 Thread Eugen Block

You're welcome, great that your cluster is healthy again.

Zitat von matt...@peregrineit.net:


Thanks Eugen, you pointed me in the right direction  :-)

Yes, the config files I mentioned were the ones in  
`/var/lib/ceph/{FSID}/mgr.{MGR}/config` - I wasn't aware there were  
others (well, I suspected their was, hence my Q).


The `global public-network` was (re-)set to the old subnet, while  
the `mon public-network` was set to the proper subnet. With you  
pointers you gave me I reset the `global public-network` to the  
proper subnet, rebooted/reloaded all the containers, and  
everything's come back up. Thecluster is now doing a deep scrub and  
self-correcting itself, so all good.


Thank you *very* much - I owe you a beer (no, I *really* owe you a beer)!

Cheers

Dulux-Oz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph Quincy to Reef non cephadm upgrade

2024-03-06 Thread Konstantin Shalygin
Hi, 

Yes, you upgrade ceph-common package, then restart your mons

k
Sent from my iPhone

> On 6 Mar 2024, at 21:55, sarda.r...@gmail.com wrote:
> 
> My question is - does this mean I need to upgrade all ceph packages (ceph, 
> ceph-common) and restart only monitor daemon first?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph-storage slack access

2024-03-06 Thread Zac Dover
Greg & co,

https://github.com/ceph/ceph/pull/56010 contains the updated link. As soon as 
it passes its tests, I'll merge and backport it.

Zac Dover
Upstream Docs
Ceph Foundation



On Thursday, March 7th, 2024 at 3:01 AM, Gregory Farnum  
wrote:

> 
> 
> On Wed, Mar 6, 2024 at 8:56 AM Matthew Vernon mver...@wikimedia.org wrote:
> 
> > Hi,
> > 
> > On 06/03/2024 16:49, Gregory Farnum wrote:
> > 
> > > Has the link on the website broken? https://ceph.com/en/community/connect/
> > > We've had trouble keeping it alive in the past (getting a non-expiring
> > > invite), but I thought that was finally sorted out.
> > 
> > Ah, yes, that works. Sorry, I'd gone to
> > https://docs.ceph.com/en/latest/start/get-involved/
> > 
> > which lacks the registration link.
> 
> 
> Whoops! Can you update that, docs master Zac? :)
> -Greg
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] 回复:Re: How to build ceph without QAT?

2024-03-06 Thread 张东川
Hi Hualong and llya,


Thanks for your help.


More info:
I am trying to buid ceph on Milkv Pioneer board (RISCV arch, OS is fedora-riscv 
6.1.55).
The ceph code being used was downloaded from github last week (master branch)


​Currently I am working on environment cleanup (I suspect my work environment 
is not clean).
I will try to switch to V19.0.0 and rebuild.
Will let you know if further help needed.
Thanks.


Best Regards,
Dongchuan




Ilya Dryomov

[ceph-users] Re: change ip node and public_network in cluster

2024-03-06 Thread Eugen Block

Hi,

your response arrived in my inbox today, so sorry for the delay.
I wrote a blog post [1] just two weeks ago for that procedure with  
cephadm, Zac adopted that and updated the docs [2]. Can you give that  
a try and let me know if it worked? I repeated that procedure a couple  
of times to be sure the docs are correct, but of course there's still  
a chance I might have missed something.


Thanks,
Eugen

[1]  
http://heiterbiswolkig.blogs.nde.ag/2024/02/22/cephadm-change-public-network/
[2]  
https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#using-cephadm-to-change-the-public-network


Zitat von farhad khedriyan :

Hi, thanks But this document is for old versions and cannot be used  
for the container version .
I used the reef version and when I retrieve the monmap and edit I  
can't use ceph-mon to change the monmap or any way to change
I tried to solve this problem by adding a new node and deleting the  
old nodes, but because the monmap has not changed, it does not allow  
adding a new node with a different subnet.

is ther a way for this change in container versions?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph Leadership Team Meeting Minutes - March 6, 2024

2024-03-06 Thread Ernesto Puerta
Hi Cephers,

These are the topics covered in today's meeting:


   - *Releases*


   - *Hot fixes Releases*


   - *18.2.2*


   - https://github.com/ceph/ceph/pull/55491 - reef: mgr/prometheus: fix
   orch check to prevent Prometheus crash


   - https://github.com/ceph/ceph/pull/55709 - reef: debian/*.postinst: add
   adduser as a dependency and specify --home when adduser


   - https://github.com/ceph/ceph/pull/55712 - reef: src/osd/OSDMap.cc: Fix
   encoder to produce same bytestream


   - [Laura] When/who can upgrade the LRC for this release? (Dan will after
   we do some last checks today)


   - Gibba has been upgraded with no apparent issues


   - *17.2.8* (on hold based on whether the osdmap fix requirement) - *No
   longer needed*


   - osdmap fix (not needed:
   https://github.com/ceph/ceph/blob/quincy/src/osd/OSDMap.cc#L3087-L3089)


   - Rook requests to include this c-v fix that blocks OSDs in some
   scenarios (not worthy of its own hotfix, just please include if we do the
   hotfix)


   - https://github.com/ceph/ceph/pull/54522 - quincy: ceph-volume: fix a
   regression in raw list


   - As crc fix is not needed, the Rook request can be included in a
   regular quincy release


   - *Regular releases*


   - *18.2.3* - exporter fixes for rook and debian-derived users make this
   more urgent than quincy


   - mgr usage of pyO3/cryptography an issue for debian - and possibly
   centos9 (https://tracker.ceph.com/issues/64213#note-2) - see notes from
   02/07


   - Any updates on potentially dropping modules or another fix? Adam?


   - Squid *19.1.0*


   - CephFS waiting for 2 feature PR


   - RGW PRs


   - NVMe? To be confirmed with Aviv


   - squid blockers:


   - build centos 9 containers:
   https://github.com/ceph/ceph-container/pull/2183


   - ceph-object-corpus:
https://github.com/ceph/ceph-object-corpus/pull/17 (testing
   in https://github.com/ceph/ceph/pull/54735)


   - Milestone for squid blockers (use to tag blockers for the first 19.1.0
   RC): https://github.com/ceph/ceph/milestone/21


   - Squid RCs and community testing


   - https://pad.ceph.com/p/squid_scale_testing


   - Target date March ~20


   - *17.2.9*


   - need jammy builds for quincy before a squid release. maybe we can just
   build them for the 17.2.7 release? (do you mean the 17.2.8 release?)


   - *Meeting time - *change days to Monday or Thursday? (added by Josh -
   who has a conflict on Wednesdays now)


   - Thursday has several conflicting community meetings


   - Any objections to Monday at the same time?


   - Note the change to US daylight savings next week


   - Let's do a poll (Doodle)


   - *debian-reef_OLD email thread "[ceph-users] debian-reef_OLD?"*


   - Fixed by Yuri


   - *CDM APAC tonight*:
   https://tracker.ceph.com/projects/ceph/wiki/CDM_06-MAR-2024


   - *Sepia Lab*:


   - PSA: https://github.com/ceph/ceph/pull/55820 merged (squid crontab
   additions and overhaul to nightlies)


   - New grafana widget for smithi node utilization:


   -
   
https://grafana-route-grafana.apps.os.sepia.ceph.com/d/teuthology/teuthology?orgId=1=1m=1709695487524=1709738687524=36


   - (Basically: unlocked machine * hours / total machine * hours )


   - [Zac] *ceph-exporter release notes question from Jan Horacek* (from
   the upstream community)


   - Route to Juanmi Olmo


   - [Zac] - *Eugen Block's question about removing sensitive information
   from ceph-users mailing list*


   - No easy way to request/remove sensitive information.


   - [Zac] - *Anthony D'Atri submits Index HQ in Toronto as a possible
   venue for Cephalocon 2024*


   - Venue already booked (Patrick)


   - [Zac] - *CQ issue 4 -- submit your requests before 25 Mar 2024* --
   zac.do...@proton.me


Kind Regards,

Ernesto Puerta
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Hanging request in S3

2024-03-06 Thread Casey Bodley
hey Christian, i'm guessing this relates to
https://tracker.ceph.com/issues/63373 which tracks a deadlock in s3
DeleteObjects requests when multisite is enabled.
rgw_multi_obj_del_max_aio can be set to 1 as a workaround until the
reef backport lands

On Wed, Mar 6, 2024 at 2:41 PM Christian Kugler  wrote:
>
> Hi,
>
> I am having some trouble with some S3 requests and I am at a loss.
>
> After upgrading to reef a couple of weeks ago some requests get stuck and
> never
> return. The two Ceph clusters are set up to sync the S3 realm
> bidirectionally.
> The bucket has 479 shards (dynamic resharding) at the moment.
>
> Putting an object (/etc/services) into the bucket via s3cmd works, and
> deleting
> it works as well. So I know it is not just the entire bucket that is somehow
> faulty.
>
> When I try to delete a specific prefix it the request for listing all
> objects
> never comes back. In the example below I only included the request in
> question
> which I aborted with ^C.
>
> $ s3cmd rm -r
> s3://sql20/pgbackrest/backup/adrpb/20240130-200410F/pg_data/base/16560/ -d
> [...snip...]
> DEBUG: Canonical Request:
> GET
> /sql20/
> prefix=pgbackrest%2Fbackup%2Fadrpb%2F20240130-200410F%2Fpg_data%2Fbase%2F16560%2F
> host:[...snip...]
> x-amz-content-sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
> x-amz-date:20240306T183435Z
>
> host;x-amz-content-sha256;x-amz-date
> e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
> --
> DEBUG: signature-v4 headers: {'x-amz-date': '20240306T183435Z',
> 'Authorization': 'AWS4-HMAC-SHA256
> Credential=VL0FRB7CYGMHBGCD419M/20240306/[...snip...]/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date,Signature=45b133675535ab611bbf2b9a7a6e40f9f510c0774bf155091dc9a05b76856cb7',
> 'x-amz-content-sha256':
> 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'}
> DEBUG: Processing request, please wait...
> DEBUG: get_hostname(sql20): [...snip...]
> DEBUG: ConnMan.get(): re-using connection: [...snip...]#1
> DEBUG: format_uri():
> /sql20/?prefix=pgbackrest%2Fbackup%2Fadrpb%2F20240130-200410F%2Fpg_data%2Fbase%2F16560%2F
> DEBUG: Sending request method_string='GET',
> uri='/sql20/?prefix=pgbackrest%2Fbackup%2Fadrpb%2F20240130-200410F%2Fpg_data%2Fbase%2F16560%2F',
> headers={'x-amz-date': '20240306T183435Z', 'Authorization':
> 'AWS4-HMAC-SHA256
> Credential=VL0FRB7CYGMHBGCD419M/20240306/[...snip...]/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date,Signature=45b133675535ab611bbf2b9a7a6e40f9f510c0774bf155091dc9a05b76856cb7',
> 'x-amz-content-sha256':
> 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'},
> body=(0 bytes)
> ^CDEBUG: Response:
> {}
> See ya!
>
> The request did not show up normally in the logs so I set debug_rgw=20 and
> debug_ms=20 via ceph config set.
>
> I tried to isolate the request and looked for its request id:
> 13321243250692796422
> The following is a grep for the request id:
>
> Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.0s
> s3:list_bucket verifying op params
> Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.0s
> s3:list_bucket pre-executing
> Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.0s
> s3:list_bucket check rate limiting
> Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.0s
> s3:list_bucket executing
> Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.0s
> s3:list_bucket list_objects_ordered: starting attempt 1
> Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.0s
> s3:list_bucket cls_bucket_list_ordered: request from each of 479 shard(s)
> for 8 entries to get 1001 total entries
> Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.332010120s
> s3:list_bucket cls_bucket_list_ordered: currently processing
> pgbackrest/backup/adrpb/20240130-200410F/pg_data/base/16560/101438318.gz
> from shard 437
> Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.332010120s
> s3:list_bucket get_obj_state: rctx=0x7f74bdc6f860
> obj=sql20:pgbackrest/backup/adrpb/20240130-200410F/pg_data/base/16560/101438318.gz
> state=0x55d4237419e8 s->prefetch_data=0
> Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.332010120s
> s3:list_bucket cls_bucket_list_ordered: skipping
> pgbackrest/backup/adrpb/20240130-200410F/pg_data/base/16560/101438318.gz[]
> Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.332010120s
> s3:list_bucket cls_bucket_list_ordered: currently processing
> pgbackrest/backup/adrpb/20240130-200410F/pg_data/base/16560/101457659_fsm.gz
> from shard 202
> Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.332010120s
> s3:l

[ceph-users] Hanging request in S3

2024-03-06 Thread Christian Kugler
Hi,

I am having some trouble with some S3 requests and I am at a loss.

After upgrading to reef a couple of weeks ago some requests get stuck and
never
return. The two Ceph clusters are set up to sync the S3 realm
bidirectionally.
The bucket has 479 shards (dynamic resharding) at the moment.

Putting an object (/etc/services) into the bucket via s3cmd works, and
deleting
it works as well. So I know it is not just the entire bucket that is somehow
faulty.

When I try to delete a specific prefix it the request for listing all
objects
never comes back. In the example below I only included the request in
question
which I aborted with ^C.

$ s3cmd rm -r
s3://sql20/pgbackrest/backup/adrpb/20240130-200410F/pg_data/base/16560/ -d
[...snip...]
DEBUG: Canonical Request:
GET
/sql20/
prefix=pgbackrest%2Fbackup%2Fadrpb%2F20240130-200410F%2Fpg_data%2Fbase%2F16560%2F
host:[...snip...]
x-amz-content-sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
x-amz-date:20240306T183435Z

host;x-amz-content-sha256;x-amz-date
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
--
DEBUG: signature-v4 headers: {'x-amz-date': '20240306T183435Z',
'Authorization': 'AWS4-HMAC-SHA256
Credential=VL0FRB7CYGMHBGCD419M/20240306/[...snip...]/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date,Signature=45b133675535ab611bbf2b9a7a6e40f9f510c0774bf155091dc9a05b76856cb7',
'x-amz-content-sha256':
'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'}
DEBUG: Processing request, please wait...
DEBUG: get_hostname(sql20): [...snip...]
DEBUG: ConnMan.get(): re-using connection: [...snip...]#1
DEBUG: format_uri():
/sql20/?prefix=pgbackrest%2Fbackup%2Fadrpb%2F20240130-200410F%2Fpg_data%2Fbase%2F16560%2F
DEBUG: Sending request method_string='GET',
uri='/sql20/?prefix=pgbackrest%2Fbackup%2Fadrpb%2F20240130-200410F%2Fpg_data%2Fbase%2F16560%2F',
headers={'x-amz-date': '20240306T183435Z', 'Authorization':
'AWS4-HMAC-SHA256
Credential=VL0FRB7CYGMHBGCD419M/20240306/[...snip...]/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date,Signature=45b133675535ab611bbf2b9a7a6e40f9f510c0774bf155091dc9a05b76856cb7',
'x-amz-content-sha256':
'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'},
body=(0 bytes)
^CDEBUG: Response:
{}
See ya!

The request did not show up normally in the logs so I set debug_rgw=20 and
debug_ms=20 via ceph config set.

I tried to isolate the request and looked for its request id:
13321243250692796422
The following is a grep for the request id:

Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.0s
s3:list_bucket verifying op params
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.0s
s3:list_bucket pre-executing
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.0s
s3:list_bucket check rate limiting
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.0s
s3:list_bucket executing
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.0s
s3:list_bucket list_objects_ordered: starting attempt 1
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.0s
s3:list_bucket cls_bucket_list_ordered: request from each of 479 shard(s)
for 8 entries to get 1001 total entries
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.332010120s
s3:list_bucket cls_bucket_list_ordered: currently processing
pgbackrest/backup/adrpb/20240130-200410F/pg_data/base/16560/101438318.gz
from shard 437
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.332010120s
s3:list_bucket get_obj_state: rctx=0x7f74bdc6f860
obj=sql20:pgbackrest/backup/adrpb/20240130-200410F/pg_data/base/16560/101438318.gz
state=0x55d4237419e8 s->prefetch_data=0
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.332010120s
s3:list_bucket cls_bucket_list_ordered: skipping
pgbackrest/backup/adrpb/20240130-200410F/pg_data/base/16560/101438318.gz[]
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.332010120s
s3:list_bucket cls_bucket_list_ordered: currently processing
pgbackrest/backup/adrpb/20240130-200410F/pg_data/base/16560/101457659_fsm.gz
from shard 202
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.332010120s
s3:list_bucket get_obj_state: rctx=0x7f74bdc6f860
obj=sql20:pgbackrest/backup/adrpb/20240130-200410F/pg_data/base/16560/101457659_fsm.gz
state=0x55d4237419e8 s->prefetch_data=0
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.332010120s
s3:list_bucket cls_bucket_list_ordered: skipping
pgbackrest/backup/adrpb/20240130-200410F/pg_data/base/16560/101457659_fsm.gz[]
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.332010120s
s3:list_bucket cls_bucket_list_ordered: currently processing
pgbackrest/backup/adrpb/20240130-200410F/pg_data/base/16560/101457662_fsm.gz
from shard 420
Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.332010120s
s3:list_bucket get_obj_state: rctx=0x7f74bdc6f860
obj=sql20:pgbackrest/backup/adrpb/20240130-200410F/pg_data/base

[ceph-users] Re: ceph-volume fails when adding spearate DATA and DATA.DB volumes

2024-03-06 Thread Adam King
If you want to be directly setting up the OSDs using ceph-volume commands
(I'll pretty much always recommend following
https://docs.ceph.com/en/latest/cephadm/services/osd/#dedicated-wal-db over
manual ceph-volume stuff in cephadm deployments unless what you're doing
can't be done with the spec files), you probably actually want to use
`cephadm ceph-volume -- ...` rather than `cephadm shell`. The `cephadm
ceph-volume` command mounts the provided keyring (or whatever keyring it
infers) at `/var/lib/ceph/bootstrap-osd/ceph.keyring` inside the container,
whereas the shell will not. So in theory you could try `cephadm ceph-volume
-- lvm prepare --bluestore  --data ceph-block-0/block-0 --block.db
ceph-db-0/db-0` and that might get you past the keyring issue it seems to
be complaining about.

On Wed, Mar 6, 2024 at 2:10 PM  wrote:

> Hi all!
> I;ve faced an issue I couldnt even google.
> Trying to create OSD with two separate LVM for data.db and data, gives me
> intresting error
>
> ```
> root@ceph-uvm2:/# ceph-volume lvm prepare --bluestore  --data
> ceph-block-0/block-0 --block.db ceph-db-0/db-0
> --> Incompatible flags were found, some values may get ignored
> --> Cannot use None (None) with --bluestore (bluestore)
> --> Incompatible flags were found, some values may get ignored
> --> Cannot use --bluestore (bluestore) with --block.db (bluestore)
> Running command: /usr/bin/ceph-authtool --gen-print-key
> Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd
> --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new
> e3b7b9e5-6399-4e61-8634-41a991bb1948
>  stderr: 2024-03-04T11:45:52.263+ 7f71c3a89700 -1 auth: unable to find
> a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or
> directory
>  stderr: 2024-03-04T11:45:52.263+ 7f71c3a89700 -1
> AuthRegistry(0x7f71bc064978) no keyring found at
> /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
>  stderr: 2024-03-04T11:45:52.263+ 7f71c3a89700 -1 auth: unable to find
> a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or
> directory
>  stderr: 2024-03-04T11:45:52.263+ 7f71c3a89700 -1
> AuthRegistry(0x7f71bc067fd0) no keyring found at
> /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
>  stderr: 2024-03-04T11:45:52.267+ 7f71c3a89700 -1 auth: unable to find
> a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or
> directory
>  stderr: 2024-03-04T11:45:52.267+ 7f71c3a89700 -1
> AuthRegistry(0x7f71c3a87ea0) no keyring found at
> /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
>  stderr: 2024-03-04T11:45:52.267+ 7f71c1024700 -1 monclient(hunting):
> handle_auth_bad_method server allowed_methods [2] but i only support [1]
>  stderr: 2024-03-04T11:45:52.267+ 7f71c2026700 -1 monclient(hunting):
> handle_auth_bad_method server allowed_methods [2] but i only support [1]
>  stderr: 2024-03-04T11:45:52.267+ 7f71c1825700 -1 monclient(hunting):
> handle_auth_bad_method server allowed_methods [2] but i only support [1]
>  stderr: 2024-03-04T11:45:52.267+ 7f71c3a89700 -1 monclient:
> authenticate NOTE: no keyring found; disabled cephx authentication
>  stderr: [errno 13] RADOS permission denied (error connecting to the
> cluster)
> -->  RuntimeError: Unable to create a new OSD id
> ```
>
> And here is output which shows my created volumes
>
> ```
> root@ceph-uvm2:/# lvs
>   LV  VG   Attr   LSize   Pool Origin Data%  Meta%  Move
> Log Cpy%Sync Convert
>   block-0 ceph-block-0 -wi-a-  16.37t
>
>   db-0ceph-db-0-wi-a- 326.00g
> ```
>
> CEPH was rolled out by using cephadm and all the commands I do under
> cephadm shell
>
> I have no idea what is the reason of that error and that is why I am here.
>
>
> Any help is appreciated.
> Thanks in advance.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph reef mon is not starting after host reboot

2024-03-06 Thread Adam King
When you ran this, was it directly on the host, or did you run `cephadm
shell` first? The two things you tend to need to connect to the cluster
(that "RADOS timed out" error is generally what you get when connecting to
the cluster fails. A bunch of different causes all end with that error) are
a keyring with proper permissions and a ceph conf that includes the
locations of the mon daemons. When you run `cephadm shell` on a host that
has the keyring and config present, which the host you bootstrapped the
cluster on should, it starts up a bash shell inside a container that mounts
the keyring and the config and has all the ceph packages you need to talk
to the cluster. If you weren't using the shell, or were trying this from a
node other than the bootstrap node, it could be worth trying that
combination. Otherwise, I see the title of your message says a mon is down.
For debugging that, I'd think we'd need to see the journal logs from when
it failed to start. `cephadm ls --no-detail | grep systemd` on a host where
the mon is (NOT from within `cephadm shell`, directly on the host) will
list out the systemd units for all the daemons cephadm has deployed there.
You could use that systemd unit name to try and grab the journal logs.

On Wed, Mar 6, 2024 at 2:09 PM  wrote:

> Hi guys,
>
> i am very newbie to ceph-cluster but after multiple attempts, i was able
> to install ceph-reef cluster on debian-12 by cephadm tool on test
> environment with 2 mons and 3 OSD's om VM's. All was seeming good and i was
> exploring more about it so i rebooted cluster and found that now i am not
> able to access ceph dashboard and i have try to check this
>
> root@ceph-mon-01:/# ceph orch ls
> 2024-03-01T08:53:05.051+ 7ff7602b8700  0 monclient(hunting):
> authenticate timed out after 300
> [errno 110] RADOS timed out (error connecting to the cluster)
>
> i have not configured RADOS. And i have no clue about it. Any help would
> be very appreciated? the same issue.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.2 (hot-fix) QE validation status

2024-03-06 Thread Laura Flores
Went over the rados results with Radek. All looks good for the hotfix, and
we are ready to upgrade the LRC.

Rados approved!

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage 

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com 
M: +17087388804




On Wed, Mar 6, 2024 at 9:20 AM Patrick Donnelly  wrote:

> On Wed, Mar 6, 2024 at 2:55 AM Venky Shankar  wrote:
> >
> > +Patrick Donnelly
> >
> > On Tue, Mar 5, 2024 at 9:18 PM Yuri Weinstein 
> wrote:
> > >
> > > Details of this release are summarized here:
> > >
> > > https://tracker.ceph.com/issues/64721#note-1
> > > Release Notes - TBD
> > > LRC upgrade - TBD
> > >
> > > Seeking approvals/reviews for:
> > >
> > > smoke - in progress
> > > rados - Radek, Laura?
> > > quincy-x - in progress
> >
> > I think
> >
> > https://github.com/ceph/ceph/pull/55669
> >
> > was supposed to be included in this hotfix (I recall Patrick
> > mentioning this in last week's CLT). The change was merged into reef
> > head last week.
>
> That's fixing a bug in reef HEAD not v18.2.1 so no need to get into this
> hotfix.
>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Red Hat Partner Engineer
> IBM, Inc.
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: bluestore_min_alloc_size and bluefs_shared_alloc_size

2024-03-06 Thread Anthony D'Atri



> On Feb 28, 2024, at 17:55, Joel Davidow  wrote:
> 
> Current situation
> -
> We have three Ceph clusters that were originally built via cephadm on octopus 
> and later upgraded to pacific. All osds are HDD (will be moving to wal+db on 
> SSD) and were resharded after the upgrade to enable rocksdb sharding. 
> 
> The value for bluefs_shared_alloc_size has remained unchanged at 65535. 
> 
> The value for bluestore_min_alloc_size_hdd was 65535 in octopus but is 
> reported as 4096 by ceph daemon osd. config show in pacific.

min_alloc_size is baked into a given OSD when it is created.  The central 
config / runtime value does not affect behavior for existing OSDs.  The only 
way to change it is to destroy / redeploy the OSD.

There was a succession of PRs in the Octopus / Pacific timeframe around default 
min_alloc_size for HDD and SSD device classes, including IIRC one temporary 
reversion.  

> However, the osd label after upgrading to pacific retains the value of 65535 
> for bfm_bytes_per_block.

OSD label?

I'm not sure if your Pacific release has the back port, but not that along ago 
`ceph osd metadata` was amended to report the min_alloc_size that a given OSD 
was built with.  If you don't have that, the OSD's startup log should report it.

-- aad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph is constantly scrubbing 1/4 of all PGs and still have pigs not scrubbed in time

2024-03-06 Thread Anthony D'Atri
I don't see these in the config dump.

I think you might have to apply them to `global` for them to take effect, not 
just `osd`, FWIW.

> I have tried various settings, like osd_deep_scrub_interval, osd_max_scrubs, 
> mds_max_scrub_ops_in_progress etc.
> All those get ignored.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Cluster Config File Locations?

2024-03-06 Thread matthew
Thanks Eugen, you pointed me in the right direction  :-)

Yes, the config files I mentioned were the ones in 
`/var/lib/ceph/{FSID}/mgr.{MGR}/config` - I wasn't aware there were others 
(well, I suspected their was, hence my Q).

The `global public-network` was (re-)set to the old subnet, while the `mon 
public-network` was set to the proper subnet. With you pointers you gave me I 
reset the `global public-network` to the proper subnet, rebooted/reloaded all 
the containers, and everything's come back up. Thecluster is now doing a deep 
scrub and self-correcting itself, so all good.

Thank you *very* much - I owe you a beer (no, I *really* owe you a beer)!

Cheers

Dulux-Oz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] InvalidAccessKeyId

2024-03-06 Thread ashar . khan
Dear Team,

I am facing one issue, which could be a possible bug, but not able to find
any solution.

We are unable to create a bucket on ceph from ceph-dashboard.
Bucket is getting created with a fresh/different name but once I am trying
with a deleted bucket name then it is not getting created.
I have tested on (Octopus and Pacific version).

Steps to reproduce.

a. Create user RGW user from ceph Dashboard. (user1)
b. Create a bucket with the bucket owner as user1. (bucket1)
c. Delete the bucket
d  Delete user.
e. Create a user again with the same name (user1)
f. Create a new bucket with the bucket owner as user1. (Bucket2)
I get below error message:

RGW REST API failed request with status code 403
(b'{"Code":"InvalidAccessKeyId","RequestId":"tx0457ff5169168a9e3-00648afcd0'
b'-fdd1c2-dev","HostId":"fdd1c2-dev-india"}')
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Number of pgs

2024-03-06 Thread ndandoul
Hi all, 

Pretty sure not the very first time you see a thread like this. 

Our cluster consists of 12 nodes/153 OSDs/1.2 PiB used, 708 TiB /1.9 PiB avail

The data pool is 2048 pgs big exactly the number when the cluster started. We 
have no issues with the cluster, everything runs as expected and very 
efficiently. We support about 1000 clients. The question is should we increase 
the number of pgs? If you think so, what is the sensible number to go to? 4096? 
More? 

I will eagerly await for your response. 

P.S. Yes, autoscaler is off :)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.2 (hot-fix) QE validation status

2024-03-06 Thread Redouane Kachach
Looks good to me. Testing went OK without any issues.

Thanks,
Redo.

On Tue, Mar 5, 2024 at 5:22 PM Travis Nielsen  wrote:

> Looks great to me, Redo has tested this thoroughly.
>
> Thanks!
> Travis
>
> On Tue, Mar 5, 2024 at 8:48 AM Yuri Weinstein  wrote:
>
>> Details of this release are summarized here:
>>
>> https://tracker.ceph.com/issues/64721#note-1
>> Release Notes - TBD
>> LRC upgrade - TBD
>>
>> Seeking approvals/reviews for:
>>
>> smoke - in progress
>> rados - Radek, Laura?
>> quincy-x - in progress
>>
>> Also need approval from Travis, Redouane for Prometheus fix testing.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] PGs with status active+clean+laggy

2024-03-06 Thread mori . ricardo
Dear community,

I have a ceph quincy cluster with 5 nodes currently. But only 3 with 
SSDs and others with nvme. On separate pools. I have had many alerts from PGs 
with active-clean-laggy status. 
This has caused problems with slow writing. I wanted to know how to 
troubleshoot properly. I checked several things related to the network, 
I have 10 GB cards on all nodes and everything seems to be correct.


Many thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PGs with status active+clean+laggy

2024-03-06 Thread mori . ricardo
I expressed myself wrong. I only have SSDs in my cluster. I meant that in 2 of 
3 nodes I have nvme in another pool.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow RGW multisite sync due to "304 Not Modified" responses on primary zone

2024-03-06 Thread praveenkumargpk17
Hi All , 

Regarding our earlier email regarding "Slow RGW multisite sync due to '304 Not 
Modified' responses on primary zone," We just wanted to quickly follow up. We 
wanted to make it clear that we still having problems and that we desperately 
need your help to find a solution.


Thank you for taking the time to consider this.

Thanks, 
Praveen
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW multisite slowness issue due to the "304 Not Modified" responses on primary zone

2024-03-06 Thread praveenkumargpk17
Hi,
We have 2 clusters (v18.2.1) primarily used for RGW which has over 2+ billion 
RGW objects. They are also in multisite configuration totaling to 2 zones and 
we've got around 2 Gbps of bandwidth dedicated (P2P) for the multisite traffic. 
We see that using "radosgw-admin sync status" on the zone 2, all the 128 shards 
are recovering and unfortunately there is very less data transfer from primary 
zone ie., the link utilization is barely 100 Mbps / 2 Gbps. Our objects are 
quite small as well like avg. of 1 MB in size. 
On further inspection, we noticed the rgw access the logs at primary site are 
mostly yielding "304 Not Modified" for RGWs at site-2. Is this expected? Here 
are some of the logs (information is redacted)

root@host-04:~# tail -f /var/log/haproxy-msync.log
Feb 12 05:06:51 host-04 haproxy[971171]: 10.1.85.14:33730 
[12/Feb/2024:05:06:51.047] https~ backend/host-04-msync 0/0/0/2/2 304 143 - - 
 56/55/1/0/0 0/0 "GET 
/bucket1/object1.jpg?rgwx-zonegroup=71dceb3d-3092-4dc6-897f-a9abf60c9972=true=a8204ce2-b69e-4d90-bca1-93edd05a1a29%3Abucket1%3A8b96aea5-c763-40a3-8430-efd67cff0c62.20010.7
 HTTP/1.1"
Feb 12 05:06:51 host-04 haproxy[971171]: 10.1.85.14:59730 
[12/Feb/2024:05:06:51.048] https~ backend/host-04-msync 0/0/0/2/2 304 143 - - 
 56/55/3/1/0 0/0 "GET 
/bucket1/object91.jpg?rgwx-zonegroup=71dceb3d-3092-4dc6-897f-a9abf60c9972=true=a8204ce2-b69e-4d90-bca1-93edd05a1a29%3Abucket1%3A8b96aea5-c763-40a3-8430-efd67cff0c62.20010.7
 HTTP/1.1"

We also took a look at our Grafana instance and out of 1000 requests / second, 
200 requests are getting  "200 OK" and 800 requests are getting "304 Not 
Modified". Sync threads are run on only 2 rgw daemons per zone and are behind a 
Load Balancer. "# radosgw-admin sync error list" also contains around 20 errors 
which are mostly automatically recoverable.
As we understand, does it mean that RGW multisite sync logs in the log pool are 
yet to be generated or some sort? Please provide us some insights and let us 
know how to resolve this.

Thanks,
Praveen
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph-volume fails when adding spearate DATA and DATA.DB volumes

2024-03-06 Thread service . plant
Hi all!
I;ve faced an issue I couldnt even google.
Trying to create OSD with two separate LVM for data.db and data, gives me 
intresting error

```
root@ceph-uvm2:/# ceph-volume lvm prepare --bluestore  --data 
ceph-block-0/block-0 --block.db ceph-db-0/db-0
--> Incompatible flags were found, some values may get ignored
--> Cannot use None (None) with --bluestore (bluestore)
--> Incompatible flags were found, some values may get ignored
--> Cannot use --bluestore (bluestore) with --block.db (bluestore)
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd 
--keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 
e3b7b9e5-6399-4e61-8634-41a991bb1948
 stderr: 2024-03-04T11:45:52.263+ 7f71c3a89700 -1 auth: unable to find a 
keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or 
directory
 stderr: 2024-03-04T11:45:52.263+ 7f71c3a89700 -1 
AuthRegistry(0x7f71bc064978) no keyring found at 
/var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 stderr: 2024-03-04T11:45:52.263+ 7f71c3a89700 -1 auth: unable to find a 
keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or 
directory
 stderr: 2024-03-04T11:45:52.263+ 7f71c3a89700 -1 
AuthRegistry(0x7f71bc067fd0) no keyring found at 
/var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 stderr: 2024-03-04T11:45:52.267+ 7f71c3a89700 -1 auth: unable to find a 
keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or 
directory
 stderr: 2024-03-04T11:45:52.267+ 7f71c3a89700 -1 
AuthRegistry(0x7f71c3a87ea0) no keyring found at 
/var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 stderr: 2024-03-04T11:45:52.267+ 7f71c1024700 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]
 stderr: 2024-03-04T11:45:52.267+ 7f71c2026700 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]
 stderr: 2024-03-04T11:45:52.267+ 7f71c1825700 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]
 stderr: 2024-03-04T11:45:52.267+ 7f71c3a89700 -1 monclient: authenticate 
NOTE: no keyring found; disabled cephx authentication
 stderr: [errno 13] RADOS permission denied (error connecting to the cluster)
-->  RuntimeError: Unable to create a new OSD id
```

And here is output which shows my created volumes

```
root@ceph-uvm2:/# lvs
  LV  VG   Attr   LSize   Pool Origin Data%  Meta%  Move Log 
Cpy%Sync Convert
  block-0 ceph-block-0 -wi-a-  16.37t   
 
  db-0ceph-db-0-wi-a- 326.00g  
```

CEPH was rolled out by using cephadm and all the commands I do under cephadm 
shell

I have no idea what is the reason of that error and that is why I am here.


Any help is appreciated.
Thanks in advance.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph reef mon is not starting after host reboot

2024-03-06 Thread ankit
Hi guys,

i am very newbie to ceph-cluster but after multiple attempts, i was able to 
install ceph-reef cluster on debian-12 by cephadm tool on test environment with 
2 mons and 3 OSD's om VM's. All was seeming good and i was exploring more about 
it so i rebooted cluster and found that now i am not able to access ceph 
dashboard and i have try to check this 

root@ceph-mon-01:/# ceph orch ls
2024-03-01T08:53:05.051+ 7ff7602b8700  0 monclient(hunting): authenticate 
timed out after 300
[errno 110] RADOS timed out (error connecting to the cluster)

i have not configured RADOS. And i have no clue about it. Any help would be 
very appreciated? the same issue.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: has anyone enabled bdev_enable_discard?

2024-03-06 Thread jsterr
Is there any update on this? Did someone test the option and has 
performance values before and after?

Is there any good documentation regarding this option?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Slow RGW multisite sync due to "304 Not Modified" responses on primary zone

2024-03-06 Thread praveenkumargpk17
Hi,
We have 2 clusters (v18.2.1) primarily used for RGW which has over 2+ billion 
RGW objects. They are also in multisite configuration totaling to 2 zones and 
we've got around 2 Gbps of bandwidth dedicated (P2P) for the multisite traffic. 
We see that using "radosgw-admin sync status" on the zone 2, all the 128 shards 
are recovering and unfortunately there is very less data transfer from primary 
zone ie., the link utilization is barely 100 Mbps / 2 Gbps. Our objects are 
quite small as well like avg. of 1 MB in size. 
On further inspection, we noticed the rgw access the logs at primary site are 
mostly yielding "304 Not Modified" for RGWs at site-2. Is this expected? Here 
are some of the logs (information is redacted)

root@host-04:~# tail -f /var/log/haproxy-msync.log
Feb 12 05:06:51 host-04 haproxy[971171]: 10.1.85.14:33730 
[12/Feb/2024:05:06:51.047] https~ backend/host-04-msync 0/0/0/2/2 304 143 - - 
 56/55/1/0/0 0/0 "GET 
/bucket1/object1.jpg?rgwx-zonegroup=71dceb3d-3092-4dc6-897f-a9abf60c9972=true=a8204ce2-b69e-4d90-bca1-93edd05a1a29%3Abucket1%3A8b96aea5-c763-40a3-8430-efd67cff0c62.20010.7
 HTTP/1.1"
Feb 12 05:06:51 host-04 haproxy[971171]: 10.1.85.14:59730 
[12/Feb/2024:05:06:51.048] https~ backend/host-04-msync 0/0/0/2/2 304 143 - - 
 56/55/3/1/0 0/0 "GET 
/bucket1/object91.jpg?rgwx-zonegroup=71dceb3d-3092-4dc6-897f-a9abf60c9972=true=a8204ce2-b69e-4d90-bca1-93edd05a1a29%3Abucket1%3A8b96aea5-c763-40a3-8430-efd67cff0c62.20010.7
 HTTP/1.1"

We also took a look at our Grafana instance and out of 1000 requests / second, 
200 requests are getting  "200 OK" and 800 requests are getting "304 Not 
Modified". Sync threads are run on only 2 rgw daemons per zone and are behind a 
Load Balancer. "# radosgw-admin sync error list" also contains around 20 errors 
which are mostly automatically recoverable.
As we understand, does it mean that RGW multisite sync logs in the log pool are 
yet to be generated or some sort? Please provide us some insights and let us 
know how to resolve this.

Thanks , 
Praveen
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] bluestore_min_alloc_size and bluefs_shared_alloc_size

2024-03-06 Thread Joel Davidow
Summary
--
The relationship of the values configured for bluestore_min_alloc_size and 
bluefs_shared_alloc_size are reported to impact space amplification, partial 
overwrites in erasure coded pools, and storage capacity as an osd becomes more 
fragmented and/or more full.


Previous discussions including this topic

comment #7 in bug 63618 in Dec 2023 - 
https://tracker.ceph.com/issues/63618#note-7

pad writeup related to bug 62282 likely from late 2023 - 
https://pad.ceph.com/p/RCA_62282

email sent 13 Sept 2023 in mail list discussion of cannot create new osd - 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/5M4QAXJDCNJ74XVIBIFSHHNSETCCKNMC/

comment #9 in bug 58530 likely from early 2023 - 
https://tracker.ceph.com/issues/58530#note-9

email sent 30 Sept 2021 in mail list discussion of flapping osds - 
https://www.mail-archive.com/ceph-users@ceph.io/msg13072.html

email sent 25 Feb 2020 in mail list discussion of changing allocation size - 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/B3DGKH6THFGHALLX6ATJ4GGD4SVFNEKU/


Current situation
-
We have three Ceph clusters that were originally built via cephadm on octopus 
and later upgraded to pacific. All osds are HDD (will be moving to wal+db on 
SSD) and were resharded after the upgrade to enable rocksdb sharding. 

The value for bluefs_shared_alloc_size has remained unchanged at 65535. 

The value for bluestore_min_alloc_size_hdd was 65535 in octopus but is reported 
as 4096 by ceph daemon osd. config show in pacific. However, the osd label 
after upgrading to pacific retains the value of 65535 for bfm_bytes_per_block. 
BitmapFreelistManager.h in Ceph source code 
(src/os/bluestore/BitmapFreelistManager.h) indicates that bytes_per_block is 
bdev_block_size.  This indicates that the physical layout of the osd has not 
changed from 65535 despite the return of the ceph dameon command reporting it 
as 4096. This interpretation is supported by the Minimum Allocation Size part 
of the Bluestore configuration reference for quincy 
(https://docs.ceph.com/en/quincy/rados/configuration/bluestore-config-ref/#minimum-allocation-size)


Questions
--
What are the pros and cons of the following three cases with two variations per 
case - when using co-located wal+db on HDD and when using separate wal+db on 
SSD:
1) bluefs_shared_alloc_size, bluestore_min_alloc_size, and bfm_bytes_per_block 
all equal
2) bluefs_shared_alloc_size greater than but a multiple of 
bluestore_min_alloc_size with bfm_bytes_per_block equal to 
bluestore_min_alloc_size
3) bluefs_shared_alloc_size greater than but a multiple of 
bluestore_min_alloc_size with bfm_bytes_per_block equal to 
bluefs_shared_alloc_size
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pg repair doesn't fix "got incorrect hash on read" / "candidate had an ec hash mismatch"

2024-03-06 Thread Kai Stian Olstad

Hi Eugen, thank you for the reply.

The OSD was drained over the weekend, so OSD 223 and 269 have only the 
problematic PG 404.bc.


I don't think moving the PG would help since I don't have any empty OSD 
to move it to, and a move would not fix the hash mismatch.
The reason I just want to have the problematic PG on the OSDs is to 
reduce recovery time.
I would need to set min_size to 4 in an EC 4+2, and stop them both at 
the same time to force a rebuild of the corrupted part of PG that is on 
osd 223 and 269, since repair doesn't fix it.


I'm debating with myself if I should
1. Stop both OSD 223 and 269,
2. Just one of them.

Stopping them both, I'm guarantied that part of the PG on 223 and 269 is 
rebuild from the 4 other, 297, 276, 136 and 197 that doesn't have any 
errors.


OSD 223 is the master in the EC, pg 404.bc acting 
[223,297,269,276,136,197]
So maybe just stop that one, wait for recovery and the run deep-scrub to 
check if things look better.

But would it then use corrupted data on osd 269 to rebuild.


-
Kai Stian Olstad



On 26.02.2024 10:19, Eugen Block wrote:

Hi,

I think your approach makes sense. But I'm wondering if moving only  
the problematic PGs to different OSDs could have an effect as well. I  
assume that moving the 2 PGs is much quicker than moving all BUT those  
2 PGs. If that doesn't work you could still fall back to draining the  
entire OSDs (except for the problematic PG).


Regards,
Eugen

Zitat von Kai Stian Olstad :


Hi,

No one have any comment at all?
I'm not picky so any speculation, guessing, I would, I wouldn't,  
should work and so one would be highly appreciated.



Since 4 out of 6 in EC 4+2 is OK and ceph pg repair doesn't solve it  
I think the following might work.


pg 404.bc acting [223,297,269,276,136,197]

- Use pgremapper to move all PG on OSD 223 and 269 except 404.bc to  
other OSD.
- Set min_since to 4, ceph osd pool set default.rgw.buckets.data 
min_size 4

- Stop osd 223 and 269

What I hope will happen is that Ceph then recreate 404.bc shard  
s0(osd.223) and s2(osd.269) since they are now down from the  
remaining shards

s1(osd.297), s3(osd.276), s4(osd.136) and s5(osd.197)


_Any_ comment is highly appreciated.

-
Kai Stian Olstad


On 21.02.2024 13:27, Kai Stian Olstad wrote:

Hi,

Short summary

PG 404.bc is an EC 4+2 where s0 and s2 report hash mismtach for 698 
objects.
Ceph pg repair doesn't fix it, because if you run deep-srub on the  
PG after repair is finished, it still report scrub errors.


Why can't ceph pg repair repair this, it has 4 out of 6 should be  
able to reconstruct the corrupted shards?
Is there a way to fix this? Like delete object s0 and s2 so it's  
forced to recreate them?



Long detailed summary

A short backstory.
* This is aftermath of problems with mclock, post "17.2.7:  
Backfilling deadlock / stall / stuck / standstill" [1].

 - 4 OSDs had a few bad sectors, set all 4 out and cluster stopped.
 - Solution was to swap from mclock to wpq and restart alle OSD.
 - When all backfilling was finished all 4 OSD was replaced.
 - osd.223 and osd.269 was 2 of the 4 OSDs that was replaced.


PG / pool 404 is EC 4+2 default.rgw.buckets.data

9 days after the osd.223 og osd.269 was replaced, deep-scub was run  
and reported errors

   ceph status
   ---
   HEALTH_ERR 1396 scrub errors; Possible data damage: 1 pg 
inconsistent

   [ERR] OSD_SCRUB_ERRORS: 1396 scrub errors
   [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
   pg 404.bc is active+clean+inconsistent, acting  
[223,297,269,276,136,197]


I then run repair
   ceph pg repair 404.bc

And ceph status showed this
   ceph status
   ---
   HEALTH_WARN Too many repaired reads on 2 OSDs
   [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
   osd.223 had 698 reads repaired
   osd.269 had 698 reads repaired

But osd.223 and osd.269 is new disks and the disks has no SMART  
error or any I/O error in OS logs.

So I tried to run deep-scrub again on the PG.
   ceph pg deep-scrub 404.bc

And got this result.

   ceph status
   ---
   HEALTH_ERR 1396 scrub errors; Too many repaired reads on 2 OSDs;  
Possible data damage: 1 pg inconsistent

   [ERR] OSD_SCRUB_ERRORS: 1396 scrub errors
   [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
   osd.223 had 698 reads repaired
   osd.269 had 698 reads repaired
   [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
   pg 404.bc is  active+clean+scrubbing+deep+inconsistent+repair, 
acting  [223,297,269,276,136,197]


698 + 698 = 1396 so the same amount of errors.

Run repair again on 404.bc and ceph status is

   HEALTH_WARN Too many repaired reads on 2 OSDs
   [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
   osd.223 had 1396 reads repaired
   osd.269 had 1396 reads repaired

So even when repair finish it doesn't fix the problem since they  
reappear again after a deep-scrub.


The log for osd.223 and osd.269 contain "got 

[ceph-users] Re: ambigous mds behind on trimming and slowops (ceph 17.2.5 and rook operator 1.10.8)

2024-03-06 Thread a . warkhade98
Thanks Dhairya for response.

its ceph 17.2.5
I don't have exact output for ceph -s currently as it is past issue.but it was 
like below and all PGs were active + clean AFAIR
 
mds slow requests
Mds behind on trimming

don't know root cause why mds was crashed but i am suspecting its something to 
do with active mon failure.
before the crash 2 nodes consisting 2 active mons were restarted causing those 
mon pods to restart
both nodes were restarted one by one for some maintenance activity.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph Quincy to Reef non cephadm upgrade

2024-03-06 Thread sarda . ravi
I want to perform non cephadm upgrade from Quincy to Reef. Reason for not using 
cephadm is do not want to go for ceph in containers.

My test deployment is as given below.

Total cluster hosts : 5
ceph-mon hosts: 3
ceph-mgr hosts: 3 (ceph-mgr active on one node, and other ceph-mgr each on 
ceph-mon host)
ceph-mds : 1
ceph-osd : 5 (one ceph-osd on each of the host in the cluster.)

While I try to follow the steps - 
https://docs.ceph.com/en/latest/releases/reef/#upgrading-non-cephadm-clusters - 
on the step - Upgrade monitors by installing the new packages and restarting 
the monitor daemons. when I try to upgrade only ceph-mon using "apt upgrade 
ceph-mon" command it upgrades all packages including ceph-mgr, ceph-mds, 
ceph-osd etc. as ceph-mon package has dependency on these packages.

My question is - does this mean I need to upgrade all ceph packages (ceph, 
ceph-common) and restart only monitor daemon first? Or there is any way I can 
upgrade only ceph-mon pacakge first, then ceph-mgr, ceph-osd and so on?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: change ip node and public_network in cluster

2024-03-06 Thread farhad khedriyan
Hi, thanks But this document is for old versions and cannot be used for the 
container version .
I used the reef version and when I retrieve the monmap and edit I can't use 
ceph-mon to change the monmap or any way to change
I tried to solve this problem by adding a new node and deleting the old nodes, 
but because the monmap has not changed, it does not allow adding a new node 
with a different subnet.
is ther a way for this change in container versions?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] PG damaged "failed_repair"

2024-03-06 Thread Romain Lebbadi-Breteau

Hi,

We're a student club from Montréal where we host an Openstack cloud with 
a Ceph backend for storage of virtual machines and volumes using rbd.


Two weeks ago we received an email from our ceph cluster saying that 
some pages were damaged. We ran "sudo ceph pg repair " but then 
there was an I/O error on the disk during the recovery ("An 
unrecoverable disk media error occurred on Disk 4 in Backplane 1 of 
Integrated RAID Controller 1." and "Bad block medium error is detected 
at block 0x1377e2ad on Virtual Disk 3 on Integrated RAID Controller 1." 
messages on iDRAC).


After that, the PG we tried to repair was in the state 
"active+recovery_unfound+degraded". After a week, we ran the command 
"sudo ceph pg 2.1b mark_unfound_lost revert" to try to recover the 
damaged PG. We tried to boot the virtual machine that had crashed 
because of this incident, but the volume seemed to have been completely 
erased, the "mount" command said there was no filesystem on it, so we 
recreated the VM from a backup.


A few days later, the same PG was once again damaged, and since we knew 
the physical disk on the OSD hosting one part of the PG had problems, we 
tried to "out" the OSD from the cluster. That resulted in the two other 
OSDs hosting copies of the problematic PG to go down, which caused 
timeouts on our virtual machines, so we put the OSD back in.


We then tried to repair the PG again, but that failed and the PG is now 
"active+clean+inconsistent+failed_repair", and whenever it goes down, 
two other OSDs from two other hosts go down too after a few minutes, so 
it's impossible to replace the disk right now, even if we have new ones 
available.


We have backups for most of our services, but it would be very 
disrupting to delete the whole cluster, and we don't know that to do 
with the broken PG and the OSD that can't be shut down.


Any help would be really appreciated, we're not experts with Ceph and 
Openstack, and it's likely we handled things wrong at some point, but we 
really want to go back to a healthy Ceph.


Here are some information about our cluster :

romain:step@alpha-cen ~  $ sudo ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
    pg 2.1b is active+clean+inconsistent+failed_repair, acting [3,11,0]

romain:step@alpha-cen ~  $ sudo ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-1 70.94226  root default
-7 20.00792  host alpha-cen
 3    hdd   1.81879  osd.3   up   1.0  1.0
 6    hdd   1.81879  osd.6   up   1.0  1.0
12    hdd   1.81879  osd.12  up   1.0  1.0
13    hdd   1.81879  osd.13  up   1.0  1.0
15    hdd   1.81879  osd.15  up   1.0  1.0
16    hdd   9.09520  osd.16  up   1.0  1.0
17    hdd   1.81879  osd.17  up   1.0  1.0
-5 23.64874  host beta-cen
 1    hdd   5.45749  osd.1   up   1.0  1.0
 4    hdd   5.45749  osd.4   up   1.0  1.0
 8    hdd   5.45749  osd.8   up   1.0  1.0
11    hdd   5.45749  osd.11  up   1.0  1.0
14    hdd   1.81879  osd.14  up   1.0  1.0
-3 27.28560  host gamma-cen
 0    hdd   9.09520  osd.0   up   1.0  1.0
 5    hdd   9.09520  osd.5   up   1.0  1.0
 9    hdd   9.09520  osd.9   up   1.0  1.0

romain:step@alpha-cen ~  $ sudo rados list-inconsistent-obj 2.1b
{"epoch":9787,"inconsistents":[]}

romain:step@alpha-cen ~  $ sudo ceph pg 2.1b query

https://pastebin.com/gsKCPCjr

Best regards,

Romain Lebbadi-Breteau
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph commands on host cannot connect to cluster after cephx disabling

2024-03-06 Thread service . plant
Hello everybody,
Suddenly faced with a problem with (probably) authorization playing with cephx.
So, long story short:
1) Rollout completely new testing cluster by cephadm with only one node
2) According to docs I've set this to /etc/ceph/ceph.conf
  auth_cluster_required = none
  auth_service_required = none
  auth_client_required = none
3) restart ceph.target
4) now even "ceph -s " cannot connect to RADOS saying
root@ceph1:/etc/ceph# ceph -s
  2024-02-24T18:15:59.219+ 7f7c10d65700 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]
  2024-02-24T18:15:59.219+ 7f7c11d67700  0 librados: client.admin 
authentication error (13) Permission denied
  [errno 13] RADOS permission denied (error connecting to the cluster)
4) I have ceph.client.admin.keyring in both /etc/ceph and 
/var/lib/ceph/$fsid/config 
5) logs of monitor doesnt show any error. It looks like it keeps normal living 
and even doesn't know that something goes wrong
6) Tried to set back /etc/ceph/ceph.conf to
  auth_cluster_required = cephx
  auth_service_required = cephx
  auth_client_required = cephx
with no success
7) I have noted that some process (I guess it is one of processes in 
containers?) always rewrite /etc/ceph/ceph.cong and 
/var/lib/ceph/$fsid/config/ceph.conf whatever I woud write there. What is the 
process? Who is it? How to set up settings if I want to keep it in the file??

Ubuntu 20.04, Reef, 18.0.2

Thanks in advance.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph is constantly scrubbing 1/4 of all PGs and still have pigs not scrubbed in time

2024-03-06 Thread thymus_03fumbler
I recently switched from 16.2.x to 18.2.x and migrated to cephadm, since the 
switch the cluster is constantly scrubbing, 24/7 up to 50 PGs simultaneously 
and up to 20 deep scrubs simultaneously in a cluster that has only 12 (in use) 
OSDs.
Furthermore it still manages to regularly have a warning with ‘pgs not scrubbed 
in time’

I have tried various settings, like osd_deep_scrub_interval, osd_max_scrubs, 
mds_max_scrub_ops_in_progress etc.
All those get ignored.

Please advice.

Here is an output of ceos config dump:

WHO MASK  LEVEL OPTION   VALUE  

RO
globaladvanced  auth_client_required cephx  

*
globaladvanced  auth_cluster_requiredcephx  

*
globaladvanced  auth_service_requiredcephx  

*
globaladvanced  auth_supported   cephx  

*
globalbasic container_image  
quay.io/ceph/ceph@sha256:aca35483144ab3548a7f670db9b79772e6fc51167246421c66c0bd56a6585468
  *
globalbasic device_failure_prediction_mode   local
globaladvanced  mon_allow_pool_deletetrue
globaladvanced  mon_data_avail_warn  20
globaladvanced  mon_max_pg_per_osd   400
globaladvanced  osd_max_pg_per_osd_hard_ratio
10.00
globaladvanced  osd_pool_default_pg_autoscale_mode   on
mon   advanced  auth_allow_insecure_global_id_reclaimfalse
mon   advanced  mon_crush_min_required_version   
firefly 
   *
mon   advanced  mon_warn_on_pool_no_redundancy   false
mon   advanced  public_network   
10.79.0.0/16
   *
mgr   advanced  mgr/balancer/active  true
mgr   advanced  mgr/balancer/modeupmap
mgr   advanced  mgr/cephadm/manage_etc_ceph_ceph_conf_hosts  
label:admin 
   *
mgr   advanced  mgr/cephadm/migration_current6  

*
mgr   advanced  mgr/dashboard/GRAFANA_API_PASSWORD   admin  

*
mgr   advanced  mgr/dashboard/GRAFANA_API_SSL_VERIFY false  

*
mgr   advanced  mgr/dashboard/GRAFANA_API_URL
https://10.79.79.12:3000
   *
mgr   advanced  mgr/dashboard/PROMETHEUS_API_HOST
http://10.79.79.12:9095 
   *
mgr   advanced  mgr/devicehealth/enable_monitoring   true
mgr   advanced  mgr/orchestrator/orchestratorcephadm
osd   advanced  osd_map_cache_size   250
osd   advanced  osd_map_share_max_epochs 50
osd   advanced  osd_mclock_profile   
high_client_ops
osd   advanced  osd_pg_epoch_persisted_max_stale 50
osd.0 basic osd_mclock_max_capacity_iops_hdd 
380.869888
osd.1 basic osd_mclock_max_capacity_iops_hdd 
441.00
osd.10basic osd_mclock_max_capacity_iops_ssd 
13677.906485
osd.11basic osd_mclock_max_capacity_iops_hdd 
274.411212
osd.13basic osd_mclock_max_capacity_iops_hdd 
198.492501
osd.2 basic osd_mclock_max_capacity_iops_hdd 
251.592009
osd.3 basic osd_mclock_max_capacity_iops_hdd 
208.197434
osd.4 basic osd_mclock_max_capacity_iops_hdd 
196.544082
osd.5 basic osd_mclock_max_capacity_iops_ssd 
12739.225456
osd.6 basic osd_mclock_max_capacity_iops_hdd 
211.288660

[ceph-users] Re: Ceph-storage slack access

2024-03-06 Thread Marc
Is it possible to access this also with xmpp?

> 
> At the very bottom of this page is a link
> https://ceph.io/en/community/connect/
> 
> Respectfully,
> 
> *Wes Dillingham*
> w...@wesdillingham.com
> LinkedIn 
> 
> 
> On Wed, Mar 6, 2024 at 11:45 AM Matthew Vernon 
> wrote:
> 
> > Hi,
> >
> > How does one get an invite to the ceph-storage slack, please?
> >
> > Thanks,
> >
> > Matthew
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph-storage slack access

2024-03-06 Thread Gregory Farnum
On Wed, Mar 6, 2024 at 8:56 AM Matthew Vernon  wrote:
>
> Hi,
>
> On 06/03/2024 16:49, Gregory Farnum wrote:
> > Has the link on the website broken? https://ceph.com/en/community/connect/
> > We've had trouble keeping it alive in the past (getting a non-expiring
> > invite), but I thought that was finally sorted out.
>
> Ah, yes, that works. Sorry, I'd gone to
> https://docs.ceph.com/en/latest/start/get-involved/
>
> which lacks the registration link.

Whoops! Can you update that, docs master Zac? :)
-Greg
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph-storage slack access

2024-03-06 Thread Matthew Vernon

Hi,

On 06/03/2024 16:49, Gregory Farnum wrote:

Has the link on the website broken? https://ceph.com/en/community/connect/
We've had trouble keeping it alive in the past (getting a non-expiring
invite), but I thought that was finally sorted out.


Ah, yes, that works. Sorry, I'd gone to
https://docs.ceph.com/en/latest/start/get-involved/

which lacks the registration link.

Regards,

Matthew
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph-storage slack access

2024-03-06 Thread Wesley Dillingham
At the very bottom of this page is a link
https://ceph.io/en/community/connect/

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Wed, Mar 6, 2024 at 11:45 AM Matthew Vernon 
wrote:

> Hi,
>
> How does one get an invite to the ceph-storage slack, please?
>
> Thanks,
>
> Matthew
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph-storage slack access

2024-03-06 Thread Gregory Farnum
Has the link on the website broken? https://ceph.com/en/community/connect/
We've had trouble keeping it alive in the past (getting a non-expiring
invite), but I thought that was finally sorted out.
-Greg

On Wed, Mar 6, 2024 at 8:46 AM Matthew Vernon  wrote:
>
> Hi,
>
> How does one get an invite to the ceph-storage slack, please?
>
> Thanks,
>
> Matthew
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph-storage slack access

2024-03-06 Thread Matthew Vernon

Hi,

How does one get an invite to the ceph-storage slack, please?

Thanks,

Matthew
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph storage project for virtualization

2024-03-06 Thread egoitz
Hi Eneko! 

Sorry for the delay answering. Thank you so much for your time really
mate :) . I answer you in green bold for instance between your lines for
better understanding what I'm talking about.

moving forward below! 

El 2024-03-05 12:26, Eneko Lacunza escribió:

> Hi Egoitz,
> 
> I don't think it is a good idea, but can't comment about if that's possible 
> because I don't know well enough Ceph's inner workings, maybe others can 
> comment.
> 
> This is what worries me:
> "
> 
> ach NFS redundant service of each datacenter will be composed by two
> NFS gateways accessing to the OSDs of the placement group located in the
> own datacenter. I planned achieving this with OSD weights and getting
> with that the fact that the crush algorithm to build the map so that
> each datacenter accesses end up having as master, the OSD of the own
> datacenter in the placement group. Obviously, slave OSD replicas will
> exist in the other three datacenters or even I don't discard the fact of
> using erasure coding in some manner.
> 
> " 
> 
> First, I don't think you got OSD weights right. 
> 
> YOU MEAN I DON'T HAVE THE CONCEPT CLEAR OR... THAT WEIGHTS DOESN'T WORK THAT 
> WAY OR... PLEASE TELL ME FOR UNDERSTANDING WHAT YOU MEAN :) . IN CASE I DON'T 
> HAVE THE CONCEPT CLEAR :) I'LL GO BACK TO READ ABOUT IT :) 
> 
> Also, any write will be synchronous to the replicas so that's why I asked 
> about latencies first. You may be able to read from DC-local "master" pgs (I 
> recall someone doing this with host-local pgs...) 
> 
> YOU MEAN HERE THAT AS I/O IS SYNCHRONOUS THE LATENCY IS EXTREMELY IMPORTANT 
> WETHER YOU JUST ACCESS TO THE SAME DATACENTER OR NOT FROM THAT DATACENTER 
> HIPERVISORS
> 
> In the best case you'll have your data in a corner-case configuration, which 
> may trigger strange bugs and/or behaviour not seen elsewhere. 
> 
> I wouldn't like to be in such a position, but I don't know how valuable your 
> data is... 
> 
> I SEE... YES THE INFO IS ESSENTIAL REALLY 
> 
> PERHAPS THEN I WOULD HAVE ALL THE OSD SERVERS IN THE SAME DATACENTER FOR 
> AVOIDING THAT DELAYS AND THAT EXTRANGE ISSUES THAT COULD HAPPEN WITH MY 
> ORIGINAL IDEA...
> 
> I think it would be best to determine inter-DC network latency first; if you 
> can choose DCs, then choose wisely with low enough latency ;) Then see if a 
> regular Ceph storage configuration will give you good enough performance. 
> 
> UNDERSTOOD YES!! 
> 
> Another option would be to run DC-local ceph storages and to mirror to other 
> DC. 
> 
> THIS IS VALID TOO... ALTHOUGH THAT WOULD BE SYNCHRONOUS?. I MEAN THE MIRROR?.
> 
> Cheers 
> 
> THANKS A LOT MATE!! REALLY!!!
> 
> El 5/3/24 a las 11:50, ego...@ramattack.net escribió: Hi Eneko!
> 
> I don't really have that data but I was planning to have as master OSD
> only the ones in the same datacenter as the hypervisor using the
> storage. The other datacenters would be just replicas. I assume you ask
> it because replication is totally synchronous.
> 
> Well for doing step by step. Imagine for the moment, the point of
> failure is a rack and all the replicas will be in the same datacenter in
> different racks and rows. In this case the latency should be acceptable
> and low.
> 
> My question was more related to the redundant nfs and if you have some
> experience with similar setups. I was trying to know if first is
> feasible what I'm planning to do.
> 
> Thank you so much :)
> 
> Cheers!
> 
> El 2024-03-05 11:43, Eneko Lacunza escribió:
> 
> Hi Egoitz,
> 
> What network latency between datacenters?
> 
> Cheers
> 
> El 5/3/24 a las 11:31, ego...@ramattack.net escribió:
> 
> Hi!
> 
> I have been reading some ebooks of Ceph and some doc and learning about
> it. The goal of all it, is the fact of creating a rock solid storage por
> virtual machines. After all the learning I have not been able to answer
> by myself to this question so I was wondering if perhaps you could
> clarify my doubt.
> 
> Let's imagine three datacenters, each one with for instance, 4
> virtualization hosts. As I was planning to build a solution for diferent
> hypervisors I have been thinking in the following env.
> 
> - I planed to have my Ceph storage (with different pools inside) with
> OSDs in three different datacenters (as failure point).
> 
> - Each datacenter's hosts, will be accessing to a NFS redundant service
> in the own datacenter.
> 
> - Each NFS redundant service of each datacenter will be composed by two
> NFS gateways accessing to the OSDs of the placement group located in the
> own datacenter. I planned achieving this with OSD weights and getting
> with that the fact that the crush algorithm to build the map so that
> each datacenter accesses end up having as master, the OSD of the own
> datacenter in the placement group. Obviously, slave OSD replicas will
> exist in the other three datacenters or even I don't discard the fact of
> using erasure coding in some manner.
> 
> - The NFS gateways could be a NFS redundant 

[ceph-users] Re: reef 18.2.2 (hot-fix) QE validation status

2024-03-06 Thread Patrick Donnelly
On Wed, Mar 6, 2024 at 2:55 AM Venky Shankar  wrote:
>
> +Patrick Donnelly
>
> On Tue, Mar 5, 2024 at 9:18 PM Yuri Weinstein  wrote:
> >
> > Details of this release are summarized here:
> >
> > https://tracker.ceph.com/issues/64721#note-1
> > Release Notes - TBD
> > LRC upgrade - TBD
> >
> > Seeking approvals/reviews for:
> >
> > smoke - in progress
> > rados - Radek, Laura?
> > quincy-x - in progress
>
> I think
>
> https://github.com/ceph/ceph/pull/55669
>
> was supposed to be included in this hotfix (I recall Patrick
> mentioning this in last week's CLT). The change was merged into reef
> head last week.

That's fixing a bug in reef HEAD not v18.2.1 so no need to get into this hotfix.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to build ceph without QAT?

2024-03-06 Thread Ilya Dryomov
On Wed, Mar 6, 2024 at 7:41 AM Feng, Hualong  wrote:
>
> Hi Dongchuan
>
> Could I know which version or which commit that you are building and your 
> environment: system, CPU, kernel?
>
> ./do_cmake.sh -DCMAKE_BUILD_TYPE=RelWithDebInfo   this command should be OK 
> without QAT.

Hi Hualong,

I don't think this is true.  In main, both WITH_QATLIB and WITH_QATZIP
default to ON unless the system is aarch64.  IIRC I needed to append -D
WITH_QATLIB=OFF -D WITH_QATZIP=OFF to build without QAT.

Thanks,

Ilya

>
> Thanks
> -Hualong
>
> > -Original Message-
> > From: 张东川 
> > Sent: Wednesday, March 6, 2024 9:51 AM
> > To: ceph-users 
> > Subject: [ceph-users] How to build ceph without QAT?
> >
> > Hi guys,
> >
> >
> > I tried both following commands.
> > Neither of them worked.
> >
> >
> > "./do_cmake.sh -DCMAKE_BUILD_TYPE=RelWithDebInfo -DWITH_QAT=OFF
> > -DWITH_QATDRV=OFF -DWITH_QATZIP=OFF"
> > "ARGS="-DWITH_QAT=OFF -DWITH_QATDRV=OFF -
> > DWITH_QATZIP=OFF" ./do_cmake.sh -
> > DCMAKE_BUILD_TYPE=RelWithDebInfo"
> >
> >
> > I still see errors like:
> > make[1]: *** [Makefile:4762:
> > quickassist/lookaside/access_layer/src/sample_code/performance/framew
> > ork/linux/user_space/cpa_sample_code-cpa_sample_code_utils.o] Error 1
> >
> >
> >
> >
> > So what's the proper way to configure build commands?
> > Thanks a lot.
> >
> >
> > Best Regards,
> > Dongchuan
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email
> > to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring Ceph Bucket and overall ceph cluster remaining space

2024-03-06 Thread Michael Worsham
SW is SolarWinds (www.soparwinds.com), a network and application monitoring and 
alerting platform.

It's not very open source at all, but it's what we use for monitoring all of 
our physical and virtual servers, network switches, SAN and NAS devices, and 
anything else with a network card in it.

From: Konstantin Shalygin 
Sent: Wednesday, March 6, 2024 1:39:43 AM
To: Michael Worsham 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: Monitoring Ceph Bucket and overall ceph cluster 
remaining space

This is an external email. Please take care when clicking links or opening 
attachments. When in doubt, check with the Help Desk or Security.


Hi,

Don't aware about what is SW, but if this software works with Prometheus 
metrics format - why not. Anyway the exporters are open source, you can modify 
the existing code for your environment


k

Sent from my iPhone

> On 6 Mar 2024, at 07:58, Michael Worsham  wrote:
>
> This looks interesting, but instead of Prometheus, could the data be exported 
> for SolarWinds?
>
> The intent is to have SW watch the available storage space allocated and then 
> to alert when a certain threshold is reached (75% remaining for a warning; 
> 95% remaining for a critical).

This message and its attachments are from Data Dimensions and are intended only 
for the use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law. If the reader of this message is not the 
intended recipient, or the employee or agent responsible for delivering the 
message to the intended recipient, you are hereby notified that any 
dissemination, distribution, or copying of this communication is strictly 
prohibited. If you have received this communication in error, please notify the 
sender immediately and permanently delete the original email and destroy any 
copies or printouts of this email as well as any attachments.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck

2024-03-06 Thread Eugen Block
Okay, so the first thing I would do is to stop the upgrade. Then make  
sure that you have two running MGRs with the current version of the  
rest of the cluster (.1). If no other daemons have been upgraded it  
shouldn't be a big issue. If necessary you can modify the unit.run  
file and specify there the container image for the MGRs. If they both  
start successfully try an upgrade to 16.2.15 (was just released this  
week) instead of 16.2.2.


Zitat von Edouard FAZENDA :


Dear Eugen,

I have removed one mgr on the node 3 , the second one is still  
crashlooping and on node 1 mgr is in 16.2.2


Not sure to understand your workaround.

* Stopping current upgrade to rollback if possible  and afterward  
upgrading to latest release of pacific ?


Best Regards,



Edouard FAZENDA
Technical Support



Chemin du Curé-Desclouds 2, CH-1226 THONEX  +41 (0)22 869 04 40

www.csti.ch

-Original Message-
From: Eugen Block 
Sent: mercredi, 6 mars 2024 10:47
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck

There was another issue when having more than two MGRs, maybe you're  
hitting that (https://tracker.ceph.com/issues/57675,
https://github.com/ceph/ceph/pull/48258). I believe my workaround  
was to set the global config to a newer image (target version) and  
then deployed a new mgr.



Zitat von Edouard FAZENDA :


The process has now started but I have the following error on mgr to
the second node



root@rke-sh1-1:~# ceph orch ps

NAME  HOST   PORTSSTATUS
REFRESHED  AGE  VERSION  IMAGE ID  CONTAINER ID

crash.rke-sh1-1   rke-sh1-1   running (12d)  41s ago
12d  16.2.1   c757e4a3636b  e8652edb2b49

crash.rke-sh1-2   rke-sh1-2   running (12d)  2s ago
20M  16.2.1   c757e4a3636b  a1249a605ee0

crash.rke-sh1-3   rke-sh1-3   running (12d)  41s ago
12d  16.2.1   c757e4a3636b  026667bc1776

mds.cephfs.rke-sh1-1.ojmpnk   rke-sh1-1   running (12d)  41s ago
5M   16.2.1   c757e4a3636b  9b4c2b08b759

mds.cephfs.rke-sh1-2.isqjza   rke-sh1-2   running (12d)  2s ago
23M  16.2.1   c757e4a3636b  71681a5f34d3

mds.cephfs.rke-sh1-3.vdicdn   rke-sh1-3   running (12d)  41s ago
4M   16.2.1   c757e4a3636b  e89946ad6b7e

mgr.rke-sh1-1.qskoyj  rke-sh1-1  *:8082,9283  running (66m)  41s ago
2y   16.2.2   5e237c38caa6  123cabbc2994

mgr.rke-sh1-2.lxmguj  rke-sh1-2  *:8082,9283  running (6s)   2s ago
22M  16.2.2   5e237c38caa6  b2a9047be1d6

mgr.rke-sh1-3.ckunvo  rke-sh1-3  *:8082,9283  running (12d)  41s ago
7M   16.2.1   c757e4a3636b  2fcaf18f3218

mon.rke-sh1-1 rke-sh1-1   running (37m)  41s ago
37m  16.2.1   c757e4a3636b  84e63e0415a8

mon.rke-sh1-2 rke-sh1-2   running (12d)  2s ago
4M   16.2.1   c757e4a3636b  f4b32ba4466b

mon.rke-sh1-3 rke-sh1-3   running (12d)  41s ago
12d  16.2.1   c757e4a3636b  d5e44c245998

osd.0 rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  7b0e69942c15

osd.1 rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  4451654d9a2d

osd.10rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  3f9d5f95e284

osd.11rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  db1cc6d2e37f

osd.12rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  de416c1ef766

osd.13rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  25a281cc5a9b

osd.14rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  62f25ba61667

osd.15rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  d3514d823c45

osd.16rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  bba857759bfe

osd.17rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  59281d4bb3d0

osd.2 rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  418041b5e60d

osd.3 rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  04a0e29d5623

osd.4 rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  1cc78a5153d3

osd.5 rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  39a4b11e31fb

osd.6 rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  2f218ffb566e

osd.7 rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  cf761fbe4d5f

osd.8 rke-sh1-3   

[ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck

2024-03-06 Thread Edouard FAZENDA
Dear Eugen,

I have removed one mgr on the node 3 , the second one is still crashlooping and 
on node 1 mgr is in 16.2.2

Not sure to understand your workaround.

* Stopping current upgrade to rollback if possible  and afterward upgrading to 
latest release of pacific ? 

Best Regards, 



Edouard FAZENDA
Technical Support
 


Chemin du Curé-Desclouds 2, CH-1226 THONEX  +41 (0)22 869 04 40
 
www.csti.ch

-Original Message-
From: Eugen Block  
Sent: mercredi, 6 mars 2024 10:47
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck

There was another issue when having more than two MGRs, maybe you're hitting 
that (https://tracker.ceph.com/issues/57675,
https://github.com/ceph/ceph/pull/48258). I believe my workaround was to set 
the global config to a newer image (target version) and then deployed a new mgr.


Zitat von Edouard FAZENDA :

> The process has now started but I have the following error on mgr to 
> the second node
>
>
>
> root@rke-sh1-1:~# ceph orch ps
>
> NAME  HOST   PORTSSTATUS
> REFRESHED  AGE  VERSION  IMAGE ID  CONTAINER ID
>
> crash.rke-sh1-1   rke-sh1-1   running (12d)  41s ago
> 12d  16.2.1   c757e4a3636b  e8652edb2b49
>
> crash.rke-sh1-2   rke-sh1-2   running (12d)  2s ago
> 20M  16.2.1   c757e4a3636b  a1249a605ee0
>
> crash.rke-sh1-3   rke-sh1-3   running (12d)  41s ago
> 12d  16.2.1   c757e4a3636b  026667bc1776
>
> mds.cephfs.rke-sh1-1.ojmpnk   rke-sh1-1   running (12d)  41s ago
> 5M   16.2.1   c757e4a3636b  9b4c2b08b759
>
> mds.cephfs.rke-sh1-2.isqjza   rke-sh1-2   running (12d)  2s ago
> 23M  16.2.1   c757e4a3636b  71681a5f34d3
>
> mds.cephfs.rke-sh1-3.vdicdn   rke-sh1-3   running (12d)  41s ago
> 4M   16.2.1   c757e4a3636b  e89946ad6b7e
>
> mgr.rke-sh1-1.qskoyj  rke-sh1-1  *:8082,9283  running (66m)  41s ago
> 2y   16.2.2   5e237c38caa6  123cabbc2994
>
> mgr.rke-sh1-2.lxmguj  rke-sh1-2  *:8082,9283  running (6s)   2s ago
> 22M  16.2.2   5e237c38caa6  b2a9047be1d6
>
> mgr.rke-sh1-3.ckunvo  rke-sh1-3  *:8082,9283  running (12d)  41s ago
> 7M   16.2.1   c757e4a3636b  2fcaf18f3218
>
> mon.rke-sh1-1 rke-sh1-1   running (37m)  41s ago
> 37m  16.2.1   c757e4a3636b  84e63e0415a8
>
> mon.rke-sh1-2 rke-sh1-2   running (12d)  2s ago
> 4M   16.2.1   c757e4a3636b  f4b32ba4466b
>
> mon.rke-sh1-3 rke-sh1-3   running (12d)  41s ago
> 12d  16.2.1   c757e4a3636b  d5e44c245998
>
> osd.0 rke-sh1-2   running (12d)  2s ago
> 3y   16.2.1   c757e4a3636b  7b0e69942c15
>
> osd.1 rke-sh1-3   running (12d)  41s ago
> 3y   16.2.1   c757e4a3636b  4451654d9a2d
>
> osd.10rke-sh1-3   running (12d)  41s ago
> 3y   16.2.1   c757e4a3636b  3f9d5f95e284
>
> osd.11rke-sh1-1   running (12d)  41s ago
> 3y   16.2.1   c757e4a3636b  db1cc6d2e37f
>
> osd.12rke-sh1-2   running (12d)  2s ago
> 3y   16.2.1   c757e4a3636b  de416c1ef766
>
> osd.13rke-sh1-3   running (12d)  41s ago
> 3y   16.2.1   c757e4a3636b  25a281cc5a9b
>
> osd.14rke-sh1-1   running (12d)  41s ago
> 3y   16.2.1   c757e4a3636b  62f25ba61667
>
> osd.15rke-sh1-2   running (12d)  2s ago
> 3y   16.2.1   c757e4a3636b  d3514d823c45
>
> osd.16rke-sh1-3   running (12d)  41s ago
> 3y   16.2.1   c757e4a3636b  bba857759bfe
>
> osd.17rke-sh1-1   running (12d)  41s ago
> 3y   16.2.1   c757e4a3636b  59281d4bb3d0
>
> osd.2 rke-sh1-1   running (12d)  41s ago
> 3y   16.2.1   c757e4a3636b  418041b5e60d
>
> osd.3 rke-sh1-2   running (12d)  2s ago
> 3y   16.2.1   c757e4a3636b  04a0e29d5623
>
> osd.4 rke-sh1-1   running (12d)  41s ago
> 3y   16.2.1   c757e4a3636b  1cc78a5153d3
>
> osd.5 rke-sh1-3   running (12d)  41s ago
> 3y   16.2.1   c757e4a3636b  39a4b11e31fb
>
> osd.6 rke-sh1-2   running (12d)  2s ago
> 3y   16.2.1   c757e4a3636b  2f218ffb566e
>
> osd.7 rke-sh1-1   running (12d)  41s ago
> 3y   16.2.1   c757e4a3636b  cf761fbe4d5f
>
> osd.8 rke-sh1-3   running (12d)  41s ago
> 3y   16.2.1   c757e4a3636b  f9f85480e800
>
> osd.9 rke-sh1-2   running (12d)  2s ago
> 3y   16.2.1   c757e4a3636b  664c54ff46d2
>
> rgw.default.rke-sh1-1.dgucwl  rke-sh1-1  *:8000   running (12d)  41s ago
> 22M  16.2.1   c757e4a3636b  f03212b955a7
>
> 

[ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck

2024-03-06 Thread Edouard FAZENDA
Dear Eugen,

Thanks again for the help.

We wanted to go smoothly, as we have unfortunaltey not test clusters,  
effectively the risk to get a bad version is high, you are right we will see to 
upgrade to the latest of pacific for the next steps.

I have wait about 30 minutes.

Still looking why the mgr is crashlooping on the second node

Thanks for the help.


Edouard FAZENDA
Technical Support
 


Chemin du Curé-Desclouds 2, CH-1226 THONEX  +41 (0)22 869 04 40
 
www.csti.ch

-Original Message-
From: Eugen Block  
Sent: mercredi, 6 mars 2024 10:33
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck

Hi,

a couple of things.
First, is there any specific reason why you're upgrading from .1 to .2? Why not 
directly to .15? It seems unnecessary and you're risking upgrading to a "bad" 
version (I believe it was 16.2.7) if you're applying evey minor release. Or why 
not upgrading to Quincy or Reef directly?
Second, the error message has changed
(https://github.com/ceph/ceph/pull/41257) from

"could not verify host allowed virtual ips"

to

"does not belong to mon public_network".

I saw this just recently during an upgrade of a cluster I didn't deploy and it 
turned out to be a misconfiguration issue (mismatch between cephadm ssh-user 
and missing pub key). I recommend to verify the ssh connections.

ceph cephadm get-ssh-config
ceph cephadm get-user
ceph cephadm get-pub-key

It also could be just a timing issue according to Sage's statement in above PR:

> Oh! I know what the problem is. 1897d1c changed the way we store the 
> per-host network interface/network info. > On upgrade, cephadm thinks 
> there are no networks on each host until the device refresh happens.

How long did you wait?


Zitat von Edouard FAZENDA :

> Dear Ceph Community,
>
>
>
> I am in the process of upgrading ceph pacific 16.2.1 to 16.2.2 , I 
> have followed the documentation :
> https://docs.ceph.com/en/pacific/cephadm/upgrade/
>
>
>
> My cluster is in Healthy state , but the upgrade is not going forward 
> , as on the cephadm logs I have the following :
>
>
>
> # Ceph -W cephadm
>
> 2024-03-06T08:39:11.653447+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: 
> Need to upgrade myself (mgr.rke-sh1-1.qskoyj)
>
> 2024-03-06T08:39:12.281386+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: 
> Updating mgr.rke-sh1-2.lxmguj
>
> 2024-03-06T08:39:12.286096+ mgr.rke-sh1-1.qskoyj [INF] Deploying 
> daemon mgr.rke-sh1-2.lxmguj on rke-sh1-2
>
> 2024-03-06T08:39:19.347877+ mgr.rke-sh1-1.qskoyj [INF] Filtered 
> out host
> rke-sh1-1: could not verify host allowed virtual ips
>
> 2024-03-06T08:39:19.347989+ mgr.rke-sh1-1.qskoyj [INF] Filtered 
> out host
> rke-sh1-3: could not verify host allowed virtual ips
>
> 2024-03-06T08:39:19.366355+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: 
> Need to upgrade myself (mgr.rke-sh1-1.qskoyj)
>
> 2024-03-06T08:39:19.965822+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: 
> Updating mgr.rke-sh1-2.lxmguj
>
> 2024-03-06T08:39:19.969089+ mgr.rke-sh1-1.qskoyj [INF] Deploying 
> daemon mgr.rke-sh1-2.lxmguj on rke-sh1-2
>
> 2024-03-06T08:39:26.961455+ mgr.rke-sh1-1.qskoyj [INF] Filtered 
> out host
> rke-sh1-1: could not verify host allowed virtual ips
>
> 2024-03-06T08:39:26.961502+ mgr.rke-sh1-1.qskoyj [INF] Filtered 
> out host
> rke-sh1-3: could not verify host allowed virtual ips
>
> 2024-03-06T08:39:26.973897+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: 
> Need to upgrade myself (mgr.rke-sh1-1.qskoyj)
>
> 2024-03-06T08:39:27.623773+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: 
> Updating mgr.rke-sh1-2.lxmguj
>
> 2024-03-06T08:39:27.628115+ mgr.rke-sh1-1.qskoyj [INF] Deploying 
> daemon mgr.rke-sh1-2.lxmguj on rke-sh1-2
>
>
>
> My public_network is set :
>
>
>
> root@rke-sh1-1:~# ceph config dump  | grep public_network
>
>   mon  advanced  public_network
> 10.10.71.0/24
>
>  *
>
> Do you have an idea why I have the following error :
>
>
>
> Filtered out host: could not verify host allowed virtual ips
>
>
>
>
>
> Current state of the upgrade :
>
>
>
> # ceph orch upgrade status
>
> {
>
> "target_image":
> "docker.io/ceph/ceph@sha256:8cdd8c7dfc7be5865255f0d59c048a1fb8d1335f69
> 23996e
> 2c2d9439499b5cf2",
>
> "in_progress": true,
>
> "services_complete": [],
>
> "progress": "0/35 ceph daemons upgraded",
>
> "message": "Currently upgrading mgr daemons"
>
> }
>
>
>
>   progress:
>
> Upgrade to 16.2.2 (24m)
>
>   []
>
>
>
> Thanks for the help.
>
>
>
> Best Regards,
>
>
>
> Edouard FAZENDA
>
> Technical Support
>
>
>
>
>
>
>
> Chemin du Curé-Desclouds 2, CH-1226 THONEX  +41 (0)22 869 04 40
>
>
>
>   www.csti.ch


___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io


smime.p7s
Description: S/MIME 

[ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck

2024-03-06 Thread Eugen Block
There was another issue when having more than two MGRs, maybe you're  
hitting that (https://tracker.ceph.com/issues/57675,  
https://github.com/ceph/ceph/pull/48258). I believe my workaround was  
to set the global config to a newer image (target version) and then  
deployed a new mgr.



Zitat von Edouard FAZENDA :


The process has now started but I have the following error on mgr to the
second node



root@rke-sh1-1:~# ceph orch ps

NAME  HOST   PORTSSTATUS
REFRESHED  AGE  VERSION  IMAGE ID  CONTAINER ID

crash.rke-sh1-1   rke-sh1-1   running (12d)  41s ago
12d  16.2.1   c757e4a3636b  e8652edb2b49

crash.rke-sh1-2   rke-sh1-2   running (12d)  2s ago
20M  16.2.1   c757e4a3636b  a1249a605ee0

crash.rke-sh1-3   rke-sh1-3   running (12d)  41s ago
12d  16.2.1   c757e4a3636b  026667bc1776

mds.cephfs.rke-sh1-1.ojmpnk   rke-sh1-1   running (12d)  41s ago
5M   16.2.1   c757e4a3636b  9b4c2b08b759

mds.cephfs.rke-sh1-2.isqjza   rke-sh1-2   running (12d)  2s ago
23M  16.2.1   c757e4a3636b  71681a5f34d3

mds.cephfs.rke-sh1-3.vdicdn   rke-sh1-3   running (12d)  41s ago
4M   16.2.1   c757e4a3636b  e89946ad6b7e

mgr.rke-sh1-1.qskoyj  rke-sh1-1  *:8082,9283  running (66m)  41s ago
2y   16.2.2   5e237c38caa6  123cabbc2994

mgr.rke-sh1-2.lxmguj  rke-sh1-2  *:8082,9283  running (6s)   2s ago
22M  16.2.2   5e237c38caa6  b2a9047be1d6

mgr.rke-sh1-3.ckunvo  rke-sh1-3  *:8082,9283  running (12d)  41s ago
7M   16.2.1   c757e4a3636b  2fcaf18f3218

mon.rke-sh1-1 rke-sh1-1   running (37m)  41s ago
37m  16.2.1   c757e4a3636b  84e63e0415a8

mon.rke-sh1-2 rke-sh1-2   running (12d)  2s ago
4M   16.2.1   c757e4a3636b  f4b32ba4466b

mon.rke-sh1-3 rke-sh1-3   running (12d)  41s ago
12d  16.2.1   c757e4a3636b  d5e44c245998

osd.0 rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  7b0e69942c15

osd.1 rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  4451654d9a2d

osd.10rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  3f9d5f95e284

osd.11rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  db1cc6d2e37f

osd.12rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  de416c1ef766

osd.13rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  25a281cc5a9b

osd.14rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  62f25ba61667

osd.15rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  d3514d823c45

osd.16rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  bba857759bfe

osd.17rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  59281d4bb3d0

osd.2 rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  418041b5e60d

osd.3 rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  04a0e29d5623

osd.4 rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  1cc78a5153d3

osd.5 rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  39a4b11e31fb

osd.6 rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  2f218ffb566e

osd.7 rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  cf761fbe4d5f

osd.8 rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  f9f85480e800

osd.9 rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  664c54ff46d2

rgw.default.rke-sh1-1.dgucwl  rke-sh1-1  *:8000   running (12d)  41s ago
22M  16.2.1   c757e4a3636b  f03212b955a7

rgw.default.rke-sh1-1.vylchc  rke-sh1-1  *:8001   running (12d)  41s ago
22M  16.2.1   c757e4a3636b  da486ce43fe5

rgw.default.rke-sh1-2.dfhhfw  rke-sh1-2  *:8000   running (12d)  2s ago
2y   16.2.1   c757e4a3636b  ef4089d0aef2

rgw.default.rke-sh1-2.efkbum  rke-sh1-2  *:8001   running (12d)  2s ago
2y   16.2.1   c757e4a3636b  9e053d5a2f7b

rgw.default.rke-sh1-3.krfgey  rke-sh1-3  *:8001   running (12d)  41s ago
9M   16.2.1   c757e4a3636b  45cd3d75edd3

rgw.default.rke-sh1-3.pwdbmp  rke-sh1-3  *:8000   running (12d)  41s ago
9M   16.2.1   c757e4a3636b  e2710265a7f4



#tail -f
/var/log/ceph/fcb373ce-7aaa-11eb-984f-e7c6e0038e87/ceph-mgr.rke-sh1-2.lxmguj
.log

2024-03-06T09:24:42.468+ 7fe68b500700  0 [dashboard DEBUG root] setting

[ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck

2024-03-06 Thread Eugen Block

Hi,

a couple of things.
First, is there any specific reason why you're upgrading from .1 to  
.2? Why not directly to .15? It seems unnecessary and you're risking  
upgrading to a "bad" version (I believe it was 16.2.7) if you're  
applying evey minor release. Or why not upgrading to Quincy or Reef  
directly?
Second, the error message has changed  
(https://github.com/ceph/ceph/pull/41257) from


"could not verify host allowed virtual ips"

to

"does not belong to mon public_network".

I saw this just recently during an upgrade of a cluster I didn't  
deploy and it turned out to be a misconfiguration issue (mismatch  
between cephadm ssh-user and missing pub key). I recommend to verify  
the ssh connections.


ceph cephadm get-ssh-config
ceph cephadm get-user
ceph cephadm get-pub-key

It also could be just a timing issue according to Sage's statement in  
above PR:


Oh! I know what the problem is. 1897d1c changed the way we store the  
per-host network interface/network info. > On upgrade, cephadm  
thinks there are no networks on each host until the device refresh  
happens.


How long did you wait?


Zitat von Edouard FAZENDA :


Dear Ceph Community,



I am in the process of upgrading ceph pacific 16.2.1 to 16.2.2 , I have
followed the documentation :
https://docs.ceph.com/en/pacific/cephadm/upgrade/



My cluster is in Healthy state , but the upgrade is not going forward , as
on the cephadm logs I have the following :



# Ceph -W cephadm

2024-03-06T08:39:11.653447+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: Need to
upgrade myself (mgr.rke-sh1-1.qskoyj)

2024-03-06T08:39:12.281386+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: Updating
mgr.rke-sh1-2.lxmguj

2024-03-06T08:39:12.286096+ mgr.rke-sh1-1.qskoyj [INF] Deploying daemon
mgr.rke-sh1-2.lxmguj on rke-sh1-2

2024-03-06T08:39:19.347877+ mgr.rke-sh1-1.qskoyj [INF] Filtered out host
rke-sh1-1: could not verify host allowed virtual ips

2024-03-06T08:39:19.347989+ mgr.rke-sh1-1.qskoyj [INF] Filtered out host
rke-sh1-3: could not verify host allowed virtual ips

2024-03-06T08:39:19.366355+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: Need to
upgrade myself (mgr.rke-sh1-1.qskoyj)

2024-03-06T08:39:19.965822+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: Updating
mgr.rke-sh1-2.lxmguj

2024-03-06T08:39:19.969089+ mgr.rke-sh1-1.qskoyj [INF] Deploying daemon
mgr.rke-sh1-2.lxmguj on rke-sh1-2

2024-03-06T08:39:26.961455+ mgr.rke-sh1-1.qskoyj [INF] Filtered out host
rke-sh1-1: could not verify host allowed virtual ips

2024-03-06T08:39:26.961502+ mgr.rke-sh1-1.qskoyj [INF] Filtered out host
rke-sh1-3: could not verify host allowed virtual ips

2024-03-06T08:39:26.973897+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: Need to
upgrade myself (mgr.rke-sh1-1.qskoyj)

2024-03-06T08:39:27.623773+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: Updating
mgr.rke-sh1-2.lxmguj

2024-03-06T08:39:27.628115+ mgr.rke-sh1-1.qskoyj [INF] Deploying daemon
mgr.rke-sh1-2.lxmguj on rke-sh1-2



My public_network is set :



root@rke-sh1-1:~# ceph config dump  | grep public_network

  mon  advanced  public_network
10.10.71.0/24

 *

Do you have an idea why I have the following error :



Filtered out host: could not verify host allowed virtual ips





Current state of the upgrade :



# ceph orch upgrade status

{

"target_image":
"docker.io/ceph/ceph@sha256:8cdd8c7dfc7be5865255f0d59c048a1fb8d1335f6923996e
2c2d9439499b5cf2",

"in_progress": true,

"services_complete": [],

"progress": "0/35 ceph daemons upgraded",

"message": "Currently upgrading mgr daemons"

}



  progress:

Upgrade to 16.2.2 (24m)

  []



Thanks for the help.



Best Regards,



Edouard FAZENDA

Technical Support







Chemin du Curé-Desclouds 2, CH-1226 THONEX  +41 (0)22 869 04 40



  www.csti.ch



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck

2024-03-06 Thread Edouard FAZENDA
The process has now started but I have the following error on mgr to the
second node 

 

root@rke-sh1-1:~# ceph orch ps

NAME  HOST   PORTSSTATUS
REFRESHED  AGE  VERSION  IMAGE ID  CONTAINER ID

crash.rke-sh1-1   rke-sh1-1   running (12d)  41s ago
12d  16.2.1   c757e4a3636b  e8652edb2b49

crash.rke-sh1-2   rke-sh1-2   running (12d)  2s ago
20M  16.2.1   c757e4a3636b  a1249a605ee0

crash.rke-sh1-3   rke-sh1-3   running (12d)  41s ago
12d  16.2.1   c757e4a3636b  026667bc1776

mds.cephfs.rke-sh1-1.ojmpnk   rke-sh1-1   running (12d)  41s ago
5M   16.2.1   c757e4a3636b  9b4c2b08b759

mds.cephfs.rke-sh1-2.isqjza   rke-sh1-2   running (12d)  2s ago
23M  16.2.1   c757e4a3636b  71681a5f34d3

mds.cephfs.rke-sh1-3.vdicdn   rke-sh1-3   running (12d)  41s ago
4M   16.2.1   c757e4a3636b  e89946ad6b7e

mgr.rke-sh1-1.qskoyj  rke-sh1-1  *:8082,9283  running (66m)  41s ago
2y   16.2.2   5e237c38caa6  123cabbc2994

mgr.rke-sh1-2.lxmguj  rke-sh1-2  *:8082,9283  running (6s)   2s ago
22M  16.2.2   5e237c38caa6  b2a9047be1d6

mgr.rke-sh1-3.ckunvo  rke-sh1-3  *:8082,9283  running (12d)  41s ago
7M   16.2.1   c757e4a3636b  2fcaf18f3218

mon.rke-sh1-1 rke-sh1-1   running (37m)  41s ago
37m  16.2.1   c757e4a3636b  84e63e0415a8

mon.rke-sh1-2 rke-sh1-2   running (12d)  2s ago
4M   16.2.1   c757e4a3636b  f4b32ba4466b

mon.rke-sh1-3 rke-sh1-3   running (12d)  41s ago
12d  16.2.1   c757e4a3636b  d5e44c245998

osd.0 rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  7b0e69942c15

osd.1 rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  4451654d9a2d

osd.10rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  3f9d5f95e284

osd.11rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  db1cc6d2e37f

osd.12rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  de416c1ef766

osd.13rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  25a281cc5a9b

osd.14rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  62f25ba61667

osd.15rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  d3514d823c45

osd.16rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  bba857759bfe

osd.17rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  59281d4bb3d0

osd.2 rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  418041b5e60d

osd.3 rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  04a0e29d5623

osd.4 rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  1cc78a5153d3

osd.5 rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  39a4b11e31fb

osd.6 rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  2f218ffb566e

osd.7 rke-sh1-1   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  cf761fbe4d5f

osd.8 rke-sh1-3   running (12d)  41s ago
3y   16.2.1   c757e4a3636b  f9f85480e800

osd.9 rke-sh1-2   running (12d)  2s ago
3y   16.2.1   c757e4a3636b  664c54ff46d2

rgw.default.rke-sh1-1.dgucwl  rke-sh1-1  *:8000   running (12d)  41s ago
22M  16.2.1   c757e4a3636b  f03212b955a7

rgw.default.rke-sh1-1.vylchc  rke-sh1-1  *:8001   running (12d)  41s ago
22M  16.2.1   c757e4a3636b  da486ce43fe5

rgw.default.rke-sh1-2.dfhhfw  rke-sh1-2  *:8000   running (12d)  2s ago
2y   16.2.1   c757e4a3636b  ef4089d0aef2

rgw.default.rke-sh1-2.efkbum  rke-sh1-2  *:8001   running (12d)  2s ago
2y   16.2.1   c757e4a3636b  9e053d5a2f7b

rgw.default.rke-sh1-3.krfgey  rke-sh1-3  *:8001   running (12d)  41s ago
9M   16.2.1   c757e4a3636b  45cd3d75edd3

rgw.default.rke-sh1-3.pwdbmp  rke-sh1-3  *:8000   running (12d)  41s ago
9M   16.2.1   c757e4a3636b  e2710265a7f4

 

#tail -f
/var/log/ceph/fcb373ce-7aaa-11eb-984f-e7c6e0038e87/ceph-mgr.rke-sh1-2.lxmguj
.log

2024-03-06T09:24:42.468+ 7fe68b500700  0 [dashboard DEBUG root] setting
log level: INFO

2024-03-06T09:24:42.468+ 7fe68b500700  1 mgr load Constructed class from
module: dashboard

2024-03-06T09:24:42.468+ 7fe68acff700  0 ms_deliver_dispatch: unhandled
message 0x55f722292160 mon_map magic: 0 v1 from mon.0 v2:10.10.71.2:3300/0

2024-03-06T09:24:42.468+ 7fe68b500700  0 

[ceph-users] Upgarde from 16.2.1 to 16.2.2 pacific stuck

2024-03-06 Thread Edouard FAZENDA
Dear Ceph Community,

 

I am in the process of upgrading ceph pacific 16.2.1 to 16.2.2 , I have
followed the documentation :
https://docs.ceph.com/en/pacific/cephadm/upgrade/

 

My cluster is in Healthy state , but the upgrade is not going forward , as
on the cephadm logs I have the following : 

 

# Ceph -W cephadm

2024-03-06T08:39:11.653447+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: Need to
upgrade myself (mgr.rke-sh1-1.qskoyj)

2024-03-06T08:39:12.281386+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: Updating
mgr.rke-sh1-2.lxmguj

2024-03-06T08:39:12.286096+ mgr.rke-sh1-1.qskoyj [INF] Deploying daemon
mgr.rke-sh1-2.lxmguj on rke-sh1-2

2024-03-06T08:39:19.347877+ mgr.rke-sh1-1.qskoyj [INF] Filtered out host
rke-sh1-1: could not verify host allowed virtual ips

2024-03-06T08:39:19.347989+ mgr.rke-sh1-1.qskoyj [INF] Filtered out host
rke-sh1-3: could not verify host allowed virtual ips

2024-03-06T08:39:19.366355+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: Need to
upgrade myself (mgr.rke-sh1-1.qskoyj)

2024-03-06T08:39:19.965822+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: Updating
mgr.rke-sh1-2.lxmguj

2024-03-06T08:39:19.969089+ mgr.rke-sh1-1.qskoyj [INF] Deploying daemon
mgr.rke-sh1-2.lxmguj on rke-sh1-2

2024-03-06T08:39:26.961455+ mgr.rke-sh1-1.qskoyj [INF] Filtered out host
rke-sh1-1: could not verify host allowed virtual ips

2024-03-06T08:39:26.961502+ mgr.rke-sh1-1.qskoyj [INF] Filtered out host
rke-sh1-3: could not verify host allowed virtual ips

2024-03-06T08:39:26.973897+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: Need to
upgrade myself (mgr.rke-sh1-1.qskoyj)

2024-03-06T08:39:27.623773+ mgr.rke-sh1-1.qskoyj [INF] Upgrade: Updating
mgr.rke-sh1-2.lxmguj

2024-03-06T08:39:27.628115+ mgr.rke-sh1-1.qskoyj [INF] Deploying daemon
mgr.rke-sh1-2.lxmguj on rke-sh1-2

 

My public_network is set : 

 

root@rke-sh1-1:~# ceph config dump  | grep public_network

  mon  advanced  public_network
10.10.71.0/24

 *

Do you have an idea why I have the following error : 

 

Filtered out host: could not verify host allowed virtual ips

 

 

Current state of the upgrade : 

 

# ceph orch upgrade status

{

"target_image":
"docker.io/ceph/ceph@sha256:8cdd8c7dfc7be5865255f0d59c048a1fb8d1335f6923996e
2c2d9439499b5cf2",

"in_progress": true,

"services_complete": [],

"progress": "0/35 ceph daemons upgraded",

"message": "Currently upgrading mgr daemons"

}

 

  progress:

Upgrade to 16.2.2 (24m)

  []

 

Thanks for the help.

 

Best Regards, 

 

Edouard FAZENDA

Technical Support

 



 

Chemin du Curé-Desclouds 2, CH-1226 THONEX  +41 (0)22 869 04 40

 

  www.csti.ch

 



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io