[ceph-users] Re: Operations: cannot update immutable features

2023-06-12 Thread Eugen Block

Hi,

can you check for snapshots in the trash namespace?

# rbd snap ls --all /

Instead of removing the feature try to remove the snapshot from trash  
(if there are any).



Zitat von Adam Boyhan :

I have a small cluster on Pacific with roughly 600 RBD images.   Out  
of those 600 images I have 2 which are in a somewhat odd state.


root@cephmon:~# rbd info Cloud-Ceph1/vm-134-disk-0
rbd image 'vm-134-disk-0':
size 1000 GiB in 256000 objects
order 22 (4 MiB objects)
snapshot_count: 11
id: 7c326b8b4567
block_name_prefix: rbd_data.7c326b8b4567
format: 2
features: layering, exclusive-lock, object-map, fast-diff,  
deep-flatten, operations

op_features: snap-trash
flags:
create_timestamp: Fri Aug 14 07:11:44 2020
access_timestamp: Thu Jun  8 06:31:06 2023
modify_timestamp: Thu Jun  8 06:31:11 2023

Specifically, the feature "operations".   I never set this feature;  
and I don't see it listed in documentation.


This feature is causing my backup software to not be able to backup  
the rdb.  Otherwise, it's working aok.


I did attempt to remove the feature.

root@cephmon:~# rbd feature disable Cloud-Ceph1/vm-464-disk-0 operations
rbd: failed to update image features: 2023-06-08T07:50:21.899-0400  
7fdea52ae340 -1 librbd::Operations: cannot update immutable features

(22) Invalid argument

Any help or input is great appreciated.
This message and any attachments may contain information that is  
protected by law as privileged and confidential, and is transmitted  
for the sole use of the intended recipient(s). If you are not the  
intended recipient, you are hereby notified that any use,  
dissemination, copying or retention of this e-mail or the  
information contained herein is strictly prohibited. If you received  
this e-mail in error, please immediately notify the sender by  
e-mail, and permanently delete this e-mail.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Container image of Pacific latest version

2023-06-12 Thread mahnoosh shahidi
Thank you for your information.

On Mon, Jun 12, 2023 at 9:35 AM Jonas Nemeiksis 
wrote:

> Hi,
>
> The ceph daemon image build is deprecated. You can read here [1]
>
> [1] https://github.com/ceph/ceph-container/issues/2112
>
> On Sun, Jun 11, 2023 at 4:03 PM mahnoosh shahidi 
> wrote:
>
>> Thanks for your response. I need the ceph daemon image. I forgot to
>> mention
>> it in the first message.
>>
>> Best Regards,
>> Mahnoosh
>>
>> On Sun, Jun 11, 2023 at 4:22 PM 胡 玮文  wrote:
>>
>> > It is available at quay.io/ceph/ceph:v16.2.13
>> >
>> > > 在 2023年6月11日,16:31,mahnoosh shahidi  写道:
>> > >
>> > > Hi all,
>> > >
>> > > It seems the latest Pacific image in the registry is 16.2.11. Is there
>> > any
>> > > plan to push the latest version of Pacific (16.2.13) in near future?
>> > >
>> > > Best Regards,
>> > > Mahnoosh
>> > > ___
>> > > ceph-users mailing list -- ceph-users@ceph.io
>> > > To unsubscribe send an email to ceph-users-le...@ceph.io
>> >
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
>
> --
> Jonas
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph fs perf stats output is empty

2023-06-12 Thread Denis Polom

Hi,

yes, I've found that trick was I had to wait for about 15 sec to see the 
metrics.


Now I can see some numbers. Are units there in miliseconds? And also I 
see 2 numbers reported - is the first value actual and second is delta?


  "client.4636": [
    [
  924,
  4
    ],
    [
  0,
  0
    ],
    [
  10,
  981017512
    ],
    [
  0,
  33484910
    ],
    [
  136,
  0
    ],
    [
  1,
  2
    ],
    [
  2,
  2
    ],
    [
  1,
  2
    ],
    [
  0,
  0
    ],
    [
  302,
  1266679808
    ],
    [
  0,
  0
    ],
    [
  0,
  0
    ],
    [
  0,
  36361015
    ],
    [
  4205537661271535,
  302
    ],
    [
  0,
  11161636
    ],
    [
  190208004421472,
  3
    ]
  ]
    }
  },

Thx!



On 6/12/23 06:36, Jos Collin wrote:
Additionally, the first `ceph fs perf stats` output would be empty or 
outdated in most cases. You need to query a few times to get the 
latest values. So try `watch ceph fs perf stats`.


On Mon, 12 Jun 2023 at 06:30, Xiubo Li  wrote:


On 6/10/23 05:35, Denis Polom wrote:
> Hi
>
> I'm running latest Ceph Pacific 16.2.13 with Cephfs. I need to
collect
> performance stats per client, but getting empty list without any
numbers
>
> I even run dd on client against mounted ceph fs, but output is only
> like this:
>
> #> ceph fs perf stats 0 4638 192.168.121.1
>
> {"version": 2, "global_counters": ["cap_hit", "read_latency",
> "write_latency", "metadata_latency", "dentry_lease",
"opened_files",
> "pinned_icaps", "
> opened_inodes", "read_io_sizes", "write_io_sizes",
"avg_read_latency",
> "stdev_read_latency", "avg_write_latency", "stdev_write_latency",
> "avg_metada
> ta_latency", "stdev_metadata_latency"], "counters": [],
> "client_metadata": {}, "global_metrics": {}, "metrics":
> {"delayed_ranks": []}}
>
> Do I need to set some extra options?
>
Were you using the ceph-fuse/libcephfs user space clients ? if so you
need to manually enable the 'client_collect_and_send_global_metrics'
option, which is disabled in Ceph Pacific 16.2.13.

While if you were using the kclient you need to make sure the
'disable_send_metrics' module parameter is 'false' for ceph.ko module.

Thanks

- Xiubo


> Does it work for some of you guys?
>
> Thank you
>
> dp
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] what are the options for config a CephFS client session

2023-06-12 Thread Denis Polom

Hi,

I didn't find any doc and any way how to get to know valid options to 
configure client session over mds socket:


#> ceph tell mds.mds1 session config

session config   [] :  Config a CephFS 
client session



Any hint on this?

Thank you
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: what are the options for config a CephFS client session

2023-06-12 Thread Dhairya Parmar
Hi,

There's just one option for `session config` (or `client config` both are
same) as of now i.e. "timeout"
#> ceph tell mds.0 session config  timeout 


*Dhairya Parmar*

Associate Software Engineer, CephFS


On Mon, Jun 12, 2023 at 2:29 PM Denis Polom  wrote:

> Hi,
>
> I didn't find any doc and any way how to get to know valid options to
> configure client session over mds socket:
>
> #> ceph tell mds.mds1 session config
>
> session config   [] :  Config a CephFS
> client session
>
>
> Any hint on this?
>
> Thank you
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-06-12 Thread Janek Bevendorff
Good news: We haven't had any new fill-ups so far. On the contrary, the 
pool size is as small as it's ever been (200GiB).


Bad news: The MDS are still acting strangely. I have very uneven session 
load and I don't know where it comes from. ceph_mds_sessions_total_load 
reports a number of 1.4 million on mds.3, whereas all the others are 
mostly idle. I checked the client list on that rank, but the heaviest 
client has about 8k caps, which isn't very much at all. Most have 0 or 
1. I don't see any blocked ops in flight. I don't think this is to do 
with the disabled balancer, because I've seen this pattern before.


The event log size of 3/5 MDS is also very high, still. mds.1, mds.3, 
and mds.4 report between 4 and 5 million events, mds.0 around 1.4 
million and mds.2 between 0 and 200,000. The numbers have been constant 
since my last MDS restart four days ago.


I ran your ceph-gather.sh script a couple of times, but dumps only 
mds.0. Should I modify it to dump mds.3 instead so you can have a look?


Janek


On 10/06/2023 15:23, Patrick Donnelly wrote:

On Fri, Jun 9, 2023 at 3:27 AM Janek Bevendorff
 wrote:

Hi Patrick,


I'm afraid your ceph-post-file logs were lost to the nether. AFAICT,
our ceph-post-file storage has been non-functional since the beginning
of the lab outage last year. We're looking into it.

I have it here still. Any other way I can send it to you?

Nevermind, I found the machine it was stored on. It was a
misconfiguration caused by post-lab-outage rebuilds.


Extremely unlikely.

Okay, taking your word for it. But something seems to be stalling
journal trimming. We had a similar thing yesterday evening, but at much
smaller scale without noticeable pool size increase. I only got an alert
that the ceph_mds_log_ev Prometheus metric starting going up again for a
single MDS. It grew past 1M events, so I restarted it. I also restarted
the other MDS and they all immediately jumped to above 5M events and
stayed there. They are, in fact, still there and have decreased only
very slightly in the morning. The pool size is totally within a normal
range, though, at 290GiB.

Please keep monitoring it. I think you're not the only cluster to
experience this.


So clearly (a) an incredible number of journal events are being logged
and (b) trimming is slow or unable to make progress. I'm looking into
why but you can help by running the attached script when the problem
is occurring so I can investigate. I'll need a tarball of the outputs.

How do I send it to you if not via ceph-post-file?

It should work soon next week. We're moving the drop.ceph.com service
to a standalone VM soonish.


Also, in the off-chance this is related to the MDS balancer, please
disable it since you're using ephemeral pinning:

ceph config set mds mds_bal_interval 0

Done.

Thanks for your help!
Janek


--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de




--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] The num of objects with cmd "rados -p xxx" not equal with s3 api?

2023-06-12 Thread Louis Koo
List objects with  rados cmd:
"rados -p oss.rgw.buckets.index ls | grep 
"c2af65dc-b456-4f5a-be6a-2a142adeea75.335721.1" | awk '{print "rados 
listomapkeys -p oss.rgw.buckets.index "$1 }'|sh -x"

some objects like this, I don't know what's  it
"?1000_DR-MON-1_20220307_224723/configs/lidars.cfgiUFWot6m-OJzb4cl6fC4czRzv9tGJMLw
?1000_DR-MON-1_20220307_224723/logs/sentry/DR_MON_1_20220307_224724_carstatus.jsoniYf0RYRvfYXWwwr6U8JDQ59ZSN2SLXg5
?1000_MONDEO_A12/DR-MKZ-1_20220114_191640/coredumps/i.b2Je.JKFrSJAmlga7IIXLyKFSmXgAo
?1000_MONDEO_A12/DR-MKZ-1_20220114_191640/coredumps/iCuATmyPL9vXXfsSHENNkIpAqS.EZKFj
?1000_MONDEO_A12/DR-MKZ-1_20220211_190901/logs/gnss_adapter/i7g.vodrYiNsy4ljwy6UsQNptRHoY4Uz
?1000_MONDEO_A12/DR-MKZ-1_20220211_190901/logs/gnss_adapter/izHgzCRwJY2eGqV0ifuz9DaS50p9zmjA
?1000_MONDEO_A12/DR-MON-1_20211002_012629/config/iEYp.8-exWsChWoEGKTPBRzfFcm8EmK4
?1000_MONDEO_A12/DR-MON-1_20211005_011823/log_other/iikTrMOJno2pKZFzxhBlJ04TqTrv37Be
?1000_MONDEO_A12/DR-MON-1_20211019_181039/task_information/iiyyJUW86YITRf023drwHV5TBNlxCIwJ
?1000_MONDEO_A12/DR-MON-1_20211019_181039/task_information/int4GOdWMZ7zXddTVtChac1TlbgwgJpJ
?1000_MONDEO_A12/DR-MON-1_2020_235107_OT1495/config/irVFxhMgTrTI.cRhUPdt3YxMg-pX.GVy
?1000_MONDEO_A12/DR-MON-1_2020_235107_OT1495/config/ixGn1hqiDGPR1IOmhW6aCGLJPELJAHNW
?1000_MONDEO_A12/DR-MON-1_20211130_230853/bag/important/ia7X4HvOe4VTk6i6jrgOLpAGzvUDlCpz
?1000_MONDEO_A12/DR-MON-1_20211130_230853/bag/important/ilyff1HsUSYEuWNr4smOpfS2lDyY.CoZ
?1000_MONDEO_A12/DR-MON-1_20211130_230853/task_information/task_description.jsoniDMuGFwTNPY2fTKuDIqSlVfy5AY6AydN
?1000_MONDEO_A12/DR-MON-1_20211203_203345/bag/bags_info.jsoniBEU6igyLo59VwAijFYQVl.iJMUP5s6F
?1000_MONDEO_A12/DR-MON-1_20211213_233958/config/lidars.cfgi.vISYc4d7gOj-qv2grSBhSZJ4LAPIti
?1000_MONDEO_A12/DR-MON-1_20211214_001357/task_information/sentry_config.jsonio8sOurJMZZReHUMu4jx116dz2buWwS0
?1000_MONDEO_A12/DR-MON-1_20211215_195308/bag/important/iCOoK1O1ydDYHgDQsXAfiCdKc84gglEb
?1000_MONDEO_A12/DR-MON-1_20211215_195308/bag/important/iTQxsDHPef1UGyMmgQL9RNQBlxLHEHyT
?1000_MONDEO_A12/DR-MON-1_20211220_223504/log/DR_MON_1_20211220_223504_syslog.jsonig-E8J2jUe2IekrF8EIOebxaC3HZHWeF
?1000_MONDEO_A12/DR-MON-1_20211220_223504/log/DR_MON_1_20211220_223504_syslog.jsoniw2cN-1wvaz35.fkanuE.zEke95.UZ8V
?1000_MONDEO_A12/DR-MON-1_20211228_200413/log_other/gnss_imu_time.logiCL3l2NTALD2Kyi-KbxYAP4oQg.m6qjN
?1000_MONDEO_A12/DR-MON-1_20211228_200413/log_other/gnss_imu_time.logitAPux7n2MeOIGLejTDnCyirEpaQoQuQ
?1000_MONDEO_A12/DR-MON-1_20220104_195217/config/versioni0QqTxQh7EdOCgTIU3KHgaZfMXiAAQXF
?1000_MONDEO_A12/DR-MON-1_20220104_195217/config/versionida0hUxK8TREImCjsG"

total objects: 395797



List obejects with s3 api:
total objects: 67498
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Keepalived configuration with cephadm

2023-06-12 Thread Luis Domingues
Hi all,

We are running 1 test cluster ceph with cephadm. Currently last pacific 
(16.2.13).
We use cephadm to deploy keepalived:2.1.5 and HAProxy:2.3.
We have 3 VIPs, 1 for each instance of HAProxy.

But, we do not use the same network for managing the cluster and for the public 
traffic.
We have a management network to connect to the machines, and for cephadm to do 
the deployments, and a prod network where the connections to HAproxy will be 
done.

Our spec file looks like:
---
service_type: ingress
service_id: rgw.rgw
placement:
label: rgws
spec:
backend_service: rgw.rgw
virtual_ips_list:
- 10.X.X.10/24
- 10.X.X.2/24
- 10.X.X.3/24
frontend_port: 443 monitor_port: 1967

Our issue is that cephadm will populate `unicast_src_ip` and `unicast_peer` 
using the IPs from mgmt network and not the ones from prod network.
A quick look into the code and it seems to be design that way.

Our issue is that doing so, Keepalived instances will not talk to each other 
because the VRRP traffic is only allowed on our prod network.
I quicky tested removing `unicast_src_ip` and `unicast_peer` and keepalived 
instances where able to talk to each other.

My question, did I missed something on the configuration? Or should we add some 
kind of option to generate keepalived's config without `unicast_src_ip` and 
`unicast_peer`?

Thanks,

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] stray daemons not managed by cephadm

2023-06-12 Thread farhad kh
 i deployed the ceph cluster with 8 node (v17.2.6) and  after add all of
hosts, ceph create 5 mon daemon instances
i try decrease that to 3 instance with ` ceph orch apply mon
--placement=label:mon,count:3 it worked, but after that i get error "2
stray daemons not managed by cephadm" .
But every time I tried to deploy and delete other instances, this number
increased Now I have 7 daemon that are not managed by cephadm
How to deal with this issue?


[root@opcsdfpsbpp0201 ~]# ceph -s
  cluster:
id: 79a2627c-0821-11ee-a494-00505695c58c
health: HEALTH_WARN
16 stray daemon(s) not managed by cephadm

  services:
mon: 3 daemons, quorum opcsdfpsbpp0201,opcsdfpsbpp0205,opcsdfpsbpp0203
(age 2m)
mgr: opcsdfpsbpp0201.vttwxa(active, since 27h), standbys:
opcsdfpsbpp0207.kzxepm
mds: 1/1 daemons up, 2 standby
osd: 74 osds: 74 up (since 26h), 74 in (since 26h)

  data:
volumes: 1/1 healthy
pools:   6 pools, 6 pgs
objects: 2.10k objects, 8.1 GiB
usage:   28 GiB used, 148 TiB / 148 TiB avail
pgs: 6 active+clean

  io:
client:   426 B/s rd, 0 op/s rd, 0 op/s wr

[root@opcsdfpsbpp0201 ~]# ceph health detail
HEALTH_WARN 16 stray daemon(s) not managed by cephadm
[WRN] CEPHADM_STRAY_DAEMON: 16 stray daemon(s) not managed by cephadm
stray daemon mon.opcsdfpsbpp0207 on host opcsdfpsbpp0203 not managed by
cephadm
stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0203 not managed by
cephadm
stray daemon mon.opcsdfpsbpp0211 on host opcsdfpsbpp0203 not managed by
cephadm
stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0203 not managed by
cephadm
stray daemon mon.opcsdfpsbpp0207 on host opcsdfpsbpp0205 not managed by
cephadm
stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0205 not managed by
cephadm
stray daemon mon.opcsdfpsbpp0211 on host opcsdfpsbpp0205 not managed by
cephadm
stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0205 not managed by
cephadm
stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0207 not managed by
cephadm
stray daemon mon.opcsdfpsbpp0207 on host opcsdfpsbpp0209 not managed by
cephadm
stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0209 not managed by
cephadm
stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0211 not managed by
cephadm
stray daemon mon.opcsdfpsbpp0215 on host opcsdfpsbpp0211 not managed by
cephadm
stray daemon mon.opcsdfpsbpp0211 on host opcsdfpsbpp0213 not managed by
cephadm
stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0215 not managed by
cephadm
stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0215 not managed by
cephadm
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: stray daemons not managed by cephadm

2023-06-12 Thread Nino Kotur
+1 for this issue, i've managed to reproduce it on my test cluster.




Kind regards,
Nino Kotur


On Mon, Jun 12, 2023 at 2:54 PM farhad kh 
wrote:

>  i deployed the ceph cluster with 8 node (v17.2.6) and  after add all of
> hosts, ceph create 5 mon daemon instances
> i try decrease that to 3 instance with ` ceph orch apply mon
> --placement=label:mon,count:3 it worked, but after that i get error "2
> stray daemons not managed by cephadm" .
> But every time I tried to deploy and delete other instances, this number
> increased Now I have 7 daemon that are not managed by cephadm
> How to deal with this issue?
>
> 
> [root@opcsdfpsbpp0201 ~]# ceph -s
>   cluster:
> id: 79a2627c-0821-11ee-a494-00505695c58c
> health: HEALTH_WARN
> 16 stray daemon(s) not managed by cephadm
>
>   services:
> mon: 3 daemons, quorum opcsdfpsbpp0201,opcsdfpsbpp0205,opcsdfpsbpp0203
> (age 2m)
> mgr: opcsdfpsbpp0201.vttwxa(active, since 27h), standbys:
> opcsdfpsbpp0207.kzxepm
> mds: 1/1 daemons up, 2 standby
> osd: 74 osds: 74 up (since 26h), 74 in (since 26h)
>
>   data:
> volumes: 1/1 healthy
> pools:   6 pools, 6 pgs
> objects: 2.10k objects, 8.1 GiB
> usage:   28 GiB used, 148 TiB / 148 TiB avail
> pgs: 6 active+clean
>
>   io:
> client:   426 B/s rd, 0 op/s rd, 0 op/s wr
>
> [root@opcsdfpsbpp0201 ~]# ceph health detail
> HEALTH_WARN 16 stray daemon(s) not managed by cephadm
> [WRN] CEPHADM_STRAY_DAEMON: 16 stray daemon(s) not managed by cephadm
> stray daemon mon.opcsdfpsbpp0207 on host opcsdfpsbpp0203 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0203 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0211 on host opcsdfpsbpp0203 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0203 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0207 on host opcsdfpsbpp0205 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0205 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0211 on host opcsdfpsbpp0205 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0205 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0207 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0207 on host opcsdfpsbpp0209 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0209 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0211 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0215 on host opcsdfpsbpp0211 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0211 on host opcsdfpsbpp0213 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0215 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0215 not managed by
> cephadm
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: stray daemons not managed by cephadm

2023-06-12 Thread Adam King
if you do a mgr failover ("ceph mgr fail") and wait a few minutes do the
issues clear out? I know there's a bug where removed mons get marked as
stray daemons while downsizing by multiple mons at once (cephadm might be
removing them too quickly, not totally sure of the cause) but doing a mgr
failover has always cleared the stray daemon notifications for me. For some
context, what it's listing as stray daemons are roughly what is being
reported in "ceph node ls" that doesn't show up in "ceph orch ps". The idea
being the orch ps output shows all the daemons cephadm is aware of and
managing while "ceph node ls" are ceph daemons the cluster, but not
necessarily cephadm itself, is aware of. For me, the mon daemons marked
stray were still showing up in that "ceph node ls" output, but doing a mgr
failover would clean that up and then the stray daemon warnings would also
disappear.

On Mon, Jun 12, 2023 at 8:54 AM farhad kh 
wrote:

>  i deployed the ceph cluster with 8 node (v17.2.6) and  after add all of
> hosts, ceph create 5 mon daemon instances
> i try decrease that to 3 instance with ` ceph orch apply mon
> --placement=label:mon,count:3 it worked, but after that i get error "2
> stray daemons not managed by cephadm" .
> But every time I tried to deploy and delete other instances, this number
> increased Now I have 7 daemon that are not managed by cephadm
> How to deal with this issue?
>
> 
> [root@opcsdfpsbpp0201 ~]# ceph -s
>   cluster:
> id: 79a2627c-0821-11ee-a494-00505695c58c
> health: HEALTH_WARN
> 16 stray daemon(s) not managed by cephadm
>
>   services:
> mon: 3 daemons, quorum opcsdfpsbpp0201,opcsdfpsbpp0205,opcsdfpsbpp0203
> (age 2m)
> mgr: opcsdfpsbpp0201.vttwxa(active, since 27h), standbys:
> opcsdfpsbpp0207.kzxepm
> mds: 1/1 daemons up, 2 standby
> osd: 74 osds: 74 up (since 26h), 74 in (since 26h)
>
>   data:
> volumes: 1/1 healthy
> pools:   6 pools, 6 pgs
> objects: 2.10k objects, 8.1 GiB
> usage:   28 GiB used, 148 TiB / 148 TiB avail
> pgs: 6 active+clean
>
>   io:
> client:   426 B/s rd, 0 op/s rd, 0 op/s wr
>
> [root@opcsdfpsbpp0201 ~]# ceph health detail
> HEALTH_WARN 16 stray daemon(s) not managed by cephadm
> [WRN] CEPHADM_STRAY_DAEMON: 16 stray daemon(s) not managed by cephadm
> stray daemon mon.opcsdfpsbpp0207 on host opcsdfpsbpp0203 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0203 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0211 on host opcsdfpsbpp0203 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0203 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0207 on host opcsdfpsbpp0205 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0205 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0211 on host opcsdfpsbpp0205 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0205 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0207 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0207 on host opcsdfpsbpp0209 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0209 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0211 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0215 on host opcsdfpsbpp0211 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0211 on host opcsdfpsbpp0213 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0215 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0215 not managed by
> cephadm
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: bucket notification retries

2023-06-12 Thread Yuval Lifshitz
Hi Stefan,
I was not able to reproduce the issue of not reconnecting after slow-down.
My steps are documented here:
https://gist.github.com/yuvalif/e58e264bafe847bc5196f95be0e704a2
Can you please share some of the radosgw logs after the broker is up again
and the reconnect fails?

Regardless, there are several race conditions that happened with kafka and
persistent notifications and also exists for amqp. Will be fixing that as
part of: https://tracker.ceph.com/issues/61639

Yuval



On Sun, Jun 11, 2023 at 11:48 AM Yuval Lifshitz  wrote:

> Hi Stefan,
> Thanks for the inputs. Replied inline
>
> On Fri, Jun 9, 2023 at 6:53 PM Stefan Reuter 
> wrote:
>
>> Hi Yuval,
>>
>> Thanks for having a look at bucket notifications and collecting
>> feedback. I also see potential for improvement in the area of bucket
>> notifications.
>>
>> We have observed issues in a setup with Rabbit MQ as a broker where the
>> RADOS queue seems to fill up and cients receive "slow down" replies.
>> Unfortunately this state did not recover. The only solution to overcome
>> the situation was to remove and recreate the topic and bucket
>> notification configuration. This happened multiple times on differenct
>> ceph clusters with latest quincy.
>>
>>
> will check that. We had a similar issue with Kafka broker that was
> recently fixed.
>
>
>> It would be great to improve the ability to monitor bucket notifications
>> (e.g. via prometheus/grafana) to see the RADOS queues and their
>> usage/queue depth as well as the health of the process that consumes the
>> queue and passes the notifications to the broker.
>>
>>
> agree. we are working on that. see: https://tracker.ceph.com/issues/52927
>
>
>> For our use case notifications are very important as they trigger
>> downstream processing of the uploaded files. If the notification does
>> not happen, the files are not processed and the result is the same as if
>> the upload did not happen at all.
>>
>> Best regards,
>>
>> Stefan
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Monitor mkfs with keyring failed

2023-06-12 Thread Michel Niyoyita
Hello Team ,

I am trying to make a ceph cluster with 3 nodes running ubuntu 20.04 and is
configured using ceph-ansible , because it is a testing cluster OSD servers
are same which will run as monitors . while installation I am facing
following error : TASK [ceph-mon : ceph monitor mkfs with keyring]
*
Monday 12 June 2023  15:48:18 + (0:00:00.089)   0:01:20.254
***
fatal: [ceph-osd1]: FAILED! => changed=true
  cmd:
  - ceph-mon
  - --cluster
  - ceph
  - --setuser
  - ceph
  - --setgroup
  - ceph
  - --mkfs
  - -i
  - ceph-osd1
  - --fsid
  - d80c5efc-1d3e-4054-a6a4-38553543
  - --keyring
  - /var/lib/ceph/tmp/ceph.mon..keyring
  delta: '0:00:00.282151'
  end: '2023-06-12 15:48:19.062350'
  msg: non-zero return code
  rc: -6
  start: '2023-06-12 15:48:18.780199'
  stderr: |-
/build/ceph-16.2.13/src/mon/MonMap.h: In function 'void
MonMap::add(const mon_info_t&)' thread 7ffae5cb8700 time
2023-06-12T15:48:18.810345+


if you face this before Kindly help to solve .

Regards

Michel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] radosgw hang under pressure

2023-06-12 Thread grin
Hello,

ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

There is a single (test) radosgw serving plenty of test traffic. When under 
heavy req/s ("heavy" in a low sense, about 1k rq/s) it pretty reliably hangs: 
low traffic threads seem to work (like handling occasional PUTs) but GETs are 
completely nonresponsive, all attention seems to be spent on futexes.

The effect is extremely similar to 
https://ceph-users.ceph.narkive.com/I4uFVzH9/radosgw-civetweb-hangs-once-around-850-established-connections
 (subject: Radosgw (civetweb) hangs once around)
except this is quincy so it's beast instead of civetweb. The effect is the same 
as described there, except the cluster is way smaller (about 20-40 OSDs).

I observed that when I start radosgw -f with debug 20/20 it almost never hangs, 
so my guess is some ugly race condition. However I am a bit clueless how to 
actually debug it since debugging makes it go away. Debug 1 (default) with -d 
seems to hang after a while but it's not that simple to induce, I'm still 
testing under 4/4.

Also I do not see much to configure about beast.

As to answer the question in the original (2016) thread:
- Debian stable
- no visible limits issue
- no obvious memory leak observed
- no other visible resource shortage
- strace says everyone's waiting on futexes, about 600-800 threads, apart from 
the one serving occasional PUTs
- tcp port doesn't respond.

IRC didn't react. ;-)

Thanks,
Peter
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: radosgw hang under pressure

2023-06-12 Thread Mark Nelson

Hi Peter,

If you can reproduce and have debug symbols installed, I'd be interested 
to see the output of this tool:



https://github.com/markhpc/uwpmp/


It might need slightly different compile instructions if you have a 
newer version of go.  I can send you an executable offline if needed.  
Since RGW potentially can have a fairly insane number of threads with 
the default settings, it will gather samples pretty slowly.  Just start 
out collecting something like 100 samples:



sudo ./unwindpmp -n 100 -p `pidof radosgw` > foo.txt


Hopefully that should help diagnose where all of the threads are 
spending time in the code.  uwpmp has a much faster libdw backend (-b 
libdw), but the callgraphs aren't always accurate so I would stick with 
the default unwind backend for now.



Mark


On 6/12/23 12:15, grin wrote:

Hello,

ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

There is a single (test) radosgw serving plenty of test traffic. When under heavy req/s 
("heavy" in a low sense, about 1k rq/s) it pretty reliably hangs: low traffic 
threads seem to work (like handling occasional PUTs) but GETs are completely 
nonresponsive, all attention seems to be spent on futexes.

The effect is extremely similar to
https://ceph-users.ceph.narkive.com/I4uFVzH9/radosgw-civetweb-hangs-once-around-850-established-connections
 (subject: Radosgw (civetweb) hangs once around)
except this is quincy so it's beast instead of civetweb. The effect is the same 
as described there, except the cluster is way smaller (about 20-40 OSDs).

I observed that when I start radosgw -f with debug 20/20 it almost never hangs, 
so my guess is some ugly race condition. However I am a bit clueless how to 
actually debug it since debugging makes it go away. Debug 1 (default) with -d 
seems to hang after a while but it's not that simple to induce, I'm still 
testing under 4/4.

Also I do not see much to configure about beast.

As to answer the question in the original (2016) thread:
- Debian stable
- no visible limits issue
- no obvious memory leak observed
- no other visible resource shortage
- strace says everyone's waiting on futexes, about 600-800 threads, apart from 
the one serving occasional PUTs
- tcp port doesn't respond.

IRC didn't react. ;-)

Thanks,
Peter
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to release the invalid tcp connection under radosgw?

2023-06-12 Thread Louis Koo
connections:
[root@et-uos-warm02 deeproute]# netstat -anltp | grep rados | grep 
10.x.x.x:7480 | grep ESTAB | grep 10.12 | wc -l
6650

The prints:
tcp0  0 10.x.x.x:7480   10.x.x.12:40210   ESTABLISHED 
76749/radosgw   
tcp0  0 10.x.x.x:7480   10.x.x.12:33218ESTABLISHED 
76749/radosgw   
tcp0  0 10.x.x.x:7480   10.x.x.12:33546ESTABLISHED 
76749/radosgw   
tcp0  0 10.x.x.x:7480   10.x.x.12:50024ESTABLISHED 
76749/radosgw 


but client  ip 10.x.x.12 is unreachable(because the node was shutdown), the  
status of the tcp connections is always "ESTABLISHED", how to fix it?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph fs perf stats output is empty

2023-06-12 Thread Jos Collin
Each array has different types. You need to look at the below cephfs-top 
code and see how they are interpreted.


[1] 
https://github.com/ceph/ceph/blob/main/src/tools/cephfs/top/cephfs-top#L66-L83
[2] 
https://github.com/ceph/ceph/blob/main/src/tools/cephfs/top/cephfs-top#L641-L714
[3] 
https://github.com/ceph/ceph/blob/main/src/tools/cephfs/top/cephfs-top#L110-L143
[4] 
https://github.com/ceph/ceph/blob/main/src/tools/cephfs/top/cephfs-top#L775-L823


On 12/06/23 14:23, Denis Polom wrote:


Hi,

yes, I've found that trick was I had to wait for about 15 sec to see 
the metrics.


Now I can see some numbers. Are units there in miliseconds? And also I 
see 2 numbers reported - is the first value actual and second is delta?


  "client.4636": [
    [
  924,
  4
    ],
    [
  0,
  0
    ],
    [
  10,
  981017512
    ],
    [
  0,
  33484910
    ],
    [
  136,
  0
    ],
    [
  1,
  2
    ],
    [
  2,
  2
    ],
    [
  1,
  2
    ],
    [
  0,
  0
    ],
    [
  302,
  1266679808
    ],
    [
  0,
  0
    ],
    [
  0,
  0
    ],
    [
  0,
  36361015
    ],
    [
  4205537661271535,
  302
    ],
    [
  0,
  11161636
    ],
    [
  190208004421472,
  3
    ]
  ]
    }
  },

Thx!



On 6/12/23 06:36, Jos Collin wrote:
Additionally, the first `ceph fs perf stats` output would be empty or 
outdated in most cases. You need to query a few times to get the 
latest values. So try `watch ceph fs perf stats`.


On Mon, 12 Jun 2023 at 06:30, Xiubo Li  wrote:


On 6/10/23 05:35, Denis Polom wrote:
> Hi
>
> I'm running latest Ceph Pacific 16.2.13 with Cephfs. I need to
collect
> performance stats per client, but getting empty list without
any numbers
>
> I even run dd on client against mounted ceph fs, but output is
only
> like this:
>
> #> ceph fs perf stats 0 4638 192.168.121.1
>
> {"version": 2, "global_counters": ["cap_hit", "read_latency",
> "write_latency", "metadata_latency", "dentry_lease",
"opened_files",
> "pinned_icaps", "
> opened_inodes", "read_io_sizes", "write_io_sizes",
"avg_read_latency",
> "stdev_read_latency", "avg_write_latency", "stdev_write_latency",
> "avg_metada
> ta_latency", "stdev_metadata_latency"], "counters": [],
> "client_metadata": {}, "global_metrics": {}, "metrics":
> {"delayed_ranks": []}}
>
> Do I need to set some extra options?
>
Were you using the ceph-fuse/libcephfs user space clients ? if so
you
need to manually enable the 'client_collect_and_send_global_metrics'
option, which is disabled in Ceph Pacific 16.2.13.

While if you were using the kclient you need to make sure the
'disable_send_metrics' module parameter is 'false' for ceph.ko
module.

Thanks

- Xiubo


> Does it work for some of you guys?
>
> Thank you
>
> dp
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSD stuck down

2023-06-12 Thread Nicola Mori

Dear Ceph users,

after a host reboot one of the OSDs is now stuck down (and out). I tried 
several times to restart it and even to reboot the host, but it still 
remains down.


# ceph -s
  cluster:
id: b1029256-7bb3-11ec-a8ce-ac1f6b627b45
health: HEALTH_WARN
4 OSD(s) have spurious read errors
(muted: OSD_SLOW_PING_TIME_BACK OSD_SLOW_PING_TIME_FRONT)

  services:
mon: 5 daemons, quorum bofur,balin,aka,romolo,dwalin (age 16h)
mgr: bofur.tklnrn(active, since 16h), standbys: aka.wzystq, 
balin.hvunfe

mds: 2/2 daemons up, 1 standby
osd: 104 osds: 103 up (since 16h), 103 in (since 13h); 4 remapped pgs

  data:
volumes: 1/1 healthy
pools:   3 pools, 529 pgs
objects: 18.85M objects, 41 TiB
usage:   56 TiB used, 139 TiB / 195 TiB avail
pgs: 68130/150150628 objects misplaced (0.045%)
 522 active+clean
 4   active+remapped+backfilling
 3   active+clean+scrubbing+deep

  io:
recovery: 46 MiB/s, 21 objects/s



The host is reachable (its other OSDs are in) and from the systemd logs 
of the OSD I don't see anything wrong:


$ sudo systemctl status ceph-b1029256-7bb3-11ec-a8ce-ac1f6b627b45@osd.34
● ceph-b1029256-7bb3-11ec-a8ce-ac1f6b627b45@osd.34.service - Ceph osd.34 
for b1029256-7bb3-11ec-a8ce-ac1f6b627b45
   Loaded: loaded 
(/etc/systemd/system/ceph-b1029256-7bb3-11ec-a8ce-ac1f6b627b45@.service; 
enabled; vendor preset: disabled)

   Active: active (running) since Mon 2023-06-12 17:00:25 CEST; 15h ago
 Main PID: 36286 (bash)
Tasks: 11 (limit: 152154)
   Memory: 20.0M
   CGroup: 
/system.slice/system-ceph\x2db1029256\x2d7bb3\x2d11ec\x2da8ce\x2dac1f6b627b45.slice/ceph-b1029256-7bb3-11ec-a8ce-ac1f6b627b45@osd.34.service
   ├─36286 /bin/bash 
/var/lib/ceph/b1029256-7bb3-11ec-a8ce-ac1f6b627b45/osd.34/unit.run
   └─36657 /usr/bin/docker run --rm --ipc=host 
--stop-signal=SIGTERM --net=host --entrypoint /usr/bin/ceph-osd 
--privileged --group-add=disk --init --name 
ceph-b1029256-7bb3-11ec-a8ce-ac1f6b627b45-osd-34 --pids-limit=0 -e 
CONTAINER_IMAGE=snack14/ceph-wizard@sha>


Jun 12 17:00:25 balin systemd[1]: Started Ceph osd.34 for 
b1029256-7bb3-11ec-a8ce-ac1f6b627b45.
Jun 12 17:00:27 balin bash[36306]: Running command: /usr/bin/chown -R 
ceph:ceph /var/lib/ceph/osd/ceph-34
Jun 12 17:00:27 balin bash[36306]: Running command: 
/usr/bin/ceph-bluestore-tool prime-osd-dir --path 
/var/lib/ceph/osd/ceph-34 --no-mon-config --dev 
/dev/mapper/ceph--9a4c3927--d3da--4b49--80fe--6cdc00c7897c-osd--block--36d2f793--e5c7--4247--a314--bcc40389d50d
Jun 12 17:00:27 balin bash[36306]: Running command: /usr/bin/chown -h 
ceph:ceph 
/dev/mapper/ceph--9a4c3927--d3da--4b49--80fe--6cdc00c7897c-osd--block--36d2f793--e5c7--4247--a314--bcc40389d50d
Jun 12 17:00:27 balin bash[36306]: Running command: /usr/bin/chown -R 
ceph:ceph /dev/dm-6
Jun 12 17:00:27 balin bash[36306]: Running command: /usr/bin/ln -s 
/dev/mapper/ceph--9a4c3927--d3da--4b49--80fe--6cdc00c7897c-osd--block--36d2f793--e5c7--4247--a314--bcc40389d50d 
/var/lib/ceph/osd/ceph-34/block
Jun 12 17:00:27 balin bash[36306]: Running command: /usr/bin/chown -R 
ceph:ceph /var/lib/ceph/osd/ceph-34
Jun 12 17:00:27 balin bash[36306]: --> ceph-volume raw activate 
successful for osd ID: 34
Jun 12 17:00:29 balin bash[36657]: debug 2023-06-12T15:00:29.066+ 
7f818e356540 -1 Falling back to public interface



I'd need some help to understand how to fix this.
Thank you,

Nicola


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io