[ceph-users] How to remove remaining bucket index shard objects

2022-09-26 Thread 伊藤 祐司
Hi,

I have encountered a problem after deleting an RGW bucket. There seem to be 
remaining bucket index shard objects. Could you tell me the desired way to 
delete these objects? Is it OK to just delete these objects? Or should I use 
some dedicated ceph commands? I couldn't found how to do it in the official 
document.

Environment:
Rook: 1.9.6
Ceph: 16.2.10

Here is a detailed information:
I got the following HEALTH_WARN after deleting a RGW bucket.

```
$ kubectl exec -n ceph-poc deploy/rook-ceph-tools -- ceph health detail
HEALTH_WARN 35 large omap objects
[WRN] LARGE_OMAP_OBJECTS: 35 large omap objects
   35 large objects found in pool 
'ceph-poc-object-store-ssd-index.rgw.buckets.index'
   Search the cluster log for 'Large omap object found' for more details.
```

I tried `bilog trim` and `stale-instance delete` commands with reffering to the 
following document.
- https://access.redhat.com/solutions/6450561
- 
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/object_gateway_guide_for_ubuntu/index#cleaning-stale-instances-after-resharding-rgw

Then I ran deep-scrub and this warning was disappeared. However, this warning 
appeared later. As a result of investigation, I found the bucket index shard 
objects of deleted bucket still exist. 

There were two buckets.

```
$ kubectl exec -it -n ceph-poc deploy/rook-ceph-tools -- radosgw-admin bucket 
stats | jq '.[] | {"bucket": .bucket, "id": .id}' | jq .
{
 "bucket": "csa-large-omap-9332ba5c-3cb5-4ff7-98cf-1729b44b954c",
 "id": "83a2aeca-b5a0-46b2-843b-fb34884bb148.62065601.1"
}
{
 "bucket": "rook-ceph-bucket-checker-dfef5d3c-036a-428a-b4df-ae6be5d5c41a",
 "id": "83a2aeca-b5a0-46b2-843b-fb34884bb148.53178977.1"
}
```

However, there were three sets of bucket index shard objects.

```
$ kubectl exec -n ceph-poc deploy/rook-ceph-tools -- rados ls --pool 
ceph-poc-object-store-ssd-index.rgw.buckets.index | sort
.dir.83a2aeca-b5a0-46b2-843b-fb34884bb148.14548925.2.0
<...snip...>
.dir.83a2aeca-b5a0-46b2-843b-fb34884bb148.14548925.2.9
.dir.83a2aeca-b5a0-46b2-843b-fb34884bb148.53178977.1.0
<...snip...>
.dir.83a2aeca-b5a0-46b2-843b-fb34884bb148.53178977.1.9
.dir.83a2aeca-b5a0-46b2-843b-fb34884bb148.62065601.1.0
<...snip...>
.dir.83a2aeca-b5a0-46b2-843b-fb34884bb148.62065601.1.9​
```

I would like to delete the above unused objects by `rados rm` command. But I'm 
not sure whether this operation is safe or not. I would like to know how to 
manually delete them and the procedure to do so.

Thanks,
Yuji
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: weird performance issue on ceph

2022-09-26 Thread Zoltan Langi

Hi Mark,

Of course I share it with you how we triggered it:

Set the drive to 4k:
nvme format --lbaf=1 /dev/nvme0n1

Make a file system on the disk, we used ext4:
mkfs.ext4 /dev/nvme0n1

Mount the disk to a mount point:
mount /dev/nvme0n1 /mnt/test/

Run the FIO write test to write data to a file:
fio --randrepeat=0 --verify=0 --ioengine=libaio --direct=1 
--gtod_reduce=1 --name=write_seq --filename=/mnt/test/fiotest1 --bs=4M 
--iodepth=16 --size=500G --readwrite=write --time_based --ramp_time=2s 
--runtime=480m --thread --numjobs=4 --offset_increment=100M


Check the nvme list output. Once the drive usage reaches 500GB kill the 
fio process and restart it with a new filename so it won't overwrite the 
original file:
fio --randrepeat=0 --verify=0 --ioengine=libaio --direct=1 
--gtod_reduce=1 --name=write_seq --filename=/mnt/test/fiotest2 --bs=4M 
--iodepth=16 --size=500G --readwrite=write --time_based --ramp_time=2s 
--runtime=480m --thread --numjobs=4 --offset_increment=100M
Shortly you will see the degraded performance which will in time gets 
worse and worse.


We used fw: EDA5702Q

Hope this makes sense. Opened a case with the disk supplier, will update 
you if I get any kind of sensible response from them.


Zoltan

Am 26.09.22 um 16:52 schrieb Mark Nelson:


CAUTION:
This email originated from outside the organization. Do not click 
links unless you can confirm the sender and know the content is safe.




Hi Zoltan,


Great investigation work!  I think in my tests the data set typically 
was smaller than 500GB/drive.  If you have a simple fio test that can 
be run against a bare NVMe drive I can try running it on one of our 
test nodes.  FWIW I kind of suspected that the issue I had to work 
around for quincy might have been related to some kind of internal 
cache being saturated.  I wonder if the drive is fast up until some 
limit is hit where it's reverted to slower flash or something?



Mark


On 9/26/22 06:39, Zoltan Langi wrote:
Hi Mark and the mailing list, we managed to figure something very 
weird out what I would like to share with you and ask if you have 
seen anything like this before.


We started to investigate the drives one-by-one after Mark's 
suggestion that a few osd-s are holding back the ceph and we noticed 
this:


When the disk usage reaches 500GB on a single drive, the drive loses 
half of its write performance compared to when it's empty.

To show you, let's see the fio write performance when the disk is empty:
Jobs: 4 (f=4): [W(4)][6.0%][w=1930MiB/s][w=482 IOPS][eta 07h:31m:13s]
We see, when the disk is empty, the drive achieves almost 1,9GB/s 
throughput and 482 iops. Very decent values.


However! When the disk gets to 500GB full and we start to write a new 
file all of the sudden we get these values:

Jobs: 4 (f=4): [W(4)][0.9%][w=1033MiB/s][w=258 IOPS][eta 07h:55m:43s]
As we see we lost significant throughput and iops as well.

If we remove all the files and do an fstrim on the disk, the 
performance returns back to normal again.


If we format the disk, no need to do fstrim, we get the performance 
back to normal again. That explains why the ceph recreation from 
scratch helped us.


Have you see this behaviour before in your deployments?

Thanks,

Zoltan

Am 17.09.22 um 06:58 schrieb Mark Nelson:


CAUTION:
This email originated from outside the organization. Do not click 
links unless you can confirm the sender and know the content is safe.




Hi Zoltan,


So kind of interesting results.  In the "good" write test the OSD 
doesn't actually seem to be working very hard.  If you look at the 
kv sync thread, it's mostly idle with only about 22% of the time in 
the thread spent doing real work:


1.
   | + 99.90% BlueStore::_kv_sync_thread()
2.
   | + 78.60% 
std::condition_variable::wait(std::unique_lock&)

3.
   | |+ 78.60% pthread_cond_wait
4.
   | + 18.00%
RocksDBStore::submit_transaction_sync(std::shared_ptr) 



...but at least it's actually doing work!  For reference though, on 
our high performing setup with enough concurrency we can push things 
hard enough where this thread isn't spending much time in 
pthread_cond_wait.  In the "bad" state, your example OSD here is 
basically doing nothing at all (100% of the time in 
pthread_cold_wait!).  The tp_osd_tp and the kv sync thread are just 
waiting around twiddling their thumbs:


1.
   Thread 339848 (bstore_kv_sync) - 1000 samples
2.
   + 100.00% clone
3.
   + 100.00% start_thread
4.
   + 100.00% BlueStore::KVSyncThread::entry()
5.
   + 100.00% BlueStore::_kv_sync_thread()
6.
   + 100.00% 
std::condition_variable::wait(std::unique_lock&)

7.
   + 100.00% pthread_cond_wait


My first thought is that you might have one or more OSDs that are 
slowing the whole cluster down so that clients are backing up on it 
and other OSDs are just waiting around for IO.  It might be worth 
checking the perf admin socket stats on each OSD to see if you

[ceph-users] Re: External RGW always down

2022-09-26 Thread Monish Selvaraj
Hi Euden,

Yes the osds stay online when i start them manually.

No pg recovery starts automatically when the osd starts.

I'm using an erasure coded pool for rgw .In that rule we have k=11 m=4
total 15 hosts and the crush rule is host .

I didn't find any error logs in the osds.

First time I upgraded the ceph version from pacific to quincy.

Second time I upgraded the ceph version from quincy 17.2.1 to 17.2.2

I have an doubt we are migrating data from scality to ceph. when the data
migration is too high that means normally we migrate the data speed 800 to
900 mbps it does not cause the problem.

When I migrate high data at 2gbps speed. the osds are automatically down.
But some osd are automatically started. Some of the osds we need to start
manually.


On Mon, Sep 26, 2022 at 11:06 PM Eugen Block  wrote:

> > Yes, I have an inactive pgs when the osd goes down. Then I started the
> osds
> > manually. But the rgw fails to start.
>
> But the OSDs stay online if you start them manually? Do the inactive
> PGs recover when you start them manually? By the way, you should check
> your crush rules, depending on how many OSDs fail you may have room
> for improvement there. And why do the OSDs fail with automatic
> restart, what's in the logs?
>
> > Only upgrading to a newer version is only for the issue and we faced this
> > issue two times.
>
> What versions are you using (ceph versions)?
>
> > I dont know why it is happening. But maybe the rgw are running in
> separate
> > machines. This causes the issue ?
>
> I don't know how that should
>
> Zitat von Monish Selvaraj :
>
> > Hi Eugen,
> >
> > Yes, I have an inactive pgs when the osd goes down. Then I started the
> osds
> > manually. But the rgw fails to start.
> >
> > Only upgrading to a newer version is only for the issue and we faced this
> > issue two times.
> >
> > I dont know why it is happening. But maybe the rgw are running in
> separate
> > machines. This causes the issue ?
> >
> > On Sat, Sep 10, 2022 at 11:27 PM Eugen Block  wrote:
> >
> >> You didn’t respond to the other questions. If you want people to be
> >> able to help you need to provide more information. If your OSDs fail
> >> do you have inactive PGs? Or do you have full OSDs which would RGW
> >> prevent from starting? I’m assuming that if you fix your OSDs the RGWs
> >> would start working again. But then again, we still don’t know
> >> anything about the current situation.
> >>
> >> Zitat von Monish Selvaraj :
> >>
> >> > Hi Eugen,
> >> >
> >> > Below is the log output,
> >> >
> >> > 2022-09-07T12:03:42.893+ 7fdd23fdc5c0  0 framework: beast
> >> > 2022-09-07T12:03:42.893+ 7fdd23fdc5c0  0 framework conf key: port,
> >> val:
> >> > 80
> >> > 2022-09-07T12:03:42.893+ 7fdd23fdc5c0  1 radosgw_Main not setting
> >> numa
> >> > affinity
> >> > 2022-09-07T12:03:42.893+ 7fdd23fdc5c0  1 rgw_d3n:
> >> > rgw_d3n_l1_local_datacache_enabled=0
> >> > 2022-09-07T12:03:42.893+ 7fdd23fdc5c0  1 D3N datacache enabled: 0
> >> > 2022-09-07T12:03:53.313+ 7fdd23fdc5c0  1 rgw main: int
> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> RGWSI_RADOS::Obj&,
> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> > 2022-09-07T12:03:53.313+ 7fdd23fdc5c0  1 rgw main: int
> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> RGWSI_RADOS::Obj&,
> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> > 2022-09-07T12:08:42.891+ 7fdd1661c700 -1 Initialization timeout,
> >> failed
> >> > to initialize
> >> > 2022-09-07T12:08:53.395+ 7f69017095c0  0 deferred set uid:gid to
> >> > 167:167 (ceph:ceph)
> >> > 2022-09-07T12:08:53.395+ 7f69017095c0  0 ceph version 17.2.0
> >> > (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable), process
> >> > radosgw, pid 7
> >> > 2022-09-07T12:08:53.395+ 7f69017095c0  0 framework: beast
> >> > 2022-09-07T12:08:53.395+ 7f69017095c0  0 framework conf key: port,
> >> val:
> >> > 80
> >> > 2022-09-07T12:08:53.395+ 7f69017095c0  1 radosgw_Main not setting
> >> numa
> >> > affinity
> >> > 2022-09-07T12:08:53.395+ 7f69017095c0  1 rgw_d3n:
> >> > rgw_d3n_l1_local_datacache_enabled=0
> >> > 2022-09-07T12:08:53.395+ 7f69017095c0  1 D3N datacache enabled: 0
> >> > 2022-09-07T12:09:03.747+ 7f69017095c0  1 rgw main: int
> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> RGWSI_RADOS::Obj&,
> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> > 2022-09-07T12:09:03.747+ 7f69017095c0  1 rgw main: int
> >> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*,
> RGWSI_RADOS::Obj&,
> >> > const RGWCacheNotifyInfo&, optional_yi>
> >> > 2022-09-07T12:13:53.397+ 7f68f3d49700 -1 Initialization timeout,
> >> failed
> >> > to initialize
> >> >
> >> > I installed the cluster in quincy.
> >> >
> >> >
> >> > On Sat, Sep 10, 2022 at 4:02 PM Eugen Block  wrote:
> >> >
> >> >> What troubleshooting have you tried? You don’t provide any log output
> >> >> or information about the cluster setup, for example the ceph osd
> 

[ceph-users] Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

2022-09-26 Thread Tyler Stachecki
Just a datapoint - we upgraded several large Mimic-born clusters straight
to 15.2.12 with the quick fsck disabled in ceph.conf, then did
require-osd-release, and finally did the omap conversion offline after the
cluster was upgraded using the bluestore tool while the OSDs were down (all
done in batches). Clusters are zippy as ever.

Maybe on a whim, try doing an offline fsck with the bluestore tool and see
if it improves things?

To answer an earlier question, if you have no health statuses muted, a
'ceph health detail' should show you at least a subset of OSDs that have
not gone through the omap conversion yet.

Cheers,
Tyler

On Mon, Sep 26, 2022, 5:13 PM Marc  wrote:

> Hi Frank,
>
> Thank you very much for this! :)
>
> >
> > we just completed a third upgrade test. There are 2 ways to convert the
> > OSDs:
> >
> > A) convert along with the upgrade (quick-fix-on-start=true)
> > B) convert after setting require-osd-release=octopus (quick-fix-on-
> > start=false until require-osd-release set to octopus, then restart to
> > initiate conversion)
> >
> > There is a variation A' of A: follow A, then initiate manual compaction
> > and restart all OSDs.
> >
> > Our experiments show that paths A and B do *not* yield the same result.
> > Following path A leads to a severely performance degraded cluster. As of
> > now, we cannot confirm that A' fixes this. It seems that the only way
> > out is to zap and re-deploy all OSDs, basically what Boris is doing
> > right now.
> >
> > We extended now our procedure to adding
> >
> >   bluestore_fsck_quick_fix_on_mount = false
> >
> > to every ceph.conf file and executing
> >
> >   ceph config set osd bluestore_fsck_quick_fix_on_mount false
> >
> > to catch any accidents. After daemon upgrade, we set
> > bluestore_fsck_quick_fix_on_mount = true host by host in the ceph.conf
> > and restart OSDs.
> >
> > This procedure works like a charm.
> >
> > I don't know what the difference between A and B is. It is possible that
> > B executes an extra step that is missing in A. The performance
> > degradation only shows up when snaptrim is active, but then it is very
> > severe. I suspect that many users who complained about snaptrim in the
> > past have at least 1 A-converted OSD in their cluster.
> >
> > If you have a cluster upgraded with B-converted OSDs, it works like a
> > native octopus cluster. There is very little performance reduction
> > compared with mimic. In exchange, I have the impression that it operates
> > more stable.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph configuration for rgw

2022-09-26 Thread Tony Liu
You can always "config get" what was set by "config set", cause that's just
write and read KV to and from configuration DB.

To "config show" what was set by "config set" requires the support for mgr
to connect to the service daemon to get running config. I see such support
for mgr, mon and osd, but not rgw.

The case I am asking is about the latter, for rgw, after "config set", I can't 
get
it by "config show". I'd like to know if this is expected.

Also, the config in configuration DB doesn't seem being applied to rgw, even
restart the service.

I also noticed that, when cephadm deploys rgw, it tries to add firewall rule for
the open port. In my case, the port is no in "public" zone. And I don't see a 
way
to set the zone or disable this action.


Thanks!
Tony

From: Eugen Block 
Sent: September 26, 2022 12:08 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Ceph configuration for rgw

Just adding this:

ses7-host1:~ # ceph config set client.rgw.ebl-rgw rgw_frontends "beast
port=8080"

This change is visible in the config get output:

client.rgw.ebl-rgwbasic rgw_frontendsbeast port=8080


Zitat von Eugen Block :

> Hi,
>
> the docs [1] show how to specifiy the rgw configuration via yaml
> file (similar to OSDs).
> If you applied it with ceph orch you should see your changes in the
> 'ceph config dump' output, or like this:
>
> ---snip---
> ses7-host1:~ # ceph orch ls | grep rgw
> rgw.ebl-rgw?:80 2/2  33s ago3M   ses7-host3;ses7-host4
>
> ses7-host1:~ # ceph config get client.rgw.ebl-rgw
> WHO MASK  LEVEL OPTION   VALUE
>
>  RO
> globalbasic container_image
> registry.fqdn:5000/ses/7.1/ceph/ceph@sha256:...  *
> client.rgw.ebl-rgwbasic rgw_frontendsbeast port=80
>
>  *
> client.rgw.ebl-rgwadvanced  rgw_realmebl-rgw
>
>  *
> client.rgw.ebl-rgwadvanced  rgw_zone ebl-zone
> ---snip---
>
> As you see the RGWs are clients so you need to consider that when
> you request the current configuration. But what I find strange is
> that apparently it only shows the config initially applied, it
> doesn't show the changes after running 'ceph orch apply -i rgw.yaml'
> although the changes are applied to the containers after restarting
> them. I don't know if this is intended but sounds like a bug to me
> (I haven't checked).
>
>> 1) When start rgw with cephadm ("orch apply -i "), I have
>> to start the daemon
>>then update configuration file and restart. I don't find a way
>> to achieve this by single step.
>
> I haven't played around too much yet, but you seem to be right,
> changing the config isn't applied immediately, but only after a
> service restart ('ceph orch restart rgw.ebl-rgw'). Maybe that's on
> purpose? So you can change your config now and apply it later when a
> service interruption is not critical.
>
>
> [1] https://docs.ceph.com/en/pacific/cephadm/services/rgw/
>
> Zitat von Tony Liu :
>
>> Hi,
>>
>> The cluster is Pacific 16.2.10 with containerized service and
>> managed by cephadm.
>>
>> "config show" shows running configuration. Who is supported?
>> mon, mgr and osd all work, but rgw doesn't. Is this expected?
>> I tried with client. and
>> without "client",
>> neither works.
>>
>> When issue "config show", who connects the daemon and retrieves
>> running config?
>> Is it mgr or mon?
>>
>> Config update by "config set" will be populated to the service.
>> Which services are
>> supported by this? I know mon, mgr and osd work, but rgw doesn't.
>> Is this expected?
>> I assume this is similar to "config show", this support needs the
>> capability of mgr/mon
>> to connect to service daemon?
>>
>> To get running config from rgw, I always do
>> "docker exec  ceph daemon  config show".
>> Is that the only way? I assume it's the same to get running config
>> from all services.
>> Just the matter of supported by mgr/mon or not?
>>
>> I've been configuring rgw by configuration file. Is that the
>> recommended way?
>> I tried with configuration db, like "config set", it doesn't seem working.
>> Is this expected?
>>
>> I see two cons with configuration file for rgw.
>> 1) When start rgw with cephadm ("orch apply -i "), I have
>> to start the daemon
>>then update configuration file and restart. I don't find a way
>> to achieve this by single step.
>> 2) When "orch daemon redeploy" or upgrade rgw, the configuration
>> file will be re-generated
>>   and I have to update it again.
>> Is this all how it's supposed to work or I am missing anything?
>>
>>
>> Thanks!
>> Tony
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@

[ceph-users] Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

2022-09-26 Thread Marc
Hi Frank,

Thank you very much for this! :)

> 
> we just completed a third upgrade test. There are 2 ways to convert the
> OSDs:
> 
> A) convert along with the upgrade (quick-fix-on-start=true)
> B) convert after setting require-osd-release=octopus (quick-fix-on-
> start=false until require-osd-release set to octopus, then restart to
> initiate conversion)
> 
> There is a variation A' of A: follow A, then initiate manual compaction
> and restart all OSDs.
> 
> Our experiments show that paths A and B do *not* yield the same result.
> Following path A leads to a severely performance degraded cluster. As of
> now, we cannot confirm that A' fixes this. It seems that the only way
> out is to zap and re-deploy all OSDs, basically what Boris is doing
> right now.
> 
> We extended now our procedure to adding
> 
>   bluestore_fsck_quick_fix_on_mount = false
> 
> to every ceph.conf file and executing
> 
>   ceph config set osd bluestore_fsck_quick_fix_on_mount false
> 
> to catch any accidents. After daemon upgrade, we set
> bluestore_fsck_quick_fix_on_mount = true host by host in the ceph.conf
> and restart OSDs.
> 
> This procedure works like a charm.
> 
> I don't know what the difference between A and B is. It is possible that
> B executes an extra step that is missing in A. The performance
> degradation only shows up when snaptrim is active, but then it is very
> severe. I suspect that many users who complained about snaptrim in the
> past have at least 1 A-converted OSD in their cluster.
> 
> If you have a cluster upgraded with B-converted OSDs, it works like a
> native octopus cluster. There is very little performance reduction
> compared with mimic. In exchange, I have the impression that it operates
> more stable.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osds not bootstrapping: monclient: wait_auth_rotating timed out

2022-09-26 Thread Wyll Ingersoll


Yes, we restarted the primary mon and mgr services.  Still no luck.


From: Dhairya Parmar 
Sent: Monday, September 26, 2022 3:44 PM
To: Wyll Ingersoll 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] osds not bootstrapping: monclient: wait_auth_rotating 
timed out

Looking at the shared tracker, I can see people talking about restarting 
primary mon/mgr
and getting this fixed at note-4 
and note-8. Did you try that out?

On Tue, Sep 27, 2022 at 12:44 AM Wyll Ingersoll 
mailto:wyllys.ingers...@keepertech.com>> wrote:
Ceph Pacific (16.2.9) on a large cluster.  Approximately 60 (out of 700) osds 
fail to start and show an error:

monclient: wait_auth_rotating timed out after 300

We modified the "rotating_keys_bootstrap_timeout" from 30 to 300, but they 
still fail.  All nodes are time-synced with NTP and the skew has been verified 
to be < 1.0 seconds.
It looks a lot like this bug: https://tracker.ceph.com/issues/17170  which does 
not appear to be resolved yet.

Any other suggestions on how to get these OSDs to sync up with the cluster?


thanks!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io



--
Dhairya Parmar

He/Him/His

Associate Software Engineer, CephFS

Red Hat Inc.

dpar...@redhat.com

[https://static.redhat.com/libs/redhat/brand-assets/2/corp/logo--200.png]
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osds not bootstrapping: monclient: wait_auth_rotating timed out

2022-09-26 Thread Dhairya Parmar
Looking at the shared tracker, I can see people talking about restarting
primary mon/mgr
and getting this fixed at note-4
 and note-8
. Did you try that out?

On Tue, Sep 27, 2022 at 12:44 AM Wyll Ingersoll <
wyllys.ingers...@keepertech.com> wrote:

> Ceph Pacific (16.2.9) on a large cluster.  Approximately 60 (out of 700)
> osds fail to start and show an error:
>
> monclient: wait_auth_rotating timed out after 300
>
> We modified the "rotating_keys_bootstrap_timeout" from 30 to 300, but they
> still fail.  All nodes are time-synced with NTP and the skew has been
> verified to be < 1.0 seconds.
> It looks a lot like this bug: https://tracker.ceph.com/issues/17170
> which does not appear to be resolved yet.
>
> Any other suggestions on how to get these OSDs to sync up with the cluster?
>
>
> thanks!
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>

-- 
*Dhairya Parmar*

He/Him/His

Associate Software Engineer, CephFS

Red Hat Inc. 

dpar...@redhat.com

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Cluster clone

2022-09-26 Thread Dhairya Parmar
Can you provide some more information on this? Can you show exactly what
error you get while trying to start the cluster?

On Mon, Sep 26, 2022 at 7:19 PM Ahmed Bessaidi 
wrote:

> Hello,
> I am working on cloning an existent Ceph Cluster (VMware).
> I fixed the IP/hostname part, but I cannot get the cloned cluster to start
> (Monitors issues).
> Any ideas ?
>
>
>
>
> Best Regards,
> Ahmed.
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>

-- 
*Dhairya Parmar*

He/Him/His

Associate Software Engineer, CephFS

Red Hat Inc. 

dpar...@redhat.com

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Newer linux kernel cephfs clients is more trouble?

2022-09-26 Thread William Edwards

Stefan Kooman schreef op 2022-05-11 18:06:

Hi List,

We have quite a few linux kernel clients for CephFS. One of our
customers has been running mainline kernels (CentOS 7 elrepo) for the
past two years. They started out with 3.x kernels (default CentOS 7),
but upgraded to mainline when those kernels would frequently generate
MDS warnings like "failing to respond to capability release". That
worked fine until 5.14 kernel. 5.14 and up would use a lot of CPU and
*way* more bandwidth on CephFS than older kernels (order of
magnitude). After the MDS was upgraded from Nautilus to Octopus that
behavior is gone (comparable CPU / bandwidth usage as older kernels).
However, the newer kernels are now the ones that give "failing to
respond to capability release", and worse, clients get evicted
(unresponsive as far as the MDS is concerned). Even the latest 5.17
kernels have that. No difference is observed between using messenger
v1 or v2. MDS version is 15.2.16.
Surprisingly the latest stable kernels from CentOS 7 work flawlessly
now. Although that is good news, newer operating systems come with
newer kernels.

Does anyone else observe the same behavior with newish kernel clients?


Yes.

I upgraded some CephFS clients from kernel 5.10.0 to 5.18.0. Ever since, 
I've experienced these issues on these clients:


- On the busiest client, ceph-msgr reads 3 - 6 Gb/s from disk. With 
5.10.0, this rarely exceeds 200 K/s.

- Clients more often don't respond to capability release.

The cluster is running Nautilus (14.2.22).



Gr. Stefan

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
With kind regards,

William Edwards

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] osds not bootstrapping: monclient: wait_auth_rotating timed out

2022-09-26 Thread Wyll Ingersoll
Ceph Pacific (16.2.9) on a large cluster.  Approximately 60 (out of 700) osds 
fail to start and show an error:

monclient: wait_auth_rotating timed out after 300

We modified the "rotating_keys_bootstrap_timeout" from 30 to 300, but they 
still fail.  All nodes are time-synced with NTP and the skew has been verified 
to be < 1.0 seconds.
It looks a lot like this bug: https://tracker.ceph.com/issues/17170  which does 
not appear to be resolved yet.

Any other suggestions on how to get these OSDs to sync up with the cluster?


thanks!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: weird performance issue on ceph

2022-09-26 Thread Frank Schilder
Hi Zoltan and Mark,

this observation of performance loss when a solid state drive gets full and/or 
exceeded a certain number of write OPS is very typical even for enterprise 
SSDs. This performance drop can be very dramatic. Therefore, I'm reluctant to 
add untested solid state drives (SSD/NVMe) to our cluster, because a single bad 
choice can ruin everything.

For testing, I always fill the entire drive before performing a benchmark. I 
found only few drives that don't suffer from this kind of performance 
degradation. Manufacturers of such "good" drives usually provide "sustained 
XYZ" performance specs instaed of just "XYZ", for example, "sustained write 
IPO/s" instead of "write IOP/s". When you start a test on these, they start 
with much higher than spec performance and settle down to specs as they fill 
up. A full drive lives up to specs for its declared life-time. The downside is, 
that these drives are usually very expensive, I never saw a cheap one living up 
to specs when full or after a couple of days with fio random 4K writes.

I believe the Samsumg PM-drives have been flagged in earlier posts as "a bit 
below expectation". There were also a lot of posts with other drives where 
users got a hard awakening.

I wonder if it might be a good idea to collect such experience somewhere in the 
ceph documentation, for example, a link unser hardware recommendations->solid 
state drives in the docs. Are there legal implications with creating a list of 
drives showing effective sustained performance of a drive in a ceph cluster? 
Maybe according to a standardised benchmark that hammers the drives for a 
couple of weeks and provides sustained performance information under sustained 
max load in contrast to peak load (which most cheap drives are optimized for 
and therefore less suitable for a constant-load system like ceph)?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mark Nelson 
Sent: 26 September 2022 16:52
To: ceph-users@ceph.io
Subject: [ceph-users] Re: weird performance issue on ceph

Hi Zoltan,


Great investigation work!  I think in my tests the data set typically
was smaller than 500GB/drive.  If you have a simple fio test that can be
run against a bare NVMe drive I can try running it on one of our test
nodes.  FWIW I kind of suspected that the issue I had to work around for
quincy might have been related to some kind of internal cache being
saturated.  I wonder if the drive is fast up until some limit is hit
where it's reverted to slower flash or something?


Mark


On 9/26/22 06:39, Zoltan Langi wrote:
> Hi Mark and the mailing list, we managed to figure something very
> weird out what I would like to share with you and ask if you have seen
> anything like this before.
>
> We started to investigate the drives one-by-one after Mark's
> suggestion that a few osd-s are holding back the ceph and we noticed
> this:
>
> When the disk usage reaches 500GB on a single drive, the drive loses
> half of its write performance compared to when it's empty.
> To show you, let's see the fio write performance when the disk is empty:
> Jobs: 4 (f=4): [W(4)][6.0%][w=1930MiB/s][w=482 IOPS][eta 07h:31m:13s]
> We see, when the disk is empty, the drive achieves almost 1,9GB/s
> throughput and 482 iops. Very decent values.
>
> However! When the disk gets to 500GB full and we start to write a new
> file all of the sudden we get these values:
> Jobs: 4 (f=4): [W(4)][0.9%][w=1033MiB/s][w=258 IOPS][eta 07h:55m:43s]
> As we see we lost significant throughput and iops as well.
>
> If we remove all the files and do an fstrim on the disk, the
> performance returns back to normal again.
>
> If we format the disk, no need to do fstrim, we get the performance
> back to normal again. That explains why the ceph recreation from
> scratch helped us.
>
> Have you see this behaviour before in your deployments?
>
> Thanks,
>
> Zoltan
>
> Am 17.09.22 um 06:58 schrieb Mark Nelson:
>> 
>> CAUTION:
>> This email originated from outside the organization. Do not click
>> links unless you can confirm the sender and know the content is safe.
>> 
>>
>>
>> Hi Zoltan,
>>
>>
>> So kind of interesting results.  In the "good" write test the OSD
>> doesn't actually seem to be working very hard.  If you look at the kv
>> sync thread, it's mostly idle with only about 22% of the time in the
>> thread spent doing real work:
>>
>> 1.
>>| + 99.90% BlueStore::_kv_sync_thread()
>> 2.
>>| + 78.60%
>> std::condition_variable::wait(std::unique_lock&)
>> 3.
>>| |+ 78.60% pthread_cond_wait
>> 4.
>>| + 18.00%
>> RocksDBStore::submit_transaction_sync(std::shared_ptr)
>>
>>
>> ...but at least it's actually doing work!  For reference though, on
>> our high performing setup with enough concurrency we can push things
>> hard enough where this thread isn't spending much time in
>> pthread_cond_wait.  In the "bad" state, your e

[ceph-users] Re: Cephadm credential support for private container repositories

2022-09-26 Thread John Mulligan
On Monday, September 26, 2022 12:53:04 PM EDT Gary Molenkamp wrote:
> I'm trying to determine whether cephadm can use credential based login 
> for container images from private repositories.  I don't see anything 
> obvious on the official documentation for cephadm to specify the 
> credentials to use.   Can someone confirm whether this is supported?
> 
> The motivation for the question is to find a solution to lack of zabbix 
> support in the standard container images:
>  https://github.com/ceph/ceph-container/issues/1651
> Our zabbix setup uses PSK for clients and we need to keep this
> confidential.
 
> I see three approaches that may work:
>  #1 Extend the standard container image with the zabbix 
> software+configs and place the containers on a hosted private repo. This 
> is easy to maintain, but requires support in cephadm to pull the images 
> with credentials.
>  #2  Extend the standard container image with the software+configs 
> and host the containers on a self-hosted repo. More work to maintain the 
> repository, but does not require cephadm to log into the repo.
>  #3. Extend the standard container image with the software, hosted 
> on a public repo, and use 
> https://docs.ceph.com/en/latest/cephadm/services/#extra-container-arguments
> 
 to map in the config/psk files for zabbix.
> 
> I'm leaning toward solution #3, but it would be nice to know if 
> credential login is supported.
>


cephadm has a 'registry-login' subcommand.  I think this might be what you are 
looking for. 

https://docs.ceph.com/en/quincy/api/mon_command_api/#cephadm-registry-login
https://docs.ceph.com/en/latest/man/8/cephadm/

I'm not finding a lot more documentation for this command - so if you find 
these 
links too sparse you can also consider filing a tracker issue for expanding the 
docs for this command.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: External RGW always down

2022-09-26 Thread Eugen Block

Yes, I have an inactive pgs when the osd goes down. Then I started the osds
manually. But the rgw fails to start.


But the OSDs stay online if you start them manually? Do the inactive  
PGs recover when you start them manually? By the way, you should check  
your crush rules, depending on how many OSDs fail you may have room  
for improvement there. And why do the OSDs fail with automatic  
restart, what's in the logs?



Only upgrading to a newer version is only for the issue and we faced this
issue two times.


What versions are you using (ceph versions)?


I dont know why it is happening. But maybe the rgw are running in separate
machines. This causes the issue ?


I don't know how that should

Zitat von Monish Selvaraj :


Hi Eugen,

Yes, I have an inactive pgs when the osd goes down. Then I started the osds
manually. But the rgw fails to start.

Only upgrading to a newer version is only for the issue and we faced this
issue two times.

I dont know why it is happening. But maybe the rgw are running in separate
machines. This causes the issue ?

On Sat, Sep 10, 2022 at 11:27 PM Eugen Block  wrote:


You didn’t respond to the other questions. If you want people to be
able to help you need to provide more information. If your OSDs fail
do you have inactive PGs? Or do you have full OSDs which would RGW
prevent from starting? I’m assuming that if you fix your OSDs the RGWs
would start working again. But then again, we still don’t know
anything about the current situation.

Zitat von Monish Selvaraj :

> Hi Eugen,
>
> Below is the log output,
>
> 2022-09-07T12:03:42.893+ 7fdd23fdc5c0  0 framework: beast
> 2022-09-07T12:03:42.893+ 7fdd23fdc5c0  0 framework conf key: port,
val:
> 80
> 2022-09-07T12:03:42.893+ 7fdd23fdc5c0  1 radosgw_Main not setting
numa
> affinity
> 2022-09-07T12:03:42.893+ 7fdd23fdc5c0  1 rgw_d3n:
> rgw_d3n_l1_local_datacache_enabled=0
> 2022-09-07T12:03:42.893+ 7fdd23fdc5c0  1 D3N datacache enabled: 0
> 2022-09-07T12:03:53.313+ 7fdd23fdc5c0  1 rgw main: int
> RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> const RGWCacheNotifyInfo&, optional_yi>
> 2022-09-07T12:03:53.313+ 7fdd23fdc5c0  1 rgw main: int
> RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> const RGWCacheNotifyInfo&, optional_yi>
> 2022-09-07T12:08:42.891+ 7fdd1661c700 -1 Initialization timeout,
failed
> to initialize
> 2022-09-07T12:08:53.395+ 7f69017095c0  0 deferred set uid:gid to
> 167:167 (ceph:ceph)
> 2022-09-07T12:08:53.395+ 7f69017095c0  0 ceph version 17.2.0
> (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable), process
> radosgw, pid 7
> 2022-09-07T12:08:53.395+ 7f69017095c0  0 framework: beast
> 2022-09-07T12:08:53.395+ 7f69017095c0  0 framework conf key: port,
val:
> 80
> 2022-09-07T12:08:53.395+ 7f69017095c0  1 radosgw_Main not setting
numa
> affinity
> 2022-09-07T12:08:53.395+ 7f69017095c0  1 rgw_d3n:
> rgw_d3n_l1_local_datacache_enabled=0
> 2022-09-07T12:08:53.395+ 7f69017095c0  1 D3N datacache enabled: 0
> 2022-09-07T12:09:03.747+ 7f69017095c0  1 rgw main: int
> RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> const RGWCacheNotifyInfo&, optional_yi>
> 2022-09-07T12:09:03.747+ 7f69017095c0  1 rgw main: int
> RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> const RGWCacheNotifyInfo&, optional_yi>
> 2022-09-07T12:13:53.397+ 7f68f3d49700 -1 Initialization timeout,
failed
> to initialize
>
> I installed the cluster in quincy.
>
>
> On Sat, Sep 10, 2022 at 4:02 PM Eugen Block  wrote:
>
>> What troubleshooting have you tried? You don’t provide any log output
>> or information about the cluster setup, for example the ceph osd tree,
>> ceph status, are the failing OSDs random or do they all belong to the
>> same pool? Any log output from failing OSDs and the RGWs might help,
>> otherwise it’s just wild guessing. Is the cluster a new installation
>> with cephadm or an older cluster upgraded to Quincy?
>>
>> Zitat von Monish Selvaraj :
>>
>> > Hi all,
>> >
>> > I have one critical issue in my prod cluster. When the customer's data
>> > comes from 600 MiB .
>> >
>> > My Osds are down *8 to 20 from 238* . Then I manually up my osds .
After
>> a
>> > few minutes, my all rgw crashes.
>> >
>> > We did some troubleshooting but nothing works. When we upgrade ceph to
>> > 17.2.0. to 17.2.1 is resolved. Also we faced the issue two times. But
>> both
>> > times we upgraded the ceph.
>> >
>> > *Node schema :*
>> >
>> > *Node 1 to node 5 --> mon,mgr and osds*
>> > *Node 6 to Node15 --> only osds*
>> > *Node 16 to Node 20 --> only rgws.*
>> >
>> > Kindly, check this issue and let me know the correct troubleshooting
>> method.
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>>
>> ___
>> 

[ceph-users] Re: PGImbalance

2022-09-26 Thread Eugen Block

Is the autoscaler running [1]? You can see the status with:

ceph osd pool autoscale-status

If it's turned off you can enable warn mode first to see what it would do:

ceph osd pool set  pg_autoscale_mode warn

If the autoscaler doesn't help you could increase the pg_num manually  
to 512 and see how the distribution changes.


[1] https://docs.ceph.com/en/pacific/rados/operations/placement-groups/

Zitat von mailing-lists :


Dear Ceph-Users,

i've recently setup a 4.3P Ceph-Cluster with cephadm.

I am seeing that the health is ok, as seen here:

ceph -s
  cluster:
    id: 8038f0xxx
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum  
ceph-a2-07,ceph-a1-01,ceph-a1-10,ceph-a2-01,ceph-a1-05 (age 3w)

    mgr: ceph-a1-01.mkptvb(active, since 2d), standbys: ceph-a2-01.bznood
    osd: 306 osds: 306 up (since 3w), 306 in (since 3w)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    pools:   7 pools, 420 pgs
    objects: 7.74M objects, 30 TiB
    usage:   45 TiB used, 4.3 PiB / 4.3 PiB avail
    pgs: 420 active+clean

But the Monitoring from the dashboard tells me, "CephPGImbalance"  
for several OSDs. The balancer is enabled and set to upmap.


ceph balancer status
{
    "active": true,
    "last_optimize_duration": "0:00:00.011314",
    "last_optimize_started": "Mon Sep 26 14:23:32 2022",
    "mode": "upmap",
    "optimize_result": "Unable to find further optimization, or  
pool(s) pg_num is decreasing, or distribution is already perfect",

    "plans": []
}

My main datapool is not yet filled by much. Its roughly 50T filled  
and I've set it to 256 PG_num. It is a 4+2 EC pool.


The average PG per OSD is 6.6, but actually some OSDs have 1, and  
some have up to 13 PGs... so it is in fact very unbalanced, but I  
don't know how to solve this, since the balancer is telling me, that  
everything is just fine. Do you have a hint for me?



Best

Ken







___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cephadm credential support for private container repositories

2022-09-26 Thread Gary Molenkamp
I'm trying to determine whether cephadm can use credential based login 
for container images from private repositories.  I don't see anything 
obvious on the official documentation for cephadm to specify the 
credentials to use.   Can someone confirm whether this is supported?


The motivation for the question is to find a solution to lack of zabbix 
support in the standard container images:

    https://github.com/ceph/ceph-container/issues/1651
Our zabbix setup uses PSK for clients and we need to keep this confidential.

I see three approaches that may work:
    #1 Extend the standard container image with the zabbix 
software+configs and place the containers on a hosted private repo. This 
is easy to maintain, but requires support in cephadm to pull the images 
with credentials.
    #2  Extend the standard container image with the software+configs 
and host the containers on a self-hosted repo. More work to maintain the 
repository, but does not require cephadm to log into the repo.
    #3. Extend the standard container image with the software, hosted 
on a public repo, and use 
https://docs.ceph.com/en/latest/cephadm/services/#extra-container-arguments 
to map in the config/psk files for zabbix.


I'm leaning toward solution #3, but it would be nice to know if 
credential login is supported.


Thanks
Gary



--
Gary Molenkamp  Science Technology Services
Systems Administrator   University of Western Ontario
molen...@uwo.ca http://sts.sci.uwo.ca
(519) 661-2111 x86882   (519) 661-3566

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow OSD startup and slow ops

2022-09-26 Thread Gauvain Pocentek
Hello Stefan,

Thank you for your answers.

On Thu, Sep 22, 2022 at 5:54 PM Stefan Kooman  wrote:

> Hi,
>
> On 9/21/22 18:00, Gauvain Pocentek wrote:
> > Hello all,
> >
> > We are running several Ceph clusters and are facing an issue on one of
> > them, we would appreciate some input on the problems we're seeing.
> >
> > We run Ceph in containers on Centos Stream 8, and we deploy using
> > ceph-ansible. While upgrading ceph from 16.2.7 to 16.2.10, we noticed
> that
> > OSDs were taking a very long time to restart on one of the clusters.
> (Other
> > clusters were not impacted at all.)
>
> Are the other clusters of similar size?
>

We have at least one cluster that is roughly the same size. It has not been
upgraded yet but restarting the OSDs doesn't create any issues.



> The OSD startup was so slow sometimes
> > that we ended up having slow ops, with 1 or 2 pg stuck in a peering
> state.
> > We've interrupted the upgrade and the cluster runs fine now, although we
> > have seen 1 OSD flapping recently, having trouble coming back to life.
> >
> > We've checked a lot of things and read a lot of mails from this list, and
> > here are some info:
> >
> > * this cluster has RBD pools for OpenStack and RGW pools; everything is
> > replicated x 3, except the RGW data pool which is EC 4+2
> > * we haven't found any hardware related issues; we run fully on SSDs and
> > they are all in good shape, no network issue, RAM and CPU are available
> on
> > all OSD hosts
> > * bluestore with an LVM collocated setup
> > * we have seen the slow restart with almost all the OSDs we've upgraded
> > (100 out of 350)
> > * on restart the ceph-osd process runs at 100% CPU but we haven't seen
> > anything weird on the host
>
> Are the containers restricted to use a certain amount of CPU? Do the
> OSDs, after ~ 10-20 seconds increase their CPU usage to 200% (if so this
> is proably because of rocksdb option max_background_compactions = 2).
>

This is actually a good point. We run the containers with --cpus=2. We also
had a couple incidents were OSDs started to act up on nodes were VMs were
running CPU intensive workloads (we have a hyperconverged setup with
OpenStack). So there's definitely something going on there.

I haven't had the opportunity to do a new restart to check more about the
CPU usage, but I hope to do that this week.


>
> > * no DB spillover
> > * we have other clusters with the same hardware, and we don't see
> problems
> > there
> >
> > The only thing that we found that looks suspicious is the number of op
> logs
> > for the PGs of the RGW index pool. `osd_max_pg_log_entries` is set to 10k
> > but `ceph pg dump` show PGs with more than 100k logs (the largest one
> has >
> > 400k logs).
> >
> > Could this be the reason for the slow startup of OSDs? If so is there a
> way
> > to trim these logs without too much impact on the cluster?
>
> Not sure. We have ~ 2K logs per PG.
>
> >
> > Let me know if additional info or logs are needed.
>
> Do you have a log of slow ops and osd logs?
>

I will get more logs when I restart an OSD this week. What log levels for
bluestore/rocksdb would you recommend?


>
> Do you have any non-standard configuration for the daemons? I.e. ceph
> daemon osd.$id config diff
>

Nothing non-standard.


>
> We are running a Ceph Octopus (15.2.16) cluster with similar
> configuration. We have *a lot* of slow ops when starting OSDs. Also
> during peering. When the OSDs start they consume 100% CPU for up to ~ 10
> seconds, and after that consume 200% for a minute or more. During that
> time the OSDs perform a compaction. You should be able to find this in
> the OSD logs if it's the same in your case. After some the OSDs are done
> initializing and starting the boot process. As soon as they boot up and
> start peering the slow ops start to kick in. Lot's of "transitioning to
> Primary" and "transitioning to Stray" logging. Some time later the OSD
> becomes "active". While the OSD is busy with peering it's also busy
> compacting. As I also see RocksDB compaction logging. So it might be due
> to RocksDB compactions impacting OSD performance while it's already busy
> becoming primary (and or secondary / tertiary) for it's PGs.
>
> We had norecover, nobackfill, norebalance active when booting the OSDs.
>
> So, it might just take a long time to do RocksDB compaction. In this
> case it might be better to do all needed RocksDB compactions, and then
> start booting. So, what might help is to set "ceph osd set noup". This
> prevents the OSD from becoming active, then wait for the RocksDB
> compactions, and after that unset the flag.
>
> If you try this, please let me know how it goes.
>

That sounds like a good thing to try, I'll keep you posted.

Thanks again,
Gauvain
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: weird performance issue on ceph

2022-09-26 Thread Mark Nelson

Hi Zoltan,


Great investigation work!  I think in my tests the data set typically 
was smaller than 500GB/drive.  If you have a simple fio test that can be 
run against a bare NVMe drive I can try running it on one of our test 
nodes.  FWIW I kind of suspected that the issue I had to work around for 
quincy might have been related to some kind of internal cache being 
saturated.  I wonder if the drive is fast up until some limit is hit 
where it's reverted to slower flash or something?



Mark


On 9/26/22 06:39, Zoltan Langi wrote:
Hi Mark and the mailing list, we managed to figure something very 
weird out what I would like to share with you and ask if you have seen 
anything like this before.


We started to investigate the drives one-by-one after Mark's 
suggestion that a few osd-s are holding back the ceph and we noticed 
this:


When the disk usage reaches 500GB on a single drive, the drive loses 
half of its write performance compared to when it's empty.

To show you, let's see the fio write performance when the disk is empty:
Jobs: 4 (f=4): [W(4)][6.0%][w=1930MiB/s][w=482 IOPS][eta 07h:31m:13s]
We see, when the disk is empty, the drive achieves almost 1,9GB/s 
throughput and 482 iops. Very decent values.


However! When the disk gets to 500GB full and we start to write a new 
file all of the sudden we get these values:

Jobs: 4 (f=4): [W(4)][0.9%][w=1033MiB/s][w=258 IOPS][eta 07h:55m:43s]
As we see we lost significant throughput and iops as well.

If we remove all the files and do an fstrim on the disk, the 
performance returns back to normal again.


If we format the disk, no need to do fstrim, we get the performance 
back to normal again. That explains why the ceph recreation from 
scratch helped us.


Have you see this behaviour before in your deployments?

Thanks,

Zoltan

Am 17.09.22 um 06:58 schrieb Mark Nelson:


CAUTION:
This email originated from outside the organization. Do not click 
links unless you can confirm the sender and know the content is safe.




Hi Zoltan,


So kind of interesting results.  In the "good" write test the OSD 
doesn't actually seem to be working very hard.  If you look at the kv 
sync thread, it's mostly idle with only about 22% of the time in the 
thread spent doing real work:


1.
   | + 99.90% BlueStore::_kv_sync_thread()
2.
   | + 78.60% 
std::condition_variable::wait(std::unique_lock&)

3.
   | |+ 78.60% pthread_cond_wait
4.
   | + 18.00%
RocksDBStore::submit_transaction_sync(std::shared_ptr) 



...but at least it's actually doing work!  For reference though, on 
our high performing setup with enough concurrency we can push things 
hard enough where this thread isn't spending much time in 
pthread_cond_wait.  In the "bad" state, your example OSD here is 
basically doing nothing at all (100% of the time in 
pthread_cold_wait!).  The tp_osd_tp and the kv sync thread are just 
waiting around twiddling their thumbs:


1.
   Thread 339848 (bstore_kv_sync) - 1000 samples
2.
   + 100.00% clone
3.
   + 100.00% start_thread
4.
   + 100.00% BlueStore::KVSyncThread::entry()
5.
   + 100.00% BlueStore::_kv_sync_thread()
6.
   + 100.00% 
std::condition_variable::wait(std::unique_lock&)

7.
   + 100.00% pthread_cond_wait


My first thought is that you might have one or more OSDs that are 
slowing the whole cluster down so that clients are backing up on it 
and other OSDs are just waiting around for IO.  It might be worth 
checking the perf admin socket stats on each OSD to see if you can 
narrow down if any of them are having issues.



Thanks,

Mark


On 9/16/22 05:57, Zoltan Langi wrote:
Hey people and Mark, the cluster was left overnight to do nothing 
and the problem as expected came back in the morning. We managed to 
capture the bad states on the exact same OSD-s we captured the good 
states earlier:


Here is the output of a read test when the cluster is in a bad state 
on the same OSD which I recorded in the good state earlier:


https://pastebin.com/jp5JLWYK

Here is the output of a write test when the cluster is in a bad 
state on the same OSD which I recorded in the good state earlier:


The write speed came down from 30,1GB/s to 17,9GB/s

https://pastebin.com/9e80L5XY

We are still open for any suggestions, so please feel free to 
comment or suggest. :)


Thanks a lot,
Zoltan

Am 15.09.22 um 16:53 schrieb Zoltan Langi:
Hey people and Mark, we managed to capture the good and bad states 
separately:


Here is the output of a read test when the cluster is in a bad state:

https://pastebin.com/0HdNapLQ

Here is the output of a write test when the cluster is in a bad state:

https://pastebin.com/2T2pKu6Q

Here is the output of a read test when the cluster is in a brand 
new reinstalled state:


https://pastebin.com/qsKeX0D8

Here is the output of a write test when the cluster is in a brand 
new reinstalled state:


https://pastebin.com/nTCuEUAb

Hope anyone can suggest anything, any ideas are welcome! :)

Zoltan

Am 13.09.2

[ceph-users] PGImbalance

2022-09-26 Thread mailing-lists

Dear Ceph-Users,

i've recently setup a 4.3P Ceph-Cluster with cephadm.

I am seeing that the health is ok, as seen here:

ceph -s
  cluster:
    id: 8038f0xxx
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum 
ceph-a2-07,ceph-a1-01,ceph-a1-10,ceph-a2-01,ceph-a1-05 (age 3w)

    mgr: ceph-a1-01.mkptvb(active, since 2d), standbys: ceph-a2-01.bznood
    osd: 306 osds: 306 up (since 3w), 306 in (since 3w)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    pools:   7 pools, 420 pgs
    objects: 7.74M objects, 30 TiB
    usage:   45 TiB used, 4.3 PiB / 4.3 PiB avail
    pgs: 420 active+clean

But the Monitoring from the dashboard tells me, "CephPGImbalance" for 
several OSDs. The balancer is enabled and set to upmap.


ceph balancer status
{
    "active": true,
    "last_optimize_duration": "0:00:00.011314",
    "last_optimize_started": "Mon Sep 26 14:23:32 2022",
    "mode": "upmap",
    "optimize_result": "Unable to find further optimization, or pool(s) 
pg_num is decreasing, or distribution is already perfect",

    "plans": []
}

My main datapool is not yet filled by much. Its roughly 50T filled and 
I've set it to 256 PG_num. It is a 4+2 EC pool.


The average PG per OSD is 6.6, but actually some OSDs have 1, and some 
have up to 13 PGs... so it is in fact very unbalanced, but I don't know 
how to solve this, since the balancer is telling me, that everything is 
just fine. Do you have a hint for me?



Best

Ken







___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: External RGW always down

2022-09-26 Thread Monish Selvaraj
Hi Eugen,

Yes, I have an inactive pgs when the osd goes down. Then I started the osds
manually. But the rgw fails to start.

Only upgrading to a newer version is only for the issue and we faced this
issue two times.

I dont know why it is happening. But maybe the rgw are running in separate
machines. This causes the issue ?

On Sat, Sep 10, 2022 at 11:27 PM Eugen Block  wrote:

> You didn’t respond to the other questions. If you want people to be
> able to help you need to provide more information. If your OSDs fail
> do you have inactive PGs? Or do you have full OSDs which would RGW
> prevent from starting? I’m assuming that if you fix your OSDs the RGWs
> would start working again. But then again, we still don’t know
> anything about the current situation.
>
> Zitat von Monish Selvaraj :
>
> > Hi Eugen,
> >
> > Below is the log output,
> >
> > 2022-09-07T12:03:42.893+ 7fdd23fdc5c0  0 framework: beast
> > 2022-09-07T12:03:42.893+ 7fdd23fdc5c0  0 framework conf key: port,
> val:
> > 80
> > 2022-09-07T12:03:42.893+ 7fdd23fdc5c0  1 radosgw_Main not setting
> numa
> > affinity
> > 2022-09-07T12:03:42.893+ 7fdd23fdc5c0  1 rgw_d3n:
> > rgw_d3n_l1_local_datacache_enabled=0
> > 2022-09-07T12:03:42.893+ 7fdd23fdc5c0  1 D3N datacache enabled: 0
> > 2022-09-07T12:03:53.313+ 7fdd23fdc5c0  1 rgw main: int
> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> > const RGWCacheNotifyInfo&, optional_yi>
> > 2022-09-07T12:03:53.313+ 7fdd23fdc5c0  1 rgw main: int
> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> > const RGWCacheNotifyInfo&, optional_yi>
> > 2022-09-07T12:08:42.891+ 7fdd1661c700 -1 Initialization timeout,
> failed
> > to initialize
> > 2022-09-07T12:08:53.395+ 7f69017095c0  0 deferred set uid:gid to
> > 167:167 (ceph:ceph)
> > 2022-09-07T12:08:53.395+ 7f69017095c0  0 ceph version 17.2.0
> > (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable), process
> > radosgw, pid 7
> > 2022-09-07T12:08:53.395+ 7f69017095c0  0 framework: beast
> > 2022-09-07T12:08:53.395+ 7f69017095c0  0 framework conf key: port,
> val:
> > 80
> > 2022-09-07T12:08:53.395+ 7f69017095c0  1 radosgw_Main not setting
> numa
> > affinity
> > 2022-09-07T12:08:53.395+ 7f69017095c0  1 rgw_d3n:
> > rgw_d3n_l1_local_datacache_enabled=0
> > 2022-09-07T12:08:53.395+ 7f69017095c0  1 D3N datacache enabled: 0
> > 2022-09-07T12:09:03.747+ 7f69017095c0  1 rgw main: int
> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> > const RGWCacheNotifyInfo&, optional_yi>
> > 2022-09-07T12:09:03.747+ 7f69017095c0  1 rgw main: int
> > RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&,
> > const RGWCacheNotifyInfo&, optional_yi>
> > 2022-09-07T12:13:53.397+ 7f68f3d49700 -1 Initialization timeout,
> failed
> > to initialize
> >
> > I installed the cluster in quincy.
> >
> >
> > On Sat, Sep 10, 2022 at 4:02 PM Eugen Block  wrote:
> >
> >> What troubleshooting have you tried? You don’t provide any log output
> >> or information about the cluster setup, for example the ceph osd tree,
> >> ceph status, are the failing OSDs random or do they all belong to the
> >> same pool? Any log output from failing OSDs and the RGWs might help,
> >> otherwise it’s just wild guessing. Is the cluster a new installation
> >> with cephadm or an older cluster upgraded to Quincy?
> >>
> >> Zitat von Monish Selvaraj :
> >>
> >> > Hi all,
> >> >
> >> > I have one critical issue in my prod cluster. When the customer's data
> >> > comes from 600 MiB .
> >> >
> >> > My Osds are down *8 to 20 from 238* . Then I manually up my osds .
> After
> >> a
> >> > few minutes, my all rgw crashes.
> >> >
> >> > We did some troubleshooting but nothing works. When we upgrade ceph to
> >> > 17.2.0. to 17.2.1 is resolved. Also we faced the issue two times. But
> >> both
> >> > times we upgraded the ceph.
> >> >
> >> > *Node schema :*
> >> >
> >> > *Node 1 to node 5 --> mon,mgr and osds*
> >> > *Node 6 to Node15 --> only osds*
> >> > *Node 16 to Node 20 --> only rgws.*
> >> >
> >> > Kindly, check this issue and let me know the correct troubleshooting
> >> method.
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >>
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
>
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph Cluster clone

2022-09-26 Thread Ahmed Bessaidi
Hello,
I am working on cloning an existent Ceph Cluster (VMware).
I fixed the IP/hostname part, but I cannot get the cloned cluster to start 
(Monitors issues).
Any ideas ?




Best Regards,
Ahmed.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS crashes after evicting client session

2022-09-26 Thread Kotresh Hiremath Ravishankar
You can find the upstream fix here https://github.com/ceph/ceph/pull/46833

Thanks,
Kotresh HR

On Mon, Sep 26, 2022 at 3:17 PM Dhairya Parmar  wrote:

> Patch for this has already been merged and backported to quincy as well. It
> will be there in the next Quincy release.
>
> On Thu, Sep 22, 2022 at 5:12 PM E Taka <0eta...@gmail.com> wrote:
>
> > Ceph 17.2.3 (dockerized in Ubuntu 20.04)
> >
> > The subject says it. The MDS process always crashes after evicting. ceph
> -w
> > shows:
> >
> > 2022-09-22T13:26:23.305527+0200 mds.ksz-cephfs2.ceph00.kqjdwe [INF]
> > Evicting (and blocklisting) client session 5181680 (
> > 10.149.12.21:0/3369570791)
> > 2022-09-22T13:26:35.729317+0200 mon.ceph00 [INF] daemon
> > mds.ksz-cephfs2.ceph03.vsyrbk restarted
> > 2022-09-22T13:26:36.039678+0200 mon.ceph00 [INF] daemon
> > mds.ksz-cephfs2.ceph01.xybiqv restarted
> > 2022-09-22T13:29:21.000392+0200 mds.ksz-cephfs2.ceph04.ekmqio [INF]
> > Evicting (and blocklisting) client session 5249349 (
> > 10.149.12.22:0/2459302619)
> > 2022-09-22T13:29:32.069656+0200 mon.ceph00 [INF] daemon
> > mds.ksz-cephfs2.ceph01.xybiqv restarted
> > 2022-09-22T13:30:00.000101+0200 mon.ceph00 [INF] overall HEALTH_OK
> > 2022-09-22T13:30:20.710271+0200 mon.ceph00 [WRN] Health check failed: 1
> > daemons have recently crashed (RECENT_CRASH)
> >
> > The crash info of the crashed MDS is:
> > # ceph crash info
> > 2022-09-22T11:26:24.013274Z_b005f3fc-7704-4cfc-96c5-f2a9c993f166
> > {
> >"assert_condition": "!mds->is_any_replay()",
> >"assert_file":
> >
> >
> "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.cc",
> >
> >"assert_func": "void MDLog::_submit_entry(LogEvent*,
> > MDSLogContextBase*)",
> >"assert_line": 283,
> >"assert_msg":
> >
> >
> "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.cc:
> > In function 'void MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)'
> > thread 7f76fa8f6700 time
> >
> >
> 2022-09-22T11:26:23.992050+\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.cc:
> > 283: FAILED ceph_assert(!mds->is_any_replay())\n",
> >"assert_thread_name": "ms_dispatch",
> >"backtrace": [
> >"/lib64/libpthread.so.0(+0x12ce0) [0x7f770231bce0]",
> >"gsignal()",
> >"abort()",
> >"(ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x1b0) [0x7f770333bcd2]",
> >"/usr/lib64/ceph/libceph-common.so.2(+0x283e95) [0x7f770333be95]",
> >"(MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x3f)
> > [0x55991905efdf]",
> >"(Server::journal_close_session(Session*, int, Context*)+0x78c)
> > [0x559918d7d63c]",
> >"(Server::kill_session(Session*, Context*)+0x212)
> [0x559918d7dd92]",
> >"(Server::apply_blocklist()+0x10d) [0x559918d7e04d]",
> >"(MDSRank::apply_blocklist(std::set > std::less, std::allocator > const&,
> unsigned
> > int)+0x34) [0x559918d39d74]",
> >"(MDSRankDispatcher::handle_osd_map()+0xf6) [0x559918d3a0b6]",
> >"(MDSDaemon::handle_core_message(boost::intrusive_ptr const>
> > const&)+0x39b) [0x559918d2330b]",
> >"(MDSDaemon::ms_dispatch2(boost::intrusive_ptr
> > const&)+0xc3) [0x559918d23cc3]",
> >"(DispatchQueue::entry()+0x14fa) [0x7f77035c240a]",
> >"(DispatchQueue::DispatchThread::entry()+0x11) [0x7f7703679481]",
> >"/lib64/libpthread.so.0(+0x81ca) [0x7f77023111ca]",
> >"clone()"
> >],
> >"ceph_version": "17.2.3",
> >"crash_id":
> > "2022-09-22T11:26:24.013274Z_b005f3fc-7704-4cfc-96c5-f2a9c993f166",
> >"entity_name": "mds.ksz-cephfs2.ceph03.vsyrbk",
> >"os_id": "centos",
> >"os_name": "CentOS Stream",
> >"os_version": "8",
> >"os_version_id": "8",
> >"process_name": "ceph-mds",
> >"stack_sig":
> > "b75e46941b5f6b7c05a037f9af5d42bb19d82ab7fc6a3c168533fc31a42b4de8",
> >"timestamp": "2022-09-22T11:26:24.013274Z",
> >"utsname_hostname": "ceph03",
> >"utsname_machine": "x86_64",
> >"utsname_release": "5.4.0-125-generic",
> >"utsname_sysname": "Linux",
> >"utsname_version": "#141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022"
> > }
> >
> > (Don't be confused by the time information, "ceph -w" is UTC+2, "crash
> > info" is UTC)
> >
> > Should I report this a bug or did I miss something which caused the
> error?
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
>
> --
> *Dhairya Parmar*
>
> He/Him/His
>
> Associate Software Engineer, CephFS
>
> Red Hat Inc.

[ceph-users] Re: weird performance issue on ceph

2022-09-26 Thread Zoltan Langi
Hi Mark and the mailing list, we managed to figure something very weird 
out what I would like to share with you and ask if you have seen 
anything like this before.


We started to investigate the drives one-by-one after Mark's suggestion 
that a few osd-s are holding back the ceph and we noticed this:


When the disk usage reaches 500GB on a single drive, the drive loses 
half of its write performance compared to when it's empty.

To show you, let's see the fio write performance when the disk is empty:
Jobs: 4 (f=4): [W(4)][6.0%][w=1930MiB/s][w=482 IOPS][eta 07h:31m:13s]
We see, when the disk is empty, the drive achieves almost 1,9GB/s 
throughput and 482 iops. Very decent values.


However! When the disk gets to 500GB full and we start to write a new 
file all of the sudden we get these values:

Jobs: 4 (f=4): [W(4)][0.9%][w=1033MiB/s][w=258 IOPS][eta 07h:55m:43s]
As we see we lost significant throughput and iops as well.

If we remove all the files and do an fstrim on the disk, the performance 
returns back to normal again.


If we format the disk, no need to do fstrim, we get the performance back 
to normal again. That explains why the ceph recreation from scratch 
helped us.


Have you see this behaviour before in your deployments?

Thanks,

Zoltan

Am 17.09.22 um 06:58 schrieb Mark Nelson:


CAUTION:
This email originated from outside the organization. Do not click 
links unless you can confirm the sender and know the content is safe.




Hi Zoltan,


So kind of interesting results.  In the "good" write test the OSD 
doesn't actually seem to be working very hard.  If you look at the kv 
sync thread, it's mostly idle with only about 22% of the time in the 
thread spent doing real work:


1.
   | + 99.90% BlueStore::_kv_sync_thread()
2.
   | + 78.60% 
std::condition_variable::wait(std::unique_lock&)

3.
   | |+ 78.60% pthread_cond_wait
4.
   | + 18.00%
RocksDBStore::submit_transaction_sync(std::shared_ptr)

...but at least it's actually doing work!  For reference though, on 
our high performing setup with enough concurrency we can push things 
hard enough where this thread isn't spending much time in 
pthread_cond_wait.  In the "bad" state, your example OSD here is 
basically doing nothing at all (100% of the time in 
pthread_cold_wait!).  The tp_osd_tp and the kv sync thread are just 
waiting around twiddling their thumbs:


1.
   Thread 339848 (bstore_kv_sync) - 1000 samples
2.
   + 100.00% clone
3.
   + 100.00% start_thread
4.
   + 100.00% BlueStore::KVSyncThread::entry()
5.
   + 100.00% BlueStore::_kv_sync_thread()
6.
   + 100.00% std::condition_variable::wait(std::unique_lock&)
7.
   + 100.00% pthread_cond_wait


My first thought is that you might have one or more OSDs that are 
slowing the whole cluster down so that clients are backing up on it 
and other OSDs are just waiting around for IO.  It might be worth 
checking the perf admin socket stats on each OSD to see if you can 
narrow down if any of them are having issues.



Thanks,

Mark


On 9/16/22 05:57, Zoltan Langi wrote:
Hey people and Mark, the cluster was left overnight to do nothing and 
the problem as expected came back in the morning. We managed to 
capture the bad states on the exact same OSD-s we captured the good 
states earlier:


Here is the output of a read test when the cluster is in a bad state 
on the same OSD which I recorded in the good state earlier:


https://pastebin.com/jp5JLWYK

Here is the output of a write test when the cluster is in a bad state 
on the same OSD which I recorded in the good state earlier:


The write speed came down from 30,1GB/s to 17,9GB/s

https://pastebin.com/9e80L5XY

We are still open for any suggestions, so please feel free to comment 
or suggest. :)


Thanks a lot,
Zoltan

Am 15.09.22 um 16:53 schrieb Zoltan Langi:
Hey people and Mark, we managed to capture the good and bad states 
separately:


Here is the output of a read test when the cluster is in a bad state:

https://pastebin.com/0HdNapLQ

Here is the output of a write test when the cluster is in a bad state:

https://pastebin.com/2T2pKu6Q

Here is the output of a read test when the cluster is in a brand new 
reinstalled state:


https://pastebin.com/qsKeX0D8

Here is the output of a write test when the cluster is in a brand 
new reinstalled state:


https://pastebin.com/nTCuEUAb

Hope anyone can suggest anything, any ideas are welcome! :)

Zoltan

Am 13.09.22 um 14:27 schrieb Zoltan Langi:

Hey Mark,

Sorry about the silence for a while, but a lot of things came up. 
We finally managed to fix up the profiler and here is an output 
when the ceph is under heavy write load, in a pretty bad state and 
its throughput is not achieving more than 12,2GB/s.


For a good state we have to recreate the whole thing, so we thought 
we start with the bad state, maybe something obvious is already 
visible for someone who knows the osd internals well.


You find the file here: https://pastebin.com/0HdNapLQ

Tan

[ceph-users] Re: MDS crashes after evicting client session

2022-09-26 Thread Dhairya Parmar
Patch for this has already been merged and backported to quincy as well. It
will be there in the next Quincy release.

On Thu, Sep 22, 2022 at 5:12 PM E Taka <0eta...@gmail.com> wrote:

> Ceph 17.2.3 (dockerized in Ubuntu 20.04)
>
> The subject says it. The MDS process always crashes after evicting. ceph -w
> shows:
>
> 2022-09-22T13:26:23.305527+0200 mds.ksz-cephfs2.ceph00.kqjdwe [INF]
> Evicting (and blocklisting) client session 5181680 (
> 10.149.12.21:0/3369570791)
> 2022-09-22T13:26:35.729317+0200 mon.ceph00 [INF] daemon
> mds.ksz-cephfs2.ceph03.vsyrbk restarted
> 2022-09-22T13:26:36.039678+0200 mon.ceph00 [INF] daemon
> mds.ksz-cephfs2.ceph01.xybiqv restarted
> 2022-09-22T13:29:21.000392+0200 mds.ksz-cephfs2.ceph04.ekmqio [INF]
> Evicting (and blocklisting) client session 5249349 (
> 10.149.12.22:0/2459302619)
> 2022-09-22T13:29:32.069656+0200 mon.ceph00 [INF] daemon
> mds.ksz-cephfs2.ceph01.xybiqv restarted
> 2022-09-22T13:30:00.000101+0200 mon.ceph00 [INF] overall HEALTH_OK
> 2022-09-22T13:30:20.710271+0200 mon.ceph00 [WRN] Health check failed: 1
> daemons have recently crashed (RECENT_CRASH)
>
> The crash info of the crashed MDS is:
> # ceph crash info
> 2022-09-22T11:26:24.013274Z_b005f3fc-7704-4cfc-96c5-f2a9c993f166
> {
>"assert_condition": "!mds->is_any_replay()",
>"assert_file":
>
> "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.cc",
>
>"assert_func": "void MDLog::_submit_entry(LogEvent*,
> MDSLogContextBase*)",
>"assert_line": 283,
>"assert_msg":
>
> "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.cc:
> In function 'void MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)'
> thread 7f76fa8f6700 time
>
> 2022-09-22T11:26:23.992050+\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.cc:
> 283: FAILED ceph_assert(!mds->is_any_replay())\n",
>"assert_thread_name": "ms_dispatch",
>"backtrace": [
>"/lib64/libpthread.so.0(+0x12ce0) [0x7f770231bce0]",
>"gsignal()",
>"abort()",
>"(ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x1b0) [0x7f770333bcd2]",
>"/usr/lib64/ceph/libceph-common.so.2(+0x283e95) [0x7f770333be95]",
>"(MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x3f)
> [0x55991905efdf]",
>"(Server::journal_close_session(Session*, int, Context*)+0x78c)
> [0x559918d7d63c]",
>"(Server::kill_session(Session*, Context*)+0x212) [0x559918d7dd92]",
>"(Server::apply_blocklist()+0x10d) [0x559918d7e04d]",
>"(MDSRank::apply_blocklist(std::set std::less, std::allocator > const&, unsigned
> int)+0x34) [0x559918d39d74]",
>"(MDSRankDispatcher::handle_osd_map()+0xf6) [0x559918d3a0b6]",
>"(MDSDaemon::handle_core_message(boost::intrusive_ptr
> const&)+0x39b) [0x559918d2330b]",
>"(MDSDaemon::ms_dispatch2(boost::intrusive_ptr
> const&)+0xc3) [0x559918d23cc3]",
>"(DispatchQueue::entry()+0x14fa) [0x7f77035c240a]",
>"(DispatchQueue::DispatchThread::entry()+0x11) [0x7f7703679481]",
>"/lib64/libpthread.so.0(+0x81ca) [0x7f77023111ca]",
>"clone()"
>],
>"ceph_version": "17.2.3",
>"crash_id":
> "2022-09-22T11:26:24.013274Z_b005f3fc-7704-4cfc-96c5-f2a9c993f166",
>"entity_name": "mds.ksz-cephfs2.ceph03.vsyrbk",
>"os_id": "centos",
>"os_name": "CentOS Stream",
>"os_version": "8",
>"os_version_id": "8",
>"process_name": "ceph-mds",
>"stack_sig":
> "b75e46941b5f6b7c05a037f9af5d42bb19d82ab7fc6a3c168533fc31a42b4de8",
>"timestamp": "2022-09-22T11:26:24.013274Z",
>"utsname_hostname": "ceph03",
>"utsname_machine": "x86_64",
>"utsname_release": "5.4.0-125-generic",
>"utsname_sysname": "Linux",
>"utsname_version": "#141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022"
> }
>
> (Don't be confused by the time information, "ceph -w" is UTC+2, "crash
> info" is UTC)
>
> Should I report this a bug or did I miss something which caused the error?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>

-- 
*Dhairya Parmar*

He/Him/His

Associate Software Engineer, CephFS

Red Hat Inc. 

dpar...@redhat.com

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HA cluster

2022-09-26 Thread Dhairya Parmar
You should give this doc
https://docs.ceph.com/en/quincy/rados/configuration/mon-config-ref/#monitor-quorum
a read. Will help you understand and set up the HA cluster much better.
Long story short, you would need at least 3 MONs to achieve HA because of
the monitor quoram.

On Sun, Sep 25, 2022 at 7:51 PM Murilo Morais  wrote:

> Hello guys.
>
> I have a question regarding HA.
>
> I set up two hosts with cephadm, created the pools and set up an NFS,
> everything working so far. I turned off the second Host and the first one
> continued to work without problems, but if I turn off the first, the second
> is totally irresponsible. What could be causing this?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>

-- 
*Dhairya Parmar*

He/Him/His

Associate Software Engineer, CephFS

Red Hat Inc. 

dpar...@redhat.com

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HA cluster

2022-09-26 Thread Neeraj Pratap Singh
We need at least 3 hosts to achieve HA with a shared storage.
If one node is turned off/fails , the storage is stopped.

On Mon, Sep 26, 2022 at 2:01 PM Neeraj Pratap Singh 
wrote:

> We need at least 3 hosts to achieve HA with a shared storage.
> If one node is turned off/fails , the storage is stopped.
>
> On Sun, Sep 25, 2022 at 7:51 PM Murilo Morais 
> wrote:
>
>> Hello guys.
>>
>> I have a question regarding HA.
>>
>> I set up two hosts with cephadm, created the pools and set up an NFS,
>> everything working so far. I turned off the second Host and the first one
>> continued to work without problems, but if I turn off the first, the
>> second
>> is totally irresponsible. What could be causing this?
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>
> --
> *Neeraj Pratap Singh*
> He/Him/His
> Associate Software Engineer, CephFS
> neesi...@redhat.com
>
> Red Hat Inc.
> 
>
>
>
>

-- 
*Neeraj Pratap Singh*
He/Him/His
Associate Software Engineer, CephFS
neesi...@redhat.com

Red Hat Inc.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HA cluster

2022-09-26 Thread Robert Sander

Am 25.09.22 um 19:20 schrieb Murilo Morais:


I set up two hosts with cephadm,


You cannot have HA with only two hosts.

You need at least three separate hosts for three MONs to keep your 
cluster running.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph configuration for rgw

2022-09-26 Thread Eugen Block

Just adding this:

ses7-host1:~ # ceph config set client.rgw.ebl-rgw rgw_frontends "beast  
port=8080"


This change is visible in the config get output:

client.rgw.ebl-rgwbasic rgw_frontendsbeast port=8080


Zitat von Eugen Block :


Hi,

the docs [1] show how to specifiy the rgw configuration via yaml  
file (similar to OSDs).
If you applied it with ceph orch you should see your changes in the  
'ceph config dump' output, or like this:


---snip---
ses7-host1:~ # ceph orch ls | grep rgw
rgw.ebl-rgw?:80 2/2  33s ago3M   ses7-host3;ses7-host4

ses7-host1:~ # ceph config get client.rgw.ebl-rgw
WHO MASK  LEVEL OPTION   VALUE
  
 RO
globalbasic container_image   
registry.fqdn:5000/ses/7.1/ceph/ceph@sha256:...  *
client.rgw.ebl-rgwbasic rgw_frontendsbeast port=80
  
 *
client.rgw.ebl-rgwadvanced  rgw_realmebl-rgw  
  
 *

client.rgw.ebl-rgwadvanced  rgw_zone ebl-zone
---snip---

As you see the RGWs are clients so you need to consider that when  
you request the current configuration. But what I find strange is  
that apparently it only shows the config initially applied, it  
doesn't show the changes after running 'ceph orch apply -i rgw.yaml'  
although the changes are applied to the containers after restarting  
them. I don't know if this is intended but sounds like a bug to me  
(I haven't checked).


1) When start rgw with cephadm ("orch apply -i "), I have  
to start the daemon
   then update configuration file and restart. I don't find a way  
to achieve this by single step.


I haven't played around too much yet, but you seem to be right,  
changing the config isn't applied immediately, but only after a  
service restart ('ceph orch restart rgw.ebl-rgw'). Maybe that's on  
purpose? So you can change your config now and apply it later when a  
service interruption is not critical.



[1] https://docs.ceph.com/en/pacific/cephadm/services/rgw/

Zitat von Tony Liu :


Hi,

The cluster is Pacific 16.2.10 with containerized service and  
managed by cephadm.


"config show" shows running configuration. Who is supported?
mon, mgr and osd all work, but rgw doesn't. Is this expected?
I tried with client. and  
without "client",

neither works.

When issue "config show", who connects the daemon and retrieves  
running config?

Is it mgr or mon?

Config update by "config set" will be populated to the service.  
Which services are
supported by this? I know mon, mgr and osd work, but rgw doesn't.  
Is this expected?
I assume this is similar to "config show", this support needs the  
capability of mgr/mon

to connect to service daemon?

To get running config from rgw, I always do
"docker exec  ceph daemon  config show".
Is that the only way? I assume it's the same to get running config  
from all services.

Just the matter of supported by mgr/mon or not?

I've been configuring rgw by configuration file. Is that the  
recommended way?

I tried with configuration db, like "config set", it doesn't seem working.
Is this expected?

I see two cons with configuration file for rgw.
1) When start rgw with cephadm ("orch apply -i "), I have  
to start the daemon
   then update configuration file and restart. I don't find a way  
to achieve this by single step.
2) When "orch daemon redeploy" or upgrade rgw, the configuration  
file will be re-generated

  and I have to update it again.
Is this all how it's supposed to work or I am missing anything?


Thanks!
Tony
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io