[ceph-users] cephadm does not recreate OSD

2024-06-25 Thread Luis Domingues
Hello all.

After a disk changed, we see that cephadm does not recreate the OSD. Going all 
back to pvs command I ended up on this issue: 
https://tracker.ceph.com/issues/62862 and this PR: 
https://github.com/ceph/ceph/pull/53500. The PR is unfortunately closed.

Is this a non-bug? I tries to replicate this pvs issue on various OS. And it 
looks like to be the good behavior at least on PVS side, if a lv was deleted in 
the middle of the disk.

Example with a simple vg with 5 lvs were lv 3 was deleted.
```
root@debian:~# pvs --readonly -o pv_name,vg_name,lv_name
PV VG LV
/dev/vdb newvg lv1
/dev/vdb newvg lv2
/dev/vdb newvg
/dev/vdb newvg lv4
/dev/vdb newvg lv5  /dev/vdb newvg
```

This output can seem weird, but if we expand the output lables
```
root@debian:~# pvs --segments -o+lv_name,seg_start_pe,segtype
PV VG Fmt Attr PSize PFree Start SSize LV Start Type
/dev/vdb newvg lvm2 a-- <60.00g <20.00g 0 2560 lv1 0 linear
/dev/vdb newvg lvm2 a-- <60.00g <20.00g 2560 2560 lv2 0 linear
/dev/vdb newvg lvm2 a-- <60.00g <20.00g 5120 2560 0 free
/dev/vdb newvg lvm2 a-- <60.00g <20.00g 7680 2560 lv4 0 linear
/dev/vdb newvg lvm2 a-- <60.00g <20.00g 10240 2560 lv5 0 linear /dev/vdb newvg 
lvm2 a-- <60.00g <20.00g 12800 2559 0 free
```

Anyway, this seems to cause the duplication.

Will someone have a look into this issue? Or should we look into a workaround?

Thanks,
Luis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mclock and massive reads

2024-03-28 Thread Luis Domingues





Luis Domingues
Proton AG


On Thursday, 28 March 2024 at 10:10, Sridhar Seshasayee  
wrote:

> Hi Luis,
> 
> > So our question, is mClock taking into account the reads as well as the
> > writes? Or are the reads calculate to be less expensive than the writes?
> 
> mClock treats both reads and writes equally. When you say "massive reads",
> do you mean a predominantly
> read workload? Also, the size of the reads is also factored in to arrive at
> the cost of the operation. In general,
> the cost of an I/O operation in mClock is proportional to its size. The
> higher the cost, the longer the operation
> stays in the queue. That being said, the implementation of mClock on
> pacific is experimental at best. I would
> recommend upgrading to either quincy or reef considering the significant
> improvements that were made both
> in terms of scheduling and usability.
> 
> -Sridhar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

When I say massive reads, is when we are draining a disk or a node. Outside of 
that particular use case, everything works quite well.

We plan upgrading in a near future, so we will see.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] mclock and massive reads

2024-03-26 Thread Luis Domingues
Hello,

We have a question about mClock scheduling reads on pacific (16.2.14 currently).

When we do massive reads, from let's say machines we want to drain containing a 
lot of data on EC pools, we observe quite frequently slow ops on the source 
OSDs. Those slow ops affect the client services, talking directly rados. If we 
kill the OSD that causes slow ops, the recovery stays more or less at the same 
speed, but no more slow ops.

And when we tweak mClock, if we limit on the OSDs that are the source, nothing 
that we can observe happens. However, if we limit on the target OSDs, the 
global speed slows down, and the slow ops disappear.

So our question, is mClock taking into account the reads as well as the writes? 
Or are the reads calculate to be less expensive than the writes?

Thanks,

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_mclock_max_capacity_iops_hdd in Reef

2024-01-08 Thread Luis Domingues
Hi Sridhar. Thanks for your reply:


> > We are testing migrations from a cluster running Pacific to Reef. In
> > pacific we needed to tweak osd_mclock_max_capacity_iops_hdd to have decent
> > performances of ou cluster.
>
> It would be helpful to know the procedure you are employing for the
> migration.

For now we run some benchmarks on a fairly small dev/test cluster. It has been 
deployed using cephadm and updated with cephadm from Pacific to Reef.

What we observed, is that with Pacific, tweaking 
osd_mclock_max_capacity_iops_hdd, we can go from arround 200MB/s of writes up 
to 600MB/s of writes, on balanced profile.
But with Reef, changing osd_mclock_max_capacity_iops_hdd does not change a lot 
the performances of the cluster. (Or if it does, they are small enough so I did 
not see them).

That been said, the performances of Reef "out of the box" are what we expect of 
our cluster (arround 600MB/s), while with Pacific we needed to tweak manually 
osd_mclock_max_capacity_iops_hdd to get the expected performances. So there is 
definitely a big improvement there.

What made me think that this option was maybe not used anymore, during the 
deploy of Pacific, each OSD pushes its own osd_mclock_max_capacity_iops_hdd, 
but deploying Reef not. We did not see any values for the OSDs in the ceph 
config db.

In conclusion, we could say, at least on our pre-update tests, that mClock 
seems to behave a lot better in Reef than in Pacific.

Luis Domingues
Proton AG


On Monday, 8 January 2024 at 12:29, Sridhar Seshasayee  
wrote:


> Hi Luis,
> 
> > We are testing migrations from a cluster running Pacific to Reef. In
> > pacific we needed to tweak osd_mclock_max_capacity_iops_hdd to have decent
> > performances of ou cluster.
> 
> 
> It would be helpful to know the procedure you are employing for the
> migration.
> 
> > But in reef it looks like changing the value of
> > osd_mclock_max_capacity_iops_hdd does not impact cluster performances. Did
> > osd_mclock_max_capacity_iops_hdd became useless?
> 
> 
> "osd_mclock_max_capacity_iops_hdd" is still valid in Reef as long as it
> accurately represents the capability of the underlying OSD device for the
> intended workload.
> 
> Between Pacific and Reef many improvements to the mClock feature have been
> made. An important change relates to the automatic determination of cost
> per I/O which is now tied to the sequential and random IOPS capability of
> the underlying device of an OSD. As long as
> "osd_mclock_max_capacity_iops_hdd" and
> "osd_mclock_max_sequential_bandwidth_hdd" represent a fairly accurate
> capability of the backing OSD device, the performance should be along
> expected lines. Changing the "osd_mclock_max_capacity_iops_hdd" to a value
> that is beyond the capability of the device will obviously not yield any
> improvement.
> 
> If the above parameters are representative of the capability of the backing
> OSD device and you still see lower than expected performance, then it could
> be some other issue that needs looking into.
> -Sridhar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] osd_mclock_max_capacity_iops_hdd in Reef

2024-01-08 Thread Luis Domingues
Hi all,

We are testing migrations from a cluster running Pacific to Reef. In pacific we 
needed to tweak osd_mclock_max_capacity_iops_hdd to have decent performances of 
ou cluster.

But in reef it looks like changing the value of 
osd_mclock_max_capacity_iops_hdd does not impact cluster performances. Did 
osd_mclock_max_capacity_iops_hdd became useless?

I did not found anything regarding it on the changelogs, but I could have miss 
something.

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm bootstrap on 3 network clusters

2024-01-03 Thread Luis Domingues

> Why? The public network should not have any restrictions between the
> Ceph nodes. Same with the cluster network.

Internal policies and network rules.

Luis Domingues
Proton AG


On Wednesday, 3 January 2024 at 16:15, Robert Sander 
 wrote:


> Hi Luis,
> 
> On 1/3/24 16:12, Luis Domingues wrote:
> 
> > My issue is that mon1 cannot connect via SSH to itself using pub network, 
> > and bootstrap fail at the end when cephadm tries to add mon1 to the list of 
> > hosts.
> 
> 
> Why? The public network should not have any restrictions between the
> Ceph nodes. Same with the cluster network.
> 
> Regards
> --
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
> 
> https://www.heinlein-support.de
> 
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
> 
> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm bootstrap on 3 network clusters

2024-01-03 Thread Luis Domingues
Hi Robert,

Thanks for your reply.

I am bootstrapping from the first node that will become the first monitor, 
let's call it mon1. I get 1 monitor and 1 manager deployed.

My issue is that mon1 cannot connect via SSH to itself using pub network, and 
bootstrap fail at the end when cephadm tries to add mon1 to the list of hosts.

When I apply a spec afterwards with my list of hosts with their IPs where 
cephadm can reach them, it works fine. But that means that I need to create the 
client-keyring rule for _admin label manually as well.


Luis Domingues
Proton AG


On Wednesday, 3 January 2024 at 16:00, Robert Sander 
 wrote:


> Hi,
> 
> On 1/3/24 14:51, Luis Domingues wrote:
> 
> > But when I bootstrap my cluster, I set my MON IP and CLUSTER NETWORK, and 
> > then the bootstrap process tries to add my bootstrap node using the MON IP.
> 
> 
> IMHO the bootstrap process has to run directly on the first node.
> The MON IP is local to this node. It is used to determine the public
> network.
> 
> Regards
> --
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
> 
> https://www.heinlein-support.de
> 
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
> 
> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephadm bootstrap on 3 network clusters

2024-01-03 Thread Luis Domingues
Hi,

I am bootstrapping a ceph cluster using cephadm, and our cluster uses 3 
networks.
We have

- 1 network as public network (10.X.X.0/24) (pub)

- 1 network as cluster network (10.X.Y.0/24) (cluster)
- 1 network for management (172.Z.Z.0/24) (mgmt)

The nodes are reachable using SSH only on mgmt network. However, they are 
reachable for our services using pub network. I want my MONs to be bind to this 
pub network.

But when I bootstrap my cluster, I set my MON IP and CLUSTER NETWORK, and then 
the bootstrap process tries to add my bootstrap node using the MON IP. And then 
fails because it cannot reach the node. If I apply proper spec after it works 
fine, but the bootstrap process did not finish properly.

Is there an option to tell cephadm to not use MON IP but another one for 
accessing the node during the bootstrap? If I tell it --skip-prepare-host, it 
tries to connect to it anyway, and then fails.

Thanks,
Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm user on cephadm rpm package

2023-11-17 Thread Luis Domingues
So I guess I need to install the cephadm rpm packages on all my machines then?

I like the idea of not having a root user, and in fact we do it on our 
clusters. But as we need to push ssh keys to the user config, so we manage 
users outside of ceph, during OS provisioning. 
So it look a little bit redundant to have cephadm package to create that user, 
when we need to figure out how to enable cephadm's access to the machines.

Anyway, thanks for your reply.

Luis Domingues
Proton AG


On Friday, 17 November 2023 at 13:55, David C.  wrote:


> Hi,
> 
> You can use the cephadm account (instead of root) to control machines with
> the orchestrator.
> 
> 
> Le ven. 17 nov. 2023 à 13:30, Luis Domingues luis.doming...@proton.ch a
> 
> écrit :
> 
> > Hi,
> > 
> > I noticed when installing the cephadm rpm package, to bootstrap a cluster
> > for example, that a user cephadm was created. But I do not see it used
> > anywhere.
> > 
> > What is the purpose of creating a user on the machine we install the local
> > binary of cephadm?
> > 
> > Luis Domingues
> > Proton AG
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephadm user on cephadm rpm package

2023-11-17 Thread Luis Domingues
Hi,

I noticed when installing the cephadm rpm package, to bootstrap a cluster for 
example, that a user cephadm was created. But I do not see it used anywhere.

What is the purpose of creating a user on the machine we install the local 
binary of cephadm?

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cephadm specs application order

2023-09-27 Thread Luis Domingues
Hi,

We are playing a little bit with OSD specs on a test cluster, and we ended up 
having nodes that match more than 1 OSD spec. (currently 4 or 5).

And there is something we did not get yet. Is there any order cephadm will 
apply the sepcs? Are the specs sorted in any way inside cephadm?

We understand that for a specific spec, cephadm will try to match nodes by 
host, label and then host_pattern. Our question is more at spec level, and the 
order cephadm will "loop" the specs.

I hope I was clear enough.

Thanks,

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm logs

2023-07-30 Thread Luis Domingues
Hi,

We are interested in having cephadm log to journald. So I create the ticket: 
https://tracker.ceph.com/issues/62233

Thanks

Luis Domingues
Proton AG


--- Original Message ---
On Saturday, July 29th, 2023 at 20:55, John Mulligan 
 wrote:


> On Friday, July 28, 2023 11:51:06 AM EDT Adam King wrote:
> 
> > Not currently. Those logs aren't generated by any daemons, they come
> > directly from anything done by the cephadm binary one the host, which tends
> > to be quite a bit since the cephadm mgr module runs most of its operations
> > on the host through a copy of the cephadm binary. It doesn't log to journal
> > because it doesn't have a systemd unit or anything, it's just a python
> > script being run directly and nothing has been implemented to make it
> > possible for that to log to journald.
> 
> 
> 
> For what it's worth, there's no requirement that a process be executed
> directly by a specific systemd unit to have it log to the journal. These days
> I'm pretty sure that anything that tries to use the local syslog goes to the
> journal. Here's a quick example:
> 
> I create foo.py with the following:
> `import logging import logging.handlers import sys handler = 
> logging.handlers.SysLogHandler('/dev/log') handler.ident = 'notcephadm: ' h2 
> = logging.StreamHandler(stream=sys.stderr) logging.basicConfig( 
> level=logging.DEBUG, handlers=[handler, h2], format="(%(levelname)s): 
> %(message)s", ) log = logging.getLogger(__name__) log.debug("debug me") 
> log.error("oops, an error was here") log.info("some helpful information goes 
> here")`
> I ran the above and now I can run:
> `$ journalctl --no-pager -t notcephadm Jul 29 14:35:31 edfu 
> notcephadm[105868]: (DEBUG): debug me Jul 29 14:35:31 edfu 
> notcephadm[105868]: (ERROR): oops, an error was here Jul 29 14:35:31 edfu 
> notcephadm[105868]: (INFO): some helpful information goes here`
> 
> Just getting logs into the journal does not even require one of the libraries
> specific to the systemd journal. Personally, I find centralized logging with 
> the
> syslog/journal more appealing than logging to a file. But they both have their
> advantages and disadvantages.
> 
> Luis, I'd suggest that you should file a ceph tracker issue [1] if having
> cephadm log this way is a use case you would be interested in. We could also
> discuss the topic further in a ceph orchestration weekly meeting.
> 
> 
> [1]: https://tracker.ceph.com/projects/orchestrator/issues/new
> 
> > On Fri, Jul 28, 2023 at 9:43 AM Luis Domingues luis.doming...@proton.ch
> > wrote:
> > 
> > > Hi,
> > > 
> > > Quick question about cephadm and its logs. On my cluster I have every
> > > logs
> > > that goes to journald. But on each machine, I still have
> > > /var/log/ceph/cephadm.log that is alive.
> > > 
> > > Is there a way to make cephadm log to journald instead of a file? If yes
> > > did I miss it on the documentation? Of if not is there any reason to log
> > > into a file while everything else logs to journald?
> > > 
> > > Thanks
> > > 
> > > Luis Domingues
> > > Proton AG
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > 
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephadm logs

2023-07-28 Thread Luis Domingues
Hi,

Quick question about cephadm and its logs. On my cluster I have every logs that 
goes to journald. But on each machine, I still have /var/log/ceph/cephadm.log 
that is alive.

Is there a way to make cephadm log to journald instead of a file? If yes did I 
miss it on the documentation? Of if not is there any reason to log into a file 
while everything else logs to journald?

Thanks

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm and kernel memory usage

2023-07-26 Thread Luis Domingues
That's the weird thing. Processes and user-space memory is the same in good 
memory and bad memory. ceph-osd memory usage looks good in all machines, cache 
is more of less the same. When I do a ps, htop or any other process review all 
look good, and coherent between all machines, containers or not.

Only difference I can see is using smem on the noncache kernel memory on 
containerized machines.

Maybe it's a podman issue, maybe a kernel. It does not seem related to ceph 
directly. I just asked here to see if anyone got the same issue.

Anyway, thanks for your time.

Luis Domingues
Proton AG


--- Original Message ---
On Wednesday, July 26th, 2023 at 09:01, Konstantin Shalygin  
wrote:


> Without determining what exactly process (kernel or userspace) "eat" memory, 
> the ceph-users can't tell what exactly use memory, because don't see your 
> display with your eyes 
>
> You should run this commands on good & bad hosts to see the real difference. 
> This may be related to kernel version, or Ceph options in container config or 
> ...
>
>
> k
> Sent from my iPhone
>
> > On 26 Jul 2023, at 07:26, Luis Domingues luis.doming...@proton.ch wrote:
> >
> > First, thank you for taking time to reply to me.
> >
> > However, my question was not on user-space memory neither on cache usage, 
> > as I can see on my machines everything sums up quite nicely.
> >
> > My question is: with packages, the non-cache kernel memory is around 2G to 
> > 3G, while with Podman usage, it is more around 10G, and it can go up to 
> > 40G-50G. Do anyone knows if this is expected and why this is the case?
> >
> > Maybe this is a podman related question and ceph-dev is not the best place 
> > to ask this kind of question, but maybe someone using cephadm saw similar 
> > behavior.
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm and kernel memory usage

2023-07-25 Thread Luis Domingues
Hi,

First, thank you for taking time to reply to me.

However, my question was not on user-space memory neither on cache usage, as I 
can see on my machines everything sums up quite nicely.

My question is: with packages, the non-cache kernel memory is around 2G to 3G, 
while with Podman usage, it is more around 10G, and it can go up to 40G-50G. Do 
anyone knows if this is expected and why this is the case?

Maybe this is a podman related question and ceph-dev is not the best place to 
ask this kind of question, but maybe someone using cephadm saw similar behavior.

Luis Domingues
Proton AG


--- Original Message ---
On Tuesday, July 25th, 2023 at 11:42, Konstantin Shalygin  
wrote:


> Good,
> 
> > On 24 Jul 2023, at 20:01, Luis Domingues luis.doming...@proton.ch wrote:
> > 
> > Of course:
> > 
> > free -h
> > total used free shared buff/cache available
> > Mem: 125Gi 96Gi 9.8Gi 4.0Gi 19Gi 7.6Gi
> > Swap: 0B 0B 0B
> 
> 
> As we can see, actually you have ~30GiB free (9.8GiB is not used & 19GiB is a 
> page cache)
> With this command you can determine what process actually use memory & how 
> much
> 
> `ps -eo size,pid,user,command | \\ awk '{ hr=$1/1024 ; printf("%13.6f Mb 
> ",hr) } { for ( x=4 ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }' | \\ 
> sort -n`
> 
> k
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm and kernel memory usage

2023-07-24 Thread Luis Domingues


Of course:

free -h
  totalusedfree  shared  buff/cache   available
Mem:  125Gi96Gi   9.8Gi   4.0Gi19Gi   7.6Gi
Swap:0B  0B  0B


Luis Domingues
Proton AG


--- Original Message ---
On Monday, July 24th, 2023 at 16:42, Konstantin Shalygin  wrote:


> Hi,
> 
> Can you paste `free -h` output for this hosts?
> 
> 
> k
> Sent from my iPhone
> 
> > On 24 Jul 2023, at 14:42, Luis Domingues luis.doming...@proton.ch wrote:
> > 
> > Hi,
> > 
> > So after, looking into OSDs memory usage, which seem to be fine, on a 
> > v16.2.13 running with cephadm, on el8, it seems that the kernel is using a 
> > lot of memory.
> > 
> > # smem -t -w -k
> > Area Used Cache Noncache
> > firmware/hardware 0 0 0
> > kernel image 0 0 0
> > kernel dynamic memory 65.0G 18.6G 46.4G
> > userspace memory 50.1G 260.5M 49.9G
> > free memory 9.9G 9.9G 0
> > -- 125.0G 28.8G 
> > 96.3G
> > 
> > Comparing with a similar other cluster, same OS, same ceph version, but 
> > running packages instead if containers, and machines have a little bit more 
> > memory:
> > 
> > # smem -t -w -k
> > Area Used Cache Noncache
> > firmware/hardware 0 0 0
> > kernel image 0 0 0
> > kernel dynamic memory 52.8G 50.5G 2.4G
> > userspace memory 123.9G 198.5M 123.7G
> > free memory 10.6G 10.6G 0
> > ------ 187.3G 61.3G 
> > 126.0G
> > 
> > Does anyone have an idea why when using containers with podman the kernel 
> > needs a lot more memory?
> > 
> > Luis Domingues
> > Proton AG
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephadm and kernel memory usage

2023-07-24 Thread Luis Domingues
Hi,

So after, looking into OSDs memory usage, which seem to be fine, on a v16.2.13 
running with cephadm, on el8, it seems that the kernel is using a lot of memory.

# smem -t -w -k
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 65.0G 18.6G 46.4G
userspace memory 50.1G 260.5M 49.9G
free memory 9.9G 9.9G 0
-- 125.0G 28.8G 96.3G

Comparing with a similar other cluster, same OS, same ceph version, but running 
packages instead if containers, and machines have a little bit more memory:

# smem -t -w -k
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 52.8G 50.5G 2.4G
userspace memory 123.9G 198.5M 123.7G
free memory 10.6G 10.6G 0
-- 187.3G 61.3G 126.0G

Does anyone have an idea why when using containers with podman the kernel needs 
a lot more memory?

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm does not redeploy OSD

2023-07-20 Thread Luis Domingues
Here you have.

So on the log when cephadm gets the inventory:

Found inventory for host [Device(path=/
Device(path=/dev/nvme2n1, lvs=[{'cluster_fsid': 
'11b47c57-5e7f-44c0-8b19-ddd801a89435', 'cluster_name': 'ceph', 'db_uuid': 
'irQUVH-txAO-fh3p-tkEj-ZoAH-p7lI-HcHOJp', 'name': 
'osd-db-75f820d1-1597-4894-88d5-e1f21e0425a6', 'osd_fsid': 
'1abbad8e-9053-4335-8673-7f1c7832b7b0', 'osd_id': '35', 'osdspec_affinity': 
'spec-a', 'type': 'db'}, 

And then when I have the list of disks, with different filters and telling if 
the disk is taken or not:

[DBG] : /dev/sde. is already used in spec spec-a, skipping it.

But I could confirm that globally the nvme for db was taken into account.
Found drive selection DeviceSelection(data devices=['/dev/sda', '/dev/sdb', 
'/dev/sdc', '/dev/sdd', '/dev/sdf', '/dev/sdg', '/dev/sdh', '/dev/sdj', 
'/dev/sdj', '/dev/sdk', '/dev/sdl'], wal_devices=[], db 
devices=['/dev/nvme2n1'], journal devices=[]

And then I saw cephadm apply spec on all nodes except:

skipping apply of node05 on 
DriveGroupSpec.from_json(yaml.safe_load('''service_type:
 ---
 service_type: osd
 service_id: spec-b
 placement:
 label: osds
 spec:
 data_devices:
 rotational: 1
 encrypted: true
 db_devices:

 size: '1TB:2TB' db_slots: 1
''')) (no change)


And now I can see both my disks on spec-b when cephadm check the disk inventory:

[DBG] : /dev/sde is already used in spec spec-b, skipping it.
...
[DBG] : /dev/sdk is already used in spec spec-b, skipping it.


As I said in my previous e-mail, I am not sure this was the reason why, as I 
did not found any clear messages saying the db_device was ignored. And I did 
not tried to replicate this behavior yet.
So yeah, I fixed my issue, but not sure if I it was just luck or not.

Luis Domingues
Proton AG


--- Original Message ---
On Wednesday, July 19th, 2023 at 22:04, Adam King  wrote:


> > When looking on the very verbous cephadm logs, it seemed that cephadm was
> > just skipping my node, with a message saying that a node was already part
> > of another spec.
>
>
> If you have it, would you mind sharing what this message was? I'm still not
> totally sure what happened here.
>
> On Wed, Jul 19, 2023 at 10:15 AM Luis Domingues luis.doming...@proton.ch
>
> wrote:
>
> > So good news, I was not hit by the bug you mention on this thread.
> >
> > What happened, (apparently, I did not tried to replicated it yet) is that
> > I had another OSD (let call it OSD.1) using the db device, but that was
> > part of an old spec. (let call it spec-a). And the OSD (OSD.2) I removed
> > should be detected as part of spec-b. The difference between them was just
> > the name and the placement, using labels instead of hostname.
> >
> > When looking on the very verbous cephadm logs, it seemed that cephadm was
> > just skipping my node, with a message saying that a node was already part
> > of another spec.
> >
> > I purged OSD.1 with --replace and --zap, and once disks where empty and
> > ready to go, cephamd just added back OSD.1 and OSD.2 using the db_device as
> > specified.
> >
> > I do not know if this is the intended behavior, or if I was just lucky,
> > but all my OSDs are back to the cluster.
> >
> > Luis Domingues
> > Proton AG
> >
> > --- Original Message ---
> > On Tuesday, July 18th, 2023 at 18:32, Luis Domingues <
> > luis.doming...@proton.ch> wrote:
> >
> > > That part looks quite good:
> > >
> > > "available": false,
> > > "ceph_device": true,
> > > "created": "2023-07-18T16:01:16.715487Z",
> > > "device_id": "SAMSUNG MZPLJ1T6HBJR-7_S55JNG0R600354",
> > > "human_readable_type": "ssd",
> > > "lsm_data": {},
> > > "lvs": [
> > > {
> > > "cluster_fsid": "11b47c57-5e7f-44c0-8b19-ddd801a89435",
> > > "cluster_name": "ceph",
> > > "db_uuid": "CUMgp7-Uscn-ASLo-bh14-7Sxe-80GE-EcywDb",
> > > "name": "osd-block-db-5cb8edda-30f9-539f-b4c5-dbe420927911",
> > > "osd_fsid": "089894cf-1782-4a3a-8ac0-9dd043f80c71",
> > > "osd_id": "7",
> > > "osdspec_affinity": "",
> > > "type": "db"
> > > },
> > > {
> > >
> > > I forgot to mention that the cluster was initially deployed with
> > > ceph-ansible and adopted by cephadm.
> > >
> > > Luis Domingues
> > > Proton AG
> > >
> > > --- Original Message ---
> > > On Tuesday, July 18th, 2023 at 18:

[ceph-users] Re: cephadm does not redeploy OSD

2023-07-19 Thread Luis Domingues
So good news, I was not hit by the bug you mention on this thread.

What happened, (apparently, I did not tried to replicated it yet) is that I had 
another OSD (let call it OSD.1) using the db device, but that was part of an 
old spec. (let call it spec-a). And the OSD (OSD.2) I removed should be 
detected as part of spec-b. The difference between them was just the name and 
the placement, using labels instead of hostname.

When looking on the very verbous cephadm logs, it seemed that cephadm was just 
skipping my node, with a message saying that a node was already part of another 
spec.

I purged OSD.1 with --replace and --zap, and once disks where empty and ready 
to go, cephamd just added back OSD.1 and OSD.2 using the db_device as specified.

I do not know if this is the intended behavior, or if I was just lucky, but all 
my OSDs are back to the cluster.

Luis Domingues
Proton AG


--- Original Message ---
On Tuesday, July 18th, 2023 at 18:32, Luis Domingues  
wrote:


> That part looks quite good:
> 
> "available": false,
> "ceph_device": true,
> "created": "2023-07-18T16:01:16.715487Z",
> "device_id": "SAMSUNG MZPLJ1T6HBJR-7_S55JNG0R600354",
> "human_readable_type": "ssd",
> "lsm_data": {},
> "lvs": [
> {
> "cluster_fsid": "11b47c57-5e7f-44c0-8b19-ddd801a89435",
> "cluster_name": "ceph",
> "db_uuid": "CUMgp7-Uscn-ASLo-bh14-7Sxe-80GE-EcywDb",
> "name": "osd-block-db-5cb8edda-30f9-539f-b4c5-dbe420927911",
> "osd_fsid": "089894cf-1782-4a3a-8ac0-9dd043f80c71",
> "osd_id": "7",
> "osdspec_affinity": "",
> "type": "db"
> },
> {
> 
> I forgot to mention that the cluster was initially deployed with ceph-ansible 
> and adopted by cephadm.
> 
> Luis Domingues
> Proton AG
> 
> 
> 
> 
> --- Original Message ---
> On Tuesday, July 18th, 2023 at 18:15, Adam King adk...@redhat.com wrote:
> 
> 
> 
> > in the "ceph orch device ls --format json-pretty" output, in the blob for
> > that specific device, is the "ceph_device" field set? There was a bug where
> > it wouldn't be set at all (https://tracker.ceph.com/issues/57100) and it
> > would make it so you couldn't use a device serving as a db device for any
> > further OSDs, unless the device was fully cleaned out (so it is no longer
> > serving as a db device). The "ceph_device" field is meant to be our way of
> > knowing "yes there are LVM partitions here, but they're our partitions for
> > ceph stuff, so we can still use the device" and without it (or with it just
> > being broken, as in the tracker) redeploying OSDs that used the device for
> > its DB wasn't working as we don't know if those LVs imply its our device or
> > has LVs for some other purpose. I had thought this was fixed already in
> > 16.2.13 but it sounds too similar to what you're seeing not to consider it.
> > 
> > On Tue, Jul 18, 2023 at 10:53 AM Luis Domingues luis.doming...@proton.ch
> > 
> > wrote:
> > 
> > > Hi,
> > > 
> > > We are running a ceph cluster managed with cephadm v16.2.13. Recently we
> > > needed to change a disk, and we replaced it with:
> > > 
> > > ceph orch osd rm 37 --replace.
> > > 
> > > It worked fine, the disk was drained and the OSD marked as destroy.
> > > 
> > > However, after changing the disk, no OSD was created. Looking to the db
> > > device, the partition for db for OSD 37 was still there. So we destroyed 
> > > it
> > > using:
> > > ceph-volume lvm zap --osd-id=37 --destroy.
> > > 
> > > But we still have no OSD redeployed.
> > > Here we have our spec:
> > > 
> > > ---
> > > service_type: osd
> > > service_id: osd-hdd
> > > placement:
> > > label: osds
> > > spec:
> > > data_devices:
> > > rotational: 1
> > > encrypted: true
> > > db_devices:
> > > size: '1TB:2TB' db_slots: 12
> > > 
> > > And the disk looks good:
> > > 
> > > HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED REJECT REASONS
> > > node05 /dev/nvme2n1 ssd SAMSUNG MZPLJ1T6HBJR-7_S55JNG0R600357 1600G
> > > 12m ago LVM detected, locked
> > > 
> > > node05 /dev/sdk hdd SEAGATE_ST1NM0206_ZA21G217C7240KPF 10.0T Yes
> > > 12m ago
> > > 
> > > And VG on db_device looks to have enough space:
> >

[ceph-users] Re: cephadm does not redeploy OSD

2023-07-18 Thread Luis Domingues
That part looks quite good:

"available": false,
"ceph_device": true,
"created": "2023-07-18T16:01:16.715487Z",
"device_id": "SAMSUNG MZPLJ1T6HBJR-7_S55JNG0R600354",
"human_readable_type": "ssd",
"lsm_data": {},
"lvs": [
  {
"cluster_fsid": "11b47c57-5e7f-44c0-8b19-ddd801a89435",
"cluster_name": "ceph",
"db_uuid": "CUMgp7-Uscn-ASLo-bh14-7Sxe-80GE-EcywDb",
"name": "osd-block-db-5cb8edda-30f9-539f-b4c5-dbe420927911",
"osd_fsid": "089894cf-1782-4a3a-8ac0-9dd043f80c71",
"osd_id": "7",
"osdspec_affinity": "",
"type": "db"
  },
  {

I forgot to mention that the cluster was initially deployed with ceph-ansible 
and adopted by cephadm.

Luis Domingues
Proton AG


--- Original Message ---
On Tuesday, July 18th, 2023 at 18:15, Adam King  wrote:


> in the "ceph orch device ls --format json-pretty" output, in the blob for
> that specific device, is the "ceph_device" field set? There was a bug where
> it wouldn't be set at all (https://tracker.ceph.com/issues/57100) and it
> would make it so you couldn't use a device serving as a db device for any
> further OSDs, unless the device was fully cleaned out (so it is no longer
> serving as a db device). The "ceph_device" field is meant to be our way of
> knowing "yes there are LVM partitions here, but they're our partitions for
> ceph stuff, so we can still use the device" and without it (or with it just
> being broken, as in the tracker) redeploying OSDs that used the device for
> its DB wasn't working as we don't know if those LVs imply its our device or
> has LVs for some other purpose. I had thought this was fixed already in
> 16.2.13 but it sounds too similar to what you're seeing not to consider it.
> 
> On Tue, Jul 18, 2023 at 10:53 AM Luis Domingues luis.doming...@proton.ch
> 
> wrote:
> 
> > Hi,
> > 
> > We are running a ceph cluster managed with cephadm v16.2.13. Recently we
> > needed to change a disk, and we replaced it with:
> > 
> > ceph orch osd rm 37 --replace.
> > 
> > It worked fine, the disk was drained and the OSD marked as destroy.
> > 
> > However, after changing the disk, no OSD was created. Looking to the db
> > device, the partition for db for OSD 37 was still there. So we destroyed it
> > using:
> > ceph-volume lvm zap --osd-id=37 --destroy.
> > 
> > But we still have no OSD redeployed.
> > Here we have our spec:
> > 
> > ---
> > service_type: osd
> > service_id: osd-hdd
> > placement:
> > label: osds
> > spec:
> > data_devices:
> > rotational: 1
> > encrypted: true
> > db_devices:
> > size: '1TB:2TB' db_slots: 12
> > 
> > And the disk looks good:
> > 
> > HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED REJECT REASONS
> > node05 /dev/nvme2n1 ssd SAMSUNG MZPLJ1T6HBJR-7_S55JNG0R600357 1600G
> > 12m ago LVM detected, locked
> > 
> > node05 /dev/sdk hdd SEAGATE_ST1NM0206_ZA21G217C7240KPF 10.0T Yes
> > 12m ago
> > 
> > And VG on db_device looks to have enough space:
> > ceph-33b06f1a-f6f6-57cf-9ca8-6e4aa81caae0 1 11 0 wz--n- <1.46t 173.91g
> > 
> > If I remove the db_devices and db_slots from the specs, and do a dry run,
> > the orchestrator seems to see the new disk as available:
> > 
> > ceph orch apply -i osd_specs.yml --dry-run
> > WARNING! Dry-Runs are snapshots of a certain point in time and are bound
> > to the current inventory setup. If any of these conditions change, the
> > preview will be invalid. Please make sure to have a minimal
> > timeframe between planning and applying the specs.
> > 
> > SERVICESPEC PREVIEWS
> > 
> > +-+--++-+
> > |SERVICE |NAME |ADD_TO |REMOVE_FROM |
> > +-+--++-+
> > +-+--++-+
> > 
> > OSDSPEC PREVIEWS
> > 
> > +-+-+-+--++-+
> > |SERVICE |NAME |HOST |DATA |DB |WAL |
> > +-+-+-+--++-+
> > |osd |osd-hdd |node05 |/dev/sdk |- |- |
> > +-+-+-+--++

[ceph-users] cephadm does not redeploy OSD

2023-07-18 Thread Luis Domingues
Hi,

We are running a ceph cluster managed with cephadm v16.2.13. Recently we needed 
to change a disk, and we replaced it with:

ceph orch osd rm 37 --replace.

It worked fine, the disk was drained and the OSD marked as destroy.

However, after changing the disk, no OSD was created. Looking to the db device, 
the partition for db for OSD 37 was still there. So we destroyed it using:
ceph-volume lvm zap --osd-id=37 --destroy.

But we still have no OSD redeployed.
Here we have our spec:

---
service_type: osd
service_id: osd-hdd
placement:
label: osds
spec:
data_devices:
rotational: 1
encrypted: true
db_devices:
size: '1TB:2TB' db_slots: 12

And the disk looks good:

HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED REJECT REASONS
node05 /dev/nvme2n1 ssd SAMSUNG MZPLJ1T6HBJR-7_S55JNG0R600357 1600G 12m ago 
LVM detected, locked

node05 /dev/sdk hdd SEAGATE_ST1NM0206_ZA21G217C7240KPF 10.0T Yes 12m ago

And VG on db_device looks to have enough space:
ceph-33b06f1a-f6f6-57cf-9ca8-6e4aa81caae0 1 11 0 wz--n- <1.46t 173.91g

If I remove the db_devices and db_slots from the specs, and do a dry run, the 
orchestrator seems to see the new disk as available:

ceph orch apply -i osd_specs.yml --dry-run
WARNING! Dry-Runs are snapshots of a certain point in time and are bound
to the current inventory setup. If any of these conditions change, the
preview will be invalid. Please make sure to have a minimal
timeframe between planning and applying the specs.

SERVICESPEC PREVIEWS

+-+--++-+
|SERVICE |NAME |ADD_TO |REMOVE_FROM |
+-+--++-+
+-+--++-+

OSDSPEC PREVIEWS

+-+-+-+--++-+
|SERVICE |NAME |HOST |DATA |DB |WAL |
+-+-+-+--++-+
|osd |osd-hdd |node05 |/dev/sdk |- |- |
+-+-+-+--++-+

But as soon as I add db_devices back, the orchestrator is happy as it is, like 
there is nothing to do:

ceph orch apply -i osd_specs.yml --dry-run
WARNING! Dry-Runs are snapshots of a certain point in time and are bound
to the current inventory setup. If any of these conditions change, the
preview will be invalid. Please make sure to have a minimal
timeframe between planning and applying the specs.

SERVICESPEC PREVIEWS

+-+--++-+
|SERVICE |NAME |ADD_TO |REMOVE_FROM |
+-+--++-+
+-+--++-+

OSDSPEC PREVIEWS

+-+--+--+--++-+
|SERVICE |NAME |HOST |DATA |DB |WAL |
+-+--+--+--++-+

I do not know why ceph will not use this disk, and I do not know where to look. 
It seems logs are not saying anything. And the weirdest thing, another disk was 
replaced on the same machine, and it went without any issues.

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD memory usage after cephadm adoption

2023-07-17 Thread Luis Domingues
It looks indeed to be that bug that I hit.

Thanks.

Luis Domingues
Proton AG


--- Original Message ---
On Monday, July 17th, 2023 at 07:45, Sridhar Seshasayee  
wrote:


> Hello Luis,
> 
> Please see my response below:
> 
> But when I took a look on the memory usage of my OSDs, I was below of that
> 
> > value, by quite a bite. Looking at the OSDs themselves, I have:
> > 
> > "bluestore-pricache": {
> > "target_bytes": 4294967296,
> > "mapped_bytes": 1343455232,
> > "unmapped_bytes": 16973824,
> > "heap_bytes": 1360429056,
> > "cache_bytes": 2845415832
> > },
> > 
> > And if I get the running config:
> > "osd_memory_target": "4294967296",
> > "osd_memory_target_autotune": "true",
> > "osd_memory_target_cgroup_limit_ratio": "0.80",
> > 
> > Which is not the value I observe from the config. I have 4294967296
> > instead of something around 7219293672. Did I miss something?
> 
> This is very likely due to https://tracker.ceph.com/issues/48750. The fix
> was recently merged into
> the main branch and should be backported soon all the way to pacific.
> 
> Until then, the workaround would be to set the osd_memory_target on each
> OSD individually to
> the desired value.
> 
> -Sridhar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD memory usage after cephadm adoption

2023-07-16 Thread Luis Domingues
Hi,

Thanks for your hints. I tries to play a little bit with the configs. And now I 
want to put the 0.7 value as default.

So I configured ceph:

  mgradvanced  
mgr/cephadm/autotune_memory_target_ratio0.70

* 
  osdadvanced  
osd_memory_target_autotune  true

  

And I ended up having this configs

  osd   host:st10-cbosd-001  basic 
osd_memory_target   7219293672  

  
  osd   host:st10-cbosd-002  basic 
osd_memory_target   7219293672  

  
  osd   host:st10-cbosd-004  basic 
osd_memory_target   7219293672  

  
  osd   host:st10-cbosd-005  basic 
osd_memory_target   7219293451  

  
  osd   host:st10-cbosd-006  basic 
osd_memory_target   7219293451  

  
  osd   host:st11-cbosd-007  basic 
osd_memory_target   7216821484  

  
  osd   host:st11-cbosd-008  basic 
osd_memory_target   7216825454 

And running a ceph orch ps gaves me:

osd.0st11-cbosd-007.plabs.ch  running (2d) 
10m ago  10d25.8G6882M  16.2.13  327f301eff51  29a075f2f925  
osd.1st10-cbosd-001.plabs.ch  running (19m) 
8m ago  10d2115M6884M  16.2.13  327f301eff51  df5067bde5ce  
osd.10   st10-cbosd-005.plabs.ch  running (2d) 
10m ago  10d5524M6884M  16.2.13  327f301eff51  f7bc0641ee46  
osd.100  st11-cbosd-008.plabs.ch  running (2d) 
10m ago  10d5234M6882M  16.2.13  327f301eff51  74efa243b953  
osd.101  st11-cbosd-008.plabs.ch  running (2d) 
10m ago  10d4741M6882M  16.2.13  327f301eff51  209671007c65  
osd.102  st11-cbosd-008.plabs.ch  running (2d) 
10m ago  10d5174M6882M  16.2.13  327f301eff51  63691d557732

So far so good.

But when I took a look on the memory usage of my OSDs, I was below of that 
value, by quite a bite. Looking at the OSDs themselves, I have:

"bluestore-pricache": {
"target_bytes": 4294967296,
"mapped_bytes": 1343455232,
"unmapped_bytes": 16973824,
"heap_bytes": 1360429056,
"cache_bytes": 2845415832
},

And if I get the running config:
"osd_memory_target": "4294967296",
"osd_memory_target_autotune": "true",
"osd_memory_target_cgroup_limit_ratio": "0.80",

Which is not the value I observe from the config. I have 4294967296 instead of 
something around 7219293672. Did I miss something?

Luis Domingues
Proton AG


--- Original Message ---
On Tuesday, July 11th, 2023 at 18:10, Mark Nelson  wrote:


> On 7/11/23 09:44, Luis Domingues wrote:
> 
> > "bluestore-pricache": {
> > "target_bytes": 6713193267,
> > "mapped_bytes": 6718742528,
> > "unmapped_bytes": 467025920,
> > "heap_bytes": 7185768448,
> > "cache_bytes": 4161537138
> > },
> 
> 
> Hi Luis,
> 
> 
> Looks like the mapped bytes for this OSD process is very close to (just
> a little over) the target bytes that has been set when you did the perf
> dump. There is some unmapped memory that can be reclaimed by the kernel,
> but we can't force the kernel to reclaim it. It could be that the
> kernel is being a little lazy if there isn't memory pressure.
> 
> The way the memory autotuning works in Ceph is that periodically the
> prioritycache system will look at the mapped memory usage of the
> process, then gro

[ceph-users] Re: OSD memory usage after cephadm adoption

2023-07-11 Thread Luis Domingues
7883,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 0,
"take_sum": 0,
"put": 225099,
"put_sum": 1227078527883,
"wait": {
"avgcount": 2039,
"sum": 65.096143721,
"avgtime": 0.031925524
}
},
"throttle-bluestore_throttle_deferred_bytes": {
"val": 0,
"max": 201326592,
"get_started": 0,
"get": 47230,
"get_sum": 32111982926,
"get_or_fail_fail": 0,
"get_or_fail_success": 47230,
"take": 0,
"take_sum": 0,
"put": 44540,
"put_sum": 32111982926,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-msgr_dispatch_throttler-client": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 292633,
"get_sum": 39290356304,
    "get_or_fail_fail": 0,
"get_or_fail_success": 292633,
"take": 0,
"take_sum": 0,
"put": 292633,
"put_sum": 39290356304,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-msgr_dispatch_throttler-cluster": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 7182670,
"get_sum": 60512426404,
"get_or_fail_fail": 0,
"get_or_fail_success": 7182670,
"take": 0,
"take_sum": 0,
"put": 7182670,
"put_sum": 60512426404,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-msgr_dispatch_throttler-hb_back_client": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 7382217,
"get_sum": 15008047161,
"get_or_fail_fail": 0,
"get_or_fail_success": 7382217,
"take": 0,
"take_sum": 0,
"put": 7382217,
    "put_sum": 15008047161,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-msgr_dispatch_throttler-hb_back_server": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 6979126,
"get_sum": 14188562814,
"get_or_fail_fail": 0,
"get_or_fail_success": 6979126,
"take": 0,
"take_sum": 0,
"put": 6979126,
"put_sum": 14188562814,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-msgr_dispatch_throttler-hb_front_client": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 7382217,
"get_sum": 15008047161,
"get_or_fail_fail": 0,
"get_or_fail_success": 7382217,
"take": 0,
"take_sum": 0,
"put": 7382217,
"put_sum": 15008047161,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-msgr_dispatch_throttler-hb_front_server": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 6979126,
"get_sum": 14188562814,
"get_or_fail_fail": 0,
"get_or_fail_success": 6979126,
"take": 0,
"take_sum": 0,
"put": 6979126,
"put_sum": 14188562814,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-msgr_dispatch_throttler-ms_objecter": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 0,
"get_sum": 0,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 0,
"take_sum": 0,
"put": 0,
"put_sum": 0,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-objecter_bytes": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 0,
"get_sum": 0,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 0,
"take_sum": 0,
"put": 0,
"put_sum": 0,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-objecter_ops": {
"val": 0,
"max": 1024,
"get_started": 0,
"get": 0,
"get_sum": 0,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 0,
"take_sum": 0,
"put": 0,
"put_sum": 0,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-osd_client_bytes": {
"val": 47,
"max": 524288000,
"get_started": 0,
"get": 291042,
"get_sum": 39289027913,
"get_or_fail_fail": 0,
"get_or_fail_success": 291042,
"take": 0,
"take_sum": 0,
"put": 1164143,
"put_sum": 39289027866,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-osd_client_messages": {
"val": 1,
"max": 256,
"get_started": 0,
"get": 291042,
"get_sum": 291042,
"get_or_fail_fail": 0,
"get_or_fail_success": 291042,
"take": 0,
"take_sum": 0,
"put": 291041,
"put_sum": 291041,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
}
}

and dump_mempools:

{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 155779,
"bytes": 12462320
},
"bluestore_cache_data": {
"items": 228476,
"bytes": 233092536
},
"bluestore_cache_onode": {
"items": 265318,
"bytes": 163435888
},
"bluestore_cache_meta": {
"items": 83890049,
"bytes": 455300708
},
"bluestore_cache_other": {
"items": 11355469,
"bytes": 91930988
},
"bluestore_Buffer": {
"items": 5325,
"bytes": 511200
},
"bluestore_Extent": {
"items": 842524,
"bytes": 40441152
},
"bluestore_Blob": {
"items": 842524,
"bytes": 94362688
},
"bluestore_SharedBlob": {
"items": 842524,
"bytes": 94362688
},
"bluestore_inline_bl": {
"items": 8842,
"bytes": 1142714
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 0,
"bytes": 0
},
"bluestore_writing_deferred": {
"items": 77,
"bytes": 1798719
},
"bluestore_writing": {
"items": 0,
"bytes": 0
},
"bluefs": {
"items": 1443,
"bytes": 36800
},
"bluefs_file_reader": {
"items": 96,
"bytes": 2104832
},
"bluefs_file_writer": {
"items": 3,
"bytes": 576
},
"buffer_anon": {
"items": 25898,
"bytes": 10464534
},
"buffer_meta": {
"items": 28808,
"bytes": 2535104
},
"osd": {
"items": 81,
"bytes": 916272
},
"osd_mapbl": {
"items": 0,
"bytes": 0
},
"osd_pglog": {
"items": 363769,
"bytes": 164806408
},
"osdmap": {
"items": 53839,
"bytes": 2182392
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 98910844,
"bytes": 1371888519
}
}
}

Luis Domingues
Proton AG


--- Original Message ---
On Tuesday, July 11th, 2023 at 14:59, Mark Nelson  wrote:


> Hi Luis,
> 
> 
> Can you do a "ceph tell osd. perf dump" and "ceph daemon osd.
> 
> dump_mempools"? Those should help us understand how much memory is
> being used by different parts of the OSD/bluestore and how much memory
> the priority cache thinks it has to work with.
> 
> 
> Mark
> 
> On 7/11/23 4:57 AM, Luis Domingues wrote:
> 
> > Hi everyone,
> > 
> > We recently migrate a cluster from ceph-ansible to cephadm. Everything went 
> > as expected.
> > But now we have some alerts on high memory usage. Cluster is running ceph 
> > 16.2.13.
> > 
> > Of course, after adoption OSDs ended up in the  zone:
> > 
> > NAME PORTS RUNNING REFRESHED AGE PLACEMENT
> > osd 88 7m ago - 
> > 
> > But the weirdest thing I observed, is that the OSDs seem to use more memory 
> > that the mem limit:
> > 
> > NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID 
> > CONTAINER ID
> > osd.0  running (5d) 2m ago 5d 19.7G 6400M 16.2.13 327f301eff51 
> > ca07fe74a0fa
> > osd.1  running (5d) 2m ago 5d 7068M 6400M 16.2.13 327f301eff51 
> > 6223ed8e34e9
> > osd.10  running (5d) 10m ago 5d 7235M 6400M 16.2.13 327f301eff51 
> > 073ddc0d7391 osd.100  running (5d) 2m ago 5d 7118M 6400M 16.2.13 
> > 327f301eff51 b7f9238c0c24
> > 
> > Does anybody knows why OSDs would use more memory than the limit?
> > 
> > Thanks
> > 
> > Luis Domingues
> > Proton AG
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> --
> Best Regards,
> Mark Nelson
> Head of R (USA)
> 
> Clyso GmbH
> p: +49 89 21552391 12
> a: Loristraße 8 | 80335 München | Germany
> w: https://clyso.com | e: mark.nel...@clyso.com
> 
> We are hiring: https://www.clyso.com/jobs/
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSD memory usage after cephadm adoption

2023-07-11 Thread Luis Domingues
Hi everyone,

We recently migrate a cluster from ceph-ansible to cephadm. Everything went as 
expected.
But now we have some alerts on high memory usage. Cluster is running ceph 
16.2.13.

Of course, after adoption OSDs ended up in the  zone:

NAME PORTS RUNNING REFRESHED AGE PLACEMENT
osd 88 7m ago - 

But the weirdest thing I observed, is that the OSDs seem to use more memory 
that the mem limit:

NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER 
ID
osd.0  running (5d) 2m ago 5d 19.7G 6400M 16.2.13 327f301eff51 
ca07fe74a0fa
osd.1  running (5d) 2m ago 5d 7068M 6400M 16.2.13 327f301eff51 
6223ed8e34e9
osd.10  running (5d) 10m ago 5d 7235M 6400M 16.2.13 327f301eff51 
073ddc0d7391 osd.100  running (5d) 2m ago 5d 7118M 6400M 16.2.13 
327f301eff51 b7f9238c0c24

Does anybody knows why OSDs would use more memory than the limit?

Thanks

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Quincy osd bench in order to define osd_mclock_max_capacity_iops_[hdd|ssd]

2023-06-30 Thread Luis Domingues
Hi Rafael.

We faced the exact same issue. And we did a bunch of tests and question.

We started with some FIOs, but results where quite meh once in production. Ceph 
bench did not seemed very reliable.

What we ended up doing and seems to hold up quite nicely, is the above. It's 
probably not the best but works for use.

Step 1: We put the disks we want to test in a small dev/test cluster we have. 
So we can mess up with the configs, and go to the not so prod friendly way to 
configure the cluster. And we can make sure no other activity other than our 
test is running.

Step 2: Create a few pools, that have this particularities:
- Only 1 PG
- No replication at all. The only PG of the pool lives in only 1 disk.
- Upmap the PG to the OSD you want to test.
- Repeat for a few disks.

Step 2 needs to enable some options of ceph, as it will not allow you to do it 
by default. I do not remember the exact options, but you can find them on the 
documentation.

Sept 3: set mclock as high_client_ops. So mclock will virtually not limit 
client ops.

Step 4: Run a few ceph rados bench on the different disks.
rados bench  write -t   -b  -p 

 we used 300 secondes to be able do perfom various tests without 
taking a week, but letting the cluster writing a bunch of stuff anyway.
 we used 100, it worked well, and previous tests for other purposes 
showed 100 was nice on our installation.
 the pool with the disk you want to test
 we used 128k and 4M. Feel free to experiment with other values. 
But in our use case it was what we used.

Somewhere in the output of the bench, after it finishes, will be Average IOPS. 
This average is more or less what the disk is capable of handling. Then we put 
some number close to that one. If we have two types of disks that are close to 
each others, we put the smaller value for all disks. And we set it as a global 
configuration on ceph instead of going disk by disk.

It's probably not perfect and looks very like we tinkered something, but it's 
the best solution for testing that so far. And most important, results between 
tests where a lot more consistent than ceph bench or fio.

Hope this will help you.

Luis Domingues
Proton AG


--- Original Message ---
On Friday, June 30th, 2023 at 12:15, Rafael Diaz Maurin 
 wrote:


> Hello,
> 
> I've just upgraded a Pacific cluster into Quincy, and all my osd have
> the low value osd_mclock_max_capacity_iops_hdd : 315.00.
> 
> The manuel does not explain how to benchmark the OSD with fio or ceph
> bench with good options.
> Can someone have the good ceph bench options or fio options in order to
> configure osd_mclock_max_capacity_iops_hdd for each osd ?
> 
> I ran this bench various times on the same OSD (class hdd) and I obtain
> different results.
> ceph tell ${osd} cache drop
> ceph tell ${osd} bench 12288000 4096 4194304 100
> 
> example :
> osd.21 (hdd): osd_mclock_max_capacity_iops_hdd = 315.00
> bench 1 : 3006.2271379745534
> bench 2 : 819.503206458996
> bench 3 : 946.5406320134085
> 
> How can I succeed in getting the good values for the
> osd_mclock_max_capacity_iops_[hdd|ssd] options ?
> 
> Thank you for your help,
> 
> Rafael
> 
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Keepalived configuration with cephadm

2023-06-12 Thread Luis Domingues
Hi all,

We are running 1 test cluster ceph with cephadm. Currently last pacific 
(16.2.13).
We use cephadm to deploy keepalived:2.1.5 and HAProxy:2.3.
We have 3 VIPs, 1 for each instance of HAProxy.

But, we do not use the same network for managing the cluster and for the public 
traffic.
We have a management network to connect to the machines, and for cephadm to do 
the deployments, and a prod network where the connections to HAproxy will be 
done.

Our spec file looks like:
---
service_type: ingress
service_id: rgw.rgw
placement:
label: rgws
spec:
backend_service: rgw.rgw
virtual_ips_list:
- 10.X.X.10/24
- 10.X.X.2/24
- 10.X.X.3/24
frontend_port: 443 monitor_port: 1967

Our issue is that cephadm will populate `unicast_src_ip` and `unicast_peer` 
using the IPs from mgmt network and not the ones from prod network.
A quick look into the code and it seems to be design that way.

Our issue is that doing so, Keepalived instances will not talk to each other 
because the VRRP traffic is only allowed on our prod network.
I quicky tested removing `unicast_src_ip` and `unicast_peer` and keepalived 
instances where able to talk to each other.

My question, did I missed something on the configuration? Or should we add some 
kind of option to generate keepalived's config without `unicast_src_ip` and 
`unicast_peer`?

Thanks,

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How mClock profile calculation works, and IOPS

2023-04-03 Thread Luis Domingues
Hi,

Thanks a lot for the information.

I have a last question. Why is the bench performed using writes of 4 KiB. Is 
any reason to choose that over another another value?

On my lab, I tested with various values, and I have mainly two type of disks. 
Some Seagates and Toshiba.

If I do bench with 4KiB, what I get from Seagate is a result around 2000 IOPS. 
While the Toshiba is more arround 600.

If I do bench with 128KiB, I still have results arround 2000 IOPS for Seagate, 
but Toshiba also bench arround 2000 IOPS. And from the rados experiment I did, 
having osd_mclock_max_capacity_iops_hdd set to 2000 on that lab setup is the 
value I get the most performance from my rados experiments, both with Segate 
and Toshiba disks.

Luis Domingues
Proton AG


--- Original Message ---
On Monday, April 3rd, 2023 at 08:44, Sridhar Seshasayee  
wrote:


> Why was it done that way? I do not understand the reason why distributing
> 
> > the IOPS accross different disks, when the measurement we have is for one
> > disk alone. This means with default parameters we will always be far from
> > reaching OSD limit right?
> > 
> > It's not on different disks. We distribute the IOPS across shards on a
> 
> given OSD/disk. This is an internal implementation detail.
> This means in your case, 450 IOPS is distributed across 5 shards on the
> same OSD/disk. You can think of it as 5 threads
> being allocated a share of the total IOPS on a given OSD.
> -Sridhar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How mClock profile calculation works, and IOPS

2023-04-03 Thread Luis Domingues
Hi Sridhar

Thanks for the information.

> 
> The above values are a result of distributing the IOPS across all the OSD
> shards as defined by the
> osd_op_num_shards_[hdd|ssd] option. For HDDs, this is set to 5 and
> therefore the IOPS will be
> distributed across the 5 shards (i.e. for e.g., 675/5 for
> osd_mclock_scheduler_background_recovery_lim
> and so on for other reservation and limit options).

Why was it done that way? I do not understand the reason why distributing the 
IOPS accross different disks, when the measurement we have is for one disk 
alone. This means with default parameters we will always be far from reaching 
OSD limit right?

Luis Domingues
Proton AG


--- Original Message ---
On Monday, April 3rd, 2023 at 07:43, Sridhar Seshasayee  
wrote:


> Hi Luis,
> 
> 
> I am reading reading some documentation about mClock and have two questions.
> 
> > First, about the IOPS. Are those IOPS disk IOPS or other kind of IOPS? And
> > what the assumption of those? (Like block size, sequential or random
> > reads/writes)?
> 
> 
> This is the result of testing running OSD bench random writes at 4 KiB
> block size.
> 
> > But what I get is:
> > 
> > "osd_mclock_scheduler_background_best_effort_lim": "99",
> > "osd_mclock_scheduler_background_best_effort_res": "18",
> > "osd_mclock_scheduler_background_best_effort_wgt": "2",
> > "osd_mclock_scheduler_background_recovery_lim": "135",
> > "osd_mclock_scheduler_background_recovery_res": "36",
> > "osd_mclock_scheduler_background_recovery_wgt": "1",
> > "osd_mclock_scheduler_client_lim": "90",
> > "osd_mclock_scheduler_client_res": "36",
> > "osd_mclock_scheduler_client_wgt": "1",
> > 
> > Which seems very low according to what my disk seems to be able to handle.
> > 
> > Is this calculation the expected one? Or did I miss something on how those
> > profiles are populated?
> 
> 
> The above values are a result of distributing the IOPS across all the OSD
> shards as defined by the
> osd_op_num_shards_[hdd|ssd] option. For HDDs, this is set to 5 and
> therefore the IOPS will be
> distributed across the 5 shards (i.e. for e.g., 675/5 for
> osd_mclock_scheduler_background_recovery_lim
> and so on for other reservation and limit options).
> 
> -Sridhar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How mClock profile calculation works, and IOPS

2023-03-31 Thread Luis Domingues
Hi,

I am reading reading some documentation about mClock and have two questions.

First, about the IOPS. Are those IOPS disk IOPS or other kind of IOPS? And what 
the assumption of those? (Like block size, sequential or random reads/writes)?

And the second question,

How mClock calculates its profiles? I have my lab cluster running Quincy, and I 
have this parameters for mClock:

"osd_mclock_max_capacity_iops_hdd": "450.00",
"osd_mclock_profile": "balanced",

According to the documentation: 
https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/#balanced 
I am expecting to have:
"osd_mclock_scheduler_background_best_effort_lim": "99",
"osd_mclock_scheduler_background_best_effort_res": "90",
"osd_mclock_scheduler_background_best_effort_wgt": "2",
"osd_mclock_scheduler_background_recovery_lim": "675",
"osd_mclock_scheduler_background_recovery_res": "180",
"osd_mclock_scheduler_background_recovery_wgt": "1",
"osd_mclock_scheduler_client_lim": "450",
"osd_mclock_scheduler_client_res": "180", "osd_mclock_scheduler_client_wgt": 
"1",

But what I get is:

"osd_mclock_scheduler_background_best_effort_lim": "99",
"osd_mclock_scheduler_background_best_effort_res": "18",
"osd_mclock_scheduler_background_best_effort_wgt": "2",
"osd_mclock_scheduler_background_recovery_lim": "135",
"osd_mclock_scheduler_background_recovery_res": "36",
"osd_mclock_scheduler_background_recovery_wgt": "1",
"osd_mclock_scheduler_client_lim": "90",
"osd_mclock_scheduler_client_res": "36",
"osd_mclock_scheduler_client_wgt": "1",

Which seems very low according to what my disk seems to be able to handle.

Is this calculation the expected one? Or did I miss something on how those 
profiles are populated?

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: how ceph OSD bench works?

2023-03-31 Thread Luis Domingues
> > OSD bench performs IOs at the objectstore level and the stats are
> > reported
> > based on the response from those transactions. It performs either
> > sequential
> > or random IOs (i.e. a random offset into an object) based on the
> > arguments
> > passed to it. IIRC if number of objects and object size is provided, a
> > random
> > offset into an object is written.
> > 
> > Therefore, depending on the parameters passed, sequential or random
> > offset
> > is determined and this obviously would result in different measurements.
> 
> 
> Do you know if it tries and do this on times where the osd is not being 
> actively used or waits until there is no activity? I have been testing a bit 
> recently and noticed that some ssd's of the same type are reporting 
> significantly different values like 117 and 90


I am doing this test on a lab cluster, freshly installed, with no activity and 
almost no data on it. So disks should be standby in-between tests.

And I performed various runs, and even if the results are note exactly the 
same, they are in the same order of magnitude (1% or 2% difference between 
runs), and the differences between disk vendor are quite high, and consistent 
between runs.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] how ceph OSD bench works?

2023-03-30 Thread Luis Domingues
Hi,

I am currently testing some new disks, doing some benchmarks and stuff, and I 
would like to undertand how the OSD bench works.

If I quicky explain our setup, we have a small ceph cluster, where our new 
disks are inserted. And we have some pools with no replication at all, and 1 PG 
only, up-mapped to those new disks. So I can do some benchmarks on them.

The thing that is odd, is that doing some tests with fio tool, I have similar 
results on all disks, and doing the rados bench during 5 minutes as well. But 
the OSD bench at startup of the OSD, for mClock to configure 
osd_mclock_max_capacity_iops_hdd gives me a very big difference between disks. 
(600 vs 2200).

I am running Pacific on this test cluster.

Is there anywhere documentation of how this works? Or if anyone could explain 
that would be great.

I did not found any documentation on how OSD benchmark works, only how to used 
it. But playing a little bit with it, it seems the results we get is highly 
dependent on the block sizes we use. Same for rados bench, results are 
dependent, at least on my tests, of the block size we use, which I found a 
little bit weird to be honest.

And as mClock depends on that, it is impactful performance wise. On our cluster 
we can reach a lot better performances if we teak those values, instead of 
letting the cluster do proper measurements. And this looks to impact certain 
disk vendors more than others.

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] mclock and backgourd best effort

2022-02-28 Thread Luis Domingues
Hello,

As we are testing mClock scheduler, we have a question that did not found any 
answer on the documentation.

The documentation says mClock has 3 types of load, client, recovery and best 
effort. I guess client is the client traffic, and recovery is the recovery when 
something goes wrong.

Could someone tell me what kind of load is included in best effort?

Regards,

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephadm cluster behing a proxy

2021-10-14 Thread Luis Domingues
Hello,

We have a cluster deployed with cephadm that sits behind a proxy. It has no 
direct access to internet.

Deploying was not an issue, we did cephadm pull on all the machines before 
bootstrapping the cluster. But we are now facing errors when we try to update 
the cluster, basically this kind of isues:

/bin/podman: stderr Error: Error initializing source 
docker://quay.io/ceph/ceph:v16.2.6: error pinging docker registry quay.io: Get 
"https://quay.io/v2/": dial tcp 54.156.10.58:443: connect: network is 
unreachable

Is there a way to tell cephadm to use an http proxy? I did not found anything 
on the documentation, and I want to avoid to have http_proxy environment 
variables set on shell system wise.

Or should I use a local container registry mirroring the ceph images?

Thanks,
Luis Domingues
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Adopting "unmanaged" OSDs into OSD service specification

2021-10-13 Thread Luis Domingues
Hi,

We have the same issue on our lab cluster. The only way I found to have the 
osds on the new specification was to drain, remove and re-add the host. The 
orchestrator was happy to recreate the osds under the good specification.

But I do not think this is a good solution for production cluster. We are still 
looking for a more smooth way to do that.

Luis Domingues

‐‐‐ Original Message ‐‐‐

On Monday, October 4th, 2021 at 10:01 PM, David Orman  
wrote:

> We have an older cluster which has been iterated on many times. It's
>
> always been cephadm deployed, but I am certain the OSD specification
>
> used has changed over time. I believe at some point, it may have been
>
> 'rm'd.
>
> So here's our current state:
>
> root@ceph02:/# ceph orch ls osd --export
>
> service_type: osd
>
> service_id: osd_spec_foo
>
> service_name: osd.osd_spec_foo
>
> placement:
>
> label: osd
>
> spec:
>
> data_devices:
>
> rotational: 1
>
> db_devices:
>
> rotational: 0
>
> db_slots: 12
>
> filter_logic: AND
>
> objectstore: bluestore
> 
>
> service_type: osd
>
> service_id: unmanaged
>
> service_name: osd.unmanaged
>
> placement: {}
>
> unmanaged: true
>
> spec:
>
> filter_logic: AND
>
> objectstore: bluestore
>
> root@ceph02:/# ceph orch ls
>
> NAME PORTS RUNNING REFRESHED AGE PLACEMENT
>
> crash 7/7 10m ago 14M *
>
> mgr 5/5 10m ago 7M label:mgr
>
> mon 5/5 10m ago 14M label:mon
>
> osd.osd_spec_foo 0/7 - 24m label:osd
>
> osd.unmanaged 167/167 10m ago - 
>
> The osd_spec_foo would match these devices normally, so we're curious
>
> how we can get these 'managed' under this service specification.
>
> What's the appropriate way in order to effectively 'adopt' these
>
> pre-existing OSDs into the service specification that we want them to
>
> be managed under?
>
> ceph-users mailing list -- ceph-users@ceph.io
>
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm adopt with another user than root

2021-10-11 Thread Luis Domingues
I tries you advice today, and it work well.

Thanks,
Luis Domingues

‐‐‐ Original Message ‐‐‐

On Friday, October 8th, 2021 at 9:50 PM, Daniel Pivonka  
wrote:

> Id have to test this to make sure it works but i believe you can run 'ceph
>
> cephadm set-user '
>
> https://docs.ceph.com/en/octopus/cephadm/operations/#configuring-a-different-ssh-user
>
> after step 4 and before step 5 in the adoption guide
>
> https://docs.ceph.com/en/pacific/cephadm/adoption/
>
> and then in step 6 you need to copy the ssh key to your user instead of root
>
> Let me know if that works for you? I will also test things myself if I have
>
> a chance.
>
> -Daniel Pivonka
>
> On Fri, Oct 8, 2021 at 8:44 AM Luis Domingues luis.doming...@proton.ch
>
> wrote:
>
> > Hello,
> >
> > On our test cluster, we are running containerized latest pacific, and we
> >
> > are testing the upgrade path to cephadm. But we do not want cephadm to use
> >
> > the root user to connect to other machines.
> >
> > We found how to set the ssh-user during bootstrapping, but not when
> >
> > adopting an existing cluster.
> >
> > Is any way to set the ssh-user when adopting a cluster? I did not found
> >
> > the way to change the ssh-user on the documentation.
> >
> > Thanks,
> >
> > Luis Domingues
> >
> > ceph-users mailing list -- ceph-users@ceph.io
> >
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ceph-users mailing list -- ceph-users@ceph.io
>
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephadm adopt with another user than root

2021-10-08 Thread Luis Domingues
Hello,

On our test cluster, we are running containerized latest pacific, and we are 
testing the upgrade path to cephadm. But we do not want cephadm to use the root 
user to connect to other machines.

We found how to set the ssh-user during bootstrapping, but not when adopting an 
existing cluster.

Is any way to set the ssh-user when adopting a cluster? I did not found the way 
to change the ssh-user on the documentation.

Thanks,
Luis Domingues
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Drop of performance after Nautilus to Pacific upgrade

2021-09-20 Thread Luis Domingues
We tested Ceph 16.2.6, and indeed, the performances came back to what we expect 
for this cluster.

Luis Domingues

‐‐‐ Original Message ‐‐‐

On Saturday, September 11th, 2021 at 9:55 AM, Luis Domingues 
 wrote:

> Hi Igor,
>
> I have a SSD for the physical DB volume. And indeed it has very high 
> utilisation during the benchmark. I will test 16.2.6.
>
> Thanks,
>
> Luis Domingues
>
> ‐‐‐ Original Message ‐‐‐
>
> On Friday, September 10th, 2021 at 5:57 PM, Igor Fedotov ifedo...@suse.de 
> wrote:
>
> > Hi Luis,
> >
> > some chances that you're hit by https://tracker.ceph.com/issues/52089.
> >
> > What is your physical DB volume configuration - are there fast
> >
> > standalone disks for that? If so are they showing high utilization
> >
> > during the benchmark?
> >
> > It makes sense to try 16.2.6 once available - would the problem go away?
> >
> > Thanks,
> >
> > Igor
> >
> > On 9/5/2021 8:45 PM, Luis Domingues wrote:
> >
> > > Hello,
> > >
> > > I run a test cluster of 3 machines with 24 HDDs each, running bare-metal 
> > > on CentOS 8. Long story short, I can have a bandwidth of ~ 1'200 MB/s 
> > > when I do a rados bench, writing objects of 128k, when the cluster is 
> > > installed with Nautilus.
> > >
> > > When I upgrade the cluster to Pacific, (using ceph-ansible to deploy 
> > > and/or upgrade), my performances drop to ~400 MB/s of bandwidth doing the 
> > > same rados bench.
> > >
> > > I am kind of clueless on what makes the performance drop so much. Does 
> > > someone have some ideas where I can dig to find the root of this 
> > > difference?
> > >
> > > Thanks,
> > >
> > > Luis Domingues
> > >
> > > ceph-users mailing list -- ceph-users@ceph.io
> > >
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> > ceph-users mailing list -- ceph-users@ceph.io
> >
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ceph-users mailing list -- ceph-users@ceph.io
>
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Drop of performance after Nautilus to Pacific upgrade

2021-09-11 Thread Luis Domingues
Hi Igor,

I have a SSD for the physical DB volume. And indeed it has very high 
utilisation during the benchmark. I will test 16.2.6.

Thanks,

Luis Domingues

‐‐‐ Original Message ‐‐‐

On Friday, September 10th, 2021 at 5:57 PM, Igor Fedotov  
wrote:

> Hi Luis,
>
> some chances that you're hit by https://tracker.ceph.com/issues/52089.
>
> What is your physical DB volume configuration - are there fast
>
> standalone disks for that? If so are they showing high utilization
>
> during the benchmark?
>
> It makes sense to try 16.2.6 once available - would the problem go away?
>
> Thanks,
>
> Igor
>
> On 9/5/2021 8:45 PM, Luis Domingues wrote:
>
> > Hello,
> >
> > I run a test cluster of 3 machines with 24 HDDs each, running bare-metal on 
> > CentOS 8. Long story short, I can have a bandwidth of ~ 1'200 MB/s when I 
> > do a rados bench, writing objects of 128k, when the cluster is installed 
> > with Nautilus.
> >
> > When I upgrade the cluster to Pacific, (using ceph-ansible to deploy and/or 
> > upgrade), my performances drop to ~400 MB/s of bandwidth doing the same 
> > rados bench.
> >
> > I am kind of clueless on what makes the performance drop so much. Does 
> > someone have some ideas where I can dig to find the root of this difference?
> >
> > Thanks,
> >
> > Luis Domingues
> >
> > ceph-users mailing list -- ceph-users@ceph.io
> >
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ceph-users mailing list -- ceph-users@ceph.io
>
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Drop of performance after Nautilus to Pacific upgrade

2021-09-10 Thread Luis Domingues
Thanks for your observation.

Indeed, I do not get drop of performance when upgrading from Nautilus to 
Octopus. But even using Pacific 16.1.0, the performance just goes down, so I 
guess we run into the same issue somehow.

I do not think just staying in Octopus is a solution, as it will reach EOF 
eventually. The source of this performance drop is still a mystery to me.

Luis Domingues

‐‐‐ Original Message ‐‐‐

On Tuesday, September 7th, 2021 at 10:51 AM, Martin Mlynář  
wrote:

> Hello,
>
> we've noticed similar issue after upgrading our test 3 node cluster from
>
> 15.2.14-1~bpo10+1 to 16.1.0-1~bpo10+1.
>
> quick tests using rados bench:
>
> 16.2.5-1~bpo10+1:
>
> Total time run:         133.28
>
> Total writes made:      576
>
> Write size:             4194304
>
> Object size:            4194304
>
> Bandwidth (MB/sec):     17.2869
>
> Stddev Bandwidth:       34.1485
>
> Max bandwidth (MB/sec): 204
>
> Min bandwidth (MB/sec): 0
>
> Average IOPS:           4
>
> Stddev IOPS:            8.55426
>
> Max IOPS:               51
>
> Min IOPS:               0
>
> Average Latency(s):     3.59873
>
> Stddev Latency(s):      5.99964
>
> Max latency(s):         30.6307
>
> Min latency(s):         0.0865062
>
> after downgrading OSDs:
>
> 15.2.14-1~bpo10+1:
>
> Total time run:         120.135
>
> Total writes made:      16324
>
> Write size:             4194304
>
> Object size:            4194304
>
> Bandwidth (MB/sec):     543.524
>
> Stddev Bandwidth:       21.7548
>
> Max bandwidth (MB/sec): 580
>
> Min bandwidth (MB/sec): 436
>
> Average IOPS:           135
>
> Stddev IOPS:            5.43871
>
> Max IOPS:               145
>
> Min IOPS:               109
>
> Average Latency(s):     0.117646
>
> Stddev Latency(s):      0.0391269
>
> Max latency(s):         0.544229
>
> Min latency(s):         0.0602735
>
> We currently run on this setup:
>
> {
>
>     "mon": {
>
>     "ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a)
>
> pacific (stable)": 2
>
>     },
>
>     "mgr": {
>
>     "ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be)
>
> octopus (stable)": 3
>
>     },
>
>     "osd": {
>
>     "ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be)
>
> octopus (stable)": 35
>
>     },
>
>     "mds": {},
>
>     "overall": {
>
>     "ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be)
>
> octopus (stable)": 38,
>
>     "ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a)
>
> pacific (stable)": 2
>
>     }
>
> }
>
> which solved performance issues. All OSDs were newly created and fully
>
> synced from other nodes when upgrading and downgrading back to 15.2.
>
> Best Regards,
>
> Martin
>
> Dne 05. 09. 21 v 19:45 Luis Domingues napsal(a):
>
> > Hello,
> >
> > I run a test cluster of 3 machines with 24 HDDs each, running bare-metal on 
> > CentOS 8. Long story short, I can have a bandwidth of ~ 1'200 MB/s when I 
> > do a rados bench, writing objects of 128k, when the cluster is installed 
> > with Nautilus.
> >
> > When I upgrade the cluster to Pacific, (using ceph-ansible to deploy and/or 
> > upgrade), my performances drop to ~400 MB/s of bandwidth doing the same 
> > rados bench.
> >
> > I am kind of clueless on what makes the performance drop so much. Does 
> > someone have some ideas where I can dig to find the root of this difference?
> >
> > Thanks,
> >
> > Luis Domingues
>
> Martin Mlynář
>
> ceph-users mailing list -- ceph-users@ceph.io
>
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Drop of performance after Nautilus to Pacific upgrade

2021-09-05 Thread Luis Domingues
Hello,

I run a test cluster of 3 machines with 24 HDDs each, running bare-metal on 
CentOS 8. Long story short, I can have a bandwidth of ~ 1'200 MB/s when I do a 
rados bench, writing objects of 128k, when the cluster is installed with 
Nautilus.

When I upgrade the cluster to Pacific, (using ceph-ansible to deploy and/or 
upgrade), my performances drop to ~400 MB/s of bandwidth doing the same rados 
bench.

I am kind of clueless on what makes the performance drop so much. Does someone 
have some ideas where I can dig to find the root of this difference?

Thanks,
Luis Domingues
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io