[ceph-users] Re: Trying to debug "Failed to send data to Zabbix"

2021-10-20 Thread Konstantin Shalygin
Hi,

Check your zabbix binary and zabbix server network reachability, the mgr call 
for zabbix_sender but exit code is bad:

"/usr/bin/zabbix_sender exited non-zero"



k
Sent from my iPhone

> On 20 Oct 2021, at 00:46, shubjero  wrote:
> 
> Hey all,
> 
> Recently upgraded to Ceph Octopus (15.2.14). We also run Zabbix
> 5.0.15. Have had ceph/zabbix monitoring for a long time. After the
> Ceph Octopus update I installed the latest version of the Ceph
> template in Zabbix
> (https://github.com/ceph/ceph/blob/master/src/pybind/mgr/zabbix/zabbix_template.xml).
> 
> Zabbix is successfully getting metrics for all the items in the items
> list in my 'ceph' zabbix host. The ceph zabbix host is configured by
> fsid so that any of my 3 ceph-mgr's can send data to it via uuid.
> 
> Here's the ceph zabbix config:
> 
> {
>"discovery_interval": 100,
>"identifier": "4a158d27-f750-41d5-9e7f-26ce4c9d2d45",
>"interval": 60,
>"log_level": "",
>"log_to_cluster": false,
>"log_to_cluster_level": "info",
>"log_to_file": false,
>"zabbix_host": "172.25.4.20",
>"zabbix_port": 10051,
>"zabbix_sender": "/usr/bin/zabbix_sender"
> }
> 
> But for some reason when I run 'ceph zabbix send' or 'ceph zabbix
> discover' I get the following errors:
> 
> # ceph zabbix send
> Failed to send data to Zabbix
> # ceph zabbix discovery
> Failed to send discovery data to Zabbix
> 
> And the ceph logs are constantly logging zabbix errors:
> # ceph log last
> 2021-10-19T17:40:00.005371-0400 mon.controller1 (mon.0) 682609 :
> cluster [INF] overall HEALTH_OK
> 2021-10-19T17:40:04.347459-0400 mon.controller1 (mon.0) 682611 :
> cluster [WRN] Health check failed: Failed to send data to Zabbix
> (MGR_ZABBIX_SEND_FAILED)
> 2021-10-19T17:40:05.352579-0400 mon.controller1 (mon.0) 682612 :
> cluster [INF] Health check cleared: MGR_ZABBIX_SEND_FAILED (was:
> Failed to send data to Zabbix)
> 2021-10-19T17:40:05.352611-0400 mon.controller1 (mon.0) 682613 :
> cluster [INF] Cluster is now healthy
> 2021-10-19T17:41:06.196293-0400 mon.controller1 (mon.0) 682647 :
> cluster [WRN] Health check failed: Failed to send data to Zabbix
> (MGR_ZABBIX_SEND_FAILED)
> 2021-10-19T17:41:07.260666-0400 mon.controller1 (mon.0) 682649 :
> cluster [INF] Health check cleared: MGR_ZABBIX_SEND_FAILED (was:
> Failed to send data to Zabbix)
> 2021-10-19T17:41:07.260689-0400 mon.controller1 (mon.0) 682650 :
> cluster [INF] Cluster is now healthy
> 
> I've tried setting debug_mgr and debug_mon to 20/20 to look for
> additional detail but I didn't see much more other than:
> 
> 2021-10-19T17:15:27.042-0400 7f2c6c50d700 7
> mon.controller1@0(leader).log v30689480 update_from_paxos applying
> incremental log 30689480 2021-10-19T17:15:26.604054-0400
> mon.controller3 (mon.2) 42876 : audit [DBG] from='mgr.490501944
> 172.25.12.17:0/3421653' entity='mgr.controller1' cmd=[{"prefix":
> "config-key get", "key": "mgr/zabbix/zabbix_host"}]: dispatch
> "MGR_ZABBIX_SEND_FAILED": {
> "message": "Failed to send data to Zabbix",
> "message": "/usr/bin/zabbix_sender exited non-zero: b''"
> 
> 
> If anyone has any tips for troubleshooting that would be greatly appreciated!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Glaza
Hi Everyone,   I am in the process of
upgrading nautilus (14.2.22) to octopus (15.2.14) on centos7 (Mon/Mgr
were additionally migrated to centos8 beforehand). Each day I upgraded
one host and after all osd's were up, I manually compacted them one by
one.  Today (8 hosts upgraded, 7 still to go) I started
getting errors like "Possible data damage: 1 pg inconsistent". For the
first time it was "acting [56,58,62]" but I thought "OK" in 
osd.62 logs
there are many lines like "osd.62 39892 class rgw_gc open got (1)
Operation not permitted" Maybe rgw did not cleaned some omaps properly,
and ceph did not noticed it until scrub happened. But now I have got
"acting [56,57,58]" and none of this osd's has those errors with 
rgw_gc
in logs. All affected osd's are octopus 15.2.14 on NVMe hosting
default.rgw.buckets.index pool.  Has anyone experience with this problem?  Any 
help appreciated.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Etienne Menguy
Hi,

You should check for inconsistency root cause. 
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#pgs-inconsistent
 

 

-
Etienne Menguy
etienne.men...@croit.io




> On 20 Oct 2021, at 09:21, Szabo, Istvan (Agoda)  
> wrote:
> 
> Have you tried to repair pg?
> 
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
> 
> On 2021. Oct 20., at 9:04, Glaza  wrote:
> 
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
> 
> 
> Hi Everyone,   I am in the process of
> upgrading nautilus (14.2.22) to octopus (15.2.14) on centos7 (Mon/Mgr
> were additionally migrated to centos8 beforehand). Each day I upgraded
> one host and after all osd's were up, I manually compacted them one by
> one.  Today (8 hosts upgraded, 7 still to go) I started
> getting errors like "Possible data damage: 1 pg inconsistent". For the
> first time it was "acting [56,58,62]" but I thought "OK" in 
> osd.62 logs
> there are many lines like "osd.62 39892 class rgw_gc open got (1)
> Operation not permitted" Maybe rgw did not cleaned some omaps properly,
> and ceph did not noticed it until scrub happened. But now I have got
> "acting [56,57,58]" and none of this osd's has those errors with 
> rgw_gc
> in logs. All affected osd's are octopus 15.2.14 on NVMe hosting
> default.rgw.buckets.index pool.  Has anyone experience with this problem?  
> Any help appreciated.
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Marc



How did you do the upgrade from centos7 to centos8? I assume you kept osd 
config's etc?

> upgrading nautilus (14.2.22) to octopus (15.2.14) on centos7 (Mon/Mgr
> were additionally migrated to centos8 beforehand). Each day I upgraded
> one host and after all osd's were up, I manually compacted them one
> by
> one.  Today (8 hosts upgraded, 7 still to go) I started
> getting errors like "Possible data damage: 1 pg inconsistent".
> For the
> first time it was "acting [56,58,62]" but I thought "OK"
> in osd.62 logs
> there are many lines like "osd.62 39892 class rgw_gc open got (1)
> Operation not permitted" Maybe rgw did not cleaned some omaps
> properly,

Is the rgw still nautilus? What about trying with rgw of octopus?

> and ceph did not noticed it until scrub happened. But now I have got
> "acting [56,57,58]" and none of this osd's has those errors
> with rgw_gc
> in logs. All affected osd's are octopus 15.2.14 on NVMe hosting
> default.rgw.buckets.index pool.  Has anyone experience with this
> problem?  Any help appreciated.
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Tomasz Płaza
Yes I did, and despite "Too many repaired reads on 1 OSDs" health is 
back to HEALTH_OK.
But it is second time it happened and do not know, should I go forward 
with update or hold on. Or maybe it is a bad move makeing compaction 
right after migration to 15.2.14


On 20.10.2021 o 09:21, Szabo, Istvan (Agoda) wrote:

Have you tried to repair pg?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---


On 2021. Oct 20., at 9:04, Glaza  wrote:

Email received from the internet. If in doubt, don't click any link 
nor open any attachment !



Hi Everyone,   I am in the process of
upgrading nautilus (14.2.22) to octopus (15.2.14) on centos7 (Mon/Mgr
were additionally migrated to centos8 beforehand). Each day I upgraded
one host and after all osd's were up, I manually compacted them 
one by

one.  Today (8 hosts upgraded, 7 still to go) I started
getting errors like "Possible data damage: 1 pg 
inconsistent". For the
first time it was "acting [56,58,62]" but I thought 
"OK" in osd.62 logs

there are many lines like "osd.62 39892 class rgw_gc open got (1)
Operation not permitted" Maybe rgw did not cleaned some omaps 
properly,

and ceph did not noticed it until scrub happened. But now I have got
"acting [56,57,58]" and none of this osd's has those 
errors with rgw_gc

in logs. All affected osd's are octopus 15.2.14 on NVMe hosting
default.rgw.buckets.index pool.  Has anyone experience with this 
problem?  Any help appreciated.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Tomasz Płaza

I did it only on MON servers. OSDs are on centos 7. Process was:
1. stop mon&mgr
2. backup /var/lib/ceph
3. reinstall server as centos 8 and install ceph nautilus
4. restore /var/lib/ceph  and start mo&mgr
5. wait few days
6. upgrade mon&mgr to octopus

On 20.10.2021 o 09:51, Marc wrote:


How did you do the upgrade from centos7 to centos8? I assume you kept osd 
config's etc?


upgrading nautilus (14.2.22) to octopus (15.2.14) on centos7 (Mon/Mgr
were additionally migrated to centos8 beforehand). Each day I upgraded
one host and after all osd's were up, I manually compacted them one
by
one.  Today (8 hosts upgraded, 7 still to go) I started
getting errors like "Possible data damage: 1 pg inconsistent".
For the
first time it was "acting [56,58,62]" but I thought "OK"
in osd.62 logs
there are many lines like "osd.62 39892 class rgw_gc open got (1)
Operation not permitted" Maybe rgw did not cleaned some omaps
properly,

Is the rgw still nautilus? What about trying with rgw of octopus?


and ceph did not noticed it until scrub happened. But now I have got
"acting [56,57,58]" and none of this osd's has those errors
with rgw_gc
in logs. All affected osd's are octopus 15.2.14 on NVMe hosting
default.rgw.buckets.index pool.  Has anyone experience with this
problem?  Any help appreciated.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Expose rgw using consul or service discovery

2021-10-20 Thread Sebastian Wagner

Am 20.10.21 um 09:12 schrieb Pierre GINDRAUD:
> Hello,
>
> I'm migrating from puppet to cephadm to deploy a ceph cluster, and I'm
> using consul to expose radosgateway. Before, with puppet, we were
> deploying radosgateway with "apt install radosgw" and applying upgrade
> using "apt upgrade radosgw". In our consul service a simple healthcheck
> on this url worked fine "/swift/healthcheck", because we were able to
> put consul agent in maintenance mode before operations.
> I've seen this thread
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/32JZAIU45KDTOWEW6LKRGJGXOFCTJKSS/#N7EGVSDHMMIXHCTPEYBA4CYJBWLD3LLP
> that proves consul is a possible way.
>
> So, with cephadm, the upgrade process decide by himself when to stop,
> upgrade and start each radosgw instances. 

Right

> It's an issue because the
> consul healthcheck must detect "as fast as possible" the instance break
> to minimize the number of applicatives hits that can use the down
> instance's IP.
>
> In some application like traefik
> https://doc.traefik.io/traefik/reference/static-configuration/cli/ we
> have an option "requestacceptgracetimeout" that allow the "http server"
> to handle requests some time after a stop signal has been received while
> the healthcheck endpoint immediatly started to response with an "error".
> This allow the loadbalancer (consul here) to put instance down and stop
> traffic to it before it fall effectively down.
>
> In https://docs.ceph.com/en/latest/radosgw/config-ref/ I have see any
> option like that. And in cephadm I haven't seen "pre-task" and "post
> task" to, for exemple, touch a file somewhere consul will be able to
> test it, or putting down a host in maintenance.
>
> How do you expose radosgw service over your application ?

cephadm nowadays ships an ingress services using haproxy for this use case:

https://docs.ceph.com/en/latest/cephadm/services/rgw/#high-availability-service-for-rgw

> Have you any idea as workaround my issue ?

Plenty actually. cephadm itself does not provide a notification
mechanisms, but other component in the deployment stack might.

On the highest level we have the config-key store of the MONs. you
should be able to get notifications for config-key changes.
Unfortunately this would involve some Coding.

On the systemd level we have systemd-notify. I haven't looked into it,
but maybe you can get events about the rgw unit deployed by cephadm.

On the container level we have "podman events" that prints state changes
of containers.

To me a script that calls podman events on one hand and pushes updates
to consul sounds like the most promising solution to me.

In case you get this setup working properly, I'd love to read a blog
post about it.

>
> Regards
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Tomasz Płaza

Sorry Marc, didn't see second question.

As the upgrade process states, rgw are the last one to be upgraded, so 
they are still on nautilus (centos7). Those logs showed up after upgrade 
of the first osd host. It is a multisite setup so I am a little afraid 
of upgrading rgw now.


Etienne:

Sorry for answering in this thread, but somehow I do not get messages 
directed only to ceph-users list. I did "rados list-inconsistent-pg" and 
got many entries like:


{
  "object": {
    "name": ".dir.99a07ed8-2112-429b-9f94-81383220a95b.7104621.23.7",
    "nspace": "",
    "locator": "",
    "snap": "head",
    "version": 82561410
  },
  "errors": [
    "omap_digest_mismatch"
  ],
  "union_shard_errors": [],
  "selected_object_info": {
    "oid": {
  "oid": ".dir.99a07ed8-2112-429b-9f94-81383220a95b.7104621.23.7",
  "key": "",
  "snapid": -2,
  "hash": 3316145293,
  "max": 0,
  "pool": 230,
  "namespace": ""
    },
    "version": "107760'82561410",
    "prior_version": "106468'82554595",
    "last_reqid": "client.392341383.0:2027385771",
    "user_version": 82561410,
    "size": 0,
    "mtime": "2021-10-19T16:32:25.699134+0200",
    "local_mtime": "2021-10-19T16:32:25.699073+0200",
    "lost": 0,
    "flags": [
  "dirty",
  "omap",
  "data_digest"
    ],
    "truncate_seq": 0,
    "truncate_size": 0,
    "data_digest": "0x",
    "omap_digest": "0x",
    "expected_object_size": 0,
    "expected_write_size": 0,
    "alloc_hint_flags": 0,
    "manifest": {
  "type": 0
    },
    "watchers": {}
  },
  "shards": [
    {
  "osd": 56,
  "primary": true,
  "errors": [],
  "size": 0,
  "omap_digest": "0xf4cf0e1c",
  "data_digest": "0x"
    },
    {
  "osd": 58,
  "primary": false,
  "errors": [],
  "size": 0,
  "omap_digest": "0xf4cf0e1c",
  "data_digest": "0x"
    },
    {
  "osd": 62,
  "primary": false,
  "errors": [],
  "size": 0,
  "omap_digest": "0x4bd5703a",
  "data_digest": "0x"
    }
  ]
}


On 20.10.2021 o 09:51, Marc wrote:

Is the rgw still nautilus? What about trying with rgw of octopus?


and ceph did not noticed it until scrub happened. But now I have got
"acting [56,57,58]" and none of this osd's has those errors
with rgw_gc
in logs. All affected osd's are octopus 15.2.14 on NVMe hosting
default.rgw.buckets.index pool.  Has anyone experience with this
problem?  Any help appreciated.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] clients failing to respond to cache pressure (nfs-ganesha)

2021-10-20 Thread Marc


If I restart nfs-ganesha this message dissapears. Is there another solution 
(server side) that would clear this message? Without the need to restart nfs or 
have some sort of service interruption?





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CEPH Zabbix MGR unable to send TLS Data

2021-10-20 Thread Marc Riudalbas Clemente

Hello,

we are trying to monitor our Ceph Cluster using the native Zabbix Module 
from CEPH. (ceph mgr zabbix).


We have configured our Zabbix Server to only accept TLS (PSK) 
connections. When we send data with the Zabbix Sender to the Zabbix 
Server this way:


*/usr/bin/zabbix_sender -vv --tls-connect psk --tls-psk-identity 
hostname --tls-psk-file /etc/zabbix/zabbix_agent.psk -z serverip -s 
hostname -p 10051 -k ceph.num_osd -o 10 *


Then everything works like expected. We get our value in Zabbix.

-

However, when the Ceph module tries using Zabbix Sender, we cannot give 
the parameters --tls-connect, --tls-psk-identity and --tls-psk-file, so 
we cant send data to our Server:


7774:20211018:140832.103 connection of type "unencrypted" is not allowed for host 
"hostname".

Is there a possibility to tell the CEPH Zabbix Module to use these 
parameters?


Thank you in advance!

Marc

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Expose rgw using consul or service discovery

2021-10-20 Thread Pierre GINDRAUD
Hello,

I'm migrating from puppet to cephadm to deploy a ceph cluster, and I'm
using consul to expose radosgateway. Before, with puppet, we were
deploying radosgateway with "apt install radosgw" and applying upgrade
using "apt upgrade radosgw". In our consul service a simple healthcheck
on this url worked fine "/swift/healthcheck", because we were able to
put consul agent in maintenance mode before operations.
I've seen this thread
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/32JZAIU45KDTOWEW6LKRGJGXOFCTJKSS/#N7EGVSDHMMIXHCTPEYBA4CYJBWLD3LLP
that proves consul is a possible way.

So, with cephadm, the upgrade process decide by himself when to stop,
upgrade and start each radosgw instances. It's an issue because the
consul healthcheck must detect "as fast as possible" the instance break
to minimize the number of applicatives hits that can use the down
instance's IP.

In some application like traefik
https://doc.traefik.io/traefik/reference/static-configuration/cli/ we
have an option "requestacceptgracetimeout" that allow the "http server"
to handle requests some time after a stop signal has been received while
the healthcheck endpoint immediatly started to response with an "error".
This allow the loadbalancer (consul here) to put instance down and stop
traffic to it before it fall effectively down.

In https://docs.ceph.com/en/latest/radosgw/config-ref/ I have see any
option like that. And in cephadm I haven't seen "pre-task" and "post
task" to, for exemple, touch a file somewhere consul will be able to
test it, or putting down a host in maintenance.

How do you expose radosgw service over your application ?
Have you any idea as workaround my issue ?

Regards
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Szabo, Istvan (Agoda)
Have you tried to repair pg?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Oct 20., at 9:04, Glaza  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi Everyone,   I am in the process of
upgrading nautilus (14.2.22) to octopus (15.2.14) on centos7 (Mon/Mgr
were additionally migrated to centos8 beforehand). Each day I upgraded
one host and after all osd's were up, I manually compacted them one by
one.  Today (8 hosts upgraded, 7 still to go) I started
getting errors like "Possible data damage: 1 pg inconsistent". For the
first time it was "acting [56,58,62]" but I thought "OK" in 
osd.62 logs
there are many lines like "osd.62 39892 class rgw_gc open got (1)
Operation not permitted" Maybe rgw did not cleaned some omaps properly,
and ceph did not noticed it until scrub happened. But now I have got
"acting [56,57,58]" and none of this osd's has those errors with 
rgw_gc
in logs. All affected osd's are octopus 15.2.14 on NVMe hosting
default.rgw.buckets.index pool.  Has anyone experience with this problem?  Any 
help appreciated.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph-ansible stable-5.0 repository must be quincy?

2021-10-20 Thread Simon Oosthoek

Hi

we're trying to get ceph-ansible working again for our current version 
of ceph (octopus), in order to be able to add some osd nodes to our 
cluster. (Obviously there's a longer story here, but just a quick 
question for now...)


When we add in all.yml
ceph_origin: repository
ceph_repository: community
# Enabled when ceph_repository == 'community'
#
ceph_mirror: https://eu.ceph.com
ceph_stable_key: https://eu.ceph.com/keys/release.asc
ceph_stable_release: octopus
ceph_stable_repo: "{{ ceph_mirror }}/debian-{{ ceph_stable_release }}"

This fails with a message originating from

- name: validate ceph_repository_community
  fail:
msg: "ceph_stable_release must be 'quincy'"
  when:
- ceph_origin == 'repository'
- ceph_repository == 'community'
- ceph_stable_release not in ['quincy']

in: ceph-ansible/roles/ceph-validate/tasks/main.yml

This is from the "Stable-5.0" branch of ceph-ansible, which is 
specifically for Octopus, as I understand it...


Is this a bug in ceph-ansible in the stable-5.0 branch, or is this our 
problem in understanding what to put in all.yml to get the octopus 
repository for ubuntu 20.04?


Cheers

/Simon
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-ansible stable-5.0 repository must be quincy?

2021-10-20 Thread Guillaume Abrioux
Hi Simon,

are you well using the latest version of stable-5.0 ?

Regards,

On Wed, 20 Oct 2021 at 14:19, Simon Oosthoek 
wrote:

> Hi
>
> we're trying to get ceph-ansible working again for our current version
> of ceph (octopus), in order to be able to add some osd nodes to our
> cluster. (Obviously there's a longer story here, but just a quick
> question for now...)
>
> When we add in all.yml
> ceph_origin: repository
> ceph_repository: community
> # Enabled when ceph_repository == 'community'
> #
> ceph_mirror: https://eu.ceph.com
> ceph_stable_key: https://eu.ceph.com/keys/release.asc
> ceph_stable_release: octopus
> ceph_stable_repo: "{{ ceph_mirror }}/debian-{{ ceph_stable_release }}"
>
> This fails with a message originating from
>
>  - name: validate ceph_repository_community
>fail:
>  msg: "ceph_stable_release must be 'quincy'"
>when:
>  - ceph_origin == 'repository'
>  - ceph_repository == 'community'
>  - ceph_stable_release not in ['quincy']
>
> in: ceph-ansible/roles/ceph-validate/tasks/main.yml
>
> This is from the "Stable-5.0" branch of ceph-ansible, which is
> specifically for Octopus, as I understand it...
>
> Is this a bug in ceph-ansible in the stable-5.0 branch, or is this our
> problem in understanding what to put in all.yml to get the octopus
> repository for ubuntu 20.04?
>
> Cheers
>
> /Simon
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>

-- 

*Guillaume Abrioux*Senior Software Engineer
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] jj's "improved" ceph balancer

2021-10-20 Thread Jonas Jelten
Hi!

I've been working on this for quite some time now and I think it's ready for 
some broader testing and feedback.

https://github.com/TheJJ/ceph-balancer

It's an alternative standalone balancer implementation, optimizing for equal 
OSD storage utilization and PG placement across all pools.

It doesn't change your cluster in any way, it just prints the commands you can 
run to apply the PG movements.
Please play around with it :)

Quickstart example: generate 10 PG movements on hdd to stdout

./placementoptimizer.py -v balance --max-pg-moves 10 --only-crushclass hdd 
| tee /tmp/balance-upmaps

When there's remapped pgs (e.g. by applying the above upmaps), you can inspect 
progress with:

./placementoptimizer.py showremapped
./placementoptimizer.py showremapped --by-osd

And you can get a nice Pool and OSD usage overview:

./placementoptimizer.py show --osds --per-pool-count --sort-utilization


Of course there's many more features and optimizations to be added,
but it served us very well in reclaiming terrabytes of until then unavailable 
storage already where the `mgr balancer` could no longer optimize.

What do you think?

Cheers
  -- Jonas
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: monitor not joining quorum

2021-10-20 Thread Michael Moyles
Have you checked sync status and progress?

A mon status command on the leader and problematic monitor should show if
any sync is going on. When datastores (/var/lib/ceph/mon/ by default) get
large the sync can take a long time, assuming the default sync settings,
and needs to complete before a mon will join quorum

  ceph daemon mon.ceph1/3 mon_status

Mike


On Wed, 20 Oct 2021 at 07:58, Konstantin Shalygin  wrote:

> Do you have any backfilling operations?
> In our case when backfilling was done mon joins to quorum immediately
>
>
> k
>
> Sent from my iPhone
>
> > On 20 Oct 2021, at 08:52, Denis Polom  wrote:
> >
> > 
> > Hi,
> >
> > I've checked it, there is not IP address collision, arp tables are OK,
> mtu also and according tcpdump there are not packet being lost.
> >
> >
> >
> > On 10/19/21 21:36, Konstantin Shalygin wrote:
> >> Hi,
> >>
> >>> On 19 Oct 2021, at 21:59, Denis Polom  wrote:
> >>>
> >>> 2021-10-19 16:22:07.629 7faec9dd2700  1 mon.ceph1@0(synchronizing) e4
> handle_auth_request failed to assign global_id
> >>> 2021-10-19 16:22:08.193 7faec8dd0700  1 mon.ceph1@0(synchronizing) e4
> handle_auth_request failed to assign global_id
> >>> 2021-10-19 16:22:09.565 7faec8dd0700  1 mon.ceph1@0(synchronizing) e4
> handle_auth_request failed to assign global_id
> >>> 2021-10-19 16:22:11.885 7faec8dd0700  1 mon.ceph1@0(synchronizing) e4
> handle_auth_request failed to assign global_id
> >>> 2021-10-19 16:22:14.233 7faec8dd0700  1 mon.ceph1@0(synchronizing) e4
> handle_auth_request failed to assign global_id
> >>> 2021-10-19 16:22:14.889 7faec8dd0700  1 mon.ceph1@0(synchronizing) e4
> handle_auth_request failed to assign global_id
> >>> 2021-10-19 16:22:16.365 7faec8dd0700  1 mon.ceph1@0(synchronizing) e4
> handle_auth_request failed to assign global_id
> >>>
> >>> any idea how to get this monitor to join the quorum?
> >>
> >> We catch this issue couple of weeks ago - this is should be a network
> issue. First check ipaddr collisions, arps, losses, mtu
> >>
> >>
> >>
> >>
> >> k
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

-- 


This e-mail together with any attachments (the "Message") is confidential 
and may contain privileged information. If you are not the intended 
recipient or if you have received this e-mail in error, please notify the 
sender immediately and permanently delete this Message from your system. Do 
not copy, disclose or distribute the information contained in this Message.

_Maven Investment Partners Ltd (No. 07511928), Maven Investment Partners 
US Ltd (No. 11494299), Maven Europe Ltd (No. 08966), Maven Derivatives Asia 
Limited (No.10361312) & Maven Securities Holding Ltd (No. 07505438) are 
registered as companies in England and Wales and their registered address 
is Level 3, 6 Bevis Marks, London EC3A 7BA, United Kingdom. The companies’ 
VAT No. is 135539016. Maven Asia (Hong Kong) Ltd (No. 2444041) is 
registered in Hong Kong and its registered address is 20/F, Tai Tung 
Building, 8 Fleming Road, Wan Chai, Hong Kong. Maven Derivatives Amsterdam 
B.V. (71291377) is registered in the Netherlands and its registered address 
is 12.02, Spaces, Barbara Strozzilaan 201, Amsterdam, 1083 HN, Netherlands. 
Maven Europe Ltd is authorised and regulated by the Financial Conduct 
Authority (FRN:770542). Maven Asia (Hong Kong) Ltd is registered and 
regulated by the Securities and Futures Commission (CE No: BJF060). Maven 
Derivatives Amsterdam B.V. is licensed and regulated by the Dutch Authority 
for the Financial Markets._
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: clients failing to respond to cache pressure (nfs-ganesha)

2021-10-20 Thread 胡 玮文
I don’t know if it is related. But we are routinely get warning about 1-4 
clients failed to respond to cache pressure. But it seems to be harmless.

We are running 16.2.6, 2 active MDSes, over 20 kernel cephfs clients, with the 
latest 5.11 kernel from Ubuntu.

> 在 2021年10月20日,16:36,Marc  写道:
> 
> 
> If I restart nfs-ganesha this message dissapears. Is there another solution 
> (server side) that would clear this message? Without the need to restart nfs 
> or have some sort of service interruption?
> 
> 
> 
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: config db host filter issue

2021-10-20 Thread Josh Baergen
Hey Richard,

On Tue, Oct 19, 2021 at 8:37 PM Richard Bade  wrote:
> user@cstor01 DEV:~$ sudo ceph config set osd/host:cstor01 osd_max_backfills 2
> user@cstor01 DEV:~$ sudo ceph config get osd.0 osd_max_backfills
> 2
> ...
> Are others able to reproduce?

Yes, we've found the same thing on Nautilus. root-based filtering
works, but nothing else that we've tried so far. We were going to
investigate at some point whether this is fixed in Octopus/Pacific
before filing a ticket.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: clients failing to respond to cache pressure (nfs-ganesha)

2021-10-20 Thread Magnus HAGDORN
We have increased the cache on our MDS which makes this issue mostly go away. 
It is due to an interaction between the MDS and the ganesha NFS server which 
keeps its own cache. I believe newer versions of ganesha can deal with it.

Sent from Android device

On 20 Oct 2021 09:37, Marc  wrote:
This email was sent to you by someone outside the University.
You should only click on links or attachments if you are certain that the email 
is genuine and the content is safe.

If I restart nfs-ganesha this message dissapears. Is there another solution 
(server side) that would clear this message? Without the need to restart nfs or 
have some sort of service interruption?





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh 
Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Dan van der Ster
Hi Jonas,

>From your readme:

"the best possible solution is some OSDs having an offset of 1 PG to the
ideal count. As a PG-distribution-optimization is done per pool, without
checking other pool's distribution at all, some devices will be the +1 more
often than others. At worst one OSD is the +1 for each pool in the cluster."

That's an interesting observation/flaw which hadn't occurred to me before.
I think we don't ever see it in practice in our clusters because we do not
have multiple large pools on the same osds.

How large are the variances in your real clusters? I hope the example in
your readme isn't from real life??

Cheers, Dan












On Wed, 20 Oct 2021, 15:11 Jonas Jelten,  wrote:

> Hi!
>
> I've been working on this for quite some time now and I think it's ready
> for some broader testing and feedback.
>
> https://github.com/TheJJ/ceph-balancer
>
> It's an alternative standalone balancer implementation, optimizing for
> equal OSD storage utilization and PG placement across all pools.
>
> It doesn't change your cluster in any way, it just prints the commands you
> can run to apply the PG movements.
> Please play around with it :)
>
> Quickstart example: generate 10 PG movements on hdd to stdout
>
> ./placementoptimizer.py -v balance --max-pg-moves 10 --only-crushclass
> hdd | tee /tmp/balance-upmaps
>
> When there's remapped pgs (e.g. by applying the above upmaps), you can
> inspect progress with:
>
> ./placementoptimizer.py showremapped
> ./placementoptimizer.py showremapped --by-osd
>
> And you can get a nice Pool and OSD usage overview:
>
> ./placementoptimizer.py show --osds --per-pool-count --sort-utilization
>
>
> Of course there's many more features and optimizations to be added,
> but it served us very well in reclaiming terrabytes of until then
> unavailable storage already where the `mgr balancer` could no longer
> optimize.
>
> What do you think?
>
> Cheers
>   -- Jonas
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Jonas Jelten
Hi Dan,

I'm not kidding, these were real-world observations, hence my motivation to 
create this balancer :)
First I tried "fixing" the mgr balancer, but after understanding the exact 
algorithm there I thought of a completely different approach.

For us the main reason things got out of balance was this (from the README):
> To make things worse, if there's a huge server in the cluster which is so 
> big, CRUSH can't place data often enough on it to fill it to the same level 
> as any other server, the balancer will fail moving PGs across servers that 
> actually would have space.
> This happens since it sees only this server's OSDs as "underfull", but each 
> PG has one shard on that server already, so no data can be moved on it.

But all the aspects in that section play together, and I don't think it's 
easily improvable in mgr-balancer while keeping the same base algorithm.

Cheers
  -- Jonas

On 20/10/2021 19.55, Dan van der Ster wrote:
> Hi Jonas,
> 
> From your readme:
> 
> "the best possible solution is some OSDs having an offset of 1 PG to the 
> ideal count. As a PG-distribution-optimization is done per pool, without 
> checking other pool's distribution at all, some devices will be the +1 more 
> often than others. At worst one OSD is the +1 for each pool in the cluster."
> 
> That's an interesting observation/flaw which hadn't occurred to me before. I 
> think we don't ever see it in practice in our clusters because we do not have 
> multiple large pools on the same osds.
> 
> How large are the variances in your real clusters? I hope the example in your 
> readme isn't from real life??
> 
> Cheers, Dan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Dan van der Ster
Hi,

I don't quite understand your "huge server" scenario, other than a basic
understanding that the balancer cannot do magic in some impossible cases.

But anyway, I wonder if this sort of higher order balancing could/should be
added as a "part two" to the mgr balancer. The existing code does a quite
good job in many (dare I say most?) cases. E.g. it even balances empty
clusters perfectly.
But after it cannot find a further optimization, maybe a heuristic like
yours can further refine the placement...

 Dan


On Wed, 20 Oct 2021, 20:52 Jonas Jelten,  wrote:

> Hi Dan,
>
> I'm not kidding, these were real-world observations, hence my motivation
> to create this balancer :)
> First I tried "fixing" the mgr balancer, but after understanding the exact
> algorithm there I thought of a completely different approach.
>
> For us the main reason things got out of balance was this (from the
> README):
> > To make things worse, if there's a huge server in the cluster which is
> so big, CRUSH can't place data often enough on it to fill it to the same
> level as any other server, the balancer will fail moving PGs across servers
> that actually would have space.
> > This happens since it sees only this server's OSDs as "underfull", but
> each PG has one shard on that server already, so no data can be moved on it.
>
> But all the aspects in that section play together, and I don't think it's
> easily improvable in mgr-balancer while keeping the same base algorithm.
>
> Cheers
>   -- Jonas
>
> On 20/10/2021 19.55, Dan van der Ster wrote:
> > Hi Jonas,
> >
> > From your readme:
> >
> > "the best possible solution is some OSDs having an offset of 1 PG to the
> ideal count. As a PG-distribution-optimization is done per pool, without
> checking other pool's distribution at all, some devices will be the +1 more
> often than others. At worst one OSD is the +1 for each pool in the cluster."
> >
> > That's an interesting observation/flaw which hadn't occurred to me
> before. I think we don't ever see it in practice in our clusters because we
> do not have multiple large pools on the same osds.
> >
> > How large are the variances in your real clusters? I hope the example in
> your readme isn't from real life??
> >
> > Cheers, Dan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Dan van der Ster
Hi Josh,

That's another interesting dimension...
Indeed a cluster that has plenty of free capacity could indeed be balanced
by workload/iops, but once it reaches maybe 60 or 70% full, then I think
capacity would need to take priority.

But to be honest I don't really understand the workload/iops balancing
use-case. Can you describe some of the scenarios you have in mind?

.. Dan


On Wed, 20 Oct 2021, 20:45 Josh Salomon,  wrote:

> Just another point of view:
> The current balancer balances the capacity but this is not enough. The
> balancer should also balance the workload and we plan on adding primary
> balancing for Quincy. In order to balance the workload you should work pool
> by pool because pools have different workloads. So while the observation
> about the +1 PGs is correct, I believe the correct solution should be
> talking this into consideration while still balancing capacity pool by pool.
> Capacity balancing is a functional requirement, while workload balancing
> is a performance requirement so it is important only for very loaded
> systems (loaded in terms of high IOPS not nearly full systems)
>
> I would appreciate comments on this thought.
>
> On Wed, 20 Oct 2021, 20:57 Dan van der Ster,  wrote:
>
>> Hi Jonas,
>>
>> From your readme:
>>
>> "the best possible solution is some OSDs having an offset of 1 PG to the
>> ideal count. As a PG-distribution-optimization is done per pool, without
>> checking other pool's distribution at all, some devices will be the +1 more
>> often than others. At worst one OSD is the +1 for each pool in the cluster."
>>
>> That's an interesting observation/flaw which hadn't occurred to me
>> before. I think we don't ever see it in practice in our clusters because we
>> do not have multiple large pools on the same osds.
>>
>> How large are the variances in your real clusters? I hope the example in
>> your readme isn't from real life??
>>
>> Cheers, Dan
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, 20 Oct 2021, 15:11 Jonas Jelten,  wrote:
>>
>>> Hi!
>>>
>>> I've been working on this for quite some time now and I think it's ready
>>> for some broader testing and feedback.
>>>
>>> https://github.com/TheJJ/ceph-balancer
>>>
>>> It's an alternative standalone balancer implementation, optimizing for
>>> equal OSD storage utilization and PG placement across all pools.
>>>
>>> It doesn't change your cluster in any way, it just prints the commands
>>> you can run to apply the PG movements.
>>> Please play around with it :)
>>>
>>> Quickstart example: generate 10 PG movements on hdd to stdout
>>>
>>> ./placementoptimizer.py -v balance --max-pg-moves 10
>>> --only-crushclass hdd | tee /tmp/balance-upmaps
>>>
>>> When there's remapped pgs (e.g. by applying the above upmaps), you can
>>> inspect progress with:
>>>
>>> ./placementoptimizer.py showremapped
>>> ./placementoptimizer.py showremapped --by-osd
>>>
>>> And you can get a nice Pool and OSD usage overview:
>>>
>>> ./placementoptimizer.py show --osds --per-pool-count
>>> --sort-utilization
>>>
>>>
>>> Of course there's many more features and optimizations to be added,
>>> but it served us very well in reclaiming terrabytes of until then
>>> unavailable storage already where the `mgr balancer` could no longer
>>> optimize.
>>>
>>> What do you think?
>>>
>>> Cheers
>>>   -- Jonas
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>> ___
>> Dev mailing list -- d...@ceph.io
>> To unsubscribe send an email to dev-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] v15.2.15 Octopus released

2021-10-20 Thread David Galloway
We're happy to announce the 15th backport release in the Octopus series.
We recommend users to update to this release. For a detailed release
notes with links & changelog please refer to the official blog entry at
https://ceph.io/en/news/blog/2021/v15-2-15-octopus-released

Notable Changes
---

* The default value of `osd_client_message_cap` has been set to 256, to
provide better flow control by limiting maximum number of in-flight
client requests.

* A new ceph-erasure-code-tool has been added to help manually recover
an object from a damaged PG.


Getting Ceph

* Git at git://github.com/ceph/ceph.git
* Tarball at https://download.ceph.com/tarballs/ceph-15.2.15.tar.gz
* Containers at https://quay.io/repository/ceph/ceph
* For packages, see https://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 2dfb18841cfecc2f7eb7eb2afd65986ca4d95985

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Dan van der Ster
Hi Josh,

Okay, but do you agree that for any given pool, the load is uniform across
it's PGs, right?

Doesn't the existing mgr balancer already balance the PGs for each pool
individually? So in your example, the PGs from the loaded pool will be
balanced across all osds, as will the idle pool's PGs. So the net load is
uniform, right?

OTOH I could see a workload/capacity imbalance if there are mixed capacity
but equal performance devices (e.g. a cluster with 50% 6TB HDDs and 50%
12TB HDDs).
In that case we're probably better to treat the disks as uniform in size
until the smaller osds fill up.

.. Dan


On Wed, 20 Oct 2021, 22:09 Josh Salomon,  wrote:

> Hi Dan,
>
> Assume you have 2 pools with the same used capacity and the same number of
> PGs, one gets 10x the IOs than the other. From capacity balancing
> perspectives all the PGs look identical, but devices with PGs from one pool
> will get 10%  of the IOs as devices with PGs only from the second pool.
> Under load almost all the load will go to the latter devices while the
> former will be almost idle, which makes very bad use of the cluster
> bandwidth.
> This is an extreme case, but even in the case that the PGs are blended but
> not ideally (even one device has more PGs from the loaded pool and it is
> not split 50-50) we get weakest link in the chain effect on that pool and
> under load it will provide less than optimal bandwidth from the cluster.
>
> IMHO it should be correct also when the cluster is almost full and not
> limited to half full clusters.
>
> I do agree with the observation of bad +1 PG splits among the OSDs and I
> believe this should be fixed. I am not sure I fully understood the huge
> node use case, if every PG has an OSD in a single node and still it is
> under utilized, I don't see how we can improve on this without sacrificing
> the reliability (by putting 2 copies on the same node).
>
> Josh
>
>
> On Wed, Oct 20, 2021 at 10:56 PM Dan van der Ster 
> wrote:
>
>> Hi Josh,
>>
>> That's another interesting dimension...
>> Indeed a cluster that has plenty of free capacity could indeed be
>> balanced by workload/iops, but once it reaches maybe 60 or 70% full, then I
>> think capacity would need to take priority.
>>
>> But to be honest I don't really understand the workload/iops balancing
>> use-case. Can you describe some of the scenarios you have in mind?
>>
>> .. Dan
>>
>>
>> On Wed, 20 Oct 2021, 20:45 Josh Salomon,  wrote:
>>
>>> Just another point of view:
>>> The current balancer balances the capacity but this is not enough. The
>>> balancer should also balance the workload and we plan on adding primary
>>> balancing for Quincy. In order to balance the workload you should work pool
>>> by pool because pools have different workloads. So while the observation
>>> about the +1 PGs is correct, I believe the correct solution should be
>>> talking this into consideration while still balancing capacity pool by pool.
>>> Capacity balancing is a functional requirement, while workload balancing
>>> is a performance requirement so it is important only for very loaded
>>> systems (loaded in terms of high IOPS not nearly full systems)
>>>
>>> I would appreciate comments on this thought.
>>>
>>> On Wed, 20 Oct 2021, 20:57 Dan van der Ster,  wrote:
>>>
 Hi Jonas,

 From your readme:

 "the best possible solution is some OSDs having an offset of 1 PG to
 the ideal count. As a PG-distribution-optimization is done per pool,
 without checking other pool's distribution at all, some devices will be the
 +1 more often than others. At worst one OSD is the +1 for each pool in the
 cluster."

 That's an interesting observation/flaw which hadn't occurred to me
 before. I think we don't ever see it in practice in our clusters because we
 do not have multiple large pools on the same osds.

 How large are the variances in your real clusters? I hope the example
 in your readme isn't from real life??

 Cheers, Dan












 On Wed, 20 Oct 2021, 15:11 Jonas Jelten,  wrote:

> Hi!
>
> I've been working on this for quite some time now and I think it's
> ready for some broader testing and feedback.
>
> https://github.com/TheJJ/ceph-balancer
>
> It's an alternative standalone balancer implementation, optimizing for
> equal OSD storage utilization and PG placement across all pools.
>
> It doesn't change your cluster in any way, it just prints the commands
> you can run to apply the PG movements.
> Please play around with it :)
>
> Quickstart example: generate 10 PG movements on hdd to stdout
>
> ./placementoptimizer.py -v balance --max-pg-moves 10
> --only-crushclass hdd | tee /tmp/balance-upmaps
>
> When there's remapped pgs (e.g. by applying the above upmaps), you can
> inspect progress with:
>
> ./placementoptimizer.py showremapped
>

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Anthony D'Atri


> Doesn't the existing mgr balancer already balance the PGs for each pool 
> individually? So in your example, the PGs from the loaded pool will be 
> balanced across all osds, as will the idle pool's PGs. So the net load is 
> uniform, right?

If there’s a single CRUSH root and all pools share the same set of OSDs?  I 
suspect that what he’s getting at is if pools use different sets of OSDs, or 
(eek) live on partly overlapping sets of OSDs.


> OTOH I could see a workload/capacity imbalance if there are mixed capacity 
> but equal performance devices (e.g. a cluster with 50% 6TB HDDs and 50% 12TB 
> HDDs). 
> In that case we're probably better to treat the disks as uniform in size 
> until the smaller osds fill up.

Primary affinity can help, with reads at least, but it’s a bit fussy.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Anthony D'Atri


> On Oct 20, 2021, at 1:49 PM, Josh Salomon  wrote:
> 
> but in the extreme case (some capacity on 1TB devices and some on 6TB 
> devices) the workload can't be balanced. I

It’s also super easy in such a scenario to

a) Have the larger drives not uniformly spread across failure domains, which 
can lead to fractional capacity that is unusuable because it can’t meet 
replication policy.

b) Find the OSDs on the larger drives exceeding the configured max PG per OSD 
figure and refusing to activate, especially when maintenance, failures, or 
other topology changes precipitate recovery.  This has bitten me with a mix of 
1.x and 3.84 TB drives; I ended up raising the limit to 1000 while I juggled 
drives, nodes, and clusters so that a given cluster had uniformly sized drives. 
 At smaller scales of course that often won’t be an option.


> primary affinity can help with a single pool - with multiple pools with 
> different r/w ratio it becomes messy since pa is per device - it could help 
> more if it was per device/pool pair. Also it could be more useful if the 
> values were not 0-1 but 0-replica_count, but this is a usability issue, not 
> functional, it just makes the use more cumbersome. It was designed for a 
> different purpose though so this is not the "right" solution, the right 
> solution is primary balancer.   


Absolutely.  I had the luxury of clusters containing a single pool.  In the 
above instance, before refactoring the nodes/drives, we achieved an easy 15-20% 
increase in aggregate read performance by applying a very rough guestimate of 
affinities based on OSD size.  The straw-draw factor does complicate deriving 
the *optimal* mapping of values, especially when topology changes.

I’ve seen someone set the CRUSH weight of larger/outlier OSDs artificially low 
to balance workload.  All depends on the topology, future plans, and local 
priorities.

> I don't quite understand your "huge server" scenario, other than a basic 
> understanding that the balancer cannot do magic in some impossible cases.

I read it as describing a cluster where nodes / failure domains have 
significantly non-uniform CRUSH weights.  Which is suboptimal, but sometimes 
folks don’t have a choice.  Or during migration between chassis generations.  
Back around … Firefly I think it was, there were a couple of bugs that resulted 
in undesirable behavior in those scenarios.

— aad

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Upgrade to 16.2.6 and osd+mds crash after bluestore_fsck_quick_fix_on_mount true

2021-10-20 Thread mgrzybowski

Hi
  Recently I did perform upgrades on single node cephfs server i have.

# ceph fs ls
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data 
ecpoolk3m1osd ecpoolk5m1osd ecpoolk4m2osd
~# ceph osd pool ls detail
pool 20 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 10674 lfor 
0/0/5088 flags hashpspool stripe_width 0 application cephfs
pool 21 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 10674 lfor 
0/0/5179 flags hashpspool stripe_width 0 application cephfs
pool 22 'ecpoolk3m1osd' erasure profile myprofilek3m1osd size 4 min_size 3 
crush_rule 3 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode warn 
last_change 10674 lfor 0/0/1442 flags hashpspool,ec_overwrites stripe_width 
12288 compression_algorithm zstd compression_mode aggressive application cephfs
pool 23 'ecpoolk5m1osd' erasure profile myprofilek5m1osd size 6 min_size 5 
crush_rule 5 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn 
last_change 12517 lfor 0/0/7892 flags hashpspool,ec_overwrites stripe_width 
20480 compression_algorithm zstd compression_mode aggressive application cephfs
pool 24 'ecpoolk4m2osd' erasure profile myprofilek4m2osd size 6 min_size 5 
crush_rule 6 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn 
last_change 10674 flags hashpspool,ec_overwrites stripe_width 16384 
compression_algorithm zstd compression_mode aggressive application cephfs
pool 25 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 11033 
lfor 0/0/10991 flags hashpspool stripe_width 0 pg_num_min 1 application 
mgr_devicehealth


I started this upgrade from ubuntu 16.04 and luminous ( there were upgrades in 
the past and some osd's could be started in Kraken ) ):
- first i upgraded ceph to Nautilus,  all seems to went well and accoording to 
the docs, no warning in status
- then i did "do-release-upgrade" to ubuntu to 18.04 ( ceph packaged  were not 
touch by that upgrade )
- then i did "do-release-upgrade" to ubuntu to 20.04 ( this upgrade bumped ceph
  packages to 15.2.1-0ubuntu1, before each do-release-upgrade i removed 
/etc/ceph/ceph.conf,
  so at least mon deamon was down. osd should not start ( siple volumes are 
encrypted )
- next i upgraded ceph packages to  16.2.6-1focal m started deamons .

All seems to work well, only what left was warning:

10 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats

I found on the list that it is recommend to set:

ceph config set osd bluestore_fsck_quick_fix_on_mount true

and rolling restart OSDs. After first restart+fsck i got crash on OSD ( and on 
MDS to) :

-1> 2021-10-14T22:02:45.877+0200 7f7f080a4f00 -1 
/build/ceph-16.2.6/src/osd/PG.cc: In function 'static int 
PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*)' thread 7f7f080a4f00 time 
2021-10-14T22:02:45.878154+0200
/build/ceph-16.2.6/src/osd/PG.cc: 1009: FAILED ceph_assert(values.size() == 2)
 ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x152) [0x55e29cd0ce61]
 2: /usr/bin/ceph-osd(+0xac6069) [0x55e29cd0d069]
 3: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*)+0xa17) 
[0x55e29ce97057]
 4: (OSD::load_pgs()+0x6b4) [0x55e29ce07ec4]
 5: (OSD::init()+0x2b4e) [0x55e29ce14a6e]
 6: main()
 7: __libc_start_main()
 8: _start()


The same went on next restart+fsck  osd:

-1> 2021-10-17T22:47:49.291+0200 7f98877bff00 -1 
/build/ceph-16.2.6/src/osd/PG.cc: In function 'static int 
PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*)' thread 7f98877bff00 time 
2021-10-17T22:47:49.292912+0200
/build/ceph-16.2.6/src/osd/PG.cc: 1009: FAILED ceph_assert(values.size() == 2)

 ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x152) [0x560e09af7e61]
 2: /usr/bin/ceph-osd(+0xac6069) [0x560e09af8069]
 3: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*)+0xa17) 
[0x560e09c82057]
 4: (OSD::load_pgs()+0x6b4) [0x560e09bf2ec4]
 5: (OSD::init()+0x2b4e) [0x560e09bffa6e]
 6: main()
 7: __libc_start_main()
 8: _start()


Once crashed OSDs could not be bring back online, they will crash again if i 
try start them.
Deep fsck did not found anything:

~# ceph-bluestore-tool --command fsck  --deep yes --path 
/var/lib/ceph/osd/ceph-2
fsck success


Any ideas what could cause this crashes and is it possible to bring online 
crashed osd this way  ?


--
  mgrzybowski
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade to 16.2.6 and osd+mds crash after bluestore_fsck_quick_fix_on_mount true

2021-10-20 Thread Igor Fedotov

Hey mgrzybowski!

Never seen that before but perhaps some omaps have been improperly 
converted to new format and aren't read any more...


I'll take a more detailed look at what's happening during that load_pgs 
call and what exact information is missing.


Meanwhile could you please set debug-bluestore to 20 and collect OSD 
startup log?



Thanks,

Igor

On 10/21/2021 12:56 AM, mgrzybowski wrote:

Hi
  Recently I did perform upgrades on single node cephfs server i have.

# ceph fs ls
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data 
ecpoolk3m1osd ecpoolk5m1osd ecpoolk4m2osd

~# ceph osd pool ls detail
pool 20 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn 
last_change 10674 lfor 0/0/5088 flags hashpspool stripe_width 0 
application cephfs
pool 21 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn 
last_change 10674 lfor 0/0/5179 flags hashpspool stripe_width 0 
application cephfs
pool 22 'ecpoolk3m1osd' erasure profile myprofilek3m1osd size 4 
min_size 3 crush_rule 3 object_hash rjenkins pg_num 16 pgp_num 16 
autoscale_mode warn last_change 10674 lfor 0/0/1442 flags 
hashpspool,ec_overwrites stripe_width 12288 compression_algorithm zstd 
compression_mode aggressive application cephfs
pool 23 'ecpoolk5m1osd' erasure profile myprofilek5m1osd size 6 
min_size 5 crush_rule 5 object_hash rjenkins pg_num 128 pgp_num 128 
autoscale_mode warn last_change 12517 lfor 0/0/7892 flags 
hashpspool,ec_overwrites stripe_width 20480 compression_algorithm zstd 
compression_mode aggressive application cephfs
pool 24 'ecpoolk4m2osd' erasure profile myprofilek4m2osd size 6 
min_size 5 crush_rule 6 object_hash rjenkins pg_num 64 pgp_num 64 
autoscale_mode warn last_change 10674 flags hashpspool,ec_overwrites 
stripe_width 16384 compression_algorithm zstd compression_mode 
aggressive application cephfs
pool 25 'device_health_metrics' replicated size 3 min_size 2 
crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode 
on last_change 11033 lfor 0/0/10991 flags hashpspool stripe_width 0 
pg_num_min 1 application mgr_devicehealth



I started this upgrade from ubuntu 16.04 and luminous ( there were 
upgrades in the past and some osd's could be started in Kraken ) ):
- first i upgraded ceph to Nautilus,  all seems to went well and 
accoording to the docs, no warning in status
- then i did "do-release-upgrade" to ubuntu to 18.04 ( ceph packaged  
were not touch by that upgrade )
- then i did "do-release-upgrade" to ubuntu to 20.04 ( this upgrade 
bumped ceph
  packages to 15.2.1-0ubuntu1, before each do-release-upgrade i 
removed /etc/ceph/ceph.conf,
  so at least mon deamon was down. osd should not start ( siple 
volumes are encrypted )

- next i upgraded ceph packages to  16.2.6-1focal m started deamons .

All seems to work well, only what left was warning:

10 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats

I found on the list that it is recommend to set:

ceph config set osd bluestore_fsck_quick_fix_on_mount true

and rolling restart OSDs. After first restart+fsck i got crash on OSD 
( and on MDS to) :


    -1> 2021-10-14T22:02:45.877+0200 7f7f080a4f00 -1 
/build/ceph-16.2.6/src/osd/PG.cc: In function 'static int 
PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*)' thread 7f7f080a4f00 
time 2021-10-14T22:02:45.878154+0200
/build/ceph-16.2.6/src/osd/PG.cc: 1009: FAILED 
ceph_assert(values.size() == 2)
 ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) 
pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x152) [0x55e29cd0ce61]

 2: /usr/bin/ceph-osd(+0xac6069) [0x55e29cd0d069]
 3: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*)+0xa17) 
[0x55e29ce97057]

 4: (OSD::load_pgs()+0x6b4) [0x55e29ce07ec4]
 5: (OSD::init()+0x2b4e) [0x55e29ce14a6e]
 6: main()
 7: __libc_start_main()
 8: _start()


The same went on next restart+fsck  osd:

    -1> 2021-10-17T22:47:49.291+0200 7f98877bff00 -1 
/build/ceph-16.2.6/src/osd/PG.cc: In function 'static int 
PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*)' thread 7f98877bff00 
time 2021-10-17T22:47:49.292912+0200
/build/ceph-16.2.6/src/osd/PG.cc: 1009: FAILED 
ceph_assert(values.size() == 2)


 ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) 
pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x152) [0x560e09af7e61]

 2: /usr/bin/ceph-osd(+0xac6069) [0x560e09af8069]
 3: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*)+0xa17) 
[0x560e09c82057]

 4: (OSD::load_pgs()+0x6b4) [0x560e09bf2ec4]
 5: (OSD::init()+0x2b4e) [0x560e09bffa6e]
 6: main()
 7: __libc_start_main()
 8: _start()


Once crashed OSDs could not be bring back online, they will crash 
again if i try start them.

Deep fsck did not found anything:

~# ceph-bluestore-tool --command fsck  --deep yes --path 
/var/lib/ceph