from:"Patrick Vranckx"

[ceph-users] TOO_MANY_PGS after upgrade from Nautilus to Octupus

2022-11-08 Thread Patrick Vranckx


Hi,

We are currently upgrading our cluster from Nautilus to Octupus.

After upgrade of the mons and mgrs, we get warnings about the number of PGS.

Which parameter did change during upgrade to explain those new warnings. 
Nothing else was changed.


Is it risky to change the pgs/pool as proposed in the warnings ? In 
particular, to reduce from 4096 to 64 !!!


Thanks in advance,

Patrick


root@server4 ~]# ceph -s
  cluster:
    id: ba00c030-382f-4d75-b150-5b17f77e57fe
    health: HEALTH_WARN
    clients are using insecure global_id reclaim
    6 pools have too few placement groups
    9 pools have too many placement groups

  services:
    mon: 3 daemons, quorum server2,server5,server6 (age 66m)
    mgr: server8(active, since 67m), standbys: server4, server1
    osd: 244 osds: 244 up (since 12m), 244 in (since 2w)
    rgw: 2 daemons active (server1, server4)

  task status:

  data:
    pools:   16 pools, 11441 pgs
    objects: 2.02M objects, 5.9 TiB
    usage:   18 TiB used, 982 TiB / 1000 TiB avail
    pgs: 11441 active+clean

  io:
    client:   862 KiB/s rd, 1.4 MiB/s wr, 61 op/s rd, 100 op/s wr

root@server4 ~]# ceph health detail

...

[WRN] POOL_TOO_MANY_PGS: 9 pools have too many placement groups
    Pool default.rgw.buckets.index has 128 placement groups, should have 32
    Pool default.rgw.buckets.data has 4096 placement groups, should have 64
    Pool os_glance has 1024 placement groups, should have 32
...


[root@server4 ~]# ceph config get mon mon_max_pg_per_osd
250


In ceph.conf, we set also:

osd_max_pg_per_osd_hard_ratio = 3




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] HELP NEEDED : cephadm adopt osd crash

2022-11-08 Thread Patrick Vranckx


Hi,

We've already convert two PRODUCTION storage nodes on Octopus to cephadm 
without problem.


On the third one, we succeeded to convert only one OSD.

[root@server4 osd]# cephadm adopt --style legacy --name osd.0
Found online OSD at //var/lib/ceph/osd/ceph-0/fsid
objectstore_type is bluestore
Stopping old systemd unit ceph-osd@0...
Disabling old systemd unit ceph-osd@0...
Moving data...
Chowning content...
Chowning /var/lib/ceph/fsid replaced/osd.0/block...
Renaming /etc/ceph/osd/0-2d973f03-82f3-499f-b5dc-d4c28dbe1b3d.json -> 
/etc/ceph/osd/0-2d973f03-82f3-499f-b5dc-d4c28dbe1b3d.json.adopted-by-cephadm

Disabling host unit ceph-volume@ simple unit...
Moving logs...
Creating new units...

For the others, we have this error:

[root@server4 osd]# cephadm adopt --style legacy --name osd.17

Found online OSD at //var/lib/ceph/osd/ceph-17/fsid
objectstore_type is bluestore
Stopping old systemd unit ceph-osd@17...
Disabling old systemd unit ceph-osd@17...
Moving data...
Traceback (most recent call last):
  File "/sbin/cephadm", line 6251, in 
    r = args.func()
  File "/sbin/cephadm", line 1458, in _default_image
    return func()
  File "/sbin/cephadm", line 4027, in command_adopt
    command_adopt_ceph(daemon_type, daemon_id, fsid);
  File "/sbin/cephadm", line 4170, in command_adopt_ceph
    os.rmdir(data_dir_src)
OSError: [Errno 16] Device or resource busy: '//var/lib/ceph/osd/ceph-17'


The directory /var/lib/ceph/osd/ceph-17 is now empty.

The directory /var/lib/ceph//osd.17 contains:

[root@server4 osd.17]# ls -l
total 72
-rw-r--r-- 1 ceph ceph  411 Jan 29  2018 activate.monmap
-rw-r--r-- 1 ceph ceph    3 Jan 29  2018 active
lrwxrwxrwx 1 root root   10 Nov  8 15:54 block -> /dev/sdad2
-rw-r--r-- 1 ceph ceph   37 Jan 29  2018 block_uuid
-rw-r--r-- 1 ceph ceph    2 Jan 29  2018 bluefs
-rw-r--r-- 1 ceph ceph   37 Jan 29  2018 ceph_fsid
-rw-r--r-- 1 ceph ceph 1226 Nov  8 15:53 config
-rw-r--r-- 1 ceph ceph   37 Jan 29  2018 fsid
-rw--- 1 ceph ceph   57 Jan 29  2018 keyring
-rw-r--r-- 1 ceph ceph    8 Jan 29  2018 kv_backend
-rw-r--r-- 1 ceph ceph   21 Jan 29  2018 magic
-rw-r--r-- 1 ceph ceph    4 Jan 29  2018 mkfs_done
-rw-r--r-- 1 ceph ceph    6 Jan 29  2018 ready
-rw--- 1 ceph ceph    3 Nov  8 14:47 require_osd_release
-rw-r--r-- 1 ceph ceph    0 Jan 13  2020 systemd
-rw-r--r-- 1 ceph ceph   10 Jan 29  2018 type
-rw--- 1 root root   22 Nov  8 15:53 unit.image
-rw--- 1 root root 1042 Nov  8 16:30 unit.poststop
-rw--- 1 root root 1851 Nov  8 16:30 unit.run
-rw-r--r-- 1 ceph ceph    3 Jan 29  2018 whoami

When trying to start or redeploy osd.17, podman inspect complains about 
non-existent image:


2022-11-08 16:58:58,503 7f930fab3740 DEBUG Running command: /bin/podman 
inspect --format 
{{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index .Config.Labels 
"io.ceph.version"}} ceph--osd.17
2022-11-08 16:58:58,591 7f930fab3740 DEBUG /bin/podman: stderr Error: 
error getting image "ceph--osd.17": unable to find a name 
and tag match for ceph--osd.17 in repotags: no such image


Is there a way to save osd.17 and create manually the podman image ?

Thanks in advance,

Patrick


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] setup problem for ingress + SSL for RGW

2023-02-23 Thread Patrick Vranckx


Hi,

Our cluster runs Pacific on Rocky8. We have 3 rgw running on port 7480.

I tried to setup an ingress service with a yaml definition of service: 
no luck


service_type: ingress
service_id: rgw.myceph.be
placement:
  hosts:
    - ceph001
    - ceph002
    - ceph003
spec:
  backend_service: rgw.myceph.be
  virtual_ip: 192.168.0.10
  frontend_port: 443
  monitor_port: 9000
  ssl_cert: |
    -BEGIN PRIVATE KEY-
    ...
    -END PRIVATE KEY-
    -BEGIN CERTIFICATE-
   ...
    -END CERTIFICATE-

I tried to setup the ingress service with the dashboard... still no 
luck. I started debugging the problem.


1. Even if I entered the certificate and the private key in the form, 
CEPH complained about no haproxy.pem.key file.


I added manually the file in the container folder definition. Haproxy 
containers started !


2. Looking at the monitoring page of HAProxy, I realized that there was 
no backend server defined. In the form, I selected manually the servers 
running the rgw.


In the container definition folder, the backend definition of 
haproxy.cfg looks like:


...

backend backend
    option forwardfor
    balance static-rr
    option httpchk HEAD / HTTP/1.0

No mention of servers or port 7480

Once again, I added the definition manually :

  server ceph001 192.168.0.1:7480 check
  server ceph004 192.168.0.2:7480 check
  server ceph008 192.168.0.2:7480 check

and redeployed the containers. It's working.

Any idea ?

Patrick
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] How to submit a bug report ?

2023-03-16 Thread Patrick Vranckx


Hi,

I suspect a bug in cephadm to configure ingress service for rgw. Our 
production server was upgraded from continuously from Luminous to 
Pacific. When configuring ingress service for rgw, the haproxy.cfg is 
incomplete. The same yaml file applied on our test cluster does the job.


Regards,

Patrick
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Unable to deploy new manager in octopus

2022-06-02 Thread Patrick Vranckx


Hi,

On my test cluster, I migrated from Nautilus to Octopus and the 
converted most of the daemons to cephadm. I got a lot of problem with 
podman 1.6.4 on CentOS 7 through an https proxy because my servers are 
on a private network.


Now, I'm unable to deploy new managers and the cluster is in a bizarre 
situation:


[root@cepht003 f5a025f9-fbe8-4506-8769-453902eb28d6]# ceph -s
  cluster:
    id: f5a025f9-fbe8-4506-8769-453902eb28d6
    health: HEALTH_WARN
    client is using insecure global_id reclaim
    mons are allowing insecure global_id reclaim
    failed to probe daemons or devices
    42 stray daemon(s) not managed by cephadm
    2 stray host(s) with 39 daemon(s) not managed by cephadm
    1 daemons have recently crashed

  services:
    mon: 5 daemons, quorum 
cepht003,cepht002,cepht001,cepht004,cephtstor01 (age 19m)

    mgr: cepht004.wyibzh(active, since 29m), standbys: cepht003.aa
    mds: fsdup:1 fsec:1 
{fsdup:0=fsdup.cepht001.opiyzk=up:active,fsec:0=fsec.cepht003.giatub=up:active} 
7 up:standby

    osd: 40 osds: 40 up (since 92m), 40 in (since 3d)
    rgw: 2 daemons active (cepht001, cepht004)

  task status:

  data:
    pools:   18 pools, 577 pgs
    objects: 6.32k objects, 2[root@cepht003 
f5a025f9-fbe8-4506-8769-453902eb28d6]# ceph orch ps
NAME  HOST STATUS REFRESHED  
AGE  VERSION    IMAGE NAME IMAGE ID  CONTAINER ID


mds.fdec.cepht004.vbuphb  cepht004 running (62m)  
47s ago    4h   15.2.13    docker.io/ceph/ceph:v15 2cf504fded39  
5fad10ffc981
mds.fdec.cephtstor01.gtxsnr   cephtstor01  running (24m)  
46s ago    24m  15.2.13    docker.io/ceph/ceph:v15 2cf504fded39  
24e837f6ac8a
mds.fdup.cepht001.nydfzs  cepht001 running (2h)   
47s ago    2h   15.2.13    docker.io/ceph/ceph:v15 2cf504fded39  
b1880e343ece
mds.fdup.cepht003.thsnbk  cepht003 running (34m)  
45s ago    34m  15.2.13    docker.io/ceph/ceph:v15 2cf504fded39  
ddd4e395e7b3
mds.fsdup.cepht001.opiyzk cepht001 running (4h)   
47s ago    4h   15.2.13    docker.io/ceph/ceph:v15 2cf504fded39  
ad081f718863
mds.fsdup.cepht004.cfnxxw cepht004 running (62m)  
47s ago    20h  15.2.13    docker.io/ceph/ceph:v15 2cf504fded39  
c6feed82af8f
mds.fsec.cepht002.uebrlc  cepht002 running (20m)  
47s ago    20m  15.2.13    docker.io/ceph/ceph:v15 2cf504fded39  
836f448c5708
mds.fsec.cepht003.giatub  cepht003 running (76m)  
45s ago    5h   15.2.13    docker.io/ceph/ceph:v15 2cf504fded39  
f235957145cb
mgr.cepht003.aa   cepht003 stopped    45s 
ago    20h  15.2.6 quay.io/ceph/ceph:v15.2.6  f16a759354cc  770d7cf078ad
mgr.cepht004.wyibzh   cepht004 unknown    47s 
ago    20h  15.2.13 docker.io/ceph/ceph:v15    2cf504fded39  6baa0f625271
mon.cepht001  cepht001 running (4h)   
47s ago    4h   15.2.13    docker.io/ceph/ceph:v15 2cf504fded39  
e7f24769153c
mon.cepht002  cepht002 running (20m)  
47s ago    20m  15.2.13    docker.io/ceph/ceph:v15 2cf504fded39  
dbb5be113201
mon.cepht003  cepht003 running (76m)  
45s ago    5h   15.2.13    docker.io/ceph/ceph:v15 2cf504fded39  
6c2d6707b3fe
mon.cepht004  cepht004 running (62m)  
47s ago    4h   15.2.13    docker.io/ceph/ceph:v15 2cf504fded39  
7986b598fd17
mon.cephtstor01   cephtstor01  running (93m)  
46s ago    2h   15.2.13    docker.io/ceph/ceph:v15 2cf504fded39  
dbd9255aab10
osd.10    cephtstor01  running (93m)  
46s ago    2h   15.2.16    quay.io/ceph/ceph:v15 8d5775c85c6a  
01b07c4a75f7  4 GiB

    usage:   80 GiB used, 102 TiB / 102 TiB avail
    pgs: 577 active+clean


When I try to create a new mgr, I get :

[ceph: root@cepht002 /]# ceph orch daemon add mgr cepht002
Error EINVAL: cephadm exited with an error code: 1, stderr:Deploy daemon 
mgr.cepht002.kqhnbt ...

Verifying port 8443 ...
ERROR: TCP Port(s) '8443' required for mgr already in use

But nothing runs on that port:

[root@cepht002 f5a025f9-fbe8-4506-8769-453902eb28d6]# ss -lntu
Netid  State  Recv-Q Send-Q Local Address:Port Peer Address:Port
udp    UNCONN 0  0 127.0.0.1:323 *:*
tcp    LISTEN 0  128 192.168.64.152:6789 *:*
tcp    LISTEN 0  128 192.168.64.152:6800 *:*
tcp    LISTEN 0  128 192.168.64.152:6801 *:*
tcp    LISTEN 0 128 *:22 *:*
tcp    LISTEN 0  100 127.0.0.1:25 *:*
tcp    LISTEN 0  128 127.0.0.1:6010 *:*
tcp    LISTEN 0 128 *:10050 *:*
tcp    LISTEN 0  128 192.168.64.152:3300 *:*

I get the same error with the command "ceph orch apply mgr ...". The 
same for each node in the cluster.


I find no answer on Google...

Any idea ?

Patrick

__

[ceph-users] ceph orch: list of scheduled tasks

2022-06-07 Thread Patrick Vranckx


Hi,

When you change the configuration of your cluster whith 'ceph orch apply 
..." or "ceph orch daemon ...", tasks are scheduled:


[root@cephc003 ~]# ceph orch apply mgr --placement="cephc001 cephc002 
cephc003"

Scheduled mgr update...

Is there a way to list all the pending tasks ?

Regards,

Patrick

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] all PG remapped after osd server reinstallation (Pacific)

2022-08-31 Thread Patrick Vranckx


Hi,

I use a Ceph test infrastructure with only two storage servers running 
the OSDs. Objects are replicated between these servers:


[ceph: root@cepht001 /]# ceph osd dump | grep 'replicated size'
pool 1 '.rgw.root' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 237 flags 
hashpspool stripe_width 0 application rgw
pool 2 'default.rgw.control' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn 
last_change 239 flags hashpspool stripe_width 0 application rgw
pool 3 'default.rgw.meta' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn 
last_change 243 flags hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.log' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn 
last_change 244 flags hashpspool stripe_width 0 application rgw
pool 6 'rbd_dup' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 975 lfor 
0/975/973 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 7 'cephfs_metadata' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 
1121 lfor 0/1121/1119 flags hashpspool stripe_width 0 pg_autoscale_bias 
4 pg_num_min 16 recovery_priority 5 application cephfs
pool 8 'cephfs_data' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
1005 lfor 0/1005/1003 flags hashpspool stripe_width 0 application cephfs
pool 9 'device_health_metrics' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 
11476 flags hashpspool stripe_width 0 pg_num_min 1 application 
mgr_devicehealth



[ceph: root@cepht001 /]# ceph osd crush rule dump
[
    {
    "rule_id": 0,
    "rule_name": "replicated_rule",
    "ruleset": 0,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
    {
    "op": "take",
    "item": -1,
    "item_name": "default"
    },
    {
    "op": "chooseleaf_firstn",
    "num": 0,
    "type": "host"
    },
    {
    "op": "emit"
    }
    ]
    }
]


Ceph version is 16.2.9 (Pacific)

I reinstalled one storage server (same Ceph version). I followed the 
following path


- set noout flag

- stop all osd on this server

- back up all osd definition in /var/lib/ceph//osd.X

- back up all symplink related to OSD in 
/etc/systemd/system/ceph-.taget.wants


- reinstall OS

- reinstall cephadm, keyring, ...

- displace a monitor to this server in order to recreate the 
/var/lib/ceph/ and /etc/systemd/system/ceph-.target.wants tree


- restore osd definition and osd symlink services

- systemctl daemon-reload

- systemctl restart ceph-.target

(as an alternative, I restored the OSD definitions in 
/var/lib/ceph/ and did a redeploy of each OSD to recreate the 
symlinks in systemd)


All daemons are seen running by the orchestrator, health of the cluster 
is OK but all the pgs are remapped and half of the objects are 
missplaced... As if the restored OSDs are seen as a new and different 
group of OSDs.


[ceph: root@cepht001 /]# ceph -s
  cluster:
    id: 1f0f76fa-7d62-43b9-b9d2-ee87da10fc32
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum cepht001,cephtstor01,cephtstor02 (age 116m)
    mgr: cepht002.bxlxvc(active, since 18m), standbys: cepht003.ldxygn, 
cepht001.ljtuai

    mds: 1/1 daemons up, 2 standby
    osd: 41 osds: 41 up (since 2h), 41 in (since 2h); 209 remapped pgs
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   8 pools, 209 pgs
    objects: 3.36k objects, 12 GiB
    usage:   29 GiB used, 104 TiB / 104 TiB avail
    pgs: 3361/6722 objects misplaced (50.000%)
 209 active+clean+remapped

How can I recover from this situation ?

Is there a better way to achieve OS reinstallation than the steps I 
followed ?


Thanks for your help,

Patrick



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] TOO_MANY_PGS after upgrade from Nautilus to Octupus

[ceph-users] HELP NEEDED : cephadm adopt osd crash

[ceph-users] setup problem for ingress + SSL for RGW

[ceph-users] How to submit a bug report ?

[ceph-users] Unable to deploy new manager in octopus

[ceph-users] ceph orch: list of scheduled tasks

[ceph-users] all PG remapped after osd server reinstallation (Pacific)

7 matches

Site Navigation

Mail list logo

Footer information