[ceph-users] Re: cephadm to setup wal/db on nvme

2023-08-28 Thread Satish Patel
> Clean up completed and total clean up time :8.23223
> >
> >
> >
> > ### SSD pool benchmark
> >
> > (venv-openstack) root@os-ctrl1:~# rados -p test-ssd -t 64 -b 4096 bench
> 10
> > write
> > hints = 1
> > Maintaining 64 concurrent writes of 4096 bytes to objects of size 4096
> for
> > up to 10 seconds or 0 objects
> > Object prefix: benchmark_data_os-ctrl1_1933383
> >  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> > lat(s)
> >0   0 0 0 0 0   -
> > 0
> >1  63 43839 43776   170.972   171 0.000991462
> > 0.00145833
> >2  64 92198 92134   179.921   188.898  0.00211419
> > 0.001387
> >3  64141917141853   184.675   194.215  0.00106326
> > 0.00135174
> >4  63193151193088   188.534   200.137  0.00179379
> > 0.00132423
> >5  63243104243041   189.847   195.129 0.000831263
> > 0.00131512
> >6  63291045290982   189.413187.27  0.00120208
> > 0.00131807
> >7  64341295341231   190.391   196.285  0.00102127
> > 0.00131137
> >8  63393336393273   191.999   203.289 0.000958149
> > 0.00130041
> >9  63442459442396   191.983   191.887  0.00123453
> > 0.00130053
> > Total time run: 10.0008
> > Total writes made:  488729
> > Write size: 4096
> > Object size:4096
> > Bandwidth (MB/sec): 190.894
> > Stddev Bandwidth:   9.35224
> > Max bandwidth (MB/sec): 203.289
> > Min bandwidth (MB/sec): 171
> > Average IOPS:   48868
> > Stddev IOPS:2394.17
> > Max IOPS:   52042
> > Min IOPS:   43776
> > Average Latency(s): 0.00130796
> > Stddev Latency(s):  0.000604629
> > Max latency(s): 0.0268462
> > Min latency(s): 0.000628738
> > Cleaning up (deleting benchmark objects)
> > Removed 488729 objects
> > Clean up completed and total clean up time :8.84114
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Aug 23, 2023 at 1:25 PM Adam King  wrote:
> >
> >> this should be possible by specifying a "data_devices" and "db_devices"
> >> fields in the OSD spec file each with different filters. There's some
> >> examples in the docs
> >> https://docs.ceph.com/en/latest/cephadm/services/osd/#the-simple-case
> that
> >> show roughly how that's done, and some other sections (
> >> https://docs.ceph.com/en/latest/cephadm/services/osd/#filters) that go
> >> more in depth on the different filtering options available so you can
> try
> >> and find one that works for your disks. You can check the output of
> "ceph
> >> orch device ls --format json | jq" to see things like what cephadm
> >> considers the model, size etc. for the devices to be for use in the
> >> filtering.
> >>
> >> On Wed, Aug 23, 2023 at 1:13 PM Satish Patel 
> wrote:
> >>
> >>> Folks,
> >>>
> >>> I have 3 nodes with each having 1x NvME (1TB) and 3x 2.9TB SSD. Trying
> to
> >>> build ceph storage using cephadm on Ubuntu 22.04 distro.
> >>>
> >>> If I want to use NvME for Journaling (WAL/DB) for my SSD based OSDs
> then
> >>> how does cephadm handle it?
> >>>
> >>> Trying to find a document where I can tell cephadm to deploy wal/db on
> >>> nvme
> >>> so it can speed up write optimization. Do I need to create or cephadm
> will
> >>> create each partition for the number of OSD?
> >>>
> >>> Help me to understand how it works and is it worth doing?
> >>> ___
> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>
> >>>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm to setup wal/db on nvme

2023-08-25 Thread Satish Patel
Thank you for reply,

I have created two class SSD and NvME and assigned them to crush maps.

$ ceph osd crush rule ls
replicated_rule
ssd_pool
nvme_pool


Running benchmarks on nvme is the worst performing. SSD showing much better
results compared to NvME. NvME model is Samsung_SSD_980_PRO_1TB

 NvME pool benchmark with 3x replication

# rados -p test-nvme -t 64 -b 4096 bench 10 write
hints = 1
Maintaining 64 concurrent writes of 4096 bytes to objects of size 4096 for
up to 10 seconds or 0 objects
Object prefix: benchmark_data_os-ctrl1_1931595
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
0   0 0 0 0 0   -
0
1  64  5541  5477   21.3917   21.3945   0.0134898
0.0116529
2  64 11209 11145   21.7641   22.1406  0.00939951
0.0114506
3  64 17036 16972   22.0956   22.7617  0.00938263
0.0112938
4  64 23187 23123   22.5776   24.0273  0.00863939
0.0110473
5  64 29753 29689   23.1911   25.6484  0.00925603
0.0107662
6  64 36222 36158   23.5369   25.2695   0.0100759
 0.010606
7  63 42997 42934   23.9551   26.4688  0.00902186
0.0104246
8  64 49859 49795   24.3102   26.8008  0.00884379
0.0102765
9  64 56429 56365   24.4601   25.6641  0.00989885
0.0102124
   10  31 62727 62696   24.4869   24.7305   0.0115833
0.0102027
Total time run: 10.0064
Total writes made:  62727
Write size: 4096
Object size:4096
Bandwidth (MB/sec): 24.4871
Stddev Bandwidth:   1.85423
Max bandwidth (MB/sec): 26.8008   <   Only 26MB/s for nvme
disk
Min bandwidth (MB/sec): 21.3945
Average IOPS:   6268
Stddev IOPS:474.683
Max IOPS:   6861
Min IOPS:   5477
Average Latency(s): 0.0102022
Stddev Latency(s):  0.00170505
Max latency(s): 0.0365743
Min latency(s): 0.00641319
Cleaning up (deleting benchmark objects)
Removed 62727 objects
Clean up completed and total clean up time :8.23223



### SSD pool benchmark

(venv-openstack) root@os-ctrl1:~# rados -p test-ssd -t 64 -b 4096 bench 10
write
hints = 1
Maintaining 64 concurrent writes of 4096 bytes to objects of size 4096 for
up to 10 seconds or 0 objects
Object prefix: benchmark_data_os-ctrl1_1933383
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
0   0 0 0 0 0   -
0
1  63 43839 43776   170.972   171 0.000991462
 0.00145833
2  64 92198 92134   179.921   188.898  0.00211419
 0.001387
3  64141917141853   184.675   194.215  0.00106326
 0.00135174
4  63193151193088   188.534   200.137  0.00179379
 0.00132423
5  63243104243041   189.847   195.129 0.000831263
 0.00131512
6  63291045290982   189.413187.27  0.00120208
 0.00131807
7  64341295341231   190.391   196.285  0.00102127
 0.00131137
8  63393336393273   191.999   203.289 0.000958149
 0.00130041
9  63442459442396   191.983   191.887  0.00123453
 0.00130053
Total time run: 10.0008
Total writes made:  488729
Write size: 4096
Object size:4096
Bandwidth (MB/sec): 190.894
Stddev Bandwidth:   9.35224
Max bandwidth (MB/sec): 203.289
Min bandwidth (MB/sec): 171
Average IOPS:   48868
Stddev IOPS:2394.17
Max IOPS:   52042
Min IOPS:   43776
Average Latency(s): 0.00130796
Stddev Latency(s):  0.000604629
Max latency(s): 0.0268462
Min latency(s): 0.000628738
Cleaning up (deleting benchmark objects)
Removed 488729 objects
Clean up completed and total clean up time :8.84114








On Wed, Aug 23, 2023 at 1:25 PM Adam King  wrote:

> this should be possible by specifying a "data_devices" and "db_devices"
> fields in the OSD spec file each with different filters. There's some
> examples in the docs
> https://docs.ceph.com/en/latest/cephadm/services/osd/#the-simple-case that
> show roughly how that's done, and some other sections (
> https://docs.ceph.com/en/latest/cephadm/services/osd/#filters) that go
> more in depth on the different filtering options available so you can try
> and find one that works for your disks. You can check the output of "ceph
> orch device ls --format json | jq" to see things like what cephadm
> considers the model, size etc. for the devices to be for use in the
> filtering.
>
> On Wed, Aug 23, 2023 at 1:13 PM Satish Patel  wrote:
>
>> Folks,
>>
>> I have 3 nodes with each having 1x NvME (1TB) and 3x 2.9TB SSD. Trying to
>> build ceph storage using cephadm on Ubuntu 22.04 distro.
>>
>> If I want to use NvME for Journaling (WAL/DB) for my SSD based 

[ceph-users] cephadm to setup wal/db on nvme

2023-08-23 Thread Satish Patel
Folks,

I have 3 nodes with each having 1x NvME (1TB) and 3x 2.9TB SSD. Trying to
build ceph storage using cephadm on Ubuntu 22.04 distro.

If I want to use NvME for Journaling (WAL/DB) for my SSD based OSDs then
how does cephadm handle it?

Trying to find a document where I can tell cephadm to deploy wal/db on nvme
so it can speed up write optimization. Do I need to create or cephadm will
create each partition for the number of OSD?

Help me to understand how it works and is it worth doing?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephadm docker registry

2023-05-09 Thread Satish Patel
Folks,

I am trying to install ceph on 10 node clusters and planning to use
cephadm. My question is if next year i will add new nodes to this cluster
then what docker image version cephadm will use to add new nodes?

Are there any local registry can i create one to copy images locally? How
does cephadm control images?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] [Quincy] Module 'devicehealth' has failed: disk I/O error

2023-02-09 Thread Satish Patel
Folks,

Any idea what is going on, I am running 3 node quincy  version of openstack
and today suddenly i noticed the following error. I found reference link
but not sure if that is my issue or not
https://tracker.ceph.com/issues/51974

root@ceph1:~# ceph -s
  cluster:
id: cd748128-a3ea-11ed-9e46-c309158fad32
health: HEALTH_ERR

1 mgr modules have recently crashed

  services:
mon: 3 daemons, quorum ceph1,ceph2,ceph3 (age 2d)
mgr: ceph1.ckfkeb(active, since 6h), standbys: ceph2.aaptny
osd: 9 osds: 9 up (since 2d), 9 in (since 2d)

  data:
pools:   4 pools, 128 pgs
objects: 1.18k objects, 4.7 GiB
usage:   17 GiB used, 16 TiB / 16 TiB avail
pgs: 128 active+clean



root@ceph1:~# ceph health
HEALTH_ERR Module 'devicehealth' has failed: disk I/O error; 1 mgr modules
have recently crashed
root@ceph1:~# ceph crash ls
IDENTITY
 NEW
2023-02-07T00:07:12.739187Z_fcb9cbc9-bb55-4e7c-bf00-945b96469035
 mgr.ceph1.ckfkeb   *
root@ceph1:~# ceph crash info
2023-02-07T00:07:12.739187Z_fcb9cbc9-bb55-4e7c-bf00-945b96469035
{
"backtrace": [
"  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 373,
in serve\nself.scrape_all()",
"  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 425,
in scrape_all\nself.put_device_metrics(device, data)",
"  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 500,
in put_device_metrics\nself._create_device(devid)",
"  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 487,
in _create_device\ncursor = self.db.execute(SQL, (devid,))",
"sqlite3.OperationalError: disk I/O error"
],
"ceph_version": "17.2.5",
"crash_id":
"2023-02-07T00:07:12.739187Z_fcb9cbc9-bb55-4e7c-bf00-945b96469035",
"entity_name": "mgr.ceph1.ckfkeb",
"mgr_module": "devicehealth",
"mgr_module_caller": "PyModuleRunner::serve",
"mgr_python_exception": "OperationalError",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-mgr",
"stack_sig":
"7e506cc2729d5a18403f0373447bb825b42aafa2405fb0e5cfffc2896b093ed8",
"timestamp": "2023-02-07T00:07:12.739187Z",
"utsname_hostname": "ceph1",
"utsname_machine": "x86_64",
"utsname_release": "5.15.0-58-generic",
"utsname_sysname": "Linux",
"utsname_version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023"
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [cephadm] Found duplicate OSDs

2022-10-21 Thread Satish Patel
Hi Eugen,

My error cleared up itself, Look like it took some time but now I am not
seeing any errors and the output is very clean. Thank you so much.




On Fri, Oct 21, 2022 at 1:46 PM Eugen Block  wrote:

> Do you still see it with ‚cephadm ls‘ on that node? If yes you could
> try ‚cephadm rm-daemon —name osd.3‘. Or you try it with the
> orchestrator: ceph orch daemon rm…
> I don’t have the exact command at the moment, you should check the docs.
>
> Zitat von Satish Patel :
>
> > Hi Eugen,
> >
> > I have delected osd.3 directory from datastorn4 node as you mentioned but
> > still i am seeing that duplicate osd in ps output.
> >
> > root@datastorn1:~# ceph orch ps | grep osd.3
> > osd.3  datastorn4stopped  5m
> > ago   3w-42.6G 
> > osd.3  datastorn5running (3w) 5m
> > ago   3w2587M42.6G  17.2.3 0912465dcea5  d139f8a1234b
> >
> > How do I clean up permanently?
> >
> >
> > On Fri, Oct 21, 2022 at 6:24 AM Eugen Block  wrote:
> >
> >> Hi,
> >>
> >> it looks like the OSDs haven't been cleaned up after removing them. Do
> >> you see the osd directory in /var/lib/ceph//osd.3 on datastorn4?
> >> Just remove the osd.3 directory, then cephadm won't try to activate it.
> >>
> >>
> >> Zitat von Satish Patel :
> >>
> >> > Folks,
> >> >
> >> > I have deployed 15 OSDs node clusters using cephadm and encount
> duplicate
> >> > OSD on one of the nodes and am not sure how to clean that up.
> >> >
> >> > root@datastorn1:~# ceph health
> >> > HEALTH_WARN 1 failed cephadm daemon(s); 1 pool(s) have no replicas
> >> > configured
> >> >
> >> > osd.3 is duplicated on two nodes, i would like to remove it from
> >> > datastorn4 but I'm not sure how to remove it. In the ceph osd tree I
> am
> >> not
> >> > seeing any duplicate.
> >> >
> >> > root@datastorn1:~# ceph orch ps | grep osd.3
> >> > osd.3  datastorn4stopped
> 7m
> >> > ago   3w-42.6G 
> >> > osd.3  datastorn5running (3w)
>  7m
> >> > ago   3w2584M42.6G  17.2.3 0912465dcea5  d139f8a1234b
> >> >
> >> >
> >> > Getting following error in logs
> >> >
> >> > 2022-10-21T09:10:45.226872+ mgr.datastorn1.nciiiu (mgr.14188)
> >> 1098186 :
> >> > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on
> >> datastorn4,
> >> > osd.3 in status running on datastorn5
> >> > 2022-10-21T09:11:46.254979+ mgr.datastorn1.nciiiu (mgr.14188)
> >> 1098221 :
> >> > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on
> >> datastorn4,
> >> > osd.3 in status running on datastorn5
> >> > 2022-10-21T09:12:53.009252+ mgr.datastorn1.nciiiu (mgr.14188)
> >> 1098256 :
> >> > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on
> >> datastorn4,
> >> > osd.3 in status running on datastorn5
> >> > 2022-10-21T09:13:59.283251+ mgr.datastorn1.nciiiu (mgr.14188)
> >> 1098293 :
> >> > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on
> >> datastorn4,
> >> > osd.3 in status running on datastorn5
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >>
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
>
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [cephadm] Found duplicate OSDs

2022-10-21 Thread Satish Patel
Hi Eugen,

I have delected osd.3 directory from datastorn4 node as you mentioned but
still i am seeing that duplicate osd in ps output.

root@datastorn1:~# ceph orch ps | grep osd.3
osd.3  datastorn4stopped  5m
ago   3w-42.6G 
osd.3  datastorn5running (3w) 5m
ago   3w2587M42.6G  17.2.3 0912465dcea5  d139f8a1234b

How do I clean up permanently?


On Fri, Oct 21, 2022 at 6:24 AM Eugen Block  wrote:

> Hi,
>
> it looks like the OSDs haven't been cleaned up after removing them. Do
> you see the osd directory in /var/lib/ceph//osd.3 on datastorn4?
> Just remove the osd.3 directory, then cephadm won't try to activate it.
>
>
> Zitat von Satish Patel :
>
> > Folks,
> >
> > I have deployed 15 OSDs node clusters using cephadm and encount duplicate
> > OSD on one of the nodes and am not sure how to clean that up.
> >
> > root@datastorn1:~# ceph health
> > HEALTH_WARN 1 failed cephadm daemon(s); 1 pool(s) have no replicas
> > configured
> >
> > osd.3 is duplicated on two nodes, i would like to remove it from
> > datastorn4 but I'm not sure how to remove it. In the ceph osd tree I am
> not
> > seeing any duplicate.
> >
> > root@datastorn1:~# ceph orch ps | grep osd.3
> > osd.3  datastorn4stopped  7m
> > ago   3w-42.6G 
> > osd.3  datastorn5running (3w) 7m
> > ago   3w2584M42.6G  17.2.3 0912465dcea5  d139f8a1234b
> >
> >
> > Getting following error in logs
> >
> > 2022-10-21T09:10:45.226872+ mgr.datastorn1.nciiiu (mgr.14188)
> 1098186 :
> > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on
> datastorn4,
> > osd.3 in status running on datastorn5
> > 2022-10-21T09:11:46.254979+ mgr.datastorn1.nciiiu (mgr.14188)
> 1098221 :
> > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on
> datastorn4,
> > osd.3 in status running on datastorn5
> > 2022-10-21T09:12:53.009252+ mgr.datastorn1.nciiiu (mgr.14188)
> 1098256 :
> > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on
> datastorn4,
> > osd.3 in status running on datastorn5
> > 2022-10-21T09:13:59.283251+ mgr.datastorn1.nciiiu (mgr.14188)
> 1098293 :
> > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on
> datastorn4,
> > osd.3 in status running on datastorn5
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] [cephadm] Found duplicate OSDs

2022-10-21 Thread Satish Patel
Folks,

I have deployed 15 OSDs node clusters using cephadm and encount duplicate
OSD on one of the nodes and am not sure how to clean that up.

root@datastorn1:~# ceph health
HEALTH_WARN 1 failed cephadm daemon(s); 1 pool(s) have no replicas
configured

osd.3 is duplicated on two nodes, i would like to remove it from
datastorn4 but I'm not sure how to remove it. In the ceph osd tree I am not
seeing any duplicate.

root@datastorn1:~# ceph orch ps | grep osd.3
osd.3  datastorn4stopped  7m
ago   3w-42.6G 
osd.3  datastorn5running (3w) 7m
ago   3w2584M42.6G  17.2.3 0912465dcea5  d139f8a1234b


Getting following error in logs

2022-10-21T09:10:45.226872+ mgr.datastorn1.nciiiu (mgr.14188) 1098186 :
cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on datastorn4,
osd.3 in status running on datastorn5
2022-10-21T09:11:46.254979+ mgr.datastorn1.nciiiu (mgr.14188) 1098221 :
cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on datastorn4,
osd.3 in status running on datastorn5
2022-10-21T09:12:53.009252+ mgr.datastorn1.nciiiu (mgr.14188) 1098256 :
cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on datastorn4,
osd.3 in status running on datastorn5
2022-10-21T09:13:59.283251+ mgr.datastorn1.nciiiu (mgr.14188) 1098293 :
cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on datastorn4,
osd.3 in status running on datastorn5
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: strange osd error during add disk

2022-09-30 Thread Satish Patel
Hi Dominique,

How do I check using cephadm shell ?  I am new to cephadm :)

https://paste.opendev.org/show/b4egkEdAkCWSkT3VRyO9/

On Fri, Sep 30, 2022 at 6:20 AM Dominique Ramaekers <
dominique.ramaek...@cometal.be> wrote:

>
> Ceph.conf isn't available on that node/container.
>
> Wat happens if you try to start a cephadm shell on that node?
>
>
> > -Oorspronkelijk bericht-
> > Van: Satish Patel 
> > Verzonden: donderdag 29 september 2022 21:45
> > Aan: ceph-users 
> > Onderwerp: [ceph-users] Re: strange osd error during add disk
> >
> > Bump! Any suggestions?
> >
> > On Wed, Sep 28, 2022 at 4:26 PM Satish Patel 
> wrote:
> >
> > > Folks,
> > >
> > > I have 15 nodes for ceph and each node has a 160TB disk attached. I am
> > > using cephadm quincy release and all 14 nodes have been added except
> > > one node which is giving a very strange error during adding it. I have
> > > put all logs here
> > https://paste.opendev.org/show/bbSKwlSLyANMbrlhwzXL/
> > >
> > > In short, the following error logs I am getting. I have tried zap to
> > > disk and re-add but getting the following error every single time.
> > >
> > > [2022-09-28 20:13:28,644][ceph_volume.main][INFO  ] Running command:
> > > ceph-volume  lvm list --format json
> > > [2022-09-28 20:13:28,644][ceph_volume.main][ERROR ] ignoring inability
> > > to load ceph.conf Traceback (most recent call last):
> > >   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line
> 145,
> > in main
> > > conf.ceph = configuration.load(conf.path)
> > >   File "/usr/lib/python3.6/site-packages/ceph_volume/configuration.py",
> > line 51, in load
> > > raise exceptions.ConfigurationError(abspath=abspath)
> > >
> > >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email
> > to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: strange osd error during add disk

2022-09-30 Thread Satish Patel
Hi Alvaro,

I have seen errors on every node and even functional and working nodes so
assuming it's not important "ceph_volume.exceptions.ConfigurationError:
Unable to load expected Ceph config at: /etc/ceph/ceph.conf"

Maybe cephadm run inside docker and that is why its just giving this
warning.

On Thu, Sep 29, 2022 at 4:29 PM Alvaro Soto  wrote:

> Where is your ceph.conf file?
>
> ceph_volume.exceptions.ConfigurationError: Unable to load expected Ceph 
> config at: /etc/ceph/ceph.conf
>
>
>
> ---
> Alvaro Soto.
>
> Note: My work hours may not be your work hours. Please do not feel the
> need to respond during a time that is not convenient for you.
> --
> Great people talk about ideas,
> ordinary people talk about things,
> small people talk... about other people.
>
> On Thu, Sep 29, 2022, 2:45 PM Satish Patel  wrote:
>
>> Bump! Any suggestions?
>>
>> On Wed, Sep 28, 2022 at 4:26 PM Satish Patel 
>> wrote:
>>
>> > Folks,
>> >
>> > I have 15 nodes for ceph and each node has a 160TB disk attached. I am
>> > using cephadm quincy release and all 14 nodes have been added except one
>> > node which is giving a very strange error during adding it. I have put
>> all
>> > logs here https://paste.opendev.org/show/bbSKwlSLyANMbrlhwzXL/
>> >
>> > In short, the following error logs I am getting. I have tried zap to
>> disk
>> > and re-add but getting the following error every single time.
>> >
>> > [2022-09-28 20:13:28,644][ceph_volume.main][INFO  ] Running command:
>> ceph-volume  lvm list --format json
>> > [2022-09-28 20:13:28,644][ceph_volume.main][ERROR ] ignoring inability
>> to load ceph.conf
>> > Traceback (most recent call last):
>> >   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line
>> 145, in main
>> > conf.ceph = configuration.load(conf.path)
>> >   File "/usr/lib/python3.6/site-packages/ceph_volume/configuration.py",
>> line 51, in load
>> > raise exceptions.ConfigurationError(abspath=abspath)
>> >
>> >
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: strange osd error during add disk

2022-09-29 Thread Satish Patel
Bump! Any suggestions?

On Wed, Sep 28, 2022 at 4:26 PM Satish Patel  wrote:

> Folks,
>
> I have 15 nodes for ceph and each node has a 160TB disk attached. I am
> using cephadm quincy release and all 14 nodes have been added except one
> node which is giving a very strange error during adding it. I have put all
> logs here https://paste.opendev.org/show/bbSKwlSLyANMbrlhwzXL/
>
> In short, the following error logs I am getting. I have tried zap to disk
> and re-add but getting the following error every single time.
>
> [2022-09-28 20:13:28,644][ceph_volume.main][INFO  ] Running command: 
> ceph-volume  lvm list --format json
> [2022-09-28 20:13:28,644][ceph_volume.main][ERROR ] ignoring inability to 
> load ceph.conf
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 145, in 
> main
> conf.ceph = configuration.load(conf.path)
>   File "/usr/lib/python3.6/site-packages/ceph_volume/configuration.py", line 
> 51, in load
> raise exceptions.ConfigurationError(abspath=abspath)
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] strange osd error during add disk

2022-09-28 Thread Satish Patel
Folks,

I have 15 nodes for ceph and each node has a 160TB disk attached. I am
using cephadm quincy release and all 14 nodes have been added except one
node which is giving a very strange error during adding it. I have put all
logs here https://paste.opendev.org/show/bbSKwlSLyANMbrlhwzXL/

In short, the following error logs I am getting. I have tried zap to disk
and re-add but getting the following error every single time.

[2022-09-28 20:13:28,644][ceph_volume.main][INFO  ] Running command:
ceph-volume  lvm list --format json
[2022-09-28 20:13:28,644][ceph_volume.main][ERROR ] ignoring inability
to load ceph.conf
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 145, in main
conf.ceph = configuration.load(conf.path)
  File "/usr/lib/python3.6/site-packages/ceph_volume/configuration.py",
line 51, in load
raise exceptions.ConfigurationError(abspath=abspath)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [cephadm] not detecting new disk

2022-09-03 Thread Satish Patel
I did use sgdisk zap to disk and wipe out everything. But still not detecting. 

Is there any other  good way to wipeout ? 

Sent from my iPhone

> On Sep 3, 2022, at 2:53 AM, Eugen Block  wrote:
> 
> It is detecting the disk, but it contains a partition table so it can’t use 
> it. Wipe the disk properly first.
> 
> Zitat von Satish Patel :
> 
>> Folks,
>> 
>> I have created a new lab using cephadm and installed a new 1TB spinning
>> disk which is trying to add in a cluster but somehow ceph is not detecting
>> it.
>> 
>> $ parted /dev/sda print
>> Model: ATA WDC WD10EZEX-00B (scsi)
>> Disk /dev/sda: 1000GB
>> Sector size (logical/physical): 512B/4096B
>> Partition Table: gpt
>> Disk Flags:
>> 
>> Number  Start  End  Size  File system  Name  Flags
>> 
>> Trying following but no luck
>> 
>> $ cephadm shell -- ceph orch daemon add osd os-ctrl-1:/dev/sda
>> Inferring fsid 351f8a26-2b31-11ed-b555-494149d85a01
>> Using recent ceph image
>> quay.io/ceph/ceph@sha256:c5fd9d806c54e5cc9db8efd50363e1edf7af62f101b264dccacb9d6091dcf7aa
>> Error EINVAL: Traceback (most recent call last):
>>  File "/usr/share/ceph/mgr/mgr_module.py", line 1446, in _handle_command
>>return self.handle_command(inbuf, cmd)
>>  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 171, in
>> handle_command
>>return dispatch[cmd['prefix']].call(self, cmd, inbuf)
>>  File "/usr/share/ceph/mgr/mgr_module.py", line 414, in call
>>return self.func(mgr, **kwargs)
>>  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 107, in
>> 
>>wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
>> # noqa: E731
>>  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 96, in wrapper
>>return func(*args, **kwargs)
>>  File "/usr/share/ceph/mgr/orchestrator/module.py", line 803, in
>> _daemon_add_osd
>>raise_if_exception(completion)
>>  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 228, in
>> raise_if_exception
>>raise e
>> RuntimeError: cephadm exited with an error code: 1, stderr:Inferring config
>> /var/lib/ceph/351f8a26-2b31-11ed-b555-494149d85a01/mon.os-ctrl-1/config
>> Non-zero exit code 2 from /usr/bin/docker run --rm --ipc=host
>> --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
>> --privileged --group-add=disk --init -e CONTAINER_IMAGE=
>> quay.io/ceph/ceph@sha256:c5fd9d806c54e5cc9db8efd50363e1edf7af62f101b264dccacb9d6091dcf7aa
>> -e NODE_NAME=os-ctrl-1 -e CEPH_USE_RANDOM_NONCE=1 -e
>> CEPH_VOLUME_OSDSPEC_AFFINITY=None -e CEPH_VOLUME_SKIP_RESTORECON=yes -e
>> CEPH_VOLUME_DEBUG=1 -v
>> /var/run/ceph/351f8a26-2b31-11ed-b555-494149d85a01:/var/run/ceph:z -v
>> /var/log/ceph/351f8a26-2b31-11ed-b555-494149d85a01:/var/log/ceph:z -v
>> /var/lib/ceph/351f8a26-2b31-11ed-b555-494149d85a01/crash:/var/lib/ceph/crash:z
>> -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
>> /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v
>> /tmp/ceph-tmpznn3t_7i:/etc/ceph/ceph.conf:z -v
>> /tmp/ceph-tmpun8t5_ej:/var/lib/ceph/bootstrap-osd/ceph.keyring:z
>> quay.io/ceph/ceph@sha256:c5fd9d806c54e5cc9db8efd50363e1edf7af62f101b264dccacb9d6091dcf7aa
>> lvm batch --no-auto /dev/sda --yes --no-systemd
>> /usr/bin/docker: stderr usage: ceph-volume lvm batch [-h] [--db-devices
>> [DB_DEVICES [DB_DEVICES ...]]]
>> /usr/bin/docker: stderr  [--wal-devices
>> [WAL_DEVICES [WAL_DEVICES ...]]]
>> /usr/bin/docker: stderr  [--journal-devices
>> [JOURNAL_DEVICES [JOURNAL_DEVICES ...]]]
>> /usr/bin/docker: stderr  [--auto] [--no-auto]
>> [--bluestore] [--filestore]
>> /usr/bin/docker: stderr  [--report] [--yes]
>> /usr/bin/docker: stderr  [--format
>> {json,json-pretty,pretty}] [--dmcrypt]
>> /usr/bin/docker: stderr  [--crush-device-class
>> CRUSH_DEVICE_CLASS]
>> /usr/bin/docker: stderr  [--no-systemd]
>> /usr/bin/docker: stderr  [--osds-per-device
>> OSDS_PER_DEVICE]
>> /usr/bin/docker: stderr  [--data-slots
>> DATA_SLOTS]
>> /usr/bin/docker: stderr  [--block-db-size
>> BLOCK_DB_SIZE]
>> /usr/bin/docker: stderr  [--block-db-slots
>> BLOCK_DB_SLOTS]
>> /usr/bin/docker: stderr

[ceph-users] [cephadm] not detecting new disk

2022-09-02 Thread Satish Patel
Folks,

I have created a new lab using cephadm and installed a new 1TB spinning
disk which is trying to add in a cluster but somehow ceph is not detecting
it.

$ parted /dev/sda print
Model: ATA WDC WD10EZEX-00B (scsi)
Disk /dev/sda: 1000GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:

Number  Start  End  Size  File system  Name  Flags

Trying following but no luck

$ cephadm shell -- ceph orch daemon add osd os-ctrl-1:/dev/sda
Inferring fsid 351f8a26-2b31-11ed-b555-494149d85a01
Using recent ceph image
quay.io/ceph/ceph@sha256:c5fd9d806c54e5cc9db8efd50363e1edf7af62f101b264dccacb9d6091dcf7aa
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1446, in _handle_command
return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 171, in
handle_command
return dispatch[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 414, in call
return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 107, in

wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
 # noqa: E731
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 96, in wrapper
return func(*args, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/module.py", line 803, in
_daemon_add_osd
raise_if_exception(completion)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 228, in
raise_if_exception
raise e
RuntimeError: cephadm exited with an error code: 1, stderr:Inferring config
/var/lib/ceph/351f8a26-2b31-11ed-b555-494149d85a01/mon.os-ctrl-1/config
Non-zero exit code 2 from /usr/bin/docker run --rm --ipc=host
--stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
--privileged --group-add=disk --init -e CONTAINER_IMAGE=
quay.io/ceph/ceph@sha256:c5fd9d806c54e5cc9db8efd50363e1edf7af62f101b264dccacb9d6091dcf7aa
-e NODE_NAME=os-ctrl-1 -e CEPH_USE_RANDOM_NONCE=1 -e
CEPH_VOLUME_OSDSPEC_AFFINITY=None -e CEPH_VOLUME_SKIP_RESTORECON=yes -e
CEPH_VOLUME_DEBUG=1 -v
/var/run/ceph/351f8a26-2b31-11ed-b555-494149d85a01:/var/run/ceph:z -v
/var/log/ceph/351f8a26-2b31-11ed-b555-494149d85a01:/var/log/ceph:z -v
/var/lib/ceph/351f8a26-2b31-11ed-b555-494149d85a01/crash:/var/lib/ceph/crash:z
-v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
/run/lock/lvm:/run/lock/lvm -v /:/rootfs -v
/tmp/ceph-tmpznn3t_7i:/etc/ceph/ceph.conf:z -v
/tmp/ceph-tmpun8t5_ej:/var/lib/ceph/bootstrap-osd/ceph.keyring:z
quay.io/ceph/ceph@sha256:c5fd9d806c54e5cc9db8efd50363e1edf7af62f101b264dccacb9d6091dcf7aa
lvm batch --no-auto /dev/sda --yes --no-systemd
/usr/bin/docker: stderr usage: ceph-volume lvm batch [-h] [--db-devices
[DB_DEVICES [DB_DEVICES ...]]]
/usr/bin/docker: stderr  [--wal-devices
[WAL_DEVICES [WAL_DEVICES ...]]]
/usr/bin/docker: stderr  [--journal-devices
[JOURNAL_DEVICES [JOURNAL_DEVICES ...]]]
/usr/bin/docker: stderr  [--auto] [--no-auto]
[--bluestore] [--filestore]
/usr/bin/docker: stderr  [--report] [--yes]
/usr/bin/docker: stderr  [--format
{json,json-pretty,pretty}] [--dmcrypt]
/usr/bin/docker: stderr  [--crush-device-class
CRUSH_DEVICE_CLASS]
/usr/bin/docker: stderr  [--no-systemd]
/usr/bin/docker: stderr  [--osds-per-device
OSDS_PER_DEVICE]
/usr/bin/docker: stderr  [--data-slots
DATA_SLOTS]
/usr/bin/docker: stderr  [--block-db-size
BLOCK_DB_SIZE]
/usr/bin/docker: stderr  [--block-db-slots
BLOCK_DB_SLOTS]
/usr/bin/docker: stderr  [--block-wal-size
BLOCK_WAL_SIZE]
/usr/bin/docker: stderr  [--block-wal-slots
BLOCK_WAL_SLOTS]
/usr/bin/docker: stderr  [--journal-size
JOURNAL_SIZE]
/usr/bin/docker: stderr  [--journal-slots
JOURNAL_SLOTS] [--prepare]
/usr/bin/docker: stderr  [--osd-ids [OSD_IDS
[OSD_IDS ...]]]
/usr/bin/docker: stderr  [DEVICES [DEVICES ...]]
/usr/bin/docker: stderr ceph-volume lvm batch: error: GPT headers found,
they must be removed on: /dev/sda
Traceback (most recent call last):
  File
"/var/lib/ceph/351f8a26-2b31-11ed-b555-494149d85a01/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d",
line 8971, in 
main()
  File
"/var/lib/ceph/351f8a26-2b31-11ed-b555-494149d85a01/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d",
line 8959, in main
r = ctx.func(ctx)
  File
"/var/lib/ceph/351f8a26-2b31-11ed-b555-494149d85a01/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d",
line 1902, in _infer_config
return func(ctx)
  File

[ceph-users] Re: [cephadm] mgr: no daemons active

2022-09-02 Thread Satish Patel
Adam,

In google someone suggested a manual upgrade using the following method and
it seems to work but I am stuck in MON redeploy.. haha

Go to mgr container and edit /var/lib/ceph/$fsid/mgr.$whatever/unit.run
file and change ceph/ceph:v16.2.10 on both mgr and restart mgr service
using systemctl restart 

After a few minutes I noticed the docker downloaded image and I can see
both mgr running with the 16.2.10 version.

Now i have tried to do an upgrade and nothing happened so I used the same
manual method with MON node and did use command ceph orch daemon redeploy
mon.ceph1 which destroyed mon service and now i can't do anything because i
don't have mon. ceph -s and all other command hangs

Try to find out how to get back mon :)



On Fri, Sep 2, 2022 at 3:34 PM Satish Patel  wrote:

> Yes, i have stopped upgrade and those log before upgrade
>
> On Fri, Sep 2, 2022 at 3:27 PM Adam King  wrote:
>
>> I don't think the number of mons should have any effect on this. Looking
>> at your logs, the interesting thing is that all the messages are so close
>> together. Was this before having stopped the upgrade?
>>
>> On Fri, Sep 2, 2022 at 2:53 PM Satish Patel  wrote:
>>
>>> Do you think this is because I have only a single MON daemon running?  I
>>> have only two nodes.
>>>
>>> On Fri, Sep 2, 2022 at 2:39 PM Satish Patel 
>>> wrote:
>>>
>>>> Adam,
>>>>
>>>> I have enabled debug and my logs flood with the following. I am going
>>>> to try some stuff from your provided mailing list and see..
>>>>
>>>> root@ceph1:~# tail -f
>>>> /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log
>>>> 2022-09-02T18:38:21.754391+ mgr.ceph2.huidoh (mgr.344392) 211198 :
>>>> cephadm [DBG] 0 OSDs are scheduled for removal: []
>>>> 2022-09-02T18:38:21.754519+ mgr.ceph2.huidoh (mgr.344392) 211199 :
>>>> cephadm [DBG] Saving [] to store
>>>> 2022-09-02T18:38:21.757155+ mgr.ceph2.huidoh (mgr.344392) 211200 :
>>>> cephadm [DBG] refreshing hosts and daemons
>>>> 2022-09-02T18:38:21.758065+ mgr.ceph2.huidoh (mgr.344392) 211201 :
>>>> cephadm [DBG] _check_for_strays
>>>> 2022-09-02T18:38:21.758334+ mgr.ceph2.huidoh (mgr.344392) 211202 :
>>>> cephadm [DBG] 0 OSDs are scheduled for removal: []
>>>> 2022-09-02T18:38:21.758455+ mgr.ceph2.huidoh (mgr.344392) 211203 :
>>>> cephadm [DBG] Saving [] to store
>>>> 2022-09-02T18:38:21.761001+ mgr.ceph2.huidoh (mgr.344392) 211204 :
>>>> cephadm [DBG] refreshing hosts and daemons
>>>> 2022-09-02T18:38:21.762092+ mgr.ceph2.huidoh (mgr.344392) 211205 :
>>>> cephadm [DBG] _check_for_strays
>>>> 2022-09-02T18:38:21.762357+ mgr.ceph2.huidoh (mgr.344392) 211206 :
>>>> cephadm [DBG] 0 OSDs are scheduled for removal: []
>>>> 2022-09-02T18:38:21.762480+ mgr.ceph2.huidoh (mgr.344392) 211207 :
>>>> cephadm [DBG] Saving [] to store
>>>>
>>>> On Fri, Sep 2, 2022 at 12:17 PM Adam King  wrote:
>>>>
>>>>> hmm, okay. It seems like cephadm is stuck in general rather than an
>>>>> issue specific to the upgrade. I'd first make sure the orchestrator isn't
>>>>> paused (just running "ceph orch resume" should be enough, it's 
>>>>> idempotent).
>>>>>
>>>>> Beyond that, there was someone else who had an issue with things
>>>>> getting stuck that was resolved in this thread
>>>>> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M
>>>>> <https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M>
>>>>>  that
>>>>> might be worth a look.
>>>>>
>>>>> If you haven't already, it's possible stopping the upgrade is a good
>>>>> idea, as maybe that's interfering with it getting to the point where it
>>>>> does the redeploy.
>>>>>
>>>>> If none of those help, it might be worth setting the log level to
>>>>> debug and seeing where things are ending up ("ceph config set mgr
>>>>> mgr/cephadm/log_to_cluster_level debug; ceph orch ps --refresh" then
>>>>> waiting a few minutes before running "ceph log last 100 debug cephadm" 
>>>>> (not
>>>>> 100% on format of that command, if it fails try just "ceph log last
>&g

[ceph-users] Re: [cephadm] mgr: no daemons active

2022-09-02 Thread Satish Patel
Yes, i have stopped upgrade and those log before upgrade

On Fri, Sep 2, 2022 at 3:27 PM Adam King  wrote:

> I don't think the number of mons should have any effect on this. Looking
> at your logs, the interesting thing is that all the messages are so close
> together. Was this before having stopped the upgrade?
>
> On Fri, Sep 2, 2022 at 2:53 PM Satish Patel  wrote:
>
>> Do you think this is because I have only a single MON daemon running?  I
>> have only two nodes.
>>
>> On Fri, Sep 2, 2022 at 2:39 PM Satish Patel  wrote:
>>
>>> Adam,
>>>
>>> I have enabled debug and my logs flood with the following. I am going to
>>> try some stuff from your provided mailing list and see..
>>>
>>> root@ceph1:~# tail -f
>>> /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log
>>> 2022-09-02T18:38:21.754391+ mgr.ceph2.huidoh (mgr.344392) 211198 :
>>> cephadm [DBG] 0 OSDs are scheduled for removal: []
>>> 2022-09-02T18:38:21.754519+ mgr.ceph2.huidoh (mgr.344392) 211199 :
>>> cephadm [DBG] Saving [] to store
>>> 2022-09-02T18:38:21.757155+ mgr.ceph2.huidoh (mgr.344392) 211200 :
>>> cephadm [DBG] refreshing hosts and daemons
>>> 2022-09-02T18:38:21.758065+ mgr.ceph2.huidoh (mgr.344392) 211201 :
>>> cephadm [DBG] _check_for_strays
>>> 2022-09-02T18:38:21.758334+ mgr.ceph2.huidoh (mgr.344392) 211202 :
>>> cephadm [DBG] 0 OSDs are scheduled for removal: []
>>> 2022-09-02T18:38:21.758455+ mgr.ceph2.huidoh (mgr.344392) 211203 :
>>> cephadm [DBG] Saving [] to store
>>> 2022-09-02T18:38:21.761001+ mgr.ceph2.huidoh (mgr.344392) 211204 :
>>> cephadm [DBG] refreshing hosts and daemons
>>> 2022-09-02T18:38:21.762092+ mgr.ceph2.huidoh (mgr.344392) 211205 :
>>> cephadm [DBG] _check_for_strays
>>> 2022-09-02T18:38:21.762357+ mgr.ceph2.huidoh (mgr.344392) 211206 :
>>> cephadm [DBG] 0 OSDs are scheduled for removal: []
>>> 2022-09-02T18:38:21.762480+ mgr.ceph2.huidoh (mgr.344392) 211207 :
>>> cephadm [DBG] Saving [] to store
>>>
>>> On Fri, Sep 2, 2022 at 12:17 PM Adam King  wrote:
>>>
>>>> hmm, okay. It seems like cephadm is stuck in general rather than an
>>>> issue specific to the upgrade. I'd first make sure the orchestrator isn't
>>>> paused (just running "ceph orch resume" should be enough, it's idempotent).
>>>>
>>>> Beyond that, there was someone else who had an issue with things
>>>> getting stuck that was resolved in this thread
>>>> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M
>>>> <https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M>
>>>>  that
>>>> might be worth a look.
>>>>
>>>> If you haven't already, it's possible stopping the upgrade is a good
>>>> idea, as maybe that's interfering with it getting to the point where it
>>>> does the redeploy.
>>>>
>>>> If none of those help, it might be worth setting the log level to debug
>>>> and seeing where things are ending up ("ceph config set mgr
>>>> mgr/cephadm/log_to_cluster_level debug; ceph orch ps --refresh" then
>>>> waiting a few minutes before running "ceph log last 100 debug cephadm" (not
>>>> 100% on format of that command, if it fails try just "ceph log last
>>>> cephadm"). We could maybe get more info on why it's not performing the
>>>> redeploy from those debug logs. Just remember to set the log level back
>>>> after 'ceph config set mgr mgr/cephadm/log_to_cluster_level info' as debug
>>>> logs are quite verbose.
>>>>
>>>> On Fri, Sep 2, 2022 at 11:39 AM Satish Patel 
>>>> wrote:
>>>>
>>>>> Hi Adam,
>>>>>
>>>>> As you said, i did following
>>>>>
>>>>> $ ceph orch daemon redeploy mgr.ceph1.smfvfd
>>>>> quay.io/ceph/ceph:v16.2.10
>>>>>
>>>>> Noticed following line in logs but then no activity nothing, still
>>>>> standby mgr running in older version
>>>>>
>>>>> 2022-09-02T15:35:45.753093+ mgr.ceph2.huidoh (mgr.344392) 2226 :
>>>>> cephadm [INF] Schedule redeploy daemon mgr.ceph1.smfvfd
>>>>> 2022-09-02T15:36:17.279190+ mgr.ceph2.huidoh (mgr.344392) 2245 :
>>>

[ceph-users] Re: [cephadm] mgr: no daemons active

2022-09-02 Thread Satish Patel
Do you think this is because I have only a single MON daemon running?  I
have only two nodes.

On Fri, Sep 2, 2022 at 2:39 PM Satish Patel  wrote:

> Adam,
>
> I have enabled debug and my logs flood with the following. I am going to
> try some stuff from your provided mailing list and see..
>
> root@ceph1:~# tail -f
> /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log
> 2022-09-02T18:38:21.754391+ mgr.ceph2.huidoh (mgr.344392) 211198 :
> cephadm [DBG] 0 OSDs are scheduled for removal: []
> 2022-09-02T18:38:21.754519+ mgr.ceph2.huidoh (mgr.344392) 211199 :
> cephadm [DBG] Saving [] to store
> 2022-09-02T18:38:21.757155+ mgr.ceph2.huidoh (mgr.344392) 211200 :
> cephadm [DBG] refreshing hosts and daemons
> 2022-09-02T18:38:21.758065+ mgr.ceph2.huidoh (mgr.344392) 211201 :
> cephadm [DBG] _check_for_strays
> 2022-09-02T18:38:21.758334+ mgr.ceph2.huidoh (mgr.344392) 211202 :
> cephadm [DBG] 0 OSDs are scheduled for removal: []
> 2022-09-02T18:38:21.758455+ mgr.ceph2.huidoh (mgr.344392) 211203 :
> cephadm [DBG] Saving [] to store
> 2022-09-02T18:38:21.761001+ mgr.ceph2.huidoh (mgr.344392) 211204 :
> cephadm [DBG] refreshing hosts and daemons
> 2022-09-02T18:38:21.762092+ mgr.ceph2.huidoh (mgr.344392) 211205 :
> cephadm [DBG] _check_for_strays
> 2022-09-02T18:38:21.762357+ mgr.ceph2.huidoh (mgr.344392) 211206 :
> cephadm [DBG] 0 OSDs are scheduled for removal: []
> 2022-09-02T18:38:21.762480+ mgr.ceph2.huidoh (mgr.344392) 211207 :
> cephadm [DBG] Saving [] to store
>
> On Fri, Sep 2, 2022 at 12:17 PM Adam King  wrote:
>
>> hmm, okay. It seems like cephadm is stuck in general rather than an issue
>> specific to the upgrade. I'd first make sure the orchestrator isn't paused
>> (just running "ceph orch resume" should be enough, it's idempotent).
>>
>> Beyond that, there was someone else who had an issue with things getting
>> stuck that was resolved in this thread
>> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M
>> <https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M>
>>  that
>> might be worth a look.
>>
>> If you haven't already, it's possible stopping the upgrade is a good
>> idea, as maybe that's interfering with it getting to the point where it
>> does the redeploy.
>>
>> If none of those help, it might be worth setting the log level to debug
>> and seeing where things are ending up ("ceph config set mgr
>> mgr/cephadm/log_to_cluster_level debug; ceph orch ps --refresh" then
>> waiting a few minutes before running "ceph log last 100 debug cephadm" (not
>> 100% on format of that command, if it fails try just "ceph log last
>> cephadm"). We could maybe get more info on why it's not performing the
>> redeploy from those debug logs. Just remember to set the log level back
>> after 'ceph config set mgr mgr/cephadm/log_to_cluster_level info' as debug
>> logs are quite verbose.
>>
>> On Fri, Sep 2, 2022 at 11:39 AM Satish Patel 
>> wrote:
>>
>>> Hi Adam,
>>>
>>> As you said, i did following
>>>
>>> $ ceph orch daemon redeploy mgr.ceph1.smfvfd  quay.io/ceph/ceph:v16.2.10
>>>
>>> Noticed following line in logs but then no activity nothing, still
>>> standby mgr running in older version
>>>
>>> 2022-09-02T15:35:45.753093+ mgr.ceph2.huidoh (mgr.344392) 2226 :
>>> cephadm [INF] Schedule redeploy daemon mgr.ceph1.smfvfd
>>> 2022-09-02T15:36:17.279190+ mgr.ceph2.huidoh (mgr.344392) 2245 :
>>> cephadm [INF] refreshing ceph2 facts
>>> 2022-09-02T15:36:17.984478+ mgr.ceph2.huidoh (mgr.344392) 2246 :
>>> cephadm [INF] refreshing ceph1 facts
>>> 2022-09-02T15:37:17.663730+ mgr.ceph2.huidoh (mgr.344392) 2284 :
>>> cephadm [INF] refreshing ceph2 facts
>>> 2022-09-02T15:37:18.386586+ mgr.ceph2.huidoh (mgr.344392) 2285 :
>>> cephadm [INF] refreshing ceph1 facts
>>>
>>> I am not seeing any image get downloaded also
>>>
>>> root@ceph1:~# docker image ls
>>> REPOSITORY TAG   IMAGE ID   CREATED
>>> SIZE
>>> quay.io/ceph/ceph  v15   93146564743f   3 weeks ago
>>> 1.2GB
>>> quay.io/ceph/ceph-grafana  8.3.5 dad864ee21e9   4 months
>>> ago558MB
>>> quay.io/prometheus/prometheus  v2.33.4   514e6a882f6e   6 months
>>> ago204MB
>>> quay.io/prom

[ceph-users] Re: [cephadm] mgr: no daemons active

2022-09-02 Thread Satish Patel
Adam,

I have enabled debug and my logs flood with the following. I am going to
try some stuff from your provided mailing list and see..

root@ceph1:~# tail -f
/var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log
2022-09-02T18:38:21.754391+ mgr.ceph2.huidoh (mgr.344392) 211198 :
cephadm [DBG] 0 OSDs are scheduled for removal: []
2022-09-02T18:38:21.754519+ mgr.ceph2.huidoh (mgr.344392) 211199 :
cephadm [DBG] Saving [] to store
2022-09-02T18:38:21.757155+ mgr.ceph2.huidoh (mgr.344392) 211200 :
cephadm [DBG] refreshing hosts and daemons
2022-09-02T18:38:21.758065+ mgr.ceph2.huidoh (mgr.344392) 211201 :
cephadm [DBG] _check_for_strays
2022-09-02T18:38:21.758334+ mgr.ceph2.huidoh (mgr.344392) 211202 :
cephadm [DBG] 0 OSDs are scheduled for removal: []
2022-09-02T18:38:21.758455+ mgr.ceph2.huidoh (mgr.344392) 211203 :
cephadm [DBG] Saving [] to store
2022-09-02T18:38:21.761001+ mgr.ceph2.huidoh (mgr.344392) 211204 :
cephadm [DBG] refreshing hosts and daemons
2022-09-02T18:38:21.762092+ mgr.ceph2.huidoh (mgr.344392) 211205 :
cephadm [DBG] _check_for_strays
2022-09-02T18:38:21.762357+ mgr.ceph2.huidoh (mgr.344392) 211206 :
cephadm [DBG] 0 OSDs are scheduled for removal: []
2022-09-02T18:38:21.762480+ mgr.ceph2.huidoh (mgr.344392) 211207 :
cephadm [DBG] Saving [] to store

On Fri, Sep 2, 2022 at 12:17 PM Adam King  wrote:

> hmm, okay. It seems like cephadm is stuck in general rather than an issue
> specific to the upgrade. I'd first make sure the orchestrator isn't paused
> (just running "ceph orch resume" should be enough, it's idempotent).
>
> Beyond that, there was someone else who had an issue with things getting
> stuck that was resolved in this thread
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M
> <https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M>
>  that
> might be worth a look.
>
> If you haven't already, it's possible stopping the upgrade is a good idea,
> as maybe that's interfering with it getting to the point where it does the
> redeploy.
>
> If none of those help, it might be worth setting the log level to debug
> and seeing where things are ending up ("ceph config set mgr
> mgr/cephadm/log_to_cluster_level debug; ceph orch ps --refresh" then
> waiting a few minutes before running "ceph log last 100 debug cephadm" (not
> 100% on format of that command, if it fails try just "ceph log last
> cephadm"). We could maybe get more info on why it's not performing the
> redeploy from those debug logs. Just remember to set the log level back
> after 'ceph config set mgr mgr/cephadm/log_to_cluster_level info' as debug
> logs are quite verbose.
>
> On Fri, Sep 2, 2022 at 11:39 AM Satish Patel  wrote:
>
>> Hi Adam,
>>
>> As you said, i did following
>>
>> $ ceph orch daemon redeploy mgr.ceph1.smfvfd  quay.io/ceph/ceph:v16.2.10
>>
>> Noticed following line in logs but then no activity nothing, still
>> standby mgr running in older version
>>
>> 2022-09-02T15:35:45.753093+ mgr.ceph2.huidoh (mgr.344392) 2226 :
>> cephadm [INF] Schedule redeploy daemon mgr.ceph1.smfvfd
>> 2022-09-02T15:36:17.279190+ mgr.ceph2.huidoh (mgr.344392) 2245 :
>> cephadm [INF] refreshing ceph2 facts
>> 2022-09-02T15:36:17.984478+ mgr.ceph2.huidoh (mgr.344392) 2246 :
>> cephadm [INF] refreshing ceph1 facts
>> 2022-09-02T15:37:17.663730+ mgr.ceph2.huidoh (mgr.344392) 2284 :
>> cephadm [INF] refreshing ceph2 facts
>> 2022-09-02T15:37:18.386586+ mgr.ceph2.huidoh (mgr.344392) 2285 :
>> cephadm [INF] refreshing ceph1 facts
>>
>> I am not seeing any image get downloaded also
>>
>> root@ceph1:~# docker image ls
>> REPOSITORY TAG   IMAGE ID   CREATED
>>   SIZE
>> quay.io/ceph/ceph  v15   93146564743f   3 weeks ago
>> 1.2GB
>> quay.io/ceph/ceph-grafana  8.3.5 dad864ee21e9   4 months ago
>>558MB
>> quay.io/prometheus/prometheus  v2.33.4   514e6a882f6e   6 months ago
>>204MB
>> quay.io/prometheus/alertmanagerv0.23.0   ba2b418f427c   12 months
>> ago   57.5MB
>> quay.io/ceph/ceph-grafana  6.7.4 557c83e11646   13 months
>> ago   486MB
>> quay.io/prometheus/prometheus  v2.18.1   de242295e225   2 years ago
>> 140MB
>> quay.io/prometheus/alertmanagerv0.20.0   0881eb8f169f   2 years ago
>> 52.1MB
>> quay.io/prometheus/node-exporter   v0.18.1   e5a616e4b9cf   3 years ago
>> 22.9MB
>>
>>
>> On Fri, Sep 2, 2022

[ceph-users] Re: [cephadm] mgr: no daemons active

2022-09-02 Thread Satish Patel
Hi Adam,

As you said, i did following

$ ceph orch daemon redeploy mgr.ceph1.smfvfd  quay.io/ceph/ceph:v16.2.10

Noticed following line in logs but then no activity nothing, still standby
mgr running in older version

2022-09-02T15:35:45.753093+ mgr.ceph2.huidoh (mgr.344392) 2226 :
cephadm [INF] Schedule redeploy daemon mgr.ceph1.smfvfd
2022-09-02T15:36:17.279190+ mgr.ceph2.huidoh (mgr.344392) 2245 :
cephadm [INF] refreshing ceph2 facts
2022-09-02T15:36:17.984478+ mgr.ceph2.huidoh (mgr.344392) 2246 :
cephadm [INF] refreshing ceph1 facts
2022-09-02T15:37:17.663730+ mgr.ceph2.huidoh (mgr.344392) 2284 :
cephadm [INF] refreshing ceph2 facts
2022-09-02T15:37:18.386586+ mgr.ceph2.huidoh (mgr.344392) 2285 :
cephadm [INF] refreshing ceph1 facts

I am not seeing any image get downloaded also

root@ceph1:~# docker image ls
REPOSITORY TAG   IMAGE ID   CREATED
SIZE
quay.io/ceph/ceph  v15   93146564743f   3 weeks ago
1.2GB
quay.io/ceph/ceph-grafana  8.3.5 dad864ee21e9   4 months ago
 558MB
quay.io/prometheus/prometheus  v2.33.4   514e6a882f6e   6 months ago
 204MB
quay.io/prometheus/alertmanagerv0.23.0   ba2b418f427c   12 months ago
57.5MB
quay.io/ceph/ceph-grafana  6.7.4 557c83e11646   13 months ago
486MB
quay.io/prometheus/prometheus  v2.18.1   de242295e225   2 years ago
140MB
quay.io/prometheus/alertmanagerv0.20.0   0881eb8f169f   2 years ago
52.1MB
quay.io/prometheus/node-exporter   v0.18.1   e5a616e4b9cf   3 years ago
22.9MB


On Fri, Sep 2, 2022 at 11:06 AM Adam King  wrote:

> hmm, at this point, maybe we should just try manually upgrading the mgr
> daemons and then move from there. First, just stop the upgrade "ceph orch
> upgrade stop". If you figure out which of the two mgr daemons is the
> standby (it should say which one is active in "ceph -s" output) and then do
> a "ceph orch daemon redeploy  quay.io/ceph/ceph:v16.2.10"
> it should redeploy that specific mgr with the new version. You could then
> do a "ceph mgr fail" to swap which of the mgr daemons is active, then do
> another "ceph orch daemon redeploy 
> quay.io/ceph/ceph:v16.2.10" where the standby is now the other mgr still
> on 15.2.17. Once the mgr daemons are both upgraded to the new version, run
> a "ceph orch redeploy mgr" and then "ceph orch upgrade start --image
> quay.io/ceph/ceph:v16.2.10" and see if it goes better.
>
> On Fri, Sep 2, 2022 at 10:36 AM Satish Patel  wrote:
>
>> Hi Adam,
>>
>> I run the following command to upgrade but it looks like nothing is
>> happening
>>
>> $ ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.10
>>
>> Status message is empty..
>>
>> root@ceph1:~# ceph orch upgrade status
>> {
>> "target_image": "quay.io/ceph/ceph:v16.2.10",
>> "in_progress": true,
>> "services_complete": [],
>> "message": ""
>> }
>>
>> Nothing in Logs
>>
>> root@ceph1:~# tail -f
>> /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log
>> 2022-09-02T14:31:52.597661+ mgr.ceph2.huidoh (mgr.344392) 174 :
>> cephadm [INF] refreshing ceph2 facts
>> 2022-09-02T14:31:52.991450+ mgr.ceph2.huidoh (mgr.344392) 176 :
>> cephadm [INF] refreshing ceph1 facts
>> 2022-09-02T14:32:52.965092+ mgr.ceph2.huidoh (mgr.344392) 207 :
>> cephadm [INF] refreshing ceph2 facts
>> 2022-09-02T14:32:53.369789+ mgr.ceph2.huidoh (mgr.344392) 208 :
>> cephadm [INF] refreshing ceph1 facts
>> 2022-09-02T14:33:53.367986+ mgr.ceph2.huidoh (mgr.344392) 239 :
>> cephadm [INF] refreshing ceph2 facts
>> 2022-09-02T14:33:53.760427+ mgr.ceph2.huidoh (mgr.344392) 240 :
>> cephadm [INF] refreshing ceph1 facts
>> 2022-09-02T14:34:53.754277+ mgr.ceph2.huidoh (mgr.344392) 272 :
>> cephadm [INF] refreshing ceph2 facts
>> 2022-09-02T14:34:54.162503+ mgr.ceph2.huidoh (mgr.344392) 273 :
>> cephadm [INF] refreshing ceph1 facts
>> 2022-09-02T14:35:54.133467+ mgr.ceph2.huidoh (mgr.344392) 305 :
>> cephadm [INF] refreshing ceph2 facts
>> 2022-09-02T14:35:54.522171+ mgr.ceph2.huidoh (mgr.344392) 306 :
>> cephadm [INF] refreshing ceph1 facts
>>
>> In progress that mesg stuck there for long time
>>
>> root@ceph1:~# ceph -s
>>   cluster:
>> id: f270ad9e-1f6f-11ed-b6f8-a539d87379ea
>> health: HEALTH_OK
>>
>>   services:
>> mon: 1 daemons, quorum ceph1 (age 9h)
>> mgr: ceph2.huidoh(active, since 9m), standbys: ceph1.smfvfd
>> osd: 4 osds: 4 up (since 9h), 4 in (since 11h)
>>
>>   data:
>> 

[ceph-users] Re: [cephadm] mgr: no daemons active

2022-09-02 Thread Satish Patel
Hi Adam,

I run the following command to upgrade but it looks like nothing is
happening

$ ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.10

Status message is empty..

root@ceph1:~# ceph orch upgrade status
{
"target_image": "quay.io/ceph/ceph:v16.2.10",
"in_progress": true,
"services_complete": [],
"message": ""
}

Nothing in Logs

root@ceph1:~# tail -f
/var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log
2022-09-02T14:31:52.597661+ mgr.ceph2.huidoh (mgr.344392) 174 : cephadm
[INF] refreshing ceph2 facts
2022-09-02T14:31:52.991450+ mgr.ceph2.huidoh (mgr.344392) 176 : cephadm
[INF] refreshing ceph1 facts
2022-09-02T14:32:52.965092+ mgr.ceph2.huidoh (mgr.344392) 207 : cephadm
[INF] refreshing ceph2 facts
2022-09-02T14:32:53.369789+ mgr.ceph2.huidoh (mgr.344392) 208 : cephadm
[INF] refreshing ceph1 facts
2022-09-02T14:33:53.367986+ mgr.ceph2.huidoh (mgr.344392) 239 : cephadm
[INF] refreshing ceph2 facts
2022-09-02T14:33:53.760427+ mgr.ceph2.huidoh (mgr.344392) 240 : cephadm
[INF] refreshing ceph1 facts
2022-09-02T14:34:53.754277+ mgr.ceph2.huidoh (mgr.344392) 272 : cephadm
[INF] refreshing ceph2 facts
2022-09-02T14:34:54.162503+ mgr.ceph2.huidoh (mgr.344392) 273 : cephadm
[INF] refreshing ceph1 facts
2022-09-02T14:35:54.133467+ mgr.ceph2.huidoh (mgr.344392) 305 : cephadm
[INF] refreshing ceph2 facts
2022-09-02T14:35:54.522171+ mgr.ceph2.huidoh (mgr.344392) 306 : cephadm
[INF] refreshing ceph1 facts

In progress that mesg stuck there for long time

root@ceph1:~# ceph -s
  cluster:
id: f270ad9e-1f6f-11ed-b6f8-a539d87379ea
health: HEALTH_OK

  services:
mon: 1 daemons, quorum ceph1 (age 9h)
mgr: ceph2.huidoh(active, since 9m), standbys: ceph1.smfvfd
osd: 4 osds: 4 up (since 9h), 4 in (since 11h)

  data:
pools:   5 pools, 129 pgs
objects: 20.06k objects, 83 GiB
usage:   168 GiB used, 632 GiB / 800 GiB avail
pgs: 129 active+clean

  io:
client:   12 KiB/s wr, 0 op/s rd, 1 op/s wr

  progress:
Upgrade to quay.io/ceph/ceph:v16.2.10 (0s)
  [........]

On Fri, Sep 2, 2022 at 10:25 AM Satish Patel  wrote:

> It Looks like I did it with the following command.
>
> $ ceph orch daemon add mgr ceph2:10.73.0.192
>
> Now i can see two with same version 15.x
>
> root@ceph1:~# ceph orch ps --daemon-type mgr
> NAME  HOST   STATUS REFRESHED  AGE  VERSION  IMAGE
> NAME
>   IMAGE ID  CONTAINER ID
> mgr.ceph1.smfvfd  ceph1  running (8h)   41s ago8h   15.2.17
> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>  93146564743f  1aab837306d2
> mgr.ceph2.huidoh  ceph2  running (60s)  110s ago   60s  15.2.17
> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>  93146564743f  294fd6ab6c97
>
> On Fri, Sep 2, 2022 at 10:19 AM Satish Patel  wrote:
>
>> Let's come back to the original question: how to bring back the second
>> mgr?
>>
>> root@ceph1:~# ceph orch apply mgr 2
>> Scheduled mgr update...
>>
>> Nothing happened with above command, logs saying nothing
>>
>> 2022-09-02T14:16:20.407927+ mgr.ceph1.smfvfd (mgr.334626) 16939 :
>> cephadm [INF] refreshing ceph2 facts
>> 2022-09-02T14:16:40.247195+ mgr.ceph1.smfvfd (mgr.334626) 16952 :
>> cephadm [INF] Saving service mgr spec with placement count:2
>> 2022-09-02T14:16:53.106919+ mgr.ceph1.smfvfd (mgr.334626) 16961 :
>> cephadm [INF] Saving service mgr spec with placement count:2
>> 2022-09-02T14:17:19.135203+ mgr.ceph1.smfvfd (mgr.334626) 16975 :
>> cephadm [INF] refreshing ceph1 facts
>> 2022-09-02T14:17:20.780496+ mgr.ceph1.smfvfd (mgr.334626) 16977 :
>> cephadm [INF] refreshing ceph2 facts
>> 2022-09-02T14:18:19.502034+ mgr.ceph1.smfvfd (mgr.334626) 17008 :
>> cephadm [INF] refreshing ceph1 facts
>> 2022-09-02T14:18:21.127973+ mgr.ceph1.smfvfd (mgr.334626) 17010 :
>> cephadm [INF] refreshing ceph2 facts
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Sep 2, 2022 at 10:15 AM Satish Patel 
>> wrote:
>>
>>> Hi Adam,
>>>
>>> Wait..wait.. now it's working suddenly without doing anything.. very odd
>>>
>>> root@ceph1:~# ceph orch ls
>>> NAME  RUNNING  REFRESHED  AGE  PLACEMENTIMAGE NAME
>>>
>>>   IMAGE ID
>>> alertmanager  1/1  5s ago 2w   count:1
>>> quay.io/prometheus/alertmanager:v0.20.0
>>>0881eb8f169f
>>> crash 2/2  5s ago 2w   *
>>> quay.io/ceph/ceph:v15
>>>93146564743f
>>&

[ceph-users] Re: [cephadm] mgr: no daemons active

2022-09-02 Thread Satish Patel
It Looks like I did it with the following command.

$ ceph orch daemon add mgr ceph2:10.73.0.192

Now i can see two with same version 15.x

root@ceph1:~# ceph orch ps --daemon-type mgr
NAME  HOST   STATUS REFRESHED  AGE  VERSION  IMAGE NAME

IMAGE ID  CONTAINER ID
mgr.ceph1.smfvfd  ceph1  running (8h)   41s ago8h   15.2.17
quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
 93146564743f  1aab837306d2
mgr.ceph2.huidoh  ceph2  running (60s)  110s ago   60s  15.2.17
quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
 93146564743f  294fd6ab6c97

On Fri, Sep 2, 2022 at 10:19 AM Satish Patel  wrote:

> Let's come back to the original question: how to bring back the second mgr?
>
> root@ceph1:~# ceph orch apply mgr 2
> Scheduled mgr update...
>
> Nothing happened with above command, logs saying nothing
>
> 2022-09-02T14:16:20.407927+ mgr.ceph1.smfvfd (mgr.334626) 16939 :
> cephadm [INF] refreshing ceph2 facts
> 2022-09-02T14:16:40.247195+ mgr.ceph1.smfvfd (mgr.334626) 16952 :
> cephadm [INF] Saving service mgr spec with placement count:2
> 2022-09-02T14:16:53.106919+ mgr.ceph1.smfvfd (mgr.334626) 16961 :
> cephadm [INF] Saving service mgr spec with placement count:2
> 2022-09-02T14:17:19.135203+ mgr.ceph1.smfvfd (mgr.334626) 16975 :
> cephadm [INF] refreshing ceph1 facts
> 2022-09-02T14:17:20.780496+ mgr.ceph1.smfvfd (mgr.334626) 16977 :
> cephadm [INF] refreshing ceph2 facts
> 2022-09-02T14:18:19.502034+ mgr.ceph1.smfvfd (mgr.334626) 17008 :
> cephadm [INF] refreshing ceph1 facts
> 2022-09-02T14:18:21.127973+ mgr.ceph1.smfvfd (mgr.334626) 17010 :
> cephadm [INF] refreshing ceph2 facts
>
>
>
>
>
>
>
> On Fri, Sep 2, 2022 at 10:15 AM Satish Patel  wrote:
>
>> Hi Adam,
>>
>> Wait..wait.. now it's working suddenly without doing anything.. very odd
>>
>> root@ceph1:~# ceph orch ls
>> NAME  RUNNING  REFRESHED  AGE  PLACEMENTIMAGE NAME
>>
>>   IMAGE ID
>> alertmanager  1/1  5s ago 2w   count:1
>> quay.io/prometheus/alertmanager:v0.20.0
>>0881eb8f169f
>> crash 2/2  5s ago 2w   *
>> quay.io/ceph/ceph:v15
>>93146564743f
>> grafana   1/1  5s ago 2w   count:1
>> quay.io/ceph/ceph-grafana:6.7.4
>>557c83e11646
>> mgr   1/2  5s ago 8h   count:2
>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>  93146564743f
>> mon   1/2  5s ago 8h   ceph1;ceph2
>> quay.io/ceph/ceph:v15
>>93146564743f
>> node-exporter 2/2  5s ago 2w   *
>> quay.io/prometheus/node-exporter:v0.18.1
>>   e5a616e4b9cf
>> osd.osd_spec_default  4/0  5s ago -
>> quay.io/ceph/ceph:v15
>>93146564743f
>> prometheus1/1  5s ago 2w   count:1
>> quay.io/prometheus/prometheus:v2.18.1
>>
>> On Fri, Sep 2, 2022 at 10:13 AM Satish Patel 
>> wrote:
>>
>>> I can see that in the output but I'm not sure how to get rid of it.
>>>
>>> root@ceph1:~# ceph orch ps --refresh
>>> NAME
>>>  HOST   STATUSREFRESHED  AGE  VERSIONIMAGE NAME
>>> IMAGE ID
>>>CONTAINER ID
>>> alertmanager.ceph1
>>>  ceph1  running (9h)  64s ago2w   0.20.0
>>> quay.io/prometheus/alertmanager:v0.20.0
>>>0881eb8f169f  ba804b555378
>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>  ceph2  stopped   65s ago-  
>>>  
>>> 
>>> crash.ceph1
>>>   ceph1  running (9h)  64s ago2w   15.2.17quay.io/ceph/ceph:v15
>>>
>>>  93146564743f  a3a431d834fc
>>> crash.ceph2
>>>   ceph2  running (9h)  65s ago13d  15.2.17quay.io/ceph/ceph:v15
>>>
>>>  93146564743f  3c963693ff2b
>>> grafana.ceph1
>>>   ceph1  running (9h)  64s ago2w   6.7.4
>>> quay.io/ceph/ceph-grafana:6.7.4
>>>557c83e11646  7583a8dc4c61
>>> mgr.ceph1.smfvfd
>>>  ceph1  running (8h)  64s ago8h   15.2.17
>>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>>  93146564743f  1aab837306d2
>>> mon.ceph1
>>>   ceph1  running 

[ceph-users] Re: [cephadm] mgr: no daemons active

2022-09-02 Thread Satish Patel
Let's come back to the original question: how to bring back the second mgr?

root@ceph1:~# ceph orch apply mgr 2
Scheduled mgr update...

Nothing happened with above command, logs saying nothing

2022-09-02T14:16:20.407927+ mgr.ceph1.smfvfd (mgr.334626) 16939 :
cephadm [INF] refreshing ceph2 facts
2022-09-02T14:16:40.247195+ mgr.ceph1.smfvfd (mgr.334626) 16952 :
cephadm [INF] Saving service mgr spec with placement count:2
2022-09-02T14:16:53.106919+ mgr.ceph1.smfvfd (mgr.334626) 16961 :
cephadm [INF] Saving service mgr spec with placement count:2
2022-09-02T14:17:19.135203+ mgr.ceph1.smfvfd (mgr.334626) 16975 :
cephadm [INF] refreshing ceph1 facts
2022-09-02T14:17:20.780496+ mgr.ceph1.smfvfd (mgr.334626) 16977 :
cephadm [INF] refreshing ceph2 facts
2022-09-02T14:18:19.502034+ mgr.ceph1.smfvfd (mgr.334626) 17008 :
cephadm [INF] refreshing ceph1 facts
2022-09-02T14:18:21.127973+ mgr.ceph1.smfvfd (mgr.334626) 17010 :
cephadm [INF] refreshing ceph2 facts







On Fri, Sep 2, 2022 at 10:15 AM Satish Patel  wrote:

> Hi Adam,
>
> Wait..wait.. now it's working suddenly without doing anything.. very odd
>
> root@ceph1:~# ceph orch ls
> NAME  RUNNING  REFRESHED  AGE  PLACEMENTIMAGE NAME
>
> IMAGE ID
> alertmanager  1/1  5s ago 2w   count:1
> quay.io/prometheus/alertmanager:v0.20.0
>  0881eb8f169f
> crash 2/2  5s ago 2w   *
> quay.io/ceph/ceph:v15
>  93146564743f
> grafana   1/1  5s ago 2w   count:1
> quay.io/ceph/ceph-grafana:6.7.4
>  557c83e11646
> mgr   1/2  5s ago 8h   count:2
> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>  93146564743f
> mon   1/2  5s ago 8h   ceph1;ceph2
> quay.io/ceph/ceph:v15
>  93146564743f
> node-exporter 2/2  5s ago 2w   *
> quay.io/prometheus/node-exporter:v0.18.1
>   e5a616e4b9cf
> osd.osd_spec_default  4/0  5s ago -
> quay.io/ceph/ceph:v15
>  93146564743f
> prometheus1/1  5s ago 2w   count:1
> quay.io/prometheus/prometheus:v2.18.1
>
> On Fri, Sep 2, 2022 at 10:13 AM Satish Patel  wrote:
>
>> I can see that in the output but I'm not sure how to get rid of it.
>>
>> root@ceph1:~# ceph orch ps --refresh
>> NAME
>>  HOST   STATUSREFRESHED  AGE  VERSIONIMAGE NAME
>> IMAGE ID
>>CONTAINER ID
>> alertmanager.ceph1
>>  ceph1  running (9h)  64s ago2w   0.20.0
>> quay.io/prometheus/alertmanager:v0.20.0
>>0881eb8f169f  ba804b555378
>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>  ceph2  stopped   65s ago-  
>>  
>> 
>> crash.ceph1
>> ceph1  running (9h)  64s ago2w   15.2.17quay.io/ceph/ceph:v15
>>
>>  93146564743f  a3a431d834fc
>> crash.ceph2
>> ceph2  running (9h)  65s ago13d  15.2.17quay.io/ceph/ceph:v15
>>
>>  93146564743f  3c963693ff2b
>> grafana.ceph1
>> ceph1  running (9h)  64s ago2w   6.7.4
>> quay.io/ceph/ceph-grafana:6.7.4
>>557c83e11646  7583a8dc4c61
>> mgr.ceph1.smfvfd
>>  ceph1  running (8h)  64s ago8h   15.2.17
>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>  93146564743f  1aab837306d2
>> mon.ceph1
>> ceph1  running (9h)  64s ago2w   15.2.17quay.io/ceph/ceph:v15
>>
>>  93146564743f  c1d155d8c7ad
>> node-exporter.ceph1
>> ceph1  running (9h)  64s ago2w   0.18.1
>> quay.io/prometheus/node-exporter:v0.18.1
>>   e5a616e4b9cf  2ff235fe0e42
>> node-exporter.ceph2
>> ceph2  running (9h)  65s ago13d  0.18.1
>> quay.io/prometheus/node-exporter:v0.18.1
>>   e5a616e4b9cf  17678b9ba602
>> osd.0
>> ceph1  running (9h)  64s ago13d  15.2.17quay.io/ceph/ceph:v15
>>
>>  93146564743f  d0fd73b777a3
>> osd.1
>> ceph1  running (9h)  64s ago13d  15.2.17quay.io/ceph/ceph:v15
>>
>>  93146564743f  049120e83102
>> osd.2
>> ceph2  running (9h)  65s ago13d  15.2.17quay.io/ceph/ceph:v15
>>
>>  93146564743f  8700e8cefd1f
>> osd.3
>> ceph2  running (9h)  65s ago13d  15.2.17quay.io/ceph/ceph:v15
>>
>>  93146564743f  9c71bc87ed16
>> prometheus.ceph1
>>  ceph1  running (9h)  64s ago2w   2.18.1
>>

[ceph-users] Re: [cephadm] mgr: no daemons active

2022-09-02 Thread Satish Patel
Hi Adam,

Wait..wait.. now it's working suddenly without doing anything.. very odd

root@ceph1:~# ceph orch ls
NAME  RUNNING  REFRESHED  AGE  PLACEMENTIMAGE NAME

IMAGE ID
alertmanager  1/1  5s ago 2w   count:1
quay.io/prometheus/alertmanager:v0.20.0
   0881eb8f169f
crash 2/2  5s ago 2w   *
quay.io/ceph/ceph:v15
   93146564743f
grafana   1/1  5s ago 2w   count:1
quay.io/ceph/ceph-grafana:6.7.4
   557c83e11646
mgr   1/2  5s ago 8h   count:2
quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
 93146564743f
mon   1/2  5s ago 8h   ceph1;ceph2
quay.io/ceph/ceph:v15
   93146564743f
node-exporter 2/2  5s ago 2w   *
quay.io/prometheus/node-exporter:v0.18.1
e5a616e4b9cf
osd.osd_spec_default  4/0  5s ago -
quay.io/ceph/ceph:v15
   93146564743f
prometheus1/1  5s ago 2w   count:1
quay.io/prometheus/prometheus:v2.18.1

On Fri, Sep 2, 2022 at 10:13 AM Satish Patel  wrote:

> I can see that in the output but I'm not sure how to get rid of it.
>
> root@ceph1:~# ceph orch ps --refresh
> NAME
>  HOST   STATUSREFRESHED  AGE  VERSIONIMAGE NAME
> IMAGE ID
>CONTAINER ID
> alertmanager.ceph1
>  ceph1  running (9h)  64s ago2w   0.20.0
> quay.io/prometheus/alertmanager:v0.20.0
>  0881eb8f169f  ba804b555378
> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>  ceph2  stopped   65s ago-  
>  
> 
> crash.ceph1
> ceph1  running (9h)  64s ago2w   15.2.17quay.io/ceph/ceph:v15
>
>  93146564743f  a3a431d834fc
> crash.ceph2
> ceph2  running (9h)  65s ago13d  15.2.17quay.io/ceph/ceph:v15
>
>  93146564743f  3c963693ff2b
> grafana.ceph1
> ceph1  running (9h)  64s ago2w   6.7.4
> quay.io/ceph/ceph-grafana:6.7.4
>  557c83e11646  7583a8dc4c61
> mgr.ceph1.smfvfd
>  ceph1  running (8h)  64s ago8h   15.2.17
> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>  93146564743f  1aab837306d2
> mon.ceph1
> ceph1  running (9h)  64s ago2w   15.2.17quay.io/ceph/ceph:v15
>
>  93146564743f  c1d155d8c7ad
> node-exporter.ceph1
> ceph1  running (9h)  64s ago2w   0.18.1
> quay.io/prometheus/node-exporter:v0.18.1
>   e5a616e4b9cf  2ff235fe0e42
> node-exporter.ceph2
> ceph2  running (9h)  65s ago13d  0.18.1
> quay.io/prometheus/node-exporter:v0.18.1
>   e5a616e4b9cf  17678b9ba602
> osd.0
> ceph1  running (9h)  64s ago13d  15.2.17quay.io/ceph/ceph:v15
>
>  93146564743f  d0fd73b777a3
> osd.1
> ceph1  running (9h)  64s ago13d  15.2.17quay.io/ceph/ceph:v15
>
>  93146564743f  049120e83102
> osd.2
> ceph2  running (9h)  65s ago13d  15.2.17quay.io/ceph/ceph:v15
>
>  93146564743f  8700e8cefd1f
> osd.3
> ceph2  running (9h)  65s ago13d  15.2.17quay.io/ceph/ceph:v15
>
>  93146564743f  9c71bc87ed16
> prometheus.ceph1
>  ceph1  running (9h)  64s ago2w   2.18.1
> quay.io/prometheus/prometheus:v2.18.1
>  de242295e225  74a538efd61e
>
> On Fri, Sep 2, 2022 at 10:10 AM Adam King  wrote:
>
>> maybe also a "ceph orch ps --refresh"? It might still have the old cached
>> daemon inventory from before you remove the files.
>>
>> On Fri, Sep 2, 2022 at 9:57 AM Satish Patel  wrote:
>>
>>> Hi Adam,
>>>
>>> I have deleted file located here - rm
>>> /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>
>>> But still getting the same error, do i need to do anything else?
>>>
>>> On Fri, Sep 2, 2022 at 9:51 AM Adam King  wrote:
>>>
>>>> Okay, I'm wondering if this is an issue with version mismatch. Having
>>>> previously had a 16.2.10 mgr and then now having a 15.2.17 one that doesn't
>>>> expect this sort of thing to be present. Either way, I'd think just
>>>> deleting this cephadm.7ce656a8721deb5054c37b0cfb9038
>>>> 1522d521dde51fb0c5a2142314d663f63d (and any others like it) file would
>>>> be the way forward to get orch ls working again.
>>>>
>>>> On Fri, Sep 2, 2022 at 9:44 AM Satish Patel 
>>>> wrote:
>>>>
>>>>> Hi Adam,
>>>>>
>>>>> In cephadm ls i found the following service but

[ceph-users] Re: [cephadm] mgr: no daemons active

2022-09-02 Thread Satish Patel
I can see that in the output but I'm not sure how to get rid of it.

root@ceph1:~# ceph orch ps --refresh
NAME
 HOST   STATUSREFRESHED  AGE  VERSIONIMAGE NAME
IMAGE ID
   CONTAINER ID
alertmanager.ceph1
 ceph1  running (9h)  64s ago2w   0.20.0
quay.io/prometheus/alertmanager:v0.20.0
   0881eb8f169f  ba804b555378
cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
 ceph2  stopped   65s ago-  
 

crash.ceph1
ceph1  running (9h)  64s ago2w   15.2.17quay.io/ceph/ceph:v15
   93146564743f
 a3a431d834fc
crash.ceph2
ceph2  running (9h)  65s ago13d  15.2.17quay.io/ceph/ceph:v15
   93146564743f
 3c963693ff2b
grafana.ceph1
ceph1  running (9h)  64s ago2w   6.7.4
quay.io/ceph/ceph-grafana:6.7.4
   557c83e11646  7583a8dc4c61
mgr.ceph1.smfvfd
 ceph1  running (8h)  64s ago8h   15.2.17
quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
 93146564743f  1aab837306d2
mon.ceph1
ceph1  running (9h)  64s ago2w   15.2.17quay.io/ceph/ceph:v15
   93146564743f
 c1d155d8c7ad
node-exporter.ceph1
ceph1  running (9h)  64s ago2w   0.18.1
quay.io/prometheus/node-exporter:v0.18.1
e5a616e4b9cf  2ff235fe0e42
node-exporter.ceph2
ceph2  running (9h)  65s ago13d  0.18.1
quay.io/prometheus/node-exporter:v0.18.1
e5a616e4b9cf  17678b9ba602
osd.0
ceph1  running (9h)  64s ago13d  15.2.17quay.io/ceph/ceph:v15
   93146564743f
 d0fd73b777a3
osd.1
ceph1  running (9h)  64s ago13d  15.2.17quay.io/ceph/ceph:v15
   93146564743f
 049120e83102
osd.2
ceph2  running (9h)  65s ago13d  15.2.17quay.io/ceph/ceph:v15
   93146564743f
 8700e8cefd1f
osd.3
ceph2  running (9h)  65s ago13d  15.2.17quay.io/ceph/ceph:v15
   93146564743f
 9c71bc87ed16
prometheus.ceph1
 ceph1  running (9h)  64s ago2w   2.18.1
quay.io/prometheus/prometheus:v2.18.1
   de242295e225  74a538efd61e

On Fri, Sep 2, 2022 at 10:10 AM Adam King  wrote:

> maybe also a "ceph orch ps --refresh"? It might still have the old cached
> daemon inventory from before you remove the files.
>
> On Fri, Sep 2, 2022 at 9:57 AM Satish Patel  wrote:
>
>> Hi Adam,
>>
>> I have deleted file located here - rm
>> /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>
>> But still getting the same error, do i need to do anything else?
>>
>> On Fri, Sep 2, 2022 at 9:51 AM Adam King  wrote:
>>
>>> Okay, I'm wondering if this is an issue with version mismatch. Having
>>> previously had a 16.2.10 mgr and then now having a 15.2.17 one that doesn't
>>> expect this sort of thing to be present. Either way, I'd think just
>>> deleting this cephadm.7ce656a8721deb5054c37b0cfb9038
>>> 1522d521dde51fb0c5a2142314d663f63d (and any others like it) file would
>>> be the way forward to get orch ls working again.
>>>
>>> On Fri, Sep 2, 2022 at 9:44 AM Satish Patel 
>>> wrote:
>>>
>>>> Hi Adam,
>>>>
>>>> In cephadm ls i found the following service but i believe it was there
>>>> before also.
>>>>
>>>> {
>>>> "style": "cephadm:v1",
>>>> "name":
>>>> "cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d",
>>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>> "systemd_unit":
>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>> ",
>>>> "enabled": false,
>>>> "state": "stopped",
>>>> "container_id": null,
>>>> "container_image_name": null,
>>>> "container_image_id": null,
>>>> "version": null,
>>>> "started": null,
>>>> "created": null,
>>>> "deployed": null,
>>>> "configured&q

[ceph-users] Re: [cephadm] mgr: no daemons active

2022-09-02 Thread Satish Patel
Hi Adam,

I have deleted file located here - rm
/var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d

But still getting the same error, do i need to do anything else?

On Fri, Sep 2, 2022 at 9:51 AM Adam King  wrote:

> Okay, I'm wondering if this is an issue with version mismatch. Having
> previously had a 16.2.10 mgr and then now having a 15.2.17 one that doesn't
> expect this sort of thing to be present. Either way, I'd think just
> deleting this cephadm.7ce656a8721deb5054c37b0cfb9038
> 1522d521dde51fb0c5a2142314d663f63d (and any others like it) file would be
> the way forward to get orch ls working again.
>
> On Fri, Sep 2, 2022 at 9:44 AM Satish Patel  wrote:
>
>> Hi Adam,
>>
>> In cephadm ls i found the following service but i believe it was there
>> before also.
>>
>> {
>> "style": "cephadm:v1",
>> "name":
>> "cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d",
>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>> "systemd_unit":
>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>> ",
>> "enabled": false,
>> "state": "stopped",
>> "container_id": null,
>> "container_image_name": null,
>> "container_image_id": null,
>> "version": null,
>> "started": null,
>> "created": null,
>> "deployed": null,
>> "configured": null
>> },
>>
>> Look like remove didn't work
>>
>> root@ceph1:~# ceph orch rm cephadm
>> Failed to remove service.  was not found.
>>
>> root@ceph1:~# ceph orch rm
>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>> Failed to remove service.
>> 
>> was not found.
>>
>> On Fri, Sep 2, 2022 at 8:27 AM Adam King  wrote:
>>
>>> this looks like an old traceback you would get if you ended up with a
>>> service type that shouldn't be there somehow. The things I'd probably check
>>> are that "cephadm ls" on either host definitely doesn't report and strange
>>> things that aren't actually daemons in your cluster such as
>>> "cephadm.". Another thing you could maybe try, as I believe the
>>> assertion it's giving is for an unknown service type here ("AssertionError:
>>> cephadm"), is just "ceph orch rm cephadm" which would maybe cause it to
>>> remove whatever it thinks is this "cephadm" service that it has deployed.
>>> Lastly, you could try having the mgr you manually deploy be a 16.2.10 one
>>> instead of 15.2.17 (I'm assuming here, but the line numbers in that
>>> traceback suggest octopus). The 16.2.10 one is just much less likely to
>>> have a bug that causes something like this.
>>>
>>> On Fri, Sep 2, 2022 at 1:41 AM Satish Patel 
>>> wrote:
>>>
>>>> Now when I run "ceph orch ps" it works but the following command throws
>>>> an
>>>> error.  Trying to bring up second mgr using ceph orch apply mgr command
>>>> but
>>>> didn't help
>>>>
>>>> root@ceph1:/ceph-disk# ceph version
>>>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus
>>>> (stable)
>>>>
>>>> root@ceph1:/ceph-disk# ceph orch ls
>>>> Error EINVAL: Traceback (most recent call last):
>>>>   File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in
>>>> _handle_command
>>>> return self.handle_command(inbuf, cmd)
>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, in
>>>> handle_command
>>>> return dispatch[cmd['prefix']].call(self, cmd, inbuf)
>>>>   File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call
>>>> return self.func(mgr, **kwargs)
>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, in
>>>> 
>>>> wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args,
>>>> **l_kwargs)
>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, in
>>>> wrapper
>>>> return func(*args, **kwargs)
>>>>   File "/usr/share

[ceph-users] Re: [cephadm] mgr: no daemons active

2022-09-02 Thread Satish Patel
Hi Adam,

In cephadm ls i found the following service but i believe it was there
before also.

{
"style": "cephadm:v1",
"name":
"cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d",
"fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
"systemd_unit":
"ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
",
"enabled": false,
"state": "stopped",
"container_id": null,
"container_image_name": null,
"container_image_id": null,
"version": null,
"started": null,
"created": null,
"deployed": null,
"configured": null
},

Look like remove didn't work

root@ceph1:~# ceph orch rm cephadm
Failed to remove service.  was not found.

root@ceph1:~# ceph orch rm
cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
Failed to remove service.

was not found.

On Fri, Sep 2, 2022 at 8:27 AM Adam King  wrote:

> this looks like an old traceback you would get if you ended up with a
> service type that shouldn't be there somehow. The things I'd probably check
> are that "cephadm ls" on either host definitely doesn't report and strange
> things that aren't actually daemons in your cluster such as
> "cephadm.". Another thing you could maybe try, as I believe the
> assertion it's giving is for an unknown service type here ("AssertionError:
> cephadm"), is just "ceph orch rm cephadm" which would maybe cause it to
> remove whatever it thinks is this "cephadm" service that it has deployed.
> Lastly, you could try having the mgr you manually deploy be a 16.2.10 one
> instead of 15.2.17 (I'm assuming here, but the line numbers in that
> traceback suggest octopus). The 16.2.10 one is just much less likely to
> have a bug that causes something like this.
>
> On Fri, Sep 2, 2022 at 1:41 AM Satish Patel  wrote:
>
>> Now when I run "ceph orch ps" it works but the following command throws an
>> error.  Trying to bring up second mgr using ceph orch apply mgr command
>> but
>> didn't help
>>
>> root@ceph1:/ceph-disk# ceph version
>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus
>> (stable)
>>
>> root@ceph1:/ceph-disk# ceph orch ls
>> Error EINVAL: Traceback (most recent call last):
>>   File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in _handle_command
>> return self.handle_command(inbuf, cmd)
>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, in
>> handle_command
>> return dispatch[cmd['prefix']].call(self, cmd, inbuf)
>>   File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call
>> return self.func(mgr, **kwargs)
>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, in
>> 
>> wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args,
>> **l_kwargs)
>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, in
>> wrapper
>> return func(*args, **kwargs)
>>   File "/usr/share/ceph/mgr/orchestrator/module.py", line 503, in
>> _list_services
>> raise_if_exception(completion)
>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642, in
>> raise_if_exception
>> raise e
>> AssertionError: cephadm
>>
>> On Fri, Sep 2, 2022 at 1:32 AM Satish Patel  wrote:
>>
>> > nevermind, i found doc related that and i am able to get 1 mgr up -
>> >
>> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon
>> >
>> >
>> > On Fri, Sep 2, 2022 at 1:21 AM Satish Patel 
>> wrote:
>> >
>> >> Folks,
>> >>
>> >> I am having little fun time with cephadm and it's very annoying to deal
>> >> with it
>> >>
>> >> I have deployed a ceph cluster using cephadm on two nodes. Now when i
>> was
>> >> trying to upgrade and noticed hiccups where it just upgraded a single
>> mgr
>> >> with 16.2.10 but not other so i started messing around and somehow I
>> >> deleted both mgr in the thought that cephadm will recreate them.
>> >>
>> >> Now i don't have any single mgr so my ceph orch command hangs forever
>> and
>> >> looks like a chicken egg issue.
>> >>
>> >> How do I recover from this? If I can't run the ceph orch command, I
>> won't
>> >> be able to redeploy my mgr daemons.
>> >>
>> >> I am not able to find any mgr in the following command on both nodes.
>> >>
>> >> $ cephadm ls | grep mgr
>> >>
>> >
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [cephadm] mgr: no daemons active

2022-09-01 Thread Satish Patel
Now when I run "ceph orch ps" it works but the following command throws an
error.  Trying to bring up second mgr using ceph orch apply mgr command but
didn't help

root@ceph1:/ceph-disk# ceph version
ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus
(stable)

root@ceph1:/ceph-disk# ceph orch ls
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in _handle_command
return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, in
handle_command
return dispatch[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call
return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, in

wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, in wrapper
return func(*args, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/module.py", line 503, in
_list_services
raise_if_exception(completion)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642, in
raise_if_exception
raise e
AssertionError: cephadm

On Fri, Sep 2, 2022 at 1:32 AM Satish Patel  wrote:

> nevermind, i found doc related that and i am able to get 1 mgr up -
> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon
>
>
> On Fri, Sep 2, 2022 at 1:21 AM Satish Patel  wrote:
>
>> Folks,
>>
>> I am having little fun time with cephadm and it's very annoying to deal
>> with it
>>
>> I have deployed a ceph cluster using cephadm on two nodes. Now when i was
>> trying to upgrade and noticed hiccups where it just upgraded a single mgr
>> with 16.2.10 but not other so i started messing around and somehow I
>> deleted both mgr in the thought that cephadm will recreate them.
>>
>> Now i don't have any single mgr so my ceph orch command hangs forever and
>> looks like a chicken egg issue.
>>
>> How do I recover from this? If I can't run the ceph orch command, I won't
>> be able to redeploy my mgr daemons.
>>
>> I am not able to find any mgr in the following command on both nodes.
>>
>> $ cephadm ls | grep mgr
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [cephadm] mgr: no daemons active

2022-09-01 Thread Satish Patel
nevermind, i found doc related that and i am able to get 1 mgr up -
https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon


On Fri, Sep 2, 2022 at 1:21 AM Satish Patel  wrote:

> Folks,
>
> I am having little fun time with cephadm and it's very annoying to deal
> with it
>
> I have deployed a ceph cluster using cephadm on two nodes. Now when i was
> trying to upgrade and noticed hiccups where it just upgraded a single mgr
> with 16.2.10 but not other so i started messing around and somehow I
> deleted both mgr in the thought that cephadm will recreate them.
>
> Now i don't have any single mgr so my ceph orch command hangs forever and
> looks like a chicken egg issue.
>
> How do I recover from this? If I can't run the ceph orch command, I won't
> be able to redeploy my mgr daemons.
>
> I am not able to find any mgr in the following command on both nodes.
>
> $ cephadm ls | grep mgr
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] [cephadm] mgr: no daemons active

2022-09-01 Thread Satish Patel
Folks,

I am having little fun time with cephadm and it's very annoying to deal
with it

I have deployed a ceph cluster using cephadm on two nodes. Now when i was
trying to upgrade and noticed hiccups where it just upgraded a single mgr
with 16.2.10 but not other so i started messing around and somehow I
deleted both mgr in the thought that cephadm will recreate them.

Now i don't have any single mgr so my ceph orch command hangs forever and
looks like a chicken egg issue.

How do I recover from this? If I can't run the ceph orch command, I won't
be able to redeploy my mgr daemons.

I am not able to find any mgr in the following command on both nodes.

$ cephadm ls | grep mgr
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [cephadm] Found duplicate OSDs

2022-09-01 Thread Satish Patel
Great, thanks!

Don't ask me how many commands I have typed to fix my issue. Finally I did
it. Basically i fix /etc/hosts and then i remove mgr service using
following command

ceph orch daemon rm mgr.ceph1.xmbvsb

And cephadm auto deployed a new working mgr.  I found ceph orch ps was
hanging and the solution I found was to restart all ceph daemon using
( systemctl restart ceph.target ) command.

root@ceph1:/ceph-disk# ceph orch ps
NAME HOST   PORTSSTATUS REFRESHED  AGE  MEM
USE  MEM LIM  VERSION  IMAGE ID  CONTAINER ID
alertmanager.ceph1   ceph1   running (12m) 9m ago   2w
 16.0M-  0.20.0   0881eb8f169f  d064a0177439
crash.ceph1  ceph1   running (49m) 9m ago   2w
 7963k-  15.2.17  93146564743f  550b088467e4
crash.ceph2  ceph2   running (35m) 9m ago  13d
 7287k-  15.2.17  93146564743f  c4b5b3327fa5
grafana.ceph1ceph1   running (14m) 9m ago   2w
 34.9M-  6.7.4557c83e11646  46048ebff031
mgr.ceph1.hxsfrs ceph1  *:8443,9283  running (13m) 9m ago  13m
327M-  15.2.17  93146564743f  4c5169890e9d
mgr.ceph2.hmbdla ceph2   running (35m) 9m ago  13d
435M-  16.2.10  0d668911f040  361d58a423cd
mon.ceph1ceph1   running (49m) 9m ago   2w
 85.5M2048M  15.2.17  93146564743f  a5f055953256
node-exporter.ceph1  ceph1   running (14m) 9m ago   2w
 32.9M-  0.18.1   e5a616e4b9cf  833cc2e6c9ed
node-exporter.ceph2  ceph2   running (13m) 9m ago  13d
 33.9M-  0.18.1   e5a616e4b9cf  30d15dde3860
osd.0ceph1   running (49m) 9m ago  13d
355M4096M  15.2.17  93146564743f  6e9bee5c211e
osd.1ceph1   running (49m) 9m ago  13d
372M4096M  15.2.17  93146564743f  09b8616bc096
osd.2ceph2   running (35m) 9m ago  13d
287M4096M  15.2.17  93146564743f  20f75a1b5221
osd.3ceph2   running (35m) 9m ago  13d
300M4096M  15.2.17  93146564743f  c57154355b03
prometheus.ceph1 ceph1   running (12m) 9m ago   2w
 89.5M-  2.18.1   de242295e225  b5ff35307ac0

Now I am going to start an upgrade process next. I will keep you posted to
see how it goes.

On Thu, Sep 1, 2022 at 10:06 PM Adam King  wrote:

> I'm not sure exactly what needs to be done to fix that, but I'd imagine
> just editing the /etc/hosts file on all your hosts to be correct would be
> the start (the cephadm shell would have taken its /etc/hosts off of
> whatever host you ran the shell from). Unfortunately I'm not much of a
> networking expert and if you have some sort of DNS stuff going on for your
> local network I'm not too sure what to do there, but if it's possible just
> fixing the /etc/hosts entries will resolve things. Either way, once you've
> got the networking fixed so ssh-ing to the hosts works as expected with the
> IPs you might need to  re-add one or both of the hosts to the cluster with
> the correct IP as well ( "ceph orch host add  "). I believe
> if you just run the orch host add command again with a different IP but the
> same hostname it will just change the IP cephadm has stored for the host.
> If that isn't working, running "ceph orch host rm  --force"
> beforehand should make it work (if you just remove the host with --force it
> shouldn't touch the host's daemons and should therefore be a relatively
> sage operation). In the end, the IP cephadm lists for each host in "ceph
> orch host ls" must be an IP that allows correctly ssh-ing to the host.
>
> On Thu, Sep 1, 2022 at 9:17 PM Satish Patel  wrote:
>
>> Hi Adam,
>>
>> You are correct, look like it was a naming issue in my /etc/hosts file.
>> Is there a way to correct it?
>>
>> If you see i have ceph1 two time. :(
>>
>> 10.73.0.191 ceph1.example.com ceph1
>> 10.73.0.192 ceph2.example.com ceph1
>>
>> On Thu, Sep 1, 2022 at 8:06 PM Adam King  wrote:
>>
>>> the naming for daemons is a bit different for each daemon type, but for
>>> mgr daemons it's always "mgr..".  The daemons
>>> cephadm will be able to find for something like a daemon redeploy are
>>> pretty much always whatever is reported in "ceph orch ps". Given that
>>> "mgr.ceph1.xmbvsb" isn't listed there, it's not surprising it said it
>>> couldn't find it.
>>>
>>> There is definitely something very odd going on here. It looks like the
>>> crash daemons as well are reporting a duplicate "crash.ceph2" on both ceph1
>>> and ceph2. Going back to your original orch ps output from the first email,
>>> it seems that every daemon see

[ceph-users] Re: [cephadm] Found duplicate OSDs

2022-09-01 Thread Satish Patel
Hi Adam,

You are correct, look like it was a naming issue in my /etc/hosts file. Is
there a way to correct it?

If you see i have ceph1 two time. :(

10.73.0.191 ceph1.example.com ceph1
10.73.0.192 ceph2.example.com ceph1

On Thu, Sep 1, 2022 at 8:06 PM Adam King  wrote:

> the naming for daemons is a bit different for each daemon type, but for
> mgr daemons it's always "mgr..".  The daemons
> cephadm will be able to find for something like a daemon redeploy are
> pretty much always whatever is reported in "ceph orch ps". Given that
> "mgr.ceph1.xmbvsb" isn't listed there, it's not surprising it said it
> couldn't find it.
>
> There is definitely something very odd going on here. It looks like the
> crash daemons as well are reporting a duplicate "crash.ceph2" on both ceph1
> and ceph2. Going back to your original orch ps output from the first email,
> it seems that every daemon seems to have a duplicate and none of the actual
> daemons listed in the "cephadm ls" on ceph1 are actually being reported in
> the orch ps output. I think something may have gone wrong with the host and
> networking setup here and it seems to be reporting ceph2 daemons as the
> daemons for both ceph1 and ceph2 as if trying to connect to ceph1 ends up
> connecting to ceph2. The only time I've seen anything like this was when I
> made a mistake and setup a virtual IP on one host that was the same as the
> actual IP for another host on the cluster and cephadm basically ended up
> ssh-ing to the same host via both IPs (the one that was supposed to be for
> host A and host B where the virtual IP matching host B was setup on host
> A). I doubt you're in that exact situation, but I think we need to look
> very closely at the networking setup here. I would try opening up a cephadm
> shell and ssh-ing to each of the two hosts by the IP listed in "ceph orch
> host ls" and make sure you actually get to the correct host and it has the
> correct hostname. Given the output, I wouldn't be surprised if trying to
> connect to ceph1's IP landed you on ceph2 or vice versa. I will say I found
> it a bit odd originally when I saw the two IPs were 10.73.0.192 and
> 10.73.3.192. There's nothing necessarily wrong with that, but typically IPs
> on the host are more likely to differ at the end than in the middle (e.g.
> 192.168.122.1 and 192.168.122.2 rather than 192.168.1.122 and
> 192.168.2.122) and it did make me wonder if a mistake had occurred in the
> networking. Either way, there's clearly something making it think ceph2's
> daemons are on both ceph1 and ceph2 and some sort of networking issue is
> the only thing I'm aware of currently that causes something like that.
>
> On Thu, Sep 1, 2022 at 6:30 PM Satish Patel  wrote:
>
>> Hi Adam,
>>
>> I have also noticed a very strange thing which is Duplicate name in the
>> following output.  Is this normal?  I don't know how it got here. Is there
>> a way I can rename them?
>>
>> root@ceph1:~# ceph orch ps
>> NAME HOST   PORTSSTATUS  REFRESHED  AGE
>>  MEM USE  MEM LIM  VERSIONIMAGE ID  CONTAINER ID
>> alertmanager.ceph1   ceph1  *:9093,9094  starting--
>>  -- 
>> crash.ceph2  ceph1   running (13d) 10s ago  13d
>>  10.0M-  15.2.1793146564743f  0a009254afb0
>> crash.ceph2  ceph2   running (13d) 10s ago  13d
>>  10.0M-  15.2.1793146564743f  0a009254afb0
>> grafana.ceph1ceph1  *:3000   starting--
>>  -- 
>> mgr.ceph2.hmbdla ceph1   running (103m)10s ago  13d
>>   518M-  16.2.100d668911f040  745245c18d5e
>> mgr.ceph2.hmbdla ceph2   running (103m)10s ago  13d
>>   518M-  16.2.100d668911f040  745245c18d5e
>> node-exporter.ceph2  ceph1   running (7h)  10s ago  13d
>>  70.2M-  0.18.1 e5a616e4b9cf  d0ba04bb977c
>> node-exporter.ceph2  ceph2   running (7h)  10s ago  13d
>>  70.2M-  0.18.1 e5a616e4b9cf  d0ba04bb977c
>> osd.2ceph1   running (19h) 10s ago  13d
>>   901M4096M  15.2.1793146564743f  e286fb1c6302
>> osd.2ceph2   running (19h) 10s ago  13d
>>   901M4096M  15.2.1793146564743f  e286fb1c6302
>> osd.3ceph1   running (19h) 10s ago  13d
>>  1006M4096M  15.2.1793146564743f  d3ae5d9f694f
>> osd.3ceph2   running (19h) 10s ago  13d
>>  1006M4096M  15.2.1793146564

[ceph-users] Re: [cephadm] Found duplicate OSDs

2022-09-01 Thread Satish Patel
Hi Adam,

I have also noticed a very strange thing which is Duplicate name in the
following output.  Is this normal?  I don't know how it got here. Is there
a way I can rename them?

root@ceph1:~# ceph orch ps
NAME HOST   PORTSSTATUS  REFRESHED  AGE
 MEM USE  MEM LIM  VERSIONIMAGE ID  CONTAINER ID
alertmanager.ceph1   ceph1  *:9093,9094  starting--
   -- 
crash.ceph2  ceph1   running (13d) 10s ago  13d
 10.0M-  15.2.1793146564743f  0a009254afb0
crash.ceph2  ceph2   running (13d) 10s ago  13d
 10.0M-  15.2.1793146564743f  0a009254afb0
grafana.ceph1ceph1  *:3000   starting--
   -- 
mgr.ceph2.hmbdla ceph1   running (103m)10s ago  13d
518M-  16.2.100d668911f040  745245c18d5e
mgr.ceph2.hmbdla ceph2   running (103m)10s ago  13d
518M-  16.2.100d668911f040  745245c18d5e
node-exporter.ceph2  ceph1   running (7h)  10s ago  13d
 70.2M-  0.18.1 e5a616e4b9cf  d0ba04bb977c
node-exporter.ceph2  ceph2   running (7h)  10s ago  13d
 70.2M-  0.18.1 e5a616e4b9cf  d0ba04bb977c
osd.2ceph1   running (19h) 10s ago  13d
901M4096M  15.2.1793146564743f  e286fb1c6302
osd.2ceph2   running (19h) 10s ago  13d
901M4096M  15.2.1793146564743f  e286fb1c6302
osd.3ceph1   running (19h) 10s ago  13d
 1006M4096M  15.2.1793146564743f  d3ae5d9f694f
osd.3ceph2   running (19h) 10s ago  13d
 1006M4096M  15.2.1793146564743f  d3ae5d9f694f
osd.5ceph1   running (19h) 10s ago   9d
222M4096M  15.2.1793146564743f  405068fb474e
osd.5ceph2   running (19h) 10s ago   9d
222M4096M  15.2.1793146564743f  405068fb474e
prometheus.ceph1 ceph1  *:9095   running (15s) 10s ago  15s
 30.6M- 514e6a882f6e  65a0acfed605
prometheus.ceph1 ceph2  *:9095   running (15s) 10s ago  15s
 30.6M- 514e6a882f6e  65a0acfed605

I found the following example link which has all different names, how does
cephadm decide naming?

https://achchusnulchikam.medium.com/deploy-ceph-cluster-with-cephadm-on-centos-8-257b300e7b42

On Thu, Sep 1, 2022 at 6:20 PM Satish Patel  wrote:

> Hi Adam,
>
> Getting the following error, not sure why it's not able to find it.
>
> root@ceph1:~# ceph orch daemon redeploy mgr.ceph1.xmbvsb
> Error EINVAL: Unable to find mgr.ceph1.xmbvsb daemon(s)
>
> On Thu, Sep 1, 2022 at 5:57 PM Adam King  wrote:
>
>> what happens if you run `ceph orch daemon redeploy mgr.ceph1.xmbvsb`?
>>
>> On Thu, Sep 1, 2022 at 5:12 PM Satish Patel  wrote:
>>
>>> Hi Adam,
>>>
>>> Here is requested output
>>>
>>> root@ceph1:~# ceph health detail
>>> HEALTH_WARN 4 stray daemon(s) not managed by cephadm
>>> [WRN] CEPHADM_STRAY_DAEMON: 4 stray daemon(s) not managed by cephadm
>>> stray daemon mon.ceph1 on host ceph1 not managed by cephadm
>>> stray daemon osd.0 on host ceph1 not managed by cephadm
>>> stray daemon osd.1 on host ceph1 not managed by cephadm
>>> stray daemon osd.4 on host ceph1 not managed by cephadm
>>>
>>>
>>> root@ceph1:~# ceph orch host ls
>>> HOST   ADDR LABELS  STATUS
>>> ceph1  10.73.0.192
>>> ceph2  10.73.3.192  _admin
>>> 2 hosts in cluster
>>>
>>>
>>> My cephadm ls  saying mgr is in error state
>>>
>>> {
>>> "style": "cephadm:v1",
>>> "name": "mgr.ceph1.xmbvsb",
>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>> "systemd_unit":
>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb",
>>> "enabled": true,
>>> "state": "error",
>>> "container_id": null,
>>> "container_image_name": "quay.io/ceph/ceph:v15",
>>> "container_image_id": null,
>>> "version": null,
>>> "started": null,
>>> "created": "2022-09-01T20:59:49.314347Z",
>>> "deployed": "2022-09-01T20:59:48.718347Z",
>>> "configured": "2022-09-01T20:59:49.314347Z"
>>> },
>>>
>>>
>>> Getting error
>>>
>&

[ceph-users] Re: [cephadm] Found duplicate OSDs

2022-09-01 Thread Satish Patel
Hi Adam,

Getting the following error, not sure why it's not able to find it.

root@ceph1:~# ceph orch daemon redeploy mgr.ceph1.xmbvsb
Error EINVAL: Unable to find mgr.ceph1.xmbvsb daemon(s)

On Thu, Sep 1, 2022 at 5:57 PM Adam King  wrote:

> what happens if you run `ceph orch daemon redeploy mgr.ceph1.xmbvsb`?
>
> On Thu, Sep 1, 2022 at 5:12 PM Satish Patel  wrote:
>
>> Hi Adam,
>>
>> Here is requested output
>>
>> root@ceph1:~# ceph health detail
>> HEALTH_WARN 4 stray daemon(s) not managed by cephadm
>> [WRN] CEPHADM_STRAY_DAEMON: 4 stray daemon(s) not managed by cephadm
>> stray daemon mon.ceph1 on host ceph1 not managed by cephadm
>> stray daemon osd.0 on host ceph1 not managed by cephadm
>> stray daemon osd.1 on host ceph1 not managed by cephadm
>> stray daemon osd.4 on host ceph1 not managed by cephadm
>>
>>
>> root@ceph1:~# ceph orch host ls
>> HOST   ADDR LABELS  STATUS
>> ceph1  10.73.0.192
>> ceph2  10.73.3.192  _admin
>> 2 hosts in cluster
>>
>>
>> My cephadm ls  saying mgr is in error state
>>
>> {
>> "style": "cephadm:v1",
>> "name": "mgr.ceph1.xmbvsb",
>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>> "systemd_unit":
>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb",
>> "enabled": true,
>> "state": "error",
>> "container_id": null,
>> "container_image_name": "quay.io/ceph/ceph:v15",
>> "container_image_id": null,
>> "version": null,
>> "started": null,
>> "created": "2022-09-01T20:59:49.314347Z",
>> "deployed": "2022-09-01T20:59:48.718347Z",
>> "configured": "2022-09-01T20:59:49.314347Z"
>> },
>>
>>
>> Getting error
>>
>> root@ceph1:~# cephadm unit --fsid f270ad9e-1f6f-11ed-b6f8-a539d87379ea
>> --name mgr.ceph1.xmbvsb start
>> stderr Job for
>> ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb.service
>> failed because the control process exited with error code.
>> stderr See "systemctl status
>> ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb.service" and
>> "journalctl -xe" for details.
>> Traceback (most recent call last):
>>   File "/usr/sbin/cephadm", line 6250, in 
>> r = args.func()
>>   File "/usr/sbin/cephadm", line 1357, in _infer_fsid
>> return func()
>>   File "/usr/sbin/cephadm", line 3727, in command_unit
>> call_throws([
>>   File "/usr/sbin/cephadm", line 1119, in call_throws
>> raise RuntimeError('Failed command: %s' % ' '.join(command))
>> RuntimeError: Failed command: systemctl start
>> ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb
>>
>>
>> How do I remove and re-deploy mgr?
>>
>> On Thu, Sep 1, 2022 at 4:54 PM Adam King  wrote:
>>
>>> cephadm deploys the containers with --rm so they will get removed if you
>>> stop them. As for getting the 2nd mgr back, if it still lists the 2nd one
>>> in `ceph orch ps` you should be able to do a `ceph orch daemon redeploy
>>> ` where  should match the name given in
>>> the orch ps output for the one that isn't actually up. If it isn't listed
>>> there, given you have a count of 2, cephadm should deploy another one. I do
>>> see in the orch ls output you posted that it says the mgr service has "2/2"
>>> running which implies it believes a 2nd mgr is present (and you would
>>> therefore be able to try the daemon redeploy if that daemon isn't actually
>>> there).
>>>
>>> Is it still reporting the duplicate osds in orch ps? I see in the
>>> cephadm ls output on ceph1 that osd.2 isn't being reported, which was
>>> reported as being on ceph1 in the orch ps output in your original message
>>> in this thread. I'm interested in what `ceph health detail` is reporting
>>> now as well, as it says there are 4 stray daemons. Also, the `ceph orch
>>> host ls` output just to get a better grasp of the topology of this cluster.
>>>
>>> On Thu, Sep 1, 2022 at 3:50 PM Satish Patel 
>>> wrote:
>>>
>>>> Adam,
>>>>
>>>> I have posted a question related to upgrading earlier and this thread
>>>> is related to that, 

[ceph-users] Re: [cephadm] Found duplicate OSDs

2022-09-01 Thread Satish Patel
Hi Adam,

Here is requested output

root@ceph1:~# ceph health detail
HEALTH_WARN 4 stray daemon(s) not managed by cephadm
[WRN] CEPHADM_STRAY_DAEMON: 4 stray daemon(s) not managed by cephadm
stray daemon mon.ceph1 on host ceph1 not managed by cephadm
stray daemon osd.0 on host ceph1 not managed by cephadm
stray daemon osd.1 on host ceph1 not managed by cephadm
stray daemon osd.4 on host ceph1 not managed by cephadm


root@ceph1:~# ceph orch host ls
HOST   ADDR LABELS  STATUS
ceph1  10.73.0.192
ceph2  10.73.3.192  _admin
2 hosts in cluster


My cephadm ls  saying mgr is in error state

{
"style": "cephadm:v1",
"name": "mgr.ceph1.xmbvsb",
"fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
"systemd_unit":
"ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb",
"enabled": true,
"state": "error",
"container_id": null,
"container_image_name": "quay.io/ceph/ceph:v15",
"container_image_id": null,
"version": null,
"started": null,
"created": "2022-09-01T20:59:49.314347Z",
"deployed": "2022-09-01T20:59:48.718347Z",
"configured": "2022-09-01T20:59:49.314347Z"
},


Getting error

root@ceph1:~# cephadm unit --fsid f270ad9e-1f6f-11ed-b6f8-a539d87379ea
--name mgr.ceph1.xmbvsb start
stderr Job for
ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb.service failed
because the control process exited with error code.
stderr See "systemctl status
ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb.service" and
"journalctl -xe" for details.
Traceback (most recent call last):
  File "/usr/sbin/cephadm", line 6250, in 
r = args.func()
  File "/usr/sbin/cephadm", line 1357, in _infer_fsid
return func()
  File "/usr/sbin/cephadm", line 3727, in command_unit
call_throws([
  File "/usr/sbin/cephadm", line 1119, in call_throws
raise RuntimeError('Failed command: %s' % ' '.join(command))
RuntimeError: Failed command: systemctl start
ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb


How do I remove and re-deploy mgr?

On Thu, Sep 1, 2022 at 4:54 PM Adam King  wrote:

> cephadm deploys the containers with --rm so they will get removed if you
> stop them. As for getting the 2nd mgr back, if it still lists the 2nd one
> in `ceph orch ps` you should be able to do a `ceph orch daemon redeploy
> ` where  should match the name given in
> the orch ps output for the one that isn't actually up. If it isn't listed
> there, given you have a count of 2, cephadm should deploy another one. I do
> see in the orch ls output you posted that it says the mgr service has "2/2"
> running which implies it believes a 2nd mgr is present (and you would
> therefore be able to try the daemon redeploy if that daemon isn't actually
> there).
>
> Is it still reporting the duplicate osds in orch ps? I see in the cephadm
> ls output on ceph1 that osd.2 isn't being reported, which was reported as
> being on ceph1 in the orch ps output in your original message in this
> thread. I'm interested in what `ceph health detail` is reporting now as
> well, as it says there are 4 stray daemons. Also, the `ceph orch host ls`
> output just to get a better grasp of the topology of this cluster.
>
> On Thu, Sep 1, 2022 at 3:50 PM Satish Patel  wrote:
>
>> Adam,
>>
>> I have posted a question related to upgrading earlier and this thread is
>> related to that, I have opened a new one because I found that error in logs
>> and thought the upgrade may be stuck because of duplicate OSDs.
>>
>> root@ceph1:~# ls -l /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/
>> total 44
>> drwx-- 3 nobody nogroup 4096 Aug 19 05:37 alertmanager.ceph1
>> drwx-- 3167 167 4096 Aug 19 05:36 crash
>> drwx-- 2167 167 4096 Aug 19 05:37 crash.ceph1
>> drwx-- 4998 996 4096 Aug 19 05:37 grafana.ceph1
>> drwx-- 2167 167 4096 Aug 19 05:36 mgr.ceph1.xmbvsb
>> drwx-- 3167 167 4096 Aug 19 05:36 mon.ceph1
>> drwx-- 2 nobody nogroup 4096 Aug 19 05:37 node-exporter.ceph1
>> drwx-- 2167 167 4096 Aug 19 17:55 osd.0
>> drwx-- 2167 167 4096 Aug 19 18:03 osd.1
>> drwx-- 2167 167 4096 Aug 31 05:20 osd.4
>> drwx-- 4 nobody nogroup 4096 Aug 19 05:38 prometheus.ceph1
>>
>> Here is the output of cephadm ls
>>
>> root@ceph1:~# cephadm ls
>> [
>> {
>> "style": "cephadm:v1",
>> "name&qu

[ceph-users] Re: [cephadm] Found duplicate OSDs

2022-09-01 Thread Satish Patel
ed": "2022-08-19T03:38:06.487603Z"
},
{
"style": "cephadm:v1",
"name": "osd.4",
"fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
"systemd_unit": "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@osd.4",
"enabled": true,
"state": "running",
"container_id":
"938840fe7fd0cb45cc26d077837c9847d7c7a7a68c7e1588d4bb4343c695a071",
    "container_image_name": "quay.io/ceph/ceph:v15",
"container_image_id":
"93146564743febec815d6a764dad93fc07ce971e88315403ac508cb5da6d35f4",
"version": "15.2.17",
"started": "2022-08-31T03:20:55.416219Z",
"created": "2022-08-23T21:46:49.458533Z",
"deployed": "2022-08-23T21:46:48.818533Z",
"configured": "2022-08-31T02:53:41.196643Z"
}
]


I have noticed one more thing, I did docker stop  on
ceph1 node and now my mgr container disappeared, I can't see it anywhere
and not sure how do i bring back mgr because upgrade won't let me do
anything if i don't have two mgr instance.

root@ceph1:~# ceph -s
  cluster:
id: f270ad9e-1f6f-11ed-b6f8-a539d87379ea
health: HEALTH_WARN
4 stray daemon(s) not managed by cephadm

  services:
mon: 1 daemons, quorum ceph1 (age 17h)
mgr: ceph2.hmbdla(active, since 5h)
osd: 6 osds: 6 up (since 40h), 6 in (since 8d)

  data:
pools:   6 pools, 161 pgs
objects: 20.59k objects, 85 GiB
usage:   174 GiB used, 826 GiB / 1000 GiB avail
pgs: 161 active+clean

  io:
client:   0 B/s rd, 12 KiB/s wr, 0 op/s rd, 2 op/s wr

  progress:
Upgrade to quay.io/ceph/ceph:16.2.10 (0s)
  []

I can see mgr count:2 but not sure how do i bring it back

root@ceph1:~# ceph orch ls
NAME   PORTSRUNNING  REFRESHED  AGE  PLACEMENT
alertmanager   ?:9093,9094  1/1  20s ago13d  count:1
crash   2/2  20s ago13d  *
grafana?:3000   1/1  20s ago13d  count:1
mgr 2/2  20s ago13d  count:2
mon 0/5  -  13d  
node-exporter  ?:9100   2/2  20s ago13d  *
osd   6  20s ago-
osd.all-available-devices 0  -  13d  *
osd.osd_spec_default  0  -  8d   *
prometheus ?:9095   1/1  20s ago13d  count:1

On Thu, Sep 1, 2022 at 12:28 PM Adam King  wrote:

> Are there any extra directories in /var/lib/ceph or /var/lib/ceph/
> that appear to be for those OSDs on that host? When cephadm builds the info
> it uses for "ceph orch ps" it's actually scraping those directories. The
> output of "cephadm ls" on the host with the duplicates could also
> potentially have some insights.
>
> On Thu, Sep 1, 2022 at 12:15 PM Satish Patel  wrote:
>
>> Folks,
>>
>> I am playing with cephadm and life was good until I started upgrading from
>> octopus to pacific. My upgrade process stuck after upgrading mgr and in
>> logs now i can see following error
>>
>> root@ceph1:~# ceph log last cephadm
>> 2022-09-01T14:40:45.739804+ mgr.ceph2.hmbdla (mgr.265806) 8 :
>> cephadm [INF] Deploying daemon grafana.ceph1 on ceph1
>> 2022-09-01T14:40:56.115693+ mgr.ceph2.hmbdla (mgr.265806) 14 :
>> cephadm [INF] Deploying daemon prometheus.ceph1 on ceph1
>> 2022-09-01T14:41:11.856725+ mgr.ceph2.hmbdla (mgr.265806) 25 :
>> cephadm [INF] Reconfiguring alertmanager.ceph1 (dependencies
>> changed)...
>> 2022-09-01T14:41:11.861535+ mgr.ceph2.hmbdla (mgr.265806) 26 :
>> cephadm [INF] Reconfiguring daemon alertmanager.ceph1 on ceph1
>> 2022-09-01T14:41:12.927852+ mgr.ceph2.hmbdla (mgr.265806) 27 :
>> cephadm [INF] Reconfiguring grafana.ceph1 (dependencies changed)...
>> 2022-09-01T14:41:12.940615+ mgr.ceph2.hmbdla (mgr.265806) 28 :
>> cephadm [INF] Reconfiguring daemon grafana.ceph1 on ceph1
>> 2022-09-01T14:41:14.056113+ mgr.ceph2.hmbdla (mgr.265806) 33 :
>> cephadm [INF] Found duplicate OSDs: osd.2 in status running on ceph1,
>> osd.2 in status running on ceph2
>> 2022-09-01T14:41:14.056437+ mgr.ceph2.hmbdla (mgr.265806) 34 :
>> cephadm [INF] Found duplicate OSDs: osd.5 in status running on ceph1,
>> osd.5 in status running on ceph2
>> 2022-09-01T14:41:14.056630+ mgr.ceph2.hmbdla (mgr.265806) 35 :
>> cephadm [INF] Found duplicate OSDs: osd.3 in status running on ceph1,
>> osd.3 in status running o

[ceph-users] [cephadm] Found duplicate OSDs

2022-09-01 Thread Satish Patel
Folks,

I am playing with cephadm and life was good until I started upgrading from
octopus to pacific. My upgrade process stuck after upgrading mgr and in
logs now i can see following error

root@ceph1:~# ceph log last cephadm
2022-09-01T14:40:45.739804+ mgr.ceph2.hmbdla (mgr.265806) 8 :
cephadm [INF] Deploying daemon grafana.ceph1 on ceph1
2022-09-01T14:40:56.115693+ mgr.ceph2.hmbdla (mgr.265806) 14 :
cephadm [INF] Deploying daemon prometheus.ceph1 on ceph1
2022-09-01T14:41:11.856725+ mgr.ceph2.hmbdla (mgr.265806) 25 :
cephadm [INF] Reconfiguring alertmanager.ceph1 (dependencies
changed)...
2022-09-01T14:41:11.861535+ mgr.ceph2.hmbdla (mgr.265806) 26 :
cephadm [INF] Reconfiguring daemon alertmanager.ceph1 on ceph1
2022-09-01T14:41:12.927852+ mgr.ceph2.hmbdla (mgr.265806) 27 :
cephadm [INF] Reconfiguring grafana.ceph1 (dependencies changed)...
2022-09-01T14:41:12.940615+ mgr.ceph2.hmbdla (mgr.265806) 28 :
cephadm [INF] Reconfiguring daemon grafana.ceph1 on ceph1
2022-09-01T14:41:14.056113+ mgr.ceph2.hmbdla (mgr.265806) 33 :
cephadm [INF] Found duplicate OSDs: osd.2 in status running on ceph1,
osd.2 in status running on ceph2
2022-09-01T14:41:14.056437+ mgr.ceph2.hmbdla (mgr.265806) 34 :
cephadm [INF] Found duplicate OSDs: osd.5 in status running on ceph1,
osd.5 in status running on ceph2
2022-09-01T14:41:14.056630+ mgr.ceph2.hmbdla (mgr.265806) 35 :
cephadm [INF] Found duplicate OSDs: osd.3 in status running on ceph1,
osd.3 in status running on ceph2


Not sure from where duplicate names came and how that happened. In
following output i can't see any duplication

root@ceph1:~# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-1 0.97656  root default
-3 0.48828  host ceph1
 4hdd  0.09769  osd.4   up   1.0  1.0
 0ssd  0.19530  osd.0   up   1.0  1.0
 1ssd  0.19530  osd.1   up   1.0  1.0
-5 0.48828  host ceph2
 5hdd  0.09769  osd.5   up   1.0  1.0
 2ssd  0.19530  osd.2   up   1.0  1.0
 3ssd  0.19530  osd.3   up   1.0  1.0


But same time i can see duplicate OSD number in ceph1 and ceph2


root@ceph1:~# ceph orch ps
NAME HOST   PORTSSTATUS REFRESHED  AGE
 MEM USE  MEM LIM  VERSION  IMAGE ID  CONTAINER ID
alertmanager.ceph1   ceph1  *:9093,9094  running (20s) 2s ago  20s
   17.1M-   ba2b418f427c  856a4fe641f1
alertmanager.ceph1   ceph2  *:9093,9094  running (20s) 3s ago  20s
   17.1M-   ba2b418f427c  856a4fe641f1
crash.ceph2  ceph1   running (12d) 2s ago  12d
   10.0M-  15.2.17  93146564743f  0a009254afb0
crash.ceph2  ceph2   running (12d) 3s ago  12d
   10.0M-  15.2.17  93146564743f  0a009254afb0
grafana.ceph1ceph1  *:3000   running (18s) 2s ago  19s
   47.9M-  8.3.5dad864ee21e9  7d7a70b8ab7f
grafana.ceph1ceph2  *:3000   running (18s) 3s ago  19s
   47.9M-  8.3.5dad864ee21e9  7d7a70b8ab7f
mgr.ceph2.hmbdla ceph1   running (13h) 2s ago  12d
506M-  16.2.10  0d668911f040  6274723c35f7
mgr.ceph2.hmbdla ceph2   running (13h) 3s ago  12d
506M-  16.2.10  0d668911f040  6274723c35f7
node-exporter.ceph2  ceph1   running (91m) 2s ago  12d
   60.7M-  0.18.1   e5a616e4b9cf  d0ba04bb977c
node-exporter.ceph2  ceph2   running (91m) 3s ago  12d
   60.7M-  0.18.1   e5a616e4b9cf  d0ba04bb977c
osd.2ceph1   running (12h) 2s ago  12d
867M4096M  15.2.17  93146564743f  e286fb1c6302
osd.2ceph2   running (12h) 3s ago  12d
867M4096M  15.2.17  93146564743f  e286fb1c6302
osd.3ceph1   running (12h) 2s ago  12d
978M4096M  15.2.17  93146564743f  d3ae5d9f694f
osd.3ceph2   running (12h) 3s ago  12d
978M4096M  15.2.17  93146564743f  d3ae5d9f694f
osd.5ceph1   running (12h) 2s ago   8d
225M4096M  15.2.17  93146564743f  405068fb474e
osd.5ceph2   running (12h) 3s ago   8d
225M4096M  15.2.17  93146564743f  405068fb474e
prometheus.ceph1 ceph1  *:9095   running (8s)  2s ago   8s
   30.4M-   514e6a882f6e  9031dbe30cae
prometheus.ceph1 ceph2  *:9095   running (8s)  3s ago   8s
   30.4M-   514e6a882f6e  9031dbe30cae


Is this a bug or did I do something wrong? any workaround to get out
from this condition?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephadm upgrade from octopus to pasific stuck

2022-08-31 Thread Satish Patel
Hi,

I have a small cluster in the lab which has only two nodes. I have a single
monitor and two OSD nodes.

Running upgrade but somehow it stuck after upgrading mgr

ceph orch upgrade start --ceph-version 16.2.10

root@ceph1:~# ceph -s
  cluster:
id: f270ad9e-1f6f-11ed-b6f8-a539d87379ea
health: HEALTH_WARN
5 stray daemon(s) not managed by cephadm

  services:
mon: 1 daemons, quorum ceph1 (age 22m)
mgr: ceph1.xmbvsb(active, since 21m), standbys: ceph2.hmbdla
osd: 6 osds: 6 up (since 23h), 6 in (since 8d)

  data:
pools:   6 pools, 161 pgs
objects: 20.53k objects, 85 GiB
usage:   173 GiB used, 826 GiB / 1000 GiB avail
pgs: 161 active+clean

  io:
client:   0 B/s rd, 2.7 KiB/s wr, 0 op/s rd, 0 op/s wr

  progress:
Upgrade to quay.io/ceph/ceph:v16.2.10 (0s)
  []


root@ceph1:~# ceph health detail
HEALTH_WARN 5 stray daemon(s) not managed by cephadm
[WRN] CEPHADM_STRAY_DAEMON: 5 stray daemon(s) not managed by cephadm
stray daemon mgr.ceph1.xmbvsb on host ceph1 not managed by cephadm
stray daemon mon.ceph1 on host ceph1 not managed by cephadm
stray daemon osd.0 on host ceph1 not managed by cephadm
stray daemon osd.1 on host ceph1 not managed by cephadm
stray daemon osd.4 on host ceph1 not managed by cephadm


root@ceph1:~# ceph log last cephadm
2022-09-01T02:46:12.020993+ mgr.ceph1.xmbvsb (mgr.254112) 437 : cephadm
[INF] refreshing ceph2 facts
2022-09-01T02:47:12.016303+ mgr.ceph1.xmbvsb (mgr.254112) 469 : cephadm
[INF] refreshing ceph1 facts
2022-09-01T02:47:12.431002+ mgr.ceph1.xmbvsb (mgr.254112) 470 : cephadm
[INF] refreshing ceph2 facts
2022-09-01T02:48:12.424640+ mgr.ceph1.xmbvsb (mgr.254112) 501 : cephadm
[INF] refreshing ceph1 facts
2022-09-01T02:48:12.839790+ mgr.ceph1.xmbvsb (mgr.254112) 502 : cephadm
[INF] refreshing ceph2 facts
2022-09-01T02:49:12.836875+ mgr.ceph1.xmbvsb (mgr.254112) 534 : cephadm
[INF] refreshing ceph1 facts
2022-09-01T02:49:13.210871+ mgr.ceph1.xmbvsb (mgr.254112) 535 : cephadm
[INF] refreshing ceph2 facts
2022-09-01T02:50:13.207635+ mgr.ceph1.xmbvsb (mgr.254112) 566 : cephadm
[INF] refreshing ceph1 facts
2022-09-01T02:50:13.615722+ mgr.ceph1.xmbvsb (mgr.254112) 568 : cephadm
[INF] refreshing ceph2 facts



root@ceph1:~# ceph orch ps
NAME
 HOST   STATUS REFRESHED  AGE  VERSIONIMAGE NAME
 IMAGE ID  CONTAINER ID
cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
 ceph1  stopped3m ago -  
 
cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
 ceph2  stopped3m ago -  
 
crash.ceph2
ceph1  running (12d)  3m ago 12d  15.2.17quay.io/ceph/ceph:v15
93146564743f  0a009254afb0
crash.ceph2
ceph2  running (12d)  3m ago 12d  15.2.17quay.io/ceph/ceph:v15
93146564743f  0a009254afb0
mgr.ceph2.hmbdla
 ceph1  running (43m)  3m ago 12d  16.2.10quay.io/ceph/ceph:v16.2.10
   0d668911f040  6274723c35f7
mgr.ceph2.hmbdla
 ceph2  running (43m)  3m ago 12d  16.2.10quay.io/ceph/ceph:v16.2.10
   0d668911f040  6274723c35f7
node-exporter.ceph2
ceph1  running (23m)  3m ago 12d  0.18.1
quay.io/prometheus/node-exporter:v0.18.1  e5a616e4b9cf  7a6217cb1a9e
node-exporter.ceph2
ceph2  running (23m)  3m ago 12d  0.18.1
quay.io/prometheus/node-exporter:v0.18.1  e5a616e4b9cf  7a6217cb1a9e
osd.2
ceph1  running (23h)  3m ago 12d  15.2.17quay.io/ceph/ceph:v15
93146564743f  e286fb1c6302
osd.2
ceph2  running (23h)  3m ago 12d  15.2.17quay.io/ceph/ceph:v15
93146564743f  e286fb1c6302
osd.3
ceph1  running (23h)  3m ago 12d  15.2.17quay.io/ceph/ceph:v15
93146564743f  d3ae5d9f694f
osd.3
ceph2  running (23h)  3m ago 12d  15.2.17quay.io/ceph/ceph:v15
93146564743f  d3ae5d9f694f
osd.5
ceph1  running (23h)  3m ago 8d   15.2.17quay.io/ceph/ceph:v15
93146564743f  405068fb474e
osd.5
ceph2  running (23h)  3m ago 8d   15.2.17quay.io/ceph/ceph:v15
93146564743f  405068fb474e



What could be wrong here and how to debug issue, cephadm is new to me so
not sure where to look for logs
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Benefits of dockerized ceph?

2022-08-24 Thread Satish Patel
Hi,

I believe only advantage of running dockerize which isolate binaries from
OS and as you said upgrade is easier, In my case i am running OSD/MON role
on same servers so it provide greater isolation when i want to upgrade
component.

cephadm uses containers to deploy ceph clusters in production.

On Wed, Aug 24, 2022 at 4:07 PM Boris  wrote:

> Hi,
> I was just asked if we can switch to dockerized ceph, because it is easier
> to update.
>
> Last time I tried wo use ceph orch i failed really hard to get the rgw
> daemon running as I would like to (IP/port/zonegroup and so on).
> Also I never really felt comfortable running production workload in
> docker.
>
> Now I wanted to ask the ML: are there good reasons to run ceph in docker,
> oder than „update is easier and is decoupled from OS packages“?
>
> Cheers
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Suggestion to build ceph storage

2022-06-20 Thread Satish Patel
Thanks Christophe,

On Mon, Jun 20, 2022 at 11:45 AM Christophe BAILLON  wrote:

> Hi
>
> We have 20 ceph node, each with 12 x 18Tb, 2 x nvme 1Tb
>
> I try this method to create osd
>
> ceph orch apply -i osd_spec.yaml
>
> with this conf
>
> osd_spec.yaml
> service_type: osd
> service_id: osd_spec_default
> placement:
>   host_pattern: '*'
> data_devices:
>   rotational: 1
> db_devices:
>   paths:
> - /dev/nvme0n1
> - /dev/nvme1n1
>
> this created 6 osd with wal/db on /dev/nvme0n1 and 6 on /dev/nvme1n1 per
> node
>
>
Does cephadm automatically create partitions for Wal/DB or is it something
I have to define in config?  ( sorry i am new to cephadm because we are
using ceph-ansible and i heard cephadm will replace ceph-ansible soon, is
that correct?)


> but when I do a lvs, I see only 6 x 80Go partitions on each nvme...
>
> I think this is  dynamic sizing, but I'm not sure, I don't know how to
> check it...
>
> Our cluster will only host couple of files, a small one and a big one ~2GB
> for cephfs only use, and with only 8 users accessing datas
>


How many MDS nodes do you have for your cluster size? Do you have dedicated
or shared MDS with OSDs?


> I don't know if this is optimum, we are in testing process...
>
> - Mail original -
> > De: "Stefan Kooman" 
> > À: "Jake Grimmett" , "Christian Wuerdig" <
> christian.wuer...@gmail.com>, "Satish Patel"
> > 
> > Cc: "ceph-users" 
> > Envoyé: Lundi 20 Juin 2022 16:59:58
> > Objet: [ceph-users] Re: Suggestion to build ceph storage
>
> > On 6/20/22 16:47, Jake Grimmett wrote:
> >> Hi Stefan
> >>
> >> We use cephfs for our 7200CPU/224GPU HPC cluster, for our use-case
> >> (large-ish image files) it works well.
> >>
> >> We have 36 ceph nodes, each with 12 x 12TB HDD, 2 x 1.92TB NVMe, plus a
> >> 240GB System disk. Four dedicated nodes have NVMe for metadata pool, and
> >> provide mon,mgr and MDS service.
> >>
> >> I'm not sure you need 4% of OSD for wal/db, search this mailing list
> >> archive for a definitive answer, but my personal notes are as follows:
> >>
> >> "If you expect lots of small files: go for a DB that's > ~300 GB
> >> For mostly large files you are probably fine with a 60 GB DB.
> >> 266 GB is the same as 60 GB, due to the way the cache multiplies at each
> >> level, spills over during compaction."
> >
> > There is (experimental ...) support for dynamic sizing in Pacific [1].
> > Not sure if it's stable yet in Quincy.
> >
> > Gr. Stefan
> >
> > [1]:
> >
> https://docs.ceph.com/en/quincy/rados/configuration/bluestore-config-ref/#sizing
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> --
> Christophe BAILLON
> Mobile :: +336 16 400 522
> Work :: https://eyona.com
> Twitter :: https://twitter.com/ctof
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Suggestion to build ceph storage

2022-06-20 Thread Satish Patel
Thank Jake

On Mon, Jun 20, 2022 at 10:47 AM Jake Grimmett 
wrote:

> Hi Stefan
>
> We use cephfs for our 7200CPU/224GPU HPC cluster, for our use-case
> (large-ish image files) it works well.
>
> We have 36 ceph nodes, each with 12 x 12TB HDD, 2 x 1.92TB NVMe, plus a
> 240GB System disk. Four dedicated nodes have NVMe for metadata pool, and
> provide mon,mgr and MDS service.
>

This is great info, Assuming we don't need redundancy for NvMe because if
it fails then it will impact only on 6 OSDs and that is acceptable. At
present, because of limited HW supply i am planning to host MDS on the same
OSDs nodes (not dedicated HW for MDS) agreed that not a best practice but
again currently i am dealing with all unknown and i don't want to throw
money on something which we don't know. I may have more data as we start
using and then I can adjust requirements accordingly.


>
> I'm not sure you need 4% of OSD for wal/db, search this mailing list
> archive for a definitive answer, but my personal notes are as follows:
>
> "If you expect lots of small files: go for a DB that's > ~300 GB
> For mostly large files you are probably fine with a 60 GB DB.
> 266 GB is the same as 60 GB, due to the way the cache multiplies at each
> level, spills over during compaction."
>

We don't know what kind of workload we are going to run because currently
all they ask for is large storage with many many drives. In future if they
ask for more IOPs then we may replace some box with NvME or SSD and adjust
requirements.


>
> We use a single enterprise quality 1.9TB NVMe for each 6 OSDs to good
> effect, you probably need 1DWPD to be safe. I suspect you might be able
> to increase the ratio of HDD per NVMe with PCIe gen4 NVMe drives.
>
>
Can you share what company NvME drives are you using?


> best regards,
>
> Jake
>
> On 20/06/2022 08:22, Stefan Kooman wrote:
> > On 6/19/22 23:23, Christian Wuerdig wrote:
> >> On Sun, 19 Jun 2022 at 02:29, Satish Patel 
> wrote:
> >>
> >>> Greeting folks,
> >>>
> >>> We are planning to build Ceph storage for mostly cephFS for HPC
> workload
> >>> and in future we are planning to expand to S3 style but that is yet
> >>> to be
> >>> decided. Because we need mass storage, we bought the following HW.
> >>>
> >>> 15 Total servers and each server has a 12x18TB HDD (spinning disk) . We
> >>> understand SSD/NvME would be best fit but it's way out of budget.
> >>>
> >>> I hope you have extra HW on hand for Monitor and MDS  servers
> >
> > ^^ this. It also depends on the uptime guarantees you have to provide
> > (if any). Are the HPC users going to write large files? Or loads of
> > small files? The more metadata operations the busier the MDSes will be,
> > but if it's mainly large files the load on them will be much lower.
> >>
> >>> Ceph recommends using a faster disk for wal/db if the data disk is
> >>> slow and
> >>> in my case I do have a slower disk for data.
> >>>
> >>> Question:
> >>> 1. Let's say if i want to put a NvME disk for wal/db then what size i
> >>> should buy.
> >>>
> >>
> >> The official recommendation is to budget 4% of OSD size for WAL/DB -
> >> so in
> >> your case that would be 720GB per OSD. Especially if you want to go to
> S3
> >> later you should stick closer to that limit since RGW is a heavy meta
> >> data
> >> user.
> >
> > CephFS can be metadata heavy also, depending on work load. You can
> > co-locate the S3 service on this cluster later on, but from an
> > operational perspective this might not be preferred: you can tune the
> > hardware / configuration for each use case. Easier to troubleshoot,
> > independent upgrade cycle, etc.
> >
> > Gr. Stefan
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> For help, read https://www.mrc-lmb.cam.ac.uk/scicomp/
> then contact unixad...@mrc-lmb.cam.ac.uk
> --
> Dr Jake Grimmett
> Head Of Scientific Computing
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue,
> Cambridge CB2 0QH, UK.
> Phone 01223 267019
> Mobile 0776 9886539
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Suggestion to build ceph storage

2022-06-18 Thread Satish Patel
Greeting folks,

We are planning to build Ceph storage for mostly cephFS for HPC workload
and in future we are planning to expand to S3 style but that is yet to be
decided. Because we need mass storage, we bought the following HW.

15 Total servers and each server has a 12x18TB HDD (spinning disk) . We
understand SSD/NvME would be best fit but it's way out of budget.

Ceph recommends using a faster disk for wal/db if the data disk is slow and
in my case I do have a slower disk for data.

Question:
1. Let's say if i want to put a NvME disk for wal/db then what size i
should buy.
2. Do I need to partition wal/db for each OSD or just a single
partition can share for all OSDs?
3. Can I put the OS on the same disk where the wal/db is going to sit ?
(This way i don't need to spend extra money for extra disk)

Any suggestions you have for this kind of storage would be much
appreciated.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io