[ceph-users] cephx server mgr.a: couldn't find entity name: mgr.a

2022-12-01 Thread Sagara Wijetunga
Hi I'm trying to enable Cephx on a cluster already running without Cephx. Here is what I did. 1. I shutdown the cluster. 2. Enabled Cephx in ceph.conf, Mon and Mgr. 3. Brought the Monitor cluster up. No issue. 4. Tried to bring first Manager up, I'm getting following error: === mgr.a ===

[ceph-users] Ceph radosgw cannot bring up

2022-11-26 Thread Sagara Wijetunga
Hi all I have a running Ceph cluster (ceph_mon, ceph_mgr, ceph_osd and ceph_mds) on IP address A, B and C. I have installed Ceph radosgw on IP address X (Ubuntu 22.04) and configured to listen on port 9000. When I bring up the Ceph radosgw, port 9000 not seems to active and I'm following

[ceph-users] Re: One mds daemon damaged, filesystem is offline. How to recover?

2021-05-23 Thread Sagara Wijetunga
On Sunday, May 23, 2021, 01:16:12 AM GMT+8, Eugen Block wrote: Awesome! I'm glad it worked out this far! At least you have a working  filesystem now even it means that you may have to use a backup. But now I can say it: Having only three OSDs is really not the best  idea. ;-) Are all

[ceph-users] Re: One mds daemon damaged, filesystem is offline. How to recover?

2021-05-22 Thread Sagara Wijetunga
Hi Eugen Now the Ceph is HEALTH_OK. > I think what we need to do now is: > 1. Get the MDS.0 recover, discard if necessary part of the object  > 200.6048 and bring the MSD.0 up. Yes, I agree, I just can't tell what the best way is here, maybe  remove all three objects from the disks

[ceph-users] Re: One mds daemon damaged, filesystem is offline. How to recover?

2021-05-22 Thread Sagara Wijetunga
Sorry, above post has to be corrected as:"Out of the info now emerged so far seems Ceph client wanted to write an object of size 1555896 but managed to write only  1540096 bytes to the journal." Sagara On Saturday, May 22, 2021, 08:29:34 PM GMT+8, Sagara Wijetunga wrote:

[ceph-users] Re: One mds daemon damaged, filesystem is offline. How to recover?

2021-05-22 Thread Sagara Wijetunga
Out of the info now emerged so far seems Ceph client wanted to write an object of size 1555896 but managed to write only 1555896 bytes to the journal. I think what we need to do now is:1. Get the MDS.0 recover, discard if necessary part of the object 200.6048 and bring the MSD.0 up. 2. Do

[ceph-users] Re: One mds daemon damaged, filesystem is offline. How to recover?

2021-05-22 Thread Sagara Wijetunga
On Saturday, May 22, 2021, 03:14:13 PM GMT+8, Eugen Block wrote: What does the MDS report in its logs from when it went down? NOTE: Power failure happened somewhere 2021-05-20 23:56: Here are log messages from MDS.0 log: 2021-05-20 17:26:19.358 2192d80  1 mds.a Updating MDS map to version

[ceph-users] Re: One mds daemon damaged, filesystem is offline. How to recover?

2021-05-22 Thread Sagara Wijetunga
Here are the physical file sizes of the "200.6048*": OSD.0:-rw-r--r--  1 ceph  ceph  1540096 May 20 22:47 /var/lib/ceph/osd/ceph-0/current/2.44_head/200.6048__head_56F5F744__ OSD.1:-rw-r--r--  1 ceph  ceph  1540096 May 20 22:47

[ceph-users] Re: One mds daemon damaged, filesystem is offline. How to recover?

2021-05-22 Thread Sagara Wijetunga
On Saturday, May 22, 2021, 03:14:13 PM GMT+8, Eugen Block wrote: What does the MDS report in its logs from when it went down? What size do you get when you run rados -p cephfs_metadata stat 200.6048 # rados -p cephfs_metadata stat 200.6048cephfs_metadata/200.6048 mtime

[ceph-users] Re: One mds daemon damaged, filesystem is offline. How to recover?

2021-05-22 Thread Sagara Wijetunga
Hi Eugen Thanks for the reply. Ceph Version:# ceph versionceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) > Can you share > > rados list-inconsistent-obj 2.44 ># rados list-inconsistent-obj 2.44 {"epoch":6996,"inconsistents":[]} > ceph tell mds. damage ls > #

[ceph-users] One mds daemon damaged, filesystem is offline. How to recover?

2021-05-21 Thread Sagara Wijetunga
Hi all An accidental power failure happened. That resulted CephFS offline and cannot be mounted. I have 3 MDS daemons but it complains "1 mds daemon damaged". It seems a PG of cephfs_metadata is inconsistent. I tried to repair, but doesn't get it repaired. How do I repair the damaged MDS and

[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair? [SOLVED]

2020-11-03 Thread Sagara Wijetunga
Hi Frank Found the issue and fixed. It was a one copy of 0 byte object. Removed it. Deep scrub the PG fixed the issue. # find /var/lib/ceph/osd/ -type f -name "123675e*" /var/lib/ceph/osd/ceph-2/current/3.b_head/DIR_B/DIR_A/DIR_E/123675e.__head_AE97EEAB__3 # ls -l

[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-03 Thread Sagara Wijetunga
Hi Frank 1. We will disable the disk controller and disk-level caching to avoid future issues. 2. My pools are: ceph osd lspools 2 cephfs_metadata 3 cephfs_data 4 rbd The PG now inconsistent is 3.b,  therefore, it belongs to cephfs_data pool. Following also shows the PG 3.b belongs

[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Sagara Wijetunga
> Hmm, I'm getting a bit confused. Could you also send the output of "ceph osd > pool ls detail". File ceph-osd-pool-ls-detail.txt attached. > Did you look at the disk/controller cache settings? I don't have disk controllers on Ceph machines. The hard disk is directly attached to the

[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Sagara Wijetunga
Hi Frank > the primary OSD is probably not listed as a peer. Can you post the complete > output of > - ceph pg 3.b query > - ceph pg dump > - ceph osd df tree > in a pastebin? Yes, the Primary OSD is 0. I have attached above as .txt files. Please let me know if you still cannot read them.

[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Sagara Wijetunga
Hi Frank > Please note, there is no peer 0 in "ceph pg 3.b query". Also no word osd. I checked other PGs with "active+clean", there is a "peer": "0".   But "ceph pg pgid query" always shows only two peers, sometime peer 0 and 1, or 1 and 2, 0 and 2, etc. Regards Sagara

[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Sagara Wijetunga
Hi Frank > Please note, there is no peer 0 in "ceph pg 3.b query". Also no word osd.I > checked other PGs with active+clean, there is a "peer": "0".  But "ceph pg pgid query" always shows only two peers, sometime peer 0 and 1, or 1 and 2, 0 and 2, etc. Regards Sagara

[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Sagara Wijetunga
Hi Frank > looks like you have one on a new and 2 on an old version. Can you add the > information about which OSD each version resides? The "ceph pg 3.b query" shows following:     "peer_info": [         {             "peer": "1",             "pgid": "3.b",             "last_update":

[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Sagara Wijetunga
Hi Frank > I'm not sure if my hypothesis can be correct. Ceph sends an acknowledge of a > write only after all copies are on disk. In other words, if PGs end up on > different versions after a power outage, one always needs to roll back. Since > you have two healthy OSDs in the PG and the PG

[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-01 Thread Sagara Wijetunga
Hi Frank Thanks for the reply. > I think this happens when a PG has 3 different copies and cannot decide which > one is correct. You might have hit a very rare case. You should start with > the scrub errors, check which PGs and which copies (OSDs) are affected. It > sounds almost like all 3

[ceph-users] How to recover from active+clean+inconsistent+failed_repair?

2020-11-01 Thread Sagara Wijetunga
Hi all I have a Ceph cluster (Nautilus 14.2.11) with 3 Ceph nodes. A crash happened and all 3 Ceph nodes went down. One (1) PG turned "active+clean+inconsistent", I tried to repair it. After the repair, now shows "active+clean+inconsistent+failed_repair" for the PG in the question and cannot