Re: [ceph-users] Erasure code profiles and crush rules. Missing link...?
Hello, thanks for the hint. I opened a ticket with a feature request to include the ec-profile information in the output of ceph osd pool ls detail. http://tracker.ceph.com/issues/40009 Rainer Am 22.05.19 um 17:04 schrieb Jan Fajerski: > On Wed, May 22, 2019 at 03:38:27PM +0200, Rainer Krienke wrote: >> Am 22.05.19 um 15:16 schrieb Dan van der Ster: >> >> Yes this is basically what I was looking for however I had expected that >> its a little better visible in the output... > Mind opening a tracker ticket on http://tracker.ceph.com/ so we can have > this added to the non-json output of ceph osd pool ls detail? >> -- Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312 Web: http://userpages.uni-koblenz.de/~krienke PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] assume_role() :http_code 405 error
On Thu, May 23, 2019 at 9:24 AM Yuan Minghui wrote: > HELLO : > > The version I am using is ceph luminous 12.2.4 ,and what types of ceph > can support AssumeRole or STS? > > > STS is available in Nautilus (v14.2.0), and versions after that. Thanks, Pritha > Thanks a lot. > > kyle > > > > *发件人**: *Pritha Srivastava > *日期**: *2019年5月23日 星期四 上午11:49 > *收件人**: *Yuan Minghui > *抄送**: *"ceph-users@lists.ceph.com" > *主题**: *Re: [ceph-users] assume_role() :http_code 405 error > > > > Hello, > > It looks like the version that you are trying this on, doesn't support > AssumeRole or STS. What version of Ceph are you using? > > Thanks, > > Pritha > > > > On Thu, May 23, 2019 at 9:10 AM Yuan Minghui > wrote: > > Hello : > >When I try to make a secure-temp-sesstion(STS), I try the following > actions: > > s3 = session.client('sts', > aws_access_key_id=tomAccessKey, > aws_secret_access_key=tomSecretKey, > endpoint_url=host > ) #返回一个低级的客户端实例 > > response = s3.assume_role(RoleArn='arn:aws:iam:::role/S3Access1', > RoleSessionName='test_session1', > ) > > > > however, it returns that: > > [image: cid:16ae2cb9a1d4cff311] > > Is there someone can help with this problems? > > > > Thanks a lot. > > Kyle > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] assume_role() :http_code 405 error
HELLO : The version I am using is ceph luminous 12.2.4 ,and what types of ceph can support AssumeRole or STS? Thanks a lot. kyle 发件人: Pritha Srivastava 日期: 2019年5月23日 星期四 上午11:49 收件人: Yuan Minghui 抄送: "ceph-users@lists.ceph.com" 主题: Re: [ceph-users] assume_role() :http_code 405 error Hello, It looks like the version that you are trying this on, doesn't support AssumeRole or STS. What version of Ceph are you using? Thanks, Pritha On Thu, May 23, 2019 at 9:10 AM Yuan Minghui wrote: Hello : When I try to make a secure-temp-sesstion(STS), I try the following actions: s3 = session.client('sts', aws_access_key_id=tomAccessKey, aws_secret_access_key=tomSecretKey, endpoint_url=host ) #返回一个低级的客户端实例 response = s3.assume_role(RoleArn='arn:aws:iam:::role/S3Access1', RoleSessionName='test_session1', ) however, it returns that: Is there someone can help with this problems? Thanks a lot. Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] assume_role() :http_code 405 error
Hello, It looks like the version that you are trying this on, doesn't support AssumeRole or STS. What version of Ceph are you using? Thanks, Pritha On Thu, May 23, 2019 at 9:10 AM Yuan Minghui wrote: > Hello : > >When I try to make a secure-temp-sesstion(STS), I try the following > actions: > > s3 = session.client('sts', > aws_access_key_id=tomAccessKey, > aws_secret_access_key=tomSecretKey, > endpoint_url=host > ) #返回一个低级的客户端实例 > > response = s3.assume_role(RoleArn='arn:aws:iam:::role/S3Access1', > RoleSessionName='test_session1', > ) > > > > however, it returns that: > > Is there someone can help with this problems? > > > > Thanks a lot. > > Kyle > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] assume_role() :http_code 405 error
Hello : When I try to make a secure-temp-sesstion(STS), I try the following actions: s3 = session.client('sts', aws_access_key_id=tomAccessKey, aws_secret_access_key=tomSecretKey, endpoint_url=host ) #返回一个低级的客户端实例 response = s3.assume_role(RoleArn='arn:aws:iam:::role/S3Access1', RoleSessionName='test_session1', ) however, it returns that: Is there someone can help with this problems? Thanks a lot. Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Massive TCP connection on radosgw
Thank you for your reply. We will run the script and let you know the results once the number of TCP connections raises up. We just restarted the sever several days ago. Sent from my iPhone > On 23 May 2019, at 12:26 AM, Igor Podlesny wrote: > >> On Wed, 22 May 2019 at 20:32, Torben Hørup wrote: >> >> Which states are all these connections in ? >> >> ss -tn > > That set of the args won't display anything but ESTAB-lished conn-s.. > > One typically needs `-atn` instead. > > -- > End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Massive TCP connection on radosgw
On Wed, 22 May 2019 at 20:32, Torben Hørup wrote: > > Which states are all these connections in ? > > ss -tn That set of the args won't display anything but ESTAB-lished conn-s.. One typically needs `-atn` instead. -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
On Wed, May 22, 2019 at 4:31 AM Kevin Flöh wrote: > Hi, > > thank you, it worked. The PGs are not incomplete anymore. Still we have > another problem, there are 7 PGs inconsistent and a cpeh pg repair is > not doing anything. I just get "instructing pg 1.5dd on osd.24 to > repair" and nothing happens. Does somebody know how we can get the PGs > to repair? > > Regards, > > Kevin > Kevin, I just fixed an inconsistent PG yesterday. You will need to figure out why they are inconsistent. Do these steps and then we can figure out how to proceed. 1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of them) 2. Print out the inconsistent report for each inconsistent PG. `rados list-inconsistent-obj --format=json-pretty` 3. You will want to look at the error messages and see if all the shards have the same data. Robert LeBlanc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS object mapping.
On Wed, May 22, 2019 at 12:22 AM Burkhard Linke < burkhard.li...@computational.bio.uni-giessen.de> wrote: > Hi, > > On 5/21/19 9:46 PM, Robert LeBlanc wrote: > > I'm at a new job working with Ceph again and am excited to back in the > > community! > > > > I can't find any documentation to support this, so please help me > > understand if I got this right. > > > > I've got a Jewel cluster with CephFS and we have an inconsistent PG. > > All copies of the object are zero size, but the digest says that it > > should be a non-zero size, so it seems that my two options are, delete > > the file that the object is part of, or rewrite the object with RADOS > > to update the digest. So, this leads to my question, how to I tell > > which file the object belongs to. > > > > From what I found, the object is prefixed with the hex value of the > > inode and suffixed by the stripe number: > > 1000d2ba15c.0005 > > . > > > > I then ran `find . -xdev -inum 1099732590940` and found a file on the > > CephFS file system. I just want to make sure that I found the right > > file before I start trying recovery options. > > > > The first stripe XYZ. has some metadata stored as xattr (rados > xattr, not cephfs xattr). One of the entries has the key 'parent': > When you say 'some' is it a fixed offset that the file data starts? Is the first stripe just metadata? > # ls Ubuntu16.04-WS2016-17.ova > Ubuntu16.04-WS2016-17.ova > > # ls -i Ubuntu16.04-WS2016-17.ova > 1099751898435 Ubuntu16.04-WS2016-17.ova > > # rados -p cephfs_test_data stat 1000e523d43. > cephfs_test_data/1000e523d43. mtime 2016-10-13 16:20:10.00, > size 4194304 > > # rados -p cephfs_test_data listxattr 1000e523d43. > layout > parent > > # rados -p cephfs_test_data getxattr 1000e523d43. parent | strings > Ubuntu16.04-WS2016-17.ova5: > adm2 > volumes > > > The complete path of the file is > /volumes/adm/Ubuntu16.04-WS2016-17.ova5. For a complete check you can > store the content of the parent key and use ceph-dencoder to print its > content: > > # rados -p cephfs_test_data getxattr 1000e523d43. parent > > parent.bin > > # ceph-dencoder type inode_backtrace_t import parent.bin decode dump_json > { > "ino": 1099751898435, > "ancestors": [ > { > "dirino": 1099527190071, > "dname": "Ubuntu16.04-WS2016-17.ova", > "version": 14901 > }, > { > "dirino": 1099521974514, > "dname": "adm", > "version": 61190706 > }, > { > "dirino": 1, > "dname": "volumes", > "version": 48394885 > } > ], > "pool": 7, > "old_pools": [] > } > > > One important thing to note: ls -i prints the inode id in decimal, > cephfs uses hexadecimal for the rados object names. Thus the different > value in the above commands. > Thank you for this, this is much faster than doing a find for the inode (that took many hours, I let it run overnight and it found it some time. It took about 21 hours to search the whole filesystem.) Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RGW metadata pool migration
Hi All, What are the metadata pools in an RGW deployment that need to sit on the fastest medium to better the client experience from an access standpoint ? Also is there an easy way to migrate these pools in a PROD scenario with minimal to no-outage if possible ? Regards, Nikhil ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS msg length greater than osd_max_write_size
Thanks for the reply! We will be more proactive about evicting clients in the future rather than waiting. One followup however, it seems that the filesystem going read only was only a WARNING state, which didn’t immediately catch our eye due to some other rebalancing operations. Is there a reason that this wouldn’t be a HEALTH_ERR condition since it represents a significant service degradation? Thanks! Ryan > On May 22, 2019, at 4:20 AM, Yan, Zheng wrote: > > On Tue, May 21, 2019 at 6:10 AM Ryan Leimenstoll > wrote: >> >> Hi all, >> >> We recently encountered an issue where our CephFS filesystem unexpectedly >> was set to read-only. When we look at some of the logs from the daemons I >> can see the following: >> >> On the MDS: >> ... >> 2019-05-18 16:34:24.341 7fb3bd610700 -1 mds.0.89098 unhandled write error >> (90) Message too long, force readonly... >> 2019-05-18 16:34:24.341 7fb3bd610700 1 mds.0.cache force file system >> read-only >> 2019-05-18 16:34:24.341 7fb3bd610700 0 log_channel(cluster) log [WRN] : >> force file system read-only >> 2019-05-18 16:34:41.289 7fb3c0616700 1 heartbeat_map is_healthy 'MDSRank' >> had timed out after 15 >> 2019-05-18 16:34:41.289 7fb3c0616700 0 mds.beacon.objmds00 Skipping beacon >> heartbeat to monitors (last acked 4.00101s ago); MDS internal heartbeat is >> not healthy! >> ... >> >> On one of the OSDs it was most likely targeting: >> ... >> 2019-05-18 16:34:24.140 7f8134e6c700 -1 osd.602 pg_epoch: 682796 pg[49.20b( >> v 682796'15706523 (682693'15703449,682796'15706523] >> local-lis/les=673041/673042 n=10524 ec=245563/245563 lis/c 673041/673041 >> les/c/f 673042/673042/0 673038/673041/668565) [602,530,558] r=0 lpr=673041 >> crt=682796'15706523 lcod 682796'15706522 mlcod 682796'15706522 active+clean] >> do_op msg data len 95146005 > osd_max_write_size 94371840 on >> osd_op(mds.0.89098:48609421 49.20b 49:d0630e4c:::mds0_sessionmap:head >> [omap-set-header,omap-set-vals] snapc 0=[] >> ondisk+write+known_if_redirected+full_force e682796) v8 >> 2019-05-18 17:10:33.695 7f813466b700 0 log_channel(cluster) log [DBG] : >> 49.31c scrub starts >> 2019-05-18 17:10:34.980 7f813466b700 0 log_channel(cluster) log [DBG] : >> 49.31c scrub ok >> 2019-05-18 22:17:37.320 7f8134e6c700 -1 osd.602 pg_epoch: 683434 pg[49.20b( >> v 682861'15706526 (682693'15703449,682861'15706526] >> local-lis/les=673041/673042 n=10525 ec=245563/245563 lis/c 673041/673041 >> les/c/f 673042/673042/0 673038/673041/668565) [602,530,558] r=0 lpr=673041 >> crt=682861'15706526 lcod 682859'15706525 mlcod 682859'15706525 active+clean] >> do_op msg data len 95903764 > osd_max_write_size 94371840 on >> osd_op(mds.0.91565:357877 49.20b 49:d0630e4c:::mds0_sessionmap:head >> [omap-set-header,omap-set-vals,omap-rm-keys] snapc 0=[] >> ondisk+write+known_if_redirected+full_force e683434) v8 >> … >> >> During this time there were some health concerns with the cluster. >> Significantly, since the error above seems to be related to the SessionMap, >> we had a client that had a few blocked requests for over 35948 secs (it’s a >> member of a compute cluster so we let the node drain/finish jobs before >> rebooting). We have also had some issues with certain OSDs running older >> hardware staying up/responding timely to heartbeats after upgrading to >> Nautilus, although that seems to be an iowait/load issue that we are >> actively working to mitigate separately. >> > > This prevent mds from trimming completed requests recorded in session. > which results a very large session item. To recovery, blacklist the > client that has blocked request, the restart mds. > >> We are running Nautilus 14.2.1 on RHEL7.6. There is only one MDS Rank, with >> an active/standby setup between two MDS nodes. MDS clients are mounted using >> the RHEL7.6 kernel driver. >> >> My read here would be that the MDS is sending too large a message to the >> OSD, however my understanding was that the MDS should be using >> osd_max_write_size to determine the size of that message [0]. Is this maybe >> a bug in how this is calculated on the MDS side? >> >> >> Thanks! >> Ryan Leimenstoll >> rleim...@umiacs.umd.edu >> University of Maryland Institute for Advanced Computer Studies >> >> >> >> [0] https://www.spinics.net/lists/ceph-devel/msg11951.html >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure code profiles and crush rules. Missing link...?
On Wed, May 22, 2019 at 03:38:27PM +0200, Rainer Krienke wrote: Am 22.05.19 um 15:16 schrieb Dan van der Ster: Yes this is basically what I was looking for however I had expected that its a little better visible in the output... Mind opening a tracker ticket on http://tracker.ceph.com/ so we can have this added to the non-json output of ceph osd pool ls detail? Rainer Is this what you're looking for? # ceph osd pool ls detail -f json | jq .[0].erasure_code_profile "jera_4plus2" -- Dan -- Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312 Web: http://userpages.uni-koblenz.de/~krienke PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jan Fajerski Engineer Enterprise Storage SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure code profiles and crush rules. Missing link...?
Am 22.05.19 um 15:16 schrieb Dan van der Ster: Yes this is basically what I was looking for however I had expected that its a little better visible in the output... Rainer > > Is this what you're looking for? > > # ceph osd pool ls detail -f json | jq .[0].erasure_code_profile > "jera_4plus2" > > -- Dan -- Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312 Web: http://userpages.uni-koblenz.de/~krienke PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Massive TCP connection on radosgw
Which states are all these connections in ? ss -tn | awk '{print $1}' | sort | uniq -c /Torben On 22.05.2019 15:19, Li Wang wrote: > Hi guys, > > Any help here? > > Sent from my iPhone > > On 20 May 2019, at 2:48 PM, John Hearns wrote: > > I found similar behaviour on a Nautilus cluster on Friday. Around 300 000 > open connections which I think were the result of a benchmarking run which > was terminated. I restarted the radosgw service to get rid of them. > > On Mon, 20 May 2019 at 06:56, Li Wang wrote: Dear ceph > community members, > > We have a ceph cluster (mimic 13.2.4) with 7 nodes and 130+ OSDs. However, we > observed over 70 millions active TCP connections on the radosgw host, which > makes the radosgw very unstable. > > After further investigation, we found most of the TCP connections on the > radosgw are connected to OSDs. > > May I ask what might be the possible reason causing the the massive amount of > TCP connection? And is there anything configuration or tuning work that I can > do to solve this issue? > > Any suggestion is highly appreciated. > > Regards, > Li Wang > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Massive TCP connection on radosgw
Hi guys, Any help here? Sent from my iPhone > On 20 May 2019, at 2:48 PM, John Hearns wrote: > > I found similar behaviour on a Nautilus cluster on Friday. Around 300 000 > open connections which I think were the result of a benchmarking run which > was terminated. I restarted the radosgw service to get rid of them. > >> On Mon, 20 May 2019 at 06:56, Li Wang wrote: >> Dear ceph community members, >> >> We have a ceph cluster (mimic 13.2.4) with 7 nodes and 130+ OSDs. However, >> we observed over 70 millions active TCP connections on the radosgw host, >> which makes the radosgw very unstable. >> >> After further investigation, we found most of the TCP connections on the >> radosgw are connected to OSDs. >> >> May I ask what might be the possible reason causing the the massive amount >> of TCP connection? And is there anything configuration or tuning work that I >> can do to solve this issue? >> >> Any suggestion is highly appreciated. >> >> Regards, >> Li Wang >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure code profiles and crush rules. Missing link...?
On Wed, May 22, 2019 at 3:03 PM Rainer Krienke wrote: > > Hello, > > I created an erasure code profile named ecprofile-42 with the following > parameters: > > $ ceph osd erasure-code-profile set ecprofile-42 plugin=jerasure k=4 m=2 > > Next I created a new pool using the ec profile from above: > > $ ceph osd pool create my_erasure_pool 64 64 erasure ecprofile-42 > > The pool created then has an autogenerated crush rule with the contents > as shown at the end of this mail (see: ceph osd crush rule dump > my_erasure_pool). > > What I am missing in the output of the crush rule dump below are the k,m > values used for this pool or a "link" from the crushrule to the erasure > code profile that contains these settings and was used creating the pool > and thus the ec crushrule. If I had several ec profiles and pools > created with the different ec profiles how else could I see which k,m > values were used for the different pools? > > For a replicated crush rule there is the size parameter which is part of > the crush-rule and indirectly tells you the number of replicas, but what > about erasure coded pools? Is this what you're looking for? # ceph osd pool ls detail -f json | jq .[0].erasure_code_profile "jera_4plus2" -- Dan > > Probably there is somewhere the link I am looking for, but I din't find > it yet... > > Thanks Rainer > > # > # autogenerated crush rule my_erasure_pool: > # > $ ceph osd crush rule dump my_erasure_pool > { > "rule_id": 1, > "rule_name": "my_erasure_pool", > "ruleset": 1, > "type": 3, > "min_size": 3, > "max_size": 6, > "steps": [ > { > "op": "set_chooseleaf_tries", > "num": 5 > }, > { > "op": "set_choose_tries", > "num": 100 > }, > { > "op": "take", > "item": -1, > "item_name": "default" > }, > { > "op": "chooseleaf_indep", > "num": 0, > "type": "host" > }, > { > "op": "emit" > } > ] > } > > -- > Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1 > 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312 > Web: http://userpages.uni-koblenz.de/~krienke > PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure code profiles and crush rules. Missing link...?
CRUSH only specifies where the chunks are placed, not how many chunks there are (the pool specifies this) This is the same with replicated rules: pool specifies the number of replicas, the rule where they are put. You can use one CRUSH rule for multiple ec pools Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Wed, May 22, 2019 at 3:03 PM Rainer Krienke wrote: > Hello, > > I created an erasure code profile named ecprofile-42 with the following > parameters: > > $ ceph osd erasure-code-profile set ecprofile-42 plugin=jerasure k=4 m=2 > > Next I created a new pool using the ec profile from above: > > $ ceph osd pool create my_erasure_pool 64 64 erasure ecprofile-42 > > The pool created then has an autogenerated crush rule with the contents > as shown at the end of this mail (see: ceph osd crush rule dump > my_erasure_pool). > > What I am missing in the output of the crush rule dump below are the k,m > values used for this pool or a "link" from the crushrule to the erasure > code profile that contains these settings and was used creating the pool > and thus the ec crushrule. If I had several ec profiles and pools > created with the different ec profiles how else could I see which k,m > values were used for the different pools? > > For a replicated crush rule there is the size parameter which is part of > the crush-rule and indirectly tells you the number of replicas, but what > about erasure coded pools? > > Probably there is somewhere the link I am looking for, but I din't find > it yet... > > Thanks Rainer > > # > # autogenerated crush rule my_erasure_pool: > # > $ ceph osd crush rule dump my_erasure_pool > { > "rule_id": 1, > "rule_name": "my_erasure_pool", > "ruleset": 1, > "type": 3, > "min_size": 3, > "max_size": 6, > "steps": [ > { > "op": "set_chooseleaf_tries", > "num": 5 > }, > { > "op": "set_choose_tries", > "num": 100 > }, > { > "op": "take", > "item": -1, > "item_name": "default" > }, > { > "op": "chooseleaf_indep", > "num": 0, > "type": "host" > }, > { > "op": "emit" > } > ] > } > > -- > Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1 > 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312 > Web: http://userpages.uni-koblenz.de/~krienke > PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Erasure code profiles and crush rules. Missing link...?
Hello, I created an erasure code profile named ecprofile-42 with the following parameters: $ ceph osd erasure-code-profile set ecprofile-42 plugin=jerasure k=4 m=2 Next I created a new pool using the ec profile from above: $ ceph osd pool create my_erasure_pool 64 64 erasure ecprofile-42 The pool created then has an autogenerated crush rule with the contents as shown at the end of this mail (see: ceph osd crush rule dump my_erasure_pool). What I am missing in the output of the crush rule dump below are the k,m values used for this pool or a "link" from the crushrule to the erasure code profile that contains these settings and was used creating the pool and thus the ec crushrule. If I had several ec profiles and pools created with the different ec profiles how else could I see which k,m values were used for the different pools? For a replicated crush rule there is the size parameter which is part of the crush-rule and indirectly tells you the number of replicas, but what about erasure coded pools? Probably there is somewhere the link I am looking for, but I din't find it yet... Thanks Rainer # # autogenerated crush rule my_erasure_pool: # $ ceph osd crush rule dump my_erasure_pool { "rule_id": 1, "rule_name": "my_erasure_pool", "ruleset": 1, "type": 3, "min_size": 3, "max_size": 6, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } -- Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312 Web: http://userpages.uni-koblenz.de/~krienke PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
It's been suggested here in the past to disable deep scrubbing temporarily before running the repair because it does not execute immediately but gets queued up behind deep scrubs. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
Hi, thank you, it worked. The PGs are not incomplete anymore. Still we have another problem, there are 7 PGs inconsistent and a cpeh pg repair is not doing anything. I just get "instructing pg 1.5dd on osd.24 to repair" and nothing happens. Does somebody know how we can get the PGs to repair? Regards, Kevin On 21.05.19 4:52 nachm., Wido den Hollander wrote: On 5/21/19 4:48 PM, Kevin Flöh wrote: Hi, we gave up on the incomplete pgs since we do not have enough complete shards to restore them. What is the procedure to get rid of these pgs? You need to start with marking the OSDs as 'lost' and then you can force_create_pg to get the PGs back (empty). Wido regards, Kevin On 20.05.19 9:22 vorm., Kevin Flöh wrote: Hi Frederic, we do not have access to the original OSDs. We exported the remaining shards of the two pgs but we are only left with two shards (of reasonable size) per pg. The rest of the shards displayed by ceph pg query are empty. I guess marking the OSD as complete doesn't make sense then. Best, Kevin On 17.05.19 2:36 nachm., Frédéric Nass wrote: Le 14/05/2019 à 10:04, Kevin Flöh a écrit : On 13.05.19 11:21 nachm., Dan van der Ster wrote: Presumably the 2 OSDs you marked as lost were hosting those incomplete PGs? It would be useful to double confirm that: check with `ceph pg query` and `ceph pg dump`. (If so, this is why the ignore history les thing isn't helping; you don't have the minimum 3 stripes up for those 3+1 PGs.) yes, but as written in my other mail, we still have enough shards, at least I think so. If those "lost" OSDs by some miracle still have the PG data, you might be able to export the relevant PG stripes with the ceph-objectstore-tool. I've never tried this myself, but there have been threads in the past where people export a PG from a nearly dead hdd, import to another OSD, then backfilling works. guess that is not possible. Hi Kevin, You want to make sure of this. Unless you recreated the OSDs 4 and 23 and had new data written on them, they should still host the data you need. What Dan suggested (export the 7 inconsistent PGs and import them on a healthy OSD) seems to be the only way to recover your lost data, as with 4 hosts and 2 OSDs lost, you're left with 2 chunks of data/parity when you actually need 3 to access it. Reducing min_size to 3 will not help. Have a look here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html This is probably the best way you want to follow form now on. Regards, Frédéric. If OTOH those PGs are really lost forever, and someone else should confirm what I say here, I think the next step would be to force recreate the incomplete PGs then run a set of cephfs scrub/repair disaster recovery cmds to recover what you can from the cephfs. -- dan would this let us recover at least some of the data on the pgs? If not we would just set up a new ceph directly without fixing the old one and copy whatever is left. Best regards, Kevin On Mon, May 13, 2019 at 4:20 PM Kevin Flöh wrote: Dear ceph experts, we have several (maybe related) problems with our ceph cluster, let me first show you the current ceph status: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_ERR 1 MDSs report slow metadata IOs 1 MDSs report slow requests 1 MDSs behind on trimming 1/126319678 objects unfound (0.000%) 19 scrub errors Reduced data availability: 2 pgs inactive, 2 pgs incomplete Possible data damage: 7 pgs inconsistent Degraded data redundancy: 1/500333881 objects degraded (0.000%), 1 pg degraded 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 126.32M objects, 260TiB usage: 372TiB used, 152TiB / 524TiB avail pgs: 0.049% pgs not active 1/500333881 objects degraded (0.000%) 1/126319678 objects unfound (0.000%) 4076 active+clean 10 active+clean+scrubbing+deep 7 active+clean+inconsistent 2 incomplete 1 active+recovery_wait+degraded io: client: 449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr and ceph health detail: HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19 scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs incomplete; Possible data damage: 7 pgs inconsistent; De
Re: [ceph-users] CephFS msg length greater than osd_max_write_size
On Tue, May 21, 2019 at 6:10 AM Ryan Leimenstoll wrote: > > Hi all, > > We recently encountered an issue where our CephFS filesystem unexpectedly was > set to read-only. When we look at some of the logs from the daemons I can see > the following: > > On the MDS: > ... > 2019-05-18 16:34:24.341 7fb3bd610700 -1 mds.0.89098 unhandled write error > (90) Message too long, force readonly... > 2019-05-18 16:34:24.341 7fb3bd610700 1 mds.0.cache force file system > read-only > 2019-05-18 16:34:24.341 7fb3bd610700 0 log_channel(cluster) log [WRN] : > force file system read-only > 2019-05-18 16:34:41.289 7fb3c0616700 1 heartbeat_map is_healthy 'MDSRank' > had timed out after 15 > 2019-05-18 16:34:41.289 7fb3c0616700 0 mds.beacon.objmds00 Skipping beacon > heartbeat to monitors (last acked 4.00101s ago); MDS internal heartbeat is > not healthy! > ... > > On one of the OSDs it was most likely targeting: > ... > 2019-05-18 16:34:24.140 7f8134e6c700 -1 osd.602 pg_epoch: 682796 pg[49.20b( v > 682796'15706523 (682693'15703449,682796'15706523] local-lis/les=673041/673042 > n=10524 ec=245563/245563 lis/c 673041/673041 les/c/f 673042/673042/0 > 673038/673041/668565) [602,530,558] r=0 lpr=673041 crt=682796'15706523 lcod > 682796'15706522 mlcod 682796'15706522 active+clean] do_op msg data len > 95146005 > osd_max_write_size 94371840 on osd_op(mds.0.89098:48609421 49.20b > 49:d0630e4c:::mds0_sessionmap:head [omap-set-header,omap-set-vals] snapc 0=[] > ondisk+write+known_if_redirected+full_force e682796) v8 > 2019-05-18 17:10:33.695 7f813466b700 0 log_channel(cluster) log [DBG] : > 49.31c scrub starts > 2019-05-18 17:10:34.980 7f813466b700 0 log_channel(cluster) log [DBG] : > 49.31c scrub ok > 2019-05-18 22:17:37.320 7f8134e6c700 -1 osd.602 pg_epoch: 683434 pg[49.20b( v > 682861'15706526 (682693'15703449,682861'15706526] local-lis/les=673041/673042 > n=10525 ec=245563/245563 lis/c 673041/673041 les/c/f 673042/673042/0 > 673038/673041/668565) [602,530,558] r=0 lpr=673041 crt=682861'15706526 lcod > 682859'15706525 mlcod 682859'15706525 active+clean] do_op msg data len > 95903764 > osd_max_write_size 94371840 on osd_op(mds.0.91565:357877 49.20b > 49:d0630e4c:::mds0_sessionmap:head > [omap-set-header,omap-set-vals,omap-rm-keys] snapc 0=[] > ondisk+write+known_if_redirected+full_force e683434) v8 > … > > During this time there were some health concerns with the cluster. > Significantly, since the error above seems to be related to the SessionMap, > we had a client that had a few blocked requests for over 35948 secs (it’s a > member of a compute cluster so we let the node drain/finish jobs before > rebooting). We have also had some issues with certain OSDs running older > hardware staying up/responding timely to heartbeats after upgrading to > Nautilus, although that seems to be an iowait/load issue that we are actively > working to mitigate separately. > This prevent mds from trimming completed requests recorded in session. which results a very large session item. To recovery, blacklist the client that has blocked request, the restart mds. > We are running Nautilus 14.2.1 on RHEL7.6. There is only one MDS Rank, with > an active/standby setup between two MDS nodes. MDS clients are mounted using > the RHEL7.6 kernel driver. > > My read here would be that the MDS is sending too large a message to the OSD, > however my understanding was that the MDS should be using osd_max_write_size > to determine the size of that message [0]. Is this maybe a bug in how this is > calculated on the MDS side? > > > Thanks! > Ryan Leimenstoll > rleim...@umiacs.umd.edu > University of Maryland Institute for Advanced Computer Studies > > > > [0] https://www.spinics.net/lists/ceph-devel/msg11951.html > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs client evicted, how to unmount the filesystem on the client?
try 'umount -f' On Tue, May 21, 2019 at 4:41 PM Marc Roos wrote: > > > > > > [@ceph]# ps -aux | grep D > USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND > root 12527 0.0 0.0 123520 932 pts/1D+ 09:26 0:00 umount > /home/mail-archive > root 14549 0.2 0.0 0 0 ?D09:29 0:09 > [kworker/0:0] > root 23350 0.0 0.0 123520 932 pts/0D09:38 0:00 umount > /home/archiveindex > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Failed Disk simulation question
Hi Alex, The cluster has been idle at the moment being new and all. I noticed some disk related errors in dmesg but that was about it. It looked to me for the next 20 - 30 minutes the failure has not been detected. All osds were up and in and health was OK. OSD logs had no smoking gun either. After 30 minutes, I restarted the OSD container and it failed to start as expected. if the cluster doesn't have to read or write to specific OSDs (or sectors on that OSD) the failure won't be detected immediately. We had an issue last year where one of the SSDs (used for rocksdb and wal) had a failure, but that was never reported. We discovered that when we tried to migrate the lvm to a new device and got read errors. Later on, I performed the same operation during the fio bench mark and OSD failed immediately. This confirms our experience, if there's data to read/write on that disk the failure will be detected. Please note that this was in a Luminous cluster, I don't know if and how Nautilus has improved in sensing disk failures. Regards, Eugen Zitat von Alex Litvak : Hello cephers, I know that there was similar question posted 5 years ago. However the answer was inconclusive for me. I installed a new Nautilus 14.2.1 cluster and started pre-production testing. I followed RedHat document and simulated a soft disk failure by # echo 1 > /sys/block/sdc/device/delete The cluster has been idle at the moment being new and all. I noticed some disk related errors in dmesg but that was about it. It looked to me for the next 20 - 30 minutes the failure has not been detected. All osds were up and in and health was OK. OSD logs had no smoking gun either. After 30 minutes, I restarted the OSD container and it failed to start as expected. Later on, I performed the same operation during the fio bench mark and OSD failed immediately. My question is: Should the disk problem have been detected quick enough even on the idle cluster? I thought Nautilus has the means to sense failure before intensive IO hit the disk. Am I wrong to expect that? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS object mapping.
Hi, On 5/21/19 9:46 PM, Robert LeBlanc wrote: I'm at a new job working with Ceph again and am excited to back in the community! I can't find any documentation to support this, so please help me understand if I got this right. I've got a Jewel cluster with CephFS and we have an inconsistent PG. All copies of the object are zero size, but the digest says that it should be a non-zero size, so it seems that my two options are, delete the file that the object is part of, or rewrite the object with RADOS to update the digest. So, this leads to my question, how to I tell which file the object belongs to. From what I found, the object is prefixed with the hex value of the inode and suffixed by the stripe number: 1000d2ba15c.0005 . I then ran `find . -xdev -inum 1099732590940` and found a file on the CephFS file system. I just want to make sure that I found the right file before I start trying recovery options. The first stripe XYZ. has some metadata stored as xattr (rados xattr, not cephfs xattr). One of the entries has the key 'parent': # ls Ubuntu16.04-WS2016-17.ova Ubuntu16.04-WS2016-17.ova # ls -i Ubuntu16.04-WS2016-17.ova 1099751898435 Ubuntu16.04-WS2016-17.ova # rados -p cephfs_test_data stat 1000e523d43. cephfs_test_data/1000e523d43. mtime 2016-10-13 16:20:10.00, size 4194304 # rados -p cephfs_test_data listxattr 1000e523d43. layout parent # rados -p cephfs_test_data getxattr 1000e523d43. parent | strings Ubuntu16.04-WS2016-17.ova5: adm2 volumes The complete path of the file is /volumes/adm/Ubuntu16.04-WS2016-17.ova5. For a complete check you can store the content of the parent key and use ceph-dencoder to print its content: # rados -p cephfs_test_data getxattr 1000e523d43. parent > parent.bin # ceph-dencoder type inode_backtrace_t import parent.bin decode dump_json { "ino": 1099751898435, "ancestors": [ { "dirino": 1099527190071, "dname": "Ubuntu16.04-WS2016-17.ova", "version": 14901 }, { "dirino": 1099521974514, "dname": "adm", "version": 61190706 }, { "dirino": 1, "dname": "volumes", "version": 48394885 } ], "pool": 7, "old_pools": [] } One important thing to note: ls -i prints the inode id in decimal, cephfs uses hexadecimal for the rados object names. Thus the different value in the above commands. Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com