[ceph-users] Ceph PVE cluster help
Have some friends I set up a Ceph cluster for use with PVE a few years ago. It wasn't maintained and is now in bad shape. They've reached out to me for help, but I do not have the time to assist right now. Is there anyone on the list that would be willing to help? As a professional service of course. Please reach out to me off-list. Thanks, Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to just delete PGs stuck incomplete on EC pool
Thanks for the suggestions. I've tried both -- setting osd_find_best_info_ignore_history_les = true and restarting all OSDs, as well as 'ceph osd-force-create-pg' -- but both still show incomplete PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete pg 18.c is incomplete, acting [32,48,58,40,13,44,61,59,30,27,43,37] (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs for 'incomplete') pg 18.1e is incomplete, acting [50,49,41,58,60,46,52,37,34,63,57,16] (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs for 'incomplete') The OSDs in down_osds_we_would_probe have already been marked lost When I ran the force-create-pg command, they went to peering for a few seconds, but then went back incomplete. Updated ceph pg 18.1e query https://pastebin.com/XgZHvJXu Updated ceph pg 18.c query https://pastebin.com/N7xdQnhX Any other suggestions? Thanks again, Daniel On Sat, Mar 2, 2019 at 3:44 PM Paul Emmerich wrote: > On Sat, Mar 2, 2019 at 5:49 PM Alexandre Marangone > wrote: > > > > If you have no way to recover the drives, you can try to reboot the OSDs > with `osd_find_best_info_ignore_history_les = true` (revert it afterwards), > you'll lose data. If after this, the PGs are down, you can mark the OSDs > blocking the PGs from become active lost. > > this should work for PG 18.1e, but not for 18.c. Try running "ceph osd > force-create-pg " to reset the PGs instead. > Data will obviously be lost afterwards. > > Paul > > > > > On Sat, Mar 2, 2019 at 6:08 AM Daniel K wrote: > >> > >> They all just started having read errors. Bus resets. Slow reads. Which > is one of the reasons the cluster didn't recover fast enough to compensate. > >> > >> I tried to be mindful of the drive type and specifically avoided the > larger capacity Seagates that are SMR. Used 1 SM863 for every 6 drives for > the WAL. > >> > >> Not sure why they failed. The data isn't critical at this point, just > need to get the cluster back to normal. > >> > >> On Sat, Mar 2, 2019, 9:00 AM wrote: > >>> > >>> Did they break, or did something went wronng trying to replace them? > >>> > >>> Jespe > >>> > >>> > >>> > >>> Sent from myMail for iOS > >>> > >>> > >>> Saturday, 2 March 2019, 14.34 +0100 from Daniel K >: > >>> > >>> I bought the wrong drives trying to be cheap. They were 2TB WD Blue > 5400rpm 2.5 inch laptop drives. > >>> > >>> They've been replace now with HGST 10K 1.8TB SAS drives. > >>> > >>> > >>> > >>> On Sat, Mar 2, 2019, 12:04 AM wrote: > >>> > >>> > >>> > >>> Saturday, 2 March 2019, 04.20 +0100 from satha...@gmail.com < > satha...@gmail.com>: > >>> > >>> 56 OSD, 6-node 12.2.5 cluster on Proxmox > >>> > >>> We had multiple drives fail(about 30%) within a few days of each > other, likely faster than the cluster could recover. > >>> > >>> > >>> Hov did so many drives break? > >>> > >>> Jesper > >> > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to just delete PGs stuck incomplete on EC pool
They all just started having read errors. Bus resets. Slow reads. Which is one of the reasons the cluster didn't recover fast enough to compensate. I tried to be mindful of the drive type and specifically avoided the larger capacity Seagates that are SMR. Used 1 SM863 for every 6 drives for the WAL. Not sure why they failed. The data isn't critical at this point, just need to get the cluster back to normal. On Sat, Mar 2, 2019, 9:00 AM wrote: > Did they break, or did something went wronng trying to replace them? > > Jespe > > > > Sent from myMail for iOS > > > Saturday, 2 March 2019, 14.34 +0100 from Daniel K : > > I bought the wrong drives trying to be cheap. They were 2TB WD Blue > 5400rpm 2.5 inch laptop drives. > > They've been replace now with HGST 10K 1.8TB SAS drives. > > > > On Sat, Mar 2, 2019, 12:04 AM wrote: > > > > Saturday, 2 March 2019, 04.20 +0100 from satha...@gmail.com < > satha...@gmail.com>: > > 56 OSD, 6-node 12.2.5 cluster on Proxmox > > We had multiple drives fail(about 30%) within a few days of each other, > likely faster than the cluster could recover. > > > Hov did so many drives break? > > Jesper > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to just delete PGs stuck incomplete on EC pool
I bought the wrong drives trying to be cheap. They were 2TB WD Blue 5400rpm 2.5 inch laptop drives. They've been replace now with HGST 10K 1.8TB SAS drives. On Sat, Mar 2, 2019, 12:04 AM wrote: > > > Saturday, 2 March 2019, 04.20 +0100 from satha...@gmail.com < > satha...@gmail.com>: > > 56 OSD, 6-node 12.2.5 cluster on Proxmox > > We had multiple drives fail(about 30%) within a few days of each other, > likely faster than the cluster could recover. > > > Hov did so many drives break? > > Jesper > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to just delete PGs stuck incomplete on EC pool
56 OSD, 6-node 12.2.5 cluster on Proxmox We had multiple drives fail(about 30%) within a few days of each other, likely faster than the cluster could recover. After the dust settled, we have 2 out of 896 pgs stuck inactive. The failed drives are completely inaccessible, so I can't mount them and attempt to export the PGs. Do I have any options besides to just consider them lost -- and how do I tell Ceph they are lost so that I can get my cluster back to normal? I already reduced min_size from 9 to 8, can't reduce it any more. The list of "down_osds_we_would_probe" have already all been marked as lost (ceph osd lost xx) ceph health detail: PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete pg 18.c is incomplete, acting [32,48,58,40,13,44,61,59,30,27,43,37] (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs for 'incomplete') pg 18.1e is incomplete, acting [50,49,41,58,60,46,52,37,34,63,57,16] (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs for 'incomplete') root@pve4:~# ceph osd erasure-code-profile get ec-84-hdd crush-device-class= crush-failure-domain=host crush-root=default k=8 m=4 plugin=isa technique=reed_sol_van Results of ceph pg 18.c query https://pastebin.com/V8nByRF6 Results of ceph pg 18.1e query https://pastebin.com/rBWwPYUn Thanks Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multiple OSD crashing on 12.2.0. Bluestore / EC pool / rbd
I'm hitting this same issue on 12.2.5. Upgraded one node to 12.2.10 and it didn't clear. 6 OSDs flapping with this error. I know this is an older issue but are traces still needed? I don't see a resolution available. Thanks, Dan On Wed, Sep 6, 2017 at 10:30 PM Brad Hubbard wrote: > These error logs look like they are being generated here, > > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8987-L8993 > or possibly here, > > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L9230-L9236 > . > > Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: > 2017-09-05 17:02:58.686723 7fe1871ac700 -1 > bluestore(/var/lib/ceph/osd/ceph-12) _txc_add_transaction error (2) No > such file or directory not handled on operation 15 (op 0, counting > from 0) > > The table of operations is here, > https://github.com/ceph/ceph/blob/master/src/os/ObjectStore.h#L370 > > Operation 15 is OP_SETATTRS so it appears to be some extended > attribute operation that is failing. > > Can someone run the ceph-osd under strace and find the last system > call (probably a call that manipulates xattrs) that returns -2 in the > thread that crashes (or that outputs the above messages)? > > strace -fvttyyTo /tmp/strace.out -s 1024 ceph-osd [system specific > argumentsarguments] > > Capturing logs with "debug_bluestore = 20" may tell us more as well. > > We need to work out what resource it is trying to access when it > receives the error '2' (No such file or directory). > > > On Thu, Sep 7, 2017 at 12:13 AM, Thomas Coelho > wrote: > > Hi, > > > > I have the same problem. A bug [1] is reported since months, but > > unfortunately this is not fixed yet. I hope, if more people are having > > this problem the developers can reproduce and fix it. > > > > I was using Kernel-RBD with a Cache Tier. > > > > so long > > Thomas Coelho > > > > [1] http://tracker.ceph.com/issues/20222 > > > > > > Am 06.09.2017 um 15:41 schrieb Henrik Korkuc: > >> On 17-09-06 16:24, Jean-Francois Nadeau wrote: > >>> Hi, > >>> > >>> On a 4 node / 48 OSDs Luminous cluster Im giving a try at RBD on EC > >>> pools + Bluestore. > >>> > >>> Setup went fine but after a few bench runs several OSD are failing and > >>> many wont even restart. > >>> > >>> ceph osd erasure-code-profile set myprofile \ > >>>k=2\ > >>>m=1 \ > >>>crush-failure-domain=host > >>> ceph osd pool create mypool 1024 1024 erasure myprofile > >>> ceph osd pool set mypool allow_ec_overwrites true > >>> rbd pool init mypool > >>> ceph -s > >>> ceph health detail > >>> ceph osd pool create metapool 1024 1024 replicated > >>> rbd create --size 1024G --data-pool mypool --image metapool/test1 > >>> rbd bench -p metapool test1 --io-type write --io-size 8192 > >>> --io-pattern rand --io-total 10G > >>> ... > >>> > >>> > >>> One of many OSD failing logs > >>> > >>> Sep 05 17:02:54 r72-k7-06-01.k8s.ash1.cloudsys.tmcs systemd[1]: > >>> Started Ceph object storage daemon osd.12. > >>> Sep 05 17:02:54 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: > >>> starting osd.12 at - osd_data /var/lib/ceph/osd/ceph-12 > >>> /var/lib/ceph/osd/ceph-12/journal > >>> Sep 05 17:02:56 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: > >>> 2017-09-05 17:02:56.627301 7fe1a2e42d00 -1 osd.12 2219 log_to_monitors > >>> {default=true} > >>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: > >>> 2017-09-05 17:02:58.686723 7fe1871ac700 -1 > >>> bluestore(/var/lib/ceph/osd/ceph-12) _txc_add_transac > >>> tion error (2) No such file or directory not handled on operation 15 > >>> (op 0, counting from 0) > >>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: > >>> 2017-09-05 17:02:58.686742 7fe1871ac700 -1 > >>> bluestore(/var/lib/ceph/osd/ceph-12) unexpected error > >>> code > >>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: > >>> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/ > >>> > centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: > >>> In function 'void BlueStore::_txc_add_transaction(Blu > >>> eStore::TransContext*, ObjectStore::Transaction*)' thread 7fe1871ac700 > >>> time 2017-09-05 17:02:58.686821 > >>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: > >>> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/ > >>> > centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: > >>> 9282: FAILED assert(0 == "unexpected error") > >>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: > >>> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) > >>> luminous (rc) > >>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: 1: > >>> (ceph::__ceph_assert_fail(char const*, char const*, int, char > >>> const*)+0x110) [0x7fe1a38bf510] > >>> Sep 05
Re: [ceph-users] Ceph health error (was: Prioritize recovery over backfilling)
Did you ever get anywhere with this? I have 6 OSDs out of 36 continuously flapping with this error in the logs. Thanks, Dan On Fri, Jun 8, 2018 at 11:10 AM Caspar Smit wrote: > Hi all, > > Maybe this will help: > > The issue is with shards 3,4 and 5 of PG 6.3f: > > LOG's of OSD's 16, 17 & 36 (the ones crashing on startup). > > > *Log OSD.16 (shard 4):* > > 2018-06-08 08:35:01.727261 7f4c585e3700 -1 > bluestore(/var/lib/ceph/osd/ceph-16) _txc_add_transaction error (2) No such > file or directory not handled on operation 30 (op 0, counting from 0) > 2018-06-08 08:35:01.727273 7f4c585e3700 -1 > bluestore(/var/lib/ceph/osd/ceph-16) ENOENT on clone suggests osd bug > 2018-06-08 08:35:01.727274 7f4c585e3700 0 > bluestore(/var/lib/ceph/osd/ceph-16) transaction dump: > { > "ops": [ > { > "op_num": 0, > "op_name": "clonerange2", > "collection": "6.3fs4_head", > "src_oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#903d0", > "dst_oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#", > "src_offset": 950272, > "len": 98304, > "dst_offset": 950272 > }, > { > "op_num": 1, > "op_name": "remove", > "collection": "6.3fs4_head", > "oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#903d0" > }, > { > "op_num": 2, > "op_name": "setattrs", > "collection": "6.3fs4_head", > "oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#", > "attr_lens": { > "_": 297, > "hinfo_key": 18, > "snapset": 35 > } > }, > { > "op_num": 3, > "op_name": "clonerange2", > "collection": "6.3fs4_head", > "src_oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#903cf", > "dst_oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#", > "src_offset": 679936, > "len": 274432, > "dst_offset": 679936 > }, > { > "op_num": 4, > "op_name": "remove", > "collection": "6.3fs4_head", > "oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#903cf" > }, > { > "op_num": 5, > "op_name": "setattrs", > "collection": "6.3fs4_head", > "oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#", > "attr_lens": { > "_": 297, > "hinfo_key": 18, > "snapset": 35 > } > }, > { > "op_num": 6, > "op_name": "nop" > }, > { > "op_num": 7, > "op_name": "op_omap_rmkeyrange", > "collection": "6.3fs4_head", > "oid": "4#6:fc00head#", > "first": "011124.00590799", > "last": "4294967295.18446744073709551615" > }, > { > "op_num": 8, > "op_name": "omap_setkeys", > "collection": "6.3fs4_head", > "oid": "4#6:fc00head#", > "attr_lens": { > "_biginfo": 597, > "_epoch": 4, > "_info": 953, > "can_rollback_to": 12, > "rollback_info_trimmed_to": 12 > } > } > ] > } > > 2018-06-08 08:35:01.730584 7f4c585e3700 -1 > /home/builder/source/ceph-12.2.2/src/os/bluestore/BlueStore.cc: In function > 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, > ObjectStore::Transaction*)' thread 7f4c585e3700 time 2018-06-08 > 08:35:01.727379 > /home/builder/source/ceph-12.2.2/src/os/bluestore/BlueStore.cc: 9363: > FAILED assert(0 == "unexpected error") > > ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous > (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x102) [0x558e08ba4202] > 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, > ObjectStore::Transaction*)+0x15fa) [0x558e08a55c3a] > 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, > std::vector std::allocator >&, > boost::intrusive_ptr, ThreadPool::TPHandle*)+0x546) > [0x558e08a572a6] > 4: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, > ObjectStore::Transaction&&, Context*, Context*, Context*, > boost::intrusive_ptr, ThreadPool::TPHandle*)+0x14f) > [0x558e085fa37f] > 5: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, > ThreadPool::TPHandle*)+0x6c) [0x558e0857db5c] > 6: (OSD::process_peering_events(std::__cxx11::list std::allocator > const&, ThreadPool::TPHandle&)+0x442) [0x558e085abec2] > 7: (ThreadPool::BatchWorkQueue::_void_process(void*, >
[ceph-users] 12.2.5 multiple OSDs crashing
12.2.5 on Proxmox cluster. 6 nodes, about 50 OSDs, bluestore and cache tiering on an EC pool. Mostly spinners with an SSD OSD drive and an SSD WAL DB drive on each node. PM863 SSDs with ~75%+ endurance remaning. Has been running relatively okay besides some spinner failures until I checked today and found 5-6 OSDs flapping. I remember reading about some issues with 12.2.5, so I upgraded one node to 12.2.10 but no change. Seeing: 2018-12-20 00:27:42.754485 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) _txc_add_transaction error (2) No such file or directory not handled on operation 30 (op 0, counting from 0) -3> 2018-12-20 00:27:42.754503 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) ENOENT on clone suggests osd bug in the logs for each of them. I've found several bugs in the tracker related to these, but nothing with a resolution I could apply besides upgrading, which doesn't appear to have helped. Suggestions welcome. Snippet of the last few lines: rbd_data.17.afb3726b8b4567.000db0d8:head (bitwise) local-lis/les=10138/10139 n=163802 ec=408/408 lis/c 15778/4121 les/c/f 15781/4127/0 59341/59343/15778) [28,27,17,19,32,33,14,22,7,9,25,23]/[28,27,17,19,2147483647,23,2147483647,13,2147483647,9,32,33]p28(0) r=-1 lpr=59343 pi=[4121,59343)/12 crt=15786'3211273 lcod 0'0 remapped NOTIFY mbc={}] enter Started/ReplicaActive -6> 2018-12-20 00:27:42.753106 7f578f68a700 5 osd.33 pg_epoch: 59344 pg[18.10s5( v 15786'3211273 (15781'3201224,15786'3211273] lb 18:0af1fd67:::rbd_data.17.afb3726b8b4567.000db0d8:head (bitwise) local-lis/les=10138/10139 n=163802 ec=408/408 lis/c 15778/4121 les/c/f 15781/4127/0 59341/59343/15778) [28,27,17,19,32,33,14,22,7,9,25,23]/[28,27,17,19,2147483647,23,2147483647,13,2147483647,9,32,33]p28(0) r=-1 lpr=59343 pi=[4121,59343)/12 crt=15786'3211273 lcod 0'0 remapped NOTIFY mbc={}] enter Started/ReplicaActive/RepNotRecovering -5> 2018-12-20 00:27:42.753186 7f578f68a700 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 15786'3211274, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0 -4> 2018-12-20 00:27:42.754485 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) _txc_add_transaction error (2) No such file or directory not handled on operation 30 (op 0, counting from 0) -3> 2018-12-20 00:27:42.754503 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) ENOENT on clone suggests osd bug -2> 2018-12-20 00:27:42.754507 7f578f68a700 0 bluestore(/var/lib/ceph/osd/ceph-33) transaction dump: { "ops": [ { "op_num": 0, "op_name": "clonerange2", "collection": "18.10s5_head", "src_oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.002ad142:head#310014", "dst_oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.002ad142:head#", "src_offset": 512000, "len": 8192, "dst_offset": 512000 }, { "op_num": 1, "op_name": "remove", "collection": "18.10s5_head", "oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.002ad142:head#310014" }, { "op_num": 2, "op_name": "setattrs", "collection": "18.10s5_head", "oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.002ad142:head#", "attr_lens": { "_": 298, "hinfo_key": 18, "snapset": 35 } }, { "op_num": 3, "op_name": "nop" }, { "op_num": 4, "op_name": "op_omap_rmkeyrange", "collection": "18.10s5_head", "oid": "5#18:0800head#", "first": "015786.03211274", "last": "4294967295.18446744073709551615" }, { "op_num": 5, "op_name": "omap_setkeys", "collection": "18.10s5_head", "oid": "5#18:0800head#", "attr_lens": { "_biginfo": 1646, "_epoch": 4, "_info": 1014, "can_rollback_to": 12, "rollback_info_trimmed_to": 12 } } ] } -1> 2018-12-20 00:27:42.757231 7f5795696700 1 -- 10.10.145.105:6801/29516 --> 10.10.145.100:6818/5468876 -- pg_info((query:59344 sent:59344 18.1es9( v 15786'3322304 (15786'3312225,15786'3322304] local-lis/les=59343/59344 n=163868 ec=408/408 lis/c 59343/3966 les/c/f 59344/3980/0 59341/59343/15773) 9->0)=(empty) epoch 59344) v5 -- 0x55c2f7e7be00 con 0 0> 2018-12-20 00:27:42.758519 7f578f68a700 -1 /home/builder/source/ceph-12.2.10/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f578f68a700 time 2018-12-20 00:27:42.754596 /home/builder/source/ceph-12.2.10/src/os/bluestore/BlueStore.cc:
Re: [ceph-users] Ceph newbie(?) issues
I had a similar problem with some relatively underpowered servers (2x E5-2603 6 core 1.7ghz no HT, 12-14 2TB OSDs per server, 32Gb RAM) There was a process on a couple of the servers that would hang and chew up all available CPU. When that happened, I started getting scrub errors on those servers. On Mon, Mar 5, 2018 at 8:45 AM, Jan Marquardtwrote: > Am 05.03.18 um 13:13 schrieb Ronny Aasen: > > i had some similar issues when i started my proof of concept. especialy > > the snapshot deletion i remember well. > > > > the rule of thumb for filestore that i assume you are running is 1GB ram > > per TB of osd. so with 8 x 4TB osd's you are looking at 32GB of ram for > > osd's + some GB's for the mon service, + some GB's for the os itself. > > > > i suspect if you inspect your dmesg log and memory graphs you will find > > that the out of memory killer ends your osd's when the snap deletion (or > > any other high load task) runs. > > > > I ended up reducing the number of osd's per node, since the old > > mainboard i used was maxed for memory. > > Well, thanks for the broad hint. Somehow I assumed we fulfill the > recommendations, but of course you are right. We'll check if our boards > support 48 GB RAM. Unfortunately, there are currently no corresponding > messages. But I can't rule out that there haven't been any. > > > corruptions occured for me as well. and they was normaly associated with > > disks dying or giving read errors. ceph often managed to fix them but > > sometimes i had to just remove the hurting OSD disk. > > > > hage some graph's to look at. personaly i used munin/munin-node since > > it was just an apt-get away from functioning graphs > > > > also i used smartmontools to send me emails about hurting disks. > > and smartctl to check all disks for errors. > > I'll check S.M.A.R.T stuff. I am wondering if scrubbing errors are > always caused by disk problems or if they also could be triggered > by flapping OSDs or other circumstances. > > > good luck with ceph ! > > Thank you! > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph iSCSI is a prank?
There's been quite a few VMWare/Ceph threads on the mailing list in the past. One setup I've been toying with is a linux guest running on the vmware host on local storage, with the guest mounting a ceph RBD with a filesystem on it, then exporting that via NFS to the VMWare host as a datastore. Exporting CephFS via NFS to Vmware is another option. I'm not sure how well shared storage will work with either of these configurations. but they work fairly well for single-host deployments. There are also quite a few products that do support iscsi on ceph. Suse Enterprise Storage is a commercial one, PetaSAN is an open-source option. On Fri, Mar 2, 2018 at 2:24 AM, Joshua Chenwrote: > Dear all, > I wonder how we could support VM systems with ceph storage (block > device)? my colleagues are waiting for my answer for vmware (vSphere 5) and > I myself use oVirt (RHEV). the default protocol is iSCSI. > I know that openstack/cinder work well with ceph and proxmox (just > heard) too. But currently we are using vmware and ovirt. > > > Your wise suggestion is appreciated > > Cheers > Joshua > > > On Thu, Mar 1, 2018 at 3:16 AM, Mark Schouten wrote: > >> Does Xen still not support RBD? Ceph has been around for years now! >> >> Met vriendelijke groeten, >> >> -- >> Kerio Operator in de Cloud? https://www.kerioindecloud.nl/ >> Mark Schouten | Tuxis Internet Engineering >> KvK: 61527076 | http://www.tuxis.nl/ >> T: 0318 200208 | i...@tuxis.nl >> >> >> >> * Van: * Massimiliano Cuttini >> * Aan: * "ceph-users@lists.ceph.com" >> * Verzonden: * 28-2-2018 13:53 >> * Onderwerp: * [ceph-users] Ceph iSCSI is a prank? >> >> I was building ceph in order to use with iSCSI. >> But I just see from the docs that need: >> >> *CentOS 7.5* >> (which is not available yet, it's still at 7.4) >> https://wiki.centos.org/Download >> >> *Kernel 4.17* >> (which is not available yet, it is still at 4.15.7) >> https://www.kernel.org/ >> >> So I guess, there is no ufficial support and this is just a bad prank. >> >> Ceph is ready to be used with S3 since many years. >> But need the kernel of the next century to works with such an old >> technology like iSCSI. >> So sad. >> >> >> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Added two OSDs, 10% of pgs went inactive
Caspar, I found Nick Fisk's post yesterday http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023223.html and set osd_max_pg_per_osd_hard_ratio = 4 in my ceph.conf on the OSDs and restarted the 10TB OSDs. The PGs went back active and recovery is complete now. My setup is similar to his in that there's a large difference in OSD size, most are 1.8TB, but about 10% of them are 10TB. The difference is I had a functional Luminous cluster, until increased the number 10TB OSDs from 6 to 8. I'm still not sure why that caused *more* PGs per OSD with the same pools. Thanks! Daniel On Wed, Dec 20, 2017 at 10:23 AM, Caspar Smit <caspars...@supernas.eu> wrote: > Hi Daniel, > > I've had the same problem with creating a new 12.2.2 cluster where i > couldn't get some pgs out of the "activating+remapped" status after i > switched some OSD's from one chassis to another (there was no data on it > yet). > > I tried restarting OSD's to no avail. > > Couldn't find anything about the stuck in "activating+remapped" state so > in the end i threw away the pool and started over. > > Could this be a bug in 12.2.2 ? > > Kind regards, > Caspar > > 2017-12-20 15:48 GMT+01:00 Daniel K <satha...@gmail.com>: > >> Just an update. >> >> Recovery completed but the PGS are still inactive. >> >> Still having a hard time understanding why adding OSDs caused this. I'm >> on 12.2.2 >> >> user@admin:~$ ceph -s >> cluster: >> id: a3672c60-3051-440c-bd83-8aff7835ce53 >> health: HEALTH_WARN >> Reduced data availability: 307 pgs inactive >> Degraded data redundancy: 307 pgs unclean >> >> services: >> mon: 5 daemons, quorum stor585r2u8a,stor585r2u12a,sto >> r585r2u16a,stor585r2u20a,stor585r2u24a >> mgr: stor585r2u8a(active) >> osd: 88 osds: 87 up, 87 in; 133 remapped pgs >> >> data: >> pools: 12 pools, 3016 pgs >> objects: 387k objects, 1546 GB >> usage: 3313 GB used, 186 TB / 189 TB avail >> pgs: 10.179% pgs not active >> 2709 active+clean >> 174 activating >> 133 activating+remapped >> >> io: >> client: 8436 kB/s rd, 935 kB/s wr, 140 op/s rd, 64 op/s wr >> >> >> On Tue, Dec 19, 2017 at 8:57 PM, Daniel K <satha...@gmail.com> wrote: >> >>> I'm trying to understand why adding OSDs would cause pgs to go inactive. >>> >>> This cluster has 88 OSDs, and had 6 OSD with device class "hdd_10TB_7.2k" >>> >>> I added two more OSDs, set the device class to "hdd_10TB_7.2k" and 10% >>> of pgs went inactive. >>> >>> I have an EC pool on these OSDs with the profile: >>> user@admin:~$ ceph osd erasure-code-profile get ISA_10TB_7.2k_4.2 >>> crush-device-class=hdd_10TB_7.2k >>> crush-failure-domain=host >>> crush-root=default >>> k=4 >>> m=2 >>> plugin=isa >>> technique=reed_sol_van. >>> >>> some outputs of ceph health detail and ceph osd df >>> user@admin:~$ ceph osd df |grep 10TB >>> 76 hdd_10TB_7.2k 9.09509 1.0 9313G 349G 8963G 3.76 2.20 488 >>> 20 hdd_10TB_7.2k 9.09509 1.0 9313G 345G 8967G 3.71 2.17 489 >>> 28 hdd_10TB_7.2k 9.09509 1.0 9313G 344G 8968G 3.70 2.17 484 >>> 36 hdd_10TB_7.2k 9.09509 1.0 9313G 345G 8967G 3.71 2.17 484 >>> 87 hdd_10TB_7.2k 9.09560 1.0 9313G 8936M 9305G 0.09 0.05 311 >>> 86 hdd_10TB_7.2k 9.09560 1.0 9313G 8793M 9305G 0.09 0.05 304 >>> 6 hdd_10TB_7.2k 9.09509 1.0 9313G 344G 8969G 3.70 2.16 471 >>> 68 hdd_10TB_7.2k 9.09509 1.0 9313G 344G 8969G 3.70 2.17 480 >>> user@admin:~$ ceph health detail|grep inactive >>> HEALTH_WARN 68287/1928007 objects misplaced (3.542%); Reduced data >>> availability: 307 pgs inactive; Degraded data redundancy: 341 pgs unclean >>> PG_AVAILABILITY Reduced data availability: 307 pgs inactive >>> pg 24.60 is stuck inactive for 1947.792377, current state >>> activating+remapped, last acting [36,20,76,6,68,28] >>> pg 24.63 is stuck inactive for 1946.571425, current state >>> activating+remapped, last acting [28,76,6,20,68,36] >>> pg 24.71 is stuck inactive for 1947.625988, current state >>> activating+remapped, last acting [6,68,20,36,28,76] >>> pg 24.73 is stuck inactive for 1947.705250, current state >>> activating+remapped, last acting [36,6,20,76,68,28] >>> pg 24.74 is stuck inactive for 1947.828063, current state >&g
Re: [ceph-users] Added two OSDs, 10% of pgs went inactive
Just an update. Recovery completed but the PGS are still inactive. Still having a hard time understanding why adding OSDs caused this. I'm on 12.2.2 user@admin:~$ ceph -s cluster: id: a3672c60-3051-440c-bd83-8aff7835ce53 health: HEALTH_WARN Reduced data availability: 307 pgs inactive Degraded data redundancy: 307 pgs unclean services: mon: 5 daemons, quorum stor585r2u8a,stor585r2u12a,stor585r2u16a,stor585r2u20a,stor585r2u24a mgr: stor585r2u8a(active) osd: 88 osds: 87 up, 87 in; 133 remapped pgs data: pools: 12 pools, 3016 pgs objects: 387k objects, 1546 GB usage: 3313 GB used, 186 TB / 189 TB avail pgs: 10.179% pgs not active 2709 active+clean 174 activating 133 activating+remapped io: client: 8436 kB/s rd, 935 kB/s wr, 140 op/s rd, 64 op/s wr On Tue, Dec 19, 2017 at 8:57 PM, Daniel K <satha...@gmail.com> wrote: > I'm trying to understand why adding OSDs would cause pgs to go inactive. > > This cluster has 88 OSDs, and had 6 OSD with device class "hdd_10TB_7.2k" > > I added two more OSDs, set the device class to "hdd_10TB_7.2k" and 10% of > pgs went inactive. > > I have an EC pool on these OSDs with the profile: > user@admin:~$ ceph osd erasure-code-profile get ISA_10TB_7.2k_4.2 > crush-device-class=hdd_10TB_7.2k > crush-failure-domain=host > crush-root=default > k=4 > m=2 > plugin=isa > technique=reed_sol_van. > > some outputs of ceph health detail and ceph osd df > user@admin:~$ ceph osd df |grep 10TB > 76 hdd_10TB_7.2k 9.09509 1.0 9313G 349G 8963G 3.76 2.20 488 > 20 hdd_10TB_7.2k 9.09509 1.0 9313G 345G 8967G 3.71 2.17 489 > 28 hdd_10TB_7.2k 9.09509 1.0 9313G 344G 8968G 3.70 2.17 484 > 36 hdd_10TB_7.2k 9.09509 1.0 9313G 345G 8967G 3.71 2.17 484 > 87 hdd_10TB_7.2k 9.09560 1.0 9313G 8936M 9305G 0.09 0.05 311 > 86 hdd_10TB_7.2k 9.09560 1.0 9313G 8793M 9305G 0.09 0.05 304 > 6 hdd_10TB_7.2k 9.09509 1.0 9313G 344G 8969G 3.70 2.16 471 > 68 hdd_10TB_7.2k 9.09509 1.0 9313G 344G 8969G 3.70 2.17 480 > user@admin:~$ ceph health detail|grep inactive > HEALTH_WARN 68287/1928007 objects misplaced (3.542%); Reduced data > availability: 307 pgs inactive; Degraded data redundancy: 341 pgs unclean > PG_AVAILABILITY Reduced data availability: 307 pgs inactive > pg 24.60 is stuck inactive for 1947.792377, current state > activating+remapped, last acting [36,20,76,6,68,28] > pg 24.63 is stuck inactive for 1946.571425, current state > activating+remapped, last acting [28,76,6,20,68,36] > pg 24.71 is stuck inactive for 1947.625988, current state > activating+remapped, last acting [6,68,20,36,28,76] > pg 24.73 is stuck inactive for 1947.705250, current state > activating+remapped, last acting [36,6,20,76,68,28] > pg 24.74 is stuck inactive for 1947.828063, current state > activating+remapped, last acting [68,36,28,20,6,76] > pg 24.75 is stuck inactive for 1947.475644, current state > activating+remapped, last acting [6,28,76,36,20,68] > pg 24.76 is stuck inactive for 1947.712046, current state > activating+remapped, last acting [20,76,6,28,68,36] > pg 24.78 is stuck inactive for 1946.576304, current state > activating+remapped, last acting [76,20,68,36,6,28] > pg 24.7a is stuck inactive for 1947.820932, current state > activating+remapped, last acting [36,20,28,68,6,76] > pg 24.7b is stuck inactive for 1947.858305, current state > activating+remapped, last acting [68,6,20,28,76,36] > pg 24.7c is stuck inactive for 1947.753917, current state > activating+remapped, last acting [76,6,20,36,28,68] > pg 24.7d is stuck inactive for 1947.803229, current state > activating+remapped, last acting [68,6,28,20,36,76] > pg 24.7f is stuck inactive for 1947.792506, current state > activating+remapped, last acting [28,20,76,6,68,36] > pg 24.8a is stuck inactive for 1947.823189, current state > activating+remapped, last acting [28,76,20,6,36,68] > pg 24.8b is stuck inactive for 1946.579755, current state > activating+remapped, last acting [76,68,20,28,6,36] > pg 24.8c is stuck inactive for 1947.555872, current state > activating+remapped, last acting [76,36,68,6,28,20] > pg 24.8d is stuck inactive for 1946.589814, current state > activating+remapped, last acting [36,6,28,76,68,20] > pg 24.8e is stuck inactive for 1947.802894, current state > activating+remapped, last acting [28,6,68,36,76,20] > pg 24.8f is stuck inactive for 1947.528603, current state > activating+remapped, last acting [76,28,6,68,20,36] > pg 25.60 is stuck inactive for 1947.620823, current state activating, > last acting [20,6,87,36,28,68] > pg 25.61 is stuck inacti
[ceph-users] Added two OSDs, 10% of pgs went inactive
I'm trying to understand why adding OSDs would cause pgs to go inactive. This cluster has 88 OSDs, and had 6 OSD with device class "hdd_10TB_7.2k" I added two more OSDs, set the device class to "hdd_10TB_7.2k" and 10% of pgs went inactive. I have an EC pool on these OSDs with the profile: user@admin:~$ ceph osd erasure-code-profile get ISA_10TB_7.2k_4.2 crush-device-class=hdd_10TB_7.2k crush-failure-domain=host crush-root=default k=4 m=2 plugin=isa technique=reed_sol_van. some outputs of ceph health detail and ceph osd df user@admin:~$ ceph osd df |grep 10TB 76 hdd_10TB_7.2k 9.09509 1.0 9313G 349G 8963G 3.76 2.20 488 20 hdd_10TB_7.2k 9.09509 1.0 9313G 345G 8967G 3.71 2.17 489 28 hdd_10TB_7.2k 9.09509 1.0 9313G 344G 8968G 3.70 2.17 484 36 hdd_10TB_7.2k 9.09509 1.0 9313G 345G 8967G 3.71 2.17 484 87 hdd_10TB_7.2k 9.09560 1.0 9313G 8936M 9305G 0.09 0.05 311 86 hdd_10TB_7.2k 9.09560 1.0 9313G 8793M 9305G 0.09 0.05 304 6 hdd_10TB_7.2k 9.09509 1.0 9313G 344G 8969G 3.70 2.16 471 68 hdd_10TB_7.2k 9.09509 1.0 9313G 344G 8969G 3.70 2.17 480 user@admin:~$ ceph health detail|grep inactive HEALTH_WARN 68287/1928007 objects misplaced (3.542%); Reduced data availability: 307 pgs inactive; Degraded data redundancy: 341 pgs unclean PG_AVAILABILITY Reduced data availability: 307 pgs inactive pg 24.60 is stuck inactive for 1947.792377, current state activating+remapped, last acting [36,20,76,6,68,28] pg 24.63 is stuck inactive for 1946.571425, current state activating+remapped, last acting [28,76,6,20,68,36] pg 24.71 is stuck inactive for 1947.625988, current state activating+remapped, last acting [6,68,20,36,28,76] pg 24.73 is stuck inactive for 1947.705250, current state activating+remapped, last acting [36,6,20,76,68,28] pg 24.74 is stuck inactive for 1947.828063, current state activating+remapped, last acting [68,36,28,20,6,76] pg 24.75 is stuck inactive for 1947.475644, current state activating+remapped, last acting [6,28,76,36,20,68] pg 24.76 is stuck inactive for 1947.712046, current state activating+remapped, last acting [20,76,6,28,68,36] pg 24.78 is stuck inactive for 1946.576304, current state activating+remapped, last acting [76,20,68,36,6,28] pg 24.7a is stuck inactive for 1947.820932, current state activating+remapped, last acting [36,20,28,68,6,76] pg 24.7b is stuck inactive for 1947.858305, current state activating+remapped, last acting [68,6,20,28,76,36] pg 24.7c is stuck inactive for 1947.753917, current state activating+remapped, last acting [76,6,20,36,28,68] pg 24.7d is stuck inactive for 1947.803229, current state activating+remapped, last acting [68,6,28,20,36,76] pg 24.7f is stuck inactive for 1947.792506, current state activating+remapped, last acting [28,20,76,6,68,36] pg 24.8a is stuck inactive for 1947.823189, current state activating+remapped, last acting [28,76,20,6,36,68] pg 24.8b is stuck inactive for 1946.579755, current state activating+remapped, last acting [76,68,20,28,6,36] pg 24.8c is stuck inactive for 1947.555872, current state activating+remapped, last acting [76,36,68,6,28,20] pg 24.8d is stuck inactive for 1946.589814, current state activating+remapped, last acting [36,6,28,76,68,20] pg 24.8e is stuck inactive for 1947.802894, current state activating+remapped, last acting [28,6,68,36,76,20] pg 24.8f is stuck inactive for 1947.528603, current state activating+remapped, last acting [76,28,6,68,20,36] pg 25.60 is stuck inactive for 1947.620823, current state activating, last acting [20,6,87,36,28,68] pg 25.61 is stuck inactive for 1947.883517, current state activating, last acting [28,36,86,76,6,87] pg 25.62 is stuck inactive for 1542089.552271, current state activating, last acting [86,6,76,20,87,68] pg 25.70 is stuck inactive for 1542089.729631, current state activating, last acting [86,87,76,20,68,28] pg 25.71 is stuck inactive for 1947.642271, current state activating, last acting [28,86,68,20,6,36] pg 25.75 is stuck inactive for 1947.825872, current state activating, last acting [68,86,36,20,76,6] pg 25.76 is stuck inactive for 1947.737307, current state activating, last acting [36,87,28,6,68,76] pg 25.77 is stuck inactive for 1947.218420, current state activating, last acting [87,36,86,28,76,6] pg 25.79 is stuck inactive for 1947.253871, current state activating, last acting [6,36,86,28,68,76] pg 25.7a is stuck inactive for 1542089.794085, current state activating, last acting [86,36,68,20,76,87] pg 25.7c is stuck inactive for 1947.666774, current state activating, last acting [20,86,36,6,76,87] pg 25.8a is stuck inactive for 1542089.687299, current state activating, last acting [87,36,68,20,86,28] pg 25.8c is stuck inactive for 1947.545965, current state activating, last acting [76,6,28,87,36,86] pg 25.8d is stuck inactive for 1947.213627, current state activating, last acting
Re: [ceph-users] Ceph re-ip of OSD node
Just curious why it wouldn't work as long as the IPs were reachable? Is there something going on in layer 2 with Ceph that wouldn't survive a trip across a router? On Wed, Aug 30, 2017 at 1:52 PM, David Turnerwrote: > ALL OSDs need to be running the same private network at the same time. > ALL clients, RGW, OSD, MON, MGR, MDS, etc, etc need to be running on the > same public network at the same time. You cannot do this as a one at a > time migration to the new IP space. Even if all of the servers can still > communicate via routing, it just won't work. Changing the public/private > network addresses for a cluster requires full cluster down time. > > On Wed, Aug 30, 2017 at 11:09 AM Ben Morrice wrote: > >> Hello >> >> We have a small cluster that we need to move to a different network in >> the same datacentre. >> >> My workflow was the following (for a single OSD host), but I failed >> (further details below) >> >> 1) ceph osd set noout >> 2) stop ceph-osd processes >> 3) change IP, gateway, domain (short hostname is the same), VLAN >> 4) change references of OLD IP (cluster and public network) in >> /etc/ceph/ceph.conf with NEW IP (see [1]) >> 5) start a single OSD process >> >> This seems to work as the NEW IP can communicate with mon hosts and osd >> hosts on the OLD network, the OSD is booted and is visible via 'ceph -w' >> however after a few seconds the OSD drops with messages such as the >> below in it's log file >> >> heartbeat_check: no reply from 10.1.1.100:6818 osd.14 ever on either >> front or back, first ping sent 2017-08-30 16:42:14.692210 (cutoff >> 2017-08-30 16:42:24.962245) >> >> There are logs like the above for every OSD server/process >> >> and then eventually a >> >> 2017-08-30 16:42:14.486275 7f6d2c966700 0 log_channel(cluster) log >> [WRN] : map e85351 wrongly marked me down >> >> >> Am I missing something obvious to reconfigure the network on a OSD host? >> >> >> >> [1] >> >> OLD >> [osd.0] >> host = sn01 >> devs = /dev/sdi >> cluster addr = 10.1.1.101 >> public addr = 10.1.1.101 >> NEW >> [osd.0] >> host = sn01 >> devs = /dev/sdi >> cluster addr = 10.1.2.101 >> public addr = 10.1.2.101 >> >> -- >> Kind regards, >> >> Ben Morrice >> >> __ >> Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670 >> <+41%2021%20693%2096%2070> >> EPFL / BBP >> Biotech Campus >> Chemin des Mines 9 >> 1202 Geneva >> Switzerland >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD encryption options?
Awesome -- I searched and all I could find was restricting access at the pool level I will investigate the dm-crypt/RBD path also. Thanks again! On Thu, Aug 24, 2017 at 7:40 PM, Alex Gorbachev <a...@iss-integration.com> wrote: > > On Mon, Aug 21, 2017 at 9:03 PM Daniel K <satha...@gmail.com> wrote: > >> Are there any client-side options to encrypt an RBD device? >> >> Using latest luminous RC, on Ubuntu 16.04 and a 4.10 kernel >> >> I assumed adding client site encryption would be as simple as using >> luks/dm-crypt/cryptsetup after adding the RBD device to /etc/ceph/rbdmap >> and enabling the rbdmap service -- but I failed to consider the order of >> things loading and it appears that the RBD gets mapped too late for >> dm-crypt to recognize it as valid.It just keeps telling me it's not a valid >> LUKS device. >> >> I know you can run the OSDs on an encrypted drive, but I was hoping for >> something client side since it's not exactly simple(as far as I can tell) >> to restrict client access to a single(or group) of RBDs within a shared >> pool. >> > > Daniel, we used info from here for single or multiple RBD mappings to > client > > https://blog-fromsomedude.rhcloud.com/2016/04/26/ > Allowing-a-RBD-client-to-map-only-one-RBD > > Also, I ran into the race condition with zfs, and would up putting zfs and > rbdmap into rc.local. It should work for dm-crypt as well. > > Regards, > Alex > > > >> Any suggestions? >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > -- > -- > Alex Gorbachev > Storcium > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD encryption options?
Are there any client-side options to encrypt an RBD device? Using latest luminous RC, on Ubuntu 16.04 and a 4.10 kernel I assumed adding client site encryption would be as simple as using luks/dm-crypt/cryptsetup after adding the RBD device to /etc/ceph/rbdmap and enabling the rbdmap service -- but I failed to consider the order of things loading and it appears that the RBD gets mapped too late for dm-crypt to recognize it as valid.It just keeps telling me it's not a valid LUKS device. I know you can run the OSDs on an encrypted drive, but I was hoping for something client side since it's not exactly simple(as far as I can tell) to restrict client access to a single(or group) of RBDs within a shared pool. Any suggestions? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] implications of losing the MDS map
I finally figured out how to get the ceph-monstore-tool (compiled from source) and am ready to attemp to recover my cluster. I have one question -- in the instructions, http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/ under Recovery from OSDs, Known limitations: -> - *MDS Maps*: the MDS maps are lost. What are the implications of this? Do I just need to rebuild this, or is there a data loss component to it? -- Is my data stored in CephFS still safe? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-monstore-tool missing in 12.1.1 on Xenial?
All 3 of my mons crashed while I was adding OSDs and now error out with: (/build/ceph-12.1.1/src/mon/OSDMonitor.cc: 3018: FAILED assert(osdmap.get_up_osd_features() & CEPH_FEATURE_MON_STATEFUL_SUB) I've resorted to just rebuilding the mon DB and making 3 new mon daemons, using the steps here: http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/ under "Recovery using OSDs" but I am not finding the ceph-monstore-tool anywhere. Is there a different package I need to install or did this tool get replaced with something else in Luminous? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Client behavior when OSD is unreachable
Does the client track which OSDs are reachable? How does it behave if some are not reachable? For example: Cluster network with all OSD hosts on a switch. Public network with OSD hosts split between two switches, failure domain is switch. copies=3 so with a failure of the public switch, 1 copy would be reachable by client. Will the client know that it can't reach the OSDs on the failed switch? Well...thinking through this: The mons communicate on the public network -- correct? So an unreachable public network for some of the OSDs would cause them to be marked down, which then the clients would know about. Correct? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph object recovery
So I'm not sure if this was the best or right way to do this but -- using rados I confirmed the unfound object was in the cephfs_data pool # rados -p cephfs_data ls|grep 001c0ed4 using the osdmap tool I found the pg/osd the unfound object was in -- # osdmaptool --test-map-object 162.001c0ed4 osdmap (previously exported osdmap to file "osdmap") > object '162.001c0ed4' -> 1.21 -> [4] then told ceph to just delete the unfound object ceph pg 1.21 mark_unfound_lost delete and then used rados to put the object back (from the file I had extracted previously) # rados -p cephfs_data put 162.001c0ed4 162.001c0ed4.obj Still have more recovery to do but this seems to have fixed my unfound object problem. On Tue, Jul 25, 2017 at 12:54 PM, Daniel K <satha...@gmail.com> wrote: > I did some bad things to my cluster, broke 5 OSDs and wound up with 1 > unfound object. > > I mounted one of the OSD drives, used ceph-objectstore-tool to find and > exported the object: > > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-10 > 162.001c0ed4 get-bytes filename.obj > > > What's the best way to bring this object back into the active cluster? > > Do I need to bring an OSD offline, mount it and do the reverse of the > above command? > > Something like: > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-22 > 162.001c0ed4 set-bytes filename.obj > > Is there some way to do this without bringing down an osd? > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Can't start bluestore OSDs after sucessfully moving them 12.1.1 ** ERROR: osd init failed: (2) No such file or directory
Just some more info -- this happens also when I just restart an OSD that *was* working -- it won't start back. In the mon log I have (which correspond to the OSDs that I've been trying to start). osd.13 was working just now, before I stopped the service and tried to start it again. 2017-07-25 14:42:49.249076 7f2386806700 0 cephx server osd.10: couldn't find entity name: osd.10 2017-07-25 14:43:24.323603 7f2386806700 0 cephx server osd.13: couldn't find entity name: osd.13 2017-07-25 14:43:25.033487 7f2386806700 0 cephx server osd.7: couldn't find entity name: osd.7 Still reading and learning. On Tue, Jul 25, 2017 at 2:38 PM, Daniel K <satha...@gmail.com> wrote: > Update to this -- I tried building a new host and a new OSD, new disk, and > I am having the same issue. > > > > I set osd debug level to 10 -- the issue looks like it's coming from a mon > daemon. Still trying to learn enough about the internals of ceph to > understand what's happening here. > > Relevant debug logs(I think) > > > 2017-07-25 14:21:58.889016 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== > mon.1 10.0.15.51:6789/0 1 mon_map magic: 0 v1 541+0+0 > (2831459213 0 0) 0x556640ecd900 con 0x556641949800 > 2017-07-25 14:21:58.889109 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== > mon.1 10.0.15.51:6789/0 2 auth_reply(proto 2 0 (0) Success) v1 > 33+0+0 (248727397 0 0) 0x556640ecdb80 con 0x556641949800 > 2017-07-25 14:21:58.889204 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> > 10.0.15.51:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x556640ecd400 > con 0 > 2017-07-25 14:21:58.889966 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== > mon.1 10.0.15.51:6789/0 3 auth_reply(proto 2 0 (0) Success) v1 > 206+0+0 (3141870879 0 0) 0x556640ecd400 con 0x556641949800 > 2017-07-25 14:21:58.890066 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> > 10.0.15.51:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x556640ecdb80 > con 0 > 2017-07-25 14:21:58.890759 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== > mon.1 10.0.15.51:6789/0 4 auth_reply(proto 2 0 (0) Success) v1 > 564+0+0 (1715764650 0 0) 0x556640ecdb80 con 0x556641949800 > 2017-07-25 14:21:58.890871 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> > 10.0.15.51:6789/0 -- mon_subscribe({monmap=0+}) v2 -- 0x556640e77680 con 0 > 2017-07-25 14:21:58.890901 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> > 10.0.15.51:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- 0x556640ecd400 > con 0 > 2017-07-25 14:21:58.891494 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== > mon.1 10.0.15.51:6789/0 5 mon_map magic: 0 v1 541+0+0 > (2831459213 0 0) 0x556640ecde00 con 0x556641949800 > 2017-07-25 14:21:58.891555 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== > mon.1 10.0.15.51:6789/0 6 auth_reply(proto 2 0 (0) Success) v1 > 194+0+0 (1036670921 0 0) 0x556640ece080 con 0x556641949800 > 2017-07-25 14:21:58.892003 7f25b5e71c80 10 osd.7 0 > mon_cmd_maybe_osd_create cmd: {"prefix": "osd crush set-device-class", > "class": "hdd", "ids": ["7"]} > 2017-07-25 14:21:58.892039 7f25b5e71c80 1 -- 10.0.15.142:6800/16150 --> > 10.0.15.51:6789/0 -- mon_command({"prefix": "osd crush set-device-class", > "class": "hdd", "ids": ["7"]} v 0) v1 -- 0x556640e78d00 con 0 > *2017-07-25 14:21:58.894596 7f25a88af700 1 -- 10.0.15.142:6800/16150 > <http://10.0.15.142:6800/16150> <== mon.1 10.0.15.51:6789/0 > <http://10.0.15.51:6789/0> 7 mon_command_ack([{"prefix": "osd crush > set-device-class", "class": "hdd", "ids": ["7"]}]=-2 (2) No such file or > directory v10406) v1 133+0+0 (3400959855 0 0) 0x556640ece300 con > 0x556641949800* > 2017-07-25 14:21:58.894797 7f25b5e71c80 1 -- 10.0.15.142:6800/16150 --> > 10.0.15.51:6789/0 -- mon_command({"prefix": "osd create", "id": 7, > "uuid": "92445e4f-850e-453b-b5ab-569d1414f72d"} v 0) v1 -- 0x556640e79180 > con 0 > 2017-07-25 14:21:58.896301 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== > mon.1 10.0.15.51:6789/0 8 mon_command_ack([{"prefix": "osd create", > "id": 7, "uuid": "92445e4f-850e-453b-b5ab-569d1414f72d"}]=0 v10406) v1 > 115+0+2 (2540205126 0 1371665406) 0x556640ece580 con 0x556641949800 > 2017-07-25 14:21:58.896473 7f25b5e71c80 10 osd.7 0 > mon_cmd_maybe_osd_create cmd: {"prefix": "osd crush set-device-class", > "class": "hdd", "ids": ["7"]} > 2017-07-25 14:21:58.896516 7f25b5e71c80 1 -- 10.0.15.142:6800/16150 --> > 10.0.15.51:6789
Re: [ceph-users] Can't start bluestore OSDs after sucessfully moving them 12.1.1 ** ERROR: osd init failed: (2) No such file or directory
Update to this -- I tried building a new host and a new OSD, new disk, and I am having the same issue. I set osd debug level to 10 -- the issue looks like it's coming from a mon daemon. Still trying to learn enough about the internals of ceph to understand what's happening here. Relevant debug logs(I think) 2017-07-25 14:21:58.889016 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 1 mon_map magic: 0 v1 541+0+0 (2831459213 0 0) 0x556640ecd900 con 0x556641949800 2017-07-25 14:21:58.889109 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 2 auth_reply(proto 2 0 (0) Success) v1 33+0+0 (248727397 0 0) 0x556640ecdb80 con 0x556641949800 2017-07-25 14:21:58.889204 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> 10.0.15.51:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x556640ecd400 con 0 2017-07-25 14:21:58.889966 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 3 auth_reply(proto 2 0 (0) Success) v1 206+0+0 (3141870879 0 0) 0x556640ecd400 con 0x556641949800 2017-07-25 14:21:58.890066 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> 10.0.15.51:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x556640ecdb80 con 0 2017-07-25 14:21:58.890759 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 4 auth_reply(proto 2 0 (0) Success) v1 564+0+0 (1715764650 0 0) 0x556640ecdb80 con 0x556641949800 2017-07-25 14:21:58.890871 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> 10.0.15.51:6789/0 -- mon_subscribe({monmap=0+}) v2 -- 0x556640e77680 con 0 2017-07-25 14:21:58.890901 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> 10.0.15.51:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- 0x556640ecd400 con 0 2017-07-25 14:21:58.891494 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 5 mon_map magic: 0 v1 541+0+0 (2831459213 0 0) 0x556640ecde00 con 0x556641949800 2017-07-25 14:21:58.891555 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 6 auth_reply(proto 2 0 (0) Success) v1 194+0+0 (1036670921 0 0) 0x556640ece080 con 0x556641949800 2017-07-25 14:21:58.892003 7f25b5e71c80 10 osd.7 0 mon_cmd_maybe_osd_create cmd: {"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["7"]} 2017-07-25 14:21:58.892039 7f25b5e71c80 1 -- 10.0.15.142:6800/16150 --> 10.0.15.51:6789/0 -- mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["7"]} v 0) v1 -- 0x556640e78d00 con 0 *2017-07-25 14:21:58.894596 7f25a88af700 1 -- 10.0.15.142:6800/16150 <http://10.0.15.142:6800/16150> <== mon.1 10.0.15.51:6789/0 <http://10.0.15.51:6789/0> 7 mon_command_ack([{"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["7"]}]=-2 (2) No such file or directory v10406) v1 133+0+0 (3400959855 0 0) 0x556640ece300 con 0x556641949800* 2017-07-25 14:21:58.894797 7f25b5e71c80 1 -- 10.0.15.142:6800/16150 --> 10.0.15.51:6789/0 -- mon_command({"prefix": "osd create", "id": 7, "uuid": "92445e4f-850e-453b-b5ab-569d1414f72d"} v 0) v1 -- 0x556640e79180 con 0 2017-07-25 14:21:58.896301 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== mon.1 10.0.15.51:6789/0 8 mon_command_ack([{"prefix": "osd create", "id": 7, "uuid": "92445e4f-850e-453b-b5ab-569d1414f72d"}]=0 v10406) v1 115+0+2 (2540205126 0 1371665406) 0x556640ece580 con 0x556641949800 2017-07-25 14:21:58.896473 7f25b5e71c80 10 osd.7 0 mon_cmd_maybe_osd_create cmd: {"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["7"]} 2017-07-25 14:21:58.896516 7f25b5e71c80 1 -- 10.0.15.142:6800/16150 --> 10.0.15.51:6789/0 -- mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["7"]} v 0) v1 -- 0x556640e793c0 con 0 *2017-07-25 14:21:58.898180 7f25a88af700 1 -- 10.0.15.142:6800/16150 <http://10.0.15.142:6800/16150> <== mon.1 10.0.15.51:6789/0 <http://10.0.15.51:6789/0> 9 mon_command_ack([{"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["7"]}]=-2 (2) No such file or directory v10406) v1 133+0+0 (3400959855 0 0) 0x556640ecd900 con 0x556641949800* *2017-07-25 14:21:58.898276 7f25b5e71c80 -1 osd.7 0 mon_cmd_maybe_osd_create fail: '(2) No such file or directory': (2) No such file or directory* 2017-07-25 14:21:58.898380 7f25b5e71c80 1 -- 10.0.15.142:6800/16150 >> 10.0.15.51:6789/0 conn(0x556641949800 :-1 s=STATE_OPEN pgs=367879 cs=1 l=1).mark_down On Mon, Jul 24, 2017 at 1:33 PM, Daniel K <satha...@gmail.com> w
[ceph-users] Ceph object recovery
I did some bad things to my cluster, broke 5 OSDs and wound up with 1 unfound object. I mounted one of the OSD drives, used ceph-objectstore-tool to find and exported the object: ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-10 162.001c0ed4 get-bytes filename.obj What's the best way to bring this object back into the active cluster? Do I need to bring an OSD offline, mount it and do the reverse of the above command? Something like: ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-22 162.001c0ed4 set-bytes filename.obj Is there some way to do this without bringing down an osd? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Can't start bluestore OSDs after sucessfully moving them 12.1.1 ** ERROR: osd init failed: (2) No such file or directory
List -- I have a 4-node cluster running on baremetal and have a need to use the kernel client on 2 nodes. As I read you should not run the kernel client on a node that runs an OSD daemon, I decided to move the OSD daemons into a VM on the same device. Orignal host is stor-vm2(bare metal), new host is stor-vm2a(Virtual) All went well -- I did these steps(for each OSD, 5 total per host) - setup the VM - install the OS - installed ceph(using ceph-deploy) - set noout - stopped ceph osd on bare metal host - unmount /dev/sdb1 from /var/lib/ceph/osd/ceph-0 - add /dev/sdb to the VM - ceph detected the osd and started automatically. - moved VM host to the same bucket as physical host in crushmap I did this for each OSD, and despite some recovery IO because of the updated crushmap, all OSDs were up. I rebooted the physical host, which rebooted the VM, and now the OSDs are refusing to start. I've tried moving them back to the bare metal host with the same results. Any ideas? Here are what seem to be the relevant osd log lines: 2017-07-24 13:21:53.561265 7faf1752fc80 0 osd.10 8854 crush map has features 2200130813952, adjusting msgr requires for clients 2017-07-24 13:21:53.561284 7faf1752fc80 0 osd.10 8854 crush map has features 2200130813952 was 8705, adjusting msgr requires for mons 2017-07-24 13:21:53.561298 7faf1752fc80 0 osd.10 8854 crush map has features 720578140510109696, adjusting msgr requires for osds 2017-07-24 13:21:55.626834 7faf1752fc80 0 osd.10 8854 load_pgs 2017-07-24 13:22:20.970222 7faf1752fc80 0 osd.10 8854 load_pgs opened 536 pgs 2017-07-24 13:22:20.972659 7faf1752fc80 0 osd.10 8854 using weightedpriority op queue with priority op cut off at 64. 2017-07-24 13:22:20.976861 7faf1752fc80 -1 osd.10 8854 log_to_monitors {default=true} 2017-07-24 13:22:20.998233 7faf1752fc80 -1 osd.10 8854 mon_cmd_maybe_osd_create fail: '(2) No such file or directory': (2) No such file or directory 2017-07-24 13:22:20.999165 7faf1752fc80 1 bluestore(/var/lib/ceph/osd/ceph-10) umount 2017-07-24 13:22:21.016146 7faf1752fc80 1 freelist shutdown 2017-07-24 13:22:21.016243 7faf1752fc80 4 rocksdb: [/build/ceph-12.1.1/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling all background work 2017-07-24 13:22:21.020440 7faf1752fc80 4 rocksdb: [/build/ceph-12.1.1/src/rocksdb/db/db_impl.cc:343] Shutdown complete 2017-07-24 13:22:21.274481 7faf1752fc80 1 bluefs umount 2017-07-24 13:22:21.275822 7faf1752fc80 1 bdev(0x558bb1f82d80 /var/lib/ceph/osd/ceph-10/block) close 2017-07-24 13:22:21.485226 7faf1752fc80 1 bdev(0x558bb1f82b40 /var/lib/ceph/osd/ceph-10/block) close 2017-07-24 13:22:21.551009 7faf1752fc80 -1 ** ERROR: osd init failed: (2) No such file or directory 2017-07-24 13:22:21.563567 7faf1752fc80 -1 /build/ceph-12.1.1/src/common/HeartbeatMap.cc: In function 'ceph::HeartbeatMap::~HeartbeatMap()' thread 7faf1752fc80 time 2017-07-24 13:22:21.558275 /build/ceph-12.1.1/src/common/HeartbeatMap.cc: 39: FAILED assert(m_workers.empty()) ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x558ba6ba6b72] 2: (()+0xb81cf1) [0x558ba6cc0cf1] 3: (CephContext::~CephContext()+0x4d9) [0x558ba6ca77b9] 4: (CephContext::put()+0xe6) [0x558ba6ca7ab6] 5: (main()+0x563) [0x558ba650df73] 6: (__libc_start_main()+0xf0) [0x7faf14999830] 7: (_start()+0x29) [0x558ba6597cf9] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- begin dump of recent events --- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph recovery incomplete PGs on Luminous RC
I was able to export the PGs using the ceph-object-store tool and import them to the new OSDs. I moved some other OSDs from the bare metal on a node into a virtual machine on the same node and was surprised at how easy it was. Install ceph in the VM(using ceph-deploy) -- stop the OSD and dismount OSD drive from physical machine, mount it to the VM, the OSD was auto-detected and ceph-osd process started automatically and was up within a few seconds. I'm having a different problem now that I will make a separate message about. Thanks! On Mon, Jul 24, 2017 at 12:52 PM, Gregory Farnum <gfar...@redhat.com> wrote: > > On Fri, Jul 21, 2017 at 10:23 PM Daniel K <satha...@gmail.com> wrote: > >> Luminous 12.1.0(RC) >> >> I replaced two OSD drives(old ones were still good, just too small), >> using: >> >> ceph osd out osd.12 >> ceph osd crush remove osd.12 >> ceph auth del osd.12 >> systemctl stop ceph-osd@osd.12 >> ceph osd rm osd.12 >> >> I later found that I also should have unmounted it from >> /var/lib/ceph/osd-12 >> >> (remove old disk, insert new disk) >> >> I added the new disk/osd with ceph-deploy osd prepare stor-vm3:sdg >> --bluestore >> >> This automatically activated the osd (not sure why, I thought it needed a >> ceph-deploy osd activate as well) >> >> >> Then, working on an unrelated issue, I upgraded one (out of 4 total) >> nodes to 12.1.1 using apt and rebooted. >> >> The mon daemon would not form a quorum with the others on 12.1.0, so, >> instead of troubleshooting that, I just went ahead and upgraded the other 3 >> nodes and rebooted. >> >> Lots of recovery IO went on afterwards, but now things have stopped at: >> >> pools: 10 pools, 6804 pgs >> objects: 1784k objects, 7132 GB >> usage: 11915 GB used, 19754 GB / 31669 GB avail >> pgs: 0.353% pgs not active >> 70894/2988573 objects degraded (2.372%) >> 422090/2988573 objects misplaced (14.123%) >> 6626 active+clean >> 129 active+remapped+backfill_wait >> 23 incomplete >> 14 active+undersized+degraded+remapped+backfill_wait >> 4active+undersized+degraded+remapped+backfilling >> 4active+remapped+backfilling >> 2active+clean+scrubbing+deep >> 1peering >> 1active+recovery_wait+degraded+remapped >> >> >> when I run ceph pg query on the incompletes, they all list at least one >> of the two removed OSDs(12,17) in "down_osds_we_would_probe" >> >> most pools are size:2 min_size 1(trusting bluestore to tell me which one >> is valid). One pool is size:1 min size:1 and I'm okay with losing it, >> except I had it mounted in a directory on cephfs, I rm'd the directory but >> I can't delete the pool because it's "in use by CephFS" >> >> >> I still have the old drives, can I stick them into another host and >> re-add them somehow? >> > > Yes, that'll probably be your easiest solution. You may have some trouble > because you already deleted them, but I'm not sure. > > Alternatively, you ought to be able to remove the pool from CephFS using > some of the monitor commands and then delete it. > > >> This data isn't super important, but I'd like to learn a bit on how to >> recover when bad things happen as we are planning a production deployment >> in a couple of weeks. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] dealing with incomplete PGs while using bluestore
I am in the process of doing exactly what you are -- this worked for me: 1. mount the first partition of the bluestore drive that holds the missing PGs (if it's not already mounted) > mkdir /mnt/tmp > mount /dev/sdb1 /mnt/tmp 2. export the pg to a suitable temporary storage location: > ceph-objectstore-tool --data-path /mnt/tmp --pgid 1.24 --op export --file /mnt/sdd1/recover.1.24 3. find the acting osd > ceph health detail |grep incomplete PG_DEGRADED Degraded data redundancy: 23 pgs unclean, 23 pgs incomplete pg 1.24 is incomplete, acting [18,13] pg 4.1f is incomplete, acting [11] ... 4. set noout > ceph osd set noout 5. Find the OSD and log into it -- I used 18 here. > ceph osd find 18 { "osd": 18, "ip": "10.0.15.54:6801/9263", "crush_location": { "building": "building-dc", "chassis": "chassis-dc400f5-10", "city": "city", "floor": "floor-dc4", "host": "stor-vm4", "rack": "rack-dc400f5", "region": "cfl", "room": "room-dc400", "root": "default", "row": "row-dc400f" } } > ssh user@10.0.15.54 6. copy the file to somewhere accessible by the new(acting) osd > scp user@10.0.14.51:/mnt/sdd1/recover.1.24 /tmp/recover.1.24 7. stop the osd > service ceph-osd@18 stop 8. import the file using ceph-objectstore-tool > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-18 --op import --file /tmp/recover.1.24 9. start the osd > service-osd@18 start this worked for me -- not sure if this is the best way or if I took any extra steps and I have yet to validate that the data is good. I based this partially off your original email, and the guide here http://ceph.com/geen-categorie/incomplete-pgs-oh-my/ On Sat, Jul 22, 2017 at 4:46 PM, mofta7ywrote: > Hi All, > > I have a situation here. > > I have an EC pool that is having cache tier pool (the cache tier is > replicated with size 2). > > Had an issue on the pool and the crush map got changed after rebooting > some OSD in any case I lost 4 cache ties OSDs > > those lost OSDs are not really lost they look fine to me but bluestore is > giving me exception when starting them i cant deal with it. (will open > question about that exception as well) > > So now i have 14 incomplete Pgs on the caching tier. > > > I am trying to recover them using ceph-objectstore-tool > > the extraction and import works nice with no issues but the OSD fail to > start after wards with same issue as the original OSD . > > after importing the PG on the acting OSD i get the exact same exception I > was getting while trying to start the failed OSD > > removing that import resolve the issue. > > > So the question is how can use ceph-objectstore-tool to import in > bluestore as i think i am missing somthing here > > > here is the procedure and the steps i used > > 1- stop old osd (it cannot start anyway) > > 2- use this command to extract the pg i need > > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-116 --pgid 15.371 > --op export --file /tmp/recover.15.371 > > that command work > > 3- check what is the acting OSD for the pg > > 4- stop the acting OSD > > 5- delete the current folder with same og name > > 6- use this command > > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-78 --op import > /tmp/recover.15.371 > the error i got in both cases is this bluestore error > > Jul 22 16:35:20 alm9 ceph-osd[3799171]: -257> 2017-07-22 16:20:19.544195 > 7f7157036a40 -1 osd.116 119691 log_to_monitors {default=true} > Jul 22 16:35:20 alm9 ceph-osd[3799171]: 0> 2017-07-22 16:35:20.142143 > 7f713c597700 -1 /tmp/buildd/ceph-11.2.0/src/os/bluestore/BitMapAllocator.cc: > In function 'virtual int BitMapAllocator::reserve(uint64_t)' thread > 7f713c597700 time 2017-07-22 16:35:20.139309 > Jul 22 16:35:20 alm9 ceph-osd[3799171]: > /tmp/buildd/ceph-11.2.0/src/os/bluestore/BitMapAllocator.cc: > 82: FAILED assert(!(need % m_block_size)) > Jul 22 16:35:20 alm9 ceph-osd[3799171]: ceph version 11.2.0 > (f223e27eeb35991352ebc1f67423d4ebc252adb7) > Jul 22 16:35:20 alm9 ceph-osd[3799171]: 1: (ceph::__ceph_assert_fail(char > const*, char const*, int, char const*)+0x80) [0x562b84558380] > Jul 22 16:35:20 alm9 ceph-osd[3799171]: 2: (BitMapAllocator::reserve(unsigned > long)+0x2ab) [0x562b8437c5cb] > Jul 22 16:35:20 alm9 ceph-osd[3799171]: 3: (BlueFS::reclaim_blocks(unsigned > int, unsigned long, std::vector mempool::pool_allocator<(mempool::pool_index_t)7, > AllocExtent> >*)+0x22a) [0x562b8435109a] > Jul 22 16:35:20 alm9 ceph-osd[3799171]: 4: (BlueStore::_balance_bluefs_fr > eespace(std::vector >*)+0x28e) [0x562b84270dae] > Jul 22 16:35:20 alm9 ceph-osd[3799171]: 5: > (BlueStore::_kv_sync_thread()+0x164a) > [0x562b84273eea] > Jul 22 16:35:20 alm9 ceph-osd[3799171]: 6: > (BlueStore::KVSyncThread::entry()+0xd) > [0x562b842ad9dd] > Jul 22 16:35:20 alm9 ceph-osd[3799171]: 7: (()+0x76ba)
[ceph-users] ceph recovery incomplete PGs on Luminous RC
Luminous 12.1.0(RC) I replaced two OSD drives(old ones were still good, just too small), using: ceph osd out osd.12 ceph osd crush remove osd.12 ceph auth del osd.12 systemctl stop ceph-osd@osd.12 ceph osd rm osd.12 I later found that I also should have unmounted it from /var/lib/ceph/osd-12 (remove old disk, insert new disk) I added the new disk/osd with ceph-deploy osd prepare stor-vm3:sdg --bluestore This automatically activated the osd (not sure why, I thought it needed a ceph-deploy osd activate as well) Then, working on an unrelated issue, I upgraded one (out of 4 total) nodes to 12.1.1 using apt and rebooted. The mon daemon would not form a quorum with the others on 12.1.0, so, instead of troubleshooting that, I just went ahead and upgraded the other 3 nodes and rebooted. Lots of recovery IO went on afterwards, but now things have stopped at: pools: 10 pools, 6804 pgs objects: 1784k objects, 7132 GB usage: 11915 GB used, 19754 GB / 31669 GB avail pgs: 0.353% pgs not active 70894/2988573 objects degraded (2.372%) 422090/2988573 objects misplaced (14.123%) 6626 active+clean 129 active+remapped+backfill_wait 23 incomplete 14 active+undersized+degraded+remapped+backfill_wait 4active+undersized+degraded+remapped+backfilling 4active+remapped+backfilling 2active+clean+scrubbing+deep 1peering 1active+recovery_wait+degraded+remapped when I run ceph pg query on the incompletes, they all list at least one of the two removed OSDs(12,17) in "down_osds_we_would_probe" most pools are size:2 min_size 1(trusting bluestore to tell me which one is valid). One pool is size:1 min size:1 and I'm okay with losing it, except I had it mounted in a directory on cephfs, I rm'd the directory but I can't delete the pool because it's "in use by CephFS" I still have the old drives, can I stick them into another host and re-add them somehow? This data isn't super important, but I'd like to learn a bit on how to recover when bad things happen as we are planning a production deployment in a couple of weeks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] how to map rbd using rbd-nbd on boot?
Once again my google-fu has failed me and I can't find the 'correct' way to map an rbd using rbd-nbd on boot. Everything takes me to rbdmap, which isn't using rbd-nbd. If someone could just point me in the right direction I'd appreciated it. Thanks! Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd-fuse performance
thank you! On Wed, Jun 28, 2017 at 11:48 AM, Mykola Golub <mgo...@mirantis.com> wrote: > On Tue, Jun 27, 2017 at 07:17:22PM -0400, Daniel K wrote: > > > rbd-nbd isn't good as it stops at 16 block devices (/dev/nbd0-15) > > modprobe nbd nbds_max=1024 > > Or, if nbd module is loaded by rbd-nbd, use --nbds_max command line > option. > > -- > Mykola Golub > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd-fuse performance
Hi, As mentioned in my previous emails, I'm extremely new to ceph, so please forgive my lack of knowledge. I'm trying to find a good way to mount ceph rbd images for export by LIO/targetcli rbd-nbd isn't good as it stops at 16 block devices (/dev/nbd0-15) kernel rbd mapping doesn't have support for new features. I thought rbd-fuse looked good, except write performance is abysmal. rados bench gives me ~250MB/s of write speed. an image mounted with rbd-fuse gives me ~2MB/s of write speed. CephFS write speeds are good as well. Is something wrong with my testing method or configuration? root@stor-vm1:/# ceph osd pool create rbd_storage 128 128 root@stor-vm1:/# rbd create --pool=rbd_storage --size=25G rbd_25g1 root@stor-vm1:/# mkdir /mnt/rbd root@stor-vm1:/# cd /mnt root@stor-vm1:/# rbd-fuse rbd -p rbd_storage root@stor-vm1:/# cd rbd root@stor-vm1:/# dd if=/dev/zero of=rbd_25g1 bs=4M count=2 status=progress 8388608 bytes (8.4 MB, 8.0 MiB) copied, 4.37754 s, 1.9 MB/s 2+0 records in 2+0 records out 8388608 bytes (8.4 MB, 8.0 MiB) copied, 4.3776 s, 1.9 MB/s rados bench: root@stor-vm1:/mnt/rbd# rados bench -p rbd_storage 10 write 2017-06-27 18:56:59.505647 7fb9c24a7e00 -1 WARNING: the following dangerous and experimental features are enabled: bluestore 2017-06-27 18:56:59.505768 7fb9c24a7e00 -1 WARNING: the following dangerous and experimental features are enabled: bluestore 2017-06-27 18:56:59.507385 7fb9c24a7e00 -1 WARNING: the following dangerous and experimental features are enabled: bluestore hints = 1 Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects Object prefix: benchmark_data_stor-vm1_8786 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 166347 187.989 1880.620617 0.285428 2 16 134 118 235.976 2840.195319 0.250789 3 16 209 193 257.306 3000.198448 0.239798 4 16 282 266 265.972 2920.232927 0.233386 5 16 362 346 276.771 3200.222398 0.226373 6 16 429 413 275.303 2680.193111 0.226703 7 16 490 474 270.828 244 0.0879974 0.228776 8 16 562 546272.97 2880.125843 0.230455 9 16 625 609 270.637 2520.145847 0.232388 10 16 701 685273.97 3040.411055 0.230831 Total time run: 10.161789 Total writes made: 702 Write size: 4194304 Object size:4194304 Bandwidth (MB/sec): 276.329 Stddev Bandwidth: 38.1925 Max bandwidth (MB/sec): 320 Min bandwidth (MB/sec): 188 Average IOPS: 69 Stddev IOPS:9 Max IOPS: 80 Min IOPS: 47 Average Latency(s): 0.231391 Stddev Latency(s): 0.107305 Max latency(s): 0.774406 Min latency(s): 0.0828756 Cleaning up (deleting benchmark objects) Removed 702 objects Clean up completed and total clean up time :1.190687 Thanks, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Luminous/Bluestore compression documentation
Is there anywhere that details the various compression settings for bluestore backed pools? I can see compression in the list of options when I run ceph osd pool set, but can't find anything that details what valid settings are. I've tried discovering the options via the command line utilities and via google and have failed at both. Thanks, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds exist in the crush map but not in the osdmap after kraken > luminous rc1 upgrade
Well that was simple. In the process of preparing the decompiled crush map, ceph status, ceph osd tree for posting I noticed that those two OSDs -- 5 & 11 didn't exist. Which explains it. I removed them from the crushmap and all is well now. Nothing changed in the config from kraken to luminous, so I guess kraken just didn't have a health check for that problem. Thanks for the help! Dan On Tue, Jun 27, 2017 at 2:18 PM, David Turner <drakonst...@gmail.com> wrote: > Can you post your decompiled crush map, ceph status, ceph osd tree, etc? > Something will allow what the extra stuff is and the easiest way to remove > it. > > On Tue, Jun 27, 2017, 12:12 PM Daniel K <satha...@gmail.com> wrote: > >> Hi, >> >> I'm extremely new to ceph and have a small 4-node/20-osd cluster. >> >> I just upgraded from kraken to luminous without much ado, except now when >> I run ceph status, I get a health_warn because "2 osds exist in the crush >> map but not in the osdmap" >> >> Googling the error message only took me to the source file on github >> >> I tried exporting and decompiling the crushmap -- there were two osd >> devices named differently. The normal name would be something like >> >> device 0 osd.0 >> device 1 osd.1 >> >> but two were named: >> >> device 5 device5 >> device 11 device11 >> >> I had edited the crushmap in the past, so it's possible this was >> introduced by me. >> >> I tried changing those to match the rest, recompiling and setting the >> crushmap, but ceph status still complains. >> >> Any assistance would be greatly appreciated. >> >> Thanks, >> Dan >> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osds exist in the crush map but not in the osdmap after kraken > luminous rc1 upgrade
Hi, I'm extremely new to ceph and have a small 4-node/20-osd cluster. I just upgraded from kraken to luminous without much ado, except now when I run ceph status, I get a health_warn because "2 osds exist in the crush map but not in the osdmap" Googling the error message only took me to the source file on github I tried exporting and decompiling the crushmap -- there were two osd devices named differently. The normal name would be something like device 0 osd.0 device 1 osd.1 but two were named: device 5 device5 device 11 device11 I had edited the crushmap in the past, so it's possible this was introduced by me. I tried changing those to match the rest, recompiling and setting the crushmap, but ceph status still complains. Any assistance would be greatly appreciated. Thanks, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] design guidance
I started down that path and got so deep that I couldn't even find where I went in. I couldn't make heads or tails out of what would or wouldn't work. We didn't need multiple hosts accessing a single datastore, so on the client side I just have a single VM guest running on each ESXi hosts, with the cephfs filesystem mounted on it(via a 10Gb connection to the ceph environment), and then exported via NFS on a host-only network, and mounted on the host. Not quite as redundant as it could be, but good enough for our usage. I'm seeing ~500MB/s speeds going to a 4-node cluster with 5x1TB 7200rpm drives. I tried it first, in a similar config, except using LIO to export an RBD device via iSCSI, still on the local host network. Write performance was good, but read performance was only around 120MB/s. I didn't do much troubleshooting, just tried NFS after that and was happy with it. On Tue, Jun 6, 2017 at 2:33 AM, Adrian Saulwrote: > > > Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5 > > > and > > > 6.0 hosts(migrating from a VMWare environment), later to transition to > > > qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio > > > and saw much worse performance with the first cluster, so it seems > > > this may be the better way, but I'm open to other suggestions. > > > > > I've never seen any ultimate solution to providing HA iSCSI on top of > Ceph, > > though other people here have made significant efforts. > > In our tests our best results were with SCST - also because it provided > proper ALUA support at the time. I ended up developing my own pacemaker > cluster resources to manage the SCST orchestration and ALUA failover. In > our model we have a pacemaker cluster in front being an RBD client > presenting LUNs/NFS out to VMware (NFS), Solaris and Hyper-V (iSCSI). We > are using CephFS over NFS but performance has been poor, even using it just > for VMware templates. We are on an earlier version of Jewel so its > possibly some later versions may improve CephFS for that but I have not had > time to test it. > > We have been running a small production/POC for over 18 months on that > setup, and gone live into a much larger setup in the last 6 months based on > that model. It's not without its issues, but most of that is a lack of > test resources to be able to shake out some of the client compatibility and > failover shortfalls we have. > > Confidentiality: This email and any attachments are confidential and may > be subject to copyright, legal or some other professional privilege. They > are intended solely for the attention and use of the named addressee(s). > They may only be copied, distributed or disclosed with the consent of the > copyright owner. If you have received this email by mistake or by breach of > the confidentiality clause, please notify the sender immediately by return > email and delete or destroy all copies of the email. Any confidentiality, > privilege or copyright is not waived or lost because this email has been > sent to you by mistake. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] design guidance
Christian, Thank you for the tips -- I certainly googled my eyes out for a good while before asking -- maybe my google-fu wasn't too good last night. > I love using IB, alas with just one port per host you're likely best off > ignoring it, unless you have a converged network/switches that can make > use of it (or run it in Ethernet mode). I've always heard people speak fondly of IB, but I've honestly never dealt with it. I'm mostly a network guy at heart, so I'm perfectly comfortable aggregating 10GB/s connections till the cows come home. What are some of the virtues of IB, over ethernet? (not ethernet over IB) > Bluestore doesn't have journals per se and unless you're going to wait for > Luminous I wouldn't recommend using Bluestore in production. > Hell, I won't be using it any time soon, but anything pre L sounds > like outright channeling Murphy to smite you I do like to play with fire often, but not normally with other people's data. I suppose I will stay away from Bluestore for now, unless Luminous is released within the next few weeks. I am using it on Kraken in my small test-cluster so far without a visit from Murphy. > That said, what SSD is it? > Bluestore WAL needs are rather small. > OTOH, a single SSD isn't something I'd recommend either, SPOF and all. > I'm guessing you have no budget to improve on that gift horse? It's a Micron 1100 256Gb, rated for 120TBW, which works out to about 100GB/day for 3 years, so not even .5DWPD. I doubt it has the endurance to journal 36 1TB drives. I do have some room in the budget, and NVMe journals have been on the back of my mind. These servers have 6 PCIe x8 slots in them, so tons of room. But then I'm going to get asked about a cache tier, which everyone seems to think is the holy grail (and probably would be, if they could 'just work') But from what I read, they're an utter nightmare to manage, particularly without a well defined workload, and often would hurt more than they help. I haven't spent a ton of time with the network gear that was dumped on me, but the switches I have now are a Nexus 7000, x4 Force10 S4810 (so I do have some stackable 10Gb that I can MC-LAG), x2 Mellanox IS5023 (18 port IB switch), what appears to be a giant IB switch (Qlogic 12800-120) and another apparently big boy (Qlogic 12800-180). I'm going to pick them up from the warehouse tomorrow. If I stay away from IB completely, may just use the IB card as a 4x10GB + the 2x 10GB on board like I had originally mentioned. But if that IB gear is good, I'd hate to see it go to waste. Might be worth getting a second IB card for each server. Again, thanks a million for the advice. I'd rather learn this the easy way than to have to rebuild this 6 times over the next 6 months. On Tue, Jun 6, 2017 at 2:05 AM, Christian Balzer <ch...@gol.com> wrote: > > Hello, > > lots of similar questions in the past, google is your friend. > > On Mon, 5 Jun 2017 23:59:07 -0400 Daniel K wrote: > > > I've built 'my-first-ceph-cluster' with two of the 4-node, 12 drive > > Supermicro servers and dual 10Gb interfaces(one cluster, one public) > > > > I now have 9x 36-drive supermicro StorageServers made available to me, > each > > with dual 10GB and a single Mellanox IB/40G nic. No 1G interfaces except > > IPMI. 2x 6-core 6-thread 1.7ghz xeon processors (12 cores total) for 36 > > drives. Currently 32GB of ram. 36x 1TB 7.2k drives. > > > I love using IB, alas with just one port per host you're likely best off > ignoring it, unless you have a converged network/switches that can make > use of it (or run it in Ethernet mode). > > > Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5 and > > 6.0 hosts(migrating from a VMWare environment), later to transition to > > qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio and > saw > > much worse performance with the first cluster, so it seems this may be > the > > better way, but I'm open to other suggestions. > > > I've never seen any ultimate solution to providing HA iSCSI on top of > Ceph, though other people here have made significant efforts. > > > Considerations: > > Best practice documents indicate .5 cpu per OSD, but I have 36 drives and > > 12 CPUs. Would it be better to create 18x 2-drive raid0 on the hardware > > raid card to present a fewer number of larger devices to ceph? Or run > > multiple drives per OSD? > > > You're definitely underpowered in the CPU department and I personally > would make RAID1 or 10s for never having to re-balance an OSD. > But if space is an issue, RAID0s would do. > OTOH, w/o any SSDs in the game your HDD only cluster is going to be less > CPU hungry than others. > > > There is a single 256gb SSD which i feel would be a bottleneck if I used > it
[ceph-users] design guidance
I've built 'my-first-ceph-cluster' with two of the 4-node, 12 drive Supermicro servers and dual 10Gb interfaces(one cluster, one public) I now have 9x 36-drive supermicro StorageServers made available to me, each with dual 10GB and a single Mellanox IB/40G nic. No 1G interfaces except IPMI. 2x 6-core 6-thread 1.7ghz xeon processors (12 cores total) for 36 drives. Currently 32GB of ram. 36x 1TB 7.2k drives. Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5 and 6.0 hosts(migrating from a VMWare environment), later to transition to qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio and saw much worse performance with the first cluster, so it seems this may be the better way, but I'm open to other suggestions. Considerations: Best practice documents indicate .5 cpu per OSD, but I have 36 drives and 12 CPUs. Would it be better to create 18x 2-drive raid0 on the hardware raid card to present a fewer number of larger devices to ceph? Or run multiple drives per OSD? There is a single 256gb SSD which i feel would be a bottleneck if I used it as a journal for all 36 drives, so I believe bluestore with a journal on each drive would be the best option. Is 1.7Ghz too slow for what I'm doing? I like the idea of keeping the public and cluster networks separate. Any suggestions on which interfaces to use for what? I could theoretically push 36Gb/s, figuring 125MB/s for each drive, but in reality will I ever see that? Perhaps bond the two 10GB and use them as the public, and the 40gb as the cluster network? Or split the 40gb in to 4x10gb and use 3x10GB bonded for each? If there is a more appropriate venue for my request, please point me in that direction. Thanks, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Kraken bluestore compression
Hi, I see several mentions that compression is available in Kraken for bluestore OSDs, however, I can find almost nothing in the documentation that indicates how to use it. I've found: - http://docs.ceph.com/docs/master/radosgw/compression/ - http://ceph.com/releases/v11-2-0-kraken-released/ I'm fairly new to ceph, so I don't have a good grasp of how rados commands apply to osd pools, so if the first link is relavant, I apologize. I am seeing: "|compression_mode|compression_algorithm|compression_required_ratio|compression_max_blob_size|compression_min_blob_size" as options when I run ceph osd pool set -- but I can't find anything documented to explain what parameters are available for those options. Could someone point me in the right direction? Thanks, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds slow request, getattr currently failed to rdlock. Kraken with Bluestore
Yes -- the crashed server also mounted cephfs as a client, and also likely had active writes to the file when it crashed. I have the max file size set to 17,592,186,044,416 -- but this file was about 5.8TB. The likely reason for the crash? The file was mounted as a fileio backstore to LIO, which was exported as an FC lun that I had connected to an ESXi server, mapped via RDM to a guest, in which I had a dd if=/dev/zero of=/dev/sdb bs=1M count=6 running(for several hours) Which I think was breaking at least 3 "don't do this" rules with ceph. Once it moves into production the pieces will be separated. On Wed, May 24, 2017 at 4:55 PM, Gregory Farnum <gfar...@redhat.com> wrote: > On Wed, May 24, 2017 at 3:15 AM, John Spray <jsp...@redhat.com> wrote: > > On Tue, May 23, 2017 at 11:41 PM, Daniel K <satha...@gmail.com> wrote: > >> Have a 20 OSD cluster -"my first ceph cluster" that has another 400 OSDs > >> enroute. > >> > >> I was "beating up" on the cluster, and had been writing to a 6TB file in > >> CephFS for several hours, during which I changed the crushmap to better > >> match my environment, generating a bunch of recovery IO. After about > 5.8TB > >> written, one of the OSD(which is also a MON..soon to be rectivied) hosts > >> crashed that hat 5 OSDs on it, and after rebooting, I have this in ceph > -s: > >> (The degraded/misplaced warnings are likely because the cluster hasn't > >> completed rebalancing after I changed the crushmap I assume) > >> > > > > Losing a quarter of your OSDs down while simultaneously rebalancing > > after editing your CRUSH map is a brutal thing to a Ceph cluster, and > > I would expect it to impact your client IO severely. > > > > I see that you've got 112MB/s of recovery going on, which may or may > > not be saturating some links depending on whether you're using 1gig or > > 10gig networking. > > > >> 2017-05-23 18:33:13.775924 7ff9d3230700 -1 WARNING: the following > dangerous > >> and experimental features are enabled: bluestore > >> 2017-05-23 18:33:13.781732 7ff9d3230700 -1 WARNING: the following > dangerous > >> and experimental features are enabled: bluestore > >> cluster e92e20ca-0fe6-4012-86cc-aa51e041 > >> health HEALTH_WARN > >> 440 pgs backfill_wait > >> 7 pgs backfilling > >> 85 pgs degraded > >> 5 pgs recovery_wait > >> 85 pgs stuck degraded > >> 452 pgs stuck unclean > >> 77 pgs stuck undersized > >> 77 pgs undersized > >> recovery 196526/3554278 objects degraded (5.529%) > >> recovery 1690392/3554278 objects misplaced (47.559%) > >> mds0: 1 slow requests are blocked > 30 sec > >> monmap e4: 3 mons at > >> {stor-vm1=10.0.15.51:6789/0,stor-vm2=10.0.15.52:6789/0, > stor-vm3=10.0.15.53:6789/0} > >> election epoch 136, quorum 0,1,2 stor-vm1,stor-vm2,stor-vm3 > >> fsmap e21: 1/1/1 up {0=stor-vm4=up:active} > >> mgr active: stor-vm1 standbys: stor-vm2 > >> osdmap e4655: 20 osds: 20 up, 20 in; 450 remapped pgs > >> flags sortbitwise,require_jewel_osds,require_kraken_osds > >> pgmap v192589: 1428 pgs, 5 pools, 5379 GB data, 1345 kobjects > >> 11041 GB used, 16901 GB / 27943 GB avail > >> 196526/3554278 objects degraded (5.529%) > >> 1690392/3554278 objects misplaced (47.559%) > >> 975 active+clean > >> 364 active+remapped+backfill_wait > >> 76 active+undersized+degraded+remapped+backfill_wait > >>3 active+recovery_wait+degraded+remapped > >>3 active+remapped+backfilling > >>3 active+degraded+remapped+backfilling > >>2 active+recovery_wait+degraded > >>1 active+clean+scrubbing+deep > >>1 active+undersized+degraded+remapped+backfilling > >> recovery io 112 MB/s, 28 objects/s > >> > >> > >> Seems related to the "corrupted rbd filesystems since jewel" thread. > >> > >> > >> log entries on the MDS server: > >> > >> 2017-05-23 18:27:12.966218 7f95ed6c0700 0 log_channel(cluster) log > [WRN] : > >> slow request 243.113407 seconds old, received at 2017-05-23 > 18:23:09.852729: > >> cl
Re: [ceph-users] mds slow request, getattr currently failed to rdlock. Kraken with Bluestore
Networking is 10Gig. I notice recovery IO is wildly variable, I assume that's normal. Very little load as this is yet to go into production, I was "seeing what it would handle" at the time it broke. I checked this morning and the slow request had gone and I could access the blocked file again. All OSes are Ubuntu 16.04.01 with the stock 4.4.0-72-generic kernel, and there were two CephFS clients accessing it, also 16.04.1. Ceph on all is 11.2.0, installed from the debian-kraken repos at download.ceph.com. All OSDs are bluestore. As of now all is okay, so don't want to waste anyone's time on a wild goose chase. On Wed, May 24, 2017 at 6:15 AM, John Spray <jsp...@redhat.com> wrote: > On Tue, May 23, 2017 at 11:41 PM, Daniel K <satha...@gmail.com> wrote: > > Have a 20 OSD cluster -"my first ceph cluster" that has another 400 OSDs > > enroute. > > > > I was "beating up" on the cluster, and had been writing to a 6TB file in > > CephFS for several hours, during which I changed the crushmap to better > > match my environment, generating a bunch of recovery IO. After about > 5.8TB > > written, one of the OSD(which is also a MON..soon to be rectivied) hosts > > crashed that hat 5 OSDs on it, and after rebooting, I have this in ceph > -s: > > (The degraded/misplaced warnings are likely because the cluster hasn't > > completed rebalancing after I changed the crushmap I assume) > > > > Losing a quarter of your OSDs down while simultaneously rebalancing > after editing your CRUSH map is a brutal thing to a Ceph cluster, and > I would expect it to impact your client IO severely. > > I see that you've got 112MB/s of recovery going on, which may or may > not be saturating some links depending on whether you're using 1gig or > 10gig networking. > > > 2017-05-23 18:33:13.775924 7ff9d3230700 -1 WARNING: the following > dangerous > > and experimental features are enabled: bluestore > > 2017-05-23 18:33:13.781732 7ff9d3230700 -1 WARNING: the following > dangerous > > and experimental features are enabled: bluestore > > cluster e92e20ca-0fe6-4012-86cc-aa51e041 > > health HEALTH_WARN > > 440 pgs backfill_wait > > 7 pgs backfilling > > 85 pgs degraded > > 5 pgs recovery_wait > > 85 pgs stuck degraded > > 452 pgs stuck unclean > > 77 pgs stuck undersized > > 77 pgs undersized > > recovery 196526/3554278 objects degraded (5.529%) > > recovery 1690392/3554278 objects misplaced (47.559%) > > mds0: 1 slow requests are blocked > 30 sec > > monmap e4: 3 mons at > > {stor-vm1=10.0.15.51:6789/0,stor-vm2=10.0.15.52:6789/0,stor- > vm3=10.0.15.53:6789/0} > > election epoch 136, quorum 0,1,2 stor-vm1,stor-vm2,stor-vm3 > > fsmap e21: 1/1/1 up {0=stor-vm4=up:active} > > mgr active: stor-vm1 standbys: stor-vm2 > > osdmap e4655: 20 osds: 20 up, 20 in; 450 remapped pgs > > flags sortbitwise,require_jewel_osds,require_kraken_osds > > pgmap v192589: 1428 pgs, 5 pools, 5379 GB data, 1345 kobjects > > 11041 GB used, 16901 GB / 27943 GB avail > > 196526/3554278 objects degraded (5.529%) > > 1690392/3554278 objects misplaced (47.559%) > > 975 active+clean > > 364 active+remapped+backfill_wait > > 76 active+undersized+degraded+remapped+backfill_wait > >3 active+recovery_wait+degraded+remapped > >3 active+remapped+backfilling > >3 active+degraded+remapped+backfilling > >2 active+recovery_wait+degraded > >1 active+clean+scrubbing+deep > >1 active+undersized+degraded+remapped+backfilling > > recovery io 112 MB/s, 28 objects/s > > > > > > Seems related to the "corrupted rbd filesystems since jewel" thread. > > > > > > log entries on the MDS server: > > > > 2017-05-23 18:27:12.966218 7f95ed6c0700 0 log_channel(cluster) log > [WRN] : > > slow request 243.113407 seconds old, received at 2017-05-23 > 18:23:09.852729: > > client_request(client.204100:5 getattr pAsLsXsFs #10003ec 2017-05-23 > > 17:48:23.770852 RETRY=2 caller_uid=0, caller_gid=0{}) currently failed to > > rdlock, waiting > > > > > > output of ceph daemon mds.stor-vm4 objecter_requests(changes each time I > run > > it) > > If that changes each time you run i