Re: [ceph-users] rebalancing taking very long time
I found place to paste my output for `ceph daemon osd.xx config show` for all my OSD's: https://www.zerobin.net/?743bbbdea41874f4#FNk5EjsfRxvkX1JuTp52fQ4CXW6VOIEB0Lj0Icnyr4Q= If you want it in a gzip'd txt file, you can download here: https://mega.nz/#!oY5QAByC!JEWhHRms0WwbYbwG4o4RdTUWtFwFjUDLWhtNtEDhBkA It honestly looks to me like the disks are maxing out on IOPS and a good portion of the disks are hitting 100% Utilization according to dstat when there was rebalancing or client i/o. I'm running this to look at my disk stats: dstat -cd --disk-util -D sda,sdb,sdc,sdd,sde,sdf,sdg,sdh --disk-tps I dont have any client load on my cluster at this point to show any good output but with just '11 active+clean+scrubbing+deep' being run, I am seeing 70-80% disk utilization for each OSD according to dstat. On Thu, Sep 3, 2015 at 2:34 AM, Jan Schermer wrote: > Can you post the output ot > > ceph daemon osd.xx config show? (probably as an attachment). > > There are several things that I've seen cause it > 1) too many PGs but too little degraded objects make it seem "slow" (if > you just have 2 degraded objects but restarted a host with 10K PGs, it will > have to scan all the PGs probably) > 2) sometimes the process gets stuck when a toofull condition occurs > 3) sometimes the process gets stuck for no apparent reason - restarting > the currently backfilling/recovering OSDs fixes it > setting osd_recovery_threads sometimes fixes both 2) and 3), but usually > not > 4) setting recovery_delay_start to anything > 0 makes recovery slow (even > 0.001 makes it much slower than simple 0). On the other hand we had to > set it high as a default because of slow ops when restarting OSDs, which > was partially fixed by this. > > Can you see any bottleneck in the system? CPU spinning, disks reading? I > don't think this is the issue, just make sure it's not something more > obvious... > > Jan > > > On 02 Sep 2015, at 22:34, Bob Ababurko wrote: > > When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a > very long time to rebalance. I should note that my cluster is slightly > unique in that I am using cephfs(shouldn't matter?) and it currently > contains about 310 million objects. > > The last time I replaced a disk/OSD was 2.5 days ago and it is still > rebalancing. This is on a cluster with no client load. > > The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro > SSD which contains the journals for said OSD's. Thats means 30 OSD's in > total. System disk is on its own disk. I'm also using a backend network > with single Gb NIC. THe rebalancing rate(objects/s) seems to be very slow > when it is close to finishingsay <1% objects misplaced. > > It doesn't seem right that it would take 2+ days to rebalance a 1TB disk > with no load on the cluster. Are my expectations off? > > I'm not sure if my pg_num/pgp_num needs to be changed OR the rebalance > time is dependent on the number of objects in the pool. These are thoughts > i've had but am not certain are relevant here. > > $ sudo ceph -v > ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) > > $ sudo ceph -s > [sudo] password for bababurko: > cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79 > health HEALTH_WARN > 5 pgs backfilling > 5 pgs stuck unclean > recovery 3046506/676638611 objects misplaced (0.450%) > monmap e1: 3 mons at {cephmon01= > 10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0 > } > election epoch 20, quorum 0,1,2 cephmon01,cephmon02,cephmon03 > mdsmap e6070: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby > osdmap e4395: 30 osds: 30 up, 30 in; 5 remapped pgs > pgmap v3100039: 2112 pgs, 3 pools, 6454 GB data, 321 Mobjects > 18319 GB used, 9612 GB / 27931 GB avail > 3046506/676638611 objects misplaced (0.450%) > 2095 active+clean > 12 active+clean+scrubbing+deep >5 active+remapped+backfilling > recovery io 2294 kB/s, 147 objects/s > > $ sudo rados df > pool name KB objects clones degraded > unfound rdrd KB wrwr KB > cephfs_data 676756996233574670200 > 0 21368341676984208 7052266742 > cephfs_metadata42738 105843700 > 0 16130199 30718800215295996938 3811963908 > rbd0000 > 00000 > total use
[ceph-users] rebalancing taking very long time
When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a very long time to rebalance. I should note that my cluster is slightly unique in that I am using cephfs(shouldn't matter?) and it currently contains about 310 million objects. The last time I replaced a disk/OSD was 2.5 days ago and it is still rebalancing. This is on a cluster with no client load. The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro SSD which contains the journals for said OSD's. Thats means 30 OSD's in total. System disk is on its own disk. I'm also using a backend network with single Gb NIC. THe rebalancing rate(objects/s) seems to be very slow when it is close to finishingsay <1% objects misplaced. It doesn't seem right that it would take 2+ days to rebalance a 1TB disk with no load on the cluster. Are my expectations off? I'm not sure if my pg_num/pgp_num needs to be changed OR the rebalance time is dependent on the number of objects in the pool. These are thoughts i've had but am not certain are relevant here. $ sudo ceph -v ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) $ sudo ceph -s [sudo] password for bababurko: cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79 health HEALTH_WARN 5 pgs backfilling 5 pgs stuck unclean recovery 3046506/676638611 objects misplaced (0.450%) monmap e1: 3 mons at {cephmon01= 10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0 } election epoch 20, quorum 0,1,2 cephmon01,cephmon02,cephmon03 mdsmap e6070: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby osdmap e4395: 30 osds: 30 up, 30 in; 5 remapped pgs pgmap v3100039: 2112 pgs, 3 pools, 6454 GB data, 321 Mobjects 18319 GB used, 9612 GB / 27931 GB avail 3046506/676638611 objects misplaced (0.450%) 2095 active+clean 12 active+clean+scrubbing+deep 5 active+remapped+backfilling recovery io 2294 kB/s, 147 objects/s $ sudo rados df pool name KB objects clones degraded unfound rdrd KB wrwr KB cephfs_data 676756996233574670200 0 21368341676984208 7052266742 cephfs_metadata42738 105843700 0 16130199 30718800215295996938 3811963908 rbd0000 00000 total used 19209068780336805139 total avail10079469460 total space29288538240 $ sudo ceph osd pool get cephfs_data pgp_num pg_num: 1024 $ sudo ceph osd pool get cephfs_metadata pgp_num pg_num: 1024 thanks, Bob ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds server(s) crashed
On Wed, Aug 12, 2015 at 7:21 PM, Yan, Zheng wrote: > On Thu, Aug 13, 2015 at 7:05 AM, Bob Ababurko wrote: > > > > If I am using a more recent client(kernel OR ceph-fuse), should I still > be > > worried about the MDS's crashing? I have added RAM to my MDS hosts and > its > > my understanding this will also help mitigate any issues, in addition to > > setting mds_bal_frag = true. Not having used cephfs before, do I always > > need to worry about my MDS servers crashing all the time, thus the need > for > > setting mds_reconnect_timeout to 0? This is not ideal for us nor is the > > idea of clients not able to access their mounts after a MDS recovery. > > > > It's unlikely this issue will happen again. But I can't guarantee no > other issue. > > no need to set mds_reconnect_timeout to 0. > ok, Good to know. > > > > I am actually looking for the most stable way to implement cephfs at this > > point. My cephfs cluster contains millions of small files, so many > inodes > > if that needs to be taken into account. Perhaps I should only be using > one > > MDS node for stability at this point? Is this the best way forward to > get a > > handle on stability? I'm also curious if I should I set my mds cache > size > > to a number greater than files I have in the cephfs cluster? If you can > > give some key points to configure cephfs to get the best stability and if > > possible, availability.this would be helpful to me. > > One active MDS is the most stable setup. Adding a few standby MDS > should not hurt stability. > > You can't set mds cache size to a number greater than files in the > fs, it requires lots of memory. > I'm not sure what amount of RAM you consider to be 'lots' but I would really like to understand a bit more about this. Perhaps a rule of thumb? It there an advantage to more RAM & large mds cache size? We plan on putting close to a billion small files in this pool via cephfs so what should we be considering when sizing our MDS hosts OR change to the MDS config? Basically, what should we OR should not be doing when we have a cluster with this many files? Thanks! > Yan, Zheng > > > > > thanks again for the help. > > > > thanks, > > Bob > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds server(s) crashed
If I am using a more recent client(kernel OR ceph-fuse), should I still be worried about the MDS's crashing? I have added RAM to my MDS hosts and its my understanding this will also help mitigate any issues, in addition to setting mds_bal_frag = true. Not having used cephfs before, do I always need to worry about my MDS servers crashing all the time, thus the need for setting mds_reconnect_timeout to 0? This is not ideal for us nor is the idea of clients not able to access their mounts after a MDS recovery. I am actually looking for the most stable way to implement cephfs at this point. My cephfs cluster contains millions of small files, so many inodes if that needs to be taken into account. Perhaps I should only be using one MDS node for stability at this point? Is this the best way forward to get a handle on stability? I'm also curious if I should I set my mds cache size to a number greater than files I have in the cephfs cluster? If you can give some key points to configure cephfs to get the best stability and if possible, availability.this would be helpful to me. thanks again for the help. thanks, Bob ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds server(s) crashed
John, This seems to have worked. I rebooted my client and restarted ceph on the MDS hosts after giving them more RAM. I restarted the rsync's that were running on the client after remounting the cephfs fs and things seem to be working. I can access the files so that is a relief. What is risky about enabling mds_bal_frag on a cluster with data and will there be any performance degradation if enabled? Thanks again for the help. On Tue, Aug 11, 2015 at 2:25 PM, John Spray wrote: > On Tue, Aug 11, 2015 at 6:23 PM, Bob Ababurko wrote: > > Here is the backtrace from the core dump. > > > > (gdb) bt > > #0 0x7f71f5404ffb in raise () from /lib64/libpthread.so.0 > > #1 0x0087065d in reraise_fatal (signum=6) at > > global/signal_handler.cc:59 > > #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:109 > > #3 > > #4 0x7f71f40235d7 in raise () from /lib64/libc.so.6 > > #5 0x7f71f4024cc8 in abort () from /lib64/libc.so.6 > > #6 0x7f71f49279b5 in __gnu_cxx::__verbose_terminate_handler() () > from > > /lib64/libstdc++.so.6 > > #7 0x7f71f4925926 in ?? () from /lib64/libstdc++.so.6 > > #8 0x7f71f4925953 in std::terminate() () from /lib64/libstdc++.so.6 > > #9 0x7f71f4925b73 in __cxa_throw () from /lib64/libstdc++.so.6 > > #10 0x0077d0fc in operator new (num_bytes=2408) at > mds/CInode.h:120 > > Python Exception list index out of range: > > #11 CDir::_omap_fetched (this=0x90af04f8, hdrbl=..., omap=std::map with > > 65536 elements, want_dn="", r=) at mds/CDir.cc:1700 > > #12 0x007d7d44 in complete (r=0, this=0x502b000) at > > include/Context.h:65 > > #13 MDSIOContextBase::complete (this=0x502b000, r=0) at > mds/MDSContext.cc:59 > > #14 0x00894818 in Finisher::finisher_thread_entry > (this=0x5108698) > > at common/Finisher.cc:59 > > #15 0x7f71f53fddf5 in start_thread () from /lib64/libpthread.so.0 > > #16 0x7f71f40e41ad in clone () from /lib64/libc.so.6 > > If we believe the line numbers here, then it's a malloc failure. Are > you running out of memory? > > The MDS is loading a bunch of these 64k file directories (presumably a > characteristic of your workload), and ending up with an unusually > large number of inodes in cache (this is all happening during the > "rejoin" phase so no trimming of the cache is done and we merrily > exceed the default mds_cache_size limit of 100k inodes). > > The thing triggering the load of the dirs is clients replaying > requests that refer to inodes by inode number, and the MDS's procedure > for handling that involves fully loading the relevant dirs. That > might be something we can improve, it doesn't seem obviously necessary > to load all the dentrys in a dirfrag during this phase. > > Anyway, you can hopefully recover from this state by forcibly > unmounting your clients. Since you're using the kernel client it may > be easiest to hard reset the client boxes. When you next restart your > MDS, the clients won't be present, so the MDS will be able to make it > all the way up without trying to load a bunch of directory fragments. > If you've got some more RAM for the MDS box that wouldn't hurt either. > > One of the less well tested (but relevant here) features we have is > directory fragmentation, where large dirs like these are internally > split up (partly to avoid memory management issues like this). It > might be a risky business on a system that you've already got real > data on, but once your MDS is back up and running you can try enabling > the mds_bal_frag setting. > > This is not a use case we have particularly strong coverage of in our > automated tests, so thanks for your experimentation and persistence. > > John > > > > > I have also gotten a log file w / debug mds = 20. It was 1.2GB, so I > > bzip2'd it w max compression and got it down to 75MB. I wasn't sure > where > > to upload it so if there is a better place to put it, please let me know. > > > > https://mega.nz/#!5V4z3A7K!0METjVs5t3DAQAts8_TYXWrLh2FhGHcb7oC4uuhr2T8 > > > > thanks, > > Bob > > > > > > On Mon, Aug 10, 2015 at 8:05 PM, Yan, Zheng wrote: > >> > >> On Tue, Aug 11, 2015 at 9:21 AM, Bob Ababurko wrote: > >> > I had a dual mds server configuration and have been copying data via > >> > cephfs > >> > kernel module to my cluster for the past 3 weeks and just had a MDS > >> > crash > >> > halting all IO. Leading up to the crash, I ran a test dd that > increased > >> > the > >> > throughpu
Re: [ceph-users] mds server(s) crashed
Yes, this was a package install and ceph-debuginfo was used and hopefully the output of the backtrace is useful. I thought it was interesting that you mentioned reproduce with an ls because aside from me doing a large dd before this issue surfaced, your post made me recall that I also ran ls a few times to drill down and eventually list the files that are located two subdirectories down around the same time. I also recall for a moment that I found it strange that I got results back so quickly because our netapp takes forever to do thisit was so quick, that in retrospect, the list of files may not have been complete. I regret not following up that thought. On Tue, Aug 11, 2015 at 1:52 AM, John Spray wrote: > On Tue, Aug 11, 2015 at 2:21 AM, Bob Ababurko wrote: > > I had a dual mds server configuration and have been copying data via > cephfs > > kernel module to my cluster for the past 3 weeks and just had a MDS crash > > halting all IO. Leading up to the crash, I ran a test dd that increased > the > > throughput by about 2x and stopped it but about 10 minutes later, the MDS > > server crashed and did not fail over to the standby properly. I have > using > > an active/standby mds configuration but neither of the mds servers will > stay > > running at this point and crash after starting them. > > > > [bababurko@cephmon01 ~]$ sudo ceph -s > > cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79 > > health HEALTH_WARN > > mds cluster is degraded > > mds cephmds02 is laggy > > noscrub,nodeep-scrub flag(s) set > > monmap e1: 3 mons at > > {cephmon01= > 10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0 > } > > election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03 > > mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)} > > osdmap e324: 30 osds: 30 up, 30 in > > flags noscrub,nodeep-scrub > > pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects > > 14051 GB used, 13880 GB / 27931 GB avail > > 2112 active+clean > > > > > > I am not sure what information is relevant so I will try to cover what I > > think is relevant based on posts I have read through: > > > > Cluster: > > running ceph-0.94.1 on CenttOS 7.1 > > [root@mdstest02 bababurko]$ uname -r > > 3.10.0-229.el7.x86_64 > > > > Here is my ceph-mds log with 'debug objector = 10' : > > > > > https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ= > > Ouch! Unfortunately all we can tell from this is that we're hitting > an assertion somewhere while loading a directory fragment from disk. > > As Zheng says, you'll need to drill a bit deeper. If you were > installing from packages you may find ceph-debuginfo useful. In > addition to getting us a clearer stack trace with debug symbols, > please also crank "debug mds" up to 20 (this is massively verbose so > hopefully it doesn't take too long to reproduce the issue). > > Hopefully this is fairly straightforward to reproduce. If it's > something fundamentally malformed on disk then just doing a recursive > ls on the filesystem would trigger it, at least. > > Cheers, > John > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds server(s) crashed
Here is the backtrace from the core dump. (gdb) bt #0 0x7f71f5404ffb in raise () from /lib64/libpthread.so.0 #1 0x0087065d in reraise_fatal (signum=6) at global/signal_handler.cc:59 #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:109 #3 #4 0x7f71f40235d7 in raise () from /lib64/libc.so.6 #5 0x7f71f4024cc8 in abort () from /lib64/libc.so.6 #6 0x7f71f49279b5 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6 #7 0x7f71f4925926 in ?? () from /lib64/libstdc++.so.6 #8 0x7f71f4925953 in std::terminate() () from /lib64/libstdc++.so.6 #9 0x7f71f4925b73 in __cxa_throw () from /lib64/libstdc++.so.6 #10 0x0077d0fc in operator new (num_bytes=2408) at mds/CInode.h:120 Python Exception list index out of range: #11 CDir::_omap_fetched (this=0x90af04f8, hdrbl=..., omap=std::map with 65536 elements, want_dn="", r=) at mds/CDir.cc:1700 #12 0x007d7d44 in complete (r=0, this=0x502b000) at include/Context.h:65 #13 MDSIOContextBase::complete (this=0x502b000, r=0) at mds/MDSContext.cc:59 #14 0x00894818 in Finisher::finisher_thread_entry (this=0x5108698) at common/Finisher.cc:59 #15 0x7f71f53fddf5 in start_thread () from /lib64/libpthread.so.0 #16 0x7f71f40e41ad in clone () from /lib64/libc.so.6 I have also gotten a log file w / debug mds = 20. It was 1.2GB, so I bzip2'd it w max compression and got it down to 75MB. I wasn't sure where to upload it so if there is a better place to put it, please let me know. https://mega.nz/#!5V4z3A7K!0METjVs5t3DAQAts8_TYXWrLh2FhGHcb7oC4uuhr2T8 thanks, Bob On Mon, Aug 10, 2015 at 8:05 PM, Yan, Zheng wrote: > On Tue, Aug 11, 2015 at 9:21 AM, Bob Ababurko wrote: > > I had a dual mds server configuration and have been copying data via > cephfs > > kernel module to my cluster for the past 3 weeks and just had a MDS crash > > halting all IO. Leading up to the crash, I ran a test dd that increased > the > > throughput by about 2x and stopped it but about 10 minutes later, the MDS > > server crashed and did not fail over to the standby properly. I have > using > > an active/standby mds configuration but neither of the mds servers will > stay > > running at this point and crash after starting them. > > > > [bababurko@cephmon01 ~]$ sudo ceph -s > > cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79 > > health HEALTH_WARN > > mds cluster is degraded > > mds cephmds02 is laggy > > noscrub,nodeep-scrub flag(s) set > > monmap e1: 3 mons at > > {cephmon01= > 10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0 > } > > election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03 > > mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)} > > osdmap e324: 30 osds: 30 up, 30 in > > flags noscrub,nodeep-scrub > > pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects > > 14051 GB used, 13880 GB / 27931 GB avail > > 2112 active+clean > > > > > > I am not sure what information is relevant so I will try to cover what I > > think is relevant based on posts I have read through: > > > > Cluster: > > running ceph-0.94.1 on CenttOS 7.1 > > [root@mdstest02 bababurko]$ uname -r > > 3.10.0-229.el7.x86_64 > > > > Here is my ceph-mds log with 'debug objector = 10' : > > > > > https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ= > > > could you use gdb to check where the crash happened. (gdb > /usr/local/bin/ceph-mds /core.x. maybe you need re-compile mds > with debuginfo) > > Yan, Zheng > > > > > cat /sys/kernel/debug/ceph/*/mdsc output: > > > > > https://www.zerobin.net/?ed238ce77b20583d#CK7Yt6yC1VgHfDee7y/CGkFh5bfyLkhwZB6i5R6N/8g= > > > > ceph.conf : > > > > > https://www.zerobin.net/?62a125349aa43c92#5VH3XRR4P7zjhBHNWmTHrFYmwE0TZEig6r2EU6X1q/U= > > > > I have copied almost 5TB of small files to this cluster which has taken > the > > better part of three weeks, so I am really hoping that there is a way to > > recover from this. This is ourPOC cluster > > > > I'm sure I have missed something relevant as i'm just getting my mind > back > > after nearly losing it, so feel free to ask for anything to assist. > > > > Any help would be greatly appreciated. > > > > thanks, > > Bob > > > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] migrating cephfs metadata pool from spinning disk to SSD.
Thanks John. I'll give that a try as soon as I fix an issue with my MDS servers that cropped up today. On Mon, Aug 10, 2015 at 2:58 AM, John Spray wrote: > On Fri, Aug 7, 2015 at 1:36 AM, Bob Ababurko wrote: > > @John, > > > > Can you clarify which values would suggest that my metadata pool is too > > slow? I have added a link that includes values for the "op_active" & > > "handle_client_request"gathered in a crude fashion but should > hopefully > > give enough data to paint a picture of what is happening. > > > > http://pastebin.com/5zAG8VXT > > Dividing by the first 20s of the second sample period, you're seeing > ~750 client metadata operations handled per second, which is kind of a > baseline level of performance (a little better than what I get running > a ceph cluster locally on my workstation). That's probably > corresponding to roughly the same number of file creates per second -- > your workload is very much a small file one, where "files per second" > is a much more meaningful measure than IOPS or MB/s. > > It does look like the kind of pattern where you've got a large clutch > of several thousand metadata pool rados ops coming out every few > seconds, then draining out over a few seconds. Your metadata pool > isn't pathologically slow (it's completing at least hundreds of ops > per second), but it is noticeable that during some periods where > op_active is draining, handle_client_request is not incrementing -- > i.e. client metadata ops are stalling while the MDS waits for its > RADOS operations to complete. > > I can't say a massive amount beyond that, other than what you'd > already figured out -- it would be worth trying to put some faster > storage in for your metadata pool. > > John > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] mds server(s) crashed
I had a dual mds server configuration and have been copying data via cephfs kernel module to my cluster for the past 3 weeks and just had a MDS crash halting all IO. Leading up to the crash, I ran a test dd that increased the throughput by about 2x and stopped it but about 10 minutes later, the MDS server crashed and did not fail over to the standby properly. I have using an active/standby mds configuration but neither of the mds servers will stay running at this point and crash after starting them. [bababurko@cephmon01 ~]$ sudo ceph -s cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79 health HEALTH_WARN mds cluster is degraded mds cephmds02 is laggy noscrub,nodeep-scrub flag(s) set monmap e1: 3 mons at {cephmon01= 10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0 } election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03 mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)} osdmap e324: 30 osds: 30 up, 30 in flags noscrub,nodeep-scrub pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects 14051 GB used, 13880 GB / 27931 GB avail 2112 active+clean I am not sure what information is relevant so I will try to cover what I think is relevant based on posts I have read through: Cluster: running ceph-0.94.1 on CenttOS 7.1 [root@mdstest02 bababurko]$ uname -r 3.10.0-229.el7.x86_64 Here is my ceph-mds log with 'debug objector = 10' : https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ= cat /sys/kernel/debug/ceph/*/mdsc output: https://www.zerobin.net/?ed238ce77b20583d#CK7Yt6yC1VgHfDee7y/CGkFh5bfyLkhwZB6i5R6N/8g= ceph.conf : https://www.zerobin.net/?62a125349aa43c92#5VH3XRR4P7zjhBHNWmTHrFYmwE0TZEig6r2EU6X1q/U= I have copied almost 5TB of small files to this cluster which has taken the better part of three weeks, so I am really hoping that there is a way to recover from this. This is ourPOC cluster I'm sure I have missed something relevant as i'm just getting my mind back after nearly losing it, so feel free to ask for anything to assist. Any help would be greatly appreciated. thanks, Bob ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] migrating cephfs metadata pool from spinning disk to SSD.
@John, Can you clarify which values would suggest that my metadata pool is too slow? I have added a link that includes values for the "op_active" & "handle_client_request"gathered in a crude fashion but should hopefully give enough data to paint a picture of what is happening. http://pastebin.com/5zAG8VXT thanks in advance, Bob On Thu, Aug 6, 2015 at 1:24 AM, Bob Ababurko wrote: > I should have probably condensed my finding over the course of the day > into one post but, I guess that just not how i'm built. > > Another data point. I ran the `ceph daemon mds.cephmds02 perf dump` in a > while loop w/ 1 second sleep and grepping out the stats John mentioned and > at times(~every 10-15 seconds), I have some large objector.op_active > values. After the high values hit, there are 5-10 seconds of zero values. > > "handle_client_request": 5785438, > "op_active": 2375, > "handle_client_request": 5785438, > "op_active": 2444, > "handle_client_request": 5785438, > "op_active": 2239, > "handle_client_request": 5785438, > "op_active": 1648, > "handle_client_request": 5785438, > "op_active": 1121, > "handle_client_request": 5785438, > "op_active": 709, > "handle_client_request": 5785438, > "op_active": 235, > "handle_client_request": 5785572, > "op_active": 0, >... > > Should I be concerned about these "op_active" values? I see that in my > narrow slice of output, "handle_client_request" does not increment. What > is happening there? > > thanks, > Bob > > On Wed, Aug 5, 2015 at 11:43 PM, Bob Ababurko wrote: > >> I found a way to get the stats you mentioned: >> mds_server.handle_client_request >> & objecter.op_active. I can see these values when I run: >> >> ceph daemon mds. perf dump >> >> I recently restarted the mds server so my stats reset but I still have >> something to share: >> >> "mds_server.handle_client_request": 4406055 >> "objecter.op_active": 0 >> >> Should I assume that op_active might be operations in writes or reads >> that are queued? I haven't been able to find anything describing what >> these stats actually mean so if anyone knows where to find them, please >> advise. >> >> On Wed, Aug 5, 2015 at 4:59 PM, Bob Ababurko wrote: >> >>> I have installed diamond(built by ksingh found at >>> https://github.com/ksingh7/ceph-calamari-packages) on the MDS node and >>> I am not seeing the mds_server.handle_client_request OR objecter.op_active >>> metrics being sent to graphite. Mind you, this is not the graphite that is >>> part of the calamari install but our own internal graphite cluster. >>> Perhaps that is the reason? I could not get calamari working correctly on >>> hammerhead/centos7.1 so I put it on pause for now to concentrate on the >>> cluster itself. >>> >>> Ultimately, I need to find a way to get a hold of these metrics to >>> determine the health of my MDS so I can justify moving forward on a SSD >>> based cephfs metadata pool. >>> >>> On Wed, Aug 5, 2015 at 4:05 PM, Bob Ababurko wrote: >>> >>>> Hi John, >>>> >>>> You are correct in that my expectations may be incongruent with what is >>>> possible with ceph(fs). I'm currently copying many small files(images) >>>> from a netapp to the cluster...~35k sized files to be exact and the number >>>> of objects/files copied thus far is fairly significant(below in bold): >>>> >>>> [bababurko@cephmon01 ceph]$ sudo rados df >>>> pool name KB objects clones degraded >>>> unfound rdrd KB wrwr KB >>>> cephfs_data 3289284749*163993660*00 >>>> 000328097038 3369847354 >>>> cephfs_metadata 133364 52436300 >>>> 0 3600023 5264453980 9564 1361554516 >>>> rbd0000 >>>> 00000 >>>> total used 9297615196164518023 >>>> total avail19990923044 >>>> total space292885382
Re: [ceph-users] migrating cephfs metadata pool from spinning disk to SSD.
I should have probably condensed my finding over the course of the day into one post but, I guess that just not how i'm built. Another data point. I ran the `ceph daemon mds.cephmds02 perf dump` in a while loop w/ 1 second sleep and grepping out the stats John mentioned and at times(~every 10-15 seconds), I have some large objector.op_active values. After the high values hit, there are 5-10 seconds of zero values. "handle_client_request": 5785438, "op_active": 2375, "handle_client_request": 5785438, "op_active": 2444, "handle_client_request": 5785438, "op_active": 2239, "handle_client_request": 5785438, "op_active": 1648, "handle_client_request": 5785438, "op_active": 1121, "handle_client_request": 5785438, "op_active": 709, "handle_client_request": 5785438, "op_active": 235, "handle_client_request": 5785572, "op_active": 0, ... Should I be concerned about these "op_active" values? I see that in my narrow slice of output, "handle_client_request" does not increment. What is happening there? thanks, Bob On Wed, Aug 5, 2015 at 11:43 PM, Bob Ababurko wrote: > I found a way to get the stats you mentioned: mds_server.handle_client_request > & objecter.op_active. I can see these values when I run: > > ceph daemon mds. perf dump > > I recently restarted the mds server so my stats reset but I still have > something to share: > > "mds_server.handle_client_request": 4406055 > "objecter.op_active": 0 > > Should I assume that op_active might be operations in writes or reads that > are queued? I haven't been able to find anything describing what these > stats actually mean so if anyone knows where to find them, please advise. > > On Wed, Aug 5, 2015 at 4:59 PM, Bob Ababurko wrote: > >> I have installed diamond(built by ksingh found at >> https://github.com/ksingh7/ceph-calamari-packages) on the MDS node and I >> am not seeing the mds_server.handle_client_request OR objecter.op_active >> metrics being sent to graphite. Mind you, this is not the graphite that is >> part of the calamari install but our own internal graphite cluster. >> Perhaps that is the reason? I could not get calamari working correctly on >> hammerhead/centos7.1 so I put it on pause for now to concentrate on the >> cluster itself. >> >> Ultimately, I need to find a way to get a hold of these metrics to >> determine the health of my MDS so I can justify moving forward on a SSD >> based cephfs metadata pool. >> >> On Wed, Aug 5, 2015 at 4:05 PM, Bob Ababurko wrote: >> >>> Hi John, >>> >>> You are correct in that my expectations may be incongruent with what is >>> possible with ceph(fs). I'm currently copying many small files(images) >>> from a netapp to the cluster...~35k sized files to be exact and the number >>> of objects/files copied thus far is fairly significant(below in bold): >>> >>> [bababurko@cephmon01 ceph]$ sudo rados df >>> pool name KB objects clones degraded >>> unfound rdrd KB wrwr KB >>> cephfs_data 3289284749*163993660*00 >>> 000328097038 3369847354 >>> cephfs_metadata 133364 52436300 >>> 0 3600023 5264453980 9564 1361554516 >>> rbd0000 >>> 00000 >>> total used 9297615196164518023 >>> total avail19990923044 >>> total space29288538240 >>> >>> Yes, that looks like ~164 million objects copied to the cluster. I >>> would assume this will potentially be a burden to the MDS but I have yet to >>> confirm with the ceph daemontool mds.. I cannot seem to run it on the >>> mds host as it doesn't seem to know about that command: >>> >>> [bababurko@cephmds01]$ sudo ceph daemonperf mds.cephmds01 >>> no valid command found; 10 closest matches: >>> osd lost {--yes-i-really-mean-it} >>> osd create {} >>> osd primary-temp >>> osd primary-affinity >>> osd reweight >>> osd pg-temp { [...]} >>> osd in [...] >>> osd rm [...] >>> osd down [...] >>> osd out [...] >>&g
Re: [ceph-users] migrating cephfs metadata pool from spinning disk to SSD.
I found a way to get the stats you mentioned: mds_server.handle_client_request & objecter.op_active. I can see these values when I run: ceph daemon mds. perf dump I recently restarted the mds server so my stats reset but I still have something to share: "mds_server.handle_client_request": 4406055 "objecter.op_active": 0 Should I assume that op_active might be operations in writes or reads that are queued? I haven't been able to find anything describing what these stats actually mean so if anyone knows where to find them, please advise. On Wed, Aug 5, 2015 at 4:59 PM, Bob Ababurko wrote: > I have installed diamond(built by ksingh found at > https://github.com/ksingh7/ceph-calamari-packages) on the MDS node and I > am not seeing the mds_server.handle_client_request OR objecter.op_active > metrics being sent to graphite. Mind you, this is not the graphite that is > part of the calamari install but our own internal graphite cluster. > Perhaps that is the reason? I could not get calamari working correctly on > hammerhead/centos7.1 so I put it on pause for now to concentrate on the > cluster itself. > > Ultimately, I need to find a way to get a hold of these metrics to > determine the health of my MDS so I can justify moving forward on a SSD > based cephfs metadata pool. > > On Wed, Aug 5, 2015 at 4:05 PM, Bob Ababurko wrote: > >> Hi John, >> >> You are correct in that my expectations may be incongruent with what is >> possible with ceph(fs). I'm currently copying many small files(images) >> from a netapp to the cluster...~35k sized files to be exact and the number >> of objects/files copied thus far is fairly significant(below in bold): >> >> [bababurko@cephmon01 ceph]$ sudo rados df >> pool name KB objects clones degraded >> unfound rdrd KB wrwr KB >> cephfs_data 3289284749*163993660*00 >> 000328097038 3369847354 >> cephfs_metadata 133364 52436300 >> 0 3600023 5264453980 9564 1361554516 >> rbd0000 >> 00000 >> total used 9297615196164518023 >> total avail19990923044 >> total space29288538240 >> >> Yes, that looks like ~164 million objects copied to the cluster. I would >> assume this will potentially be a burden to the MDS but I have yet to >> confirm with the ceph daemontool mds.. I cannot seem to run it on the >> mds host as it doesn't seem to know about that command: >> >> [bababurko@cephmds01]$ sudo ceph daemonperf mds.cephmds01 >> no valid command found; 10 closest matches: >> osd lost {--yes-i-really-mean-it} >> osd create {} >> osd primary-temp >> osd primary-affinity >> osd reweight >> osd pg-temp { [...]} >> osd in [...] >> osd rm [...] >> osd down [...] >> osd out [...] >> Error EINVAL: invalid command >> >> This fails in a similar manner on all the hosts in the cluster. I'm very >> green w/ ceph and i'm probably missing something obvious. Is there >> something I need to install to get access to the 'ceph daemonperf' command >> in hammerhead? >> >> thanks, >> Bob >> >> On Wed, Aug 5, 2015 at 2:43 AM, John Spray wrote: >> >>> On Tue, Aug 4, 2015 at 10:36 PM, Bob Ababurko wrote: >>> > My writes are not going as I would expect wrt to IOPS(50-1000 IOPs) & >>> write >>> > throughput( ~25MB/s max). I'm interested in understanding what it >>> takes to >>> > create a SSD pool that I can then migrate the current Cephfs_metadata >>> pool >>> > to. I suspect that the spinning disk metadata pool is a bottleneck >>> and I >>> > want to try to get the max performance out of this cluster to prove >>> that we >>> > would build out a larger version. One caveat is that I have copied >>> about 4 >>> > TB of data to the cluster via cephfs and dont want to lose the data so >>> I >>> > obviously need to keep the metadata intact. >>> >>> I'm a bit suspicious of this: your IOPS expectations sort of imply >>> doing big files, but you're then suggesting that metadata is the >>> bottleneck (i.e. small file workload). >>> >>> There are lots of statistics that come out of the MDS, you may be >>> particular interested in mds_server.
Re: [ceph-users] migrating cephfs metadata pool from spinning disk to SSD.
I have installed diamond(built by ksingh found at https://github.com/ksingh7/ceph-calamari-packages) on the MDS node and I am not seeing the mds_server.handle_client_request OR objecter.op_active metrics being sent to graphite. Mind you, this is not the graphite that is part of the calamari install but our own internal graphite cluster. Perhaps that is the reason? I could not get calamari working correctly on hammerhead/centos7.1 so I put it on pause for now to concentrate on the cluster itself. Ultimately, I need to find a way to get a hold of these metrics to determine the health of my MDS so I can justify moving forward on a SSD based cephfs metadata pool. On Wed, Aug 5, 2015 at 4:05 PM, Bob Ababurko wrote: > Hi John, > > You are correct in that my expectations may be incongruent with what is > possible with ceph(fs). I'm currently copying many small files(images) > from a netapp to the cluster...~35k sized files to be exact and the number > of objects/files copied thus far is fairly significant(below in bold): > > [bababurko@cephmon01 ceph]$ sudo rados df > pool name KB objects clones degraded > unfound rdrd KB wrwr KB > cephfs_data 3289284749*163993660*00 > 000328097038 3369847354 > cephfs_metadata 133364 52436300 > 0 3600023 5264453980 9564 1361554516 > rbd0000 > 00000 > total used 9297615196164518023 > total avail19990923044 > total space29288538240 > > Yes, that looks like ~164 million objects copied to the cluster. I would > assume this will potentially be a burden to the MDS but I have yet to > confirm with the ceph daemontool mds.. I cannot seem to run it on the > mds host as it doesn't seem to know about that command: > > [bababurko@cephmds01]$ sudo ceph daemonperf mds.cephmds01 > no valid command found; 10 closest matches: > osd lost {--yes-i-really-mean-it} > osd create {} > osd primary-temp > osd primary-affinity > osd reweight > osd pg-temp { [...]} > osd in [...] > osd rm [...] > osd down [...] > osd out [...] > Error EINVAL: invalid command > > This fails in a similar manner on all the hosts in the cluster. I'm very > green w/ ceph and i'm probably missing something obvious. Is there > something I need to install to get access to the 'ceph daemonperf' command > in hammerhead? > > thanks, > Bob > > On Wed, Aug 5, 2015 at 2:43 AM, John Spray wrote: > >> On Tue, Aug 4, 2015 at 10:36 PM, Bob Ababurko wrote: >> > My writes are not going as I would expect wrt to IOPS(50-1000 IOPs) & >> write >> > throughput( ~25MB/s max). I'm interested in understanding what it >> takes to >> > create a SSD pool that I can then migrate the current Cephfs_metadata >> pool >> > to. I suspect that the spinning disk metadata pool is a bottleneck and >> I >> > want to try to get the max performance out of this cluster to prove >> that we >> > would build out a larger version. One caveat is that I have copied >> about 4 >> > TB of data to the cluster via cephfs and dont want to lose the data so I >> > obviously need to keep the metadata intact. >> >> I'm a bit suspicious of this: your IOPS expectations sort of imply >> doing big files, but you're then suggesting that metadata is the >> bottleneck (i.e. small file workload). >> >> There are lots of statistics that come out of the MDS, you may be >> particular interested in mds_server.handle_client_request, >> objecter.op_active, to work out if there really are lots of RADOS >> operations getting backed up on the MDS (which would be the symptom of >> a too-slow metadata pool). "ceph daemonperf mds." may be some >> help if you don't already have graphite or similar set up. >> >> > If anyone has done this OR understands how this can be done, I would >> > appreciate the advice. >> >> You could potentially do this in a two-phase process where you >> initially set a crush rule that includes both SSDs and spinners, and >> then finally set a crush rule that just points to SSDs. Obviously >> that'll do lots of data movement, but your metadata is probably a fair >> bit smaller than your data so that might be acceptable. >> >> John >> > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] migrating cephfs metadata pool from spinning disk to SSD.
Hi John, You are correct in that my expectations may be incongruent with what is possible with ceph(fs). I'm currently copying many small files(images) from a netapp to the cluster...~35k sized files to be exact and the number of objects/files copied thus far is fairly significant(below in bold): [bababurko@cephmon01 ceph]$ sudo rados df pool name KB objects clones degraded unfound rdrd KB wrwr KB cephfs_data 3289284749*163993660*00 000328097038 3369847354 cephfs_metadata 133364 52436300 0 3600023 5264453980 9564 1361554516 rbd0000 00000 total used 9297615196164518023 total avail19990923044 total space29288538240 Yes, that looks like ~164 million objects copied to the cluster. I would assume this will potentially be a burden to the MDS but I have yet to confirm with the ceph daemontool mds.. I cannot seem to run it on the mds host as it doesn't seem to know about that command: [bababurko@cephmds01]$ sudo ceph daemonperf mds.cephmds01 no valid command found; 10 closest matches: osd lost {--yes-i-really-mean-it} osd create {} osd primary-temp osd primary-affinity osd reweight osd pg-temp { [...]} osd in [...] osd rm [...] osd down [...] osd out [...] Error EINVAL: invalid command This fails in a similar manner on all the hosts in the cluster. I'm very green w/ ceph and i'm probably missing something obvious. Is there something I need to install to get access to the 'ceph daemonperf' command in hammerhead? thanks, Bob On Wed, Aug 5, 2015 at 2:43 AM, John Spray wrote: > On Tue, Aug 4, 2015 at 10:36 PM, Bob Ababurko wrote: > > My writes are not going as I would expect wrt to IOPS(50-1000 IOPs) & > write > > throughput( ~25MB/s max). I'm interested in understanding what it takes > to > > create a SSD pool that I can then migrate the current Cephfs_metadata > pool > > to. I suspect that the spinning disk metadata pool is a bottleneck and I > > want to try to get the max performance out of this cluster to prove that > we > > would build out a larger version. One caveat is that I have copied > about 4 > > TB of data to the cluster via cephfs and dont want to lose the data so I > > obviously need to keep the metadata intact. > > I'm a bit suspicious of this: your IOPS expectations sort of imply > doing big files, but you're then suggesting that metadata is the > bottleneck (i.e. small file workload). > > There are lots of statistics that come out of the MDS, you may be > particular interested in mds_server.handle_client_request, > objecter.op_active, to work out if there really are lots of RADOS > operations getting backed up on the MDS (which would be the symptom of > a too-slow metadata pool). "ceph daemonperf mds." may be some > help if you don't already have graphite or similar set up. > > > If anyone has done this OR understands how this can be done, I would > > appreciate the advice. > > You could potentially do this in a two-phase process where you > initially set a crush rule that includes both SSDs and spinners, and > then finally set a crush rule that just points to SSDs. Obviously > that'll do lots of data movement, but your metadata is probably a fair > bit smaller than your data so that might be acceptable. > > John > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] migrating cephfs metadata pool from spinning disk to SSD.
I will dig into the network and determine if we have any issues. One thing to note is our MTU is 1500 and will not be changed for this testsimply put, I am not going to be able to get these changes implemented in our current network . I dont expect a huge increase in performance by moving to jumbo frames and I suspect not necessarily worth it for a POC and not the reason my cluster performance is sucking so bad at this particular moment. One other thing I wanted to get clarity on was your rbd perf(dd) tests. I was under the impression that rbd devices are striped across all of the OSD's, where when writing via objects and files, the object would be getting written to a single disk. If my understanding is true, a dd would yield significantly better results(throughput/iops) for a rbd vs file OR object. Please let me know if I am missing something. thank you. On Tue, Aug 4, 2015 at 2:53 PM, Shane Gibson wrote: > > Bob, > > Those numbers would seem to indicate some other problem One of the > biggest culprits of that poor performance is often related to network > issues. In the last few months, there have been several reported issues of > performance, that have turned out to be network. Not all, but most. > You're best bet is to check each host interface statistics for errors. > make sure you have a match on the MTU size (jumbo frames settings on the > host and on your switches). Check your switches for network errors. Try > extended size ping checks between nodes, insure you set the packet size > close to your max MTU size and check that you're getting good performance > from *all nodes* to every other node. Last, try a network performance test > to each of the OSD nodes and see if one of them is acting up. > > If you are backing your journal on SSD - you DEFINITELY should be getting > vastly better performance than that. I have a cluster with 6 OSD nodes w/ > 10x 4TB OSDs - using 2 7200 rpm disks as the journatl (12 disks total). NO > SSDs in that configuration. I can push the cluster to about 650 MByte/sec > via network RBD 'dd' tests, and get about 2500 IOPS. NOTE - this is an all > spinning HDD cluster w/ 7200 rpm disks! > > ~~shane > > On 8/4/15, 2:36 PM, "ceph-users on behalf of Bob Ababurko" < > ceph-users-boun...@lists.ceph.com on behalf of b...@ababurko.net> wrote: > > I have my first ceph cluster up and running and am currently testing > cephfs for file access. It turns out, I am not getting excellent write > performance on my cluster via cephfs(kernel driver) and would like to try > to explore moving my cephfs_metadata pool to SSD. > > To quickly describe the cluster: > > all nodes run Centos 7.1 w/ ceph-0.94.1(hammerhead) > [bababurko@cephosd01 ~]$ uname -r > 3.10.0-229.el7.x86_64 > [bababurko@cephosd01 ~]$ cat /etc/redhat-release > CentOS Linux release 7.1.1503 (Core) > > 6 OSD nodes w/ 5 x 1TB(7200 rpm/dont have model handy) sata & 1 TB SSD(850 > pro) which includes a journal(5GB) for each of the 5 OSD's, so there is > much space on the SSD left to create a partition for a SSD pool...at least > 900GB per SSD. Also noteworthy is that these disks are behind a raid > controller(LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2) > with each disk configured as raid 0. > 3 MON nodes > 1 MDS node > > My writes are not going as I would expect wrt to IOPS(50-1000 IOPs) & > write throughput( ~25MB/s max). I'm interested in understanding what it > takes to create a SSD pool that I can then migrate the current > Cephfs_metadata pool to. I suspect that the spinning disk metadata pool is > a bottleneck and I want to try to get the max performance out of this > cluster to prove that we would build out a larger version. One caveat is > that I have copied about 4 TB of data to the cluster via cephfs and dont > want to lose the data so I obviously need to keep the metadata intact. > > If anyone has done this OR understands how this can be done, I would > appreciate the advice. > > thanks in advance, > Bob > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] migrating cephfs metadata pool from spinning disk to SSD.
I have my first ceph cluster up and running and am currently testing cephfs for file access. It turns out, I am not getting excellent write performance on my cluster via cephfs(kernel driver) and would like to try to explore moving my cephfs_metadata pool to SSD. To quickly describe the cluster: all nodes run Centos 7.1 w/ ceph-0.94.1(hammerhead) [bababurko@cephosd01 ~]$ uname -r 3.10.0-229.el7.x86_64 [bababurko@cephosd01 ~]$ cat /etc/redhat-release CentOS Linux release 7.1.1503 (Core) 6 OSD nodes w/ 5 x 1TB(7200 rpm/dont have model handy) sata & 1 TB SSD(850 pro) which includes a journal(5GB) for each of the 5 OSD's, so there is much space on the SSD left to create a partition for a SSD pool...at least 900GB per SSD. Also noteworthy is that these disks are behind a raid controller(LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2) with each disk configured as raid 0. 3 MON nodes 1 MDS node My writes are not going as I would expect wrt to IOPS(50-1000 IOPs) & write throughput( ~25MB/s max). I'm interested in understanding what it takes to create a SSD pool that I can then migrate the current Cephfs_metadata pool to. I suspect that the spinning disk metadata pool is a bottleneck and I want to try to get the max performance out of this cluster to prove that we would build out a larger version. One caveat is that I have copied about 4 TB of data to the cluster via cephfs and dont want to lose the data so I obviously need to keep the metadata intact. If anyone has done this OR understands how this can be done, I would appreciate the advice. thanks in advance, Bob ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com