Re: [ceph-users] Understanding incomplete PGs
On Friday, July 5, 2019 11:50:44 AM CDT Paul Emmerich wrote: > * There are virtually no use cases for ec pools with m=1, this is a bad > configuration as you can't have both availability and durability I'll have to look into this more. The cluster only has 4 hosts, so it might be worth switching to osd failure domain for the EC pools and using k=5,m=2. > > * Due to weird internal restrictions ec pools below their min size can't > recover, you'll probably have to reduce min_size temporarily to recover it Lowering min_size to 2 did allow it to recover. > > * Depending on your version it might be necessary to restart some of the > OSDs due to a bug (fixed by now) that caused it to mark some objects as > degraded if you remove or restart an OSD while you have remapped objects > > * run "ceph osd safe-to-destroy X" to check if it's safe to destroy a given > OSD Excellent, thanks! > > > Hello, > > > > I'm working with a small ceph cluster (about 10TB, 7-9 OSDs, all Bluestore > > on > > lvm) and recently ran into a problem with 17 pgs marked as incomplete > > after > > adding/removing OSDs. > > > > Here's the sequence of events: > > 1. 7 osds in the cluster, health is OK, all pgs are active+clean > > 2. 3 new osds on a new host are added, lots of backfilling in progress > > 3. osd 6 needs to be removed, so we do "ceph osd crush reweight osd.6 0" > > 4. after a few hours we see "min osd.6 with 0 pgs" from "ceph osd > > utilization" > > 5. ceph osd out 6 > > 6. systemctl stop ceph-osd@6 > > 7. the drive backing osd 6 is pulled and wiped > > 8. backfilling has now finished all pgs are active+clean except for 17 > > incomplete pgs > > > > From reading the docs, it sounds like there has been unrecoverable data > > loss > > in those 17 pgs. That raises some questions for me: > > > > Was "ceph osd utilization" only showing a goal of 0 pgs allocated instead > > of > > the current actual allocation? > > > > Why is there data loss from a single osd being removed? Shouldn't that be > > recoverable? > > All pools in the cluster are either replicated 3 or erasure-coded k=2,m=1 > > with > > default "host" failure domain. They shouldn't suffer data loss with a > > single > > osd being removed even if there were no reweighting beforehand. Does the > > backfilling temporarily reduce data durability in some way? > > > > Is there a way to see which pgs actually have data on a given osd? > > > > I attached an example of one of the incomplete pgs. > > > > Thanks for any help, > > > > Kyle___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Understanding incomplete PGs
On Friday, July 5, 2019 11:28:32 AM CDT Caspar Smit wrote: > Kyle, > > Was the cluster still backfilling when you removed osd 6 or did you only > check its utilization? Yes, still backfilling. > > Running an EC pool with m=1 is a bad idea. EC pool min_size = k+1 so losing > a single OSD results in inaccessible data. > Your incomplete PG's are probably all EC pool pgs, please verify. Yes, also correct. > > If the above statement is true, you could *temporarily* set min_size to 2 > (on your EC pools) to get back access to your data again but this is a very > dangerous action. Losing another OSD during this period results in actual > data loss. This resolved the issue. I had seen reducing min_size mentioned elsewhere, but for some reason I thought that applied only to replicated pools. Thank you! > > Kind regards, > Caspar Smit > > Op vr 5 jul. 2019 om 01:17 schreef Kyle : > > Hello, > > > > I'm working with a small ceph cluster (about 10TB, 7-9 OSDs, all Bluestore > > on > > lvm) and recently ran into a problem with 17 pgs marked as incomplete > > after > > adding/removing OSDs. > > > > Here's the sequence of events: > > 1. 7 osds in the cluster, health is OK, all pgs are active+clean > > 2. 3 new osds on a new host are added, lots of backfilling in progress > > 3. osd 6 needs to be removed, so we do "ceph osd crush reweight osd.6 0" > > 4. after a few hours we see "min osd.6 with 0 pgs" from "ceph osd > > utilization" > > 5. ceph osd out 6 > > 6. systemctl stop ceph-osd@6 > > 7. the drive backing osd 6 is pulled and wiped > > 8. backfilling has now finished all pgs are active+clean except for 17 > > incomplete pgs > > > > From reading the docs, it sounds like there has been unrecoverable data > > loss > > in those 17 pgs. That raises some questions for me: > > > > Was "ceph osd utilization" only showing a goal of 0 pgs allocated instead > > of > > the current actual allocation? > > > > Why is there data loss from a single osd being removed? Shouldn't that be > > recoverable? > > All pools in the cluster are either replicated 3 or erasure-coded k=2,m=1 > > with > > default "host" failure domain. They shouldn't suffer data loss with a > > single > > osd being removed even if there were no reweighting beforehand. Does the > > backfilling temporarily reduce data durability in some way? > > > > Is there a way to see which pgs actually have data on a given osd? > > > > I attached an example of one of the incomplete pgs. > > > > Thanks for any help, > > > > Kyle___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Understanding incomplete PGs
Hello, I'm working with a small ceph cluster (about 10TB, 7-9 OSDs, all Bluestore on lvm) and recently ran into a problem with 17 pgs marked as incomplete after adding/removing OSDs. Here's the sequence of events: 1. 7 osds in the cluster, health is OK, all pgs are active+clean 2. 3 new osds on a new host are added, lots of backfilling in progress 3. osd 6 needs to be removed, so we do "ceph osd crush reweight osd.6 0" 4. after a few hours we see "min osd.6 with 0 pgs" from "ceph osd utilization" 5. ceph osd out 6 6. systemctl stop ceph-osd@6 7. the drive backing osd 6 is pulled and wiped 8. backfilling has now finished all pgs are active+clean except for 17 incomplete pgs >From reading the docs, it sounds like there has been unrecoverable data loss in those 17 pgs. That raises some questions for me: Was "ceph osd utilization" only showing a goal of 0 pgs allocated instead of the current actual allocation? Why is there data loss from a single osd being removed? Shouldn't that be recoverable? All pools in the cluster are either replicated 3 or erasure-coded k=2,m=1 with default "host" failure domain. They shouldn't suffer data loss with a single osd being removed even if there were no reweighting beforehand. Does the backfilling temporarily reduce data durability in some way? Is there a way to see which pgs actually have data on a given osd? I attached an example of one of the incomplete pgs. Thanks for any help, Kyle{ "state": "incomplete", "snap_trimq": "[]", "snap_trimq_len": 0, "epoch": 2087, "up": [ 4, 3, 8 ], "acting": [ 4, 3, 8 ], "info": { "pgid": "15.59s0", "last_update": "753'7465", "last_complete": "753'7465", "log_tail": "663'4401", "last_user_version": 6947, "last_backfill": "MAX", "last_backfill_bitwise": 0, "purged_snaps": [], "history": { "epoch_created": 603, "epoch_pool_created": 603, "last_epoch_started": 1581, "last_interval_started": 1580, "last_epoch_clean": 945, "last_interval_clean": 944, "last_epoch_split": 0, "last_epoch_marked_full": 0, "same_up_since": 2082, "same_interval_since": 2082, "same_primary_since": 2076, "last_scrub": "753'7465", "last_scrub_stamp": "2019-07-02 13:40:58.935208", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "2019-06-27 17:42:04.685790", "last_clean_scrub_stamp": "2019-07-02 13:40:58.935208" }, "stats": { "version": "753'7465", "reported_seq": "12691", "reported_epoch": "2087", "state": "incomplete", "last_fresh": "2019-07-04 14:30:47.930190", "last_change": "2019-07-04 14:30:47.930190", "last_active": "2019-07-03 13:04:00.967354", "last_peered": "2019-07-03 13:02:40.242867", "last_clean": "2019-07-02 23:04:26.601070", "last_became_active": "2019-07-03 08:35:12.459857", "last_became_peered": "2019-07-03 08:35:12.459857", "last_unstale": "2019-07-04 14:30:47.930190", "last_undegraded": "2019-07-04 14:30:47.930190", "last_fullsized": "2019-07-04 14:30:47.930190", "mapping_epoch": 2082, "log_start": "663'4401", "ondisk_log_start": "663'4401", "created": 603, "last_epoch_clean": 945, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "753'7465", "last_scrub_stamp": "2019-07-02 13:40:58.935208", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "2019-06-27 17:42:04.685790", "last_clean_scrub_stamp": "2019-07-02 1
Re: [ceph-users] Prioritized pool recovery
On 5/6/2019 6:37 PM, Gregory Farnum wrote: Hmm, I didn't know we had this functionality before. It looks to be changing quite a lot at the moment, so be aware this will likely require reconfiguring later. Good to know, and not a problem. In any case, I'd assume it won't change substantially for luminous, correct? I'm not seeing this in the luminous docs, are you sure? The source You're probably right, but there are options for this in luminous: # ceph osd pool get vm Invalid command: missing required parameter var([...] recovery_priority|recovery_op_priority [...]) code indicates in Luminous it's 0-254. (As I said, things have changed, so in the current master build it seems to be -10 to 10 and configured a bit differently.) The 1-63 values generally apply to op priorities within the OSD, and are used as part of a weighted priority queue when selecting the next op to work on out of those available; you may have been looking at osd_recovery_op_priority which is on that scale and should apply to individual recovery messages/ops but will not work to schedule PGs differently. So I was probably looking at the OSD level then. Questions: 1) If I have pools 1-4, what would I set these values to in order to backfill pools 1, 2, 3, and then 4 in order? So if I'm reading the code right, they just need to be different weights, and the higher value will win when trying to get a reservation if there's a queue of them. (However, it's possible that lower-priority pools will send off requests first and get to do one or two PGs first, then the higher-priority pool will get to do all its work before that pool continues.) Where higher is 0, or higher is 254? And what's the difference between recovery_priority and recovery_op_priority? In reading the docs for the OSD, _op_ is "priority set for recovery operations," and non-op is "priority set for recovery work queue." For someone new to ceph such as myself, this reads like the same thing at a glance. Would the recovery operations not be a part of the work queue? And would this apply the same for the pools? 2) Assuming this is possible, how do I ensure that backfill isn't prioritized over client I/O? This is an ongoing issue but I don't think the pool prioritization will change the existing mechanisms. Okay, understood. Not a huge problem, I'm primarily looking for understanding. 3) Is there a command that enumerates the weights of the current operations (so that I can observe what's going on)? "ceph osd pool ls detail" will include them. Perfect! Thank you very much for the information. Once I have a little more, I'm probably going to work towards sending a pull request in for the docs... --Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Prioritized pool recovery
I've been running luminous / ceph-12.2.11-0.el7.x86_64 on CentOS 7 for about a month now, and had a few times when I've needed to recreate the OSDs on a server. (no I'm not planning on routinely doing this...) What I've noticed is that the recovery will generally stagger the recovery so that the pools on the cluster will finish around the same time (+/- a few hours). What I'm hoping to do is prioritize specific pools over others, so that ceph will recover all of pool 1 before it moves on to pool 2, for example. In the docs, recovery_{,op}_priority both have roughly the same description, which is "the priority set for recovery operations" as well as a valid range of 1-63, default 5. This doesn't tell me if a value of 1 is considered a higher priority than 63, and it doesn't tell me how it fits in line with other ceph operations. Questions: 1) If I have pools 1-4, what would I set these values to in order to backfill pools 1, 2, 3, and then 4 in order? 2) Assuming this is possible, how do I ensure that backfill isn't prioritized over client I/O? 3) Is there a command that enumerates the weights of the current operations (so that I can observe what's going on)? For context, my pools are: 1) cephfs_metadata 2) vm (RBD pool, VM OS drives) 3) storage (RBD pool, VM data drives) 4) cephfs_data These are sorted by both size (smallest to largest) and criticality of recovery (most to least). If there's a critique of this setup / a better way of organizing this, suggestions are welcome. Thanks, --Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD Segfaults after Bluestore conversion
I'm following up from awhile ago. I don't think this is the same bug. The bug referenced shows "abort: Corruption: block checksum mismatch", and I'm not seeing that on mine. Now I've had 8 OSDs down on this one server for a couple of weeks, and I just tried to start it back up. Here's a link to the log of that OSD (which segfaulted right after starting up): http://people.beocat.ksu.edu/~kylehutson/ceph-osd.414.log To me, it looks like the logs are providing surprisingly few hints as to where the problem lies. Is there a way I can turn up logging to see if I can get any more info as to why this is happening? On Thu, Feb 8, 2018 at 3:02 AM, Mike O'Connor wrote: > On 7/02/2018 8:23 AM, Kyle Hutson wrote: > > We had a 26-node production ceph cluster which we upgraded to Luminous > > a little over a month ago. I added a 27th-node with Bluestore and > > didn't have any issues, so I began converting the others, one at a > > time. The first two went off pretty smoothly, but the 3rd is doing > > something strange. > > > > Initially, all the OSDs came up fine, but then some started to > > segfault. Out of curiosity more than anything else, I did reboot the > > server to see if it would get better or worse, and it pretty much > > stayed the same - 12 of the 18 OSDs did not properly come up. Of > > those, 3 again segfaulted > > > > I picked one that didn't properly come up and copied the log to where > > anybody can view it: > > http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log > > <http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.426.log> > > > > You can contrast that with one that is up: > > http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log > > <http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.428.log> > > > > (which is still showing segfaults in the logs, but seems to be > > recovering from them OK?) > > > > Any ideas? > Ideas ? yes > > There is a a bug which is hitting a small number of systems and at this > time there is no solution. Issues details at > http://tracker.ceph.com/issues/22102. > > Please submit more details of your problem on the ticket. > > Mike > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD Segfaults after Bluestore conversion
We had a 26-node production ceph cluster which we upgraded to Luminous a little over a month ago. I added a 27th-node with Bluestore and didn't have any issues, so I began converting the others, one at a time. The first two went off pretty smoothly, but the 3rd is doing something strange. Initially, all the OSDs came up fine, but then some started to segfault. Out of curiosity more than anything else, I did reboot the server to see if it would get better or worse, and it pretty much stayed the same - 12 of the 18 OSDs did not properly come up. Of those, 3 again segfaulted I picked one that didn't properly come up and copied the log to where anybody can view it: http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log You can contrast that with one that is up: http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log (which is still showing segfaults in the logs, but seems to be recovering from them OK?) Any ideas? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS kernel driver is 10-15x slower than FUSE driver
I tried Fedora 25 (kernel 4.10.8-200.fc25.x86_64) with the kernel driver and it works great. Perhaps you're on to something with the kernel version. I didn't realize how far behind 16.04 was on this. I will give upgrading Ubuntu 16.04 to a newer kernel the old college try. Thanks. On Sun, Apr 9, 2017 at 11:41 AM, Kyle Drake wrote: > On Sun, Apr 9, 2017 at 9:31 AM, John Spray wrote: > >> On Sun, Apr 9, 2017 at 12:48 AM, Kyle Drake wrote: >> > Pretty much says it all. 1GB test file copy to local: >> > >> > $ time cp /mnt/ceph-kernel-driver-test/test.img . >> > >> > real 2m50.063s >> > user 0m0.000s >> > sys 0m9.000s >> > >> > $ time cp /mnt/ceph-fuse-test/test.img . >> > >> > real 0m3.648s >> > user 0m0.000s >> > sys 0m1.872s >> > >> > Yikes. The kernel driver averages ~5MB and the fuse driver averages >> > ~150MBish? Something crazy is happening here. It's not caching, I ran >> both >> > tests fresh. >> >> What does "fresh" mean in this context? i.e. what did you do in >> between runs to reset it? Have you tried running your procedure in >> the reverse order (i.e. is the kernel client still slow when you're >> running it after the fuse client)? >> > > I rebooted the machine and ran the same test. > > I just repeated the exercise by creating two completely different test > files, one for each driver, and got the same results. > > The FUSE driver has never under any circumstances been as slow, though > when I feed a lot of activity into it at once, it tends to get stuck on > something and hang for a while, so it's not a solution for me unfortunately. > > >> >> > Ubuntu 16.04.2, 4.4.0-72-generic, ceph-fuse 10.2.6-1xenial, >> ceph-fs-common >> > 10.2.6-0ubuntu0.16.04.1 (I also tried the 16.04.2 one, same issue). >> >> I don't know of any issues in the older kernel that you're running, >> but you should be aware that 4.4 is over a year old and as far as I >> know there is no backporting of cephfs stuff to the Ubuntu kernel, so >> you're not getting the latest fixes. >> > > That could be related, but Ubuntu 16.04 is going to be around for a long > time, so this is probably something that needs to get addressed (unless I'm > literally the only person on the planet experiencing this bug, as it seems > to be right now). I don't know how to get on a newer kernel than that > (without potentially wrecking the distro). > > I was actually about to try 14.04 to see if it does the same thing. If it > works I'll post an update. > > -Kyle > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS kernel driver is 10-15x slower than FUSE driver
On Sun, Apr 9, 2017 at 9:31 AM, John Spray wrote: > On Sun, Apr 9, 2017 at 12:48 AM, Kyle Drake wrote: > > Pretty much says it all. 1GB test file copy to local: > > > > $ time cp /mnt/ceph-kernel-driver-test/test.img . > > > > real 2m50.063s > > user 0m0.000s > > sys 0m9.000s > > > > $ time cp /mnt/ceph-fuse-test/test.img . > > > > real 0m3.648s > > user 0m0.000s > > sys 0m1.872s > > > > Yikes. The kernel driver averages ~5MB and the fuse driver averages > > ~150MBish? Something crazy is happening here. It's not caching, I ran > both > > tests fresh. > > What does "fresh" mean in this context? i.e. what did you do in > between runs to reset it? Have you tried running your procedure in > the reverse order (i.e. is the kernel client still slow when you're > running it after the fuse client)? > I rebooted the machine and ran the same test. I just repeated the exercise by creating two completely different test files, one for each driver, and got the same results. The FUSE driver has never under any circumstances been as slow, though when I feed a lot of activity into it at once, it tends to get stuck on something and hang for a while, so it's not a solution for me unfortunately. > > > Ubuntu 16.04.2, 4.4.0-72-generic, ceph-fuse 10.2.6-1xenial, > ceph-fs-common > > 10.2.6-0ubuntu0.16.04.1 (I also tried the 16.04.2 one, same issue). > > I don't know of any issues in the older kernel that you're running, > but you should be aware that 4.4 is over a year old and as far as I > know there is no backporting of cephfs stuff to the Ubuntu kernel, so > you're not getting the latest fixes. > That could be related, but Ubuntu 16.04 is going to be around for a long time, so this is probably something that needs to get addressed (unless I'm literally the only person on the planet experiencing this bug, as it seems to be right now). I don't know how to get on a newer kernel than that (without potentially wrecking the distro). I was actually about to try 14.04 to see if it does the same thing. If it works I'll post an update. -Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS kernel driver is 10-15x slower than FUSE driver
Pretty much says it all. 1GB test file copy to local: $ time cp /mnt/ceph-kernel-driver-test/test.img . real 2m50.063s user 0m0.000s sys 0m9.000s $ time cp /mnt/ceph-fuse-test/test.img . real 0m3.648s user 0m0.000s sys 0m1.872s Yikes. The kernel driver averages ~5MB and the fuse driver averages ~150MBish? Something crazy is happening here. It's not caching, I ran both tests fresh. Ubuntu 16.04.2, 4.4.0-72-generic, ceph-fuse 10.2.6-1xenial, ceph-fs-common 10.2.6-0ubuntu0.16.04.1 (I also tried the 16.04.2 one, same issue). Anyone run into this? Did a lot of digging on the ML and didn't see anything. I'm was going to use FUSE for production, but it tends to lag more on a lot of small requests so I had to fall back the kernel driver. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Default CRUSH Weight Set To 0 ?
Burkhard Linke writes: > > The default weight is the size of the OSD in tera bytes. Did you use > a very small OSD partition for test purposes, e.g. 20 GB? In that > case the weight is rounded and results in an effective weight of > 0.0. As a result the OSD will not be used for data storage. > Regards, > Burkhard > > > ___ > ceph-users mailing list > ceph-users@... > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > Thanks Burkhard, Yes, I used only 5 GB drives for this test. So how is this calculated in the Infernalis release? I used the exact same setup and the CRUSH weight turned out to be 321? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Default CRUSH Weight Set To 0 ?
Hello, I have been working on a very basic cluster with 3 nodes and a single OSD per node. I am using Hammer installed on CentOS 7 (ceph-0.94.5-0.el7.x86_64) since it is the LTS version. I kept running into an issue of not getting past the status of undersized+degraded+peered. I finally discovered the problem was that in the default CRUSH map, the weight assigned is 0. I changed the weight and everything came up as it should. I did the same test using the Infernalis release and everything worked as expected as the weight has been changed to a default of 321. - Is this a bug or by design and if the latter, why? Perhaps I'm missing something? - Has anyone else ran into this? - Am I correct in assuming a weight of 0 won't allow the OSDs to be used or is there some other purpose for this? Hopefully this will help others that may run into this same situation. Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v9.1.0 Infernalis release candidate released
Nice! Thanks! On Wed, Oct 14, 2015 at 1:23 PM, Sage Weil wrote: > On Wed, 14 Oct 2015, Kyle Hutson wrote: > > > Which bug? We want to fix hammer, too! > > > > This > > one: > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg23915.html > > > > (Adam sits about 5' from me.) > > Oh... that fix is already in the hammer branch and will be in 0.94.4. > Since you have to go to that anyway before infernalis you may as well stop > there (unless there is something else you want from internalis!). > > sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v9.1.0 Infernalis release candidate released
> Which bug? We want to fix hammer, too! This one: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg23915.html (Adam sits about 5' from me.) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v9.1.0 Infernalis release candidate released
A couple of questions related to this, especially since we have a hammer bug that's biting us so we're anxious to upgrade to Infernalis. 1) RE: ibrbd and librados ABI compatibility is broken. Be careful installing this RC on client machines (e.g., those running qemu). It will be fixed in the final v9.2.0 release. We have several qemu clients. If we upgrade the ceph servers (and not the qemu clients), will this affect us? 2) RE: Upgrading directly from Firefly v0.80.z is not possible. All clusters must first upgrade to Hammer v0.94.4 or a later v0.94.z release; only then is it possible to upgrade to Infernalis 9.2.z. I think I understand this, but want to verify. We're on 0.94.3. Can we upgrade to the RC 9.1.0 and then safely upgrade to 9.2.z when it is finalized? Any foreseen issues with this upgrade path? On Wed, Oct 14, 2015 at 7:30 AM, Sage Weil wrote: > On Wed, 14 Oct 2015, Dan van der Ster wrote: > > Hi Goncalo, > > > > On Wed, Oct 14, 2015 at 6:51 AM, Goncalo Borges > > wrote: > > > Hi Sage... > > > > > > I've seen that the rh6 derivatives have been ruled out. > > > > > > This is a problem in our case since the OS choice in our systems is, > > > somehow, imposed by CERN. The experiments software is certified for > SL6 and > > > the transition to SL7 will take some time. > > > > Are you accessing Ceph directly from "physics" machines? Here at CERN > > we run CentOS 7 on the native clients (e.g. qemu-kvm hosts) and by the > > time we upgrade to Infernalis the servers will all be CentOS 7 as > > well. Batch nodes running SL6 don't (currently) talk to Ceph directly > > (in the future they might talk to Ceph-based storage via an xroot > > gateway). But if there are use-cases then perhaps we could find a > > place to build and distributing the newer ceph clients. > > > > There's a ML ceph-t...@cern.ch where we could take this discussion. > > Mail me if have trouble joining that e-Group. > > Also note that it *is* possible to build infernalis on el6, but it > requires a lot more effort... enough that we would rather spend our time > elsewhere (at least as far as ceph.com packages go). If someone else > wants to do that work we'd be happy to take patches to update the and/or > release process. > > IIRC the thing that eventually made me stop going down this patch was the > fact that the newer gcc had a runtime dependency on the newer libstdc++, > which wasn't part of the base distro... which means we'd need also to > publish those packages in the ceph.com repos, or users would have to > add some backport repo or ppa or whatever to get things running. Bleh. > > sage > > > > > > Cheers, Dan > > CERN IT-DSS > > > > > This is kind of a showstopper specially if we can't deploy clients in > SL6 / > > > Centos6. > > > > > > Is there any alternative? > > > > > > TIA > > > Goncalo > > > > > > > > > > > > On 10/14/2015 08:01 AM, Sage Weil wrote: > > >> > > >> This is the first Infernalis release candidate. There have been some > > >> major changes since hammer, and the upgrade process is non-trivial. > > >> Please read carefully. > > >> > > >> Getting the release candidate > > >> - > > >> > > >> The v9.1.0 packages are pushed to the development release > repositories:: > > >> > > >>http://download.ceph.com/rpm-testing > > >>http://download.ceph.com/debian-testing > > >> > > >> For for info, see:: > > >> > > >>http://docs.ceph.com/docs/master/install/get-packages/ > > >> > > >> Or install with ceph-deploy via:: > > >> > > >>ceph-deploy install --testing HOST > > >> > > >> Known issues > > >> > > >> > > >> * librbd and librados ABI compatibility is broken. Be careful > > >>installing this RC on client machines (e.g., those running qemu). > > >>It will be fixed in the final v9.2.0 release. > > >> > > >> Major Changes from Hammer > > >> - > > >> > > >> * *General*: > > >>* Ceph daemons are now managed via systemd (with the exception of > > >> Ubuntu Trusty, which still uses upstart). > > >>* Ceph daemons run as 'ceph' user instead root. > > >>* On Red Hat distros, there is also an SELinux policy. > > >> * *RADOS*: > > >>* The RADOS cache tier can now proxy write operations to the base > > >> tier, allowing writes to be handled without forcing migration of > > >> an object into the cache. > > >>* The SHEC erasure coding support is no longer flagged as > > >> experimental. SHEC trades some additional storage space for > faster > > >> repair. > > >>* There is now a unified queue (and thus prioritization) of client > > >> IO, recovery, scrubbing, and snapshot trimming. > > >>* There have been many improvements to low-level repair tooling > > >> (ceph-objectstore-tool). > > >>* The internal ObjectStore API has been significantly cleaned up in > > >> order > > >> to faciliate new storage backends like NewStore. > > >> * *RGW*: > > >>* The Swift API now
Re: [ceph-users] CephFS and caching
A 'rados -p cachepool ls' takes about 3 hours - not exactly useful. I'm intrigued that you say a single read may not promote it into the cache. My understanding is that if you have an EC-backed pool the clients can't talk to them directly, which means they would necessarily be promoted to the cache pool so the client could read it. Is my understanding wrong? I'm also wondering if it's possible to use RAM as a read-cache layer. Obviously, we don't want this for write-cache because of power outages, motherboard failures, etc., but it seems to make sense for a read-cache. Is that something that's being done, can be done, is going to be done, or has even been considered? On Wed, Sep 9, 2015 at 10:33 AM, Gregory Farnum wrote: > On Wed, Sep 9, 2015 at 4:26 PM, Kyle Hutson wrote: > > > > > > On Wed, Sep 9, 2015 at 9:34 AM, Gregory Farnum > wrote: > >> > >> On Wed, Sep 9, 2015 at 3:27 PM, Kyle Hutson wrote: > >> > We are using Hammer - latest released version. How do I check if it's > >> > getting promoted into the cache? > >> > >> Umm...that's a good question. You can run rados ls on the cache pool, > >> but that's not exactly scalable; you can turn up logging and dig into > >> them to see if redirects are happening, or watch the OSD operations > >> happening via the admin socket. But I don't know if there's a good > >> interface for users to just query the cache state of a single object. > >> :/ > > > > > > even using 'rados ls', I (naturally) get cephfs object names - is there a > > way to see a filename -> objectname conversion ... or objectname -> > filename > > ? > > The object name is .. So you can > look at the file inode and then see which of its objects are actually > in the pool. > -Greg > > > > >> > >> > We're using the latest ceph kernel client. Where do I poke at > readahead > >> > settings there? > >> > >> Just the standard kernel readahead settings; I'm not actually familiar > >> with how to configure those but I don't believe Ceph's are in any way > >> special. What do you mean by "latest ceph kernel client"; are you > >> running one of the developer testing kernels or something? > > > > > > No, just what comes with the latest stock kernel. Sorry for any > confusion. > > > >> > >> I think > >> Ilya might have mentioned some issues with readahead being > >> artificially blocked, but that might have only been with RBD. > >> > >> Oh, are the files you're using sparse? There was a bug with sparse > >> files not filling in pages that just got patched yesterday or > >> something. > > > > > > No, these are not sparse files. Just really big. > > > >> > >> > > >> > On Tue, Sep 8, 2015 at 8:29 AM, Gregory Farnum > >> > wrote: > >> >> > >> >> On Thu, Sep 3, 2015 at 11:58 PM, Kyle Hutson > >> >> wrote: > >> >> > I was wondering if anybody could give me some insight as to how > >> >> > CephFS > >> >> > does > >> >> > its caching - read-caching in particular. > >> >> > > >> >> > We are using CephFS with an EC pool on the backend with a > replicated > >> >> > cache > >> >> > pool in front of it. We're seeing some very slow read times. Trying > >> >> > to > >> >> > compute an md5sum on a 15GB file twice in a row (so it should be in > >> >> > cache) > >> >> > takes the time from 23 minutes down to 17 minutes, but this is > over a > >> >> > 10Gbps > >> >> > network and with a crap-ton of OSDs (over 300), so I would expect > it > >> >> > to > >> >> > be > >> >> > down in the 2-3 minute range. > >> >> > >> >> A single sequential read won't necessarily promote an object into the > >> >> cache pool (although if you're using Hammer I think it will), so you > >> >> want to check if it's actually getting promoted into the cache before > >> >> assuming that's happened. > >> >> > >> >> > > >> >> > I'm just trying to figure out what we can do to increase the > >> >> > performance. I > >> >> > have over 300 TB of live data that I have to be careful with, > though, > >> >> > so > >> >> > I > >> >> > have to have some level of caution. > >> >> > > >> >> > Is there some other caching we can do (client-side or server-side) > >> >> > that > >> >> > might give us a decent performance boost? > >> >> > >> >> Which client are you using for this testing? Have you looked at the > >> >> readahead settings? That's usually the big one; if you're only asking > >> >> for 4KB at once then stuff is going to be slow no matter what (a > >> >> single IO takes at minimum about 2 milliseconds right now, although > >> >> the RADOS team is working to improve that). > >> >> -Greg > >> >> > >> >> > > >> >> > ___ > >> >> > ceph-users mailing list > >> >> > ceph-users@lists.ceph.com > >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> > > >> > > >> > > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS and caching
On Wed, Sep 9, 2015 at 9:34 AM, Gregory Farnum wrote: > On Wed, Sep 9, 2015 at 3:27 PM, Kyle Hutson wrote: > > We are using Hammer - latest released version. How do I check if it's > > getting promoted into the cache? > > Umm...that's a good question. You can run rados ls on the cache pool, > but that's not exactly scalable; you can turn up logging and dig into > them to see if redirects are happening, or watch the OSD operations > happening via the admin socket. But I don't know if there's a good > interface for users to just query the cache state of a single object. > :/ > even using 'rados ls', I (naturally) get cephfs object names - is there a way to see a filename -> objectname conversion ... or objectname -> filename ? > > We're using the latest ceph kernel client. Where do I poke at readahead > > settings there? > > Just the standard kernel readahead settings; I'm not actually familiar > with how to configure those but I don't believe Ceph's are in any way > special. What do you mean by "latest ceph kernel client"; are you > running one of the developer testing kernels or something? No, just what comes with the latest stock kernel. Sorry for any confusion. > I think > Ilya might have mentioned some issues with readahead being > artificially blocked, but that might have only been with RBD. > > Oh, are the files you're using sparse? There was a bug with sparse > files not filling in pages that just got patched yesterday or > something. > No, these are not sparse files. Just really big. > > > > On Tue, Sep 8, 2015 at 8:29 AM, Gregory Farnum > wrote: > >> > >> On Thu, Sep 3, 2015 at 11:58 PM, Kyle Hutson > wrote: > >> > I was wondering if anybody could give me some insight as to how CephFS > >> > does > >> > its caching - read-caching in particular. > >> > > >> > We are using CephFS with an EC pool on the backend with a replicated > >> > cache > >> > pool in front of it. We're seeing some very slow read times. Trying to > >> > compute an md5sum on a 15GB file twice in a row (so it should be in > >> > cache) > >> > takes the time from 23 minutes down to 17 minutes, but this is over a > >> > 10Gbps > >> > network and with a crap-ton of OSDs (over 300), so I would expect it > to > >> > be > >> > down in the 2-3 minute range. > >> > >> A single sequential read won't necessarily promote an object into the > >> cache pool (although if you're using Hammer I think it will), so you > >> want to check if it's actually getting promoted into the cache before > >> assuming that's happened. > >> > >> > > >> > I'm just trying to figure out what we can do to increase the > >> > performance. I > >> > have over 300 TB of live data that I have to be careful with, though, > so > >> > I > >> > have to have some level of caution. > >> > > >> > Is there some other caching we can do (client-side or server-side) > that > >> > might give us a decent performance boost? > >> > >> Which client are you using for this testing? Have you looked at the > >> readahead settings? That's usually the big one; if you're only asking > >> for 4KB at once then stuff is going to be slow no matter what (a > >> single IO takes at minimum about 2 milliseconds right now, although > >> the RADOS team is working to improve that). > >> -Greg > >> > >> > > >> > ___ > >> > ceph-users mailing list > >> > ceph-users@lists.ceph.com > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS and caching
We are using Hammer - latest released version. How do I check if it's getting promoted into the cache? We're using the latest ceph kernel client. Where do I poke at readahead settings there? On Tue, Sep 8, 2015 at 8:29 AM, Gregory Farnum wrote: > On Thu, Sep 3, 2015 at 11:58 PM, Kyle Hutson wrote: > > I was wondering if anybody could give me some insight as to how CephFS > does > > its caching - read-caching in particular. > > > > We are using CephFS with an EC pool on the backend with a replicated > cache > > pool in front of it. We're seeing some very slow read times. Trying to > > compute an md5sum on a 15GB file twice in a row (so it should be in > cache) > > takes the time from 23 minutes down to 17 minutes, but this is over a > 10Gbps > > network and with a crap-ton of OSDs (over 300), so I would expect it to > be > > down in the 2-3 minute range. > > A single sequential read won't necessarily promote an object into the > cache pool (although if you're using Hammer I think it will), so you > want to check if it's actually getting promoted into the cache before > assuming that's happened. > > > > > I'm just trying to figure out what we can do to increase the > performance. I > > have over 300 TB of live data that I have to be careful with, though, so > I > > have to have some level of caution. > > > > Is there some other caching we can do (client-side or server-side) that > > might give us a decent performance boost? > > Which client are you using for this testing? Have you looked at the > readahead settings? That's usually the big one; if you're only asking > for 4KB at once then stuff is going to be slow no matter what (a > single IO takes at minimum about 2 milliseconds right now, although > the RADOS team is working to improve that). > -Greg > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS and caching
I was wondering if anybody could give me some insight as to how CephFS does its caching - read-caching in particular. We are using CephFS with an EC pool on the backend with a replicated cache pool in front of it. We're seeing some very slow read times. Trying to compute an md5sum on a 15GB file twice in a row (so it should be in cache) takes the time from 23 minutes down to 17 minutes, but this is over a 10Gbps network and with a crap-ton of OSDs (over 300), so I would expect it to be down in the 2-3 minute range. I'm just trying to figure out what we can do to increase the performance. I have over 300 TB of live data that I have to be careful with, though, so I have to have some level of caution. Is there some other caching we can do (client-side or server-side) that might give us a decent performance boost? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph migration to AWS
> To those interested in a tricky problem, > > We have a Ceph cluster running at one of our data centers. One of our > client's requirements is to have them hosted at AWS. My question is: How do > we effectively migrate our data on our internal Ceph cluster to an AWS Ceph > cluster? > > Ideas currently on the table: > > 1. Build OSDs at AWS and add them to our current Ceph cluster. Build quorum > at AWS then sever the connection between AWS and our data center. I would highly discourage this. > 2. Build a Ceph cluster at AWS and send snapshots from our data center to > our AWS cluster allowing us to "migrate" to AWS. This sounds far more sensible. I'd look at the I2 (iops) or D2 (density) class instances, depending on use case. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds crashing
Thank you, John! That was exactly the bug we were hitting. My Google-fu didn't lead me to this one. On Wed, Apr 15, 2015 at 4:16 PM, John Spray wrote: > On 15/04/2015 20:02, Kyle Hutson wrote: > >> I upgraded to 0.94.1 from 0.94 on Monday, and everything had been going >> pretty well. >> >> Then, about noon today, we had an mds crash. And then the failover mds >> crashed. And this cascaded through all 4 mds servers we have. >> >> If I try to start it ('service ceph start mds' on CentOS 7.1), it appears >> to be OK for a little while. ceph -w goes through 'replay' 'reconnect' >> 'rejoin' 'clientreplay' and 'active' but nearly immediately after getting >> to 'active', it crashes again. >> >> I have the mds log at http://people.beocat.cis.ksu. >> edu/~kylehutson/ceph-mds.hobbit01.log <http://people.beocat.cis.ksu. >> edu/%7Ekylehutson/ceph-mds.hobbit01.log> >> >> For the possibly, but not necessarily, useful background info. >> - Yesterday we took our erasure coded pool and increased both pg_num and >> pgp_num from 2048 to 4096. We still have several objects misplaced (~17%), >> but those seem to be continuing to clean themselves up. >> - We are in the midst of a large (300+ TB) rsync from our old (non-ceph) >> filesystem to this filesystem. >> - Before we realized the mds crashes, we had just changed the size of our >> metadata pool from 2 to 4. >> > > It looks like you're seeing http://tracker.ceph.com/issues/10449, which > is a situation where the SessionMap object becomes too big for the MDS to > save.The cause of it in that case was stuck requests from a misbehaving > client running a slightly older kernel. > > Assuming you're using the kernel client and having a similar problem, you > could try to work around this situation by forcibly unmounting the clients > while the MDS is offline, such that during clientreplay the MDS will remove > them from the SessionMap after timing out, and then next time it tries to > save the map it won't be oversized. If that works, you could then look > into getting newer kernels on the clients to avoid hitting the issue again > -- the #10449 ticket has some pointers about which kernel changes were > relevant. > > Cheers, > John > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] mds crashing
I upgraded to 0.94.1 from 0.94 on Monday, and everything had been going pretty well. Then, about noon today, we had an mds crash. And then the failover mds crashed. And this cascaded through all 4 mds servers we have. If I try to start it ('service ceph start mds' on CentOS 7.1), it appears to be OK for a little while. ceph -w goes through 'replay' 'reconnect' 'rejoin' 'clientreplay' and 'active' but nearly immediately after getting to 'active', it crashes again. I have the mds log at http://people.beocat.cis.ksu.edu/~kylehutson/ceph-mds.hobbit01.log For the possibly, but not necessarily, useful background info. - Yesterday we took our erasure coded pool and increased both pg_num and pgp_num from 2048 to 4096. We still have several objects misplaced (~17%), but those seem to be continuing to clean themselves up. - We are in the midst of a large (300+ TB) rsync from our old (non-ceph) filesystem to this filesystem. - Before we realized the mds crashes, we had just changed the size of our metadata pool from 2 to 4. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] "protocol feature mismatch" after upgrading to Hammer
http://people.beocat.cis.ksu.edu/~kylehutson/crushmap On Thu, Apr 9, 2015 at 11:25 AM, Gregory Farnum wrote: > Hmmm. That does look right and neither I nor Sage can come up with > anything via code inspection. Can you post the actual binary crush map > somewhere for download so that we can inspect it with our tools? > -Greg > > On Thu, Apr 9, 2015 at 7:57 AM, Kyle Hutson wrote: > > Here 'tis: > > https://dpaste.de/POr1 > > > > > > On Thu, Apr 9, 2015 at 9:49 AM, Gregory Farnum wrote: > >> > >> Can you dump your crush map and post it on pastebin or something? > >> > >> On Thu, Apr 9, 2015 at 7:26 AM, Kyle Hutson wrote: > >> > Nope - it's 64-bit. > >> > > >> > (Sorry, I missed the reply-all last time.) > >> > > >> > On Thu, Apr 9, 2015 at 9:24 AM, Gregory Farnum > wrote: > >> >> > >> >> [Re-added the list] > >> >> > >> >> Hmm, I'm checking the code and that shouldn't be possible. What's > your > >> >> ciient? (In particular, is it 32-bit? That's the only thing i can > >> >> think of that might have slipped through our QA.) > >> >> > >> >> On Thu, Apr 9, 2015 at 7:17 AM, Kyle Hutson > wrote: > >> >> > I did nothing to enable anything else. Just changed my ceph repo > from > >> >> > 'giant' to 'hammer', then did 'yum update' and restarted services. > >> >> > > >> >> > On Thu, Apr 9, 2015 at 9:15 AM, Gregory Farnum > >> >> > wrote: > >> >> >> > >> >> >> Did you enable the straw2 stuff? CRUSHV4 shouldn't be required by > >> >> >> the > >> >> >> cluster unless you made changes to the layout requiring it. > >> >> >> > >> >> >> If you did, the clients have to be upgraded to understand it. You > >> >> >> could disable all the v4 features; that should let them connect > >> >> >> again. > >> >> >> -Greg > >> >> >> > >> >> >> On Thu, Apr 9, 2015 at 7:07 AM, Kyle Hutson > >> >> >> wrote: > >> >> >> > This particular problem I just figured out myself ('ceph -w' was > >> >> >> > still > >> >> >> > running from before the upgrade, and ctrl-c and restarting > solved > >> >> >> > that > >> >> >> > issue), but I'm still having a similar problem on the ceph > client: > >> >> >> > > >> >> >> > libceph: mon19 10.5.38.20:6789 feature set mismatch, my > >> >> >> > 2b84a042aca < > >> >> >> > server's 102b84a042aca, missing 1 > >> >> >> > > >> >> >> > It appears that even the latest kernel doesn't have support for > >> >> >> > CEPH_FEATURE_CRUSH_V4 > >> >> >> > > >> >> >> > How do I make my ceph cluster backward-compatible with the old > >> >> >> > cephfs > >> >> >> > client? > >> >> >> > > >> >> >> > On Thu, Apr 9, 2015 at 8:58 AM, Kyle Hutson > > >> >> >> > wrote: > >> >> >> >> > >> >> >> >> I upgraded from giant to hammer yesterday and now 'ceph -w' is > >> >> >> >> constantly > >> >> >> >> repeating this message: > >> >> >> >> > >> >> >> >> 2015-04-09 08:50:26.318042 7f95dbf86700 0 -- > 10.5.38.1:0/2037478 > >> >> >> >> >> > >> >> >> >> 10.5.38.1:6789/0 pipe(0x7f95e00256e0 sd=3 :39489 s=1 pgs=0 > cs=0 > >> >> >> >> l=1 > >> >> >> >> c=0x7f95e0023670).connect protocol feature mismatch, my > >> >> >> >> 3fff > >> >> >> >> < > >> >> >> >> peer > >> >> >> >> 13fff missing 1 > >> >> >> >> > >> >> >> >> It isn't always the same IP for the destination - here's > another: > >> >> >> >> 2015-04-09 08:50:20.322059 7f95dc087700 0 -- > 10.5.38.1:0/2037478 > >> >> >> >> >> > >> >> >> >> 10.5.38.8:6789/0 pipe(0x7f95e00262f0 sd=3 :54047 s=1 pgs=0 > cs=0 > >> >> >> >> l=1 > >> >> >> >> c=0x7f95e002b480).connect protocol feature mismatch, my > >> >> >> >> 3fff > >> >> >> >> < > >> >> >> >> peer > >> >> >> >> 13fff missing 1 > >> >> >> >> > >> >> >> >> Some details about our install: > >> >> >> >> We have 24 hosts with 18 OSDs each. 16 per host are spinning > >> >> >> >> disks > >> >> >> >> in > >> >> >> >> an > >> >> >> >> erasure coded pool (k=8 m=4). 2 OSDs per host are SSD > partitions > >> >> >> >> used > >> >> >> >> for a > >> >> >> >> caching tier in front of the EC pool. All 24 hosts are > monitors. > >> >> >> >> 4 > >> >> >> >> hosts are > >> >> >> >> mds. We are running cephfs with a client trying to write data > >> >> >> >> over > >> >> >> >> cephfs > >> >> >> >> when we're seeing these messages. > >> >> >> >> > >> >> >> >> Any ideas? > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > ___ > >> >> >> > ceph-users mailing list > >> >> >> > ceph-users@lists.ceph.com > >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >> > > >> >> > > >> >> > > >> > > >> > > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] "protocol feature mismatch" after upgrading to Hammer
Here 'tis: https://dpaste.de/POr1 On Thu, Apr 9, 2015 at 9:49 AM, Gregory Farnum wrote: > Can you dump your crush map and post it on pastebin or something? > > On Thu, Apr 9, 2015 at 7:26 AM, Kyle Hutson wrote: > > Nope - it's 64-bit. > > > > (Sorry, I missed the reply-all last time.) > > > > On Thu, Apr 9, 2015 at 9:24 AM, Gregory Farnum wrote: > >> > >> [Re-added the list] > >> > >> Hmm, I'm checking the code and that shouldn't be possible. What's your > >> ciient? (In particular, is it 32-bit? That's the only thing i can > >> think of that might have slipped through our QA.) > >> > >> On Thu, Apr 9, 2015 at 7:17 AM, Kyle Hutson wrote: > >> > I did nothing to enable anything else. Just changed my ceph repo from > >> > 'giant' to 'hammer', then did 'yum update' and restarted services. > >> > > >> > On Thu, Apr 9, 2015 at 9:15 AM, Gregory Farnum > wrote: > >> >> > >> >> Did you enable the straw2 stuff? CRUSHV4 shouldn't be required by the > >> >> cluster unless you made changes to the layout requiring it. > >> >> > >> >> If you did, the clients have to be upgraded to understand it. You > >> >> could disable all the v4 features; that should let them connect > again. > >> >> -Greg > >> >> > >> >> On Thu, Apr 9, 2015 at 7:07 AM, Kyle Hutson > wrote: > >> >> > This particular problem I just figured out myself ('ceph -w' was > >> >> > still > >> >> > running from before the upgrade, and ctrl-c and restarting solved > >> >> > that > >> >> > issue), but I'm still having a similar problem on the ceph client: > >> >> > > >> >> > libceph: mon19 10.5.38.20:6789 feature set mismatch, my > 2b84a042aca < > >> >> > server's 102b84a042aca, missing 1 > >> >> > > >> >> > It appears that even the latest kernel doesn't have support for > >> >> > CEPH_FEATURE_CRUSH_V4 > >> >> > > >> >> > How do I make my ceph cluster backward-compatible with the old > cephfs > >> >> > client? > >> >> > > >> >> > On Thu, Apr 9, 2015 at 8:58 AM, Kyle Hutson > >> >> > wrote: > >> >> >> > >> >> >> I upgraded from giant to hammer yesterday and now 'ceph -w' is > >> >> >> constantly > >> >> >> repeating this message: > >> >> >> > >> >> >> 2015-04-09 08:50:26.318042 7f95dbf86700 0 -- 10.5.38.1:0/2037478 > >> > >> >> >> 10.5.38.1:6789/0 pipe(0x7f95e00256e0 sd=3 :39489 s=1 pgs=0 cs=0 > l=1 > >> >> >> c=0x7f95e0023670).connect protocol feature mismatch, my > 3fff > >> >> >> < > >> >> >> peer > >> >> >> 13fff missing 1 > >> >> >> > >> >> >> It isn't always the same IP for the destination - here's another: > >> >> >> 2015-04-09 08:50:20.322059 7f95dc087700 0 -- 10.5.38.1:0/2037478 > >> > >> >> >> 10.5.38.8:6789/0 pipe(0x7f95e00262f0 sd=3 :54047 s=1 pgs=0 cs=0 > l=1 > >> >> >> c=0x7f95e002b480).connect protocol feature mismatch, my > 3fff > >> >> >> < > >> >> >> peer > >> >> >> 13fff missing 1 > >> >> >> > >> >> >> Some details about our install: > >> >> >> We have 24 hosts with 18 OSDs each. 16 per host are spinning disks > >> >> >> in > >> >> >> an > >> >> >> erasure coded pool (k=8 m=4). 2 OSDs per host are SSD partitions > >> >> >> used > >> >> >> for a > >> >> >> caching tier in front of the EC pool. All 24 hosts are monitors. 4 > >> >> >> hosts are > >> >> >> mds. We are running cephfs with a client trying to write data over > >> >> >> cephfs > >> >> >> when we're seeing these messages. > >> >> >> > >> >> >> Any ideas? > >> >> > > >> >> > > >> >> > > >> >> > ___ > >> >> > ceph-users mailing list > >> >> > ceph-users@lists.ceph.com > >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> > > >> > > >> > > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] "protocol feature mismatch" after upgrading to Hammer
Nope - it's 64-bit. (Sorry, I missed the reply-all last time.) On Thu, Apr 9, 2015 at 9:24 AM, Gregory Farnum wrote: > [Re-added the list] > > Hmm, I'm checking the code and that shouldn't be possible. What's your > ciient? (In particular, is it 32-bit? That's the only thing i can > think of that might have slipped through our QA.) > > On Thu, Apr 9, 2015 at 7:17 AM, Kyle Hutson wrote: > > I did nothing to enable anything else. Just changed my ceph repo from > > 'giant' to 'hammer', then did 'yum update' and restarted services. > > > > On Thu, Apr 9, 2015 at 9:15 AM, Gregory Farnum wrote: > >> > >> Did you enable the straw2 stuff? CRUSHV4 shouldn't be required by the > >> cluster unless you made changes to the layout requiring it. > >> > >> If you did, the clients have to be upgraded to understand it. You > >> could disable all the v4 features; that should let them connect again. > >> -Greg > >> > >> On Thu, Apr 9, 2015 at 7:07 AM, Kyle Hutson wrote: > >> > This particular problem I just figured out myself ('ceph -w' was still > >> > running from before the upgrade, and ctrl-c and restarting solved that > >> > issue), but I'm still having a similar problem on the ceph client: > >> > > >> > libceph: mon19 10.5.38.20:6789 feature set mismatch, my 2b84a042aca < > >> > server's 102b84a042aca, missing 1 > >> > > >> > It appears that even the latest kernel doesn't have support for > >> > CEPH_FEATURE_CRUSH_V4 > >> > > >> > How do I make my ceph cluster backward-compatible with the old cephfs > >> > client? > >> > > >> > On Thu, Apr 9, 2015 at 8:58 AM, Kyle Hutson > wrote: > >> >> > >> >> I upgraded from giant to hammer yesterday and now 'ceph -w' is > >> >> constantly > >> >> repeating this message: > >> >> > >> >> 2015-04-09 08:50:26.318042 7f95dbf86700 0 -- 10.5.38.1:0/2037478 >> > >> >> 10.5.38.1:6789/0 pipe(0x7f95e00256e0 sd=3 :39489 s=1 pgs=0 cs=0 l=1 > >> >> c=0x7f95e0023670).connect protocol feature mismatch, my 3fff > < > >> >> peer > >> >> 13fff missing 1 > >> >> > >> >> It isn't always the same IP for the destination - here's another: > >> >> 2015-04-09 08:50:20.322059 7f95dc087700 0 -- 10.5.38.1:0/2037478 >> > >> >> 10.5.38.8:6789/0 pipe(0x7f95e00262f0 sd=3 :54047 s=1 pgs=0 cs=0 l=1 > >> >> c=0x7f95e002b480).connect protocol feature mismatch, my 3fff > < > >> >> peer > >> >> 13fff missing 1 > >> >> > >> >> Some details about our install: > >> >> We have 24 hosts with 18 OSDs each. 16 per host are spinning disks in > >> >> an > >> >> erasure coded pool (k=8 m=4). 2 OSDs per host are SSD partitions used > >> >> for a > >> >> caching tier in front of the EC pool. All 24 hosts are monitors. 4 > >> >> hosts are > >> >> mds. We are running cephfs with a client trying to write data over > >> >> cephfs > >> >> when we're seeing these messages. > >> >> > >> >> Any ideas? > >> > > >> > > >> > > >> > ___ > >> > ceph-users mailing list > >> > ceph-users@lists.ceph.com > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] "protocol feature mismatch" after upgrading to Hammer
This particular problem I just figured out myself ('ceph -w' was still running from before the upgrade, and ctrl-c and restarting solved that issue), but I'm still having a similar problem on the ceph client: libceph: mon19 10.5.38.20:6789 feature set mismatch, my 2b84a042aca < server's 102b84a042aca, missing 1 It appears that even the latest kernel doesn't have support for CEPH_FEATURE_CRUSH_V4 How do I make my ceph cluster backward-compatible with the old cephfs client? On Thu, Apr 9, 2015 at 8:58 AM, Kyle Hutson wrote: > I upgraded from giant to hammer yesterday and now 'ceph -w' is constantly > repeating this message: > > 2015-04-09 08:50:26.318042 7f95dbf86700 0 -- 10.5.38.1:0/2037478 >> > 10.5.38.1:6789/0 pipe(0x7f95e00256e0 sd=3 :39489 s=1 pgs=0 cs=0 l=1 > c=0x7f95e0023670).connect protocol feature mismatch, my 3fff < peer > 13fff missing 1 > > It isn't always the same IP for the destination - here's another: > 2015-04-09 08:50:20.322059 7f95dc087700 0 -- 10.5.38.1:0/2037478 >> > 10.5.38.8:6789/0 pipe(0x7f95e00262f0 sd=3 :54047 s=1 pgs=0 cs=0 l=1 > c=0x7f95e002b480).connect protocol feature mismatch, my 3fff < peer > 13fff missing 1 > > Some details about our install: > We have 24 hosts with 18 OSDs each. 16 per host are spinning disks in an > erasure coded pool (k=8 m=4). 2 OSDs per host are SSD partitions used for a > caching tier in front of the EC pool. All 24 hosts are monitors. 4 hosts > are mds. We are running cephfs with a client trying to write data over > cephfs when we're seeing these messages. > > Any ideas? > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] "protocol feature mismatch" after upgrading to Hammer
I upgraded from giant to hammer yesterday and now 'ceph -w' is constantly repeating this message: 2015-04-09 08:50:26.318042 7f95dbf86700 0 -- 10.5.38.1:0/2037478 >> 10.5.38.1:6789/0 pipe(0x7f95e00256e0 sd=3 :39489 s=1 pgs=0 cs=0 l=1 c=0x7f95e0023670).connect protocol feature mismatch, my 3fff < peer 13fff missing 1 It isn't always the same IP for the destination - here's another: 2015-04-09 08:50:20.322059 7f95dc087700 0 -- 10.5.38.1:0/2037478 >> 10.5.38.8:6789/0 pipe(0x7f95e00262f0 sd=3 :54047 s=1 pgs=0 cs=0 l=1 c=0x7f95e002b480).connect protocol feature mismatch, my 3fff < peer 13fff missing 1 Some details about our install: We have 24 hosts with 18 OSDs each. 16 per host are spinning disks in an erasure coded pool (k=8 m=4). 2 OSDs per host are SSD partitions used for a caching tier in front of the EC pool. All 24 hosts are monitors. 4 hosts are mds. We are running cephfs with a client trying to write data over cephfs when we're seeing these messages. Any ideas? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how do I destroy cephfs? (interested in cephfs + tiering + erasure coding)
For what it's worth, I don't think "being patient" was the answer. I was having the same problem a couple of weeks ago, and I waited from before 5pm one day until after 8am the next, and still got the same errors. I ended up adding a "new" cephfs pool with a newly-created small pool, but was never able to actually remove cephfs altogether. On Thu, Mar 26, 2015 at 12:45 PM, Jake Grimmett wrote: > On 03/25/2015 05:44 PM, Gregory Farnum wrote: > >> On Wed, Mar 25, 2015 at 10:36 AM, Jake Grimmett >> wrote: >> >>> Dear All, >>> >>> Please forgive this post if it's naive, I'm trying to familiarise myself >>> with cephfs! >>> >>> I'm using Scientific Linux 6.6. with Ceph 0.87.1 >>> >>> My first steps with cephfs using a replicated pool worked OK. >>> >>> Now trying now to test cephfs via a replicated caching tier on top of an >>> erasure pool. I've created an erasure pool, cannot put it under the >>> existing >>> replicated pool. >>> >>> My thoughts were to delete the existing cephfs, and start again, however >>> I >>> cannot delete the existing cephfs: >>> >>> errors are as follows: >>> >>> [root@ceph1 ~]# ceph fs rm cephfs2 >>> Error EINVAL: all MDS daemons must be inactive before removing filesystem >>> >>> I've tried killing the ceph-mds process, but this does not prevent the >>> above >>> error. >>> >>> I've also tried this, which also errors: >>> >>> [root@ceph1 ~]# ceph mds stop 0 >>> Error EBUSY: must decrease max_mds or else MDS will immediately >>> reactivate >>> >> >> Right, so did you run "ceph mds set_max_mds 0" and then repeating the >> stop command? :) >> >> >>> This also fail... >>> >>> [root@ceph1 ~]# ceph-deploy mds destroy >>> [ceph_deploy.conf][DEBUG ] found configuration file at: >>> /root/.cephdeploy.conf >>> [ceph_deploy.cli][INFO ] Invoked (1.5.21): /usr/bin/ceph-deploy mds >>> destroy >>> [ceph_deploy.mds][ERROR ] subcommand destroy not implemented >>> >>> Am I doing the right thing in trying to wipe the original cephfs config >>> before attempting to use an erasure cold tier? Or can I just redefine the >>> cephfs? >>> >> >> Yeah, unfortunately you need to recreate it if you want to try and use >> an EC pool with cache tiering, because CephFS knows what pools it >> expects data to belong to. Things are unlikely to behave correctly if >> you try and stick an EC pool under an existing one. :( >> >> Sounds like this is all just testing, which is good because the >> suitability of EC+cache is very dependent on how much hot data you >> have, etc...good luck! >> -Greg >> >> >>> many thanks, >>> >>> Jake Grimmett >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> > Thanks for your help - much appreciated. > > The "set_max_mds 0" command worked, but only after I rebooted the server, > and restarted ceph twice. Before this I still got an > "mds active" error, and so was unable to destroy the cephfs. > > Possibly I was being impatient, and needed to let mds go inactive? there > were ~1 million files on the system. > > [root@ceph1 ~]# ceph mds set_max_mds 0 > max_mds = 0 > > [root@ceph1 ~]# ceph mds stop 0 > telling mds.0 10.1.0.86:6811/3249 to deactivate > > [root@ceph1 ~]# ceph mds stop 0 > Error EEXIST: mds.0 not active (up:stopping) > > [root@ceph1 ~]# ceph fs rm cephfs2 > Error EINVAL: all MDS daemons must be inactive before removing filesystem > > There shouldn't be any other mds servers running.. > [root@ceph1 ~]# ceph mds stop 1 > Error EEXIST: mds.1 not active (down:dne) > > At this point I rebooted the server, did a "service ceph restart" twice. > Shutdown ceph, then restarted ceph before this command worked: > > [root@ceph1 ~]# ceph fs rm cephfs2 --yes-i-really-mean-it > > Anyhow, I've now been able to create an erasure coded pool, with a > replicated tier which cephfs is running on :) > > *Lots* of testing to go! > > Again, many thanks > > Jake > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New EC pool undersized
So it sounds like I should figure out at 'how many nodes' do I need to increase pg_num to 4096, and again for 8192, and increase those incrementally when as I add more hosts, correct? On Wed, Mar 4, 2015 at 3:04 PM, Don Doerner wrote: > Sorry, I missed your other questions, down at the bottom. See here > <http://ceph.com/docs/master/rados/operations/placement-groups/> (look > for “number of replicas for replicated pools or the K+M sum for erasure > coded pools”) for the formula; 38400/8 probably implies 8192. > > > > The thing is, you’ve got to think about how many ways you can form > combinations of 8 unique OSDs (with replacement) that match your failure > domain rules. If you’ve only got 8 hosts, and your failure domain is > hosts, it severely limits this number. And I have read that too many > isn’t good either – a serialization issue, I believe. > > > > -don- > > > > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf > Of *Don Doerner > *Sent:* 04 March, 2015 12:49 > *To:* Kyle Hutson > *Cc:* ceph-users@lists.ceph.com > > *Subject:* Re: [ceph-users] New EC pool undersized > > > > Hmmm, I just struggled through this myself. How many racks do you have? If > not more than 8, you might want to make your failure domain smaller? I.e., > maybe host? That, at least, would allow you to debug the situation… > > > > -don- > > > > *From:* Kyle Hutson [mailto:kylehut...@ksu.edu ] > *Sent:* 04 March, 2015 12:43 > *To:* Don Doerner > *Cc:* Ceph Users > *Subject:* Re: [ceph-users] New EC pool undersized > > > > It wouldn't let me simply change the pg_num, giving > > Error EEXIST: specified pg_num 2048 <= current 8192 > > > > But that's not a big deal, I just deleted the pool and recreated with > 'ceph osd pool create ec44pool 2048 2048 erasure ec44profile' > > ...and the result is quite similar: 'ceph status' is now > > ceph status > > cluster 196e5eb8-d6a7-4435-907e-ea028e946923 > > health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs > undersized > > monmap e1: 4 mons at {hobbit01= > 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0 > <https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0A&m=fHQcjtxx3uADdikQAQAh65Z0s%2FzNFIj544bRY5zThgI%3D%0A&s=01b7463be37041310163f5d75abc634fab3280633eaef2158ed6609c6f3978d8>}, > election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14 > > osdmap e412: 144 osds: 144 up, 144 in > > pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects > > 90590 MB used, 640 TB / 640 TB avail > >4 active+undersized+degraded > > 6140 active+clean > > > > 'ceph pg dump_stuck results' in > > ok > > pg_stat objects mip degr misp unf bytes log disklog state > state_stampvreported up up_primary actingacting_primary > last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp > > 2.296 00000000 > active+undersized+degraded2015-03-04 14:33:26.672224 0'0 412:9 > [5,55,91,2147483647,83,135,53,26] 5 [5,55,91,2147483647,83,135,53,26] > 50'0 2015-03-04 14:33:15.649911 0'0 2015-03-04 14:33:15.649911 > > 2.69c 00000000 > active+undersized+degraded2015-03-04 14:33:24.984802 0'0 412:9 > [93,134,1,74,112,28,2147483647,60] 93 [93,134,1,74,112,28,2147483647 > ,60] 93 0'0 2015-03-04 14:33:15.695747 0'0 2015-03-04 > 14:33:15.695747 > > 2.36d 00000000 > active+undersized+degraded2015-03-04 14:33:21.937620 0'0 412:9 > [12,108,136,104,52,18,63,2147483647]12 [12,108,136,104,52,18,63, > 2147483647]12 0'0 2015-03-04 14:33:15.6524800'0 2015-03-04 > 14:33:15.652480 > > 2.5f7 00000000 > active+undersized+degraded2015-03-04 14:33:26.169242 0'0 412:9 > [94,128,73,22,4,60,2147483647,113] 94 [94,128,73,22,4,60,2147483647 > ,113] 94 0'0 2015-03-04 14:33:15.687695 0'0 2015-03-04 > 14:33:15.687695 > > > > I do have questions for you, even at this point, though. > > 1) Where did you find the formula (14400/(k+m))? > > 2) I was really trying to size this for when it goes to production, at > which point it may have as many
Re: [ceph-users] New EC pool undersized
That did it. 'step set_choose_tries 200' fixed the problem right away. Thanks Yann! On Wed, Mar 4, 2015 at 2:59 PM, Yann Dupont wrote: > > Le 04/03/2015 21:48, Don Doerner a écrit : > > Hmmm, I just struggled through this myself. How many racks do you have? > If not more than 8, you might want to make your failure domain smaller? I.e., > maybe host? That, at least, would allow you to debug the situation… > > > > -don- > > > > > Hello, I think I already had this problem. > It's explained here > http://tracker.ceph.com/issues/10350 > > And solution is probably here : > http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ > > Section : "CRUSH gives up too soon" > > Cheers, > Yann > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New EC pool undersized
My lowest level (other than OSD) is 'disktype' (based on the crushmaps at http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ ) since I have SSDs and HDDs on the same host. I just made that change (deleted the pool, deleted the profile, deleted the crush ruleset), then re-created using ruleset-failure-domain=disktype. Very similar results. health HEALTH_WARN 3 pgs degraded; 3 pgs stuck unclean; 3 pgs undersized 'ceph pg dump stuck' looks very similar to the last one I posted. On Wed, Mar 4, 2015 at 2:48 PM, Don Doerner wrote: > Hmmm, I just struggled through this myself. How many racks do you have? > If not more than 8, you might want to make your failure domain smaller? I.e., > maybe host? That, at least, would allow you to debug the situation… > > > > -don- > > > > *From:* Kyle Hutson [mailto:kylehut...@ksu.edu] > *Sent:* 04 March, 2015 12:43 > *To:* Don Doerner > *Cc:* Ceph Users > > *Subject:* Re: [ceph-users] New EC pool undersized > > > > It wouldn't let me simply change the pg_num, giving > > Error EEXIST: specified pg_num 2048 <= current 8192 > > > > But that's not a big deal, I just deleted the pool and recreated with > 'ceph osd pool create ec44pool 2048 2048 erasure ec44profile' > > ...and the result is quite similar: 'ceph status' is now > > ceph status > > cluster 196e5eb8-d6a7-4435-907e-ea028e946923 > > health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs > undersized > > monmap e1: 4 mons at {hobbit01= > 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0 > <https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0A&m=fHQcjtxx3uADdikQAQAh65Z0s%2FzNFIj544bRY5zThgI%3D%0A&s=01b7463be37041310163f5d75abc634fab3280633eaef2158ed6609c6f3978d8>}, > election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14 > > osdmap e412: 144 osds: 144 up, 144 in > > pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects > > 90590 MB used, 640 TB / 640 TB avail > >4 active+undersized+degraded > > 6140 active+clean > > > > 'ceph pg dump_stuck results' in > > ok > > pg_stat objects mip degr misp unf bytes log disklog state > state_stampvreported up up_primary actingacting_primary > last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp > > 2.296 00000000 > active+undersized+degraded2015-03-04 14:33:26.672224 0'0 412:9 > [5,55,91,2147483647,83,135,53,26] 5 [5,55,91,2147483647,83,135,53,26] > 50'0 2015-03-04 14:33:15.649911 0'0 2015-03-04 14:33:15.649911 > > 2.69c 00000000 > active+undersized+degraded2015-03-04 14:33:24.984802 0'0 412:9 > [93,134,1,74,112,28,2147483647,60] 93 [93,134,1,74,112,28,2147483647 > ,60] 93 0'0 2015-03-04 14:33:15.695747 0'0 2015-03-04 > 14:33:15.695747 > > 2.36d 00000000 > active+undersized+degraded2015-03-04 14:33:21.937620 0'0 412:9 > [12,108,136,104,52,18,63,2147483647]12 [12,108,136,104,52,18,63, > 2147483647]12 0'0 2015-03-04 14:33:15.6524800'0 2015-03-04 > 14:33:15.652480 > > 2.5f7 00000000 > active+undersized+degraded2015-03-04 14:33:26.169242 0'0 412:9 > [94,128,73,22,4,60,2147483647,113] 94 [94,128,73,22,4,60,2147483647 > ,113] 94 0'0 2015-03-04 14:33:15.687695 0'0 2015-03-04 > 14:33:15.687695 > > > > I do have questions for you, even at this point, though. > > 1) Where did you find the formula (14400/(k+m))? > > 2) I was really trying to size this for when it goes to production, at > which point it may have as many as 384 OSDs. Doesn't that imply I should > have even more pgs? > > > > On Wed, Mar 4, 2015 at 2:15 PM, Don Doerner > wrote: > > Oh duh… OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so > try 2048. > > > > -don- > > > > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf > Of *Don Doerner > *Sent:* 04 March, 2015 12:14 > *To:* Kyle Hutson; Ceph Users > *Subject:* Re: [ceph-users] New EC pool undersized > > > > In this case, that number means that there is not an OSD that can be > assigned. What’s your k, m fro
Re: [ceph-users] New EC pool undersized
It wouldn't let me simply change the pg_num, giving Error EEXIST: specified pg_num 2048 <= current 8192 But that's not a big deal, I just deleted the pool and recreated with 'ceph osd pool create ec44pool 2048 2048 erasure ec44profile' ...and the result is quite similar: 'ceph status' is now ceph status cluster 196e5eb8-d6a7-4435-907e-ea028e946923 health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs undersized monmap e1: 4 mons at {hobbit01= 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0}, election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14 osdmap e412: 144 osds: 144 up, 144 in pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects 90590 MB used, 640 TB / 640 TB avail 4 active+undersized+degraded 6140 active+clean 'ceph pg dump_stuck results' in ok pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 2.296 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:26.672224 0'0 412:9 [5,55,91,2147483647,83,135,53,26] 5 [5,55,91,2147483647,83,135,53,26] 5 0'0 2015-03-04 14:33:15.649911 0'0 2015-03-04 14:33:15.649911 2.69c 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:24.984802 0'0 412:9 [93,134,1,74,112,28,2147483647,60] 93 [93,134,1,74,112,28,2147483647,60] 93 0'0 2015-03-04 14:33:15.695747 0'0 2015-03-04 14:33:15.695747 2.36d 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:21.937620 0'0 412:9 [12,108,136,104,52,18,63,2147483647] 12 [12,108,136,104,52,18,63,2147483647] 12 0'0 2015-03-04 14:33:15.652480 0'0 2015-03-04 14:33:15.652480 2.5f7 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:26.169242 0'0 412:9 [94,128,73,22,4,60,2147483647,113] 94 [94,128,73,22,4,60,2147483647,113] 94 0'0 2015-03-04 14:33:15.687695 0'0 2015-03-04 14:33:15.687695 I do have questions for you, even at this point, though. 1) Where did you find the formula (14400/(k+m))? 2) I was really trying to size this for when it goes to production, at which point it may have as many as 384 OSDs. Doesn't that imply I should have even more pgs? On Wed, Mar 4, 2015 at 2:15 PM, Don Doerner wrote: > Oh duh… OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so > try 2048. > > > > -don- > > > > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf > Of *Don Doerner > *Sent:* 04 March, 2015 12:14 > *To:* Kyle Hutson; Ceph Users > *Subject:* Re: [ceph-users] New EC pool undersized > > > > In this case, that number means that there is not an OSD that can be > assigned. What’s your k, m from you erasure coded pool? You’ll need > approximately (14400/(k+m)) PGs, rounded up to the next power of 2… > > > > -don- > > > > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com > ] *On Behalf Of *Kyle Hutson > *Sent:* 04 March, 2015 12:06 > *To:* Ceph Users > *Subject:* [ceph-users] New EC pool undersized > > > > Last night I blew away my previous ceph configuration (this environment is > pre-production) and have 0.87.1 installed. I've manually edited the > crushmap so it down looks like https://dpaste.de/OLEa > <https://urldefense.proofpoint.com/v1/url?u=https://dpaste.de/OLEa&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0A&m=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2wGdV3C0Rbk%3D%0A&s=c1bd46dcd96e656554817882d7f6581903b1e3c6a50313f4bf7494acfd12b442> > > > > I currently have 144 OSDs on 8 nodes. > > > > After increasing pg_num and pgp_num to a more suitable 1024 (due to the > high number of OSDs), everything looked happy. > > So, now I'm trying to play with an erasure-coded pool. > > I did: > > ceph osd erasure-code-profile set ec44profile k=4 m=4 > ruleset-failure-domain=rack > > ceph osd pool create ec44pool 8192 8192 erasure ec44profile > > > > After settling for a bit 'ceph status' gives > > cluster 196e5eb8-d6a7-4435-907e-ea028e946923 > > health HEALTH_WARN 7 pgs degraded; 7 pgs stuck degraded; 7 pgs stuck > unclean; 7 pgs stuck undersized; 7 pgs undersized > > monmap e1: 4 mons at {hobbit01= > 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0 > <https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0A&m=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2w
[ceph-users] New EC pool undersized
Last night I blew away my previous ceph configuration (this environment is pre-production) and have 0.87.1 installed. I've manually edited the crushmap so it down looks like https://dpaste.de/OLEa I currently have 144 OSDs on 8 nodes. After increasing pg_num and pgp_num to a more suitable 1024 (due to the high number of OSDs), everything looked happy. So, now I'm trying to play with an erasure-coded pool. I did: ceph osd erasure-code-profile set ec44profile k=4 m=4 ruleset-failure-domain=rack ceph osd pool create ec44pool 8192 8192 erasure ec44profile After settling for a bit 'ceph status' gives cluster 196e5eb8-d6a7-4435-907e-ea028e946923 health HEALTH_WARN 7 pgs degraded; 7 pgs stuck degraded; 7 pgs stuck unclean; 7 pgs stuck undersized; 7 pgs undersized monmap e1: 4 mons at {hobbit01= 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0}, election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14 osdmap e409: 144 osds: 144 up, 144 in pgmap v6763: 12288 pgs, 2 pools, 0 bytes data, 0 objects 90598 MB used, 640 TB / 640 TB avail 7 active+undersized+degraded 12281 active+clean So to troubleshoot the undersized pgs, I issued 'ceph pg dump_stuck' ok pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 1.d77 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:33:57.502849 0'0 408:12 [15,95,58,73,52,31,116,2147483647] 15 [15,95,58,73,52,31,116,2147483647] 15 0'0 2015-03-04 11:33:42.100752 0'0 2015-03-04 11:33:42.100752 1.10fa 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:29.362554 0'0 408:12 [23,12,99,114,132,53,56,2147483647] 23 [23,12,99,114,132,53,56,2147483647] 23 0'0 2015-03-04 11:33:42.168571 0'0 2015-03-04 11:33:42.168571 1.1271 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:33:48.795742 0'0 408:12 [135,112,69,4,22,95,2147483647,83] 135 [135,112,69,4,22,95,2147483647,83] 135 0'0 2015-03-04 11:33:42.139555 0'0 2015-03-04 11:33:42.139555 1.2b5 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:32.189738 0'0 408:12 [11,115,139,19,76,52,94,2147483647] 11 [11,115,139,19,76,52,94,2147483647] 11 0'0 2015-03-04 11:33:42.079673 0'0 2015-03-04 11:33:42.079673 1.7ae 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:26.848344 0'0 408:12 [27,5,132,119,94,56,52,2147483647] 27 [27,5,132,119,94,56,52,2147483647] 27 0'0 2015-03-04 11:33:42.109832 0'0 2015-03-04 11:33:42.109832 1.1a97 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:25.457454 0'0 408:12 [20,53,14,54,102,118,2147483647,72] 20 [20,53,14,54,102,118,2147483647,72] 20 0'0 2015-03-04 11:33:42.833850 0'0 2015-03-04 11:33:42.833850 1.10a6 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:30.059936 0'0 408:12 [136,22,4,2147483647,72,52,101,55] 136 [136,22,4,2147483647,72,52,101,55] 136 0'0 2015-03-04 11:33:42.125871 0'0 2015-03-04 11:33:42.125871 This appears to have a number on all these (2147483647) that is way out of line from what I would expect. Thoughts? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Centos 7 OSD silently fail to start
Just did it. Thanks for suggesting it. On Wed, Feb 25, 2015 at 5:59 PM, Brad Hubbard wrote: > On 02/26/2015 09:05 AM, Kyle Hutson wrote: > >> Thank you Thomas. You at least made me look it the right spot. Their >> long-form is showing what to do for a mon, not an osd. >> >> At the bottom of step 11, instead of >> sudo touch /var/lib/ceph/mon/{cluster-name}-{hostname}/sysvinit >> >> It should read >> sudo touch /var/lib/ceph/osd/{cluster-name}-{osd-num}/sysvinit >> >> Once I did that 'service ceph status' correctly shows that I have that >> OSD available and I can start or stop it from there. >> >> > Could you open a bug for this? > > Cheers, > Brad > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Centos 7 OSD silently fail to start
Thank you Thomas. You at least made me look it the right spot. Their long-form is showing what to do for a mon, not an osd. At the bottom of step 11, instead of sudo touch /var/lib/ceph/mon/{cluster-name}-{hostname}/sysvinit It should read sudo touch /var/lib/ceph/osd/{cluster-name}-{osd-num}/sysvinit Once I did that 'service ceph status' correctly shows that I have that OSD available and I can start or stop it from there. On Wed, Feb 25, 2015 at 4:55 PM, Thomas Foster wrote: > I am using the long form and have it working. The one thing that I saw > was to change from osd_host to just host. See if that works. > On Feb 25, 2015 5:44 PM, "Kyle Hutson" wrote: > >> I just tried it, and that does indeed get the OSD to start. >> >> However, it doesn't add it to the appropriate place so it would survive a >> reboot. In my case, running 'service ceph status osd.16' still results in >> the same line I posted above. >> >> There's still something broken such that 'ceph-disk activate' works >> correctly, but using the long-form version does not. >> >> On Wed, Feb 25, 2015 at 4:03 PM, Robert LeBlanc >> wrote: >> >>> Step #6 in >>> http://ceph.com/docs/master/install/manual-deployment/#long-form >>> only set-ups the file structure for the OSD, it doesn't start the long >>> running process. >>> >>> On Wed, Feb 25, 2015 at 2:59 PM, Kyle Hutson wrote: >>> > But I already issued that command (back in step 6). >>> > >>> > The interesting part is that "ceph-disk activate" apparently does it >>> > correctly. Even after reboot, the services start as they should. >>> > >>> > On Wed, Feb 25, 2015 at 3:54 PM, Robert LeBlanc >>> > wrote: >>> >> >>> >> I think that your problem lies with systemd (even though you are using >>> >> SysV syntax, systemd is really doing the work). Systemd does not like >>> >> multiple arguments and I think this is why it is failing. There is >>> >> supposed to be some work done to get systemd working ok, but I think >>> >> it has the limitation of only working with a cluster named 'ceph' >>> >> currently. >>> >> >>> >> What I did to get around the problem was to run the osd command >>> manually: >>> >> >>> >> ceph-osd -i >>> >> >>> >> Once I understand the under-the-hood stuff, I moved to ceph-disk and >>> >> now because of the GPT partition IDs, udev automatically starts up the >>> >> OSD process at boot/creation and moves to the appropiate CRUSH >>> >> location (configuratble in ceph.conf >>> >> >>> http://ceph.com/docs/master/rados/operations/crush-map/#crush-location, >>> >> an example: crush location = host=test rack=rack3 row=row8 >>> >> datacenter=local region=na-west root=default). To restart an OSD >>> >> process, I just kill the PID for the OSD then issue ceph-disk activate >>> >> /dev/sdx1 to restart the OSD process. You probably could stop it with >>> >> systemctl since I believe udev creates a resource for it (I should >>> >> probably look into that now that this system will be going production >>> >> soon). >>> >> >>> >> On Wed, Feb 25, 2015 at 2:13 PM, Kyle Hutson >>> wrote: >>> >> > I'm having a similar issue. >>> >> > >>> >> > I'm following >>> http://ceph.com/docs/master/install/manual-deployment/ to >>> >> > a T. >>> >> > >>> >> > I have OSDs on the same host deployed with the short-form and they >>> work >>> >> > fine. I am trying to deploy some more via the long form (because I >>> want >>> >> > them >>> >> > to appear in a different location in the crush map). Everything >>> through >>> >> > step >>> >> > 10 (i.e. ceph osd crush add {id-or-name} {weight} >>> >> > [{bucket-type}={bucket-name} ...] ) works just fine. When I go to >>> step >>> >> > 11 >>> >> > (sudo /etc/init.d/ceph start osd.{osd-num}) I get: >>> >> > /etc/init.d/ceph: osd.16 not found (/etc/ceph/ceph.conf defines >>> >> > mon.hobbit01 >>> >> > osd.7 osd.15 osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 >>> osd.12 >>> >>
Re: [ceph-users] Centos 7 OSD silently fail to start
I just tried it, and that does indeed get the OSD to start. However, it doesn't add it to the appropriate place so it would survive a reboot. In my case, running 'service ceph status osd.16' still results in the same line I posted above. There's still something broken such that 'ceph-disk activate' works correctly, but using the long-form version does not. On Wed, Feb 25, 2015 at 4:03 PM, Robert LeBlanc wrote: > Step #6 in > http://ceph.com/docs/master/install/manual-deployment/#long-form > only set-ups the file structure for the OSD, it doesn't start the long > running process. > > On Wed, Feb 25, 2015 at 2:59 PM, Kyle Hutson wrote: > > But I already issued that command (back in step 6). > > > > The interesting part is that "ceph-disk activate" apparently does it > > correctly. Even after reboot, the services start as they should. > > > > On Wed, Feb 25, 2015 at 3:54 PM, Robert LeBlanc > > wrote: > >> > >> I think that your problem lies with systemd (even though you are using > >> SysV syntax, systemd is really doing the work). Systemd does not like > >> multiple arguments and I think this is why it is failing. There is > >> supposed to be some work done to get systemd working ok, but I think > >> it has the limitation of only working with a cluster named 'ceph' > >> currently. > >> > >> What I did to get around the problem was to run the osd command > manually: > >> > >> ceph-osd -i > >> > >> Once I understand the under-the-hood stuff, I moved to ceph-disk and > >> now because of the GPT partition IDs, udev automatically starts up the > >> OSD process at boot/creation and moves to the appropiate CRUSH > >> location (configuratble in ceph.conf > >> http://ceph.com/docs/master/rados/operations/crush-map/#crush-location, > >> an example: crush location = host=test rack=rack3 row=row8 > >> datacenter=local region=na-west root=default). To restart an OSD > >> process, I just kill the PID for the OSD then issue ceph-disk activate > >> /dev/sdx1 to restart the OSD process. You probably could stop it with > >> systemctl since I believe udev creates a resource for it (I should > >> probably look into that now that this system will be going production > >> soon). > >> > >> On Wed, Feb 25, 2015 at 2:13 PM, Kyle Hutson > wrote: > >> > I'm having a similar issue. > >> > > >> > I'm following http://ceph.com/docs/master/install/manual-deployment/ > to > >> > a T. > >> > > >> > I have OSDs on the same host deployed with the short-form and they > work > >> > fine. I am trying to deploy some more via the long form (because I > want > >> > them > >> > to appear in a different location in the crush map). Everything > through > >> > step > >> > 10 (i.e. ceph osd crush add {id-or-name} {weight} > >> > [{bucket-type}={bucket-name} ...] ) works just fine. When I go to step > >> > 11 > >> > (sudo /etc/init.d/ceph start osd.{osd-num}) I get: > >> > /etc/init.d/ceph: osd.16 not found (/etc/ceph/ceph.conf defines > >> > mon.hobbit01 > >> > osd.7 osd.15 osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12 > >> > osd.6 > >> > osd.11 osd.5 osd.4 osd.0 , /var/lib/ceph defines mon.hobbit01 osd.7 > >> > osd.15 > >> > osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12 osd.6 osd.11 > >> > osd.5 > >> > osd.4 osd.0) > >> > > >> > > >> > > >> > On Wed, Feb 25, 2015 at 11:55 AM, Travis Rhoden > >> > wrote: > >> >> > >> >> Also, did you successfully start your monitor(s), and define/create > the > >> >> OSDs within the Ceph cluster itself? > >> >> > >> >> There are several steps to creating a Ceph cluster manually. I'm > >> >> unsure > >> >> if you have done the steps to actually create and register the OSDs > >> >> with the > >> >> cluster. > >> >> > >> >> - Travis > >> >> > >> >> On Wed, Feb 25, 2015 at 9:49 AM, Leszek Master > >> >> wrote: > >> >>> > >> >>> Check firewall rules and selinux. It sometimes is a pain in the ... > :) > >> >>> > >> >>> 25 lut 2015 01:46 "Barclay Jameson" > >> >>> napisał(
Re: [ceph-users] Centos 7 OSD silently fail to start
So I issue it twice? e.g. ceph-osd -i X --mkfs --mkkey ...other commands... ceph-osd -i X ? On Wed, Feb 25, 2015 at 4:03 PM, Robert LeBlanc wrote: > Step #6 in > http://ceph.com/docs/master/install/manual-deployment/#long-form > only set-ups the file structure for the OSD, it doesn't start the long > running process. > > On Wed, Feb 25, 2015 at 2:59 PM, Kyle Hutson wrote: > > But I already issued that command (back in step 6). > > > > The interesting part is that "ceph-disk activate" apparently does it > > correctly. Even after reboot, the services start as they should. > > > > On Wed, Feb 25, 2015 at 3:54 PM, Robert LeBlanc > > wrote: > >> > >> I think that your problem lies with systemd (even though you are using > >> SysV syntax, systemd is really doing the work). Systemd does not like > >> multiple arguments and I think this is why it is failing. There is > >> supposed to be some work done to get systemd working ok, but I think > >> it has the limitation of only working with a cluster named 'ceph' > >> currently. > >> > >> What I did to get around the problem was to run the osd command > manually: > >> > >> ceph-osd -i > >> > >> Once I understand the under-the-hood stuff, I moved to ceph-disk and > >> now because of the GPT partition IDs, udev automatically starts up the > >> OSD process at boot/creation and moves to the appropiate CRUSH > >> location (configuratble in ceph.conf > >> http://ceph.com/docs/master/rados/operations/crush-map/#crush-location, > >> an example: crush location = host=test rack=rack3 row=row8 > >> datacenter=local region=na-west root=default). To restart an OSD > >> process, I just kill the PID for the OSD then issue ceph-disk activate > >> /dev/sdx1 to restart the OSD process. You probably could stop it with > >> systemctl since I believe udev creates a resource for it (I should > >> probably look into that now that this system will be going production > >> soon). > >> > >> On Wed, Feb 25, 2015 at 2:13 PM, Kyle Hutson > wrote: > >> > I'm having a similar issue. > >> > > >> > I'm following http://ceph.com/docs/master/install/manual-deployment/ > to > >> > a T. > >> > > >> > I have OSDs on the same host deployed with the short-form and they > work > >> > fine. I am trying to deploy some more via the long form (because I > want > >> > them > >> > to appear in a different location in the crush map). Everything > through > >> > step > >> > 10 (i.e. ceph osd crush add {id-or-name} {weight} > >> > [{bucket-type}={bucket-name} ...] ) works just fine. When I go to step > >> > 11 > >> > (sudo /etc/init.d/ceph start osd.{osd-num}) I get: > >> > /etc/init.d/ceph: osd.16 not found (/etc/ceph/ceph.conf defines > >> > mon.hobbit01 > >> > osd.7 osd.15 osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12 > >> > osd.6 > >> > osd.11 osd.5 osd.4 osd.0 , /var/lib/ceph defines mon.hobbit01 osd.7 > >> > osd.15 > >> > osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12 osd.6 osd.11 > >> > osd.5 > >> > osd.4 osd.0) > >> > > >> > > >> > > >> > On Wed, Feb 25, 2015 at 11:55 AM, Travis Rhoden > >> > wrote: > >> >> > >> >> Also, did you successfully start your monitor(s), and define/create > the > >> >> OSDs within the Ceph cluster itself? > >> >> > >> >> There are several steps to creating a Ceph cluster manually. I'm > >> >> unsure > >> >> if you have done the steps to actually create and register the OSDs > >> >> with the > >> >> cluster. > >> >> > >> >> - Travis > >> >> > >> >> On Wed, Feb 25, 2015 at 9:49 AM, Leszek Master > >> >> wrote: > >> >>> > >> >>> Check firewall rules and selinux. It sometimes is a pain in the ... > :) > >> >>> > >> >>> 25 lut 2015 01:46 "Barclay Jameson" > >> >>> napisał(a): > >> >>> > >> >>>> I have tried to install ceph using ceph-deploy but sgdisk seems to > >> >>>> have too many issues so I did a manual install. After mkfs.btrfs on > >> >>>> the disks and journals and mounted them
Re: [ceph-users] Centos 7 OSD silently fail to start
But I already issued that command (back in step 6). The interesting part is that "ceph-disk activate" apparently does it correctly. Even after reboot, the services start as they should. On Wed, Feb 25, 2015 at 3:54 PM, Robert LeBlanc wrote: > I think that your problem lies with systemd (even though you are using > SysV syntax, systemd is really doing the work). Systemd does not like > multiple arguments and I think this is why it is failing. There is > supposed to be some work done to get systemd working ok, but I think > it has the limitation of only working with a cluster named 'ceph' > currently. > > What I did to get around the problem was to run the osd command manually: > > ceph-osd -i > > Once I understand the under-the-hood stuff, I moved to ceph-disk and > now because of the GPT partition IDs, udev automatically starts up the > OSD process at boot/creation and moves to the appropiate CRUSH > location (configuratble in ceph.conf > http://ceph.com/docs/master/rados/operations/crush-map/#crush-location, > an example: crush location = host=test rack=rack3 row=row8 > datacenter=local region=na-west root=default). To restart an OSD > process, I just kill the PID for the OSD then issue ceph-disk activate > /dev/sdx1 to restart the OSD process. You probably could stop it with > systemctl since I believe udev creates a resource for it (I should > probably look into that now that this system will be going production > soon). > > On Wed, Feb 25, 2015 at 2:13 PM, Kyle Hutson wrote: > > I'm having a similar issue. > > > > I'm following http://ceph.com/docs/master/install/manual-deployment/ to > a T. > > > > I have OSDs on the same host deployed with the short-form and they work > > fine. I am trying to deploy some more via the long form (because I want > them > > to appear in a different location in the crush map). Everything through > step > > 10 (i.e. ceph osd crush add {id-or-name} {weight} > > [{bucket-type}={bucket-name} ...] ) works just fine. When I go to step 11 > > (sudo /etc/init.d/ceph start osd.{osd-num}) I get: > > /etc/init.d/ceph: osd.16 not found (/etc/ceph/ceph.conf defines > mon.hobbit01 > > osd.7 osd.15 osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12 > osd.6 > > osd.11 osd.5 osd.4 osd.0 , /var/lib/ceph defines mon.hobbit01 osd.7 > osd.15 > > osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12 osd.6 osd.11 > osd.5 > > osd.4 osd.0) > > > > > > > > On Wed, Feb 25, 2015 at 11:55 AM, Travis Rhoden > wrote: > >> > >> Also, did you successfully start your monitor(s), and define/create the > >> OSDs within the Ceph cluster itself? > >> > >> There are several steps to creating a Ceph cluster manually. I'm unsure > >> if you have done the steps to actually create and register the OSDs > with the > >> cluster. > >> > >> - Travis > >> > >> On Wed, Feb 25, 2015 at 9:49 AM, Leszek Master > wrote: > >>> > >>> Check firewall rules and selinux. It sometimes is a pain in the ... :) > >>> > >>> 25 lut 2015 01:46 "Barclay Jameson" > napisał(a): > >>> > >>>> I have tried to install ceph using ceph-deploy but sgdisk seems to > >>>> have too many issues so I did a manual install. After mkfs.btrfs on > >>>> the disks and journals and mounted them I then tried to start the osds > >>>> which failed. The first error was: > >>>> #/etc/init.d/ceph start osd.0 > >>>> /etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines , > >>>> /var/lib/ceph defines ) > >>>> > >>>> I then manually added the osds to the conf file with the following as > >>>> an example: > >>>> [osd.0] > >>>> osd_host = node01 > >>>> > >>>> Now when I run the command : > >>>> # /etc/init.d/ceph start osd.0 > >>>> > >>>> There is no error or output from the command and in fact when I do a > >>>> ceph -s no osds are listed as being up. > >>>> Doing as ps aux | grep -i ceph or ps aux | grep -i osd shows there are > >>>> no osd running. > >>>> I also have done htop to see if any process are running and none are > >>>> shown. > >>>> > >>>> I had this working on SL6.5 with Firefly but Giant on Centos 7 has > >>>> been nothing but a giant pain. > >>>> ___ > >>>&g
Re: [ceph-users] Centos 7 OSD silently fail to start
I'm having a similar issue. I'm following http://ceph.com/docs/master/install/manual-deployment/ to a T. I have OSDs on the same host deployed with the short-form and they work fine. I am trying to deploy some more via the long form (because I want them to appear in a different location in the crush map). Everything through step 10 (i.e. ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...] ) works just fine. When I go to step 11 (sudo /etc/init.d/ceph start osd.{osd-num}) I get: /etc/init.d/ceph: osd.16 not found (/etc/ceph/ceph.conf defines mon.hobbit01 osd.7 osd.15 osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12 osd.6 osd.11 osd.5 osd.4 osd.0 , /var/lib/ceph defines mon.hobbit01 osd.7 osd.15 osd.10 osd.9 osd.1 osd.14 osd.2 osd.3 osd.13 osd.8 osd.12 osd.6 osd.11 osd.5 osd.4 osd.0) On Wed, Feb 25, 2015 at 11:55 AM, Travis Rhoden wrote: > Also, did you successfully start your monitor(s), and define/create the > OSDs within the Ceph cluster itself? > > There are several steps to creating a Ceph cluster manually. I'm unsure > if you have done the steps to actually create and register the OSDs with > the cluster. > > - Travis > > On Wed, Feb 25, 2015 at 9:49 AM, Leszek Master wrote: > >> Check firewall rules and selinux. It sometimes is a pain in the ... :) >> 25 lut 2015 01:46 "Barclay Jameson" napisał(a): >> >> I have tried to install ceph using ceph-deploy but sgdisk seems to >>> have too many issues so I did a manual install. After mkfs.btrfs on >>> the disks and journals and mounted them I then tried to start the osds >>> which failed. The first error was: >>> #/etc/init.d/ceph start osd.0 >>> /etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines , >>> /var/lib/ceph defines ) >>> >>> I then manually added the osds to the conf file with the following as >>> an example: >>> [osd.0] >>> osd_host = node01 >>> >>> Now when I run the command : >>> # /etc/init.d/ceph start osd.0 >>> >>> There is no error or output from the command and in fact when I do a >>> ceph -s no osds are listed as being up. >>> Doing as ps aux | grep -i ceph or ps aux | grep -i osd shows there are >>> no osd running. >>> I also have done htop to see if any process are running and none are >>> shown. >>> >>> I had this working on SL6.5 with Firefly but Giant on Centos 7 has >>> been nothing but a giant pain. >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fixing a crushmap
Oh, and I don't yet have any important data here, so I'm not worried about losing anything at this point. I just need to get my cluster happy again so I can play with it some more. On Fri, Feb 20, 2015 at 11:00 AM, Kyle Hutson wrote: > Here was the process I went through. > 1) I created an EC pool which created ruleset 1 > 2) I edited the crushmap to approximately its current form > 3) I discovered my previous EC pool wasn't doing what I meant for it to > do, so I deleted it. > 4) I created a new EC pool with the parameters I wanted and told it to use > ruleset 3 > > On Fri, Feb 20, 2015 at 10:55 AM, Luis Periquito > wrote: > >> The process of creating an erasure coded pool and a replicated one is >> slightly different. You can use Sebastian's guide to create/manage the osd >> tree, but you should follow this guide >> http://ceph.com/docs/giant/dev/erasure-coded-pool/ to create the EC pool. >> >> I'm not sure (i.e. I never tried) to create a EC pool the way you did. >> The normal replicated ones do work like this. >> >> On Fri, Feb 20, 2015 at 4:49 PM, Kyle Hutson wrote: >> >>> I manually edited my crushmap, basing my changes on >>> http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ >>> I have SSDs and HDDs in the same box and was wanting to separate them by >>> ruleset. My current crushmap can be seen at http://pastie.org/9966238 >>> >>> I had it installed and everything looked gooduntil I created a new >>> pool. All of the new pgs are stuck in "creating". I first tried creating an >>> erasure-coded pool using ruleset 3, then created another pool using ruleset >>> 0. Same result. >>> >>> I'm not opposed to an 'RTFM' answer, so long as you can point me to the >>> right one. I've seen very little documentation on crushmap rules, in >>> particular. >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fixing a crushmap
Here was the process I went through. 1) I created an EC pool which created ruleset 1 2) I edited the crushmap to approximately its current form 3) I discovered my previous EC pool wasn't doing what I meant for it to do, so I deleted it. 4) I created a new EC pool with the parameters I wanted and told it to use ruleset 3 On Fri, Feb 20, 2015 at 10:55 AM, Luis Periquito wrote: > The process of creating an erasure coded pool and a replicated one is > slightly different. You can use Sebastian's guide to create/manage the osd > tree, but you should follow this guide > http://ceph.com/docs/giant/dev/erasure-coded-pool/ to create the EC pool. > > I'm not sure (i.e. I never tried) to create a EC pool the way you did. The > normal replicated ones do work like this. > > On Fri, Feb 20, 2015 at 4:49 PM, Kyle Hutson wrote: > >> I manually edited my crushmap, basing my changes on >> http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ >> I have SSDs and HDDs in the same box and was wanting to separate them by >> ruleset. My current crushmap can be seen at http://pastie.org/9966238 >> >> I had it installed and everything looked gooduntil I created a new >> pool. All of the new pgs are stuck in "creating". I first tried creating an >> erasure-coded pool using ruleset 3, then created another pool using ruleset >> 0. Same result. >> >> I'm not opposed to an 'RTFM' answer, so long as you can point me to the >> right one. I've seen very little documentation on crushmap rules, in >> particular. >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fixing a crushmap
I manually edited my crushmap, basing my changes on http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ I have SSDs and HDDs in the same box and was wanting to separate them by ruleset. My current crushmap can be seen at http://pastie.org/9966238 I had it installed and everything looked gooduntil I created a new pool. All of the new pgs are stuck in "creating". I first tried creating an erasure-coded pool using ruleset 3, then created another pool using ruleset 0. Same result. I'm not opposed to an 'RTFM' answer, so long as you can point me to the right one. I've seen very little documentation on crushmap rules, in particular. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xfs/nobarrier
> do people consider a UPS + Shutdown procedures a suitable substitute? I certainly wouldn't, I've seen utility power fail and the transfer switch fail to transition to UPS strings. Had this happened to me with nobarrier it would have been a very sad day. -- Kyle Bader ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] private network - VLAN vs separate switch
> Thanks for all the help. Can the moving over from VLAN to separate > switches be done on a live cluster? Or does there need to be a down > time? You can do it on a life cluster. The more cavalier approach would be to quickly switch the link over one server at a time, which might cause a short io stall. The more careful approach would be to 'ceph osd set noup' mark all the osds on a node down, move the link, 'ceph osd unset noup', and then wait for their peers to mark them back up before proceeding to the next host. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] private network - VLAN vs separate switch
> For a large network (say 100 servers and 2500 disks), are there any > strong advantages to using separate switch and physical network > instead of VLAN? Physical isolation will ensure that congestion on one does not affect the other. On the flip side, asymmetric network failures tend to be more difficult to troubleshoot eg. backend failure with functional front end. That said, in a pinch you can switch to using the front end network for both until you can repair the backend. > Also, how difficult it would be to switch from a VLAN to using > separate switches later? Should be relatively straight forward. Simply configure the VLAN/subnets on the new physical switches and move links over one by one. Once all the links are moved over you can remove the VLAN and subnets that are now on the new kit from the original hardware. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install
> Can you paste me the whole output of the install? I am curious why/how you > are getting el7 and el6 packages. priority=1 required in /etc/yum.repos.d/ceph.repo entries -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is OSDs based on VFS?
> I wonder that OSDs use system calls of Virtual File System (i.e. open, read, > write, etc) when they access disks. > > I mean ... Could I monitor I/O command requested by OSD to disks if I > monitor VFS? Ceph OSDs run on top of a traditional filesystem, so long as they support xattrs - xfs by default. As such you can use kernel instrumentation to view what is going on "under" the Ceph OSDs. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bypass Cache-Tiering for special reads (Backups)
> I was wondering, having a cache pool in front of an RBD pool is all fine > and dandy, but imagine you want to pull backups of all your VMs (or one > of them, or multiple...). Going to the cache for all those reads isn't > only pointless, it'll also potentially fill up the cache and possibly > evict actually frequently used data. Which got me thinking... wouldn't > it be nifty if there was a special way of doing specific backup reads > where you'd bypass the cache, ensuring the dirty cache contents get > written to cold pool first? Or at least doing special reads where a > cache-miss won't actually cache the requested data? > > AFAIK the backup routine for an RBD-backed KVM usually involves creating > a snapshot of the RBD and putting that into a backup storage/tape, all > done via librbd/API. > > Maybe something like that even already exists? When used in the context of OpenStack Cinder, it does: http://ceph.com/docs/next/rbd/rbd-openstack/#configuring-cinder-backup You can have the backup pool use the default crush rules, assuming the default isn't your hot pool. Another option might be to put backups on an erasure coded pool, I'm not sure if that has been tested, but in principle should work since objects composing a snapshot should be immutable. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Journal SSD durability
> TL;DR: Power outages are more common than your colo facility will admit. Seconded. I've seen power failures in at least 4 different facilities and all of them had the usual gamut of batteries/generators/etc. Some of those facilities I've seen problems multiple times in a single year. Even a datacenter with five nines power availability is going to see > 5m of downtime per year, and that would qualify for the highest rating from the Uptime Institute (Tier IV)! I've lost power to Ceph clusters on several occasions, in all cases the journals were on spinning media. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrate whole clusters
> Anyway replacing set of monitors means downtime for every client, so > I`m in doubt if 'no outage' word is still applicable there. Taking the entire quorum down for migration would be bad. It's better to add one in the new location, remove one at the old, ad infinitum. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrate whole clusters
> Let's assume a test cluster up and running with real data on it. > Which is the best way to migrate everything to a production (and > larger) cluster? > > I'm thinking to add production MONs to the test cluster, after that, > add productions OSDs to the test cluster, waiting for a full rebalance > and then starting to remove test OSDs and test mons. > > This should migrate everything with no outage. It's possible and I've done it, this was around the argonaut/bobtail timeframe on a pre-production cluster. If your cluster has a lot of data then it may take a long time or be disruptive, make sure you've tested that your recovery tunables are suitable for your hardware configuration. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache
>> >> I think the timing should work that we'll be deploying with Firefly and >> >> so >> >> have Ceph cache pool tiering as an option, but I'm also evaluating >> >> Bcache >> >> versus Tier to act as node-local block cache device. Does anybody have >> >> real >> >> or anecdotal evidence about which approach has better performance? >> > New idea that is dependent on failure behaviour of the cache tier... >> >> The problem with this type of configuration is it ties a VM to a >> specific hypervisor, in theory it should be faster because you don't >> have network latency from round trips to the cache tier, resulting in >> higher iops. Large sequential workloads may achieve higher throughput >> by parallelizing across many OSDs in a cache tier, whereas local flash >> would be limited to single device throughput. > > Ah, I was ambiguous. When I said node-local I meant OSD-local. So I'm really > looking at: > 2-copy write-back object ssd cache-pool > versus > OSD write-back ssd block-cache > versus > 1-copy write-around object cache-pool & ssd journal Ceph cache pools allow you to scale the size of the cache pool independent of the underlying storage and avoids constraints about disk:ssd ratios (for flashcache, bcache, etc). Local block caches should have lower latency than a cache tier for a cache miss, due to the extra hop(s) across the network. I would lean towards using Ceph's cache tiers for the scaling independence. > This is undoubtedly true for a write-back cache-tier. But in the scenario > I'm suggesting, a write-around cache, that needn't be bad news - if a > cache-tier OSD is lost the cache simply just got smaller and some cached > objects were unceremoniously flushed. The next read on those objects should > just miss and bring them into the now smaller cache. > > The thing I'm trying to avoid with the above is double read-caching of > objects (so as to get more aggregate read cache). I assume the standard > wisdom with write-back cache-tiering is that the backing data pool shouldn't > bother with ssd journals? Currently, all cache tiers need to be durable - regardless of cache mode. As such, cache tiers should be erasure coded or N+1 replicated (I'd recommend N+2 or 3x replica). Ceph could potentially do what you described in the future, it just doesn't yet. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache
>> Obviously the ssds could be used as journal devices, but I'm not really >> convinced whether this is worthwhile when all nodes have 1GB of hardware >> writeback cache (writes to journal and data areas on the same spindle have >> time to coalesce in the cache and minimise seek time hurt). Any advice on >> this? All writes need to be written to the journal before being written to the data volume so it's going to impact your overall throughput and cause seeking, a hardware cache will only help with the later (unless you use btrfs). >> I think the timing should work that we'll be deploying with Firefly and so >> have Ceph cache pool tiering as an option, but I'm also evaluating Bcache >> versus Tier to act as node-local block cache device. Does anybody have real >> or anecdotal evidence about which approach has better performance? > New idea that is dependent on failure behaviour of the cache tier... The problem with this type of configuration is it ties a VM to a specific hypervisor, in theory it should be faster because you don't have network latency from round trips to the cache tier, resulting in higher iops. Large sequential workloads may achieve higher throughput by parallelizing across many OSDs in a cache tier, whereas local flash would be limited to single device throughput. > Carve the ssds 4-ways: each with 3 partitions for journals servicing the > backing data pool and a fourth larger partition serving a write-around cache > tier with only 1 object copy. Thus both reads and writes hit ssd but the ssd > capacity is not halved by replication for availability. > > ...The crux is how the current implementation behaves in the face of cache > tier OSD failures? Cache tiers are durable by way of replication or erasure coding, OSDs will remap degraded placement groups and backfill as appropriate. With single replica cache pools loss of OSDs becomes a real concern, in the case of RBD this means losing arbitrary chunk(s) of your block devices - bad news. If you want host independence, durability and speed your best bet is a replicated cache pool (2-3x). -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] question on harvesting freed space
> I'm assuming Ceph/RBD doesn't have any direct awareness of this since > the file system doesn't traditionally have a "give back blocks" > operation to the block device. Is there anything special RBD does in > this case that communicates the release of the Ceph storage back to the > pool? VMs running a 3.2+ kernel (iirc) can "give back blocks" by issuing TRIM. http://wiki.qemu.org/Features/QED/Trim -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 答复: 答复: why object can't be recovered when delete one replica
> I have run the repair command, and the warning info disappears in the output of "ceph health detail", but the replicas isn't recovered in the "current" directory. > In all, the ceph cluster status can recover (the pg's status recover from inconsistent to active and clean), but not the replica. If you run a pg query does it still show the osd you removed the object from in the acting set? It could be that the pg has a different member now and the restored copy is on another osd. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Error initializing cluster client: Error
> I have two nodes with 8 OSDs on each. First node running 2 monitors on > different virtual machines (mon.1 and mon.2), second node runing mon.3 > After several reboots (I have tested power failure scenarios) "ceph -w" on > node 2 always fails with message: > > root@bes-mon3:~# ceph --verbose -w > Error initializing cluster client: Error The cluster is simply protecting itself from a split brain situation. Say you have: mon.1 mon.2 mon.3 If mon.1 fails, no big deal, you still have 2/3 so no problem. Now instead, say mon.1 is separated from mon.2 and mon.3 because of a network partition (trunk failure, whatever). If one monitor of the three could elect itself as leader then you might have divergence between your monitors. Self-elected mon.1 thinks it's the leader and mon.{2,3} have elected a leader amongst themselves. The harsh reality is you really need to have monitors on 3 distinct physical hosts to protect against the failure of a physical host. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why object can't be recovered when delete one replica
> I upload a file through swift API, then delete it in the “current” directory > in the secondary OSD manually, why the object can’t be recovered? > > If I delete it in the primary OSD, the object is deleted directly in the > pool .rgw.bucket and it can’t be recovered from the secondary OSD. > > Do anyone know this behavior? This is because the placement group containing that object likely needs to scrub (just a light scrub should do). The scrub will compare the two replicas, notice the replica is missing from the secondary and trigger recovery/backfill. Can you try scrubbing the placement group containing the removed object and let us know if it triggers recovery? -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mounting with dmcrypt still fails
> ceph-disk-prepare --fs-type xfs --dmcrypt --dmcrypt-key-dir > /etc/ceph/dmcrypt-keys --cluster ceph -- /dev/sdb > ceph-disk: Error: Device /dev/sdb2 is in use by a device-mapper mapping > (dm-crypt?): dm-0 It sounds like device-mapper still thinks it's using the the volume, you might be able to track it down with this: for i in `ls -1 /sys/block/ | grep sd`; do echo $i: `ls /sys/block/$i/${i}1/holders/`; done Then it's a matter of making sure there are no open file handles on the encrypted volume and unmounting it. You will still need to completely clear out the partition table on that disk, which can be tricky with GPT because it's not as simple as dd'in the start of the volume. This is what the zapdisk parameter is for in ceph-disk-prepare, I don't know enough about ceph-deploy to know if you can somehow pass it. After you know the device/dm mapping you can use udevadm to find out where it should map to (uuids replaced with xxx's): udevadm test /block/sdc/sdc1 run: '/sbin/cryptsetup --key-file /etc/ceph/dmcrypt-keys/x --key-size 256 create /dev/sdc1' run: '/bin/bash -c 'while [ ! -e /dev/mapper/x ];do sleep 1; done'' run: '/usr/sbin/ceph-disk-activate /dev/mapper/x' -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd rebalance question
> I need to add a extend server, which reside several osds, to a > running ceph cluster. During add osds, ceph would not automatically modify > the ceph.conf. So I manually modify the ceph.conf > > And restart the whole ceph cluster with command: ’service ceph –a restart’. > I just confused that if I restart the ceph cluster, ceph would rebalance the > whole data(redistribution whole data) among osds? Or just move some > > Data from existed osds to new osds? Anybody knows? It depends on how you added the OSDs, if the initial crush weight is set to 0 then no data will be moved to the OSD when it joins the cluster. Only once the weight has been increased with the rest of the OSD population will data start to move to the new OSD(s). If you add new OSD(s) with an initial weight > 0 then they will start accepting data from peers as soon as they are up/in. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD + FlashCache vs. Cache Pool for RBD...
> One downside of the above arrangement: I read that support for mapping > newer-format RBDs is only present in fairly recent kernels. I'm running > Ubuntu 12.04 on the cluster at present with its stock 3.2 kernel. There > is a PPA for the 3.11 kernel used in Ubuntu 13.10, but if you're looking > at a new deployment it might be better to wait until 14.04: then you'll > get kernel 3.13. > > Anyone else have any ideas on the above? I don't think there are any hairy udev issues or similar that will make using a newer kernel on precise problematic. The only thing I can think of that is a caveat of this kind of setup if if you lose a hypervisor the cache will go with it and you likely wont be able to migrate the guest to another host. The alternative is to use flashcache on top of the OSD partition but then you introduce network hops and is closer to what the tiering feature will offer, except the flashcache OSD method is more particular about disk:ssd ratio, whereas in a tier the flash could be on s completely separate hosts (possibly dedicated flash machines). -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What's the difference between using /dev/sdb and /dev/sdb1 as osd?
> If I want to use a disk dedicated for osd, can I just use something like > /dev/sdb instead of /dev/sdb1? Is there any negative impact on performance? You can pass /dev/sdb to ceph-disk-prepare and it will create two partitions, one for the journal (raw partition) and one for the data volume (defaults to formatting xfs). This is known as a single device OSD, in contrast with a multi-device OSD where the journal is on a completely different device (like a partition on a shared journaling SSD). -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] if partition name changes, will ceph get corrupted?
> We use /dev/disk/by-path for this reason, but we confirmed that is stable > for our HBAs. Maybe /dev/disk/by-something is consistent with your > controller. The upstart/udev scripts will handle mounting and osd id detection, at least on Ubuntu. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB
> This is in my lab. Plain passthrough setup with automap enabled on the F5. s3 > & curl work fine as far as queries go. But file transfer rate degrades badly > once I start file up/download. Maybe the difference can be attributed to LAN client traffic with jumbo frames vs F5 using a smaller WAN MTU? -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB
> You're right. Sorry didn't specify I was trying this for Radosgw. Even for > this I'm seeing performance degrade once my clients start to hit the LB VIP. Could you tell us more about your load balancer and configuration? -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Put Ceph Cluster Behind a Pair of LB
> Anybody has a good practice on how to set up a ceph cluster behind a pair of > load balancer? The only place you would want to put a load balancer in the context of a Ceph cluster would be north of RGW nodes. You can do L3 transparent load balancing or balance with a L7 proxy, ie Linux Virtual Server or HAProxy/Nginx. The other components of Ceph are horizontally scalable and because of the way Ceph's native protocols work you don't need load balancers doing L2/L3/L7 tricks to achieve HA. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] qemu-rbd
> I tried rbd-fuse and it's throughput using fio is approx. 1/4 that of the > kernel client. > > Can you please let me know how to setup RBD backend for FIO? I'm assuming > this RBD backend is also based on librbd? You will probably have to build fio from source since the rbd engine is new: https://github.com/axboe/fio Assuming you already have a cluster and a client configured this should do the trick: https://github.com/axboe/fio/blob/master/examples/rbd.fio -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Utilizing DAS on XEN or XCP hosts for Openstack Cinder
> 1. Is it possible to install Ceph and Ceph monitors on the the XCP > (XEN) Dom0 or would we need to install it on the DomU containing the > Openstack components? I'm not a Xen guru but in the case of KVM I would run the OSDs on the hypervisor to avoid virtualization overhead. > 2. Is Ceph server aware, or Rack aware so that replicas are not stored > on the same server? Yes, placement is defined with your crush map and placement rules. > 3. Are 4Tb OSD’s too large? We are attempting to restrict the qty of > OSD’s per server to minimise system overhead Nope! > Any other feedback regarding our plan would also be welcomed. I would probably run each disk as it's own OSD, which means you need a bit more memory per host. Networking could certainly be a bottleneck with 8 to 16 spindle nodes. YMMV. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Encryption/Multi-tennancy
> There could be millions of tennants. Looking deeper at the docs, it looks > like Ceph prefers to have one OSD per disk. We're aiming at having > backblazes, so will be looking at 45 OSDs per machine, many machines. I want > to separate the tennants and separately encrypt their data. The encryption > will be provided by us, but I was originally intending to have > passphrase-based encryption, and use programmatic means to either hash the > passphrase or/and encrypt it using the same passphrase. This way, we > wouldn't be able to access the tennant's data, or the key for the passphrase, > although we'd still be able to store both. The way I see it you have several options: 1. Encrypted OSDs Preserve confidentiality in the event someone gets physical access to a disk, whether theft or RMA. Requires tenant to trust provider. vm rbd rados osd <-here disks 2. Whole disk VM encryption Preserve confidentiality in the even someone gets physical access to disk, whether theft or RMA. tenant: key/passphrase provider: nothing tenant: passphrase provider: key tenant: nothing provider: key vm <--- here rbd rados osd disks 3. Encryption further up stack (application perhaps?) To me, #1/#2 are identical except in the case of #2 when the rbd volume is not attached to a VM. Block devices attached to a VM and mounted will be decrypted, making the encryption only useful at defending against unauthorized access to storage media. With a different key per VM, with potentially millions of tenants, you now have a massive key escrow/management problem that only buys you a bit of additional security when block devices are detached. Sounds like a crappy deal to me, I'd either go with #1 or #3. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recommended node size for Ceph
> Why the limit of 6 OSDs per SSD? SATA/SAS throughput generally. > I am doing testing with a PCI-e based SSD, and showing that even with 15 OSD disk drives per SSD that the SSD is keeping up. That will probably be fine performance wise but it's worth noting that all OSDs will fail if the flash fails (same as node failure). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Encryption/Multi-tennancy
> Ceph is seriously badass, but my requirements are to create a cluster in > which I can host my customer's data in separate areas which are independently > encrypted, with passphrases which we as cloud admins do not have access to. > > My current thoughts are: > 1. Create an OSD per machine stretching over all installed disks, then create > a user-sized block device per customer. Mount this block device on an access > VM and create a LUKS container in to it followed by a zpool and then I can > allow the users to create separate bins of data as separate ZFS filesystems > in the container which is actually a blockdevice striped across the OSDs. > 2. Create an OSD per customer and use dm-crypt, then store the dm-crypt key > somewhere which is rendered in some way so that we cannot access it, such as > a pgp-encrypted file using a passphrase which only the customer knows. > My questions are: > 1. What are people's comments regarding this problem (irrespective of my > thoughts) What is the threat model that leads to these requirements? The story "cloud admins do not have access" is not achievable through technology alone. > 2. Which would be the most efficient of (1) and (2) above? In the case of #1 and #2, you are only protecting data at rest. With #2 you would need to decrypt the key to open the block device, and the key would remain in memory until it is unmounted (which the cloud admin could access). This means #2 is safe so long as you never mount the volume, which means it's utility is rather limited (archive perhaps). Neither of these schemes buy you much more than the encryption handling provided by ceph-disk-prepare (dmcrypted osd data/journal volumes), the key management problem becomes more acute, eg. per tenant. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Running a mon on a USB stick
> Is there an issue with IO performance? Ceph monitors store cluster maps and various other things in leveldb, which persists to disk. I wouldn't recommend using a sd/usb cards for the monitor store because they tend to be slow and have poor durability. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] questions about ceph cluster in multi-dacenter
> What could be the best replication ? Are you using two sites to increase availability, durability, or both? For availability your really better off using three sites and use CRUSH to place each of three replicas in a different datacenter. In this setup you can survive losing 1 of 3 datacenters. If you two sites is the only option and your goal is availability and durability then I would do 4 replicas, using osd_pool_default_min_size = 2. > How to tune the crushmap of this kind of setup ? > and last question : It's possible to have the reads from vms on DC1 to always > read datas on DC1 ? No yet! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How client choose among replications?
> Why would it help? Since it's not that ONE OSD will be primary for all objects. There will be 1 Primary OSD per PG and you'll probably have a couple of thousands PGs. The primary may be across a oversubscribed/expensive link, in which case a local replica with a common ancestor to the client may be preferable. It's WIP with the goal of landing in firefly iirc. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] poor data distribution
> Change pg_num for .rgw.buckets to power of 2, an 'crush tunables > optimal' didn't help :( Did you bump pgp_num as well? The split pgs will stay in place until pgp_num is bumped as well, if you do this be prepared for (potentially lots) of data movement. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RADOS Gateway Issues
> HEALTH_WARN 1 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 pgs stuck unclean; 7 requests are blocked > 32 sec; 3 osds have slow requests; pool cloudstack has too few pgs; pool .rgw.buckets has too few pgs > pg 14.0 is stuck inactive since forever, current state incomplete, last acting [5,0] > pg 14.2 is stuck inactive since forever, current state incomplete, last acting [0,5] > pg 14.6 is stuck inactive since forever, current state down+incomplete, last acting [4,2] > pg 14.0 is stuck unclean since forever, current state incomplete, last acting [5,0] > pg 14.2 is stuck unclean since forever, current state incomplete, last acting [0,5] > pg 14.6 is stuck unclean since forever, current state down+incomplete, last acting [4,2] > pg 14.0 is incomplete, acting [5,0] > pg 14.2 is incomplete, acting [0,5] > pg 14.6 is down+incomplete, acting [4,2] > 3 ops are blocked > 8388.61 sec > 3 ops are blocked > 4194.3 sec > 1 ops are blocked > 2097.15 sec > 1 ops are blocked > 8388.61 sec on osd.0 > 1 ops are blocked > 4194.3 sec on osd.0 > 2 ops are blocked > 8388.61 sec on osd.4 > 2 ops are blocked > 4194.3 sec on osd.5 > 1 ops are blocked > 2097.15 sec on osd.5 > 3 osds have slow requests > pool cloudstack objects per pg (37316) is more than 27.1587 times cluster average (1374) > pool .rgw.buckets objects per pg (76219) is more than 55.4723 times cluster average (1374) > > > Ignore the cloudstack pool, I was using cloudstack but not anymore, it's an inactive pool. You will probably want to check osd 0,2,4,5 to make sure they are all up and in. Pg 14.6 need (4,2) and the others need (0,5). Other than that you may find that a pg query on the inactive/incomplete will provide more insight. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power Cycle Problems
> On two separate occasions I have lost power to my Ceph cluster. Both times, I > had trouble bringing the cluster back to good health. I am wondering if I > need to config something that would solve this problem? No special configuration should be necessary, I've had the unfortunate luck of witnessing several power loss events with large Ceph clusters. In both cases something other than Ceph was the source of frustrations once power was returned. That said, monitor daemons should be started first and must form a quorum before the cluster will be usable. It sounds like you have made it that far if your getting output from "ceph health" commands. The next step is to get your Ceph OSD daemons running, which will require the data partitions to be mounted and the journal device present. In Ubuntu installations this is handled by udev scripts installed by the Ceph packages (I think this is may be true for RHEL/CentOS but have not verified). Short of the udev method you can mount the data partition manually. Once the data partition is mounted you can start the OSDs manually in the event that init still doesn't work after mounting, to do so you will need to know the location of your keyring, ceph.conf and the OSD id. If you are unsure of what the OSD id is then you can look at the root of the OSD data partition, after it is mounted, in a file named "whoami". To manually start: /usr/bin/ceph-osd -i ${OSD_ID} --pid-file /var/run/ceph/osd.${OSD_ID}.pid -c /etc/ceph/ceph.conf After that it's a matter of examining the logs if your still having issues getting the OSDs to boot. > After powering back up the cluster, “ceph health” revealed stale pages, mds > cluster degraded, 3/3 OSDs down. I tried to issue “sudo /etc/init.d/ceph -a > start” but I got no output from the command and the health status did not > change. The placement groups are stale because none of the OSDs have reported their state recently since they are down. > I ended up having to re-install the cluster to fix the issue, but as my group > wants to use Ceph for VM storage in the future, we need to find a solution. That's a shame, but at least you will be better prepared if it happens again, hopefully your luck is not as unfortunate as mine! -- Kyle Bader ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Networking questions
> Do monitors have to be on the cluster network as well or is it sufficient > for them to be on the public network as > http://ceph.com/docs/master/rados/configuration/network-config-ref/ > suggests? Monitors only need to be on the public network. > Also would the OSDs re-route their traffic over the public network if that > was still available in case the cluster network fails? Ceph doesn't currently support this type of configuration. Hope that clears things up! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Failure probability with largish deployments
> Yes, that also makes perfect sense, so the aforementioned 12500 objects > for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way > replication that makes 2400 PGs, following the recommended formula. > >> > What amount of disks (OSDs) did you punch in for the following run? >> >> Disk Modeling Parameters >> >> size: 3TiB >> >> FIT rate:826 (MTBF = 138.1 years) >> >> NRE rate:1.0E-16 >> >> RADOS parameters >> >> auto mark-out: 10 minutes >> >> recovery rate:50MiB/s (40 seconds/drive) >> > Blink??? >> > I guess that goes back to the number of disks, but to restore 2.25GB at >> > 50MB/s with 40 seconds per drive... >> >> The surviving replicas for placement groups that the failed OSDs >> participated will naturally be distributed across many OSDs in the >> cluster, when the failed OSD is marked out, it's replicas will be >> remapped to many OSDs. It's not a 1:1 replacement like you might find >> in a RAID array. >> > I completely get that part, however the total amount of data to be > rebalanced after a single disk/OSD failure to fully restore redundancy is > still 2.25TB (mistyped that as GB earlier) at the 75% utilization you > assumed. > What I'm still missing in this pictures is how many disks (OSDs) you > calculated this with. Maybe I'm just misreading the 40 seconds per drive > bit there. Because if that means each drive is only required to be just > active for 40 seconds to do it's bit of recovery, we're talking 1100 > drives. ^o^ 1100 PGs would be another story. To recreate the modeling: git clone https://github.com/ceph/ceph-tools.git cd ceph-tools/models/reliability/ python main.py -g I used the following values: Disk Type: Enterprise Size: 3000 GiB Primary FITs: 826 Secondary FITS: 826 NRE Rate: 1.0E-16 RAID Type: RAID6 Replace (hours): 6 Rebuild (MiB/s): 500 Volumes: 11 RADOS Copies: 3 Mark-out (min): 10 Recovery (MiB/s): 50 Space Usage: 75% Declustering (pg): 1100 Stripe length: 1100 (limited by pgs anyway) RADOS sites: 1 Rep Latency (s): 0 Recovery (MiB/s): 10 Disaster (years): 1000 Site Recovery (days): 30 NRE Model: Fail Period (years): 1 Object Size: 4MB It seems that the number of disks is not considered when calculating the recovery window, only the number of pgs https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L68 I could also see the recovery rates varying based on the max osd backfill tunable. http://ceph.com/docs/master/rados/configuration/osd-config-ref/#backfilling Doing both would improve the quality of models generated by the tool. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Failure probability with largish deployments
> Is an object a CephFS file or a RBD image or is it the 4MB blob on the > actual OSD FS? Objects are at the RADOS level, CephFS filesystems, RBD images and RGW objects are all composed by striping RADOS objects - default is 4MB. > In my case, I'm only looking at RBD images for KVM volume storage, even > given the default striping configuration I would assume that those 12500 > OSD objects for a 50GB image would not be in the same PG and thus just on > 3 (with 3 replicas set) OSDs total? Objects are striped across placement groups, so you take your RBD size / 4 MB and cap it at the total number of placement groups in your cluster. > What amount of disks (OSDs) did you punch in for the following run? >> Disk Modeling Parameters >> size: 3TiB >> FIT rate:826 (MTBF = 138.1 years) >> NRE rate:1.0E-16 >> RADOS parameters >> auto mark-out: 10 minutes >> recovery rate:50MiB/s (40 seconds/drive) > Blink??? > I guess that goes back to the number of disks, but to restore 2.25GB at > 50MB/s with 40 seconds per drive... The surviving replicas for placement groups that the failed OSDs participated will naturally be distributed across many OSDs in the cluster, when the failed OSD is marked out, it's replicas will be remapped to many OSDs. It's not a 1:1 replacement like you might find in a RAID array. >> osd fullness: 75% >> declustering:1100 PG/OSD >> NRE model: fail >> object size: 4MB >> stripe length: 1100 > I take it that is to mean that any RBD volume of sufficient size is indeed > spread over all disks? Spread over all placement groups, the difference is subtle but there is a difference. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Failure probability with largish deployments
Using your data as inputs to in the Ceph reliability calculator [1] results in the following: Disk Modeling Parameters size: 3TiB FIT rate:826 (MTBF = 138.1 years) NRE rate:1.0E-16 RAID parameters replace: 6 hours recovery rate: 500MiB/s (100 minutes) NRE model: fail object size:4MiB Column legends 1 storage unit/configuration being modeled 2 probability of object survival (per 1 years) 3 probability of loss due to site failures (per 1 years) 4 probability of loss due to drive failures (per 1 years) 5 probability of loss due to NREs during recovery (per 1 years) 6 probability of loss due to replication failure (per 1 years) 7 expected data loss per Petabyte (per 1 years) storage durabilityPL(site) PL(copies) PL(NRE) PL(rep)loss/PiB ---- -- -- -- -- -- RAID-6: 9+2 6-nines 0.000e+00 2.763e-10 0.11% 0.000e+00 9.317e+07 Disk Modeling Parameters size: 3TiB FIT rate:826 (MTBF = 138.1 years) NRE rate:1.0E-16 RADOS parameters auto mark-out: 10 minutes recovery rate:50MiB/s (40 seconds/drive) osd fullness: 75% declustering:1100 PG/OSD NRE model: fail object size: 4MB stripe length: 1100 Column legends 1 storage unit/configuration being modeled 2 probability of object survival (per 1 years) 3 probability of loss due to site failures (per 1 years) 4 probability of loss due to drive failures (per 1 years) 5 probability of loss due to NREs during recovery (per 1 years) 6 probability of loss due to replication failure (per 1 years) 7 expected data loss per Petabyte (per 1 years) storage durabilityPL(site) PL(copies) PL(NRE) PL(rep)loss/PiB ---- -- -- -- -- -- RADOS: 3 cp 10-nines 0.000e+00 5.232e-08 0.000116% 0.000e+00 6.486e+03 [1] https://github.com/ceph/ceph-tools/tree/master/models/reliability -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph network topology with redundant switches
> The area I'm currently investigating is how to configure the > networking. To avoid a SPOF I'd like to have redundant switches for > both the public network and the internal network, most likely running > at 10Gb. I'm considering splitting the nodes in to two separate racks > and connecting each half to its own switch, and then trunk the > switches together to allow the two halves of the cluster to see each > other. The idea being that if a single switch fails I'd only lose half > of the cluster. This is fine if you are using a replication factor of 2, you would need 2/3 of the cluster surviving if using a replication factor 3 with "osd pool default min size" set to 2. > My question is about configuring the public network. If it's all one > subnet then the clients consuming the Ceph resources can't have both > links active, so they'd be configured in an active/standby role. But > this results in quite heavy usage of the trunk between the two > switches when a client accesses nodes on the other switch than the one > they're actively connected to. The linux bonding driver supports several strategies for teaming network adapters on L2 networks. > So, can I configure multiple public networks? I think so, based on the > documentation, but I'm not completely sure. Can I have one half of the > cluster on one subnet, and the other half on another? And then the > client machine can have interfaces in different subnets and "do the > right thing" with both interfaces to talk to all the nodes. This seems > like a fairly simple solution that avoids a SPOF in Ceph or the network > layer. You can have multiple networks for both the public and cluster networks, the only restriction is that all subnets for a given type be within the same supernet. For example 10.0.0.0/16 - Public supernet (configured in ceph.conf) 10.0.1.0/24 - Public rack 1 10.0.2.0/24 - Public rack 2 10.1.0.0/16 - Cluster supernet (configured in ceph.conf) 10.1.1.0/24 - Cluster rack 1 10.1.2.0/24 - Cluster rack 2 > Or maybe I'm missing an alternative that would be better? I'm aiming > for something that keeps things as simple as possible while meeting > the redundancy requirements. > > As an aside, there's a similar issue on the cluster network side with > heavy traffic on the trunk between the two cluster switches. But I > can't see that's avoidable, and presumably it's something people just > have to deal with in larger Ceph installations? A proper CRUSH configuration is going to place a replica on a node in each rack, this means every write is going to cross the trunk. Other traffic that you will see on the trunk: * OSDs gossiping with one another * OSD/Monitor traffic in the case where an OSD is connected to a monitor connected in the adjacent rack (map updates, heartbeats). * OSD/Client traffic where the OSD and client are in adjacent racks If you use all 4 40GbE uplinks (common on 10GbE ToR) then your cluster level bandwidth is oversubscribed 4:1. To lower oversubscription you are going to have to steal some of the other 48 ports, 12 for 2:1 and 24 for a non-blocking fabric. Given number of nodes you have/plan to have you will be utilizing 6-12 links per switch, leaving you with 12-18 links for clients on a non-blocking fabric, 24-30 for 2:1 and 36-48 for 4:1. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw daemon stalls on download of some files
> Do you have any futher detail on this radosgw bug? https://github.com/ceph/ceph/commit/0f36eddbe7e745665a634a16bf3bf35a3d0ac424 https://github.com/ceph/ceph/commit/0b9dc0e5890237368ba3dc34cb029010cb0b67fd > Does it only apply to emperor? The bug is present in dumpling too. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rbd image performance
>> Has anyone tried scaling a VMs io by adding additional disks and >> striping them in the guest os? I am curious what effect this would have >> on io performance? > Why would it? You can also change the stripe size of the RBD image. Depending on the workload you might change it from 4MB to something like 1MB or 32MB? That would give you more or less RADOS objects which will also give you a different I/O pattern. The question comes up because it's common for people operating on EC2 to stripe EBS volumes together for higher iops rates. I've tried striping kernel RBD volumes before but hit some sort of thread limitation where throughput was consistent despite the volume count. I've since learned the thread limit is configurable. I don't think there is a thread limit that needs to be tweaked for RBD via KVM/QEMU but I haven't tested this empirically. As Wido mentioned, if you are operating your own cluster configuring the stripe size may achieve similar results. Google used to use a 64MB chunk size with GFS but switched to 1MB after they started supporting more and more seek heavy workloads. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] SysAdvent: Day 15 - Distributed Storage with Ceph
For you holiday pleasure I've prepared a SysAdvent article on Ceph: http://sysadvent.blogspot.com/2013/12/day-15-distributed-storage-with-ceph.html Check it out! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CEPH and Savanna Integration
> Introduction of Savanna for those haven't heard of it: > > Savanna project aims to provide users with simple means to provision a > Hadoop > > cluster at OpenStack by specifying several parameters like Hadoop version, > cluster > > topology, nodes hardware details and a few more. > > For now, Savanna can use Swift as a storage for data that will be processed > by > Hadoop jobs. As far as I know, we can use Hadoop with CephFS. > Is there anybody interested in CEPH and Savanna integration? and how to? You could use a Ceph RADOS gateway as a drop in replacement that provides a Swift compatible endpoint. Alternatively, the docs for using Hadoop in conjunction with CephFS are here: http://ceph.com/docs/master/cephfs/hadoop/ Hopefully that puts you in the right direction! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] NUMA and ceph
It seems that NUMA can be problematic for ceph-osd daemons in certain circumstances. Namely it seems that if a NUMA zone is running out of memory due to uneven allocation it is possible for a NUMA zone to enter reclaim mode when threads/processes are scheduled on a core in that zone and those processes are request memory allocations greater than the zones remaining memory. In order for the kernel to satisfy the memory allocation for those processes it needs to page out some of the contents of the contentious zone, which can have dramatic performance implications due to cache misses, etc. I see two ways an operator could alleviate these issues: Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing ceph-osd daemons with "numactl --interleave=all". This should probably be activated by a flag in /etc/default/ceph and modifying the ceph-osd.conf upstart script, along with adding a depend to the ceph package's debian/rules file on the "numactl" package. The alternative is to use a cgroup for each ceph-osd daemon, pinning each one to cores in the same NUMA zone using cpuset.cpu and cpuset.mems. This would probably also live in /etc/default/ceph and the upstart scripts. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph reliability in large RBD setups
> I've been running similar calculations recently. I've been using this > tool from Inktank to calculate RADOS reliabilities with different > assumptions: > https://github.com/ceph/ceph-tools/tree/master/models/reliability > > But I've also had similar questions about RBD (or any multi-part files > stored in RADOS) -- naively, a file/device stored in N objects would > be N times less reliable than a single object. But I hope there's an > error in that logic. It's worth pointing out that Ceph's RGW will actually stripe S3 objects across many RADOS objects - even when it's not a multi-part upload, this has been the case since the Bobtail release. There is a in depth Google paper about availability modeling, it might provide some insight into what the math should look like: http://research.google.com/pubs/archive/36737.pdf When reading it you can think of objects as chunks and pgs as stripes. CRUSH should be configured based on failure domains that cause correlated failures, ie power and networking. You also want to consider the availability of the facility itself: "Typical availability estimates used in the industry range from 99.7% availability for tier II datacenters to 99.98% and 99.995% for tiers III and IV, respectively." http://www.morganclaypool.com/doi/pdf/10.2200/s00193ed1v01y200905cac006 If you combine the cluster availability metric and the facility availability metric, you might be surprised. A cluster with 99.995% availability in a Tier II facility is going to be dragged down to 99.7% availability. If a cluster goes down in the forest, does anyone know? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Anybody doing Ceph for OpenStack with OSDs across compute/hypervisor nodes?
> We're running OpenStack (KVM) with local disk for ephemeral storage. > Currently we use local RAID10 arrays of 10k SAS drives, so we're quite rich > for IOPS and have 20GE across the board. Some recent patches in OpenStack > Havana make it possible to use Ceph RBD as the source of ephemeral VM > storage, so I'm interested in the potential for clustered storage across our > hypervisors for this purpose. Any experience out there? I believe Piston converges their storage/compute, they refer to it as a null-tier architecture. http://www.pistoncloud.com/openstack-cloud-software/technology/#storage -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] optimal setup with 4 x ethernet ports
> looking at tcpdump all the traffic is going exactly where it is supposed to > go, in particular an osd on the 192.168.228.x network appears to talk to an > osd on the 192.168.229.x network without anything strange happening. I was > just wondering if there was anything about ceph that could make this > non-optimal, assuming traffic was reasonably balanced between all the osd's > (eg all the same weights). I think the only time it would suffer is if writes > to other osds result in a replica write to a single osd, and even then a > single OSD is still limited to 7200RPM disk speed anyway so the loss isn't > going to be that great. Should be fine given you only have a 1:1 ratio of link to disk. > I think I'll be moving over to bonded setup anyway, although I'm not sure if > rr or lacp is best... rr will give the best potential throughput, but lacp > should give similar aggregate throughput if there are plenty of connections > going on, and less cpu load as no need to reassemble fragments. One of the DreamHost clusters is using a pair of bonded 1GbE links on the public network and another pair for the cluster network, we configured each to use mode 802.3ad. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] optimal setup with 4 x ethernet ports
>> Is having two cluster networks like this a supported configuration? Every >> osd and mon can reach every other so I think it should be. > > Maybe. If your back end network is a supernet and each cluster network is a > subnet of that supernet. For example: > > Ceph.conf cluster network (supernet): 10.0.0.0/8 > > Cluster network #1: 10.1.1.0/24 > Cluster network #2: 10.1.2.0/24 > > With that configuration OSD address autodection *should* just work. It should work but thinking more about it the OSDs will likely be assigned IPs on a single network, whichever is inspected and matches the supernet range (which could be in either subnet). In order to have OSDs on two distinct networks you will likely have to use a declarative configuration in /etc/ceph/ceph.conf which lists the OSD IP addresses for each OSD (making sure to balance between links). >> 1. move osd traffic to eth1. This obviously limits maximum throughput to >> ~100Mbytes/second, but I'm getting nowhere near that right now anyway. > > Given three links I would probably do this if your replication factor is >= > 3. Keep in mind 100Mbps links could very well end up being a limiting > factor. Sorry I read Mbytes and Mbps, big difference, the former is much preferable! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] optimal setup with 4 x ethernet ports
> Is having two cluster networks like this a supported configuration? Every osd and mon can reach every other so I think it should be. Maybe. If your back end network is a supernet and each cluster network is a subnet of that supernet. For example: Ceph.conf cluster network (supernet): 10.0.0.0/8 Cluster network #1: 10.1.1.0/24 Cluster network #2: 10.1.2.0/24 With that configuration OSD address autodection *should* just work. > 1. move osd traffic to eth1. This obviously limits maximum throughput to ~100Mbytes/second, but I'm getting nowhere near that right now anyway. Given three links I would probably do this if your replication factor is >= 3. Keep in mind 100Mbps links could very well end up being a limiting factor. What are you backing each OSD with storage wise and how many OSDs do you expect to participate in this cluster? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Impact of fancy striping
> This journal problem is a bit of wizardry to me, I even had weird intermittent issues with OSDs not starting because the journal was not found, so please do not hesitate to suggest a better journal setup. You mentioned using SAS for journal, if your OSDs are SATA and a expander is in the data path it might be slow from MUX/STP/etc overhead. If the setup is all SAS you might try collocating the journal with it's matching data partition on a single disk. Two spindles must be contended with 9 OSDs. How are your drives attached? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] installing OS on software RAID
> > Is the OS doing anything apart from ceph? Would booting a ramdisk-only system from USB or compact flash work? I haven't tested this kind of configuration myself but I can't think of anything that would preclude this type of setup. I'd probably use sqashfs layered with a tmpfs via aufs to avoid any writes to the USB drive. I would also mount spinning high capacity media for /var/log or setup log streaming to something like rsyslog/syslog-ng/logstash. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 回复:Re: testing ceph performance issue
> How much performance can be improved if use SSDs to storage journals? You will see roughly twice the throughput unless you are using btrfs (still improved but not as dramatic). You will also see lower latency because the disk head doesn't have to seek back and forth between journal and data partitions. > Kernel RBD Driver , what is this ? There are several RBD implementations, one is the kernel RBD driver in upstream Linux, another is built into Qemu/KVM. > and we want to know the RBD if support XEN virual ? It is possible, but not nearly as well tested and not prevalent as RBD via Qemu/KVM. This might be a starting point if your interested in testing Xen/RBD integration: http://wiki.xenproject.org/wiki/Ceph_and_libvirt_technology_preview Hope that helps! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD on an external, shared device
> Is there any way to manually configure which OSDs are started on which > machines? The osd configuration block includes the osd name and host, so is > there a way to say that, say, osd.0 should only be started on host vashti > and osd.1 should only be started on host zadok? I tried using this > configuration: The ceph udev rules are going to automatically mount disks that match the ceph "magic" guids, to dig through the full logic you need to inspect these files: /lib/udev/rules.d/60-ceph-partuuid-workaround.rules /lib/udev/rules.d/95-ceph-osd.rules The upstart scripts look to see what is mounted at /var/lib/ceph/osd/ and starts osd daemons as appropriate: /etc/init/ceph-osd-all-starter.conf In theory you should be able to remove the udev scripts and mount the osds in /var/lib/ceph/osd if your using upstart. You will want to make sure that upgrades to the ceph package don't replace the files, maybe that means making a null rule and using "-o Dpkg::Options::='--force-confold" in ceph-deploy/chef/puppet/whatever. You will also want to avoid putting the mounts in fstab because it could render your node unbootable if the device or filesystem fails. -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] installing OS on software RAID
Several people have reported issues with combining OS and OSD journals on the same SSD drives/RAID due to contention. If you do something like this I would definitely test to make sure it meets your expectations. Ceph logs are going to compose the majority of the writes to the OS storage devices. On Mon, Nov 25, 2013 at 12:46 PM, James Harper wrote: >> >> We need to install the OS on the 3TB harddisks that come with our Dell >> servers. (After many attempts, I've discovered that Dell servers won't allow >> attaching an external harddisk via the PCIe slot. (I've tried everything). ) >> >> But, must I therefore sacrifice two hard disks (RAID-1) for the OS? I don't >> see >> why I can't just create a small partition (~30GB) on all 6 of my hard >> disks, do a >> software-based RAID 1 on it, and be done. >> >> I know that software based RAID-5 seems computationally expensive, but >> shouldn't RAID 1 be fast and computationally inexpensive for a computer >> built over the last 4 years? I wouldn't think that a CEPH systems (with lots >> of >> VMs but little data changes) would even do much writing to the OS >> partitionbut I'm not sure. (And in the past, I have noticed that RAID5 >> systems did suck up a lot of CPU and caused lots of waits, unlike what the >> blogs implied. But I'm thinking that a RAID 1 takes little CPU and the OS >> does >> little writing to disk; it's mostly reads, which should hit the RAM.) >> >> Does anyone see any holes in the above idea? Any gut instincts? (I would try >> it, but it's hard to tell how well the system would really behave under >> "real" >> load conditions without some degree of experience and/or strong >> theoretical knowledge.) > > Is the OS doing anything apart from ceph? Would booting a ramdisk-only system > from USB or compact flash work? > > If the OS doesn't produce a lot of writes then having it on the main disk > should work okay. I've done it exactly as you describe before. > > James > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] misc performance tuning queries (related to OpenStack in particular)
> So quick correction based on Michael's response. In question 4, I should > have not made any reference to Ceph objects, since objects are not striped > (per Michael's response). Instead, I should simply have used the words "Ceph > VM Image" instead of "Ceph objects". A Ceph VM image would constitute 1000s > of objects, and the different objects are striped/spread across multiple > OSDs from multiple servers. In that situation, what's answer to #4 It depends on which linux bonding driver is in use, some drivers load share on transmit, some load share on receive, some do both and some only provide active/passive fault tolerance. I have Ceph OSD hosts using LACP (bond-mode 802.3ad) and they load share on both receive and transmit. We're utilizing a pair of bonded 1GbE links for the Ceph public network and another pair of bonded 1GbE links for the cluster network. The issues we've seen with 1GbE are complexity, shallow buffers on 1GbE top of rack switch gear (Cisco 4948-10G) and the fact that not all flows are equal (4x 1GbE != 4GbE). -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph performance
> We have the plan to run ceph as block storage for openstack, but from test > we found the IOPS is slow. > > Our apps primarily use the block storage for saving logs (i.e, nginx's > access logs). > How to improve this? There are a number of things you can do, notably: 1. Tuning cache on the hypervisor http://ceph.com/docs/master/rbd/rbd-config-ref/ 2. Separate device OSD journals, usually SSD is used (no longer seeking between data and journal volumes on a single disk) 3. Flash based writeback cache for OSD data volume https://github.com/facebook/flashcache/ http://bcache.evilpiepirate.org/ If you have any questions let us know! -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Today I’ve encountered multiple OSD down and multiple OSD won’t start and OSD disk access “Input/Output” error”
> 3).Comment out, #hashtag the bad OSD drives in the “/etc/fstab”. This is unnecessary if your using the provided upstart and udev scripts, OSD data devices will be identified by label and mounted. If you choose not to use the upstart and udev scripts then you should write init scripts that do similar so that you don't have to have /etc/fstab entries. > 3).Login to Ceph Node with bad OSD net/serial/video. I'd put check dmesg somewhere near the top of this section, often if you lose an OSD due to a filesystem hiccup then it will be evident in dmesg output. > 4).Stop only this local Ceph node with “service Ceph stop” You may want to set "noout" depending on whether you expect it to come back online within your "mon osd down out interval" threshold. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph User Committee
> Would this be something like > http://wiki.ceph.com/01Planning/02Blueprints/Firefly/Ceph-Brag ? Something very much like that :) -- Kyle ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com