[ceph-users] Build Raw Volume from Recovered RBD Objects
All, I was called in to assist in a failed Ceph environment with the cluster in an inoperable state. No rbd volumes are mountable/exportable due to missing PGs. The previous operator was using a replica count of 2. The cluster suffered a power outage and various non-catastrophic hardware issues as they were starting it back up. At some point during recovery, drives were removed from the cluster leaving several PGs missing. Efforts to restore the missing PGs from the data on the removed drives failed using the process detailed in a Red Hat Customer Support blog post [0]. Upon starting the OSDs with recovered PGs, a segfault halts progress. The original operator isn't clear on when, but there may have been a software upgrade applied after the drives were pulled. I believe the cluster may be irrecoverable at this point. My recovery assistance has focused on a plan to: 1) Scrape all objects for several key rbd volumes from live OSDs and the removed former OSD drives. 2) Compare and deduplicate the two copies of each object. 3) Recombine the objects for each volume into a raw image. I have completed steps 1 and 2 with apparent success. My initial stab at step 3 yielded a raw image that could be mounted and had signs of a filesystem, but it could not be read. Could anyone assist me with the following questions? 1) Are the rbd objects in order by filename? If not, what is the method to determine their order? 2) How should objects smaller than the default 4MB chunk size be handled? Should they be padded somehow? 3) If any objects were completely missing and therefore unavailable to this process, how should they be handled? I assume we need to offset/pad to compensate. -- Thanks, Mike Dawson Co-Founder & Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 M: 317-490-3018 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Discuss: New default recovery config settings
With a write-heavy RBD workload, I add the following to ceph.conf: osd_max_backfills = 2 osd_recovery_max_active = 2 If things are going well during recovery (i.e. guests happy and no slow requests), I will often bump both up to three: # ceph tell osd.* injectargs '--osd-max-backfills 3 --osd-recovery-max-active 3' If I see slow requests, I drop them down. The biggest downside to setting either to 1 seems to be the long tail issue detailed in: http://tracker.ceph.com/issues/9566 Thanks, Mike Dawson On 6/3/2015 6:44 PM, Sage Weil wrote: On Mon, 1 Jun 2015, Gregory Farnum wrote: On Mon, Jun 1, 2015 at 6:39 PM, Paul Von-Stamwitz pvonstamw...@us.fujitsu.com wrote: On Fri, May 29, 2015 at 4:18 PM, Gregory Farnum g...@gregs42.com wrote: On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote: Many people have reported that they need to lower the osd recovery config options to minimize the impact of recovery on client io. We are talking about changing the defaults as follows: osd_max_backfills to 1 (from 10) osd_recovery_max_active to 3 (from 15) osd_recovery_op_priority to 1 (from 10) osd_recovery_max_single_start to 1 (from 5) I'm under the (possibly erroneous) impression that reducing the number of max backfills doesn't actually reduce recovery speed much (but will reduce memory use), but that dropping the op priority can. I'd rather we make users manually adjust values which can have a material impact on their data safety, even if most of them choose to do so. After all, even under our worst behavior we're still doing a lot better than a resilvering RAID array. ;) -Greg -- Greg, When we set... osd recovery max active = 1 osd max backfills = 1 We see rebalance times go down by more than half and client write performance increase significantly while rebalancing. We initially played with these settings to improve client IO expecting recovery time to get worse, but we got a 2-for-1. This was with firefly using replication, downing an entire node with lots of SAS drives. We left osd_recovery_threads, osd_recovery_op_priority, and osd_recovery_max_single_start default. We dropped osd_recovery_max_active and osd_max_backfills together. If you're right, do you think osd_recovery_max_active=1 is primary reason for the improvement? (higher osd_max_backfills helps recovery time with erasure coding.) Well, recovery max active and max backfills are similar in many ways. Both are about moving data into a new or outdated copy of the PG ? the difference is that recovery refers to our log-based recovery (where we compare the PG logs and move over the objects which have changed) whereas backfill requires us to incrementally move through the entire PG's hash space and compare. I suspect dropping down max backfills is more important than reducing max recovery (gathering recovery metadata happens largely in memory) but I don't really know either way. My comment was meant to convey that I'd prefer we not reduce the recovery op priority levels. :) We could make a less extreme move than to 1, but IMO we have to reduce it one way or another. Every major operator I've talked to does this, our PS folks have been recommending it for years, and I've yet to see a single complaint about recovery times... meanwhile we're drowning in a sea of complaints about the impact on clients. How about osd_max_backfills to 1 (from 10) osd_recovery_max_active to 3 (from 15) osd_recovery_op_priority to 3 (from 10) osd_recovery_max_single_start to 1 (from 5) (same as above, but 1/3rd the recovery op prio instead of 1/10th) ? sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Negative amount of objects degraded
Erik, I reported a similar issue 22 months ago. I don't think any developer has ever really prioritized these issues. http://tracker.ceph.com/issues/3720 I was able to recover that cluster. The method I used is in the comments. I have no idea if my cluster was broken for the same reason as your. Your results may vary. - Mike Dawson On 10/30/2014 4:50 PM, Erik Logtenberg wrote: Thanks for pointing that out. Unfortunately, those tickets contain only a description of the problem, but no solution or workaround. One was opened 8 months ago and the other more than a year ago. No love since. Is there any way I can get my cluster back in a healthy state? Thanks, Erik. On 10/30/2014 05:13 PM, John Spray wrote: There are a couple of open tickets about bogus (negative) stats on PGs: http://tracker.ceph.com/issues/5884 http://tracker.ceph.com/issues/7737 Cheers, John On Thu, Oct 30, 2014 at 12:38 PM, Erik Logtenberg e...@logtenberg.eu wrote: Hi, Yesterday I removed two OSD's, to replace them with new disks. Ceph was not able to completely reach all active+clean state, but some degraded objects remain. However, the amount of degraded objects is negative (-82), see below: 2014-10-30 13:31:32.862083 mon.0 [INF] pgmap v209175: 768 pgs: 761 active+clean, 7 active+remapped; 1644 GB data, 2524 GB used, 17210 GB / 19755 GB avail; 2799 B/s wr, 1 op/s; -82/1439391 objects degraded (-0.006%) According to rados df, the -82 degraded objects are part of the cephfs-data-cache pool, which is an SSD-backed replicated pool, that functions as a cache pool for an HDD-backed erasure coded pool for cephfs. The cache should be empty, because I isseud rados cache-flush-evict-all-command, and rados -p cephfs-data-cache ls indeed shows zero objects in this pool. rados df however does show 192 objects for this pool, with just 35KB used and -82 degraded: pool name category KB objects clones degraded unfound rdrd KB wrwr KB cephfs-data-cache - 35 1920 -82 0 1119 348800 1198371 1703673493 Please advice... Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] converting legacy puppet-ceph configured OSDs to look like ceph-deployed OSDs
On 10/15/2014 4:20 PM, Dan van der Ster wrote: Hi Ceph users, (sorry for the novel, but perhaps this might be useful for someone) During our current project to upgrade our cluster from disks-only to SSD journals, we've found it useful to convert our legacy puppet-ceph deployed cluster (using something like the enovance module) to one that looks like it has had its OSD created with ceph-disk prepare. It's been educational for me, and I thought it would be good experience to share. To start, the old puppet-ceph configures OSDs explicitly in ceph.conf, like this: [osd.211] host = p05151113489275 devs = /dev/disk/by-path/pci-:02:00.0-sas-...-lun-0-part1 and ceph-disk list says this about the disks: /dev/sdh : /dev/sdh1 other, xfs, mounted on /var/lib/ceph/osd/osd.211 In other words, ceph-disk doesn't know anything about the OSD living on that disk. Before deploying our SSD journals I was trying to find the best way to map OSDs to SSD journal partitions (in puppet!), but basically there is no good way to do this with the legacy puppet-ceph module. (What we'd have to do is puppetize the partitioning of SSDs, then manually map OSDs to SSD partitions. This would be tedious, and also error prone after disk replacements and reboots). However, I've found that by using ceph-deploy, i.e ceph-disk, to prepare and activate OSDs, this becomes very simple, trivial even. Using ceph-disk we keep the OSD/SSD mapping out of puppet; instead the state is stored in the OSD itself. (1.5 years ago when we deployed this cluster, ceph-deploy was advertised as quick tool to spin up small clusters, so we didn't dare use it. I realize now that it (or the puppet/chef/... recipes based on it) is _the_only_way_ to build a cluster if you're starting out today.) Now our problem was that I couldn't go and re-ceph-deploy the whole cluster, since we've got some precious user data there. Instead, I needed to learn how ceph-disk is labeling and preparing disks, and modify our existing OSDs in place to look like they'd been prepared and activated with ceph-disk. In the end, I've worked out all the configuration and sgdisk magic and put the recipes into a couple of scripts here [1]. Note that I do not expect these to work for any other cluster unmodified. In fact, that would be dangerous, so don't blame me if you break something. But they might helpful for understanding how the ceph-disk udev magic works and could be a basis for upgrading other clusters. The scripts are: ceph-deployifier/ceph-create-journals.sh: - this script partitions SSDs (assuming sda to sdd) with 5 partitions each - the only trick is to add the partition name 'ceph journal' and set the typecode to the magic JOURNAL_UUID along with a random partition guid ceph-deployifier/ceph-label-disks.sh: - this script discovers the next OSD which is not prepared with ceph-disk, finds an appropriate unused journal partition, and converts the OSD to a ceph-disk prepared lookalike. - aside from the discovery part, the main magic is to: - create the files active, sysvinit and journal_uuid on the OSD - rename the partition to 'ceph data', set the typecode to the magic OSD_UUID, and the partition guid to the OSD's uuid. - link to the /dev/disk/by-partuuid/ journal symlink, and make the new journal - at the end, udev is triggered and the OSD is started (via the ceph-disk activation magic) The complete details are of course in the scripts. (I also have another version of ceph-label-disks.sh that doesn't expect an SSD journal but instead prepares the single disk 2 partitions scheme.) After running these scripts you'll get a nice shiny ceph-disk list output: /dev/sda : /dev/sda1 ceph journal, for /dev/sde1 /dev/sda2 ceph journal, for /dev/sdf1 /dev/sda3 ceph journal, for /dev/sdg1 ... /dev/sde : /dev/sde1 ceph data, active, cluster ceph, osd.2, journal /dev/sda1 /dev/sdf : /dev/sdf1 ceph data, active, cluster ceph, osd.8, journal /dev/sda2 /dev/sdg : /dev/sdg1 ceph data, active, cluster ceph, osd.12, journal /dev/sda3 ... And all of the udev magic is working perfectly. I've tested all of the reboot, failed OSD, and failed SSD scenarios and it all works as it should. And the puppet-ceph manifest for osd's is now just a very simple wrapper around ceph-disk prepare. (I haven't published ours to github yet, but it is very similar to the stackforge puppet-ceph manifest). There you go, sorry that was so long. I hope someone finds this useful :) Best Regards, Dan [1] https://github.com/cernceph/ceph-scripts/tree/master/tools/ceph-deployifier Dan, Thank you for publishing this! I put some time into this very issue earlier this year, but got pulled in another direction before completing the work. I'd like to bring a production cluster deployed with mkcephfs out of the stone ages, so your work will be very useful to me. Thanks again, Mike Dawson ___ ceph-users mailing list
Re: [ceph-users] v0.67.11 dumpling released
On 9/25/2014 11:09 AM, Sage Weil wrote: v0.67.11 Dumpling === This stable update for Dumpling fixes several important bugs that affect a small set of users. We recommend that all Dumpling users upgrade at their convenience. If none of these issues are affecting your deployment there is no urgency. Notable Changes --- * common: fix sending dup cluster log items (#9080 Sage Weil) * doc: several doc updates (Alfredo Deza) * libcephfs-java: fix build against older JNI headesr (Greg Farnum) * librados: fix crash in op timeout path (#9362 Matthias Kiefer, Sage Weil) * librbd: fix crash using clone of flattened image (#8845 Josh Durgin) * librbd: fix error path cleanup when failing to open image (#8912 Josh Durgin) * mon: fix crash when adjusting pg_num before any OSDs are added (#9052 Sage Weil) * mon: reduce log noise from paxos (Aanchal Agrawal, Sage Weil) * osd: allow scrub and snap trim thread pool IO priority to be adjusted (Sage Weil) Sage, Thanks for the great work! Could you provide any links describing how to tune the scrub and snap trim thread pool IO priority? I couldn't find these settings in the docs. IIUC, 0.67.11 does not include the proposed changes to address #9487 or #9503, right? Thanks, Mike Dawson * osd: fix mount/remount sync race (#9144 Sage Weil) Getting Ceph * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.com/download/ceph-0.67.11.tar.gz * For packages, see http://ceph.com/docs/master/install/get-packages * For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.67.11 dumpling released
Looks like the packages have partially hit the repo, but at least the following are missing: Failed to fetch http://ceph.com/debian-dumpling/pool/main/c/ceph/librbd1_0.67.11-1precise_amd64.deb 404 Not Found Failed to fetch http://ceph.com/debian-dumpling/pool/main/c/ceph/librados2_0.67.11-1precise_amd64.deb 404 Not Found Failed to fetch http://ceph.com/debian-dumpling/pool/main/c/ceph/python-ceph_0.67.11-1precise_amd64.deb 404 Not Found Failed to fetch http://ceph.com/debian-dumpling/pool/main/c/ceph/ceph_0.67.11-1precise_amd64.deb 404 Not Found Failed to fetch http://ceph.com/debian-dumpling/pool/main/c/ceph/libcephfs1_0.67.11-1precise_amd64.deb 404 Not Found Based on the timestamps of the files that made it, it looks like the process to publish the packages isn't still in process, but rather failed yesterday. Thanks, Mike Dawson On 9/25/2014 11:09 AM, Sage Weil wrote: v0.67.11 Dumpling === This stable update for Dumpling fixes several important bugs that affect a small set of users. We recommend that all Dumpling users upgrade at their convenience. If none of these issues are affecting your deployment there is no urgency. Notable Changes --- * common: fix sending dup cluster log items (#9080 Sage Weil) * doc: several doc updates (Alfredo Deza) * libcephfs-java: fix build against older JNI headesr (Greg Farnum) * librados: fix crash in op timeout path (#9362 Matthias Kiefer, Sage Weil) * librbd: fix crash using clone of flattened image (#8845 Josh Durgin) * librbd: fix error path cleanup when failing to open image (#8912 Josh Durgin) * mon: fix crash when adjusting pg_num before any OSDs are added (#9052 Sage Weil) * mon: reduce log noise from paxos (Aanchal Agrawal, Sage Weil) * osd: allow scrub and snap trim thread pool IO priority to be adjusted (Sage Weil) * osd: fix mount/remount sync race (#9144 Sage Weil) Getting Ceph * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.com/download/ceph-0.67.11.tar.gz * For packages, see http://ceph.com/docs/master/install/get-packages * For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
On 8/28/2014 11:17 AM, Loic Dachary wrote: On 28/08/2014 16:29, Mike Dawson wrote: On 8/28/2014 12:23 AM, Christian Balzer wrote: On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote: On 27/08/2014 04:34, Christian Balzer wrote: Hello, On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote: Hi Craig, I assume the reason for the 48 hours recovery time is to keep the cost of the cluster low ? I wrote 1h recovery time because it is roughly the time it would take to move 4TB over a 10Gb/s link. Could you upgrade your hardware to reduce the recovery time to less than two hours ? Or are there factors other than cost that prevent this ? I doubt Craig is operating on a shoestring budget. And even if his network were to be just GbE, that would still make it only 10 hours according to your wishful thinking formula. He probably has set the max_backfills to 1 because that is the level of I/O his OSDs can handle w/o degrading cluster performance too much. The network is unlikely to be the limiting factor. The way I see it most Ceph clusters are in sort of steady state when operating normally, i.e. a few hundred VM RBD images ticking over, most actual OSD disk ops are writes, as nearly all hot objects that are being read are in the page cache of the storage nodes. Easy peasy. Until something happens that breaks this routine, like a deep scrub, all those VMs rebooting at the same time or a backfill caused by a failed OSD. Now all of a sudden client ops compete with the backfill ops, page caches are no longer hot, the spinners are seeking left and right. Pandemonium. I doubt very much that even with a SSD backed cluster you would get away with less than 2 hours for 4TB. To give you some real life numbers, I currently am building a new cluster but for the time being have only one storage node to play with. It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8 actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it. So I took out one OSD (reweight 0 first, then the usual removal steps) because the actual disk was wonky. Replaced the disk and re-added the OSD. Both operations took about the same time, 4 minutes for evacuating the OSD (having 7 write targets clearly helped) for measly 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the OSD. And that is on one node (thus no network latency) that has the default parameters (so a max_backfill of 10) which was otherwise totally idle. In other words, in this pretty ideal case it would have taken 22 hours to re-distribute 4TB. That makes sense to me :-) When I wrote 1h, I thought about what happens when an OSD becomes unavailable with no planning in advance. In the scenario you describe the risk of a data loss does not increase since the objects are evicted gradually from the disk being decommissioned and the number of replica stays the same at all times. There is not a sudden drop in the number of replica which is what I had in mind. That may be, but I'm rather certain that there is no difference in speed and priority of a rebalancing caused by an OSD set to weight 0 or one being set out. If the lost OSD was part of 100 PG, the other disks (let say 50 of them) will start transferring a new replica of the objects they have to the new OSD in their PG. The replacement will not be a single OSD although nothing prevents the same OSD to be used in more than one PG as a replacement for the lost one. If the cluster network is connected at 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new duplicates do not originate from a single OSD but from at least dozens of them and since they target more than one OSD, I assume we can expect an actual throughput of 5Gb/s. I should have written 2h instead of 1h to account for the fact that the cluster network is never idle. Am I being too optimistic ? Vastly. Do you see another blocking factor that would significantly slow down recovery ? As Craig and I keep telling you, the network is not the limiting factor. Concurrent disk IO is, as I pointed out in the other thread. Completely agree. On a production cluster with OSDs backed by spindles, even with OSD journals on SSDs, it is insufficient to calculate single-disk replacement backfill time based solely on network throughput. IOPS will likely be the limiting factor when backfilling a single failed spinner in a production cluster. Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 3:1), with dual 1GbE bonded NICs. Using the only throughput math, backfill could have theoretically completed in a bit over 2.5 hours, but it actually took 15 hours. I've done this a few times with similar results. Why? Spindle contention on the replacement drive. Graph the '%util' metric from something like 'iostat -xt 2' during a single disk backfill to get a very clear view that spindle contention is the true limiting
Re: [ceph-users] Best practice K/M-parameters EC pool
On 8/28/2014 4:17 PM, Craig Lewis wrote: My initial experience was similar to Mike's, causing a similar level of paranoia. :-) I'm dealing with RadosGW though, so I can tolerate higher latencies. I was running my cluster with noout and nodown set for weeks at a time. I'm sure Craig will agree, but wanted to add this for other readers: I find value in the noout flag for temporary intervention, but prefer to set mon osd down out interval for dealing with events that may occur in the future to give an operator time to intervene. The nodown flag is another beast altogether. The nodown flag tends to be *a bad thing* when attempting to provide reliable client io. For our use case, we want OSDs to be marked down quickly if they are in fact unavailable for any reason, so client io doesn't hang waiting for them. If OSDs are flapping during recovery (i.e. the wrongly marked me down log messages), I've found far superior results by tuning the recovery knobs than by permanently setting the nodown flag. - Mike Recovery of a single OSD might cause other OSDs to crash. In the primary cluster, I was always able to get it under control before it cascaded too wide. In my secondary cluster, it did spiral out to 40% of the OSDs, with 2-5 OSDs down at any time. I traced my problems to a combination of osd max backfills was too high for my cluster, and my mkfs.xfs arguments were causing memory starvation issues. I lowered osd max backfills, added SSD journals, and reformatted every OSD with better mkfs.xfs arguments. Now both clusters are stable, and I don't want to break it. I only have 45 OSDs, so the risk with a 24-48 hours recovery time is acceptable to me. It will be a problem as I scale up, but scaling up will also help with the latency problems. On Thu, Aug 28, 2014 at 10:38 AM, Mike Dawson mike.daw...@cloudapt.com mailto:mike.daw...@cloudapt.com wrote: We use 3x replication and have drives that have relatively high steady-state IOPS. Therefore, we tend to prioritize client-side IO more than a reduction from 3 copies to 2 during the loss of one disk. The disruption to client io is so great on our cluster, we don't want our cluster to be in a recovery state without operator-supervision. Letting OSDs get marked out without operator intervention was a disaster in the early going of our cluster. For example, an OSD daemon crash would trigger automatic recovery where it was unneeded. Ironically, often times the unneeded recovery would often trigger additional daemons to crash, making a bad situation worse. During the recovery, rbd client io would often times go to 0. To deal with this issue, we set mon osd down out interval = 14400, so as operators we have 4 hours to intervene before Ceph attempts to self-heal. When hardware is at fault, we remove the osd, replace the drive, re-add the osd, then allow backfill to begin, thereby completely skipping step B in your timeline above. - Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to avoid deep-scrubbing performance hit?
Craig, I've struggled with the same issue for quite a while. If your i/o is similar to mine, I believe you are on the right track. For the past month or so, I have been running this cronjob: * * * * * for strPg in `ceph pg dump | egrep '^[0-9]\.[0-9a-f]{1,4}' | sort -k20 | awk '{ print $1 }' | head -2`; do ceph pg deep-scrub $strPg; done That roughly handles my 20672 PGs that are set to be deep-scrubbed every 7 days. Your script may be a bit better, but this quick and dirty method has helped my cluster maintain more consistency. The real key for me is to avoid the clumpiness I have observed without that hack where concurrent deep-scrubs sit at zero for a long period of time (despite having PGs that were months overdue for a deep-scrub), then concurrent deep-scrubs suddenly spike up and stay in the teens for hours, killing client writes/second. The scrubbing behavior table[0] indicates that a periodic tick initiates scrubs on a per-PG basis. Perhaps the timing of ticks aren't sufficiently randomized when you restart lots of OSDs concurrently (for instance via pdsh). On my cluster I suffer a significant drag on client writes/second when I exceed perhaps four or five concurrent PGs in deep-scrub. When concurrent deep-scrubs get into the teens, I get a massive drop in client writes/second. Greg, is there locking involved when a PG enters deep-scrub? If so, is the entire PG locked for the duration or is each individual object inside the PG locked as it is processed? Some of my PGs will be in deep-scrub for minutes at a time. 0: http://ceph.com/docs/master/dev/osd_internals/scrub/ Thanks, Mike Dawson On 6/9/2014 6:22 PM, Craig Lewis wrote: I've correlated a large deep scrubbing operation to cluster stability problems. My primary cluster does a small amount of deep scrubs all the time, spread out over the whole week. It has no stability problems. My secondary cluster doesn't spread them out. It saves them up, and tries to do all of the deep scrubs over the weekend. The secondary starts loosing OSDs about an hour after these deep scrubs start. To avoid this, I'm thinking of writing a script that continuously scrubs the oldest outstanding PG. In psuedo-bash: # Sort by the deep-scrub timestamp, taking the single oldest PG while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21, $1}' | sort | head -1 | read date time pg do ceph pg deep-scrub ${pg} while ceph status | grep scrubbing+deep do sleep 5 done sleep 30 done Does anybody think this will solve my problem? I'm also considering disabling deep-scrubbing until the secondary finishes replicating from the primary. Once it's caught up, the write load should drop enough that opportunistic deep scrubs should have a chance to run. It should only take another week or two to catch up. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Calamari Goes Open Source
Great work Inktank / Red Hat! An open source Calamari will be a great benefit to the community! Cheers, Mike Dawson On 5/30/2014 6:04 PM, Patrick McGarry wrote: Hey cephers, Sorry to push this announcement so late on a Friday but... Calamari has arrived! The source code bits have been flipped, the ticket tracker has been moved, and we have even given you a little bit of background from both a technical and vision point of view: Technical (ceph.com): http://ceph.com/community/ceph-calamari-goes-open-source/ Vision (inktank.com): http://www.inktank.com/software/future-of-calamari/ The ceph.com link should give you everything you need to know about what tech comprises Calamari, where the source lives, and where the discussions will take place. If you have any questions feel free to hit the new ceph-calamari list or stop by IRC and we'll get you started. Hope you all enjoy the GUI! Best Regards, Patrick McGarry Director, Community || Inktank http://ceph.com || http://inktank.com @scuttlemonkey || @ceph || @inktank ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multiple L2 LAN segments with Ceph
Travis, We run a routed ECMP spine-leaf network architecture with Ceph and have no issues on the network side whatsoever. Each leaf switch has an L2 cidr block inside a common L3 supernet. We do not currently split cluster_network and public_network. If we did, we'd likely build a separate spine-leaf network with it's own L3 supernet. A simple IPv4 example: - ceph-cluster: 10.1.0.0/16 - cluster-leaf1: 10.1.1.0/24 - node1: 10.1.1.1/24 - node2: 10.1.1.2/24 - cluster-leaf2: 10.1.2.0/24 - ceph-public: 10.2.0.0/16 - public-leaf1: 10.2.1.0/24 - node1: 10.2.1.1/24 - node2: 10.2.1.2/24 - public-leaf2: 10.2.2.0/24 ceph.conf would be: cluster_network: 10.1.0.0/255.255.0.0 public_network: 10.2.0.0/255.255.0.0 - Mike Dawson On 5/28/2014 1:01 PM, Travis Rhoden wrote: Hi folks, Does anybody know if there are any issues running Ceph with multiple L2 LAN segements? I'm picturing a large multi-rack/multi-row deployment where you may give each rack (or row) it's own L2 segment, then connect them all with L3/ECMP in a leaf-spine architecture. I'm wondering how cluster_network (or public_network) in ceph.conf works in this case. Does that directive just tell a daemon starting on a particular node which network to bind to? Or is a CIDR that has to be accurate for every OSD and MON in the entire cluster? Thanks, - Travis ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to find the disk partitions attached to a OSD
Perhaps: # mount | grep ceph - Mike Dawson On 5/21/2014 11:00 AM, Sharmila Govind wrote: Hi, I am new to Ceph. I have a storage node with 2 OSDs. Iam trying to figure out to which pyhsical device/partition each of the OSDs are attached to. Is there are command that can be executed in the storage node to find out the same. Thanks in Advance, Sharmila ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to find the disk partitions attached to a OSD
Looks like you may not have any OSDs properly setup and mounted. It should look more like: user@host:~# mount | grep ceph /dev/sdb1 on /var/lib/ceph/osd/ceph-0 type xfs (rw,noatime,inode64) /dev/sdc1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,noatime,inode64) /dev/sdd1 on /var/lib/ceph/osd/ceph-2 type xfs (rw,noatime,inode64) Confirm the OSD in your ceph cluster with: user@host:~# ceph osd tree - Mike On 5/21/2014 11:15 AM, Sharmila Govind wrote: Hi Mike, Thanks for your quick response. When I try mount on the storage node this is what I get: *root@cephnode4:~# mount* */dev/sda1 on / type ext4 (rw,errors=remount-ro)* *proc on /proc type proc (rw,noexec,nosuid,nodev)* *sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)* *none on /sys/fs/fuse/connections type fusectl (rw)* *none on /sys/kernel/debug type debugfs (rw)* *none on /sys/kernel/security type securityfs (rw)* *udev on /dev type devtmpfs (rw,mode=0755)* *devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)* *tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)* *none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)* *none on /run/shm type tmpfs (rw,nosuid,nodev)* */dev/sdb on /mnt/CephStorage1 type ext4 (rw)* */dev/sdc on /mnt/CephStorage2 type ext4 (rw)* */dev/sda7 on /mnt/Storage type ext4 (rw)* */dev/sda2 on /boot type ext4 (rw)* */dev/sda5 on /home type ext4 (rw)* */dev/sda6 on /mnt/CephStorage type ext4 (rw)* Is there anything wrong in the setup I have? I dont have any 'ceph' related mounts. Thanks, Sharmila On Wed, May 21, 2014 at 8:34 PM, Mike Dawson mike.daw...@cloudapt.com mailto:mike.daw...@cloudapt.com wrote: Perhaps: # mount | grep ceph - Mike Dawson On 5/21/2014 11:00 AM, Sharmila Govind wrote: Hi, I am new to Ceph. I have a storage node with 2 OSDs. Iam trying to figure out to which pyhsical device/partition each of the OSDs are attached to. Is there are command that can be executed in the storage node to find out the same. Thanks in Advance, Sharmila _ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG Selection Criteria for Deep-Scrub
Today I noticed that deep-scrub is consistently missing some of my Placement Groups, leaving me with the following distribution of PGs and the last day they were successfully deep-scrubbed. # ceph pg dump all | grep active | awk '{ print $20}' | sort -k1 | uniq -c 5 2013-11-06 221 2013-11-20 1 2014-02-17 25 2014-02-19 60 2014-02-20 4 2014-03-06 3 2014-04-03 6 2014-04-04 6 2014-04-05 13 2014-04-06 4 2014-04-08 3 2014-04-10 2 2014-04-11 50 2014-04-12 28 2014-04-13 14 2014-04-14 3 2014-04-15 78 2014-04-16 44 2014-04-17 8 2014-04-18 1 2014-04-20 16 2014-05-02 69 2014-05-04 140 2014-05-05 569 2014-05-06 9231 2014-05-07 103 2014-05-08 514 2014-05-09 1593 2014-05-10 393 2014-05-16 2563 2014-05-17 1283 2014-05-18 1640 2014-05-19 1979 2014-05-20 I have been running the default osd deep scrub interval of once per week, but have disabled deep-scrub on several occasions in an attempt to avoid the associated degraded cluster performance I have written about before. To get the PGs longest in need of a deep-scrub started, I set the nodeep-scrub flag, and wrote a script to manually kick off deep-scrub according to age. It is processing as expected. Do you consider this a feature request or a bug? Perhaps the code that schedules PGs to deep-scrub could be improved to prioritize PGs that have needed a deep-scrub the longest. Thanks, Mike Dawson ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG Selection Criteria for Deep-Scrub
I tend to set it whenever I don't want to be bothered by storage performance woes (nights I value sleep, etc). This cluster is bounded by relentless small writes (it has a couple dozen rbd volumes backing video surveillance DVRs). Some of the software we run is completely unaffected whereas other software falls apart during periods of deep-scrubs. I theorize it has to do with the individual software's attitude about flushing to disk / buffering. - Mike On 5/20/2014 8:31 PM, Aaron Ten Clay wrote: For what it's worth, version 0.79 has different headers, and the awk command needs $19 instead of $20. But here is the output I have on a small cluster that I recently rebuilt: $ ceph pg dump all | grep active | awk '{ print $19}' | sort -k1 | uniq -c dumped all in format plain 1 2014-05-15 2 2014-05-17 19 2014-05-18 193 2014-05-19 105 2014-05-20 I have set noscrub and nodeep-scrub, as well as noout and nodown off and on while I performed various maintenance, but that hasn't (apparently) impeded the regular schedule. With what frequency are you setting the nodeep-scrub flag? -Aaron On Tue, May 20, 2014 at 5:21 PM, Mike Dawson mike.daw...@cloudapt.com mailto:mike.daw...@cloudapt.com wrote: Today I noticed that deep-scrub is consistently missing some of my Placement Groups, leaving me with the following distribution of PGs and the last day they were successfully deep-scrubbed. # ceph pg dump all | grep active | awk '{ print $20}' | sort -k1 | uniq -c 5 2013-11-06 221 2013-11-20 1 2014-02-17 25 2014-02-19 60 2014-02-20 4 2014-03-06 3 2014-04-03 6 2014-04-04 6 2014-04-05 13 2014-04-06 4 2014-04-08 3 2014-04-10 2 2014-04-11 50 2014-04-12 28 2014-04-13 14 2014-04-14 3 2014-04-15 78 2014-04-16 44 2014-04-17 8 2014-04-18 1 2014-04-20 16 2014-05-02 69 2014-05-04 140 2014-05-05 569 2014-05-06 9231 2014-05-07 103 2014-05-08 514 2014-05-09 1593 2014-05-10 393 2014-05-16 2563 2014-05-17 1283 2014-05-18 1640 2014-05-19 1979 2014-05-20 I have been running the default osd deep scrub interval of once per week, but have disabled deep-scrub on several occasions in an attempt to avoid the associated degraded cluster performance I have written about before. To get the PGs longest in need of a deep-scrub started, I set the nodeep-scrub flag, and wrote a script to manually kick off deep-scrub according to age. It is processing as expected. Do you consider this a feature request or a bug? Perhaps the code that schedules PGs to deep-scrub could be improved to prioritize PGs that have needed a deep-scrub the longest. Thanks, Mike Dawson _ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Occasional Missing Admin Sockets
All, I have a recurring issue where the admin sockets (/var/run/ceph/ceph-*.*.asok) may vanish on a running cluster while the daemons keep running (or restart without my knowledge). I see this issue on a dev cluster running Ubuntu and Ceph Emperor/Firefly, deployed with ceph-deploy using Upstart to control daemons. I never see this issue on Ubuntu / Dumpling / sysvinit. Has anyone else seen this issue or know the likely cause? -- Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Monitoring ceph statistics using rados python module
Adrian, Yes, it is single OSD oriented. Like Haomai, we monitor perf dumps from individual OSD admin sockets. On new enough versions of ceph, you can do 'ceph daemon osd.x perf dump', which is a shorter way to ask for the same output as 'ceph --admin-daemon /var/run/ceph/ceph-osd.x.asok perf dump'. Keep in mind, either version has to be run locally on the host where osd.x is running. We use Sensu to take samples and push them to Graphite. We have the ability to then build dashboards showing the whole cluster, units in our CRUSH tree, hosts, or an individual OSDs. I have found that monitoring each OSD's admin daemon is critical. Often times a single OSD can affect performance of the entire cluster. Without individual data, these types of issues can be quite difficult to pinpoint. Also, note that Inktank has developed Calamari. There are rumors that it may be open sourced at some point in the future. Cheers, Mike Dawson On 5/13/2014 12:33 PM, Adrian Banasiak wrote: Thanks for sugestion with admin daemon but it looks like single osd oriented. I have used perf dump on mon socket and it output some interesting data in case of monitoring whole cluster: { cluster: { num_mon: 4, num_mon_quorum: 4, num_osd: 29, num_osd_up: 29, num_osd_in: 29, osd_epoch: 1872, osd_kb: 20218112516, osd_kb_used: 5022202696, osd_kb_avail: 15195909820, num_pool: 4, num_pg: 3500, num_pg_active_clean: 3500, num_pg_active: 3500, num_pg_peering: 0, num_object: 400746, num_object_degraded: 0, num_object_unfound: 0, num_bytes: 1678788329609, num_mds_up: 0, num_mds_in: 0, num_mds_failed: 0, mds_epoch: 1}, Unfortunately cluster wide IO statistics are still missing. 2014-05-13 17:17 GMT+02:00 Haomai Wang haomaiw...@gmail.com mailto:haomaiw...@gmail.com: Not sure your demand. I use ceph --admin-daemon /var/run/ceph/ceph-osd.x.asok perf dump to get the monitor infos. And the result can be parsed by simplejson easily via python. On Tue, May 13, 2014 at 10:56 PM, Adrian Banasiak adr...@banasiak.it mailto:adr...@banasiak.it wrote: Hi, i am working with test Ceph cluster and now I want to implement Zabbix monitoring with items such as: - whoe cluster IO (for example ceph -s - recovery io 143 MB/s, 35 objects/s) - pg statistics I would like to create single script in python to retrive values using rados python module, but there are only few informations in documentation about module usage. I've created single function which calculates all pools current read/write statistics but i cant find out how to add recovery IO usage and pg statistics: read = 0 write = 0 for pool in conn.list_pools(): io = conn.open_ioctx(pool) stats[pool] = io.get_stats() read+=int(stats[pool]['num_rd']) write+=int(stats[pool]['num_wr']) Could someone share his knowledge about rados module for retriving ceph statistics? BTW Ceph is awesome! -- Best regards, Adrian Banasiak email: adr...@banasiak.it mailto:adr...@banasiak.it ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat -- Pozdrawiam, Adrian Banasiak email: adr...@banasiak.it mailto:adr...@banasiak.it ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Occasional Missing Admin Sockets
Greg/Loic, I can confirm that logrotate --force /etc/logrotate.d/ceph removes the monitor admin socket on my boxes running 0.80.1 just like the description in Issue 7188 [0]. 0: http://tracker.ceph.com/issues/7188 Should that bug be reopened? Thanks, Mike Dawson On 5/13/2014 2:10 PM, Gregory Farnum wrote: On Tue, May 13, 2014 at 9:06 AM, Mike Dawson mike.daw...@cloudapt.com wrote: All, I have a recurring issue where the admin sockets (/var/run/ceph/ceph-*.*.asok) may vanish on a running cluster while the daemons keep running Hmm. (or restart without my knowledge). I'm guessing this might be involved: I see this issue on a dev cluster running Ubuntu and Ceph Emperor/Firefly, deployed with ceph-deploy using Upstart to control daemons. I never see this issue on Ubuntu / Dumpling / sysvinit. *goes and greps the git log* I'm betting it was commit 45600789f1ca399dddc5870254e5db883fb29b38 (which has, in fact, been backported to dumpling and emperor), intended so that turning on a new daemon wouldn't remove the admin socket of an existing one. But I think that means that if you activate the new daemon before the old one has finished shutting down and unlinking, you would end up with a daemon that had no admin socket. Perhaps it's an incomplete fix and we need a tracker ticket? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.80 Firefly released
Andrey, In initial testing, it looks like it may work rather efficiently. 1) Upgrade all mon, osd, and clients to Firefly. Restart everything so no legacy ceph code is running. 2) Add mon osd allow primary affinity = true to ceph.conf, distribute ceph.conf to nodes. 3) Inject it into the monitors to make it immediately active: # ceph tell mon.* injectargs '--mon_osd_allow_primary_affinity true' Ignore the mon.a: injectargs: failed to parse arguments: true warnings, this appears to be a bug [0]. 4) Check to see how many PGs have OSD 0 as their primary: ceph pg dump | awk '{ print $15 $14 $1}' | egrep ^0 | wc -l 5) Set primary affinity to zero on osd.0: # ceph osd primary-affinity osd.0 0 If you didn't set mon_osd_allow_primary_affinity properly above, you'll get a helpful error message. 6) Confirm it worked by comparing how many PGs have osd.0 as their primary. ceph pg dump | awk '{ print $15 }' | egrep ^0 | wc -l On my small dev cluster, the number goes to 0 in less than 10 seconds. 7) Perform maintenance and watch ceph -w. If you didn't get all your clients updated, you'll likely see a bunch of errors in ceph -w like: 2014-05-09 21:12:42.534900 osd.0 [WRN] client.130959 x.x.x.x:0/1015056 misdirected client.130959.0:619497 pg 4.90eaebe to osd.0 not [6,1,0] in e1650/1650 8) After you are done with maintenance, reset the primary affinity: # ceph osd primary-affinity osd.0 1 I have not scaled up my testing, but it looks like this has the potential to work well in preventing unnecessary read starvation in certain situations. 0: http://tracker.ceph.com/issues/8323#note-1 Cheers, Mike Dawson On 5/8/2014 8:20 AM, Andrey Korolyov wrote: Mike, would you mind to write your experience if you`ll manage to get this flow through first? I hope I`ll be able to conduct some tests related to 0.80 only next week, including maintenance combined with primary pointer relocation - one of most crucial things remaining in Ceph for the production performance. On Wed, May 7, 2014 at 10:18 PM, Mike Dawson mike.daw...@cloudapt.com wrote: On 5/7/2014 11:53 AM, Gregory Farnum wrote: On Wed, May 7, 2014 at 8:44 AM, Dan van der Ster daniel.vanders...@cern.ch wrote: Hi, Sage Weil wrote: * *Primary affinity*: Ceph now has the ability to skew selection of OSDs as the primary copy, which allows the read workload to be cheaply skewed away from parts of the cluster without migrating any data. Can you please elaborate a bit on this one? I found the blueprint [1] but still don't quite understand how it works. Does this only change the crush calculation for reads? i.e writes still go to the usual primary, but reads are distributed across the replicas? If so, does this change the consistency model in any way. It changes the calculation of who becomes the primary, and that primary serves both reads and writes. In slightly more depth: Previously, the primary has always been the first OSD chosen as a member of the PG. For erasure coding, we added the ability to specify a primary independent of the selection ordering. This was part of a broad set of changes to prevent moving the EC shards around between different members of the PG, and means that the primary might be the second OSD in the PG, or the fourth. Once this work existed, we realized that it might be useful in other cases, because primaries get more of the work for their PG (serving all reads, coordinating writes). So we added the ability to specify a primary affinity, which is like the CRUSH weights but only impacts whether you become the primary. So if you have 3 OSDs that each have primary affinity = 1, it will behave as normal. If two have primary affinity = 0, the remaining OSD will be the primary. Etc. Is it possible (and/or advisable) to set primary affinity low while backfilling / recovering an OSD in an effort to prevent unnecessary slow reads that could be directed to less busy replicas? I suppose if the cost of setting/unsetting primary affinity is low and clients are starved for reads during backfill/recovery from the osd in question, it could be a win. Perhaps the workflow for maintenance on osd.0 would be something like: - Stop osd.0, do some maintenance on osd.0 - Read primary affinity of osd.0, store it for later - Set primary affinity on osd.0 to 0 - Start osd.0 - Enjoy a better backfill/recovery experience. RBD clients happier. - Reset primary affinity on osd.0 to previous value If the cost of setting primary affinity is low enough, perhaps this strategy could be automated by the ceph daemons. Thanks, Mike Dawson -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http
Re: [ceph-users] v0.80 Firefly released
On 5/7/2014 11:53 AM, Gregory Farnum wrote: On Wed, May 7, 2014 at 8:44 AM, Dan van der Ster daniel.vanders...@cern.ch wrote: Hi, Sage Weil wrote: * *Primary affinity*: Ceph now has the ability to skew selection of OSDs as the primary copy, which allows the read workload to be cheaply skewed away from parts of the cluster without migrating any data. Can you please elaborate a bit on this one? I found the blueprint [1] but still don't quite understand how it works. Does this only change the crush calculation for reads? i.e writes still go to the usual primary, but reads are distributed across the replicas? If so, does this change the consistency model in any way. It changes the calculation of who becomes the primary, and that primary serves both reads and writes. In slightly more depth: Previously, the primary has always been the first OSD chosen as a member of the PG. For erasure coding, we added the ability to specify a primary independent of the selection ordering. This was part of a broad set of changes to prevent moving the EC shards around between different members of the PG, and means that the primary might be the second OSD in the PG, or the fourth. Once this work existed, we realized that it might be useful in other cases, because primaries get more of the work for their PG (serving all reads, coordinating writes). So we added the ability to specify a primary affinity, which is like the CRUSH weights but only impacts whether you become the primary. So if you have 3 OSDs that each have primary affinity = 1, it will behave as normal. If two have primary affinity = 0, the remaining OSD will be the primary. Etc. Is it possible (and/or advisable) to set primary affinity low while backfilling / recovering an OSD in an effort to prevent unnecessary slow reads that could be directed to less busy replicas? I suppose if the cost of setting/unsetting primary affinity is low and clients are starved for reads during backfill/recovery from the osd in question, it could be a win. Perhaps the workflow for maintenance on osd.0 would be something like: - Stop osd.0, do some maintenance on osd.0 - Read primary affinity of osd.0, store it for later - Set primary affinity on osd.0 to 0 - Start osd.0 - Enjoy a better backfill/recovery experience. RBD clients happier. - Reset primary affinity on osd.0 to previous value If the cost of setting primary affinity is low enough, perhaps this strategy could be automated by the ceph daemons. Thanks, Mike Dawson -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 16 osds: 11 up, 16 in
Craig, I suspect the disks in question are seeking constantly and the spindle contention is causing significant latency. A strategy of throttling backfill/recovery and reducing client traffic tends to work for me. 1) You should make sure recovery and backfill are throttled: ceph tell osd.* injectargs '--osd_max_backfills 1' ceph tell osd.* injectargs '--osd_recovery_max_active 1' ceph tell osd.* injectargs '--osd_recovery_op_priority 1' 2) We run a not-particularly critical service with a constant stream of 95% write/5% read small, random IO. During recovery/backfill, we are heavily bound by IOPS. It often times feels like a net win to throttle unessential client traffic in an effort to get spindle contention under control if Step 1 wasn't enough. If that all fails, you can try ceph osd set nodown which will prevent OSDs from being marked down (with or without proper cause), but that tends to cause me more trouble than its worth. Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 On 5/7/2014 1:28 PM, Craig Lewis wrote: The 5 OSDs that are down have all been kicked out for being unresponsive. The 5 OSDs are getting kicked faster than they can complete the recovery+backfill. The number of degraded PGs is growing over time. root@ceph0c:~# ceph -w cluster 1604ec7a-6ceb-42fc-8c68-0a7896c4e120 health HEALTH_WARN 49 pgs backfill; 926 pgs degraded; 252 pgs down; 30 pgs incomplete; 291 pgs peering; 1 pgs recovery_wait; 175 pgs stale; 255 pgs stuck inactive; 175 pgs stuck stale; 1234 pgs stuck unclean; 66 requests are blocked 32 sec; recovery 6820014/3806 objects degraded (17.921%); 4/16 in osds are down; noout flag(s) set monmap e2: 2 mons at {ceph0c=10.193.0.6:6789/0,ceph1c=10.193.0.7:6789/0}, election epoch 238, quorum 0,1 ceph0c,ceph1c osdmap e38673: 16 osds: 12 up, 16 in flags noout pgmap v7325233: 2560 pgs, 17 pools, 14090 GB data, 18581 kobjects 28456 GB used, 31132 GB / 59588 GB avail 6820014/3806 objects degraded (17.921%) 1 stale+active+clean+scrubbing+deep 15 active 1247 active+clean 1 active+recovery_wait 45 stale+active+clean 39 peering 29 stale+active+degraded+wait_backfill 252 down+peering 827 active+degraded 50 stale+active+degraded 20 stale+active+degraded+remapped+wait_backfill 30 stale+incomplete 4 active+clean+scrubbing+deep Here's a snippet of ceph.log for one of these OSDs: 2014-05-07 09:22:46.747036 mon.0 10.193.0.6:6789/0 39981 : [INF] osd.3 marked down after no pg stats for 901.212859seconds 2014-05-07 09:47:17.930251 mon.0 10.193.0.6:6789/0 40561 : [INF] osd.3 10.193.0.6:6812/2830 boot 2014-05-07 09:47:16.914519 osd.3 10.193.0.6:6812/2830 823 : [WRN] map e38649 wrongly marked me down root@ceph0c:~# uname -a Linux ceph0c 3.5.0-46-generic #70~precise1-Ubuntu SMP Thu Jan 9 23:55:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux root@ceph0c:~# lsb_release -a No LSB modules are available. Distributor ID:Ubuntu Description:Ubuntu 12.04.4 LTS Release:12.04 Codename:precise root@ceph0c:~# ceph -v ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) Any ideas what I can do to make these OSDs stop drying after 15 minutes? -- *Craig Lewis* Senior Systems Engineer Office +1.714.602.1309 Email cle...@centraldesktop.com mailto:cle...@centraldesktop.com *Central Desktop. Work together in ways you never thought possible.* Connect with us Website http://www.centraldesktop.com/ | Twitter http://www.twitter.com/centraldesktop | Facebook http://www.facebook.com/CentralDesktop | LinkedIn http://www.linkedin.com/groups?gid=147417 | Blog http://cdblog.centraldesktop.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Deep-Scrub Scheduling
My write-heavy cluster struggles under the additional load created by deep-scrub from time to time. As I have instrumented the cluster more, it has become clear that there is something I cannot explain happening in the scheduling of PGs to undergo deep-scrub. Please refer to these images [0][1] to see two graphical representations of how deep-scrub goes awry in my cluster. These were two separate incidents. Both show a period of happy scrub and deep-scrubs and stable writes/second across the cluster, then an approximately 5x jump in concurrent deep-scrubs where client IO is cut by nearly 50%. The first image (deep-scrub-issue1.jpg) shows a happy cluster with low numbers of scrub and deep-scrub running until about 10pm, then something triggers deep-scrubs to increase about 5x and remain high until I manually 'ceph osd set nodeep-scrub' at approx 10am. During the time of higher concurrent deep-scrubs, IOPS drop significantly due to OSD spindle contention preventing qemu/rbd clients from writing like normal. The second image (deep-scrub-issue2.jpg) shows a similar approx 5x jump in concurrent deep-scrubs and associated drop in writes/second. This image also adds a summary of the 'dump historic ops' which show the to be expected jump in the slowest ops in the cluster. Does anyone have an idea of what is happening when the spike in concurrent deep-scrub occurs and how to prevent the adverse effects, outside of disabling deep-scrub permanently? 0: http://www.mikedawson.com/deep-scrub-issue1.jpg 1: http://www.mikedawson.com/deep-scrub-issue2.jpg Thanks, Mike Dawson ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deep-Scrub Scheduling
Perhaps, but if that were the case, would you expect the max concurrent number of deep-scrubs to approach the number of OSDs in the cluster? I have 72 OSDs in this cluster and concurrent deep-scrubs seem to peak at a max of 12. Do pools (two in use) and replication settings (3 copies in both pools) factor in? 72 OSDs / (2 pools * 3 copies) = 12 max concurrent deep-scrubs That seems plausible (without looking at the code). But, if I 'ceph osd set nodeep-scrub' then 'ceph osd unset nodeep-scrub', the count of concurrent deep-scrubs doesn't resume the high level, but rather stays low seemingly for days at a time, until the next onslaught. If driven by the max scrub interval, shouldn't it jump quickly back up? Is there way to find the last scrub time for a given PG via the CLI to know for sure? Thanks, Mike Dawson On 5/7/2014 10:59 PM, Gregory Farnum wrote: Is it possible you're running into the max scrub intervals and jumping up to one-per-OSD from a much lower normal rate? On Wednesday, May 7, 2014, Mike Dawson mike.daw...@cloudapt.com mailto:mike.daw...@cloudapt.com wrote: My write-heavy cluster struggles under the additional load created by deep-scrub from time to time. As I have instrumented the cluster more, it has become clear that there is something I cannot explain happening in the scheduling of PGs to undergo deep-scrub. Please refer to these images [0][1] to see two graphical representations of how deep-scrub goes awry in my cluster. These were two separate incidents. Both show a period of happy scrub and deep-scrubs and stable writes/second across the cluster, then an approximately 5x jump in concurrent deep-scrubs where client IO is cut by nearly 50%. The first image (deep-scrub-issue1.jpg) shows a happy cluster with low numbers of scrub and deep-scrub running until about 10pm, then something triggers deep-scrubs to increase about 5x and remain high until I manually 'ceph osd set nodeep-scrub' at approx 10am. During the time of higher concurrent deep-scrubs, IOPS drop significantly due to OSD spindle contention preventing qemu/rbd clients from writing like normal. The second image (deep-scrub-issue2.jpg) shows a similar approx 5x jump in concurrent deep-scrubs and associated drop in writes/second. This image also adds a summary of the 'dump historic ops' which show the to be expected jump in the slowest ops in the cluster. Does anyone have an idea of what is happening when the spike in concurrent deep-scrub occurs and how to prevent the adverse effects, outside of disabling deep-scrub permanently? 0: http://www.mikedawson.com/__deep-scrub-issue1.jpg http://www.mikedawson.com/deep-scrub-issue1.jpg 1: http://www.mikedawson.com/__deep-scrub-issue2.jpg http://www.mikedawson.com/deep-scrub-issue2.jpg Thanks, Mike Dawson _ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy osd activate error: AttributeError: 'module' object has no attribute 'logger' exception
Victor, This is a verified issue reported earlier today: http://tracker.ceph.com/issues/8260 Cheers, Mike On 4/30/2014 3:10 PM, Victor Bayon wrote: Hi all, I am following the quick-ceph-deploy tutorial [1] and I am getting a error when running the ceph-deploy osd activate and I am getting an exception. See below[2]. I am following the quick tutorial step by step, except that any help greatly appreciate ceph-deploy mon create-initial does not seem to gather the keys and I have to execute manually with ceph-deploy gatherkeys node01 I am following the same configuration with - one admin node (myhost) - 1 monitoring node (node01) - 2 osd (node02, node03) I am in Ubuntu Server 12.04 LTS (precise) and using ceph emperor Any help greatly appreciated Many thanks Best regards /V [1] http://ceph.com/docs/master/start/quick-ceph-deploy/ [2] Error: ceph@myhost:~/cluster$ ceph-deploy osd activate node02:/var/local/osd0 node03:/var/local/osd1 [ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.0): /usr/bin/ceph-deploy osd activate node02:/var/local/osd0 node03:/var/local/osd1 [ceph_deploy.osd][DEBUG ] Activating cluster ceph disks node02:/var/local/osd0: node03:/var/local/osd1: [node02][DEBUG ] connected to host: node02 [node02][DEBUG ] detect platform information from remote host [node02][DEBUG ] detect machine type [ceph_deploy.osd][INFO ] Distro info: Ubuntu 12.04 precise [ceph_deploy.osd][DEBUG ] activating host node02 disk /var/local/osd0 [ceph_deploy.osd][DEBUG ] will use init type: upstart [node02][INFO ] Running command: sudo ceph-disk-activate --mark-init upstart --mount /var/local/osd0 [node02][WARNIN] got latest monmap [node02][WARNIN] 2014-04-30 19:36:30.268882 7f506fd07780 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway [node02][WARNIN] 2014-04-30 19:36:30.298239 7f506fd07780 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway [node02][WARNIN] 2014-04-30 19:36:30.301091 7f506fd07780 -1 filestore(/var/local/osd0) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory [node02][WARNIN] 2014-04-30 19:36:30.307474 7f506fd07780 -1 created object store /var/local/osd0 journal /var/local/osd0/journal for osd.0 fsid 76de3b72-44e3-47eb-8bd7-2b5b6e3666eb [node02][WARNIN] 2014-04-30 19:36:30.307512 7f506fd07780 -1 auth: error reading file: /var/local/osd0/keyring: can't open /var/local/osd0/keyring: (2) No such file or directory [node02][WARNIN] 2014-04-30 19:36:30.307547 7f506fd07780 -1 created new key in keyring /var/local/osd0/keyring [node02][WARNIN] added key for osd.0 Traceback (most recent call last): File /usr/bin/ceph-deploy, line 21, in module sys.exit(main()) File /usr/lib/python2.7/dist-packages/ceph_deploy/util/decorators.py, line 62, in newfunc return f(*a, **kw) File /usr/lib/python2.7/dist-packages/ceph_deploy/cli.py, line 147, in main return args.func(args) File /usr/lib/python2.7/dist-packages/ceph_deploy/osd.py, line 532, in osd activate(args, cfg) File /usr/lib/python2.7/dist-packages/ceph_deploy/osd.py, line 338, in activate catch_osd_errors(distro.conn, distro.logger, args) AttributeError: 'module' object has no attribute 'logger' ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Backfill and Recovery traffic shaping
Hi Greg, On 4/19/2014 2:20 PM, Greg Poirier wrote: We have a cluster in a sub-optimal configuration with data and journal colocated on OSDs (that coincidentally are spinning disks). During recovery/backfill, the entire cluster suffers degraded performance because of the IO storm that backfills cause. Client IO becomes extremely latent. Graph '%util' or simply watch it from 'iostat -xt 2'. It may likely show you the bottleneck is iops available from your spinning disks. Client IO can see significant latency (or at worse complete stalls) as your disks approach saturation. I've tried to decrease the impact that recovery/backfill has with the following: ceph tell osd.* injectargs '--osd-max-backfills 1' ceph tell osd.* injectargs '--osd-max-recovery-threads 1' ceph tell osd.* injectargs '--osd-recovery-op-priority 1' ceph tell osd.* injectargs '--osd-client-op-priority 63' ceph tell osd.* injectargs '--osd-recovery-max-active 1' On our cluster, these settings can be an effective method for minimizing disruption. I'd also recommend you disable deep scrub by: ceph osd set nodeep-scrub Re-enable it later with: ceph osd unset nodeep-scrub I have some clients that are much more susceptible to disruptions from spindle contention during recovery/backfill. Others operate without disruption. I am working to quantify the difference, but I believe it is related to caching or syncing behavior of the individual application/OS. The only other option I have left would be to use linux traffic shaping to artificially reduce the bandwidth available to the interfaced tagged for cluster traffic (instead of separate physical networks, we use VLAN tagging). We are nowhere _near_ the point where network saturation would cause the latency we're seeing, so I am left to believe that it is simply disk IO saturation. I could be wrong about this assumption, though, as iostat doesn't terrify me. This could be suboptimal network configuration on the cluster as well. I'm still looking into that possibility, but I wanted to get feedback on what I'd done already first--as well as the proposed traffic shaping idea. Thoughts? I would exhaust all troubleshooting / tuning related to spindle contention before spending much more than a cursory look at network sanity. It sounds to me that you simply don't have enough IOPS available as configured in your cluster to operate your client IO workload while also absorbing the performance hit of recovery/backfill. With a workload consisting of lots of small writes, I've seen client IO starved with as little as 5Mbps of traffic per host due to spindle contention once deep-scrub and/or recovery/backfill start. Co-locating OSD Journals on the same spinners as you have will double that likelihood. Possible solutions include moving OSD Journals to SSD (with a reasonable ratio), expanding the cluster, or increasing the performance of underlying storage. Cheers, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD write access patterns and atime
Thanks Dan! Thanks, Mike Dawson On 4/17/2014 4:06 AM, Dan van der Ster wrote: Mike Dawson wrote: Dan, Could you describe how you harvested and analyzed this data? Even better, could you share the code? Cheers, Mike First enable debug_filestore=10, then you'll see logs like this: 2014-04-17 09:40:34.466749 7fb39df16700 10 filestore(/var/lib/ceph/osd/osd.0) write 4.206_head/57186206/rbd_data.1f7ccd36575a0ed.1620/head//4 651264~4096 = 4096 and this for reads: 2014-04-17 09:46:10.449577 7fb392427700 10 filestore(/var/lib/ceph/osd/osd.0) FileStore::read 4.fe9_head/f7281fe9/rbd_data.10bb48f705289c0.6a24/head//4 1994752~4096/4096 The last num is the size of the write/read. Then run this: https://github.com/cernceph/ceph-scripts/blob/master/tools/rbd-io-stats.pl Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD write access patterns and atime
Dan, Could you describe how you harvested and analyzed this data? Even better, could you share the code? Cheers, Mike On 4/16/2014 11:08 AM, Dan van der Ster wrote: Dear ceph-users, I've recently started looking through our FileStore logs to better understand the VM/RBD IO patterns, and noticed something interesting. Here is a snapshot of the write lengths for one OSD server (with 24 OSDs) -- I've listed the top 10 write lengths ordered by number of writes in one day: Writes per length: 4096: 2011442 8192: 438259 4194304: 207293 12288: 175848 16384: 148274 20480: 69050 24576: 58961 32768: 54771 28672: 43627 65536: 34208 49152: 31547 40960: 28075 There were ~400 writes to that server on that day, so you see that ~50% of the writes were 4096 bytes, and then the distribution drops off sharply before a peak again at 4MB (the object size, i.e. the max write size). (For those interested, read lengths are below in the P.S.) I'm trying to understand that distribution, and the best explanation I've come up with is that these are ext4/xfs metadata updates, probably atime updates. Based on that theory, I'm going to test noatime on a few VMs and see if I notice a change in the distribution. Did anyone already go through such an exercise, or does anyone already enforce/recommend specific mount options for their clients' RBD volumes? Of course I realize that noatime is a generally recommended mount option for performance, but I've never heard a discussion about noatime specifically in relation to RBD volumes. Best Regards, Dan P.S. Reads per length: 524288: 1235401 4096: 675012 8192: 488194 516096: 342771 16384: 187577 65536: 87783 131072: 87279 12288: 66735 49152: 50170 24576: 47794 262144: 45199 466944: 23064 So reads are mostly 512kB, which is probably some default read-ahead size. -- Dan van der Ster || Data Storage Services || CERN IT Department -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Migrate from mkcephfs to ceph-deploy
Hello, I have a production cluster that was deployed with mkcephfs around the Bobtail release. Quite a bit has changed in regards to ceph.conf conventions, ceph-deploy, symlinks to journal partitions, udev magic, and upstart. Is there any path to migrate these OSDs up to the new style setup? For obvious reasons I'd prefer to avoid redeploying the OSDs. With each release, I get a bit more worried that this legacy setup will cause issues. If you are an operators with a cluster older than a year or so, what have you done? Thanks, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Error while provisioning my first OSD
Adam, I believe you need the command 'ceph osd create' prior to 'ceph-osd -i X --mkfs --mkkey' for each OSD you add. http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#adding-an-osd-manual Cheers, Mike On 4/5/2014 7:37 PM, Adam Clark wrote: HI all, I am trying to setup a Ceph cluster for the first time. I am following the manual deployment at I want to orchestrate it with puppet. http://ceph.com/docs/master/install/manual-deployment/ All is going well until I want to add the OSD to the crush map. I get the following error: ceph osd crush add osd.0 1.0 host=ceph-osd133 Error ENOENT: osd.0 does not exist. create it before updating the crush map Here is the process that I went through: ceph -v ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) cat /etc/ceph/ceph.conf [global] osd_pool_default_pgp_num = 100 osd_pool_default_min_size = 1 auth_service_required = cephx mon_initial_members = ceph-mon01,ceph-mon02,ceph-mon03 fsid = 983a74a9-1e99-42ef-8a1d-097553c3e6ce cluster_network = 172.16.34.0/24 http://172.16.34.0/24 auth_supported = cephx auth_cluster_required = cephx mon_host = 172.16.33.20,172.16.33.21,172.16.33.22 auth_client_required = cephx osd_pool_default_size = 2 osd_pool_default_pg_num = 100 public_network = 172.16.33.0/24 http://172.16.33.0/24 ceph -s cluster 983a74a9-1e99-42ef-8a1d-097553c3e6ce health HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds monmap e3: 3 mons at {ceph-mon01=172.16.33.20:6789/0,ceph-mon02=172.16.33.21:6789/0,ceph-mon03=172.16.33.22:6789/0 http://172.16.33.20:6789/0,ceph-mon02=172.16.33.21:6789/0,ceph-mon03=172.16.33.22:6789/0}, election epoch 6, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03 osdmap e3: 0 osds: 0 up, 0 in pgmap v4: 192 pgs, 3 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 192 creating ceph-disk list /dev/fd0 other, unknown /dev/sda : /dev/sda1 other, ext2, mounted on /boot /dev/sda2 other /dev/sda5 other, LVM2_member /dev/sdb : /dev/sdb1 ceph data, active, cluster ceph, osd.0, journal /dev/sdb2 /dev/sdb2 ceph journal, for /dev/sdb1 /dev/sr0 other, unknown mount /dev/sdb1 /var/lib/ceph/osd/ceph-0 ceph-osd -i 0 --mkfs --mkkey ceph auth add osd.0 osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-0/keyring ceph osd crush add-bucket ceph-osd133 host ceph osd crush move ceph-osd133 root=default ceph osd crush add osd.0 1.0 host=ceph-osd133 Error ENOENT: osd.0 does not exist. create it before updating the crush map I have seen that in earlier versions, it can show this message but happily proceeds. Is the doco out of date, or am I missing something? Cheers Adam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Pause i/o from time to time
What version of qemu do you have? The issues I had were fixed once I upgraded qemu to =1.4.2 which includes a critical rbd patch for asynchronous io from Josh Durgin. Cheers, Mike On 12/28/2013 4:09 PM, Andrei Mikhailovsky wrote: Hi guys, Did anyone figure out what could be causing this problem and a workaround? I've noticed a very annoying behaviour with my vms. It seems to happen randomly about 5-10 times a day and the pauses last between 2-10 minutes. It happens across all vms on all host servers in my cluster. I am running 0.67.4 on ubuntu 12.04 with 3.11 kernel from backports. Initially i though that these pauses are caused by the scrubbing issue reported by Mike, however, I've also noticed the stalls when the cluster is not scrubbing. Both of my osd servers are pretty idle (load around 1 to 2) with osds are less than 10% utilised. Unlike Uwe's case, I am not using iscsi, but plain rbd with qemu and I do not see any i/o errors in dmesg or kernel panics. the vms just freeze and become unresponsive, so i can't ssh into it or do simple commands like ls. VMs do respond to pings though. Thanks Andrei *From: *Uwe Grohnwaldt u...@grohnwaldt.eu *To: *ceph-users@lists.ceph.com *Sent: *Thursday, 24 October, 2013 8:31:42 AM *Subject: *Re: [ceph-users] Pause i/o from time to time Hello ceph-users, we're hitting a similar problem last Thursday and today. We have a cluster consisting of 6 storagenodes containing 70 osds (JBOD configuration). We created several rbd devices and mapped them on dedicated server and exporting them via targetcli. This iscsi target are connected to Citrix XenServer 6.1 (with HF30) and XenServer 6.2 (HF4). In the last time some disks died. After this some errors occured on this dedicated iscsitarget: Oct 23 15:19:42 targetcli01 kernel: [673836.709887] end_request: I/O error, dev rbd4, sector 2034037064 Oct 23 15:19:42 targetcli01 kernel: [673836.713596] test_bit(BIO_UPTODATE) failed for bio: 880127546c00, err: -6 Oct 23 15:19:43 targetcli01 kernel: [673837.497382] end_request: I/O error, dev rbd4, sector 2034037064 Oct 23 15:19:43 targetcli01 kernel: [673837.501323] test_bit(BIO_UPTODATE) failed for bio: 880124d933c0, err: -6 These errors go through up to the virtual machines and lead to readonly filesystems. We could trigger this behavior with set one disk to out. We are using Ubuntu 13.04 with latest stable ceph (ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7) Our ceph.conf is like this: [global] filestore_xattr_use_omap = true mon_host = 10.200.20.1,10.200.20.2,10.200.20.3 osd_journal_size = 1024 public_network = 10.200.40.0/16 mon_initial_members = ceph-mon01, ceph-mon02, ceph-mon03 cluster_network = 10.210.40.0/16 auth_supported = none fsid = 9283e647-2b57-4077-b427-0d3d656233b3 [osd] osd_max_backfills = 4 osd_recovery_max_active = 1 [osd.0] public_addr = 10.200.40.1 cluster_addr = 10.210.40.1 After the first outage we set osd_max_backfill to 8, after the second one to 4 but it didn't help. It seems like it is the bug mentioned at http://tracker.ceph.com/issues/6278 . The problem is, that this is a production environment and the problems began after we moved several VMs to it. In our test environment we can't reproduct it but we are working on a larger testinstallation. Does anybody have an idea how to investigate further without destroying virtual machines? ;) Sometimes these IO errors lead to kernel panics on the iscsi target machine. The targetcli/lio config is a simple default config without any tuning or big configurations. Mit freundlichen GrĂĽĂźen / Best Regards, Uwe Grohnwaldt - Original Message - From: Timofey timo...@koolin.ru To: Mike Dawson mike.daw...@cloudapt.com Cc: ceph-users@lists.ceph.com Sent: Dienstag, 17. September 2013 22:37:44 Subject: Re: [ceph-users] Pause i/o from time to time I have examined logs. Yes, first time it can be scrubbing. It repaired some self. I had 2 servers before first problem: one dedicated for osd (osd.0), and second - with osd and websites (osd.1). After problem I add third server - dedicated for osd (osd.2) and call ceph osd set out osd.1 for replace data. In ceph -s i saw normal replacing process and all work good about 5-7 hours. Then I have many misdirected records (few hundreds per second): osd.0 [WRN] client.359671 misdirected client.359671.1:220843 pg 2.3ae744c0 to osd.0 not [2,0] in e1040/1040 and errors in i/o operations. Now I have about 20GB ceph logs with this errors. (I don't work with cluster now - I copy out all data on hdd and work from hdd). Is any way have local software raid1 with ceph rbd and local image (for work when ceph fail or work slow by any reason). I tried mdadm but it work bad - server hang up every few hours. You could be suffering from a known, but unfixed issue [1] where spindle contention from scrub
Re: [ceph-users] rebooting nodes in a ceph cluster
It is also useful to mention that you can set the noout flag when doing maintenance of any given length needs to exceeds the 'mon osd down out interval'. $ ceph osd set noout ** no re-balancing will happen ** $ ceph osd unset noout ** normal re-balancing rules will resume ** - Mike Dawson On 12/19/2013 7:51 PM, Sage Weil wrote: On Thu, 19 Dec 2013, John-Paul Robinson wrote: What impact does rebooting nodes in a ceph cluster have on the health of the ceph cluster? Can it trigger rebalancing activities that then have to be undone once the node comes back up? I have a 4 node ceph cluster each node has 11 osds. There is a single pool with redundant storage. If it takes 15 minutes for one of my servers to reboot is there a risk that some sort of needless automatic processing will begin? By default, we start rebalancing data after 5 minutes. You can adjust this (to, say, 15 minutes) with mon osd down out interval = 900 in ceph.conf. sage I'm assuming that the ceph cluster can go into a not ok state but that in this particular configuration all the data is protected against the single node failure and there is no place for the data to migrate too so nothing bad will happen. Thanks for any feedback. ~jpr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rebooting nodes in a ceph cluster
I think my wording was a bit misleading in my last message. Instead of no re-balancing will happen, I should have said that no OSDs will be marked out of the cluster with the noout flag set. - Mike On 12/21/2013 2:06 PM, Mike Dawson wrote: It is also useful to mention that you can set the noout flag when doing maintenance of any given length needs to exceeds the 'mon osd down out interval'. $ ceph osd set noout ** no re-balancing will happen ** $ ceph osd unset noout ** normal re-balancing rules will resume ** - Mike Dawson On 12/19/2013 7:51 PM, Sage Weil wrote: On Thu, 19 Dec 2013, John-Paul Robinson wrote: What impact does rebooting nodes in a ceph cluster have on the health of the ceph cluster? Can it trigger rebalancing activities that then have to be undone once the node comes back up? I have a 4 node ceph cluster each node has 11 osds. There is a single pool with redundant storage. If it takes 15 minutes for one of my servers to reboot is there a risk that some sort of needless automatic processing will begin? By default, we start rebalancing data after 5 minutes. You can adjust this (to, say, 15 minutes) with mon osd down out interval = 900 in ceph.conf. sage I'm assuming that the ceph cluster can go into a not ok state but that in this particular configuration all the data is protected against the single node failure and there is no place for the data to migrate too so nothing bad will happen. Thanks for any feedback. ~jpr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Sanity check of deploying Ceph very unconventionally (on top of RAID6, with very few nodes and OSDs)
Christian, I think you are going to suffer the effects of spindle contention with this type of setup. Based on your email and my assumptions, I will use the following inputs: - 4 OSDs, each backed by a 12-disk RAID 6 set - 75iops for each 7200rpm 3TB drive - RAID 6 write penalty of 6 - OSD Journal co-located with OSD - Ceph replication size of 2 4osds * 12disks * 75iops / 6(RAID6WritePenalty) / 2(OsdJournalHit) / 2(CephReplication) = 150 Writes/second max 4osds * 12disks * 75iops / 2xCephReplication = 1800 Reads/second max My guess is 150 writes/second is far lower than your 500 VMs will require. After all, this setup will likely give you lower writes/second than a single 15K SAS drive. Further, if you need to replace a drive, I suspect this setup would grind to a halt as the RAID6 set set attempts to repair. On the other hand, if you planned for 48 individual drives with OSD journals on SSDs in a typical setup of perhaps 5:1 or lower ratio of SSDs:HDs, the calculation would look like: 48osds * 75iops / 2xCephReplication = 1800 Writes/second max 48osds * 75iops / 2xCephReplication = 1800 Reads/second max As you can see, I estimate 12x more random writes without RAID6 (6x) and co-located osd journals (2x). Plus you'll be able to configure 12x more placement groups in your CRUSH rules by going from 4 osds to 48 osds. That will allow Ceph's psuedo-random placement rules to significantly improve the distribution of data and io load across the cluster to decrease the risk of hot-spots. A few other notes: - You'll certainly want QEMU 1.4.2 or later to get asynchronous io for RBD. - You'll likely want to enable RBD writeback cache. It helps coalesce small writes before hitting the disks. Cheers, Mike On 12/17/2013 2:44 AM, Christian Balzer wrote: Hello, I've been doing a lot of reading and am looking at the following design for a storage cluster based on Ceph. I will address all the likely knee-jerk reactions and reasoning below, so hold your guns until you've read it all. I also have a number of questions I've not yet found the answer to or determined it by experimentation. Hardware: 2x 4U (can you say Supermicro? ^.^) servers with 24 3.5 hotswap bays, 2 internal OS (journal?) drives, probably Opteron 4300 CPUs (see below), Areca 1882 controller with 4GB cache, 2 or 3 2-port Infiniband HCAs. 24 3TB HDs (30% of the price of a 4TB one!) in one or two RAID6, 2 of them hotspares, giving us 60TB per node and thus with a replication factor of 2 that's also the usable space. Space for 2 more identical servers if need be. Network: Infiniband QDR, 2x 18port switches (interconnected of course), redundant paths everywhere, including to the clients (compute nodes). Ceph configuration: Additional server with mon, mons also on the 2 storage nodes, at least 2 OSDs per node (see below) This is for a private cloud with about 500 VMs at most. There will 2 types of VMs, the majority writing a small amount of log chatter to their volumes, the other type (a few dozen) writing a more substantial data stream. I estimate less than 100MB/s of read/writes at full build out, which should be well within the abilities of this setup. Now for the rationale of this design that goes contrary to anything normal Ceph layouts suggest: 1. Idiot (aka NOC monkey) proof hotswap of disks. This will be deployed in a remote data center, meaning that qualified people will not be available locally and thus would have to travel there each time a disk or two fails. In short, telling somebody to pull the disk tray with the red flashing LED and put a new one from the spare pile in there is a lot more likely to result in success than telling them to pull the 3rd row, 4th column disk in server 2. ^o^ 2. Density, TCO Ideally I would love to deploy something like this: http://www.mbx.com/60-drive-4u-storage-server/ but they seem to not really have a complete product description, price list, etc. ^o^ With a monster like that, I'd be willing to reconsider local raids and just overspec things in a way that a LOT disks can fail before somebody (with a clue) needs to visit that DC. However failing that, the typical approach to use many smaller servers for OSDs increases the costs and/or reduces density. Replacing the 4U servers with 2U ones (that hold 12 disks) would require some sort of controller (to satisfy my #1 requirement) and similar amounts of HCAs per node, clearly driving the TCO up. 1U servers with typically 4 disk would be even worse. 3. Increased reliability/stability Failure of a single disk has no impact on the whole cluster, no need for any CPU/network intensive rebalancing. Questions/remarks: Due to the fact that there will be redundancy, reliability on the disk level and that there will be only 2 storage nodes initially, I'm planning to disable rebalancing. Or will Ceph realize that making replicas on the same server won't really save the day and refrain from doing so? If more nodes are added
Re: [ceph-users] Adding new OSDs, need to increase PGs?
Robert, Interesting results on the effect of # of PG/PGPs. My cluster struggles a bit under the strain of heavy random small-sized writes. The IOPS you mention seem high to me given 30 drives and 3x replication unless they were pure reads or on high-rpm drives. Instead of assuming, I want to pose a few questions: - How are you testing? rados bench, rbd bench, rbd bench with writeback cache, etc? - Were the 2000-2500 random 4k IOPS more reads than writes? If you test 100% 4k random reads, what do you get? If you test 100% 4k random writes, what do you get? - What drives do you have? Any RAID involved under your OSDs? Thanks, Mike Dawson On 12/3/2013 1:31 AM, Robert van Leeuwen wrote: On 2 dec. 2013, at 18:26, Brian Andrus brian.and...@inktank.com wrote: Setting your pg_num and pgp_num to say... 1024 would A) increase data granularity, B) likely lend no noticeable increase to resource consumption, and C) allow some room for future OSDs two be added while still within range of acceptable pg numbers. You could probably safely double even that number if you plan on expanding at a rapid rate and want to avoid splitting PGs every time a node is added. In general, you can conservatively err on the larger side when it comes to pg/p_num. Any excess resource utilization will be negligible (up to a certain point). If you have a comfortable amount of available RAM, you could experiment with increasing the multiplier in the equation you are using and see how it affects your final number. The pg_num and pgp_num parameters can safely be changed before or after your new nodes are integrated. I would be a bit conservative with the PGs / PGPs. I've experimented with the PG number a bit and noticed the following random IO performance drop. ( this could be something to our specific setup but since the PG is easily increased and impossible to decrease I would be conservative) The setup: 3 OSD nodes with 128GB ram, 2 * 6 core CPU (12 with ht). Nodes have 10 OSDs running on 1 tb disks and 2 SSDs for Journals. We use a replica count of 3 so optimum according to formula is about 1000 With 1000 PGs I got about 2000-2500 random 4k IOPS. Because the nodes are fast enough and I expect the cluster to be expanded with 3 more nodes I set the PGs to 2000. Performance dropped to about 1200-1400 IOPS. I noticed that the spinning disks where no longer maxing out on 100% usage. Memory and CPU did not seem to be a problem. Since had the option to recreate the pool and I was not using the recommended settings I did not really dive into the issue. I will not stray to far from the recommended settings in the future though :) Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Adding new OSDs, need to increase PGs?
Robert, Do you have rbd writeback cache enabled on these volumes? That could certainly explain the higher than expected write performance. Any chance you could re-test with rbd writeback on vs. off? Thanks, Mike Dawson On 12/3/2013 10:37 AM, Robert van Leeuwen wrote: Hi Mike, I am using filebench within a kvm virtual. (Like an actual workload we will have) Using 100% synchronous 4k writes with a 50GB file on a 100GB volume with 32 writer threads. Also tried from multiple KVM machines from multiple hosts. Aggregated performance keeps at 2k+ IOPS The disks are 7200RPM 2.5 inch drives, no RAID whatsoever. I agree the amount of IOPS seem high. Maybe the journal on SSD (2 x Intel 3500) helps a bit in this regard but the SSD's where not maxed out yet. The writes seem to be limited by the spinning disks: As soon as the benchmark starts the are used for 100% percent. Also the usage dropped to 0% pretty much immediately after the benchmark so it looks like it's not lagging behind the journal. Did not really test reads yet since we have so much read cache (128 GB per node) I assume we will mostly be write limited. Cheers, Robert van Leeuwen Sent from my iPad On 3 dec. 2013, at 16:15, Mike Dawson mike.daw...@cloudapt.com wrote: Robert, Interesting results on the effect of # of PG/PGPs. My cluster struggles a bit under the strain of heavy random small-sized writes. The IOPS you mention seem high to me given 30 drives and 3x replication unless they were pure reads or on high-rpm drives. Instead of assuming, I want to pose a few questions: - How are you testing? rados bench, rbd bench, rbd bench with writeback cache, etc? - Were the 2000-2500 random 4k IOPS more reads than writes? If you test 100% 4k random reads, what do you get? If you test 100% 4k random writes, what do you get? - What drives do you have? Any RAID involved under your OSDs? Thanks, Mike Dawson On 12/3/2013 1:31 AM, Robert van Leeuwen wrote: On 2 dec. 2013, at 18:26, Brian Andrus brian.and...@inktank.com wrote: Setting your pg_num and pgp_num to say... 1024 would A) increase data granularity, B) likely lend no noticeable increase to resource consumption, and C) allow some room for future OSDs two be added while still within range of acceptable pg numbers. You could probably safely double even that number if you plan on expanding at a rapid rate and want to avoid splitting PGs every time a node is added. In general, you can conservatively err on the larger side when it comes to pg/p_num. Any excess resource utilization will be negligible (up to a certain point). If you have a comfortable amount of available RAM, you could experiment with increasing the multiplier in the equation you are using and see how it affects your final number. The pg_num and pgp_num parameters can safely be changed before or after your new nodes are integrated. I would be a bit conservative with the PGs / PGPs. I've experimented with the PG number a bit and noticed the following random IO performance drop. ( this could be something to our specific setup but since the PG is easily increased and impossible to decrease I would be conservative) The setup: 3 OSD nodes with 128GB ram, 2 * 6 core CPU (12 with ht). Nodes have 10 OSDs running on 1 tb disks and 2 SSDs for Journals. We use a replica count of 3 so optimum according to formula is about 1000 With 1000 PGs I got about 2000-2500 random 4k IOPS. Because the nodes are fast enough and I expect the cluster to be expanded with 3 more nodes I set the PGs to 2000. Performance dropped to about 1200-1400 IOPS. I noticed that the spinning disks where no longer maxing out on 100% usage. Memory and CPU did not seem to be a problem. Since had the option to recreate the pool and I was not using the recommended settings I did not really dive into the issue. I will not stray to far from the recommended settings in the future though :) Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to enable rbd cache
Greg is right, you need to enable RBD admin sockets. This can be a bit tricky though, so here are a few tips: 1) In ceph.conf on the compute node, explicitly set a location for the admin socket: [client.volumes] admin socket = /var/run/ceph/rbd-$pid.asok In this example, libvirt/qemu is running with permissions from ceph.client.volumes.keyring. If you use something different, adjust accordingly. You can put this under a more generic [client] section, but there are some downsides (like a new admin socket for each ceph cli command). 2) Watch for permissions issues creating the admin socket at the path you used above. For me, I needed to explicitly grant some permissions in /etc/apparmor.d/abstractions/libvirt-qemu, specifically I had to add: # for rbd capability mknod, and # for rbd /etc/ceph/ceph.conf r, /var/log/ceph/* rw, /{,var/}run/ceph/** rw, 3) Be aware that if you have multiple rbd volumes attached to a single rbd image, you'll only get an admin socket to the volume mounted last. If you can set admin_socket via the libvirt xml for each volume, you can avoid this issue. This thread will explain better: http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg16168.html 4) Once you get an RBD admin socket, query it like: ceph --admin-daemon /var/run/ceph/rbd-29050.asok config show | grep rbd Cheers, Mike Dawson On 11/25/2013 11:12 AM, Gregory Farnum wrote: On Mon, Nov 25, 2013 at 5:58 AM, Mark Nelson mark.nel...@inktank.com wrote: On 11/25/2013 07:21 AM, Shu, Xinxin wrote: Recently , I want to enable rbd cache to identify performance benefit. I add rbd_cache=true option in my ceph configure file, I use ’virsh attach-device’ to attach rbd to vm, below is my vdb xml file. Ceph configuration files are a bit confusing because sometimes you'll see something like rbd_cache listed somewhere but in the ceph.conf file you'll want a space instead: rbd cache = true with no underscore. That should (hopefully) fix it for you! I believe the config file will take either format. The RBD cache is a client-side thing, though, so it's not ever going to show up in the OSD! You want to look at the admin socket created by QEMU (via librbd) to see if it's working. :) -Greg -Greg disk type='network' device='disk' driver name='qemu' type='raw' cache='writeback'/ source protocol='rbd' name='rbd/node12_2:rbd_cache=true:rbd_cache_writethrough_until_flush=true'/ target dev='vdb' bus='virtio'/ serial6b5ff6f4-9f8c-4fe0-84d6-9d795967c7dd/serial address type='pci' domain='0x' bus='0x00' slot='0x06' function='0x0'/i /disk I do not know this is ok to enable rbd cache. I see perf counter for rbd cache in source code, but when I used admin daemon to check rbd cache statistics, Ceph –admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump But I did not get any rbd cahce flags. My question is how to enable rbd cahce and check rbd cache perf counter, or how can I make sure rbd cache is enabled, any tips will be appreciated? Thanks in advanced. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Running on disks that lose their head
Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 On 11/7/2013 2:12 PM, Kyle Bader wrote: Once I know a drive has had a head failure, do I trust that the rest of the drive isn't going to go at an inconvenient moment vs just fixing it right now when it's not 3AM on Christmas morning? (true story) As good as Ceph is, do I trust that Ceph is smart enough to prevent spreading corrupt data all over the cluster if I leave bad disks in place and they start doing terrible things to the data? I have a lot more disks than I have trust in disks. If a drive lost a head then I want it gone. I love the idea of using smart data but can foresee see some implementation issues. We have seen some raid configurations where polling smart will halt all raid operations momentarily. Also, some controllers require you to use their CLI tool to pool for smart vs smartmontools. It would be similarly awesome to embed something like an apdex score against each osd, especially if it factored in hierarchy to identify poor performing osds, nodes, racks, etc.. Kyle, I think you are spot-on here. Apdex or similar scoring for gear performance is important for Ceph, IMO. Due to pseudo-random placement and replication, it can be quite difficult to identify 1) if hardware, software, or configuration are the cause of slowness, and 2) which hardware (if any) is slow. I recently discovered a method that seems address both points built. Zackc, Loicd, and I have been the main participants in a weekly Teuthology call the past few weeks. We've talked mostly about methods to extend Teuthology to capture performance metrics. Would you be willing to join us during the Teuthology and Ceph-Brag sessions at the Firefly Developer Summit? Cheers, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph User Committee
I also have time I could spend. Thanks for getting this started Loic! Thanks, Mike Dawson On 11/6/2013 12:35 PM, Loic Dachary wrote: Hi Ceph, I would like to open a discussion about organizing a Ceph User Committee. We briefly discussed the idea with Ross Turk, Patrick McGarry and Sage Weil today during the OpenStack summit. A pad was created and roughly summarizes the idea: http://pad.ceph.com/p/user-committee If there is enough interest, I'm willing to devote one day a week working for the Ceph User Committee. And yes, that includes sitting at the Ceph booth during the FOSDEM :-) And interviewing Ceph users and describing their use cases, which I enjoy very much. But also contribute to a user centric roadmap, which is what ultimately matters for the company I work for. If you'd like to see this happen but don't have time to participate in this discussion, please add your name + email at the end of the pad. What do you think ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster performance
We just fixed a performance issue on our cluster related to spikes of high latency on some of our SSDs used for osd journals. In our case, the slow SSDs showed spikes of 100x higher latency than expected. What SSDs were you using that were so slow? Cheers, Mike On 11/6/2013 12:39 PM, Dinu Vlad wrote: I'm using the latest 3.8.0 branch from raring. Is there a more recent/better kernel recommended? Meanwhile, I think I might have identified the culprit - my SSD drives are extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By comparison, an Intel 530 in another server (also installed behind a SAS expander is doing the same test with ~ 8k iops. I guess I'm good for replacing them. Removing the SSD drives from the setup and re-testing with ceph = 595 MB/s throughput under the same conditions (only mechanical drives, journal on a separate partition on each one, 8 rados bench processes, 16 threads each). On Nov 5, 2013, at 4:38 PM, Mark Nelson mark.nel...@inktank.com wrote: Ok, some more thoughts: 1) What kernel are you using? 2) Mixing SATA and SAS on an expander backplane can some times have bad effects. We don't really know how bad this is and in what circumstances, but the Nexenta folks have seen problems with ZFS on solaris and it's not impossible linux may suffer too: http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html 3) If you are doing tests and look at disk throughput with something like collectl -sD -oT do the writes look balanced across the spinning disks? Do any devices have much really high service times or queue times? 4) Also, after the test is done, you can try: find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} dump_historic_ops \; foo and then grep for duration in foo. You'll get a list of the slowest operations over the last 10 minutes from every osd on the node. Once you identify a slow duration, you can go back and in an editor search for the slow duration and look at where in the OSD it hung up. That might tell us more about slow/latent operations. 5) Something interesting here is that I've heard from another party that in a 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on a SAS9207-8i controller and were pushing significantly faster throughput than you are seeing (even given the greater number of drives). So it's very interesting to me that you are pushing so much less. The 36 drive supermicro chassis I have with no expanders and 30 drives with 6 SSDs can push about 2100MB/s with a bunch of 9207-8i controllers and XFS (no replication). Mark On 11/05/2013 05:15 AM, Dinu Vlad wrote: Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph settings I was able to get 440 MB/s from 8 rados bench instances, over a single osd node (pool pg_num = 1800, size = 1) This still looks awfully slow to me - fio throughput across all disks reaches 2.8 GB/s!! I'd appreciate any suggestion, where to look for the issue. Thanks! On Oct 31, 2013, at 6:35 PM, Dinu Vlad dinuvla...@gmail.com wrote: I tested the osd performance from a single node. For this purpose I deployed a new cluster (using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster configuration stayed default, with the same additions about xfs mount mkfs.xfs as before. With a single host, the pgs were stuck unclean (active only, not active+clean): # ceph -s cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 health HEALTH_WARN 1800 pgs stuck unclean monmap e1: 3 mons at {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 osdmap e101: 18 osds: 18 up, 18 in pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 16759 GB avail mdsmap e1: 0/0/1 up Test results: Local test, 1 process, 16 threads: 241.7 MB/s Local test, 8 processes, 128 threads: 374.8 MB/s Remote test, 1 process, 16 threads: 231.8 MB/s Remote test, 8 processes, 128 threads: 366.1 MB/s Maybe it's just me, but it seems on the low side too. Thanks, Dinu On Oct 30, 2013, at 8:59 PM, Mark Nelson mark.nel...@inktank.com wrote: On 10/30/2013 01:51 PM, Dinu Vlad wrote: Mark, The SSDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 and the HDDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. The chasis is a SiliconMechanics C602 - but I don't have the exact model. It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander. I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to what the driver reports in dmesg). here are the results (filtered): Sequential: Run status group 0 (all jobs):
Re: [ceph-users] ceph cluster performance
No, in our case flashing the firmware to the latest release cured the problem. If you build a new cluster with the slow SSDs, I'd be interested in the results of ioping[0] or fsync-tester[1]. I theorize that you may see spikes of high latency. [0] https://code.google.com/p/ioping/ [1] https://github.com/gregsfortytwo/fsync-tester Thanks, Mike Dawson On 11/6/2013 4:18 PM, Dinu Vlad wrote: ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i. By fixed - you mean replaced the SSDs? Thanks, Dinu On Nov 6, 2013, at 10:25 PM, Mike Dawson mike.daw...@cloudapt.com wrote: We just fixed a performance issue on our cluster related to spikes of high latency on some of our SSDs used for osd journals. In our case, the slow SSDs showed spikes of 100x higher latency than expected. What SSDs were you using that were so slow? Cheers, Mike On 11/6/2013 12:39 PM, Dinu Vlad wrote: I'm using the latest 3.8.0 branch from raring. Is there a more recent/better kernel recommended? Meanwhile, I think I might have identified the culprit - my SSD drives are extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By comparison, an Intel 530 in another server (also installed behind a SAS expander is doing the same test with ~ 8k iops. I guess I'm good for replacing them. Removing the SSD drives from the setup and re-testing with ceph = 595 MB/s throughput under the same conditions (only mechanical drives, journal on a separate partition on each one, 8 rados bench processes, 16 threads each). On Nov 5, 2013, at 4:38 PM, Mark Nelson mark.nel...@inktank.com wrote: Ok, some more thoughts: 1) What kernel are you using? 2) Mixing SATA and SAS on an expander backplane can some times have bad effects. We don't really know how bad this is and in what circumstances, but the Nexenta folks have seen problems with ZFS on solaris and it's not impossible linux may suffer too: http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html 3) If you are doing tests and look at disk throughput with something like collectl -sD -oT do the writes look balanced across the spinning disks? Do any devices have much really high service times or queue times? 4) Also, after the test is done, you can try: find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} dump_historic_ops \; foo and then grep for duration in foo. You'll get a list of the slowest operations over the last 10 minutes from every osd on the node. Once you identify a slow duration, you can go back and in an editor search for the slow duration and look at where in the OSD it hung up. That might tell us more about slow/latent operations. 5) Something interesting here is that I've heard from another party that in a 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on a SAS9207-8i controller and were pushing significantly faster throughput than you are seeing (even given the greater number of drives). So it's very interesting to me that you are pushing so much less. The 36 drive supermicro chassis I have with no expanders and 30 drives with 6 SSDs can push about 2100MB/s with a bunch of 9207-8i controllers and XFS (no replication). Mark On 11/05/2013 05:15 AM, Dinu Vlad wrote: Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph settings I was able to get 440 MB/s from 8 rados bench instances, over a single osd node (pool pg_num = 1800, size = 1) This still looks awfully slow to me - fio throughput across all disks reaches 2.8 GB/s!! I'd appreciate any suggestion, where to look for the issue. Thanks! On Oct 31, 2013, at 6:35 PM, Dinu Vlad dinuvla...@gmail.com wrote: I tested the osd performance from a single node. For this purpose I deployed a new cluster (using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster configuration stayed default, with the same additions about xfs mount mkfs.xfs as before. With a single host, the pgs were stuck unclean (active only, not active+clean): # ceph -s cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 health HEALTH_WARN 1800 pgs stuck unclean monmap e1: 3 mons at {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 osdmap e101: 18 osds: 18 up, 18 in pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 16759 GB avail mdsmap e1: 0/0/1 up Test results: Local test, 1 process, 16 threads: 241.7 MB/s Local test, 8 processes, 128 threads: 374.8 MB/s Remote test, 1 process, 16 threads: 231.8 MB/s Remote test, 8 processes, 128 threads: 366.1 MB/s Maybe it's just me, but it seems on the low side too. Thanks, Dinu On Oct 30, 2013, at 8:59 PM, Mark Nelson mark.nel...@inktank.com wrote: On 10/30/2013 01:51 PM, Dinu Vlad wrote: Mark, The SSDs
Re: [ceph-users] Ceph health checkup
Narendra, This is an issue. You really want your cluster to he HEALTH_OK with all PGs active+clean. Some exceptions apply (like scrub / deep-scrub). What do 'ceph health detail' and 'ceph osd tree' show? Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 On 10/31/2013 6:53 PM, Trivedi, Narendra wrote: My Ceph cluster health checkup tells me the following. Should I be concerned? What's the remedy? What is missing? I issued this command from the monitor node. Please correct me if I am wrong, but I think admin's node job is done after the installation unless I want to add additional OSD/MONs. [ceph@ceph-node1-mon-centos-6-4 ceph]$ sudo ceph health HEALTH_WARN 145 pgs degraded; 43 pgs down; 47 pgs peering; 76 pgs stale; 47 pgs stuck inactive; 76 pgs stuck stale; 192 pgs stuck unclean Thanks a lot in advance! Narendra This message contains information which may be confidential and/or privileged. Unless you are the intended recipient (or authorized to receive for the intended recipient), you may not read, use, copy or disclose to anyone the message or any information contained in the message. If you have received the message in error, please advise the sender by reply e-mail and delete the message and any attachment(s) thereto without retaining any copies. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How can I check the image's IO ?
Vernon, You can use the rbd command bench-write documented here: http://ceph.com/docs/next/man/8/rbd/#commands The command might looks something like: rbd --pool test-pool bench-write --io-size 4096 --io-threads 16 --io-total 1GB test-image Some other interesting flags are --rbd-cache, --no-rbd-cache, and --io-pattern {seq|rand} Cheers, Mike On 10/30/2013 3:23 AM, vernon1987 wrote: Hi cephers, I use qemu-img create -f rbd rbd:test-pool/test-image to create a image. I want to know how can I check this image's IO. Or how to check the IO for each block? Thanks. 2013-10-30 vernon ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph monitor problems
Aaron, Don't mistake valid for advisable. For documentation purposes, three monitors is the advisable initial configuration for multi-node ceph clusters. If there is a valid need for more than three monitors, it is advisable to add them two at a time to maintain an odd number of total monitors. -Mike On 10/30/2013 4:46 PM, Aaron Ten Clay wrote: On Wed, Oct 30, 2013 at 1:43 PM, Joao Eduardo Luis joao.l...@inktank.com mailto:joao.l...@inktank.com wrote: A quorum of 2 monitors is completely fine as long as both monitors are up. A quorum is always possible regardless of how many monitors you have, as long as a majority is up and able to form it (1 out of 1, 2 out of 2, 2 out of 3, 3 out of 4, 3 out of 5, 4 out of 6,...). -Joao Joao, The page at http://ceph.com/docs/master/rados/operations/add-or-rm-mons/ only lists 1; 3 out of 5; 4 out of 6; etc.. Perhaps it should be updated if 2 out of 2 is a valid configuration? -Aaron ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] About use same SSD for OS and Journal
Kurt, When you had OS and osd journals co-located, how many osd journals were on the SSD containing the OS? You mention you now use a 5:1 ratio. Was the ratio something like 11:1 before (one SSD for OS plus 11 osd journals to 11 OSDs in a 12-disk chassis)? Also, what throughput per drive were you seeing on the cluster during the periods where things got laggy due to backfills, etc? Last, did you attempt to throttle using ceph config setting in the old setup? Do you need to throttle in your current setup? Thanks, Mike Dawson On 10/24/2013 10:40 AM, Kurt Bauer wrote: Hi, we had a setup like this and ran into trouble, so I would strongly discourage you from setting it up like this. Under normal circumstances there's no problem, but when the cluster is under heavy load, for example when it has a lot of pgs backfilling, for whatever reason (increasing num of pgs, adding OSDs,..), there's obviously a lot of entries written to the journals. What we saw then was extremly laggy behavior of the cluster and when looking at the iostats of the SSD, they were at 100% most of the time. I don't exactly know what causes this and why the SSDs can't cope with the amount of IOs, but seperating OS and journals did the trick. We now have quick 15k HDDs in Raid1 for OS and Monitor journal and per 5 OSD journals one SSD with one partition per journal (used as raw partition). Hope that helps, best regards, Kurt Martin Catudal schrieb: Hi, Here my scenario : I will have a small cluster (4 nodes) with 4 (4 TB) OSD's per node. I will have OS installed on two SSD in raid 1 configuration. Is one of you have successfully and efficiently a Ceph cluster that is built with Journal on a separate partition on the OS SSD's? I know that it may occur a lot of IO on the Journal SSD and I'm scared of have my OS suffer from too much IO. Any background experience? Martin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] saucy salamander support?
For the time being, you can install the Raring debs on Saucy without issue. echo deb http://ceph.com/debian-dumpling/ raring main | sudo tee /etc/apt/sources.list.d/ceph.list I'd also like to register a +1 request for official builds targeted at Saucy. Cheers, Mike On 10/22/2013 11:42 AM, LaSalle, Jurvis wrote: Hi, I accidentally installed Saucy Salamander. Does the project have a timeframe for supporting this Ubuntu release? Thanks, JL ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multiply OSDs per host strategy ?
Andrija, You can use a single pool and the proper CRUSH rule step chooseleaf firstn 0 type host to accomplish your goal. http://ceph.com/docs/master/rados/operations/crush-map/ Cheers, Mike Dawson On 10/16/2013 5:16 PM, Andrija Panic wrote: Hi, I have 2 x 2TB disks, in 3 servers, so total of 6 disks... I have deployed total of 6 OSDs. ie: host1 = osd.0 and osd.1 host2 = osd.2 and osd.3 host4 = osd.4 and osd.5 Now, since I will have total of 3 replica (original + 2 replicas), I want my replica placement to be such, that I don't end up having 2 replicas on 1 host (replica on osd0, osd1 (both on host1) and replica on osd2. I want all 3 replicas spread on different hosts... I know this is to be done via crush maps, but I'm not sure if it would be better to have 2 pools, 1 pool on osd0,2,4 and and another pool on osd1,3,5. If possible, I would want only 1 pool, spread across all 6 OSDs, but with data placement such, that I don't end up having 2 replicas on 1 host...not sure if this is possible at all... Is that possible, or maybe I should go for RAID0 in each server (2 x 2Tb = 4TB for osd0) or maybe JBOD (1 volume, so 1 OSD per host) ? Any suggesting about best practice ? Regards, -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph and RAID
Currently Ceph uses replication. Each pool is set with a replication factor. A replication factor of 1 obviously offers no redundancy. Replication factors of 2 or 3 are common. So, Ceph currently halfs or thirds your usable storage, accordingly. Also, note you can co-mingle pools of various replication factors, so the actual math can get more complicated. There is a team of developers building an Erasure Coding backend for Ceph that will allow for more options. http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend http://wiki.ceph.com/01Planning/02Blueprints/Emperor/Erasure_coded_storage_backend_%28step_2%29 Initial release is scheduled for Ceph's Firefly release in February 2014. Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC On 10/3/2013 2:44 PM, Aronesty, Erik wrote: Does Ceph really halve your storage like that? If if you specify N+1,does it really store two copies, or just compute checksums across MxN stripes? I guess Raid5+Ceph with a large array (12 disks say) would be not too bad (2.2TB for each 1). But It would be nicer, if I had 12 storage units in a single rack on a single network, for me to tell CEPH to stripe across them in a RAIDZ fashion, so that I'm only losing 10% of my storage to redundancy... not 50%. -Original Message- From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of John-Paul Robinson Sent: Thursday, October 03, 2013 12:08 PM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph and RAID What is this take on such a configuration? Is it worth the effort of tracking rebalancing at two layers, RAID mirror and possibly Ceph if the pool has a redundancy policy. Or is it better to just let ceph rebalance itself when you lose a non-mirrored disk? If following the raid mirror approach, would you then skip redundency at the ceph layer to keep your total overhead the same? It seems that would be risky in the even you loose your storage server with the raid-1'd drives. No Ceph level redunancy would then be fatal. But if you do raid-1 plus ceph redundancy, doesn't that mean it takes 4TB for each 1 real TB? ~jpr On 10/02/2013 10:03 AM, Dimitri Maziuk wrote: I would consider (mdadm) raid-1, dep. on the hardware budget, because this way a single disk failure will not trigger a cluster-wide rebalance. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD Snap removal priority
[cc ceph-devel] Travis, RBD doesn't behave well when Ceph maintainance operations create spindle contention (i.e. 100% util from iostat). More about that below. Do you run XFS under your OSDs? If so, can you check for extent fragmentation? Should be something like: xfs_db -c frag -r /dev/sdb1 We recently saw a fragmentation factors of over 80%, with lots of ino's having hundreds of extents. After 24 hours+ of defrag'ing, we got it under control, but we're seeing the fragmentation factor grow by ~1.5% daily. We experienced spindle contention issues even after the defrag. Sage, Sam, etc, I think the real issue is Ceph has several states where it performs what I would call maintanance operations that saturate the underlying storage without properly yielding to client i/o (which should have a higher priority). I have experienced or seen reports of Ceph maintainance affecting rbd client i/o in many ways: - QEMU/RBD Client I/O Stalls or Halts Due to Spindle Contention from Ceph Maintainance [1] - Recovery and/or Backfill Cause QEMU/RBD Reads to Hang [2] - rbd snap rm (Travis' report below) [1] http://tracker.ceph.com/issues/6278 [2] http://tracker.ceph.com/issues/6333 I think this family of issues speak to the need for Ceph to have more visibility into the underlying storage's limitations (especially spindle contention) when performing known expensive maintainance operations. Thanks, Mike Dawson On 9/27/2013 12:25 PM, Travis Rhoden wrote: Hello everyone, I'm running a Cuttlefish cluster that hosts a lot of RBDs. I recently removed a snapshot of a large one (rbd snap rm -- 12TB), and I noticed that all of the clients had markedly decreased performance. Looking at iostat on the OSD nodes had most disks pegged at 100% util. I know there are thread priorities that can be set for clients vs recovery, but I'm not sure what deleting a snapshot falls under. I couldn't really find anything relevant. Is there anything I can tweak to lower the priority of such an operation? I didn't need it to complete fast, as rbd snap rm returns immediately and the actual deletion is done asynchronously. I'd be fine with it taking longer at a lower priority, but as it stands now it brings my cluster to a crawl and is causing issues with several VMs. I see an osd snap trim thread timeout option in the docs -- Is the operation occuring here what you would call snap trimming? If so, any chance of adding an option for osd snap trim priority just like there is for osd client op and osd recovery op? Hope what I am saying makes sense... - Travis ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Pause i/o from time to time
You could be suffering from a known, but unfixed issue [1] where spindle contention from scrub and deep-scrub cause periodic stalls in RBD. You can try to disable scrub and deep-scrub with: # ceph osd set noscrub # ceph osd set nodeep-scrub If your problem stops, Issue #6278 is likely the cause. To re-enable scrub and deep-scrub: # ceph osd unset noscrub # ceph osd unset nodeep-scrub Because you seem to only have two OSDs, you may also be saturating your disks even without scrub or deep-scrub. http://tracker.ceph.com/issues/6278 Cheers, Mike Dawson On 9/16/2013 12:30 PM, Timofey wrote: I use ceph for HA-cluster. Some time ceph rbd go to have pause in work (stop i/o operations). Sometime it can be when one of OSD slow response to requests. Sometime it can be my mistake (xfs_freeze -f for one of OSD-drive). I have 2 storage servers with one osd on each. This pauses can be few minutes. 1. Is any settings for fast change primary osd if current osd work bad (slow, don't response). 2. Can I use ceph-rbd in software raid-array with local drive, for use local drive instead of ceph if ceph cluster fail? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] status of glance/cinder/nova integration in openstack grizzly
Darren, I can confirm Copy on Write (show_image_direct_url = True) does work in Grizzly. It sounds like you are close. To check permissions, run 'ceph auth list', and reply with client.images and client.volumes (or whatever keys you use in Glance and Cinder). Cheers, Mike Dawson On 9/10/2013 10:12 AM, Darren Birkett wrote: Hi All, tl;dr - does glance/rbd and cinder/rbd play together nicely in grizzly? I'm currently testing a ceph/rados back end with an openstack installation. I have the following things working OK: 1. cinder configured to create volumes in RBD 2. nova configured to boot from RBD backed cinder volumes (libvirt UUID secret set etc) 3. glance configured to use RBD as a back end store for images With this setup, when I create a bootable volume in cinder, passing an id of an image in glance, the image gets downloaded, converted to raw, and then created as an RBD object and made available to cinder. The correct metadata field for the cinder volume is populated (volume_image_metadata) and so the cinder client marks the volume as bootable. This is all fine. If I want to take advantage of the fact that both glance images and cinder volumes are stored in RBD, I can add the following flag to the glance-api.conf: show_image_direct_url = True This enables cinder to see that the glance image is stored in RBD, and the cinder rbd driver then, instead of downloading the image and creating an RBD image from it, just issues an 'rbd clone' command (seen in the cinder-volume.log): rbd clone --pool images --image dcb2f16d-a09d-4064-9198-1965274e214d --snap snap --dest-pool volumes --dest volume-20987f9d-b4fb-463d-8b8f-fa667bd47c6d This is all very nice, and the cinder volume is available immediately as you'd expect. The problem is that the metadata field is not populated so it's not seen as bootable. Even manually populating this field leaves the volume unbootable. The volume can not even be attached to another instance for inspection. libvirt doesn't seem to be able to access the rbd device. From nova-compute.log: qemu-system-x86_64: -drive file=rbd:volumes/volume-20987f9d-b4fb-463d-8b8f-fa667bd47c6d:id=volumes:key=AQAnAy9ScPB4IRAAtxD/V1rDciqFiT9AMPPr+A==:auth_supported=cephx\;none,if=none,id=drive-virtio-disk0,format=raw,serial=20987f9d-b4fb-463d-8b8f-fa667bd47c6d,cache=none: error reading header from volume-20987f9d-b4fb-463d-8b8f-fa667bd47c6d qemu-system-x86_64: -drive file=rbd:volumes/volume-20987f9d-b4fb-463d-8b8f-fa667bd47c6d:id=volumes:key=AQAnAy9ScPB4IRAAtxD/V1rDciqFiT9AMPPr+A==:auth_supported=cephx\;none,if=none,id=drive-virtio-disk0,format=raw,serial=20987f9d-b4fb-463d-8b8f-fa667bd47c6d,cache=none: could not open disk image rbd:volumes/volume-20987f9d-b4fb-463d-8b8f-fa667bd47c6d:id=volumes:key=AQAnAy9ScPB4IRAAtxD/V1rDciqFiT9AMPPr+A==:auth_supported=cephx\;none: Operation not permitted It's almost like a permission issue, but my ceph/rbd knowledge is still fledgeling. I know that the cinder rbd driver has been rewritten to use librbd in havana, and I'm wondering if this will change any of this behaviour? I'm also wondering if anyone has actually got this working with grizzly, and how? Many thanks Darren ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] status of glance/cinder/nova integration in openstack grizzly
On 9/10/2013 4:50 PM, Darren Birkett wrote: Hi Mike, That led me to realise what the issue was. My cinder (volumes) client did not have the correct perms on the images pool. I ran the following to update the perms for that client: ceph auth caps client.volumes mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rx pool=images' ...and was then able to successfully boot an instance from a cinder volume that was created by cloning a glance image from the images pool! Glad you found it. This has been a sticking point for several people. One last question: I presume the fact that the 'volume_image_metadata' field is not populated when cloning a glance image into a cinder volume is a bug? It means that the cinder client doesn't show the volume as bootable, though I'm not sure what other detrimental effect it actually has (clearly the volume can be booted from). I think you are talking about data in the cinder table of your database backend (mysql?). I don't have 'volume_image_metadata' at all here. I don't think this is the issue. To create a Cinder volume from Glance, I do something like: cinder --os_tenant_name MyTenantName create --image-id 00e0042e-d007-400a-918a-d5e00cea8b0f --display-name MyVolumeName 40 I can then spin up an instance backed by MyVolumeName and boot as expected. Thanks Darren On 10 September 2013 21:04, Darren Birkett darren.birk...@gmail.com mailto:darren.birk...@gmail.com wrote: Hi Mike, Thanks - glad to hear it definitely works as expected! Here's my client.glance and client.volumes from 'ceph auth list': client.glance key: AQAWFi9SOKzAABAAPV1ZrpWkx72tmJ5E7nOi3A== caps: [mon] allow r caps: [osd] allow rwx pool=images, allow class-read object_prefix rbd_children client.volumes key: AQAnAy9ScPB4IRAAtxD/V1rDciqFiT9AMPPr+A== caps: [mon] allow r caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=volumes Thanks Darren On 10 September 2013 20:08, Mike Dawson mike.daw...@cloudapt.com mailto:mike.daw...@cloudapt.com wrote: Darren, I can confirm Copy on Write (show_image_direct_url = True) does work in Grizzly. It sounds like you are close. To check permissions, run 'ceph auth list', and reply with client.images and client.volumes (or whatever keys you use in Glance and Cinder). Cheers, Mike Dawson On 9/10/2013 10:12 AM, Darren Birkett wrote: Hi All, tl;dr - does glance/rbd and cinder/rbd play together nicely in grizzly? I'm currently testing a ceph/rados back end with an openstack installation. I have the following things working OK: 1. cinder configured to create volumes in RBD 2. nova configured to boot from RBD backed cinder volumes (libvirt UUID secret set etc) 3. glance configured to use RBD as a back end store for images With this setup, when I create a bootable volume in cinder, passing an id of an image in glance, the image gets downloaded, converted to raw, and then created as an RBD object and made available to cinder. The correct metadata field for the cinder volume is populated (volume_image_metadata) and so the cinder client marks the volume as bootable. This is all fine. If I want to take advantage of the fact that both glance images and cinder volumes are stored in RBD, I can add the following flag to the glance-api.conf: show_image_direct_url = True This enables cinder to see that the glance image is stored in RBD, and the cinder rbd driver then, instead of downloading the image and creating an RBD image from it, just issues an 'rbd clone' command (seen in the cinder-volume.log): rbd clone --pool images --image dcb2f16d-a09d-4064-9198-__1965274e214d --snap snap --dest-pool volumes --dest volume-20987f9d-b4fb-463d-__8b8f-fa667bd47c6d This is all very nice, and the cinder volume is available immediately as you'd expect. The problem is that the metadata field is not populated so it's not seen as bootable. Even manually populating this field leaves the volume unbootable. The volume can not even be attached to another instance for inspection. libvirt doesn't seem to be able to access the rbd device. From nova-compute.log: qemu-system-x86_64: -drive file=rbd:volumes/volume-__20987f9d-b4fb-463d-8b8f-__fa667bd47c6d:id=volumes:key=__AQAnAy9ScPB4IRAAtxD
Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling
Sam and Oliver, We've had tons of issues with Dumpling rbd volumes showing sporadic periods of high latency for Windows guests doing lots of small writes. We saw the issue occasionally with Cuttlefish, but it got significantly worse with Dumpling. Initial results with wip-dumpling-perf2 appear very promising. Thanks for your work! I'll report back tomorrow if I have any new results. Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 On 8/29/2013 2:52 PM, Oliver Daudey wrote: Hey Mark and list, FYI for you and the list: Samuel and I seem to have found and fixed the remaining performance-problems. For those who can't wait, fixes are in wip-dumpling-perf2 and will probably be in the next point-release. Regards, Oliver On 27-08-13 17:13, Mark Nelson wrote: Ok, definitely let us know how it goes! For what it's worth, I'm testing Sam's wip-dumpling-perf branch with the wbthrottle code disabled now and comparing it both to that same branch with it enabled along with 0.67.1. Don't have any perf data, but quite a bit of other data to look through, both in terms of RADOS bench and RBD. Mark On 08/27/2013 10:07 AM, Oliver Daudey wrote: Hey Mark, That will take a day or so for me to know with enough certainty. With the low CPU-usage and preliminary results today, I'm confident enough to upgrade all OSDs in production and test the cluster all-Dumpling tomorrow. For now, I only upgraded a single OSD and measured CPU-usage and whatever performance-effects that had on the cluster, so if I would lose that OSD, I could recover. :-) Will get back to you. Regards, Oliver On 27-08-13 15:04, Mark Nelson wrote: Hi Olver/Matthew, Ignoring CPU usage, has speed remained slower as well? Mark On 08/27/2013 03:08 AM, Oliver Daudey wrote: Hey Samuel, The PGLog::check() is now no longer visible in profiling, so it helped for that. Unfortunately, it doesn't seem to have helped to bring down the OSD's CPU-loading much. Leveldb still uses much more than in Cuttlefish. On my test-cluster, I didn't notice any difference in the RBD bench-results, either, so I have to assume that it didn't help performance much. Here's the `perf top' I took just now on my production-cluster with your new version, under regular load. Also note the memcmp and memcpy, which also don't show up when running a Cuttlefish-OSD: 15.65% [kernel] [k] intel_idle 7.20% libleveldb.so.1.9[.] 0x3ceae 6.28% libc-2.11.3.so [.] memcmp 5.22% [kernel] [k] find_busiest_group 3.92% kvm [.] 0x2cf006 2.40% libleveldb.so.1.9[.] leveldb::InternalKeyComparator::Compar 1.95% [kernel] [k] _raw_spin_lock 1.69% [kernel] [k] default_send_IPI_mask_sequence_phys 1.46% libc-2.11.3.so [.] memcpy 1.17% libleveldb.so.1.9[.] leveldb::Block::Iter::Next() 1.16% [kernel] [k] hrtimer_interrupt 1.07% [kernel] [k] native_write_cr0 1.01% [kernel] [k] __hrtimer_start_range_ns 1.00% [kernel] [k] clockevents_program_event 0.93% [kernel] [k] find_next_bit 0.93% libstdc++.so.6.0.13 [.] std::string::_M_mutate(unsigned long, 0.89% [kernel] [k] cpumask_next_and 0.87% [kernel] [k] __schedule 0.85% [kernel] [k] _raw_spin_unlock_irqrestore 0.85% [kernel] [k] do_select 0.84% [kernel] [k] apic_timer_interrupt 0.80% [kernel] [k] fget_light 0.79% [kernel] [k] native_write_msr_safe 0.76% [kernel] [k] _raw_spin_lock_irqsave 0.66% libc-2.11.3.so [.] 0xdc6d8 0.61% libpthread-2.11.3.so [.] pthread_mutex_lock 0.61% [kernel] [k] tg_load_down 0.59% [kernel] [k] reschedule_interrupt 0.59% libsnappy.so.1.1.2 [.] snappy::RawUncompress(snappy::Source*, 0.56% libstdc++.so.6.0.13 [.] std::string::append(char const*, unsig 0.54% [kvm_intel] [k] vmx_vcpu_run 0.53% [kernel] [k] copy_user_generic_string 0.53% [kernel] [k] load_balance 0.50% [kernel] [k] rcu_needs_cpu 0.45% [kernel] [k] fput Regards, Oliver On ma, 2013-08-26 at 23:33 -0700, Samuel Just wrote: I just pushed a patch to wip-dumpling-log-assert (based on current dumpling head). I had disabled most of the code in PGLog::check() but left an (I thought) innocuous assert. It seems that with (at least) g
Re: [ceph-users] Openstack glance ceph rbd_store_user authentification problem
Steffan, It works for me. I have: user@node:/etc/ceph# cat /etc/glance/glance-api.conf | grep rbd default_store = rbd # glance.store.rbd.Store, rbd_store_ceph_conf = /etc/ceph/ceph.conf rbd_store_user = images rbd_store_pool = images rbd_store_chunk_size = 4 Thanks, Mike Dawson On 8/8/2013 9:01 AM, Steffen Thorhauer wrote: Hi, recently I had a problem with openstack glance and ceph. I used the http://ceph.com/docs/master/rbd/rbd-openstack/#configuring-glance documentation and http://docs.openstack.org/developer/glance/configuring.html documentation I'm using ubuntu 12.04 LTS with grizzly from Ubuntu Cloud Archive and ceph 61.7. glance-api.conf had following config options default_store = rbd rbd_store_user=images rbd_store_pool = images rbd_store_ceph_conf = /etc/ceph/ceph.conf All the time when doing glance image create I get errors. In the glance api log I only found error like 2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images Traceback (most recent call last): 2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images File /usr/lib/python2.7/dist-packages/glance/api/v1/images.py, line 444, in _upload 2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images image_meta['size']) 2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images File /usr/lib/python2.7/dist-packages/glance/store/rbd.py, line 241, in add 2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images with rados.Rados(conffile=self.conf_file, rados_id=self.user) as conn: 2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images File /usr/lib/python2.7/dist-packages/rados.py, line 134, in __enter__ 2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images self.connect() 2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images File /usr/lib/python2.7/dist-packages/rados.py, line 192, in connect 2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images raise make_ex(ret, error calling connect) 2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images ObjectNotFound: error calling connect This trace message helped me not very much :-( My google search glance.api.v1.images ObjectNotFound: error calling connect did only find http://irclogs.ceph.widodh.nl/index.php?date=2012-10-26 This points me to an ceph authentification problem. But the ceph tools worked fine for me. The I tried the debug option in glance-api.conf and I found following entry . DEBUG glance.common.config [-] rbd_store_pool = images log_opt_values /usr/lib/python2.7/dist-packages/oslo/config/cfg.py:1485 DEBUG glance.common.config [-] rbd_store_user = glance log_opt_values /usr/lib/python2.7/dist-packages/oslo/config/cfg.py:1485 The glance-api service did not use my rbd_store_user = images option!! Then I configured a client.glance auth and it worked with the implicit glance user!!! Now my question: Am I the only one with this problem?? Regards, Steffen Thorhauer ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to recover the osd.
Looks like you didn't get osd.0 deployed properly. Can you show: - ls /var/lib/ceph/osd/ceph-0 - cat /etc/ceph/ceph.conf Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 On 8/8/2013 9:13 AM, Suresh Sadhu wrote: HI, My storage health cluster is warning state , one of the osd is in down state and even if I try to start the osd it fail to start sadhu@ubuntu3:~$ ceph osd stat e22: 2 osds: 1 up, 1 in sadhu@ubuntu3:~$ ls /var/lib/ceph/osd/ ceph-0 ceph-1 sadhu@ubuntu3:~$ ceph osd tree # idweight type name up/down reweight -1 0.14root default -2 0.14host ubuntu3 0 0.06999 osd.0 down0 1 0.06999 osd.1 up 1 sadhu@ubuntu3:~$ sudo /etc/init.d/ceph -a start 0 /etc/init.d/ceph: 0. not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines ) sadhu@ubuntu3:~$ sudo /etc/init.d/ceph -a start osd.0 /etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines ) Ceph health status in warning mode. pg 4.10 is active+degraded, acting [1] pg 3.17 is active+degraded, acting [1] pg 5.16 is active+degraded, acting [1] pg 4.17 is active+degraded, acting [1] pg 3.10 is active+degraded, acting [1] recovery 62/124 degraded (50.000%) mds.ceph@ubuntu3 at 10.147.41.3:6803/2148 is laggy/unresponsi regards sadhu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Large storage nodes - best practices
On 8/5/2013 12:51 PM, Brian Candler wrote: On 05/08/2013 17:15, Mike Dawson wrote: Short answer: Ceph generally is used with multiple OSDs per node. One OSD per storage drive with no RAID is the most common setup. At 24- or 36-drives per chassis, there are several potential bottlenecks to consider. Mark Nelson, the Ceph performance guy at Inktank, has published several articles you should consider reading. A few of interest are [0], [1], and [2]. The last link is a 5-part series. Yep, I saw [0] and [1]. He tries a 6-disk RAID0 array and generally gets lower throughput than 6 x JBOD disks (although I think he's using the controller RAID0 functionality, rather than mdraid). AFAICS he has a 36-disk chassis but only runs tests with 6 disks, which is a shame as it would be nice to know which other bottleneck you could hit first with this type of setup. The third link I sent shows Mark's results with 24 spinners and 8 SSDs for journals. Specifically read: http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-1-introduction-and-rados-bench/#setup Florian Haas has also published some thoughts on bottenecks: http://www.hastexo.com/resources/hints-and-kinks/solid-state-drives-and-ceph-osd-journals Also, note that there is on-going work to add erasure coding as a optional backend (as opposed to the current replication scheme). If you prioritize bulk storage over performance, you may be interested in following the progress [3], [4], and [5]. [0]: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/ [1]: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/ [2]: http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-1-introduction-and-rados-bench/ [3]: http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend [4]: http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend [5]: http://www.inktank.com/about-inktank/roadmap/ Thank you - erasure coding is very much of interest for the archival-type storage I'm looking at. However your links [3] and [4] are identical, did you mean to link to another one? Oops. http://wiki.ceph.com/01Planning/02Blueprints/Emperor/Erasure_coded_storage_backend_%28step_2%29 Cheers, Brian. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Qemu-devel] [Bug 1207686]
Josh, Logs are uploaded to cephdrop with the file name mikedawson-rbd-qemu-deadlock. - At about 2013-08-05 19:46 or 47, we hit the issue, traffic went to 0 - At about 2013-08-05 19:53:51, ran a 'virsh screenshot' Environment is: - Ceph 0.61.7 (client is co-mingled with three OSDs) - rbd cache = true and cache=writeback - qemu 1.4.0 1.4.0+dfsg-1expubuntu4 - Ubuntu Raring with 3.8.0-25-generic This issue is reproducible in my environment, and I'm willing to run any wip branch you need. What else can I provide to help? Thanks, Mike Dawson On 8/5/2013 3:48 AM, Stefan Hajnoczi wrote: On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wrote: Am 02.08.2013 um 23:47 schrieb Mike Dawson mike.daw...@cloudapt.com: We can un-wedge the guest by opening a NoVNC session or running a 'virsh screenshot' command. After that, the guest resumes and runs as expected. At that point we can examine the guest. Each time we'll see: If virsh screenshot works then this confirms that QEMU itself is still responding. Its main loop cannot be blocked since it was able to process the screendump command. This supports Josh's theory that a callback is not being invoked. The virtio-blk I/O request would be left in a pending state. Now here is where the behavior varies between configurations: On a Windows guest with 1 vCPU, you may see the symptom that the guest no longer responds to ping. On a Linux guest with multiple vCPUs, you may see the hung task message from the guest kernel because other vCPUs are still making progress. Just the vCPU that issued the I/O request and whose task is in UNINTERRUPTIBLE state would really be stuck. Basically, the symptoms depend not just on how QEMU is behaving but also on the guest kernel and how many vCPUs you have configured. I think this can explain how both problems you are observing, Oliver and Mike, are a result of the same bug. At least I hope they are :). Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process
Oliver, We've had a similar situation occur. For about three months, we've run several Windows 2008 R2 guests with virtio drivers that record video surveillance. We have long suffered an issue where the guest appears to hang indefinitely (or until we intervene). For the sake of this conversation, we call this state wedged, because it appears something (rbd, qemu, virtio, etc) gets stuck on a deadlock. When a guest gets wedged, we see the following: - the guest will not respond to pings - the qemu-system-x86_64 process drops to 0% cpu - graphite graphs show the interface traffic dropping to 0bps - the guest will stay wedged forever (or until we intervene) - strace of qemu-system-x86_64 shows QEMU is making progress [1][2] We can un-wedge the guest by opening a NoVNC session or running a 'virsh screenshot' command. After that, the guest resumes and runs as expected. At that point we can examine the guest. Each time we'll see: - No Windows error logs whatsoever while the guest is wedged - A time sync typically occurs right after the guest gets un-wedged - Scheduled tasks do not run while wedged - Windows error logs do not show any evidence of suspend, sleep, etc We had so many issue with guests becoming wedged, we wrote a script to 'virsh screenshot' them via cron. Then we installed some updates and had a month or so of higher stability (wedging happened maybe 1/10th as often). Until today we couldn't figure out why. Yesterday, I realized qemu was starting the instances without specifying cache=writeback. We corrected that, and let them run overnight. With RBD writeback re-enabled, wedging came back as often as we had seen in the past. I've counted ~40 occurrences in the past 12-hour period. So I feel like writeback caching in RBD certainly makes the deadlock more likely to occur. Joshd asked us to gather RBD client logs: joshd it could very well be the writeback cache not doing a callback at some point - if you could gather logs of a vm getting stuck with debug rbd = 20, debug ms = 1, and debug objectcacher = 30 that would be great We'll do that over the weekend. If you could as well, we'd love the help! [1] http://www.gammacode.com/kvm/wedged-with-timestamps.txt [2] http://www.gammacode.com/kvm/not-wedged.txt Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 On 8/2/2013 6:22 AM, Oliver Francke wrote: Well, I believe, I'm the winner of buzzwords-bingo for today. But seriously speaking... as I don't have this particular problem with qcow2 with kernel 3.2 nor qemu-1.2.2 nor newer kernels, I hope I'm not alone here? We have a raising number of tickets from people reinstalling from ISO's with 3.2-kernel. Fast fallback is to start all VM's with qemu-1.2.2, but we then lose some features ala latency-free-RBD-cache ;) I just opened a bug for qemu per: https://bugs.launchpad.net/qemu/+bug/1207686 with all dirty details. Installing a backport-kernel 3.9.x or upgrade Ubuntu-kernel to 3.8.x fixes it. So we have a bad combination for all distros with 3.2-kernel and rbd as storage-backend, I assume. Any similar findings? Any idea of tracing/debugging ( Josh? ;) ) very welcome, Oliver. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Why is my mon store.db is 220GB?
220GB is way, way too big. I suspect your monitors need to go through a successful leveldb compaction. The early releases of Cuttlefish suffered several issues with store.db growing unbounded. Most were fixed by 0.61.5, I believe. You may have luck stoping all Ceph daemons, then starting the monitor by itself. When there were bugs, leveldb compaction tended work better without OSD traffic hitting the monitors. Also, there are some settings to force a compact on startup like 'mon compact on start = true' and mon compact on trim = true. I don't think either are required anymore though. See some history here: http://tracker.ceph.com/issues/4895 Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 On 8/1/2013 6:52 PM, Jeppesen, Nelson wrote: My Mon store.db has been at 220GB for a few months now. Why is this and how can I fix it? I have one monitor in this cluster and I suspect that I can’t add monitors to the cluster because it is too big. Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Defective ceph startup script
Greg, You can check the currently running version (and much more) using the admin socket: http://ceph.com/docs/master/rados/operations/monitoring/#using-the-admin-socket For me, this looks like: # ceph --admin-daemon /var/run/ceph/ceph-mon.a.asok version {version:0.61.7} # ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok version {version:0.61.7} Also, I use 'service ceph restart' on Ubuntu 13.04 running a mkcephfs deployment. It may be different when using ceph-deploy. Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 On 7/31/2013 2:51 PM, Greg Chavez wrote: I am running on Ubuntu 13.04. There is something amiss with /etc/init.d/ceph on all of my ceph nodes. I was upgrading to 0.61.7 from what I *thought* was 0.61.5 today when I realized that service ceph-all restart wasn't actually doing anything. I saw nothing in /var/log/ceph.log - it just kept printing pg statuses - and the PIDs of the osd and mon daemons did not change. Stops failed as well. Then, when I tried to do individual osd restarts like this: root@kvm-cs-sn-14i:/var/lib/ceph/osd# service ceph -v status osd.10 /etc/init.d/ceph: osd.10 not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines ) Despite the fact that I have this directory: /var/lib/ceph/osd/ceph-10/. I have the same issue with mon restarts: root@kvm-cs-sn-14i:/var/lib/ceph/mon# ls ceph-kvm-cs-sn-14i root@kvm-cs-sn-14i:/var/lib/ceph/mon# service ceph -v status mon.kvm-cs-sn-14i /etc/init.d/ceph: mon.kvm-cs-sn-14i not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines ) I'm very worried that I have all my packages at 0.61.7 while my osd and mon daemons could be running as old as 0.61.1! Can anyone help me figure this out? Thanks. -- \*..+.- --Greg Chavez +//..;}; ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Production/Non-production segmentation
Greg, IMO the most critical risks when running Ceph are bugs that affect daemon stability and the upgrade process. Due to the speed of releases in the Ceph project, I feel having separate physical hardware is the safer way to go, especially in light of your mention of an SLA for your production services. A separate non-production cluster will allow you to test and validate new versions (including point releases within a stable series) before you attempt to upgrade your production cluster. Cheers, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 On 7/31/2013 10:47 AM, Greg Poirier wrote: Does anyone here have multiple clusters or segment their single cluster in such a way as to try to maintain different SLAs for production vs non-production services? We have been toying with the idea of running separate clusters (on the same hardware, but reserve a portion of the OSDs for the production cluster), but I'd rather have a single cluster in order to more evenly distribute load across all of the spindles. Thoughts or observations from people with Ceph in production would be greatly appreciated. Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Production/Non-production segmentation
On 7/31/2013 3:34 PM, Greg Poirier wrote: On Wed, Jul 31, 2013 at 12:19 PM, Mike Dawson mike.daw...@cloudapt.com mailto:mike.daw...@cloudapt.com wrote: Due to the speed of releases in the Ceph project, I feel having separate physical hardware is the safer way to go, especially in light of your mention of an SLA for your production services. Ah. I guess I should offer a little more background as to what I mean by production vs. non-production: customer-facing, and not. That makes more sense. We're using Ceph primarily for volume storage with OpenStack at the moment and operate two OS clusters: one for all of our customer-facing services (which require a higher SLA) and one for all of our internal services. The idea being that all of the customer-facing stuff is segmented physically from anything our developers might be testing internally. What I'm wondering: Does anyone else here do this? Have you looked at Ceph Pools? I think you may find they address many of your concerns while maintaining a single cluster. If so, do you run multiple Ceph clusters? Do you let Ceph sort itself out? Can this be done with a single physical cluster, but multiple logical clusters? Should it be? I know that, mathematically speaking, the larger your Ceph cluster is, the more evenly distributed the load (thanks to CRUSH). I'm wondering if, in practice, RBD can still create hotspots (say from a runaway service with multiple instances and volumes that is suddenly doing a ton of IO). This would increase IO latency across the Ceph cluster, I'd assume, and could impact the performance of customer-facing services. So, to some degree, physical segmentation makes sense to me. But can we simply reserve some OSDs per physical host for a production logical cluster and then use the rest for the development logical cluster (separate MON clusters for each, but all running on the same hardware). Or, given a sufficiently large cluster, is this not even a concern? I'm also interested in hearing about experience using CephFS, Swift, and RBD all on a single cluster or if people have chosen to use multiple clusters for these as well. For example, if you need faster volume storage in RBD, so you go for more spindles and smaller disks vs. larger disks with fewer spindles for object storage, which can have a higher allowance for latency than volume storage. See the response from Greg F. from Inktank to a similar question: http://comments.gmane.org/gmane.comp.file-systems.ceph.user/2090 A separate non-production cluster will allow you to test and validate new versions (including point releases within a stable series) before you attempt to upgrade your production cluster. Oh yeah. I'm doing that for sure. Thanks, Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cinder volume creation issues
You can specify the uuid in the secret.xml file like: secret ephemeral='no' private='no' uuidbdf77f5d-bf0b-1053-5f56-cd76b32520dc/uuid usage type='ceph' nameclient.volumes secret/name /usage /secret Then use that same uuid on all machines in cinder.conf: rbd_secret_uuid=bdf77f5d-bf0b-1053-5f56-cd76b32520dc Also, the column you are referring to in the OpenStack Dashboard lists the machine running the Cinder APIs, not specifically the server hosting the storage. Like Greg stated, Ceph stripes the storage across your cluster. Fix your uuids and cinder.conf any you'll be moving in the right direction. Cheers, Mike On 7/26/2013 1:32 PM, johnu wrote: Greg, :) I am not getting where was the mistake in the configuration. virsh secret-define gave different secrets sudo virsh secret-define --file secret.xml uuid of secret is output here sudo virsh secret-set-value --secret {uuid of secret} --base64 $(cat client.volumes.key) On Fri, Jul 26, 2013 at 10:16 AM, Gregory Farnum g...@inktank.com mailto:g...@inktank.com wrote: On Fri, Jul 26, 2013 at 10:11 AM, johnu johnugeorge...@gmail.com mailto:johnugeorge...@gmail.com wrote: Greg, Yes, the outputs match Nope, they don't. :) You need the secret_uuid to be the same on each node, because OpenStack is generating configuration snippets on one node (which contain these secrets) and then shipping them to another node where they're actually used. Your secrets are also different despite having the same rbd user specified, so that's broken too; not quite sure how you got there... -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com master node: ceph auth get-key client.volumes AQC/ze1R2EOWNBAAmLUE4U7zO1KafZ/CzVVTqQ== virsh secret-get-value bdf77f5d-bf0b-1053-5f56-cd76b32520dc AQC/ze1R2EOWNBAAmLUE4U7zO1KafZ/CzVVTqQ== /etc/cinder/cinder.conf volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_pool=volumes glance_api_version=2 rbd_user=volumes rbd_secret_uuid=bdf77f5d-bf0b-1053-5f56-cd76b32520dc slave1 /etc/cinder/cinder.conf volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_pool=volumes glance_api_version=2 rbd_user=volumes rbd_secret_uuid=62d0b384-50ad-2e17-15ed-66bfeda40252 virsh secret-get-value 62d0b384-50ad-2e17-15ed-66bfeda40252 AQC/ze1R2EOWNBAAmLUE4U7zO1KafZ/CzVVTqQ== slave2 /etc/cinder/cinder.conf volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_pool=volumes glance_api_version=2 rbd_user=volumes rbd_secret_uuid=33651ba9-5145-1fda-3e61-df6a5e6051f5 virsh secret-get-value 33651ba9-5145-1fda-3e61-df6a5e6051f5 AQC/ze1R2EOWNBAAmLUE4U7zO1KafZ/CzVVTqQ== Yes, Openstack horizon is showing same host for all volumes. Somehow, if volume is attached to an instance lying on the same host, it works otherwise, it doesn't. Might be a coincidence. And I am surprised that no one else has seen or reported this issue. Any idea? On Fri, Jul 26, 2013 at 9:45 AM, Gregory Farnum g...@inktank.com mailto:g...@inktank.com wrote: On Fri, Jul 26, 2013 at 9:35 AM, johnu johnugeorge...@gmail.com mailto:johnugeorge...@gmail.com wrote: Greg, I verified in all cluster nodes that rbd_secret_uuid is same as virsh secret-list. And If I do virsh secret-get-value of this uuid, i getting back the auth key for client.volumes. What did you mean by same configuration?. Did you mean same secret for all compute nodes? If you run virsh secret-get-value with that rbd_secret_uuid on each compute node, does it return the right secret for client.volumes? when we login as admin, There is a column in admin panel which gives the 'host' where the volumes lie. I know that volumes are striped across the cluster but it gives same host for all volumes. That is why ,I got little confused. That's not something you can get out of the RBD stack itself; is this something that OpenStack is showing you? I suspect it's just making up information to fit some API expectations, but somebody more familiar with the OpenStack guts can probably chime in. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4
Darryl, I've seen this issue a few times recently. I believe Joao was looking into it at one point, but I don't know if it has been resolved (Any news Joao?). Others have run into it too. Look closely at: http://tracker.ceph.com/issues/4999 http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15 I'd recommend you submit this as a bug on the tracker. It sounds like you have reliable quorum between a and b, that's good. The workaround that has worked for me is to remove mon.c, then re-add it. Assuming your monitor leveldb stores aren't too large, the process is rather quick. Follow the instructions at: http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors then http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors - Mike On 6/25/2013 10:34 PM, Darryl Bond wrote: Upgrading a cluster from 6.1.3 to 6.1.4 with 3 monitors. Cluster had been successfully upgraded from bobtail to cuttlefish and then from 6.1.2 to 6.1.3. There have been no changes to ceph.conf. Node mon.a upgrade, a,b,c monitors OK after upgrade Node mon.b upgrade a,b monitors OK after upgrade (note that c was not available, even though I hadn't touched it) Node mon.c very slow to install the upgrade, RAM was tight for some reason and mon process was using half the RAM Node mon.c shutdown mon.c Node mon.c performed the upgrade Node mon.c restart ceph - mon.c will not start service ceph start mon.c === mon.c === Starting Ceph mon.c on ceph3... [23992]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf ' Starting ceph-create-keys on ceph3... health HEALTH_WARN 1 mons down, quorum 0,1 a,b monmap e1: 3 mons at {a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0}, election epoch 14224, quorum 0,1 a,b osdmap e1342: 18 osds: 18 up, 18 in pgmap v4058788: 5448 pgs: 5447 active+clean, 1 active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB / 47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s mdsmap e1: 0/0/1 up Set debug mon = 20 Nothing going into logs other than assertion--- begin dump of recent events --- 0 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal (Aborted) ** in thread 7fd5e81b57c0 ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404) 1: /usr/bin/ceph-mon() [0x596fe2] 2: (()+0xf000) [0x7fd5e782] 3: (gsignal()+0x35) [0x7fd5e619fba5] 4: (abort()+0x148) [0x7fd5e61a1358] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d] 6: (()+0x5eeb6) [0x7fd5e6a97eb6] 7: (()+0x5eee3) [0x7fd5e6a97ee3] 8: (()+0x5f10e) [0x7fd5e6a9810e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x64a6aa] 10: /usr/bin/ceph-mon() [0x65f916] 11: /usr/bin/ceph-mon() [0x6960e9] 12: (pick_addresses(CephContext*)+0x8d) [0x69624d] 13: (main()+0x1a8a) [0x49786a] 14: (__libc_start_main()+0xf5) [0x7fd5e618ba05] 15: /usr/bin/ceph-mon() [0x499a69] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 20/20 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mon.c.log --- end dump of recent events --- The contents of this electronic message and any attachments are intended only for the addressee and may contain legally privileged, personal, sensitive or confidential information. If you are not the intended addressee, and have received this email, any transmission, distribution, downloading, printing or photocopying of the contents of this message or attachments is strictly prohibited. Any legal privilege or confidentiality attached to this message and attachments is not waived, lost or destroyed by reason of delivery to any person other than intended addressee. If you have received this message and are not the intended addressee you should notify the sender by return email and destroy all copies of the message and any attachments. Unless expressly
Re: [ceph-users] One monitor won't start after upgrade from 6.1.3 to 6.1.4
I've typically moved it off to a non-conflicting path in lieu of deleting it outright, but either way should work. IIRC, I used something like: sudo mv /var/lib/ceph/mon/ceph-c /var/lib/ceph/mon/ceph-c-bak sudo mkdir /var/lib/ceph/mon/ceph-c - Mike On 6/25/2013 11:08 PM, Darryl Bond wrote: Thanks for your prompt response. Given that my mon.c /var/lib/ceph/mon/ceph-c is currently populated, should I delete it's contents after removing the monitor and before re-adding it? Darryl On 06/26/13 12:50, Mike Dawson wrote: Darryl, I've seen this issue a few times recently. I believe Joao was looking into it at one point, but I don't know if it has been resolved (Any news Joao?). Others have run into it too. Look closely at: http://tracker.ceph.com/issues/4999 http://irclogs.ceph.widodh.nl/index.php?date=2013-06-07 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-27 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-25 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-21 http://irclogs.ceph.widodh.nl/index.php?date=2013-05-15 I'd recommend you submit this as a bug on the tracker. It sounds like you have reliable quorum between a and b, that's good. The workaround that has worked for me is to remove mon.c, then re-add it. Assuming your monitor leveldb stores aren't too large, the process is rather quick. Follow the instructions at: http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#removing-monitors then http://ceph.com/docs/next/rados/operations/add-or-rm-mons/#adding-monitors - Mike On 6/25/2013 10:34 PM, Darryl Bond wrote: Upgrading a cluster from 6.1.3 to 6.1.4 with 3 monitors. Cluster had been successfully upgraded from bobtail to cuttlefish and then from 6.1.2 to 6.1.3. There have been no changes to ceph.conf. Node mon.a upgrade, a,b,c monitors OK after upgrade Node mon.b upgrade a,b monitors OK after upgrade (note that c was not available, even though I hadn't touched it) Node mon.c very slow to install the upgrade, RAM was tight for some reason and mon process was using half the RAM Node mon.c shutdown mon.c Node mon.c performed the upgrade Node mon.c restart ceph - mon.c will not start service ceph start mon.c === mon.c === Starting Ceph mon.c on ceph3... [23992]: (33) Numerical argument out of domain failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf ' Starting ceph-create-keys on ceph3... health HEALTH_WARN 1 mons down, quorum 0,1 a,b monmap e1: 3 mons at {a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0}, election epoch 14224, quorum 0,1 a,b osdmap e1342: 18 osds: 18 up, 18 in pgmap v4058788: 5448 pgs: 5447 active+clean, 1 active+clean+scrubbing+deep; 5820 GB data, 11673 GB used, 35464 GB / 47137 GB avail; 813B/s rd, 643KB/s wr, 69op/s mdsmap e1: 0/0/1 up Set debug mon = 20 Nothing going into logs other than assertion--- begin dump of recent events --- 0 2013-06-26 12:20:36.383430 7fd5e81b57c0 -1 *** Caught signal (Aborted) ** in thread 7fd5e81b57c0 ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404) 1: /usr/bin/ceph-mon() [0x596fe2] 2: (()+0xf000) [0x7fd5e782] 3: (gsignal()+0x35) [0x7fd5e619fba5] 4: (abort()+0x148) [0x7fd5e61a1358] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5e6a99e1d] 6: (()+0x5eeb6) [0x7fd5e6a97eb6] 7: (()+0x5eee3) [0x7fd5e6a97ee3] 8: (()+0x5f10e) [0x7fd5e6a9810e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x64a6aa] 10: /usr/bin/ceph-mon() [0x65f916] 11: /usr/bin/ceph-mon() [0x6960e9] 12: (pick_addresses(CephContext*)+0x8d) [0x69624d] 13: (main()+0x1a8a) [0x49786a] 14: (__libc_start_main()+0xf5) [0x7fd5e618ba05] 15: /usr/bin/ceph-mon() [0x499a69] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. The contents of this electronic message and any attachments are intended only for the addressee and may contain legally privileged, personal, sensitive or confidential information. If you are not the intended addressee, and have received this email, any transmission, distribution, downloading, printing or photocopying of the contents of this message or attachments is strictly prohibited. Any legal privilege or confidentiality attached to this message and attachments is not waived, lost or destroyed by reason of delivery to any person other than intended addressee. If you have received this message and are not the intended addressee you should notify the sender by return email and destroy all copies of the message and any attachments. Unless expressly attributed, the views expressed in this email do not necessarily represent the views of the company. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multi Rack Reference architecture
Behind a registration form, but iirc, this is likely what you are looking for: http://www.inktank.com/resource/dreamcompute-architecture-blueprint/ - Mike On 5/31/2013 3:26 AM, Gandalf Corvotempesta wrote: In reference architecture PDF, downloadable from your website, there was some reference to a multi rack architecture described in another doc. Is this paper available ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mon IO usage
Sylvain, I can confirm I see a similar traffic pattern. Any time I have lots of writes going to my cluster (like heavy writes from RBD or remapping/backfilling after losing an OSD), I see all sorts of monitor issues. If my monitor leveldb store.db directories grow past some unknown point (maybe ~1GB or so), 'compact on trim' is insufficiently slow. The store.db grows faster than compact can trim the garbage. After that point, the only hope to rein in the store.db size is to stop the OSDs and get leveldb to compact without any ongoing writes. I sent Sage and Joao a transaction dump of the growth yesterday. Sage looked, but the files are so large it is tough to get useful info. http://tracker.ceph.com/issues/4895 I believe this issue has existed since 0.48. - Mike On 5/21/2013 8:16 AM, Sylvain Munaut wrote: Hi, I've just added some monitoring to the IO usage of mon (trying to track down that growing mon issue), and I'm kind of surprised by the amount of IO generated by the monitor process. I get continuous 4 Mo/s / 75 iops with added big spikes at each compaction every 3 min or so. Is there a description somewhere of what the monitor does exactly ? I mean the monmap / pgmap / osdmap / mdsmap / election epoch don't change that often (pgmap is like 1 per second and that's the fastest change by several orders of magnitude). So what exactly does the monitor do with all that IO ??? Cheers, Sylvain ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Running Ceph issues: HEALTH_WARN, unknown auth protocol, others
Wyatt, A few notes: - Yes, the second host = ceph under mon.a is redundant and should be deleted. - auth client required = cephx [osd] should be simply auth client required = cephx. - Looks like you only have one OSD. You need at least as many (and hopefully more) OSDs than highest replication level out of your pools. Mike On 5/1/2013 12:23 PM, Wyatt Gorman wrote: Here is my ceph.conf. I just figured out that the second host = isn't necessary, though it is like that on the 5-minute quick start guide... (Perhaps I'll submit my couple of fixes that I've had to implement so far). That fixes the redefined host issue, but none of the others. [global] # For version 0.55 and beyond, you must explicitly enable or # disable authentication with auth entries in [global]. auth cluster required = cephx auth service required = cephx auth client required = cephx [osd] osd journal size = 1000 #The following assumes ext4 filesystem. filestore xattr use omap = true # For Bobtail (v 0.56) and subsequent versions, you may add #settings for mkcephfs so that it will create and mount the file #system on a particular OSD for you. Remove the comment `#` #character for the following settings and replace the values in #braces with appropriate values, or leave the following settings #commented out to accept the default values. You must specify #the --mkfs option with mkcephfs in order for the deployment #script to utilize the following settings, and you must define #the 'devs' option for each osd instance; see below. osd mkfs #type = {fs-type} osd mkfs options {fs-type} = {mkfs options} # #default for xfs is -f osd mount options {fs-type} = {mount #options} # default mount option is rw,noatime # For example, for ext4, the mount option might look like this: #osd mkfs options ext4 = user_xattr,rw,noatime # Execute $ hostname to retrieve the name of your host, and # replace {hostname} with the name of your host. For the # monitor, replace {ip-address} with the IP address of your # host. [mon.a] host = ceph mon addr = 10.81.2.100:6789 http://10.81.2.100:6789 [osd.0] host = ceph # For Bobtail (v 0.56) and subsequent versions, you may add # settings for mkcephfs so that it will create and mount the # file system on a particular OSD for you. Remove the comment # `#` character for the following setting for each OSD and # specify a path to the device if you use mkcephfs with the # --mkfs option. #devs = {path-to-device} [osd.1] host = ceph #devs = {path-to-device} [mds.a] host = ceph On Wed, May 1, 2013 at 12:14 PM, Mike Dawson mike.daw...@scholarstack.com mailto:mike.daw...@scholarstack.com wrote: Wyatt, Please post your ceph.conf. - mike On 5/1/2013 12:06 PM, Wyatt Gorman wrote: Hi everyone, I'm setting up a test ceph cluster and am having trouble getting it running (great for testing, huh?). I went through the installation on Debian squeeze, had to modify the mkcephfs script a bit because it calls monmaptool with too many paramaters in the $args variable (mine had --add a [ip address]:[port] [osd1] and I had to get rid of the [osd1] part for the monmaptool command to take it). Anyway, so I got it installed, started the service, waiting a little while for it to build the fs, and ran ceph health and got (and am still getting after a day and a reboot) the following error: (note: I have also been getting the first line in various calls, unsure why it is complaining, I followed the instructions...) warning: line 34: 'host' in section 'mon.a' redefined 2013-05-01 12:04:39.801102 b733b710 -1 WARNING: unknown auth protocol defined: [osd] HEALTH_WARN 384 pgs degraded; 384 pgs stuck unclean; recovery 21/42 degraded (50.000%) Can anybody tell me the root of this issue, and how I can fix it? Thank you! - Wyatt Gorman _ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cuttlefish countdown -- OSD doesn't get marked out
Sage, I confirm this issue. The requested info is listed below. *Note that due to the pre-Cuttlefish monitor sync issues, this deployment has been running three monitors (mon.b and mon.c working properly in quorum. mon.a stuck forever synchronizing). For the past two hours, no OSD processes have been running on any host, yet some OSDs are still marked as up. http://www.gammacode.com/upload/ceph-osd-tree The mon* sections of ceph.conf are: [mon] debug mon = 20 debug paxos = 20 debug ms = 1 [mon.a] host = node2 mon addr = 10.1.0.3:6789 [mon.b] host = node26 mon addr = 10.1.0.67:6789 [mon.c] host = node49 mon addr = 10.1.0.130:6789 root@controller1:~# ceph -s health HEALTH_WARN 43 pgs degraded; 13308 pgs peering; 27932 pgs stale; 13308 pgs stuck inactive; 27932 pgs stuck stale; 13582 pgs stuck unclean; recovery 7264/7986546 degraded (0.091%); 47/66 in osds are down; 1 mons down, quorum 1,2 b,c monmap e1: 3 mons at {a=10.1.0.3:6789/0,b=10.1.0.67:6789/0,c=10.1.0.130:6789/0}, election epoch 1428, quorum 1,2 b,c osdmap e1323: 66 osds: 19 up, 66 in pgmap v427324: 28864 pgs: 257 active+clean, 231 stale+active, 15025 stale+active+clean, 675 peering, 12633 stale+peering, 43 stale+active+degraded; 448 GB data, 1402 GB used, 178 TB / 180 TB avail; 7264/7986546 degraded (0.091%) mdsmap e1: 0/0/1 up For reference, this is ceph version 0.60-666-ga5cade1 (a5cade1fe7338602fb2bbfa867433d825f337c87) from gitbuilder. Thanks, Mike On 4/25/2013 12:17 PM, Sage Weil wrote: On Thu, 25 Apr 2013, Martin Mailand wrote: Hi, if I shutdown an OSD, the OSD gets marked down after 20 seconds, after 300 seconds the osd should get marked out, an the cluster should resync. But that doesn't happened, the OSD stays in the status down/in forever, therefore the cluster stays forever degraded. I can reproduce it with a new installed cluster. If I manually set the osd out (ceph osd out 1), the cluster resync starts immediately. I think thats a release critical bug, because the cluster health is not automatically recovered. What is the output from 'ceph osd tree' and the contents of your [mon*] sections of ceph.conf? Thanks! sage And I reported this behavior a while ago http://article.gmane.org/gmane.comp.file-systems.ceph.user/603/ -martin Log: root@store1:~# ceph -s health HEALTH_OK monmap e1: 3 mons at {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0}, election epoch 82, quorum 0,1,2 a,b,c osdmap e204: 24 osds: 24 up, 24 in pgmap v106709: 5056 pgs: 5056 active+clean; 526 GB data, 1068 GB used, 173 TB / 174 TB avail mdsmap e1: 0/0/1 up root@store1:~# ceph --version ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c) root@store1:~# /etc/init.d/ceph stop osd.1 === osd.1 === Stopping Ceph osd.1 on store1...bash: warning: setlocale: LC_ALL: cannot change locale (en_GB.utf8) kill 5492...done root@store1:~# ceph -s health HEALTH_OK monmap e1: 3 mons at {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0}, election epoch 82, quorum 0,1,2 a,b,c osdmap e204: 24 osds: 24 up, 24 in pgmap v106709: 5056 pgs: 5056 active+clean; 526 GB data, 1068 GB used, 173 TB / 174 TB avail mdsmap e1: 0/0/1 up root@store1:~# date -R Thu, 25 Apr 2013 13:09:54 +0200 root@store1:~# ceph -s date -R health HEALTH_WARN 423 pgs degraded; 423 pgs stuck unclean; recovery 10999/269486 degraded (4.081%); 1/24 in osds are down monmap e1: 3 mons at {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0}, election epoch 82, quorum 0,1,2 a,b,c osdmap e206: 24 osds: 23 up, 24 in pgmap v106715: 5056 pgs: 4633 active+clean, 423 active+degraded; 526 GB data, 1068 GB used, 173 TB / 174 TB avail; 10999/269486 degraded (4.081%) mdsmap e1: 0/0/1 up Thu, 25 Apr 2013 13:10:14 +0200 root@store1:~# ceph -s date -R health HEALTH_WARN 423 pgs degraded; 423 pgs stuck unclean; recovery 10999/269486 degraded (4.081%); 1/24 in osds are down monmap e1: 3 mons at {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0}, election epoch 82, quorum 0,1,2 a,b,c osdmap e206: 24 osds: 23 up, 24 in pgmap v106719: 5056 pgs: 4633 active+clean, 423 active+degraded; 526 GB data, 1068 GB used, 173 TB / 174 TB avail; 10999/269486 degraded (4.081%) mdsmap e1: 0/0/1 up Thu, 25 Apr 2013 13:23:01 +0200 On 25.04.2013 01:46, Sage Weil wrote: Hi everyone- We are down to a handful of urgent bugs (3!) and a cuttlefish release date that is less than a week away. Thank you to everyone who has been involved in coding, testing, and stabilizing this release. We are close! If you would like to test the current release candidate, your efforts would be much appreciated! For deb systems, you can do wget -q -O- 'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/autobuild.asc' | sudo apt-key add - echo deb
Re: [ceph-users] Crushmap doesn't match osd tree
Mike, I use a process like: crushtool -c new-crushmap.txt -o new-crushmap ceph osd setcrushmap -i new-crushmap I did not attempt to validate your crush map. If that command fails, I would scrutinize your crushmap for validity/correctness. Once you have the new crushmap injected, you can do something like: ceph osd crush move ec02sv35 root=default datacenter=site-hd room=room-CR3.11391 rack=rack-9.41933-pehdpw09a - Mike On 4/25/2013 6:11 AM, Mike Bryant wrote: Hi, On version 0.56.4, I'm having a problem with my crush map. The output of osd tree is: # idweighttype nameup/downreweight 00osd.0up1 10osd.1up1 20osd.2up1 30osd.3up1 40osd.4up1 50osd.5up1 But there are buckets set in the crush map (Attached). How can I fix this? Editing the crush map and doing setcrushmap doesn't appear to change anything. Cheers Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Monitor Access Denied message to itself?
Greg, Looks like Sage has a fix for this problem. In case it matters, I have seen a few cases that conflict with your notes in this thread and the bug report. I have seen the bug exclusively on new Ceph installs (without upgrading from bobtail), so it is not isolated to upgrades. Further, I have seen it on test deployments with a single monitor, so it doesn't seem to be limited to deployments with a leader and followers. Thanks getting this bug moving forward. Thanks, Mike On 4/18/2013 6:23 PM, Gregory Farnum wrote: There's a little bit of python called ceph-create-keys, which is invoked by the upstart scripts. You can kill the running processes, and edit them out of the scripts, without direct harm. (Their purpose is to create some standard keys which the newer deployment tools rely on to do things like create OSDs, etc.) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Apr 18, 2013 at 3:20 PM, Matthew Roy imjustmatt...@gmail.com wrote: On 04/18/2013 06:03 PM, Joao Eduardo Luis wrote: There's definitely some command messages being forwarded, but AFAICT they're being forwarded to the monitor, not by the monitor, which by itself is a good omen towards the monitor being the leader :-) In any case, nothing in the trace's code path indicates we could be a peon, unless the monitor itself believed to be the leader. If you take a closer look, you'll see that we come from 'handle_last()', which is bound to happen only on the leader (we'll assert otherwise). For the monitor to be receiving these messages it must mean the peons believe him to be the leader -- or we have so many bugs going around that it's just madness! In all seriousness, when I was chasing after this bug, Matthew sent me his logs with higher debug levels -- no craziness going around :-) -Joao Is there a way to tell who's being denied? Even if it's just log pollution I'd like to know which client is misconfigured. There are similar messages in all the mon logs: mon.a: 2013-04-18 18:16:51.254378 7fc7c6d10700 1 -- [2001:470:8:dd9::20]:6789/0 -- [2001:470:8:dd9::21]:6789/0 -- route(mon_command_ack([auth,get-or-create,client.admin,mon,allow *,osd,allow *,mds,allow]=-13 access denied v775211) v1 tid 8867608) v2 -- ?+0 0x7fc61a18b160 con 0x253f700 mon.b: 2013-04-18 18:16:49.670758 7f37c7afa700 20 -- [2001:470:8:dd9::21]:6789/0 [2001:470:8:dd9::21]:0/22372 pipe(0x7f383c070b70 sd=90 :6789 s=2 pgs=1 cs=1 l=1).writer encoding 7 0x7f37f49876a0 mon_command_ack([auth,get-or-create,client.admin,mon,allow *,osd,allow *,mds,allow]=-13 access denied v775209) v1 (mon.c was removed since the first log file in the thread) mon.d: 2013-04-18 18:16:51.304897 7f927d40f700 1 -- [2001:470:8:dd9:7271:bcff:febd:e398]:6789/0 -- client.? [2001:470:8:dd9::21]:0/26333 -- mon_command_ack([auth,get-or-create,client.admin,mon,allow *,osd,allow *,mds,allow]=-13 access denied v775211) v1 -- ?+0 0x7f923c0230a0 The spacing on these messages is about 0.001s so there's a lot of them going around. All these systems are running 0.60-472-g327002e Matthew -- Matthew ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Monitor Access Denied message to itself?
Matthew, I have seen the same behavior on 0.59. Ran through some troubleshooting with Dan and Joao on March 21st and 22nd, but I haven't looked at it since then. If you look at running processes, I believe you'll see an instance of ceph-create-keys start each time you start a Monitor. So, if you restart the monitor several times, you'll have several ceph-create-keys processes piling, essentially leaking processes. IIRC, the tmp files you see in /etc/ceph correspond with the ceph-create-keys PID. Can you confirm that's what you are seeing? I haven't looked in a couple weeks, but I hope to start 0.60 later today. - Mike On 4/8/2013 12:43 AM, Matthew Roy wrote: I'm seeing weird messages in my monitor logs that don't correlate to admin activity: 2013-04-07 22:54:11.528871 7f2e9e6c8700 1 -- [2001:something::20]:6789/0 -- [2001:something::20]:0/1920 -- mon_command_ack([auth,get-or-create,client.admin,mon,allow *,osd,allow *,mds,allow]=-13 access denied v134192) v1 -- ?+0 0x37bfc00 con 0x3716840 It's also writing out a bunch of empty files along the lines of ceph.client.admin.keyring.1008.tmp in /etc/ceph/ Could this be related to the mon trying to Starting ceph-create-keys when starting? This could be the cause of, or just associated with, some general instability of the monitor cluster. After increasing the logging level I did catch one crash: ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c) 1: /usr/bin/ceph-mon() [0x5834fa] 2: (()+0xfcb0) [0x7f4b03328cb0] 3: (gsignal()+0x35) [0x7f4b01efe425] 4: (abort()+0x17b) [0x7f4b01f01b8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f4b0285069d] 6: (()+0xb5846) [0x7f4b0284e846] 7: (()+0xb5873) [0x7f4b0284e873] 8: (()+0xb596e) [0x7f4b0284e96e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x636c8f] 10: (PaxosService::propose_pending()+0x46d) [0x4dee3d] 11: (MDSMonitor::tick()+0x1c62) [0x51cdd2] 12: (MDSMonitor::on_active()+0x1a) [0x512ada] 13: (PaxosService::_active()+0x31d) [0x4e067d] 14: (Context::complete(int)+0xa) [0x4b7b4a] 15: (finish_contexts(CephContext*, std::listContext*, std::allocatorContext* , int)+0x95) [0x4ba5a5] 16: (Paxos::handle_last(MMonPaxos*)+0xbef) [0x4da92f] 17: (Paxos::dispatch(PaxosServiceMessage*)+0x26b) [0x4dad8b] 18: (Monitor::_ms_dispatch(Message*)+0x149f) [0x4b310f] 19: (Monitor::ms_dispatch(Message*)+0x32) [0x4c9d12] 20: (DispatchQueue::entry()+0x341) [0x698da1] 21: (DispatchQueue::DispatchThread::entry()+0xd) [0x626c5d] 22: (()+0x7e9a) [0x7f4b03320e9a] 23: (clone()+0x6d) [0x7f4b01fbbcbd] The complete log is at: http://goo.gl/UmNs3 Does anyone recognize what's going on? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com