Re: OSD and MON memory usage
On Wed, 28 Nov 2012 13:00:17 -0800 Samuel Just sam.j...@inktank.com wrote: What replication level are you using? Hi, The replication level is 3. Thanks Cláudio -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD and MON memory usage
On Wed, 28 Nov 2012 13:08:08 -0800 Samuel Just sam.j...@inktank.com wrote: Can you post the output of ceph -s? 'ceph -s' right now gives health HEALTH_WARN 923 pgs degraded; 8666 pgs down; 9606 pgs peering; 7 pgs recovering; 406 pgs recovery_wait; 3769 pgs stale; 9606 pgs stuck inactive; 3769 pgs stuck stale; 11052 pgs stuck unclean; recovery 121068/902868 degraded (13.409%); 4824/300956 unfound (1.603%); 2/18 in osds are down monmap e1: 1 mons at {0=193.136.128.202:6789/0}, election epoch 1, quorum 0 0 osdmap e7669: 62 osds: 16 up, 18 in pgmap v47643: 12480 pgs: 35 active, 1223 active+clean, 129 stale+active, 321 active+recovery_wait, 198 stale+active+clean, 236 peering, 2 active+remapped, 2 stale+active+recovery_wait, 6126 down+peering, 249 active+degraded, 2 stale+active+recovering+degraded, 598 stale+peering, 7 active+clean+scrubbing, 29 active+recovery_wait+remapped, 2067 stale+down+peering, 618 stale+active+degraded, 52 active+recovery_wait+degraded, 61 remapped+peering, 365 down+remapped+peering, 2 stale+active+recovery_wait+degraded, 45 stale+remapped+peering, 108 stale+down+remapped+peering, 5 active+recovering; 1175 GB data, 1794 GB used, 25969 GB / 27764 GB avail; 121068/902868 degraded (13.409%); 4824/300956 unfound (1.603%) mdsmap e1: 0/0/1 up The cluster has been in this state since the last attempt to get it going. I added about 100GB of swap on each machine to avoid the OOM killer. Running like this resulted in the machines trashing wildly and getting to ~2000 load avg, and after a while the osds started dying/commited suicide, but *not* from OOM. Some of the few that remain have bloated to around 1.9GB of mem usage. If you want, I can try to restart the whole thing tomorrow and collect fresh log output from the dying OSDs, or any other action or debug info that you might find useful. Thanks! Cláudio -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD and MON memory usage
On Thu, 29 Nov 2012 00:13:25 +0100 Sylvain Munaut s.mun...@whatever-company.com wrote: Hi, If you want, I can try to restart the whole thing tomorrow and collect fresh log output from the dying OSDs, or any other action or debug info that you might find useful. Is the clock synchronized on all machines ? Yup. All machines synched by ntp. Cláudio What you describe (growing mem, recovery that doesn't seem to end) seems pretty similar to what I experienced when clocks of OSD were off ... -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD and MON memory usage
On Fri, 23 Nov 2012 16:46:00 + Joao Eduardo Luis joao.l...@inktank.com wrote: On 11/16/2012 05:24 PM, Cláudio Martins wrote: As for the monitor daemon on this cluster (running on a dedicated machine), it is currently using 3.2GB of memory, and it got to that point again in a matter of minutes after being restarted. Would it be good if we tested with the changes from the wip-mon-leaks-fix branch? Following up on this, wip-mon-leaks-fix was merged into master a couple of days ago. If you have the chance to check if that fixes your memory consumption issues on the monitor, it would be much appreciated! Hi João, I've had a chance to test it and it does indeed seem to make a big difference on mon memory usage. As for the OSD memory usage issue, it's still looking really bad. I'm preparing to do more testing and send more info about this, but a lot of unrelated stuff crept up this week and things are going slowly on this front. I hope to talk more about this before the weekend. Thanks! Cláudio -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD and MON memory usage
Hi, We're testing ceph using a recent build from the 'next' branch (commit b40387d) and we've run into some interesting problems related to memory usage. The setup consists of 64 OSDs (4 boxes, each with 16 disks, most of them 2TB, some 1.5TB, XFS filesystems, Debian Wheezy). After the initial mkcephfs, a 'ceph -s' reports 12480 pgs total. For generating some load we used rados -p rbd bench 28000 write -t 25 and left it running overnight. After several hours most of the OSDs had eaten up around 1GB or more of memory each, which caused trashing on the servers (12GB of RAM per box), and eventually the OOM killer was invoked, killing many OSDs and even the SSH daemons. This seems to have caused a domino effect, and in the morning only around 18 of the OSD were still up. After a hard reboot of the boxes that were unresponsive, we are now in a situation in which there is simply not enough memory for the cluster to recover. That is, after restarting the OSDs, in 2 to 3 minutes we have many of them using 1~1.5GB of RAM and the trashing starts all over again, the OOM killer comes in and things go downhill again. Efectively the cluster is not able to recover no matter how many times we restart the daemons. We're not using any non-default options in the OSD section of the config. file. We checked that there is free space for logging on the system partitions. While I know that 12GB per machine can be hardly called to much RAM, the question I put forward is: is it reasonable for a OSD to consume so much memory in normal usage, or even recovery situations, when there is just around ~200 PGs per OSD and only around ~3TB of objects created by rados bench? Is there a rule of thumb to estimate the amount of memory consumed as a function of PG count, object count and perhaps the number of PGs trying to recover in a given instant? One of my concerns here is also to understand if memory consumption during recovery is bounded and deterministic at all, or if we're simply hitting a severe memory leak in the OSDs. As for the monitor daemon on this cluster (running on a dedicated machine), it is currently using 3.2GB of memory, and it got to that point again in a matter of minutes after being restarted. Would it be good if we tested with the changes from the wip-mon-leaks-fix branch? We would appreciate any advice on the best way to understand if the OSDs are leaking memory or not. We will gladly provide any config or debug info that you might be interested in, or run any tests. Thanks in advance Best regards Cláudio -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: some snapshot problems
On Thu, 8 Nov 2012 09:30:55 -0800 (PST) Sage Weil s...@inktank.com wrote: Lots of different snapshots: - librados lets you do 'selfmanaged snaps' in its API, which let an application control which snapshots apply to which objects. - you can create a 'pool' snapshot on an entire librados pool. this cannot be used at the same time as rbd, fs, or the above 'selfmanaged' snaps. Could you please clarify this? You mean that if a given pool has a snapshot created, a subsequent 'rbd snap create' on an image placed in that very same pool would fail? A quick search through the RADOS and RBD documentation didn't turn anything up about this restriction, but I apologize if I missed it. Thanks Cláudio -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD deadlock with cephfs client and OSD on same machine
On Fri, 1 Jun 2012 11:35:37 +0200 Amon Ott a@m-privacy.de wrote: After backporting syncfs() support into Debian stable libc6 2.11 and recompiling Ceph with it, our test cluster is now running with syncfs(). Hi, We're running OSDs on top of Debian wheezy, which unfortunately has libc6 2.13. By chance, do you still have that patch to backport syncfs()? Thanks in advance Best regards Cláudio -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bobtail timing
On Wed, 31 Oct 2012 20:17:49 -0700 (PDT) Sage Weil s...@inktank.com wrote: On Thu, 1 Nov 2012, Cl?udio Martins wrote: On Wed, 31 Oct 2012 14:38:28 -0700 (PDT) Sage Weil s...@inktank.com wrote: On Wed, 31 Oct 2012, Noah Watkins wrote: Which branch is the freeze taken against? master? Right. Basically, every 3-4 weeks: - next is tagged as v0.XX - and is merged back into master - next branch is reset to current master - testing branch is reset to just-tagged v0.XX Hmm, interesting. But doesn't that mean that when the real v0.XX is later officially _released_, its top commit might not be the commit that was tagged as v0.XX? Assuming that issues are found after the testing branch is reset to v0.XX, fixes would go on top of v0.XX, right? Am I missing something, or people checking out a v0.XX with git might not be getting the real v0.XX that was released as tarballs? The releases and tarballs contain *exactly* the content that is tagged v0.X. The branches may accumulate additional fixes after that, which will later be tagged with a v0.X.Y point release. Since we've started maintaing a stable release, we haven't done point releases for the development 'testing' releases, although if there are important bugs we may need to do so in the future. The 'stable' branch tracks the last stable release (currently argonaut) and is where bug fixes accumulate until the next release. For example, stable current contains v0.48.2 and several additional commits (mostly backports of proviioning scripts to support the ceph-deploy tool that we're working on). Ok, it makes perfect sense now. I wasn't realising that the tarball is released as soon as the branch is tagged, sorry about that. Thanks for the clarification. Best regards Cláudio -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] rbd: finish up basic format 2 support
On Tue, 09 Oct 2012 13:57:09 -0700 Alex Elder el...@inktank.com wrote: This series includes updates for two patches posted previously. -Alex Greetings, We're gearing up to test v0.52 (specifically the RBD stuff) on our cluster. After reading this series of posts about rbd format 2 patches I began wondering if we should start testing these patches as well or not. To put it simply, what I'd like to know is: Is it enough to use the 3.6 vanilla kernel client to take full advantage of the rbd changes in v0.52 (i.e. new RBD cloning features)? Do we have any benefits from applying any of these patches on top of v3.6 and using format 2, assuming that we stick to v0.52 on the server, or is this strictly v0.53 and beyond stuff? I apologize if this is a dumb question, but by looking at the v0.52 changelog, at doc/rbd/* and the list, it doesn't seem clear how this fits with v0.52. Thanks in advance Best regards Cláudio -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: some thoughts about scrub
On Tue, 1 Feb 2011 17:20:34 +0800 Henry Chang henry.cy.ch...@gmail.com wrote: Yeah. I expect that scrub can both detect disk errors and check data integrity (based on the checksum) in the background. For disk errors, I would like CEPH to mark the OSD down/failed and notify the sys admin immediately. For data errors, I expect that CEPH can repair them automatically (by fetching a right copy from other replicas). I suppose the best approach would be for this to be configurable with per OSD granularity. Something like an io_error_threshold config variable. I would set it to something like 50 or 100, but you could set it to 1 and the OSD would put itself down or out after that many IO errors that propagated up to the osd daemon. I guess that even if that OSD becomes unresponsive for a while it won't be much trouble, since ceph will mark it down and should recover later, or else the OSD will be out soon by itself due to the error threshold. What do you think? Cheers Cláudio -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: some thoughts about scrub
On Mon, 31 Jan 2011 12:56:36 -0800 Colin McCabe cmcc...@alumni.cmu.edu wrote: Case #1: The hard disk where that the FileStore is reading from could be dying. In my experience, hard disks that are dying will tend to experience long delays in reading from the filesystem. Occasionally you will be unable to read some files, and you'll get EIO instead. When a hard disk is dying, all you want to do is get your data off there as soon as possible. You don't want to bother trying to fix the files on the disk. That disk is toast. I'm my experience, with recent disks not every read error means that the disk is going to die anytime soon. I manage several dozens of Western Digital Drives (Caviar black 2TB) in linux raid6 arrays. When running MD array background check, MD will report a read error from time to time on some drives. It will recover the data for that block and rewrite it - but the bad block won't show as Reallocated or Pending in SMART reports for that drive. Later, the same drive will do several entire background checks just fine and will go some time before acting up again. I also have seen some big Hitachi drives throwing some uncorrected errors (but reallocating them, unlike WD drives), but otherwise work just fine for months. So, granted, I may have flaky drives, but since they currently are not causing significant hangs or timeouts on the array, why should I just replace all of them? Even a flaky drive is a useful drive if it contains a known good copy of your blocks for some time, just in case your other good drive dies at the wrong time. So, I do agree that, as Brian Chrisman pointed out, background scrub is always important as it helps to prevent your data redundancy going bad without you knowing about it. I also agree with that sys. admin. notification is important in either case. But I also think that Ceph should try to correct the errors it finds through scrub, because some of today's drives may throw uncorrected errors even if they are still useful - I'd rather have more copies of my data, even if they're slightly unrealiable, since I should always be able to tell the bad ones by BTRFS checksums. Besides, I think this model of always trying to correct errors fits well with Ceph's goal of working with unrealiable, comodity hardware, so it makes no sense to just bail out and force the operator to swap every flaky drive. Best regards Cláudio -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiple disks per server.
On Tue, 04 May 2010 14:18:25 +0200 Mickaël Canévet cane...@embl.fr wrote: Hi, I'm testing ceph on 4 old servers. As there is more then one disk per server available for data (2 with 6 disks and 2 with 10 disks for a total of 32 disks over 4 nodes), I was wondering how to define OSDs. I have choice between one OSD per disk (32 OSDs on the cluster) or one OSD per server with one btrfs filesystem over all disks of the server (4 OSDs on the cluster). Which one is the best solution ? In the first case, if I lose one disk, I lose only a small part of available space. In the other case, if I lose one disk, I lose the whole server (as btrfs filesystem is in stripping) much more space. Hi, I too am facing a similar dilemma: Scenario 1: I can set up an MD raid6 array for each OSD box and so can afford up to 2 simultaneous disk failures without Ceph noticing anything wrong. When the 3rd drive fails, a long time will be spent redistributing data across the cluster (though much less time than a simple 25TB raid6 rebuild) . This setup should be quite simple, and a 16 disk raid6 should give generally nice performance, though. I probably would use 2-way data replication (on Ceph config.) for this case. Scenario 2: I can try to configure 1 OSD per disk. As soon as a drive fails, there will be data redistribution across the remaining OSDs - but this should be quite fast, as only the content of a single drive (or slightly more) has to be redistributed across the cluster (worst case). In this case I would use 3-way replication for added protection against simultaneous double drive failures and to compensate for the OSDs not having a raid array underneath them. I can see several potential advantages in Scenario 2: * Greater simplicity and ease of administration, as there's no need to worry about RAID arrays, their configuration and their possible bugs. You have one less layer in the stack to worry about, and that has to be good news. * You can replace failed drives with different drives without worrying about wasted capacity because they are bigger (as you would on raid), and you can even take advantage of older, smaller drives that would otherwise go to the trash can. This will give greater liberty when upgrading hardware, overall. * Degradation of available cluster capacity and bandwidth would be much softer. In fact, assuming that you don't have many power supplies or mainboards burning up, your cluster will maintain redundancy as drives go failing. That is, as long as you have more drives than (amount_of_data * replication_level) your cluster will probably be in a good, fully redundant state. That should make for better sleep at night. * Workloads with small, spread writes should perform better. In a RAID array those could cause entire stripes to be read, thus requiring data chunks to be read from a lot of disks just to compute the redundancy chunks. This one should be quite an advantage for big mail server workloads, which is one of the workloads I'm interested in. * Large write performance should be no worse than with raid, since Ceph also spreads chunks across OSDs. Having said that, there are some aspects about how Ceph would behave in Scenario 2 that I still have to investigate: * If multiple OSDs per node is a well suported option. Do multiple OSDs per node play well with each other and with a node's resources? * If there are issues with network ports/addresses when setting up more than 1 OSD per node. * OSD behaviour when getting I/O errors from its drive -- this is really the most complex and important one, and the one I wish I could hear your opinions about: Usually, in a RAID array, when there is a fatal failure, the upper layers will just get permanent I/O errors and you can assume that storage area is dead and go on with life. However, this is frequently not true when you consider single drives as in Scenario 2, at least for reads: the drive may return read errors for a small region but still be quite ok for the remaining data. So, ideally, a Ceph OSD receiving a read error from the filesystem would request a copy of the Object in question from another OSD and try to rewrite it several times before giving up and declaring the drive dead (1). This is actually what Linux MD does on recent kernels and I know from my experience that it increases array survivability a lot. Background data scrubbing would help a lot with the above, and I guess BTRFS checksuming will simplify things here. Sorry for the huge email, but I hope that what I wrote are valid points to make Ceph more robust, and hope to know what you think about them. Notes: (1) Better yet, if the error repeats, it could leave the old backing file alone, and try to alloc a new one for that object, thus avoiding declaring the drive completely dead too early. Best regards and thanks Cláudio -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message