[ceph-users] Repository with some internal utils
Hi, someone asked me if he could get access to the BTRFS defragmenter we used for our Ceph OSDs. I took a few minutes to put together a small github repository with : - the defragmenter I've been asked about (tested on 7200 rpm drives and designed to put low IO load on them), - the scrub scheduler we use to avoid load spikes on Firefly, - some basic documentation (this is still rough around the edges so you better like to read Ruby code if you want to peak at most of the logic, tune or hack these). Here it is: https://github.com/jtek/ceph-utils This is running in production for several months now and I didn't touch the code or the numerous internal tunables these scripts have for several weeks so it probably won't destroy your clusters. These scripts come without warranties though. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to properly deal with NEAR FULL OSD
Le 19/02/2016 17:17, Don Laursen a écrit : > > Thanks. To summarize > > Your data, images+volumes = 27.15% space used > > Raw used = 81.71% used > > > > This is a big difference that I can’t account for? Can anyone? So is > your cluster actually full? > I believe this is the pool size being accounted for and it is harmless: 3 x 27.15 = 81.45 which is awfully close to 81.71. We have the same behavior on our Ceph cluster. > > > I had the same problem with my small cluster. Raw used was about 85% > and actual data, with replication, was about 30%. My OSDs were also > BRTFS. BRTFS was causing its own problems. I fixed my problem by > removing each OSD one at a time and re-adding as the default XFS > filesystem. Doing so brought the percentages used to be about the same > and it’s good now. > That's odd : AFAIK we had the same behaviour with XFS before migrating to BTRFS. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ZFS or BTRFS for performance?
Hi, Le 18/03/2016 20:58, Mark Nelson a écrit : > FWIW, from purely a performance perspective Ceph usually looks pretty > fantastic on a fresh BTRFS filesystem. In fact it will probably > continue to look great until you do small random writes to large > objects (like say to blocks in an RBD volume). Then COW starts > fragmenting the objects into oblivion. I've seen sequential read > performance drop by 300% after 5 minutes of 4K random writes to the > same RBD blocks. > > Autodefrag might help. With 3.19 it wasn't enough for our workload and we had to develop our own defragmentation, see scheduler https://github.com/jtek/ceph-utils. We tried autodefrag again with a 4.0.5 kernel but it wasn't good enough yet (and based on my reading of the linux-btrfs list I don't think there is any work done on it currently). > A long time ago I recall Josef told me it was dangerous to use (I > think it could run the node out of memory and corrupt the FS), but it > may be that it's safer now. No problem here (as long as we use our defragmentation scheduler, otherwise the performance degrades over time/amount of rewrites). > In any event we don't really do a lot of testing with BTRFS these > days as bluestore is indeed the next gen OSD backend. Will bluestore provide the same protection against bitrot than BTRFS? Ie: with BTRFS the deep-scrubs detect inconsistencies *and* the OSD(s) with invalid data get IO errors when trying to read corrupted data and as such can't be used as the source for repairs even if they are primary OSD(s). So with BTRFS you get a pretty good overall protection against bitrot in Ceph (it allowed us to automate the repair process in the most common cases). With XFS IIRC unless you override the default behavior the primary OSD is always the source for repairs (even if all the secondaries agree on another version of the data). Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deprecating ext4 support
Le 12/04/2016 01:40, Lindsay Mathieson a écrit : > On 12/04/2016 9:09 AM, Lionel Bouton wrote: >> * If the journal is not on a separate partition (SSD), it should >> definitely be re-created NoCoW to avoid unnecessary fragmentation. From >> memory : stop OSD, touch journal.new, chattr +C journal.new, dd >> if=journal of=journal.new (your dd options here for best perf/least >> amount of cache eviction), rm journal, mv journal.new journal, start OSD >> again. > > Flush the journal after stopping the OSD ! > No need to: dd makes an exact duplicate. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deprecating ext4 support
Hi, Le 11/04/2016 23:57, Mark Nelson a écrit : > [...] > To add to this on the performance side, we stopped doing regular > performance testing on ext4 (and btrfs) sometime back around when ICE > was released to focus specifically on filestore behavior on xfs. > There were some cases at the time where ext4 was faster than xfs, but > not consistently so. btrfs is often quite fast on fresh fs, but > degrades quickly due to fragmentation induced by cow with > small-writes-to-large-object workloads (IE RBD small writes). If > btrfs auto-defrag is now safe to use in production it might be worth > looking at again, but probably not ext4. For BTRFS, autodefrag is probably not performance-safe (yet), at least with RBD access patterns. At least it wasn't in 4.1.9 when we tested it last time (the performance degraded slowly but surely over several weeks from an initially good performing filesystem to the point where we measured a 100% increase in average latencies and large spikes and stopped the experiment). I didn't see any patches on linux-btrfs since then (it might have benefited from other modifications, but the autodefrag algorithm wasn't reworked itself AFAIK). That's not an inherent problem of BTRFS but of the autodefrag implementation though. Deactivating autodefrag and reimplementing a basic, cautious defragmentation scheduler gave us noticeably better latencies with BTRFS vs XFS (~30% better) on the same hardware and workload long term (as in almost a year and countless full-disk rewrites on the same filesystems due to both normal writes and rebalancing with 3 to 4 months of XFS and BTRFS OSDs coexisting for comparison purposes). I'll certainly remount a subset of our OSDs autodefrag as I did with 4.1.9 when we will deploy 4.4.x or a later LTS kernel. So I might have more up to date information in the coming months. I don't plan to compare BTRFS to XFS anymore though : XFS only saves us from running our defragmentation scheduler, BTRFS is far more suited to our workload and we've seen constant improvements in behavior in the (arguably bumpy until late 3.19 versions) 3.16.x to 4.1.x road. Other things: * If the journal is not on a separate partition (SSD), it should definitely be re-created NoCoW to avoid unnecessary fragmentation. From memory : stop OSD, touch journal.new, chattr +C journal.new, dd if=journal of=journal.new (your dd options here for best perf/least amount of cache eviction), rm journal, mv journal.new journal, start OSD again. * filestore btrfs snap = false is mandatory if you want consistent performance (at least on HDDs). It may not be felt with almost empty OSDs but performance hiccups appear if any non trivial amount of data is added to the filesystems. IIRC, after debugging surprisingly the snapshot creation didn't seem to be the actual cause of the performance problems but the snapshot deletion... It's so bad that the default should probably be false and not true. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ZFS or BTRFS for performance?
Le 19/03/2016 18:38, Heath Albritton a écrit : > If you google "ceph bluestore" you'll be able to find a couple slide > decks on the topic. One of them by Sage is easy to follow without the > benefit of the presentation. There's also the " Redhat Ceph Storage > Roadmap 2016" deck. > > In any case, bluestore is not intended to address bitrot. Given that > ceph is a distributed file system, many of the posix file system > features are not required for the underlying block storage device. > Bluestore is intended to address this and reduce the disk IO required > to store user data. > > Ceph protects against bitrot at a much higher level by validating the > checksum of the entire placement group during a deep scrub. My impression is that the only protection against bitrot is provided by the underlying filesystem which means that you don't get any if you use XFS or EXT4. I can't trust Ceph on this alone until its bitrot protection (if any) is clearly documented. The situation is far from clear right now. The documentations states that deep scrubs are using checksums to validate data, but this is not good enough at least because we don't known what these checksums are supposed to cover (see below for another reason). There is even this howto by Sebastien Han about repairing a PG : http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/ which clearly concludes that with only 2 replicas you can't reliably find out which object is corrupted with Ceph alone. If Ceph really stored checksums to verify all the objects it stores we could manually check which replica is valid. Even if deep scrubs would use checksums to verify data this would not be enough to protect against bitrot: there is a window between a corruption event and a deep scrub where the data on a primary can be returned to a client. BTRFS solves this problem by returning an IO error for any data read that doesn't match its checksum (or automatically rebuilds it if the allocation group is using RAID1/10/5/6). I've never seen this kind of behavior documented for Ceph. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ZFS or BTRFS for performance?
Hi, Le 20/03/2016 15:23, Francois Lafont a écrit : > Hello, > > On 20/03/2016 04:47, Christian Balzer wrote: > >> That's not protection, that's an "uh-oh, something is wrong, you better >> check it out" notification, after which you get to spend a lot of time >> figuring out which is the good replica > In fact, I have never been confronted to this case so far and I have a > couple of questions. > > 1. When it happens (ie a deep scrub fails), is it mentioned in the output > of the "ceph status" command and, in this case, can you confirm to me > that the health of the cluster in the output is different of "HEALTH_OK"? Yes. This is obviously a threat to your data so the cluster isn't HEALTH_OK (HEALTH_WARN IIRC). > > 2. For instance, if it happens with the PG id == 19.10 and if I have 3 OSDs > for this PG (because my pool has replica size == 3). I suppose that the > concerned OSDs are OSD id == 1, 6 and 12. Can you tell me if this "naive" > method is valid to solve the problem (and, if not, why)? > > a) ssh in the node which hosts osd-1 and I launch this command: > ~# id=1 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | > sed "s|/ceph-$id/|/ceph-id/|" | sha1sum > 055b0fd18cee4b158a8d336979de74d25fadc1a3 - > > b) ssh in the node which hosts osd-6 and I launch this command: > ~# id=6 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | > sed "s|/ceph-$id/|/ceph-id/|" | sha1sum > 055b0fd18cee4b158a8d336979de74d25fadc1a3 - > > c) ssh in the node which hosts osd-12 and I launch this command: > ~# id=12 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | > sed "s|/ceph-$id/|/ceph-id/|" | sha1sum > 3f786850e387550fdab836ed7e6dc881de23001b - You may get 3 different hashes because of concurrent writes on the PG. So you may have to restart your commands and probably try to launch them at the same time on all nodes to avoid this problem. If you have constant heavy writes on all your PGs this will probably never give a useful result. > > I notice that the result is different for osd-12 so it's the "bad" osd. > So, in the node which hosts osd-12, I launch this command: > > id=12 && rm /var/lib/ceph/osd/ceph-$id/current/19.10_head/* You should stop the OSD, flush its journal and then do this before restarting the OSD. > And now I can launch safely this command: > > ceph pg repair 19.10 > > Is there a problem with this "naive" method? It is probably overkill (and may not work, see above). Usually you can find out the exact file (see the link in my previous post) in this directory which differs and should be deleted. I believe that if the offending file isn't on the primary you can directly launch the repair command. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: Ceph OSD suicide himself
Hi, Le 12/07/2016 02:51, Brad Hubbard a écrit : > [...] This is probably a fragmentation problem : typical rbd access patterns cause heavy BTRFS fragmentation. >>> To the extent that operations take over 120 seconds to complete? Really? >> Yes, really. I had these too. By default Ceph/RBD uses BTRFS in a very >> aggressive way, rewriting data all over the place and creating/deleting >> snapshots every filestore sync interval (5 seconds max by default IIRC). >> >> As I said there are 3 main causes of performance degradation : >> - the snapshots, >> - the journal in a standard copy-on-write file (move it out of the FS or >> use NoCow), >> - the weak auto defragmentation of BTRFS (autodefrag mount option). >> >> Each one of them is enough to impact or even destroy performance in the >> long run. The 3 combined make BTRFS unusable by default. This is why >> BTRFS is not recommended : if you want to use it you have to be prepared >> for some (heavy) tuning. The first 2 points are easy to address, for the >> last (which begins to be noticeable when you accumulate rewrites on your >> data) I'm not aware of any other tool than the one we developed and >> published on github (link provided in previous mail). >> >> Another thing : you better have a recent 4.1.x or 4.4.x kernel on your >> OSDs if you use BTRFS. We've used it since 3.19.x but I wouldn't advise >> it now and would recommend 4.4.x if it's possible for you and 4.1.x >> otherwise. > Thanks for the information. I wasn't aware things were that bad with BTRFS as > I haven't had much to do with it up to this point. Bad is relative. BTRFS was very time consuming to set up (mainly because of the defragmentation scheduler development but finding sources of inefficiency was no picnic either), but once used properly it has 3 unique advantages : - data checksums : this forces Ceph to use one good replica by refusing to hand over corrupted data and makes it far easier to handle silent data corruption (and some of our RAID controllers, probably damaged by electrical surges, had this nasty habit of flipping bits so it really was a big time/data saver here), - compression : you get more space for free, - speed : we get better latencies than XFS with it. Until bluestore is production ready (it should address these points even better than BTRFS does), if I don't find a use case where BTRFS falls on its face there's no way I'd used anything but BTRFS with Ceph. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph OSD with 95% full
Hi, On 19/07/2016 13:06, Wido den Hollander wrote: >> Op 19 juli 2016 om 12:37 schreef M Ranga Swami Reddy: >> >> >> Thanks for the correction...so even one OSD reaches to 95% full, the >> total ceph cluster IO (R/W) will be blocked...Ideally read IO should >> work... > That should be a config option, since reading while writes still block is > also a danger. Multiple clients could read the same object, perform a > in-memory change and their write will block. > > Now, which client will 'win' after the full flag has been removed? > > That could lead to data corruption. If it did, the clients would be broken as normal usage (without writes being blocked) doesn't prevent multiple clients from reading the same data and trying to write at the same time. So if multiple writes (I suppose on the same data blocks) are possibly waiting the order in which they are performed *must not* matter in your system. The alternative is to prevent simultaneous write accesses from multiple clients (this is how non-cluster filesystems must be configured on top of Ceph/RBD, they must even be prevented from read-only accessing an already mounted fs). > > Just make sure you have proper monitoring on your Ceph cluster. At nearfull > it goes into WARN and you should act on that. +1 : monitoring is not an option. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: Ceph OSD suicide himself
Le 11/07/2016 04:48, 한승진 a écrit : > Hi cephers. > > I need your help for some issues. > > The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs. > > I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs). > > I've experienced one of OSDs was killed himself. > > Always it issued suicide timeout message. This is probably a fragmentation problem : typical rbd access patterns cause heavy BTRFS fragmentation. If you already use the autodefrag mount option, you can try this which performs much better for us : https://github.com/jtek/ceph-utils/blob/master/btrfs-defrag-scheduler.rb Note that it can take some time to fully defragment the filesystems but it shouldn't put more stress than autodefrag while doing so. If you don't already use it, set : filestore btrfs snap = false in ceph.conf an restart your OSDs. Finally if you use journals on the filesystem and not on dedicated partitions, you'll have to recreate them with the NoCow attribute (there's no way to defragment journals in any way that doesn't kill performance otherwise). Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: Ceph OSD suicide himself
Le 11/07/2016 11:56, Brad Hubbard a écrit : > On Mon, Jul 11, 2016 at 7:18 PM, Lionel Bouton > <lionel-subscript...@bouton.name> wrote: >> Le 11/07/2016 04:48, 한승진 a écrit : >>> Hi cephers. >>> >>> I need your help for some issues. >>> >>> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs. >>> >>> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs). >>> >>> I've experienced one of OSDs was killed himself. >>> >>> Always it issued suicide timeout message. >> This is probably a fragmentation problem : typical rbd access patterns >> cause heavy BTRFS fragmentation. > To the extent that operations take over 120 seconds to complete? Really? Yes, really. I had these too. By default Ceph/RBD uses BTRFS in a very aggressive way, rewriting data all over the place and creating/deleting snapshots every filestore sync interval (5 seconds max by default IIRC). As I said there are 3 main causes of performance degradation : - the snapshots, - the journal in a standard copy-on-write file (move it out of the FS or use NoCow), - the weak auto defragmentation of BTRFS (autodefrag mount option). Each one of them is enough to impact or even destroy performance in the long run. The 3 combined make BTRFS unusable by default. This is why BTRFS is not recommended : if you want to use it you have to be prepared for some (heavy) tuning. The first 2 points are easy to address, for the last (which begins to be noticeable when you accumulate rewrites on your data) I'm not aware of any other tool than the one we developed and published on github (link provided in previous mail). Another thing : you better have a recent 4.1.x or 4.4.x kernel on your OSDs if you use BTRFS. We've used it since 3.19.x but I wouldn't advise it now and would recommend 4.4.x if it's possible for you and 4.1.x otherwise. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Another cluster completely hang
Hi, Le 29/06/2016 12:00, Mario Giammarco a écrit : > Now the problem is that ceph has put out two disks because scrub has > failed (I think it is not a disk fault but due to mark-complete) There is something odd going on. I've only seen deep-scrub failing (ie detect one inconsistency and marking the pg so) so I'm not sure what happens in the case of a "simple" scrub failure but what should not happen is the whole OSD going down on scrub of deepscrub fairure which you seem to imply did happen. Do you have logs for these two failures giving a hint at what happened (probably /var/log/ceph/ceph-osd..log) ? Any kernel log pointing to hardware failure(s) around the time these events happened ? Another point : you said that you had one disk "broken". Usually ceph handles this case in the following manner : - the OSD detects the problem and commit suicide (unless it's configured to ignore IO errors which is not the default), - your cluster is then in degraded state with one OSD down/in, - after a timeout (several minutes), Ceph decides that the OSD won't come up again soon and marks the OSD "out" (so one OSD down/out), - as the OSD is out, crush adapts pg positions based on the remaining available OSDs and bring back all degraded pg to clean state by creating missing replicas while moving pgs around. You see a lot of IO, many pg in wait_backfill/backfilling states at this point, - when all is done the cluster is back to HEALTH_OK When your disk was broken and you waited 24 hours how far along this process was your cluster ? Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pg scrub and auto repair in hammer
Hi, Le 29/06/2016 18:33, Stefan Priebe - Profihost AG a écrit : >> Am 28.06.2016 um 09:43 schrieb Lionel Bouton >> <lionel-subscript...@bouton.name>: >> >> Hi, >> >> Le 28/06/2016 08:34, Stefan Priebe - Profihost AG a écrit : >>> [...] >>> Yes but at least BTRFS is still not working for ceph due to >>> fragmentation. I've even tested a 4.6 kernel a few weeks ago. But it >>> doubles it's I/O after a few days. >> BTRFS autodefrag is not working over the long term. That said BTRFS >> itself is working far better than XFS on our cluster (noticeably better >> latencies). As not having checksums wasn't an option we coded and are >> using this: >> >> https://github.com/jtek/ceph-utils/blob/master/btrfs-defrag-scheduler.rb >> >> This actually saved us from 2 faulty disk controllers which were >> infrequently corrupting data in our cluster. >> >> Mandatory too for performance : >> filestore btrfs snap = false > This sounds interesting. For how long you use this method? More than a year now. Since the beginning almost two years ago we always had at least one or two BTRFS OSDs to test and compare to the XFS ones. At the very beginning we had to recycle them regularly because their performance degraded over time. This was not a problem as Ceph makes it easy to move data around safely. We only switched after both finding out that "filestore btrfs snap = false" was mandatory (when true it creates large write spikes every filestore sync interval) and that a custom defragmentation process was needed to maintain performance over the long run. > What kind of workload do you have? A dozen VMs using rbd through KVM built-in support. There are different kinds of access patterns : a large PostgreSQL instance (75+GB on disk, 300+ tx/s with peaks of ~2000 with a mean of 50+ IO/s and peaks to 1000, mostly writes), a small MySQL instance (hard to say : was very large but we moved most of its content to PostgreSQL which left only a small database for a proprietary tool and large ibdata* files with mostly holes), a very large NFS server (~10 TB), lots of Ruby on Rails applications and background workers. On the whole storage system Ceph reports an average of 170 op/s with peaks that can reach 3000. > How did you measure the performance and latency? Every useful metric we can get is fed to a Zabbix server. Latency is measured both by the kernel on each disk with the average time a request stays in queue (number of IOs / accumulated wait time over a given period : you can find these values in /sys/block//stat) and at Ceph level by monitoring the apply latency (we now have journals on SSD so our commit latency is mostly limited by the available CPU). The most interesting metric is the apply latency, block device latency is useful to monitor to see how much the device itself is pushed and how well read performs (apply latency only gives us the write side of the story). The behavior during backfills confirmed the latency benefits too : BTRFS OSDs were less frequently involved in slow requests than the XFS ones. > What kernel do you use with btrfs? 4.4.6 currently (we just finished migrating all servers last week-end). But the switch from XFS to BTRFS occurred with late 3.9 kernels IIRC. I don't have measurements for this but when we switched from 4.1.15-r1 ("-r1" is for Gentoo patches) to 4.4.6 we saw faster OSD startups (including the initial filesystem mount). The only drawback with BTRFS (if you don't count having to develop and run a custom defragmentation scheduler) was the OSD startup times vs XFS. It was very slow when starting from an unmounted filesystem at least until 4.1.x. This was not really a problem as we don't restart OSDs often. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how possible is that ceph cluster crash
Le 19/11/2016 à 00:52, Brian :: a écrit : > This is like your mother telling not to cross the road when you were 4 > years of age but not telling you it was because you could be flattened > by a car :) > > Can you expand on your answer? If you are in a DC with AB power, > redundant UPS, dual feed from the electric company, onsite generators, > dual PSU servers, is it still a bad idea? Yes it is. In such a datacenter where we have a Ceph cluster there was a complete shutdown because of a design error : the probes used by the solution responsible for starting and stopping the generators were installed before the breakers installed on the feeds. After a blackout where generators kicked in the breakers opened due to a surge when power was restored. The generators were stopped because power was restored, and the UPS systems failed 3 minutes later. Closing the breakers couldn't be done in time (you don't approach them without being heavily protected, putting on the suit to protect you needs more time than simply closing the breaker). There's no such thing as uninterruptible power supply. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release
Hi, Le 10/01/2017 à 19:32, Brian Andrus a écrit : > [...] > > > I think the main point I'm trying to address is - as long as the > backing OSD isn't egregiously handling large amounts of writes and it > has a good journal in front of it (that properly handles O_DSYNC [not > D_SYNC as Sebastien's article states]), it is unlikely inconsistencies > will occur upon a crash and subsequent restart. I don't see how you can guess if it is "unlikely". If you need SSDs you are probably handling relatively large amounts of accesses (so large amounts of writes aren't unlikely) or you would have used cheap 7200rpm or even slower drives. Remember that in the default configuration, if you have any 3 OSDs failing at the same time, you have chances of losing data. For <30 OSDs and size=3 this is highly probable as there are only a few thousands combinations of 3 OSDs possible (and you usually have typically a thousand or 2 of pgs picking OSDs in a more or less random pattern). With SSDs not handling write barriers properly I wouldn't bet on recovering the filesystems of all OSDs properly given a cluster-wide power loss shutting down all the SSDs at the same time... In fact as the hardware will lie about the stored data, the filesystem might not even detect the crash properly and might apply its own journal on outdated data leading to unexpected results. So losing data is a possibility and testing for it is almost impossible (you'll have to reproduce all the different access patterns your Ceph cluster could experience at the time of a power loss and trigger the power losses in each case). > > Therefore - while not ideal to rely on journals to maintain consistency, Ceph journals aren't designed for maintaining the filestore consistency. They *might* restrict the access patterns to the filesystems in such a way that running fsck on them after a "let's throw away committed data" crash might have better chances of restoring enough data but if it's the case it's only an happy coincidence (and you will have to run these fscks *manually* as the filesystem can't detect inconsistencies by itself). > that is what they are there for. No. They are here for Ceph internal consistency, not the filesystem backing the filestore consistency. Ceph relies both on journals and filesystems able to maintain internal consistency and supporting syncfs to maintain consistency, if the journal or the filesystem fails the OSD is damaged. If 3 OSDs are damaged at the same time on a size=3 pool you enter "probable data loss" territory. > There is a situation where "consumer-grade" SSDs could be used as > OSDs. While not ideal, it can and has been done before, and may be > preferable to tossing out $500k of SSDs (Seen it firsthand!) For these I'd like to know : - which SSD models were used ? - how long did the SSDs survive (some consumer SSDs not only lie to the system about write completions but they usually don't handle large amounts of write nearly as well as DC models) ? - how many cluster-wide power losses did the cluster survive ? - what were the access patterns on the cluster during the power losses ? If for a model not guaranteed for sync writes there hasn't been dozens of power losses on clusters under large loads without any problem detected in the week following (thing deep-scrub), using them is playing Russian roulette with your data. AFAIK there have only been reports of data losses and/or heavy maintenance later when people tried to use consumer SSDs (admittedly mainly for journals). I've yet to spot long-running robust clusters built with consumer SSDs. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release
Le 07/01/2017 à 14:11, kevin parrikar a écrit : > Thanks for your valuable input. > We were using these SSD in our NAS box(synology) and it was giving > 13k iops for our fileserver in raid1.We had a few spare disks which we > added to our ceph nodes hoping that it will give good performance same > as that of NAS box.(i am not comparing NAS with ceph ,just the reason > why we decided to use these SSD) > > We dont have S3520 or S3610 at the moment but can order one of these > to see how it performs in ceph .We have 4xS3500 80Gb handy. > If i create a 2 node cluster with 2xS3500 each and with replica of > 2,do you think it can deliver 24MB/s of 4k writes . Probably not. See http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ According to the page above the DC S3500 reaches 39MB/s. Its capacity isn't specified, yours are 80GB only which is the lowest capacity I'm aware of and for all DC models I know of the speed goes down with the capacity so you probably will get lower than that. If you put both data and journal on the same device you cut your bandwidth in half : so this would give you an average <20MB/s per OSD (with occasional peaks above that if you don't have a sustained 20MB/s). With 4 OSDs and size=2, your total write bandwidth is <40MB/s. For a single stream of data you will only get <20MB/s though (you won't benefit from parallel writes to the 4 OSDs and will only write on 2 at a time). Not that by comparison the 250GB 840 EVO only reaches 1.9MB/s. But even if you reach the 40MB/s, these models are not designed for heavy writes, you will probably kill them long before their warranty is expired (IIRC these are rated for ~24GB writes per day over the warranty period). In your configuration you only have to write 24G each day (as you have 4 of them, write both to data and journal and size=2) to be in this situation (this is an average of only 0.28 MB/s compared to your 24 MB/s target). > We bought S3500 because last time when we tried ceph, people were > suggesting this model :) :) The 3500 series might be enough with the higher capacities in some rare cases but the 80GB model is almost useless. You have to do the math considering : - how much you will write to the cluster (guess high if you have to guess), - if you will use the SSD for both journals and data (which means writing twice on them), - your replication level (which means you will write multiple times the same data), - when you expect to replace the hardware, - the amount of writes per day they support under warranty (if the manufacturer doesn't present this number prominently they probably are trying to sell you a fast car headed for a brick wall) If your hardware can't handle the amount of write you expect to put in it then you are screwed. There were reports of new Ceph users not aware of this and using cheap SSDs that failed in a matter of months all at the same time. You definitely don't want to be in their position. In fact as problems happen (hardware failure leading to cluster storage rebalancing for example) you should probably get a system able to handle 10x the amount of writes you expect it to handle and then monitor the SSD SMART attributes to be alerted long before they die and replace them before problems happen. You definitely want a controller allowing access to this information. If you can't you will have to monitor the writes and guess this value which is risky as write amplification inside SSDs is not easy to guess... Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests and short OSD failures in small cluster
Le 13/04/2017 à 17:47, mj a écrit : > Hi, > > On 04/13/2017 04:53 PM, Lionel Bouton wrote: >> We use rbd snapshots on Firefly (and Hammer now) and I didn't see any >> measurable impact on performance... until we tried to remove them. > > What exactly do you mean with that? Just what I said : having snapshots doesn't impact performance, only removing them (obviously until Ceph is finished cleaning up). Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests and short OSD failures in small cluster
Hi, Le 13/04/2017 à 10:51, Peter Maloney a écrit : > [...] > Also more things to consider... > > Ceph snapshots relly slow things down. We use rbd snapshots on Firefly (and Hammer now) and I didn't see any measurable impact on performance... until we tried to remove them. We usually have at least one snapshot per VM image, often 3 or 4. Note that we use BTRFS filestores where IIRC the CoW is handled by the filesystem so it might be faster compared to the default/recommended XFS filestores. > They aren't efficient like on > zfs and btrfs. Having one might take away some % performance, and having > 2 snaps takes potentially double, etc. until it is crawling. And it's > not just the CoW... even just rbd snap rm, rbd diff, etc. starts to take > many times longer. See http://tracker.ceph.com/issues/10823 for > explanation of CoW. My goal is just to keep max 1 long term snapshot.[...] In my experience with BTRFS filestores, snap rm impact is proportional to the amount of data specific to the snapshot being removed (ie: not present on any other snapshot) but completely unrelated to the number of existing snapshots. For example the first one removed can be handled very fast and it can be the last one removed that takes the most time and impacts the most the performance. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests and short OSD failures in small cluster
Le 18/04/2017 à 11:24, Jogi Hofmüller a écrit : > Hi, > > thanks for all you comments so far. > > Am Donnerstag, den 13.04.2017, 16:53 +0200 schrieb Lionel Bouton: >> Hi, >> >> Le 13/04/2017 à 10:51, Peter Maloney a écrit : >>> Ceph snapshots relly slow things down. > I can confirm that now :( > >> We use rbd snapshots on Firefly (and Hammer now) and I didn't see any >> measurable impact on performance... until we tried to remove them. We >> usually have at least one snapshot per VM image, often 3 or 4. > This might have been true for hammer and older versions of ceph. From > what I can tell now, every snapshot taken reduces performance of the > entire cluster :( The version isn't the only difference here. We use BTRFS with a custom defragmentation process for the filestores, which is highly uncommon for Ceph users. As I said, Ceph has support for BTRFS CoW, so a part of the snapshot handling processes is actually handled by BTRFS. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] dropping filestore+btrfs testing for luminous
Le 04/07/2017 à 19:00, Jack a écrit : > You may just upgrade to Luminous, then replace filestore by bluestore You don't just "replace" filestore by bluestore on a production cluster : you transition over several weeks/months from the first to the second. The two must be rock stable and have predictable performance characteristics to do that. We took more than 6 months with Firefly to migrate from XFS to Btrfs and studied/tuned the cluster along the way. Simply replacing a store by another without any experience of the real world behavior of the new one is just playing with fire (and a huge heap of customer data). Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] dropping filestore+btrfs testing for luminous
Le 30/06/2017 à 18:48, Sage Weil a écrit : > On Fri, 30 Jun 2017, Lenz Grimmer wrote: >> Hi Sage, >> >> On 06/30/2017 05:21 AM, Sage Weil wrote: >> >>> The easiest thing is to >>> >>> 1/ Stop testing filestore+btrfs for luminous onward. We've recommended >>> against btrfs for a long time and are moving toward bluestore anyway. >> Searching the documentation for "btrfs" does not really give a user any >> clue that the use of Btrfs is discouraged. >> >> Where exactly has this been recommended? >> >> The documentation currently states: >> >> http://docs.ceph.com/docs/master/rados/configuration/ceph-conf/?highlight=btrfs#osds >> >> "We recommend using the xfs file system or the btrfs file system when >> running mkfs." >> >> http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=btrfs#filesystems >> >> "btrfs is still supported and has a comparatively compelling set of >> features, but be mindful of its stability and support status in your >> Linux distribution." >> >> http://docs.ceph.com/docs/master/start/os-recommendations/?highlight=btrfs#ceph-dependencies >> >> "If you use the btrfs file system with Ceph, we recommend using a recent >> Linux kernel (3.14 or later)." >> >> As an end user, none of these statements would really sound as >> recommendations *against* using Btrfs to me. >> >> I'm therefore concerned about just disabling the tests related to >> filestore on Btrfs while still including and shipping it. This has >> potential to introduce regressions that won't get caught and fixed. > Ah, crap. This is what happens when devs don't read their own > documetnation. I recommend against btrfs every time it ever comes up, the > downstream distributions all support only xfs, but yes, it looks like the > docs never got updated... despite the xfs focus being 5ish years old now. > > I'll submit a PR to clean this up, but > >>> 2/ Leave btrfs in the mix for jewel, and manually tolerate and filter out >>> the occasional ENOSPC errors we see. (They make the test runs noisy but >>> are pretty easy to identify.) >>> >>> If we don't stop testing filestore on btrfs now, I'm not sure when we >>> would ever be able to stop, and that's pretty clearly not sustainable. >>> Does that seem reasonable? (Pretty please?) >> If you want to get rid of filestore on Btrfs, start a proper deprecation >> process and inform users that support for it it's going to be removed in >> the near future. The documentation must be updated accordingly and it >> must be clearly emphasized in the release notes. >> >> Simply disabling the tests while keeping the code in the distribution is >> setting up users who happen to be using Btrfs for failure. > I don't think we can wait *another* cycle (year) to stop testing this. > > We can, however, > > - prominently feature this in the luminous release notes, and > - require the 'enable experimental unrecoverable data corrupting features = > btrfs' in order to use it, so that users are explicitly opting in to > luminous+btrfs territory. > > The only good(ish) news is that we aren't touching FileStore if we can > help it, so it less likely to regress than other things. And we'll > continue testing filestore+btrfs on jewel for some time. > > Is that good enough? Not sure how we will handle the transition. Is bluestore considered stable in Jewel ? Then our current clusters (recently migrated from Firefly to Hammer) will have support for both BTRFS+Filestore and Bluestore when the next upgrade takes place. If Bluestore is only considered stable on Luminous I don't see how we can manage the transition easily. The only path I see is to : - migrate to XFS+filestore with Jewel (which will not only take time but will be a regression for us : this will cause performance and sizing problems on at least one of our clusters and we will lose the silent corruption detection from BTRFS) - then upgrade to Luminous and migrate again to Bluestore. I was not expecting the transition from Btrfs+Filestore to Bluestore to be this convoluted (we were planning to add Bluestore OSDs one at a time and study the performance/stability for months before migrating the whole clusters). Is there any way to restrict your BTRFS tests to at least a given stable configuration (BTRFS is known to have problems with the high rate of snapshot deletion Ceph generates by default for example and we use 'filestore btrfs snap = false') ? Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HW Raid vs. Multiple OSD
Le 13/11/2017 à 15:47, Oscar Segarra a écrit : > Thanks Mark, Peter, > > For clarification, the configuration with RAID5 is having many servers > (2 or more) with RAID5 and CEPH on top of it. Ceph will replicate data > between servers. Of course, each server will have just one OSD daemon > managing a big disk. > > It looks functionally is the same using RAID5 + 1 Ceph daemon as 8 > CEPH daemons. Functionally it's the same but RAID5 will kill your write performance. For example if you start with 3 OSD hosts and a pool size of 3, due to RAID5 each and every write on your Ceph cluster will imply a read on one server on every disks minus one then a write on *all* the disks of the cluster. If you use one OSD per disk you'll have a read on one disk only and a write on 3 disks only : you'll get approximately 8 times the IOPS for writes (with 8 disks per server). Clever RAID5 logic can minimize this for some I/O patterns but it is a bet and will never be as good as what you'll get with one disk per OSD. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?
Hi, On 22/02/2018 23:32, Mike Lovell wrote: > hrm. intel has, until a year ago, been very good with ssds. the > description of your experience definitely doesn't inspire confidence. > intel also dropping the entire s3xxx and p3xxx series last year before > having a viable replacement has been driving me nuts. > > i don't know that i have the luxury of being able to return all of the > ones i have or just buying replacements. i'm going to need to at least > try them in production. it'll probably happen with the s4600 limited > to a particular fault domain. these are also going to be filestore > osds so maybe that will result in a different behavior. i'll try to > post updates as i have them. Sorry for the deep digging into the archives. I might be in a situation where I could get S4600 (with filestore initially but I would very much like them to support Bluestore without bursting into flames). To expand a Ceph cluster and test EPYC in our context we have ordered a server based on a Supermicro EPYC motherboard and SM863a SSDs. For reference : https://www.supermicro.nl/Aplus/motherboard/EPYC7000/H11DSU-iN.cfm Unfortunately I just learned that Supermicro found an incompatibility between this motherboard and SM863a SSDs (I don't have more information yet) and they proposed S4600 as an alternative. I immediately remembered that there were problems and asked for a delay/more information and dug out this old thread. Has anyone successfully used Ceph with S4600 ? If so could you share if you used filestore or bluestore, which firmware was used and approximately how much data was written on the most used SSDs ? Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?
On 31/05/2018 14:41, Simon Ironside wrote: > On 24/05/18 19:21, Lionel Bouton wrote: > >> Unfortunately I just learned that Supermicro found an incompatibility >> between this motherboard and SM863a SSDs (I don't have more information >> yet) and they proposed S4600 as an alternative. I immediately remembered >> that there were problems and asked for a delay/more information and dug >> out this old thread. > > In case it helps you, I'm about to go down the same Supermicro EPYC > and SM863a path as you. I asked about the incompatibility you > mentioned and they knew what I was referring to. The incompatibility > is between the on-board SATA controller and the SM863a and has > apparently already been fixed. That's good news. > Even if not fixed, the incompatibility wouldn't be present if you're > using a RAID controller instead of the on board SATA (which I intend > to - don't know if you were?). I wasn't : we plan to use the 14 on board SATA connectors. As long as we can we use a standard SATA/AHCI controller as they cause less headaches than RAID controllers even in HBA mode. Thanks a lot for this information, I've forwarded it to our Supermicro reseller. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM
Le 11/12/2018 à 15:51, Konstantin Shalygin a écrit : > >> Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes >> and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous >> cluster (which already holds lot's of images). >> The server has access to both local and cluster-storage, I only need >> to live migrate the storage, not machine. >> >> I have never used live migration as it can cause more issues and the >> VMs that are already migrated, had planned downtime. >> Taking the VM offline and convert/import using qemu-img would take >> some hours but I would like to still serve clients, even if it is >> slower. >> >> The VM is I/O-heavy in terms of the old storage (LSI/Adaptec with >> BBU). There are two HDDs bound as RAID1 which are constantly under 30% >> - 60% load (this goes up to 100% during reboot, updates or login >> prime-time). >> >> What happens when either the local compute node or the ceph cluster >> fails (degraded)? Or network is unavailable? >> Are all writes performed to both locations? Is this fail-safe? Or does >> the VM crash in worst case, which can lead to dirty shutdown for MS-EX >> DBs? >> >> The node currently has 4GB free RAM and 29GB listed as cache / >> available. These numbers need caution because we have "tuned" enabled >> which causes de-deplication on RAM and this host runs about 10 Windows >> VMs. >> During reboots or updates, RAM can get full again. >> >> Maybe I am to cautious about live-storage-migration, maybe I am not. >> >> What are your experiences or advices? >> >> Thank you very much! > > I was read your message two times and still can't figure out what is > your question? > > You need move your block image from some storage to Ceph? No, you > can't do this without downtime because fs consistency. > > You can easy migrate your filesystem via rsync for example, with small > downtime for reboot VM. > I believe OP is trying to use the storage migration feature of QEMU. I've never tried it and I wouldn't recommend it (probably not very tested and there is a large window for failure). One tactic that can be used assuming OP is using LVM in the VM for storage is to add a Ceph volume to the VM (probably needs a reboot) add the corresponding virtual disk to the VM volume group and then migrate all data from the logical volume(s) to the new disk. LVM is using mirroring internally during the transfer so you get robustness by using it. It can be slow (especially with old kernels) but at least it is safe. I've done a DRBD to Ceph migration with this process 5 years ago. When all logical volumes are moved to the new disk you can remove the old disk from the volume group. Assuming everything is on LVM including the root filesystem, only moving the boot partition will have to be done outside of LVM. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
Le 13/05/2019 à 16:20, Kevin Flöh a écrit : > Dear ceph experts, > > [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...] > Here is what happened: One osd daemon could not be started and > therefore we decided to mark the osd as lost and set it up from > scratch. Ceph started recovering and then we lost another osd with the > same behavior. We did the same as for the first osd. With 3+1 you only allow a single OSD failure per pg at a given time. You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2 separate servers (assuming standard crush rules) is a death sentence for the data on some pgs using both of those OSD (the ones not fully recovered before the second failure). Depending on the data stored (CephFS ?) you probably can recover most of it but some of it is irremediably lost. If you can recover the data from the failed OSD at the time they failed you might be able to recover some of your lost data (with the help of Ceph devs), if not there's nothing to do. In the later case I'd add a new server to use at least 3+2 for a fresh pool instead of 3+1 and begin moving the data to it. The 12.2 + 13.2 mix is a potential problem in addition to the one above but it's a different one. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com