Re: Designing a cluster guide
On Thu, May 17, 2012 at 2:27 PM, Gregory Farnum g...@inktank.com wrote: Sorry this got left for so long... On Thu, May 10, 2012 at 6:23 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi, the Designing a cluster guide http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it still leaves some questions unanswered. It mentions for example Fast CPU for the mds system. What does fast mean? Just the speed of one core? Or is ceph designed to use multi core? Is multi core or more speed important? Right now, it's primarily the speed of a single core. The MDS is highly threaded but doing most things requires grabbing a big lock. How fast is a qualitative rather than quantitative assessment at this point, though. The Cluster Design Recommendations mentions to seperate all Daemons on dedicated machines. Is this also for the MON useful? As they're so leightweight why not running them on the OSDs? It depends on what your nodes look like, and what sort of cluster you're running. The monitors are pretty lightweight, but they will add *some* load. More important is their disk access patterns — they have to do a lot of syncs. So if they're sharing a machine with some other daemon you want them to have an independent disk and to be running a new kernelglibc so that they can use syncfs rather than sync. (The only distribution I know for sure does this is Ubuntu 12.04.) I just had it pointed out to me that I rather overstated the importance of syncfs if you were going to do this. The monitor mostly does fsync, not sync/syncfs(), so that's not so important. What is important is that it has highly seeky disk behavior, so you don't want a ceph-osd and ceph-mon daemon to be sharing a disk. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
On Fri, Jun 29, 2012 at 11:07 AM, Gregory Farnum g...@inktank.com wrote: the Designing a cluster guide http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it still leaves some questions unanswered. Oh, thank you. I've been poking through the Ceph docs, but somehow had not managed to turn up the wiki yet. What are the likely and worst case scenarios if the OSD journal were to simply be on a garden variety ramdisk, no battery backing? In the case of a single node losing power, and thus losing some data, surely Ceph can recognize this, and handle it through normal redundancy? I could see it being an issue if the whole cluster lost power at once. Anything I'm missing? Brian. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
On Fri, Jun 29, 2012 at 11:50 AM, Gregory Farnum g...@inktank.com wrote: If you lose a journal, you lose the OSD. Really? Everything? Not just recent commits? I would have hoped it would just come back up in an old state. Replication should have already been taking care of regaining redundancy for the stuff that was on it, particularly the newest stuff that wouldn't return with it and say Hi, I'm back. I suppose it makes the design easier though. =) Brian. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
On Fri, Jun 29, 2012 at 1:59 PM, Brian Edmonds mor...@gmail.com wrote: On Fri, Jun 29, 2012 at 11:50 AM, Gregory Farnum g...@inktank.com wrote: If you lose a journal, you lose the OSD. Really? Everything? Not just recent commits? I would have hoped it would just come back up in an old state. Replication should have already been taking care of regaining redundancy for the stuff that was on it, particularly the newest stuff that wouldn't return with it and say Hi, I'm back. I suppose it makes the design easier though. =) Well, actually this depends on the filesystem you're using. With btrfs, the OSD will roll back to a consistent state, but you don't know how out-of-date that state is. (Practically speaking, it's pretty new, but if you were doing any writes it is going to be data loss.) With xfs/ext4/other, the OSD can't create consistency points the same way it can with btrfs, and so the loss of a journal means that it can't repair itself. Sorry for not mentioning the distinction earlier; I didn't think we'd implemented the rollback on btrfs. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
On Fri, Jun 29, 2012 at 2:11 PM, Gregory Farnum g...@inktank.com wrote: Well, actually this depends on the filesystem you're using. With btrfs, the OSD will roll back to a consistent state, but you don't know how out-of-date that state is. Ok, so assuming btrfs, then a single machine failure with a ramdisk journal should not result in any data loss, assuming replication is working? The cluster would then be at risk of data loss primarily from a full power outage. (In practice I'd expect either one machine to die, or a power loss to take out all of them, and smaller but non-unitary losses would be uncommon.) Something to play with, perhaps. Brian. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
On Fri, Jun 29, 2012 at 2:18 PM, Brian Edmonds mor...@gmail.com wrote: On Fri, Jun 29, 2012 at 2:11 PM, Gregory Farnum g...@inktank.com wrote: Well, actually this depends on the filesystem you're using. With btrfs, the OSD will roll back to a consistent state, but you don't know how out-of-date that state is. Ok, so assuming btrfs, then a single machine failure with a ramdisk journal should not result in any data loss, assuming replication is working? The cluster would then be at risk of data loss primarily from a full power outage. (In practice I'd expect either one machine to die, or a power loss to take out all of them, and smaller but non-unitary losses would be uncommon.) That's correct. And replication will be working — it's all synchronous, so if the replication isn't working, you won't be able to write. :) There are some edge cases here — if an OSD is down but not out then you might not have the same number of data copies as normal, but that's all configurable. Something to play with, perhaps. Brian. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
On Fri, 29 Jun 2012, Brian Edmonds wrote: On Fri, Jun 29, 2012 at 2:11 PM, Gregory Farnum g...@inktank.com wrote: Well, actually this depends on the filesystem you're using. With btrfs, the OSD will roll back to a consistent state, but you don't know how out-of-date that state is. Ok, so assuming btrfs, then a single machine failure with a ramdisk journal should not result in any data loss, assuming replication is working? The cluster would then be at risk of data loss primarily from a full power outage. (In practice I'd expect either one machine to die, or a power loss to take out all of them, and smaller but non-unitary losses would be uncommon.) Right. From a data-safety perspective (the cluster said my writes were safe.. are they?) consider journal loss an OSD failure. If there aren't other surviving replicas, something may be lost. From a recovery perspective, it is a partial failure; not everything was lost, and recovery will be quick (only recent objects get copied around). Maybe your application can tolerate that, maybe it can't. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Designing a cluster guide
Interesting, I've been thinking about this and I think most Ceph installations could benefit from more nodes and less disks per node. For example We have a replica level of 2, your RBD block size of 4mb. You start writing a file of 10gb, This is divided effectively into 4mb chunks, The first chunk to node 1 and node 2 (at the same time I assume) which is written to a journal then replayed to the data file system. Second chunk might be sent to node 2 and 3 at the same time which is written to a journal then replayed. (we now have overlap from chunk 1) Third chunk might be sent to 1 and 3 (we have more overlap from chunks 1 and 2) and as you can see this quickly this becomes an issue. So if we have 10 nodes vs. 3 nodes with the same mount of disks we should see better write and read performance as you would have less overlap. Now we take BTRFS into the picture as I understand journals are not necessary due to the nature of the way it writes/snapshots and reads data this alone would be a major performance increase on a BTRFS Raid level (like ZFS RAIDZ). Side note this may sound crazy but the more I read about SSD's the less I wish to use/rely on them and RAM SSD's are crazly priced imo. =) Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Slawomir Skowron Sent: Tuesday, 22 May 2012 3:52 PM To: Quenten Grasso Cc: Gregory Farnum; ceph-devel@vger.kernel.org Subject: Re: Designing a cluster guide I have some performance from rbd cluster near 320MB/s on VM from 3 node cluster, but with 10GE, and with 26 2.5 SAS drives used on every machine it's not everything that can be. Every osd drive is raid0 with one drive via battery cached nvram in hardware raid ctrl. Every osd take much ram for caching. That's why i'am thinking about to change 2 drives for SSD in raid1 with hpa tuned for increase durability of drive for journaling - but if this will work ;) With newest drives can theoreticaly get 500MB/s with a long queue depth. This means that i can in theory improve bandwith score, and take lower latency, and better handling of multiple IO writes, from many hosts. Reads are cached in ram from OSD daemon, VFS in kernel, nvram in ctrl, and in near future improve from cache in kvm (i need to test that - this will improve performance) But if SSD drive goes slower, it can get whole performance down in writes. It's is very delicate. Pozdrawiam iSS Dnia 22 maj 2012 o godz. 02:47 Quenten Grasso qgra...@onq.com.au napisał(a): I Should have added For storage I'm considering something like Enterprise nearline SAS 3TB disks running individual disks not raided with rep level of 2 as suggested :) Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Quenten Grasso Sent: Tuesday, 22 May 2012 10:43 AM To: 'Gregory Farnum' Cc: ceph-devel@vger.kernel.org Subject: RE: Designing a cluster guide Hi Greg, I'm only talking about journal disks not storage. :) Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gregory Farnum Sent: Tuesday, 22 May 2012 10:30 AM To: Quenten Grasso Cc: ceph-devel@vger.kernel.org Subject: Re: Designing a cluster guide On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso qgra...@onq.com.au wrote: Hi All, I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5 15K 72/146GB Disks, in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage. Can someone help clarify this one, Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client? Or Once the data is written to the (journal disk) is this considered successful by the client? This one — the write is considered safe once it is on-disk on all OSDs currently responsible for hosting the object. Every time anybody mentions RAID10 I have to remind them of the storage amplification that entails, though. Are you sure you want that on top of (well, underneath, really) Ceph's own replication? Or Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful) Pros Quite fast Write throughput to the journal disks, No write wareout of SSD's RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well) Cons Not as fast as SSD's More rackspace required per server. Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf
Re: Designing a cluster guide
On Tue, May 29, 2012 at 12:25 AM, Quenten Grasso qgra...@onq.com.au wrote: So if we have 10 nodes vs. 3 nodes with the same mount of disks we should see better write and read performance as you would have less overlap. First of all, a typical way to run Ceph is with say 8-12 disks per node, and an OSD per disk. That means your 3-10 node clusters actually have 24-120 OSDs on them. The number of physical machines is not really a factor, number of OSDs is what matters. Secondly, 10-node or 3-node clusters are fairly uninteresting for Ceph. The real challenge is at the hundreds, thousands and above range. Now we take BTRFS into the picture as I understand journals are not necessary due to the nature of the way it writes/snapshots and reads data this alone would be a major performance increase on a BTRFS Raid level (like ZFS RAIDZ). A journal is still needed on btrfs, snapshots just enable us to write to the journal in parallel to the real write, instead of needing to journal first. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
On Wed, 23 May 2012, Gregory Farnum wrote: On Wed, May 23, 2012 at 12:47 PM, Jerker Nyberg jer...@update.uu.se wrote: * Scratch file system for HPC. (kernel client) * Scratch file system for research groups. (SMB, NFS, SSH) * Backend for simple disk backup. (SSH/rsync, AFP, BackupPC) * Metropolitan cluster. * VDI backend. KVM with RBD. Hmm. Sounds to me like scratch filesystems would get a lot out of not having to hit disk on the commit, but not much out of having separate caching locations versus just letting the OSD page cache handle it. :) The others, I don't really see collaborative caching helping much either. Oh, sorry, those were my use cases for ceph in general. Yes, scratch is mosty of interest. But also fast backup. Currently IOPS is limiting our backup speed on a small cluster with many files but not much data. I have problems scanning through and backing all changed files every night. Currently I am backing to ZFS but Ceph might help with scaling up performance and size. Another option is going for SSD instead of mechanical drives. Anyway, make a bug for it in the tracker (I don't think one exists yet, though I could be wrong) and someday when we start work on the filesystem again we should be able to get to it. :) Thank you for your thoughts on this. I hope to be able to do that soon. Regards, Jerker Nyberg, Uppsala, Sweden.
Re: Designing a cluster guide
On Wed, May 23, 2012 at 12:47 PM, Jerker Nyberg jer...@update.uu.se wrote: On Tue, 22 May 2012, Gregory Farnum wrote: Direct users of the RADOS object store (i.e., librados) can do all kinds of things with the integrity guarantee options. But I don't believe there's currently a way to make the filesystem do so яя among other things, you're running through the page cache and other writeback caches anyway, so it generally wouldn't be useful except when running an fsync or similar. And at that point you probably really want to not be lying to the application that's asking for it. I am comparing with in-memory databases. If replication and failovers are used, couldn't in-memory in some cases be good enough? And faster. do you have a use case on Ceph? Currently of interest: * Scratch file system for HPC. (kernel client) * Scratch file system for research groups. (SMB, NFS, SSH) * Backend for simple disk backup. (SSH/rsync, AFP, BackupPC) * Metropolitan cluster. * VDI backend. KVM with RBD. Hmm. Sounds to me like scratch filesystems would get a lot out of not having to hit disk on the commit, but not much out of having separate caching locations versus just letting the OSD page cache handle it. :) The others, I don't really see collaborative caching helping much either. So basically it sounds like you want to be able to toggle off Ceph's data safety requirements. That would have to be done in the clients; it wouldn't even be hard in ceph-fuse (although I'm not sure about the kernel client). It's probably a pretty easy way to jump into the code base :) Anyway, make a bug for it in the tracker (I don't think one exists yet, though I could be wrong) and someday when we start work on the filesystem again we should be able to get to it. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
Am 21.05.2012 20:13, schrieb Gregory Farnum: On Sat, May 19, 2012 at 1:37 AM, Stefan Priebe s.pri...@profihost.ag wrote: So would you recommand a fast (more ghz) Core i3 instead of a single xeon for this system? (price per ghz is better). If that's all the MDS is doing there, probably? (It would also depend on cache sizes and things; I don't have a good sense for how that impacts the MDS' performance.) As i'm only using KVM / rbd i don't have any MDS. Well, RAID1 isn't going to make it any faster than just the single SSD, is why I pointed that out. I wouldn't recommend using a ramdisk for the journal — that will guarantee local data loss in the event the server doesn't shut down properly, and if it happens to several servers at once you get a good chance of losing client writes. Sure but it's the same WHEN NOT using a Raid 1 for the journal isn't it? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
On Mon, 21 May 2012, Gregory Farnum wrote: This one the write is considered safe once it is on-disk on all OSDs currently responsible for hosting the object. Is it possible to configure the client to consider the write successful when the data is hitting RAM on all the OSDs but not yet committed to disk? Also, the IBM zFS research file system is talking about cooperative cache and Lustre about a collaborative cache. Do you have any thoughts of this regarding Ceph? Regards, Jerker Nyberg, Uppsala, Sweden.
Re: Designing a cluster guide
Am 20.05.2012 10:31, schrieb Christian Brunner: That's exactly what i thought too but then you need a seperate ceph / rbd cluster for each type. Which will result in a minimum of: 3x mon servers per type 4x osd servers per type --- so you'll need a minimum of 12x osd systems and 9x mon systems. You can arrange the storage types in different pools, so that you don't need separate mon servers (this can be done by adjusting the crushmap) and you could even run multiple OSDs per server. That sounds great. Can you give me a hint how to setup pools? Right now i have data, metadata and rbd = the default pools. But i wasn't able to find any page in the wiki which described how to setup pools. Thanks, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
2012/5/20 Tim O'Donovan t...@icukhosting.co.uk: - High performance Block Storage (RBD) Many large SATA SSDs for the storage (prbably in a RAID5 config) stec zeusram ssd drive for the journal How do you think standard SATA disks would perform in comparison to this, and is a separate journaling device really necessary? A journaling device is improving write latency a lot and the write latency is directly related to the throughput you get in your virtual machine. If you have a raid controller with a battery backed write cache you could try to put the journal on a separate, small partition of your SATA disk. I haven't tried this, but I think this could work. Apart from that you should calculate the sum of the IOPS your guests genereate. In the end everything has to be written on your backend storage and is has to be able to deliver the IOPS. With the journal you might be able to compensate short write peaks and there might be a gain by merging write requests on the OSDs, but for a solid sizing I would neglect this. Read requests can be delivered for the OSDs cache (RAM), but again this will probably give you only a small gain. For a single SATA disk you can calculate with 100-150 IOPS (depending on the speed of the disk). SSDs can deliver much higher IOPS values. Perhaps three servers, each with 12 x 1TB SATA disks configured in RAID10, an osd on each server and three separate mon servers. With a replication level of two this would be 1350 IOPS: 150 IOPS per disk * 12 disks * 3 servers / 2 for the RAID10 / 2 for ceph replication Comments on this formula would be welcome... Would this be suitable for the storage backend for a small OpenStack cloud, performance wise, for instance? That depends on what you are doing in your guests. Regards, Christian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
2012/5/21 Stefan Priebe - Profihost AG s.pri...@profihost.ag: Am 20.05.2012 10:31, schrieb Christian Brunner: That's exactly what i thought too but then you need a seperate ceph / rbd cluster for each type. Which will result in a minimum of: 3x mon servers per type 4x osd servers per type --- so you'll need a minimum of 12x osd systems and 9x mon systems. You can arrange the storage types in different pools, so that you don't need separate mon servers (this can be done by adjusting the crushmap) and you could even run multiple OSDs per server. That sounds great. Can you give me a hint how to setup pools? Right now i have data, metadata and rbd = the default pools. But i wasn't able to find any page in the wiki which described how to setup pools. rados mkpool pool-name [123[ 4]] create pool pool-name' [with auid 123[and using crush rule 4]] Christian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
Another great thing that should be mentioned is: https://github.com/facebook/flashcache/. It gives really huge performance improvements for reads/writes (especialy on FunsionIO drives) event without using librbd caching :-) On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER aderum...@odiso.com wrote: Hi, For your journal , if you have money, you can use stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k block). I'm using them with zfs san, they rocks for journal. http://www.stec-inc.com/product/zeusram.php another interessesting product is ddrdrive http://www.ddrdrive.com/ - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Gregory Farnum g...@inktank.com Cc: ceph-devel@vger.kernel.org Envoyé: Samedi 19 Mai 2012 10:37:01 Objet: Re: Designing a cluster guide Hi Greg, Am 17.05.2012 23:27, schrieb Gregory Farnum: It mentions for example Fast CPU for the mds system. What does fast mean? Just the speed of one core? Or is ceph designed to use multi core? Is multi core or more speed important? Right now, it's primarily the speed of a single core. The MDS is highly threaded but doing most things requires grabbing a big lock. How fast is a qualitative rather than quantitative assessment at this point, though. So would you recommand a fast (more ghz) Core i3 instead of a single xeon for this system? (price per ghz is better). It depends on what your nodes look like, and what sort of cluster you're running. The monitors are pretty lightweight, but they will add *some* load. More important is their disk access patterns — they have to do a lot of syncs. So if they're sharing a machine with some other daemon you want them to have an independent disk and to be running a new kernelglibc so that they can use syncfs rather than sync. (The only distribution I know for sure does this is Ubuntu 12.04.) Which kernel and which glibc version supports this? I have searched google but haven't found an exact version. We're using debian lenny squeeze with a custom kernel. Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd and you should go for 22x SSD Disks in a Raid 6? You'll need to do your own failure calculations on this one, I'm afraid. Just take note that you'll presumably be limited to the speed of your journaling device here. Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or is this still too slow? Another idea was to use only a ramdisk for the journal and backup the files while shutting down to disk and restore them after boot. Given that Ceph is going to be doing its own replication, though, I wouldn't want to add in another whole layer of replication with raid10 — do you really want to multiply your storage requirements by another factor of two? OK correct bad idea. Is it more useful the use a Raid 6 HW Controller or the btrfs raid? I would use the hardware controller over btrfs raid for now; it allows more flexibility in eg switching to xfs. :) OK but overall you would recommand running one osd per disk right? So instead of using a Raid 6 with for example 10 disks you would run 6 osds on this machine? Use single socket Xeon for the OSDs or Dual Socket? Dual socket servers will be overkill given the setup you're describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD daemon. You might consider it if you decided you wanted to do an OSD per disk instead (that's a more common configuration, but it requires more CPU and RAM per disk and we don't know yet which is the better choice). Is there also a rule of thumb for the memory? My biggest problem with ceph right now is the awful slow speed while doing random reads and writes. Sequential read and writes are at 200Mb/s (that's pretty good for bonded dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s which is def. too slow. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- -- Alexandre D erumier Ingénieur Système Fixe : 03 20 68 88 90 Fax : 03 20 68 90 81 45 Bvd du Général Leclerc 59100 Roubaix - France 12 rue Marivaux 75002 Paris - France -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Tomasz Paszkowski SS7, Asterisk, SAN, Datacenter, Cloud Computing +48500166299 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
Project is indeed very interesting, but requires to patch a kernel source. For me using lkm is safer ;) On Mon, May 21, 2012 at 5:30 PM, Kiran Patil kirantpa...@gmail.com wrote: Hello, Has someone looked into bcache (http://bcache.evilpiepirate.org/) ? It seems, it is superior to flashcache. Lwn.net article: https://lwn.net/Articles/497024/ Mailing list: http://news.gmane.org/gmane.linux.kernel.bcache.devel Source code: http://evilpiepirate.org/cgi-bin/cgit.cgi/linux-bcache.git/ Thanks, Kiran Patil. On Mon, May 21, 2012 at 8:42 PM, Tomasz Paszkowski ss7...@gmail.com wrote: If you're using Qemu/KVM you can use 'info blockstats' command for measruing I/O on particular VM. On Mon, May 21, 2012 at 5:05 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Am 21.05.2012 16:59, schrieb Christian Brunner: Apart from that you should calculate the sum of the IOPS your guests genereate. In the end everything has to be written on your backend storage and is has to be able to deliver the IOPS. How to measure the IOPs of a dedicated actual system? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Tomasz Paszkowski SS7, Asterisk, SAN, Datacenter, Cloud Computing +48500166299 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Tomasz Paszkowski SS7, Asterisk, SAN, Datacenter, Cloud Computing +48500166299 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
On Sat, May 19, 2012 at 1:37 AM, Stefan Priebe s.pri...@profihost.ag wrote: Hi Greg, Am 17.05.2012 23:27, schrieb Gregory Farnum: It mentions for example Fast CPU for the mds system. What does fast mean? Just the speed of one core? Or is ceph designed to use multi core? Is multi core or more speed important? Right now, it's primarily the speed of a single core. The MDS is highly threaded but doing most things requires grabbing a big lock. How fast is a qualitative rather than quantitative assessment at this point, though. So would you recommand a fast (more ghz) Core i3 instead of a single xeon for this system? (price per ghz is better). If that's all the MDS is doing there, probably? (It would also depend on cache sizes and things; I don't have a good sense for how that impacts the MDS' performance.) It depends on what your nodes look like, and what sort of cluster you're running. The monitors are pretty lightweight, but they will add *some* load. More important is their disk access patterns — they have to do a lot of syncs. So if they're sharing a machine with some other daemon you want them to have an independent disk and to be running a new kernelglibc so that they can use syncfs rather than sync. (The only distribution I know for sure does this is Ubuntu 12.04.) Which kernel and which glibc version supports this? I have searched google but haven't found an exact version. We're using debian lenny squeeze with a custom kernel. syncfs is in Linux 2.6.39; I'm not sure about glibc but from a quick web search it looks like it might have appeared in glibc 2.15? Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd and you should go for 22x SSD Disks in a Raid 6? You'll need to do your own failure calculations on this one, I'm afraid. Just take note that you'll presumably be limited to the speed of your journaling device here. Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or is this still too slow? Another idea was to use only a ramdisk for the journal and backup the files while shutting down to disk and restore them after boot. Well, RAID1 isn't going to make it any faster than just the single SSD, is why I pointed that out. I wouldn't recommend using a ramdisk for the journal — that will guarantee local data loss in the event the server doesn't shut down properly, and if it happens to several servers at once you get a good chance of losing client writes. Is it more useful the use a Raid 6 HW Controller or the btrfs raid? I would use the hardware controller over btrfs raid for now; it allows more flexibility in eg switching to xfs. :) OK but overall you would recommand running one osd per disk right? So instead of using a Raid 6 with for example 10 disks you would run 6 osds on this machine? Right now all the production systems I'm involved in are using 1 OSD per disk, but honestly we don't know if that's the right answer or not. It's a tradeoff — more OSDs increases cpu and memory requirements (per storage space) but also localizes failure a bit more. Use single socket Xeon for the OSDs or Dual Socket? Dual socket servers will be overkill given the setup you're describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD daemon. You might consider it if you decided you wanted to do an OSD per disk instead (that's a more common configuration, but it requires more CPU and RAM per disk and we don't know yet which is the better choice). Is there also a rule of thumb for the memory? About 200MB per daemon right now, plus however much you want the page cache to be able to use. :) This might go up a bit during peering, but under normal operation it shouldn't be more than another couple hundred MB. My biggest problem with ceph right now is the awful slow speed while doing random reads and writes. Sequential read and writes are at 200Mb/s (that's pretty good for bonded dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s which is def. too slow. Hmm. I'm not super-familiar where our random IO performance is right now (and lots of other people seem to have advice on journaling devices :), but that's about in line with what you get from a hard disk normally. Unless you've designed your application very carefully (lots and lots of parallel IO), an individual client doing synchronous random IO is unlikely to be able to get much faster than a regular drive. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
On 21 May 2012 16:36, Tomasz Paszkowski ss7...@gmail.com wrote: Project is indeed very interesting, but requires to patch a kernel source. For me using lkm is safer ;) I believe bcache is actually in the process of being mainlined and moved to a device mapper target, although I could wrong about one or more of those things. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
Am 21.05.2012 17:12, schrieb Tomasz Paszkowski: If you're using Qemu/KVM you can use 'info blockstats' command for measruing I/O on particular VM. I want to migrate physical servers to KVM. Any idea for that? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
Just to clarify. You'd like to measure I/O on those system which are currently running on physical machines ? On Mon, May 21, 2012 at 10:11 PM, Stefan Priebe s.pri...@profihost.ag wrote: Am 21.05.2012 17:12, schrieb Tomasz Paszkowski: If you're using Qemu/KVM you can use 'info blockstats' command for measruing I/O on particular VM. I want to migrate physical servers to KVM. Any idea for that? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Tomasz Paszkowski SS7, Asterisk, SAN, Datacenter, Cloud Computing +48500166299 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
Am 21.05.2012 22:13, schrieb Tomasz Paszkowski: Just to clarify. You'd like to measure I/O on those system which are currently running on physical machines ? IOPs not just I/O. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
On Linux boxes you may use output from iostat -x /dev/sda and connect it it to any monitoring system like: zabbix or cacti :-) On Mon, May 21, 2012 at 10:14 PM, Stefan Priebe s.pri...@profihost.ag wrote: Am 21.05.2012 22:13, schrieb Tomasz Paszkowski: Just to clarify. You'd like to measure I/O on those system which are currently running on physical machines ? IOPs not just I/O. Stefan -- Tomasz Paszkowski SS7, Asterisk, SAN, Datacenter, Cloud Computing +48500166299 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
Maybe good for journal will be two cheap MLC Intel drives on Sandforce (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for separate journaling partitions with hardware RAID1. I like to test setup like this, but maybe someone have any real life info ?? On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski ss7...@gmail.com wrote: Another great thing that should be mentioned is: https://github.com/facebook/flashcache/. It gives really huge performance improvements for reads/writes (especialy on FunsionIO drives) event without using librbd caching :-) On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER aderum...@odiso.com wrote: Hi, For your journal , if you have money, you can use stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k block). I'm using them with zfs san, they rocks for journal. http://www.stec-inc.com/product/zeusram.php another interessesting product is ddrdrive http://www.ddrdrive.com/ - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Gregory Farnum g...@inktank.com Cc: ceph-devel@vger.kernel.org Envoyé: Samedi 19 Mai 2012 10:37:01 Objet: Re: Designing a cluster guide Hi Greg, Am 17.05.2012 23:27, schrieb Gregory Farnum: It mentions for example Fast CPU for the mds system. What does fast mean? Just the speed of one core? Or is ceph designed to use multi core? Is multi core or more speed important? Right now, it's primarily the speed of a single core. The MDS is highly threaded but doing most things requires grabbing a big lock. How fast is a qualitative rather than quantitative assessment at this point, though. So would you recommand a fast (more ghz) Core i3 instead of a single xeon for this system? (price per ghz is better). It depends on what your nodes look like, and what sort of cluster you're running. The monitors are pretty lightweight, but they will add *some* load. More important is their disk access patterns — they have to do a lot of syncs. So if they're sharing a machine with some other daemon you want them to have an independent disk and to be running a new kernelglibc so that they can use syncfs rather than sync. (The only distribution I know for sure does this is Ubuntu 12.04.) Which kernel and which glibc version supports this? I have searched google but haven't found an exact version. We're using debian lenny squeeze with a custom kernel. Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd and you should go for 22x SSD Disks in a Raid 6? You'll need to do your own failure calculations on this one, I'm afraid. Just take note that you'll presumably be limited to the speed of your journaling device here. Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or is this still too slow? Another idea was to use only a ramdisk for the journal and backup the files while shutting down to disk and restore them after boot. Given that Ceph is going to be doing its own replication, though, I wouldn't want to add in another whole layer of replication with raid10 — do you really want to multiply your storage requirements by another factor of two? OK correct bad idea. Is it more useful the use a Raid 6 HW Controller or the btrfs raid? I would use the hardware controller over btrfs raid for now; it allows more flexibility in eg switching to xfs. :) OK but overall you would recommand running one osd per disk right? So instead of using a Raid 6 with for example 10 disks you would run 6 osds on this machine? Use single socket Xeon for the OSDs or Dual Socket? Dual socket servers will be overkill given the setup you're describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD daemon. You might consider it if you decided you wanted to do an OSD per disk instead (that's a more common configuration, but it requires more CPU and RAM per disk and we don't know yet which is the better choice). Is there also a rule of thumb for the memory? My biggest problem with ceph right now is the awful slow speed while doing random reads and writes. Sequential read and writes are at 200Mb/s (that's pretty good for bonded dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s which is def. too slow. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- -- Alexandre D erumier Ingénieur Système Fixe : 03 20 68 88 90 Fax : 03 20 68 90 81 45 Bvd du Général Leclerc 59100 Roubaix - France 12 rue Marivaux 75002 Paris - France -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Tomasz Paszkowski SS7, Asterisk, SAN, Datacenter, Cloud
Re: Designing a cluster guide
On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso qgra...@onq.com.au wrote: Hi All, I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5 15K 72/146GB Disks, in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage. Can someone help clarify this one, Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client? Or Once the data is written to the (journal disk) is this considered successful by the client? This one — the write is considered safe once it is on-disk on all OSDs currently responsible for hosting the object. Every time anybody mentions RAID10 I have to remind them of the storage amplification that entails, though. Are you sure you want that on top of (well, underneath, really) Ceph's own replication? Or Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful) Pros Quite fast Write throughput to the journal disks, No write wareout of SSD's RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well) Cons Not as fast as SSD's More rackspace required per server. Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Slawomir Skowron Sent: Tuesday, 22 May 2012 7:22 AM To: ceph-devel@vger.kernel.org Cc: Tomasz Paszkowski Subject: Re: Designing a cluster guide Maybe good for journal will be two cheap MLC Intel drives on Sandforce (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for separate journaling partitions with hardware RAID1. I like to test setup like this, but maybe someone have any real life info ?? On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski ss7...@gmail.com wrote: Another great thing that should be mentioned is: https://github.com/facebook/flashcache/. It gives really huge performance improvements for reads/writes (especialy on FunsionIO drives) event without using librbd caching :-) On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER aderum...@odiso.com wrote: Hi, For your journal , if you have money, you can use stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k block). I'm using them with zfs san, they rocks for journal. http://www.stec-inc.com/product/zeusram.php another interessesting product is ddrdrive http://www.ddrdrive.com/ - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Gregory Farnum g...@inktank.com Cc: ceph-devel@vger.kernel.org Envoyé: Samedi 19 Mai 2012 10:37:01 Objet: Re: Designing a cluster guide Hi Greg, Am 17.05.2012 23:27, schrieb Gregory Farnum: It mentions for example Fast CPU for the mds system. What does fast mean? Just the speed of one core? Or is ceph designed to use multi core? Is multi core or more speed important? Right now, it's primarily the speed of a single core. The MDS is highly threaded but doing most things requires grabbing a big lock. How fast is a qualitative rather than quantitative assessment at this point, though. So would you recommand a fast (more ghz) Core i3 instead of a single xeon for this system? (price per ghz is better). It depends on what your nodes look like, and what sort of cluster you're running. The monitors are pretty lightweight, but they will add *some* load. More important is their disk access patterns — they have to do a lot of syncs. So if they're sharing a machine with some other daemon you want them to have an independent disk and to be running a new kernelglibc so that they can use syncfs rather than sync. (The only distribution I know for sure does this is Ubuntu 12.04.) Which kernel and which glibc version supports this? I have searched google but haven't found an exact version. We're using debian lenny squeeze with a custom kernel. Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd and you should go for 22x SSD Disks in a Raid 6? You'll need to do your own failure calculations on this one, I'm afraid. Just take note that you'll presumably be limited to the speed of your journaling device here. Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or is this still too slow? Another idea was to use only a ramdisk for the journal and backup the files while shutting down to disk and restore them after boot. Given that Ceph is going to be doing its own replication, though, I wouldn't want to add in another whole layer of replication with raid10 — do you really want to multiply your storage requirements by another factor of two? OK
RE: Designing a cluster guide
Hi Greg, I'm only talking about journal disks not storage. :) Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gregory Farnum Sent: Tuesday, 22 May 2012 10:30 AM To: Quenten Grasso Cc: ceph-devel@vger.kernel.org Subject: Re: Designing a cluster guide On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso qgra...@onq.com.au wrote: Hi All, I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5 15K 72/146GB Disks, in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage. Can someone help clarify this one, Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client? Or Once the data is written to the (journal disk) is this considered successful by the client? This one — the write is considered safe once it is on-disk on all OSDs currently responsible for hosting the object. Every time anybody mentions RAID10 I have to remind them of the storage amplification that entails, though. Are you sure you want that on top of (well, underneath, really) Ceph's own replication? Or Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful) Pros Quite fast Write throughput to the journal disks, No write wareout of SSD's RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well) Cons Not as fast as SSD's More rackspace required per server. Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Slawomir Skowron Sent: Tuesday, 22 May 2012 7:22 AM To: ceph-devel@vger.kernel.org Cc: Tomasz Paszkowski Subject: Re: Designing a cluster guide Maybe good for journal will be two cheap MLC Intel drives on Sandforce (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for separate journaling partitions with hardware RAID1. I like to test setup like this, but maybe someone have any real life info ?? On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski ss7...@gmail.com wrote: Another great thing that should be mentioned is: https://github.com/facebook/flashcache/. It gives really huge performance improvements for reads/writes (especialy on FunsionIO drives) event without using librbd caching :-) On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER aderum...@odiso.com wrote: Hi, For your journal , if you have money, you can use stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k block). I'm using them with zfs san, they rocks for journal. http://www.stec-inc.com/product/zeusram.php another interessesting product is ddrdrive http://www.ddrdrive.com/ - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Gregory Farnum g...@inktank.com Cc: ceph-devel@vger.kernel.org Envoyé: Samedi 19 Mai 2012 10:37:01 Objet: Re: Designing a cluster guide Hi Greg, Am 17.05.2012 23:27, schrieb Gregory Farnum: It mentions for example Fast CPU for the mds system. What does fast mean? Just the speed of one core? Or is ceph designed to use multi core? Is multi core or more speed important? Right now, it's primarily the speed of a single core. The MDS is highly threaded but doing most things requires grabbing a big lock. How fast is a qualitative rather than quantitative assessment at this point, though. So would you recommand a fast (more ghz) Core i3 instead of a single xeon for this system? (price per ghz is better). It depends on what your nodes look like, and what sort of cluster you're running. The monitors are pretty lightweight, but they will add *some* load. More important is their disk access patterns — they have to do a lot of syncs. So if they're sharing a machine with some other daemon you want them to have an independent disk and to be running a new kernelglibc so that they can use syncfs rather than sync. (The only distribution I know for sure does this is Ubuntu 12.04.) Which kernel and which glibc version supports this? I have searched google but haven't found an exact version. We're using debian lenny squeeze with a custom kernel. Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd and you should go for 22x SSD Disks in a Raid 6? You'll need to do your own failure calculations on this one, I'm afraid. Just take note that you'll presumably be limited to the speed of your journaling device here. Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or is this still too slow? Another idea
RE: Designing a cluster guide
I Should have added For storage I'm considering something like Enterprise nearline SAS 3TB disks running individual disks not raided with rep level of 2 as suggested :) Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Quenten Grasso Sent: Tuesday, 22 May 2012 10:43 AM To: 'Gregory Farnum' Cc: ceph-devel@vger.kernel.org Subject: RE: Designing a cluster guide Hi Greg, I'm only talking about journal disks not storage. :) Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gregory Farnum Sent: Tuesday, 22 May 2012 10:30 AM To: Quenten Grasso Cc: ceph-devel@vger.kernel.org Subject: Re: Designing a cluster guide On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso qgra...@onq.com.au wrote: Hi All, I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5 15K 72/146GB Disks, in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage. Can someone help clarify this one, Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client? Or Once the data is written to the (journal disk) is this considered successful by the client? This one — the write is considered safe once it is on-disk on all OSDs currently responsible for hosting the object. Every time anybody mentions RAID10 I have to remind them of the storage amplification that entails, though. Are you sure you want that on top of (well, underneath, really) Ceph's own replication? Or Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful) Pros Quite fast Write throughput to the journal disks, No write wareout of SSD's RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well) Cons Not as fast as SSD's More rackspace required per server. Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Slawomir Skowron Sent: Tuesday, 22 May 2012 7:22 AM To: ceph-devel@vger.kernel.org Cc: Tomasz Paszkowski Subject: Re: Designing a cluster guide Maybe good for journal will be two cheap MLC Intel drives on Sandforce (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for separate journaling partitions with hardware RAID1. I like to test setup like this, but maybe someone have any real life info ?? On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski ss7...@gmail.com wrote: Another great thing that should be mentioned is: https://github.com/facebook/flashcache/. It gives really huge performance improvements for reads/writes (especialy on FunsionIO drives) event without using librbd caching :-) On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER aderum...@odiso.com wrote: Hi, For your journal , if you have money, you can use stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k block). I'm using them with zfs san, they rocks for journal. http://www.stec-inc.com/product/zeusram.php another interessesting product is ddrdrive http://www.ddrdrive.com/ - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Gregory Farnum g...@inktank.com Cc: ceph-devel@vger.kernel.org Envoyé: Samedi 19 Mai 2012 10:37:01 Objet: Re: Designing a cluster guide Hi Greg, Am 17.05.2012 23:27, schrieb Gregory Farnum: It mentions for example Fast CPU for the mds system. What does fast mean? Just the speed of one core? Or is ceph designed to use multi core? Is multi core or more speed important? Right now, it's primarily the speed of a single core. The MDS is highly threaded but doing most things requires grabbing a big lock. How fast is a qualitative rather than quantitative assessment at this point, though. So would you recommand a fast (more ghz) Core i3 instead of a single xeon for this system? (price per ghz is better). It depends on what your nodes look like, and what sort of cluster you're running. The monitors are pretty lightweight, but they will add *some* load. More important is their disk access patterns — they have to do a lot of syncs. So if they're sharing a machine with some other daemon you want them to have an independent disk and to be running a new kernelglibc so that they can use syncfs rather than sync. (The only distribution I know for sure does this is Ubuntu 12.04.) Which kernel and which glibc version supports this? I have searched google but haven't found an exact version. We're using debian lenny squeeze with a custom kernel. Regarding the OSDs
Re: Designing a cluster guide
I have some performance from rbd cluster near 320MB/s on VM from 3 node cluster, but with 10GE, and with 26 2.5 SAS drives used on every machine it's not everything that can be. Every osd drive is raid0 with one drive via battery cached nvram in hardware raid ctrl. Every osd take much ram for caching. That's why i'am thinking about to change 2 drives for SSD in raid1 with hpa tuned for increase durability of drive for journaling - but if this will work ;) With newest drives can theoreticaly get 500MB/s with a long queue depth. This means that i can in theory improve bandwith score, and take lower latency, and better handling of multiple IO writes, from many hosts. Reads are cached in ram from OSD daemon, VFS in kernel, nvram in ctrl, and in near future improve from cache in kvm (i need to test that - this will improve performance) But if SSD drive goes slower, it can get whole performance down in writes. It's is very delicate. Pozdrawiam iSS Dnia 22 maj 2012 o godz. 02:47 Quenten Grasso qgra...@onq.com.au napisał(a): I Should have added For storage I'm considering something like Enterprise nearline SAS 3TB disks running individual disks not raided with rep level of 2 as suggested :) Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Quenten Grasso Sent: Tuesday, 22 May 2012 10:43 AM To: 'Gregory Farnum' Cc: ceph-devel@vger.kernel.org Subject: RE: Designing a cluster guide Hi Greg, I'm only talking about journal disks not storage. :) Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gregory Farnum Sent: Tuesday, 22 May 2012 10:30 AM To: Quenten Grasso Cc: ceph-devel@vger.kernel.org Subject: Re: Designing a cluster guide On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso qgra...@onq.com.au wrote: Hi All, I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5 15K 72/146GB Disks, in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage. Can someone help clarify this one, Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client? Or Once the data is written to the (journal disk) is this considered successful by the client? This one — the write is considered safe once it is on-disk on all OSDs currently responsible for hosting the object. Every time anybody mentions RAID10 I have to remind them of the storage amplification that entails, though. Are you sure you want that on top of (well, underneath, really) Ceph's own replication? Or Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful) Pros Quite fast Write throughput to the journal disks, No write wareout of SSD's RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well) Cons Not as fast as SSD's More rackspace required per server. Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Slawomir Skowron Sent: Tuesday, 22 May 2012 7:22 AM To: ceph-devel@vger.kernel.org Cc: Tomasz Paszkowski Subject: Re: Designing a cluster guide Maybe good for journal will be two cheap MLC Intel drives on Sandforce (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for separate journaling partitions with hardware RAID1. I like to test setup like this, but maybe someone have any real life info ?? On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski ss7...@gmail.com wrote: Another great thing that should be mentioned is: https://github.com/facebook/flashcache/. It gives really huge performance improvements for reads/writes (especialy on FunsionIO drives) event without using librbd caching :-) On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER aderum...@odiso.com wrote: Hi, For your journal , if you have money, you can use stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k block). I'm using them with zfs san, they rocks for journal. http://www.stec-inc.com/product/zeusram.php another interessesting product is ddrdrive http://www.ddrdrive.com/ - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Gregory Farnum g...@inktank.com Cc: ceph-devel@vger.kernel.org Envoyé: Samedi 19 Mai 2012 10:37:01 Objet: Re: Designing a cluster guide Hi Greg, Am 17.05.2012 23:27, schrieb Gregory Farnum: It mentions for example Fast CPU for the mds system. What does fast mean? Just the speed of one core? Or is ceph designed to use multi
Re: Designing a cluster guide
Am 19.05.2012 18:15, schrieb Alexandre DERUMIER: Hi, For your journal , if you have money, you can use stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k block). I'm using them with zfs san, they rocks for journal. http://www.stec-inc.com/product/zeusram.php another interessesting product is ddrdrive http://www.ddrdrive.com/ Great products but really expensive. The question is do we really need this in case of rbd block device. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
I think that depend how much random writes io you have and acceptable latency you need. (As the purpose of the journal is to take random io then flush them sequentially to slow storage). Maybe some slower ssd will fill your needs. (just be carefull of performance degradation in time, trim,) - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Alexandre DERUMIER aderum...@odiso.com Cc: ceph-devel@vger.kernel.org, Gregory Farnum g...@inktank.com Envoyé: Dimanche 20 Mai 2012 09:56:21 Objet: Re: Designing a cluster guide Am 19.05.2012 18:15, schrieb Alexandre DERUMIER: Hi, For your journal , if you have money, you can use stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k block). I'm using them with zfs san, they rocks for journal. http://www.stec-inc.com/product/zeusram.php another interessesting product is ddrdrive http://www.ddrdrive.com/ Great products but really expensive. The question is do we really need this in case of rbd block device. Stefan -- -- Alexandre D erumier Ingénieur Système Fixe : 03 20 68 88 90 Fax : 03 20 68 90 81 45 Bvd du Général Leclerc 59100 Roubaix - France 12 rue Marivaux 75002 Paris - France -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
2012/5/20 Stefan Priebe s.pri...@profihost.ag: Am 19.05.2012 18:15, schrieb Alexandre DERUMIER: Hi, For your journal , if you have money, you can use stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k block). I'm using them with zfs san, they rocks for journal. http://www.stec-inc.com/product/zeusram.php another interessesting product is ddrdrive http://www.ddrdrive.com/ Great products but really expensive. The question is do we really need this in case of rbd block device. I think it depends, what you are planning to do. I was calculating different storage type for our cloud solution lately. I think that there are three different types that make sense (at least for us): - Cheap Object Storage (S3): Many 3,5'' SATA Drives for the storage (probably in a RAID config) A small and cheap SSD for the journal - Basic Block Storage (RBD): Many 2,5'' SATA Drives for the storage (RAID10 and/or mutliple OSDs) Small MaxIOPS SSDs for each OSD journal - High performance Block Storage (RBD) Many large SATA SSDs for the storage (prbably in a RAID5 config) stec zeusram ssd drive for the journal Regards, Christian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
Am 20.05.2012 10:19, schrieb Christian Brunner: - Cheap Object Storage (S3): Many 3,5'' SATA Drives for the storage (probably in a RAID config) A small and cheap SSD for the journal - Basic Block Storage (RBD): Many 2,5'' SATA Drives for the storage (RAID10 and/or mutliple OSDs) Small MaxIOPS SSDs for each OSD journal - High performance Block Storage (RBD) Many large SATA SSDs for the storage (prbably in a RAID5 config) stec zeusram ssd drive for the journal That's exactly what i thought too but then you need a seperate ceph / rbd cluster for each type. Which will result in a minimum of: 3x mon servers per type 4x osd servers per type --- so you'll need a minimum of 12x osd systems and 9x mon systems. Regards, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
2012/5/20 Stefan Priebe s.pri...@profihost.ag: Am 20.05.2012 10:19, schrieb Christian Brunner: - Cheap Object Storage (S3): Many 3,5'' SATA Drives for the storage (probably in a RAID config) A small and cheap SSD for the journal - Basic Block Storage (RBD): Many 2,5'' SATA Drives for the storage (RAID10 and/or mutliple OSDs) Small MaxIOPS SSDs for each OSD journal - High performance Block Storage (RBD) Many large SATA SSDs for the storage (prbably in a RAID5 config) stec zeusram ssd drive for the journal That's exactly what i thought too but then you need a seperate ceph / rbd cluster for each type. Which will result in a minimum of: 3x mon servers per type 4x osd servers per type --- so you'll need a minimum of 12x osd systems and 9x mon systems. You can arrange the storage types in different pools, so that you don't need separate mon servers (this can be done by adjusting the crushmap) and you could even run multiple OSDs per server. Christian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
- High performance Block Storage (RBD) Many large SATA SSDs for the storage (prbably in a RAID5 config) stec zeusram ssd drive for the journal How do you think standard SATA disks would perform in comparison to this, and is a separate journaling device really necessary? Perhaps three servers, each with 12 x 1TB SATA disks configured in RAID10, an osd on each server and three separate mon servers. Would this be suitable for the storage backend for a small OpenStack cloud, performance wise, for instance? Regards, Tim O'Donovan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
Am 20.05.2012 um 10:56 schrieb Tim O'Donovan t...@icukhosting.co.uk: - High performance Block Storage (RBD) Many large SATA SSDs for the storage (prbably in a RAID5 config) stec zeusram ssd drive for the journal How do you think standard SATA disks would perform in comparison to this, and is a separate journaling device really necessary? Perhaps three servers, each with 12 x 1TB SATA disks configured in RAID10, an osd on each server and three separate mon servers. He's talking about ssd's not normal sata disks. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
He's talking about ssd's not normal sata disks. I realise that. I'm looking for similar advice and have been following this thread. It didn't seem off topic to ask here. Regards, Tim O'Donovan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
No sorry just wanted to clarify as you quoted the ssd part. Stefan Am 20.05.2012 um 11:46 schrieb Tim O'Donovan t...@icukhosting.co.uk: He's talking about ssd's not normal sata disks. I realise that. I'm looking for similar advice and have been following this thread. It didn't seem off topic to ask here. Regards, Tim O'Donovan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
Hi Greg, Am 17.05.2012 23:27, schrieb Gregory Farnum: It mentions for example Fast CPU for the mds system. What does fast mean? Just the speed of one core? Or is ceph designed to use multi core? Is multi core or more speed important? Right now, it's primarily the speed of a single core. The MDS is highly threaded but doing most things requires grabbing a big lock. How fast is a qualitative rather than quantitative assessment at this point, though. So would you recommand a fast (more ghz) Core i3 instead of a single xeon for this system? (price per ghz is better). It depends on what your nodes look like, and what sort of cluster you're running. The monitors are pretty lightweight, but they will add *some* load. More important is their disk access patterns — they have to do a lot of syncs. So if they're sharing a machine with some other daemon you want them to have an independent disk and to be running a new kernelglibc so that they can use syncfs rather than sync. (The only distribution I know for sure does this is Ubuntu 12.04.) Which kernel and which glibc version supports this? I have searched google but haven't found an exact version. We're using debian lenny squeeze with a custom kernel. Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd and you should go for 22x SSD Disks in a Raid 6? You'll need to do your own failure calculations on this one, I'm afraid. Just take note that you'll presumably be limited to the speed of your journaling device here. Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or is this still too slow? Another idea was to use only a ramdisk for the journal and backup the files while shutting down to disk and restore them after boot. Given that Ceph is going to be doing its own replication, though, I wouldn't want to add in another whole layer of replication with raid10 — do you really want to multiply your storage requirements by another factor of two? OK correct bad idea. Is it more useful the use a Raid 6 HW Controller or the btrfs raid? I would use the hardware controller over btrfs raid for now; it allows more flexibility in eg switching to xfs. :) OK but overall you would recommand running one osd per disk right? So instead of using a Raid 6 with for example 10 disks you would run 6 osds on this machine? Use single socket Xeon for the OSDs or Dual Socket? Dual socket servers will be overkill given the setup you're describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD daemon. You might consider it if you decided you wanted to do an OSD per disk instead (that's a more common configuration, but it requires more CPU and RAM per disk and we don't know yet which is the better choice). Is there also a rule of thumb for the memory? My biggest problem with ceph right now is the awful slow speed while doing random reads and writes. Sequential read and writes are at 200Mb/s (that's pretty good for bonded dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s which is def. too slow. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
Hi, For your journal , if you have money, you can use stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k block). I'm using them with zfs san, they rocks for journal. http://www.stec-inc.com/product/zeusram.php another interessesting product is ddrdrive http://www.ddrdrive.com/ - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Gregory Farnum g...@inktank.com Cc: ceph-devel@vger.kernel.org Envoyé: Samedi 19 Mai 2012 10:37:01 Objet: Re: Designing a cluster guide Hi Greg, Am 17.05.2012 23:27, schrieb Gregory Farnum: It mentions for example Fast CPU for the mds system. What does fast mean? Just the speed of one core? Or is ceph designed to use multi core? Is multi core or more speed important? Right now, it's primarily the speed of a single core. The MDS is highly threaded but doing most things requires grabbing a big lock. How fast is a qualitative rather than quantitative assessment at this point, though. So would you recommand a fast (more ghz) Core i3 instead of a single xeon for this system? (price per ghz is better). It depends on what your nodes look like, and what sort of cluster you're running. The monitors are pretty lightweight, but they will add *some* load. More important is their disk access patterns — they have to do a lot of syncs. So if they're sharing a machine with some other daemon you want them to have an independent disk and to be running a new kernelglibc so that they can use syncfs rather than sync. (The only distribution I know for sure does this is Ubuntu 12.04.) Which kernel and which glibc version supports this? I have searched google but haven't found an exact version. We're using debian lenny squeeze with a custom kernel. Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd and you should go for 22x SSD Disks in a Raid 6? You'll need to do your own failure calculations on this one, I'm afraid. Just take note that you'll presumably be limited to the speed of your journaling device here. Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or is this still too slow? Another idea was to use only a ramdisk for the journal and backup the files while shutting down to disk and restore them after boot. Given that Ceph is going to be doing its own replication, though, I wouldn't want to add in another whole layer of replication with raid10 — do you really want to multiply your storage requirements by another factor of two? OK correct bad idea. Is it more useful the use a Raid 6 HW Controller or the btrfs raid? I would use the hardware controller over btrfs raid for now; it allows more flexibility in eg switching to xfs. :) OK but overall you would recommand running one osd per disk right? So instead of using a Raid 6 with for example 10 disks you would run 6 osds on this machine? Use single socket Xeon for the OSDs or Dual Socket? Dual socket servers will be overkill given the setup you're describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD daemon. You might consider it if you decided you wanted to do an OSD per disk instead (that's a more common configuration, but it requires more CPU and RAM per disk and we don't know yet which is the better choice). Is there also a rule of thumb for the memory? My biggest problem with ceph right now is the awful slow speed while doing random reads and writes. Sequential read and writes are at 200Mb/s (that's pretty good for bonded dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s which is def. too slow. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- -- Alexandre D erumier Ingénieur Système Fixe : 03 20 68 88 90 Fax : 03 20 68 90 81 45 Bvd du Général Leclerc 59100 Roubaix - France 12 rue Marivaux 75002 Paris - France -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
Sorry this got left for so long... On Thu, May 10, 2012 at 6:23 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi, the Designing a cluster guide http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it still leaves some questions unanswered. It mentions for example Fast CPU for the mds system. What does fast mean? Just the speed of one core? Or is ceph designed to use multi core? Is multi core or more speed important? Right now, it's primarily the speed of a single core. The MDS is highly threaded but doing most things requires grabbing a big lock. How fast is a qualitative rather than quantitative assessment at this point, though. The Cluster Design Recommendations mentions to seperate all Daemons on dedicated machines. Is this also for the MON useful? As they're so leightweight why not running them on the OSDs? It depends on what your nodes look like, and what sort of cluster you're running. The monitors are pretty lightweight, but they will add *some* load. More important is their disk access patterns — they have to do a lot of syncs. So if they're sharing a machine with some other daemon you want them to have an independent disk and to be running a new kernelglibc so that they can use syncfs rather than sync. (The only distribution I know for sure does this is Ubuntu 12.04.) Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd and you should go for 22x SSD Disks in a Raid 6? You'll need to do your own failure calculations on this one, I'm afraid. Just take note that you'll presumably be limited to the speed of your journaling device here. Given that Ceph is going to be doing its own replication, though, I wouldn't want to add in another whole layer of replication with raid10 — do you really want to multiply your storage requirements by another factor of two? Is it more useful the use a Raid 6 HW Controller or the btrfs raid? I would use the hardware controller over btrfs raid for now; it allows more flexibility in eg switching to xfs. :) Use single socket Xeon for the OSDs or Dual Socket? Dual socket servers will be overkill given the setup you're describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD daemon. You might consider it if you decided you wanted to do an OSD per disk instead (that's a more common configuration, but it requires more CPU and RAM per disk and we don't know yet which is the better choice). -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html