RE: Designing a cluster guide
Interesting, I've been thinking about this and I think most Ceph installations could benefit from more nodes and less disks per node. For example We have a replica level of 2, your RBD block size of 4mb. You start writing a file of 10gb, This is divided effectively into 4mb chunks, The first chunk to node 1 and node 2 (at the same time I assume) which is written to a journal then replayed to the data file system. Second chunk might be sent to node 2 and 3 at the same time which is written to a journal then replayed. (we now have overlap from chunk 1) Third chunk might be sent to 1 and 3 (we have more overlap from chunks 1 and 2) and as you can see this quickly this becomes an issue. So if we have 10 nodes vs. 3 nodes with the same mount of disks we should see better write and read performance as you would have less overlap. Now we take BTRFS into the picture as I understand journals are not necessary due to the nature of the way it writes/snapshots and reads data this alone would be a major performance increase on a BTRFS Raid level (like ZFS RAIDZ). Side note this may sound crazy but the more I read about SSD's the less I wish to use/rely on them and RAM SSD's are crazly priced imo. =) Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Slawomir Skowron Sent: Tuesday, 22 May 2012 3:52 PM To: Quenten Grasso Cc: Gregory Farnum; ceph-devel@vger.kernel.org Subject: Re: Designing a cluster guide I have some performance from rbd cluster near 320MB/s on VM from 3 node cluster, but with 10GE, and with 26 2.5 SAS drives used on every machine it's not everything that can be. Every osd drive is raid0 with one drive via battery cached nvram in hardware raid ctrl. Every osd take much ram for caching. That's why i'am thinking about to change 2 drives for SSD in raid1 with hpa tuned for increase durability of drive for journaling - but if this will work ;) With newest drives can theoreticaly get 500MB/s with a long queue depth. This means that i can in theory improve bandwith score, and take lower latency, and better handling of multiple IO writes, from many hosts. Reads are cached in ram from OSD daemon, VFS in kernel, nvram in ctrl, and in near future improve from cache in kvm (i need to test that - this will improve performance) But if SSD drive goes slower, it can get whole performance down in writes. It's is very delicate. Pozdrawiam iSS Dnia 22 maj 2012 o godz. 02:47 Quenten Grasso qgra...@onq.com.au napisał(a): I Should have added For storage I'm considering something like Enterprise nearline SAS 3TB disks running individual disks not raided with rep level of 2 as suggested :) Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Quenten Grasso Sent: Tuesday, 22 May 2012 10:43 AM To: 'Gregory Farnum' Cc: ceph-devel@vger.kernel.org Subject: RE: Designing a cluster guide Hi Greg, I'm only talking about journal disks not storage. :) Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gregory Farnum Sent: Tuesday, 22 May 2012 10:30 AM To: Quenten Grasso Cc: ceph-devel@vger.kernel.org Subject: Re: Designing a cluster guide On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso qgra...@onq.com.au wrote: Hi All, I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5 15K 72/146GB Disks, in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage. Can someone help clarify this one, Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client? Or Once the data is written to the (journal disk) is this considered successful by the client? This one — the write is considered safe once it is on-disk on all OSDs currently responsible for hosting the object. Every time anybody mentions RAID10 I have to remind them of the storage amplification that entails, though. Are you sure you want that on top of (well, underneath, really) Ceph's own replication? Or Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful) Pros Quite fast Write throughput to the journal disks, No write wareout of SSD's RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well) Cons Not as fast as SSD's More rackspace required per server. Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of
OSD deadlock with cephfs client and OSD on same machine
Hello again! On Linux, if you run OSD on ext4 filesystem, have a cephfs kernel client mount on the same system and no syncfs system call (as to be expected with libc6 2.14 or kernel 2.6.39), OSD deadlocks in sys_sync(). Only reboot recovers the system. After some investigation in the code, this is what I found: In src/common/sync_filesystem.h, the function sync_filesystem() first tries a syncfs() (not available), then a btrfs ioctrl sync (not available with non-btrfs), then finally a sync(). sys_sync tries to sync all filesystems, including the journal device, the osd storage area and the cephfs mount. Under some load, when OSD calls sync(), cephfs sync waits for the local osd, which already waits for its storage to sync, which the kernel wants to do after the cephfs sync. Deadlock. The function sync_filesystem() is called by FileStore::sync_entry() in src/os/FileStore.cc, but only on non-btrfs storage and if filestore_fsync_flushes_journal_data is false. After forcing this to true in OSD config, our test cluster survived three days of heavy load (and still running fine) instead of deadlocking all nodes within an hour. Reproduced with 0.47.2 and kernel 3.2.18, but the related code seems unchanged in current master. Conclusion: If you want to run OSD and cephfs kernel client on the same Linux server and have a libc6 before 2.14 (e.g. Debian's newest in experimental is 2.13) or a kernel before 2.6.39, either do not use ext4 (but btrfs is still unstable) or risk data loss by missing syncs through the workaround of forcing filestore_fsync_flushes_journal_data to true. Please consider putting out a fat warning at least at build time, if syncfs() is not available, e.g. No syncfs() syscall, please expect a deadlock when running osd on non-btrfs together with a local cephfs mount. Even better would be a quick runtime test for missing syncfs() and storage on non-btrfs that spits out a warning, if deadlock is possible. As a side effect, the experienced lockup seems to be a good way to reproduce the long standing bug 1047 - when our cluster tried to recover, all MDS instances died with those symptoms. It seems that a partial sync of journal or data partition causes that broken state. Amon Ott -- Dr. Amon Ott m-privacy GmbH Tel: +49 30 24342334 Am Köllnischen Park 1Fax: +49 30 24342336 10179 Berlin http://www.m-privacy.de Amtsgericht Charlottenburg, HRB 84946 Geschäftsführer: Dipl.-Kfm. Holger Maczkowsky, Roman Maczkowsky GnuPG-Key-ID: 0x2DD3A649 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: distributed cluster
This is also something I'm very interested in as well from a Power outage or some other Data centre issue. I assume the main issue here would be our friend latency however there is a bloke on the mailing list who is currently running a 2 site cluster setup as well. I've been thinking about a setup with 2 replica level (1 replica per site) with the sites only 2-3km apart latency shouldn't be much of an issue but the obvious bottleneck will be the 10gbe link between sites and split brain isn't an issue if the RBD Vol is only mounted at a single site anyway. If the data is sitting on a BTRFS/ZFS raid (or raid6 until BTRFS is ready) this would be reasonable level of risk. As for data integrity/availability of only having 2 replicas because the likely hood of having a complete server failure and a link outage at the same time would be fairly minimal. Regards, Quenten -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Jimmy Tang Sent: Monday, 28 May 2012 11:48 PM To: Jerker Nyberg Cc: ceph-devel@vger.kernel.org Subject: Re: distributed cluster Hi All, On 28 May 2012, at 12:28, Jerker Nyberg wrote: This may not really be a subject ceph-devel mailinglist but rather a potential ceph-users? I hope it is ok to write here. I would like to discuss the if it sounds reasonable to run a Ceph cluster distributed over a metro (city) network. Let us assume we have a couple of sites distributed over a metro network with at least gigabit interconnect. The demands for storage capacity and speed at our sites are increasing together with the demands for reasonably stable storage. May Ceph be a port of a solution? One idea is to set up Ceph distributed over this metro network. A public service network is announced at all sites, anycasted from the storage SMB/NFS/RGW(?)-to-Ceph gateway. (for stateless connections). Statefull connections (iSCSI?) has to contact the individual storage gateways and redundancy is handled at the application level (dual path). Ceph kernel clients contact the storage servers directly. Hopefully this means that clients at the sites with a storage gateway will contact it. Clients at a site without a local storage gateway, or when the local gateway is down, will contact a storage gateway at another site. Hopefully not all power and network at the whole city will go down at once! Does this sound reasonable? It should be easy to scale up with more storage nodes with Ceph. Or is it better to put all servers in the same server room? Internet | | Routers | | Metro network = | | | ||| Sites R R R RRR | | | | Servers Ceph1 Ceph2 Ceph3 Ceph4 I'm also interested in this type of use case, I would be interested in running a ceph cluster across a metropolitan area network. Has anyone tried running ceph in a WAN/MAN environment across a city/state/country? Regards, Jimmy Tang -- Senior Software Engineer, Digital Repository of Ireland (DRI) Trinity Centre for High Performance Computing, Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. http://www.tchpc.tcd.ie/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4
Am 29.05.2012 05:54, schrieb Alexandre DERUMIER: This happens with ext4 or btrfs too. maybe this is related to io scheduler ? did you have compared cfq,deadline,noop scheduler ? This is something i consider for performance tuning later on, when everything is running smooth. Right now i'm using CFQ with the tuned IBM settings (which proxmox uses too). Here are some outputs of basic fio Tests running on 3.4 and 3.0. 3.4: http://pastebin.com/raw.php?i=6GEKsCYH 3.0: http://pastebin.com/raw.php?i=FU4AtUck strangely 3.4 is faster but this corresponds to the fact that the normal Disk I/O is working fine with 3.4 It's just ceph which isn't working fine. also what's is your sas/sata controller ? Intel onboard SATA controller in this testsetup. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4
It would be really nice if somebody from inktank can comment this whole sitation. Thanks! Stefan Am 29.05.2012 05:54, schrieb Alexandre DERUMIER: This happens with ext4 or btrfs too. maybe this is related to io scheduler ? did you have compared cfq,deadline,noop scheduler ? noop should be fast with ssd. also what's is your sas/sata controller ? - Mail original - De: Stefan Priebe s.pri...@profihost.ag À: Alexandre DERUMIER aderum...@odiso.com Cc: ceph-devel@vger.kernel.org, Mark Nelson mark.nel...@inktank.com Envoyé: Lundi 28 Mai 2012 21:48:34 Objet: Re: poor OSD performance using kernel 3.4 Am 28.05.2012 08:52, schrieb Alexandre DERUMIER: I think filestore journal parallel works only with btrfs. Other filesystem are writeahead. ... you might be right but i can't change ceph's implementation. See my schema, I think you see parallel writes, because you see flush write of first wave to disk, in the same time of second wave write to journal. Yes i fulllý understand and agree - but still this should at least result in a constant bandwidth near max of underlying disk. I totally aggree with you but this is just a test setup AND if you have a big log file to copy let's say 100GB your journal will never be big enough and the speed should never drop to 0MB/s. Also i see the correct behaviour with 3.0.X where the speed is maxed to the underlying device. So i still see no reason that with 3.4 the speed drops to 0MB/s and is mostly 10-20MB/s instead of 130MB/s. Maybe something is wrong with 3.4, then your disk write more slowly. (xfs bug, sata driver controller bug, ...) This happens with ext4 or btrfs too. Squential write speed to FS is exactly the same under 3.0 and 3.4 using oflag=direct. 3.4: 1+0 records in 1+0 records out 1048576 bytes (10 GB) copied, 41,4899 s, 253 MB/s 3.0: 1+0 records in 1+0 records out 1048576 bytes (10 GB) copied, 40,861 s, 257 MB/s maybe some local benchmark of your ssd with 3.4 can give some tips ? How many disks (7,2K) do you have by osd ? One intel 520 SSD per OSD. I see some benchmark on internet about 150-300MB/s (depend of the blocksize). bench OSD shows around 260MB/s ceph osd tell X bench shows me a speed of 260MB/s under both kernels which corresponds to the dd from above. Something must be wrong, Doing local benchmark can really help I think. You can use sysbench-tools https://github.com/tsuna/sysbench-tools It make bench compare with nice graphs. Thx hopefully i'll find something. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4
fio benchmark will give you raw device performance bypassing filesystem. So maybe the problem is in xfs or linux vfs layer. I think you need to bench the filesystem to compare performance - Mail original - De: Stefan Priebe - Profihost AG s.pri...@profihost.ag À: Alexandre DERUMIER aderum...@odiso.com Cc: ceph-devel@vger.kernel.org, Mark Nelson mark.nel...@inktank.com Envoyé: Mardi 29 Mai 2012 10:22:34 Objet: Re: poor OSD performance using kernel 3.4 Am 29.05.2012 05:54, schrieb Alexandre DERUMIER: This happens with ext4 or btrfs too. maybe this is related to io scheduler ? did you have compared cfq,deadline,noop scheduler ? This is something i consider for performance tuning later on, when everything is running smooth. Right now i'm using CFQ with the tuned IBM settings (which proxmox uses too). Here are some outputs of basic fio Tests running on 3.4 and 3.0. 3.4: http://pastebin.com/raw.php?i=6GEKsCYH 3.0: http://pastebin.com/raw.php?i=FU4AtUck strangely 3.4 is faster but this corresponds to the fact that the normal Disk I/O is working fine with 3.4 It's just ceph which isn't working fine. also what's is your sas/sata controller ? Intel onboard SATA controller in this testsetup. Stefan -- -- Alexandre D erumier Ingénieur Système Fixe : 03 20 68 88 90 Fax : 03 20 68 90 81 45 Bvd du Général Leclerc 59100 Roubaix - France 12 rue Marivaux 75002 Paris - France -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4
On 29/05/2012 11:46, Stefan Priebe - Profihost AG wrote: It would be really nice if somebody from inktank can comment this whole sitation. Hello. I think I have the same bug : My setup is with 8 OSD nodes, 3 MDS (1 active) 3 MON. All my machines are debian, using a custom 3.4.0 kernel. Ceph is 0.47.2-1~bpo60+1 (debian package) root@label5:~# rados -p data bench 20 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 169983 331.9 332 0.059756 0.0946512 2 16 141 125 249.946 168 0.049822 0.212338 3 16 166 150 199.963 100 0.057352 0.257179 4 16 227 211 210.965 244 0.043592 0.265005 5 16 257 241 192.767 120 0.040883 0.276718 6 16 260 244 162.64112 1.59593 0.293439 7 16 319 303 173.118 236 0.056913 0.357856 8 16 348 332 165.976 116 0.052954 0.332424 9 16 348 332 147.535 0 - 0.332424 10 16 472 456 182.374 248 0.038543 0.343745 11 16 485 469 170.52252 0.040475 0.347328 12 16 485 469 156.312 0 - 0.347328 13 16 517 501 154.13364 0.047759 0.378595 14 16 562 546155.98 180 0.042814 0.395036 15 16 563 547 145.847 4 0.045834 0.394398 16 16 563 547 136.732 0 - 0.394398 17 16 563 547 128.689 0 - 0.394398 18 16 667 651 144.648 138.667 0.06501 0.440847 19 16 703 687 144.613 144 0.040772 0.421935 min lat: 0.030505 max lat: 5.05834 avg lat: 0.421935 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 703 687 137.382 0 - 0.421935 21 16 704 688 131.031 2 2.65675 0.425184 22 14 704 690 125.439 8 3.26857 0.433417 Total time run:22.042041 Total writes made: 704 Write size:4194304 Bandwidth (MB/sec):127.756 Average Latency: 0.498932 Max latency: 5.05834 Min latency: 0.030505 What puzzle me is if I test with pool rbd instead : root@label5:~# rados -p rbd bench 20 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 191 175 699.782 700 0.236737 0.0841979 2 16 397 381 761.837 824 0.065643 0.0813094 3 16 602 586 781.193 820 0.07921 0.0808584 4 16 815 799798.88 852 0.066597 0.0785906 5 16 1026 1010 807.885 844 0.10364 0.0785475 6 16 1249 1233 821.886 892 0.069324 0.0773951 7 16 1461 1445 825.608 848 0.053176 0.0770628 8 16 1680 1664 831.895 876 0.09612 0.0765263 9 16 1897 1881 835.891 868 0.100736 0.0761617 10 16 2105 2089 835.491 832 0.114913 0.0761897 11 16 2329 2313 840.983 896 0.042009 0.0758589 12 16 2553 2537 845.559 896 0.07017 0.0754364 13 16 2786 2770 852.203 932 0.066365 0.0749136 14 16 3009 2993 855.041 892 0.06491 0.0746046 15 16 3228 3212 856.431 876 0.05698 0.0745573 16 16 3437 3421 855.148 836 0.062162 0.0746339 17 16 3652 3636 855.428 860 0.140451 0.074534 18 16 3878 3862 858.121 904 0.081505 0.0743125 19 16 4106 4090 860.952 912 0.079922 0.0742146 min lat: 0.032342 max lat: 0.63151 avg lat: 0.0741575 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 4324 4308 861.495 872 0.06199 0.0741575 Total time run:20.102264 Total writes made: 4325 Write size:4194304 Bandwidth (MB/sec):860.600 Average Latency: 0.0743131 Max latency: 0.63151 Min latency: 0.032342 As you can see, much more stable bandwith with this pool. I understand data rbd pool probably don't use the same internals, but is this difference expected ? disclaimer: By no mean I'm a ceph expert, I'm just experimenting with it, and still don't
Re: NFS re-exporting CEPH cluster
Greg Farnum greg at inktank.com writes: Have you tried something and it failed? Or are you looking for suggestions? If the former, please report the failure. :) If the latter: http://ceph.com/wiki/Re-exporting_NFS -Greg Greg, I have tried the link. But, my production build (t_make) is failing on the NFS exported ceph_cluster where as it runs fine over another NFS directory coming from a NFS server. Is CEPH is 100% compatible with NFS ? Thanks __M -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4
Am 29.05.2012 15:01, schrieb Alexandre DERUMIER: fio benchmark will give you raw device performance bypassing filesystem. So maybe the problem is in xfs or linux vfs layer. I think you need to bench the filesystem to compare performance here another test with bonnie, which shows the same: http://pastebin.com/raw.php?i=fGTt4NLi Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4
Am 29.05.2012 15:39, schrieb Yann Dupont: On 29/05/2012 11:46, Stefan Priebe - Profihost AG wrote: It would be really nice if somebody from inktank can comment this whole sitation. Hello. I think I have the same bug : My setup is with 8 OSD nodes, 3 MDS (1 active) 3 MON. All my machines are debian, using a custom 3.4.0 kernel. Ceph is 0.47.2-1~bpo60+1 (debian package) That sounds absolutely like the same issue. Sadly nobody from inktank has replied to this problems for the last days. As you can see, much more stable bandwith with this pool. That's pretty strange... I understand data rbd pool probably don't use the same internals, but is this difference expected ? There must be differences in pool handling. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD deadlock with cephfs client and OSD on same machine
On Tue, 29 May 2012, Amon Ott wrote: Hello again! On Linux, if you run OSD on ext4 filesystem, have a cephfs kernel client mount on the same system and no syncfs system call (as to be expected with libc6 2.14 or kernel 2.6.39), OSD deadlocks in sys_sync(). Only reboot recovers the system. After some investigation in the code, this is what I found: In src/common/sync_filesystem.h, the function sync_filesystem() first tries a syncfs() (not available), then a btrfs ioctrl sync (not available with non-btrfs), then finally a sync(). sys_sync tries to sync all filesystems, including the journal device, the osd storage area and the cephfs mount. Under some load, when OSD calls sync(), cephfs sync waits for the local osd, which already waits for its storage to sync, which the kernel wants to do after the cephfs sync. Deadlock. The function sync_filesystem() is called by FileStore::sync_entry() in src/os/FileStore.cc, but only on non-btrfs storage and if filestore_fsync_flushes_journal_data is false. After forcing this to true in OSD config, our test cluster survived three days of heavy load (and still running fine) instead of deadlocking all nodes within an hour. Reproduced with 0.47.2 and kernel 3.2.18, but the related code seems unchanged in current master. Conclusion: If you want to run OSD and cephfs kernel client on the same Linux server and have a libc6 before 2.14 (e.g. Debian's newest in experimental is 2.13) or a kernel before 2.6.39, either do not use ext4 (but btrfs is still unstable) or risk data loss by missing syncs through the workaround of forcing filestore_fsync_flushes_journal_data to true. Note that fsync_flushed_journal_data should only be set to true with ext3 and the 'data=ordered' or 'data=journal' mount option. It is an implementation artifact only that fsync() will flush all previous writes. Please consider putting out a fat warning at least at build time, if syncfs() is not available, e.g. No syncfs() syscall, please expect a deadlock when running osd on non-btrfs together with a local cephfs mount. Even better would be a quick runtime test for missing syncfs() and storage on non-btrfs that spits out a warning, if deadlock is possible. I think a runtime warning makes more sense; nobody will see the build time warning (e.g., those installed debs). As a side effect, the experienced lockup seems to be a good way to reproduce the long standing bug 1047 - when our cluster tried to recover, all MDS instances died with those symptoms. It seems that a partial sync of journal or data partition causes that broken state. Interesting! If you could also note on that bug what the metadata workload was (what was making hard links?), that would be great! Thanks- sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD deadlock with cephfs client and OSD on same machine
On Tue, May 29, 2012 at 12:44 AM, Amon Ott a@m-privacy.de wrote: On Linux, if you run OSD on ext4 filesystem, have a cephfs kernel client mount on the same system and no syncfs system call (as to be expected with libc6 2.14 or kernel 2.6.39), OSD deadlocks in sys_sync(). Only reboot recovers the system. This is the classic issue of memory pressure needing free memory to be relieved. While syncfs(2) may make the hang less common, I do not think having syncfs(2) is enough; nothing sort of having a reserved memory pool guaranteed to be big enough to handle the request will, and maintaining that solution is hideously complex. Loopback NFS suffers from the exact same thing. Apparently using ceph-fuse is enough to move so much of the processing to user space, that the pageability of userspace memory allows the system to recover. Here's a fragment of the earlier conversation on this topic. Apologies for gmane/mail clients breaking the thread, anything with that subject line is part of the conversation: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/1673 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD per disk.
On Mon, May 28, 2012 at 2:34 AM, Alexandre DERUMIER aderum...@odiso.com wrote: maybe try [osd.0] host = testnode osd data = /data/osd0 osd journal = /data/osd0/osd0journal osd journal size = 1000 That shouldn't be needed. osd.0 will happily read the [osd] section in the config file. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD per disk.
On Mon, May 28, 2012 at 2:25 AM, chandrashekhar chandub...@gmail.com wrote: Thanks Alexandre, I created four directories in /data (osd0,osd1,osd2,osd3) and mounted as below: /dev/sdb1 - /data/osd1 /dev/sdc1 - /data/osd2 /dev/sdd1 - /data/osd3 But when I start ceph its starting mons and md daemons but not osds. Please help me to get this working. How did you create the cluster? mkcephfs? Do you see log entries in /var/log/ceph/*osd*.log ? What do they say? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: distributed cluster
On Mon, May 28, 2012 at 4:28 AM, Jerker Nyberg jer...@update.uu.se wrote: This may not really be a subject ceph-devel mailinglist but rather a potential ceph-users? I hope it is ok to write here. It's absolutely ok to talk on this mailing list about using Ceph. We may create a separate ceph-users later on, but right now this list is where the conversation should go. Let us assume we have a couple of sites distributed over a metro network with at least gigabit interconnect. The demands for storage capacity and speed at our sites are increasing together with the demands for reasonably stable storage. May Ceph be a port of a solution? Ceph was designed to work within a single data center. If parts of the cluster reside in remote locations, you essentially suffer the worst combination of their latency and bandwidth limits. A write that gets replicated to three different data centers is not complete until the data has been transferred to all three, and an acknowledgement has been received. For example: with data replicated over data centers A, B, C, connected at 1Gb/s, the fastest all of A will ever handle writes is 0.5Gb/s -- it'll need to replicate everything to B and C, over that single pipe. I am aware of a few people building multi-dc Ceph clusters. Some have shared their network latency, bandwidth and availability numbers with me (confidentially), and at first glance their wide-area network performs better than many single-dc networks. They are far above a 1 gigabit interconnect. I would really recommend you embark on a project like this only if you are able to understand the Ceph replication model, and do the math for yourself and figure out what your expected service levels for Ceph operations would be. (Naturally, Inktank Professional Services will help you in your endeavors, though their first response should be that's not a recommended setup.) One idea is to set up Ceph distributed over this metro network. A public service network is announced at all sites, anycasted from the storage SMB/NFS/RGW(?)-to-Ceph gateway. (for stateless connections). Statefull connections (iSCSI?) has to contact the individual storage gateways and redundancy is handled at the application level (dual path). Ceph kernel clients contact the storage servers directly. The Ceph Distributed File System is not considered production ready yet. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Designing a cluster guide
On Tue, May 29, 2012 at 12:25 AM, Quenten Grasso qgra...@onq.com.au wrote: So if we have 10 nodes vs. 3 nodes with the same mount of disks we should see better write and read performance as you would have less overlap. First of all, a typical way to run Ceph is with say 8-12 disks per node, and an OSD per disk. That means your 3-10 node clusters actually have 24-120 OSDs on them. The number of physical machines is not really a factor, number of OSDs is what matters. Secondly, 10-node or 3-node clusters are fairly uninteresting for Ceph. The real challenge is at the hundreds, thousands and above range. Now we take BTRFS into the picture as I understand journals are not necessary due to the nature of the way it writes/snapshots and reads data this alone would be a major performance increase on a BTRFS Raid level (like ZFS RAIDZ). A journal is still needed on btrfs, snapshots just enable us to write to the journal in parallel to the real write, instead of needing to journal first. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: distributed cluster
I could see a lot of use-cases for BC and DR tiers where performance may not be as much an issue, but availability being critical above all. Most options in use today rely on some async rep. and are in most cases quite expensive and still do not view performance as their primary concern. On Tue, May 29, 2012 at 9:44 AM, Tommi Virtanen t...@inktank.com wrote: On Mon, May 28, 2012 at 4:28 AM, Jerker Nyberg jer...@update.uu.se wrote: This may not really be a subject ceph-devel mailinglist but rather a potential ceph-users? I hope it is ok to write here. It's absolutely ok to talk on this mailing list about using Ceph. We may create a separate ceph-users later on, but right now this list is where the conversation should go. Let us assume we have a couple of sites distributed over a metro network with at least gigabit interconnect. The demands for storage capacity and speed at our sites are increasing together with the demands for reasonably stable storage. May Ceph be a port of a solution? Ceph was designed to work within a single data center. If parts of the cluster reside in remote locations, you essentially suffer the worst combination of their latency and bandwidth limits. A write that gets replicated to three different data centers is not complete until the data has been transferred to all three, and an acknowledgement has been received. For example: with data replicated over data centers A, B, C, connected at 1Gb/s, the fastest all of A will ever handle writes is 0.5Gb/s -- it'll need to replicate everything to B and C, over that single pipe. I am aware of a few people building multi-dc Ceph clusters. Some have shared their network latency, bandwidth and availability numbers with me (confidentially), and at first glance their wide-area network performs better than many single-dc networks. They are far above a 1 gigabit interconnect. I would really recommend you embark on a project like this only if you are able to understand the Ceph replication model, and do the math for yourself and figure out what your expected service levels for Ceph operations would be. (Naturally, Inktank Professional Services will help you in your endeavors, though their first response should be that's not a recommended setup.) One idea is to set up Ceph distributed over this metro network. A public service network is announced at all sites, anycasted from the storage SMB/NFS/RGW(?)-to-Ceph gateway. (for stateless connections). Statefull connections (iSCSI?) has to contact the individual storage gateways and redundancy is handled at the application level (dual path). Ceph kernel clients contact the storage servers directly. The Ceph Distributed File System is not considered production ready yet. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4
On 05/29/2012 09:43 AM, Stefan Priebe - Profihost AG wrote: Am 29.05.2012 15:39, schrieb Yann Dupont: On 29/05/2012 11:46, Stefan Priebe - Profihost AG wrote: It would be really nice if somebody from inktank can comment this whole sitation. Hello. I think I have the same bug : My setup is with 8 OSD nodes, 3 MDS (1 active) 3 MON. All my machines are debian, using a custom 3.4.0 kernel. Ceph is 0.47.2-1~bpo60+1 (debian package) That sounds absolutely like the same issue. Sadly nobody from inktank has replied to this problems for the last days. Sorry about that, yesterday was a holiday in the US. I did some quick tests on a couple of nodes I had laying around this morning. Distro: Oneiric (IE no syncfs in glibc) Ceph: 0.46-65-gf6c5dff 1 1GbE Client node 3 1GbE Mon nodes 2 1GbE OSD nodes with 1 OSD on each mounted on a 7200rpm SAS drive. btrfs with -l 64k -n64k, mounted using noatime. H700 Raid controller with each drive in a 1 disk raid0. Journals are partitioned on a separate drive. /proc/version: Linux version 3.4.0-ceph (autobuild-ceph@gitbuilder-kernel-amd64) rados -p data bench 120 write: Total time run:120.601286 Total writes made: 2979 Write size:4194304 Bandwidth (MB/sec):98.805 Average Latency: 0.647507 Max latency: 1.39966 Min latency: 0.181663 Once I get these nodes up to 0.47 and get them switched over to 10GbE I'll redo the btrfs tests and try out xfs as well with longer running tests. As you can see, much more stable bandwith with this pool. That's pretty strange... Indeed, that is very strange! Can you check to see how many pgs are in each? Any difference in replication level? You can check with: ceph osd pool get pool size ceph osd pool get pool pg_num I understand data rbd pool probably don't use the same internals, but is this difference expected ? There must be differences in pool handling. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Thanks, Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Appearing messages per 10 sec
On Mon, May 28, 2012 at 3:35 AM, Tomoki BENIYA ben...@bit-isle.co.jp wrote: Following messages are appearing per 10 sec on terminal of mds.1. But, not appearing on mds.0. What does these mean? And, how to stop these? Message from syslogd@mds1 at May 28 19:26:03 ... ceph-mds: 2012-05-28 19:26:03.497958 7fede74e5700 0 mds.0.bal mds.0 mdsload[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.05 = 0 ~ 0 That looks like https://github.com/ceph/ceph/blob/master/src/mds/MDBalancer.cc#L472 which looks like a diagnostic message only output by the root MDS. That's why you only see it on one of your MDS servers. It's a harmless status message. It's logged at a fairly high priority, but you can ignore it. The decision to output it to a console is made by your syslog daemon, and that is the right place to configure that. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question regarding API doc
You are quite right. I've updated the documentation in master, commit f953c4c0b0ba69342cab52243c1b73987f7f94f6. Thanks for the info! -Sam On Fri, May 25, 2012 at 8:08 PM, Xiaopong Tran xiaopong.t...@gmail.com wrote: I'm looking at the description in this API: http://ceph.com/docs/master/api/librados/#rados_objects_list_next For the parameters entry and key, the doc said (caller must free). I looked up in the code, and found this statement in the doc a bit misleading. Is the doc outdated, or did I miss anything? Cheers xp -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiple named clusters on same nodes
On Thursday, May 24, 2012 at 1:58 AM, Amon Ott wrote: On Thursday 24 May 2012 wrote Amon Ott: Attached is a patch based on current git stable that makes mkcephfs work fine for me with --cluster name. ceph-mon uses the wrong mkfs path for mon data (default ceph instead of supplied cluster name), so I put in a workaround. Please have a look and consider inclusion as well as fixing mon data path. Thanks. And another patch for the init script to handle multiple clusters. Amon: Thanks for the patches! Unfortunately nobody who's competent to review these (ie, not me) has time to look into them right now, but they're on the queue when TV or Sage gets some time. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4
Le 29/05/2012 19:50, Mark Nelson a écrit : 1 1GbE Client node 3 1GbE Mon nodes 2 1GbE OSD nodes with 1 OSD on each mounted on a 7200rpm SAS drive. btrfs with -l 64k -n64k, mounted using noatime. H700 Raid controller with each drive in a 1 disk raid0. Journals are partitioned on a separate drive. Hello , Forgot to mention I'm using 10 Gbe and FS using btrfs with -l 64k -n64k, but also space_cache,compress=lzo,nobarrier,noatime. journal is on tmpfs : osd journal = /dev/shm/journal osd journal size = 6144 Remember It's not a production system for the moment. I'm just trying to evaluate what is the best performance I can get. (and if the system is stable enough to start alpha/pre-production services). BTW, I noticed OSD usings XFS are much much slower than OSD with btrfs right now, particulary in rbd tests. btrfs have some stability problems, even if with newer kernels it seems better. /proc/version: Linux version 3.4.0-ceph (autobuild-ceph@gitbuilder-kernel-amd64) rados -p data bench 120 write: Total time run:120.601286 Total writes made: 2979 Write size:4194304 Bandwidth (MB/sec):98.805 Average Latency: 0.647507 Max latency: 1.39966 Min latency: 0.181663 Once I get these nodes up to 0.47 and get them switched over to 10GbE I'll redo the btrfs tests and try out xfs as well with longer running tests. As you can see, much more stable bandwith with this pool. That's pretty strange... Indeed, that is very strange! Can you check to see how many pgs are in each? Any difference in replication level? You can check with: ceph osd pool get pool size root@label5:~# ceph osd pool get data size don't know how to get pool field size root@label5:~# ceph osd pool get rbd size don't know how to get pool field size Is size the good name of the field ? In the the wiki size isn't listed as a valid field ceph osd pool get pool pg_num root@label5:~# ceph osd pool get rbd pg_num PG_NUM: 576 root@label5:~# ceph osd pool get data pg_num PG_NUM: 576 Th pg num is quite low because I started with small OSD (9 osd with 200G each - internal disks) when I formatted. Now, I reduced to 8 osd, (osd.4 is out) but with much larger ( faster) storage. 6 OSD have 5T on it, 2 have still 200G but they are planned to migrate before the end of the week. I try, for the moment, to keep the OSD similars. Replication is set to 2. No OSD is full, I don't have much data stored for the moment. Concerning crush map, I'm not using the default one : The 8 nodes are in 3 different locations (some kilometers away). 2 are in 1 place, 2 in another, and the 4 last in the principal place. I try to group host together to avoid problem when I loose a location (electrical problem, for example). Not sure I really customized the crush map as I should have. here is the map : begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 device4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 # types type 0 osd type 1 host type 2 rack type 3 pool # buckets host karuizawa { id -5# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.2 weight 1.000 } host hazelburn { id -6# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.3 weight 1.000 } rack loire { id -3# do not change unnecessarily # weight 2.000 alg straw hash 0# rjenkins1 item karuizawa weight 1.000 item hazelburn weight 1.000 } host carsebridge { id -8# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.5 weight 1.000 } host cameronbridge { id -9# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.6 weight 1.000 } rack chantrerie { id -7# do not change unnecessarily # weight 2.000 alg straw hash 0# rjenkins1 item carsebridge weight 1.000 item cameronbridge weight 1.000 } host chichibu { id -2# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.0 weight 1.000 } host glenesk { id -4# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.1 weight 1.000 } host braeval { id -10# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.7 weight 1.000 } host hanyu { id -11# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.8 weight 1.000 } rack lombarderie { id -12# do not change unnecessarily # weight 4.000 alg straw hash 0# rjenkins1 item chichibu weight 1.000 item glenesk weight 1.000 item braeval weight 1.000 item hanyu weight 1.000 } pool default { id -1# do
Re: poor OSD performance using kernel 3.4
Am 29.05.2012 19:50, schrieb Mark Nelson: Once I get these nodes up to 0.47 and get them switched over to 10GbE I'll redo the btrfs tests and try out xfs as well with longer running tests. I always test on 1GE and see this proble no matter whether btrfs or xfs. So i think this is just a waste of time. At least my test differ as i see this problem on ALL pools. Mark should i try 0.46? Thanks, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4
Am 29.05.2012 19:50, schrieb Mark Nelson: I did some quick tests on a couple of nodes I had laying around this morning. I just noticed that i get a constant rate of 40MB/s while using 1 thread. When i use two thread or more i get drop to 0MB/s and crazy jumping values. ~# rados -p rbd bench 90 write -t 1 Maintaining 1 concurrent writes of 4194304 bytes for at least 90 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 110 935.99436 0.100147 0.101133 2 12019 37.993140 0.096893 0.100719 3 13130 39.992144 0.09784 0.0999607 4 14140 39.992940 0.099156 0.0999003 5 15150 39.993240 0.098239 0.0996518 6 16160 39.993240 0.098682 0.0994851 7 17170 39.993340 0.094397 0.099184 8 18180 39.993140 0.099823 0.0993327 9 19190 39.993140 0.101013 0.0992236 10 1 101 10039.99340 0.098277 0.099237 # rados -p rbd bench 90 write -t 2 Maintaining 2 concurrent writes of 4194304 bytes for at least 90 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 21513 51.9888520.0956 0.115315 2 22220 39.992828 0.120065 0.193125 3 24139 51.991776 0.09557 0.15246 4 25856 55.991268 0.09875 0.137688 5 2676551.99236 0.111211 0.139465 6 28583 55.325172 0.136967 0.143079 7 2 10199 56.562564 0.098664 0.136263 8 2 10199 49.4919 0 - 0.136263 9 2 112 110 48.880822 0.099479 0.160563 Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4
Le 29/05/2012 23:08, Stefan Priebe a écrit : Am 29.05.2012 19:50, schrieb Mark Nelson: I did some quick tests on a couple of nodes I had laying around this morning. I just noticed that i get a constant rate of 40MB/s while using 1 thread. When i use two thread or more i get drop to 0MB/s and crazy jumping values. ~# rados -p rbd bench 90 write -t 1 Maintaining 1 concurrent writes of 4194304 bytes for at least 90 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 110 935.99436 0.100147 0.101133 2 12019 37.993140 0.096893 0.100719 3 13130 39.992144 0.09784 0.0999607 4 14140 39.992940 0.099156 0.0999003 5 15150 39.993240 0.098239 0.0996518 6 16160 39.993240 0.098682 0.0994851 7 17170 39.993340 0.094397 0.099184 8 18180 39.993140 0.099823 0.0993327 9 19190 39.993140 0.101013 0.0992236 10 1 101 10039.99340 0.098277 0.099237 not here : on data : root@label5:~# rados -p data bench 20 write -t 1 Maintaining 1 concurrent writes of 4194304 bytes for at least 20 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 11514 55.983756 0.096813 0.0677311 2 13332 63.985272 0.088802 0.0612602 3 15150 66.652972 0.056883 0.0594909 4 1605958.98936 0.046377 0.0577145 5 16059 47.1916 0 - 0.0577145 6 17978 51.991138 0.041831 0.0768918 7 1989755.41976 0.050436 0.0718439 8 1 101 100 49.991912 0.043673 0.0712079 9 1 101 100 44.4375 0 - 0.0712079 10 1 115 114 45.592928 0.043768 0.0876947 11 1 134 13348.35676 0.052382 0.0826428 12 1 154 153 50.991980 0.042077 0.0783619 13 1 175 174 53.529984 0.053474 0.0745956 14 1 194 193 55.133976 0.049631 0.0724711 15 1 211 21055.99168 0.052683 0.0712887 16 1 232 231 57.740784 0.044341 0.0692121 17 1 249 248 58.343668 0.053707 0.0684414 18 1 258 25757.10236 0.086088 0.0680656 19 1 267 266 55.991136 0.050902 0.0713341 min lat: 0.033395 max lat: 2.14757 avg lat: 0.0703545 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 1 285 284 56.790972 0.047755 0.0703545 Total time run:20.066134 Total writes made: 286 Write size:4194304 Bandwidth (MB/sec):57.011 on rbd : Maintaining 1 concurrent writes of 4194304 bytes for at least 20 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 1 1 0 0 0 - 0 1 11817 67.980168 0.065869 0.0587313 2 13534 67.984268 0.056982 0.0580468 3 15554 71.984880 0.050305 0.0554721 4 17271 70.985868 0.039387 0.0561269 5 1919071.98676 0.055236 0.0554057 6 1 109 108 71.986472 0.069547 0.0554112 7 1 126 125 71.415468 0.049234 0.0556564 8 1 146 145 72.486880 0.052302 0.0551064 9 1 165 164 72.8758760.0533 0.0548858 10 1 184 18373.18776 0.041342 0.0543598 11 1 202 20173.07872 0.048963 0.0544978 12 1 218 217 72.320764 0.071926 0.0549402 13 1 236 235 72.295172 0.055804 0.0551936 14 1 254 253 72.273172 0.058315 0.0552612 15 1 272 271 72.254172 0.047687 0.0552036 16 1 290 289 72.237572 0.059162 0.055275 17 1 308 307 72.222972 0.051991 0.0553467 18 1 327 32672.43276 0.053271 0.0552114 19 1 346 345 72.619276 0.058125 0.0550658 min lat:
Re: poor OSD performance using kernel 3.4
Am 29.05.2012 23:31, schrieb Yann Dupont: on the contrary, pool data is jumping up down, no matter how much thread involved :) Maybe this is because journal is too tight ? Or because 2 of the 8 nodes have slower disks ? Can you try with 3.0.X? I would be really interested what happens in this case. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4
On 05/29/2012 04:08 PM, Stefan Priebe wrote: Am 29.05.2012 19:50, schrieb Mark Nelson: I did some quick tests on a couple of nodes I had laying around this morning. I just noticed that i get a constant rate of 40MB/s while using 1 thread. When i use two thread or more i get drop to 0MB/s and crazy jumping values. ~# rados -p rbd bench 90 write -t 1 Maintaining 1 concurrent writes of 4194304 bytes for at least 90 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 110 935.99436 0.100147 0.101133 2 12019 37.993140 0.096893 0.100719 3 13130 39.992144 0.09784 0.0999607 4 14140 39.992940 0.099156 0.0999003 5 15150 39.993240 0.098239 0.0996518 6 16160 39.993240 0.098682 0.0994851 7 17170 39.993340 0.094397 0.099184 8 18180 39.993140 0.099823 0.0993327 9 19190 39.993140 0.101013 0.0992236 10 1 101 10039.99340 0.098277 0.099237 When you are using 1 thread, you are hitting a ~40MB/s limit (probably networking related) before the data gets to the journal. Because (in this case) the filestore data disk can handle that throughput, everything looks nice and consistent. # rados -p rbd bench 90 write -t 2 Maintaining 2 concurrent writes of 4194304 bytes for at least 90 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 21513 51.9888520.0956 0.115315 2 22220 39.992828 0.120065 0.193125 3 24139 51.991776 0.09557 0.15246 4 25856 55.991268 0.09875 0.137688 5 2676551.99236 0.111211 0.139465 6 28583 55.325172 0.136967 0.143079 7 2 10199 56.562564 0.098664 0.136263 8 2 10199 49.4919 0 - 0.136263 9 2 112 110 48.880822 0.099479 0.160563 In this case, that 40MB/s limit with 1 thread has increased. Now more data is getting fed into the journal than the filestore can write out to disk. Eventually writes stall while the data is being written out. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD format changes and layering
On Fri, May 25, 2012 at 4:07 PM, Josh Durgin josh.dur...@inktank.com wrote: To check whether children exist, you can iterate over all the pools and check the rbd_clones object in each one. Since the number of pools is relatively small, this isn't very expensive. If the pool is deleted, by definition all the children in it are deleted. With separate namespaces in the future, this will be a bit more expensive, but it's only needed at base image deletion time, which is relatively rare. Deleting the image itself already requires an I/O per object, so this is probably not the slow part anyway. Yehuda, Tv, did I miss anything? One thing: that's still racy, and we discussed a solution. 1. A: walk through all pools, look for clones, find none 2. B: create a clone 3. A: rbd unpreserve parent 4. A: rbd rm parent Oopsie. To avoid that, I proposed a deleting flag. Clones can only be created when parent is preserved !deleting. Now, 1. A: rbd deleting parent 2. A: walk through all pools, look for clones, find none 3. B: attempt to create a clone, fails ... Now, that doesn't have to be strictly deleting.. going_unpreserved or something; instead of deletion, the intended operation might be starting a vm against the parent image to e.g. add security updates. And, as we discussed, these flags would be per snapshot (or also on the master image, if you want to support that). Thus, one snapshot can preserved while an older one is scheduled for deletion. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4
On 05/24/2012 09:10 AM, Stefan Priebe - Profihost AG wrote: Hi list, today while testing btrfs i discovered a very poor osd performance using kernel 3.4. Underlying FS is XFS but it is the same with btrfs. 3.0.30: ~# rados -p data bench 10 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 164125 99.9767 100 0.586984 0.447293 2 167155 109.979 120 0.934388 0.488375 3 169983 110.647 112 1.15982 0.503111 4 16 130 114 113.981 124 1.05952 0.516925 5 16 159 143 114.382 116 0.149313 0.510734 6 16 188 172 114.649 116 0.287166 0.52203 7 16 215 199 113.697 108 0.151784 0.531461 8 16 242 226 112.984 108 0.623478 0.539896 9 16 265 249 110.65192 0.50354 0.538504 10 16 296 280 111.984 124 0.155048 0.542846 Total time run:10.776153 Total writes made: 297 Write size:4194304 Bandwidth (MB/sec):110.243 Average Latency: 0.577534 Max latency: 1.85499 Min latency: 0.091473 3.4: ~# rados -p data bench 10 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 164024 95.979496 0.393196 0.455936 2 166852 103.983 112 0.835652 0.517297 3 168569 91.984968 1.00535 0.493058 4 169680 79.986944 0.096564 0.577948 5 16 10387 69.587928 0.092722 0.589147 6 16 117 101 67.321656 0.222175 0.675334 7 16 130 114 65.132152 0.15677 0.623806 8 16 144 128 63.989656 0.089157 0.56746 9 16 144 128 56.8794 0 - 0.56746 10 16 144 128 51.1912 0 - 0.56746 11 16 144 128 46.5373 0 - 0.56746 12 16 144 128 42.6591 0 - 0.56746 13 16 144 128 39.3776 0 - 0.56746 14 16 144 128 36.5649 0 - 0.56746 15 16 144 128 34.1272 0 - 0.56746 16 16 145 129 32.2443 0.5 11.3422 0.650985 Total time run:16.193871 Total writes made: 145 Write size:4194304 Bandwidth (MB/sec):35.816 Average Latency: 1.78467 Max latency: 14.4744 Min latency: 0.088753 Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html I setup some tests today to try to replicate your findings (and also check results against some previous ones I've done). I don't think I'm seeing exactly the same results as you, but I definitely see xfs performing worse in this specific test than btrfs. I've included the results here. Distro: Ubuntu Oneiric (IE no syncfs in glibc) Ceph: 0.47.2 Kernel 3.4.0-ceph (autobuild-ceph@gitbuilder-kernel-amd64) Network: 10GbE 1 Client node 3 Mon nodes 2 OSD nodes with 1 OSD each mounted on a 7200rpm SAS drive. H700 Raid controller with each drive in a 1 disk raid0. Journals are partitioned on a separate drive. OSD data disks are using WT cache while journals are using WB. btrfs created with -l 64k -n64k, mounted using noatime. xfs created with -f -d su=64k,sw=1 -i size=2048, mounted using noatime. rados bench invocation: rados -p data bench 300 write -t 16 -b 4194304 btrfs: Total time run:300.413696 Total writes made: 7582 Write size:4194304 Bandwidth (MB/sec):100.954 Average Latency: 0.633932 Max latency: 3.78661 Min latency: 0.065734 xfs: Total time run:304.435966 Total writes made: 5023 Write size:4194304 Bandwidth (MB/sec):65.997 Average Latency: 0.96965 Max latency: 36.4993 Min latency: 0.07516 Full results are available here: http://nhm.ceph.com/results/mailinglist-tests/ I created seekwatcher movies by running blktrace on the underlying OSD data disks during the tests. These show throughput over time, seeks/sec, and visual representation of where the disk is being written to for each OSD. You can
Kernel crash bug status
We are still seeing a crash on 0.47.2 with 3.2.18, which seems to be this bug: http://tracker.newdream.net/issues/2260 Any one else seeing this problem, and/or have any ideas how to fix or work around it? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html