Re: [ceph-users] CEPH I/O Performance with OpenStack
I have two ceph nodes with the following specifications 2x CEPH - OSD - 2 Replication factor Model : SuperMicro X8DT3 CPU : Dual intel E5620 RAM : 32G HDD : 2x 480GB SSD RAID-1 ( OS and Journal ) 22x 4TB SATA RAID-10 ( OSD ) 3x Controllers - CEPH Monitor Model : ProLiant DL180 G6 CPU : Dual intel E5620 RAM : 24G If it's a hardware issue please help finding out an answer for the following 5 questions. 4 TB spinners do not give a lot of IOPS, about 100 random IOPS per disk. In total it would just be 1100 IOPS: 44 disk times 100 IOPS divide by 2 for RAID and divide by 2 for replication factor. There might be a bit of caching on the RAID controller and SSD journal but worst case you will get just 1100 IOPS. I need around 20TB storage, SuperMicro SC846TQ can get 24 hardisk. I may attach 24x 960G SSD - NO Raid - with 3x SuperMicro servers - replication factor 3. Or it's better to scale-out and put smaller disks on many servers such ( HP DL380pG8/2x Intel Xeon E5-2650 ) which can hold 12 hardisk And Attach 12x 960G SSD - NO Raid - 6x OSD nodes - replication factor 3. An OSD for a SSD can easily eat a whole CPU core so 24 SSDs would be to much. More smaller nodes also have the upside off smaller impact when a node breaks. You could also look at the Supermicro 2u twin chassis with 2 servers with 12 disks in 2u. Note that you will not get near to theoretical native performance of those combined SSDs (10+ IOPS) but performance will be good none the less. There have been a few threads about that here before so look back in the mail threads to find out more. 2. I'm using Mirantis/Fuel 5 for provisioning and deployment of nodes When i attach the new ceph osd nodes to the environment, Will the data be replicated automatically from my current old SuperMicro OSD nodes to the new servers after the deployment complete ? Don't know the specifics of Fuel and how it manages the crush map. Some of the data will end up there but not a copy of all data unless you specify the new servers as a new failure domain in the crush map. 3. I will use 2x 960G SSD RAID 1 for OS Is it recommended put the SSD journal disk as a separate partition on the same disk of OS ? If you run with SSDs only I would put the journals together with the data SSDs. It makes a lot of sense to have them on seperate SSDs when your data disks are spinners. (because of the speed difference and bad random IOPS performance of spinners.) 4. Is it safe to remove the OLD ceph nodes while i'm currently using 2 replication factors after adding the new hardware nodes ? It is probably not safe to just turn them off (as mentioned above it depend on the crush map failure domain layout) The safe way would be to follow the documentation on how to remove an OSD: http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ This will make sure the data is re-located before the OSD is removed. 5. Do i need RAID 1 for the journal hardisk ? and if not, What will happen if one of the journal HDD's failed ? No, it is not required. Both have trade-offs. Disks that are behind the journal will become unavailable when it happens. RAID1 will be a bit easier to replace in case of a single SSD failure but is useless if the 2 SSDs fail at the same time (e.g. due to wear). JBOD will reduce the write load and wear plus it has less impact when it does fail. 6. Should i use RAID Level for the drivers on OSD nodes ? or it's better to go without RAID ? Without RAID usually makes for better performance. Benchmark your specific workload to be sure. In general I would go for 3 replica's and no RAID. Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Total number PGs using multiple pools
Although the documentation is not great, and open to interpretation, there is a pg calculator here http://ceph.com/pgcalc/. With it you should be able to simulate your use case, and generate number based on your scenario. On Mon, Jan 26, 2015 at 8:00 PM, Italo Santos okd...@gmail.com wrote: Thanks for your answer. But what I’d like to understand is if this numbers are per pool bases or per cluster bases? If this number were per cluster bases I’ll plan on cluster deploy how much pools I’d like to have on that cluster and their replicas Regards. *Italo Santos* http://italosantos.com.br/ On Saturday, January 17, 2015 at 07:04, lidc...@redhat.com wrote: Here are a few values commonly used: - Less than 5 OSDs set pg_num to 128 - Between 5 and 10 OSDs set pg_num to 512 - Between 10 and 50 OSDs set pg_num to 4096 - If you have more than 50 OSDs, you need to understand the tradeoffs and how to calculate the pg_num value by yourself But i think 10 OSD is to small for rados cluster. *From:* Italo Santos okd...@gmail.com *Date:* 2015-01-17 05:00 *To:* ceph-users ceph-users@lists.ceph.com *Subject:* [ceph-users] Total number PGs using multiple pools Hello, Into placement groups documentation http://ceph.com/docs/giant/rados/operations/placement-groups/ we have the message bellow: “*When using multiple data pools for storing objects, you need to ensure that you balance the number of placement groups per pool with the number of placement groups per OSD so that you arrive at a reasonable total number of placement groups that provides reasonably low variance per OSD without taxing system resources or making the peering process too slow.*” This means that, if I have a cluster with 10 OSD and 3 pools with size = 3 each pool can have only ~111 PGs? Ex.: (100 * 10 OSDs) / 3 replicas = 333 PGs / 3 pools = 111 PGS per pool I don't know if reasoning is right… I’ll glad for any help. Regards. *Italo Santos* http://italosantos.com.br/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph File System Question
Raj, The note is still valid, but the filesystem is getting more stable all the time. Some people are using it, especially in an active/passive configuration with a single active MDS. If you do choose to do some testing, use the most recent stable release of Ceph and the most recent linux kernel you can. Thanks, John On Mon, Jan 26, 2015 at 11:25 PM, Jeripotula, Shashiraj shashiraj.jeripot...@verizon.com wrote: Hi All, We are planning to use Ceph File System in our data center. I was reading the ceph documentation and they do not recommend this for production Is this still valid ??? Please advise. Thanks Raj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph Testing
Hi All, Is there a good documentation on Ceph Testing. I have the following setup done, but not able to find a good document to start doing the tests. [cid:image001.png@01D03A1C.7B7513B0] Please advise. Thanks Raj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs modification time
Hey folks, Any update on this fix getting merged? We suspect other crashes based on this bug. Thanks, Chris On Tue, Jan 13, 2015 at 7:09 AM, Gregory Farnum g...@gregs42.com wrote: Awesome, thanks for the bug report and the fix, guys. :) -Greg On Mon, Jan 12, 2015 at 11:18 PM, 严正 z...@redhat.com wrote: I tracked down the bug. Please try the attached patch Regards Yan, Zheng 在 2015年1月13日,07:40,Gregory Farnum g...@gregs42.com 写道: Zheng, this looks like a kernel client issue to me, or else something funny is going on with the cap flushing and the timestamps (note how the reading client's ctime is set to an even second, while the mtime is ~.63 seconds later and matches what the writing client sees). Any ideas? -Greg On Mon, Jan 12, 2015 at 12:19 PM, Lorieri lori...@gmail.com wrote: Hi Gregory, $ uname -a Linux coreos2 3.17.7+ #2 SMP Tue Jan 6 08:22:04 UTC 2015 x86_64 Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz GenuineIntel GNU/Linux Kernel Client, using `mount -t ceph ...` core@coreos2 /var/run/systemd/system $ modinfo ceph filename: /lib/modules/3.17.7+/kernel/fs/ceph/ceph.ko license:GPL description:Ceph filesystem for Linux author: Patience Warnick patie...@newdream.net author: Yehuda Sadeh yeh...@hq.newdream.net author: Sage Weil s...@newdream.net alias: fs-ceph depends:libceph intree: Y vermagic: 3.17.7+ SMP mod_unload signer: Magrathea: Glacier signing key sig_key: D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2 sig_hashalgo: sha256 core@coreos2 /var/run/systemd/system $ modinfo libceph filename: /lib/modules/3.17.7+/kernel/net/ceph/libceph.ko license:GPL description:Ceph filesystem for Linux author: Patience Warnick patie...@newdream.net author: Yehuda Sadeh yeh...@hq.newdream.net author: Sage Weil s...@newdream.net depends:libcrc32c intree: Y vermagic: 3.17.7+ SMP mod_unload signer: Magrathea: Glacier signing key sig_key: D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2 sig_hashalgo: sha256 ceph is installed on a ubuntu containers (same kernel): $ dpkg -l |grep ceph ii ceph 0.87-1trusty amd64distributed storage and file system ii ceph-common 0.87-1trusty amd64common utilities to mount and interact with a ceph storage cluster ii ceph-fs-common 0.87-1trusty amd64common utilities to mount and interact with a ceph file system ii ceph-fuse0.87-1trusty amd64FUSE-based client for the Ceph distributed file system ii ceph-mds 0.87-1trusty amd64metadata server for the ceph distributed file system ii libcephfs1 0.87-1trusty amd64Ceph distributed file system client library ii python-ceph 0.87-1trusty amd64Python libraries for the Ceph distributed filesystem Reproducing the error: at machine 1: core@coreos1 /var/lib/deis/store/logs $ test.log core@coreos1 /var/lib/deis/store/logs $ echo 1 test.log core@coreos1 /var/lib/deis/store/logs $ stat test.log File: 'test.log' Size: 2 Blocks: 1 IO Block: 4194304 regular file Device: 0h/0d Inode: 1099511629882 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 500/core) Gid: ( 500/ core) Access: 2015-01-12 20:05:03.0 + Modify: 2015-01-12 20:06:09.637234229 + Change: 2015-01-12 20:06:09.637234229 + Birth: - at machine 2: core@coreos2 /var/lib/deis/store/logs $ stat test.log File: 'test.log' Size: 2 Blocks: 1 IO Block: 4194304 regular file Device: 0h/0d Inode: 1099511629882 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 500/core) Gid: ( 500/ core) Access: 2015-01-12 20:05:03.0 + Modify: 2015-01-12 20:06:09.637234229 + Change: 2015-01-12 20:06:09.0 + Birth: - Change time is not updated making some tail libs to not show new content until you force the change time be updated, like running a touch in the file. Some tools freeze and trigger other issues in the system. Tests, all in the machine #2: FAILED - https://github.com/ActiveState/tail FAILED - /usr/bin/tail of a Google docker image running debian wheezy PASSED - /usr/bin/tail of a ubuntu 14.04 docker image PASSED - /usr/bin/tail of the coreos release 494.5.0 Tests in machine #1 (same machine that is writing the file) all tests pass. On Mon, Jan 12, 2015 at 5:14 PM, Gregory Farnum g...@gregs42.com wrote: What versions of all the Ceph pieces are you using? (Kernel client/ceph-fuse, MDS, etc) Can you provide more details on exactly what the program is doing on which
Re: [ceph-users] Appending to a rados object with feedback
Hi Greg, Thanks for your feedback. On 2015-01-27 15:38, Gregory Farnum wrote: On Mon, Jan 26, 2015 at 6:47 PM, Kim Vandry van...@tzone.org wrote: By the way, I have a question about the class. Following the example in cle_hello.cc method record_hello, our method calls cls_cxx_stat() and yet is declared CLS_METHOD_WR, not CLS_METHOD_RD|CLS_METHOD_WR. Is stating an object not considered reading it? How come the method does not need the CLS_METHOD_RD flag? I tried including that flag to see what would happen but then my method was unable to create new objects, which we want to support with the same meaning as appending to a 0-size object. It seems that in that case Ceph asserts that the objects exists before calling the method. Mmmm, this actually might be an issue. Write ops don't always force an object into a readable state before being processed, so you could read out-of-date status in some cases. :/ I see. I'll change this, then. I don't have the exact API calls to hand, but librados exposes versions on op completion and you can assert the version when submitting ops, too. Did you check that out? That sounds like exactly what we should be using. I see rados_get_last_version() for reading the version, which I missed before. Unfortunately, I can't find how to assert the version when submitting an op. I'm looking at src/include/rados/librados.h in git. Maybe you or someone else can help me find it once you have the API docs at hand. -kv ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Appending to a rados object with feedback
On 2015-01-27 17:06, Kim Vandry wrote: Unfortunately, I can't find how to assert the version when submitting an op. I'm looking at src/include/rados/librados.h in git. Maybe you or someone else can help me find it once you have the API docs at hand. Ah, never mind, I see that assert_version is found in the C++ version. I will see if I can change our client to use C++ instead of C. I'll worry about the Python client (which also can't call into C++) later. -kv ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph as a primary storage for owncloud
Dear all, we would like to use ceph as a a primary (object) storage for owncloud. Did anyone already do this? I mean: is that actually possible or am I wrong? As I understood I have to use radosGW in swift flavor, but what about s3 flavor? I cannot find anything official so hence my question. Do you have any advice or can you indicate me some kind of documentation/how-to? I know that maybe this is not the right place for this questions but I also asked owncloud's community... in the meantime... Every answer is appreciated! Thanks Simone -- Simone Spinelli simone.spine...@unipi.it Università di Pisa Direzione ICT - Servizi di Rete PGP KEY http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0xDBDA383DEA2F1F96 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph and btrfs - disable copy-on-write?
When starting an OSD in a Docker container (so the volume is btrfs), we see the following output: 2015-01-24 16:48:30.511813 7f9f3d066900 0 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-osd, pid 1 2015-01-24 16:48:30.522509 7f9f3d066900 0 filestore(/var/lib/ceph/osd/ceph-0) backend btrfs (magic 0x9123683e) 2015-01-24 16:48:30.535455 7f9f3d066900 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is supported and appears to work 2015-01-24 16:48:30.535519 7f9f3d066900 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-01-24 16:48:30.628612 7f9f3d066900 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2015-01-24 16:48:30.628960 7f9f3d066900 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: CLONE_RANGE ioctl is supported 2015-01-24 16:48:30.629211 7f9f3d066900 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: failed to create simple subvolume test_subvol: (17) File exists 2015-01-24 16:48:30.629509 7f9f3d066900 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: SNAP_CREATE is supported 2015-01-24 16:48:30.630487 7f9f3d066900 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: SNAP_DESTROY failed: (1) Operation not permitted 2015-01-24 16:48:30.630763 7f9f3d066900 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: snaps enabled, but no SNAP_DESTROY ioctl; DISABLING 2015-01-24 16:48:30.631744 7f9f3d066900 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: START_SYNC is supported (transid 67) 2015-01-24 16:48:30.639763 7f9f3d066900 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: WAIT_SYNC is supported 2015-01-24 16:48:30.639914 7f9f3d066900 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: removing old async_snap_test 2015-01-24 16:48:30.640178 7f9f3d066900 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: failed to remove old async_snap_test: (1) Operation not permitted 2015-01-24 16:48:30.641138 7f9f3d066900 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: SNAP_CREATE_V2 is supported 2015-01-24 16:48:30.641387 7f9f3d066900 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: SNAP_DESTROY failed: (1) Operation not permitted 2015-01-24 16:48:30.641528 7f9f3d066900 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: failed to remove test_subvol: (1) Operation not permitted 2015-01-24 16:48:30.651029 7f9f3d066900 0 filestore(/var/lib/ceph/osd/ceph-0) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2015-01-24 16:48:30.651282 7f9f3d066900 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2015-01-24 16:48:30.652322 7f9f3d066900 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 19: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 0 2015-01-24 16:48:30.652945 7f9f3d066900 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 19: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 0 2015-01-24 16:48:30.654462 7f9f3d066900 1 journal close /var/lib/ceph/osd/ceph-0/journal We're considering disabling copy-on-write for the directory to improve write performance. Are there any recommendations for or against this? Thanks! Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Consumer Grade SSD Clusters
Hi Nick, Agreed, I see your point of basically once your past the 150TBW or whatever that number maybe, your just waiting for failure effectively but aren't we anyway? I guess it depends on your use case at the end of the day. I wonder what the likes of Amazon, Rackspace etc are doing in the way of SSD's, either they are buying them so cheap per GB due to the volume or they are possibly using consumer grade SSD'. hmm.. using consumer grade SSD's it may be an interesting option if you have descent monitoring and alerting using SMART you should be able to still see how much spare flash you have available. As suggested by Wido using multiple brands would help remove the possible cascading failure affect which I guess we all should be doing anyway on our spinners. I guess we have to decide is it worth the extra effort in the long run vs running enterprise ssds. Regards, Quenten Grasso From: Nick Fisk [mailto:n...@fisk.me.uk] Sent: Saturday, 24 January 2015 7:33 PM To: Quenten Grasso; ceph-users@lists.ceph.com Subject: RE: Consumer Grade SSD Clusters Hi Quenten, There is no real answer to your question. It really depends on how busy your storage will be and particularly if it is mainly reads or writes. I wouldn't pay too much attention to that SSD endurance test, whilst it's great to know that they have a lot more headroom than their official spec's, you run the risk of having a spectacular multiple disk failure if you intend to run them all that high. You can probably guarantee that as 1 SSD starts to fail the increase in workload to re-balance the cluster will cause failures on the rest. I guess it really comes down to how important is the availability of your data. Whilst an average pc user might balk at the price of paying 4 times per GB more for a S3700 SSD, in the enterprise world they are still comparatively cheap. The other thing you need to be aware of is that most consumer SSD's don't have power loss protection, again if you are mainly doing reads and cost is more important than availability, there may be an argument to use them. Nick From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Quenten Grasso Sent: 24 January 2015 09:13 To: ceph-users@lists.ceph.com Subject: [ceph-users] Consumer Grade SSD Clusters Hi Everyone, Just wondering if anyone has had any experience in using consumer grade SSD's for a Ceph cluster? I came across this article http://techreport.com/review/26523/the-ssd-endurance-experiment-casualties-on-the-way-to-a-petabyte/3http://xo4t.mjt.lu/link/xo4t/gg573yr/1/QRjiN_2beI5qST5ggOanaQ/aHR0cDovL3RlY2hyZXBvcnQuY29tL3Jldmlldy8yNjUyMy90aGUtc3NkLWVuZHVyYW5jZS1leHBlcmltZW50LWNhc3VhbHRpZXMtb24tdGhlLXdheS10by1hLXBldGFieXRlLzM They have been testing different SSD's write endurance and they have been able to write up to 1PB+ to a Samsung 840 Pro 256GB which is only rated at 150TBW and of course other SSD's have failed well before 1PBW, So defiantly worth a read. So I've been thinking about using consumer grade SSD's for OSD's and Enterprise SSD's for journals. Reasoning is enterprise SSD's are a lot faster at journaling then consumer grade drives plus this would effectively half the overall write requirements on the consumer grade disks. This also could be a cost effective alternative to using enterprise SSD's as OSD's however it seems if your happy to use 2 x replication it's a pretty good cost saving however 3x replication not so much. Cheers, Quenten Grasso ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] chattr +i not working with cephfs
Should chattr +i work with cephfs? Using ceph v0.91 and a 3.18 kernel on the CephFS client, I tried this: # mount | grep ceph 172.16.30.10:/ on /cephfs/test01 type ceph (name=cephfs,key=client.cephfs) # echo 1 /cephfs/test01/test.1 # ls -l /cephfs/test01/test.1 -rw-r--r-- 1 root root 2 Jan 27 19:09 /cephfs/test01/test.1 # chattr +i /cephfs/test01/test.1 chattr: Inappropriate ioctl for device while reading flags on /cephfs/test01/test.1 I also tried it using the FUSE interface: # ceph-fuse -m 172.16.30.10 /cephfs/fuse01/ ceph-fuse[5326]: starting ceph client 2015-01-27 19:54:59.002563 7f6f8fbcb7c0 -1 init, newargv = 0x2ec2be0 newargc=11 ceph-fuse[5326]: starting fuse # mount | grep ceph ceph-fuse on /cephfs/fuse01 type fuse.ceph-fuse (rw,nosuid,nodev,allow_other,default_permissions) # echo 1 /cephfs/fuse01/test02.dat # chattr +i /cephfs/fuse01/test02.dat chattr: Invalid argument while reading flags on /cephfs/fuse01/test02.dat Eric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CEPH I/O Performance with OpenStack
Thanks Robert for your response. I'm considering giving SAS 600G 15K a try before moving to SSD. It should give ~175 IOPS per disk. Do you think the performance will be better if i goes with the following setup ? 4x OSD nodes 2x SSD - RAID 1 for OS and Journal 10x 600G SAS 15K - NO Raid Two Replication. According to the IOPS calculation you did for the 4TB. Please clarify is 1100 IOPS will be for the one node and the cluster IOPS =$number_of_nodes x $IOPS_per_node ? If this formula is correct, That's being said the cluster on the 4TB - my current setup should give in total 2200 IOPS and the new SAS setup should give 3500 IOPS ? Please correct me if i understand this wrong. Thanks in advance, On Tue, Jan 27, 2015 at 3:30 PM, Robert van Leeuwen robert.vanleeu...@spilgames.com wrote: I have two ceph nodes with the following specifications 2x CEPH - OSD - 2 Replication factor Model : SuperMicro X8DT3 CPU : Dual intel E5620 RAM : 32G HDD : 2x 480GB SSD RAID-1 ( OS and Journal ) 22x 4TB SATA RAID-10 ( OSD ) 3x Controllers - CEPH Monitor Model : ProLiant DL180 G6 CPU : Dual intel E5620 RAM : 24G If it's a hardware issue please help finding out an answer for the following 5 questions. 4 TB spinners do not give a lot of IOPS, about 100 random IOPS per disk. In total it would just be 1100 IOPS: 44 disk times 100 IOPS divide by 2 for RAID and divide by 2 for replication factor. There might be a bit of caching on the RAID controller and SSD journal but worst case you will get just 1100 IOPS. I need around 20TB storage, SuperMicro SC846TQ can get 24 hardisk. I may attach 24x 960G SSD - NO Raid - with 3x SuperMicro servers - replication factor 3. Or it's better to scale-out and put smaller disks on many servers such ( HP DL380pG8/2x Intel Xeon E5-2650 ) which can hold 12 hardisk And Attach 12x 960G SSD - NO Raid - 6x OSD nodes - replication factor 3. An OSD for a SSD can easily eat a whole CPU core so 24 SSDs would be to much. More smaller nodes also have the upside off smaller impact when a node breaks. You could also look at the Supermicro 2u twin chassis with 2 servers with 12 disks in 2u. Note that you will not get near to theoretical native performance of those combined SSDs (10+ IOPS) but performance will be good none the less. There have been a few threads about that here before so look back in the mail threads to find out more. 2. I'm using Mirantis/Fuel 5 for provisioning and deployment of nodes When i attach the new ceph osd nodes to the environment, Will the data be replicated automatically from my current old SuperMicro OSD nodes to the new servers after the deployment complete ? Don't know the specifics of Fuel and how it manages the crush map. Some of the data will end up there but not a copy of all data unless you specify the new servers as a new failure domain in the crush map. 3. I will use 2x 960G SSD RAID 1 for OS Is it recommended put the SSD journal disk as a separate partition on the same disk of OS ? If you run with SSDs only I would put the journals together with the data SSDs. It makes a lot of sense to have them on seperate SSDs when your data disks are spinners. (because of the speed difference and bad random IOPS performance of spinners.) 4. Is it safe to remove the OLD ceph nodes while i'm currently using 2 replication factors after adding the new hardware nodes ? It is probably not safe to just turn them off (as mentioned above it depend on the crush map failure domain layout) The safe way would be to follow the documentation on how to remove an OSD: http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ This will make sure the data is re-located before the OSD is removed. 5. Do i need RAID 1 for the journal hardisk ? and if not, What will happen if one of the journal HDD's failed ? No, it is not required. Both have trade-offs. Disks that are behind the journal will become unavailable when it happens. RAID1 will be a bit easier to replace in case of a single SSD failure but is useless if the 2 SSDs fail at the same time (e.g. due to wear). JBOD will reduce the write load and wear plus it has less impact when it does fail. 6. Should i use RAID Level for the drivers on OSD nodes ? or it's better to go without RAID ? Without RAID usually makes for better performance. Benchmark your specific workload to be sure. In general I would go for 3 replica's and no RAID. Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to do maintenance without falling out of service?
On Wed, Jan 21, 2015 at 5:53 PM, Gregory Farnum g...@gregs42.com wrote: Depending on how you configured things it's possible that the min_size is also set to 2, which would be bad for your purposes (it should be at 1). This was exactly the problem. Setting min_size=1 (which I believe used to be the default, looks like it changed almost exactly when we set this cluster up) got things back on track for us. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CEPH I/O Performance with OpenStack
Thanks Robert for your response. I'm considering giving SAS 600G 15K a try before moving to SSD. It should give ~175 IOPS per disk. Do you think the performance will be better if i goes with the following setup ? 4x OSD nodes 2x SSD - RAID 1 for OS and Journal 10x 600G SAS 15K - NO Raid Two Replication. According to the IOPS calculation you did for the 4TB. Please clarify is 1100 IOPS will be for the one node and the cluster IOPS =$number_of_nodes x $IOPS_per_node ? If this formula is correct, That's being said the cluster on the 4TB - my current setup should give in total 2200 IOPS and the new SAS setup should give 3500 IOPS ? Please correct me if i understand this wrong. No the current setup is in total 1100 IOPS! You have 44 disks each doing 100 IOPS = 4400 IOPS You have RAID10 which effectively halves the write speed = 2200 IOPS You have a replication factor of 2 in Ceph which halves it again = 1100 IOPS I would not be a fan of a replication factor of 2 with NO raid. Chances that 2 disks in the cluster fail at the same time is significant and you will lose data. Replication of 3 would be the absolute minimum. For the suggested setup that would be: 40 * 175 = 7000 Rep factor of 3 / devide by three = 2300 IOPS So you effectively double the amount of writes you can do. Note that this is the total cluster performance. You will not get this from a single instance since the data would be needed to be written exactly spread across the cluster. In my experience it is good enough for some low writes instances but not for write intensive applications like Mysql. Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help:mount error
Hi Wang: I have created the pool and fs before. -- 在 2015-01-28 14:54:33,王亚洲 breb...@163.com 写道: hi: if you do ceph fs new command? I encounter the same issue without doing ceph fs new. At 2015-01-28 14:48:09, 于泓海 foxconn-...@163.com wrote: Hi: I have completed the installation of ceph cluster,and the ceph health is ok: cluster 15ee68b9-eb3c-4a49-8a99-e5de64449910 health HEALTH_OK monmap e1: 1 mons at {ceph01=10.194.203.251:6789/0}, election epoch 1, quorum 0 ceph01 mdsmap e2: 0/0/1 up osdmap e16: 2 osds: 2 up, 2 in pgmap v729: 92 pgs, 4 pools, 136 MB data, 46 objects 23632 MB used, 31172 MB / 54805 MB avail 92 active+clean But when i mount from client,the error is: mount error 5 = Input/output error. I have tried lots of ways,for ex:disable selinux,update kernel... Could anyone help me to resolve it? Thanks! Jason -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Help:mount error
Hi: I have completed the installation of ceph cluster,and the ceph health is ok: cluster 15ee68b9-eb3c-4a49-8a99-e5de64449910 health HEALTH_OK monmap e1: 1 mons at {ceph01=10.194.203.251:6789/0}, election epoch 1, quorum 0 ceph01 mdsmap e2: 0/0/1 up osdmap e16: 2 osds: 2 up, 2 in pgmap v729: 92 pgs, 4 pools, 136 MB data, 46 objects 23632 MB used, 31172 MB / 54805 MB avail 92 active+clean But when i mount from client,the error is: mount error 5 = Input/output error. I have tried lots of ways,for ex:disable selinux,update kernel... Could anyone help me to resolve it? Thanks! Jason -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help:mount error
hi: if you do ceph fs new command? I encounter the same issue without doing ceph fs new. At 2015-01-28 14:48:09, 于泓海 foxconn-...@163.com wrote: Hi: I have completed the installation of ceph cluster,and the ceph health is ok: cluster 15ee68b9-eb3c-4a49-8a99-e5de64449910 health HEALTH_OK monmap e1: 1 mons at {ceph01=10.194.203.251:6789/0}, election epoch 1, quorum 0 ceph01 mdsmap e2: 0/0/1 up osdmap e16: 2 osds: 2 up, 2 in pgmap v729: 92 pgs, 4 pools, 136 MB data, 46 objects 23632 MB used, 31172 MB / 54805 MB avail 92 active+clean But when i mount from client,the error is: mount error 5 = Input/output error. I have tried lots of ways,for ex:disable selinux,update kernel... Could anyone help me to resolve it? Thanks! Jason -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD over cache tier over EC pool: rbd rm doesn't remove objects
On Tue, 27 Jan 2015, Irek Fasikhov wrote: Hi,All. Indeed, there is a problem. Removed 1 TB of data space on a cluster is not cleared. This feature of the behavior or a bug? And how long will it be cleaned? Your subject says cache tier but I don't see it in the 'ceph df' output below. The cache tiers will store 'whiteout' objects that cache object non-existence that could be delaying some deletion. You can wrangle the cluster into flushing those with ceph osd pool set cachepool cache_target_dirty_ratio .05 (though you'll probably want to change it back to the default .4 later). If there's no cache tier involved, there may be another problem. What version is this? Firefly? sage Sat Sep 20 2014 at 8:19:24 AM, Mika?l Cluseau mclus...@isi.nc: Hi all, I have weird behaviour on my firefly test + convenience storage cluster. It consists of 2 nodes with a light imbalance in available space: # id weight type name up/down reweight -1 14.58 root default -2 8.19 host store-1 1 2.73 osd.1 up 1 0 2.73 osd.0 up 1 5 2.73 osd.5 up 1 -3 6.39 host store-2 2 2.73 osd.2 up 1 3 2.73 osd.3 up 1 4 0.93 osd.4 up 1 I used to store ~8TB of rbd volumes, coming to a near-full state. There was some annoying stuck misplaced PGs so I began to remove 4.5TB of data; the weird thing is: the space hasn't been reclaimed on the OSDs, they keeped stuck around 84% usage. I tried to move PGs around and it happens that the space is correctly reclaimed if I take an OSD out, let him empty it XFS volume and then take it in again. I'm currently applying this to and OSD in turn, but I though it could be worth telling about this. The current ceph df output is: GLOBAL: SIZE AVAIL RAW USED %RAW USED 12103G 5311G 6792G 56.12 POOLS: NAME ID USED %USED OBJECTS data 0 0 0 0 metadata 1 0 0 0 rbd 2 444G 3.67 117333 [...] archives-ec 14 3628G 29.98 928902 archives 15 37518M 0.30 273167 Before just moving data, AVAIL was around 3TB. I finished the process with the OSDs on store-1, who show the following space usage now: /dev/sdb1 2.8T 1.4T 1.4T 50% /var/lib/ceph/osd/ceph-0 /dev/sdc1 2.8T 1.3T 1.5T 46% /var/lib/ceph/osd/ceph-1 /dev/sdd1 2.8T 1.3T 1.5T 48% /var/lib/ceph/osd/ceph-5 I'm currently fixing OSD 2, 3 will be the last one to be fixed. The df on store-2 shows the following: /dev/sdb1 2.8T 1.9T 855G 70% /var/lib/ceph/osd/ceph-2 /dev/sdc1 2.8T 2.4T 417G 86% /var/lib/ceph/osd/ceph-3 /dev/sdd1 932G 481G 451G 52% /var/lib/ceph/osd/ceph-4 OSD 2 was at 84% 3h ago, and OSD 3 was ~75%. During rbd rm (that took a bit more that 3 days), ceph log was showing things like that: 2014-09-03 16:17:38.831640 mon.0 192.168.1.71:6789/0 417194 : [INF] pgmap v14953987: 3196 pgs: 2882 active+clean, 314 active+remapped; 7647 GB data, 11067 GB used, 3828 GB / 14896 GB avail; 0 B/s rd, 6778 kB/s wr, 18 op/s; -5/5757286 objects degraded (-0.000%) [...] 2014-09-05 03:09:59.895507 mon.0 192.168.1.71:6789/0 513976 : [INF] pgmap v15050766: 3196 pgs: 2882 active+clean, 314 active+remapped; 6010 GB data, 11156 GB used, 3740 GB / 14896 GB avail; 0 B/s rd, 0 B/s wr, 8 op/s; -388631/5247320 objects degraded (-7.406%) [...] 2014-09-06 03:56:50.008109 mon.0 192.168.1.71:6789/0 580816 : [INF] pgmap v15117604: 3196 pgs: 2882 active+clean, 314 active+remapped; 4865 GB data, 11207 GB used, 3689 GB / 14896 GB avail; 0 B/s rd, 6117 kB/s wr, 22 op/s; -706519/3699415 objects degraded (-19.098%) 2014-09-06 03:56:44.476903 osd.0 192.168.1.71:6805/11793 729 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.058434 secs 2014-09-06 03:56:44.476909 osd.0 192.168.1.71:6805/11793 730 : [WRN] slow request 30.058434 seconds old, received at 2014-09-06 03:56:14.418429: osd_op(client.19843278.0:46081 rb.0.c7fd7f.238e1f29.b3fa [delete] 15.b8fb7551 ack+ondisk+write e38950) v4 currently waiting for blocked object 2014-09-06 03:56:49.477785 osd.0 192.168.1.71:6805/11793
Re: [ceph-users] slow read-performance inside the vm
Hello again, thank all for the very very helpful advices. Now i have reinstalled my ceph cluster. Three nodes with ceph version 0.80.7 and for every single disk an osd. The journal will be saved on a ssd. My ceph.conf [global] fsid = bceade34-3c54-4a35-a759-7af631a19df7 mon_initial_members = ceph01 mon_host = 10.0.0.20,10.0.0.21,10.0.0.22 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true public_network = 10.0.0.0/24 cluster_network = 10.0.1.0/24 osd_pool_default_size = 2 osd_pool_default_min_size = 1 osd_pool_default_pg_num = 4096 osd_pool_default_pgp_num = 4096 filestore_max_sync_interval = 30 ceph osd tree -1 6.76 root default -2 2.44 host ceph01 0 0.55 osd.0 up 1 3 0.27 osd.3 up 1 4 0.27 osd.4 up 1 5 0.27 osd.5 up 1 6 0.27 osd.6 up 1 7 0.27 osd.7 up 1 1 0.27 osd.1 up 1 2 0.27 osd.2 up 1 -3 2.16 host ceph02 9 0.27 osd.9 up 1 11 0.27 osd.11 up 1 12 0.27 osd.12 up 1 13 0.27 osd.13 up 1 14 0.27 osd.14 up 1 15 0.27 osd.15 up 1 8 0.27 osd.8 up 1 10 0.27 osd.10 up 1 -4 2.16 host ceph03 17 0.27 osd.17 up 1 18 0.27 osd.18 up 1 19 0.27 osd.19 up 1 20 0.27 osd.20 up 1 21 0.27 osd.21 up 1 22 0.27 osd.22 up 1 23 0.27 osd.23 up 1 16 0.27 osd.16 up 1 rados bench -p kvm 50 write --no-cleanup Total time run: 50.494855 Total writes made: 1180 Write size: 4194304 Bandwidth (MB/sec): 93.475 Stddev Bandwidth: 16.3955 Max bandwidth (MB/sec): 112 Min bandwidth (MB/sec): 0 Average Latency: 0.684571 Stddev Latency: 0.216088 Max latency: 1.86831 Min latency: 0.234673 rados bench -p kvm 50 seq Total time run: 15.009855 Total reads made: 1180 Read size: 4194304 Bandwidth (MB/sec): 314.460 Average Latency: 0.20296 Max latency: 1.06341 Min latency: 0.02983 I am really happy, these values above are enough for my little amount of vms. Inside the vms I get now for write 80mb/s and read 130mb/s, with write-cache enabled. But there is one little problem. Are there some tuning parameters for small files? For 4kb to 50kb files the cluster is very slow. thank you best regards -Original message- From: Lindsay Mathieson lindsay.mathie...@gmail.com Sent: Friday 9th January 2015 0:59 To: ceph-users@lists.ceph.com Cc: Patrik Plank pat...@plank.me Subject: Re: [ceph-users] slow read-performance inside the vm On Thu, 8 Jan 2015 05:36:43 PM Patrik Plank wrote: Hi Patrick, just a beginner myself, but have been through a similar process recently :) With these values above, I get a write performance of 90Mb/s and read performance of 29Mb/s, inside the VM. (Windows 2008/R2 with virtio driver and writeback-cache enabled) Are these values normal with my configuration and hardware? - They do seem *very* odd. Your write performance is pretty good, your read performance is abysmal - with a similar setup, with 3 OSD's slower than yours I was getting 200 MB/s reads. Maybe your network setup is dodgy? Jumbo frames can be tricky. Have you run iperf between the nodes? What are you using for benchmark testing on the windows guest? Also, probably more useful to turn writeback caching off for benchmarking, the cache will totally obscure the real performance. How is the VM mounted? rbd driver? The read-performance seems slow. Would the read-performance better if I run for every single disk a osd? I think so - in general the more OSD's the better. Also having 8 HD's in RAID0 is a recipe for disaster, you'll lost the entire OSD is one of those disks fails. I'd be creating an OSD for each HD (8 per node), with a 5-10GB SSD partition per OSD for journal. Tedious, but should make a big difference to reads and writes. Might be worth while trying [global] filestore max sync interval = 30 as well. -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CEPH I/O Performance with OpenStack
Hello, I'm using ceph-0.80.7 with Mirantis OpenStack IceHouse - RBD for nova ephemeral disk and glance. I have two ceph nodes with the following specifications 2x CEPH - OSD - 2 Replication factor Model : SuperMicro X8DT3 CPU : Dual intel E5620 RAM : 32G HDD : 2x 480GB SSD RAID-1 ( OS and Journal ) 22x 4TB SATA RAID-10 ( OSD ) 3x Controllers - CEPH Monitor Model : ProLiant DL180 G6 CPU : Dual intel E5620 RAM : 24G *Network Public : 1G NIC ( eth0 ) - Juniper 2200-48 Storage,Admin,Management - 10G NIC ( eth1 ) - Arista 7050T-36 (32x 10GE UTP, 4x 10GE SFP+) *I'm getting very poor ceph performance and high I/O with write/read And when light or deep scrub is running the load on VM's went crazy. ceph.conf tuning didn't help. [global] auth_service_required = cephx filestore_xattr_use_omap = true auth_client_required = cephx auth_cluster_required = cephx mon_host = xx.xx.xx.xx xx.xx.xx.xx xx.xx.xx.xx mon_initial_members = node-xx node-xx node-xx fsid = osd_pool_default_size = 2 osd_pool_default_min_size = 1 osd_pool_default_pg_num = 50 public_network = xx.xx.xx.xx osd_journal_size = 10 auth_supported = cephx osd_pool_default_pgp_num = 50 osd_pool_default_flag_hashpspool = true osd_mkfs_type = xfs cluster_network = xx.xx.xx.xx mon_clock_drift_allowed = 2 [osd] osd_op_threads=16 osd_disk_threads=4 osd_disk_thread_ioprio_priority=7 osd_disk_thread_ioprio_class=idle filestore op threads=8 filestore_queue_max_ops=10 filestore_queue_committing_max_ops=10 filestore_queue_max_bytes=1073741824 filestore_queue_committing_max_bytes=1073741824 filestore_max_sync_interval=10 filestore_fd_cache_size=20240 filestore_flusher=false filestore_flush_min=0 filestore_sync_flush=true journal_dio=true journal_aio=true journal_max_write_bytes=1073741824 journal_max_write_entries=5 journal_queue_max_bytes=1073741824 journal_queue_max_ops=10 ms_dispatch_throttle_bytes=1073741824 objecter_infilght_op_bytes=1073741824 objecter_inflight_ops=1638400 osd_recovery_threads = 16 #osd_recovery_max_active = 2 #osd_recovery_max_chunk = 8388608 #osd_recovery_op_priority = 2 #osd_max_backfills = 1 [client] rbd_cache = true rbd_cache_writethrough_until_flush = true rbd_cache_size = 20 GiB rbd_cache_max_dirty = 16 GiB rbd_cache_target_dirty = 512 MiB *Results inside CentOS6 64bit VM : [root@vm ~]# dd if=/dev/zero of=./largefile bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 17.3417 s, 61.9 MB/s [root@vm ~]# rm -rf /tmp/test spew -i 50 -v -d --write -r -b 4096 10M /tmp/test Iteration:1Total runtime: 00:00:00 WTR:27753.91 KiB/s Transfer time: 00:00:00IOPS: 6938.48 Iteration:2Total runtime: 00:00:00 WTR:29649.53 KiB/s Transfer time: 00:00:00IOPS: 7412.38 Iteration:3Total runtime: 00:00:01 WTR:30897.44 KiB/s Transfer time: 00:00:00IOPS: 7724.36 Iteration:4Total runtime: 00:00:02 WTR: 7474.93 KiB/s Transfer time: 00:00:01IOPS: 1868.73 Iteration:5Total runtime: 00:00:02 WTR:24810.11 KiB/s Transfer time: 00:00:00IOPS: 6202.53 Iteration:6Total runtime: 00:00:03 WTR:28534.01 KiB/s Transfer time: 00:00:00IOPS: 7133.50 Iteration:7Total runtime: 00:00:03 WTR:27687.95 KiB/s Transfer time: 00:00:00IOPS: 6921.99 Iteration:8Total runtime: 00:00:03 WTR:29195.91 KiB/s Transfer time: 00:00:00IOPS: 7298.98 Iteration:9Total runtime: 00:00:04 WTR:28315.53 KiB/s Transfer time: 00:00:00IOPS: 7078.88 Iteration: 10Total runtime: 00:00:04 WTR:27971.42 KiB/s Transfer time: 00:00:00IOPS: 6992.85 Iteration: 11Total runtime: 00:00:04 WTR:29873.39 KiB/s Transfer time: 00:00:00IOPS: 7468.35 Iteration: 12Total runtime: 00:00:05 WTR:32364.30 KiB/s Transfer time: 00:00:00IOPS: 8091.08 Iteration: 13Total runtime: 00:00:05 WTR:32619.98 KiB/s Transfer time: 00:00:00IOPS: 8155.00 Iteration: 14Total runtime: 00:00:06 WTR:18714.54 KiB/s Transfer time: 00:00:00IOPS: 4678.64 Iteration: 15Total runtime: 00:00:06 WTR:17070.37 KiB/s Transfer time: 00:00:00IOPS: 4267.59 Iteration: 16Total runtime: 00:00:07 WTR:22403.23 KiB/s Transfer time: 00:00:00IOPS: 5600.81 Iteration: 17Total runtime: 00:00:07 WTR:16076.39 KiB/s Transfer time: 00:00:00IOPS: 4019.10 Iteration: 18Total runtime: 00:00:08 WTR:26219.77 KiB/s Transfer time: 00:00:00IOPS: 6554.94 Iteration: 19Total runtime: 00:00:08 WTR:29054.01 KiB/s Transfer time: 00:00:00IOPS: 7263.50 Iteration: 20Total runtime: 00:00:08 WTR:27210.02 KiB/s Transfer time: 00:00:00IOPS: 6802.50 Iteration: 21Total runtime: 00:00:09 WTR:28502.72 KiB/s Transfer
Re: [ceph-users] Ceph File System Question
On Tue, Jan 27, 2015 at 6:13 AM, John Spray john.sp...@redhat.com wrote: Raj, The note is still valid, but the filesystem is getting more stable all the time. Some people are using it, especially in an active/passive configuration with a single active MDS. If you do choose to do some testing, use the most recent stable release of Ceph and the most recent linux kernel you can. Thanks, John For what it's worth, I've had much more stability with the FUSE client than the kernel client. I know there have been lots of bugfixes in the kernel client recently, though, so I'd be interested in hearing how that works for you if you try it. I run a single MDS, and haven't had any issues at all since switching to the FUSE client :) -Aaron ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow read-performance inside the vm
Hi Patrik, Am 27.01.2015 14:06, schrieb Patrik Plank: ... I am really happy, these values above are enough for my little amount of vms. Inside the vms I get now for write 80mb/s and read 130mb/s, with write-cache enabled. But there is one little problem. Are there some tuning parameters for small files? For 4kb to 50kb files the cluster is very slow. do you use an higher read-ahead inside the VM? Like echo 4096 /sys/block/vda/queue/read_ahead_kb Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] verifying tiered pool functioning
Hi Zhang, Thanks for the pointer. That page looks like the commands to set up the cache, not how to verify that it is working. I think I have been able to see objects (not PGs I guess) moving from the cache pool to the storage pool using 'rados df' . (I haven't run long enough to verify yet.) Thanks again! Chad. On Tuesday, January 27, 2015 03:47:53 you wrote: Do you mean cache tiering? You can refer to http://ceph.com/docs/master/rados/operations/cache-tiering/ for detail command line. PGs won't migrate from pool to pool. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Chad William Seys Sent: Thursday, January 22, 2015 5:40 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] verifying tiered pool functioning Hello, Could anyone provide a howto verify that a tiered pool is working correctly? E.g. Command to watch as PG migrate from one pool to another? (Or determine which pool a PG is currently in.) Command to see how much data is in each pool (global view of number of PGs I guess)? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cache pool and storage pool: possible to remove storage pool?
Hi all, Documentation explains how to remove the cache pool: http://ceph.com/docs/master/rados/operations/cache-tiering/ Anyone know how to remove the storage pool instead? (E.g. the storage pool has wrong parameters.) I was hoping to push all the objects into the cache pool and then replace storage pool. Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 85% of the cluster won't start, or how I learned why to use disk UUIDs
Story time. Over the past year or so, our datacenter had been undergoing the first of a series or renovations designed to add more power and cooling capacity. As part of these renovations, changes to the emergency power off system (EPO) necessitated that this system be tested. If you're unfamiliar, the EPO system is tied into the fire system, and presented as a angry caged red button next to each exit design to *immediately* cut all power and backup power to the datacenter. The idea being that if there's a fire, or someone's being electrocuted, or some other life-threatening electrical shenanigans occur, power can be completely cut in one swift action. As this system hadn't been tested in about 10 years and a whole bunch of changes had been made due to the renovations, the powers that be scheduled downtime for all services one Saturday at which time we would test the EPO and cut all power to the room. On the appointed day, I shut down each of the 21 nodes and the 3 monitors in our cluster. A couple hours later, after testing and some associated work had been completed, I powered the monitors back up and began turning on the nodes holding the spinning OSDs and associated SSD journals. After pushing the power buttons, I sat down at the console and noticed something odd; only about 15% of the OSDs in the cluster had come back online. Checking the logs, I noticed that the OSDs which had failed to start were complaining about not being able to find their associated journal partitions. Fortunately, two things were true at this point. First and most importantly, I had split off 8 nodes which had not been added to the cluster yet and set up a second separate cluster in another site to which I had exported/imported the critical images (and diffs) from the primary cluster over the past few weeks. Second, I had happened to restart a node a month or so prior which had presented the same symptoms, so I knew why this had happened. When I first provisioned the cluster I added the journals using the /dev/sd[a-z]+ identifier. On the first four nodes, which I had provisioned manually, this was fine. On subsequent nodes, I had used FAI Linux, Saltstack, and a Python script I wrote to automatically provision the OS, configuration, and add the OSDs and journals as they were inserted into the nodes. After a reboot on these nodes, the devices were reordered, and the OSDs subsequently couldn't find the journals. I had written a script to trickle remove/re-insert OSDs one by one with journals using /dev/disk/by-id (which is a persistent identifier), but hadn't yet run it on the production cluster. After some thought, I came up with a potential (if somewhat unpleasant) solution which would let me get the production cluster back into a running state quickly, without having to blow away the whole thing, re-provision, and restore the backups. I theorized that if I shutdown a node, removed all the hot-swap disks (the OSDs and journals), booted the node, and then added the journals in the same order as I had when the node was first provisioned, the OS should give them the same /dev/sd[a-z}+ identifiers they had had pre-EPO. A quick test determined I was correct, and could restore the cluster to working order by applying the same operation to each node. Luckily, I had (mostly) added drives to each node in the same order and where I hadn't at least one journal was placed in the correct order and that allowed me to determine the correct order for the other two, ie. if journal 2 as ok, but 1 and 3 weren't when I had added them in order 1,2,3, I knew the correct order was 3,2,1. After pulling and re-inserting 336 disks, I had a working cluster once again, except for one node where one journal had originally been /dev/sda, which was now half of the OS software RAID mirror. Breaking that, toggling the /sys/block/sdX/device/delete flag on that disk, rescanning the bus, re-adding it to the RAID set when it came back as /dev/sds, and symlinking /dev/sda to the appropriate SSD fixed that last node. Needless to say, I started pulling that node, and subsequently the other nodes out of the cluster and readding them with /dev/disk/by-id journal to prevent this from happening again. So a couple lessons here. First, remember when adding OSDs with SSD journals to use a device UUID, not /dev/sd[a-z]+, so you don't end up needing to spend three hours manually touching each disk in your cluster and even longer slowly shifting a couple hundred terabytes around while you fix the root cause. Second, establish standards early and stick to them. As much of a headache as pulling all the disks and re-inserting them was, it would have been much worse if they weren't originally inserted in the same order on (almost) all the nodes. Finally, backups are important. Having that safety net helped me focus on the solution, rather than the problem since I knew that if none of my ideas worked, I'd be able to get the most critical data back. Hopefully this saves
Re: [ceph-users] ceph as a primary storage for owncloud
I tried this a while back. In my setup, I exposed an block device with rbd on the owncloud host and tried sharing an image to the owncloud host via NFS. If I recall correctly, both worked fine (I didn't try S3). The problem I had at the time (maybe 6-12 months ago) was that owncloud didn't support enough automated management of LDAP group permissions for me to easily deploy and manage it for 1000+ users. It is on my list of things to revisit however, so I'd be curious to hear how things go for you. If it doesn't work out, I'd also recommend checking out Pydio. It didn't make it into production in my environment (I didn't have time to focus on it), but I liked its user management better than owncloud's at the time. -Steve On 01/27/2015 05:05 AM, Simone Spinelli wrote: Dear all, we would like to use ceph as a a primary (object) storage for owncloud. Did anyone already do this? I mean: is that actually possible or am I wrong? As I understood I have to use radosGW in swift flavor, but what about s3 flavor? I cannot find anything official so hence my question. Do you have any advice or can you indicate me some kind of documentation/how-to? I know that maybe this is not the right place for this questions but I also asked owncloud's community... in the meantime... Every answer is appreciated! Thanks Simone -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph, LIO, VMWARE anyone?
Oh yeah, I am not completely sure (have not tested myself), but if you were doing a setup where you were not using a clustering app like windows/redhat clustering that uses PRs, did not use vmfs and were instead accessing the disks exported by LIO/TGT directly in the vm (either using the guest's iscsi client or as a raw esx device), and were not using ESX clustering, then you might be safe doing active/passive or active/active with no modifications needed other than some scripts to distribute the setup info across LIO/TGT nodes. Were any of you trying this type of setup when you were describing your results? If so, were you running oracle or something like that? Just wondering. On 01/27/2015 08:58 PM, Mike Christie wrote: I do not know about perf, but here is some info on what is safe and general info. - If you are not using VAAI then it will use older style RESERVE/RELEASE commands only. If you are using VAAI ATS, and doing active/active then you need something, like the lock/sync talked about in the slides/hammer doc, that would coordinate multiple ATS/COMPARE_AND_WRITEs from executing at the same time on the same sectors. You probably do not ever see problems today, because it seems ESX normally does this command for only one sector and I do not think there are multiple commands for the same sectors in flight normally. For active/passive, ATS is simple since you only have the one LIO/TGT node executing commands at a time, so the locking is done locally using a normal old mutex. - tgt and LIO both support SCSI-3 persistent reservations. This is not really needed for ESX vmfs though since it uses ATS or older RESERVE/RELEASE. If you were using a cluster app like windows clustering, red hat cluster, etc in ESX or in normal non vm use, then you need something extra to support SCSI-3 PRs in both active/active or active/passive. For AA, you need something like described in that doc/video. For AP, you would need to copy over the PR state from one node to the other when failing over/back across nodes. For LIO this is in /var/target. Depending on how you do AP (what ALUA states you use if you do ALUA), you might also need to always distribute the PR info if you are doing windows clustering. Windows wants to see a consistent view of the PR info from all ports if you do something like ALUA active-optimized and standby states for active/passive. - I do not completely understand the comment about using LIO as a backend for tgt. You would either use tgt or LIO to export a rbd device. Not both at the same time like using LIO for some sort of tgt backend. Maybe people meant using the RBD backend instead of LIO backend - There are some other setup complications that you can see here http://comments.gmane.org/gmane.linux.scsi.target.devel/7044 if you are using ALUA. I think tgt does not support ALUA, but LIO does. On 01/23/2015 04:25 PM, Zoltan Arnold Nagy wrote: Correct me if I'm wrong, but tgt doesn't have full SCSI-3 persistence support when _not_ using the LIO backend for it, right? AFAIK you can either run tgt with it's own iSCSI implementation or you can use tgt to manage your LIO targets. I assume when you're running tgt with the rbd backend code you're skipping all the in-kernel LIO parts (in which case the RedHat patches won't help a bit), and you won't have proper active-active support, since the initiators have no way to synchronize state (and more importantly, no way to synchronize write caching! [I can think of some really ugly hacks to get around that, tho...]). On 01/23/2015 05:46 PM, Jake Young wrote: Thanks for the feedback Nick and Zoltan, I have been seeing periodic kernel panics when I used LIO. It was either due to LIO or the kernel rbd mapping. I have seen this on Ubuntu precise with kernel 3.14.14 and again in Ubunty trusty with the utopic kernel (currently 3.16.0-28). Ironically, this is the primary reason I started exploring a redundancy solution for my iSCSI proxy node. So, yes, these crashes have nothing to do with running the Active/Active setup. I am moving my entire setup from LIO to rbd enabled tgt, which I've found to be much more stable and gives equivalent performance. I've been testing active/active LIO since July of 2014 with VMWare and I've never seen any vmfs corruption. I am now convinced (thanks Nick) that it is possible. The reason I have not seen any corruption may have to do with how VMWare happens to be configured. Originally, I had made a point to use round robin path selection in the VMware hosts; but as I did performance testing, I found that it actually didn't help performance. When the host switches iSCSI targets there is a short spin up time for LIO to get to 100% IO capability. Since round robin switches targets every 30 seconds (60 seconds? I forget), this seemed to be significant. A secondary goal for me was to end up with a config that required minimal
Re: [ceph-users] Consumer Grade SSD Clusters
Hello, As others said, it depends on your use case and expected write load. If you search the ML archives, you will find that there can be SEVERE write amplification with Ceph, something to very much keep in mind. You should run tests yourself before deploying things and committing to a hardware that won't cut the mustard. I did this comparison for a project (involving DRBD, not Ceph) 3 months ago: --- Model Size Endurance Cost Samsung 845DC EVO 960GB600TBW 700USD TBW/$=1.16 Intel DC S3700 800GB 7300TBW1500USD TBW/$=0.20 --- In that particular case the Samsung made the grade, as the expected writes per year and SSD are less than 60TB. Make those calculation for the specific SSDs you have in mind. Cheap initial costs may come back to bite you in the behind later. With Ceph, I'd be _very_ uncomfortable putting data on consumer SSDs. Aside from the ease of mind that the Intel DC S3700 (or maybe the Samsung 845DC Pro, not tested myself yet) give you, there's the overall better performance (and consistently so, no going to sleep for garbage collection). On top of the SMART monitoring you'll probably also be forced use fstrim on these SSDs to keep their performance (such as it is) from degrading. Christian On Wed, 28 Jan 2015 00:30:04 + Quenten Grasso wrote: Hi Nick, Agreed, I see your point of basically once your past the 150TBW or whatever that number maybe, your just waiting for failure effectively but aren't we anyway? I guess it depends on your use case at the end of the day. I wonder what the likes of Amazon, Rackspace etc are doing in the way of SSD's, either they are buying them so cheap per GB due to the volume or they are possibly using consumer grade SSD'. hmm.. using consumer grade SSD's it may be an interesting option if you have descent monitoring and alerting using SMART you should be able to still see how much spare flash you have available. As suggested by Wido using multiple brands would help remove the possible cascading failure affect which I guess we all should be doing anyway on our spinners. I guess we have to decide is it worth the extra effort in the long run vs running enterprise ssds. Regards, Quenten Grasso From: Nick Fisk [mailto:n...@fisk.me.uk] Sent: Saturday, 24 January 2015 7:33 PM To: Quenten Grasso; ceph-users@lists.ceph.com Subject: RE: Consumer Grade SSD Clusters Hi Quenten, There is no real answer to your question. It really depends on how busy your storage will be and particularly if it is mainly reads or writes. I wouldn't pay too much attention to that SSD endurance test, whilst it's great to know that they have a lot more headroom than their official spec's, you run the risk of having a spectacular multiple disk failure if you intend to run them all that high. You can probably guarantee that as 1 SSD starts to fail the increase in workload to re-balance the cluster will cause failures on the rest. I guess it really comes down to how important is the availability of your data. Whilst an average pc user might balk at the price of paying 4 times per GB more for a S3700 SSD, in the enterprise world they are still comparatively cheap. The other thing you need to be aware of is that most consumer SSD's don't have power loss protection, again if you are mainly doing reads and cost is more important than availability, there may be an argument to use them. Nick From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Quenten Grasso Sent: 24 January 2015 09:13 To: ceph-users@lists.ceph.com Subject: [ceph-users] Consumer Grade SSD Clusters Hi Everyone, Just wondering if anyone has had any experience in using consumer grade SSD's for a Ceph cluster? I came across this article http://techreport.com/review/26523/the-ssd-endurance-experiment-casualties-on-the-way-to-a-petabyte/3http://xo4t.mjt.lu/link/xo4t/gg573yr/1/QRjiN_2beI5qST5ggOanaQ/aHR0cDovL3RlY2hyZXBvcnQuY29tL3Jldmlldy8yNjUyMy90aGUtc3NkLWVuZHVyYW5jZS1leHBlcmltZW50LWNhc3VhbHRpZXMtb24tdGhlLXdheS10by1hLXBldGFieXRlLzM They have been testing different SSD's write endurance and they have been able to write up to 1PB+ to a Samsung 840 Pro 256GB which is only rated at 150TBW and of course other SSD's have failed well before 1PBW, So defiantly worth a read. So I've been thinking about using consumer grade SSD's for OSD's and Enterprise SSD's for journals. Reasoning is enterprise SSD's are a lot faster at journaling then consumer grade drives plus this would effectively half the overall write requirements on the consumer grade disks. This also could be a cost effective alternative to using enterprise SSD's as OSD's however it seems if your happy to use 2 x replication it's a pretty good cost saving however 3x replication not so much. Cheers, Quenten Grasso -- Christian BalzerNetwork/Systems Engineer ch...@gol.com
Re: [ceph-users] Ceph, LIO, VMWARE anyone?
I do not know about perf, but here is some info on what is safe and general info. - If you are not using VAAI then it will use older style RESERVE/RELEASE commands only. If you are using VAAI ATS, and doing active/active then you need something, like the lock/sync talked about in the slides/hammer doc, that would coordinate multiple ATS/COMPARE_AND_WRITEs from executing at the same time on the same sectors. You probably do not ever see problems today, because it seems ESX normally does this command for only one sector and I do not think there are multiple commands for the same sectors in flight normally. For active/passive, ATS is simple since you only have the one LIO/TGT node executing commands at a time, so the locking is done locally using a normal old mutex. - tgt and LIO both support SCSI-3 persistent reservations. This is not really needed for ESX vmfs though since it uses ATS or older RESERVE/RELEASE. If you were using a cluster app like windows clustering, red hat cluster, etc in ESX or in normal non vm use, then you need something extra to support SCSI-3 PRs in both active/active or active/passive. For AA, you need something like described in that doc/video. For AP, you would need to copy over the PR state from one node to the other when failing over/back across nodes. For LIO this is in /var/target. Depending on how you do AP (what ALUA states you use if you do ALUA), you might also need to always distribute the PR info if you are doing windows clustering. Windows wants to see a consistent view of the PR info from all ports if you do something like ALUA active-optimized and standby states for active/passive. - I do not completely understand the comment about using LIO as a backend for tgt. You would either use tgt or LIO to export a rbd device. Not both at the same time like using LIO for some sort of tgt backend. Maybe people meant using the RBD backend instead of LIO backend - There are some other setup complications that you can see here http://comments.gmane.org/gmane.linux.scsi.target.devel/7044 if you are using ALUA. I think tgt does not support ALUA, but LIO does. On 01/23/2015 04:25 PM, Zoltan Arnold Nagy wrote: Correct me if I'm wrong, but tgt doesn't have full SCSI-3 persistence support when _not_ using the LIO backend for it, right? AFAIK you can either run tgt with it's own iSCSI implementation or you can use tgt to manage your LIO targets. I assume when you're running tgt with the rbd backend code you're skipping all the in-kernel LIO parts (in which case the RedHat patches won't help a bit), and you won't have proper active-active support, since the initiators have no way to synchronize state (and more importantly, no way to synchronize write caching! [I can think of some really ugly hacks to get around that, tho...]). On 01/23/2015 05:46 PM, Jake Young wrote: Thanks for the feedback Nick and Zoltan, I have been seeing periodic kernel panics when I used LIO. It was either due to LIO or the kernel rbd mapping. I have seen this on Ubuntu precise with kernel 3.14.14 and again in Ubunty trusty with the utopic kernel (currently 3.16.0-28). Ironically, this is the primary reason I started exploring a redundancy solution for my iSCSI proxy node. So, yes, these crashes have nothing to do with running the Active/Active setup. I am moving my entire setup from LIO to rbd enabled tgt, which I've found to be much more stable and gives equivalent performance. I've been testing active/active LIO since July of 2014 with VMWare and I've never seen any vmfs corruption. I am now convinced (thanks Nick) that it is possible. The reason I have not seen any corruption may have to do with how VMWare happens to be configured. Originally, I had made a point to use round robin path selection in the VMware hosts; but as I did performance testing, I found that it actually didn't help performance. When the host switches iSCSI targets there is a short spin up time for LIO to get to 100% IO capability. Since round robin switches targets every 30 seconds (60 seconds? I forget), this seemed to be significant. A secondary goal for me was to end up with a config that required minimal tuning from VMWare and the target software; so the obvious choice is to leave VMWare's path selection at the default which is Fixed and picks the first target in ASCII-betical order. That means I am actually functioning in Active/Passive mode. Jake On Fri, Jan 23, 2015 at 8:46 AM, Zoltan Arnold Nagy zol...@linux.vnet.ibm.com mailto:zol...@linux.vnet.ibm.com wrote: Just to chime in: it will look fine, feel fine, but underneath it's quite easy to get VMFS corruption. Happened in our tests. Also if you're running LIO, from time to time expect a kernel panic (haven't tried with the latest upstream, as I've been using Ubuntu 14.04 on my export hosts for the test, so might have improved...). As of now I would not recommend this