Re: [ceph-users] Ceph Supermicro hardware recommendation
Hi Christian, On 04/02/15 02:39, Christian Balzer ch...@gol.com wrote: On Tue, 3 Feb 2015 15:16:57 + Colombo Marco wrote: Hi all, I have to build a new Ceph storage cluster, after i‘ve read the hardware recommendations and some mail from this mailing list i would like to buy these servers: Nick mentioned a number of things already I totally agree with, so don't be surprised if some of this feels like a repeat. OSD: SSG-6027R-E1R12L - http://www.supermicro.nl/products/system/2U/6027/SSG-6027R-E1R12L.cfm Intel Xeon e5-2630 v2 64 GB RAM As nick said, v3 and more RAM might be helpful, depending on your use case (small writes versus large ones) even faster CPUs as well. Ok, we switch from v2 to v3 and from 64 to 96 GB of RAM. LSI 2308 IT 2 x SSD Intel DC S3700 400GB 2 x SSD Intel DC S3700 200GB Why the separation of SSDs? They aren't going to be that busy with regards to the OS. We would like to use 400GB SSD for a cache pool, and 200GB SSD for the journaling. Get a case like Nick mentioned with 2 2.5 bays in the back, put 2 DC S3700 400GBs in there (connected to onboard 6Gb/s SATA3), partition them so that you have a RAID1 for OS and plain partitions for the journals of the now 12 OSD HDDs in your chassis. Of course this optimization in terms of cost and density comes with a price, if one SSD should fail, you will have 6 OSDs down. Given how reliable the Intels are this is unlikely, but something you need to consider. If you want to limit the impact of a SSD failure and have just 2 OSD journals per SSD, get a chassis like the one above and 4 DC S3700 200GB, RAID10 them for the OS and put 2 journal partitions on each. I did the same with 8 3TB HDDs and 4 DC S3700 100GB, the HDDs (and CPU with 4KB IOPS), are the limiting factor, not the SSDs. 8 x HDD Seagate Enterprise 6TB Are you really sure you need that density? One disk failure will result in a LOT of data movement once these become somewhat full. If you were to go for a 12 OSD node as described above, consider 4TB ones for the same overall density, while having more IOPS and likely the same price or less. We choosen the 6TB of disk, because we need a lot of storage in a small amount of server and we prefer server with not too much disks. However we plan to use max 80% of a 6TB Disk 2 x 40GbE for backend network You'd be lucky to write more that 800MB/s sustained to your 8 HDDs (remember they will have to deal with competing reads and writes, this is not a sequential synthetic write benchmark). Incidentally 1GB/s to 1.2GB/s (depending on configuration) would also be the limit of your journal SSDs. Other than backfilling caused by cluster changes (OSD removed/added), your limitation is nearly always going to be IOPS, not bandwidth. Ok, after some discussion, we switch to 2 x 10 GbE. So 2x10GbE or if you're comfortable with it (I am ^o^) an Infiniband backend (can be cheaper, less latency, plans for RDMA support in Ceph) should be more than sufficient. 2 x 10GbE for public network META/MON: SYS-6017R-72RFTP - http://www.supermicro.com/products/system/1U/6017/SYS-6017R-72RFTP.cfm 2 x Intel Xeon e5-2637 v2 4 x SSD Intel DC S3500 240GB raid 1+0 You're likely to get better performance and of course MUCH better durability by using 2 DC S3700, at about the same price. Ok we switch to 2 x SSD DC S3700 128 GB RAM Total overkill for a MON, but I have no idea about MDS and RAM never hurts. Ok we switch from 128 to 96 In your follow-up you mentioned 3 mons, I would suggest putting 2 more mons (only, not MDS) on OSD nodes and make sure that within the IP numbering the real mons have the lowest IP addresses, because the MON with the lowest IP becomes master (and thus the busiest). This way you can survive a loss of 2 nodes and still have a valid quorum. Ok, got it Christian 2 x 10 GbE What do you think? Any feedbacks, advices, or ideas are welcome! Thanks so much Regards, -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ Thanks so much! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd recover tool for stopped ceph cluster
rbd recover tool is an offline tool to recover rbd image when ceph cluster is stopped. It is usefull when you want to recover rbd image on a broken ceph cluster in urgent. I have used a similar prototype tool succeessfully recovering a large rbd image in ceph cluster whose scale is 900+ osds. So I think this tool can help us to keep rbd data security. Before you runing this tool, just make sure to stop all ceph services: ceph-mon, ceph-osd, ceph-mds. Currently, this tool supports both raw image and snapshot, and clone image will be supported sooner. there is the pull request https://github.com/ceph/ceph/pull/3611___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Question about output message and object update for ceph class
Hello, I write a ceph client using rados lib to execute a funcution upon the object. CLIENT SIDE CODE === int main() { ... strcpy(in, from client); err = rados_exec(io, objname, devctl, devctl_op, in, strlen(in), out, 128); if (err 0) { fprintf(stderr, rados_exec() failed: %s\n, strerror(-err)); rados_ioctx_destroy(io); rados_shutdown(cluster); exit(1); } out[err] = '\0'; printf(err = %d, exec result out = %s, in = %s\n, err, out, in); ... } CLASS CODE IN OSD SIDE == static int devctl_op(cls_method_context_t hctx, bufferlist *in, bufferlist *out) { ... i = cls_cxx_stat(hctx, size, NULL); if (i 0) return i; bufferlist read_bl, write_bl; i = cls_cxx_read(hctx, 0, size, read_bl); if (i 0) { CLS_ERR(cls_cxx_read failed); return i; } // we generate our reply out-append(Hello, ); if (in-length() == 0) out-append(world); else out-append(*in); out-append(!); #if 1 const char *tstr = from devctl func; write_bl.append(tstr); i = cls_cxx_write(hctx, size, write_bl.length(), write_bl); if (i 0) { CLS_ERR(cls_cxx_write failed: %s, strerror(-i)); return i; } #endif // this return value will be returned back to the librados caller return 0; } I found that if I update the content of the object when calling cls_cxx_write(), then the 'out' will be null in the client side, otherwise the out will be Hello, from client!. Does anybody here can give some hints? -- Den ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] snapshoting on btrfs vs xfs
On Wed, 4 Feb 2015, Cristian Falcas wrote: Hi, We have an openstack installation that uses ceph as the storage backend. We use mainly snapshot and boot from snapshot from an original instance with a 200gb disk. Something like this: 1. import original image 2. make volume from image (those 2 steps were done only once, when we installed openstack) 3. boot main instance from volume, update the db inside 4. snapshot the instance 5. make volumes from previous snapshot 6. boot test instances from those volumes (the last 3 steps take less then 30s) Currently the fs is btrfs and we are in love with the solution: the snapshots are instant and boot from snapshot is also instant. It cut our tests time (compared with the vmware solution + netap storage) from 12h to 2h. With vmware we were spending 10h with what now is done in a few seconds. That's great to hear! I was wondering if the fs matters in this case, because we are a little worry about using btrfs and reading all the horror story here and on btrfs mailing list. Is the snapshoting performed by ceph or by the fs? Can we switch to xfs and have the same capabilities: instant snapshot + instant boot from snapshot? The feature set and capabilities are identical. The difference is that on btrfs we are letting btrfs do the efficient copy-on-write cloning when we touch a snapshotted object while with XFS we literally copy the object file (usually 4MB) on the first write. You will likely see some penalty in the boot-from-clone scenario, although I have no idea how significant it will be. On the other hand, we've also seen that btrfs fragmentation over time can lead to poor performance relative to XFS. So, no clear answer, really. Sorry! If you do stick with btrfs, please report back here and share what you see as far as stability (along with the kernel version(s) you are using; most of the XFS over btrfs usage is based on FUD (in the literal sense) and I don't think we have seen much in the way of real user reports here in a while. Thanks! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG to pool mapping?
On Feb 4, 2015, at 3:27 PM, Gregory Farnum wrote: On Wed, Feb 4, 2015 at 1:20 PM, Chad William Seys cws...@physics.wisc.edu wrote: Hi all, How do I determine which pool a PG belongs to? (Also, is it the case that all objects in a PG belong to one pool?) PGs are of the form 1.a2b3c4. The part prior to the period is the pool ID; the part following distinguishes the PG and is based on the hash range it covers. :) Yes, all objects in a PG belong to a single pool; they are hash ranges of the pool. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com You can also map the pool number to the pool name with: 'ceph osd lspools' Similarly, 'rados lspools' will print out the pools line by line. Cheers, Lincoln ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG to pool mapping?
On Wed, Feb 4, 2015 at 1:20 PM, Chad William Seys cws...@physics.wisc.edu wrote: Hi all, How do I determine which pool a PG belongs to? (Also, is it the case that all objects in a PG belong to one pool?) PGs are of the form 1.a2b3c4. The part prior to the period is the pool ID; the part following distinguishes the PG and is based on the hash range it covers. :) Yes, all objects in a PG belong to a single pool; they are hash ranges of the pool. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] snapshoting on btrfs vs xfs
Thank you for the clarifications. We will try to report back, but I'm not sure our use case is relevant. We are trying to use every dirty trick to speed up the VMs. We have only 1 replica, and 2 pools. One pool with journal on disk, where the original instance exists (we want to keep this one safe). The second pool is for the tests machines and has the journal in ram, so this part is very volatile. We don't really care, because if the worst happens and we have a power loss we just redo the pool and start new instances. Journal in ram did wonders for us in terms of read/write speed. On Wed, Feb 4, 2015 at 11:22 PM, Sage Weil s...@newdream.net wrote: On Wed, 4 Feb 2015, Cristian Falcas wrote: Hi, We have an openstack installation that uses ceph as the storage backend. We use mainly snapshot and boot from snapshot from an original instance with a 200gb disk. Something like this: 1. import original image 2. make volume from image (those 2 steps were done only once, when we installed openstack) 3. boot main instance from volume, update the db inside 4. snapshot the instance 5. make volumes from previous snapshot 6. boot test instances from those volumes (the last 3 steps take less then 30s) Currently the fs is btrfs and we are in love with the solution: the snapshots are instant and boot from snapshot is also instant. It cut our tests time (compared with the vmware solution + netap storage) from 12h to 2h. With vmware we were spending 10h with what now is done in a few seconds. That's great to hear! I was wondering if the fs matters in this case, because we are a little worry about using btrfs and reading all the horror story here and on btrfs mailing list. Is the snapshoting performed by ceph or by the fs? Can we switch to xfs and have the same capabilities: instant snapshot + instant boot from snapshot? The feature set and capabilities are identical. The difference is that on btrfs we are letting btrfs do the efficient copy-on-write cloning when we touch a snapshotted object while with XFS we literally copy the object file (usually 4MB) on the first write. You will likely see some penalty in the boot-from-clone scenario, although I have no idea how significant it will be. On the other hand, we've also seen that btrfs fragmentation over time can lead to poor performance relative to XFS. So, no clear answer, really. Sorry! If you do stick with btrfs, please report back here and share what you see as far as stability (along with the kernel version(s) you are using; most of the XFS over btrfs usage is based on FUD (in the literal sense) and I don't think we have seen much in the way of real user reports here in a while. Thanks! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] snapshoting on btrfs vs xfs
Hi, We have an openstack installation that uses ceph as the storage backend. We use mainly snapshot and boot from snapshot from an original instance with a 200gb disk. Something like this: 1. import original image 2. make volume from image (those 2 steps were done only once, when we installed openstack) 3. boot main instance from volume, update the db inside 4. snapshot the instance 5. make volumes from previous snapshot 6. boot test instances from those volumes (the last 3 steps take less then 30s) Currently the fs is btrfs and we are in love with the solution: the snapshots are instant and boot from snapshot is also instant. It cut our tests time (compared with the vmware solution + netap storage) from 12h to 2h. With vmware we were spending 10h with what now is done in a few seconds. I was wondering if the fs matters in this case, because we are a little worry about using btrfs and reading all the horror story here and on btrfs mailing list. Is the snapshoting performed by ceph or by the fs? Can we switch to xfs and have the same capabilities: instant snapshot + instant boot from snapshot? Best regards, Cristian Falcas ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG to pool mapping?
Hi all, How do I determine which pool a PG belongs to? (Also, is it the case that all objects in a PG belong to one pool?) Thanks! C. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] snapshoting on btrfs vs xfs
Hi Cristian, We will try to report back, but I'm not sure our use case is relevant. We are trying to use every dirty trick to speed up the VMs. we have the same use-case. The second pool is for the tests machines and has the journal in ram, so this part is very volatile. We don't really care, because if the worst happens and we have a power loss we just redo the pool and start new instances. Journal in ram did wonders for us in terms of read/write speed. How do you handle a reboot of a node managing your pool having the journals in RAM? All the mon's knows about the volatile pool - do you have remove recreate the pool automatically after rebooting this node? Did you tried to enable rdb-caching? Is there a write-performance benefit using journal @RAM instead of enable rbd-caching on client (openstack) side ? I thought with rbd-caching the write performance should be fast enough. regards Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW put file question
when I put the same file with multi threads, sometimes put file head oid ref.ioctx.operate(ref.oid, op); return -ECANCELED. I think this is normal. but fuction jump to done_cancel, and run the complete_update_index_cancel(or index_op.cancel() ), but osd execute rgw_bucket_complete_op with CLS_RGW_OP_ADD and file size must be 0; so at this moment bucket index record file size is zero. I think this is not right. baijia...@126.com From: Yehuda Sadeh-Weinraub Date: 2015-02-05 12:06 To: baijiaruo CC: ceph-users Subject: Re: [ceph-users] RGW put file question - Original Message - From: baijia...@126.com To: ceph-users ceph-users@lists.ceph.com Sent: Wednesday, February 4, 2015 5:47:03 PM Subject: [ceph-users] RGW put file question when I put file failed, and run the function RGWRados::cls_obj_complete_cancel, why we use CLS_RGW_OP_ADD not use CLS_RGW_OP_CANCEL? why we set poolid is -1 and set epoch is 0? I'm not sure, could very well be a bug. It should definitely be OP_CANCEL, but going back through the history it seems like it has been OP_ADD since at least argonaut. How did you notice it? It might explain a couple of issues that we've been seeing. Yehuda___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] snapshoting on btrfs vs xfs
We want to use this script as a service for start/stop (but it wasn't tested yet): #!/bin/bash # chkconfig: - 50 90 # description: make a journal for osd.0 in ram start () { -f /dev/shm/osd.0.journal || ceph-osd -i 0 --mkjournal } stop () { service ceph stop osd.0 ceph-osd -i osd.0 --flush-journal rm -f /dev/shm/osd.0.journal } case \$1 in start) start;; stop) stop;; esac Also, we didn't see any noticeable improvements with rbd-caching, but we didn't performed any tests to measure it, just how we feel it. On Thu, Feb 5, 2015 at 12:09 AM, Daniel Schwager daniel.schwa...@dtnet.de wrote: Hi Cristian, We will try to report back, but I'm not sure our use case is relevant. We are trying to use every dirty trick to speed up the VMs. we have the same use-case. The second pool is for the tests machines and has the journal in ram, so this part is very volatile. We don't really care, because if the worst happens and we have a power loss we just redo the pool and start new instances. Journal in ram did wonders for us in terms of read/write speed. How do you handle a reboot of a node managing your pool having the journals in RAM? All the mon's knows about the volatile pool - do you have remove recreate the pool automatically after rebooting this node? Did you tried to enable rdb-caching? Is there a write-performance benefit using journal @RAM instead of enable rbd-caching on client (openstack) side ? I thought with rbd-caching the write performance should be fast enough. regards Danny ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] snapshoting on btrfs vs xfs
On 5 February 2015 at 07:22, Sage Weil s...@newdream.net wrote: Is the snapshoting performed by ceph or by the fs? Can we switch to xfs and have the same capabilities: instant snapshot + instant boot from snapshot? The feature set and capabilities are identical. The difference is that on btrfs we are letting btrfs do the efficient copy-on-write cloning when we touch a snapshotted object while with XFS we literally copy the object file (usually 4MB) on the first write. Are ceph snapshots really that much faster when using btrfs underneath? one of the problem we have with ceph is that snapshot take/restore is insanely slow, tens of minutes - but we are using xfs. -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RGW put file question
when I put file failed, and run the function RGWRados::cls_obj_complete_cancel, why we use CLS_RGW_OP_ADD not use CLS_RGW_OP_CANCEL? why we set poolid is -1 and set epoch is 0? baijia...@126.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] command to flush rbd cache?
On 02/05/2015 07:44 AM, Udo Lembke wrote: Hi all, is there any command to flush the rbd cache like the echo 3 /proc/sys/vm/drop_caches for the os cache? librbd exposes it as rbd_invalidate_cache(), and qemu uses it internally, but I don't think you can trigger that via any user-facing qemu commands. Exposing it through the admin socket would be pretty simple though: http://tracker.ceph.com/issues/2468 You can also just detach and reattach the device to flush the rbd cache. Josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] command to flush rbd cache?
Hi Dan, I mean qemu-kvm, also librbd. But how I can kvm told to flush the buffer? Udo On 05.02.2015 07:59, Dan Mick wrote: On 02/04/2015 10:44 PM, Udo Lembke wrote: Hi all, is there any command to flush the rbd cache like the echo 3 /proc/sys/vm/drop_caches for the os cache? Udo Do you mean the kernel rbd or librbd? The latter responds to flush requests from the hypervisor. The former...I'm not sure it has a separate cache. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] command to flush rbd cache?
On 02/04/2015 10:44 PM, Udo Lembke wrote: Hi all, is there any command to flush the rbd cache like the echo 3 /proc/sys/vm/drop_caches for the os cache? Udo Do you mean the kernel rbd or librbd? The latter responds to flush requests from the hypervisor. The former...I'm not sure it has a separate cache. -- Dan Mick Red Hat, Inc. Ceph docs: http://ceph.com/docs ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] command to flush rbd cache?
Hi Josh, thanks for the info. detach/reattach schould be fine for me, because it's only for performance testing. #2468 would be fine of course. Udo On 05.02.2015 08:02, Josh Durgin wrote: On 02/05/2015 07:44 AM, Udo Lembke wrote: Hi all, is there any command to flush the rbd cache like the echo 3 /proc/sys/vm/drop_caches for the os cache? librbd exposes it as rbd_invalidate_cache(), and qemu uses it internally, but I don't think you can trigger that via any user-facing qemu commands. Exposing it through the admin socket would be pretty simple though: http://tracker.ceph.com/issues/2468 You can also just detach and reattach the device to flush the rbd cache. Josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] command to flush rbd cache?
Hi all, is there any command to flush the rbd cache like the echo 3 /proc/sys/vm/drop_caches for the os cache? Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] command to flush rbd cache?
I don't know the details well; I know the device itself supports the block-device-level cache-flush commands (I know there's a SCSI-specific one but I don't know offhand if there's a device generic one) so the guest OS can, and does, request flushing. I can't remember if there's also a qemu command to prompt the virtual device to flush without telling the guest. On 02/04/2015 11:08 PM, Udo Lembke wrote: Hi Dan, I mean qemu-kvm, also librbd. But how I can kvm told to flush the buffer? Udo On 05.02.2015 07:59, Dan Mick wrote: On 02/04/2015 10:44 PM, Udo Lembke wrote: Hi all, is there any command to flush the rbd cache like the echo 3 /proc/sys/vm/drop_caches for the os cache? Udo Do you mean the kernel rbd or librbd? The latter responds to flush requests from the hypervisor. The former...I'm not sure it has a separate cache. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph Performance random write is more then sequential
Yes, So far I have tried both the options and in both cases I am able to get better sequential performance then random (as explained by somnath) *But *performance numbers(iops, mbps) are way less then default option, I can understand that as ceph is dealing with 1000 times more objects then default option. So keeping this is mind that I am running performance test for random only and leaving sequential tests. Still not sure how reports available on internet from intel and mellanox shows good number from sequential write, may be they have enabled cache. http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Ceph_over_High_Performance_Networks.pdf Thanks sumit On Thu, Feb 5, 2015 at 2:09 PM, Alexandre DERUMIER aderum...@odiso.com wrote: Hi, What I saw after enabling RBD cache it is working as expected, means sequential write has better MBps than random write. can somebody explain this behaviour ? This is because rbd_cache merge coalesced ios in bigger ios, so it's working only with sequential workload. you'll do less ios but bigger ios to ceph, so less cpus, - Mail original - De: Sumit Gaur sumitkg...@gmail.com À: Florent MONTHEL fmont...@flox-arts.net Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Lundi 2 Février 2015 03:54:36 Objet: Re: [ceph-users] ceph Performance random write is more then sequential Hi All, What I saw after enabling RBD cache it is working as expected, means sequential write has better MBps than random write. can somebody explain this behaviour ? Is RBD cache setting must for ceph cluster to behave normally ? Thanks sumit On Mon, Feb 2, 2015 at 9:59 AM, Sumit Gaur sumitkg...@gmail.com wrote: Hi Florent, Cache tiering , No . ** Our Architecture : vdbench/FIO inside VM -- RBD without cache - Ceph Cluster (6 OSDs + 3 Mons) Thanks sumit [root@ceph-mon01 ~]# ceph -s cluster 47b3b559-f93c-4259-a6fb-97b00d87c55a health HEALTH_WARN clock skew detected on mon.ceph-mon02, mon.ceph-mon03 monmap e1: 3 mons at {ceph-mon01= 192.168.10.19:6789/0,ceph-mon02=192.168.10.20:6789/0,ceph-mon03=192.168.10.21:6789/0 }, election epoch 14, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03 osdmap e603: 36 osds: 36 up, 36 in pgmap v40812: 5120 pgs, 2 pools, 179 GB data, 569 kobjects 522 GB used, 9349 GB / 9872 GB avail 5120 active+clean On Mon, Feb 2, 2015 at 12:21 AM, Florent MONTHEL fmont...@flox-arts.net wrote: BQ_BEGIN Hi Sumit Do you have cache pool tiering activated ? Some feed-back regarding your architecture ? Thanks Sent from my iPad On 1 févr. 2015, at 15:50, Sumit Gaur sumitkg...@gmail.com wrote: Hi I have installed 6 node ceph cluster and to my surprise when I ran rados bench I saw that random write has more performance number then sequential write. This is opposite to normal disk write. Can some body let me know if I am missing any ceph Architecture point here ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com BQ_END ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question about output message and object update for ceph class
I take back the question, because I just found that for a succeed write opetion in the class, *no* data in the out buffer... On Wed, Feb 4, 2015 at 5:44 PM, Dennis Chen kernel.org@gmail.com wrote: Hello, I write a ceph client using rados lib to execute a funcution upon the object. CLIENT SIDE CODE === int main() { ... strcpy(in, from client); err = rados_exec(io, objname, devctl, devctl_op, in, strlen(in), out, 128); if (err 0) { fprintf(stderr, rados_exec() failed: %s\n, strerror(-err)); rados_ioctx_destroy(io); rados_shutdown(cluster); exit(1); } out[err] = '\0'; printf(err = %d, exec result out = %s, in = %s\n, err, out, in); ... } CLASS CODE IN OSD SIDE == static int devctl_op(cls_method_context_t hctx, bufferlist *in, bufferlist *out) { ... i = cls_cxx_stat(hctx, size, NULL); if (i 0) return i; bufferlist read_bl, write_bl; i = cls_cxx_read(hctx, 0, size, read_bl); if (i 0) { CLS_ERR(cls_cxx_read failed); return i; } // we generate our reply out-append(Hello, ); if (in-length() == 0) out-append(world); else out-append(*in); out-append(!); #if 1 const char *tstr = from devctl func; write_bl.append(tstr); i = cls_cxx_write(hctx, size, write_bl.length(), write_bl); if (i 0) { CLS_ERR(cls_cxx_write failed: %s, strerror(-i)); return i; } #endif // this return value will be returned back to the librados caller return 0; } I found that if I update the content of the object when calling cls_cxx_write(), then the 'out' will be null in the client side, otherwise the out will be Hello, from client!. Does anybody here can give some hints? -- Den -- Den ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Supermicro hardware recommendation
Hi Marco, Am 04.02.2015 10:20, schrieb Colombo Marco: ... We choosen the 6TB of disk, because we need a lot of storage in a small amount of server and we prefer server with not too much disks. However we plan to use max 80% of a 6TB Disk 80% is too much! You will run into trouble. Ceph don't write the data in equal distribution. Sometimes I see an difference of 20% in the usage of the OSD. I recommend 60-70% as maximum. Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Supermicro hardware recommendation
Hello, On Wed, 4 Feb 2015 09:20:24 + Colombo Marco wrote: Hi Christian, On 04/02/15 02:39, Christian Balzer ch...@gol.com wrote: On Tue, 3 Feb 2015 15:16:57 + Colombo Marco wrote: Hi all, I have to build a new Ceph storage cluster, after i‘ve read the hardware recommendations and some mail from this mailing list i would like to buy these servers: Nick mentioned a number of things already I totally agree with, so don't be surprised if some of this feels like a repeat. OSD: SSG-6027R-E1R12L - http://www.supermicro.nl/products/system/2U/6027/SSG-6027R-E1R12L.cfm Intel Xeon e5-2630 v2 64 GB RAM As nick said, v3 and more RAM might be helpful, depending on your use case (small writes versus large ones) even faster CPUs as well. Ok, we switch from v2 to v3 and from 64 to 96 GB of RAM. LSI 2308 IT 2 x SSD Intel DC S3700 400GB 2 x SSD Intel DC S3700 200GB Why the separation of SSDs? They aren't going to be that busy with regards to the OS. We would like to use 400GB SSD for a cache pool, and 200GB SSD for the journaling. Don't, at least not like that. First and foremost, SSD based OSDs/pools have different requirements, especially when it comes to CPU. Mixing your HDD and SSD based OSDs in the same chassis is a generally a bad idea. If you really want to use SSD based OSDs, got at least with Giant, probably better even to wait for Hammer. Otherwise your performance will be nowhere near the investment you're making. Read up in the ML archives about SSD based clusters and their performance, as well as cache pools. Which brings us to the second point, cache pools are pretty pointless currently when it comes to performance. So unless you're planning to use EC pools, you will gain very little from them. Lastly, if you still want to do SSD based OSDs, go for something like this: http://www.supermicro.com.tw/products/system/2U/2028/SYS-2028TP-DC0TR.cfm Add the fastest CPUs you can afford and voila, instant SSD based cluster (replication of 2 should be fine with DC S3700). Now with _this_ particular type of nodes, you might want to consider 40GbE links (front and back-end). Get a case like Nick mentioned with 2 2.5 bays in the back, put 2 DC S3700 400GBs in there (connected to onboard 6Gb/s SATA3), partition them so that you have a RAID1 for OS and plain partitions for the journals of the now 12 OSD HDDs in your chassis. Of course this optimization in terms of cost and density comes with a price, if one SSD should fail, you will have 6 OSDs down. Given how reliable the Intels are this is unlikely, but something you need to consider. If you want to limit the impact of a SSD failure and have just 2 OSD journals per SSD, get a chassis like the one above and 4 DC S3700 200GB, RAID10 them for the OS and put 2 journal partitions on each. I did the same with 8 3TB HDDs and 4 DC S3700 100GB, the HDDs (and CPU with 4KB IOPS), are the limiting factor, not the SSDs. 8 x HDD Seagate Enterprise 6TB Are you really sure you need that density? One disk failure will result in a LOT of data movement once these become somewhat full. If you were to go for a 12 OSD node as described above, consider 4TB ones for the same overall density, while having more IOPS and likely the same price or less. We choosen the 6TB of disk, because we need a lot of storage in a small amount of server and we prefer server with not too much disks. However we plan to use max 80% of a 6TB Disk Less disks, less IOPS, less bandwidth. Reducing the amount of servers (which are fixed cost after all) is understandable. But you have an option up there that gives you the same density as with the 6TB disks, but with a significantly improved performance. 2 x 40GbE for backend network You'd be lucky to write more that 800MB/s sustained to your 8 HDDs (remember they will have to deal with competing reads and writes, this is not a sequential synthetic write benchmark). Incidentally 1GB/s to 1.2GB/s (depending on configuration) would also be the limit of your journal SSDs. Other than backfilling caused by cluster changes (OSD removed/added), your limitation is nearly always going to be IOPS, not bandwidth. Ok, after some discussion, we switch to 2 x 10 GbE. So 2x10GbE or if you're comfortable with it (I am ^o^) an Infiniband backend (can be cheaper, less latency, plans for RDMA support in Ceph) should be more than sufficient. 2 x 10GbE for public network META/MON: SYS-6017R-72RFTP - http://www.supermicro.com/products/system/1U/6017/SYS-6017R-72RFTP.cfm 2 x Intel Xeon e5-2637 v2 4 x SSD Intel DC S3500 240GB raid 1+0 You're likely to get better performance and of course MUCH better durability by using 2 DC S3700, at about the same price. Ok we switch to 2 x SSD DC S3700 128 GB RAM Total overkill for a MON, but I have no idea about MDS and RAM never hurts. Ok we switch from 128 to 96 Don't take my
Re: [ceph-users] ceph Performance random write is more then sequential
Hi, What I saw after enabling RBD cache it is working as expected, means sequential write has better MBps than random write. can somebody explain this behaviour ? This is because rbd_cache merge coalesced ios in bigger ios, so it's working only with sequential workload. you'll do less ios but bigger ios to ceph, so less cpus, - Mail original - De: Sumit Gaur sumitkg...@gmail.com À: Florent MONTHEL fmont...@flox-arts.net Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Lundi 2 Février 2015 03:54:36 Objet: Re: [ceph-users] ceph Performance random write is more then sequential Hi All, What I saw after enabling RBD cache it is working as expected, means sequential write has better MBps than random write. can somebody explain this behaviour ? Is RBD cache setting must for ceph cluster to behave normally ? Thanks sumit On Mon, Feb 2, 2015 at 9:59 AM, Sumit Gaur sumitkg...@gmail.com wrote: Hi Florent, Cache tiering , No . ** Our Architecture : vdbench/FIO inside VM -- RBD without cache - Ceph Cluster (6 OSDs + 3 Mons) Thanks sumit [root@ceph-mon01 ~]# ceph -s cluster 47b3b559-f93c-4259-a6fb-97b00d87c55a health HEALTH_WARN clock skew detected on mon.ceph-mon02, mon.ceph-mon03 monmap e1: 3 mons at {ceph-mon01= 192.168.10.19:6789/0,ceph-mon02=192.168.10.20:6789/0,ceph-mon03=192.168.10.21:6789/0 }, election epoch 14, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03 osdmap e603: 36 osds: 36 up, 36 in pgmap v40812: 5120 pgs, 2 pools, 179 GB data, 569 kobjects 522 GB used, 9349 GB / 9872 GB avail 5120 active+clean On Mon, Feb 2, 2015 at 12:21 AM, Florent MONTHEL fmont...@flox-arts.net wrote: BQ_BEGIN Hi Sumit Do you have cache pool tiering activated ? Some feed-back regarding your architecture ? Thanks Sent from my iPad On 1 févr. 2015, at 15:50, Sumit Gaur sumitkg...@gmail.com wrote: Hi I have installed 6 node ceph cluster and to my surprise when I ran rados bench I saw that random write has more performance number then sequential write. This is opposite to normal disk write. Can some body let me know if I am missing any ceph Architecture point here ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com BQ_END ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com