Re: CEPH Erasure Encoding + OSD Scalability
Hi Andreas, That sounds reasonable. Would you be so kind as to send a patch with your changes ? I'll rework it into something that fits the test infrastructure of Ceph. Cheers On 22/09/2013 09:26, Andreas Joachim Peters wrote: Hi Loic, I run a benchmark with the changed code tomorrow ... I actually had to insert some of my realtime benchmark macro's into your Jerasure code to see the different time fractions between buffer preparation encoding step, but for you QA suite it is probably enough to get a total value after your fix. I will send you a program sampling the performance at different buffer sizes and encoding types. I changed my code to use vector operations (128-bit XOR's) and it gives another 10% gain. I also want to try out if it makes sense to do the CRC32C computation in-line in the encoding step and compare it with the two step procedure first encoding all blocks, then CRC32C on all blocks. Cheers Andreas. From: Loic Dachary [l...@dachary.org] Sent: 21 September 2013 17:11 To: Andreas Joachim Peters Cc: ceph-devel@vger.kernel.org Subject: Re: CEPH Erasure Encoding + OSD Scalability Hi Andreas, It's probably too soon to be smart about reducing the number of copies, but you're right : this copy is not necessary. The following pull request gets rid of it: https://github.com/ceph/ceph/pull/615 Cheers On 20/09/2013 18:49, Loic Dachary wrote: Hi, This is a first attempt at avoiding unnecessary copy: https://github.com/dachary/ceph/blob/03445a5926cd073c11cd8693fb110729e40f35fa/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L66 I'm not sure how it could be made more readable / terse with bufferlist iterators. Any kind of hint would be welcome :-) Cheers On 20/09/2013 17:36, Sage Weil wrote: On Fri, 20 Sep 2013, Loic Dachary wrote: Hi Andreas, Great work on these benchmarks ! It's definitely an incentive to improve as much as possible. Could you push / send the scripts and sequence of operations you've used ? I'll reproduce this locally while getting rid of the extra copy. It would be useful to capture that into a script that can be conveniently run from the teuthology integrations tests to check against performance regressions. Regarding the 3P implementation, in my opinion it would be very valuable for some people who prefer low CPU consumption. And I'm eager to see more than one plugin in the erasure code plugin directory ;-) One way to approach this might be to make a bufferlist 'multi-iterator' that you give you bufferlist::iterator's and will give you back a pair of points and length for each contiguous segment. This would capture the annoying iterator details and let the user focus on processing chunks that are as large as possible. sage Cheers On 20/09/2013 13:35, Andreas Joachim Peters wrote: Hi Loic, I have now some benchmarks on a Xeon 2.27 GHz 4-core with gcc 4.4 (-O2) for ENCODING based on the CEPH Jerasure port. I measured for objects from 128k to 512 MB with random contents (if you encode 1 GB objects you see slow downs due to caching inefficiencies ...), otherwise results are stable for the given object sizes. I quote only the benchmark for ErasureCodeJerasureReedSolomonRAID6 (3,2) , the other are significantly slower (2-3x slower) and my 3P(3,2,1) implementation providing the same redundancy level like RS-Raid6[3,2] (double disk failure) but using more space (66% vs 100% overhead). The effect of out.c_str() is significant ( contributes with factor 2 slow-down for the best jerasure algorithm for [3,2] ). Averaged results for Objects Size 4MB: 1) Erasure CRS [3,2] - 2.6 ms buffer preparation (out.c_str()) - 2.4 ms encoding = ~780 MB/s 2) 3P [3,2,1] - 0,005 ms buffer preparation (3P adjusts the padding in the algorithm) - 0.87ms encoding = ~4.4 GB/s I think it pays off to avoid the copy in the encoding if it does not matter for the buffer handling upstream and pad only the last chunk. Last thing I tested is how performances scales with number of cores running 4 tests in parallel: Jerasure (3,2) limits at ~2,0 GB/s for a 4-core CPU (Xeon 2.27 GHz). 3P(3,2,1) limits ~8 GB/s for a 4-core CPU (Xeon 2.27 GHz). I also implemented the decoding for 3P, but didn't test yet all reconstruction cases. There is probably room for improvements using AVX support for XOR operations in both implementations. Before I invest more time, do think it is useful to have this fast 3P algorithm for double disk failures with 100% space overhead? Because I believe that people will always optimize for space and would rather use something like (10,2) even if the performance degrades and CPU consumption goes up?!? Let me know, no problem in any case! Finally I tested some combinations for ErasureCodeJerasureReedSolomonRAID6: (3,2) (4,2) (6,2) (8,2) (10,2) they all run around 780-800 MB/s Cheers
Re: [ceph-users] Ceph write performance and my Dell R515's
On 09/22/2013 03:12 AM, Quenten Grasso wrote: Hi All, I’m finding my write performance is less than I would have expected. After spending some considerable amount of time testing several different configurations I can never seems to break over ~360mb/s write even when using tmpfs for journaling. So I’ve purchased 3x Dell R515’s with 1 x AMD 6C CPU with 12 x 3TB SAS 2 x 100GB Intel DC S3700 SSD’s 32GB Ram with the Perc H710p Raid controller and Dual Port 10GBE Network Cards. So first up I realise the SSD’s were a mistake, I should have bought the 200GB Ones as they have considerably better write though put of ~375 Mb/s vs 200 Mb/s So to our Nodes Configuration, 2 x 3TB disks in Raid1 for OS/MON 1 partition for OSD, 12 Disks in a Single each in a Raid0 (like a JBOD Fashion) with a 1MB Stripe size, (Stripe size this part was particularly important because I found the stripe size matters considerably even on a single disk raid0. contrary to what you might read on the internet) Also each disk is configured with (write back cache) is enabled and (read head) disabled. For Networking, All nodes are connected via LACP bond with L3 hashing and using iperf I can get up to 16gbit/s tx and rx between the nodes. OS: Ubuntu 12.04.3 LTS w/ Kernel 3.10.12-031012-generic (had to upgrade kernel due to 10Gbit Intel NIC’s driver issues) So this gives me 11 OSD’s 2 SSD’s Per Node. I'm a bit leery about that 1 OSD on the RAID1. It may be fine, but you definitely will want to do some investigation to make sure that OSD isn't holding the other ones back. iostat or collectl might be useful, along with the ceph osd admin socket and the dump_ops_in_flight and dump_historic_ops commands. Next I’ve tried several different configurations which I’ll briefly describe 2 of which below, 1)Cluster Configuration 1, 33 OSD’s with 6x SSD’s as Journals, w/ 15GB Journals on SSD. # ceph osd pool create benchmark1 1800 1800 # rados bench -p benchmark1 180 write --no-cleanup -- Maintaining 16 concurrent writes of 4194304 bytes for up to 180 seconds or 0 objects Total time run: 180.250417 Total writes made: 10152 Write size: 4194304 Bandwidth (MB/sec): 225.287 Stddev Bandwidth: 35.0897 Max bandwidth (MB/sec): 312 Min bandwidth (MB/sec): 0 Average Latency: 0.284054 Stddev Latency: 0.199075 Max latency: 1.46791 Min latency: 0.038512 -- What was your pool replication set to? # rados bench -p benchmark1 180 seq - Total time run: 43.782554 Total reads made: 10120 Read size: 4194304 Bandwidth (MB/sec): 924.569 Average Latency: 0.0691903 Max latency: 0.262542 Min latency: 0.015756 - In this configuration I found my write performance suffers a lot to the SSD’s seem to be a bottleneck and my write performance using rados bench was around 224-230mb/s 2)Cluster Configuration 2, 33 OSD’s with 1Gbyte Journals on tmpfs. # ceph osd pool create benchmark1 1800 1800 # rados bench -p benchmark1 180 write --no-cleanup -- Maintaining 16 concurrent writes of 4194304 bytes for up to 180 seconds or 0 objects Total time run: 180.044669 Total writes made: 15328 Write size: 4194304 Bandwidth (MB/sec): 340.538 Stddev Bandwidth: 26.6096 Max bandwidth (MB/sec): 380 Min bandwidth (MB/sec): 0 Average Latency: 0.187916 Stddev Latency: 0.0102989 Max latency: 0.336581 Min latency: 0.034475 -- Definitely low, especially with journals on tmpfs. :( How are the CPUs doing at this point? We have some R515s in our lab, and they definitely are slow too. Ours have 7 OSD disks and 1 Dell branded SSD (usually unused) each and can do about ~150MB/s writes per system. It's actually a puzzle we've been trying to solve for quite some time. Some thoughts: Could the expander backplane be having issues due to having to tunnel STP for the SATA SSDs (or potentially be causing expander wide resets)? Could the H700 (and apparently H710) be doing something unusual that the stock LSI firmware handles better? We replaced the H700 with an Areca 1880 and definitely saw changes in performance (better large IO throughput and worse IOPS), but the performance was still much lower than in a supermicro node with no expanders in the backplane using either an LSI 2208 or Areca 1880. Things you might want to try: - single node tests, and if you have an alternate controller you can try, seeing if that works better. - removing the S3700s from the chassis entirely and retry the tmpfs journal tests. - Since the H710 is SAS2208 based, you may be able to use megacli to set it into JBOD mode and see if that works any better (it may if you are using SSD or tmpfs backed journals). MegaCli -AdpSetProp -EnableJBOD
Re: [ceph-users] Ceph write performance and my Dell R515's
On Sun, Sep 22, 2013 at 07:40:24AM -0500, Mark Nelson wrote: On 09/22/2013 03:12 AM, Quenten Grasso wrote: Hi All, I’m finding my write performance is less than I would have expected. After spending some considerable amount of time testing several different configurations I can never seems to break over ~360mb/s write even when using tmpfs for journaling. So I’ve purchased 3x Dell R515’s with 1 x AMD 6C CPU with 12 x 3TB SAS 2 x 100GB Intel DC S3700 SSD’s 32GB Ram with the Perc H710p Raid controller and Dual Port 10GBE Network Cards. So first up I realise the SSD’s were a mistake, I should have bought the 200GB Ones as they have considerably better write though put of ~375 Mb/s vs 200 Mb/s So to our Nodes Configuration, 2 x 3TB disks in Raid1 for OS/MON 1 partition for OSD, 12 Disks in a Single each in a Raid0 (like a JBOD Fashion) with a 1MB Stripe size, (Stripe size this part was particularly important because I found the stripe size matters considerably even on a single disk raid0. contrary to what you might read on the internet) Also each disk is configured with (write back cache) is enabled and (read head) disabled. For Networking, All nodes are connected via LACP bond with L3 hashing and using iperf I can get up to 16gbit/s tx and rx between the nodes. OS: Ubuntu 12.04.3 LTS w/ Kernel 3.10.12-031012-generic (had to upgrade kernel due to 10Gbit Intel NIC’s driver issues) So this gives me 11 OSD’s 2 SSD’s Per Node. I'm a bit leery about that 1 OSD on the RAID1. It may be fine, but you definitely will want to do some investigation to make sure that OSD isn't holding the other ones back. iostat or collectl might be useful, along with the ceph osd admin socket and the dump_ops_in_flight and dump_historic_ops commands. I was wondering if latency on the network was OK, I wondered if there was some kind of LACP-bonding not working correctly or L3 hashing cashing problems. ifstat or iptraf or graphs of SNMP from the switch might show you where the traffic went. I did my last tests I with nping echo-mode from nmap with --rate to do latency tests under network load. It does generate a lot of output which slows it down a bit you might want to redirect it somewhere. Next I’ve tried several different configurations which I’ll briefly describe 2 of which below, 1)Cluster Configuration 1, 33 OSD’s with 6x SSD’s as Journals, w/ 15GB Journals on SSD. # ceph osd pool create benchmark1 1800 1800 # rados bench -p benchmark1 180 write --no-cleanup -- Maintaining 16 concurrent writes of 4194304 bytes for up to 180 seconds or 0 objects Total time run: 180.250417 Total writes made: 10152 Write size: 4194304 Bandwidth (MB/sec): 225.287 Stddev Bandwidth: 35.0897 Max bandwidth (MB/sec): 312 Min bandwidth (MB/sec): 0 Average Latency: 0.284054 Stddev Latency: 0.199075 Max latency: 1.46791 Min latency: 0.038512 -- What was your pool replication set to? # rados bench -p benchmark1 180 seq - Total time run: 43.782554 Total reads made: 10120 Read size: 4194304 Bandwidth (MB/sec): 924.569 Average Latency: 0.0691903 Max latency: 0.262542 Min latency: 0.015756 - In this configuration I found my write performance suffers a lot to the SSD’s seem to be a bottleneck and my write performance using rados bench was around 224-230mb/s 2)Cluster Configuration 2, 33 OSD’s with 1Gbyte Journals on tmpfs. # ceph osd pool create benchmark1 1800 1800 # rados bench -p benchmark1 180 write --no-cleanup -- Maintaining 16 concurrent writes of 4194304 bytes for up to 180 seconds or 0 objects Total time run: 180.044669 Total writes made: 15328 Write size: 4194304 Bandwidth (MB/sec): 340.538 Stddev Bandwidth: 26.6096 Max bandwidth (MB/sec): 380 Min bandwidth (MB/sec): 0 Average Latency: 0.187916 Stddev Latency: 0.0102989 Max latency: 0.336581 Min latency: 0.034475 -- Definitely low, especially with journals on tmpfs. :( How are the I'm no expert, but I did notice the tmpfs journals were only 1GB that seems kinda small. But the systems didn't have a lot more memory, so there wasn't much choice. Even if you make them slightly larger it will cut into the memory available for the filesystem cache. That might be a bad idea as well, I guess. CPUs doing at this point? We have some R515s in our lab, and they definitely are slow too. Ours have 7 OSD disks and 1 Dell branded SSD (usually unused) each and can do about ~150MB/s writes per system. It's actually a puzzle we've been trying to solve for quite some time. Some thoughts: Could the
Hiding auth key string for the qemu process
Hello, Since it was a long time from enabling cephx by default and we may think that everyone using it, is seems worthy to introduce bits of code hiding the key from cmdline. First applicable place for such improvement is most-likely OpenStack envs with their sparse security and usage of admin key as default one. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: CEPH Erasure Encoding + OSD Scalability
Hi Loic, I was applying the changes and the situation improves, however there is still one important thing which actually dominated all the measurements which were needing larger packet sizes (everything besides Raid6): pad_in_length(unsigned in_length) The implementation is sort of 'unlucky' and slow when one increases the packetsize. while (in_length%(k*w*packetsize*sizeof(int)) != 0) in_length++; better do like this: alignment = k*w*packetsize*sizeof(int); in_length += (alignment - (in_length%alignment); E.g. for the CauchyGood algorithm one should increase the packetsize and when changing the pad_in_length implementation one get's excellent (pure encoding) performance for (3+2) : 2.6 GB/s and it scales well with the number of core's to 8 GB/s. I compared this with the output of the 'encode' example of the jerasure example and it gives the same result for (3+2), so that looks now good and consistent! (10,4) is ~ 610 MB/s. ... Finally the description of Jerasure 2.0 looks really great and will probably shift all the performance problems upstream ;-) Do you evt. want to add support into the plugin for local parities (like Xorbas does) to improve the disk draining performance? Cheers Andreas.-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ceph: fix sync read eof check deadlock
As Yan,Zheng said, commit 0913444208db intruoduce a bug:getattr need to read lock inode's filelock. But the lock can be in unstable state. the getattr request waits for lock's state to become stable, the lock waits for client to release Fr cap. Commit 6a026589ba333185c466c90 resolved the same bug also. Before doing getattr, it must put the caps which already hold avoid deadlock. Reported-by: Yan, Zheng zheng.z@intel.com Signed-off-by: Jianpeng Ma majianp...@gmail.com --- fs/ceph/file.c | 25 - 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 7da35c7..bc00ace 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -839,7 +839,15 @@ again: ret = ceph_sync_read(iocb, i, checkeof); if (checkeof ret = 0) { - int statret = ceph_do_getattr(inode, + int statret; + /* +*Before getattr,it should put caps avoid +*deadlock. +*/ + ceph_put_cap_refs(ci, got); + got = 0; + + statret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE); /* hit EOF or hole? */ @@ -851,16 +859,23 @@ again: read += ret; checkeof = 0; - goto again; + ret = ceph_get_caps(ci, CEPH_CAP_FILE_RD, + want, got, -1); + if (ret 0) + ret = 0; + else + goto again; } } } else ret = generic_file_aio_read(iocb, iov, nr_segs, pos); - dout(aio_read %p %llx.%llx dropping cap refs on %s = %d\n, -inode, ceph_vinop(inode), ceph_cap_string(got), (int)ret); - ceph_put_cap_refs(ci, got); + if (got) { + dout(aio_read %p %llx.%llx dropping cap refs on %s = %d\n, +inode, ceph_vinop(inode), ceph_cap_string(got), (int)ret); + ceph_put_cap_refs(ci, got); + } if (ret = 0) ret += read; -- 1.8.4-rc0 N§²æìr¸yúèØb²X¬¶Ç§vØ^)Þº{.nÇ+·z]z÷¥{ayºÊÚë,j¢f£¢·hàz¹®w¥¢¸ ¢·¦j:+v¨wèjØm¶ÿ¾«êçzZ+ùÝ¢jú!¶i
Re: [PATCH 1/3] ceph: queue cap release in __ceph_remove_cap()
Reviewed-by: Sage Weil s...@inktank.com On Sun, 22 Sep 2013, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com call __queue_cap_release() in __ceph_remove_cap(), this avoids acquiring s_cap_lock twice. Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/caps.c | 21 +++-- fs/ceph/mds_client.c | 6 ++ fs/ceph/super.h | 8 +--- 3 files changed, 14 insertions(+), 21 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 13976c3..d2d6e40 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -897,7 +897,7 @@ static int __ceph_is_any_caps(struct ceph_inode_info *ci) * caller should hold i_ceph_lock. * caller will not hold session s_mutex if called from destroy_inode. */ -void __ceph_remove_cap(struct ceph_cap *cap) +void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) { struct ceph_mds_session *session = cap-session; struct ceph_inode_info *ci = cap-ci; @@ -909,6 +909,10 @@ void __ceph_remove_cap(struct ceph_cap *cap) /* remove from session list */ spin_lock(session-s_cap_lock); + if (queue_release) + __queue_cap_release(session, ci-i_vino.ino, cap-cap_id, + cap-mseq, cap-issue_seq); + if (session-s_cap_iterator == cap) { /* not yet, we are iterating over this very cap */ dout(__ceph_remove_cap delaying %p removal from session %p\n, @@ -1023,7 +1027,6 @@ void __queue_cap_release(struct ceph_mds_session *session, struct ceph_mds_cap_release *head; struct ceph_mds_cap_item *item; - spin_lock(session-s_cap_lock); BUG_ON(!session-s_num_cap_releases); msg = list_first_entry(session-s_cap_releases, struct ceph_msg, list_head); @@ -1052,7 +1055,6 @@ void __queue_cap_release(struct ceph_mds_session *session, (int)CEPH_CAPS_PER_RELEASE, (int)msg-front.iov_len); } - spin_unlock(session-s_cap_lock); } /* @@ -1067,12 +1069,8 @@ void ceph_queue_caps_release(struct inode *inode) p = rb_first(ci-i_caps); while (p) { struct ceph_cap *cap = rb_entry(p, struct ceph_cap, ci_node); - struct ceph_mds_session *session = cap-session; - - __queue_cap_release(session, ceph_ino(inode), cap-cap_id, - cap-mseq, cap-issue_seq); p = rb_next(p); - __ceph_remove_cap(cap); + __ceph_remove_cap(cap, true); } } @@ -2791,7 +2789,7 @@ static void handle_cap_export(struct inode *inode, struct ceph_mds_caps *ex, } spin_unlock(mdsc-cap_dirty_lock); } - __ceph_remove_cap(cap); + __ceph_remove_cap(cap, false); } /* else, we already released it */ @@ -2931,9 +2929,12 @@ void ceph_handle_caps(struct ceph_mds_session *session, if (!inode) { dout( i don't have ino %llx\n, vino.ino); - if (op == CEPH_CAP_OP_IMPORT) + if (op == CEPH_CAP_OP_IMPORT) { + spin_lock(session-s_cap_lock); __queue_cap_release(session, vino.ino, cap_id, mseq, seq); + spin_unlock(session-s_cap_lock); + } goto flush_cap_releases; } diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index f51ab26..8f8f5c0 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -986,7 +986,7 @@ static int remove_session_caps_cb(struct inode *inode, struct ceph_cap *cap, dout(removing cap %p, ci is %p, inode is %p\n, cap, ci, ci-vfs_inode); spin_lock(ci-i_ceph_lock); - __ceph_remove_cap(cap); + __ceph_remove_cap(cap, false); if (!__ceph_is_any_real_caps(ci)) { struct ceph_mds_client *mdsc = ceph_sb_to_client(inode-i_sb)-mdsc; @@ -1231,9 +1231,7 @@ static int trim_caps_cb(struct inode *inode, struct ceph_cap *cap, void *arg) session-s_trim_caps--; if (oissued) { /* we aren't the only cap.. just remove us */ - __queue_cap_release(session, ceph_ino(inode), cap-cap_id, - cap-mseq, cap-issue_seq); - __ceph_remove_cap(cap); + __ceph_remove_cap(cap, true); } else { /* try to drop referring dentries */ spin_unlock(ci-i_ceph_lock); diff --git a/fs/ceph/super.h b/fs/ceph/super.h index a538b51..8de94b5 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -742,13 +742,7 @@ extern int ceph_add_cap(struct inode *inode, int fmode, unsigned issued, unsigned wanted, unsigned cap, unsigned seq, u64 realmino, int flags,