[RESEND][PATCH 0/2] fix few root xattr bugs
The first patch fixes a bug that causes MDS crash while setting or removing xattrs on root directory. The second patch fixes another bug that root xattrs not correctly logged in MDS journal. Kuan Kai Chiu (2): mds: fix setting/removing xattrs on root mds: journal the projected root xattrs in add_root() src/mds/Server.cc |6 ++ src/mds/events/EMetaBlob.h |2 +- 2 files changed, 3 insertions(+), 5 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] mds: fix setting/removing xattrs on root
MDS crashes while journaling dirty root inode in handle_client_setxattr and handle_client_removexattr. We should use journal_dirty_inode to safely log root inode here. Signed-off-by: Kuan Kai Chiu big.c...@bigtera.com --- src/mds/Server.cc |6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/src/mds/Server.cc b/src/mds/Server.cc index 11ab834..1e62dd2 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -3907,8 +3907,7 @@ void Server::handle_client_setxattr(MDRequest *mdr) mdlog-start_entry(le); le-metablob.add_client_req(req-get_reqid(), req-get_oldest_client_tid()); mdcache-predirty_journal_parents(mdr, le-metablob, cur, 0, PREDIRTY_PRIMARY, false); - mdcache-journal_cow_inode(mdr, le-metablob, cur); - le-metablob.add_primary_dentry(cur-get_projected_parent_dn(), true, cur); + mdcache-journal_dirty_inode(mdr, le-metablob, cur); journal_and_reply(mdr, cur, 0, le, new C_MDS_inode_update_finish(mds, mdr, cur)); } @@ -3964,8 +3963,7 @@ void Server::handle_client_removexattr(MDRequest *mdr) mdlog-start_entry(le); le-metablob.add_client_req(req-get_reqid(), req-get_oldest_client_tid()); mdcache-predirty_journal_parents(mdr, le-metablob, cur, 0, PREDIRTY_PRIMARY, false); - mdcache-journal_cow_inode(mdr, le-metablob, cur); - le-metablob.add_primary_dentry(cur-get_projected_parent_dn(), true, cur); + mdcache-journal_dirty_inode(mdr, le-metablob, cur); journal_and_reply(mdr, cur, 0, le, new C_MDS_inode_update_finish(mds, mdr, cur)); } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] mds: journal the projected root xattrs in add_root()
In EMetaBlob::add_root(), we should log the projected root xattrs instead of original ones to reflect xattr changes. Signed-off-by: Kuan Kai Chiu big.c...@bigtera.com --- src/mds/events/EMetaBlob.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/mds/events/EMetaBlob.h b/src/mds/events/EMetaBlob.h index 7065460..439bd78 100644 --- a/src/mds/events/EMetaBlob.h +++ b/src/mds/events/EMetaBlob.h @@ -468,7 +468,7 @@ private: if (!pi) pi = in-get_projected_inode(); if (!pdft) pdft = in-dirfragtree; -if (!px) px = in-xattrs; +if (!px) px = in-get_projected_xattrs(); bufferlist snapbl; if (psnapbl) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD Read performance
On 04/17/2013 11:35 PM, Malcolm Haak wrote: Hi all, Hi Malcolm! I jumped into the IRC channel yesterday and they said to email ceph-devel. I have been having some read performance issues. With Reads being slower than writes by a factor of ~5-8. I recently saw this kind of behaviour (writes were fine, but reads were terrible) on an IPoIB based cluster and it was caused by the same TCP auto tune issues that Jim Schutt saw last year. It's worth a try at least to see if it helps. echo 0 /proc/sys/net/ipv4/tcp_moderate_rcvbuf on all of the clients and server nodes should be enough to test it out. Sage added an option in more recent Ceph builds that lets you work around it too. First info: Server SLES 11 SP2 Ceph 0.56.4. 12 OSD's that are Hardware Raid 5 each of the twelve is made from 5 NL-SAS disks for a total of 60 disks (Each lun can do around 320MB/s stream write and the same if not better read) Connected via 2xQDR IB OSD's/MDS and such all on same box (for testing) Box is a Quad AMD Opteron 6234 Ram is 256Gb 10GB Journals osd_op_theads: 8 osd_disk_threads:2 Filestore_op_threads:4 OSD's are all XFS Interesting setup! QUAD socket Opteron boxes have somewhat slow and slightly oversubscribed hypertransport links don't they? I wonder if on a system with so many disks and QDR-IB if that could become a problem... We typically like smaller nodes where we can reasonably do 1 OSD per drive, but we've tested on a couple of 60 drive chassis in RAID configs too. Should be interesting to hear what kind of aggregate performance you can eventually get. All nodes are connected via QDR IB using IP_O_IB. We get 1.7GB/s on TCP performance tests between the nodes. Clients: One is FC17 the other us Ubuntu 12.10 they only have around 32GB-70GB ram. We ran into an odd issue were the OSD's would all start in the same NUMA node and pretty much on the same processor core. We fixed that up with some cpuset magic. Strange! Was that more due to cpuset or Ceph? I can't imagine that we are doing anything that would cause that. Performance testing we have done: (Note oflag=direct was yielding results within 5% of cached results) root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=3200 3200+0 records in 3200+0 records out 33554432000 bytes (34 GB) copied, 47.6685 s, 704 MB/s root@ty3:~# root@ty3:~# rm /test-rbd-fs/DELETEME root@ty3:~# root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=4800 4800+0 records in 4800+0 records out 50331648000 bytes (50 GB) copied, 69.5527 s, 724 MB/s [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M count=2400 2400+0 records in 2400+0 records out 25165824000 bytes (25 GB) copied, 26.3593 s, 955 MB/s [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M count=9600 9600+0 records in 9600+0 records out 100663296000 bytes (101 GB) copied, 145.212 s, 693 MB/s Both clients each doing a 140GB write (2x dogbreath's RAM) at the same time to two different rbds in the same pool. root@ty3:~# rm /test-rbd-fs/DELETEME root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=14000 14000+0 records in 14000+0 records out 14680064 bytes (147 GB) copied, 412.404 s, 356 MB/s root@ty3:~# [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M count=14000 14000+0 records in 14000+0 records out 14680064 bytes (147 GB) copied, 433.351 s, 339 MB/s [root@dogbreath ~]# Onto reads... Also we found that doing iflag=direct increased read performance. [root@dogbreath ~]# dd of=/dev/null if=/test-rbd-fs/DELETEME bs=10M count=160 160+0 records in 160+0 records out 1677721600 bytes (1.7 GB) copied, 29.4242 s, 57.0 MB/s [root@dogbreath ~]# [root@dogbreath ~]# echo 1 /proc/sys/vm/drop_caches [root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M count=1 1+0 records in 1+0 records out 4194304 bytes (42 GB) copied, 382.334 s, 110 MB/s [root@dogbreath ~]# [root@dogbreath ~]# echo 1 /proc/sys/vm/drop_caches [root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M count=1 iflag=direct 1+0 records in 1+0 records out 4194304 bytes (42 GB) copied, 150.774 s, 278 MB/s [root@dogbreath ~]# So what info do you want/where do I start hunting for my wumpus? might also be worth looking at the size of the reads to see if there's a lot of fragmentation. Also, is this kernel rbd or qemu-kvm? Regards Malcolm Haak -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mds: fix setting/removing xattrs on root
I didn't notice the bug. Guessing it was hidden because CephFS had been accessed by other daemons in my test environment. Thank you for the hint! The signed-off patches are resent, also including your fix. On Wed, Apr 17, 2013 at 4:06 AM, Gregory Farnum g...@inktank.com wrote: On Mon, Apr 15, 2013 at 3:23 AM, Kuan Kai Chiu big.c...@bigtera.com wrote: MDS crashes while journaling dirty root inode in handle_client_setxattr and handle_client_removexattr. We should use journal_dirty_inode to safely log root inode here. --- src/mds/Server.cc |6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/src/mds/Server.cc b/src/mds/Server.cc index 11ab834..1e62dd2 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -3907,8 +3907,7 @@ void Server::handle_client_setxattr(MDRequest *mdr) mdlog-start_entry(le); le-metablob.add_client_req(req-get_reqid(), req-get_oldest_client_tid()); mdcache-predirty_journal_parents(mdr, le-metablob, cur, 0, PREDIRTY_PRIMARY, false); - mdcache-journal_cow_inode(mdr, le-metablob, cur); - le-metablob.add_primary_dentry(cur-get_projected_parent_dn(), true, cur); + mdcache-journal_dirty_inode(mdr, le-metablob, cur); journal_and_reply(mdr, cur, 0, le, new C_MDS_inode_update_finish(mds, mdr, cur)); } @@ -3964,8 +3963,7 @@ void Server::handle_client_removexattr(MDRequest *mdr) mdlog-start_entry(le); le-metablob.add_client_req(req-get_reqid(), req-get_oldest_client_tid()); mdcache-predirty_journal_parents(mdr, le-metablob, cur, 0, PREDIRTY_PRIMARY, false); - mdcache-journal_cow_inode(mdr, le-metablob, cur); - le-metablob.add_primary_dentry(cur-get_projected_parent_dn(), true, cur); + mdcache-journal_dirty_inode(mdr, le-metablob, cur); journal_and_reply(mdr, cur, 0, le, new C_MDS_inode_update_finish(mds, mdr, cur)); } This is fine as far as it goes, but we'll need your sign-off for us to incorporate it into the codebase. Also, have you run any tests with it? The reason I ask is that when I apply this patch, set an xattr on the root inode, and then restart the MDS and client, there are no xattrs on the root any more. I think this should fix that, but there may be other such issues: diff --git a/src/mds/events/EMetaBlob.h b/src/mds/events/EMetaBlob.h index 7065460..439bd78 100644 --- a/src/mds/events/EMetaBlob.h +++ b/src/mds/events/EMetaBlob.h @@ -468,7 +468,7 @@ private: if (!pi) pi = in-get_projected_inode(); if (!pdft) pdft = in-dirfragtree; -if (!px) px = in-xattrs; +if (!px) px = in-get_projected_xattrs(); bufferlist snapbl; if (psnapbl) You've fallen victim to this new setup, incidentally — in the past the root inode wasn't allowed to get any of these modifications because it's not quite real in the way the rest of them are. We opened that up when we made the virtual xattr interface, but we weren't very careful about it so apparently we missed some side effects. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD Read performance
Hi Mark! Thanks for the quick reply! I'll reply inline below. On 18/04/13 17:04, Mark Nelson wrote: On 04/17/2013 11:35 PM, Malcolm Haak wrote: Hi all, Hi Malcolm! I jumped into the IRC channel yesterday and they said to email ceph-devel. I have been having some read performance issues. With Reads being slower than writes by a factor of ~5-8. I recently saw this kind of behaviour (writes were fine, but reads were terrible) on an IPoIB based cluster and it was caused by the same TCP auto tune issues that Jim Schutt saw last year. It's worth a try at least to see if it helps. echo 0 /proc/sys/net/ipv4/tcp_moderate_rcvbuf on all of the clients and server nodes should be enough to test it out. Sage added an option in more recent Ceph builds that lets you work around it too. Awesome I will test this first up tomorrow. First info: Server SLES 11 SP2 Ceph 0.56.4. 12 OSD's that are Hardware Raid 5 each of the twelve is made from 5 NL-SAS disks for a total of 60 disks (Each lun can do around 320MB/s stream write and the same if not better read) Connected via 2xQDR IB OSD's/MDS and such all on same box (for testing) Box is a Quad AMD Opteron 6234 Ram is 256Gb 10GB Journals osd_op_theads: 8 osd_disk_threads:2 Filestore_op_threads:4 OSD's are all XFS Interesting setup! QUAD socket Opteron boxes have somewhat slow and slightly oversubscribed hypertransport links don't they? I wonder if on a system with so many disks and QDR-IB if that could become a problem... We typically like smaller nodes where we can reasonably do 1 OSD per drive, but we've tested on a couple of 60 drive chassis in RAID configs too. Should be interesting to hear what kind of aggregate performance you can eventually get. We are also going to try this out with 6 luns on a dual xeon box. The Opteron box was the biggest scariest thing we had that was doing nothing. All nodes are connected via QDR IB using IP_O_IB. We get 1.7GB/s on TCP performance tests between the nodes. Clients: One is FC17 the other us Ubuntu 12.10 they only have around 32GB-70GB ram. We ran into an odd issue were the OSD's would all start in the same NUMA node and pretty much on the same processor core. We fixed that up with some cpuset magic. Strange! Was that more due to cpuset or Ceph? I can't imagine that we are doing anything that would cause that. More than likely it is an odd quirk in the SLES kernel.. but when I have time I'll do some more poking. We were seeing insane CPU usage on some cores because all the OSD's were piled up in one place. Performance testing we have done: (Note oflag=direct was yielding results within 5% of cached results) root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=3200 3200+0 records in 3200+0 records out 33554432000 bytes (34 GB) copied, 47.6685 s, 704 MB/s root@ty3:~# root@ty3:~# rm /test-rbd-fs/DELETEME root@ty3:~# root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=4800 4800+0 records in 4800+0 records out 50331648000 bytes (50 GB) copied, 69.5527 s, 724 MB/s [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M count=2400 2400+0 records in 2400+0 records out 25165824000 bytes (25 GB) copied, 26.3593 s, 955 MB/s [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M count=9600 9600+0 records in 9600+0 records out 100663296000 bytes (101 GB) copied, 145.212 s, 693 MB/s Both clients each doing a 140GB write (2x dogbreath's RAM) at the same time to two different rbds in the same pool. root@ty3:~# rm /test-rbd-fs/DELETEME root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=14000 14000+0 records in 14000+0 records out 14680064 bytes (147 GB) copied, 412.404 s, 356 MB/s root@ty3:~# [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M count=14000 14000+0 records in 14000+0 records out 14680064 bytes (147 GB) copied, 433.351 s, 339 MB/s [root@dogbreath ~]# Onto reads... Also we found that doing iflag=direct increased read performance. [root@dogbreath ~]# dd of=/dev/null if=/test-rbd-fs/DELETEME bs=10M count=160 160+0 records in 160+0 records out 1677721600 bytes (1.7 GB) copied, 29.4242 s, 57.0 MB/s [root@dogbreath ~]# [root@dogbreath ~]# echo 1 /proc/sys/vm/drop_caches [root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M count=1 1+0 records in 1+0 records out 4194304 bytes (42 GB) copied, 382.334 s, 110 MB/s [root@dogbreath ~]# [root@dogbreath ~]# echo 1 /proc/sys/vm/drop_caches [root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M count=1 iflag=direct 1+0 records in 1+0 records out 4194304 bytes (42 GB) copied, 150.774 s, 278 MB/s [root@dogbreath ~]# So what info do you want/where do I start hunting for my wumpus? might also be worth looking at the size of the reads to see if there's a lot of fragmentation. Also, is this kernel rbd or
poor write performance
I'm doing some basic testing so I'm not really fussed about poor performance, but my write performance appears to be so bad I think I'm doing something wrong. Using dd to test gives me kbytes/second for write performance for 4kb block sizes, while read performance is acceptable (for testing at least). For dd I'm using iflag=direct for read and oflag=direct for write testing. My setup, approximately, is: Two OSD's . 1 x 7200RPM SATA disk each . 2 x gigabit cluster network interfaces each in a bonded configuration directly attached (osd to osd, no switch) . 1 x gigabit public network . journal on another spindle Three MON's . 1 each on the OSD's . 1 on another server, which is also the one used for testing performance I'm using debian packages from ceph which are version 0.56.4 For comparison, my existing production storage is 2 servers running DRBD with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the iSCSI. Performance not spectacular but acceptable. The servers in question are the same specs as the servers I'm testing on. Where should I start looking for performance problems? I've tried running some of the benchmark stuff in the documentation but I haven't gotten very far... Thanks James -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
Hi James, This is just pure speculation, but can you assure that the bonding works correctly? Maybe you have issues there. I have seen a lot of incorrectly configured bonding throughout my life as unix admin. Maybe this could help you a little: http://www.wogri.at/Port-Channeling-802-3ad.338.0.html On 04/18/2013 01:46 PM, James Harper wrote: I'm doing some basic testing so I'm not really fussed about poor performance, but my write performance appears to be so bad I think I'm doing something wrong. Using dd to test gives me kbytes/second for write performance for 4kb block sizes, while read performance is acceptable (for testing at least). For dd I'm using iflag=direct for read and oflag=direct for write testing. My setup, approximately, is: Two OSD's . 1 x 7200RPM SATA disk each . 2 x gigabit cluster network interfaces each in a bonded configuration directly attached (osd to osd, no switch) . 1 x gigabit public network . journal on another spindle Three MON's . 1 each on the OSD's . 1 on another server, which is also the one used for testing performance I'm using debian packages from ceph which are version 0.56.4 For comparison, my existing production storage is 2 servers running DRBD with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the iSCSI. Performance not spectacular but acceptable. The servers in question are the same specs as the servers I'm testing on. Where should I start looking for performance problems? I've tried running some of the benchmark stuff in the documentation but I haven't gotten very far... Thanks James -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- DI (FH) Wolfgang Hennerbichler Software Development Unit Advanced Computing Technologies RISC Software GmbH A company of the Johannes Kepler University Linz IT-Center Softwarepark 35 4232 Hagenberg Austria Phone: +43 7236 3343 245 Fax: +43 7236 3343 250 wolfgang.hennerbich...@risc-software.at http://www.risc-software.at -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/7, v2] rbd: issue stat request before layered write
(Since this hasn't been reviewed I have updated it slightly. I rebased the series onto the current testing branch. They are all available in the review/wip-4679-3 in the ceph-client git repository. I also made some minor changes in the definition of rbd_img_obj_exists_callback()). This is a step toward fully implementing layered writes. Add checks before request submission for the object(s) associated with an image request. For write requests, if we don't know that the target object exists, issue a STAT request to find out. When that request completes, mark the known and exists flags for the original object request accordingly and re-submit the object request. (Note that this still does the existence check only; the copyup operation is not yet done.) A new object request is created to perform the existence check. A pointer to the original request is added to that object request to allow the stat request to re-issue the original request after updating its flags. If there is a failure with the stat request the error code is stored with the original request, which is then completed. This resolves: http://tracker.ceph.com/issues/3418 Signed-off-by: Alex Elder el...@inktank.com --- v2: rebased to testing; small cleanup in rbd_img_obj_exists_callback() drivers/block/rbd.c | 163 --- 1 file changed, 155 insertions(+), 8 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index b1b8ef8..ce2fb3a 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -183,9 +183,31 @@ struct rbd_obj_request { u64 length; /* bytes from offset */ unsigned long flags; - struct rbd_img_request *img_request; - u64 img_offset; /* image relative offset */ - struct list_headlinks; /* img_request-obj_requests */ + /* +* An object request associated with an image will have its +* img_data flag set; a standlone object request will not. +* +* A standalone object request will have which == BAD_WHICH +* and a null obj_request pointer. +* +* An object request initiated in support of a layered image +* object (to check for its existence before a write) will +* have which == BAD_WHICH and a non-null obj_request pointer. +* +* Finally, an object request for rbd image data will have +* which != BAD_WHICH, and will have a non-null img_request +* pointer. The value of which will be in the range +* 0..(img_request-obj_request_count-1). +*/ + union { + struct rbd_obj_request *obj_request; /* STAT op */ + struct { + struct rbd_img_request *img_request; + u64 img_offset; + /* links for img_request-obj_requests list */ + struct list_headlinks; + }; + }; u32 which; /* posn image request list */ enum obj_request_type type; @@ -1656,10 +1678,6 @@ static struct rbd_img_request *rbd_img_request_create( INIT_LIST_HEAD(img_request-obj_requests); kref_init(img_request-kref); - (void) obj_request_existence_set; - (void) obj_request_known_test; - (void) obj_request_exists_test; - rbd_img_request_get(img_request); /* Avoid a warning */ rbd_img_request_put(img_request); /* TEMPORARY */ @@ -1847,18 +1865,147 @@ out_unwind: return -ENOMEM; } +static void rbd_img_obj_exists_callback(struct rbd_obj_request *obj_request) +{ + struct rbd_device *rbd_dev; + struct ceph_osd_client *osdc; + struct rbd_obj_request *orig_request; + int result; + + rbd_assert(!obj_request_img_data_test(obj_request)); + + /* +* All we need from the object request is the original +* request and the result of the STAT op. Grab those, then +* we're done with the request. +*/ + orig_request = obj_request-obj_request; + obj_request-obj_request = NULL; + rbd_assert(orig_request); + rbd_assert(orig_request-img_request); + + result = obj_request-result; + obj_request-result = 0; + + dout(%s: obj %p for obj %p result %d %llu/%llu\n, __func__, + obj_request, orig_request, result, + obj_request-xferred, obj_request-length); + rbd_obj_request_put(obj_request); + + rbd_assert(orig_request); + rbd_assert(orig_request-img_request); + rbd_dev = orig_request-img_request-rbd_dev; + osdc = rbd_dev-rbd_client-client-osdc; + + /* +* Our only purpose here is to determine whether the object +* exists, and we don't want to treat the non-existence as +* an error. If something else comes back, transfer the +
[PATCH V2] radosgw: receiving unexpected error code while accessing an non-existing object by authorized not-owner user
This patch fixes a bug in radosgw swift compatibility code, that is, if a not-owner but authorized user access a non-existing object in a container, he wiil receive unexpected error code, to repeat this bug, do the following steps, 1 User1 creates a container, and grants the read/write permission to user2 curl -X PUT -i -k -H X-Auth-Token: $user1_token $url/$container curl -X POST -i -k -H X-Auth-Token: $user1_token -H X-Container-Read: $user2 -H X-Container-Write: $user2 $url/$container 2 User2 queries the object 'obj' in the newly created container by using HEAD instruction, note the container currently is empty curl -X HEAD -i -k -H X-Auth-Token: $user2_token $url/$container/obj 3 The response received by user2 is '401 Authorization Required', rather than the expected '404 Not Found', the details are as follows, HTTP/1.1 401 Authorization Required Date: Tue, 16 Apr 2013 01:52:49 GMT Server: Apache/2.2.22 (Ubuntu) Accept-Ranges: bytes Content-Length: 12 Vary: Accept-Encoding Content-Type: text/plain; charset=utf-8 Signed-off-by: Yunchuan Wen yunchuan...@ubuntukylin.com Signed-off-by: Li Wang liw...@ubuntukylin.com --- src/rgw/rgw_op.cc |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/rgw/rgw_op.cc b/src/rgw/rgw_op.cc index d2fbeeb..ef6448c 100644 --- a/src/rgw/rgw_op.cc +++ b/src/rgw/rgw_op.cc @@ -268,7 +268,7 @@ static int read_policy(RGWRados *store, struct req_state *s, RGWBucketInfo buck return ret; string owner = bucket_policy.get_owner().get_id(); if (owner.compare(s-user.user_id) != 0 -!bucket_policy.verify_permission(s-user.user_id, s-perm_mask, RGW_PERM_READ)) +!bucket_policy.verify_permission(s-user.user_id, s-perm_mask, RGW_PERM_READ) !bucket_policy.verify_permission(s-user.user_id, RGW_PERM_READ_OBJS, RGW_PERM_READ_OBJS)) ret = -EACCES; else ret = -ENOENT; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
On 04/18/2013 06:46 AM, James Harper wrote: I'm doing some basic testing so I'm not really fussed about poor performance, but my write performance appears to be so bad I think I'm doing something wrong. Using dd to test gives me kbytes/second for write performance for 4kb block sizes, while read performance is acceptable (for testing at least). For dd I'm using iflag=direct for read and oflag=direct for write testing. My setup, approximately, is: Two OSD's . 1 x 7200RPM SATA disk each . 2 x gigabit cluster network interfaces each in a bonded configuration directly attached (osd to osd, no switch) . 1 x gigabit public network . journal on another spindle Three MON's . 1 each on the OSD's . 1 on another server, which is also the one used for testing performance I'm using debian packages from ceph which are version 0.56.4 For comparison, my existing production storage is 2 servers running DRBD with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the iSCSI. Performance not spectacular but acceptable. The servers in question are the same specs as the servers I'm testing on. Where should I start looking for performance problems? I've tried running some of the benchmark stuff in the documentation but I haven't gotten very far... Hi James! Sorry to hear about the performance trouble! Is it just sequential 4KB direct IO writes that are giving you troubles? If you are using the kernel version of RBD, we don't have any kind of cache implemented there and since you are bypassing the pagecache on the client, those writes are being sent to the different OSDs in 4KB chunks over the network. RBD stores data in blocks that are represented by 4MB objects on one of the OSDs, so without cache a lot of sequential 4KB writes will be hitting 1 OSD repeatedly and then moving on to the next one. Hopefully those writes would get aggregated at the OSD level, but clearly that's not really happening here given your performance. Here's a couple of thoughts: 1) If you are working with VMs, using the QEMU/KVM interface with virtio drivers and RBD cache enabled will give you a huge jump in small sequential write performance relative to what you are seeing now. 2) You may want to try upgrading to 0.60. We made a change to how the pg_log works that causes fewer disk seeks during small IO, especially with XFS. 3) If you are still having trouble, testing your network, disk speeds, and using rados bench to test the object store all may be helpful. Thanks James Good luck! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: test osd on zfs
On Thu, 18 Apr 2013, Stefan Priebe - Profihost AG wrote: Am 17.04.2013 um 23:14 schrieb Brian Behlendorf behlendo...@llnl.gov: On 04/17/2013 01:16 PM, Mark Nelson wrote: I'll let Brian talk about the virtues of ZFS, I think the virtues of ZFS have been discussed at length in various other forums. But in short it brings some nice functionality to the table which may be useful to ceph and that's worth exploring. Sure I know about the advantages of zfs. I just thought about how ceph can benefit. Right now I've no idea. The osds should be single disks so zpool, zraid does not matter. Ceph does it own scrubbing and check summing and instead of btrfs ceph does not know how to use snapshots with zfs. That's why I'm asking. The main things that come to mind: - zfs checksumming - ceph can eventually use zfs snapshots similarly to how it uses btrfs snapshots to create stable checkpoints as journal reference points, allowing parallel (instead of writeahead) journaling - can use raidz beneath a single ceph-osd for better reliability (e.g., 2x * raidz instead of 3x replication) ZFS doesn't have a clone function that we can use to enable efficient cephfs/rbd/rados snaps, but maybe this will motivate someone to implement one. :) sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Xen blktap driver for Ceph RBD : Anybody wants to test ? :p
Hi, I've been working on getting a working blktap driver allowing to access ceph RBD block devices without relying on the RBD kernel driver and it finally got to a point where, it works and is testable. Some of the advantages are: - Easier to update to newer RBD version - Allows functionality only available in the userspace RBD library (write cache, layering, ...) - Less issue when you have OSD as domU on the same dom0 - Contains crash to user space :p (they shouldn't happen, but ...) It's still an early prototype, but if you want to give it a shot and give feedback. You can find the code there https://github.com/smunaut/blktap/tree/rbd (rbd branch). Currently the username, poolname and image name are hardcoded ... (look for FIXME in the code). I'll get to that next, once I figured the best format for arguments. Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Swift ACL .rlistings support
Sorry for the late response, this somehow went through the cracks. The main issue that I see with this patch is that it introduces a new bit for object listing that is not really needed. You just need to set the RGW_PERM_READ on the bucket. This way setting this flag through swift you'd be able to see it via S3. Is there any compelling reason not to do so? Thanks, Yehuda On Tue, Apr 2, 2013 at 7:07 AM, Li Wang liw...@ubuntukylin.com wrote: This patch implements the Swift ACL .rlistings for Radosgw, it should be seamlessly compatible with earlier version as well as S3. Signed-off-by: Yunchuan Wen yunchuan...@ubuntukylin.com Signed-off-by: Li Wang liw...@ubuntukylin.com --- src/rgw/rgw_acl.cc |3 +++ src/rgw/rgw_acl.h| 19 ++- src/rgw/rgw_acl_swift.cc | 14 ++ src/rgw/rgw_op.cc|2 +- 4 files changed, 32 insertions(+), 6 deletions(-) diff --git a/src/rgw/rgw_acl.cc b/src/rgw/rgw_acl.cc index 1a90649..d6255e1 100644 --- a/src/rgw/rgw_acl.cc +++ b/src/rgw/rgw_acl.cc @@ -96,6 +96,9 @@ bool RGWAccessControlPolicy::verify_permission(string uid, int user_perm_mask, int policy_perm = get_perm(uid, test_perm); + if (policy_perm RGW_PERM_READ) { +policy_perm |= (test_perm RGW_PERM_READ_LIST); + } /* the swift WRITE_OBJS perm is equivalent to the WRITE obj, just convert those bits. Note that these bits will only be set on buckets, so the swift READ permission on bucket will allow listing diff --git a/src/rgw/rgw_acl.h b/src/rgw/rgw_acl.h index c06e9eb..6374413 100644 --- a/src/rgw/rgw_acl.h +++ b/src/rgw/rgw_acl.h @@ -15,11 +15,15 @@ using namespace std; #define RGW_PERM_WRITE 0x02 #define RGW_PERM_READ_ACP0x04 #define RGW_PERM_WRITE_ACP 0x08 -#define RGW_PERM_READ_OBJS 0x10 -#define RGW_PERM_WRITE_OBJS 0x20 +#define RGW_PERM_READ_OBJS 0x10 // Swift read +#define RGW_PERM_WRITE_OBJS 0x20 // Swift write +#define RGW_PERM_READ_LIST 0x40 // Swift .rlistings #define RGW_PERM_FULL_CONTROL( RGW_PERM_READ | RGW_PERM_WRITE | \ + RGW_PERM_READ_ACP | RGW_PERM_WRITE_ACP | \ + RGW_PERM_READ_LIST ) +#define RGW_PERM_ALL_S3 ( RGW_PERM_READ | RGW_PERM_WRITE | \ RGW_PERM_READ_ACP | RGW_PERM_WRITE_ACP ) -#define RGW_PERM_ALL_S3 RGW_PERM_FULL_CONTROL + enum ACLGranteeTypeEnum { /* numbers are encoded, should not change */ @@ -47,13 +51,18 @@ public: void set_permissions(int perm) { flags = perm; } void encode(bufferlist bl) const { -ENCODE_START(2, 2, bl); +ENCODE_START(3, 2, bl); ::encode(flags, bl); ENCODE_FINISH(bl); } void decode(bufferlist::iterator bl) { -DECODE_START_LEGACY_COMPAT_LEN(2, 2, 2, bl); +DECODE_START_LEGACY_COMPAT_LEN(3, 2, 2, bl); ::decode(flags, bl); +if (struct_v = 2) { + ACLGrant grant; + grant.set_group(ACL_GROUP_ALL_USERS, RGW_PERM_READ_LIST); + acl.add_grant(grant); +} DECODE_FINISH(bl); } void dump(Formatter *f) const; diff --git a/src/rgw/rgw_acl_swift.cc b/src/rgw/rgw_acl_swift.cc index b02ce90..af5f804 100644 --- a/src/rgw/rgw_acl_swift.cc +++ b/src/rgw/rgw_acl_swift.cc @@ -15,6 +15,7 @@ using namespace std; #define SWIFT_PERM_WRITE RGW_PERM_WRITE_OBJS #define SWIFT_GROUP_ALL_USERS .r:* +#define SWIFT_GROUP_LIST .rlistings static int parse_list(string uid_list, vectorstring uids) { @@ -54,6 +55,11 @@ static bool uid_is_public(string uid) sub.compare(.referrer) == 0; } +static bool uid_is_list(string uid) +{ + return uid.compare(SWIFT_GROUP_LIST) == 0; +} + void RGWAccessControlPolicy_SWIFT::add_grants(RGWRados *store, vectorstring uids, int perm) { vectorstring::iterator iter; @@ -64,6 +70,9 @@ void RGWAccessControlPolicy_SWIFT::add_grants(RGWRados *store, vectorstring u if (uid_is_public(uid)) { grant.set_group(ACL_GROUP_ALL_USERS, perm); acl.add_grant(grant); +} else if ((perm SWIFT_PERM_READ) (uid_is_list(uid))) { + grant.set_group(ACL_GROUP_ALL_USERS, RGW_PERM_READ_LIST); + acl.add_grant(grant); } else if (rgw_get_user_info_by_uid(store, uid, grant_user) 0) { ldout(cct, 10) grant user does not exist: uid dendl; /* skipping silently */ @@ -116,6 +125,11 @@ void RGWAccessControlPolicy_SWIFT::to_str(string read, string write) if (grant.get_group() != ACL_GROUP_ALL_USERS) continue; id = SWIFT_GROUP_ALL_USERS; + if (perm RGW_PERM_READ_LIST) { +if (!read.empty()) + read.append(, ); +read.append(SWIFT_GROUP_LIST); + } } if (perm SWIFT_PERM_READ) { if (!read.empty()) diff --git a/src/rgw/rgw_op.cc b/src/rgw/rgw_op.cc index 43415d4..5c4d95a 100644 --- a/src/rgw/rgw_op.cc +++
Re: poor write performance
On Thu, Apr 18, 2013 at 5:43 PM, Mark Nelson mark.nel...@inktank.com wrote: On 04/18/2013 06:46 AM, James Harper wrote: I'm doing some basic testing so I'm not really fussed about poor performance, but my write performance appears to be so bad I think I'm doing something wrong. Using dd to test gives me kbytes/second for write performance for 4kb block sizes, while read performance is acceptable (for testing at least). For dd I'm using iflag=direct for read and oflag=direct for write testing. My setup, approximately, is: Two OSD's . 1 x 7200RPM SATA disk each . 2 x gigabit cluster network interfaces each in a bonded configuration directly attached (osd to osd, no switch) . 1 x gigabit public network . journal on another spindle Three MON's . 1 each on the OSD's . 1 on another server, which is also the one used for testing performance I'm using debian packages from ceph which are version 0.56.4 For comparison, my existing production storage is 2 servers running DRBD with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the iSCSI. Performance not spectacular but acceptable. The servers in question are the same specs as the servers I'm testing on. Where should I start looking for performance problems? I've tried running some of the benchmark stuff in the documentation but I haven't gotten very far... Hi James! Sorry to hear about the performance trouble! Is it just sequential 4KB direct IO writes that are giving you troubles? If you are using the kernel version of RBD, we don't have any kind of cache implemented there and since you are bypassing the pagecache on the client, those writes are being sent to the different OSDs in 4KB chunks over the network. RBD stores data in blocks that are represented by 4MB objects on one of the OSDs, so without cache a lot of sequential 4KB writes will be hitting 1 OSD repeatedly and then moving on to the next one. Hopefully those writes would get aggregated at the OSD level, but clearly that's not really happening here given your performance. Here's a couple of thoughts: 1) If you are working with VMs, using the QEMU/KVM interface with virtio drivers and RBD cache enabled will give you a huge jump in small sequential write performance relative to what you are seeing now. 2) You may want to try upgrading to 0.60. We made a change to how the pg_log works that causes fewer disk seeks during small IO, especially with XFS. Can you point into related commits, if possible? 3) If you are still having trouble, testing your network, disk speeds, and using rados bench to test the object store all may be helpful. Thanks James Good luck! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
On 04/18/2013 11:46 AM, Andrey Korolyov wrote: On Thu, Apr 18, 2013 at 5:43 PM, Mark Nelson mark.nel...@inktank.com wrote: On 04/18/2013 06:46 AM, James Harper wrote: I'm doing some basic testing so I'm not really fussed about poor performance, but my write performance appears to be so bad I think I'm doing something wrong. Using dd to test gives me kbytes/second for write performance for 4kb block sizes, while read performance is acceptable (for testing at least). For dd I'm using iflag=direct for read and oflag=direct for write testing. My setup, approximately, is: Two OSD's . 1 x 7200RPM SATA disk each . 2 x gigabit cluster network interfaces each in a bonded configuration directly attached (osd to osd, no switch) . 1 x gigabit public network . journal on another spindle Three MON's . 1 each on the OSD's . 1 on another server, which is also the one used for testing performance I'm using debian packages from ceph which are version 0.56.4 For comparison, my existing production storage is 2 servers running DRBD with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the iSCSI. Performance not spectacular but acceptable. The servers in question are the same specs as the servers I'm testing on. Where should I start looking for performance problems? I've tried running some of the benchmark stuff in the documentation but I haven't gotten very far... Hi James! Sorry to hear about the performance trouble! Is it just sequential 4KB direct IO writes that are giving you troubles? If you are using the kernel version of RBD, we don't have any kind of cache implemented there and since you are bypassing the pagecache on the client, those writes are being sent to the different OSDs in 4KB chunks over the network. RBD stores data in blocks that are represented by 4MB objects on one of the OSDs, so without cache a lot of sequential 4KB writes will be hitting 1 OSD repeatedly and then moving on to the next one. Hopefully those writes would get aggregated at the OSD level, but clearly that's not really happening here given your performance. Here's a couple of thoughts: 1) If you are working with VMs, using the QEMU/KVM interface with virtio drivers and RBD cache enabled will give you a huge jump in small sequential write performance relative to what you are seeing now. 2) You may want to try upgrading to 0.60. We made a change to how the pg_log works that causes fewer disk seeks during small IO, especially with XFS. Can you point into related commits, if possible? here you go: http://tracker.ceph.com/projects/ceph/repository/revisions/188f3ea6867eeb6e950f6efed18d53ff17522bbc 3) If you are still having trouble, testing your network, disk speeds, and using rados bench to test the object store all may be helpful. Thanks James Good luck! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock
On Wed, 17 Apr 2013, Anand Avati wrote: On Wed, Apr 17, 2013 at 9:45 PM, Sage Weil s...@inktank.com wrote: On Wed, 17 Apr 2013, Anand Avati wrote: On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil s...@inktank.com wrote: We've hit a new deadlock with fuse_lowlevel_notify_inval_inode, this time on the read side: - ceph-fuse queues an invalidate (in a separate thread) - kernel initiates a read - invalidate blocks in kernel, waiting on a page lock - the read blocks in ceph-fuse Now, assuming we're reading the stack traces properly, this is more or less what we see with writes, except with reads, and the obvious don't block the read would resolve it. But! If that is the only way to avoid deadlock, I'm afraid it is difficult to implement reliable cache invalidation at all. The reason we are invalidating is because the server told us to: we are no longer allowed to do reads and cached data is invalid. The obvious approach is to 1- stop processing new reads 2- let in-progress reads complete 3- invalidate the cache 4- ack to server ...but that will deadlock as above, as any new read will lock pages before blcoking. If we don't block, then the read may repopulate pages we just invalidated. We could 1- invalidate 2- if any reads happened while we were invalidating, goto 1 3- ack but then we risk starvation and livelock. How do other people solve this problem? It seems like another upcall that would let you block new reads (and/or writes) from starting while the invalidate is in progress would do the trick, but I'm not convinced I'm not missing something much simpler. Do you really need to call fuse_lowlevel_notify_inval_inode() while still holding the mutex in cfuse? It should be sufficient if you - 0 - Receive inval request from server 1 - mutex_lock() in cfuse 2 - invalidate cfuse cache 3 - mutex_unlock() in cfuse 4 - fuse_lowlevel_notify_inval_inode() 5 - ack to server The only necessary ordering seems to be 0-[2,4]-5. Placing 4 within the mutex boundaries looks unnecessary and self-imposed. In-progress reads which took the page lock before fuse_lowlevel_notify_inval_inode() would either read data cached in cfuse (in case they reached the cache before 1), or get sent over to server as though data was never cached. There wouldn't be a livelock either. Did I miss something? It's the concurrent reads I'm concerned about: 3.5 - read(2) is called, locks some pages, and sends a message through the fuse connection 3.9 or 4.1 - ceph-fuse gets the reads request. It can either handle it, repopulating a region of the page cache it possibly just partially invalidated (rendering the invalidate a failure), I think the problem lies here, that handling the read before 5 is returning old data (which effectively renders the invalidation a failure). Step 0 needs to guarantee that new data about to be written is already staged in the server and made available for read. However the write request itself needs to be blocked from completing till step 5 from all other clients completes. It sounds like you're thinking of a weaker consistency model. The mds is telling us to stop reads and invalidate our cache *before* anybody is allowed to write. At the end of this process, we ack, we should have an empty page cache for this file and any reads should be blocked until further notice. or block, possibly preventing the invalidate from ever completing. Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the server should be prepared to serve staged data even before Step 0. Invalidating might be *delayed* till the in-progress reads finishes. But that only delays the completion of write(), but no deadlocks anywhere. Hope I din't miss something :-) You can ignore the mutex_lock stuff; I don't think it's necessary to see the issue. Simply consider a racing read (that has pages locked) and invalidate (that is walking through the address_space mapping locking and discarding pages). We can't block the read without risking deadlock with the invalidate, and we can't continue with the read without making the invalidate unsuccessful/unreliable. We can actually do reads at this point from the ceph client vs server perspective since we haven't acked the revocation yet.. but with the fuse vs ceph-fuse interaction we are choosing betweeen deadlock or potential livelock (if we do the read and then try the invalidate a second time). sage 4.2 - invalidate either completes (having possibly missed some just-read pages),
Re: [RESEND][PATCH 0/2] fix few root xattr bugs
Thanks! I merged these into next (going to be Cuttlefish) in commits f379ce37bfdcb3670f52ef47c02787f82e50e612 and 87634d882fda80c4a2e3705c83a38bdfd613763f. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Apr 17, 2013 at 11:43 PM, Kuan Kai Chiu big.c...@bigtera.com wrote: The first patch fixes a bug that causes MDS crash while setting or removing xattrs on root directory. The second patch fixes another bug that root xattrs not correctly logged in MDS journal. Kuan Kai Chiu (2): mds: fix setting/removing xattrs on root mds: journal the projected root xattrs in add_root() src/mds/Server.cc |6 ++ src/mds/events/EMetaBlob.h |2 +- 2 files changed, 3 insertions(+), 5 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock
On Apr 18, 2013, at 10:05 AM, Sage Weil s...@inktank.com wrote: On Wed, 17 Apr 2013, Anand Avati wrote: On Wed, Apr 17, 2013 at 9:45 PM, Sage Weil s...@inktank.com wrote: On Wed, 17 Apr 2013, Anand Avati wrote: On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil s...@inktank.com wrote: We've hit a new deadlock with fuse_lowlevel_notify_inval_inode, this time on the read side: - ceph-fuse queues an invalidate (in a separate thread) - kernel initiates a read - invalidate blocks in kernel, waiting on a page lock - the read blocks in ceph-fuse Now, assuming we're reading the stack traces properly, this is more or less what we see with writes, except with reads, and the obvious don't block the read would resolve it. But! If that is the only way to avoid deadlock, I'm afraid it is difficult to implement reliable cache invalidation at all. The reason we are invalidating is because the server told us to: we are no longer allowed to do reads and cached data is invalid. The obvious approach is to 1- stop processing new reads 2- let in-progress reads complete 3- invalidate the cache 4- ack to server ...but that will deadlock as above, as any new read will lock pages before blcoking. If we don't block, then the read may repopulate pages we just invalidated. We could 1- invalidate 2- if any reads happened while we were invalidating, goto 1 3- ack but then we risk starvation and livelock. How do other people solve this problem? It seems like another upcall that would let you block new reads (and/or writes) from starting while the invalidate is in progress would do the trick, but I'm not convinced I'm not missing something much simpler. Do you really need to call fuse_lowlevel_notify_inval_inode() while still holding the mutex in cfuse? It should be sufficient if you - 0 - Receive inval request from server 1 - mutex_lock() in cfuse 2 - invalidate cfuse cache 3 - mutex_unlock() in cfuse 4 - fuse_lowlevel_notify_inval_inode() 5 - ack to server The only necessary ordering seems to be 0-[2,4]-5. Placing 4 within the mutex boundaries looks unnecessary and self-imposed. In-progress reads which took the page lock before fuse_lowlevel_notify_inval_inode() would either read data cached in cfuse (in case they reached the cache before 1), or get sent over to server as though data was never cached. There wouldn't be a livelock either. Did I miss something? It's the concurrent reads I'm concerned about: 3.5 - read(2) is called, locks some pages, and sends a message through the fuse connection 3.9 or 4.1 - ceph-fuse gets the reads request. It can either handle it, repopulating a region of the page cache it possibly just partially invalidated (rendering the invalidate a failure), I think the problem lies here, that handling the read before 5 is returning old data (which effectively renders the invalidation a failure). Step 0 needs to guarantee that new data about to be written is already staged in the server and made available for read. However the write request itself needs to be blocked from completing till step 5 from all other clients completes. It sounds like you're thinking of a weaker consistency model. The mds is telling us to stop reads and invalidate our cache *before* anybody is allowed to write. At the end of this process, we ack, we should have an empty page cache for this file and any reads should be blocked until further notice. Yes, the consistency model I was talking about is weaker than block new reads, purge all cache, wait until further notified. If you are striping data and if the read request spans multiple stripes, then I guess you do need a stricter version. or block, possibly preventing the invalidate from ever completing. Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the server should be prepared to serve staged data even before Step 0. Invalidating might be *delayed* till the in-progress reads finishes. But that only delays the completion of write(), but no deadlocks anywhere. Hope I din't miss something :-) You can ignore the mutex_lock stuff; I don't think it's necessary to see the issue. Simply consider a racing read (that has pages locked) and invalidate (that is walking through the address_space mapping locking and discarding pages). We can't block the read without risking deadlock with the invalidate, and we can't continue with the read without making the invalidate unsuccessful/unreliable. We can actually do reads at this point from the ceph client vs server perspective since we haven't acked the revocation yet.. but with the fuse vs ceph-fuse interaction we are choosing betweeen deadlock or potential livelock (if we do the
Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock
On Thu, 18 Apr 2013, Anand Avati wrote: On Apr 18, 2013, at 10:05 AM, Sage Weil s...@inktank.com wrote: On Wed, 17 Apr 2013, Anand Avati wrote: On Wed, Apr 17, 2013 at 9:45 PM, Sage Weil s...@inktank.com wrote: On Wed, 17 Apr 2013, Anand Avati wrote: On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil s...@inktank.com wrote: We've hit a new deadlock with fuse_lowlevel_notify_inval_inode, this time on the read side: - ceph-fuse queues an invalidate (in a separate thread) - kernel initiates a read - invalidate blocks in kernel, waiting on a page lock - the read blocks in ceph-fuse Now, assuming we're reading the stack traces properly, this is more or less what we see with writes, except with reads, and the obvious don't block the read would resolve it. But! If that is the only way to avoid deadlock, I'm afraid it is difficult to implement reliable cache invalidation at all. The reason we are invalidating is because the server told us to: we are no longer allowed to do reads and cached data is invalid. The obvious approach is to 1- stop processing new reads 2- let in-progress reads complete 3- invalidate the cache 4- ack to server ...but that will deadlock as above, as any new read will lock pages before blcoking. If we don't block, then the read may repopulate pages we just invalidated. We could 1- invalidate 2- if any reads happened while we were invalidating, goto 1 3- ack but then we risk starvation and livelock. How do other people solve this problem? It seems like another upcall that would let you block new reads (and/or writes) from starting while the invalidate is in progress would do the trick, but I'm not convinced I'm not missing something much simpler. Do you really need to call fuse_lowlevel_notify_inval_inode() while still holding the mutex in cfuse? It should be sufficient if you - 0 - Receive inval request from server 1 - mutex_lock() in cfuse 2 - invalidate cfuse cache 3 - mutex_unlock() in cfuse 4 - fuse_lowlevel_notify_inval_inode() 5 - ack to server The only necessary ordering seems to be 0-[2,4]-5. Placing 4 within the mutex boundaries looks unnecessary and self-imposed. In-progress reads which took the page lock before fuse_lowlevel_notify_inval_inode() would either read data cached in cfuse (in case they reached the cache before 1), or get sent over to server as though data was never cached. There wouldn't be a livelock either. Did I miss something? It's the concurrent reads I'm concerned about: 3.5 - read(2) is called, locks some pages, and sends a message through the fuse connection 3.9 or 4.1 - ceph-fuse gets the reads request. It can either handle it, repopulating a region of the page cache it possibly just partially invalidated (rendering the invalidate a failure), I think the problem lies here, that handling the read before 5 is returning old data (which effectively renders the invalidation a failure). Step 0 needs to guarantee that new data about to be written is already staged in the server and made available for read. However the write request itself needs to be blocked from completing till step 5 from all other clients completes. It sounds like you're thinking of a weaker consistency model. The mds is telling us to stop reads and invalidate our cache *before* anybody is allowed to write. At the end of this process, we ack, we should have an empty page cache for this file and any reads should be blocked until further notice. Yes, the consistency model I was talking about is weaker than block new reads, purge all cache, wait until further notified. If you are striping data and if the read request spans multiple stripes, then I guess you do need a stricter version. or block, possibly preventing the invalidate from ever completing. Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the server should be prepared to serve staged data even before Step 0. Invalidating might be *delayed* till the in-progress reads finishes. But that only delays the completion of write(), but no deadlocks anywhere. Hope I din't miss something :-) You can ignore the mutex_lock stuff; I don't think it's necessary to see the issue. Simply consider a racing read (that has pages locked) and invalidate (that is walking through the address_space mapping locking and discarding pages). We can't block the read without risking deadlock with the invalidate, and we can't continue with the read without making the invalidate unsuccessful/unreliable. We can actually do reads at this point from the ceph client vs server perspective
Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock
On 04/18/2013 12:12 PM, Sage Weil wrote: On Thu, 18 Apr 2013, Anand Avati wrote: On Apr 18, 2013, at 10:05 AM, Sage Weils...@inktank.com wrote: On Wed, 17 Apr 2013, Anand Avati wrote: On Wed, Apr 17, 2013 at 9:45 PM, Sage Weils...@inktank.com wrote: On Wed, 17 Apr 2013, Anand Avati wrote: On Wed, Apr 17, 2013 at 5:43 PM, Sage Weils...@inktank.com wrote: We've hit a new deadlock with fuse_lowlevel_notify_inval_inode, this time on the read side: - ceph-fuse queues an invalidate (in a separate thread) - kernel initiates a read - invalidate blocks in kernel, waiting on a page lock - the read blocks in ceph-fuse Now, assuming we're reading the stack traces properly, this is more or less what we see with writes, except with reads, and the obvious don't block the read would resolve it. But! If that is the only way to avoid deadlock, I'm afraid it is difficult to implement reliable cache invalidation at all. The reason we are invalidating is because the server told us to: we are no longer allowed to do reads and cached data is invalid. The obvious approach is to 1- stop processing new reads 2- let in-progress reads complete 3- invalidate the cache 4- ack to server ...but that will deadlock as above, as any new read will lock pages before blcoking. If we don't block, then the read may repopulate pages we just invalidated. We could 1- invalidate 2- if any reads happened while we were invalidating, goto 1 3- ack but then we risk starvation and livelock. How do other people solve this problem? It seems like another upcall that would let you block new reads (and/or writes) from starting while the invalidate is in progress would do the trick, but I'm not convinced I'm not missing something much simpler. Do you really need to call fuse_lowlevel_notify_inval_inode() while still holding the mutex in cfuse? It should be sufficient if you - 0 - Receive inval request from server 1 - mutex_lock() in cfuse 2 - invalidate cfuse cache 3 - mutex_unlock() in cfuse 4 - fuse_lowlevel_notify_inval_inode() 5 - ack to server The only necessary ordering seems to be 0-[2,4]-5. Placing 4 within the mutex boundaries looks unnecessary and self-imposed. In-progress reads which took the page lock before fuse_lowlevel_notify_inval_inode() would either read data cached in cfuse (in case they reached the cache before 1), or get sent over to server as though data was never cached. There wouldn't be a livelock either. Did I miss something? It's the concurrent reads I'm concerned about: 3.5 - read(2) is called, locks some pages, and sends a message through the fuse connection 3.9 or 4.1 - ceph-fuse gets the reads request. It can either handle it, repopulating a region of the page cache it possibly just partially invalidated (rendering the invalidate a failure), I think the problem lies here, that handling the read before 5 is returning old data (which effectively renders the invalidation a failure). Step 0 needs to guarantee that new data about to be written is already staged in the server and made available for read. However the write request itself needs to be blocked from completing till step 5 from all other clients completes. It sounds like you're thinking of a weaker consistency model. The mds is telling us to stop reads and invalidate our cache *before* anybody is allowed to write. At the end of this process, we ack, we should have an empty page cache for this file and any reads should be blocked until further notice. Yes, the consistency model I was talking about is weaker than block new reads, purge all cache, wait until further notified. If you are striping data and if the read request spans multiple stripes, then I guess you do need a stricter version. or block, possibly preventing the invalidate from ever completing. Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the server should be prepared to serve staged data even before Step 0. Invalidating might be *delayed* till the in-progress reads finishes. But that only delays the completion of write(), but no deadlocks anywhere. Hope I din't miss something :-) You can ignore the mutex_lock stuff; I don't think it's necessary to see the issue. Simply consider a racing read (that has pages locked) and invalidate (that is walking through the address_space mapping locking and discarding pages). We can't block the read without risking deadlock with the invalidate, and we can't continue with the read without making the invalidate unsuccessful/unreliable. We can actually do reads at this point from the ceph client vs server perspective since we haven't acked the revocation yet.. but with the fuse vs ceph-fuse interaction we are choosing betweeen deadlock or potential livelock (if
erasure coding (sorry)
sorry to bring this up again, googling revealed some people don't like the subject [anymore]. but I'm working on a new +- 3PB cluster for storage of immutable files. and it would be either all cold data, or mostly cold. 150MB avg filesize, max size 5GB (for now) For this use case, my impression is erasure coding would make a lot of sense (though I'm not sure about the computational overhead on storing and loading objects..? outbound traffic would peak at 6 Gbps, but I can make it way less and still keep a large cluster, by taking away the small set of hot files. inbound traffic would be minimal) I know that the answer a while ago was no plans to implement erasure coding, has this changed? if not, is anyone aware of a similar system that does support it? I found QFS but that's meant for batch processing, has a single 'namenode' etc. thanks, Dieter -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: erasure coding (sorry)
On Thu, 18 Apr 2013, Plaetinck, Dieter wrote: sorry to bring this up again, googling revealed some people don't like the subject [anymore]. but I'm working on a new +- 3PB cluster for storage of immutable files. and it would be either all cold data, or mostly cold. 150MB avg filesize, max size 5GB (for now) For this use case, my impression is erasure coding would make a lot of sense (though I'm not sure about the computational overhead on storing and loading objects..? outbound traffic would peak at 6 Gbps, but I can make it way less and still keep a large cluster, by taking away the small set of hot files. inbound traffic would be minimal) I know that the answer a while ago was no plans to implement erasure coding, has this changed? if not, is anyone aware of a similar system that does support it? I found QFS but that's meant for batch processing, has a single 'namenode' etc. We would love to do it, but it is not a priority at the moment (things like multi-site replication are in much higher demand). That of course doesn't prevent someone outside of Inktank from working on it :) The main caveat is that it will be complicate. For an initial implementation, the full breadth of the rados API probably wouldn't be support for erasure/parity encoded pools (thinkgs like rados classes and the omap key/value api get tricky when you start talking about parity). But for many (or even most) use cases, objects are just bytes, and those restrictions are just fine. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: erasure coding (sorry)
On 04/18/2013 04:08 PM, Josh Durgin wrote: On 04/18/2013 01:47 PM, Sage Weil wrote: On Thu, 18 Apr 2013, Plaetinck, Dieter wrote: sorry to bring this up again, googling revealed some people don't like the subject [anymore]. but I'm working on a new +- 3PB cluster for storage of immutable files. and it would be either all cold data, or mostly cold. 150MB avg filesize, max size 5GB (for now) For this use case, my impression is erasure coding would make a lot of sense (though I'm not sure about the computational overhead on storing and loading objects..? outbound traffic would peak at 6 Gbps, but I can make it way less and still keep a large cluster, by taking away the small set of hot files. inbound traffic would be minimal) I know that the answer a while ago was no plans to implement erasure coding, has this changed? if not, is anyone aware of a similar system that does support it? I found QFS but that's meant for batch processing, has a single 'namenode' etc. We would love to do it, but it is not a priority at the moment (things like multi-site replication are in much higher demand). That of course doesn't prevent someone outside of Inktank from working on it :) The main caveat is that it will be complicate. For an initial implementation, the full breadth of the rados API probably wouldn't be support for erasure/parity encoded pools (thinkgs like rados classes and the omap key/value api get tricky when you start talking about parity). But for many (or even most) use cases, objects are just bytes, and those restrictions are just fine. I talked to some folks interested in doing a more limited form of this yesterday. They started a blueprint [1]. One of their ideas was to have erasure coding done by a separate process (or thread perhaps). It would use erasure coding on an object and then use librados to store the rasure-encoded pieces in a separate pool, and finally leave a marker in place of the original object in the first pool. When the osd detected this marker, it would proxy the request to the erasure coding thread/process which would service the request on the second pool for reads, and potentially make writes move the data back to the first pool in a tiering sort of scenario. I might have misremembered some details, but I think it's an interesting way to get many of the benefits of erasure coding with a relatively small amount of work compared to a fully native osd solution. Josh Neat. :) [1] http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: erasure coding (sorry)
On Apr 18, 2013, at 2:08 PM, Josh Durgin josh.dur...@inktank.com wrote: I talked to some folks interested in doing a more limited form of this yesterday. They started a blueprint [1]. One of their ideas was to have erasure coding done by a separate process (or thread perhaps). It would use erasure coding on an object and then use librados to store the rasure-encoded pieces in a separate pool, and finally leave a marker in place of the original object in the first pool. This sounds at a high-level similar to work out of Microsoft: https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf The basic idea is to replicate first, then erasure code in the background. - Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: erasure coding (sorry)
On Thu, 18 Apr 2013, Noah Watkins wrote: On Apr 18, 2013, at 2:08 PM, Josh Durgin josh.dur...@inktank.com wrote: I talked to some folks interested in doing a more limited form of this yesterday. They started a blueprint [1]. One of their ideas was to have erasure coding done by a separate process (or thread perhaps). It would use erasure coding on an object and then use librados to store the rasure-encoded pieces in a separate pool, and finally leave a marker in place of the original object in the first pool. This sounds at a high-level similar to work out of Microsoft: https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf The basic idea is to replicate first, then erasure code in the background. FWIW, I think a useful (and generic) concept to add to rados would be a redirect symlink sort of thing that says oh, this object is over there is that other pool, such that client requests will be transparently redirected or proxied. This will enable generic tiering type operations, and probably simplify/enable migration without a lot of additional complexity on the client side. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: poor write performance
Where should I start looking for performance problems? I've tried running some of the benchmark stuff in the documentation but I haven't gotten very far... Hi James! Sorry to hear about the performance trouble! Is it just sequential 4KB direct IO writes that are giving you troubles? If you are using the kernel version of RBD, we don't have any kind of cache implemented there and since you are bypassing the pagecache on the client, those writes are being sent to the different OSDs in 4KB chunks over the network. RBD stores data in blocks that are represented by 4MB objects on one of the OSDs, so without cache a lot of sequential 4KB writes will be hitting 1 OSD repeatedly and then moving on to the next one. Hopefully those writes would get aggregated at the OSD level, but clearly that's not really happening here given your performance. Using dd I tried various block sizes. With 4kb I was getting around 500kbytes/second rate. With 1MB I was getting a few mbytes/second. Read performance seems great though. Here's a couple of thoughts: 1) If you are working with VMs, using the QEMU/KVM interface with virtio drivers and RBD cache enabled will give you a huge jump in small sequential write performance relative to what you are seeing now. I'm using Xen so that won't work for me right now, although I did notice someone posted some blktap code to support ceph. I'm trying a windows restore of a physical machine into a VM under Xen and performance matches what I am seeing with dd - very very slow. 2) You may want to try upgrading to 0.60. We made a change to how the pg_log works that causes fewer disk seeks during small IO, especially with XFS. Do packages for this exist for Debian? At the moment my sources.list contains ceph.com/debian-bobtail wheezy main. 3) If you are still having trouble, testing your network, disk speeds, and using rados bench to test the object store all may be helpful. I tried that and while the write worked the seq test always said I had to do a write test first. While running my Xen restore, /var/log/ceph/ceph.log looks like: pgmap v18316: 832 pgs: 832 active+clean; 61443 MB data, 119 GB used, 1742 GB / 1862 GB avail; 824KB/s wr, 12op/s pgmap v18317: 832 pgs: 832 active+clean; 61446 MB data, 119 GB used, 1742 GB / 1862 GB avail; 649KB/s wr, 10op/s pgmap v18318: 832 pgs: 832 active+clean; 61449 MB data, 119 GB used, 1742 GB / 1862 GB avail; 652KB/s wr, 10op/s pgmap v18319: 832 pgs: 832 active+clean; 61452 MB data, 119 GB used, 1742 GB / 1862 GB avail; 614KB/s wr, 9op/s pgmap v18320: 832 pgs: 832 active+clean; 61454 MB data, 119 GB used, 1742 GB / 1862 GB avail; 537KB/s wr, 8op/s pgmap v18321: 832 pgs: 832 active+clean; 61457 MB data, 119 GB used, 1742 GB / 1862 GB avail; 511KB/s wr, 7op/s James -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD Read performance
Morning all, Did the echos on all boxes involved... and the results are in.. [root@dogbreath ~]# [root@dogbreath ~]# dd if=/todd-rbd-fs/DELETEME of=/dev/null bs=4M count=1 iflag=direct 1+0 records in 1+0 records out 4194304 bytes (42 GB) copied, 144.083 s, 291 MB/s [root@dogbreath ~]# dd if=/todd-rbd-fs/DELETEME of=/dev/null bs=4M count=1 1+0 records in 1+0 records out 4194304 bytes (42 GB) copied, 316.025 s, 133 MB/s [root@dogbreath ~]# No change which is a shame. What other information or testing should I start? Regards Malcolm Haak On 18/04/13 17:22, Malcolm Haak wrote: Hi Mark! Thanks for the quick reply! I'll reply inline below. On 18/04/13 17:04, Mark Nelson wrote: On 04/17/2013 11:35 PM, Malcolm Haak wrote: Hi all, Hi Malcolm! I jumped into the IRC channel yesterday and they said to email ceph-devel. I have been having some read performance issues. With Reads being slower than writes by a factor of ~5-8. I recently saw this kind of behaviour (writes were fine, but reads were terrible) on an IPoIB based cluster and it was caused by the same TCP auto tune issues that Jim Schutt saw last year. It's worth a try at least to see if it helps. echo 0 /proc/sys/net/ipv4/tcp_moderate_rcvbuf on all of the clients and server nodes should be enough to test it out. Sage added an option in more recent Ceph builds that lets you work around it too. Awesome I will test this first up tomorrow. First info: Server SLES 11 SP2 Ceph 0.56.4. 12 OSD's that are Hardware Raid 5 each of the twelve is made from 5 NL-SAS disks for a total of 60 disks (Each lun can do around 320MB/s stream write and the same if not better read) Connected via 2xQDR IB OSD's/MDS and such all on same box (for testing) Box is a Quad AMD Opteron 6234 Ram is 256Gb 10GB Journals osd_op_theads: 8 osd_disk_threads:2 Filestore_op_threads:4 OSD's are all XFS Interesting setup! QUAD socket Opteron boxes have somewhat slow and slightly oversubscribed hypertransport links don't they? I wonder if on a system with so many disks and QDR-IB if that could become a problem... We typically like smaller nodes where we can reasonably do 1 OSD per drive, but we've tested on a couple of 60 drive chassis in RAID configs too. Should be interesting to hear what kind of aggregate performance you can eventually get. We are also going to try this out with 6 luns on a dual xeon box. The Opteron box was the biggest scariest thing we had that was doing nothing. All nodes are connected via QDR IB using IP_O_IB. We get 1.7GB/s on TCP performance tests between the nodes. Clients: One is FC17 the other us Ubuntu 12.10 they only have around 32GB-70GB ram. We ran into an odd issue were the OSD's would all start in the same NUMA node and pretty much on the same processor core. We fixed that up with some cpuset magic. Strange! Was that more due to cpuset or Ceph? I can't imagine that we are doing anything that would cause that. More than likely it is an odd quirk in the SLES kernel.. but when I have time I'll do some more poking. We were seeing insane CPU usage on some cores because all the OSD's were piled up in one place. Performance testing we have done: (Note oflag=direct was yielding results within 5% of cached results) root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=3200 3200+0 records in 3200+0 records out 33554432000 bytes (34 GB) copied, 47.6685 s, 704 MB/s root@ty3:~# root@ty3:~# rm /test-rbd-fs/DELETEME root@ty3:~# root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=4800 4800+0 records in 4800+0 records out 50331648000 bytes (50 GB) copied, 69.5527 s, 724 MB/s [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M count=2400 2400+0 records in 2400+0 records out 25165824000 bytes (25 GB) copied, 26.3593 s, 955 MB/s [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M count=9600 9600+0 records in 9600+0 records out 100663296000 bytes (101 GB) copied, 145.212 s, 693 MB/s Both clients each doing a 140GB write (2x dogbreath's RAM) at the same time to two different rbds in the same pool. root@ty3:~# rm /test-rbd-fs/DELETEME root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=14000 14000+0 records in 14000+0 records out 14680064 bytes (147 GB) copied, 412.404 s, 356 MB/s root@ty3:~# [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M count=14000 14000+0 records in 14000+0 records out 14680064 bytes (147 GB) copied, 433.351 s, 339 MB/s [root@dogbreath ~]# Onto reads... Also we found that doing iflag=direct increased read performance. [root@dogbreath ~]# dd of=/dev/null if=/test-rbd-fs/DELETEME bs=10M count=160 160+0 records in 160+0 records out 1677721600 bytes (1.7 GB) copied, 29.4242 s, 57.0 MB/s [root@dogbreath ~]# [root@dogbreath ~]# echo 1 /proc/sys/vm/drop_caches [root@dogbreath ~]# dd
Re: erasure coding (sorry)
Supposedly, on 2013-Apr-18, at 14.08 PDT(-0700), someone claiming to be Josh Durgin scribed: On 04/18/2013 01:47 PM, Sage Weil wrote: On Thu, 18 Apr 2013, Plaetinck, Dieter wrote: sorry to bring this up again, googling revealed some people don't like the subject [anymore]. but I'm working on a new +- 3PB cluster for storage of immutable files. and it would be either all cold data, or mostly cold. 150MB avg filesize, max size 5GB (for now) For this use case, my impression is erasure coding would make a lot of sense (though I'm not sure about the computational overhead on storing and loading objects..? outbound traffic would peak at 6 Gbps, but I can make it way less and still keep a large cluster, by taking away the small set of hot files. inbound traffic would be minimal) I know that the answer a while ago was no plans to implement erasure coding, has this changed? if not, is anyone aware of a similar system that does support it? I found QFS but that's meant for batch processing, has a single 'namenode' etc. We would love to do it, but it is not a priority at the moment (things like multi-site replication are in much higher demand). That of course doesn't prevent someone outside of Inktank from working on it :) The main caveat is that it will be complicate. For an initial implementation, the full breadth of the rados API probably wouldn't be support for erasure/parity encoded pools (thinkgs like rados classes and the omap key/value api get tricky when you start talking about parity). But for many (or even most) use cases, objects are just bytes, and those restrictions are just fine. I talked to some folks interested in doing a more limited form of this yesterday. They started a blueprint [1]. One of their ideas was to have erasure coding done by a separate process (or thread perhaps). It would use erasure coding on an object and then use librados to store the rasure-encoded pieces in a separate pool, and finally leave a marker in place of the original object in the first pool. When the osd detected this marker, it would proxy the request to the erasure coding thread/process which would service the request on the second pool for reads, and potentially make writes move the data back to the first pool in a tiering sort of scenario. I might have misremembered some details, but I think it's an interesting way to get many of the benefits of erasure coding with a relatively small amount of work compared to a fully native osd solution. Greetings, I'm one of those individuals :) Our thinking is evolving on this, and I think we can keep most of the work out of the main machinery of ceph, and simply require a modified client that runs the proxy function on the hot pool OSDs. Even wondering if it could be prototyped in fuse. I will be writing this up in the next day or two in the blueprint below. Josh has the idea basically correct. Josh Christopher [1] http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend -- 李柯睿 Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf Check my calendar availability: https://tungle.me/cdl signature.asc Description: OpenPGP digital signature
Re: RBD Read performance
On 04/18/2013 07:27 PM, Malcolm Haak wrote: Morning all, Did the echos on all boxes involved... and the results are in.. [root@dogbreath ~]# [root@dogbreath ~]# dd if=/todd-rbd-fs/DELETEME of=/dev/null bs=4M count=1 iflag=direct 1+0 records in 1+0 records out 4194304 bytes (42 GB) copied, 144.083 s, 291 MB/s [root@dogbreath ~]# dd if=/todd-rbd-fs/DELETEME of=/dev/null bs=4M count=1 1+0 records in 1+0 records out 4194304 bytes (42 GB) copied, 316.025 s, 133 MB/s [root@dogbreath ~]# Boo! No change which is a shame. What other information or testing should I start? Any chance you can try out a quick rados bench test from the client against the pool for writes and reads and see how that works? rados -p pool bench 300 write --no-cleanup rados -p pool bench 300 seq Regards Malcolm Haak On 18/04/13 17:22, Malcolm Haak wrote: Hi Mark! Thanks for the quick reply! I'll reply inline below. On 18/04/13 17:04, Mark Nelson wrote: On 04/17/2013 11:35 PM, Malcolm Haak wrote: Hi all, Hi Malcolm! I jumped into the IRC channel yesterday and they said to email ceph-devel. I have been having some read performance issues. With Reads being slower than writes by a factor of ~5-8. I recently saw this kind of behaviour (writes were fine, but reads were terrible) on an IPoIB based cluster and it was caused by the same TCP auto tune issues that Jim Schutt saw last year. It's worth a try at least to see if it helps. echo 0 /proc/sys/net/ipv4/tcp_moderate_rcvbuf on all of the clients and server nodes should be enough to test it out. Sage added an option in more recent Ceph builds that lets you work around it too. Awesome I will test this first up tomorrow. First info: Server SLES 11 SP2 Ceph 0.56.4. 12 OSD's that are Hardware Raid 5 each of the twelve is made from 5 NL-SAS disks for a total of 60 disks (Each lun can do around 320MB/s stream write and the same if not better read) Connected via 2xQDR IB OSD's/MDS and such all on same box (for testing) Box is a Quad AMD Opteron 6234 Ram is 256Gb 10GB Journals osd_op_theads: 8 osd_disk_threads:2 Filestore_op_threads:4 OSD's are all XFS Interesting setup! QUAD socket Opteron boxes have somewhat slow and slightly oversubscribed hypertransport links don't they? I wonder if on a system with so many disks and QDR-IB if that could become a problem... We typically like smaller nodes where we can reasonably do 1 OSD per drive, but we've tested on a couple of 60 drive chassis in RAID configs too. Should be interesting to hear what kind of aggregate performance you can eventually get. We are also going to try this out with 6 luns on a dual xeon box. The Opteron box was the biggest scariest thing we had that was doing nothing. All nodes are connected via QDR IB using IP_O_IB. We get 1.7GB/s on TCP performance tests between the nodes. Clients: One is FC17 the other us Ubuntu 12.10 they only have around 32GB-70GB ram. We ran into an odd issue were the OSD's would all start in the same NUMA node and pretty much on the same processor core. We fixed that up with some cpuset magic. Strange! Was that more due to cpuset or Ceph? I can't imagine that we are doing anything that would cause that. More than likely it is an odd quirk in the SLES kernel.. but when I have time I'll do some more poking. We were seeing insane CPU usage on some cores because all the OSD's were piled up in one place. Performance testing we have done: (Note oflag=direct was yielding results within 5% of cached results) root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=3200 3200+0 records in 3200+0 records out 33554432000 bytes (34 GB) copied, 47.6685 s, 704 MB/s root@ty3:~# root@ty3:~# rm /test-rbd-fs/DELETEME root@ty3:~# root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=4800 4800+0 records in 4800+0 records out 50331648000 bytes (50 GB) copied, 69.5527 s, 724 MB/s [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M count=2400 2400+0 records in 2400+0 records out 25165824000 bytes (25 GB) copied, 26.3593 s, 955 MB/s [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M count=9600 9600+0 records in 9600+0 records out 100663296000 bytes (101 GB) copied, 145.212 s, 693 MB/s Both clients each doing a 140GB write (2x dogbreath's RAM) at the same time to two different rbds in the same pool. root@ty3:~# rm /test-rbd-fs/DELETEME root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=14000 14000+0 records in 14000+0 records out 14680064 bytes (147 GB) copied, 412.404 s, 356 MB/s root@ty3:~# [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M count=14000 14000+0 records in 14000+0 records out 14680064 bytes (147 GB) copied, 433.351 s, 339 MB/s [root@dogbreath ~]# Onto reads... Also we found that doing iflag=direct increased read performance.
Re: erasure coding (sorry)
Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be Plaetinck, Dieter scribed: On Thu, 18 Apr 2013 16:09:52 -0500 Mark Nelson mark.nel...@inktank.com wrote: On 04/18/2013 04:08 PM, Josh Durgin wrote: On 04/18/2013 01:47 PM, Sage Weil wrote: On Thu, 18 Apr 2013, Plaetinck, Dieter wrote: sorry to bring this up again, googling revealed some people don't like the subject [anymore]. but I'm working on a new +- 3PB cluster for storage of immutable files. and it would be either all cold data, or mostly cold. 150MB avg filesize, max size 5GB (for now) For this use case, my impression is erasure coding would make a lot of sense (though I'm not sure about the computational overhead on storing and loading objects..? outbound traffic would peak at 6 Gbps, but I can make it way less and still keep a large cluster, by taking away the small set of hot files. inbound traffic would be minimal) I know that the answer a while ago was no plans to implement erasure coding, has this changed? if not, is anyone aware of a similar system that does support it? I found QFS but that's meant for batch processing, has a single 'namenode' etc. We would love to do it, but it is not a priority at the moment (things like multi-site replication are in much higher demand). That of course doesn't prevent someone outside of Inktank from working on it :) The main caveat is that it will be complicate. For an initial implementation, the full breadth of the rados API probably wouldn't be support for erasure/parity encoded pools (thinkgs like rados classes and the omap key/value api get tricky when you start talking about parity). But for many (or even most) use cases, objects are just bytes, and those restrictions are just fine. I talked to some folks interested in doing a more limited form of this yesterday. They started a blueprint [1]. One of their ideas was to have erasure coding done by a separate process (or thread perhaps). It would use erasure coding on an object and then use librados to store the rasure-encoded pieces in a separate pool, and finally leave a marker in place of the original object in the first pool. When the osd detected this marker, it would proxy the request to the erasure coding thread/process which would service the request on the second pool for reads, and potentially make writes move the data back to the first pool in a tiering sort of scenario. I might have misremembered some details, but I think it's an interesting way to get many of the benefits of erasure coding with a relatively small amount of work compared to a fully native osd solution. Josh Neat. :) @Bryan: I did come across cleversafe. all the articles around it seemed promising, but unfortunately it seems everything related to the cleversafe open source project somehow vanished from the internet. (e.g. http://www.cleversafe.org/) quite weird... Yea - in a previous incarnation I looked at cleversafe to do something similar a few years ago. It is odd that the cleversafe.org stuff did disapear. However, tahoe-lafs also does encoding, and their package (zfec) [1] may be leverageable. @Sage: interesting. I thought it would be more relatively simple if one assumes the restriction of immutable files. I'm not familiar with those ceph specifics you're mentioning. When building an erasure codes-based system, maybe there's ways to reuse existing ceph code and/or allow some integration with replication based objects, without aiming for full integration or full support of the rados api, based on some tradeoffs. I think this might sit UNDER the rados api. I would certainly want to leverage CRUSH to place the shards, however (great tool, no reason to re-invent the wheel). @Josh, that sounds like an interesting approach. Too bad that page doesn't contain any information yet :) Give me time :) - openstack has kept me a bit busy… May also be a factor of design at keyboard :) Dieter Christopher [1] https://tahoe-lafs.org/trac/zfec -- 李柯睿 Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf Check my calendar availability: https://tungle.me/cdl signature.asc Description: OpenPGP digital signature
Re: erasure coding (sorry)
Supposedly, on 2013-Apr-18, at 14.24 PDT(-0700), someone claiming to be Noah Watkins scribed: On Apr 18, 2013, at 2:08 PM, Josh Durgin josh.dur...@inktank.com wrote: I talked to some folks interested in doing a more limited form of this yesterday. They started a blueprint [1]. One of their ideas was to have erasure coding done by a separate process (or thread perhaps). It would use erasure coding on an object and then use librados to store the rasure-encoded pieces in a separate pool, and finally leave a marker in place of the original object in the first pool. This sounds at a high-level similar to work out of Microsoft: I've looked at that, and it would be somewhat similar (not completely, but borrow some ideas). Christopher https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf The basic idea is to replicate first, then erasure code in the background. - Noah -- 李柯睿 Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf Check my calendar availability: https://tungle.me/cdl signature.asc Description: OpenPGP digital signature
Re: erasure coding (sorry)
Supposedly, on 2013-Apr-18, at 14.26 PDT(-0700), someone claiming to be Sage Weil scribed: On Thu, 18 Apr 2013, Noah Watkins wrote: On Apr 18, 2013, at 2:08 PM, Josh Durgin josh.dur...@inktank.com wrote: I talked to some folks interested in doing a more limited form of this yesterday. They started a blueprint [1]. One of their ideas was to have erasure coding done by a separate process (or thread perhaps). It would use erasure coding on an object and then use librados to store the rasure-encoded pieces in a separate pool, and finally leave a marker in place of the original object in the first pool. This sounds at a high-level similar to work out of Microsoft: https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf The basic idea is to replicate first, then erasure code in the background. FWIW, I think a useful (and generic) concept to add to rados would be a redirect symlink sort of thing that says oh, this object is over there is that other pool, such that client requests will be transparently redirected or proxied. This will enable generic tiering type operations, and probably simplify/enable migration without a lot of additional complexity on the client side. More to come, but I'm starting to think of a union mount of a fuse re-directing overlay. The quick idea. On the hot pool, the OSD's would write to the host FS as usual. However, that FS is actually a light-weight fuse (at least for prototype) fs that passes almost everything right down to the file system. As the OSD hits a capacity HWM, a watcher (asynchronous process), starts evicting objects from the OSD. It does that by using a modified ceph client that calls zfec and uses CRUSH to place the resulting shards in the cool pool. Once those are committed, it replaces the object in the hot OSD with a special token. This is repeated until a LWM is reached. When the OSD gets a read request for that object, when the fuse shim sees the token, it knows to actually do a modified client fetch from the cool pool. It returns the resulting object to the original requester and (potentially) stores the object back in the hot OSD (if you want a cache-like performance), replacing the token. If necessary, some other object may get, in turn, evicted if the HWM is again breached. We would also need to modify the repair mechanism for the deep scrub in the cool pool to account for the repair being a re-constitution of an invalid shard, rather than a copy (as there is only one copy of a given shard). I'll get a bit more of a write-up today, hopefully, in the wiki. Christopher sage -- 李柯睿 Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf Check my calendar availability: https://tungle.me/cdl signature.asc Description: OpenPGP digital signature
Re: RBD Read performance
Ok this is getting interesting. rados -p pool bench 300 write --no-cleanup Total time run: 301.103933 Total writes made: 22477 Write size: 4194304 Bandwidth (MB/sec): 298.595 Stddev Bandwidth: 171.941 Max bandwidth (MB/sec): 832 Min bandwidth (MB/sec): 8 Average Latency:0.214295 Stddev Latency: 0.405511 Max latency:3.26323 Min latency:0.019429 rados -p pool bench 300 seq Total time run:76.634659 Total reads made: 22477 Read size:4194304 Bandwidth (MB/sec):1173.203 Average Latency: 0.054539 Max latency: 0.937036 Min latency: 0.018132 So the writes on the rados bench are slower than we have achieved with dd and were slower on the back-end file store as well. But the reads are great. We could see 1~1.5GB/s on the back-end as well. So we started doing some other tests to see if it was in RBD or the VFS layer in the kernel.. And things got weird. So using CephFS: root@dogbreath ~]# dd if=/dev/zero of=/test-fs/DELETEME1 bs=1G count=10 10+0 records in 10+0 records out 10737418240 bytes (11 GB) copied, 7.28658 s, 1.5 GB/s [root@dogbreath ~]# dd if=/dev/zero of=/test-fs/DELETEME1 bs=1G count=20 20+0 records in 20+0 records out 21474836480 bytes (21 GB) copied, 20.6105 s, 1.0 GB/s [root@dogbreath ~]# dd if=/dev/zero of=/test-fs/DELETEME1 bs=1G count=40 40+0 records in 40+0 records out 42949672960 bytes (43 GB) copied, 53.4013 s, 804 MB/s [root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=4 iflag=direct 4+0 records in 4+0 records out 4294967296 bytes (4.3 GB) copied, 23.1572 s, 185 MB/s [root@dogbreath ~]# [root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=4 4+0 records in 4+0 records out 4294967296 bytes (4.3 GB) copied, 1.20258 s, 3.6 GB/s [root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=20 20+0 records in 20+0 records out 21474836480 bytes (21 GB) copied, 5.40589 s, 4.0 GB/s [root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=40 40+0 records in 40+0 records out 42949672960 bytes (43 GB) copied, 10.4781 s, 4.1 GB/s [root@dogbreath ~]# echo 1 /proc/sys/vm/drop_caches [root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=40 ^C24+0 records in 23+0 records out 24696061952 bytes (25 GB) copied, 56.8824 s, 434 MB/s [root@dogbreath ~]# dd if=/test-fs/DELETEME1 of=/dev/null bs=1G count=40 40+0 records in 40+0 records out 42949672960 bytes (43 GB) copied, 113.542 s, 378 MB/s [root@dogbreath ~]# So about the same, when we were not hitting cache. So we decided to just hit the RBD block device with no FS on it.. Welcome to weirdsville root@ty3:~# umount /test-rbd-fs root@ty3:~# root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=4 4+0 records in 4+0 records out 4294967296 bytes (4.3 GB) copied, 18.6603 s, 230 MB/s root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=4 iflag=direct 4+0 records in 4+0 records out 4294967296 bytes (4.3 GB) copied, 1.13584 s, 3.8 GB/s root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=20 iflag=direct 20+0 records in 20+0 records out 21474836480 bytes (21 GB) copied, 4.61028 s, 4.7 GB/s root@ty3:~# echo 1 /proc/sys/vm/drop_caches root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=20 iflag=direct 20+0 records in 20+0 records out 21474836480 bytes (21 GB) copied, 4.43416 s, 4.8 GB/s root@ty3:~# echo 1 /proc/sys/vm/drop_caches root@ty3:~# root@ty3:~# root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=20 iflag=direct 20+0 records in 20+0 records out 21474836480 bytes (21 GB) copied, 5.07426 s, 4.2 GB/s root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=40 iflag=direct 40+0 records in 40+0 records out 42949672960 bytes (43 GB) copied, 8.60885 s, 5.0 GB/s root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=80 iflag=direct 80+0 records in 80+0 records out 85899345920 bytes (86 GB) copied, 18.4305 s, 4.7 GB/s root@ty3:~# dd if=/dev/rbd1 of=/dev/null bs=1G count=20 20+0 records in 20+0 records out 21474836480 bytes (21 GB) copied, 91.5546 s, 235 MB/s root@ty3:~# So.. we just started reading from the block device. And the numbers were well.. Faster than the QDR IB can do TCP/IP. So we figured local caching. So we dropped caches and ramped up to bigger than ram. (ram is 24GB) and it got faster. So we went to 3x ram.. and it was a bit slower.. Oh also the whole time we were doing these tests, the back-end disk was seeing no I/O at all.. We were dropping caches on the OSD's as well, but even if it was caching at the OSD end, the IB link is only QDR and we aren't doing RDMA so. Yeah..No idea what is going on here... On 19/04/13 10:40, Mark Nelson wrote: On 04/18/2013 07:27 PM, Malcolm Haak wrote: Morning all, Did the echos on all boxes involved... and the results are in.. [root@dogbreath ~]# [root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M count=1 iflag=direct 1+0 records in 1+0 records out 4194304 bytes