date:20130418


On 04/17/2013 11:35 PM, Malcolm Haak wrote:

Hi all,


Hi Malcolm!



I jumped into the IRC channel yesterday and they said to email
ceph-devel. I have been having some read performance issues. With Reads
being slower than writes by a factor of ~5-8.


I recently saw this kind of behaviour (writes were fine, but reads were 
terrible) on an IPoIB based cluster and it was caused by the same TCP 
auto tune issues that Jim Schutt saw last year. It's worth a try at 
least to see if it helps.


echo 0  /proc/sys/net/ipv4/tcp_moderate_rcvbuf

on all of the clients and server nodes should be enough to test it out. 
 Sage added an option in more recent Ceph builds that lets you work 
around it too.




First info:
Server
SLES 11 SP2
Ceph 0.56.4.
12 OSD's  that are Hardware Raid 5 each of the twelve is made from 5
NL-SAS disks for a total of 60 disks (Each lun can do around 320MB/s
stream write and the same if not better read) Connected via 2xQDR IB
OSD's/MDS and such all on same box (for testing)
Box is a Quad AMD Opteron 6234
Ram is 256Gb
10GB Journals
osd_op_theads: 8
osd_disk_threads:2
Filestore_op_threads:4
OSD's are all XFS


Interesting setup!  QUAD socket Opteron boxes have somewhat slow and 
slightly oversubscribed hypertransport links don't they?  I wonder if on 
a system with so many disks and QDR-IB if that could become a problem...


We typically like smaller nodes where we can reasonably do 1 OSD per 
drive, but we've tested on a couple of 60 drive chassis in RAID configs 
too.  Should be interesting to hear what kind of aggregate performance 
you can eventually get.




All nodes are connected via QDR IB using IP_O_IB. We get 1.7GB/s on TCP
performance tests between the nodes.

Clients: One is FC17 the other us Ubuntu 12.10 they only have around
32GB-70GB ram.

We ran into an odd issue were the OSD's would all start in the same NUMA
node and pretty much on the same processor core. We fixed that up with
some cpuset magic.


Strange!  Was that more due to cpuset or Ceph?  I can't imagine that we 
are doing anything that would cause that.




Performance testing we have done: (Note oflag=direct was yielding
results within 5% of cached results)


root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=3200
3200+0 records in
3200+0 records out
33554432000 bytes (34 GB) copied, 47.6685 s, 704 MB/s
root@ty3:~#
root@ty3:~# rm /test-rbd-fs/DELETEME
root@ty3:~#
root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=4800
4800+0 records in
4800+0 records out
50331648000 bytes (50 GB) copied, 69.5527 s, 724 MB/s

[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=2400
2400+0 records in
2400+0 records out
25165824000 bytes (25 GB) copied, 26.3593 s, 955 MB/s
[root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=9600
9600+0 records in
9600+0 records out
100663296000 bytes (101 GB) copied, 145.212 s, 693 MB/s

Both clients each doing a 140GB write (2x dogbreath's RAM) at the same
time to two different rbds in the same pool.

root@ty3:~# rm /test-rbd-fs/DELETEME
root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=14000
14000+0 records in
14000+0 records out
14680064 bytes (147 GB) copied, 412.404 s, 356 MB/s
root@ty3:~#

[root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=14000
14000+0 records in
14000+0 records out
14680064 bytes (147 GB) copied, 433.351 s, 339 MB/s
[root@dogbreath ~]#

Onto reads...
Also we found that doing iflag=direct increased read performance.

[root@dogbreath ~]# dd of=/dev/null if=/test-rbd-fs/DELETEME bs=10M
count=160
160+0 records in
160+0 records out
1677721600 bytes (1.7 GB) copied, 29.4242 s, 57.0 MB/s
[root@dogbreath ~]#
[root@dogbreath ~]# echo 1  /proc/sys/vm/drop_caches
[root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M
count=1
1+0 records in
1+0 records out
4194304 bytes (42 GB) copied, 382.334 s, 110 MB/s
[root@dogbreath ~]#
[root@dogbreath ~]# echo 1  /proc/sys/vm/drop_caches
[root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M
count=1 iflag=direct
1+0 records in
1+0 records out
4194304 bytes (42 GB) copied, 150.774 s, 278 MB/s
[root@dogbreath ~]#


So what info do you want/where do I start hunting for my wumpus?


might also be worth looking at the size of the reads to see if there's a 
lot of fragmentation.  Also, is this kernel rbd or qemu-kvm?




Regards

Malcolm Haak


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mds: fix setting/removing xattrs on root

2013-04-18 Thread Big Chiu

I didn't notice the bug. Guessing it was hidden because CephFS had been
accessed by other daemons in my test environment. Thank you for the hint!

The signed-off patches are resent, also including your fix.

On Wed, Apr 17, 2013 at 4:06 AM, Gregory Farnum g...@inktank.com wrote:
 On Mon, Apr 15, 2013 at 3:23 AM, Kuan Kai Chiu big.c...@bigtera.com wrote:
 MDS crashes while journaling dirty root inode in handle_client_setxattr
 and handle_client_removexattr. We should use journal_dirty_inode to
 safely log root inode here.
 ---
  src/mds/Server.cc |6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

 diff --git a/src/mds/Server.cc b/src/mds/Server.cc
 index 11ab834..1e62dd2 100644
 --- a/src/mds/Server.cc
 +++ b/src/mds/Server.cc
 @@ -3907,8 +3907,7 @@ void Server::handle_client_setxattr(MDRequest *mdr)
mdlog-start_entry(le);
le-metablob.add_client_req(req-get_reqid(), 
 req-get_oldest_client_tid());
mdcache-predirty_journal_parents(mdr, le-metablob, cur, 0, 
 PREDIRTY_PRIMARY, false);
 -  mdcache-journal_cow_inode(mdr, le-metablob, cur);
 -  le-metablob.add_primary_dentry(cur-get_projected_parent_dn(), true, 
 cur);
 +  mdcache-journal_dirty_inode(mdr, le-metablob, cur);

journal_and_reply(mdr, cur, 0, le, new C_MDS_inode_update_finish(mds, 
 mdr, cur));
  }
 @@ -3964,8 +3963,7 @@ void Server::handle_client_removexattr(MDRequest *mdr)
mdlog-start_entry(le);
le-metablob.add_client_req(req-get_reqid(), 
 req-get_oldest_client_tid());
mdcache-predirty_journal_parents(mdr, le-metablob, cur, 0, 
 PREDIRTY_PRIMARY, false);
 -  mdcache-journal_cow_inode(mdr, le-metablob, cur);
 -  le-metablob.add_primary_dentry(cur-get_projected_parent_dn(), true, 
 cur);
 +  mdcache-journal_dirty_inode(mdr, le-metablob, cur);

journal_and_reply(mdr, cur, 0, le, new C_MDS_inode_update_finish(mds, 
 mdr, cur));
  }

 This is fine as far as it goes, but we'll need your sign-off for us to
 incorporate it into the codebase. Also, have you run any tests with
 it? The reason I ask is that when I apply this patch, set an xattr on
 the root inode, and then restart the MDS and client, there are no
 xattrs on the root any more. I think this should fix that, but there
 may be other such issues:
 diff --git a/src/mds/events/EMetaBlob.h b/src/mds/events/EMetaBlob.h
 index 7065460..439bd78 100644
 --- a/src/mds/events/EMetaBlob.h
 +++ b/src/mds/events/EMetaBlob.h
 @@ -468,7 +468,7 @@ private:

  if (!pi) pi = in-get_projected_inode();
  if (!pdft) pdft = in-dirfragtree;
 -if (!px) px = in-xattrs;
 +if (!px) px = in-get_projected_xattrs();

  bufferlist snapbl;
  if (psnapbl)

 You've fallen victim to this new setup, incidentally — in the past the
 root inode wasn't allowed to get any of these modifications because
 it's not quite real in the way the rest of them are. We opened that up
 when we made the virtual xattr interface, but we weren't very careful
 about it so apparently we missed some side effects.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RBD Read performance

2013-04-18 Thread Malcolm Haak


Hi Mark!

Thanks for the quick reply!

I'll reply inline below.

On 18/04/13 17:04, Mark Nelson wrote:

On 04/17/2013 11:35 PM, Malcolm Haak wrote:

Hi all,


Hi Malcolm!



I jumped into the IRC channel yesterday and they said to email
ceph-devel. I have been having some read performance issues. With Reads
being slower than writes by a factor of ~5-8.


I recently saw this kind of behaviour (writes were fine, but reads were
terrible) on an IPoIB based cluster and it was caused by the same TCP
auto tune issues that Jim Schutt saw last year. It's worth a try at
least to see if it helps.

echo 0  /proc/sys/net/ipv4/tcp_moderate_rcvbuf

on all of the clients and server nodes should be enough to test it out.
  Sage added an option in more recent Ceph builds that lets you work
around it too.


Awesome I will test this first up tomorrow.


First info:
Server
SLES 11 SP2
Ceph 0.56.4.
12 OSD's  that are Hardware Raid 5 each of the twelve is made from 5
NL-SAS disks for a total of 60 disks (Each lun can do around 320MB/s
stream write and the same if not better read) Connected via 2xQDR IB
OSD's/MDS and such all on same box (for testing)
Box is a Quad AMD Opteron 6234
Ram is 256Gb
10GB Journals
osd_op_theads: 8
osd_disk_threads:2
Filestore_op_threads:4
OSD's are all XFS


Interesting setup!  QUAD socket Opteron boxes have somewhat slow and
slightly oversubscribed hypertransport links don't they?  I wonder if on
a system with so many disks and QDR-IB if that could become a problem...

We typically like smaller nodes where we can reasonably do 1 OSD per
drive, but we've tested on a couple of 60 drive chassis in RAID configs
too.  Should be interesting to hear what kind of aggregate performance
you can eventually get.


We are also going to try this out with 6 luns on a dual xeon box. The 
Opteron box was the biggest scariest thing we had that was doing nothing.






All nodes are connected via QDR IB using IP_O_IB. We get 1.7GB/s on TCP
performance tests between the nodes.

Clients: One is FC17 the other us Ubuntu 12.10 they only have around
32GB-70GB ram.

We ran into an odd issue were the OSD's would all start in the same NUMA
node and pretty much on the same processor core. We fixed that up with
some cpuset magic.


Strange!  Was that more due to cpuset or Ceph?  I can't imagine that we
are doing anything that would cause that.



More than likely it is an odd quirk in the SLES kernel.. but when I have 
time I'll do some more poking. We were seeing insane CPU usage on some 
cores because all the OSD's were piled up in one place.




Performance testing we have done: (Note oflag=direct was yielding
results within 5% of cached results)


root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=3200
3200+0 records in
3200+0 records out
33554432000 bytes (34 GB) copied, 47.6685 s, 704 MB/s
root@ty3:~#
root@ty3:~# rm /test-rbd-fs/DELETEME
root@ty3:~#
root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=4800
4800+0 records in
4800+0 records out
50331648000 bytes (50 GB) copied, 69.5527 s, 724 MB/s

[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=2400
2400+0 records in
2400+0 records out
25165824000 bytes (25 GB) copied, 26.3593 s, 955 MB/s
[root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=9600
9600+0 records in
9600+0 records out
100663296000 bytes (101 GB) copied, 145.212 s, 693 MB/s

Both clients each doing a 140GB write (2x dogbreath's RAM) at the same
time to two different rbds in the same pool.

root@ty3:~# rm /test-rbd-fs/DELETEME
root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=14000
14000+0 records in
14000+0 records out
14680064 bytes (147 GB) copied, 412.404 s, 356 MB/s
root@ty3:~#

[root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=14000
14000+0 records in
14000+0 records out
14680064 bytes (147 GB) copied, 433.351 s, 339 MB/s
[root@dogbreath ~]#

Onto reads...
Also we found that doing iflag=direct increased read performance.

[root@dogbreath ~]# dd of=/dev/null if=/test-rbd-fs/DELETEME bs=10M
count=160
160+0 records in
160+0 records out
1677721600 bytes (1.7 GB) copied, 29.4242 s, 57.0 MB/s
[root@dogbreath ~]#
[root@dogbreath ~]# echo 1  /proc/sys/vm/drop_caches
[root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M
count=1
1+0 records in
1+0 records out
4194304 bytes (42 GB) copied, 382.334 s, 110 MB/s
[root@dogbreath ~]#
[root@dogbreath ~]# echo 1  /proc/sys/vm/drop_caches
[root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M
count=1 iflag=direct
1+0 records in
1+0 records out
4194304 bytes (42 GB) copied, 150.774 s, 278 MB/s
[root@dogbreath ~]#


So what info do you want/where do I start hunting for my wumpus?


might also be worth looking at the size of the reads to see if there's a
lot of fragmentation.  Also, is this kernel rbd or

poor write performance

2013-04-18 Thread James Harper

I'm doing some basic testing so I'm not really fussed about poor performance, 
but my write performance appears to be so bad I think I'm doing something wrong.

Using dd to test gives me kbytes/second for write performance for 4kb block 
sizes, while read performance is acceptable (for testing at least). For dd I'm 
using iflag=direct for read and oflag=direct for write testing.

My setup, approximately, is:

Two OSD's
. 1 x 7200RPM SATA disk each
. 2 x gigabit cluster network interfaces each in a bonded configuration 
directly attached (osd to osd, no switch)
. 1 x gigabit public network
. journal on another spindle

Three MON's
. 1 each on the OSD's
. 1 on another server, which is also the one used for testing performance

I'm using debian packages from ceph which are version 0.56.4

For comparison, my existing production storage is 2 servers running DRBD with 
iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the 
iSCSI. Performance not spectacular but acceptable. The servers in question are 
the same specs as the servers I'm testing on.

Where should I start looking for performance problems? I've tried running some 
of the benchmark stuff in the documentation but I haven't gotten very far...

Thanks

James

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: poor write performance

2013-04-18 Thread Wolfgang Hennerbichler

Hi James,

This is just pure speculation, but can you assure that the bonding works
correctly? Maybe you have issues there. I have seen a lot of incorrectly
configured bonding throughout my life as unix admin.

Maybe this could help you a little:
http://www.wogri.at/Port-Channeling-802-3ad.338.0.html

On 04/18/2013 01:46 PM, James Harper wrote:
 I'm doing some basic testing so I'm not really fussed about poor performance, 
 but my write performance appears to be so bad I think I'm doing something 
 wrong.
 
 Using dd to test gives me kbytes/second for write performance for 4kb block 
 sizes, while read performance is acceptable (for testing at least). For dd 
 I'm using iflag=direct for read and oflag=direct for write testing.
 
 My setup, approximately, is:
 
 Two OSD's
 . 1 x 7200RPM SATA disk each
 . 2 x gigabit cluster network interfaces each in a bonded configuration 
 directly attached (osd to osd, no switch)
 . 1 x gigabit public network
 . journal on another spindle
 
 Three MON's
 . 1 each on the OSD's
 . 1 on another server, which is also the one used for testing performance
 
 I'm using debian packages from ceph which are version 0.56.4
 
 For comparison, my existing production storage is 2 servers running DRBD with 
 iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of 
 the iSCSI. Performance not spectacular but acceptable. The servers in 
 question are the same specs as the servers I'm testing on.
 
 Where should I start looking for performance problems? I've tried running 
 some of the benchmark stuff in the documentation but I haven't gotten very 
 far...
 
 Thanks
 
 James
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbich...@risc-software.at
http://www.risc-software.at
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/7, v2] rbd: issue stat request before layered write

2013-04-18 Thread Alex Elder

(Since this hasn't been reviewed I have updated it slightly.
I rebased the series onto the current testing branch.  They
are all available in the review/wip-4679-3 in the ceph-client
git repository.  I also made some minor changes in the definition
of rbd_img_obj_exists_callback()).


This is a step toward fully implementing layered writes.

Add checks before request submission for the object(s) associated
with an image request.  For write requests, if we don't know that
the target object exists, issue a STAT request to find out.  When
that request completes, mark the known and exists flags for the
original object request accordingly and re-submit the object
request.  (Note that this still does the existence check only; the
copyup operation is not yet done.)

A new object request is created to perform the existence check.  A
pointer to the original request is added to that object request to
allow the stat request to re-issue the original request after
updating its flags.  If there is a failure with the stat request
the error code is stored with the original request, which is then
completed.

This resolves:
http://tracker.ceph.com/issues/3418

Signed-off-by: Alex Elder el...@inktank.com
---
v2: rebased to testing; small cleanup in rbd_img_obj_exists_callback()

 drivers/block/rbd.c |  163
---
 1 file changed, 155 insertions(+), 8 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index b1b8ef8..ce2fb3a 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -183,9 +183,31 @@ struct rbd_obj_request {
u64 length; /* bytes from offset */
unsigned long   flags;

-   struct rbd_img_request  *img_request;
-   u64 img_offset; /* image relative offset */
-   struct list_headlinks;  /* img_request-obj_requests */
+   /*
+* An object request associated with an image will have its
+* img_data flag set; a standlone object request will not.
+*
+* A standalone object request will have which == BAD_WHICH
+* and a null obj_request pointer.
+*
+* An object request initiated in support of a layered image
+* object (to check for its existence before a write) will
+* have which == BAD_WHICH and a non-null obj_request pointer.
+*
+* Finally, an object request for rbd image data will have
+* which != BAD_WHICH, and will have a non-null img_request
+* pointer.  The value of which will be in the range
+* 0..(img_request-obj_request_count-1).
+*/
+   union {
+   struct rbd_obj_request  *obj_request;   /* STAT op */
+   struct {
+   struct rbd_img_request  *img_request;
+   u64 img_offset;
+   /* links for img_request-obj_requests list */
+   struct list_headlinks;
+   };
+   };
u32 which;  /* posn image request list */

enum obj_request_type   type;
@@ -1656,10 +1678,6 @@ static struct rbd_img_request
*rbd_img_request_create(
INIT_LIST_HEAD(img_request-obj_requests);
kref_init(img_request-kref);

-   (void) obj_request_existence_set;
-   (void) obj_request_known_test;
-   (void) obj_request_exists_test;
-
rbd_img_request_get(img_request);   /* Avoid a warning */
rbd_img_request_put(img_request);   /* TEMPORARY */

@@ -1847,18 +1865,147 @@ out_unwind:
return -ENOMEM;
 }

+static void rbd_img_obj_exists_callback(struct rbd_obj_request
*obj_request)
+{
+   struct rbd_device *rbd_dev;
+   struct ceph_osd_client *osdc;
+   struct rbd_obj_request *orig_request;
+   int result;
+
+   rbd_assert(!obj_request_img_data_test(obj_request));
+
+   /*
+* All we need from the object request is the original
+* request and the result of the STAT op.  Grab those, then
+* we're done with the request.
+*/
+   orig_request = obj_request-obj_request;
+   obj_request-obj_request = NULL;
+   rbd_assert(orig_request);
+   rbd_assert(orig_request-img_request);
+
+   result = obj_request-result;
+   obj_request-result = 0;
+
+   dout(%s: obj %p for obj %p result %d %llu/%llu\n, __func__,
+   obj_request, orig_request, result,
+   obj_request-xferred, obj_request-length);
+   rbd_obj_request_put(obj_request);
+
+   rbd_assert(orig_request);
+   rbd_assert(orig_request-img_request);
+   rbd_dev = orig_request-img_request-rbd_dev;
+   osdc = rbd_dev-rbd_client-client-osdc;
+
+   /*
+* Our only purpose here is to determine whether the object
+* exists, and we don't want to treat the non-existence as
+* an error.  If something else comes back, transfer the
+

[PATCH V2] radosgw: receiving unexpected error code while accessing an non-existing object by authorized not-owner user

2013-04-18 Thread Li Wang

This patch fixes a bug in radosgw swift compatibility code,
that is, if a not-owner but authorized user access a non-existing
object in a container, he wiil receive unexpected error code,
to repeat this bug, do the following steps,

1 User1 creates a container, and grants the read/write permission to user2

curl -X PUT -i -k -H X-Auth-Token: $user1_token $url/$container
curl -X POST -i -k -H X-Auth-Token: $user1_token -H X-Container-Read:
$user2 -H X-Container-Write: $user2 $url/$container

2 User2 queries the object 'obj' in the newly created container
by using HEAD instruction, note the container currently is empty

curl -X HEAD -i -k -H X-Auth-Token: $user2_token $url/$container/obj

3 The response received by user2 is '401 Authorization Required',
rather than the expected '404 Not Found', the details are as follows,

HTTP/1.1 401 Authorization Required
Date: Tue, 16 Apr 2013 01:52:49 GMT
Server: Apache/2.2.22 (Ubuntu)
Accept-Ranges: bytes
Content-Length: 12
Vary: Accept-Encoding
Content-Type: text/plain; charset=utf-8

Signed-off-by: Yunchuan Wen yunchuan...@ubuntukylin.com
Signed-off-by: Li Wang liw...@ubuntukylin.com
---
 src/rgw/rgw_op.cc |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/rgw/rgw_op.cc b/src/rgw/rgw_op.cc
index d2fbeeb..ef6448c 100644
--- a/src/rgw/rgw_op.cc
+++ b/src/rgw/rgw_op.cc
@@ -268,7 +268,7 @@ static int read_policy(RGWRados *store, struct req_state 
*s, RGWBucketInfo buck
   return ret;
 string owner = bucket_policy.get_owner().get_id();
 if (owner.compare(s-user.user_id) != 0 
-!bucket_policy.verify_permission(s-user.user_id, s-perm_mask, 
RGW_PERM_READ))
+!bucket_policy.verify_permission(s-user.user_id, s-perm_mask, 
RGW_PERM_READ)  !bucket_policy.verify_permission(s-user.user_id, 
RGW_PERM_READ_OBJS, RGW_PERM_READ_OBJS))
   ret = -EACCES;
 else
   ret = -ENOENT;
-- 
1.7.9.5


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: poor write performance


On 04/18/2013 06:46 AM, James Harper wrote:

I'm doing some basic testing so I'm not really fussed about poor performance, 
but my write performance appears to be so bad I think I'm doing something wrong.

Using dd to test gives me kbytes/second for write performance for 4kb block 
sizes, while read performance is acceptable (for testing at least). For dd I'm 
using iflag=direct for read and oflag=direct for write testing.

My setup, approximately, is:

Two OSD's
. 1 x 7200RPM SATA disk each
. 2 x gigabit cluster network interfaces each in a bonded configuration 
directly attached (osd to osd, no switch)
. 1 x gigabit public network
. journal on another spindle

Three MON's
. 1 each on the OSD's
. 1 on another server, which is also the one used for testing performance

I'm using debian packages from ceph which are version 0.56.4

For comparison, my existing production storage is 2 servers running DRBD with 
iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the 
iSCSI. Performance not spectacular but acceptable. The servers in question are 
the same specs as the servers I'm testing on.

Where should I start looking for performance problems? I've tried running some 
of the benchmark stuff in the documentation but I haven't gotten very far...


Hi James!  Sorry to hear about the performance trouble!  Is it just 
sequential 4KB direct IO writes that are giving you troubles?  If you 
are using the kernel version of RBD, we don't have any kind of cache 
implemented there and since you are bypassing the pagecache on the 
client, those writes are being sent to the different OSDs in 4KB chunks 
over the network.  RBD stores data in blocks that are represented by 4MB 
objects on one of the OSDs, so without cache a lot of sequential 4KB 
writes will be hitting 1 OSD repeatedly and then moving on to the next 
one.  Hopefully those writes would get aggregated at the OSD level, but 
clearly that's not really happening here given your performance.


Here's a couple of thoughts:

1) If you are working with VMs, using the QEMU/KVM interface with virtio 
drivers and RBD cache enabled will give you a huge jump in small 
sequential write performance relative to what you are seeing now.


2) You may want to try upgrading to 0.60.  We made a change to how the 
pg_log works that causes fewer disk seeks during small IO, especially 
with XFS.


3) If you are still having trouble, testing your network, disk speeds, 
and using rados bench to test the object store all may be helpful.




Thanks

James


Good luck!



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: test osd on zfs

On Thu, 18 Apr 2013, Stefan Priebe - Profihost AG wrote:
 Am 17.04.2013 um 23:14 schrieb Brian Behlendorf behlendo...@llnl.gov:
 
  On 04/17/2013 01:16 PM, Mark Nelson wrote:
  I'll let Brian talk about the virtues of ZFS,
  
  I think the virtues of ZFS have been discussed at length in various other 
  forums.  But in short it brings some nice functionality to the table which 
  may be useful to ceph and that's worth exploring.
 Sure I know about the advantages of zfs.
 
 I just thought about how ceph can benefit. Right now I've no idea. The 
 osds should be single disks so zpool, zraid does not matter. Ceph does 
 it own scrubbing and check summing and instead of btrfs ceph does not 
 know how to use snapshots with zfs. That's why I'm asking.

The main things that come to mind:

- zfs checksumming
- ceph can eventually use zfs snapshots similarly to how it uses btrfs 
  snapshots to create stable checkpoints as journal reference points, 
  allowing parallel (instead of writeahead) journaling
- can use raidz beneath a single ceph-osd for better reliability (e.g., 2x 
  * raidz instead of 3x replication)

ZFS doesn't have a clone function that we can use to enable efficient 
cephfs/rbd/rados snaps, but maybe this will motivate someone to implement 
one. :)

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xen blktap driver for Ceph RBD : Anybody wants to test ? :p

2013-04-18 Thread Sylvain Munaut

Hi,

I've been working on getting a working blktap driver allowing to
access ceph RBD block devices without relying on the RBD kernel driver
and it finally got to a point where, it works and is testable.

Some of the advantages are:
 - Easier to update to newer RBD version
 - Allows functionality only available in the userspace RBD library
(write cache, layering, ...)
 - Less issue when you have OSD as domU on the same dom0
 - Contains crash to user space :p (they shouldn't happen, but ...)

It's still an early prototype, but if you want to give it a shot and
give feedback.

You can find the code there https://github.com/smunaut/blktap/tree/rbd
 (rbd branch).

Currently the username, poolname and image name are hardcoded ...
(look for FIXME in the code). I'll get to that next, once I figured
the best format for arguments.

Cheers,

Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Swift ACL .rlistings support

2013-04-18 Thread Yehuda Sadeh

Sorry for the late response, this somehow went through the cracks. The
main issue that I see with this patch is that it introduces a new bit
for object listing that is not really needed. You just need to set the
RGW_PERM_READ on the bucket. This way setting this flag through swift
you'd be able to see it via S3. Is there any compelling reason not to
do so?

Thanks,
Yehuda

On Tue, Apr 2, 2013 at 7:07 AM, Li Wang liw...@ubuntukylin.com wrote:
 This patch implements the Swift ACL .rlistings for Radosgw,
 it should be seamlessly compatible with earlier version as well
 as S3.

 Signed-off-by: Yunchuan Wen yunchuan...@ubuntukylin.com
 Signed-off-by: Li Wang liw...@ubuntukylin.com
 ---
  src/rgw/rgw_acl.cc   |3 +++
  src/rgw/rgw_acl.h|   19 ++-
  src/rgw/rgw_acl_swift.cc |   14 ++
  src/rgw/rgw_op.cc|2 +-
  4 files changed, 32 insertions(+), 6 deletions(-)

 diff --git a/src/rgw/rgw_acl.cc b/src/rgw/rgw_acl.cc
 index 1a90649..d6255e1 100644
 --- a/src/rgw/rgw_acl.cc
 +++ b/src/rgw/rgw_acl.cc
 @@ -96,6 +96,9 @@ bool RGWAccessControlPolicy::verify_permission(string uid, 
 int user_perm_mask,

int policy_perm = get_perm(uid, test_perm);

 +  if (policy_perm  RGW_PERM_READ) {
 +policy_perm |= (test_perm  RGW_PERM_READ_LIST);
 +  }
/* the swift WRITE_OBJS perm is equivalent to the WRITE obj, just
   convert those bits. Note that these bits will only be set on
   buckets, so the swift READ permission on bucket will allow listing
 diff --git a/src/rgw/rgw_acl.h b/src/rgw/rgw_acl.h
 index c06e9eb..6374413 100644
 --- a/src/rgw/rgw_acl.h
 +++ b/src/rgw/rgw_acl.h
 @@ -15,11 +15,15 @@ using namespace std;
  #define RGW_PERM_WRITE   0x02
  #define RGW_PERM_READ_ACP0x04
  #define RGW_PERM_WRITE_ACP   0x08
 -#define RGW_PERM_READ_OBJS   0x10
 -#define RGW_PERM_WRITE_OBJS  0x20
 +#define RGW_PERM_READ_OBJS   0x10  // Swift read
 +#define RGW_PERM_WRITE_OBJS  0x20  // Swift write
 +#define RGW_PERM_READ_LIST   0x40  // Swift .rlistings
  #define RGW_PERM_FULL_CONTROL( RGW_PERM_READ | RGW_PERM_WRITE | \
 +  RGW_PERM_READ_ACP | RGW_PERM_WRITE_ACP | \
 +  RGW_PERM_READ_LIST )
 +#define RGW_PERM_ALL_S3  ( RGW_PERM_READ | RGW_PERM_WRITE | \
RGW_PERM_READ_ACP | RGW_PERM_WRITE_ACP )
 -#define RGW_PERM_ALL_S3  RGW_PERM_FULL_CONTROL
 +

  enum ACLGranteeTypeEnum {
  /* numbers are encoded, should not change */
 @@ -47,13 +51,18 @@ public:
void set_permissions(int perm) { flags = perm; }

void encode(bufferlist bl) const {
 -ENCODE_START(2, 2, bl);
 +ENCODE_START(3, 2, bl);
  ::encode(flags, bl);
  ENCODE_FINISH(bl);
}
void decode(bufferlist::iterator bl) {
 -DECODE_START_LEGACY_COMPAT_LEN(2, 2, 2, bl);
 +DECODE_START_LEGACY_COMPAT_LEN(3, 2, 2, bl);
  ::decode(flags, bl);
 +if (struct_v = 2) {
 +  ACLGrant grant;
 +  grant.set_group(ACL_GROUP_ALL_USERS, RGW_PERM_READ_LIST);
 +  acl.add_grant(grant);
 +}
  DECODE_FINISH(bl);
}
void dump(Formatter *f) const;
 diff --git a/src/rgw/rgw_acl_swift.cc b/src/rgw/rgw_acl_swift.cc
 index b02ce90..af5f804 100644
 --- a/src/rgw/rgw_acl_swift.cc
 +++ b/src/rgw/rgw_acl_swift.cc
 @@ -15,6 +15,7 @@ using namespace std;
  #define SWIFT_PERM_WRITE RGW_PERM_WRITE_OBJS

  #define SWIFT_GROUP_ALL_USERS .r:*
 +#define SWIFT_GROUP_LIST .rlistings

  static int parse_list(string uid_list, vectorstring uids)
  {
 @@ -54,6 +55,11 @@ static bool uid_is_public(string uid)
   sub.compare(.referrer) == 0;
  }

 +static bool uid_is_list(string uid)
 +{
 +  return uid.compare(SWIFT_GROUP_LIST) == 0;
 +}
 +
  void RGWAccessControlPolicy_SWIFT::add_grants(RGWRados *store, 
 vectorstring uids, int perm)
  {
vectorstring::iterator iter;
 @@ -64,6 +70,9 @@ void RGWAccessControlPolicy_SWIFT::add_grants(RGWRados 
 *store, vectorstring u
  if (uid_is_public(uid)) {
grant.set_group(ACL_GROUP_ALL_USERS, perm);
acl.add_grant(grant);
 +} else if ((perm  SWIFT_PERM_READ)  (uid_is_list(uid))) {
 +  grant.set_group(ACL_GROUP_ALL_USERS, RGW_PERM_READ_LIST);
 +  acl.add_grant(grant);
  } else if (rgw_get_user_info_by_uid(store, uid, grant_user)  0) {
ldout(cct, 10)  grant user does not exist:  uid  dendl;
/* skipping silently */
 @@ -116,6 +125,11 @@ void RGWAccessControlPolicy_SWIFT::to_str(string read, 
 string write)
if (grant.get_group() != ACL_GROUP_ALL_USERS)
  continue;
id = SWIFT_GROUP_ALL_USERS;
 +  if (perm  RGW_PERM_READ_LIST) {
 +if (!read.empty())
 +  read.append(, );
 +read.append(SWIFT_GROUP_LIST);
 +  }
  }
  if (perm  SWIFT_PERM_READ) {
if (!read.empty())
 diff --git a/src/rgw/rgw_op.cc b/src/rgw/rgw_op.cc
 index 43415d4..5c4d95a 100644
 --- a/src/rgw/rgw_op.cc
 +++

Re: poor write performance

2013-04-18 Thread Andrey Korolyov

On Thu, Apr 18, 2013 at 5:43 PM, Mark Nelson mark.nel...@inktank.com wrote:
 On 04/18/2013 06:46 AM, James Harper wrote:

 I'm doing some basic testing so I'm not really fussed about poor
 performance, but my write performance appears to be so bad I think I'm doing
 something wrong.

 Using dd to test gives me kbytes/second for write performance for 4kb
 block sizes, while read performance is acceptable (for testing at least).
 For dd I'm using iflag=direct for read and oflag=direct for write testing.

 My setup, approximately, is:

 Two OSD's
 . 1 x 7200RPM SATA disk each
 . 2 x gigabit cluster network interfaces each in a bonded configuration
 directly attached (osd to osd, no switch)
 . 1 x gigabit public network
 . journal on another spindle

 Three MON's
 . 1 each on the OSD's
 . 1 on another server, which is also the one used for testing performance

 I'm using debian packages from ceph which are version 0.56.4

 For comparison, my existing production storage is 2 servers running DRBD
 with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top
 of the iSCSI. Performance not spectacular but acceptable. The servers in
 question are the same specs as the servers I'm testing on.

 Where should I start looking for performance problems? I've tried running
 some of the benchmark stuff in the documentation but I haven't gotten very
 far...


 Hi James!  Sorry to hear about the performance trouble!  Is it just
 sequential 4KB direct IO writes that are giving you troubles?  If you are
 using the kernel version of RBD, we don't have any kind of cache implemented
 there and since you are bypassing the pagecache on the client, those writes
 are being sent to the different OSDs in 4KB chunks over the network.  RBD
 stores data in blocks that are represented by 4MB objects on one of the
 OSDs, so without cache a lot of sequential 4KB writes will be hitting 1 OSD
 repeatedly and then moving on to the next one.  Hopefully those writes would
 get aggregated at the OSD level, but clearly that's not really happening
 here given your performance.

 Here's a couple of thoughts:

 1) If you are working with VMs, using the QEMU/KVM interface with virtio
 drivers and RBD cache enabled will give you a huge jump in small sequential
 write performance relative to what you are seeing now.

 2) You may want to try upgrading to 0.60.  We made a change to how the
 pg_log works that causes fewer disk seeks during small IO, especially with
 XFS.

Can you point into related commits, if possible?


 3) If you are still having trouble, testing your network, disk speeds, and
 using rados bench to test the object store all may be helpful.


 Thanks

 James


 Good luck!



 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: poor write performance

On 04/18/2013 11:46 AM, Andrey Korolyov wrote:

On Thu, Apr 18, 2013 at 5:43 PM, Mark Nelson mark.nel...@inktank.com wrote:

On 04/18/2013 06:46 AM, James Harper wrote:

I'm doing some basic testing so I'm not really fussed about poor
performance, but my write performance appears to be so bad I think I'm doing
something wrong.

Using dd to test gives me kbytes/second for write performance for 4kb
block sizes, while read performance is acceptable (for testing at least).
For dd I'm using iflag=direct for read and oflag=direct for write testing.

My setup, approximately, is:

Two OSD's
. 1 x 7200RPM SATA disk each
. 2 x gigabit cluster network interfaces each in a bonded configuration
directly attached (osd to osd, no switch)
. 1 x gigabit public network
. journal on another spindle

Three MON's
. 1 each on the OSD's
. 1 on another server, which is also the one used for testing performance

I'm using debian packages from ceph which are version 0.56.4

For comparison, my existing production storage is 2 servers running DRBD
with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top
of the iSCSI. Performance not spectacular but acceptable. The servers in
question are the same specs as the servers I'm testing on.

Where should I start looking for performance problems? I've tried running
some of the benchmark stuff in the documentation but I haven't gotten very
far...

Hi James! Sorry to hear about the performance trouble! Is it just
sequential 4KB direct IO writes that are giving you troubles? If you are
using the kernel version of RBD, we don't have any kind of cache implemented
there and since you are bypassing the pagecache on the client, those writes
are being sent to the different OSDs in 4KB chunks over the network. RBD
stores data in blocks that are represented by 4MB objects on one of the
OSDs, so without cache a lot of sequential 4KB writes will be hitting 1 OSD
repeatedly and then moving on to the next one. Hopefully those writes would
get aggregated at the OSD level, but clearly that's not really happening
here given your performance.

Here's a couple of thoughts:

1) If you are working with VMs, using the QEMU/KVM interface with virtio
drivers and RBD cache enabled will give you a huge jump in small sequential
write performance relative to what you are seeing now.

2) You may want to try upgrading to 0.60. We made a change to how the
pg_log works that causes fewer disk seeks during small IO, especially with
XFS.

Can you point into related commits, if possible?

here you go:

http://tracker.ceph.com/projects/ceph/repository/revisions/188f3ea6867eeb6e950f6efed18d53ff17522bbc

3) If you are still having trouble, testing your network, disk speeds, and
using rados bench to test the object store all may be helpful.

Thanks

James

Good luck!

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock

On Wed, 17 Apr 2013, Anand Avati wrote:
 On Wed, Apr 17, 2013 at 9:45 PM, Sage Weil s...@inktank.com wrote:
 
  On Wed, 17 Apr 2013, Anand Avati wrote:
   On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil s...@inktank.com wrote:
 We've hit a new deadlock with fuse_lowlevel_notify_inval_inode,
 this time
 on the read side:
  
 - ceph-fuse queues an invalidate (in a separate thread)
 - kernel initiates a read
 - invalidate blocks in kernel, waiting on a page lock
 - the read blocks in ceph-fuse
  
 Now, assuming we're reading the stack traces properly, this is
 more or
 less what we see with writes, except with reads, and the obvious
 don't
 block the read would resolve it.
  
 But!  If that is the only way to avoid deadlock, I'm afraid it
 is
 difficult to implement reliable cache invalidation at all.  The
 reason we
 are invalidating is because the server told us to: we are no
 longer
 allowed to do reads and cached data is invalid.  The obvious
 approach is
 to
  
 1- stop processing new reads
 2- let in-progress reads complete
 3- invalidate the cache
 4- ack to server
  
 ...but that will deadlock as above, as any new read will lock
 pages before
 blcoking.  If we don't block, then the read may repopulate pages
 we just
 invalidated.  We could
  
 1- invalidate
 2- if any reads happened while we were invalidating, goto 1
 3- ack
  
 but then we risk starvation and livelock.
  
 How do other people solve this problem?  It seems like another
 upcall that
 would let you block new reads (and/or writes) from starting
 while the
 invalidate is in progress would do the trick, but I'm not
 convinced I'm
 not missing something much simpler.
  
  
   Do you really need to call fuse_lowlevel_notify_inval_inode() while still
   holding the mutex in cfuse? It should be sufficient if you -
  
   0 - Receive inval request from server
   1 - mutex_lock() in cfuse
   2 - invalidate cfuse cache
   3 - mutex_unlock() in cfuse
   4 - fuse_lowlevel_notify_inval_inode()
   5 - ack to server
  
   The only necessary ordering seems to be 0-[2,4]-5. Placing 4 within the
   mutex boundaries looks unnecessary and self-imposed. In-progress reads
  which
   took the page lock before fuse_lowlevel_notify_inval_inode() would either
   read data cached in cfuse (in case they reached the cache before 1), or
  get
   sent over to server as though data was never cached. There wouldn't be a
   livelock either. Did I miss something?
 
  It's the concurrent reads I'm concerned about:
 
  3.5 - read(2) is called, locks some pages, and sends a message through the
  fuse connection
 
  3.9 or 4.1 - ceph-fuse gets the reads request.  It can either handle it,
  repopulating a region of the page cache it possibly just partially
  invalidated (rendering the invalidate a failure),
 
 I think the problem lies here, that handling the read before 5 is returning
 old data (which effectively renders the invalidation a failure). Step 0
 needs to guarantee that new data about to be written is already staged in
 the server and made available for read. However the write request itself
 needs to be blocked from completing till step 5 from all other clients
 completes.

It sounds like you're thinking of a weaker consistency model.  The mds is 
telling us to stop reads and invalidate our cache *before* anybody is 
allowed to write. At the end of this process, we ack, we should have an 
empty page cache for this file and any reads should be blocked until 
further notice.

  or block, possibly preventing the invalidate from ever completing.
 
 
 Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the
 server should be prepared to serve staged data even before Step 0.
 Invalidating might be *delayed* till the in-progress reads finishes. But
 that only delays the completion of write(), but no deadlocks anywhere. Hope
 I din't miss something :-)

You can ignore the mutex_lock stuff; I don't think it's necessary to see 
the issue.  Simply consider a racing read (that has pages locked) and 
invalidate (that is walking through the address_space mapping locking and 
discarding pages).  We can't block the read without risking deadlock with 
the invalidate, and we can't continue with the read without making the 
invalidate unsuccessful/unreliable.  We can actually do reads at this 
point from the ceph client vs server perspective since we haven't acked 
the revocation yet.. but with the fuse vs ceph-fuse interaction we are 
choosing betweeen deadlock or potential livelock (if we do the read and 
then try the invalidate a second time).

sage


 
 
  4.2 - invalidate either completes (having possibly missed some just-read
  pages),

Re: [RESEND][PATCH 0/2] fix few root xattr bugs

2013-04-18 Thread Gregory Farnum

Thanks! I merged these into next (going to be Cuttlefish) in commits
f379ce37bfdcb3670f52ef47c02787f82e50e612 and
87634d882fda80c4a2e3705c83a38bdfd613763f.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Apr 17, 2013 at 11:43 PM, Kuan Kai Chiu big.c...@bigtera.com wrote:
 The first patch fixes a bug that causes MDS crash while setting or removing
 xattrs on root directory.

 The second patch fixes another bug that root xattrs not correctly logged
 in MDS journal.

 Kuan Kai Chiu (2):
   mds: fix setting/removing xattrs on root
   mds: journal the projected root xattrs in add_root()

  src/mds/Server.cc  |6 ++
  src/mds/events/EMetaBlob.h |2 +-
  2 files changed, 3 insertions(+), 5 deletions(-)

 --
 1.7.9.5

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock

2013-04-18 Thread Anand Avati


On Apr 18, 2013, at 10:05 AM, Sage Weil s...@inktank.com wrote:

 On Wed, 17 Apr 2013, Anand Avati wrote:
 On Wed, Apr 17, 2013 at 9:45 PM, Sage Weil s...@inktank.com wrote:
 
 On Wed, 17 Apr 2013, Anand Avati wrote:
 On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil s...@inktank.com wrote:
  We've hit a new deadlock with fuse_lowlevel_notify_inval_inode,
  this time
  on the read side:
 
  - ceph-fuse queues an invalidate (in a separate thread)
  - kernel initiates a read
  - invalidate blocks in kernel, waiting on a page lock
  - the read blocks in ceph-fuse
 
  Now, assuming we're reading the stack traces properly, this is
  more or
  less what we see with writes, except with reads, and the obvious
  don't
  block the read would resolve it.
 
  But!  If that is the only way to avoid deadlock, I'm afraid it
  is
  difficult to implement reliable cache invalidation at all.  The
  reason we
  are invalidating is because the server told us to: we are no
  longer
  allowed to do reads and cached data is invalid.  The obvious
  approach is
  to
 
  1- stop processing new reads
  2- let in-progress reads complete
  3- invalidate the cache
  4- ack to server
 
  ...but that will deadlock as above, as any new read will lock
  pages before
  blcoking.  If we don't block, then the read may repopulate pages
  we just
  invalidated.  We could
 
  1- invalidate
  2- if any reads happened while we were invalidating, goto 1
  3- ack
 
  but then we risk starvation and livelock.
 
  How do other people solve this problem?  It seems like another
  upcall that
  would let you block new reads (and/or writes) from starting
  while the
  invalidate is in progress would do the trick, but I'm not
  convinced I'm
  not missing something much simpler.
 
 
 Do you really need to call fuse_lowlevel_notify_inval_inode() while still
 holding the mutex in cfuse? It should be sufficient if you -
 
 0 - Receive inval request from server
 1 - mutex_lock() in cfuse
 2 - invalidate cfuse cache
 3 - mutex_unlock() in cfuse
 4 - fuse_lowlevel_notify_inval_inode()
 5 - ack to server
 
 The only necessary ordering seems to be 0-[2,4]-5. Placing 4 within the
 mutex boundaries looks unnecessary and self-imposed. In-progress reads
 which
 took the page lock before fuse_lowlevel_notify_inval_inode() would either
 read data cached in cfuse (in case they reached the cache before 1), or
 get
 sent over to server as though data was never cached. There wouldn't be a
 livelock either. Did I miss something?
 
 It's the concurrent reads I'm concerned about:
 
 3.5 - read(2) is called, locks some pages, and sends a message through the
 fuse connection
 
 3.9 or 4.1 - ceph-fuse gets the reads request.  It can either handle it,
 repopulating a region of the page cache it possibly just partially
 invalidated (rendering the invalidate a failure),
 
 I think the problem lies here, that handling the read before 5 is returning
 old data (which effectively renders the invalidation a failure). Step 0
 needs to guarantee that new data about to be written is already staged in
 the server and made available for read. However the write request itself
 needs to be blocked from completing till step 5 from all other clients
 completes.
 
 It sounds like you're thinking of a weaker consistency model.  The mds is 
 telling us to stop reads and invalidate our cache *before* anybody is 
 allowed to write. At the end of this process, we ack, we should have an 
 empty page cache for this file and any reads should be blocked until 
 further notice.

Yes, the consistency model I was talking about is weaker than block new reads, 
purge all cache, wait until further notified. If you are striping data and if 
the read request spans multiple stripes, then I guess you do need a stricter 
version.

 or block, possibly preventing the invalidate from ever completing.
 
 
 Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the
 server should be prepared to serve staged data even before Step 0.
 Invalidating might be *delayed* till the in-progress reads finishes. But
 that only delays the completion of write(), but no deadlocks anywhere. Hope
 I din't miss something :-)
 
 You can ignore the mutex_lock stuff; I don't think it's necessary to see 
 the issue.  Simply consider a racing read (that has pages locked) and 
 invalidate (that is walking through the address_space mapping locking and 
 discarding pages).  We can't block the read without risking deadlock with 
 the invalidate, and we can't continue with the read without making the 
 invalidate unsuccessful/unreliable.  We can actually do reads at this 
 point from the ceph client vs server perspective since we haven't acked 
 the revocation yet.. but with the fuse vs ceph-fuse interaction we are 
 choosing betweeen deadlock or potential livelock (if we do the

Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock

On Thu, 18 Apr 2013, Anand Avati wrote:
 On Apr 18, 2013, at 10:05 AM, Sage Weil s...@inktank.com wrote:
  On Wed, 17 Apr 2013, Anand Avati wrote:
  On Wed, Apr 17, 2013 at 9:45 PM, Sage Weil s...@inktank.com wrote:
  
  On Wed, 17 Apr 2013, Anand Avati wrote:
  On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil s...@inktank.com wrote:
   We've hit a new deadlock with fuse_lowlevel_notify_inval_inode,
   this time
   on the read side:
  
   - ceph-fuse queues an invalidate (in a separate thread)
   - kernel initiates a read
   - invalidate blocks in kernel, waiting on a page lock
   - the read blocks in ceph-fuse
  
   Now, assuming we're reading the stack traces properly, this is
   more or
   less what we see with writes, except with reads, and the obvious
   don't
   block the read would resolve it.
  
   But!  If that is the only way to avoid deadlock, I'm afraid it
   is
   difficult to implement reliable cache invalidation at all.  The
   reason we
   are invalidating is because the server told us to: we are no
   longer
   allowed to do reads and cached data is invalid.  The obvious
   approach is
   to
  
   1- stop processing new reads
   2- let in-progress reads complete
   3- invalidate the cache
   4- ack to server
  
   ...but that will deadlock as above, as any new read will lock
   pages before
   blcoking.  If we don't block, then the read may repopulate pages
   we just
   invalidated.  We could
  
   1- invalidate
   2- if any reads happened while we were invalidating, goto 1
   3- ack
  
   but then we risk starvation and livelock.
  
   How do other people solve this problem?  It seems like another
   upcall that
   would let you block new reads (and/or writes) from starting
   while the
   invalidate is in progress would do the trick, but I'm not
   convinced I'm
   not missing something much simpler.
  
  
  Do you really need to call fuse_lowlevel_notify_inval_inode() while still
  holding the mutex in cfuse? It should be sufficient if you -
  
  0 - Receive inval request from server
  1 - mutex_lock() in cfuse
  2 - invalidate cfuse cache
  3 - mutex_unlock() in cfuse
  4 - fuse_lowlevel_notify_inval_inode()
  5 - ack to server
  
  The only necessary ordering seems to be 0-[2,4]-5. Placing 4 within the
  mutex boundaries looks unnecessary and self-imposed. In-progress reads
  which
  took the page lock before fuse_lowlevel_notify_inval_inode() would either
  read data cached in cfuse (in case they reached the cache before 1), or
  get
  sent over to server as though data was never cached. There wouldn't be a
  livelock either. Did I miss something?
  
  It's the concurrent reads I'm concerned about:
  
  3.5 - read(2) is called, locks some pages, and sends a message through the
  fuse connection
  
  3.9 or 4.1 - ceph-fuse gets the reads request.  It can either handle it,
  repopulating a region of the page cache it possibly just partially
  invalidated (rendering the invalidate a failure),
  
  I think the problem lies here, that handling the read before 5 is returning
  old data (which effectively renders the invalidation a failure). Step 0
  needs to guarantee that new data about to be written is already staged in
  the server and made available for read. However the write request itself
  needs to be blocked from completing till step 5 from all other clients
  completes.
  
  It sounds like you're thinking of a weaker consistency model.  The mds is 
  telling us to stop reads and invalidate our cache *before* anybody is 
  allowed to write. At the end of this process, we ack, we should have an 
  empty page cache for this file and any reads should be blocked until 
  further notice.
 
 Yes, the consistency model I was talking about is weaker than block new 
 reads, purge all cache, wait until further notified. If you are striping 
 data and if the read request spans multiple stripes, then I guess you do need 
 a stricter version.
 
  or block, possibly preventing the invalidate from ever completing.
  
  
  Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the
  server should be prepared to serve staged data even before Step 0.
  Invalidating might be *delayed* till the in-progress reads finishes. But
  that only delays the completion of write(), but no deadlocks anywhere. Hope
  I din't miss something :-)
  
  You can ignore the mutex_lock stuff; I don't think it's necessary to see 
  the issue.  Simply consider a racing read (that has pages locked) and 
  invalidate (that is walking through the address_space mapping locking and 
  discarding pages).  We can't block the read without risking deadlock with 
  the invalidate, and we can't continue with the read without making the 
  invalidate unsuccessful/unreliable.  We can actually do reads at this 
  point from the ceph client vs server perspective

Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock

2013-04-18 Thread Anand Avati


On 04/18/2013 12:12 PM, Sage Weil wrote:

On Thu, 18 Apr 2013, Anand Avati wrote:

On Apr 18, 2013, at 10:05 AM, Sage Weils...@inktank.com  wrote:

On Wed, 17 Apr 2013, Anand Avati wrote:

On Wed, Apr 17, 2013 at 9:45 PM, Sage Weils...@inktank.com  wrote:


On Wed, 17 Apr 2013, Anand Avati wrote:

On Wed, Apr 17, 2013 at 5:43 PM, Sage Weils...@inktank.com  wrote:
  We've hit a new deadlock with fuse_lowlevel_notify_inval_inode,
  this time
  on the read side:

  - ceph-fuse queues an invalidate (in a separate thread)
  - kernel initiates a read
  - invalidate blocks in kernel, waiting on a page lock
  - the read blocks in ceph-fuse

  Now, assuming we're reading the stack traces properly, this is
  more or
  less what we see with writes, except with reads, and the obvious
  don't
  block the read would resolve it.

  But!  If that is the only way to avoid deadlock, I'm afraid it
  is
  difficult to implement reliable cache invalidation at all.  The
  reason we
  are invalidating is because the server told us to: we are no
  longer
  allowed to do reads and cached data is invalid.  The obvious
  approach is
  to

  1- stop processing new reads
  2- let in-progress reads complete
  3- invalidate the cache
  4- ack to server

  ...but that will deadlock as above, as any new read will lock
  pages before
  blcoking.  If we don't block, then the read may repopulate pages
  we just
  invalidated.  We could

  1- invalidate
  2- if any reads happened while we were invalidating, goto 1
  3- ack

  but then we risk starvation and livelock.

  How do other people solve this problem?  It seems like another
  upcall that
  would let you block new reads (and/or writes) from starting
  while the
  invalidate is in progress would do the trick, but I'm not
  convinced I'm
  not missing something much simpler.


Do you really need to call fuse_lowlevel_notify_inval_inode() while still
holding the mutex in cfuse? It should be sufficient if you -

0 - Receive inval request from server
1 - mutex_lock() in cfuse
2 - invalidate cfuse cache
3 - mutex_unlock() in cfuse
4 - fuse_lowlevel_notify_inval_inode()
5 - ack to server

The only necessary ordering seems to be 0-[2,4]-5. Placing 4 within the
mutex boundaries looks unnecessary and self-imposed. In-progress reads

which

took the page lock before fuse_lowlevel_notify_inval_inode() would either
read data cached in cfuse (in case they reached the cache before 1), or

get

sent over to server as though data was never cached. There wouldn't be a
livelock either. Did I miss something?


It's the concurrent reads I'm concerned about:

3.5 - read(2) is called, locks some pages, and sends a message through the
fuse connection

3.9 or 4.1 - ceph-fuse gets the reads request.  It can either handle it,
repopulating a region of the page cache it possibly just partially
invalidated (rendering the invalidate a failure),


I think the problem lies here, that handling the read before 5 is returning
old data (which effectively renders the invalidation a failure). Step 0
needs to guarantee that new data about to be written is already staged in
the server and made available for read. However the write request itself
needs to be blocked from completing till step 5 from all other clients
completes.


It sounds like you're thinking of a weaker consistency model.  The mds is
telling us to stop reads and invalidate our cache *before* anybody is
allowed to write. At the end of this process, we ack, we should have an
empty page cache for this file and any reads should be blocked until
further notice.


Yes, the consistency model I was talking about is weaker than block new reads, 
purge all cache, wait until further notified. If you are striping data and if the 
read request spans multiple stripes, then I guess you do need a stricter version.


or block, possibly preventing the invalidate from ever completing.



Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the
server should be prepared to serve staged data even before Step 0.
Invalidating might be *delayed* till the in-progress reads finishes. But
that only delays the completion of write(), but no deadlocks anywhere. Hope
I din't miss something :-)


You can ignore the mutex_lock stuff; I don't think it's necessary to see
the issue.  Simply consider a racing read (that has pages locked) and
invalidate (that is walking through the address_space mapping locking and
discarding pages).  We can't block the read without risking deadlock with
the invalidate, and we can't continue with the read without making the
invalidate unsuccessful/unreliable.  We can actually do reads at this
point from the ceph client vs server perspective since we haven't acked
the revocation yet.. but with the fuse vs ceph-fuse interaction we are
choosing betweeen deadlock or potential livelock (if

erasure coding (sorry)

2013-04-18 Thread Plaetinck, Dieter

sorry to bring this up again, googling revealed some people don't like the 
subject [anymore].

but I'm working on a new +- 3PB cluster for storage of immutable files.
and it would be either all cold data, or mostly cold. 150MB avg filesize, max 
size 5GB (for now)
For this use case, my impression is erasure coding would make a lot of sense
(though I'm not sure about the computational overhead on storing and loading 
objects..? outbound traffic would peak at 6 Gbps, but I can make it way less 
and still keep a large cluster, by taking away the small set of hot files.
inbound traffic would be minimal)

I know that the answer a while ago was no plans to implement erasure coding, 
has this changed?
if not, is anyone aware of a similar system that does support it? I found QFS 
but that's meant for batch processing, has a single 'namenode' etc.

thanks,
Dieter
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: erasure coding (sorry)

On Thu, 18 Apr 2013, Plaetinck, Dieter wrote:
 sorry to bring this up again, googling revealed some people don't like the 
 subject [anymore].
 
 but I'm working on a new +- 3PB cluster for storage of immutable files.
 and it would be either all cold data, or mostly cold. 150MB avg filesize, max 
 size 5GB (for now)
 For this use case, my impression is erasure coding would make a lot of sense
 (though I'm not sure about the computational overhead on storing and loading 
 objects..? outbound traffic would peak at 6 Gbps, but I can make it way less 
 and still keep a large cluster, by taking away the small set of hot files.
 inbound traffic would be minimal)
 
 I know that the answer a while ago was no plans to implement erasure 
 coding, has this changed?
 if not, is anyone aware of a similar system that does support it? I found QFS 
 but that's meant for batch processing, has a single 'namenode' etc.

We would love to do it, but it is not a priority at the moment (things 
like multi-site replication are in much higher demand).  That of course 
doesn't prevent someone outside of Inktank from working on it :)

The main caveat is that it will be complicate.  For an initial 
implementation, the full breadth of the rados API probably wouldn't be 
support for erasure/parity encoded pools (thinkgs like rados classes and 
the omap key/value api get tricky when you start talking about parity). 
But for many (or even most) use cases, objects are just bytes, and those 
restrictions are just fine.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: erasure coding (sorry)

On 04/18/2013 04:08 PM, Josh Durgin wrote:

On 04/18/2013 01:47 PM, Sage Weil wrote:

On Thu, 18 Apr 2013, Plaetinck, Dieter wrote:

sorry to bring this up again, googling revealed some people don't
like the subject [anymore].

but I'm working on a new +- 3PB cluster for storage of immutable files.
and it would be either all cold data, or mostly cold. 150MB avg
filesize, max size 5GB (for now)
For this use case, my impression is erasure coding would make a lot
of sense
(though I'm not sure about the computational overhead on storing and
loading objects..? outbound traffic would peak at 6 Gbps, but I can
make it way less and still keep a large cluster, by taking away the
small set of hot files.
inbound traffic would be minimal)

I know that the answer a while ago was no plans to implement erasure
coding, has this changed?
if not, is anyone aware of a similar system that does support it? I
found QFS but that's meant for batch processing, has a single
'namenode' etc.

We would love to do it, but it is not a priority at the moment (things
like multi-site replication are in much higher demand). That of course
doesn't prevent someone outside of Inktank from working on it :)

The main caveat is that it will be complicate. For an initial
implementation, the full breadth of the rados API probably wouldn't be
support for erasure/parity encoded pools (thinkgs like rados classes and
the omap key/value api get tricky when you start talking about parity).
But for many (or even most) use cases, objects are just bytes, and those
restrictions are just fine.

I talked to some folks interested in doing a more limited form of this
yesterday. They started a blueprint [1]. One of their ideas was to have
erasure coding done by a separate process (or thread perhaps). It would
use erasure coding on an object and then use librados to store the
rasure-encoded pieces in a separate pool, and finally leave a marker in
place of the original object in the first pool.

When the osd detected this marker, it would proxy the request to the
erasure coding thread/process which would service the request on the
second pool for reads, and potentially make writes move the data back to
the first pool in a tiering sort of scenario.

I might have misremembered some details, but I think it's an
interesting way to get many of the benefits of erasure coding with a
relatively small amount of work compared to a fully native osd solution.

Josh

Neat. :)

[1]
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: erasure coding (sorry)

2013-04-18 Thread Noah Watkins


On Apr 18, 2013, at 2:08 PM, Josh Durgin josh.dur...@inktank.com wrote:

 I talked to some folks interested in doing a more limited form of this
 yesterday. They started a blueprint [1]. One of their ideas was to have
 erasure coding done by a separate process (or thread perhaps). It would
 use erasure coding on an object and then use librados to store the
 rasure-encoded pieces in a separate pool, and finally leave a marker in
 place of the original object in the first pool.

This sounds at a high-level similar to work out of Microsoft:

  https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf

The basic idea is to replicate first, then erasure code in the background.

- Noah
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: erasure coding (sorry)

On Thu, 18 Apr 2013, Noah Watkins wrote:
 On Apr 18, 2013, at 2:08 PM, Josh Durgin josh.dur...@inktank.com wrote:
 
  I talked to some folks interested in doing a more limited form of this
  yesterday. They started a blueprint [1]. One of their ideas was to have
  erasure coding done by a separate process (or thread perhaps). It would
  use erasure coding on an object and then use librados to store the
  rasure-encoded pieces in a separate pool, and finally leave a marker in
  place of the original object in the first pool.
 
 This sounds at a high-level similar to work out of Microsoft:
 
   https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf
 
 The basic idea is to replicate first, then erasure code in the background.

FWIW, I think a useful (and generic) concept to add to rados would be a 
redirect symlink sort of thing that says oh, this object is over there is 
that other pool, such that client requests will be transparently 
redirected or proxied.  This will enable generic tiering type operations, 
and probably simplify/enable migration without a lot of additional 
complexity on the client side.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: poor write performance

2013-04-18 Thread James Harper

  Where should I start looking for performance problems? I've tried running
  some of the benchmark stuff in the documentation but I haven't gotten very
  far...
 
 Hi James!  Sorry to hear about the performance trouble!  Is it just
 sequential 4KB direct IO writes that are giving you troubles?  If you
 are using the kernel version of RBD, we don't have any kind of cache
 implemented there and since you are bypassing the pagecache on the
 client, those writes are being sent to the different OSDs in 4KB chunks
 over the network.  RBD stores data in blocks that are represented by 4MB
 objects on one of the OSDs, so without cache a lot of sequential 4KB
 writes will be hitting 1 OSD repeatedly and then moving on to the next
 one.  Hopefully those writes would get aggregated at the OSD level, but
 clearly that's not really happening here given your performance.

Using dd I tried various block sizes. With 4kb I was getting around 
500kbytes/second rate. With 1MB I was getting a few mbytes/second. Read 
performance seems great though.

 Here's a couple of thoughts:
 
 1) If you are working with VMs, using the QEMU/KVM interface with virtio
 drivers and RBD cache enabled will give you a huge jump in small
 sequential write performance relative to what you are seeing now.

I'm using Xen so that won't work for me right now, although I did notice 
someone posted some blktap code to support ceph.

I'm trying a windows restore of a physical machine into a VM under Xen and 
performance matches what I am seeing with dd - very very slow.

 2) You may want to try upgrading to 0.60.  We made a change to how the
 pg_log works that causes fewer disk seeks during small IO, especially
 with XFS.

Do packages for this exist for Debian? At the moment my sources.list contains 
ceph.com/debian-bobtail wheezy main.

 3) If you are still having trouble, testing your network, disk speeds,
 and using rados bench to test the object store all may be helpful.
 

I tried that and while the write worked the seq test always said I had to do a 
write test first.

While running my Xen restore, /var/log/ceph/ceph.log looks like:

pgmap v18316: 832 pgs: 832 active+clean; 61443 MB data, 119 GB used, 1742 GB / 
1862 GB avail; 824KB/s wr, 12op/s
pgmap v18317: 832 pgs: 832 active+clean; 61446 MB data, 119 GB used, 1742 GB / 
1862 GB avail; 649KB/s wr, 10op/s
pgmap v18318: 832 pgs: 832 active+clean; 61449 MB data, 119 GB used, 1742 GB / 
1862 GB avail; 652KB/s wr, 10op/s
pgmap v18319: 832 pgs: 832 active+clean; 61452 MB data, 119 GB used, 1742 GB / 
1862 GB avail; 614KB/s wr, 9op/s
pgmap v18320: 832 pgs: 832 active+clean; 61454 MB data, 119 GB used, 1742 GB / 
1862 GB avail; 537KB/s wr, 8op/s
pgmap v18321: 832 pgs: 832 active+clean; 61457 MB data, 119 GB used, 1742 GB / 
1862 GB avail; 511KB/s wr, 7op/s

James

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RBD Read performance

2013-04-18 Thread Malcolm Haak


Morning all,

Did the echos on all boxes involved... and the results are in..

[root@dogbreath ~]#
[root@dogbreath ~]# dd if=/todd-rbd-fs/DELETEME of=/dev/null bs=4M 
count=1 iflag=direct

1+0 records in
1+0 records out
4194304 bytes (42 GB) copied, 144.083 s, 291 MB/s
[root@dogbreath ~]# dd if=/todd-rbd-fs/DELETEME of=/dev/null bs=4M 
count=1

1+0 records in
1+0 records out
4194304 bytes (42 GB) copied, 316.025 s, 133 MB/s
[root@dogbreath ~]#

No change which is a shame. What other information or testing should I 
start?


Regards

Malcolm Haak

On 18/04/13 17:22, Malcolm Haak wrote:

Hi Mark!

Thanks for the quick reply!

I'll reply inline below.

On 18/04/13 17:04, Mark Nelson wrote:

On 04/17/2013 11:35 PM, Malcolm Haak wrote:

Hi all,


Hi Malcolm!



I jumped into the IRC channel yesterday and they said to email
ceph-devel. I have been having some read performance issues. With Reads
being slower than writes by a factor of ~5-8.


I recently saw this kind of behaviour (writes were fine, but reads were
terrible) on an IPoIB based cluster and it was caused by the same TCP
auto tune issues that Jim Schutt saw last year. It's worth a try at
least to see if it helps.

echo 0  /proc/sys/net/ipv4/tcp_moderate_rcvbuf

on all of the clients and server nodes should be enough to test it out.
  Sage added an option in more recent Ceph builds that lets you work
around it too.


Awesome I will test this first up tomorrow.


First info:
Server
SLES 11 SP2
Ceph 0.56.4.
12 OSD's  that are Hardware Raid 5 each of the twelve is made from 5
NL-SAS disks for a total of 60 disks (Each lun can do around 320MB/s
stream write and the same if not better read) Connected via 2xQDR IB
OSD's/MDS and such all on same box (for testing)
Box is a Quad AMD Opteron 6234
Ram is 256Gb
10GB Journals
osd_op_theads: 8
osd_disk_threads:2
Filestore_op_threads:4
OSD's are all XFS


Interesting setup!  QUAD socket Opteron boxes have somewhat slow and
slightly oversubscribed hypertransport links don't they?  I wonder if on
a system with so many disks and QDR-IB if that could become a problem...

We typically like smaller nodes where we can reasonably do 1 OSD per
drive, but we've tested on a couple of 60 drive chassis in RAID configs
too.  Should be interesting to hear what kind of aggregate performance
you can eventually get.


We are also going to try this out with 6 luns on a dual xeon box. The
Opteron box was the biggest scariest thing we had that was doing nothing.





All nodes are connected via QDR IB using IP_O_IB. We get 1.7GB/s on TCP
performance tests between the nodes.

Clients: One is FC17 the other us Ubuntu 12.10 they only have around
32GB-70GB ram.

We ran into an odd issue were the OSD's would all start in the same NUMA
node and pretty much on the same processor core. We fixed that up with
some cpuset magic.


Strange!  Was that more due to cpuset or Ceph?  I can't imagine that we
are doing anything that would cause that.



More than likely it is an odd quirk in the SLES kernel.. but when I have
time I'll do some more poking. We were seeing insane CPU usage on some
cores because all the OSD's were piled up in one place.



Performance testing we have done: (Note oflag=direct was yielding
results within 5% of cached results)


root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=3200
3200+0 records in
3200+0 records out
33554432000 bytes (34 GB) copied, 47.6685 s, 704 MB/s
root@ty3:~#
root@ty3:~# rm /test-rbd-fs/DELETEME
root@ty3:~#
root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=4800
4800+0 records in
4800+0 records out
50331648000 bytes (50 GB) copied, 69.5527 s, 724 MB/s

[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=2400
2400+0 records in
2400+0 records out
25165824000 bytes (25 GB) copied, 26.3593 s, 955 MB/s
[root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=9600
9600+0 records in
9600+0 records out
100663296000 bytes (101 GB) copied, 145.212 s, 693 MB/s

Both clients each doing a 140GB write (2x dogbreath's RAM) at the same
time to two different rbds in the same pool.

root@ty3:~# rm /test-rbd-fs/DELETEME
root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=14000
14000+0 records in
14000+0 records out
14680064 bytes (147 GB) copied, 412.404 s, 356 MB/s
root@ty3:~#

[root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=14000
14000+0 records in
14000+0 records out
14680064 bytes (147 GB) copied, 433.351 s, 339 MB/s
[root@dogbreath ~]#

Onto reads...
Also we found that doing iflag=direct increased read performance.

[root@dogbreath ~]# dd of=/dev/null if=/test-rbd-fs/DELETEME bs=10M
count=160
160+0 records in
160+0 records out
1677721600 bytes (1.7 GB) copied, 29.4242 s, 57.0 MB/s
[root@dogbreath ~]#
[root@dogbreath ~]# echo 1  /proc/sys/vm/drop_caches
[root@dogbreath ~]# dd

Re: erasure coding (sorry)

Supposedly, on 2013-Apr-18, at 14.08 PDT(-0700), someone claiming to be Josh
Durgin scribed:

On 04/18/2013 01:47 PM, Sage Weil wrote:
On Thu, 18 Apr 2013, Plaetinck, Dieter wrote:
sorry to bring this up again, googling revealed some people don't like the
subject [anymore].

but I'm working on a new +- 3PB cluster for storage of immutable files.
and it would be either all cold data, or mostly cold. 150MB avg filesize,
max size 5GB (for now)
For this use case, my impression is erasure coding would make a lot of sense
(though I'm not sure about the computational overhead on storing and
loading objects..? outbound traffic would peak at 6 Gbps, but I can make it
way less and still keep a large cluster, by taking away the small set of
hot files.
inbound traffic would be minimal)

I know that the answer a while ago was no plans to implement erasure
coding, has this changed?
if not, is anyone aware of a similar system that does support it? I found
QFS but that's meant for batch processing, has a single 'namenode' etc.

Greetings,

I'm one of those individuals :) Our thinking is evolving on this, and
I think we can keep most of the work out of the main machinery of ceph, and
simply require a modified client that runs the proxy function on the hot
pool OSDs. Even wondering if it could be prototyped in fuse. I will be writing
this up in the next day or two in the blueprint below. Josh has the idea
basically correct.

Josh

Christopher

[1]
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

signature.asc
Description: OpenPGP digital signature

Re: RBD Read performance


On 04/18/2013 07:27 PM, Malcolm Haak wrote:

Morning all,

Did the echos on all boxes involved... and the results are in..

[root@dogbreath ~]#
[root@dogbreath ~]# dd if=/todd-rbd-fs/DELETEME of=/dev/null bs=4M
count=1 iflag=direct
1+0 records in
1+0 records out
4194304 bytes (42 GB) copied, 144.083 s, 291 MB/s
[root@dogbreath ~]# dd if=/todd-rbd-fs/DELETEME of=/dev/null bs=4M
count=1
1+0 records in
1+0 records out
4194304 bytes (42 GB) copied, 316.025 s, 133 MB/s
[root@dogbreath ~]#


Boo!



No change which is a shame. What other information or testing should I
start?


Any chance you can try out a quick rados bench test from the client 
against the pool for writes and reads and see how that works?


rados -p pool bench 300 write --no-cleanup
rados -p pool bench 300 seq



Regards

Malcolm Haak

On 18/04/13 17:22, Malcolm Haak wrote:

Hi Mark!

Thanks for the quick reply!

I'll reply inline below.

On 18/04/13 17:04, Mark Nelson wrote:

On 04/17/2013 11:35 PM, Malcolm Haak wrote:

Hi all,


Hi Malcolm!



I jumped into the IRC channel yesterday and they said to email
ceph-devel. I have been having some read performance issues. With Reads
being slower than writes by a factor of ~5-8.


I recently saw this kind of behaviour (writes were fine, but reads were
terrible) on an IPoIB based cluster and it was caused by the same TCP
auto tune issues that Jim Schutt saw last year. It's worth a try at
least to see if it helps.

echo 0  /proc/sys/net/ipv4/tcp_moderate_rcvbuf

on all of the clients and server nodes should be enough to test it out.
  Sage added an option in more recent Ceph builds that lets you work
around it too.


Awesome I will test this first up tomorrow.


First info:
Server
SLES 11 SP2
Ceph 0.56.4.
12 OSD's  that are Hardware Raid 5 each of the twelve is made from 5
NL-SAS disks for a total of 60 disks (Each lun can do around 320MB/s
stream write and the same if not better read) Connected via 2xQDR IB
OSD's/MDS and such all on same box (for testing)
Box is a Quad AMD Opteron 6234
Ram is 256Gb
10GB Journals
osd_op_theads: 8
osd_disk_threads:2
Filestore_op_threads:4
OSD's are all XFS


Interesting setup!  QUAD socket Opteron boxes have somewhat slow and
slightly oversubscribed hypertransport links don't they?  I wonder if on
a system with so many disks and QDR-IB if that could become a problem...

We typically like smaller nodes where we can reasonably do 1 OSD per
drive, but we've tested on a couple of 60 drive chassis in RAID configs
too.  Should be interesting to hear what kind of aggregate performance
you can eventually get.


We are also going to try this out with 6 luns on a dual xeon box. The
Opteron box was the biggest scariest thing we had that was doing nothing.





All nodes are connected via QDR IB using IP_O_IB. We get 1.7GB/s on TCP
performance tests between the nodes.

Clients: One is FC17 the other us Ubuntu 12.10 they only have around
32GB-70GB ram.

We ran into an odd issue were the OSD's would all start in the same
NUMA
node and pretty much on the same processor core. We fixed that up with
some cpuset magic.


Strange!  Was that more due to cpuset or Ceph?  I can't imagine that we
are doing anything that would cause that.



More than likely it is an odd quirk in the SLES kernel.. but when I have
time I'll do some more poking. We were seeing insane CPU usage on some
cores because all the OSD's were piled up in one place.



Performance testing we have done: (Note oflag=direct was yielding
results within 5% of cached results)


root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=3200
3200+0 records in
3200+0 records out
33554432000 bytes (34 GB) copied, 47.6685 s, 704 MB/s
root@ty3:~#
root@ty3:~# rm /test-rbd-fs/DELETEME
root@ty3:~#
root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=4800
4800+0 records in
4800+0 records out
50331648000 bytes (50 GB) copied, 69.5527 s, 724 MB/s

[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=2400
2400+0 records in
2400+0 records out
25165824000 bytes (25 GB) copied, 26.3593 s, 955 MB/s
[root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=9600
9600+0 records in
9600+0 records out
100663296000 bytes (101 GB) copied, 145.212 s, 693 MB/s

Both clients each doing a 140GB write (2x dogbreath's RAM) at the same
time to two different rbds in the same pool.

root@ty3:~# rm /test-rbd-fs/DELETEME
root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M count=14000
14000+0 records in
14000+0 records out
14680064 bytes (147 GB) copied, 412.404 s, 356 MB/s
root@ty3:~#

[root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME
[root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M
count=14000
14000+0 records in
14000+0 records out
14680064 bytes (147 GB) copied, 433.351 s, 339 MB/s
[root@dogbreath ~]#

Onto reads...
Also we found that doing iflag=direct increased read performance.

Re: erasure coding (sorry)

Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be 
Plaetinck, Dieter scribed:

 On Thu, 18 Apr 2013 16:09:52 -0500
 Mark Nelson mark.nel...@inktank.com wrote:

 On 04/18/2013 04:08 PM, Josh Durgin wrote:
 On 04/18/2013 01:47 PM, Sage Weil wrote:
 On Thu, 18 Apr 2013, Plaetinck, Dieter wrote:
 sorry to bring this up again, googling revealed some people don't
 like the subject [anymore].

 but I'm working on a new +- 3PB cluster for storage of immutable files.
 and it would be either all cold data, or mostly cold. 150MB avg
 filesize, max size 5GB (for now)
 For this use case, my impression is erasure coding would make a lot
 of sense
 (though I'm not sure about the computational overhead on storing and
 loading objects..? outbound traffic would peak at 6 Gbps, but I can
 make it way less and still keep a large cluster, by taking away the
 small set of hot files.
 inbound traffic would be minimal)

 I know that the answer a while ago was no plans to implement erasure
 coding, has this changed?
 if not, is anyone aware of a similar system that does support it? I
 found QFS but that's meant for batch processing, has a single
 'namenode' etc.

 We would love to do it, but it is not a priority at the moment (things
 like multi-site replication are in much higher demand).  That of course
 doesn't prevent someone outside of Inktank from working on it :)

 The main caveat is that it will be complicate.  For an initial
 implementation, the full breadth of the rados API probably wouldn't be
 support for erasure/parity encoded pools (thinkgs like rados classes and
 the omap key/value api get tricky when you start talking about parity).
 But for many (or even most) use cases, objects are just bytes, and those
 restrictions are just fine.

 I talked to some folks interested in doing a more limited form of this
 yesterday. They started a blueprint [1]. One of their ideas was to have
 erasure coding done by a separate process (or thread perhaps). It would
 use erasure coding on an object and then use librados to store the
 rasure-encoded pieces in a separate pool, and finally leave a marker in
 place of the original object in the first pool.

 When the osd detected this marker, it would proxy the request to the
 erasure coding thread/process which would service the request on the
 second pool for reads, and potentially make writes move the data back to
 the first pool in a tiering sort of scenario.

 I might have misremembered some details, but I think it's an
 interesting way to get many of the benefits of erasure coding with a
 relatively small amount of work compared to a fully native osd solution.

 Josh

 Neat. :)


 @Bryan: I did come across cleversafe.  all the articles around it seemed 
 promising,
 but unfortunately it seems everything related to the cleversafe open source 
 project
 somehow vanished from the internet.  (e.g. http://www.cleversafe.org/) quite 
 weird...

Yea - in a previous incarnation I looked at cleversafe to do something similar 
a few years ago.  It is odd that the cleversafe.org stuff did disapear.  
However, tahoe-lafs also does encoding, and their package (zfec) [1] may be 
leverageable.


 @Sage: interesting. I thought it would be more relatively simple if one 
 assumes
 the restriction of immutable files.  I'm not familiar with those ceph 
 specifics you're mentioning.
 When building an erasure codes-based system, maybe there's ways to reuse 
 existing ceph
 code and/or allow some integration with replication based objects, without 
 aiming for full integration or
 full support of the rados api, based on some tradeoffs.

I think this might sit UNDER the rados api.  I would certainly want to leverage 
CRUSH to place the shards, however (great tool, no reason to re-invent the 
wheel).

 @Josh, that sounds like an interesting approach.  Too bad that page doesn't 
 contain any information yet :)

Give me time :) - openstack has kept me a bit busy…  May also be a factor of 
design at keyboard :)


 Dieter

Christopher


[1] https://tahoe-lafs.org/trac/zfec

--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

signature.asc
Description: OpenPGP digital signature

Re: erasure coding (sorry)

Supposedly, on 2013-Apr-18, at 14.24 PDT(-0700), someone claiming to be Noah 
Watkins scribed:

 On Apr 18, 2013, at 2:08 PM, Josh Durgin josh.dur...@inktank.com wrote:

 I talked to some folks interested in doing a more limited form of this
 yesterday. They started a blueprint [1]. One of their ideas was to have
 erasure coding done by a separate process (or thread perhaps). It would
 use erasure coding on an object and then use librados to store the
 rasure-encoded pieces in a separate pool, and finally leave a marker in
 place of the original object in the first pool.

 This sounds at a high-level similar to work out of Microsoft:

I've looked at that, and it would be somewhat similar (not completely, but 
borrow some ideas).

Christopher


 https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf

 The basic idea is to replicate first, then erasure code in the background.

 - Noah


--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

signature.asc
Description: OpenPGP digital signature

Re: erasure coding (sorry)