Re: Write Replication on Degraded PGs

2013-02-18 Thread Ben Rowland
Hi Sam,

I can still reproduce it.  I'm not clear if this is actually the
expected behaviour of Ceph: if reads/writes are done at the primary
OSD, and if a new primary can't be 'elected' (say due to a net-split
between failure domains), then is a failure expected, for consistency
guarantees?  Or am I missing something?  If this is the case, then
we'll have to rule out Ceph as it would not be appropriate for our
use-case.  We need high availability across failure domains, which
could become split from one another say by a network failure,
resulting in an incomplete PG.  In this case we still need read
availability.

I tried to enable osd logging by adding: debug osd = 20 to the [osd]
section of my ceph.conf on the requesting machine, but didn't get much
output (see below).  Could the fundamental issue be that the primary
OSD on the other machine is down (intentionally, for our test case)
and no other primary can be elected (as the CRUSH rule demands one OSD
on each host)?  Apologies for any speculation on my part here, any
clarification will help a lot!

2013-02-18 10:11:51.913256 osd.0 10.9.64.61:6801/25064 5 : [WRN] 2
slow requests, 1 included below; oldest blocked for  95.700672 secs
2013-02-18 10:11:51.913290 osd.0 10.9.64.61:6801/25064 6 : [WRN] slow
request 30.976297 seconds old, received at 2013-02-18 10:11:20.936876:
osd_op(client.4345.0:29594
4345.365_91bf7acb-8321-494e-bc79-6ab1625162bc [getxattrs,stat,read
0~524288] 9.5aaf1592) v4 currently reached pg

Thanks,

Ben

On Sat, Feb 16, 2013 at 5:42 PM, Sam Lang sam.l...@inktank.com wrote:
 On Fri, Feb 15, 2013 at 6:29 AM, Ben Rowland ben.rowl...@gmail.com wrote:
 Further to my question about reads on a degraded PG, my tests show
 that indeed reads from rgw fail when not all OSDs in a PG are up, even
 when the data is physically available on an up/in OSD.

 I have a size and min_size of 2 on my pool, and 2 hosts with 2
 OSDs on each.  Crush map is set to write to 1 OSD on each of 2 hosts.
 After writing a file to successfully to rgw via host 1, I then stop
 all Ceph services on host 2.  Attempts to read the file I just wrote
 time out after 30 seconds.  Starting Ceph again on host 2 allows reads
 to proceed from host 1 once again.

 I see the following in ceph.log after the read times out:

 2013-02-15 12:04:39.162685 osd.0 10.9.64.61:6802/19242 3 : [WRN] slow
 request 30.461867 seconds old, received at 2013-02-15 12:04:08.700630:
 osd_op(client.4345.0:21511
 4345.365_91bf7acb-8321-494e-bc79-6ab1625162bc [getxattrs,stat,read
 0~524288] 9.5aaf1592 RETRY) v4 currently reached pg

 After stopping Ceph on host 2, ceph -s reports:

health HEALTH_WARN 514 pgs degraded; 16 pgs incomplete; 16 pgs
 stuck inactive; 632 pgs stuck unclean; recovery 44/6804 degraded
 (0.647%)
monmap e1: 1 mons at {a=10.9.64.61:6789/0}, election epoch 1, quorum 0 a
osdmap e155: 4 osds: 2 up, 2 in
 pgmap v4911: 632 pgs: 102 active+remapped, 514 active+degraded, 16
 incomplete; 844 MB data, 5969 MB used, 2280 MB / 8691 MB avail;
 44/6804 degraded (0.647%)
mdsmap e1: 0/0/1 up

 OSD tree just in case:

 # id weight type name up/down reweight
 -1 2 root default
 -3 2 rack unknownrack
 -2 1 host squeezeceph1
 0 1 osd.0 up 1
 2 1 osd.2 up 1
 -4 1 host squeezeceph2
 1 1 osd.1 down 0
 3 0 osd.3 down 0

 Running osd map on both the container and object names say host 1 is
 acting for that PG (not sure if I'm looking at the right pools,
 though):

 $ ceph osd map .rgw.buckets aa94e84a-e720-45e1-8c85-4afa7d0f6b5c

 osdmap e155 pool '.rgw.buckets' (9) object
 'aa94e84a-e720-45e1-8c85-4afa7d0f6b5c' - pg 9.494717b9 (9.1) - up
 [0] acting [0]

 $ ceph osd map .rgw 91bf7acb-8321-494e-bc79-6ab1625162bc

 osdmap e155 pool '.rgw' (3) object
 '91bf7acb-8321-494e-bc79-6ab1625162bc' - pg 3.1db18d16 (3.6) - up
 [2] acting [2]

 Any thoughts?  It doesn't seem right that taking out a single failure
 domain should cause this degradation.

 Hi Ben,

 Are you still seeing this?  Can you enable osd logging and restart the
 osds on host 1?
 -sam


 Many thanks,

 Ben

 On Thu, Feb 14, 2013 at 11:53 PM, Ben Rowland ben.rowl...@gmail.com wrote:
 On 13 Feb 2013 18:16, Gregory Farnum g...@inktank.com wrote:
 
  On Wed, Feb 13, 2013 at 3:40 AM, Ben Rowland ben.rowl...@gmail.com 
  wrote:

 So it sounds from the rest of your post like you'd want to, for each
 pool that RGW uses (it's not just .rgw), run ceph osd set .rgw
 min_size 2. (and for .rgw.buckets, etc etc)

 Thanks, that did the trick. When the number of up OSDs is less than
 min_size, writes block for 30s then return http 500. Ceph honours my
 crush rule in this case - adding more OSDs to only one of two failure
 domains continues to block writes - all well and good!

  If this is the expected behaviour of Ceph, then it seems to prefer
  write-availability over read-availability (in this case my data is
  only stored on 1 OSD, thus a SPOF).  Is there any way to change this
  trade-off, e.g. as you can in Cassandra with its write quorums?

Re: Write Replication on Degraded PGs

2013-02-15 Thread Ben Rowland
Further to my question about reads on a degraded PG, my tests show
that indeed reads from rgw fail when not all OSDs in a PG are up, even
when the data is physically available on an up/in OSD.

I have a size and min_size of 2 on my pool, and 2 hosts with 2
OSDs on each.  Crush map is set to write to 1 OSD on each of 2 hosts.
After writing a file to successfully to rgw via host 1, I then stop
all Ceph services on host 2.  Attempts to read the file I just wrote
time out after 30 seconds.  Starting Ceph again on host 2 allows reads
to proceed from host 1 once again.

I see the following in ceph.log after the read times out:

2013-02-15 12:04:39.162685 osd.0 10.9.64.61:6802/19242 3 : [WRN] slow
request 30.461867 seconds old, received at 2013-02-15 12:04:08.700630:
osd_op(client.4345.0:21511
4345.365_91bf7acb-8321-494e-bc79-6ab1625162bc [getxattrs,stat,read
0~524288] 9.5aaf1592 RETRY) v4 currently reached pg

After stopping Ceph on host 2, ceph -s reports:

   health HEALTH_WARN 514 pgs degraded; 16 pgs incomplete; 16 pgs
stuck inactive; 632 pgs stuck unclean; recovery 44/6804 degraded
(0.647%)
   monmap e1: 1 mons at {a=10.9.64.61:6789/0}, election epoch 1, quorum 0 a
   osdmap e155: 4 osds: 2 up, 2 in
pgmap v4911: 632 pgs: 102 active+remapped, 514 active+degraded, 16
incomplete; 844 MB data, 5969 MB used, 2280 MB / 8691 MB avail;
44/6804 degraded (0.647%)
   mdsmap e1: 0/0/1 up

OSD tree just in case:

# id weight type name up/down reweight
-1 2 root default
-3 2 rack unknownrack
-2 1 host squeezeceph1
0 1 osd.0 up 1
2 1 osd.2 up 1
-4 1 host squeezeceph2
1 1 osd.1 down 0
3 0 osd.3 down 0

Running osd map on both the container and object names say host 1 is
acting for that PG (not sure if I'm looking at the right pools,
though):

$ ceph osd map .rgw.buckets aa94e84a-e720-45e1-8c85-4afa7d0f6b5c

osdmap e155 pool '.rgw.buckets' (9) object
'aa94e84a-e720-45e1-8c85-4afa7d0f6b5c' - pg 9.494717b9 (9.1) - up
[0] acting [0]

$ ceph osd map .rgw 91bf7acb-8321-494e-bc79-6ab1625162bc

osdmap e155 pool '.rgw' (3) object
'91bf7acb-8321-494e-bc79-6ab1625162bc' - pg 3.1db18d16 (3.6) - up
[2] acting [2]

Any thoughts?  It doesn't seem right that taking out a single failure
domain should cause this degradation.

Many thanks,

Ben

On Thu, Feb 14, 2013 at 11:53 PM, Ben Rowland ben.rowl...@gmail.com wrote:
 On 13 Feb 2013 18:16, Gregory Farnum g...@inktank.com wrote:
 
  On Wed, Feb 13, 2013 at 3:40 AM, Ben Rowland ben.rowl...@gmail.com wrote:

 So it sounds from the rest of your post like you'd want to, for each
 pool that RGW uses (it's not just .rgw), run ceph osd set .rgw
 min_size 2. (and for .rgw.buckets, etc etc)

 Thanks, that did the trick. When the number of up OSDs is less than
 min_size, writes block for 30s then return http 500. Ceph honours my
 crush rule in this case - adding more OSDs to only one of two failure
 domains continues to block writes - all well and good!

  If this is the expected behaviour of Ceph, then it seems to prefer
  write-availability over read-availability (in this case my data is
  only stored on 1 OSD, thus a SPOF).  Is there any way to change this
  trade-off, e.g. as you can in Cassandra with its write quorums?

 I'm not quite sure this is describing it correctly — Ceph guarantees
 that anything that's been written to disk will be readable later on,
 and placement groups won't go active if they can't retrieve all data.
 The sort of flexible policies allowed by Cassandra aren't possible
 within Ceph — it is a strictly consistent system.

 Are objects always readable even if a PG is missing some OSDs, and
 where it cannot recover? Example: 2 hosts each with 1 osd, pool
 min_size is 2, with a crush rule saying to write to both hosts. I
 write a file successfully, then one host goes down, and eventually is
 marked 'out'. Is the file readable on the 'up' host (say if I'm
 running rgw there?) What if the up host does not have the primary
 copy?

 Furthermore, if Ceph is strictly consistent, how would it resolve
 possible stale reads? Say, if in the 2 hosts example, the network
 connection died, but min_size was set to 1. Would it be possible for
 writes to proceed, say making edits to an existing object? Could
 readers at the other host see stale data?

 Thanks again in advance,

 Ben
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Write Replication on Degraded PGs

2013-02-13 Thread Gregory Farnum
On Wed, Feb 13, 2013 at 3:40 AM, Ben Rowland ben.rowl...@gmail.com wrote:
 Hi,

 Apologies that this is a fairly long post, but hopefully all my
 questions are similar (or even invalid!)

 Does Ceph allow writes to proceed if it's not possible to satisfy the
 rules for replica placement across failure domains, as specified in
 the CRUSH map?  For example, if my CRUSH map says to place one replica
 on each of 2 hosts, and no devices are up on one of the hosts, what
 will happen?

As you've discovered, it will write to only the up host. You can
control this behavior by setting the min write size on your pools —
ceph osd set pool min_size size — to be the minimum number of
writes you'd like to guarantee are on disk. You can also set the osd
pool default min size parameter on the monitors; it defaults to 0
which is interpreted as half of the requested size. (So for a size 2
pool, it will require at least one copy [d'oh], for a size 5 pool, it
will require at least 3 copies, etc.)
So it sounds from the rest of your post like you'd want to, for each
pool that RGW uses (it's not just .rgw), run ceph osd set .rgw
min_size 2. (and for .rgw.buckets, etc etc)

 From tests on a small cluster, my finding is that Ceph will allow
 writes to proceed in this case, even when there are less OSDs in the
 cluster than the replication size.

 If this is the expected behaviour of Ceph, then it seems to prefer
 write-availability over read-availability (in this case my data is
 only stored on 1 OSD, thus a SPOF).  Is there any way to change this
 trade-off, e.g. as you can in Cassandra with its write quorums?

I'm not quite sure this is describing it correctly — Ceph guarantees
that anything that's been written to disk will be readable later on,
and placement groups won't go active if they can't retrieve all data.
The sort of flexible policies allowed by Cassandra aren't possible
within Ceph — it is a strictly consistent system.


 From reading the paper which details the CRUSH replica placement
 algorithm, I understand several concepts:

 - The CRUSH algorithm loops over n replicas, for each descending from
 items in the current set (buckets) until it finds a device
 - CRUSH may reject and reselect items using a modified input for three
 different reasons: if an item has already been selected in the current
 set (a collision..., if a device is failed, or if a device is
 overloaded.

 I'm not clear on what will happen if these constraints cannot be met,
 say in the case mentioned above.  Does Ceph store the object once
 only, not meeting the replica size, or does it store the object twice
 on the same OSD somehow?  The latter would violate point 2 above,
 unless the word reselect is appropriate here.

CRUSH has a specified retry size; if it can't meet the constraints
then it spits out whatever replicas it was able to select. The
higher-level system needs to decide what to do with that list. Ceph
chooses to use whatever CRUSH gives back, within the minimum size
constraint specified above and a few other override mechanisms we
don't need to into here. :)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html