Re: Write Replication on Degraded PGs
Hi Sam, I can still reproduce it. I'm not clear if this is actually the expected behaviour of Ceph: if reads/writes are done at the primary OSD, and if a new primary can't be 'elected' (say due to a net-split between failure domains), then is a failure expected, for consistency guarantees? Or am I missing something? If this is the case, then we'll have to rule out Ceph as it would not be appropriate for our use-case. We need high availability across failure domains, which could become split from one another say by a network failure, resulting in an incomplete PG. In this case we still need read availability. I tried to enable osd logging by adding: debug osd = 20 to the [osd] section of my ceph.conf on the requesting machine, but didn't get much output (see below). Could the fundamental issue be that the primary OSD on the other machine is down (intentionally, for our test case) and no other primary can be elected (as the CRUSH rule demands one OSD on each host)? Apologies for any speculation on my part here, any clarification will help a lot! 2013-02-18 10:11:51.913256 osd.0 10.9.64.61:6801/25064 5 : [WRN] 2 slow requests, 1 included below; oldest blocked for 95.700672 secs 2013-02-18 10:11:51.913290 osd.0 10.9.64.61:6801/25064 6 : [WRN] slow request 30.976297 seconds old, received at 2013-02-18 10:11:20.936876: osd_op(client.4345.0:29594 4345.365_91bf7acb-8321-494e-bc79-6ab1625162bc [getxattrs,stat,read 0~524288] 9.5aaf1592) v4 currently reached pg Thanks, Ben On Sat, Feb 16, 2013 at 5:42 PM, Sam Lang sam.l...@inktank.com wrote: On Fri, Feb 15, 2013 at 6:29 AM, Ben Rowland ben.rowl...@gmail.com wrote: Further to my question about reads on a degraded PG, my tests show that indeed reads from rgw fail when not all OSDs in a PG are up, even when the data is physically available on an up/in OSD. I have a size and min_size of 2 on my pool, and 2 hosts with 2 OSDs on each. Crush map is set to write to 1 OSD on each of 2 hosts. After writing a file to successfully to rgw via host 1, I then stop all Ceph services on host 2. Attempts to read the file I just wrote time out after 30 seconds. Starting Ceph again on host 2 allows reads to proceed from host 1 once again. I see the following in ceph.log after the read times out: 2013-02-15 12:04:39.162685 osd.0 10.9.64.61:6802/19242 3 : [WRN] slow request 30.461867 seconds old, received at 2013-02-15 12:04:08.700630: osd_op(client.4345.0:21511 4345.365_91bf7acb-8321-494e-bc79-6ab1625162bc [getxattrs,stat,read 0~524288] 9.5aaf1592 RETRY) v4 currently reached pg After stopping Ceph on host 2, ceph -s reports: health HEALTH_WARN 514 pgs degraded; 16 pgs incomplete; 16 pgs stuck inactive; 632 pgs stuck unclean; recovery 44/6804 degraded (0.647%) monmap e1: 1 mons at {a=10.9.64.61:6789/0}, election epoch 1, quorum 0 a osdmap e155: 4 osds: 2 up, 2 in pgmap v4911: 632 pgs: 102 active+remapped, 514 active+degraded, 16 incomplete; 844 MB data, 5969 MB used, 2280 MB / 8691 MB avail; 44/6804 degraded (0.647%) mdsmap e1: 0/0/1 up OSD tree just in case: # id weight type name up/down reweight -1 2 root default -3 2 rack unknownrack -2 1 host squeezeceph1 0 1 osd.0 up 1 2 1 osd.2 up 1 -4 1 host squeezeceph2 1 1 osd.1 down 0 3 0 osd.3 down 0 Running osd map on both the container and object names say host 1 is acting for that PG (not sure if I'm looking at the right pools, though): $ ceph osd map .rgw.buckets aa94e84a-e720-45e1-8c85-4afa7d0f6b5c osdmap e155 pool '.rgw.buckets' (9) object 'aa94e84a-e720-45e1-8c85-4afa7d0f6b5c' - pg 9.494717b9 (9.1) - up [0] acting [0] $ ceph osd map .rgw 91bf7acb-8321-494e-bc79-6ab1625162bc osdmap e155 pool '.rgw' (3) object '91bf7acb-8321-494e-bc79-6ab1625162bc' - pg 3.1db18d16 (3.6) - up [2] acting [2] Any thoughts? It doesn't seem right that taking out a single failure domain should cause this degradation. Hi Ben, Are you still seeing this? Can you enable osd logging and restart the osds on host 1? -sam Many thanks, Ben On Thu, Feb 14, 2013 at 11:53 PM, Ben Rowland ben.rowl...@gmail.com wrote: On 13 Feb 2013 18:16, Gregory Farnum g...@inktank.com wrote: On Wed, Feb 13, 2013 at 3:40 AM, Ben Rowland ben.rowl...@gmail.com wrote: So it sounds from the rest of your post like you'd want to, for each pool that RGW uses (it's not just .rgw), run ceph osd set .rgw min_size 2. (and for .rgw.buckets, etc etc) Thanks, that did the trick. When the number of up OSDs is less than min_size, writes block for 30s then return http 500. Ceph honours my crush rule in this case - adding more OSDs to only one of two failure domains continues to block writes - all well and good! If this is the expected behaviour of Ceph, then it seems to prefer write-availability over read-availability (in this case my data is only stored on 1 OSD, thus a SPOF). Is there any way to change this trade-off, e.g. as you can in Cassandra with its write quorums?
Re: Write Replication on Degraded PGs
Further to my question about reads on a degraded PG, my tests show that indeed reads from rgw fail when not all OSDs in a PG are up, even when the data is physically available on an up/in OSD. I have a size and min_size of 2 on my pool, and 2 hosts with 2 OSDs on each. Crush map is set to write to 1 OSD on each of 2 hosts. After writing a file to successfully to rgw via host 1, I then stop all Ceph services on host 2. Attempts to read the file I just wrote time out after 30 seconds. Starting Ceph again on host 2 allows reads to proceed from host 1 once again. I see the following in ceph.log after the read times out: 2013-02-15 12:04:39.162685 osd.0 10.9.64.61:6802/19242 3 : [WRN] slow request 30.461867 seconds old, received at 2013-02-15 12:04:08.700630: osd_op(client.4345.0:21511 4345.365_91bf7acb-8321-494e-bc79-6ab1625162bc [getxattrs,stat,read 0~524288] 9.5aaf1592 RETRY) v4 currently reached pg After stopping Ceph on host 2, ceph -s reports: health HEALTH_WARN 514 pgs degraded; 16 pgs incomplete; 16 pgs stuck inactive; 632 pgs stuck unclean; recovery 44/6804 degraded (0.647%) monmap e1: 1 mons at {a=10.9.64.61:6789/0}, election epoch 1, quorum 0 a osdmap e155: 4 osds: 2 up, 2 in pgmap v4911: 632 pgs: 102 active+remapped, 514 active+degraded, 16 incomplete; 844 MB data, 5969 MB used, 2280 MB / 8691 MB avail; 44/6804 degraded (0.647%) mdsmap e1: 0/0/1 up OSD tree just in case: # id weight type name up/down reweight -1 2 root default -3 2 rack unknownrack -2 1 host squeezeceph1 0 1 osd.0 up 1 2 1 osd.2 up 1 -4 1 host squeezeceph2 1 1 osd.1 down 0 3 0 osd.3 down 0 Running osd map on both the container and object names say host 1 is acting for that PG (not sure if I'm looking at the right pools, though): $ ceph osd map .rgw.buckets aa94e84a-e720-45e1-8c85-4afa7d0f6b5c osdmap e155 pool '.rgw.buckets' (9) object 'aa94e84a-e720-45e1-8c85-4afa7d0f6b5c' - pg 9.494717b9 (9.1) - up [0] acting [0] $ ceph osd map .rgw 91bf7acb-8321-494e-bc79-6ab1625162bc osdmap e155 pool '.rgw' (3) object '91bf7acb-8321-494e-bc79-6ab1625162bc' - pg 3.1db18d16 (3.6) - up [2] acting [2] Any thoughts? It doesn't seem right that taking out a single failure domain should cause this degradation. Many thanks, Ben On Thu, Feb 14, 2013 at 11:53 PM, Ben Rowland ben.rowl...@gmail.com wrote: On 13 Feb 2013 18:16, Gregory Farnum g...@inktank.com wrote: On Wed, Feb 13, 2013 at 3:40 AM, Ben Rowland ben.rowl...@gmail.com wrote: So it sounds from the rest of your post like you'd want to, for each pool that RGW uses (it's not just .rgw), run ceph osd set .rgw min_size 2. (and for .rgw.buckets, etc etc) Thanks, that did the trick. When the number of up OSDs is less than min_size, writes block for 30s then return http 500. Ceph honours my crush rule in this case - adding more OSDs to only one of two failure domains continues to block writes - all well and good! If this is the expected behaviour of Ceph, then it seems to prefer write-availability over read-availability (in this case my data is only stored on 1 OSD, thus a SPOF). Is there any way to change this trade-off, e.g. as you can in Cassandra with its write quorums? I'm not quite sure this is describing it correctly — Ceph guarantees that anything that's been written to disk will be readable later on, and placement groups won't go active if they can't retrieve all data. The sort of flexible policies allowed by Cassandra aren't possible within Ceph — it is a strictly consistent system. Are objects always readable even if a PG is missing some OSDs, and where it cannot recover? Example: 2 hosts each with 1 osd, pool min_size is 2, with a crush rule saying to write to both hosts. I write a file successfully, then one host goes down, and eventually is marked 'out'. Is the file readable on the 'up' host (say if I'm running rgw there?) What if the up host does not have the primary copy? Furthermore, if Ceph is strictly consistent, how would it resolve possible stale reads? Say, if in the 2 hosts example, the network connection died, but min_size was set to 1. Would it be possible for writes to proceed, say making edits to an existing object? Could readers at the other host see stale data? Thanks again in advance, Ben -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Write Replication on Degraded PGs
On Wed, Feb 13, 2013 at 3:40 AM, Ben Rowland ben.rowl...@gmail.com wrote: Hi, Apologies that this is a fairly long post, but hopefully all my questions are similar (or even invalid!) Does Ceph allow writes to proceed if it's not possible to satisfy the rules for replica placement across failure domains, as specified in the CRUSH map? For example, if my CRUSH map says to place one replica on each of 2 hosts, and no devices are up on one of the hosts, what will happen? As you've discovered, it will write to only the up host. You can control this behavior by setting the min write size on your pools — ceph osd set pool min_size size — to be the minimum number of writes you'd like to guarantee are on disk. You can also set the osd pool default min size parameter on the monitors; it defaults to 0 which is interpreted as half of the requested size. (So for a size 2 pool, it will require at least one copy [d'oh], for a size 5 pool, it will require at least 3 copies, etc.) So it sounds from the rest of your post like you'd want to, for each pool that RGW uses (it's not just .rgw), run ceph osd set .rgw min_size 2. (and for .rgw.buckets, etc etc) From tests on a small cluster, my finding is that Ceph will allow writes to proceed in this case, even when there are less OSDs in the cluster than the replication size. If this is the expected behaviour of Ceph, then it seems to prefer write-availability over read-availability (in this case my data is only stored on 1 OSD, thus a SPOF). Is there any way to change this trade-off, e.g. as you can in Cassandra with its write quorums? I'm not quite sure this is describing it correctly — Ceph guarantees that anything that's been written to disk will be readable later on, and placement groups won't go active if they can't retrieve all data. The sort of flexible policies allowed by Cassandra aren't possible within Ceph — it is a strictly consistent system. From reading the paper which details the CRUSH replica placement algorithm, I understand several concepts: - The CRUSH algorithm loops over n replicas, for each descending from items in the current set (buckets) until it finds a device - CRUSH may reject and reselect items using a modified input for three different reasons: if an item has already been selected in the current set (a collision..., if a device is failed, or if a device is overloaded. I'm not clear on what will happen if these constraints cannot be met, say in the case mentioned above. Does Ceph store the object once only, not meeting the replica size, or does it store the object twice on the same OSD somehow? The latter would violate point 2 above, unless the word reselect is appropriate here. CRUSH has a specified retry size; if it can't meet the constraints then it spits out whatever replicas it was able to select. The higher-level system needs to decide what to do with that list. Ceph chooses to use whatever CRUSH gives back, within the minimum size constraint specified above and a few other override mechanisms we don't need to into here. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html