On Wed, 11 Dec 2013, Josh Durgin wrote:
> The PAUSEWR and PAUSERD flags are meant to stop the cluster from
> processing writes and reads, respectively. The FULL flag is set when
> the cluster determines that it is out of space, and will no longer
> process writes. PAUSEWR and PAUSERD are purely client-side settings
> already implemented in userspace clients. The osd does nothing special
> with these flags.
>
> When the FULL flag is set, however, the osd responds to all writes
> with -ENOSPC. For cephfs, this makes sense, but for rbd the block
> layer translates this into EIO. If a cluster goes from full to
> non-full quickly, a filesystem on top of rbd will not behave well,
> since some writes succeed while others get EIO.
>
> Fix this by blocking any writes when the FULL flag is set in the osd
> client. This is the same strategy used by userspace, so apply it by
> default. A follow-on patch makes this configurable.
>
> __map_request() is called to re-target osd requests in case the
> available osds changed. Add a paused field to a ceph_osd_request, and
> set it whenever an appropriate osd map flag is set. Avoid queueing
> paused requests in __map_request(), but force them to be resent if
> they become unpaused.
>
> Also subscribe to the next osd map from the monitor if any of these
> flags are set, so paused requests can be unblocked as soon as
> possible.
>
> Fixes: http://tracker.ceph.com/issues/6079
>
> Signed-off-by: Josh Durgin <[email protected]>
> ---
> include/linux/ceph/osd_client.h | 1 +
> net/ceph/osd_client.c | 29 +++++++++++++++++++++++++++--
> 2 files changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 8f47625..4fb6a89 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -138,6 +138,7 @@ struct ceph_osd_request {
> __le64 *r_request_pool;
> void *r_request_pgid;
> __le32 *r_request_attempts;
> + bool r_paused;
> struct ceph_eversion *r_request_reassert_version;
>
> int r_result;
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index a17eaae..1ad9866 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -1232,6 +1232,22 @@ void ceph_osdc_set_request_linger(struct
> ceph_osd_client *osdc,
> EXPORT_SYMBOL(ceph_osdc_set_request_linger);
>
> /*
> + * Returns whether a request should be blocked from being sent
> + * based on the current osdmap and osd_client settings.
> + *
> + * Caller should hold map_sem for read.
> + */
> +static bool __req_should_be_paused(struct ceph_osd_client *osdc,
> + struct ceph_osd_request *req)
> +{
> + bool pauserd = ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSERD);
> + bool pausewr = ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSEWR) ||
> + ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL);
> + return (req->r_flags & CEPH_OSD_FLAG_READ && pauserd) ||
> + (req->r_flags & CEPH_OSD_FLAG_WRITE && pausewr);
> +}
> +
> +/*
> * Pick an osd (the first 'up' osd in the pg), allocate the osd struct
> * (as needed), and set the request r_osd appropriately. If there is
> * no up osd, set r_osd to NULL. Move the request to the appropriate list
> @@ -1248,6 +1264,7 @@ static int __map_request(struct ceph_osd_client *osdc,
> int acting[CEPH_PG_MAX_SIZE];
> int o = -1, num = 0;
> int err;
> + bool was_paused;
>
> dout("map_request %p tid %lld\n", req, req->r_tid);
> err = ceph_calc_ceph_pg(&pgid, req->r_oid, osdc->osdmap,
> @@ -1264,12 +1281,18 @@ static int __map_request(struct ceph_osd_client *osdc,
> num = err;
> }
>
> + was_paused = req->r_paused;
> + req->r_paused = __req_should_be_paused(osdc, req);
> + if (was_paused && !req->r_paused)
> + force_resend = 1;
> +
> if ((!force_resend &&
> req->r_osd && req->r_osd->o_osd == o &&
> req->r_sent >= req->r_osd->o_incarnation &&
> req->r_num_pg_osds == num &&
> memcmp(req->r_pg_osds, acting, sizeof(acting[0])*num) == 0) ||
> - (req->r_osd == NULL && o == -1))
> + (req->r_osd == NULL && o == -1) ||
> + req->r_paused)
It seems like we could be a bit more aggressive (and more closely aligned
with what the other causes of changed mappings do) and cancel the request
if it is newly paused. Otherwise, we leave req->r_osd set to the last
person we sent the request to, which means we might get a reply.
I guess that is what we want, actually...
> return 0; /* no change */
>
> dout("map_request tid %llu pgid %lld.%x osd%d (was osd%d)\n",
> @@ -1811,7 +1834,9 @@ done:
> * we find out when we are no longer full and stop returning
> * ENOSPC.
> */
> - if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
> + if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL) ||
> + ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSERD) ||
> + ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_PAUSEWR))
> ceph_monc_request_next_osdmap(&osdc->client->monc);
>
> mutex_lock(&osdc->request_mutex);
> --
> 1.7.10.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html