Re: radosgw Segmentation fault on obj copy

2013-12-03 Thread Dominik Mostowiec
Thanks.

--
Regards
Dominik

2013/12/3 Yehuda Sadeh yeh...@inktank.com:
 For bobtail at this point yes. You can try the unofficial version with
 that fix off the gitbuilder. Another option is to upgrade everything
 to dumpling.

 Yehuda

 On Mon, Dec 2, 2013 at 10:24 PM, Dominik Mostowiec
 dominikmostow...@gmail.com wrote:
 Thanks
 Workaround, don't use multipart when obj size == 0 ?

 On Dec 3, 2013 6:43 AM, Yehuda Sadeh yeh...@inktank.com wrote:

 I created earlier an issue (6919) and updated it with the relevant
 issue. This has been fixed in dumpling, although I don't remember
 hitting the scenario that you did. Was probably hitting it as part of
 the development work that was done then.
 In any case I created a branch with the relevant fixes in it (wip-6919).

 Thanks,
 Yehuda

 On Mon, Dec 2, 2013 at 8:39 PM, Dominik Mostowiec
 dominikmostow...@gmail.com wrote:
  for another object.
  http://pastebin.com/VkVAYgwn
 
 
  2013/12/3 Yehuda Sadeh yeh...@inktank.com:
  I see. Do you have backtrace for the crash?
 
  On Mon, Dec 2, 2013 at 6:19 PM, Dominik Mostowiec
  dominikmostow...@gmail.com wrote:
  0.56.7
 
  W dniu poniedziałek, 2 grudnia 2013 użytkownik Yehuda Sadeh napisał:
 
  I'm having trouble reproducing the issue. What version are you using?
 
  Thanks,
  Yehuda
 
  On Mon, Dec 2, 2013 at 2:16 PM, Yehuda Sadeh yeh...@inktank.com
  wrote:
   Actually, I read that differently. It only says that if there's
   more
   than 1 part, all parts except for the last one need to be  5M.
   Which
   means that for uploads that are smaller than 5M there should be
   zero
   or one parts.
  
   On Mon, Dec 2, 2013 at 12:54 PM, Dominik Mostowiec
   dominikmostow...@gmail.com wrote:
   You're right.
  
   S3 api doc:
  
   http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadComplete.html
   Err:EntityTooSmall
   Your proposed upload is smaller than the minimum allowed object
   size.
   Each part must be at least 5 MB in size, except the last part.
  
   Thanks.
  
   This error should be triggered from radosgw also.
  
   --
   Regards
   Dominik
  
   2013/12/2 Yehuda Sadeh yeh...@inktank.com:
   Looks like it. There should be a guard against it (mulitpart
   upload
   minimum is 5M).
  
   On Mon, Dec 2, 2013 at 12:32 PM, Dominik Mostowiec
   dominikmostow...@gmail.com wrote:
   Yes, this is probably upload empty file.
   This is the problem?
  
   --
   Regards
   Dominik
  
  
   2013/12/2 Yehuda Sadeh yeh...@inktank.com:
   By any chance are you uploading empty objects through the
   multipart
   upload api?
  
   On Mon, Dec 2, 2013 at 12:08 PM, Dominik Mostowiec
   dominikmostow...@gmail.com wrote:
   Hi,
   Another file with the same problems:
  
   2013-12-01 11:37:15.556687 7f7891fd3700  1 == starting new
   request
   req=0x25406d0 =
   2013-12-01 11:37:15.556739 7f7891fd3700  2 req
   1314:0.52initializing
   2013-12-01 11:37:15.556789 7f7891fd3700 10
   s-object=files/192.txt
   s-bucket=testbucket
   2013-12-01 11:37:15.556799 7f7891fd3700  2 req
   1314:0.000112:s3:POST
   /testbucket/files/192.txt::getting op
   2013-12-01 11:37:15.556804 7f7891fd3700  2 req
   1314:0.000118:s3:POST
   /testbucket/files/192.txt:complete_multipart:authorizing
   2013-12-01 11:37:15.560013 7f7891fd3700 10
   get_canon_resource():
  
  
   dest=/testbucket/files/192.txt?uploadId=i92xi2olzDtFAeLXlfU2PFP9CDU87BC
   2013-12-01 11:37:15.560027 7f7891fd3700 10 auth_hdr:
   POST
  
   application/xml
   Sun, 01 Dec 2013 10:37:10 GMT
  
   /testbucket/files/192.txt?uploadId=i92xi2olzDtFAeLXlfU2PFP9CDU87BC
   2013-12-01 11:37:15.560085 7f7891fd3700  2 req
   1314:0.003399:s3:POST
   /testbucket/files/192.txt:complete_multipart:reading
   permissions
   2013-12-01 11:37:15.562356 7f7891fd3700  2 req
   1314:0.005670:s3:POST
   /testbucket/files/192.txt:complete_multipart:verifying op
   permissions
   2013-12-01 11:37:15.562373 7f7891fd3700  5 Searching
   permissions
   for
   uid=0 mask=2
   2013-12-01 11:37:15.562377 7f7891fd3700  5 Found permission:
   15
   2013-12-01 11:37:15.562378 7f7891fd3700 10  uid=0 requested
   perm
   (type)=2, policy perm=2, user_perm_mask=2, acl perm=2
   2013-12-01 11:37:15.562381 7f7891fd3700  2 req
   1314:0.005695:s3:POST
   /testbucket/files/192.txt:complete_multipart:verifying op
   params
   2013-12-01 11:37:15.562384 7f7891fd3700  2 req
   1314:0.005698:s3:POST
   /testbucket/files/192.txt:complete_multipart:executing
   2013-12-01 11:37:15.565461 7f7891fd3700 10 calculated etag:
   d41d8cd98f00b204e9800998ecf8427e-0
   2013-12-01 11:37:15.566718 7f7891fd3700 10 can't clone object
   testbucket:files/192.txt to shadow object, tag/shadow_obj
   haven't
   been
   set
   2013-12-01 11:37:15.566777 7f7891fd3700  0 setting object
   tag=_leyAzxCw7YxpKv8P3v3QGwcsw__9VmP
   2013-12-01 11:37:15.678973 7f7891fd3700  2 req
   1314:0.122286:s3:POST
   /testbucket/files/192.txt:complete_multipart:http status=200
   2013-12-01 11:37:15.679192 7f7891fd3700  1 == req done

Re: [Xen-devel] Xen blktap driver for Ceph RBD : Anybody wants to test ? :p

2013-12-03 Thread Sylvain Munaut
Hi,

 What sort of memory are your instances using?

I just had a look. Around 120 Mb. Which indeed is a bit higher that I'd like.


 I haven't turned on any caching so I assume it's disabled.

Yes.


Cheers,

Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Radosgw on Ubuntu vs CentOS

2013-12-03 Thread Sage Weil
On Tue, 3 Dec 2013, Andy McCrae wrote:
 Hi ceph-users,
 I've been playing around with radosgw and I notice there is an inconsistency
 between the Ubuntu and CentOS startup scripts.
 
 On Ubuntu, if I run a start ceph-all (which will start radosgw), or I run
 the init script /etc/init.d/radosgw start - the radosgw process starts up
 fine, but running as root.
 
 On CentOS the init script starts radosgw as the apache user by default.
 
 I can see the Ubuntu init script is specifying www-data which would be in
 keeping with the CentOS init script, but the process runs as root.
 
 + start-stop-daemon --start -u www-data -x /usr/bin/radosgw -- -n
 client.radosgw.ubunTest
 2013-12-03 15:13:26.449087 7fee1d33b780 -1 WARNING: libcurl doesn't support
 curl_multi_wait()
 2013-12-03 15:13:26.449093 7fee1d33b780 -1 WARNING: cross zone / region
 transfer performance may be affected
 root@ubunTest:~# ps -ef | grep radosgw
 root     28528     1  0 15:13 ?        00:00:00 /usr/bin/radosgw -n
 client.radosgw.ubunTest
 
 
 The question is, do we consider this a bug in that radosgw shouldn't run as
 root by default, or should the CentOS/RHEL (rpm) init scripts start radosgw
 as root - I'd assume the former.

I think it's a bug.  There is no real need for radosgw to run as root, 
except that it needs to log to /var/log/radosgw/*.  We should update the 
packaging (rpm and deb) to create a radosgw user (and/or a ceph group?) 
and then make the two environments behave consistently.

Anyone with strong opinions in this area interested?

sage


Re: MDS can't join in

2013-12-03 Thread Gregory Farnum
Does the MDS have access to a keyring which contains its key, and does
that match what's on the monitor? You're just referring to the
client.admin one, which it won't use (it's not a client). It certainly
looks like there's a mismatch based on the verification error.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Dec 2, 2013 at 5:31 PM, Luo Shaobo (DSI)
lu...@dsi.a-star.edu.sg wrote:
 Dear All,

 I have updated a new version of Ceph, but I meet some strange things. It 
 seems that MDS does not join the cluster.

 In the manager node, there no MDS in. and the log of ceph_mds  shows error 
 during decoding block for decryption. But all the configuration on all the 
 node are same.

 1.  sudo cat /etc/ceph/ceph.client.admin.keyring
 [client.admin]
 key = AQADa5xSCNRIIRAA/wPhnf+opgCFfwVFQTI0sg==

 2. sudo cat /etc/ceph/ceph.conf
 [global]
 fsid = fc446df6-cc65-4ae8-8508-9eb8db6ad922
 mon_initial_members = ubuntu217
 mon_host = 192.168.36.217
 auth_supported = cephx
 osd_journal_size = 1024
 filestore_xattr_use_omap = true

 3. ceph mds dump.
 dumped mdsmap epoch 32
 epoch   32
 flags   0
 created 2013-12-02 19:12:02.932374
 modified2013-12-03 18:13:03.363896
 tableserver 0
 root0
 session_timeout 60
 session_autoclose   300
 last_failure0
 last_failure_osd_epoch  19
 compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
 ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds 
 uses versioned encoding}
 max_mds 1
 in  0
 up  {0=4322}
 failed
 stopped
 data_pools  0
 metadata_pool   1
 4322:   192.168.36.217:6800/2507 'ubuntu217' mds.0.6 up:replay seq 1 laggy 
 since 2013-12-03 18:09:18.277255

 4. ceph-mds process log.

 ceph@ubuntu217:~$ sudo /usr/bin/ceph-mds --cluster=ceph -i ubuntu217 -d
 2013-12-03 18:08:55.210327 7f30827a5780  0 starting mds.ubuntu217 at :/ceph 
 version 0.72.1 (4d923861868f6a15dcb33fef7f50f674997322de), process ceph-mds, 
 pid 2507
 0
 2013-12-03 18:08:55.235122 7f307d838700  1 mds.0.6 handle_mds_map i am now 
 mds.0.6
 2013-12-03 18:08:55.235125 7f307d838700  1 mds.0.6 handle_mds_map state 
 change up:boot -- up:replay
 2013-12-03 18:08:55.235135 7f307d838700  1 mds.0.6 replay_start
 2013-12-03 18:08:55.235139 7f307d838700  1 mds.0.6  recovery set is
 2013-12-03 18:08:55.235141 7f307d838700  1 mds.0.6  need osdmap epoch 19, 
 have 19
 2013-12-03 18:08:55.237000 7f307a630700  0 cephx: verify_reply couldn't 
 decrypt with error: error decoding block for decryption
 2013-12-03 18:08:55.237028 7f307a630700  0 -- 192.168.36.217:6800/2507  
 192.168.35.82:6800/7500 pipe(0x298b500 sd=16 :48150 s=1 pgs=0 cs=0 l=1 
 c=0x2958840).failed verifying authorize reply
 2013-12-03 18:08:55.237083 7f307a630700  0 -- 192.168.36.217:6800/2507  
 192.168.35.82:6800/7500 pipe(0x298b500 sd=16 :48150 s=1 pgs=0 cs=0 l=1 
 c=0x2958840).fault

 
 This email and any attachments are confidential and may be privileged. If you 
 are not the intended recipient, please delete it and notify us immediately. 
 Please do not copy or use it for any purpose, or disclose its contents to any 
 other person. This email does not constitute a contract offer, a contract 
 amendment, or an acceptance of a contract offer. Thank you.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: /etc/init.d/ceph vs upstart

2013-12-03 Thread Tim Spriggs
This does seem to fix the issue. Thanks!

On Mon, Nov 25, 2013 at 3:02 PM, Josh Durgin josh.dur...@inktank.com wrote:
 On 11/25/2013 11:01 AM, Tim Spriggs wrote:

 ... ping

 On Thu, Nov 7, 2013 at 3:31 PM, Tim Spriggs t...@uahirise.org wrote:

 Oops, I just realized I did the patch in the wrong direction :)

 On Thu, Nov 7, 2013 at 3:06 PM, Tim Spriggs t...@uahirise.org wrote:

 Hi All,

I am battling extraneous error messages from two sources:

logrotate which is run in cron.daily and has a definition from the
 ceph package in /etc/logrotate.d. The message I get in an email from
 every node once a day is:

 cat: /var/run/ceph/osd.3.pid: No such file or directory

 This comes up because upstart is actually running ceph-osd while the
 init.d script expects a pidfile.


/var/log/ceph/ceph-osd.$id.log which complains:

 ERROR: error converting store /var/lib/ceph/osd/ceph-3: (16) Device or
 resource busy

 This happens on boot as well as on log rotation.


 After talking with dmick on irc.oftc.net#ceph, I was alerted to the
 fact that there are bits in upstart as well as the sysvinit style
 script that attempt to only use one scheme or the other. However, the
 logic seems wrong. Inside of ceph_common.sh, there is a function named
 check_host which looks for /var/lib/ceph/$type/ceph-$id/sysvinit and
 if it exists, it returns. If it doesn't exist, it just goes on to the
 next check (which passes in my environment.) Instead, it should return
 a non-0 value. Attached is an example patch.


 I think continuing if the host matches is intentional, so the init
 script continues working for daemons deployed before
 /var/lib/ceph/$type/ceph-$id/sysvinit or
 /var/lib/ceph/$type/ceph-$id/upstart were used.

 To maintain backwards compatibility, and prevent both upstart and sysvinit
 from trying to manage the same daemons, I think we can exit
 if the file for the other init system is present, like this patch:

 https://github.com/ceph/ceph/commit/b1d260cabb90bb9155f18c8e38a1dca102e6466c

 Does this work for you?

 Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


crush changes for firefly

2013-12-03 Thread Sage Weil
https://github.com/ceph/ceph/pull/869

has a bunch of pending changes to CRUSH to support the erasure coding 
work in firefly.

The main item is that the behavior of 'choose indep' has changed 
significantly.  This is strictly speaking a change in behavior, but nobody 
should be using indep mode in a normal ceph cluster (unless they went 
manually fiddling with their crush map).

The new and improved indep does a breadth-first mapping instead of 
depth-first, which means few items shifting around when there are 
failures.  It also drops some of the cruft that fell out of the combined 
code from before.  As a bonus, the old method is now firstn-only and I was 
able to strip out a bunch of crap in the process.  Yay!

There are a few other things:

- The 'osd crush rule create-simple ..' command now takes an optional mode 
  (firstn or indep) so that it can be used for erasure pools.

- There is an 'erasure' pg pool type (existing types were 'rep' (default) 
and 'raid4' (never used or implemented)).

- New rule commands:

 step set_choose_tries N

This overrides the tunable total_tries (default is 50) for the current 
rule only.

 step set_chooseleaf_tries M

This overrides the recursive behavior when using chooseleaf.  By default, 
for indep mode, we try exactly once with the recursive call, as this 
maintains the same bound on computational complexity.  However, increasing 
this a bit (say, to 5) improves stability of the mapping a bit when 
there are devices marked out.  This lets you set it for *just* the current 
rule.

Note that for the 'firstn' mode, the default (legacy) behavior is to try 
total_tries in the recursive call, which makes the computational 
complexity proprotional to total_tries^2 (in the extreme).  If the 
'descend_once' tunable is set (now the default), then we do one attempt.. 
if we hit a reject.  Unfortunately not in the case of a collision (dup).  
But, we can't change that without breaking compatibility for existing 
rules.  To fix that, we can add a set_chooseleaf_tries 1 command to 
firstn rules.  It's a bit muddled, though.  :(

- CrushWrapper has a helper to detect if any of these rule commands are in 
use, and OSDMap sets the required features accordingly.

- There is a small fix for OSDMap CACHEPOOL feature detection.

Long story short: if any of this new stuff is used (and it will be 
needed for erasure pools), the new feature bit will be required and old 
clients won't be able to connect.  I think the new behavior is good.  My 
main concern is the weird interplay of the 'descend_once' tunable, which 
unfortunately wasn't implemented to mean the same as chooseleaf_tries = 1.  
I'm not sure if it's worth fixing that via _another_ tunable or not; if 
so, we could (yay) end up where set_chooseleaf_tries actually works for 
firstn the same way it does for indep, and the tunable just makes it 
default to 1 (as it does with indep).

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] block I/O when cluster is full

2013-12-03 Thread Josh Durgin
These patches allow rbd to block writes instead of returning errors
when OSDs are full enough that the FULL flag is set in the osd map.
This avoids filesystems on top of rbd getting confused by transient
EIOs if the cluster oscillates between full and non-full.

These are also available in the wip-full branch of ceph-client.git.

Josh Durgin (3):
  libceph: block I/O when PAUSE or FULL osd map flags are set
  libceph: add an option to configure client behavior when osds are
full
  rbd: document rbd-specific options

 Documentation/ABI/testing/sysfs-bus-rbd |   19 ++
 include/linux/ceph/libceph.h|7 +++
 include/linux/ceph/osd_client.h |1 +
 net/ceph/ceph_common.c  |   13 +
 net/ceph/osd_client.c   |   32 +--
 5 files changed, 70 insertions(+), 2 deletions(-)

-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] rbd: document rbd-specific options

2013-12-03 Thread Josh Durgin
osd_full_behavior only affects rbd, so document it along with
read-only and read-write.

Signed-off-by: Josh Durgin josh.dur...@inktank.com
---
 Documentation/ABI/testing/sysfs-bus-rbd |   19 +++
 1 file changed, 19 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-rbd 
b/Documentation/ABI/testing/sysfs-bus-rbd
index 0a30647..15f3ba6 100644
--- a/Documentation/ABI/testing/sysfs-bus-rbd
+++ b/Documentation/ABI/testing/sysfs-bus-rbd
@@ -18,6 +18,25 @@ Removal of a device:
 
   $ echo dev-id  /sys/bus/rbd/remove
 
+Options
+---
+
+read_only/ro
+
+   The mapped device will only handle reads. This is the default for
+   snapshots.
+
+read_write/rw
+
+   The mapped device will handle reads and writes. This is invalid
+   for snapshots.
+
+osd_full_behavior
+
+   Choose how to handle writes to a full ceph cluster. Options are
+   block to pause I/O until there is space (the default), or
+   error, to return an I/O error.
+
 Entries under /sys/bus/rbd/devices/dev-id/
 
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] libceph: block I/O when PAUSE or FULL osd map flags are set

2013-12-03 Thread Josh Durgin
The PAUSEWR and PAUSERD flags are meant to stop the cluster from
processing writes and reads, respectively. The FULL flag is set when
the cluster determines that it is out of space, and will no longer
process writes.  PAUSEWR and PAUSERD are purely client-side settings
already implemented in userspace clients. The osd does nothing special
with these flags.

When the FULL flag is set, however, the osd responds to all writes
with -ENOSPC. For cephfs, this makes sense, but for rbd the block
layer translates this into EIO.  If a cluster goes from full to
non-full quickly, a filesystem on top of rbd will not behave well,
since some writes succeed while others get EIO.

Fix this by blocking any writes when the FULL flag is set in the osd
client. This is the same strategy used by userspace, so apply it by
default.  A follow-on patch makes this configurable.

__map_request() is called to re-target osd requests in case the
available osds changed.  Add a paused field to a ceph_osd_request, and
set it whenever an appropriate osd map flag is set.  Avoid queueing
paused requests in __map_request(), but force them to be resent if
they become unpaused.

Also subscribe to the next osd map from the monitor if any of these
flags are set, so paused requests can be unblocked as soon as
possible.

Fixes: http://tracker.ceph.com/issues/6079

Signed-off-by: Josh Durgin josh.dur...@inktank.com
---
 include/linux/ceph/osd_client.h |1 +
 net/ceph/osd_client.c   |   29 +++--
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 8f47625..4fb6a89 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -138,6 +138,7 @@ struct ceph_osd_request {
__le64   *r_request_pool;
void *r_request_pgid;
__le32   *r_request_attempts;
+   bool  r_paused;
struct ceph_eversion *r_request_reassert_version;
 
int   r_result;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 2b4b32a..21476be 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1232,6 +1232,22 @@ void ceph_osdc_set_request_linger(struct ceph_osd_client 
*osdc,
 EXPORT_SYMBOL(ceph_osdc_set_request_linger);
 
 /*
+ * Returns whether a request should be blocked from being sent
+ * based on the current osdmap and osd_client settings.
+ *
+ * Caller should hold map_sem for read.
+ */
+static bool __req_should_be_paused(struct ceph_osd_client *osdc,
+  struct ceph_osd_request *req)
+{
+   bool pauserd = ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_PAUSERD);
+   bool pausewr = ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_PAUSEWR) ||
+   ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_FULL);
+   return (req-r_flags  CEPH_OSD_FLAG_READ  pauserd) ||
+   (req-r_flags  CEPH_OSD_FLAG_WRITE  pausewr);
+}
+
+/*
  * Pick an osd (the first 'up' osd in the pg), allocate the osd struct
  * (as needed), and set the request r_osd appropriately.  If there is
  * no up osd, set r_osd to NULL.  Move the request to the appropriate list
@@ -1248,6 +1264,7 @@ static int __map_request(struct ceph_osd_client *osdc,
int acting[CEPH_PG_MAX_SIZE];
int o = -1, num = 0;
int err;
+   bool was_paused;
 
dout(map_request %p tid %lld\n, req, req-r_tid);
err = ceph_calc_ceph_pg(pgid, req-r_oid, osdc-osdmap,
@@ -1264,12 +1281,18 @@ static int __map_request(struct ceph_osd_client *osdc,
num = err;
}
 
+   was_paused = req-r_paused;
+   req-r_paused = __req_should_be_paused(osdc, req);
+   if (was_paused  !req-r_paused)
+   force_resend = 1;
+
if ((!force_resend 
 req-r_osd  req-r_osd-o_osd == o 
 req-r_sent = req-r_osd-o_incarnation 
 req-r_num_pg_osds == num 
 memcmp(req-r_pg_osds, acting, sizeof(acting[0])*num) == 0) ||
-   (req-r_osd == NULL  o == -1))
+   (req-r_osd == NULL  o == -1) ||
+   req-r_paused)
return 0;  /* no change */
 
dout(map_request tid %llu pgid %lld.%x osd%d (was osd%d)\n,
@@ -1804,7 +1827,9 @@ done:
 * we find out when we are no longer full and stop returning
 * ENOSPC.
 */
-   if (ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_FULL))
+   if (ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_FULL) ||
+   ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_PAUSERD) ||
+   ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_PAUSEWR))
ceph_monc_request_next_osdmap(osdc-client-monc);
 
mutex_lock(osdc-request_mutex);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] libceph: add an option to configure client behavior when osds are full

2013-12-03 Thread Josh Durgin
Default to blocking requests to be consistent with userspace. Some
applications may prefer the previous behavior of returning an error
instead, so make that an option. CephFS implements returning -ENOSPC
at a higher level, so only rbd is really affected by this.

Signed-off-by: Josh Durgin josh.dur...@inktank.com
---
 include/linux/ceph/libceph.h |7 +++
 net/ceph/ceph_common.c   |   13 +
 net/ceph/osd_client.c|5 -
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
index 2e30248..77b28ac 100644
--- a/include/linux/ceph/libceph.h
+++ b/include/linux/ceph/libceph.h
@@ -32,6 +32,12 @@
 
 #define CEPH_OPT_DEFAULT   (0)
 
+/* osd full behavior */
+enum {
+   CEPH_OSD_FULL_ERROR,
+   CEPH_OSD_FULL_BLOCK,
+};
+
 #define ceph_set_opt(client, opt) \
(client)-options-flags |= CEPH_OPT_##opt;
 #define ceph_test_opt(client, opt) \
@@ -44,6 +50,7 @@ struct ceph_options {
int mount_timeout;
int osd_idle_ttl;
int osd_keepalive_timeout;
+   int osd_full_behavior;
 
/*
 * any type that can't be simply compared or doesn't need need
diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
index 34b11ee..d029fc5 100644
--- a/net/ceph/ceph_common.c
+++ b/net/ceph/ceph_common.c
@@ -217,6 +217,7 @@ enum {
Opt_secret,
Opt_key,
Opt_ip,
+   Opt_osd_full_behavior,
Opt_last_string,
/* string args above */
Opt_share,
@@ -236,6 +237,7 @@ static match_table_t opt_tokens = {
{Opt_secret, secret=%s},
{Opt_key, key=%s},
{Opt_ip, ip=%s},
+   {Opt_osd_full_behavior, osd_full_behavior=%s},
/* string args above */
{Opt_share, share},
{Opt_noshare, noshare},
@@ -329,6 +331,7 @@ ceph_parse_options(char *options, const char *dev_name,
opt-osd_keepalive_timeout = CEPH_OSD_KEEPALIVE_DEFAULT;
opt-mount_timeout = CEPH_MOUNT_TIMEOUT_DEFAULT; /* seconds */
opt-osd_idle_ttl = CEPH_OSD_IDLE_TTL_DEFAULT;   /* seconds */
+   opt-osd_full_behavior = CEPH_OSD_FULL_BLOCK;
 
/* get mon ip(s) */
/* ip1[:port1][,ip2[:port2]...] */
@@ -408,6 +411,16 @@ ceph_parse_options(char *options, const char *dev_name,
if (err  0)
goto out;
break;
+   case Opt_osd_full_behavior:
+   if (!strcmp(argstr[0].from, error)) {
+   opt-osd_full_behavior = CEPH_OSD_FULL_ERROR;
+   } else if (!strcmp(argstr[0].from, block)) {
+   opt-osd_full_behavior = CEPH_OSD_FULL_BLOCK;
+   } else {
+   err = -EINVAL;
+   goto out;
+   }
+   break;
 
/* misc */
case Opt_osdtimeout:
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 21476be..664432e 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1240,9 +1240,12 @@ EXPORT_SYMBOL(ceph_osdc_set_request_linger);
 static bool __req_should_be_paused(struct ceph_osd_client *osdc,
   struct ceph_osd_request *req)
 {
+   bool block_on_full =
+   osdc-client-options-osd_full_behavior  CEPH_OSD_FULL_BLOCK;
bool pauserd = ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_PAUSERD);
bool pausewr = ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_PAUSEWR) ||
-   ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_FULL);
+   (ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_FULL) 
+   block_on_full);
return (req-r_flags  CEPH_OSD_FLAG_READ  pauserd) ||
(req-r_flags  CEPH_OSD_FLAG_WRITE  pausewr);
 }
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html