RE: Question to RWLock reverse DNS ip=hostname
We did a lot of testing of RW locks in another multi-threaded project and we use successfully PTHREAD_RWLOCK_PREFER_WRITER_NP which works perfectly fine if you avoid recursive read locks. It is probably good practice to avoid any recursive mutex in general. If one stay's with the default RW mutex without setting attributes we have seen that you starve writers for long periods if there is a contention on the read mutex (n-reader n-writer). I agree this is a very 'sensitive' topic but could well be an origin for threading scalability problems - depends how and where you use the RW locks in CEPH. Cheers Andreas. From: Gregory Farnum [g...@inktank.com] Sent: 09 September 2014 21:56 To: Sage Weil Cc: Andreas Joachim Peters; ceph-devel@vger.kernel.org Subject: Re: Question to RWLock reverse DNS ip=hostname On Tue, Sep 9, 2014 at 10:50 AM, Sage Weil sw...@redhat.com wrote: On Tue, 9 Sep 2014, Andreas Joachim Peters wrote: Hi, by chance I had a look to the RWLock class. To my best knowledge the way you create RW locks it defaults to writer-starvation e.g. all readers will always jump a head of a pending writer. I cannot imagine that you never have the opposite requirement in the CEPH multithreaded code but I didn't review where it is used. In case you are aware, you can just ignore this comment, otherwise one could add the option to create a writer-prefering RWLock at construction time using e.g. pthread_rwlockattr_setkind_np(attr, PTHREAD_RWLOCK_PREFER_WRITER_NP) Hmm, I was not aware. Perhaps we should make the reader vs writer preference explicit during construction? Yeah, this is interesting as I thought I'd learned somewhere that the default behavior was to prefer writers. Poking around the internet makes me wonder if we actually want to make any changes — best I can tell PTHREAD_RWLOCK_PREFER_WRITER_NP does not actually behave the way one wants (as part of the spec!) because pthreads requires recursive locking, so we'd have to use PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP. But we'd probably want to profile the performance changes (and, uh, run tests to make sure we don't deadlock somehow) before putting any such code upstream. Another question: I couldn't find a base implementation to translate IPV4/6 ip's to host names? Is there same configuration table/method allowing to get back from an OSD address retrieved via OSDMap::get_addr() the original host name avoiding the use of DNS lookups? I don't think we are doing reverse DNS lookups anywhere in the ceph code ... In particular I think we've looked at doing this before and been unable to reliably map to the *proper* hostname in cases where servers have multiple NICs. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] libceph: gracefully handle large reply messages from the mon
From: Sage Weil s...@redhat.com We preallocate a few of the message types we get back from the mon. If we get a larger message than we are expecting, fall back to trying to allocate a new one instead of blindly using the one we have. CC: sta...@vger.kernel.org Signed-off-by: Sage Weil s...@redhat.com --- net/ceph/mon_client.c |8 1 file changed, 8 insertions(+) diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c index 067d3af2eaf6..61fcfc304f68 100644 --- a/net/ceph/mon_client.c +++ b/net/ceph/mon_client.c @@ -1181,7 +1181,15 @@ static struct ceph_msg *mon_alloc_msg(struct ceph_connection *con, if (!m) { pr_info(alloc_msg unknown type %d\n, type); *skip = 1; + } else if (front_len m-front_alloc_len) { + pr_warning(mon_alloc_msg front %d prealloc %d (%u#%llu)\n, + front_len, m-front_alloc_len, + (unsigned int)con-peer_name.type, + le64_to_cpu(con-peer_name.num)); + ceph_msg_put(m); + m = ceph_msg_new(type, front_len, GFP_NOFS, false); } + return m; } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] libceph: add process_one_ticket() helper
Add a helper for processing individual cephx auth tickets. Needed for the next commit, that deals with allocating ticket buffers. (Most of the diff here is whitespace - view with git diff -b). Cc: sta...@vger.kernel.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- net/ceph/auth_x.c | 228 + 1 file changed, 124 insertions(+), 104 deletions(-) diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c index 96238ba95f2b..0eb146dce1aa 100644 --- a/net/ceph/auth_x.c +++ b/net/ceph/auth_x.c @@ -129,17 +129,131 @@ static void remove_ticket_handler(struct ceph_auth_client *ac, kfree(th); } +static int process_one_ticket(struct ceph_auth_client *ac, + struct ceph_crypto_key *secret, + void **p, void *end, + void *dbuf, void *ticket_buf) +{ + struct ceph_x_info *xi = ac-private; + int type; + u8 tkt_struct_v, blob_struct_v; + struct ceph_x_ticket_handler *th; + void *dp, *dend; + int dlen; + char is_enc; + struct timespec validity; + struct ceph_crypto_key old_key; + void *tp, *tpend; + struct ceph_timespec new_validity; + struct ceph_crypto_key new_session_key; + struct ceph_buffer *new_ticket_blob; + unsigned long new_expires, new_renew_after; + u64 new_secret_id; + int ret; + + ceph_decode_need(p, end, sizeof(u32) + 1, bad); + + type = ceph_decode_32(p); + dout( ticket type %d %s\n, type, ceph_entity_type_name(type)); + + tkt_struct_v = ceph_decode_8(p); + if (tkt_struct_v != 1) + goto bad; + + th = get_ticket_handler(ac, type); + if (IS_ERR(th)) { + ret = PTR_ERR(th); + goto out; + } + + /* blob for me */ + dlen = ceph_x_decrypt(secret, p, end, dbuf, + TEMP_TICKET_BUF_LEN); + if (dlen = 0) { + ret = dlen; + goto out; + } + dout( decrypted %d bytes\n, dlen); + dp = dbuf; + dend = dp + dlen; + + tkt_struct_v = ceph_decode_8(dp); + if (tkt_struct_v != 1) + goto bad; + + memcpy(old_key, th-session_key, sizeof(old_key)); + ret = ceph_crypto_key_decode(new_session_key, dp, dend); + if (ret) + goto out; + + ceph_decode_copy(dp, new_validity, sizeof(new_validity)); + ceph_decode_timespec(validity, new_validity); + new_expires = get_seconds() + validity.tv_sec; + new_renew_after = new_expires - (validity.tv_sec / 4); + dout( expires=%lu renew_after=%lu\n, new_expires, +new_renew_after); + + /* ticket blob for service */ + ceph_decode_8_safe(p, end, is_enc, bad); + tp = ticket_buf; + if (is_enc) { + /* encrypted */ + dout( encrypted ticket\n); + dlen = ceph_x_decrypt(old_key, p, end, ticket_buf, + TEMP_TICKET_BUF_LEN); + if (dlen 0) { + ret = dlen; + goto out; + } + dlen = ceph_decode_32(tp); + } else { + /* unencrypted */ + ceph_decode_32_safe(p, end, dlen, bad); + ceph_decode_need(p, end, dlen, bad); + ceph_decode_copy(p, ticket_buf, dlen); + } + tpend = tp + dlen; + dout( ticket blob is %d bytes\n, dlen); + ceph_decode_need(tp, tpend, 1 + sizeof(u64), bad); + blob_struct_v = ceph_decode_8(tp); + new_secret_id = ceph_decode_64(tp); + ret = ceph_decode_buffer(new_ticket_blob, tp, tpend); + if (ret) + goto out; + + /* all is well, update our ticket */ + ceph_crypto_key_destroy(th-session_key); + if (th-ticket_blob) + ceph_buffer_put(th-ticket_blob); + th-session_key = new_session_key; + th-ticket_blob = new_ticket_blob; + th-validity = new_validity; + th-secret_id = new_secret_id; + th-expires = new_expires; + th-renew_after = new_renew_after; + dout( got ticket service %d (%s) secret_id %lld len %d\n, +type, ceph_entity_type_name(type), th-secret_id, +(int)th-ticket_blob-vec.iov_len); + xi-have_keys |= th-service; + +out: + return ret; + +bad: + ret = -EINVAL; + goto out; +} + static int ceph_x_proc_ticket_reply(struct ceph_auth_client *ac, struct ceph_crypto_key *secret, void *buf, void *end) { - struct ceph_x_info *xi = ac-private; - int num; void *p = buf; - int ret; char *dbuf; char *ticket_buf; u8 reply_struct_v; + u32 num; + int ret; dbuf = kmalloc(TEMP_TICKET_BUF_LEN, GFP_NOFS); if (!dbuf) @@ -150,112
[PATCH 3/3] libceph: do not hard code max auth ticket len
We hard code cephx auth ticket buffer size to 256 bytes. This isn't enough for any moderate setups and, in case tickets themselves are not encrypted, leads to buffer overflows (ceph_x_decrypt() errors out, but ceph_decode_copy() doesn't - it's just a memcpy() wrapper). Since the buffer is allocated dynamically anyway, allocated it a bit later, at the point where we know how much is going to be needed. Fixes: http://tracker.ceph.com/issues/8979 Cc: sta...@vger.kernel.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- net/ceph/auth_x.c | 64 - 1 file changed, 29 insertions(+), 35 deletions(-) diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c index 0eb146dce1aa..de6662b14e1f 100644 --- a/net/ceph/auth_x.c +++ b/net/ceph/auth_x.c @@ -13,8 +13,6 @@ #include auth_x.h #include auth_x_protocol.h -#define TEMP_TICKET_BUF_LEN256 - static void ceph_x_validate_tickets(struct ceph_auth_client *ac, int *pneed); static int ceph_x_is_authenticated(struct ceph_auth_client *ac) @@ -64,7 +62,7 @@ static int ceph_x_encrypt(struct ceph_crypto_key *secret, } static int ceph_x_decrypt(struct ceph_crypto_key *secret, - void **p, void *end, void *obuf, size_t olen) + void **p, void *end, void **obuf, size_t olen) { struct ceph_x_encrypt_header head; size_t head_len = sizeof(head); @@ -75,8 +73,14 @@ static int ceph_x_decrypt(struct ceph_crypto_key *secret, return -EINVAL; dout(ceph_x_decrypt len %d\n, len); - ret = ceph_decrypt2(secret, head, head_len, obuf, olen, - *p, len); + if (*obuf == NULL) { + *obuf = kmalloc(len, GFP_NOFS); + if (!*obuf) + return -ENOMEM; + olen = len; + } + + ret = ceph_decrypt2(secret, head, head_len, *obuf, olen, *p, len); if (ret) return ret; if (head.struct_v != 1 || le64_to_cpu(head.magic) != CEPHX_ENC_MAGIC) @@ -131,18 +135,19 @@ static void remove_ticket_handler(struct ceph_auth_client *ac, static int process_one_ticket(struct ceph_auth_client *ac, struct ceph_crypto_key *secret, - void **p, void *end, - void *dbuf, void *ticket_buf) + void **p, void *end) { struct ceph_x_info *xi = ac-private; int type; u8 tkt_struct_v, blob_struct_v; struct ceph_x_ticket_handler *th; + void *dbuf = NULL; void *dp, *dend; int dlen; char is_enc; struct timespec validity; struct ceph_crypto_key old_key; + void *ticket_buf = NULL; void *tp, *tpend; struct ceph_timespec new_validity; struct ceph_crypto_key new_session_key; @@ -167,8 +172,7 @@ static int process_one_ticket(struct ceph_auth_client *ac, } /* blob for me */ - dlen = ceph_x_decrypt(secret, p, end, dbuf, - TEMP_TICKET_BUF_LEN); + dlen = ceph_x_decrypt(secret, p, end, dbuf, 0); if (dlen = 0) { ret = dlen; goto out; @@ -195,20 +199,25 @@ static int process_one_ticket(struct ceph_auth_client *ac, /* ticket blob for service */ ceph_decode_8_safe(p, end, is_enc, bad); - tp = ticket_buf; if (is_enc) { /* encrypted */ dout( encrypted ticket\n); - dlen = ceph_x_decrypt(old_key, p, end, ticket_buf, - TEMP_TICKET_BUF_LEN); + dlen = ceph_x_decrypt(old_key, p, end, ticket_buf, 0); if (dlen 0) { ret = dlen; goto out; } + tp = ticket_buf; dlen = ceph_decode_32(tp); } else { /* unencrypted */ ceph_decode_32_safe(p, end, dlen, bad); + ticket_buf = kmalloc(dlen, GFP_NOFS); + if (!ticket_buf) { + ret = -ENOMEM; + goto out; + } + tp = ticket_buf; ceph_decode_need(p, end, dlen, bad); ceph_decode_copy(p, ticket_buf, dlen); } @@ -237,6 +246,8 @@ static int process_one_ticket(struct ceph_auth_client *ac, xi-have_keys |= th-service; out: + kfree(ticket_buf); + kfree(dbuf); return ret; bad: @@ -249,21 +260,10 @@ static int ceph_x_proc_ticket_reply(struct ceph_auth_client *ac, void *buf, void *end) { void *p = buf; - char *dbuf; - char *ticket_buf; u8 reply_struct_v; u32 num; int ret; - dbuf = kmalloc(TEMP_TICKET_BUF_LEN, GFP_NOFS); - if (!dbuf) - return -ENOMEM; - - ret = -ENOMEM; -
Re: [PATCH] libceph: fix a use after free issue in osdmap_set_max_osd
On Sun, Sep 7, 2014 at 6:27 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Sun, Sep 7, 2014 at 2:10 PM, roy.qing...@gmail.com wrote: From: Li RongQing roy.qing...@gmail.com If the state variable is krealloced successfully, map-osd_state will be freed, once following two reallocation failed, and exit the function without resetting map-osd_state, map-osd_state become a wild pointer. fix it by resetting them after krealloc successfully. Signed-off-by: Li RongQing roy.qing...@gmail.com --- net/ceph/osdmap.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c index c547e46..81e9c66 100644 --- a/net/ceph/osdmap.c +++ b/net/ceph/osdmap.c @@ -671,15 +671,19 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max) int i; state = krealloc(map-osd_state, max*sizeof(*state), GFP_NOFS); + if (!state) + return -ENOMEM; + map-osd_state = state; + weight = krealloc(map-osd_weight, max*sizeof(*weight), GFP_NOFS); - addr = krealloc(map-osd_addr, max*sizeof(*addr), GFP_NOFS); - if (!state || !weight || !addr) { - kfree(state); - kfree(weight); - kfree(addr); + if (!weight) + return -ENOMEM; + map-osd_weight = weight; + addr = krealloc(map-osd_addr, max*sizeof(*addr), GFP_NOFS); + if (!addr) return -ENOMEM; - } + map-osd_addr = addr; for (i = map-max_osd; i max; i++) { state[i] = 0; @@ -687,10 +691,6 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max) memset(addr + i, 0, sizeof(*addr)); } - map-osd_state = state; - map-osd_weight = weight; - map-osd_addr = addr; - if (map-osd_primary_affinity) { u32 *affinity; -- 1.7.10.4 Looks good. I'll apply it tomorrow. Pushed a branch with your patch. Minor modifications: use map-* instead of local variables for initialization and change primary affinity case so it fits in. You can take a look at https://github.com/ceph/ceph-client/commit/790415a024871a2100388ce4b3d485756fb90865 Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ceph: trim unused inodes before reconnecting to recovering MDS
So the recovering MDS does not need to fetch these ununsed inodes during cache rejoin. This may reduce MDS recovery time. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/mds_client.c | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index bad07c0..f751fea 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2695,16 +2695,6 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc, session-s_state = CEPH_MDS_SESSION_RECONNECTING; session-s_seq = 0; - ceph_con_close(session-s_con); - ceph_con_open(session-s_con, - CEPH_ENTITY_TYPE_MDS, mds, - ceph_mdsmap_get_addr(mdsc-mdsmap, mds)); - - /* replay unsafe requests */ - replay_unsafe_requests(mdsc, session); - - down_read(mdsc-snap_rwsem); - dout(session %p state %s\n, session, session_state_name(session-s_state)); @@ -2723,6 +2713,19 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc, discard_cap_releases(mdsc, session); spin_unlock(session-s_cap_lock); + /* trim unused caps to reduce MDS's cache rejoin time */ + shrink_dcache_parent(mdsc-fsc-sb-s_root); + + ceph_con_close(session-s_con); + ceph_con_open(session-s_con, + CEPH_ENTITY_TYPE_MDS, mds, + ceph_mdsmap_get_addr(mdsc-mdsmap, mds)); + + /* replay unsafe requests */ + replay_unsafe_requests(mdsc, session); + + down_read(mdsc-snap_rwsem); + /* traverse this session's caps */ s_nr_caps = session-s_nr_caps; err = ceph_pagelist_encode_32(pagelist, s_nr_caps); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] question about RGW
[Moving this to ceph-devel, where you're more likely to get a response from a developer!] On Wed, 10 Sep 2014, baijia...@126.com wrote: when I read RGW code, and can't understand master_ver inside struct rgw_bucket_dir_header . who can explain this struct , in especial master_ver and stats , thanks baijia...@126.com
Re: [PATCH net-next 2/5] ceph: Convert pr_warning to pr_warn
On Wed, Sep 10, 2014 at 8:17 AM, Joe Perches j...@perches.com wrote: Use the more common pr_warn. Other miscellanea: o Coalesce formats o Realign arguments Signed-off-by: Joe Perches j...@perches.com --- net/ceph/ceph_common.c | 15 +-- net/ceph/messenger.c | 19 ++- net/ceph/osd_client.c | 15 +++ net/ceph/osdmap.c | 20 ++-- 4 files changed, 36 insertions(+), 33 deletions(-) diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c index 1675021..58fbfe1 100644 --- a/net/ceph/ceph_common.c +++ b/net/ceph/ceph_common.c @@ -293,17 +293,20 @@ static int get_secret(struct ceph_crypto_key *dst, const char *name) { key_err = PTR_ERR(ukey); switch (key_err) { case -ENOKEY: - pr_warning(ceph: Mount failed due to key not found: %s\n, name); + pr_warn(ceph: Mount failed due to key not found: %s\n, + name); break; case -EKEYEXPIRED: - pr_warning(ceph: Mount failed due to expired key: %s\n, name); + pr_warn(ceph: Mount failed due to expired key: %s\n, + name); break; case -EKEYREVOKED: - pr_warning(ceph: Mount failed due to revoked key: %s\n, name); + pr_warn(ceph: Mount failed due to revoked key: %s\n, + name); break; default: - pr_warning(ceph: Mount failed due to unknown key error - %d: %s\n, key_err, name); + pr_warn(ceph: Mount failed due to unknown key error %d: %s\n, + key_err, name); } err = -EPERM; goto out; @@ -433,7 +436,7 @@ ceph_parse_options(char *options, const char *dev_name, /* misc */ case Opt_osdtimeout: - pr_warning(ignoring deprecated osdtimeout option\n); + pr_warn(ignoring deprecated osdtimeout option\n); break; case Opt_osdkeepalivetimeout: opt-osd_keepalive_timeout = intval; diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index b2f571d..a7203be 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -1937,11 +1937,11 @@ static int process_banner(struct ceph_connection *con) sizeof(con-peer_addr)) != 0 !(addr_is_blank(con-actual_peer_addr.in_addr) con-actual_peer_addr.nonce == con-peer_addr.nonce)) { - pr_warning(wrong peer, want %s/%d, got %s/%d\n, - ceph_pr_addr(con-peer_addr.in_addr), - (int)le32_to_cpu(con-peer_addr.nonce), - ceph_pr_addr(con-actual_peer_addr.in_addr), - (int)le32_to_cpu(con-actual_peer_addr.nonce)); + pr_warn(wrong peer, want %s/%d, got %s/%d\n, + ceph_pr_addr(con-peer_addr.in_addr), + (int)le32_to_cpu(con-peer_addr.nonce), + ceph_pr_addr(con-actual_peer_addr.in_addr), + (int)le32_to_cpu(con-actual_peer_addr.nonce)); con-error_msg = wrong peer at address; return -1; } @@ -2302,7 +2302,7 @@ static int read_partial_message(struct ceph_connection *con) BUG_ON(!con-in_msg ^ skip); if (con-in_msg data_len con-in_msg-data_length) { - pr_warning(%s skipping long message (%u %zd)\n, + pr_warn(%s skipping long message (%u %zd)\n, __func__, data_len, con-in_msg-data_length); ceph_msg_put(con-in_msg); con-in_msg = NULL; @@ -2712,7 +2712,7 @@ static bool con_sock_closed(struct ceph_connection *con) CASE(OPEN); CASE(STANDBY); default: - pr_warning(%s con %p unrecognized state %lu\n, + pr_warn(%s con %p unrecognized state %lu\n, __func__, con, con-state); con-error_msg = unrecognized con state; BUG(); @@ -2828,8 +2828,9 @@ static void con_work(struct work_struct *work) */ static void con_fault(struct ceph_connection *con) { - pr_warning(%s%lld %s %s\n, ENTITY_NAME(con-peer_name), - ceph_pr_addr(con-peer_addr.in_addr), con-error_msg); + pr_warn(%s%lld %s %s\n, + ENTITY_NAME(con-peer_name), + ceph_pr_addr(con-peer_addr.in_addr), con-error_msg); dout(fault %p state %lu to peer %s\n,
Re: [PATCH net-next 2/5] ceph: Convert pr_warning to pr_warn
On Wed, Sep 10, 2014 at 8:17 AM, Joe Perches j...@perches.com wrote: Use the more common pr_warn. Other miscellanea: o Coalesce formats o Realign arguments Signed-off-by: Joe Perches j...@perches.com --- net/ceph/ceph_common.c | 15 +-- net/ceph/messenger.c | 19 ++- net/ceph/osd_client.c | 15 +++ net/ceph/osdmap.c | 20 ++-- 4 files changed, 36 insertions(+), 33 deletions(-) diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c index 1675021..58fbfe1 100644 --- a/net/ceph/ceph_common.c +++ b/net/ceph/ceph_common.c @@ -293,17 +293,20 @@ static int get_secret(struct ceph_crypto_key *dst, const char *name) { key_err = PTR_ERR(ukey); switch (key_err) { case -ENOKEY: - pr_warning(ceph: Mount failed due to key not found: %s\n, name); + pr_warn(ceph: Mount failed due to key not found: %s\n, + name); break; case -EKEYEXPIRED: - pr_warning(ceph: Mount failed due to expired key: %s\n, name); + pr_warn(ceph: Mount failed due to expired key: %s\n, + name); break; case -EKEYREVOKED: - pr_warning(ceph: Mount failed due to revoked key: %s\n, name); + pr_warn(ceph: Mount failed due to revoked key: %s\n, + name); break; default: - pr_warning(ceph: Mount failed due to unknown key error - %d: %s\n, key_err, name); + pr_warn(ceph: Mount failed due to unknown key error %d: %s\n, + key_err, name); } err = -EPERM; goto out; @@ -433,7 +436,7 @@ ceph_parse_options(char *options, const char *dev_name, /* misc */ case Opt_osdtimeout: - pr_warning(ignoring deprecated osdtimeout option\n); + pr_warn(ignoring deprecated osdtimeout option\n); break; case Opt_osdkeepalivetimeout: opt-osd_keepalive_timeout = intval; diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index b2f571d..a7203be 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -1937,11 +1937,11 @@ static int process_banner(struct ceph_connection *con) sizeof(con-peer_addr)) != 0 !(addr_is_blank(con-actual_peer_addr.in_addr) con-actual_peer_addr.nonce == con-peer_addr.nonce)) { - pr_warning(wrong peer, want %s/%d, got %s/%d\n, - ceph_pr_addr(con-peer_addr.in_addr), - (int)le32_to_cpu(con-peer_addr.nonce), - ceph_pr_addr(con-actual_peer_addr.in_addr), - (int)le32_to_cpu(con-actual_peer_addr.nonce)); + pr_warn(wrong peer, want %s/%d, got %s/%d\n, + ceph_pr_addr(con-peer_addr.in_addr), + (int)le32_to_cpu(con-peer_addr.nonce), + ceph_pr_addr(con-actual_peer_addr.in_addr), + (int)le32_to_cpu(con-actual_peer_addr.nonce)); con-error_msg = wrong peer at address; return -1; } @@ -2302,7 +2302,7 @@ static int read_partial_message(struct ceph_connection *con) BUG_ON(!con-in_msg ^ skip); if (con-in_msg data_len con-in_msg-data_length) { - pr_warning(%s skipping long message (%u %zd)\n, + pr_warn(%s skipping long message (%u %zd)\n, __func__, data_len, con-in_msg-data_length); ceph_msg_put(con-in_msg); con-in_msg = NULL; @@ -2712,7 +2712,7 @@ static bool con_sock_closed(struct ceph_connection *con) CASE(OPEN); CASE(STANDBY); default: - pr_warning(%s con %p unrecognized state %lu\n, + pr_warn(%s con %p unrecognized state %lu\n, __func__, con, con-state); con-error_msg = unrecognized con state; BUG(); @@ -2828,8 +2828,9 @@ static void con_work(struct work_struct *work) */ static void con_fault(struct ceph_connection *con) { - pr_warning(%s%lld %s %s\n, ENTITY_NAME(con-peer_name), - ceph_pr_addr(con-peer_addr.in_addr), con-error_msg); + pr_warn(%s%lld %s %s\n, + ENTITY_NAME(con-peer_name), + ceph_pr_addr(con-peer_addr.in_addr), con-error_msg); Just wondering, why did you move ENITITY_NAME
RE: Question to RWLock reverse DNS ip=hostname
On Wed, 10 Sep 2014, Andreas Joachim Peters wrote: We did a lot of testing of RW locks in another multi-threaded project and we use successfully PTHREAD_RWLOCK_PREFER_WRITER_NP which works perfectly fine if you avoid recursive read locks. It is probably good practice to avoid any recursive mutex in general. The good news is that there are (I believe) no recursive uses of the mutexes or rwlocks in the code base. Our lockdep dependency checking system throws errors if it detects any recursion. There might be a few exceptions that are explicitly marked recursive via the constructor. If one stay's with the default RW mutex without setting attributes we have seen that you starve writers for long periods if there is a contention on the read mutex (n-reader n-writer). I agree this is a very 'sensitive' topic but could well be an origin for threading scalability problems - depends how and where you use the RW locks in CEPH. Yeah! I suspect the way to approach this would be to make reader vs writer favortism explicit in the constructor arguments, which no change in behavior, and then a series of patches that switch to writer-favortism... sage Cheers Andreas. From: Gregory Farnum [g...@inktank.com] Sent: 09 September 2014 21:56 To: Sage Weil Cc: Andreas Joachim Peters; ceph-devel@vger.kernel.org Subject: Re: Question to RWLock reverse DNS ip=hostname On Tue, Sep 9, 2014 at 10:50 AM, Sage Weil sw...@redhat.com wrote: On Tue, 9 Sep 2014, Andreas Joachim Peters wrote: Hi, by chance I had a look to the RWLock class. To my best knowledge the way you create RW locks it defaults to writer-starvation e.g. all readers will always jump a head of a pending writer. I cannot imagine that you never have the opposite requirement in the CEPH multithreaded code but I didn't review where it is used. In case you are aware, you can just ignore this comment, otherwise one could add the option to create a writer-prefering RWLock at construction time using e.g. pthread_rwlockattr_setkind_np(attr, PTHREAD_RWLOCK_PREFER_WRITER_NP) Hmm, I was not aware. Perhaps we should make the reader vs writer preference explicit during construction? Yeah, this is interesting as I thought I'd learned somewhere that the default behavior was to prefer writers. Poking around the internet makes me wonder if we actually want to make any changes ? best I can tell PTHREAD_RWLOCK_PREFER_WRITER_NP does not actually behave the way one wants (as part of the spec!) because pthreads requires recursive locking, so we'd have to use PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP. But we'd probably want to profile the performance changes (and, uh, run tests to make sure we don't deadlock somehow) before putting any such code upstream. Another question: I couldn't find a base implementation to translate IPV4/6 ip's to host names? Is there same configuration table/method allowing to get back from an OSD address retrieved via OSDMap::get_addr() the original host name avoiding the use of DNS lookups? I don't think we are doing reverse DNS lookups anywhere in the ceph code ... In particular I think we've looked at doing this before and been unable to reliably map to the *proper* hostname in cases where servers have multiple NICs. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] libceph: gracefully handle large reply messages from the mon
Reviewed-by: On Wed, 10 Sep 2014, Ilya Dryomov wrote: From: Sage Weil s...@redhat.com We preallocate a few of the message types we get back from the mon. If we get a larger message than we are expecting, fall back to trying to allocate a new one instead of blindly using the one we have. CC: sta...@vger.kernel.org Signed-off-by: Sage Weil s...@redhat.com --- net/ceph/mon_client.c |8 1 file changed, 8 insertions(+) diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c index 067d3af2eaf6..61fcfc304f68 100644 --- a/net/ceph/mon_client.c +++ b/net/ceph/mon_client.c @@ -1181,7 +1181,15 @@ static struct ceph_msg *mon_alloc_msg(struct ceph_connection *con, if (!m) { pr_info(alloc_msg unknown type %d\n, type); *skip = 1; + } else if (front_len m-front_alloc_len) { + pr_warning(mon_alloc_msg front %d prealloc %d (%u#%llu)\n, +front_len, m-front_alloc_len, +(unsigned int)con-peer_name.type, +le64_to_cpu(con-peer_name.num)); + ceph_msg_put(m); + m = ceph_msg_new(type, front_len, GFP_NOFS, false); } + return m; } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] libceph: add process_one_ticket() helper
Reviewed-by: On Wed, 10 Sep 2014, Ilya Dryomov wrote: Add a helper for processing individual cephx auth tickets. Needed for the next commit, that deals with allocating ticket buffers. (Most of the diff here is whitespace - view with git diff -b). Cc: sta...@vger.kernel.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- net/ceph/auth_x.c | 228 + 1 file changed, 124 insertions(+), 104 deletions(-) diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c index 96238ba95f2b..0eb146dce1aa 100644 --- a/net/ceph/auth_x.c +++ b/net/ceph/auth_x.c @@ -129,17 +129,131 @@ static void remove_ticket_handler(struct ceph_auth_client *ac, kfree(th); } +static int process_one_ticket(struct ceph_auth_client *ac, + struct ceph_crypto_key *secret, + void **p, void *end, + void *dbuf, void *ticket_buf) +{ + struct ceph_x_info *xi = ac-private; + int type; + u8 tkt_struct_v, blob_struct_v; + struct ceph_x_ticket_handler *th; + void *dp, *dend; + int dlen; + char is_enc; + struct timespec validity; + struct ceph_crypto_key old_key; + void *tp, *tpend; + struct ceph_timespec new_validity; + struct ceph_crypto_key new_session_key; + struct ceph_buffer *new_ticket_blob; + unsigned long new_expires, new_renew_after; + u64 new_secret_id; + int ret; + + ceph_decode_need(p, end, sizeof(u32) + 1, bad); + + type = ceph_decode_32(p); + dout( ticket type %d %s\n, type, ceph_entity_type_name(type)); + + tkt_struct_v = ceph_decode_8(p); + if (tkt_struct_v != 1) + goto bad; + + th = get_ticket_handler(ac, type); + if (IS_ERR(th)) { + ret = PTR_ERR(th); + goto out; + } + + /* blob for me */ + dlen = ceph_x_decrypt(secret, p, end, dbuf, + TEMP_TICKET_BUF_LEN); + if (dlen = 0) { + ret = dlen; + goto out; + } + dout( decrypted %d bytes\n, dlen); + dp = dbuf; + dend = dp + dlen; + + tkt_struct_v = ceph_decode_8(dp); + if (tkt_struct_v != 1) + goto bad; + + memcpy(old_key, th-session_key, sizeof(old_key)); + ret = ceph_crypto_key_decode(new_session_key, dp, dend); + if (ret) + goto out; + + ceph_decode_copy(dp, new_validity, sizeof(new_validity)); + ceph_decode_timespec(validity, new_validity); + new_expires = get_seconds() + validity.tv_sec; + new_renew_after = new_expires - (validity.tv_sec / 4); + dout( expires=%lu renew_after=%lu\n, new_expires, + new_renew_after); + + /* ticket blob for service */ + ceph_decode_8_safe(p, end, is_enc, bad); + tp = ticket_buf; + if (is_enc) { + /* encrypted */ + dout( encrypted ticket\n); + dlen = ceph_x_decrypt(old_key, p, end, ticket_buf, + TEMP_TICKET_BUF_LEN); + if (dlen 0) { + ret = dlen; + goto out; + } + dlen = ceph_decode_32(tp); + } else { + /* unencrypted */ + ceph_decode_32_safe(p, end, dlen, bad); + ceph_decode_need(p, end, dlen, bad); + ceph_decode_copy(p, ticket_buf, dlen); + } + tpend = tp + dlen; + dout( ticket blob is %d bytes\n, dlen); + ceph_decode_need(tp, tpend, 1 + sizeof(u64), bad); + blob_struct_v = ceph_decode_8(tp); + new_secret_id = ceph_decode_64(tp); + ret = ceph_decode_buffer(new_ticket_blob, tp, tpend); + if (ret) + goto out; + + /* all is well, update our ticket */ + ceph_crypto_key_destroy(th-session_key); + if (th-ticket_blob) + ceph_buffer_put(th-ticket_blob); + th-session_key = new_session_key; + th-ticket_blob = new_ticket_blob; + th-validity = new_validity; + th-secret_id = new_secret_id; + th-expires = new_expires; + th-renew_after = new_renew_after; + dout( got ticket service %d (%s) secret_id %lld len %d\n, + type, ceph_entity_type_name(type), th-secret_id, + (int)th-ticket_blob-vec.iov_len); + xi-have_keys |= th-service; + +out: + return ret; + +bad: + ret = -EINVAL; + goto out; +} + static int ceph_x_proc_ticket_reply(struct ceph_auth_client *ac, struct ceph_crypto_key *secret, void *buf, void *end) { - struct ceph_x_info *xi = ac-private; - int num; void *p = buf; - int ret; char *dbuf; char *ticket_buf; u8 reply_struct_v; + u32 num; + int ret; dbuf = kmalloc(TEMP_TICKET_BUF_LEN, GFP_NOFS); if (!dbuf) @@ -150,112 +264,18 @@
Re: [PATCH 3/3] libceph: do not hard code max auth ticket len
On Wed, 10 Sep 2014, Ilya Dryomov wrote: We hard code cephx auth ticket buffer size to 256 bytes. This isn't enough for any moderate setups and, in case tickets themselves are not encrypted, leads to buffer overflows (ceph_x_decrypt() errors out, but ceph_decode_copy() doesn't - it's just a memcpy() wrapper). Since the buffer is allocated dynamically anyway, allocated it a bit later, at the point where we know how much is going to be needed. Fixes: http://tracker.ceph.com/issues/8979 Cc: sta...@vger.kernel.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com Reviewed-by: Sage Weil s...@redhat.com --- net/ceph/auth_x.c | 64 - 1 file changed, 29 insertions(+), 35 deletions(-) diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c index 0eb146dce1aa..de6662b14e1f 100644 --- a/net/ceph/auth_x.c +++ b/net/ceph/auth_x.c @@ -13,8 +13,6 @@ #include auth_x.h #include auth_x_protocol.h -#define TEMP_TICKET_BUF_LEN 256 - static void ceph_x_validate_tickets(struct ceph_auth_client *ac, int *pneed); static int ceph_x_is_authenticated(struct ceph_auth_client *ac) @@ -64,7 +62,7 @@ static int ceph_x_encrypt(struct ceph_crypto_key *secret, } static int ceph_x_decrypt(struct ceph_crypto_key *secret, - void **p, void *end, void *obuf, size_t olen) + void **p, void *end, void **obuf, size_t olen) { struct ceph_x_encrypt_header head; size_t head_len = sizeof(head); @@ -75,8 +73,14 @@ static int ceph_x_decrypt(struct ceph_crypto_key *secret, return -EINVAL; dout(ceph_x_decrypt len %d\n, len); - ret = ceph_decrypt2(secret, head, head_len, obuf, olen, - *p, len); + if (*obuf == NULL) { + *obuf = kmalloc(len, GFP_NOFS); + if (!*obuf) + return -ENOMEM; + olen = len; + } + + ret = ceph_decrypt2(secret, head, head_len, *obuf, olen, *p, len); if (ret) return ret; if (head.struct_v != 1 || le64_to_cpu(head.magic) != CEPHX_ENC_MAGIC) @@ -131,18 +135,19 @@ static void remove_ticket_handler(struct ceph_auth_client *ac, static int process_one_ticket(struct ceph_auth_client *ac, struct ceph_crypto_key *secret, - void **p, void *end, - void *dbuf, void *ticket_buf) + void **p, void *end) { struct ceph_x_info *xi = ac-private; int type; u8 tkt_struct_v, blob_struct_v; struct ceph_x_ticket_handler *th; + void *dbuf = NULL; void *dp, *dend; int dlen; char is_enc; struct timespec validity; struct ceph_crypto_key old_key; + void *ticket_buf = NULL; void *tp, *tpend; struct ceph_timespec new_validity; struct ceph_crypto_key new_session_key; @@ -167,8 +172,7 @@ static int process_one_ticket(struct ceph_auth_client *ac, } /* blob for me */ - dlen = ceph_x_decrypt(secret, p, end, dbuf, - TEMP_TICKET_BUF_LEN); + dlen = ceph_x_decrypt(secret, p, end, dbuf, 0); if (dlen = 0) { ret = dlen; goto out; @@ -195,20 +199,25 @@ static int process_one_ticket(struct ceph_auth_client *ac, /* ticket blob for service */ ceph_decode_8_safe(p, end, is_enc, bad); - tp = ticket_buf; if (is_enc) { /* encrypted */ dout( encrypted ticket\n); - dlen = ceph_x_decrypt(old_key, p, end, ticket_buf, - TEMP_TICKET_BUF_LEN); + dlen = ceph_x_decrypt(old_key, p, end, ticket_buf, 0); if (dlen 0) { ret = dlen; goto out; } + tp = ticket_buf; dlen = ceph_decode_32(tp); } else { /* unencrypted */ ceph_decode_32_safe(p, end, dlen, bad); + ticket_buf = kmalloc(dlen, GFP_NOFS); + if (!ticket_buf) { + ret = -ENOMEM; + goto out; + } + tp = ticket_buf; ceph_decode_need(p, end, dlen, bad); ceph_decode_copy(p, ticket_buf, dlen); } @@ -237,6 +246,8 @@ static int process_one_ticket(struct ceph_auth_client *ac, xi-have_keys |= th-service; out: + kfree(ticket_buf); + kfree(dbuf); return ret; bad: @@ -249,21 +260,10 @@ static int ceph_x_proc_ticket_reply(struct ceph_auth_client *ac, void *buf, void *end) { void *p = buf; - char *dbuf; - char *ticket_buf; u8 reply_struct_v; u32 num; int ret; - dbuf = kmalloc(TEMP_TICKET_BUF_LEN, GFP_NOFS); - if (!dbuf) -
[ANN] ceph-deploy 1.5.14 released
Hi All, There is a new bug-fix release of ceph-deploy that helps prevent the environment variable issues that sometimes may cause issues when depending on them on remote hosts. It is also now possible to specify public and cluster networks when creating a new ceph.conf file. Make sure you update! -Alfredo -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/5] ceph: Convert pr_warning to pr_warn
On Wed, 2014-09-10 at 16:38 +0400, Ilya Dryomov wrote: On Wed, Sep 10, 2014 at 8:17 AM, Joe Perches j...@perches.com wrote: Use the more common pr_warn. Other miscellanea: o Coalesce formats o Realign arguments [] diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c [] @@ -2828,8 +2828,9 @@ static void con_work(struct work_struct *work) */ static void con_fault(struct ceph_connection *con) { - pr_warning(%s%lld %s %s\n, ENTITY_NAME(con-peer_name), - ceph_pr_addr(con-peer_addr.in_addr), con-error_msg); + pr_warn(%s%lld %s %s\n, + ENTITY_NAME(con-peer_name), + ceph_pr_addr(con-peer_addr.in_addr), con-error_msg); Just wondering, why did you move ENITITY_NAME argument to the next line? I see you are doing this pretty consistently. Is this a style guideline or something? Hello Ilya. It's just a personal preference to have the formats on one line and its arguments on another when the format and its arguments don't fit 80 columns. Maximal 80 column fill isn't always easy for me to read. I find it easier to match formats and arguments when scanning because my eye doesn't have to scan to the end of the format line. If you don't like it, it's your code. Change it. cheers, Joe -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/5] ceph: Convert pr_warning to pr_warn
On Wed, Sep 10, 2014 at 4:30 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Wed, Sep 10, 2014 at 8:17 AM, Joe Perches j...@perches.com wrote: Use the more common pr_warn. Other miscellanea: o Coalesce formats o Realign arguments Signed-off-by: Joe Perches j...@perches.com --- net/ceph/ceph_common.c | 15 +-- net/ceph/messenger.c | 19 ++- net/ceph/osd_client.c | 15 +++ net/ceph/osdmap.c | 20 ++-- 4 files changed, 36 insertions(+), 33 deletions(-) diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c index 1675021..58fbfe1 100644 --- a/net/ceph/ceph_common.c +++ b/net/ceph/ceph_common.c @@ -293,17 +293,20 @@ static int get_secret(struct ceph_crypto_key *dst, const char *name) { key_err = PTR_ERR(ukey); switch (key_err) { case -ENOKEY: - pr_warning(ceph: Mount failed due to key not found: %s\n, name); + pr_warn(ceph: Mount failed due to key not found: %s\n, + name); break; case -EKEYEXPIRED: - pr_warning(ceph: Mount failed due to expired key: %s\n, name); + pr_warn(ceph: Mount failed due to expired key: %s\n, + name); break; case -EKEYREVOKED: - pr_warning(ceph: Mount failed due to revoked key: %s\n, name); + pr_warn(ceph: Mount failed due to revoked key: %s\n, + name); break; default: - pr_warning(ceph: Mount failed due to unknown key error - %d: %s\n, key_err, name); + pr_warn(ceph: Mount failed due to unknown key error %d: %s\n, + key_err, name); } err = -EPERM; goto out; @@ -433,7 +436,7 @@ ceph_parse_options(char *options, const char *dev_name, /* misc */ case Opt_osdtimeout: - pr_warning(ignoring deprecated osdtimeout option\n); + pr_warn(ignoring deprecated osdtimeout option\n); break; case Opt_osdkeepalivetimeout: opt-osd_keepalive_timeout = intval; diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index b2f571d..a7203be 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -1937,11 +1937,11 @@ static int process_banner(struct ceph_connection *con) sizeof(con-peer_addr)) != 0 !(addr_is_blank(con-actual_peer_addr.in_addr) con-actual_peer_addr.nonce == con-peer_addr.nonce)) { - pr_warning(wrong peer, want %s/%d, got %s/%d\n, - ceph_pr_addr(con-peer_addr.in_addr), - (int)le32_to_cpu(con-peer_addr.nonce), - ceph_pr_addr(con-actual_peer_addr.in_addr), - (int)le32_to_cpu(con-actual_peer_addr.nonce)); + pr_warn(wrong peer, want %s/%d, got %s/%d\n, + ceph_pr_addr(con-peer_addr.in_addr), + (int)le32_to_cpu(con-peer_addr.nonce), + ceph_pr_addr(con-actual_peer_addr.in_addr), + (int)le32_to_cpu(con-actual_peer_addr.nonce)); con-error_msg = wrong peer at address; return -1; } @@ -2302,7 +2302,7 @@ static int read_partial_message(struct ceph_connection *con) BUG_ON(!con-in_msg ^ skip); if (con-in_msg data_len con-in_msg-data_length) { - pr_warning(%s skipping long message (%u %zd)\n, + pr_warn(%s skipping long message (%u %zd)\n, __func__, data_len, con-in_msg-data_length); ceph_msg_put(con-in_msg); con-in_msg = NULL; @@ -2712,7 +2712,7 @@ static bool con_sock_closed(struct ceph_connection *con) CASE(OPEN); CASE(STANDBY); default: - pr_warning(%s con %p unrecognized state %lu\n, + pr_warn(%s con %p unrecognized state %lu\n, __func__, con, con-state); con-error_msg = unrecognized con state; BUG(); @@ -2828,8 +2828,9 @@ static void con_work(struct work_struct *work) */ static void con_fault(struct ceph_connection *con) { - pr_warning(%s%lld %s %s\n, ENTITY_NAME(con-peer_name), - ceph_pr_addr(con-peer_addr.in_addr), con-error_msg); + pr_warn(%s%lld %s %s\n, + ENTITY_NAME(con-peer_name), +
Re: [ceph-users] [ANN] ceph-deploy 1.5.14 released
Suggestion: Can you link to a changelog of any new features or major bug fixes when you do new releases. Thanks, Scottix On Wed, Sep 10, 2014 at 6:45 AM, Alfredo Deza alfredo.d...@inktank.com wrote: Hi All, There is a new bug-fix release of ceph-deploy that helps prevent the environment variable issues that sometimes may cause issues when depending on them on remote hosts. It is also now possible to specify public and cluster networks when creating a new ceph.conf file. Make sure you update! -Alfredo ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Follow Me: @Scottix http://about.me/scottix scot...@gmail.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OpTracker optimization
Added a comment about the approach. -Sam On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy somnath@sandisk.com wrote: Hi Sam/Sage, As we discussed earlier, enabling the present OpTracker code degrading performance severely. For example, in my setup a single OSD node with 10 clients is reaching ~103K read iops with io served from memory while optracking is disabled but enabling optracker it is reduced to ~39K iops. Probably, running OSD without enabling OpTracker is not an option for many of Ceph users. Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist ops_in_flight) and removing some other bottlenecks I am able to match the performance of OpTracking enabled OSD with OpTracking disabled, but with the expense of ~1 extra cpu core. In this process I have also fixed the following tracker. http://tracker.ceph.com/issues/9384 and probably http://tracker.ceph.com/issues/8885 too. I have created following pull request for the same. Please review it. https://github.com/ceph/ceph/pull/2440 Thanks Regards Somnath PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 0/5] net: Convert pr_warning to pr_warn
From: Joe Perches j...@perches.com Date: Tue, 9 Sep 2014 21:17:27 -0700 Remove remaining uses of pr_warning in net/ Joe Perches (5): atm: Convert pr_warning to pr_warn ceph: Convert pr_warning to pr_warn pktgen: Convert pr_warning to pr_warn iucv: Convert pr_warning to pr_warn netfilter: Convert pr_warning to pr_warn All except the ceph patch which is going via another tree. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: OpTracker optimization
Thanks Sam..I responded back :-) -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just Sent: Wednesday, September 10, 2014 11:17 AM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Added a comment about the approach. -Sam On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy somnath@sandisk.com wrote: Hi Sam/Sage, As we discussed earlier, enabling the present OpTracker code degrading performance severely. For example, in my setup a single OSD node with 10 clients is reaching ~103K read iops with io served from memory while optracking is disabled but enabling optracker it is reduced to ~39K iops. Probably, running OSD without enabling OpTracker is not an option for many of Ceph users. Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist ops_in_flight) and removing some other bottlenecks I am able to match the performance of OpTracking enabled OSD with OpTracking disabled, but with the expense of ~1 extra cpu core. In this process I have also fixed the following tracker. http://tracker.ceph.com/issues/9384 and probably http://tracker.ceph.com/issues/8885 too. I have created following pull request for the same. Please review it. https://github.com/ceph/ceph/pull/2440 Thanks Regards Somnath PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
Re: OpTracker optimization
Responded with cosmetic nonsense. Once you've got that and the other comments addressed, I can put it in wip-sam-testing. -Sam On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy somnath@sandisk.com wrote: Thanks Sam..I responded back :-) -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just Sent: Wednesday, September 10, 2014 11:17 AM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Added a comment about the approach. -Sam On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy somnath@sandisk.com wrote: Hi Sam/Sage, As we discussed earlier, enabling the present OpTracker code degrading performance severely. For example, in my setup a single OSD node with 10 clients is reaching ~103K read iops with io served from memory while optracking is disabled but enabling optracker it is reduced to ~39K iops. Probably, running OSD without enabling OpTracker is not an option for many of Ceph users. Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist ops_in_flight) and removing some other bottlenecks I am able to match the performance of OpTracking enabled OSD with OpTracking disabled, but with the expense of ~1 extra cpu core. In this process I have also fixed the following tracker. http://tracker.ceph.com/issues/9384 and probably http://tracker.ceph.com/issues/8885 too. I have created following pull request for the same. Please review it. https://github.com/ceph/ceph/pull/2440 Thanks Regards Somnath PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: OpTracker optimization
Thanks Sam. So, you want me to go with optracker/shadedopWq , right ? Regards Somnath -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Wednesday, September 10, 2014 2:36 PM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Responded with cosmetic nonsense. Once you've got that and the other comments addressed, I can put it in wip-sam-testing. -Sam On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy somnath@sandisk.com wrote: Thanks Sam..I responded back :-) -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just Sent: Wednesday, September 10, 2014 11:17 AM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Added a comment about the approach. -Sam On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy somnath@sandisk.com wrote: Hi Sam/Sage, As we discussed earlier, enabling the present OpTracker code degrading performance severely. For example, in my setup a single OSD node with 10 clients is reaching ~103K read iops with io served from memory while optracking is disabled but enabling optracker it is reduced to ~39K iops. Probably, running OSD without enabling OpTracker is not an option for many of Ceph users. Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist ops_in_flight) and removing some other bottlenecks I am able to match the performance of OpTracking enabled OSD with OpTracking disabled, but with the expense of ~1 extra cpu core. In this process I have also fixed the following tracker. http://tracker.ceph.com/issues/9384 and probably http://tracker.ceph.com/issues/8885 too. I have created following pull request for the same. Please review it. https://github.com/ceph/ceph/pull/2440 Thanks Regards Somnath PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
Re: OpTracker optimization
I don't quite understand. -Sam On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy somnath@sandisk.com wrote: Thanks Sam. So, you want me to go with optracker/shadedopWq , right ? Regards Somnath -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Wednesday, September 10, 2014 2:36 PM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Responded with cosmetic nonsense. Once you've got that and the other comments addressed, I can put it in wip-sam-testing. -Sam On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy somnath@sandisk.com wrote: Thanks Sam..I responded back :-) -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just Sent: Wednesday, September 10, 2014 11:17 AM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Added a comment about the approach. -Sam On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy somnath@sandisk.com wrote: Hi Sam/Sage, As we discussed earlier, enabling the present OpTracker code degrading performance severely. For example, in my setup a single OSD node with 10 clients is reaching ~103K read iops with io served from memory while optracking is disabled but enabling optracker it is reduced to ~39K iops. Probably, running OSD without enabling OpTracker is not an option for many of Ceph users. Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist ops_in_flight) and removing some other bottlenecks I am able to match the performance of OpTracking enabled OSD with OpTracking disabled, but with the expense of ~1 extra cpu core. In this process I have also fixed the following tracker. http://tracker.ceph.com/issues/9384 and probably http://tracker.ceph.com/issues/8885 too. I have created following pull request for the same. Please review it. https://github.com/ceph/ceph/pull/2440 Thanks Regards Somnath PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: OpTracker optimization
As I understand, you want me to implement the following. 1. Keep this implementation one sharded optracker for the ios going through ms_dispatch path. 2. Additionally, for ios going through ms_fast_dispatch, you want me to implement optracker (without internal shard) per opwq shard Am I right ? Thanks Regards Somnath -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Wednesday, September 10, 2014 3:08 PM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization I don't quite understand. -Sam On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy somnath@sandisk.com wrote: Thanks Sam. So, you want me to go with optracker/shadedopWq , right ? Regards Somnath -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Wednesday, September 10, 2014 2:36 PM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Responded with cosmetic nonsense. Once you've got that and the other comments addressed, I can put it in wip-sam-testing. -Sam On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy somnath@sandisk.com wrote: Thanks Sam..I responded back :-) -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just Sent: Wednesday, September 10, 2014 11:17 AM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Added a comment about the approach. -Sam On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy somnath@sandisk.com wrote: Hi Sam/Sage, As we discussed earlier, enabling the present OpTracker code degrading performance severely. For example, in my setup a single OSD node with 10 clients is reaching ~103K read iops with io served from memory while optracking is disabled but enabling optracker it is reduced to ~39K iops. Probably, running OSD without enabling OpTracker is not an option for many of Ceph users. Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist ops_in_flight) and removing some other bottlenecks I am able to match the performance of OpTracking enabled OSD with OpTracking disabled, but with the expense of ~1 extra cpu core. In this process I have also fixed the following tracker. http://tracker.ceph.com/issues/9384 and probably http://tracker.ceph.com/issues/8885 too. I have created following pull request for the same. Please review it. https://github.com/ceph/ceph/pull/2440 Thanks Regards Somnath PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
Re: OpTracker optimization
Oh, I changed my mind, your approach is fine. I was unclear. Currently, I just need you to address the other comments. -Sam On Wed, Sep 10, 2014 at 3:13 PM, Somnath Roy somnath@sandisk.com wrote: As I understand, you want me to implement the following. 1. Keep this implementation one sharded optracker for the ios going through ms_dispatch path. 2. Additionally, for ios going through ms_fast_dispatch, you want me to implement optracker (without internal shard) per opwq shard Am I right ? Thanks Regards Somnath -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Wednesday, September 10, 2014 3:08 PM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization I don't quite understand. -Sam On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy somnath@sandisk.com wrote: Thanks Sam. So, you want me to go with optracker/shadedopWq , right ? Regards Somnath -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Wednesday, September 10, 2014 2:36 PM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Responded with cosmetic nonsense. Once you've got that and the other comments addressed, I can put it in wip-sam-testing. -Sam On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy somnath@sandisk.com wrote: Thanks Sam..I responded back :-) -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just Sent: Wednesday, September 10, 2014 11:17 AM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Added a comment about the approach. -Sam On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy somnath@sandisk.com wrote: Hi Sam/Sage, As we discussed earlier, enabling the present OpTracker code degrading performance severely. For example, in my setup a single OSD node with 10 clients is reaching ~103K read iops with io served from memory while optracking is disabled but enabling optracker it is reduced to ~39K iops. Probably, running OSD without enabling OpTracker is not an option for many of Ceph users. Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist ops_in_flight) and removing some other bottlenecks I am able to match the performance of OpTracking enabled OSD with OpTracking disabled, but with the expense of ~1 extra cpu core. In this process I have also fixed the following tracker. http://tracker.ceph.com/issues/9384 and probably http://tracker.ceph.com/issues/8885 too. I have created following pull request for the same. Please review it. https://github.com/ceph/ceph/pull/2440 Thanks Regards Somnath PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RBD readahead strategies
I've been testing a few strategies for RBD readahead and wanted to share my results as well as ask for input. I have four sample workloads that I replayed at maximum speed with rbd-replay. boot-ide and boot-virtio are captured from booting a VM with the image on the IDE and virtio buses, respectively. Likewise, grep-ide and grep-virtio are captured from a large grep run. (I'm not entirely sure why the IDE and virtio workloads are different, but part of it is the number of pending requests allowed.) The readahead strategies are: - none: No readahead. - plain: My initial implementation. The readahead window doubles for each readahead request, up to a limit, and resets when a random request is detected. - aligned: Same as above, but readahead requests are aligned with object boundaries, when possible. - eager: When activated, read to the end of the object. For all of these, 10 sequential requests trigger readahead, the maximum readahead size is 4 MB, and rbd readahead disable after bytes is disabled (meaning that readahead is enabled for the entire workload). The object size is the default 4 MB, and data is striped over a single object. (Alignment with stripes or object sets is ignored for now.) Here's the data: workload strategy time (seconds) RA ops RA MB read ops read MB boot-ide none 46.22 +/- 0.410 0 57516 407 boot-ide plain 11.42 +/- 0.25 281 203 57516 407 boot-ide aligned11.46 +/- 0.13 276 201 57516 407 boot-ide eager 12.48 +/- 0.61 111 303 57516 407 boot-virtio none9.05 +/- 0.250 0 11851 393 boot-virtio plain 8.05 +/- 0.38 451 221 11851 393 boot-virtio aligned 7.86 +/- 0.27 452 213 11851 393 boot-virtio eager 9.17 +/- 0.34 249 600 11851 393 grep-ide none 138.55 +/- 1.670 0 130104 3044 grep-ide plain 136.07 +/- 1.57 397 867 130104 3044 grep-ide aligned 137.30 +/- 1.77 379 844 130104 3044 grep-ide eager 138.77 +/- 1.52 346 993 130104 3044 grep-virtio none 120.73 +/- 1.330 0 130061 2820 grep-virtio plain 121.29 +/- 1.28 11861485 130061 2820 grep-virtio aligned 123.32 +/- 1.29 11391409 130061 2820 grep-virtio eager 127.75 +/- 1.32 8422218 130061 2820 (The time is the mean wall-clock time +/- the margin of error with 99.7% confidence. RA=readahead.) Right off the bat, readahead is a huge improvement for the boot-ide workload, which is no surprise because it issues 50,000 sequential, single-sector reads. (Why the early boot process is so inefficient is open for speculation, but that's a real, natural workload.) boot-virtio also sees an improvement, although not nearly so dramatic. The grep workloads show no statistically significant improvement. One conclusion I draw is that 'eager' is, well, too eager. 'aligned' shows no statistically significant difference from 'plain', and 'plain' is no worse than 'none' (at statistically significant levels) and sometimes better. Should the readahead strategy be configurable, or should we just stick with whichever seems the best one? Is there anything big I'm missing? Adam -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: fix a memory leak in handle_watch_notify
From: Li RongQing roy.qing...@gmail.com event_work should be freed when adding it to queue failed Signed-off-by: Li RongQing roy.qing...@gmail.com --- net/ceph/osd_client.c |1 + 1 file changed, 1 insertion(+) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 30f6faf..1e1b4f1 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -2323,6 +2323,7 @@ static void handle_watch_notify(struct ceph_osd_client *osdc, event_work-opcode = opcode; if (!queue_work(osdc-notify_wq, event_work-work)) { dout(WARNING: failed to queue notify event work\n); + kfree(event_work); goto done_err; } } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: fix a memory leak in handle_watch_notify
On 09/10/2014 07:20 PM, roy.qing...@gmail.com wrote: From: Li RongQing roy.qing...@gmail.com event_work should be freed when adding it to queue failed Signed-off-by: Li RongQing roy.qing...@gmail.com Looks good. Reviewed-by: Alex Elder el...@linaro.org --- net/ceph/osd_client.c |1 + 1 file changed, 1 insertion(+) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 30f6faf..1e1b4f1 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -2323,6 +2323,7 @@ static void handle_watch_notify(struct ceph_osd_client *osdc, event_work-opcode = opcode; if (!queue_work(osdc-notify_wq, event_work-work)) { dout(WARNING: failed to queue notify event work\n); + kfree(event_work); goto done_err; } } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: OpTracker optimization
Sam/Sage, I have incorporated all of your comments. Please have a look at the same pull request. https://github.com/ceph/ceph/pull/2440 Thanks Regards Somnath -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Wednesday, September 10, 2014 3:25 PM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Oh, I changed my mind, your approach is fine. I was unclear. Currently, I just need you to address the other comments. -Sam On Wed, Sep 10, 2014 at 3:13 PM, Somnath Roy somnath@sandisk.com wrote: As I understand, you want me to implement the following. 1. Keep this implementation one sharded optracker for the ios going through ms_dispatch path. 2. Additionally, for ios going through ms_fast_dispatch, you want me to implement optracker (without internal shard) per opwq shard Am I right ? Thanks Regards Somnath -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Wednesday, September 10, 2014 3:08 PM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization I don't quite understand. -Sam On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy somnath@sandisk.com wrote: Thanks Sam. So, you want me to go with optracker/shadedopWq , right ? Regards Somnath -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Wednesday, September 10, 2014 2:36 PM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Responded with cosmetic nonsense. Once you've got that and the other comments addressed, I can put it in wip-sam-testing. -Sam On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy somnath@sandisk.com wrote: Thanks Sam..I responded back :-) -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just Sent: Wednesday, September 10, 2014 11:17 AM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Added a comment about the approach. -Sam On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy somnath@sandisk.com wrote: Hi Sam/Sage, As we discussed earlier, enabling the present OpTracker code degrading performance severely. For example, in my setup a single OSD node with 10 clients is reaching ~103K read iops with io served from memory while optracking is disabled but enabling optracker it is reduced to ~39K iops. Probably, running OSD without enabling OpTracker is not an option for many of Ceph users. Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist ops_in_flight) and removing some other bottlenecks I am able to match the performance of OpTracking enabled OSD with OpTracking disabled, but with the expense of ~1 extra cpu core. In this process I have also fixed the following tracker. http://tracker.ceph.com/issues/9384 and probably http://tracker.ceph.com/issues/8885 too. I have created following pull request for the same. Please review it. https://github.com/ceph/ceph/pull/2440 Thanks Regards Somnath PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
RE: OpTracker optimization
I had two substantiative comments on the first patch and then some trivial whitespace nits.Otherwise looks good! tahnks- sage On Thu, 11 Sep 2014, Somnath Roy wrote: Sam/Sage, I have incorporated all of your comments. Please have a look at the same pull request. https://github.com/ceph/ceph/pull/2440 Thanks Regards Somnath -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Wednesday, September 10, 2014 3:25 PM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Oh, I changed my mind, your approach is fine. I was unclear. Currently, I just need you to address the other comments. -Sam On Wed, Sep 10, 2014 at 3:13 PM, Somnath Roy somnath@sandisk.com wrote: As I understand, you want me to implement the following. 1. Keep this implementation one sharded optracker for the ios going through ms_dispatch path. 2. Additionally, for ios going through ms_fast_dispatch, you want me to implement optracker (without internal shard) per opwq shard Am I right ? Thanks Regards Somnath -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Wednesday, September 10, 2014 3:08 PM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization I don't quite understand. -Sam On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy somnath@sandisk.com wrote: Thanks Sam. So, you want me to go with optracker/shadedopWq , right ? Regards Somnath -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Wednesday, September 10, 2014 2:36 PM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Responded with cosmetic nonsense. Once you've got that and the other comments addressed, I can put it in wip-sam-testing. -Sam On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy somnath@sandisk.com wrote: Thanks Sam..I responded back :-) -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just Sent: Wednesday, September 10, 2014 11:17 AM To: Somnath Roy Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com Subject: Re: OpTracker optimization Added a comment about the approach. -Sam On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy somnath@sandisk.com wrote: Hi Sam/Sage, As we discussed earlier, enabling the present OpTracker code degrading performance severely. For example, in my setup a single OSD node with 10 clients is reaching ~103K read iops with io served from memory while optracking is disabled but enabling optracker it is reduced to ~39K iops. Probably, running OSD without enabling OpTracker is not an option for many of Ceph users. Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist ops_in_flight) and removing some other bottlenecks I am able to match the performance of OpTracking enabled OSD with OpTracking disabled, but with the expense of ~1 extra cpu core. In this process I have also fixed the following tracker. http://tracker.ceph.com/issues/9384 and probably http://tracker.ceph.com/issues/8885 too. I have created following pull request for the same. Please review it. https://github.com/ceph/ceph/pull/2440 Thanks Regards Somnath PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you