Problem with slow operation on xattr
Dear all developers, I use the rbd kernel module on the client-end, and when we test the random write performance. The throughput is quit poor and always drops to zero. And I trace the development logs on the server-side and find that it is always blocked in the function: get_object_context, getattr() and _setattrs. The average time os about hundreds of milliseconds. Even bad, the maximum latency is up to 4-6 seconds, so the throughput observed on the client-side is always blocked several seconds. This is really ruining the performance of the cluster. Therefore, I carefully analyze those functions mentioned above (get_object_context, getattr() and _setattrs). I cannot find any blocked code except for the system calls for xattr like (fgetattr, fsetattr, flistattr). On the OSD node, I use the xfs file system as the underlying osd file system. And by default, it will use the extend attribute feature of the xfs to store ceph.user xattr (??_?? and ??snapset??). Since those system calls are synchronized function call, I set the io-scheduler of the disk to [Deadline] so that no reading meta-data will be blocked a long time before it will be served. However, even though, the performance is still quite poor and those functions mentioned above are still blocked, sometimes, up to several seconds. Therefore, I wanna know that how to solve this problem, does ceph provide any user-space cache for xattr? Does this problem caused by xfs file-system, its xattr system calls? Furthermore, I try to stop the feature of xfs xattr by setting ??filestore_max_inline_xattrs_xfs = 0?? ??filestore_max_inline_xattr_size_xfs = 0??. So the xattr key/value pair will be stored in omap implemented by LevelDB. It solves the problem a bit, the maximum blocked interval drops to about 1-2 second. But if the xattr read from the physical disk not the page cache, it still quite slow. So I wonder that is it a good idea to cache all xattr data in use_space cache as for xattr, ??_??, the length is just 242 bytes if we use xfs file-system? For hundred thousands of Objects, it will cost just less than 100MB. Best Regards, Neal Yao -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/14] libceph: move and add dout()s to ceph_msg_{get,put}()
On Mon, Jun 30, 2014 at 4:29 PM, Alex Elder el...@ieee.org wrote: On 06/25/2014 12:16 PM, Ilya Dryomov wrote: Add dout()s to ceph_msg_{get,put}(). Also move them to .c and turn kref release callback into a static function. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com This is all very good. I have one suggestion though, below, but regardless: Reviewed-by: Alex Elder el...@linaro.org --- include/linux/ceph/messenger.h | 14 ++ net/ceph/messenger.c | 31 ++- 2 files changed, 24 insertions(+), 21 deletions(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index d21f2dba0731..40ae58e3e9db 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -285,19 +285,9 @@ extern void ceph_msg_data_add_bio(struct ceph_msg *msg, struct bio *bio, extern struct ceph_msg *ceph_msg_new(int type, int front_len, gfp_t flags, bool can_fail); -extern void ceph_msg_kfree(struct ceph_msg *m); - -static inline struct ceph_msg *ceph_msg_get(struct ceph_msg *msg) -{ - kref_get(msg-kref); - return msg; -} -extern void ceph_msg_last_put(struct kref *kref); -static inline void ceph_msg_put(struct ceph_msg *msg) -{ - kref_put(msg-kref, ceph_msg_last_put); -} +extern struct ceph_msg *ceph_msg_get(struct ceph_msg *msg); +extern void ceph_msg_put(struct ceph_msg *msg); extern void ceph_msg_dump(struct ceph_msg *msg); diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 1948d592aa54..8bffa5b90fef 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -3269,24 +3269,21 @@ static int ceph_con_in_msg_alloc(struct ceph_connection *con, int *skip) /* * Free a generically kmalloc'd message. */ -void ceph_msg_kfree(struct ceph_msg *m) +static void ceph_msg_free(struct ceph_msg *m) { - dout(msg_kfree %p\n, m); + dout(%s %p\n, __func__, m); ceph_kvfree(m-front.iov_base); kmem_cache_free(ceph_msg_cache, m); } -/* - * Drop a msg ref. Destroy as needed. - */ -void ceph_msg_last_put(struct kref *kref) +static void ceph_msg_release(struct kref *kref) { struct ceph_msg *m = container_of(kref, struct ceph_msg, kref); LIST_HEAD(data); struct list_head *links; struct list_head *next; - dout(ceph_msg_put last one on %p\n, m); + dout(%s %p\n, __func__, m); WARN_ON(!list_empty(m-list_head)); /* drop middle, data, if any */ @@ -3308,9 +3305,25 @@ void ceph_msg_last_put(struct kref *kref) if (m-pool) ceph_msgpool_put(m-pool, m); else - ceph_msg_kfree(m); + ceph_msg_free(m); +} + +struct ceph_msg *ceph_msg_get(struct ceph_msg *msg) +{ + dout(%s %p (was %d)\n, __func__, msg, + atomic_read(msg-kref.refcount)); + kref_get(msg-kref); I suggest you do the dout() *after* you call kref_get(). I'm sure it doesn't matter in practice here, but my reasoning is that you get the reference immediately, and you'll have the reference when reading the refcount value. It also makes the dout() calls here and in ceph_msg_put() end up getting symmetrically wrapped by the get kref get and put. (You have a race reading the updated refcount value either way, but it's debug code.) My inspiration was rbd_{img,obj}_request_get(). kref_get() can't fail (may spit out a WARN though) and it is racey anyway, so I'll leave it as is for consistency. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/14] libceph: introduce ceph_osdc_cancel_request()
On Mon, Jul 7, 2014 at 5:47 PM, Alex Elder el...@ieee.org wrote: On 06/30/2014 09:34 AM, Ilya Dryomov wrote: On Mon, Jun 30, 2014 at 5:39 PM, Alex Elder el...@ieee.org wrote: On 06/25/2014 12:16 PM, Ilya Dryomov wrote: Introduce ceph_osdc_cancel_request() intended for canceling requests from the higher layers (rbd and cephfs). Because higher layers are in charge and are supposed to know what and when they are canceling, the request is not completed, only unref'ed and removed from the libceph data structures. This seems reasonable. But you make two changes here that I'd like to see a little more explanation on. They seem significant enough to warrant a little more attention in the commit description. - __cancel_request() is no longer called when ceph_osdc_wait_request() fails due to an an interrupt. This is my main concern, and I plan to work through it but I'm in a small hurry right now. Perhaps it should have been a separate commit. __unregister_request() revokes r_request, so I opted for not trying to do it twice. As for OK, that makes sense--revoking the request is basically all __cancel_request() does anyway. You ought to have mentioned that in the description anyway, even if it wasn't a separate commit. I have added the explanation to the commit message. the r_sent condition and assignment, it doesn't seem to make much of a difference, given that the request is about to be unregistered anyway. If the request is getting canceled (from a caller outside libceph) it can't take into account whether it was sent or not. As you said, it is getting revoked unconditionally by __unregister_request(). To be honest though, *that* revoke call should have been zeroing the r_sent value also. It apparently won't matter, but I think it's wrong. The revoke drops a reference, it doesn't necessarily free the structure (though I expect it may always do so anyway). - __unregister_linger_request() is now called when a request is canceled, but it wasn't before. (Since we're consistent about setting the r_linger flag this should be fine, but it *is* a behavior change.) The general goal here is to make ceph_osdc_cancel_request() cancel *any* request correctly, so if r_linger is set, which means that the request in question *could* be lingering, __unregister_linger_request() is called. The goal is good. Note that __unregister_linger_request() drops a reference to the request though. There are three other callers of this function. Two of them drop a reference that had just been added by a call to __register_request(). The other one is in ceph_osdc_unregister_linger_request(), initiated by a higher layer. In that last case, r_linger will be zeroed, so calling it again should be harmless. Yeah, ceph_osdc_unregister_linger_request() is removed in favor of ceph_osdc_cancel_request() later in the series. r_linger is now treated is a sort of immutable field - it's never zeroed after it's been set. It's still safe to call __unregister_linger_request() at any point in time though, because unless the request is *actually* lingering it won't do a thing. Are you OK with your Reviewed-by for this patch? Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/14] rbd: rbd_obj_request_wait() should cancel the request if interrupted
On Mon, Jul 7, 2014 at 8:55 PM, Alex Elder el...@linaro.org wrote: On 06/25/2014 12:16 PM, Ilya Dryomov wrote: rbd_obj_request_wait() should cancel the underlying OSD request if interrupted. Otherwise libceph will hold onto it indefinitely, causing assert failures or leaking the original object request. At first I didn't understand this. Let me see if I've got it now, though. Each OSD request has a completion associated with it. An OSD request is started via ceph_osdc_start_request(), which registers the request and takes a reference to it. One can call ceph_osdc_wait_request() after the request has been successfully started. Whether the wait succeeds or not, by the time ceph_osdc_wait_request() returns the request should have been cleaned up, back to the state it was in before the start_request call. That means the request needs to be unregistered and its reference dropped, etc. Similarly, each RBD object request has a completion associated with it. It is distinct from the OSD request associated with the RBD object request because there may be more to do for RBD request to complete than just complete one object request. An RBD object request is started by a call to rbd_obj_request_submit(), and once that's successfully returned, one can wait for it to complete using a call to rbd_obj_request_wait(). And as above, that call should return state to (roughly) where it was before the submit call, whether the wait request succeeded or not. Now, RBD doesn't actually wait for its object requests to complete--all its OSD requests complete asynchronously. Instead, it relies on the OSD client to call the callback function (always rbd_osd_req_callback()) when it has completed. That function will lead to the RBD request's completion being signaled when appropriate. So... What happens when an interrupt occurs after rbd_obj_request_submit() has returned successfully? That function is a simple wrapper for ceph_osdc_start_request(), so a successful return means the request was mapped and put on a target's unsent list (or the OSD client's no target list). Nobody waits for the OSD request, so an interrupt won't get the benefit of the cleanup done in ceph_osdc_wait_request(). Therefore the RBD layer needs to be responsible for doing that. In other words, when rbd_obj_request_wait() gets an interrupt while waiting for the completion, it needs to clean up (end) the interrupted request, and rbd_obj_request_end() sounds right. And what that cleanup function should do is basically the same as what ceph_osdc_wait_request() should do in that situation, which is call ceph_osdc_cancel_request(). That's exactly right. The only question that leaves me with is, does ceph_osdc_cancel_request() need to include the call to complete_request() that's present in ceph_osdc_wait_request()? I don't think so - I mentioned it in the ceph_osdc_cancel_request() function comment. ceph_osdc_cancel_request() is supposed to be used by higher layers - rbd, cephfs - and exactly because their completion logic is decoupled from libceph completions (as you have brilliantly explained above) it's the higher layers who should be taking care of it. IOW higher layers are in charge and are supposed to know what and when they are cancelling. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/14] rbd: add rbd_obj_watch_request_helper() helper
On Tue, Jul 8, 2014 at 2:36 AM, Alex Elder el...@ieee.org wrote: On 06/25/2014 12:16 PM, Ilya Dryomov wrote: In the past, rbd_dev_header_watch_sync() used to handle both watch and unwatch requests and was entangled and leaky. Commit b30a01f2a307 (rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()) split it into two separate functions. This commit cleanly abstracts the common bits, relying on the fixed rbd_obj_request_wait(). Adding this without calling it leads to an unused function warning in the build, I'm sure. You could probably squash this into the next patch. It used to be a single patch in the previous version of this series, but it was too hard to review even for myself, so I had to split it. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 14/14] libceph: drop osd ref when canceling con work
On Tue, Jul 8, 2014 at 2:38 AM, Alex Elder el...@ieee.org wrote: On 06/25/2014 12:16 PM, Ilya Dryomov wrote: queue_con() bumps osd ref count. We should do the reverse when canceling con work. Kind of unrelated to the rest of the series, but it looks good. Good to have a same-level-of-abstraction function for it as well. This series is really everything I stumbled upon while fixing #6628 ;) Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/14] rbd: rbd_obj_request_wait() should cancel the request if interrupted
On 07/08/2014 06:18 AM, Ilya Dryomov wrote: The only question that leaves me with is, does ceph_osdc_cancel_request() need to include the call to complete_request() that's present in ceph_osdc_wait_request()? I don't think so - I mentioned it in the ceph_osdc_cancel_request() function comment. ceph_osdc_cancel_request() is supposed to be used by higher layers - rbd, cephfs - and exactly because their completion logic is decoupled from libceph completions (as you have brilliantly explained above) it's the higher layers who should be taking care of it. IOW higher layers are in charge and are supposed to know what and when they are cancelling. I noticed that comment only after sending my message. RBD doesn't use the safe completion, only the FS client does, and I was pretty focused on RBD behavior while looking at this. I was trying to conceptualize how (from the perspective of the upper layer) the safe completion differs from the normal completion. It's possible that an I have your request (normal completion) *also* carries with it the your request has completed (safe completion) indication, but the higher layer caller has no way of knowing that. Maybe I should flip my question around, and ask, why should the ceph_osdc_cancel_request() include the call to complete_request()? The answer lies in details of the file system client, and I'm not in a position right now to dive into that. Whether it's called in ceph_osdc_cancel_request() or not has no effect on RBD. Anyway, your response is fine with me, thank you. -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/14] libceph: introduce ceph_osdc_cancel_request()
On 07/08/2014 06:15 AM, Ilya Dryomov wrote: Are you OK with your Reviewed-by for this patch? Reviewed-by: Alex Elder el...@linaro.org -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with slow operation on xattr
Hi Neal, On Tue, 8 Jul 2014, n...@cs.hku.hk wrote: Dear all developers, I use the rbd kernel module on the client-end, and when we test the random write performance. The throughput is quit poor and always drops to zero. And I trace the development logs on the server-side and find that it is always blocked in the function: get_object_context, getattr() and _setattrs. The average time os about hundreds of milliseconds. Even bad, the maximum latency is up to 4-6 seconds, so the throughput observed on the client-side is always blocked several seconds. This is really ruining the performance of the cluster. Therefore, I carefully analyze those functions mentioned above (get_object_context, getattr() and _setattrs). I cannot find any blocked code except for the system calls for xattr like (fgetattr, fsetattr, flistattr). On the OSD node, I use the xfs file system as the underlying osd file system. And by default, it will use the extend attribute feature of the xfs to store ceph.user xattr (??_?? and ??snapset??). Since those system calls are synchronized function call, I set the io-scheduler of the disk to [Deadline] so that no reading meta-data will be blocked a long time before it will be served. However, even though, the performance is still quite poor and those functions mentioned above are still blocked, sometimes, up to several seconds. Therefore, I wanna know that how to solve this problem, does ceph provide any user-space cache for xattr? Does this problem caused by xfs file-system, its xattr system calls? Furthermore, I try to stop the feature of xfs xattr by setting ??filestore_max_inline_xattrs_xfs = 0?? ??filestore_max_inline_xattr_size_xfs = 0??. So the xattr key/value pair will be stored in omap implemented by LevelDB. It solves the problem a bit, the maximum blocked interval drops to about 1-2 second. But if the xattr read from the physical disk not the page cache, it still quite slow. So I wonder that is it a good idea to cache all xattr data in use_space cache as for xattr, ??_??, the length is just 242 bytes if we use xfs file-system? For hundred thousands of Objects, it will cost just less than 100MB. I would have guessed that it is not actually the XFS xattrs that are slow, but leveldb, which may be used when there are objects that are too big to fit inside the file system's xattr. Have you adjusted any of the filestore_max_incline_xattr* options from their defaults? I don't think XFS's getxattr should be that slow. Ideally the XFS inode size is 1k or more so that the xattrs are embedded there; this normally means there is only a single read needed to load them up (if they are not already in the cache). Did your fs get created by the ceph-disk or ceph-deploy tools, or did you create those file systems manually when your cluster was created? By default, those tools create 2 KB inodes. Try running xfs_info mountpiont to see what the current file systems are using. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cache tier READ_FORWARD transition
On Mon, 7 Jul 2014, Luis Pab?n wrote: What about the following usecase (please forgive some of my ceph architecture ignorance): If it was possible to setup OSD caching tier at the host (if the host had a dedicated SSD for accelerating I/O), then caching pools could be created to cache VM rbds, since they are inherently exclusive to a single host. Using a write through (or a readonly, depending on the workload) policy would have a major increase in VM IOPs. Using writethrough or readonly policy would also ensure any writes are first written to the back end storage tier. Enabling hosts to service most of their VM I/O reads would also increases the overall IOPs of the back end storage tier. This could be accomplished by doing a rados pool per client host. The rados caching only works in as a writeback cache, though, not write-through, so you really need to replicate it for it to be usable in practice. So although it's possible, this isn't a particularly attractive approach. What you're describing is really a client-side write-through cache, either for librbd or librados. We've discussed this in the past (mostly in the context of a shared host-wide read-only data, not as write-through), but in both cases the caching would plug into the client libraries. There are some CDS notes from emperor: http://wiki.ceph.com/Planning/Sideboard/rbd%3A_shared_read_cache http://pad.ceph.com/p/rbd-shared-read-cache http://www.youtube.com/watch?v=SVgBdUv_Lv4t=70m11s Note that you can also accomplish this with the kernel rbd driver by layering dm-cache or bcache or something similar on top and running it in write-through mode. Most clients are (KVM+)librbd, though, so eventually a userspace implementation for librbd (or maybe librados) makes sense. sage Does this make sense? - Luis On 07/07/2014 03:29 PM, Sage Weil wrote: On Mon, 7 Jul 2014, Luis Pabon wrote: Hi all, I am working on OSDMonitor.cc:5325 and wanted to confirm the following read_forward cache tier transition: readforward - forward || writeback || (any num_objects_dirty == 0) forward - writeback || readforward || (any num_objects_dirty == 0) writeback - readforward || forward Is this the correct cache tier state transition? That looks right to me. By the way, I had a thought after we spoke that we probably want something that is somewhere inbetween the current writeback behavior (promote on first read) and the read_forward behavior (never promote on read). I suspect a good all-around policy is something like promote on second read? This should probably be rolled into the writeback mode as a tunable... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Intel ISA-L EC plugin
Hi Xavi, I had the same understanding. However the decoding routine is not always able to invert the given matrix. This can be either due to the generator matrix which does not conserve the MDS property, the matrix inversion procedure or a bug on my side I will check carefully my code and the input I provide. I already asked the maintainer of the library who put me on that path and there are also inline comments about non invertible vandermonde matrices in the code. However if I switch to a cauchy matrix I never see any problem with the code. Cheers Andreas. From: Xavier Hernandez [xhernan...@datalab.es] Sent: 04 July 2014 09:43 To: Andreas Joachim Peters Cc: Loic Dachary; ceph-devel@vger.kernel.org Subject: Re: Intel ISA-L EC plugin On Thursday 03 July 2014 21:24:59 Andreas Joachim Peters wrote: Hi Loic, I have chosen after benchmarking to use the LRU cache for the decoding matrix. I can cache easily all decoding matrices (worst case scenario) for configurations up to (10,4) without trashing it. I will restrict the (k,m) in case of the vandermonde matrix to the ones which are invertible, for cauchy they are all invertible anyway. Maybe I'm misunderstanding something because I haven't had time to analyze ec code. AFAIK the only condition that a Vandermonde matrix needs to be invertible is to have all rows different (i.e. use a different base number for each row). Do you have matrices with duplicated rows ? or there is some property I don't know ? Thanks, Xavi -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Intel ISA-L EC plugin
Hi Loic, probably one should do the same trick for Jerasure. However in the most common situation (one lost), the decoding matrix is not generated anyway. I used std::map + std::list where the map value is a pair with the iterator pointing into the list + the matrix and I use the splice function to move an LRU entry in the list. The key is just a string defining the decoding situation (erasures). I can also use lru.h ... should give the same performance. Cheers Andreas. I have chosen after benchmarking to use the LRU cache for the decoding matrix. I can cache easily all decoding matrices (worst case scenario) for configurations up to (10,4) without trashing it. I will restrict the (k,m) in case of the vandermonde matrix to the ones which are invertible, for cauchy they are all invertible anyway. I will rebase to your branch wip-7238-lrc then! Hi Andreas, From what I read in https://bitbucket.org/jimplank/jerasure/src/21de98383350e7c46e5ee329de2a93c696dee67a/src/jerasure.c?at=master#cl-167 it looks like it could benefit from a LRU cache also. Are you going to use https://github.com/ceph/ceph/blob/master/src/include/lru.h for this ? Cheers Cheers Andreas. From: Loic Dachary [l...@dachary.org] Sent: 02 July 2014 20:33 To: Andreas Joachim Peters; ceph-devel@vger.kernel.org Subject: Re: FW: Intel ISA-L EC plugin Hi Andreas, On 02/07/2014 19:54, Andreas Joachim Peters wrote: Hi Sage Loic et al ... getting some support from Paul Luse I have finished the refactoring of the EC ISA-L plug-in. The essential ISA-L v 2.10 sources are now part of the source tree and it builds a single shared library which is portable on platforms with varying CPU extensions (SSE2, AVX, AVX2). I tested on various Intel AMD processor types. The build of the plug-in is coupled to the presence of 'yasm' similiar to the crc32c extension in common/ ... (I couldn't build ISA-L on ARM). It supports two encoding matrices Vandermonde Cauchy. The techniques are called similar to the one used by Loic in Jerasure reed_sol_van cauchy. cauchy is the default. Greg Tucker from Intel pointed me to the proper ( and faster ) way of decoding if parity chunks are missing. Great ! ??? How do we proceed? I currently rebase against firefly and use its API definition or should this be for a later release with Loic's refactored interface? Shall I make a pull request or shall I hand it over to Loic and he takes care to do the integration including QA etc ...? It would be great if you could rebase against https://github.com/dachary/ceph/tree/wip-7238-lrc. It contains the base class that will help us share code common to plugins. I hope it will be merged in the next few days. During the last CDS the remapping of the data chunks has been agreed on and the only reason why it is not yet merged is that integration tests must first show it does not break anything and is fully backward compatible. Cheers ??? I have still an open question on the library optimization for decoding(=repair). If you call decoding for a certain set one needs to do a matrix inversion coupled to the given set. If the payload is like 1M the computation of the decoding matrix does not play a role. If the payload is 4k it plays a role. Can I assume that the plugin will be called concurrently for the same object with the same set of chunks or would the plugin be called interleaved for many objects with changing chunk configurations? Is the EC object called single-threaded or by a thread pool? Will backfilll use 4k IOs or larger? I would either commit the simple cache mechanism caching the last computed erasure configuration corresponding matrix or put an LRU cache for the last computed matrices. I prototyped both, but would stick to the simplest required. Cheers Andreas. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html