Problem with slow operation on xattr

2014-07-08 Thread nyao


Dear all developers,

I use the rbd kernel module on the client-end, and when we test the  
random write performance. The throughput is quit poor and always drops  
to zero.


And I trace the development logs on the server-side and find that it  
is always blocked in the function: get_object_context, getattr() and  
_setattrs. The average time os about hundreds of milliseconds. Even  
bad, the maximum latency is up to 4-6 seconds, so the throughput  
observed on the client-side is always blocked several seconds. This is  
really ruining the performance of the cluster.


Therefore, I carefully analyze those functions mentioned above  
(get_object_context, getattr() and _setattrs). I cannot find any  
blocked code except for the system calls for xattr like (fgetattr,  
fsetattr, flistattr).


On the OSD node, I use the xfs file system as the underlying osd file  
system. And by default, it will use the extend attribute feature of  
the xfs to store ceph.user xattr (??_?? and ??snapset??). Since those  
system calls are synchronized function call, I set the io-scheduler of  
the disk to [Deadline] so that no reading meta-data will be blocked a  
long time before it will be served. However, even though, the  
performance is still quite poor and those functions mentioned above  
are still blocked, sometimes, up to several seconds.


Therefore, I wanna know that how to solve this problem, does ceph  
provide any user-space cache for xattr?


Does this problem caused by xfs file-system, its xattr system calls?

Furthermore, I try to stop the feature of xfs xattr by setting  
??filestore_max_inline_xattrs_xfs = 0??   
??filestore_max_inline_xattr_size_xfs = 0??. So the xattr key/value  
pair will be stored in omap implemented by LevelDB. It solves the  
problem a bit, the maximum blocked interval drops to about 1-2 second.  
But if the xattr read from the physical disk not the page cache, it  
still quite slow.
So I wonder that is it a good idea to cache all xattr data in  
use_space cache as for xattr, ??_??, the length is just 242 bytes if  
we use xfs file-system? For hundred thousands of Objects, it will cost  
just less than 100MB.



Best Regards,
Neal Yao
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/14] libceph: move and add dout()s to ceph_msg_{get,put}()

2014-07-08 Thread Ilya Dryomov
On Mon, Jun 30, 2014 at 4:29 PM, Alex Elder el...@ieee.org wrote:
 On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 Add dout()s to ceph_msg_{get,put}().  Also move them to .c and turn
 kref release callback into a static function.

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com

 This is all very good.

 I have one suggestion though, below, but regardless:

 Reviewed-by: Alex Elder el...@linaro.org


 ---
  include/linux/ceph/messenger.h |   14 ++
  net/ceph/messenger.c   |   31 ++-
  2 files changed, 24 insertions(+), 21 deletions(-)

 diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
 index d21f2dba0731..40ae58e3e9db 100644
 --- a/include/linux/ceph/messenger.h
 +++ b/include/linux/ceph/messenger.h
 @@ -285,19 +285,9 @@ extern void ceph_msg_data_add_bio(struct ceph_msg *msg, 
 struct bio *bio,

  extern struct ceph_msg *ceph_msg_new(int type, int front_len, gfp_t flags,
bool can_fail);
 -extern void ceph_msg_kfree(struct ceph_msg *m);

 -
 -static inline struct ceph_msg *ceph_msg_get(struct ceph_msg *msg)
 -{
 - kref_get(msg-kref);
 - return msg;
 -}
 -extern void ceph_msg_last_put(struct kref *kref);
 -static inline void ceph_msg_put(struct ceph_msg *msg)
 -{
 - kref_put(msg-kref, ceph_msg_last_put);
 -}
 +extern struct ceph_msg *ceph_msg_get(struct ceph_msg *msg);
 +extern void ceph_msg_put(struct ceph_msg *msg);

  extern void ceph_msg_dump(struct ceph_msg *msg);

 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index 1948d592aa54..8bffa5b90fef 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -3269,24 +3269,21 @@ static int ceph_con_in_msg_alloc(struct 
 ceph_connection *con, int *skip)
  /*
   * Free a generically kmalloc'd message.
   */
 -void ceph_msg_kfree(struct ceph_msg *m)
 +static void ceph_msg_free(struct ceph_msg *m)
  {
 - dout(msg_kfree %p\n, m);
 + dout(%s %p\n, __func__, m);
   ceph_kvfree(m-front.iov_base);
   kmem_cache_free(ceph_msg_cache, m);
  }

 -/*
 - * Drop a msg ref.  Destroy as needed.
 - */
 -void ceph_msg_last_put(struct kref *kref)
 +static void ceph_msg_release(struct kref *kref)
  {
   struct ceph_msg *m = container_of(kref, struct ceph_msg, kref);
   LIST_HEAD(data);
   struct list_head *links;
   struct list_head *next;

 - dout(ceph_msg_put last one on %p\n, m);
 + dout(%s %p\n, __func__, m);
   WARN_ON(!list_empty(m-list_head));

   /* drop middle, data, if any */
 @@ -3308,9 +3305,25 @@ void ceph_msg_last_put(struct kref *kref)
   if (m-pool)
   ceph_msgpool_put(m-pool, m);
   else
 - ceph_msg_kfree(m);
 + ceph_msg_free(m);
 +}
 +
 +struct ceph_msg *ceph_msg_get(struct ceph_msg *msg)
 +{
 + dout(%s %p (was %d)\n, __func__, msg,
 +  atomic_read(msg-kref.refcount));
 + kref_get(msg-kref);

 I suggest you do the dout() *after* you call kref_get().
 I'm sure it doesn't matter in practice here, but my
 reasoning is that you get the reference immediately, and
 you'll have the reference when reading the refcount value.
 It also  makes the dout() calls here and in ceph_msg_put()
 end up getting symmetrically wrapped by the get kref
 get and put.  (You have a race reading the updated
 refcount value either way, but it's debug code.)

My inspiration was rbd_{img,obj}_request_get().  kref_get() can't fail
(may spit out a WARN though) and it is racey anyway, so I'll leave it as
is for consistency.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/14] libceph: introduce ceph_osdc_cancel_request()

2014-07-08 Thread Ilya Dryomov
On Mon, Jul 7, 2014 at 5:47 PM, Alex Elder el...@ieee.org wrote:
 On 06/30/2014 09:34 AM, Ilya Dryomov wrote:
 On Mon, Jun 30, 2014 at 5:39 PM, Alex Elder el...@ieee.org wrote:
 On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 Introduce ceph_osdc_cancel_request() intended for canceling requests
 from the higher layers (rbd and cephfs).  Because higher layers are in
 charge and are supposed to know what and when they are canceling, the
 request is not completed, only unref'ed and removed from the libceph
 data structures.

 This seems reasonable.

 But you make two changes here that I'd like to see a little
 more explanation on.  They seem significant enough to warrant
 a little more attention in the commit description.

 - __cancel_request() is no longer called when
   ceph_osdc_wait_request() fails due to an
   an interrupt.  This is my main concern, and I
   plan to work through it but I'm in a small hurry
   right now.

 Perhaps it should have been a separate commit.  __unregister_request()
 revokes r_request, so I opted for not trying to do it twice.  As for

 OK, that makes sense--revoking the request is basically all
 __cancel_request() does anyway.  You ought to have mentioned
 that in the description anyway, even if it wasn't a separate
 commit.

I have added the explanation to the commit message.


 the r_sent condition and assignment, it doesn't seem to make much of
 a difference, given that the request is about to be unregistered
 anyway.

 If the request is getting canceled (from a caller outside libceph)
 it can't take into account whether it was sent or not.  As you said,
 it is getting revoked unconditionally by __unregister_request().
 To be honest though, *that* revoke call should have been zeroing
 the r_sent value also.  It apparently won't matter, but I think it's
 wrong.  The revoke drops a reference, it doesn't necessarily free
 the structure (though I expect it may always do so anyway).

 - __unregister_linger_request() is now called when
   a request is canceled, but it wasn't before.  (Since
   we're consistent about setting the r_linger flag
   this should be fine, but it *is* a behavior change.)

 The general goal here is to make ceph_osdc_cancel_request() cancel
 *any* request correctly, so if r_linger is set, which means that the
 request in question *could* be lingering, __unregister_linger_request()
 is called.

 The goal is good.  Note that __unregister_linger_request() drops
 a reference to the request though.  There are three other callers
 of this function.  Two of them drop a reference that had just been
 added by a call to __register_request().  The other one is in
 ceph_osdc_unregister_linger_request(), initiated by a higher layer.
 In that last case, r_linger will be zeroed, so calling it again
 should be harmless.

Yeah, ceph_osdc_unregister_linger_request() is removed in favor of
ceph_osdc_cancel_request() later in the series.  r_linger is now
treated is a sort of immutable field - it's never zeroed after it's
been set.  It's still safe to call __unregister_linger_request()
at any point in time though, because unless the request is *actually*
lingering it won't do a thing.

Are you OK with your Reviewed-by for this patch?

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/14] rbd: rbd_obj_request_wait() should cancel the request if interrupted

2014-07-08 Thread Ilya Dryomov
On Mon, Jul 7, 2014 at 8:55 PM, Alex Elder el...@linaro.org wrote:
 On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 rbd_obj_request_wait() should cancel the underlying OSD request if
 interrupted.  Otherwise libceph will hold onto it indefinitely, causing
 assert failures or leaking the original object request.

 At first I didn't understand this.  Let me see if I've got
 it now, though.

 Each OSD request has a completion associated with it.  An
 OSD request is started via ceph_osdc_start_request(), which
 registers the request and takes a reference to it.  One can
 call ceph_osdc_wait_request() after the request has been
 successfully started.  Whether the wait succeeds or not,
 by the time ceph_osdc_wait_request() returns the request
 should have been cleaned up, back to the state it was
 in before the start_request call.  That means the request
 needs to be unregistered and its reference dropped, etc.

 Similarly, each RBD object request has a completion associated
 with it.  It is distinct from the OSD request associated
 with the RBD object request because there may be more to do
 for RBD request to complete than just complete one object
 request.  An RBD object request is started by a call to
 rbd_obj_request_submit(), and once that's successfully
 returned, one can wait for it to complete using a call to
 rbd_obj_request_wait().  And as above, that call should
 return state to (roughly) where it was before the submit
 call, whether the wait request succeeded or not.

 Now, RBD doesn't actually wait for its object requests
 to complete--all its OSD requests complete asynchronously.
 Instead, it relies on the OSD client to call the callback
 function (always rbd_osd_req_callback()) when it has
 completed.  That function will lead to the RBD request's
 completion being signaled when appropriate.

 So...  What happens when an interrupt occurs after
 rbd_obj_request_submit() has returned successfully?  That
 function is a simple wrapper for ceph_osdc_start_request(),
 so a successful return means the request was mapped and
 put on a target's unsent list (or the OSD client's no
 target list).  Nobody waits for the OSD request, so an
 interrupt won't get the benefit of the cleanup done in
 ceph_osdc_wait_request().  Therefore the RBD layer needs
 to be responsible for doing that.

 In other words, when rbd_obj_request_wait() gets an
 interrupt while waiting for the completion, it needs
 to clean up (end) the interrupted request, and
 rbd_obj_request_end() sounds right.  And what that
 cleanup function should do is basically the same
 as what ceph_osdc_wait_request() should do in that
 situation, which is call ceph_osdc_cancel_request().

That's exactly right.


 The only question that leaves me with is, does
 ceph_osdc_cancel_request() need to include the
 call to complete_request() that's present in
 ceph_osdc_wait_request()?

I don't think so - I mentioned it in the ceph_osdc_cancel_request()
function comment.  ceph_osdc_cancel_request() is supposed to be used by
higher layers - rbd, cephfs - and exactly because their completion
logic is decoupled from libceph completions (as you have brilliantly
explained above) it's the higher layers who should be taking care of
it.  IOW higher layers are in charge and are supposed to know what and
when they are cancelling.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/14] rbd: add rbd_obj_watch_request_helper() helper

2014-07-08 Thread Ilya Dryomov
On Tue, Jul 8, 2014 at 2:36 AM, Alex Elder el...@ieee.org wrote:
 On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 In the past, rbd_dev_header_watch_sync() used to handle both watch and
 unwatch requests and was entangled and leaky.  Commit b30a01f2a307
 (rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync())
 split it into two separate functions.  This commit cleanly abstracts
 the common bits, relying on the fixed rbd_obj_request_wait().

 Adding this without calling it leads to an unused function
 warning in the build, I'm sure.

 You could probably squash this into the next patch.

It used to be a single patch in the previous version of this series,
but it was too hard to review even for myself, so I had to split it.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/14] libceph: drop osd ref when canceling con work

2014-07-08 Thread Ilya Dryomov
On Tue, Jul 8, 2014 at 2:38 AM, Alex Elder el...@ieee.org wrote:
 On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 queue_con() bumps osd ref count.  We should do the reverse when
 canceling con work.

 Kind of unrelated to the rest of the series, but it looks
 good.  Good to have a same-level-of-abstraction function
 for it as well.

This series is really everything I stumbled upon while fixing #6628 ;)

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/14] rbd: rbd_obj_request_wait() should cancel the request if interrupted

2014-07-08 Thread Alex Elder
On 07/08/2014 06:18 AM, Ilya Dryomov wrote:
  The only question that leaves me with is, does
  ceph_osdc_cancel_request() need to include the
  call to complete_request() that's present in
  ceph_osdc_wait_request()?
 I don't think so - I mentioned it in the ceph_osdc_cancel_request()
 function comment.  ceph_osdc_cancel_request() is supposed to be used by
 higher layers - rbd, cephfs - and exactly because their completion
 logic is decoupled from libceph completions (as you have brilliantly
 explained above) it's the higher layers who should be taking care of
 it.  IOW higher layers are in charge and are supposed to know what and
 when they are cancelling.

I noticed that comment only after sending my message.

RBD doesn't use the safe completion, only the FS client
does, and I was pretty focused on RBD behavior while
looking at this.  I was trying to conceptualize how
(from the perspective of the upper layer) the safe
completion differs from the normal completion.

It's possible that an I have your request (normal
completion) *also* carries with it the your request
has completed (safe completion) indication, but
the higher layer caller has no way of knowing that.

Maybe I should flip my question around, and ask, why
should the ceph_osdc_cancel_request() include the call
to complete_request()?

The answer lies in details of the file system client,
and I'm not in a position right now to dive into that.
Whether it's called in ceph_osdc_cancel_request() or
not has no effect on RBD.

Anyway, your response is fine with me, thank you.

-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/14] libceph: introduce ceph_osdc_cancel_request()

2014-07-08 Thread Alex Elder
On 07/08/2014 06:15 AM, Ilya Dryomov wrote:
 Are you OK with your Reviewed-by for this patch?

Reviewed-by: Alex Elder el...@linaro.org

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with slow operation on xattr

2014-07-08 Thread Sage Weil
Hi Neal,

On Tue, 8 Jul 2014, n...@cs.hku.hk wrote:
 Dear all developers,
 
 I use the rbd kernel module on the client-end, and when we test the random
 write performance. The throughput is quit poor and always drops to zero.
 
 And I trace the development logs on the server-side and find that it is always
 blocked in the function: get_object_context, getattr() and _setattrs. The
 average time os about hundreds of milliseconds. Even bad, the maximum latency
 is up to 4-6 seconds, so the throughput observed on the client-side is always
 blocked several seconds. This is really ruining the performance of the
 cluster.
 
 Therefore, I carefully analyze those functions mentioned above
 (get_object_context, getattr() and _setattrs). I cannot find any blocked code
 except for the system calls for xattr like (fgetattr, fsetattr, flistattr).
 
 On the OSD node, I use the xfs file system as the underlying osd file system.
 And by default, it will use the extend attribute feature of the xfs to store
 ceph.user xattr (??_?? and ??snapset??). Since those system calls are
 synchronized function call, I set the io-scheduler of the disk to [Deadline]
 so that no reading meta-data will be blocked a long time before it will be
 served. However, even though, the performance is still quite poor and those
 functions mentioned above are still blocked, sometimes, up to several seconds.
 
 Therefore, I wanna know that how to solve this problem, does ceph provide any
 user-space cache for xattr?
 
 Does this problem caused by xfs file-system, its xattr system calls?

 Furthermore, I try to stop the feature of xfs xattr by setting
 ??filestore_max_inline_xattrs_xfs = 0?? 
 ??filestore_max_inline_xattr_size_xfs = 0??. So the xattr key/value pair will
 be stored in omap implemented by LevelDB. It solves the problem a bit, the
 maximum blocked interval drops to about 1-2 second. But if the xattr read from
 the physical disk not the page cache, it still quite slow.
 So I wonder that is it a good idea to cache all xattr data in use_space cache
 as for xattr, ??_??, the length is just 242 bytes if we use xfs file-system?
 For hundred thousands of Objects, it will cost just less than 100MB.

I would have guessed that it is not actually the XFS xattrs that are slow, 
but leveldb, which may be used when there are objects that are too big to 
fit inside the file system's xattr.  Have you adjusted any of the 
filestore_max_incline_xattr* options from their defaults?  I don't think 
XFS's getxattr should be that slow.

Ideally the XFS inode size is 1k or more so that the xattrs are embedded 
there; this normally means there is only a single read needed to load 
them up (if they are not already in the cache).  Did your fs get created 
by the ceph-disk or ceph-deploy tools, or did you create those file 
systems manually when your cluster was created?  By default, those tools 
create 2 KB inodes.  Try running xfs_info mountpiont to see what the 
current file systems are using.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cache tier READ_FORWARD transition

2014-07-08 Thread Sage Weil
On Mon, 7 Jul 2014, Luis Pab?n wrote:
 What about the following usecase (please forgive some of my ceph architecture
 ignorance):
 
 If it was possible to setup OSD caching tier at the host (if the host had a
 dedicated SSD for accelerating I/O), then caching pools could be created to
 cache VM rbds, since they are inherently exclusive to a single host.  Using a
 write through (or a readonly, depending on the workload) policy would have a
 major increase in VM IOPs.   Using writethrough or readonly policy would also
 ensure any writes are first written to the back end storage tier.  Enabling
 hosts to service most of their VM I/O reads would also increases the overall
 IOPs of the back end storage tier.

This could be accomplished by doing a rados pool per client host.  The 
rados caching only works in as a writeback cache, though, not 
write-through, so you really need to replicate it for it to be usable in 
practice.  So although it's possible, this isn't a particularly attractive 
approach.

What you're describing is really a client-side write-through cache, either 
for librbd or librados.  We've discussed this in the past (mostly in the 
context of a shared host-wide read-only data, not as write-through), but 
in both cases the caching would plug into the client libraries.  There are 
some CDS notes from emperor:

http://wiki.ceph.com/Planning/Sideboard/rbd%3A_shared_read_cache
http://pad.ceph.com/p/rbd-shared-read-cache
http://www.youtube.com/watch?v=SVgBdUv_Lv4t=70m11s

Note that you can also accomplish this with the kernel rbd driver by 
layering dm-cache or bcache or something similar on top and running it in 
write-through mode.  Most clients are (KVM+)librbd, though, so eventually 
a userspace implementation for librbd (or maybe librados) makes sense.

sage


 Does this make sense?
 
 - Luis
 
 On 07/07/2014 03:29 PM, Sage Weil wrote:
  On Mon, 7 Jul 2014, Luis Pabon wrote:
   Hi all,
I am working on OSDMonitor.cc:5325 and wanted to confirm the
   following
   read_forward cache tier transition:
   
readforward - forward || writeback || (any  num_objects_dirty ==
   0)
forward - writeback || readforward || (any  num_objects_dirty ==
   0)
writeback - readforward || forward
   
   Is this the correct cache tier state transition?
  That looks right to me.
  
  By the way, I had a thought after we spoke that we probably want something
  that is somewhere inbetween the current writeback behavior (promote on
  first read) and the read_forward behavior (never promote on read).  I
  suspect a good all-around policy is something like promote on second read?
  This should probably be rolled into the writeback mode as a tunable...
  
  sage
  
  
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Intel ISA-L EC plugin

2014-07-08 Thread Andreas Joachim Peters
Hi Xavi, 
I had the same understanding. However the decoding routine is not always able 
to invert the given matrix. This can be either due to the generator matrix 
which does not conserve the MDS property, the matrix inversion procedure or a 
bug on my side I will check carefully my code and the input I provide. I 
already asked the maintainer of the library who put me on that path and there 
are also inline comments about non invertible vandermonde matrices in the code. 
However if I switch to a cauchy matrix I never see any problem with the code.

Cheers Andreas.


From: Xavier Hernandez [xhernan...@datalab.es]
Sent: 04 July 2014 09:43
To: Andreas Joachim Peters
Cc: Loic Dachary; ceph-devel@vger.kernel.org
Subject: Re: Intel ISA-L EC plugin

On Thursday 03 July 2014 21:24:59 Andreas Joachim Peters wrote:
 Hi Loic,

 I have chosen after benchmarking to use the LRU cache for the decoding
 matrix. I can cache easily all decoding matrices (worst case scenario) for
 configurations up to (10,4) without trashing it. I will restrict the (k,m)
 in case of the vandermonde matrix to the ones which are invertible, for
 cauchy they are all invertible anyway.

Maybe I'm misunderstanding something because I haven't had time to analyze ec
code.

AFAIK the only condition that a Vandermonde matrix needs to be invertible is
to have all rows different (i.e. use a different base number for each row). Do
you have matrices with duplicated rows ? or there is some property I don't
know ?

Thanks,

Xavi
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Intel ISA-L EC plugin

2014-07-08 Thread Andreas Joachim Peters
Hi Loic, 
probably one should do the same trick for Jerasure. However in the most common 
situation (one lost), the decoding matrix is not generated anyway.

I used std::map + std::list where the map value is a pair with the iterator 
pointing into the list + the matrix and I use the splice function to move an 
LRU entry in the list. The key is just a string defining the decoding situation 
(erasures). I can also use lru.h ... should give the same performance.

Cheers Andreas.


 I have chosen after benchmarking to use the LRU cache for the decoding 
 matrix. I can cache easily all decoding matrices (worst case scenario) for 
 configurations up to (10,4) without trashing it. I will restrict the (k,m) in 
 case of the vandermonde matrix to the ones which are invertible, for cauchy 
 they are all invertible anyway.

 I will rebase to your branch wip-7238-lrc then!

Hi Andreas,

From what I read in 
https://bitbucket.org/jimplank/jerasure/src/21de98383350e7c46e5ee329de2a93c696dee67a/src/jerasure.c?at=master#cl-167
 it looks like it could benefit from a LRU cache also. Are you going to use 
https://github.com/ceph/ceph/blob/master/src/include/lru.h for this ?

Cheers

 Cheers Andreas.


 
 From: Loic Dachary [l...@dachary.org]
 Sent: 02 July 2014 20:33
 To: Andreas Joachim Peters; ceph-devel@vger.kernel.org
 Subject: Re: FW: Intel ISA-L EC plugin

 Hi Andreas,

 On 02/07/2014 19:54, Andreas Joachim Peters wrote: Hi Sage  Loic et al ...
 getting some support from Paul Luse I have finished the refactoring of the 
 EC ISA-L plug-in.


 The essential ISA-L v 2.10 sources are now part of the source tree and it 
 builds a single shared library which is portable on platforms with varying 
 CPU extensions (SSE2, AVX, AVX2). I tested on various Intel  AMD processor 
 types.

 The build of the plug-in is coupled to the presence of 'yasm' similiar to 
 the crc32c extension in common/ ... (I couldn't build ISA-L on ARM).

 It supports two encoding matrices Vandermonde  Cauchy. The techniques are 
 called similar to the one used by Loic in Jerasure reed_sol_van  
 cauchy. cauchy is the default.

 Greg Tucker from Intel pointed me to the proper ( and faster ) way of 
 decoding if parity chunks are missing.

 Great !

 ??? How do we proceed? I currently rebase against firefly and use its API 
 definition or should this be for a later release with Loic's refactored 
 interface? Shall I make a pull request or shall I hand it over to Loic and 
 he takes care to do the integration including QA etc ...?

 It would be great if you could rebase against 
 https://github.com/dachary/ceph/tree/wip-7238-lrc. It contains the base class 
 that will help us share code common to plugins. I hope it will be merged in 
 the next few days. During the last CDS the remapping of the data chunks has 
 been agreed on and the only reason why it is not yet merged is that 
 integration tests must first show it does not break anything and is fully 
 backward compatible.

 Cheers

 ??? I have still an open question on the library optimization for 
 decoding(=repair). If you call decoding for a certain set one needs to do a 
 matrix inversion coupled to the given set. If the payload is like 1M the 
 computation of the decoding matrix does not
  play a role. If the payload is 4k it plays a role. Can I assume that the 
 plugin will be called concurrently for the same object with the same set of 
 chunks or would the plugin be called interleaved for many objects with 
 changing chunk configurations? Is the
  EC object called single-threaded or by a thread pool? Will backfilll use 4k 
 IOs or larger?

 I would either commit the simple cache mechanism caching the last computed 
 erasure configuration  corresponding matrix or put an LRU cache for the 
 last computed matrices. I prototyped both, but would stick to the simplest 
 required.

 Cheers Andreas.
































 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Loïc Dachary, Artisan Logiciel Libre


--
Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html