Re: [PATCHv2 0/3] rbd: header read/refresh improvements

2015-04-26 Thread Alex Elder

On 04/24/2015 08:22 AM, Douglas Fuller wrote:

Support multiple class op calls in one ceph_msg and consolidate rbd header
read and refresh processes to use this feature to reduce the number of
ceph_msgs sent for that process. Refresh features on header refresh and
begin returning EIO if features have changed since mapping.


I think I understand the motivation for what you're doing here.
You ultimately want to pack more ops into a message, which
would be really great.  Ideally you could pack some arbitrary
number (much larger than 3), but it seems to me there'll be
some restructuring required to get there.

I am having a hard time understanding your intended
implementation though.  I've looked mainly at your first
patch, and scanned the other two.  Let's see if I have
it right.

You are adding another hunk of data to a class request,
and using it as an intermediate buffer for receiving
data.

Currently a class request has three hunks:
- request_info, a pagelist data item, which contains the
  class name and method name, and perhaps someday, an
  argument list (?).
- request_data, which can contain any number of data
  items of any type, and constitutes outgoing data for
  a write request.
- response_data, also any number of any type of data
  items, which constitutes space into which incoming
  data is placed.

You are adding a fourth:
- chain_data, which is a page vector big enough to hold
  all incoming response data.

When building up an OSD request, we call osd_req_encode_op()
for every op in the request.  For a class op, that records
the length of the class and method names in the op structure,
and then appends those names to the outbound request_data.
Following that, the real request data (if any) provided
with the class op is appended to request_data.

Next (before your patches) whatever buffer space was provided
(as a page array, pagelist, or bio) to receive response data
is appended to response_data.  Prior to this, the class
op will have been set up with adequate buffers to directly
receive the expected data.

What your first patch seems to do is circumvent this last
step, and instead of using the prearranged buffers, you
allocate another page vector, big enough to hold the
combined size of all response_data data items, and store
that in the new chain_data field of an OSD request class op.
(Note that this could be in *any* of the ops in an OSD
request, though I suspect it's always the first.)

When a response arrives, if the *first* op in the request
array is CEPH_OSD_OP_CALL, you split the data, which will
have been deposited in the chain_data page vector.  That
process involves allocating another buffer sufficient to
hold the entirety of the received data.  You copy the
received data into that buffer and release the chain_data
page vector.  And then you copy the data from this new
buffer back into the original response array, and then
free the buffer.

This sounds pretty expensive.  For every class operation
you are copying the received data two extra times.

It's possible I'm not understanding what you're doing here
and that's why I'm writing this now.  Before I provide
any more specific commentary on your patches, I want to
be sure I understand what you're trying to do.

If my understanding *is* correct, I would say you're
going about this the wrong way, and I may have some
suggestions for a better approach.

Will you please correct me where I'm wrong above, and
maybe give a little better high-level explanation of
what you're trying to do?  I see in patch 1 you mention
a problem with short reads, and there may be a simpler
fix than that (to begin with).  But beyond that, if
you're trying to incorporate more ops in a message,
there may be better ways to go about that as well.

Thanks.

-Alex


v2: Edit history and address comments from Mike Christie.

Douglas Fuller (3):
   ceph: support multiple class method calls in one ceph_msg
   rbd: combine object method calls in header refresh using fewer
 ceph_msgs
   rbd: re-read features during header refresh and detect changes.

  drivers/block/rbd.c | 512 +---
  include/linux/ceph/osd_client.h |   3 +-
  net/ceph/messenger.c|   4 +
  net/ceph/osd_client.c   |  90 ++-
  4 files changed, 462 insertions(+), 147 deletions(-)



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 0/3] rbd: header read/refresh improvements

2015-04-26 Thread Ilya Dryomov
On Sun, Apr 26, 2015 at 9:29 AM, Alex Elder el...@ieee.org wrote:
 On 04/24/2015 08:22 AM, Douglas Fuller wrote:

 Support multiple class op calls in one ceph_msg and consolidate rbd header
 read and refresh processes to use this feature to reduce the number of
 ceph_msgs sent for that process. Refresh features on header refresh and
 begin returning EIO if features have changed since mapping.


 I think I understand the motivation for what you're doing here.
 You ultimately want to pack more ops into a message, which
 would be really great.  Ideally you could pack some arbitrary
 number (much larger than 3), but it seems to me there'll be
 some restructuring required to get there.

 I am having a hard time understanding your intended
 implementation though.  I've looked mainly at your first
 patch, and scanned the other two.  Let's see if I have
 it right.

 You are adding another hunk of data to a class request,
 and using it as an intermediate buffer for receiving
 data.

 Currently a class request has three hunks:
 - request_info, a pagelist data item, which contains the
   class name and method name, and perhaps someday, an
   argument list (?).
 - request_data, which can contain any number of data
   items of any type, and constitutes outgoing data for
   a write request.
 - response_data, also any number of any type of data
   items, which constitutes space into which incoming
   data is placed.

 You are adding a fourth:
 - chain_data, which is a page vector big enough to hold
   all incoming response data.

 When building up an OSD request, we call osd_req_encode_op()
 for every op in the request.  For a class op, that records
 the length of the class and method names in the op structure,
 and then appends those names to the outbound request_data.
 Following that, the real request data (if any) provided
 with the class op is appended to request_data.

 Next (before your patches) whatever buffer space was provided
 (as a page array, pagelist, or bio) to receive response data
 is appended to response_data.  Prior to this, the class
 op will have been set up with adequate buffers to directly
 receive the expected data.

 What your first patch seems to do is circumvent this last
 step, and instead of using the prearranged buffers, you
 allocate another page vector, big enough to hold the
 combined size of all response_data data items, and store
 that in the new chain_data field of an OSD request class op.
 (Note that this could be in *any* of the ops in an OSD
 request, though I suspect it's always the first.)

 When a response arrives, if the *first* op in the request
 array is CEPH_OSD_OP_CALL, you split the data, which will
 have been deposited in the chain_data page vector.  That
 process involves allocating another buffer sufficient to
 hold the entirety of the received data.  You copy the
 received data into that buffer and release the chain_data
 page vector.  And then you copy the data from this new
 buffer back into the original response array, and then
 free the buffer.

 This sounds pretty expensive.  For every class operation
 you are copying the received data two extra times.

 It's possible I'm not understanding what you're doing here
 and that's why I'm writing this now.  Before I provide
 any more specific commentary on your patches, I want to
 be sure I understand what you're trying to do.

 If my understanding *is* correct, I would say you're
 going about this the wrong way, and I may have some
 suggestions for a better approach.

 Will you please correct me where I'm wrong above, and
 maybe give a little better high-level explanation of
 what you're trying to do?  I see in patch 1 you mention
 a problem with short reads, and there may be a simpler
 fix than that (to begin with).  But beyond that, if
 you're trying to incorporate more ops in a message,
 there may be better ways to go about that as well.

Yeah, the only objective of this was to pack more call ops into an
osd_request in order to reduce the number of network RTs during rbd map
and refresh.  The fundamental problem the first patch is trying to work
around is the first ceph_msg_data item consuming all reply buffers
instead of just its own.  We have to preallocate response buffers and
if the preallocated response buffer for the first op is large enough
(it always is) it will consume the entire ceph_msg, along with replies
to other ops.

My understanding is that the first patch is supposed to be a specific
workarond for just that - I think it's noted somewhere that it will
break on reads combined with call ops or similar.

I too have my efficiency concerns and I raised them in one of the
standups but the argument was that this is only going to be used for
header init/refresh, not copyup, so the overhead is negligible.  I'm
still not sold on the copying everything twice though and was going to
try to see if there is a simple way to special case this even more and
make the diffstat smaller.

Thanks,

Ilya
--

Re: [PATCHv2 0/3] rbd: header read/refresh improvements

2015-04-26 Thread Douglas Fuller
Alex,

I think you are correct in both your understanding and your impression of the 
approach.

 On Apr 26, 2015, at 4:44 AM, Ilya Dryomov idryo...@gmail.com wrote:
 
 On Sun, Apr 26, 2015 at 9:29 AM, Alex Elder el...@ieee.org wrote:
 On 04/24/2015 08:22 AM, Douglas Fuller wrote:
 
 Support multiple class op calls in one ceph_msg and consolidate rbd header
 read and refresh processes to use this feature to reduce the number of
 ceph_msgs sent for that process. Refresh features on header refresh and
 begin returning EIO if features have changed since mapping.
 
 This sounds pretty expensive.  For every class operation
 you are copying the received data two extra times.

I’d really like a solution to this, but I don’t think one is available given 
the constraints.

 Will you please correct me where I'm wrong above, and
 maybe give a little better high-level explanation of
 what you're trying to do?  I see in patch 1 you mention
 a problem with short reads, and there may be a simpler
 fix than that (to begin with).  But beyond that, if
 you're trying to incorporate more ops in a message,
 there may be better ways to go about that as well.
 
 Yeah, the only objective of this was to pack more call ops into an
 osd_request in order to reduce the number of network RTs during rbd map
 and refresh.  The fundamental problem the first patch is trying to work
 around is the first ceph_msg_data item consuming all reply buffers
 instead of just its own.  We have to preallocate response buffers and
 if the preallocated response buffer for the first op is large enough
 (it always is) it will consume the entire ceph_msg, along with replies
 to other ops.
 
 My understanding is that the first patch is supposed to be a specific
 workarond for just that - I think it's noted somewhere that it will
 break on reads combined with call ops or similar.

That’s correct. This patch only works around that short response issue in this 
specific corner case. It does not fix cases where call ops returning data could 
be arbitrarily combined with reads or writes (and apparently they never have 
been, because that would not work). It should not introduce additional cases of 
breakage, though.

I was told that the preferred way to proceed for now was to avoid changing the 
osd_client interface and to handle this case in osd_client instead of the 
messaging layer. Given those constraints, it was agreed in the standups and on 
#ceph-devel that it should be done this way.

We can’t know the actual response sizes until they are decoded in osd_client. 
Therefore, we can’t copy them directly to the user buffers off the wire. That 
costs us one extra copy. The other copy is required because pagevec.c does not 
provide an interface for copying arbitrary data between two page vectors.

 I too have my efficiency concerns and I raised them in one of the
 standups but the argument was that this is only going to be used for
 header init/refresh, not copyup, so the overhead is negligible.  I'm
 still not sold on the copying everything twice though and was going to
 try to see if there is a simple way to special case this even more and
 make the diffstat smaller.

You and I agreed in that particular standup discussion; I don’t like it, 
either. I would prefer a general-case solution that avoids introducing so much 
extra overhead, especially for such a small amount of data (really just a few 
bytes).

If there’s a way to handle this with a better method, I’m all ears. I had 
thought of taking advantage of the fact that the sum total of all this data 
will never exceed a single page, but that seems to me to require working around 
even more functionality for what is, essentially, a corner case.

-Doug


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 0/3] rbd: header read/refresh improvements

2015-04-26 Thread Douglas Fuller

 I’d really like a solution to this, but I don’t think one is
 available given the constraints.
 
 Now that I know I understand it, I will at least provide
 some review comments on the patches.  I'll also think about
 it a little.  My instinct says there should be no need to
 make the copies, but the devil will be in the details.

Yeah, it seems like it should be simpler. To fix it, I think we would have to 
either commit a layering violation (decode the call ops’ response sizes in the 
message receive flow) or change the interface to osd_client (don’t provide 
preallocated pagevecs for call response data even though that’s what’s done for 
other types of ops). I’m the new guy, though, so it’s certainly possible I’m 
missing something.

 I was told that the preferred way to proceed for now was to avoid
 changing the osd_client interface and to handle this case in
 osd_client instead of the messaging layer. Given those constraints,
 it was agreed in the standups and on #ceph-devel that it should be
 done this way.
 
 Sorry, I haven't been keeping up with all the activity on the
 mailing list.  I pay closest attention to patches to the kernel
 code.

Well, this was worked out mainly in the standups and on IRC. There wasn’t a 
mailing list discussion on it. Having said that, the attention to the patch is 
certainly appreciated.

 OK, well I'll go look at that code path again to see if I
 can
 come up with any bright ideas.

Thanks very much.

-Doug--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 0/3] rbd: header read/refresh improvements

2015-04-26 Thread Alex Elder

On 04/26/2015 11:35 PM, Douglas Fuller wrote:

This solution just feels hacky and inefficient, so I think there is a
desire to feel like we at least tried to come up something simpler
and more efficient before proceeding.


Right.  And somehow more fitting with the existing code, if
that's possible.  There are a few things in this proposal
that are just special case code, and it would be nice to
avoid that.

I'm sorry I've been too busy today to get back to this.  I spent
a lot of time on Saturday getting ramped back up on this code
so I could provide a decent review...

I'm going to at least look at the code (error on a read op)
before going to bed, so I at least know what the problem is.
I'll try to have something to say (even if it's I've got
nothing) in the next day or two.

-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 0/3] rbd: header read/refresh improvements

2015-04-26 Thread Sage Weil
On Sun, 26 Apr 2015, Douglas Fuller wrote:
  I?d really like a solution to this, but I don?t think one is
  available given the constraints.
  
  Now that I know I understand it, I will at least provide
  some review comments on the patches.  I'll also think about
  it a little.  My instinct says there should be no need to
  make the copies, but the devil will be in the details.
 
 Yeah, it seems like it should be simpler. To fix it, I think we would have to 
 either commit a layering violation (decode the call ops? response sizes in 
 the message receive flow) or change the interface to osd_client (don?t 
 provide preallocated pagevecs for call response data even though that?s 
 what?s done for other types of ops). I?m the new guy, though, so it?s 
 certainly possible I?m missing something.
 
  I was told that the preferred way to proceed for now was to avoid
  changing the osd_client interface and to handle this case in
  osd_client instead of the messaging layer. Given those constraints,
  it was agreed in the standups and on #ceph-devel that it should be
  done this way.
  
  Sorry, I haven't been keeping up with all the activity on the
  mailing list.  I pay closest attention to patches to the kernel
  code.
 
 Well, this was worked out mainly in the standups and on IRC. There wasn?t a 
 mailing list discussion on it. Having said that, the attention to the patch 
 is certainly appreciated.

Two basic approaches come to mind:

1) add an up-call between the front and data portions.  Right now we have 
a get_reply between header and front that is used to get the buffers for 
front (and data).  We could add a second that comes after front to prepare 
the data buffers.  This would basically change teh sequence from

 - read header
 - get/allocate ceph_msg, along with front, middle, and data buffers
 - read front, middle, data
 - dispatch message
- decode
- process

to

 - read header
 - get/allocate ceph_msg, along with front, maybe middle, and data buffers
 - read front, middle
 - call get_data hook
- decode front, prepare/allocate data buffers
 - read data
 - dispatch message
- process

I think that's a viable approach, but it means a fair bit of refactoring 
to separate the decoding from the processing (at least for 
osd_reply)--right now those two steps are combined in all of the 
handle_* functions.

2) Introduce a protocol change so that the section boundaries are 
described in a generic way.  Even more work!

Given the only uesrs we have in mind have very small data payloads, 
copying seems simplest to me, but it's up to you guys.  Maybe there is 
something much simpler I'm missing?  Somehow we have to do the front 
parsing before reading data... and probably we want to avoid parsing it 
twice...

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 0/3] rbd: header read/refresh improvements

2015-04-26 Thread Douglas Fuller




 On Apr 27, 2015, at 12:15 AM, Sage Weil s...@newdream.net wrote:
 
 Two basic approaches come to mind:
 
 1) add an up-call between the front and data portions.  Right now we have 
 a get_reply between header and front that is used to get the buffers for 
 front (and data).  We could add a second that comes after front to prepare 
 the data buffers.  This would basically change teh sequence from
 
 - read header
 - get/allocate ceph_msg, along with front, middle, and data buffers
 - read front, middle, data
 - dispatch message
- decode
- process
 
 to
 
 - read header
 - get/allocate ceph_msg, along with front, maybe middle, and data buffers
 - read front, middle
 - call get_data hook
- decode front, prepare/allocate data buffers
 - read data
 - dispatch message
- process
 
 I think that's a viable approach, but it means a fair bit of refactoring 
 to separate the decoding from the processing (at least for 
 osd_reply)--right now those two steps are combined in all of the 
 handle_* functions.

In the kernel client, the data buffers are allocated before the message is even 
sent. The messaging code reads the reply data directly into those preallocated 
buffers. If we add a call up to decode the message size and allocate the data 
buffers when we process the reply, we have to change the osd_client interface.

 2) Introduce a protocol change so that the section boundaries are 
 described in a generic way.  Even more work!

We could also pad the relevant op replies so that each has a fixed length. It 
would be a protocol change, but a smaller one, I assume.

 Given the only uesrs we have in mind have very small data payloads, 
 copying seems simplest to me, but it's up to you guys.  Maybe there is 
 something much simpler I'm missing?  Somehow we have to do the front 
 parsing before reading data... and probably we want to avoid parsing it 
 twice...


This solution just feels hacky and inefficient, so I think there is a desire to 
feel like we at least tried to come up something simpler and more efficient 
before proceeding.

-Doug--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 0/3] rbd: header read/refresh improvements

2015-04-24 Thread Douglas Fuller
Support multiple class op calls in one ceph_msg and consolidate rbd header
read and refresh processes to use this feature to reduce the number of
ceph_msgs sent for that process. Refresh features on header refresh and
begin returning EIO if features have changed since mapping.

v2: Edit history and address comments from Mike Christie.

Douglas Fuller (3):
  ceph: support multiple class method calls in one ceph_msg
  rbd: combine object method calls in header refresh using fewer
ceph_msgs
  rbd: re-read features during header refresh and detect changes.

 drivers/block/rbd.c | 512 +---
 include/linux/ceph/osd_client.h |   3 +-
 net/ceph/messenger.c|   4 +
 net/ceph/osd_client.c   |  90 ++-
 4 files changed, 462 insertions(+), 147 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html