Re: [PATCHv2 0/3] rbd: header read/refresh improvements
On 04/24/2015 08:22 AM, Douglas Fuller wrote: Support multiple class op calls in one ceph_msg and consolidate rbd header read and refresh processes to use this feature to reduce the number of ceph_msgs sent for that process. Refresh features on header refresh and begin returning EIO if features have changed since mapping. I think I understand the motivation for what you're doing here. You ultimately want to pack more ops into a message, which would be really great. Ideally you could pack some arbitrary number (much larger than 3), but it seems to me there'll be some restructuring required to get there. I am having a hard time understanding your intended implementation though. I've looked mainly at your first patch, and scanned the other two. Let's see if I have it right. You are adding another hunk of data to a class request, and using it as an intermediate buffer for receiving data. Currently a class request has three hunks: - request_info, a pagelist data item, which contains the class name and method name, and perhaps someday, an argument list (?). - request_data, which can contain any number of data items of any type, and constitutes outgoing data for a write request. - response_data, also any number of any type of data items, which constitutes space into which incoming data is placed. You are adding a fourth: - chain_data, which is a page vector big enough to hold all incoming response data. When building up an OSD request, we call osd_req_encode_op() for every op in the request. For a class op, that records the length of the class and method names in the op structure, and then appends those names to the outbound request_data. Following that, the real request data (if any) provided with the class op is appended to request_data. Next (before your patches) whatever buffer space was provided (as a page array, pagelist, or bio) to receive response data is appended to response_data. Prior to this, the class op will have been set up with adequate buffers to directly receive the expected data. What your first patch seems to do is circumvent this last step, and instead of using the prearranged buffers, you allocate another page vector, big enough to hold the combined size of all response_data data items, and store that in the new chain_data field of an OSD request class op. (Note that this could be in *any* of the ops in an OSD request, though I suspect it's always the first.) When a response arrives, if the *first* op in the request array is CEPH_OSD_OP_CALL, you split the data, which will have been deposited in the chain_data page vector. That process involves allocating another buffer sufficient to hold the entirety of the received data. You copy the received data into that buffer and release the chain_data page vector. And then you copy the data from this new buffer back into the original response array, and then free the buffer. This sounds pretty expensive. For every class operation you are copying the received data two extra times. It's possible I'm not understanding what you're doing here and that's why I'm writing this now. Before I provide any more specific commentary on your patches, I want to be sure I understand what you're trying to do. If my understanding *is* correct, I would say you're going about this the wrong way, and I may have some suggestions for a better approach. Will you please correct me where I'm wrong above, and maybe give a little better high-level explanation of what you're trying to do? I see in patch 1 you mention a problem with short reads, and there may be a simpler fix than that (to begin with). But beyond that, if you're trying to incorporate more ops in a message, there may be better ways to go about that as well. Thanks. -Alex v2: Edit history and address comments from Mike Christie. Douglas Fuller (3): ceph: support multiple class method calls in one ceph_msg rbd: combine object method calls in header refresh using fewer ceph_msgs rbd: re-read features during header refresh and detect changes. drivers/block/rbd.c | 512 +--- include/linux/ceph/osd_client.h | 3 +- net/ceph/messenger.c| 4 + net/ceph/osd_client.c | 90 ++- 4 files changed, 462 insertions(+), 147 deletions(-) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 0/3] rbd: header read/refresh improvements
On Sun, Apr 26, 2015 at 9:29 AM, Alex Elder el...@ieee.org wrote: On 04/24/2015 08:22 AM, Douglas Fuller wrote: Support multiple class op calls in one ceph_msg and consolidate rbd header read and refresh processes to use this feature to reduce the number of ceph_msgs sent for that process. Refresh features on header refresh and begin returning EIO if features have changed since mapping. I think I understand the motivation for what you're doing here. You ultimately want to pack more ops into a message, which would be really great. Ideally you could pack some arbitrary number (much larger than 3), but it seems to me there'll be some restructuring required to get there. I am having a hard time understanding your intended implementation though. I've looked mainly at your first patch, and scanned the other two. Let's see if I have it right. You are adding another hunk of data to a class request, and using it as an intermediate buffer for receiving data. Currently a class request has three hunks: - request_info, a pagelist data item, which contains the class name and method name, and perhaps someday, an argument list (?). - request_data, which can contain any number of data items of any type, and constitutes outgoing data for a write request. - response_data, also any number of any type of data items, which constitutes space into which incoming data is placed. You are adding a fourth: - chain_data, which is a page vector big enough to hold all incoming response data. When building up an OSD request, we call osd_req_encode_op() for every op in the request. For a class op, that records the length of the class and method names in the op structure, and then appends those names to the outbound request_data. Following that, the real request data (if any) provided with the class op is appended to request_data. Next (before your patches) whatever buffer space was provided (as a page array, pagelist, or bio) to receive response data is appended to response_data. Prior to this, the class op will have been set up with adequate buffers to directly receive the expected data. What your first patch seems to do is circumvent this last step, and instead of using the prearranged buffers, you allocate another page vector, big enough to hold the combined size of all response_data data items, and store that in the new chain_data field of an OSD request class op. (Note that this could be in *any* of the ops in an OSD request, though I suspect it's always the first.) When a response arrives, if the *first* op in the request array is CEPH_OSD_OP_CALL, you split the data, which will have been deposited in the chain_data page vector. That process involves allocating another buffer sufficient to hold the entirety of the received data. You copy the received data into that buffer and release the chain_data page vector. And then you copy the data from this new buffer back into the original response array, and then free the buffer. This sounds pretty expensive. For every class operation you are copying the received data two extra times. It's possible I'm not understanding what you're doing here and that's why I'm writing this now. Before I provide any more specific commentary on your patches, I want to be sure I understand what you're trying to do. If my understanding *is* correct, I would say you're going about this the wrong way, and I may have some suggestions for a better approach. Will you please correct me where I'm wrong above, and maybe give a little better high-level explanation of what you're trying to do? I see in patch 1 you mention a problem with short reads, and there may be a simpler fix than that (to begin with). But beyond that, if you're trying to incorporate more ops in a message, there may be better ways to go about that as well. Yeah, the only objective of this was to pack more call ops into an osd_request in order to reduce the number of network RTs during rbd map and refresh. The fundamental problem the first patch is trying to work around is the first ceph_msg_data item consuming all reply buffers instead of just its own. We have to preallocate response buffers and if the preallocated response buffer for the first op is large enough (it always is) it will consume the entire ceph_msg, along with replies to other ops. My understanding is that the first patch is supposed to be a specific workarond for just that - I think it's noted somewhere that it will break on reads combined with call ops or similar. I too have my efficiency concerns and I raised them in one of the standups but the argument was that this is only going to be used for header init/refresh, not copyup, so the overhead is negligible. I'm still not sold on the copying everything twice though and was going to try to see if there is a simple way to special case this even more and make the diffstat smaller. Thanks, Ilya --
Re: [PATCHv2 0/3] rbd: header read/refresh improvements
Alex, I think you are correct in both your understanding and your impression of the approach. On Apr 26, 2015, at 4:44 AM, Ilya Dryomov idryo...@gmail.com wrote: On Sun, Apr 26, 2015 at 9:29 AM, Alex Elder el...@ieee.org wrote: On 04/24/2015 08:22 AM, Douglas Fuller wrote: Support multiple class op calls in one ceph_msg and consolidate rbd header read and refresh processes to use this feature to reduce the number of ceph_msgs sent for that process. Refresh features on header refresh and begin returning EIO if features have changed since mapping. This sounds pretty expensive. For every class operation you are copying the received data two extra times. I’d really like a solution to this, but I don’t think one is available given the constraints. Will you please correct me where I'm wrong above, and maybe give a little better high-level explanation of what you're trying to do? I see in patch 1 you mention a problem with short reads, and there may be a simpler fix than that (to begin with). But beyond that, if you're trying to incorporate more ops in a message, there may be better ways to go about that as well. Yeah, the only objective of this was to pack more call ops into an osd_request in order to reduce the number of network RTs during rbd map and refresh. The fundamental problem the first patch is trying to work around is the first ceph_msg_data item consuming all reply buffers instead of just its own. We have to preallocate response buffers and if the preallocated response buffer for the first op is large enough (it always is) it will consume the entire ceph_msg, along with replies to other ops. My understanding is that the first patch is supposed to be a specific workarond for just that - I think it's noted somewhere that it will break on reads combined with call ops or similar. That’s correct. This patch only works around that short response issue in this specific corner case. It does not fix cases where call ops returning data could be arbitrarily combined with reads or writes (and apparently they never have been, because that would not work). It should not introduce additional cases of breakage, though. I was told that the preferred way to proceed for now was to avoid changing the osd_client interface and to handle this case in osd_client instead of the messaging layer. Given those constraints, it was agreed in the standups and on #ceph-devel that it should be done this way. We can’t know the actual response sizes until they are decoded in osd_client. Therefore, we can’t copy them directly to the user buffers off the wire. That costs us one extra copy. The other copy is required because pagevec.c does not provide an interface for copying arbitrary data between two page vectors. I too have my efficiency concerns and I raised them in one of the standups but the argument was that this is only going to be used for header init/refresh, not copyup, so the overhead is negligible. I'm still not sold on the copying everything twice though and was going to try to see if there is a simple way to special case this even more and make the diffstat smaller. You and I agreed in that particular standup discussion; I don’t like it, either. I would prefer a general-case solution that avoids introducing so much extra overhead, especially for such a small amount of data (really just a few bytes). If there’s a way to handle this with a better method, I’m all ears. I had thought of taking advantage of the fact that the sum total of all this data will never exceed a single page, but that seems to me to require working around even more functionality for what is, essentially, a corner case. -Doug -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 0/3] rbd: header read/refresh improvements
I’d really like a solution to this, but I don’t think one is available given the constraints. Now that I know I understand it, I will at least provide some review comments on the patches. I'll also think about it a little. My instinct says there should be no need to make the copies, but the devil will be in the details. Yeah, it seems like it should be simpler. To fix it, I think we would have to either commit a layering violation (decode the call ops’ response sizes in the message receive flow) or change the interface to osd_client (don’t provide preallocated pagevecs for call response data even though that’s what’s done for other types of ops). I’m the new guy, though, so it’s certainly possible I’m missing something. I was told that the preferred way to proceed for now was to avoid changing the osd_client interface and to handle this case in osd_client instead of the messaging layer. Given those constraints, it was agreed in the standups and on #ceph-devel that it should be done this way. Sorry, I haven't been keeping up with all the activity on the mailing list. I pay closest attention to patches to the kernel code. Well, this was worked out mainly in the standups and on IRC. There wasn’t a mailing list discussion on it. Having said that, the attention to the patch is certainly appreciated. OK, well I'll go look at that code path again to see if I can come up with any bright ideas. Thanks very much. -Doug-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 0/3] rbd: header read/refresh improvements
On 04/26/2015 11:35 PM, Douglas Fuller wrote: This solution just feels hacky and inefficient, so I think there is a desire to feel like we at least tried to come up something simpler and more efficient before proceeding. Right. And somehow more fitting with the existing code, if that's possible. There are a few things in this proposal that are just special case code, and it would be nice to avoid that. I'm sorry I've been too busy today to get back to this. I spent a lot of time on Saturday getting ramped back up on this code so I could provide a decent review... I'm going to at least look at the code (error on a read op) before going to bed, so I at least know what the problem is. I'll try to have something to say (even if it's I've got nothing) in the next day or two. -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 0/3] rbd: header read/refresh improvements
On Sun, 26 Apr 2015, Douglas Fuller wrote: I?d really like a solution to this, but I don?t think one is available given the constraints. Now that I know I understand it, I will at least provide some review comments on the patches. I'll also think about it a little. My instinct says there should be no need to make the copies, but the devil will be in the details. Yeah, it seems like it should be simpler. To fix it, I think we would have to either commit a layering violation (decode the call ops? response sizes in the message receive flow) or change the interface to osd_client (don?t provide preallocated pagevecs for call response data even though that?s what?s done for other types of ops). I?m the new guy, though, so it?s certainly possible I?m missing something. I was told that the preferred way to proceed for now was to avoid changing the osd_client interface and to handle this case in osd_client instead of the messaging layer. Given those constraints, it was agreed in the standups and on #ceph-devel that it should be done this way. Sorry, I haven't been keeping up with all the activity on the mailing list. I pay closest attention to patches to the kernel code. Well, this was worked out mainly in the standups and on IRC. There wasn?t a mailing list discussion on it. Having said that, the attention to the patch is certainly appreciated. Two basic approaches come to mind: 1) add an up-call between the front and data portions. Right now we have a get_reply between header and front that is used to get the buffers for front (and data). We could add a second that comes after front to prepare the data buffers. This would basically change teh sequence from - read header - get/allocate ceph_msg, along with front, middle, and data buffers - read front, middle, data - dispatch message - decode - process to - read header - get/allocate ceph_msg, along with front, maybe middle, and data buffers - read front, middle - call get_data hook - decode front, prepare/allocate data buffers - read data - dispatch message - process I think that's a viable approach, but it means a fair bit of refactoring to separate the decoding from the processing (at least for osd_reply)--right now those two steps are combined in all of the handle_* functions. 2) Introduce a protocol change so that the section boundaries are described in a generic way. Even more work! Given the only uesrs we have in mind have very small data payloads, copying seems simplest to me, but it's up to you guys. Maybe there is something much simpler I'm missing? Somehow we have to do the front parsing before reading data... and probably we want to avoid parsing it twice... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 0/3] rbd: header read/refresh improvements
On Apr 27, 2015, at 12:15 AM, Sage Weil s...@newdream.net wrote: Two basic approaches come to mind: 1) add an up-call between the front and data portions. Right now we have a get_reply between header and front that is used to get the buffers for front (and data). We could add a second that comes after front to prepare the data buffers. This would basically change teh sequence from - read header - get/allocate ceph_msg, along with front, middle, and data buffers - read front, middle, data - dispatch message - decode - process to - read header - get/allocate ceph_msg, along with front, maybe middle, and data buffers - read front, middle - call get_data hook - decode front, prepare/allocate data buffers - read data - dispatch message - process I think that's a viable approach, but it means a fair bit of refactoring to separate the decoding from the processing (at least for osd_reply)--right now those two steps are combined in all of the handle_* functions. In the kernel client, the data buffers are allocated before the message is even sent. The messaging code reads the reply data directly into those preallocated buffers. If we add a call up to decode the message size and allocate the data buffers when we process the reply, we have to change the osd_client interface. 2) Introduce a protocol change so that the section boundaries are described in a generic way. Even more work! We could also pad the relevant op replies so that each has a fixed length. It would be a protocol change, but a smaller one, I assume. Given the only uesrs we have in mind have very small data payloads, copying seems simplest to me, but it's up to you guys. Maybe there is something much simpler I'm missing? Somehow we have to do the front parsing before reading data... and probably we want to avoid parsing it twice... This solution just feels hacky and inefficient, so I think there is a desire to feel like we at least tried to come up something simpler and more efficient before proceeding. -Doug-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2 0/3] rbd: header read/refresh improvements
Support multiple class op calls in one ceph_msg and consolidate rbd header read and refresh processes to use this feature to reduce the number of ceph_msgs sent for that process. Refresh features on header refresh and begin returning EIO if features have changed since mapping. v2: Edit history and address comments from Mike Christie. Douglas Fuller (3): ceph: support multiple class method calls in one ceph_msg rbd: combine object method calls in header refresh using fewer ceph_msgs rbd: re-read features during header refresh and detect changes. drivers/block/rbd.c | 512 +--- include/linux/ceph/osd_client.h | 3 +- net/ceph/messenger.c| 4 + net/ceph/osd_client.c | 90 ++- 4 files changed, 462 insertions(+), 147 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html