Re: [PATCH 01/18] libceph: add scatterlist messenger data type
On Wed, Jul 29, 2015 at 06:40:01PM -0500, Mike Christie wrote: I guess I was viewing this similar to cephfs where it does not use rbd and the block layer. It just makes ceph/rados calls directly using libceph. I am using rbd.c for its helper/wrapper functions around the libceph ones, but I could just make libceph calls directly too. Were you saying because for lio support we need to do more block layer'ish operations like write same, compare and write, etc than cephfs, then I should not do the lio backend and we should always go through rbd for lio support? I'd really prefer that. We have other users for these facilities as well, and I'd much prefer having block layer support rather than working around it. Is that for all operations? For distributed TMFs and PRs then are you thinking I should make those more block layer based (some sort of queue or block deivce callouts or REQ_ types), or should those still have some sort of lio callouts which could call different locking/cluster APIs like libceph? Yes. FYI, I've pushed out my WIP work for PRs here: http://git.infradead.org/users/hch/scsi.git/shortlog/refs/heads/pr-api TMFs are a bit of boderline case, but instead of needing special bypasses I'd rather find a way to add them. For example we already have TMF ioctls for SCSI, so we might as well pull this up to the block layer. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/18] libceph: add scatterlist messenger data type
Hey Mike HCH, On Wed, 2015-07-29 at 18:40 -0500, Mike Christie wrote: On 07/29/2015 05:59 PM, Mike Christie wrote: On 07/29/2015 12:55 PM, Christoph Hellwig wrote: On Wed, Jul 29, 2015 at 04:23:38AM -0500, mchri...@redhat.com wrote: From: Mike Christie micha...@cs.wisc.edu LIO uses scatterlist for its page/data management. This patch adds a scatterlist messenger data type, so LIO can pass its sg down directly to rbd. Just as I mentioned for David's patches this is the wrong way to attack your problem. The block layer already supports WRITE SAME, and COMPARE and WRITE nees to be supported at that level too insted of creating artifical bypasses. Why do I have to use the block layer? I just want to map the se_cmd to a ceph request and then put it back on the wire. I don't think I need any of the block layer services. We will do things like io scheduling on the OSD side. We just want to use LIO as more of a passthrough. Maybe I misunderstood you. I guess I was viewing this similar to cephfs where it does not use rbd and the block layer. It just makes ceph/rados calls directly using libceph. I am using rbd.c for its helper/wrapper functions around the libceph ones, but I could just make libceph calls directly too. Were you saying because for lio support we need to do more block layer'ish operations like write same, compare and write, etc than cephfs, then I should not do the lio backend and we should always go through rbd for lio support? If we're using common request_queue function pointers it would avoid the need to maintain an extra backend drivers, and be a generic offload interface for other make_request_fn() based block drivers using target-core to utilize. In the WRITE_SAME + COMPARE_AND_WRITE pass-through cases, IBLOCK se_cmd pass-through should be invoking the driver provided VAAI callbacks directly if they exist, and disabling local target-core CDB emulation. For EXTENDED_COPY pass-through, target-core copy-manager still needs to be responsible for config_group dependencies across multiple se_device backends, with a IBLOCK pass-through providing both block_device *src_bd + *dst_bd pointers into a request_queue callback for different PUSH/PULL offload models. It should also be able to fall back to local copy if source + destination devices do not both support the same type copy-offload pass-through. Is that for all operations? For distributed TMFs and PRs then are you thinking I should make those more block layer based (some sort of queue or block deivce callouts or REQ_ types), or should those still have some sort of lio callouts which could call different locking/cluster APIs like libceph? The PR-OUT logic for REGISTER w/ remote I_PORT and remote PREEMPT-* currently obtains the necessary config_group dependencies, and would need to be considered for request_queue based PR pass-through too. Exposing target TMFs pass-through into request_queue is a different beast.. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/18] libceph: add scatterlist messenger data type
On 07/29/2015 08:34 AM, Alex Elder wrote: On 07/29/2015 04:23 AM, mchri...@redhat.com wrote: From: Mike Christie micha...@cs.wisc.edu LIO uses scatterlist for its page/data management. This patch adds a scatterlist messenger data type, so LIO can pass its sg down directly to rbd. Signed-off-by: Mike Christie micha...@cs.wisc.edu I'm not going to be able to review all of these, and this isnt' even a complete review. But it's something... No problem. Thanks for any comments. You're clearly on the right track, but I want to provide a meaningful review for correctness and design so I'm looking for a bit more information. --- include/linux/ceph/messenger.h | 13 ++ include/linux/ceph/osd_client.h | 12 +- net/ceph/messenger.c| 96 + net/ceph/osd_client.c | 26 +++ 4 files changed, 146 insertions(+), 1 deletion(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 3775327..bc1bde8 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -79,6 +79,7 @@ enum ceph_msg_data_type { #ifdef CONFIG_BLOCK CEPH_MSG_DATA_BIO, /* data source/destination is a bio list */ #endif /* CONFIG_BLOCK */ +CEPH_MSG_DATA_SG, /* data source/destination is a scatterlist */ }; static __inline__ bool ceph_msg_data_type_valid(enum ceph_msg_data_type type) @@ -90,6 +91,7 @@ static __inline__ bool ceph_msg_data_type_valid(enum ceph_msg_data_type type) #ifdef CONFIG_BLOCK case CEPH_MSG_DATA_BIO: #endif /* CONFIG_BLOCK */ +case CEPH_MSG_DATA_SG: return true; default: return false; @@ -112,6 +114,11 @@ struct ceph_msg_data { unsigned intalignment; /* first page */ }; struct ceph_pagelist*pagelist; +struct { +struct scatterlist *sgl; +unsigned intsgl_init_offset; +u64 sgl_length; +}; Can you supply a short explanation of what these fields represent? It seems sgl_init_offset is the offset of the starting byte in the sgl, but is the purpose of that for page offset calculation, or does it represent an offset into the total length of the sgl? Or put another way, does sgl_init_offset represent some portion of the sgl_length that has been consumed already (and so the initial ressidual length is sgl_length - sgl_init_offset)? sgl - starting scatterlist entry we are going to send/receive to/from. sgl_init_offset - byte offset in the sgl above we will start executing from. It is for cases like where a LIO command crossed segment/object boundaries, so we had to break it up, and the first obj request ended up in the middle of a scatterlist entry. For the second obj request we set the sgl to the sg we ended on in the first request, and then set the sgl_init_offset to where we left off in the first request. So it basically allows me to not have to clone the list similar to how the bio code does it. However, if we did clone it then I could just manipulate the cloned sg's sg-offset instead of adding the sgl_init_offset field. sgl_length - number of bytes in the sgl we are going to send/receive. This also is for the case where we broke up the LIO command into multiple obj requests. }; }; @@ -139,6 +146,10 @@ struct ceph_msg_data_cursor { struct page *page; /* page from list */ size_t offset; /* bytes from list */ }; +struct { +struct scatterlist *sg;/* curr sg */ /* current sg */ +unsigned intsg_consumed; Here too, what does sg_consumed represent with respect to the initial offset and the length? It the number of bytes in the sgl we have sent/received. It is used by the messenger advance code to track if we need to advance to the next sg or if we still have data left from the current one. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/18] libceph: add scatterlist messenger data type
On Wed, Jul 29, 2015 at 04:23:38AM -0500, mchri...@redhat.com wrote: From: Mike Christie micha...@cs.wisc.edu LIO uses scatterlist for its page/data management. This patch adds a scatterlist messenger data type, so LIO can pass its sg down directly to rbd. Just as I mentioned for David's patches this is the wrong way to attack your problem. The block layer already supports WRITE SAME, and COMPARE and WRITE nees to be supported at that level too insted of creating artifical bypasses. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/18] libceph: add scatterlist messenger data type
From: Mike Christie micha...@cs.wisc.edu LIO uses scatterlist for its page/data management. This patch adds a scatterlist messenger data type, so LIO can pass its sg down directly to rbd. Signed-off-by: Mike Christie micha...@cs.wisc.edu --- include/linux/ceph/messenger.h | 13 ++ include/linux/ceph/osd_client.h | 12 +- net/ceph/messenger.c| 96 + net/ceph/osd_client.c | 26 +++ 4 files changed, 146 insertions(+), 1 deletion(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 3775327..bc1bde8 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -79,6 +79,7 @@ enum ceph_msg_data_type { #ifdef CONFIG_BLOCK CEPH_MSG_DATA_BIO, /* data source/destination is a bio list */ #endif /* CONFIG_BLOCK */ + CEPH_MSG_DATA_SG, /* data source/destination is a scatterlist */ }; static __inline__ bool ceph_msg_data_type_valid(enum ceph_msg_data_type type) @@ -90,6 +91,7 @@ static __inline__ bool ceph_msg_data_type_valid(enum ceph_msg_data_type type) #ifdef CONFIG_BLOCK case CEPH_MSG_DATA_BIO: #endif /* CONFIG_BLOCK */ + case CEPH_MSG_DATA_SG: return true; default: return false; @@ -112,6 +114,11 @@ struct ceph_msg_data { unsigned intalignment; /* first page */ }; struct ceph_pagelist*pagelist; + struct { + struct scatterlist *sgl; + unsigned intsgl_init_offset; + u64 sgl_length; + }; }; }; @@ -139,6 +146,10 @@ struct ceph_msg_data_cursor { struct page *page; /* page from list */ size_t offset; /* bytes from list */ }; + struct { + struct scatterlist *sg;/* curr sg */ + unsigned intsg_consumed; + }; }; }; @@ -294,6 +305,8 @@ extern void ceph_msg_data_add_pagelist(struct ceph_msg *msg, extern void ceph_msg_data_add_bio(struct ceph_msg *msg, struct bio *bio, size_t length); #endif /* CONFIG_BLOCK */ +extern void ceph_msg_data_add_sg(struct ceph_msg *msg, struct scatterlist *sgl, +unsigned int sgl_init_offset, u64 length); extern struct ceph_msg *ceph_msg_new(int type, int front_len, gfp_t flags, bool can_fail); diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 0890167..2152f06 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -52,6 +52,7 @@ enum ceph_osd_data_type { #ifdef CONFIG_BLOCK CEPH_OSD_DATA_TYPE_BIO, #endif /* CONFIG_BLOCK */ + CEPH_OSD_DATA_TYPE_SG, }; struct ceph_osd_data { @@ -70,6 +71,11 @@ struct ceph_osd_data { struct bio *bio; /* list of bios */ size_t bio_length; /* total in list */ }; + struct { + struct scatterlist *sgl; + size_t sgl_length; + unsigned intsgl_init_offset; + }; #endif /* CONFIG_BLOCK */ }; }; @@ -313,7 +319,11 @@ extern void osd_req_op_extent_osd_data_bio(struct ceph_osd_request *, unsigned int which, struct bio *bio, size_t bio_length); #endif /* CONFIG_BLOCK */ - +extern void osd_req_op_extent_osd_data_sg(struct ceph_osd_request *, + unsigned int which, + struct scatterlist *sgl, + unsigned int init_sg_offset, + u64 length); extern void osd_req_op_cls_request_data_pagelist(struct ceph_osd_request *, unsigned int which, struct ceph_pagelist *pagelist); diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index e3be1d2..08d39fb 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -893,6 +893,75 @@ static bool ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor, #endif /* CONFIG_BLOCK */ /* + * For a sg data item, a piece is whatever remains of the next + * entry in the current sg entry, or the first entry in the next + * sg in the list. + */ +static void ceph_msg_data_sg_cursor_init(struct ceph_msg_data_cursor *cursor, +size_t length) +{ + struct ceph_msg_data *data = cursor-data; + struct scatterlist *sg; + + BUG_ON(data-type != CEPH_MSG_DATA_SG); + + sg = data-sgl; +
Re: [PATCH 01/18] libceph: add scatterlist messenger data type
On 07/29/2015 04:23 AM, mchri...@redhat.com wrote: From: Mike Christie micha...@cs.wisc.edu LIO uses scatterlist for its page/data management. This patch adds a scatterlist messenger data type, so LIO can pass its sg down directly to rbd. Signed-off-by: Mike Christie micha...@cs.wisc.edu I'm not going to be able to review all of these, and this isnt' even a complete review. But it's something... You're clearly on the right track, but I want to provide a meaningful review for correctness and design so I'm looking for a bit more information. --- include/linux/ceph/messenger.h | 13 ++ include/linux/ceph/osd_client.h | 12 +- net/ceph/messenger.c| 96 + net/ceph/osd_client.c | 26 +++ 4 files changed, 146 insertions(+), 1 deletion(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 3775327..bc1bde8 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -79,6 +79,7 @@ enum ceph_msg_data_type { #ifdef CONFIG_BLOCK CEPH_MSG_DATA_BIO, /* data source/destination is a bio list */ #endif /* CONFIG_BLOCK */ + CEPH_MSG_DATA_SG, /* data source/destination is a scatterlist */ }; static __inline__ bool ceph_msg_data_type_valid(enum ceph_msg_data_type type) @@ -90,6 +91,7 @@ static __inline__ bool ceph_msg_data_type_valid(enum ceph_msg_data_type type) #ifdef CONFIG_BLOCK case CEPH_MSG_DATA_BIO: #endif /* CONFIG_BLOCK */ + case CEPH_MSG_DATA_SG: return true; default: return false; @@ -112,6 +114,11 @@ struct ceph_msg_data { unsigned intalignment; /* first page */ }; struct ceph_pagelist*pagelist; + struct { + struct scatterlist *sgl; + unsigned intsgl_init_offset; + u64 sgl_length; + }; Can you supply a short explanation of what these fields represent? It seems sgl_init_offset is the offset of the starting byte in the sgl, but is the purpose of that for page offset calculation, or does it represent an offset into the total length of the sgl? Or put another way, does sgl_init_offset represent some portion of the sgl_length that has been consumed already (and so the initial residual length is sgl_length - sgl_init_offset)? }; }; @@ -139,6 +146,10 @@ struct ceph_msg_data_cursor { struct page *page; /* page from list */ size_t offset; /* bytes from list */ }; + struct { + struct scatterlist *sg;/* curr sg */ /* current sg */ + unsigned intsg_consumed; Here too, what does sg_consumed represent with respect to the initial offset and the length? I guess I'm going to stop with that. It'll be a lot easier for me to review this if I'm sure I'm sure I understand what these represent. Thanks. -Alex + }; }; }; . . . -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/18] libceph: add scatterlist messenger data type
On 07/29/2015 12:55 PM, Christoph Hellwig wrote: On Wed, Jul 29, 2015 at 04:23:38AM -0500, mchri...@redhat.com wrote: From: Mike Christie micha...@cs.wisc.edu LIO uses scatterlist for its page/data management. This patch adds a scatterlist messenger data type, so LIO can pass its sg down directly to rbd. Just as I mentioned for David's patches this is the wrong way to attack your problem. The block layer already supports WRITE SAME, and COMPARE and WRITE nees to be supported at that level too insted of creating artifical bypasses. Why do I have to use the block layer? I just want to map the se_cmd to a ceph request and then put it back on the wire. I don't think I need any of the block layer services. We will do things like io scheduling on the OSD side. We just want to use LIO as more of a passthrough. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html