Re: iscsi target regression due to "tcp: remove prequeue support" patch
On 01/18/2018 09:10 AM, Florian Westphal wrote: >> It would indicate users providing their own ->sk_data_ready() callback >> must be responsible for waking up a kthread context blocked on >> sock_recvmsg(..., MSG_WAITALL), when a second ->sk_data_ready() is >> received before the first sock_recvmsg(..., MSG_WAITALL) completes. > > I agree, it looks like we need something like this? > (not even build tested): > > diff --git a/drivers/target/iscsi/iscsi_target_nego.c > b/drivers/target/iscsi/iscsi_target_nego.c > index b686e2ce9c0e..3723f8f419aa 100644 > --- a/drivers/target/iscsi/iscsi_target_nego.c > +++ b/drivers/target/iscsi/iscsi_target_nego.c > @@ -432,6 +432,9 @@ static void iscsi_target_sk_data_ready(struct sock *sk) > if (test_and_set_bit(LOGIN_FLAGS_READ_ACTIVE, >login_flags)) { > write_unlock_bh(>sk_callback_lock); > pr_debug("Got LOGIN_FLAGS_READ_ACTIVE=1, conn: %p \n", > conn); > + if (WARN_ON(iscsi_target_sk_data_ready == > conn->orig_data_ready)) > + return; > + conn->orig_data_ready(sk); > return; This allows iscsi login to work for me. I ran it against the target-pending for-next branch.
Re: iscsi target regression due to "tcp: remove prequeue support" patch
On 01/15/2018 12:41 AM, Nicholas A. Bellinger wrote: > Hey MNC & Co, > > Ping on the earlier iscsi-target authentication login failures atop > 4.14 + commit e7942d063 removing tcp prequeue support. > > For reference, what is your pre 4.14 environment using for > sysctl_tcp_low_latency..? tcp_low_latency=1. I have tested the current kernel with 1 and 0, and it does not make a difference. > > netdev folks, how would you like to proceed for -rc1..? > > On Mon, 2018-01-08 at 22:32 -0800, Nicholas A. Bellinger wrote: >> Hi MNC & Florian, >> >> (Adding net-dev + DaveM CC') >> >> Catching up on pre-holiday threads, thanks for the heads up. >> >> Comments below. >> >> On Wed, 2017-12-13 at 23:56 -0600, Mike Christie wrote: >>> Hey Nick and Florian, >>> >>> Starting in 4.14 iscsi logins will fail around 50% of the time. >>> >>> I git bisected the issue down to this commit: >>> >>> commit e7942d0633c47c791ece6afa038be9cf977226de >>> Author: Florian Westphal <f...@strlen.de> >>> Date: Sun Jul 30 03:57:18 2017 +0200 >>> >>> tcp: remove prequeue support >>> >>> Nick, attached is the iscsi target log info when the login fails. >>> >>> You can see at: >>> >>> Dec 13 17:55:01 rhel73n1 kernel: Got Login Command, Flags 0x81, ITT: >>> 0x, CmdSN: 0x, ExpStatSN: 0xf86dc69b, CID: 0, Length: 65 >>> >>> we have got a login command and we seem to then go into >>> iscsit_do_rx_data -> sock_recvmsg >>> >>> We seem to get stuck in there though, because we stay blocked until: >>> >>> Dec 13 17:55:01 rhel73n1 kernel: Entering iscsi_target_sk_data_ready: >>> conn: 88b35cbb3000 >>> Dec 13 17:55:01 rhel73n1 kernel: Got LOGIN_FLAGS_READ_ACTIVE=1, conn: >>> 88b35cbb3000 >>>> >>> >>> where initiator side timeout fires 15 seconds later and it disconnects >>> the tcp connection, and we eventually break out of the recvmsg call: >>> >>> Dec 13 17:55:16 rhel73n1 kernel: Entering iscsi_target_sk_state_change >>> Dec 13 17:55:16 rhel73n1 kernel: __iscsi_target_sk_check_close: >>> TCP_CLOSE_WAIT|TCP_CLOSE,returning FALSE >>> >>> >>> >>> Dec 13 17:55:16 rhel73n1 kernel: rx_loop: 68, total_rx: 68, data: 68 >>> Dec 13 17:55:16 rhel73n1 kernel: iscsi_target_do_login_rx after >>> rx_login_io, 88b35cbb3000, kworker/2:2:1829 >>> >> >> Ok, the 3rd third login request payload (65 + 3 padded to 68 bytes) >> containing CHAP_N + CHAP_R keys remains blocked on sock_recvmsg(), until >> TPG login_timeout subsequently fires after 15 seconds of inactivity to >> terminate this login attempt. >> >>> Is the iscsi target doing something incorrect in its use of >>> sk_data_ready and sock_recvmsg or is the tcp patch at fault? >> >> From the logs, sk_data_ready() -> iscsi_target_sk_data_ready() callbacks >> appear firing as expected. >> >> iscsi-target login does iscsit_rx_do_data() -> rx_data() -> >> sock_recvmsg(..., MSG_WAITALL) from a system_wq kworker process context >> after iscsi_target_sk_data_ready() callback queues up >> iscsi_conn->login_work for execution, and sock_recvmsg() uses a single >> struct kvec iovec for struct msg_hdr. >> >> AFAICT, iscsi-target uses blocking kernel socket reads from process >> context, similar to kernel_recvmsg(..., MSG_WAITALL) with DRBD. >> >> Florian + DaveM, any idea why the removal of prequeue support is having >> an effect here..? >> >> -- >> To unsubscribe from this list: send the line "unsubscribe target-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >
Re: [PATCH net-next 1/4] scsi_tcp: block BH in TCP callbacks
On 05/17/2016 07:44 PM, Eric Dumazet wrote: > iscsi_sw_tcp_data_ready() and iscsi_sw_tcp_state_change() were > using read_lock(>sk_callback_lock) which is fine if caller > disabled BH. > > TCP stack no longer has this requirement and can run from > process context. > > Use read_lock_bh() variant to restore previous assumption. > > Ideally this code could use RCU instead... > > Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog") > Fixes: d41a69f1d390 ("tcp: make tcp_sendmsg() aware of socket backlog") > Signed-off-by: Eric Dumazet <eduma...@google.com> > Cc: Mike Christie <micha...@cs.wisc.edu> > Cc: Venkatesh Srinivas <venkate...@google.com> > --- > drivers/scsi/iscsi_tcp.c | 12 ++-- > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c > index 2e4c82f8329c..ace4f1f41b8e 100644 > --- a/drivers/scsi/iscsi_tcp.c > +++ b/drivers/scsi/iscsi_tcp.c > @@ -131,10 +131,10 @@ static void iscsi_sw_tcp_data_ready(struct sock *sk) > struct iscsi_tcp_conn *tcp_conn; > read_descriptor_t rd_desc; > > - read_lock(>sk_callback_lock); > + read_lock_bh(>sk_callback_lock); > conn = sk->sk_user_data; > if (!conn) { > - read_unlock(>sk_callback_lock); > + read_unlock_bh(>sk_callback_lock); > return; > } > tcp_conn = conn->dd_data; > @@ -154,7 +154,7 @@ static void iscsi_sw_tcp_data_ready(struct sock *sk) > /* If we had to (atomically) map a highmem page, >* unmap it now. */ > iscsi_tcp_segment_unmap(_conn->in.segment); > - read_unlock(>sk_callback_lock); > + read_unlock_bh(>sk_callback_lock); > } > > static void iscsi_sw_tcp_state_change(struct sock *sk) > @@ -165,10 +165,10 @@ static void iscsi_sw_tcp_state_change(struct sock *sk) > struct iscsi_session *session; > void (*old_state_change)(struct sock *); > > - read_lock(>sk_callback_lock); > + read_lock_bh(>sk_callback_lock); > conn = sk->sk_user_data; > if (!conn) { > - read_unlock(>sk_callback_lock); > + read_unlock_bh(>sk_callback_lock); > return; > } > session = conn->session; > @@ -179,7 +179,7 @@ static void iscsi_sw_tcp_state_change(struct sock *sk) > tcp_sw_conn = tcp_conn->dd_data; > old_state_change = tcp_sw_conn->old_state_change; > > - read_unlock(>sk_callback_lock); > + read_unlock_bh(>sk_callback_lock); > > old_state_change(sk); > } > Reviewed and tested. Thanks Acked-by: Mike Christie <micha...@cs.wisc.edu>
Re: [PATCH net-next 1/4] scsi_tcp: block BH in TCP callbacks
On 05/17/2016 07:44 PM, Eric Dumazet wrote: > iscsi_sw_tcp_data_ready() and iscsi_sw_tcp_state_change() were > using read_lock(>sk_callback_lock) which is fine if caller > disabled BH. > > TCP stack no longer has this requirement and can run from > process context. > > Use read_lock_bh() variant to restore previous assumption. > > Ideally this code could use RCU instead... > > Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog") > Fixes: d41a69f1d390 ("tcp: make tcp_sendmsg() aware of socket backlog") > Signed-off-by: Eric Dumazet <eduma...@google.com> > Cc: Mike Christie <micha...@cs.wisc.edu> > Cc: Venkatesh Srinivas <venkate...@google.com> > --- > drivers/scsi/iscsi_tcp.c | 12 ++-- > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c > index 2e4c82f8329c..ace4f1f41b8e 100644 > --- a/drivers/scsi/iscsi_tcp.c > +++ b/drivers/scsi/iscsi_tcp.c > @@ -131,10 +131,10 @@ static void iscsi_sw_tcp_data_ready(struct sock *sk) > struct iscsi_tcp_conn *tcp_conn; > read_descriptor_t rd_desc; > > - read_lock(>sk_callback_lock); > + read_lock_bh(>sk_callback_lock); > conn = sk->sk_user_data; > if (!conn) { > - read_unlock(>sk_callback_lock); > + read_unlock_bh(>sk_callback_lock); > return; > } > tcp_conn = conn->dd_data; > @@ -154,7 +154,7 @@ static void iscsi_sw_tcp_data_ready(struct sock *sk) > /* If we had to (atomically) map a highmem page, >* unmap it now. */ > iscsi_tcp_segment_unmap(_conn->in.segment); > - read_unlock(>sk_callback_lock); > + read_unlock_bh(>sk_callback_lock); > } > > static void iscsi_sw_tcp_state_change(struct sock *sk) > @@ -165,10 +165,10 @@ static void iscsi_sw_tcp_state_change(struct sock *sk) > struct iscsi_session *session; > void (*old_state_change)(struct sock *); > > - read_lock(>sk_callback_lock); > + read_lock_bh(>sk_callback_lock); > conn = sk->sk_user_data; > if (!conn) { > - read_unlock(>sk_callback_lock); > + read_unlock_bh(>sk_callback_lock); > return; > } > session = conn->session; > @@ -179,7 +179,7 @@ static void iscsi_sw_tcp_state_change(struct sock *sk) > tcp_sw_conn = tcp_conn->dd_data; > old_state_change = tcp_sw_conn->old_state_change; > > - read_unlock(>sk_callback_lock); > + read_unlock_bh(>sk_callback_lock); > > old_state_change(sk); > } Can I just confirm that nested bh lock calls like: spin_lock_bh(lock1); spin_lock_bh(lock2); do something spin_unlock_bh(lock2); spin_unlock_bh(lock1); is ok? It seems smatch sometimes warns about this. I found this thread https://lkml.org/lkml/2011/1/25/232 which says it is ok.
Re: [dm-devel] [PATCH 22/26] iscsi_tcp: Use ahash
On 01/24/2016 07:19 AM, Herbert Xu wrote: > This patch replaces uses of the long obsolete hash interface with > ahash. > > Signed-off-by: Herbert Xu <herb...@gondor.apana.org.au> > --- > > drivers/scsi/iscsi_tcp.c| 54 > ++-- > drivers/scsi/iscsi_tcp.h|4 +-- > drivers/scsi/libiscsi_tcp.c | 29 +-- > include/scsi/libiscsi_tcp.h | 13 +- > 4 files changed, 58 insertions(+), 42 deletions(-) > iSCSI parts look ok. Reviewed-by: Mike Christie <micha...@cs.wisc.edu>
Re: [PATCH v3 2/2][BNX2]: Add iSCSI support to BNX2 devices.
Anil Veerabhadrappa wrote: Which ones were they exactly? I think JamesB wanted only common transport values in the transport class. If it is driver specific then it should go on the host or target or device with the scsi_host_template attrs. It's a chicken egg issue to put port mapper sysfs entry in scsi host attributes. Application won't see sysfs unless initiator creates an Sorry for the late response. I was on vacation. That is only with how you coded it today. I asked you to do something like qla4xxx where the session and host are not so closely bound. bnx2i is still using scsi host per iscsi session model, we plan to migrate to scsi host per device model pretty soon. Previously you indicated that you are working on this port, can you please share the code? We will build on your work and bring this to closure. Send me a link to your code so I can rebuild my stuff against it. iSCSI session and driver can't create an iSCSI session without a tcp That is not right with how things are today even. The iscsi_session struct can be created before the tcp connection. This was done because we thought we were going to have to use only sysfs for all setup and management (we ended up netlink and sysfs though). You are right, software is flexible enough to allow creation of session/connection and endpoint independently but so far user daemon creates TCP endpoint before iSCSI session and bind them all together. Any changes to this scheme will not be compatible with existing distributions So what do you need to do exactly? You want userspace to be able to set some session or connection value, then have it used for that session, right? We have been using the netlink interface for that. You would basically pass iscsiadm some value (on cmdline or in some file/db), then when the session and connection is being created we will pass those values in. Values that are used after we are in FFP, like abort timeouts, are passed in the set_param command. If a value is needed for the session/connection setup then we were passing that in with the command (like create_session has the queue depth sent with it). To add new values (driver specific and common ones) and have the interface compatible with older versions should not be too difficult. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/2][BNX2]: Add iSCSI support to BNX2 devices.
Anil Veerabhadrappa wrote: The sysfs bits related to the hba should be use one of the scsi sysfs facilities or if they are related to iscsi bits and are generic then through the iscsi hba bnx2i needs 2 sysfs entries - 1. QP size info - this is used to size per connection shared data structures to issue work requests to chip (login, scsi cmd, tmf, nopin) and get completions from the chip (scsi completions, async messages, etc'). This is a iSCSI HBA attribute 2. port mapper - we can be more flexible on classifying this as either iSCSI HBA attribute or bnx2i driver global attribute Can hooks be added to iSCSI transport class to include these? Which ones were they exactly? I think JamesB wanted only common transport values in the transport class. If it is driver specific then it should go on the host or target or device with the scsi_host_template attrs. It's a chicken egg issue to put port mapper sysfs entry in scsi host attributes. Application won't see sysfs unless initiator creates an Sorry for the late response. I was on vacation. That is only with how you coded it today. I asked you to do something like qla4xxx where the session and host are not so closely bound. iSCSI session and driver can't create an iSCSI session without a tcp That is not right with how things are today even. The iscsi_session struct can be created before the tcp connection. This was done because we thought we were going to have to use only sysfs for all setup and management (we ended up netlink and sysfs though). port. I was wondering if there is a better way than using IOCTL in this situation? - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/2][BNX2]: Add iSCSI support to BNX2 devices.
Anil Veerabhadrappa wrote: + +/* iSCSI stages */ +#define ISCSI_STAGE_SECURITY_NEGOTIATION (0) +#define ISCSI_STAGE_LOGIN_OPERATIONAL_NEGOTIATION (1) +#define ISCSI_STAGE_FULL_FEATURE_PHASE (3) +/* Logout response codes */ +#define ISCSI_LOGOUT_RESPONSE_CONNECTION_CLOSED (0) +#define ISCSI_LOGOUT_RESPONSE_CID_NOT_FOUND (1) +#define ISCSI_LOGOUT_RESPONSE_CLEANUP_FAILED (3) + +/* iSCSI task types */ +#define ISCSI_TASK_TYPE_READ(0) +#define ISCSI_TASK_TYPE_WRITE (1) +#define ISCSI_TASK_TYPE_MPATH (2) All of these iscsi code shoulds be in iscsi_proto.h or should be added there. This is a very tricky proposal as this header file is automatically generated by a well defined process and is shared between various driver supporting multiple platform/OS and the firmware. If it is not of a big issue I would like to keep it the way it is. The values that are iscsi RFC values should come from the iscsi_proto.h file and not be duplicated for each driver. +/* + * hardware reset + */ +int bnx2i_reset(struct scsi_cmnd *sc) +{ + return 0; +} So what is up with this one? It seems like if there is a way to reset hardware then you would want it as the scsi eh host reset callout instead of dropping the session. We could add some transport level recovery callouts for the iscsi specifics. We may not be able to support HBA cold reset as bnx2 driver is the primary owner of chip reset and initialization. This is the drawback of sharing network interface with the NIC driver. If there is a need for administrator to reset the iSCSI port same can be achieved by running 'ifdown eth#' and 'ifup eth#'. Current driver even allows ethernet interface reset when there are active iSCSI connection, all active iscsi sessions will be reinstated when the network link comes back live If you cannot support it or it does not make sense just remove the stub then. I say it is not a big deal now, but hopefully we do not hit fun like with qla3xxx and qla4xxx :) + +void bnx2i_sysfs_cleanup(void) +{ + class_device_unregister(port_class_dev); + class_unregister(bnx2i_class); +} The sysfs bits related to the hba should be use one of the scsi sysfs facilities or if they are related to iscsi bits and are generic then through the iscsi hba bnx2i needs 2 sysfs entries - 1. QP size info - this is used to size per connection shared data structures to issue work requests to chip (login, scsi cmd, tmf, nopin) and get completions from the chip (scsi completions, async messages, etc'). This is a iSCSI HBA attribute 2. port mapper - we can be more flexible on classifying this as either iSCSI HBA attribute or bnx2i driver global attribute Can hooks be added to iSCSI transport class to include these? Which ones were they exactly? I think JamesB wanted only common transport values in the transport class. If it is driver specific then it should go on the host or target or device with the scsi_host_template attrs. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SG_IO with 4k buffer size to iscsi sg device causes Bad page panic
Qi, Yanling wrote: Hi All, This panic is related to the interactions between scsi/sg.c, iscsi initiator and tcp on the RHEL 2.6.9-42 kernel. But we may also have the similar problem with open-iscsi initiator. I will explain why we see the Yeah, this problem should occur in the upstream open-iscsi iscsi code. open-iscsi works very similar to linux-scsi where it just sends pages around with sock-ops-sendpage, and it looks like sg uses __get_free_pages in RHEL's kernel and upstream it uses alloc_pages so unless there was a change in those functions or the network layer then we should have a similar problem. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 20/20] iscsi: support for swapping over iSCSI.
Peter Zijlstra wrote: On Wed, 2006-09-13 at 15:50 -0500, Mike Christie wrote: Peter Zijlstra wrote: Implement sht-swapdev() for iSCSI. This method takes care of reserving the extra memory needed and marking all relevant sockets with SOCK_VMIO. When used for swapping, TCP socket creation is done under GFP_MEMALLOC and the TCP connect is done with SOCK_VMIO to ensure their success. Also the netlink userspace interface is marked SOCK_VMIO, this will ensure that even under pressure we can still communicate with the daemon (which runs as mlockall() and needs no additional memory to operate). Netlink requests are handled under the new PF_MEM_NOWAIT when a swapper is present. This ensures that the netlink socket will not block. User-space will need to retry failed requests. The TCP receive path is handled under PF_MEMALLOC for SOCK_VMIO sockets. This makes sure we do not block the critical socket, and that we do not fail to process incomming data. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] CC: Mike Christie [EMAIL PROTECTED] --- drivers/scsi/iscsi_tcp.c| 103 +++- drivers/scsi/scsi_transport_iscsi.c | 23 +++- include/scsi/libiscsi.h |1 include/scsi/scsi_transport_iscsi.h |2 4 files changed, 113 insertions(+), 16 deletions(-) Index: linux-2.6/drivers/scsi/iscsi_tcp.c === --- linux-2.6.orig/drivers/scsi/iscsi_tcp.c +++ linux-2.6/drivers/scsi/iscsi_tcp.c @@ -42,6 +42,7 @@ #include scsi/scsi_host.h #include scsi/scsi.h #include scsi/scsi_transport_iscsi.h +#include scsi/scsi_device.h #include iscsi_tcp.h @@ -845,9 +846,13 @@ iscsi_tcp_data_recv(read_descriptor_t *r int rc; struct iscsi_conn *conn = rd_desc-arg.data; struct iscsi_tcp_conn *tcp_conn = conn-dd_data; - int processed; + int processed = 0; char pad[ISCSI_PAD_LEN]; struct scatterlist sg; + unsigned long pflags = current-flags; + + if (sk_has_vmio(tcp_conn-sock-sk)) + current-flags |= PF_MEMALLOC; Is this too late or not needed or what is it for? This function gets run from the network layer's softirq and at this point we have a skbuff with data that we want to process. The iscsi layer also does not allocate memory for read or write IO in this path. I thought I found allocations in that path, lemme search... found this: iscsi_tcp_data_recv() iscsi_data_rescv() iscsi_complete_pdu() __iscsi_complete_pdu() iscsi_recv_pdu() alloc_skb( GFP_ATOMIC); You are right that is for the netlink interface. Could we move the PF_MEMALLOC setting and clearing to iscsi_recv_pdu and and add it to iscsi_conn_error in scsi_transport_iscsi.c so that iscsi_iser and qla4xxx will have it set when they need it. I will send a patch for this along with a way to have the netlink sock vmio set for all iscsi drivers that need it. I think we would want to set this flag at a lower level. Something closer to where the skbuf is allocated? Is that the skbuff you were talking about? If so, I'd need to carve a path to pass the swapper information. I had that in a previous patch, but that was large and ugly. I had to go carrying gfp_t flags all through that call chain. In my original post I was just concerned about the sk_buff that gets passed to the iscsi layer in iscsi_tcp_data_recv. I was wondering if the chunk of code in the network layer or network driver that allocated that skbuff needed to set PF_MEMALLOC. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 20/20] iscsi: support for swapping over iSCSI.
Peter Zijlstra wrote: On Thu, 2006-09-14 at 14:22 -0500, Mike Christie wrote: Peter Zijlstra wrote: On Wed, 2006-09-13 at 15:50 -0500, Mike Christie wrote: Peter Zijlstra wrote: Implement sht-swapdev() for iSCSI. This method takes care of reserving the extra memory needed and marking all relevant sockets with SOCK_VMIO. When used for swapping, TCP socket creation is done under GFP_MEMALLOC and the TCP connect is done with SOCK_VMIO to ensure their success. Also the netlink userspace interface is marked SOCK_VMIO, this will ensure that even under pressure we can still communicate with the daemon (which runs as mlockall() and needs no additional memory to operate). Netlink requests are handled under the new PF_MEM_NOWAIT when a swapper is present. This ensures that the netlink socket will not block. User-space will need to retry failed requests. The TCP receive path is handled under PF_MEMALLOC for SOCK_VMIO sockets. This makes sure we do not block the critical socket, and that we do not fail to process incomming data. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] CC: Mike Christie [EMAIL PROTECTED] --- drivers/scsi/iscsi_tcp.c| 103 +++- drivers/scsi/scsi_transport_iscsi.c | 23 +++- include/scsi/libiscsi.h |1 include/scsi/scsi_transport_iscsi.h |2 4 files changed, 113 insertions(+), 16 deletions(-) Index: linux-2.6/drivers/scsi/iscsi_tcp.c === --- linux-2.6.orig/drivers/scsi/iscsi_tcp.c +++ linux-2.6/drivers/scsi/iscsi_tcp.c @@ -42,6 +42,7 @@ #include scsi/scsi_host.h #include scsi/scsi.h #include scsi/scsi_transport_iscsi.h +#include scsi/scsi_device.h #include iscsi_tcp.h @@ -845,9 +846,13 @@ iscsi_tcp_data_recv(read_descriptor_t *r int rc; struct iscsi_conn *conn = rd_desc-arg.data; struct iscsi_tcp_conn *tcp_conn = conn-dd_data; - int processed; + int processed = 0; char pad[ISCSI_PAD_LEN]; struct scatterlist sg; + unsigned long pflags = current-flags; + + if (sk_has_vmio(tcp_conn-sock-sk)) + current-flags |= PF_MEMALLOC; Is this too late or not needed or what is it for? This function gets run from the network layer's softirq and at this point we have a skbuff with data that we want to process. The iscsi layer also does not allocate memory for read or write IO in this path. I thought I found allocations in that path, lemme search... found this: iscsi_tcp_data_recv() iscsi_data_rescv() iscsi_complete_pdu() __iscsi_complete_pdu() iscsi_recv_pdu() alloc_skb( GFP_ATOMIC); You are right that is for the netlink interface. Could we move the PF_MEMALLOC setting and clearing to iscsi_recv_pdu and and add it to iscsi_conn_error in scsi_transport_iscsi.c so that iscsi_iser and qla4xxx will have it set when they need it. I will send a patch for this along with a way to have the netlink sock vmio set for all iscsi drivers that need it. I already have such a patch, look at: http://programming.kicks-ass.net/kernel-patches/vm_deadlock/current/iscsi_vmio.patch You are drowning me in patches :) I did not see that one. I was still commenting on this patch :) The new patch looks ok. but what conditional do you want to use for PF_MEMALLOC, an unconditional setting will be highly unpopular. Hmm, perhaps you could key it of sk_has_vmio(nls)... Yes. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 20/20] iscsi: support for swapping over iSCSI.
Mike Christie wrote: Peter Zijlstra wrote: On Thu, 2006-09-14 at 14:22 -0500, Mike Christie wrote: Peter Zijlstra wrote: On Wed, 2006-09-13 at 15:50 -0500, Mike Christie wrote: Peter Zijlstra wrote: Implement sht-swapdev() for iSCSI. This method takes care of reserving the extra memory needed and marking all relevant sockets with SOCK_VMIO. When used for swapping, TCP socket creation is done under GFP_MEMALLOC and the TCP connect is done with SOCK_VMIO to ensure their success. Also the netlink userspace interface is marked SOCK_VMIO, this will ensure that even under pressure we can still communicate with the daemon (which runs as mlockall() and needs no additional memory to operate). Netlink requests are handled under the new PF_MEM_NOWAIT when a swapper is present. This ensures that the netlink socket will not block. User-space will need to retry failed requests. The TCP receive path is handled under PF_MEMALLOC for SOCK_VMIO sockets. This makes sure we do not block the critical socket, and that we do not fail to process incomming data. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] CC: Mike Christie [EMAIL PROTECTED] --- drivers/scsi/iscsi_tcp.c| 103 +++- drivers/scsi/scsi_transport_iscsi.c | 23 +++- include/scsi/libiscsi.h |1 include/scsi/scsi_transport_iscsi.h |2 4 files changed, 113 insertions(+), 16 deletions(-) Index: linux-2.6/drivers/scsi/iscsi_tcp.c === --- linux-2.6.orig/drivers/scsi/iscsi_tcp.c +++ linux-2.6/drivers/scsi/iscsi_tcp.c @@ -42,6 +42,7 @@ #include scsi/scsi_host.h #include scsi/scsi.h #include scsi/scsi_transport_iscsi.h +#include scsi/scsi_device.h #include iscsi_tcp.h @@ -845,9 +846,13 @@ iscsi_tcp_data_recv(read_descriptor_t *r int rc; struct iscsi_conn *conn = rd_desc-arg.data; struct iscsi_tcp_conn *tcp_conn = conn-dd_data; -int processed; +int processed = 0; char pad[ISCSI_PAD_LEN]; struct scatterlist sg; +unsigned long pflags = current-flags; + +if (sk_has_vmio(tcp_conn-sock-sk)) +current-flags |= PF_MEMALLOC; Is this too late or not needed or what is it for? This function gets run from the network layer's softirq and at this point we have a skbuff with data that we want to process. The iscsi layer also does not allocate memory for read or write IO in this path. I thought I found allocations in that path, lemme search... found this: iscsi_tcp_data_recv() iscsi_data_rescv() iscsi_complete_pdu() __iscsi_complete_pdu() iscsi_recv_pdu() alloc_skb( GFP_ATOMIC); You are right that is for the netlink interface. Could we move the PF_MEMALLOC setting and clearing to iscsi_recv_pdu and and add it to iscsi_conn_error in scsi_transport_iscsi.c so that iscsi_iser and qla4xxx will have it set when they need it. I will send a patch for this along with a way to have the netlink sock vmio set for all iscsi drivers that need it. I already have such a patch, look at: http://programming.kicks-ass.net/kernel-patches/vm_deadlock/current/iscsi_vmio.patch You are drowning me in patches :) I did not see that one. I was still commenting on this patch :) The new patch looks ok. Oh, I think you need a sock_put to go with netlink lookup (lookup does a hold). - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 20/20] iscsi: support for swapping over iSCSI.
Peter Zijlstra wrote: On Thu, 2006-09-14 at 22:35 +0200, Peter Zijlstra wrote: On Thu, 2006-09-14 at 14:22 -0500, Mike Christie wrote: I thought I found allocations in that path, lemme search... found this: iscsi_tcp_data_recv() iscsi_data_rescv() iscsi_complete_pdu() __iscsi_complete_pdu() iscsi_recv_pdu() alloc_skb( GFP_ATOMIC); You are right that is for the netlink interface. Could we move the PF_MEMALLOC setting and clearing to iscsi_recv_pdu and and add it to iscsi_conn_error in scsi_transport_iscsi.c so that iscsi_iser and qla4xxx will have it set when they need it. I will send a patch for this along with a way to have the netlink sock vmio set for all iscsi drivers that need it. I already have such a patch, look at: http://programming.kicks-ass.net/kernel-patches/vm_deadlock/current/iscsi_vmio.patch but what conditional do you want to use for PF_MEMALLOC, an unconditional setting will be highly unpopular. Hmm, perhaps you could key it of sk_has_vmio(nls)... On second thought, not such a good idea, that will still be too course. You only want to force feed stuff originating from sk_has_vmio(iscsi_tcp_conn-sock-sk) connections, not all connectections as soon as there is a swapper in the system. You can move the iscsi_session-swapper field to the iscsi_cls_session and have iscsi_swapdev take a iscsi_cls_session and set that flag. iscsi_recv_pdu and iscsi_conn_error and all the llds can then access this bit. In order to preserve that information you need extra state, abusing this process flags is as good as propagating __GFP_EMERGENCY down the call chain with extra gfp_t arguments, perhaps even better, since it will make sure we catch all allocations. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 20/20] iscsi: support for swapping over iSCSI.
Mike Christie wrote: Peter Zijlstra wrote: On Thu, 2006-09-14 at 22:35 +0200, Peter Zijlstra wrote: On Thu, 2006-09-14 at 14:22 -0500, Mike Christie wrote: I thought I found allocations in that path, lemme search... found this: iscsi_tcp_data_recv() iscsi_data_rescv() iscsi_complete_pdu() __iscsi_complete_pdu() iscsi_recv_pdu() alloc_skb( GFP_ATOMIC); You are right that is for the netlink interface. Could we move the PF_MEMALLOC setting and clearing to iscsi_recv_pdu and and add it to iscsi_conn_error in scsi_transport_iscsi.c so that iscsi_iser and qla4xxx will have it set when they need it. I will send a patch for this along with a way to have the netlink sock vmio set for all iscsi drivers that need it. I already have such a patch, look at: http://programming.kicks-ass.net/kernel-patches/vm_deadlock/current/iscsi_vmio.patch but what conditional do you want to use for PF_MEMALLOC, an unconditional setting will be highly unpopular. Hmm, perhaps you could key it of sk_has_vmio(nls)... On second thought, not such a good idea, that will still be too course. You only want to force feed stuff originating from sk_has_vmio(iscsi_tcp_conn-sock-sk) connections, not all connectections as soon as there is a swapper in the system. You can move the iscsi_session-swapper field to the iscsi_cls_session and have iscsi_swapdev take a iscsi_cls_session and set that flag. iscsi_recv_pdu and iscsi_conn_error and all the llds can then access this bit. In order to preserve that information you need extra state, abusing this process flags is as good as propagating __GFP_EMERGENCY down the call chain with extra gfp_t arguments, perhaps even better, since it will make sure we catch all allocations. Oh yeah, on the send side we also allocate some memory for the netlink interface if there is a connection error (iscsi_conn_failure - iscsi_conn_error). And when that is called from the transmit side we can change the GFP_ATOMICs to GFP_NOIOs since we have process context. So I am just saying we need to set that flag in a couple more places (if you set it in iscsi_conn_error if iscsi_cls_session-swapper is set then don't worry about it), and that I need to change iscsi_conn_failure and iscsi_conn_error to take a gfp_t as an argument (or do a in_interrupt or something) so we can use GFP_NOIO in the transmit code. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 20/20] iscsi: support for swapping over iSCSI.
Peter Zijlstra wrote: Implement sht-swapdev() for iSCSI. This method takes care of reserving the extra memory needed and marking all relevant sockets with SOCK_VMIO. When used for swapping, TCP socket creation is done under GFP_MEMALLOC and the TCP connect is done with SOCK_VMIO to ensure their success. Also the netlink userspace interface is marked SOCK_VMIO, this will ensure that even under pressure we can still communicate with the daemon (which runs as mlockall() and needs no additional memory to operate). Netlink requests are handled under the new PF_MEM_NOWAIT when a swapper is present. This ensures that the netlink socket will not block. User-space will need to retry failed requests. The TCP receive path is handled under PF_MEMALLOC for SOCK_VMIO sockets. This makes sure we do not block the critical socket, and that we do not fail to process incomming data. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] CC: Mike Christie [EMAIL PROTECTED] --- drivers/scsi/iscsi_tcp.c| 103 +++- drivers/scsi/scsi_transport_iscsi.c | 23 +++- include/scsi/libiscsi.h |1 include/scsi/scsi_transport_iscsi.h |2 4 files changed, 113 insertions(+), 16 deletions(-) Index: linux-2.6/drivers/scsi/iscsi_tcp.c === --- linux-2.6.orig/drivers/scsi/iscsi_tcp.c +++ linux-2.6/drivers/scsi/iscsi_tcp.c @@ -42,6 +42,7 @@ #include scsi/scsi_host.h #include scsi/scsi.h #include scsi/scsi_transport_iscsi.h +#include scsi/scsi_device.h #include iscsi_tcp.h @@ -845,9 +846,13 @@ iscsi_tcp_data_recv(read_descriptor_t *r int rc; struct iscsi_conn *conn = rd_desc-arg.data; struct iscsi_tcp_conn *tcp_conn = conn-dd_data; - int processed; + int processed = 0; char pad[ISCSI_PAD_LEN]; struct scatterlist sg; + unsigned long pflags = current-flags; + + if (sk_has_vmio(tcp_conn-sock-sk)) + current-flags |= PF_MEMALLOC; Is this too late or not needed or what is it for? This function gets run from the network layer's softirq and at this point we have a skbuff with data that we want to process. The iscsi layer also does not allocate memory for read or write IO in this path. I think we would want to set this flag at a lower level. Something closer to where the skbuf is allocated? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Netlink and user-space buffer pointers
James Smart wrote: Mike Christie wrote: For the tasks you want to do for the fc class is performance critical? No, it should not be. If not, you could do what the iscsi class (for the netdev people this is drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple copies. For iscsi we do this in userspace to send down a login pdu: /* * xmitbuf is a buffer that is large enough for the iscsi_event, * iscsi pdu (hdr_size) and iscsi pdu data (data_size) */ Well, the real difference is that the payload of the message is actually the payload of the SCSI command or ELS/CT Request. Thus, the payload may I am not sure I follow. For iscsi, everything after the iscsi_event struct can be the iscsi request that is to be transmitted. The payload will not normally be Mbytes but it is not a couple if bytes. range in size from a few hundred bytes to several kbytes ( 1 page) to Mbyte's in size. Rather than buffer all of this, and push it over the socket, thus the extra copies - it would best to have the LLDD simply DMA the payload like on a typical SCSI command. Additionally, there will be response data that can be several kbytes in length. Once you have got the buffer to the class, the class can create a scatterlist to DMA from for the LLD. I thought. iscsi does not do this just because it is software right now. For qla4xxx we do not need something like what you are talking about (see below for what I was thinking about for the initiators). If you are saying the extra step of the copy is plain dumb, I agree, but this happens (you have to suffer some copy and cannot do dio) for sg io as well in some cases. I think for the sg driver the copy_*_user is the default. Instead of netlink for scsi commands and transport requests For scsi commands could we just use sg io, or is there something special about the command you want to send? If you can use sg io for scsi commands, maybe for transport level requests (in my example iscsi pdu) we could modify something like sg/bsg/block layer scsi_ioctl.c to send down transport requests to the classes and encapsulate them in some new struct transport_requests or use the existing struct request but do that thing people keep taling about using the request/request_queue for message passing. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Netlink and user-space buffer pointers
Mike Christie wrote: James Smart wrote: Mike Christie wrote: For the tasks you want to do for the fc class is performance critical? No, it should not be. If not, you could do what the iscsi class (for the netdev people this is drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple copies. For iscsi we do this in userspace to send down a login pdu: /* * xmitbuf is a buffer that is large enough for the iscsi_event, * iscsi pdu (hdr_size) and iscsi pdu data (data_size) */ Well, the real difference is that the payload of the message is actually the payload of the SCSI command or ELS/CT Request. Thus, the payload may I am not sure I follow. For iscsi, everything after the iscsi_event struct can be the iscsi request that is to be transmitted. The payload will not normally be Mbytes but it is not a couple if bytes. range in size from a few hundred bytes to several kbytes ( 1 page) to Mbyte's in size. Rather than buffer all of this, and push it over the socket, thus the extra copies - it would best to have the LLDD simply DMA the payload like on a typical SCSI command. Additionally, there will be response data that can be several kbytes in length. Once you have got the buffer to the class, the class can create a scatterlist to DMA from for the LLD. I thought. iscsi does not do this just because it is software right now. For qla4xxx we do not need That should be, we do need. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Netlink and user-space buffer pointers
James Smart wrote: Note: We've transitioned off topic. If what this means is there isn't a good way except by ioctls (which still isn't easily portable) or system calls, then that's ok. Then at least we know the limits and can look at other implementation alternatives. Mike Christie wrote: James Smart wrote: Mike Christie wrote: For the tasks you want to do for the fc class is performance critical? No, it should not be. If not, you could do what the iscsi class (for the netdev people this is drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple copies. For iscsi we do this in userspace to send down a login pdu: /* * xmitbuf is a buffer that is large enough for the iscsi_event, * iscsi pdu (hdr_size) and iscsi pdu data (data_size) */ Well, the real difference is that the payload of the message is actually the payload of the SCSI command or ELS/CT Request. Thus, the payload may I am not sure I follow. For iscsi, everything after the iscsi_event struct can be the iscsi request that is to be transmitted. The payload will not normally be Mbytes but it is not a couple if bytes. True... For a large read/write - it will eventually total what the i/o request size was, and you did have to push it through the socekt. What this discussion really comes down to is the difference between initiator offload and what a target does. The initiator offloads the full i/o from the users - e.g. send command, get response. In the initiator case, the user isn't aware of each and every IU that makes up the i/o. As it's on an i/o basis, the LLDD doing the offload needs the full buffer sitting and ready. DMA is preferred so the buffer doesn't have to be consuming socket/kernel/driver buffers while it's pending - plus speed. In the target case, the target controls each IU and it's size, thus it only has to have access to as much buffer space as it wants to push the next IU. The i/o can be paced by the target. Unfortunately, this is an entirely different use model than users of a scsi initiator expect, and it won't map well into replacing things like our sg_io ioctls. I am not talking about the target here. For the open-iscsi initiator that is in mainline that I referecnced in the example we send pdus from userpsace to the LLD. In the future, initaitors that offload some iscsi processing and will login from userspace or have userspace monitor the transport by doing iscsi pings, we need to be able to send these pdus. And the iscsi pdu cannot be broken up at the iscsi level (they can at the interconect level though). From the iscsi host level they have to go out like a scsi command would in that the LLD cannot decide to send out mutiple pdus for he pdu that userspace sends down. I do agree with you that targets can break down a scsi command into multiple transport level packets as it sees fit. Instead of netlink for scsi commands and transport requests For scsi commands could we just use sg io, or is there something special about the command you want to send? If you can use sg io for scsi commands, maybe for transport level requests (in my example iscsi pdu) we could modify something like sg/bsg/block layer scsi_ioctl.c to send down transport requests to the classes and encapsulate them in some new struct transport_requests or use the existing struct request but do that thing people keep taling about using the request/request_queue for message passing. Well - there's 2 parts to this answer: First : IOCTL's are considered dangerous/bad practice and therefore it would Yeah, i am not trying to kill ioctls. I go where the community goes. What I am trying to dois just reuse the sg io mapping code so that we do not end up with sg, st, target, blk scsi_ioctl.c and bsg all doing similar things. be nice to find a replacement mechanism that eliminates them. If that mechanism has some of the cool features that netlink does, even better. Using sg io, in the manner you indicate, wouldn't remove the ioctl use. Note: I have OEMs/users that are very confused about the community's statement about ioctls. They've heard they are bad, should never be allowed, will no be longer supported, but yet they are at the heart of DM and sg io and other subsystems. Other than a grandfathered explanation, they don't understand why the rules bend for one piece of code but not for another. To them, all the features are just as critical regardless of whose providing them. Second: transport level i/o could be done like you suggest, and we've prototyped some of this as well. However, there's something very wrong about putting block device wrappers and settings around something that is not a block device. In general, it's a heck of a lot of overhead and still doesn't solve the real issue - how to portably pass that user buffer I am not talking about putting block device wrappers. This the magic part
Re: [RFC] Netlink and user-space buffer pointers
Mike Christie wrote: James Smart wrote: Note: We've transitioned off topic. If what this means is there isn't a good way except by ioctls (which still isn't easily portable) or system calls, then that's ok. Then at least we know the limits and can look at other implementation alternatives. Mike Christie wrote: James Smart wrote: Mike Christie wrote: For the tasks you want to do for the fc class is performance critical? No, it should not be. If not, you could do what the iscsi class (for the netdev people this is drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple copies. For iscsi we do this in userspace to send down a login pdu: /* * xmitbuf is a buffer that is large enough for the iscsi_event, * iscsi pdu (hdr_size) and iscsi pdu data (data_size) */ Well, the real difference is that the payload of the message is actually the payload of the SCSI command or ELS/CT Request. Thus, the payload may I am not sure I follow. For iscsi, everything after the iscsi_event struct can be the iscsi request that is to be transmitted. The payload will not normally be Mbytes but it is not a couple if bytes. True... For a large read/write - it will eventually total what the i/o request size was, and you did have to push it through the socekt. What this discussion really comes down to is the difference between initiator offload and what a target does. The initiator offloads the full i/o from the users - e.g. send command, get response. In the initiator case, the user isn't aware of each and every IU that makes up the i/o. As it's on an i/o basis, the LLDD doing the offload needs the full buffer sitting and ready. DMA is preferred so the buffer doesn't have to be consuming socket/kernel/driver buffers while it's pending - plus speed. In the target case, the target controls each IU and it's size, thus it only has to have access to as much buffer space as it wants to push the next IU. The i/o can be paced by the target. Unfortunately, this is an entirely different use model than users of a scsi initiator expect, and it won't map well into replacing things like our sg_io ioctls. I am not talking about the target here. For the open-iscsi initiator that is in mainline that I referecnced in the example we send pdus from userpsace to the LLD. In the future, initaitors that offload some iscsi processing and will login from userspace or have userspace monitor the transport by doing iscsi pings, we need to be able to send these pdus. And the iscsi pdu cannot be broken up at the iscsi level (they can at the interconect level though). From the iscsi host level they have to go out like a scsi command would in that the LLD cannot decide to send out mutiple pdus for he pdu that userspace sends down. I do agree with you that targets can break down a scsi command into multiple transport level packets as it sees fit. Oh yeah is FC IU == iscsi tcp packet or FC IU == iscsi pdu ? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Netlink and user-space buffer pointers
James Smart wrote: Folks, To take netlink to where we want to use it within the SCSI subsystem (as the mechanism of choice to replace ioctls), we're going to need to pass user-space buffer pointers. What is the best, portable manner to pass a pointer between user and kernel space within a netlink message ? The example I've seen is in the iscsi target code - and it's passed between user-kernel space as a u64, then typecast to a void *, and later within the bio_map_xxx functions, as an unsigned long. I assume we are going to continue with this method ? I do not know if it is needed. For the target code, we use the bio_map_xxx to avoid having to copy the command data which is needed for decent performance. We have also been trying to figure out ways of getting out of using netlink to send the command info (cdb, tag info, etc) around, because in some of Tomo's tests using mmaped packet sockets he was able to imporove performance by removing that copy from the kernel to userspace. We had problems with that though and other nice interfaces like relayfs only allowed us to pass data from the kernel to userspace so we still need another interface to pass things from userspace to the kernel. Still working on this though. If someone knows a interface please let us know. For the tasks you want to do for the fc class is performance critical? If not, you could do what the iscsi class (for the netdev people this is drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple copies. For iscsi we do this in userspace to send down a login pdu: /* * xmitbuf is a buffer that is large enough for the iscsi_event, * iscsi pdu (hdr_size) and iscsi pdu data (data_size) */ memset(xmitbuf, 0, sizeof(*ev) + hdr_size + data_size); xmitlen = sizeof(*ev); ev = xmitbuf; ev-type = ISCSI_UEVENT_SEND_PDU; ev-transport_handle = transport_handle; ev-u.send_pdu.sid = sid; ev-u.send_pdu.cid = cid; ev-u.send_pdu.hdr_size = hdr_size; ev-u.send_pdu.data_size = data_size; then later we do sendmsg()to send down the xmitbuf to the kernel iscsi driver. I think there may be issues with packing structs or 32 bit userspace and 64 bit kernels and other fun things like this so the iscsi pdu and iscsi event have to be defined correctly and I guess we are back to some of the problems with ioctls :( - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] scsi tgt: scsi target netlink interface
This patch implments a netlink interface for the scsi tgt framework. I was not sure if code using the netlink interface had to get reviewed by the netdev guys. I am ccing them on this patch and providing a basic review of why/how we want to use netlink. I did not think the netdev people wanted to see the scsi and block layer code, so I am just sending the netlink interface part of this patchset to netdev. I can resend the other parts if needed. The scsi tgt framework, adds support for scsi target mode cards. So instead of using the scsi card in your box as a initiator/host you can use it as a target/server. The reason for the netlink use is becuase the target normally receives a interrupt indicating that a command or event is ready to be processed. The scsi card's driver will then call a scsi lib function which eventually calls scsi_tgt_uspace_send (in this patch below) to tell userspace to begin to process the request (userspace contains the state model). Later userspace will call back into the kernel by sending a netlink msg, and instruct the scsi driver what to do next. When the scsi driver is done executing the operation, it will send a netlink message back to userspace to indicate the success or failure of the operation (using scsi_tgt_uspace_send_status in the patch below). Signed-off-by: Mike Christie [EMAIL PROTECTED] Signed-off-by: FUJITA Tomonori [EMAIL PROTECTED] diff --git a/drivers/scsi/scsi_tgt_if.c b/drivers/scsi/scsi_tgt_if.c new file mode 100644 index 000..38b35da --- /dev/null +++ b/drivers/scsi/scsi_tgt_if.c @@ -0,0 +1,214 @@ +/* + * SCSI target kernel/user interface functions + * + * Copyright (C) 2005 FUJITA Tomonori [EMAIL PROTECTED] + * Copyright (C) 2005 Mike Christie [EMAIL PROTECTED] + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 of the + * License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA + * 02110-1301 USA + */ +#include linux/blkdev.h +#include linux/file.h +#include linux/netlink.h +#include net/tcp.h +#include scsi/scsi.h +#include scsi/scsi_cmnd.h +#include scsi/scsi_device.h +#include scsi/scsi_host.h +#include scsi/scsi_tgt.h +#include scsi/scsi_tgt_if.h + +#include scsi_tgt_priv.h + +static int tgtd_pid; +static struct sock *nl_sk; + +static int send_event_res(uint16_t type, struct tgt_event *p, + void *data, int dlen, gfp_t flags, pid_t pid) +{ + struct tgt_event *ev; + struct nlmsghdr *nlh; + struct sk_buff *skb; + uint32_t len; + + len = NLMSG_SPACE(sizeof(*ev) + dlen); + skb = alloc_skb(len, flags); + if (!skb) + return -ENOMEM; + + nlh = __nlmsg_put(skb, pid, 0, type, len - sizeof(*nlh), 0); + + ev = NLMSG_DATA(nlh); + memcpy(ev, p, sizeof(*ev)); + if (dlen) + memcpy(ev-data, data, dlen); + + return netlink_unicast(nl_sk, skb, pid, 0); +} + +int scsi_tgt_uspace_send(struct scsi_cmnd *cmd, struct scsi_lun *lun, gfp_t gfp_mask) +{ + struct Scsi_Host *shost = scsi_tgt_cmd_to_host(cmd); + struct sk_buff *skb; + struct nlmsghdr *nlh; + struct tgt_event *ev; + struct tgt_cmd *tcmd; + int err, len; + + len = NLMSG_SPACE(sizeof(*ev) + sizeof(struct tgt_cmd)); + /* +* TODO: add MAX_COMMAND_SIZE to ev and add mempool +*/ + skb = alloc_skb(NLMSG_SPACE(len), gfp_mask); + if (!skb) + return -ENOMEM; + + nlh = __nlmsg_put(skb, tgtd_pid, 0, TGT_KEVENT_CMD_REQ, + len - sizeof(*nlh), 0); + + ev = NLMSG_DATA(nlh); + ev-k.cmd_req.host_no = shost-host_no; + ev-k.cmd_req.cid = cmd-request-tag; + ev-k.cmd_req.data_len = cmd-request_bufflen; + + dprintk(%d %u %u\n, ev-k.cmd_req.host_no, ev-k.cmd_req.cid, + ev-k.cmd_req.data_len); + + /* FIXME: we need scsi core to do that. */ + memcpy(cmd-cmnd, cmd-data_cmnd, MAX_COMMAND_SIZE); + + tcmd = (struct tgt_cmd *) ev-data; + memcpy(tcmd-scb, cmd-cmnd, sizeof(tcmd-scb)); + memcpy(tcmd-lun, lun, sizeof(struct scsi_lun)); + + err = netlink_unicast(nl_sk, skb, tgtd_pid, 0); + if (err 0) + printk(KERN_ERR scsi_tgt_uspace_send: could not send skb %d\n, + err); + return err; +} + +int scsi_tgt_uspace_send_status(struct scsi_cmnd *cmd, gfp_t gfp_mask) +{ + struct Scsi_Host