Re: [PATCH v4] iscsi: Perform connection failure entirely in kernel space

2020-01-23 Thread The Lee-Man
On Wednesday, January 15, 2020 at 7:52:39 PM UTC-8, Martin K. Petersen 
wrote:
>
>
> > Please consider the v4 below with the lock added. 
>
> Lee: Please re-review this given the code change. 
>

Martin:

The recent change makes sense, so please still include my:

Reviewed-by: Lee Duncan 

>
> > From: Bharath Ravi  
> > 
> > Connection failure processing depends on a daemon being present to (at 
> > least) stop the connection and start recovery.  This is a problem on a 
> > multipath scenario, where if the daemon failed for whatever reason, the 
> > SCSI path is never marked as down, multipath won't perform the 
> > failover and IO to the device will be forever waiting for that 
> > connection to come back. 
> > 
> > This patch performs the connection failure entirely inside the kernel. 
> > This way, the failover can happen and pending IO can continue even if 
> > the daemon is dead. Once the daemon comes alive again, it can execute 
> > recovery procedures if applicable. 
>
> -- 
> Martin K. PetersenOracle Linux Engineering 
>

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/open-iscsi/1ece4cbc-29df-4b10-952d-a8d829e24e6f%40googlegroups.com.


Re: [PATCH v4] iscsi: Perform connection failure entirely in kernel space

2020-01-15 Thread Martin K. Petersen


> Please consider the v4 below with the lock added.

Lee: Please re-review this given the code change.

> From: Bharath Ravi 
>
> Connection failure processing depends on a daemon being present to (at
> least) stop the connection and start recovery.  This is a problem on a
> multipath scenario, where if the daemon failed for whatever reason, the
> SCSI path is never marked as down, multipath won't perform the
> failover and IO to the device will be forever waiting for that
> connection to come back.
>
> This patch performs the connection failure entirely inside the kernel.
> This way, the failover can happen and pending IO can continue even if
> the daemon is dead. Once the daemon comes alive again, it can execute
> recovery procedures if applicable.

-- 
Martin K. Petersen  Oracle Linux Engineering

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/open-iscsi/yq1tv4wnjm2.fsf%40oracle.com.


Re: [PATCH v4] iscsi: Perform connection failure entirely in kernel space

2020-01-03 Thread 'Khazhismel Kumykov' via open-iscsi
On Fri, Jan 3, 2020 at 2:26 PM Gabriel Krisman Bertazi
 wrote:
> Please consider the v4 below with the lock added.
>
Reviewed-by: Khazhismel Kumykov 

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/open-iscsi/CACGdZYKsGk-kD7aO%3DbCSUzsFkX12xPkB3D2XDYGgDE4gD%2B1cmA%40mail.gmail.com.


smime.p7s
Description: S/MIME Cryptographic Signature


[PATCH v4] iscsi: Perform connection failure entirely in kernel space

2020-01-03 Thread Gabriel Krisman Bertazi
Khazhismel Kumykov  writes:

>> >> +   if (!list_empty(>conn_list_err))
>> > Does this check need to be under connlock?
>>
>> My understanding is that it is not necessary, since it is serialized
>> against the conn removal itself, through the rx_mutex, it seemed safe to
>> do the verification lockless.
>>
>> It can only race with the insertion, in which case, it will be safely
>> removed from the dispatch list here, under rx_mutex, and the worker will
>> detect and skipped it.
>
> My worry is the splice, which is under only connlock, not rx_mutex, which
> might lead to UB if we're checking empty while modifying the list_head ?

Hi,

Please consider the v4 below with the lock added.

-- >8 --
From: Bharath Ravi 

Connection failure processing depends on a daemon being present to (at
least) stop the connection and start recovery.  This is a problem on a
multipath scenario, where if the daemon failed for whatever reason, the
SCSI path is never marked as down, multipath won't perform the
failover and IO to the device will be forever waiting for that
connection to come back.

This patch performs the connection failure entirely inside the kernel.
This way, the failover can happen and pending IO can continue even if
the daemon is dead. Once the daemon comes alive again, it can execute
recovery procedures if applicable.

Changes since v3:
  - Protect list_empty with connlock on session destroy

Changes since v2:
  - Don't hold rx_mutex for too long at once

Changes since v1:
  - Remove module parameter.
  - Always do kernel-side stop work.
  - Block recovery timeout handler if system is dying.
  - send a CONN_TERM stop if the system is dying.

Cc: Mike Christie 
Cc: Lee Duncan 
Cc: Bart Van Assche 
Reviewed-by: Lee Duncan 
Co-developed-by: Dave Clausen 
Signed-off-by: Dave Clausen 
Co-developed-by: Nick Black 
Signed-off-by: Nick Black 
Co-developed-by: Vaibhav Nagarnaik 
Signed-off-by: Vaibhav Nagarnaik 
Co-developed-by: Anatol Pomazau 
Signed-off-by: Anatol Pomazau 
Co-developed-by: Tahsin Erdogan 
Signed-off-by: Tahsin Erdogan 
Co-developed-by: Frank Mayhar 
Signed-off-by: Frank Mayhar 
Co-developed-by: Junho Ryu 
Signed-off-by: Junho Ryu 
Co-developed-by: Khazhismel Kumykov 
Signed-off-by: Khazhismel Kumykov 
Signed-off-by: Bharath Ravi 
Co-developed-by: Gabriel Krisman Bertazi 
Signed-off-by: Gabriel Krisman Bertazi 
---
 drivers/scsi/scsi_transport_iscsi.c | 68 +
 include/scsi/scsi_transport_iscsi.h |  1 +
 2 files changed, 69 insertions(+)

diff --git a/drivers/scsi/scsi_transport_iscsi.c 
b/drivers/scsi/scsi_transport_iscsi.c
index 271afea654e2..ba6cfaf71aef 100644
--- a/drivers/scsi/scsi_transport_iscsi.c
+++ b/drivers/scsi/scsi_transport_iscsi.c
@@ -86,6 +86,12 @@ struct iscsi_internal {
struct transport_container session_cont;
 };
 
+/* Worker to perform connection failure on unresponsive connections
+ * completely in kernel space.
+ */
+static void stop_conn_work_fn(struct work_struct *work);
+static DECLARE_WORK(stop_conn_work, stop_conn_work_fn);
+
 static atomic_t iscsi_session_nr; /* sysfs session id for next new session */
 static struct workqueue_struct *iscsi_eh_timer_workq;
 
@@ -1611,6 +1617,7 @@ static DEFINE_MUTEX(rx_queue_mutex);
 static LIST_HEAD(sesslist);
 static DEFINE_SPINLOCK(sesslock);
 static LIST_HEAD(connlist);
+static LIST_HEAD(connlist_err);
 static DEFINE_SPINLOCK(connlock);
 
 static uint32_t iscsi_conn_get_sid(struct iscsi_cls_conn *conn)
@@ -2247,6 +2254,7 @@ iscsi_create_conn(struct iscsi_cls_session *session, int 
dd_size, uint32_t cid)
 
mutex_init(>ep_mutex);
INIT_LIST_HEAD(>conn_list);
+   INIT_LIST_HEAD(>conn_list_err);
conn->transport = transport;
conn->cid = cid;
 
@@ -2293,6 +2301,7 @@ int iscsi_destroy_conn(struct iscsi_cls_conn *conn)
 
spin_lock_irqsave(, flags);
list_del(>conn_list);
+   list_del(>conn_list_err);
spin_unlock_irqrestore(, flags);
 
transport_unregister_device(>dev);
@@ -2407,6 +2416,51 @@ int iscsi_offload_mesg(struct Scsi_Host *shost,
 }
 EXPORT_SYMBOL_GPL(iscsi_offload_mesg);
 
+static void stop_conn_work_fn(struct work_struct *work)
+{
+   struct iscsi_cls_conn *conn, *tmp;
+   unsigned long flags;
+   LIST_HEAD(recovery_list);
+
+   spin_lock_irqsave(, flags);
+   if (list_empty(_err)) {
+   spin_unlock_irqrestore(, flags);
+   return;
+   }
+   list_splice_init(_err, _list);
+   spin_unlock_irqrestore(, flags);
+
+   list_for_each_entry_safe(conn, tmp, _list, conn_list_err) {
+   uint32_t sid = iscsi_conn_get_sid(conn);
+   struct iscsi_cls_session *session;
+
+   mutex_lock(_queue_mutex);
+
+   session = iscsi_session_lookup(sid);
+   if (session) {
+   if (system_state != SYSTEM_RUNNING) {
+   session->recovery_tmo = 0;
+