Change in vdsm[master]: multipath: Do not fail I/O after short outage

2016-08-21 Thread amureini
Allon Mureinik has posted comments on this change.

Change subject: multipath: Do not fail I/O after short outage
..


Patch Set 1: Verified+1

Marking as verified based on Elad Ben Aharon's comment on 
https://bugzilla.redhat.com/show_bug.cgi?id=1335176#c31

-- 
To view, visit https://gerrit.ovirt.org/61281
To unsubscribe, visit https://gerrit.ovirt.org/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6496dbdaafca6b110c952fcc5d51bf9ac04d49b4
Gerrit-PatchSet: 1
Gerrit-Project: vdsm
Gerrit-Branch: master
Gerrit-Owner: Nir Soffer 
Gerrit-Reviewer: Adam Litke 
Gerrit-Reviewer: Allon Mureinik 
Gerrit-Reviewer: Jenkins CI
Gerrit-Reviewer: Nir Soffer 
Gerrit-Reviewer: Yaniv Kaul 
Gerrit-Reviewer: gerrit-hooks 
Gerrit-HasComments: No
___
vdsm-patches mailing list
vdsm-patches@lists.fedorahosted.org
https://lists.fedorahosted.org/admin/lists/vdsm-patches@lists.fedorahosted.org


Change in vdsm[master]: multipath: Do not fail I/O after short outage

2016-07-25 Thread nsoffer
Nir Soffer has posted comments on this change.

Change subject: multipath: Do not fail I/O after short outage
..


Patch Set 1:

Yaniv, no_path_retry is used when all path have failed, so, this adds 20 
seconds delay only in that event. The description in the commit message was 
mostly copied from Benjamin Marzinski mail.

-- 
To view, visit https://gerrit.ovirt.org/61281
To unsubscribe, visit https://gerrit.ovirt.org/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6496dbdaafca6b110c952fcc5d51bf9ac04d49b4
Gerrit-PatchSet: 1
Gerrit-Project: vdsm
Gerrit-Branch: master
Gerrit-Owner: Nir Soffer 
Gerrit-Reviewer: Adam Litke 
Gerrit-Reviewer: Allon Mureinik 
Gerrit-Reviewer: Jenkins CI
Gerrit-Reviewer: Nir Soffer 
Gerrit-Reviewer: Yaniv Kaul 
Gerrit-Reviewer: gerrit-hooks 
Gerrit-HasComments: No
___
vdsm-patches mailing list
vdsm-patches@lists.fedorahosted.org
https://lists.fedorahosted.org/admin/lists/vdsm-patches@lists.fedorahosted.org


Change in vdsm[master]: multipath: Do not fail I/O after short outage

2016-07-25 Thread ykaul
Yaniv Kaul has posted comments on this change.

Change subject: multipath: Do not fail I/O after short outage
..


Patch Set 1:

In the worst case scenario, the timeout really depends on the path selection 
algorithm (a combination of path_grouping_policy, path_selector, failback, 
prio, rr_weight and some other witchcraft no one really knows). So assuming it 
has (for example) an active-passive set of paths, it'll retry each path 4 more 
times for 5 seconds, one by one in the worst case scenario - so the delay might 
be worse. To make things even worse, there's also the SCSI layer timeout 
(/sys/block//device/timeout). See 
https://www.flagword.net/2014/06/default-linux-io-multipathd-configuration-and-oracle-rac-caveat/
 for some explanation.

I think the bottom line is (and that is a rule of a thumb) that if the timeout 
overall will be larger than 30 seconds, we may face application timeout - 
therefore I'm in favor of 'no_path_retry fail' - hoping it'll move quickly to 
the next path, hoping not all path failed. If they did, we have the SCSI layer 
queuing to overcome small outages (I hope) and if not - so be it.

-- 
To view, visit https://gerrit.ovirt.org/61281
To unsubscribe, visit https://gerrit.ovirt.org/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6496dbdaafca6b110c952fcc5d51bf9ac04d49b4
Gerrit-PatchSet: 1
Gerrit-Project: vdsm
Gerrit-Branch: master
Gerrit-Owner: Nir Soffer 
Gerrit-Reviewer: Adam Litke 
Gerrit-Reviewer: Allon Mureinik 
Gerrit-Reviewer: Jenkins CI
Gerrit-Reviewer: Yaniv Kaul 
Gerrit-Reviewer: gerrit-hooks 
Gerrit-HasComments: No
___
vdsm-patches mailing list
vdsm-patches@lists.fedorahosted.org
https://lists.fedorahosted.org/admin/lists/vdsm-patches@lists.fedorahosted.org


Change in vdsm[master]: multipath: Do not fail I/O after short outage

2016-07-24 Thread automation
gerrit-hooks has posted comments on this change.

Change subject: multipath: Do not fail I/O after short outage
..


Patch Set 1:

* #1335176::Update tracker: OK
* Check Bug-Url::OK
* Check Public Bug::#1335176::OK, public bug
* Check Product::#1335176::OK, Correct product Red Hat Enterprise 
Virtualization Manager
* Check TM::SKIP, not in a monitored branch (ovirt-3.6 ovirt-4.0)
* Check merged to previous::IGNORE, Not in stable branch (['ovirt-3.6', 
'ovirt-4.0'])

-- 
To view, visit https://gerrit.ovirt.org/61281
To unsubscribe, visit https://gerrit.ovirt.org/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6496dbdaafca6b110c952fcc5d51bf9ac04d49b4
Gerrit-PatchSet: 1
Gerrit-Project: vdsm
Gerrit-Branch: master
Gerrit-Owner: Nir Soffer 
Gerrit-Reviewer: gerrit-hooks 
Gerrit-HasComments: No
___
vdsm-patches mailing list
vdsm-patches@lists.fedorahosted.org
https://lists.fedorahosted.org/admin/lists/vdsm-patches@lists.fedorahosted.org


Change in vdsm[master]: multipath: Do not fail I/O after short outage

2016-07-24 Thread nsoffer
Nir Soffer has uploaded a new change for review.

Change subject: multipath: Do not fail I/O after short outage
..

multipath: Do not fail I/O after short outage

Since 3.6 we are using "no_path_retry fail" for all devices by default.
This setting is not new, but before 3.6 it was applied only for few
devices specified in our multipath conf, while other devices used the
defaults hard-coded in the multipathd daemon.

We have seen several reports in the user list that this makes ovirt too
fragile considering short storage outage, causing vms to pause too
quickly.

This patch change the default setting to "no_path_retry 4". With this
setting, once multipathd notices that the last path has failed, it will
check all paths 4 more times (20 seconds assuming the default 5 second
polling_interval). If no paths are up, it will tell the kernel to stop
queuing.  After that, all outstanding and future I/O will immediately be
failed, until a path is restored. Once a path is restored the 20 second
delay is reset for the next time all paths fail.

This change will introduce a 20 seconds delay after the last path has
failed, before outstanding I/O will fail. If there is a short storage
outage on the server, and the server could recover in 20 seconds, no vm
will pause, and commands ran by vdsm (e.g. lvm) will not fail.

Change-Id: I6496dbdaafca6b110c952fcc5d51bf9ac04d49b4
Bug-Url: https://bugzilla.redhat.com/1335176
Signed-off-by: Nir Soffer 
---
M lib/vdsm/tool/configurators/multipath.py
1 file changed, 21 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.ovirt.org:29418/vdsm refs/changes/81/61281/1

diff --git a/lib/vdsm/tool/configurators/multipath.py 
b/lib/vdsm/tool/configurators/multipath.py
index 00ec558..db5d34f 100644
--- a/lib/vdsm/tool/configurators/multipath.py
+++ b/lib/vdsm/tool/configurators/multipath.py
@@ -38,9 +38,10 @@
 # "VDSM REVISION X.Y" tag.  Note that older version used "RHEV REVISION X.Y"
 # format.
 
-_CURRENT_TAG = "# VDSM REVISION 1.3"
+_CURRENT_TAG = "# VDSM REVISION 1.4"
 
 _OLD_TAGS = (
+"# VDSM REVISION 1.3",
 "# VDSM REVISION 1.2",
 "# RHEV REVISION 1.1",
 "# RHEV REVISION 1.0",
@@ -60,12 +61,19 @@
 _PRIVATE_TAG = "# VDSM PRIVATE"
 _OLD_PRIVATE_TAG = "# RHEV PRIVATE"
 
+# Once multipathd notices that the last path has failed, it will check
+# all paths no_path_retry more times. If no paths are up, it will tell
+# the kernel to stop queuing.  After that, all outstanding and future
+# I/O will immediately be failed, until a path is restored. Once a path
+# is restored the delay is reset for the next time all paths fail.
+_NO_PATH_RETRY = 4
+
 _CONF_DATA = """\
 %(current_tag)s
 
 defaults {
 polling_interval5
-no_path_retry   fail
+no_path_retry   %(no_path_retry)d
 user_friendly_names no
 flush_on_last_del   yes
 fast_io_fail_tmo5
@@ -79,10 +87,15 @@
 # These settings overrides built-in devices settings. It does not apply
 # to devices without built-in settings (these use the settings in the
 # "defaults" section), or to devices defined in the "devices" section.
-# Note: This is not available yet on Fedora 21. For more info see
-# https://bugzilla.redhat.com/1253799
 all_devsyes
-no_path_retry   fail
+
+# Once multipathd notices that the last path has failed, it will
+# check all paths no_path_retry more times. If no paths are up,
+# it will tell the kernel to stop queuing.  After that, all
+# outstanding and future I/O will immediately be failed, until a
+# path is restored. Once a path is restored the delay is reset
+# for the next time all paths fail.
+no_path_retry   %(no_path_retry)d
 }
 }
 
@@ -91,10 +104,11 @@
 # multipathd.
 #
 # overrides {
-#  no_path_retry   fail
+#  no_path_retry%(no_path_retry)d
 # }
 
-""" % {"current_tag": _CURRENT_TAG}
+""" % {"current_tag": _CURRENT_TAG,
+   "no_path_retry": _NO_PATH_RETRY}
 
 # If multipathd is up, it will be reloaded after configuration,
 # or started before vdsm starts, so service should not be stopped


-- 
To view, visit https://gerrit.ovirt.org/61281
To unsubscribe, visit https://gerrit.ovirt.org/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I6496dbdaafca6b110c952fcc5d51bf9ac04d49b4
Gerrit-PatchSet: 1
Gerrit-Project: vdsm
Gerrit-Branch: master
Gerrit-Owner: Nir Soffer 
___
vdsm-patches mailing list
vdsm-patches@lists.fedorahosted.org
https://lists.fedorahosted.org/admin/lists/vdsm-patches@lists.fedorahosted.org