Change in vdsm[master]: multipath: Do not fail I/O after short outage
Nir Soffer has uploaded a new change for review. Change subject: multipath: Do not fail I/O after short outage .. multipath: Do not fail I/O after short outage Since 3.6 we are using "no_path_retry fail" for all devices by default. This setting is not new, but before 3.6 it was applied only for few devices specified in our multipath conf, while other devices used the defaults hard-coded in the multipathd daemon. We have seen several reports in the user list that this makes ovirt too fragile considering short storage outage, causing vms to pause too quickly. This patch change the default setting to "no_path_retry 4". With this setting, once multipathd notices that the last path has failed, it will check all paths 4 more times (20 seconds assuming the default 5 second polling_interval). If no paths are up, it will tell the kernel to stop queuing. After that, all outstanding and future I/O will immediately be failed, until a path is restored. Once a path is restored the 20 second delay is reset for the next time all paths fail. This change will introduce a 20 seconds delay after the last path has failed, before outstanding I/O will fail. If there is a short storage outage on the server, and the server could recover in 20 seconds, no vm will pause, and commands ran by vdsm (e.g. lvm) will not fail. Change-Id: I6496dbdaafca6b110c952fcc5d51bf9ac04d49b4 Bug-Url: https://bugzilla.redhat.com/1335176 Signed-off-by: Nir Soffer --- M lib/vdsm/tool/configurators/multipath.py 1 file changed, 21 insertions(+), 7 deletions(-) git pull ssh://gerrit.ovirt.org:29418/vdsm refs/changes/81/61281/1 diff --git a/lib/vdsm/tool/configurators/multipath.py b/lib/vdsm/tool/configurators/multipath.py index 00ec558..db5d34f 100644 --- a/lib/vdsm/tool/configurators/multipath.py +++ b/lib/vdsm/tool/configurators/multipath.py @@ -38,9 +38,10 @@ # "VDSM REVISION X.Y" tag. Note that older version used "RHEV REVISION X.Y" # format. -_CURRENT_TAG = "# VDSM REVISION 1.3" +_CURRENT_TAG = "# VDSM REVISION 1.4" _OLD_TAGS = ( +"# VDSM REVISION 1.3", "# VDSM REVISION 1.2", "# RHEV REVISION 1.1", "# RHEV REVISION 1.0", @@ -60,12 +61,19 @@ _PRIVATE_TAG = "# VDSM PRIVATE" _OLD_PRIVATE_TAG = "# RHEV PRIVATE" +# Once multipathd notices that the last path has failed, it will check +# all paths no_path_retry more times. If no paths are up, it will tell +# the kernel to stop queuing. After that, all outstanding and future +# I/O will immediately be failed, until a path is restored. Once a path +# is restored the delay is reset for the next time all paths fail. +_NO_PATH_RETRY = 4 + _CONF_DATA = """\ %(current_tag)s defaults { polling_interval5 -no_path_retry fail +no_path_retry %(no_path_retry)d user_friendly_names no flush_on_last_del yes fast_io_fail_tmo5 @@ -79,10 +87,15 @@ # These settings overrides built-in devices settings. It does not apply # to devices without built-in settings (these use the settings in the # "defaults" section), or to devices defined in the "devices" section. -# Note: This is not available yet on Fedora 21. For more info see -# https://bugzilla.redhat.com/1253799 all_devsyes -no_path_retry fail + +# Once multipathd notices that the last path has failed, it will +# check all paths no_path_retry more times. If no paths are up, +# it will tell the kernel to stop queuing. After that, all +# outstanding and future I/O will immediately be failed, until a +# path is restored. Once a path is restored the delay is reset +# for the next time all paths fail. +no_path_retry %(no_path_retry)d } } @@ -91,10 +104,11 @@ # multipathd. # # overrides { -# no_path_retry fail +# no_path_retry%(no_path_retry)d # } -""" % {"current_tag": _CURRENT_TAG} +""" % {"current_tag": _CURRENT_TAG, + "no_path_retry": _NO_PATH_RETRY} # If multipathd is up, it will be reloaded after configuration, # or started before vdsm starts, so service should not be stopped -- To view, visit https://gerrit.ovirt.org/61281 To unsubscribe, visit https://gerrit.ovirt.org/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I6496dbdaafca6b110c952fcc5d51bf9ac04d49b4 Gerrit-PatchSet: 1 Gerrit-Project: vdsm Gerrit-Branch: master Gerrit-Owner: Nir Soffer ___ vdsm-patches mailing list vdsm-patches@lists.fedorahosted.org https://lists.fedorahosted.org/admin/lists/vdsm-patches@lists.fedorahosted.org
Change in vdsm[master]: multipath: Do not fail I/O after short outage
gerrit-hooks has posted comments on this change. Change subject: multipath: Do not fail I/O after short outage .. Patch Set 1: * #1335176::Update tracker: OK * Check Bug-Url::OK * Check Public Bug::#1335176::OK, public bug * Check Product::#1335176::OK, Correct product Red Hat Enterprise Virtualization Manager * Check TM::SKIP, not in a monitored branch (ovirt-3.6 ovirt-4.0) * Check merged to previous::IGNORE, Not in stable branch (['ovirt-3.6', 'ovirt-4.0']) -- To view, visit https://gerrit.ovirt.org/61281 To unsubscribe, visit https://gerrit.ovirt.org/settings Gerrit-MessageType: comment Gerrit-Change-Id: I6496dbdaafca6b110c952fcc5d51bf9ac04d49b4 Gerrit-PatchSet: 1 Gerrit-Project: vdsm Gerrit-Branch: master Gerrit-Owner: Nir Soffer Gerrit-Reviewer: gerrit-hooks Gerrit-HasComments: No ___ vdsm-patches mailing list vdsm-patches@lists.fedorahosted.org https://lists.fedorahosted.org/admin/lists/vdsm-patches@lists.fedorahosted.org
Change in vdsm[master]: multipath: Do not fail I/O after short outage
Yaniv Kaul has posted comments on this change. Change subject: multipath: Do not fail I/O after short outage .. Patch Set 1: In the worst case scenario, the timeout really depends on the path selection algorithm (a combination of path_grouping_policy, path_selector, failback, prio, rr_weight and some other witchcraft no one really knows). So assuming it has (for example) an active-passive set of paths, it'll retry each path 4 more times for 5 seconds, one by one in the worst case scenario - so the delay might be worse. To make things even worse, there's also the SCSI layer timeout (/sys/block//device/timeout). See https://www.flagword.net/2014/06/default-linux-io-multipathd-configuration-and-oracle-rac-caveat/ for some explanation. I think the bottom line is (and that is a rule of a thumb) that if the timeout overall will be larger than 30 seconds, we may face application timeout - therefore I'm in favor of 'no_path_retry fail' - hoping it'll move quickly to the next path, hoping not all path failed. If they did, we have the SCSI layer queuing to overcome small outages (I hope) and if not - so be it. -- To view, visit https://gerrit.ovirt.org/61281 To unsubscribe, visit https://gerrit.ovirt.org/settings Gerrit-MessageType: comment Gerrit-Change-Id: I6496dbdaafca6b110c952fcc5d51bf9ac04d49b4 Gerrit-PatchSet: 1 Gerrit-Project: vdsm Gerrit-Branch: master Gerrit-Owner: Nir Soffer Gerrit-Reviewer: Adam Litke Gerrit-Reviewer: Allon Mureinik Gerrit-Reviewer: Jenkins CI Gerrit-Reviewer: Yaniv Kaul Gerrit-Reviewer: gerrit-hooks Gerrit-HasComments: No ___ vdsm-patches mailing list vdsm-patches@lists.fedorahosted.org https://lists.fedorahosted.org/admin/lists/vdsm-patches@lists.fedorahosted.org
Change in vdsm[master]: multipath: Do not fail I/O after short outage
Nir Soffer has posted comments on this change. Change subject: multipath: Do not fail I/O after short outage .. Patch Set 1: Yaniv, no_path_retry is used when all path have failed, so, this adds 20 seconds delay only in that event. The description in the commit message was mostly copied from Benjamin Marzinski mail. -- To view, visit https://gerrit.ovirt.org/61281 To unsubscribe, visit https://gerrit.ovirt.org/settings Gerrit-MessageType: comment Gerrit-Change-Id: I6496dbdaafca6b110c952fcc5d51bf9ac04d49b4 Gerrit-PatchSet: 1 Gerrit-Project: vdsm Gerrit-Branch: master Gerrit-Owner: Nir Soffer Gerrit-Reviewer: Adam Litke Gerrit-Reviewer: Allon Mureinik Gerrit-Reviewer: Jenkins CI Gerrit-Reviewer: Nir Soffer Gerrit-Reviewer: Yaniv Kaul Gerrit-Reviewer: gerrit-hooks Gerrit-HasComments: No ___ vdsm-patches mailing list vdsm-patches@lists.fedorahosted.org https://lists.fedorahosted.org/admin/lists/vdsm-patches@lists.fedorahosted.org
Change in vdsm[master]: multipath: Do not fail I/O after short outage
Allon Mureinik has posted comments on this change. Change subject: multipath: Do not fail I/O after short outage .. Patch Set 1: Verified+1 Marking as verified based on Elad Ben Aharon's comment on https://bugzilla.redhat.com/show_bug.cgi?id=1335176#c31 -- To view, visit https://gerrit.ovirt.org/61281 To unsubscribe, visit https://gerrit.ovirt.org/settings Gerrit-MessageType: comment Gerrit-Change-Id: I6496dbdaafca6b110c952fcc5d51bf9ac04d49b4 Gerrit-PatchSet: 1 Gerrit-Project: vdsm Gerrit-Branch: master Gerrit-Owner: Nir Soffer Gerrit-Reviewer: Adam Litke Gerrit-Reviewer: Allon Mureinik Gerrit-Reviewer: Jenkins CI Gerrit-Reviewer: Nir Soffer Gerrit-Reviewer: Yaniv Kaul Gerrit-Reviewer: gerrit-hooks Gerrit-HasComments: No ___ vdsm-patches mailing list vdsm-patches@lists.fedorahosted.org https://lists.fedorahosted.org/admin/lists/vdsm-patches@lists.fedorahosted.org