Hi Guan,
>>If the patch if OK for you, can I add your Reviewed-by tag into this patch?
The patch is ok for me.
The patch is ok for me.
Regards,
Muneendra.
-----Original Message-----
From: Guan Junxiong [mailto:[email protected]]
Sent: Monday, October 09, 2017 6:13 AM
To: Muneendra Kumar M <[email protected]>
Cc: Shenhong (C) <[email protected]>; niuhaoxin <[email protected]>;
Martin Wilck <[email protected]>; Christophe Varoqui
<[email protected]>; [email protected]
Subject: Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting
to improve reliability
Hi Muneendra,
Sorry for late reply because of National Holiday.
On 2017/10/6 13:54, Muneendra Kumar M wrote:
> Hi Guan,
> Did you push the patch to mainline.
> If so can you just provide me those details.
> If not can you just let me know the status.
>
Yes, I pushed Version 6 of the patch to the mail list but it hasn't been merged
yet.
It is still waiting for review.
You can find it at this link:
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.redhat.com_archives_dm-2Ddevel_2017-2DSeptember_msg00296.html&d=DwIDaQ&c=IL_XqQWOjubgfqINi2jTzg&r=E3ftc47B6BGtZ4fVaYvkuv19wKvC_Mc6nhXaA1sBIP0&m=N8T04oW6j0kkcf5fLp8jXA1y75SRN6PM9D-dM5nc2d4&s=sBZTTjpCVZB3NBgGXwPCE1fBtqAmx75s0DkAsVYRrwc&e=
> As couple of our clients are already using the previous patch(san_path_XX).
> If your patch is pushed then I can give them the updated patch and test the
> same.
>
If the patch if OK for you, can I add your Reviewed-by tag into this patch?
Regards,
Guan
> Regards,
> Muneendra.
>
>
> -----Original Message-----
> From: Muneendra Kumar M
> Sent: Thursday, September 21, 2017 3:41 PM
> To: 'Guan Junxiong' <[email protected]>; Martin Wilck
> <[email protected]>; [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected]
> Subject: RE: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting
> to improve reliability
>
> Hi Guan,
> Thanks for adopting the naming convention.
> Instead of marginal_path_err_recheck_gap_time, marginal_path_recovery_time
> will looks reasonable.Could you please relook into it.
>
> I will review the code in a day time.
>
> Regards,
> Muneendra.
>
> -----Original Message-----
> From: Guan Junxiong [mailto:[email protected]]
> Sent: Thursday, September 21, 2017 3:35 PM
> To: Muneendra Kumar M <[email protected]>; Martin Wilck <[email protected]>;
> [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting
> to improve reliability
>
> Hi, Muneendra
>
> Thanks for your clarification. I adopt this renaming. If it is convenient
> for you, please review the V5 patch that I sent out 2 hours ago.
>
> Regards,
> Guan
>
> On 2017/9/20 20:58, Muneendra Kumar M wrote:
>> Hi Guan,
>>>>> Shall we use existing PATH_SHAKY ?
>> As the path_shaky Indicates path not available for "normal" operations we
>> can use this state. That's a good idea.
>>
>> Regarding the marginal paths below is my explanation. And brocade is
>> publishing couple of white papers regarding the same to educate the SAN
>> administrators and the san community.
>>
>> Marginal path:
>>
>> A host, target, LUN (ITL path) flow goes through SAN. It is to be noted
>> that the for each I/O request that goes to the SCSI layer, it transforms
>> into a single SCSI exchange. In a single SAN, there are typically multiple
>> SAN network paths for a ITL flow/path. Each SCSI exchange can take one of
>> the various network paths that are available for the ITL path. A SAN can be
>> based on Ethernet, FC, Infiniband physical networks to carry block storage
>> traffic (SCSI, NVMe etc.)
>>
>> There are typically two type of SAN network problems that are categorized as
>> marginal issues. These issues by nature are not permanent in time and do
>> come and go away over time.
>> 1) Switches in the SAN can have intermittent frame drops or intermittent
>> frame corruptions due to bad optics cable (SFP) or any such wear/tear port
>> issues. This causes ITL flows that go through the faulty switch/port to
>> intermittently experience frame drops.
>> 2) There exists SAN topologies where there are switch ports in the fabric
>> that becomes the only conduit for many different ITL flows across multiple
>> hosts. These single network paths are essentially shared across multiple ITL
>> flows. Under these conditions if the port link bandwidth is not able to
>> handle the net sum of the shared ITL flows bandwidth going through the
>> single path then we could see intermittent network congestion problems.
>> This condition is called network oversubscription. The intermittent
>> congestions can delay SCSI exchange completion time (increase in I/O latency
>> is observed).
>>
>> To overcome the above network issues and many more such target issues, there
>> are frame level retries that are done in HBA device firmware and I/O retries
>> in the SCSI layer. These retries might succeed because of two reasons:
>> 1) The intermittent switch/port issue is not observed
>> 2) The retry I/O is a new SCSI exchange. This SCSI exchange can take an
>> alternate SAN path for the ITL flow, if such an SAN path exists.
>> 3) Network congestion disappears momentarily because the net I/O bandwidth
>> coming from multiple ITL flows on the single shared network path is
>> something the path can handle
>>
>> However in some cases we have seen I/O retries don’t succeed because the
>> retry I/Os hits a SAN network path that has intermittent switch/port issue
>> and/or network congestion.
>>
>> On the host thus we see configurations two or more ITL path sharing the
>> same target/LUN going through two or more HBA ports. These HBA ports are
>> connected to two or more SAN to the same target/LUN.
>> If the I/O fails at the multipath layer then, the ITL path is turned into
>> Failed state. Because of the marginal nature of the network, the next Health
>> Check command sent from multipath layer might succeed, which results in
>> making the ITL path into Active state. You end up seeing the DM path state
>> going into Active, Failed, Active transitions. This results in overall
>> reduction in application I/O throughput and sometime application I/O
>> failures (because of timing constraints). All this can happen because of I/O
>> retries and I/O request moving across multiple paths of the DM device. In
>> the host it is to be noted all I/O retries on a single path and I/O
>> movement across multiple paths results in slowing down the forward progress
>> of new application I/O. Reason behind, the above I/O re-queue actions are
>> given higher priority than the newer I/O requests coming from the
>> application.
>>
>> The above condition of the ITL path is hence called “marginal”.
>>
>> What we desire is for the DM to deterministically categorize a ITL Path as
>> “marginal” and move all the pending I/Os from the marginal Path to an Active
>> Path. This will help in meeting application I/O timing constraints. Also a
>> capability to automatically re-instantiate the marginal path into Active
>> once the marginal condition in the network is fixed.
>>
>>
>> Based on the above explanation I want to rename the names as
>> marginal_path_XXXX and this is irrespective of any storage network.
>>
>> Regards,
>> Muneendra.
>
--
dm-devel mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/dm-devel