Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-27 Thread Martin K. Petersen

Dave,

>> 
> Previously we had used sleep to delay until the controller got its
> mind back, but early testing indicated it wasn't needed. I'm good with
> this.

Applied to 4.14/scsi-fixes. Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering


RE: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-27 Thread Dave Carroll

> 
> 
> Guilherme,
> 
> > James/Martin, am I expected to send a v2 with some change? Perhaps
> > with Dave's ack?  Sorry to annoy, thanks in advance for any advice!
> 
> I was just about to mail Dave and ask for confirmation that your 
> interpretation
> of the controller behavior is correct.
> 
> Dave?
> 
 Hi Martin,

Previously we had used sleep to delay until the controller got its mind back, 
but early testing indicated it wasn't needed. I'm  good with this.

-Dave


Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-25 Thread Martin K. Petersen

Guilherme,

> James/Martin, am I expected to send a v2 with some change? Perhaps
> with Dave's ack?  Sorry to annoy, thanks in advance for any advice!

I was just about to mail Dave and ask for confirmation that your
interpretation of the controller behavior is correct.

Dave?

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-25 Thread Guilherme G. Piccoli
On 09/21/2017 01:19 PM, Dave Carroll wrote:
>> [...]
>> ---
> Acked-by: Dave Carroll 
> 

Thanks Dave!

James/Martin, am I expected to send a v2 with some change? Perhaps with
Dave's ack?
Sorry to annoy, thanks in advance for any advice!

Cheers,


Guilherme



RE: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-21 Thread Dave Carroll
> From: Guilherme G. Piccoli [mailto:gpicc...@linux.vnet.ibm.com]
> Sent: Tuesday, September 19, 2017 9:12 AM
> To: dl-esc-Aacraid Linux Driver ; linux-
> s...@vger.kernel.org
> Cc: gpicc...@linux.vnet.ibm.com; Raghava Aditya Renukunta
> ; Dave Carroll
> ; brk...@linux.vnet.ibm.com;
> dougm...@linux.vnet.ibm.com; sta...@vger.kernel.org
> Subject: [PATCH] scsi: aacraid: Add a small delay after IOP reset
> 
> Commit 0e9973ed3382 ("scsi: aacraid: Add periodic checks to see IOP reset
> status") changed the way driver checks if a reset succeeded. Now, after an IOP
> reset, aacraid immediately start polling a register to verify the reset is 
> complete.
> 
> This behavior cause regressions on the reset path in PowerPC (at least).
> Since the delay after the IOP reset was removed by the aforementioned patch,
> the fact driver just starts to read a register instantly after the reset was 
> issued
> (by writing in another register) "corrupts" the reset procedure, which ends up
> failing all the time.
> 
> The issue highly impacted kdump on PowerPC, since on kdump path we
> proactively issue a reset in adapter (through the reset_devices kernel
> parameter).
> 
> This patch (re-)adds a delay right after IOP reset is issued. Empirically we
> measured that 3 seconds is enough, but for safety reasons we delay for 5s (and
> since it was 30s before, 5s is still a small amount).
> 
> For reference, without this patch we observe the following messages on kdump
> kernel boot process:
> 
>   [ 76.294] aacraid 0003:01:00.0: IOP reset failed
>   [ 76.294] aacraid 0003:01:00.0: ARC Reset attempt failed
>   [ 86.524] aacraid 0003:01:00.0: adapter kernel panic'd ff.
>   [ 86.524] aacraid 0003:01:00.0: Controller reset type is 3
>   [ 86.524] aacraid 0003:01:00.0: Issuing IOP reset
>   [146.534] aacraid 0003:01:00.0: IOP reset failed
>   [146.534] aacraid 0003:01:00.0: ARC Reset attempt failed
> 
> Fixes: 0e9973ed3382 ("scsi: aacraid: Add periodic checks to see IOP reset
> status")
> Cc: sta...@vger.kernel.org # v4.13+
> Signed-off-by: Guilherme G. Piccoli 
> ---
Acked-by: Dave Carroll 


Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread Guilherme G. Piccoli
On 09/19/2017 02:05 PM, James Bottomley wrote:
> Actually, the whole problem sounds like a posted write.  Likely the
> write that causes the reset doesn't get flushed until the read checking
> if the reset has succeeded, which might explain the 100% initial
> failure.  Why not throw away that first value if it's a failure and
> then do your polled wait and timeout on the reset success.  We should
> anyway be waiting some time for a reset to be issued, so even on non-
> posted write systems we could see this problem intermittently.
> 
> James
> 

Thanks for this suggestion James.

I tried to remove the sleep and did a dummy read to register using
readl() - issue reproduced.

I did expect that, since in aac_is_ctrl_up_and_running() we indeed read
a register and if it shows us reset is not complete, we wait and read it
again. So, we can think in this 1st read as a dummy one heheh

My theory here is that we're observing a failure similar to one we
already did in some specific NVMe adapters - the readl before some delay
(in nvme it was 2s) corrupts the adapter FW procedure. It's as if the
adapter doesn't like to deal with this read while the reset procedure is
ongoing. So, we wait a bit to issue a readl and everything goes fine.

Cheers,


Guilherme



Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread James Bottomley
On Tue, 2017-09-19 at 08:52 -0700, Christoph Hellwig wrote:
> On Tue, Sep 19, 2017 at 12:49:21PM -0300, Guilherme G. Piccoli wrote:
> > 
> > On 09/19/2017 12:37 PM, Christoph Hellwig wrote:
> > > 
> > > On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli
> > > wrote:
> > > > 
> > > >     src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK);
> > > > +
> > > > +   msleep(5000);
> > > 
> > > src_writel is a writel, and thus a posted MMIO write.  You'll
> > > need
> > > to have to a read first to make it a reliable timing base.
> > > 
> > 
> > Just for my full understanding - you're saying a readl BEFORE
> > src_writel() or AFTER src_writel() ?
> 
> AFTER.

Actually, the whole problem sounds like a posted write.  Likely the
write that causes the reset doesn't get flushed until the read checking
if the reset has succeeded, which might explain the 100% initial
failure.  Why not throw away that first value if it's a failure and
then do your polled wait and timeout on the reset success.  We should
anyway be waiting some time for a reset to be issued, so even on non-
posted write systems we could see this problem intermittently.

James



Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread Guilherme G. Piccoli
On 09/19/2017 12:52 PM, Christoph Hellwig wrote:
> On Tue, Sep 19, 2017 at 12:49:21PM -0300, Guilherme G. Piccoli wrote:
>> On 09/19/2017 12:37 PM, Christoph Hellwig wrote:
>>> On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli wrote:
src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK);
 +
 +  msleep(5000);
>>>
>>> src_writel is a writel, and thus a posted MMIO write.  You'll need
>>> to have to a read first to make it a reliable timing base.
>>>
>>
>> Just for my full understanding - you're saying a readl BEFORE
>> src_writel() or AFTER src_writel() ?
> 
> AFTER.

Thanks!

> 
>> I could add a read to some dummy register, but notice it was a sequence
>> of readl's on aac_is_ctrl_up_and_running() that caused the failure of
>> reset...
> 
> Oh, ouch.  I guess in that case we'll need to do the writel and pray,
> but that would need a big comment explaining what's going on there.
> 

Heheh you're right!

But do you remember that quirk added on nvme? Basically, it was a
similar scenario in which some adapters weren't happy in poll a register
in nvme_wait_ready() right after we wrote in the Controller Config
register when disabling a controller.

Seems the same thing here, the controller is not being able to handle a
read right after some internal procedure (reset) were initiated.

If you have suggestion to improve the commit msg, let me know :)
Thanks!



Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread Christoph Hellwig
On Tue, Sep 19, 2017 at 12:49:21PM -0300, Guilherme G. Piccoli wrote:
> On 09/19/2017 12:37 PM, Christoph Hellwig wrote:
> > On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli wrote:
> >>src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK);
> >> +
> >> +  msleep(5000);
> > 
> > src_writel is a writel, and thus a posted MMIO write.  You'll need
> > to have to a read first to make it a reliable timing base.
> > 
> 
> Just for my full understanding - you're saying a readl BEFORE
> src_writel() or AFTER src_writel() ?

AFTER.

> I could add a read to some dummy register, but notice it was a sequence
> of readl's on aac_is_ctrl_up_and_running() that caused the failure of
> reset...

Oh, ouch.  I guess in that case we'll need to do the writel and pray,
but that would need a big comment explaining what's going on there.


Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread Guilherme G. Piccoli
On 09/19/2017 12:37 PM, Christoph Hellwig wrote:
> On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli wrote:
>>  src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK);
>> +
>> +msleep(5000);
> 
> src_writel is a writel, and thus a posted MMIO write.  You'll need
> to have to a read first to make it a reliable timing base.
> 

Just for my full understanding - you're saying a readl BEFORE
src_writel() or AFTER src_writel() ?

I could add a read to some dummy register, but notice it was a sequence
of readl's on aac_is_ctrl_up_and_running() that caused the failure of
reset...

Thanks



Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread Christoph Hellwig
On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli wrote:
>   src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK);
> +
> + msleep(5000);

src_writel is a writel, and thus a posted MMIO write.  You'll need
to have to a read first to make it a reliable timing base.