Re: Re: Re: [PATCH] mwifiex: cancel pcie/sdio work in remove/shutdown handler

2018-01-12 Thread Brian Norris
On Thu, Jan 11, 2018 at 06:25:09PM -0800, Brian Norris wrote:
> Anyway, I'll do my own testing and then submit my patch properly.

OK, so I definitely confirmed: if your patch does anything, it
introduces a new deadlock possibility. Just trigger a Wifi timeout or
reset from within remove(), and you'll see the work event get stuck in
pci_reset_function(), while remove() gets stuck at cancel_work_sync().

I also confirmed that my patch resolves this problem.

I'll send the revert + my patch now.

Brian


Re: Re: Re: [PATCH] mwifiex: cancel pcie/sdio work in remove/shutdown handler

2018-01-11 Thread Brian Norris
Hi Simon,

On Wed, Jan 10, 2018 at 12:30:35PM +, Xinming Hu wrote:
> > -Original Message-
> > From: Brian Norris [mailto:briannor...@chromium.org]
> > On Tue, Jan 09, 2018 at 11:42:21AM +, Xinming Hu wrote:
> > > I am not sure whether there is any mutual exclusion protect between
> > pcie_reset and pcie_remove in pcie core.
> > > But it looks a different race.
> > > We still need this fix, right?
> > 
> > Good point. Previously, there wasn't any such exclusion, and that's why 
> > races
> > like the above were even more likely. But as of 4.13, now there
> > *is* exclusion. See commit b014e96d1abb ("PCI: Protect
> > pci_error_handlers->reset_notify() usage with device_lock()"). That 
> > incidentally
> > means that you're creating a deadlock with this patch! [1]
> > 
> > If we start a timeout/reset sequence in mwifiex_init_shutdown_fw() (called
> > from remove()), then you'll eventually have pci_reset_function() waiting on 
> > the
> > device lock, but mwifiex_pcie_remove() will be holding the device lock 
> > already,
> > and now (with your patch), remove() will be waiting on the worker thread to
> > finish pci_reset_function()...deadlock!
> > 
> > I actually think that the above patch (adding device_lock()) resolves most 
> > of the
> > race but introduces a possible deadlock. I believe an easy solution is just 
> > to
> > switch to pci_try_reset_function() instead? That will just abort a reset 
> > attempt
> > if we're in the middle of removing the device. Problem solved? Diff 
> > appended,
> > but I'll send out a real version if that looks right. Can you test your 
> > original
> > problem with the above commit from Christopher, as well as the appended
> > diff?
> > 
> 
> Since I don't have the original customer platform, which is 4.1 based.

You could suggest your customer try the aforementioned commit + my
patch. It will likely get them a more reliable result in the end.

> Just run the stress test on my 4.14, with commands:
> while true; do rmmod mwifiex;insmod $mwd/mwifiex.ko; sleep 1; insmod 
> $mwd/mwifiex_pcie.ko; sleep 1; rmmod mwifiex_pcie & echo "1" > 
> /sys/kernel/debug/mwifiex/mlan0/reset;done &

That's actually not much of a good test, since the mutual exclusion of
reset and remove will mean that 'rmmod' and 'echo 1 > ... /reset' won't
really be doing anything significant in parallel.

What you'd really need to do is fake a timeout from within remove().
e.g., you could do

adapter->if_ops.card_reset(adapter);

plus some sleep (to let the work event get started) followed by the
cancel_work_sync() of your patch.

> 
> Till now, I didn't hit deadlock with/without below diff, though it
> looks finally will be reached, according to the above explanation.  I
> guess, maybe rmmod operation is slowly then sysfs operation.

Per my above explanation, you won't hit the deadlock like that.

Anyway, I'll do my own testing and then submit my patch properly.

Brian


Re: Re: Re: [PATCH] mwifiex: cancel pcie/sdio work in remove/shutdown handler

2018-01-10 Thread Xinming Hu
Hi Brian,

> -Original Message-
> From: Brian Norris [mailto:briannor...@chromium.org]
> Sent: 2018年1月10日 6:01
> To: Xinming Hu <h...@marvell.com>; Kalle Valo <kv...@codeaurora.org>
> Cc: Kalle Valo <kv...@codeaurora.org>; Linux Wireless
> <linux-wireless@vger.kernel.org>; Dmitry Torokhov <d...@google.com>;
> raja...@google.com; Zhiyuan Yang <yan...@marvell.com>; Tim Song
> <song...@marvell.com>; Cathy Luo <c...@marvell.com>; James Cao
> <j...@marvell.com>; Ganapathi Bhat <gb...@marvell.com>; Ellie Reeves
> <ellierev...@gmail.com>; Christoph Hellwig <h...@lst.de>
> Subject: [EXT] Re: Re: [PATCH] mwifiex: cancel pcie/sdio work in
> remove/shutdown handler
> 
> External Email
> 
> --
> + Christopher
> 
> Hi Simon and Kalle,
> 
> On Tue, Jan 09, 2018 at 11:42:21AM +, Xinming Hu wrote:
> > Hi,
> >
> > > -Original Message-
> > > From: Kalle Valo [mailto:kv...@codeaurora.org]
> > > Sent: 2018年1月9日 15:40
> > > To: Brian Norris <briannor...@chromium.org>
> > > Cc: Xinming Hu <h...@marvell.com>; Linux Wireless
> > > <linux-wireless@vger.kernel.org>; Dmitry Torokhov <d...@google.com>;
> > > raja...@google.com; Zhiyuan Yang <yan...@marvell.com>; Tim Song
> > > <song...@marvell.com>; Cathy Luo <c...@marvell.com>; James Cao
> > > <j...@marvell.com>; Ganapathi Bhat <gb...@marvell.com>; Ellie Reeves
> > > <ellierev...@gmail.com>
> > > Subject: [EXT] Re: [PATCH] mwifiex: cancel pcie/sdio work in
> > > remove/shutdown handler
> > >
> > > External Email
> > >
> > > 
> > > -- Brian Norris <briannor...@chromium.org> writes:
> > >
> > > >> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
> > > >> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
> > > >> @@ -310,6 +310,8 @@ static void mwifiex_pcie_remove(struct
> > > >> pci_dev
> > > *pdev)
> > > >>mwifiex_init_shutdown_fw(priv, MWIFIEX_FUNC_SHUTDOWN);
> > > >>}
> > > >>
> > > >> +  cancel_work_sync(>work);
> > > >> +
> > > >
> > > > Just FYI, this "fix" is not a real fix. It will likely paper over
> > > > some of your bugs (where, e.g., the FW shutdown command times out
> > > > in the previous couple of lines), but this highlights the fact
> > > > that there are other races that could trigger the same behavior.
> > > > You're not fixing those.
> > > >
> > > > For example, what if somebody initiates a scan or other nl80211
> > > > command between the above line and mwifiex_remove_card()? That
> > > command
> > > > could potentially time out too.
> > > >
> >
> > The hardware status have been reset before downloading the last
> > command(FUNC SHUTDOWN), in this way, follow commands  download will
> be
> > ignored and warned.
> 
> Hmm, I suppose that's true. So the race I'm talking about probably can't
> happen usually. What about in manufacturing mode
> or !FIRMWARE_READY_PCIE though? Those cases don't shut down the
> firmware. Can we still have outstanding timeouts in those cases?
> 
> Anyway, I still think there's a problem though, and this patch is just going 
> to
> make things worse. See below.
> 
> > > > The proper fix would be to institute some kind of mutual exclusion
> > > > (locking) between mwifiex_shutdown_sw() and mwifiex_remove_card(),
> > > > so that they can't occur at the same time.
> > > >
> >
> > I am not sure whether there is any mutual exclusion protect between
> pcie_reset and pcie_remove in pcie core.
> > But it looks a different race.
> > We still need this fix, right?
> 
> Good point. Previously, there wasn't any such exclusion, and that's why races
> like the above were even more likely. But as of 4.13, now there
> *is* exclusion. See commit b014e96d1abb ("PCI: Protect
> pci_error_handlers->reset_notify() usage with device_lock()"). That 
> incidentally
> means that you're creating a deadlock with this patch! [1]
> 
> If we start a timeout/reset sequence in mwifiex_init_shutdown_fw() (called
> from remove()), then you'll eventually have pci_reset_function() waiting on 
> the
> device lock, but mwifiex_pcie_remove() will be holding 

Re: Re: [PATCH] mwifiex: cancel pcie/sdio work in remove/shutdown handler

2018-01-09 Thread Brian Norris
+ Christopher

Hi Simon and Kalle,

On Tue, Jan 09, 2018 at 11:42:21AM +, Xinming Hu wrote:
> Hi,
> 
> > -Original Message-
> > From: Kalle Valo [mailto:kv...@codeaurora.org]
> > Sent: 2018年1月9日 15:40
> > To: Brian Norris <briannor...@chromium.org>
> > Cc: Xinming Hu <h...@marvell.com>; Linux Wireless
> > <linux-wireless@vger.kernel.org>; Dmitry Torokhov <d...@google.com>;
> > raja...@google.com; Zhiyuan Yang <yan...@marvell.com>; Tim Song
> > <song...@marvell.com>; Cathy Luo <c...@marvell.com>; James Cao
> > <j...@marvell.com>; Ganapathi Bhat <gb...@marvell.com>; Ellie Reeves
> > <ellierev...@gmail.com>
> > Subject: [EXT] Re: [PATCH] mwifiex: cancel pcie/sdio work in remove/shutdown
> > handler
> > 
> > External Email
> > 
> > --
> > Brian Norris <briannor...@chromium.org> writes:
> > 
> > >> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
> > >> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
> > >> @@ -310,6 +310,8 @@ static void mwifiex_pcie_remove(struct pci_dev
> > *pdev)
> > >>  mwifiex_init_shutdown_fw(priv, MWIFIEX_FUNC_SHUTDOWN);
> > >>  }
> > >>
> > >> +cancel_work_sync(>work);
> > >> +
> > >
> > > Just FYI, this "fix" is not a real fix. It will likely paper over some
> > > of your bugs (where, e.g., the FW shutdown command times out in the
> > > previous couple of lines), but this highlights the fact that there are
> > > other races that could trigger the same behavior. You're not fixing
> > > those.
> > >
> > > For example, what if somebody initiates a scan or other nl80211
> > > command between the above line and mwifiex_remove_card()? That
> > command
> > > could potentially time out too.
> > >
> 
> The hardware status have been reset before downloading the last
> command(FUNC SHUTDOWN), in this way, follow commands  download will be
> ignored and warned.

Hmm, I suppose that's true. So the race I'm talking about probably can't
happen usually. What about in manufacturing mode or !FIRMWARE_READY_PCIE
though? Those cases don't shut down the firmware. Can we still have
outstanding timeouts in those cases?

Anyway, I still think there's a problem though, and this patch is just
going to make things worse. See below.

> > > The proper fix would be to institute some kind of mutual exclusion
> > > (locking) between mwifiex_shutdown_sw() and mwifiex_remove_card(), so
> > > that they can't occur at the same time.
> > >
> 
> I am not sure whether there is any mutual exclusion protect between 
> pcie_reset and pcie_remove in pcie core.
> But it looks a different race.
> We still need this fix, right?

Good point. Previously, there wasn't any such exclusion, and that's why
races like the above were even more likely. But as of 4.13, now there
*is* exclusion. See commit b014e96d1abb ("PCI: Protect
pci_error_handlers->reset_notify() usage with device_lock()"). That
incidentally means that you're creating a deadlock with this patch! [1]

If we start a timeout/reset sequence in mwifiex_init_shutdown_fw()
(called from remove()), then you'll eventually have pci_reset_function()
waiting on the device lock, but mwifiex_pcie_remove() will be holding
the device lock already, and now (with your patch), remove() will be
waiting on the worker thread to finish pci_reset_function()...deadlock!

I actually think that the above patch (adding device_lock()) resolves
most of the race but introduces a possible deadlock. I believe an easy
solution is just to switch to pci_try_reset_function() instead? That
will just abort a reset attempt if we're in the middle of removing the
device. Problem solved? Diff appended, but I'll send out a real version
if that looks right. Can you test your original problem with the above
commit from Christopher, as well as the appended diff?

> Regards,
> Simon
> > > Unfortunately, I only paid attention to this after Kalle already
> > > applied this patch. Personally, I'd prefer this patch not get applied,
> > > since it's a bad solution to an obvious problem, which instead leaves
> > > a subtle problem that perhaps no one will bother fixing.
> > 
> > I can revert it, that's not a problem. Can I use the text below as 
> > explanation for
> > the revert?
> > 
> > --
> > Brian Norris <briannor...@chromium.org> says:
> > 
> > Just FYI, 

Re: Re: [PATCH] mwifiex: cancel pcie/sdio work in remove/shutdown handler

2018-01-09 Thread Xinming Hu
Hi,

> -Original Message-
> From: Kalle Valo [mailto:kv...@codeaurora.org]
> Sent: 2018年1月9日 15:40
> To: Brian Norris <briannor...@chromium.org>
> Cc: Xinming Hu <h...@marvell.com>; Linux Wireless
> <linux-wireless@vger.kernel.org>; Dmitry Torokhov <d...@google.com>;
> raja...@google.com; Zhiyuan Yang <yan...@marvell.com>; Tim Song
> <song...@marvell.com>; Cathy Luo <c...@marvell.com>; James Cao
> <j...@marvell.com>; Ganapathi Bhat <gb...@marvell.com>; Ellie Reeves
> <ellierev...@gmail.com>
> Subject: [EXT] Re: [PATCH] mwifiex: cancel pcie/sdio work in remove/shutdown
> handler
> 
> External Email
> 
> --
> Brian Norris <briannor...@chromium.org> writes:
> 
> >> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
> >> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
> >> @@ -310,6 +310,8 @@ static void mwifiex_pcie_remove(struct pci_dev
> *pdev)
> >>mwifiex_init_shutdown_fw(priv, MWIFIEX_FUNC_SHUTDOWN);
> >>}
> >>
> >> +  cancel_work_sync(>work);
> >> +
> >
> > Just FYI, this "fix" is not a real fix. It will likely paper over some
> > of your bugs (where, e.g., the FW shutdown command times out in the
> > previous couple of lines), but this highlights the fact that there are
> > other races that could trigger the same behavior. You're not fixing
> > those.
> >
> > For example, what if somebody initiates a scan or other nl80211
> > command between the above line and mwifiex_remove_card()? That
> command
> > could potentially time out too.
> >

The hardware status have been reset before downloading the last command(FUNC 
SHUTDOWN), in this way, follow commands  download will be ignored and warned.

> > The proper fix would be to institute some kind of mutual exclusion
> > (locking) between mwifiex_shutdown_sw() and mwifiex_remove_card(), so
> > that they can't occur at the same time.
> >

I am not sure whether there is any mutual exclusion protect between pcie_reset 
and pcie_remove in pcie core.
But it looks a different race.
We still need this fix, right?

Regards,
Simon
> > Unfortunately, I only paid attention to this after Kalle already
> > applied this patch. Personally, I'd prefer this patch not get applied,
> > since it's a bad solution to an obvious problem, which instead leaves
> > a subtle problem that perhaps no one will bother fixing.
> 
> I can revert it, that's not a problem. Can I use the text below as 
> explanation for
> the revert?
> 
> --
> Brian Norris <briannor...@chromium.org> says:
> 
> Just FYI, this "fix" is not a real fix. It will likely paper over some of 
> your bugs
> (where, e.g., the FW shutdown command times out in the previous couple of
> lines), but this highlights the fact that there are other races that could 
> trigger
> the same behavior. You're not fixing those.
> 
> For example, what if somebody initiates a scan or other nl80211 command
> between the above line and mwifiex_remove_card()? That command could
> potentially time out too.
> 
> The proper fix would be to institute some kind of mutual exclusion
> (locking) between mwifiex_shutdown_sw() and mwifiex_remove_card(), so that
> they can't occur at the same time.
> 
> --
> 
> --
> Kalle Valo