Re: [PATCH] mlx4: prevent the device from being removed concurrently

2012-02-29 Thread Jack Morgenstein
Mr Cascardo, If this is at a customer site, there is a workaround available for you. In file /etc/modprobe.d/mlx4_core.conf, enter a line: options mlx4_core internal_err_reset=0 (do "modinfo mlx4_core" to see a description of the module parameter). Setting this parameter to 0 in the modprobe co

Re: [PATCH] mlx4: prevent the device from being removed concurrently

2012-02-29 Thread Jack Morgenstein
On Wednesday 29 February 2012 16:47, Jack Morgenstein wrote: > Some comments. > > 1. Mr Cascardo's solution is only partial, and does not cover all the problem > cases. He >simply uncovered one of several examples of what lack-of-sync will do when > removing a device. >Mr. Cascardo found

Re: [PATCH] mlx4: prevent the device from being removed concurrently

2012-02-29 Thread Jack Morgenstein
On Tuesday 28 February 2012 22:46, David Miller wrote: > From: Thadeu Lima de Souza Cascardo > Date: Tue, 28 Feb 2012 17:34:38 -0300 > > > On Tue, Feb 28, 2012 at 02:30:51PM -0500, David Miller wrote: > >> From: Thadeu Lima de Souza Cascardo > >> Date: Tue, 28 Feb 2012 15:36:16 -0300 > >> > >>

Re: [PATCH] mlx4: prevent the device from being removed concurrently

2012-02-28 Thread David Miller
From: Thadeu Lima de Souza Cascardo Date: Tue, 28 Feb 2012 17:34:38 -0300 > On Tue, Feb 28, 2012 at 02:30:51PM -0500, David Miller wrote: >> From: Thadeu Lima de Souza Cascardo >> Date: Tue, 28 Feb 2012 15:36:16 -0300 >> >> > When a EEH happens, the catas poll code will try to restart the devic

Re: [PATCH] mlx4: prevent the device from being removed concurrently

2012-02-28 Thread Thadeu Lima de Souza Cascardo
On Tue, Feb 28, 2012 at 02:30:51PM -0500, David Miller wrote: > From: Thadeu Lima de Souza Cascardo > Date: Tue, 28 Feb 2012 15:36:16 -0300 > > > When a EEH happens, the catas poll code will try to restart the device, > > removing it and adding it back again. The EEH code will try to do the > > s

Re: [PATCH] mlx4: prevent the device from being removed concurrently

2012-02-28 Thread David Miller
From: Thadeu Lima de Souza Cascardo Date: Tue, 28 Feb 2012 15:36:16 -0300 > When a EEH happens, the catas poll code will try to restart the device, > removing it and adding it back again. The EEH code will try to do the > same. One of the threads ends up accessing memory that was freed by the > o

[PATCH] mlx4: prevent the device from being removed concurrently

2012-02-28 Thread Thadeu Lima de Souza Cascardo
When a EEH happens, the catas poll code will try to restart the device, removing it and adding it back again. The EEH code will try to do the same. One of the threads ends up accessing memory that was freed by the other thread and we get a crash. The EEH backtrace: <4>Call Trace: <4>[c0007fff