Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt

2015-11-27 Thread Thomas Gleixner
On Fri, 27 Nov 2015, Jiang Liu wrote:

Please trim your replies.
 
> On 2015/11/26 5:12, Thomas Gleixner wrote:
> > Looks a bit overkill with the extra cpumask. I tried a simple counter
> > but that does not work versus cpu unplug as we do not know whether the
> > outgoing cpu is involved in the cleanup or not. And if the cpu is
> > involved we starve assign_irq_vector() 
> > 
> > The upside of this is that we get rid of that atomic allocation in
> > __send_cleanup_vector().
>
>   Maybe more headache for you now:)
>   It seems there are still rooms for improvements. First it
> seems we could just reuse old_domain instead of adding cleanup_mask.

I really like to get rid of that atomic allocation in
__send_cleanup_vector()

> Second I found another race window among x86_vector_free_irqs(),
> __send_cleanup_vector() and smp_irq_move_cleanup_interrupt().

What's the race there?

> I'm trying to refine your patch based following rules:
> 1) move_in_progress controls whether we need to send IPIs
> 2) old_domain controls which CPUs we should do clean up
> 3) assign_irq_vector checks both move_in_progress and old_domain.
> Will send out the patch soon for comments:)

Sure.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt

2015-11-27 Thread Jiang Liu


On 2015/11/26 5:12, Thomas Gleixner wrote:
> On Wed, 25 Nov 2015, Thomas Gleixner wrote:
>> So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and
>> does not get another IPI before the next move . That has been that
>> way forever.
>>
>> Duh. Working on a real fix this time.
> 
> Here you go. Completely untested of course.
> 
> Larger than I hoped for, but the simple fix of just clearing the
> move_in_progress flag before sending the IPI does not work because:
> 
> CPU0CPU1  CPU2
> data->move_in_progress=0
> sendIPI() 
> set_affinity()
> lock_vector() handle_IPI
> move_in_progress = 1  lock_vector()
> unlock_vector()
>   move_in_progress == 1
>   -> no cleanup
> 
> So we are back to square one. Now one might think that taking vector
> lock prevents that issue:
> 
> CPU0CPU1  CPU2
> lock_vector()
> data->move_in_progress=0
> sendIPI() 
> unlock_vector()
> set_affinity()
> assign_irq_vector()
> lock_vector() handle_IPI
> move_in_progress = 1  lock_vector()
> unlock_vector()
>   move_in_progress == 1
> Not really. 
> 
> So now the solution is:
> 
> CPU0CPU1  CPU2
> lock_vector()
> data->move_in_progress=0
> data->cleanup_mask = data->old_domain
> sendIPI() 
> unlock_vector()
> set_affinity()
> assign_irq_vector()
> lock_vector() 
> if (move_in_progress ||
> !empty(cleanup_mask)) {
>unlock_vector()
>return -EBUSY; handle_IPI
> } lock_vector()
>   move_in_progress == 0
>   cpu is set in cleanup 
> mask
>   ->cleanup vector
> 
> Looks a bit overkill with the extra cpumask. I tried a simple counter
> but that does not work versus cpu unplug as we do not know whether the
> outgoing cpu is involved in the cleanup or not. And if the cpu is
> involved we starve assign_irq_vector() 
> 
> The upside of this is that we get rid of that atomic allocation in
> __send_cleanup_vector().
Hi Thomas,
Maybe more headache for you now:)
It seems there are still rooms for improvements. First it
seems we could just reuse old_domain instead of adding cleanup_mask.
Second I found another race window among x86_vector_free_irqs(),
__send_cleanup_vector() and smp_irq_move_cleanup_interrupt().
I'm trying to refine your patch based following rules:
1) move_in_progress controls whether we need to send IPIs
2) old_domain controls which CPUs we should do clean up
3) assign_irq_vector checks both move_in_progress and old_domain.
Will send out the patch soon for comments:)
Thanks,
Gerry   

> 
> Brain hurts by now. 
> 
> Not-Yet-Signed-off-by: Thomas Gleixner 
> ---
>  arch/x86/kernel/apic/vector.c |   37 -
>  1 file changed, 16 insertions(+), 21 deletions(-)
> 
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -25,6 +25,7 @@ struct apic_chip_data {
>   struct irq_cfg  cfg;
>   cpumask_var_t   domain;
>   cpumask_var_t   old_domain;
> + cpumask_var_t   cleanup_mask;
>   u8  move_in_progress : 1;
>  };
>  
> @@ -83,7 +84,11 @@ static struct apic_chip_data *alloc_apic
>   goto out_data;
>   if (!zalloc_cpumask_var_node(>old_domain, GFP_KERNEL, node))
>   goto out_domain;
> + if (!zalloc_cpumask_var_node(>cleanup_mask, GFP_KERNEL, node))
> + goto out_old;
>   return data;
> +out_old:
> + free_cpumask_var(data->old_domain);
>  out_domain:
>   free_cpumask_var(data->domain);
>  out_data:
> @@ -96,6 +101,7 @@ static void free_apic_chip_data(struct a
>   if (data) {
>   free_cpumask_var(data->domain);
>   free_cpumask_var(data->old_domain);
> + free_cpumask_var(data->cleanup_mask);
>   kfree(data);
>   }
>  }
> @@ -118,7 +124,7 @@ static int __assign_irq_vector(int irq,
>   static int current_offset = VECTOR_OFFSET_START % 16;
>   int cpu, err;
>  
> - if (d->move_in_progress)
> + if 

Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt

2015-11-27 Thread Jiang Liu


On 2015/11/26 5:12, Thomas Gleixner wrote:
> On Wed, 25 Nov 2015, Thomas Gleixner wrote:
>> So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and
>> does not get another IPI before the next move . That has been that
>> way forever.
>>
>> Duh. Working on a real fix this time.
> 
> Here you go. Completely untested of course.
> 
> Larger than I hoped for, but the simple fix of just clearing the
> move_in_progress flag before sending the IPI does not work because:
> 
> CPU0CPU1  CPU2
> data->move_in_progress=0
> sendIPI() 
> set_affinity()
> lock_vector() handle_IPI
> move_in_progress = 1  lock_vector()
> unlock_vector()
>   move_in_progress == 1
>   -> no cleanup
> 
> So we are back to square one. Now one might think that taking vector
> lock prevents that issue:
> 
> CPU0CPU1  CPU2
> lock_vector()
> data->move_in_progress=0
> sendIPI() 
> unlock_vector()
> set_affinity()
> assign_irq_vector()
> lock_vector() handle_IPI
> move_in_progress = 1  lock_vector()
> unlock_vector()
>   move_in_progress == 1
> Not really. 
> 
> So now the solution is:
> 
> CPU0CPU1  CPU2
> lock_vector()
> data->move_in_progress=0
> data->cleanup_mask = data->old_domain
> sendIPI() 
> unlock_vector()
> set_affinity()
> assign_irq_vector()
> lock_vector() 
> if (move_in_progress ||
> !empty(cleanup_mask)) {
>unlock_vector()
>return -EBUSY; handle_IPI
> } lock_vector()
>   move_in_progress == 0
>   cpu is set in cleanup 
> mask
>   ->cleanup vector
> 
> Looks a bit overkill with the extra cpumask. I tried a simple counter
> but that does not work versus cpu unplug as we do not know whether the
> outgoing cpu is involved in the cleanup or not. And if the cpu is
> involved we starve assign_irq_vector() 
> 
> The upside of this is that we get rid of that atomic allocation in
> __send_cleanup_vector().
Hi Thomas,
Maybe more headache for you now:)
It seems there are still rooms for improvements. First it
seems we could just reuse old_domain instead of adding cleanup_mask.
Second I found another race window among x86_vector_free_irqs(),
__send_cleanup_vector() and smp_irq_move_cleanup_interrupt().
I'm trying to refine your patch based following rules:
1) move_in_progress controls whether we need to send IPIs
2) old_domain controls which CPUs we should do clean up
3) assign_irq_vector checks both move_in_progress and old_domain.
Will send out the patch soon for comments:)
Thanks,
Gerry   

> 
> Brain hurts by now. 
> 
> Not-Yet-Signed-off-by: Thomas Gleixner 
> ---
>  arch/x86/kernel/apic/vector.c |   37 -
>  1 file changed, 16 insertions(+), 21 deletions(-)
> 
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -25,6 +25,7 @@ struct apic_chip_data {
>   struct irq_cfg  cfg;
>   cpumask_var_t   domain;
>   cpumask_var_t   old_domain;
> + cpumask_var_t   cleanup_mask;
>   u8  move_in_progress : 1;
>  };
>  
> @@ -83,7 +84,11 @@ static struct apic_chip_data *alloc_apic
>   goto out_data;
>   if (!zalloc_cpumask_var_node(>old_domain, GFP_KERNEL, node))
>   goto out_domain;
> + if (!zalloc_cpumask_var_node(>cleanup_mask, GFP_KERNEL, node))
> + goto out_old;
>   return data;
> +out_old:
> + free_cpumask_var(data->old_domain);
>  out_domain:
>   free_cpumask_var(data->domain);
>  out_data:
> @@ -96,6 +101,7 @@ static void free_apic_chip_data(struct a
>   if (data) {
>   free_cpumask_var(data->domain);
>   free_cpumask_var(data->old_domain);
> + free_cpumask_var(data->cleanup_mask);
>   kfree(data);
>   }
>  }
> @@ -118,7 +124,7 @@ static int __assign_irq_vector(int irq,
>   static int current_offset = VECTOR_OFFSET_START % 16;
>   int cpu, err;
>  
> - if 

Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt

2015-11-27 Thread Thomas Gleixner
On Fri, 27 Nov 2015, Jiang Liu wrote:

Please trim your replies.
 
> On 2015/11/26 5:12, Thomas Gleixner wrote:
> > Looks a bit overkill with the extra cpumask. I tried a simple counter
> > but that does not work versus cpu unplug as we do not know whether the
> > outgoing cpu is involved in the cleanup or not. And if the cpu is
> > involved we starve assign_irq_vector() 
> > 
> > The upside of this is that we get rid of that atomic allocation in
> > __send_cleanup_vector().
>
>   Maybe more headache for you now:)
>   It seems there are still rooms for improvements. First it
> seems we could just reuse old_domain instead of adding cleanup_mask.

I really like to get rid of that atomic allocation in
__send_cleanup_vector()

> Second I found another race window among x86_vector_free_irqs(),
> __send_cleanup_vector() and smp_irq_move_cleanup_interrupt().

What's the race there?

> I'm trying to refine your patch based following rules:
> 1) move_in_progress controls whether we need to send IPIs
> 2) old_domain controls which CPUs we should do clean up
> 3) assign_irq_vector checks both move_in_progress and old_domain.
> Will send out the patch soon for comments:)

Sure.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt

2015-11-25 Thread Joe Lawrence

On 11/25/2015 04:12 PM, Thomas Gleixner wrote:

On Wed, 25 Nov 2015, Thomas Gleixner wrote:

So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and
does not get another IPI before the next move . That has been that
way forever.

Duh. Working on a real fix this time.


Here you go. Completely untested of course.



Hi Thomas -- thanks for taking a look!  (Well, the analysis looks like 
more than *just* a look :)


I'll give the patch a go when I get back in the office next week.

-- Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt

2015-11-25 Thread Thomas Gleixner
On Wed, 25 Nov 2015, Thomas Gleixner wrote:
> So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and
> does not get another IPI before the next move . That has been that
> way forever.
> 
> Duh. Working on a real fix this time.

Here you go. Completely untested of course.

Larger than I hoped for, but the simple fix of just clearing the
move_in_progress flag before sending the IPI does not work because:

CPU0  CPU1  CPU2
data->move_in_progress=0
sendIPI()   
  set_affinity()
  lock_vector() handle_IPI
  move_in_progress = 1  lock_vector()
  unlock_vector()
move_in_progress == 1
-> no cleanup

So we are back to square one. Now one might think that taking vector
lock prevents that issue:

CPU0  CPU1  CPU2
lock_vector()
data->move_in_progress=0
sendIPI()   
unlock_vector()
  set_affinity()
  assign_irq_vector()
  lock_vector() handle_IPI
  move_in_progress = 1  lock_vector()
  unlock_vector()
move_in_progress == 1
Not really. 

So now the solution is:

CPU0  CPU1  CPU2
lock_vector()
data->move_in_progress=0
data->cleanup_mask = data->old_domain
sendIPI()   
unlock_vector()
  set_affinity()
  assign_irq_vector()
  lock_vector() 
  if (move_in_progress ||
  !empty(cleanup_mask)) {
 unlock_vector()
 return -EBUSY; handle_IPI
  } lock_vector()
move_in_progress == 0
cpu is set in cleanup 
mask
->cleanup vector

Looks a bit overkill with the extra cpumask. I tried a simple counter
but that does not work versus cpu unplug as we do not know whether the
outgoing cpu is involved in the cleanup or not. And if the cpu is
involved we starve assign_irq_vector() 

The upside of this is that we get rid of that atomic allocation in
__send_cleanup_vector().

Brain hurts by now. 

Not-Yet-Signed-off-by: Thomas Gleixner 
---
 arch/x86/kernel/apic/vector.c |   37 -
 1 file changed, 16 insertions(+), 21 deletions(-)

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -25,6 +25,7 @@ struct apic_chip_data {
struct irq_cfg  cfg;
cpumask_var_t   domain;
cpumask_var_t   old_domain;
+   cpumask_var_t   cleanup_mask;
u8  move_in_progress : 1;
 };
 
@@ -83,7 +84,11 @@ static struct apic_chip_data *alloc_apic
goto out_data;
if (!zalloc_cpumask_var_node(>old_domain, GFP_KERNEL, node))
goto out_domain;
+   if (!zalloc_cpumask_var_node(>cleanup_mask, GFP_KERNEL, node))
+   goto out_old;
return data;
+out_old:
+   free_cpumask_var(data->old_domain);
 out_domain:
free_cpumask_var(data->domain);
 out_data:
@@ -96,6 +101,7 @@ static void free_apic_chip_data(struct a
if (data) {
free_cpumask_var(data->domain);
free_cpumask_var(data->old_domain);
+   free_cpumask_var(data->cleanup_mask);
kfree(data);
}
 }
@@ -118,7 +124,7 @@ static int __assign_irq_vector(int irq,
static int current_offset = VECTOR_OFFSET_START % 16;
int cpu, err;
 
-   if (d->move_in_progress)
+   if (d->move_in_progress || !cpumask_empty(d->cleanup_mask))
return -EBUSY;
 
/* Only try and allocate irqs on cpus that are present */
@@ -511,20 +517,11 @@ static struct irq_chip lapic_controller
 #ifdef CONFIG_SMP
 static void __send_cleanup_vector(struct apic_chip_data *data)
 {
-   cpumask_var_t cleanup_mask;
-
-   if (unlikely(!alloc_cpumask_var(_mask, GFP_ATOMIC))) {
-   unsigned int i;
-
-   for_each_cpu_and(i, data->old_domain, cpu_online_mask)
-   apic->send_IPI_mask(cpumask_of(i),
-   IRQ_MOVE_CLEANUP_VECTOR);
-   } else {
-   cpumask_and(cleanup_mask, data->old_domain, cpu_online_mask);
-   apic->send_IPI_mask(cleanup_mask, 

Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt

2015-11-25 Thread Thomas Gleixner
On Wed, 25 Nov 2015, Thomas Gleixner wrote:
> The problem is actually in the vector assignment code.
> 
> >   [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> 
> > 0x88102a8196f8
> 
> No interrupt happened so far. So nothing cleans up the vector on cpu 1
> 
> >   [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> 
> > 0x88102a8196f8
>  
> Now that moves it from 44 to 34 and ignores that cpu 1 still has the
> vector assigned. __assign_irq_vector unconditionally overwrites
> data->old_domain, so the bit of cpu 1 is lost 
> 
> I'm staring into the code to figure out a fix 

Just to figure out that my analysis was completely wrong.

__assign_irq_vector()
{
if (d->move_in_progress)
return -EBUSY;
...

So that cannot happen. Now the question is:

>   [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> 
> 0x88102a8196f8

So CPU1 sees still data->move_in_progress

  [001]54.636722: smp_irq_move_cleanup_interrupt : data->move_in_progress : 
vector=145 0x88102a8196f8

And why does __assign_irq_vector not see it, but no cleanup vector was
received by cpu1 with data->move_in_progress == 0.

>   [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> 
> 0x88102a8196f8

Ah.

__send_cleanup_vector()
{
send_IPI()
move_in_progress = 0; 
}

So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and
does not get another IPI before the next move . That has been that
way forever.

Duh. Working on a real fix this time.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt

2015-11-25 Thread Thomas Gleixner
Joe,

On Mon, 23 Nov 2015, Joe Lawrence wrote:

Nice detective work!

> The irq_desc is in R15: 88102a8196f8
>
> This irq_desc is no longer allocated, it's been filled with the slub debug
> poison pattern (hence the spinlock is stuck):
> 
> The irq vector is in RBX: 0091
>
> But from a custom crash gdb script, the freed irq_desc can still be found
> in CPU 1's vector_irq[145]:
> 
>   cpu[1] vector_irq[145] irq_desc @ 0x88102a8196f8
> 
> Sifting through a heavily grepped and abbreviated trace log:
> 
>   - The irq moved from CPU 1, vector 145 to CPU 44, vector 134 
>   - Both CPUs skip cleaning up their vector_irq[] entries for this irq
> because data->move_in_progress is set (is this normal?)

Yes. The cleanup interrupt is not targetting a particular vector. 

>   - The irq moves again to CPU 34, vector 123
>   - The underlying device is removed @ 593.106514 jiffies
>   - The irq_desc is freed @ 593.239879 jiffies
>   - CPU 1 is last heard from @ 1022.830083 jiffies
> 
>   [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> 
> 0x88102a8196f8
>   [044]35.209945: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=134 0x88102a8196f8
>   [001]35.212370: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=145 0x88102a8196f8
>   [044]35.674114: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=134 0x88102a8196f8
>   [001]35.675395: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=145 0x88102a8196f8
>   [044]36.265716: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=134 0x88102a8196f8
>   [044]36.517785: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=134 0x88102a8196f8
>   [001]36.520115: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=145 0x88102a8196f8
>   [001]54.636651: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=145 0x88102a8196f8
>   [001]54.636722: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=145 0x88102a8196f8

So I assume, that no interrupt happened so far.

>   [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> 
> 0x88102a8196f8

Now it's moved again.

>   [001]61.670979: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=145 0x88102a8196f8
>   [044]61.696120: smp_irq_move_cleanup_interrupt : cpu (this) : 
> vector=134 0x88102a8196f8 -> (nil)

>   [034]69.086107: smp_irq_move_cleanup_interrupt : vector == 
> data->cfg.vector && ... : vector=123 0x88102a8196f8
>   [000]   593.239811: clear_irq_vector : 1 : cpu 34 : vector=123 
> 0x88102a8196f8 -> (nil)
>   [000]   593.239879: free_desc : desc @ 0x88102a8196f8
>   [001]  1022.830083: smp_irq_move_cleanup_interrupt : IS_ERR_OR_NULL : 
> vector=144 (nil)

Ok. Here is issue. So I assume the following happens:

CPU0 CPU1  CPU34CPU44   CPUxx

send_IPI(CLEANUP)
 smp_irq_move_cleanup_interrupt
   sees data->move_in_progress

  device_irq happens
  data->move_in_progress = 0
  send_IPI(CLEANUP)

 Does not receive another
 cleanup IPI for whatever
 reason  See below.

smp_irq_move_cleanup_interrupt
vector = NULL   

free_vector on CPU34

Does not clean CPU1 because
data->move_in_progress is not set.

free_desc


send_IPI(CLEANUP)
 smp_irq_move_cleanup_interrupt
   Accesses random memory
 
> Prior to a782a7e46bb5 "x86/irq: Store irq descriptor in vector array", the
> vector_irq array held irq values, those interested in the irq_desc would
> have to perform their own irq_to_desc() (ie, a radix_tree_lookup of the
> irq_desc_tree).  The radix tree is updated in free_desc(), so any
> subsequent lookups would fail.

Yes, that above race has existed forever and the irq_to_desc() check
papered over it. In case that the irq number was reassigned the thing
operated on the wrong descriptor. Pretty bad as well as it fiddles
with the wrong bits. Though it clears the vector ...

The problem is actually in the vector assignment code.

>   [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> 
> 0x88102a8196f8

No interrupt happened so far. So nothing cleans up the vector on cpu 1

>   [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> 
> 0x88102a8196f8
 
Now that moves it from 44 to 34 and ignores that cpu 1 still has the
vector 

Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt

2015-11-25 Thread Thomas Gleixner
Joe,

On Mon, 23 Nov 2015, Joe Lawrence wrote:

Nice detective work!

> The irq_desc is in R15: 88102a8196f8
>
> This irq_desc is no longer allocated, it's been filled with the slub debug
> poison pattern (hence the spinlock is stuck):
> 
> The irq vector is in RBX: 0091
>
> But from a custom crash gdb script, the freed irq_desc can still be found
> in CPU 1's vector_irq[145]:
> 
>   cpu[1] vector_irq[145] irq_desc @ 0x88102a8196f8
> 
> Sifting through a heavily grepped and abbreviated trace log:
> 
>   - The irq moved from CPU 1, vector 145 to CPU 44, vector 134 
>   - Both CPUs skip cleaning up their vector_irq[] entries for this irq
> because data->move_in_progress is set (is this normal?)

Yes. The cleanup interrupt is not targetting a particular vector. 

>   - The irq moves again to CPU 34, vector 123
>   - The underlying device is removed @ 593.106514 jiffies
>   - The irq_desc is freed @ 593.239879 jiffies
>   - CPU 1 is last heard from @ 1022.830083 jiffies
> 
>   [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> 
> 0x88102a8196f8
>   [044]35.209945: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=134 0x88102a8196f8
>   [001]35.212370: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=145 0x88102a8196f8
>   [044]35.674114: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=134 0x88102a8196f8
>   [001]35.675395: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=145 0x88102a8196f8
>   [044]36.265716: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=134 0x88102a8196f8
>   [044]36.517785: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=134 0x88102a8196f8
>   [001]36.520115: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=145 0x88102a8196f8
>   [001]54.636651: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=145 0x88102a8196f8
>   [001]54.636722: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=145 0x88102a8196f8

So I assume, that no interrupt happened so far.

>   [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> 
> 0x88102a8196f8

Now it's moved again.

>   [001]61.670979: smp_irq_move_cleanup_interrupt : data->move_in_progress 
> : vector=145 0x88102a8196f8
>   [044]61.696120: smp_irq_move_cleanup_interrupt : cpu (this) : 
> vector=134 0x88102a8196f8 -> (nil)

>   [034]69.086107: smp_irq_move_cleanup_interrupt : vector == 
> data->cfg.vector && ... : vector=123 0x88102a8196f8
>   [000]   593.239811: clear_irq_vector : 1 : cpu 34 : vector=123 
> 0x88102a8196f8 -> (nil)
>   [000]   593.239879: free_desc : desc @ 0x88102a8196f8
>   [001]  1022.830083: smp_irq_move_cleanup_interrupt : IS_ERR_OR_NULL : 
> vector=144 (nil)

Ok. Here is issue. So I assume the following happens:

CPU0 CPU1  CPU34CPU44   CPUxx

send_IPI(CLEANUP)
 smp_irq_move_cleanup_interrupt
   sees data->move_in_progress

  device_irq happens
  data->move_in_progress = 0
  send_IPI(CLEANUP)

 Does not receive another
 cleanup IPI for whatever
 reason  See below.

smp_irq_move_cleanup_interrupt
vector = NULL   

free_vector on CPU34

Does not clean CPU1 because
data->move_in_progress is not set.

free_desc


send_IPI(CLEANUP)
 smp_irq_move_cleanup_interrupt
   Accesses random memory
 
> Prior to a782a7e46bb5 "x86/irq: Store irq descriptor in vector array", the
> vector_irq array held irq values, those interested in the irq_desc would
> have to perform their own irq_to_desc() (ie, a radix_tree_lookup of the
> irq_desc_tree).  The radix tree is updated in free_desc(), so any
> subsequent lookups would fail.

Yes, that above race has existed forever and the irq_to_desc() check
papered over it. In case that the irq number was reassigned the thing
operated on the wrong descriptor. Pretty bad as well as it fiddles
with the wrong bits. Though it clears the vector ...

The problem is actually in the vector assignment code.

>   [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> 
> 0x88102a8196f8

No interrupt happened so far. So nothing cleans up the vector on cpu 1

>   [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> 
> 0x88102a8196f8
 
Now that moves it from 44 to 34 and ignores that cpu 1 still has the
vector 

Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt

2015-11-25 Thread Thomas Gleixner
On Wed, 25 Nov 2015, Thomas Gleixner wrote:
> The problem is actually in the vector assignment code.
> 
> >   [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> 
> > 0x88102a8196f8
> 
> No interrupt happened so far. So nothing cleans up the vector on cpu 1
> 
> >   [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> 
> > 0x88102a8196f8
>  
> Now that moves it from 44 to 34 and ignores that cpu 1 still has the
> vector assigned. __assign_irq_vector unconditionally overwrites
> data->old_domain, so the bit of cpu 1 is lost 
> 
> I'm staring into the code to figure out a fix 

Just to figure out that my analysis was completely wrong.

__assign_irq_vector()
{
if (d->move_in_progress)
return -EBUSY;
...

So that cannot happen. Now the question is:

>   [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> 
> 0x88102a8196f8

So CPU1 sees still data->move_in_progress

  [001]54.636722: smp_irq_move_cleanup_interrupt : data->move_in_progress : 
vector=145 0x88102a8196f8

And why does __assign_irq_vector not see it, but no cleanup vector was
received by cpu1 with data->move_in_progress == 0.

>   [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> 
> 0x88102a8196f8

Ah.

__send_cleanup_vector()
{
send_IPI()
move_in_progress = 0; 
}

So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and
does not get another IPI before the next move . That has been that
way forever.

Duh. Working on a real fix this time.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt

2015-11-25 Thread Joe Lawrence

On 11/25/2015 04:12 PM, Thomas Gleixner wrote:

On Wed, 25 Nov 2015, Thomas Gleixner wrote:

So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and
does not get another IPI before the next move . That has been that
way forever.

Duh. Working on a real fix this time.


Here you go. Completely untested of course.



Hi Thomas -- thanks for taking a look!  (Well, the analysis looks like 
more than *just* a look :)


I'll give the patch a go when I get back in the office next week.

-- Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt

2015-11-25 Thread Thomas Gleixner
On Wed, 25 Nov 2015, Thomas Gleixner wrote:
> So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and
> does not get another IPI before the next move . That has been that
> way forever.
> 
> Duh. Working on a real fix this time.

Here you go. Completely untested of course.

Larger than I hoped for, but the simple fix of just clearing the
move_in_progress flag before sending the IPI does not work because:

CPU0  CPU1  CPU2
data->move_in_progress=0
sendIPI()   
  set_affinity()
  lock_vector() handle_IPI
  move_in_progress = 1  lock_vector()
  unlock_vector()
move_in_progress == 1
-> no cleanup

So we are back to square one. Now one might think that taking vector
lock prevents that issue:

CPU0  CPU1  CPU2
lock_vector()
data->move_in_progress=0
sendIPI()   
unlock_vector()
  set_affinity()
  assign_irq_vector()
  lock_vector() handle_IPI
  move_in_progress = 1  lock_vector()
  unlock_vector()
move_in_progress == 1
Not really. 

So now the solution is:

CPU0  CPU1  CPU2
lock_vector()
data->move_in_progress=0
data->cleanup_mask = data->old_domain
sendIPI()   
unlock_vector()
  set_affinity()
  assign_irq_vector()
  lock_vector() 
  if (move_in_progress ||
  !empty(cleanup_mask)) {
 unlock_vector()
 return -EBUSY; handle_IPI
  } lock_vector()
move_in_progress == 0
cpu is set in cleanup 
mask
->cleanup vector

Looks a bit overkill with the extra cpumask. I tried a simple counter
but that does not work versus cpu unplug as we do not know whether the
outgoing cpu is involved in the cleanup or not. And if the cpu is
involved we starve assign_irq_vector() 

The upside of this is that we get rid of that atomic allocation in
__send_cleanup_vector().

Brain hurts by now. 

Not-Yet-Signed-off-by: Thomas Gleixner 
---
 arch/x86/kernel/apic/vector.c |   37 -
 1 file changed, 16 insertions(+), 21 deletions(-)

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -25,6 +25,7 @@ struct apic_chip_data {
struct irq_cfg  cfg;
cpumask_var_t   domain;
cpumask_var_t   old_domain;
+   cpumask_var_t   cleanup_mask;
u8  move_in_progress : 1;
 };
 
@@ -83,7 +84,11 @@ static struct apic_chip_data *alloc_apic
goto out_data;
if (!zalloc_cpumask_var_node(>old_domain, GFP_KERNEL, node))
goto out_domain;
+   if (!zalloc_cpumask_var_node(>cleanup_mask, GFP_KERNEL, node))
+   goto out_old;
return data;
+out_old:
+   free_cpumask_var(data->old_domain);
 out_domain:
free_cpumask_var(data->domain);
 out_data:
@@ -96,6 +101,7 @@ static void free_apic_chip_data(struct a
if (data) {
free_cpumask_var(data->domain);
free_cpumask_var(data->old_domain);
+   free_cpumask_var(data->cleanup_mask);
kfree(data);
}
 }
@@ -118,7 +124,7 @@ static int __assign_irq_vector(int irq,
static int current_offset = VECTOR_OFFSET_START % 16;
int cpu, err;
 
-   if (d->move_in_progress)
+   if (d->move_in_progress || !cpumask_empty(d->cleanup_mask))
return -EBUSY;
 
/* Only try and allocate irqs on cpus that are present */
@@ -511,20 +517,11 @@ static struct irq_chip lapic_controller
 #ifdef CONFIG_SMP
 static void __send_cleanup_vector(struct apic_chip_data *data)
 {
-   cpumask_var_t cleanup_mask;
-
-   if (unlikely(!alloc_cpumask_var(_mask, GFP_ATOMIC))) {
-   unsigned int i;
-
-   for_each_cpu_and(i, data->old_domain, cpu_online_mask)
-   apic->send_IPI_mask(cpumask_of(i),
-   IRQ_MOVE_CLEANUP_VECTOR);
-   } else {
-   cpumask_and(cleanup_mask, data->old_domain, cpu_online_mask);
-   

irq_desc use-after-free in smp_irq_move_cleanup_interrupt

2015-11-23 Thread Joe Lawrence
Hi Thomas,

I've been chasing down a use-after-free on an irq_desc structure during
repeated device removal testing that crashes 4.3 in
smp_irq_move_cleanup_interrupt.  So far I have a bunch of crashes and some
data gleaned from instrumenting the kernel with trace logging.  Details to
follow, but in short, I think I'm hitting the use-after-free because of
a782a7e46bb5 "x86/irq: Store irq descriptor in vector array".

During the tests, I usually see RCU stalls (sometimes a crash later) --
when looking at a vmcore, the CPU stall is here:

  crash> bt 881038ee4200
  PID: 0  TASK: 881038ee4200  CPU: 1   COMMAND: "swapper/1"
  [exception RIP: _raw_spin_lock+0x10]
  RIP: 81672a10  RSP: 88103f843f70  RFLAGS: 0046
  RAX:   RBX: 0091  RCX: 0001
  RDX: 0001  RSI: 0001  RDI: 81d36a88
  RBP: 88103f843fa8   R8: 0101   R9: 0f78
  R10: 0010  R11: 8810384ccf8c  R12: 0091
  R13: 88102a819794  R14: b5e8  R15: 88102a8196f8
  CS: 0010  SS: 0018
   #0 [88103f843f70] smp_irq_move_cleanup_interrupt at 8105319b
   #1 [88103f843fb0] irq_move_cleanup_interrupt at 81673a04
  ---  ---
   #2 [881038f0fd90] irq_move_cleanup_interrupt at 81673a04
  [exception RIP: unknown or invalid address]
  RIP: 001d  RSP:   RFLAGS: 881038f1
  RAX: 0001  RBX: 881038f0c000  RCX: 0005
  RDX: 000a087a  RSI: 00051ca2  RDI: 0018
  RBP: 88103f84f100   R8: 881038f0fea0   R9: e8f42378
  R10: 0002  R11: 00ee954efd02  R12: 810e5060
  R13: 881038f0fdc8  R14: 88103f84f500  R15: 88103f84f100
  ORIG_RAX: 88103f856c40  CS: 8000a044  SS: ffdf
  bt: WARNING: possibly bogus exception frame
   #3 [881038f0fe38] cpuidle_enter_state at 8151e278
   #4 [881038f0fea8] cpuidle_enter at 8151e417
   #5 [881038f0feb8] call_cpuidle at 810bef42
   #6 [881038f0fed0] cpu_startup_entry at 810bf1dc
   #7 [881038f0ff28] start_secondary at 8104e9e0

The irq_desc is in R15: 88102a8196f8

  crash> kmem 88102a8196f8
  CACHENAME OBJSIZE  ALLOCATED TOTAL  SLABS  
SSIZE
  88103f408c80 kmalloc-512  512   5292  5460140
32k
SLAB  MEMORYNODE  TOTAL  ALLOCATED  FREE
ea0040aa0600  88102a818000 0 39 1722
FREE / [ALLOCATED]
 88102a8196f8  

PAGE PHYSICAL  MAPPING   INDEX CNT FLAGS
  ea0040aa0640 102a81900000  0 2f80008000 tail

This irq_desc is no longer allocated, it's been filled with the slub debug
poison pattern (hence the spinlock is stuck):

  crash> rd 88102a8196f8 0x28
  88102a8196f8:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819708:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819718:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819728:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819738:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819748:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819758:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819768:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819778:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819788:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819798:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a8197a8:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a8197b8:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a8197c8:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a8197d8:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a8197e8:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a8197f8:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819808:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819818:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819828:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   

The irq vector is in RBX: 0091

  crash> p/d 0x91
  $1 = 145

The irq_desc_tree entry for vector 0x91 appears to have been updated to
something new (the device re-add):

  crash> tree -t radix irq_desc_tree -s irq_desc | grep 'irq = 0x91' -B11 | 
grep -e '^' -e 'irq ='
  88102fb144e8
  irq = 0x91,

But from a custom crash gdb script, the freed irq_desc can still be found
in CPU 1's vector_irq[145]:

  cpu[1] vector_irq[145] irq_desc @ 0x88102a8196f8

Sifting 

irq_desc use-after-free in smp_irq_move_cleanup_interrupt

2015-11-23 Thread Joe Lawrence
Hi Thomas,

I've been chasing down a use-after-free on an irq_desc structure during
repeated device removal testing that crashes 4.3 in
smp_irq_move_cleanup_interrupt.  So far I have a bunch of crashes and some
data gleaned from instrumenting the kernel with trace logging.  Details to
follow, but in short, I think I'm hitting the use-after-free because of
a782a7e46bb5 "x86/irq: Store irq descriptor in vector array".

During the tests, I usually see RCU stalls (sometimes a crash later) --
when looking at a vmcore, the CPU stall is here:

  crash> bt 881038ee4200
  PID: 0  TASK: 881038ee4200  CPU: 1   COMMAND: "swapper/1"
  [exception RIP: _raw_spin_lock+0x10]
  RIP: 81672a10  RSP: 88103f843f70  RFLAGS: 0046
  RAX:   RBX: 0091  RCX: 0001
  RDX: 0001  RSI: 0001  RDI: 81d36a88
  RBP: 88103f843fa8   R8: 0101   R9: 0f78
  R10: 0010  R11: 8810384ccf8c  R12: 0091
  R13: 88102a819794  R14: b5e8  R15: 88102a8196f8
  CS: 0010  SS: 0018
   #0 [88103f843f70] smp_irq_move_cleanup_interrupt at 8105319b
   #1 [88103f843fb0] irq_move_cleanup_interrupt at 81673a04
  ---  ---
   #2 [881038f0fd90] irq_move_cleanup_interrupt at 81673a04
  [exception RIP: unknown or invalid address]
  RIP: 001d  RSP:   RFLAGS: 881038f1
  RAX: 0001  RBX: 881038f0c000  RCX: 0005
  RDX: 000a087a  RSI: 00051ca2  RDI: 0018
  RBP: 88103f84f100   R8: 881038f0fea0   R9: e8f42378
  R10: 0002  R11: 00ee954efd02  R12: 810e5060
  R13: 881038f0fdc8  R14: 88103f84f500  R15: 88103f84f100
  ORIG_RAX: 88103f856c40  CS: 8000a044  SS: ffdf
  bt: WARNING: possibly bogus exception frame
   #3 [881038f0fe38] cpuidle_enter_state at 8151e278
   #4 [881038f0fea8] cpuidle_enter at 8151e417
   #5 [881038f0feb8] call_cpuidle at 810bef42
   #6 [881038f0fed0] cpu_startup_entry at 810bf1dc
   #7 [881038f0ff28] start_secondary at 8104e9e0

The irq_desc is in R15: 88102a8196f8

  crash> kmem 88102a8196f8
  CACHENAME OBJSIZE  ALLOCATED TOTAL  SLABS  
SSIZE
  88103f408c80 kmalloc-512  512   5292  5460140
32k
SLAB  MEMORYNODE  TOTAL  ALLOCATED  FREE
ea0040aa0600  88102a818000 0 39 1722
FREE / [ALLOCATED]
 88102a8196f8  

PAGE PHYSICAL  MAPPING   INDEX CNT FLAGS
  ea0040aa0640 102a81900000  0 2f80008000 tail

This irq_desc is no longer allocated, it's been filled with the slub debug
poison pattern (hence the spinlock is stuck):

  crash> rd 88102a8196f8 0x28
  88102a8196f8:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819708:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819718:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819728:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819738:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819748:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819758:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819768:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819778:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819788:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819798:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a8197a8:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a8197b8:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a8197c8:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a8197d8:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a8197e8:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a8197f8:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819808:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819818:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
  88102a819828:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   

The irq vector is in RBX: 0091

  crash> p/d 0x91
  $1 = 145

The irq_desc_tree entry for vector 0x91 appears to have been updated to
something new (the device re-add):

  crash> tree -t radix irq_desc_tree -s irq_desc | grep 'irq = 0x91' -B11 | 
grep -e '^' -e 'irq ='
  88102fb144e8
  irq = 0x91,

But from a custom crash gdb script, the freed irq_desc can still be found
in CPU 1's vector_irq[145]:

  cpu[1] vector_irq[145] irq_desc @ 0x88102a8196f8

Sifting