Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt
On Fri, 27 Nov 2015, Jiang Liu wrote: Please trim your replies. > On 2015/11/26 5:12, Thomas Gleixner wrote: > > Looks a bit overkill with the extra cpumask. I tried a simple counter > > but that does not work versus cpu unplug as we do not know whether the > > outgoing cpu is involved in the cleanup or not. And if the cpu is > > involved we starve assign_irq_vector() > > > > The upside of this is that we get rid of that atomic allocation in > > __send_cleanup_vector(). > > Maybe more headache for you now:) > It seems there are still rooms for improvements. First it > seems we could just reuse old_domain instead of adding cleanup_mask. I really like to get rid of that atomic allocation in __send_cleanup_vector() > Second I found another race window among x86_vector_free_irqs(), > __send_cleanup_vector() and smp_irq_move_cleanup_interrupt(). What's the race there? > I'm trying to refine your patch based following rules: > 1) move_in_progress controls whether we need to send IPIs > 2) old_domain controls which CPUs we should do clean up > 3) assign_irq_vector checks both move_in_progress and old_domain. > Will send out the patch soon for comments:) Sure. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt
On 2015/11/26 5:12, Thomas Gleixner wrote: > On Wed, 25 Nov 2015, Thomas Gleixner wrote: >> So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and >> does not get another IPI before the next move . That has been that >> way forever. >> >> Duh. Working on a real fix this time. > > Here you go. Completely untested of course. > > Larger than I hoped for, but the simple fix of just clearing the > move_in_progress flag before sending the IPI does not work because: > > CPU0CPU1 CPU2 > data->move_in_progress=0 > sendIPI() > set_affinity() > lock_vector() handle_IPI > move_in_progress = 1 lock_vector() > unlock_vector() > move_in_progress == 1 > -> no cleanup > > So we are back to square one. Now one might think that taking vector > lock prevents that issue: > > CPU0CPU1 CPU2 > lock_vector() > data->move_in_progress=0 > sendIPI() > unlock_vector() > set_affinity() > assign_irq_vector() > lock_vector() handle_IPI > move_in_progress = 1 lock_vector() > unlock_vector() > move_in_progress == 1 > Not really. > > So now the solution is: > > CPU0CPU1 CPU2 > lock_vector() > data->move_in_progress=0 > data->cleanup_mask = data->old_domain > sendIPI() > unlock_vector() > set_affinity() > assign_irq_vector() > lock_vector() > if (move_in_progress || > !empty(cleanup_mask)) { >unlock_vector() >return -EBUSY; handle_IPI > } lock_vector() > move_in_progress == 0 > cpu is set in cleanup > mask > ->cleanup vector > > Looks a bit overkill with the extra cpumask. I tried a simple counter > but that does not work versus cpu unplug as we do not know whether the > outgoing cpu is involved in the cleanup or not. And if the cpu is > involved we starve assign_irq_vector() > > The upside of this is that we get rid of that atomic allocation in > __send_cleanup_vector(). Hi Thomas, Maybe more headache for you now:) It seems there are still rooms for improvements. First it seems we could just reuse old_domain instead of adding cleanup_mask. Second I found another race window among x86_vector_free_irqs(), __send_cleanup_vector() and smp_irq_move_cleanup_interrupt(). I'm trying to refine your patch based following rules: 1) move_in_progress controls whether we need to send IPIs 2) old_domain controls which CPUs we should do clean up 3) assign_irq_vector checks both move_in_progress and old_domain. Will send out the patch soon for comments:) Thanks, Gerry > > Brain hurts by now. > > Not-Yet-Signed-off-by: Thomas Gleixner > --- > arch/x86/kernel/apic/vector.c | 37 - > 1 file changed, 16 insertions(+), 21 deletions(-) > > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -25,6 +25,7 @@ struct apic_chip_data { > struct irq_cfg cfg; > cpumask_var_t domain; > cpumask_var_t old_domain; > + cpumask_var_t cleanup_mask; > u8 move_in_progress : 1; > }; > > @@ -83,7 +84,11 @@ static struct apic_chip_data *alloc_apic > goto out_data; > if (!zalloc_cpumask_var_node(>old_domain, GFP_KERNEL, node)) > goto out_domain; > + if (!zalloc_cpumask_var_node(>cleanup_mask, GFP_KERNEL, node)) > + goto out_old; > return data; > +out_old: > + free_cpumask_var(data->old_domain); > out_domain: > free_cpumask_var(data->domain); > out_data: > @@ -96,6 +101,7 @@ static void free_apic_chip_data(struct a > if (data) { > free_cpumask_var(data->domain); > free_cpumask_var(data->old_domain); > + free_cpumask_var(data->cleanup_mask); > kfree(data); > } > } > @@ -118,7 +124,7 @@ static int __assign_irq_vector(int irq, > static int current_offset = VECTOR_OFFSET_START % 16; > int cpu, err; > > - if (d->move_in_progress) > + if
Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt
On 2015/11/26 5:12, Thomas Gleixner wrote: > On Wed, 25 Nov 2015, Thomas Gleixner wrote: >> So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and >> does not get another IPI before the next move . That has been that >> way forever. >> >> Duh. Working on a real fix this time. > > Here you go. Completely untested of course. > > Larger than I hoped for, but the simple fix of just clearing the > move_in_progress flag before sending the IPI does not work because: > > CPU0CPU1 CPU2 > data->move_in_progress=0 > sendIPI() > set_affinity() > lock_vector() handle_IPI > move_in_progress = 1 lock_vector() > unlock_vector() > move_in_progress == 1 > -> no cleanup > > So we are back to square one. Now one might think that taking vector > lock prevents that issue: > > CPU0CPU1 CPU2 > lock_vector() > data->move_in_progress=0 > sendIPI() > unlock_vector() > set_affinity() > assign_irq_vector() > lock_vector() handle_IPI > move_in_progress = 1 lock_vector() > unlock_vector() > move_in_progress == 1 > Not really. > > So now the solution is: > > CPU0CPU1 CPU2 > lock_vector() > data->move_in_progress=0 > data->cleanup_mask = data->old_domain > sendIPI() > unlock_vector() > set_affinity() > assign_irq_vector() > lock_vector() > if (move_in_progress || > !empty(cleanup_mask)) { >unlock_vector() >return -EBUSY; handle_IPI > } lock_vector() > move_in_progress == 0 > cpu is set in cleanup > mask > ->cleanup vector > > Looks a bit overkill with the extra cpumask. I tried a simple counter > but that does not work versus cpu unplug as we do not know whether the > outgoing cpu is involved in the cleanup or not. And if the cpu is > involved we starve assign_irq_vector() > > The upside of this is that we get rid of that atomic allocation in > __send_cleanup_vector(). Hi Thomas, Maybe more headache for you now:) It seems there are still rooms for improvements. First it seems we could just reuse old_domain instead of adding cleanup_mask. Second I found another race window among x86_vector_free_irqs(), __send_cleanup_vector() and smp_irq_move_cleanup_interrupt(). I'm trying to refine your patch based following rules: 1) move_in_progress controls whether we need to send IPIs 2) old_domain controls which CPUs we should do clean up 3) assign_irq_vector checks both move_in_progress and old_domain. Will send out the patch soon for comments:) Thanks, Gerry > > Brain hurts by now. > > Not-Yet-Signed-off-by: Thomas Gleixner> --- > arch/x86/kernel/apic/vector.c | 37 - > 1 file changed, 16 insertions(+), 21 deletions(-) > > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -25,6 +25,7 @@ struct apic_chip_data { > struct irq_cfg cfg; > cpumask_var_t domain; > cpumask_var_t old_domain; > + cpumask_var_t cleanup_mask; > u8 move_in_progress : 1; > }; > > @@ -83,7 +84,11 @@ static struct apic_chip_data *alloc_apic > goto out_data; > if (!zalloc_cpumask_var_node(>old_domain, GFP_KERNEL, node)) > goto out_domain; > + if (!zalloc_cpumask_var_node(>cleanup_mask, GFP_KERNEL, node)) > + goto out_old; > return data; > +out_old: > + free_cpumask_var(data->old_domain); > out_domain: > free_cpumask_var(data->domain); > out_data: > @@ -96,6 +101,7 @@ static void free_apic_chip_data(struct a > if (data) { > free_cpumask_var(data->domain); > free_cpumask_var(data->old_domain); > + free_cpumask_var(data->cleanup_mask); > kfree(data); > } > } > @@ -118,7 +124,7 @@ static int __assign_irq_vector(int irq, > static int current_offset = VECTOR_OFFSET_START % 16; > int cpu, err; > > - if
Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt
On Fri, 27 Nov 2015, Jiang Liu wrote: Please trim your replies. > On 2015/11/26 5:12, Thomas Gleixner wrote: > > Looks a bit overkill with the extra cpumask. I tried a simple counter > > but that does not work versus cpu unplug as we do not know whether the > > outgoing cpu is involved in the cleanup or not. And if the cpu is > > involved we starve assign_irq_vector() > > > > The upside of this is that we get rid of that atomic allocation in > > __send_cleanup_vector(). > > Maybe more headache for you now:) > It seems there are still rooms for improvements. First it > seems we could just reuse old_domain instead of adding cleanup_mask. I really like to get rid of that atomic allocation in __send_cleanup_vector() > Second I found another race window among x86_vector_free_irqs(), > __send_cleanup_vector() and smp_irq_move_cleanup_interrupt(). What's the race there? > I'm trying to refine your patch based following rules: > 1) move_in_progress controls whether we need to send IPIs > 2) old_domain controls which CPUs we should do clean up > 3) assign_irq_vector checks both move_in_progress and old_domain. > Will send out the patch soon for comments:) Sure. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt
On 11/25/2015 04:12 PM, Thomas Gleixner wrote: On Wed, 25 Nov 2015, Thomas Gleixner wrote: So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and does not get another IPI before the next move . That has been that way forever. Duh. Working on a real fix this time. Here you go. Completely untested of course. Hi Thomas -- thanks for taking a look! (Well, the analysis looks like more than *just* a look :) I'll give the patch a go when I get back in the office next week. -- Joe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt
On Wed, 25 Nov 2015, Thomas Gleixner wrote: > So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and > does not get another IPI before the next move . That has been that > way forever. > > Duh. Working on a real fix this time. Here you go. Completely untested of course. Larger than I hoped for, but the simple fix of just clearing the move_in_progress flag before sending the IPI does not work because: CPU0 CPU1 CPU2 data->move_in_progress=0 sendIPI() set_affinity() lock_vector() handle_IPI move_in_progress = 1 lock_vector() unlock_vector() move_in_progress == 1 -> no cleanup So we are back to square one. Now one might think that taking vector lock prevents that issue: CPU0 CPU1 CPU2 lock_vector() data->move_in_progress=0 sendIPI() unlock_vector() set_affinity() assign_irq_vector() lock_vector() handle_IPI move_in_progress = 1 lock_vector() unlock_vector() move_in_progress == 1 Not really. So now the solution is: CPU0 CPU1 CPU2 lock_vector() data->move_in_progress=0 data->cleanup_mask = data->old_domain sendIPI() unlock_vector() set_affinity() assign_irq_vector() lock_vector() if (move_in_progress || !empty(cleanup_mask)) { unlock_vector() return -EBUSY; handle_IPI } lock_vector() move_in_progress == 0 cpu is set in cleanup mask ->cleanup vector Looks a bit overkill with the extra cpumask. I tried a simple counter but that does not work versus cpu unplug as we do not know whether the outgoing cpu is involved in the cleanup or not. And if the cpu is involved we starve assign_irq_vector() The upside of this is that we get rid of that atomic allocation in __send_cleanup_vector(). Brain hurts by now. Not-Yet-Signed-off-by: Thomas Gleixner --- arch/x86/kernel/apic/vector.c | 37 - 1 file changed, 16 insertions(+), 21 deletions(-) --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -25,6 +25,7 @@ struct apic_chip_data { struct irq_cfg cfg; cpumask_var_t domain; cpumask_var_t old_domain; + cpumask_var_t cleanup_mask; u8 move_in_progress : 1; }; @@ -83,7 +84,11 @@ static struct apic_chip_data *alloc_apic goto out_data; if (!zalloc_cpumask_var_node(>old_domain, GFP_KERNEL, node)) goto out_domain; + if (!zalloc_cpumask_var_node(>cleanup_mask, GFP_KERNEL, node)) + goto out_old; return data; +out_old: + free_cpumask_var(data->old_domain); out_domain: free_cpumask_var(data->domain); out_data: @@ -96,6 +101,7 @@ static void free_apic_chip_data(struct a if (data) { free_cpumask_var(data->domain); free_cpumask_var(data->old_domain); + free_cpumask_var(data->cleanup_mask); kfree(data); } } @@ -118,7 +124,7 @@ static int __assign_irq_vector(int irq, static int current_offset = VECTOR_OFFSET_START % 16; int cpu, err; - if (d->move_in_progress) + if (d->move_in_progress || !cpumask_empty(d->cleanup_mask)) return -EBUSY; /* Only try and allocate irqs on cpus that are present */ @@ -511,20 +517,11 @@ static struct irq_chip lapic_controller #ifdef CONFIG_SMP static void __send_cleanup_vector(struct apic_chip_data *data) { - cpumask_var_t cleanup_mask; - - if (unlikely(!alloc_cpumask_var(_mask, GFP_ATOMIC))) { - unsigned int i; - - for_each_cpu_and(i, data->old_domain, cpu_online_mask) - apic->send_IPI_mask(cpumask_of(i), - IRQ_MOVE_CLEANUP_VECTOR); - } else { - cpumask_and(cleanup_mask, data->old_domain, cpu_online_mask); - apic->send_IPI_mask(cleanup_mask,
Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt
On Wed, 25 Nov 2015, Thomas Gleixner wrote: > The problem is actually in the vector assignment code. > > > [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> > > 0x88102a8196f8 > > No interrupt happened so far. So nothing cleans up the vector on cpu 1 > > > [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> > > 0x88102a8196f8 > > Now that moves it from 44 to 34 and ignores that cpu 1 still has the > vector assigned. __assign_irq_vector unconditionally overwrites > data->old_domain, so the bit of cpu 1 is lost > > I'm staring into the code to figure out a fix Just to figure out that my analysis was completely wrong. __assign_irq_vector() { if (d->move_in_progress) return -EBUSY; ... So that cannot happen. Now the question is: > [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> > 0x88102a8196f8 So CPU1 sees still data->move_in_progress [001]54.636722: smp_irq_move_cleanup_interrupt : data->move_in_progress : vector=145 0x88102a8196f8 And why does __assign_irq_vector not see it, but no cleanup vector was received by cpu1 with data->move_in_progress == 0. > [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> > 0x88102a8196f8 Ah. __send_cleanup_vector() { send_IPI() move_in_progress = 0; } So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and does not get another IPI before the next move . That has been that way forever. Duh. Working on a real fix this time. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt
Joe, On Mon, 23 Nov 2015, Joe Lawrence wrote: Nice detective work! > The irq_desc is in R15: 88102a8196f8 > > This irq_desc is no longer allocated, it's been filled with the slub debug > poison pattern (hence the spinlock is stuck): > > The irq vector is in RBX: 0091 > > But from a custom crash gdb script, the freed irq_desc can still be found > in CPU 1's vector_irq[145]: > > cpu[1] vector_irq[145] irq_desc @ 0x88102a8196f8 > > Sifting through a heavily grepped and abbreviated trace log: > > - The irq moved from CPU 1, vector 145 to CPU 44, vector 134 > - Both CPUs skip cleaning up their vector_irq[] entries for this irq > because data->move_in_progress is set (is this normal?) Yes. The cleanup interrupt is not targetting a particular vector. > - The irq moves again to CPU 34, vector 123 > - The underlying device is removed @ 593.106514 jiffies > - The irq_desc is freed @ 593.239879 jiffies > - CPU 1 is last heard from @ 1022.830083 jiffies > > [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> > 0x88102a8196f8 > [044]35.209945: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=134 0x88102a8196f8 > [001]35.212370: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=145 0x88102a8196f8 > [044]35.674114: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=134 0x88102a8196f8 > [001]35.675395: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=145 0x88102a8196f8 > [044]36.265716: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=134 0x88102a8196f8 > [044]36.517785: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=134 0x88102a8196f8 > [001]36.520115: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=145 0x88102a8196f8 > [001]54.636651: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=145 0x88102a8196f8 > [001]54.636722: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=145 0x88102a8196f8 So I assume, that no interrupt happened so far. > [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> > 0x88102a8196f8 Now it's moved again. > [001]61.670979: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=145 0x88102a8196f8 > [044]61.696120: smp_irq_move_cleanup_interrupt : cpu (this) : > vector=134 0x88102a8196f8 -> (nil) > [034]69.086107: smp_irq_move_cleanup_interrupt : vector == > data->cfg.vector && ... : vector=123 0x88102a8196f8 > [000] 593.239811: clear_irq_vector : 1 : cpu 34 : vector=123 > 0x88102a8196f8 -> (nil) > [000] 593.239879: free_desc : desc @ 0x88102a8196f8 > [001] 1022.830083: smp_irq_move_cleanup_interrupt : IS_ERR_OR_NULL : > vector=144 (nil) Ok. Here is issue. So I assume the following happens: CPU0 CPU1 CPU34CPU44 CPUxx send_IPI(CLEANUP) smp_irq_move_cleanup_interrupt sees data->move_in_progress device_irq happens data->move_in_progress = 0 send_IPI(CLEANUP) Does not receive another cleanup IPI for whatever reason See below. smp_irq_move_cleanup_interrupt vector = NULL free_vector on CPU34 Does not clean CPU1 because data->move_in_progress is not set. free_desc send_IPI(CLEANUP) smp_irq_move_cleanup_interrupt Accesses random memory > Prior to a782a7e46bb5 "x86/irq: Store irq descriptor in vector array", the > vector_irq array held irq values, those interested in the irq_desc would > have to perform their own irq_to_desc() (ie, a radix_tree_lookup of the > irq_desc_tree). The radix tree is updated in free_desc(), so any > subsequent lookups would fail. Yes, that above race has existed forever and the irq_to_desc() check papered over it. In case that the irq number was reassigned the thing operated on the wrong descriptor. Pretty bad as well as it fiddles with the wrong bits. Though it clears the vector ... The problem is actually in the vector assignment code. > [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> > 0x88102a8196f8 No interrupt happened so far. So nothing cleans up the vector on cpu 1 > [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> > 0x88102a8196f8 Now that moves it from 44 to 34 and ignores that cpu 1 still has the vector
Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt
Joe, On Mon, 23 Nov 2015, Joe Lawrence wrote: Nice detective work! > The irq_desc is in R15: 88102a8196f8 > > This irq_desc is no longer allocated, it's been filled with the slub debug > poison pattern (hence the spinlock is stuck): > > The irq vector is in RBX: 0091 > > But from a custom crash gdb script, the freed irq_desc can still be found > in CPU 1's vector_irq[145]: > > cpu[1] vector_irq[145] irq_desc @ 0x88102a8196f8 > > Sifting through a heavily grepped and abbreviated trace log: > > - The irq moved from CPU 1, vector 145 to CPU 44, vector 134 > - Both CPUs skip cleaning up their vector_irq[] entries for this irq > because data->move_in_progress is set (is this normal?) Yes. The cleanup interrupt is not targetting a particular vector. > - The irq moves again to CPU 34, vector 123 > - The underlying device is removed @ 593.106514 jiffies > - The irq_desc is freed @ 593.239879 jiffies > - CPU 1 is last heard from @ 1022.830083 jiffies > > [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> > 0x88102a8196f8 > [044]35.209945: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=134 0x88102a8196f8 > [001]35.212370: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=145 0x88102a8196f8 > [044]35.674114: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=134 0x88102a8196f8 > [001]35.675395: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=145 0x88102a8196f8 > [044]36.265716: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=134 0x88102a8196f8 > [044]36.517785: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=134 0x88102a8196f8 > [001]36.520115: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=145 0x88102a8196f8 > [001]54.636651: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=145 0x88102a8196f8 > [001]54.636722: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=145 0x88102a8196f8 So I assume, that no interrupt happened so far. > [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> > 0x88102a8196f8 Now it's moved again. > [001]61.670979: smp_irq_move_cleanup_interrupt : data->move_in_progress > : vector=145 0x88102a8196f8 > [044]61.696120: smp_irq_move_cleanup_interrupt : cpu (this) : > vector=134 0x88102a8196f8 -> (nil) > [034]69.086107: smp_irq_move_cleanup_interrupt : vector == > data->cfg.vector && ... : vector=123 0x88102a8196f8 > [000] 593.239811: clear_irq_vector : 1 : cpu 34 : vector=123 > 0x88102a8196f8 -> (nil) > [000] 593.239879: free_desc : desc @ 0x88102a8196f8 > [001] 1022.830083: smp_irq_move_cleanup_interrupt : IS_ERR_OR_NULL : > vector=144 (nil) Ok. Here is issue. So I assume the following happens: CPU0 CPU1 CPU34CPU44 CPUxx send_IPI(CLEANUP) smp_irq_move_cleanup_interrupt sees data->move_in_progress device_irq happens data->move_in_progress = 0 send_IPI(CLEANUP) Does not receive another cleanup IPI for whatever reason See below. smp_irq_move_cleanup_interrupt vector = NULL free_vector on CPU34 Does not clean CPU1 because data->move_in_progress is not set. free_desc send_IPI(CLEANUP) smp_irq_move_cleanup_interrupt Accesses random memory > Prior to a782a7e46bb5 "x86/irq: Store irq descriptor in vector array", the > vector_irq array held irq values, those interested in the irq_desc would > have to perform their own irq_to_desc() (ie, a radix_tree_lookup of the > irq_desc_tree). The radix tree is updated in free_desc(), so any > subsequent lookups would fail. Yes, that above race has existed forever and the irq_to_desc() check papered over it. In case that the irq number was reassigned the thing operated on the wrong descriptor. Pretty bad as well as it fiddles with the wrong bits. Though it clears the vector ... The problem is actually in the vector assignment code. > [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> > 0x88102a8196f8 No interrupt happened so far. So nothing cleans up the vector on cpu 1 > [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> > 0x88102a8196f8 Now that moves it from 44 to 34 and ignores that cpu 1 still has the vector
Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt
On Wed, 25 Nov 2015, Thomas Gleixner wrote: > The problem is actually in the vector assignment code. > > > [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> > > 0x88102a8196f8 > > No interrupt happened so far. So nothing cleans up the vector on cpu 1 > > > [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> > > 0x88102a8196f8 > > Now that moves it from 44 to 34 and ignores that cpu 1 still has the > vector assigned. __assign_irq_vector unconditionally overwrites > data->old_domain, so the bit of cpu 1 is lost > > I'm staring into the code to figure out a fix Just to figure out that my analysis was completely wrong. __assign_irq_vector() { if (d->move_in_progress) return -EBUSY; ... So that cannot happen. Now the question is: > [001]22.936764: __assign_irq_vector : cpu 44 : vector=134 -> > 0x88102a8196f8 So CPU1 sees still data->move_in_progress [001]54.636722: smp_irq_move_cleanup_interrupt : data->move_in_progress : vector=145 0x88102a8196f8 And why does __assign_irq_vector not see it, but no cleanup vector was received by cpu1 with data->move_in_progress == 0. > [044]61.670267: __assign_irq_vector : cpu 34 : vector=123 -> > 0x88102a8196f8 Ah. __send_cleanup_vector() { send_IPI() move_in_progress = 0; } So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and does not get another IPI before the next move . That has been that way forever. Duh. Working on a real fix this time. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt
On 11/25/2015 04:12 PM, Thomas Gleixner wrote: On Wed, 25 Nov 2015, Thomas Gleixner wrote: So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and does not get another IPI before the next move . That has been that way forever. Duh. Working on a real fix this time. Here you go. Completely untested of course. Hi Thomas -- thanks for taking a look! (Well, the analysis looks like more than *just* a look :) I'll give the patch a go when I get back in the office next week. -- Joe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: irq_desc use-after-free in smp_irq_move_cleanup_interrupt
On Wed, 25 Nov 2015, Thomas Gleixner wrote: > So if CPU1 gets the IPI _BEFORE_ move_in_progress is set to 0, and > does not get another IPI before the next move . That has been that > way forever. > > Duh. Working on a real fix this time. Here you go. Completely untested of course. Larger than I hoped for, but the simple fix of just clearing the move_in_progress flag before sending the IPI does not work because: CPU0 CPU1 CPU2 data->move_in_progress=0 sendIPI() set_affinity() lock_vector() handle_IPI move_in_progress = 1 lock_vector() unlock_vector() move_in_progress == 1 -> no cleanup So we are back to square one. Now one might think that taking vector lock prevents that issue: CPU0 CPU1 CPU2 lock_vector() data->move_in_progress=0 sendIPI() unlock_vector() set_affinity() assign_irq_vector() lock_vector() handle_IPI move_in_progress = 1 lock_vector() unlock_vector() move_in_progress == 1 Not really. So now the solution is: CPU0 CPU1 CPU2 lock_vector() data->move_in_progress=0 data->cleanup_mask = data->old_domain sendIPI() unlock_vector() set_affinity() assign_irq_vector() lock_vector() if (move_in_progress || !empty(cleanup_mask)) { unlock_vector() return -EBUSY; handle_IPI } lock_vector() move_in_progress == 0 cpu is set in cleanup mask ->cleanup vector Looks a bit overkill with the extra cpumask. I tried a simple counter but that does not work versus cpu unplug as we do not know whether the outgoing cpu is involved in the cleanup or not. And if the cpu is involved we starve assign_irq_vector() The upside of this is that we get rid of that atomic allocation in __send_cleanup_vector(). Brain hurts by now. Not-Yet-Signed-off-by: Thomas Gleixner--- arch/x86/kernel/apic/vector.c | 37 - 1 file changed, 16 insertions(+), 21 deletions(-) --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -25,6 +25,7 @@ struct apic_chip_data { struct irq_cfg cfg; cpumask_var_t domain; cpumask_var_t old_domain; + cpumask_var_t cleanup_mask; u8 move_in_progress : 1; }; @@ -83,7 +84,11 @@ static struct apic_chip_data *alloc_apic goto out_data; if (!zalloc_cpumask_var_node(>old_domain, GFP_KERNEL, node)) goto out_domain; + if (!zalloc_cpumask_var_node(>cleanup_mask, GFP_KERNEL, node)) + goto out_old; return data; +out_old: + free_cpumask_var(data->old_domain); out_domain: free_cpumask_var(data->domain); out_data: @@ -96,6 +101,7 @@ static void free_apic_chip_data(struct a if (data) { free_cpumask_var(data->domain); free_cpumask_var(data->old_domain); + free_cpumask_var(data->cleanup_mask); kfree(data); } } @@ -118,7 +124,7 @@ static int __assign_irq_vector(int irq, static int current_offset = VECTOR_OFFSET_START % 16; int cpu, err; - if (d->move_in_progress) + if (d->move_in_progress || !cpumask_empty(d->cleanup_mask)) return -EBUSY; /* Only try and allocate irqs on cpus that are present */ @@ -511,20 +517,11 @@ static struct irq_chip lapic_controller #ifdef CONFIG_SMP static void __send_cleanup_vector(struct apic_chip_data *data) { - cpumask_var_t cleanup_mask; - - if (unlikely(!alloc_cpumask_var(_mask, GFP_ATOMIC))) { - unsigned int i; - - for_each_cpu_and(i, data->old_domain, cpu_online_mask) - apic->send_IPI_mask(cpumask_of(i), - IRQ_MOVE_CLEANUP_VECTOR); - } else { - cpumask_and(cleanup_mask, data->old_domain, cpu_online_mask); -
irq_desc use-after-free in smp_irq_move_cleanup_interrupt
Hi Thomas, I've been chasing down a use-after-free on an irq_desc structure during repeated device removal testing that crashes 4.3 in smp_irq_move_cleanup_interrupt. So far I have a bunch of crashes and some data gleaned from instrumenting the kernel with trace logging. Details to follow, but in short, I think I'm hitting the use-after-free because of a782a7e46bb5 "x86/irq: Store irq descriptor in vector array". During the tests, I usually see RCU stalls (sometimes a crash later) -- when looking at a vmcore, the CPU stall is here: crash> bt 881038ee4200 PID: 0 TASK: 881038ee4200 CPU: 1 COMMAND: "swapper/1" [exception RIP: _raw_spin_lock+0x10] RIP: 81672a10 RSP: 88103f843f70 RFLAGS: 0046 RAX: RBX: 0091 RCX: 0001 RDX: 0001 RSI: 0001 RDI: 81d36a88 RBP: 88103f843fa8 R8: 0101 R9: 0f78 R10: 0010 R11: 8810384ccf8c R12: 0091 R13: 88102a819794 R14: b5e8 R15: 88102a8196f8 CS: 0010 SS: 0018 #0 [88103f843f70] smp_irq_move_cleanup_interrupt at 8105319b #1 [88103f843fb0] irq_move_cleanup_interrupt at 81673a04 --- --- #2 [881038f0fd90] irq_move_cleanup_interrupt at 81673a04 [exception RIP: unknown or invalid address] RIP: 001d RSP: RFLAGS: 881038f1 RAX: 0001 RBX: 881038f0c000 RCX: 0005 RDX: 000a087a RSI: 00051ca2 RDI: 0018 RBP: 88103f84f100 R8: 881038f0fea0 R9: e8f42378 R10: 0002 R11: 00ee954efd02 R12: 810e5060 R13: 881038f0fdc8 R14: 88103f84f500 R15: 88103f84f100 ORIG_RAX: 88103f856c40 CS: 8000a044 SS: ffdf bt: WARNING: possibly bogus exception frame #3 [881038f0fe38] cpuidle_enter_state at 8151e278 #4 [881038f0fea8] cpuidle_enter at 8151e417 #5 [881038f0feb8] call_cpuidle at 810bef42 #6 [881038f0fed0] cpu_startup_entry at 810bf1dc #7 [881038f0ff28] start_secondary at 8104e9e0 The irq_desc is in R15: 88102a8196f8 crash> kmem 88102a8196f8 CACHENAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE 88103f408c80 kmalloc-512 512 5292 5460140 32k SLAB MEMORYNODE TOTAL ALLOCATED FREE ea0040aa0600 88102a818000 0 39 1722 FREE / [ALLOCATED] 88102a8196f8 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ea0040aa0640 102a81900000 0 2f80008000 tail This irq_desc is no longer allocated, it's been filled with the slub debug poison pattern (hence the spinlock is stuck): crash> rd 88102a8196f8 0x28 88102a8196f8: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819708: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819718: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819728: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819738: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819748: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819758: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819768: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819778: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819788: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819798: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a8197a8: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a8197b8: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a8197c8: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a8197d8: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a8197e8: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a8197f8: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819808: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819818: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819828: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b The irq vector is in RBX: 0091 crash> p/d 0x91 $1 = 145 The irq_desc_tree entry for vector 0x91 appears to have been updated to something new (the device re-add): crash> tree -t radix irq_desc_tree -s irq_desc | grep 'irq = 0x91' -B11 | grep -e '^' -e 'irq =' 88102fb144e8 irq = 0x91, But from a custom crash gdb script, the freed irq_desc can still be found in CPU 1's vector_irq[145]: cpu[1] vector_irq[145] irq_desc @ 0x88102a8196f8 Sifting
irq_desc use-after-free in smp_irq_move_cleanup_interrupt
Hi Thomas, I've been chasing down a use-after-free on an irq_desc structure during repeated device removal testing that crashes 4.3 in smp_irq_move_cleanup_interrupt. So far I have a bunch of crashes and some data gleaned from instrumenting the kernel with trace logging. Details to follow, but in short, I think I'm hitting the use-after-free because of a782a7e46bb5 "x86/irq: Store irq descriptor in vector array". During the tests, I usually see RCU stalls (sometimes a crash later) -- when looking at a vmcore, the CPU stall is here: crash> bt 881038ee4200 PID: 0 TASK: 881038ee4200 CPU: 1 COMMAND: "swapper/1" [exception RIP: _raw_spin_lock+0x10] RIP: 81672a10 RSP: 88103f843f70 RFLAGS: 0046 RAX: RBX: 0091 RCX: 0001 RDX: 0001 RSI: 0001 RDI: 81d36a88 RBP: 88103f843fa8 R8: 0101 R9: 0f78 R10: 0010 R11: 8810384ccf8c R12: 0091 R13: 88102a819794 R14: b5e8 R15: 88102a8196f8 CS: 0010 SS: 0018 #0 [88103f843f70] smp_irq_move_cleanup_interrupt at 8105319b #1 [88103f843fb0] irq_move_cleanup_interrupt at 81673a04 --- --- #2 [881038f0fd90] irq_move_cleanup_interrupt at 81673a04 [exception RIP: unknown or invalid address] RIP: 001d RSP: RFLAGS: 881038f1 RAX: 0001 RBX: 881038f0c000 RCX: 0005 RDX: 000a087a RSI: 00051ca2 RDI: 0018 RBP: 88103f84f100 R8: 881038f0fea0 R9: e8f42378 R10: 0002 R11: 00ee954efd02 R12: 810e5060 R13: 881038f0fdc8 R14: 88103f84f500 R15: 88103f84f100 ORIG_RAX: 88103f856c40 CS: 8000a044 SS: ffdf bt: WARNING: possibly bogus exception frame #3 [881038f0fe38] cpuidle_enter_state at 8151e278 #4 [881038f0fea8] cpuidle_enter at 8151e417 #5 [881038f0feb8] call_cpuidle at 810bef42 #6 [881038f0fed0] cpu_startup_entry at 810bf1dc #7 [881038f0ff28] start_secondary at 8104e9e0 The irq_desc is in R15: 88102a8196f8 crash> kmem 88102a8196f8 CACHENAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE 88103f408c80 kmalloc-512 512 5292 5460140 32k SLAB MEMORYNODE TOTAL ALLOCATED FREE ea0040aa0600 88102a818000 0 39 1722 FREE / [ALLOCATED] 88102a8196f8 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ea0040aa0640 102a81900000 0 2f80008000 tail This irq_desc is no longer allocated, it's been filled with the slub debug poison pattern (hence the spinlock is stuck): crash> rd 88102a8196f8 0x28 88102a8196f8: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819708: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819718: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819728: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819738: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819748: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819758: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819768: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819778: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819788: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819798: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a8197a8: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a8197b8: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a8197c8: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a8197d8: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a8197e8: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a8197f8: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819808: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819818: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 88102a819828: 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b The irq vector is in RBX: 0091 crash> p/d 0x91 $1 = 145 The irq_desc_tree entry for vector 0x91 appears to have been updated to something new (the device re-add): crash> tree -t radix irq_desc_tree -s irq_desc | grep 'irq = 0x91' -B11 | grep -e '^' -e 'irq =' 88102fb144e8 irq = 0x91, But from a custom crash gdb script, the freed irq_desc can still be found in CPU 1's vector_irq[145]: cpu[1] vector_irq[145] irq_desc @ 0x88102a8196f8 Sifting