Re: [PATCH v2 05/17] compat_ioctl: move more drivers to generic_compat_ioctl_ptrarg
On Fri, Sep 14, 2018 at 09:57:48PM +0100, Al Viro wrote: > On Fri, Sep 14, 2018 at 01:35:06PM -0700, Darren Hart wrote: > > > Acked-by: Darren Hart (VMware) > > > > As for a longer term solution, would it be possible to init fops in such > > a way that the compat_ioctl call defaults to generic_compat_ioctl_ptrarg > > so we don't have to duplicate this boilerplate for every ioctl fops > > structure? > > Bad idea, that... Because several years down the road somebody will add > an ioctl that takes an unsigned int for argument. Without so much as looking > at your magical mystery macro being used to initialize file_operations. Fair, being explicit in the declaration as it is currently may be preferable then. -- Darren Hart VMware Open Source Technology Center
Re: [PATCH v2 05/17] compat_ioctl: move more drivers to generic_compat_ioctl_ptrarg
On Wed, Sep 12, 2018 at 05:08:52PM +0200, Arnd Bergmann wrote: > The .ioctl and .compat_ioctl file operations have the same prototype so > they can both point to the same function, which works great almost all > the time when all the commands are compatible. > > One exception is the s390 architecture, where a compat pointer is only > 31 bit wide, and converting it into a 64-bit pointer requires calling > compat_ptr(). Most drivers here will ever run in s390, but since we now > have a generic helper for it, it's easy enough to use it consistently. > > I double-checked all these drivers to ensure that all ioctl arguments > are used as pointers or are ignored, but are not interpreted as integer > values. > > Signed-off-by: Arnd Bergmann > --- ... > drivers/platform/x86/wmi.c | 2 +- ... > static void link_event_work(struct work_struct *work) > diff --git a/drivers/platform/x86/wmi.c b/drivers/platform/x86/wmi.c > index 04791ea5d97b..e4d0697e07d6 100644 > --- a/drivers/platform/x86/wmi.c > +++ b/drivers/platform/x86/wmi.c > @@ -886,7 +886,7 @@ static const struct file_operations wmi_fops = { > .read = wmi_char_read, > .open = wmi_char_open, > .unlocked_ioctl = wmi_ioctl, > - .compat_ioctl = wmi_ioctl, > + .compat_ioctl = generic_compat_ioctl_ptrarg, > }; For platform/drivers/x86: Acked-by: Darren Hart (VMware) As for a longer term solution, would it be possible to init fops in such a way that the compat_ioctl call defaults to generic_compat_ioctl_ptrarg so we don't have to duplicate this boilerplate for every ioctl fops structure? -- Darren Hart VMware Open Source Technology Center
Re: [PATCH 1/1] futex: remove duplicated code and fix UB
On Wed, Jun 21, 2017 at 01:53:18PM +0200, Jiri Slaby wrote: > There is code duplicated over all architecture's headers for > futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr, > and comparison of the result. > > Remove this duplication and leave up to the arches only the needed > assembly which is now in arch_futex_atomic_op_inuser. > > This effectively distributes the Will Deacon's arm64 fix for undefined > behaviour reported by UBSAN to all architectures. The fix was done in > commit 5f16a046f8e1 (arm64: futex: Fix undefined behaviour with > FUTEX_OP_OPARG_SHIFT usage). Look there for an example dump. > > Note that s390 removed access_ok check in d12a29703 ("s390/uaccess: > remove pointless access_ok() checks") as access_ok there returns true. > We introduce it back to the helper for the sake of simplicity (it gets > optimized away anyway). > This required a minor manual merge for ARM on the tip of Linus' tree today. The reduced duplication is a welcome improvement. Reviewed-by: Darren Hart (VMware) <dvh...@infradead.org> -- Darren Hart VMware Open Source Technology Center
Re: [PATCH 3/8] selftests: futex: override clean in lib.mk to fix warnings
On Fri, Apr 21, 2017 at 05:14:45PM -0600, Shuah Khan wrote: > Add override for lib.mk clean to fix the following warnings from clean > target run. > > Makefile:36: warning: overriding recipe for target 'clean' > ../lib.mk:55: warning: ignoring old recipe for target 'clean' > > Signed-off-by: Shuah Khan <shua...@osg.samsung.com> > --- > tools/testing/selftests/futex/Makefile | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/tools/testing/selftests/futex/Makefile > b/tools/testing/selftests/futex/Makefile > index c8095e6..e2fbb89 100644 > --- a/tools/testing/selftests/futex/Makefile > +++ b/tools/testing/selftests/futex/Makefile > @@ -32,9 +32,10 @@ override define EMIT_TESTS > echo "./run.sh" > endef > > -clean: > +override define CLEAN > for DIR in $(SUBDIRS); do \ > BUILD_TARGET=$(OUTPUT)/$$DIR; \ > mkdir $$BUILD_TARGET -p; \ > make OUTPUT=$$BUILD_TARGET -C $$DIR $@;\ > done > +endef Taking the move of clean into lib.mk as a given, Acked-by: Darren Hart (VMware) <dvh...@infradead.org> -- Darren Hart VMware Open Source Technology Center
Re: [RFC 07/24] x86/thinkpad_acpi: Use arch_nvram_ops methods instead of nvram_read_byte() and nvram_write_byte()
On Wed, Jun 03, 2015 at 07:37:13AM -0300, Henrique de Moraes Holschuh wrote: On Wed, Jun 3, 2015, at 00:34, Darren Hart wrote: On Tue, Jun 02, 2015 at 07:09:28AM -0300, Henrique de Moraes Holschuh wrote: Test results were sent to me privately, and they are correct, so... Finn, unless there is some compelling reason not to - like they are MBs worth of data, please submit these to the list in the future so we have them for reference. After I told him which exact bitmask to use on a T43 to test hotkey_source_mask, his test results can be summarized as I could see no difference in behavior, which is *exactly* what I expected to happen. If anything went wrong with the thinkpad-acpi NVRAM code, you'd notice a very large change in behavior (typical: hotkeys don't work, less typical: random hotkey keypresses, hotkey press bursts, low responsivity of hotkeys). Perfect, thanks for the update so we have it recorded here on the list. -- Darren Hart Intel Open Source Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 07/24] x86/thinkpad_acpi: Use arch_nvram_ops methods instead of nvram_read_byte() and nvram_write_byte()
On Tue, Jun 02, 2015 at 07:09:28AM -0300, Henrique de Moraes Holschuh wrote: Test results were sent to me privately, and they are correct, so... Finn, unless there is some compelling reason not to - like they are MBs worth of data, please submit these to the list in the future so we have them for reference. Acked-by: Henrique de Moraes Holschuh h...@hmh.eng.br I'm fine with the changes, but they need to be submitted with the other changes as this one change cannot compile independently in my tree. Finn, please work with whomever is pulling the series to include this in their pull request. Reviewed-by: Darren Hart dvh...@linux.intel.com -- Darren Hart Intel Open Source Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] drivers: depend on instead of select BACKLIGHT_CLASS_DEVICE and ACPI_VIDEO
On Sat, Oct 18, 2014 at 12:13:23AM +0300, Jani Nikula wrote: Documentation/kbuild/kconfig-language.txt warns to use select with care, and in general use select only for non-visible symbols and for symbols with no dependencies, because select will force a symbol to a value without visiting the dependencies. Select has become particularly problematic, interdependently, with the BACKLIGHT_CLASS_DEVICE and ACPI_VIDEO configs. For example: scripts/kconfig/conf --randconfig Kconfig KCONFIG_SEED=0x48312B00 warning: (DRM_RADEON DRM_NOUVEAU DRM_I915 DRM_GMA500 DRM_SHMOBILE DRM_TILCDC FB_BACKLIGHT FB_MX3 USB_APPLEDISPLAY FB_OLPC_DCON ASUS_LAPTOP SONY_LAPTOP THINKPAD_ACPI EEEPC_LAPTOP ACPI_CMPC SAMSUNG_Q10) selects BACKLIGHT_CLASS_DEVICE which has unmet direct dependencies (HAS_IOMEM BACKLIGHT_LCD_SUPPORT) warning: (DRM_RADEON DRM_NOUVEAU DRM_I915 DRM_GMA500 DRM_SHMOBILE DRM_TILCDC FB_BACKLIGHT FB_MX3 USB_APPLEDISPLAY FB_OLPC_DCON ASUS_LAPTOP SONY_LAPTOP THINKPAD_ACPI EEEPC_LAPTOP ACPI_CMPC SAMSUNG_Q10) selects BACKLIGHT_CLASS_DEVICE which has unmet direct dependencies (HAS_IOMEM BACKLIGHT_LCD_SUPPORT) With tristates it's possible to end up selecting FOO=y depending on BAR=m in the config, which gets discovered at build time, not config time, like reported in the thread referenced below. Do the following to fix the dependencies: * Depend on instead of select BACKLIGHT_CLASS_DEVICE everywhere. Drop select BACKLIGHT_LCD_SUPPORT in such cases, as it's a dependency of BACKLIGHT_CLASS_DEVICE. * Remove config FB_BACKLIGHT altogether, and replace it with a dependency on BACKLIGHT_CLASS_DEVICE. All configs selecting FB_BACKLIGHT select or depend on FB anyway, so we can simplify. * Depend on (ACPI ACPI_VIDEO) || ACPI=n in several places instead of selecting ACPI_VIDEO and a number of its dependencies if ACPI is enabled. This is tied to backlight, as ACPI_VIDEO depends on BACKLIGHT_CLASS_DEVICE. * Replace a couple of select INPUT/VT with depends as it seemed to be necessary. Reference: http://lkml.kernel.org/r/ca+r1zhhmt4drwtf6mbrqo5eqxwx+lxcqh15vsu_d9wpftlh...@mail.gmail.com Reported-by: Jim Davis jim.ep...@gmail.com Cc: Randy Dunlap rdun...@infradead.org Cc: David Airlie airl...@linux.ie Cc: Daniel Vetter daniel.vet...@intel.com Cc: Greg Kroah-Hartman gre...@linuxfoundation.org Cc: Darren Hart dvh...@infradead.org Cc: Laurent Pinchart laurent.pinch...@ideasonboard.com Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Jens Frederich jfreder...@gmail.com Cc: Daniel Drake d...@laptop.org Cc: Jon Nettleton jon.nettle...@gmail.com Cc: Jean-Christophe Plagniol-Villard plagn...@jcrosoft.com Cc: Tomi Valkeinen tomi.valkei...@ti.com Signed-off-by: Jani Nikula jani.nik...@intel.com As for the drivers/platform/x86/Kconfig: Acked-by: Darren Hart dvh...@linux.intel.com Thanks, -- Darren Hart Intel Open Source Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty young
Obviously no objection from the futex side of things, looks good. Couple nits on the function comment: On 07/18/2011 09:29 PM, Benjamin Herrenschmidt wrote: ... -/** +/* + * fixup_user_fault() - manually resolve a user page fault s/ fault/ fault/ + * @tsk: the task_struct to use for page fault accounting, or + * NULL if faults are not to be recorded. + * @mm: mm_struct of target mm + * @address: user address + * @fault_flags:flags to pass down to handle_mm_fault() + * + * This is meant to be called in the specific scenario where for + * locking reasons we try to access user memory in atomic context + * (within a pagefault_disable() section), this returns -EFAULT, + * and we want to resolve the user fault before trying again. + * + * Typically this is meant to be used by the futex code. + * + * The main difference with get_user_pages() is that this function + * will unconditionally call handle_mm_fault() which will in turn + * perform all the necessary SW fixup of the dirty and young bits + * in the PTE, while handle_mm_fault() only guarantees to update + * these in the struct page. + * + * This is important for some architectures where those bits also + * gate the access permission to the page because their are s/their/they/ Thanks, -- Darren Hart Intel Open Source Technology Center Yocto Project - Linux Kernel ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
] [c005b0d8] .probe_hcall_entry+0x38/0x94 [c0008e776e80] [c004ef70] .__trace_hcall_entry+0x80/0xd4 [c0008e776f10] [c004f96c] .plpar_hcall_norets+0x50/0xd0 [c0008e776f80] [c0055044] .smp_xics_message_pass+0x110/0x244 [c0008e777030] [c0034824] .smp_send_reschedule+0x5c/0x78 [c0008e7770c0] [c006a6bc] .resched_task+0xb4/0xd8 [c0008e777150] [c006a840] .check_preempt_curr_idle+0x2c/0x44 [c0008e7771e0] [c007a868] .try_to_wake_up+0x460/0x548 [c0008e7772b0] [c007a98c] .default_wake_function+0x3c/0x54 [c0008e777350] [c00ab3b0] .autoremove_wake_function+0x44/0x84 [c0008e777400] [c006449c] .__wake_up_common+0x7c/0xf4 [c0008e7774c0] [c006a5d8] .__wake_up+0x60/0x90 [c0008e777570] [c00810dc] .printk_tick+0x84/0xa8 [c0008e777600] [c0095c90] .update_process_times+0x64/0xa4 [c0008e7776a0] [c00bdcec] .tick_sched_timer+0xd0/0x120 [c0008e50] [c00afe7c] .__run_hrtimer+0x1a0/0x29c [c0008e777800] [c00b02a4] .hrtimer_interrupt+0x124/0x278 [c0008e777900] [c002ea90] .timer_interrupt+0x1dc/0x2e4 [c0008e7779a0] [c0003700] decrementer_common+0x100/0x180 --- Exception: 901 at .raw_local_irq_restore+0x48/0x54 LR = .cpu_idle+0x12c/0x208 [c0008e777c90] [c0b21130] ppc_md+0x0/0x240 (unreliable) [c0008e777cd0] [c0016a60] .cpu_idle+0x120/0x208 [c0008e777d70] [c069dbec] .start_secondary+0x394/0x3d4 [c0008e777e30] [c0008278] .start_secondary_resume+0x10/0x14 Instruction dump: 409e0034 78630620 2ba300f4 40fd0028 4bd0b711 6000 2fa3 419e0018 e93e8880 8009 2f80 409e0008 0fe0 383f0090 e8010010 7c0803a6 BUG: scheduling while atomic: swapper/0/0x00f0 Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan] Call Trace: [c0008e7779a0] [c0014280] .show_stack+0xd8/0x218 (unreliable) [c0008e777a80] [c06980b0] .dump_stack+0x28/0x3c [c0008e777b00] [c0071954] .__schedule_bug+0x94/0xb8 [c0008e777b90] [c068db40] .schedule+0xf8/0xbf4 [c0008e777cd0] [c0016b34] .cpu_idle+0x1f4/0x208 [c0008e777d70] [c069dbec] .start_secondary+0x394/0x3d4 [c0008e777e30] [c0008278] .start_secondary_resume+0x10/0x14 I'll spend tomorrow collecting traces and trying to see who's groping who this time... -- Darren -- Steve if (get_cpu_current_state(lcpu) == CPU_STATE_INACTIVE) goto out; +/* Check to see if the CPU out of FW already for kexec */ +if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){ +cpumask_set_cpu(lcpu, of_spin_mask); +return 1; +} + /* * If the RTAS start-cpu token does not exist then presume the * cpu is already spinning. -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
On 09/02/2010 04:04 PM, Michael Neuling wrote: In message 1283400367.2356.69.ca...@gandalf.stny.rr.com you wrote: On Thu, 2010-09-02 at 11:02 +1000, Michael Neuling wrote: We need to call smp_startup_cpu on boot when we the cpus are still in FW. smp_startup_cpu does this for us on boot. I'm wondering if we just need to move the test down a bit to make sure the preempt_count is set. I've not been following this thread, but maybe this might work? Egad no! Setting the preempt_count to zero _is_ the bug. I think Darren even said that adding the exit prevented the bug (although now he's hitting a hard lockup someplace else). The original code he was using did not have the condition to return for kexec. It was just a coincidence that this code helped in bringing a CPU back online. Untested patch below... Mikey diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/ pseries/smp.c index 0317cce..3afaba4 100644 --- a/arch/powerpc/platforms/pseries/smp.c +++ b/arch/powerpc/platforms/pseries/smp.c @@ -104,18 +104,18 @@ static inline int __devinit smp_startup_cpu(unsigned int lcpu) pcpu = get_hard_smp_processor_id(lcpu); - /* Check to see if the CPU out of FW already for kexec */ - if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){ - cpumask_set_cpu(lcpu, of_spin_mask); - return 1; - } - /* Fixup atomic count: it exited inside IRQ handler. */ task_thread_info(paca[lcpu].__current)-preempt_count = 0; We DON'T want to do the above. It's nasty! This is one CPU's task touching an intimate part of another CPU's task. It's equivalent of me putting my hand down you wife's blouse. It's offensive, and rude. OK, if the CPU was never online, then you can do what you want. But what we see is that this fails on CPU hotplug. You stop a CPU, and it goes into this cede_processor() call. When you wake it up, suddenly the task on that woken CPU has its preempt count fscked up. This was really really hard to debug. We thought it was stack corruption or something. But it ended up being that this code has one CPU touching the breasts of another CPU. This code is a pervert! What the trace clearly showed, was that we take down a CPU, and in doing so, the code on that CPU set the preempt count to 1, and it expected to have it as 1 when it returned. But the code that kicked started this CPU back to life (bring the CPU back online), set the preempt count on the task of that CPU to 0, and screwed everything up. /me goes to checks where this came from... It's been in the kernel since hotplug CPU support was added to ppc64 back in 2004, so I guess we are all at fault for letting this pervert get away with this stuff for so long in plain sight. :-) So I guess we should remove this but we need to audit all the different paths that go through here to make sure they are OK with preempt. Normal boot, kexec boot, hotplug with FW stop and hotplug with extended_cede all hit this. Mikey CC'ing my alter ego. -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
On 08/31/2010 10:54 PM, Michael Ellerman wrote: On Tue, 2010-08-31 at 00:12 -0700, Darren Hart wrote: .. When running with the function plugin I had to stop the trace immediately before entering start_secondary after an online or my traces would not include the pseries_mach_cpu_die function, nor the tracing I added there (possibly buffer size, I am using 2048). The following trace was collected using trace-cmd record -p function -e irq -e sched and has been filtered to only show CPU [001] (the CPU undergoing the offline/online test, and the one seeing preempt_count (pcnt) go to after cede. The function tracer does not indicate anything running on the CPU other than the HCALL - unless the __trace_hcall* commands might be to blame. It's not impossible. Though normally they're disabled right, so the only reason they're running is because you're tracing. So if they are causing the bug then that doesn't explain why you see it normally. Still, might be worth disabling just the hcall tracepoints just to be 100% sure. A couple of updates. I was working from tip/rt/head, which has been stale for some months now. I switched to tip/rt/2.6.33 and the preempt_count() change over cede went away. I now see the live hang that Will has reported. Before I dive into the live hang, I want to understand what fixed the preempt_count() change. That might start pointing us in the right direction for the live hang. I did an inverted git bisect between tip/rt/head and tip/rt/2.6.33 to try and locate the fix. I've narrowed it down to the 2.6.33.6 merge: # git show 7e1af1172bbd4109d09ac515c5d376f633da7cff commit 7e1af1172bbd4109d09ac515c5d376f633da7cff Merge: d8e94db 9666790 Author: Thomas Gleixner t...@linutronix.de Date: Tue Jul 13 16:01:16 2010 +0200 Merge git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.33.y Conflicts: Makefile Signed-off-by: Thomas Gleixner t...@linutronix.de Visual inspection yields two patches of interest: f8b67691828321f5c85bb853283aa101ae673130 powerpc/pseries: Make query-cpu-stopped callable outside hotplug cpu aef40e87d866355ffd279ab21021de733242d0d5 powerpc/pseries: Only call start-cpu when a CPU is stopped I'm going to try reverting these today and see if they addressed the issue indirectly. -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
On 09/01/2010 08:10 AM, Darren Hart wrote: On 08/31/2010 10:54 PM, Michael Ellerman wrote: On Tue, 2010-08-31 at 00:12 -0700, Darren Hart wrote: .. When running with the function plugin I had to stop the trace immediately before entering start_secondary after an online or my traces would not include the pseries_mach_cpu_die function, nor the tracing I added there (possibly buffer size, I am using 2048). The following trace was collected using trace-cmd record -p function -e irq -e sched and has been filtered to only show CPU [001] (the CPU undergoing the offline/online test, and the one seeing preempt_count (pcnt) go to after cede. The function tracer does not indicate anything running on the CPU other than the HCALL - unless the __trace_hcall* commands might be to blame. It's not impossible. Though normally they're disabled right, so the only reason they're running is because you're tracing. So if they are causing the bug then that doesn't explain why you see it normally. Still, might be worth disabling just the hcall tracepoints just to be 100% sure. A couple of updates. I was working from tip/rt/head, which has been stale for some months now. I switched to tip/rt/2.6.33 and the preempt_count() change over cede went away. I now see the live hang that Will has reported. Before I dive into the live hang, I want to understand what fixed the preempt_count() change. That might start pointing us in the right direction for the live hang. I did an inverted git bisect between tip/rt/head and tip/rt/2.6.33 to try and locate the fix. I've narrowed it down to the 2.6.33.6 merge: # git show 7e1af1172bbd4109d09ac515c5d376f633da7cff commit 7e1af1172bbd4109d09ac515c5d376f633da7cff Merge: d8e94db 9666790 Author: Thomas Gleixner t...@linutronix.de Date: Tue Jul 13 16:01:16 2010 +0200 Merge git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.33.y Conflicts: Makefile Signed-off-by: Thomas Gleixner t...@linutronix.de Visual inspection yields two patches of interest: f8b67691828321f5c85bb853283aa101ae673130 powerpc/pseries: Make query-cpu-stopped callable outside hotplug cpu aef40e87d866355ffd279ab21021de733242d0d5 powerpc/pseries: Only call start-cpu when a CPU is stopped I'm going to try reverting these today and see if they addressed the issue indirectly. Removing: aef40e87d866355ffd279ab21021de733242d0d5 powerpc/pseries: Only call start-cpu when a CPU is stopped --- a/arch/powerpc/platforms/pseries/smp.c +++ b/arch/powerpc/platforms/pseries/smp.c @@ -82,6 +82,12 @@ static inline int __devinit smp_startup_cpu(unsigned int lcpu) pcpu = get_hard_smp_processor_id(lcpu); + /* Check to see if the CPU out of FW already for kexec */ This comment is really confusing to me. I _think_ it is saying that this test determines if the CPU is done executing firmware code and has begun executing OS code Is that right? + if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){ + cpu_set(lcpu, of_spin_map); + return 1; + } + /* Fixup atomic count: it exited inside IRQ handler. */ task_thread_info(paca[lcpu].__current)-preempt_count = 0; from tip/rt/2.6.33 causes the preempt_count() to change across the cede call. This patch appears to prevents the proxy preempt_count assignment from happening. This non-local-cpu assignment to 0 would cause an underrun of preempt_count() if the local-cpu had disabled preemption prior to the assignment and then later tried to enable it. This appears to be the case with the stack of __trace_hcall* calls preceeding the return from extended_cede_processor() in the latency format trace-cmd report: idle-0 1d 201.252737: function: .cpu_die idle-0 1d 201.252738: function: .pseries_mach_cpu_die idle-0 1d 201.252740: function: .idle_task_exit idle-0 1d 201.252741: function: .switch_slb idle-0 1d 201.252742: function: .xics_teardown_cpu idle-0 1d 201.252743: function: .xics_set_cpu_priority idle-0 1d 201.252744: function: .__trace_hcall_entry idle-0 1d..1. 201.252745: function: .probe_hcall_entry idle-0 1d..1. 201.252746: function: .__trace_hcall_exit idle-0 1d..2. 201.252747: function:.probe_hcall_exit idle-0 1d 201.252748: function: .__trace_hcall_entry idle-0 1d..1. 201.252748: function: .probe_hcall_entry idle-0 1d..1. 201.252750: function: .__trace_hcall_exit idle-0 1d..2. 201.252751: function:.probe_hcall_exit idle-0 1d 201.252752: function: .__trace_hcall_entry idle-0 1d..1. 201.252753: function
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
On 09/01/2010 12:59 PM, Steven Rostedt wrote: On Wed, 2010-09-01 at 11:47 -0700, Darren Hart wrote: from tip/rt/2.6.33 causes the preempt_count() to change across the cede call. This patch appears to prevents the proxy preempt_count assignment from happening. This non-local-cpu assignment to 0 would cause an underrun of preempt_count() if the local-cpu had disabled preemption prior to the assignment and then later tried to enable it. This appears to be the case with the stack of __trace_hcall* calls preceeding the return from extended_cede_processor() in the latency format trace-cmd report: idle-0 1d 201.252737: function: .cpu_die Note, the above 1d is a series of values. The first being the CPU, the next if interrupts are disabled, the next if the NEED_RESCHED flag is set, the next is softirqs enabled or disabled, next the preempt_count, and finally the lockdepth count. Here we only care about the preempt_count, which is zero when '.' and a number if it is something else. It is the second to last field in that list. idle-0 1d 201.252738: function: .pseries_mach_cpu_die idle-0 1d 201.252740: function: .idle_task_exit idle-0 1d 201.252741: function: .switch_slb idle-0 1d 201.252742: function: .xics_teardown_cpu idle-0 1d 201.252743: function: .xics_set_cpu_priority idle-0 1d 201.252744: function: .__trace_hcall_entry idle-0 1d..1. 201.252745: function: .probe_hcall_entry ^ preempt_count set to 1 idle-0 1d..1. 201.252746: function: .__trace_hcall_exit idle-0 1d..2. 201.252747: function: .probe_hcall_exit idle-0 1d 201.252748: function: .__trace_hcall_entry idle-0 1d..1. 201.252748: function: .probe_hcall_entry idle-0 1d..1. 201.252750: function: .__trace_hcall_exit idle-0 1d..2. 201.252751: function: .probe_hcall_exit idle-0 1d 201.252752: function: .__trace_hcall_entry idle-0 1d..1. 201.252753: function: .probe_hcall_entry ^ ^ CPU preempt_count Entering the function probe_hcall_entry() the preempt_count is 1 (see below). But probe_hcall_entry does: h = get_cpu_var(hcall_stats)[opcode / 4]; Without doing the put (which it does in probe_hcall_exit()) So exiting the probe_hcall_entry() the prempt_count is 2. The trace_hcall_entry() will do a preempt_enable() making it leave as 1. offon.sh-3684 6. 201.466488: bprint: .smp_pSeries_kick_cpu : resetting pcnt to 0 for cpu 1 This is CPU 6, changing the preempt count from 1 to 0. preempt_count() is reset from 1 to 0 by smp_startup_cpu() without the QCSS_NOT_STOPPED check from the patch above. idle-0 1d 201.466503: function: .__trace_hcall_exit Note: __trace_hcall_exit() and __trace_hcall_entry() basically do: preempt_disable(); call probe(); preempt_enable(); idle-0 1d..1. 201.466505: function: .probe_hcall_exit The preempt_count of 1 entering the probe_hcall_exit() is because of the preempt_disable() shown above. It should have been entered as a 2. But then it does: put_cpu_var(hcall_stats); making preempt_count 0. But the preempt_enable() in the trace_hcall_exit() causes this to be -1. idle-0 1d.Hff. 201.466507: bprint: .pseries_mach_cpu_die : after cede: With the preempt_count() being one less than it should be, the final preempt_enable() in the trace_hcall path drops preempt_count to 0x, which of course is an illegal value and leads to a crash. I'm confused to how this works in mainline? Turns out it didn't. 2.6.33.5 with CONFIG_PREEMPT=y sees this exact same behavior. The following, part of the 2.6.33.6 stable release, prevents this from happening: aef40e87d866355ffd279ab21021de733242d0d5 powerpc/pseries: Only call start-cpu when a CPU is stopped --- a/arch/powerpc/platforms/pseries/smp.c +++ b/arch/powerpc/platforms/pseries/smp.c @@ -82,6 +82,12 @@ static inline int __devinit smp_startup_cpu(unsigned int lcpu) pcpu = get_hard_smp_processor_id(lcpu); + /* Check to see if the CPU out of FW already for kexec */ + if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){ + cpu_set(lcpu, of_spin_map); + return 1; + } + /* Fixup atomic count: it exited inside IRQ handler. */ task_thread_info(paca[lcpu].__current)-preempt_count = 0; The question is now, Is this the right fix? If so, perhaps we can
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
On 08/19/2010 08:58 AM, Ankita Garg wrote: Hi Darren, On Thu, Jul 22, 2010 at 11:24:13AM -0700, Darren Hart wrote: With some instrumentation we were able to determine that the preempt_count() appears to change across the extended_cede_processor() call. Specifically across the plpar_hcall_norets(H_CEDE) call. On PREEMPT_RT we call this with preempt_count=1 and return with preempt_count=0x. On mainline with CONFIG_PREEMPT=y, the value is different (0x65) but is still incorrect. I was trying to reproduce this issue on a 2.6.33.7-rt29 kernel. I could easily reproduce this on the RT kernel and not the non-RT kernel. This matches my experience. However, I hit it every single time I did a cpu online operation. I also noticed that the issue persists even when I disable H_CEDE by passing the cede_offline=0 kernel commandline parameter. Could you pl confirm if you observe the same in your setup ? Yes, this is my experience as well. However, the issue still remains. Will spend few cycles looking into this issue. I've spent today trying to collect some useful traces. I consistently see the preempt_count() change to 0x across the H_CEDE call, but the irq and sched events (to ftrace) do not indicate anything else running on the CPU that could change that. idle-0 1d.h1. 11408us : irq_handler_entry: irq=16 name=IPI idle-0 1d.h1. 11411us : irq_handler_exit: irq=16 ret=handled ...other cpus idle-0 1d 22101us : .pseries_mach_cpu_die: start idle-0 1d 22108us : .pseries_mach_cpu_die: cpu 1 (hwid 1) ceding for offline with hint 2 ...other cpus idle-0 1d.Hff. 33804us : .pseries_mach_cpu_die: returned from cede with pcnt: idle-0 1d.Hff. 33805us : .pseries_mach_cpu_die: forcing pcnt to 0 idle-0 1d 33807us : .pseries_mach_cpu_die: cpu 1 (hwid 1) returned from cede. idle-0 1d 33808us : .pseries_mach_cpu_die: Decrementer value = 7fa49474 Timebase value = baefee6c88113 idle-0 1d 33809us : .pseries_mach_cpu_die: cpu 1 (hwid 1) got prodded to go online idle-0 1d 33816us : .pseries_mach_cpu_die: calling start_seconday, pcnt: 0 idle-0 1d 33816us : .pseries_mach_cpu_die: forcing pcnt to 0 - Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan] NIP: c006aa50 LR: c006ac40 CTR: c006ac14 REGS: c0008e79ffc0 TRAP: 0300 Not tainted (2.6.33-rt-dvhrt16-02358-g61223dd-dirty) MSR: 80001032 ME,IR,DR CR: 2822 XER: DAR: c18000ba44c0, DSISR: 4000 TASK = c0010e6de040[0] 'swapper' THREAD: c0008e7a CPU: 1 GPR00: 01800040 c0008e7a0240 c0b55008 c0010e6de040 GPR04: c00085d800c0 000f GPR08: 0180 c0008e7a c0ba4480 c0a32c80 GPR12: 80009032 c0ba4680 c0008e7a08f0 0001 GPR16: f2fa c0010e6de040 7fff GPR20: 0001 0001 c0f42c80 GPR24: c000850b7638 0001 GPR28: c0f42c80 c0010e6de040 c0ad7698 c0008e7a0240 NIP [c006aa50] .resched_task+0x48/0xd8 LR [c006ac40] .check_preempt_curr_idle+0x2c/0x44 Call Trace: Instruction dump: f821ff71 7c3f0b78 ebc2ab28 7c7d1b78 6000 6000 e95e8008 e97e8000 e93d0008 81090010 79084da4 38080040 7d4a002a 7c0b502e 7c74 7800d182 ---[ end trace 6d3f1cddaa17382c ]--- Kernel panic - not syncing: Attempted to kill the idle task! Call Trace: Rebooting in 180 seconds.. When running with the function plugin I had to stop the trace immediately before entering start_secondary after an online or my traces would not include the pseries_mach_cpu_die function, nor the tracing I added there (possibly buffer size, I am using 2048). The following trace was collected using trace-cmd record -p function -e irq -e sched and has been filtered to only show CPU [001] (the CPU undergoing the offline/online test, and the one seeing preempt_count (pcnt) go to after cede. The function tracer does not indicate anything running on the CPU other than the HCALL - unless the __trace_hcall* commands might be to blame. Can a POWER guru comment? idle-0 [001] 417.625286: function: .cpu_die idle-0 [001] 417.625287: function: .pseries_mach_cpu_die idle-0 [001] 417.625290: bprint: .pseries_mach_cpu_die : start idle-0 [001] 417.625291: function: .idle_task_exit idle-0 [001] 417.625292: function: .switch_slb idle-0 [001] 417.625294: function: .xics_teardown_cpu idle-0
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
On 08/19/2010 08:58 AM, Ankita Garg wrote: Hi Darren, On Thu, Jul 22, 2010 at 11:24:13AM -0700, Darren Hart wrote: With some instrumentation we were able to determine that the preempt_count() appears to change across the extended_cede_processor() call. Specifically across the plpar_hcall_norets(H_CEDE) call. On PREEMPT_RT we call this with preempt_count=1 and return with preempt_count=0x. On mainline with CONFIG_PREEMPT=y, the value is different (0x65) but is still incorrect. I was trying to reproduce this issue on a 2.6.33.7-rt29 kernel. I could easily reproduce this on the RT kernel and not the non-RT kernel. However, I hit it every single time I did a cpu online operation. I also noticed that the issue persists even when I disable H_CEDE by passing the cede_offline=0 kernel commandline parameter. Could you pl confirm if you observe the same in your setup ? with the following patches: [r...@igoort1 linux-2.6-combined]# quilt applied patches/0001-wms-fix01.patch patches/powerpc-increase-pseries_cpu_die-delay.patch patches/powerpc-enable-preemption-before-cpu_die.patch patches/powerpc-silence-__cpu_up-under-normal-operation.patch patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch patches/powerpc-wait-for-cpu-to-go-inactive.patch patches/powerpc-disable-decrementer-on-offline.patch patches/powerpc-cpu_die-preempt-hack.patch patches/powerpc-cede-processor-inst.patch patches/irq-preempt-inst.patch patches/disable-decrementer-in-cpu_die.patch patches/powerpc-hard_irq_disable.patch [r...@igoort1 linux-2.6-combined]# quilt unapplied patches/powerpc-debug-replace-cede-with-mdelay.patch patches/powerpc-pad-thread_info.patch applied to tip/rt/head (2.6.33-rt ish) I will see the following crash after 3 or 4 runs: 3Badness at kernel/sched.c:3720 4NIP: c06986f8 LR: c06986dc CTR: c006ec34 4REGS: c0008e7a7e50 TRAP: 0700 Not tainted (2.6.33-rt-dvhrt08-02358-g61223dd-dirty) 4MSR: 80021032 ME,CE,IR,DR CR: 2822 XER: 4TASK = c0010e7080c0[0] 'swapper' THREAD: c0008e7a8000 CPU: 3 4GPR00: c0008e7a80d0 c0b54fa0 0001 4GPR04: 0032 000f 4GPR08: c0008eb68d00 c0ca5420 0001 c0bc22e8 4GPR12: 80009032 c0ba4a80 c0008e7a8a70 0003 4GPR16: f9ba c0010e7080c0 7fff 4GPR20: 0001 0003 c1042c80 4GPR24: c0008eb686a0 0003 0001 4GPR28: 0001 c0ad7628 c0008e7a80d0 4NIP [c06986f8] .sub_preempt_count+0x6c/0xdc 4LR [c06986dc] .sub_preempt_count+0x50/0xdc 4Call Trace: 4Instruction dump: 478290464 80090014 7f80e800 40fc002c 4bd08a99 6000 2fa3 419e0068 4e93e87e0 8009 2f80 409e0058 0fe0 4850 2b9d00fe 41fd0038 1Unable to handle kernel paging request for data at address 0xc18000ba44c0 1Faulting instruction address: 0xc006aae4 This occurs with or without the cede_offline=0 parameter. Also, in a similar experiment which seems to corroborate your results, suggesting the HCEDE call is not necessarily to blame here. I had replaced the HCEDE call with a mdelay(2) and still ran into issues. I didn't see the preempt count change, but I do see the rtmutex.c:684 bug. cpu 0x0: Vector: 300 (Data Access) at [c0b53ef0] pc: c006aa54: .resched_task+0x48/0xd8 lr: c006ac44: .check_preempt_curr_idle+0x2c/0x44 sp: c0b54170 msr: 80001032 dar: c18000ba44c0 dsisr: 4000 current = 0xc0aa1410 paca= 0xc0ba4480 pid = 0, comm = swapper enter ? for help [c0b54200] c006ac44 .check_preempt_curr_idle+0x2c/0x44 [c0b54290] c007b494 .try_to_wake_up+0x430/0x540 [c0b54360] c007b754 .wake_up_process+0x34/0x48 [c0b543f0] c0089fa8 .wakeup_softirqd+0x78/0x9c [c0b54480] c008a2c4 .raise_softirq+0x7c/0xb8 [c0b54520] c00977b0 .run_local_timers+0x2c/0x4c [c0b545a0] c0097828 .update_process_times+0x58/0x9c [c0b54640] c00beb3c .tick_sched_timer+0xd0/0x120 [c0b546f0] c00b08b8 .__run_hrtimer+0x1a0/0x29c [c0b547a0] c00b1258 .hrtimer_interrupt+0x21c/0x394 [c0b548d0] c00304c4 .timer_interrupt+0x1ec/0x2f8 [c0b54980] c0003700 decrementer_common+0x100/0x180 --- Exception: 901 (Decrementer) at c00100f8 .raw_local_irq_restore+0x74/0x8c [c0b54d00] c0017d14 .cpu_idle+0x12c/0x220 [c0b54da0] c06a1768 .start_secondary+0x3d8/0x418 [c0b54e60] c005c1f0 .pseries_mach_cpu_die+0x244/0x318 [c0b54f10] c001e7e0 .cpu_die+0x4c/0x68 [c0b54f90
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
On 07/22/2010 11:38 AM, Thomas Gleixner wrote: On Thu, 22 Jul 2010, Darren Hart wrote: Also of interest is that this path cpu_idle()-cpu_die()-pseries_mach_cpu_die() to start_secondary() enters with a preempt_count=1 if it wasn't corrupted across the hcall. That triggers the problem as well. preempt_count needs to be 0 when entering start_secondary(). So I really wonder how that ever worked. The early boot path from _start however appears to call start_secondary() with a preempt_count of 0. Which is correct. The following patch is most certainly not correct, but it does eliminate It is correct, but i think it is incomplete as other portions of the thread_info on the stack might be in some weird state as well. Just FYI, I took a look at the stack pointers as well as all the fields in the thread_info struct. The only one that changes is preempt_count. The previous value of preempt_count doesn't impact the value after cede. An initial value of 0, 1, or 4 all result in an after-cede value of 0x. I also added 32 bits of padding on either side of the preempt_count in case the change was accidental - it wasnt', the padded values remained unchanged across the cede call while the preempt_count still changed to 0x. -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
On 08/05/2010 10:09 PM, Vaidyanathan Srinivasan wrote: * Darren Hartdvh...@us.ibm.com [2010-08-05 19:19:00]: On 07/22/2010 10:09 PM, Benjamin Herrenschmidt wrote: On Thu, 2010-07-22 at 21:44 -0700, Darren Hart wrote: suggestion I updated the instrumentation to display the local_save_flags and irqs_disabled_flags: Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1 Jul 22 23:36:58 igoort1 kernel: before H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 Jul 22 23:36:58 igoort1 kernel: after H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 I'm not sure if I'm reading that right, but I believe interrupts are intended to be disabled here. If accomplished via the spin_lock_irqsave() this would behave differently on RT. However, this path disables the interrupts handled by xics, all but the IPIs anyway. On RT I disabled the decrementer as well. Is it possible for RT to be receiving other interrupts here? Also you may want to call hard_irq_disable() to -really- disable interrupts ... since we do lazy-disable on powerpc A quick update and request for direction wrt the cede hcall, interrupt handling, and the stack pointer. I've added patches one at a time, eliminating bugs with preempt_rt (tip/rt/head). With my current patchset I have no crashes with a single run of random_online.sh (stress testing to hit the hang will sees is my todo for tomorrow). Current patchset includes: patches/0001-wms-fix01.patch # P7 lazy flushing thing # next four are sent to / queued for mainline patches/powerpc-increase-pseries_cpu_die-delay.patch patches/powerpc-enable-preemption-before-cpu_die.patch patches/powerpc-silence-__cpu_up-under-normal-operation.patch patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch # this one needs to be cleaned up and sent to mainline patches/powerpc-wait-for-cpu-to-go-inactive.patch patches/powerpc-disable-decrementer-on-offline.patch patches/powerpc-cpu_die-preempt-hack.patch # reset preempt_count to 0 after cede patches/powerpc-cede-processor-inst.patch # instrumentation patches/powerpc-hard_irq_disable.patch # hard_irq_disable before cede patches/powerpc-pad-thread_info.patch I didn't include all the patches as the relevant bits are included below in code form. With the instrumentation, it's clear the change to preempt_count() is targeted (since I added padding before and after preempt_count in the thread_info struct) and preempt_count is still changed. It is also clear that it isn't just a stray decrement since I set preempt_count() to 4 prior to calling cede and it still is 0x after the hcall. Also note that the stack pointer doesn't change across the cede call and neither does any other part of the thread_info structure. Adding hard_irq_disable() did seem to affect things somewhat. Rather than a serialized list of before/after thread_info prints around cede, I see several befores then several afters. But the preempt_count is still modified. The relevant code looks like this (from pseries_mach_cpu_die()) hard_irq_disable(); /* this doesn't fix things... but does change behavior a bit */ preempt_count() = 0x4; asm(mr %0,1 : =r (sp)); /* stack pointer is in R1 */ printk(before cede: sp=%lx pcnt=%x\n, sp, preempt_count()); print_thread_info(); extended_cede_processor(cede_latency_hint); asm(mr %0,1 : =r (sp)); printk(after cede: sp=%lx pcnt=%x\n, sp, preempt_count()); print_thread_info(); preempt_count() = 0; With these patches applied, the output across cede looks like: before cede: sp=c0b57150 pcnt=4 *** current-thread_info *** ti-task: c0aa1410 ti-exec_domain: c0a59958 ti-cpu: 0 ti-stomp_on_me: 57005 /* 0xDEAD - forgot to print in hex */ ti-preempt_count: 4 ti-stomp_on_me_too: 48879 /* 0xBEEF - forgot to print in hex */ ti-local_flags: 0 *** end thread_info *** after cede: sp=c0b57150 pcnt= *** current-thread_info *** ti-task: c0aa1410 ti-exec_domain: c0a59958 ti-cpu: 0 Is this print sufficient to prove that we did not jump CPU? Probably not, but the following should be sufficient. Note that the dump occurs on CPU: 6 which matches the CPU noted in the ti-cpu before and after the cede call. *** current-thread_info *** ti-task: c0008e7b2240 ti-exec_domain: c0a59958 ti-cpu: 6 ti-stomp_on_me: 57005 ti-preempt_count: 4 ti-stomp_on_me_too: 48879 ti-local_flags: 0 *** end thread_info *** [ cut here ] Badness at kernel/sched.c:3698 NIP: c06987e4 LR: c06987c8 CTR: c005c668 REGS: c0010e713800 TRAP: 0700 Not tainted (2.6.33-rt
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
On 07/22/2010 10:09 PM, Benjamin Herrenschmidt wrote: On Thu, 2010-07-22 at 21:44 -0700, Darren Hart wrote: suggestion I updated the instrumentation to display the local_save_flags and irqs_disabled_flags: Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1 Jul 22 23:36:58 igoort1 kernel: before H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 Jul 22 23:36:58 igoort1 kernel: after H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 I'm not sure if I'm reading that right, but I believe interrupts are intended to be disabled here. If accomplished via the spin_lock_irqsave() this would behave differently on RT. However, this path disables the interrupts handled by xics, all but the IPIs anyway. On RT I disabled the decrementer as well. Is it possible for RT to be receiving other interrupts here? Also you may want to call hard_irq_disable() to -really- disable interrupts ... since we do lazy-disable on powerpc A quick update and request for direction wrt the cede hcall, interrupt handling, and the stack pointer. I've added patches one at a time, eliminating bugs with preempt_rt (tip/rt/head). With my current patchset I have no crashes with a single run of random_online.sh (stress testing to hit the hang will sees is my todo for tomorrow). Current patchset includes: patches/0001-wms-fix01.patch # P7 lazy flushing thing # next four are sent to / queued for mainline patches/powerpc-increase-pseries_cpu_die-delay.patch patches/powerpc-enable-preemption-before-cpu_die.patch patches/powerpc-silence-__cpu_up-under-normal-operation.patch patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch # this one needs to be cleaned up and sent to mainline patches/powerpc-wait-for-cpu-to-go-inactive.patch patches/powerpc-disable-decrementer-on-offline.patch patches/powerpc-cpu_die-preempt-hack.patch # reset preempt_count to 0 after cede patches/powerpc-cede-processor-inst.patch # instrumentation patches/powerpc-hard_irq_disable.patch # hard_irq_disable before cede patches/powerpc-pad-thread_info.patch I didn't include all the patches as the relevant bits are included below in code form. With the instrumentation, it's clear the change to preempt_count() is targeted (since I added padding before and after preempt_count in the thread_info struct) and preempt_count is still changed. It is also clear that it isn't just a stray decrement since I set preempt_count() to 4 prior to calling cede and it still is 0x after the hcall. Also note that the stack pointer doesn't change across the cede call and neither does any other part of the thread_info structure. Adding hard_irq_disable() did seem to affect things somewhat. Rather than a serialized list of before/after thread_info prints around cede, I see several befores then several afters. But the preempt_count is still modified. The relevant code looks like this (from pseries_mach_cpu_die()) hard_irq_disable(); /* this doesn't fix things... but does change behavior a bit */ preempt_count() = 0x4; asm(mr %0,1 : =r (sp)); /* stack pointer is in R1 */ printk(before cede: sp=%lx pcnt=%x\n, sp, preempt_count()); print_thread_info(); extended_cede_processor(cede_latency_hint); asm(mr %0,1 : =r (sp)); printk(after cede: sp=%lx pcnt=%x\n, sp, preempt_count()); print_thread_info(); preempt_count() = 0; With these patches applied, the output across cede looks like: before cede: sp=c0b57150 pcnt=4 *** current-thread_info *** ti-task: c0aa1410 ti-exec_domain: c0a59958 ti-cpu: 0 ti-stomp_on_me: 57005 /* 0xDEAD - forgot to print in hex */ ti-preempt_count: 4 ti-stomp_on_me_too: 48879 /* 0xBEEF - forgot to print in hex */ ti-local_flags: 0 *** end thread_info *** after cede: sp=c0b57150 pcnt= *** current-thread_info *** ti-task: c0aa1410 ti-exec_domain: c0a59958 ti-cpu: 0 ti-stomp_on_me: 57005 ti-preempt_count: ti-stomp_on_me_too: 48879 ti-local_flags: 0 *** end thread_info *** Are there any additional thoughts on what might be causing preempt_count to change? I was thinking that cede might somehow force it to 0 (or perhaps one of the preempt_count explicit value setters in irq.c) and then decrement it - but that dec should trigger an error in the dec_preempt_count() as I have CONFIG_DEBUG_PREEMPT=y. ... -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote: On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote: 1) How can the preempt_count() get mangled across the H_CEDE hcall? 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ? The preempt count is on the thread info at the bottom of the stack. Can you check the stack pointers ? Hi Ben, sorry if I didn't get back to you on this already. I checked the stack pointer before and after the cede call and they match. -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/4] powerpc: cpu offline/online fixes for CONFIG_PREEMPT
The following patch series addresses several issues detected during intense CPU offline/online testing on the mainline kernel with CONFIG_PREEMPT=y. These patches require the following patch from Brian King: http://patchwork.ozlabs.org/patch/59645/ Tested against linux-2.6.git master with and without CONFIG_DEBUG_PREEMPT. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/3] powerpc-enable-preemption-before-cpu_die
From: Signed-off-by: Darren Hart dvh...@us.ibm.com start_secondary() is called shortly after _start and also via cpu_idle()-cpu_die()-pseries_mach_cpu_die() start_secondary() expects a preempt_count() of 0. pseries_mach_cpu_die() is called via the cpu_idle() routine with preemption disabled, resulting in the following repeating message during rapid cpu offline/online tests with CONFIG_PREEMPT=y: BUG: scheduling while atomic: swapper/0/0x0002 Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan] Call Trace: [c0010e7079c0] [c00133ec] .show_stack+0xd8/0x218 (unreliable) [c0010e707aa0] [c06a47f0] .dump_stack+0x28/0x3c [c0010e707b20] [c006e7a4] .__schedule_bug+0x7c/0x9c [c0010e707bb0] [c0699d9c] .schedule+0x104/0x800 [c0010e707cd0] [c0015b24] .cpu_idle+0x1c4/0x1d8 [c0010e707d70] [c06aa1b4] .start_secondary+0x398/0x3d4 [c0010e707e30] [c0008278] .start_secondary_resume+0x10/0x14 Move the cpu_die() call inside the existing preemption enabled block of cpu_idle(). This is safe as the idle task is affined to a single CPU so the debug_smp_processor_id() tests (from cpu_should_die()) won't trigger as we are in a migration disabled region. Signed-off-by: Darren Hart dvh...@us.ibm.com Acked-by: Will Schmidt will_schm...@vnet.ibm.com Cc: Thomas Gleixner t...@linutronix.de Cc: Nathan Fontenot nf...@austin.ibm.com Cc: Robert Jennings r...@linux.vnet.ibm.com Cc: Brian King brk...@linux.vnet.ibm.com --- arch/powerpc/kernel/idle.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c index 049dda6..39a2baa 100644 --- a/arch/powerpc/kernel/idle.c +++ b/arch/powerpc/kernel/idle.c @@ -94,9 +94,9 @@ void cpu_idle(void) HMT_medium(); ppc64_runlatch_on(); tick_nohz_restart_sched_tick(); + preempt_enable_no_resched(); if (cpu_should_die()) cpu_die(); - preempt_enable_no_resched(); schedule(); preempt_disable(); } -- 1.7.0.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/3] powerpc-silence-__cpu_up-under-normal-operation
From: Signed-off-by: Darren Hart dvh...@us.ibm.com During CPU offline/online tests __cpu_up would flood the logs with the following message: Processor 0 found. This provides no useful information to the user as there is no context provided, and since the operation was a success (to this point) it is expected that the CPU will come back online, providing all the feedback necessary. Change the Processor found message to DBG() similar to other such messages in the same function. Also, add an appropriate log level for the Processor is stuck message. Signed-off-by: Darren Hart dvh...@us.ibm.com Acked-by: Will Schmidt will_schm...@vnet.ibm.com Cc: Thomas Gleixner t...@linutronix.de Cc: Nathan Fontenot nf...@austin.ibm.com Cc: Robert Jennings r...@linux.vnet.ibm.com Cc: Brian King brk...@linux.vnet.ibm.com --- arch/powerpc/kernel/smp.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 5c196d1..cc05792 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -429,11 +429,11 @@ int __cpuinit __cpu_up(unsigned int cpu) #endif if (!cpu_callin_map[cpu]) { - printk(Processor %u is stuck.\n, cpu); + printk(KERN_ERR Processor %u is stuck.\n, cpu); return -ENOENT; } - printk(Processor %u found.\n, cpu); + DBG(Processor %u found.\n, cpu); if (smp_ops-give_timebase) smp_ops-give_timebase(); -- 1.7.0.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/3] powerpc-silence-xics_migrate_irqs_away-during-cpu-offline
From: Signed-off-by: Darren Hart dvh...@us.ibm.com All IRQs are migrated away from a CPU that is being offlined so the following messages suggest a problem when the system is behaving as designed: IRQ 262 affinity broken off cpu 1 IRQ 17 affinity broken off cpu 0 IRQ 18 affinity broken off cpu 0 IRQ 19 affinity broken off cpu 0 IRQ 256 affinity broken off cpu 0 IRQ 261 affinity broken off cpu 0 IRQ 262 affinity broken off cpu 0 Don't print these messages when the CPU is not online. Signed-off-by: Darren Hart dvh...@us.ibm.com Acked-by: Will Schmidt will_schm...@vnet.ibm.com Cc: Thomas Gleixner t...@linutronix.de Cc: Nathan Fontenot nf...@austin.ibm.com Cc: Robert Jennings r...@linux.vnet.ibm.com Cc: Brian King brk...@linux.vnet.ibm.com --- arch/powerpc/platforms/pseries/xics.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/pseries/xics.c b/arch/powerpc/platforms/pseries/xics.c index f19d194..8d0b0b1 100644 --- a/arch/powerpc/platforms/pseries/xics.c +++ b/arch/powerpc/platforms/pseries/xics.c @@ -930,8 +930,10 @@ void xics_migrate_irqs_away(void) if (xics_status[0] != hw_cpu) goto unlock; - printk(KERN_WARNING IRQ %u affinity broken off cpu %u\n, - virq, cpu); + /* This is expected during cpu offline. */ + if (cpu_online(cpu)) + printk(KERN_WARNING IRQ %u affinity broken off cpu %u\n, + virq, cpu); /* Reset affinity to all cpus */ cpumask_setall(irq_to_desc(virq)-affinity); -- 1.7.0.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
On 07/23/2010 12:07 AM, Vaidyanathan Srinivasan wrote: * Benjamin Herrenschmidtb...@kernel.crashing.org [2010-07-23 15:11:00]: On Fri, 2010-07-23 at 10:38 +0530, Vaidyanathan Srinivasan wrote: Yes. extended_cede_processor() will return with interrupts enabled in the cpu. (This is done by the hypervisor). Under normal cases we cannot be interrupted because no IO interrupts are routed to us after xics_teardown_cpu() and since the CPU is out of the map, nobody will send us IPIs. What about decrementer ? Decrementer expiry event handling is bit complex. The event as such may not bring back the extended_cede_processor() cpu, but may be marked pending when we get out of this state eventually. I will find more information on this event and update. Hi Vaidy, have you been able to dig anything up about the decrementer expiry? Thanks, -- Darren Hart IBM Linux Technology Center Real-Time Linux Team Though H_CEDE will return with interrupts enabled, it is unlikely that an interrupt can be delivered in this context. Well, if interrupts are soft-disabled, even if one occurs, we will just mask and return, so that at least should be ok. Yes. We will immediately return to the extended_cede_processor() in the while loop until the preferred_offline_state is changed. --Vaidy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
While testing CPU offline/online, we hit various preempt_count related bugs. Various hacks have been employed for several theoretical corner cases. One situation however is perfectly repeatable on 2.6.33.6 with CONFIG_PREEMPT=y. BUG: scheduling while atomic: swapper/0/0x0065 Modules linked in: autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log dm_mod ehea sg ext4 jbd2 mbcache sd_mod crc_t10dif ibmvscsic scsi_transport_srp scsi_tgt [last unloaded: scsi_wait_scan] Call Trace: [c0010e9e39f0] [c00144d4] .show_stack+0x74/0x1c0 (unreliable) [c0010e9e3aa0] [c007a680] .__schedule_bug+0xa0/0xb0 [c0010e9e3b30] [c056dea4] .schedule+0x7a4/0xd60 [c0010e9e3cd0] [c0016be8] .cpu_idle+0x1f8/0x220 [c0010e9e3d80] [c057d858] .start_secondary+0x388/0x3c0 [c0010e9e3e30] [c0008278] .start_secondary_resume+0x10/0x14 With some instrumentation we were able to determine that the preempt_count() appears to change across the extended_cede_processor() call. Specifically across the plpar_hcall_norets(H_CEDE) call. On PREEMPT_RT we call this with preempt_count=1 and return with preempt_count=0x. On mainline with CONFIG_PREEMPT=y, the value is different (0x65) but is still incorrect. Also of interest is that this path cpu_idle()-cpu_die()-pseries_mach_cpu_die() to start_secondary() enters with a preempt_count=1 if it wasn't corrupted across the hcall. The early boot path from _start however appears to call start_secondary() with a preempt_count of 0. The following patch is most certainly not correct, but it does eliminate the situation on mainline 100% of the time (there is still a 25% reproduction rate on PREEMPT_RT). Can someone comment on: 1) How can the preempt_count() get mangled across the H_CEDE hcall? 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ? Hacked-up-by: Darren Hart dvh...@us.ibm.com Index: linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c === --- linux-2.6.33.6.orig/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -138,6 +138,7 @@ static void pseries_mach_cpu_die(void) * Kernel stack will be reset and start_secondary() * will be called to continue the online operation. */ + preempt_count() = 0; start_secondary_resume(); } } -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
On 07/22/2010 11:24 AM, Darren Hart wrote: The following patch is most certainly not correct, but it does eliminate the situation on mainline 100% of the time (there is still a 25% reproduction rate on PREEMPT_RT). Can someone comment on: Apologies. This particular issue is also 100% eliminated on PREEMPT_RT. We hit another issue possibly unrelated to this 25% of time. Please disregard the comment regarding 25% failure on PREEMPT_RT. -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote: On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote: 1) How can the preempt_count() get mangled across the H_CEDE hcall? 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ? The preempt count is on the thread info at the bottom of the stack. Can you check the stack pointers ? Hi Ben, thanks for looking. I instrumented the area around extended_cede_processor() as follows (please confirm I'm getting the stack pointer correctly). while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) { asm(mr %0,1 : =r (sp)); printk(before H_CEDE current-stack: %lx, pcnt: %x\n, sp, preempt_count()); extended_cede_processor(cede_latency_hint); asm(mr %0,1 : =r (sp)); printk(after H_CEDE current-stack: %lx, pcnt: %x\n, sp, preempt_count()); } On Mainline (2.6.33.6, CONFIG_PREEMPT=y) I see this: Jul 22 18:37:08 igoort1 kernel: before H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 Jul 22 18:37:08 igoort1 kernel: after H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 This surprised me as preempt_count is 1 before and after, so no corruption appears to occur on mainline. This makes the pcnt of 65 I see without the preempt_count()=0 hack very strange. I ran several hundred off/on cycles. The issue of preempt_count being 1 is still addressed by this patch however. On PREEMPT_RT (2.6.33.5-rt23 - tglx, sorry, rt/2.6.33 next time, promise): Jul 22 18:51:11 igoort1 kernel: before H_CEDE current-stack: c00089bcfcf0, pcnt: 1 Jul 22 18:51:11 igoort1 kernel: after H_CEDE current-stack: c00089bcfcf0, pcnt: In both cases the stack pointer appears unchanged. Note: there is a BUG triggered in between these statements as the preempt_count causes the printk to trigger: Badness at kernel/sched.c:5572 Thanks, -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
On 07/22/2010 04:57 PM, Darren Hart wrote: On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote: On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote: 1) How can the preempt_count() get mangled across the H_CEDE hcall? 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ? The preempt count is on the thread info at the bottom of the stack. Can you check the stack pointers ? Hi Ben, thanks for looking. I instrumented the area around extended_cede_processor() as follows (please confirm I'm getting the stack pointer correctly). while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) { asm(mr %0,1 : =r (sp)); printk(before H_CEDE current-stack: %lx, pcnt: %x\n, sp, preempt_count()); extended_cede_processor(cede_latency_hint); asm(mr %0,1 : =r (sp)); printk(after H_CEDE current-stack: %lx, pcnt: %x\n, sp, preempt_count()); } On Mainline (2.6.33.6, CONFIG_PREEMPT=y) I see this: Jul 22 18:37:08 igoort1 kernel: before H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 Jul 22 18:37:08 igoort1 kernel: after H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 This surprised me as preempt_count is 1 before and after, so no corruption appears to occur on mainline. This makes the pcnt of 65 I see without the preempt_count()=0 hack very strange. I ran several hundred off/on cycles. The issue of preempt_count being 1 is still addressed by this patch however. On PREEMPT_RT (2.6.33.5-rt23 - tglx, sorry, rt/2.6.33 next time, promise): Jul 22 18:51:11 igoort1 kernel: before H_CEDE current-stack: c00089bcfcf0, pcnt: 1 Jul 22 18:51:11 igoort1 kernel: after H_CEDE current-stack: c00089bcfcf0, pcnt: In both cases the stack pointer appears unchanged. Note: there is a BUG triggered in between these statements as the preempt_count causes the printk to trigger: Badness at kernel/sched.c:5572 At Steven's suggestion I updated the instrumentation to display the local_save_flags and irqs_disabled_flags: while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) { local_save_flags(flags); printk(local flags: %lx, irqs disabled: %d\n, flags, irqs_disabled_flags(flags)); asm(mr %0,1 : =r (sp)); printk(before H_CEDE current-stack: %lx, pcnt: %x\n, sp, preempt_count()); extended_cede_processor(cede_latency_hint); asm(mr %0,1 : =r (sp)); printk(after H_CEDE current-stack: %lx, pcnt: %x\n, sp, preempt_count()); } Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1 Jul 22 23:36:58 igoort1 kernel: before H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 Jul 22 23:36:58 igoort1 kernel: after H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 I'm not sure if I'm reading that right, but I believe interrupts are intended to be disabled here. If accomplished via the spin_lock_irqsave() this would behave differently on RT. However, this path disables the interrupts handled by xics, all but the IPIs anyway. On RT I disabled the decrementer as well. Is it possible for RT to be receiving other interrupts here? -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Badness in xics_ipi_dispatch
Michael Ellerman michael at ellerman.id.au writes: On Tue, 2010-04-20 at 17:17 -0500, Brian King wrote: In stress testing enabling and disabling of SMT, we are regularly seeing the badness warning below. Looking through the cpu offline path, this is what I see: 1. stop_cpu: IRQ's get disabled 2. pseries_cpu_disable: set cpu offline (no barriers after this) 3. xics_migrate_irqs_away: Remove ourselves from the GIQ, but still allow IPIs 4. stop_cpu: IRQ's get enabled again (local_irq_enable) It looks to me like there is plenty of opportunity between 1 and 2 for an IPI to get queued, resulting in the badness below. Is there something in xics_migrate_irqs_away that should clear any pending IPIs? Is that not what this does? /* Reject any interrupt that was queued to us... */ xics_set_cpu_priority(0); /* Remove ourselves from the global interrupt queue */ xics_set_cpu_giq(default_distrib_server, 0); I thought the above would clear any pending (queued) interrupts and disable additional interrupts from coming in. Of course the next line allows IPIs again /* Allow IPIs again... */ xics_set_cpu_priority(DEFAULT_PRIORITY); Which I confess I really don't get... If there is, maybe the solution is as simple as adding a barrier after marking the cpu offline. Or is the warning bogus and we should just remove it? It looks like xics_migrate_irqs_away() doesn't do anything about IPIs, at least the comment says Allow IPIs again. So I don't see what's to stop you just taking another IPI after you reenable interrupts in stop_cpu(). Maybe xics_ipi_dispatch() should just return if the cpu is offline? We're seeing something possibly related in real-time. Notice how the decrementer handler interrupts stop_cpu(). Is the decrementer interrupt delivered as an IPI? cpu 0x3: Vector: 700 (Program Check) at [c00084d02d90] pc: c0068af4: .__might_sleep+0x11c/0x148 lr: c0068af0: .__might_sleep+0x118/0x148 sp: c00084d03010 msr: 80021032 current = 0xc00086658240 paca= 0xc0bb8a80 pid = 4045, comm = kstop/3 kernel BUG at kernel/sched.c:10168! enter ? for help [c00084d030b0] c06a2798 .rt_spin_lock+0x4c/0x9c [c00084d03140] c00e3c98 .cpuset_cpus_allowed_locked+0x38/0x74 [c00084d031e0] c0070be0 .select_fallback_rq+0x10c/0x1a4 [c00084d032a0] c007cda8 .try_to_wake_up+0x1b0/0x540 [c00084d03370] c007d2e8 .wake_up_process+0x34/0x48 [c00084d03400] c008c5f8 .wakeup_softirqd+0x78/0x9c [c00084d03490] c008c8e4 .raise_softirq+0x6c/0xa4 [c00084d03520] c0099c18 .run_local_timers+0x2c/0x4c [c00084d035a0] c0099c90 .update_process_times+0x58/0x9c [c00084d03640] c00c2e70 .tick_sched_timer+0xd0/0x120 [c00084d036f0] c00b4bec .__run_hrtimer+0x1a0/0x29c [c00084d037a0] c00b558c .hrtimer_interrupt+0x21c/0x394 [c00084d038d0] c00307d8 .timer_interrupt+0x1dc/0x2e4 [c00084d03970] c0003700 decrementer_common+0x100/0x180 --- Exception: 901 (Decrementer) at c000d144 .raw_local_irq_restore+0x48/0x54 [link register ] c00e57ec .stop_cpu+0x1c0/0x1ec [c00084d03c60] c104a4f0 (unreliable) [c00084d03ca0] c00e5780 .stop_cpu+0x154/0x1ec [c00084d03d40] c00a8b84 .worker_thread+0x25c/0x338 [c00084d03e60] c00af8c8 .kthread+0xb8/0xc4 [c00084d03f90] c0034408 .kernel_thread+0x54/0x70 Thanks, Darren Hart ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On 05/20/2010 01:14 AM, Thomas Gleixner wrote: On Thu, 20 May 2010, Jan-Bernd Themann wrote: Thought more about that. The case at hand (ehea) is nasty: The driver does _NOT_ disable the rx interrupt in the card in the rx interrupt handler - for whatever reason. Yeah I saw that, but I don't know why it's written that way. Perhaps Jan-Bernd or Doug will chime in and enlighten us? :) From our perspective there is no need to disable interrupts for the RX side as the chip does not fire further interrupts until we tell the chip to do so for a particular queue. We have multiple receive The traces tell a different story though: ehea_recv_irq_handler() napi_reschedule() eoi() ehea_poll() ... ehea_recv_irq_handler() ??? napi_reschedule() ... napi_complete() Can't tell whether you can see the same behaviour in mainline, but I don't see a reason why not. I was going to suggest that because these are threaded handlers, perhaps they are rescheduled on a different CPU and then receive the interrupt for the other CPU/queue that Jan was mentioning. But, the handlers are affined if I remember correctly, and we aren't running with multiple receive queues. So, we're back to the same question, why are we seeing another irq. It comes in before napi_complete() and therefor before the ehea_reset*() block of calls which do the equivalent of re-enabling interrupts. -- Darren queues with an own interrupt each so that the interrupts can arrive on multiple CPUs in parallel. Interrupts are enabled again when we leave the NAPI Poll function for the corresponding receive queue. I can't see a piece of code which does that, but that's probably just lack of detailed hardware knowledge on my side. Thanks, tglx -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On 05/18/2010 06:25 PM, Michael Ellerman wrote: On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote: On 05/18/2010 02:52 PM, Brian King wrote: Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline... Yes, it basically says don't make this handler threaded. That is a good fix for EHEA, but the threaded handling is still broken for anything else that is edge triggered isn't it? No, I don't believe so. Edge triggered interrupts that are reported as edge triggered interrupts will use the edge handler (which was the approach Sebastien took to make this work back in 2008). Since XICS presents all interrupts as Level Triggered, they use the fasteoi path. The result of the discussion about two years ago on this was that we needed a custom flow handler for XICS on RT. I'm still not clear on why the ultimate solution wasn't to have XICS report edge triggered as edge triggered. Probably some complexity of the entire power stack that I am ignorant of. Apart from the issue of loosing interrupts there is also the fact that masking on the XICS requires an RTAS call which takes a global lock. Right, one of may reasons why we felt this was the right fix. The other is that there is no real additional overhead in running this as non-threaded since the receive handler is so short (just napi_schedule()). -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On 05/18/2010 02:52 PM, Brian King wrote: Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline... Yes, it basically says don't make this handler threaded. -- Darren -Brian On 05/18/2010 04:33 PM, dvh...@linux.vnet.ibm.com wrote: From ad81664794e33d785f533c5edee37aaba20dd92d Mon Sep 17 00:00:00 2001 From: Darren Hartdvh...@us.ibm.com Date: Tue, 18 May 2010 11:07:13 -0700 Subject: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY) The underlying hardware is edge triggered but presented by XICS as level triggered. The edge triggered interrupts are not reissued after masking. This is not a problem in mainline which does not mask the interrupt (relying on the EOI mechanism instead). The threaded interrupts in PREEMPT_RT do mask the interrupt, and can lose interrupts that occurred while masked, resulting in a hung ethernet interface. The receive handler simply calls napi_schedule(), as such, there is no significant additional overhead in making this non-threaded, since we either wakeup the threaded irq handler to call napi_schedule(), or just call napi_schedule() directly to wakeup the softirqs. As the receive handler is lockless, there is no need to convert any of the ehea spinlock_t's to atomic_spinlock_t's. Without this patch, a simple scp file copy loop would fail quickly (usually seconds). We have over two hours of sustained scp activity with the patch applied. Credit goes to Will Schmidt for lots of instrumentation and tracing which clarified the scenario and to Thomas Gleixner for the incredibly simple solution. Signed-off-by: Darren Hartdvh...@us.ibm.com Acked-by: Will Schmidtwill_schm...@vnet.ibm.com Cc: Thomas Gleixnert...@linuxtronix.de Cc: Jan-Bernd Themannthem...@de.ibm.com Cc: Nivedita Singhvin...@us.ibm.com Cc: Brian Kingbjki...@us.ibm.com Cc: Michael Ellermaneller...@au1.ibm.com Cc: Doug Maxeydoug.ma...@us.ibm.com --- drivers/net/ehea/ehea_main.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c index 977c3d3..2c53df2 100644 --- a/drivers/net/ehea/ehea_main.c +++ b/drivers/net/ehea/ehea_main.c @@ -1263,7 +1263,7 @@ static int ehea_reg_interrupts(struct net_device *dev) %s-queue%d, dev-name, i); ret = ibmebus_request_irq(pr-eq-attr.ist1, ehea_recv_irq_handler, - IRQF_DISABLED, pr-int_send_name, + IRQF_DISABLED | IRQF_NODELAY, pr-int_send_name, pr); if (ret) { ehea_error(failed registering irq for ehea_queue -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev