On Mon, Jan 19, 2015 at 02:40:28PM +0000, Mark Rutland wrote: > On Fri, Jan 16, 2015 at 02:11:04PM +0000, Peter Zijlstra wrote: > > On Fri, Jan 16, 2015 at 11:46:44AM +0100, Peter Zijlstra wrote: > > > Its a bandaid at best :/ The problem is (again) that we changes > > > event->ctx without any kind of serialization. > > > > > > The issue came up before: > > > > > > https://lkml.org/lkml/2014/9/5/397 > > In the end neither the CCI or CCN perf drivers migrate events on > hotplug, so ARM is currently safe from the perf_pmu_migrate_context > case, but I see that you fix the move_group handling too. > > I had a go at testing this by hacking migration back into the CCI PMU > driver (atop of v3.19-rc5), but I'm seeing lockups after a few minutes > with my original test case (https://lkml.org/lkml/2014/9/1/569 with > PMU_TYPE and PMU_EVENT fixed up). > > I unfortunately don't have a suitable x86 box spare to run that on. > Would someone be able to give it a spin on something with an uncore PMU? > > I'll go and dig a bit further. I may just be hitting another latent > issue on my board.
I'm able to trigger the lockups even without both your patch and the call to perf_pmu_migrate_context, so there is a latent issue. On vanilla v3.19-rc5 and vanilla v3.18, I'm able to get my hotplug script hung when run concurrently with the test case against the CCI PMU driver (without migration). The v3.18 and v3.19-rc5 lockups are identical: INFO: task hpall.sh:1506 blocked for more than 120 seconds. Not tainted 3.19.0-rc5 #9 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. hpall.sh D 804a6ffc 0 1506 1497 0x00000000 [<804a6ffc>] (__schedule) from [<80022308>] (cpu_hotplug_begin+0xa0/0xac) [<80022308>] (cpu_hotplug_begin) from [<8002236c>] (_cpu_up+0x24/0x180) [<8002236c>] (_cpu_up) from [<8002253c>] (cpu_up+0x74/0x98) [<8002253c>] (cpu_up) from [<802bce60>] (device_online+0x64/0x90) [<802bce60>] (device_online) from [<802bcef4>] (online_store+0x68/0x74) [<802bcef4>] (online_store) from [<8014059c>] (kernfs_fop_write+0xbc/0x1a0) [<8014059c>] (kernfs_fop_write) from [<800e71b0>] (vfs_write+0xa0/0x1ac) [<800e71b0>] (vfs_write) from [<800e7808>] (SyS_write+0x44/0x9c) [<800e7808>] (SyS_write) from [<8000e560>] (ret_fast_syscall+0x0/0x48) 7 locks held by hpall.sh/1506: #0: (sb_writers#6){.+.+.+}, at: [<800e729c>] vfs_write+0x18c/0x1ac #1: (&of->mutex){+.+.+.}, at: [<8014052c>] kernfs_fop_write+0x4c/0x1a0 #2: (s_active#15){.+.+.+}, at: [<80140534>] kernfs_fop_write+0x54/0x1a0 #3: (device_hotplug_lock){+.+.+.}, at: [<802bbe44>] lock_device_hotplug_sysfs+0xc/0x4c #4: (&dev->mutex){......}, at: [<802bce14>] device_online+0x18/0x90 #5: (cpu_add_remove_lock){+.+.+.}, at: [<80022508>] cpu_up+0x40/0x98 #6: (cpu_hotplug.lock){++++++}, at: [<80022268>] cpu_hotplug_begin+0x0/0xac I guess that lockup is my fundamental issue, and with your patch the perf_rwsem manages to spread a transitive dependency on one of those locks all over the perf subsystem. I haven't considered that in great detail, however. Thanks, Mark. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/