On Tue, Mar 25, 2014 at 08:08:23AM +0000, Dmitry Malkin wrote: > Hello, Daniel, > > Thank you for response. I've got a couple questions for you: > > 0. What do you think about making integration test with continuous reloading > i915 driver and X server (with FLR between iteration)? > Simplified example for ubuntu (root required):
Module reloading is know to be horribly racy atm. Don't do that in any kind of stress-test situation before fixing up piles of bugs ;-) Will mean: You'll uncover at _lot_ more than just this. Patches for all this highly welcome though. Thanks, Daniel > > #!/bin/bash > service lightdm stop > rmmod snd_hda_intel > echo -n "0000:00:02.0" > /sys/bus/pci/devices/0000\:00\:02.0/driver/unbind > rmmod i915 > echo 1 > /sys/bus/pci/devices/0000\:00\:02.0/reset > modprobe i915 > service lightdm start > > This will uncover next problems: > * sysfs add/remove i2c connectors (parent/child warning) > * drm static buffer races > * NX bit violation crash (see dump below) > * and bunch of DMAR problems > > > 1. Could you point me out git/branch with FIXME comments? > > 2. About kfree() problem: what if put string buffer onto stack of caller for > drm_get_connector_name and drm_get_encoder_name? > It will be auto-removed and there is will be the patch about adding new param > for these functions. > (yes the patch will be big and awful to read) > > ======================= NX bit violation crash > ======================================== > Mar 20 21:53:29 haswell01 kernel: [13447.100849] Console: switching to colour > dummy device 80x25 > Mar 20 21:53:29 haswell01 kernel: [13447.100950] drm_kms_helper: drm: > unregistered panic notifier > Mar 20 21:53:29 haswell01 kernel: [13447.117785] waiting module removal not > supported: please upgrade > Mar 20 21:53:29 haswell01 kernel: [13447.117880] [drm] Module unloaded > Mar 20 21:53:29 haswell01 kernel: [13447.118244] waiting module removal not > supported: please upgrade > Mar 20 21:53:29 haswell01 kernel: [13447.118360] waiting module removal not > supported: please upgradewaiting module removal not supported: please upgrade > Mar 20 21:53:39 haswell01 kernel: [13447.118590] waiting module removal not > supported: please upgrade<6>[13457.263808] [drm] Initialized drm 1.1.0 > 20060810 > Mar 20 21:53:39 haswell01 kernel: [13457.333992] kernel tried to execute > NX-protected page - exploit attempt? (uid: 0) > Mar 20 21:53:39 haswell01 kernel: [13457.333996] BUG: unable to handle kernel > paging request at ffff8803eb27fcb0 > Mar 20 21:53:39 haswell01 kernel: [13457.333998] IP: [<ffff8803eb27fcb0>] > 0xffff8803eb27fcb0 > Mar 20 21:53:39 haswell01 kernel: [13457.334001] PGD 2fd9067 PUD 3e68c3063 > PMD 404902063 PTE 80000003eb27f163 > Mar 20 21:53:39 haswell01 kernel: [13457.334004] Oops: 0011 [#1] SMP > Mar 20 21:53:39 haswell01 kernel: [13457.334006] Modules linked in: i915(+) > video drm_kms_helper drm i2c_algo_bit snd_hda_codec_hdmi rfcomm bnep > bluetooth nls_iso8859_1 x86_pkg_temp_thermal intel_powerclamp coretemp > kvm_intel snd_hda_codec_realtek kvm snd_hda_codec snd_hwdep snd_pcm > snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq > hid_generic crct10dif_pclmul snd_seq_device usbhid crc32_pclmul snd_timer > ghash_clmulni_intel snd aesni_intel psmouse hid mei_me aes_x86_64 lrw ppdev > gf128mul mei glue_helper parport_pc ablk_helper lp cryptd soundcore parport > serio_raw lpc_ich mac_hid e1000e ahci ptp libahci pps_core [last unloaded: > video] > Mar 20 21:53:39 haswell01 kernel: [13457.334031] CPU: 0 PID: 4974 Comm: > modprobe Not tainted 3.13.6 #10 > Mar 20 21:53:39 haswell01 kernel: [13457.334032] Hardware name: > /DQ87PG, BIOS PGQ8710H.86A.0144.2014.0113.1604 01/13/2014 > Mar 20 21:53:39 haswell01 kernel: [13457.334034] task: ffff88002e6c5fc0 ti: > ffff880406394000 task.ti: ffff880406394000 > Mar 20 21:53:39 haswell01 kernel: [13457.334035] RIP: > 0010:[<ffff8803eb27fcb0>] [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 > Mar 20 21:53:39 haswell01 kernel: [13457.334037] RSP: 0018:ffff880406395af0 > EFLAGS: 00010282 > Mar 20 21:53:39 haswell01 kernel: [13457.334038] RAX: ffff8803eb27f4b0 RBX: > ffff8803f6016000 RCX: ffffffffa0238630 > Mar 20 21:53:39 haswell01 kernel: [13457.334039] RDX: ffff8803eb27fcb0 RSI: > ffffffffa03ba3c4 RDI: ffff8803f6016000 > Mar 20 21:53:39 haswell01 kernel: [13457.334040] RBP: ffff880406395b10 R08: > 0000000000000000 R09: ffff88041ea172e0 > Mar 20 21:53:39 haswell01 kernel: [13457.334042] R10: ffffea00101d7200 R11: > 0000000000000000 R12: 0000000000000000 > Mar 20 21:53:39 haswell01 kernel: [13457.334043] R13: ffffffffa03ba3c4 R14: > ffffffffa03dd100 R15: 0000000000000000 > Mar 20 21:53:39 haswell01 kernel: [13457.334044] FS: 00007f3f392d0740(0000) > GS:ffff88041ea00000(0000) knlGS:0000000000000000 > Mar 20 21:53:39 haswell01 kernel: [13457.334046] CS: 0010 DS: 0000 ES: 0000 > CR0: 0000000080050033 > Mar 20 21:53:39 haswell01 kernel: [13457.334047] CR2: ffff8803eb27fcb0 CR3: > 00000003e7358000 CR4: 00000000001407f0 > Mar 20 21:53:39 haswell01 kernel: [13457.334048] DR0: 0000000000000000 DR1: > 0000000000000000 DR2: 0000000000000000 > Mar 20 21:53:39 haswell01 kernel: [13457.334050] DR3: 0000000000000000 DR6: > 00000000fffe0ff0 DR7: 0000000000000400 > Mar 20 21:53:39 haswell01 kernel: [13457.334051] Stack: > Mar 20 21:53:39 haswell01 kernel: [13457.334052] ffffffffa0213cb2 > ffff8803f6016000 ffff880408199000 ffff880408199098 > Mar 20 21:53:39 haswell01 kernel: [13457.334054] ffff880406395b60 > ffffffffa0215b92 0000000000000004 ffff880408199140 > Mar 20 21:53:39 haswell01 kernel: [13457.334057] ffffffffa03b9920 > ffff880408199000 0000000000000000 ffffffffa03dd000 > Mar 20 21:53:39 haswell01 kernel: [13457.334059] Call Trace: > Mar 20 21:53:39 haswell01 kernel: [13457.334067] [<ffffffffa0213cb2>] ? > drm_dev_register+0xa2/0x1e0 [drm] > Mar 20 21:53:39 haswell01 kernel: [13457.334073] [<ffffffffa0215b92>] > drm_get_pci_dev+0x92/0x140 [drm] > Mar 20 21:53:39 haswell01 kernel: [13457.334082] [<ffffffffa033d67c>] > i915_pci_probe+0x3c/0x90 [i915] > Mar 20 21:53:39 haswell01 kernel: [13457.334086] [<ffffffff813982b5>] > local_pci_probe+0x45/0xa0 > Mar 20 21:53:39 haswell01 kernel: [13457.334088] [<ffffffff81399555>] ? > pci_match_device+0xc5/0xd0 > Mar 20 21:53:39 haswell01 kernel: [13457.334090] [<ffffffff81399679>] > pci_device_probe+0xd9/0x130 > Mar 20 21:53:39 haswell01 kernel: [13457.334093] [<ffffffff81484795>] > driver_probe_device+0x125/0x3b0 > Mar 20 21:53:39 haswell01 kernel: [13457.334095] [<ffffffff81484af3>] > __driver_attach+0x93/0xa0 > Mar 20 21:53:39 haswell01 kernel: [13457.334098] [<ffffffff81484a60>] ? > __device_attach+0x40/0x40 > Mar 20 21:53:39 haswell01 kernel: [13457.334100] [<ffffffff81482703>] > bus_for_each_dev+0x63/0xa0 > Mar 20 21:53:39 haswell01 kernel: [13457.334102] [<ffffffff8148414e>] > driver_attach+0x1e/0x20 > Mar 20 21:53:39 haswell01 kernel: [13457.334104] [<ffffffff81483d30>] > bus_add_driver+0x180/0x250 > Mar 20 21:53:39 haswell01 kernel: [13457.334106] [<ffffffffa03fe000>] ? > 0xffffffffa03fdfff > Mar 20 21:53:39 haswell01 kernel: [13457.334109] [<ffffffff81485174>] > driver_register+0x64/0xf0 > Mar 20 21:53:39 haswell01 kernel: [13457.334110] [<ffffffffa03fe000>] ? > 0xffffffffa03fdfff > Mar 20 21:53:39 haswell01 kernel: [13457.334113] [<ffffffff81397c4c>] > __pci_register_driver+0x4c/0x50 > Mar 20 21:53:39 haswell01 kernel: [13457.334117] [<ffffffffa0215d5a>] > drm_pci_init+0x11a/0x130 [drm] > Mar 20 21:53:39 haswell01 kernel: [13457.334119] [<ffffffffa03fe000>] ? > 0xffffffffa03fdfff > Mar 20 21:53:39 haswell01 kernel: [13457.334127] [<ffffffffa03fe066>] > i915_init+0x66/0x68 [i915] > Mar 20 21:53:39 haswell01 kernel: [13457.334130] [<ffffffff8100214a>] > do_one_initcall+0xfa/0x1b0 > Mar 20 21:53:39 haswell01 kernel: [13457.334133] [<ffffffff810589e3>] ? > set_memory_nx+0x43/0x50 > Mar 20 21:53:39 haswell01 kernel: [13457.334137] [<ffffffff810e0630>] > load_module+0x2050/0x26f0 > Mar 20 21:53:39 haswell01 kernel: [13457.334139] [<ffffffff810dbfa0>] ? > store_uevent+0x40/0x40 > Mar 20 21:53:39 haswell01 kernel: [13457.334141] [<ffffffff810e0e46>] > SyS_finit_module+0x86/0xb0 > Mar 20 21:53:39 haswell01 kernel: [13457.334144] [<ffffffff8171823f>] > tracesys+0xe1/0xe6 > Mar 20 21:53:39 haswell01 kernel: [13457.334145] Code: 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20 00 > 00 00 00 00 00 00 00 00 00 00 <f0> b1 ca 81 ff ff ff ff b0 f4 27 eb 03 88 ff > ff ff ff ff 7f 00 > Mar 20 21:53:39 haswell01 kernel: [13457.334164] RIP [<ffff8803eb27fcb0>] > 0xffff8803eb27fcb0 > Mar 20 21:53:39 haswell01 kernel: [13457.334166] RSP <ffff880406395af0> > Mar 20 21:53:39 haswell01 kernel: [13457.334167] CR2: ffff8803eb27fcb0 > Mar 20 21:53:39 haswell01 kernel: [13457.334168] ---[ end trace > e2598b1fc83a65fd ]--- > > > ________________________________________ > From: Daniel Vetter <daniel.vetter at ffwll.ch> on behalf of Daniel Vetter > <daniel at ffwll.ch> > Sent: Tuesday, March 25, 2014 12:31 AM > To: Dmitry Malkin > Cc: dri-devel at lists.freedesktop.org > Subject: Re: [DRM] drm_get_connector_name internal static string buffer > causes a race > > On Mon, Mar 24, 2014 at 12:06:21PM +0000, Dmitry Malkin wrote: > > > > Hello guys, > > > > I've been playing with reloading intel gfx driver (i915) in a cycle, for a > > while, > > and at some point I've found a non-deterministic kernel crash with a > > highly-variable > > iteration dependency -- 2 to 200 driver reload iterations. > > > > The apparent race is over the shared internal string buffer in > > drm_get_connector_name(). > > It is mostly harmless, due to its results being mostly used for log output, > > but in at least one > > case -- drm_sysfs_connector_add() -- this leads to a more critical error. > > > > Race scenario: > > > > - drm_sysfs_connector_add() > > - drm_get_connector_name() > > vs. > > - anything that generates log messages involving DRM connectors > > - drm_get_connector_name() > > > > (and many other from drm_crtc.c) shares with caller const char* to > > internal static char buffer. > > If something call it from other thread, while main thread strore and use > > returned pointer it may overwrite connector name. > > > > Here are we go: registering HDMI connecter (drm_sysfs_connector_add store > > and use pointer from drm_get_connector_name) > > and the same time got VGA connector name down through the stack. (the > > second thread is upowerd who watch continuously sysfs) > > Yeah, in my recent kerneldoc series I've noticed this too and added FIXME > comments. There's a lot more than just those you've pointed out. The > problem is that fixing these will be an awful lot of work since you need > to add proper cleanup code (calling kfree()) to all the required places. > > So a full subsystem wide code audit for every single use-site of these. > -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch