On Tue, Mar 25, 2014 at 08:08:23AM +0000, Dmitry Malkin wrote:
> Hello, Daniel,
> 
> Thank you for response. I've got a couple questions for you:
> 
> 0. What do you think about making integration test with continuous reloading 
> i915 driver and X server (with FLR between iteration)?
> Simplified example for ubuntu (root required):

Module reloading is know to be horribly racy atm. Don't do that in any
kind of stress-test situation before fixing up piles of bugs ;-)

Will mean: You'll uncover at _lot_ more than just this. Patches for all
this highly welcome though.

Thanks, Daniel

> 
> #!/bin/bash
> service lightdm stop
> rmmod snd_hda_intel
> echo -n "0000:00:02.0" > /sys/bus/pci/devices/0000\:00\:02.0/driver/unbind
> rmmod i915
> echo 1 > /sys/bus/pci/devices/0000\:00\:02.0/reset
> modprobe i915
> service lightdm start
> 
> This will uncover next problems:
> * sysfs add/remove i2c connectors (parent/child warning)
> * drm static buffer races
> * NX bit violation crash (see dump below)
> * and bunch of DMAR problems
> 
> 
> 1. Could you point me out git/branch with FIXME comments?
> 
> 2. About kfree() problem: what if put string buffer onto stack of caller for 
> drm_get_connector_name and drm_get_encoder_name? 
> It will be auto-removed and there is will be the patch about adding new param 
> for these functions.
> (yes the patch will be big and awful to read)
> 
> ======================= NX bit violation crash 
> ========================================
> Mar 20 21:53:29 haswell01 kernel: [13447.100849] Console: switching to colour 
> dummy device 80x25
> Mar 20 21:53:29 haswell01 kernel: [13447.100950] drm_kms_helper: drm: 
> unregistered panic notifier
> Mar 20 21:53:29 haswell01 kernel: [13447.117785] waiting module removal not 
> supported: please upgrade
> Mar 20 21:53:29 haswell01 kernel: [13447.117880] [drm] Module unloaded
> Mar 20 21:53:29 haswell01 kernel: [13447.118244] waiting module removal not 
> supported: please upgrade
> Mar 20 21:53:29 haswell01 kernel: [13447.118360] waiting module removal not 
> supported: please upgradewaiting module removal not supported: please upgrade
> Mar 20 21:53:39 haswell01 kernel: [13447.118590] waiting module removal not 
> supported: please upgrade<6>[13457.263808] [drm] Initialized drm 1.1.0 
> 20060810
> Mar 20 21:53:39 haswell01 kernel: [13457.333992] kernel tried to execute 
> NX-protected page - exploit attempt? (uid: 0)
> Mar 20 21:53:39 haswell01 kernel: [13457.333996] BUG: unable to handle kernel 
> paging request at ffff8803eb27fcb0
> Mar 20 21:53:39 haswell01 kernel: [13457.333998] IP: [<ffff8803eb27fcb0>] 
> 0xffff8803eb27fcb0
> Mar 20 21:53:39 haswell01 kernel: [13457.334001] PGD 2fd9067 PUD 3e68c3063 
> PMD 404902063 PTE 80000003eb27f163
> Mar 20 21:53:39 haswell01 kernel: [13457.334004] Oops: 0011 [#1] SMP 
> Mar 20 21:53:39 haswell01 kernel: [13457.334006] Modules linked in: i915(+) 
> video drm_kms_helper drm i2c_algo_bit snd_hda_codec_hdmi rfcomm bnep 
> bluetooth nls_iso8859_1 x86_pkg_temp_thermal intel_powerclamp coretemp 
> kvm_intel snd_hda_codec_realtek kvm snd_hda_codec snd_hwdep snd_pcm 
> snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq 
> hid_generic crct10dif_pclmul snd_seq_device usbhid crc32_pclmul snd_timer 
> ghash_clmulni_intel snd aesni_intel psmouse hid mei_me aes_x86_64 lrw ppdev 
> gf128mul mei glue_helper parport_pc ablk_helper lp cryptd soundcore parport 
> serio_raw lpc_ich mac_hid e1000e ahci ptp libahci pps_core [last unloaded: 
> video]
> Mar 20 21:53:39 haswell01 kernel: [13457.334031] CPU: 0 PID: 4974 Comm: 
> modprobe Not tainted 3.13.6 #10
> Mar 20 21:53:39 haswell01 kernel: [13457.334032] Hardware name:               
>    /DQ87PG, BIOS PGQ8710H.86A.0144.2014.0113.1604 01/13/2014
> Mar 20 21:53:39 haswell01 kernel: [13457.334034] task: ffff88002e6c5fc0 ti: 
> ffff880406394000 task.ti: ffff880406394000
> Mar 20 21:53:39 haswell01 kernel: [13457.334035] RIP: 
> 0010:[<ffff8803eb27fcb0>]  [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0
> Mar 20 21:53:39 haswell01 kernel: [13457.334037] RSP: 0018:ffff880406395af0  
> EFLAGS: 00010282
> Mar 20 21:53:39 haswell01 kernel: [13457.334038] RAX: ffff8803eb27f4b0 RBX: 
> ffff8803f6016000 RCX: ffffffffa0238630
> Mar 20 21:53:39 haswell01 kernel: [13457.334039] RDX: ffff8803eb27fcb0 RSI: 
> ffffffffa03ba3c4 RDI: ffff8803f6016000
> Mar 20 21:53:39 haswell01 kernel: [13457.334040] RBP: ffff880406395b10 R08: 
> 0000000000000000 R09: ffff88041ea172e0
> Mar 20 21:53:39 haswell01 kernel: [13457.334042] R10: ffffea00101d7200 R11: 
> 0000000000000000 R12: 0000000000000000
> Mar 20 21:53:39 haswell01 kernel: [13457.334043] R13: ffffffffa03ba3c4 R14: 
> ffffffffa03dd100 R15: 0000000000000000
> Mar 20 21:53:39 haswell01 kernel: [13457.334044] FS:  00007f3f392d0740(0000) 
> GS:ffff88041ea00000(0000) knlGS:0000000000000000
> Mar 20 21:53:39 haswell01 kernel: [13457.334046] CS:  0010 DS: 0000 ES: 0000 
> CR0: 0000000080050033
> Mar 20 21:53:39 haswell01 kernel: [13457.334047] CR2: ffff8803eb27fcb0 CR3: 
> 00000003e7358000 CR4: 00000000001407f0
> Mar 20 21:53:39 haswell01 kernel: [13457.334048] DR0: 0000000000000000 DR1: 
> 0000000000000000 DR2: 0000000000000000
> Mar 20 21:53:39 haswell01 kernel: [13457.334050] DR3: 0000000000000000 DR6: 
> 00000000fffe0ff0 DR7: 0000000000000400
> Mar 20 21:53:39 haswell01 kernel: [13457.334051] Stack:
> Mar 20 21:53:39 haswell01 kernel: [13457.334052]  ffffffffa0213cb2 
> ffff8803f6016000 ffff880408199000 ffff880408199098
> Mar 20 21:53:39 haswell01 kernel: [13457.334054]  ffff880406395b60 
> ffffffffa0215b92 0000000000000004 ffff880408199140
> Mar 20 21:53:39 haswell01 kernel: [13457.334057]  ffffffffa03b9920 
> ffff880408199000 0000000000000000 ffffffffa03dd000
> Mar 20 21:53:39 haswell01 kernel: [13457.334059] Call Trace:
> Mar 20 21:53:39 haswell01 kernel: [13457.334067]  [<ffffffffa0213cb2>] ? 
> drm_dev_register+0xa2/0x1e0 [drm]
> Mar 20 21:53:39 haswell01 kernel: [13457.334073]  [<ffffffffa0215b92>] 
> drm_get_pci_dev+0x92/0x140 [drm]
> Mar 20 21:53:39 haswell01 kernel: [13457.334082]  [<ffffffffa033d67c>] 
> i915_pci_probe+0x3c/0x90 [i915]
> Mar 20 21:53:39 haswell01 kernel: [13457.334086]  [<ffffffff813982b5>] 
> local_pci_probe+0x45/0xa0
> Mar 20 21:53:39 haswell01 kernel: [13457.334088]  [<ffffffff81399555>] ? 
> pci_match_device+0xc5/0xd0
> Mar 20 21:53:39 haswell01 kernel: [13457.334090]  [<ffffffff81399679>] 
> pci_device_probe+0xd9/0x130
> Mar 20 21:53:39 haswell01 kernel: [13457.334093]  [<ffffffff81484795>] 
> driver_probe_device+0x125/0x3b0
> Mar 20 21:53:39 haswell01 kernel: [13457.334095]  [<ffffffff81484af3>] 
> __driver_attach+0x93/0xa0
> Mar 20 21:53:39 haswell01 kernel: [13457.334098]  [<ffffffff81484a60>] ? 
> __device_attach+0x40/0x40
> Mar 20 21:53:39 haswell01 kernel: [13457.334100]  [<ffffffff81482703>] 
> bus_for_each_dev+0x63/0xa0
> Mar 20 21:53:39 haswell01 kernel: [13457.334102]  [<ffffffff8148414e>] 
> driver_attach+0x1e/0x20
> Mar 20 21:53:39 haswell01 kernel: [13457.334104]  [<ffffffff81483d30>] 
> bus_add_driver+0x180/0x250
> Mar 20 21:53:39 haswell01 kernel: [13457.334106]  [<ffffffffa03fe000>] ? 
> 0xffffffffa03fdfff
> Mar 20 21:53:39 haswell01 kernel: [13457.334109]  [<ffffffff81485174>] 
> driver_register+0x64/0xf0
> Mar 20 21:53:39 haswell01 kernel: [13457.334110]  [<ffffffffa03fe000>] ? 
> 0xffffffffa03fdfff
> Mar 20 21:53:39 haswell01 kernel: [13457.334113]  [<ffffffff81397c4c>] 
> __pci_register_driver+0x4c/0x50
> Mar 20 21:53:39 haswell01 kernel: [13457.334117]  [<ffffffffa0215d5a>] 
> drm_pci_init+0x11a/0x130 [drm]
> Mar 20 21:53:39 haswell01 kernel: [13457.334119]  [<ffffffffa03fe000>] ? 
> 0xffffffffa03fdfff
> Mar 20 21:53:39 haswell01 kernel: [13457.334127]  [<ffffffffa03fe066>] 
> i915_init+0x66/0x68 [i915]
> Mar 20 21:53:39 haswell01 kernel: [13457.334130]  [<ffffffff8100214a>] 
> do_one_initcall+0xfa/0x1b0
> Mar 20 21:53:39 haswell01 kernel: [13457.334133]  [<ffffffff810589e3>] ? 
> set_memory_nx+0x43/0x50
> Mar 20 21:53:39 haswell01 kernel: [13457.334137]  [<ffffffff810e0630>] 
> load_module+0x2050/0x26f0
> Mar 20 21:53:39 haswell01 kernel: [13457.334139]  [<ffffffff810dbfa0>] ? 
> store_uevent+0x40/0x40
> Mar 20 21:53:39 haswell01 kernel: [13457.334141]  [<ffffffff810e0e46>] 
> SyS_finit_module+0x86/0xb0
> Mar 20 21:53:39 haswell01 kernel: [13457.334144]  [<ffffffff8171823f>] 
> tracesys+0xe1/0xe6
> Mar 20 21:53:39 haswell01 kernel: [13457.334145] Code: 00 00 00 00 00 00 00 
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20 00 
> 00 00 00 00 00 00 00 00 00 00 <f0> b1 ca 81 ff ff ff ff b0 f4 27 eb 03 88 ff 
> ff ff ff ff 7f 00 
> Mar 20 21:53:39 haswell01 kernel: [13457.334164] RIP  [<ffff8803eb27fcb0>] 
> 0xffff8803eb27fcb0
> Mar 20 21:53:39 haswell01 kernel: [13457.334166]  RSP <ffff880406395af0>
> Mar 20 21:53:39 haswell01 kernel: [13457.334167] CR2: ffff8803eb27fcb0
> Mar 20 21:53:39 haswell01 kernel: [13457.334168] ---[ end trace 
> e2598b1fc83a65fd ]---
> 
> 
> ________________________________________
> From: Daniel Vetter <daniel.vetter at ffwll.ch> on behalf of Daniel Vetter 
> <daniel at ffwll.ch>
> Sent: Tuesday, March 25, 2014 12:31 AM
> To: Dmitry Malkin
> Cc: dri-devel at lists.freedesktop.org
> Subject: Re: [DRM] drm_get_connector_name internal static string buffer 
> causes a race
> 
> On Mon, Mar 24, 2014 at 12:06:21PM +0000, Dmitry Malkin wrote:
> >
> > Hello guys,
> >
> > I've been playing with reloading intel gfx driver (i915) in a cycle, for a 
> > while,
> > and at some point I've found a non-deterministic kernel crash with a 
> > highly-variable
> > iteration dependency -- 2 to 200 driver reload iterations.
> >
> > The apparent race is over the shared internal string buffer in 
> > drm_get_connector_name().
> > It is mostly harmless, due to its results being mostly used for log output, 
> > but in at least one
> > case  -- drm_sysfs_connector_add() -- this leads to a more critical error.
> >
> > Race scenario:
> >
> > - drm_sysfs_connector_add()
> >    - drm_get_connector_name()
> > vs.
> > - anything that generates log messages involving DRM connectors
> >    - drm_get_connector_name()
> >
> >  (and many other from drm_crtc.c) shares with caller const char* to 
> > internal static char buffer.
> > If something call it from other thread, while main thread strore and use 
> > returned pointer it may overwrite connector name.
> >
> > Here are we go: registering HDMI connecter  (drm_sysfs_connector_add store 
> > and use pointer from drm_get_connector_name)
> > and the same time got VGA connector name down through the stack. (the 
> > second thread is upowerd who watch continuously sysfs)
> 
> Yeah, in my recent kerneldoc series I've noticed this too and added FIXME
> comments. There's a lot more than just those you've pointed out. The
> problem is that fixing these will be an awful lot of work since you need
> to add proper cleanup code (calling kfree()) to all the required places.
> 
> So a full subsystem wide code audit for every single use-site of these.
> -Daniel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

Reply via email to