On Wed, 2013-11-13 at 18:50 -0800, David Decotigny wrote: > Hello, > > Thank you for looking into this. I could reproduce the oops on some > Dell Poweredge R720 with the following config flags, otherwise the > problem goes un-noticed: > > CONFIG_DEBUG_PAGEALLOC=y > CONFIG_DEBUG_SLAB=y > > [ 4.924033] BUG: unable to handle kernel paging request at ffff88000004dd10 > [ 4.931823] IP: [<ffffffff8139797f>] __scsi_scan_target+0x3ef/0x6f0 > [ 4.938846] PGD 1ba1067 PUD 1ba2067 PMD 1ba3067 PTE 800000000004d060 > [ 4.945985] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC > [ 4.951074] Modules linked in: > [ 4.954492] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.12.0-smp-scsi01 #1 > > This points to this line on the return path of scsi_report_lun_scan: > if (scsi_device_created(sdev)) > > Kernel is jejb/scsi/for-next at 2aee240c68ed32 and I could reproduce > the bug with other 3.x kernels on same hardware. For me, it is 100% > reproducible. > > The ref counter values I indicated in my previous email are the result > of a basic instrumentation. It shows that ref count drops from 3 to 1 > as a result of scsi_probe_and_add_lun(). I believe this is because the > latter calls __scsi_remove_device(sdev). > > Now, if sdev reclaiming is not allowed to happen at the end of > scsi_report_lun_scan by design because someone else is expected to > hold a reference to it, then I'd be happy to add a BUG_ON() on the > return path and explicit the post-condition in the function > documentation, and also try to find out where a ref is killed by > mistake. However, if sdev relcaiming at the end of > scsi_report_lun_scan is allowed, then I'd argue that the "if > (scsi_device_created(sdev))" on the potentially reclaimed sdev is not > right, that's why I was proposing this patch.
Heh, perhaps this is why bug reports are so useful. Your patch told me where you thought the bug was but this report actually tells me where the bug is: it's in scsi_probe_and_add_luns(). There's no way that routine should reduce the refcount of a struct scsi_device. It should either leave it as it is, or if an sdevp is specified, increment the reference and return the sdev in sdevp. This should be enough information for you to come up with the fix. Please document it with the actual bug details like you have above and I'll apply it. James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/