Re: usb HC busted?
Hi Mathias, On Fri, Jul 20, 2018 at 01:54:21PM +0100, Sudip Mukherjee wrote: > Hi Mathias, > > On Fri, Jul 20, 2018 at 02:10:58PM +0300, Mathias Nyman wrote: > > On 19.07.2018 20:32, Sudip Mukherjee wrote: > > > Hi Mathias, > > > > > > On Thu, Jul 19, 2018 at 06:42:19PM +0300, Mathias Nyman wrote: > > > > > > As first aid I could try to implement checks that make sure the > > > > > > flushed URBs > > > > > > trb pointers really are on the current endpoint ring, and also add > > > > > > some warning > > > > > > if we are we are dropping endpoints with URBs still queued. > > > > > > > > > > Yes, please. I think your first-aid will be a much better option than > > > > > the hacky patch I am using atm. > > > > > > > > > > > > So poison is overwritten at e5acda58 with almost its own address, (reading > > backwards) e5 ac da 60, twice. > > looks like something (32bit?)is pointing to itself twice, maybe a linked > > list node next and prev pointer > > being set to point to itself as last item was removed from list. > > > > The cancelled_td_list is part of struct xhci_virt_ep, so that should be > > fine. > > But td_list is part of struct xhci_ring, which was freed. and we removed > > the URBs tds from the td_list when > > flushing the ring after ring was freed > > > > I changed the patch (attached) to make sure it doesn't touch the td_list > > when canceling a URB after > > ring is freed. > > > > How about this one, any improvements? > > Yes, it worked. :D > > So, cycle-1 = no change, just to make sure I can still reproduce the error. > cycle-2 and cycle-3 with your patch, and there was no problem, > slub debug was also happy. > I am starting an autotest with this patch now, and I will have almost > 50 cycles tested by tomorrow morning. I can confirm that your bandaid patch has worked. Total of 67 cycles tested till now and there was no error. Its continuing to test over the weekend. Thank you very much for this one. :) I guess you will start with the proper fix, that you and Alan had been discussing, after you are fully back to work. -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Mathias, On Fri, Jul 20, 2018 at 02:10:58PM +0300, Mathias Nyman wrote: > On 19.07.2018 20:32, Sudip Mukherjee wrote: > > Hi Mathias, > > > > On Thu, Jul 19, 2018 at 06:42:19PM +0300, Mathias Nyman wrote: > > > > > As first aid I could try to implement checks that make sure the > > > > > flushed URBs > > > > > trb pointers really are on the current endpoint ring, and also add > > > > > some warning > > > > > if we are we are dropping endpoints with URBs still queued. > > > > > > > > Yes, please. I think your first-aid will be a much better option than > > > > the hacky patch I am using atm. > > > > > > > > So poison is overwritten at e5acda58 with almost its own address, (reading > backwards) e5 ac da 60, twice. > looks like something (32bit?)is pointing to itself twice, maybe a linked list > node next and prev pointer > being set to point to itself as last item was removed from list. > > The cancelled_td_list is part of struct xhci_virt_ep, so that should be fine. > But td_list is part of struct xhci_ring, which was freed. and we removed the > URBs tds from the td_list when > flushing the ring after ring was freed > > I changed the patch (attached) to make sure it doesn't touch the td_list when > canceling a URB after > ring is freed. > > How about this one, any improvements? Yes, it worked. :D So, cycle-1 = no change, just to make sure I can still reproduce the error. cycle-2 and cycle-3 with your patch, and there was no problem, slub debug was also happy. I am starting an autotest with this patch now, and I will have almost 50 cycles tested by tomorrow morning. -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Mathias, On Thu, Jul 19, 2018 at 06:42:19PM +0300, Mathias Nyman wrote: > > > As first aid I could try to implement checks that make sure the flushed > > > URBs > > > trb pointers really are on the current endpoint ring, and also add some > > > warning > > > if we are we are dropping endpoints with URBs still queued. > > > > Yes, please. I think your first-aid will be a much better option than > > the hacky patch I am using atm. > > > > Attached a patch that checks canceled URB td/trb pointers. > I haven't tested it at all (well compiles and boots, but new code never > exercised) > > Does it work for you? No, not exactly. :( I can see your message getting printed. [ 249.518394] xhci_hcd :00:14.0: Canceled URB td not found on endpoint ring [ 249.518431] xhci_hcd :00:14.0: Canceled URB td not found on endpoint ring But I can see the message from slub debug again: [ 348.279986] = [ 348.279993] BUG kmalloc-96 (Tainted: G U O ): Poison overwritten [ 348.279995] - [ 348.279997] Disabling lock debugging due to kernel taint [ 348.28] INFO: 0xe5acda60-0xe5acda67. First byte 0x60 instead of 0x6b [ 348.280012] INFO: Allocated in xhci_ring_alloc.constprop.14+0x31/0x125 [xhci_hcd] age=129264 cpu=0 pid=33 [ 348.280019] ___slab_alloc.constprop.24+0x1fc/0x292 [ 348.280023] __slab_alloc.isra.18.constprop.23+0x1c/0x25 [ 348.280026] kmem_cache_alloc_trace+0x78/0x141 [ 348.280032] xhci_ring_alloc.constprop.14+0x31/0x125 [xhci_hcd] [ 348.280038] xhci_endpoint_init+0x25f/0x30a [xhci_hcd] [ 348.280044] xhci_add_endpoint+0x126/0x149 [xhci_hcd] [ 348.280057] usb_hcd_alloc_bandwidth+0x26a/0x2a0 [usbcore] [ 348.280067] usb_set_interface+0xeb/0x25d [usbcore] [ 348.280071] btusb_work+0xeb/0x324 [btusb] [ 348.280076] process_one_work+0x163/0x2b2 [ 348.280080] worker_thread+0x1a9/0x25c [ 348.280083] kthread+0xf8/0xfd [ 348.280087] ret_from_fork+0x2e/0x38 [ 348.280095] INFO: Freed in xhci_ring_free+0xa7/0xc6 [xhci_hcd] age=98722 cpu=0 pid=33 [ 348.280098] __slab_free+0x4b/0x27a [ 348.280100] kfree+0x12e/0x155 [ 348.280106] xhci_ring_free+0xa7/0xc6 [xhci_hcd] [ 348.280112] xhci_free_endpoint_ring+0x16/0x20 [xhci_hcd] [ 348.280118] xhci_check_bandwidth+0x1c2/0x211 [xhci_hcd] [ 348.280129] usb_hcd_alloc_bandwidth+0x205/0x2a0 [usbcore] [ 348.280139] usb_set_interface+0xeb/0x25d [usbcore] [ 348.280142] btusb_work+0x228/0x324 [btusb] [ 348.280145] process_one_work+0x163/0x2b2 [ 348.280148] worker_thread+0x1a9/0x25c [ 348.280151] kthread+0xf8/0xfd [ 348.280154] ret_from_fork+0x2e/0x38 [ 348.280158] INFO: Slab 0xf46e0fe0 objects=29 used=29 fp=0x (null) flags=0x40008100 [ 348.280160] INFO: Object 0xe5acda48 @offset=6728 fp=0xe5acd700 [ 348.280164] Redzone e5acda40: bb bb bb bb bb bb bb bb [ 348.280167] Object e5acda48: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 348.280169] Object e5acda58: 6b 6b 6b 6b 6b 6b 6b 6b 60 da ac e5 60 da ac e5 `...`... [ 348.280171] Object e5acda68: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 348.280174] Object e5acda78: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 348.280176] Object e5acda88: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 348.280179] Object e5acda98: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkk. [ 348.280181] Redzone e5acdaa8: bb bb bb bb [ 348.280183] Padding e5acdb50: 5a 5a 5a 5a 5a 5a 5a 5a [ 348.280188] CPU: 0 PID: 133 Comm: weston Tainted: GBU O 4.14.55-20180712+ #2 [ 348.280190] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017 [ 348.280192] Call Trace: [ 348.280199] dump_stack+0x47/0x5b [ 348.280202] print_trailer+0x12b/0x133 [ 348.280206] check_bytes_and_report+0x6c/0xae [ 348.280210] check_object+0x10a/0x1db [ 348.280214] alloc_debug_processing+0x79/0x123 [ 348.280218] ___slab_alloc.constprop.24+0x1fc/0x292 [ 348.280224] ? drm_mode_atomic_ioctl+0x374/0x75e [ 348.280227] ? drm_mode_atomic_ioctl+0x374/0x75e [ 348.280231] ? drm_mode_object_get+0x28/0x3a [ 348.280235] ? __radix_tree_lookup+0x27/0x7e [ 348.280238] ? drm_mode_object_get+0x28/0x3a [ 348.280242] ? drm_mode_object_put+0x28/0x4c [ 348.280246] __slab_alloc.isra.18.constprop.23+0x1c/0x25 [ 348.280249] ? __slab_alloc.isra.18.constprop.23+0x1c/0x25 [ 348.280253] kmem_cache_alloc_trace+0x78/0x141 [ 348.280257] ? drm_mode_atomic_ioctl+0x374/0x75e [ 348.280261] drm_mode_atomic_ioctl+0x374/0x75e [ 348.280267] ? drm_atomic_set_property+0x442/0x442 [ 348.280272] drm_ioctl_kernel+0x52/0x88 [ 348.280275] drm_ioctl+0x1fc/0x2c1 [ 348.280279] ? drm_atomic_set_property+0x442/0x442 [
Re: usb HC busted?
Hi Mathias, On Thu, Jul 19, 2018 at 01:59:01PM +0300, Mathias Nyman wrote: > On 17.07.2018 18:10, Sudip Mukherjee wrote: > > Hi Alan, Greg, > > > > On Tue, Jul 17, 2018 at 03:49:18PM +0100, Sudip Mukherjee wrote: > > > On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote: > > > > Hi Alan, > > > > > > > > On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote: > > > > > On Tue, 17 Jul 2018, Sudip Mukherjee wrote: > > > > > > > > > > > I did some more debugging. Tested with a KASAN enabled kernel and > > > > > > that > > > > > > shows the problem. The report is attached. > > > > > > > > > > And, my hacky patch worked as I prevented it from calling > > usb_disable_interface() in this particular case. > > > > Back for a few days, looking at this I hope you had a good holiday. :) > > xhci driver will set up all the endpoints for the new altsetting already in > usb_hcd_alloc_bandwidth(). > > > As first aid I could try to implement checks that make sure the flushed URBs > trb pointers really are on the current endpoint ring, and also add some > warning > if we are we are dropping endpoints with URBs still queued. Yes, please. I think your first-aid will be a much better option than the hacky patch I am using atm. > > But we need to fix this properly as well. > xhci needs to be more in sync with usb core in usb_set_interface(), currently > xhci > has the altssetting up and running when usb core hasn't event started > flushing endpoints. I am able to reproduce this on almost all cycles, so I can always test the fix for you after you are fully back from your holiday. -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Tue, Jul 17, 2018 at 04:59:01PM +0100, Sudip Mukherjee wrote: > On Tue, Jul 17, 2018 at 05:52:59PM +0200, Greg KH wrote: > > On Tue, Jul 17, 2018 at 10:31:38AM -0400, Alan Stern wrote: > > > On Tue, 17 Jul 2018, Greg KH wrote: > > > > > > > > From: Sudip Mukherjee > > > > > Date: Tue, 10 Jul 2018 09:50:00 +0100 > > > > > Subject: [PATCH] hacky solution to mem-corruption > > > > > > > > > > Signed-off-by: Sudip Mukherjee > > > > > --- > > > > > > > No, neither of these is right. It's possible to use > > > usb_set_interface() as a kind of "soft" reset. Even when the new > > > altsetting is specified to be the same as the current one, we still > > > have to tell the lower-layer drivers and hardware about it. > > > > You are right, it's a hacky soft reset, I was just trying to figure out > > what the bluetooth driver was trying to do. I wouldn't expect it to be > > calling that function a lot, but I guess it does :( > > usb_set_interface() is being called two times from bluetooth event. But > I am now adding more debugs to see why your patch did not work. So, a very simple debug to see the sequence of functions being called. I have attached the patch I used. In a good case: [ 124.287991] sudip: xhci_urb_dequeue [ 124.287997] sudip: xhci_queue_stop_endpoint cmd=ee032950 [ 124.288016] sudip: handle_cmd_completion cmd=ee032950 [ 124.288173] sudip: xhci_urb_dequeue [ 124.288176] sudip: xhci_queue_stop_endpoint cmd=ee032950 [ 124.288189] sudip: handle_cmd_completion cmd=ee032950 [ 124.290647] sudip: usb_hcd_flush_endpoint [ 124.290652] sudip: usb_hcd_flush_endpoint But in a bad case: [ 186.786900] sudip: xhci_urb_dequeue [ 186.786905] sudip: xhci_queue_stop_endpoint cmd=ebe47cb0 [ 186.786923] sudip: handle_cmd_completion cmd=ebe47cb0 [ 186.789040] sudip: xhci_urb_dequeue [ 186.789047] sudip: xhci_queue_stop_endpoint cmd=ebe47cb0 [ 186.789069] sudip: handle_cmd_completion cmd=ebe47cb0 [ 186.790082] sudip: usb_hcd_flush_endpoint [ 186.790094] sudip: xhci_urb_dequeue [ 186.790097] sudip: xhci_queue_stop_endpoint cmd=ebe47290 [ 186.790150] sudip: handle_cmd_completion cmd=ebe47290 [ 186.790202] sudip: usb_hcd_flush_endpoint So, when usb_hcd_flush_endpoint() is called by usb_disable_endpoint() it finds urbs still on the urb_list of the ep. And in the process of unlinking them, it again sends the command to stop the endpoint, although that endpoint has already been stopped. So Greg's patch did not work as the memory got corrupted on the first call to usb_set_interface(), whereas that patch was preventing the second call to usb_set_interface(). -- Regards Sudip diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c index 467bedeb542a..8d28f120ec0a 100644 --- a/drivers/usb/core/hcd.c +++ b/drivers/usb/core/hcd.c @@ -1885,6 +1885,7 @@ void usb_hcd_flush_endpoint(struct usb_device *udev, might_sleep(); hcd = bus_to_hcd(udev->bus); + pr_err("sudip: %s\n", __func__); /* No more submits can occur */ spin_lock_irq(_urb_list_lock); rescan: diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 6996235e34a9..4f80791fdfc5 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -1450,6 +1450,7 @@ static void handle_cmd_completion(struct xhci_hcd *xhci, case TRB_STOP_RING: WARN_ON(slot_id != TRB_TO_SLOT_ID( le32_to_cpu(cmd_trb->generic.field[3]))); + pr_err("sudip: %s cmd=%p\n", __func__, cmd); xhci_handle_cmd_stop_ep(xhci, slot_id, cmd_trb, event); break; case TRB_SET_DEQ: @@ -4009,6 +4010,7 @@ int xhci_queue_stop_endpoint(struct xhci_hcd *xhci, struct xhci_command *cmd, u32 type = TRB_TYPE(TRB_STOP_RING); u32 trb_suspend = SUSPEND_PORT_FOR_TRB(suspend); + pr_err("sudip: %s cmd=%p\n", __func__, cmd); return queue_command(xhci, cmd, 0, 0, 0, trb_slot_id | trb_ep_index | type | trb_suspend, false); } diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c index db1de6113db2..3832128107ff 100644 --- a/drivers/usb/host/xhci.c +++ b/drivers/usb/host/xhci.c @@ -1516,6 +1516,7 @@ static int xhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status) ep->stop_cmd_timer.expires = jiffies + XHCI_STOP_EP_CMD_TIMEOUT * HZ; add_timer(>stop_cmd_timer); + pr_err("sudip: %s\n", __func__); xhci_queue_stop_endpoint(xhci, command, urb->dev->slot_id, ep_index, 0); xhci_ring_cmd_db(xhci); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Tue, Jul 17, 2018 at 05:52:59PM +0200, Greg KH wrote: > On Tue, Jul 17, 2018 at 10:31:38AM -0400, Alan Stern wrote: > > On Tue, 17 Jul 2018, Greg KH wrote: > > > > > > From: Sudip Mukherjee > > > > Date: Tue, 10 Jul 2018 09:50:00 +0100 > > > > Subject: [PATCH] hacky solution to mem-corruption > > > > > > > > Signed-off-by: Sudip Mukherjee > > > > --- > > > > No, neither of these is right. It's possible to use > > usb_set_interface() as a kind of "soft" reset. Even when the new > > altsetting is specified to be the same as the current one, we still > > have to tell the lower-layer drivers and hardware about it. > > You are right, it's a hacky soft reset, I was just trying to figure out > what the bluetooth driver was trying to do. I wouldn't expect it to be > calling that function a lot, but I guess it does :( usb_set_interface() is being called two times from bluetooth event. But I am now adding more debugs to see why your patch did not work. -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Alan, Greg, On Tue, Jul 17, 2018 at 03:49:18PM +0100, Sudip Mukherjee wrote: > On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote: > > Hi Alan, > > > > On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote: > > > On Tue, 17 Jul 2018, Sudip Mukherjee wrote: > > > > > > > I did some more debugging. Tested with a KASAN enabled kernel and that > > > > shows the problem. The report is attached. > > > > > > > > To my understanding: > > > > > > > > btusb_work() is calling usb_set_interface() with alternate = 0. which > > > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by > > > > xhci_free_endpoint_ring(). > > > > > > That doesn't sound like the right thing to do. The rings shouldn't be > > > freed until xhci_endpoint_disable() is called. > > > > > > On the other hand, there doesn't appear to be any > > > xhci_endpoint_disable() routine, although a comment refers to it. > > > Maybe this is the real problem? > > > > one of your old mail might help :) > > > > https://www.spinics.net/lists/linux-usb/msg98123.html > > Wrote too soon. > > Is it the one you are looking for - > usb_disable_endpoint() is in drivers/usb/core/message.c I think now I understand what the problem is. usb_set_interface() calls usb_disable_interface() which again calls usb_disable_endpoint(). This usb_disable_endpoint() gets the pointer to 'ep', marks it as NULL and sends the pointer to usb_hcd_flush_endpoint(). After flushing the endpoints usb_disable_endpoint() calls usb_hcd_disable_endpoint() which tries to do: if (hcd->driver->endpoint_disable) hcd->driver->endpoint_disable(hcd, ep); but there is no endpoint_disable() callback in xhci, so the endpoint is never marked as disabled. So, next time usb_hcd_flush_endpoint() is called I get this corruption. And this is exactly where I used to see the problem happening. And, my hacky patch worked as I prevented it from calling usb_disable_interface() in this particular case. Greg - answering your question here. My hacky patch was based on the fact that usb_hcd_alloc_bandwidth() is calling hcd->driver->drop_endpoint() and hcd->driver->add_endpoint() if (cur_alt && new_alt). So, I prevented usb_disable_interface() to be called for that same condition. And that worked as the call to usb_hcd_flush_endpoint() was not executed. I know it is not correct and I might be having memory leaks for this, but I have the system working till we get the actual fix. -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote: > Hi Alan, > > On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote: > > On Tue, 17 Jul 2018, Sudip Mukherjee wrote: > > > > > I did some more debugging. Tested with a KASAN enabled kernel and that > > > shows the problem. The report is attached. > > > > > > To my understanding: > > > > > > btusb_work() is calling usb_set_interface() with alternate = 0. which > > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by > > > xhci_free_endpoint_ring(). > > > > That doesn't sound like the right thing to do. The rings shouldn't be > > freed until xhci_endpoint_disable() is called. > > > > On the other hand, there doesn't appear to be any > > xhci_endpoint_disable() routine, although a comment refers to it. > > Maybe this is the real problem? > > one of your old mail might help :) > > https://www.spinics.net/lists/linux-usb/msg98123.html Wrote too soon. Is it the one you are looking for - usb_disable_endpoint() is in drivers/usb/core/message.c -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Alan, On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote: > On Tue, 17 Jul 2018, Sudip Mukherjee wrote: > > > I did some more debugging. Tested with a KASAN enabled kernel and that > > shows the problem. The report is attached. > > > > To my understanding: > > > > btusb_work() is calling usb_set_interface() with alternate = 0. which > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by > > xhci_free_endpoint_ring(). > > That doesn't sound like the right thing to do. The rings shouldn't be > freed until xhci_endpoint_disable() is called. > > On the other hand, there doesn't appear to be any > xhci_endpoint_disable() routine, although a comment refers to it. > Maybe this is the real problem? one of your old mail might help :) https://www.spinics.net/lists/linux-usb/msg98123.html -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Greg, On Tue, Jul 17, 2018 at 02:04:11PM +0200, Greg KH wrote: > On Tue, Jul 17, 2018 at 12:41:04PM +0100, Sudip Mukherjee wrote: > > Hi Mathias, > > > > On Sat, Jun 30, 2018 at 10:07:04PM +0100, Sudip Mukherjee wrote: > > > Hi Mathias, > > > > > > On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote: > > > > On 27.06.2018 14:59, Sudip Mukherjee wrote: > > > > > > > Can you share a bit more details on the platform you are using, > > > > > > > and what types of test you are running. > > > > > > > > > > > Then to track what is going on, I added the slub debugging and :( > > > I have attached part of dmesg for you to check. > > > Will appreciate your help in finding out the problem. > > > > I did some more debugging. Tested with a KASAN enabled kernel and that > > shows the problem. The report is attached. > > > > To my understanding: > > > > btusb_work() is calling usb_set_interface() with alternate = 0. which > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by > > xhci_free_endpoint_ring(). But then usb_set_interface() continues and > > calls usb_disable_interface() -> usb_hcd_flush_endpoint()->unlink1()-> > > xhci_urb_dequeue() which at the end gives the command to stop endpoint. > > > > In all the cycles I have tested I see that only in the fail case > > handle_cmd_completion() gets called, but in the cycles where the error > > is not there handle_cmd_completion() is not called with that command. > > > > I am not sure what is happening, and you are the best person to understand > > what is happening. :) > > > > But for now (untill you are back from holiday and suggest a proper > > solution), > > I made a hacky patch (attached) which is working and I donot get any > > corruption after that. Both KASAN and slub debug are also happy. > > > > So, now waiting for you to analyze what is going on and suggest a proper > > fix. > > > > Thanks in advance. > > > > -- > > Regards > > Sudip > > > [ 236.814156] > > == > > [ 236.814187] BUG: KASAN: use-after-free in xhci_trb_virt_to_dma+0x2e/0x74 > > [xhci_hcd] > > [ 236.814193] Read of size 8 at addr 8800789329c8 by task weston/138 > > > > [ 236.814203] CPU: 0 PID: 138 Comm: weston Tainted: G U W O > > 4.14.47-20180606+ #7 > > [ 236.814206] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017 > > [ 236.814209] Call Trace: > > [ 236.814214] > > [ 236.814226] dump_stack+0x46/0x59 > > [ 236.814238] print_address_description+0x6b/0x23b > > [ 236.814255] ? xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd] > > [ 236.814262] kasan_report+0x220/0x246 > > [ 236.814278] xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd] > > [ 236.814294] trb_in_td+0x3b/0x1cd [xhci_hcd] > > [ 236.814311] handle_cmd_completion+0x1181/0x2c9b [xhci_hcd] > > [ 236.814329] ? xhci_queue_new_dequeue_state+0x5d9/0x5d9 [xhci_hcd] > > [ 236.814337] ? drm_handle_vblank+0x4ec/0x590 > > [ 236.814352] xhci_irq+0x529/0x3294 [xhci_hcd] > > [ 236.814362] ? __accumulate_pelt_segments+0x24/0x33 > > [ 236.814378] ? finish_td.isra.40+0x223/0x223 [xhci_hcd] > > [ 236.814384] ? __accumulate_pelt_segments+0x24/0x33 > > [ 236.814390] ? __accumulate_pelt_segments+0x24/0x33 > > [ 236.814405] ? xhci_irq+0x3294/0x3294 [xhci_hcd] > > [ 236.814412] __handle_irq_event_percpu+0x149/0x3db > > [ 236.814421] handle_irq_event_percpu+0x65/0x109 > > [ 236.814428] ? __handle_irq_event_percpu+0x3db/0x3db > > [ 236.814436] ? ttwu_do_wakeup.isra.18+0x3a2/0x3ce > > [ 236.814442] handle_irq_event+0xa8/0x10a > > [ 236.814449] handle_edge_irq+0x4b2/0x538 > > [ 236.814458] handle_irq+0x3e/0x45 > > [ 236.814465] do_IRQ+0x5c/0x126 > > [ 236.814474] common_interrupt+0x7a/0x7a > > [ 236.814478] > > [ 236.814483] RIP: 0023:0xf79d3d82 > > [ 236.814486] RSP: 002b:ffc588e8 EFLAGS: 00200282 ORIG_RAX: > > ffdc > > [ 236.814493] RAX: RBX: f7bebd5c RCX: > > > > [ 236.814496] RDX: 08d4197c RSI: RDI: > > f746c020 > > [ 236.814499] RBP: ffc588e8 R08: R09: > > > > [ 236.814503] R10: R11: 00200206 R12: > > > > [ 236.814506] R13: R14:
Re: usb HC busted?
Hi Mathias, On Sat, Jun 30, 2018 at 10:07:04PM +0100, Sudip Mukherjee wrote: > Hi Mathias, > > On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote: > > On 27.06.2018 14:59, Sudip Mukherjee wrote: > > > > > Can you share a bit more details on the platform you are using, and > > > > > what types of test you are running. > > > > > Then to track what is going on, I added the slub debugging and :( > I have attached part of dmesg for you to check. > Will appreciate your help in finding out the problem. I did some more debugging. Tested with a KASAN enabled kernel and that shows the problem. The report is attached. To my understanding: btusb_work() is calling usb_set_interface() with alternate = 0. which again calls usb_hcd_alloc_bandwidth() and that frees the rings by xhci_free_endpoint_ring(). But then usb_set_interface() continues and calls usb_disable_interface() -> usb_hcd_flush_endpoint()->unlink1()-> xhci_urb_dequeue() which at the end gives the command to stop endpoint. In all the cycles I have tested I see that only in the fail case handle_cmd_completion() gets called, but in the cycles where the error is not there handle_cmd_completion() is not called with that command. I am not sure what is happening, and you are the best person to understand what is happening. :) But for now (untill you are back from holiday and suggest a proper solution), I made a hacky patch (attached) which is working and I donot get any corruption after that. Both KASAN and slub debug are also happy. So, now waiting for you to analyze what is going on and suggest a proper fix. Thanks in advance. -- Regards Sudip [ 236.814156] == [ 236.814187] BUG: KASAN: use-after-free in xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd] [ 236.814193] Read of size 8 at addr 8800789329c8 by task weston/138 [ 236.814203] CPU: 0 PID: 138 Comm: weston Tainted: G U W O 4.14.47-20180606+ #7 [ 236.814206] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017 [ 236.814209] Call Trace: [ 236.814214] [ 236.814226] dump_stack+0x46/0x59 [ 236.814238] print_address_description+0x6b/0x23b [ 236.814255] ? xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd] [ 236.814262] kasan_report+0x220/0x246 [ 236.814278] xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd] [ 236.814294] trb_in_td+0x3b/0x1cd [xhci_hcd] [ 236.814311] handle_cmd_completion+0x1181/0x2c9b [xhci_hcd] [ 236.814329] ? xhci_queue_new_dequeue_state+0x5d9/0x5d9 [xhci_hcd] [ 236.814337] ? drm_handle_vblank+0x4ec/0x590 [ 236.814352] xhci_irq+0x529/0x3294 [xhci_hcd] [ 236.814362] ? __accumulate_pelt_segments+0x24/0x33 [ 236.814378] ? finish_td.isra.40+0x223/0x223 [xhci_hcd] [ 236.814384] ? __accumulate_pelt_segments+0x24/0x33 [ 236.814390] ? __accumulate_pelt_segments+0x24/0x33 [ 236.814405] ? xhci_irq+0x3294/0x3294 [xhci_hcd] [ 236.814412] __handle_irq_event_percpu+0x149/0x3db [ 236.814421] handle_irq_event_percpu+0x65/0x109 [ 236.814428] ? __handle_irq_event_percpu+0x3db/0x3db [ 236.814436] ? ttwu_do_wakeup.isra.18+0x3a2/0x3ce [ 236.814442] handle_irq_event+0xa8/0x10a [ 236.814449] handle_edge_irq+0x4b2/0x538 [ 236.814458] handle_irq+0x3e/0x45 [ 236.814465] do_IRQ+0x5c/0x126 [ 236.814474] common_interrupt+0x7a/0x7a [ 236.814478] [ 236.814483] RIP: 0023:0xf79d3d82 [ 236.814486] RSP: 002b:ffc588e8 EFLAGS: 00200282 ORIG_RAX: ffdc [ 236.814493] RAX: RBX: f7bebd5c RCX: [ 236.814496] RDX: 08d4197c RSI: RDI: f746c020 [ 236.814499] RBP: ffc588e8 R08: R09: [ 236.814503] R10: R11: 00200206 R12: [ 236.814506] R13: R14: R15: [ 236.814513] Allocated by task 2082: [ 236.814521] kasan_kmalloc.part.1+0x51/0xc7 [ 236.814526] kmem_cache_alloc_trace+0x178/0x187 [ 236.814540] xhci_segment_alloc.isra.11+0x9d/0x3bf [xhci_hcd] [ 236.814553] xhci_alloc_segments_for_ring+0x9e/0x176 [xhci_hcd] [ 236.814566] xhci_ring_alloc.constprop.16+0x197/0x4ba [xhci_hcd] [ 236.814579] xhci_endpoint_init+0x77a/0x9ba [xhci_hcd] [ 236.814592] xhci_add_endpoint+0x3bc/0x43b [xhci_hcd] [ 236.814615] usb_hcd_alloc_bandwidth+0x7ef/0x857 [usbcore] [ 236.814637] usb_set_interface+0x294/0x681 [usbcore] [ 236.814645] btusb_work+0x2e6/0x981 [btusb] [ 236.814651] process_one_work+0x579/0x9e9 [ 236.814656] worker_thread+0x68f/0x804 [ 236.814662] kthread+0x31c/0x32b [ 236.814668] ret_from_fork+0x35/0x40 [ 236.814672] Freed by task 1533: [ 236.814678] kasan_slab_free+0xb3/0x15e [ 236.814683] kfree+0x103/0x1a9 [ 236.814696] xhci_ring_free+0x205/0x286 [xhci_hcd] [ 236.814709] xhci_free_endpoint_ring+0x4d/0x83 [xhci_hcd] [ 236.814722] xhci_check_bandwidth+0x57b/0x
Re: usb HC busted?
Hi Mathias, On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote: > On 27.06.2018 14:59, Sudip Mukherjee wrote: > > > > Can you share a bit more details on the platform you are using, and > > > > what types of test you are running. > > > > > > It is a board based on "Intel(R) Atom(TM) CPU E3840 @ 1.91GHz". > > > The usb device in question is a bluetooth device: > > > > > > > There is however freeing of the same dma address: > > <...>-28448 [003] 492.025808: xhci_ring_free: ISOC f1ffb700: enq > 0x2d31bcc0(0x2d31b000) deq > 0x2d31b000(0x2d31b000) segs 2 stream 0 free_trbs 305 bounce > 17 cycle 0 > <...>-28448 [003] 492.025818: xhci_ring_mem_detail: MATTU xhci segment > free seg->dma @ 0x2d31b000 > <...>-28448 [003] 492.025823: xhci_ring_mem_detail: MATTU xhci segment > free seg->dma @ 0x2d31b000 > <...>-28448 [003] 492.025826: xhci_ring_free: ISOC f1f9b380: enq > 0x2d31b140(0x2d31b000) deq > 0x2d31b000(0x2d31b000) segs 2 stream 0 free_trbs 489 bounce > 17 cycle 1 > <...>-28448 [003] 492.025828: xhci_ring_mem_detail: MATTU xhci segment > free seg->dma @ 0x2d31b000 > <...>-28448 [003] 492.025830: xhci_ring_mem_detail: MATTU xhci segment > free seg->dma @ 0x2d31b000 > > I'd guess it's still the same cause, maybe trace is not complete? It is either mutiple freeing of the same address or mutiple allocation of the same address or a combination of both. To track the mutiple allocation I added few extra debugging and it seems that the mutiple allocation is only happening when someone accesses that memory and makes the first 4 bytes (which holds the offset data) as 0. I have not yet checked in what condition does it try to free the same address more than once. Then to track what is going on, I added the slub debugging and :( I have attached part of dmesg for you to check. Will appreciate your help in finding out the problem. -- Regards Sudip [ 383.096204] = [ 383.096212] BUG kmalloc-96 (Tainted: G U O ): Poison overwritten [ 383.096213] - [ 383.096215] Disabling lock debugging due to kernel taint [ 383.096218] INFO: 0xdccd1b78-0xdccd1b7f. First byte 0x78 instead of 0x6b [ 383.096232] INFO: Allocated in xhci_ring_alloc.constprop.14+0x31/0x125 [xhci_hcd] age=227516 cpu=2 pid=21 [ 383.096240] ___slab_alloc.constprop.24+0x1fc/0x292 [ 383.096243] __slab_alloc.isra.18.constprop.23+0x1c/0x25 [ 383.096246] kmem_cache_alloc_trace+0x78/0x141 [ 383.096252] xhci_ring_alloc.constprop.14+0x31/0x125 [xhci_hcd] [ 383.096259] xhci_endpoint_init+0x25f/0x30a [xhci_hcd] [ 383.096265] xhci_add_endpoint+0x126/0x149 [xhci_hcd] [ 383.096276] usb_hcd_alloc_bandwidth+0x26a/0x2a0 [usbcore] [ 383.096287] usb_set_interface+0xeb/0x25d [usbcore] [ 383.096292] btusb_work+0xeb/0x324 [btusb] [ 383.096296] process_one_work+0x163/0x2b2 [ 383.096299] worker_thread+0x1a9/0x25c [ 383.096301] kthread+0xf8/0xfd [ 383.096306] ret_from_fork+0x2e/0x38 [ 383.096314] INFO: Freed in xhci_ring_free+0xa7/0xc6 [xhci_hcd] age=197020 cpu=0 pid=324 [ 383.096317] __slab_free+0x4b/0x27a [ 383.096319] kfree+0x12e/0x155 [ 383.096325] xhci_ring_free+0xa7/0xc6 [xhci_hcd] [ 383.096331] xhci_free_endpoint_ring+0x16/0x20 [xhci_hcd] [ 383.096338] xhci_check_bandwidth+0x1bf/0x20e [xhci_hcd] [ 383.096348] usb_hcd_alloc_bandwidth+0x205/0x2a0 [usbcore] [ 383.096358] usb_set_interface+0xeb/0x25d [usbcore] [ 383.096361] btusb_work+0x228/0x324 [btusb] [ 383.096364] process_one_work+0x163/0x2b2 [ 383.096367] worker_thread+0x1a9/0x25c [ 383.096370] kthread+0xf8/0xfd [ 383.096373] ret_from_fork+0x2e/0x38 [ 383.096376] INFO: Slab 0xf457e080 objects=29 used=29 fp=0x (null) flags=0x40008100 [ 383.096379] INFO: Object 0xdccd1b60 @offset=7008 fp=0xdccd0350 [ 383.096383] Redzone dccd1b58: bb bb bb bb bb bb bb bb [ 383.096386] Object dccd1b60: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 383.096388] Object dccd1b70: 6b 6b 6b 6b 6b 6b 6b 6b 78 1b cd dc 78 1b cd dc x...x... [ 383.096390] Object dccd1b80: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 383.096393] Object dccd1b90: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 383.096395] Object dccd1ba0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 383.096397] Object dccd1bb0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkk. [ 383.096400] Redzone dccd1bc0: bb bb bb bb
Re: usb HC busted?
On Wed, Jun 27, 2018 at 12:59:48PM +0100, Sudip Mukherjee wrote: > Hi Mathias, > > On Mon, Jun 25, 2018 at 05:15:00PM +0100, Sudip Mukherjee wrote: > > Hi Mathias, > > > > On Thu, Jun 21, 2018 at 02:01:30PM +0300, Mathias Nyman wrote: > > > On 21.06.2018 03:53, Sudip Mukherjee wrote: > > > > Hi Mathias, Andy, > > > > > > > > On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote: > > > > > On 06.06.2018 19:45, Sudip Mukherjee wrote: > > > > > > > Can you share a bit more details on the platform you are using, and what > > > types of test you are running. > > > > Sorry for the delayed reply, I was in Tokyo for the OSS. > > > > It is a board based on "Intel(R) Atom(TM) CPU E3840 @ 1.91GHz". > > The usb device in question is a bluetooth device: > > > > Bus 001 Device 012: ID 8087:07dc Intel Corp. > > > > > And the problem that we are seeing is with phone calls via bluetooth. > > > > > Does my test above trigger the case? (show "MATTU dmatest match!") > > > > I have kept it for tonight, will see the results tomorrow morning. > > And I am using that same device in the usb script to change "authrized". > > No, your test did not trigger the error. :( > > But, my last night's test (with an added debug to get some extra trace for > addresses) showed the same error of - > "Looking for event-dma", but looking at the ftrace, I could not see it > getting same address from dma_pool_zalloc(). > > Can you please have a look at the dmesg and ftrace at: > https://drive.google.com/open?id=1nMy_qVxOQzcZNYa9bw7az9WiS2MZzdKo And to add to my previous mail, in another cycle where I do see the same problem and my extra debugs give the following: <...>-23974 [002] 495.991276: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d21c000 <...>-23974 [002] 495.991285: xhci_ring_mem_detail: SUDIP page details dma=0x2d21c000, vaddr=ed21c000, inuse=1, offset=0 <...>-23974 [002] 495.991289: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d21c000 <...>-23974 [002] 495.991292: xhci_ring_mem_detail: SUDIP page details dma=0x2d21c000, vaddr=ed21c000, inuse=2, offset=0 <...>-23974 [002] 495.991295: xhci_ring_alloc: ISOC f0b62900: enq 0x2d21c000(0x2d21c000) deq 0x2d21c000(0x2d21c000) segs 2 stream 0 free_trbs 509 bounce 17 cycle 1 <...>-23974 [002] 495.991298: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d21c000 <...>-23974 [002] 495.991301: xhci_ring_mem_detail: SUDIP page details dma=0x2d21c000, vaddr=ed21c000, inuse=3, offset=0 <...>-23974 [002] 495.991304: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d21c000 <...>-23974 [002] 495.991306: xhci_ring_mem_detail: SUDIP page details dma=0x2d21c000, vaddr=ed21c000, inuse=4, offset=0 I am totally lost now. Are we looking at two different issues? This log shows same addresses, my previous mail and log did not show the same addresses. :( -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Mathias, On Mon, Jun 25, 2018 at 05:15:00PM +0100, Sudip Mukherjee wrote: > Hi Mathias, > > On Thu, Jun 21, 2018 at 02:01:30PM +0300, Mathias Nyman wrote: > > On 21.06.2018 03:53, Sudip Mukherjee wrote: > > > Hi Mathias, Andy, > > > > > > On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote: > > > > On 06.06.2018 19:45, Sudip Mukherjee wrote: > > > > Can you share a bit more details on the platform you are using, and what > > types of test you are running. > > Sorry for the delayed reply, I was in Tokyo for the OSS. > > It is a board based on "Intel(R) Atom(TM) CPU E3840 @ 1.91GHz". > The usb device in question is a bluetooth device: > > Bus 001 Device 012: ID 8087:07dc Intel Corp. > > And the problem that we are seeing is with phone calls via bluetooth. > > > Does my test above trigger the case? (show "MATTU dmatest match!") > > I have kept it for tonight, will see the results tomorrow morning. > And I am using that same device in the usb script to change "authrized". No, your test did not trigger the error. :( But, my last night's test (with an added debug to get some extra trace for addresses) showed the same error of - "Looking for event-dma", but looking at the ftrace, I could not see it getting same address from dma_pool_zalloc(). Can you please have a look at the dmesg and ftrace at: https://drive.google.com/open?id=1nMy_qVxOQzcZNYa9bw7az9WiS2MZzdKo -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Mathias, On Thu, Jun 21, 2018 at 02:01:30PM +0300, Mathias Nyman wrote: > On 21.06.2018 03:53, Sudip Mukherjee wrote: > > Hi Mathias, Andy, > > > > On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote: > > > On 06.06.2018 19:45, Sudip Mukherjee wrote: > > git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git dmapool-test > https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=dmapool-test > > Tested by just leaving the following running for a few days: > > while true; do echo 0 > authorized; sleep 3; echo 1 > authorized; sleep 3; > done; > For some usb device (for example: /sys/bus/usb/devices/1-8) > > Then grep logs for "MATTU dmatest match! " > > Can you share a bit more details on the platform you are using, and what > types of test you are running. Sorry for the delayed reply, I was in Tokyo for the OSS. It is a board based on "Intel(R) Atom(TM) CPU E3840 @ 1.91GHz". The usb device in question is a bluetooth device: Bus 001 Device 012: ID 8087:07dc Intel Corp. Device Descriptor: bLength18 bDescriptorType 1 bcdUSB 2.00 bDeviceClass 224 Wireless bDeviceSubClass 1 Radio Frequency bDeviceProtocol 1 Bluetooth bMaxPacketSize064 idVendor 0x8087 Intel Corp. idProduct 0x07dc bcdDevice0.01 iManufacturer 0 iProduct0 iSerial 0 bNumConfigurations 1 And the problem that we are seeing is with phone calls via bluetooth. > Does my test above trigger the case? (show "MATTU dmatest match!") I have kept it for tonight, will see the results tomorrow morning. And I am using that same device in the usb script to change "authrized". But looking at the code for dma_pool_alloc(), it seems 'dma' can have same value again only if "*(int *)(page->vaddr + offset)" gets a value of 0 in pool_initialise_page(). But I can't think of anyway how it can be 0. I have also added some more debugs in the kernel to see what might be going wrong there. -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Mathias, Andy, On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote: > On 06.06.2018 19:45, Sudip Mukherjee wrote: > > Hi Andy, > > > > And we meet again. :) > > > > On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote: > > > On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote: > > > > On 04.06.2018 18:28, Sudip Mukherjee wrote: > > > > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote: > > > > > > > > > > > > > Odd and unlikely, but to me this looks like some issue in allocating > > > > dma memory > > > > from pool using dma_pool_zalloc() > > > > > > > > > > > > Here's the story: > > > > Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel. > > > > All tracing points to dma_pool_zalloc() returning the same dma address > > > > block on > > > > consecutive calls. > > > > > > > > In the failing case dma_pool_zalloc() is called 3 - 6us apart. > > > > > > > > <...>-26362 [002] 1186.756739: xhci_ring_mem_detail: MATTU > > > > xhci_segment_alloc dma @ 0x2d92b000 > > > > <...>-26362 [002] 1186.756745: xhci_ring_mem_detail: MATTU > > > > xhci_segment_alloc dma @ 0x2d92b000 > > > > <...>-26362 [002] 1186.756748: xhci_ring_mem_detail: MATTU > > > > xhci_segment_alloc dma @ 0x2d92b000 > > > > > > > > dma_pool_zalloc() is called from xhci_segment_alloc() in > > > > drivers/usb/host/xhci-mem.c > > > > see: > > > > https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci- > > > > mem.c#L52 > > > > > > > > prints above are custom traces added right after dma_pool_zalloc() > > > > > > For better understanding it would be good to have dma_pool_free() calls > > > debugged as well. > > > > Sudip has a full (394M unpacked) trace at: > https://drive.google.com/open?id=1h-3r-1lfjg8oblBGkzdRIq8z3ZNgGZx- > > But then it gets stuck, for the whole ring2 dma_pool_zalloc() just returns > the same dma address as the last segment for > ring1:0x2d92b000. Last part of trace snippet is just another ring being freed. A gentle ping on this. Any idea on what the problem might be and any possible fix? -- regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi All, On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote: > On 06.06.2018 19:45, Sudip Mukherjee wrote: > > Hi Andy, > > > > And we meet again. :) > > > > On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote: > > > On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote: > > > > On 04.06.2018 18:28, Sudip Mukherjee wrote: > > > > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote: > > > > > > > > > > > > > Odd and unlikely, but to me this looks like some issue in allocating > > > > dma memory > > > > from pool using dma_pool_zalloc() > > > > > > > > Adding people with DMA knowledge to cc, maybe someone knows what is > > > > going on. > > > > > > > > Here's the story: > > > > Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel. > > > > All tracing points to dma_pool_zalloc() returning the same dma address > > > > block on > > > > consecutive calls. We have started testing with v4.14.47 now and we are seeing the issue with it also. :( -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Andy, And we meet again. :) On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote: > On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote: > > On 04.06.2018 18:28, Sudip Mukherjee wrote: > > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote: > > > > > > > Odd and unlikely, but to me this looks like some issue in allocating > > dma memory > > from pool using dma_pool_zalloc() > > > > Adding people with DMA knowledge to cc, maybe someone knows what is > > going on. > > > > Here's the story: > > Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel. > > All tracing points to dma_pool_zalloc() returning the same dma address > > block on > > consecutive calls. > > > > In the failing case dma_pool_zalloc() is called 3 - 6us apart. > > > > <...>-26362 [002] 1186.756739: xhci_ring_mem_detail: MATTU > > xhci_segment_alloc dma @ 0x2d92b000 > > <...>-26362 [002] 1186.756745: xhci_ring_mem_detail: MATTU > > xhci_segment_alloc dma @ 0x2d92b000 > > <...>-26362 [002] 1186.756748: xhci_ring_mem_detail: MATTU > > xhci_segment_alloc dma @ 0x2d92b000 > > > > dma_pool_zalloc() is called from xhci_segment_alloc() in > > drivers/usb/host/xhci-mem.c > > see: > > https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci- > > mem.c#L52 > > > > prints above are custom traces added right after dma_pool_zalloc() > > For better understanding it would be good to have dma_pool_free() calls > debugged as well. So, I am adding another trace event for dma_pool_free() and continuing with the test. Is there anything else that I should be adding as debug? -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Wed, Jun 06, 2018 at 05:12:21PM +0300, Mathias Nyman wrote: > On 04.06.2018 18:28, Sudip Mukherjee wrote: > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote: > > > > > > > Will request you to have a look at it. > > > > Odd and unlikely, but to me this looks like some issue in allocating dma > memory > from pool using dma_pool_zalloc() > > Adding people with DMA knowledge to cc, maybe someone knows what is going on. Thanks Mathias. -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] iommu/vt-d: fix memory leak
We are returning NULL if we are not able to attach the iommu to the domain but while returning we missed freeing info. Signed-off-by: Sudip Mukherjee <su...@vectorindia.org> --- drivers/iommu/intel-iommu.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 8a74fcc..668acf7 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -2305,6 +2305,7 @@ static struct dmar_domain *dmar_insert_one_dev_info(struct intel_iommu *iommu, if (ret) { spin_unlock_irqrestore(_domain_lock, flags); + free_devinfo_mem(info); return NULL; } -- 1.9.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu