Re: usb HC busted?

2018-07-21 Thread Sudip Mukherjee
Hi Mathias,

On Fri, Jul 20, 2018 at 01:54:21PM +0100, Sudip Mukherjee wrote:
> Hi Mathias,
> 
> On Fri, Jul 20, 2018 at 02:10:58PM +0300, Mathias Nyman wrote:
> > On 19.07.2018 20:32, Sudip Mukherjee wrote:
> > > Hi Mathias,
> > > 
> > > On Thu, Jul 19, 2018 at 06:42:19PM +0300, Mathias Nyman wrote:
> > > > > > As first aid I could try to implement checks that make sure the 
> > > > > > flushed URBs
> > > > > > trb pointers really are on the current endpoint ring, and also add 
> > > > > > some warning
> > > > > > if we are we are dropping endpoints with URBs still queued.
> > > > > 
> > > > > Yes, please. I think your first-aid will be a much better option than
> > > > > the hacky patch I am using atm.
> > > > > 
> > > > 
> 
> > So poison is overwritten at e5acda58 with almost its own address, (reading 
> > backwards) e5 ac da 60, twice.
> > looks like something (32bit?)is pointing to itself twice, maybe a linked 
> > list node next and prev pointer
> > being set to point to itself as last item was removed from list.
> > 
> > The cancelled_td_list is part of struct xhci_virt_ep, so that should be 
> > fine.
> > But td_list is part of struct xhci_ring, which was freed. and we removed 
> > the URBs tds from the td_list when
> > flushing the ring after ring was freed
> > 
> > I changed the patch (attached) to make sure it doesn't touch the td_list 
> > when canceling a URB after
> > ring is freed.
> > 
> > How about this one, any improvements?
> 
> Yes, it worked. :D
> 
> So, cycle-1 = no change, just to make sure I can still reproduce the error.
> cycle-2 and cycle-3 with your patch, and there was no problem,
> slub debug was also happy.
> I am starting an autotest with this patch now, and I will have almost
> 50 cycles tested by tomorrow morning.

I can confirm that your bandaid patch has worked. Total of 67 cycles
tested till now and there was no error. Its continuing to test over the
weekend.
Thank you very much for this one. :)

I guess you will start with the proper fix, that you and Alan had been
discussing, after you are fully back to work.

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-20 Thread Sudip Mukherjee
Hi Mathias,

On Fri, Jul 20, 2018 at 02:10:58PM +0300, Mathias Nyman wrote:
> On 19.07.2018 20:32, Sudip Mukherjee wrote:
> > Hi Mathias,
> > 
> > On Thu, Jul 19, 2018 at 06:42:19PM +0300, Mathias Nyman wrote:
> > > > > As first aid I could try to implement checks that make sure the 
> > > > > flushed URBs
> > > > > trb pointers really are on the current endpoint ring, and also add 
> > > > > some warning
> > > > > if we are we are dropping endpoints with URBs still queued.
> > > > 
> > > > Yes, please. I think your first-aid will be a much better option than
> > > > the hacky patch I am using atm.
> > > > 
> > > 

> So poison is overwritten at e5acda58 with almost its own address, (reading 
> backwards) e5 ac da 60, twice.
> looks like something (32bit?)is pointing to itself twice, maybe a linked list 
> node next and prev pointer
> being set to point to itself as last item was removed from list.
> 
> The cancelled_td_list is part of struct xhci_virt_ep, so that should be fine.
> But td_list is part of struct xhci_ring, which was freed. and we removed the 
> URBs tds from the td_list when
> flushing the ring after ring was freed
> 
> I changed the patch (attached) to make sure it doesn't touch the td_list when 
> canceling a URB after
> ring is freed.
> 
> How about this one, any improvements?

Yes, it worked. :D

So, cycle-1 = no change, just to make sure I can still reproduce the error.
cycle-2 and cycle-3 with your patch, and there was no problem,
slub debug was also happy.
I am starting an autotest with this patch now, and I will have almost
50 cycles tested by tomorrow morning.

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-19 Thread Sudip Mukherjee
Hi Mathias,

On Thu, Jul 19, 2018 at 06:42:19PM +0300, Mathias Nyman wrote:
> > > As first aid I could try to implement checks that make sure the flushed 
> > > URBs
> > > trb pointers really are on the current endpoint ring, and also add some 
> > > warning
> > > if we are we are dropping endpoints with URBs still queued.
> > 
> > Yes, please. I think your first-aid will be a much better option than
> > the hacky patch I am using atm.
> > 
> 
> Attached a patch that checks canceled URB td/trb pointers.
> I haven't tested it at all (well compiles and boots, but new code never 
> exercised)
> 
> Does it work for you?

No, not exactly. :(

I can see your message getting printed.
[  249.518394] xhci_hcd :00:14.0: Canceled URB td not found on endpoint ring
[  249.518431] xhci_hcd :00:14.0: Canceled URB td not found on endpoint ring

But I can see the message from slub debug again:

[  348.279986] 
=
[  348.279993] BUG kmalloc-96 (Tainted: G U O   ): Poison overwritten
[  348.279995] 
-

[  348.279997] Disabling lock debugging due to kernel taint
[  348.28] INFO: 0xe5acda60-0xe5acda67. First byte 0x60 instead of 0x6b
[  348.280012] INFO: Allocated in xhci_ring_alloc.constprop.14+0x31/0x125 
[xhci_hcd] age=129264 cpu=0 pid=33
[  348.280019]  ___slab_alloc.constprop.24+0x1fc/0x292
[  348.280023]  __slab_alloc.isra.18.constprop.23+0x1c/0x25
[  348.280026]  kmem_cache_alloc_trace+0x78/0x141
[  348.280032]  xhci_ring_alloc.constprop.14+0x31/0x125 [xhci_hcd]
[  348.280038]  xhci_endpoint_init+0x25f/0x30a [xhci_hcd]
[  348.280044]  xhci_add_endpoint+0x126/0x149 [xhci_hcd]
[  348.280057]  usb_hcd_alloc_bandwidth+0x26a/0x2a0 [usbcore]
[  348.280067]  usb_set_interface+0xeb/0x25d [usbcore]
[  348.280071]  btusb_work+0xeb/0x324 [btusb]
[  348.280076]  process_one_work+0x163/0x2b2
[  348.280080]  worker_thread+0x1a9/0x25c
[  348.280083]  kthread+0xf8/0xfd
[  348.280087]  ret_from_fork+0x2e/0x38
[  348.280095] INFO: Freed in xhci_ring_free+0xa7/0xc6 [xhci_hcd] age=98722 
cpu=0 pid=33
[  348.280098]  __slab_free+0x4b/0x27a
[  348.280100]  kfree+0x12e/0x155
[  348.280106]  xhci_ring_free+0xa7/0xc6 [xhci_hcd]
[  348.280112]  xhci_free_endpoint_ring+0x16/0x20 [xhci_hcd]
[  348.280118]  xhci_check_bandwidth+0x1c2/0x211 [xhci_hcd]
[  348.280129]  usb_hcd_alloc_bandwidth+0x205/0x2a0 [usbcore]
[  348.280139]  usb_set_interface+0xeb/0x25d [usbcore]
[  348.280142]  btusb_work+0x228/0x324 [btusb]
[  348.280145]  process_one_work+0x163/0x2b2
[  348.280148]  worker_thread+0x1a9/0x25c
[  348.280151]  kthread+0xf8/0xfd
[  348.280154]  ret_from_fork+0x2e/0x38
[  348.280158] INFO: Slab 0xf46e0fe0 objects=29 used=29 fp=0x  (null) 
flags=0x40008100
[  348.280160] INFO: Object 0xe5acda48 @offset=6728 fp=0xe5acd700

[  348.280164] Redzone e5acda40: bb bb bb bb bb bb bb bb
  
[  348.280167] Object e5acda48: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  348.280169] Object e5acda58: 6b 6b 6b 6b 6b 6b 6b 6b 60 da ac e5 60 da ac e5 
 `...`...
[  348.280171] Object e5acda68: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  348.280174] Object e5acda78: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  348.280176] Object e5acda88: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  348.280179] Object e5acda98: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 
 kkk.
[  348.280181] Redzone e5acdaa8: bb bb bb bb
  
[  348.280183] Padding e5acdb50: 5a 5a 5a 5a 5a 5a 5a 5a
  
[  348.280188] CPU: 0 PID: 133 Comm: weston Tainted: GBU O
4.14.55-20180712+ #2
[  348.280190] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017
[  348.280192] Call Trace:
[  348.280199]  dump_stack+0x47/0x5b
[  348.280202]  print_trailer+0x12b/0x133
[  348.280206]  check_bytes_and_report+0x6c/0xae
[  348.280210]  check_object+0x10a/0x1db
[  348.280214]  alloc_debug_processing+0x79/0x123
[  348.280218]  ___slab_alloc.constprop.24+0x1fc/0x292
[  348.280224]  ? drm_mode_atomic_ioctl+0x374/0x75e
[  348.280227]  ? drm_mode_atomic_ioctl+0x374/0x75e
[  348.280231]  ? drm_mode_object_get+0x28/0x3a
[  348.280235]  ? __radix_tree_lookup+0x27/0x7e
[  348.280238]  ? drm_mode_object_get+0x28/0x3a
[  348.280242]  ? drm_mode_object_put+0x28/0x4c
[  348.280246]  __slab_alloc.isra.18.constprop.23+0x1c/0x25
[  348.280249]  ? __slab_alloc.isra.18.constprop.23+0x1c/0x25
[  348.280253]  kmem_cache_alloc_trace+0x78/0x141
[  348.280257]  ? drm_mode_atomic_ioctl+0x374/0x75e
[  348.280261]  drm_mode_atomic_ioctl+0x374/0x75e
[  348.280267]  ? drm_atomic_set_property+0x442/0x442
[  348.280272]  drm_ioctl_kernel+0x52/0x88
[  348.280275]  drm_ioctl+0x1fc/0x2c1
[  348.280279]  ? drm_atomic_set_property+0x442/0x442
[  

Re: usb HC busted?

2018-07-19 Thread Sudip Mukherjee
Hi Mathias,

On Thu, Jul 19, 2018 at 01:59:01PM +0300, Mathias Nyman wrote:
> On 17.07.2018 18:10, Sudip Mukherjee wrote:
> > Hi Alan, Greg,
> > 
> > On Tue, Jul 17, 2018 at 03:49:18PM +0100, Sudip Mukherjee wrote:
> > > On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote:
> > > > Hi Alan,
> > > > 
> > > > On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote:
> > > > > On Tue, 17 Jul 2018, Sudip Mukherjee wrote:
> > > > > 
> > > > > > I did some more debugging. Tested with a KASAN enabled kernel and 
> > > > > > that
> > > > > > shows the problem. The report is attached.
> > > > > > 

> > 
> > And, my hacky patch worked as I prevented it from calling
> > usb_disable_interface() in this particular case.
> > 
> 
> Back for a few days, looking at this

I hope you had a good holiday. :)

> 
> xhci driver will set up all the endpoints for the new altsetting already in
> usb_hcd_alloc_bandwidth().
> 

> 
> As first aid I could try to implement checks that make sure the flushed URBs
> trb pointers really are on the current endpoint ring, and also add some 
> warning
> if we are we are dropping endpoints with URBs still queued.

Yes, please. I think your first-aid will be a much better option than
the hacky patch I am using atm.

> 
> But we need to fix this properly as well.
> xhci needs to be more in sync with usb core in usb_set_interface(), currently 
> xhci
> has the altssetting up and running when usb core hasn't event started 
> flushing endpoints.

I am able to reproduce this on almost all cycles, so I can always test
the fix for you after you are fully back from your holiday.

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-17 Thread Sudip Mukherjee
On Tue, Jul 17, 2018 at 04:59:01PM +0100, Sudip Mukherjee wrote:
> On Tue, Jul 17, 2018 at 05:52:59PM +0200, Greg KH wrote:
> > On Tue, Jul 17, 2018 at 10:31:38AM -0400, Alan Stern wrote:
> > > On Tue, 17 Jul 2018, Greg KH wrote:
> > > 
> > > > > From: Sudip Mukherjee 
> > > > > Date: Tue, 10 Jul 2018 09:50:00 +0100
> > > > > Subject: [PATCH] hacky solution to mem-corruption
> > > > > 
> > > > > Signed-off-by: Sudip Mukherjee 
> > > > > ---
> 
> > > 
> > > No, neither of these is right.  It's possible to use 
> > > usb_set_interface() as a kind of "soft" reset.  Even when the new 
> > > altsetting is specified to be the same as the current one, we still 
> > > have to tell the lower-layer drivers and hardware about it.
> > 
> > You are right, it's a hacky soft reset, I was just trying to figure out
> > what the bluetooth driver was trying to do.  I wouldn't expect it to be
> > calling that function a lot, but I guess it does :(
> 
> usb_set_interface() is being called two times from bluetooth event. But
> I am now adding more debugs to see why your patch did not work.

So, a very simple debug to see the sequence of functions being called.
I have attached the patch I used.

In a good case:
[  124.287991] sudip: xhci_urb_dequeue
[  124.287997] sudip: xhci_queue_stop_endpoint cmd=ee032950
[  124.288016] sudip: handle_cmd_completion cmd=ee032950
[  124.288173] sudip: xhci_urb_dequeue
[  124.288176] sudip: xhci_queue_stop_endpoint cmd=ee032950
[  124.288189] sudip: handle_cmd_completion cmd=ee032950
[  124.290647] sudip: usb_hcd_flush_endpoint
[  124.290652] sudip: usb_hcd_flush_endpoint

But in a bad case:
[  186.786900] sudip: xhci_urb_dequeue
[  186.786905] sudip: xhci_queue_stop_endpoint cmd=ebe47cb0
[  186.786923] sudip: handle_cmd_completion cmd=ebe47cb0
[  186.789040] sudip: xhci_urb_dequeue
[  186.789047] sudip: xhci_queue_stop_endpoint cmd=ebe47cb0
[  186.789069] sudip: handle_cmd_completion cmd=ebe47cb0
[  186.790082] sudip: usb_hcd_flush_endpoint
[  186.790094] sudip: xhci_urb_dequeue
[  186.790097] sudip: xhci_queue_stop_endpoint cmd=ebe47290
[  186.790150] sudip: handle_cmd_completion cmd=ebe47290
[  186.790202] sudip: usb_hcd_flush_endpoint

So, when usb_hcd_flush_endpoint() is called by usb_disable_endpoint() it
finds urbs still on the urb_list of the ep. And in the process of unlinking
them, it again sends the command to stop the endpoint, although that endpoint
has already been stopped.
So Greg's patch did not work as the memory got corrupted on the first call
to usb_set_interface(), whereas that patch was preventing the second call
to usb_set_interface().

--
Regards
Sudip
diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 467bedeb542a..8d28f120ec0a 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -1885,6 +1885,7 @@ void usb_hcd_flush_endpoint(struct usb_device *udev,
might_sleep();
hcd = bus_to_hcd(udev->bus);
 
+   pr_err("sudip: %s\n", __func__);
/* No more submits can occur */
spin_lock_irq(_urb_list_lock);
 rescan:
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 6996235e34a9..4f80791fdfc5 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1450,6 +1450,7 @@ static void handle_cmd_completion(struct xhci_hcd *xhci,
case TRB_STOP_RING:
WARN_ON(slot_id != TRB_TO_SLOT_ID(
le32_to_cpu(cmd_trb->generic.field[3])));
+   pr_err("sudip: %s cmd=%p\n", __func__, cmd);
xhci_handle_cmd_stop_ep(xhci, slot_id, cmd_trb, event);
break;
case TRB_SET_DEQ:
@@ -4009,6 +4010,7 @@ int xhci_queue_stop_endpoint(struct xhci_hcd *xhci, 
struct xhci_command *cmd,
u32 type = TRB_TYPE(TRB_STOP_RING);
u32 trb_suspend = SUSPEND_PORT_FOR_TRB(suspend);
 
+   pr_err("sudip: %s cmd=%p\n", __func__, cmd);
return queue_command(xhci, cmd, 0, 0, 0,
trb_slot_id | trb_ep_index | type | trb_suspend, false);
 }
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index db1de6113db2..3832128107ff 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -1516,6 +1516,7 @@ static int xhci_urb_dequeue(struct usb_hcd *hcd, struct 
urb *urb, int status)
ep->stop_cmd_timer.expires = jiffies +
XHCI_STOP_EP_CMD_TIMEOUT * HZ;
add_timer(>stop_cmd_timer);
+   pr_err("sudip: %s\n", __func__);
xhci_queue_stop_endpoint(xhci, command, urb->dev->slot_id,
 ep_index, 0);
xhci_ring_cmd_db(xhci);
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: usb HC busted?

2018-07-17 Thread Sudip Mukherjee
On Tue, Jul 17, 2018 at 05:52:59PM +0200, Greg KH wrote:
> On Tue, Jul 17, 2018 at 10:31:38AM -0400, Alan Stern wrote:
> > On Tue, 17 Jul 2018, Greg KH wrote:
> > 
> > > > From: Sudip Mukherjee 
> > > > Date: Tue, 10 Jul 2018 09:50:00 +0100
> > > > Subject: [PATCH] hacky solution to mem-corruption
> > > > 
> > > > Signed-off-by: Sudip Mukherjee 
> > > > ---

> > 
> > No, neither of these is right.  It's possible to use 
> > usb_set_interface() as a kind of "soft" reset.  Even when the new 
> > altsetting is specified to be the same as the current one, we still 
> > have to tell the lower-layer drivers and hardware about it.
> 
> You are right, it's a hacky soft reset, I was just trying to figure out
> what the bluetooth driver was trying to do.  I wouldn't expect it to be
> calling that function a lot, but I guess it does :(

usb_set_interface() is being called two times from bluetooth event. But
I am now adding more debugs to see why your patch did not work.

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-17 Thread Sudip Mukherjee
Hi Alan, Greg,

On Tue, Jul 17, 2018 at 03:49:18PM +0100, Sudip Mukherjee wrote:
> On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote:
> > Hi Alan,
> > 
> > On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote:
> > > On Tue, 17 Jul 2018, Sudip Mukherjee wrote:
> > > 
> > > > I did some more debugging. Tested with a KASAN enabled kernel and that
> > > > shows the problem. The report is attached.
> > > > 
> > > > To my understanding:
> > > > 
> > > > btusb_work() is calling usb_set_interface() with alternate = 0. which
> > > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by
> > > > xhci_free_endpoint_ring().
> > > 
> > > That doesn't sound like the right thing to do.  The rings shouldn't be 
> > > freed until xhci_endpoint_disable() is called.  
> > > 
> > > On the other hand, there doesn't appear to be any 
> > > xhci_endpoint_disable() routine, although a comment refers to it.  
> > > Maybe this is the real problem?
> > 
> > one of your old mail might help :)
> > 
> > https://www.spinics.net/lists/linux-usb/msg98123.html
> 
> Wrote too soon.
> 
> Is it the one you are looking for -
> usb_disable_endpoint() is in drivers/usb/core/message.c

I think now I understand what the problem is.
usb_set_interface() calls usb_disable_interface() which again calls
usb_disable_endpoint(). This usb_disable_endpoint() gets the pointer
to 'ep', marks it as NULL and sends the pointer to usb_hcd_flush_endpoint().
After flushing the endpoints usb_disable_endpoint() calls
usb_hcd_disable_endpoint() which tries to do:
if (hcd->driver->endpoint_disable)
hcd->driver->endpoint_disable(hcd, ep);
but there is no endpoint_disable() callback in xhci, so the endpoint is
never marked as disabled. So, next time usb_hcd_flush_endpoint() is
called I get this corruption. 
And this is exactly where I used to see the problem happening.

And, my hacky patch worked as I prevented it from calling
usb_disable_interface() in this particular case.

Greg - answering your question here. My hacky patch was based on the
fact that usb_hcd_alloc_bandwidth() is calling hcd->driver->drop_endpoint()
and hcd->driver->add_endpoint() if (cur_alt && new_alt). So, I prevented
usb_disable_interface() to be called for that same condition. And that
worked as the call to usb_hcd_flush_endpoint() was not executed.
I know it is not correct and I might be having memory leaks for this, but
I have the system working till we get the actual fix.

--
Regards
Sudip

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-17 Thread Sudip Mukherjee
On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote:
> Hi Alan,
> 
> On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote:
> > On Tue, 17 Jul 2018, Sudip Mukherjee wrote:
> > 
> > > I did some more debugging. Tested with a KASAN enabled kernel and that
> > > shows the problem. The report is attached.
> > > 
> > > To my understanding:
> > > 
> > > btusb_work() is calling usb_set_interface() with alternate = 0. which
> > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by
> > > xhci_free_endpoint_ring().
> > 
> > That doesn't sound like the right thing to do.  The rings shouldn't be 
> > freed until xhci_endpoint_disable() is called.  
> > 
> > On the other hand, there doesn't appear to be any 
> > xhci_endpoint_disable() routine, although a comment refers to it.  
> > Maybe this is the real problem?
> 
> one of your old mail might help :)
> 
> https://www.spinics.net/lists/linux-usb/msg98123.html

Wrote too soon.

Is it the one you are looking for -
usb_disable_endpoint() is in drivers/usb/core/message.c

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-17 Thread Sudip Mukherjee
Hi Alan,

On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote:
> On Tue, 17 Jul 2018, Sudip Mukherjee wrote:
> 
> > I did some more debugging. Tested with a KASAN enabled kernel and that
> > shows the problem. The report is attached.
> > 
> > To my understanding:
> > 
> > btusb_work() is calling usb_set_interface() with alternate = 0. which
> > again calls usb_hcd_alloc_bandwidth() and that frees the rings by
> > xhci_free_endpoint_ring().
> 
> That doesn't sound like the right thing to do.  The rings shouldn't be 
> freed until xhci_endpoint_disable() is called.  
> 
> On the other hand, there doesn't appear to be any 
> xhci_endpoint_disable() routine, although a comment refers to it.  
> Maybe this is the real problem?

one of your old mail might help :)

https://www.spinics.net/lists/linux-usb/msg98123.html

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-17 Thread Sudip Mukherjee
Hi Greg,

On Tue, Jul 17, 2018 at 02:04:11PM +0200, Greg KH wrote:
> On Tue, Jul 17, 2018 at 12:41:04PM +0100, Sudip Mukherjee wrote:
> > Hi Mathias,
> > 
> > On Sat, Jun 30, 2018 at 10:07:04PM +0100, Sudip Mukherjee wrote:
> > > Hi Mathias,
> > > 
> > > On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote:
> > > > On 27.06.2018 14:59, Sudip Mukherjee wrote:
> > > > > > > Can you share a bit more details on the platform you are using, 
> > > > > > > and what types of test you are running.
> > > > > > 
> > 
> > > Then to track what is going on, I added the slub debugging and :(
> > > I have attached part of dmesg for you to check.
> > > Will appreciate your help in finding out the problem.
> > 
> > I did some more debugging. Tested with a KASAN enabled kernel and that
> > shows the problem. The report is attached.
> > 
> > To my understanding:
> > 
> > btusb_work() is calling usb_set_interface() with alternate = 0. which
> > again calls usb_hcd_alloc_bandwidth() and that frees the rings by
> > xhci_free_endpoint_ring(). But then usb_set_interface() continues and
> > calls usb_disable_interface() -> usb_hcd_flush_endpoint()->unlink1()->
> > xhci_urb_dequeue() which at the end gives the command to stop endpoint.
> > 
> > In all the cycles I have tested I see that only in the fail case
> > handle_cmd_completion() gets called, but in the cycles where the error
> > is not there handle_cmd_completion() is not called with that command.
> > 
> > I am not sure what is happening, and you are the best person to understand
> > what is happening. :)
> > 
> > But for now (untill you are back from holiday and suggest a proper 
> > solution),
> > I made a hacky patch (attached) which is working and I donot get any
> > corruption after that. Both KASAN and slub debug are also happy.
> > 
> > So, now waiting for you to analyze what is going on and suggest a proper
> > fix.
> > 
> > Thanks in advance.
> > 
> > --
> > Regards
> > Sudip
> 
> > [  236.814156] 
> > ==
> > [  236.814187] BUG: KASAN: use-after-free in xhci_trb_virt_to_dma+0x2e/0x74 
> > [xhci_hcd]
> > [  236.814193] Read of size 8 at addr 8800789329c8 by task weston/138
> > 
> > [  236.814203] CPU: 0 PID: 138 Comm: weston Tainted: G U  W  O
> > 4.14.47-20180606+ #7
> > [  236.814206] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017
> > [  236.814209] Call Trace:
> > [  236.814214]  
> > [  236.814226]  dump_stack+0x46/0x59
> > [  236.814238]  print_address_description+0x6b/0x23b
> > [  236.814255]  ? xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd]
> > [  236.814262]  kasan_report+0x220/0x246
> > [  236.814278]  xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd]
> > [  236.814294]  trb_in_td+0x3b/0x1cd [xhci_hcd]
> > [  236.814311]  handle_cmd_completion+0x1181/0x2c9b [xhci_hcd]
> > [  236.814329]  ? xhci_queue_new_dequeue_state+0x5d9/0x5d9 [xhci_hcd]
> > [  236.814337]  ? drm_handle_vblank+0x4ec/0x590
> > [  236.814352]  xhci_irq+0x529/0x3294 [xhci_hcd]
> > [  236.814362]  ? __accumulate_pelt_segments+0x24/0x33
> > [  236.814378]  ? finish_td.isra.40+0x223/0x223 [xhci_hcd]
> > [  236.814384]  ? __accumulate_pelt_segments+0x24/0x33
> > [  236.814390]  ? __accumulate_pelt_segments+0x24/0x33
> > [  236.814405]  ? xhci_irq+0x3294/0x3294 [xhci_hcd]
> > [  236.814412]  __handle_irq_event_percpu+0x149/0x3db
> > [  236.814421]  handle_irq_event_percpu+0x65/0x109
> > [  236.814428]  ? __handle_irq_event_percpu+0x3db/0x3db
> > [  236.814436]  ? ttwu_do_wakeup.isra.18+0x3a2/0x3ce
> > [  236.814442]  handle_irq_event+0xa8/0x10a
> > [  236.814449]  handle_edge_irq+0x4b2/0x538
> > [  236.814458]  handle_irq+0x3e/0x45
> > [  236.814465]  do_IRQ+0x5c/0x126
> > [  236.814474]  common_interrupt+0x7a/0x7a
> > [  236.814478]  
> > [  236.814483] RIP: 0023:0xf79d3d82
> > [  236.814486] RSP: 002b:ffc588e8 EFLAGS: 00200282 ORIG_RAX: 
> > ffdc
> > [  236.814493] RAX:  RBX: f7bebd5c RCX: 
> > 
> > [  236.814496] RDX: 08d4197c RSI:  RDI: 
> > f746c020
> > [  236.814499] RBP: ffc588e8 R08:  R09: 
> > 
> > [  236.814503] R10:  R11: 00200206 R12: 
> > 
> > [  236.814506] R13:  R14: 

Re: usb HC busted?

2018-07-17 Thread Sudip Mukherjee
Hi Mathias,

On Sat, Jun 30, 2018 at 10:07:04PM +0100, Sudip Mukherjee wrote:
> Hi Mathias,
> 
> On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote:
> > On 27.06.2018 14:59, Sudip Mukherjee wrote:
> > > > > Can you share a bit more details on the platform you are using, and 
> > > > > what types of test you are running.
> > > > 

> Then to track what is going on, I added the slub debugging and :(
> I have attached part of dmesg for you to check.
> Will appreciate your help in finding out the problem.

I did some more debugging. Tested with a KASAN enabled kernel and that
shows the problem. The report is attached.

To my understanding:

btusb_work() is calling usb_set_interface() with alternate = 0. which
again calls usb_hcd_alloc_bandwidth() and that frees the rings by
xhci_free_endpoint_ring(). But then usb_set_interface() continues and
calls usb_disable_interface() -> usb_hcd_flush_endpoint()->unlink1()->
xhci_urb_dequeue() which at the end gives the command to stop endpoint.

In all the cycles I have tested I see that only in the fail case
handle_cmd_completion() gets called, but in the cycles where the error
is not there handle_cmd_completion() is not called with that command.

I am not sure what is happening, and you are the best person to understand
what is happening. :)

But for now (untill you are back from holiday and suggest a proper solution),
I made a hacky patch (attached) which is working and I donot get any
corruption after that. Both KASAN and slub debug are also happy.

So, now waiting for you to analyze what is going on and suggest a proper
fix.

Thanks in advance.

--
Regards
Sudip
[  236.814156] 
==
[  236.814187] BUG: KASAN: use-after-free in xhci_trb_virt_to_dma+0x2e/0x74 
[xhci_hcd]
[  236.814193] Read of size 8 at addr 8800789329c8 by task weston/138

[  236.814203] CPU: 0 PID: 138 Comm: weston Tainted: G U  W  O
4.14.47-20180606+ #7
[  236.814206] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017
[  236.814209] Call Trace:
[  236.814214]  
[  236.814226]  dump_stack+0x46/0x59
[  236.814238]  print_address_description+0x6b/0x23b
[  236.814255]  ? xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd]
[  236.814262]  kasan_report+0x220/0x246
[  236.814278]  xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd]
[  236.814294]  trb_in_td+0x3b/0x1cd [xhci_hcd]
[  236.814311]  handle_cmd_completion+0x1181/0x2c9b [xhci_hcd]
[  236.814329]  ? xhci_queue_new_dequeue_state+0x5d9/0x5d9 [xhci_hcd]
[  236.814337]  ? drm_handle_vblank+0x4ec/0x590
[  236.814352]  xhci_irq+0x529/0x3294 [xhci_hcd]
[  236.814362]  ? __accumulate_pelt_segments+0x24/0x33
[  236.814378]  ? finish_td.isra.40+0x223/0x223 [xhci_hcd]
[  236.814384]  ? __accumulate_pelt_segments+0x24/0x33
[  236.814390]  ? __accumulate_pelt_segments+0x24/0x33
[  236.814405]  ? xhci_irq+0x3294/0x3294 [xhci_hcd]
[  236.814412]  __handle_irq_event_percpu+0x149/0x3db
[  236.814421]  handle_irq_event_percpu+0x65/0x109
[  236.814428]  ? __handle_irq_event_percpu+0x3db/0x3db
[  236.814436]  ? ttwu_do_wakeup.isra.18+0x3a2/0x3ce
[  236.814442]  handle_irq_event+0xa8/0x10a
[  236.814449]  handle_edge_irq+0x4b2/0x538
[  236.814458]  handle_irq+0x3e/0x45
[  236.814465]  do_IRQ+0x5c/0x126
[  236.814474]  common_interrupt+0x7a/0x7a
[  236.814478]  
[  236.814483] RIP: 0023:0xf79d3d82
[  236.814486] RSP: 002b:ffc588e8 EFLAGS: 00200282 ORIG_RAX: 
ffdc
[  236.814493] RAX:  RBX: f7bebd5c RCX: 
[  236.814496] RDX: 08d4197c RSI:  RDI: f746c020
[  236.814499] RBP: ffc588e8 R08:  R09: 
[  236.814503] R10:  R11: 00200206 R12: 
[  236.814506] R13:  R14:  R15: 

[  236.814513] Allocated by task 2082:
[  236.814521]  kasan_kmalloc.part.1+0x51/0xc7
[  236.814526]  kmem_cache_alloc_trace+0x178/0x187
[  236.814540]  xhci_segment_alloc.isra.11+0x9d/0x3bf [xhci_hcd]
[  236.814553]  xhci_alloc_segments_for_ring+0x9e/0x176 [xhci_hcd]
[  236.814566]  xhci_ring_alloc.constprop.16+0x197/0x4ba [xhci_hcd]
[  236.814579]  xhci_endpoint_init+0x77a/0x9ba [xhci_hcd]
[  236.814592]  xhci_add_endpoint+0x3bc/0x43b [xhci_hcd]
[  236.814615]  usb_hcd_alloc_bandwidth+0x7ef/0x857 [usbcore]
[  236.814637]  usb_set_interface+0x294/0x681 [usbcore]
[  236.814645]  btusb_work+0x2e6/0x981 [btusb]
[  236.814651]  process_one_work+0x579/0x9e9
[  236.814656]  worker_thread+0x68f/0x804
[  236.814662]  kthread+0x31c/0x32b
[  236.814668]  ret_from_fork+0x35/0x40

[  236.814672] Freed by task 1533:
[  236.814678]  kasan_slab_free+0xb3/0x15e
[  236.814683]  kfree+0x103/0x1a9
[  236.814696]  xhci_ring_free+0x205/0x286 [xhci_hcd]
[  236.814709]  xhci_free_endpoint_ring+0x4d/0x83 [xhci_hcd]
[  236.814722]  xhci_check_bandwidth+0x57b/0x

Re: usb HC busted?

2018-06-30 Thread Sudip Mukherjee
Hi Mathias,

On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote:
> On 27.06.2018 14:59, Sudip Mukherjee wrote:
> > > > Can you share a bit more details on the platform you are using, and 
> > > > what types of test you are running.
> > > 
> > > It is a board based on "Intel(R) Atom(TM) CPU  E3840  @ 1.91GHz".
> > > The usb device in question is a bluetooth device:
> > > 

> > 
> 
> There is however freeing of the same dma address:
> 
> <...>-28448 [003]    492.025808: xhci_ring_free: ISOC f1ffb700: enq 
> 0x2d31bcc0(0x2d31b000) deq 
> 0x2d31b000(0x2d31b000) segs 2 stream 0 free_trbs 305 bounce 
> 17 cycle 0
> <...>-28448 [003]    492.025818: xhci_ring_mem_detail: MATTU xhci segment 
> free seg->dma @ 0x2d31b000
> <...>-28448 [003]    492.025823: xhci_ring_mem_detail: MATTU xhci segment 
> free seg->dma @ 0x2d31b000
> <...>-28448 [003]    492.025826: xhci_ring_free: ISOC f1f9b380: enq 
> 0x2d31b140(0x2d31b000) deq 
> 0x2d31b000(0x2d31b000) segs 2 stream 0 free_trbs 489 bounce 
> 17 cycle 1
> <...>-28448 [003]    492.025828: xhci_ring_mem_detail: MATTU xhci segment 
> free seg->dma @ 0x2d31b000
> <...>-28448 [003]    492.025830: xhci_ring_mem_detail: MATTU xhci segment 
> free seg->dma @ 0x2d31b000
> 
> I'd guess it's still the same cause, maybe trace is not complete?

It is either mutiple freeing of the same address or mutiple allocation
of the same address or a combination of both. To track the mutiple
allocation I added few extra debugging and it seems that the mutiple
allocation is only happening when someone accesses that memory and
makes the first 4 bytes (which holds the offset data) as 0. I have not
yet checked in what condition does it try to free the same address more
than once.
Then to track what is going on, I added the slub debugging and :(
I have attached part of dmesg for you to check.
Will appreciate your help in finding out the problem.


--
Regards
Sudip
[  383.096204] 
=
[  383.096212] BUG kmalloc-96 (Tainted: G U O   ): Poison overwritten
[  383.096213] 
-

[  383.096215] Disabling lock debugging due to kernel taint
[  383.096218] INFO: 0xdccd1b78-0xdccd1b7f. First byte 0x78 instead of 0x6b
[  383.096232] INFO: Allocated in xhci_ring_alloc.constprop.14+0x31/0x125 
[xhci_hcd] age=227516 cpu=2 pid=21
[  383.096240]  ___slab_alloc.constprop.24+0x1fc/0x292
[  383.096243]  __slab_alloc.isra.18.constprop.23+0x1c/0x25
[  383.096246]  kmem_cache_alloc_trace+0x78/0x141
[  383.096252]  xhci_ring_alloc.constprop.14+0x31/0x125 [xhci_hcd]
[  383.096259]  xhci_endpoint_init+0x25f/0x30a [xhci_hcd]
[  383.096265]  xhci_add_endpoint+0x126/0x149 [xhci_hcd]
[  383.096276]  usb_hcd_alloc_bandwidth+0x26a/0x2a0 [usbcore]
[  383.096287]  usb_set_interface+0xeb/0x25d [usbcore]
[  383.096292]  btusb_work+0xeb/0x324 [btusb]
[  383.096296]  process_one_work+0x163/0x2b2
[  383.096299]  worker_thread+0x1a9/0x25c
[  383.096301]  kthread+0xf8/0xfd
[  383.096306]  ret_from_fork+0x2e/0x38
[  383.096314] INFO: Freed in xhci_ring_free+0xa7/0xc6 [xhci_hcd] age=197020 
cpu=0 pid=324
[  383.096317]  __slab_free+0x4b/0x27a
[  383.096319]  kfree+0x12e/0x155
[  383.096325]  xhci_ring_free+0xa7/0xc6 [xhci_hcd]
[  383.096331]  xhci_free_endpoint_ring+0x16/0x20 [xhci_hcd]
[  383.096338]  xhci_check_bandwidth+0x1bf/0x20e [xhci_hcd]
[  383.096348]  usb_hcd_alloc_bandwidth+0x205/0x2a0 [usbcore]
[  383.096358]  usb_set_interface+0xeb/0x25d [usbcore]
[  383.096361]  btusb_work+0x228/0x324 [btusb]
[  383.096364]  process_one_work+0x163/0x2b2
[  383.096367]  worker_thread+0x1a9/0x25c
[  383.096370]  kthread+0xf8/0xfd
[  383.096373]  ret_from_fork+0x2e/0x38
[  383.096376] INFO: Slab 0xf457e080 objects=29 used=29 fp=0x  (null) 
flags=0x40008100
[  383.096379] INFO: Object 0xdccd1b60 @offset=7008 fp=0xdccd0350

[  383.096383] Redzone dccd1b58: bb bb bb bb bb bb bb bb
  
[  383.096386] Object dccd1b60: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  383.096388] Object dccd1b70: 6b 6b 6b 6b 6b 6b 6b 6b 78 1b cd dc 78 1b cd dc 
 x...x...
[  383.096390] Object dccd1b80: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  383.096393] Object dccd1b90: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  383.096395] Object dccd1ba0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  383.096397] Object dccd1bb0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 
 kkk.
[  383.096400] Redzone dccd1bc0: bb bb bb bb 

Re: usb HC busted?

2018-06-27 Thread Sudip Mukherjee


On Wed, Jun 27, 2018 at 12:59:48PM +0100, Sudip Mukherjee wrote:
> Hi Mathias,
> 
> On Mon, Jun 25, 2018 at 05:15:00PM +0100, Sudip Mukherjee wrote:
> > Hi Mathias,
> > 
> > On Thu, Jun 21, 2018 at 02:01:30PM +0300, Mathias Nyman wrote:
> > > On 21.06.2018 03:53, Sudip Mukherjee wrote:
> > > > Hi Mathias, Andy,
> > > > 
> > > > On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote:
> > > > > On 06.06.2018 19:45, Sudip Mukherjee wrote:
> 
> > > 
> > > Can you share a bit more details on the platform you are using, and what 
> > > types of test you are running.
> > 
> > Sorry for the delayed reply, I was in Tokyo for the OSS.
> > 
> > It is a board based on "Intel(R) Atom(TM) CPU  E3840  @ 1.91GHz".
> > The usb device in question is a bluetooth device:
> > 
> > Bus 001 Device 012: ID 8087:07dc Intel Corp.
> 
> > 
> > And the problem that we are seeing is with phone calls via bluetooth.
> > 
> > > Does my test above trigger the case? (show "MATTU dmatest match!")
> > 
> > I have kept it for tonight, will see the results tomorrow morning.
> > And I am using that same device in the usb script to change "authrized".
> 
> No, your test did not trigger the error. :(
> 
> But, my last night's test (with an added debug to get some extra trace for 
> addresses) showed the same error of -
> "Looking for event-dma", but looking at the ftrace, I could not see it 
> getting same address from dma_pool_zalloc().
> 
> Can you please have a look at the dmesg and ftrace at:
> https://drive.google.com/open?id=1nMy_qVxOQzcZNYa9bw7az9WiS2MZzdKo

And to add to my previous mail, in another cycle where I do see the
same problem and my extra debugs give the following:

   <...>-23974 [002]    495.991276: xhci_ring_mem_detail: MATTU 
xhci_segment_alloc dma @ 0x2d21c000
   <...>-23974 [002]    495.991285: xhci_ring_mem_detail: SUDIP 
page details dma=0x2d21c000, vaddr=ed21c000, inuse=1, offset=0
   <...>-23974 [002]    495.991289: xhci_ring_mem_detail: MATTU 
xhci_segment_alloc dma @ 0x2d21c000
   <...>-23974 [002]    495.991292: xhci_ring_mem_detail: SUDIP 
page details dma=0x2d21c000, vaddr=ed21c000, inuse=2, offset=0
   <...>-23974 [002]    495.991295: xhci_ring_alloc: ISOC f0b62900: 
enq 0x2d21c000(0x2d21c000) deq 
0x2d21c000(0x2d21c000) segs 2 stream 0 free_trbs 509 bounce 17 
cycle 1
   <...>-23974 [002]    495.991298: xhci_ring_mem_detail: MATTU 
xhci_segment_alloc dma @ 0x2d21c000
   <...>-23974 [002]    495.991301: xhci_ring_mem_detail: SUDIP 
page details dma=0x2d21c000, vaddr=ed21c000, inuse=3, offset=0
   <...>-23974 [002]    495.991304: xhci_ring_mem_detail: MATTU 
xhci_segment_alloc dma @ 0x2d21c000
   <...>-23974 [002]    495.991306: xhci_ring_mem_detail: SUDIP 
page details dma=0x2d21c000, vaddr=ed21c000, inuse=4, offset=0


I am totally lost now. Are we looking at two different issues?
This log shows same addresses, my previous mail and log did not show
the same addresses. :(


--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-27 Thread Sudip Mukherjee
Hi Mathias,

On Mon, Jun 25, 2018 at 05:15:00PM +0100, Sudip Mukherjee wrote:
> Hi Mathias,
> 
> On Thu, Jun 21, 2018 at 02:01:30PM +0300, Mathias Nyman wrote:
> > On 21.06.2018 03:53, Sudip Mukherjee wrote:
> > > Hi Mathias, Andy,
> > > 
> > > On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote:
> > > > On 06.06.2018 19:45, Sudip Mukherjee wrote:

> > 
> > Can you share a bit more details on the platform you are using, and what 
> > types of test you are running.
> 
> Sorry for the delayed reply, I was in Tokyo for the OSS.
> 
> It is a board based on "Intel(R) Atom(TM) CPU  E3840  @ 1.91GHz".
> The usb device in question is a bluetooth device:
> 
> Bus 001 Device 012: ID 8087:07dc Intel Corp.

> 
> And the problem that we are seeing is with phone calls via bluetooth.
> 
> > Does my test above trigger the case? (show "MATTU dmatest match!")
> 
> I have kept it for tonight, will see the results tomorrow morning.
> And I am using that same device in the usb script to change "authrized".

No, your test did not trigger the error. :(

But, my last night's test (with an added debug to get some extra trace for 
addresses) showed the same error of -
"Looking for event-dma", but looking at the ftrace, I could not see it getting 
same address from dma_pool_zalloc().

Can you please have a look at the dmesg and ftrace at:
https://drive.google.com/open?id=1nMy_qVxOQzcZNYa9bw7az9WiS2MZzdKo


--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-25 Thread Sudip Mukherjee
Hi Mathias,

On Thu, Jun 21, 2018 at 02:01:30PM +0300, Mathias Nyman wrote:
> On 21.06.2018 03:53, Sudip Mukherjee wrote:
> > Hi Mathias, Andy,
> > 
> > On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote:
> > > On 06.06.2018 19:45, Sudip Mukherjee wrote:

> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git dmapool-test
> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=dmapool-test
> 
> Tested by just leaving the following running for a few days:
> 
> while true; do echo 0 > authorized; sleep 3; echo 1 > authorized; sleep 3; 
> done;
> For some usb device (for example: /sys/bus/usb/devices/1-8)
> 
> Then grep logs for "MATTU dmatest match! "
> 
> Can you share a bit more details on the platform you are using, and what 
> types of test you are running.

Sorry for the delayed reply, I was in Tokyo for the OSS.

It is a board based on "Intel(R) Atom(TM) CPU  E3840  @ 1.91GHz".
The usb device in question is a bluetooth device:

Bus 001 Device 012: ID 8087:07dc Intel Corp.
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   2.00
  bDeviceClass  224 Wireless
  bDeviceSubClass 1 Radio Frequency
  bDeviceProtocol 1 Bluetooth
  bMaxPacketSize064
  idVendor   0x8087 Intel Corp.
  idProduct  0x07dc
  bcdDevice0.01
  iManufacturer   0
  iProduct0
  iSerial 0
  bNumConfigurations  1

And the problem that we are seeing is with phone calls via bluetooth.

> Does my test above trigger the case? (show "MATTU dmatest match!")

I have kept it for tonight, will see the results tomorrow morning.
And I am using that same device in the usb script to change "authrized".

But looking at the code for dma_pool_alloc(), it seems 'dma' can have
same value again only if "*(int *)(page->vaddr + offset)" gets a value
of 0 in pool_initialise_page(). But I can't think of anyway how it
can be 0. I have also added some more debugs in the kernel to see what
might be going wrong there.

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-20 Thread Sudip Mukherjee
Hi Mathias, Andy,

On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote:
> On 06.06.2018 19:45, Sudip Mukherjee wrote:
> > Hi Andy,
> > 
> > And we meet again. :)
> > 
> > On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote:
> > > On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote:
> > > > On 04.06.2018 18:28, Sudip Mukherjee wrote:
> > > > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:
> > > > > > 
> > > 
> > > > Odd and unlikely, but to me this looks like some issue in allocating
> > > > dma memory
> > > > from pool using dma_pool_zalloc()
> > > > 
> > > > 
> > > > Here's the story:
> > > > Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel.
> > > > All tracing points to dma_pool_zalloc() returning the same dma address
> > > > block on
> > > > consecutive calls.
> > > > 
> > > > In the failing case dma_pool_zalloc() is called 3 - 6us apart.
> > > > 
> > > > <...>-26362 [002]   1186.756739: xhci_ring_mem_detail: MATTU
> > > > xhci_segment_alloc dma @ 0x2d92b000
> > > > <...>-26362 [002]   1186.756745: xhci_ring_mem_detail: MATTU
> > > > xhci_segment_alloc dma @ 0x2d92b000
> > > > <...>-26362 [002]   1186.756748: xhci_ring_mem_detail: MATTU
> > > > xhci_segment_alloc dma @ 0x2d92b000
> > > > 
> > > > dma_pool_zalloc() is called from xhci_segment_alloc() in
> > > > drivers/usb/host/xhci-mem.c
> > > > see:
> > > > https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci-
> > > > mem.c#L52
> > > > 
> > > > prints above are custom traces added right after dma_pool_zalloc()
> > > 
> > > For better understanding it would be good to have dma_pool_free() calls
> > > debugged as well.
> > 
> 
> Sudip has a full (394M unpacked) trace at:
> https://drive.google.com/open?id=1h-3r-1lfjg8oblBGkzdRIq8z3ZNgGZx-
> 



> But then it gets stuck, for the whole ring2 dma_pool_zalloc() just returns 
> the same dma address as the last segment for
> ring1:0x2d92b000. Last part of trace snippet is just another ring being freed.

A gentle ping on this. Any idea on what the problem might be and any
possible fix?

--
regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-08 Thread Sudip Mukherjee
Hi All,

On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote:
> On 06.06.2018 19:45, Sudip Mukherjee wrote:
> > Hi Andy,
> > 
> > And we meet again. :)
> > 
> > On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote:
> > > On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote:
> > > > On 04.06.2018 18:28, Sudip Mukherjee wrote:
> > > > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:
> > > > > > 
> > > 
> > > > Odd and unlikely, but to me this looks like some issue in allocating
> > > > dma memory
> > > > from pool using dma_pool_zalloc()
> > > > 
> > > > Adding people with DMA knowledge to cc, maybe someone knows what is
> > > > going on.
> > > > 
> > > > Here's the story:
> > > > Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel.
> > > > All tracing points to dma_pool_zalloc() returning the same dma address
> > > > block on
> > > > consecutive calls.

We have started testing with v4.14.47 now and we are seeing the issue
with it also. :(

--
Regards
Sudip

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-06 Thread Sudip Mukherjee
Hi Andy,

And we meet again. :)

On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote:
> On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote:
> > On 04.06.2018 18:28, Sudip Mukherjee wrote:
> > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:
> > > > 
> 
> > Odd and unlikely, but to me this looks like some issue in allocating
> > dma memory
> > from pool using dma_pool_zalloc()
> > 
> > Adding people with DMA knowledge to cc, maybe someone knows what is
> > going on.
> > 
> > Here's the story:
> > Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel.
> > All tracing points to dma_pool_zalloc() returning the same dma address
> > block on
> > consecutive calls.
> > 
> > In the failing case dma_pool_zalloc() is called 3 - 6us apart.
> > 
> > <...>-26362 [002]   1186.756739: xhci_ring_mem_detail: MATTU
> > xhci_segment_alloc dma @ 0x2d92b000
> > <...>-26362 [002]   1186.756745: xhci_ring_mem_detail: MATTU
> > xhci_segment_alloc dma @ 0x2d92b000
> > <...>-26362 [002]   1186.756748: xhci_ring_mem_detail: MATTU
> > xhci_segment_alloc dma @ 0x2d92b000
> > 
> > dma_pool_zalloc() is called from xhci_segment_alloc() in
> > drivers/usb/host/xhci-mem.c
> > see:
> > https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci-
> > mem.c#L52
> > 
> > prints above are custom traces added right after dma_pool_zalloc()
> 
> For better understanding it would be good to have dma_pool_free() calls
> debugged as well.

So, I am adding another trace event for dma_pool_free() and continuing
with the test. Is there anything else that I should be adding as debug?

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-06 Thread Sudip Mukherjee
On Wed, Jun 06, 2018 at 05:12:21PM +0300, Mathias Nyman wrote:
> On 04.06.2018 18:28, Sudip Mukherjee wrote:
> > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:
> > > 

> > 
> > Will request you to have a look at it.
> > 
> 
> Odd and unlikely, but to me this looks like some issue in allocating dma 
> memory
> from pool using dma_pool_zalloc()
> 
> Adding people with DMA knowledge to cc, maybe someone knows what is going on.

Thanks Mathias.

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/vt-d: fix memory leak

2015-09-21 Thread Sudip Mukherjee
We are returning NULL if we are not able to attach the iommu to the
domain but while returning we missed freeing info.

Signed-off-by: Sudip Mukherjee <su...@vectorindia.org>
---
 drivers/iommu/intel-iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 8a74fcc..668acf7 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2305,6 +2305,7 @@ static struct dmar_domain 
*dmar_insert_one_dev_info(struct intel_iommu *iommu,
 
if (ret) {
spin_unlock_irqrestore(_domain_lock, flags);
+   free_devinfo_mem(info);
return NULL;
}
 
-- 
1.9.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu