[PATCH] Add USB quirk for HVR-950q to avoid intermittent device resets

2017-06-27 Thread Devin Heitmueller
The USB core and sysfs will attempt to enumerate certain parameters
which are unsupported by the au0828 - causing inconsistent behavior
and sometimes causing the chip to reset.  Avoid making these calls.

This problem manifested as intermittent cases where the au8522 would
be reset on analog video startup, in particular when starting up ALSA
audio streaming in parallel - the sysfs entries created by
snd-usb-audio on streaming startup would result in unsupported control
messages being sent during tuning which would put the chip into an
unknown state.

Signed-off-by: Devin Heitmueller 
---
 drivers/usb/core/quirks.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/usb/core/quirks.c b/drivers/usb/core/quirks.c
index 96b21b0..3116edf 100644
--- a/drivers/usb/core/quirks.c
+++ b/drivers/usb/core/quirks.c
@@ -223,6 +223,10 @@
/* Blackmagic Design UltraStudio SDI */
{ USB_DEVICE(0x1edb, 0xbd4f), .driver_info = USB_QUIRK_NO_LPM },
 
+   /* Hauppauge HVR-950q */
+   { USB_DEVICE(0x2040, 0x7200), .driver_info =
+   USB_QUIRK_CONFIG_INTF_STRINGS },
+
/* INTEL VALUE SSD */
{ USB_DEVICE(0x8086, 0xf1a5), .driver_info = USB_QUIRK_RESET_RESUME },
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/12] Add USB quirk for HVR-950q to avoid intermittent device resets

2017-06-19 Thread Devin Heitmueller
> I've accepted the other patches in this patch series for the media subsystem,
> but this patch should go through the USB subsystem. Cc-ed linux-usb.
>
> Acked-by: Hans Verkuil 



I'm not sure who on the linux-usb mailing list would need to deal with
this, but would be great if we could get this merged.

Thanks,

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] em28xx TerraTec Cinergy Hybrid T USB XS with demodulator MT352 is not detect by em28xx

2017-06-11 Thread Devin Heitmueller
> It's been a long while since I've touched EM28xx stuff, but if I
> remember correctly there should be an EEPROM byte which identifies the
> demodulator used. There's some EEPROM hashing code in the driver but
> it's only used for devices with generic (non-manufacturer
> specific/default) USB IDs.

You would probably need to see the eeprom dump from both devices to
see if they are actually different.  Some vendors actually have a
field in the eeprom which indicates what demodulator and tuner model
is present (Hauppauge does this), but there are other strategies I've
seen used such as:

- having a GPIO input which acts as a strapping option (i.e. a
resistor gets populated which indicates a different tuner/demod, and
hence you have to read the GPIO value to know which board revision it
is).

- relying on info stored in the eeprom (see above)

- having the driver probe for one of the demodulators, and if the
probing fails falling back and probing for the other demod.  In some
cases it's just working off the I2C address, in other cases it probes
specific registers for values like a chip ID.

You would probably have to have both board revisions in your
possession so you could do some comparison.  Alternatively if you know
two people who have the boards you could potentially get a USB trace
under Windows and determine the heuristic used based on the traffic.

For those of you who might be asking, "Why the hell didn't the
manufacturer just give the new board a different USB ID?",
unfortunately the answer tends to be because they wanted to quietly
change the hardware design without having to go through a second round
of WHQL certification.

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: xhci_hcd error Transfer event TRB DMA ptr not part of current TD ep_index 8 comp_code 1

2015-03-23 Thread Devin Heitmueller
> At the moment I only have access to laptops, not a desktop,
> so the EHCI testing was on a different machine, with the Ubuntu
> 3.13 kernel.  If you'd like me to test on a more recent kernel,
> please let me know.

Ok.  That's a useful data point.

>> It's also worth mentioning that the process you're exercising isn't
>> just the code path for device insertion.  The udev process and
>> PulseAudio will both try to connect to the device as soon as the
>> underlying device nodes appear, so it's possible there is some race
>> condition where the device is being accessed and registers are being
>> poked before initialization is complete.  I cannot say for certain
>> this would be an issue with cx231xx, but I've definitely seen it with
>> other V4L2 drivers.  That might also explain why you see different
>> behavior at boot - the driver loads early enough in the boot process
>> such that initialization completes before processes like udev and
>> pulseaudio get a chance to interact with it.
>
> Good point, although the problems occur when the device is present
> during boot.  It works fine if it is plugged in after boot.  However it
> does appear to be some sort of timing / race condition.
>
> Any other suggestions?

My guess would be that there's some bug in the cx231xx that
exacerbates an edge case in the XHCI core - like prematurely setting
the USB alternate back to zero when stopping streaming and not
canceling all the URBs first.

> Do you know if the cx23102 datasheet is available?  All I've been able
> to find is a two page sheet with a block diagram.

It is not publicly available.  I have it under NDA.  That said, I
suspect we're not dealing with anything that is really specific to the
cx231xx chip, but rather just some general bug in the URB handling in
the cx231xx driver.

As a test, you may wish to try disabling the cx231xx-audio module.
That will help narrow down an interaction with ALSA, pulseaudio, and
one of the two sources of URB queueing.

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: xhci_hcd error Transfer event TRB DMA ptr not part of current TD ep_index 8 comp_code 1

2015-03-23 Thread Devin Heitmueller
Hi Alistair,

> There are some small differences in packet ordering, however the first
> major difference is an isochronous in packet where the Live2 returns
> "URB status: No such file or directory (-ENOENT) (-2)".
>
> Devin, I'm try to learn a bit about USB and the Live2 protocol, however I'm
> not sure how far I'll get.  Are you able to offer any hints on what to do
> next?

I'm sorry if I'm asking something already answered in your original
description of the problems - but have you confirmed the device works
properly on a system with a standard EHCI controller (as opposed to
XHCI)?

The reason I ask is because that the cx231xx driver is pretty unstable
in general, and I'm wondering if perhaps you're just hitting some
issue completely unrelated to the recent XHCI problems (which
obviously needed to be addressed in order for you to get this far into
testing).

If you haven't tried it yet on a standard ECHI controller because you
don't have one in your PC, it might be worth spending the $20 to buy a
PCI card with an USB 2.0 host controller on it for testing.

It's also worth mentioning that the process you're exercising isn't
just the code path for device insertion.  The udev process and
PulseAudio will both try to connect to the device as soon as the
underlying device nodes appear, so it's possible there is some race
condition where the device is being accessed and registers are being
poked before initialization is complete.  I cannot say for certain
this would be an issue with cx231xx, but I've definitely seen it with
other V4L2 drivers.  That might also explain why you see different
behavior at boot - the driver loads early enough in the boot process
such that initialization completes before processes like udev and
pulseaudio get a chance to interact with it.

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/1] usb: xhci: apply XHCI_AVOID_BEI quirk to all Intel xHCI controllers

2015-03-12 Thread Devin Heitmueller
On Thu, Mar 12, 2015 at 6:15 AM, Lu Baolu  wrote:
> When a device with an isochronous endpoint is plugged into the Intel
> xHCI host controller, and the driver submits multiple frames per URB,
> the xHCI driver will set the Block Event Interrupt (BEI) flag on all
> but the last TD for the URB. This causes the host controller to place
> an event on the event ring, but not send an interrupt. When the last
> TD for the URB completes, BEI is cleared, and we get an interrupt for
> the whole URB.
>
> However, under Intel xHCI host controllers, if the event ring is full
> of events from transfers with BEI set,  an "Event Ring is Full" event
> will be posted to the last entry of the event ring,  but no interrupt
> is generated. Host will cease all transfer and command executions and
> wait until software completes handling the pending events in the event
> ring.  That means xHC stops, but event of "event ring is full" is not
> notified. As the result, the xHC looks like dead to user.
>
> This patch is to apply XHCI_AVOID_BEI quirk to Intel xHC devices. And
> it should be backported to kernels as old as 3.0, that contains the
> commit 69e848c2090a ("Intel xhci: Support EHCI/xHCI port switching.").
>
> Signed-off-by: Lu Baolu 
> Cc: sta...@vger.kernel.org
> ---
>  drivers/usb/host/xhci-pci.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c
> index fd53c9e..2af32e2 100644
> --- a/drivers/usb/host/xhci-pci.c
> +++ b/drivers/usb/host/xhci-pci.c
> @@ -115,6 +115,7 @@ static void xhci_pci_quirks(struct device *dev, struct 
> xhci_hcd *xhci)
> if (pdev->vendor == PCI_VENDOR_ID_INTEL) {
> xhci->quirks |= XHCI_LPM_SUPPORT;
> xhci->quirks |= XHCI_INTEL_HOST;
> +   xhci->quirks |= XHCI_AVOID_BEI;
> }
> if (pdev->vendor == PCI_VENDOR_ID_INTEL &&
> pdev->device == 
> PCI_DEVICE_ID_INTEL_PANTHERPOINT_XHCI) {
> @@ -130,7 +131,6 @@ static void xhci_pci_quirks(struct device *dev, struct 
> xhci_hcd *xhci)
>  * PPT chipsets.
>  */
> xhci->quirks |= XHCI_SPURIOUS_REBOOT;
> -   xhci->quirks |= XHCI_AVOID_BEI;
> }
> if (pdev->vendor == PCI_VENDOR_ID_INTEL &&
> pdev->device == PCI_DEVICE_ID_INTEL_LYNXPOINT_LP_XHCI) {
> --
> 2.1.0
>

Looks good for me to (tested with an HVR-950q on a 2013 Macbook Pro (Haswell).

Tested by: Devin Heitmueller 

Thanks for your help,

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Control message failures kill entire XHCI stack

2015-03-11 Thread Devin Heitmueller
On Tue, Mar 10, 2015 at 8:34 PM, Lu, Baolu  wrote:
> Devin, I'd like you to provide output of "lspci -xv". I want to know which
> xHC devices (PCI vendor and device ID) you are working with.

Sorry, I misread your email.  The lspci output is below.

Devin

00:00.0 Host bridge: Intel Corporation Haswell-ULT DRAM Controller (rev 09)
Subsystem: Apple Inc. Device 011a
Flags: bus master, fast devsel, latency 0
Capabilities: [e0] Vendor Specific Information: Len=0c 
00: 86 80 04 0a 06 00 90 20 09 00 00 06 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 6b 10 1a 01
30: 00 00 00 00 e0 00 00 00 00 00 00 00 00 00 00 00

00:02.0 VGA compatible controller: Intel Corporation Device 0a2e (rev
09) (prog-if 00 [VGA controller])
Subsystem: Apple Inc. Device 011a
Flags: bus master, fast devsel, latency 0, IRQ 51
Memory at b000 (64-bit, non-prefetchable) [size=4M]
Memory at a000 (64-bit, prefetchable) [size=256M]
I/O ports at 2000 [size=64]
Expansion ROM at  [disabled]
Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [d0] Power Management version 2
Capabilities: [a4] PCI Advanced Features
Kernel driver in use: i915
00: 86 80 2e 0a 07 04 90 00 09 00 00 03 00 00 00 00
10: 04 00 00 b0 00 00 00 00 0c 00 00 a0 00 00 00 00
20: 01 20 00 00 00 00 00 00 00 00 00 00 6b 10 1a 01
30: 00 00 00 00 90 00 00 00 00 00 00 00 00 01 00 00

00:03.0 Audio device: Intel Corporation Haswell-ULT HD Audio Controller (rev 09)
Subsystem: Apple Inc. Device 011a
Flags: bus master, fast devsel, latency 0, IRQ 52
Memory at b0a1 (64-bit, non-prefetchable) [size=16K]
Capabilities: [50] Power Management version 2
Capabilities: [60] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
Kernel driver in use: snd_hda_intel
00: 86 80 0c 0a 06 04 10 00 09 00 03 04 40 00 00 00
10: 04 00 a1 b0 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 6b 10 1a 01
30: 00 00 00 00 50 00 00 00 00 00 00 00 00 01 00 00

00:14.0 USB controller: Intel Corporation Lynx Point-LP USB xHCI HC
(rev 04) (prog-if 30 [XHCI])
Subsystem: Intel Corporation Device 7270
Flags: bus master, medium devsel, latency 0, IRQ 49
Memory at b0a0 (64-bit, non-prefetchable) [size=64K]
Capabilities: [70] Power Management version 2
Capabilities: [80] MSI: Enable+ Count=1/8 Maskable- 64bit+
Kernel driver in use: xhci_hcd
00: 86 80 31 9c 06 04 90 02 04 30 03 0c 00 00 00 00
10: 04 00 a0 b0 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 70 72
30: 00 00 00 00 70 00 00 00 00 00 00 00 00 01 00 00

00:16.0 Communication controller: Intel Corporation Lynx Point-LP HECI
#0 (rev 04)
Subsystem: Intel Corporation Device 7270
Flags: bus master, fast devsel, latency 0, IRQ 48
Memory at b0a2a100 (64-bit, non-prefetchable) [size=32]
Capabilities: [50] Power Management version 3
Capabilities: [8c] MSI: Enable+ Count=1/1 Maskable- 64bit+
Kernel driver in use: mei_me
00: 86 80 3a 9c 06 04 10 00 04 00 80 07 00 00 80 00
10: 04 a1 a2 b0 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 70 72
30: 00 00 00 00 50 00 00 00 00 00 00 00 00 01 00 00

00:1b.0 Audio device: Intel Corporation Lynx Point-LP HD Audio
Controller (rev 04)
Subsystem: Intel Corporation Device 7270
Flags: bus master, fast devsel, latency 0, IRQ 50
Memory at b0a14000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [50] Power Management version 3
Capabilities: [60] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Kernel driver in use: snd_hda_intel
00: 86 80 20 9c 06 04 10 00 04 00 03 04 40 00 00 00
10: 04 40 a1 b0 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 70 72
30: 00 00 00 00 50 00 00 00 00 00 00 00 00 01 00 00

00:1c.0 PCI bridge: Intel Corporation Lynx Point-LP PCI Express Root
Port 1 (rev e4) (prog-if 00 [Normal decode])
Flags: fast devsel
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
Capabilities: [40] Express Root Port (Slot-), MSI 00
Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-
Capabilities: [90] Subsystem: Intel Corporation Device 7270
Capabilities: [a0] Power Management version 3
Kernel driver in use: pcieport
00: 86 80 10 9c 00 00 10 00 e4 00 04 06 40 00 81 00
10: 00 00 00 00 00 00 00 00 00 01 01 00 f0 00 00 20
20: f0 ff 00 00 f1 ff 01 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00

00:1c.1 PCI bridge: Intel Corporation Lynx Point-LP PCI Express Root
Port 2 (rev e4) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
Memory behind bridge: b08

Re: Control message failures kill entire XHCI stack

2015-03-10 Thread Devin Heitmueller
riptorType 5
bEndpointAddress 0x81  EP 1 IN
bmAttributes3
  Transfer TypeInterrupt
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0008  1x 8 bytes
bInterval  11
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x82  EP 2 IN
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0200  1x 512 bytes
bInterval   0
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x03  EP 3 OUT
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0200  1x 512 bytes
bInterval   0
Device Qualifier (for other device speed):
  bLength10
  bDescriptorType 6
  bcdUSB   2.00
  bDeviceClass  255 Vendor Specific Class
  bDeviceSubClass   255 Vendor Specific Subclass
  bDeviceProtocol 0
  bMaxPacketSize0 8
  bNumConfigurations  1
Device Status: 0x
  (Bus Powered)

Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   2.00
  bDeviceClass9 Hub
  bDeviceSubClass 0 Unused
  bDeviceProtocol 1 Single TT
  bMaxPacketSize064
  idVendor   0x1d6b Linux Foundation
  idProduct  0x0002 2.0 root hub
  bcdDevice3.19
  iManufacturer   3 Linux 3.19.0-rc5djh+ xhci-hcd
  iProduct2 xHCI Host Controller
  iSerial 1 :00:14.0
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength   25
bNumInterfaces  1
bConfigurationValue 1
iConfiguration  0
bmAttributes 0xe0
  Self Powered
  Remote Wakeup
MaxPower0mA
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   0
  bNumEndpoints   1
  bInterfaceClass 9 Hub
  bInterfaceSubClass  0 Unused
  bInterfaceProtocol  0 Full speed (or root) hub
  iInterface  0
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x81  EP 1 IN
bmAttributes3
  Transfer TypeInterrupt
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0004  1x 4 bytes
bInterval  12
Hub Descriptor:
  bLength  11
  bDescriptorType  41
  nNbrPorts 9
  wHubCharacteristic 0x000a
No power switching (usb 1.0)
Per-port overcurrent protection
TT think time 8 FS bits
  bPwrOn2PwrGood   10 * 2 milli seconds
  bHubContrCurrent  0 milli Ampere
  DeviceRemovable0x00 0x00
  PortPwrCtrlMask0xff 0xff
 Hub Port Status:
   Port 1: .0503 highspeed power enable connect
   Port 2: .0100 power
   Port 3: .0103 power enable connect
   Port 4: .0100 power
   Port 5: .0103 power enable connect
   Port 6: .0100 power
   Port 7: .0100 power
   Port 8: .0100 power
   Port 9: .0100 power
Device Status: 0x0001
  Self Powered

Devin

On Tue, Mar 10, 2015 at 5:55 AM, Lu, Baolu  wrote:
> Hi Devin,
>
> Do you mind to post output of "lspci -xv" on the machine where you saw this
> problem?
>
> We are facing the same problems in other cases. I could provide a patch for
> it later.
>
> Thanks,
> Baolu
>
>
> On 01/19/2015 04:55 AM, Devin Heitmueller wrote:
>>
>> Hello,
>>
>> I'm debugging some issues on a couple of different USB TV tuners which
>> boil down to the following error:
>>
>> xhci_hcd :00:14.0: xHCI host not responding to stop endpoint command.
>>
>> This is followed by the XHCI driver disconnecting *all* USB devices
>> from the controller.
>>
>> I've done a bit of debugging, and the root of the issue appears to be
>> an intermittent control message timing out, and then the call to
>> usb_kill_urb() that occurs inside of usb_control_msg() when the
>> timeout expires is what causes the disconnect.  Specifically, it would
>> appear that xhci_urb_dequeue tries to stop the endpoint using
>> xhci_queue_stop_endpoint(), the command gets queued but the IRQ never
>> fires to perform the

Re: URB IRQ fires on URB after usb_kill_urb() already completed

2015-02-25 Thread Devin Heitmueller
Hi Alan,

I think I see what's going on.  Permit me to comment on your
explanation of urb->use_count first, since it's relevant later on.

> Here's the story:
>
> A new URB has its use_count set to 0 by usb_init_urb.
>
> The use_count is incremented when the URB is submitted,
> in usb_hcd_submit_urb.  (If the submission fails then
> the use_count is decremented again.)
>
> The use_count is decremented after the completion routine
> returns, in usb_hcd_giveback_urb.
>
> This means that an active URB should never have its use_count equal to
> 0.  The value is normally 1; it can go up to 2 temporarily if the
> completion routine resubmits the URB.

There's an additional state here that you're not mentioning - a URB
which has been submitted, serviced, but then *not* rescheduled in the
completion handler.  In this case the use_count will drop back down to
zero.

Now in many drivers the completion handler doesn't resubmit the URB
because of some error condition (such as the urb->status being ENOENT
because some thread of execution called usb_kill_urb()), or in other
drivers the completion handler will blindly call usb_submit_urb()
regardless of urb->status but the call will fail (again for example
because somebody called usb_kill_urb() and thus urb->reject was set to
1).  However, there's a third condition which can cause the completion
handler to not resubmit the URB, and that's because the driver has a
flag which simply says not to (the au0828 driver does this to stop
streaming).  Hence you can have a case where the URB completion
handler was called with (urb->status == 0) but the URB will not still
be resubmitted, and thus when usb_kill_urb() is eventually called
against the URB, the use_count is going to be zero when entering the
function.

Hence the actual lifecycle is:
create: 0
submit: 1
complete:
 - if completion handler calls usb_submit_urb() then 2 during
usb_submit_urb(), then back to 1 after completion returns.
 - if completion handler doesn't call usb_submit_urb(), then 0 after
completion returns.

Why does this matter?  Asynchronous unlinking.  But let's first look
at how usb_kill_urb() works 99.99% of the time, which is against a URB
which has been submitted and is actively being serviced:

1.  A task calls usb_kill_urb() against a URB with use count of 1 (or
sometimes 2).
2.  usb_kill_urb() increments urb->reject so any attempt to resubmit
the URB fails
3.  usb_kill_urb() calls usb_hcd_unlink_urb()
***
4.  usb_kill_urb() spins on &usb_kill_urb_queue waiting for use_count
to reach zero.
5.  usb_kill_urb() returns to the caller, which typically then calls
usb_free_urb(urb).

Between steps 3 and 4, the following things occur:
3a.  usb_hcd_unlink_urb() sees urb->use_count is nonzero and calls
unlink1() against the URB
3b.  unlink1() calls ehci_urb_dequeue(), which calls
usb_hcd_check_unlink_urb, which sets the urb->unlinked field to
-ENOENT
3c.  ehci_urb_dequeue() calls start_unlink_async() for the QH tied to the URB

Sometime while usb_kill_urb() is spinning on urb->use_count, the IRQ
handler eventually fires:
3d.  ehci_irq() calls end_unlink_async()
3e.  end_unlink_async() calls qh_completions()
3f.  qh_completions() calls ehci_urb_done()
3g.  ehci_urb_done() calls __usb_hcd_giveback_urb
3h.   __usb_hcd_giveback_urb calls the completion handler for the URB
3i. Because the URB isn't resubmitted, use_count drops to zero and
wake_up(&usb_kill_urb_queue) is called.

===

For the failure case, I see the completion handler firing for the URB,
but the URB isn't being resubmitted because of the flag in the
completion handler.  Then when the kill comes around, the
urb->use_count is zero and thus it returns immediately rather than
blocking on the wait queue.  But this doesn't change the fact that the
end_unlink_async() hasn't necessarily been invoked yet for the URB.
When the end_unlink_async() does finally do its housekeeping and call
ehci_urb_done(), it's being called against a URB that has already been
freed.

The problem is exacerbated by the fact that I immediately call
usb_free_urb() right after the usb_kill_urb() call for a given URB,
and hence create a larger window of opportunity to hit the
use-after-free condition.  Even a short sleep would likely allow for
enough time after the usb_kill_urb() call to allow end_unlink_async()
to finish its work before the underlying URB was freed.

Below is a trace which shows the crash, and how during the completion
handler the use count is good for URB ebc54410, but because the URB
won't be resubmitted it drops to zero by the time the usb_kill_urb()
is called (note, URB ebc54410 is the second to last URB killed before
the crash).  You can also see at the time the completion handler is
called for URB ebc54410 that while I've set the flag that the
completion handler looks for to return without resubmitting, I haven't
actually issued the usb_kill_urb() call, which is why the status is 0
instead of -2.

One other data p

Re: URB IRQ fires on URB after usb_kill_urb() already completed

2015-02-23 Thread Devin Heitmueller
> Good suggestion.  I'll add code to print it out before and after
> usb_kill_urb(), as well as from within the the IRQ handler and right
> before the call to usb_free_urb().

Added a quick debug line right before the call to usb_urb_kill(),
thinking that perhaps one of the things checked for at the top of that
function was causing the function to return without actually doing the
work (the function starts with a check of urb, dev, and ep for NULL).
That said though, it looks like the usage count is screwed up.  The
URB has a usage count of zero before the call to usb_kill_urb(), and
that's the exact URB that is referenced in the panic trace (ecb07680).

Is the usage_count member an indication that the URB has gone through
usb_submit_urb()?  That it's been linked into the hardware transfer
ring?

Devin

[ 3686.780204] au0828/0: stop_urb_transfer()
[ 3686.780213] au0828/0: killing urb ecb07340 dev=f69a74d0 ep=f6450780 uc=1
[ 3686.782979] au0828/0: killing urb ecb07680 dev=f69a74d0 ep=f6450780 uc=0
[ 3686.783002] au0828/0: killing urb ecb070d0 dev=f69a74d0 ep=f6450780 uc=1
[ 3686.785767] BUG: unable to handle kernel paging request at 6b6b6b6f
[ 3686.785960] IP: [] usb_hcd_unlink_urb_from_ep+0x1e/0x40
[ 3686.786129] *pdpt = 31c9f001 *pde = 
[ 3686.786283] Oops: 0002 [#1] SMP
[ 3686.786379] Modules linked in: cuse nfsd(F) auth_rpcgss(F)
nfs_acl(F) nfs(F) lockd(F) sunrpc(F) fscache(F) snd_usb_audio
snd_usbmidi_lib au8522_dig(OF) btusb joydev(F) bluetooth xc5000(OF)
tuner(OF) au8522_decoder(OF) au8522_common(OF) hid_generic kvm_amd(F)
kvm(F) microcode(F) snd_hda_codec_realtek sp5100_tco
snd_hda_codec_hdmi k10temp arc4(F) iwldvm mac80211 au0828(OF) iwlwifi
tveeprom(OF) videobuf_vmalloc(OF) videobuf_core(OF) v4l2_common(OF)
dvb_core(OF) videodev(OF) snd_hda_intel media(OF) cfg80211
snd_seq_midi(F) snd_seq_midi_event(F) usbhid snd_hda_codec
snd_rawmidi(F) hid snd_hwdep(F) snd_seq(F) snd_pcm(F)
snd_seq_device(F) snd_page_alloc(F) snd_timer(F) i2c_piix4snd(F)
ohci_pci radeon soundcore(F) ttm drm_kms_helper drm i2c_algo_bit
mac_hid lp(F) parport(F) r8169 ahci(F) libahci(F) mii(F)
[ 3686.788502]
[ 3686.788520] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GF  O
3.12.0-031200rc7-generic #201310271935
[ 3686.788741] Hardware name: To be filled by O.E.M. To be filled by
O.E.M./Inagua CRB, BIOS 4.6.5 02/12/2014
[ 3686.788971] task: f70ce800 ti: f7108000 task.ti: f70c6000
[ 3686.789106] EIP: 0060:[] EFLAGS: 00210046 CPU: 1
[ 3686.789243] EIP is at usb_hcd_unlink_urb_from_ep+0x1e/0x40
[ 3686.789378] EAX: ecb07694 EBX: ecb07680 ECX: 6b6b6b6b EDX: 6b6b6b6b
[ 3686.789529] ESI: ecb07680 EDI: f6839860 EBP: f7109e50 ESP: f7109e4c
[ 3686.789680]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 3686.789811] CR0: 8005003b CR2: 6b6b6b6f CR3: 364a CR4: 07f0
[ 3686.789961] Stack:
[ 3686.790015]  f6839a64 f7109e68 c14ae94a ff8d ec92d3c0 ec92d99c
f4bf43f0 f7109ecc
[ 3686.790255]  c14af863 2c92d060  00200082 000f4240 0007
f68399e0 f7109ea0
[ 3686.790492]  c1075a26 000f4240 0301 f686bcc0 ec92d3c0 f4bf4400
0001 f68399a8
[ 3686.790730] Call Trace:
[ 3686.790804]  [] ehci_urb_done+0x4a/0x90
[ 3686.790926]  [] qh_completions+0x203/0x580
[ 3686.791056]  [] ? hrtimer_start_range_ns+0x26/0x30
[ 3686.791202]  [] end_unlink_async+0x108/0x1f0
[ 3686.791333]  [] ehci_irq+0xf1/0x430
[ 3686.791445]  [] ? __wake_up+0x3f/0x50
[ 3686.791563]  [] ? serial8250_tx_chars+0xe8/0x130
[ 3686.791701]  [] ? serial8250_handle_irq.part.14+0x75/0xa0
[ 3686.791857]  [] usb_hcd_irq+0x33/0x50
[ 3686.791975]  [] handle_irq_event_percpu+0x35/0x1a0
[ 3686.792119]  [] ? __unmask_ioapic+0x2d/0x40
[ 3686.792248]  [] handle_irq_event+0x31/0x50
[ 3686.792375]  [] ? unmask_irq+0x30/0x30
[ 3686.792494]  [] handle_fasteoi_irq+0x4e/0xe0
[ 3686.792620]  
[ 3686.792675]
[ 3686.792727]  [] ? do_IRQ+0x3c/0xb0
[ 3686.792806]  [] ? common_interrupt+0x33/0x38
[ 3686.792940]  [] ? cpuidle_enter_state+0x3e/0xd0
[ 3686.793077]  [] ? cpuidle_idle_call+0x9e/0x1d0
[ 3686.793213]  [] ? arch_cpu_idle+0xd/0x30
[ 3686.793336]  [] ? cpu_startup_entry+0x9b/0x200
[ 3686.793470]  [] ? setup_APIC_timer+0xab/0x130
[ 3686.793603]  [] ? start_secondary+0x208/0x2d0
[ 3686.793731] Code: eb d6 83 cb ff eb d1 8d b6 00 00 00 00 55 89 e5
53 3e 8d 74 26 00 b8 26 8b b5 c1 89 d3 e8 ab 95 1a 00 8b 4b 14 8d 43
14 8b 53 18 <89> 51 04 89 0a 89 43 14 89 43 18 b8 26 8b b5 c1 e8 1d 95
1a 00
[ 3686.800948] EIP: [] usb_hcd_unlink_urb_from_ep+0x1e/0x40
SS:ESP 0068:f7109e4c
[ 3686.807525] CR2: 6b6b6b6f



-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: URB IRQ fires on URB after usb_kill_urb() already completed

2015-02-20 Thread Devin Heitmueller
> But the fact that the call stack passes through qh_completions and
> ehci_unlink_async means that the URB in question is _not_ isochronous
> (or interrupt for that matter).  Does that help?

Doh.  My mistake.  Of course it's a bulk endpoint.  This is the price
I pay for working on two different USB drivers on the same day.  :-)

> If you want to trace things in greater detail, look at the value of
> urb->use_count at various points.  usb_kill_urb won't return until the
> value reaches 0, which should not occur until well after
> usb_hcd_unlink_urb_from_ep returns.

Good suggestion.  I'll add code to print it out before and after
usb_kill_urb(), as well as from within the the IRQ handler and right
before the call to usb_free_urb().

I was kind of hoping you would chime in and say, "Oh that?  I fixed
that six months ago in 3.15 via this patch."  No such luck, I guess...

It definitely seems to be timing sensitive.  I've got a script to
reproduce the problem and on average it takes anywhere from 30-80
restarts of the stream (where each restart kills and frees the URBs as
well as the transfer buffer for that URB).

Thanks,

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


URB IRQ fires on URB after usb_kill_urb() already completed

2015-02-20 Thread Devin Heitmueller
Hi there,

I've been trying to track down some memory corruption in my driver for
the HVR-950q related to dequeueing of ISOC URBs when shutting down the
stream.  I enabled SLUB poisoning, and I'm seeing a use-after-free
condition in the EHCI driver.  See below.

Based on my read of the stack, it looks like the IRQ handler comes
along and tries to access the URB eba088f0 *after* I've already killed
it.  In fact, I can see that the usb_kill_urb() call for that URB has
returned, since it's already moved onto killing the next URB in the
list when the crash occurs.  My assumption would have to be that the
URB wasn't properly removed from the ISOC schedule.

Now I'm on 3.12-rc7 which I know is a bit old, and even my typical
response to complaining users is "upgrade to the latest and see if
it's already fixed".  That said, really all I'm asking at this point
is if anybody recalls having hit this before and fixed it in some
later revision.  For reasons I cannot really get into right now,
upgrading to the last kernel isn't practical, but if I know for sure
it's something fixed in some later release then I can look at
backporting the fix.

Thanks in advance,

Devin

[  540.930055] au0828/0: au0828_dvb_stop_feed(), start_count: 124,
stop_count: 124
[  540.930062] au0828/0: stop_urb_transfer()
[  540.930069] au0828/0: killing urb eba08340
[  540.933270] au0828/0: killing urb eba088f0
[  540.933293] au0828/0: killing urb f431ca90
[  540.936242] BUG: unable to handle kernel paging request at 6b6b6b6f
[  540.936445] IP: [] usb_hcd_unlink_urb_from_ep+0x1e/0x40
[  540.936614] *pdpt = 2f1e8001 *pde = 
[  540.936770] Oops: 0002 [#1] SMP
[  540.936866] Modules linked in: cuse nfsd(F) auth_rpcgss(F)
nfs_acl(F) nfs(F) lockd(F) sunrpc(F) fscache(F) snd_usb_audio
snd_usbmidi_lib au8522_dig(OF) btusb bluetooth joydev(F) hid_generic
xc5000(OF) tuner(OF) au8522_decoder(OF) au8522_common(OF) kvm_amd(F)
kvm(F) microcode(F) snd_hda_codec_realtek snd_hda_codec_hdmi
sp5100_tco k10temp arc4(F) iwldvm au0828(OF) tveeprom(OF)
videobuf_vmalloc(OF) videobuf_core(OF) v4l2_common(OF) mac80211
dvb_core(OF) videodev(OF) media(OF) iwlwifi usbhid hid snd_hda_intel
snd_hda_codec snd_hwdep(F) cfg80211 snd_seq_midi(F)
snd_seq_midi_event(F) snd_pcm(F) snd_rawmidi(F) i2c_piix4
snd_page_alloc(F) ohci_pci radeon snd_seq(F) snd_seq_device(F) ttm
snd_timer(F) drm_kms_helper drm snd(F) i2c_algo_bit mac_hid
soundcore(F) lp(F) parport(F) r8169 ahci(F) libahci(F) mii(F)
[  540.939014]
[  540.939033] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GF  O
3.12.0-031200rc7-generic #201310271935
[  540.939256] Hardware name: To be filled by O.E.M. To be filled by
O.E.M./Inagua CRB, BIOS 4.6.5 02/12/2014
[  540.939487] task: c1910980 ti: f7008000 task.ti: c1904000
[  540.939621] EIP: 0060:[] EFLAGS: 00210046 CPU: 0
[  540.939758] EIP is at usb_hcd_unlink_urb_from_ep+0x1e/0x40
[  540.939894] EAX: eba08904 EBX: eba088f0 ECX: 6b6b6b6b EDX: 6b6b6b6b
[  540.940046] ESI: eba088f0 EDI: f6831860 EBP: f7009e50 ESP: f7009e4c
[  540.940197]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[  540.940329] CR0: 8005003b CR2: 6b6b6b6f CR3: 2ba72000 CR4: 07f0
[  540.940480] Stack:
[  540.940534]  f6831a64 f7009e68 c14ae94a ff8d eb96c8a0 eb96c39c
ef9d1d90 f7009ecc
[  540.940773]  c14af863 2b96c5a0  00200082 000f4240 0007
f68319e0 f7009ea0
[  540.941010]  c1075a26 000f4240 0301 f685d960 eb96c8a0 ef9d1da0
0001 f68319a8
[  540.941247] Call Trace:
[  540.941322]  [] ehci_urb_done+0x4a/0x90
[  540.941443]  [] qh_completions+0x203/0x580
[  540.941573]  [] ? hrtimer_start_range_ns+0x26/0x30
[  540.941720]  [] end_unlink_async+0x108/0x1f0
[  540.941852]  [] ehci_irq+0xf1/0x430
[  540.941967]  [] ? add_interrupt_randomness+0x120/0x170
[  540.942118]  [] ? __mix_pool_bytes+0x39/0x80
[  540.942248]  [] ? add_interrupt_randomness+0x149/0x170
[  540.942400]  [] usb_hcd_irq+0x33/0x50
[  540.942517]  [] handle_irq_event_percpu+0x35/0x1a0
[  540.942663]  [] ? __unmask_ioapic+0x2d/0x40
[  540.942792]  [] handle_irq_event+0x31/0x50
[  540.942921]  [] ? unmask_irq+0x30/0x30
[  540.943040]  [] handle_fasteoi_irq+0x4e/0xe0
[  540.943166]  
[  540.943221]
[  540.943274]  [] ? do_IRQ+0x3c/0xb0
[  540.943354]  [] ? common_interrupt+0x33/0x38
[  540.943489]  [] ? cpuidle_enter_state+0x3e/0xd0
[  540.943626]  [] ? cpuidle_idle_call+0x9e/0x1d0
[  540.943762]  [] ? arch_cpu_idle+0xd/0x30
[  540.943885]  [] ? cpu_startup_entry+0x9b/0x200
[  540.944019]  [] ? complete+0x48/0x50
[  540.944136]  [] ? rest_init+0x62/0x70
[  540.944254]  [] ? start_kernel+0x397/0x39d
[  540.944380]  [] ? repair_env_string+0x51/0x51
[  540.950487]  [] ? i386_start_kernel+0x137/0x13a
[  540.956536] Code: eb d6 83 cb ff eb d1 8d b6 00 00 00 00 55 89 e5
53 3e 8d 74 26 00 b8 26 8b b5 c1 89 d3 e8 ab 95 1a 00 8b 4b 14 8d 43
14 8b 53 18 <89> 51 04 89 0a 89 43 14 89 43 18 b8 26 8b b5 c1 e8 1d 95
1a 00
[  540.969778] EIP: [] usb_hcd_unlink_urb_from_ep+0x1e/0x40
SS:ESP 0068:f7009e4c
[  540.976118]

Re: Control message failures kill entire XHCI stack

2015-02-20 Thread Devin Heitmueller
> Got one more patch added to the for-usb-next-branch.
> It makes sure we allocate enough scratchpad memory for xhci.
>
> It's one possible cause.
> Patch will anyway go to 3.20, but you can try it out first to see if it helps

Hi Mathias,

Thanks, I'll see if I can get this tested out over the weekend.

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Safe to call usb_kill_urb() against valid URB that hasn't been submitted?

2015-02-17 Thread Devin Heitmueller
Hi there,

I've got a driver which maintains a pool of URBs, properly created via
usb_alloc_urb().  Separately I have a cleanup function which destroys
the pool by looping through the list and calling usb_kill_urb(),
freeing the urb->transfer_buffer, and then calling usb_free_urb().

There are some edge cases where not all of the URBs may be in a
submitted state when calling the cleanup function.  Could this cause a
problem?  Is it legal to call usb_kill_urb() against a URB that hasn't
had usb_submit_urb() called against it (or where the URB had been
submitted but it has already completed and wasn't resubmitted)?

I can certainly add additional logic to keep track of exactly which
URBs in the pool are in a submitted state (so I call usb_kill_urb()
against only submitted URBs but usb_free_urb against all URBs), but I
would obviously prefer to avoid the headache and risk of additional
bugs if it isn't necessary.

Thanks,

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Control message failures kill entire XHCI stack

2015-01-25 Thread Devin Heitmueller
Hi Mathias,

Here's an interesting development:  as a result of a related thread on
linux-media, I came across a patch they are distributing in openelec:

https://github.com/OpenELEC/OpenELEC.tv/commit/b636927dec20652ff020e54ed7838a2e9be51e03

Now I'm not saying that reverting the commit in question is the
"right" thing to do, but I applied this patch and for the first time
in 100+ tests it started to work (i.e. I'm not seeing the XHCI hcd
tear down all the attached devices).

Given what I've seen of the bug I cannot really explain why the
scatter gather list sizes would have any bearing on TRBs for USB
control messages to be added to the queue.  Perhaps we're hitting the
upper bound of the list?  Any further speculation on my part would
just make me look clueless...

It would be great if you could offer any insight as to why the patch
in question could be responsible for the behavior we're seeing.  I
would really rather not just blindly check this patch into my local
tree and declare "victory" without understanding the underlying issue
and whether it's likely to cause other problems.

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Control message failures kill entire XHCI stack

2015-01-25 Thread Devin Heitmueller
Hi Mathias,

I've done additional debugging and found a few things that might be noteworthy.

When I originally reported the issue, I noted that it was some sort of
recovery problem related to failed control messages.  So while you've
been focused on why stopping the endpoint failed when it came time to
kill the URB, I started looking at why the control message failed in
the first place.  This prompted me to hook up the USB analyzer in
order to see the exact USB control message it failed on.  The
interesting thing is that control message which never succeeded never
actually made it onto the bus.  While I was expecting to see the setup
message followed by either a missing in or out packet, or some other
error such as a stall condition, I don't actually ever see the the
setup message appear on the bus.

The reason I'm bringing this up is because perhaps the reason the IRQ
for the stop endpoint completion handler is failing to fire is because
the TRB never actually got queued in the first place (hence we're
trying to unlink TRBs that don't exist).  It also suggests this isn't
a case where the USB device isn't properly responding to the control
message, since it never gets received in the first place.

I do see the URB gets all the way to xhci_queue_ctrl_tx(), however I
don't fully understand yet the internals of that function so cannot
speculate on how much further it gets.

One other thing worth mentioning:  just as a hack I jammed a
udelay(1000) right before xhci_queue_ctrl_tx() (outside the
spin_lock_irqsave), thinking that this might be some sort of issue
with bursts of control messages being queued too fast.  However the
problem still occurs even in that case.

I'm not sure if any of the above offers any clue as to what's going on.

Devin


On Thu, Jan 22, 2015 at 6:32 AM, Alistair Grant  wrote:
> Hi Matthias,
>
> On Thu, Jan 22, 2015 at 12:22 PM, Mathias Nyman
>  wrote:
>> On 21.01.2015 21:16, Alistair Grant wrote:
>>> Hi Matthias,
>>>
>>> On Tue, Jan 20, 2015 at 10:22 AM, Mathias Nyman
>>>  wrote:
>>>> On 19.01.2015 22:02, Devin Heitmueller wrote:
>>>>> Hi Mathias,
>>>>>
>>>>> Thanks for getting back to me.
>>>>>
>>>>>> There are a couple of xhci bugs triggered by dvb devices:
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=75521
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=65021
>>>>>>
>>>>>> The first one (75521) I believe is mostly fixed by patches in 3.18 and 
>>>>>> early
>>>>>> 3.19-rc, so work on a 3.19-rc kernel to eliminate those issues.
>>>>> ...
>>>>>> The second bug (65021) looks more like your case, it queues two 
>>>>>> stop_endpoints
>>>>>> commands almost simultaneously, which end up never completing, ->timeout 
>>>>>> and tear down xhci.
>>>>>> That bug has a debug patch for command ring status, you could try it out 
>>>>>> to check if
>>>>>> the command queue is running among other details.
>>>>>
>>>>> The second bug definitely looks like what I'm seeing.  I'll try your
>>>>> tree and report back my findings.
>>>>
>>>> Ah, let me rebase that tree on top of a more current kernel first.
>>>
>>> I've been following this as it somewhat resembles the problems I'm
>>> experiencing, which I've just formally reported in "Hauppauge
>>> USB-Live2 recording fails on USB3 port".
>>>
>>> It looks like you did the rebase yesterday, so I tried running your
>>> updated kernel.  It didn't solve the problems, but was an incremental
>>> improvement since the mainline 3.19rc5 kernel hangs during shutdown,
>>> while your modified kernel shutdown cleanly.
>>>
>>> I've got a syslog with verbose xhci logging (echo -n 'module xhci_hcd
>>> =p' > /sys/kernel/debug/dynamic_debug/control as you suggested above)
>>> if you would like to take a look.  It's 2.4M so please let me know how
>>> you would like me to pass it on.
>>
>> I'm Interested yes, anyway is fine. Add it somewhere (bugreport, webpage) 
>> and send me the link,
>> or just sending it as an email attachent to me works fine as well.
>
> You should be able to download it from:
>
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1412121/+attachment/4303563/+files/syslog-14.10-3.19rc5mathias0.2-2.log
>
> Please let me know if you have any trouble accessing it, or would like
> more testing, information, etc.
>
> Thanks,
> Alistair
> --
> To unsubscribe from this list: send the line "unsubscribe linux-usb" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Control message failures kill entire XHCI stack

2015-01-21 Thread Devin Heitmueller
> I've been following this as it somewhat resembles the problems I'm
> experiencing, which I've just formally reported in "Hauppauge
> USB-Live2 recording fails on USB3 port".

Yeah, it happens with the cx231xx as well, although the sequencing of
events leading to the failure is a bit different (I wrote the original
support for the USBLive2 back in 2011).

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Control message failures kill entire XHCI stack

2015-01-19 Thread Devin Heitmueller
Hi Mathias,

Thanks for getting back to me.

> There are a couple of xhci bugs triggered by dvb devices:
> https://bugzilla.kernel.org/show_bug.cgi?id=75521
> https://bugzilla.kernel.org/show_bug.cgi?id=65021
>
> The first one (75521) I believe is mostly fixed by patches in 3.18 and early
> 3.19-rc, so work on a 3.19-rc kernel to eliminate those issues.

Yeah, I'm already running 3.19-rc1 so I should be in good shape for
the first issue (and I've never seen any errors that resemble those in
the bug report).

> The second bug (65021) looks more like your case, it queues two stop_endpoints
> commands almost simultaneously, which end up never completing, ->timeout and 
> tear down xhci.
> That bug has a debug patch for command ring status, you could try it out to 
> check if
> the command queue is running among other details.

The second bug definitely looks like what I'm seeing.  I'll try your
tree and report back my findings.

> A verbose xhci dmesg log log using:
> echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control
> could give some insight to what's happening

Just to be clear, the log I provided in my original email actually was
with "xhci_hcd +p", but I'm assuming you mean to provide that log
output after loading up your "for-usb-next-test" tree.

I'm happy to ship you a device if you want to reproduce the issue
locally.  On my 2014 Macbook Pro all I have to do is plug in the
device and run the tvtime application that ships with most distros,
and it fails 100% of the time within 5 seconds (you don't actually
need a DVB or analog TV source connected to hit the bug).  Just email
me your shipping info off-list if you're interested.

Thanks,

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Control message failures kill entire XHCI stack

2015-01-18 Thread Devin Heitmueller
Hello,

I'm debugging some issues on a couple of different USB TV tuners which
boil down to the following error:

xhci_hcd :00:14.0: xHCI host not responding to stop endpoint command.

This is followed by the XHCI driver disconnecting *all* USB devices
from the controller.

I've done a bit of debugging, and the root of the issue appears to be
an intermittent control message timing out, and then the call to
usb_kill_urb() that occurs inside of usb_control_msg() when the
timeout expires is what causes the disconnect.  Specifically, it would
appear that xhci_urb_dequeue tries to stop the endpoint using
xhci_queue_stop_endpoint(), the command gets queued but the IRQ never
fires to perform the TRB_STOP_RING completion code. The function
xhci_stop_endpoint_command_watchdog() fires after five seconds, which
tears down the entire driver.

Below is the dmesg output with the xhci_hcd debugging enabled.  The
dump_stack() call is something I added (i.e. it's not an OOPS) so I
could see which code path was making the usb_kill_urb() call that was
failing.  Note that the caller is using usb_control_msg() with 1000ms
timeout, and we can see from the timestamps that the timer expires
which is what causes the call to usb_kill_urb().

I would imagine that explicitly killing URBs is a pretty uncommon task
for control endpoint messages (compared to ISOC or BULK endpoints
where it's done regularly).  Is it possible a exception case has been
missed?

Independent of the usb_kill_urb() killing the entire stack, I still
don't really understand yet why the control message failed in the
first place.  This is a well-exercised code path in the au0828 driver
(related to I2C transfers) and I've never seen this when using the
EHCI driver.  My assumption is that either the HCD is getting sick
which is causing both the control message to fail as well as putting
it into an inconsistent state such that we never get the TRB_STOP_RING
IRQ, or we've got two separate bugs - the control message failing for
some "legitimate" reason (i.e. I screwed something up in my au0828
driver), followed by the usb_kill_urb() error simply not handling
killing of URBs on a control endpoint (which causes the entire stack
to go down).

Thoughts/suggestions/recommendations are welcome.

Thanks in advance,

Devin

Jan 18 14:04:05 devin-MacBookPro kernel: [ 9119.647249] au0828:
au0828_writereg(0x0100, 0x00)
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.645091] xhci_hcd
:00:14.0: Cancel URB 8802543c36c0, dev 1, ep 0x0, starting at
offset 0x25c358940
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.645365] djh dequeue
pending=0 ep_index=0
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.645632] CPU: 1 PID:
2782 Comm: tvtime Tainted: P   OE  3.18.0-rc4djh+ #33
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.645921] Hardware name:
Apple Inc. MacBookPro11,1/Mac-189A3D4F975D5FFC, BIOS
MBP111.88Z.0138.B11.1408291433 08/29/2014
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.646236]
88025b9b 88023ea2f9d8 817445c1 
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.646570]
8802543c36c0 88023ea2fa58 a0080b2e 00025c358940
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.646909]
88023ea2fa48 3ea2fa18 88023e8a22a0 
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.647256] Call Trace:
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.647605]
[] dump_stack+0x46/0x58
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.647981]
[] xhci_urb_dequeue+0x28e/0x420 [xhci_hcd]
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.648357]
[] unlink1+0x2d/0x130
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.648743]
[] ? internal_add_timer+0xb0/0xb0
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.649133]
[] ? get_device+0x17/0x30
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.649526]
[] usb_hcd_unlink_urb+0x5d/0xf0
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.649928]
[] usb_kill_urb+0x3a/0xa0
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.650334]
[] ? wake_up_state+0x20/0x20
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.650746]
[] usb_start_wait_urb+0xc8/0x150
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.651166]
[] ? __kmalloc+0x55/0x190
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.651586]
[] usb_control_msg+0xc5/0x110
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.652011]
[] au0828_writereg+0x79/0xf0 [au0828]
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.652447]
[] ? try_to_del_timer_sync+0x4f/0x70
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.652894]
[] au0828_analog_stream_disable+0x29/0x50 [au0828]
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.653354]
[] vidioc_streamoff+0xd9/0x1c0 [au0828]
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.653823]
[] v4l_streamoff+0x1a/0x20 [videodev]
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.654298]
[] __video_do_ioctl+0x274/0x2f0 [videodev]
Jan 18 14:04:06 devin-MacBookPro kernel: [ 9120.654786]
[] ? commit_charge+0x1f/0x80
Jan 18 14:04:06 dev

Re: Upgrading just the USB core from 3.7 to tip?

2014-02-07 Thread Devin Heitmueller
Hi Alan,

Thanks for taking the time to respond.

On Thu, Feb 6, 2014 at 5:08 PM, Alan Stern  wrote:
> Is it possible to get any debugging information from the machines in
> the field?

At this point all I know is the usb_submit_urb() is returning -EBIGF.
In the 3.7 ehci_hcd driver that can come from one of three places and
I'm in the process of deploying some additional logging so at least I
know which of those three is causing it.  I do know that when it
occurs, it's not necessarily at the start of streaming (i.e. it's not
some sort of initial bandwidth reservation issue).  I see it happening
when the URBs are being actively serviced by the completion handler
and resubmitted.

>> I know this is an atypical thing to consider since most devs work on
>> the trunk, but can anybody give me a feel for how much breakage is
>> likely to occur If I move the tree between those two revs?  Have there
>> been a ton of changes to common frameworks in the kernel that
>> drivers/usb relies on?
>>
>> Suggestions welcome.
>
> That's a pretty hard question to answer.  Probably the best way to find
> out is try it and see what breaks.  It may be easier to back-port just
> the EHCI driver family and not the whole stack.

Yeah, I'll take a crack at it.  Figured I would ask to see if anybody
would come out and say "don't even try it - we know it to be a lost
cause."

> I can point out that in the last one or two kernel releases there were
> significant changes to the snd-usb-audio driver that affect the way it
> interacts with ehci-hcd.  Just to make your life more difficult...

Supposedly they never had this issue in 2.6.33, so presumably it's
some sort of regression in either the EHCI driver or snd-usb-audio
between 2.6.33 and 3.7.  That said, I'm hoping it's something already
fixed in 3.13 and I just need to find the right change to backport.

I know the ISO_ASAP behavior changed, but aside from that do you see
any reason that I wouldn't be able to backport the entire
snd-usb-audio driver from 3.13 (leaving the ehci driver at the 3.7
level)?

Cheers,

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Upgrading just the USB core from 3.7 to tip?

2014-02-06 Thread Devin Heitmueller
Hi there,

So I have a problem in the field on kernel 3.7 which is looking more
and more like some sort of EHCI scheduler bug (very occasional
usb_submit_urb failures in snd-usb-audio wedge the entire USB stack
until reboot).  Because it's only happening in the field and I cannot
repro locally (and it's happening on lots of machines), I was
speculating on how hard it would be to just backport in the entire
drivers/usb tree from the latest kernel back to 3.7.

I know this is an atypical thing to consider since most devs work on
the trunk, but can anybody give me a feel for how much breakage is
likely to occur If I move the tree between those two revs?  Have there
been a ton of changes to common frameworks in the kernel that
drivers/usb relies on?

Suggestions welcome.

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Video corruption varies by system load

2013-07-11 Thread Devin Heitmueller
On Thu, Jul 11, 2013 at 5:02 PM, Alan Stern  wrote:
> You could try doing the opposite test: Leave the driver otherwise
> intact, but have it not deliver the video buffers to the userspace
> client.

Interesting.  Only one error in 72 frames of video.  Suggests the
problem is somewhere that is anywhere other than where I've been
focusing all my attention.  :-(

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Video corruption varies by system load

2013-07-11 Thread Devin Heitmueller
On Thu, Jul 11, 2013 at 3:02 PM, Johannes Stezenbach  wrote:
> I'm not familiar with musb but a quick glance at the code
> shows it uses software packet scheduling.  If you do too
> much processing in the URB completion callback (which
> is called from irq context), then the next bus transaction
> is delayed by that time.  See musb_advance_schedule() vs.
> musb_giveback() and also the big comment at the start
> of musb_core.c.  I guess if you want to make em28xx work
> with musb you need to move all processing to a tasklet
> or similar.  I think the issue you are seeing on musb has
> nothing to do with the issue on Intel EHCI.

Funny - I had the same thought a few hours ago about moving the actual
URB processing to a tasklet.  However I don't think it will help.
Here's why:

Expanding on the test where I cut out everything in the URB handler
except for the submit, I tweaked the code so that it still does the
accounting so that video buffers get delivered but it skips the
memcpy() call (which is probably 99% of the work).  This results in
the URB handler itself having essentially zero cost but empty buffers
get delivered to userland so the rest of the system behaves as it
normally would.  In this case I was still seeing corruption and
skipped microframes via the Beagle, suggesting that the reason the
corruption went away when I commented out the body of the completion
handler wasn't because the cost of executing the completion handler
went to zero, but rather because the rest of the system load had gone
down as well because no video buffers were being delivered to the rest
of the system.

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Video corruption varies by system load

2013-07-11 Thread Devin Heitmueller
On Thu, Jul 11, 2013 at 11:28 AM, Alan Stern  wrote:
> On Thu, 11 Jul 2013, Johannes Stezenbach wrote:
>
>> I took a peek at the usbmon log, and there is one thing I don't get.
>>
>>   $ grep "C Zi:1:005:2" em28xx_usbmon.log | less -S
>>
>> There are several cases where the isoc descriptor actual length is
>> short (< 2892, e.g. 0, 552 or 1928), yet the actual_length for the
>> whole URB at the end of the line is still 64*2892 -- is that like it should 
>> be?
>
> Yes.  When one packet is short, the following packets don't get moved
> up in the buffer -- their offsets are fixed in advance.  The unused
> space at the end of the short packet remains, and it contributes to the
> total actual_length value for the entire URB.
>
>> Also, is it expected that the em28xx doesn't have data to send for
>> some microframes (for analog video capture e.g. during horizontal
>> or vertical blank interval)?  I guess in the analyzer that should
>> show as an IN token with no reply instead of DATA[012] replies.
>
> Not necessarily.  It might send random data, or a short (maybe even
> 0-length) reply.
>
> In fact, the usbmon trace does show some short packets (as you
> mentioned above) and even some 0-length packets.  For example, see
> timestamp 3976305153.
>
>> Can you confirm that in the error case there is no IN token at all?
>
> That's what the analyzer log showed.  Of course, it's possible that the
> analyzer itself was faulty, but I don't think so because Devin said the
> missing packets matched exactly with the corrupted data in the
> following microframes.
>
>> And another dumb question: Is there another USB video grabber
>> with different hardware which also uses isoc and works?
>> E.g. the cx231xx?  Or do you think it might have the same missing
>> microframe issue but copes better with it due to larger
>> buffer space on the device?
>
> The uvcvideo driver works.  But as you point out, the circumstances may
> be significantly different.
>
> Alan Stern
>

Sorry for the delayed response in contributing to this thread.  I've
been testing and gathering data all morning.

So a few new things:

It looks like I may have created some confusion by attempting to
correlate the usbmon trace against the Beagle data.  I've got two
environments I'm debugging on:  an x86 box with the stock Intel EHCI
HCD, and a Davinci TI8147 ARM system which has the musb HCD.   Because
I was seeing what visually looks like identical behavior on both
platforms, I had been actively going back and forth between them
during debugging.  In this case, the Beagle trace was collected
against the Davinci board but the usbmon traces were collected against
the x86 system.

I finally ran the Beagle against the x86 system, and found that in
fact I'm not getting the intermittent 250us gaps between microframes
when corruption occurs (which BTW is entirely consistent with the
usbmon trace).  Hence we've been trying to figure out why there was a
missing microframe in the Beagle trace but not in the usbmon trace,
when in fact that was a red-herring and there's no lost microframe to
begin with.

Looking closer at the Beagle trace on the x86 system -- while there
aren't 250us gaps prior to the corruption, there *is* a delay between
the SOF and the IN packet on the order of 30-50us.  This is present in
all instances where there is corruption, and in all other cases the IN
packet occurs within 1us of the SOF.

I suspect we've got the same basic issue on both platforms where there
is a delay between SOF and the IN packet, but the x86 system is so
much faster than the ti8147 platform that there's still time to grab
the microframe within the frame interval (albeit delayed), while on
the 8147 platform the HCD skips the microframe entirely (the HCD is
expected to skip a microframe if it concludes that it would not be
finished before the next SOF).

Because of the difference in behavior between x86 and ti1847, I
figured it would be a worthwhile exercise to go back to the 8147
platform and apply the ISO_ASAP fix that Alan mentioned yesterday, as
well as a fix for an incorrect spin_lock() call which I had tried on
the x86 box but hadn't yet applied to my 8147 tree (which out of dumb
luck, Ming Lei happened to submit a patch for this morning as well).

http://www.mail-archive.com/linux-usb@vger.kernel.org/msg24178.html

The ISO_ASAP change had no visible effect on the URB handling.
However the change from spin_lock() to spin_lock_irqsave() had a
unexpectedly *huge* effect.  Now essentially every frame is corrupted
and it would appear that the system can no longer keep up with the
stream.  It's not clear to me why this would cause any change in
behavior.  Probably also worth noting that with that change, I was not
getting *multiple* microframes being skipped (e.g. 250, 500, 750us
gaps), at a much higher frequency than before.

Another strange thing:  If I change the completion handler on the 8147
to comment out the actual processing of the URB (such that all the
completion han

Re: Video corruption varies by system load

2013-07-11 Thread Devin Heitmueller
On Thu, Jul 11, 2013 at 4:24 AM, Johannes Stezenbach  wrote:
> There are several cases where the isoc descriptor actual length is
> short (< 2892, e.g. 0, 552 or 1928), yet the actual_length for the
> whole URB at the end of the line is still 64*2892 -- is that like it should 
> be?

Yup, as Alan stated, this is expected behavior.

> Also, is it expected that the em28xx doesn't have data to send for
> some microframes (for analog video capture e.g. during horizontal
> or vertical blank interval)?  I guess in the analyzer that should
> show as an IN token with no reply instead of DATA[012] replies.
> Can you confirm that in the error case there is no IN token at all?
>
> And another dumb question: Is there another USB video grabber
> with different hardware which also uses isoc and works?
> E.g. the cx231xx?  Or do you think it might have the same missing
> microframe issue but copes better with it due to larger
> buffer space on the device?

I've been actively testing also with an au0828 device (HVR-950q),
which has a virtually identical URB completion handler (not shocking,
since I copied it from the em28xx driver when I wrote it in 2009).
The HVR-950q doesn't exhibit any of the problems described in this
thread though (presumably because it has a larger on-chip FIFO).

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 36/50] media: usb: em28xx: spin_lock in complete() cleanup

2013-07-11 Thread Devin Heitmueller
On Thu, Jul 11, 2013 at 5:05 AM, Ming Lei  wrote:
> Complete() will be run with interrupt enabled, so change to
> spin_lock_irqsave().
>
> Cc: Mauro Carvalho Chehab 
> Cc: linux-me...@vger.kernel.org
> Signed-off-by: Ming Lei 
> ---
>  drivers/media/usb/em28xx/em28xx-core.c |5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/media/usb/em28xx/em28xx-core.c 
> b/drivers/media/usb/em28xx/em28xx-core.c
> index fc157af..0d698f9 100644
> --- a/drivers/media/usb/em28xx/em28xx-core.c
> +++ b/drivers/media/usb/em28xx/em28xx-core.c
> @@ -941,6 +941,7 @@ static void em28xx_irq_callback(struct urb *urb)
>  {
> struct em28xx *dev = urb->context;
> int i;
> +   unsigned long flags;
>
> switch (urb->status) {
> case 0: /* success */
> @@ -956,9 +957,9 @@ static void em28xx_irq_callback(struct urb *urb)
> }
>
> /* Copy data from URB */
> -   spin_lock(&dev->slock);
> +   spin_lock_irqsave(&dev->slock, flags);
> dev->usb_ctl.urb_data_copy(dev, urb);
> -   spin_unlock(&dev->slock);
> +   spin_unlock_irqrestore(&dev->slock, flags);
>
> /* Reset urb buffers */
> for (i = 0; i < urb->number_of_packets; i++) {
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-media" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

I actually stumbled across this a couple of weeks ago, and have had an
identical patch running in a local dev tree.

Reviewed-by: Devin Heitmueller 
Tested-by: Devin Heitmueller 

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Video corruption varies by system load

2013-07-10 Thread Devin Heitmueller
On Wed, Jul 10, 2013 at 5:21 PM, Alan Stern  wrote:
>> Yeah, I tried that a few days ago.  I did a memset() on the
>> transfer_buffer prior to every resubmit, because at one point I
>> thought perhaps I was getting back the buffer without it having been
>> filled in.
>
> And you found that the buffer was getting filled in, although some of
> the data values were wrong?

Correct.

>> > Finally, compare the data received in the packets surrounding a gap
>> > with the data you see from the Beagle-480.  Look especially for signs
>> > indicating whether the gap represents a packet that was skipped over
>> > entirely, or whether the packet was delayed by one microframe and then
>> > all the following packets in the URB shifted back accordingly.  To do
>> > this, you'll need the full data stream recorded by wireshark.
>>
>> From a data perspective, I don't think that the data is being skipped.
>>  If the URB were being lost entirely then the rest of the video frame
>> would be skewed by the amount of the missing data.  The raw video
>> frame only tells me where the frame starts and ends, so if data is
>> missing then the video data ends up being out of alignment from that
>> point forward.  Throw a URB on the floor and the entire rest of the
>> frame will be shifted.  I'm receiving the correct number of bytes, but
>> the buffer contains the wrong data.
>>
>> It's probably also worth noting that I added some debug code to the
>> URB completion handler to inject a couple of pixels to mark where the
>> isoc packets start/end in the resulting video.  That allowed me to see
>> that the corrupt bytes always start on a packet boundary, and extend
>> only partially into that packet.  It's not like the entire packet is
>> corrupt - only a portion of it.  Once you get past the corrupt portion
>> of the packet, the correct video picks up right where it left off.
>
> Could that happen if no packet was transmitted?  The only way I can see
> would be if the buffer already contained valid-looking data and it
> didn't get overwritten, or if the buffer was overwritten with stale
> data left over from the preceding packet.  Can you rule out both of
> these possibilities?
>
> So far, your analysis of the received data buffer seems to show that
> the computer does receive data from the device, but the Beagle-480
> shows that no data was sent.  This is inconsistent.

I think I did not explain this clearly.  Let me try again:

It would appear that the data received by the host and announced in
the URB completion handler matches the data received by the Beagle.  I
see the incorrect bytes in the a microframe in both cases.  In fact
I've got a script that reconstructs the video frames based solely on
the Beagle trace, and I see the same video corruption as I'm seeing on
the host.  Given that, it is my assertion that the problem has nothing
to do with the way the HCD receives the buffers and passes them off to
the completion handler.

So one might ask:  why is the em28xx device sending a microframe with
corrupt bytes?  One thing I've noticed is immediately prior to any
microframe containing corruption, there was a missing microframe - the
time between the microframe containing corruption and the previously
received microframe was 250us, and not 125us as expected.  My
speculation is that the failure of the host controller to ask for that
missing microframe causes some confusion in the em28xx, which results
in it sending the wrong bytes in the next microframe.  Alternatively
the em28xx has some sort of circular buffer, and failing to ask for
the microframe causes the circular buffer to wrap around (so the next
microframe sent contains bad data).

>> This all raises a good question - does the usbmon or wireshark trace
>> show isoc data arriving every 125us?  If it does, then the host
>> controller is magically making up packets and announcing them to
>> em28xx even though they were never actually sent across the wire.
>> I'll have to take a closer look into that.
>
> usbmon and wireshark show essentially the same things; the only
> difference being that usbmon abbreviates the data (on the theory that
> excessive detail is rarely useful).  They both look only at URB
> submissions and completions; neither one can show what happened while
> an URB was in progress.
>
> So for example, your usbmon trace shows that the URBs did complete at
> intervals of 8000 us, although the timing wasn't entirely regular --
> some completions were delayed by as much as 120 us or so relative to
> their expected time.  The delays did not accumulate, however, and with
> a pipeline 40-ms long, they should not have mattered.  The microframe
> numbers in the trace show that the URBs were scheduled at regular
> intervals of 64 microframes.  Lack of errors indicates the controller
> hardware reported that it executed the transfers at the proper times.

Wireshark appears to be showing the URB timing, but really what we're
interested in is the microframe timing.  That sai

Re: Video corruption varies by system load

2013-07-10 Thread Devin Heitmueller
On Wed, Jul 10, 2013 at 3:37 PM, Alan Stern  wrote:
> 20-30 times each second?  Okay...  I didn't realize the errors were
> that frequent.

Yeah, now you see why I'm freaking out.  If this were one corrupt line
every 20 seconds, I wouldn't care less.  But there are lines every few
frames of video (and we're doing 30 frames/second).  This isn't the
sort of problem that is only apparent to a video engineer who knows
what to look for - It's highly visible even to non-technical viewers.

> It's rather disturbing that they don't show up at all in the usbmon
> trace.  This suggests two possibilities:
>
> The ehci-hcd driver is so messed up that not only did it fail
> to tell the hardware to send those missing packets; it also
> thought they had been sent and gotten replies!
>
> The controller hardware is so messed up that it failed to send
> the missing packets but then stored a status value indicating
> that they had been sent and replies had been received!

Either of these are possibilities.

> Both possibilities are rather unsettling.  Fortunately, they aren't
> exhaustive -- something else might be going on.
>
>> > If you want to investigate further, you can capture the entire data
>> > stream using Wireshark.  (Of course, that means capturing 22 MB/s of
>> > data.)
>>
>> Will that get me any new data/context I'm not already getting with the
>> bus analyzer?  I'm not against doing a Wireshark capture if it gets me
>> access to some internal state of the USB stack that wouldn't be
>> available to the Beagle, but at this point I've got scripts written to
>> parse the Beagle data so it's not clear what the benefit would be.
>
> Here's an approach that might prove informative.  Clear the transfer
> buffer to all zeros before resubmitting each URB (or store some other
> recognizable pattern in the buffer).  When checking the completed URB,
> verify that each packet not only has a 0 status value but also has the
> right actual_length value and has new data in the transfer buffer.

Yeah, I tried that a few days ago.  I did a memset() on the
transfer_buffer prior to every resubmit, because at one point I
thought perhaps I was getting back the buffer without it having been
filled in.

> Finally, compare the data received in the packets surrounding a gap
> with the data you see from the Beagle-480.  Look especially for signs
> indicating whether the gap represents a packet that was skipped over
> entirely, or whether the packet was delayed by one microframe and then
> all the following packets in the URB shifted back accordingly.  To do
> this, you'll need the full data stream recorded by wireshark.

>From a data perspective, I don't think that the data is being skipped.
 If the URB were being lost entirely then the rest of the video frame
would be skewed by the amount of the missing data.  The raw video
frame only tells me where the frame starts and ends, so if data is
missing then the video data ends up being out of alignment from that
point forward.  Throw a URB on the floor and the entire rest of the
frame will be shifted.  I'm receiving the correct number of bytes, but
the buffer contains the wrong data.

It's probably also worth noting that I added some debug code to the
URB completion handler to inject a couple of pixels to mark where the
isoc packets start/end in the resulting video.  That allowed me to see
that the corrupt bytes always start on a packet boundary, and extend
only partially into that packet.  It's not like the entire packet is
corrupt - only a portion of it.  Once you get past the corrupt portion
of the packet, the correct video picks up right where it left off.

This all raises a good question - does the usbmon or wireshark trace
show isoc data arriving every 125us?  If it does, then the host
controller is magically making up packets and announcing them to
em28xx even though they were never actually sent across the wire.
I'll have to take a closer look into that.

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Video corruption varies by system load

2013-07-10 Thread Devin Heitmueller
Hi Alan,

On Wed, Jul 10, 2013 at 2:27 PM, Alan Stern  wrote:
> You inspired me to take a closer look at the usbmon log you made
> available.  There _is_ an error indication after all; the line with
> timestamp 397263317 got an error in one of its 64 packets (but this
> was the only error in the entire trace).  The trace doesn't say what
> the error was or which packet it occurred in.  This error should have
> shown up in your driver as a nonzero value for the packet status.

I probably should have better qualified my previous statement.  Every
once in a while there is an actual -EOVERFLOW condition, which will
result in that one frame being a bit corrupted.  Those cases do show
up in the URB handler via the status field.  That said though, the
problem I'm debugging happens 20-30 times a second, so I'm looking for
something much more prevalent than one error in a five second capture.

> If you want to investigate further, you can capture the entire data
> stream using Wireshark.  (Of course, that means capturing 22 MB/s of
> data.)

Will that get me any new data/context I'm not already getting with the
bus analyzer?  I'm not against doing a Wireshark capture if it gets me
access to some internal state of the USB stack that wouldn't be
available to the Beagle, but at this point I've got scripts written to
parse the Beagle data so it's not clear what the benefit would be.

>> > It could be that you're facing some sort of hardware limitation of the
>> > host controller.  Can you try running the test on a computer with a
>> > different brand of motherboard?
>>
>> Ruled that out already.  I'm seeing the behavior on my 2010 MacBook
>> Pro (Intel EHCI), a Dell XPS system (also Intel EHCI), and a TI 8147
>> Davinci embedded system (using the musb HCD),  I've also tried three
>> different Empia designs and they all exhibit the same behavior
>> (em2820, em2861, em2883).
>
> It's hard to see how they could all suffer from the same hardware bug.

Agreed.

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Video corruption varies by system load

2013-07-10 Thread Devin Heitmueller
On Wed, Jul 10, 2013 at 12:30 PM, Alan Stern  wrote:
> If you use the bus analyzer at the same time, you could compare
> microframe numbers.

Ah, again, good suggestion.  I'll get a usbmon trace in parallel to
the Beagle.  I'll have to move some stuff around though because I
don't want to run the Beagle on the same system the em28xx device is
connected to.

> You're using 64 packets/URB, so each URB lasts 8 ms.  In addition, you
> have a pipeline of 5 URBs, for a total of 40 ms.  That should be
> plenty.

Yeah, it felt pretty conservative.  64 packets/URB is what the Windows
driver does, and I tried higher numbers of URBs (up to 12) with no
change in behavior.

> I don't see any problems in the usbmon trace, but they might not show
> up there.  In particular, the trace includes the status values only for
> the first 5 packets in each URB.  Does the driver encounter any packets
> with a nonzero status?

No.  I added code to the driver to do a printk() on any error
conditions (both at the URB level and at the isoc packet level), and
the log has been clean.

> It could be that you're facing some sort of hardware limitation of the
> host controller.  Can you try running the test on a computer with a
> different brand of motherboard?

Ruled that out already.  I'm seeing the behavior on my 2010 MacBook
Pro (Intel EHCI), a Dell XPS system (also Intel EHCI), and a TI 8147
Davinci embedded system (using the musb HCD),  I've also tried three
different Empia designs and they all exhibit the same behavior
(em2820, em2861, em2883).

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Video corruption varies by system load

2013-07-10 Thread Devin Heitmueller
On Wed, Jul 10, 2013 at 11:40 AM, Alan Stern  wrote:
>> Nope, that wasn't it.  I am now only setting ISO_ASAP in the first
>> packet, and I tried both leaving it as in on resubmit and clearing the
>> flag prior to resubmit.
>
> usbmon is the best debugging tool at this point.

http://www.devinheitmueller.com/em28xx_usbmon.log

(only about 290KB).

I'm just starting to read through it now, but figured it would be
better to send it along while doing such.  Probably the biggest issue
at this point is that while I was definitely seeing the corruption on
the screen while creating the trace, I don't yet have a way to
correlate that corruption to a particular spot in the usbmon trace.

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Video corruption varies by system load

2013-07-10 Thread Devin Heitmueller
On Wed, Jul 10, 2013 at 10:58 AM, Devin Heitmueller
 wrote:
>> I bet the problem is related to the usage of the URB_ISO_ASAP flag.
>> em28xx_alloc_urbs() sets URB_ISO_ASAP in urb->transfer_flags, and the
>> value never gets cleared.  In fact, that flag bit is supposed to be set
>> only in the first URB of a stream, not in the following URBs.  (The
>> same mistake is present for the URBs in the audio stream.)
>
> Wow, REALLY?  Ok, if that's the case then I will fix that and see if
> it makes any difference.

Nope, that wasn't it.  I am now only setting ISO_ASAP in the first
packet, and I tried both leaving it as in on resubmit and clearing the
flag prior to resubmit.

In retrospect, it seems a bit strange that if that were the problem
the behavior would have changed when introducing system load.

Regardless, I'm happy to fix any conformance issues you see with the
driver (even if they don't fix the problem at hand).

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Video corruption varies by system load

2013-07-10 Thread Devin Heitmueller
Hi Alan,

On Wed, Jul 10, 2013 at 10:48 AM, Alan Stern  wrote:
> Digging into the scheduling code probably won't help much.  However you
> could try collecting a usbmon trace (see Documentation/usb/usbmon.txt).
> This would clearly show the timing of URB submissions and completions.

Good suggestion.  I'll do that.

>> I'll be the first to admit that this isn't my particular area of
>> expertise - so if I've made some stupid assumption about the expected
>> behavior for the URB timing on the bus, don't hesitate to point that
>> out.
>
> The transfers should occur regularly at 1-microframe intervals.  If
> they don't then something is wrong somewhere.  But I think the problem
> is more likely to lie in the upper-level driver than in ehci-hcd.
> You're using the em28xx driver?

Yes.

> At first glance, there is one obvious bug in that driver (probably not
> at all related to your problem, though).  The em28xx_irq_callback()
> routine should not set urb->status.

I noticed that already and have it fixed in my local tree.  It'll be
in my next pull request.  That said, fixing that had no effect.

> I bet the problem is related to the usage of the URB_ISO_ASAP flag.
> em28xx_alloc_urbs() sets URB_ISO_ASAP in urb->transfer_flags, and the
> value never gets cleared.  In fact, that flag bit is supposed to be set
> only in the first URB of a stream, not in the following URBs.  (The
> same mistake is present for the URBs in the audio stream.)

Wow, REALLY?  Ok, if that's the case then I will fix that and see if
it makes any difference.  That really should be documented somewhere
because I've seen it done that way in a bunch of different drivers
(and in fact done it myself that way in several drivers I wrote in the
media tree).

Just so I'm understanding what is supposed to be the expected behavior
- so it should be set in the first URB, but what about when we
resubmit the URBs? Should I be clearing the flag prior to resubmitting
the first URB (since it will be unchanged)?  Or is the expected
behavior that I set it on the first URB, and then nothing is supposed
to ever touch transfer flags on any URB from that point forward?

While on that topic, I'm clearing the status and actual_length fields
in all of the iso_frame_desc[] fields of the URB prior to resubmitting
- should I be doing that?  Are there other fields I should be
resetting?

Thank you so much for your help!

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Video corruption varies by system load

2013-07-09 Thread Devin Heitmueller
On Sun, Jul 7, 2013 at 9:39 PM, Devin Heitmueller
 wrote:
> I finally dug out my Beagle 480 USB, so I will get that hooked up this
> week, write a decoder to reassemble the video frames based on the USB
> trace, and know once and for all whether the device is delivering
> correct video or not.  If the video being delivered by the device has
> no corruption, then we're talking about some sort of memory
> consistency or DMA issue (or perhaps some sort of problem with the USB
> core populating the finished URBs before calling the completion
> handler).  If the video coming down the bus is corrupted, then we're
> probably talking about some sort of timing problem with the URB
> submission (combined with the FIFO on the chip poorly handling the
> incorrect timing).

So I hooked up the video and wrote a bit of Perl to parse the ISOC
stream and render the underlying video frames.  I can see definitively
that the video returned from the device contains the corruption.  This
rules out any sort of DMA or memory related issue (proving that the
data is not being mangled by the host on receipt).

Now that I have the raw USB trace though including timing data, I
started looking at the actual underlying ISOC traffic at the time of
the corruption, and found something interesting:  Despite having five
URBs queued at all times with an interval of 1, there are cases where
the URB isn't being sent.  The corruption consistently follows one of
these intervals where a URB was skipped.  We're expecting the host
controller to request to pull the buffer every 125us, and in instances
where the corruption is exhibited immediately follow a 250us gap
between URBs.

See attached screenshot:

http://devinheitmueller.com/isoc_loss.png

Packet 27082 is the packet that contains the corruption.  The previous
URB was received exactly 250us prior (whereas it should have been only
125us).  349.594 - 349.344 = 250.

I suspect the FIFO is overflowing on the chip as a result of the host
controller not asking for the buffer when it's supposed to.  It's
worth mentioning that the "corrupt bytes" are actually also found
several packets later in the correct place, suggesting the chip is
probably employing some sort of circular buffer which is wrapping
around.

So should I be digging into the EHCI URB scheduling code?  Any
suggestions on where else I should be poking around would be very
welcome.

I'll be the first to admit that this isn't my particular area of
expertise - so if I've made some stupid assumption about the expected
behavior for the URB timing on the bus, don't hesitate to point that
out.

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Video corruption varies by system load

2013-07-07 Thread Devin Heitmueller
Hi Alan,

Thanks for taking the time to provide feedback.  I'm just noticing now
that I left off the subject line, which all the more reason makes me
thankful that you bothered to read an email with as uninteresting a
subject line as is possible.  :-)

On Tue, Jul 2, 2013 at 11:21 AM, Alan Stern  wrote:
> That's weird.  But it might not be the scheduler so much; it could be
> related more to the total CPU load.

Sorry, I didn't mean to suggest the scheduler itself was at fault -
just that the high context switching may be changing the timing in a
way that exposes a missing spinlock in usb core or increases the
likelihood that a call that is sleepable is taking the path of the
context switch.

>> The problem has been seen on both the stock EHCI driver (on x86) as
>> well as the musb driver used on the TI Davinci platform (ARM).  The
>> transfer buffer itself is being allocated using usb_alloc_coherent(),
>> and I've seen it when allocating with vmalloc() as well.
>
> Do you mean kmalloc()?  Memory allocated with vmalloc() is generally
> not suitable for DMA mapping.

Yeah, typo.  Sorry.

>> This feels like some sort of DMA or cache related issue, since the
>> behavior of the URB completion handler itself appears completely
>> consistent regardless of the system load.  I'm seeing the issue on
>> 3.10-rc6 all the way back to 2.6.31 (the earliest I can go on my
>> Ubuntu box given some udev related dependencies).
>>
>> I've done plenty of work on USB drivers under Linux over the years,
>> but haven't dug too much into the USB core.  Anybody who has any
>> suggestions on how to debug such a timing problem, such suggestions
>> would be very welcomed.
>
> This is an interesting problem, but I don't think you'll get much
> insight from looking at the USB side of things.  You could try asking
> the people in charge of the DMA- and cache-related parts of the kernel.

I finally dug out my Beagle 480 USB, so I will get that hooked up this
week, write a decoder to reassemble the video frames based on the USB
trace, and know once and for all whether the device is delivering
correct video or not.  If the video being delivered by the device has
no corruption, then we're talking about some sort of memory
consistency or DMA issue (or perhaps some sort of problem with the USB
core populating the finished URBs before calling the completion
handler).  If the video coming down the bus is corrupted, then we're
probably talking about some sort of timing problem with the URB
submission (combined with the FIFO on the chip poorly handling the
incorrect timing).

Anyway, thanks for your suggestions.

Cheers,

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2013-07-01 Thread Devin Heitmueller
Hi all,

I've been doing some debugging of a video corruption problem in the
em28xx video capture driver, and after a couple of weeks of digging
in, I think I might have exposed some sort race condition in the USB
core.

http://devinheitmueller.com/out17.png

I'm submitting URBs of 64 x 3072, I've got five URBs queued at any
given time, and I'm getting highly intermittent lines in the video
(example shown above).  The offset for the corruption is inconsistent,
but the actual corruption appears to be a different part of the same
transfer_buffer, and it is always at the beginning of the individual
packet within the URB.  I've stripped down the URB handler to the bare
minimum, removing essentially everything except for a simple memcpy()
to the output buffer (and incrementing the output buffer pointer by
the isoc actual_length received).  I've also done memset(0) calls for
the URB transfer buffer before submitting, as well as the output
buffer (in an effort to distinguish between writing the wrong data and
not writing any data to the output buffer for the area in question).
And I've checked the math to ensure that the offsets and lengths being
sent to in the usb_submit_urb() call are all what's expected.

Here's where things get interesting/scary.  If I run the following
command while video is streaming, the whole problem magically
disappears:

while [ 1 ]; do /bin/true; done

There's something going on in the scheduler which is effecting the
state of the urb->transfer_buffer being sent to the completion
handler.

The problem has been seen on both the stock EHCI driver (on x86) as
well as the musb driver used on the TI Davinci platform (ARM).  The
transfer buffer itself is being allocated using usb_alloc_coherent(),
and I've seen it when allocating with vmalloc() as well.

This feels like some sort of DMA or cache related issue, since the
behavior of the URB completion handler itself appears completely
consistent regardless of the system load.  I'm seeing the issue on
3.10-rc6 all the way back to 2.6.31 (the earliest I can go on my
Ubuntu box given some udev related dependencies).

I've done plenty of work on USB drivers under Linux over the years,
but haven't dug too much into the USB core.  Anybody who has any
suggestions on how to debug such a timing problem, such suggestions
would be very welcomed.

Thanks in advance,

Devin

-- 
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html