Re: reeze while write on external usb 3.0 hard disk [Bug 204095]

2019-08-26 Thread Piergiorgio Sartor
On Tue, Aug 20, 2019 at 06:37:22PM +0200, Piergiorgio Sartor wrote:
> On Tue, Aug 20, 2019 at 09:23:26AM +0200, Christoph Hellwig wrote:
> > On Mon, Aug 19, 2019 at 10:14:25AM -0400, Alan Stern wrote:
> > > Let's bring this to the attention of some more people.
> > > 
> > > It looks like the bug that was supposed to be fixed by commit
> > > d74ffae8b8dd ("usb-storage: Add a limitation for
> > > blk_queue_max_hw_sectors()"), which is part of 5.2.5, but apparently
> > > the bug still occurs.
> > 
> > Piergiorgio,
> > 
> > can you dump the content of max_hw_sectors_kb file for your USB storage
> > device and send that to this thread?
> 
> Hi all,
> 
> for both kernels, 5.1.20 (working) and 5.2.8 (not working),
> the content of /sys/dev/x:y/queue/max_hw_sectors_kb is 512
> for USB storage devices (2.0 and 3.0).
> 
> This is for the PC showing the issue.
> 
> In an other PC, which does not show the issus at the moment,
> the values are 120, for USB2.0, and 256, for USB3.0.

Hi again,

any news on this?

Is there anything I can do to help?

Should I report this somewhere else too?

Currently this is quite a huge problem for me,
since the only working external storage is an
old 1394 HDD...

Thanks,

bye,

-- 

piergiorgio


Re: reeze while write on external usb 3.0 hard disk [Bug 204095]

2019-08-20 Thread Piergiorgio Sartor
On Tue, Aug 20, 2019 at 09:23:26AM +0200, Christoph Hellwig wrote:
> On Mon, Aug 19, 2019 at 10:14:25AM -0400, Alan Stern wrote:
> > Let's bring this to the attention of some more people.
> > 
> > It looks like the bug that was supposed to be fixed by commit
> > d74ffae8b8dd ("usb-storage: Add a limitation for
> > blk_queue_max_hw_sectors()"), which is part of 5.2.5, but apparently
> > the bug still occurs.
> 
> Piergiorgio,
> 
> can you dump the content of max_hw_sectors_kb file for your USB storage
> device and send that to this thread?

Hi all,

for both kernels, 5.1.20 (working) and 5.2.8 (not working),
the content of /sys/dev/x:y/queue/max_hw_sectors_kb is 512
for USB storage devices (2.0 and 3.0).

This is for the PC showing the issue.

In an other PC, which does not show the issus at the moment,
the values are 120, for USB2.0, and 256, for USB3.0.

bye,

-- 

piergiorgio


reeze while write on external usb 3.0 hard disk [Bug 204095]

2019-08-17 Thread Piergiorgio Sartor
Hi all,

bug 204095 from bugzilla.kernel.org was closed,
but apparently something is left unfixed.

Ref.: https://bugzilla.kernel.org/show_bug.cgi?id=204095

Below the two last entries I did about the topic:

--- --- ---

I've Fedora 30 and there was an update, from kernel 5.1.20
to 5.2.5, and now I've the logs full of:

ehci-pci :00:13.2: swiotlb buffer is full (sz: 327680 bytes), total 32768 
(slots), used 97 (slots)

Or similarly:

ehci-pci :00:13.2: swiotlb buffer is full (sz: 315392 bytes), total 32768 
(slots), used 103 (slots)

This happens whenever I access an external USB3.0 drive
connected to a USB2.0 port, I'm not sure if this makes
any difference.

It is enough something like "hdparm -t /dev/sdX" to trigger
the above (and cause a lock-up of the command and 100% CPU
load for "usb-storage").
Nevertheless, it seems that the first access, likely few bytes
read, is successful since the device is recognize by the file
manager.
Anything else, (large) file copy causes the issue.

Could it be the same bug? That is the fix for 5.2.5 does not
fix it completely?

--- --- ---

Some more comments.

I tested a different PC, this does not show any problem, neither
with the "old" 5.1.20 nor with the new 5.2.5 kernel.

Second, the original PC still have problems, here below the log
output at the moment of the "crash" due to "hdparm -t /dev/sdX":

[   47.212609] xhci_hcd :02:00.0: swiotlb buffer is full (sz: 421888 
bytes), total 32768 (slots), used 1 (slots)
[   47.212620] xhci_hcd :02:00.0: overflow 0x000383a19000+421888 of DMA 
mask  bus mask 0
[   47.212646] WARNING: CPU: 0 PID: 454 at kernel/dma/direct.c:43 
report_addr+0x33/0x60
[   47.212649] Modules linked in: fuse cfg80211 rfkill nf_log_ipv4 
nf_log_common ipt_REJECT nf_reject_ipv4 xt_state xt_conntrack xt_owner xt_LOG 
iptable_filter nf_conntrack_ftp xt_CT nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 iptable_raw hwmon_vid sunrpc dm_crypt dm_cache_smq 
nvidia_drm(POE) dm_cache nvidia_modeset(POE) dm_persistent_data dm_bio_prison 
libcrc32c nvidia_uvm(OE) amd64_edac_mod edac_mce_amd kvm_amd ccp kvm 
snd_hda_codec_hdmi irqbypass joydev crct10dif_pclmul nvidia(POE) crc32_pclmul 
ghash_clmulni_intel snd_hda_codec_via snd_hda_codec_generic ledtrig_audio 
snd_hda_intel drm_kms_helper wmi_bmof snd_hda_codec fam15h_power k10temp 
snd_hda_core sp5100_tco snd_hwdep i2c_piix4 snd_seq snd_seq_device drm snd_pcm 
snd_timer ipmi_devintf snd ipmi_msghandler asus_atk0110 soundcore pcc_cpufreq 
acpi_cpufreq vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) binfmt_misc 
ip_tables firewire_sbp2 raid1 raid10 crc32c_intel firewire_ohci firewire_core 
serio_raw crc_itu_t r8169 wmi uas
[   47.212719]  usb_storage hid_logitech ff_memless ecryptfs
[   47.212734] CPU: 0 PID: 454 Comm: usb-storage Tainted: P   OE 
5.2.6-200.fc30.x86_64 #1
[   47.212738] Hardware name: System manufacturer System Product 
Name/M5A78L-M/USB3, BIOS 210112/02/2014
[   47.212744] RIP: 0010:report_addr+0x33/0x60
[   47.212750] Code: 48 8b 87 28 02 00 00 48 89 34 24 48 85 c0 74 2d 4c 8b 00 
b8 fe ff ff ff 49 39 c0 76 14 80 3d a3 ae 41 01 00 0f 84 a9 06 00 00 <0f> 0b 48 
83 c4 08 c3 48 83 bf 38 02 00 00 00 74 ef eb e0 80 3d 84
[   47.212754] RSP: 0018:a85941bffbc8 EFLAGS: 00010246
[   47.212758] RAX:  RBX: 0002 RCX: 
[   47.212762] RDX: 0007 RSI: 0092 RDI: 9a45af617900
[   47.212765] RBP: 9a45accc00b0 R08: a85941bff92d R09: 03a3
[   47.212769] R10: b6bece08 R11: a85941bff92d R12: 00067000
[   47.212772] R13: 0002 R14:  R15: 9a45a40502b0
[   47.212777] FS:  () GS:9a45af60() 
knlGS:
[   47.212780] CS:  0010 DS:  ES:  CR0: 80050033
[   47.212784] CR2: 55b215f34c50 CR3: 0003d622e000 CR4: 000406f0
[   47.212787] Call Trace:
[   47.212797]  dma_direct_map_page+0xdf/0xf0
[   47.212803]  dma_direct_map_sg+0x67/0xb0
[   47.212811]  usb_hcd_map_urb_for_dma+0x343/0x530
[   47.212817]  usb_hcd_submit_urb+0x9a/0xbb0
[   47.212824]  ? schedule_timeout+0x209/0x300
[   47.212829]  ? usb_hcd_submit_urb+0xbf/0xbb0
[   47.212835]  ? __switch_to_asm+0x40/0x70
[   47.212840]  ? __switch_to_asm+0x34/0x70
[   47.212845]  ? _cond_resched+0x15/0x30
[   47.212852]  ? __kmalloc+0x16c/0x210
[   47.212857]  ? _cond_resched+0x15/0x30
[   47.212863]  ? usb_alloc_urb+0x35/0x60
[   47.212867]  usb_sg_wait+0x7e/0x150
[   47.212879]  usb_stor_bulk_transfer_sglist.part.0+0x64/0xb0 [usb_storage]
[   47.212886]  usb_stor_bulk_srb+0x49/0x80 [usb_storage]
[   47.212893]  usb_stor_Bulk_transport+0x163/0x3e0 [usb_storage]
[   47.212898]  ? schedule+0x33/0x90
[   47.212905]  usb_stor_invoke_transport+0x3a/0x500 [usb_storage]
[   47.212910]  ? wait_for_completion_interruptible+0x156/0x1a0
[   47.212915]  ? wake_up_q+0x60/0x60
[   47.212922]  usb_stor_control_thread+0x183/0x280 [u

Re: USB issue with kernel 3.6

2013-01-26 Thread Piergiorgio Sartor
On Sat, Jan 26, 2013 at 10:00:28AM -0500, Alan Stern wrote:
[...]
> What's the kernel bugzilla entry number?

Hi Alan,

sorry, forgot that, here it is the link:

https://bugzilla.kernel.org/show_bug.cgi?id=50381

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2013-01-26 Thread Piergiorgio Sartor
On Fri, Jan 25, 2013 at 03:43:38PM -0500, Alan Stern wrote:
[...]
> Okay, good.  I'll submit this for inclusion in the stable kernels.

Hi Alan,

one more item.

How we do proceed from the administrative point of view?

There is a bug open in Fedora (Red Hat) Bugzilla and one
in the Linux kernel Bugzilla.

I guess these will have to be closed, sooner or later.

For the Fedora one, I can take care to track the situation
and prompt the action.
How about the Linux kernel one?
Will you do? Should I do?

Thanks,

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2013-01-25 Thread Piergiorgio Sartor
On Tue, Jan 22, 2013 at 12:45:32PM -0500, Alan Stern wrote:
[...]
> Yes, remove all the previous patches before applying this one.

Hi Alan,

I tested the kernel with the latest patch you provide.

I was not able to reproduce the problem, so it seems
this change is fine.

The only comment I've is that I'm under the impression
that unlinking multiple queue element together was
giving a slightly better perfomance, under certain
trafic conditions (when working).

I guess, if you're confident the code does not have
other different issues, you can push it upstream.

Thanks again for the support,

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2013-01-22 Thread Piergiorgio Sartor
Hi Alan,

On Tue, Jan 22, 2013 at 11:09:43AM -0500, Alan Stern wrote:
> On Fri, 18 Jan 2013, Alan Stern wrote:
> 
> > Well, I guess there's no choice but to go back to the one-at-a-time 
> > approach.  I'll write a proper patch for that.
> 
> And here it is at last.  This completely replaces the earlier patch.

thanks for the patch, this will probably be week-end task.

Usual question, should I remove the previous two patches
or does this apply on top of them?
You wrote "replaces", I assume the second option is correct,
but I would prefere to confirm...

Thanks,

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2013-01-18 Thread Piergiorgio Sartor
Hi Alan,

On Wed, Jan 16, 2013 at 10:55:50AM -0500, Alan Stern wrote:
[...]
> I changed my mind -- it turns out that adding the fix is somewhat
> easier than unlinking one QH at a time.
> 
> So now we're ready for some serious testing.  The patch below is based 
> on the 3.7 kernel, and it doesn't include any of the debugging stuff 
> you have been using.  Remove all the old patches and apply this one 
> instead.  It has two changes: the increase in the schedule polling time 
> and the fix for multiple unlinks.
> 
> If this causes the problem to go away then I will submit it for 
> inclusion in the stable kernel series.

I tried the patch below with kernel 3.7, I updated
my local git copy to that version.

I'm sorry to say it does not seem to fix the issue.

Some explanation, I made three trials, with reboot
inbetween, of course.
The first one got the error -110 almost immediately,
the second did not show any problem, even with some
load of the CPU, the third did show the error, but
after some time, specifically after I loaded the
I/O sub system with a kernel recompilation.
Note that, during the compilation (make -j5) the transfer
rate of the USB HDDs went up and down, with about a
decrease of 10% from the usual one (from 4100 KB/sec to
3700 KB/sec) as minimum. This I did not notice before,
but maybe I just did not notice...

I also tried to run twice the USB HDDs reading, in
this case, while the problem did not show up, I could
see that the transfer rate did not change, maybe the
multiple queue operation did have some positive effects.

Nevertheless, bottom line is that I got the problem two
out of three times, so it seems to me the fix was not
completely effective.

I've some log, but there's a lot of garbage in it, I
esitate to post here or on the bugzilla page, if you
want I can dig something or I can retry in order to
get a cleaner one.

Hope this helps,

bye,

pg

> Alan Stern
> 
> 
> 
> Index: 3.7/drivers/usb/host/ehci-timer.c
> ===
> --- 3.7.orig/drivers/usb/host/ehci-timer.c
> +++ 3.7/drivers/usb/host/ehci-timer.c
> @@ -113,21 +113,22 @@ static void ehci_poll_ASS(struct ehci_hc
>  
>   if (want != actual) {
>  
> - /* Poll again later, but give up after about 20 ms */
> - if (ehci->ASS_poll_count++ < 20) {
> - ehci_enable_event(ehci, EHCI_HRTIMER_POLL_ASS, true);
> - return;
> - }
> - ehci_dbg(ehci, "Waited too long for the async schedule status 
> (%x/%x), giving up\n",
> - want, actual);
> + /* Poll again later */
> + ehci_enable_event(ehci, EHCI_HRTIMER_POLL_ASS, true);
> + ++ehci->ASS_poll_count;
> + return;
>   }
> +
> + if (ehci->ASS_poll_count > 20)
> + ehci_dbg(ehci, "ASS poll count reached %d\n",
> + ehci->ASS_poll_count);
>   ehci->ASS_poll_count = 0;
>  
>   /* The status is up-to-date; restart or stop the schedule as needed */
>   if (want == 0) {/* Stopped */
> - if (ehci->async_count > 0)
> + if (ehci->async_count > 0) {
>   ehci_set_command_bit(ehci, CMD_ASE);
> -
> + }
>   } else {/* Running */
>   if (ehci->async_count == 0) {
>  
> @@ -159,14 +160,14 @@ static void ehci_poll_PSS(struct ehci_hc
>  
>   if (want != actual) {
>  
> - /* Poll again later, but give up after about 20 ms */
> - if (ehci->PSS_poll_count++ < 20) {
> - ehci_enable_event(ehci, EHCI_HRTIMER_POLL_PSS, true);
> - return;
> - }
> - ehci_dbg(ehci, "Waited too long for the periodic schedule 
> status (%x/%x), giving up\n",
> - want, actual);
> + /* Poll again later */
> + ehci_enable_event(ehci, EHCI_HRTIMER_POLL_PSS, true);
> + return;
>   }
> +
> + if (ehci->PSS_poll_count > 20)
> + ehci_dbg(ehci, "PSS poll count reached %d\n",
> + ehci->PSS_poll_count);
>   ehci->PSS_poll_count = 0;
>  
>   /* The status is up-to-date; restart or stop the schedule as needed */
> Index: 3.7/drivers/usb/host/ehci-q.c
> ===
> --- 3.7.orig/drivers/usb/host/ehci-q.c
> +++ 3.7/drivers/usb/host/ehci-q.c
> @@ -1174,6 +1174,18 @@ submit_async (
>  static void single_unlink_async(struct ehci_hcd *ehci, struct ehci_qh *qh)
>  {
>   struct ehci_qh  *prev;
> + __hc32  dma = QH_NEXT(ehci, qh->qh_dma);
> + __hc32  dma_next = qh->hw->hw_next;
> +
> + /* No QH on the unlink lists should point to qh */
> + for (prev = ehci->async_unlink; prev; prev = prev->unlink_next) {
> + if (prev->hw->hw_next == dma)
> +   

Re: USB issue with kernel 3.6

2013-01-14 Thread Piergiorgio Sartor
Hi Alan,

On Mon, Jan 14, 2013 at 10:17:52AM -0500, Alan Stern wrote:
> On Sun, 13 Jan 2013, Piergiorgio Sartor wrote:
> 
> > Hi Alan,
> > 
> > Happy New Year!
> 
> I hope you had a good holiday break.

thanks, holiday was fine, I trust you had some rest too.

[...]
> > Next step?
> 
> Okay, I figured out what the problem is.  It really is a bug in 
> ehci-hcd; the driver doesn't do what the spec requires when unlinking 
> more than one QH at a time.

Well, I'm happy to read this. Good that you find it!

It puzzles me I'm the only one, it seems, affected by this!

> Although the driver could be fixed, it will be easier to stick with the
> old approach and never unlink more than one QH.  Doing two or more at
> once is pretty rare, so it doesn't much matter if they take a little
> longer.  I'll send a patch that does this properly in the near future,
> for you to test.

Cool! I'm looking forward to it.

Thanks for the support,

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2013-01-13 Thread Piergiorgio Sartor
Hi Alan,

Happy New Year!

I removed the previous short patch to ehci-q.c and applied
this second one.

After some extensive testing, I could not reproduce the
issue and dmesg showed the ususl "async off" and "async on"
messages, the second as soon as I started to read the HDDs.

Performance wise, it seems to me better than ever, I mean
I saw regular transfer of above 4000 KB/sec per disk.

Even with 2 processes reading the 10 HDDs, the data rate
was about 3800 KB/sec.

Next step?

Thanks,

bye,

pg

On Mon, Dec 17, 2012 at 02:59:21PM -0500, Alan Stern wrote:
> On Mon, 17 Dec 2012, Piergiorgio Sartor wrote:
> 
> > Hi Alan,
> > 
> > I applied the latest patch, changing the unlinking
> > behavior of EHCI.
> > 
> > I tested quite extensively and I could not see the
> > issue (as expected). At a certain point I lost 3 HDDs,
> > but this is something I saw before (it happens very
> > seldom, maybe future debugging) and it was not really
> > locking the EHCI driver.
> > 
> > Another thing is that the read performance was around
> > about 10% slower that the usual one.
> > I saw 3700~3800 KB/sec per HDD, while, under normal
> > conditions, I saw 4100~4200 KB/sec per HDD.
> 
> The slower performance might be caused by all the idle QHs remaining on 
> the async list.  The controller has to look at them because they are on 
> the list, even though they aren't being used, and that could slow it 
> down.
> 
> > I think that's all, how do you want to proceed now?
> 
> Below is another patch, which is meant to replace the previous one.  
> This changes the behavior to be like the driver was before the
> troublesome commit: Only one idle QH will be unlinked from the async
> schedule at a time.
> 
> > Please note that shortly I'll not be anymore with
> > my PC, until January, so we can take it easy.
> > 
> > In this respect, I wish you Merry Christmas and
> > Happy New Year.
> 
> Enjoy your holidays too.
> 
> Alan Stern
> 
> 
> 
> Index: usb-3.7/drivers/usb/host/ehci-q.c
> ===
> --- usb-3.7.orig/drivers/usb/host/ehci-q.c
> +++ usb-3.7/drivers/usb/host/ehci-q.c
> @@ -1279,8 +1279,9 @@ static void unlink_empty_async(struct eh
>  
>   if (list_empty(&qh->qtd_list) &&
>   qh->qh_state == QH_STATE_LINKED) {
> - if (!stopped && qh->unlink_cycle ==
> - ehci->async_unlink_cycle)
> + if (!stopped && (qh->unlink_cycle ==
> + ehci->async_unlink_cycle ||
> + ehci->async_unlink))
>   check_unlinks_later = true;
>   else
>   single_unlink_async(ehci, qh);

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-12-18 Thread Piergiorgio Sartor
Hi Alan,

just one short question before holidays (and forgetting it...).

Is this latest patch to be applied on top of all the others, or
should I remove the previous one? Was same file, I think.

Thanks,

bye,

pg

On Mon, Dec 17, 2012 at 02:59:21PM -0500, Alan Stern wrote:
> On Mon, 17 Dec 2012, Piergiorgio Sartor wrote:
> 
> > Hi Alan,
> > 
> > I applied the latest patch, changing the unlinking
> > behavior of EHCI.
> > 
> > I tested quite extensively and I could not see the
> > issue (as expected). At a certain point I lost 3 HDDs,
> > but this is something I saw before (it happens very
> > seldom, maybe future debugging) and it was not really
> > locking the EHCI driver.
> > 
> > Another thing is that the read performance was around
> > about 10% slower that the usual one.
> > I saw 3700~3800 KB/sec per HDD, while, under normal
> > conditions, I saw 4100~4200 KB/sec per HDD.
> 
> The slower performance might be caused by all the idle QHs remaining on 
> the async list.  The controller has to look at them because they are on 
> the list, even though they aren't being used, and that could slow it 
> down.
> 
> > I think that's all, how do you want to proceed now?
> 
> Below is another patch, which is meant to replace the previous one.  
> This changes the behavior to be like the driver was before the
> troublesome commit: Only one idle QH will be unlinked from the async
> schedule at a time.
> 
> > Please note that shortly I'll not be anymore with
> > my PC, until January, so we can take it easy.
> > 
> > In this respect, I wish you Merry Christmas and
> > Happy New Year.
> 
> Enjoy your holidays too.
> 
> Alan Stern
> 
> 
> 
> Index: usb-3.7/drivers/usb/host/ehci-q.c
> ===
> --- usb-3.7.orig/drivers/usb/host/ehci-q.c
> +++ usb-3.7/drivers/usb/host/ehci-q.c
> @@ -1279,8 +1279,9 @@ static void unlink_empty_async(struct eh
>  
>   if (list_empty(&qh->qtd_list) &&
>   qh->qh_state == QH_STATE_LINKED) {
> - if (!stopped && qh->unlink_cycle ==
> - ehci->async_unlink_cycle)
> + if (!stopped && (qh->unlink_cycle ==
> + ehci->async_unlink_cycle ||
> + ehci->async_unlink))
>   check_unlinks_later = true;
>   else
>   single_unlink_async(ehci, qh);

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-12-17 Thread Piergiorgio Sartor
Hi Alan,

I applied the latest patch, changing the unlinking
behavior of EHCI.

I tested quite extensively and I could not see the
issue (as expected). At a certain point I lost 3 HDDs,
but this is something I saw before (it happens very
seldom, maybe future debugging) and it was not really
locking the EHCI driver.

Another thing is that the read performance was around
about 10% slower that the usual one.
I saw 3700~3800 KB/sec per HDD, while, under normal
conditions, I saw 4100~4200 KB/sec per HDD.

I think that's all, how do you want to proceed now?

Please note that shortly I'll not be anymore with
my PC, until January, so we can take it easy.

In this respect, I wish you Merry Christmas and
Happy New Year.

bye,

pg

On Mon, Dec 17, 2012 at 10:57:48AM -0500, Alan Stern wrote:
> On Sun, 16 Dec 2012, Piergiorgio Sartor wrote:
> 
> > Hi Alan,
> > 
> > I applied the modifications you suggested, this is the output:
> > 
> > [  110.922009] ehci_hcd :00:0b.1: async off
> > [  209.362134] ehci_hcd :00:0b.1: async on
> > [  241.794760] ehci_hcd :00:0b.1: alan start  cur time 4294908992 last 
> > scan 4294878606
> > [  241.794774] ehci_hcd :00:0b.1: command 10035 cmd reg 10035 status 
> > reg e008
> > [  241.794779] ehci_hcd :00:0b.1: async list:
> > [  241.794787] ehci_hcd :00:0b.1:  qh 880149fe7800 hw 
> > 8800af80 dma af80 next 880146ce3f80 hw_next af800c62
> > [  241.794795] ehci_hcd :00:0b.1:  qh 880146ce3f80 hw 
> > 8800af800c60 dma af800c60 next 880146ce3780 hw_next af800d22
> > [  241.794801] ehci_hcd :00:0b.1:  qh 880146ce3780 hw 
> > 8800af800d20 dma af800d20 next 880147776680 hw_next af800ae2
> > [  241.794808] ehci_hcd :00:0b.1:  qh 880147776680 hw 
> > 8800af800ae0 dma af800ae0 next 8801478bf380 hw_next af800a22
> > [  241.794815] ehci_hcd :00:0b.1:  qh 8801478bf380 hw 
> > 8800af800a20 dma af800a20 next 88014070da00 hw_next af800f02
> > [  241.794821] ehci_hcd :00:0b.1:  qh 88014070da00 hw 
> > 8800af800f00 dma af800f00 next 8801407d3800 hw_next af800de2
> > [  241.794828] ehci_hcd :00:0b.1:  qh 8801407d3800 hw 
> > 8800af800de0 dma af800de0 next 8801478bfc80 hw_next af800782
> > [  241.794834] ehci_hcd :00:0b.1:  qh 8801478bfc80 hw 
> > 8800af800780 dma af800780 next 88014079e480 hw_next af800ba2
> > [  241.794841] ehci_hcd :00:0b.1:  qh 88014079e480 hw 
> > 8800af800ba0 dma af800ba0 next 8801478bf780 hw_next af8008a2
> > [  241.794847] ehci_hcd :00:0b.1:  qh 8801478bf780 hw 
> > 8800af8008a0 dma af8008a0 next 8801478bfd00 hw_next af800962
> > [  241.794854] ehci_hcd :00:0b.1:  qh 8801478bfd00 hw 
> > 8800af800960 dma af800960 next   (null) hw_next af82
> > [  241.794861] ehci_hcd :00:0b.1: This qh link time 4294878614 enqueue 
> > time 4294878614 td token 1f8c80 ov token 0
> > [  241.794869] ehci_hcd :00:0b.1:   qh 8801478bfd00 naf82 info 
> > 4200220e 4000 qtd af820540
> > [  241.794876] ehci_hcd :00:0b.1: overlay td 8800af800970 naf820d80 
> > 0001 t p0=af883a9f
> > [  241.794884] ehci_hcd :00:0b.1:   dma af820d80 td 8800af820d80 
> > naf820960 0001 t001f8c80 p0=af883a80
> > [  241.794892] ehci_hcd :00:0b.1:   dummy af820960 td 8800af820960 
> > n0001 0001 t0040 p0=
> > [  241.805935] ehci_hcd :00:0b.1: giveback urb 8801455fa300 actual 0
> > [  241.805943] ehci_hcd :00:0b.1: alan end
> 
> This confirms it.  The async schedule isn't running, even though it is 
> turned on and the status register says that it is running.
> 
> > The first "async off" happened, I guess correctly,
> > when the hot-plug settled and all the HDDs where
> > in place as expected.
> > The "async on" happened as soon as I started to
> > read 10 HDDs in parallel, I think correctly too.
> > 
> > After few seconds, this time without any "help",
> > the problem appeared with the log as reported.
> > 
> > I noticed several "async on/off" in dmesg, it
> > seems to me all were consistent.
> 
> This is all what one would expect in normal operation.
> 
> It seems clear that we aren't going to be able to find the cause of 
> the bug.  So let's try another approach: We'll change the commit you 
> identified as the one where the problems began.
> 
> The patch below will prevent the driver from unlinking any idle QHs 
> from the async schedule.  This is a little overboard, because even 
&g

Re: USB issue with kernel 3.6

2012-12-16 Thread Piergiorgio Sartor
Hi Alan,

I applied the modifications you suggested, this is the output:

[  110.922009] ehci_hcd :00:0b.1: async off
[  209.362134] ehci_hcd :00:0b.1: async on
[  241.794760] ehci_hcd :00:0b.1: alan start  cur time 4294908992 last scan 
4294878606
[  241.794774] ehci_hcd :00:0b.1: command 10035 cmd reg 10035 status reg 
e008
[  241.794779] ehci_hcd :00:0b.1: async list:
[  241.794787] ehci_hcd :00:0b.1:  qh 880149fe7800 hw 8800af80 
dma af80 next 880146ce3f80 hw_next af800c62
[  241.794795] ehci_hcd :00:0b.1:  qh 880146ce3f80 hw 8800af800c60 
dma af800c60 next 880146ce3780 hw_next af800d22
[  241.794801] ehci_hcd :00:0b.1:  qh 880146ce3780 hw 8800af800d20 
dma af800d20 next 880147776680 hw_next af800ae2
[  241.794808] ehci_hcd :00:0b.1:  qh 880147776680 hw 8800af800ae0 
dma af800ae0 next 8801478bf380 hw_next af800a22
[  241.794815] ehci_hcd :00:0b.1:  qh 8801478bf380 hw 8800af800a20 
dma af800a20 next 88014070da00 hw_next af800f02
[  241.794821] ehci_hcd :00:0b.1:  qh 88014070da00 hw 8800af800f00 
dma af800f00 next 8801407d3800 hw_next af800de2
[  241.794828] ehci_hcd :00:0b.1:  qh 8801407d3800 hw 8800af800de0 
dma af800de0 next 8801478bfc80 hw_next af800782
[  241.794834] ehci_hcd :00:0b.1:  qh 8801478bfc80 hw 8800af800780 
dma af800780 next 88014079e480 hw_next af800ba2
[  241.794841] ehci_hcd :00:0b.1:  qh 88014079e480 hw 8800af800ba0 
dma af800ba0 next 8801478bf780 hw_next af8008a2
[  241.794847] ehci_hcd :00:0b.1:  qh 8801478bf780 hw 8800af8008a0 
dma af8008a0 next 8801478bfd00 hw_next af800962
[  241.794854] ehci_hcd :00:0b.1:  qh 8801478bfd00 hw 8800af800960 
dma af800960 next   (null) hw_next af82
[  241.794861] ehci_hcd :00:0b.1: This qh link time 4294878614 enqueue time 
4294878614 td token 1f8c80 ov token 0
[  241.794869] ehci_hcd :00:0b.1:   qh 8801478bfd00 naf82 info 
4200220e 4000 qtd af820540
[  241.794876] ehci_hcd :00:0b.1: overlay td 8800af800970 naf820d80 
0001 t p0=af883a9f
[  241.794884] ehci_hcd :00:0b.1:   dma af820d80 td 8800af820d80 
naf820960 0001 t001f8c80 p0=af883a80
[  241.794892] ehci_hcd :00:0b.1:   dummy af820960 td 8800af820960 
n0001 0001 t0040 p0=
[  241.805935] ehci_hcd :00:0b.1: giveback urb 8801455fa300 actual 0
[  241.805943] ehci_hcd :00:0b.1: alan end

The first "async off" happened, I guess correctly,
when the hot-plug settled and all the HDDs where
in place as expected.
The "async on" happened as soon as I started to
read 10 HDDs in parallel, I think correctly too.

After few seconds, this time without any "help",
the problem appeared with the log as reported.

I noticed several "async on/off" in dmesg, it
seems to me all were consistent.

bye,

pg

On Sat, Dec 15, 2012 at 03:53:09PM -0500, Alan Stern wrote:
> On Sat, 15 Dec 2012, Piergiorgio Sartor wrote:
> 
> > Hi Alan,
> > 
> > thanks for the patch, I applied it and tested.
> > 
> > I've bad news...
> > Unless I made some mistake (which can always be), I did not
> > see any of the logs of this patch in the output, despite
> > having seen the problem occuring (this time I used "cpuburn"
> > to overload the system).
> 
> That's disappointing.
> 
> > The only reasonable log was like the previous one:
> > 
> > [  248.833978] ehci_hcd :00:0b.1: alan start  cur time 4294916032 last 
> > scan 4294885464
> > [  248.833991] ehci_hcd :00:0b.1: command 10035 cmd reg 10075 io 
> > watchdog 1 async count 10
> > [  248.833998] ehci_hcd :00:0b.1: async list:
> > [  248.834006] ehci_hcd :00:0b.1:  qh 880149fcfe00 hw 
> > 8800af80 dma af80 next 880146d70380 hw_next af8009c2
> > [  248.834013] ehci_hcd :00:0b.1:  qh 880146d70380 hw 
> > 8800af8009c0 dma af8009c0 next 880146c95b00 hw_next af800a82
> > [  248.834020] ehci_hcd :00:0b.1:  qh 880146c95b00 hw 
> > 8800af800a80 dma af800a80 next 880149b51f00 hw_next af800cc2
> > [  248.834027] ehci_hcd :00:0b.1:  qh 880149b51f00 hw 
> > 8800af800cc0 dma af800cc0 next 880145d17e80 hw_next af800f02
> > [  248.834033] ehci_hcd :00:0b.1:  qh 880145d17e80 hw 
> > 8800af800f00 dma af800f00 next 88013d137180 hw_next af800c02
> > [  248.834040] ehci_hcd :00:0b.1:  qh 88013d137180 hw 
> > 8800af800c00 dma af800c00 next 880148bc hw_next af800d82
> > [  248.834046] ehci_hcd :00:0b.1:  qh 880148bc hw 
> > 8800af800d80 dma af800d80 next 8801410f9f00 hw_next af800b42
> > [  248.834053] ehci_hcd :00:

Re: USB issue with kernel 3.6

2012-12-15 Thread Piergiorgio Sartor
On Fri, Dec 14, 2012 at 11:02:07AM -0500, Alan Stern wrote:
> On Mon, 10 Dec 2012, Piergiorgio Sartor wrote:
> 
> > > Here's an idea.  This just occurred to me.  Maybe when the driver is
> > > waiting for the async schedule to turn off, new QH's should not be
> > > added to the schedule.  The driver could wait and add them after the
> > > schedule was off.  I didn't do it that way because it would slow things
> > > down and add complexity, but maybe that's what the nVidia hardware
> > > needs.
> > 
> > How difficult is it?
> > Would it be possible to have as a patch I could try?
> 
> Instead of doing this, I wrote a simpler patch that removes the limit
> on the number of polls.  It also writes a debugging message whenever
> more than 20 ms of polling is needed, so we can see what your 
> controller is actually doing.

Hi Alan,

thanks for the patch, I applied it and tested.

I've bad news...
Unless I made some mistake (which can always be), I did not
see any of the logs of this patch in the output, despite
having seen the problem occuring (this time I used "cpuburn"
to overload the system).

The only reasonable log was like the previous one:

[  248.833978] ehci_hcd :00:0b.1: alan start  cur time 4294916032 last scan 
4294885464
[  248.833991] ehci_hcd :00:0b.1: command 10035 cmd reg 10075 io watchdog 1 
async count 10
[  248.833998] ehci_hcd :00:0b.1: async list:
[  248.834006] ehci_hcd :00:0b.1:  qh 880149fcfe00 hw 8800af80 
dma af80 next 880146d70380 hw_next af8009c2
[  248.834013] ehci_hcd :00:0b.1:  qh 880146d70380 hw 8800af8009c0 
dma af8009c0 next 880146c95b00 hw_next af800a82
[  248.834020] ehci_hcd :00:0b.1:  qh 880146c95b00 hw 8800af800a80 
dma af800a80 next 880149b51f00 hw_next af800cc2
[  248.834027] ehci_hcd :00:0b.1:  qh 880149b51f00 hw 8800af800cc0 
dma af800cc0 next 880145d17e80 hw_next af800f02
[  248.834033] ehci_hcd :00:0b.1:  qh 880145d17e80 hw 8800af800f00 
dma af800f00 next 88013d137180 hw_next af800c02
[  248.834040] ehci_hcd :00:0b.1:  qh 88013d137180 hw 8800af800c00 
dma af800c00 next 880148bc hw_next af800d82
[  248.834046] ehci_hcd :00:0b.1:  qh 880148bc hw 8800af800d80 
dma af800d80 next 8801410f9f00 hw_next af800b42
[  248.834053] ehci_hcd :00:0b.1:  qh 8801410f9f00 hw 8800af800b40 
dma af800b40 next 880147b8bc80 hw_next af8007e2
[  248.834060] ehci_hcd :00:0b.1:  qh 880147b8bc80 hw 8800af8007e0 
dma af8007e0 next 880140b49180 hw_next af800e42
[  248.834066] ehci_hcd :00:0b.1:  qh 880140b49180 hw 8800af800e40 
dma af800e40 next 880141f8ac00 hw_next af8008a2
[  248.834073] ehci_hcd :00:0b.1:  qh 880141f8ac00 hw 8800af8008a0 
dma af8008a0 next   (null) hw_next af82
[  248.834080] ehci_hcd :00:0b.1: This qh link time 4294885466 enqueue time 
4294885466 td token 1f8c80 ov token 0
[  248.834087] ehci_hcd :00:0b.1:   qh 880146d70380 naf800a82 info 
42002210 4000 qtd af81ccc0
[  248.834095] ehci_hcd :00:0b.1: overlay td 8800af8009d0 nae2531e0 
0001 t p0=af87fa9f
[  248.834103] ehci_hcd :00:0b.1:   dma ae2531e0 td 8800ae2531e0 
naf81ef60 0001 t001f8c80 p0=af87fa80
[  248.834111] ehci_hcd :00:0b.1:   dummy af81ef60 td 8800af81ef60 
n0001 0001 t0040 p0=
[  248.845154] ehci_hcd :00:0b.1: giveback urb 88014980dcc0 actual 0
[  248.845162] ehci_hcd :00:0b.1: alan end

Note that there was not anymore the log line before "alan start".

I suspect the previous error was something else, maybe something
it happens seldom, since sometimes I saw some timeout errors, but
they were almost always not a problem.

Any ideas?

Thanks again,

bye,

pg

> 
> Alan Stern
> 
> 
> 
> 
> Index: usb-3.7/drivers/usb/host/ehci-timer.c
> ===
> --- usb-3.7.orig/drivers/usb/host/ehci-timer.c
> +++ usb-3.7/drivers/usb/host/ehci-timer.c
> @@ -113,14 +113,15 @@ static void ehci_poll_ASS(struct ehci_hc
>  
>   if (want != actual) {
>  
> - /* Poll again later, but give up after about 20 ms */
> - if (ehci->ASS_poll_count++ < 20) {
> - ehci_enable_event(ehci, EHCI_HRTIMER_POLL_ASS, true);
> - return;
> - }
> - ehci_dbg(ehci, "Waited too long for the async schedule status 
> (%x/%x), giving up\n",
> - want, actual);
> + /* Poll again later */
> + ehci_enable_event(ehci, EHCI_HRTIMER_POLL_ASS, true);
> + ++ehci->ASS_poll_count;
> + return;
>   }
> +
> +   

Re: USB issue with kernel 3.6

2012-12-10 Thread Piergiorgio Sartor
Hi Alan,

On Sat, Dec 08, 2012 at 10:54:00PM -0500, Alan Stern wrote:
[...]
> > Is this the reason why the "ehci", after the problem
> > is triggered, it does not work anymore?
> 
> Yes.  Actually, it might have started working again if you unplugged 
> all your high-speed devices and then plugged them back in.  Maybe.

last time I tried, it did not work.
The ehci was dead and kept staying that way.

> > One more question, why the heavy traffic triggers it?
> > Could it be the controller is too busy and it does not
> > answer (within 20ms) to the request?
> 
> It's hard to say.  At the moment the controller was told to turn off
> the schedule, it was not under heavy load.  This is because the
> schedule gets turned off when there has been no traffic at all (no QH's
> in the schedule) for at least 15 ms.
> 
> On the other hand, we know that new traffic did get started before the
> schedule actually turned off.  So maybe the new load caused the
> controller to be too busy -- we don't know how much time passed between
> the "turn off" command and the start of the new traffic.  I could write
> a patch to find out...

It might be useful.
I tried quickly to put 5 instead of 20, but still
the issue was difficult to reproduce (3.7.0-rc4+).
It happened (again) only after I increased the CPU
(or I/O) load by "make -j5" in "/usr/src/linux".
And even not immediately.

Of course, it was just a quick test, it could have
been I made some mistakes. Or not.
 
> Here's an idea.  This just occurred to me.  Maybe when the driver is
> waiting for the async schedule to turn off, new QH's should not be
> added to the schedule.  The driver could wait and add them after the
> schedule was off.  I didn't do it that way because it would slow things
> down and add complexity, but maybe that's what the nVidia hardware
> needs.

How difficult is it?
Would it be possible to have as a patch I could try?

> > More of that, is it "sane" to just increase a timeout
> > in order to workaround the issue?
> 
> I don't know.  If a small increase in the timeout fixes the problem
> then maybe it is.  The problem is that I don't understand exactly what
> causes the bug, so I can't tell the right way to work around it.

Maybe more verbosity is needed somewhere?
 
> > Would it be better, after the timeout, to re-try to turn
> > on the async schedule for a couple of times? With some
> > wait inbetween, of course.
> 
> I doubt that would work.  It would be better to make the timeout 
> longer.

What I mean is that instead of waiting for 500ms
at once, it could be better to wait 50 times 10ms.
As soon as the switch occurs, the system can just
continue. In other words to have a flexible timeout,
instead of a fixed one.
Assuming it is realiably possible to verify the
status of the EHCI hardware.

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-12-08 Thread Piergiorgio Sartor
Hi alan,

On Sat, Dec 08, 2012 at 05:33:27PM -0500, Alan Stern wrote:
> On Sat, 8 Dec 2012, Piergiorgio Sartor wrote:
> 
> > Hi Alan,
> > 
> > the output of lspci:
> > 
> > 00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3) 
> > (prog-if 20 [EHCI])
> > Subsystem: ASUSTeK Computer Inc. Device 81c0
> > Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 22
> > Memory at fe02e000 (32-bit, non-prefetchable) [size=256]
> > Capabilities: [44] Debug port: BAR=1 offset=0098
> > Capabilities: [80] Power Management version 2
> > Kernel driver in use: ehci_hcd
> 
> nVidia, eh?  Their stuff does have some weird bugs...

I already heard that... :-)

> > The output of "dmesg" with your patch:
> > 
> > [  773.387144] ehci_hcd :00:0b.1: Waited too long for the async 
> > schedule status (0/8000), giving up
> 
> Believe it or not, this one line explains everything -- and the rest of
> the debugging log confirms it.  This didn't show up in your earlier
> kernel log because this message wasn't added until the 3.6 release, and
> your earlier test used 3.5.

I was using exactly the point bisect found as
problematic, I was afraid to change things.
Anyway, good that I kept it in the log.

> Basically, this is a bug in nVidia's EHCI controller hardware.  The
> driver told the controller to turn off its async schedule, and 20 ms
> later the schedule still was running.  Although the EHCI specification
> doesn't put any time limit on how long turning off the schedule may
> take, in practice it shouldn't be any more than a couple of ms (and
> normally much less).

Question is, why it was working before kernel 3.6.0?
Or, at least, why it did not show up.
We know the patch triggering it (it seems), what was
done before instead of that?

> But then the driver needed to turn the schedule back on.  The command
> to do so was ignored by the controller, since it was still trying to
> carry out the earlier command to turn off the schedule.  Eventually --
> no way to know exactly when -- the schedule _did_ turn off.  And then
> it never turned back on!

Is this the reason why the "ehci", after the problem
is triggered, it does not work anymore?

One more question, why the heavy traffic triggers it?
Could it be the controller is too busy and it does not
answer (within 20ms) to the request?
 
> That's why none of the transfers from that point on were successful.
> 
> > [  803.804848] ehci_hcd :00:0b.1: alan start  cur time 4295471040 last 
> > scan 4295440395
> > [  803.804862] ehci_hcd :00:0b.1: command 10035 cmd reg 10035 io 
> > watchdog 1 async count 10
> 
> The "Async Schedule Enable" bit is bit 0x20 in the command register;  
> the debugging output shows that it is turned on.  Nevertheless the 
> async schedule isn't running, as shown by the rest of the debugging 
> output.
> 
> > One comment, I was using, apparently, kernel 3.7.0-rc4+
> 
> It's a good thing you did!
> 
> > and it was quite difficult to reproduce the error.
> > It happened only after I increased the CPU (and I/O) load
> > doing a "make clean" in /usr/src/linux
> 
> 3.7 waits longer for the schedule to turn off before giving up than 3.5
> does; that probably explains why the bug was harder to trigger.
> 
> The question now is what to do about this.  I suppose the waiting time 
> could be increased -- but how much?

More of that, is it "sane" to just increase a timeout
in order to workaround the issue?
Again, I think the change in the code that caused this
to show up must be somehow reconsidered, or not?

Would it be better, after the timeout, to re-try to turn
on the async schedule for a couple of times? With some
wait inbetween, of course.
 
> You can experiment with different values easily enough.  In the source
> file drivers/usb/host/ehci-timer.c, in the ehci_poll_ASS() function
> near line 117, you'll see this:
> 
>   /* Poll again later, but give up after about 20 ms */
>   if (ehci->ASS_poll_count++ < 20) {
>   ehci_enable_event(ehci, EHCI_HRTIMER_POLL_ASS, true);
>   return;
>   }
>   ehci_dbg(ehci, "Waited too long for the async schedule status 
> (%x/%x), giving up\n",
>   want, actual);
> 
> Change the 20's to something bigger, like 50 or 100, and see if that
> makes the problem go away.  (For thoroughness, also try changing them
> to something smaller, like 5, and see if that makes the problem occur
> more readily.)

I'll give it a try.

Thanks for the analysis and the explanation, I hope
to have this fixed soon.

bye,
 
-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-12-08 Thread Piergiorgio Sartor
On Tue, Dec 04, 2012 at 04:55:26PM -0500, Alan Stern wrote:
> On Tue, 4 Dec 2012, Piergiorgio Sartor wrote:
> 
> > Hi Alan,
> > 
> > thanks for the patch.
> > 
> > One question, which kernel version should I use
> > for this patch?
> > 
> > It's the latest from git?
> 
> I think it should work with any fairly recent kernel.  The patch was 
> written against linux-next from about six weeks ago.  I don't think 
> much has changed in this area for a while.
> 
> Alan Stern

Hi Alan,

the output of lspci:

00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3) 
(prog-if 20 [EHCI])
Subsystem: ASUSTeK Computer Inc. Device 81c0
Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 22
Memory at fe02e000 (32-bit, non-prefetchable) [size=256]
Capabilities: [44] Debug port: BAR=1 offset=0098
Capabilities: [80] Power Management version 2
Kernel driver in use: ehci_hcd

--- --- --- ---

The output of "dmesg" with your patch:

[  773.387144] ehci_hcd :00:0b.1: Waited too long for the async schedule 
status (0/8000), giving up
[  803.804848] ehci_hcd :00:0b.1: alan start  cur time 4295471040 last scan 
4295440395
[  803.804862] ehci_hcd :00:0b.1: command 10035 cmd reg 10035 io watchdog 1 
async count 10
[  803.804868] ehci_hcd :00:0b.1: async list:
[  803.804876] ehci_hcd :00:0b.1:  qh 880149fe7800 hw 8800af80 
dma af80 next 880148305600 hw_next af800de2
[  803.804884] ehci_hcd :00:0b.1:  qh 880148305600 hw 8800af800de0 
dma af800de0 next 88013d09b200 hw_next af800d22
[  803.804891] ehci_hcd :00:0b.1:  qh 88013d09b200 hw 8800af800d20 
dma af800d20 next 8801498aca00 hw_next af800ae2
[  803.804897] ehci_hcd :00:0b.1:  qh 8801498aca00 hw 8800af800ae0 
dma af800ae0 next 880147a7cf00 hw_next af800c62
[  803.804904] ehci_hcd :00:0b.1:  qh 880147a7cf00 hw 8800af800c60 
dma af800c60 next 880147b28680 hw_next af800a22
[  803.804911] ehci_hcd :00:0b.1:  qh 880147b28680 hw 8800af800a20 
dma af800a20 next 88013d1f4c00 hw_next af800f02
[  803.804917] ehci_hcd :00:0b.1:  qh 88013d1f4c00 hw 8800af800f00 
dma af800f00 next 880147733100 hw_next af800ba2
[  803.804924] ehci_hcd :00:0b.1:  qh 880147733100 hw 8800af800ba0 
dma af800ba0 next 880147971100 hw_next af800782
[  803.804930] ehci_hcd :00:0b.1:  qh 880147971100 hw 8800af800780 
dma af800780 next 880147971e00 hw_next af8008a2
[  803.804937] ehci_hcd :00:0b.1:  qh 880147971e00 hw 8800af8008a0 
dma af8008a0 next 880147b24b80 hw_next af800962
[  803.804944] ehci_hcd :00:0b.1:  qh 880147b24b80 hw 8800af800960 
dma af800960 next   (null) hw_next af82
[  803.804951] ehci_hcd :00:0b.1: This qh link time 4295440580 enqueue time 
4295440580 td token 1f8c80 ov token 0
[  803.804959] ehci_hcd :00:0b.1:   qh 880148305600 naf800d22 info 
42002215 4000 qtd 96f91ae0
[  803.804967] ehci_hcd :00:0b.1: overlay td 8800af800df0 n96f910c0 
0001 t p0=af824d9f
[  803.804975] ehci_hcd :00:0b.1:   dma 96f910c0 td 880096f910c0 
n96f91de0 0001 t001f8c80 p0=af824d80
[  803.804982] ehci_hcd :00:0b.1:   dummy 96f91de0 td 880096f91de0 
n0001 0001 t0040 p0=
[  803.816026] ehci_hcd :00:0b.1: giveback urb 88014755e6c0 actual 0
[  803.816034] ehci_hcd :00:0b.1: alan end

--- --- --- ---

I updated the bug report too.

One comment, I was using, apparently, kernel 3.7.0-rc4+
and it was quite difficult to reproduce the error.
It happened only after I increased the CPU (and I/O) load
doing a "make clean" in /usr/src/linux

I do not know if this has any meaning, but better not to
forget such details.

Hope this helps,

bye,

--

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-12-04 Thread Piergiorgio Sartor
Hi Alan,

thanks for the patch.

One question, which kernel version should I use
for this patch?

It's the latest from git?

Thanks,

bye,

pg

On Tue, Dec 04, 2012 at 11:45:14AM -0500, Alan Stern wrote:
> On Mon, 3 Dec 2012, Piergiorgio Sartor wrote:
> 
> > Hi Alan,
> > 
> > I updated the bug report with dmesg dump.
> > 
> > I hope this time it is correct...
> 
> Yes, it is.  It shows a couple of strange things; I need more 
> information.
> 
> Below is a new test patch; try running the test with this patch in 
> place of the previous one.  I don't need to see the usbmon trace, just 
> the dmesg log.  And not even all of that; just the part from "alan 
> start" to "alan end" -- this will be near the end of the log.
> 
> Also, what output do you get from "lspci -v -s b.1"?
> 
> Alan Stern
> 
> 
> 
> Index: usb-3.7/drivers/usb/host/ehci-hcd.c
> ===
> --- usb-3.7.orig/drivers/usb/host/ehci-hcd.c
> +++ usb-3.7/drivers/usb/host/ehci-hcd.c
> @@ -825,8 +825,10 @@ dead:
>   bh = 0;
>   }
>  
> - if (bh)
> + if (bh) {
> + ehci->last_scan_time = jiffies;
>   ehci_work (ehci);
> + }
>   spin_unlock (&ehci->lock);
>   if (pcd_status)
>   usb_hcd_poll_rh_status(hcd);
> @@ -884,6 +886,53 @@ static int ehci_urb_enqueue (
>   }
>  }
>  
> +static void alan_debug(struct ehci_hcd *ehci, struct urb *urb, struct 
> ehci_qh *qh)
> +{
> + static int done;
> + struct ehci_qh *qh2;
> + struct ehci_qtd *td;
> + char label[24];
> +
> + if (done)
> + return;
> + if (urb->transfer_buffer_length != 31) {
> + ehci_dbg(ehci, "unlink urb len %d\n", 
> urb->transfer_buffer_length);
> + return;
> + }
> +
> + ehci->alan_urb = urb;
> + done = 1;
> + ehci_dbg(ehci, "alan start  cur time %lu last scan %lu\n",
> + jiffies, ehci->last_scan_time);
> + ehci_dbg(ehci, "command %x cmd reg %x io watchdog %d async count %d\n",
> + ehci->command, ehci_readl(ehci, &ehci->regs->command),
> + ehci->need_io_watchdog, ehci->async_count);
> +
> + ehci_dbg(ehci, "async list:\n");
> + qh2 = ehci->async;
> + while (qh2) {
> + ehci_dbg(ehci, " qh %p hw %p dma %x next %p hw_next %x\n",
> + qh2, qh2->hw, (u32) qh2->qh_dma, qh2->qh_next.qh,
> + hc32_to_cpu(ehci, qh2->hw->hw_next));
> + qh2 = qh2->qh_next.qh;
> + }
> +
> + ehci_dbg(ehci, "This qh link time %lu enqueue time %lu td token %x ov 
> token %x\n",
> + qh->link_time, qh->enqueue_time,
> + hc32_to_cpu(ehci, qh->tdtoken),
> + hc32_to_cpu(ehci, qh->ovtoken));
> + dbg_qh(" ", ehci, qh);
> +
> + list_for_each_entry(td, &qh->qtd_list, qtd_list) {
> + sprintf(label, "  dma %x", (u32) td->qtd_dma);
> + dbg_qtd(label, ehci, td);
> + }
> +
> + sprintf(label, "  dummy %x", (u32) qh->dummy->qtd_dma);
> + dbg_qtd(label, ehci, qh->dummy);
> +}
> +
> +
>  /* remove from hardware lists
>   * completions normally happen asynchronously
>   */
> @@ -907,6 +956,10 @@ static int ehci_urb_dequeue(struct usb_h
>   qh = (struct ehci_qh *) urb->hcpriv;
>   if (!qh)
>   break;
> +
> + if (usb_pipetype(urb->pipe) == PIPE_BULK)
> + alan_debug(ehci, urb, qh);
> +
>   switch (qh->qh_state) {
>   case QH_STATE_LINKED:
>   case QH_STATE_COMPLETING:
> Index: usb-3.7/drivers/usb/host/ehci.h
> ===
> --- usb-3.7.orig/drivers/usb/host/ehci.h
> +++ usb-3.7/drivers/usb/host/ehci.h
> @@ -221,6 +221,10 @@ struct ehci_hcd {/* one per 
> controlle
>  #ifdef DEBUG
>   struct dentry   *debug_dir;
>  #endif
> +
> + struct urb  *alan_urb;
> + unsigned long   last_scan_time;
> +
>  };
>  
>  /* convert between an HCD pointer and the corresponding EHCI_HCD */
> @@ -400,6 +404,9 @@ struct ehci_qh {
>   struct usb_device   *dev;   /* access to TT */
>   unsignedis_out:1;   /* bulk or intr OUT */
>   unsignedclearing_tt:1;  /* Clear

Re: USB issue with kernel 3.6

2012-12-03 Thread Piergiorgio Sartor
On Sun, Dec 02, 2012 at 03:54:39PM -0500, Alan Stern wrote:
> On Sun, 2 Dec 2012, Piergiorgio Sartor wrote:
> 
> > I'll give it another round, with dmesg
> > 
> > How about the registers? Any hint there?
> 
> No, not really.  At least, not that I can tell at this point; maybe 
> later on it will be more useful.
> 
> Alan Stern

Hi Alan,

I updated the bug report with dmesg dump.

I hope this time it is correct...

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-12-02 Thread Piergiorgio Sartor
On Sun, Dec 02, 2012 at 01:03:22PM -0500, Alan Stern wrote:
> On Sat, 1 Dec 2012, Piergiorgio Sartor wrote:
> 
> > Hi Alan,
> > 
> > I updated the bugzilla entry with some attachments.
> > 
> > One (split in two) is the log from usbmon with the patch
> > you provided (I hope I did it correctly).
> 
> This usbmon trace doesn't contain anything significantly different from
> the previous one.  You can delete the two big files from the bug report 
> if you want.

OK, I'll do that.
 
> More importantly, where's the dmesg output from the patch?

Damm it! I knew I missed something.
I understood the patch was adding verbosity to usbmon.
Sorry, I got it wrong.

I'll give it another round, with dmesg

How about the registers? Any hint there?
 
> > The other is an archive with all the register information
> > I could think of.
> > 
> > I tested with 4 HDDs only and I was not able to see a crash
> > of the USB subsystem.
> > 
> > About using only one HUB, that might be a bit tricky, but
> > since each HUB has 4 ports (the 1-to-7 has internally two
> > 1-to-4 cascaded) I guess I've only to make sure the 4 HDDs
> > was *not* running on single HUB.
> > If it was, then a different connection must be tested,
> > otherwise, I guess, we will not get more information by
> > connecting the 4 HDDs to a single HUB.
> 
> It's not clear...  In fact, using fewer drives and hubs may not provide 
> any useful information at all.  This is just trying whatever I can 
> think of to see what's going wrong.
> 
> > The problem I've is that it is not really easy to
> > convert from /dev/sdX to lsusb to SCSI mappings...
> 
> The necessary information is all there in the dmesg log.

OK, so I'll try both cases again.

Thanks for the support, sorry for the waste of time.

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-12-01 Thread Piergiorgio Sartor
On Mon, Nov 26, 2012 at 03:46:38PM -0500, Alan Stern wrote:
> On Mon, 26 Nov 2012, Piergiorgio Sartor wrote:
> 
> > Hi Alan,
> > 
> > thanks for the patch.
> > 
> > I'll try to apply it and report the results,
> > including the USB EHCI register files.
> > 
> > Unfortunately, I cannot provide a "deadline"
> > (so to speak), I'm quite busy at the moment.
> 
> Don't worry about it; I'm not in any hurry.
> 
> When you do get around to testing, don't forget to try combinations 
> with fewer drives and/or only one hub.

Hi Alan,

I updated the bugzilla entry with some attachments.

One (split in two) is the log from usbmon with the patch
you provided (I hope I did it correctly).

The other is an archive with all the register information
I could think of.

I tested with 4 HDDs only and I was not able to see a crash
of the USB subsystem.

About using only one HUB, that might be a bit tricky, but
since each HUB has 4 ports (the 1-to-7 has internally two
1-to-4 cascaded) I guess I've only to make sure the 4 HDDs
was *not* running on single HUB.
If it was, then a different connection must be tested,
otherwise, I guess, we will not get more information by
connecting the 4 HDDs to a single HUB.

The problem I've is that it is not really easy to
convert from /dev/sdX to lsusb to SCSI mappings...

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-11-26 Thread Piergiorgio Sartor
On Mon, Nov 26, 2012 at 11:51:39AM -0500, Alan Stern wrote:
> On Thu, 22 Nov 2012, Piergiorgio Sartor wrote:
> 
> > OK, I got a log with working and then non working system,
> > so there should be a transition,
> 
> The trace attached to Kernel Bugzilla #50381 didn't give any details.  
> All it showed was that at one point the computer stopped getting data
> from the drives.
> 
> Does the same problem occur if you use fewer drives or only one hub?
> 
> Below is a diagnostic patch which should add extra information to the 
> dmesg log when the problem occurs.  It requires CONFIG_USB_DEBUG to be 
> set.  Let's see what it shows.
> 

Hi Alan,

thanks for the patch.

I'll try to apply it and report the results,
including the USB EHCI register files.

Unfortunately, I cannot provide a "deadline"
(so to speak), I'm quite busy at the moment.

Thanks again,

bye,

pg

> 
> 
> 
> Index: usb-3.7/drivers/usb/host/ehci-hcd.c
> ===
> --- usb-3.7.orig/drivers/usb/host/ehci-hcd.c
> +++ usb-3.7/drivers/usb/host/ehci-hcd.c
> @@ -907,6 +907,18 @@ static int ehci_urb_dequeue(struct usb_h
>   qh = (struct ehci_qh *) urb->hcpriv;
>   if (!qh)
>   break;
> +
> + if (usb_pipetype(urb->pipe) == PIPE_BULK) {
> + char label[40];
> +
> + sprintf(label, "dev %d ep %x state %d",
> + usb_pipedevice(urb->pipe),
> + usb_pipein(urb->pipe) +
> + usb_pipeendpoint(urb->pipe),
> + qh->qh_state);
> + dbg_qh(label, ehci, qh);
> + }
> +
>   switch (qh->qh_state) {
>   case QH_STATE_LINKED:
>   case QH_STATE_COMPLETING:

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-11-22 Thread Piergiorgio Sartor
On Tue, Nov 20, 2012 at 03:40:44PM -0500, Alan Stern wrote:
> On Tue, 20 Nov 2012, Piergiorgio Sartor wrote:
> 
> > This is the output of usbmod just when the problem happened:
> > 
> > 8801232fd6c0 815566692 C Co:1:012:0 -2 0
> > 880130cd9000 815566767 S Co:1:012:0 s 23 03 0004 0001  0
> > 880130cd9000 816577053 C Co:1:012:0 -2 0
> ...
> > 880026fb2780 845911126 S Co:1:013:0 s 23 03 0004 0002  0
> > 880026fb2780 846922109 C Co:1:013:0 -2 0
> 
> All this shows is that the computer isn't able to communicate with two 
> of your hubs.
> 
> > After dmesg reported the error -110 I stopped it.
> > 
> > Interesting enough, with usbmod running (cat /sys/.../1u > usb1.txt)
> > the transfer rate, from the HDDs, was around about 10% slower than
> > usual and the problem did not show up at first.
> 
> No doubt that was caused by the overhead of using usbmon.
> 
> > I do not know if it was just by random, but this could hint a race
> > condition somewhere.
> > 
> > Any idea?
> 
> I need to see more of the context.  What shows up in the usbmon trace 
> starting somewhat _before_ the problem happens?

OK, I got a log with working and then non working system,
so there should be a transition,
Problem is, the file is 1.2MB, bizp2 reduces it to 200K.
How do I pass it to you?

Furthermore, this time it was quite hard to get the error.
The system was working several minutes before it happened.

I noticed, maybe unrelated, that when working, the CPU
clock was at max (2.6GHz), just before stopping to work,
the CPU clock was at min (800MHz).

Another, possibly related item, in this test run, the
transfer speed was always at max, monitoring the USB
was not slowing it down as seen before (same kernel).

I might think that the clock at max speed was responsible.

> Also, after the problem occurs, you should go into the 
> /sys/kernel/debug/usb/ehci/ directory and find the subdirectory 
> corresponding to the controller for bus 1.  Let's see what the files in 
> that subdirectory say.

Of the 4 files I found there, 2 were empty, the others were
"periodic" and "registers", with following content:

size = 512
   1:  qh256-0001/88014a55a5a0 (h4 ep1in [1/0] q1 p1)
 257:  qh256-0001/88014a55a5a0

bus pci, device :00:0b.1
EHCI Host Controller
EHCI 1.00, rh state running
ownership 0001
SMI sts/enable 0xc008
structural params 0x00101888
capability params 0xa086
status c008 Async Periodic FLR
command 0010015 (park)=0 ithresh=1 Periodic period=512 RUN
intrenable 37 IAA FATAL PCD ERR INT
uframe 0deb
port:1 status 003400 0  ACK POWER OWNER sig=k
port:2 status 003400 0  ACK POWER OWNER sig=k
port:3 status 001005 0  ACK POWER sig=se0 PE CONNECT
port:4 status 001000 0  ACK POWER sig=se0
port:5 status 001000 0  ACK POWER sig=se0
port:6 status 001000 0  ACK POWER sig=se0
port:7 status 003400 0  ACK POWER OWNER sig=k
port:8 status 003400 0  ACK POWER OWNER sig=k
irq normal 523806 err 103 iaa 29127 (lost 0)
complete 1188957 unlink 51

There were both taken after the error and after disconnecting
the HUBs, I'm not sure if there're meaningful to you.
In case not, please let me know what should be the exact
procedure you need.

Thanks,

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-11-20 Thread Piergiorgio Sartor
On Sun, Nov 18, 2012 at 12:44:37PM -0500, Alan Stern wrote:
> On Sun, 18 Nov 2012, Piergiorgio Sartor wrote:
> 
> > 32830f207691176234b4c4dd17f0d7ab6d87d94b is the first bad commit
> > commit 32830f207691176234b4c4dd17f0d7ab6d87d94b
> > Author: Alan Stern 
> > Date:   Wed Jul 11 11:22:53 2012 -0400
> > 
> > USB: EHCI: use hrtimer for unlinking empty async QHs
> > 
> > This patch (as1583) changes ehci-hcd to use an hrtimer event for
> > unlinking empty (unused) async QHs instead of using a kernel timer.
> > 
> > The check for empty QHs is moved to a new routine, where it doesn't
> > require going through an entire scan of both the async and periodic
> > schedules.  And it can unlink multiple QHs at once, unlike the current
> > code.
> > 
> > Signed-off-by: Alan Stern 
> > Signed-off-by: Greg Kroah-Hartman 
> > 
> > :04 04 9767ca0f3b8c9fa0530e90730dc3106420412822
> > 6b567d1137df3cdec9f19f835ea7c546491d9ca8 Mdrivers
> > 
> > Which looks like to me more consistent with the problem.
> 
> Here's what you should do.  Build a kernel that exhibits the problem
> with CONFIG_USB_DEBUG enabled and post a copy of the dmesg log showing 
> what happens when the problem occurs.  Also, it would help if you can 
> collect a usbmon trace at the same time (see the instructions in 
> Documentation/usb/usbmon.txt).
> 
> Alan Stern

This is the output of usbmod just when the problem happened:

8801232fd6c0 815566692 C Co:1:012:0 -2 0
880130cd9000 815566767 S Co:1:012:0 s 23 03 0004 0001  0
880130cd9000 816577053 C Co:1:012:0 -2 0
8800649db6c0 816577128 S Co:1:012:0 s 23 03 0004 0001  0
8800649db6c0 817588087 C Co:1:012:0 -2 0
8800649bcb40 817588171 S Co:1:012:0 s 23 03 0004 0001  0
8800649bcb40 818599529 C Co:1:012:0 -2 0
8800276fc900 818599606 S Co:1:012:0 s 23 03 0004 0001  0
8800276fc900 819611470 C Co:1:012:0 -2 0
8800649bc9c0 819611555 S Co:1:012:0 s 23 01 0001 0001  0
8800649bc9c0 820622047 C Co:1:012:0 -2 0
88010b2ea300 820622129 S Co:1:012:0 s 23 03 0004 0001  0
88010b2ea300 821633095 C Co:1:012:0 -2 0
8800273ed6c0 821633152 S Co:1:012:0 s 23 03 0004 0001  0
8800273ed6c0 822644294 C Co:1:012:0 -2 0
880130f02cc0 822644367 S Co:1:012:0 s 23 03 0004 0001  0
880130f02cc0 823655689 C Co:1:012:0 -2 0
8801193c3300 823655742 S Co:1:012:0 s 23 03 0004 0001  0
8801193c3300 824666926 C Co:1:012:0 -2 0
880130f02600 824667018 S Co:1:012:0 s 23 03 0004 0001  0
880130f02600 825678105 C Co:1:012:0 -2 0
8801193c3840 825678165 S Co:1:012:0 s 23 01 0001 0001  0
8801193c3840 826690597 C Co:1:012:0 -2 0
880130f02a80 826690671 S Co:1:012:0 s 23 03 0004 0001  0
880130f02a80 827701091 C Co:1:012:0 -2 0
880027340a80 827701145 S Co:1:012:0 s 23 03 0004 0001  0
880027340a80 828712752 C Co:1:012:0 -2 0
88010b2ea240 828712814 S Co:1:012:0 s 23 03 0004 0001  0
88010b2ea240 829723041 C Co:1:012:0 -2 0
880133cb8600 829723083 S Co:1:012:0 s 23 03 0004 0001  0
880133cb8600 830734734 C Co:1:012:0 -2 0
8800649bc780 830734808 S Co:1:012:0 s 23 03 0004 0001  0
8800649bc780 831746050 C Co:1:012:0 -2 0
880133cb86c0 831746129 S Co:1:012:0 s 23 01 0001 0001  0
880133cb86c0 832757150 C Co:1:012:0 -2 0
880026d5c6c0 832757231 S Co:1:012:0 s 23 03 0004 0001  0
880026d5c6c0 833768771 C Co:1:012:0 -2 0
880133cb8000 833768846 S Co:1:012:0 s 23 03 0004 0001  0
880133cb8000 834779719 C Co:1:012:0 -2 0
880071bd0300 834779783 S Co:1:012:0 s 23 03 0004 0001  0
880071bd0300 835790173 C Co:1:012:0 -2 0
88010b2ea480 835790244 S Co:1:012:0 s 23 03 0004 0001  0
88010b2ea480 836801136 C Co:1:012:0 -2 0
880064bb3900 836801194 S Co:1:012:0 s 23 03 0004 0001  0
880064bb3900 837812753 C Co:1:012:0 -2 0
88010b2ea840 837812827 S Co:1:012:0 s 23 01 0001 0001  0
88010b2ea840 838823775 C Co:1:012:0 -2 0
880064bb3600 838823839 S Co:1:012:0 s 23 01 0001 0001  0
880133cb8b40 838823856 S Co:1:013:0 s 23 03 0004 0002  0
880064bb3600 839834571 C Co:1:012:0 -2 0
88010b2ea600 839834661 S Ci:1:012:0 s a3 00  0001 0004 4 <
880133cb8b40 839845616 C Co:1:013:0 -2 0
880071bd 839845675 S Co:1:013:0 s 23 03 0004 0002  0
88010b2ea600 840845592 C Ci:1:012:0 -2 0
8800273edd80 840845632 S Ci:1:012:0 s a3 00  0001 0004 4 <
880071bd 840856651 C Co:1:013:0 -2 0
880062d5bb40 840856720 S Co:1:013:0 s 23 03 0004 0002  0
8800273edd80 841856072 C Ci:1:012:0 -2 0
88014822eb40 841856108 S Ci:1:012:0 s a3 00  0001 0004 4 <
880062d5bb40 841867394 C Co:1:013:0 -2 0
880064bb3240 841867456 S Co:1:013:0 s 23 03 0004 0002  0
ff

Re: USB issue with kernel 3.6

2012-11-19 Thread Piergiorgio Sartor
On Sun, Nov 18, 2012 at 12:44:37PM -0500, Alan Stern wrote:
> On Sun, 18 Nov 2012, Piergiorgio Sartor wrote:
> 
> > 32830f207691176234b4c4dd17f0d7ab6d87d94b is the first bad commit
> > commit 32830f207691176234b4c4dd17f0d7ab6d87d94b
> > Author: Alan Stern 
> > Date:   Wed Jul 11 11:22:53 2012 -0400
> > 
> > USB: EHCI: use hrtimer for unlinking empty async QHs
> > 
> > This patch (as1583) changes ehci-hcd to use an hrtimer event for
> > unlinking empty (unused) async QHs instead of using a kernel timer.
> > 
> > The check for empty QHs is moved to a new routine, where it doesn't
> > require going through an entire scan of both the async and periodic
> > schedules.  And it can unlink multiple QHs at once, unlike the current
> > code.
> > 
> > Signed-off-by: Alan Stern 
> > Signed-off-by: Greg Kroah-Hartman 
> > 
> > :04 04 9767ca0f3b8c9fa0530e90730dc3106420412822
> > 6b567d1137df3cdec9f19f835ea7c546491d9ca8 Mdrivers
> > 
> > Which looks like to me more consistent with the problem.
> 
> Here's what you should do.  Build a kernel that exhibits the problem
> with CONFIG_USB_DEBUG enabled and post a copy of the dmesg log showing 
> what happens when the problem occurs.  Also, it would help if you can 
> collect a usbmon trace at the same time (see the instructions in 
> Documentation/usb/usbmon.txt).

There is not too much after the DEBUG enabled:

[...]
[  236.867416] usb 1-5.4: usb-storage timed out on ep0out len=0/0
[  236.867426] hub 1-5.4:1.0: cannot reset port 4 (err = -110)
[  236.867431] hub 1-5.4:1.0: port 4 not enabled, trying reset again...
[  237.878947] usb 1-5.4: usb-storage timed out on ep0out len=0/0
[  237.878960] hub 1-5.4:1.0: cannot reset port 4 (err = -110)
[  237.879069] hub 1-5.4:1.0: port 4 not enabled, trying reset again...
[  238.890004] usb 1-5.4: usb-storage timed out on ep0out len=0/0
[  238.890017] hub 1-5.4:1.0: cannot reset port 4 (err = -110)
[  238.890126] hub 1-5.4:1.0: port 4 not enabled, trying reset again...
[  239.900652] usb 1-5.4: usb-storage timed out on ep0out len=0/0
[  239.900665] hub 1-5.4:1.0: cannot reset port 4 (err = -110)
[  239.900773] hub 1-5.4:1.0: port 4 not enabled, trying reset again...
[  240.912037] usb 1-5.4: usb-storage timed out on ep0out len=0/0
[  240.912094] hub 1-5.4:1.0: cannot reset port 4 (err = -110)
[  240.912204] hub 1-5.4:1.0: port 4 not enabled, trying reset again...
[  240.912209] hub 1-5.4:1.0: Cannot enable port 4.  Maybe the USB cable is bad?
[  241.923720] usb 1-5.4: usb-storage timed out on ep0out len=0/0
[  241.923731] hub 1-5.4:1.0: cannot disable port 4 (err = -110)
[  242.934802] usb 1-5.4: usb-storage timed out on ep0out len=0/0
[  242.934814] hub 1-5.4:1.0: cannot reset port 4 (err = -110)
[  242.934993] hub 1-5.4:1.0: port 4 not enabled, trying reset again...
[...]

This is going on until the storage device reports and I/O error.

Note the "Maybe the USB cable is bad?", which is not, since it works
with other kernels.

I'll try the "usbmon" maybe tomorrow.

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-11-18 Thread Piergiorgio Sartor
On Sat, Nov 17, 2012 at 01:06:42PM -0800, Greg KH wrote:
[...]
> Yes, but again, that has nothing to do with the USB subsystem at all, so
> I think you got a false-positive here somehow.
> 

Thanks for the feedback, I did it again.
So, it seems I've no beginners luck, in fact it looks like I took a wrong
turn, in the first try.

New bisect log is:

Bisecting: 5071 revisions left to test after this (roughly 13 steps)
[b13bc8dda81c54a66a1c84e66f60b8feba659f28] Merge tag 'staging-3.6-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Bisecting: 2384 revisions left to test after this (roughly 12 steps)
[3c4cfadef6a1665d9cd02a543782d03d3e6740c6] Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Bisecting: 1118 revisions left to test after this (roughly 10 steps)
[9fc377799bc9bfd8d5cb35d0d1ea2e2458cbdbb3] Merge tag 'usb-3.6-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Bisecting: 670 revisions left to test after this (roughly 9 steps)
[f14121ab35912e3d2e57ac9a4ce1f9d4b7baeffb] Merge tag 'dt-for-3.6' of
git://sources.calxeda.com/kernel/linux
Bisecting: 340 revisions left to test after this (roughly 8 steps)
[801b03653fc04de2cc5bc83c06de504d41345b63] Merge
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw
Bisecting: 170 revisions left to test after this (roughly 7 steps)
[b015cb79ce84944076a8a849055f898f2a4d6be8] USB: EHCI: return void instead of 0
Bisecting: 85 revisions left to test after this (roughly 7 steps)
[94c122ab01f84332c371dc4ae0f5b70e879173d6] [SCSI] ufs: fix incorrect return
value about SUCCESS and FAILED
Bisecting: 42 revisions left to test after this (roughly 6 steps)
[d81a5d1956731c453b85c141458d4ff5d6cc5366] USB: add
USB_VENDOR_AND_INTERFACE_INFO() macro
Bisecting: 19 revisions left to test after this (roughly 5 steps)
[6470cbc486652942dcf6c4b6420e3a521e766bbf] Merge tag 'for-usb-next-2012-07-11'
of git://git.kernel.org/pub/scm/linux/kernel/git/sarah/xhci into usb-next
Bisecting: 11 revisions left to test after this (roughly 4 steps)
[569b394f53f0abd177cc665c9b4ace89e3f4c7fb] USB: EHCI: always scan each
interrupt QH
Bisecting: 5 revisions left to test after this (roughly 3 steps)
[55934eb3b9fa52eb53b9d7342267fc73c38206aa] USB: EHCI: use hrtimer for (s)iTD
deallocation
Bisecting: 2 revisions left to test after this (roughly 2 steps)
[3c273a056bf46167f0a1309c2ba72282a17d2541] USB: EHCI: unlink multiple async QHs
together
Bisecting: 0 revisions left to test after this (roughly 1 step)
[361aabf395e4a23cf554cf4ec0c0c6963b8beb01] USB: EHCI: don't lose events during
a scan
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[32830f207691176234b4c4dd17f0d7ab6d87d94b] USB: EHCI: use hrtimer for unlinking
empty async QHs
32830f207691176234b4c4dd17f0d7ab6d87d94b is the first bad commit
commit 32830f207691176234b4c4dd17f0d7ab6d87d94b
Author: Alan Stern 
Date:   Wed Jul 11 11:22:53 2012 -0400

USB: EHCI: use hrtimer for unlinking empty async QHs

This patch (as1583) changes ehci-hcd to use an hrtimer event for
unlinking empty (unused) async QHs instead of using a kernel timer.

The check for empty QHs is moved to a new routine, where it doesn't
require going through an entire scan of both the async and periodic
schedules.  And it can unlink multiple QHs at once, unlike the current
code.

Signed-off-by: Alan Stern 
Signed-off-by: Greg Kroah-Hartman 

:04 04 9767ca0f3b8c9fa0530e90730dc3106420412822
6b567d1137df3cdec9f19f835ea7c546491d9ca8 Mdrivers

Which looks like to me more consistent with the problem.

Your turn again,

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-11-17 Thread Piergiorgio Sartor
On Sat, Nov 17, 2012 at 10:24:19AM -0800, Greg KH wrote:
> On Sat, Nov 17, 2012 at 06:40:07PM +0100, Piergiorgio Sartor wrote:
> > Hi all,
> > 
> > I followed this instructions: http://wiki.gentoo.org/wiki/Kernel_git-bisect
> > The result, unless I did something incorrect, is:
> > 
> > Bisecting: a merge base must be tested
> > [28a33cbc24e4256c143dce96c7d93bf423229f92] Linux 3.5
> > Bisecting: 5128 revisions left to test after this (roughly 13 steps)
> > [b13bc8dda81c54a66a1c84e66f60b8feba659f28] Merge tag 'staging-3.6-rc1' of 
> > git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
> > Bisecting: 2384 revisions left to test after this (roughly 12 steps)
> > [3c4cfadef6a1665d9cd02a543782d03d3e6740c6] Merge 
> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
> > Bisecting: 1118 revisions left to test after this (roughly 10 steps)
> > [9fc377799bc9bfd8d5cb35d0d1ea2e2458cbdbb3] Merge tag 'usb-3.6-rc1' of 
> > git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
> > Bisecting: 559 revisions left to test after this (roughly 9 steps)
> > [07c7f79ee1c7e6288c614ba88005a8de6dbaadff] staging:iio:imu:adis16400 drop 
> > sysfs interface for manual device reset.
> > Bisecting: 279 revisions left to test after this (roughly 8 steps)
> > [a2b520b7867c964bb629633c1f909f6ac901f91b] staging: comedi: rtd520: remove 
> > RtdInterruptClear macro
> > Bisecting: 139 revisions left to test after this (roughly 7 steps)
> > [b94cea48ab8428ae803f4e27131367574295853d] staging: comedi: adl_pci7296: 
> > remove the private data
> > Bisecting: 69 revisions left to test after this (roughly 6 steps)
> > [f2a64902e3c2ffa34b52e6ad13e8e3c40f6919c7] staging: comedi: dyna_pci10xx: 
> > fix detach
> > Bisecting: 34 revisions left to test after this (roughly 5 steps)
> > [f58bfd88be490c0f9e25d980abe1976ceef00647] staging: csr: remove 
> > CsrUtf16String typedef
> > Bisecting: 17 revisions left to test after this (roughly 4 steps)
> > [6a4f6d38a34a992f8fd1572fc8d5706c11089d5e] staging: csr: remove CsrStrChr()
> > Bisecting: 8 revisions left to test after this (roughly 3 steps)
> > [4fe9db37104f833972486355fe86d7dcd29279b5] staging: csr: remove 
> > CsrMemFree() and CsrMemFreeDma()
> > Bisecting: 4 revisions left to test after this (roughly 2 steps)
> > [55a27055b9ea5aabf8206ed6bc79f4e840c3] staging: csr: remove 
> > CsrPmemFree()
> > Bisecting: 2 revisions left to test after this (roughly 1 step)
> > [eaae2e9223e859cd968b58bd3173a3f17b50538d] staging: csr: remove csr_pmem.h
> > Bisecting: 0 revisions left to test after this (roughly 1 step)
> > [419e9266884fa853179ab726c27a63a9d3ae46e3] staging: csr: delete a bunch of 
> > unused library functions
> > b13bc8dda81c54a66a1c84e66f60b8feba659f28 is the first bad commit
> > 
> > Now is your turn... :-)
> 
> Are you using the csr driver?  I somehow doubt that, given that I know
> what hardware that supports, and it isn't USB.

No, the "csr" has nothing to do with it, since the kernel with the
"csr" modifications are working.
Commit b13bc8dda81c54a66a1c84e66f60b8feba659f28 was the first bisect step,
that would be:

'staging-3.6-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

If I'm not wrong.
 
> How did you test each kernel for failure or not?  Did you install it and
> boot with it?

Of course (and it lasted forever).
The first bisect was giving *immediately* the error, all the others
were fine. I do not think I took the wrong branch, since that should
have delivered, if I get it correctly, always error.

The test was always the same: start the system, attach the USB
chain, access all the disks in parallel.
This was stopping with error -110 in the first pass, all the
others were, as wrote above, fine.

What was exactly commit b13bc8dda81c54a66a1c84e66f60b8feba659f28 ?

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-11-17 Thread Piergiorgio Sartor
Hi all,

I followed this instructions: http://wiki.gentoo.org/wiki/Kernel_git-bisect
The result, unless I did something incorrect, is:

Bisecting: a merge base must be tested
[28a33cbc24e4256c143dce96c7d93bf423229f92] Linux 3.5
Bisecting: 5128 revisions left to test after this (roughly 13 steps)
[b13bc8dda81c54a66a1c84e66f60b8feba659f28] Merge tag 'staging-3.6-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Bisecting: 2384 revisions left to test after this (roughly 12 steps)
[3c4cfadef6a1665d9cd02a543782d03d3e6740c6] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Bisecting: 1118 revisions left to test after this (roughly 10 steps)
[9fc377799bc9bfd8d5cb35d0d1ea2e2458cbdbb3] Merge tag 'usb-3.6-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Bisecting: 559 revisions left to test after this (roughly 9 steps)
[07c7f79ee1c7e6288c614ba88005a8de6dbaadff] staging:iio:imu:adis16400 drop sysfs 
interface for manual device reset.
Bisecting: 279 revisions left to test after this (roughly 8 steps)
[a2b520b7867c964bb629633c1f909f6ac901f91b] staging: comedi: rtd520: remove 
RtdInterruptClear macro
Bisecting: 139 revisions left to test after this (roughly 7 steps)
[b94cea48ab8428ae803f4e27131367574295853d] staging: comedi: adl_pci7296: remove 
the private data
Bisecting: 69 revisions left to test after this (roughly 6 steps)
[f2a64902e3c2ffa34b52e6ad13e8e3c40f6919c7] staging: comedi: dyna_pci10xx: fix 
detach
Bisecting: 34 revisions left to test after this (roughly 5 steps)
[f58bfd88be490c0f9e25d980abe1976ceef00647] staging: csr: remove CsrUtf16String 
typedef
Bisecting: 17 revisions left to test after this (roughly 4 steps)
[6a4f6d38a34a992f8fd1572fc8d5706c11089d5e] staging: csr: remove CsrStrChr()
Bisecting: 8 revisions left to test after this (roughly 3 steps)
[4fe9db37104f833972486355fe86d7dcd29279b5] staging: csr: remove CsrMemFree() 
and CsrMemFreeDma()
Bisecting: 4 revisions left to test after this (roughly 2 steps)
[55a27055b9ea5aabf8206ed6bc79f4e840c3] staging: csr: remove CsrPmemFree()
Bisecting: 2 revisions left to test after this (roughly 1 step)
[eaae2e9223e859cd968b58bd3173a3f17b50538d] staging: csr: remove csr_pmem.h
Bisecting: 0 revisions left to test after this (roughly 1 step)
[419e9266884fa853179ab726c27a63a9d3ae46e3] staging: csr: delete a bunch of 
unused library functions
b13bc8dda81c54a66a1c84e66f60b8feba659f28 is the first bad commit

Now is your turn... :-)

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-11-14 Thread Piergiorgio Sartor
Hi all again,

I would like just to re-state the issue and try
to rationalize a possible debugging strategy.

So my questions would be:

1) Is there any possibility, with a Fedora kernel,
to activate some run-time debugging/verbosity
option for the USB/ehci/usb-storage subsystem in
order to try to see where the problem is?

2) About compiling the kernel and git bisect:
  a) As mentioned, this is a Fedora system, I'm not
sure how many, if any, changes they did. Should I
follow the Fedora wiki instruction on how recompile
a kernel or should I simply use one from kernel.org?
  b) While I do not like it, I'm ready to give it a
try if really, really necessary. Where do I find the
instruction of how to proceed, specifically (again)
for a Fedora system?

Thanks a lot in advance,

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-11-04 Thread Piergiorgio Sartor
On Sun, Nov 04, 2012 at 03:07:35PM -0500, Alan Stern wrote:
[...]
> The first thing you should do is test a 3.6 kernel with 
> CONFIG_USB_DEBUG enabled and post the dmesg log showing the problem.  
> That will help indicate what is going wrong.
> 
> If that doesn't suggest a solution, the next thing you should do is use 
> git bisect to find which change to the kernel is responsible for the 
> errors.  For this to work, you have to be able to tell reliably whether 
> the problem is present or not with a particular kernel, but it sounds 
> like that should be easy enough.
> 
> Alan Stern

Forgot to thank you for the reply...

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB issue with kernel 3.6

2012-11-04 Thread Piergiorgio Sartor
On Sun, Nov 04, 2012 at 03:07:35PM -0500, Alan Stern wrote:
[...]
> The first thing you should do is test a 3.6 kernel with 
> CONFIG_USB_DEBUG enabled and post the dmesg log showing the problem.  
> That will help indicate what is going wrong.

I guess this is a compile time option.
Is it any run-time possibility to achieve a similar result?

> If that doesn't suggest a solution, the next thing you should do is use 
> git bisect to find which change to the kernel is responsible for the 
> errors.  For this to work, you have to be able to tell reliably whether 
> the problem is present or not with a particular kernel, but it sounds 
> like that should be easy enough.

This seems a bit out of reach, for several reasons.
I'll have to use a Fedora kernel, which might have its
own patches and I suspect things will become complicated
beyond my capability to manage them.
I might try on a different PC, with a vanilla kernel,
not sure...

I was hoping someone had in mind what could be the hotspot
and the possibility to check it quickly.

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


USB issue with kernel 3.6

2012-11-04 Thread Piergiorgio Sartor
Hi all,

I've a strange problem with the USB subsystem, under
kernel 3.6, this was not happening under 3.5.
The system is Fedora, so I'll use their kernel numbers,
and the related bug report is this:

https://bugzilla.redhat.com/show_bug.cgi?id=866166

I've some USB hubs cascaded in order to have 10 ports
available.
Two configurations: one is with 3 hubs, 4 ports each,
where 2 hubs are connected to the first one.
Second configuration is similar, it only uses 1 hubs
with 7 ports followed by a 5 ports one.

To the resulting 10 ports are connected to 10 USB hard
disks. This is a "feasibility study" on the possibility
to create a storage device in this condition.

The 10 HDDs are combined in RAID-6.

Now, until kernel 3.5.6, everything was working fine,
I can achieve around 43MB/sec transfer rate (read) from
such setup, with high reliability.
Of course, some times one HDD would flight away, but
nothing really serious as a problem. In fact this was
one of the point I wanted to check.

Since kernel 3.6.x, after short time accessing the
RAID (or in parallel the 10 HDDs, independently from
the md layer), I get error -110 from the USB subsystem
and everything dies.
It seems to me, but I'm not sure, reading a single
HDD does not lead to this problem.

Furthermore, the USB ehci subsystem does not work
any more, even after waiting for the usual timeouts,
connecting a USB stick to some other USB port in the
PC results in absolute nothing.

Removing ehci, via sysfs since it is compiled in,
result in a system working, but at USB 1.1 speed.

My suspect is that something in kernel 3.6 changed
the behaviour under load of the USB subsystem, showing
up only in such extreme cases, I guess.
Probably it is a borderline test case, nevertheless,
still a failure.

The USB HDD controller are JMicron, I do not know if
it does matter.

Any idea or suggestion on how to test further?

Thanks a lot in advance,

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html