On Tue, 11 Nov 2014 19:05:37 +0100
Hans Verkuil <hverk...@xs4all.nl> wrote:
> On 11/11/2014 06:46 PM, Andrey Utkin wrote:
> > At Bluecherry, we have issues with servers which have 3 solo6110
> > cards (and cards have up to 16 analog video cameras connected to
> > them, and being actively read).
> > This is a kernel which I tested with such a server last time. It is
> > based on linux-next of October, 31, with few patches of mine (all
> > are in review for upstream).
> > https://github.com/krieger-od/linux/ . The HEAD commit is
> > 949e18db86ebf45acab91d188b247abd40b6e2a1 at the moment.
> > 
> > The problem is the following: after ~1 hour of uptime with working
> > application reading the streams, one card (the same one every time)
> > stops producing interrupts (counter in /proc/interrupts freezes),
> > and all threads reading from that card hang forever in
> > ioctl(VIDIOC_DQBUF). The application uses libavformat (ffmpeg) API
> > to read the corresponding /dev/videoX devices of H264 encoders.
> > Application restart doesn't help, just interrupt counter increases
> > by 64. To help that, we need reboot or programmatic PCI device
> > reset by "echo 1 > /sys/bus/pci/devices/0000\:03\:05.0/reset",
> > which requires unloading app and driver and is not a solution
> > obviously.
> > 
> > We had this issue for a long time, even before we used libavformat
> > for reading from such sources.
> > A few days ago, we had standalone ffmpeg processes working stable
> > for several days. The kernel was 3.17, the only probably-relevant
> > change in code over the above mentioned revision is an additional
> > bool variable set in solo_enc_v4l2_isr() and checked in
> > solo_ring_thread() to figure out whether to do or skip
> > solo_handle_ring(). The variable was guarded with
> > spin_lock_irqsave(). I am not sure if it makes any difference, will
> > try it again eventually.
> > 
> > Any thoughts, can it be a bug in driver code causing that (please
> > point which areas of code to review/fix)? Or is that desperate
> > hardware issue? How to figure out for sure whether it is the former
> > or the latter?
> 
> I would first try to exclude hardware issues: since you say it is
> always the same card, try either replacing it or swapping it with
> another solo card and see if the problem follows the card or not. If
> it does, then it is likely a hardware problem. If it doesn't, then it
> suggests a race condition in the interrupt handling somewhere.
> 
> Regards,
> 
>       Hans

CC'ing Curtis, hope you don't mind.

It's just coincidence. This has been a long standing issue, and only
depends on having enough cards.

One of the problems I had to weed out this one was that I didn't
have the right hardware (only one 16-port card), and my guess is that
Andrey is in the same position.

Attachment: pgpsd0GlIkpe9.pgp
Description: OpenPGP digital signature

Reply via email to