I sent a bug report to linux-b...@nvidia.com regarding the problem. I had a very nice conversation with a Pierre-Loup A. Griffais, who reports that the problem has been fixed and can be expected in the next 304.xx release.

/Daniel

Our conversation:

Quoting "Pierre-Loup A. Griffais":
Hi Daniel,

I don't have an estimate, but it made it into the current release series >(304.xx), so whenever there is a new release from that.

Feel free to update the bug report with this information.

Thanks a lot,
- Pierre-Loup

On 08/24/2012 08:27 PM, Daniel Anderberg wrote:
Hi Pierre-Loup,

Exelent! Do you dare to venture a guess as to when this will be
available in a released driver?

Would it be OK if I attached our conversation below to the Debian bug report?

Best regards
Daniel Anderberg

Quoting "Pierre-Loup A. Griffais":

Hi Daniel,

This should now be fixed; again, thanks a lot for your detailed report.
 - Pierre-Loup

On 08/20/2012 06:49 PM, Pierre-Loup Griffais wrote:
Hi Daniel,

Thanks a lot for your bug report and your in-depth research! I agree
with your initial findings; the result of the LoaderSymbol call not
being cached in this case just looks like an oversight. I'm sorry for
the inconvenience that this has caused you and will look into fixing it.

Thanks,
 - Pierre-Loup

On 08/20/2012 09:02 AM, Daniel Anderberg wrote:
Hello,

I have been having problems with X loocking up, requiring me to log in
from another computer and kill X. This lockup is sporadic but quite
frequent.
nvidia-bug-report.log.gz attached.

Problems verified both with 302.17 and 304.37 (the current Debian
testing and unstable packages).

I have already filed a bug report with Debian
(http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=684941). The Debian
maintainer sugested I start a thread over at nvnews.net, but since I
still cant post there (nvnews support staff have been contacted), I
figured that a mail to linux-bugs might not be a bad idea.

In analyzing the problem I have arrived at the following points of data:

When X freeze, if I attach gdb and get a back-trace I ALLWAYS get the same:
  - X calls into nvidia_drv.so
  - nvidia_drive.so calls LoaderSymbol("rrPrivKeyRec")
  - LoaderSymbol calls dlsym that locks a mutex
  - Signal interrupts
  - X calls into nvidia_drv.so
  - nvidia_drive.so calls LoaderSymbol("rrPrivKeyRec")
  - LoaderSymbol calls dlsym that attempts to lock the same mutex
  * Deadlock

Example bt attached.

Instrumenting dlsym via LD_AUDIT show that while I move the mouse
pointer. X will resolve the symbol "rrPrivKeyRec" approx 1000 times
every second.

If I LD_PRELOAD the attached (admittedly crude and not generally
applicable) nvidia_workaround when starting the X server the problem
goes away (used to occure every few hours, now running good for half a
week).

My conclusions are the following:

- nvidia_drv.so indirectly uses dlsym in a signal handler.
- dlsym is not among the async-signal-safe functions according to
   POSIX.1-2008.
- violating the async-signal-safe rule hundreds of times per second greatly
   increases the risk of deadlock.
- From an efficiency standpoint, using LoaderSymbol in a hotpath seems
   suboptimal, and cacheing the return value from
   LoaderSymbol("rrPrivKeyRec") seems prudent.

Best wishes
Daniel Anderberg


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to