On linux this kernel timeout is only an issue if you are using the computing gpu also for a display. If you have multiple cards, or an onboard graphics solution, use CUDA on the device that is not time out locked. A cheap graphics card for display in an open slot (CUDA capable or otherwise) is a simple hardware solution. I have done this on 3 linux machines.
I do not know if this is the same in windows, but maybe somebody with a windows box can chime in here... On Wed, Nov 17, 2010 at 3:00 PM, <pycuda-requ...@tiker.net> wrote: > Send PyCUDA mailing list submissions to > pycuda@tiker.net > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.tiker.net/listinfo/pycuda > or, via email, send a message with subject or body 'help' to > pycuda-requ...@tiker.net > > You can reach the person managing the list at > pycuda-ow...@tiker.net > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of PyCUDA digest..." > > > Today's Topics: > > 1. Dealing with driver timeouts in long running kernels (Dan Goodman) > 2. Re: Dealing with driver timeouts in long running kernels > (Cyrus Omar) > 3. Re: Dealing with driver timeouts in long running kernels > (Fr?d?ric Bastien) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 17 Nov 2010 02:25:29 +0100 > From: Dan Goodman <dg.pyc...@thesamovar.net> > To: "pycuda@tiker.net" <pycuda@tiker.net> > Subject: [PyCUDA] Dealing with driver timeouts in long running kernels > Message-ID: <4ce32f09.7000...@thesamovar.net> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Hi all, > > I have a problem that can be split into pieces of different sizes. > Essentially, the larger the size is, the more efficiently it runs. > However, on Windows (and I understand similar things happen on Linux) a > single GPU kernel launch cannot take more than 5 seconds on XP or 2 > seconds on Vista/Win7, or the Timeout Detection and Recovery (TDR) > system will terminate it and raise an error (also causing the screen to > flash). My problem is that I want to run my kernels for as long as > possible for maximum efficiency, but I don't know how long the kernel > launch will take as a function of problem size until I run it. I could > profile my functions and work out something that would probably work, > but this is for a software package that will be used by third parties, > and I'd like it to be handled automatically (and preferably without the > screen flashes, which will disturb users). > > Has anyone worked out a good way of dealing with this? > > One option is to increase the TDR window as detailed in: > > http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx > > This might have adverse effects though, and I'm not sure all users of my > package would be happy changing these values (it's also not automatic). > > Another option is to have two GPUs, one of which is not attached to a > monitor and only used in compute mode (as discussed at > http://forums.nvidia.com/index.php?showtopic=171630). Again, fine for me > (I have two), but not so good for users who I guess in many cases will > only have one. > > A final option that I thought of would be to check for a launch timeout > failure after each kernel launch, and if it happens, divide my problem > size by two and try again, repeating until I don't get any launch > failures. The trouble with this approach is that I'll get multiple > failures and screen flashes before it settles down to a value that > works, wasting a little bit of time but more importantly being quite > alarming. It also doesn't feel very elegant... ;-) > > Any other ideas or experiences dealing with this problem? > > Dan > > > > ------------------------------ > > Message: 2 > Date: Tue, 16 Nov 2010 21:33:34 -0500 > From: Cyrus Omar <cy...@cmu.edu> > To: Dan Goodman <dg.pyc...@thesamovar.net> > Cc: "pycuda@tiker.net" <pycuda@tiker.net> > Subject: Re: [PyCUDA] Dealing with driver timeouts in long running > kernels > Message-ID: > <aanlktikfyj56rvn3ailxquu5jbwbds=bzosk7d_n_...@mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > On Tue, Nov 16, 2010 at 20:25, Dan Goodman <dg.pyc...@thesamovar.net> > wrote: > > > A final option that I thought of would be to check for a launch timeout > > failure after each kernel launch, and if it happens, divide my problem > size > > by two and try again, repeating until I don't get any launch failures. > The > > trouble with this approach is that I'll get multiple failures and screen > > flashes before it settles down to a value that works, wasting a little > bit > > of time but more importantly being quite alarming. It also doesn't feel > very > > elegant... ;-) > > > This is risky, as per the TDR webpage you linked to: > > > Minor changes were made in Windows Vista SP1 to improve the user > experience > > in cases of frequent and rapidly occurring GPU hangs. Repetitive GPU > hangs > > indicate that the graphics hardware has not recovered successfully. In > these > > instances, the system must be shut down and restarted to fully reset the > > graphics hardware. If the operating system detects that six or more GPU > > hangs and subsequent recoveries occur within 1 minute, then the following > > GPU hang is treated as a system bug check. > > > Seems the best option is to just disable TDR through the registry while the > application is running and inform the user that that is what you're doing > and what it means. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://lists.tiker.net/pipermail/pycuda/attachments/20101116/a8cb9df6/attachment-0001.html > > > > ------------------------------ > > Message: 3 > Date: Wed, 17 Nov 2010 09:46:21 -0500 > From: Fr?d?ric Bastien <no...@nouiz.org> > To: Cyrus Omar <cy...@cmu.edu> > Cc: "pycuda@tiker.net" <pycuda@tiker.net> > Subject: Re: [PyCUDA] Dealing with driver timeouts in long running > kernels > Message-ID: > > <aanlktinr_ynpgfgw0lky1jenm_m+4x5+3qnfyzrgg...@mail.gmail.com<aanlktinr_ynpgfgw0lky1jenm_m%2b4x5%2b3qnfyzrgg...@mail.gmail.com> > > > Content-Type: text/plain; charset="iso-8859-1" > > On Tue, Nov 16, 2010 at 9:33 PM, Cyrus Omar <cy...@cmu.edu> wrote: > > > On Tue, Nov 16, 2010 at 20:25, Dan Goodman <dg.pyc...@thesamovar.net > >wrote: > > > >> A final option that I thought of would be to check for a launch timeout > >> failure after each kernel launch, and if it happens, divide my problem > size > >> by two and try again, repeating until I don't get any launch failures. > The > >> trouble with this approach is that I'll get multiple failures and screen > >> flashes before it settles down to a value that works, wasting a little > bit > >> of time but more importantly being quite alarming. It also doesn't feel > very > >> elegant... ;-) > > > > > > This is risky, as per the TDR webpage you linked to: > > > > Why not starting with a small size and if it take less then half of 2 or 5 > seconds depending of the os, you double the size for the next time? > > Fred > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://lists.tiker.net/pipermail/pycuda/attachments/20101117/f99e687c/attachment-0001.html > > > > ------------------------------ > > _______________________________________________ > PyCUDA mailing list > PyCUDA@tiker.net > http://lists.tiker.net/listinfo/pycuda > > > End of PyCUDA Digest, Vol 29, Issue 7 > ************************************* >
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda