Re: [PyCUDA] PyCUDA Digest, Vol 29, Issue 7

GARRETT B WRIGHT Wed, 17 Nov 2010 13:06:03 -0800

On linux this kernel timeout is only an issue if you are using the computing
gpu also for a display.  If you have multiple cards, or an onboard graphics
solution, use CUDA on the device that is not time out locked.  A cheap
graphics card for display in an open slot (CUDA capable or otherwise) is a
simple hardware solution.   I have done this on 3 linux machines.


I do not know if this is the same in windows, but maybe somebody with a
windows box can chime in here...

On Wed, Nov 17, 2010 at 3:00 PM, <pycuda-requ...@tiker.net> wrote:

> Send PyCUDA mailing list submissions to
>        pycuda@tiker.net
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://lists.tiker.net/listinfo/pycuda
> or, via email, send a message with subject or body 'help' to
>        pycuda-requ...@tiker.net
>
> You can reach the person managing the list at
>        pycuda-ow...@tiker.net
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of PyCUDA digest..."
>
>
> Today's Topics:
>
>   1. Dealing with driver timeouts in long running kernels (Dan Goodman)
>   2. Re: Dealing with driver timeouts in long running kernels
>      (Cyrus Omar)
>   3. Re: Dealing with driver timeouts in long running kernels
>      (Fr?d?ric Bastien)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 17 Nov 2010 02:25:29 +0100
> From: Dan Goodman <dg.pyc...@thesamovar.net>
> To: "pycuda@tiker.net" <pycuda@tiker.net>
> Subject: [PyCUDA] Dealing with driver timeouts in long running kernels
> Message-ID: <4ce32f09.7000...@thesamovar.net>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Hi all,
>
> I have a problem that can be split into pieces of different sizes.
> Essentially, the larger the size is, the more efficiently it runs.
> However, on Windows (and I understand similar things happen on Linux) a
> single GPU kernel launch cannot take more than 5 seconds on XP or 2
> seconds on Vista/Win7, or the Timeout Detection and Recovery (TDR)
> system will terminate it and raise an error (also causing the screen to
> flash). My problem is that I want to run my kernels for as long as
> possible for maximum efficiency, but I don't know how long the kernel
> launch will take as a function of problem size until I run it. I could
> profile my functions and work out something that would probably work,
> but this is for a software package that will be used by third parties,
> and I'd like it to be handled automatically (and preferably without the
> screen flashes, which will disturb users).
>
> Has anyone worked out a good way of dealing with this?
>
> One option is to increase the TDR window as detailed in:
>
> http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx
>
> This might have adverse effects though, and I'm not sure all users of my
> package would be happy changing these values (it's also not automatic).
>
> Another option is to have two GPUs, one of which is not attached to a
> monitor and only used in compute mode (as discussed at
> http://forums.nvidia.com/index.php?showtopic=171630). Again, fine for me
> (I have two), but not so good for users who I guess in many cases will
> only have one.
>
> A final option that I thought of would be to check for a launch timeout
> failure after each kernel launch, and if it happens, divide my problem
> size by two and try again, repeating until I don't get any launch
> failures. The trouble with this approach is that I'll get multiple
> failures and screen flashes before it settles down to a value that
> works, wasting a little bit of time but more importantly being quite
> alarming. It also doesn't feel very elegant... ;-)
>
> Any other ideas or experiences dealing with this problem?
>
> Dan
>
>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 16 Nov 2010 21:33:34 -0500
> From: Cyrus Omar <cy...@cmu.edu>
> To: Dan Goodman <dg.pyc...@thesamovar.net>
> Cc: "pycuda@tiker.net" <pycuda@tiker.net>
> Subject: Re: [PyCUDA] Dealing with driver timeouts in long running
>        kernels
> Message-ID:
>        <aanlktikfyj56rvn3ailxquu5jbwbds=bzosk7d_n_...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> On Tue, Nov 16, 2010 at 20:25, Dan Goodman <dg.pyc...@thesamovar.net>
> wrote:
>
> > A final option that I thought of would be to check for a launch timeout
> > failure after each kernel launch, and if it happens, divide my problem
> size
> > by two and try again, repeating until I don't get any launch failures.
> The
> > trouble with this approach is that I'll get multiple failures and screen
> > flashes before it settles down to a value that works, wasting a little
> bit
> > of time but more importantly being quite alarming. It also doesn't feel
> very
> > elegant... ;-)
>
>
> This is risky, as per the TDR webpage you linked to:
>
> > Minor changes were made in Windows Vista SP1 to improve the user
> experience
> > in cases of frequent and rapidly occurring GPU hangs. Repetitive GPU
> hangs
> > indicate that the graphics hardware has not recovered successfully. In
> these
> > instances, the system must be shut down and restarted to fully reset the
> > graphics hardware. If the operating system detects that six or more GPU
> > hangs and subsequent recoveries occur within 1 minute, then the following
> > GPU hang is treated as a system bug check.
> >
> Seems the best option is to just disable TDR through the registry while the
> application is running and inform the user that that is what you're doing
> and what it means.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.tiker.net/pipermail/pycuda/attachments/20101116/a8cb9df6/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 3
> Date: Wed, 17 Nov 2010 09:46:21 -0500
> From: Fr?d?ric Bastien <no...@nouiz.org>
> To: Cyrus Omar <cy...@cmu.edu>
> Cc: "pycuda@tiker.net" <pycuda@tiker.net>
> Subject: Re: [PyCUDA] Dealing with driver timeouts in long running
>        kernels
> Message-ID:
>        
> <aanlktinr_ynpgfgw0lky1jenm_m+4x5+3qnfyzrgg...@mail.gmail.com<aanlktinr_ynpgfgw0lky1jenm_m%2b4x5%2b3qnfyzrgg...@mail.gmail.com>
> >
> Content-Type: text/plain; charset="iso-8859-1"
>
> On Tue, Nov 16, 2010 at 9:33 PM, Cyrus Omar <cy...@cmu.edu> wrote:
>
> > On Tue, Nov 16, 2010 at 20:25, Dan Goodman <dg.pyc...@thesamovar.net
> >wrote:
> >
> >> A final option that I thought of would be to check for a launch timeout
> >> failure after each kernel launch, and if it happens, divide my problem
> size
> >> by two and try again, repeating until I don't get any launch failures.
> The
> >> trouble with this approach is that I'll get multiple failures and screen
> >> flashes before it settles down to a value that works, wasting a little
> bit
> >> of time but more importantly being quite alarming. It also doesn't feel
> very
> >> elegant... ;-)
> >
> >
> > This is risky, as per the TDR webpage you linked to:
> >
>
> Why not starting with a small size and if it take less then half of 2 or 5
> seconds  depending of the os, you double the size for the next time?
>
> Fred
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.tiker.net/pipermail/pycuda/attachments/20101117/f99e687c/attachment-0001.html
> >
>
> ------------------------------
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA@tiker.net
> http://lists.tiker.net/listinfo/pycuda
>
>
> End of PyCUDA Digest, Vol 29, Issue 7
> *************************************
>

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] PyCUDA Digest, Vol 29, Issue 7

Reply via email to