Re: [gridengine users] SGE and GPU

Ian Kaufman Mon, 14 Apr 2014 13:42:24 -0700

Look at the info presented here:

http://stackoverflow.com/questions/10557816/scheduling-gpu-resources-using-the-sun-grid-engine-sge


Ian

On Mon, Apr 14, 2014 at 1:29 PM, Feng Zhang <[email protected]> wrote:
> Thanks, Ian and Gowtham!
>
>
> This is a very nice instruction.  One of my problem is, for example:
>
> node1,  number of gpu=4
> node2,  number of gpu=4
> node3,  number of gpu=2
>
> So in total I have 10 GPUs.
>
> Right now, user A has a serial GPU job, which takes one GPU on
> node1(Don't know which GPU though). So node1:3, node2:4 and node3:2
> GPUs are still free for jobs.
>
> I submit one job with PE=8. SGE allocate all the 3 nodes to me with 8
> GPU slots. The problem is now: how my job knows what GPUs it can get
> on node1?
>
> Best
>
>
>
>
> On Mon, Apr 14, 2014 at 4:13 PM, Ian Kaufman <[email protected]> wrote:
>> Again, look into using it as a consumable resource as Gowtham posted above.
>>
>> Ian
>>
>> On Mon, Apr 14, 2014 at 11:57 AM, Feng Zhang <[email protected]> wrote:
>>> Thanks, Reuti,
>>>
>>> The socket solution looks like only work fine for serial jobs, not PE
>>> jobs, right?
>>>
>>> Our cluster has different nodes, some nodes each has 2 GPUs, some
>>> others each has 4 GPUs. Most of the user jobs are PE jobs, some are
>>> serial.
>>>
>>> The socket solution can event work for PE jobs, but as my
>>> understanding, it is not efficient? Since each node has, for example,
>>> 4 queues. If one user submit a PE job to a queue, he/she can not use
>>> the other GPUs on the other queues?
>>>
>>> On Mon, Apr 14, 2014 at 2:16 PM, Reuti <[email protected]> wrote:
>>>> Am 14.04.2014 um 20:06 schrieb Feng Zhang:
>>>>
>>>>> Thanks, Ian!
>>>>>
>>>>> I haven't checked the GPU load sensor in detail, either. It sounds to
>>>>> me it only handles the number of GPU allocated to a job, but the job
>>>>> doesn't know which GPUs it actually get and set the
>>>>> CUDA_VISIBLE_DEVICE(some programs need this env to be set). This can
>>>>> be done by writing some scripts/programs, but to me, it is not an
>>>>> accurate solution, since some jobs may still happen to collide to each
>>>>> other on the same GPU on a multiple GPU node. If GE can have the
>>>>> memory to record the GPUs allocated to a job, then this can be
>>>>> perfect.
>>>>
>>>> Like the option to request sockets instead of cores which I posted in the 
>>>> last couple of days, you can use a similar approach to get the number of 
>>>> the granted GPU out of the queue name.
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> On Mon, Apr 14, 2014 at 1:46 PM, Ian Kaufman <[email protected]> 
>>>>> wrote:
>>>>>> I believe there already is support for GPUs - there is a GPU Load
>>>>>> Sensor in Open Grid Engine. You may have to build it yourself, I
>>>>>> haven't checked to see if it comes pre-packaged.
>>>>>>
>>>>>> Univa has Phi support, and I believe OGE/OGS has it as well, or at
>>>>>> least has been working on it.
>>>>>>
>>>>>> Ian
>>>>>>
>>>>>> On Mon, Apr 14, 2014 at 10:35 AM, Feng Zhang <[email protected]> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Is there's any plan to implement the GPU resource management in SGE in
>>>>>>> the near future? Like Slurm or Torque? There are some ways to do this
>>>>>>> using scripts/programs, but I wonder that if the SGE itself can
>>>>>>> recognize and manage GPU(and Phi). Not need to be complicated and
>>>>>>> powerful, just do basic work.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> [email protected]
>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ian Kaufman
>>>>>> Research Systems Administrator
>>>>>> UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> https://gridengine.org/mailman/listinfo/users
>>>>
>>
>>
>>
>> --
>> Ian Kaufman
>> Research Systems Administrator
>> UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu



-- 
Ian Kaufman
Research Systems Administrator
UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SGE and GPU

Reply via email to