Re: [OMPI users] Open MPI collectives algorithm selection

2015-05-21 Thread Khalid Hasanov
George,

Thank you a lot. It makes more sense now.

Best regards,
Khalid

On Thu, May 21, 2015 at 4:14 AM, George Bosilca  wrote:

> Khalid,
>
> The rule number zero is always selected by default. If the size you look
> for (message or communicator) is larger than zero then another rule will be
> selected, otherwise zero is the best selection. Same thing for communicator
> and size, a consistent approach from my perspective.
>
> If you don't want to define the behavior for a particular range you should
> set the algorithm of the range to zero, in which case the control will be
> given back (for that particular range) to the default algorithm selection
> function (the one hardcoded in Open MPI).
>
> So going back to your example, if what you expect not to select an
> algorithm for communicator sizes less than 5, add a rule for a communicator
> size of zero for using the algorithm zero. In this case, the rule 0 will be
> automatically the default until another one is matched.
>
>   George.
>
>
> On Wed, May 20, 2015 at 7:52 PM, Khalid Hasanov  wrote:
>
>> George,
>>
>> Thank you for your answer.
>>
>> Another confusing thing is that, If I use some communicator size which
>> does not exist in the configuration file, some rule from the configuration
>> file will be used anyway.
>> For example, let say I have a configuration file with two communicator
>> sizes 5 and 16. If I execute mpirun with any number of processes from 2 up
>> to 15 then the rule for communicator of size 5 (the first in the config
>> file) is used. If I use mpirun with -n 16 or greater then the rule for 16
>> (the last in the config file) is going to be used.
>>
>> I am not sure if the exclusive approach you mentioned applies here as
>> well.
>>
>> Thanks.
>>
>> Best regards,
>> Khalid
>>
>>
>>
>> On Thu, May 21, 2015 at 12:05 AM, George Bosilca 
>> wrote:
>>
>>> Khalid,
>>>
>>> The way we designed these rules was to define intervals in a 2 dimension
>>> space (communicator size, message size). You should imagine these rules as
>>> exclusive, you match them in the order defined by the configuration file
>>> and you use the algorithm defined by the last matching rule.
>>>
>>>   George.
>>>
>>>
>>> On Tue, May 19, 2015 at 9:30 PM, Khalid Hasanov 
>>> wrote:
>>>
 Hi Gilles,

 Thank you a lot, it works now.

 Just one minor thing I have seen now. If I use some communicator size
 which does not exist in the configuration file, it will still use the
 configuration file. For example, if I use the previous config file with
 mpirun -n 4 it will use the config for the comm size 5 (the first one). The
 same happens if n is less than 16. If n > 16 it will use the config for the
 communicator size 16 (the second one). I am writing this just in case it is
 not expected behaviour.

 Thanks again.

 Best regards,
 Khalid


 On Wed, May 20, 2015 at 2:12 AM, Gilles Gouaillardet >>> > wrote:

>  Hi Khalid,
>
> i checked the source code and it turns out rules must be ordered :
> - first by communicator size
> - second by message size
>
> Here is attached an updated version of the ompi_tuned_file.conf you
> should use
>
> Cheers,
>
> Gilles
>
>
> On 5/20/2015 8:39 AM, Khalid Hasanov wrote:
>
>  Hello,
>
> I am trying to use coll_tuned_dynamic_rules_filename option.
>
>  I am not sure if I do everything right or not. But my impression is
> that config file feature does not work as expected.
>
> For example, if I specify config file as in the attached
> ompi_tuned_file.conf and execute the attached simple broadcast example as 
> :
>
>
>>   mpirun -n 16 --mca coll_tuned_use_dynamic_rules 1  --mca
>> coll_tuned_dynamic_rules_filename ompi_tuned_file.conf   -mca
>> coll_base_verbose 1  bcast_example
>>
>>
>> 
>> I would expect that during run time the config file should be ignored
>> as it does not contain any configuration for communicator size 16. 
>> However,
>> it uses configuration for the last communicator for which the size is 5. 
>> I
>> have attached tuned_output file for more information.
>>
>>  Similar problem exists even if the configuration file contains
>> config for communicator size 16. For example , I added to the 
>> configuration
>> file first communicator size 16 then communicator size 5. But it used
>> configuration for communicator size 5.
>>
>>  Another interesting thing is that if the second communicator size
>> is greater than the first communicator in the config file then it seems 
>> to
>> work correctly. At least I tested it for the case where communicator one
>> had size 16 and second had 55.
>>
>>
>>  I used a develo

Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-21 Thread Lev Givon
Received from Rolf vandeVaart on Wed, May 20, 2015 at 07:48:15AM EDT:

(snip)

> I see that you mentioned you are starting 4 MPS daemons.  Are you following
> the instructions here?
> 
> http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-service-mps.html
>  

Yes - also
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

> This relies on setting CUDA_VISIBLE_DEVICES which can cause problems for CUDA
> IPC. Since you are using CUDA 7 there is no more need to start multiple
> daemons. You simply leave CUDA_VISIBLE_DEVICES untouched and start a single
> MPS control daemon which will handle all GPUs.  Can you try that?  

I assume that this means that only one CUDA_MPS_PIPE_DIRECTORY value should be
passed to all MPI processes. 

Several questions related to your comment above:

- Should the MPI processes select and initialize the GPUs they respectively need
  to access as they normally would when MPS is not in use?
- Can CUDA_VISIBLE_DEVICES be used to control what GPUs are visible to MPS (and
  hence the client processes)? I ask because SLURM uses CUDA_VISIBLE_DEVICES to
  control GPU resource allocation, and I would like to run my program (and the
  MPS control daemon) on a cluster via SLURM.
- Does the clash between setting CUDA_VISIBLE_DEVICES and CUDA IPC imply that
  MPS and CUDA IPC cannot reliably be used simultaneously in a multi-GPU setting
  with CUDA 6.5 even when one starts multiple MPS control daemons as described
  in the aforementioned blog post?

> Because of this question, we realized we need to update our documentation as
> well.
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/



Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-21 Thread Lev Givon
Received from Lev Givon on Thu, May 21, 2015 at 11:32:33AM EDT:
> Received from Rolf vandeVaart on Wed, May 20, 2015 at 07:48:15AM EDT:
> 
> (snip)
> 
> > I see that you mentioned you are starting 4 MPS daemons.  Are you following
> > the instructions here?
> > 
> > http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-service-mps.html
> >  
> 
> Yes - also
> https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
> 
> > This relies on setting CUDA_VISIBLE_DEVICES which can cause problems for 
> > CUDA
> > IPC. Since you are using CUDA 7 there is no more need to start multiple
> > daemons. You simply leave CUDA_VISIBLE_DEVICES untouched and start a single
> > MPS control daemon which will handle all GPUs.  Can you try that?  
> 
> I assume that this means that only one CUDA_MPS_PIPE_DIRECTORY value should be
> passed to all MPI processes. 
> 
> Several questions related to your comment above:
> 
> - Should the MPI processes select and initialize the GPUs they respectively 
> need
>   to access as they normally would when MPS is not in use?
> - Can CUDA_VISIBLE_DEVICES be used to control what GPUs are visible to MPS 
> (and
>   hence the client processes)? I ask because SLURM uses CUDA_VISIBLE_DEVICES 
> to
>   control GPU resource allocation, and I would like to run my program (and the
>   MPS control daemon) on a cluster via SLURM.
> - Does the clash between setting CUDA_VISIBLE_DEVICES and CUDA IPC imply that
>   MPS and CUDA IPC cannot reliably be used simultaneously in a multi-GPU 
> setting
>   with CUDA 6.5 even when one starts multiple MPS control daemons as described
>   in the aforementioned blog post?

Using a single control daemon with CUDA_VISIBLE_DEVICES unset appears to solve
the problem when IPC is enabled.
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/



Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-21 Thread Rolf vandeVaart
Answers below...
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Thursday, May 21, 2015 2:19 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] cuIpcOpenMemHandle failure when using
>OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service
>
>Received from Lev Givon on Thu, May 21, 2015 at 11:32:33AM EDT:
>> Received from Rolf vandeVaart on Wed, May 20, 2015 at 07:48:15AM EDT:
>>
>> (snip)
>>
>> > I see that you mentioned you are starting 4 MPS daemons.  Are you
>> > following the instructions here?
>> >
>> > http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-se
>> > rvice-mps.html
>>
>> Yes - also
>>
>https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overvie
>w
>> .pdf
>>
>> > This relies on setting CUDA_VISIBLE_DEVICES which can cause problems
>> > for CUDA IPC. Since you are using CUDA 7 there is no more need to
>> > start multiple daemons. You simply leave CUDA_VISIBLE_DEVICES
>> > untouched and start a single MPS control daemon which will handle all
>GPUs.  Can you try that?
>>
>> I assume that this means that only one CUDA_MPS_PIPE_DIRECTORY value
>> should be passed to all MPI processes.
There is no need to do anything with CUDA_MPS_PIPE_DIRECTORY with CUDA 7.  

>>
>> Several questions related to your comment above:
>>
>> - Should the MPI processes select and initialize the GPUs they respectively 
>> need
>>   to access as they normally would when MPS is not in use?
Yes.  

>> - Can CUDA_VISIBLE_DEVICES be used to control what GPUs are visible to MPS 
>> (and
>>   hence the client processes)? I ask because SLURM uses CUDA_VISIBLE_DEVICES 
>> to
>>   control GPU resource allocation, and I would like to run my program (and 
>> the
>>   MPS control daemon) on a cluster via SLURM.
Yes, I believe that is true.  

>> - Does the clash between setting CUDA_VISIBLE_DEVICES and CUDA IPC imply that
>>   MPS and CUDA IPC cannot reliably be used simultaneously in a multi-GPU 
>> setting
>>   with CUDA 6.5 even when one starts multiple MPS control daemons as  
>> described
>>   in the aforementioned blog post?
>
>Using a single control daemon with CUDA_VISIBLE_DEVICES unset appears to
>solve the problem when IPC is enabled.
>--
Glad to see this worked.  And you are correct that CUDA IPC will not work 
between devices if they are segregated by the use of CUDA_VISIBLE_DEVICES as we 
do with MPS in 6.5.

Rolf
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---