Re: [gmx-users] running g_tune_pme on stampede

2014-12-06 Thread Carsten Kutzner

On 06 Dec 2014, at 00:16, Kevin Chen  wrote:

> Hi,
> 
> Has anybody tried g_tune_pme on stampede before? It appears stampede only 
> support ibrun, but not mpi -np type of stuff. So I assume one could launch 
> g_tune_pme with mpi using command like this (without -np option),
> 
> Ibrun g_tune_pme -s cutoff.tpr -launch
Try 

export MPIRUN=ibrun
export MDRUN=$( which mdrun)
g_tune_pme -s …

Carsten

> 
> Unfortunately, it failed. Any suggestion is welcome!
> 
> Thanks in advance
> 
> Kevin Chen
> 
> 
> 
> 
> 
> 
> -Original Message-
> From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se 
> [mailto:gromacs.org_gmx-users-boun...@maillist.sys.kth.se] On Behalf Of 
> Szilárd Páll
> Sent: Friday, December 5, 2014 12:54 PM
> To: Discussion list for GROMACS users
> Subject: Re: [gmx-users] multinode issue
> 
> On a second thought (and a quick googling), it _seems_ that this is an issue 
> caused by the following:
> - the OpenMP runtime gets initialized outside mdrun and its threads (or just 
> the master thread), get their affinity set;
> - mdrun then executes the sanity check, point at which omp_get_num_procs(), 
> reports 1 CPU most probably because the master thread is bound to a single 
> core.
> 
> This alone should not be a big deal as long as the affinity settings get 
> correctly overridden in mdrun. However this can have the ugly side-effect 
> that, if mdrun's affinity setting gets disabled (if mdrun detects the 
> externally set affinities it back off or if not all cores/hardware threads 
> are used), all compute threads will inherit the affinity set previously and 
> multiple threads will run on a the same core.
> 
> Note that this warning should typically not cause a crash, but it is telling 
> you that something is not quite right, so it may be best to start with 
> eliminating this warning (hints: I_MPI_PIN for Intel MPI, -cc for Cray's 
> aprun, --cpu-bind for slurm).
> 
> Cheers,
> --
> Szilárd
> 
> 
> On Fri, Dec 5, 2014 at 7:35 PM, Szilárd Páll  wrote:
>> I don't think this is a sysconf issue. As you seem to have 16-core (hw
>> thread?) nodes, it looks like sysnconf returned the correct value 
>> (16), but the OpenMP runtime actually returned 1. This typically means 
>> that the OpenMP runtime was initialized outside mdrun and for some 
>> reason (which I'm not sure about) it returns 1.
>> 
>> My guess is that your job scheduler is multi-threading aware and by 
>> default assumes 1 core/hardware thread per rank so you may want to set 
>> some rank depth/width option.
>> 
>> --
>> Szilárd
>> 
>> 
>> On Fri, Dec 5, 2014 at 1:37 PM, Éric Germaneau  wrote:
>>> Thank you Mark,
>>> 
>>> Yes this was the end of the log.
>>> I tried an other input and got the same issue:
>>> 
>>>   Number of CPUs detected (16) does not match the number reported by
>>>   OpenMP (1).
>>>   Consider setting the launch configuration manually!
>>>   Reading file yukuntest-70K.tpr, VERSION 4.6.3 (single precision)
>>>   [16:node328] unexpected disconnect completion event from [0:node299]
>>>   Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>>>   internal ABORT - process 16
>>> 
>>> Actually, I'm running some test for our users, I'll talk with the 
>>> admin about how to  return information to the standard sysconf() 
>>> routine in the usual way.
>>> Thank you,
>>> 
>>>   Éric.
>>> 
>>> 
>>> On 12/05/2014 07:38 PM, Mark Abraham wrote:
 
 On Fri, Dec 5, 2014 at 9:15 AM, Éric Germaneau 
 
 wrote:
 
> Dear all,
> 
> I use impi and when I submit o job (via LSF) to more than one node 
> I get the following message:
> 
>Number of CPUs detected (16) does not match the number reported by
>OpenMP (1).
> 
 That suggests this machine has not be set up to return information 
 to the standard sysconf() routine in the usual way. What kind of machine 
 is this?
 
Consider setting the launch configuration manually!
> 
>Reading file test184000atoms_verlet.tpr, VERSION 4.6.2 (single
>precision)
> 
 I hope that's just a 4.6.2-era .tpr, but nobody should be using 
 4.6.2 mdrun because there was a bug in only that version affecting 
 precisely these kinds of issues...
 
[16:node319] unexpected disconnect completion event from 
 [11:node328]
> 
>Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>internal ABORT - process 16
> 
> I submit doing
> 
>mpirun -np 32 -machinefile nodelist $EXE -v -deffnm $INPUT
> 
> The machinefile looks like this
> 
>node328:16
>node319:16
> 
> I'm running the release 4.6.7.
> I do not set anything about OpenMP for this job, I'd like to have 
> 32 MPI process.
> 
> Using one node it works fine.
> Any hints here?
> 
 Everything seems fine. What was the end of the .log file? Can you 
 run another MPI test program thus?
 
 Mark
>

Re: [gmx-users] running g_tune_pme on stampede

2014-12-06 Thread Mark Abraham
On Sat, Dec 6, 2014 at 12:16 AM, Kevin Chen  wrote:

> Hi,
>
> Has anybody tried g_tune_pme on stampede before? It appears stampede only
> support ibrun, but not mpi -np type of stuff. So I assume one could launch
> g_tune_pme with mpi using command like this (without -np option),
>
> Ibrun g_tune_pme -s cutoff.tpr -launch
>

You should be trying to run mdrun from g_tune_pme in parallel, not trying
to run g_tune_pme in parallel. Make sure you've read g_tune_pme -h to find
out what environment and command line variables you should be setting.

Unfortunately, it failed. Any suggestion is welcome!
>

More information than "it failed" is needed to get a useful suggestion.

Mark


> Thanks in advance
>
> Kevin Chen
>
>
>
>
>
>
> -Original Message-
> From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se [mailto:
> gromacs.org_gmx-users-boun...@maillist.sys.kth.se] On Behalf Of Szilárd
> Páll
> Sent: Friday, December 5, 2014 12:54 PM
> To: Discussion list for GROMACS users
> Subject: Re: [gmx-users] multinode issue
>
> On a second thought (and a quick googling), it _seems_ that this is an
> issue caused by the following:
> - the OpenMP runtime gets initialized outside mdrun and its threads (or
> just the master thread), get their affinity set;
> - mdrun then executes the sanity check, point at which
> omp_get_num_procs(), reports 1 CPU most probably because the master thread
> is bound to a single core.
>
> This alone should not be a big deal as long as the affinity settings get
> correctly overridden in mdrun. However this can have the ugly side-effect
> that, if mdrun's affinity setting gets disabled (if mdrun detects the
> externally set affinities it back off or if not all cores/hardware threads
> are used), all compute threads will inherit the affinity set previously and
> multiple threads will run on a the same core.
>
> Note that this warning should typically not cause a crash, but it is
> telling you that something is not quite right, so it may be best to start
> with eliminating this warning (hints: I_MPI_PIN for Intel MPI, -cc for
> Cray's aprun, --cpu-bind for slurm).
>
> Cheers,
> --
> Szilárd
>
>
> On Fri, Dec 5, 2014 at 7:35 PM, Szilárd Páll 
> wrote:
> > I don't think this is a sysconf issue. As you seem to have 16-core (hw
> > thread?) nodes, it looks like sysnconf returned the correct value
> > (16), but the OpenMP runtime actually returned 1. This typically means
> > that the OpenMP runtime was initialized outside mdrun and for some
> > reason (which I'm not sure about) it returns 1.
> >
> > My guess is that your job scheduler is multi-threading aware and by
> > default assumes 1 core/hardware thread per rank so you may want to set
> > some rank depth/width option.
> >
> > --
> > Szilárd
> >
> >
> > On Fri, Dec 5, 2014 at 1:37 PM, Éric Germaneau 
> wrote:
> >> Thank you Mark,
> >>
> >> Yes this was the end of the log.
> >> I tried an other input and got the same issue:
> >>
> >>Number of CPUs detected (16) does not match the number reported by
> >>OpenMP (1).
> >>Consider setting the launch configuration manually!
> >>Reading file yukuntest-70K.tpr, VERSION 4.6.3 (single precision)
> >>[16:node328] unexpected disconnect completion event from [0:node299]
> >>Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
> >>internal ABORT - process 16
> >>
> >> Actually, I'm running some test for our users, I'll talk with the
> >> admin about how to  return information to the standard sysconf()
> >> routine in the usual way.
> >> Thank you,
> >>
> >>Éric.
> >>
> >>
> >> On 12/05/2014 07:38 PM, Mark Abraham wrote:
> >>>
> >>> On Fri, Dec 5, 2014 at 9:15 AM, Éric Germaneau
> >>> 
> >>> wrote:
> >>>
>  Dear all,
> 
>  I use impi and when I submit o job (via LSF) to more than one node
>  I get the following message:
> 
>  Number of CPUs detected (16) does not match the number reported by
>  OpenMP (1).
> 
> >>> That suggests this machine has not be set up to return information
> >>> to the standard sysconf() routine in the usual way. What kind of
> machine is this?
> >>>
> >>> Consider setting the launch configuration manually!
> 
>  Reading file test184000atoms_verlet.tpr, VERSION 4.6.2 (single
>  precision)
> 
> >>> I hope that's just a 4.6.2-era .tpr, but nobody should be using
> >>> 4.6.2 mdrun because there was a bug in only that version affecting
> >>> precisely these kinds of issues...
> >>>
> >>> [16:node319] unexpected disconnect completion event from
> >>> [11:node328]
> 
>  Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>  internal ABORT - process 16
> 
>  I submit doing
> 
>  mpirun -np 32 -machinefile nodelist $EXE -v -deffnm $INPUT
> 
>  The machinefile looks like this
> 
>  node328:16
>  node319:16
> 
>  I'm running the release 4.6.7.
>  I do not set anything about OpenMP f

[gmx-users] running g_tune_pme on stampede

2014-12-05 Thread Kevin Chen
Hi,

Has anybody tried g_tune_pme on stampede before? It appears stampede only 
support ibrun, but not mpi -np type of stuff. So I assume one could launch 
g_tune_pme with mpi using command like this (without -np option),

Ibrun g_tune_pme -s cutoff.tpr -launch

Unfortunately, it failed. Any suggestion is welcome!

Thanks in advance

Kevin Chen






-Original Message-
From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se 
[mailto:gromacs.org_gmx-users-boun...@maillist.sys.kth.se] On Behalf Of Szilárd 
Páll
Sent: Friday, December 5, 2014 12:54 PM
To: Discussion list for GROMACS users
Subject: Re: [gmx-users] multinode issue

On a second thought (and a quick googling), it _seems_ that this is an issue 
caused by the following:
- the OpenMP runtime gets initialized outside mdrun and its threads (or just 
the master thread), get their affinity set;
- mdrun then executes the sanity check, point at which omp_get_num_procs(), 
reports 1 CPU most probably because the master thread is bound to a single core.

This alone should not be a big deal as long as the affinity settings get 
correctly overridden in mdrun. However this can have the ugly side-effect that, 
if mdrun's affinity setting gets disabled (if mdrun detects the externally set 
affinities it back off or if not all cores/hardware threads are used), all 
compute threads will inherit the affinity set previously and multiple threads 
will run on a the same core.

Note that this warning should typically not cause a crash, but it is telling 
you that something is not quite right, so it may be best to start with 
eliminating this warning (hints: I_MPI_PIN for Intel MPI, -cc for Cray's aprun, 
--cpu-bind for slurm).

Cheers,
--
Szilárd


On Fri, Dec 5, 2014 at 7:35 PM, Szilárd Páll  wrote:
> I don't think this is a sysconf issue. As you seem to have 16-core (hw
> thread?) nodes, it looks like sysnconf returned the correct value 
> (16), but the OpenMP runtime actually returned 1. This typically means 
> that the OpenMP runtime was initialized outside mdrun and for some 
> reason (which I'm not sure about) it returns 1.
>
> My guess is that your job scheduler is multi-threading aware and by 
> default assumes 1 core/hardware thread per rank so you may want to set 
> some rank depth/width option.
>
> --
> Szilárd
>
>
> On Fri, Dec 5, 2014 at 1:37 PM, Éric Germaneau  wrote:
>> Thank you Mark,
>>
>> Yes this was the end of the log.
>> I tried an other input and got the same issue:
>>
>>Number of CPUs detected (16) does not match the number reported by
>>OpenMP (1).
>>Consider setting the launch configuration manually!
>>Reading file yukuntest-70K.tpr, VERSION 4.6.3 (single precision)
>>[16:node328] unexpected disconnect completion event from [0:node299]
>>Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>>internal ABORT - process 16
>>
>> Actually, I'm running some test for our users, I'll talk with the 
>> admin about how to  return information to the standard sysconf() 
>> routine in the usual way.
>> Thank you,
>>
>>Éric.
>>
>>
>> On 12/05/2014 07:38 PM, Mark Abraham wrote:
>>>
>>> On Fri, Dec 5, 2014 at 9:15 AM, Éric Germaneau 
>>> 
>>> wrote:
>>>
 Dear all,

 I use impi and when I submit o job (via LSF) to more than one node 
 I get the following message:

 Number of CPUs detected (16) does not match the number reported by
 OpenMP (1).

>>> That suggests this machine has not be set up to return information 
>>> to the standard sysconf() routine in the usual way. What kind of machine is 
>>> this?
>>>
>>> Consider setting the launch configuration manually!

 Reading file test184000atoms_verlet.tpr, VERSION 4.6.2 (single
 precision)

>>> I hope that's just a 4.6.2-era .tpr, but nobody should be using 
>>> 4.6.2 mdrun because there was a bug in only that version affecting 
>>> precisely these kinds of issues...
>>>
>>> [16:node319] unexpected disconnect completion event from 
>>> [11:node328]

 Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
 internal ABORT - process 16

 I submit doing

 mpirun -np 32 -machinefile nodelist $EXE -v -deffnm $INPUT

 The machinefile looks like this

 node328:16
 node319:16

 I'm running the release 4.6.7.
 I do not set anything about OpenMP for this job, I'd like to have 
 32 MPI process.

 Using one node it works fine.
 Any hints here?

>>> Everything seems fine. What was the end of the .log file? Can you 
>>> run another MPI test program thus?
>>>
>>> Mark
>>>
>>>
   Éric.

 --
 Éric Germaneau (???), Specialist
 Center for High Performance Computing Shanghai Jiao Tong University 
 Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China 
 M:german...@sjtu.edu.cn P:+86-136-4161-6480