Re: [gmx-users] multinode issue

2014-12-06 Thread Éric Germaneau

Thanks Mark for having tried to help.

On 12/06/2014 10:08 PM, Mark Abraham wrote:

On Sat, Dec 6, 2014 at 9:29 AM, Éric Germaneau 
wrote:


Dear Mark, Dear Szilárd,

Thank you for your help.
I did try different I_MPI... option without success.
Something I can't figure is I can run jobs with 2 or more OpenMP threads
per MPI process, but not just one.
It crash doing one OpenMP threads per MPI process, even I disable
I_MPI_PIN.


OK, well that points to something being configured incorrectly in IMPI,
rather than any of the other theories. Try OpenMPI ;-)

Mark



   Éric.



On 12/06/2014 02:54 AM, Szilárd Páll wrote:


On a second thought (and a quick googling), it _seems_ that this is an
issue caused by the following:
- the OpenMP runtime gets initialized outside mdrun and its threads
(or just the master thread), get their affinity set;
- mdrun then executes the sanity check, point at which
omp_get_num_procs(), reports 1 CPU most probably because the master
thread is bound to a single core.

This alone should not be a big deal as long as the affinity settings
get correctly overridden in mdrun. However this can have the ugly
side-effect that, if mdrun's affinity setting gets disabled (if mdrun
detects the externally set affinities it back off or if not all
cores/hardware threads are used), all compute threads will inherit the
affinity set previously and multiple threads will run on a the same
core.

Note that this warning should typically not cause a crash, but it is
telling you that something is not quite right, so it may be best to
start with eliminating this warning (hints: I_MPI_PIN for Intel MPI,
-cc for Cray's aprun, --cpu-bind for slurm).

Cheers,
--
Szilárd


On Fri, Dec 5, 2014 at 7:35 PM, Szilárd Páll 
wrote:


I don't think this is a sysconf issue. As you seem to have 16-core (hw
thread?) nodes, it looks like sysnconf returned the correct value
(16), but the OpenMP runtime actually returned 1. This typically means
that the OpenMP runtime was initialized outside mdrun and for some
reason (which I'm not sure about) it returns 1.

My guess is that your job scheduler is multi-threading aware and by
default assumes 1 core/hardware thread per rank so you may want to set
some rank depth/width option.

--
Szilárd


On Fri, Dec 5, 2014 at 1:37 PM, Éric Germaneau 
wrote:


Thank you Mark,

Yes this was the end of the log.
I tried an other input and got the same issue:

 Number of CPUs detected (16) does not match the number reported by
 OpenMP (1).
 Consider setting the launch configuration manually!
 Reading file yukuntest-70K.tpr, VERSION 4.6.3 (single precision)
 [16:node328] unexpected disconnect completion event from [0:node299]
 Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
 internal ABORT - process 16

Actually, I'm running some test for our users, I'll talk with the admin
about how to  return information
to the standard sysconf() routine in the usual way.
Thank you,

 Éric.


On 12/05/2014 07:38 PM, Mark Abraham wrote:


On Fri, Dec 5, 2014 at 9:15 AM, Éric Germaneau 
wrote:

  Dear all,

I use impi and when I submit o job (via LSF) to more than one node I
get
the following message:

  Number of CPUs detected (16) does not match the number reported
by
  OpenMP (1).

  That suggests this machine has not be set up to return information

to the
standard sysconf() routine in the usual way. What kind of machine is
this?

  Consider setting the launch configuration manually!


  Reading file test184000atoms_verlet.tpr, VERSION 4.6.2 (single
  precision)

  I hope that's just a 4.6.2-era .tpr, but nobody should be using 4.6.2

mdrun
because there was a bug in only that version affecting precisely these
kinds of issues...

  [16:node319] unexpected disconnect completion event from
[11:node328]


  Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
  internal ABORT - process 16

I submit doing

  mpirun -np 32 -machinefile nodelist $EXE -v -deffnm $INPUT

The machinefile looks like this

  node328:16
  node319:16

I'm running the release 4.6.7.
I do not set anything about OpenMP for this job, I'd like to have 32
MPI
process.

Using one node it works fine.
Any hints here?

  Everything seems fine. What was the end of the .log file? Can you run

another MPI test program thus?

Mark


 Éric.

--
Éric Germaneau (???), Specialist
Center for High Performance Computing
Shanghai Jiao Tong University
Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
M:german...@sjtu.edu.cn P:+86-136-4161-6480 W:http://hpc.sjtu.edu.cn
--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send 

Re: [gmx-users] multinode issue

2014-12-06 Thread Mark Abraham
On Sat, Dec 6, 2014 at 9:29 AM, Éric Germaneau 
wrote:

> Dear Mark, Dear Szilárd,
>
> Thank you for your help.
> I did try different I_MPI... option without success.
> Something I can't figure is I can run jobs with 2 or more OpenMP threads
> per MPI process, but not just one.
> It crash doing one OpenMP threads per MPI process, even I disable
> I_MPI_PIN.
>

OK, well that points to something being configured incorrectly in IMPI,
rather than any of the other theories. Try OpenMPI ;-)

Mark


>
>   Éric.
>
>
>
> On 12/06/2014 02:54 AM, Szilárd Páll wrote:
>
>> On a second thought (and a quick googling), it _seems_ that this is an
>> issue caused by the following:
>> - the OpenMP runtime gets initialized outside mdrun and its threads
>> (or just the master thread), get their affinity set;
>> - mdrun then executes the sanity check, point at which
>> omp_get_num_procs(), reports 1 CPU most probably because the master
>> thread is bound to a single core.
>>
>> This alone should not be a big deal as long as the affinity settings
>> get correctly overridden in mdrun. However this can have the ugly
>> side-effect that, if mdrun's affinity setting gets disabled (if mdrun
>> detects the externally set affinities it back off or if not all
>> cores/hardware threads are used), all compute threads will inherit the
>> affinity set previously and multiple threads will run on a the same
>> core.
>>
>> Note that this warning should typically not cause a crash, but it is
>> telling you that something is not quite right, so it may be best to
>> start with eliminating this warning (hints: I_MPI_PIN for Intel MPI,
>> -cc for Cray's aprun, --cpu-bind for slurm).
>>
>> Cheers,
>> --
>> Szilárd
>>
>>
>> On Fri, Dec 5, 2014 at 7:35 PM, Szilárd Páll 
>> wrote:
>>
>>> I don't think this is a sysconf issue. As you seem to have 16-core (hw
>>> thread?) nodes, it looks like sysnconf returned the correct value
>>> (16), but the OpenMP runtime actually returned 1. This typically means
>>> that the OpenMP runtime was initialized outside mdrun and for some
>>> reason (which I'm not sure about) it returns 1.
>>>
>>> My guess is that your job scheduler is multi-threading aware and by
>>> default assumes 1 core/hardware thread per rank so you may want to set
>>> some rank depth/width option.
>>>
>>> --
>>> Szilárd
>>>
>>>
>>> On Fri, Dec 5, 2014 at 1:37 PM, Éric Germaneau 
>>> wrote:
>>>
 Thank you Mark,

 Yes this was the end of the log.
 I tried an other input and got the same issue:

 Number of CPUs detected (16) does not match the number reported by
 OpenMP (1).
 Consider setting the launch configuration manually!
 Reading file yukuntest-70K.tpr, VERSION 4.6.3 (single precision)
 [16:node328] unexpected disconnect completion event from [0:node299]
 Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
 internal ABORT - process 16

 Actually, I'm running some test for our users, I'll talk with the admin
 about how to  return information
 to the standard sysconf() routine in the usual way.
 Thank you,

 Éric.


 On 12/05/2014 07:38 PM, Mark Abraham wrote:

> On Fri, Dec 5, 2014 at 9:15 AM, Éric Germaneau 
> wrote:
>
>  Dear all,
>>
>> I use impi and when I submit o job (via LSF) to more than one node I
>> get
>> the following message:
>>
>>  Number of CPUs detected (16) does not match the number reported
>> by
>>  OpenMP (1).
>>
>>  That suggests this machine has not be set up to return information
> to the
> standard sysconf() routine in the usual way. What kind of machine is
> this?
>
>  Consider setting the launch configuration manually!
>
>>  Reading file test184000atoms_verlet.tpr, VERSION 4.6.2 (single
>>  precision)
>>
>>  I hope that's just a 4.6.2-era .tpr, but nobody should be using 4.6.2
> mdrun
> because there was a bug in only that version affecting precisely these
> kinds of issues...
>
>  [16:node319] unexpected disconnect completion event from
> [11:node328]
>
>>  Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>>  internal ABORT - process 16
>>
>> I submit doing
>>
>>  mpirun -np 32 -machinefile nodelist $EXE -v -deffnm $INPUT
>>
>> The machinefile looks like this
>>
>>  node328:16
>>  node319:16
>>
>> I'm running the release 4.6.7.
>> I do not set anything about OpenMP for this job, I'd like to have 32
>> MPI
>> process.
>>
>> Using one node it works fine.
>> Any hints here?
>>
>>  Everything seems fine. What was the end of the .log file? Can you run
> another MPI test program thus?
>
> Mark
>
>
> Éric.
>>
>> --
>> Éric German

Re: [gmx-users] multinode issue

2014-12-06 Thread Éric Germaneau

Dear Mark, Dear Szilárd,

Thank you for your help.
I did try different I_MPI... option without success.
Something I can't figure is I can run jobs with 2 or more OpenMP threads 
per MPI process, but not just one.

It crash doing one OpenMP threads per MPI process, even I disable I_MPI_PIN.

  Éric.


On 12/06/2014 02:54 AM, Szilárd Páll wrote:

On a second thought (and a quick googling), it _seems_ that this is an
issue caused by the following:
- the OpenMP runtime gets initialized outside mdrun and its threads
(or just the master thread), get their affinity set;
- mdrun then executes the sanity check, point at which
omp_get_num_procs(), reports 1 CPU most probably because the master
thread is bound to a single core.

This alone should not be a big deal as long as the affinity settings
get correctly overridden in mdrun. However this can have the ugly
side-effect that, if mdrun's affinity setting gets disabled (if mdrun
detects the externally set affinities it back off or if not all
cores/hardware threads are used), all compute threads will inherit the
affinity set previously and multiple threads will run on a the same
core.

Note that this warning should typically not cause a crash, but it is
telling you that something is not quite right, so it may be best to
start with eliminating this warning (hints: I_MPI_PIN for Intel MPI,
-cc for Cray's aprun, --cpu-bind for slurm).

Cheers,
--
Szilárd


On Fri, Dec 5, 2014 at 7:35 PM, Szilárd Páll  wrote:

I don't think this is a sysconf issue. As you seem to have 16-core (hw
thread?) nodes, it looks like sysnconf returned the correct value
(16), but the OpenMP runtime actually returned 1. This typically means
that the OpenMP runtime was initialized outside mdrun and for some
reason (which I'm not sure about) it returns 1.

My guess is that your job scheduler is multi-threading aware and by
default assumes 1 core/hardware thread per rank so you may want to set
some rank depth/width option.

--
Szilárd


On Fri, Dec 5, 2014 at 1:37 PM, Éric Germaneau  wrote:

Thank you Mark,

Yes this was the end of the log.
I tried an other input and got the same issue:

Number of CPUs detected (16) does not match the number reported by
OpenMP (1).
Consider setting the launch configuration manually!
Reading file yukuntest-70K.tpr, VERSION 4.6.3 (single precision)
[16:node328] unexpected disconnect completion event from [0:node299]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 16

Actually, I'm running some test for our users, I'll talk with the admin
about how to  return information
to the standard sysconf() routine in the usual way.
Thank you,

Éric.


On 12/05/2014 07:38 PM, Mark Abraham wrote:

On Fri, Dec 5, 2014 at 9:15 AM, Éric Germaneau 
wrote:


Dear all,

I use impi and when I submit o job (via LSF) to more than one node I get
the following message:

 Number of CPUs detected (16) does not match the number reported by
 OpenMP (1).


That suggests this machine has not be set up to return information to the
standard sysconf() routine in the usual way. What kind of machine is this?

 Consider setting the launch configuration manually!

 Reading file test184000atoms_verlet.tpr, VERSION 4.6.2 (single
 precision)


I hope that's just a 4.6.2-era .tpr, but nobody should be using 4.6.2
mdrun
because there was a bug in only that version affecting precisely these
kinds of issues...

 [16:node319] unexpected disconnect completion event from [11:node328]

 Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
 internal ABORT - process 16

I submit doing

 mpirun -np 32 -machinefile nodelist $EXE -v -deffnm $INPUT

The machinefile looks like this

 node328:16
 node319:16

I'm running the release 4.6.7.
I do not set anything about OpenMP for this job, I'd like to have 32 MPI
process.

Using one node it works fine.
Any hints here?


Everything seems fine. What was the end of the .log file? Can you run
another MPI test program thus?

Mark



   Éric.

--
Éric Germaneau (???), Specialist
Center for High Performance Computing
Shanghai Jiao Tong University
Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
M:german...@sjtu.edu.cn P:+86-136-4161-6480 W:http://hpc.sjtu.edu.cn
--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.


--
Éric Germaneau (???), Specialist
Center for High Performance Computing
Shanghai Jiao Tong University
Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
Email:german...@sjtu.edu.cn Mobi:+86-136-4161-6480 http://hpc.sjtu.edu.cn
--
Gromacs Users mailin

Re: [gmx-users] multinode issue

2014-12-05 Thread Szilárd Páll
On a second thought (and a quick googling), it _seems_ that this is an
issue caused by the following:
- the OpenMP runtime gets initialized outside mdrun and its threads
(or just the master thread), get their affinity set;
- mdrun then executes the sanity check, point at which
omp_get_num_procs(), reports 1 CPU most probably because the master
thread is bound to a single core.

This alone should not be a big deal as long as the affinity settings
get correctly overridden in mdrun. However this can have the ugly
side-effect that, if mdrun's affinity setting gets disabled (if mdrun
detects the externally set affinities it back off or if not all
cores/hardware threads are used), all compute threads will inherit the
affinity set previously and multiple threads will run on a the same
core.

Note that this warning should typically not cause a crash, but it is
telling you that something is not quite right, so it may be best to
start with eliminating this warning (hints: I_MPI_PIN for Intel MPI,
-cc for Cray's aprun, --cpu-bind for slurm).

Cheers,
--
Szilárd


On Fri, Dec 5, 2014 at 7:35 PM, Szilárd Páll  wrote:
> I don't think this is a sysconf issue. As you seem to have 16-core (hw
> thread?) nodes, it looks like sysnconf returned the correct value
> (16), but the OpenMP runtime actually returned 1. This typically means
> that the OpenMP runtime was initialized outside mdrun and for some
> reason (which I'm not sure about) it returns 1.
>
> My guess is that your job scheduler is multi-threading aware and by
> default assumes 1 core/hardware thread per rank so you may want to set
> some rank depth/width option.
>
> --
> Szilárd
>
>
> On Fri, Dec 5, 2014 at 1:37 PM, Éric Germaneau  wrote:
>> Thank you Mark,
>>
>> Yes this was the end of the log.
>> I tried an other input and got the same issue:
>>
>>Number of CPUs detected (16) does not match the number reported by
>>OpenMP (1).
>>Consider setting the launch configuration manually!
>>Reading file yukuntest-70K.tpr, VERSION 4.6.3 (single precision)
>>[16:node328] unexpected disconnect completion event from [0:node299]
>>Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>>internal ABORT - process 16
>>
>> Actually, I'm running some test for our users, I'll talk with the admin
>> about how to  return information
>> to the standard sysconf() routine in the usual way.
>> Thank you,
>>
>>Éric.
>>
>>
>> On 12/05/2014 07:38 PM, Mark Abraham wrote:
>>>
>>> On Fri, Dec 5, 2014 at 9:15 AM, Éric Germaneau 
>>> wrote:
>>>
 Dear all,

 I use impi and when I submit o job (via LSF) to more than one node I get
 the following message:

 Number of CPUs detected (16) does not match the number reported by
 OpenMP (1).

>>> That suggests this machine has not be set up to return information to the
>>> standard sysconf() routine in the usual way. What kind of machine is this?
>>>
>>> Consider setting the launch configuration manually!

 Reading file test184000atoms_verlet.tpr, VERSION 4.6.2 (single
 precision)

>>> I hope that's just a 4.6.2-era .tpr, but nobody should be using 4.6.2
>>> mdrun
>>> because there was a bug in only that version affecting precisely these
>>> kinds of issues...
>>>
>>> [16:node319] unexpected disconnect completion event from [11:node328]

 Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
 internal ABORT - process 16

 I submit doing

 mpirun -np 32 -machinefile nodelist $EXE -v -deffnm $INPUT

 The machinefile looks like this

 node328:16
 node319:16

 I'm running the release 4.6.7.
 I do not set anything about OpenMP for this job, I'd like to have 32 MPI
 process.

 Using one node it works fine.
 Any hints here?

>>> Everything seems fine. What was the end of the .log file? Can you run
>>> another MPI test program thus?
>>>
>>> Mark
>>>
>>>
   Éric.

 --
 Éric Germaneau (???), Specialist
 Center for High Performance Computing
 Shanghai Jiao Tong University
 Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
 M:german...@sjtu.edu.cn P:+86-136-4161-6480 W:http://hpc.sjtu.edu.cn
 --
 Gromacs Users mailing list

 * Please search the archive at http://www.gromacs.org/
 Support/Mailing_Lists/GMX-Users_List before posting!

 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
 send a mail to gmx-users-requ...@gromacs.org.

>>
>> --
>> Éric Germaneau (???), Specialist
>> Center for High Performance Computing
>> Shanghai Jiao Tong University
>> Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
>> Email:german...@sjtu.edu.cn Mobi:+86-136-4161-6480 

Re: [gmx-users] multinode issue

2014-12-05 Thread Szilárd Páll
I don't think this is a sysconf issue. As you seem to have 16-core (hw
thread?) nodes, it looks like sysnconf returned the correct value
(16), but the OpenMP runtime actually returned 1. This typically means
that the OpenMP runtime was initialized outside mdrun and for some
reason (which I'm not sure about) it returns 1.

My guess is that your job scheduler is multi-threading aware and by
default assumes 1 core/hardware thread per rank so you may want to set
some rank depth/width option.

--
Szilárd


On Fri, Dec 5, 2014 at 1:37 PM, Éric Germaneau  wrote:
> Thank you Mark,
>
> Yes this was the end of the log.
> I tried an other input and got the same issue:
>
>Number of CPUs detected (16) does not match the number reported by
>OpenMP (1).
>Consider setting the launch configuration manually!
>Reading file yukuntest-70K.tpr, VERSION 4.6.3 (single precision)
>[16:node328] unexpected disconnect completion event from [0:node299]
>Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>internal ABORT - process 16
>
> Actually, I'm running some test for our users, I'll talk with the admin
> about how to  return information
> to the standard sysconf() routine in the usual way.
> Thank you,
>
>Éric.
>
>
> On 12/05/2014 07:38 PM, Mark Abraham wrote:
>>
>> On Fri, Dec 5, 2014 at 9:15 AM, Éric Germaneau 
>> wrote:
>>
>>> Dear all,
>>>
>>> I use impi and when I submit o job (via LSF) to more than one node I get
>>> the following message:
>>>
>>> Number of CPUs detected (16) does not match the number reported by
>>> OpenMP (1).
>>>
>> That suggests this machine has not be set up to return information to the
>> standard sysconf() routine in the usual way. What kind of machine is this?
>>
>> Consider setting the launch configuration manually!
>>>
>>> Reading file test184000atoms_verlet.tpr, VERSION 4.6.2 (single
>>> precision)
>>>
>> I hope that's just a 4.6.2-era .tpr, but nobody should be using 4.6.2
>> mdrun
>> because there was a bug in only that version affecting precisely these
>> kinds of issues...
>>
>> [16:node319] unexpected disconnect completion event from [11:node328]
>>>
>>> Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>>> internal ABORT - process 16
>>>
>>> I submit doing
>>>
>>> mpirun -np 32 -machinefile nodelist $EXE -v -deffnm $INPUT
>>>
>>> The machinefile looks like this
>>>
>>> node328:16
>>> node319:16
>>>
>>> I'm running the release 4.6.7.
>>> I do not set anything about OpenMP for this job, I'd like to have 32 MPI
>>> process.
>>>
>>> Using one node it works fine.
>>> Any hints here?
>>>
>> Everything seems fine. What was the end of the .log file? Can you run
>> another MPI test program thus?
>>
>> Mark
>>
>>
>>>   Éric.
>>>
>>> --
>>> Éric Germaneau (???), Specialist
>>> Center for High Performance Computing
>>> Shanghai Jiao Tong University
>>> Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
>>> M:german...@sjtu.edu.cn P:+86-136-4161-6480 W:http://hpc.sjtu.edu.cn
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at http://www.gromacs.org/
>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>> send a mail to gmx-users-requ...@gromacs.org.
>>>
>
> --
> Éric Germaneau (???), Specialist
> Center for High Performance Computing
> Shanghai Jiao Tong University
> Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
> Email:german...@sjtu.edu.cn Mobi:+86-136-4161-6480 http://hpc.sjtu.edu.cn
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
> mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] multinode issue

2014-12-05 Thread Mark Abraham
On Fri, Dec 5, 2014 at 1:37 PM, Éric Germaneau 
wrote:

> Thank you Mark,
>
> Yes this was the end of the log.
>

No, it's not the end of the .log file, it's the end of the stdout. The end
of the .log file will give us more clues about where mdrun couldn't cope
with life on this system. And the start of the log file will confirm that
you're not accidentally running the broken 4.6.2 ;-)

I tried an other input and got the same issue:
>
>Number of CPUs detected (16) does not match the number reported by
>OpenMP (1).
>Consider setting the launch configuration manually!
>Reading file yukuntest-70K.tpr, VERSION 4.6.3 (single precision)
>[16:node328] unexpected disconnect completion event from [0:node299]
>Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>internal ABORT - process 16
>
> Actually, I'm running some test for our users, I'll talk with the admin
> about how to  return information
> to the standard sysconf() routine in the usual way.
>

OK. But if this is Linux on x86 then that really should be hard to get
wrong. So, "What kind of machine is this?" Does non-Intel MPI do a better
job? (hint, this has been true...)

Mark


> Thank you,
>
>Éric.
>
>
> On 12/05/2014 07:38 PM, Mark Abraham wrote:
>
>> On Fri, Dec 5, 2014 at 9:15 AM, Éric Germaneau 
>> wrote:
>>
>>  Dear all,
>>>
>>> I use impi and when I submit o job (via LSF) to more than one node I get
>>> the following message:
>>>
>>> Number of CPUs detected (16) does not match the number reported by
>>> OpenMP (1).
>>>
>>>  That suggests this machine has not be set up to return information to
>> the
>> standard sysconf() routine in the usual way. What kind of machine is this?
>>
>> Consider setting the launch configuration manually!
>>
>>> Reading file test184000atoms_verlet.tpr, VERSION 4.6.2 (single
>>> precision)
>>>
>>>  I hope that's just a 4.6.2-era .tpr, but nobody should be using 4.6.2
>> mdrun
>> because there was a bug in only that version affecting precisely these
>> kinds of issues...
>>
>> [16:node319] unexpected disconnect completion event from [11:node328]
>>
>>> Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>>> internal ABORT - process 16
>>>
>>> I submit doing
>>>
>>> mpirun -np 32 -machinefile nodelist $EXE -v -deffnm $INPUT
>>>
>>> The machinefile looks like this
>>>
>>> node328:16
>>> node319:16
>>>
>>> I'm running the release 4.6.7.
>>> I do not set anything about OpenMP for this job, I'd like to have 32 MPI
>>> process.
>>>
>>> Using one node it works fine.
>>> Any hints here?
>>>
>>>  Everything seems fine. What was the end of the .log file? Can you run
>> another MPI test program thus?
>>
>> Mark
>>
>>
>>Éric.
>>>
>>> --
>>> Éric Germaneau (???), Specialist
>>> Center for High Performance Computing
>>> Shanghai Jiao Tong University
>>> Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
>>> M:german...@sjtu.edu.cn P:+86-136-4161-6480 W:http://hpc.sjtu.edu.cn
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at http://www.gromacs.org/
>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>> send a mail to gmx-users-requ...@gromacs.org.
>>>
>>>
> --
> Éric Germaneau (???), Specialist
> Center for High Performance Computing
> Shanghai Jiao Tong University
> Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
> Email:german...@sjtu.edu.cn Mobi:+86-136-4161-6480 http://hpc.sjtu.edu.cn
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] multinode issue

2014-12-05 Thread Éric Germaneau

Thank you Mark,

Yes this was the end of the log.
I tried an other input and got the same issue:

   Number of CPUs detected (16) does not match the number reported by
   OpenMP (1).
   Consider setting the launch configuration manually!
   Reading file yukuntest-70K.tpr, VERSION 4.6.3 (single precision)
   [16:node328] unexpected disconnect completion event from [0:node299]
   Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
   internal ABORT - process 16

Actually, I'm running some test for our users, I'll talk with the admin 
about how to  return information

to the standard sysconf() routine in the usual way.
Thank you,

   Éric.

On 12/05/2014 07:38 PM, Mark Abraham wrote:

On Fri, Dec 5, 2014 at 9:15 AM, Éric Germaneau 
wrote:


Dear all,

I use impi and when I submit o job (via LSF) to more than one node I get
the following message:

Number of CPUs detected (16) does not match the number reported by
OpenMP (1).


That suggests this machine has not be set up to return information to the
standard sysconf() routine in the usual way. What kind of machine is this?

Consider setting the launch configuration manually!

Reading file test184000atoms_verlet.tpr, VERSION 4.6.2 (single
precision)


I hope that's just a 4.6.2-era .tpr, but nobody should be using 4.6.2 mdrun
because there was a bug in only that version affecting precisely these
kinds of issues...

[16:node319] unexpected disconnect completion event from [11:node328]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 16

I submit doing

mpirun -np 32 -machinefile nodelist $EXE -v -deffnm $INPUT

The machinefile looks like this

node328:16
node319:16

I'm running the release 4.6.7.
I do not set anything about OpenMP for this job, I'd like to have 32 MPI
process.

Using one node it works fine.
Any hints here?


Everything seems fine. What was the end of the .log file? Can you run
another MPI test program thus?

Mark



  Éric.

--
Éric Germaneau (???), Specialist
Center for High Performance Computing
Shanghai Jiao Tong University
Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
M:german...@sjtu.edu.cn P:+86-136-4161-6480 W:http://hpc.sjtu.edu.cn
--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.



--
Éric Germaneau (???), Specialist
Center for High Performance Computing
Shanghai Jiao Tong University
Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
Email:german...@sjtu.edu.cn Mobi:+86-136-4161-6480 http://hpc.sjtu.edu.cn
--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] multinode issue

2014-12-05 Thread Mark Abraham
On Fri, Dec 5, 2014 at 9:15 AM, Éric Germaneau 
wrote:

> Dear all,
>
> I use impi and when I submit o job (via LSF) to more than one node I get
> the following message:
>
>Number of CPUs detected (16) does not match the number reported by
>OpenMP (1).
>

That suggests this machine has not be set up to return information to the
standard sysconf() routine in the usual way. What kind of machine is this?

   Consider setting the launch configuration manually!
>Reading file test184000atoms_verlet.tpr, VERSION 4.6.2 (single
>precision)
>

I hope that's just a 4.6.2-era .tpr, but nobody should be using 4.6.2 mdrun
because there was a bug in only that version affecting precisely these
kinds of issues...

   [16:node319] unexpected disconnect completion event from [11:node328]
>Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>internal ABORT - process 16
>
> I submit doing
>
>mpirun -np 32 -machinefile nodelist $EXE -v -deffnm $INPUT
>
> The machinefile looks like this
>
>node328:16
>node319:16
>
> I'm running the release 4.6.7.
> I do not set anything about OpenMP for this job, I'd like to have 32 MPI
> process.
>
> Using one node it works fine.
> Any hints here?
>

Everything seems fine. What was the end of the .log file? Can you run
another MPI test program thus?

Mark


>  Éric.
>
> --
> Éric Germaneau (艾海克), Specialist
> Center for High Performance Computing
> Shanghai Jiao Tong University
> Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
> M:german...@sjtu.edu.cn P:+86-136-4161-6480 W:http://hpc.sjtu.edu.cn
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


[gmx-users] multinode issue

2014-12-05 Thread Éric Germaneau

Dear all,

I use impi and when I submit o job (via LSF) to more than one node I get 
the following message:


   Number of CPUs detected (16) does not match the number reported by
   OpenMP (1).
   Consider setting the launch configuration manually!
   Reading file test184000atoms_verlet.tpr, VERSION 4.6.2 (single
   precision)
   [16:node319] unexpected disconnect completion event from [11:node328]
   Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
   internal ABORT - process 16

I submit doing

   mpirun -np 32 -machinefile nodelist $EXE -v -deffnm $INPUT

The machinefile looks like this

   node328:16
   node319:16

I'm running the release 4.6.7.
I do not set anything about OpenMP for this job, I'd like to have 32 MPI 
process.


Using one node it works fine.
Any hints here?

 Éric.

--
Éric Germaneau (艾海克), Specialist
Center for High Performance Computing
Shanghai Jiao Tong University
Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
M:german...@sjtu.edu.cn P:+86-136-4161-6480 W:http://hpc.sjtu.edu.cn
--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.