Re: [OMPI users] first cluster

2010-07-15 Thread Douglas Guptill
On Wed, Jul 14, 2010 at 04:27:11PM -0400, Jeff Squyres wrote:
> On Jul 9, 2010, at 12:43 PM, Douglas Guptill wrote:
> 
> > After some lurking and reading, I plan this:
> >   Debian (lenny)
> >   + fai   - for compute-node operating system install
> >   + Torque- job scheduler/manager
> >   + MPI (Intel MPI)   - for the application
> >   + MPI (OpenMP)  - alternative MPI
> > 
> > Does anyone see holes in this plan?
> 
> HPC is very much a "what is best for *your* requirements" kind of
> environment.  There are many different recipes out there for
> different kinds of HPC environments.

Very wise words.

We will be running only one application, and have one, maybe two, users.

> What you listed above is a reasonable list of meta software
> packages.

Thanks,
Douglas.
-- 
  Douglas Guptill   voice: 902-461-9749
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada



Re: [hwloc-users] hwloc_set/get_thread_cpubind

2010-07-15 Thread Jeff Squyres
Fixed -- thanks for the heads-up!

On Jul 14, 2010, at 2:28 PM, Αλέξανδρος Παπαδογιαννάκης wrote:

> 
> hwloc_set_thread_cpubind and hwloc_get_thread_cpubind are missing from the 
> html documentation
> http://www.open-mpi.org/projects/hwloc/doc/v1.0.1/group__hwlocality__binding.php
>  
> _
> Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
> https://signup.live.com/signup.aspx?id=60969
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Open MPI runtime parameter tuning on a custom cluster

2010-07-15 Thread Eugene Loh

Simone Pellegrini wrote:


Dear Open MPI community,
I would like to know from expert system administrator if they know any 
"standardized" way for tuning Open MPI runtime parameters.


I need to tune the performance on a custom cluster so I would like to 
have some hints in order to proceed in the correct direction.


(I'll try to save Jeff a message as he goes through his inbox...)

I believe there is no standardized way for tuning these parameters.

On the discouraging side, there are very many parameters and they can be 
difficult to understand, particularly as there is little to basically no 
documentation in a number of cases.  In some cases, the trade-offs they 
make entail issues that require a detailed understanding of your 
particular configuration and application.


On the positive node, the default values of many of these parameters 
have been chosen reasonably and so most of the parameters don't need 
your attention.  There is on-going, gradual progress in helping users 
understand tuning -- for example in the OMPI mpirun man page and on the 
FAQ http://www.open-mpi.org/faq/ -- see the section on "Tuning".


Re: [OMPI users] Highly variable performance

2010-07-15 Thread Jed Brown
On Thu, 15 Jul 2010 13:03:31 -0400, Jeff Squyres  wrote:
> Given the oversubscription on the existing HT links, could contention
> account for the difference?  (I have no idea how HT's contention
> management works) Meaning: if the stars line up in a given run, you
> could end up with very little/no contention and you get good
> bandwidth.  But if there's a bit of jitter, you could end up with
> quite a bit of contention that ends up cascading into a bunch of
> additional delay.

What contention?  Many sockets needing to access memory on another
socket via HT links?  Then yes, perhaps that could be a lot.  As show in
the diagram, it's pretty non-uniform, and if, say sockets 0, 1, and 3
all found memory on socket 0 (say socket 2 had local memory), then there
are two ways for messages to get from 3 to 0 (via 1 or via 2).  I don't
know if there is hardware support to re-route to avoid contention, but
if not, then socket 3 could be sharing the 1->0 HT link (which has max
throughput of 8 GB/s, therefore 4 GB/s would be available per socket,
provided it was still operating at peak).  Note that this 4 GB/s is
still less than splitting the 10.7 GB/s three ways.

> I fail to see how that could add up to 70-80 (or more) seconds of
> difference -- 13 secs vs. 90+ seconds (and more), though...  70-80
> seconds sounds like an IO delay -- perhaps paging due to the ramdisk
> or somesuch...?  That's a SWAG.

This problem should have had significantly less resident than would
cause paging, but these were very short jobs so a relatively small
amount of paging would cause a big performance hit.  We have also seen
up to a factor of 10 variability in longer jobs (e.g. 1 hour for a
"fast" run), with larger working sets, but once the pages are faulted,
this kernel (2.6.18 from RHEL5) won't migrate them around, so even if
you eventually swap out all the ramdisk, pages faulted before and after
will be mapped to all sorts of inconvenient places.

But, I don't have any systematic testing with a guaranteed clean
ramdisk, and I'm not going to overanalyze the extra factors when there's
an understood factor of 3 hanging in the way.  I'll give an update if
there is any news.

Jed


Re: [OMPI users] Highly variable performance

2010-07-15 Thread Jeff Squyres
Given the oversubscription on the existing HT links, could contention account 
for the difference?  (I have no idea how HT's contention management works)  
Meaning: if the stars line up in a given run, you could end up with very 
little/no contention and you get good bandwidth.  But if there's a bit of 
jitter, you could end up with quite a bit of contention that ends up cascading 
into a bunch of additional delay.

I fail to see how that could add up to 70-80 (or more) seconds of difference -- 
13 secs vs. 90+ seconds (and more), though...  70-80 seconds sounds like an IO 
delay -- perhaps paging due to the ramdisk or somesuch...?  That's a SWAG.



On Jul 15, 2010, at 10:40 AM, Jed Brown wrote:

> On Thu, 15 Jul 2010 09:36:18 -0400, Jeff Squyres  wrote:
> > Per my other disclaimer, I'm trolling through my disastrous inbox and
> > finding some orphaned / never-answered emails.  Sorry for the delay!
> 
> No problem, I should have followed up on this with further explanation.
> 
> > Just to be clear -- you're running 8 procs locally on an 8 core node,
> > right?
> 
> These are actually 4-socket quad-core nodes, so there are 16 cores
> available, but we are only running on 8, -npersocket 2 -bind-to-socket.
> This was a greatly simplified case, but is still sufficient to show the
> variability.  It tends to be somewhat worse if we use all cores of a
> node.
> 
>   (Cisco is an Intel partner -- I don't follow the AMD line
> > much) So this should all be local communication with no external
> > network involved, right?
> 
> Yes, this was the greatly simplified case, contained entirely within a
> 
> > > lsf.o240562 killed   8*a6200
> > > lsf.o240563 9.2110e+01   8*a6200
> > > lsf.o240564 1.5638e+01   8*a6237
> > > lsf.o240565 1.3873e+01   8*a6228
> >
> > Am I reading that right that it's 92 seconds vs. 13 seconds?  Woof!
> 
> Yes, an the "killed" means it wasn't done after 120 seconds.  This
> factor of 10 is about the worst we see, but of course very surprising.
> 
> > Nice and consistent, as you mentioned.  And I assume your notation
> > here means that it's across 2 nodes.
> 
> Yes, the Quadrics nodes are 2-socket dual core, so 8 procs needs two
> nodes.
> 
> The rest of your observations are consistent with my understanding.  We
> identified two other issues, neither of which accounts for a factor of
> 10, but which account for at least a factor of 3.
> 
> 1. The administrators mounted a 16 GB ramdisk on /scratch, but did not
>ensure that it was wiped before the next task ran.  So if you got a
>node after some job that left stinky feces there, you could
>effectively only have 16 GB (before the old stuff would be swapped
>out).  More importantly, the physical pages backing the ramdisk may
>not be uniformly distributed across the sockets, and rather than
>preemptively swap out those old ramdisk pages, the kernel would find
>a page on some other socket (instead of locally, this could be
>confirmed, for example, by watching the numa_foreign and numa_miss
>counts with numastat).  Then when you went to use that memory
>(typically in a bandwidth-limited application), it was easy to have 3
>sockets all waiting on one bus, thus taking a factor of 3+
>performance hit despite a resident set much less than 50% of the
>available memory.  I have a rather complete analysis of this in case
>someone is interested.  Note that this can affect programs with
>static or dynamic allocation (the kernel looks for local pages when
>you fault it, not when you allocate it), the only way I know of to
>circumvent the problem is to allocate memory with libnuma
>(e.g. numa_alloc_local) which will fail if local memory isn't
>available (instead of returning and subsequently faulting remote
>pages).
> 
> 2. The memory bandwidth is 16-18% different between sockets, with
>sockets 0,3 being slow and sockets 1,2 having much faster available
>bandwidth.  This is fully reproducible and acknowledged by
>Sun/Oracle, their response to an early inquiry:
> 
>  http://59A2.org/files/SunBladeX6440STREAM-20100616.pdf
> 
>I am not completely happy with this explanation because the issue
>persists even with full software prefetch, packed SSE2, and
>non-temporal stores; as long as the working set does not fit within
>(per-socket) L3.  Note that the software prefetch allows for several
>hundred cycles of latency, so the extra hop for snooping shouldn't be
>a problem.  If the working set fits within L3, then all sockets are
>the same speed (and of course much faster due to improved bandwidth).
>Some disassembly here:
> 
>  http://gist.github.com/476942
> 
>The three with prefetch and movntpd run within 2% of each other, the
>other is much faster within cache and much slower when it breaks out
>of cache (obviously).  The performance numbers are higher than with
>the reference implementation 

Re: [OMPI users] error in (Open MPI) 1.3.3r21324-ct8.2-b09b-r31

2010-07-15 Thread Don Kerr
There is a slightly newer version available, 8.2.1c at 
http://www.oracle.com/goto/ompt


You should be able to install side by side without interfering with a 
previously installed version.


If that does not alleviate the issue additional information as Scott 
asked would be useful. The full mpirun line or list of mca parameters 
that were set, number of processes, number of nodes, version of Solaris,

version of compiler, what interconnect?

If that does not shed some light then maybe a small test case would be 
the next step.


-DON

On 07/15/10 09:56, Scott Atchley wrote:

Lydia,

Which interconnect is this running over?

Scott

On Jul 15, 2010, at 5:19 AM, Lydia Heck wrote:


We are running Sun's build of Open Mpi  1.3.3r21324-ct8.2-b09b-r31
(HPC8.2) and one code that runs perfectly fine under
HPC8.1 (Open MPI) 1.3r19845-ct8.1-b06b-r21 and before fails with



[oberon:08454] *** Process received signal ***
[oberon:08454] Signal: Segmentation Fault (11)
[oberon:08454] Signal code: Address not mapped (1)
[oberon:08454] Failing at address: 0
/opt/SUNWhpc/HPC8.2/sun/lib/amd64/libopen-pal.so.0.0.0:0x4b89e
/lib/amd64/libc.so.1:0xd0f36
/lib/amd64/libc.so.1:0xc5a72
0x0 [ Signal 11 (SEGV)]
/opt/SUNWhpc/HPC8.2/sun/lib/amd64/libmpi.so.0.0.0:MPI_Alloc_mem+0x7f
/opt/SUNWhpc/HPC8.2/sun/lib/amd64/libmpi.so.0.0.0:MPI_Sendrecv_replace+0x31e
/opt/SUNWhpc/HPC8.2/sun/lib/amd64/libmpi_f77.so.0.0.0:PMPI_SENDRECV_REPLACE+0x94
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:mpi_cyclic_transfer_+0xd9
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:cycle_particles_and_interpolate_+0x94b
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:interpolate_field_+0xc30
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:MAIN_+0xe68
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:main+0x3d
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:0x62ac
[oberon:08454] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 8454 on node oberon exited on 
signal 11 (Segmentation Fault).



I have not tried to get and build a newer Open Mpi, so I do not know if the 
problem propagates into the more recent versions.


If the developers are interested, I could ask the user to prepare the code for 
you to have a look at the problem which looks like to be in  MPI_Alloc_mem.

Best wishes,
Lydia Heck

--
Dr E L  Heck

University of Durham Institute for Computational Cosmology
Ogden Centre
Department of Physics South Road

DURHAM, DH1 3LE United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Edgar Gabriel
On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
> hi edgar,
> 
> thanks for the tips, I'm gonna try this option as well. the segmentation 
> fault i'm observing always happened during a collective communication 
> indeed...
> does it basically switch all collective communication to basic mode, right ?
> 
> sorry for my ignorance, but what's a NCA ? 

sorry, I meant to type HCA (InifinBand networking card)

Thanks
Edgar

> 
> thanks,
> éloi
> 
> On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
>> you could try first to use the algorithms in the basic module, e.g.
>>
>> mpirun -np x --mca coll basic ./mytest
>>
>> and see whether this makes a difference. I used to observe sometimes a
>> (similar ?) problem in the openib btl triggered from the tuned
>> collective component, in cases where the ofed libraries were installed
>> but no NCA was found on a node. It used to work however with the basic
>> component.
>>
>> Thanks
>> Edgar
>>
>> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
>>> hi Rolf,
>>>
>>> unfortunately, i couldn't get rid of that annoying segmentation fault
>>> when selecting another bcast algorithm. i'm now going to replace
>>> MPI_Bcast with a naive implementation (using MPI_Send and MPI_Recv) and
>>> see if that helps.
>>>
>>> regards,
>>> éloi
>>>
>>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
 Hi Rolf,

 thanks for your input. You're right, I miss the
 coll_tuned_use_dynamic_rules option.

 I'll check if I the segmentation fault disappears when using the basic
 bcast linear algorithm using the proper command line you provided.

 Regards,
 Eloi

 On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
> Hi Eloi:
> To select the different bcast algorithms, you need to add an extra mca
> parameter that tells the library to use dynamic selection.
> --mca coll_tuned_use_dynamic_rules 1
>
> One way to make sure you are typing this in correctly is to use it with
> ompi_info.  Do the following:
> ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
>
> You should see lots of output with all the different algorithms that
> can be selected for the various collectives.
> Therefore, you need this:
>
> --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 1
>
> Rolf
>
> On 07/13/10 11:28, Eloi Gaudry wrote:
>> Hi,
>>
>> I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch
>> to the basic linear algorithm. Anyway whatever the algorithm used, the
>> segmentation fault remains.
>>
>> Does anyone could give some advice on ways to diagnose the issue I'm
>> facing ?
>>
>> Regards,
>> Eloi
>>
>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
>>> Hi,
>>>
>>> I'm focusing on the MPI_Bcast routine that seems to randomly segfault
>>> when using the openib btl. I'd like to know if there is any way to
>>> make OpenMPI switch to a different algorithm than the default one
>>> being selected for MPI_Bcast.
>>>
>>> Thanks for your help,
>>> Eloi
>>>
>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
 Hi,

 I'm observing a random segmentation fault during an internode
 parallel computation involving the openib btl and OpenMPI-1.4.2 (the
 same issue can be observed with OpenMPI-1.3.3).

mpirun (Open MPI) 1.4.2
Report bugs to http://www.open-mpi.org/community/help/
[pbn08:02624] *** Process received signal ***
[pbn08:02624] Signal: Segmentation fault (11)
[pbn08:02624] Signal code: Address not mapped (1)
[pbn08:02624] Failing at address: (nil)
[pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
[pbn08:02624] *** End of error message ***
sh: line 1:  2624 Segmentation fault

 \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/
 x 86 _6 4\ /bin\/actranpy_mp
 '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86
 _ 64 /A c tran_11.0.rc2.41872'
 '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.d
 a t' '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch'
 '--mem=3200' '--threads=1' '--errorlevel=FATAL' '--t_max=0.1'
 '--parallel=domain'

 If I choose not to use the openib btl (by using --mca btl
 self,sm,tcp on the command line, for instance), I don't encounter
 any problem and the parallel computation runs flawlessly.

 I would like to get some help to be able:
 - to diagnose the issue I'm facing with the openib btl
 - understand why this issue is observed only when using the openib
 btl and not when using self,sm,tcp

 Any help would be very much appreciated.

 The outputs of 

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Eloi Gaudry
hi edgar,

thanks for the tips, I'm gonna try this option as well. the segmentation fault 
i'm observing always happened during a collective communication indeed...
does it basically switch all collective communication to basic mode, right ?

sorry for my ignorance, but what's a NCA ? 

thanks,
éloi

On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
> you could try first to use the algorithms in the basic module, e.g.
> 
> mpirun -np x --mca coll basic ./mytest
> 
> and see whether this makes a difference. I used to observe sometimes a
> (similar ?) problem in the openib btl triggered from the tuned
> collective component, in cases where the ofed libraries were installed
> but no NCA was found on a node. It used to work however with the basic
> component.
> 
> Thanks
> Edgar
> 
> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
> > hi Rolf,
> > 
> > unfortunately, i couldn't get rid of that annoying segmentation fault
> > when selecting another bcast algorithm. i'm now going to replace
> > MPI_Bcast with a naive implementation (using MPI_Send and MPI_Recv) and
> > see if that helps.
> > 
> > regards,
> > éloi
> > 
> > On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
> >> Hi Rolf,
> >> 
> >> thanks for your input. You're right, I miss the
> >> coll_tuned_use_dynamic_rules option.
> >> 
> >> I'll check if I the segmentation fault disappears when using the basic
> >> bcast linear algorithm using the proper command line you provided.
> >> 
> >> Regards,
> >> Eloi
> >> 
> >> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
> >>> Hi Eloi:
> >>> To select the different bcast algorithms, you need to add an extra mca
> >>> parameter that tells the library to use dynamic selection.
> >>> --mca coll_tuned_use_dynamic_rules 1
> >>> 
> >>> One way to make sure you are typing this in correctly is to use it with
> >>> ompi_info.  Do the following:
> >>> ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
> >>> 
> >>> You should see lots of output with all the different algorithms that
> >>> can be selected for the various collectives.
> >>> Therefore, you need this:
> >>> 
> >>> --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 1
> >>> 
> >>> Rolf
> >>> 
> >>> On 07/13/10 11:28, Eloi Gaudry wrote:
>  Hi,
>  
>  I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch
>  to the basic linear algorithm. Anyway whatever the algorithm used, the
>  segmentation fault remains.
>  
>  Does anyone could give some advice on ways to diagnose the issue I'm
>  facing ?
>  
>  Regards,
>  Eloi
>  
>  On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> > Hi,
> > 
> > I'm focusing on the MPI_Bcast routine that seems to randomly segfault
> > when using the openib btl. I'd like to know if there is any way to
> > make OpenMPI switch to a different algorithm than the default one
> > being selected for MPI_Bcast.
> > 
> > Thanks for your help,
> > Eloi
> > 
> > On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
> >> Hi,
> >> 
> >> I'm observing a random segmentation fault during an internode
> >> parallel computation involving the openib btl and OpenMPI-1.4.2 (the
> >> same issue can be observed with OpenMPI-1.3.3).
> >> 
> >>mpirun (Open MPI) 1.4.2
> >>Report bugs to http://www.open-mpi.org/community/help/
> >>[pbn08:02624] *** Process received signal ***
> >>[pbn08:02624] Signal: Segmentation fault (11)
> >>[pbn08:02624] Signal code: Address not mapped (1)
> >>[pbn08:02624] Failing at address: (nil)
> >>[pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
> >>[pbn08:02624] *** End of error message ***
> >>sh: line 1:  2624 Segmentation fault
> >> 
> >> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/
> >> x 86 _6 4\ /bin\/actranpy_mp
> >> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86
> >> _ 64 /A c tran_11.0.rc2.41872'
> >> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.d
> >> a t' '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch'
> >> '--mem=3200' '--threads=1' '--errorlevel=FATAL' '--t_max=0.1'
> >> '--parallel=domain'
> >> 
> >> If I choose not to use the openib btl (by using --mca btl
> >> self,sm,tcp on the command line, for instance), I don't encounter
> >> any problem and the parallel computation runs flawlessly.
> >> 
> >> I would like to get some help to be able:
> >> - to diagnose the issue I'm facing with the openib btl
> >> - understand why this issue is observed only when using the openib
> >> btl and not when using self,sm,tcp
> >> 
> >> Any help would be very much appreciated.
> >> 
> >> The outputs of ompi_info and the configure scripts of OpenMPI are
> >> enclosed to this email, and some information on the infiniband
> 

Re: [OMPI users] Highly variable performance

2010-07-15 Thread Jed Brown
On Thu, 15 Jul 2010 09:36:18 -0400, Jeff Squyres  wrote:
> Per my other disclaimer, I'm trolling through my disastrous inbox and
> finding some orphaned / never-answered emails.  Sorry for the delay!

No problem, I should have followed up on this with further explanation.

> Just to be clear -- you're running 8 procs locally on an 8 core node,
> right?

These are actually 4-socket quad-core nodes, so there are 16 cores
available, but we are only running on 8, -npersocket 2 -bind-to-socket.
This was a greatly simplified case, but is still sufficient to show the
variability.  It tends to be somewhat worse if we use all cores of a
node.

  (Cisco is an Intel partner -- I don't follow the AMD line
> much) So this should all be local communication with no external
> network involved, right?

Yes, this was the greatly simplified case, contained entirely within a 

> > lsf.o240562 killed   8*a6200
> > lsf.o240563 9.2110e+01   8*a6200
> > lsf.o240564 1.5638e+01   8*a6237
> > lsf.o240565 1.3873e+01   8*a6228
>
> Am I reading that right that it's 92 seconds vs. 13 seconds?  Woof!

Yes, an the "killed" means it wasn't done after 120 seconds.  This
factor of 10 is about the worst we see, but of course very surprising.

> Nice and consistent, as you mentioned.  And I assume your notation
> here means that it's across 2 nodes.

Yes, the Quadrics nodes are 2-socket dual core, so 8 procs needs two
nodes.

The rest of your observations are consistent with my understanding.  We
identified two other issues, neither of which accounts for a factor of
10, but which account for at least a factor of 3.

1. The administrators mounted a 16 GB ramdisk on /scratch, but did not
   ensure that it was wiped before the next task ran.  So if you got a
   node after some job that left stinky feces there, you could
   effectively only have 16 GB (before the old stuff would be swapped
   out).  More importantly, the physical pages backing the ramdisk may
   not be uniformly distributed across the sockets, and rather than
   preemptively swap out those old ramdisk pages, the kernel would find
   a page on some other socket (instead of locally, this could be
   confirmed, for example, by watching the numa_foreign and numa_miss
   counts with numastat).  Then when you went to use that memory
   (typically in a bandwidth-limited application), it was easy to have 3
   sockets all waiting on one bus, thus taking a factor of 3+
   performance hit despite a resident set much less than 50% of the
   available memory.  I have a rather complete analysis of this in case
   someone is interested.  Note that this can affect programs with
   static or dynamic allocation (the kernel looks for local pages when
   you fault it, not when you allocate it), the only way I know of to
   circumvent the problem is to allocate memory with libnuma
   (e.g. numa_alloc_local) which will fail if local memory isn't
   available (instead of returning and subsequently faulting remote
   pages).

2. The memory bandwidth is 16-18% different between sockets, with
   sockets 0,3 being slow and sockets 1,2 having much faster available
   bandwidth.  This is fully reproducible and acknowledged by
   Sun/Oracle, their response to an early inquiry:

 http://59A2.org/files/SunBladeX6440STREAM-20100616.pdf

   I am not completely happy with this explanation because the issue
   persists even with full software prefetch, packed SSE2, and
   non-temporal stores; as long as the working set does not fit within
   (per-socket) L3.  Note that the software prefetch allows for several
   hundred cycles of latency, so the extra hop for snooping shouldn't be
   a problem.  If the working set fits within L3, then all sockets are
   the same speed (and of course much faster due to improved bandwidth).
   Some disassembly here:

 http://gist.github.com/476942

   The three with prefetch and movntpd run within 2% of each other, the
   other is much faster within cache and much slower when it breaks out
   of cache (obviously).  The performance numbers are higher than with
   the reference implementation (quoted in Sun/Oracle's repsonse), but
   (run with taskset to each of the four sockets):

 Triad:   5842.5814   0.0329   0.0329   0.0330
 Triad:   6843.4206   0.0281   0.0281   0.0282
 Triad:   6827.6390   0.0282   0.0281   0.0283
 Triad:   5862.0601   0.0329   0.0328   0.0331

   This is almost exclusively due to the prefetching, the packed
   arithmetic is almost completely inconsequential when waiting on
   memory bandwidth.

Jed


Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Edgar Gabriel
you could try first to use the algorithms in the basic module, e.g.

mpirun -np x --mca coll basic ./mytest

and see whether this makes a difference. I used to observe sometimes a
(similar ?) problem in the openib btl triggered from the tuned
collective component, in cases where the ofed libraries were installed
but no NCA was found on a node. It used to work however with the basic
component.

Thanks
Edgar


On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
> hi Rolf,
> 
> unfortunately, i couldn't get rid of that annoying segmentation fault when 
> selecting another bcast algorithm.
> i'm now going to replace MPI_Bcast with a naive implementation (using 
> MPI_Send and MPI_Recv) and see if that helps.
> 
> regards,
> éloi
> 
> 
> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
>> Hi Rolf,
>>
>> thanks for your input. You're right, I miss the
>> coll_tuned_use_dynamic_rules option.
>>
>> I'll check if I the segmentation fault disappears when using the basic
>> bcast linear algorithm using the proper command line you provided.
>>
>> Regards,
>> Eloi
>>
>> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
>>> Hi Eloi:
>>> To select the different bcast algorithms, you need to add an extra mca
>>> parameter that tells the library to use dynamic selection.
>>> --mca coll_tuned_use_dynamic_rules 1
>>>
>>> One way to make sure you are typing this in correctly is to use it with
>>> ompi_info.  Do the following:
>>> ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
>>>
>>> You should see lots of output with all the different algorithms that can
>>> be selected for the various collectives.
>>> Therefore, you need this:
>>>
>>> --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 1
>>>
>>> Rolf
>>>
>>> On 07/13/10 11:28, Eloi Gaudry wrote:
 Hi,

 I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch
 to the basic linear algorithm. Anyway whatever the algorithm used, the
 segmentation fault remains.

 Does anyone could give some advice on ways to diagnose the issue I'm
 facing ?

 Regards,
 Eloi

 On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> Hi,
>
> I'm focusing on the MPI_Bcast routine that seems to randomly segfault
> when using the openib btl. I'd like to know if there is any way to
> make OpenMPI switch to a different algorithm than the default one
> being selected for MPI_Bcast.
>
> Thanks for your help,
> Eloi
>
> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
>> Hi,
>>
>> I'm observing a random segmentation fault during an internode
>> parallel computation involving the openib btl and OpenMPI-1.4.2 (the
>> same issue can be observed with OpenMPI-1.3.3).
>>
>>mpirun (Open MPI) 1.4.2
>>Report bugs to http://www.open-mpi.org/community/help/
>>[pbn08:02624] *** Process received signal ***
>>[pbn08:02624] Signal: Segmentation fault (11)
>>[pbn08:02624] Signal code: Address not mapped (1)
>>[pbn08:02624] Failing at address: (nil)
>>[pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
>>[pbn08:02624] *** End of error message ***
>>sh: line 1:  2624 Segmentation fault
>>
>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/x
>> 86 _6 4\ /bin\/actranpy_mp
>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86_
>> 64 /A c tran_11.0.rc2.41872'
>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.da
>> t' '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch'
>> '--mem=3200' '--threads=1' '--errorlevel=FATAL' '--t_max=0.1'
>> '--parallel=domain'
>>
>> If I choose not to use the openib btl (by using --mca btl self,sm,tcp
>> on the command line, for instance), I don't encounter any problem and
>> the parallel computation runs flawlessly.
>>
>> I would like to get some help to be able:
>> - to diagnose the issue I'm facing with the openib btl
>> - understand why this issue is observed only when using the openib
>> btl and not when using self,sm,tcp
>>
>> Any help would be very much appreciated.
>>
>> The outputs of ompi_info and the configure scripts of OpenMPI are
>> enclosed to this email, and some information on the infiniband
>> drivers as well.
>>
>> Here is the command line used when launching a parallel computation
>>
>> using infiniband:
>>path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
>>--mca
>>
>> btl openib,sm,self,tcp  --display-map --verbose --version --mca
>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
>>
>> and the command line used if not using infiniband:
>>path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
>>--mca
>>
>> btl self,sm,tcp  --display-map --verbose --version --mca
>> 

Re: [OMPI users] Adding libraries to wrapper compiler at run-time

2010-07-15 Thread Jeff Squyres
On Jul 7, 2010, at 2:53 PM, Jeremiah Willcock wrote:

> The Open MPI FAQ shows how to add libraries to the Open MPI wrapper
> compilers when building them (using configure flags), but I would like to
> add flags for a specific run of the wrapper compiler.  Setting OMPI_LIBS
> overrides the necessary MPI libraries, and it does not appear that there
> is an easy way to get just the flags that OMPI_LIBS contains by default
> (either using -showme:link or ompi_info).  Is there a way to add to the
> default set of OMPI_LIBS rather than overriding it entirely?  Thank you
> for your help.

Sorry for the high latency reply!

There isn't currently a better way to do this other than editing or extracting 
the information from the wrapper data text files directly (it might not be too 
hard to parse the information from the wrapper text file -- a little 
grep/awk/sed might do ya...?).

If you have a better suggestion, a patch would be welcome!  :-)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Low Open MPI performance on InfiniBand and shared memory?

2010-07-15 Thread Jeff Squyres
(still trolling through the history in my INBOX...)

On Jul 9, 2010, at 8:56 AM, Andreas Schäfer wrote:

> On 14:39 Fri 09 Jul , Peter Kjellstrom wrote:
> > 8x pci-express gen2 5GT/s should show figures like mine. If it's pci-express
> > gen1 or gen2 2.5GT/s or 4x or if the IB only came up with two lanes then 
> > 1500
> > is expected.
> 
> lspci and ibv_devinfo tell me it's PCIe 2.0 x8 and InfiniBand 4x QDR
> (active_width 4X, active_speed 10.0 Gbps), so I /should/ be able to
> get about twice the throughput of what I'm currently seeing.

You'll get different shared memory performance if you bind both the local procs 
to a single socket or two different sockets.  I don't know much about AMDs, so 
I can't say exactly what it'll do offhand.

As for the IB performance, you want to make sure that your MPI process is bound 
to a core that is "near" the HCA for minimum latency and max bandwidth.  Then 
also check that your IB fabric is clean, etc.  I believe that OFED comes with a 
bunch of verbs-level latency and bandwidth unit tests that can measure what 
you're getting across your fabric (i.e., raw network performance without MPI).  
It's been a while since I've worked deeply with OFED stuff; I don't remember 
the command names offhand -- perhaps ibv_rc_pingpong, or somesuch?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] error in (Open MPI) 1.3.3r21324-ct8.2-b09b-r31

2010-07-15 Thread Scott Atchley
Lydia,

Which interconnect is this running over?

Scott

On Jul 15, 2010, at 5:19 AM, Lydia Heck wrote:

> We are running Sun's build of Open Mpi  1.3.3r21324-ct8.2-b09b-r31
> (HPC8.2) and one code that runs perfectly fine under
> HPC8.1 (Open MPI) 1.3r19845-ct8.1-b06b-r21 and before fails with
> 
> 
> 
> [oberon:08454] *** Process received signal ***
> [oberon:08454] Signal: Segmentation Fault (11)
> [oberon:08454] Signal code: Address not mapped (1)
> [oberon:08454] Failing at address: 0
> /opt/SUNWhpc/HPC8.2/sun/lib/amd64/libopen-pal.so.0.0.0:0x4b89e
> /lib/amd64/libc.so.1:0xd0f36
> /lib/amd64/libc.so.1:0xc5a72
> 0x0 [ Signal 11 (SEGV)]
> /opt/SUNWhpc/HPC8.2/sun/lib/amd64/libmpi.so.0.0.0:MPI_Alloc_mem+0x7f
> /opt/SUNWhpc/HPC8.2/sun/lib/amd64/libmpi.so.0.0.0:MPI_Sendrecv_replace+0x31e
> /opt/SUNWhpc/HPC8.2/sun/lib/amd64/libmpi_f77.so.0.0.0:PMPI_SENDRECV_REPLACE+0x94
> /home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:mpi_cyclic_transfer_+0xd9
> /home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:cycle_particles_and_interpolate_+0x94b
> /home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:interpolate_field_+0xc30
> /home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:MAIN_+0xe68
> /home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:main+0x3d
> /home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:0x62ac
> [oberon:08454] *** End of error message ***
> --
> mpirun noticed that process rank 0 with PID 8454 on node oberon exited on 
> signal 11 (Segmentation Fault).
> 
> 
> 
> I have not tried to get and build a newer Open Mpi, so I do not know if the 
> problem propagates into the more recent versions.
> 
> 
> If the developers are interested, I could ask the user to prepare the code 
> for you to have a look at the problem which looks like to be in  
> MPI_Alloc_mem.
> 
> Best wishes,
> Lydia Heck
> 
> --
> Dr E L  Heck
> 
> University of Durham Institute for Computational Cosmology
> Ogden Centre
> Department of Physics South Road
> 
> DURHAM, DH1 3LE United Kingdom
> 
> e-mail: lydia.h...@durham.ac.uk
> 
> Tel.: + 44 191 - 334 3628
> Fax.: + 44 191 - 334 3645
> ___
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI flags conditions

2010-07-15 Thread Jeff Squyres
On Jul 15, 2010, at 9:27 AM, Gabriele Fatigati wrote:

> Mm at the momento no,
> 
> but i think it's a good idea to insert this feature in future OpenMPI release 
> :)

Agreed.

> We can have parameter set that works well with a precise numbers of procs, 
> and not with a more large number ( or more small number) . The same for 
> message size.

We've actually (very briefly) talked internally about this kind of idea, but no 
one ever had the time / resources to come up with a logic language that would 
be necessary to implement such a thing (i.e., at a minimum, you'd have to have 
some kind of "if" logic in the mca-params.conf file).

But here's another idea: perhaps it would be a better idea to leverage an 
existing language (like perl or python or ...) that Open MPI could somehow 
"eval" at runtime -- the output of which would be the set of .params values 
that you want to use for the current job.  

That way, Open MPI stays out of the "language" business (there's lots of fine 
languages today; why should an MPI implementation invent another one?).  OMPI 
just adds some bootstrapping to be able to call down to some scripting language 
during job startup, and perhaps inject some pre-defined variables that 
perl/python/whatever can use (e.g., num MPI procs, host list, argv, existing 
MCA params, ...etc.).  

Lots of hand-waving in there, but if someone wants to run with this idea...  :-)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Highly variable performance

2010-07-15 Thread Jeff Squyres
Per my other disclaimer, I'm trolling through my disastrous inbox and finding 
some orphaned / never-answered emails.  Sorry for the delay!


On Jun 2, 2010, at 4:36 PM, Jed Brown wrote:

> The nodes of interest are 4-socket Opteron 8380 (quad core, 2.5 GHz), 
> connected
> with QDR InfiniBand.  The benchmark loops over
> 
>   
> MPI_Allgather(localdata,nlocal,MPI_DOUBLE,globaldata,nlocal,MPI_DOUBLE,MPI_COMM_WORLD);
> 
> with nlocal=1 (80 KiB messages) 1 times, so it normally runs in
> a few seconds.  

Just to be clear -- you're running 8 procs locally on an 8 core node, right?  
(Cisco is an Intel partner -- I don't follow the AMD line much)  So this should 
all be local communication with no external network involved, right?

> #  JOB   TIME (s)  HOST
> 
> ompirun
> lsf.o240562 killed   8*a6200
> lsf.o240563 9.2110e+01   8*a6200
> lsf.o240564 1.5638e+01   8*a6237
> lsf.o240565 1.3873e+01   8*a6228

Am I reading that right that it's 92 seconds vs. 13 seconds?  Woof!

> ompirun -mca btl self,sm
> lsf.o240574 1.6916e+01   8*a6237
> lsf.o240575 1.7456e+01   8*a6200
> lsf.o240576 1.4183e+01   8*a6161
> lsf.o240577 1.3254e+01   8*a6203
> lsf.o240578 1.8848e+01   8*a6274

13 vs. 18 seconds.  Better, but still dodgy.

> prun (quadrics)
> lsf.o240602 1.6168e+01   4*a2108+4*a2109
> lsf.o240603 1.6746e+01   4*a2110+4*a2111
> lsf.o240604 1.6371e+01   4*a2108+4*a2109
> lsf.o240606 1.6867e+01   4*a2110+4*a2111

Nice and consistent, as you mentioned.  And I assume your notation here means 
that it's across 2 nodes.

> ompirun -mca btl self,openib
> lsf.o240776 3.1463e+01   8*a6203
> lsf.o240777 3.0418e+01   8*a6264
> lsf.o240778 3.1394e+01   8*a6203
> lsf.o240779 3.5111e+01   8*a6274

Also much better.  Probably because all messages are equally penalized by going 
out to the HCA and back.

> ompirun -mca self,sm,openib
> lsf.o240851 1.3848e+01   8*a6244
> lsf.o240852 1.7362e+01   8*a6237
> lsf.o240854 1.3266e+01   8*a6204
> lsf.o240855 1.3423e+01   8*a6276

This should be pretty much the same as sm,self, because openib shouldn't be 
used for any of the communication (i.e., Open MPI should determine that sm is 
the "best" transport between all the peers and silently discard openib).

> ompirun
> lsf.o240858 1.4415e+01   8*a6244
> lsf.o240859 1.5092e+01   8*a6237
> lsf.o240860 1.3940e+01   8*a6204
> lsf.o240861 1.5521e+01   8*a6276
> lsf.o240903 1.3273e+01   8*a6234
> lsf.o240904 1.6700e+01   8*a6206
> lsf.o240905 1.4636e+01   8*a6269
> lsf.o240906 1.5056e+01   8*a6234

Strange that this would be different than the first one.  It should be 
functionally equivalent to --mca self,sm,openib.

> ompirun -mca self,tcp
> lsf.o240948 1.8504e+01   8*a6234
> lsf.o240949 1.9317e+01   8*a6207
> lsf.o240950 1.8964e+01   8*a6234
> lsf.o240951 2.0764e+01   8*a6207

Variation here isn't too bad.  The slowdown here (compared to sm) is likely 
because it's going through the TCP loopback stack vs. "directly" going to the 
peer in shared memory.

...a quick look through the rest seems to indicate that they're more-or-less 
consistent with what you showed above.

Your later mail says:

> Following up on this, I have partial resolution.  The primary culprit
> appears to be stale files in a ramdisk non-uniformly distributed across
> the sockets, thus interactingly poorly with NUMA.  The slow runs
> invariably have high numa_miss and numa_foreign counts.  I still have
> trouble making it explain up to a factor of 10 degredation, but it
> certainly explains a factor of 3.

Try playing with Open MPI's process affinity options, like --bind-to-core (see 
mpirun(1)). This may help prevent some OS jitter in moving processes around, 
and allow pinning memory locally to each NUMA node.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] OpenMPI flags conditions

2010-07-15 Thread Gabriele Fatigati
Mm at the momento no,

but i think it's a good idea to insert this feature in future OpenMPI
release :)

We can have parameter set that works well with a precise numbers of procs,
and not with a more large number ( or more small number) . The same for
message size.

Thanks for the quick reply! :D



2010/7/15 Jeff Squyres 

> We don't have any kind of logic language like that for the params files.
>
> Got any suggestions / patches?
>
>
> On Jul 15, 2010, at 8:37 AM, Gabriele Fatigati wrote:
>
> > Dear OpenMPI users,
> >
> > is it possible to define some set of parameters for a range number of
> processors and message size in openmpi-mca-params.conf ? For example:
> >
> > if nprocs<256
> >
> > some mca parameters...
> >
> > if nprocs > 256
> >
> > other mca parameters..
> >
> > and the same related to message size?
> >
> > Thanks in advance.
> >
> > --
> > Ing. Gabriele Fatigati
> >
> > Parallel programmer
> >
> > CINECA Systems & Tecnologies Department
> >
> > Supercomputing Group
> >
> > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
> >
> > www.cineca.itTel:   +39 051 6171722
> >
> > g.fatigati [AT] cineca.it
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>


-- 
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it


Re: [OMPI users] 1.4.2 build problem

2010-07-15 Thread Jeff Squyres
On Jun 2, 2010, at 10:14 AM, John Cary wrote:

> It seems that the rpath arg is something that bites me over and again.
> What are your thoughts about making this automatic?

I'm trolling through the disaster that is my inbox and finding some orphaned 
email threads -- sorry for the delay, folks!

We have some developers in the OMPI community who are passionate about adding 
only the bare minimum set of compiler / linker flags that are necessary for 
correctness.  Rpath has come up again and again (i.e., add rpath into the set 
of flags that is added by the wrappers) and has been soundly rejected because 
it is not part of that "bare minimum set."

That being said, I think the objections would be *much* less if adding rpath 
was optional (and probably not the default) -- e.g., perhaps rpath is added to 
wrapper compiler flags if OMPI was configured with a specific command line 
flag, or a specific MCA parameter is enabled.

We'd be very happy to accept patches for this kind of thing...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] OpenMPI flags conditions

2010-07-15 Thread Jeff Squyres
We don't have any kind of logic language like that for the params files.

Got any suggestions / patches?


On Jul 15, 2010, at 8:37 AM, Gabriele Fatigati wrote:

> Dear OpenMPI users,
> 
> is it possible to define some set of parameters for a range number of 
> processors and message size in openmpi-mca-params.conf ? For example:
> 
> if nprocs<256
> 
> some mca parameters...
> 
> if nprocs > 256
> 
> other mca parameters..
> 
> and the same related to message size?
> 
> Thanks in advance.
> 
> -- 
> Ing. Gabriele Fatigati
> 
> Parallel programmer
> 
> CINECA Systems & Tecnologies Department
> 
> Supercomputing Group
> 
> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
> 
> www.cineca.itTel:   +39 051 6171722
> 
> g.fatigati [AT] cineca.it   
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] orted unknown option "--daemonize"

2010-07-15 Thread Jeff Squyres
This usually means that you have mis-matched versions of Open MPI across your 
machines.  Double check that you have the same version of Open MPI installed on 
all the machines that you'll be running (e.g., perhaps birg-desktop-10 has a 
different version?).


On Jul 15, 2010, at 5:18 AM, TH Chew wrote:

> Hi all,
> 
> I am setting up a 7+1 nodes cluster for MD simulation, specifically using 
> GROMACS. I am using Ubuntu Lucid 64-bit on all machines. Installed gromacs, 
> gromacs-openmpi, and gromacs-mpich from the repository. MPICH version of 
> gromacs runs fine without any error. However, when I ran OpenMPI version of 
> gromacs by
> 
> ###
> mpirun.openmpi -np 8 -wdir /home/birg/Desktop/nfs/ -hostfile 
> ~/Desktop/mpi_settings/hostfile mdrun_mpi.openmpi -v
> ###
> 
> an error occur, something like this
> 
> ###
> [birg-desktop-10:02101] Error: unknown option "--daemonize"
> Usage: orted [OPTION]...
> Start an Open RTE Daemon
> 
>--bootproxy Run as boot proxy for 
> -d|--debug   Debug the OpenRTE
> -d|--spinHave the orted spin until we can connect a debugger
>  to it
>--debug-daemons   Enable debugging of OpenRTE daemons
>--debug-daemons-file  Enable debugging of OpenRTE daemons, storing output
>  in files
>--gprreplicaRegistry contact information.
> -h|--helpThis help message
>--mpi-call-yield   
>  Have MPI (or similar) applications call yield when
>  idle
>--name  Set the orte process name
>--no-daemonizeDon't daemonize into the background
>--nodename  Node name as specified by host/resource
>  description.
>--ns-ndsset sds/nds component to use for daemon (normally
>  not needed)
>--nsreplica Name service contact information.
>--num_procs Set the number of process in this job
>--persistent  Remain alive after the application process
>  completes
>--report-uriReport this process' uri on indicated pipe
>--scope Set restrictions on who can connect to this
>  universe
>--seedHost replicas for the core universe services
>--set-sid Direct the orted to separate from the current
>  session
>--tmpdirSet the root for the session directory tree
>--universe  Set the universe name as
>  username@hostname:universe_name for this
>  application
>--vpid_startSet the starting vpid for this job
> --
> A daemon (pid 5598) died unexpectedly with status 251 while attempting
> to launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun.openmpi noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> --
> mpirun.openmpi was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --
> birg-desktop-04 - daemon did not report back when launched
> birg-desktop-07 - daemon did not report back when launched
> birg-desktop-10 - daemon did not report back when launched
> ###
> 
> It is strange that it only happen on one of the compute node 
> (birg-desktop-10). If I remove birg-desktop-10 from the hostfile, I can run 
> OpenMPI gromacs successfully. Any idea?
> 
> Thanks.
> 
> -- 
> Regards,
> THChew
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] error in (Open MPI) 1.3.3r21324-ct8.2-b09b-r31

2010-07-15 Thread Lydia Heck


We are running Sun's build of Open Mpi  1.3.3r21324-ct8.2-b09b-r31
(HPC8.2) and one code that runs perfectly fine under
HPC8.1 (Open MPI) 1.3r19845-ct8.1-b06b-r21 and before fails with



[oberon:08454] *** Process received signal ***
[oberon:08454] Signal: Segmentation Fault (11)
[oberon:08454] Signal code: Address not mapped (1)
[oberon:08454] Failing at address: 0
/opt/SUNWhpc/HPC8.2/sun/lib/amd64/libopen-pal.so.0.0.0:0x4b89e
/lib/amd64/libc.so.1:0xd0f36
/lib/amd64/libc.so.1:0xc5a72
0x0 [ Signal 11 (SEGV)]
/opt/SUNWhpc/HPC8.2/sun/lib/amd64/libmpi.so.0.0.0:MPI_Alloc_mem+0x7f
/opt/SUNWhpc/HPC8.2/sun/lib/amd64/libmpi.so.0.0.0:MPI_Sendrecv_replace+0x31e
/opt/SUNWhpc/HPC8.2/sun/lib/amd64/libmpi_f77.so.0.0.0:PMPI_SENDRECV_REPLACE+0x94
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:mpi_cyclic_transfer_+0xd9
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:cycle_particles_and_interpolate_+0x94b
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:interpolate_field_+0xc30
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:MAIN_+0xe68
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:main+0x3d
/home/arj/code_devel/ic_gen_2lpt_v3.5/comp_disp.x:0x62ac
[oberon:08454] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 8454 on node oberon exited on 
signal 11 (Segmentation Fault).




I have not tried to get and build a newer Open Mpi, so I do not know if the 
problem propagates into the more recent versions.



If the developers are interested, I could ask the user to prepare the code for 
you to have a look at the problem which looks like to be in  MPI_Alloc_mem.


Best wishes,
Lydia Heck

--
Dr E L  Heck

University of Durham 
Institute for Computational Cosmology

Ogden Centre
Department of Physics 
South Road


DURHAM, DH1 3LE 
United Kingdom


e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___



[OMPI users] orted unknown option "--daemonize"

2010-07-15 Thread TH Chew
Hi all,

I am setting up a 7+1 nodes cluster for MD simulation, specifically using
GROMACS. I am using Ubuntu Lucid 64-bit on all machines. Installed gromacs,
gromacs-openmpi, and gromacs-mpich from the repository. MPICH version of
gromacs runs fine without any error. However, when I ran OpenMPI version of
gromacs by

###
mpirun.openmpi -np 8 -wdir /home/birg/Desktop/nfs/ -hostfile
~/Desktop/mpi_settings/hostfile mdrun_mpi.openmpi -v
###

an error occur, something like this

###
[birg-desktop-10:02101] Error: unknown option "--daemonize"
Usage: orted [OPTION]...
Start an Open RTE Daemon

   --bootproxy Run as boot proxy for 
-d|--debug   Debug the OpenRTE
-d|--spinHave the orted spin until we can connect a debugger
 to it
   --debug-daemons   Enable debugging of OpenRTE daemons
   --debug-daemons-file  Enable debugging of OpenRTE daemons, storing output
 in files
   --gprreplicaRegistry contact information.
-h|--helpThis help message
   --mpi-call-yield 
 Have MPI (or similar) applications call yield when
 idle
   --name  Set the orte process name
   --no-daemonizeDon't daemonize into the background
   --nodename  Node name as specified by host/resource
 description.
   --ns-ndsset sds/nds component to use for daemon (normally
 not needed)
   --nsreplica Name service contact information.
   --num_procs Set the number of process in this job
   --persistent  Remain alive after the application process
 completes
   --report-uriReport this process' uri on indicated pipe
   --scope Set restrictions on who can connect to this
 universe
   --seedHost replicas for the core universe services
   --set-sid Direct the orted to separate from the current
 session
   --tmpdirSet the root for the session directory tree
   --universe  Set the universe name as
 username@hostname:universe_name for this
 application
   --vpid_startSet the starting vpid for this job
--
A daemon (pid 5598) died unexpectedly with status 251 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun.openmpi noticed that the job aborted, but has no info as to the
process
that caused that situation.
--
--
mpirun.openmpi was unable to cleanly terminate the daemons on the nodes
shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
birg-desktop-04 - daemon did not report back when launched
birg-desktop-07 - daemon did not report back when launched
birg-desktop-10 - daemon did not report back when launched
###

It is strange that it only happen on one of the compute node
(birg-desktop-10). If I remove birg-desktop-10 from the hostfile, I can run
OpenMPI gromacs successfully. Any idea?

Thanks.

-- 
Regards,
THChew


Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Eloi Gaudry
hi Rolf,

unfortunately, i couldn't get rid of that annoying segmentation fault when 
selecting another bcast algorithm.
i'm now going to replace MPI_Bcast with a naive implementation (using MPI_Send 
and MPI_Recv) and see if that helps.

regards,
éloi


On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
> Hi Rolf,
> 
> thanks for your input. You're right, I miss the
> coll_tuned_use_dynamic_rules option.
> 
> I'll check if I the segmentation fault disappears when using the basic
> bcast linear algorithm using the proper command line you provided.
> 
> Regards,
> Eloi
> 
> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
> > Hi Eloi:
> > To select the different bcast algorithms, you need to add an extra mca
> > parameter that tells the library to use dynamic selection.
> > --mca coll_tuned_use_dynamic_rules 1
> > 
> > One way to make sure you are typing this in correctly is to use it with
> > ompi_info.  Do the following:
> > ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
> > 
> > You should see lots of output with all the different algorithms that can
> > be selected for the various collectives.
> > Therefore, you need this:
> > 
> > --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 1
> > 
> > Rolf
> > 
> > On 07/13/10 11:28, Eloi Gaudry wrote:
> > > Hi,
> > > 
> > > I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch
> > > to the basic linear algorithm. Anyway whatever the algorithm used, the
> > > segmentation fault remains.
> > > 
> > > Does anyone could give some advice on ways to diagnose the issue I'm
> > > facing ?
> > > 
> > > Regards,
> > > Eloi
> > > 
> > > On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> > >> Hi,
> > >> 
> > >> I'm focusing on the MPI_Bcast routine that seems to randomly segfault
> > >> when using the openib btl. I'd like to know if there is any way to
> > >> make OpenMPI switch to a different algorithm than the default one
> > >> being selected for MPI_Bcast.
> > >> 
> > >> Thanks for your help,
> > >> Eloi
> > >> 
> > >> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
> > >>> Hi,
> > >>> 
> > >>> I'm observing a random segmentation fault during an internode
> > >>> parallel computation involving the openib btl and OpenMPI-1.4.2 (the
> > >>> same issue can be observed with OpenMPI-1.3.3).
> > >>> 
> > >>>mpirun (Open MPI) 1.4.2
> > >>>Report bugs to http://www.open-mpi.org/community/help/
> > >>>[pbn08:02624] *** Process received signal ***
> > >>>[pbn08:02624] Signal: Segmentation fault (11)
> > >>>[pbn08:02624] Signal code: Address not mapped (1)
> > >>>[pbn08:02624] Failing at address: (nil)
> > >>>[pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
> > >>>[pbn08:02624] *** End of error message ***
> > >>>sh: line 1:  2624 Segmentation fault
> > >>> 
> > >>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/x
> > >>> 86 _6 4\ /bin\/actranpy_mp
> > >>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86_
> > >>> 64 /A c tran_11.0.rc2.41872'
> > >>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.da
> > >>> t' '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch'
> > >>> '--mem=3200' '--threads=1' '--errorlevel=FATAL' '--t_max=0.1'
> > >>> '--parallel=domain'
> > >>> 
> > >>> If I choose not to use the openib btl (by using --mca btl self,sm,tcp
> > >>> on the command line, for instance), I don't encounter any problem and
> > >>> the parallel computation runs flawlessly.
> > >>> 
> > >>> I would like to get some help to be able:
> > >>> - to diagnose the issue I'm facing with the openib btl
> > >>> - understand why this issue is observed only when using the openib
> > >>> btl and not when using self,sm,tcp
> > >>> 
> > >>> Any help would be very much appreciated.
> > >>> 
> > >>> The outputs of ompi_info and the configure scripts of OpenMPI are
> > >>> enclosed to this email, and some information on the infiniband
> > >>> drivers as well.
> > >>> 
> > >>> Here is the command line used when launching a parallel computation
> > >>> 
> > >>> using infiniband:
> > >>>path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
> > >>>--mca
> > >>> 
> > >>> btl openib,sm,self,tcp  --display-map --verbose --version --mca
> > >>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> > >>> 
> > >>> and the command line used if not using infiniband:
> > >>>path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
> > >>>--mca
> > >>> 
> > >>> btl self,sm,tcp  --display-map --verbose --version --mca
> > >>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> > >>> 
> > >>> Thanks,
> > >>> Eloi
> > > 
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 


Eloi Gaudry

Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959



Re: [OMPI users] Error while compiling openMPI 1.4.2 in Cygwin 1.7.5-1. Library missing?

2010-07-15 Thread Shiqing Fan

 Hi Miguel,

Cygwin is not actively supported, as we are now focusing on native 
Windows build using CMake and Visual Studio. But I remember there were 
emails some time ago, that people has done Cygwin build with 1.3 series, 
see here: http://www.open-mpi.org/community/lists/users/2008/11/7294.php 
, but it's difficult and might be different for 1.4.2.


However, I would like to recommend to use CMake+VS build, it's much 
easier and faster than the Cygwin build. Do you have any reason that you 
have to use Cygwin?



Regards,
Shiqing

On 2010-7-9 5:50 PM, Miguel Rubio-Roy wrote:

Hi all,
   I'm trying to compile openMPI 1.4.2 in Cygwin 1.7.5-1.
   After ./configure I do make and after some time I always get this
error. I've tried "make clean" and "make" again, but that doesn't
help. It looks to me like I have all the requirements of the
README.Windows file (Cygwin and libtool 2.2.7a>  2.2.6).
   I guess my installation is missing some library, but which one? Find
attached the "configure" log.

Thanks!

Miguel

Error:

make[2]: Entering directory
`/home/miguel/openmpi-1.4.2/opal/mca/installdirs/windows'
depbase=`echo opal_installdirs_windows.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||'`;\
/bin/sh ../../../../libtool --tag=CC   --mode=compile gcc
-DHAVE_CONFIG_H -I. -I../../../../opal/include
-I../../../../orte/include -I../../../../ompi/include
-I../../../../opal/mca/paffinity/linux/plpa/src/libplpa
-I../../../..  -D_REENTRANT  -O3 -DNDEBUG -finline-functions
-fno-strict-aliasing  -MT opal_installdirs_windows.lo -MD -MP -MF
$depbase.Tpo -c -o opal_installdirs_windows.lo
opal_installdirs_windows.c&&\
mv -f $depbase.Tpo $depbase.Plo
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../../../../opal/include
-I../../../../orte/include -I../../../../ompi/include
-I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../..
-D_REENTRANT -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -MT
opal_installdirs_windows.lo -MD -MP -MF
.deps/opal_installdirs_windows.Tpo -c opal_installdirs_windows.c
-DDLL_EXPORT -DPIC -o .libs/opal_installdirs_windows.o
opal_installdirs_windows.c: In function ‘installdirs_windows_open’:
opal_installdirs_windows.c:69: error: ‘HKEY’ undeclared (first use in
this function)
opal_installdirs_windows.c:69: error: (Each undeclared identifier is
reported only once
opal_installdirs_windows.c:69: error: for each function it appears in.)
opal_installdirs_windows.c:69: error: expected ‘;’ before ‘ompi_key’
opal_installdirs_windows.c:79: error: ‘ERROR_SUCCESS’ undeclared
(first use in this function)
opal_installdirs_windows.c:79: error: ‘HKEY_LOCAL_MACHINE’ undeclared
(first use in this function)
opal_installdirs_windows.c:79: error: ‘KEY_READ’ undeclared (first use
in this function)
opal_installdirs_windows.c:79: error: ‘ompi_key’ undeclared (first use
in this function)
opal_installdirs_windows.c:85: error: ‘DWORD’ undeclared (first use in
this function)
opal_installdirs_windows.c:85: error: expected ‘;’ before ‘cbData’
opal_installdirs_windows.c:85: error: ‘valueLength’ undeclared (first
use in this function)
opal_installdirs_windows.c:85: error: ‘cbData’ undeclared (first use
in this function)
opal_installdirs_windows.c:85: error: ‘keyType’ undeclared (first use
in this function)
opal_installdirs_windows.c:85: error: ‘LPBYTE’ undeclared (first use
in this function)
opal_installdirs_windows.c:85: error: expected ‘)’ before ‘vData’
opal_installdirs_windows.c:85: error: ‘REG_EXPAND_SZ’ undeclared
(first use in this function)
opal_installdirs_windows.c:85: error: ‘REG_SZ’ undeclared (first use
in this function)
opal_installdirs_windows.c:86: error: expected ‘;’ before ‘cbData’
opal_installdirs_windows.c:86: error: expected ‘)’ before ‘vData’
opal_installdirs_windows.c:87: error: expected ‘;’ before ‘cbData’
opal_installdirs_windows.c:87: error: expected ‘)’ before ‘vData’
opal_installdirs_windows.c:88: error: expected ‘;’ before ‘cbData’
opal_installdirs_windows.c:88: error: expected ‘)’ before ‘vData’
opal_installdirs_windows.c:89: error: expected ‘;’ before ‘cbData’
opal_installdirs_windows.c:89: error: expected ‘)’ before ‘vData’
opal_installdirs_windows.c:90: error: expected ‘;’ before ‘cbData’
opal_installdirs_windows.c:90: error: expected ‘)’ before ‘vData’
opal_installdirs_windows.c:91: error: expected ‘;’ before ‘cbData’
opal_installdirs_windows.c:91: error: expected ‘)’ before ‘vData’
opal_installdirs_windows.c:92: error: expected ‘;’ before ‘cbData’
opal_installdirs_windows.c:92: error: expected ‘)’ before ‘vData’
opal_installdirs_windows.c:93: error: expected ‘;’ before ‘cbData’
opal_installdirs_windows.c:93: error: expected ‘)’ before ‘vData’
opal_installdirs_windows.c:94: error: expected ‘;’ before ‘cbData’
opal_installdirs_windows.c:94: error: expected ‘)’ before ‘vData’
opal_installdirs_windows.c:95: error: expected ‘;’ before ‘cbData’
opal_installdirs_windows.c:95: error: expected ‘)’ before ‘vData’
opal_installdirs_windows.c:96: error: expected ‘;’ before ‘cbData’

Re: [hwloc-users] hwloc_set/get_thread_cpubind

2010-07-15 Thread Brice Goglin
Le 14/07/2010 20:28, Αλέξανδρος Παπαδογιαννάκης a écrit :
> hwloc_set_thread_cpubind and hwloc_get_thread_cpubind are missing from the 
> html documentation
> http://www.open-mpi.org/projects/hwloc/doc/v1.0.1/group__hwlocality__binding.php
> 
>   

It may be related to way doxygen handles #ifdef, but we seem to have the
right PREDEFINED variable in the config, and the html doc seems properly
generated on my machine (Debian with doxygen 1.6.3).

Brice



Re: [OMPI users] Question on checkpoint overhead in Open MPI

2010-07-15 Thread Nguyen Toan
Somebody helps please? I am sorry to spam the mailing list but I really need
your help.
Thanks in advance.

Best Regards,
Nguyen Toan

On Thu, Jul 8, 2010 at 1:25 AM, Nguyen Toan wrote:

> Hello everyone,
> I have a question concerning the checkpoint overhead in Open MPI, which is
> the difference taken from the runtime of application execution with and
> without checkpoint.
> I observe that when the data size and the number of processes increases,
> the runtime of BLCR is very small compared to the overall checkpoint
> overhead in Open MPI. Is it because of the increase of coordination time for
> checkpoint? And what is included in the overall checkpoint overhead besides
> the BLCR's checkpoint overhead and coordination time?
> Thank you.
>
> Best Regards,
> Nguyen Toan
>