[Pw_forum] openmp vs mpich performance with MKL 10.x

2008-05-09 Thread Eduardo Ariel Menendez Proupin
Thank you Axel, for all the time and the explanations of the deep numerical
details.
I will give up at his point. I am satisfied to know that I am not
wasting CPU resources due to bad configuration, and I will keep both MPI and
OpenMP compilations.. I will try to add to the benchmark of Nicola with CP.
Best regards
-- 
Eduardo Menendez
-- next part --
An HTML attachment was scrubbed...
URL: 
http://www.democritos.it/pipermail/pw_forum/attachments/20080509/e749495a/attachment-0001.htm
 


[Pw_forum] openmp vs mpich performance with MKL 10.x

2008-05-07 Thread Axel Kohlmeyer
On Wed, 7 May 2008, Eduardo Ariel Menendez Proupin wrote:

EAM> Hi,
EAM> Plese, find attached my best make.sys, to be run serially. Try this in your
EAM> system. My timings are close to yours. Below are the details. However, it

ok, i tried running on my machine with the intel wrapper
i get a wall time of 13m43s and using a multi-threaded fftw3
i need a wall time of 16m13s to complete the job, but i have
not yet added the additional tunings that i added to CPMD
that finally made fftw3 faster.

in summary it looks as if on my hardware MPI is the winner. 
i would be interested to see if you get different timings 
with OpenMPI instead of MPICH. 


EAM> runs faster serially than using mpiexec -n 1.


[...]


EAM> >
EAM> > obviously, switching to the intel fft didn't help.
EAM> 
EAM> FOR ME, IT HELPS ONLY WHEN RUNNING SERIAL.

on CPMD i found that actually, using the multi-threaded fftw3 
is _even_ faster. you will need to add one function call to 
tell the fftw3-planner that all future plans should be generated 
for $OMP_NUM_THREADS threads.

the fact that it helps in the serial code only, is easily 
understandable if you look what QEs FFT modules do differently 
when running in serial or in parallel.

if you run in serial, QE calls a 3d-FFT directly instead 
of a sequence of 1d/2d-FFTs. with the 3d-fft you have the 
chance to parallelize in the same was as with MPI by using 
threads. if you run in parallel, you already call many 
small 1d-ffts and those don't parallelize well. instead
it would be required to distribute those calls across
threads to have a similar gain.

EAM> > your system with many states and only gamma point
EAM> > is definitely a case that benefits the most from
EAM> > multi-threaded BLAS/LAPACK.
EAM> 
EAM> TYPICAL FOR BO MOLECULAR DYNAMICS.
EAM> I WOULD SAY, AVOID MIXING MPI AND OPENMP. ALSO AVOID INTEL  FFTW WRAPPERS
EAM> WITH MPI, EVEN IF OMP_NUM_THREADS=1.
EAM> USE THREADED BLAS/LAPACK/FFTW2(3) FOR SERIAL RUNS.

i don't think that this can be said in general, because
your system is a best case scenario. in my experience a
serial executable is about 10% faster than a parallel one
for one task with plane-wave pseudopotential calculations.
the fact that you have a large system with only gamma
point gives you the maximum benefit from parallel LAPACK/BLAS
and the multi-threaded FFT. however, if you want to do
BO-dynamics i suspect that you may lose the performance
advantage, since the wavefunction extrapolation will cut
down the number of SCF cycles needed and at the same time
the force calculation is not multi-threaded at all.

to get a real benefit from a multi-core machine, additional
OpenMP directives need to be added to the QE code. the fact
that OpenMP libraries and MPI parallelization are somewhat
comparable, could indicate that there is some more room to 
improve the MPI parallelization. luckily for most QE-users
the first, simple level of parallelization across k-points
will apply and give them a lot of speedup without much and 
only _then_ the parallelization across the G-space, task groups
and finally threads/libraries/OpenMP directives should apply.

cheers,
   axel.

EAM> 
EAM> ANYWAY, THE DIFFERENCE BETWEEN THE BEST MPI AND THE BEST OPENMP IS LESS 
THAN
EAM> 10% (11m30s vs 12m43s)
EAM> 
EAM> >
EAM> >
EAM> > i'm curious to learn how these number match up
EAM> > with your performance measurements.
EAM> >
EAM> > cheers,
EAM> >   axel.
EAM> >
EAM> >
EAM> >
EAM> 
EAM> 
EAM> 

-- 
===
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
===
If you make something idiot-proof, the universe creates a better idiot.


[Pw_forum] openmp vs mpich performance with MKL 10.x

2008-05-06 Thread Axel Kohlmeyer
On Tue, 6 May 2008, Eduardo Ariel Menendez Proupin wrote:

EAMP> Dear Axel,
EAMP> 
EAMP> > you should also compare against the parallel executable
EAMP> > run with -np 1 against the serial executable.
EAMP> 
EAMP> It is a bit slower.

eduardo,

ok. thanks, here are some timing results on my desktop.
the machine was not entirely idle and freshly booted, so 
take the numbers with a bit of caution.

i have a one year old two-socket intel dual core 2.66GHz machine
(i.e. more or less equivalent to a single socket 
intel quad-core, with two dual-core dies in one case).

this is using the latest cvs code:

with serial MKL, serial FFTW-2.1.5 and OpenMPI with 4 mpi tasks. 
i get a wall time of 12m12s and cpu time of 10m40s.
changing MKL to threaded MKL using 4 threads and 1 mpi task 
i get a wall time of 18m8s and cpu time of 28m30s
(which means that roughly 40% of the time the code
was running multi-threaded BLAS/LAPACK).
with serial FFT, threaded MKL using 2 threads and 2 mpi tasks
i get a wall time of 12m45s and cpu time of 14.42s

now when i swap the serial FFTW2 against the 
intel MKL FFTW2 wrapper i get with 2 threads and 2 MPI tasks
a wall time of 15m2s and a cpu time of 24m11s.
and with 4 threads and 1 MPI task i get
a wall time of 0h19m   and a cpu time of 1h 2m
and finally when disabling threading and with 
4 MPI tasks i get 12m38 wall time and 11m14s cpu time.

obviously, switching to the intel fft didn't help.

your system with many states and only gamma point 
is definitely a case that benefits the most from
multi-threaded BLAS/LAPACK. 

i'm curious to learn how these number match up
with your performance measurements. 

cheers,
   axel.


EAMP> 
EAMP> Attached is my input.
EAMP> 
EAMP> 
EAMP> 
EAMP> 

-- 
===
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
===
If you make something idiot-proof, the universe creates a better idiot.


[Pw_forum] openmp vs mpich performance with MKL 10.x

2008-05-06 Thread Eduardo Ariel Menendez Proupin
> there are two issue that need to be considered.
>
> 1) how large are your test jobs? if they are not large enough, timings are
> pointless.


about 15 minutes in Intel Quadcore. 66 atoms: Cd_30Te_30O_6. 576 electrons
in total.
My test may be very particular. If you a have a balanced benchmark, I would
like to run it.


> 2) it is most likely, that you are still tricked by the
>   auto-parallelization of intel MKL. the export OMP_NUM_THREADS
>   will usually only work for the _local_ copy, for some
>   MPI startup mechanisms not at all. thus your MPI jobs will
>   be slowed down.

I am using only SMP. Sorry, I still haven't a cluster of Quadcores.

>
>
>   to make certain that you only like the serial version of
>   MKL with your MPI executable, please replace  -lmkl_em64t
>   in your make.sys file with
>   -lmkl_intel_lp64 -lmkl_sequential -lmkl_core


Yes, I also tried that. The test runs in 14m2s. Using only -lmkl_em64t it
runs in 14m31s. Using serial compilations it ran in 12m20s.



Thanks,
Eduardo
-- next part --
An HTML attachment was scrubbed...
URL: 
http://www.democritos.it/pipermail/pw_forum/attachments/20080506/ef1744f2/attachment.htm
 


[Pw_forum] openmp vs mpich performance with MKL 10.x

2008-05-06 Thread Axel Kohlmeyer
On Tue, 6 May 2008, Eduardo Ariel Menendez Proupin wrote:

EAMP> > there are two issue that need to be considered.
EAMP> >
EAMP> > 1) how large are your test jobs? if they are not large enough, timings 
are
EAMP> > pointless.


EAMP> about 15 minutes in Intel Quadcore. 66 atoms: Cd_30Te_30O_6. 576 
EAMP> electrons in total. My test may be very particular. If you a have 

hmmm... that is pretty large. would you mind sending me the input.
i'd like to make some verifications on my machine (two-socket dual core).

EAMP> a balanced benchmark, I would like to run it.

i've only done these kind of benchmarks systematically with CPMD,
and only a few confirmation tests with QE. in general the G-space
parallelization is comparable. while the individual performance for
a specific problem can be quite different (QE is far superior with
ultra-soft and k-points, CPMD outruns cp.x with norm-conserving
pseudos), the scaling behavior was always quite similar for small
to medium numbers of nodes.

EAMP> > 2) it is most likely, that you are still tricked by the
EAMP> >   auto-parallelization of intel MKL. the export OMP_NUM_THREADS
EAMP> >   will usually only work for the _local_ copy, for some
EAMP> >   MPI startup mechanisms not at all. thus your MPI jobs will
EAMP> >   be slowed down.
EAMP> 
EAMP> I am using only SMP. Sorry, I still haven't a cluster of Quadcores.

that still does not mean that the environment is exported. some
MPICH versions have pretty awkward ways of starting MPI environments
that do not always forward the environment at all.

EAMP> >   to make certain that you only like the serial version of
EAMP> >   MKL with your MPI executable, please replace  -lmkl_em64t
EAMP> >   in your make.sys file with
EAMP> >   -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
EAMP> 
EAMP> 
EAMP> Yes, I also tried that. The test runs in 14m2s. Using only -lmkl_em64t it
EAMP> runs in 14m31s. Using serial compilations it ran in 12m20s.

you should also compare against the parallel executable
run with -np 1 against the serial executable.

depending on your hardware (memory speed) and the fact that the 
10.0 MKL has about 20% speed improvement on recent cpus, it is 
quite possible. since your problem is quite large, i guess that
a lot of time is spent in the libraries.

with a single quad-core cpu you also have the maximum amount of
memory contention when running 4 individual mpi threads, whereas
using multi-threading may take better advantage of data locality
and reduce the load on the memory bus.

cheers,
axel.

EAMP> 
EAMP> 
EAMP> 
EAMP> Thanks,
EAMP> Eduardo
EAMP> 

-- 
===
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
===
If you make something idiot-proof, the universe creates a better idiot.


[Pw_forum] openmp vs mpich performance with MKL 10.x

2008-05-06 Thread Axel Kohlmeyer
On Tue, 6 May 2008, Nicola Marzari wrote:

NM> 
NM> Dear Eduardo,


hi nicola,

NM> 1) no improvements with the Intel fftw2 wrapper, as opposed to fftw2
NM> Q-E sources, when using mpi. I also never managed to successfully run
NM> with the Intel fftw3 wrapper (or with fftw3 - that probably says 
NM> something about me).

no it doesn't. i 


NM> 2) great improvements of a serial code (different from Q-E) when using
NM> the automatic parallelism of MKL in quad-cores.

nod. i just yesterday made some tests with different BLAS/LAPACK 
implementations, and it turns out that the 10.0 MKL is pretty
effecient in parallelizing tasks like DGEMM. through use of 
SSE2/3/4 and multi threading you can get easily a factor of 6
improvement on a 4-core node.

NM> 
NM> 3) btw, MPICH has always been for us the slower protocol, compared with
NM> LAMMPI or OpenMPI
NM> 
NM> I actually wonder if the best solution on a quad-core would be, say,
NM> to use two cores for MPI, and the other two for the openmp threads.

this is a _very_ tricky issue. usually for a plane wave pseudopotential
codes, that distributed data parallelization is pretty efficient, except
for the 3d-fourier transforms across the whole data set, which are very
sensitive to network latencies. for jobs using k-points, you also have
the option to parallelize over k-points with is very efficient, even on
not so fast networks. with the CVS versions, you have another level of
parallelism added (parallelisation over function instead of data = task
groups). thus given an ideal network, you first want to exploit MPI 
parallelism maximally and then what is left is rather small, and - sadly 
- OpenMP doesn't work very efficiently on that. the overhead of 
spawning, synchronizing and joining threads is too high compared to
the gain through parallelism. 

but we live in a real world and there are unexpected side effects and
non-ideal machines and networks. e.g. when using nodes with many cores, 
e.g. two-socket quad-core, you have to "squeeze" a lot of communication 
through just one network card (be it infiniband, myrinet or ethernet) 
that will serialize communication and add unwanted conflicts and latencies. 
i've seen this happen particularly when using a very large number 
of nodes where you can run out of (physical) memory simply because
of the way how the lowlevel communication was programmed.

in that case you may indeed be better off using only half or a 
quarter of the cores with MPI and then set OMP_NUM_THREADS to 2 
or even keep it at 1 (because that will, provided you have an
MPI with processor affinity and optimal job placement, double the
cpu cache). it is particularly interesting to discuss from this 
perspective having multi-core nodes connected by a high-latency
TCP/IP network (e.g. gigabit ethernet). here with one MPI task per
node you reach the limit of scaling pretty fast, and also using 
multiple MPI tasks per node is mostly multiplying the latencies, 
which is not helping. under those circumstance the data set is still 
rather large and then OpenMP parallelism can help to get the most
out of a given machine. as noted before, it would be _even_ better
if OpenMP directives were added to time critical and multi-threadable
parts of QE. i have experienced this in CPMD where i managed to
get about 80% of the MPI performance with the latest (extensively 
threaded) development sources and a fully-multi-threaded toolchain 
on a single node. however, running across multiple nodes quickly 
reduces the effectivity of the OpenMP support. just with two nodes
you are at 60% only.

now, deciding on what is the best combination of options is a
very tricky multi-dimensional optimization problem you have to 
consider the following:

- the size of the typical problem and job type
- whether you can benefit from k-point parallelism
- whether you prefer faster execution over cost 
  efficiency and throughput.
- the total amount of money you want to spend
- the skillset of people that have to run the machine
- how many people have to share the machine.
- how I/O bound the jobs are.
- how much memory you need and how much money you 
  are willing to invest in faster memory.
- failure rates and the level of service
  (gigabit equipment is easily available).

also some of those parameters are (non-linearly) coupled
which makes the decision making process even nastier.

cheers,
   axel.

NM> 
NM> I eagerly await Axel's opinion.
NM> 
NM> nicola
NM> 
NM> Eduardo Ariel Menendez Proupin wrote:
NM> > Hi,
NM> > I have noted recently that I am able to obtain faster binaries of pw.x 
NM> > using the the OpenMP paralellism implemented in the Intel MKL libraries 
NM> > of version 10.xxx, than using MPICH, in the Intel cpus. Previously I had 
NM> > always gotten better performance using MPI. I would like to know of 
NM> > other experience on how to make the machines faster. Let me explain in 
NM> > more details.
NM> > 
NM> > Compiling using MPI means using mpif90 as link

[Pw_forum] openmp vs mpich performance with MKL 10.x

2008-05-06 Thread Nicola Marzari


Dear Eduardo,


our own experiences are summarized here:
http://quasiamore.mit.edu/pmwiki/index.php?n=Main.CP90Timings

It would be great if you could contribute your own data, either for
pw.x or cp.x under the conditions you describe.

I noticed indeed, informally, a few of the things you mention:

1) no improvements with the Intel fftw2 wrapper, as opposed to fftw2
Q-E sources, when using mpi. I also never managed to successfully run
with the Intel fftw3 wrapper (or with fftw3 - that probably says 
something about me).

2) great improvements of a serial code (different from Q-E) when using
the automatic parallelism of MKL in quad-cores.

3) btw, MPICH has always been for us the slower protocol, compared with
LAMMPI or OpenMPI

I actually wonder if the best solution on a quad-core would be, say,
to use two cores for MPI, and the other two for the openmp threads.

I eagerly await Axel's opinion.

nicola

Eduardo Ariel Menendez Proupin wrote:
> Hi,
> I have noted recently that I am able to obtain faster binaries of pw.x 
> using the the OpenMP paralellism implemented in the Intel MKL libraries 
> of version 10.xxx, than using MPICH, in the Intel cpus. Previously I had 
> always gotten better performance using MPI. I would like to know of 
> other experience on how to make the machines faster. Let me explain in 
> more details.
> 
> Compiling using MPI means using mpif90 as linker and compiler, linking 
> against mkl_ia32 or mkl_em64t, and using link flags -i-static -openmp. 
> This is just the  what appears in the make.sys after running configure 
> in version 4cvs,
> 
> At runtime, I set
> export OMP_NUM_THREADS=1
> export MKL_NUM_THREADS=1
> and run using
> mpiexec -n $NCPUs pw.x output
> where NCPUs  is the number of cores available in the system.
> 
> The second choice is
> ./configure --disable-parallel
> 
> and at runtime
> export OMP_NUM_THREADS=$NCPU
> export MKL_NUM_THREADS=$NCPU
> and run using 
> pw.x output
> 
> I have tested it in Quadcores (NCPU=4) and with an old Dual Xeon B.C. 
> (before cores) (NCPU=2).
> 
> Before April 2007, the first choice had always workes faster. After 
> that, when I came to use the MKL 10.xxx, the second choice is working 
> faster. I have found no significant difference between version 3.2.3 and 
> 4cvs.
> 
> A special comment is for the FFT library. The MKL has a wrapper to the 
> FFTW, that must be compiled after instalation (it is very easy). This 
> creates additional libraries named like libfftw3xf_intel.a and 
> libfftw2xf_intel.a
> This allows improves the performance in the second choice, specially 
> with libfftw3xf_intel.a.
> 
> Using MPI, libfftw2xf_intel.a is as fast as using the FFTW source 
> distributed with espresso, i.e., there is no gain in using 
> libfftw2xf_intel.a. With  libfftw3xf_intel.a and MPI, I have never been 
> able to run pw.x succesfully, it just aborts.
> 
> I would like to hear of your experiences.
>  
> Best regards
> Eduardo Menendez
> University of Chile
> 
> 
> 
> 
> ___
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum


-- 
-
Prof Nicola Marzari   Department of Materials Science and Engineering
13-5066   MIT   77 Massachusetts Avenue   Cambridge MA 02139-4307 USA
tel 617.4522758 fax 2586534 marzari at mit.edu http://quasiamore.mit.edu


[Pw_forum] openmp vs mpich performance with MKL 10.x

2008-05-06 Thread Eduardo Ariel Menendez Proupin
Hi,
I have noted recently that I am able to obtain faster binaries of pw.x using
the the OpenMP paralellism implemented in the Intel MKL libraries of version
10.xxx, than using MPICH, in the Intel cpus. Previously I had always gotten
better performance using MPI. I would like to know of other experience on
how to make the machines faster. Let me explain in more details.

Compiling using MPI means using mpif90 as linker and compiler, linking
against mkl_ia32 or mkl_em64t, and using link flags -i-static -openmp. This
is just the  what appears in the make.sys after running configure in version
4cvs,

At runtime, I set
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
and run using
mpiexec -n $NCPUs pw.x output
where NCPUs  is the number of cores available in the system.

The second choice is
./configure --disable-parallel

and at runtime
export OMP_NUM_THREADS=$NCPU
export MKL_NUM_THREADS=$NCPU
and run using
pw.x output

I have tested it in Quadcores (NCPU=4) and with an old Dual Xeon B.C.
(before cores) (NCPU=2).

Before April 2007, the first choice had always workes faster. After that,
when I came to use the MKL 10.xxx, the second choice is working faster. I
have found no significant difference between version 3.2.3 and 4cvs.

A special comment is for the FFT library. The MKL has a wrapper to the FFTW,
that must be compiled after instalation (it is very easy). This creates
additional libraries named like libfftw3xf_intel.a and libfftw2xf_intel.a
This allows improves the performance in the second choice, specially with
libfftw3xf_intel.a.

Using MPI, libfftw2xf_intel.a is as fast as using the FFTW source
distributed with espresso, i.e., there is no gain in using
libfftw2xf_intel.a. With  libfftw3xf_intel.a and MPI, I have never been able
to run pw.x succesfully, it just aborts.

I would like to hear of your experiences.

Best regards
Eduardo Menendez
University of Chile
-- next part --
An HTML attachment was scrubbed...
URL: 
http://www.democritos.it/pipermail/pw_forum/attachments/20080506/ca00a740/attachment.htm
 


[Pw_forum] openmp vs mpich performance with MKL 10.x

2008-05-06 Thread Axel Kohlmeyer
On Tue, 6 May 2008, Eduardo Ariel Menendez Proupin wrote:

EAMP> Hi,
EAMP> I have noted recently that I am able to obtain faster binaries of pw.x 
using
EAMP> the the OpenMP paralellism implemented in the Intel MKL libraries of 
version
EAMP> 10.xxx, than using MPICH, in the Intel cpus. Previously I had always 
gotten
EAMP> better performance using MPI. I would like to know of other experience on
EAMP> how to make the machines faster. Let me explain in more details.
EAMP> 
EAMP> Compiling using MPI means using mpif90 as linker and compiler, linking
EAMP> against mkl_ia32 or mkl_em64t, and using link flags -i-static -openmp. 
This
EAMP> is just the  what appears in the make.sys after running configure in 
version
EAMP> 4cvs,
EAMP> 
EAMP> At runtime, I set
EAMP> export OMP_NUM_THREADS=1
EAMP> export MKL_NUM_THREADS=1
EAMP> and run using
EAMP> mpiexec -n $NCPUs pw.x output
EAMP> where NCPUs  is the number of cores available in the system.




EAMP> 
EAMP> The second choice is
EAMP> ./configure --disable-parallel
EAMP> 
EAMP> and at runtime
EAMP> export OMP_NUM_THREADS=$NCPU
EAMP> export MKL_NUM_THREADS=$NCPU
EAMP> and run using
EAMP> pw.x output
EAMP> 
EAMP> I have tested it in Quadcores (NCPU=4) and with an old Dual Xeon B.C.
EAMP> (before cores) (NCPU=2).
EAMP> 
EAMP> Before April 2007, the first choice had always workes faster. After that,
EAMP> when I came to use the MKL 10.xxx, the second choice is working faster. I
EAMP> have found no significant difference between version 3.2.3 and 4cvs.
EAMP> 
EAMP> A special comment is for the FFT library. The MKL has a wrapper to the 
FFTW,
EAMP> that must be compiled after instalation (it is very easy). This creates
EAMP> additional libraries named like libfftw3xf_intel.a and libfftw2xf_intel.a
EAMP> This allows improves the performance in the second choice, specially with
EAMP> libfftw3xf_intel.a.
EAMP> 
EAMP> Using MPI, libfftw2xf_intel.a is as fast as using the FFTW source
EAMP> distributed with espresso, i.e., there is no gain in using
EAMP> libfftw2xf_intel.a. With  libfftw3xf_intel.a and MPI, I have never been 
able
EAMP> to run pw.x succesfully, it just aborts.
EAMP> 
EAMP> I would like to hear of your experiences.

eduardo,

there are two issue that need to be considered.

1) how large are your test jobs? if they are not large enough, timings are 
pointless.

2) it is most likely, that you are still tricked by the 
   auto-parallelization of intel MKL. the export OMP_NUM_THREADS
   will usually only work for the _local_ copy, for some
   MPI startup mechanisms not at all. thus your MPI jobs will
   be slowed down.

   to make certain that you only like the serial version of
   MKL with your MPI executable, please replace  -lmkl_em64t
   in your make.sys file with 
   -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
   you may have to add:
   -Wl,-rpath,/opt/intel/path/to/your/mkl
   to make your executable find the libraries at runtime.

   with those executable you can try again, and i would be
   _very_ surprised if using MPI is slower than serial and
   multi-threading. i made tests with intel FFT vs. FFTW
   in a number of plane wave codes and the intel FFT was 
   always slower.

cheers,
axel.



EAMP> 
EAMP> Best regards
EAMP> Eduardo Menendez
EAMP> University of Chile
EAMP> 

-- 
===
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
===
If you make something idiot-proof, the universe creates a better idiot.