[Pw_forum] Problem with QE 4.2.1 and AMD Opteron 6200 / 6300

2013-12-09 Thread Fabricio Cannini
Hi All


I'm facing a very strange situation with the above version of QE, 
compiled with :

- intel 12.1.3
- mkl 10.0,
- fftw 3.2.2 ( OS package )
- openmpi 1.4.5

Running 'prace-medium' benchmark as a test .
http://qe-forge.org/gf/project/q-e/frs/?action=FrsReleaseView&release_id=47

The OS is centos 5.6 x86-64 .

And the results were :

- Intel Xeon E5430 2.66GHz / 8 cores / 16 GB RAM = 00h:38m:21s

- Intel Core i7-2600 @ 3.40GHz / 8 cores / 16 GB = 00h:19m:40s

- AMD Opteron Processor 6276 / 8 cores / 256 GB = more than 8h ( process 
killed )



Then i tried another compilation :

- pgi 12.5
- acml 5.1.0 64
- fftw 3.2.2 ( OS package )
- openmpi 1.4.5

And the results were even worse . None of the machines above were able 
to finish the test in *24h* .




My third attempt was the following :

- intel 13.2
- mkl 11.0
- fftw 3.3.3 ( OS package )
- openmpi 1.6.5

- OS = Ubuntu 12.04 LTS

- AMD Opteron 6380 / 8 cores / 64 GB RAM

- Same "prace medium benchmark" test input.


Result : Also didn't finish in *24h* .




I was suspicious of the intel compiler , so I setup a 4th test :

- gfortran 4.6 ( OS package )
- openblas 0.2.8 ( compiled with gcc 4.6 )
- fftw 3.3.3 ( OS package )
- openmpi 1.4.3 ( OS package )


Same machine as the third test, and the result was the same too, with a 
difference that the binary compiled with gfortran used *much more* 
memory , running into as much as 15GB of swap memory , before i kill the 
process ( it took some 30 min to reach this point )

It should be noted that when running the 'small size' benchmark on the 
Opteron 6380 machine, the gfortran/openblas binary is faster than the 
intel 13.2/mkl binary ( up to a minute on the 3rd and 4th test ) .


Do you have a clue of what could be happening ?

Should i attach the 'make.sys' files to another message or paste it 
somewhere ?

TIA
Fabricio


[Pw_forum] Problem with QE 4.2.1 and AMD Opteron 6200 / 6300

2013-12-10 Thread Paolo Giannozzi
First of all: why do you want to use an old QE version? P.

On Mon, 2013-12-09 at 19:16 -0200, Fabricio Cannini wrote:
> Hi All
> 
> 
> I'm facing a very strange situation with the above version of QE, 
> compiled with :
> 
> - intel 12.1.3
> - mkl 10.0,
> - fftw 3.2.2 ( OS package )
> - openmpi 1.4.5
> 
> Running 'prace-medium' benchmark as a test .
> http://qe-forge.org/gf/project/q-e/frs/?action=FrsReleaseView&release_id=47
> 
> The OS is centos 5.6 x86-64 .
> 
> And the results were :
> 
> - Intel Xeon E5430 2.66GHz / 8 cores / 16 GB RAM = 00h:38m:21s
> 
> - Intel Core i7-2600 @ 3.40GHz / 8 cores / 16 GB = 00h:19m:40s
> 
> - AMD Opteron Processor 6276 / 8 cores / 256 GB = more than 8h ( process 
> killed )
> 
> 
> 
> Then i tried another compilation :
> 
> - pgi 12.5
> - acml 5.1.0 64
> - fftw 3.2.2 ( OS package )
> - openmpi 1.4.5
> 
> And the results were even worse . None of the machines above were able 
> to finish the test in *24h* .
> 
> 
> 
> 
> My third attempt was the following :
> 
> - intel 13.2
> - mkl 11.0
> - fftw 3.3.3 ( OS package )
> - openmpi 1.6.5
> 
> - OS = Ubuntu 12.04 LTS
> 
> - AMD Opteron 6380 / 8 cores / 64 GB RAM
> 
> - Same "prace medium benchmark" test input.
> 
> 
> Result : Also didn't finish in *24h* .
> 
> 
> 
> 
> I was suspicious of the intel compiler , so I setup a 4th test :
> 
> - gfortran 4.6 ( OS package )
> - openblas 0.2.8 ( compiled with gcc 4.6 )
> - fftw 3.3.3 ( OS package )
> - openmpi 1.4.3 ( OS package )
> 
> 
> Same machine as the third test, and the result was the same too, with a 
> difference that the binary compiled with gfortran used *much more* 
> memory , running into as much as 15GB of swap memory , before i kill the 
> process ( it took some 30 min to reach this point )
> 
> It should be noted that when running the 'small size' benchmark on the 
> Opteron 6380 machine, the gfortran/openblas binary is faster than the 
> intel 13.2/mkl binary ( up to a minute on the 3rd and 4th test ) .
> 
> 
> Do you have a clue of what could be happening ?
> 
> Should i attach the 'make.sys' files to another message or paste it 
> somewhere ?
> 
> TIA
> Fabricio
> ___
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum

-- 
 Paolo Giannozzi, Dept. Chemistry&Physics&Environment, 
 Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
 Phone +39-0432-558216, fax +39-0432-558222 



[Pw_forum] Problem with QE 4.2.1 and AMD Opteron 6200 / 6300

2013-12-10 Thread Fabricio Cannini
Em 10-12-2013 06:01, Paolo Giannozzi escreveu:
> First of all: why do you want to use an old QE version? P.

Because the problem is of a client of mine, and he is using this very 
version , so I'm trying to reproduce the problem as exactly as possible.
Also, I'm a sysadmin, not a scientist . ;)

Should I re-run the tests with the newest QE ?


TIA,
Fabricio


[Pw_forum] Problem with QE 4.2.1 and AMD Opteron 6200 / 6300

2013-12-10 Thread Paolo Giannozzi
First of all you should verify if multi-threading libraries 
are conflicting with MPI parallelization. 

P.

On Tue, 2013-12-10 at 12:26 -0200, Fabricio Cannini wrote:
> Em 10-12-2013 06:01, Paolo Giannozzi escreveu:
> > First of all: why do you want to use an old QE version? P.
> 
> Because the problem is of a client of mine, and he is using this very 
> version , so I'm trying to reproduce the problem as exactly as possible.
> Also, I'm a sysadmin, not a scientist . ;)
> 
> Should I re-run the tests with the newest QE ?
> 
> 
> TIA,
> Fabricio

-- 
Paolo Giannozzi, Dept. Chemistry&Physics&Environment, 
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222 



[Pw_forum] Problem with QE 4.2.1 and AMD Opteron 6200 / 6300

2013-12-10 Thread Fabricio Cannini
Em 10-12-2013 18:34, Paolo Giannozzi escreveu:
> First of all you should verify if multi-threading libraries
> are conflicting with MPI parallelization.

Yes, i did look into it already.
I can send you the 'make.sys' files of all compilations if it helps .

TIA,
Fabricio


[Pw_forum] Problem with QE 4.2.1 and AMD Opteron 6200 / 6300

2013-12-11 Thread Fabricio Cannini
Em 11-12-2013 15:51, Paolo Giannozzi escreveu:
> On Tue, 2013-12-10 at 19:49 -0200, Fabricio Cannini wrote:
>> Em 10-12-2013 18:34, Paolo Giannozzi escreveu:
>>> First of all you should verify if multi-threading libraries
>>> are conflicting with MPI parallelization.
>>
>> Yes, i did look into it already.


Hi there


So, what else can I look into ?
I did more tests, on the same Opteron 6380 machine, using the same 
binaries, but now using the "DEISA medium benchmark" and the results 
were interesting.

http://qe-forge.org/gf/project/q-e/frs/?action=FrsReleaseView&release_id=45


ifort 13.2 + mkl 11.0 / 8 cores = 1h8m
gfortran 4.6 + openblas 0.2.8 / 8 cores = 46m57.62s



This is making me even more suspicious of intel compiler being the problem.

TIA,
Fabricio


[Pw_forum] Problem with QE 4.2.1 and AMD Opteron 6200 / 6300

2013-12-12 Thread Ivan Girotto
Dear Fabricio,

I reckon there is some inconsistency in the results you are obtaining.
The AMD 6380 is a 16-core model. I'm wondering how do you map the 
process affinity while running on 8 cores.
Without controlling such mapping you can obtain substantial performance 
variation at each execution.
Indeed, cache memory and FPU are shared among a given set of cores and 
the increasing concurrency on shared resources goes along with a 
degradation of the performances.
The AMD 6380 also supports the AVX instruction set extension for vector 
operations at 256bit. Does your O.S. support that too?
Compile a simple source with -mavx and see whether you can run it. Or 
check if the "avx" flag is present in your /proc/cpuinfo.

For my experience about benchmarking QE on the same CPU system, the 
combination of the Intel compiler + MKL turned out to be always the best 
option.

Regards,

Ivan

On 11/12/2013 19:52, Fabricio Cannini wrote:
> Em 11-12-2013 15:51, Paolo Giannozzi escreveu:
>> >  On Tue, 2013-12-10 at 19:49 -0200, Fabricio Cannini wrote:
>>> >>  Em 10-12-2013 18:34, Paolo Giannozzi escreveu:
 >>>  First of all you should verify if multi-threading libraries
 >>>  are conflicting with MPI parallelization.
>>> >>
>>> >>  Yes, i did look into it already.
> Hi there
>
>
> So, what else can I look into ?
> I did more tests, on the same Opteron 6380 machine, using the same
> binaries, but now using the "DEISA medium benchmark" and the results
> were interesting.
>
> http://qe-forge.org/gf/project/q-e/frs/?action=FrsReleaseView&release_id=45
>
>
> ifort 13.2 + mkl 11.0 / 8 cores   = 1h8m
> gfortran 4.6 + openblas 0.2.8 / 8 cores   = 46m57.62s
>
>
>
> This is making me even more suspicious of intel compiler being the problem.
>
> TIA,
> Fabricio



[Pw_forum] Problem with QE 4.2.1 and AMD Opteron 6200 / 6300

2013-12-17 Thread Fabricio Cannini
Em 12-12-2013 08:20, Ivan Girotto escreveu:
> Dear Fabricio,
>
> I reckon there is some inconsistency in the results you are obtaining.
> The AMD 6380 is a 16-core model. I'm wondering how do you map the
> process affinity while running on 8 cores.
> Without controlling such mapping you can obtain substantial performance
> variation at each execution.
> Indeed, cache memory and FPU are shared among a given set of cores and
> the increasing concurrency on shared resources goes along with a
> degradation of the performances.

I tried using 'taskset' and 'hwloc-bind' , and the results were *worse* 
than without any of them.

binary 1 :
intel 13.2 + mkl 11.0 + openmpi 1.6.5 / 8 cores

binary 2:
gfortran 4.6 + openblas 0.2.8 + openmpi 1.6.5 / 8 cores


binary 1 time = 1h8m
binary 2 time = 46m57.62s

binary 1 with hwloc = 1h22m
binary 2 with hwloc = 56m40.50s

binary 1 with taskset = 2h27m
binary 2 with taskset = 1h48m



As I understand it, they're creating an overhead to the execution, which 
I'm sure is not the intent of them both.

> The AMD 6380 also supports the AVX instruction set extension for vector
> operations at 256bit. Does your O.S. support that too?
> Compile a simple source with -mavx and see whether you can run it. Or
> check if the "avx" flag is present in your /proc/cpuinfo.

I tried it too, but still wasn't enough to make binary 1 faster.


I'm not sure what to make of these results. Any clues ?



TIA,
Fabricio



[Pw_forum] Problem with QE 4.2.1 and AMD Opteron 6200 / 6300

2013-12-17 Thread Fabricio Cannini
Em 12-12-2013 08:20, Ivan Girotto escreveu:
> Dear Fabricio,
>
> I reckon there is some inconsistency in the results you are obtaining.
> The AMD 6380 is a 16-core model. I'm wondering how do you map the
> process affinity while running on 8 cores.
> Without controlling such mapping you can obtain substantial performance
> variation at each execution.
> Indeed, cache memory and FPU are shared among a given set of cores and
> the increasing concurrency on shared resources goes along with a
> degradation of the performances.

I tried using 'taskset' and 'hwloc-bind' , and the results were *worse* 
than without any of them.

binary 1 :
intel 13.2 + mkl 11.0 + openmpi 1.6.5 / 8 cores

binary 2:
gfortran 4.6 + openblas 0.2.8 + openmpi 1.6.5 / 8 cores


binary 1 time = 1h8m
binary 2 time = 46m57.62s

binary 1 with hwloc = 1h22m
binary 2 with hwloc = 56m40.50s

binary 1 with taskset = 2h27m
binary 2 with taskset = 1h48m



As I understand it, they're creating an overhead to the execution, which 
I'm sure is not the intent of them both.

> The AMD 6380 also supports the AVX instruction set extension for vector
> operations at 256bit. Does your O.S. support that too?
> Compile a simple source with -mavx and see whether you can run it. Or
> check if the "avx" flag is present in your /proc/cpuinfo.

I tried it too, but still wasn't enough to make binary 1 faster.


I'm not sure what to make of these results. Any clues ?



TIA,
Fabricio



[Pw_forum] Problem with QE 4.2.1 and AMD Opteron 6200 / 6300

2014-01-13 Thread Fabricio Cannini
Em 12-12-2013 08:20, Ivan Girotto escreveu:

Hi there

I've finally figured out the problem.

First, i changed the benchmark to DEISA Medium so that i could use a 
wider range of machines, as it ( predictably ) used less than half of 
the memory than the PRACE medium benchmark demanded. Then i started to 
have results.

http://qe-forge.org/gf/project/q-e/frs/?action=FrsReleaseView&release_id=45


Second, i noticed that no matter what, the machines running ubuntu 12.04 
were faster than the machines running debian 6.0 , and that it was 
causing by a series of factors. In order :

- OS runtime :
http://www.eglibc.org/cgi-bin/viewvc.cgi/branches/eglibc-2_15/libc/NEWS?view=markup
( "Lots of generic, 64-bit, and x86-64-specific performance 
optimizations to math functions." )


- Math Library :
Old MKL ( 10.0 ) and ACML ( 5.1.0 ) versions . Openblas gave much better 
results in both debian and ubuntu, even better in ubuntu.


- Compiler :
And this is why openblas ran even better in ubuntu.
http://gcc.gnu.org/gcc-4.6/changes.html
( "Support for AMD Bulldozer (family 15) processors is now available 
through the -march=bdver1 and -mtune=bdver1 options" )

The machines were running old versions of intel ifort ( 11.1 ), which 
could not possibly have good performance in a much newer machine.


Once again, Thanks for your time.
Fabricio