Thanks Lorenzo I hope so too, I think the best references are Examples 4 and 
10, I have this tendency to just go ahead once I get something working, need to 
work on that :P


    Indeed I have reproduced almost exactly what you have said. What I can 
confirm when using bp_c_phase (no electric field):


- all gdir work, only gdir=3 has a notable improvement in performance.

- when gdir=3, up to 4 processors scaling is good, on 8 it is terrible it 
actually takes longer, WALL time is notably larger than CPU time.

- the call to 'CALL mp_sum(aux_g(:), intra_bgrp_comm )' is made when gdir != 3.


My current understanding is that mp_sum takes the trace of the 'aux_g' matrix, 
whereas for gdir=3 there is significantly less code that ends up building the 
matrix 'aux' which is finally used to build 'mat'. The matrix 'evc' represents 
the wavefunctions built using plane waves, but 'evc' is used in many files. 
Since bp_c_phase is executed last, 'evc' has already been built and is only 
read in this file. With this and comparing the output I notice that performance 
when gdir=3 is better for almost all routines.. I will continue debugging 
tomorrow on the 8 processor machine where the differences are much more 
noticeable.. Do you think I should contact Paolo Giannozzi directly to better 
understand what is going on here?


Thanks so much [??]

Louis

________________________________
From: pw_forum-boun...@pwscf.org <pw_forum-boun...@pwscf.org> on behalf of 
Lorenzo Paulatto <lorenzo.paula...@impmc.upmc.fr>
Sent: 13 February 2017 13:04:22
To: PWSCF Forum
Subject: Re: [Pw_forum] PW.x homogeneous electric field berry phase calculation 
in trigonal cell

On Monday, February 13, 2017 11:43:08 AM CET Louis Fry-Bouriaux wrote:
> Finally when you were talking about the bottleneck, I suppose you were
> talking about the efield code, my impression so far is this is not a
> problem using 4 processors, I will also test using 8 and compare the time
> taken. I have no idea how fast it 'should' be with proper parallisation,
> assuming it is possible to parallelise.

When you increase the number of CPUs, you would expect the time to decreased
linearly, if over a certain number of CPUs it stops decreasing or if it
decreases slower than linear, it is a bottleneck. This will always happen
eventually, but with berry/lefield it happens much sooner.

Thank you for reporting back! I hope this information will be useful to future
users

--
Dr. Lorenzo Paulatto
IdR @ IMPMC -- CNRS & Université Paris 6
phone: +33 (0)1 442 79822 / skype: paulatz
www:   http://www-int.impmc.upmc.fr/~paulatto/
mail:  23-24/423 Boîte courrier 115, 4 place Jussieu 75252 Paris Cédex 05

_______________________________________________
Pw_forum mailing list
Pw_forum@pwscf.org
http://pwscf.org/mailman/listinfo/pw_forum
_______________________________________________
Pw_forum mailing list
Pw_forum@pwscf.org
http://pwscf.org/mailman/listinfo/pw_forum

Reply via email to