Shared memory can be used ONLY if you have only ONE machine with multiple cores (actually, this option was used eg. on a SGI-Origin machine with eg. 64 shared memory cpus).
So if you have just ONE 4-core PC, you can used "shared memory", but when you want to couple TWO different PCs, you cannot do so. Please read the requirements for k-parallelization in the UG. Kakhaber Jandieri schrieb: > >> No, there was no change! >> >> Did you set "shared memory" ?? This would also explain why >> everything runs on >> one machine ?? > > Yes, I set "shared memory" for both versions of WIEN2K and accordingly I > have setenv USE_REMOTE 0 in parallel_options. > >> >> >> Kakhaber Jandieri schrieb: >>> Dear prof. Blaha >>> >>>> I do NOT believe that k-point parallel with an older WIEN2k was >>>> possible >>>> (unless you set it up with "rsh" instead of "ssh" and defined a >>>> .rhosts file). >>> >>> But it is really possible. I checked again. I even reinstall the >>> WIEN2K_08.1 aiming to verify that the options are the same as that >>> used for WIEN2K_09.1. >>> I did not set "rsh" and did not define ".rhost file. >>> The behaviour of WIEN2K_08.1 is the same (as I described in my >>> previous letters). >>> According to dayfile, k-points are distributed among all reserved >>> nodes. Here is a little fragment of the dayfile: >>> >>> >>> Calculating GaAsB in /home/kakhaber/wien_work/GaAsB >>> on node112 with PID 10597 >>> >>> start (Wed Jun 16 09:50:23 CEST 2010) with lapw0 (40/99 to go) >>> >>> cycle 1 (Wed Jun 16 09:50:23 CEST 2010) (40/99 to go) >>> >>>> lapw0 -p (09:50:23) starting parallel lapw0 at Wed Jun 16 >>>> 09:50:23 CEST 2010 >>> -------- >>> running lapw0 in single mode >>> 77.496u 0.628s 1:18.47 99.5% 0+0k 0+7008io 0pf+0w >>>> lapw1 -c -p (09:51:42) starting parallel lapw1 at Wed Jun 16 >>>> 09:51:42 CEST 2010 >>> -> starting parallel LAPW1 jobs at Wed Jun 16 09:51:42 CEST 2010 >>> running LAPW1 in parallel mode (using .machines) >>> 4 number_of_parallel_jobs >>> node112(1) 2091.6u 2.3s 37:18.76 93.5% 0+0k 0+205296io 0pf+0w >>> node105(1) 2024.2u 2.3s 34:26.54 98.0% 0+0k 0+198376io 0pf+0w >>> node122(1) 2115.4u 5.1s 36:08.08 97.8% 0+0k 0+197808io 0pf+0w >>> node131(1) 2041.3u 2.6s 35:19.70 96.4% 0+0k 0+202912io 0pf+0w >>> Summary of lapw1para: >>> node112 k=1 user=2091.6 wallclock=2238.76 >>> node105 k=1 user=2024.2 wallclock=2066.54 >>> node122 k=1 user=2115.4 wallclock=2168.08 >>> node131 k=1 user=2041.3 wallclock=2119.7 >>> 8274.113u 15.744s 37:20.53 369.9% 0+0k 8+805440io 0pf+0w >>>> lapw2 -c -p (10:29:02) running LAPW2 in parallel mode >>> node112 87.4u 0.5s 1:37.85 89.9% 0+0k 0+8104io 0pf+0w >>> node105 86.8u 0.9s 1:30.90 96.5% 0+0k 198064+8096io 0pf+0w >>> node122 84.7u 0.6s 1:27.71 97.3% 0+0k 0+8088io 0pf+0w >>> node131 87.9u 1.0s 1:31.00 97.7% 0+0k 0+8088io 0pf+0w >>> Summary of lapw2para: >>> node112 user=87.4 wallclock=97.85 >>> node105 user=86.8 wallclock=90.9 >>> node122 user=84.7 wallclock=87.71 >>> node131 user=87.9 wallclock=91 >>> 349.001u 3.592s 1:41.96 345.8% 0+0k 204504+42240io 0pf+0w >>>> lcore (10:30:44) 0.176u 0.060s 0:01.05 21.9% 0+0k 0+5336io >>>> 0pf+0w >>>> mixer (10:30:46) 1.436u 0.168s 0:01.99 79.8% 0+0k 0+11920io >>>> 0pf+0w >>> :ENERGY convergence: 0 0.001 0 >>> :CHARGE convergence: 0 0.001 0 >>> ec cc and fc_conv 0 0 0 >>> >>> In spite of that, when I login into the nodes, I see the following: >>> >>> node112:~> nice top -c -u kakhaber (this is the master node) >>> >>> top - 10:58:42 up 116 days, 23:19, 1 user, load average: 8.01, >>> 7.77, 7.33 >>> Tasks: 110 total, 10 running, 100 sleeping, 0 stopped, 0 zombie >>> Cpu(s): 90.5%us, 0.3%sy, 9.1%ni, 0.0%id, 0.0%wa, 0.0%hi, >>> 0.0%si, 0.0%st >>> Mem: 16542480k total, 16144412k used, 398068k free, 105896k buffers >>> Swap: 4000144k total, 18144k used, 3982000k free, 11430460k cached >>> >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> 14474 kakhaber 20 0 937m 918m 2020 R 100 5.7 26:06.48 lapw1c >>> lapw1_4.def >>> 14458 kakhaber 20 0 926m 907m 2020 R 98 5.6 26:01.62 lapw1c >>> lapw1_3.def >>> 14443 kakhaber 20 0 934m 915m 2028 R 98 5.7 25:46.37 lapw1c >>> lapw1_2.def >>> 14428 kakhaber 20 0 936m 917m 2028 R 66 5.7 24:26.43 lapw1c >>> lapw1_1.def >>> 5952 kakhaber 20 0 13724 1360 820 S 0 0.0 0:00.00 >>> /bin/tcsh /var/spoo >>> 6077 kakhaber 20 0 3920 736 540 S 0 0.0 0:00.00 /bin/csh >>> -f /home/k >>> 10597 kakhaber 20 0 3928 780 572 S 0 0.0 0:00.00 >>> /bin/csh -f /home/k >>> 14320 kakhaber 20 0 11252 1180 772 S 0 0.0 0:00.00 >>> /bin/tcsh -f /home/ >>> 14336 kakhaber 20 0 3920 800 604 S 0 0.0 0:00.62 >>> /bin/csh -f /home/k >>> 14427 kakhaber 20 0 3920 440 244 S 0 0.0 0:00.00 >>> /bin/csh -f /home/k >>> 14442 kakhaber 20 0 3920 432 236 S 0 0.0 0:00.00 >>> /bin/csh -f /home/k >>> 14457 kakhaber 20 0 3920 432 236 S 0 0.0 0:00.00 >>> /bin/csh -f /home/k >>> 14472 kakhaber 20 0 3920 432 236 S 0 0.0 0:00.00 >>> /bin/csh -f /home/k >>> 16499 kakhaber 20 0 77296 1808 1100 R 0 0.0 0:00.00 sshd: >>> kakhaber at pts/ >>> 16500 kakhaber 20 0 16212 2032 1080 S 0 0.0 0:00.02 -tcsh >>> 16603 kakhaber 24 4 10620 1120 848 R 0 0.0 0:00.02 top -c >>> -u kakhaber >>> >>> >>> node105:~> nice top -c -u kakhaber >>> top - 11:01:37 up 116 days, 23:23, 1 user, load average: 3.00, >>> 3.00, 3.00 >>> Tasks: 99 total, 3 running, 96 sleeping, 0 stopped, 0 zombie >>> Cpu(s): 2.9%us, 20.6%sy, 49.9%ni, 25.0%id, 0.0%wa, 0.1%hi, >>> 1.6%si, 0.0%st >>> Mem: 16542480k total, 6277020k used, 10265460k free, 233364k buffers >>> Swap: 4000144k total, 13512k used, 3986632k free, 5173312k cached >>> >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> 18955 kakhaber 20 0 77296 1808 1100 S 0 0.0 0:00.00 sshd: >>> kakhaber at pts/ >>> 18956 kakhaber 20 0 16212 2032 1080 S 0 0.0 0:00.02 -tcsh >>> 19071 kakhaber 24 4 10620 1112 848 R 0 0.0 0:00.00 top -c >>> -u kakhaber >>> >>> for node122 and node131 the output is the same as for node105. >>> >>>> Anyway, k-parallel does not use mpi at all and you have to read the >>>> requirements specified in the UG. >>> >>> I know, but I meant the following: if k-point parallelization in >>> WIEN2K_09.1 does not work because of the problem with >>> interconnection between different nodes, then I thought that the >>> MPI-parallelization also should be impossible. But the MPI-parallel >>> jobs runs without any problem. >>> >>> I suggest one possibility (may be trivial or wrong). >>> In WIEN2K_08.1 k-point parallelization works but all processes run >>> on the master node. In WIEN2K_09.1 k-point parallelization does not >>> work at all. May be there is some restriction in WIEN2K_09.1 >>> preventing different k-processes to be ran on the same node and this >>> is the reason of the crash in parallel lapw1? >>> >>> Is such suggestion reasonable? >>> I will be extremely thankful for your additional advices. >>> >>> >>>> Kakhaber Jandieri schrieb: >>>>> Dear Prof. Blaha, >>>>> >>>>> Thank you for your reply. >>>>> >>>>>> Can you ssh node120 ps >>>>>> without supplying a password ? >>>>> >>>>> No, I can't ssh the nodes without password supply, but in my >>>>> parallel_options I have setenv MPI_REMOTE 0. I thought that our >>>>> cluster has a shared memory architecture, since the >>>>> MPI-parallelization works without any problem for 1 k-point. I >>>>> cheeked the corresponding nodes. All they were loaded. May be I >>>>> misunderstood something. Are the requirements for >>>>> MPI-parallelization different from that for k-point paralleization? >>>>> >>>>>> Try x lapw1 -p on the commandline. >>>>>> What exactly is the "error" ? >>>>> >>>>> Just now, to try your suggestions, I ran new task with k-point >>>>> parallelization. The .machines file is: >>>>> granularity:1 >>>>> 1:node120 >>>>> 1:node127 >>>>> 1:node121 >>>>> 1:node123 >>>>> >>>>> with node120 as a master node. >>>>> >>>>> The output of x lapw -p is: >>>>> starting parallel lapw1 at Sun Jun 13 22:44:08 CEST 2010 >>>>> -> starting parallel LAPW1 jobs at Sun Jun 13 22:44:08 CEST 2010 >>>>> running LAPW1 in parallel mode (using .machines) >>>>> 4 number_of_parallel_jobs >>>>> [1] 31314 >>>>> [2] 31341 >>>>> [3] 31357 >>>>> [4] 31373 >>>>> Permission denied, please try again. >>>>> Permission denied, please try again. >>>>> Received disconnect from 172.26.6.120: 2: Too many authentication >>>>> failures for kakhaber >>>>> [1] Done ( ( $remote $machine[$p] ... >>>>> Permission denied, please try again. >>>>> Permission denied, please try again. >>>>> Received disconnect from 172.26.6.127: 2: Too many authentication >>>>> failures for kakhaber >>>>> Permission denied, please try again. >>>>> Permission denied, please try again. >>>>> Received disconnect from 172.26.6.121: 2: Too many authentication >>>>> failures for kakhaber >>>>> [3] - Done ( ( $remote $machine[$p] ... >>>>> [2] - Done ( ( $remote $machine[$p] ... >>>>> Permission denied, please try again. >>>>> Permission denied, please try again. >>>>> Received disconnect from 172.26.6.123: 2: Too many authentication >>>>> failures for kakhaber >>>>> [4] Done ( ( $remote $machine[$p] ... >>>>> node120(1) node127(1) node121(1) node123(1) ** >>>>> LAPW1 crashed! >>>>> cat: No match. >>>>> 0.116u 0.324s 0:11.88 3.6% 0+0k 0+864io 0pf+0w >>>>> error: command /home/kakhaber/WIEN2K_09/lapw1cpara -c lapw1.def >>>>> failed >>>>> >>>>>> How many k-points do you have ? ( 4 ?) >>>>> >>>>> Yes, I have 4 k-points. >>>>> >>>>>> Content of .machine1 and .processes >>>>> >>>>> marc-hn:~/wien_work/GaAsB> cat .machine1 node120 >>>>> marc-hn:~/wien_work/GaAsB> cat .machine2 >>>>> node127 >>>>> marc-hn:~/wien_work/GaAsB> cat .machine3 >>>>> node121 >>>>> marc-hn:~/wien_work/GaAsB> cat .machine4 >>>>> node123 >>>>> >>>>> marc-hn:~/wien_work/GaAsB> cat .processes >>>>> init:node120 >>>>> init:node127 >>>>> init:node121 >>>>> init:node123 >>>>> 1 : node120 : 1 : 1 : 1 >>>>> 2 : node127 : 1 : 1 : 2 >>>>> 3 : node121 : 1 : 1 : 3 >>>>> 4 : node123 : 1 : 1 : 4 >>>>> >>>>>> While x lapw1 -p is running, do a ps -ef |grep lapw >>>>> >>>>> I had not enough time to do it - the program crashed before. >>>>> >>>>>> Your .machines file is most likely a rather "useless" one. The >>>>>> mpi-lapw1 >>>>>> diagonalization (SCALAPACK) is almost a factor of 2 slower than >>>>>> the serial >>>>>> version, thus your speedup by using 2 processors in mpi-mode will be >>>>>> very small. >>>>> >>>>> Yes, I know, but I am simply trying to arrange the calculations >>>>> using Wien2K. For "real" calculations I will use much more >>>>> processors. >>>>> >>>>> And finally, for additional information. As I wrote in my >>>>> previous letters, in >>>>> WIEN2k_08.1 k-point parallelization works, but all processes are >>>>> running on master node and all other reserved nodes are idle. I >>>>> forgot to mention: this is true for lapw1 only. Lapw2 is >>>>> distributed among all reserved nodes. >>>>> >>>>> Thank you one again. I am looking forward for your further advices. >>>>> >>>>> >>>>> Dr. Kakhaber Jandieri >>>>> Department of Physics >>>>> Philipps University Marburg >>>>> Tel:+49 6421 2824159 (2825704) >>>>> >>>>> >>>>> _______________________________________________ >>>>> Wien mailing list >>>>> Wien at zeus.theochem.tuwien.ac.at >>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien >>>> >>>> -- >>>> ----------------------------------------- >>>> Peter Blaha >>>> Inst. Materials Chemistry, TU Vienna >>>> Getreidemarkt 9, A-1060 Vienna, Austria >>>> Tel: +43-1-5880115671 >>>> Fax: +43-1-5880115698 >>>> email: pblaha at theochem.tuwien.ac.at >>>> ----------------------------------------- >>>> _______________________________________________ >>>> Wien mailing list >>>> Wien at zeus.theochem.tuwien.ac.at >>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien >>> >>> >>> >>> Dr. Kakhaber Jandieri >>> Department of Physics >>> Philipps University Marburg >>> Tel:+49 6421 2824159 (2825704) >>> >>> >>> _______________________________________________ >>> Wien mailing list >>> Wien at zeus.theochem.tuwien.ac.at >>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien >> >> -- >> >> P.Blaha >> -------------------------------------------------------------------------- >> >> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna >> Phone: +43-1-58801-15671 FAX: +43-1-58801-15698 >> Email: blaha at theochem.tuwien.ac.at WWW: >> http://info.tuwien.ac.at/theochem/ >> -------------------------------------------------------------------------- >> >> _______________________________________________ >> Wien mailing list >> Wien at zeus.theochem.tuwien.ac.at >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien > > > > Dr. Kakhaber Jandieri > Department of Physics > Philipps University Marburg > Tel:+49 6421 2824159 (2825704) > > > _______________________________________________ > Wien mailing list > Wien at zeus.theochem.tuwien.ac.at > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien -- P.Blaha -------------------------------------------------------------------------- Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna Phone: +43-1-58801-15671 FAX: +43-1-58801-15698 Email: blaha at theochem.tuwien.ac.at WWW: http://info.tuwien.ac.at/theochem/ --------------------------------------------------------------------------