Re: [Wien] MPI Problem

2013-05-02 Thread Laurence Marks
I think these are semi-harmless, and you can add ",iostat=i" to the
relevant lines. You may need to add the same to any write statements to
unit 99 in errclr.f.

However, your timing seems strange, 6.5 serial versus 9.5 parallel. Is this
CPU time, the WALL time may be more reliable.

---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
 On May 2, 2013 7:25 PM, "Oliver Albertini"  wrote:

>  Dear W2K,
>
>  On an AIX 560 server with 16 processors, I have been running scf for NiO
> supercell (2x2x2) in serial as well as MPI parallel (one kpoint). The
> serial version runs fine. When running in parallel, the following error
> appears:
>
>  STOP LAPW2 - FERMI; weighs written
> "errclr.f", line 64: 1525-014 The I/O operation on unit 99 cannot be
> completed because an errno value of 2 (A file or directory in the path name
> does not exist.) was received while opening the file.  The program will
> stop.
>
>  A similar error that appears which does not stop the program is the
> following:
>
>   STOP  LAPW0 END
> "inilpw.f", line 233: 1525-142 The CLOSE statement on unit 200 cannot be
> completed because an errno value of 2 (A file or directory in the path name
> does not exist.) was received while closing the file.  The program will
> stop.
> STOP  LAPW1 END
>
>
> The second error is always there, while the former only appears with more
> than 2 (4,8 or 16) processors. Running the scf in serial took ~6.5 minutes,
> in parallel with two processors ~9.5 minutes. The problem occurs regardless
> of MPI/USER_REMOTE set to 0 or 1.
>
>
>  My compile options:
>
>  FC = xlf90
> MPF = mpxlf90
> CC = xlc -q64
> FOPT =  -O5 -qarch=pwr6 -q64 -qextname=flush:w2k_catch_signal
> FPOPT =  -O5 -qarch=pwr6 -q64 -qfree=f90
> -qextname=flush:w2k_catch_signal:fftw_mpi_execute_dft
> #DParallel = '-WF,-DParallel'
> FGEN = $(PARALLEL)
> LDFLAGS = -L /lapack-3.4.2/ -L /usr/lpp/ppe.poe/lib/ -L /usr/local/lib -I
> /usr/include -q64 -bnoquiet
> R_LIBS = -llapack -lessl -lfftw3 -lm -lfftw3_essl_64
> RP_LIBS = $(R_LIBS) -lpessl -lmpi -lfftw3_mpi
>
>  WIEN_MPI_RUN='poe _EXEC_ -procs _NP_'
>
>  .machines and host.list attached.
>
>  As always, any advice on this matter would be great,
>
>  Oliver Albertini
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] MPI Problem

2013-05-02 Thread Gavin Abo



STOP  LAPW0 END
"inilpw.f", line 233: 1525-142 The CLOSE statement on unit 200 cannot 
be completed because an errno value of 2 (A file or directory in the 
path name does not exist.) was received while closing the file.  The 
program will stop.

STOP  LAPW1 END
If this is on operating system AIX 6.1 
[http://zeus.theochem.tuwien.ac.at/pipermail/wien/2013-March/018560.html], 
the following link mentions that a fix might be needed for some release 
levels:


http://www-01.ibm.com/support/docview.wss?uid=isg1IZ23555
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] MPI Problem

2013-05-03 Thread Oliver Albertini
Thanks to you both for the suggestions. The OS was recently updated beyond
those versions mentioned in the link (now 6100-08).

Adding the iostat statement to all the errclr.f files prevents the program
from stopping altogether although error messages sill appear in the output:

STOP  LAPW0 END
STOP  LAPW0 END
STOP  LAPW0 END
STOP  LAPW0 END
STOP  LAPW0 END
STOP LAPW1 - Error
STOP  LAPW1 END
STOP  LAPW1 END
STOP  LAPW1 END
STOP  LAPW1 END
STOP LAPW1 - Error
STOP  LAPW1 END
STOP  LAPW1 END
STOP  LAPW1 END
STOP  LAPW1 END
STOP LAPW2 - FERMI; weighs written
STOP  LAPW2 END
STOP  LAPW2 END
STOP  LAPW2 END
STOP  LAPW2 END
STOP  LAPW2 END
STOP  SUMPARA END
STOP LAPW2 - FERMI; weighs written
STOP  LAPW2 END
STOP  LAPW2 END
STOP  LAPW2 END
STOP  LAPW2 END
STOP  LAPW2 END
STOP  SUMPARA END
STOP  CORE  END
STOP  CORE  END
STOP  MIXER END


which are more prevalent when using higher processor counts. After
completing a few runs with more processors, the times have continually
increased:

real6m43.33s


user6m19.18sserial


sys 0m13.59s





real10m36.03s


user1m4.68s   2proc


sys 0m47.79s





real11m11.25s


user1m5.24s 4proc


sys 0m52.17s





real11m39.17s


user1m6.18s8proc


sys 1m10.65s





real14m31.16s


user1m7.95s   16proc


sys 2m7.63s

After looking into various IBM Parallel Operating Environment (poe)
environmental variables (MP_SHARED_MEMORY,MP_IO_BUFFER_SIZE,MP_EAGER_LIMIT)
it seems like none of them are improving performance. Any ideas why this is
getting slower?


On Thu, May 2, 2013 at 8:49 PM, Gavin Abo  wrote:

>
>  STOP  LAPW0 END
>> "inilpw.f", line 233: 1525-142 The CLOSE statement on unit 200 cannot be
>> completed because an errno value of 2 (A file or directory in the path name
>> does not exist.) was received while closing the file.  The program will
>> stop.
>> STOP  LAPW1 END
>>
> If this is on operating system AIX 6.1 [http://zeus.theochem.tuwien.**
> ac.at/pipermail/wien/2013-**March/018560.html],
> the following link mentions that a fix might be needed for some release
> levels:
>
> http://www-01.ibm.com/support/**docview.wss?uid=isg1IZ23555
> __**_
> Wien mailing list
> w...@zeus.theochem.tuwien.ac.**at 
> http://zeus.theochem.tuwien.**ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/**
> w...@zeus.theochem.tuwien.ac.**at/index.html
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] MPI Problem

2013-05-03 Thread Laurence Marks
Please have a look at the end of case.outputup_* which gives the real cpu
and wall times and post those. It may be that the times being reported are
misleading.

In addition, I do not understand why you are seeing an error and the script
is continuing - it should not. Maybe some of the tasks are not working or
there are bugs in the csh. It may be useful to post the dayfile.

---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
On May 3, 2013 6:47 PM, "Oliver Albertini"  wrote:

>  Thanks to you both for the suggestions. The OS was recently updated
> beyond those versions mentioned in the link (now 6100-08).
>
>  Adding the iostat statement to all the errclr.f files prevents the
> program from stopping altogether although error messages sill appear in the
> output:
>
>  STOP  LAPW0 END
> STOP  LAPW0 END
> STOP  LAPW0 END
> STOP  LAPW0 END
> STOP  LAPW0 END
> STOP LAPW1 - Error
> STOP  LAPW1 END
> STOP  LAPW1 END
> STOP  LAPW1 END
> STOP  LAPW1 END
> STOP LAPW1 - Error
> STOP  LAPW1 END
> STOP  LAPW1 END
> STOP  LAPW1 END
> STOP  LAPW1 END
> STOP LAPW2 - FERMI; weighs written
> STOP  LAPW2 END
> STOP  LAPW2 END
> STOP  LAPW2 END
> STOP  LAPW2 END
> STOP  LAPW2 END
> STOP  SUMPARA END
> STOP LAPW2 - FERMI; weighs written
> STOP  LAPW2 END
> STOP  LAPW2 END
> STOP  LAPW2 END
> STOP  LAPW2 END
> STOP  LAPW2 END
> STOP  SUMPARA END
> STOP  CORE  END
> STOP  CORE  END
> STOP  MIXER END
>
>
>  which are more prevalent when using higher processor counts. After
> completing a few runs with more processors, the times have continually
> increased:
>
>  real6m43.33s
>
>
> user6m19.18sserial
>
>
> sys 0m13.59s
>
>
>
>
>
> real10m36.03s
>
>
> user1m4.68s   2proc
>
>
> sys 0m47.79s
>
>
>
>
>
> real11m11.25s
>
>
> user1m5.24s 4proc
>
>
> sys 0m52.17s
>
>
>
>
>
> real11m39.17s
>
>
> user1m6.18s8proc
>
>
> sys 1m10.65s
>
>
>
>
>
> real14m31.16s
>
>
> user1m7.95s   16proc
>
>
> sys 2m7.63s
>
>  After looking into various IBM Parallel Operating Environment (poe)
> environmental variables (MP_SHARED_MEMORY,MP_IO_BUFFER_SIZE,MP_EAGER_LIMIT)
> it seems like none of them are improving performance. Any ideas why this is
> getting slower?
>
>
> On Thu, May 2, 2013 at 8:49 PM, Gavin Abo  wrote:
>
>>
>>  STOP  LAPW0 END
>>> "inilpw.f", line 233: 1525-142 The CLOSE statement on unit 200 cannot be
>>> completed because an errno value of 2 (A file or directory in the path name
>>> does not exist.) was received while closing the file.  The program will
>>> stop.
>>> STOP  LAPW1 END
>>>
>>  If this is on operating system AIX 6.1 [http://zeus.theochem.tuwien.**
>> ac.at/pipermail/wien/2013-**March/018560.html],
>> the following link mentions that a fix might be needed for some release
>> levels:
>>
>> http://www-01.ibm.com/support/**docview.wss?uid=isg1IZ23555
>> __**_
>> Wien mailing list
>> w...@zeus.theochem.tuwien.ac.**at 
>> http://zeus.theochem.tuwien.**ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/**
>> w...@zeus.theochem.tuwien.ac.**at/index.html
>>
>
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] MPI Problem

2013-05-04 Thread Laurence Marks
It looks as if your .machines file is OK, I assume that you added the
A*** in front for emailing, but Wien2k does not use a hosts file
itself. I guess that you are using a server at ibm in almaden.
Unfortunately very few people that I know of are running WIen2k on
ibm/aix machines which is going to make it very hard for anyone to
give useful advice remotely by guessing.

I suggest that you download the benchmarks from
http://www.wien2k.at/reg_user/benchmark/ and run these then compare
the times. Beyond that get help from someone at ibm who knows the poe
command. Or try something more standard such as openmpi which many
people know.

On Fri, May 3, 2013 at 10:32 PM, Laurence Marks
 wrote:
> Please have a look at the end of case.outputup_* which gives the real cpu
> and wall times and post those. It may be that the times being reported are
> misleading.
>
> In addition, I do not understand why you are seeing an error and the script
> is continuing - it should not. Maybe some of the tasks are not working or
> there are bugs in the csh. It may be useful to post the dayfile.
>
> ---
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> www.numis.northwestern.edu 1-847-491-3996
> "Research is to see what everybody else has seen, and to think what nobody
> else has thought"
> Albert Szent-Gyorgi
>
> On May 3, 2013 6:47 PM, "Oliver Albertini"  wrote:
>>
>> Thanks to you both for the suggestions. The OS was recently updated beyond
>> those versions mentioned in the link (now 6100-08).
>>
>> Adding the iostat statement to all the errclr.f files prevents the program
>> from stopping altogether although error messages sill appear in the output:
>>
>> STOP  LAPW0 END
>> STOP  LAPW0 END
>> STOP  LAPW0 END
>> STOP  LAPW0 END
>> STOP  LAPW0 END
>> STOP LAPW1 - Error
>> STOP  LAPW1 END
>> STOP  LAPW1 END
>> STOP  LAPW1 END
>> STOP  LAPW1 END
>> STOP LAPW1 - Error
>> STOP  LAPW1 END
>> STOP  LAPW1 END
>> STOP  LAPW1 END
>> STOP  LAPW1 END
>> STOP LAPW2 - FERMI; weighs written
>> STOP  LAPW2 END
>> STOP  LAPW2 END
>> STOP  LAPW2 END
>> STOP  LAPW2 END
>> STOP  LAPW2 END
>> STOP  SUMPARA END
>> STOP LAPW2 - FERMI; weighs written
>> STOP  LAPW2 END
>> STOP  LAPW2 END
>> STOP  LAPW2 END
>> STOP  LAPW2 END
>> STOP  LAPW2 END
>> STOP  SUMPARA END
>> STOP  CORE  END
>> STOP  CORE  END
>> STOP  MIXER END
>>
>>
>> which are more prevalent when using higher processor counts. After
>> completing a few runs with more processors, the times have continually
>> increased:
>>
>> real6m43.33s
>> user6m19.18sserial
>> sys 0m13.59s
>>
>> real10m36.03s
>> user1m4.68s   2proc
>> sys 0m47.79s
>>
>> real11m11.25s
>> user1m5.24s 4proc
>> sys 0m52.17s
>>
>> real11m39.17s
>> user1m6.18s8proc
>> sys 1m10.65s
>>
>> real14m31.16s
>> user1m7.95s   16proc
>> sys 2m7.63s
>>
>> After looking into various IBM Parallel Operating Environment (poe)
>> environmental variables (MP_SHARED_MEMORY,MP_IO_BUFFER_SIZE,MP_EAGER_LIMIT)
>> it seems like none of them are improving performance. Any ideas why this is
>> getting slower?
>>
>>
>> On Thu, May 2, 2013 at 8:49 PM, Gavin Abo  wrote:
>>>
>>>
 STOP  LAPW0 END
 "inilpw.f", line 233: 1525-142 The CLOSE statement on unit 200 cannot be
 completed because an errno value of 2 (A file or directory in the path name
 does not exist.) was received while closing the file.  The program will
 stop.
 STOP  LAPW1 END
>>>
>>> If this is on operating system AIX 6.1
>>> [http://zeus.theochem.tuwien.ac.at/pipermail/wien/2013-March/018560.html],
>>> the following link mentions that a fix might be needed for some release
>>> levels:
>>>
>>> http://www-01.ibm.com/support/docview.wss?uid=isg1IZ23555
>>> ___
>>> Wien mailing list
>>> Wien@zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>> SEARCH the MAILING-LIST at:
>>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>
>>
>



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html