Re: [Wien] MPI error

Peter Blaha Tue, 13 Apr 2021 23:05:00 -0700

It cannot initialize an mpi job, because it is missing the interfacesoftware.

You need to ask the computing center / system administrators how oneexecutes a mpi job on this computer.

It could be, that "mpirun" is not supported on this machine. You may trya wien2k installation with system "LS" in siteconfig. This willconfigure the parallel environment/commands using "slurm" commands likesrun -K -N_nodes_ -n_NP_ ..., replacing mpirun.We used it once on our hpc machine, since it was recommended by thecomputing center people. However, it turned out that the standard mpiruninstallation was more stable because the "slurm controller" died toooften leading to many random crashes. Anyway, if your system has what iscalled "tight integration of mpi", it might be necessary.


Am 13.04.2021 um 21:47 schrieb leila mollabashi:

Dear Prof. Peter Blaha and WIEN2k users,

Then by run x lapw1 –p:

starting parallel lapw1 at Tue Apr 13 21:04:15 CEST 2021

->  starting parallel LAPW1 jobs at Tue Apr 13 21:04:15 CEST 2021

running LAPW1 in parallel mode (using .machines)

2 number_of_parallel_jobs

[1] 14530

[e0467:14538] mca_base_component_repository_open: unable to openmca_btl_uct: libucp.so.0: cannot open shared object file: No such fileor directory (ignored)


WARNING: There was an error initializing an OpenFabrics device.

   Local host:   e0467

   Local device: mlx4_0

MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD

with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on

exactly when Open MPI kills them.

--------------------------------------------------------------------------

[e0467:14567] 1 more process has sent help messagehelp-mpi-btl-openib.txt / error in device init

[e0467:14567] Set MCA parameter "orte_base_help_aggregate" to 0 to seeall help / error messages

[warn] Epoll MOD(1) on fd 27 failed. Old events were 6; read change was0 (none); write change was 2 (del): Bad file descriptor

Somewhere there should be some documentation how one runs an mpi job on

your system.

Only I found this:

Before ordering a task, it should be encapsulated in an appropriatescript understandable for the queue system, e.g .:


/home/users/user/submit_script.sl <http://submit_script.sl>

Sample SLURM script:

#! / bin / bash -l

#SBATCH -N 1

#SBATCH --mem 5000

#SBATCH --time = 20:00:00

/sciezka/do/pliku/binarnego/plik_binarny.in <http://plik_binarny.in>>/sciezka/do/pliku/wyjsciowego.out


To order a task to a specific queue, use the #SBATCH -p parameter, e.g.

#! / bin / bash -l

#SBATCH -N 1

#SBATCH --mem 5000

#SBATCH --time = 20:00:00

#SBATCH -p standard

/sciezka/do/pliku/binarnego/plik_binarny.in <http://plik_binarny.in>>/siezka/do/pliku/wyjsciowego.out


The task must then be ordered using the *sbatch* command

sbatch /home/users/user/submit_script.sl <http://submit_script.sl>

*Ordering interactive tasks***


Interactive tasks can be divided into two groups:

·interactive task (working in text mode)

·interactive task

*Interactive task (working in text mode)***

Ordering interactive tasks is very simple and in the simplest case itcomes down to issuing the command below.


srun --pty / bin / bash

Sincerely yours,

Leila Mollabashi

On Wed, Apr 14, 2021 at 12:03 AM leila mollabashi<le.mollaba...@gmail.com <mailto:le.mollaba...@gmail.com>> wrote:


    Dear Prof. Peter Blaha and WIEN2k users,

    Thank you for your assistances.

    >  At least now the error: "lapw0 not found" is gone. Do you
    understand why ??

    Yes, I think that because now the path is clearly known.

    >How many slots do you get by this srun command ?

    Usually I went to node with 28 CPUs.

    >Is this the node with the name  e0591 ???

    Yes, it is.

    >Of course the .machines file must be consistent (dynamically adapted)

    with the actual nodename.

    Yes, to do this I use my script.

    >When I  use “srun --pty -n 8 /bin/bash” that goes to the node with 8 free
    cores, and run x lapw0 –p then this happens:

    starting parallel lapw0 at Tue Apr 13 20:50:49 CEST 2021

    -------- .machine0 : 4 processors

    [1] 12852

    [e0467:12859] mca_base_component_repository_open: unable to open
    mca_btl_uct: libucp.so.0: cannot open shared object file: No such
    file or directory (ignored)

    [e0467][[56319,1],1][btl_openib_component.c:1699:init_one_device]
    error obtaining device attributes for mlx4_0 errno says Protocol not
    supported

    [e0467:12859] mca_base_component_repository_open: unable to open
    mca_pml_ucx: libucp.so.0: cannot open shared object file: No such
    file or directory (ignored)

    LAPW0 END

    [1]    Done                          mpirun -np 4 -machinefile
    .machine0 /home/users/mollabashi/v19.2/lapw0_mpi lapw0.def >> .time00

    Sincerely yours,

    Leila Mollabashi


_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


--
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: bl...@theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at
-------------------------------------------------------------------------
_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Re: [Wien] MPI error

Reply via email to