I have an MPI program (a code in c for a school project) that I want to
run on more nodes (this time 2 nodes) but it doesn't work and it is
infinitely waiting. 
First I tried to run it on both machines with command `mpirun -np 2 --
host 192.168.0.147,192.168.0.116 ./mandelbrot_mpi_omp`, 
and now I tried to be more specific with subnet masks and to enable
logging with: `mpirun --mca oob_base_verbose 100 --mca
oob_tcp_if_include 192.168.0.0/24 --mca btl_tcp_if_include
192.168.0.0/24 -np 2 --host 192.168.0.147,192.168.0.116
./mandelbrot_mpi_omp`
Local ip addresses are ip addresses of that computers and they are
entered in same order on both pcs, so 192.168.0.147 would be 0 -
master.
The first command is waiting without any text/error.
The second one will pull a log and stop at step: get transports for
component tcp - on both machines.
log from both machines: https://pastebin.com/bt32ZddX
lsof from both machines: https://pastebin.com/s3HHFWZB
the lsof output is not shrinked, it's everything that had opened any
ports at the moment, nothing else was running. The weird thing is about
that CLOSE_WAIT flag that the ssh connection has on both sides.

here is my code:
```
int main(int argc, char* argv[]){
    int width = SCALE_X;
    int height = SCALE_Y;

    // MPI init & setup
    MPI_Init(&argc, &argv);

    int world_size;
    int rank;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    // calculate size of buffer according to server count
    int part_height = SCALE_Y/world_size;
    int buffer_size = (width+1)*(part_height+1)*3;

    // dynamically allocate arrays for image data according to server
count
    send_buffer = calloc( buffer_size, sizeof(PIXEL));
    recv_buffer = calloc( buffer_size*world_size, sizeof(PIXEL));

    if(rank == 0) printf("MPI node count: %i\n", world_size);
    MPI_Barrier(MPI_COMM_WORLD);

    // OpenMP setup
    int cpu_count = omp_get_num_procs();
    omp_set_num_threads(cpu_count);
    printf("OpenMP cpu count on node %i: %i\n", rank, cpu_count);
    printf("OpenMP (max) thread count on node %i: %i\n", rank,
omp_get_num_threads());
    MPI_Barrier(MPI_COMM_WORLD);

    // generate a part of mandelbrot set according to world size and
rank of this server
    mandelbrot(rank, world_size, width, part_height);

    // gather parts of mandelbrot from all nodes
    MPI_Gather(send_buffer, (width)*(part_height)*3, MPI_CHAR,
recv_buffer, (width)*(part_height)*3, MPI_CHAR, 0, MPI_COMM_WORLD);


    // save raster array of mandelbrot data to png file
    if(rank == 0) save_to_png(width, height);


    printf("Process %i finished.\n", rank);

    MPI_Finalize();

    return 0;
}
```

My OS is Debian 11 and Open MPI (v4.1.0) is installed through official
debian repositories (on both machines). iptables or nftables are not
installed on both systems, so any ip blocking should not be problem
right now. Machines are connected to one router and they can connect to
each other - i can ping them or connect to them with ssh on each other.
I tried to connect them directly with ethernet cable and set ip
addresses manually, but this didn't work too. Also, they have same
username and password in system.

What am I missing?
I am new to mpi and not very savvy about networking as it is.

Thanks in advance.

Reply via email to