Thank you both for your interest.

$ mpirun --use-hwthread-cpus --bind-to core -np 64 --mca btl vader,self 
mpijr.out /run/user/10002/bigarray.in > mpijr-bindto-vader-64.log 2>&1
$ mpirun --use-hwthread-cpus --bind-to core -np 60 --mca btl vader,self 
mpijr.out /run/user/10002/bigarray.in > mpijr-bindto-vader-60.log 2>&1$
$ mpirun --use-hwthread-cpus --bind-to core -np 32 --mca btl vader,self 
mpijr.out /run/user/10002/bigarray.in > mpijr-bindto-vader-32.log 2>&1

Unfortunately, using 60 instead of 64 cores did not really help. And yes, 
despite using --bind-to core, it did suffer from performance degradation when 
adding
MPI_Barrier(MPI_COMM_WORLD);
before
MPI_Finalize();
But note the elapsed time is calculated after MPI_Barrier();

Cheers,
Ali.

On Mar 6, 2020, at 4:11 PM, Gabriel, Edgar 
<egabr...@central.uh.edu<mailto:egabr...@central.uh.edu>> wrote:

How is the performance if you leave a few cores for the OS, e,g. running with 
60 processes instead of 64? Reasoning being that the file read operation is 
really executed by the OS, and could potentially be quite resource intensive.

Thanks
Edgar

From: users 
<users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ali Cherry via users
Sent: Friday, March 6, 2020 8:06 AM
To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Cc: Ali Cherry <ali.che...@lau.edu<mailto:ali.che...@lau.edu>>
Subject: Re: [OMPI users] Read from file performance degradation whenincreasing 
number of processors in some cases

Hello,

Thank you for your replies.
Yes, it is only a single node with 64 cores.
The input file is copied from nfs to a tmpfs when I start the node.
The mpirun command lines were:
$  mpirun -np 64 --mca btl vader,self pms.out /run/user/10002/bigarray.in > 
pms-vader-64.log 2>&1
$ mpirun -np 32 --mca btl vader,self pms.out /run/user/10002/bigarray.in > 
pms-vader-32.log 2>&1
$  mpirun -np 32 --mca btl tcp,self pms.out /run/user/10002/bigarray.in > 
pms-tcp-32.log 2>&1
$  mpirun -np 64 --mca btl tcp,self pms.out /run/user/10002/bigarray.in > 
pms-tcp-64.log 2>&1
$  mpirun -np 32 --mca btl vader,self mpijr.out /run/user/10002/bigarray.in > 
mpijr-vader-32.log 2>&1
$  mpirun -np 64 --mca btl vader,self mpijr.out /run/user/10002/bigarray.in > 
mpijr-vader-64.log 2>&1

I added mpi_just_read_barrier.c: 
https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf#file-mpi_just_read_barrier-c
Unfortunately, despite running mpi_just_read_barrier with 32 cores and 
--bind-to core set, I was not unable to run it with 64 cores for the following 
reason:
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        compute-0
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
—————————————————————————————————————


I will solve this and get back to you soon.


Best regards,
Ali Cherry.



On Mar 6, 2020, at 3:24 PM, Gilles Gouaillardet via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:

 Also, in mpi_just_read.c, what if you add
MPI_Barrier(MPI_COMM_WORLD);
right before invoking
MPI_Finalize();

can you observe a similar performance degradation when moving from 32 to 64 
tasks ?

Cheers,

Gilles
----- Original Message -----
 Hi,

The log filenames suggests you are always running on a single node, is that 
correct ?
Do you create the input file on the tmpfs once for all? before each run?
Can you please post your mpirun command lines?
If you did not bind the tasks, can you try again
mpirun --bind-to core ...

Cheers,

Gilles
----- Original Message -----
Hi,

We faced an issue when testing the scalability of parallel merge sort using 
reduction tree on an array of size 1024^3.
Currently, only the master opens the input file and parse it into an array 
using fscanf and then distribute the array to other processors.
When using 32 processors, it took ~109 seconds to read from file.
When using 64 processors, it took ~216 seconds to read from file.
Despite varying number of processors, only one processor (the master) read the 
file.
The input file is stored in a tmpfs, its made up of 1024^3 + 1 numbers (where 
the first number is the array size).

Additionally, I ran a C program that only read the file, it took ~104 seconds.
However, I also ran an MPI program that only read the file, it took ~116 and  
~118 seconds on 32 and 64 processors respectively.

Code at  https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf
parallel_ms.c:  
https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf#file-parallel_ms-c
mpi_just_read.c:  
https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf#file-mpi_just_read-c
just_read.c:  
https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf#file-just_read-c

Clearly, increasing number of processors on mpi_just_read.c did not severely 
affect the elapsed time.
For parallel_ms.c, is it possible that 63 processors are in a blocking-read 
state from processor 0 somehow affecting the read from file elapsed time?

Any assistance or clarification would be appreciated.
Ali.

Attachment: mpijr-bindto-vader-32.log
Description: mpijr-bindto-vader-32.log

Attachment: mpijr-bindto-vader-60.log
Description: mpijr-bindto-vader-60.log

Attachment: mpijr-bindto-vader-64.log
Description: mpijr-bindto-vader-64.log

Reply via email to