Re: [OMPI users] Slow collective MPI File IO

2020-04-08 Thread Dong-In Kang via users
Thank you for your suggestion.
I have more information and possible explanation, and more questions.
It looks like that NUMA plays a big role here.
In summary, it looks like that synchronization overhead of MPI file I/O
among "socket" is a lot higher than the overhead among the processes within
a socket.
But it looks too big.

I ran IOR on Luster parallel file system testing single share file write
performance.
I used "--map-by socket --bind-by socket", which results in poor
performance.
Write performance gets worse as the number of MPI processes increases.

When I switched it to "--map-by core --bind-by socket", the result
becomes better.
It showed a couple of times faster as the number of MPI processes increases,
but it is not scalable.

I think the (sync) overhead of MPI file I/O among "socket" is too big.
I'm running the test on a big single shared memory machine.
I'm not sure if it is only for a big single shared memory machine.
Is there any way to reduce the sync overhead of MPI file I/O among "socket"
in a shared memory machine?

Thanks,
David


Re: [OMPI users] Slow collective MPI File IO

2020-04-07 Thread George Reeke via users
On Mon, 2020-04-06 at 10:02 -0400, Dong-In Kang via users wrote:
> 
> Thank you Edgar for the information.
> 
> I also tried MPI_File_write_at_all(), but it usually makes the
> performance worse.
> My program is very simple.
> Each MPI process writes a consecutive portion of a file.
> No interleaving among the MPI processes.
> I think in this case I can use MPI_File_write_at().
> 
> 
FWIW, what I do is have each process write its data to node 0 (my term,
I think MPI term is rank 0), and that program does nothing but collect
those messages (using prompts to get them in order without filling all
the buffer space) and write to the file.  I have not timed this vs
other methods, but might be worth a try.
George Reeke




Re: [OMPI users] Slow collective MPI File IO

2020-04-06 Thread Benson Muite via users

If possible, consider changing to a non-blocking write using 
MPI_FILE_WRITE_ALL_BEGIN so that if possible, work can continue while the file 
is being written to disk. You may need to make a copy of the data being written 
if the space will be used for another purpose while the data is being written.


On Mon, Apr 6, 2020, at 6:35 PM, Collin Strassburger via users wrote:
> Gilles,

> 

> I just checked the write implementation of the Fortran codes with which I 
> have noticed the issue; while they are compiled with MPI, they are not using 
> MPI-IO. Thank you for pointing out the important distinction!

> 

> Thanks,

> Collin

> 

> 

> 

> **From:** users  **On Behalf Of **Gilles 
> GOUAILLARDET via users
> **Sent:** Monday, April 6, 2020 11:01 AM
> **To:** Open MPI Users 
> **Cc:** Gilles GOUAILLARDET 
> **Subject:** Re: [OMPI users] Slow collective MPI File IO

> 

> Collin,

> Do you have any data to backup your claim?

> As long as MPI-IO is used to perform file I/O, the Fortran bindings overhead 
> should be hardly noticeable.

> Cheers,

> Gilles

> 
> 
> On April 6, 2020, at 23:22, Collin Strassburger via users 
>  wrote:

> Hello,

> 

> Just a quick comment on this; is your code written in C/C++ or Fortran? 
> Fortran has issues with writing at a decent speed regardless of MPI setup and 
> as such should be avoided for file IO (yet I still occasionally see it 
> implemented).

> 

> Collin

> 

> **From:** users  **On Behalf Of **Dong-In 
> Kang via users
> **Sent:** Monday, April 6, 2020 10:02 AM
> **To:** Gabriel, Edgar 
> **Cc:** Dong-In Kang ; Open MPI Users 
> 
> **Subject:** Re: [OMPI users] Slow collective MPI File IO

> 

> 

> Thank you Edgar for the information.

> 

> I also tried MPI_File_write_at_all(), but it usually makes the performance 
> worse.

> My program is very simple.

> Each MPI process writes a consecutive portion of a file.

> No interleaving among the MPI processes.

> I think in this case I can use MPI_File_write_at().

> 

> I tested the maximum bandwidth of the target devices and they are at least a 
> few times bigger than what single process can achieve.

> I tested it using the same program but open the individual files using 
> MPI_COMM_SELF.

> I tested 32MB chunk, but didn't show noticeable changes. I also tried 512MB 
> chunk, but no noticeable difference.

> (There are performance differences between using 32MB chunk and using 512MB 
> chunk.

> But, they still don't make multiple MPI processes file IO exceeds the 
> performance of single MPI process file IO)

> As for the local disk, at least 2 times faster than single MPI process can 
> achieve.

> As for the ramdisk, at least 5 times faster.

> Luster, I know that it is at least 7-8 times or more faster depending on the 
> configuration.

> 

> About caching effect, it would be the case of MPI_File_read().

> I can see very high bandwidth of MPI_File_read(), which I believe comes from 
> caches in RAM.

> But as for MPI_File_write, I think it doesn't be affected by caching.

> And I create a new file for each test and removes the file at the end of the 
> testing.

> 

> I may make a very simple mistake, but I don't know what it is.

> I saw MPI_File I/O could achieve multiple times of speedup over single 
> process file IO,

> when faster file system is used like Lustre from a few reports in the 
> internet.

> I started this experiment because I couldn't get speedup on Lustre file 
> system.
>  And then I moved the experiment to ramdisk and local disk, because it can 
> remove the issue of Lustre configuration.

> 

> Any comments are welcome.

> 

> David

> 

> 

> 

> 

> 

> 

> 

> 

> 

> 

> 

> On Mon, Apr 6, 2020 at 9:03 AM Gabriel, Edgar  wrote:

>> Hi,

>> 

>> A couple of comments. First, if you use MPI_File_write_at, this is usually 
>> not considered collective I/O, even if executed by multiple processes. 
>> MPI_File_write_at_all would be collective I/O.

>> 

>> Second, MPI I/O can not do ‘magic’, but is bound by hardware that you are 
>> providing. If already a single process is able to saturate the bandwidth of 
>> your file system and hardware, you will not be able to see performance 
>> improvements from multiple processes (some minor exceptions maybe due to 
>> caching effects, but that is only for smaller problem sizes, the larger the 
>> amount of data that you try to write, the lesser the caching effects become 
>> in file I/O). So the first question that you have to answer, what is the 
>> sustained bandwidth of your hardware, and are you able to saturate it 
>> already with a single

Re: [OMPI users] Slow collective MPI File IO

2020-04-06 Thread Collin Strassburger via users
Gilles,

I just checked the write implementation of the Fortran codes with which I have 
noticed the issue; while they are compiled with MPI, they are not using MPI-IO. 
 Thank you for pointing out the important distinction!

Thanks,
Collin



From: users  On Behalf Of Gilles GOUAILLARDET 
via users
Sent: Monday, April 6, 2020 11:01 AM
To: Open MPI Users 
Cc: Gilles GOUAILLARDET 
Subject: Re: [OMPI users] Slow collective MPI File IO


Collin,

Do you have any data to backup your claim?

As long as MPI-IO is used to perform file I/O, the Fortran bindings  overhead 
should be hardly noticeable.

Cheers,

Gilles


On April 6, 2020, at 23:22, Collin Strassburger via users 
mailto:users@lists.open-mpi.org>> wrote:

Hello,

Just a quick comment on this; is your code written in C/C++ or Fortran?  
Fortran has issues with writing at a decent speed regardless of MPI setup and 
as such should be avoided for file IO (yet I still occasionally see it 
implemented).

Collin

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Dong-In Kang via users
Sent: Monday, April 6, 2020 10:02 AM
To: Gabriel, Edgar mailto:egabr...@central.uh.edu>>
Cc: Dong-In Kang mailto:dik...@gmail.com>>; Open MPI Users 
mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] Slow collective MPI File IO


Thank you Edgar for the information.

I also tried MPI_File_write_at_all(), but it usually makes the performance 
worse.
My program is very simple.
Each MPI process writes a consecutive portion of a file.
No interleaving among the MPI processes.
I think in this case I can use MPI_File_write_at().

I tested the maximum bandwidth of the target devices and they are at least a 
few times bigger than what single process can achieve.
I tested it using the same program but open the individual files using 
MPI_COMM_SELF.
I tested 32MB chunk, but didn't show noticeable changes. I also tried 512MB 
chunk, but no noticeable difference.
(There are performance differences between using 32MB chunk and using 512MB 
chunk.
But, they still don't make multiple MPI processes file IO exceeds the 
performance of single MPI process file IO)
As for the local disk, at least 2 times faster than single MPI process can 
achieve.
As for the ramdisk, at least 5 times faster.
Luster, I know that it is at least 7-8 times or more faster depending on the 
configuration.

About caching effect, it would be the case of MPI_File_read().
I can see very high bandwidth of MPI_File_read(), which I believe comes from 
caches in RAM.
But as for MPI_File_write, I think it doesn't be affected by caching.
And I create a new file for each test and removes the file at the end of the 
testing.

I may make a very simple mistake, but I don't know what it is.
I saw MPI_File I/O could achieve multiple times of speedup over single process 
file IO,
when faster file system is used like Lustre from a few reports in the internet.
I started this experiment because I couldn't get speedup on Lustre file system.
And then I moved the experiment to ramdisk and local disk, because it can 
remove the issue of Lustre configuration.

Any comments are welcome.

David











On Mon, Apr 6, 2020 at 9:03 AM Gabriel, Edgar 
mailto:egabr...@central.uh.edu>> wrote:
Hi,

A couple of comments. First, if you use MPI_File_write_at, this is usually not 
considered collective I/O, even if executed by multiple processes. 
MPI_File_write_at_all would be collective I/O.

Second, MPI I/O can not do ‘magic’, but is bound by hardware that you are 
providing. If already a single process is able to saturate the bandwidth of 
your file system and hardware, you will not be able to see performance 
improvements from multiple processes (some minor exceptions maybe due to 
caching effects, but that is only for smaller problem sizes, the larger the 
amount of data that you try to write, the lesser the caching effects become in 
file I/O). So the first question that you have to answer, what is the sustained 
bandwidth of your hardware, and are you able  to saturate it already with a 
single process. If you are using a single hard drive (or even 2 or 3 hard 
drives in a RAID 0 configuration), this is almost certainly the case.

Lastly, the configuration parameters of your tests also play a major role. As a 
general rule, the larger amounts of data you are able to provide per file I/O 
call, the better the performance will be. 1MB of data per call is probably on 
the smaller side. The ompio implementation of MPI I/O breaks large individual 
I/O operations (e.g. MPI_File_write_at) into chunks of 512MB for performance 
reasons internally. Large collective I/O operations (e.g. 
MPI_File_write_at_all) are broken into chunks of 32 MB. This gives you some 
hints on the quantities of data that you would have to use for performance 
reasons.

Along the same lines, one final comment. You say you did 1000 writes of 1MB 
each. For a single process that is about 1GB of data. Depending o

Re: [OMPI users] Slow collective MPI File IO

2020-04-06 Thread Gilles Gouaillardet via users
David,

I suggest you rely on well established benchmarks such as IOR or iozone.

As already pointed by Edgar, you first need to make sure you are not 
benchmarking your (memory) cache by comparing the bandwidth you measure vs the 
performance you can expect from your hardware.

As a side note, unless you are running on Lustre, the default IO component is 
ompio.
You can give ROMIO a try by running
mpirun --mca io ^ompio ...


Cheers,

Gilles

Sent from my iPod

> On Apr 7, 2020, at 0:00, Dong-In Kang via users  
> wrote:
> 
> 
> Yes, I agree with you.
> I think I did the test using each file per MPI process.
> Each MPI process opens a file with the file name followed by its rank using 
> MPI_File_open(MPI_COMM_SELF, ...).
> It showed a few times better performance (with np=4 or 8 on my workstation) 
> than single MPI process  (with np = 1) can achieve.
> As I mentioned before I could get:
> "As for the local disk, at least 2 times faster than single MPI process can 
> achieve.
> As for the ramdisk, at least 5 times faster.
> Luster, I know that it is at least 7-8 times or more faster depending on the 
> configuration.".
> However, when a single file is shared by multiple MPI processes (np > 1), the 
> sum of write speed of all MPI processes is at most the performance of run 
> with a single MPI process run(np = 1).
> 
> I expect the simple MPI File IO is scalable at least for small number of 
> processes.
> But I don't see that at all now.
> I ran it on a shared memory machine having tens of cores, but saw the same 
> results.
> Any idea?
> 
> David
> 
> 
> 
>> On Mon, Apr 6, 2020 at 10:47 AM Gabriel, Edgar  
>> wrote:
>> The one test that would give you a good idea of the upper bound for your 
>> scenario would be that write a benchmark where each process writes to a 
>> separate file, and look at the overall bandwidth achieved across all 
>> processes. The MPI I/O performance will be less or equal to the bandwidth 
>> achieved in this scenario, as long as the number of processes are moderate.
>> 
>>  
>> 
>> Thanks
>> 
>> Edgar
>> 
>>  
>> 
>> From: Dong-In Kang  
>> Sent: Monday, April 6, 2020 9:34 AM
>> To: Collin Strassburger 
>> Cc: Open MPI Users ; Gabriel, Edgar 
>> 
>> Subject: Re: [OMPI users] Slow collective MPI File IO
>> 
>>  
>> 
>> Hi Collin,
>> 
>>  
>> 
>> It is written in C.
>> 
>> So, I think it is OK.
>> 
>>  
>> 
>> Thank you,
>> 
>> David
>> 
>>  
>> 
>>  
>> 
>> On Mon, Apr 6, 2020 at 10:19 AM Collin Strassburger 
>>  wrote:
>> 
>> Hello,
>> 
>>  
>> 
>> Just a quick comment on this; is your code written in C/C++ or Fortran?  
>> Fortran has issues with writing at a decent speed regardless of MPI setup 
>> and as such should be avoided for file IO (yet I still occasionally see it 
>> implemented).
>> 
>>  
>> 
>> Collin
>> 
>>  
>> 
>> From: users  On Behalf Of Dong-In Kang via 
>> users
>> Sent: Monday, April 6, 2020 10:02 AM
>> To: Gabriel, Edgar 
>> Cc: Dong-In Kang ; Open MPI Users 
>> 
>> Subject: Re: [OMPI users] Slow collective MPI File IO
>> 
>>  
>> 
>>  
>> 
>> Thank you Edgar for the information.
>> 
>>  
>> 
>> I also tried MPI_File_write_at_all(), but it usually makes the performance 
>> worse.
>> 
>> My program is very simple.
>> 
>> Each MPI process writes a consecutive portion of a file.
>> 
>> No interleaving among the MPI processes.
>> 
>> I think in this case I can use MPI_File_write_at().
>> 
>>  
>> 
>> I tested the maximum bandwidth of the target devices and they are at least a 
>> few times bigger than what single process can achieve.
>> 
>> I tested it using the same program but open the individual files using 
>> MPI_COMM_SELF.
>> 
>> I tested 32MB chunk, but didn't show noticeable changes. I also tried 512MB 
>> chunk, but no noticeable difference.
>> 
>> (There are performance differences between using 32MB chunk and using 512MB 
>> chunk.
>> 
>> But, they still don't make multiple MPI processes file IO exceeds the 
>> performance of single MPI process file IO)
>> 
>> As for the local disk, at least 2 times faster than single MPI process can 
>> achieve.
>> 
>> As for the ramdisk, at least 5 times faster.
>> 
>> Luster, I know that it is at least 7-8 times or more faster depending on the 
>>

Re: [OMPI users] Slow collective MPI File IO

2020-04-06 Thread Gilles GOUAILLARDET via users
Collin,

Do you have any data to backup your claim?

As long as MPI-IO is used to perform file I/O, the Fortran bindings  overhead 
should be hardly noticeable.

Cheers,

Gilles

On April 6, 2020, at 23:22, Collin Strassburger via users 
 wrote:

 

Hello,

 

Just a quick comment on this; is your code written in C/C++ or Fortran?  
Fortran has issues with writing at a decent speed regardless of MPI setup and 
as such should be avoided for file IO (yet I still occasionally see it 
implemented).

 

Collin

 

From: users  On Behalf Of Dong-In Kang via 
users
Sent: Monday, April 6, 2020 10:02 AM
To: Gabriel, Edgar 
Cc: Dong-In Kang ; Open MPI Users 
Subject: Re: [OMPI users] Slow collective MPI File IO

 

 

Thank you Edgar for the information.

 

I also tried MPI_File_write_at_all(), but it usually makes the performance 
worse.

My program is very simple.

Each MPI process writes a consecutive portion of a file.

No interleaving among the MPI processes.

I think in this case I can use MPI_File_write_at().

 

I tested the maximum bandwidth of the target devices and they are at least a 
few times bigger than what single process can achieve.

I tested it using the same program but open the individual files using 
MPI_COMM_SELF.

I tested 32MB chunk, but didn't show noticeable changes. I also tried 512MB 
chunk, but no noticeable difference.

(There are performance differences between using 32MB chunk and using 512MB 
chunk.

But, they still don't make multiple MPI processes file IO exceeds the 
performance of single MPI process file IO)

As for the local disk, at least 2 times faster than single MPI process can 
achieve.

As for the ramdisk, at least 5 times faster.

Luster, I know that it is at least 7-8 times or more faster depending on the 
configuration.

 

About caching effect, it would be the case of MPI_File_read().

I can see very high bandwidth of MPI_File_read(), which I believe comes from 
caches in RAM.

But as for MPI_File_write, I think it doesn't be affected by caching.

And I create a new file for each test and removes the file at the end of the 
testing.

 

I may make a very simple mistake, but I don't know what it is.

I saw MPI_File I/O could achieve multiple times of speedup over single process 
file IO,

when faster file system is used like Lustre from a few reports in the internet.

I started this experiment because I couldn't get speedup on Lustre file system.
And then I moved the experiment to ramdisk and local disk, because it can 
remove the issue of Lustre configuration.

 

Any comments are welcome.

 

David

 

 

 

 

 

 

 

 

 

 

 

On Mon, Apr 6, 2020 at 9:03 AM Gabriel, Edgar  wrote:

Hi,

 

A couple of comments. First, if you use MPI_File_write_at, this is usually not 
considered collective I/O, even if executed by multiple processes. 
MPI_File_write_at_all would be collective I/O.

 

Second, MPI I/O can not do ‘magic’, but is bound by hardware that you are 
providing. If already a single process is able to saturate the bandwidth of 
your file system and hardware, you will not be able to see performance 
improvements from multiple processes (some minor exceptions maybe due to 
caching effects, but that is only for smaller problem sizes, the larger the 
amount of data that you try to write, the lesser the caching effects become in 
file I/O). So the first question that you have to answer, what is the sustained 
bandwidth of your hardware, and are you able  to saturate it already with a 
single process. If you are using a single hard drive (or even 2 or 3 hard 
drives in a RAID 0 configuration), this is almost certainly the case.

 

Lastly, the configuration parameters of your tests also play a major role. As a 
general rule, the larger amounts of data you are able to provide per file I/O 
call, the better the performance will be. 1MB of data per call is probably on 
the smaller side. The ompio implementation of MPI I/O breaks large individual 
I/O operations (e.g. MPI_File_write_at) into chunks of 512MB for performance 
reasons internally. Large collective I/O operations (e.g. 
MPI_File_write_at_all) are broken into chunks of 32 MB. This gives you some 
hints on the quantities of data that you would have to use for performance 
reasons.

 

Along the same lines, one final comment. You say you did 1000 writes of 1MB 
each. For a single process that is about 1GB of data. Depending on how much 
main memory  your PC has, this amount of data can still be cached in modern 
systems, and you might have an unrealistically high bandwidth value for the 1 
process case that you are comparing against (it depends a bit on what your 
benchmark does, and whether you force flushing the data to disk inside of your 
measurement loop).

 

Hope this gives you some pointers on where to start to look.

Thanks

Edgar

 

From: users  On Behalf Of Dong-In Kang via 
users
Sent: Monday, April 6, 2020 7:14 AM
To: users@lists.open-mpi.org
Cc: Dong-In Kang 
Subject: [OMPI users] Slow

Re: [OMPI users] Slow collective MPI File IO

2020-04-06 Thread Dong-In Kang via users
Yes, I agree with you.
I think I did the test using each file per MPI process.
Each MPI process opens a file with the file name followed by its rank using
MPI_File_open(MPI_COMM_SELF, ...).
It showed a few times better performance (with np=4 or 8 on my workstation)
than single MPI process  (with np = 1) can achieve.
As I mentioned before I could get:
"As for the local disk, at least 2 times faster than single MPI process can
achieve.
As for the ramdisk, at least 5 times faster.
Luster, I know that it is at least 7-8 times or more faster depending on
the configuration.".
However, when a single file is shared by multiple MPI processes (np > 1),
the sum of write speed of all MPI processes is at most the performance of
run with a single MPI process run(np = 1).

I expect the simple MPI File IO is scalable at least for small number of
processes.
But I don't see that at all now.
I ran it on a shared memory machine having tens of cores, but saw the same
results.
Any idea?

David



On Mon, Apr 6, 2020 at 10:47 AM Gabriel, Edgar 
wrote:

> The one test that would give you a good idea of the upper bound for your
> scenario would be that write a benchmark where each process writes to a
> separate file, and look at the overall bandwidth achieved across all
> processes. The MPI I/O performance will be less or equal to the bandwidth
> achieved in this scenario, as long as the number of processes are moderate.
>
>
>
> Thanks
>
> Edgar
>
>
>
> *From:* Dong-In Kang 
> *Sent:* Monday, April 6, 2020 9:34 AM
> *To:* Collin Strassburger 
> *Cc:* Open MPI Users ; Gabriel, Edgar <
> egabr...@central.uh.edu>
> *Subject:* Re: [OMPI users] Slow collective MPI File IO
>
>
>
> Hi Collin,
>
>
>
> It is written in C.
>
> So, I think it is OK.
>
>
>
> Thank you,
>
> David
>
>
>
>
>
> On Mon, Apr 6, 2020 at 10:19 AM Collin Strassburger <
> cstrassbur...@bihrle.com> wrote:
>
> Hello,
>
>
>
> Just a quick comment on this; is your code written in C/C++ or Fortran?
> Fortran has issues with writing at a decent speed regardless of MPI setup
> and as such should be avoided for file IO (yet I still occasionally see it
> implemented).
>
>
>
> Collin
>
>
>
> *From:* users  *On Behalf Of *Dong-In
> Kang via users
> *Sent:* Monday, April 6, 2020 10:02 AM
> *To:* Gabriel, Edgar 
> *Cc:* Dong-In Kang ; Open MPI Users <
> users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Slow collective MPI File IO
>
>
>
>
>
> Thank you Edgar for the information.
>
>
>
> I also tried MPI_File_write_at_all(), but it usually makes the
> performance worse.
>
> My program is very simple.
>
> Each MPI process writes a consecutive portion of a file.
>
> No interleaving among the MPI processes.
>
> I think in this case I can use MPI_File_write_at().
>
>
>
> I tested the maximum bandwidth of the target devices and they are at least
> a few times bigger than what single process can achieve.
>
> I tested it using the same program but open the individual files using
> MPI_COMM_SELF.
>
> I tested 32MB chunk, but didn't show noticeable changes. I also tried
> 512MB chunk, but no noticeable difference.
>
> (There are performance differences between using 32MB chunk and using
> 512MB chunk.
>
> But, they still don't make multiple MPI processes file IO exceeds the
> performance of single MPI process file IO)
>
> As for the local disk, at least 2 times faster than single MPI process can
> achieve.
>
> As for the ramdisk, at least 5 times faster.
>
> Luster, I know that it is at least 7-8 times or more faster depending on
> the configuration.
>
>
>
> About caching effect, it would be the case of MPI_File_read().
>
> I can see very high bandwidth of MPI_File_read(), which I believe comes
> from caches in RAM.
>
> But as for MPI_File_write, I think it doesn't be affected by caching.
>
> And I create a new file for each test and removes the file at the end of
> the testing.
>
>
>
> I may make a very simple mistake, but I don't know what it is.
>
> I saw MPI_File I/O could achieve multiple times of speedup over single
> process file IO,
>
> when faster file system is used like Lustre from a few reports in the
> internet.
>
> I started this experiment because I couldn't get speedup on Lustre file
> system.
> And then I moved the experiment to ramdisk and local disk, because it can
> remove the issue of Lustre configuration.
>
>
>
> Any comments are welcome.
>
>
>
> David
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Apr 6, 2020 at 9:03 AM Gabr

Re: [OMPI users] Slow collective MPI File IO

2020-04-06 Thread Gabriel, Edgar via users
The one test that would give you a good idea of the upper bound for your 
scenario would be that write a benchmark where each process writes to a 
separate file, and look at the overall bandwidth achieved across all processes. 
The MPI I/O performance will be less or equal to the bandwidth achieved in this 
scenario, as long as the number of processes are moderate.

Thanks
Edgar

From: Dong-In Kang 
Sent: Monday, April 6, 2020 9:34 AM
To: Collin Strassburger 
Cc: Open MPI Users ; Gabriel, Edgar 

Subject: Re: [OMPI users] Slow collective MPI File IO

Hi Collin,

It is written in C.
So, I think it is OK.

Thank you,
David


On Mon, Apr 6, 2020 at 10:19 AM Collin Strassburger 
mailto:cstrassbur...@bihrle.com>> wrote:
Hello,

Just a quick comment on this; is your code written in C/C++ or Fortran?  
Fortran has issues with writing at a decent speed regardless of MPI setup and 
as such should be avoided for file IO (yet I still occasionally see it 
implemented).

Collin

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Dong-In Kang via users
Sent: Monday, April 6, 2020 10:02 AM
To: Gabriel, Edgar mailto:egabr...@central.uh.edu>>
Cc: Dong-In Kang mailto:dik...@gmail.com>>; Open MPI Users 
mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] Slow collective MPI File IO


Thank you Edgar for the information.

I also tried MPI_File_write_at_all(), but it usually makes the performance 
worse.
My program is very simple.
Each MPI process writes a consecutive portion of a file.
No interleaving among the MPI processes.
I think in this case I can use MPI_File_write_at().

I tested the maximum bandwidth of the target devices and they are at least a 
few times bigger than what single process can achieve.
I tested it using the same program but open the individual files using 
MPI_COMM_SELF.
I tested 32MB chunk, but didn't show noticeable changes. I also tried 512MB 
chunk, but no noticeable difference.
(There are performance differences between using 32MB chunk and using 512MB 
chunk.
But, they still don't make multiple MPI processes file IO exceeds the 
performance of single MPI process file IO)
As for the local disk, at least 2 times faster than single MPI process can 
achieve.
As for the ramdisk, at least 5 times faster.
Luster, I know that it is at least 7-8 times or more faster depending on the 
configuration.

About caching effect, it would be the case of MPI_File_read().
I can see very high bandwidth of MPI_File_read(), which I believe comes from 
caches in RAM.
But as for MPI_File_write, I think it doesn't be affected by caching.
And I create a new file for each test and removes the file at the end of the 
testing.

I may make a very simple mistake, but I don't know what it is.
I saw MPI_File I/O could achieve multiple times of speedup over single process 
file IO,
when faster file system is used like Lustre from a few reports in the internet.
I started this experiment because I couldn't get speedup on Lustre file system.
And then I moved the experiment to ramdisk and local disk, because it can 
remove the issue of Lustre configuration.

Any comments are welcome.

David











On Mon, Apr 6, 2020 at 9:03 AM Gabriel, Edgar 
mailto:egabr...@central.uh.edu>> wrote:
Hi,

A couple of comments. First, if you use MPI_File_write_at, this is usually not 
considered collective I/O, even if executed by multiple processes. 
MPI_File_write_at_all would be collective I/O.

Second, MPI I/O can not do ‘magic’, but is bound by hardware that you are 
providing. If already a single process is able to saturate the bandwidth of 
your file system and hardware, you will not be able to see performance 
improvements from multiple processes (some minor exceptions maybe due to 
caching effects, but that is only for smaller problem sizes, the larger the 
amount of data that you try to write, the lesser the caching effects become in 
file I/O). So the first question that you have to answer, what is the sustained 
bandwidth of your hardware, and are you able  to saturate it already with a 
single process. If you are using a single hard drive (or even 2 or 3 hard 
drives in a RAID 0 configuration), this is almost certainly the case.

Lastly, the configuration parameters of your tests also play a major role. As a 
general rule, the larger amounts of data you are able to provide per file I/O 
call, the better the performance will be. 1MB of data per call is probably on 
the smaller side. The ompio implementation of MPI I/O breaks large individual 
I/O operations (e.g. MPI_File_write_at) into chunks of 512MB for performance 
reasons internally. Large collective I/O operations (e.g. 
MPI_File_write_at_all) are broken into chunks of 32 MB. This gives you some 
hints on the quantities of data that you would have to use for performance 
reasons.

Along the same lines, one final comment. You say you did 1000 writes of 1MB 
each. For a single process that is about 1GB of data. Depending o

Re: [OMPI users] Slow collective MPI File IO

2020-04-06 Thread Dong-In Kang via users
Hi Collin,

It is written in C.
So, I think it is OK.

Thank you,
David


On Mon, Apr 6, 2020 at 10:19 AM Collin Strassburger <
cstrassbur...@bihrle.com> wrote:

> Hello,
>
>
>
> Just a quick comment on this; is your code written in C/C++ or Fortran?
> Fortran has issues with writing at a decent speed regardless of MPI setup
> and as such should be avoided for file IO (yet I still occasionally see it
> implemented).
>
>
>
> Collin
>
>
>
> *From:* users  *On Behalf Of *Dong-In
> Kang via users
> *Sent:* Monday, April 6, 2020 10:02 AM
> *To:* Gabriel, Edgar 
> *Cc:* Dong-In Kang ; Open MPI Users <
> users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Slow collective MPI File IO
>
>
>
>
>
> Thank you Edgar for the information.
>
>
>
> I also tried MPI_File_write_at_all(), but it usually makes the
> performance worse.
>
> My program is very simple.
>
> Each MPI process writes a consecutive portion of a file.
>
> No interleaving among the MPI processes.
>
> I think in this case I can use MPI_File_write_at().
>
>
>
> I tested the maximum bandwidth of the target devices and they are at least
> a few times bigger than what single process can achieve.
>
> I tested it using the same program but open the individual files using
> MPI_COMM_SELF.
>
> I tested 32MB chunk, but didn't show noticeable changes. I also tried
> 512MB chunk, but no noticeable difference.
>
> (There are performance differences between using 32MB chunk and using
> 512MB chunk.
>
> But, they still don't make multiple MPI processes file IO exceeds the
> performance of single MPI process file IO)
>
> As for the local disk, at least 2 times faster than single MPI process can
> achieve.
>
> As for the ramdisk, at least 5 times faster.
>
> Luster, I know that it is at least 7-8 times or more faster depending on
> the configuration.
>
>
>
> About caching effect, it would be the case of MPI_File_read().
>
> I can see very high bandwidth of MPI_File_read(), which I believe comes
> from caches in RAM.
>
> But as for MPI_File_write, I think it doesn't be affected by caching.
>
> And I create a new file for each test and removes the file at the end of
> the testing.
>
>
>
> I may make a very simple mistake, but I don't know what it is.
>
> I saw MPI_File I/O could achieve multiple times of speedup over single
> process file IO,
>
> when faster file system is used like Lustre from a few reports in the
> internet.
>
> I started this experiment because I couldn't get speedup on Lustre file
> system.
> And then I moved the experiment to ramdisk and local disk, because it can
> remove the issue of Lustre configuration.
>
>
>
> Any comments are welcome.
>
>
>
> David
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Apr 6, 2020 at 9:03 AM Gabriel, Edgar 
> wrote:
>
> Hi,
>
>
>
> A couple of comments. First, if you use MPI_File_write_at, this is usually
> not considered collective I/O, even if executed by multiple processes.
> MPI_File_write_at_all would be collective I/O.
>
>
>
> Second, MPI I/O can not do ‘magic’, but is bound by hardware that you are
> providing. If already a single process is able to saturate the bandwidth of
> your file system and hardware, you will not be able to see performance
> improvements from multiple processes (some minor exceptions maybe due to
> caching effects, but that is only for smaller problem sizes, the larger the
> amount of data that you try to write, the lesser the caching effects become
> in file I/O). So the first question that you have to answer, what is the
> sustained bandwidth of your hardware, and are you able  to saturate it
> already with a single process. If you are using a single hard drive (or
> even 2 or 3 hard drives in a RAID 0 configuration), this is almost
> certainly the case.
>
>
>
> Lastly, the configuration parameters of your tests also play a major role.
> As a general rule, the larger amounts of data you are able to provide per
> file I/O call, the better the performance will be. 1MB of data per call is
> probably on the smaller side. The ompio implementation of MPI I/O breaks
> large individual I/O operations (e.g. MPI_File_write_at) into chunks of
> 512MB for performance reasons internally. Large collective I/O operations
> (e.g. MPI_File_write_at_all) are broken into chunks of 32 MB. This gives
> you some hints on the quantities of data that you would have to use for
> performance reasons.
>
>
>
> Along the same lines, one final comment. You say you did 1000 writes of
> 1MB each. For a singl

Re: [OMPI users] Slow collective MPI File IO

2020-04-06 Thread Collin Strassburger via users
Hello,

Just a quick comment on this; is your code written in C/C++ or Fortran?  
Fortran has issues with writing at a decent speed regardless of MPI setup and 
as such should be avoided for file IO (yet I still occasionally see it 
implemented).

Collin

From: users  On Behalf Of Dong-In Kang via 
users
Sent: Monday, April 6, 2020 10:02 AM
To: Gabriel, Edgar 
Cc: Dong-In Kang ; Open MPI Users 
Subject: Re: [OMPI users] Slow collective MPI File IO


Thank you Edgar for the information.

I also tried MPI_File_write_at_all(), but it usually makes the performance 
worse.
My program is very simple.
Each MPI process writes a consecutive portion of a file.
No interleaving among the MPI processes.
I think in this case I can use MPI_File_write_at().

I tested the maximum bandwidth of the target devices and they are at least a 
few times bigger than what single process can achieve.
I tested it using the same program but open the individual files using 
MPI_COMM_SELF.
I tested 32MB chunk, but didn't show noticeable changes. I also tried 512MB 
chunk, but no noticeable difference.
(There are performance differences between using 32MB chunk and using 512MB 
chunk.
But, they still don't make multiple MPI processes file IO exceeds the 
performance of single MPI process file IO)
As for the local disk, at least 2 times faster than single MPI process can 
achieve.
As for the ramdisk, at least 5 times faster.
Luster, I know that it is at least 7-8 times or more faster depending on the 
configuration.

About caching effect, it would be the case of MPI_File_read().
I can see very high bandwidth of MPI_File_read(), which I believe comes from 
caches in RAM.
But as for MPI_File_write, I think it doesn't be affected by caching.
And I create a new file for each test and removes the file at the end of the 
testing.

I may make a very simple mistake, but I don't know what it is.
I saw MPI_File I/O could achieve multiple times of speedup over single process 
file IO,
when faster file system is used like Lustre from a few reports in the internet.
I started this experiment because I couldn't get speedup on Lustre file system.
And then I moved the experiment to ramdisk and local disk, because it can 
remove the issue of Lustre configuration.

Any comments are welcome.

David











On Mon, Apr 6, 2020 at 9:03 AM Gabriel, Edgar 
mailto:egabr...@central.uh.edu>> wrote:
Hi,

A couple of comments. First, if you use MPI_File_write_at, this is usually not 
considered collective I/O, even if executed by multiple processes. 
MPI_File_write_at_all would be collective I/O.

Second, MPI I/O can not do ‘magic’, but is bound by hardware that you are 
providing. If already a single process is able to saturate the bandwidth of 
your file system and hardware, you will not be able to see performance 
improvements from multiple processes (some minor exceptions maybe due to 
caching effects, but that is only for smaller problem sizes, the larger the 
amount of data that you try to write, the lesser the caching effects become in 
file I/O). So the first question that you have to answer, what is the sustained 
bandwidth of your hardware, and are you able  to saturate it already with a 
single process. If you are using a single hard drive (or even 2 or 3 hard 
drives in a RAID 0 configuration), this is almost certainly the case.

Lastly, the configuration parameters of your tests also play a major role. As a 
general rule, the larger amounts of data you are able to provide per file I/O 
call, the better the performance will be. 1MB of data per call is probably on 
the smaller side. The ompio implementation of MPI I/O breaks large individual 
I/O operations (e.g. MPI_File_write_at) into chunks of 512MB for performance 
reasons internally. Large collective I/O operations (e.g. 
MPI_File_write_at_all) are broken into chunks of 32 MB. This gives you some 
hints on the quantities of data that you would have to use for performance 
reasons.

Along the same lines, one final comment. You say you did 1000 writes of 1MB 
each. For a single process that is about 1GB of data. Depending on how much 
main memory  your PC has, this amount of data can still be cached in modern 
systems, and you might have an unrealistically high bandwidth value for the 1 
process case that you are comparing against (it depends a bit on what your 
benchmark does, and whether you force flushing the data to disk inside of your 
measurement loop).

Hope this gives you some pointers on where to start to look.
Thanks
Edgar

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Dong-In Kang via users
Sent: Monday, April 6, 2020 7:14 AM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
Cc: Dong-In Kang mailto:dik...@gmail.com>>
Subject: [OMPI users] Slow collective MPI File IO

Hi,

I am running an MPI program where N processes write to a single file on a 
single shared memory machine.
I’m using OpenMPI v.4.0.2.
Each MPI process write a 

Re: [OMPI users] Slow collective MPI File IO

2020-04-06 Thread Dong-In Kang via users
Thank you Edgar for the information.

I also tried MPI_File_write_at_all(), but it usually makes the
performance worse.
My program is very simple.
Each MPI process writes a consecutive portion of a file.
No interleaving among the MPI processes.
I think in this case I can use MPI_File_write_at().

I tested the maximum bandwidth of the target devices and they are at least
a few times bigger than what single process can achieve.
I tested it using the same program but open the individual files using
MPI_COMM_SELF.
I tested 32MB chunk, but didn't show noticeable changes. I also tried 512MB
chunk, but no noticeable difference.
(There are performance differences between using 32MB chunk and using 512MB
chunk.
But, they still don't make multiple MPI processes file IO exceeds the
performance of single MPI process file IO)
As for the local disk, at least 2 times faster than single MPI process can
achieve.
As for the ramdisk, at least 5 times faster.
Luster, I know that it is at least 7-8 times or more faster depending on
the configuration.

About caching effect, it would be the case of MPI_File_read().
I can see very high bandwidth of MPI_File_read(), which I believe comes
from caches in RAM.
But as for MPI_File_write, I think it doesn't be affected by caching.
And I create a new file for each test and removes the file at the end of
the testing.

I may make a very simple mistake, but I don't know what it is.
I saw MPI_File I/O could achieve multiple times of speedup over single
process file IO,
when faster file system is used like Lustre from a few reports in the
internet.
I started this experiment because I couldn't get speedup on Lustre file
system.
And then I moved the experiment to ramdisk and local disk, because it can
remove the issue of Lustre configuration.

Any comments are welcome.

David











On Mon, Apr 6, 2020 at 9:03 AM Gabriel, Edgar 
wrote:

> Hi,
>
>
>
> A couple of comments. First, if you use MPI_File_write_at, this is usually
> not considered collective I/O, even if executed by multiple processes.
> MPI_File_write_at_all would be collective I/O.
>
>
>
> Second, MPI I/O can not do ‘magic’, but is bound by hardware that you are
> providing. If already a single process is able to saturate the bandwidth of
> your file system and hardware, you will not be able to see performance
> improvements from multiple processes (some minor exceptions maybe due to
> caching effects, but that is only for smaller problem sizes, the larger the
> amount of data that you try to write, the lesser the caching effects become
> in file I/O). So the first question that you have to answer, what is the
> sustained bandwidth of your hardware, and are you able  to saturate it
> already with a single process. If you are using a single hard drive (or
> even 2 or 3 hard drives in a RAID 0 configuration), this is almost
> certainly the case.
>
>
>
> Lastly, the configuration parameters of your tests also play a major role.
> As a general rule, the larger amounts of data you are able to provide per
> file I/O call, the better the performance will be. 1MB of data per call is
> probably on the smaller side. The ompio implementation of MPI I/O breaks
> large individual I/O operations (e.g. MPI_File_write_at) into chunks of
> 512MB for performance reasons internally. Large collective I/O operations
> (e.g. MPI_File_write_at_all) are broken into chunks of 32 MB. This gives
> you some hints on the quantities of data that you would have to use for
> performance reasons.
>
>
>
> Along the same lines, one final comment. You say you did 1000 writes of
> 1MB each. For a single process that is about 1GB of data. Depending on how
> much main memory  your PC has, this amount of data can still be cached in
> modern systems, and you might have an unrealistically high bandwidth value
> for the 1 process case that you are comparing against (it depends a bit on
> what your benchmark does, and whether you force flushing the data to disk
> inside of your measurement loop).
>
>
>
> Hope this gives you some pointers on where to start to look.
>
> Thanks
>
> Edgar
>
>
>
> *From:* users  *On Behalf Of *Dong-In
> Kang via users
> *Sent:* Monday, April 6, 2020 7:14 AM
> *To:* users@lists.open-mpi.org
> *Cc:* Dong-In Kang 
> *Subject:* [OMPI users] Slow collective MPI File IO
>
>
>
> Hi,
>
>
>
> I am running an MPI program where N processes write to a single file on a
> single shared memory machine.
>
> I’m using OpenMPI v.4.0.2.
>
> Each MPI process write a 1MB chunk of data for 1K times sequentially.
>
> There is no overlap in the file between any of the two MPI processes.
>
> I ran the program for -np = {1, 2, 4, 8}.
>
> I am seeing that the speed of the collective write to a file for -np = {2,
> 4, 8} never exc

Re: [OMPI users] Slow collective MPI File IO

2020-04-06 Thread Gabriel, Edgar via users
Hi,

A couple of comments. First, if you use MPI_File_write_at, this is usually not 
considered collective I/O, even if executed by multiple processes. 
MPI_File_write_at_all would be collective I/O.

Second, MPI I/O can not do ‘magic’, but is bound by hardware that you are 
providing. If already a single process is able to saturate the bandwidth of 
your file system and hardware, you will not be able to see performance 
improvements from multiple processes (some minor exceptions maybe due to 
caching effects, but that is only for smaller problem sizes, the larger the 
amount of data that you try to write, the lesser the caching effects become in 
file I/O). So the first question that you have to answer, what is the sustained 
bandwidth of your hardware, and are you able  to saturate it already with a 
single process. If you are using a single hard drive (or even 2 or 3 hard 
drives in a RAID 0 configuration), this is almost certainly the case.

Lastly, the configuration parameters of your tests also play a major role. As a 
general rule, the larger amounts of data you are able to provide per file I/O 
call, the better the performance will be. 1MB of data per call is probably on 
the smaller side. The ompio implementation of MPI I/O breaks large individual 
I/O operations (e.g. MPI_File_write_at) into chunks of 512MB for performance 
reasons internally. Large collective I/O operations (e.g. 
MPI_File_write_at_all) are broken into chunks of 32 MB. This gives you some 
hints on the quantities of data that you would have to use for performance 
reasons.

Along the same lines, one final comment. You say you did 1000 writes of 1MB 
each. For a single process that is about 1GB of data. Depending on how much 
main memory  your PC has, this amount of data can still be cached in modern 
systems, and you might have an unrealistically high bandwidth value for the 1 
process case that you are comparing against (it depends a bit on what your 
benchmark does, and whether you force flushing the data to disk inside of your 
measurement loop).

Hope this gives you some pointers on where to start to look.
Thanks
Edgar

From: users  On Behalf Of Dong-In Kang via 
users
Sent: Monday, April 6, 2020 7:14 AM
To: users@lists.open-mpi.org
Cc: Dong-In Kang 
Subject: [OMPI users] Slow collective MPI File IO

Hi,

I am running an MPI program where N processes write to a single file on a 
single shared memory machine.
I’m using OpenMPI v.4.0.2.
Each MPI process write a 1MB chunk of data for 1K times sequentially.
There is no overlap in the file between any of the two MPI processes.
I ran the program for -np = {1, 2, 4, 8}.
I am seeing that the speed of the collective write to a file for -np = {2, 4, 
8} never exceeds the speed of -np = {1}.
I did the experiment with a few different file systems {local disk, ram disk, 
Luster FS}.
For all of them, I see similar results.
The speed of collective write to a single shared file never exceeds the speed 
of single MPI process case.
Any tip or suggestions?

I used MPI_File_write_at() routine with proper offset for each MPI process.
(I also tried MPI_File_write_at_all() routine, which makes the performance 
worse as np gets bigger.)
Before writing, MPI_Barrrier() is used.
The start time is taken right after MPI_Barrier() using MPI_Timer();
The end time is taken right after another MPI_Barrier().
The speed of the collective write is calculate as
(total data amount written to the file)/(time between the first MPI_Barrier() 
and the second MPI_Barrier());

Any idea to increase the speed?

Thanks,
David



[OMPI users] Slow collective MPI File IO

2020-04-06 Thread Dong-In Kang via users
Hi,



I am running an MPI program where N processes write to a single file on a
single shared memory machine.

I’m using OpenMPI v.4.0.2.

Each MPI process write a 1MB chunk of data for 1K times sequentially.

There is no overlap in the file between any of the two MPI processes.

I ran the program for -np = {1, 2, 4, 8}.

I am seeing that the speed of the collective write to a file for -np = {2,
4, 8} never exceeds the speed of -np = {1}.

I did the experiment with a few different file systems {local disk, ram
disk, Luster FS}.

For all of them, I see similar results.

The speed of collective write to a single shared file never exceeds the
speed of single MPI process case.

Any tip or suggestions?



I used MPI_File_write_at() routine with proper offset for each MPI process.

(I also tried MPI_File_write_at_all() routine, which makes the performance
worse as np gets bigger.)

Before writing, MPI_Barrrier() is used.

The start time is taken right after MPI_Barrier() using MPI_Timer();

The end time is taken right after another MPI_Barrier().

The speed of the collective write is calculate as

(total data amount written to the file)/(time between the first
MPI_Barrier() and the second MPI_Barrier());



Any idea to increase the speed?


Thanks,

David