Re: [OMPI users] Status of pNFS, CephFS and MPI I/O

Gabriel, Edgar via users Thu, 23 Sep 2021 16:18:43 -0700

let me amend my last email by making clear that I do not recommend using NFS 
for parallel I/O. But if you have to, make sure your code does not do things 
like read-after-write, or multiple processes writing data that ends up in the 
same file system block (can often be avoided by using collective I/O for 
example).




-----Original Message-----
From: users <users-boun...@lists.open-mpi.org> On Behalf Of Gabriel, Edgar via 
users
Sent: Thursday, September 23, 2021 5:31 PM
To: Eric Chamberland <eric.chamberl...@giref.ulaval.ca>; Open MPI Users 
<users@lists.open-mpi.org>
Cc: Gabriel, Edgar <egabr...@central.uh.edu>; Louis Poirel 
<louis.poi...@michelin.com>; Vivien Clauzon <vivien.clau...@michelin.com>
Subject: Re: [OMPI users] Status of pNFS, CephFS and MPI I/O

-----Original Message-----
From: Eric Chamberland <eric.chamberl...@giref.ulaval.ca> 

Thanks for your answer Edgard!

In fact, we are able to use NFS and certainly any POSIX file system on a single 
node basis.

I should have been asking for: What are the supported file systems for 
*multiple nodes* read/write access to files?

-> We have tested it on BeeGFS, GPFS, Lustre, PVFS2/OrangeFS, and NFS, but 
again, if a parallel file system has POSIX functions  would expect it to work 
(and yes, I am aware of the strict POSIX semantics are not necessarily 
available in parallel file systems. Internally, we are using  open, close, 
(p)readv, (pwritev), lock, unlock, seek).

For nfs, MPI I/O is known to *not* work on NFS when using multiple nodes ... 
except for NFS v3 with "noac" mount option (we are about to test with 
"actimeo=0" option to see if it works).

-> well, it depends on how you define that. I would suspect that our largest 
user base is actually using NFS v3/v4 with the noac option + nfslock server. 
Most I/O patterns will probably work in this scenario, and in fact, we are 
actually passing our entire testsuite on multi-node NFS setup (which does some 
nasty things). However, it is true that there could be corner cases that fail. 
In addition, parallel I/O on multi-node NFS can be outrageously slow, since we 
lock the *entire* file before every operation ( in contrary to ROMIO, which 
only locks the file range that is currently being accessed). 

Btw, is OpenMPI MPI I/O  having some "hidden" (mca?) options to make a multiple 
nodes NFS cluster to work?

-> OMPO recognizes the NFS file system automatically, without requiring an mca 
parameter. I usually recommend to users to try to relax the locking options and 
see whether they still produce correct data in order to improve performance of 
their code, since most I/O patterns do not require this super-strict locking 
behavior. This is the fs_ufs_lock_algorithm parameter. 

Thanks
Edgar


Thanks,

Eric

On 2021-09-23 1:57 p.m., Gabriel, Edgar wrote:
> Eric,
>
> generally speaking, ompio should be able to operate correctly on all file 
> systems that have support for POSIX functions.  The generic ufs component is 
> for example being used on  BeeGFS parallel file systems without problems, we 
> are using that on a daily basis. For GPFS, the only reason we handle that 
> file system separately is because of some custom info objects that can be 
> used to configure the file during file_open. If one would not use these info 
> objects the generic ufs component would be as good as the GPFS specific 
> component.
>
> Note, the generic ufs component is also being used for NFS, it has logic 
> built in to recognize an NFS file system and handle some operations slightly 
> differently (but still relying on POSIX functions). The one big exception is 
> Lustre: due its different file locking strategy we are required to use a 
> different collective I/O component (dynamic_gen2 vs. vulcan). Generic ufs 
> would work on Lustre, too, but it would be horribly slow.
>
> I cannot comment on CephFS and pNFS since I do not have access to those file 
> systems, it would come down to test them.
>
> Thanks
> Edgar
>
>
> -----Original Message-----
> From: users <users-boun...@lists.open-mpi.org> On Behalf Of Eric 
> Chamberland via users
> Sent: Thursday, September 23, 2021 9:28 AM
> To: Open MPI Users <users@lists.open-mpi.org>
> Cc: Eric Chamberland <eric.chamberl...@giref.ulaval.ca>; Vivien 
> Clauzon <vivien.clau...@michelin.com>
> Subject: [OMPI users] Status of pNFS, CephFS and MPI I/O
>
> Hi,
>
> I am looking around for information about parallel filesystems supported for 
> MPI I/O.
>
> Clearly, GFPS, Lustre are fully supported, but what about others?
>
> - CephFS
>
> - pNFS
>
> - Other?
>
> when I "grep" for "pnfs\|cephfs" into ompi source code, I found nothing...
>
> Otherwise I found this into ompi/mca/common/ompio/common_ompio.h :
>
> enum ompio_fs_type
> {
>       NONE = 0,
>       UFS = 1,
>       PVFS2 = 2,
>       LUSTRE = 3,
>       PLFS = 4,
>       IME = 5,
>       GPFS = 6
> };
>
> Does that mean that other fs types (pNFS, CephFS) does not need special 
> treatment or are not supported or not optimally supported?
>
> Thanks,
>
> Eric
>
> --
> Eric Chamberland, ing., M. Ing
> Professionnel de recherche
> GIREF/Université Laval
> (418) 656-2131 poste 41 22 42
>
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42

Re: [OMPI users] Status of pNFS, CephFS and MPI I/O

Reply via email to