On Wed, Nov 12, 2014 at 07:56:08PM +0900, Gilles Gouaillardet wrote: > Folks, > > I found (at least) two issues with oshmem put if btl/vader is used with > knem enabled : > > $ oshrun -np 2 --mca btl vader,self ./oshmem_max_reduction > -------------------------------------------------------------------------- > SHMEM_ABORT was invoked on rank 0 (pid 11936, host=soleil) with > errorcode -1. > -------------------------------------------------------------------------- > [soleil.iferc.local:11934] 1 more process has sent help message > help-shmem-api.txt / shmem-abort > [soleil.iferc.local:11934] Set MCA parameter "orte_base_help_aggregate" > to 0 to see all help / error messages > > > the error message is not helpful at all ... > the abort happens in the vader btl in mca_btl_vader_put_knem > if (OPAL_UNLIKELY(0 != ioctl (mca_btl_vader.knem_fd, > KNEM_CMD_INLINE_COPY, &icopy))) { > return OPAL_ERROR; > } > ioctl fails with EACCES > > the root cause is the symmetric memory was "prepared" with > vader_prepare_src that uses > knem_cr.protection = PROT_READ; > > a trivial workaround (probably not good for production) is to > knem_cr.protection = PROT_READ|PROT_WRITE;
I should have commented on this earlier. I think this is the appropriate fix for 1.8. There is no way with the old btl interface to register a region for both remote read and remote write. The openib btl gets around this by always registering for read and write. The new btl interface does not have this problem :). Permissions are specified when the region is registered... Though I still need to work in modifications to the mpool to make sure flags get passed down to the mpool if the btl chooses to use one. -Nathan
pgpKcB4OhO1S5.pgp
Description: PGP signature