On Wed, Nov 12, 2014 at 07:56:08PM +0900, Gilles Gouaillardet wrote: > Folks, > > I found (at least) two issues with oshmem put if btl/vader is used with > knem enabled : > > $ oshrun -np 2 --mca btl vader,self ./oshmem_max_reduction > -------------------------------------------------------------------------- > SHMEM_ABORT was invoked on rank 0 (pid 11936, host=soleil) with > errorcode -1. > -------------------------------------------------------------------------- > [soleil.iferc.local:11934] 1 more process has sent help message > help-shmem-api.txt / shmem-abort > [soleil.iferc.local:11934] Set MCA parameter "orte_base_help_aggregate" > to 0 to see all help / error messages > > > the error message is not helpful at all ... > the abort happens in the vader btl in mca_btl_vader_put_knem > if (OPAL_UNLIKELY(0 != ioctl (mca_btl_vader.knem_fd, > KNEM_CMD_INLINE_COPY, &icopy))) { > return OPAL_ERROR; > } > ioctl fails with EACCES > > the root cause is the symmetric memory was "prepared" with > vader_prepare_src that uses > knem_cr.protection = PROT_READ; > > a trivial workaround (probably not good for production) is to > knem_cr.protection = PROT_READ|PROT_WRITE; > > > then we run into the second issue : > > in mca_btl_vader_put_knem : > icopy.remote_offset = 0;
I am fixing this one now. Vader should have been calculating the offset (as it does in the new btl interface). I never test shmem so I didn't catch this one before. -Nathan
pgpzB4lX1oxD4.pgp
Description: PGP signature