Re: [OMPI devel] oshmem: put does not work with btl/vader if knem is enabled

2014-11-12 Thread Nathan Hjelm
On Wed, Nov 12, 2014 at 07:56:08PM +0900, Gilles Gouaillardet wrote:
> Folks,
> 
> I found (at least) two issues with oshmem put if btl/vader is used with
> knem enabled :
> 
> $ oshrun -np 2 --mca btl vader,self ./oshmem_max_reduction
> --
> SHMEM_ABORT was invoked on rank 0 (pid 11936, host=soleil) with
> errorcode -1.
> --
> [soleil.iferc.local:11934] 1 more process has sent help message
> help-shmem-api.txt / shmem-abort
> [soleil.iferc.local:11934] Set MCA parameter "orte_base_help_aggregate"
> to 0 to see all help / error messages
> 
> 
> the error message is not helpful at all ...
> the abort happens in the vader btl in mca_btl_vader_put_knem
>if (OPAL_UNLIKELY(0 != ioctl (mca_btl_vader.knem_fd,
> KNEM_CMD_INLINE_COPY, ))) {
> return OPAL_ERROR;
> }
> ioctl fails with EACCES
> 
> the root cause is the symmetric memory was "prepared" with
> vader_prepare_src that uses
> knem_cr.protection = PROT_READ;
> 
> a trivial workaround (probably not good for production) is to
> knem_cr.protection = PROT_READ|PROT_WRITE;

I should have commented on this earlier. I think this is the appropriate
fix for 1.8. There is no way with the old btl interface to register a
region for both remote read and remote write. The openib btl gets around
this by always registering for read and write.

The new btl interface does not have this problem :). Permissions are
specified when the region is registered... Though I still need to work
in modifications to the mpool to make sure flags get passed down to the
mpool if the btl chooses to use one.

-Nathan


pgpKcB4OhO1S5.pgp
Description: PGP signature


Re: [OMPI devel] oshmem: put does not work with btl/vader if knem is enabled

2014-11-12 Thread Nathan Hjelm

On Wed, Nov 12, 2014 at 07:56:08PM +0900, Gilles Gouaillardet wrote:
> Folks,
> 
> I found (at least) two issues with oshmem put if btl/vader is used with
> knem enabled :
> 
> $ oshrun -np 2 --mca btl vader,self ./oshmem_max_reduction
> --
> SHMEM_ABORT was invoked on rank 0 (pid 11936, host=soleil) with
> errorcode -1.
> --
> [soleil.iferc.local:11934] 1 more process has sent help message
> help-shmem-api.txt / shmem-abort
> [soleil.iferc.local:11934] Set MCA parameter "orte_base_help_aggregate"
> to 0 to see all help / error messages
> 
> 
> the error message is not helpful at all ...
> the abort happens in the vader btl in mca_btl_vader_put_knem
>if (OPAL_UNLIKELY(0 != ioctl (mca_btl_vader.knem_fd,
> KNEM_CMD_INLINE_COPY, ))) {
> return OPAL_ERROR;
> }
> ioctl fails with EACCES
> 
> the root cause is the symmetric memory was "prepared" with
> vader_prepare_src that uses
> knem_cr.protection = PROT_READ;
> 
> a trivial workaround (probably not good for production) is to
> knem_cr.protection = PROT_READ|PROT_WRITE;
> 
> 
> then we run into the second issue :
> 
> in mca_btl_vader_put_knem :
> icopy.remote_offset = 0;


I am fixing this one now. Vader should have been calculating the offset
(as it does in the new btl interface). I never test shmem so I didn't
catch this one before.

-Nathan



pgpzB4lX1oxD4.pgp
Description: PGP signature


Re: [OMPI devel] oshmem: put does not work with btl/vader if knem is enabled

2014-11-12 Thread Alexander Mikheev
It looks like we need to use prepare_dst() instead of prepare_src(). 
I also remember that there was a reason why prepare_src() is used with openib 
btl. 

I will be taking another look

Alex



[OMPI devel] oshmem: put does not work with btl/vader if knem is enabled

2014-11-12 Thread Gilles Gouaillardet
Folks,

I found (at least) two issues with oshmem put if btl/vader is used with
knem enabled :

$ oshrun -np 2 --mca btl vader,self ./oshmem_max_reduction
--
SHMEM_ABORT was invoked on rank 0 (pid 11936, host=soleil) with
errorcode -1.
--
[soleil.iferc.local:11934] 1 more process has sent help message
help-shmem-api.txt / shmem-abort
[soleil.iferc.local:11934] Set MCA parameter "orte_base_help_aggregate"
to 0 to see all help / error messages


the error message is not helpful at all ...
the abort happens in the vader btl in mca_btl_vader_put_knem
   if (OPAL_UNLIKELY(0 != ioctl (mca_btl_vader.knem_fd,
KNEM_CMD_INLINE_COPY, ))) {
return OPAL_ERROR;
}
ioctl fails with EACCES

the root cause is the symmetric memory was "prepared" with
vader_prepare_src that uses
knem_cr.protection = PROT_READ;

a trivial workaround (probably not good for production) is to
knem_cr.protection = PROT_READ|PROT_WRITE;


then we run into the second issue :

in mca_btl_vader_put_knem :
icopy.remote_offset = 0;

and this is clearly not what we want ...
in my environment, we want to put to 0x0600df0, so the remote_offset
should be 0xdf0 since the symmetric memory was "prepared" starting at
0x060

i do not think the vader btl is to be blamed here ... i'd rather think
yoda way to use the btl is not correct (but only for put with vader btl
when knem is used)

i can get the test program run correctly by manually setting
icopy.remote_offset with a debugger.

please note i fixed a typo in the vader btl so make sure you update the
master.


in the mean time, what about forcing put_via_send to 1 in
mca_spml_yoda_put_internal ?
/* an other option is to unset the MCA_BTL_FLAGS_PUT flag in the vader
btl if knem is used, but i do not believe this is a vader issue */

Cheers,

Gilles