Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins
hummm i intentionally did not swap the two 32 bits (!) from the top level, what we have is : typedef struct { union { uint64_t opal; struct { uint32_t jobid; uint32_t vpid; } orte; } meta_process_name_t; OPAL is agnostic about jobid and vpid. jobid and vpid are set in ORTE/MPI and OPAL is used only to transport the 64 bits /* opal_process_name_t and orte_process_name_t are often casted into each other */ at ORTE/MPI level, jobid and vpid are set individually /* e.g. we do *not* do something like opal = jobid | (vpid<<32) */ this is why everything works fine on homogeneous clusters regardless endianness. now in heterogeneous cluster, thing get a bit trickier ... i was initially unhappy with my commit and i think i found out why : this is an abstraction violation ! the two 32 bits are not swapped by OPAL because this is what is expected by the ORTE/OMPI. now i d like to suggest the following lightweight approach : at OPAL, use #if protected htonll/ntohll (e.g. swap the two 32bits) do the trick at the ORTE level : simply replace struct orte_process_name_t { orte_jobid_t jobid; orte_vpid_t vpid; }; with #if OPAL_ENABLE_HETEROGENEOUS_SUPPORT && !defined(WORDS_BIGENDIAN) struct orte_process_name_t { orte_vpid_t vpid; orte_jobid_t jobid; }; #else struct orte_process_name_t { orte_jobid_t jobid; orte_vpid_t vpid; }; #endif so we keep OPAL agnostic about how the uint64_t is really used at the upper level. an other option is to make OPAL aware of jobid and vpid but this is a bit more heavyweight imho. i'll try this today and make sure it works. any thoughts ? Cheers, Gilles On Wed, Aug 6, 2014 at 8:17 AM, Ralph Castainwrote: > Ah yes, so it is - sorry I missed that last test :-/ > > On Aug 5, 2014, at 10:50 AM, George Bosilca wrote: > > The code committed by Gilles is correctly protected for big endian ( > https://svn.open-mpi.org/trac/ompi/changeset/32425). I was merely > pointing out that I think he should also swap the 2 32 bits in his > implementation. > > George. > > > > On Tue, Aug 5, 2014 at 1:32 PM, Ralph Castain wrote: > >> >> On Aug 5, 2014, at 10:23 AM, George Bosilca wrote: >> >> On Tue, Aug 5, 2014 at 1:15 PM, Ralph Castain wrote: >> >>> Hmmm...wouldn't that then require that you know (a) the other side is >>> little endian, and (b) that you are on a big endian? Otherwise, you wind up >>> with the same issue in reverse, yes? >>> >> >> This is similar to the 32 bits ntohl that we are using in other parts of >> the project. Any little endian participant will do the conversion, while >> every big endian participant will use an empty macro instead. >> >> >>> In the ORTE methods, we explicitly set the fields (e.g., jobid = >>> ntohl(remote-jobid)) to get around this problem. I missed that he did it by >>> location instead of named fields - perhaps we should do that instead? >>> >> >> As soon as we impose the ORTE naming scheme at the OPAL level (aka. the >> notion of jobid and vpid) this approach will become possible. >> >> >> Not proposing that at all so long as the other method will work without >> knowing the other side's endianness. Sounds like your approach should work >> fine as long as Gilles adds a #if so big endian defines the macro away >> >> >> George. >> >> >> >>> >>> >>> On Aug 5, 2014, at 10:06 AM, George Bosilca wrote: >>> >>> Technically speaking, converting a 64 bits to a big endian >>> representation requires the swap of the 2 32 bits parts. So the correct >>> approach would have been: >>> uint64_t htonll(uint64_t v) >>> { >>> return uint64_t)ntohl(n)) << 32 | (uint64_t)ntohl(n >> 32)); >>> } >>> >>> George. >>> >>> >>> >>> On Tue, Aug 5, 2014 at 5:52 AM, Ralph Castain wrote: >>> FWIW: that's exactly how we do it in ORTE On Aug 4, 2014, at 10:25 PM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> wrote: George, i confirm there was a problem when running on an heterogeneous cluster, this is now fixed in r32425. i am not convinced i chose the most elegant way to achieve the desired result ... could you please double check this commit ? Thanks, Gilles On 2014/08/02 0:14, George Bosilca wrote: Gilles, The design of the BTL move was to let the opal_process_name_t be agnostic to what is stored inside, and all accesses should be done through the provided accessors. Thus, big endian or little endian doesn’t make a difference, as long as everything goes through the accessors. I’m skeptical about the support of heterogeneous environments in the current code, so I didn’t pay much attention to handling the case in the TCP BTL. But in case we do care it is enough to make the 2 macros point
Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins
Ah yes, so it is - sorry I missed that last test :-/ On Aug 5, 2014, at 10:50 AM, George Bosilcawrote: > The code committed by Gilles is correctly protected for big endian > (https://svn.open-mpi.org/trac/ompi/changeset/32425). I was merely pointing > out that I think he should also swap the 2 32 bits in his implementation. > > George. > > > > On Tue, Aug 5, 2014 at 1:32 PM, Ralph Castain wrote: > > On Aug 5, 2014, at 10:23 AM, George Bosilca wrote: > >> On Tue, Aug 5, 2014 at 1:15 PM, Ralph Castain wrote: >> Hmmm...wouldn't that then require that you know (a) the other side is little >> endian, and (b) that you are on a big endian? Otherwise, you wind up with >> the same issue in reverse, yes? >> >> This is similar to the 32 bits ntohl that we are using in other parts of the >> project. Any little endian participant will do the conversion, while every >> big endian participant will use an empty macro instead. >> >> In the ORTE methods, we explicitly set the fields (e.g., jobid = >> ntohl(remote-jobid)) to get around this problem. I missed that he did it by >> location instead of named fields - perhaps we should do that instead? >> >> As soon as we impose the ORTE naming scheme at the OPAL level (aka. the >> notion of jobid and vpid) this approach will become possible. > > Not proposing that at all so long as the other method will work without > knowing the other side's endianness. Sounds like your approach should work > fine as long as Gilles adds a #if so big endian defines the macro away > >> >> George. >> >> >> >> >> On Aug 5, 2014, at 10:06 AM, George Bosilca wrote: >> >>> Technically speaking, converting a 64 bits to a big endian representation >>> requires the swap of the 2 32 bits parts. So the correct approach would >>> have been: >>> uint64_t htonll(uint64_t v) >>> { >>> return uint64_t)ntohl(n)) << 32 | (uint64_t)ntohl(n >> 32)); >>> } >>> >>> George. >>> >>> >>> >>> On Tue, Aug 5, 2014 at 5:52 AM, Ralph Castain wrote: >>> FWIW: that's exactly how we do it in ORTE >>> >>> On Aug 4, 2014, at 10:25 PM, Gilles Gouaillardet >>> wrote: >>> George, i confirm there was a problem when running on an heterogeneous cluster, this is now fixed in r32425. i am not convinced i chose the most elegant way to achieve the desired result ... could you please double check this commit ? Thanks, Gilles On 2014/08/02 0:14, George Bosilca wrote: > Gilles, > > The design of the BTL move was to let the opal_process_name_t be agnostic > to what is stored inside, and all accesses should be done through the > provided accessors. Thus, big endian or little endian doesn’t make a > difference, as long as everything goes through the accessors. > > I’m skeptical about the support of heterogeneous environments in the > current code, so I didn’t pay much attention to handling the case in the > TCP BTL. But in case we do care it is enough to make the 2 macros point > to something meaningful instead of being empty (bswap_64 or something). > > George. > > On Aug 1, 2014, at 06:52 , Gilles Gouaillardet > wrote: > >> George and Ralph, >> >> i am very confused whether there is an issue or not. >> >> >> anyway, today Paul and i ran basic tests on big endian machines and did >> not face any issue related to big endianness. >> >> so i made my homework, digged into the code, and basically, >> opal_process_name_t is used as an orte_process_name_t. >> for example, in ompi_proc_init : >> >> OMPI_CAST_ORTE_NAME(>super.proc_name)->jobid = >> OMPI_PROC_MY_NAME->jobid; >> OMPI_CAST_ORTE_NAME(>super.proc_name)->vpid = i; >> >> and with >> >> #define OMPI_CAST_ORTE_NAME(a) ((orte_process_name_t*)(a)) >> >> so as long as an opal_process_name_t is used as an orte_process_name_t, >> there is no problem, >> regardless the endianness of the homogenous cluster we are running on. >> >> for the sake of readability (and for being pedantic too ;-) ) in r32357, >> _temp->super.proc_name >> could be replaced with >> OMPI_CAST_ORTE_NAME(_temp->super.proc_name) >> >> >> >> That being said, in btl/tcp, i noticed : >> >> in mca_btl_tcp_component_recv_handler : >> >> opal_process_name_t guid; >> [...] >> /* recv the process identifier */ >> retval = recv(sd, (char *), sizeof(guid), 0); >> if(retval != sizeof(guid)) { >> CLOSE_THE_SOCKET(sd); >> return; >> } >> OPAL_PROCESS_NAME_NTOH(guid); >> >> and in
Re: [OMPI devel] [1.8.2rc3] static linking fails on linux when not building ROMIO
Gilles, I have not tested your patch. I've only read it. It looks like it could work, except that libopen-rte.a depends on libsocket and libnsl on Solaris. So, one probably needs to add $LIBS to the ORTE wrapper libs as well. Additionally,if your approach is the correct one, then I think one can fold: OPAL_FLAGS_APPEND_UNIQ([OPAL_WRAPPER_EXTRA_LIBS], [$wrapper_extra_libs]) OPAL_WRAPPER_EXTRA_LIBS="$OPAL_WRAPPER_EXTRA_LIBS $with_wrapper_libs" + OPAL_FLAGS_APPEND_UNIQ([OPAL_WRAPPER_EXTRA_LIBS], [$LIBS]) + OPAL_WRAPPER_EXTRA_LIBS="$OPAL_WRAPPER_EXTRA_LIBS $with_wrapper_libs" into just -OPAL_FLAGS_APPEND_UNIQ([OPAL_WRAPPER_EXTRA_LIBS], [$wrapper_extra_libs]) + OPAL_FLAGS_APPEND_UNIQ([OPAL_WRAPPER_EXTRA_LIBS], [$wrapper_extra_libs $LIBS]) which merges two calls to OPAL_FLAGS_APPEND_UNIQ and avoids double-adding of the user's $with_wrapper_libs. And of course the same 1-line change would apply for the OMPI and eventually ORTE variables too. I'd like to wait until Jeff has had a chance to look this over before I devote time to testing. Since I've determined already that 1.6.5 did not have the problem while 1.7.x does, the possibility exists that some smaller change might exist to restore what ever was lost between the v1.6 and v1.7 branches. -Paul On Tue, Aug 5, 2014 at 1:33 AM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> wrote: > Here is a patch that has been minimally tested. > > this is likely an overkill (at least when dynamic libraries can be used), > but it does the job so far ... > > Cheers, > > Gilles > > On 2014/08/05 16:56, Gilles Gouaillardet wrote: > > from libopen-pal.la : > dependency_libs=' -lrdmacm -libverbs -lscif -lnuma -ldl -lrt -lnsl > -lutil -lm' > > > i confirm mpicc fails linking > > but FWIT, using libtool does work (!) > > could the bug come from the mpicc (and other) wrappers ? > > Gilles > > $ gcc -g -O0 -o hw /csc/home1/gouaillardet/hw.c > -I/tmp/install/ompi.noromio/include -pthread -L/usr/lib64 -Wl,-rpath > -Wl,/usr/lib64 -Wl,-rpath -Wl,/tmp/install/ompi.noromio/lib > -Wl,--enable-new-dtags -L/tmp/install/ompi.noromio/lib -lmpi -lopen-rte > -lopen-pal -lm -lnuma -libverbs -lscif -lrdmacm -ldl -llustreapi > > $ /tmp/install/ompi.noromio/bin/mpicc -g -O0 -o hw -show ~/hw.c > gcc -g -O0 -o hw /csc/home1/gouaillardet/hw.c > -I/tmp/install/ompi.noromio/include -pthread -L/usr/lib64 -Wl,-rpath > -Wl,/usr/lib64 -Wl,-rpath -Wl,/tmp/install/ompi.noromio/lib > -Wl,--enable-new-dtags -L/tmp/install/ompi.noromio/lib -lmpi -lopen-rte > -lopen-pal -lm -lnuma -libverbs -lscif -lrdmacm -ldl -llustreapi > [gouaillardet@soleil build]$ /tmp/install/ompi.noromio/bin/mpicc -g -O0 > -o hw ~/hw.c > /tmp/install/ompi.noromio/lib/libmpi.a(fbtl_posix_ipwritev.o): In > function `mca_fbtl_posix_ipwritev': > fbtl_posix_ipwritev.c:(.text+0x17b): undefined reference to `aio_write' > fbtl_posix_ipwritev.c:(.text+0x237): undefined reference to `aio_write' > fbtl_posix_ipwritev.c:(.text+0x3f4): undefined reference to `aio_write' > fbtl_posix_ipwritev.c:(.text+0x48e): undefined reference to `aio_write' > /tmp/install/ompi.noromio/lib/libopen-pal.a(opal_pty.o): In function > `opal_openpty': > opal_pty.c:(.text+0x1): undefined reference to `openpty' > /tmp/install/ompi.noromio/lib/libopen-pal.a(event.o): In function > `event_add_internal': > event.c:(.text+0x288d): undefined reference to `clock_gettime' > > $ /bin/sh ./static/libtool --silent --tag=CC --mode=compile gcc > -std=gnu99 -I/tmp/install/ompi.noromio/include -c ~/hw.c > $ /bin/sh ./static/libtool --silent --tag=CC --mode=link gcc > -std=gnu99 -o hw hw.o -L/tmp/install/ompi.noromio/lib -lmpi > $ ldd hw > linux-vdso.so.1 => (0x7fff7530d000) > librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x7f0ed541e000) > libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x7f0ed521) > libscif.so.0 => /usr/lib64/libscif.so.0 (0x003b9c60) > libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x003ba560) > libdl.so.2 => /lib64/libdl.so.2 (0x003b9be0) > librt.so.1 => /lib64/librt.so.1 (0x003b9ca0) > libnsl.so.1 => /lib64/libnsl.so.1 (0x003bae20) > libutil.so.1 => /lib64/libutil.so.1 (0x003bac60) > libm.so.6 => /lib64/libm.so.6 (0x003b9ba0) > libpthread.so.0 => /lib64/libpthread.so.0 (0x003b9c20) > libc.so.6 => /lib64/libc.so.6 (0x003b9b60) > /lib64/ld-linux-x86-64.so.2 (0x003b9b20) > > > > > On 2014/08/05 7:56, Ralph Castain wrote: > > My thought was to post initially as a blocker, pending a discussion with > Jeff at tomorrow's telecon. If he thinks this is something we can fix in some > central point (thus catching it everywhere), then it could be quick and worth > doing. However, I'm skeptical as I tried to do that in the most obvious > place, and it failed (could be operator error). > > Will let you know tomorrow. Truly appreciate your digging on this! > Ralph > > On Aug 4,
Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value
Thanks to Paul help all the inlined atomics have been tested. The new patch is attached below. However, this only fixes the inline atomics, all those generated from the *.asm files have not been updated. Any volunteer? George. atomics.patch Description: Binary data On Aug 1, 2014, at 18:09 , Paul Hargrovewrote:I have confirmed that George's latest version works on both SPARC ABIs.ARMv7 and three MIPS ABIs still pending...-PaulOn Fri, Aug 1, 2014 at 9:40 AM, George Bosilca wrote:Another version of the atomic patch. Paul has tested it on a bunch of platforms. At this point we have confirmation from all architectures except SPARC (v8+ and v9). George.On Jul 31, 2014, at 19:13 , George Bosilca wrote:> All,>> Here is the patch that change the meaning of the atomics to make them always return the previous value (similar to sync_fetch_and_<*>). I tested this with the following atomics: OS X, gcc style intrinsics and AMD64.>> I did not change the base assembly files used when GCC style assembly operations are not supported. If someone feels like fixing them, feel free.>> Paul, I know you have a pretty diverse range computers. Can you try to compile and run a “make check” with the following patch?>> George.>> >> On Jul 30, 2014, at 15:21 , Nathan Hjelm wrote:> That is what I would prefer. I was trying to not disturb things too>> much :). Please bring the changes over! -Nathan On Wed, Jul 30, 2014 at 03:18:44PM -0400, George Bosilca wrote:>>> Why do you want to add new versions? This will lead to having two, almost>>> identical, sets of atomics that are conceptually equivalent but different>>> in terms of code. And we will have to maintained both!>>> I did a similar change in a fork of OPAL in another project but instead of>>> adding another flavor of atomics, I completely replaced the available ones>>> with a set returning the old value. I can bring the code over.>>> George.>> On Tue, Jul 29, 2014 at 5:29 PM, Paul Hargrove wrote:>> On Tue, Jul 29, 2014 at 2:10 PM, Nathan Hjelm wrote:>> Is there a reason why the>>> current implementations of opal atomics (add, cmpset) do not return>>> the>>> old value?>> Because some CPUs don't implement such an atomic instruction?>> On any CPU one *can* certainly synthesize the desired operation with an>>> added read before the compare-and-swap to return a value that was>>> present at some time before a failed cmpset. That is almost certainly>>> sufficient for your purposes. However, the added load makes it>>> (marginally) more expensive on some CPUs that only have the native>>> equivalent of gcc's __sync_bool_compare_and_swap().>> -Paul>>> -->>> Paul H. Hargrove phhargr...@lbl.gov>>> Future Technologies Group>>> Computer and Data Sciences Department Tel: +1-510-495-2352>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900>>> ___>>> devel mailing list>>> de...@open-mpi.org>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel>>> Link to this post:>>> http://www.open-mpi.org/community/lists/devel/2014/07/15328.php> ___>>> devel mailing list>>> de...@open-mpi.org>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel>>> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15369.php ___>> devel mailing list>> de...@open-mpi.org>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel>> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15370.php>___devel mailing listde...@open-mpi.orgSubscription: http://www.open-mpi.org/mailman/listinfo.cgi/develLink to this post: http://www.open-mpi.org/community/lists/devel/2014/08/15462.php-- Paul H. Hargrove phhargr...@lbl.govFuture Technologies GroupComputer and Data Sciences Department Tel: +1-510-495-2352Lawrence Berkeley National Laboratory Fax: +1-510-486-6900___devel mailing listde...@open-mpi.orgSubscription: http://www.open-mpi.org/mailman/listinfo.cgi/develLink to this post: http://www.open-mpi.org/community/lists/devel/2014/08/15465.php
Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins
The code committed by Gilles is correctly protected for big endian ( https://svn.open-mpi.org/trac/ompi/changeset/32425). I was merely pointing out that I think he should also swap the 2 32 bits in his implementation. George. On Tue, Aug 5, 2014 at 1:32 PM, Ralph Castainwrote: > > On Aug 5, 2014, at 10:23 AM, George Bosilca wrote: > > On Tue, Aug 5, 2014 at 1:15 PM, Ralph Castain wrote: > >> Hmmm...wouldn't that then require that you know (a) the other side is >> little endian, and (b) that you are on a big endian? Otherwise, you wind up >> with the same issue in reverse, yes? >> > > This is similar to the 32 bits ntohl that we are using in other parts of > the project. Any little endian participant will do the conversion, while > every big endian participant will use an empty macro instead. > > >> In the ORTE methods, we explicitly set the fields (e.g., jobid = >> ntohl(remote-jobid)) to get around this problem. I missed that he did it by >> location instead of named fields - perhaps we should do that instead? >> > > As soon as we impose the ORTE naming scheme at the OPAL level (aka. the > notion of jobid and vpid) this approach will become possible. > > > Not proposing that at all so long as the other method will work without > knowing the other side's endianness. Sounds like your approach should work > fine as long as Gilles adds a #if so big endian defines the macro away > > > George. > > > >> >> >> On Aug 5, 2014, at 10:06 AM, George Bosilca wrote: >> >> Technically speaking, converting a 64 bits to a big endian representation >> requires the swap of the 2 32 bits parts. So the correct approach would >> have been: >> uint64_t htonll(uint64_t v) >> { >> return uint64_t)ntohl(n)) << 32 | (uint64_t)ntohl(n >> 32)); >> } >> >> George. >> >> >> >> On Tue, Aug 5, 2014 at 5:52 AM, Ralph Castain wrote: >> >>> FWIW: that's exactly how we do it in ORTE >>> >>> On Aug 4, 2014, at 10:25 PM, Gilles Gouaillardet < >>> gilles.gouaillar...@iferc.org> wrote: >>> >>> George, >>> >>> i confirm there was a problem when running on an heterogeneous cluster, >>> this is now fixed in r32425. >>> >>> i am not convinced i chose the most elegant way to achieve the desired >>> result ... >>> could you please double check this commit ? >>> >>> Thanks, >>> >>> Gilles >>> >>> On 2014/08/02 0:14, George Bosilca wrote: >>> >>> Gilles, >>> >>> The design of the BTL move was to let the opal_process_name_t be agnostic >>> to what is stored inside, and all accesses should be done through the >>> provided accessors. Thus, big endian or little endian doesn’t make a >>> difference, as long as everything goes through the accessors. >>> >>> I’m skeptical about the support of heterogeneous environments in the >>> current code, so I didn’t pay much attention to handling the case in the >>> TCP BTL. But in case we do care it is enough to make the 2 macros point to >>> something meaningful instead of being empty (bswap_64 or something). >>> >>> George. >>> >>> On Aug 1, 2014, at 06:52 , Gilles Gouaillardet >>> wrote: >>> >>> >>> George and Ralph, >>> >>> i am very confused whether there is an issue or not. >>> >>> >>> anyway, today Paul and i ran basic tests on big endian machines and did not >>> face any issue related to big endianness. >>> >>> so i made my homework, digged into the code, and basically, >>> opal_process_name_t is used as an orte_process_name_t. >>> for example, in ompi_proc_init : >>> >>> OMPI_CAST_ORTE_NAME(>super.proc_name)->jobid = >>> OMPI_PROC_MY_NAME->jobid; >>> OMPI_CAST_ORTE_NAME(>super.proc_name)->vpid = i; >>> >>> and with >>> >>> #define OMPI_CAST_ORTE_NAME(a) ((orte_process_name_t*)(a)) >>> >>> so as long as an opal_process_name_t is used as an orte_process_name_t, >>> there is no problem, >>> regardless the endianness of the homogenous cluster we are running on. >>> >>> for the sake of readability (and for being pedantic too ;-) ) in r32357, >>> _temp->super.proc_name >>> could be replaced with >>> OMPI_CAST_ORTE_NAME(_temp->super.proc_name) >>> >>> >>> >>> That being said, in btl/tcp, i noticed : >>> >>> in mca_btl_tcp_component_recv_handler : >>> >>> opal_process_name_t guid; >>> [...] >>> /* recv the process identifier */ >>> retval = recv(sd, (char *), sizeof(guid), 0); >>> if(retval != sizeof(guid)) { >>> CLOSE_THE_SOCKET(sd); >>> return; >>> } >>> OPAL_PROCESS_NAME_NTOH(guid); >>> >>> and in mca_btl_tcp_endpoint_send_connect_ack : >>> >>> /* send process identifier to remote endpoint */ >>> opal_process_name_t guid = btl_proc->proc_opal->proc_name; >>> >>> OPAL_PROCESS_NAME_HTON(guid); >>> if(mca_btl_tcp_endpoint_send_blocking(btl_endpoint, , >>> sizeof(guid)) != >>> >>> and with >>> >>> #define OPAL_PROCESS_NAME_NTOH(guid) >>> #define
Re: [OMPI devel] canceling buffered send request with pml/cm
Yossi, I think you raised an interesting corner-case, and a possible bug in the MTL implementation. As the request is marked as complete by the CM/PML the cancel should never succeed. As the CM/PML is forcing the completion on all bend requests, it should also enforce that all completed requests cannot be cancelled (instead of leaving this task to the MTL). I think the cleanest approach will be to allow the MTL itself o handle the complete case, by moving the code you pinpointed to (MCA_PML_CM_HVY_SEND_REQUEST_START) from the CM/MTL down in each MTL send case (they can check for buffered send requests). This approach will possible allow an MTL to implement cancel sends. George. On Aug 4, 2014, at 09:49 , Yossi Etiginwrote: > Hi, > > Seems like it’s impossible to cancel buffered sends with pml/cm. > > From one hand, pml/cm completes the buffered send immediately > (MCA_PML_CM_HVY_SEND_REQUEST_START): > if(OMPI_SUCCESS == ret && >\ >sendreq->req_send.req_send_mode == MCA_PML_BASE_SEND_BUFFERED) { >\ > sendreq->req_send.req_base.req_ompi.req_status.MPI_ERROR = 0; >\ > ompi_request_complete(&(sendreq)->req_send.req_base.req_ompi, > true); \ > } > > So, if the user is doing Bsend()/Cancel()/Wait()/Test_canceled(), the Wait() > would be a no-op. > Therefore when mtl_cancel() was called, it had to either cancel/guarantee > completion *immediately*, otherwise the return from Test_canceled would be > undefined. > However, it’s not always possible to cancel immediately, because need to make > sure the peer has not matched it yet (fox example, with mtl mxm). > > IMHO it’s wrong for pml_cm to complete a buffered send immediately. > What do you think? > > --Yossi
Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins
On Aug 5, 2014, at 10:23 AM, George Bosilcawrote: > On Tue, Aug 5, 2014 at 1:15 PM, Ralph Castain wrote: > Hmmm...wouldn't that then require that you know (a) the other side is little > endian, and (b) that you are on a big endian? Otherwise, you wind up with the > same issue in reverse, yes? > > This is similar to the 32 bits ntohl that we are using in other parts of the > project. Any little endian participant will do the conversion, while every > big endian participant will use an empty macro instead. > > In the ORTE methods, we explicitly set the fields (e.g., jobid = > ntohl(remote-jobid)) to get around this problem. I missed that he did it by > location instead of named fields - perhaps we should do that instead? > > As soon as we impose the ORTE naming scheme at the OPAL level (aka. the > notion of jobid and vpid) this approach will become possible. Not proposing that at all so long as the other method will work without knowing the other side's endianness. Sounds like your approach should work fine as long as Gilles adds a #if so big endian defines the macro away > > George. > > > > > On Aug 5, 2014, at 10:06 AM, George Bosilca wrote: > >> Technically speaking, converting a 64 bits to a big endian representation >> requires the swap of the 2 32 bits parts. So the correct approach would have >> been: >> uint64_t htonll(uint64_t v) >> { >> return uint64_t)ntohl(n)) << 32 | (uint64_t)ntohl(n >> 32)); >> } >> >> George. >> >> >> >> On Tue, Aug 5, 2014 at 5:52 AM, Ralph Castain wrote: >> FWIW: that's exactly how we do it in ORTE >> >> On Aug 4, 2014, at 10:25 PM, Gilles Gouaillardet >> wrote: >> >>> George, >>> >>> i confirm there was a problem when running on an heterogeneous cluster, >>> this is now fixed in r32425. >>> >>> i am not convinced i chose the most elegant way to achieve the desired >>> result ... >>> could you please double check this commit ? >>> >>> Thanks, >>> >>> Gilles >>> >>> On 2014/08/02 0:14, George Bosilca wrote: Gilles, The design of the BTL move was to let the opal_process_name_t be agnostic to what is stored inside, and all accesses should be done through the provided accessors. Thus, big endian or little endian doesn’t make a difference, as long as everything goes through the accessors. I’m skeptical about the support of heterogeneous environments in the current code, so I didn’t pay much attention to handling the case in the TCP BTL. But in case we do care it is enough to make the 2 macros point to something meaningful instead of being empty (bswap_64 or something). George. On Aug 1, 2014, at 06:52 , Gilles Gouaillardet wrote: > George and Ralph, > > i am very confused whether there is an issue or not. > > > anyway, today Paul and i ran basic tests on big endian machines and did > not face any issue related to big endianness. > > so i made my homework, digged into the code, and basically, > opal_process_name_t is used as an orte_process_name_t. > for example, in ompi_proc_init : > > OMPI_CAST_ORTE_NAME(>super.proc_name)->jobid = > OMPI_PROC_MY_NAME->jobid; > OMPI_CAST_ORTE_NAME(>super.proc_name)->vpid = i; > > and with > > #define OMPI_CAST_ORTE_NAME(a) ((orte_process_name_t*)(a)) > > so as long as an opal_process_name_t is used as an orte_process_name_t, > there is no problem, > regardless the endianness of the homogenous cluster we are running on. > > for the sake of readability (and for being pedantic too ;-) ) in r32357, > _temp->super.proc_name > could be replaced with > OMPI_CAST_ORTE_NAME(_temp->super.proc_name) > > > > That being said, in btl/tcp, i noticed : > > in mca_btl_tcp_component_recv_handler : > > opal_process_name_t guid; > [...] > /* recv the process identifier */ > retval = recv(sd, (char *), sizeof(guid), 0); > if(retval != sizeof(guid)) { > CLOSE_THE_SOCKET(sd); > return; > } > OPAL_PROCESS_NAME_NTOH(guid); > > and in mca_btl_tcp_endpoint_send_connect_ack : > > /* send process identifier to remote endpoint */ > opal_process_name_t guid = btl_proc->proc_opal->proc_name; > > OPAL_PROCESS_NAME_HTON(guid); > if(mca_btl_tcp_endpoint_send_blocking(btl_endpoint, , > sizeof(guid)) != > > and with > > #define OPAL_PROCESS_NAME_NTOH(guid) > #define OPAL_PROCESS_NAME_HTON(guid) > > > i had no time yet to test yet, but for now, i can only suspect : > - there will be an issue with the tcp btl on an heterogeneous cluster > - for this case,
Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins
On Tue, Aug 5, 2014 at 1:15 PM, Ralph Castainwrote: > Hmmm...wouldn't that then require that you know (a) the other side is > little endian, and (b) that you are on a big endian? Otherwise, you wind up > with the same issue in reverse, yes? > This is similar to the 32 bits ntohl that we are using in other parts of the project. Any little endian participant will do the conversion, while every big endian participant will use an empty macro instead. > In the ORTE methods, we explicitly set the fields (e.g., jobid = > ntohl(remote-jobid)) to get around this problem. I missed that he did it by > location instead of named fields - perhaps we should do that instead? > As soon as we impose the ORTE naming scheme at the OPAL level (aka. the notion of jobid and vpid) this approach will become possible. George. > > > On Aug 5, 2014, at 10:06 AM, George Bosilca wrote: > > Technically speaking, converting a 64 bits to a big endian representation > requires the swap of the 2 32 bits parts. So the correct approach would > have been: > uint64_t htonll(uint64_t v) > { > return uint64_t)ntohl(n)) << 32 | (uint64_t)ntohl(n >> 32)); > } > > George. > > > > On Tue, Aug 5, 2014 at 5:52 AM, Ralph Castain wrote: > >> FWIW: that's exactly how we do it in ORTE >> >> On Aug 4, 2014, at 10:25 PM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> >> George, >> >> i confirm there was a problem when running on an heterogeneous cluster, >> this is now fixed in r32425. >> >> i am not convinced i chose the most elegant way to achieve the desired >> result ... >> could you please double check this commit ? >> >> Thanks, >> >> Gilles >> >> On 2014/08/02 0:14, George Bosilca wrote: >> >> Gilles, >> >> The design of the BTL move was to let the opal_process_name_t be agnostic to >> what is stored inside, and all accesses should be done through the provided >> accessors. Thus, big endian or little endian doesn’t make a difference, as >> long as everything goes through the accessors. >> >> I’m skeptical about the support of heterogeneous environments in the current >> code, so I didn’t pay much attention to handling the case in the TCP BTL. >> But in case we do care it is enough to make the 2 macros point to something >> meaningful instead of being empty (bswap_64 or something). >> >> George. >> >> On Aug 1, 2014, at 06:52 , Gilles Gouaillardet >> wrote: >> >> >> George and Ralph, >> >> i am very confused whether there is an issue or not. >> >> >> anyway, today Paul and i ran basic tests on big endian machines and did not >> face any issue related to big endianness. >> >> so i made my homework, digged into the code, and basically, >> opal_process_name_t is used as an orte_process_name_t. >> for example, in ompi_proc_init : >> >> OMPI_CAST_ORTE_NAME(>super.proc_name)->jobid = >> OMPI_PROC_MY_NAME->jobid; >> OMPI_CAST_ORTE_NAME(>super.proc_name)->vpid = i; >> >> and with >> >> #define OMPI_CAST_ORTE_NAME(a) ((orte_process_name_t*)(a)) >> >> so as long as an opal_process_name_t is used as an orte_process_name_t, >> there is no problem, >> regardless the endianness of the homogenous cluster we are running on. >> >> for the sake of readability (and for being pedantic too ;-) ) in r32357, >> _temp->super.proc_name >> could be replaced with >> OMPI_CAST_ORTE_NAME(_temp->super.proc_name) >> >> >> >> That being said, in btl/tcp, i noticed : >> >> in mca_btl_tcp_component_recv_handler : >> >> opal_process_name_t guid; >> [...] >> /* recv the process identifier */ >> retval = recv(sd, (char *), sizeof(guid), 0); >> if(retval != sizeof(guid)) { >> CLOSE_THE_SOCKET(sd); >> return; >> } >> OPAL_PROCESS_NAME_NTOH(guid); >> >> and in mca_btl_tcp_endpoint_send_connect_ack : >> >> /* send process identifier to remote endpoint */ >> opal_process_name_t guid = btl_proc->proc_opal->proc_name; >> >> OPAL_PROCESS_NAME_HTON(guid); >> if(mca_btl_tcp_endpoint_send_blocking(btl_endpoint, , sizeof(guid)) >> != >> >> and with >> >> #define OPAL_PROCESS_NAME_NTOH(guid) >> #define OPAL_PROCESS_NAME_HTON(guid) >> >> >> i had no time yet to test yet, but for now, i can only suspect : >> - there will be an issue with the tcp btl on an heterogeneous cluster >> - for this case, the fix is to have a different version of the >> OPAL_PROCESS_NAME_xTOy >> on little endian arch if heterogeneous mode is supported. >> >> >> >> does that make sense ? >> >> Cheers, >> >> Gilles >> >> >> On 2014/07/31 1:29, George Bosilca wrote: >> >> The underlying structure changed, so a little bit of fiddling is normal. >> Instead of using a field in the ompi_proc_t you are now using a field down >> in opal_proc_t, a field that simply cannot have the same type as before >> (orte_process_name_t). >> >> George. >> >> >> >> On Wed, Jul 30, 2014 at 12:19 PM, Ralph
Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins
Hmmm...wouldn't that then require that you know (a) the other side is little endian, and (b) that you are on a big endian? Otherwise, you wind up with the same issue in reverse, yes? In the ORTE methods, we explicitly set the fields (e.g., jobid = ntohl(remote-jobid)) to get around this problem. I missed that he did it by location instead of named fields - perhaps we should do that instead? On Aug 5, 2014, at 10:06 AM, George Bosilcawrote: > Technically speaking, converting a 64 bits to a big endian representation > requires the swap of the 2 32 bits parts. So the correct approach would have > been: > uint64_t htonll(uint64_t v) > { > return uint64_t)ntohl(n)) << 32 | (uint64_t)ntohl(n >> 32)); > } > > George. > > > > On Tue, Aug 5, 2014 at 5:52 AM, Ralph Castain wrote: > FWIW: that's exactly how we do it in ORTE > > On Aug 4, 2014, at 10:25 PM, Gilles Gouaillardet > wrote: > >> George, >> >> i confirm there was a problem when running on an heterogeneous cluster, >> this is now fixed in r32425. >> >> i am not convinced i chose the most elegant way to achieve the desired >> result ... >> could you please double check this commit ? >> >> Thanks, >> >> Gilles >> >> On 2014/08/02 0:14, George Bosilca wrote: >>> Gilles, >>> >>> The design of the BTL move was to let the opal_process_name_t be agnostic >>> to what is stored inside, and all accesses should be done through the >>> provided accessors. Thus, big endian or little endian doesn’t make a >>> difference, as long as everything goes through the accessors. >>> >>> I’m skeptical about the support of heterogeneous environments in the >>> current code, so I didn’t pay much attention to handling the case in the >>> TCP BTL. But in case we do care it is enough to make the 2 macros point to >>> something meaningful instead of being empty (bswap_64 or something). >>> >>> George. >>> >>> On Aug 1, 2014, at 06:52 , Gilles Gouaillardet >>> wrote: >>> George and Ralph, i am very confused whether there is an issue or not. anyway, today Paul and i ran basic tests on big endian machines and did not face any issue related to big endianness. so i made my homework, digged into the code, and basically, opal_process_name_t is used as an orte_process_name_t. for example, in ompi_proc_init : OMPI_CAST_ORTE_NAME(>super.proc_name)->jobid = OMPI_PROC_MY_NAME->jobid; OMPI_CAST_ORTE_NAME(>super.proc_name)->vpid = i; and with #define OMPI_CAST_ORTE_NAME(a) ((orte_process_name_t*)(a)) so as long as an opal_process_name_t is used as an orte_process_name_t, there is no problem, regardless the endianness of the homogenous cluster we are running on. for the sake of readability (and for being pedantic too ;-) ) in r32357, _temp->super.proc_name could be replaced with OMPI_CAST_ORTE_NAME(_temp->super.proc_name) That being said, in btl/tcp, i noticed : in mca_btl_tcp_component_recv_handler : opal_process_name_t guid; [...] /* recv the process identifier */ retval = recv(sd, (char *), sizeof(guid), 0); if(retval != sizeof(guid)) { CLOSE_THE_SOCKET(sd); return; } OPAL_PROCESS_NAME_NTOH(guid); and in mca_btl_tcp_endpoint_send_connect_ack : /* send process identifier to remote endpoint */ opal_process_name_t guid = btl_proc->proc_opal->proc_name; OPAL_PROCESS_NAME_HTON(guid); if(mca_btl_tcp_endpoint_send_blocking(btl_endpoint, , sizeof(guid)) != and with #define OPAL_PROCESS_NAME_NTOH(guid) #define OPAL_PROCESS_NAME_HTON(guid) i had no time yet to test yet, but for now, i can only suspect : - there will be an issue with the tcp btl on an heterogeneous cluster - for this case, the fix is to have a different version of the OPAL_PROCESS_NAME_xTOy on little endian arch if heterogeneous mode is supported. does that make sense ? Cheers, Gilles On 2014/07/31 1:29, George Bosilca wrote: > The underlying structure changed, so a little bit of fiddling is normal. > Instead of using a field in the ompi_proc_t you are now using a field down > in opal_proc_t, a field that simply cannot have the same type as before > (orte_process_name_t). > > George. > > > > On Wed, Jul 30, 2014 at 12:19 PM, Ralph Castain wrote: > >> George - my point was that we regularly tested using the method in that >> routine, and now we have to do something a little different. So it is an >> "issue" in that we have to make changes across
Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins
Technically speaking, converting a 64 bits to a big endian representation requires the swap of the 2 32 bits parts. So the correct approach would have been: uint64_t htonll(uint64_t v) { return uint64_t)ntohl(n)) << 32 | (uint64_t)ntohl(n >> 32)); } George. On Tue, Aug 5, 2014 at 5:52 AM, Ralph Castainwrote: > FWIW: that's exactly how we do it in ORTE > > On Aug 4, 2014, at 10:25 PM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > > George, > > i confirm there was a problem when running on an heterogeneous cluster, > this is now fixed in r32425. > > i am not convinced i chose the most elegant way to achieve the desired > result ... > could you please double check this commit ? > > Thanks, > > Gilles > > On 2014/08/02 0:14, George Bosilca wrote: > > Gilles, > > The design of the BTL move was to let the opal_process_name_t be agnostic to > what is stored inside, and all accesses should be done through the provided > accessors. Thus, big endian or little endian doesn’t make a difference, as > long as everything goes through the accessors. > > I’m skeptical about the support of heterogeneous environments in the current > code, so I didn’t pay much attention to handling the case in the TCP BTL. But > in case we do care it is enough to make the 2 macros point to something > meaningful instead of being empty (bswap_64 or something). > > George. > > On Aug 1, 2014, at 06:52 , Gilles Gouaillardet > wrote: > > > George and Ralph, > > i am very confused whether there is an issue or not. > > > anyway, today Paul and i ran basic tests on big endian machines and did not > face any issue related to big endianness. > > so i made my homework, digged into the code, and basically, > opal_process_name_t is used as an orte_process_name_t. > for example, in ompi_proc_init : > > OMPI_CAST_ORTE_NAME(>super.proc_name)->jobid = OMPI_PROC_MY_NAME->jobid; > OMPI_CAST_ORTE_NAME(>super.proc_name)->vpid = i; > > and with > > #define OMPI_CAST_ORTE_NAME(a) ((orte_process_name_t*)(a)) > > so as long as an opal_process_name_t is used as an orte_process_name_t, there > is no problem, > regardless the endianness of the homogenous cluster we are running on. > > for the sake of readability (and for being pedantic too ;-) ) in r32357, > _temp->super.proc_name > could be replaced with > OMPI_CAST_ORTE_NAME(_temp->super.proc_name) > > > > That being said, in btl/tcp, i noticed : > > in mca_btl_tcp_component_recv_handler : > > opal_process_name_t guid; > [...] > /* recv the process identifier */ > retval = recv(sd, (char *), sizeof(guid), 0); > if(retval != sizeof(guid)) { > CLOSE_THE_SOCKET(sd); > return; > } > OPAL_PROCESS_NAME_NTOH(guid); > > and in mca_btl_tcp_endpoint_send_connect_ack : > > /* send process identifier to remote endpoint */ > opal_process_name_t guid = btl_proc->proc_opal->proc_name; > > OPAL_PROCESS_NAME_HTON(guid); > if(mca_btl_tcp_endpoint_send_blocking(btl_endpoint, , sizeof(guid)) > != > > and with > > #define OPAL_PROCESS_NAME_NTOH(guid) > #define OPAL_PROCESS_NAME_HTON(guid) > > > i had no time yet to test yet, but for now, i can only suspect : > - there will be an issue with the tcp btl on an heterogeneous cluster > - for this case, the fix is to have a different version of the > OPAL_PROCESS_NAME_xTOy > on little endian arch if heterogeneous mode is supported. > > > > does that make sense ? > > Cheers, > > Gilles > > > On 2014/07/31 1:29, George Bosilca wrote: > > The underlying structure changed, so a little bit of fiddling is normal. > Instead of using a field in the ompi_proc_t you are now using a field down > in opal_proc_t, a field that simply cannot have the same type as before > (orte_process_name_t). > > George. > > > > On Wed, Jul 30, 2014 at 12:19 PM, Ralph Castain > wrote: > > > George - my point was that we regularly tested using the method in that > routine, and now we have to do something a little different. So it is an > "issue" in that we have to make changes across the code base to ensure we > do things the "new" way, that's all > > On Jul 30, 2014, at 9:17 AM, George Bosilca > wrote: > > No, this is not going to be an issue if the opal_identifier_t is used > correctly (aka only via the exposed accessors). > > George. > > > > On Wed, Jul 30, 2014 at 12:09 PM, Ralph Castain > wrote: > > > Yeah, my fix won't work for big endian machines - this is going to be an > issue across the code base now, so we'll have to troll and fix it. I was > doing the minimal change required to fix the trunk in the meantime. > > On Jul 30, 2014, at 9:06 AM, George Bosilca > wrote: > > Yes. opal_process_name_t has basically no meaning by itself, it is a 64 > bits
Re: [OMPI devel] oshmem enabled by default
Sounds like clearer language - done! On Aug 4, 2014, at 7:58 PM, Paul Hargrovewrote: > Since "disabled by default" is just part of a macro argument we can say > anything we want. > I propose the following: > > Index: config/oshmem_configure_options.m4 > === > --- config/oshmem_configure_options.m4 (revision 32424) > +++ config/oshmem_configure_options.m4 (working copy) > @@ -22,7 +22,7 @@ > AC_MSG_CHECKING([if want oshmem]) > AC_ARG_ENABLE([oshmem], >[AC_HELP_STRING([--enable-oshmem], > - [Enable building the OpenSHMEM interface > (disabled by default)])], > + [Enable building the OpenSHMEM interface > (available on Linux only, where it is enabled by default)])], >[oshmem_arg_given=yes], >[oshmem_arg_given=no]) > if test "$oshmem_arg_given" = "yes"; then > > > -Paul > > > > > On Mon, Aug 4, 2014 at 7:34 PM, Gilles Gouaillardet > wrote: > Paul, > > this is a bit trickier ... > > on a Linux platform oshmem is built by default, > on a non Linux platform, oshmem is *not* built by default. > > so the configure message (disabled by default) is correct on non Linux > platform, and incorrect on Linux platform ... > > i do not know what should be done, here are some options : > - have a different behaviour on Linux vs non Linux platforms (by the way, > does autotools support this ?) > - disable by default, provide only the --enable-oshmem option (so configure > abort if --enable-oshmem on non Linux platforms) > - provide only the --disable-oshmem option, useful only on Linux platforms. > on non Linux platforms do not build oshmem and this is not an error > - other ? > > Cheers, > > Gilles > > r31155 | rhc | 2014-03-20 05:32:15 +0900 (Thu, 20 Mar 2014) | 5 lines > > As per the thread on ticket #4399, OSHMEM does not support non-Linux > platforms. So provide a check for Linux and error out if --enable-oshmem is > given on a non-supported platform. If no OSHMEM option is given (enable or > disable), then don't attempt to build OSHMEM unless we are on a Linux > platform. Default to building if we are on Linux for now, pending the outcome > of the Debian situation. > > > On 2014/08/05 6:41, Paul Hargrove wrote: >> In both trunk and 1.8.2rc3 the behavior is to enable oshmem by default. >> >> In the 1.8.2rc3 tarball the configure help output matches the behavior. >> HOWEVER, in the trunk the configure help output still says oshmem is >> DISabled by default. >> >> {~/OMPI/ompi-trunk}$ svn info | grep "Revision" >> Revision: 32422 >> {~/OMPI/ompi-trunk}$ ./configure --help | grep -A1 'enable-oshmem ' >> --enable-oshmem Enable building the OpenSHMEM interface (disabled >> by >> default) >> >> -Paul >> >> >> On Thu, Jul 24, 2014 at 2:09 PM, Ralph Castain wrote: >> >>> Actually, it already is set correctly - the help message was out of date, >>> so I corrected that. >>> >>> On Jul 24, 2014, at 10:58 AM, Marco Atzeri wrote: >>> On 24/07/2014 15:52, Ralph Castain wrote: > Oshmem should be enabled by default now Ok, so please reverse the configure switch --enable-oshmem Enable building the OpenSHMEM interface >>> (disabled by default) I will test enabling it in the meantime. Regards Marco ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/07/15254.php >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/07/15261.php >>> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15502.php > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15507.php > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link
Re: [OMPI devel] minor atomics nit
Done - thanks! On Aug 4, 2014, at 7:15 PM, Paul Hargrovewrote: > Running "make dist" on trunk I see: > > --> Generating assembly for "SPARC" "default-.text-.globl-:--.L-#-1-0-1-0-0" > Could not open ../../../opal/asm/base/SPARC.asm: No such file or directory > > Which is apparent because the following lines were never removed from > opal/asm/asm-data.txt > > # default compile mode on Solaris. Evil. equiv to about Sparc v8 > SPARC default-.text-.globl-:--.L-#-1-0-1-0-0 sparc-solaris > > README is clear about having dropped support for SPARC < v8plus. > > > -Paul > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15506.php
Re: [OMPI devel] 1.8.2rc3 cosmetic issues in configure
Got these cleaned up - will commit and CMR Thanks! Ralph On Aug 4, 2014, at 12:47 AM, Paul Hargrovewrote: > It looks like four instances of AC_MSG_CHECKING are missing an AC_MSG_RESULT > or have other configure macros improperly nested between the two: > > checking for epoll support... checking for epoll_ctl... yes > yes > checking for working epoll library interface... yes > yes > > checking if user requested CMA build... checking --with-knem value... simple > ok (unspecified) > > checking if user requested CMA build... checking if MCA component btl:vader > can compile... yes > > checking orte configuration args... checking if MCA component dpm:orte can > compile... yes > > -Paul > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15493.php
Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins
FWIW: that's exactly how we do it in ORTE On Aug 4, 2014, at 10:25 PM, Gilles Gouaillardetwrote: > George, > > i confirm there was a problem when running on an heterogeneous cluster, > this is now fixed in r32425. > > i am not convinced i chose the most elegant way to achieve the desired result > ... > could you please double check this commit ? > > Thanks, > > Gilles > > On 2014/08/02 0:14, George Bosilca wrote: >> Gilles, >> >> The design of the BTL move was to let the opal_process_name_t be agnostic to >> what is stored inside, and all accesses should be done through the provided >> accessors. Thus, big endian or little endian doesn’t make a difference, as >> long as everything goes through the accessors. >> >> I’m skeptical about the support of heterogeneous environments in the current >> code, so I didn’t pay much attention to handling the case in the TCP BTL. >> But in case we do care it is enough to make the 2 macros point to something >> meaningful instead of being empty (bswap_64 or something). >> >> George. >> >> On Aug 1, 2014, at 06:52 , Gilles Gouaillardet >> wrote: >> >>> George and Ralph, >>> >>> i am very confused whether there is an issue or not. >>> >>> >>> anyway, today Paul and i ran basic tests on big endian machines and did not >>> face any issue related to big endianness. >>> >>> so i made my homework, digged into the code, and basically, >>> opal_process_name_t is used as an orte_process_name_t. >>> for example, in ompi_proc_init : >>> >>> OMPI_CAST_ORTE_NAME(>super.proc_name)->jobid = >>> OMPI_PROC_MY_NAME->jobid; >>> OMPI_CAST_ORTE_NAME(>super.proc_name)->vpid = i; >>> >>> and with >>> >>> #define OMPI_CAST_ORTE_NAME(a) ((orte_process_name_t*)(a)) >>> >>> so as long as an opal_process_name_t is used as an orte_process_name_t, >>> there is no problem, >>> regardless the endianness of the homogenous cluster we are running on. >>> >>> for the sake of readability (and for being pedantic too ;-) ) in r32357, >>> _temp->super.proc_name >>> could be replaced with >>> OMPI_CAST_ORTE_NAME(_temp->super.proc_name) >>> >>> >>> >>> That being said, in btl/tcp, i noticed : >>> >>> in mca_btl_tcp_component_recv_handler : >>> >>> opal_process_name_t guid; >>> [...] >>> /* recv the process identifier */ >>> retval = recv(sd, (char *), sizeof(guid), 0); >>> if(retval != sizeof(guid)) { >>> CLOSE_THE_SOCKET(sd); >>> return; >>> } >>> OPAL_PROCESS_NAME_NTOH(guid); >>> >>> and in mca_btl_tcp_endpoint_send_connect_ack : >>> >>> /* send process identifier to remote endpoint */ >>> opal_process_name_t guid = btl_proc->proc_opal->proc_name; >>> >>> OPAL_PROCESS_NAME_HTON(guid); >>> if(mca_btl_tcp_endpoint_send_blocking(btl_endpoint, , >>> sizeof(guid)) != >>> >>> and with >>> >>> #define OPAL_PROCESS_NAME_NTOH(guid) >>> #define OPAL_PROCESS_NAME_HTON(guid) >>> >>> >>> i had no time yet to test yet, but for now, i can only suspect : >>> - there will be an issue with the tcp btl on an heterogeneous cluster >>> - for this case, the fix is to have a different version of the >>> OPAL_PROCESS_NAME_xTOy >>> on little endian arch if heterogeneous mode is supported. >>> >>> >>> >>> does that make sense ? >>> >>> Cheers, >>> >>> Gilles >>> >>> >>> On 2014/07/31 1:29, George Bosilca wrote: The underlying structure changed, so a little bit of fiddling is normal. Instead of using a field in the ompi_proc_t you are now using a field down in opal_proc_t, a field that simply cannot have the same type as before (orte_process_name_t). George. On Wed, Jul 30, 2014 at 12:19 PM, Ralph Castain wrote: > George - my point was that we regularly tested using the method in that > routine, and now we have to do something a little different. So it is an > "issue" in that we have to make changes across the code base to ensure we > do things the "new" way, that's all > > On Jul 30, 2014, at 9:17 AM, George Bosilca wrote: > > No, this is not going to be an issue if the opal_identifier_t is used > correctly (aka only via the exposed accessors). > > George. > > > > On Wed, Jul 30, 2014 at 12:09 PM, Ralph Castain wrote: > >> Yeah, my fix won't work for big endian machines - this is going to be an >> issue across the code base now, so we'll have to troll and fix it. I was >> doing the minimal change required to fix the trunk in the meantime. >> >> On Jul 30, 2014, at 9:06 AM, George Bosilca wrote: >> >> Yes. opal_process_name_t has basically no meaning by itself, it is a 64 >> bits storage location used by the upper layer to save some local key that >> can be later used to extract information. Calling the
Re: [OMPI devel] [1.8.2rc3] static linking fails on linux when not building ROMIO
Here is a patch that has been minimally tested. this is likely an overkill (at least when dynamic libraries can be used), but it does the job so far ... Cheers, Gilles On 2014/08/05 16:56, Gilles Gouaillardet wrote: > from libopen-pal.la : > dependency_libs=' -lrdmacm -libverbs -lscif -lnuma -ldl -lrt -lnsl > -lutil -lm' > > > i confirm mpicc fails linking > > but FWIT, using libtool does work (!) > > could the bug come from the mpicc (and other) wrappers ? > > Gilles > > $ gcc -g -O0 -o hw /csc/home1/gouaillardet/hw.c > -I/tmp/install/ompi.noromio/include -pthread -L/usr/lib64 -Wl,-rpath > -Wl,/usr/lib64 -Wl,-rpath -Wl,/tmp/install/ompi.noromio/lib > -Wl,--enable-new-dtags -L/tmp/install/ompi.noromio/lib -lmpi -lopen-rte > -lopen-pal -lm -lnuma -libverbs -lscif -lrdmacm -ldl -llustreapi > > $ /tmp/install/ompi.noromio/bin/mpicc -g -O0 -o hw -show ~/hw.c > gcc -g -O0 -o hw /csc/home1/gouaillardet/hw.c > -I/tmp/install/ompi.noromio/include -pthread -L/usr/lib64 -Wl,-rpath > -Wl,/usr/lib64 -Wl,-rpath -Wl,/tmp/install/ompi.noromio/lib > -Wl,--enable-new-dtags -L/tmp/install/ompi.noromio/lib -lmpi -lopen-rte > -lopen-pal -lm -lnuma -libverbs -lscif -lrdmacm -ldl -llustreapi > [gouaillardet@soleil build]$ /tmp/install/ompi.noromio/bin/mpicc -g -O0 > -o hw ~/hw.c > /tmp/install/ompi.noromio/lib/libmpi.a(fbtl_posix_ipwritev.o): In > function `mca_fbtl_posix_ipwritev': > fbtl_posix_ipwritev.c:(.text+0x17b): undefined reference to `aio_write' > fbtl_posix_ipwritev.c:(.text+0x237): undefined reference to `aio_write' > fbtl_posix_ipwritev.c:(.text+0x3f4): undefined reference to `aio_write' > fbtl_posix_ipwritev.c:(.text+0x48e): undefined reference to `aio_write' > /tmp/install/ompi.noromio/lib/libopen-pal.a(opal_pty.o): In function > `opal_openpty': > opal_pty.c:(.text+0x1): undefined reference to `openpty' > /tmp/install/ompi.noromio/lib/libopen-pal.a(event.o): In function > `event_add_internal': > event.c:(.text+0x288d): undefined reference to `clock_gettime' > > $ /bin/sh ./static/libtool --silent --tag=CC --mode=compile gcc > -std=gnu99 -I/tmp/install/ompi.noromio/include -c ~/hw.c > $ /bin/sh ./static/libtool --silent --tag=CC --mode=link gcc > -std=gnu99 -o hw hw.o -L/tmp/install/ompi.noromio/lib -lmpi > $ ldd hw > linux-vdso.so.1 => (0x7fff7530d000) > librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x7f0ed541e000) > libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x7f0ed521) > libscif.so.0 => /usr/lib64/libscif.so.0 (0x003b9c60) > libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x003ba560) > libdl.so.2 => /lib64/libdl.so.2 (0x003b9be0) > librt.so.1 => /lib64/librt.so.1 (0x003b9ca0) > libnsl.so.1 => /lib64/libnsl.so.1 (0x003bae20) > libutil.so.1 => /lib64/libutil.so.1 (0x003bac60) > libm.so.6 => /lib64/libm.so.6 (0x003b9ba0) > libpthread.so.0 => /lib64/libpthread.so.0 (0x003b9c20) > libc.so.6 => /lib64/libc.so.6 (0x003b9b60) > /lib64/ld-linux-x86-64.so.2 (0x003b9b20) > > > > > On 2014/08/05 7:56, Ralph Castain wrote: >> My thought was to post initially as a blocker, pending a discussion with >> Jeff at tomorrow's telecon. If he thinks this is something we can fix in >> some central point (thus catching it everywhere), then it could be quick and >> worth doing. However, I'm skeptical as I tried to do that in the most >> obvious place, and it failed (could be operator error). >> >> Will let you know tomorrow. Truly appreciate your digging on this! >> Ralph >> >> On Aug 4, 2014, at 3:50 PM, Paul Hargrovewrote: >> >>> Ralph and Jeff, >>> >>> I've been digging and find the problem is wider than just the one library >>> and has manifestations specific to FreeBSD, NetBSD and Solaris. I am >>> adding new info to the ticket as I unearth it. >>> >>> Additionally, it appears this existed in 1.8, 1.8.1 and in the 1.7 series >>> as well. >>> So, would suggest this NOT be a blocker for a 1.8.2 release. >>> >>> Of course I am willing to provide testing if you still want to push for a >>> quick resolution. >>> >>> -Paul >>> >>> >>> On Mon, Aug 4, 2014 at 1:27 PM, Ralph Castain wrote: >>> Okay, I filed a blocker on this for 1.8.2 and assigned it to Jeff. I took a >>> crack at fixing it, but came up short :-( >>> >>> >>> On Aug 3, 2014, at 10:46 PM, Paul Hargrove wrote: >>> I've identified the difference between the platform that does link libutil and the one that does not. 1) libutil is linked (as an OMPI dependency) only on the working system: Working system: $ grep 'checking for .* LIBS' configure.out checking for OPAL LIBS... -lm -lpciaccess -ldl checking for ORTE LIBS... -lm -lpciaccess -ldl -ltorque checking for OMPI LIBS... -lm -lpciaccess -ldl -ltorque -lrt -lnsl -lutil NON-working system: $ grep 'checking for .* LIBS' configure.out
Re: [OMPI devel] [1.8.2rc3] static linking fails on linux when not building ROMIO
from libopen-pal.la : dependency_libs=' -lrdmacm -libverbs -lscif -lnuma -ldl -lrt -lnsl -lutil -lm' i confirm mpicc fails linking but FWIT, using libtool does work (!) could the bug come from the mpicc (and other) wrappers ? Gilles $ gcc -g -O0 -o hw /csc/home1/gouaillardet/hw.c -I/tmp/install/ompi.noromio/include -pthread -L/usr/lib64 -Wl,-rpath -Wl,/usr/lib64 -Wl,-rpath -Wl,/tmp/install/ompi.noromio/lib -Wl,--enable-new-dtags -L/tmp/install/ompi.noromio/lib -lmpi -lopen-rte -lopen-pal -lm -lnuma -libverbs -lscif -lrdmacm -ldl -llustreapi $ /tmp/install/ompi.noromio/bin/mpicc -g -O0 -o hw -show ~/hw.c gcc -g -O0 -o hw /csc/home1/gouaillardet/hw.c -I/tmp/install/ompi.noromio/include -pthread -L/usr/lib64 -Wl,-rpath -Wl,/usr/lib64 -Wl,-rpath -Wl,/tmp/install/ompi.noromio/lib -Wl,--enable-new-dtags -L/tmp/install/ompi.noromio/lib -lmpi -lopen-rte -lopen-pal -lm -lnuma -libverbs -lscif -lrdmacm -ldl -llustreapi [gouaillardet@soleil build]$ /tmp/install/ompi.noromio/bin/mpicc -g -O0 -o hw ~/hw.c /tmp/install/ompi.noromio/lib/libmpi.a(fbtl_posix_ipwritev.o): In function `mca_fbtl_posix_ipwritev': fbtl_posix_ipwritev.c:(.text+0x17b): undefined reference to `aio_write' fbtl_posix_ipwritev.c:(.text+0x237): undefined reference to `aio_write' fbtl_posix_ipwritev.c:(.text+0x3f4): undefined reference to `aio_write' fbtl_posix_ipwritev.c:(.text+0x48e): undefined reference to `aio_write' /tmp/install/ompi.noromio/lib/libopen-pal.a(opal_pty.o): In function `opal_openpty': opal_pty.c:(.text+0x1): undefined reference to `openpty' /tmp/install/ompi.noromio/lib/libopen-pal.a(event.o): In function `event_add_internal': event.c:(.text+0x288d): undefined reference to `clock_gettime' $ /bin/sh ./static/libtool --silent --tag=CC --mode=compile gcc -std=gnu99 -I/tmp/install/ompi.noromio/include -c ~/hw.c $ /bin/sh ./static/libtool --silent --tag=CC --mode=link gcc -std=gnu99 -o hw hw.o -L/tmp/install/ompi.noromio/lib -lmpi $ ldd hw linux-vdso.so.1 => (0x7fff7530d000) librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x7f0ed541e000) libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x7f0ed521) libscif.so.0 => /usr/lib64/libscif.so.0 (0x003b9c60) libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x003ba560) libdl.so.2 => /lib64/libdl.so.2 (0x003b9be0) librt.so.1 => /lib64/librt.so.1 (0x003b9ca0) libnsl.so.1 => /lib64/libnsl.so.1 (0x003bae20) libutil.so.1 => /lib64/libutil.so.1 (0x003bac60) libm.so.6 => /lib64/libm.so.6 (0x003b9ba0) libpthread.so.0 => /lib64/libpthread.so.0 (0x003b9c20) libc.so.6 => /lib64/libc.so.6 (0x003b9b60) /lib64/ld-linux-x86-64.so.2 (0x003b9b20) On 2014/08/05 7:56, Ralph Castain wrote: > My thought was to post initially as a blocker, pending a discussion with Jeff > at tomorrow's telecon. If he thinks this is something we can fix in some > central point (thus catching it everywhere), then it could be quick and worth > doing. However, I'm skeptical as I tried to do that in the most obvious > place, and it failed (could be operator error). > > Will let you know tomorrow. Truly appreciate your digging on this! > Ralph > > On Aug 4, 2014, at 3:50 PM, Paul Hargrovewrote: > >> Ralph and Jeff, >> >> I've been digging and find the problem is wider than just the one library >> and has manifestations specific to FreeBSD, NetBSD and Solaris. I am adding >> new info to the ticket as I unearth it. >> >> Additionally, it appears this existed in 1.8, 1.8.1 and in the 1.7 series as >> well. >> So, would suggest this NOT be a blocker for a 1.8.2 release. >> >> Of course I am willing to provide testing if you still want to push for a >> quick resolution. >> >> -Paul >> >> >> On Mon, Aug 4, 2014 at 1:27 PM, Ralph Castain wrote: >> Okay, I filed a blocker on this for 1.8.2 and assigned it to Jeff. I took a >> crack at fixing it, but came up short :-( >> >> >> On Aug 3, 2014, at 10:46 PM, Paul Hargrove wrote: >> >>> I've identified the difference between the platform that does link libutil >>> and the one that does not. >>> >>> 1) libutil is linked (as an OMPI dependency) only on the working system: >>> >>> Working system: >>> $ grep 'checking for .* LIBS' configure.out >>> checking for OPAL LIBS... -lm -lpciaccess -ldl >>> checking for ORTE LIBS... -lm -lpciaccess -ldl -ltorque >>> checking for OMPI LIBS... -lm -lpciaccess -ldl -ltorque -lrt -lnsl -lutil >>> >>> NON-working system: >>> $ grep 'checking for .* LIBS' configure.out >>> checking for OPAL LIBS... -lm -ldl >>> checking for ORTE LIBS... -lm -ldl -ltorque >>> checking for OMPI LIBS... -lm -ldl -ltorque >>> >>> So, the working system that does link libutil is doing so as an OMPI >>> dependency. >>> However it is also needed for opal (only caller of openpty is >>> opal/util/open_pty.c). >>> >>> 2) Only the working system is building ROMIO: >>>
Re: [OMPI devel] [vt] --with-openmpi-inside configure argument
Bert, It is just an observation of something that could easily break in the future. The code is correct as written. So, no immediate action is required. -Paul On Tue, Aug 5, 2014 at 12:04 AM, Bert Wesargwrote: > On 08/05/2014 02:40 AM, Paul Hargrove wrote: > >> I noticed that Open MPI is passing >> --with-openmpi-inside=1.7 >> in the arguments passed to >> ompi/contrib/vt/vt/configure >> and >> ompi/contrib/vt/vt/extlib/otf/configure >> >> The extlib/otf case just tests if the value is set, but the top-level >> vt/configure is checking for the specific string "1.7": >> >> # Check whether we are inside Open MPI package >> inside_openmpi="no" >> AC_ARG_WITH(openmpi-inside, [], >> [ >> AS_IF([test x"$withval" = "xyes" -o x"$withval" = "x1.7"], >> [ >> inside_openmpi="$withval" >> CPPFLAGS="-DINSIDE_OPENMPI $CPPFLAGS" >> >> # Set FC to F77 if Open MPI version < 1.7 >> AS_IF([test x"$withval" = "xyes" -a x"$FC" = x -a x"$F77" >> != x], >> [FC="$F77"]) >> ]) >> ]) >> >> That logic looks a bit fragile with respect to any future changes. >> Specifically the inner AS_IF is true for the desired condition "version < >> 1.7" only because the outer AS_IF currently ensures the only possible >> values of "$withval" are "yes" and "1.7". >> > > Noted. But this is not my field. May take some time, because Matthias is > still in vacation. > > Bert > > >> -Paul >> >> >> > -- > Dipl.-Inf. Bert Wesarg > wiss. Mitarbeiter > > Technische Universität Dresden > Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) > 01062 Dresden > Tel.: +49 (351) 463-42451 > Fax: +49 (351) 463-37773 > E-Mail: bert.wes...@tu-dresden.de > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15510.php > -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] [vt] --with-openmpi-inside configure argument
On 08/05/2014 02:40 AM, Paul Hargrove wrote: I noticed that Open MPI is passing --with-openmpi-inside=1.7 in the arguments passed to ompi/contrib/vt/vt/configure and ompi/contrib/vt/vt/extlib/otf/configure The extlib/otf case just tests if the value is set, but the top-level vt/configure is checking for the specific string "1.7": # Check whether we are inside Open MPI package inside_openmpi="no" AC_ARG_WITH(openmpi-inside, [], [ AS_IF([test x"$withval" = "xyes" -o x"$withval" = "x1.7"], [ inside_openmpi="$withval" CPPFLAGS="-DINSIDE_OPENMPI $CPPFLAGS" # Set FC to F77 if Open MPI version < 1.7 AS_IF([test x"$withval" = "xyes" -a x"$FC" = x -a x"$F77" != x], [FC="$F77"]) ]) ]) That logic looks a bit fragile with respect to any future changes. Specifically the inner AS_IF is true for the desired condition "version < 1.7" only because the outer AS_IF currently ensures the only possible values of "$withval" are "yes" and "1.7". Noted. But this is not my field. May take some time, because Matthias is still in vacation. Bert -Paul -- Dipl.-Inf. Bert Wesarg wiss. Mitarbeiter Technische Universität Dresden Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) 01062 Dresden Tel.: +49 (351) 463-42451 Fax: +49 (351) 463-37773 E-Mail: bert.wes...@tu-dresden.de smime.p7s Description: S/MIME Cryptographic Signature
Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins
George, i confirm there was a problem when running on an heterogeneous cluster, this is now fixed in r32425. i am not convinced i chose the most elegant way to achieve the desired result ... could you please double check this commit ? Thanks, Gilles On 2014/08/02 0:14, George Bosilca wrote: > Gilles, > > The design of the BTL move was to let the opal_process_name_t be agnostic to > what is stored inside, and all accesses should be done through the provided > accessors. Thus, big endian or little endian doesn't make a difference, as > long as everything goes through the accessors. > > I'm skeptical about the support of heterogeneous environments in the current > code, so I didn't pay much attention to handling the case in the TCP BTL. But > in case we do care it is enough to make the 2 macros point to something > meaningful instead of being empty (bswap_64 or something). > > George. > > On Aug 1, 2014, at 06:52 , Gilles Gouaillardet >wrote: > >> George and Ralph, >> >> i am very confused whether there is an issue or not. >> >> >> anyway, today Paul and i ran basic tests on big endian machines and did not >> face any issue related to big endianness. >> >> so i made my homework, digged into the code, and basically, >> opal_process_name_t is used as an orte_process_name_t. >> for example, in ompi_proc_init : >> >> OMPI_CAST_ORTE_NAME(>super.proc_name)->jobid = >> OMPI_PROC_MY_NAME->jobid; >> OMPI_CAST_ORTE_NAME(>super.proc_name)->vpid = i; >> >> and with >> >> #define OMPI_CAST_ORTE_NAME(a) ((orte_process_name_t*)(a)) >> >> so as long as an opal_process_name_t is used as an orte_process_name_t, >> there is no problem, >> regardless the endianness of the homogenous cluster we are running on. >> >> for the sake of readability (and for being pedantic too ;-) ) in r32357, >> _temp->super.proc_name >> could be replaced with >> OMPI_CAST_ORTE_NAME(_temp->super.proc_name) >> >> >> >> That being said, in btl/tcp, i noticed : >> >> in mca_btl_tcp_component_recv_handler : >> >> opal_process_name_t guid; >> [...] >> /* recv the process identifier */ >> retval = recv(sd, (char *), sizeof(guid), 0); >> if(retval != sizeof(guid)) { >> CLOSE_THE_SOCKET(sd); >> return; >> } >> OPAL_PROCESS_NAME_NTOH(guid); >> >> and in mca_btl_tcp_endpoint_send_connect_ack : >> >> /* send process identifier to remote endpoint */ >> opal_process_name_t guid = btl_proc->proc_opal->proc_name; >> >> OPAL_PROCESS_NAME_HTON(guid); >> if(mca_btl_tcp_endpoint_send_blocking(btl_endpoint, , sizeof(guid)) >> != >> >> and with >> >> #define OPAL_PROCESS_NAME_NTOH(guid) >> #define OPAL_PROCESS_NAME_HTON(guid) >> >> >> i had no time yet to test yet, but for now, i can only suspect : >> - there will be an issue with the tcp btl on an heterogeneous cluster >> - for this case, the fix is to have a different version of the >> OPAL_PROCESS_NAME_xTOy >> on little endian arch if heterogeneous mode is supported. >> >> >> >> does that make sense ? >> >> Cheers, >> >> Gilles >> >> >> On 2014/07/31 1:29, George Bosilca wrote: >>> The underlying structure changed, so a little bit of fiddling is normal. >>> Instead of using a field in the ompi_proc_t you are now using a field down >>> in opal_proc_t, a field that simply cannot have the same type as before >>> (orte_process_name_t). >>> >>> George. >>> >>> >>> >>> On Wed, Jul 30, 2014 at 12:19 PM, Ralph Castain wrote: >>> George - my point was that we regularly tested using the method in that routine, and now we have to do something a little different. So it is an "issue" in that we have to make changes across the code base to ensure we do things the "new" way, that's all On Jul 30, 2014, at 9:17 AM, George Bosilca wrote: No, this is not going to be an issue if the opal_identifier_t is used correctly (aka only via the exposed accessors). George. On Wed, Jul 30, 2014 at 12:09 PM, Ralph Castain wrote: > Yeah, my fix won't work for big endian machines - this is going to be an > issue across the code base now, so we'll have to troll and fix it. I was > doing the minimal change required to fix the trunk in the meantime. > > On Jul 30, 2014, at 9:06 AM, George Bosilca wrote: > > Yes. opal_process_name_t has basically no meaning by itself, it is a 64 > bits storage location used by the upper layer to save some local key that > can be later used to extract information. Calling the OPAL level compare > function might be a better fit there. > > George. > > > > On Wed, Jul 30, 2014 at 11:50 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > >> Ralph, >> >> was it really that simple ? >> >> proc_temp->super.proc_name has type opal_process_name_t