On Sep 17, 2013, at 2:01 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> Ralph, > > On Sep 17, 2013, at 20:13 , Ralph Castain <r...@open-mpi.org> wrote: > >> I guess we could argue this for awhile, but I personally don't care how it >> gets fixed. The issue here is that (a) you promised to provide a "better" >> fix nearly a year ago, (b) it never happened, and © a user who has patiently >> waited all this time has asked if we could please fix it. > > There seem to be some misunderstanding here. Believe me, I have neither time > nor interest in vain arguments but in this case I was not arguing, I was just > trying to be polite and explain the problem so that people can understand the > real issue and the fact that there was no room for argument. Dot. Your patch > is not correct, as it addresses a non existent issue in MPI. But it appears > somehow I failed not make myself clear here, and you took my message as some > kind of joke. Email sucks - I wasn't upset or disturbed at all. Only trying to explain why I finally addressed it. > >> It now works, but if you want to provide a better solution, please do - I >> have no issue with it. However, until you do, I propose to use what we have. > > It does not work. It fixes a minimalistic corner case without addressing the > real problem. You can leave it in the trunk if so you wish, but it > definitively should not make it in the 1.7. I'm happy to replace my patch IF someone will take the time to offer a better solution. I can always continue to improve it as required in the meantime. Believe me, I have no skin in this issue, but do feel that a year is more than enough time to wait. > > Let me try another approach. A complete test case for this is to add an > MPI_Barrier on the intercom before the call to MPI_Intercomm_merge, the one > merging the communicator created by MPI_Intercomm_create (this MPI_Barrier > should be added on both the parent and the children code). If this test > passes with the current patch, then I misunderstood your patch and I'm > entirely in the wrong. I very much doubt that it would work, though I can give it a try, as the patch addresses Intercomm_merge and not Intercomm_create. I debated about putting the patch into "create" instead, but nobody was citing that as being a problem. In my opinion, it makes more sense for it to be in "create", and I can certainly shift it to that location easily enough. My concern with your approach is that I'm not convinced it will work. The problem is that not all the MPI procs can communicate via MPI at this point because they lack the required info and haven't added the procs into the BTLs yet. So packing modex info into a buffer and attempting to send it via MPI could just cause the lockup to occur sooner. Hence the approach of ensuring all procs have the required info. Not optimal, I agree, but performance isn't an issue with this function, and the trivial amount of RTE effort didn't seem worth worrying about. > >> As for the commit message, I really have no interest in spending time >> debating the proper way to say something. :-) > > I do, words have meaning for a good reason. Please read the Section 6.6.2 in > the MPI standard, and you will understand why your commit message was > slightly distorting the reality of the MPI standard. Ah, George - we have debated this for a long time. I'm not a fan of wordy commit messages, nor do I worry much about them. If people read commit messages as part of understanding the MPI standard document...well, that is beyond my concern. :-) > > George. > >> >> >> On Sep 17, 2013, at 10:40 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >> >>> Ralph, >>> >>> I don't think your patch is addressing the right issue. In fact your commit >>> treat the wrong symptom instead of addressing the core issue that generate >>> the problem. Let me explain this in terms of MPI. >>> >>> The MPI_Intercomm_merge function transform an inter-comm into an >>> intra-comm, basically a two groups world into a single group world. Under >>> the MPI standard the two groups handled by this function should be able to >>> talk to each other in this inter-comm. So, your patch fixes a non existent >>> problem, as the processes were already supposed to be able to communicate >>> together before the MPI_Intercomm_merge. The real issue (which was >>> highlighted in the original email exchange) is that during the >>> MPI_Intercom_create the bridge communicator is not used to correctly >>> exchange the modex of the two groups of processes. >>> >>> In addition I have two smaller issues related to this patch. >>> >>> 1. The commit message is misleading, at least from the MPI standpoint. >>> >>> 2. This function is one of the few MPI-2 dynamic processing functions that >>> can be solved purely at the OMPI layer, without a need for extra >>> functionality from the RTE. The infrastructure of the correct solution is >>> already in the trunk, what is missing is the correct exchange of the >>> complete modex information of the two groups instead of exchanging their >>> OMPI_ARCH. >>> >>> Based on the fact that the band-aid is not really solving the right problem >>> I propose the removal of this patch from the trunk, and the blocking of the >>> pending CMR until a better solution is found. >>> >>> Thanks, >>> George. >>> >>> >>> On Sep 15, 2013, at 17:01 , Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> I fixed it and have filed a cmr to move it to 1.7.3 >>>> >>>> Thanks for your patience, and for reminding me >>>> Ralph >>>> >>>> On Sep 13, 2013, at 12:05 PM, Suraj Prabhakaran >>>> <suraj.prabhaka...@gmail.com> wrote: >>>> >>>>> Dear Ralph, that would be great if you could give it a try. We have been >>>>> hoping for it for a year now and it could greatly benefit us if this is >>>>> fixed!! :-) >>>>> >>>>> Thanks! >>>>> Suraj >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Sep 13, 2013 at 5:39 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> It has been a low priority issue, and hence not resolved yet. I doubt it >>>>> will make 1.7.3, though if you need it, I'll give it a try. >>>>> >>>>> On Sep 13, 2013, at 7:21 AM, Suraj Prabhakaran >>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>> >>>>> > Hello, >>>>> > >>>>> > Is there a plan to fix the problem with MPI_Intercomm_merge with 1.7.3 >>>>> > as stated in this ticket? We are really in need of this at the moment. >>>>> > Any hints? >>>>> > >>>>> > We face the following problem. >>>>> > >>>>> > Parents (x and y) spawn child (z). (all of them execute on separate >>>>> > nodes) >>>>> > x is the root. >>>>> > x,y and z do an MPI_Intercomm_merge. >>>>> > x and z are able to communicate properly. >>>>> > But y and z are not able to communicate after the merge. >>>>> > >>>>> > Is this bug in high priority for the next release? >>>>> > >>>>> > https://svn.open-mpi.org/trac/ompi/ticket/2904 >>>>> > >>>>> > Best, >>>>> > Suraj >>>>> > >>>>> > >>>>> > _______________________________________________ >>>>> > devel mailing list >>>>> > de...@open-mpi.org >>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Suraj Prabhakaran >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel