Oops!! You put the fix in the wrong place, I'm afraid: /* leave space for length field */ memset(c, ' ', PMII_COMMANDLEN_SIZE); c += PMII_COMMANDLEN_SIZE;
PMI2U_ERR_CHKANDJUMP(strlen(cmd) > PMI2_MAX_VALLEN, pmi2_errno, PMI2_ERR_OTHER, "**cmd_too_long"); ret = snprintf(c, remaining_len, "cmd=%s;", cmd); <<<<< THIS IS THE PROBLEM POINT PMI2U_ERR_CHKANDJUMP(ret >= remaining_len, pmi2_errno, PMI2_ERR_OTHER, "**intern %s", "Ran out of room for command"); c += ret; remaining_len -= ret; /* Subtract the PMII_COMMANDLEN_SIZE to prevent * certain implementation of snprintf() to * segfault when zero out the buffer. * PMII_COMMANDLEN_SIZE must be added later on * back again to send out the right protocol * message size. */ remaining_len -= PMII_COMMANDLEN_SIZE; That change needed to go *before* the highlighted snprintf. I'm afraid 2.6.3 continue to segfault :-( On Oct 3, 2013, at 12:12 PM, Ralph Castain <r...@open-mpi.org> wrote: > > Cool - thanks David! > > On Oct 3, 2013, at 11:31 AM, David Bigagli <da...@schedmd.com> wrote: > >> Fixed. I chose the method proposed by Michael, subtract first, add later. :-) >> >> On 10/03/2013 10:29 AM, Ralph Castain wrote: >>> >>> >>> On Oct 3, 2013, at 10:16 AM, David Bigagli <da...@schedmd.com> wrote: >>> >>>> >>>> >>>> I am not saying that remaining_len is correct or that mpich is bugless :-) >>>> I am only saying that decrementing remaining_len as proposed breaks the >>>> pmi2 protocol as since the client sends a wrong length to the server. We >>>> actually applied the very same fix in the past and then had to undo it >>>> commit: 1e25eb10df0f. >>>> >>>> Funny enough Ralph was one of those that reported that broken protocol bug >>>> :-) "mpi/pmi2: no value for key %s in req". >>> >>> Ha! Yes indeed, we do see that reappear. :-/ >>> >>>> >>>> We can either use sprintf() as there is no need to use snprintf() here or >>>> do snprintf(c, remaining_len - PMII_COMMANDLEN_SIZE, "thrid=%p;", resp); >>> >>> Either way is fine with me - feel free to choose. >>> Thanks! >>> >>>> >>>> On 10/03/2013 09:50 AM, Ralph Castain wrote: >>>>> >>>>> I'm afraid that isn't quite correct, David. If you walk thru the code, >>>>> you'll see that the snprintf is being given an incorrect length for the >>>>> buffer. When using some libc versions, snprintf automatically fills the >>>>> given buffer with NULLs to ensure that the string is NULL-terminated. >>>>> When it does so, the code as currently written causes a segfault. >>>>> >>>>> We have tested the provided change and it fixes the problem. If your code >>>>> review indicates that remaining_len should not be changed where we did >>>>> it, then you need to at least change the snprintf command to reflect the >>>>> reduced size of the "c" buffer. >>>>> >>>>> On Oct 3, 2013, at 9:44 AM, David Bigagli <da...@schedmd.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> Hi, >>>>>> I don't know the details of the segfault but the code in question is >>>>>> correct. If you decrease the length then the file cmdlen: >>>>>> >>>>>> cmdlen = PMII_MAX_COMMAND_LEN - remaining_len; >>>>>> >>>>>> will not be correct and wrong length will be sent to the pmi2 server. >>>>>> >>>>>> This code is taken verbatim from mpich2-1.5, precisely from: >>>>>> mpich2-1.5/src/pmi/pmi2/simple2pmi.c >>>>>> >>>>>> On 10/03/2013 08:47 AM, Ralph Castain wrote: >>>>>>> >>>>>>> Hi folks >>>>>>> >>>>>>> We have uncovered a segfault-causing bug in contribs/pmi2/pmi2_api.c. >>>>>>> Looking at the code in the current trunk, you need to add the following >>>>>>> at line 1499: >>>>>>> >>>>>>> remaining_len -= PMII_COMMANDLEN_SIZE; >>>>>>> >>>>>>> as you have changed the location of the pointer to the buffer, but >>>>>>> failed to account for that change in the size param being later passed >>>>>>> to snprintf. >>>>>>> >>>>>>> Could you please fix this and include it in 2.6.3? >>>>>>> >>>>>>> Thanks >>>>>>> Ralph >>>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Thanks, >>>>>> /David >>>> >>>> -- >>>> >>>> Thanks, >>>> /David >> >> -- >> >> Thanks, >> /David