I think I know why it didn't cause problems with SLURM and TORQUE. The routing was wrong, so the message was at one point forwarded to the HNP. As the HNP has direct connections with all other processes, it was able to correctly deliver the message. The only visible impact was 2 more jumps in for all messages directed to the last daemon, which might only have a minimal impact on performance.

Based on the content of the email related to the commit, I think this will fix the problem. Unfortunately, our svn servers seems to have some troubles right now (i.e. it doesn't respond at all), so I can't test it. I'll do it as soon as the svn server is back online.

  Thanks,
    george.

On Wed, 1 Jul 2009, Ralph Castain wrote:

Believe this is now fixed with r21582 - let me know if it now works for you.

Sorry for the problem. It was indeed miscounting the number of daemons in the 
system, though apparently
this wasn't causing problems for slurm and torque (still investigating why 
since it should have).
Unfortunately, just changing the index caused shared memory to think everyone 
was remote, so the fix was a
tad more involved - though not particularly difficult.

Ralph


On Wed, Jul 1, 2009 at 2:06 PM, Ralph Castain <r...@open-mpi.org> wrote:
      Hmmm...I'll take a look. It seems to be working for me under Torque and 
SLURM, though I cannot
      vouch for the tree launch. The problem with letting the index start at 0 
is it breaks other
      things, so I'll have to see about fixing the routing schemes, or find 
some compromise.

      Thanks for the heads up.
      Ralph



On Wed, Jul 1, 2009 at 1:49 PM, George Bosilca <bosi...@eecs.utk.edu> wrote:
      Ralph,

      This commit break several components in OMPI, mainly the routing schemes 
and the tree
      launch. The part with the problem is the reduction of the number of 
declared daemons on
      the second part of the commit, where you change the boundary for the for 
loop from 0 to
      1. As a result the number of daemons was decreased by one (I guess in 
order to exclude
      the HNP), which is not something that the routing implementations 
tolerate.

      Setting the loop boundary back to 0 seems to fix all problems. Please 
reconsider your
      patch.

       george.

      On Fri, 26 Jun 2009, r...@osl.iu.edu wrote:

            Author: rhc
            Date: 2009-06-26 18:07:25 EDT (Fri, 26 Jun 2009)
            New Revision: 21548
            URL: https://svn.open-mpi.org/trac/ompi/changeset/21548

            Log:
            Cleanup some indexing bugs so that shared memory can function

            Text files modified:
             trunk/orte/util/nidmap.c |    12 +++++++-----
             1 files changed, 7 insertions(+), 5 deletions(-)

            Modified: trunk/orte/util/nidmap.c
            
==============================================================================
            --- trunk/orte/util/nidmap.c    (original)
            +++ trunk/orte/util/nidmap.c    2009-06-26 18:07:25 EDT (Fri, 26 
Jun 2009)
            @@ -341,10 +341,10 @@

               /* pack every nodename individually */
               for (i=1; i < orte_node_pool->size; i++) {
            +        if (NULL == (node =
            (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) {
            +            continue;
            +        }
                   if (!orte_keep_fqdn_hostnames) {
            -            if (NULL == (node =
            (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) {
            -                continue;
            -            }
                       nodename = strdup(node->name);
                       if (NULL != (ptr = strchr(nodename, '.'))) {
                           *ptr = '\0';
            @@ -553,6 +553,8 @@
                   ORTE_ERROR_LOG(rc);
                   return rc;
               }
            +    /* set the daemon to 0 */
            +    node->daemon = 0;

               /* loop over nodes and unpack the raw nodename */
               for (i=1; i < num_nodes; i++) {
            @@ -570,7 +572,7 @@
                   }
               }

            -    /* unpack the daemon names */
            +    /* unpack the daemon vpids */
               vpids = (orte_vpid_t*)malloc(num_nodes * sizeof(orte_vpid_t));
               n=num_nodes;
               if (ORTE_SUCCESS != (rc = opal_dss.unpack(&buf, vpids, &n, 
ORTE_VPID))) {
            @@ -581,7 +583,7 @@
                * daemons in the system
                */
               num_daemons = 0;
            -    for (i=0; i < num_nodes; i++) {
            +    for (i=1; i < num_nodes; i++) {
                   if (NULL != (ndptr =
            (orte_nid_t*)opal_pointer_array_get_item(&orte_nidmap, i))) {
                       ndptr->daemon = vpids[i];
                       if (ORTE_VPID_INVALID != vpids[i]) {
            _______________________________________________
            svn mailing list
            s...@open-mpi.org
            http://www.open-mpi.org/mailman/listinfo.cgi/svn


      "We must accept finite disappointment, but we must never lose infinite
      hope."
                                       Martin Luther King

      _______________________________________________
      devel mailing list
      de...@open-mpi.org
      http://www.open-mpi.org/mailman/listinfo.cgi/devel






"We must accept finite disappointment, but we must never lose infinite
hope."
                                  Martin Luther King

Reply via email to