Ok I was assuming that setting the ranks was done the same for Plist as for the sparse groups (which calls translate_ranks) For Plist i just forgot to add a check that the rank is not MPI_UNDEFINED, before i do the lookup and set the rank.
I just commited the fix.. On Thu, August 16, 2007 10:53 am, Tim Prins wrote: > Mohamad, > > 2 process was plenty. Like I said, when running in debug mode, it tends > to 'work' since memory is initialized to \0 and we fall through. In an > optimized build, looking at the mtt results it looks like it segfaults > about 10% of the time. > > But if you apply the patch I sent, it will tell you when an invaild > lookup happens, which should be every time it runs. > > Tim > > Mohamad Chaarawi wrote: >> Hey Tim, >> >> I understand what you are talking about. >> Im trying to reproduce the problem. How many processes are your running >> with, because with the default (4 for the group) it's passing.. >> >> Thanks, >> Mohamad >> >> On Thu, August 16, 2007 7:49 am, Tim Prins wrote: >>> Sorry, I pushed the wrong button and sent this before it was ready.... >>> >>> Tim Prins wrote: >>>> Hi folks, >>>> >>>> I am running into a problem with the ibm test 'group'. I will try to >>>> explain what I think is going on, but I do not really understand the >>>> group code so please forgive me if it is wrong... >>>> >>>> The test creates a group based on MPI_COMM_WORLD (group1), and a group >>>> that has half the procs in group1 (newgroup). Next, all the processes >>>> do: >>>> >>>> MPI_Group_intersection(newgroup,group1,&group2) >>>> >>>> ompi_group_intersection figures out what procs are needed for group2, >>>> then calls >>>> >>>> ompi_group_incl, passing 'newgroup' and '&group2' >>>> >>>> This then calls (since I am not using sparse groups) >>>> ompi_group_incl_plist >>>> >>>> However, ompi_group_plist assumes that the current process is a member >>>> of the passed group ('newgroup'). Thus when it calls >>>> ompi_group_peer_lookup on 'newgroup', half of the processes get >>>> garbage >>>> back since they are not in 'newgroup'. In most cases, memory is >>>> initialized to \0 and things fall through, but we get intermittent >>>> segfaults in optimized builds. >>>> >>> Here is a patch to a error check which highlights the problem: >>> Index: group/group.h >>> =================================================================== >>> --- group/group.h (revision 15869) >>> +++ group/group.h (working copy) >>> @@ -308,7 +308,7 @@ >>> static inline struct ompi_proc_t* ompi_group_peer_lookup(ompi_group_t >>> *group, int peer_id) >>> { >>> #if OMPI_ENABLE_DEBUG >>> - if (peer_id >= group->grp_proc_count) { >>> + if (peer_id >= group->grp_proc_count || peer_id < 0) { >>> opal_output(0, "ompi_group_lookup_peer: invalid peer index >>> (%d)", peer_id); >>> >>>> Thanks, >>>> >>>> Tim >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Mohamad Chaarawi Instructional Assistant http://www.cs.uh.edu/~mschaara Department of Computer Science University of Houston 4800 Calhoun, PGH Room 526 Houston, TX 77204, USA