Hi, Seqf bug fixed in r18706. Best Regards Lenny. On Thu, Jun 19, 2008 at 5:37 PM, Lenny Verkhovsky < lenny.verkhov...@gmail.com> wrote:
> Sorry, > I checked it without sm. > > pls ignore this mail. > > > > On Thu, Jun 19, 2008 at 4:32 PM, Lenny Verkhovsky < > lenny.verkhov...@gmail.com> wrote: > >> Hi, >> I found what caused the problem in both cases. >> >> --- ompi/mca/btl/sm/btl_sm.c (revision 18675) >> +++ ompi/mca/btl/sm/btl_sm.c (working copy) >> @@ -812,7 +812,7 @@ >> */ >> MCA_BTL_SM_FIFO_WRITE(endpoint, endpoint->my_smp_rank, >> endpoint->peer_smp_rank, frag->hdr, false, rc); >> - return (rc < 0 ? rc : 1); >> + return OMPI_SUCCESS; >> } >> I am just not sure if it's OK. >> >> Lenny. >> On Wed, Jun 18, 2008 at 3:21 PM, Lenny Verkhovsky < >> lenny.verkhov...@gmail.com> wrote: >> >>> Hi, >>> I am not sure if it related, >>> but I applied your patch ( r18667 ) to r 18656 ( one before NUMA ) >>> together with disabling sendi, >>> The result still the same ( hanging ). >>> >>> >>> >>> >>> On Tue, Jun 17, 2008 at 2:10 PM, George Bosilca <bosi...@eecs.utk.edu> >>> wrote: >>> >>>> Lenny, >>>> >>>> I guess you're running the latest version. If not, please update, Galen >>>> and myself corrected some bugs last week. If you're using the latest (and >>>> greatest) then ... well I imagine there is at least one bug left. >>>> >>>> There is a quick test you can do. In the btl_sm.c in the module >>>> structure at the beginning of the file, please replace the sendi function >>>> by >>>> NULL. If this fix the problem, then at least we know that it's a sm send >>>> immediate problem. >>>> >>>> Thanks, >>>> george. >>>> >>>> >>>> On Jun 17, 2008, at 7:54 AM, Lenny Verkhovsky wrote: >>>> >>>> Hi, George, >>>>> >>>>> I have a problem running BW benchmark on 100 rank cluster after r18551. >>>>> The BW is mpi_p that runs mpi_bandwidth with 100K between all pairs. >>>>> >>>>> >>>>> #mpirun -np 100 -hostfile hostfile_w ./mpi_p_18549 -t bw -s 100000 >>>>> BW (100) (size min max avg) 100000 576.734030 2001.882416 >>>>> 1062.698408 >>>>> #mpirun -np 100 -hostfile hostfile_w ./mpi_p_18551 -t bw -s 100000 >>>>> mpirun: killing job... >>>>> ( it hangs even after 10 hours ). >>>>> >>>>> >>>>> It doesn't happen if I run --bynode or btl openib,self only. >>>>> >>>>> >>>>> Lenny. >>>>> >>>> >>>> >>> >> >