I took a look at the code but I'm afraid I dont see anything wrong. p.
On Thu, Aug 19, 2010 at 2:32 PM, Ralph Castain <r...@open-mpi.org> wrote: > Yes, that is correct - we reserve the first port in the range for a daemon, > should one exist. > The problem is clearly that get_node_rank is returning the wrong value for > the second process (your rank=1). If you want to dig deeper, look at the > orte/mca/ess/generic code where it generates the nidmap and pidmap. There is > a bug down there somewhere that gives the wrong answer when ppn > 1. > > > On Thu, Aug 19, 2010 at 12:12 PM, Philippe <phil...@mytoaster.net> wrote: >> >> Ralph, >> >> somewhere in ./orte/mca/oob/tcp/oob_tcp.c, there is this comment: >> >> orte_node_rank_t nrank; >> /* do I know my node_local_rank yet? */ >> if (ORTE_NODE_RANK_INVALID != (nrank = >> orte_ess.get_node_rank(ORTE_PROC_MY_NAME)) && >> (nrank+1) < >> opal_argv_count(mca_oob_tcp_component.tcp4_static_ports)) { >> /* any daemon takes the first entry, so we start >> with the second */ >> >> which seems constant with process #0 listening on 10001. the question >> would be why process #1 attempt to connect to port 10000 then? or >> maybe totally unrelated :-) >> >> btw, if I trick process #1 to open the connection to 10001 by shifting >> the range, I now get this error and the process terminate immediately: >> >> [c0301b10e1:03919] [[0,9999],1]-[[0,0],0] >> mca_oob_tcp_peer_recv_connect_ack: received unexpected process >> identifier [[0,9999],0] >> >> good luck with the surgery and wishing you a prompt recovery! >> >> p. >> >> On Thu, Aug 19, 2010 at 2:02 PM, Ralph Castain <r...@open-mpi.org> wrote: >> > Something doesn't look right - here is what the algo attempts to do: >> > given a port range of 10000-12000, the lowest rank'd process on the node >> > should open port 10000. The next lowest rank on the node will open >> > 10001, >> > etc. >> > So it looks to me like there is some confusion in the local rank algo. >> > I'll >> > have to look at the generic module - must be a bug in it somewhere. >> > This might take a couple of days as I have surgery tomorrow morning, so >> > please forgive the delay. >> > >> > On Thu, Aug 19, 2010 at 11:13 AM, Philippe <phil...@mytoaster.net> >> > wrote: >> >> >> >> Ralph, >> >> >> >> I'm able to use the generic module when the processes are on different >> >> machines. >> >> >> >> what would be the values of the EV when two processes are on the same >> >> machine (hopefully talking over SHM). >> >> >> >> i've played with combination of nodelist and ppn but no luck. I get >> >> errors >> >> like: >> >> >> >> >> >> >> >> [c0301b10e1:03172] [[0,9999],1] -> [[0,0],0] (node: c0301b10e1) >> >> oob-tcp: Number of attempts to create TCP connection has been >> >> exceeded. Can not communicate with peer >> >> [c0301b10e1:03172] [[0,9999],1] ORTE_ERROR_LOG: Unreachable in file >> >> grpcomm_hier_module.c at line 303 >> >> [c0301b10e1:03172] [[0,9999],1] ORTE_ERROR_LOG: Unreachable in file >> >> base/grpcomm_base_modex.c at line 470 >> >> [c0301b10e1:03172] [[0,9999],1] ORTE_ERROR_LOG: Unreachable in file >> >> grpcomm_hier_module.c at line 484 >> >> >> >> -------------------------------------------------------------------------- >> >> It looks like MPI_INIT failed for some reason; your parallel process is >> >> likely to abort. There are many reasons that a parallel process can >> >> fail during MPI_INIT; some of which are due to configuration or >> >> environment >> >> problems. This failure appears to be an internal failure; here's some >> >> additional information (which may only be relevant to an Open MPI >> >> developer): >> >> >> >> orte_grpcomm_modex failed >> >> --> Returned "Unreachable" (-12) instead of "Success" (0) >> >> >> >> -------------------------------------------------------------------------- >> >> *** The MPI_Init() function was called before MPI_INIT was invoked. >> >> *** This is disallowed by the MPI standard. >> >> *** Your MPI job will now abort. >> >> [c0301b10e1:3172] Abort before MPI_INIT completed successfully; not >> >> able to guarantee that all other processes were killed! >> >> >> >> >> >> maybe a related question is how to assign the TCP port range and how >> >> is it used? when the processes are on different machines, I use the >> >> same range and that's ok as long as the range is free. but when the >> >> processes are on the same node, what value should the range be for >> >> each process? My range is 10000-12000 (for both processes) and I see >> >> that process with rank #0 listen on port 10001 while process with rank >> >> #1 try to establish a connect to port 10000. >> >> >> >> Thanks so much! >> >> p. still here... still trying... ;-) >> >> >> >> On Tue, Jul 27, 2010 at 12:58 AM, Ralph Castain <r...@open-mpi.org> >> >> wrote: >> >> > Use what hostname returns - don't worry about IP addresses as we'll >> >> > discover them. >> >> > >> >> > On Jul 26, 2010, at 10:45 PM, Philippe wrote: >> >> > >> >> >> Thanks a lot! >> >> >> >> >> >> now, for the ev "OMPI_MCA_orte_nodes", what do I put exactly? our >> >> >> nodes have a short/long name (it's rhel 5.x, so the command hostname >> >> >> returns the long name) and at least 2 IP addresses. >> >> >> >> >> >> p. >> >> >> >> >> >> On Tue, Jul 27, 2010 at 12:06 AM, Ralph Castain <r...@open-mpi.org> >> >> >> wrote: >> >> >>> Okay, fixed in r23499. Thanks again... >> >> >>> >> >> >>> >> >> >>> On Jul 26, 2010, at 9:47 PM, Ralph Castain wrote: >> >> >>> >> >> >>>> Doh - yes it should! I'll fix it right now. >> >> >>>> >> >> >>>> Thanks! >> >> >>>> >> >> >>>> On Jul 26, 2010, at 9:28 PM, Philippe wrote: >> >> >>>> >> >> >>>>> Ralph, >> >> >>>>> >> >> >>>>> i was able to test the generic module and it seems to be working. >> >> >>>>> >> >> >>>>> one question tho, the function orte_ess_generic_component_query >> >> >>>>> in >> >> >>>>> "orte/mca/ess/generic/ess_generic_component.c" calls getenv with >> >> >>>>> the >> >> >>>>> argument "OMPI_MCA_enc", which seems to cause the module to fail >> >> >>>>> to >> >> >>>>> load. shouldnt it be "OMPI_MCA_ess" ? >> >> >>>>> >> >> >>>>> ..... >> >> >>>>> >> >> >>>>> /* only pick us if directed to do so */ >> >> >>>>> if (NULL != (pick = getenv("OMPI_MCA_env")) && >> >> >>>>> 0 == strcmp(pick, "generic")) { >> >> >>>>> *priority = 1000; >> >> >>>>> *module = (mca_base_module_t *)&orte_ess_generic_module; >> >> >>>>> >> >> >>>>> ... >> >> >>>>> >> >> >>>>> p. >> >> >>>>> >> >> >>>>> On Thu, Jul 22, 2010 at 5:53 PM, Ralph Castain <r...@open-mpi.org> >> >> >>>>> wrote: >> >> >>>>>> Dev trunk looks okay right now - I think you'll be fine using >> >> >>>>>> it. >> >> >>>>>> My new component -might- work with 1.5, but probably not with >> >> >>>>>> 1.4. I haven't >> >> >>>>>> checked either of them. >> >> >>>>>> >> >> >>>>>> Anything at r23478 or above will have the new module. Let me >> >> >>>>>> know >> >> >>>>>> how it works for you. I haven't tested it myself, but am pretty >> >> >>>>>> sure it >> >> >>>>>> should work. >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> On Jul 22, 2010, at 3:22 PM, Philippe wrote: >> >> >>>>>> >> >> >>>>>>> Ralph, >> >> >>>>>>> >> >> >>>>>>> Thank you so much!! >> >> >>>>>>> >> >> >>>>>>> I'll give it a try and let you know. >> >> >>>>>>> >> >> >>>>>>> I know it's a tough question, but how stable is the dev trunk? >> >> >>>>>>> Can >> >> >>>>>>> I >> >> >>>>>>> just grab the latest and run, or am I better off taking your >> >> >>>>>>> changes >> >> >>>>>>> and copy them back in a stable release? (if so, which one? 1.4? >> >> >>>>>>> 1.5?) >> >> >>>>>>> >> >> >>>>>>> p. >> >> >>>>>>> >> >> >>>>>>> On Thu, Jul 22, 2010 at 3:50 PM, Ralph Castain >> >> >>>>>>> <r...@open-mpi.org> >> >> >>>>>>> wrote: >> >> >>>>>>>> It was easier for me to just construct this module than to >> >> >>>>>>>> explain how to do so :-) >> >> >>>>>>>> >> >> >>>>>>>> I will commit it this evening (couple of hours from now) as >> >> >>>>>>>> that >> >> >>>>>>>> is our standard practice. You'll need to use the developer's >> >> >>>>>>>> trunk, though, >> >> >>>>>>>> to use it. >> >> >>>>>>>> >> >> >>>>>>>> Here are the envars you'll need to provide: >> >> >>>>>>>> >> >> >>>>>>>> Each process needs to get the same following values: >> >> >>>>>>>> >> >> >>>>>>>> * OMPI_MCA_ess=generic >> >> >>>>>>>> * OMPI_MCA_orte_num_procs=<number of MPI procs> >> >> >>>>>>>> * OMPI_MCA_orte_nodes=<a comma-separated list of nodenames >> >> >>>>>>>> where >> >> >>>>>>>> MPI procs reside> >> >> >>>>>>>> * OMPI_MCA_orte_ppn=<number of procs/node> >> >> >>>>>>>> >> >> >>>>>>>> Note that I have assumed this last value is a constant for >> >> >>>>>>>> simplicity. If that isn't the case, let me know - you could >> >> >>>>>>>> instead provide >> >> >>>>>>>> it as a comma-separated list of values with an entry for each >> >> >>>>>>>> node. >> >> >>>>>>>> >> >> >>>>>>>> In addition, you need to provide the following value that will >> >> >>>>>>>> be >> >> >>>>>>>> unique to each process: >> >> >>>>>>>> >> >> >>>>>>>> * OMPI_MCA_orte_rank=<MPI rank> >> >> >>>>>>>> >> >> >>>>>>>> Finally, you have to provide a range of static TCP ports for >> >> >>>>>>>> use >> >> >>>>>>>> by the processes. Pick any range that you know will be >> >> >>>>>>>> available across all >> >> >>>>>>>> the nodes. You then need to ensure that each process sees the >> >> >>>>>>>> following >> >> >>>>>>>> envar: >> >> >>>>>>>> >> >> >>>>>>>> * OMPI_MCA_oob_tcp_static_ports=6000-6010 <== obviously, >> >> >>>>>>>> replace >> >> >>>>>>>> this with your range >> >> >>>>>>>> >> >> >>>>>>>> You will need a port range that is at least equal to the ppn >> >> >>>>>>>> for >> >> >>>>>>>> the job (each proc on a node will take one of the provided >> >> >>>>>>>> ports). >> >> >>>>>>>> >> >> >>>>>>>> That should do it. I compute everything else I need from those >> >> >>>>>>>> values. >> >> >>>>>>>> >> >> >>>>>>>> Does that work for you? >> >> >>>>>>>> Ralph >> >> >>>>>>>> >> >> >>>>>>>> >> >> >> >> >> >> _______________________________________________ >> >> >> users mailing list >> >> >> us...@open-mpi.org >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > >> >> > >> >> > _______________________________________________ >> >> > users mailing list >> >> > us...@open-mpi.org >> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > >> >> >> >> _______________________________________________ >> >> users mailing list >> >> us...@open-mpi.org >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >