Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

Ralph Castain Tue, 27 Jul 2010 00:06:19 -0400

Okay, fixed in r23499. Thanks again...


On Jul 26, 2010, at 9:47 PM, Ralph Castain wrote:

> Doh - yes it should! I'll fix it right now.
> 
> Thanks!
> 
> On Jul 26, 2010, at 9:28 PM, Philippe wrote:
> 
>> Ralph,
>> 
>> i was able to test the generic module and it seems to be working.
>> 
>> one question tho, the function orte_ess_generic_component_query in
>> "orte/mca/ess/generic/ess_generic_component.c" calls getenv with the
>> argument "OMPI_MCA_enc", which seems to cause the module to fail to
>> load. shouldnt it be "OMPI_MCA_ess" ?
>> 
>> .....
>> 
>>   /* only pick us if directed to do so */
>>   if (NULL != (pick = getenv("OMPI_MCA_env")) &&
>>                0 == strcmp(pick, "generic")) {
>>       *priority = 1000;
>>       *module = (mca_base_module_t *)&orte_ess_generic_module;
>> 
>> ...
>> 
>> p.
>> 
>> On Thu, Jul 22, 2010 at 5:53 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>> Dev trunk looks okay right now - I think you'll be fine using it. My new 
>>> component -might- work with 1.5, but probably not with 1.4. I haven't 
>>> checked either of them.
>>> 
>>> Anything at r23478 or above will have the new module. Let me know how it 
>>> works for you. I haven't tested it myself, but am pretty sure it should 
>>> work.
>>> 
>>> 
>>> On Jul 22, 2010, at 3:22 PM, Philippe wrote:
>>> 
>>>> Ralph,
>>>> 
>>>> Thank you so much!!
>>>> 
>>>> I'll give it a try and let you know.
>>>> 
>>>> I know it's a tough question, but how stable is the dev trunk? Can I
>>>> just grab the latest and run, or am I better off taking your changes
>>>> and copy them back in a stable release? (if so, which one? 1.4? 1.5?)
>>>> 
>>>> p.
>>>> 
>>>> On Thu, Jul 22, 2010 at 3:50 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>> It was easier for me to just construct this module than to explain how to 
>>>>> do so :-)
>>>>> 
>>>>> I will commit it this evening (couple of hours from now) as that is our 
>>>>> standard practice. You'll need to use the developer's trunk, though, to 
>>>>> use it.
>>>>> 
>>>>> Here are the envars you'll need to provide:
>>>>> 
>>>>> Each process needs to get the same following values:
>>>>> 
>>>>> * OMPI_MCA_ess=generic
>>>>> * OMPI_MCA_orte_num_procs=<number of MPI procs>
>>>>> * OMPI_MCA_orte_nodes=<a comma-separated list of nodenames where MPI 
>>>>> procs reside>
>>>>> * OMPI_MCA_orte_ppn=<number of procs/node>
>>>>> 
>>>>> Note that I have assumed this last value is a constant for simplicity. If 
>>>>> that isn't the case, let me know - you could instead provide it as a 
>>>>> comma-separated list of values with an entry for each node.
>>>>> 
>>>>> In addition, you need to provide the following value that will be unique 
>>>>> to each process:
>>>>> 
>>>>> * OMPI_MCA_orte_rank=<MPI rank>
>>>>> 
>>>>> Finally, you have to provide a range of static TCP ports for use by the 
>>>>> processes. Pick any range that you know will be available across all the 
>>>>> nodes. You then need to ensure that each process sees the following envar:
>>>>> 
>>>>> * OMPI_MCA_oob_tcp_static_ports=6000-6010  <== obviously, replace this 
>>>>> with your range
>>>>> 
>>>>> You will need a port range that is at least equal to the ppn for the job 
>>>>> (each proc on a node will take one of the provided ports).
>>>>> 
>>>>> That should do it. I compute everything else I need from those values.
>>>>> 
>>>>> Does that work for you?
>>>>> Ralph
>>>>> 
>>>>> 
>>>>> On Jul 22, 2010, at 6:48 AM, Philippe wrote:
>>>>> 
>>>>>> On Wed, Jul 21, 2010 at 10:44 AM, Ralph Castain <r...@open-mpi.org> 
>>>>>> wrote:
>>>>>>> 
>>>>>>> On Jul 21, 2010, at 7:44 AM, Philippe wrote:
>>>>>>> 
>>>>>>>> Ralph,
>>>>>>>> 
>>>>>>>> Sorry for the late reply -- I was away on vacation.
>>>>>>> 
>>>>>>> no problem at all!
>>>>>>> 
>>>>>>>> 
>>>>>>>> regarding your earlier question about how many processes where
>>>>>>>> involved when the memory was entirely allocated, it was only two, a
>>>>>>>> sender and a receiver. I'm still trying to pinpoint what can be
>>>>>>>> different between the standalone case and the "integrated" case. I
>>>>>>>> will try to find out what part of the code is allocating memory in a
>>>>>>>> loop.
>>>>>>> 
>>>>>>> hmmm....that sounds like a bug in your program. let me know what you 
>>>>>>> find
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Jul 20, 2010 at 12:51 AM, Ralph Castain <r...@open-mpi.org> 
>>>>>>>> wrote:
>>>>>>>>> Well, I finally managed to make this work without the required 
>>>>>>>>> ompi-server rendezvous point. The fix is only in the devel trunk 
>>>>>>>>> right now - I'll have to ask the release managers for 1.5 and 1.4 if 
>>>>>>>>> they want it ported to those series.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> great -- i'll give it a try
>>>>>>>> 
>>>>>>>>> On the notion of integrating OMPI to your launch environment: 
>>>>>>>>> remember that we don't necessarily require that you use mpiexec for 
>>>>>>>>> that purpose. If your launch environment provides just a little info 
>>>>>>>>> in the environment of the launched procs, we can usually devise a 
>>>>>>>>> method that allows the procs to perform an MPI_Init as a single job 
>>>>>>>>> without all this work you are doing.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> I'm working on creating operators using MPI for the IBM product
>>>>>>>> "InfoSphere Streams". It has its own launching mechanism to start the
>>>>>>>> processes. However I can pass some information to the processes that
>>>>>>>> belong to the same job (Streams job -- which should neatly map to MPI
>>>>>>>> job).
>>>>>>>> 
>>>>>>>>> Only difference is that your procs will all block in MPI_Init until 
>>>>>>>>> they -all- have executed that function. If that isn't a problem, this 
>>>>>>>>> would be a much more scalable and reliable method than doing it thru 
>>>>>>>>> massive calls to MPI_Port_connect.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> in the general case, that would be a problem, but for my prototype,
>>>>>>>> this is acceptable.
>>>>>>>> 
>>>>>>>> In general, each process is composed of operators, some may be MPI
>>>>>>>> related and some may not. But in my case, I know ahead of time which
>>>>>>>> processes will be part of the MPI job, so I can easily deal with the
>>>>>>>> fact that they would block on MPI_init (actually -- MPI_thread_init
>>>>>>>> since its using a lot of threads).
>>>>>>> 
>>>>>>> We have talked in the past about creating a non-blocking MPI_Init as an 
>>>>>>> extension to the standard. It would lock you to Open MPI, though...
>>>>>>> 
>>>>>>> Regardless, at some point you would have to know how many processes are 
>>>>>>> going to be part of the job so you can know when MPI_Init is complete. 
>>>>>>> I would think you would require that info for the singleton wireup 
>>>>>>> anyway - yes? Otherwise, how would you know when to quit running 
>>>>>>> connect-accept?
>>>>>>> 
>>>>>> 
>>>>>> the short answer is yes... although, the longer answer is a bit more
>>>>>> complicated. currently I do know the number of connect I need to do on
>>>>>> a per-port basis. a job can contains an arbitrary number of MPI
>>>>>> processes, each opening one or more ports. so i know the count port by
>>>>>> ports but I dont need to worry about how many MPI processes there is
>>>>>> globally. to make things a bit more complicated, each MPI operator can
>>>>>> be "fused" with other operators to make a process. each fused operator
>>>>>> may or may not require MPI. the bottom line is, to get the total
>>>>>> number of processes to calculate rank&size, I need to reverse engineer
>>>>>> the fusing that the compiler may do.
>>>>>> 
>>>>>> but that's ok, I'm willing to do that for our prototype :-)
>>>>>> 
>>>>>>>> 
>>>>>>>> Is there a documentation or example I can use to see what information
>>>>>>>> I can pass to the processes to enable that? Is it just environment
>>>>>>>> variables?
>>>>>>> 
>>>>>>> No real documentation - a lack I should probably fill. At the moment, 
>>>>>>> we don't have a "generic" module for standalone launch, but I can 
>>>>>>> create one as it is pretty trivial. I would then need you to pass each 
>>>>>>> process envars telling it the total number of processes in the MPI job, 
>>>>>>> its rank within that job, and a file where some rendezvous process (can 
>>>>>>> be rank=0) has provided that port string. Armed with that info, I can 
>>>>>>> wireup the job.
>>>>>>> 
>>>>>>> Won't be as scalable as an mpirun-initiated startup, but will be much 
>>>>>>> better than doing it from singletons.
>>>>>> 
>>>>>> that would be great. I can definitely pass environment variables to
>>>>>> each process.
>>>>>> 
>>>>>>> 
>>>>>>> Or if you prefer, we could setup an "infosphere" module that we could 
>>>>>>> customize for this system. Main thing here would be to provide us with 
>>>>>>> some kind of regex (or access to a file containing the info) that 
>>>>>>> describes the map of rank to node so we can construct the wireup 
>>>>>>> communication pattern.
>>>>>>> 
>>>>>> 
>>>>>> i think for our prototype we are fine with the first method. I'd leave
>>>>>> the cleaner implementation as a task for the product team ;-)
>>>>>> 
>>>>>> regarding the "generic" module, is that something you can put together
>>>>>> quickly? can I help in any way?
>>>>>> 
>>>>>> Thanks!
>>>>>> p
>>>>>> 
>>>>>>> Either way would work. The second is more scalable, but I don't know if 
>>>>>>> you have (or can construct) the map info.
>>>>>>> 
>>>>>>>> 
>>>>>>>> Many thanks!
>>>>>>>> p.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Jul 18, 2010, at 4:09 PM, Philippe wrote:
>>>>>>>>> 
>>>>>>>>>> Ralph,
>>>>>>>>>> 
>>>>>>>>>> thanks for investigating.
>>>>>>>>>> 
>>>>>>>>>> I've applied the two patches you mentioned earlier and ran with the
>>>>>>>>>> ompi server. Although i was able to runn our standalone test, when I
>>>>>>>>>> integrated the changes to our code, the processes entered a crazy 
>>>>>>>>>> loop
>>>>>>>>>> and allocated all the memory available when calling MPI_Port_Connect.
>>>>>>>>>> I was not able to identify why it works standalone but not integrated
>>>>>>>>>> with our code. If I found why, I'll let your know.
>>>>>>>>>> 
>>>>>>>>>> looking forward to your findings. We'll be happy to test any patches
>>>>>>>>>> if you have some!
>>>>>>>>>> 
>>>>>>>>>> p.
>>>>>>>>>> 
>>>>>>>>>> On Sat, Jul 17, 2010 at 9:47 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>>>>> wrote:
>>>>>>>>>>> Okay, I can reproduce this problem. Frankly, I don't think this 
>>>>>>>>>>> ever worked with OMPI, and I'm not sure how the choice of BTL makes 
>>>>>>>>>>> a difference.
>>>>>>>>>>> 
>>>>>>>>>>> The program is crashing in the communicator definition, which 
>>>>>>>>>>> involves a communication over our internal out-of-band messaging 
>>>>>>>>>>> system. That system has zero connection to any BTL, so it should 
>>>>>>>>>>> crash either way.
>>>>>>>>>>> 
>>>>>>>>>>> Regardless, I will play with this a little as time allows. Thanks 
>>>>>>>>>>> for the reproducer!
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Jun 25, 2010, at 7:23 AM, Philippe wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm trying to run a test program which consists of a server 
>>>>>>>>>>>> creating a
>>>>>>>>>>>> port using MPI_Open_port and N clients using MPI_Comm_connect to
>>>>>>>>>>>> connect to the server.
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm able to do so with 1 server and 2 clients, but with 1 server + 
>>>>>>>>>>>> 3
>>>>>>>>>>>> clients, I get the following error message:
>>>>>>>>>>>> 
>>>>>>>>>>>>  [node003:32274] [[37084,0],0]:route_callback tried routing message
>>>>>>>>>>>> from [[37084,1],0] to [[40912,1],0]:102, can't find route
>>>>>>>>>>>> 
>>>>>>>>>>>> This is only happening with the openib BTL. With tcp BTL it works
>>>>>>>>>>>> perfectly fine (ofud also works as a matter of fact...). This has 
>>>>>>>>>>>> been
>>>>>>>>>>>> tested on two completely different clusters, with identical 
>>>>>>>>>>>> results.
>>>>>>>>>>>> In either cases, the IB frabic works normally.
>>>>>>>>>>>> 
>>>>>>>>>>>> Any help would be greatly appreciated! Several people in my team
>>>>>>>>>>>> looked at the problem. Google and the mailing list archive did not
>>>>>>>>>>>> provide any clue. I believe that from an MPI standpoint, my test
>>>>>>>>>>>> program is valid (and it works with TCP, which make me feel better
>>>>>>>>>>>> about the sequence of MPI calls)
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Philippe.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Background:
>>>>>>>>>>>> 
>>>>>>>>>>>> I intend to use openMPI to transport data inside a much larger
>>>>>>>>>>>> application. Because of that, I cannot used mpiexec. Each process 
>>>>>>>>>>>> is
>>>>>>>>>>>> started by our own "job management" and use a name server to find
>>>>>>>>>>>> about each others. Once all the clients are connected, I would like
>>>>>>>>>>>> the server to do MPI_Recv to get the data from all the client. I 
>>>>>>>>>>>> dont
>>>>>>>>>>>> care about the order or which client are sending data, as long as I
>>>>>>>>>>>> can receive it with on call. Do do that, the clients and the server
>>>>>>>>>>>> are going through a series of 
>>>>>>>>>>>> Comm_accept/Conn_connect/Intercomm_merge
>>>>>>>>>>>> so that at the end, all the clients and the server are inside the 
>>>>>>>>>>>> same
>>>>>>>>>>>> intracomm.
>>>>>>>>>>>> 
>>>>>>>>>>>> Steps:
>>>>>>>>>>>> 
>>>>>>>>>>>> I have a sample program that show the issue. I tried to make it as
>>>>>>>>>>>> short as possible. It needs to be executed on a shared file system
>>>>>>>>>>>> like NFS because the server write the port info to a file that the
>>>>>>>>>>>> client will read. To reproduce the issue, the following steps 
>>>>>>>>>>>> should
>>>>>>>>>>>> be performed:
>>>>>>>>>>>> 
>>>>>>>>>>>> 0. compile the test with "mpicc -o ben12 ben12.c"
>>>>>>>>>>>> 1. ssh to the machine that will be the server
>>>>>>>>>>>> 2. run ./ben12 3 1
>>>>>>>>>>>> 3. ssh to the machine that will be the client #1
>>>>>>>>>>>> 4. run ./ben12 3 0
>>>>>>>>>>>> 5. repeat step 3-4 for client #2 and #3
>>>>>>>>>>>> 
>>>>>>>>>>>> the server accept the connection from client #1 and merge it in a 
>>>>>>>>>>>> new
>>>>>>>>>>>> intracomm. It then accept connection from client #2 and merge it. 
>>>>>>>>>>>> when
>>>>>>>>>>>> the client #3 arrives, the server accept the connection, but that
>>>>>>>>>>>> cause client #1 and #2 to die with the error above (see the 
>>>>>>>>>>>> complete
>>>>>>>>>>>> trace in the tarball).
>>>>>>>>>>>> 
>>>>>>>>>>>> The exact steps are:
>>>>>>>>>>>> 
>>>>>>>>>>>>    - server open port
>>>>>>>>>>>>    - server does accept
>>>>>>>>>>>>    - client #1 does connect
>>>>>>>>>>>>    - server and client #1 do merge
>>>>>>>>>>>>    - server does accept
>>>>>>>>>>>>    - client #2 does connect
>>>>>>>>>>>>    - server, client #1 and client #2 do merge
>>>>>>>>>>>>    - server does accept
>>>>>>>>>>>>    - client #3 does connect
>>>>>>>>>>>>    - server, client #1, client #2 and client #3 do merge
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> My infiniband network works normally with other test programs or
>>>>>>>>>>>> applications (MPI or others like Verbs).
>>>>>>>>>>>> 
>>>>>>>>>>>> Info about my setup:
>>>>>>>>>>>> 
>>>>>>>>>>>>   openMPI version = 1.4.1 (I also tried 1.4.2, nightly snapshot of
>>>>>>>>>>>> 1.4.3, nightly snapshot of 1.5 --- all show the same error)
>>>>>>>>>>>>   config.log in the tarball
>>>>>>>>>>>>   "ompi_info --all" in the tarball
>>>>>>>>>>>>   OFED version = 1.3 installed from RHEL 5.3
>>>>>>>>>>>>   Distro = RedHat Entreprise Linux 5.3
>>>>>>>>>>>>   Kernel = 2.6.18-128.4.1.el5 x86_64
>>>>>>>>>>>>   subnet manager = built-in SM from the cisco/topspin switch
>>>>>>>>>>>>   output of ibv_devinfo included in the tarball (there are no 
>>>>>>>>>>>> "bad" nodes)
>>>>>>>>>>>>   "ulimit -l" says "unlimited"
>>>>>>>>>>>> 
>>>>>>>>>>>> The tarball contains:
>>>>>>>>>>>> 
>>>>>>>>>>>>  - ben12.c: my test program showing the behavior
>>>>>>>>>>>>  - config.log / config.out / make.out / make-install.out /
>>>>>>>>>>>> ifconfig.txt / ibv-devinfo.txt / ompi_info.txt
>>>>>>>>>>>>  - trace-tcp.txt: output of the server and each client when it 
>>>>>>>>>>>> works
>>>>>>>>>>>> with TCP (I added "btl = tcp,self" in ~/.openmpi/mca-params.conf)
>>>>>>>>>>>>  - trace-ib.txt: output of the server and each client when it fails
>>>>>>>>>>>> with IB (I added "btl = openib,self" in ~/.openmpi/mca-params.conf)
>>>>>>>>>>>> 
>>>>>>>>>>>> I hope I provided enough info for somebody to reproduce the 
>>>>>>>>>>>> problem...
>>>>>>>>>>>> <ompi-output.tar.bz2>_______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

Reply via email to