Re: [OMPI devel] openib btl header caching
Ok here is the numbers on my machines: 0 bytes mvapich with header caching: 1.56 mvapich without header caching: 1.79 ompi 1.2: 1.59 So on zero bytes ompi not so bad. Also we can see that header caching decrease the mvapich latency on 0.23 1 bytes mvapich with header caching: 1.58 mvapich without header caching: 1.83 ompi 1.2: 1.73 Is this just convertor initialization cost? - Galen And here ompi make some latency jump. In mvapich the header caching decrease the header size from 56bytes to 12bytes. What is the header size (pml + btl) in ompi ? The match header size is 16 bytes, so it looks like ours is already optimized ... So for 0 bytes message we are sending only 16bytes on the wire , is it correct ? Pasha. george. Pasha ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] openib btl header caching
I think we need to take a step back from micro-optimizations such as header caching. Rich, George, Brian and I are currently looking into latency improvements. We came up with several areas of performance enhancements that can be done with minimal disruption. The progress issue that Christian and others have pointed does appear to be a problem, but will take a bit more work. I would like to see progress in these areas first as I really don't like the idea of caching more endpoint state in OMPI for micro-benchmark latency improvements until we are certain we have done the ground work for improving latency in the general case. Here are the items we have identified: 1) remove 0 byte optimization of not initializing the convertor This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an “if“ in mca_pml_ob1_send_request_start_copy +++ Measure the convertor initialization before taking any other action. 2) get rid of mca_pml_ob1_send_request_start_prepare and mca_pml_ob1_send_request_start_copy by removing the MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send return OMPI_SUCCESS if the fragment can be marked as completed and OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This solves another problem, with IB if there are a bunch of isends outstanding we end up buffering them all in the btl, marking completion and never get them on the wire because the BTL runs out of credits, we never get credits back until finalize because we never call progress cause the requests are complete. There is one issue here, start_prepare calls prepare_src and start_copy calls alloc, I think we can work around this by just always using prepare_src, OpenIB BTL will give a fragment off the free list anyway because the fragment is less than the eager limit. +++ Make the BTL return different return codes for the send. If the fragment is gone, then the PML is responsible of marking the MPI request as completed and so on. Only the updated BTLs will get any benefit from this feature. Add a flag into the descriptor to allow or not the BTL to free the fragment. Add a 3 level flag: - BTL_HAVE_OWNERSHIP : the fragment can be released by the BTL after the send, and then it report back a special return to the PML - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released by the BTL once the completion callback was triggered. - PML_HAVE_OWNERSHIP : the BTL is not allowed to release the fragment at all (the PML is responsible for this). Return codes: - done and there will be no callbacks - not done, wait for a callback later - error state 3) Change the remote callback function (and tag value based on what data we are sending), don't use mca_pml_ob1_recv_frag_callback for everything! I think we need: mca_pml_ob1_recv_frag_match mca_pml_ob1_recv_frag_rndv mca_pml_ob1_recv_frag_rget mca_pml_ob1_recv_match_ack_copy mca_pml_ob1_recv_match_ack_pipeline mca_pml_ob1_recv_copy_frag mca_pml_ob1_recv_put_request mca_pml_ob1_recv_put_fin +++ Pass the callback as parameter to the match function will save us 2 switches. Add more registrations in the BTL in order to jump directly in the correct function (the first 3 require a match while the others don't). 4 & 4 bits on the tag so each layer will have 4 bits of tags [i.e. first 4 bits for the protocol tag and lower 4 bits they are up to the protocol] and the registration table will still be local to each component. 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback! I think what we can do here is modify mca_pml_ob1_recv_frag_match to take a function pointer for what it should call on a successful match. So based on the receive callback we can pass the correct scheduling function to invoke into the generic mca_pml_ob1_recv_frag_match Recv_request progress is call in a generic way from multiple places, and we do a big switch inside. In the match function we might want to pass a function pointer to the successful match progress function. This way we will be able to specialize what happens after the match, in a more optimized way. Or the recv_request_match can return the match and then the caller will have to specialize it's action. ---
[OMPI devel] OMPI_FREE_LIST improvements
In working on my changes in the ib_multifrag branch I modified the ompi_free_list. The change enables a free list to have a bit more personality than what is dictated by the type of the item on the free list. The overall problem was that we often use different free list item types to simply distinguish sizes of the free list item. In an ideal world we would just have constructors that accepted arguments. There are numerous problems with this approach, but mostly it would require a major change to the object system, don't think we want that. So instead I modified the free list to allow for an optional "post constructor" initialization function to be run on each free list item which includes optional opaque data to be passed to the initialization routine. Here is the signature of the initialization routine: typedef void (*ompi_free_list_item_init_fn_t) (struct ompi_free_list_item_t*, void* ctx); I also added two new items to the free list struct: struct ompi_free_list_t { . ompi_free_list_item_init_fn_t item_init; void* ctx; }; The current ompi_free_list_init function didn't change at all, instead I added these optional params to ompi_free_list_init_ex: OMPI_DECLSPEC int ompi_free_list_init_ex( ompi_free_list_t *free_list, size_t element_size, size_t alignment, opal_class_t* element_class, int num_elements_to_alloc, int max_elements_to_alloc, int num_elements_per_alloc, struct mca_mpool_base_module_t*, ompi_free_list_item_init_fn_t item_init, void *ctx ); So all the free list does is run the function specified by "item_init" on created free list item (after calling OBJ_CONSTRUCT_INTERNAL) For those that don't need this new functionality, simply pass two NULL's to ompi_free_lint_init_ex : ompi_free_list_init_ex(&btl->udapl_frag_eager, sizeof(mca_btl_udapl_frag_eager_t) + mca_btl_udapl_component.udapl_eager_frag_size, mca_btl_udapl_component.udapl_buffer_alignment, OBJ_CLASS(mca_btl_udapl_frag_eager_t), mca_btl_udapl_component.udapl_free_list_num, mca_btl_udapl_component.udapl_free_list_max, mca_btl_udapl_component.udapl_free_list_inc, btl->super.btl_mpool, NULL, NULL); Again, if you are using ompi_free_list_init you won't be affected. I think this functionality makes sense, it reduced the number of different free list item types in the OpenIB BTL and allows me to have numerous free lists of the same item type but with slightly different characteristics. Here is an example of how I use this in the OpenIB BTL: init_data = (mca_btl_openib_frag_init_data_t*) malloc(sizeof(mca_btl_openib_frag_init_data_t)); init_data->length = length; init_data->type = MCA_BTL_OPENIB_FRAG_SEND_USER; init_data->order = mca_btl_openib_component.rdma_qp; init_data->list = &openib_btl->send_user_free; ompi_free_list_init_ex( &openib_btl->send_user_free, length, 2, OBJ_CLASS (mca_btl_openib_send_user_frag_t), mca_btl_openib_component.ib_free_list_num, mca_btl_openib_component.ib_free_list_max, mca_btl_openib_component.ib_free_list_inc, NULL, mca_btl_openib_frag_init, (void*)init_data)) Thanks, Galen
Re: [OMPI devel] OpenIB BTL and SRQs
On Jul 12, 2007, at 10:29 AM, Don Kerr wrote: Through mca parameters one can select the use of shared receive queues in the openib btl, other than having fewer queues I am wondering what are the benefits of using this option. Can anyone eleborate on using them vs the default? In the trunk the number of queue pairs is the same, regardless of SRQ or NON-SRQ hence forth named PP (per-peer). The difference is that PP receive resources scale with the number of active QP connections. SRQ receive resources do not. So the real difference is the memory footprint of the the receive resources. SRQ is potentially much smaller. This comes at a cost; SRQ does not have flow control as we cannot reserve resources for a particular peer, so we do have the possibility of an RNR (receiver not ready) NAK if all the shared receive resources are consumed and some peer is still transmitting messages. This has a performance penalty as an RNR NAK stalls the IB pipeline. With PP, we can guarantee that resources are available to the peer and thereby avoid RNR (although there is a bug in the trunk right now in that sometimes we get RNR even with PP, but this is being worked on). I have been working on a modification to the OpenIB BTL which allows the user to specify SRQ and PP QPs arbitrarily. That is we can use a mix of PP and SRQ with a mix of receive sizes for each. This is coming into the trunk very soon, perhaps tomorrow but we need to verify the branch with some additional testing. I hope this helps, I have a paper at EuroPVM/MPI that discusses much of this, I will send you a copy off list. - Galen ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [devel-core] Collective Communications Optimization - MeetingScheduled in Albuquerque!
Hotels near the airport / university area, I pulled this off of this site: http://www.airnav.com/airport/KABQ Miles Price ($) FAIRFIELD INN BY MARRIOTT ALBUQUERQUE UNIVERSITY AREA 4.8 79-80 COMFORT INN AIRPORT 1.3 52-101 COURTYARD BY MARRIOTT ALBUQUERQUE AIRPORT 1.6 74-139 SLEEP INN AIRPORT 1.6 45-71 LA QUINTA INN ALBUQUERQUE AIRPORT 1.4 69-91 AMERISUITES ALBUQUERQUE AIRPORT 1.5 59-109 RAMADA LTD ALBUQUERQUE AIRPORT 1.7 62-89 SUBURBAN EXTENDED STAY 4.6 51-52 HOWARD JOHNSON - ALBUQUERQUE (EAST) 5.4 54-69 WYNDHAM ALBUQUERQUE AIRPORT 1.0 80-109 COMFORT INN EAST 6.3 55-56 PARK PLAZA HOTEL ALBUQUERQUE 4.7 60-61 Other hotels near Albuquerque International Sunport Airport Miles Price ($) HILTON GARDEN INN ALBUQUERQUE AIRPORT 1.2 85-180 BEST WESTERN INNSUITES (AIRPORT) 1.2 49-71 HAMPTON INN ALBUQERQUE AP 1.4 71-105 ESA ALBUQUERQUE-AIRPORT 1.6 54-70 HAWTHORN INN & SUITES - ALBUQUERQUE (AIRPORT) 1.8 69-70 QUALITY SUITES ALBUQUERQUE 1.8 58-65 VAGABOND EXECUTIVE INN - FORMERLY THE AIRPORT UNIVERSITY INN 1.9 49-56 COUNTRY INN & SUITES BY CARLSON - ALBUQUERQUE AIRPORT 1.9 79-109 HOMEWOOD STE ALBUQUERQUE ARPT On Jul 10, 2007, at 2:49 PM, Gil Bloch wrote: What time do we plan to start on Aug. 6? I am trying to figure out if I have to be there the day before. Also, is there any specific hotel you would recommend? Regards, Gil Bloch -Original Message- From: devel-core-boun...@open-mpi.org [mailto:devel-core- boun...@open-mpi.org] On Behalf Of Galen Shipman Sent: ב 09 יולי 2007 15:44 To: Open MPI Developers Subject: Re: [devel-core] Collective Communications Optimization - MeetingScheduled in Albuquerque! All, I have confirmed the meeting to be held at the HPC facility at UNM on Aug 6,7,8. Here is a link to the HPC center: http://www.hpc.unm.edu/ Here is the visitor information link: http://www.hpc.unm.edu/info/visitor-information I hope everyone who expressed interest is able to attend! Thanks, Galen On Jun 29, 2007, at 6:23 PM, Galen Shipman wrote: So we are looking at a change of venue for this meeting. Santa Fe turned out to be a bit too costly in terms of hotel rooms for some participants. I am looking into getting the HPC conference room in Albuquerque. This is a convenient location for most and the hotels are cheaper. I am firming up the details with the new HPC director at UNM, the dates will remain August 6,7,8. Thanks, Galen On Jun 6, 2007, at 2:43 PM, Galen Shipman wrote: Updated Attendees as of June 6th (5 tentative, 12 confirmed): Cisco Jeff (tentative) IU Tim Andrew (tentative) Josh (tentative) Torsten LANL Brian Ollie Galen Mellanox Gil Myricom Patrick (tentative) ORNL Rich SNL Ron UH Edgar UT George Jelena (tentative) SUN Rolf QLogic Christian On Jun 5, 2007, at 10:10 AM, Galen Shipman wrote: Sorry for duplicate (included a reasonable subject line): Okay, so we tried to get the Hilton at a reasonable rate, didn't happen. Instead we got the eldorado hotel: http://www.eldoradohotel.com/ So the meeting will be held here. The room rates at the hotel are probably a bit high, but there are a number of other hotels in and around the area. I will try to get a list from our admin. I have the following attendees so far, if you are on the list and marked as tentative please let me know ASAP if you are definitely coming. If you are on the list and not marked as tentative, then we are expecting you, so please let me know today if you are unable to make it. This should be a good meeting, we will be located in the heart of Santa Fe so travel will be easier (you still need a car from ABQ but it is less than one hour) and there are lots of things to do/see before and after the meetings. Thanks, Galen Updated Attendees (15 in total): Cisco Jeff (tentative) IU Tim Andrew (tentative) Josh (tentative) Torsten LANL Brian Ollie Galen Mellanox Gil Myricom Patrick (tentative) ORNL Rich SNL Ron UH Edgar UT George Jelena ___ devel-core mailing list devel-c...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel-core ___ devel-core mailing list devel-c...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel-core ___ devel-core mailing list devel-c...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel-core ___ devel-core mailing list devel-c...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel-core
Re: [OMPI devel] flags in openib btl
These two: MCA_BTL_FLAGS_NEED_ACK MCA_BTL_FLAGS_NEED_CSUM Are used by DR. They aren't used by OB1. - Galen On Jun 15, 2007, at 9:27 AM, Jeff Squyres wrote: I notice that our help message for the btl_openib_flags MCA parameter seems to be a bit out of date: CHECK(reg_int("flags", "BTL flags, added together: SEND=1, PUT=2, GET=4 " "(cannot be 0)", MCA_BTL_FLAGS_RDMA | MCA_BTL_FLAGS_NEED_ACK | MCA_BTL_FLAGS_NEED_CSUM, &ival, REGINT_GE_ZERO)); mca_btl_openib_module.super.btl_flags = (uint32_t) ival; Specifically, we only list values of 1, 2, and 4. But the default value is 54. So clearly, there's quite a few more flags that can be set there. What are they? -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Problem with openib on demand connection bring up.
The patch applies to ib_multifrag as is without a conflict. But the branch doesn't compile with or without the patch so I was not able to test it. Do you have some uncommitted changes that may generate a conflict? Can you commit them so they can be resolved? If there is no conflict between your work and this patch may be it is a good idea to commit it to your branch and trunk for testing? I have a whole pile of changes that need to be committed, and even with these changes, it still doesn't compile as I am reworking names, and data structures, etc. I will commit what I have now, and will work on this a bit more over the weekend. - Galen Thanks, Galen On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote: Hello everyone, I encountered a problem with openib on depend connection code. Basically it works only by pure luck if you have more then one endpoint for the same proc and sometimes breaks in mysterious ways. The algo works like this: A wants to connect to B so it creates QP and sends it to B. B receives the QP from A and looks for endpoint that is not yet associated with remote endpoint, creates QP for it and sends info back. Now A receives the QP and goes through the same logic as B i.e looks for endpoint that is not yet connected, BUT there is no guaranty that it will find the endpoint that initiated the connection in the first place! And if it finds another one it will create QP for it and will send it back to B and so on and so forth. In the end I sometimes receive a peculiar mesh of connection where no QP has a connection back to it from the peer process. To overcome this problem B needs to send back some info that will allow A to determine the endpoint that initiated a connection request. The lid:qp pair will allow for this. But even then the problem will remain if two procs initiate connection at the same time. To dial with simultaneous connection asymmetry protocol have to be used one peer became master another slave. Slave alway initiate a connection to master. Master choose local endpoint to satisfy incoming request and sends info back to a slave. If master wants to initiate a connection it send message to a slave and slave initiate connection back to master. Included patch implements an algorithm described above and work for all scenarios for which current code fails to create a connection. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] openib coord teleconf (was: Problem with openib on demand connection bring up)
On Jun 13, 2007, at 12:52 PM, Gleb Natapov wrote: On Wed, Jun 13, 2007 at 02:48:02PM -0400, Jeff Squyres wrote: On Jun 13, 2007, at 2:41 PM, Gleb Natapov wrote: Pasha tells me that the best times for Ishai and him are: - 2000-2030 Israel time - 1300-1300 US Eastern - 1100-1130 US Mountain - 2230-2300 India (Bangalore) Although they could also do the preceding half hour as well. Depends on the date. The closest I can at 20:00 is June 19. Oops! I left out the date -- sorry. I meant to say Monday, June 18th. And I got the US eastern time wrong; that should have been noon, not 1300. 20:00 Israel June 19th is right after the weekly OMPI teleconf; want to do it then? Yes. On my calendar. - Galen -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] openib coord teleconf (was: Problem with openib on demand connection bring up)
On Jun 13, 2007, at 12:23 PM, Jeff Squyres wrote: On Jun 13, 2007, at 1:40 PM, Gleb Natapov wrote: [snip] coordination kind of teleconference. If people think this is a good idea, I can setup the call. sounds good to me. Sounds good to me to. Pasha also works on async event thread. This patch is not something I planned to work on. This problem prevented me from testing my changes to OB1 an is serious enough to be fixed on v1.2. Pasha tells me that the best times for Ishai and him are: - 2000-2030 Israel time - 1300-1300 US Eastern - 1100-1130 US Mountain - 2230-2300 India (Bangalore) These times work for me but not until next week. - Galen Although they could also do the preceding half hour as well. Does this work for everyone? -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Problem with openib on demand connection bring up.
On Jun 13, 2007, at 12:07 PM, Gleb Natapov wrote: On Wed, Jun 13, 2007 at 02:05:00PM -0400, Jeff Squyres wrote: On Jun 13, 2007, at 1:54 PM, Jeff Squyres wrote: With today's trunk, I still see the problem: Same thing happens on v1.2 branch. I'll re-open #548. I am sure it was never tested with multiple subnets. I'll try to get such setup. I tested this with multiple subnets but it was quite some time ago. - Galen -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Problem with openib on demand connection bring up.
On Jun 13, 2007, at 11:33 AM, Jeff Squyres wrote: On Jun 13, 2007, at 1:15 PM, Nysal Jan wrote: There is a ticket (closed) here: https://svn.open-mpi.org/trac/ompi/ ticket/548 It was fixed by Galen for 1.2. Ah -- I forgot to look at closed tickets. I think we broke it again; it certainly fails on the trunk (perhaps related to what Gleb found?). I did not test 1.2. There is a FAQ entry also about this http://www.open-mpi.org/faq/? category=openfabrics#ofa-port-wireup That's what it *should* be doing, but I wonder if that's what it *actually* is doing. So it has been a while but we tested this on our local cluster with differing number of ports and it worked, but I was doing simple ping- pongs. If both sides try to open a connection at the same time however, badness can occur, from my understanding of this. - Galen -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Problem with openib on demand connection bring up.
On Jun 13, 2007, at 11:15 AM, Nysal Jan wrote: I was just bitten yesterday by a problem that I've known about for a while but had never gotten around to looking into (I could have sworn that there was an open trac ticket on this, but I can't find one anywhere). I have 2 hosts: one with 3 active ports and one with 2 active ports. If I run an MPI job between them, the openib BTL wireup got badly and it aborts. So handling a heterogeneous number of ports is not currently handled properly in the code. I don't know if Gleb's patch addresses this situation or not; I'll look at his patch this afternoon. There is a ticket (closed) here: https://svn.open-mpi.org/trac/ompi/ ticket/548 It was fixed by Galen for 1.2. There is a FAQ entry also about this http://www.open-mpi.org/faq/?category=openfabrics#ofa-port- wireup I think Gleb's patch addresses a potential race condition when both sides attempt to connect at the same time. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Problem with openib on demand connection bring up.
On Jun 13, 2007, at 10:48 AM, Jeff Squyres wrote: I wonder if this is bringing up the point that there are several of us working in the openib code base -- I wonder if it would be worthwhile to have a [short] teleconference to discuss what we're all doing in openib, where we're doing it (trunk, branch, whatever), when we expect to have it done, what version we need it in, etc. Just a coordination kind of teleconference. If people think this is a good idea, I can setup the call. sounds good to me. - Galen For example, don't forget that Nysal and I have the openib btl port- selection stuff off in /tmp/jnysal-openib-wireup (the btl_openib_if_ [in|ex]clude MCA params). Per my prior e-mail, if no one objects, I will be bringing that stuff in to the trunk tomorrow evening (I'm pretty sure it won't conflict with what Galen is doing; Galen and I discussed on the phone this morning). On Jun 13, 2007, at 11:38 AM, Galen Shipman wrote: Hi Gleb, As we have discussed before I am working on adding support for multiple QPs with either per peer resources or shared resources. As a result of this I am trying to clean up a lot of the OpenIB code. It has grown up organically over the years and needs some attention. Perhaps we can coordinate on commits or even work from the same temp branch to do an overall cleanup as well as addressing the issue you describe in this email. I bring this up because this commit will conflict quite a bit with what I am working on, I can always merge it by hand but it may make sense for us to get this all done in one area and then bring it all over? Thanks, Galen On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote: Hello everyone, I encountered a problem with openib on depend connection code. Basically it works only by pure luck if you have more then one endpoint for the same proc and sometimes breaks in mysterious ways. The algo works like this: A wants to connect to B so it creates QP and sends it to B. B receives the QP from A and looks for endpoint that is not yet associated with remote endpoint, creates QP for it and sends info back. Now A receives the QP and goes through the same logic as B i.e looks for endpoint that is not yet connected, BUT there is no guaranty that it will find the endpoint that initiated the connection in the first place! And if it finds another one it will create QP for it and will send it back to B and so on and so forth. In the end I sometimes receive a peculiar mesh of connection where no QP has a connection back to it from the peer process. To overcome this problem B needs to send back some info that will allow A to determine the endpoint that initiated a connection request. The lid:qp pair will allow for this. But even then the problem will remain if two procs initiate connection at the same time. To dial with simultaneous connection asymmetry protocol have to be used one peer became master another slave. Slave alway initiate a connection to master. Master choose local endpoint to satisfy incoming request and sends info back to a slave. If master wants to initiate a connection it send message to a slave and slave initiate connection back to master. Included patch implements an algorithm described above and work for all scenarios for which current code fails to create a connection. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Problem with openib on demand connection bring up.
On Jun 13, 2007, at 9:49 AM, Torsten Hoefler wrote: Hi Galen,Gleb, there is also something weird going on if I call the basic alltoall during the module_init() of a collective module (I need to wire up my own QPs in my coll component). It takes 7 seconds for 4 nodes and more than 30 minutes for 120 nodes. It seems to be an OpenIB wireup issue because if I start with -mca btl tcp,self this goes as fast as expected (<2 seconds). Will this issue be fixed with your patch? No, this is a separate issue. Try: -mca mpi_preconnect_oob 1 then try: -mca mpi_preconnect_all 1 and let us know what the times are. thx, galen Thanks, Torsten -- bash$ :(){ :|:&};: - http://www.unixer.de/ - Indiana University| http://www.indiana.edu Open Systems Lab | http://osl.iu.edu/ 150 S. Woodlawn Ave. | Bloomington, IN, 474045-7104 | USA Lindley Hall Room 135 | +01 (812) 855-3608 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Problem with openib on demand connection bring up.
Hi Gleb, As we have discussed before I am working on adding support for multiple QPs with either per peer resources or shared resources. As a result of this I am trying to clean up a lot of the OpenIB code. It has grown up organically over the years and needs some attention. Perhaps we can coordinate on commits or even work from the same temp branch to do an overall cleanup as well as addressing the issue you describe in this email. I bring this up because this commit will conflict quite a bit with what I am working on, I can always merge it by hand but it may make sense for us to get this all done in one area and then bring it all over? Thanks, Galen On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote: Hello everyone, I encountered a problem with openib on depend connection code. Basically it works only by pure luck if you have more then one endpoint for the same proc and sometimes breaks in mysterious ways. The algo works like this: A wants to connect to B so it creates QP and sends it to B. B receives the QP from A and looks for endpoint that is not yet associated with remote endpoint, creates QP for it and sends info back. Now A receives the QP and goes through the same logic as B i.e looks for endpoint that is not yet connected, BUT there is no guaranty that it will find the endpoint that initiated the connection in the first place! And if it finds another one it will create QP for it and will send it back to B and so on and so forth. In the end I sometimes receive a peculiar mesh of connection where no QP has a connection back to it from the peer process. To overcome this problem B needs to send back some info that will allow A to determine the endpoint that initiated a connection request. The lid:qp pair will allow for this. But even then the problem will remain if two procs initiate connection at the same time. To dial with simultaneous connection asymmetry protocol have to be used one peer became master another slave. Slave alway initiate a connection to master. Master choose local endpoint to satisfy incoming request and sends info back to a slave. If master wants to initiate a connection it send message to a slave and slave initiate connection back to master. Included patch implements an algorithm described above and work for all scenarios for which current code fails to create a connection. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] threaded builds
On Jun 11, 2007, at 8:25 AM, Jeff Squyres wrote: I leave it to the thread subgroup to decide... Should we discuss on the call tomorrow? I don't have a strong opinion; I was just testing both because it was easy to do so. If we want to concentrate on the trunk, I can adjust my MTT setup. I think trying to worry about 1.2 would just be a time sink. We know that there are architectural issues with threads in some parts of the code. I don't see us re-architecting 1.2 in this regard. Seems we should only focus on the trunk. - Galen On Jun 11, 2007, at 10:17 AM, Brian Barrett wrote: Yes, this is a known issue. I don't know -- are we trying to make threads work on the 1.2 branch, or just the trunk? I had thought just the trunk? Brian On Jun 11, 2007, at 8:13 AM, Tim Prins wrote: I had similar problems on the trunk, which was fixed by Brian with r14877. Perhaps 1.2 needs something similar? Tim On Monday 11 June 2007 10:08:15 am Jeff Squyres wrote: Per the teleconf last week, I have started to revamp the Cisco MTT infrastructure to do simplistic thread testing. Specifically, I'm building the OMPI trunk and v1.2 branches with "--with-threads -- enable-mpi-threads". I haven't switched this into my production MTT setup yet, but in the first trial runs, I'm noticing a segv in the test/threads/ opal_condition program. It seems that in the thr1 test on the v1.2 branch, when it calls opal_progress() underneath the condition variable wait, at some point in there current_base is getting to be NULL. Hence, the following segv's because the passed in value of "base" is NULL (event.c): int opal_event_base_loop(struct event_base *base, int flags) { const struct opal_eventop *evsel = base->evsel; ... Here's the full call stack: #0 0x002a955a020e in opal_event_base_loop (base=0x0, flags=5) at event.c:520 #1 0x002a955a01f9 in opal_event_loop (flags=5) at event.c:514 #2 0x002a95599111 in opal_progress () at runtime/ opal_progress.c: 259 #3 0x004012c8 in opal_condition_wait (c=0x5025a0, m=0x502600) at ../../opal/threads/condition.h:81 #4 0x00401146 in thr1_run (obj=0x503110) at opal_condition.c:46 #5 0x0036e290610a in start_thread () from /lib64/tls/ libpthread.so.0 #6 0x0036e1ec68c3 in clone () from /lib64/tls/libc.so.6 #7 0x in ?? () This test seems to work fine on the trunk (at least, it didn't segv in my small number of trail runs). Is this a known problem in the 1.2 branch? Should I skip the thread testing on the 1.2 branch and concentrate on the trunk? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768
Everyone: George thought this was okay after the discussion, I should have made the wiki prior to my commit as it did look very Open IB specific. please review: https://svn.open-mpi.org/trac/ompi/wiki/BTLSemantics Let me know if you want to discuss this further and we can setup a call early next week. Thanks, Galen On Jun 7, 2007, at 12:49 PM, Don Kerr wrote: It would be difficult for me to attend this afternoon. Tomorrow is much better for me. -DON George Bosilca wrote: I'm available this afternoon. george. On Jun 7, 2007, at 2:35 PM, Galen Shipman wrote: Are people available today to discuss this over the phone? - Galen On Jun 7, 2007, at 11:28 AM, Gleb Natapov wrote: On Thu, Jun 07, 2007 at 11:11:12AM -0400, George Bosilca wrote: ) I expect you to revise the patch in order to propose a generic solution or I'll trigger a vote against the patch. I vote to be backed out of the trunk as it export way to much knowledge from the Open IB BTL into the PML layer. The patch solves real problem. If we want to back it out we need to find another solution. I also didn't like this change too much, but I thought about other solutions and haven't found something better that what Galen did. If you have something in mind lets discuss it. As a general comment this kind of discussion is why I prefer to send significant changes as a patch to the list for discussion before committing. george. PS: With Gleb changes the problem is the same. The following snippet reflect exactly the same behavior as the original patch. I didn't try to change the semantic. Just make the code to match the semantic that Galen described. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel - --- ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768
Call in details: I have scheduled your requested audio conference "Open MPI" for today beginning at 2:30pm to 3:30pm mountain time with 7 ports. Dial in number: 5-4165 local 866-260-0475 toll free - Galen On Jun 7, 2007, at 1:47 PM, Galen Shipman wrote: On Jun 7, 2007, at 12:49 PM, Don Kerr wrote: It would be difficult for me to attend this afternoon. Tomorrow is much better for me. Brian and I are both out tomorrow. I think what we will do is have a call today, report back to the group and then if necessary have another call on Monday/Tuesday. - Galen -DON George Bosilca wrote: I'm available this afternoon. george. On Jun 7, 2007, at 2:35 PM, Galen Shipman wrote: Are people available today to discuss this over the phone? - Galen On Jun 7, 2007, at 11:28 AM, Gleb Natapov wrote: On Thu, Jun 07, 2007 at 11:11:12AM -0400, George Bosilca wrote: ) I expect you to revise the patch in order to propose a generic solution or I'll trigger a vote against the patch. I vote to be backed out of the trunk as it export way to much knowledge from the Open IB BTL into the PML layer. The patch solves real problem. If we want to back it out we need to find another solution. I also didn't like this change too much, but I thought about other solutions and haven't found something better that what Galen did. If you have something in mind lets discuss it. As a general comment this kind of discussion is why I prefer to send significant changes as a patch to the list for discussion before committing. george. PS: With Gleb changes the problem is the same. The following snippet reflect exactly the same behavior as the original patch. I didn't try to change the semantic. Just make the code to match the semantic that Galen described. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel - --- ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768
On Jun 7, 2007, at 12:49 PM, Don Kerr wrote: It would be difficult for me to attend this afternoon. Tomorrow is much better for me. Brian and I are both out tomorrow. I think what we will do is have a call today, report back to the group and then if necessary have another call on Monday/Tuesday. - Galen -DON George Bosilca wrote: I'm available this afternoon. george. On Jun 7, 2007, at 2:35 PM, Galen Shipman wrote: Are people available today to discuss this over the phone? - Galen On Jun 7, 2007, at 11:28 AM, Gleb Natapov wrote: On Thu, Jun 07, 2007 at 11:11:12AM -0400, George Bosilca wrote: ) I expect you to revise the patch in order to propose a generic solution or I'll trigger a vote against the patch. I vote to be backed out of the trunk as it export way to much knowledge from the Open IB BTL into the PML layer. The patch solves real problem. If we want to back it out we need to find another solution. I also didn't like this change too much, but I thought about other solutions and haven't found something better that what Galen did. If you have something in mind lets discuss it. As a general comment this kind of discussion is why I prefer to send significant changes as a patch to the list for discussion before committing. george. PS: With Gleb changes the problem is the same. The following snippet reflect exactly the same behavior as the original patch. I didn't try to change the semantic. Just make the code to match the semantic that Galen described. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel - --- ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768
Okay, how is 2:30 mountain time for everyone? I will setup a a call in if this works. Thanks, Galen On Jun 7, 2007, at 12:39 PM, George Bosilca wrote: I'm available this afternoon. george. On Jun 7, 2007, at 2:35 PM, Galen Shipman wrote: Are people available today to discuss this over the phone? - Galen On Jun 7, 2007, at 11:28 AM, Gleb Natapov wrote: On Thu, Jun 07, 2007 at 11:11:12AM -0400, George Bosilca wrote: ) I expect you to revise the patch in order to propose a generic solution or I'll trigger a vote against the patch. I vote to be backed out of the trunk as it export way to much knowledge from the Open IB BTL into the PML layer. The patch solves real problem. If we want to back it out we need to find another solution. I also didn't like this change too much, but I thought about other solutions and haven't found something better that what Galen did. If you have something in mind lets discuss it. As a general comment this kind of discussion is why I prefer to send significant changes as a patch to the list for discussion before committing. george. PS: With Gleb changes the problem is the same. The following snippet reflect exactly the same behavior as the original patch. I didn't try to change the semantic. Just make the code to match the semantic that Galen described. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768
Are people available today to discuss this over the phone? - Galen On Jun 7, 2007, at 11:28 AM, Gleb Natapov wrote: On Thu, Jun 07, 2007 at 11:11:12AM -0400, George Bosilca wrote: ) I expect you to revise the patch in order to propose a generic solution or I'll trigger a vote against the patch. I vote to be backed out of the trunk as it export way to much knowledge from the Open IB BTL into the PML layer. The patch solves real problem. If we want to back it out we need to find another solution. I also didn't like this change too much, but I thought about other solutions and haven't found something better that what Galen did. If you have something in mind lets discuss it. As a general comment this kind of discussion is why I prefer to send significant changes as a patch to the list for discussion before committing. george. PS: With Gleb changes the problem is the same. The following snippet reflect exactly the same behavior as the original patch. I didn't try to change the semantic. Just make the code to match the semantic that Galen described. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] BTL Semantics Teleconference: was : Re: [OMPI svn] svn:open-mpi r14768
I just had a discussion with Rich regarding the BTL semantics. I think what might be helpful here is for us to have telecon to discuss this further. I only have one goal out of this, and that is to firmly define the ordering semantics of the BTL, or alternatively local/remote completion semantics of the BTL, whatever they may be. I have created a wiki here to help describe the issue as I currently see it, please feel free to add to this with suggestions/etc.. https://svn.open-mpi.org/trac/ompi/wiki/BTLSemantics - Galen On Jun 7, 2007, at 9:55 AM, Galen Shipman wrote: On Jun 7, 2007, at 9:11 AM, George Bosilca wrote: There is something weird with this change, and the patch reflect it. The new argument "order" come from the PML level and might be MCA_BTL_NO_ORDER (which is kind of global) or BTL_OPENIB_LP_QP or BTL_OPENIB_HP_QP (which are definitively Open IB related). Do you really intend to let the PML knows about Open IB internal constants ? No, the PML knows only one thing about the order tag, it is either MCA_BTL_NO_ORDER or it is something that the BTL assigns. The PML has no idea about BTL_OPENIB_LP_QP or BTL_OPENIB_HP_QP, to the PML it is just an order tag assigned to a fragment by the BTL. So the semantics are that after a btl_send/put/get an order tag may be assigned by the BTL to the descriptor, This order tag can then be specified to subsequent calls to btl_alloc or btl_prepare. The PML has no idea what the value means, other than he is requesting a descriptor that will be ordered w.r.t. a previously transmitted descriptor. If it's the case (which seems to be true from the following snippet if(MCA_BTL_NO_ORDER == order) { frag->base.order = BTL_OPENIB_LP_QP; } else { frag->base.order = order; } So I am choosing some ordering to use here because the PML told me he doesn't care, what is wrong with this? ) I expect you to revise the patch in order to propose a generic solution or I'll trigger a vote against the patch. This exports no knowledge of the Open IB BTL to the PML layer, the PML doesn't know that this is a QP index, he doesn't care! The PML simply uses this value (if it wants to) to request ordering with subsequent fragments. We use the QP index only as a BTL optimization, it could have been anything. So the only new knowledge that the PML has is how to request that ordering of fragments be enforced, and the BTL doesn't even have to provide this if it doesn't want, that is the reason for MCA_BTL_NO_ORDER. Please describe a use case where this is not a generic solution. Keep in mind that MX, TCP, GM all can provide ordering guarantees if they wish, in fact for MX you can simply always assign an order tag, say the value is 1. MX can then guarantee ordering of all fragments sent over the same BTL. I vote to be backed out of the trunk as it export way to much knowledge from the Open IB BTL into the PML layer. The only other option that I have identified that doesn't push PML level protocol into the BTL is to require that BTLs always guarantee ordering of fragments sent/put/get over the same BTL. george. PS: With Gleb changes the problem is the same. The following snippet reflect exactly the same behavior as the original patch. Gleb's changes don't change the semantic guarantees that I have described above. frag->base.order = order; assert(frag->base.order != BTL_OPENIB_HP_QP); On Jun 7, 2007, at 9:49 AM, Gleb Natapov wrote: Hi Galen, On Sun, May 27, 2007 at 10:19:09AM -0600, Galen Shipman wrote: With current code this is not the case. Order tag is set during a fragment allocation. It seems wrong according to your description. Attached patch fixes this. If no specific ordering tag is provided to allocation function order of the fragment is set to be MCA_BTL_NO_ORDER. After call to send/ put/ get order is set to whatever QP was used for communication. If order is set before send call it is used to choose QP. I do set the order tag during allocation/prepare, but the defined semantics are that the tag is only valid after send/put/get. We can set them up any where we wish in the BTL, the PML however cannot rely on anything until after the send/put/get call. So really this is an issue of semantics versus implementation. The implementation I believe does conform to the semantics as the upper layer (PML) doesn't use the tag value until after a call to send/put/get. I will look over the patch however, might make more sense to delay setting the value until the actual send/put/get call. Have you had a chance to look over the patch? -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listi
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768
On Jun 7, 2007, at 9:11 AM, George Bosilca wrote: There is something weird with this change, and the patch reflect it. The new argument "order" come from the PML level and might be MCA_BTL_NO_ORDER (which is kind of global) or BTL_OPENIB_LP_QP or BTL_OPENIB_HP_QP (which are definitively Open IB related). Do you really intend to let the PML knows about Open IB internal constants ? No, the PML knows only one thing about the order tag, it is either MCA_BTL_NO_ORDER or it is something that the BTL assigns. The PML has no idea about BTL_OPENIB_LP_QP or BTL_OPENIB_HP_QP, to the PML it is just an order tag assigned to a fragment by the BTL. So the semantics are that after a btl_send/put/get an order tag may be assigned by the BTL to the descriptor, This order tag can then be specified to subsequent calls to btl_alloc or btl_prepare. The PML has no idea what the value means, other than he is requesting a descriptor that will be ordered w.r.t. a previously transmitted descriptor. If it's the case (which seems to be true from the following snippet if(MCA_BTL_NO_ORDER == order) { frag->base.order = BTL_OPENIB_LP_QP; } else { frag->base.order = order; } So I am choosing some ordering to use here because the PML told me he doesn't care, what is wrong with this? ) I expect you to revise the patch in order to propose a generic solution or I'll trigger a vote against the patch. This exports no knowledge of the Open IB BTL to the PML layer, the PML doesn't know that this is a QP index, he doesn't care! The PML simply uses this value (if it wants to) to request ordering with subsequent fragments. We use the QP index only as a BTL optimization, it could have been anything. So the only new knowledge that the PML has is how to request that ordering of fragments be enforced, and the BTL doesn't even have to provide this if it doesn't want, that is the reason for MCA_BTL_NO_ORDER. Please describe a use case where this is not a generic solution. Keep in mind that MX, TCP, GM all can provide ordering guarantees if they wish, in fact for MX you can simply always assign an order tag, say the value is 1. MX can then guarantee ordering of all fragments sent over the same BTL. I vote to be backed out of the trunk as it export way to much knowledge from the Open IB BTL into the PML layer. The only other option that I have identified that doesn't push PML level protocol into the BTL is to require that BTLs always guarantee ordering of fragments sent/put/get over the same BTL. george. PS: With Gleb changes the problem is the same. The following snippet reflect exactly the same behavior as the original patch. Gleb's changes don't change the semantic guarantees that I have described above. frag->base.order = order; assert(frag->base.order != BTL_OPENIB_HP_QP); On Jun 7, 2007, at 9:49 AM, Gleb Natapov wrote: Hi Galen, On Sun, May 27, 2007 at 10:19:09AM -0600, Galen Shipman wrote: With current code this is not the case. Order tag is set during a fragment allocation. It seems wrong according to your description. Attached patch fixes this. If no specific ordering tag is provided to allocation function order of the fragment is set to be MCA_BTL_NO_ORDER. After call to send/put/ get order is set to whatever QP was used for communication. If order is set before send call it is used to choose QP. I do set the order tag during allocation/prepare, but the defined semantics are that the tag is only valid after send/put/get. We can set them up any where we wish in the BTL, the PML however cannot rely on anything until after the send/put/get call. So really this is an issue of semantics versus implementation. The implementation I believe does conform to the semantics as the upper layer (PML) doesn't use the tag value until after a call to send/put/get. I will look over the patch however, might make more sense to delay setting the value until the actual send/put/get call. Have you had a chance to look over the patch? -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14782
Actually, we still need MCA_BTL_FLAGS_FAKE_RDMA , it can be used as a hint for components such as one-sided. Galen On May 27, 2007, at 5:25 AM, g...@osl.iu.edu wrote: Author: gleb Date: 2007-05-27 07:25:39 EDT (Sun, 27 May 2007) New Revision: 14782 URL: https://svn.open-mpi.org/trac/ompi/changeset/14782 Log: No need for MCA_BTL_FLAGS_NEED_ACK any more. As of commit r14768 this is the default behaviour. Text files modified: trunk/ompi/mca/btl/btl.h | 3 --- trunk/ompi/mca/btl/tcp/btl_tcp_component.c | 3 +-- 2 files changed, 1 insertions(+), 5 deletions(-) Modified: trunk/ompi/mca/btl/btl.h == --- trunk/ompi/mca/btl/btl.h(original) +++ trunk/ompi/mca/btl/btl.h 2007-05-27 07:25:39 EDT (Sun, 27 May 2007) @@ -157,9 +157,6 @@ #define MCA_BTL_FLAGS_NEED_ACK 0x10 #define MCA_BTL_FLAGS_NEED_CSUM 0x20 -/* btl can report put/get completion before data hits the other side */ -#define MCA_BTL_FLAGS_FAKE_RDMA 0x40 - /* btl needs local rdma completion */ #define MCA_BTL_FLAGS_RDMA_COMPLETION 0x80 Modified: trunk/ompi/mca/btl/tcp/btl_tcp_component.c == --- trunk/ompi/mca/btl/tcp/btl_tcp_component.c (original) +++ trunk/ompi/mca/btl/tcp/btl_tcp_component.c 2007-05-27 07:25:39 EDT (Sun, 27 May 2007) @@ -224,8 +224,7 @@ mca_btl_tcp_module.super.btl_flags = MCA_BTL_FLAGS_PUT | MCA_BTL_FLAGS_SEND_INPLACE | MCA_BTL_FLAGS_NEED_CSUM | - MCA_BTL_FLAGS_NEED_ACK | - MCA_BTL_FLAGS_FAKE_RDMA; + MCA_BTL_FLAGS_NEED_ACK; mca_btl_tcp_module.super.btl_bandwidth = 100; mca_btl_tcp_module.super.btl_latency = 0; mca_btl_base_param_register (&mca_btl_tcp_component.super.btl_version, ___ svn mailing list s...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14780
Can we get rid of mca_pml_ob1_send_fin_btl and just have mca_pml_ob1_send_fin? It seems we should just always send the fin over the same btl and this would clean up the code a bit. Thanks, Galen On May 27, 2007, at 2:29 AM, g...@osl.iu.edu wrote: Author: gleb Date: 2007-05-27 04:29:38 EDT (Sun, 27 May 2007) New Revision: 14780 URL: https://svn.open-mpi.org/trac/ompi/changeset/14780 Log: Fix out of resource handling for FIN packets broken by r14768. Text files modified: trunk/ompi/mca/pml/ob1/pml_ob1.c | 7 +++ trunk/ompi/mca/pml/ob1/pml_ob1.h |14 -- 2 files changed, 15 insertions(+), 6 deletions(-) Modified: trunk/ompi/mca/pml/ob1/pml_ob1.c == --- trunk/ompi/mca/pml/ob1/pml_ob1.c(original) +++ trunk/ompi/mca/pml/ob1/pml_ob1.c 2007-05-27 04:29:38 EDT (Sun, 27 May 2007) @@ -249,7 +249,7 @@ MCA_PML_OB1_PROGRESS_PENDING(bml_btl); } -int mca_pml_ob1_send_fin( +int mca_pml_ob1_send_fin_btl( ompi_proc_t* proc, mca_bml_base_btl_t* bml_btl, void *hdr_des, @@ -260,9 +260,8 @@ mca_pml_ob1_fin_hdr_t* hdr; int rc; -MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order, sizeof (mca_pml_ob1_fin_hdr_t)); +MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order, sizeof (mca_pml_ob1_fin_hdr_t)); if(NULL == fin) { -MCA_PML_OB1_ADD_FIN_TO_PENDING(proc, hdr_des, bml_btl, order); return OMPI_ERR_OUT_OF_RESOURCE; } fin->des_flags |= MCA_BTL_DES_FLAGS_PRIORITY; @@ -349,7 +348,7 @@ } break; case MCA_PML_OB1_HDR_TYPE_FIN: -rc = mca_pml_ob1_send_fin(pckt->proc, send_dst, +rc = mca_pml_ob1_send_fin_btl(pckt->proc, send_dst, pckt- >hdr.hdr_fin.hdr_des.pval, pckt->order); MCA_PML_OB1_PCKT_PENDING_RETURN(pckt); Modified: trunk/ompi/mca/pml/ob1/pml_ob1.h == --- trunk/ompi/mca/pml/ob1/pml_ob1.h(original) +++ trunk/ompi/mca/pml/ob1/pml_ob1.h 2007-05-27 04:29:38 EDT (Sun, 27 May 2007) @@ -283,9 +283,19 @@ } while(0) -int mca_pml_ob1_send_fin(ompi_proc_t* proc, mca_bml_base_btl_t* bml_btl, - void *hdr_des, uint8_t order); +int mca_pml_ob1_send_fin_btl(ompi_proc_t* proc, mca_bml_base_btl_t* bml_btl, +void *hdr_des, uint8_t order); +static inline int mca_pml_ob1_send_fin(ompi_proc_t* proc, void *hdr_des, +mca_bml_base_btl_t* bml_btl, uint8_t order) +{ + if(mca_pml_ob1_send_fin_btl(proc, bml_btl, hdr_des, order) == OMPI_SUCCESS) + return OMPI_SUCCESS; + +MCA_PML_OB1_ADD_FIN_TO_PENDING(proc, hdr_des, bml_btl, order); + +return OMPI_ERR_OUT_OF_RESOURCE; +} /* This function tries to resend FIN/ACK packets from pckt_pending queue. * Packets are added to the queue when sending of FIN or ACK is failed due to ___ svn mailing list s...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768
doh,, correction.. On May 27, 2007, at 10:23 AM, Galen Shipman wrote: The problem is that MCA_BTL_DES_FLAGS_PRIORITY was meant to indicate that the fragment was higher priority, but the fragment isn't higher priority. It simply needs to be ordered w.r.t. a previous fragment, an RDMA in this case. But after the change priority flags is totally ignored. So the priority flag was broken I think, in OpenIB we used the priority flag to choose a QP on which to send the fragment, but there was no checking for if the fragment could be sent on the specified QP. So a max send size fragment could be set as priority and the BTL would use the high priority QP and we would go bang. This is how I read the code, I might have missed something. If the priority flag is simply a "hint" and has *NO* hard requirements than we can still use it in the OpenIB BTL but it won't have any effect as only eager size fragments can be marked high priority and we already send these over the high priority QP. - Galen -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768
The problem is that MCA_BTL_DES_FLAGS_PRIORITY was meant to indicate that the fragment was higher priority, but the fragment isn't higher priority. It simply needs to be ordered w.r.t. a previous fragment, an RDMA in this case. But after the change priority flags is totally ignored. So the priority flag was broken I think, in OpenIB we used the priority flag to choose a QP on which to send the fragment, but there was no checking for if the fragment could be sent on the specified QP. So a max send size fragment could be set as priority and the BTL would use the high priority QP and we would go bang. This is how I read the code, I might have missed something. If the priority flag is simply a "hint" and has hard requirements than we can still use it in the OpenIB BTL but it won't have any effect as only eager size fragments can be marked high priority and we already send these over the high priority QP. - Galen -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768
With current code this is not the case. Order tag is set during a fragment allocation. It seems wrong according to your description. Attached patch fixes this. If no specific ordering tag is provided to allocation function order of the fragment is set to be MCA_BTL_NO_ORDER. After call to send/put/ get order is set to whatever QP was used for communication. If order is set before send call it is used to choose QP. I do set the order tag during allocation/prepare, but the defined semantics are that the tag is only valid after send/put/get. We can set them up any where we wish in the BTL, the PML however cannot rely on anything until after the send/put/get call. So really this is an issue of semantics versus implementation. The implementation I believe does conform to the semantics as the upper layer (PML) doesn't use the tag value until after a call to send/put/get. I will look over the patch however, might make more sense to delay setting the value until the actual send/put/get call. Thanks, Galen -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768
On May 24, 2007, at 2:48 PM, George Bosilca wrote: I see the problem this patch try to solve, but I fail to correctly understand the implementation. The patch affect all PML and BTL in the code base by adding one more argument to some of the most often called functions. And there is only one BTL (openib) who seems to use it while all others completely ignore it. Moreover, there seems to be already a very similar mechanism based on the MCA_BTL_DES_FLAGS_PRIORITY flag, which can be set by the PML level into the btl_descriptor. So what's the difference between the additional argument and a correct usage of the MCA_BTL_DES_FLAGS_PRIORITY flag ? The problem is that MCA_BTL_DES_FLAGS_PRIORITY was meant to indicate that the fragment was higher priority, but the fragment isn't higher priority. It simply needs to be ordered w.r.t. a previous fragment, an RDMA in this case. This being said, we could have just added an rdma fin flag, but this would mix protocol a bit too much between the BTL and the PML in my opinion. What we have with this fix is that the BTL can assign an order tag to any descriptor if it wishes, this order tag is only valid after a call to btl_send or btl_put/get. This order tag can then be used to request another descriptor later that will enforce ordering. The semantics here are clear, and the BTL doesn't have to do anything if it doesn't wish (w.r.t. assigning a valid order tag). So this was the clearest semantics I could come up with that allowed for numerous implementations at the BTL level. For example, even specifying an rdma fin flag directly to the BTL would restrict the BTL further than these semantics because then all RDMA's must be sent on the same endpoint/QP as all the PML would be able to indicate is that a FIN is being sent, and the BTL wouldn't have the context to know which RDMA the FIN belonged to and hence couldn't enforce ordering easily. The only reason OpenIB is the only one to use this new functionality is because I haven't had a chance to fix up udapl, which I plan to do next week. Note that GM semantics expose a similar problem (ordering is only guaranteed for messages of the same priority), but myrinet doesn't buffer like some of the IB/IWARP stuff can so we won't see it there. There are also a number of optimizations that these semantics allow, for example, the BTL doesn't have to give local completion callback on an RDMA anymore, as the fin message can be used for local completion of both. I am also looking at adding a BTL_PUT_IMMEDIATE which provides remote completion via an active message tag callback along with 64 bits of data, this would allow us to bypass the FIN entirely if the network supports it, this would be useful for MX as an example. OpenIB also supports a similar mechanism but there are problems that would need to be addressed as OpenIB only delivers 32 bits with the remote completion. - Galen george. On May 24, 2007, at 3:51 PM, gship...@osl.iu.edu wrote: Author: gshipman Date: 2007-05-24 15:51:26 EDT (Thu, 24 May 2007) New Revision: 14768 URL: https://svn.open-mpi.org/trac/ompi/changeset/14768 Log: Add optional ordering to the BTL interface. This is required to tighten up the BTL semantics. Ordering is not guaranteed, but, if the BTL returns a order tag in a descriptor (other than MCA_BTL_NO_ORDER) then we may request another descriptor that will obey ordering w.r.t. to the other descriptor. This will allow sane behavior for RDMA networks, where local completion of an RDMA operation on the active side does not imply remote completion on the passive side. If we send a FIN message after local completion and the FIN is not ordered w.r.t. the RDMA operation then badness may occur as the passive side may now try to deregister the memory and the RDMA operation may still be pending on the passive side. Note that this has no impact on networks that don't suffer from this limitation as the ORDER tag can simply always be specified as MCA_BTL_NO_ORDER. Text files modified: trunk/ompi/mca/bml/bml.h |29 +++ trunk/ompi/mca/btl/btl.h |10 +++ + trunk/ompi/mca/btl/gm/btl_gm.c | 8 +++ +++ trunk/ompi/mca/btl/gm/btl_gm.h | 3 ++ trunk/ompi/mca/btl/mx/btl_mx.c | 8 +++ +++ trunk/ompi/mca/btl/mx/btl_mx.h | 3 ++ trunk/ompi/mca/btl/openib/btl_openib.c |49 +++ +++- trunk/ompi/mca/btl/openib/btl_openib.h | 3 ++ trunk/ompi/mca/btl/openib/btl_openib_endpoint.c | 7 +++-- trunk/ompi/mca/btl/openib/btl_openib_frag.c | 7 + trunk/ompi/mca/btl/portals/btl_portals.c | 8 - trunk/ompi/mca/btl/portals/btl_portals.h
Re: [OMPI devel] OMPI over ofed udapl over iwarp
As an aside, my personal feeling is that even when running over IB the preposting of recvs is worth the small overhead of piggybacking a credit system on the messages that already cross the wire. If nothing else, this avoids adding congestion of RNR-NAKS and the resends they trigger. Put another way, I favor programming for IB as if it lacked the link-level flow control that the current BTL apparently assumes. We avoid the RNR-NAKS in the Open IB BTL via a credit system. I would have to review the udapl BTL but I believe it does something similar. I believe the problem only exists during lazy connection establishment, when credits are probably initialized to the defaults on both ends. We should really just set the credits as part of the handshake (after the receiver has posted the receive buffers). - Galen -Paul -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] OMPI over ofed udapl over iwarp
More like trying to work around the race condition that exists: The server side sends an rdma message first thus violating the iwarp protocol. For those who want the gory details: when the server sends first -and- that rdma message arrives at the client _before_ the client transitions into rdma mode, then that rdma message gets passed up to the host driver as streaming mode data. So when I originally ran OMPI/ udapl on chelsio's rnic, the client actually received the mpa start response -and- the first fpdu from the server while still in streaming mode. This results in a connection abort because it violates the mpa startup protocol... I would recommend making a state transition diagram of the lazy connection establishment. I did this during implementation of the Open IB BTL. Of course, I threw it out as soon as the code was committed :-). This shouldn't take very long and then you can simply modify the state diagram to prevent these race conditions, perhaps identifying an existing state where you can post any buffers that you need. Don't forget to throw away the state diagram as soon as the code is committed ;-). - Galen By causing the server to sleep just after accepting the connection, I give the client time to transition into rdma mode... It was just a hack to continue testing OMPI/udapl/chelsio. And it revealed the problem being discussed in this thread: OMPI udapl btl doesn't pre-post the recvs for the sendrecv exchange. Neither the client nor the server side of the udapl btl connection setup pre-post RECV buffers before connecting. This can allow a SEND to arrive before a RECV buffer is available. I _think_ IB will handle this issue by retransmitting the SEND. Chelsio's iWARP device, however, TERMINATEs the connection. My sleep() makes this condition happen every time. A compliant DAPL program also ensures that there are adequate receive buffers in place before the remote peer Sends. It is explicitly noted that failure to follow this real will invoke a transport/device dependent penalty. It may be that the sendq will be fenced, or it may be that the connection will be terminated. So any RDMA BTL should pre-post recv buffers before initiating or accepting a connection. I know of no udapl restiction saying a recv must be posted before a send. And yes we do pre post recv buffers but since the BTL creates 2 connections per peer, one for eager size messages and one for max size messages the BTL needs to know which connection the current endpoint is to service so that it can post the proper size recv buffer. Also, I agree in theory the btl could potentially post the recv which currently occurs in mca_btl_udapl_sendrecv before the connect or accept but I think in practise we had issue doing this and we had to wait until a DAT_CONNECTION_EVENT_ESTABLISHED was received. Well I'm trying it now. It should work. If it doesn't, then dapl or the underlying providers are broken. From what I can tell, the udapl btl exchanges memory info as a first order of business after connection establishment (mba_btl_udapl_sendrecv(). The RECV buffer post for this exchange, however, should really be done _before_ the dat_ep_connect() on the active side, and _before_ the dat_cr_accept() on the server side. Currently its done after the ESTABLISHED event is dequeued, thus allowing the race condition. I believe the rules are the ULP must ensure that a RECV is posted before the client can post a SEND for that buffer. And further, the ULP must enforce flow control somehow so that a SEND never arrives without a RECV buffer being available. maybe this is a rule iwarp imposes on its ULPs but not uDAPL. Perhaps this is just a bug and I opened it up with my sleep() Or is the uDAPL btl assuming the transport will deal with lack of RECV buffer at the time a SEND arrives? There may be a race condition here but you really have to try hard to see it. I agree its a small race condition that I made very large by my sleep! :-) But I can kill 2 birds with one stone here: I'm testing now a change to the sendrecv exchange to: 1) prepost the recvs just after dat_ep_create 2) force the side that issues the dat_ep_connect to send its addr-data first 3) force the side that issues the dat_cr_accept to wait for the send from the peer, then post its addr-data send This will plug both race conditions. I'll post the patch once I debug it and we can discuss if you thinks a good solution and/or if it work for solaris udapl. From Steve previously. "Also: Given there is a message exchange _always_ after connection setup, then we can change that exchange to support the client-must-send-first issue..." I agree I am sure we can do something but if it includes an additional message we should consider a mca parameter to govern this because the connection wireup is already costly enough. It won't add an additional message. Stay tuned for a patch! Steve
Re: [OMPI devel] Create success (r1.3a1r13481)
Expect this rev to die all over the place.. I had a bug in my r13481 checkin that prevented OB1 from getting selected, I corrected this in r13482. sorry bout that.. - Galen On Feb 2, 2007, at 7:44 PM, MPI Team wrote: Creating nightly snapshot SVN tarball was a success. Snapshot: 1.3a1r13481 Start time: Fri Feb 2 21:21:47 EST 2007 End time: Fri Feb 2 21:44:05 EST 2007 Your friendly daemon, Cyrador ___ testing mailing list test...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/testing
[OMPI devel] For Open MPI + BPROC users
We have found a potential issue with BPROC that may effect Open MPI. Open MPI by default uses PTYs for I/O forwarding, if PTYs aren't setup on the compute nodes, Open MPI will revert to using pipes. Recently (today) we found a potential issue with PTYs and BPROC. A simple reader/writer using PTYs causes the writer to hang in uninterruptible sleep. The consistency of the process table from the head node and the back end nodes is also effected, that is "bps" shows no writer process, while "bpsh NODE ps aux" shows the writer process in uninterruptible sleep. Since Open MPI uses PTYs by default on BPROC this results in ORTED or MPI processes being orphaned on compute nodes. The workaround for this issue is to configure Open MPI with --disable-pty-support and rebuild. - Open MPI Team
[O-MPI devel] build warnings..
Current build warnings: mca_base_parse_paramfile_lex.c:1664: warning: 'yy_flex_realloc' defined but not used qsort.c:163: warning: cast from pointer to integer of different size show_help_lex.c:1606: warning: 'yy_flex_realloc' defined but not used rmgr_proxy.c:237: warning: ISO C forbids conversion of object pointer to function pointer type rmgr_proxy.c:356: warning: ISO C forbids conversion of function pointer to object pointer type rmgr_urm.c:184: warning: ISO C forbids conversion of object pointer to function pointer type rmgr_urm.c:309: warning: ISO C forbids conversion of function pointer to object pointer type comm_cid.c:167: warning: comparison between signed and unsigned fake_stack.c:46: warning: no previous prototype for 'ompi_convertor_create_stack_with_pos_general'
Re: [O-MPI devel] couple of problems in openib mpool.
Hey Gleb, Sorry for the delay.. we have been doing a bit of reworking of the pml/btl so that the btl's can be shared outside of just the pml (collectives, etc). I have added the bug fix (old_reg). Will look at the assumption of non-null registration next. Thanks (and keep them coming ;-) , Galen On Aug 11, 2005, at 8:27 AM, Gleb Natapov wrote: Hello, There are couple of bugs/typos in openib mpool. First one is fixed by included patch. Second one is in function mca_mpool_openib_free(). This function assumes that registration is never NULL, but there are callers that think different (ompi/class/ompi_fifo.h, ompi/class/ompi_circular_buffer_fifo.h) Index: ompi/mca/mpool/openib/mpool_openib_module.c === --- ompi/mca/mpool/openib/mpool_openib_module.c (revision 6806) +++ ompi/mca/mpool/openib/mpool_openib_module.c (working copy) @@ -127,7 +127,7 @@ mca_mpool_base_registration_t* old_reg = *registration; void* new_mem = mpool->mpool_alloc(mpool, size, 0, registration); memcpy(new_mem, addr, old_reg->bound - old_reg->base); -mpool->mpool_free(mpool, addr, &old_reg); +mpool->mpool_free(mpool, addr, old_reg); return new_mem; } -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [O-MPI devel] Fwd: Regarding MVAPI Component in Open MPI
Hi Sridhar, I have committed changes that allow you to set the debg verbosity, OMPI_MCA_btl_base_debug 0 - no debug output 1 - standard debug output 2 - very verbose debug output Also we have run the Pallas tests and are not able to reproduce your failures. We do see a warning in the Reduce test but it does not hang and runs to completion. Attached is a simple ping pong program, try running this and let us know the results. Thanks, Galen /* * MPI ping program * * Patterned after the example in the Quadrics documentation */ #define MPI_ALLOC_MEM 0 #include #include #include #include #include #include #include "mpi.h" static int str2size(char *str) { int size; char mod[32]; switch (sscanf(str, "%d%1[mMkK]", &size, mod)) { case 1: return (size); case 2: switch (*mod) { case 'm': case 'M': return (size << 20); case 'k': case 'K': return (size << 10); default: return (size); } default: return (-1); } } static void usage(void) { fprintf(stderr, "Usage: mpi-ping [flags] [] []\n" " mpi-ping -h\n"); exit(EXIT_FAILURE); } static void help(void) { printf ("Usage: mpi-ping [flags] [] []\n" "\n" " Flags may be any of\n" " -Buse blocking send/recv\n" " -Ccheck data\n" " -Ooverlapping pings\n" " -Wperform warm-up phase\n" " -r number repetitions to time\n" " -A use MPI_Alloc_mem to register memory\n" " -hprint this info\n" "\n" " Numbers may be postfixed with 'k' or 'm'\n\n"); exit(EXIT_SUCCESS); } int main(int argc, char *argv[]) { MPI_Status status; MPI_Request recv_request; MPI_Request send_request; unsigned char *rbuf; unsigned char *tbuf; int c; int i; int bytes; int nproc; int peer; int proc; int r; int tag = 0x666; /* * default options / arguments */ int reps = 1; int blocking = 0; int check = 0; int overlap = 0; int warmup = 0; int inc_bytes = 0; int max_bytes = 0; int min_bytes = 0; int alloc_mem = 0; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &proc); MPI_Comm_size(MPI_COMM_WORLD, &nproc); while ((c = getopt(argc, argv, "BCOWAr:h")) != -1) { switch (c) { case 'B': blocking = 1; break; case 'C': check = 1; break; case 'O': overlap = 1; break; case 'W': warmup = 1; break; case 'A': alloc_mem=1; break; case 'r': if ((reps = str2size(optarg)) <= 0) { usage(); } break; case 'h': help(); default: usage(); } } if (optind == argc) { min_bytes = 0; } else if ((min_bytes = str2size(argv[optind++])) < 0) { usage(); } if (optind == argc) { max_bytes = min_bytes; } else if ((max_bytes = str2size(argv[optind++])) < min_bytes) { usage(); } if (optind == argc) { inc_bytes = 0; } else if ((inc_bytes = str2size(argv[optind++])) < 0) { usage(); } if (nproc == 1) { exit(EXIT_SUCCESS); } #if MPI_ALLOC_MEM if(alloc_mem) { MPI_Alloc_mem(max_bytes ? max_bytes: 8, MPI_INFO_NULL, &rbuf); MPI_Alloc_mem(max_bytes ? max_bytes: 8, MPI_INFO_NULL, &tbuf); } else { #endif if ((rbuf = (unsigned char *) malloc(max_bytes ? max_bytes : 8)) == NULL) { perror("malloc"); exit(EXIT_FAILURE); } if ((tbuf = (unsigned char *) malloc(max_bytes ? max_bytes : 8)) == NULL) { perror("malloc"); exit(EXIT_FAILURE); } #if MPI_ALLOC_MEM } #endif if (check) { for (i = 0; i < max_bytes; i++) { tbuf[i] = i & 255; rbuf[i] = 0; } } if (proc == 0) { if (overlap) { printf("mpi-ping: overlapping ping-pong\n"); } else if (blocking) { printf("mpi-ping: ping-pong (using blocking send/recv)\n"); } else { printf("mpi-ping: ping-pong\n"); } if (check) { printf("data checking enabled\n"); } printf("nprocs=%d, reps=%d, min bytes=%d, max bytes=%d inc bytes=%d\n", nproc, reps, min_bytes, max_bytes, inc_bytes); fflush(stdout); } MPI_Barrier(MPI_COMM_WORLD); peer = proc ^ 1; if ((peer < nproc) && (peer & 1)) { printf("%d pings %d\n", proc, peer); fflush(stdout
Re: [O-MPI devel] Fwd: Regarding MVAPI Component in Open MPI
Hi On Aug 9, 2005, at 8:15 AM, Sridhar Chirravuri wrote: The same kind of output while running Pallas "pingpong" test. -Sridhar -Original Message- From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf Of Sridhar Chirravuri Sent: Tuesday, August 09, 2005 7:44 PM To: Open MPI Developers Subject: Re: [O-MPI devel] Fwd: Regarding MVAPI Component in Open MPI I have run sendrecv function in Pallas but it failed to run it. Here is the output [root@micrompi-2 SRC_PMB]# mpirun -np 2 PMB-MPI1 sendrecv Could not join a running, existing universe Establishing a new one named: default-universe-5097 [0,1,1][btl_mvapi.c:130:mca_btl_mvapi_del_procs] Stub [0,1,1][btl_mvapi.c:130:mca_btl_mvapi_del_procs] Stub [0,1,0][btl_mvapi.c:130:mca_btl_mvapi_del_procs] Stub [0,1,0][btl_mvapi.c:130:mca_btl_mvapi_del_procs] Stub [0,1,0][btl_mvapi_endpoint.c:542:mca_btl_mvapi_endpoint_send] Connection to endpoint closed ... connecting ... [0,1,0][btl_mvapi_endpoint.c:318:mca_btl_mvapi_endpoint_start_connect] Initialized High Priority QP num = 263177, Low Priority QP num = 263178, LID = 785 [0,1,0][btl_mvapi_endpoint.c:190: mca_btl_mvapi_endpoint_send_connect_req ] Sending High Priority QP num = 263177, Low Priority QP num = 263178, LID = 785[0,1,0][btl_mvapi_endpoint.c:542:mca_btl_mvapi_endpoint_send] Connection to endpoint closed ... connecting ... [0,1,0][btl_mvapi_endpoint.c:318:mca_btl_mvapi_endpoint_start_connect] Initialized High Priority QP num = 263179, Low Priority QP num = 263180, LID = 786 [0,1,0][btl_mvapi_endpoint.c:190: mca_btl_mvapi_endpoint_send_connect_req ] Sending High Priority QP num = 263179, Low Priority QP num = 263180, LID = 786#--- #PALLAS MPI Benchmark Suite V2.2, MPI-1 part #--- # Date : Tue Aug 9 07:11:25 2005 # Machine: x86_64# System : Linux # Release: 2.6.9-5.ELsmp # Version: #1 SMP Wed Jan 5 19:29:47 EST 2005 # # Minimum message length in bytes: 0 # Maximum message length in bytes: 4194304 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions: MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # Sendrecv [0,1,1][btl_mvapi_endpoint.c:368: mca_btl_mvapi_endpoint_reply_start_conn ect] Initialized High Priority QP num = 263177, Low Priority QP num = 263178, LID = 777 [0,1,1][btl_mvapi_endpoint.c:266: mca_btl_mvapi_endpoint_set_remote_info] Received High Priority QP num = 263177, Low Priority QP num 263178, LID = 785 [0,1,1][btl_mvapi_endpoint.c:756:mca_btl_mvapi_endpoint_qp_init_query] Modified to init..Qp 7080096[0,1,1][btl_mvapi_endpoint.c:791: mca_btl_mvapi_endpoint_qp_init_q uery] Modified to RTR..Qp 7080096[0,1,1][btl_mvapi_endpoint.c:814: mca_btl_mvapi_endpoint_qp_init_q uery] Modified to RTS..Qp 7080096 [0,1,1][btl_mvapi_endpoint.c:756:mca_btl_mvapi_endpoint_qp_init_query] Modified to init..Qp 7240736 [0,1,1][btl_mvapi_endpoint.c:791:mca_btl_mvapi_endpoint_qp_init_query] Modified to RTR..Qp 7240736[0,1,1][btl_mvapi_endpoint.c:814: mca_btl_mvapi_endpoint_qp_init_q uery] Modified to RTS..Qp 7240736 [0,1,1][btl_mvapi_endpoint.c:190: mca_btl_mvapi_endpoint_send_connect_req ] Sending High Priority QP num = 263177, Low Priority QP num = 263178, LID = 777 [0,1,0][btl_mvapi_endpoint.c:266: mca_btl_mvapi_endpoint_set_remote_info] Received High Priority QP num = 263177, Low Priority QP num 263178, LID = 777 [0,1,0][btl_mvapi_endpoint.c:756:mca_btl_mvapi_endpoint_qp_init_query] Modified to init..Qp 7081440 [0,1,0][btl_mvapi_endpoint.c:791:mca_btl_mvapi_endpoint_qp_init_query] Modified to RTR..Qp 7081440 [0,1,0][btl_mvapi_endpoint.c:814:mca_btl_mvapi_endpoint_qp_init_query] Modified to RTS..Qp 7081440 [0,1,0][btl_mvapi_endpoint.c:756:mca_btl_mvapi_endpoint_qp_init_query] Modified to init..Qp 7241888 [0,1,0][btl_mvapi_endpoint.c:791:mca_btl_mvapi_endpoint_qp_init_query] Modified to RTR..Qp 7241888[0,1,0][btl_mvapi_endpoint.c:814: mca_btl_mvapi_endpoint_qp_init_q uery] Modified to RTS..Qp 7241888 [0,1,1][btl_mvapi_component.c:523:mca_btl_mvapi_component_progress] Got a recv completion Thanks -Sridhar -Original Message- From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf Of Brian Barrett Sent: Tuesday, August 09, 2005 7:35 PM To: Open MPI Developers Subject: Re: [O-MPI devel] Fwd: Regarding MVAPI Component in Open MPI On Aug 9, 2005, at 8:48 AM, Sridhar Chirravuri wrote: Does r6774 has lot of changes that are related to 3rd generation point-to-point? I am trying to run some benchmark tests (ex: pallas) with Open MPI stack and just want to compare the performance figures with MVAPICH 095 and MVAPICH 092. In order to use 3rd generation p2p communication, I have added the following line in the /openmpi/etc/openmpi-mca-params.conf pml=ob1 I also exported (as double check) OMPI_MCA_pml=ob1. Then, I
Re: [O-MPI devel] [PATCH] wrong variable type in OpenIB.
Gleb, Changes are in the trunk thanks, Galen On Aug 7, 2005, at 4:32 AM, Gleb Natapov wrote: Hello Galen, Included patch changes type of returned value from ibv_poll_cq. It should be signed because we check if it is less then zero later in the code. Index: ompi/mca/btl/openib/btl_openib_component.c === --- ompi/mca/btl/openib/btl_openib_component.c (revision 6757) +++ ompi/mca/btl/openib/btl_openib_component.c (working copy) @@ -492,8 +492,8 @@ mca_btl_base_module_t** mca_btl_openib_c int mca_btl_openib_component_progress() { -uint32_t i, ne; -int count = 0; +uint32_t i; +int count = 0, ne; mca_btl_openib_frag_t* frag; mca_btl_openib_endpoint_t* endpoint; /* Poll for completions */ -- Gleb.
Re: [O-MPI devel] [PATCH] for ompi_free_list.c
Changes are in the trunk Thanks, Gaeln On Aug 8, 2005, at 7:38 AM, Gleb Natapov wrote: Hello, Included patch fixes bugs in ompi_free_list in the case ompi_free_list was created with NULL class and/or mpool parameters. Index: ompi/class/ompi_free_list.c === --- ompi/class/ompi_free_list.c (revision 6760) +++ ompi/class/ompi_free_list.c (working copy) @@ -75,7 +75,7 @@ int ompi_free_list_grow(ompi_free_list_t unsigned char* ptr; size_t i; size_t mod; -mca_mpool_base_registration_t* user_out; +mca_mpool_base_registration_t* user_out = NULL; if (flist->fl_max_to_alloc > 0 && flist->fl_num_allocated + num_elements > flist->fl_max_to_alloc) return OMPI_ERR_TEMP_OUT_OF_RESOURCE; @@ -97,7 +97,10 @@ int ompi_free_list_grow(ompi_free_list_t item->user_data = user_out; if (NULL != flist->fl_elem_class) { OBJ_CONSTRUCT_INTERNAL(item, flist->fl_elem_class); -} +} else { + OBJ_CONSTRUCT (&item->super, opal_list_item_t); + } + opal_list_append(&(flist->super), &(item->super)); ptr += flist->fl_elem_size; } -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[O-MPI devel] Mellanox VAPI SRQ.
Using the mvapi btl you can now set OMPI_MCA_btl_mvapi_use_srq=1 which will cause mvapi to use a shared receive queue. This will allow much better scaling as receives are posted per interface port and not per queue pair. Note: older versions of mellanox firmware may see a substantial performance impact on small message latency but the latest firmware shows only a small cost on the order of 2/10 uSec. - Galen