Hi, This is probably irrelevant by now, but I just want to close the issue.
Updating UCX to 1.4.0 and rebuilding OpenMPI against it solves the freezing problem. Not sure if UCX is part of OFED but we do have a relatively old version of it on the cluster so I will try to ask sys admins to update it centrally on the system as well. I am currently testing across several nodes and it's definitely working. I still cannot get cuda-aware version to work, but I'll send a separate email about this when I get my head around what's going on. Best wishes, Robert -- Dr Robert Sawko Research Staff Member, IBM Daresbury Laboratory Keckwick Lane, Warrington WA4 4AD United Kingdom -- Email (IBM): rsa...@uk.ibm.com Email (STFC): robert.sa...@stfc.ac.uk Phone (office): +44 (0) 1925 60 3967 Phone (mobile): +44 778 830 8522 Profile page: http://researcher.watson.ibm.com/researcher/view.php?person=uk-RSawko -- -----pyfrmailinglist@googlegroups.com wrote: ----- To: pyfrmailinglist@googlegroups.com From: Freddie Witherden Sent by: pyfrmailinglist@googlegroups.com Date: 10/29/2018 08:55PM Subject: Re: [pyfrmailinglist] Cylinder 3D case freezes on P100 Hi Robert, On 29/10/2018 20:39, Robert Sawko wrote: > We got the code to work, by reverting to OPAL hooks. Your suggestion was > correct, but I fear some more work is needed. The code runs with this command: > > mpirun \ > --mca pml_ucx_opal_mem_hooks 1 \ > -report-bindings \\ > pyfr run -b cuda mesh.pyfrm ../config.ini > > For details, please read below. Are you running PyFR on Summit? I am not 100% > sure, but I think this may become relevant for you at some point. I suspect the reason you only encounter this problem for the larger (3D) cases is a consequence of how Python manages memory. Small allocations are handled by a memory pool and thus never result in a malloc/free operation. Thus it is possible that the issue is only triggered when running larger cases whose allocations bypass the pool. Either way this is almost certainly a bug in UCX. In terms of Summit we have run successfully with Spectrum MPI. Performance and scaling were both very impressive. I do not believe that any special modifications or MPI parameters were required. Regards, Freddie. > > I actually build OpenMPI myself. So the my build the following transport > layers are enabled: > > Transports > ----------------------- > Cisco usNIC: no > Cray uGNI (Gemini/Aries): no > Intel Omnipath (PSM2): no > Intel SCIF: no > Intel TrueScale (PSM): no > Mellanox MXM: yes > Open UCX: yes <- This guy seems to be the culprit. > OpenFabrics Libfabric: no > OpenFabrics Verbs: yes > Portals4: no > Shared memory/copy in+copy out: yes > Shared memory/Linux CMA: yes > Shared memory/Linux KNEM: yes > Shared memory/XPMEM: no > TCP: yes > > When I build OpenMPI I need to point it to mellanox libraries. I validate my > build of OMPI with Intel Memory Benchmark. To maximise performance on > Infiniband OMPI needs to be able to find these libs. > > > > Now here's how the train of thought on UCX went: > > 1) As per your suggestion, memory hooks are an issue here. > > 2) gdb top most backtrace said > > #0 0x00003fff82994740 in ucm_malloc_mmaped_ptr_remove_if_exists > (ptr=0x3eff0dd9bdd0) at malloc/malloc_hook.c:153 > > 3) What is ucm? We go into openmpi and we look for "ucm_" > openmpi-3.1.2$grep -r ucm_* > ompi/mca/pml/ucx/pml_ucx_component.c:#include <ucm/api/ucm.h> > ompi/mca/pml/ucx/pml_ucx_component.c: ucm_vm_munmap(buf, length); > ompi/mca/pml/ucx/pml_ucx_component.c: > ucm_set_external_event(UCM_EVENT_VM_UNMAPPED); > > So UCX component is the only thing that uses it. > > 4) We run a command: > > ompi_info --param pml ucx --level 9 > > MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v3.1.2) > MCA pml ucx: --------------------------------------------------- > MCA pml ucx: parameter "pml_ucx_verbose" (current value: "0", data source: > default, level: 9 dev/all, type: int) > Verbose level of the UCX component > MCA pml ucx: parameter "pml_ucx_priority" (current value: "51", data source: > default, level: 3 user/all, type: int) > Priority of the UCX component > MCA pml ucx: parameter "pml_ucx_num_disconnect" (current value: "1", data > source: default, level: 3 user/all, type: int) > How may disconnects go in parallel > MCA pml ucx: parameter "pml_ucx_opal_mem_hooks" (current value: "false", data > source: default, level: 3 user/all, type: boo > Use OPAL memory hooks, instead of UCX internal memory hooks > Valid values: 0: f|false|disabled|no|n, 1: t|true|enabled|yes|y > > We use the last one to suppress UCX memory hooks. Code seems to work. > Elementary? > > > Now I am going to test a few more examples. It's still not clear why this > manifests itself in 3D cylinder but not in 2D examples provided by you. It > baffles me why this works on the login node? > > I need to test it with Spectrum MPI too, but DL has an older version of > Spectrum and I think it may take a while to get the new one on the system > and I want to test cuda-aware comms. > > Hope this report helps. I think UCX may be important for you in the future so > it would be good to test PyFR with it. It's possible that my old builds of > OpenMPI did not include it and it's why I had a recollection that it all > worked > smoothly in the past, but I have this hopeless habit of installing the latest > software whenever I start a new project... > > Best wishes, > Robert > -- > Dr Robert Sawko > Research Staff Member, IBM > Daresbury Laboratory > Keckwick Lane, Warrington > WA4 4AD > United Kingdom > -- > Email (IBM): rsa...@uk.ibm.com > Email (STFC): robert.sa...@stfc.ac.uk > Phone (office): +44 (0) 1925 60 3967 > Phone (mobile): +44 778 830 8522 > Profile page: > http://researcher.watson.ibm.com/researcher/view.php?person=uk-RSawko > --Unless stated otherwise above: > IBM United Kingdom Limited - Registered in England and Wales with number > 741598. > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU > -- You received this message because you are subscribed to the Google Groups "PyFR Mailing List" group. To unsubscribe from this group and stop receiving emails from it, send an email to pyfrmailinglist+unsubscr...@googlegroups.com. To post to this group, send an email to pyfrmailinglist@googlegroups.com. Visit this group at https://groups.google.com/group/pyfrmailinglist. For more options, visit https://groups.google.com/d/optout. [attachment "signature.asc" removed by Robert Sawko/UK/IBM]Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- You received this message because you are subscribed to the Google Groups "PyFR Mailing List" group. To unsubscribe from this group and stop receiving emails from it, send an email to pyfrmailinglist+unsubscr...@googlegroups.com. To post to this group, send an email to pyfrmailinglist@googlegroups.com. Visit this group at https://groups.google.com/group/pyfrmailinglist. For more options, visit https://groups.google.com/d/optout.