do you receive my email?

timesir <mrlong...@gmail.com> 于2022年11月15日周二 12:33写道:

> *(py3.9) ➜  /share   mpirun -n 2 --machinefile hosts --mca
> rmaps_base_verbose 100 --mca ras_base_verbose 100  which mpirun*
> [computer01:39342] mca: base: component_find: searching NULL for ras
> components
> [computer01:39342] mca: base: find_dyn_components: checking NULL for ras
> components
> [computer01:39342] pmix:mca: base: components_register: registering
> framework ras components
> [computer01:39342] pmix:mca: base: components_register: found loaded
> component simulator
> [computer01:39342] pmix:mca: base: components_register: component
> simulator register function successful
> [computer01:39342] pmix:mca: base: components_register: found loaded
> component pbs
> [computer01:39342] pmix:mca: base: components_register: component pbs
> register function successful
> [computer01:39342] pmix:mca: base: components_register: found loaded
> component slurm
> [computer01:39342] pmix:mca: base: components_register: component slurm
> register function successful
> [computer01:39342] mca: base: components_open: opening ras components
> [computer01:39342] mca: base: components_open: found loaded component
> simulator
> [computer01:39342] mca: base: components_open: found loaded component pbs
> [computer01:39342] mca: base: components_open: component pbs open function
> successful
> [computer01:39342] mca: base: components_open: found loaded component slurm
> [computer01:39342] mca: base: components_open: component slurm open
> function successful
> [computer01:39342] mca:base:select: Auto-selecting ras components
> [computer01:39342] mca:base:select:(  ras) Querying component [simulator]
> [computer01:39342] mca:base:select:(  ras) Querying component [pbs]
> [computer01:39342] mca:base:select:(  ras) Querying component [slurm]
> [computer01:39342] mca:base:select:(  ras) No component selected!
> [computer01:39342] mca: base: component_find: searching NULL for rmaps
> components
> [computer01:39342] mca: base: find_dyn_components: checking NULL for rmaps
> components
> [computer01:39342] pmix:mca: base: components_register: registering
> framework rmaps components
> [computer01:39342] pmix:mca: base: components_register: found loaded
> component ppr
> [computer01:39342] pmix:mca: base: components_register: component ppr
> register function successful
> [computer01:39342] pmix:mca: base: components_register: found loaded
> component rank_file
> [computer01:39342] pmix:mca: base: components_register: component
> rank_file has no register or open function
> [computer01:39342] pmix:mca: base: components_register: found loaded
> component round_robin
> [computer01:39342] pmix:mca: base: components_register: component
> round_robin register function successful
> [computer01:39342] pmix:mca: base: components_register: found loaded
> component seq
> [computer01:39342] pmix:mca: base: components_register: component seq
> register function successful
> [computer01:39342] mca: base: components_open: opening rmaps components
> [computer01:39342] mca: base: components_open: found loaded component ppr
> [computer01:39342] mca: base: components_open: component ppr open function
> successful
> [computer01:39342] mca: base: components_open: found loaded component
> rank_file
> [computer01:39342] mca: base: components_open: found loaded component
> round_robin
> [computer01:39342] mca: base: components_open: component round_robin open
> function successful
> [computer01:39342] mca: base: components_open: found loaded component
> seq
> [35/405]
> [computer01:39342] mca: base: components_open: component seq open function
> successful
> [computer01:39342] mca:rmaps:select: checking available component ppr
> [computer01:39342] mca:rmaps:select: Querying component [ppr]
> [computer01:39342] mca:rmaps:select: checking available component rank_file
> [computer01:39342] mca:rmaps:select: Querying component [rank_file]
> [computer01:39342] mca:rmaps:select: checking available component
> round_robin
> [computer01:39342] mca:rmaps:select: Querying component [round_robin]
> [computer01:39342] mca:rmaps:select: checking available component seq
> [computer01:39342] mca:rmaps:select: Querying component [seq]
> [computer01:39342] [prterun-computer01-39342@0,0]: Final mapper priorities
> [computer01:39342]      Mapper: ppr Priority: 90
> [computer01:39342]      Mapper: seq Priority: 60
> [computer01:39342]      Mapper: round_robin Priority: 10
> [computer01:39342]      Mapper: rank_file Priority: 0
>
> ======================   ALLOCATED NODES   ======================
>     computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
>         Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
>         aliases: 192.168.180.48
>     192.168.60.203: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>         Flags: SLOTS_GIVEN
>         aliases: NONE
> =================================================================
>
> ======================   ALLOCATED NODES   ======================
>     computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
>         Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
>         aliases: 192.168.180.48
>     hepslustretest03: slots=1 max_slots=0 slots_inuse=0 state=UP
>         Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
>         aliases: 192.168.60.203,hepslustretest03.ihep.ac.cn
> ,172.17.180.203,172.168.10.23,172.168.10.143
> =================================================================
> [computer01:39342] mca:rmaps: mapping job prterun-computer01-39342@1
> [computer01:39342] mca:rmaps: setting mapping policies for job
> prterun-computer01-39342@1 inherit TRUE hwtcpus FALSE
> [computer01:39342] mca:rmaps[358] mapping not given - using bycore
> [computer01:39342] setdefaultbinding[365] binding not given - using bycore
> [computer01:39342] mca:rmaps:ppr: job prterun-computer01-39342@1 not
> using ppr mapper PPR NULL policy PPR NOTSET
> [computer01:39342] mca:rmaps:seq: job prterun-computer01-39342@1 not
> using seq mapper
> [computer01:39342] mca:rmaps:rr: mapping job prterun-computer01-39342@1
> [computer01:39342] AVAILABLE NODES FOR MAPPING:
> [computer01:39342]     node: computer01 daemon: 0 slots_available: 1
> [computer01:39342] mca:rmaps:rr: mapping by Core for job
> prterun-computer01-39342@1 slots 1 num_procs 2
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 2
> slots that were requested by the application:
>
>   which
>
> Either request fewer procs for your application, or make more slots
> available for use.
>
> A "slot" is the PRRTE term for an allocatable unit where we can
> launch a process.  The number of slots available are defined by the
> environment in which PRRTE processes are run:
>
>   1. Hostfile, via "slots=N" clauses (N defaults to number of
>      processor cores if not provided)
>   2. The --host command line parameter, via a ":N" suffix on the
>      hostname (N defaults to 1 if not provided)
>   3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
>   4. If none of a hostfile, the --host command line parameter, or an
>      RM is present, PRRTE defaults to the number of processor cores
>
> In all the above cases, if you want PRRTE to default to the number
> of hardware threads instead of the number of processor cores, use the
> --use-hwthread-cpus option.
>
> Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
> number of available slots when deciding the number of processes to
> launch.
> --------------------------------------------------------------------------
> 在 2022/11/15 02:04, users-requ...@lists.open-mpi.org 写道:
>
> Send users mailing list submissions to
>       users@lists.open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>       https://lists.open-mpi.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
>       users-requ...@lists.open-mpi.org
>
> You can reach the person managing the list at
>       users-ow...@lists.open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>    1. Re: [OMPI devel] There are not enough slots available in the
>       system to satisfy the 2, slots that were requested by the
>       application (Jeff Squyres (jsquyres))
>    2. Re: Tracing of openmpi internal functions
>       (Jeff Squyres (jsquyres))
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 14 Nov 2022 17:04:24 +0000
> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> <jsquy...@cisco.com>
> To: Open MPI Users <users@lists.open-mpi.org> <users@lists.open-mpi.org>
> Subject: Re: [OMPI users] [OMPI devel] There are not enough slots
>       available in the system to satisfy the 2, slots that were requested by
>       the application
> Message-ID:
>       
> <bl0pr11mb29801261edb4fd0e9ef2f4ecc0...@bl0pr11mb2980.namprd11.prod.outlook.com>
>  
> <bl0pr11mb29801261edb4fd0e9ef2f4ecc0...@bl0pr11mb2980.namprd11.prod.outlook.com>
>       
> Content-Type: text/plain; charset="utf-8"
>
> Yes, somehow I'm not seeing all the output that I expect to see.  Can you 
> ensure that if you're copy-and-pasting from the email, that it's actually 
> using "dash dash" in front of "mca" and "machinefile" (vs. a copy-and-pasted 
> "em dash")?
>
> --
> Jeff squyresjsquy...@cisco.com
> ________________________________
> From: users <users-boun...@lists.open-mpi.org> 
> <users-boun...@lists.open-mpi.org> on behalf of Gilles Gouaillardet via users 
> <users@lists.open-mpi.org> <users@lists.open-mpi.org>
> Sent: Sunday, November 13, 2022 9:18 PM
> To: Open MPI Users <users@lists.open-mpi.org> <users@lists.open-mpi.org>
> Cc: Gilles Gouaillardet <gilles.gouaillar...@gmail.com> 
> <gilles.gouaillar...@gmail.com>
> Subject: Re: [OMPI users] [OMPI devel] There are not enough slots available 
> in the system to satisfy the 2, slots that were requested by the application
>
> There is a typo in your command line.
> You should use --mca (minus minus) instead of -mca
>
> Also, you can try --machinefile instead of -machinefile
>
> Cheers,
>
> Gilles
>
> There are not enough slots available in the system to satisfy the 2
> slots that were requested by the application:
>
>   ?mca
>
> On Mon, Nov 14, 2022 at 11:04 AM timesir via users 
> <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> 
> <users@lists.open-mpi.org>> wrote:
>
> (py3.9) ?  /share  mpirun -n 2 -machinefile hosts ?mca rmaps_base_verbose 100 
> --mca ras_base_verbose 100  which mpirun
> [computer01:04570] mca: base: component_find: searching NULL for ras 
> components
> [computer01:04570] mca: base: find_dyn_components: checking NULL for ras 
> components
> [computer01:04570] pmix:mca: base: components_register: registering framework 
> ras components
> [computer01:04570] pmix:mca: base: components_register: found loaded 
> component simulator
> [computer01:04570] pmix:mca: base: components_register: component simulator 
> register function successful
> [computer01:04570] pmix:mca: base: components_register: found loaded 
> component pbs
> [computer01:04570] pmix:mca: base: components_register: component pbs 
> register function successful
> [computer01:04570] pmix:mca: base: components_register: found loaded 
> component slurm
> [computer01:04570] pmix:mca: base: components_register: component slurm 
> register function successful
> [computer01:04570] mca: base: components_open: opening ras components
> [computer01:04570] mca: base: components_open: found loaded component 
> simulator
> [computer01:04570] mca: base: components_open: found loaded component pbs
> [computer01:04570] mca: base: components_open: component pbs open function 
> successful
> [computer01:04570] mca: base: components_open: found loaded component slurm
> [computer01:04570] mca: base: components_open: component slurm open function 
> successful
> [computer01:04570] mca:base:select: Auto-selecting ras components
> [computer01:04570] mca:base:select:(  ras) Querying component [simulator]
> [computer01:04570] mca:base:select:(  ras) Querying component [pbs]
> [computer01:04570] mca:base:select:(  ras) Querying component [slurm]
> [computer01:04570] mca:base:select:(  ras) No component selected!
>
> ======================   ALLOCATED NODES   ======================             
>                                                                      [10/1444]
>     computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
>         Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
>         aliases: 192.168.180.48
>     192.168.60.203<http://192.168.60.203> <http://192.168.60.203>: slots=1 
> max_slots=0 slots_inuse=0 state=UNKNOWN
>         Flags: SLOTS_GIVEN
>         aliases: NONE
> =================================================================
>
> ======================   ALLOCATED NODES   ======================
>     computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
>         Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
>         aliases: 192.168.180.48
>     hepslustretest03: slots=1 max_slots=0 slots_inuse=0 state=UP
>         Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
>         aliases: 192.168.60.203,172.17.180.203,172.168.10.23,172.168.10.143
> =================================================================
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 2
> slots that were requested by the application:
>
>   ?mca
>
> Either request fewer procs for your application, or make more slots
> available for use.
>
> A "slot" is the PRRTE term for an allocatable unit where we can
> launch a process.  The number of slots available are defined by the
> environment in which PRRTE processes are run:
>
>   1. Hostfile, via "slots=N" clauses (N defaults to number of
>      processor cores if not provided)
>   2. The --host command line parameter, via a ":N" suffix on the
>      hostname (N defaults to 1 if not provided)
>   3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
>   4. If none of a hostfile, the --host command line parameter, or an
>      RM is present, PRRTE defaults to the number of processor cores
>
> In all the above cases, if you want PRRTE to default to the number
> of hardware threads instead of the number of processor cores, use the
> --use-hwthread-cpus option.
>
> Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
> number of available slots when deciding the number of processes to
> launch.
> --------------------------------------------------------------------------
>
>
>
> ? 2022/11/13 23:42, Jeff Squyres (jsquyres) ??:
> Interesting.  It says:
>
> [computer01:106117] AVAILABLE NODES FOR MAPPING:
> [computer01:106117] node: computer01 daemon: 0 slots_available: 1
>
> This is why it tells you you're out of slots: you're asking for 2, but it 
> only found 1.  This means it's not seeing your hostfile somehow.
>
> I should have asked you to run with 2? variables last time -- can you re-run 
> with "mpirun --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 ..."?
>
> Turning on the RAS verbosity should show us what the hostfile component is 
> doing.
>
> --
> Jeff squyresjsquy...@cisco.com<mailto:jsquy...@cisco.com> <jsquy...@cisco.com>
> ________________________________
> From: ?? <mrlong...@gmail.com> 
> <mrlong...@gmail.com><mailto:mrlong...@gmail.com> <mrlong...@gmail.com>
> Sent: Sunday, November 13, 2022 3:13 AM
> To: Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> <jsquy...@cisco.com><mailto:jsquy...@cisco.com> <jsquy...@cisco.com>; Open 
> MPI Users <users@lists.open-mpi.org> 
> <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org> 
> <users@lists.open-mpi.org>
> Subject: Re: [OMPI devel] There are not enough slots available in the system 
> to satisfy the 2, slots that were requested by the application
>
>
> (py3.9) ? /share mpirun ?version
>
> mpirun (Open MPI) 5.0.0rc9
>
> Report bugs to https://www.open-mpi.org/community/help/
>
> (py3.9) ? /share cat hosts
>
> 192.168.180.48 slots=1
> 192.168.60.203 slots=1
>
> (py3.9) ? /share mpirun -n 2 -machinefile hosts ?mca rmaps_base_verbose 100 
> which mpirun
>
> [computer01:106117] mca: base: component_find: searching NULL for rmaps 
> components
> [computer01:106117] mca: base: find_dyn_components: checking NULL for rmaps 
> components
> [computer01:106117] pmix:mca: base: components_register: registering 
> framework rmaps components
> [computer01:106117] pmix:mca: base: components_register: found loaded 
> component ppr
> [computer01:106117] pmix:mca: base: components_register: component ppr 
> register function successful
> [computer01:106117] pmix:mca: base: components_register: found loaded 
> component rank_file
> [computer01:106117] pmix:mca: base: components_register: component rank_file 
> has no register or open function
> [computer01:106117] pmix:mca: base: components_register: found loaded 
> component round_robin
> [computer01:106117] pmix:mca: base: components_register: component 
> round_robin register function successful
> [computer01:106117] pmix:mca: base: components_register: found loaded 
> component seq
> [computer01:106117] pmix:mca: base: components_register: component seq 
> register function successful
> [computer01:106117] mca: base: components_open: opening rmaps components
> [computer01:106117] mca: base: components_open: found loaded component ppr
> [computer01:106117] mca: base: components_open: component ppr open function 
> successful
> [computer01:106117] mca: base: components_open: found loaded component 
> rank_file
> [computer01:106117] mca: base: components_open: found loaded component 
> round_robin
> [computer01:106117] mca: base: components_open: component round_robin open 
> function successful
> [computer01:106117] mca: base: components_open: found loaded component seq
> [computer01:106117] mca: base: components_open: component seq open function 
> successful
> [computer01:106117] mca:rmaps:select: checking available component ppr
> [computer01:106117] mca:rmaps:select: Querying component [ppr]
> [computer01:106117] mca:rmaps:select: checking available component rank_file
> [computer01:106117] mca:rmaps:select: Querying component [rank_file]
> [computer01:106117] mca:rmaps:select: checking available component round_robin
> [computer01:106117] mca:rmaps:select: Querying component [round_robin]
> [computer01:106117] mca:rmaps:select: checking available component seq
> [computer01:106117] mca:rmaps:select: Querying component [seq]
> [computer01:106117] [prterun-computer01-106117@0,0]: Final mapper priorities
> [computer01:106117] Mapper: ppr Priority: 90
> [computer01:106117] Mapper: seq Priority: 60
> [computer01:106117] Mapper: round_robin Priority: 10
> [computer01:106117] Mapper: rank_file Priority: 0
> [computer01:106117] mca:rmaps: mapping job prterun-computer01-106117@1
>
> [computer01:106117] mca:rmaps: setting mapping policies for job 
> prterun-computer01-106117@1 inherit TRUE hwtcpus FALSE [9/1957]
> [computer01:106117] mca:rmaps[358] mapping not given - using bycore
> [computer01:106117] setdefaultbinding[365] binding not given - using bycore
> [computer01:106117] mca:rmaps:ppr: job prterun-computer01-106117@1 not using 
> ppr mapper PPR NULL policy PPR NOTSET
> [computer01:106117] mca:rmaps:seq: job prterun-computer01-106117@1 not using 
> seq mapper
> [computer01:106117] mca:rmaps:rr: mapping job prterun-computer01-106117@1
> [computer01:106117] AVAILABLE NODES FOR MAPPING:
> [computer01:106117] node: computer01 daemon: 0 slots_available: 1
> [computer01:106117] mca:rmaps:rr: mapping by Core for job 
> prterun-computer01-106117@1 slots 1 num_procs 2
>
> ________________________________
>
> There are not enough slots available in the system to satisfy the 2
> slots that were requested by the application:
>
> which
>
> Either request fewer procs for your application, or make more slots
> available for use.
>
> A ?slot? is the PRRTE term for an allocatable unit where we can
> launch a process. The number of slots available are defined by the
> environment in which PRRTE processes are run:
>
>   1.  Hostfile, via ?slots=N? clauses (N defaults to number of
> processor cores if not provided)
>   2.  The ?host command line parameter, via a ?:N? suffix on the
> hostname (N defaults to 1 if not provided)
>   3.  Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
>   4.  If none of a hostfile, the ?host command line parameter, or an
> RM is present, PRRTE defaults to the number of processor cores
>
> In all the above cases, if you want PRRTE to default to the number
> of hardware threads instead of the number of processor cores, use the
> ?use-hwthread-cpus option.
>
> Alternatively, you can use the ?map-by :OVERSUBSCRIBE option to ignore the
> number of available slots when deciding the number of processes to
> launch.
>
> ________________________________
> ? 2022/11/8 05:46, Jeff Squyres (jsquyres) ??:
> In the future, can you please just mail one of the lists?  This particular 
> question is probably more of a users type of question (since we're not 
> talking about the internals of Open MPI itself), so I'll reply just on the 
> users list.
>
> For what it's worth, I'm unable to replicate your error:
>
>
> $ mpirun --version
>
> mpirun (Open MPI) 5.0.0rc9
>
>
> Report bugs to https://www.open-mpi.org/community/help/
>
> $ cat hostfile
>
> mpi002 slots=1
>
> mpi005 slots=1
>
> $ mpirun -n 2 --machinefile hostfile hostname
>
> mpi002
>
> mpi005
>
> Can you try running with "--mca rmaps_base_verbose 100" so that we can get 
> some debugging output and see why the slots aren't working for you?  Show the 
> full output, like I did above (e.g., cat the hostfile, and then mpirun with 
> the MCA param and all the output).  Thanks!
>
> --
> Jeff squyresjsquy...@cisco.com<mailto:jsquy...@cisco.com> <jsquy...@cisco.com>
> ________________________________
> From: devel <devel-boun...@lists.open-mpi.org> 
> <devel-boun...@lists.open-mpi.org><mailto:devel-boun...@lists.open-mpi.org> 
> <devel-boun...@lists.open-mpi.org> on behalf of mrlong via devel 
> <de...@lists.open-mpi.org> 
> <de...@lists.open-mpi.org><mailto:de...@lists.open-mpi.org> 
> <de...@lists.open-mpi.org>
> Sent: Monday, November 7, 2022 3:37 AM
> To: de...@lists.open-mpi.org<mailto:de...@lists.open-mpi.org> 
> <de...@lists.open-mpi.org> <de...@lists.open-mpi.org> 
> <de...@lists.open-mpi.org><mailto:de...@lists.open-mpi.org> 
> <de...@lists.open-mpi.org>; Open MPI Users <users@lists.open-mpi.org> 
> <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org> 
> <users@lists.open-mpi.org>
> Cc: mrlong <mrlong...@gmail.com> 
> <mrlong...@gmail.com><mailto:mrlong...@gmail.com> <mrlong...@gmail.com>
> Subject: [OMPI devel] There are not enough slots available in the system to 
> satisfy the 2, slots that were requested by the application
>
>
> Two machines, each with 64 cores. The contents of the hosts file are:
>
> 192.168.180.48 slots=1
> 192.168.60.203 slots=1
>
> Why do you get the following error when running with openmpi 5.0.0rc9?
>
> (py3.9) [user@machine01 share]0.5692263713929891nbsp; mpirun -n 2 
> --machinefile hosts hostname
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 2
> slots that were requested by the application:
>
>   hostname
>
> Either request fewer procs for your application, or make more slots
> available for use.
>
> A "slot" is the PRRTE term for an allocatable unit where we can
> launch a process.  The number of slots available are defined by the
> environment in which PRRTE processes are run:
>
>   1. Hostfile, via "slots=N" clauses (N defaults to number of
>      processor cores if not provided)
>   2. The --host command line parameter, via a ":N" suffix on the
>      hostname (N defaults to 1 if not provided)
>   3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
>   4. If none of a hostfile, the --host command line parameter, or an
>      RM is present, PRRTE defaults to the number of processor cores
>
> In all the above cases, if you want PRRTE to default to the number
> of hardware threads instead of the number of processor cores, use the
> --use-hwthread-cpus option.
>
> Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
> number of available slots when deciding the number of processes to
> launch.
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <https://lists.open-mpi.org/mailman/private/users/attachments/20221114/2c75fc85/attachment.html>
>  
> <https://lists.open-mpi.org/mailman/private/users/attachments/20221114/2c75fc85/attachment.html>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 14 Nov 2022 18:04:06 +0000
> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> <jsquy...@cisco.com>
> To: "users@lists.open-mpi.org" <users@lists.open-mpi.org> 
> <users@lists.open-mpi.org> <users@lists.open-mpi.org>
> Cc: arun c <arun.edar...@gmail.com> <arun.edar...@gmail.com>
> Subject: Re: [OMPI users] Tracing of openmpi internal functions
> Message-ID:
>       
> <bl0pr11mb2980b144bc115f202701558dc0...@bl0pr11mb2980.namprd11.prod.outlook.com>
>  
> <bl0pr11mb2980b144bc115f202701558dc0...@bl0pr11mb2980.namprd11.prod.outlook.com>
>       
> Content-Type: text/plain; charset="us-ascii"
>
> Open MPI uses plug-in modules for its implementations of the MPI collective 
> algorithms.  From that perspective, once you understand that infrastructure, 
> it's exactly the same regardless of whether the MPI job is using intra-node 
> or inter-node collectives.
>
> We don't have much in the way of detailed internal function call tracing 
> inside Open MPI itself, due to performance considerations.  You might want to 
> look into flamegraphs, or something similar...?
>
> --
> Jeff squyresjsquy...@cisco.com
> ________________________________
> From: users <users-boun...@lists.open-mpi.org> 
> <users-boun...@lists.open-mpi.org> on behalf of arun c via users 
> <users@lists.open-mpi.org> <users@lists.open-mpi.org>
> Sent: Saturday, November 12, 2022 9:46 AM
> To: users@lists.open-mpi.org <users@lists.open-mpi.org> 
> <users@lists.open-mpi.org>
> Cc: arun c <arun.edar...@gmail.com> <arun.edar...@gmail.com>
> Subject: [OMPI users] Tracing of openmpi internal functions
>
> Hi All,
>
> I am new to openmpi and trying to learn the internals (source code
> level) of data transfer during collective operations. At first, I will
> limit it to intra-node (between cpu cores, and sockets) to minimize
> the scope of learning.
>
> What are the best options (Looking for only free and open methods) for
> tracing the openmpi code? (say I want to execute alltoall collective
> and trace all the function calls and event callbacks that happened
> inside the libmpi.so on all the cores)
>
> Linux kernel has something called ftrace, it gives a neat call graph
> of all the internal functions inside the kernel with time, is
> something similar available?
>
> --Arun
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <https://lists.open-mpi.org/mailman/private/users/attachments/20221114/0c9d0e69/attachment.html>
>  
> <https://lists.open-mpi.org/mailman/private/users/attachments/20221114/0c9d0e69/attachment.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> users mailing 
> listus...@lists.open-mpi.orghttps://lists.open-mpi.org/mailman/listinfo/users
>
> ------------------------------
>
> End of users Digest, Vol 4818, Issue 1
> **************************************
>
>

Reply via email to