I tweaked your scripts a bunch so that I could run a bunch of different variations on my cluster.
I have lots of jobs queued up (I have 29 nodes in my cluster -- 3 have died over time); they'll take a bunch of time to execute. JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3204131 jenkins alltoall jsquyres PD 0:00 8 (Resources) 3204132 jenkins alltoall jsquyres PD 0:00 8 (Resources) 3204133 jenkins barrier jsquyres PD 0:00 8 (Resources) 3204134 jenkins bcast jsquyres PD 0:00 8 (Resources) 3204135 jenkins gather jsquyres PD 0:00 8 (Resources) 3204136 jenkins reduce jsquyres PD 0:00 8 (Resources) 3204137 jenkins reduce_s jsquyres PD 0:00 8 (Resources) 3204138 jenkins reduce_s jsquyres PD 0:00 8 (Resources) 3204139 jenkins scatter jsquyres PD 0:00 8 (Resources) 3204140 jenkins allgathe jsquyres PD 0:00 8 (Resources) 3204141 jenkins allgathe jsquyres PD 0:00 8 (Resources) 3204142 jenkins allreduc jsquyres PD 0:00 8 (Resources) 3204143 jenkins alltoall jsquyres PD 0:00 8 (Resources) 3204144 jenkins alltoall jsquyres PD 0:00 8 (Resources) 3204145 jenkins barrier jsquyres PD 0:00 8 (Resources) 3204146 jenkins bcast jsquyres PD 0:00 8 (Resources) 3204147 jenkins gather jsquyres PD 0:00 8 (Resources) 3204148 jenkins reduce jsquyres PD 0:00 8 (Resources) 3204149 jenkins reduce_s jsquyres PD 0:00 8 (Resources) 3204150 jenkins reduce_s jsquyres PD 0:00 8 (Resources) 3204151 jenkins scatter jsquyres PD 0:00 8 (Resources) 3204152 jenkins allgathe jsquyres PD 0:00 16 (Resources) 3204153 jenkins allgathe jsquyres PD 0:00 16 (Resources) 3204154 jenkins allreduc jsquyres PD 0:00 16 (Resources) 3204155 jenkins alltoall jsquyres PD 0:00 16 (Resources) 3204156 jenkins alltoall jsquyres PD 0:00 16 (Resources) 3204157 jenkins barrier jsquyres PD 0:00 16 (Resources) 3204158 jenkins bcast jsquyres PD 0:00 16 (Resources) 3204159 jenkins gather jsquyres PD 0:00 16 (Resources) 3204160 jenkins reduce jsquyres PD 0:00 16 (Resources) 3204161 jenkins reduce_s jsquyres PD 0:00 16 (Resources) 3204162 jenkins reduce_s jsquyres PD 0:00 16 (Resources) 3204163 jenkins scatter jsquyres PD 0:00 16 (Resources) 3204164 jenkins allgathe jsquyres PD 0:00 16 (Resources) 3204165 jenkins allgathe jsquyres PD 0:00 16 (Resources) 3204166 jenkins allreduc jsquyres PD 0:00 16 (Resources) 3204167 jenkins alltoall jsquyres PD 0:00 16 (Resources) 3204168 jenkins alltoall jsquyres PD 0:00 16 (Resources) 3204169 jenkins barrier jsquyres PD 0:00 16 (Resources) 3204170 jenkins bcast jsquyres PD 0:00 16 (Resources) 3204171 jenkins gather jsquyres PD 0:00 16 (Resources) 3204172 jenkins reduce jsquyres PD 0:00 16 (Resources) 3204173 jenkins reduce_s jsquyres PD 0:00 16 (Resources) 3204174 jenkins reduce_s jsquyres PD 0:00 16 (Resources) 3204175 jenkins scatter jsquyres PD 0:00 16 (Resources) 3204176 jenkins allgathe jsquyres PD 0:00 16 (Resources) 3204177 jenkins allgathe jsquyres PD 0:00 16 (Resources) 3204178 jenkins allreduc jsquyres PD 0:00 16 (Resources) 3204179 jenkins alltoall jsquyres PD 0:00 16 (Resources) 3204180 jenkins alltoall jsquyres PD 0:00 16 (Resources) 3204181 jenkins barrier jsquyres PD 0:00 16 (Resources) 3204182 jenkins bcast jsquyres PD 0:00 16 (Resources) 3204183 jenkins gather jsquyres PD 0:00 16 (Resources) 3204184 jenkins reduce jsquyres PD 0:00 16 (Resources) 3204185 jenkins reduce_s jsquyres PD 0:00 16 (Resources) 3204186 jenkins reduce_s jsquyres PD 0:00 16 (Resources) 3204187 jenkins scatter jsquyres PD 0:00 16 (Resources) 3204188 jenkins allgathe jsquyres PD 0:00 29 (Resources) 3204189 jenkins allgathe jsquyres PD 0:00 29 (Resources) 3204190 jenkins allreduc jsquyres PD 0:00 29 (Resources) 3204191 jenkins alltoall jsquyres PD 0:00 29 (Resources) 3204192 jenkins alltoall jsquyres PD 0:00 29 (Resources) 3204193 jenkins barrier jsquyres PD 0:00 29 (Resources) 3204194 jenkins bcast jsquyres PD 0:00 29 (Resources) 3204195 jenkins gather jsquyres PD 0:00 29 (Resources) 3204196 jenkins reduce jsquyres PD 0:00 29 (Resources) 3204197 jenkins reduce_s jsquyres PD 0:00 29 (Resources) 3204198 jenkins reduce_s jsquyres PD 0:00 29 (Resources) 3204199 jenkins scatter jsquyres PD 0:00 29 (Resources) 3204200 jenkins allgathe jsquyres PD 0:00 29 (Resources) 3204201 jenkins allgathe jsquyres PD 0:00 29 (Resources) 3204202 jenkins allreduc jsquyres PD 0:00 29 (Resources) 3204203 jenkins alltoall jsquyres PD 0:00 29 (Resources) 3204204 jenkins alltoall jsquyres PD 0:00 29 (Resources) 3204205 jenkins barrier jsquyres PD 0:00 29 (Resources) 3204206 jenkins bcast jsquyres PD 0:00 29 (Resources) 3204207 jenkins gather jsquyres PD 0:00 29 (Resources) 3204208 jenkins reduce jsquyres PD 0:00 29 (Resources) 3204209 jenkins reduce_s jsquyres PD 0:00 29 (Resources) 3204210 jenkins reduce_s jsquyres PD 0:00 29 (Resources) 3204211 jenkins scatter jsquyres PD 0:00 29 (Resources) 3204212 jenkins allgathe jsquyres PD 0:00 29 (Resources) 3204213 jenkins allgathe jsquyres PD 0:00 29 (Resources) 3204214 jenkins allreduc jsquyres PD 0:00 29 (Resources) 3204215 jenkins alltoall jsquyres PD 0:00 29 (Resources) 3204216 jenkins alltoall jsquyres PD 0:00 29 (Resources) 3204217 jenkins barrier jsquyres PD 0:00 29 (Resources) 3204218 jenkins bcast jsquyres PD 0:00 29 (Resources) 3204219 jenkins gather jsquyres PD 0:00 29 (Resources) 3204220 jenkins reduce jsquyres PD 0:00 29 (Resources) 3204221 jenkins reduce_s jsquyres PD 0:00 29 (Resources) 3204222 jenkins reduce_s jsquyres PD 0:00 29 (Resources) 3204223 jenkins scatter jsquyres PD 0:00 29 (Resources) 3204128 jenkins allgathe jsquyres R 5:10 8 mpi[004-011] 3204129 jenkins allgathe jsquyres R 5:10 8 mpi[016-023] 3204130 jenkins allreduc jsquyres R 5:10 8 mpi[024-031] > On Apr 13, 2020, at 6:35 PM, Zhang, William via devel > <devel@lists.open-mpi.org> wrote: > > Hello all, > > I have created a —with-slurm option when running (See updated README). In > order to set new defaults for collective algorithms, we will need data from > those who wish to provide it. We have created the following package that > allows for collecting data: > https://github.com/open-mpi/ompi-collectives-tuning > > Please run the package as soon as possible. Details on how to run are in the > README.md. If data collection fails, the output of the analyze script (either > analyze.sh.o* for SGE or the ouput of ./run_and_analyze if using slurm) will > report "Error parsing <filename>. Data format doesn't match. Exiting..”. > Please make sure data collection succeeds and a decision file is written > entirely. > > Please provide me with either the output directory or if it’s inconvenient to > share this data, provide me a list of optimal switchover points at different > message sizes for each algorithm (This can be in the form of the > output/decision.file which only contains switchover points and no specific > performance numbers) > > Thanks, > William Zhang -- Jeff Squyres jsquy...@cisco.com