Hi,
  Ray Sheppard here wearing my SPEC hat.  We received a mail from AMD we are 
not sure how to deal with.  So I thought I would pass it along in case anyone 
might have some relevant thoughts about it.  It looks like Jeff S.  filed the 
issue they site.  We are sort of fishing for a response to them.  So any info 
is appreciated.  Thanks.
       Ray

Dear Support.
 
I am an engineer at AMD who is currently running the SPECMPI2007 benchmarks, 
and we are experiencing issues with running the 122.Tachyon benchmark when 
compiled with OpenMPI 5. It is our goal to be able to run SPECMPI with OpenMPI 
5 to minimize the overhead of MPI in our benchmarking.
 
In our usual configuration, running the benchmark on 256 ranks using OpenMPI 5 
with the cross-memory attach (CMA) fabric. It appears that the 122.Tachyon 
benchmark deadlocks. When running Tachyon with OpenMPI 4.1.8 and the UCX 
fabric, this issue does not occur.
 
On investigating further, we observe:
With MPICH v4.3.0 the benchmark fails to run due to an MPI error detected by 
MPICH, due to an ‘MPI_Allgather()’ call using the same array for the send and 
receive buffer, which is disallowed by the MPI spec.
On modifying the benchmark to correct the issue with the Allgather call we see 
the following:
MPICH runs to completion, then crashes at finalization.
OpenMPI still deadlocks.
The deadlock is only observed when running on >35 ranks and is present in 
multiple versions of OpenMPI (v.5.0.5, v.5.0.8).
We discovered this issue for OpenMPI when investigating this:  
https://github.com/open-mpi/ompi/issues/12979, which may be relevant.
 
Is this a known issue with 122.Tachyon benchmark, and are you able to help us 
run 122.Tachyon on OpenMPI 5?
 
Thank you in advance for your help. If you require any further information, 
please do not hesitate to reach out to me.
 
Thanks
James

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].

Reply via email to