Hi Timothée, Timothee Mathieu <[email protected]> writes:
> After further investigations, it seems that the problem was not from > the simulator itself but with the fact that it simulates contact which > are very sensitive to even a small difference in the input actions. I > discovered that pytorch (and maybe other dependencies) has a > reproducibility problem of order 1e-5 when on AVX512 compared to > AVX2. I first tried to solve the problem by disabling AVX512 at the > level of pytorch, but it did not work. The dev of pytorch said that it > may be because some components dispatch computation to MKL-DNN, I > tried to disable AVX512 on MKL, and still the results were not > reproducible, I also tried to deactivate in openmpi without success. > I finally concluded that there was a problem with AVX512 somewhere in > the dependencies graph but I gave up identifying where, as this seems > very complicated. Oh, not fully satisfactory then. :-) > Instead, I found a tool https://github.com/twosigma/libvirtcpuid/ > which allows me to mask avx512 from the process and this worked! I was > able to use it to modify glibc with a graft in the guix shell command > to disable AVX512 in a guix shell command and get the exact same > result on both AVX512 and non-AVX512 computers without much of an > overhead (there is no vm, the only difference seems to be a slight > acceleration when using AVX512 as expected). Interesting, thanks! Ludo’.
