To address the deterministic item, I know for a fact that training will not be deterministic in some cases where the “parallel random” class is utilized in parallel threads, such as OMP, if the number of cores is different, even with the same seed, because threads are seeded independently and different number of threads will end up generating different random number sequences. Dropout operator being an example.
On Mon, Nov 18, 2019 at 6:39 PM Alfredo Luque <alfredo.lu...@airbnb.com.invalid> wrote: > For AMD CPUs, you’d want to perform validation because now MKL-DNN would be > enabled by default. Historically, other intel libraries (along with the ICC > compiler) have had performance issues on AMD CPUs. It’s just worth double > checking to make sure that’s not the case here. Perhaps some MKL-DNN > authors can chime in though. It’s not sufficient to double check that an > AVX2 package passes tests. > > Agreed in the case we’re not releasing ARM binaries. > > The reproducibility argument is around the results being numerically > reproducible. That is, eg; if I train a model with some fixed set of data, > some random seed, etc. and then run inference on it do I get the exact same > floating point values for the weights and results? Does MxNet already offer > this without MKL-DNN? > > On November 18, 2019 at 6:32:07 PM, Tao Lv (mutou...@gmail.com) wrote: > > Regarding the cases listed by Marco: > - AMD CPU > From my architecture knowledge, what works on C4 instances (with AVX2 > support) should also work well on m5a, right? I think mxnet-mkl and > mxnet-cuxxmkl packages have been fully validated on AVX2 machines. > Also, we didn't perform any validation on AMD CPU before, why we need do > that for this time? > > - ARM CPU > I don't know we're releasing any convenience binaries for ARM CPU. This > proposal mainly targets those pypi packages. > > - Windows > Already validated by CI. We're also releasing mxnet-mkl packages for Win. > > - GPU and MKLDNN enabled > Already validated by CI and mxnet-cuxxmkl packages have been released for > several versions. > > - Fully reproducible results (medical and financial sector requested that > and we have some flags for cuda) > Not sure I understand this case. We already have MKL-DNN backend for a > while. Functionality and correctness of it have been verified by MXNet > users. > > -tao > > On Tue, Nov 19, 2019 at 4:41 AM Marco de Abreu <marco.g.ab...@gmail.com> > wrote: > > > Sorry, my intent with the "non-standard" phrase was not about general > MXNet > > but rather from MKLDNNs point of view, considering that it's being > > developed by Intel, I assumed that MKLDNN might consider non-intel > > use-cases non standard. > > > > -Marco > > > > Skalicky, Sam <sska...@amazon.com.invalid> schrieb am Mo., 18. Nov. > 2019, > > 21:34: > > > > > Thanks Alfredo, if you can create a GitHub issue with notes/steps we > can > > > add this to the todo list for integrating with the MXNet CI to test on > > m5a > > > instances too. Then we can start tracking this on a regular basis. It > > would > > > be great to actually test on ARM instances now that AWS has A1 > instances > > > too…..ill add it to the wish list ;-D > > > > > > Sam > > > > > > > On Nov 18, 2019, at 12:32 PM, Alfredo Luque < > alfredo.lu...@airbnb.com > > .INVALID> > > > wrote: > > > > > > > > Happy to run some benchmarks on an AWS m5a instance (Epyc) and first > > > > generation AMD Threadripper Gen 1 if someone has something easy to > run > > > and > > > > representative. > > > > > > > > On November 18, 2019 at 12:29:31 PM, Skalicky, Sam ( > > > > sska...@amazon.com.invalid) wrote: > > > > > > > > Thanks a good idea Alfredo, are you able to help test on AMD CPUs? Or > > is > > > > there someone else in the mxnet dev@ community who can help? > > > > > > > > Sam > > > > > > > >> On Nov 18, 2019, at 12:27 PM, Alfredo Luque > > > > <alfredo.lu...@airbnb.com.INVALID> wrote: > > > >> > > > >> Verifying that there isn’t a slowdown on AMD CPUs (eg; Ryzen / Epyc) > > > > would > > > >> definitely make sense as a requirement. It seems odd to classify > that > > as > > > > a > > > >> “nonstandard” use case. > > > >> > > > >> On November 18, 2019 at 12:20:33 PM, Skalicky, Sam ( > > > >> sska...@amazon.com.invalid) wrote: > > > >> > > > >> Thanks Patric & team for your work over the years to make MXNet fast > > > with > > > >> MKLDNN! > > > >> > > > >> I think it would be great to make MKLDNN enabled by default. We will > > > need > > > >> to continue producing variants without MKLDNN for those who don’t > want > > > it > > > >> (Marco enumerated some use cases). How do you propose to identify > the > > > pip > > > >> wheels with/without MKLDNN? Previously we had: mxnet-mkl and > > > > mxnet-cu101mkl > > > >> with MKLDNN. If the plain “mxnet” pip wheel now contains MKLDNN what > > do > > > > you > > > >> propose we call the build without MKLDNN? mxnet-nomkl? > > > >> > > > >> Thanks! > > > >> Sam > > > >> > > > >>> On Nov 18, 2019, at 11:08 AM, Marco de Abreu < > > marco.g.ab...@gmail.com> > > > >> wrote: > > > >>> > > > >>> Hi Patric, > > > >>> > > > >>> First of all, thanks a lot to you and your team for all the effort > on > > > >> MXNet > > > >>> and mkldnn! > > > >>> > > > >>> Generally I'm inclined towards your proposal, but I'm thinking > about > > > the > > > >>> non-standard use cases: > > > >>> - AMD CPU > > > >>> - ARM CPU > > > >>> - Windows > > > >>> - GPU and MKLDNN enabled > > > >>> - Fully reproducible results (medical and financial sector > requested > > > > that > > > >>> and we have some flags for cuda) > > > >>> > > > >>> Is mkldnn fully compatible with these use cases? If not, what would > > > >> happen? > > > >>> If yes, do we have performance numbers? > > > >>> > > > >>> Best regards, > > > >>> Marco > > > >>> > > > >>> Zhao, Patric <patric.z...@intel.com> schrieb am Mo., 18. Nov. > 2019, > > > >> 14:00: > > > >>> > > > >>>> Hi MXNet community, > > > >>>> > > > >>>> From the first MKLDNN backend integrated in release 1.2, the > > community > > > >> is > > > >>>> continuously improving the quality and performance of MKLDNN CPU > > > >> backend. > > > >>>> Nowadays, the MKLDNN backend is widely used for the inference, > > > >> especially > > > >>>> for INT8 inference, and we got lots of very positive feedbacks > from > > > >> MXNet > > > >>>> users. > > > >>>> > > > >>>> Achieved milestones as below: > > > >>>> > > > >>>> - MKLDNN integrated into Apache MXNet from release 1.2, Feb, 2018 > > [1] > > > >>>> - MKLDNN backend as default CPU backend from source building, Jan, > > > 2019 > > > >> [2] > > > >>>> - MKLDNN subgraph optimization as default for the inference, Jul, > > 2019 > > > >> [3] > > > >>>> - MKLDNN major version upgrade in release 1.6, Oct, 2019 [4] > > > >>>> > > > >>>> To make more successful and technical leadership for Apache MXNet > in > > > > the > > > >>>> industry, I propose to make MKLDNN as default CPU backend in all > > > binary > > > >>>> distribution from the next release. > > > >>>> The new milestone includes: > > > >>>> > > > >>>> - Static link MKLDNN library in the binary avoiding the mismatch > > > > version > > > >>>> in the runtime [5] > > > >>>> - Make nightly build with MKLDNN default from master pre 1.7 > release > > > >>>> - Binary distribution with MKLDNN default from 1.7 release. > > > >>>> > > > >>>> What will be changed: > > > >>>> > > > >>>> - mxnet and mxnet-cuXX binary will be built with MKLDNN=1 > > > >>>> - mxnet-mkl and mxnet-cuXXmkl will be not changed in the minor > > release > > > >>>> (1.x) and plan to remove in next major release (2.0) > > > >>>> > > > >>>> Suggestions and comments are highly appreciated. > > > >>>> > > > >>>> Thanks, > > > >>>> > > > >>>> --Patric > > > >>>> > > > >>>> > > > >>>> [1] https://github.com/apache/incubator-mxnet/pull/9677 > > > >>>> [2] > > > >>>> > > > >> > > > > > > > > > > > https://lists.apache.org/thread.html/bfeae6ee46374112eb4dff1470c262959101e4bffb19930926963535@%3Cdev.mxnet.apache.org%3E > > > >>>> [3] https://github.com/apache/incubator-mxnet/pull/15518 > > > >>>> [4] > > > >>>> > > > >> > > > > > > > > > > > https://lists.apache.org/thread.html/f46ab920f18795496eafe713e6e9e561c684e06189085cec17b401dc@%3Cdev.mxnet.apache.org%3E > > > >>>> [5] https://github.com/apache/incubator-mxnet/pull/16731 > > > >>>> > > > >> > > > >> — > > > >> Alfredo Luque > > > >> Software Engineer > > > >> Machine Learning Infrastructure > > > >> Airbnb > > > >> San Francisco, CA > > > > > > > > — > > > > Alfredo Luque > > > > Software Engineer > > > > Machine Learning Infrastructure > > > > Airbnb > > > > San Francisco, CA > > > > > > > > > > — > Alfredo Luque > Software Engineer > Machine Learning Infrastructure > Airbnb > San Francisco, CA >