Hey all, I'm also supportive of making MKLDNN the default build for MXNet, but there were a few questions asked in the thread that I am not sure were answered. Would be great if Alex and others who worked on MKLDNN and that are proposing it to be the default can answer them clearly: - What the story is like when there's no AVX instructions present on CPUs. Do we get an illegal instruction error, or does it fallback gracefully? (asked by Kellen) - Are there any outstanding issues when MKLDNN is enabled? (asked by Naveen) - MKLDNN is a submodule dependency, are we pulling the latest commit or releases? If not we should move to releases before we make it a default (Naveen) There was a discussion about MKLDNN version used by MXNet, and would be great if it can be summarized.
Hagay On Tue, Nov 27, 2018 at 6:21 PM Lv, Tao A <tao.a...@intel.com> wrote: > Hi Anirudh, please find the statements from MKL-DNN manager for the > versioning mechanism of MKL-DNN library as below: > > "That's valid request and I would expect that as the software matures more > and more applications will rely on stable versions. I would expect that for > MXNet there is a stable branch that would rely on stable MKL-DNN and > development branch that would rely on master. > > MKL-DNN relies on semantic versioning. We do maintain a release branches > in addition to master that can be used to release patches. In particular we > are planning v0.17.1 this week to deliver a fix for reorders that you > requested. This works in the following way: > * master contains the latest development (typically the next release) > * rls-v0.17 contains v0.17 and will be used to create minor releases > (v0.17.1 and so on)" > > I also restate my initial concern here: If the master (development) branch > of MXNet relies on a bleeding edge commit from MKL-DNN master branch, when > MXNet comes to release, we need revert many changes in MXNet if MKL-DNN > will not have a new release at that time, since we need fallback the > dependency to a previous release version. That might mess up or slow down > the development and release of MXNet. To avoid that, we always need > negotiate with MKL-DNN team for the release pace before every release. > > If you have any other questions about MKL-DNN's versioning, feel free to > let me know. If you want to change and re-define the dependency behavior of > MKL-DNN, please propose a solution for my concern and start a vote for that. > > Thanks, > -tao > > > -----Original Message----- > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com] > Sent: Wednesday, November 28, 2018 8:26 AM > To: dev@mxnet.incubator.apache.org > Subject: Re: Include MKLDNN into default mxnet pip package > > Hi Tao, > > I was suggesting we can start using a release tag from mkldnn for major > and minor releases of mxnet starting with 1.4.0. But this would require a > versioning mechanism similar to semver for MKLDNN and MKLDNN to do patch > release to backport the bug fixes/regressions. I dont know if this is going > to happen anytime soon (It would be nice if you can obtain some timeline > from MKLDNN team on this). As long as the PIP still has two different > packages for mkl and without mkl my vote is +1 for adding it as a default. > > Anirudh > > > On Tue, Nov 27, 2018 at 5:04 AM Lv, Tao A <tao.a...@intel.com> wrote: > > > Hi Anirudh, > > > > Just to confirm, you're focusing on the 1.4.0 release of MXNet and > > want to have a release version of MKL-DNN there, right? Or do you mean > > all the development in the future should base on the release version of > MKL-DNN? > > For the former one, I think 0.17 release of MKL-DNN is a good choice. > > But it will not have fix for the LSTM regression mentioned in previous > email. > > > > I'm talking about the versioning mechanism with MKL-DNN maintainers > > and will be back to you if I get any response. But from the releasing > > history of MKL-DNN, I cannot find any evidence about patch release. > > > > -tao > > > > -----Original Message----- > > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com] > > Sent: Tuesday, November 27, 2018 6:16 AM > > To: dev@mxnet.incubator.apache.org > > Subject: Re: Include MKLDNN into default mxnet pip package > > > > Hi Tao, > > > > I agree with Steffen that we can start with a stable release for > > MKLDNN for 1.4.0. For your suggestion on using 0.17, can you provide > > info on what versioning mechanism MKLDNN uses. Once a MKLDNN release > > is out and there are some regressions found like the LSTM regression, > > would it be possible to do a patch release for it or maintain a release > branch for it ? > > > > Anirudh > > > > On Sun, Nov 25, 2018 at 5:03 PM Lv, Tao A <tao.a...@intel.com> wrote: > > > > > Hi Steffen, > > > > > > I think all the commits on MKL-DNN master branch are well tested for > > > MKL-DNN development team. If we really want to have a release commit > > > in the coming 1.4 mxnet release, my suggestion is 0.17 MKL-DNN release. > > > > > > Thank you, > > > Tao > > > > > > Sent from my iPhone > > > > > > > On Nov 26, 2018, at 8:09 AM, Steffen Rochel > > > > <steffenroc...@gmail.com> > > > wrote: > > > > > > > > +1 to make MKL-DNN default. > > > > I'm tracking > > > > https://github.com/apache/incubator-mxnet/issues/13369 > > > > as open issue to be addressed for 1.4.0 I do agree that we should > > > > move to a model to include released > > > dependencies > > > > instead of just taking bleeding edge snapshots. > > > > However, speed of development is important as well. > > > > As a compromise for 1.4.0 release with MKL-DNN: can the MKL-DNN > > > development > > > > team provide us with a well tested tag/commit id to include in > > > > 1.4.0 release? > > > > Steffen > > > > > > > >> On Wed, Nov 21, 2018 at 11:42 PM Lv, Tao A <tao.a...@intel.com> > > wrote: > > > >> > > > >> Thanks for the information, Kellen and Naveen. > > > >> > > > >> Better than onnx-tensorrt, MKL-DNN has already provided > > > >> versioning and release tags. My concern is that as MKL-DNN is > > > >> still under intensive development, if it has a new feature or bug > > > >> fix on its master branch, > > > do we > > > >> really want to wait for next release to get it supported in MXNet? > > > >> > > > >> Take the LSTM regression as an example, probably MKL-DNN will > > > >> give a fix or improvement on its master branch soon, do we need > > > >> to wait for 0.18 release to get it fixed for mxnet user? AFAIK, > > > >> tensorflow is also using normal commit id, not release, as the > > > >> dependency for > > MKL-DNN. > > > >> > > > >> Regarding the LSTM regression, we are using internal JIRA tickets > > > >> rather than github issues to track the defects of MKL-DNN. But I > > > >> agree with > > > you, > > > >> we need update the progress of it in Alex's issue. > > > >> > > > >> Thanks, > > > >> -tao > > > >> > > > >> -----Original Message----- > > > >> From: kellen sunderland [mailto:kellen.sunderl...@gmail.com] > > > >> Sent: Thursday, November 22, 2018 10:55 AM > > > >> To: dev@mxnet.incubator.apache.org > > > >> Subject: Re: Include MKLDNN into default mxnet pip package > > > >> > > > >> Agree with your point about other repos also not being based on > > > versioning > > > >> Tao. I would point out that I've given some that I've worked > > > >> with > > > similar > > > >> feedback: https://github.com/onnx/onnx-tensorrt/issues/68 > > > >> > > > >>> On Wed, Nov 21, 2018 at 6:48 PM Naveen Swamy > > > >>> <mnnav...@gmail.com> > > > wrote: > > > >>> > > > >>> Tao, > > > >>> > > > >>> You are right there are many submodules in 3rd party. We have to > > > >>> start somewhere and I believe this one is a good candidate to > > > >>> start > > with. > > > >>> This is not to cater to release of MXNet or to tie them with the > > > >>> releases of the submodules but instead to pick only stable > > > >>> releases and not to pick up bleeding edge commits from the tip > > > >>> of the master, this gives us confidence in the submodule that > > > >>> MXNet users are depending on that especially if we make MKLDNN the > default. > > > >>> > > > >>> Good to know it is known already as a regression.Alex has > > > >>> created this issue > > > >>> https://github.com/apache/incubator-mxnet/issues/13369, > > > >>> please add details and link the corresponding issue in MKLDNN(I > > > >>> couldn't > > > find). > > > >>> > > > >>> -Naveen > > > >>> > > > >>>> On Wed, Nov 21, 2018 at 6:04 PM Lv, Tao A <tao.a...@intel.com> > > wrote: > > > >>>> > > > >>>> Here are my answers for the questions from Kellen and Naveen > > > >>>> about MKL-DNN. It doesn't mean that I'm supportive for making > > > >>>> MKL-DNN default here. > > > >>>> > > > >>>> @Kellen, > > > >>>> > > > >>>> FYI, here is a list for those platforms which are officially > > > >>>> supported by MKL-DNN. > > > >>>> https://github.com/intel/mkl-dnn#system-requirements > > > >>>> > > > >>>> Most of computation intensive kernels in MKL-DNN are JITed. So > > > >>>> they are supposed to generate code according to the platform > > > >>>> during runtime. For non-JIT code in MKL-DNN, same as other code > > > >>>> in MXNet, it will generate instructions according to the > > > >>>> options/flags of compiler. We can set -DARCH_OPT_FLAGS when > > > >>>> build MKL-DNN to avoid optimization for compiling machine. > > > >>>> That's exactly what we are doing > > > >> for MKL-DNN build in MXNet. > > > >>> Even > > > >>>> without MKL-DNN, I noticed there were issues about illegal > > > >>>> instructions > > > >>> of > > > >>>> MXNet when users import the pip package on a lower end machine > > > >>>> which probably only supports SSE. > > > >>>> > > > >>>> @Naveen, > > > >>>> > > > >>>> The LSTM issue has already been identified as a regression from > > > >>>> the > > > >>> recent > > > >>>> version of MKL-DNN. Hopefully it will be fixed soon with a new > > > >>>> update of MKL-DNN. > > > >>>> > > > >>>> MXNet has many submodule dependencies under the 3rd party folder. > > > >>>> Seems > > > >>> we > > > >>>> don't require release versions for most of these dependencies. > > > >>>> The > > > >>> release > > > >>>> period of MKL-DNN and MXNet are not matched very well. I think > > > >>>> it would > > > >>> be > > > >>>> a risk for MXNet release if it hardly depends on the release of > > > >>>> a submodule, no need to say depends on the releases of all > submodules. > > > >>>> > > > >>>> -tao > > > >>>> > > > >>>> -----Original Message----- > > > >>>> From: Naveen Swamy [mailto:mnnav...@gmail.com] > > > >>>> Sent: Thursday, November 22, 2018 9:08 AM > > > >>>> To: dev@mxnet.incubator.apache.org > > > >>>> Cc: d...@mxnet.apache.org > > > >>>> Subject: Re: Include MKLDNN into default mxnet pip package > > > >>>> > > > >>>> Hi Alex, > > > >>>> > > > >>>> Thanks for promptly running the numbers on AMD and reporting here. > > > >>>> > > > >>>> Can you please update the AMD numbers here for posterity > > > >>>> > > > >>> https://cwiki.apache.org/confluence/display/MXNET/MXNet+with+Int > > > >>> el > > > >>> +MKL > > > >>> -DNN+-+Performance+Benchmarking > > > >>>> ? > > > >>>> > > > >>>> are there any outstanding issues when MKLDNN is enabled? from > > > >>>> my offline conversation I am briefly aware performance issues > > > >>>> with LSTM, is there an GitHub issue for it? > > > >>>> > > > >>>> MKLDNN is a submodule dependency, are we pulling the latest > > > >>>> commit or releases ? If not we should move to releases before > > > >>>> we make it a > > > >>> default. > > > >>>> Ideally we should use platform specific distributions (-dev > > > >>>> packages) at least we should rely on well tested releases. > > > >>>> > > > >>>> > > > >>>> Thanks, Naveen > > > >>>> > > > >>>> On Wed, Nov 21, 2018 at 4:55 PM Zai, Alexander > > > >>> <alex...@amazon.com.invalid > > > >>>>> > > > >>>> wrote: > > > >>>> > > > >>>>> AMD benchmarks have been published. We are seeing a x15.8 > > > >>>>> speedup with > > > >>>>> Resnet50 (batch size 32) on AWS's new m5a.24xlarge machine. > > > >>>>> With a smaller network (Mobilenet - batch size 32) the speedup > > > >>>>> is more significant at x38.7. Let's have a vote to see if the > > > >>>>> PR to have MKLDNN enabled by default > > > >>>>> (https://github.com/apache/incubator-mxnet/pull/12591) can be > > > >>>>> merged before 1.4.0 release. > > > >>>>> > > > >>>>> On 10/19/18, 9:17 AM, "Pedro Larroy" > > > >>>>> <pedro.larroy.li...@gmail.com> > > > >>>>> wrote: > > > >>>>> > > > >>>>> I did pip install mxnet-mkl==1.3.1b20181018 on an AMD > > > >>>>> Ryzen 1950X and unit > > > >>>>> tests are passing. > > > >>>>> > > > >>>>> Is this build using AVX512? in /proc/cpuinfo I see only "avx" > > > >>> flag. > > > >>>>> There's no "avx2" like on recent intel cpus. > > > >>>>> > > > >>>>> Pedro. > > > >>>>> > > > >>>>> On Fri, Oct 19, 2018 at 5:12 PM Hagay Lupesko > > > >>>>> <lupe...@gmail.com> > > > >>>>> wrote: > > > >>>>> > > > >>>>>> Awesome collaborative effort across many contributors and > > > >>>> companies! > > > >>>>>> > > > >>>>>> The boost is impressive and for MXNet users to get this > > > >>>>> boost "out of the > > > >>>>>> box" is a great benefit and makes MXNet an even better choice. > > > >>>>>> > > > >>>>>> Alex - can you clarify whether there are any down sides with > > > >>>>> regards to > > > >>>>>> noon AVX-512 architectures, AMD CPUs, etc? Will it > > > >>>>> gracefully fallback? > > > >>>>>> > > > >>>>>> Hagay > > > >>>>>> > > > >>>>>> > > > >>>>>> On Fri, Oct 19, 2018, 15:46 Sergio Fernández > > > >>>>> <wik...@apache.org> > > > >>>>> wrote: > > > >>>>>> > > > >>>>>>> If there is no downside on platforms not supporting AVX512 > > > >>>>> instructions, > > > >>>>>>> then +1 > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> On Wed, Oct 17, 2018, 14:10 Alex Zai <aza...@gmail.com> > > > >> wrote: > > > >>>>>>> > > > >>>>>>>> Hey all, > > > >>>>>>>> We have been working hard these past few months to > > > >>>>> integrate > > > >>>> and > > > >>>>>>> stabilize > > > >>>>>>>> Intel’s MKLDNN deep learning CPU accelerator into Mxnet > > > >>>>> and have made > > > >>>>>>>> incredible progress. On CPUs with AVX512 instructions > > > >>>>> (such as > > > >>>>> c5.18x) > > > >>>>>> we > > > >>>>>>>> have seen performance increase up to 12x and on other > > > >>>>> platforms (Macs, > > > >>>>>>>> AVX2) we seen a speedup of 1.5+. Full list of benchmarks > > > >>>>> can be found > > > >>>>>>> here > > > >>>>>>>> ( > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId > > > >>> =9 > > > >>> 5650 > > > >>> 764 > > > >>>>>>>> and https://github.com/apache/incubator-mxnet/pull/12591 > > > >> ). > > > >>>>>>>> > > > >>>>>>>> Currently, using this accelerator requires the developer > > > >>>>> to either pip > > > >>>>>>>> install the mxnet-mkl version of mxnet or to build it > > > >>>>> themselves from > > > >>>>>>>> source. Given that we should try to provide the best > > > >>>>> performance "out > > > >>>>>> of > > > >>>>>>>> the box” with mxnet we should include this in the > > > >>>>> default > > > >>>> build. > > > >>>>> The > > > >>>>>>> mkldnn > > > >>>>>>>> library is included with in the pip package build so it > > > >>>>> does > > > >>>> not > > > >>>>>> require > > > >>>>>>> an > > > >>>>>>>> external dependency. > > > >>>>>>>> > > > >>>>>>>> There were concerns that MKLDNN could cause regressions > > > >>>>> on certain > > > >>>>>>>> platforms (as it did with the tensorflow version a while > > > >>>>> back); but we > > > >>>>>>>> added a env flag (MXNET_MKLDNN_ENABLED) that allows > > > >>>>> users to turn of > > > >>>>>> this > > > >>>>>>>> feature during runtime. Please bring up any other > > > >>>>> concerns you may have > > > >>>>>>> and > > > >>>>>>>> your thoughts on including this accelerator in the > > > >>>>> default > > > >>>> build. > > > >>>>>>>> > > > >>>>>>>> Best, > > > >>>>>>>> Alex > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > >