Re: [apache/incubator-mxnet] [RFC] Custom Operator Part 2 (#17006)

2019-12-08 Thread Ziyi Mu
Need to include a fix for the test error 
https://github.com/apache/incubator-mxnet/pull/15921#pullrequestreview-328686634

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17006#issuecomment-563085461

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Chris Olivier
please answer the questions in my last email regarding the suspected issue
in mxnet as well as on that PR you opened.


On Sun, Dec 8, 2019 at 7:00 PM Lausen, Leonard 
wrote:

> The assertion failure in the MXNet DEBUG build goes away by updating LLVM
> OpenMP
> to the latest released version. All evidence I have points to the assertion
> failure being due to a bug in the 2 years old UNRELEASED version of LLVM
> OpenMP.
> that we are using currently in CMake builds.
>
> Thus I'm requesting 3 commiters to approve
> https://github.com/apache/incubator-mxnet/pull/17012 to update to a
> released
> version of LLVM OpenMP.
>
> As described in the PR, the assertion is still part of LLVM OpenMP 9.0
> codebase.
> In particular look at lines
>
> https://github.com/llvm-mirror/openmp/blob/release_90/runtime/src/kmp_runtime.cpp#L6616
> and
>
> https://github.com/llvm-mirror/openmp/blob/37c72127e90360a020f351f18d9cccfc30e5145a/runtime/src/kmp_runtime.cpp#L6481
> where the latter is the line that currently fails in MXNet DEBUG build and
> the
> former is the equivalent line that doesn't fail in MXNet DEBUG builds after
> updating LLVM OpenMP.
>
>
>
> There is also a crash with Intel OpenMP as well as both the old UNRELEASED
> and
> the new, released version LLVM OpenMP that happens after forking. That
> crash
> doesn't go away and needs to be root-caused
> https://github.com/apache/incubator-mxnet/issues/14979
>
>
> On Sun, 2019-12-08 at 16:27 -0800, Pedro Larroy wrote:
> > Hi Leonard.
> >
> > Are you saying that you have updated this library and the problems
> desribed
> > in the related tickets are no longer present?
> >
> > P.
> >
> > On Sunday, December 8, 2019, Lausen, Leonard 
> > wrote:
> > > Thanks Pedro and Chris for your responses.
> > >
> > > After further investigation I find:
> > >
> > > 1) I don't think
> https://github.com/apache/incubator-mxnet/issues/14979 is
> > > caused by any incompatibility between gomp and llvm / intel omp. Rather
> > it's
> > > simply a problem of llvm / intel omp. See my comment to the issue for
> the
> > > methodology to arrive at this claim.
> > >
> > > 2) Regarding the assertion failure when compiling with (llvm)
> > 3rdparty/openmp,
> > > it can be fixed by updating the by now 2 years old llvm openmp code to
> the
> > > newest released version. I went ahead and opened a PR
> > > https://github.com/apache/incubator-mxnet/pull/17012
> > >
> > > Based on the investigation described in 1), I think Chris is right that
> > the
> > > assertion failure is not due to some interaction between gomp and llvm
> > omp.
> > > However, I'm not sure about Chris's suggestion that the assertion fa
>
> > > ilure
> > is due
> > > to a bug in MXNet. In fact, the failure goes away when updating the
> llvm
> > openmp
> > > code. So I think it's just due to a bug in the 2 years old code.
> > >
> > > @Chris, I think updating 3rdparty/openmp to fix the assertion issue is
> not
> > > contentious. Thus let's do it via lazy consensus (72 hours) or just
> > approve the
> > > PR and merge it.
> > >
> > > Please also take a look at my comment at #14979 and let everyone know
> if
> > you see
> > > any option to fix the bug while keeping 3rdparty/openmp. As this bug
> > affects an
> > > important use-case, I beleive we need to remove 3rdparty/openmp from
> the
> > CMake
> > > build as long as we don't find a solution for making #14979 work with
> > > 3rdparty/openmp.
> > >
> > > In fact, removing 3rdparty/openmp will then match the current Makefile
> > setup
> > > that according to my understanding is used to build the nightly
> releases
> > used by
> > > the majority of developers. Ie. most users actually don't use the CMake
> > build
> > > with 3rdparty/openmp. You can consider rescinding your veto on removing
> > > 3rdparty/openmp after reading through the evidence in that issue. If
> you
> > don't
> > > provide any evidence for why the methodology/conclusion in #14979 is
> > flawed, I
> > > will assume your previous veto is void based on Apache Voting rule as
> it
> > lacks
> > > technical justification and in any case was motivated by the assertion
> > issue,
> > > which I agree with you, is likely not due to gomp / omp interaction.
> > >
> > > Thank you
> > > Leonard
> > >
> > >
> > > On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
> > > > Stop disseminating false information:
> > > >
> > > > https://github.com/apache/incubator-mxnet/issues/14979
> > > >
> > > >
> > > > On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier 
> > wrote:
> > > > > -1
> > > > >
> > > > > mkldnn removed omp5 for licencing issues
> > > > > no bugs have actually been traced to the use of llvm openmp. only
> an
> > assert
> > > > > caused by an actual bug in mxnet code. there are suitable
> workarounds.
> > > > >
> > > > > over time llvm omp has simply been used as a “catch all” for random
> > > > > problems that aren’t related at all (such as getenv race condition
> in
> > an
> > > > > atfork call that isn’t even part of an omp 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Lausen, Leonard
Thanks Chris for the elaboration.

> What the assert in question is trying to say is that mxnet code is calling
> into omp library after a fork, but before the omp library’s atfork()
> handler is called, so the omp library has not yet initialized a new team if
> threads.  This looks to be the case in one of the call stacks on that
> issue. This is problematic for any openmp library which supports omp after
> a fork, and may not be deterministic from build to build, since the order
> of static init calls for a given module is undefined (i think mxnet is
> initializing omp during static init, but this may not matter).

I disagree. The assert fails without any forking. You can trigger it by running
`python3 -c 'import mxnet'` a few times.

As described in https://github.com/apache/incubator-mxnet/pull/17012 the 
previously failing assertion is still part of LLVM OpenMP 9.0 codebase. In 
particular you can compare line 
https://github.com/llvm-mirror/openmp/blob/release_90/runtime/src/kmp_runtime.cpp#L6616
to 
https://github.com/llvm-mirror/openmp/blob/37c72127e90360a020f351f18d9cccfc30e5145a/runtime/src/kmp_runtime.cpp#L6481
where the latter is the line that currently fails in MXNet.

I would like to reiterate, we are currently using a random and UNRELEASED LLVM
OpenMP version from 2017. There is no evidence that the assertion failure is due
to a bug in MXNet, but rather there is strong evidence that it is due to a bug
in the previously used LLVM version. (See the latter part of this email
regarding your suggestion about bug in MXNet fork handling).

> It stands to reason that linking this or that library may affect the assert
> occurring because it’s not known at what time one of the dependent
> libraries initializes omp (thus causing it to hook its atfork handler), so
> it is not surprising that mucking with dependencies may cause the assert to
> occur or not occur.

With respect to the Assert failure, there is no difference in "linking this or
that library". We are only updating the version of LLVM OpenMP codebase. See 
https://github.com/apache/incubator-mxnet/pull/17012

Thus your veto on updating the shipped LLVM OpenMP lacks technical justification
and is void. I'm requesting 3 commiters to approve the PR to go ahead with the
update.

> So if mxnet is doing that, it is a bug and remains a problem regardless of
> the omp library and probably should be fixed.  llvm omp happens to be nice
> enough to tell you you’re doing something wrong, at least when built in
> debug mode.

Yes, we need to investigate if there is a bug. Let's do this based on 
https://github.com/apache/incubator-mxnet/issues/14979 which is a "real" crash
that happens both with the old unreleased and the new 9.0 LLVM OpenMP versions.

> Once this issue is resolved, we can discuss the library inclusion itself.
> My objection is “fixing” what appears to be a bug by effectively
> “commenting out the assert” which is what i stated in the very beginning.

This is a reasonable approach, as long as you are willing to help fix it. As
made evident by your comments, you have more experience with advanced OpenMP
libraries. Thus let's work together on fixing the root cause, assuming it's
indeed a bug in MXNet.

If we run into any roadblock with finding a root cause in MXNet, the only
alternative I see is to remove the 3rdparty/openmp library, as we can not rule
out the possibility of an LLVM OpenMP bug either.

> Is there another explanation for the call stack with the assert?  Can this
> bug be ruled out?

Yes. The other explanation is that it's due to a bug in the old LLVM OpenMP
version (see above).
The stack trace you refer to is
https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-505162890
relies on mkl and is hard to reproduce.

With respect to establishing and fixing the bug in MXNet, let's focus on the
crash described in https://github.com/apache/incubator-mxnet/issues/14979 which
is happening with Intel OpenMP, old unreleased LLVM OpenMP and the current LLVM
OpenMP 9.0 release. It involves forking and thus seems quite related to the
first paragraph of your email compared to the assertion failure which happens
without any forking.

Best regards
Leonard

On Sun, 2019-12-08 at 07:58 -0800, Chris Olivier wrote:
> btw the call stack I am referring to below is the one where I explained
> this problem before and after I got a hostile response, I locked the issue.
> 
> On Sun, Dec 8, 2019 at 7:24 AM Chris Olivier  wrote:
> 
> > Again, here is what I suspect the bug is in mxnet:
> > 
> > The way that advanced openmp libraries handle a fork is that they hook an
> > atfork() callback in which, in the new process, it creates a new “team” of
> > threads to use for its thread pool (since all of the thread handles in its
> > data structure belong to the previous process). atfork() callback order is
> > the order at which the callbacks are registered, which will tend to be the
> > first call to the openmp library.  For this reason, the 

Nightly releases on Jenkins CD

2019-12-08 Thread Lausen, Leonard
I agree with Marco that we should build as part of the Jenkins CD. Thanks Sam
and everyone involved to get an intermediate proprietary solution set-up, that
can ease developers lives.

When setting up the Jenkins CD, I suggest we use CMake to build the releases,
given the recent progress to identify and get rid of the OpenMP bugs in the
CMake pipeline. Then we can finally drop the Makefile setup which will lower our
maintenance effort.

Would anyone have concerns about that (72 hour lazy consensus)?

Best regards
Leonard

On Sat, 2019-12-07 at 19:08 +, Skalicky, Sam wrote:
> Hi Marco,
> 
> I believe this CodeBuild solution is a stop-gap until the Jenkins CD project
> that Per and Sheng have been driving is finished. There were some failing
> builds with the Jenkins CD that were preventing some nightly builds from being
> available. The long term goal is to get back to the Jenkins CD so that the
> community can access & maintain. Remember the builds for MXNet to PyPI were
> originally done in a private travis CI, not available to the community. Thanks
> to Sheng for donating the code, the Jenkins CD is the long term plan.
> 
> The Jenkins CI is still testing MXNet, nothing changed there.
> 
> If someone wants to work on adding support to the Jenkins CD [2] to publish
> nightly builds into an S3 bucket, please go ahead and take the lead.
> 
> Sam
> 
> [1] http://jenkins.mxnet-ci.amazon-ml.com/job/restricted-mxnet-cd/
> 
> On Dec 7, 2019, at 10:04 AM, Marco de Abreu  marco.g.ab...@gmail.com>> wrote:
> 
> Could you elaborate how a non-Amazonian is able to access, maintain and
> review the CodeBuild pipeline? How come we've diverted from the community
> agreed-on standard where the public Jenkins serves for the purpose of
> testing and releasing MXNet? I'd be curious about the issues you're
> encountering with Jenkins CI that led to a non-standard solution.
> 
> -Marco
> 
> 
> Skalicky, Sam mailto:sska...@amazon.com.invalid>>
> schrieb am Sa., 7. Dez. 2019,
> 18:39:
> 
> Hi MXNet Community,
> 
> We have been working on getting nightly builds fixed and made available
> again. We’ve made another system using AWS CodeBuild & S3 to work around
> the problems with Jenkins CI, PyPI, etc. It is currently building all the
> flavors and publishing to an S3 bucket here:
> 
> https://us-west-2.console.aws.amazon.com/s3/buckets/apache-mxnet/dist/?region=us-west-2
> 
> There are folders for each set of nightly builds, try out the wheels
> starting today 2019-12-07. Builds start at 1:30am PT (9:30am GMT) and
> arrive in the bucket 30min-2hours later. Inside each folder are the wheels
> for each flavor of MXNet. Currently we’re only building for linux, builds
> for windows/Mac will come later.
> 
> If you want to download the wheels easily you can use a URL in the form of:
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/
> /dist/-1.6.0b-py2.py3-none-
> manylinux1_x86_64.whl
> 
> Heres a set of links for today’s builds
> 
> (Plain mxnet, no mkl no cuda)
> 
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> (mxnet-mkl
> <
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl(mxnet-mkl
> >
> )
> 
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> (mxnet-cuXXX
> <
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl(mxnet-cuXXX
> >
> )
> 
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu90-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> 
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu92-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> 
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu100-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> 
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu101-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> (mxnet-cuXXXmkl
> <
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu101-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl(mxnet-cuXXXmkl
> >
> )
> 
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu90mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> 
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu92mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> 
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu100mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> 
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu101mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> 
> You can easily install these pip wheels in your system either by
> downloading them to your machine first and then installing 

Re: Stopping nightly releases to Pypi

2019-12-08 Thread Lausen, Leonard
From Shanghai, the closest endpoint (automatically chosen endpoint) is in Tokyo
and download speed for mxnet-mkl was on average 1.7 MB/s with a maximum of 5
MB/s during my test.

On Sun, 2019-12-08 at 01:30 +, Sheng Zha wrote:
> > Heres a set of links for today’s builds
> > 
> > (Plain mxnet, no mkl no cuda)
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > (mxnet-mkl)
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > (mxnet-cuXXX)
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu90-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu92-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu100-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu101-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > (mxnet-cuXXXmkl)
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu90mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu92mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu100mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu101mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> 
> These links are not utilizing the s3 accelerate feature (i.e. not backed by
> cloudfront edges). Please use repo.mxnet.io instead. The updated links are:
> (Plain mxnet, no mkl no cuda)
> https://repo.mxnet.io/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> (mxnet-mkl)
> https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> (mxnet-cuXXX)
> https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu90-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu92-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu100-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu101-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> (mxnet-cuXXXmkl)
> https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu90mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu92mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu100mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu101mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> 
> When updating the installation doc we should use repo.mxnet.io domain name
> too.
> 
> Best,
> -sz
> 
> On 2019/12/07 17:39:40, "Skalicky, Sam"  wrote: 
> > Hi MXNet Community,
> > 
> > We have been working on getting nightly builds fixed and made available
> > again. We’ve made another system using AWS CodeBuild & S3 to work around the
> > problems with Jenkins CI, PyPI, etc. It is currently building all the
> > flavors and publishing to an S3 bucket here:
> > https://us-west-2.console.aws.amazon.com/s3/buckets/apache-mxnet/dist/?region=us-west-2
> > 
> > There are folders for each set of nightly builds, try out the wheels
> > starting today 2019-12-07. Builds start at 1:30am PT (9:30am GMT) and arrive
> > in the bucket 30min-2hours later. Inside each folder are the wheels for each
> > flavor of MXNet. Currently we’re only building for linux, builds for
> > windows/Mac will come later.
> > 
> > If you want to download the wheels easily you can use a URL in the form of:
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist//dist/-1.6.0b-py2.py3-none-manylinux1_x86_64.whl
> > 
> > Heres a set of links for today’s builds
> > 
> > (Plain mxnet, no mkl no cuda)
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > (mxnet-mkl)
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > (mxnet-cuXXX)
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu90-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu92-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu100-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Pedro Larroy
Great investigation thank you. I have to agree with your analysis and for
helping resolving this long standing issue.

This will not repair the damage made to the community of losing 3-4
valuable contributors. Introducing a library that causes bugs then blocking
changes and locking gh issues which attempt to remove or workaround the
issues in addition to making rude comments and worse things that are better
left out is still not acceptable and begs for an apology from Chris.

P.




On Sunday, December 8, 2019, Lausen, Leonard 
wrote:
> Thanks Pedro and Chris for your responses.
>
> After further investigation I find:
>
> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 is
> caused by any incompatibility between gomp and llvm / intel omp. Rather
it's
> simply a problem of llvm / intel omp. See my comment to the issue for the
> methodology to arrive at this claim.
>
> 2) Regarding the assertion failure when compiling with (llvm)
3rdparty/openmp,
> it can be fixed by updating the by now 2 years old llvm openmp code to the
> newest released version. I went ahead and opened a PR
> https://github.com/apache/incubator-mxnet/pull/17012
>
> Based on the investigation described in 1), I think Chris is right that
the
> assertion failure is not due to some interaction between gomp and llvm
omp.
> However, I'm not sure about Chris's suggestion that the assertion failure
is due
> to a bug in MXNet. In fact, the failure goes away when updating the llvm
openmp
> code. So I think it's just due to a bug in the 2 years old code.
>
> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
> contentious. Thus let's do it via lazy consensus (72 hours) or just
approve the
> PR and merge it.
>
> Please also take a look at my comment at #14979 and let everyone know if
you see
> any option to fix the bug while keeping 3rdparty/openmp. As this bug
affects an
> important use-case, I beleive we need to remove 3rdparty/openmp from the
CMake
> build as long as we don't find a solution for making #14979 work with
> 3rdparty/openmp.
>
> In fact, removing 3rdparty/openmp will then match the current Makefile
setup
> that according to my understanding is used to build the nightly releases
used by
> the majority of developers. Ie. most users actually don't use the CMake
build
> with 3rdparty/openmp. You can consider rescinding your veto on removing
> 3rdparty/openmp after reading through the evidence in that issue. If you
don't
> provide any evidence for why the methodology/conclusion in #14979 is
flawed, I
> will assume your previous veto is void based on Apache Voting rule as it
lacks
> technical justification and in any case was motivated by the assertion
issue,
> which I agree with you, is likely not due to gomp / omp interaction.
>
> Thank you
> Leonard
>
>
> On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
>> Stop disseminating false information:
>>
>> https://github.com/apache/incubator-mxnet/issues/14979
>>
>>
>> On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier 
wrote:
>>
>> > -1
>> >
>> > mkldnn removed omp5 for licencing issues
>> > no bugs have actually been traced to the use of llvm openmp. only an
assert
>> > caused by an actual bug in mxnet code. there are suitable workarounds.
>> >
>> > over time llvm omp has simply been used as a “catch all” for random
>> > problems that aren’t related at all (such as getenv race condition in
an
>> > atfork call that isn’t even part of an omp parallel region).
>> >
>> > proposal is now and has always been roughly equivalent to the idea of
>> > “comment out an assert rather than fix the bug it’s reporting”.
>> >
>> > Up until very recently, Makefile version of mxnet used libomp5 for
YEARS
>> > and not libgomp, with no issue reported (omp not built in debug mode),
so
>> > the equivalent configuration from CMake mysteriously causing myriads if
>> > problems has questionable merit and smells more like a hubris
situation.
>> >
>> > I use tensorflow as well and it links to libomp5 rather than libgomp.
>> >
>> > if the assert problem is really a problem, the bug being reported
would be
>> > prioritized and fixed. it should be fixed regardless. all the time
spent by
>> > some CI people trying to remove this could have simply fixed the
actual bug
>> > in a small fraction of the time.
>> >
>> >
>> > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard

>> > wrote:
>> >
>> > > I think it's reasonable to assume that the Intel MKLDNN team is an
>> > > "authorative"
>> > > source about the issue of compilation with OpenMP and the OpenMP
runtime
>> > > library
>> > > related issues. Thus I suggest we follow the recommendation of Intel
>> > > MKLDNN team
>> > > within the MXNet project.
>> > >
>> > > Looking through the Intel MKLDNN documentation, I find [1]:
>> > >
>> > > > DNNL uses OpenMP runtime library provided by the compiler.
>> > >
>> > > as well as
>> > >
>> > > > it's important to ensure that only one OpenMP runtime is used
>> > throughout
>> > > the
>> > > > 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Pedro Larroy
Hi Leonard.

Are you saying that you have updated this library and the problems desribed
in the related tickets are no longer present?

P.

On Sunday, December 8, 2019, Lausen, Leonard 
wrote:
> Thanks Pedro and Chris for your responses.
>
> After further investigation I find:
>
> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 is
> caused by any incompatibility between gomp and llvm / intel omp. Rather
it's
> simply a problem of llvm / intel omp. See my comment to the issue for the
> methodology to arrive at this claim.
>
> 2) Regarding the assertion failure when compiling with (llvm)
3rdparty/openmp,
> it can be fixed by updating the by now 2 years old llvm openmp code to the
> newest released version. I went ahead and opened a PR
> https://github.com/apache/incubator-mxnet/pull/17012
>
> Based on the investigation described in 1), I think Chris is right that
the
> assertion failure is not due to some interaction between gomp and llvm
omp.
> However, I'm not sure about Chris's suggestion that the assertion failure
is due
> to a bug in MXNet. In fact, the failure goes away when updating the llvm
openmp
> code. So I think it's just due to a bug in the 2 years old code.
>
> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
> contentious. Thus let's do it via lazy consensus (72 hours) or just
approve the
> PR and merge it.
>
> Please also take a look at my comment at #14979 and let everyone know if
you see
> any option to fix the bug while keeping 3rdparty/openmp. As this bug
affects an
> important use-case, I beleive we need to remove 3rdparty/openmp from the
CMake
> build as long as we don't find a solution for making #14979 work with
> 3rdparty/openmp.
>
> In fact, removing 3rdparty/openmp will then match the current Makefile
setup
> that according to my understanding is used to build the nightly releases
used by
> the majority of developers. Ie. most users actually don't use the CMake
build
> with 3rdparty/openmp. You can consider rescinding your veto on removing
> 3rdparty/openmp after reading through the evidence in that issue. If you
don't
> provide any evidence for why the methodology/conclusion in #14979 is
flawed, I
> will assume your previous veto is void based on Apache Voting rule as it
lacks
> technical justification and in any case was motivated by the assertion
issue,
> which I agree with you, is likely not due to gomp / omp interaction.
>
> Thank you
> Leonard
>
>
> On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
>> Stop disseminating false information:
>>
>> https://github.com/apache/incubator-mxnet/issues/14979
>>
>>
>> On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier 
wrote:
>>
>> > -1
>> >
>> > mkldnn removed omp5 for licencing issues
>> > no bugs have actually been traced to the use of llvm openmp. only an
assert
>> > caused by an actual bug in mxnet code. there are suitable workarounds.
>> >
>> > over time llvm omp has simply been used as a “catch all” for random
>> > problems that aren’t related at all (such as getenv race condition in
an
>> > atfork call that isn’t even part of an omp parallel region).
>> >
>> > proposal is now and has always been roughly equivalent to the idea of
>> > “comment out an assert rather than fix the bug it’s reporting”.
>> >
>> > Up until very recently, Makefile version of mxnet used libomp5 for
YEARS
>> > and not libgomp, with no issue reported (omp not built in debug mode),
so
>> > the equivalent configuration from CMake mysteriously causing myriads if
>> > problems has questionable merit and smells more like a hubris
situation.
>> >
>> > I use tensorflow as well and it links to libomp5 rather than libgomp.
>> >
>> > if the assert problem is really a problem, the bug being reported
would be
>> > prioritized and fixed. it should be fixed regardless. all the time
spent by
>> > some CI people trying to remove this could have simply fixed the
actual bug
>> > in a small fraction of the time.
>> >
>> >
>> > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard

>> > wrote:
>> >
>> > > I think it's reasonable to assume that the Intel MKLDNN team is an
>> > > "authorative"
>> > > source about the issue of compilation with OpenMP and the OpenMP
runtime
>> > > library
>> > > related issues. Thus I suggest we follow the recommendation of Intel
>> > > MKLDNN team
>> > > within the MXNet project.
>> > >
>> > > Looking through the Intel MKLDNN documentation, I find [1]:
>> > >
>> > > > DNNL uses OpenMP runtime library provided by the compiler.
>> > >
>> > > as well as
>> > >
>> > > > it's important to ensure that only one OpenMP runtime is used
>> > throughout
>> > > the
>> > > > application. Having more than one OpenMP runtime linked to an
>> > executable
>> > > may
>> > > > lead to undefined behavior including incorrect results or crashes.
>> > >
>> > > To keep our project maintainable and error free, I thus suggest we
follow
>> > > DNNL
>> > > and use the OpenMP runtime library provided by the compiler.
>> > > We have 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Pedro Larroy
This is actually useful information, thanks.

Still I don't see a justificqtion for vetoing being able to choose the
library at compile time. Fixing the issue you reasonably describe and being
able to choose are two orthogonal topics.


Thanks for the constructive information.

On Sunday, December 8, 2019, Chris Olivier  wrote:
> Again, here is what I suspect the bug is in mxnet:
>
> The way that advanced openmp libraries handle a fork is that they hook an
> atfork() callback in which, in the new process, it creates a new “team” of
> threads to use for its thread pool (since all of the thread handles in its
> data structure belong to the previous process). atfork() callback order is
> the order at which the callbacks are registered, which will tend to be the
> first call to the openmp library.  For this reason, the fork order will
> vary depending upon what other libraries might be linked in and whether
> they make omp calls before mxnet starts its static init.
>
> What the assert in question is trying to say is that mxnet code is calling
> into omp library after a fork, but before the omp library’s atfork()
> handler is called, so the omp library has not yet initialized a new team
if
> threads.  This looks to be the case in one of the call stacks on that
> issue. This is problematic for any openmp library which supports omp after
> a fork, and may not be deterministic from build to build, since the order
> of static init calls for a given module is undefined (i think mxnet is
> initializing omp during static init, but this may not matter).
>
> So if mxnet is doing that, it is a bug and remains a problem regardless of
> the omp library and probably should be fixed.  llvm omp happens to be nice
> enough to tell you you’re doing something wrong, at least when built in
> debug mode.
>
> Once this issue is resolved, we can discuss the library inclusion itself.
> My objection is “fixing” what appears to be a bug by effectively
> “commenting out the assert” which is what i stated in the very beginning.
>
> It stands to reason that linking this or that library may affect the
assert
> occurring because it’s not known at what time one of the dependent
> libraries initializes omp (thus causing it to hook its atfork handler), so
> it is not surprising that mucking with dependencies may cause the assert
to
> occur or not occur.
>
> Is there another explanation for the call stack with the assert?  Can this
> bug be ruled out?
>
>
> Here is an example of the atfork team concept with libgomp as well.
> Probably you can check the current libgomp code itself but this explains
> the code:
> https://patchwork.ozlabs.org/patch/319827/
>
>
>
>
>
>
>
>
> On Sun, Dec 8, 2019 at 2:21 AM Lausen, Leonard 
> wrote:
>
>> Thanks Pedro and Chris for your responses.
>>
>> After further investigation I find:
>>
>> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979
is
>> caused by any incompatibility between gomp and llvm / intel omp. Rather
>> it's
>> simply a problem of llvm / intel omp. See my comment to the issue for the
>> methodology to arrive at this claim.
>>
>> 2) Regarding the assertion failure when compiling with (llvm)
>> 3rdparty/openmp,
>> it can be fixed by updating the by now 2 years old llvm openmp code to
the
>> newest released version. I went ahead and opened a PR
>> https://github.com/apache/incubator-mxnet/pull/17012
>>
>> Based on the investigation described in 1), I think Chris is right that
the
>> assertion failure is not due to some interaction between gomp and llvm
omp.
>> However, I'm not sure about Chris's suggestion that the assertion failure
>> is due
>> to a bug in MXNet. In fact, the failure goes away when updating the llvm
>> openmp
>> code. So I think it's just due to a bug in the 2 years old code.
>>
>> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is
not
>> contentious. Thus let's do it via lazy consensus (72 hours) or just
>> approve the
>> PR and merge it.
>>
>> Please also take a look at my comment at #14979 and let everyone know if
>> you see
>> any option to fix the bug while keeping 3rdparty/openmp. As this bug
>> affects an
>> important use-case, I beleive we need to remove 3rdparty/openmp from the
>> CMake
>> build as long as we don't find a solution for making #14979 work with
>> 3rdparty/openmp.
>>
>> In fact, removing 3rdparty/openmp will then match the current Makefile
>> setup
>> that according to my understanding is used to build the nightly releases
>> used by
>> the majority of developers. Ie. most users actually don't use the CMake
>> build
>> with 3rdparty/openmp. You can consider rescinding your veto on removing
>> 3rdparty/openmp after reading through the evidence in that issue. If you
>> don't
>> provide any evidence for why the methodology/conclusion in #14979 is
>> flawed, I
>> will assume your previous veto is void based on Apache Voting rule as it
>> lacks
>> technical justification and in any case was motivated by the assertion
>> 

Re: Custom C++ Operators

2019-12-08 Thread Skalicky, Sam
Thanks Ciyong,

Absolutely! Heres how a backward function is registered [1] and here’s an 
example backward function for GEMM [2]. We’ll be working on documentation and a 
blog post/tutorial soon, hopefully that will  help clarify things as well.

Keep the questions coming!

Thanks,
Sam

[1] 
https://github.com/apache/incubator-mxnet/blob/master/example/extensions/lib_custom_op/gemm_lib.cc#L171
[2] 
https://github.com/apache/incubator-mxnet/blob/master/example/extensions/lib_custom_op/gemm_lib.cc#L90-L116

On Dec 8, 2019, at 6:48 AM, Chen, Ciyong 
mailto:ciyong.c...@intel.com>> wrote:

Really great features, it will provide a more convenient way for deployment.
BTW, does it support backward ops too?

-Ciyong

-Original Message-
From: Marco de Abreu mailto:marco.g.ab...@gmail.com>>
Sent: Sunday, December 8, 2019 2:56 AM
To: dev@mxnet.incubator.apache.org
Subject: Re: Custom C++ Operators

Awesome project, love it! It really seems easy to use, great job!

-Marco

Skalicky, Sam mailto:sska...@amazon.com.invalid>> 
schrieb am Sa., 7. Dez. 2019,
19:50:

Hi MXNet Community,

We have been working on adding support for custom C++ operators for a
while and are happy to announce that the initial functionality is now
available for you to try out in the master branch!

CustomOp support in MXNet began with allowing users to write custom
operators in Python and has been available for years. If you wanted to
write a high-performance C++ operator you had to do it by adding it to
the MXNet source code, recompiling a custom version of MXNet, and
distributing that custom build. The Custom C++ operator support
enhances this by enabling users to write high-performance C++
operators and compile them separately from MXNet. This frees up users
from having to recompile MXNet from source and makes it easier to add custom 
operators to suit their needs.

Heres a few pointers to get started:
1. Check out the overview in the cwiki [1] 2. Check out the PR [2] 3.
You can try this out using the new nightly builds that are available
in
S3 [3]
4. Leave feedback on features to add or things to fix in a followup PR
here [4]

Credit goes to everyone involved (in no particular order) Manu Seth
Sheng Zha Jackie Wu Junru Shao Ziyi Mu

Special thanks to all the PR reviewers!

Thanks!
Sam


[1]
https://cwiki.apache.org/confluence/display/MXNET/Dynamic+CustomOp+Sup
port [2] https://github.com/apache/incubator-mxnet/pull/15921
[3]
https://lists.apache.org/thread.html/0a22e10b290b4ad322ed50024d778c373
6b0a772811caea317790732%40%3Cdev.mxnet.apache.org%3E
<
https://lists.apache.org/thread.html/0a22e10b290b4ad322ed50024d778c373
6b0a772811caea317790732@
>
[4] https://github.com/apache/incubator-mxnet/issues/17006




Re: Stopping nightly releases to Pypi

2019-12-08 Thread Skalicky, Sam
Thanks Sheng,

Looks like 12/8 builds are working as expected too:

(Plain mxnet, no mkl no cuda)
https://repo.mxnet.io/dist/2019-12-08/dist/mxnet-1.6.0b20191208-py2.py3-none-manylinux1_x86_64.whl
(mxnet-mkl)
https://repo.mxnet.io/dist/2019-12-08/dist/mxnet_mkl-1.6.0b20191208-py2.py3-none-manylinux1_x86_64.whl
(mxnet-cuXXX)
https://repo.mxnet.io/dist/2019-12-08/dist/mxnet_cu90-1.6.0b20191208-py2.py3-none-manylinux1_x86_64.whl
https://repo.mxnet.io/dist/2019-12-08/dist/mxnet_cu92-1.6.0b20191208-py2.py3-none-manylinux1_x86_64.whl
https://repo.mxnet.io/dist/2019-12-08/dist/mxnet_cu100-1.6.0b20191208-py2.py3-none-manylinux1_x86_64.whl
https://repo.mxnet.io/dist/2019-12-08/dist/mxnet_cu101-1.6.0b20191208-py2.py3-none-manylinux1_x86_64.whl
(mxnet-cuXXXmkl)
https://repo.mxnet.io/dist/2019-12-08/dist/mxnet_cu90mkl-1.6.0b20191208-py2.py3-none-manylinux1_x86_64.whl
https://repo.mxnet.io/dist/2019-12-08/dist/mxnet_cu92mkl-1.6.0b20191208-py2.py3-none-manylinux1_x86_64.whl
https://repo.mxnet.io/dist/2019-12-08/dist/mxnet_cu100mkl-1.6.0b20191208-py2.py3-none-manylinux1_x86_64.whl
https://repo.mxnet.io/dist/2019-12-08/dist/mxnet_cu101mkl-1.6.0b20191208-py2.py3-none-manylinux1_x86_64.whl

I’ll try and get this updated on the installation docs page tomorrow.

Sam

On Dec 7, 2019, at 5:30 PM, Sheng Zha 
mailto:zhash...@apache.org>> wrote:

Heres a set of links for today’s builds

(Plain mxnet, no mkl no cuda)
https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
(mxnet-mkl)
https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
(mxnet-cuXXX)
https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu90-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu92-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu100-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu101-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
(mxnet-cuXXXmkl)
https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu90mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu92mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu100mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu101mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl

These links are not utilizing the s3 accelerate feature (i.e. not backed by 
cloudfront edges). Please use repo.mxnet.io instead. The 
updated links are:
(Plain mxnet, no mkl no cuda)
https://repo.mxnet.io/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
(mxnet-mkl)
https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
(mxnet-cuXXX)
https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu90-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu92-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu100-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu101-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
(mxnet-cuXXXmkl)
https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu90mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu92mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu100mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
https://repo.mxnet.io/dist/2019-12-07/dist/mxnet_cu101mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl

When updating the installation doc we should use 
repo.mxnet.io domain name too.

Best,
-sz

On 2019/12/07 17:39:40, "Skalicky, Sam" 
mailto:sska...@amazon.com.INVALID>> wrote:
Hi MXNet Community,

We have been working on getting nightly builds fixed and made available again. 
We’ve made another system using AWS CodeBuild & S3 to work around the problems 
with Jenkins CI, PyPI, etc. It is currently building all the flavors and 
publishing to an S3 bucket here:
https://us-west-2.console.aws.amazon.com/s3/buckets/apache-mxnet/dist/?region=us-west-2

There are folders for each set of nightly builds, try out the wheels starting 
today 2019-12-07. Builds start at 1:30am PT (9:30am GMT) and arrive in the 
bucket 30min-2hours later. Inside each folder are the wheels for each flavor of 
MXNet. Currently we’re only building for linux, builds for windows/Mac will 
come later.

If you want to download 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Chris Olivier
btw the call stack I am referring to below is the one where I explained
this problem before and after I got a hostile response, I locked the issue.

On Sun, Dec 8, 2019 at 7:24 AM Chris Olivier  wrote:

> Again, here is what I suspect the bug is in mxnet:
>
> The way that advanced openmp libraries handle a fork is that they hook an
> atfork() callback in which, in the new process, it creates a new “team” of
> threads to use for its thread pool (since all of the thread handles in its
> data structure belong to the previous process). atfork() callback order is
> the order at which the callbacks are registered, which will tend to be the
> first call to the openmp library.  For this reason, the fork order will
> vary depending upon what other libraries might be linked in and whether
> they make omp calls before mxnet starts its static init.
>
> What the assert in question is trying to say is that mxnet code is calling
> into omp library after a fork, but before the omp library’s atfork()
> handler is called, so the omp library has not yet initialized a new team if
> threads.  This looks to be the case in one of the call stacks on that
> issue. This is problematic for any openmp library which supports omp after
> a fork, and may not be deterministic from build to build, since the order
> of static init calls for a given module is undefined (i think mxnet is
> initializing omp during static init, but this may not matter).
>
> So if mxnet is doing that, it is a bug and remains a problem regardless of
> the omp library and probably should be fixed.  llvm omp happens to be nice
> enough to tell you you’re doing something wrong, at least when built in
> debug mode.
>
> Once this issue is resolved, we can discuss the library inclusion itself.
> My objection is “fixing” what appears to be a bug by effectively
> “commenting out the assert” which is what i stated in the very beginning.
>
> It stands to reason that linking this or that library may affect the
> assert occurring because it’s not known at what time one of the dependent
> libraries initializes omp (thus causing it to hook its atfork handler), so
> it is not surprising that mucking with dependencies may cause the assert to
> occur or not occur.
>
> Is there another explanation for the call stack with the assert?  Can this
> bug be ruled out?
>
>
> Here is an example of the atfork team concept with libgomp as well.
> Probably you can check the current libgomp code itself but this explains
> the code:
> https://patchwork.ozlabs.org/patch/319827/
>
>
>
>
>
>
>
>
> On Sun, Dec 8, 2019 at 2:21 AM Lausen, Leonard 
> wrote:
>
>> Thanks Pedro and Chris for your responses.
>>
>> After further investigation I find:
>>
>> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979
>> is
>> caused by any incompatibility between gomp and llvm / intel omp. Rather
>> it's
>> simply a problem of llvm / intel omp. See my comment to the issue for the
>> methodology to arrive at this claim.
>>
>> 2) Regarding the assertion failure when compiling with (llvm)
>> 3rdparty/openmp,
>> it can be fixed by updating the by now 2 years old llvm openmp code to the
>> newest released version. I went ahead and opened a PR
>> https://github.com/apache/incubator-mxnet/pull/17012
>>
>> Based on the investigation described in 1), I think Chris is right that
>> the
>> assertion failure is not due to some interaction between gomp and llvm
>> omp.
>> However, I'm not sure about Chris's suggestion that the assertion failure
>> is due
>> to a bug in MXNet. In fact, the failure goes away when updating the llvm
>> openmp
>> code. So I think it's just due to a bug in the 2 years old code.
>>
>> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
>> contentious. Thus let's do it via lazy consensus (72 hours) or just
>> approve the
>> PR and merge it.
>>
>> Please also take a look at my comment at #14979 and let everyone know if
>> you see
>> any option to fix the bug while keeping 3rdparty/openmp. As this bug
>> affects an
>> important use-case, I beleive we need to remove 3rdparty/openmp from the
>> CMake
>> build as long as we don't find a solution for making #14979 work with
>> 3rdparty/openmp.
>>
>> In fact, removing 3rdparty/openmp will then match the current Makefile
>> setup
>> that according to my understanding is used to build the nightly releases
>> used by
>> the majority of developers. Ie. most users actually don't use the CMake
>> build
>> with 3rdparty/openmp. You can consider rescinding your veto on removing
>> 3rdparty/openmp after reading through the evidence in that issue. If you
>> don't
>> provide any evidence for why the methodology/conclusion in #14979 is
>> flawed, I
>> will assume your previous veto is void based on Apache Voting rule as it
>> lacks
>> technical justification and in any case was motivated by the assertion
>> issue,
>> which I agree with you, is likely not due to gomp / omp interaction.
>>
>> Thank you
>> Leonard
>>
>>
>> On 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Chris Olivier
Again, here is what I suspect the bug is in mxnet:

The way that advanced openmp libraries handle a fork is that they hook an
atfork() callback in which, in the new process, it creates a new “team” of
threads to use for its thread pool (since all of the thread handles in its
data structure belong to the previous process). atfork() callback order is
the order at which the callbacks are registered, which will tend to be the
first call to the openmp library.  For this reason, the fork order will
vary depending upon what other libraries might be linked in and whether
they make omp calls before mxnet starts its static init.

What the assert in question is trying to say is that mxnet code is calling
into omp library after a fork, but before the omp library’s atfork()
handler is called, so the omp library has not yet initialized a new team if
threads.  This looks to be the case in one of the call stacks on that
issue. This is problematic for any openmp library which supports omp after
a fork, and may not be deterministic from build to build, since the order
of static init calls for a given module is undefined (i think mxnet is
initializing omp during static init, but this may not matter).

So if mxnet is doing that, it is a bug and remains a problem regardless of
the omp library and probably should be fixed.  llvm omp happens to be nice
enough to tell you you’re doing something wrong, at least when built in
debug mode.

Once this issue is resolved, we can discuss the library inclusion itself.
My objection is “fixing” what appears to be a bug by effectively
“commenting out the assert” which is what i stated in the very beginning.

It stands to reason that linking this or that library may affect the assert
occurring because it’s not known at what time one of the dependent
libraries initializes omp (thus causing it to hook its atfork handler), so
it is not surprising that mucking with dependencies may cause the assert to
occur or not occur.

Is there another explanation for the call stack with the assert?  Can this
bug be ruled out?


Here is an example of the atfork team concept with libgomp as well.
Probably you can check the current libgomp code itself but this explains
the code:
https://patchwork.ozlabs.org/patch/319827/








On Sun, Dec 8, 2019 at 2:21 AM Lausen, Leonard 
wrote:

> Thanks Pedro and Chris for your responses.
>
> After further investigation I find:
>
> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 is
> caused by any incompatibility between gomp and llvm / intel omp. Rather
> it's
> simply a problem of llvm / intel omp. See my comment to the issue for the
> methodology to arrive at this claim.
>
> 2) Regarding the assertion failure when compiling with (llvm)
> 3rdparty/openmp,
> it can be fixed by updating the by now 2 years old llvm openmp code to the
> newest released version. I went ahead and opened a PR
> https://github.com/apache/incubator-mxnet/pull/17012
>
> Based on the investigation described in 1), I think Chris is right that the
> assertion failure is not due to some interaction between gomp and llvm omp.
> However, I'm not sure about Chris's suggestion that the assertion failure
> is due
> to a bug in MXNet. In fact, the failure goes away when updating the llvm
> openmp
> code. So I think it's just due to a bug in the 2 years old code.
>
> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
> contentious. Thus let's do it via lazy consensus (72 hours) or just
> approve the
> PR and merge it.
>
> Please also take a look at my comment at #14979 and let everyone know if
> you see
> any option to fix the bug while keeping 3rdparty/openmp. As this bug
> affects an
> important use-case, I beleive we need to remove 3rdparty/openmp from the
> CMake
> build as long as we don't find a solution for making #14979 work with
> 3rdparty/openmp.
>
> In fact, removing 3rdparty/openmp will then match the current Makefile
> setup
> that according to my understanding is used to build the nightly releases
> used by
> the majority of developers. Ie. most users actually don't use the CMake
> build
> with 3rdparty/openmp. You can consider rescinding your veto on removing
> 3rdparty/openmp after reading through the evidence in that issue. If you
> don't
> provide any evidence for why the methodology/conclusion in #14979 is
> flawed, I
> will assume your previous veto is void based on Apache Voting rule as it
> lacks
> technical justification and in any case was motivated by the assertion
> issue,
> which I agree with you, is likely not due to gomp / omp interaction.
>
> Thank you
> Leonard
>
>
> On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
> > Stop disseminating false information:
> >
> > https://github.com/apache/incubator-mxnet/issues/14979
> >
> >
> > On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier 
> wrote:
> >
> > > -1
> > >
> > > mkldnn removed omp5 for licencing issues
> > > no bugs have actually been traced to the use of llvm openmp. only an
> 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Lausen, Leonard
Thanks Pedro and Chris for your responses.

After further investigation I find:

1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 is
caused by any incompatibility between gomp and llvm / intel omp. Rather it's
simply a problem of llvm / intel omp. See my comment to the issue for the
methodology to arrive at this claim.

2) Regarding the assertion failure when compiling with (llvm) 3rdparty/openmp,
it can be fixed by updating the by now 2 years old llvm openmp code to the
newest released version. I went ahead and opened a PR 
https://github.com/apache/incubator-mxnet/pull/17012

Based on the investigation described in 1), I think Chris is right that the
assertion failure is not due to some interaction between gomp and llvm omp.
However, I'm not sure about Chris's suggestion that the assertion failure is due
to a bug in MXNet. In fact, the failure goes away when updating the llvm openmp
code. So I think it's just due to a bug in the 2 years old code.

@Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
contentious. Thus let's do it via lazy consensus (72 hours) or just approve the
PR and merge it.

Please also take a look at my comment at #14979 and let everyone know if you see
any option to fix the bug while keeping 3rdparty/openmp. As this bug affects an
important use-case, I beleive we need to remove 3rdparty/openmp from the CMake
build as long as we don't find a solution for making #14979 work with
3rdparty/openmp.

In fact, removing 3rdparty/openmp will then match the current Makefile setup
that according to my understanding is used to build the nightly releases used by
the majority of developers. Ie. most users actually don't use the CMake build
with 3rdparty/openmp. You can consider rescinding your veto on removing
3rdparty/openmp after reading through the evidence in that issue. If you don't
provide any evidence for why the methodology/conclusion in #14979 is flawed, I
will assume your previous veto is void based on Apache Voting rule as it lacks
technical justification and in any case was motivated by the assertion issue,
which I agree with you, is likely not due to gomp / omp interaction.

Thank you
Leonard


On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
> Stop disseminating false information:
> 
> https://github.com/apache/incubator-mxnet/issues/14979
> 
> 
> On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier  wrote:
> 
> > -1
> > 
> > mkldnn removed omp5 for licencing issues
> > no bugs have actually been traced to the use of llvm openmp. only an assert
> > caused by an actual bug in mxnet code. there are suitable workarounds.
> > 
> > over time llvm omp has simply been used as a “catch all” for random
> > problems that aren’t related at all (such as getenv race condition in an
> > atfork call that isn’t even part of an omp parallel region).
> > 
> > proposal is now and has always been roughly equivalent to the idea of
> > “comment out an assert rather than fix the bug it’s reporting”.
> > 
> > Up until very recently, Makefile version of mxnet used libomp5 for YEARS
> > and not libgomp, with no issue reported (omp not built in debug mode), so
> > the equivalent configuration from CMake mysteriously causing myriads if
> > problems has questionable merit and smells more like a hubris situation.
> > 
> > I use tensorflow as well and it links to libomp5 rather than libgomp.
> > 
> > if the assert problem is really a problem, the bug being reported would be
> > prioritized and fixed. it should be fixed regardless. all the time spent by
> > some CI people trying to remove this could have simply fixed the actual bug
> > in a small fraction of the time.
> > 
> > 
> > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard 
> > wrote:
> > 
> > > I think it's reasonable to assume that the Intel MKLDNN team is an
> > > "authorative"
> > > source about the issue of compilation with OpenMP and the OpenMP runtime
> > > library
> > > related issues. Thus I suggest we follow the recommendation of Intel
> > > MKLDNN team
> > > within the MXNet project.
> > > 
> > > Looking through the Intel MKLDNN documentation, I find [1]:
> > > 
> > > > DNNL uses OpenMP runtime library provided by the compiler.
> > > 
> > > as well as
> > > 
> > > > it's important to ensure that only one OpenMP runtime is used
> > throughout
> > > the
> > > > application. Having more than one OpenMP runtime linked to an
> > executable
> > > may
> > > > lead to undefined behavior including incorrect results or crashes.
> > > 
> > > To keep our project maintainable and error free, I thus suggest we follow
> > > DNNL
> > > and use the OpenMP runtime library provided by the compiler.
> > > We have limited ressources and finding the root cause for any bugs
> > > resulting
> > > from linking multiple OpenMP libraries as currently done is, in my
> > > opinion. not
> > > a good use of time. We know it's due to undefined behavior and we know
> > > it's best
> > > practice to use OpenMP runtime library provided by