RE: v1.4.0 status 11/29

2018-11-29 Thread Lv, Tao A
Credit belongs to Alex.

Hi Alex, would you mind porting your fix to the v1.4.x branch?

Thanks,
-Tao

-Original Message-
From: Steffen Rochel [mailto:steffenroc...@gmail.com] 
Sent: Friday, November 30, 2018 12:48 PM
To: dev@mxnet.incubator.apache.org
Subject: Re: v1.4.0 status 11/29

Hi Tao - thanks for fixing the crash. Please create PR on v1.4.x branch with 
[v1.4.x] in title and add me to the PR.
Steffen

On Thu, Nov 29, 2018 at 8:44 PM Lv, Tao A  wrote:

> Hi Steffen, I would like to have
> https://github.com/apache/incubator-mxnet/pull/13433  into the coming
> 1.4.0 release. It fixed a crash of deconvolution with certain input 
> size for MKL-DNN backend. This PR is well reviewed and already merged 
> into the master branch. New test case is also included there.
>
> Please find the corresponding issue here:
> https://github.com/apache/incubator-mxnet/issues/13421 .
>
> Thanks,
> -Tao
>
> -Original Message-
> From: Steffen Rochel [mailto:steffenroc...@gmail.com]
> Sent: Friday, November 30, 2018 12:05 PM
> To: dev@mxnet.incubator.apache.org
> Subject: v1.4.0 status 11/29
>
> Dear MXNet community -
> I would like to provide update on v1.4.0 status, details will be 
> tracked here < 
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incu
> bating%29+1.4.0+Release+Plan+and+Status
> >
> .
>
> 1. Sergey created v1.4.x branch
> 2. As expected, additional requests have been made for inclusion in 
> v1.4.0 release. Critical PR are tracked here < 
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incu
> bating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0Re
> leasePlanandStatus-OpenPRstotrack
> >
> .
> 3. PR to update README.md is blocked by flaky test failures, 
> retriggered check.
> 4. PR to upgrade version on master to v1.5.0 has been submitted.
> 5. CI is setup and first run passed.
>
> Note: if you want to add selected fixes or enhancements, please reply 
> to this email. Please provide justification, add me as approver to the 
> v1.4.x PR and make sure your changes have tests included in PR and get 
> properly reviewed.
>
> Regards,
> Steffen
>


Re: v1.4.0 status 11/29

2018-11-29 Thread Steffen Rochel
Hi Tao - thanks for fixing the crash. Please create PR on v1.4.x branch
with [v1.4.x] in title and add me to the PR.
Steffen

On Thu, Nov 29, 2018 at 8:44 PM Lv, Tao A  wrote:

> Hi Steffen, I would like to have
> https://github.com/apache/incubator-mxnet/pull/13433  into the coming
> 1.4.0 release. It fixed a crash of deconvolution with certain input size
> for MKL-DNN backend. This PR is well reviewed and already merged into the
> master branch. New test case is also included there.
>
> Please find the corresponding issue here:
> https://github.com/apache/incubator-mxnet/issues/13421 .
>
> Thanks,
> -Tao
>
> -Original Message-
> From: Steffen Rochel [mailto:steffenroc...@gmail.com]
> Sent: Friday, November 30, 2018 12:05 PM
> To: dev@mxnet.incubator.apache.org
> Subject: v1.4.0 status 11/29
>
> Dear MXNet community -
> I would like to provide update on v1.4.0 status, details will be tracked
> here <
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> >
> .
>
> 1. Sergey created v1.4.x branch
> 2. As expected, additional requests have been made for inclusion in v1.4.0
> release. Critical PR are tracked here <
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
> >
> .
> 3. PR to update README.md is blocked by flaky test failures, retriggered
> check.
> 4. PR to upgrade version on master to v1.5.0 has been submitted.
> 5. CI is setup and first run passed.
>
> Note: if you want to add selected fixes or enhancements, please reply to
> this email. Please provide justification, add me as approver to the v1.4.x
> PR and make sure your changes have tests included in PR and get properly
> reviewed.
>
> Regards,
> Steffen
>


Re: Apache Infra tickets for MXNet

2018-11-29 Thread Steffen Rochel
Thanks Marco. Cwiki seems a good place to document the policy.
Steffen

On Thu, Nov 29, 2018 at 8:06 PM Marco de Abreu
 wrote:

> Hello everyone,
>
> I have just had a nice conversation with Greg Stein, VP of Apache Infra,
> about the topic of creating tickets against Apache Infra.
>
> In the past, we had the restriction that only IPMC members (speak, mentors)
> were allowed to file tickets against Apache Infra. This was due past issues
> where tickets have been created without previous discussions on dev@ and
> from people who were not PPMC members, thus creating too much churn.
>
> During the last year, the MXNet community has shown that we are able to
> adhere to the Apache ways. Thus the restrictions are being lifted and the
> following policy get set in place:
>
> - Only PPMC members are allowed to create tickets (if you can see
> priv...@mxnet.apache.org, you're good to go)
> - Committers are not allowed to create tickets (if you have write access to
> GitHub but can't see priv...@mxnet.apache.org, you're not a PPMC member
> but
> a committer)
> - Contributors are not allowed to create tickets (if you're neither a PPMC
> member, nor a committer, then you're a contributor)
> - There always has to be a dev@ thread before a ticket can be created.
> That
> thread has to be linked in that said ticket.
> - Always search for a solution yourself (self-service) before engaging with
> Apache Infra.
>
> I'm not sure about a good place to document these guidelines. If somebody
> has a good idea where we should write them down, please feel free to drop
> me a link and I'll paste them in there.
>
> Thanks everybody for the great collaboration around Apache Infra tickets!
> This was a prime example of a community working together.
>
> Best regards,
> Marco
>


RE: v1.4.0 status 11/29

2018-11-29 Thread Lv, Tao A
Hi Steffen, I would like to have 
https://github.com/apache/incubator-mxnet/pull/13433  into the coming 1.4.0 
release. It fixed a crash of deconvolution with certain input size for MKL-DNN 
backend. This PR is well reviewed and already merged into the master branch. 
New test case is also included there.

Please find the corresponding issue here: 
https://github.com/apache/incubator-mxnet/issues/13421 .

Thanks,
-Tao

-Original Message-
From: Steffen Rochel [mailto:steffenroc...@gmail.com] 
Sent: Friday, November 30, 2018 12:05 PM
To: dev@mxnet.incubator.apache.org
Subject: v1.4.0 status 11/29

Dear MXNet community -
I would like to provide update on v1.4.0 status, details will be tracked here 

.

1. Sergey created v1.4.x branch
2. As expected, additional requests have been made for inclusion in v1.4.0 
release. Critical PR are tracked here 

.
3. PR to update README.md is blocked by flaky test failures, retriggered check.
4. PR to upgrade version on master to v1.5.0 has been submitted.
5. CI is setup and first run passed.

Note: if you want to add selected fixes or enhancements, please reply to this 
email. Please provide justification, add me as approver to the v1.4.x PR and 
make sure your changes have tests included in PR and get properly reviewed.

Regards,
Steffen


Apache Infra tickets for MXNet

2018-11-29 Thread Marco de Abreu
Hello everyone,

I have just had a nice conversation with Greg Stein, VP of Apache Infra,
about the topic of creating tickets against Apache Infra.

In the past, we had the restriction that only IPMC members (speak, mentors)
were allowed to file tickets against Apache Infra. This was due past issues
where tickets have been created without previous discussions on dev@ and
from people who were not PPMC members, thus creating too much churn.

During the last year, the MXNet community has shown that we are able to
adhere to the Apache ways. Thus the restrictions are being lifted and the
following policy get set in place:

- Only PPMC members are allowed to create tickets (if you can see
priv...@mxnet.apache.org, you're good to go)
- Committers are not allowed to create tickets (if you have write access to
GitHub but can't see priv...@mxnet.apache.org, you're not a PPMC member but
a committer)
- Contributors are not allowed to create tickets (if you're neither a PPMC
member, nor a committer, then you're a contributor)
- There always has to be a dev@ thread before a ticket can be created. That
thread has to be linked in that said ticket.
- Always search for a solution yourself (self-service) before engaging with
Apache Infra.

I'm not sure about a good place to document these guidelines. If somebody
has a good idea where we should write them down, please feel free to drop
me a link and I'll paste them in there.

Thanks everybody for the great collaboration around Apache Infra tickets!
This was a prime example of a community working together.

Best regards,
Marco


v1.4.0 status 11/29

2018-11-29 Thread Steffen Rochel
Dear MXNet community -
I would like to provide update on v1.4.0 status, details will be tracked
here

.

1. Sergey created v1.4.x branch
2. As expected, additional requests have been made for inclusion in v1.4.0
release. Critical PR are tracked here

.
3. PR to update README.md is blocked by flaky test failures, retriggered
check.
4. PR to upgrade version on master to v1.5.0 has been submitted.
5. CI is setup and first run passed.

Note: if you want to add selected fixes or enhancements, please reply
to this email. Please provide justification, add me as approver to the
v1.4.x PR and make sure your changes have tests included in PR and get
properly reviewed.

Regards,
Steffen


Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Lin Yuan
Hi Steffen,

Can we add the following PR to 1.4.0 release:

https://github.com/apache/incubator-mxnet/pull/13452

It's just a Python API returning header path so it should not cause any
regression issues. But it is required for Horovod to integrate MXNet. It's
better to have this in a minor release than patch release.

Thanks,

Lin

On Thu, Nov 29, 2018 at 6:46 PM Steffen Rochel 
wrote:

> Hi Zhi - thanks for the improvement, which we should consider for 1.4.0.
> However, I don't see any tests with the PR and think it is too risky to add
> changes without tests. I will add your PR to the tracking list, but would
> like to ask you to add functional tests before completing the PR to master
> and v1.4.x branch.
>
> Steffen
>
> On Thu, Nov 29, 2018 at 5:01 PM Joshua Z. Zhang 
> wrote:
>
> > Hi, I would like to bring a critical performance and stability patch of
> > existing gluon dataloader to 1.4.0:
> > https://github.com/apache/incubator-mxnet/pull/13447 <
> > https://github.com/apache/incubator-mxnet/pull/13447>.
> >
> > This PR is finished, waiting for CI to pass.
> >
> > Steffen, could you help me add that to the tracked list?
> >
> > Best,
> > Zhi
> >
> > > On Nov 29, 2018, at 4:25 PM, Naveen Swamy  wrote:
> > >
> > > the tests are randomly failing in different stages
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13105/
> > > This PR has failed 8 times so far
> > >
> > > On Thu, Nov 29, 2018 at 3:43 PM Steffen Rochel <
> steffenroc...@gmail.com>
> > > wrote:
> > >
> > >> Pedro - ok. Please add PR to v1.4.x branch after merge to master and
> > please
> > >> update tracking page
> > >> <
> > >>
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
> > >>>
> > >> .
> > >> Steffen
> > >>
> > >> On Thu, Nov 29, 2018 at 3:00 PM Pedro Larroy <
> > pedro.larroy.li...@gmail.com
> > >>>
> > >> wrote:
> > >>
> > >>> PR is ready from my side and passes the tests, unless somebody raises
> > >>> any concerns it's good to go.
> > >>> On Thu, Nov 29, 2018 at 9:50 PM Steffen Rochel <
> > steffenroc...@gmail.com>
> > >>> wrote:
> > 
> >  Pedro - added  to 1.4.0 tracking list
> >  <
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
> > 
> > 
> >  Do you have already ETA?
> >  Steffen
> > 
> >  On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
> > >>> pedro.larroy.li...@gmail.com>
> >  wrote:
> > 
> > > Hi all.
> > >
> > > There are two important issues / fixes that should go in the next
> > > release in my radar:
> > >
> > > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > > There is a bug in shape inference on CPU when not using MKL, also
> we
> > > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > > I'm finishing a fix for these issues in the above PR.
> > >
> > > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > > We are seeing crashes due to unsafe setenv in multithreaded code.
> > > Setenv / getenv from multiple threads is not safe and is causing
> > > segfaults. This piece of code (the handlers in pthread_atfork)
> > >> already
> > > caused a very difficult to diagnose hang in a previous release,
> where
> > > a fork inside cudnn would deadlock the engine.
> > >
> > > I would remove setenv from 2) as a mitigation, but we would need to
> > > check for regressions as we could be creating additional threads
> > > inside the engine.
> > >
> > > I would suggest that we address these two major issues before the
> > >> next
> > > release.
> > >
> > > Pedro
> > >
> > >
> > >
> > > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
> > >>> steffenroc...@gmail.com>
> > > wrote:
> > >>
> > >> Dear MXNet community,
> > >>
> > >> I will be the release manager for the upcoming Apache MXNet 1.4.0
> > > release.
> > >> Sergey Kolychev will be co-managing the release and providing help
> > >>> from
> > > the
> > >> committers side.
> > >> A release candidate will be cut on November 29, 2018 and voting
> > >> will
> > > start
> > >> December 7, 2018. Release notes have been drafted here [1]. If you
> > >>> have
> > > any
> > >> additional features in progress and would like to include it in
> > >> this
> > >> release, please assure they have been merged by November 27, 2018.
> > > Release
> > >> schedule is available here [2].
> > >>
> > >> Feel free to add any other comments/suggestions. Please help to
> > >>> review
> > > and
> > >> merge outstanding PR's and resolve issues impacting the quality of
> > >>> the
> > >> 

Re: Adding AMD CPU to CI

2018-11-29 Thread Hao Jin
For CPUs, the supported instruction sets may also vary between the same
manufacturer's different product lines of the same generation (Skylake-SP
versus Skylake).
For the same instruction set, the two manufacturers should both have a
working version of the hardware implementation. If any of the
implementations does not work, then the chip would not even be considered
functioning properly.
If some AMD CPUs only support up to AVX2 instruction sets, they would just
function in the same way as an Intel CPU that supports up to AVX2
instruction sets. The performance may vary, but the capability and behavior
of the two chips would be the same when given the same machine code.
For AMD GPUs it's a totally different story, as AMD GPUs do not share the
same instruction sets with the NVIDIA ones, thus testing on AMD GPUs(if we
do have support for them) would definitely add values.
Hao

On Thu, Nov 29, 2018 at 8:37 PM Anirudh Subramanian 
wrote:

> Instruction set extensions support like AVX2, AVX512 etc. can vary between
> AMD and Intel and there can also be a time lag between when Intel supports
> it versus when AMD supports it.
> Also, in the future this setup may be useful in case MXNet supports AMD
> GPUs and AWS also happens to have support for it.
>
> Anirudh
>
>
> On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu
>  wrote:
>
> > I think it's worth a discussion to do a sanity check. While generally
> these
> > instructions are standardized, we also made the experience with ARM that
> > the theory and reality sometimes don't match. Thus, it's always good to
> > check.
> >
> > In the next months we are going to refactor our slave creation processes.
> > Chance Bair has been working on rewriting Windows slaves from scratch (we
> > used images that haven't really been updated for 2 years - we still don't
> > know what was done on them) and they're ready soon. In the following
> > months, we will also port our Ubuntu slaves to the new method (don't
> have a
> > timeline yet). Ideally, the integration of AMD instances will only be a
> > matter of running the same pipeline on a different instance type. In that
> > Case, it should not be a big deal.
> >
> > If there are big differences, that's already a yellow flag for
> > compatibility, but that's unlikely. But in that case, we would have to
> make
> > a more thorough time analysis and whether it's worth the effort. Maybe,
> > somebody else could also lend us a hand and help us with adding AMD
> > support.
> >
> > -Marco
> >
> > Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin 
> > geschrieben:
> >
> > > f16c is also an instruction set supported by both brands' recent CPUs
> > just
> > > like x86, AVX, SSE etc., and any difference in behaviors (quite
> > impossible
> > > to happen or it will be a major defect) would most likely be caused by
> > the
> > > underlying hardware implementation, so still, adding AMD instances is
> not
> > > adding much value here.
> > > Hao
> > >
> > > On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Just looked at the mf16c work and wanted to mention Rahul clearly
> _was_
> > > > thinking about AMD users in that PR.
> > > >
> > > > On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > From my perspective we're developing a few features like mf16c and
> > > MKLDNN
> > > > > integration specifically for Intel CPUs.  It wouldn't hurt to make
> > sure
> > > > > those changes also run properly on AMD cpus.
> > > > >
> > > > > On Thu, Nov 29, 2018, 3:38 PM Hao Jin  > > > >
> > > > >> I'm a bit confused about why we need extra functionality tests
> just
> > > for
> > > > >> AMD
> > > > >> CPUs, aren't AMD CPUs supporting roughly the same instruction sets
> > as
> > > > the
> > > > >> Intel ones? In the very impossible case that something working on
> > > Intel
> > > > >> CPUs being not functioning on AMD CPUs (or vice versa), it would
> > > mostly
> > > > >> likely be related to the underlying hardware implementation of the
> > > same
> > > > >> ISA, to which we definitely do not have a good solution. So I
> don't
> > > > think
> > > > >> performing extra tests on functional aspect of the system on AMD
> > CPUs
> > > is
> > > > >> adding any values.
> > > > >> Hao
> > > > >>
> > > > >> On Thu, Nov 29, 2018 at 5:50 PM Seth, Manu
> >  > > >
> > > > >> wrote:
> > > > >>
> > > > >> > +1
> > > > >> >
> > > > >> > On 11/29/18, 2:39 PM, "Alex Zai"  wrote:
> > > > >> >
> > > > >> > What are people's thoughts on having AMD machines tested on
> > the
> > > > CI?
> > > > >> AMD
> > > > >> > machines are now available on AWS.
> > > > >> >
> > > > >> > Best,
> > > > >> > Alex
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>


MKLDNN dynamically linked

2018-11-29 Thread Alex Zai
Created new thread for this as my email were not sending through my work
email. Original thread can be found here (
https://lists.apache.org/list.html?d...@mxnet.apache.org:lte=1M:dynamically)

There seems to be an issue linking static libraries (MKLDNN) on windows as
not all VS compilers support statically linking (
https://stackoverflow.com/questions/18901128/link-static-library-using-cmake
)

PR can be tracked here (
https://github.com/apache/incubator-mxnet/pull/13197). Jenkins fails on
windows build.

There are two routes I see here (both non-ideal):

1. Keep mkldnn as a dynamically linked library. This will cause
issues, especially since the mkldnn version has been incremented to 0.17
and soon to be 0.17.1.

2. Change build file such that mkldnn is statically linked in
linux/mac but remains dynamically linked on windows. This will complicate
our build files (cmakelistfile and makefile) but it may be easier to
resolve mkldnn issues on mac/linux since we'll know what version of mkldnn
they are using.


Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Marco de Abreu
Hi everyone,

would you mind prepending [1.4.x] to the title of your PRs so we can see
cherry-picks at a glance? That'd allow me to better classify the load we
have on our CI (Release-branches have a higher load than master due to
cache mismatches).

Best regards,
Marco

On Fri, Nov 30, 2018 at 2:17 AM Marco de Abreu 
wrote:

> Hi Naveen,
>
> yeah sorry, that's DockerHub acting up again (this happens every now and
> then unfortunately). Basically docker pull starts multiple download threads
> and it seems like sometimes a single web server request sits in the queue
> forever which then slows down the docker pull (for the cache retrieval).
>
> Chance will be assisting with CI issues this week and I explained him my
> proposed solution: Basically wrap the 'docker pull' into a timeout in
> combination with a retry with backoff. Anton proposed, in case that retry
> fails after a few times, we are falling back to local cache and cache
> regeneration to avoid the job failing. That would solve the problem you're
> encountering. We would basically wrap [1] into the timeout-retry-mechanism.
>
> Best regards,
> Marco
>
> [1]:
> https://github.com/apache/incubator-mxnet/blob/master/ci/docker_cache.py#L107
>
> On Fri, Nov 30, 2018 at 2:01 AM Joshua Z. Zhang 
> wrote:
>
>> Hi, I would like to bring a critical performance and stability patch of
>> existing gluon dataloader to 1.4.0:
>> https://github.com/apache/incubator-mxnet/pull/13447 <
>> https://github.com/apache/incubator-mxnet/pull/13447>.
>>
>> This PR is finished, waiting for CI to pass.
>>
>> Steffen, could you help me add that to the tracked list?
>>
>> Best,
>> Zhi
>>
>> > On Nov 29, 2018, at 4:25 PM, Naveen Swamy  wrote:
>> >
>> > the tests are randomly failing in different stages
>> >
>> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13105/
>> > This PR has failed 8 times so far
>> >
>> > On Thu, Nov 29, 2018 at 3:43 PM Steffen Rochel > >
>> > wrote:
>> >
>> >> Pedro - ok. Please add PR to v1.4.x branch after merge to master and
>> please
>> >> update tracking page
>> >> <
>> >>
>> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
>> >>>
>> >> .
>> >> Steffen
>> >>
>> >> On Thu, Nov 29, 2018 at 3:00 PM Pedro Larroy <
>> pedro.larroy.li...@gmail.com
>> >>>
>> >> wrote:
>> >>
>> >>> PR is ready from my side and passes the tests, unless somebody raises
>> >>> any concerns it's good to go.
>> >>> On Thu, Nov 29, 2018 at 9:50 PM Steffen Rochel <
>> steffenroc...@gmail.com>
>> >>> wrote:
>> 
>>  Pedro - added  to 1.4.0 tracking list
>>  <
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
>> 
>> 
>>  Do you have already ETA?
>>  Steffen
>> 
>>  On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
>> >>> pedro.larroy.li...@gmail.com>
>>  wrote:
>> 
>> > Hi all.
>> >
>> > There are two important issues / fixes that should go in the next
>> > release in my radar:
>> >
>> > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
>> > There is a bug in shape inference on CPU when not using MKL, also we
>> > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
>> > I'm finishing a fix for these issues in the above PR.
>> >
>> > 2) https://github.com/apache/incubator-mxnet/issues/13438
>> > We are seeing crashes due to unsafe setenv in multithreaded code.
>> > Setenv / getenv from multiple threads is not safe and is causing
>> > segfaults. This piece of code (the handlers in pthread_atfork)
>> >> already
>> > caused a very difficult to diagnose hang in a previous release,
>> where
>> > a fork inside cudnn would deadlock the engine.
>> >
>> > I would remove setenv from 2) as a mitigation, but we would need to
>> > check for regressions as we could be creating additional threads
>> > inside the engine.
>> >
>> > I would suggest that we address these two major issues before the
>> >> next
>> > release.
>> >
>> > Pedro
>> >
>> >
>> >
>> > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
>> >>> steffenroc...@gmail.com>
>> > wrote:
>> >>
>> >> Dear MXNet community,
>> >>
>> >> I will be the release manager for the upcoming Apache MXNet 1.4.0
>> > release.
>> >> Sergey Kolychev will be co-managing the release and providing help
>> >>> from
>> > the
>> >> committers side.
>> >> A release candidate will be cut on November 29, 2018 and voting
>> >> will
>> > start
>> >> December 7, 2018. Release notes have been drafted here [1]. If you
>> >>> have
>> > any
>> >> additional features in progress and would like to include it in
>> >> this
>> >> 

Re: Adding AMD CPU to CI

2018-11-29 Thread Anirudh Subramanian
Instruction set extensions support like AVX2, AVX512 etc. can vary between
AMD and Intel and there can also be a time lag between when Intel supports
it versus when AMD supports it.
Also, in the future this setup may be useful in case MXNet supports AMD
GPUs and AWS also happens to have support for it.

Anirudh


On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu
 wrote:

> I think it's worth a discussion to do a sanity check. While generally these
> instructions are standardized, we also made the experience with ARM that
> the theory and reality sometimes don't match. Thus, it's always good to
> check.
>
> In the next months we are going to refactor our slave creation processes.
> Chance Bair has been working on rewriting Windows slaves from scratch (we
> used images that haven't really been updated for 2 years - we still don't
> know what was done on them) and they're ready soon. In the following
> months, we will also port our Ubuntu slaves to the new method (don't have a
> timeline yet). Ideally, the integration of AMD instances will only be a
> matter of running the same pipeline on a different instance type. In that
> Case, it should not be a big deal.
>
> If there are big differences, that's already a yellow flag for
> compatibility, but that's unlikely. But in that case, we would have to make
> a more thorough time analysis and whether it's worth the effort. Maybe,
> somebody else could also lend us a hand and help us with adding AMD
> support.
>
> -Marco
>
> Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin 
> geschrieben:
>
> > f16c is also an instruction set supported by both brands' recent CPUs
> just
> > like x86, AVX, SSE etc., and any difference in behaviors (quite
> impossible
> > to happen or it will be a major defect) would most likely be caused by
> the
> > underlying hardware implementation, so still, adding AMD instances is not
> > adding much value here.
> > Hao
> >
> > On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Just looked at the mf16c work and wanted to mention Rahul clearly _was_
> > > thinking about AMD users in that PR.
> > >
> > > On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > From my perspective we're developing a few features like mf16c and
> > MKLDNN
> > > > integration specifically for Intel CPUs.  It wouldn't hurt to make
> sure
> > > > those changes also run properly on AMD cpus.
> > > >
> > > > On Thu, Nov 29, 2018, 3:38 PM Hao Jin  > > >
> > > >> I'm a bit confused about why we need extra functionality tests just
> > for
> > > >> AMD
> > > >> CPUs, aren't AMD CPUs supporting roughly the same instruction sets
> as
> > > the
> > > >> Intel ones? In the very impossible case that something working on
> > Intel
> > > >> CPUs being not functioning on AMD CPUs (or vice versa), it would
> > mostly
> > > >> likely be related to the underlying hardware implementation of the
> > same
> > > >> ISA, to which we definitely do not have a good solution. So I don't
> > > think
> > > >> performing extra tests on functional aspect of the system on AMD
> CPUs
> > is
> > > >> adding any values.
> > > >> Hao
> > > >>
> > > >> On Thu, Nov 29, 2018 at 5:50 PM Seth, Manu
>  > >
> > > >> wrote:
> > > >>
> > > >> > +1
> > > >> >
> > > >> > On 11/29/18, 2:39 PM, "Alex Zai"  wrote:
> > > >> >
> > > >> > What are people's thoughts on having AMD machines tested on
> the
> > > CI?
> > > >> AMD
> > > >> > machines are now available on AWS.
> > > >> >
> > > >> > Best,
> > > >> > Alex
> > > >> >
> > > >> >
> > > >> >
> > > >>
> > > >
> > >
> >
>


Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Marco de Abreu
Hi Naveen,

yeah sorry, that's DockerHub acting up again (this happens every now and
then unfortunately). Basically docker pull starts multiple download threads
and it seems like sometimes a single web server request sits in the queue
forever which then slows down the docker pull (for the cache retrieval).

Chance will be assisting with CI issues this week and I explained him my
proposed solution: Basically wrap the 'docker pull' into a timeout in
combination with a retry with backoff. Anton proposed, in case that retry
fails after a few times, we are falling back to local cache and cache
regeneration to avoid the job failing. That would solve the problem you're
encountering. We would basically wrap [1] into the timeout-retry-mechanism.

Best regards,
Marco

[1]:
https://github.com/apache/incubator-mxnet/blob/master/ci/docker_cache.py#L107

On Fri, Nov 30, 2018 at 2:01 AM Joshua Z. Zhang 
wrote:

> Hi, I would like to bring a critical performance and stability patch of
> existing gluon dataloader to 1.4.0:
> https://github.com/apache/incubator-mxnet/pull/13447 <
> https://github.com/apache/incubator-mxnet/pull/13447>.
>
> This PR is finished, waiting for CI to pass.
>
> Steffen, could you help me add that to the tracked list?
>
> Best,
> Zhi
>
> > On Nov 29, 2018, at 4:25 PM, Naveen Swamy  wrote:
> >
> > the tests are randomly failing in different stages
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13105/
> > This PR has failed 8 times so far
> >
> > On Thu, Nov 29, 2018 at 3:43 PM Steffen Rochel 
> > wrote:
> >
> >> Pedro - ok. Please add PR to v1.4.x branch after merge to master and
> please
> >> update tracking page
> >> <
> >>
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
> >>>
> >> .
> >> Steffen
> >>
> >> On Thu, Nov 29, 2018 at 3:00 PM Pedro Larroy <
> pedro.larroy.li...@gmail.com
> >>>
> >> wrote:
> >>
> >>> PR is ready from my side and passes the tests, unless somebody raises
> >>> any concerns it's good to go.
> >>> On Thu, Nov 29, 2018 at 9:50 PM Steffen Rochel <
> steffenroc...@gmail.com>
> >>> wrote:
> 
>  Pedro - added  to 1.4.0 tracking list
>  <
> >>>
> >>
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
> 
> 
>  Do you have already ETA?
>  Steffen
> 
>  On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
> >>> pedro.larroy.li...@gmail.com>
>  wrote:
> 
> > Hi all.
> >
> > There are two important issues / fixes that should go in the next
> > release in my radar:
> >
> > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > There is a bug in shape inference on CPU when not using MKL, also we
> > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > I'm finishing a fix for these issues in the above PR.
> >
> > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > We are seeing crashes due to unsafe setenv in multithreaded code.
> > Setenv / getenv from multiple threads is not safe and is causing
> > segfaults. This piece of code (the handlers in pthread_atfork)
> >> already
> > caused a very difficult to diagnose hang in a previous release, where
> > a fork inside cudnn would deadlock the engine.
> >
> > I would remove setenv from 2) as a mitigation, but we would need to
> > check for regressions as we could be creating additional threads
> > inside the engine.
> >
> > I would suggest that we address these two major issues before the
> >> next
> > release.
> >
> > Pedro
> >
> >
> >
> > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
> >>> steffenroc...@gmail.com>
> > wrote:
> >>
> >> Dear MXNet community,
> >>
> >> I will be the release manager for the upcoming Apache MXNet 1.4.0
> > release.
> >> Sergey Kolychev will be co-managing the release and providing help
> >>> from
> > the
> >> committers side.
> >> A release candidate will be cut on November 29, 2018 and voting
> >> will
> > start
> >> December 7, 2018. Release notes have been drafted here [1]. If you
> >>> have
> > any
> >> additional features in progress and would like to include it in
> >> this
> >> release, please assure they have been merged by November 27, 2018.
> > Release
> >> schedule is available here [2].
> >>
> >> Feel free to add any other comments/suggestions. Please help to
> >>> review
> > and
> >> merge outstanding PR's and resolve issues impacting the quality of
> >>> the
> >> 1.4.0 release.
> >>
> >> Regards,
> >>
> >> Steffen
> >>
> >> [1]
> >>
> >
> >>>
> >>
> 

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Joshua Z. Zhang
Hi, I would like to bring a critical performance and stability patch of 
existing gluon dataloader to 1.4.0: 
https://github.com/apache/incubator-mxnet/pull/13447 
. 

This PR is finished, waiting for CI to pass. 

Steffen, could you help me add that to the tracked list?

Best,
Zhi

> On Nov 29, 2018, at 4:25 PM, Naveen Swamy  wrote:
> 
> the tests are randomly failing in different stages
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13105/
> This PR has failed 8 times so far
> 
> On Thu, Nov 29, 2018 at 3:43 PM Steffen Rochel 
> wrote:
> 
>> Pedro - ok. Please add PR to v1.4.x branch after merge to master and please
>> update tracking page
>> <
>> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
>>> 
>> .
>> Steffen
>> 
>> On Thu, Nov 29, 2018 at 3:00 PM Pedro Larroy >> 
>> wrote:
>> 
>>> PR is ready from my side and passes the tests, unless somebody raises
>>> any concerns it's good to go.
>>> On Thu, Nov 29, 2018 at 9:50 PM Steffen Rochel 
>>> wrote:
 
 Pedro - added  to 1.4.0 tracking list
 <
>>> 
>> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
 
 
 Do you have already ETA?
 Steffen
 
 On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
>>> pedro.larroy.li...@gmail.com>
 wrote:
 
> Hi all.
> 
> There are two important issues / fixes that should go in the next
> release in my radar:
> 
> 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> There is a bug in shape inference on CPU when not using MKL, also we
> are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> I'm finishing a fix for these issues in the above PR.
> 
> 2) https://github.com/apache/incubator-mxnet/issues/13438
> We are seeing crashes due to unsafe setenv in multithreaded code.
> Setenv / getenv from multiple threads is not safe and is causing
> segfaults. This piece of code (the handlers in pthread_atfork)
>> already
> caused a very difficult to diagnose hang in a previous release, where
> a fork inside cudnn would deadlock the engine.
> 
> I would remove setenv from 2) as a mitigation, but we would need to
> check for regressions as we could be creating additional threads
> inside the engine.
> 
> I would suggest that we address these two major issues before the
>> next
> release.
> 
> Pedro
> 
> 
> 
> On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
>>> steffenroc...@gmail.com>
> wrote:
>> 
>> Dear MXNet community,
>> 
>> I will be the release manager for the upcoming Apache MXNet 1.4.0
> release.
>> Sergey Kolychev will be co-managing the release and providing help
>>> from
> the
>> committers side.
>> A release candidate will be cut on November 29, 2018 and voting
>> will
> start
>> December 7, 2018. Release notes have been drafted here [1]. If you
>>> have
> any
>> additional features in progress and would like to include it in
>> this
>> release, please assure they have been merged by November 27, 2018.
> Release
>> schedule is available here [2].
>> 
>> Feel free to add any other comments/suggestions. Please help to
>>> review
> and
>> merge outstanding PR's and resolve issues impacting the quality of
>>> the
>> 1.4.0 release.
>> 
>> Regards,
>> 
>> Steffen
>> 
>> [1]
>> 
> 
>>> 
>> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
>> 
>> [2]
> 
>>> 
>> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
>> 
>> 
>> 
>> 
>> On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
>> kellen.sunderl...@gmail.com> wrote:
>> 
>>> Spoke too soon[1], looks like others have been adding Turing
>>> support as
>>> well (thanks to those helping with this).  I believe there's
>> still
>>> a
> few
>>> changes we'd have to make to claim support though (mshadow CMake
> changes,
>>> PyPi package creation tweaks).
>>> 
>>> 1:
>>> 
>>> 
> 
>>> 
>> https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
>>> 
>>> On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
>>> kellen.sunderl...@gmail.com> wrote:
>>> 
 Hey Steffen, I'd like to be able to merge this PR for version
>>> 1.4:
 https://github.com/apache/incubator-mxnet/pull/13310 . It
>> fixes
>>> a
 regression in master which causes incorrect feature vectors to
>> be
> output
 when 

Re: [Launch Announcement] Dynamic training with Apache MXNet

2018-11-29 Thread Rahul Huilgol
This is great stuff. Well done!  Few questions:

   - Do you plan to maintain this as a separate fork, or merge it back to
   the main repository?
   - Is the number of parameter servers fixed at the start? Or can we add
   more parameter servers?
   - I see that you can not remove any nodes that you initialized the
   cluster with. Why are these initial nodes treated differently? Are they
   treated differently because they hold the parameter servers who update the
   weights (and hold the optimizer states)?


On Thu, Nov 29, 2018 at 4:04 PM Marco de Abreu
 wrote:

> Awesome project! Great job everyone.
>
> Am Do., 29. Nov. 2018, 19:55 hat Kumar, Vikas 
> geschrieben:
>
> > A big thanks to Qi Qiao < https://github.com/mirocody > for making it
> > easy for users to set up a cluster for dynamic training using
> > cloudformation.
> >
> > From: "Kumar, Vikas" 
> > Date: Thursday, November 29, 2018 at 10:26 AM
> > To: "dev@mxnet.incubator.apache.org" 
> > Subject: [Launch Announcement] Dynamic training with Apache MXNet
> >
> > Hello MXNet community,
> >
> > MXNet users can now use Dynamic Training(DT) for Deep learning models
> with
> > Apache MXNet. DT helps to reducing training cost and training time by
> > adding elasticity to the distributed training cluster. DT also helps in
> > increasing instance pool utilization. With DT unused instances can be
> used
> > to speed up training and then instances can be removed from training
> > cluster at a later time to be used by some other application.
> > For details, refer to DT blog<
> >
> https://aws.amazon.com/blogs/machine-learning/introducing-dynamic-training-for-deep-learning-with-amazon-ec2/
> > >.
> > Developers should be able to integrate Dynamic training in their existing
> > distributed training code, with introduction of few extra lines of code<
> >
> https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws#writing-a-distributed-training-script
> > >.
> >
> > Thank you for all the contributors – Vikas Kumar <
> > https://github.com/Vikas89 >, Haibin Lin <
> > https://github.com/eric-haibin-lin>, Andrea Olgiati <
> > https://github.com/andreaolgiati/> ,
> > Mu Li < https://github.com/mli >, Hagay Lupesko <
> > https://github.com/lupesko>, Markham Aaron <
> > https://github.com/aaronmarkham > , Sergey Sokolov <
> > https://github.com/Ishitori> , Qi Qiao < https://github.com/mirocody >
> >
> > This is an effort towards making training neural networks cheap and fast.
> > We welcome your contributions to the repo -
> > https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws .
> We
> > would love to hear feedback and ideas in this direction.
> >
> > Thanks
> > Vikas
> >
>


-- 
Rahul Huilgol


Re: CI impaired

2018-11-29 Thread Marco de Abreu
Hello,

since the release branch has now been cut, I would like to move forward
with the CI improvements for the master branch. This would include the
following actions:
1. Re-enable the new Jenkins job
2. Request Apache Infra to move the protected branch check from the main
pipeline to our new ones
3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this
finalizes the deprecation process

If nobody objects, I would like to start with #1 soon. Mentors, could you
please assist to create the Apache Infra ticket? I would then take it from
there and talk to Infra.

Best regards,
Marco

On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Sorry, [1] meant to reference
> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
>
> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Marco and I ran into another urgent issue over the weekend that was
> > causing builds to fail.  This issue was unrelated to any feature
> > development work, or other CI fixes applied recently, but it did require
> > quite a bit of work from Marco (and a little from me) to fix.
> >
> > We spent enough time on the problem that it caused us to take a step back
> > and consider how we could both fix issues in CI and support the 1.4
> release
> > with the least impact possible on MXNet devs.  Marco had planned to make
> a
> > significant change to the CI to fix a long-standing Jenkins error [1],
> but
> > we feel that most developers would prioritize having a stable build
> > environment for the next few weeks over having this fix in place.
> >
> > To properly introduce a new CI system the intent was to do a gradual
> > blue/green roll out of the fix.  To manage this rollout would have taken
> > operational effort and double compute load as we run systems in parallel.
> > This risks outages due to scaling limits, and we’d rather make this
> change
> > during a period of low-developer activity, i.e. shortly after the 1.4
> > release.
> >
> > This means that from now until the 1.4 release, in order to reduce
> > complexity MXNet developers should only see a single Jenkins verification
> > check, and a single Travis check.
> >
> >
>


Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Naveen Swamy
the tests are randomly failing in different stages
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13105/
This PR has failed 8 times so far

On Thu, Nov 29, 2018 at 3:43 PM Steffen Rochel 
wrote:

> Pedro - ok. Please add PR to v1.4.x branch after merge to master and please
> update tracking page
> <
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
> >
> .
> Steffen
>
> On Thu, Nov 29, 2018 at 3:00 PM Pedro Larroy  >
> wrote:
>
> > PR is ready from my side and passes the tests, unless somebody raises
> > any concerns it's good to go.
> > On Thu, Nov 29, 2018 at 9:50 PM Steffen Rochel 
> > wrote:
> > >
> > > Pedro - added  to 1.4.0 tracking list
> > > <
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
> > >
> > >
> > > Do you have already ETA?
> > > Steffen
> > >
> > > On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
> > pedro.larroy.li...@gmail.com>
> > > wrote:
> > >
> > > > Hi all.
> > > >
> > > > There are two important issues / fixes that should go in the next
> > > > release in my radar:
> > > >
> > > > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > > > There is a bug in shape inference on CPU when not using MKL, also we
> > > > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > > > I'm finishing a fix for these issues in the above PR.
> > > >
> > > > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > > > We are seeing crashes due to unsafe setenv in multithreaded code.
> > > > Setenv / getenv from multiple threads is not safe and is causing
> > > > segfaults. This piece of code (the handlers in pthread_atfork)
> already
> > > > caused a very difficult to diagnose hang in a previous release, where
> > > > a fork inside cudnn would deadlock the engine.
> > > >
> > > > I would remove setenv from 2) as a mitigation, but we would need to
> > > > check for regressions as we could be creating additional threads
> > > > inside the engine.
> > > >
> > > > I would suggest that we address these two major issues before the
> next
> > > > release.
> > > >
> > > > Pedro
> > > >
> > > >
> > > >
> > > > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
> > steffenroc...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Dear MXNet community,
> > > > >
> > > > > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > > > release.
> > > > > Sergey Kolychev will be co-managing the release and providing help
> > from
> > > > the
> > > > > committers side.
> > > > > A release candidate will be cut on November 29, 2018 and voting
> will
> > > > start
> > > > > December 7, 2018. Release notes have been drafted here [1]. If you
> > have
> > > > any
> > > > > additional features in progress and would like to include it in
> this
> > > > > release, please assure they have been merged by November 27, 2018.
> > > > Release
> > > > > schedule is available here [2].
> > > > >
> > > > > Feel free to add any other comments/suggestions. Please help to
> > review
> > > > and
> > > > > merge outstanding PR's and resolve issues impacting the quality of
> > the
> > > > > 1.4.0 release.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Steffen
> > > > >
> > > > > [1]
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> > > > >
> > > > > [2]
> > > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> > > > > kellen.sunderl...@gmail.com> wrote:
> > > > >
> > > > > > Spoke too soon[1], looks like others have been adding Turing
> > support as
> > > > > > well (thanks to those helping with this).  I believe there's
> still
> > a
> > > > few
> > > > > > changes we'd have to make to claim support though (mshadow CMake
> > > > changes,
> > > > > > PyPi package creation tweaks).
> > > > > >
> > > > > > 1:
> > > > > >
> > > > > >
> > > >
> >
> https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> > > > > >
> > > > > > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > > > > > kellen.sunderl...@gmail.com> wrote:
> > > > > >
> > > > > > > Hey Steffen, I'd like to be able to merge this PR for version
> > 1.4:
> > > > > > > https://github.com/apache/incubator-mxnet/pull/13310 . It
> fixes
> > a
> > > > > > > regression in master which causes incorrect feature vectors to
> be
> > > > output
> > > > > > > when using the TensorRT feature.  (Thanks to Nathalie for
> > helping me
> > > > > > track
> > > > > > > down the root cause of the issue).   I'm currently blocked on a
> > CI
> > > > issue
> > > > > > I
> > > > > > > 

Re: Adding AMD CPU to CI

2018-11-29 Thread Hao Jin
f16c is also an instruction set supported by both brands' recent CPUs just
like x86, AVX, SSE etc., and any difference in behaviors (quite impossible
to happen or it will be a major defect) would most likely be caused by the
underlying hardware implementation, so still, adding AMD instances is not
adding much value here.
Hao

On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Just looked at the mf16c work and wanted to mention Rahul clearly _was_
> thinking about AMD users in that PR.
>
> On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > From my perspective we're developing a few features like mf16c and MKLDNN
> > integration specifically for Intel CPUs.  It wouldn't hurt to make sure
> > those changes also run properly on AMD cpus.
> >
> > On Thu, Nov 29, 2018, 3:38 PM Hao Jin  >
> >> I'm a bit confused about why we need extra functionality tests just for
> >> AMD
> >> CPUs, aren't AMD CPUs supporting roughly the same instruction sets as
> the
> >> Intel ones? In the very impossible case that something working on Intel
> >> CPUs being not functioning on AMD CPUs (or vice versa), it would mostly
> >> likely be related to the underlying hardware implementation of the same
> >> ISA, to which we definitely do not have a good solution. So I don't
> think
> >> performing extra tests on functional aspect of the system on AMD CPUs is
> >> adding any values.
> >> Hao
> >>
> >> On Thu, Nov 29, 2018 at 5:50 PM Seth, Manu 
> >> wrote:
> >>
> >> > +1
> >> >
> >> > On 11/29/18, 2:39 PM, "Alex Zai"  wrote:
> >> >
> >> > What are people's thoughts on having AMD machines tested on the
> CI?
> >> AMD
> >> > machines are now available on AWS.
> >> >
> >> > Best,
> >> > Alex
> >> >
> >> >
> >> >
> >>
> >
>


Re: Adding AMD CPU to CI

2018-11-29 Thread Rahul Huilgol
+1
I do think it would be valuable to add an AMD step to our CI. As we
continue to improve performance, we might have to consider more
instructions which are faster but are specific to the hardware
architecture. We are doing a lot of Intel specific work, it would be a good
sanity check that we continue to support AMD.


On Thu, Nov 29, 2018 at 4:03 PM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Just looked at the mf16c work and wanted to mention Rahul clearly _was_
> thinking about AMD users in that PR.
>
> On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > From my perspective we're developing a few features like mf16c and MKLDNN
> > integration specifically for Intel CPUs.  It wouldn't hurt to make sure
> > those changes also run properly on AMD cpus.
> >
> > On Thu, Nov 29, 2018, 3:38 PM Hao Jin  >
> >> I'm a bit confused about why we need extra functionality tests just for
> >> AMD
> >> CPUs, aren't AMD CPUs supporting roughly the same instruction sets as
> the
> >> Intel ones? In the very impossible case that something working on Intel
> >> CPUs being not functioning on AMD CPUs (or vice versa), it would mostly
> >> likely be related to the underlying hardware implementation of the same
> >> ISA, to which we definitely do not have a good solution. So I don't
> think
> >> performing extra tests on functional aspect of the system on AMD CPUs is
> >> adding any values.
> >> Hao
> >>
> >> On Thu, Nov 29, 2018 at 5:50 PM Seth, Manu 
> >> wrote:
> >>
> >> > +1
> >> >
> >> > On 11/29/18, 2:39 PM, "Alex Zai"  wrote:
> >> >
> >> > What are people's thoughts on having AMD machines tested on the
> CI?
> >> AMD
> >> > machines are now available on AWS.
> >> >
> >> > Best,
> >> > Alex
> >> >
> >> >
> >> >
> >>
> >
>


-- 
Rahul Huilgol


Re: [Launch Announcement] Dynamic training with Apache MXNet

2018-11-29 Thread Marco de Abreu
Awesome project! Great job everyone.

Am Do., 29. Nov. 2018, 19:55 hat Kumar, Vikas 
geschrieben:

> A big thanks to Qi Qiao < https://github.com/mirocody > for making it
> easy for users to set up a cluster for dynamic training using
> cloudformation.
>
> From: "Kumar, Vikas" 
> Date: Thursday, November 29, 2018 at 10:26 AM
> To: "dev@mxnet.incubator.apache.org" 
> Subject: [Launch Announcement] Dynamic training with Apache MXNet
>
> Hello MXNet community,
>
> MXNet users can now use Dynamic Training(DT) for Deep learning models with
> Apache MXNet. DT helps to reducing training cost and training time by
> adding elasticity to the distributed training cluster. DT also helps in
> increasing instance pool utilization. With DT unused instances can be used
> to speed up training and then instances can be removed from training
> cluster at a later time to be used by some other application.
> For details, refer to DT blog<
> https://aws.amazon.com/blogs/machine-learning/introducing-dynamic-training-for-deep-learning-with-amazon-ec2/
> >.
> Developers should be able to integrate Dynamic training in their existing
> distributed training code, with introduction of few extra lines of code<
> https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws#writing-a-distributed-training-script
> >.
>
> Thank you for all the contributors – Vikas Kumar <
> https://github.com/Vikas89 >, Haibin Lin <
> https://github.com/eric-haibin-lin>, Andrea Olgiati <
> https://github.com/andreaolgiati/> ,
> Mu Li < https://github.com/mli >, Hagay Lupesko <
> https://github.com/lupesko>, Markham Aaron <
> https://github.com/aaronmarkham > , Sergey Sokolov <
> https://github.com/Ishitori> , Qi Qiao < https://github.com/mirocody >
>
> This is an effort towards making training neural networks cheap and fast.
> We welcome your contributions to the repo -
> https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws . We
> would love to hear feedback and ideas in this direction.
>
> Thanks
> Vikas
>


Re: Adding AMD CPU to CI

2018-11-29 Thread kellen sunderland
Just looked at the mf16c work and wanted to mention Rahul clearly _was_
thinking about AMD users in that PR.

On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> From my perspective we're developing a few features like mf16c and MKLDNN
> integration specifically for Intel CPUs.  It wouldn't hurt to make sure
> those changes also run properly on AMD cpus.
>
> On Thu, Nov 29, 2018, 3:38 PM Hao Jin 
>> I'm a bit confused about why we need extra functionality tests just for
>> AMD
>> CPUs, aren't AMD CPUs supporting roughly the same instruction sets as the
>> Intel ones? In the very impossible case that something working on Intel
>> CPUs being not functioning on AMD CPUs (or vice versa), it would mostly
>> likely be related to the underlying hardware implementation of the same
>> ISA, to which we definitely do not have a good solution. So I don't think
>> performing extra tests on functional aspect of the system on AMD CPUs is
>> adding any values.
>> Hao
>>
>> On Thu, Nov 29, 2018 at 5:50 PM Seth, Manu 
>> wrote:
>>
>> > +1
>> >
>> > On 11/29/18, 2:39 PM, "Alex Zai"  wrote:
>> >
>> > What are people's thoughts on having AMD machines tested on the CI?
>> AMD
>> > machines are now available on AWS.
>> >
>> > Best,
>> > Alex
>> >
>> >
>> >
>>
>


Re: You've been removed from the The Apache Software Foundation team mxnet committers

2018-11-29 Thread Marco de Abreu
Found the link: https://gitbox.apache.org/setup/

That's how it should look like:
https://photos.app.goo.gl/ZgkMSGrXWv5FMuJn8

Am Fr., 30. Nov. 2018, 00:39 hat Marco de Abreu <
marco.g.ab...@googlemail.com> geschrieben:

> Hello Sergey,
>
> feel free to create the ticket yourself in that case. For quick support,
> you can go the ASF infra channel on slack to get immediate support.
>
> Try to log in to Gitbox at Apache (don't have the link on hand). After
> authentication, you should see three green check marks.
>
> Could it be possible that you recently removed 2FA or changed your
> password? That will trigger an automatic revocation of all permissions
> until you have reauthenticated.
>
> Best regards,
> Marco
>
> Am Fr., 30. Nov. 2018, 00:27 hat Steffen Rochel 
> geschrieben:
>
>> Dear Mentors - please file ticket with Infra to restore Sergey's
>> permissions as "The Apache Software Foundation team
>> mxnet committers". As he is co-release manager for v1.4.0 it is timing
>> sensitive to re-enable is write permissions on
>> https://github.com/apache/incubator-mxnet.
>>
>> Thanks,
>> Steffen
>>
>> On Thu, Nov 29, 2018 at 3:23 PM Tianqi Chen 
>> wrote:
>>
>> > Hmm, this does not sound right. I have no idea what is going on. As far
>> as
>> > I know, only apache infra have the admin right to the org. You are
>> still in
>> > the roster https://whimsy.apache.org/roster/ppmc/mxnet
>> >
>> > Tianqi
>> >
>> > On Thu, Nov 29, 2018 at 2:45 PM Sergey Kolychev <
>> > sergeykolychev.git...@gmail.com> wrote:
>> >
>> > > Hello there,
>> > > Could please any mentors help me understand why this happened and how
>> to
>> > > revert this change ?
>> > > Thanks in advance.
>> > > I'm in process of helping to release 1.4.0 version of MXNet and it
>> likely
>> > > relate to the release branch I created and pushed today.
>> > > -- Forwarded message -
>> > > From: The Apache Software Foundation 
>> > > Date: Thu, Nov 29, 2018 at 2:31 PM
>> > > Subject: You've been removed from the The Apache Software Foundation
>> team
>> > > mxnet committers
>> > > To: Sergey Kolychev 
>> > >
>> > >
>> > > You’ve been removed from the mxnet committers team on the The Apache
>> > > Software Foundation organization.
>> > >
>> > > Cheers & Octocats,
>> > > GitHub Support
>> > >
>> >
>>
>


Re: Adding AMD CPU to CI

2018-11-29 Thread kellen sunderland
>From my perspective we're developing a few features like mf16c and MKLDNN
integration specifically for Intel CPUs.  It wouldn't hurt to make sure
those changes also run properly on AMD cpus.

On Thu, Nov 29, 2018, 3:38 PM Hao Jin  I'm a bit confused about why we need extra functionality tests just for AMD
> CPUs, aren't AMD CPUs supporting roughly the same instruction sets as the
> Intel ones? In the very impossible case that something working on Intel
> CPUs being not functioning on AMD CPUs (or vice versa), it would mostly
> likely be related to the underlying hardware implementation of the same
> ISA, to which we definitely do not have a good solution. So I don't think
> performing extra tests on functional aspect of the system on AMD CPUs is
> adding any values.
> Hao
>
> On Thu, Nov 29, 2018 at 5:50 PM Seth, Manu 
> wrote:
>
> > +1
> >
> > On 11/29/18, 2:39 PM, "Alex Zai"  wrote:
> >
> > What are people's thoughts on having AMD machines tested on the CI?
> AMD
> > machines are now available on AWS.
> >
> > Best,
> > Alex
> >
> >
> >
>


Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Steffen Rochel
Pedro - ok. Please add PR to v1.4.x branch after merge to master and please
update tracking page

.
Steffen

On Thu, Nov 29, 2018 at 3:00 PM Pedro Larroy 
wrote:

> PR is ready from my side and passes the tests, unless somebody raises
> any concerns it's good to go.
> On Thu, Nov 29, 2018 at 9:50 PM Steffen Rochel 
> wrote:
> >
> > Pedro - added  to 1.4.0 tracking list
> > <
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
> >
> >
> > Do you have already ETA?
> > Steffen
> >
> > On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > > Hi all.
> > >
> > > There are two important issues / fixes that should go in the next
> > > release in my radar:
> > >
> > > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > > There is a bug in shape inference on CPU when not using MKL, also we
> > > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > > I'm finishing a fix for these issues in the above PR.
> > >
> > > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > > We are seeing crashes due to unsafe setenv in multithreaded code.
> > > Setenv / getenv from multiple threads is not safe and is causing
> > > segfaults. This piece of code (the handlers in pthread_atfork) already
> > > caused a very difficult to diagnose hang in a previous release, where
> > > a fork inside cudnn would deadlock the engine.
> > >
> > > I would remove setenv from 2) as a mitigation, but we would need to
> > > check for regressions as we could be creating additional threads
> > > inside the engine.
> > >
> > > I would suggest that we address these two major issues before the next
> > > release.
> > >
> > > Pedro
> > >
> > >
> > >
> > > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
> steffenroc...@gmail.com>
> > > wrote:
> > > >
> > > > Dear MXNet community,
> > > >
> > > > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > > release.
> > > > Sergey Kolychev will be co-managing the release and providing help
> from
> > > the
> > > > committers side.
> > > > A release candidate will be cut on November 29, 2018 and voting will
> > > start
> > > > December 7, 2018. Release notes have been drafted here [1]. If you
> have
> > > any
> > > > additional features in progress and would like to include it in this
> > > > release, please assure they have been merged by November 27, 2018.
> > > Release
> > > > schedule is available here [2].
> > > >
> > > > Feel free to add any other comments/suggestions. Please help to
> review
> > > and
> > > > merge outstanding PR's and resolve issues impacting the quality of
> the
> > > > 1.4.0 release.
> > > >
> > > > Regards,
> > > >
> > > > Steffen
> > > >
> > > > [1]
> > > >
> > >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> > > >
> > > > [2]
> > >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > Spoke too soon[1], looks like others have been adding Turing
> support as
> > > > > well (thanks to those helping with this).  I believe there's still
> a
> > > few
> > > > > changes we'd have to make to claim support though (mshadow CMake
> > > changes,
> > > > > PyPi package creation tweaks).
> > > > >
> > > > > 1:
> > > > >
> > > > >
> > >
> https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> > > > >
> > > > > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > > > > kellen.sunderl...@gmail.com> wrote:
> > > > >
> > > > > > Hey Steffen, I'd like to be able to merge this PR for version
> 1.4:
> > > > > > https://github.com/apache/incubator-mxnet/pull/13310 . It fixes
> a
> > > > > > regression in master which causes incorrect feature vectors to be
> > > output
> > > > > > when using the TensorRT feature.  (Thanks to Nathalie for
> helping me
> > > > > track
> > > > > > down the root cause of the issue).   I'm currently blocked on a
> CI
> > > issue
> > > > > I
> > > > > > haven't seen before, but hope to have it resolved by EOW.
> > > > > >
> > > > > > One call-out I would make is that we currently don't support
> Turing
> > > > > > architecture (sm_75).  I've been slowly trying to add support,
> but I
> > > > > don't
> > > > > > think I'd have capacity to do this done by EOW.  Does anyone feel
> > > > > strongly
> > > > > > we need this in the 1.4 release?  From my perspective this will
> > > already
> > > > > be
> > > > > > a strong release without it.
> > > > > >
> > > > > > On Tue, Nov 20, 2018 at 

Re: You've been removed from the The Apache Software Foundation team mxnet committers

2018-11-29 Thread Marco de Abreu
Hello Sergey,

feel free to create the ticket yourself in that case. For quick support,
you can go the ASF infra channel on slack to get immediate support.

Try to log in to Gitbox at Apache (don't have the link on hand). After
authentication, you should see three green check marks.

Could it be possible that you recently removed 2FA or changed your
password? That will trigger an automatic revocation of all permissions
until you have reauthenticated.

Best regards,
Marco

Am Fr., 30. Nov. 2018, 00:27 hat Steffen Rochel 
geschrieben:

> Dear Mentors - please file ticket with Infra to restore Sergey's
> permissions as "The Apache Software Foundation team
> mxnet committers". As he is co-release manager for v1.4.0 it is timing
> sensitive to re-enable is write permissions on
> https://github.com/apache/incubator-mxnet.
>
> Thanks,
> Steffen
>
> On Thu, Nov 29, 2018 at 3:23 PM Tianqi Chen 
> wrote:
>
> > Hmm, this does not sound right. I have no idea what is going on. As far
> as
> > I know, only apache infra have the admin right to the org. You are still
> in
> > the roster https://whimsy.apache.org/roster/ppmc/mxnet
> >
> > Tianqi
> >
> > On Thu, Nov 29, 2018 at 2:45 PM Sergey Kolychev <
> > sergeykolychev.git...@gmail.com> wrote:
> >
> > > Hello there,
> > > Could please any mentors help me understand why this happened and how
> to
> > > revert this change ?
> > > Thanks in advance.
> > > I'm in process of helping to release 1.4.0 version of MXNet and it
> likely
> > > relate to the release branch I created and pushed today.
> > > -- Forwarded message -
> > > From: The Apache Software Foundation 
> > > Date: Thu, Nov 29, 2018 at 2:31 PM
> > > Subject: You've been removed from the The Apache Software Foundation
> team
> > > mxnet committers
> > > To: Sergey Kolychev 
> > >
> > >
> > > You’ve been removed from the mxnet committers team on the The Apache
> > > Software Foundation organization.
> > >
> > > Cheers & Octocats,
> > > GitHub Support
> > >
> >
>


Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Steffen Rochel
Qing - ok. Please merge to v1.4.x branch after merged to master.
Steffen

On Thu, Nov 29, 2018 at 3:17 PM Qing Lan  wrote:

> Hi all,
> I have a critical bug-fix PR
> https://github.com/apache/incubator-mxnet/pull/13330 that essentially fix
> the problems for supporting inference with different shape in Scala/Java
> (introduced in v1.1). I would like to request to cherry-pick this one in
> 1.4.
>
> Thanks,
> Qing
>
> On 11/29/18, 3:00 PM, "Pedro Larroy" 
> wrote:
>
> PR is ready from my side and passes the tests, unless somebody raises
> any concerns it's good to go.
> On Thu, Nov 29, 2018 at 9:50 PM Steffen Rochel <
> steffenroc...@gmail.com> wrote:
> >
> > Pedro - added  to 1.4.0 tracking list
> > <
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
> >
> >
> > Do you have already ETA?
> > Steffen
> >
> > On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > > Hi all.
> > >
> > > There are two important issues / fixes that should go in the next
> > > release in my radar:
> > >
> > > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > > There is a bug in shape inference on CPU when not using MKL, also
> we
> > > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > > I'm finishing a fix for these issues in the above PR.
> > >
> > > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > > We are seeing crashes due to unsafe setenv in multithreaded code.
> > > Setenv / getenv from multiple threads is not safe and is causing
> > > segfaults. This piece of code (the handlers in pthread_atfork)
> already
> > > caused a very difficult to diagnose hang in a previous release,
> where
> > > a fork inside cudnn would deadlock the engine.
> > >
> > > I would remove setenv from 2) as a mitigation, but we would need to
> > > check for regressions as we could be creating additional threads
> > > inside the engine.
> > >
> > > I would suggest that we address these two major issues before the
> next
> > > release.
> > >
> > > Pedro
> > >
> > >
> > >
> > > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
> steffenroc...@gmail.com>
> > > wrote:
> > > >
> > > > Dear MXNet community,
> > > >
> > > > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > > release.
> > > > Sergey Kolychev will be co-managing the release and providing
> help from
> > > the
> > > > committers side.
> > > > A release candidate will be cut on November 29, 2018 and voting
> will
> > > start
> > > > December 7, 2018. Release notes have been drafted here [1]. If
> you have
> > > any
> > > > additional features in progress and would like to include it in
> this
> > > > release, please assure they have been merged by November 27,
> 2018.
> > > Release
> > > > schedule is available here [2].
> > > >
> > > > Feel free to add any other comments/suggestions. Please help to
> review
> > > and
> > > > merge outstanding PR's and resolve issues impacting the quality
> of the
> > > > 1.4.0 release.
> > > >
> > > > Regards,
> > > >
> > > > Steffen
> > > >
> > > > [1]
> > > >
> > >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> > > >
> > > > [2]
> > >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > Spoke too soon[1], looks like others have been adding Turing
> support as
> > > > > well (thanks to those helping with this).  I believe there's
> still a
> > > few
> > > > > changes we'd have to make to claim support though (mshadow
> CMake
> > > changes,
> > > > > PyPi package creation tweaks).
> > > > >
> > > > > 1:
> > > > >
> > > > >
> > >
> https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> > > > >
> > > > > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > > > > kellen.sunderl...@gmail.com> wrote:
> > > > >
> > > > > > Hey Steffen, I'd like to be able to merge this PR for
> version 1.4:
> > > > > > https://github.com/apache/incubator-mxnet/pull/13310 . It
> fixes a
> > > > > > regression in master which causes incorrect feature vectors
> to be
> > > output
> > > > > > when using the TensorRT feature.  (Thanks to Nathalie for
> helping me
> > > > > track
> > > > > > down the root cause of the issue).   I'm currently 

Re: Adding AMD CPU to CI

2018-11-29 Thread Hao Jin
I'm a bit confused about why we need extra functionality tests just for AMD
CPUs, aren't AMD CPUs supporting roughly the same instruction sets as the
Intel ones? In the very impossible case that something working on Intel
CPUs being not functioning on AMD CPUs (or vice versa), it would mostly
likely be related to the underlying hardware implementation of the same
ISA, to which we definitely do not have a good solution. So I don't think
performing extra tests on functional aspect of the system on AMD CPUs is
adding any values.
Hao

On Thu, Nov 29, 2018 at 5:50 PM Seth, Manu 
wrote:

> +1
>
> On 11/29/18, 2:39 PM, "Alex Zai"  wrote:
>
> What are people's thoughts on having AMD machines tested on the CI? AMD
> machines are now available on AWS.
>
> Best,
> Alex
>
>
>


Re: Adding AMD CPU to CI

2018-11-29 Thread Tianqi Chen
I am not sure if it is necessary, as AMD CPU also supports x86, and it
would not add additional information

Tianqi

On Thu, Nov 29, 2018 at 3:35 PM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> +1
>
> On Thu, Nov 29, 2018 at 2:50 PM Seth, Manu 
> wrote:
>
> > +1
> >
> > On 11/29/18, 2:39 PM, "Alex Zai"  wrote:
> >
> > What are people's thoughts on having AMD machines tested on the CI?
> AMD
> > machines are now available on AWS.
> >
> > Best,
> > Alex
> >
> >
> >
>


Re: Adding AMD CPU to CI

2018-11-29 Thread kellen sunderland
+1

On Thu, Nov 29, 2018 at 2:50 PM Seth, Manu 
wrote:

> +1
>
> On 11/29/18, 2:39 PM, "Alex Zai"  wrote:
>
> What are people's thoughts on having AMD machines tested on the CI? AMD
> machines are now available on AWS.
>
> Best,
> Alex
>
>
>


Re: You've been removed from the The Apache Software Foundation team mxnet committers

2018-11-29 Thread Tianqi Chen
Hmm, this does not sound right. I have no idea what is going on. As far as
I know, only apache infra have the admin right to the org. You are still in
the roster https://whimsy.apache.org/roster/ppmc/mxnet

Tianqi

On Thu, Nov 29, 2018 at 2:45 PM Sergey Kolychev <
sergeykolychev.git...@gmail.com> wrote:

> Hello there,
> Could please any mentors help me understand why this happened and how to
> revert this change ?
> Thanks in advance.
> I'm in process of helping to release 1.4.0 version of MXNet and it likely
> relate to the release branch I created and pushed today.
> -- Forwarded message -
> From: The Apache Software Foundation 
> Date: Thu, Nov 29, 2018 at 2:31 PM
> Subject: You've been removed from the The Apache Software Foundation team
> mxnet committers
> To: Sergey Kolychev 
>
>
> You’ve been removed from the mxnet committers team on the The Apache
> Software Foundation organization.
>
> Cheers & Octocats,
> GitHub Support
>


Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Qing Lan
Hi all,
I have a critical bug-fix PR 
https://github.com/apache/incubator-mxnet/pull/13330 that essentially fix the 
problems for supporting inference with different shape in Scala/Java 
(introduced in v1.1). I would like to request to cherry-pick this one in 1.4.

Thanks,
Qing

On 11/29/18, 3:00 PM, "Pedro Larroy"  wrote:

PR is ready from my side and passes the tests, unless somebody raises
any concerns it's good to go.
On Thu, Nov 29, 2018 at 9:50 PM Steffen Rochel  
wrote:
>
> Pedro - added  to 1.4.0 tracking list
> 

>
> Do you have already ETA?
> Steffen
>
> On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy 

> wrote:
>
> > Hi all.
> >
> > There are two important issues / fixes that should go in the next
> > release in my radar:
> >
> > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > There is a bug in shape inference on CPU when not using MKL, also we
> > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > I'm finishing a fix for these issues in the above PR.
> >
> > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > We are seeing crashes due to unsafe setenv in multithreaded code.
> > Setenv / getenv from multiple threads is not safe and is causing
> > segfaults. This piece of code (the handlers in pthread_atfork) already
> > caused a very difficult to diagnose hang in a previous release, where
> > a fork inside cudnn would deadlock the engine.
> >
> > I would remove setenv from 2) as a mitigation, but we would need to
> > check for regressions as we could be creating additional threads
> > inside the engine.
> >
> > I would suggest that we address these two major issues before the next
> > release.
> >
> > Pedro
> >
> >
> >
> > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel 

> > wrote:
> > >
> > > Dear MXNet community,
> > >
> > > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > release.
> > > Sergey Kolychev will be co-managing the release and providing help 
from
> > the
> > > committers side.
> > > A release candidate will be cut on November 29, 2018 and voting will
> > start
> > > December 7, 2018. Release notes have been drafted here [1]. If you 
have
> > any
> > > additional features in progress and would like to include it in this
> > > release, please assure they have been merged by November 27, 2018.
> > Release
> > > schedule is available here [2].
> > >
> > > Feel free to add any other comments/suggestions. Please help to review
> > and
> > > merge outstanding PR's and resolve issues impacting the quality of the
> > > 1.4.0 release.
> > >
> > > Regards,
> > >
> > > Steffen
> > >
> > > [1]
> > >
> > 
https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> > >
> > > [2]
> > 
https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> > >
> > >
> > >
> > >
> > > On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Spoke too soon[1], looks like others have been adding Turing 
support as
> > > > well (thanks to those helping with this).  I believe there's still a
> > few
> > > > changes we'd have to make to claim support though (mshadow CMake
> > changes,
> > > > PyPi package creation tweaks).
> > > >
> > > > 1:
> > > >
> > > >
> > 
https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> > > >
> > > > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > Hey Steffen, I'd like to be able to merge this PR for version 1.4:
> > > > > https://github.com/apache/incubator-mxnet/pull/13310 . It fixes a
> > > > > regression in master which causes incorrect feature vectors to be
> > output
> > > > > when using the TensorRT feature.  (Thanks to Nathalie for helping 
me
> > > > track
> > > > > down the root cause of the issue).   I'm currently blocked on a CI
> > issue
> > > > I
> > > > > haven't seen before, but hope to have it resolved by EOW.
> > > > >
> > > > > One call-out I would make is that we currently don't support 
Turing
> > > > > architecture (sm_75).  I've been slowly trying to add support, 
but I
> > > > don't
> > > > > think I'd have capacity to do this done by EOW.  Does anyone feel
> > > > strongly
> > > > > we need this in the 1.4 release?  From my 

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Pedro Larroy
PR is ready from my side and passes the tests, unless somebody raises
any concerns it's good to go.
On Thu, Nov 29, 2018 at 9:50 PM Steffen Rochel  wrote:
>
> Pedro - added  to 1.4.0 tracking list
> 
>
> Do you have already ETA?
> Steffen
>
> On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy 
> wrote:
>
> > Hi all.
> >
> > There are two important issues / fixes that should go in the next
> > release in my radar:
> >
> > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > There is a bug in shape inference on CPU when not using MKL, also we
> > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > I'm finishing a fix for these issues in the above PR.
> >
> > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > We are seeing crashes due to unsafe setenv in multithreaded code.
> > Setenv / getenv from multiple threads is not safe and is causing
> > segfaults. This piece of code (the handlers in pthread_atfork) already
> > caused a very difficult to diagnose hang in a previous release, where
> > a fork inside cudnn would deadlock the engine.
> >
> > I would remove setenv from 2) as a mitigation, but we would need to
> > check for regressions as we could be creating additional threads
> > inside the engine.
> >
> > I would suggest that we address these two major issues before the next
> > release.
> >
> > Pedro
> >
> >
> >
> > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel 
> > wrote:
> > >
> > > Dear MXNet community,
> > >
> > > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > release.
> > > Sergey Kolychev will be co-managing the release and providing help from
> > the
> > > committers side.
> > > A release candidate will be cut on November 29, 2018 and voting will
> > start
> > > December 7, 2018. Release notes have been drafted here [1]. If you have
> > any
> > > additional features in progress and would like to include it in this
> > > release, please assure they have been merged by November 27, 2018.
> > Release
> > > schedule is available here [2].
> > >
> > > Feel free to add any other comments/suggestions. Please help to review
> > and
> > > merge outstanding PR's and resolve issues impacting the quality of the
> > > 1.4.0 release.
> > >
> > > Regards,
> > >
> > > Steffen
> > >
> > > [1]
> > >
> > https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> > >
> > > [2]
> > https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> > >
> > >
> > >
> > >
> > > On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Spoke too soon[1], looks like others have been adding Turing support as
> > > > well (thanks to those helping with this).  I believe there's still a
> > few
> > > > changes we'd have to make to claim support though (mshadow CMake
> > changes,
> > > > PyPi package creation tweaks).
> > > >
> > > > 1:
> > > >
> > > >
> > https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> > > >
> > > > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > Hey Steffen, I'd like to be able to merge this PR for version 1.4:
> > > > > https://github.com/apache/incubator-mxnet/pull/13310 . It fixes a
> > > > > regression in master which causes incorrect feature vectors to be
> > output
> > > > > when using the TensorRT feature.  (Thanks to Nathalie for helping me
> > > > track
> > > > > down the root cause of the issue).   I'm currently blocked on a CI
> > issue
> > > > I
> > > > > haven't seen before, but hope to have it resolved by EOW.
> > > > >
> > > > > One call-out I would make is that we currently don't support Turing
> > > > > architecture (sm_75).  I've been slowly trying to add support, but I
> > > > don't
> > > > > think I'd have capacity to do this done by EOW.  Does anyone feel
> > > > strongly
> > > > > we need this in the 1.4 release?  From my perspective this will
> > already
> > > > be
> > > > > a strong release without it.
> > > > >
> > > > > On Tue, Nov 20, 2018 at 6:42 PM Steffen Rochel <
> > steffenroc...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Thanks Patrick, lets target to get the PR's merged this week.
> > > > >>
> > > > >> Call for contributions from the community: Right now we have 10 PR
> > > > >> awaiting
> > > > >> merge
> > > > >> <
> > > > >>
> > > >
> > https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge+
> > > > >> >
> > > > >> and
> > > > >> we have 61 open PR awaiting review.
> > > > >> <
> > > > >>
> > > >
> > https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-review
> > > > >> >
> > > > >> I would 

Re: Adding AMD CPU to CI

2018-11-29 Thread Seth, Manu
+1

On 11/29/18, 2:39 PM, "Alex Zai"  wrote:

What are people's thoughts on having AMD machines tested on the CI? AMD
machines are now available on AWS.

Best,
Alex




Fwd: You've been removed from the The Apache Software Foundation team mxnet committers

2018-11-29 Thread Sergey Kolychev
Hello there,
Could please any mentors help me understand why this happened and how to
revert this change ?
Thanks in advance.
I'm in process of helping to release 1.4.0 version of MXNet and it likely
relate to the release branch I created and pushed today.
-- Forwarded message -
From: The Apache Software Foundation 
Date: Thu, Nov 29, 2018 at 2:31 PM
Subject: You've been removed from the The Apache Software Foundation team
mxnet committers
To: Sergey Kolychev 


You’ve been removed from the mxnet committers team on the The Apache
Software Foundation organization.

Cheers & Octocats,
GitHub Support


Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Steffen Rochel
All - Sergey has created v1.4.x branch and I opened first PR:
https://github.com/apache/incubator-mxnet/pull/13469

Please add critical - and only critical - bug fixes to v1.4.x branch and
add myself as approver.

Regards,
Steffen

On Thu, Nov 29, 2018 at 2:17 PM Lin Yuan  wrote:

> https://github.com/apache/incubator-mxnet/pull/13452 is needed in 1.4.0 to
> support Horovod integration project.
>
> Thanks!
>
> Lin
>
>
> On Thu, Nov 29, 2018 at 1:40 PM Davydenko, Denis <
> dzianis.davydze...@gmail.com> wrote:
>
> > I suggest to include this issue into tracked ones for the release:
> > https://github.com/apache/incubator-mxnet/issues/12255. It has proven to
> > be a problem with MXNet start up time and it will cause even more
> problems
> > down the line with Elastic Training, EIA where MXNet is a commodity
> rather
> > than statically running process. Also it already causes noticeable issues
> > with MMS (MXNet Model Server [1]). MMS users already noticed significant
> > lag with MMS start up time, especially on beefy instances like C5.18xl
> with
> > 72 vCPUs. MMS spins up multiple MXNet instances during its start up to
> > ensure full utilization of CPU or GPU resources on the host. By default
> it
> > spins up as many MXNet instances as there are cores (either CPU or GPU
> > cores) and the bigger the host the more MXNet instances are spun up. And
> > the more MXNet instances spun up - the more each instance takes time to
> > start. For example, on C5.4xl users reported waiting for as long as 2
> > minutes to have just 8 MXNet instances spun up with MXNet 1.3. Same
> efforts
> > with MXNet 1.1 take less than 0.5 sec.
> >
> > This is quite a significant regression in MXNet when it comes to start up
> > experience. I suggest to consider this as a blocker for 1.4.
> >
> > [1] https://github.com/awslabs/mxnet-model-server
> >
> > On 11/29/18, 12:51 PM, "Steffen Rochel" 
> wrote:
> >
> > added to 1.4.0 tracking list
> > <
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
> > >
> > .
> > Steffen
> >
> > On Thu, Nov 29, 2018 at 9:32 AM Zheng, Da  >
> > wrote:
> >
> > > Hello Steffen,
> > >
> > > Can this bug be fixed in 1.4.0 release? It's a significant
> > performance
> > > regression on sparse matrix multiplication.
> > > https://github.com/apache/incubator-mxnet/issues/13449
> > >
> > > Thanks,
> > > Da
> > >
> > > On 11/26/18, 6:42 AM, "Steffen Rochel" 
> > wrote:
> > >
> > > Dear MXNet community,
> > >
> > > I will be the release manager for the upcoming Apache MXNet
> 1.4.0
> > > release.
> > > Sergey Kolychev will be co-managing the release and providing
> > help
> > > from the
> > > committers side.
> > > A release candidate will be cut on November 29, 2018 and voting
> > will
> > > start
> > > December 7, 2018. Release notes have been drafted here [1]. If
> > you
> > > have any
> > > additional features in progress and would like to include it in
> > this
> > > release, please assure they have been merged by November 27,
> > 2018.
> > > Release
> > > schedule is available here [2].
> > >
> > > Feel free to add any other comments/suggestions. Please help to
> > review
> > > and
> > > merge outstanding PR's and resolve issues impacting the quality
> > of the
> > > 1.4.0 release.
> > >
> > > Regards,
> > >
> > > Steffen
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> > >
> > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> > >
> > >
> > >
> > >
> > > On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Spoke too soon[1], looks like others have been adding Turing
> > support
> > > as
> > > > well (thanks to those helping with this).  I believe there's
> > still a
> > > few
> > > > changes we'd have to make to claim support though (mshadow
> > CMake
> > > changes,
> > > > PyPi package creation tweaks).
> > > >
> > > > 1:
> > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> > > >
> > > > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > Hey Steffen, I'd like to be able to merge this PR for
> > version 1.4:
> > > > > https://github.com/apache/incubator-mxnet/pull/13310 . It
> > fixes a
> > > > > regression in master 

Adding AMD CPU to CI

2018-11-29 Thread Alex Zai
What are people's thoughts on having AMD machines tested on the CI? AMD
machines are now available on AWS.

Best,
Alex


Re: Adding AMD CPU to CI

2018-11-29 Thread Anirudh Subramanian
+1

On Thu, Nov 29, 2018 at 2:38 PM Alex Zai  wrote:

> What are people's thoughts on having AMD machines tested on the CI? AMD
> machines are now available on AWS.
>
> Best,
> Alex
>


Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Steffen Rochel
Denis - added.

On Thu, Nov 29, 2018 at 1:40 PM Davydenko, Denis <
dzianis.davydze...@gmail.com> wrote:

> I suggest to include this issue into tracked ones for the release:
> https://github.com/apache/incubator-mxnet/issues/12255. It has proven to
> be a problem with MXNet start up time and it will cause even more problems
> down the line with Elastic Training, EIA where MXNet is a commodity rather
> than statically running process. Also it already causes noticeable issues
> with MMS (MXNet Model Server [1]). MMS users already noticed significant
> lag with MMS start up time, especially on beefy instances like C5.18xl with
> 72 vCPUs. MMS spins up multiple MXNet instances during its start up to
> ensure full utilization of CPU or GPU resources on the host. By default it
> spins up as many MXNet instances as there are cores (either CPU or GPU
> cores) and the bigger the host the more MXNet instances are spun up. And
> the more MXNet instances spun up - the more each instance takes time to
> start. For example, on C5.4xl users reported waiting for as long as 2
> minutes to have just 8 MXNet instances spun up with MXNet 1.3. Same efforts
> with MXNet 1.1 take less than 0.5 sec.
>
> This is quite a significant regression in MXNet when it comes to start up
> experience. I suggest to consider this as a blocker for 1.4.
>
> [1] https://github.com/awslabs/mxnet-model-server
>
> On 11/29/18, 12:51 PM, "Steffen Rochel"  wrote:
>
> added to 1.4.0 tracking list
> <
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
> >
> .
> Steffen
>
> On Thu, Nov 29, 2018 at 9:32 AM Zheng, Da 
> wrote:
>
> > Hello Steffen,
> >
> > Can this bug be fixed in 1.4.0 release? It's a significant
> performance
> > regression on sparse matrix multiplication.
> > https://github.com/apache/incubator-mxnet/issues/13449
> >
> > Thanks,
> > Da
> >
> > On 11/26/18, 6:42 AM, "Steffen Rochel" 
> wrote:
> >
> > Dear MXNet community,
> >
> > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > release.
> > Sergey Kolychev will be co-managing the release and providing
> help
> > from the
> > committers side.
> > A release candidate will be cut on November 29, 2018 and voting
> will
> > start
> > December 7, 2018. Release notes have been drafted here [1]. If
> you
> > have any
> > additional features in progress and would like to include it in
> this
> > release, please assure they have been merged by November 27,
> 2018.
> > Release
> > schedule is available here [2].
> >
> > Feel free to add any other comments/suggestions. Please help to
> review
> > and
> > merge outstanding PR's and resolve issues impacting the quality
> of the
> > 1.4.0 release.
> >
> > Regards,
> >
> > Steffen
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> >
> > [2]
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> >
> >
> >
> >
> > On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Spoke too soon[1], looks like others have been adding Turing
> support
> > as
> > > well (thanks to those helping with this).  I believe there's
> still a
> > few
> > > changes we'd have to make to claim support though (mshadow
> CMake
> > changes,
> > > PyPi package creation tweaks).
> > >
> > > 1:
> > >
> > >
> >
> https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> > >
> > > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Hey Steffen, I'd like to be able to merge this PR for
> version 1.4:
> > > > https://github.com/apache/incubator-mxnet/pull/13310 . It
> fixes a
> > > > regression in master which causes incorrect feature vectors
> to be
> > output
> > > > when using the TensorRT feature.  (Thanks to Nathalie for
> helping
> > me
> > > track
> > > > down the root cause of the issue).   I'm currently blocked
> on a CI
> > issue
> > > I
> > > > haven't seen before, but hope to have it resolved by EOW.
> > > >
> > > > One call-out I would make is that we currently don't support
> Turing
> > > > architecture (sm_75).  I've been slowly trying to add
> support, but
> > I
> > > don't
> > > > think I'd have capacity to do this done by EOW.  Does anyone
> feel
>   

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Lin Yuan
https://github.com/apache/incubator-mxnet/pull/13452 is needed in 1.4.0 to
support Horovod integration project.

Thanks!

Lin


On Thu, Nov 29, 2018 at 1:40 PM Davydenko, Denis <
dzianis.davydze...@gmail.com> wrote:

> I suggest to include this issue into tracked ones for the release:
> https://github.com/apache/incubator-mxnet/issues/12255. It has proven to
> be a problem with MXNet start up time and it will cause even more problems
> down the line with Elastic Training, EIA where MXNet is a commodity rather
> than statically running process. Also it already causes noticeable issues
> with MMS (MXNet Model Server [1]). MMS users already noticed significant
> lag with MMS start up time, especially on beefy instances like C5.18xl with
> 72 vCPUs. MMS spins up multiple MXNet instances during its start up to
> ensure full utilization of CPU or GPU resources on the host. By default it
> spins up as many MXNet instances as there are cores (either CPU or GPU
> cores) and the bigger the host the more MXNet instances are spun up. And
> the more MXNet instances spun up - the more each instance takes time to
> start. For example, on C5.4xl users reported waiting for as long as 2
> minutes to have just 8 MXNet instances spun up with MXNet 1.3. Same efforts
> with MXNet 1.1 take less than 0.5 sec.
>
> This is quite a significant regression in MXNet when it comes to start up
> experience. I suggest to consider this as a blocker for 1.4.
>
> [1] https://github.com/awslabs/mxnet-model-server
>
> On 11/29/18, 12:51 PM, "Steffen Rochel"  wrote:
>
> added to 1.4.0 tracking list
> <
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0ReleasePlanandStatus-OpenPRstotrack
> >
> .
> Steffen
>
> On Thu, Nov 29, 2018 at 9:32 AM Zheng, Da 
> wrote:
>
> > Hello Steffen,
> >
> > Can this bug be fixed in 1.4.0 release? It's a significant
> performance
> > regression on sparse matrix multiplication.
> > https://github.com/apache/incubator-mxnet/issues/13449
> >
> > Thanks,
> > Da
> >
> > On 11/26/18, 6:42 AM, "Steffen Rochel" 
> wrote:
> >
> > Dear MXNet community,
> >
> > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > release.
> > Sergey Kolychev will be co-managing the release and providing
> help
> > from the
> > committers side.
> > A release candidate will be cut on November 29, 2018 and voting
> will
> > start
> > December 7, 2018. Release notes have been drafted here [1]. If
> you
> > have any
> > additional features in progress and would like to include it in
> this
> > release, please assure they have been merged by November 27,
> 2018.
> > Release
> > schedule is available here [2].
> >
> > Feel free to add any other comments/suggestions. Please help to
> review
> > and
> > merge outstanding PR's and resolve issues impacting the quality
> of the
> > 1.4.0 release.
> >
> > Regards,
> >
> > Steffen
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> >
> > [2]
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> >
> >
> >
> >
> > On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Spoke too soon[1], looks like others have been adding Turing
> support
> > as
> > > well (thanks to those helping with this).  I believe there's
> still a
> > few
> > > changes we'd have to make to claim support though (mshadow
> CMake
> > changes,
> > > PyPi package creation tweaks).
> > >
> > > 1:
> > >
> > >
> >
> https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> > >
> > > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Hey Steffen, I'd like to be able to merge this PR for
> version 1.4:
> > > > https://github.com/apache/incubator-mxnet/pull/13310 . It
> fixes a
> > > > regression in master which causes incorrect feature vectors
> to be
> > output
> > > > when using the TensorRT feature.  (Thanks to Nathalie for
> helping
> > me
> > > track
> > > > down the root cause of the issue).   I'm currently blocked
> on a CI
> > issue
> > > I
> > > > haven't seen before, but hope to have it resolved by EOW.
> > > >
> > > > One call-out I would make is that we currently don't support
> Turing
> > > > architecture (sm_75).  I've been slowly trying to add
> support, but
> 

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Davydenko, Denis
I suggest to include this issue into tracked ones for the release: 
https://github.com/apache/incubator-mxnet/issues/12255. It has proven to be a 
problem with MXNet start up time and it will cause even more problems down the 
line with Elastic Training, EIA where MXNet is a commodity rather than 
statically running process. Also it already causes noticeable issues with MMS 
(MXNet Model Server [1]). MMS users already noticed significant lag with MMS 
start up time, especially on beefy instances like C5.18xl with 72 vCPUs. MMS 
spins up multiple MXNet instances during its start up to ensure full 
utilization of CPU or GPU resources on the host. By default it spins up as many 
MXNet instances as there are cores (either CPU or GPU cores) and the bigger the 
host the more MXNet instances are spun up. And the more MXNet instances spun up 
- the more each instance takes time to start. For example, on C5.4xl users 
reported waiting for as long as 2 minutes to have just 8 MXNet instances spun 
up with MXNet 1.3. Same efforts with MXNet 1.1 take less than 0.5 sec.

This is quite a significant regression in MXNet when it comes to start up 
experience. I suggest to consider this as a blocker for 1.4.

[1] https://github.com/awslabs/mxnet-model-server 

On 11/29/18, 12:51 PM, "Steffen Rochel"  wrote:

added to 1.4.0 tracking list


.
Steffen

On Thu, Nov 29, 2018 at 9:32 AM Zheng, Da  wrote:

> Hello Steffen,
>
> Can this bug be fixed in 1.4.0 release? It's a significant performance
> regression on sparse matrix multiplication.
> https://github.com/apache/incubator-mxnet/issues/13449
>
> Thanks,
> Da
>
> On 11/26/18, 6:42 AM, "Steffen Rochel"  wrote:
>
> Dear MXNet community,
>
> I will be the release manager for the upcoming Apache MXNet 1.4.0
> release.
> Sergey Kolychev will be co-managing the release and providing help
> from the
> committers side.
> A release candidate will be cut on November 29, 2018 and voting will
> start
> December 7, 2018. Release notes have been drafted here [1]. If you
> have any
> additional features in progress and would like to include it in this
> release, please assure they have been merged by November 27, 2018.
> Release
> schedule is available here [2].
>
> Feel free to add any other comments/suggestions. Please help to review
> and
> merge outstanding PR's and resolve issues impacting the quality of the
> 1.4.0 release.
>
> Regards,
>
> Steffen
>
> [1]
>
> 
https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
>
> [2]
> 
https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
>
>
>
>
> On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Spoke too soon[1], looks like others have been adding Turing support
> as
> > well (thanks to those helping with this).  I believe there's still a
> few
> > changes we'd have to make to claim support though (mshadow CMake
> changes,
> > PyPi package creation tweaks).
> >
> > 1:
> >
> >
> 
https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> >
> > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Hey Steffen, I'd like to be able to merge this PR for version 1.4:
> > > https://github.com/apache/incubator-mxnet/pull/13310 . It fixes a
> > > regression in master which causes incorrect feature vectors to be
> output
> > > when using the TensorRT feature.  (Thanks to Nathalie for helping
> me
> > track
> > > down the root cause of the issue).   I'm currently blocked on a CI
> issue
> > I
> > > haven't seen before, but hope to have it resolved by EOW.
> > >
> > > One call-out I would make is that we currently don't support 
Turing
> > > architecture (sm_75).  I've been slowly trying to add support, but
> I
> > don't
> > > think I'd have capacity to do this done by EOW.  Does anyone feel
> > strongly
> > > we need this in the 1.4 release?  From my perspective this will
> already
> > be
> > > a strong release without it.
> > >
> > > On Tue, Nov 20, 2018 at 6:42 PM Steffen Rochel <
> steffenroc...@gmail.com>
> > > wrote:
> > >
> > >> Thanks Patrick, lets 

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Steffen Rochel
added to 1.4.0 tracking list

.
Steffen

On Thu, Nov 29, 2018 at 9:32 AM Zheng, Da  wrote:

> Hello Steffen,
>
> Can this bug be fixed in 1.4.0 release? It's a significant performance
> regression on sparse matrix multiplication.
> https://github.com/apache/incubator-mxnet/issues/13449
>
> Thanks,
> Da
>
> On 11/26/18, 6:42 AM, "Steffen Rochel"  wrote:
>
> Dear MXNet community,
>
> I will be the release manager for the upcoming Apache MXNet 1.4.0
> release.
> Sergey Kolychev will be co-managing the release and providing help
> from the
> committers side.
> A release candidate will be cut on November 29, 2018 and voting will
> start
> December 7, 2018. Release notes have been drafted here [1]. If you
> have any
> additional features in progress and would like to include it in this
> release, please assure they have been merged by November 27, 2018.
> Release
> schedule is available here [2].
>
> Feel free to add any other comments/suggestions. Please help to review
> and
> merge outstanding PR's and resolve issues impacting the quality of the
> 1.4.0 release.
>
> Regards,
>
> Steffen
>
> [1]
>
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
>
> [2]
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
>
>
>
>
> On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Spoke too soon[1], looks like others have been adding Turing support
> as
> > well (thanks to those helping with this).  I believe there's still a
> few
> > changes we'd have to make to claim support though (mshadow CMake
> changes,
> > PyPi package creation tweaks).
> >
> > 1:
> >
> >
> https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> >
> > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Hey Steffen, I'd like to be able to merge this PR for version 1.4:
> > > https://github.com/apache/incubator-mxnet/pull/13310 . It fixes a
> > > regression in master which causes incorrect feature vectors to be
> output
> > > when using the TensorRT feature.  (Thanks to Nathalie for helping
> me
> > track
> > > down the root cause of the issue).   I'm currently blocked on a CI
> issue
> > I
> > > haven't seen before, but hope to have it resolved by EOW.
> > >
> > > One call-out I would make is that we currently don't support Turing
> > > architecture (sm_75).  I've been slowly trying to add support, but
> I
> > don't
> > > think I'd have capacity to do this done by EOW.  Does anyone feel
> > strongly
> > > we need this in the 1.4 release?  From my perspective this will
> already
> > be
> > > a strong release without it.
> > >
> > > On Tue, Nov 20, 2018 at 6:42 PM Steffen Rochel <
> steffenroc...@gmail.com>
> > > wrote:
> > >
> > >> Thanks Patrick, lets target to get the PR's merged this week.
> > >>
> > >> Call for contributions from the community: Right now we have 10 PR
> > >> awaiting
> > >> merge
> > >> <
> > >>
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge+
> > >> >
> > >> and
> > >> we have 61 open PR awaiting review.
> > >> <
> > >>
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-review
> > >> >
> > >> I would appreciate if you all can help to review the open PR and
> the
> > >> committers can drive the merge before code freeze for 1.4.0.
> > >>
> > >> The contributors on the Java API are making progress, but not all
> > >> performance issues are resolved. With some luck it should be
> possible to
> > >> code freeze towards end of this week.
> > >>
> > >> Are there other critical features/bugs/PR you think need to be
> included
> > in
> > >> 1.4.0? If so, please communicate as soon as possible.
> > >>
> > >> Regards,
> > >> Steffen
> > >>
> > >> On Mon, Nov 19, 2018 at 8:26 PM Zhao, Patric <
> patric.z...@intel.com>
> > >> wrote:
> > >>
> > >> > Thanks, Steffen. I think there is NO open issue to block the
> MKLDNN to
> > >> GA
> > >> > now.
> > >> >
> > >> > BTW, several quantization related PRs (#13297,#13260) are under
> the
> > >> review
> > >> > and I think it can be merged in this week.
> > >> >
> > >> > Thanks,
> > >> >
> > >> > --Patric
> > >> >
> > >> >
> > >> > > -Original Message-
> > >> > > From: Steffen 

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Steffen Rochel
Pedro - added  to 1.4.0 tracking list


Do you have already ETA?
Steffen

On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy 
wrote:

> Hi all.
>
> There are two important issues / fixes that should go in the next
> release in my radar:
>
> 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> There is a bug in shape inference on CPU when not using MKL, also we
> are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> I'm finishing a fix for these issues in the above PR.
>
> 2) https://github.com/apache/incubator-mxnet/issues/13438
> We are seeing crashes due to unsafe setenv in multithreaded code.
> Setenv / getenv from multiple threads is not safe and is causing
> segfaults. This piece of code (the handlers in pthread_atfork) already
> caused a very difficult to diagnose hang in a previous release, where
> a fork inside cudnn would deadlock the engine.
>
> I would remove setenv from 2) as a mitigation, but we would need to
> check for regressions as we could be creating additional threads
> inside the engine.
>
> I would suggest that we address these two major issues before the next
> release.
>
> Pedro
>
>
>
> On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel 
> wrote:
> >
> > Dear MXNet community,
> >
> > I will be the release manager for the upcoming Apache MXNet 1.4.0
> release.
> > Sergey Kolychev will be co-managing the release and providing help from
> the
> > committers side.
> > A release candidate will be cut on November 29, 2018 and voting will
> start
> > December 7, 2018. Release notes have been drafted here [1]. If you have
> any
> > additional features in progress and would like to include it in this
> > release, please assure they have been merged by November 27, 2018.
> Release
> > schedule is available here [2].
> >
> > Feel free to add any other comments/suggestions. Please help to review
> and
> > merge outstanding PR's and resolve issues impacting the quality of the
> > 1.4.0 release.
> >
> > Regards,
> >
> > Steffen
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> >
> > [2]
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> >
> >
> >
> >
> > On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Spoke too soon[1], looks like others have been adding Turing support as
> > > well (thanks to those helping with this).  I believe there's still a
> few
> > > changes we'd have to make to claim support though (mshadow CMake
> changes,
> > > PyPi package creation tweaks).
> > >
> > > 1:
> > >
> > >
> https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> > >
> > > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Hey Steffen, I'd like to be able to merge this PR for version 1.4:
> > > > https://github.com/apache/incubator-mxnet/pull/13310 . It fixes a
> > > > regression in master which causes incorrect feature vectors to be
> output
> > > > when using the TensorRT feature.  (Thanks to Nathalie for helping me
> > > track
> > > > down the root cause of the issue).   I'm currently blocked on a CI
> issue
> > > I
> > > > haven't seen before, but hope to have it resolved by EOW.
> > > >
> > > > One call-out I would make is that we currently don't support Turing
> > > > architecture (sm_75).  I've been slowly trying to add support, but I
> > > don't
> > > > think I'd have capacity to do this done by EOW.  Does anyone feel
> > > strongly
> > > > we need this in the 1.4 release?  From my perspective this will
> already
> > > be
> > > > a strong release without it.
> > > >
> > > > On Tue, Nov 20, 2018 at 6:42 PM Steffen Rochel <
> steffenroc...@gmail.com>
> > > > wrote:
> > > >
> > > >> Thanks Patrick, lets target to get the PR's merged this week.
> > > >>
> > > >> Call for contributions from the community: Right now we have 10 PR
> > > >> awaiting
> > > >> merge
> > > >> <
> > > >>
> > >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge+
> > > >> >
> > > >> and
> > > >> we have 61 open PR awaiting review.
> > > >> <
> > > >>
> > >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-review
> > > >> >
> > > >> I would appreciate if you all can help to review the open PR and the
> > > >> committers can drive the merge before code freeze for 1.4.0.
> > > >>
> > > >> The contributors on the Java API are making progress, but not all
> > > >> performance issues are resolved. With some luck it should be
> possible to
> > > >> code freeze towards end of this week.
> > > >>
> > > >> Are there other critical features/bugs/PR you think need to 

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread kellen sunderland
I believe this PR is ready to merge but so far I don't have any approvals.
Would appreciate if someone could do a quick review:

https://github.com/apache/incubator-mxnet/pull/13311
and
https://github.com/apache/incubator-mxnet/pull/13310

-Kellen

On Thu, Nov 29, 2018 at 12:43 PM Steffen Rochel 
wrote:

> Kellen - please merge your PR before v1.4.x branch is created or integrate
> afterwards.
> Steffen
>
> On Tue, Nov 20, 2018 at 7:01 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Hey Steffen, I'd like to be able to merge this PR for version 1.4:
> > https://github.com/apache/incubator-mxnet/pull/13310 . It fixes a
> > regression in master which causes incorrect feature vectors to be output
> > when using the TensorRT feature.  (Thanks to Nathalie for helping me
> track
> > down the root cause of the issue).   I'm currently blocked on a CI issue
> I
> > haven't seen before, but hope to have it resolved by EOW.
> >
> > One call-out I would make is that we currently don't support Turing
> > architecture (sm_75).  I've been slowly trying to add support, but I
> don't
> > think I'd have capacity to do this done by EOW.  Does anyone feel
> strongly
> > we need this in the 1.4 release?  From my perspective this will already
> be
> > a strong release without it.
> >
> > On Tue, Nov 20, 2018 at 6:42 PM Steffen Rochel 
> > wrote:
> >
> > > Thanks Patrick, lets target to get the PR's merged this week.
> > >
> > > Call for contributions from the community: Right now we have 10 PR
> > awaiting
> > > merge
> > > <
> > >
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge+
> > > >
> > > and
> > > we have 61 open PR awaiting review.
> > > <
> > >
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-review
> > > >
> > > I would appreciate if you all can help to review the open PR and the
> > > committers can drive the merge before code freeze for 1.4.0.
> > >
> > > The contributors on the Java API are making progress, but not all
> > > performance issues are resolved. With some luck it should be possible
> to
> > > code freeze towards end of this week.
> > >
> > > Are there other critical features/bugs/PR you think need to be included
> > in
> > > 1.4.0? If so, please communicate as soon as possible.
> > >
> > > Regards,
> > > Steffen
> > >
> > > On Mon, Nov 19, 2018 at 8:26 PM Zhao, Patric 
> > > wrote:
> > >
> > > > Thanks, Steffen. I think there is NO open issue to block the MKLDNN
> to
> > GA
> > > > now.
> > > >
> > > > BTW, several quantization related PRs (#13297,#13260) are under the
> > > review
> > > > and I think it can be merged in this week.
> > > >
> > > > Thanks,
> > > >
> > > > --Patric
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Steffen Rochel [mailto:steffenroc...@gmail.com]
> > > > > Sent: Tuesday, November 20, 2018 2:57 AM
> > > > > To: dev@mxnet.incubator.apache.org
> > > > > Subject: Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0
> > > release
> > > > >
> > > > > On Friday the contributors working on Java API discovered a
> potential
> > > > > performance problem with inference using Java API vs. Python.
> > > > Investigation
> > > > > is ongoing.
> > > > > As the Java API is one of the main features for the upcoming
> > release, I
> > > > > suggest to post-pone the code freeze towards end of this week.
> > > > >
> > > > > Please provide feedback and concern about the change in dates for
> > code
> > > > > freeze and 1.4.0 release. I will provide updates on progress
> > resolving
> > > > the
> > > > > potential performance problem.
> > > > >
> > > > > Patrick - do you think it is possible to resolve the remaining
> issues
> > > on
> > > > MKL-
> > > > > DNN this week, so we can consider GA for MKL-DNN with 1.4.0?
> > > > >
> > > > > Regards,
> > > > > Steffen
> > > > >
> > > > > On Thu, Nov 15, 2018 at 5:26 AM Anton Chernov  >
> > > > > wrote:
> > > > >
> > > > > > I'd like to remind everyone that 'code freeze' would mean
> cutting a
> > > > > > v1.4.x release branch and all following fixes would need to be
> > > > backported.
> > > > > > Development on master can be continued as usual.
> > > > > >
> > > > > > Best
> > > > > > Anton
> > > > > >
> > > > > > ср, 14 нояб. 2018 г. в 6:04, Steffen Rochel <
> > steffenroc...@gmail.com
> > > >:
> > > > > >
> > > > > > > Dear MXNet community,
> > > > > > > the agreed plan was to establish code freeze for 1.4.0 release
> > > > > > > today. As the 1.3.1 patch release is still ongoing I suggest to
> > > > > > > post-pone the code freeze to Friday 16th November 2018.
> > > > > > >
> > > > > > > Sergey Kolychev has agreed to act as co-release manager for all
> > > > > > > tasks
> > > > > > which
> > > > > > > require committer privileges. If anybody is interested to
> > volunteer
> > > > > > > as release manager - now is the time to speak up. Otherwise I
> > will
> > > > > > > manage
> > > > > > 

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Steffen Rochel
Kellen - please merge your PR before v1.4.x branch is created or integrate
afterwards.
Steffen

On Tue, Nov 20, 2018 at 7:01 PM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Hey Steffen, I'd like to be able to merge this PR for version 1.4:
> https://github.com/apache/incubator-mxnet/pull/13310 . It fixes a
> regression in master which causes incorrect feature vectors to be output
> when using the TensorRT feature.  (Thanks to Nathalie for helping me track
> down the root cause of the issue).   I'm currently blocked on a CI issue I
> haven't seen before, but hope to have it resolved by EOW.
>
> One call-out I would make is that we currently don't support Turing
> architecture (sm_75).  I've been slowly trying to add support, but I don't
> think I'd have capacity to do this done by EOW.  Does anyone feel strongly
> we need this in the 1.4 release?  From my perspective this will already be
> a strong release without it.
>
> On Tue, Nov 20, 2018 at 6:42 PM Steffen Rochel 
> wrote:
>
> > Thanks Patrick, lets target to get the PR's merged this week.
> >
> > Call for contributions from the community: Right now we have 10 PR
> awaiting
> > merge
> > <
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge+
> > >
> > and
> > we have 61 open PR awaiting review.
> > <
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-review
> > >
> > I would appreciate if you all can help to review the open PR and the
> > committers can drive the merge before code freeze for 1.4.0.
> >
> > The contributors on the Java API are making progress, but not all
> > performance issues are resolved. With some luck it should be possible to
> > code freeze towards end of this week.
> >
> > Are there other critical features/bugs/PR you think need to be included
> in
> > 1.4.0? If so, please communicate as soon as possible.
> >
> > Regards,
> > Steffen
> >
> > On Mon, Nov 19, 2018 at 8:26 PM Zhao, Patric 
> > wrote:
> >
> > > Thanks, Steffen. I think there is NO open issue to block the MKLDNN to
> GA
> > > now.
> > >
> > > BTW, several quantization related PRs (#13297,#13260) are under the
> > review
> > > and I think it can be merged in this week.
> > >
> > > Thanks,
> > >
> > > --Patric
> > >
> > >
> > > > -Original Message-
> > > > From: Steffen Rochel [mailto:steffenroc...@gmail.com]
> > > > Sent: Tuesday, November 20, 2018 2:57 AM
> > > > To: dev@mxnet.incubator.apache.org
> > > > Subject: Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0
> > release
> > > >
> > > > On Friday the contributors working on Java API discovered a potential
> > > > performance problem with inference using Java API vs. Python.
> > > Investigation
> > > > is ongoing.
> > > > As the Java API is one of the main features for the upcoming
> release, I
> > > > suggest to post-pone the code freeze towards end of this week.
> > > >
> > > > Please provide feedback and concern about the change in dates for
> code
> > > > freeze and 1.4.0 release. I will provide updates on progress
> resolving
> > > the
> > > > potential performance problem.
> > > >
> > > > Patrick - do you think it is possible to resolve the remaining issues
> > on
> > > MKL-
> > > > DNN this week, so we can consider GA for MKL-DNN with 1.4.0?
> > > >
> > > > Regards,
> > > > Steffen
> > > >
> > > > On Thu, Nov 15, 2018 at 5:26 AM Anton Chernov 
> > > > wrote:
> > > >
> > > > > I'd like to remind everyone that 'code freeze' would mean cutting a
> > > > > v1.4.x release branch and all following fixes would need to be
> > > backported.
> > > > > Development on master can be continued as usual.
> > > > >
> > > > > Best
> > > > > Anton
> > > > >
> > > > > ср, 14 нояб. 2018 г. в 6:04, Steffen Rochel <
> steffenroc...@gmail.com
> > >:
> > > > >
> > > > > > Dear MXNet community,
> > > > > > the agreed plan was to establish code freeze for 1.4.0 release
> > > > > > today. As the 1.3.1 patch release is still ongoing I suggest to
> > > > > > post-pone the code freeze to Friday 16th November 2018.
> > > > > >
> > > > > > Sergey Kolychev has agreed to act as co-release manager for all
> > > > > > tasks
> > > > > which
> > > > > > require committer privileges. If anybody is interested to
> volunteer
> > > > > > as release manager - now is the time to speak up. Otherwise I
> will
> > > > > > manage
> > > > > the
> > > > > > release.
> > > > > >
> > > > > > Regards,
> > > > > > Steffen
> > > > > >
> > > > >
> > >
> >
>


Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Pedro Larroy
I see. There's also an openmp primitive to change this. I see a way to
fix this issue with a bit of refactor.

Thanks.

Pedro.
On Thu, Nov 29, 2018 at 6:24 PM Chris Olivier  wrote:
>
> I don’t think that does anything at all, as stated in my other email.
> Someone can look into the omp code to be sure but my suspicion is that the
> environment variable is only read on startup, and at any rate, better to be
> set through the api at runtime
>
> On Thu, Nov 29, 2018 at 8:11 AM Pedro Larroy 
> wrote:
>
> > To be precise, what would be the consequences of not having these env
> > variables set in the engine threads related to OMP?
> > Given your experience with OpenMP I hope you can help us answer these
> > questions.
> >
> > Hopefully we can get the same effect (if any) of these setenvs using
> > some openmp call or a pragma. Definitely we shouldn't be mutating the
> > environment from a different thread from what I understand, which is
> > the likely cause of the random crashes some users are experiencing.
> >
> > Pedro
> > On Thu, Nov 29, 2018 at 5:00 PM Pedro Larroy
> >  wrote:
> > >
> > > Chris.  The problem is with setenv, not with getenv. We don't want to
> > > remove any getenv call, just these misplaced setenvs:
> > >
> > >
> > >
> > https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L61
> > >
> > > Please check the code above carefully and give us your feedback. Based
> > > on your email I think we don't yet have a common understanding of the
> > > root cause of this issue.
> > >
> > > Pedro.
> > > On Thu, Nov 29, 2018 at 4:02 PM Chris Olivier 
> > wrote:
> > > >
> > > > - getenv should be thread safe as long as nothing is calling
> > putenv/setenv
> > > > in another thread (the environment doesn’t change) as stated here:
> > > >
> > > > http://www.cplusplus.com/reference/cstdlib/getenv/
> > > >
> > > > it’s a simple library call, so to be sure either way, one can check the
> > > > actual source and see (in case some particular implementation is
> > acting in
> > > > a particularly thread-unsafe manner). This should be vetted before
> > making
> > > > any high-impact decisions such as trying to go remove every getenv
> > call in
> > > > the whole system.
> > > >
> > > > - locking after fork is possibly due to libgomp not supporting forking
> > such
> > > > that after a fork, a call is made to release the blocked omp threads
> > and
> > > > the main thread waits for the omp threads to finish, but the omp
> > threads
> > > > belong to the pre-forked process and thus never execute, causing that
> > > > forked process to freeze.  This behavior has been witnessed before.
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
> > pedro.larroy.li...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi all.
> > > > >
> > > > > There are two important issues / fixes that should go in the next
> > > > > release in my radar:
> > > > >
> > > > > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > > > > There is a bug in shape inference on CPU when not using MKL, also we
> > > > > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > > > > I'm finishing a fix for these issues in the above PR.
> > > > >
> > > > > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > > > > We are seeing crashes due to unsafe setenv in multithreaded code.
> > > > > Setenv / getenv from multiple threads is not safe and is causing
> > > > > segfaults. This piece of code (the handlers in pthread_atfork)
> > already
> > > > > caused a very difficult to diagnose hang in a previous release, where
> > > > > a fork inside cudnn would deadlock the engine.
> > > > >
> > > > > I would remove setenv from 2) as a mitigation, but we would need to
> > > > > check for regressions as we could be creating additional threads
> > > > > inside the engine.
> > > > >
> > > > > I would suggest that we address these two major issues before the
> > next
> > > > > release.
> > > > >
> > > > > Pedro
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
> > steffenroc...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Dear MXNet community,
> > > > > >
> > > > > > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > > > > release.
> > > > > > Sergey Kolychev will be co-managing the release and providing help
> > from
> > > > > the
> > > > > > committers side.
> > > > > > A release candidate will be cut on November 29, 2018 and voting
> > will
> > > > > start
> > > > > > December 7, 2018. Release notes have been drafted here [1]. If you
> > have
> > > > > any
> > > > > > additional features in progress and would like to include it in
> > this
> > > > > > release, please assure they have been merged by November 27, 2018.
> > > > > Release
> > > > > > schedule is available here [2].
> > > > > >
> > > > > > Feel free to add any other comments/suggestions. Please help to
> > review
> > > > > and
> > > > > > merge outstanding 

Re: [Launch Announcement] Dynamic training with Apache MXNet

2018-11-29 Thread Kumar, Vikas
A big thanks to Qi Qiao < https://github.com/mirocody > for making it easy for 
users to set up a cluster for dynamic training using cloudformation.

From: "Kumar, Vikas" 
Date: Thursday, November 29, 2018 at 10:26 AM
To: "dev@mxnet.incubator.apache.org" 
Subject: [Launch Announcement] Dynamic training with Apache MXNet

Hello MXNet community,

MXNet users can now use Dynamic Training(DT) for Deep learning models with 
Apache MXNet. DT helps to reducing training cost and training time by adding 
elasticity to the distributed training cluster. DT also helps in increasing 
instance pool utilization. With DT unused instances can be used to speed up 
training and then instances can be removed from training cluster at a later 
time to be used by some other application.
For details, refer to DT 
blog.
Developers should be able to integrate Dynamic training in their existing 
distributed training code, with introduction of few extra lines of 
code.

Thank you for all the contributors – Vikas Kumar , 
Haibin Lin < https://github.com/eric-haibin-lin>, Andrea Olgiati < 
https://github.com/andreaolgiati/> , Mu Li < 
https://github.com/mli >, Hagay Lupesko , Markham 
Aaron < https://github.com/aaronmarkham > , Sergey Sokolov < 
https://github.com/Ishitori> , Qi Qiao < https://github.com/mirocody >

This is an effort towards making training neural networks cheap and fast. We 
welcome your contributions to the repo - 
https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws . We would 
love to hear feedback and ideas in this direction.

Thanks
Vikas


[Launch Announcement] Dynamic training with Apache MXNet

2018-11-29 Thread Kumar, Vikas
Hello MXNet community,

MXNet users can now use Dynamic Training(DT) for Deep learning models with 
Apache MXNet. DT helps to reducing training cost and training time by adding 
elasticity to the distributed training cluster. DT also helps in increasing 
instance pool utilization. With DT unused instances can be used to speed up 
training and then instances can be removed from training cluster at a later 
time to be used by some other application.
For details, refer to DT 
blog.
Developers should be able to integrate Dynamic training in their existing 
distributed training code, with introduction of few extra lines of 
code.

Thank you for all the contributors – Vikas Kumar , 
Haibin Lin < https://github.com/eric-haibin-lin>, Andrea Olgiati < 
https://github.com/andreaolgiati/> , Mu Li < 
https://github.com/mli >, Hagay Lupesko , Markham 
Aaron < https://github.com/aaronmarkham > , Sergey Sokolov < 
https://github.com/Ishitori>

This is an effort towards making training neural networks cheap and fast. We 
welcome your contributions to the repo - 
https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws . We would 
love to hear feedback and ideas in this direction.

Thanks
Vikas


Build from Source Instruction Changes

2018-11-29 Thread Zachary
I would like to raise a PR for discussion here. In the mxnet installation
docs, we currently have three inconsistent ways of compiling the mxnet
backend:

1. Use make by passing in the configuration directly

make USE_BLAS=openblas

2. Use make with config.mk

echo "USE_BLAS=openblas" >> ./config.mk
make

3. Use cmake (which is only direct and does not work with config.mk)

cmake USE_BLAS=open



I investigated this because we found that that passing configuration
directly can cause problems for building the scala frontend. The scala
frontend is compiled using make because it currently requires that the
building and linking flags that were used for building the backend also be
passed in to the frontend. Make computes the flags for both backend and
frontend, but if the make configurations differ between the two then it can
result in not using various libraries and subpar performance. The two ways
to fix that are either passing the same configuration "make
USE_BLAS=openblas; make scalapkg USE_BLAS=openblas" or that the
configuration be passed through config.mk (which unifies the
configuration). As it is far simpler, I opted for the config.mk solution.



However, there is also a movement to migrate from make to cmake. For this
reason, cmake has already begun appearing in some of the installation docs.
However, cmake does not use config.mk and has some differences in the
configurations from make.



The temporary fix that I implemented was to migrate all direct calls to
make (1) to uses of config.mk(2). I also take all the calls to build cmake
and redundantly add the flags to both config.mk and pass them directly to
cmake:

echo "USE_BLAS=openblas" >> ./config.mk
cmake USE_BLAS=open



You can see the specific changes in
https://github.com/apache/incubator-mxnet/pull/13364. There are also other
options for a temporary fix. We could remove all cmake usage for now and
then switch later. We could also switch entirely to cmake with redundant
config.mk flags.



The permenant fix would be to remove the scala compilation requirement for
the build flags. We are working on this, but it may take some time so we
want to make a temporary fix. Once it is done, then we will not have
problems with migrating to cmake. Because this affects the general build
instructions, I want to post everything here in case there is some input.


Thanks,
Zach


- https://github.com/apache/incubator-mxnet/pull/13364


Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Zheng, Da
Hello Steffen,

Can this bug be fixed in 1.4.0 release? It's a significant performance 
regression on sparse matrix multiplication.
https://github.com/apache/incubator-mxnet/issues/13449

Thanks,
Da

On 11/26/18, 6:42 AM, "Steffen Rochel"  wrote:

Dear MXNet community,

I will be the release manager for the upcoming Apache MXNet 1.4.0 release.
Sergey Kolychev will be co-managing the release and providing help from the
committers side.
A release candidate will be cut on November 29, 2018 and voting will start
December 7, 2018. Release notes have been drafted here [1]. If you have any
additional features in progress and would like to include it in this
release, please assure they have been merged by November 27, 2018. Release
schedule is available here [2].

Feel free to add any other comments/suggestions. Please help to review and
merge outstanding PR's and resolve issues impacting the quality of the
1.4.0 release.

Regards,

Steffen

[1]

https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes

[2] 
https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status




On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Spoke too soon[1], looks like others have been adding Turing support as
> well (thanks to those helping with this).  I believe there's still a few
> changes we'd have to make to claim support though (mshadow CMake changes,
> PyPi package creation tweaks).
>
> 1:
>
> 
https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
>
> On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Hey Steffen, I'd like to be able to merge this PR for version 1.4:
> > https://github.com/apache/incubator-mxnet/pull/13310 . It fixes a
> > regression in master which causes incorrect feature vectors to be output
> > when using the TensorRT feature.  (Thanks to Nathalie for helping me
> track
> > down the root cause of the issue).   I'm currently blocked on a CI issue
> I
> > haven't seen before, but hope to have it resolved by EOW.
> >
> > One call-out I would make is that we currently don't support Turing
> > architecture (sm_75).  I've been slowly trying to add support, but I
> don't
> > think I'd have capacity to do this done by EOW.  Does anyone feel
> strongly
> > we need this in the 1.4 release?  From my perspective this will already
> be
> > a strong release without it.
> >
> > On Tue, Nov 20, 2018 at 6:42 PM Steffen Rochel 
> > wrote:
> >
> >> Thanks Patrick, lets target to get the PR's merged this week.
> >>
> >> Call for contributions from the community: Right now we have 10 PR
> >> awaiting
> >> merge
> >> <
> >>
> 
https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge+
> >> >
> >> and
> >> we have 61 open PR awaiting review.
> >> <
> >>
> 
https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-review
> >> >
> >> I would appreciate if you all can help to review the open PR and the
> >> committers can drive the merge before code freeze for 1.4.0.
> >>
> >> The contributors on the Java API are making progress, but not all
> >> performance issues are resolved. With some luck it should be possible 
to
> >> code freeze towards end of this week.
> >>
> >> Are there other critical features/bugs/PR you think need to be included
> in
> >> 1.4.0? If so, please communicate as soon as possible.
> >>
> >> Regards,
> >> Steffen
> >>
> >> On Mon, Nov 19, 2018 at 8:26 PM Zhao, Patric 
> >> wrote:
> >>
> >> > Thanks, Steffen. I think there is NO open issue to block the MKLDNN 
to
> >> GA
> >> > now.
> >> >
> >> > BTW, several quantization related PRs (#13297,#13260) are under the
> >> review
> >> > and I think it can be merged in this week.
> >> >
> >> > Thanks,
> >> >
> >> > --Patric
> >> >
> >> >
> >> > > -Original Message-
> >> > > From: Steffen Rochel [mailto:steffenroc...@gmail.com]
> >> > > Sent: Tuesday, November 20, 2018 2:57 AM
> >> > > To: dev@mxnet.incubator.apache.org
> >> > > Subject: Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0
> >> release
> >> > >
> >> > > On Friday the contributors working on Java API discovered a
> potential
> >> > > performance problem with inference using Java API vs. Python.
> >> > Investigation
> >> > > is ongoing.
> >> > > As the Java API is one of the main features for the upcoming
> 

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Chris Olivier
I don’t think that does anything at all, as stated in my other email.
Someone can look into the omp code to be sure but my suspicion is that the
environment variable is only read on startup, and at any rate, better to be
set through the api at runtime

On Thu, Nov 29, 2018 at 8:11 AM Pedro Larroy 
wrote:

> To be precise, what would be the consequences of not having these env
> variables set in the engine threads related to OMP?
> Given your experience with OpenMP I hope you can help us answer these
> questions.
>
> Hopefully we can get the same effect (if any) of these setenvs using
> some openmp call or a pragma. Definitely we shouldn't be mutating the
> environment from a different thread from what I understand, which is
> the likely cause of the random crashes some users are experiencing.
>
> Pedro
> On Thu, Nov 29, 2018 at 5:00 PM Pedro Larroy
>  wrote:
> >
> > Chris.  The problem is with setenv, not with getenv. We don't want to
> > remove any getenv call, just these misplaced setenvs:
> >
> >
> >
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L61
> >
> > Please check the code above carefully and give us your feedback. Based
> > on your email I think we don't yet have a common understanding of the
> > root cause of this issue.
> >
> > Pedro.
> > On Thu, Nov 29, 2018 at 4:02 PM Chris Olivier 
> wrote:
> > >
> > > - getenv should be thread safe as long as nothing is calling
> putenv/setenv
> > > in another thread (the environment doesn’t change) as stated here:
> > >
> > > http://www.cplusplus.com/reference/cstdlib/getenv/
> > >
> > > it’s a simple library call, so to be sure either way, one can check the
> > > actual source and see (in case some particular implementation is
> acting in
> > > a particularly thread-unsafe manner). This should be vetted before
> making
> > > any high-impact decisions such as trying to go remove every getenv
> call in
> > > the whole system.
> > >
> > > - locking after fork is possibly due to libgomp not supporting forking
> such
> > > that after a fork, a call is made to release the blocked omp threads
> and
> > > the main thread waits for the omp threads to finish, but the omp
> threads
> > > belong to the pre-forked process and thus never execute, causing that
> > > forked process to freeze.  This behavior has been witnessed before.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > > wrote:
> > >
> > > > Hi all.
> > > >
> > > > There are two important issues / fixes that should go in the next
> > > > release in my radar:
> > > >
> > > > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > > > There is a bug in shape inference on CPU when not using MKL, also we
> > > > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > > > I'm finishing a fix for these issues in the above PR.
> > > >
> > > > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > > > We are seeing crashes due to unsafe setenv in multithreaded code.
> > > > Setenv / getenv from multiple threads is not safe and is causing
> > > > segfaults. This piece of code (the handlers in pthread_atfork)
> already
> > > > caused a very difficult to diagnose hang in a previous release, where
> > > > a fork inside cudnn would deadlock the engine.
> > > >
> > > > I would remove setenv from 2) as a mitigation, but we would need to
> > > > check for regressions as we could be creating additional threads
> > > > inside the engine.
> > > >
> > > > I would suggest that we address these two major issues before the
> next
> > > > release.
> > > >
> > > > Pedro
> > > >
> > > >
> > > >
> > > > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
> steffenroc...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Dear MXNet community,
> > > > >
> > > > > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > > > release.
> > > > > Sergey Kolychev will be co-managing the release and providing help
> from
> > > > the
> > > > > committers side.
> > > > > A release candidate will be cut on November 29, 2018 and voting
> will
> > > > start
> > > > > December 7, 2018. Release notes have been drafted here [1]. If you
> have
> > > > any
> > > > > additional features in progress and would like to include it in
> this
> > > > > release, please assure they have been merged by November 27, 2018.
> > > > Release
> > > > > schedule is available here [2].
> > > > >
> > > > > Feel free to add any other comments/suggestions. Please help to
> review
> > > > and
> > > > > merge outstanding PR's and resolve issues impacting the quality of
> the
> > > > > 1.4.0 release.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Steffen
> > > > >
> > > > > [1]
> > > > >
> > > >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> > > > >
> > > > > [2]
> > > >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> > > > 

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Chris Olivier
By the way, have you traced a problem to these calls?

I am a bit skeptical that this is problematic here for the following reason:

At the time of arfork(), the new process doesn’t have any other threads to
speak of that are calling getenv(). Any globals from the last process are
owned by that process and copy-on-write in the new process. This would mean
that the getenv() in the old process wouldn’t be affected by putenv() in
the newly forked process and like I said, at this time, the newly forked
process tends to be single-threaded.



On Thu, Nov 29, 2018 at 8:11 AM Pedro Larroy 
wrote:

> To be precise, what would be the consequences of not having these env
> variables set in the engine threads related to OMP?
> Given your experience with OpenMP I hope you can help us answer these
> questions.
>
> Hopefully we can get the same effect (if any) of these setenvs using
> some openmp call or a pragma. Definitely we shouldn't be mutating the
> environment from a different thread from what I understand, which is
> the likely cause of the random crashes some users are experiencing.
>
> Pedro
> On Thu, Nov 29, 2018 at 5:00 PM Pedro Larroy
>  wrote:
> >
> > Chris.  The problem is with setenv, not with getenv. We don't want to
> > remove any getenv call, just these misplaced setenvs:
> >
> >
> >
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L61
> >
> > Please check the code above carefully and give us your feedback. Based
> > on your email I think we don't yet have a common understanding of the
> > root cause of this issue.
> >
> > Pedro.
> > On Thu, Nov 29, 2018 at 4:02 PM Chris Olivier 
> wrote:
> > >
> > > - getenv should be thread safe as long as nothing is calling
> putenv/setenv
> > > in another thread (the environment doesn’t change) as stated here:
> > >
> > > http://www.cplusplus.com/reference/cstdlib/getenv/
> > >
> > > it’s a simple library call, so to be sure either way, one can check the
> > > actual source and see (in case some particular implementation is
> acting in
> > > a particularly thread-unsafe manner). This should be vetted before
> making
> > > any high-impact decisions such as trying to go remove every getenv
> call in
> > > the whole system.
> > >
> > > - locking after fork is possibly due to libgomp not supporting forking
> such
> > > that after a fork, a call is made to release the blocked omp threads
> and
> > > the main thread waits for the omp threads to finish, but the omp
> threads
> > > belong to the pre-forked process and thus never execute, causing that
> > > forked process to freeze.  This behavior has been witnessed before.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > > wrote:
> > >
> > > > Hi all.
> > > >
> > > > There are two important issues / fixes that should go in the next
> > > > release in my radar:
> > > >
> > > > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > > > There is a bug in shape inference on CPU when not using MKL, also we
> > > > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > > > I'm finishing a fix for these issues in the above PR.
> > > >
> > > > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > > > We are seeing crashes due to unsafe setenv in multithreaded code.
> > > > Setenv / getenv from multiple threads is not safe and is causing
> > > > segfaults. This piece of code (the handlers in pthread_atfork)
> already
> > > > caused a very difficult to diagnose hang in a previous release, where
> > > > a fork inside cudnn would deadlock the engine.
> > > >
> > > > I would remove setenv from 2) as a mitigation, but we would need to
> > > > check for regressions as we could be creating additional threads
> > > > inside the engine.
> > > >
> > > > I would suggest that we address these two major issues before the
> next
> > > > release.
> > > >
> > > > Pedro
> > > >
> > > >
> > > >
> > > > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
> steffenroc...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Dear MXNet community,
> > > > >
> > > > > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > > > release.
> > > > > Sergey Kolychev will be co-managing the release and providing help
> from
> > > > the
> > > > > committers side.
> > > > > A release candidate will be cut on November 29, 2018 and voting
> will
> > > > start
> > > > > December 7, 2018. Release notes have been drafted here [1]. If you
> have
> > > > any
> > > > > additional features in progress and would like to include it in
> this
> > > > > release, please assure they have been merged by November 27, 2018.
> > > > Release
> > > > > schedule is available here [2].
> > > > >
> > > > > Feel free to add any other comments/suggestions. Please help to
> review
> > > > and
> > > > > merge outstanding PR's and resolve issues impacting the quality of
> the
> > > > > 1.4.0 release.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Steffen
> > > 

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Chris Olivier
I see. Yeah probably those can be removed. I haven’t checked the source,
but I would be surprised if omp even looked at the environment variable
after initial startup since looking up environment variables is a slow
linear search each time.

On Thu, Nov 29, 2018 at 8:09 AM Pedro Larroy 
wrote:

> Chris.  The problem is with setenv, not with getenv. We don't want to
> remove any getenv call, just these misplaced setenvs:
>
>
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L61
>
> Please check the code above carefully and give us your feedback. Based
> on your email I think we don't yet have a common understanding of the
> root cause of this issue.
>
> Pedro.
> On Thu, Nov 29, 2018 at 4:02 PM Chris Olivier 
> wrote:
> >
> > - getenv should be thread safe as long as nothing is calling
> putenv/setenv
> > in another thread (the environment doesn’t change) as stated here:
> >
> > http://www.cplusplus.com/reference/cstdlib/getenv/
> >
> > it’s a simple library call, so to be sure either way, one can check the
> > actual source and see (in case some particular implementation is acting
> in
> > a particularly thread-unsafe manner). This should be vetted before making
> > any high-impact decisions such as trying to go remove every getenv call
> in
> > the whole system.
> >
> > - locking after fork is possibly due to libgomp not supporting forking
> such
> > that after a fork, a call is made to release the blocked omp threads and
> > the main thread waits for the omp threads to finish, but the omp threads
> > belong to the pre-forked process and thus never execute, causing that
> > forked process to freeze.  This behavior has been witnessed before.
> >
> >
> >
> >
> > On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > > Hi all.
> > >
> > > There are two important issues / fixes that should go in the next
> > > release in my radar:
> > >
> > > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > > There is a bug in shape inference on CPU when not using MKL, also we
> > > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > > I'm finishing a fix for these issues in the above PR.
> > >
> > > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > > We are seeing crashes due to unsafe setenv in multithreaded code.
> > > Setenv / getenv from multiple threads is not safe and is causing
> > > segfaults. This piece of code (the handlers in pthread_atfork) already
> > > caused a very difficult to diagnose hang in a previous release, where
> > > a fork inside cudnn would deadlock the engine.
> > >
> > > I would remove setenv from 2) as a mitigation, but we would need to
> > > check for regressions as we could be creating additional threads
> > > inside the engine.
> > >
> > > I would suggest that we address these two major issues before the next
> > > release.
> > >
> > > Pedro
> > >
> > >
> > >
> > > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
> steffenroc...@gmail.com>
> > > wrote:
> > > >
> > > > Dear MXNet community,
> > > >
> > > > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > > release.
> > > > Sergey Kolychev will be co-managing the release and providing help
> from
> > > the
> > > > committers side.
> > > > A release candidate will be cut on November 29, 2018 and voting will
> > > start
> > > > December 7, 2018. Release notes have been drafted here [1]. If you
> have
> > > any
> > > > additional features in progress and would like to include it in this
> > > > release, please assure they have been merged by November 27, 2018.
> > > Release
> > > > schedule is available here [2].
> > > >
> > > > Feel free to add any other comments/suggestions. Please help to
> review
> > > and
> > > > merge outstanding PR's and resolve issues impacting the quality of
> the
> > > > 1.4.0 release.
> > > >
> > > > Regards,
> > > >
> > > > Steffen
> > > >
> > > > [1]
> > > >
> > >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> > > >
> > > > [2]
> > >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > Spoke too soon[1], looks like others have been adding Turing
> support as
> > > > > well (thanks to those helping with this).  I believe there's still
> a
> > > few
> > > > > changes we'd have to make to claim support though (mshadow CMake
> > > changes,
> > > > > PyPi package creation tweaks).
> > > > >
> > > > > 1:
> > > > >
> > > > >
> > >
> https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> > > > >
> > > > > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > > > > kellen.sunderl...@gmail.com> wrote:
> > > > >
> > > > > > Hey Steffen, I'd like to be able to merge this PR for version
> 1.4:

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Pedro Larroy
To be precise, what would be the consequences of not having these env
variables set in the engine threads related to OMP?
Given your experience with OpenMP I hope you can help us answer these questions.

Hopefully we can get the same effect (if any) of these setenvs using
some openmp call or a pragma. Definitely we shouldn't be mutating the
environment from a different thread from what I understand, which is
the likely cause of the random crashes some users are experiencing.

Pedro
On Thu, Nov 29, 2018 at 5:00 PM Pedro Larroy
 wrote:
>
> Chris.  The problem is with setenv, not with getenv. We don't want to
> remove any getenv call, just these misplaced setenvs:
>
>
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L61
>
> Please check the code above carefully and give us your feedback. Based
> on your email I think we don't yet have a common understanding of the
> root cause of this issue.
>
> Pedro.
> On Thu, Nov 29, 2018 at 4:02 PM Chris Olivier  wrote:
> >
> > - getenv should be thread safe as long as nothing is calling putenv/setenv
> > in another thread (the environment doesn’t change) as stated here:
> >
> > http://www.cplusplus.com/reference/cstdlib/getenv/
> >
> > it’s a simple library call, so to be sure either way, one can check the
> > actual source and see (in case some particular implementation is acting in
> > a particularly thread-unsafe manner). This should be vetted before making
> > any high-impact decisions such as trying to go remove every getenv call in
> > the whole system.
> >
> > - locking after fork is possibly due to libgomp not supporting forking such
> > that after a fork, a call is made to release the blocked omp threads and
> > the main thread waits for the omp threads to finish, but the omp threads
> > belong to the pre-forked process and thus never execute, causing that
> > forked process to freeze.  This behavior has been witnessed before.
> >
> >
> >
> >
> > On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy 
> > wrote:
> >
> > > Hi all.
> > >
> > > There are two important issues / fixes that should go in the next
> > > release in my radar:
> > >
> > > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > > There is a bug in shape inference on CPU when not using MKL, also we
> > > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > > I'm finishing a fix for these issues in the above PR.
> > >
> > > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > > We are seeing crashes due to unsafe setenv in multithreaded code.
> > > Setenv / getenv from multiple threads is not safe and is causing
> > > segfaults. This piece of code (the handlers in pthread_atfork) already
> > > caused a very difficult to diagnose hang in a previous release, where
> > > a fork inside cudnn would deadlock the engine.
> > >
> > > I would remove setenv from 2) as a mitigation, but we would need to
> > > check for regressions as we could be creating additional threads
> > > inside the engine.
> > >
> > > I would suggest that we address these two major issues before the next
> > > release.
> > >
> > > Pedro
> > >
> > >
> > >
> > > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel 
> > > wrote:
> > > >
> > > > Dear MXNet community,
> > > >
> > > > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > > release.
> > > > Sergey Kolychev will be co-managing the release and providing help from
> > > the
> > > > committers side.
> > > > A release candidate will be cut on November 29, 2018 and voting will
> > > start
> > > > December 7, 2018. Release notes have been drafted here [1]. If you have
> > > any
> > > > additional features in progress and would like to include it in this
> > > > release, please assure they have been merged by November 27, 2018.
> > > Release
> > > > schedule is available here [2].
> > > >
> > > > Feel free to add any other comments/suggestions. Please help to review
> > > and
> > > > merge outstanding PR's and resolve issues impacting the quality of the
> > > > 1.4.0 release.
> > > >
> > > > Regards,
> > > >
> > > > Steffen
> > > >
> > > > [1]
> > > >
> > > https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> > > >
> > > > [2]
> > > https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > Spoke too soon[1], looks like others have been adding Turing support 
> > > > > as
> > > > > well (thanks to those helping with this).  I believe there's still a
> > > few
> > > > > changes we'd have to make to claim support though (mshadow CMake
> > > changes,
> > > > > PyPi package creation tweaks).
> > > > >
> > > > > 1:
> > > > >
> > > > >
> > > https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> > > > >
> > > > > On Tue, Nov 20, 2018 

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Pedro Larroy
Chris.  The problem is with setenv, not with getenv. We don't want to
remove any getenv call, just these misplaced setenvs:


https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L61

Please check the code above carefully and give us your feedback. Based
on your email I think we don't yet have a common understanding of the
root cause of this issue.

Pedro.
On Thu, Nov 29, 2018 at 4:02 PM Chris Olivier  wrote:
>
> - getenv should be thread safe as long as nothing is calling putenv/setenv
> in another thread (the environment doesn’t change) as stated here:
>
> http://www.cplusplus.com/reference/cstdlib/getenv/
>
> it’s a simple library call, so to be sure either way, one can check the
> actual source and see (in case some particular implementation is acting in
> a particularly thread-unsafe manner). This should be vetted before making
> any high-impact decisions such as trying to go remove every getenv call in
> the whole system.
>
> - locking after fork is possibly due to libgomp not supporting forking such
> that after a fork, a call is made to release the blocked omp threads and
> the main thread waits for the omp threads to finish, but the omp threads
> belong to the pre-forked process and thus never execute, causing that
> forked process to freeze.  This behavior has been witnessed before.
>
>
>
>
> On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy 
> wrote:
>
> > Hi all.
> >
> > There are two important issues / fixes that should go in the next
> > release in my radar:
> >
> > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > There is a bug in shape inference on CPU when not using MKL, also we
> > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > I'm finishing a fix for these issues in the above PR.
> >
> > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > We are seeing crashes due to unsafe setenv in multithreaded code.
> > Setenv / getenv from multiple threads is not safe and is causing
> > segfaults. This piece of code (the handlers in pthread_atfork) already
> > caused a very difficult to diagnose hang in a previous release, where
> > a fork inside cudnn would deadlock the engine.
> >
> > I would remove setenv from 2) as a mitigation, but we would need to
> > check for regressions as we could be creating additional threads
> > inside the engine.
> >
> > I would suggest that we address these two major issues before the next
> > release.
> >
> > Pedro
> >
> >
> >
> > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel 
> > wrote:
> > >
> > > Dear MXNet community,
> > >
> > > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > release.
> > > Sergey Kolychev will be co-managing the release and providing help from
> > the
> > > committers side.
> > > A release candidate will be cut on November 29, 2018 and voting will
> > start
> > > December 7, 2018. Release notes have been drafted here [1]. If you have
> > any
> > > additional features in progress and would like to include it in this
> > > release, please assure they have been merged by November 27, 2018.
> > Release
> > > schedule is available here [2].
> > >
> > > Feel free to add any other comments/suggestions. Please help to review
> > and
> > > merge outstanding PR's and resolve issues impacting the quality of the
> > > 1.4.0 release.
> > >
> > > Regards,
> > >
> > > Steffen
> > >
> > > [1]
> > >
> > https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> > >
> > > [2]
> > https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> > >
> > >
> > >
> > >
> > > On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Spoke too soon[1], looks like others have been adding Turing support as
> > > > well (thanks to those helping with this).  I believe there's still a
> > few
> > > > changes we'd have to make to claim support though (mshadow CMake
> > changes,
> > > > PyPi package creation tweaks).
> > > >
> > > > 1:
> > > >
> > > >
> > https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> > > >
> > > > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > Hey Steffen, I'd like to be able to merge this PR for version 1.4:
> > > > > https://github.com/apache/incubator-mxnet/pull/13310 . It fixes a
> > > > > regression in master which causes incorrect feature vectors to be
> > output
> > > > > when using the TensorRT feature.  (Thanks to Nathalie for helping me
> > > > track
> > > > > down the root cause of the issue).   I'm currently blocked on a CI
> > issue
> > > > I
> > > > > haven't seen before, but hope to have it resolved by EOW.
> > > > >
> > > > > One call-out I would make is that we currently don't support Turing
> > > > > architecture (sm_75).  I've been slowly trying to add support, but I
> > > > don't
> > > > > think I'd 

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Chris Olivier
- getenv should be thread safe as long as nothing is calling putenv/setenv
in another thread (the environment doesn’t change) as stated here:

http://www.cplusplus.com/reference/cstdlib/getenv/

it’s a simple library call, so to be sure either way, one can check the
actual source and see (in case some particular implementation is acting in
a particularly thread-unsafe manner). This should be vetted before making
any high-impact decisions such as trying to go remove every getenv call in
the whole system.

- locking after fork is possibly due to libgomp not supporting forking such
that after a fork, a call is made to release the blocked omp threads and
the main thread waits for the omp threads to finish, but the omp threads
belong to the pre-forked process and thus never execute, causing that
forked process to freeze.  This behavior has been witnessed before.




On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy 
wrote:

> Hi all.
>
> There are two important issues / fixes that should go in the next
> release in my radar:
>
> 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> There is a bug in shape inference on CPU when not using MKL, also we
> are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> I'm finishing a fix for these issues in the above PR.
>
> 2) https://github.com/apache/incubator-mxnet/issues/13438
> We are seeing crashes due to unsafe setenv in multithreaded code.
> Setenv / getenv from multiple threads is not safe and is causing
> segfaults. This piece of code (the handlers in pthread_atfork) already
> caused a very difficult to diagnose hang in a previous release, where
> a fork inside cudnn would deadlock the engine.
>
> I would remove setenv from 2) as a mitigation, but we would need to
> check for regressions as we could be creating additional threads
> inside the engine.
>
> I would suggest that we address these two major issues before the next
> release.
>
> Pedro
>
>
>
> On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel 
> wrote:
> >
> > Dear MXNet community,
> >
> > I will be the release manager for the upcoming Apache MXNet 1.4.0
> release.
> > Sergey Kolychev will be co-managing the release and providing help from
> the
> > committers side.
> > A release candidate will be cut on November 29, 2018 and voting will
> start
> > December 7, 2018. Release notes have been drafted here [1]. If you have
> any
> > additional features in progress and would like to include it in this
> > release, please assure they have been merged by November 27, 2018.
> Release
> > schedule is available here [2].
> >
> > Feel free to add any other comments/suggestions. Please help to review
> and
> > merge outstanding PR's and resolve issues impacting the quality of the
> > 1.4.0 release.
> >
> > Regards,
> >
> > Steffen
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> >
> > [2]
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> >
> >
> >
> >
> > On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Spoke too soon[1], looks like others have been adding Turing support as
> > > well (thanks to those helping with this).  I believe there's still a
> few
> > > changes we'd have to make to claim support though (mshadow CMake
> changes,
> > > PyPi package creation tweaks).
> > >
> > > 1:
> > >
> > >
> https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> > >
> > > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Hey Steffen, I'd like to be able to merge this PR for version 1.4:
> > > > https://github.com/apache/incubator-mxnet/pull/13310 . It fixes a
> > > > regression in master which causes incorrect feature vectors to be
> output
> > > > when using the TensorRT feature.  (Thanks to Nathalie for helping me
> > > track
> > > > down the root cause of the issue).   I'm currently blocked on a CI
> issue
> > > I
> > > > haven't seen before, but hope to have it resolved by EOW.
> > > >
> > > > One call-out I would make is that we currently don't support Turing
> > > > architecture (sm_75).  I've been slowly trying to add support, but I
> > > don't
> > > > think I'd have capacity to do this done by EOW.  Does anyone feel
> > > strongly
> > > > we need this in the 1.4 release?  From my perspective this will
> already
> > > be
> > > > a strong release without it.
> > > >
> > > > On Tue, Nov 20, 2018 at 6:42 PM Steffen Rochel <
> steffenroc...@gmail.com>
> > > > wrote:
> > > >
> > > >> Thanks Patrick, lets target to get the PR's merged this week.
> > > >>
> > > >> Call for contributions from the community: Right now we have 10 PR
> > > >> awaiting
> > > >> merge
> > > >> <
> > > >>
> > >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge+
> > > >> >
> > > >> and
> > > >> 

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Pedro Larroy
Hi all.

There are two important issues / fixes that should go in the next
release in my radar:

1) https://github.com/apache/incubator-mxnet/pull/13409/files
There is a bug in shape inference on CPU when not using MKL, also we
are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
I'm finishing a fix for these issues in the above PR.

2) https://github.com/apache/incubator-mxnet/issues/13438
We are seeing crashes due to unsafe setenv in multithreaded code.
Setenv / getenv from multiple threads is not safe and is causing
segfaults. This piece of code (the handlers in pthread_atfork) already
caused a very difficult to diagnose hang in a previous release, where
a fork inside cudnn would deadlock the engine.

I would remove setenv from 2) as a mitigation, but we would need to
check for regressions as we could be creating additional threads
inside the engine.

I would suggest that we address these two major issues before the next release.

Pedro



On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel  wrote:
>
> Dear MXNet community,
>
> I will be the release manager for the upcoming Apache MXNet 1.4.0 release.
> Sergey Kolychev will be co-managing the release and providing help from the
> committers side.
> A release candidate will be cut on November 29, 2018 and voting will start
> December 7, 2018. Release notes have been drafted here [1]. If you have any
> additional features in progress and would like to include it in this
> release, please assure they have been merged by November 27, 2018. Release
> schedule is available here [2].
>
> Feel free to add any other comments/suggestions. Please help to review and
> merge outstanding PR's and resolve issues impacting the quality of the
> 1.4.0 release.
>
> Regards,
>
> Steffen
>
> [1]
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
>
> [2] 
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
>
>
>
>
> On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Spoke too soon[1], looks like others have been adding Turing support as
> > well (thanks to those helping with this).  I believe there's still a few
> > changes we'd have to make to claim support though (mshadow CMake changes,
> > PyPi package creation tweaks).
> >
> > 1:
> >
> > https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> >
> > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Hey Steffen, I'd like to be able to merge this PR for version 1.4:
> > > https://github.com/apache/incubator-mxnet/pull/13310 . It fixes a
> > > regression in master which causes incorrect feature vectors to be output
> > > when using the TensorRT feature.  (Thanks to Nathalie for helping me
> > track
> > > down the root cause of the issue).   I'm currently blocked on a CI issue
> > I
> > > haven't seen before, but hope to have it resolved by EOW.
> > >
> > > One call-out I would make is that we currently don't support Turing
> > > architecture (sm_75).  I've been slowly trying to add support, but I
> > don't
> > > think I'd have capacity to do this done by EOW.  Does anyone feel
> > strongly
> > > we need this in the 1.4 release?  From my perspective this will already
> > be
> > > a strong release without it.
> > >
> > > On Tue, Nov 20, 2018 at 6:42 PM Steffen Rochel 
> > > wrote:
> > >
> > >> Thanks Patrick, lets target to get the PR's merged this week.
> > >>
> > >> Call for contributions from the community: Right now we have 10 PR
> > >> awaiting
> > >> merge
> > >> <
> > >>
> > https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge+
> > >> >
> > >> and
> > >> we have 61 open PR awaiting review.
> > >> <
> > >>
> > https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-review
> > >> >
> > >> I would appreciate if you all can help to review the open PR and the
> > >> committers can drive the merge before code freeze for 1.4.0.
> > >>
> > >> The contributors on the Java API are making progress, but not all
> > >> performance issues are resolved. With some luck it should be possible to
> > >> code freeze towards end of this week.
> > >>
> > >> Are there other critical features/bugs/PR you think need to be included
> > in
> > >> 1.4.0? If so, please communicate as soon as possible.
> > >>
> > >> Regards,
> > >> Steffen
> > >>
> > >> On Mon, Nov 19, 2018 at 8:26 PM Zhao, Patric 
> > >> wrote:
> > >>
> > >> > Thanks, Steffen. I think there is NO open issue to block the MKLDNN to
> > >> GA
> > >> > now.
> > >> >
> > >> > BTW, several quantization related PRs (#13297,#13260) are under the
> > >> review
> > >> > and I think it can be merged in this week.
> > >> >
> > >> > Thanks,
> > >> >
> > >> > --Patric
> > >> >
> > >> >
> > >> > > -Original Message-
> > >> > > From: Steffen 

[ANNOUNCE] Release Apache MXNet (incubating) version 1.3.1

2018-11-29 Thread Anton Chernov
Dear all,

The Apache MXNet (incubating) community is happy to announce Apache MXNet
(incubating) version 1.3.1!

Apache MXNet (incubating) is a deep learning framework designed for both
efficiency and flexibility. It allows you to mix symbolic and imperative
programming to maximize efficiency and productivity.

1.3.1 is a maintenance release incorporating important bug fixes and
important performance improvements.

A full list of the changes in this release can be found in the release
notes:
https://cwiki.apache.org/confluence/x/eZGzBQ

A link to the download can be found here:
http://mxnet.incubator.apache.org/install/download.html

If you prefer to build from source and experiment with various compile-time
configuration options, use this link to get the instructions:
http://mxnet.incubator.apache.org/install/index.html

Or you can download and play with MXNet easily using one of the options
below:

1. The Pip packages can be found here:
https://pypi.python.org/pypi/mxnet

2. The Docker Images can be found here:
https://hub.docker.com/r/mxnet/python/

Links in Maven to the published Scala packages:

https://repository.apache.org/content/repositories/releases/org/apache/mxnet/
https://repository.apache.org/#nexus-search;quick~org.apache.mxnet

and to the experimental Clojure packages:
https://repository.apache.org/content/repositories/releases/org/apache/mxnet/contrib/clojure/

The Docker images:
https://hub.docker.com/u/mxnet/

The Pip package:
https://pypi.python.org/pypi/mxnet

The Release Tag:
https://github.com/apache/incubator-mxnet/tree/1.3.1

MXNet Resources
- Our discussion forum (https://discuss.mxnet.io)
- MXNet user mailing list (
https://lists.apache.org/list.html?u...@mxnet.apache.org)
- MXNet dev mailing list (
https://lists.apache.org/list.html?d...@mxnet.apache.org)
- StackOverflow mxnet tag (https://stackoverflow.com/questions/tagged/mxnet)
- MXNet website (https://mxnet.incubator.apache.org/faq/)
- Github issues (https://github.com/apache/incubator-mxnet/issues)
- Wiki (https://cwiki.apache.org/confluence/display/MXNET)

Attend one of the regular user groups meetings:
https://cwiki.apache.org/confluence/x/7BY0BQ

For more information on Apache MXNet (incubating), please see:
https://mxnet.io


Best regards,
Apache MXNet (incubating) Team

___

DISCLAIMER:

Apache MXNet (incubating) is an effort undergoing incubation at The Apache
Software Foundation (ASF), sponsored by the name of Apache Incubator PMC.
Incubation is required of all newly accepted projects until a further
review indicates that the infrastructure, communications, and decision
making process have stabilized in a manner consistent with other successful
ASF projects. While incubation status is not necessarily a reflection of
the completeness or stability of the code, it does indicate that the
project has yet to be fully endorsed by the ASF.

https://cwiki.apache.org/confluence/x/BINjB