Re: Requesting slack access

2018-09-12 Thread Naveen Swamy
done.

On Wed, Sep 12, 2018 at 11:09 AM, Chaitanya Bapat 
wrote:

> Hello,
>
> Chaitanya here. Requesting slack access.
> Thanks
>
> --
> *Chaitanya Prakash Bapat*
> *+1 (973) 953-6299*
>
> [image: https://www.linkedin.com//in/chaibapat25]
> [image: https://www.facebook.com/chaibapat
> ]
> [image:
> https://twitter.com/ChaiBapchya] [image:
> https://www.linkedin.com//in/chaibapat25]
> 
>


Requesting slack access

2018-09-12 Thread Chaitanya Bapat
Hello,

Chaitanya here. Requesting slack access.
Thanks

-- 
*Chaitanya Prakash Bapat*
*+1 (973) 953-6299*

[image: https://www.linkedin.com//in/chaibapat25]
[image: https://www.facebook.com/chaibapat]
[image:
https://twitter.com/ChaiBapchya] [image:
https://www.linkedin.com//in/chaibapat25]



Requesting slack access

2018-09-12 Thread Chaitanya Bapat
Requesting slack access

-- 
*Chaitanya Prakash Bapat*
*+1 (973) 953-6299*

[image: https://www.linkedin.com//in/chaibapat25]
[image: https://www.facebook.com/chaibapat]
[image:
https://twitter.com/ChaiBapchya] [image:
https://www.linkedin.com//in/chaibapat25]



Re: [DISCUSS] Build OSX builds in CI (possibly with TravisCI).

2018-09-12 Thread kellen sunderland
We've got fairly limited ability to change what's reported by Travis.  Most
administration is done by the ASF Infra crew, so it's tough for us to
experiment with settings.  It'd be great if you could bear with us for a
few days.  It shouldn't take too long to either (1) get happy-feeling green
checks back, or (2) decide we don't care as much as we thought we did about
MacOS support.

On Wed, Sep 12, 2018 at 9:53 PM Aaron Markham 
wrote:

> Is there any way to make it not show a red X failure in the GitHub UI when
> TravisCI fails? I keep going back to check what flakey test failed this
> time and realizing that Jenkins is still running and it was the "not
> required" Travis fail. The green checkmark makes me happy and it's easier
> to keep an eye on what's going on. If Travis times out a lot of the time,
> then most of our PRs will look red/bad/sad when they're not.
>
> What about no failure flag set, but add a label that Travis failed or
> if we can't control the flag, auto-set labels for each Travis and Jenkins
> pass/fail so we still get the benefit of at-a-glance status checks.
>
> On Wed, Sep 12, 2018 at 6:04 AM Marco de Abreu
>  wrote:
>
> > Hello,
> >
> > Travis CI has successfully been enabled just now. This means you will now
> > see a new status under your PR which is called
> > "continuous-integration/travis-ci/pr".
> >
> > The job only compiles MXNet on Mac and currently does not run unit tests
> -
> > we expect the overall execution duration to be around 6 minutes and thus
> > faster than the full Jenkins pipeline. The status is set to "not
> required"
> > which means that it does not block merging if that job fails since the
> > pipeline is still in beta. But in general, it would be good if committers
> > review the results in case the job shows a failure. Our last known state
> is
> > that the pipeline works properly, but we will keep everybody up to date
> in
> > case we get aware of any problems.
> >
> > The next step will be integration of Python CPU unit tests. There will
> be a
> > separate email if we got an update on that manner.
> >
> > Special thanks to Kellen Sunderland for the contribution of this Travis
> CI
> > pipeline.
> >
> > Best regards,
> > Marco
> >
> > On Wed, Sep 5, 2018 at 8:19 PM Tianqi Chen 
> > wrote:
> >
> > > Alrite, then I think it is fine as long as we can kept up with build
> > speed
> > > without timeout.
> > >
> > >
> > > Tianqi
> > >
> > > On Wed, Sep 5, 2018 at 9:14 AM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Travis actually has explicit support for ccache, it's a platform
> > feature.
> > > > I've run it and it seems to work quite well.  See for example this
> > build:
> > > >
> > https://travis-ci.org/KellenSunderland/incubator-mxnet/builds/424768656
> > > >
> > > > On Wed, Sep 5, 2018 at 7:10 PM Tianqi Chen  >
> > > > wrote:
> > > >
> > > > > Travis it self is stateless, which means ccache is not likely going
> > to
> > > > > work. As far as I understand, if jenkins master is in the public
> > > domain,
> > > > > you do not need to setup a vpn to the subset of the master.
> > > > >
> > > > > As for versions of MacOS, we are likely going to be fine with one
> > > > version,
> > > > > as usually the problems exhibits on mac are similar
> > > > >
> > > > > Tianqi
> > > > > On Wed, Sep 5, 2018 at 9:04 AM kellen sunderland <
> > > > > kellen.sunderl...@gmail.com> wrote:
> > > > >
> > > > > > @Tianqi: Yeah there's going to be a lot of trade-offs to using
> > > > Travis.  I
> > > > > > hope we can get it running fast enough with ccache that it won't
> > > > timeout
> > > > > > when running tests, but even that is questionable.  In my private
> > > > testing
> > > > > > it was running in about 35 minutes and the global timeout for
> > Travis
> > > > jobs
> > > > > > is 45 minutes.  I'd say let's run it for a few builds and see how
> > it
> > > > > goes.
> > > > > > It won't be enabled in a mode that blocks PRs any time soon.
> > > > > >
> > > > > > I don't think physical hardware is a great solution.  We would
> have
> > > to
> > > > > > purchase the hardware, then maintain security updates, install
> > > > different
> > > > > > versions of XCode / MacOS, setup a vpn to our jenkins master,
> > etc.  I
> > > > > would
> > > > > > also worry that if the machine goes down for whatever reason it
> > would
> > > > > block
> > > > > > PRs, and someone would have to be physically present to turn it
> > back
> > > > on.
> > > > > > Even assuming we set all the hardware up it's still not scalable
> so
> > > > we'd
> > > > > > have to over-provision.
> > > > > >
> > > > > > I'm hoping the Travis solution works for the time being. If it
> > > doesn't
> > > > > > we'll have to take a look at a few other options, but I've spent
> a
> > > fair
> > > > > > amount of time thinking about this and I don't think there are
> any
> > > good
> > > > > > options that don't have trade-offs.
> > > > > >
> > > > > > @Lin: Great!  Thanks for the 

Re: Acknowledgement for 1.3.0 release (was: Re: [RESULT][VOTE] Release MXNet version 1.3.0)

2018-09-12 Thread Steffen Rochel
Thanks to all contributors and to Roshani and Sheng managing the 1.3
release and getting in the hands of our users.

Anybody interested to manage the next release?
Steffen

On Wed, Sep 12, 2018 at 9:56 AM Zha, Sheng 
wrote:

> Hi all,
>
> We would like to thank to all who contributed to the 1.3.0 release:
> Aaron Markham, Alex Li, Alexander Zai, Amol Lele, Andrew Ayres, Anirudh
> Acharya, Anirudh Subramanian, Ankit Khedia, Anton Chernov, starimpact,
> Asmus Hetzel, Aston Zhang, brli, Burness Duan, cclauss, chinakook, ctcyang,
> Da Zheng, Deokjae Lee, Dick Carter, Eric Junyuan Xie, Felix Hieber, Hagay
> Lupesko, Haibin Lin, Hang Zhang, Hao Jin, Hao Li, Haozhi Qi, Hu Shiwen,
> Indhu Bharathi, Istvan Fehervari, JackieWu, James MacGlashan, jeremiedb,
> Jerry Zhang, Jian Guo, Jin Huang, Jun Wu, Kalyanee Chendke, Kellen
> Sunderland, kpmurali, Leonard Lausen, Lin Yuan, Marco de Abreu, Marek
> Kolodziej, Mu Li, Nan Zhu, Naveen Swamy, Nehal J Wani, PatricZhao, Pedro
> Larroy, Pracheer Gupta, Przemyslaw Tredak, Qiang Kou, Qing Lan, Rahul
> Huilgol, Robert Stone, Roshani Nagmote, Sandeep Krishnamurthy, Sebastian
> Bodenstein, Sergey Kolychev, Sergey Sokolov, Sheng Zha, Sheng-Ying, Simon,
> Sina Afrooze, solin319, Soonhwan-Kwon, Steffen Rochel, Taliesin Beynon, Tao
> Lv, Thom Lane, ThomasDelteil, Tianqi Chen, Tong He, Wei Wu, Wen-Yang Chu,
> Xingjian Shi, Xinyu Chen, yifeim, Yizhi Liu, Yu-Xiang Wang, Yuan Tang,
> Yuntao Chen, Zhi Zhang, Ziyue Huang, Shuai Zheng, Junru Shao, Philip Hyunsu
> Cho
>
> Especially we would like to thank for first time contributions from:
> bl0, Abhinav Sharma, access2rohit, Alexander Alexandrov, Arunkumar V
> Ramanan, Burin Choomnuan, Carin Meier, Carl Tsai, Chance Bair, Chudong
> Tian, ciyong, Dang Trung Kien, Francisco Facioni, Frank Liu, Gnanesh,
> Huilin Qu, Jake Lee, jimdunn, Jingbei Li, Lai Wei, Milan Desai, Mingkun
> Huang, Paul Stadig, perdasilva, Piyush Ghai, qiuhan, Rakesh Vasudevan, Ray
> Zhang, Sam Skalicky, Soji Adeshina, Todd Sundsted, Vishaal Kapoor,
> YouRancestor, Yuelin Zhang, Zach Kimberg, zhiyuan-huang, Zhuo Zhang, Ziyi
> Mu, luobao-intel, Manu Seth, Matthew Brookhart, Vandana Kannan, vdantu
>
> And a friendly welcome and thank you for everybody who provided PR
> feedback for the first time:
> aplikaplik, Ben Kamphaus, Caenorst, Cliff Woolley, Didier A., Faldict,
> hasanmua, Kovas Boguta, Kurman Karabukaev, Lianmin Zheng, lufenamazon,
> miteshyh, Philip Hyunsu Cho, Pishen Tsai, Shen Zhu, slitsey, wangzhe,
> xcgoner, Zhennan Qin
>
> Best regards,
> -sz
>
> On 9/7/18, 11:19 AM, "Roshani Nagmote"  wrote:
>
> Hi All,
>
> So, this vote passes with *seven* +1, *two* 0  and *three* -1 votes.
>
> *+1 votes*
> *Committers:*
> - Joshua Zhang
> - Carin
> - Naveen
> - Indu
> - Haibin
>
> *Community:*
> - Pigeon Lucky
> - Steffen
> *0 votes:*
> *Community:*
> - Thomas
> - Aaron
> *-1 votes:*
> *Committers:*
> - Sandeep
> - Anirudh
>
> *Community:*
> - Hagay
>
> *Vote Thread:*
>
>
> https://lists.apache.org/thread.html/8ad6f14811be465cdf663d6962980fd95e12193626292631a21ec6f1@%3Cdev.mxnet.apache.org%3E
>
>
> I will continue with the release process on general@ and the release
> announcement will follow in the next few days.
>
> Thanks,
> Roshani
>
>
>


Re: [DISCUSS] Build OSX builds in CI (possibly with TravisCI).

2018-09-12 Thread Aaron Markham
Is there any way to make it not show a red X failure in the GitHub UI when
TravisCI fails? I keep going back to check what flakey test failed this
time and realizing that Jenkins is still running and it was the "not
required" Travis fail. The green checkmark makes me happy and it's easier
to keep an eye on what's going on. If Travis times out a lot of the time,
then most of our PRs will look red/bad/sad when they're not.

What about no failure flag set, but add a label that Travis failed or
if we can't control the flag, auto-set labels for each Travis and Jenkins
pass/fail so we still get the benefit of at-a-glance status checks.

On Wed, Sep 12, 2018 at 6:04 AM Marco de Abreu
 wrote:

> Hello,
>
> Travis CI has successfully been enabled just now. This means you will now
> see a new status under your PR which is called
> "continuous-integration/travis-ci/pr".
>
> The job only compiles MXNet on Mac and currently does not run unit tests -
> we expect the overall execution duration to be around 6 minutes and thus
> faster than the full Jenkins pipeline. The status is set to "not required"
> which means that it does not block merging if that job fails since the
> pipeline is still in beta. But in general, it would be good if committers
> review the results in case the job shows a failure. Our last known state is
> that the pipeline works properly, but we will keep everybody up to date in
> case we get aware of any problems.
>
> The next step will be integration of Python CPU unit tests. There will be a
> separate email if we got an update on that manner.
>
> Special thanks to Kellen Sunderland for the contribution of this Travis CI
> pipeline.
>
> Best regards,
> Marco
>
> On Wed, Sep 5, 2018 at 8:19 PM Tianqi Chen 
> wrote:
>
> > Alrite, then I think it is fine as long as we can kept up with build
> speed
> > without timeout.
> >
> >
> > Tianqi
> >
> > On Wed, Sep 5, 2018 at 9:14 AM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Travis actually has explicit support for ccache, it's a platform
> feature.
> > > I've run it and it seems to work quite well.  See for example this
> build:
> > >
> https://travis-ci.org/KellenSunderland/incubator-mxnet/builds/424768656
> > >
> > > On Wed, Sep 5, 2018 at 7:10 PM Tianqi Chen 
> > > wrote:
> > >
> > > > Travis it self is stateless, which means ccache is not likely going
> to
> > > > work. As far as I understand, if jenkins master is in the public
> > domain,
> > > > you do not need to setup a vpn to the subset of the master.
> > > >
> > > > As for versions of MacOS, we are likely going to be fine with one
> > > version,
> > > > as usually the problems exhibits on mac are similar
> > > >
> > > > Tianqi
> > > > On Wed, Sep 5, 2018 at 9:04 AM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > @Tianqi: Yeah there's going to be a lot of trade-offs to using
> > > Travis.  I
> > > > > hope we can get it running fast enough with ccache that it won't
> > > timeout
> > > > > when running tests, but even that is questionable.  In my private
> > > testing
> > > > > it was running in about 35 minutes and the global timeout for
> Travis
> > > jobs
> > > > > is 45 minutes.  I'd say let's run it for a few builds and see how
> it
> > > > goes.
> > > > > It won't be enabled in a mode that blocks PRs any time soon.
> > > > >
> > > > > I don't think physical hardware is a great solution.  We would have
> > to
> > > > > purchase the hardware, then maintain security updates, install
> > > different
> > > > > versions of XCode / MacOS, setup a vpn to our jenkins master,
> etc.  I
> > > > would
> > > > > also worry that if the machine goes down for whatever reason it
> would
> > > > block
> > > > > PRs, and someone would have to be physically present to turn it
> back
> > > on.
> > > > > Even assuming we set all the hardware up it's still not scalable so
> > > we'd
> > > > > have to over-provision.
> > > > >
> > > > > I'm hoping the Travis solution works for the time being. If it
> > doesn't
> > > > > we'll have to take a look at a few other options, but I've spent a
> > fair
> > > > > amount of time thinking about this and I don't think there are any
> > good
> > > > > options that don't have trade-offs.
> > > > >
> > > > > @Lin: Great!  Thanks for the offer.  There'll be a few features we
> > want
> > > > to
> > > > > re-enable once the Job gets hooked up again.  I'll ping you when
> it's
> > > > ready
> > > > > and see if there's anything you think would be interesting to help
> > > with.
> > > > >
> > > > > -Kellen
> > > > >
> > > > > On Wed, Sep 5, 2018 at 6:58 PM Lin Yuan 
> wrote:
> > > > >
> > > > > > Hi Kellen,
> > > > > >
> > > > > > I would love to contribute. Please let me know if you have any
> > > > particular
> > > > > > work item that I can help.
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Lin
> > > > > >
> > > > > > On Wed, Sep 5, 2018 at 9:51 AM Tianqi Chen <
> > tqc...@cs.washington.edu
> > > >
> > > > > > 

Re: Off-Heap Memory Management in MXNet Scala

2018-09-12 Thread Naveen Swamy
Thank you all for your feedback.

@Chris: Yes, One of the Amazon user(Calum Leslie) had contributed the
Dispose Pattern removing the free of native handles in Finalizers and
instead added Log. This was done because calling free in Finalizers was
segfaulting the application at random points and was very hard to reproduce
and debug.
The dispose pattern worked for some cases but made code cumbersome from a
readability aspect, keeping track of all the objects that were
created(imagine slice/reshape instead of writing expressions you are now
creating unnecessary variables and calling dispose on them).
As the 1st graph in the design shows despite carefully calling dispose on
most objects, there was constant memory leak and diagnosing leaks wasn't
straightforward. Note that Finalizers run on a separate thread later than
the object was found unreachable.

@Timur, thanks for the feedback.
1) No, the goal here is to manage Native memory that is created for various
operations. In MXNet-Scala most objects are in C++ Heap and Scala objects
are wrappers around it, the MXNet engine when it runs operations expects
objects to be accessible in C++ Heap.

2) Agree MNIST is not representative, the goal was to understand and show
that the existing code has hard to debug memory leaks(even for MNIST). I
was aiming to test my prototype code and see if my changes make a
difference. Yizhi suggested I run tests against RESNET50 model which I will
do as a part of my implementation. I think this is a standard benchmark
model that is widely used. Also note that most of MXNet-Scala's use-case
that we have seen is for Inference.

3) No, we haven't created a branch for Java-API work, please look at this
design and kindly leave your feedback:
https://cwiki.apache.org/confluence/display/MXNET/MXNet+Java+Inference+API

4) Calling System.gc() will be configurable(including don't call GC), one
of the feedback that I got from a User is calling System.gc on the user's
behalf is intrusive which i think is also the point you are making.

5) understood and agree, I see the calling GC as only a part of the
solution and configurable option. For using GPUs, training and other memory
intensive application ResourceScope is be a very good option.

Another alternative is to create Bytebuffers in Java and map the C++
pointers to JVM heap by tapping to the native malloc/free that way JVM is
aware of all the memory that is allocated and can free appropriately
whenever the objects becomes unreachable. I have to note that this still
does not solve the problem of accumulating memory until GC has kicked in.
This approach is too very involved and might not be tenable.

@Marco, thanks for your comments.
1) JVM kicks of GC when it feels pressure on JVM Heap not CPU RAM. Objects
on GPU are no special they are still off-heap(JVM Heap) so this would work,
look at the graph that show running GAN example on GPUs in the doc.

2) I am not looking to rewrite the Memory Allocation in MXNet, that will
still be handled by the C++ backend, the goal here is to free(reduce of
shared pointer count) native-memory when JVM objects go out of scope(become
unreachable).


@Carin, yes hopefully this would alleviate the memory management headache
for our users.

Hope that makes sense.

Thanks, Naveen


On Wed, Sep 12, 2018 at 6:06 AM, Carin Meier  wrote:

> Naveen,
>
> Thanks for putting together the detailed document and kickstarting this
> effort. It will benefit all the MXNet JVM users and will help solve a
> current pain point for them.
>
> - Carin
>
> On Tue, Sep 11, 2018 at 5:37 PM Naveen Swamy  wrote:
>
> > Hi All,
> >
> > I am working on managing Off-Heap Memory Management and have written a
> > proposal here based on my prototype and research I did.
> >
> > Please review the doc and provide your feedback ?
> >
> > https://cwiki.apache.org/confluence/display/MXNET/JVM+Memory+Management
> >
> > I had offline discussion with a few people I work with and added their
> > feedback to the doc as well.
> >
> > Thanks, Naveen
> >
>


Acknowledgement for 1.3.0 release (was: Re: [RESULT][VOTE] Release MXNet version 1.3.0)

2018-09-12 Thread Zha, Sheng
Hi all,

We would like to thank to all who contributed to the 1.3.0 release:
Aaron Markham, Alex Li, Alexander Zai, Amol Lele, Andrew Ayres, Anirudh 
Acharya, Anirudh Subramanian, Ankit Khedia, Anton Chernov, starimpact, Asmus 
Hetzel, Aston Zhang, brli, Burness Duan, cclauss, chinakook, ctcyang, Da Zheng, 
Deokjae Lee, Dick Carter, Eric Junyuan Xie, Felix Hieber, Hagay Lupesko, Haibin 
Lin, Hang Zhang, Hao Jin, Hao Li, Haozhi Qi, Hu Shiwen, Indhu Bharathi, Istvan 
Fehervari, JackieWu, James MacGlashan, jeremiedb, Jerry Zhang, Jian Guo, Jin 
Huang, Jun Wu, Kalyanee Chendke, Kellen Sunderland, kpmurali, Leonard Lausen, 
Lin Yuan, Marco de Abreu, Marek Kolodziej, Mu Li, Nan Zhu, Naveen Swamy, Nehal 
J Wani, PatricZhao, Pedro Larroy, Pracheer Gupta, Przemyslaw Tredak, Qiang Kou, 
Qing Lan, Rahul Huilgol, Robert Stone, Roshani Nagmote, Sandeep Krishnamurthy, 
Sebastian Bodenstein, Sergey Kolychev, Sergey Sokolov, Sheng Zha, Sheng-Ying, 
Simon, Sina Afrooze, solin319, Soonhwan-Kwon, Steffen Rochel, Taliesin Beynon, 
Tao Lv, Thom Lane, ThomasDelteil, Tianqi Chen, Tong He, Wei Wu, Wen-Yang Chu, 
Xingjian Shi, Xinyu Chen, yifeim, Yizhi Liu, Yu-Xiang Wang, Yuan Tang, Yuntao 
Chen, Zhi Zhang, Ziyue Huang, Shuai Zheng, Junru Shao, Philip Hyunsu Cho
 
Especially we would like to thank for first time contributions from:
bl0, Abhinav Sharma, access2rohit, Alexander Alexandrov, Arunkumar V Ramanan, 
Burin Choomnuan, Carin Meier, Carl Tsai, Chance Bair, Chudong Tian, ciyong, 
Dang Trung Kien, Francisco Facioni, Frank Liu, Gnanesh, Huilin Qu, Jake Lee, 
jimdunn, Jingbei Li, Lai Wei, Milan Desai, Mingkun Huang, Paul Stadig, 
perdasilva, Piyush Ghai, qiuhan, Rakesh Vasudevan, Ray Zhang, Sam Skalicky, 
Soji Adeshina, Todd Sundsted, Vishaal Kapoor, YouRancestor, Yuelin Zhang, Zach 
Kimberg, zhiyuan-huang, Zhuo Zhang, Ziyi Mu, luobao-intel, Manu Seth, Matthew 
Brookhart, Vandana Kannan, vdantu
 
And a friendly welcome and thank you for everybody who provided PR feedback for 
the first time:
aplikaplik, Ben Kamphaus, Caenorst, Cliff Woolley, Didier A., Faldict, 
hasanmua, Kovas Boguta, Kurman Karabukaev, Lianmin Zheng, lufenamazon, 
miteshyh, Philip Hyunsu Cho, Pishen Tsai, Shen Zhu, slitsey, wangzhe, xcgoner, 
Zhennan Qin

Best regards,
-sz

On 9/7/18, 11:19 AM, "Roshani Nagmote"  wrote:

Hi All,

So, this vote passes with *seven* +1, *two* 0  and *three* -1 votes.

*+1 votes*
*Committers:*
- Joshua Zhang
- Carin
- Naveen
- Indu
- Haibin

*Community:*
- Pigeon Lucky
- Steffen
*0 votes:*
*Community:*
- Thomas
- Aaron
*-1 votes:*
*Committers:*
- Sandeep
- Anirudh

*Community:*
- Hagay

*Vote Thread:*


https://lists.apache.org/thread.html/8ad6f14811be465cdf663d6962980fd95e12193626292631a21ec6f1@%3Cdev.mxnet.apache.org%3E


I will continue with the release process on general@ and the release
announcement will follow in the next few days.

Thanks,
Roshani




Re: [RESULT][VOTE] Release MXNet version 1.3.0

2018-09-12 Thread Roshani Nagmote
Thanks Chris. Will keep it in mind next time. :)

Regards,
Roshani

On Fri, Sep 7, 2018 at 12:07 PM Chris Olivier  wrote:

> nit: using the "-" before peoples' names makes this kind of hard to read
> for me, since "-" is part of "-1"
>
> On Fri, Sep 7, 2018 at 11:18 AM Roshani Nagmote  >
> wrote:
>
> > Hi All,
> >
> > So, this vote passes with *seven* +1, *two* 0  and *three* -1 votes.
> >
> > *+1 votes*
> > *Committers:*
> > - Joshua Zhang
> > - Carin
> > - Naveen
> > - Indu
> > - Haibin
> >
> > *Community:*
> > - Pigeon Lucky
> > - Steffen
> > *0 votes:*
> > *Community:*
> > - Thomas
> > - Aaron
> > *-1 votes:*
> > *Committers:*
> > - Sandeep
> > - Anirudh
> >
> > *Community:*
> > - Hagay
> >
> > *Vote Thread:*
> >
> >
> >
> https://lists.apache.org/thread.html/8ad6f14811be465cdf663d6962980fd95e12193626292631a21ec6f1@%3Cdev.mxnet.apache.org%3E
> >
> >
> > I will continue with the release process on general@ and the release
> > announcement will follow in the next few days.
> >
> > Thanks,
> > Roshani
> >
>


Re: [VOTE] Release MXNet version 1.3.0.RC0

2018-09-12 Thread Roshani Nagmote
Thanks everyone for testing and voting for the release. I am working with
Sheng to finalize and post the release. Announcement will follow soon.

Regards,
Roshani

On Mon, Sep 10, 2018 at 7:03 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Tracked down the issue referred to above and it's not a bug.   I'll update
> the ticket.
>
> Changing to +1.
>
> On Mon, Sep 10, 2018 at 3:00 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > -0.1
> >
> > There's one test failure I've run into (details below).  Following
> Indhu's
> > logic I don't think this should block the release as it's not relating
> to a
> > release feature introduced in this version.
> >
> > I'm trying to use the cpp-package examples as reference code for how to
> > run MXNet models from a native context. I'd like to run them with ASAN
> as a
> > sanity check for memory leaks and pointer errors.  I was continually
> > running into segfaults and crashes w/ and w/o ASAN.  A little googling
> > shows me that this issue has already been reported, and is related to
> > running tests on CPU, not to any changes I made:
> > https://github.com/apache/incubator-mxnet/issues/9814  Having what our
> > effectively our reference examples crash is not a good practice IMO.
> >
> > I also share some concerns around the fp16 failures.  I know developers
> > who are currently porting their models to Gluon who use fp16.  They'll be
> > disappointed with the error.
> >
> > In general though, release looks good.  Big thanks to Sheng and Roshani
> > for putting it together (and sorry for the late testing).
> >
> > -Kellen
> >
> >
> > On Fri, Sep 7, 2018 at 4:31 AM Anirudh  wrote:
> >
> >> -1 Considering that using fp16 with gluon is much easier than the
> >> alternative where you need access to the model code, this fix is really
> >> useful. I understand the pain of doing mxnet release and appreciate
> >> Roshani
> >> and Shengs efforts, but this seems like something we should fix.
> >>
> >> On Thu, Sep 6, 2018, 4:57 PM Haibin Lin 
> wrote:
> >>
> >> > +1 built from source and passes dist_sync_kvstore test on Ubuntu.
> >> >
> >> > Best,
> >> > Haibin
> >> >
> >> > On Thu, Sep 6, 2018 at 1:32 PM Indhu  wrote:
> >> >
> >> > > +1
> >> > >
> >> > > The release candidate looks good. I'm able to build and run basic
> >> models.
> >> > >
> >> > > One the FP16 issue:
> >> > >
> >> > > Like others have pointed out, releases on expensive in terms of time
> >> and
> >> > > effort. There needs to be a high and more objective bar on what
> >> qualifies
> >> > > as a release blocker to make sure we are not setting precedence for
> a
> >> lot
> >> > > of release blockers in future.
> >> > >
> >> > > I think a release blocker is justified only if there is a serious
> bug
> >> > > discovered in one of the features included in the release or if
> there
> >> is
> >> > a
> >> > > regression. Given FP16 supports is not a new feature claimed in this
> >> > > release and this is not a regression in this release candidate, I'm
> >> > > inclined to release this candidate and include the FP16 fix in a
> >> > subsequent
> >> > > release.
> >> > >
> >> > > Thanks,
> >> > > Indu
> >> > >
> >> > > On Wed, Sep 5, 2018 at 10:21 AM Aaron Markham <
> >> aaron.s.mark...@gmail.com
> >> > >
> >> > > wrote:
> >> > >
> >> > > > 0 (non-binding) If we have a problem that blocks users, and a
> >> solution
> >> > in
> >> > > > hand... then we should fix it, but not at the expense of starting
> >> the
> >> > > > release cycle again just for one fix. Users can cherry pick or
> build
> >> > from
> >> > > > master if they want the fix right away, right? I'd change my mind
> >> to -1
> >> > > if
> >> > > > this wasn't the case, with good reason, and if the user impact was
> >> > > critical
> >> > > > to adoption or risks abandonment.
> >> > > >
> >> > > >
> >> > > > On Wed, Sep 5, 2018 at 9:57 AM Roshani Nagmote <
> >> > > roshaninagmo...@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > I believe everyone here is working hard to make MXNet a better
> >> > > framework
> >> > > > > for users. It's completely okay to have different opinions, we
> can
> >> > > decide
> >> > > > > together if this issue is a blocker or not after voting time is
> >> over.
> >> > > > >
> >> > > > > As I mentioned before, voting will end at 7 pm today. So there
> is
> >> > still
> >> > > > > time to test the release. If there are any other issues anyone
> >> > finds, I
> >> > > > > will be happy to start the process again and work on RC1. For
> >> now, I
> >> > > want
> >> > > > > to encourage everyone to utilize this time and vote. :)
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Roshani
> >> > > > >
> >> > > > > On Tue, Sep 4, 2018 at 10:35 PM sandeep krishnamurthy <
> >> > > > > sandeep.krishn...@gmail.com> wrote:
> >> > > > >
> >> > > > > >1. As a Apache MXNet community member, I raised the concern
> >> of
> >> > > > broken
> >> > > > > >functionality for the user. I explained and provided 

Re: Off-Heap Memory Management in MXNet Scala

2018-09-12 Thread Carin Meier
Naveen,

Thanks for putting together the detailed document and kickstarting this
effort. It will benefit all the MXNet JVM users and will help solve a
current pain point for them.

- Carin

On Tue, Sep 11, 2018 at 5:37 PM Naveen Swamy  wrote:

> Hi All,
>
> I am working on managing Off-Heap Memory Management and have written a
> proposal here based on my prototype and research I did.
>
> Please review the doc and provide your feedback ?
>
> https://cwiki.apache.org/confluence/display/MXNET/JVM+Memory+Management
>
> I had offline discussion with a few people I work with and added their
> feedback to the doc as well.
>
> Thanks, Naveen
>


Re: [DISCUSS] Build OSX builds in CI (possibly with TravisCI).

2018-09-12 Thread Marco de Abreu
Hello,

Travis CI has successfully been enabled just now. This means you will now
see a new status under your PR which is called
"continuous-integration/travis-ci/pr".

The job only compiles MXNet on Mac and currently does not run unit tests -
we expect the overall execution duration to be around 6 minutes and thus
faster than the full Jenkins pipeline. The status is set to "not required"
which means that it does not block merging if that job fails since the
pipeline is still in beta. But in general, it would be good if committers
review the results in case the job shows a failure. Our last known state is
that the pipeline works properly, but we will keep everybody up to date in
case we get aware of any problems.

The next step will be integration of Python CPU unit tests. There will be a
separate email if we got an update on that manner.

Special thanks to Kellen Sunderland for the contribution of this Travis CI
pipeline.

Best regards,
Marco

On Wed, Sep 5, 2018 at 8:19 PM Tianqi Chen  wrote:

> Alrite, then I think it is fine as long as we can kept up with build speed
> without timeout.
>
>
> Tianqi
>
> On Wed, Sep 5, 2018 at 9:14 AM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Travis actually has explicit support for ccache, it's a platform feature.
> > I've run it and it seems to work quite well.  See for example this build:
> > https://travis-ci.org/KellenSunderland/incubator-mxnet/builds/424768656
> >
> > On Wed, Sep 5, 2018 at 7:10 PM Tianqi Chen 
> > wrote:
> >
> > > Travis it self is stateless, which means ccache is not likely going to
> > > work. As far as I understand, if jenkins master is in the public
> domain,
> > > you do not need to setup a vpn to the subset of the master.
> > >
> > > As for versions of MacOS, we are likely going to be fine with one
> > version,
> > > as usually the problems exhibits on mac are similar
> > >
> > > Tianqi
> > > On Wed, Sep 5, 2018 at 9:04 AM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > @Tianqi: Yeah there's going to be a lot of trade-offs to using
> > Travis.  I
> > > > hope we can get it running fast enough with ccache that it won't
> > timeout
> > > > when running tests, but even that is questionable.  In my private
> > testing
> > > > it was running in about 35 minutes and the global timeout for Travis
> > jobs
> > > > is 45 minutes.  I'd say let's run it for a few builds and see how it
> > > goes.
> > > > It won't be enabled in a mode that blocks PRs any time soon.
> > > >
> > > > I don't think physical hardware is a great solution.  We would have
> to
> > > > purchase the hardware, then maintain security updates, install
> > different
> > > > versions of XCode / MacOS, setup a vpn to our jenkins master, etc.  I
> > > would
> > > > also worry that if the machine goes down for whatever reason it would
> > > block
> > > > PRs, and someone would have to be physically present to turn it back
> > on.
> > > > Even assuming we set all the hardware up it's still not scalable so
> > we'd
> > > > have to over-provision.
> > > >
> > > > I'm hoping the Travis solution works for the time being. If it
> doesn't
> > > > we'll have to take a look at a few other options, but I've spent a
> fair
> > > > amount of time thinking about this and I don't think there are any
> good
> > > > options that don't have trade-offs.
> > > >
> > > > @Lin: Great!  Thanks for the offer.  There'll be a few features we
> want
> > > to
> > > > re-enable once the Job gets hooked up again.  I'll ping you when it's
> > > ready
> > > > and see if there's anything you think would be interesting to help
> > with.
> > > >
> > > > -Kellen
> > > >
> > > > On Wed, Sep 5, 2018 at 6:58 PM Lin Yuan  wrote:
> > > >
> > > > > Hi Kellen,
> > > > >
> > > > > I would love to contribute. Please let me know if you have any
> > > particular
> > > > > work item that I can help.
> > > > >
> > > > > Best,
> > > > >
> > > > > Lin
> > > > >
> > > > > On Wed, Sep 5, 2018 at 9:51 AM Tianqi Chen <
> tqc...@cs.washington.edu
> > >
> > > > > wrote:
> > > > >
> > > > > > is it possible for us to get a MacBook and hook it to the current
> > > > Jenkins
> > > > > > CI? Travis OSX usually build from scratch and that was pretty
> slow
> > > > > >
> > > > > > Tianqi
> > > > > >
> > > > > >
> > > > > > On Wed, Sep 5, 2018 at 8:49 AM kellen sunderland <
> > > > > > kellen.sunderl...@gmail.com> wrote:
> > > > > >
> > > > > > > Great you feel that way Lin, please feel free to contribute if
> > you
> > > > have
> > > > > > any
> > > > > > > features you'd like tested.  We are using the travis image
> > xcode9.4
> > > > > which
> > > > > > > is based on MacOS 10.13.
> > > > > > >
> > > > > > > On Wed, Sep 5, 2018 at 6:40 PM Lin Yuan 
> > > wrote:
> > > > > > >
> > > > > > > > Hi Kellen,
> > > > > > > >
> > > > > > > > Many thanks for your and Marco's effort! I think this is a
> very
> > > > > crucial
> > > > > > > > piece to improve MXNet stability.
> > > > > > > >
> > > > > > > > 

Re: Off-Heap Memory Management in MXNet Scala

2018-09-12 Thread Marco de Abreu
Interesting and detailed document!

The JVM garbage collector gets executed depending on the memory pressure
for CPU RAM (or a different custom strategy). It was mentioned that this
document also supports disposing GPU objects. Sorry if I missed it, but how
exactly are we ensuring we don't run out of memory on GPU?

There are quite a lot of cases where we get close to the limit of a
available GPU RAM even with explicit disposes. If we run out of memory, we
get a fatal exception and it's basically game over (as of the current
state). How do handle these cases where we can't rely on paging and the
benefits of virtual memory?

Best regards,
Marco

Timur Shenkao  schrieb am Mi., 12. Sep. 2018, 09:59:

> Thanks for great job!
>
> My questions / proposals.
> 1) Have considered Java collections with low memory footprint like
> Fastutil, Koloboke, etc.? They are much more memory efficient and they have
> "better correspondence" with low level data types.
> 2) MNIST example on the page is "bad" because MNIST is handled pretty fast
> even on laptop, i.e. we won't catch GC & off-heap problems.
> 3) Is it 1.2.0-java branch where Java API things happen?
> 4) System.gc() behaves differently on various JVM platforms, JDK
> implementations, GC types. So, I am sure that we will get users' requests
> to eliminate this approach in the future.
> 5) Frameworks like mxnet aren't used separately. Folks have to integrate
> Spark or Spring with DL libraries. And in this case, they often use CMS or
> even more archaic GCs as for streaming or long living jobs G1GC isn't
> always good.
>
>
>
> On Wednesday, September 12, 2018, Chris Olivier 
> wrote:
>
> > do you log on finalize() if the object wasn’t properly freed (ie
> > NDArray.finalize())? is that available in Scala?
> >
> > On Tue, Sep 11, 2018 at 6:12 PM Qing Lan  wrote:
> >
> > > Nice document! Way better than current .dispose() in Scala!
> > >
> > > Thanks,
> > > Qing
> > >
> > > On 9/11/18, 6:04 PM, "Chris Olivier"  wrote:
> > >
> > > wow, incredible document!
> > >
> > > On Tue, Sep 11, 2018 at 2:37 PM Naveen Swamy 
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > I am working on managing Off-Heap Memory Management and have
> > written
> > > a
> > > > proposal here based on my prototype and research I did.
> > > >
> > > > Please review the doc and provide your feedback ?
> > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/MXNET/JVM+Memory+Management
> > > >
> > > > I had offline discussion with a few people I work with and added
> > > their
> > > > feedback to the doc as well.
> > > >
> > > > Thanks, Naveen
> > > >
> > >
> > >
> > >
> >
>


Re: Off-Heap Memory Management in MXNet Scala

2018-09-12 Thread Timur Shenkao
Thanks for great job!

My questions / proposals.
1) Have considered Java collections with low memory footprint like
Fastutil, Koloboke, etc.? They are much more memory efficient and they have
"better correspondence" with low level data types.
2) MNIST example on the page is "bad" because MNIST is handled pretty fast
even on laptop, i.e. we won't catch GC & off-heap problems.
3) Is it 1.2.0-java branch where Java API things happen?
4) System.gc() behaves differently on various JVM platforms, JDK
implementations, GC types. So, I am sure that we will get users' requests
to eliminate this approach in the future.
5) Frameworks like mxnet aren't used separately. Folks have to integrate
Spark or Spring with DL libraries. And in this case, they often use CMS or
even more archaic GCs as for streaming or long living jobs G1GC isn't
always good.



On Wednesday, September 12, 2018, Chris Olivier 
wrote:

> do you log on finalize() if the object wasn’t properly freed (ie
> NDArray.finalize())? is that available in Scala?
>
> On Tue, Sep 11, 2018 at 6:12 PM Qing Lan  wrote:
>
> > Nice document! Way better than current .dispose() in Scala!
> >
> > Thanks,
> > Qing
> >
> > On 9/11/18, 6:04 PM, "Chris Olivier"  wrote:
> >
> > wow, incredible document!
> >
> > On Tue, Sep 11, 2018 at 2:37 PM Naveen Swamy 
> > wrote:
> >
> > > Hi All,
> > >
> > > I am working on managing Off-Heap Memory Management and have
> written
> > a
> > > proposal here based on my prototype and research I did.
> > >
> > > Please review the doc and provide your feedback ?
> > >
> > >
> > https://cwiki.apache.org/confluence/display/MXNET/JVM+Memory+Management
> > >
> > > I had offline discussion with a few people I work with and added
> > their
> > > feedback to the doc as well.
> > >
> > > Thanks, Naveen
> > >
> >
> >
> >
>