Re: Release plan - MXNET 1.3

2018-08-16 Thread Afrooze, Sina
Hi Roshani - Can you please make sure that this fix (which is already merged to 
master) is also merged to the stable branch for 1.3.0: 
https://github.com/apache/incubator-mxnet/pull/11493 - Thanks, Sina


On 8/16/18, 10:51 AM, "Roshani Nagmote"  wrote:

Hi all,

Release status:

Currently, for release 1.3.0 there are a couple of issues open which needs
to be resolved before cutting RC.

The current date we are looking at for cutting RC0 is 08/17(Friday).



Open issues which need to be looked at before cutting RC:

   1. Topk regression issue
    - #12202 PR
   with fix 
   2. Excessive memory allocation issue
    - #12184 PR
   with fix 
   3. Test_io.test_csvIter breaks on CentOS
    - #12189 PR
   with fix  



@committers, could you please help review these PRs and get them merged?



Thanks,

Roshani

On Tue, Aug 14, 2018 at 12:46 PM Roshani Nagmote 
wrote:

> Talked to the person who ran resnet50 benchmarks offline. Build flag was
> not properly set so there was a difference in performance numbers 
observed.
> There is no issue caught and he was able to get the same results as
> mentioned here https://mxnet.incubator.apache.org/faq/perf.html
> 
>
> We are good here.
>
> Thanks,
> Roshani
>
> On Mon, Aug 13, 2018 at 4:08 PM Roshani Nagmote 

> wrote:
>
>> Hi Dom,
>>
>> I verified resnet50 run on MXNet master branch. Checked on single gpu
>> machine. Numbers match. I didn't see any performance degradation.
>> https://mxnet.incubator.apache.org/faq/perf.html#scoring-results
>>
>> Can you please give me more details on the instance type and script you
>> ran exactly so that I can try to reproduce it again?
>>
>> Thanks,
>> Roshani
>>
>>
>> On Mon, Aug 13, 2018 at 12:31 PM Roshani Nagmote <
>> roshaninagmo...@gmail.com> wrote:
>>
>>> This is not a major feature. I meant other new feature requests PR won't
>>> be accepted in 1.3 release now.
>>> Bug fixes will be accepted. I will be trying to reproduce the regression
>>> Dom mentioned today. :)
>>>
>>> Thanks,
>>> Roshani
>>>
>>> On Mon, Aug 13, 2018 at 12:06 PM Naveen Swamy 
>>> wrote:
>>>
 Is this is a major feature? This is a regression that Dom is reporting
 wrt
 to performance

 On Mon, Aug 13, 2018 at 11:38 AM, Roshani Nagmote <
 roshaninagmo...@gmail.com
 > wrote:

 > Thanks for reporting this issue Dom.
 > 08/10 (Frida)y was the major feature freeze date. We won't be
 accepting any
 > new features now for MXNet 1.3 release.
 > RC0 will be cut on 08/17(Friday).
 >
 > Will be verifying the performance degradation issue mentioned.
 >
 > Thanks,
 > Roshani
 >
 > On Mon, Aug 13, 2018 at 8:45 AM Divakaruni, Dominic
 >  wrote:
 >
 > > Hi all, We tested resnet50 on MXNet built from master branch on
 Friday
 > and
 > > were seeing degraded performance on GPU - about 50% slower compared
 to
 > > these values here https://mxnet.incubator.apache.org/faq/perf.html.
 FWIW
 > > this slowdown was seen for both MXNet as well as the TRT integrated
 > MXNet.
 > >
 > > Something for you all to verify before or after you cut the RC.
 > >
 > > Thx!
 > >
 > > On 8/13/18, 4:34 AM, "kellen sunderland" <
 kellen.sunderl...@gmail.com>
 > > wrote:
 > >
 > > Hey Roshani,
 > >
 > > Has a RC branch already been cut?  If so, a quick heads up that
 I
 > think
 > > this commit should probably get into RC0 for 1.3.
 > >
 > > https://github.com/apache/incubator-mxnet/commit/
 > ee8755a2531b322fec29c9c3d2aa3b8738da41f3
 > >
 > > It won't cause issues for users, but from a versioning
 compatibility
 > > perspective it's probably better that we remove these functions
 in
 > this
 > > release. This way we don't have to worry about major bumps in
 the
 > next
 > > release if they're removed.
 > >
 > > -Kellen
 > >
 > >
 > > On Fri, Aug 10, 2018 at 7:24 PM 

Re: Testing examples in nightly build

2018-08-16 Thread Marco de Abreu
These are great ideas, sounds very good to me. Thanks for your efforts
around the user experience!

-Marco

Vandana Kannan  schrieb am Do., 16. Aug. 2018, 18:27:

> @Marco: Most of the examples are standalone applications and not
> notebooks. We could start off by enabling tests for examples that are
> either notebooks or those that can be executed with command line options.
>
> Some of the problems that we may come across are with examples that load
> large datasets (we could probably use a smaller dataset, but it depends on
> the example), or examples that have a long training time (we could probably
> execute them for 1 epoch). Straight Dope nightly tests cap the execution
> time to 10 minutes per notebook. We could do the same here wherever
> possible.
>
> @Kellen: Good idea to include flake8/pylint checks for examples in nightly
> builds - we would need to fix the backlog (
> https://github.com/apache/incubator-mxnet/issues/12205) before enabling.
> Yes, any contribution from the community on any of these TODO items, would
> be great.
>
> - Vandana
>
> On 2018/08/16 06:40:56, kellen sunderland 
> wrote:
> > I think it would be very beneficial to start fleshing out the nighties.
> > They provide a lot of value at a relatively small cost.  Any
> contributions
> > from the community would be appreciated.
> >
> > Things I could see being beneficial:
> > * Long running tests
> > * In depth flake8/pylint linters
> > * cudamemcheck builds
> > * Asan builds.
> >
> > On Thu, Aug 16, 2018, 12:24 AM Marco de Abreu
> >  wrote:
> >
> > > Hello,
> > >
> > > I think this is a great idea! Thanks a lot for fixing all the problems
> in
> > > our examples.
> > >
> > > Do you see any problems that could come up if we just run them one by
> one?
> > > We already got a pipeline that allows us to verify jupyter notebooks,
> but
> > > afaik the examples are standalone files with a main function, right?
> > >
> > > Best regards,
> > > Marco
> > >
> > > Vandana Kannan  schrieb am Mi., 15. Aug. 2018,
> > > 23:44:
> > >
> > > > Hi All,
> > > >
> > > > Recently we saw that there were quite a few Pylint undefined-variable
> > > > errors in MXNet code, a majority of them in the example folder. These
> > > > errors highlight that there are paths in the examples that are
> broken and
> > > > these have not been caught during testing.
> > > >
> > > > These errors are later reported as issues by users. For example,
> > > > https://github.com/apache/incubator-mxnet/issues/11278.
> > > >
> > > > It would be a good idea to include tests for all examples in the
> nightly
> > > > builds (Similar to
> > > > incubator-mxnet/tests/nightly/test_image_classification.sh). Side
> note:
> > > We
> > > > might have to see what to do about examples that include special
> setup
> > > > instructions or take too long to execute.
> > > >
> > > > Any thoughts or suggestions on writing these tests, feasibility or
> > > > previous attempts at including these tests, would be helpful.
> > > >
> > > > Thanks,
> > > > Vandana
> > > >
> > >
> >
>
>


Re: Release blocker? - buggy topk Op

2018-08-16 Thread Roshani Nagmote
Thanks Leonard for raising this issue.
@ciyong Thanks for submitting the fix. I will be tracking mentioned PR
#12202  for release.

-Roshani

On Thu, Aug 16, 2018 at 6:45 AM Zhao, Patric  wrote:

> Hi Leonard,
>
> Thanks to raising the issue of topk op.
>
> The root cause is from the current API design which used float data type
> to represent the integer index, and as we know, the float type could NOT
> express the large integer precisely.
> (I have no offense. I know I missed some backgrounds and I think the
> current design is very good).
>
> The new CI#12085 changes the computation order and make this issue looks
> more significant. Essentially, the bug will happen when the index is large
> whatever with or without the new CI.
> One line example code can trigger the issue,
> 'print(mx.nd.topk(mx.nd.array(np.arange(256*300096).reshape(8, -1)), k=4))'.
>
> Thus, the real fix is to change the API interface and use INT for the
> index. But it might introduce compatibility issue to current
> framework/topology due to API change.
> I am not sure we need to change in the last minutes of release 1.3
> (actually, we can contribute to it).
>
> Currently, we submitted a fix (#12202) to make the computation order as
> same as before and still much faster :)
>
> Apologies for the confusion and feel free to let us know for any feedback.
>
> Thanks,
>
> --Patric
>
>
> > -Original Message-
> > From: Leonard Lausen [mailto:l-softw...@lausen.nl]
> > Sent: Thursday, August 16, 2018 9:51 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Release blocker? - buggy topk Op
> >
> > Recent changes in mxnet master introduced a bug into the topk operator.
> >  Below code example will output [ 274232. 179574. 274233. 274231.] with
> >  mxnet-cu90==1.3.0b20180810 but [ 274232. 179574. 274232. 274232.] with
> > mxnet-cu90==1.3.0b20180814. Likely #12085 is at fault.
> >
> > See https://github.com/apache/incubator-mxnet/issues/12197 for more
> info.
> >
> > I think this should be considered a release blocker for the 1.3 release.
> >
> > Note this breaks some parts of the KDD 18 MXNet / Gluon tutorial which is
> > scheduled for next Tuesday http://www.kdd.org/kdd2018/hands-on-
> > tutorials/view/mxnet-with-a-focus-on-nlp
> > . (We can work around by asking people to install the 0810 version
> > though.)
>
>


Re: Release plan - MXNET 1.3

2018-08-16 Thread Roshani Nagmote
Hi all,

Release status:

Currently, for release 1.3.0 there are a couple of issues open which needs
to be resolved before cutting RC.

The current date we are looking at for cutting RC0 is 08/17(Friday).



Open issues which need to be looked at before cutting RC:

   1. Topk regression issue
    - #12202 PR
   with fix 
   2. Excessive memory allocation issue
    - #12184 PR
   with fix 
   3. Test_io.test_csvIter breaks on CentOS
    - #12189 PR
   with fix  



@committers, could you please help review these PRs and get them merged?



Thanks,

Roshani

On Tue, Aug 14, 2018 at 12:46 PM Roshani Nagmote 
wrote:

> Talked to the person who ran resnet50 benchmarks offline. Build flag was
> not properly set so there was a difference in performance numbers observed.
> There is no issue caught and he was able to get the same results as
> mentioned here https://mxnet.incubator.apache.org/faq/perf.html
> 
>
> We are good here.
>
> Thanks,
> Roshani
>
> On Mon, Aug 13, 2018 at 4:08 PM Roshani Nagmote 
> wrote:
>
>> Hi Dom,
>>
>> I verified resnet50 run on MXNet master branch. Checked on single gpu
>> machine. Numbers match. I didn't see any performance degradation.
>> https://mxnet.incubator.apache.org/faq/perf.html#scoring-results
>>
>> Can you please give me more details on the instance type and script you
>> ran exactly so that I can try to reproduce it again?
>>
>> Thanks,
>> Roshani
>>
>>
>> On Mon, Aug 13, 2018 at 12:31 PM Roshani Nagmote <
>> roshaninagmo...@gmail.com> wrote:
>>
>>> This is not a major feature. I meant other new feature requests PR won't
>>> be accepted in 1.3 release now.
>>> Bug fixes will be accepted. I will be trying to reproduce the regression
>>> Dom mentioned today. :)
>>>
>>> Thanks,
>>> Roshani
>>>
>>> On Mon, Aug 13, 2018 at 12:06 PM Naveen Swamy 
>>> wrote:
>>>
 Is this is a major feature? This is a regression that Dom is reporting
 wrt
 to performance

 On Mon, Aug 13, 2018 at 11:38 AM, Roshani Nagmote <
 roshaninagmo...@gmail.com
 > wrote:

 > Thanks for reporting this issue Dom.
 > 08/10 (Frida)y was the major feature freeze date. We won't be
 accepting any
 > new features now for MXNet 1.3 release.
 > RC0 will be cut on 08/17(Friday).
 >
 > Will be verifying the performance degradation issue mentioned.
 >
 > Thanks,
 > Roshani
 >
 > On Mon, Aug 13, 2018 at 8:45 AM Divakaruni, Dominic
 >  wrote:
 >
 > > Hi all, We tested resnet50 on MXNet built from master branch on
 Friday
 > and
 > > were seeing degraded performance on GPU - about 50% slower compared
 to
 > > these values here https://mxnet.incubator.apache.org/faq/perf.html.
 FWIW
 > > this slowdown was seen for both MXNet as well as the TRT integrated
 > MXNet.
 > >
 > > Something for you all to verify before or after you cut the RC.
 > >
 > > Thx!
 > >
 > > On 8/13/18, 4:34 AM, "kellen sunderland" <
 kellen.sunderl...@gmail.com>
 > > wrote:
 > >
 > > Hey Roshani,
 > >
 > > Has a RC branch already been cut?  If so, a quick heads up that
 I
 > think
 > > this commit should probably get into RC0 for 1.3.
 > >
 > > https://github.com/apache/incubator-mxnet/commit/
 > ee8755a2531b322fec29c9c3d2aa3b8738da41f3
 > >
 > > It won't cause issues for users, but from a versioning
 compatibility
 > > perspective it's probably better that we remove these functions
 in
 > this
 > > release. This way we don't have to worry about major bumps in
 the
 > next
 > > release if they're removed.
 > >
 > > -Kellen
 > >
 > >
 > > On Fri, Aug 10, 2018 at 7:24 PM Roshani Nagmote <
 > > roshaninagmo...@gmail.com>
 > > wrote:
 > >
 > > > Thanks Kellen and everyone else for working to get TensorRT PR
 > > merged!
 > > > @Sina, I will be keeping track of that issue and fixes to get
 in
 > the
 > > > release.
 > > >
 > > > We are starting code freeze for 1.3 release today. A release
 > > candidate will
 > > > be cut on 08/17.
 > > > Feel free to add any other comments/suggestions.
 > > >
 > > > Thanks,
 > > > Roshani
 > > >
 > > > On Fri, Aug 10, 2018 at 5:39 AM kellen sunderland <
 > > > kellen.sunderl...@gmail.com> wrote:
 > > >
 > > > > All merged and ready to go from my side Roshani (the
 TensorRT
 > PR).
 > >  

Re: Testing examples in nightly build

2018-08-16 Thread Vandana Kannan
@Marco: Most of the examples are standalone applications and not notebooks. We 
could start off by enabling tests for examples that are either notebooks or 
those that can be executed with command line options.

Some of the problems that we may come across are with examples that load large 
datasets (we could probably use a smaller dataset, but it depends on the 
example), or examples that have a long training time (we could probably execute 
them for 1 epoch). Straight Dope nightly tests cap the execution time to 10 
minutes per notebook. We could do the same here wherever possible. 

@Kellen: Good idea to include flake8/pylint checks for examples in nightly 
builds - we would need to fix the backlog 
(https://github.com/apache/incubator-mxnet/issues/12205) before enabling. Yes, 
any contribution from the community on any of these TODO items, would be great.

- Vandana

On 2018/08/16 06:40:56, kellen sunderland  wrote: 
> I think it would be very beneficial to start fleshing out the nighties.
> They provide a lot of value at a relatively small cost.  Any contributions
> from the community would be appreciated.
> 
> Things I could see being beneficial:
> * Long running tests
> * In depth flake8/pylint linters
> * cudamemcheck builds
> * Asan builds.
> 
> On Thu, Aug 16, 2018, 12:24 AM Marco de Abreu
>  wrote:
> 
> > Hello,
> >
> > I think this is a great idea! Thanks a lot for fixing all the problems in
> > our examples.
> >
> > Do you see any problems that could come up if we just run them one by one?
> > We already got a pipeline that allows us to verify jupyter notebooks, but
> > afaik the examples are standalone files with a main function, right?
> >
> > Best regards,
> > Marco
> >
> > Vandana Kannan  schrieb am Mi., 15. Aug. 2018,
> > 23:44:
> >
> > > Hi All,
> > >
> > > Recently we saw that there were quite a few Pylint undefined-variable
> > > errors in MXNet code, a majority of them in the example folder. These
> > > errors highlight that there are paths in the examples that are broken and
> > > these have not been caught during testing.
> > >
> > > These errors are later reported as issues by users. For example,
> > > https://github.com/apache/incubator-mxnet/issues/11278.
> > >
> > > It would be a good idea to include tests for all examples in the nightly
> > > builds (Similar to
> > > incubator-mxnet/tests/nightly/test_image_classification.sh). Side note:
> > We
> > > might have to see what to do about examples that include special setup
> > > instructions or take too long to execute.
> > >
> > > Any thoughts or suggestions on writing these tests, feasibility or
> > > previous attempts at including these tests, would be helpful.
> > >
> > > Thanks,
> > > Vandana
> > >
> >
> 



RE: Release blocker? - buggy topk Op

2018-08-16 Thread Zhao, Patric
Hi Leonard,

Thanks to raising the issue of topk op.

The root cause is from the current API design which used float data type to 
represent the integer index, and as we know, the float type could NOT express 
the large integer precisely.
(I have no offense. I know I missed some backgrounds and I think the current 
design is very good).

The new CI#12085 changes the computation order and make this issue looks more 
significant. Essentially, the bug will happen when the index is large whatever 
with or without the new CI. 
One line example code can trigger the issue, 
'print(mx.nd.topk(mx.nd.array(np.arange(256*300096).reshape(8, -1)), k=4))'.

Thus, the real fix is to change the API interface and use INT for the index. 
But it might introduce compatibility issue to current framework/topology due to 
API change.
I am not sure we need to change in the last minutes of release 1.3 (actually, 
we can contribute to it).

Currently, we submitted a fix (#12202) to make the computation order as same as 
before and still much faster :)

Apologies for the confusion and feel free to let us know for any feedback.

Thanks,

--Patric


> -Original Message-
> From: Leonard Lausen [mailto:l-softw...@lausen.nl]
> Sent: Thursday, August 16, 2018 9:51 AM
> To: dev@mxnet.incubator.apache.org
> Subject: Release blocker? - buggy topk Op
> 
> Recent changes in mxnet master introduced a bug into the topk operator.
>  Below code example will output [ 274232. 179574. 274233. 274231.] with
>  mxnet-cu90==1.3.0b20180810 but [ 274232. 179574. 274232. 274232.] with
> mxnet-cu90==1.3.0b20180814. Likely #12085 is at fault.
> 
> See https://github.com/apache/incubator-mxnet/issues/12197 for more info.
> 
> I think this should be considered a release blocker for the 1.3 release.
> 
> Note this breaks some parts of the KDD 18 MXNet / Gluon tutorial which is
> scheduled for next Tuesday http://www.kdd.org/kdd2018/hands-on-
> tutorials/view/mxnet-with-a-focus-on-nlp
> . (We can work around by asking people to install the 0810 version
> though.)



Re: Testing examples in nightly build

2018-08-16 Thread kellen sunderland
I think it would be very beneficial to start fleshing out the nighties.
They provide a lot of value at a relatively small cost.  Any contributions
from the community would be appreciated.

Things I could see being beneficial:
* Long running tests
* In depth flake8/pylint linters
* cudamemcheck builds
* Asan builds.

On Thu, Aug 16, 2018, 12:24 AM Marco de Abreu
 wrote:

> Hello,
>
> I think this is a great idea! Thanks a lot for fixing all the problems in
> our examples.
>
> Do you see any problems that could come up if we just run them one by one?
> We already got a pipeline that allows us to verify jupyter notebooks, but
> afaik the examples are standalone files with a main function, right?
>
> Best regards,
> Marco
>
> Vandana Kannan  schrieb am Mi., 15. Aug. 2018,
> 23:44:
>
> > Hi All,
> >
> > Recently we saw that there were quite a few Pylint undefined-variable
> > errors in MXNet code, a majority of them in the example folder. These
> > errors highlight that there are paths in the examples that are broken and
> > these have not been caught during testing.
> >
> > These errors are later reported as issues by users. For example,
> > https://github.com/apache/incubator-mxnet/issues/11278.
> >
> > It would be a good idea to include tests for all examples in the nightly
> > builds (Similar to
> > incubator-mxnet/tests/nightly/test_image_classification.sh). Side note:
> We
> > might have to see what to do about examples that include special setup
> > instructions or take too long to execute.
> >
> > Any thoughts or suggestions on writing these tests, feasibility or
> > previous attempts at including these tests, would be helpful.
> >
> > Thanks,
> > Vandana
> >
>


Re: [DISCUSS] improve MXNet Scala release process

2018-08-16 Thread Qing Lan
Hi all,

I have created a design document on Automated Scala publish in here:
https://cwiki.apache.org/confluence/display/MXNET/Automated+MXNet+Scala+release+design

Please kindly review it and leave any thoughts you may have.

Thanks,
Qing

On 8/3/18, 10:26 AM, "Naveen Swamy"  wrote:

Hi Carin,

The thinking right now is to publish nightly to the Apache Snapshot
repository, building and validating with integration tests. We(Qing, me and
Andrew Ayres) will work on a elaborate document detailing the process and
we'll loop you in as well.

Thanks, Naveen

On Fri, Aug 3, 2018 at 8:35 AM, Carin Meier  wrote:

> Hi,
>
> I was thinking about the process for publishing the Clojure jar as well. I
> think, since it will be published to Nexus/Maven and the building of it
> depends on the Scala jar artifact, it might make sense to combine the
> publishing of the Clojure jar at the same time as the Scala Jar. I haven't
> worked out all the details yet, but I'm thinking with that it would be a
> couple command lines in the same script that handles the Scala deployment.
>
> Or if it is a separate process, it might be able to share common parts.
>
> Could you please keep me in the loop of the deployment process so we can
> figure out the best place to work in Clojure as well?
>
> Thanks,
> Carin
>
> On Tue, Jul 31, 2018 at 3:49 PM, Qing Lan  wrote:
>
> > Upon offline discussion with Marco,
> >
> > He proposed a plan that can actually help us conduct 3):
> > 1. This job will not be trigger when PR runs and strictly limit
> > that only committer can run the restricted job.
> > 2. The code being run in there will only covers the code from 
the
> > branch you choose to go, it will be committers responsibilities not to
> > merge any trivial credential grabber code.
> > 3. Test this is simple. The restricted job uses a similar
> > architecture with current CI. You can send a PR with dockerfiles, 
scripts
> > and configurations on Jenkins to give it a test to run the job with a
> mock
> > credential. Finally please contact people working on CI to give it a 
test
> > run and they will do the last step to merge your change to CI.
> > 4. Marco also mentioned the security level of the credentials.
> The
> > credential being used in the AWS Credential services will be assigned
> with
> > an individual IAM role, which only allows to access to the credentials
> that
> > role being assigned to, and used in the restricted job you have set up.
> >
> > I would also like to encourage people in this list  to join the
> > https://cwiki.apache.org/confluence/display/MXNET/
> > MXNet+Berlin+Office+Hours as the people who is working on improving the
> > CI are there ready to help.
> >
> > Thanks,
> > Qing
> >
> >
> > On 7/28/18, 11:44 PM, "Qing Lan"  wrote:
> >
> > Thanks Marco, Naveen and Sheng's feedback.
> >
> > About the 1): Scala side will only pack the mxnet binary only and 
use
> > dynamic links to all the rest dependencies. So indeed it will require
> users
> > to install all deps as the same as the builder platforms version and 
this
> > will make them hard to use. Let's please collaborate and create a (set
> of)
> > general CI script(s) to install the deps and bring static links to the
> > package.
> >
> > About 3): it is indeed a general problems for both Scala and Python
> > publish. If there is a good way we can safely store the credentials, we
> can
> > definitely give automated publish a go. And thanks again for Marco's
> option
> > provided below, I think we can make use of the restricted slaves and 
give
> > it a test run. And to Marco:
> > 1. Will this restricted jobs being triggered in every PR runs or
> > it just depends on where you put it (like I put in nightly it will never
> >be trigger in PR)? Will there be a potential risk like a PR attack
> > (create a PR to grab credentials)
> >
> > 2. How do we make sure the coding being run there is under
> control
> > and not be changed by anyone?
> >
> > 3. If I want to test this functionality, where is the best place
> > to create the job and make a test run?
> >
> > Thanks,
> > Qing
> >
> >
> >
> > On 7/27/18, 5:44 PM, "Marco de Abreu"  INVALID>
> > wrote:
> >
> > Hi all,
> >
> > about the credential management: We already have a solution 
based
> > on
> > restricted slaves [1] and AWS secrets manager [2] that is
> generally
> > classified to generate binaries and handle credentials. It was
> > designed
> > with continuous