Re: [Announcement] New Committer - Junru Shao

2019-09-08 Thread Qin, Zhennan
Congratulations, Junru!

Zhennan


On Sun, 2019-09-08 at 03:14 +, Sheng Zha wrote:

Hi all,


Please join me in welcoming Junru Shao as a new committer of Apache MXNet 
(incubating)!


Junru made a number of contributions to this project such as cond and 
while_loop control-flow

operators, enabling dynamic shape in control-flow ops, and zero-copy array 
creation from numpy

array. He's also been actively working on numpy-compatible programming 
experience and

has helped the community recently by driving the 1.4.1 patch release.


Welcome, Junru!


-sz



Re: Making new operators and AMP lists

2019-05-30 Thread Qin, Zhennan
How about change the below line in amp.py:164

wrap_list = fp32_ops if fp32_ops is not None else
lists.symbol.FP32_FUNCS

to be

plist = ctypes.POINTER(ctypes.c_char_p)()
size = ctypes.c_uint()

check_call(_LIB.MXListAllOpNames(ctypes.byref(size),
 ctypes.byref(plist)))
op_names = []
for i in range(size.value):
op_names.append(py_str(plist[i]))

wrap_list = []
fp16_op_list = lists.symbol.FP16_FUNCS +
lists.symbol.WIDEST_TYPE_CASTS + lists.symbol.FP16_FP32_FUNCS +
list(map(lambda x: x[0], lists.symbol.CONDITIONAL_FP32_FUNCS))
for op_name in op_names:
if not op_name.startswith("_backward_") and not
op_name.startswith("_contrib_backward_") and op_name not in
fp16_op_list:
wrap_list.append(op_name)
wrap_list = fp32_ops if fp32_ops is not None else wrap_list


I've checked, the changed code can produce identity wrap_list as
lists.symbol.FP16_FP32_FUNCS.

The check code is,

print("op that in wrap_list but not in FP32_FUNCS:")
for i in wrap_list:
if i not in lists.symbol.FP32_FUNCS:
print(i)

print("op that in FP32_FUNCS but not in wrap_list:")
for i in lists.symbol.FP32_FUNCS:
if i not in wrap_list:
print(i)

The output is,

op that in infered_fp32_op_list but not in FP32_FUNCS:
op that in FP32_FUNCS but not in infered_fp32_op_list:
op that in infered_fp32_op_list but not in FP32_FUNCS:
op that in FP32_FUNCS but not in infered_fp32_op_list:

This change should be able to automatically set new operator as fp32
only.

Thanks,
Zhennan

On Wed, 2019-05-29 at 19:13 +, Skalicky, Sam wrote:
> Thanks Przemek for the additional explanation, but I’m still confused
> on this part. I don’t understand the explanation of the optimizer’s
> interaction here. 
> 
> > > The technical reason for that is the only place from which one
> > > can get MXNet operators is MXListAllOps call, which gives back
> > > all operators without any way of differentiating them, including
> > > for example calls to optimizers. If somebody added a new
> > > optimizer and AMP inserts a FP32 cast before its input, the
> > > optimizer would actually update the FP32 copy of the tensor
> > > instead of the tensor itself. This is not a performance problem
> > > (which, I agree, would be perfectly fine to solve later) - it is
> > > a correctness problem.
> > > That is why the error message in the test explicitly mentions
> > > this case as one where the operator should be put in the list
> > > that does not make any casts.
> 
> 
> If this is third list that Anirudh mentions:
> 
> > > 3. Not sure, Don't know or dont care: FP32 list.
> 
> 
> and this list simply contains all operators that have not been
> categorized, why do we need the list at all? 
> 
> Thanks,
> Sam
> 
> 
> > On May 29, 2019, at 11:58 AM, Sheng Zha 
> > wrote:
> > 
> > > The second misunderstanding is that the ask from the test is to
> > > somehow "add support" for AMP in the operator. That is definitely
> > > not true and adding a single line in either FP32_FUNCS (cast me
> > > always to FP32) or FP16_FP32_FUNCS (do not do anything with me
> > > because I'm not relevant for this usecase) is perfectly fine.
> > 
> > I respectfully disagree with this statement. In this case, the
> > "support to add" is exactly to differentiate how the new operator
> > should be treated. It was not there before, so it's new. And
> > requiring other developers to add any line is totally not fine.
> > 
> > > The test ensures that the list is updated at the same time as new
> > > operator is introduced.
> > 
> > It would indeed have that effect. The problem is that the test
> > requires the wrong person to add such support. A developer who adds
> > a new operator should be under no obligation to do anything with
> > AMP.
> > 
> > Having it as warning both in runtime and in the test has the effect
> > to prompt people who actually use and care about the performance
> > from AMP to take the steps and decide where a new operator fits.
> > And like you said, "It is up to people like me actively working on
> > better FP16 support to go through the FP32 list to introduce
> > support and then move such operator to another list." This is also
> > exactly how it should work with AMP.
> > 
> > -sz
> > 
> > On 2019/05/29 18:25:46, Przemys��aw Tr��dak 
> > wrote: 
> > > Thank you all for responding.
> > > 
> > > Let me address a few misunderstandings about AMP in the
> > > discussion so far:
> > > - I think the main misunderstanding in the discussion so far is
> > > that it would be somehow possible to implicitly cast to FP32 all
> > > previously unseen operators. I would much prefer such solution
> > > myself, however this is actually (unfortunately) not really
> > > possible. The technical reason for that is the only place from
> > > which one can get MXNet operators is MXListAllOps call, which
> > > gives back all operators without any way of differentiating 

RE: Order of includes in cpplint

2019-01-08 Thread Qin, Zhennan
Ah, I see the problem. The real problem is, we shouldn't use diamond include 
instead of quotes include for project header, which is cheating cpplint to 
consider them as system header. Diamond include should only be used for system 
header, including OS header, STL header and host-installed library 
header(opencv for example). For others, quote include should be used.

Thanks,
Zhennan

-Original Message-
From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] 
Sent: Wednesday, January 9, 2019 9:29 AM
To: dev@mxnet.incubator.apache.org
Subject: Re: Order of includes in cpplint

I worked around the case, I saw several cases where project and third party 
headers were included with diamond includes instead of quotes.
Agree I think is too much hassle to change the order of includes in the whole 
project anyway. But this convention can cause some headers not to be self 
sufficient as system headers are being included before other library headers 
which might be using a system header like  without including it 
themselves.

Pedro.

On Wed, Jan 9, 2019 at 2:12 AM Qin, Zhennan  wrote:
>
> Hi Pedro,
>
> Interesting topic. Google style does have guidance for this:
>
> https://google.github.io/styleguide/cppguide.html#Names_and_Order_of_I
> ncludes
>
> According to it, the order is,
>
> dir2/foo2.h.
> A blank line
> C system files.
> C++ system files.
> A blank line
> Other libraries' .h files.
> Your project's .h files.
>
> As MXNet follows this style, I guess we shouldn't break it unless we have 
> some problems. Do you have such a case that need the change?
>
> Thanks,
> Zhennan
>
> -Original Message-
> From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com]
> Sent: Wednesday, January 9, 2019 6:44 AM
> To: dev@mxnet.incubator.apache.org
> Subject: Order of includes in cpplint
>
> Hi MXNet community
>
> cpplint seems to complain when the order of includes is not  [own, 
> system, other]
>
> But the general best practice in C++ is [own, project, 3rd party, 
> system] for the reasons explained in this stackoverflow answer:  ( 
> https://stackoverflow.com/questions/614302/c-header-order )
>
> A contribution to cpplint could be made to make this configurable:
>
> https://github.com/cpplint/cpplint/blob/master/cpplint.py#L605
>
> Thoughts?
>
> Pedro.


RE: Order of includes in cpplint

2019-01-08 Thread Qin, Zhennan
Hi Pedro,

Interesting topic. Google style does have guidance for this:

https://google.github.io/styleguide/cppguide.html#Names_and_Order_of_Includes

According to it, the order is,

dir2/foo2.h.
A blank line
C system files.
C++ system files.
A blank line
Other libraries' .h files.
Your project's .h files.

As MXNet follows this style, I guess we shouldn't break it unless we have some 
problems. Do you have such a case that need the change?

Thanks,
Zhennan

-Original Message-
From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] 
Sent: Wednesday, January 9, 2019 6:44 AM
To: dev@mxnet.incubator.apache.org
Subject: Order of includes in cpplint

Hi MXNet community

cpplint seems to complain when the order of includes is not  [own, system, 
other]

But the general best practice in C++ is [own, project, 3rd party, system] for 
the reasons explained in this stackoverflow answer:  ( 
https://stackoverflow.com/questions/614302/c-header-order )

A contribution to cpplint could be made to make this configurable:

https://github.com/cpplint/cpplint/blob/master/cpplint.py#L605

Thoughts?

Pedro.


RE: Time out for Travis CI

2018-09-29 Thread Qin, Zhennan
Hi YiZhi and Kellen,

From my point of view, travis should be able to get passed from a scratch 
build. Pending result on ccache hit/miss is not a good idea. For this PR, as it 
changed many header file, lots of files need be recompiled, just like a scratch 
build. I think that's the reason that travis timeout. This should be fixed 
before enabling travis, as it will block any change to those base header file. 
Again, it's not a special case with this PR only, you can find same problem on 
other PRs:

https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification
https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_source=github_status_medium=notification


Thanks,
Zhennan

-Original Message-
From: YiZhi Liu [mailto:eazhi@gmail.com] 
Sent: Sunday, September 30, 2018 5:15 AM
To: eazhi@gmail.com
Cc: dev@mxnet.incubator.apache.org
Subject: Re: Time out for Travis CI

while other PRs are all good.
On Sat, Sep 29, 2018 at 2:13 PM YiZhi Liu  wrote:
>
> Honestly I don't know yet. I can help to investigate. Just given the 
> evidence that, travis timeout every time it gets re-triggered - 2 
> times at least. Correct me if I'm wrong @ Zhennan On Sat, Sep 29, 2018 
> at 1:54 PM kellen sunderland  wrote:
> >
> > Reading over the PR I don't see what aspects would cause extra 
> > runtime YiZhi, could you point them out?
> >
> > On Sat, Sep 29, 2018 at 8:46 PM YiZhi Liu  wrote:
> >
> > > Kellen, I think this PR introduces extra runtime in CI, thus 
> > > causes the timeout. Which means, once merged, every PR later will 
> > > see same timeout in travis.
> > >
> > > So shall we modify the changes to decrease the test running time? 
> > > or just disable the Travis CI?
> > >
> > >
> > > On Fri, Sep 28, 2018 at 9:17 PM Qin, Zhennan 
> > > 
> > > wrote:
> > > >
> > > > Hi Kellen,
> > > >
> > > > Thanks for your explanation. Do you have a time plan to solve 
> > > > the
> > > timeout issue? Rebasing can't work for my case. Or shall we run it 
> > > silently to disallow it voting X for overall CI result? Because 
> > > most developers are used to ignore the PRs with 'X'.
> > > >
> > > > Thanks,
> > > > Zhennan
> > > >
> > > > -Original Message-
> > > > From: kellen sunderland [mailto:kellen.sunderl...@gmail.com]
> > > > Sent: Friday, September 28, 2018 10:38 PM
> > > > To: dev@mxnet.incubator.apache.org
> > > > Subject: Re: Time out for Travis CI
> > > >
> > > > Hey Zhennan, you're safe to ignore Travis failures for now.  
> > > > They're
> > > just informational.
> > > >
> > > > The reason you sometimes see quick builds and sometimes see slow 
> > > > builds
> > > is that we're making use of ccache in between builds.  If your PR 
> > > is similar to what's in master you should build very quickly, if 
> > > not it's going to take a while and likely time out.  If you see 
> > > timeouts rebasing may speed things up.  Unfortunately the timeouts 
> > > are global and we're not able to increase them.  I'm hoping that 
> > > adding artifact caching will speed up future builds to the point 
> > > that test runs and builds can be executed in under the global limit 
> > > (which is ~50 minutes).
> > > >
> > > > -Kellen
> > > >
> > > >
> > > > On Fri, Sep 28, 2018 at 4:05 PM Qin, Zhennan 
> > > > 
> > > wrote:
> > > >
> > > > > Hi MXNet devs,
> > > > >
> > > > > I'm struggled with new Travis CI for a while, it always run 
> > > > > time out for this PR:
> > > > > https://github.com/apache/incubator-mxnet/pull/12530
> > > > >
> > > > > Most of the time, Jenkins CI can pass, while Travis can't be 
> > > > > finished within 50 minutes. For this PR, it shouldn't affect 
> > > > > much on the build time or unit test time. Also, I saw other PR has 
> > > > > same problem, eg.
> > > > >
> > > > >
> > > > > https://travis-ci.org/apache/incubator-mxnet/builds/433172088?
> > > > > utm_sour ce=github_status_medium=notification
> > > > >
> > > > > https://travis-ci.org/apache/incubator-mxnet/builds/434404305?
> > > > > utm_sour ce=github_status_medium=notification
> > > > >
> > > > > According to the time stamp from Travis, all passed PR are 
> > > > > within small code change, and can complete `make -j2` within 
> > > > > 25s. But for timeout case, 'make -j2' will need about 1600s. 
> > > > > Does Travis do incremental build for each test? Shall we 
> > > > > increase time limit for large PR? Can we add more time stamp 
> > > > > for build and unites stage to
> > > help understand what's going on there?
> > > > >
> > > > > Thanks in advance,
> > > > > Zhennan
> > > > >
> > >
> > >
> > >
> > > --
> > > Yizhi Liu
> > > DMLC member
> > > Amazon Web Services
> > > Vancouver, Canada
> > >
>
>
>
> --
> Yizhi Liu
> DMLC member
> Amazon Web Services
> Vancouver, Canada



--
Yizhi Liu
DMLC member
Amazon Web Services
Vancouver, Canada


RE: Time out for Travis CI

2018-09-28 Thread Qin, Zhennan
Hi Kellen,

Thanks for your explanation. Do you have a time plan to solve the timeout 
issue? Rebasing can't work for my case. Or shall we run it silently to disallow 
it voting X for overall CI result? Because most developers are used to ignore 
the PRs with 'X'.

Thanks,
Zhennan

-Original Message-
From: kellen sunderland [mailto:kellen.sunderl...@gmail.com] 
Sent: Friday, September 28, 2018 10:38 PM
To: dev@mxnet.incubator.apache.org
Subject: Re: Time out for Travis CI

Hey Zhennan, you're safe to ignore Travis failures for now.  They're just 
informational.

The reason you sometimes see quick builds and sometimes see slow builds is that 
we're making use of ccache in between builds.  If your PR is similar to what's 
in master you should build very quickly, if not it's going to take a while and 
likely time out.  If you see timeouts rebasing may speed things up.  
Unfortunately the timeouts are global and we're not able to increase them.  I'm 
hoping that adding artifact caching will speed up future builds to the point 
that test runs and builds can be executed in under the global limit (which is 
~50 minutes).

-Kellen


On Fri, Sep 28, 2018 at 4:05 PM Qin, Zhennan  wrote:

> Hi MXNet devs,
>
> I'm struggled with new Travis CI for a while, it always run time out 
> for this PR:
> https://github.com/apache/incubator-mxnet/pull/12530
>
> Most of the time, Jenkins CI can pass, while Travis can't be finished 
> within 50 minutes. For this PR, it shouldn't affect much on the build 
> time or unit test time. Also, I saw other PR has same problem, eg.
>
>
> https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_sour
> ce=github_status_medium=notification
>
> https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_sour
> ce=github_status_medium=notification
>
> According to the time stamp from Travis, all passed PR are within 
> small code change, and can complete `make -j2` within 25s. But for 
> timeout case, 'make -j2' will need about 1600s. Does Travis do 
> incremental build for each test? Shall we increase time limit for 
> large PR? Can we add more time stamp for build and unites stage to help 
> understand what's going on there?
>
> Thanks in advance,
> Zhennan
>


Time out for Travis CI

2018-09-28 Thread Qin, Zhennan
Hi MXNet devs,

I'm struggled with new Travis CI for a while, it always run time out for this 
PR:
https://github.com/apache/incubator-mxnet/pull/12530

Most of the time, Jenkins CI can pass, while Travis can't be finished within 50 
minutes. For this PR, it shouldn't affect much on the build time or unit test 
time. Also, I saw other PR has same problem, eg.

https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification
https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_source=github_status_medium=notification

According to the time stamp from Travis, all passed PR are within small code 
change, and can complete `make -j2` within 25s. But for timeout case, 'make 
-j2' will need about 1600s. Does Travis do incremental build for each test? 
Shall we increase time limit for large PR? Can we add more time stamp for build 
and unites stage to help understand what's going on there?

Thanks in advance,
Zhennan