Re: [Announcement] New Committer - Junru Shao
Congratulations, Junru! Zhennan On Sun, 2019-09-08 at 03:14 +, Sheng Zha wrote: Hi all, Please join me in welcoming Junru Shao as a new committer of Apache MXNet (incubating)! Junru made a number of contributions to this project such as cond and while_loop control-flow operators, enabling dynamic shape in control-flow ops, and zero-copy array creation from numpy array. He's also been actively working on numpy-compatible programming experience and has helped the community recently by driving the 1.4.1 patch release. Welcome, Junru! -sz
Re: Making new operators and AMP lists
How about change the below line in amp.py:164 wrap_list = fp32_ops if fp32_ops is not None else lists.symbol.FP32_FUNCS to be plist = ctypes.POINTER(ctypes.c_char_p)() size = ctypes.c_uint() check_call(_LIB.MXListAllOpNames(ctypes.byref(size), ctypes.byref(plist))) op_names = [] for i in range(size.value): op_names.append(py_str(plist[i])) wrap_list = [] fp16_op_list = lists.symbol.FP16_FUNCS + lists.symbol.WIDEST_TYPE_CASTS + lists.symbol.FP16_FP32_FUNCS + list(map(lambda x: x[0], lists.symbol.CONDITIONAL_FP32_FUNCS)) for op_name in op_names: if not op_name.startswith("_backward_") and not op_name.startswith("_contrib_backward_") and op_name not in fp16_op_list: wrap_list.append(op_name) wrap_list = fp32_ops if fp32_ops is not None else wrap_list I've checked, the changed code can produce identity wrap_list as lists.symbol.FP16_FP32_FUNCS. The check code is, print("op that in wrap_list but not in FP32_FUNCS:") for i in wrap_list: if i not in lists.symbol.FP32_FUNCS: print(i) print("op that in FP32_FUNCS but not in wrap_list:") for i in lists.symbol.FP32_FUNCS: if i not in wrap_list: print(i) The output is, op that in infered_fp32_op_list but not in FP32_FUNCS: op that in FP32_FUNCS but not in infered_fp32_op_list: op that in infered_fp32_op_list but not in FP32_FUNCS: op that in FP32_FUNCS but not in infered_fp32_op_list: This change should be able to automatically set new operator as fp32 only. Thanks, Zhennan On Wed, 2019-05-29 at 19:13 +, Skalicky, Sam wrote: > Thanks Przemek for the additional explanation, but I’m still confused > on this part. I don’t understand the explanation of the optimizer’s > interaction here. > > > > The technical reason for that is the only place from which one > > > can get MXNet operators is MXListAllOps call, which gives back > > > all operators without any way of differentiating them, including > > > for example calls to optimizers. If somebody added a new > > > optimizer and AMP inserts a FP32 cast before its input, the > > > optimizer would actually update the FP32 copy of the tensor > > > instead of the tensor itself. This is not a performance problem > > > (which, I agree, would be perfectly fine to solve later) - it is > > > a correctness problem. > > > That is why the error message in the test explicitly mentions > > > this case as one where the operator should be put in the list > > > that does not make any casts. > > > If this is third list that Anirudh mentions: > > > > 3. Not sure, Don't know or dont care: FP32 list. > > > and this list simply contains all operators that have not been > categorized, why do we need the list at all? > > Thanks, > Sam > > > > On May 29, 2019, at 11:58 AM, Sheng Zha > > wrote: > > > > > The second misunderstanding is that the ask from the test is to > > > somehow "add support" for AMP in the operator. That is definitely > > > not true and adding a single line in either FP32_FUNCS (cast me > > > always to FP32) or FP16_FP32_FUNCS (do not do anything with me > > > because I'm not relevant for this usecase) is perfectly fine. > > > > I respectfully disagree with this statement. In this case, the > > "support to add" is exactly to differentiate how the new operator > > should be treated. It was not there before, so it's new. And > > requiring other developers to add any line is totally not fine. > > > > > The test ensures that the list is updated at the same time as new > > > operator is introduced. > > > > It would indeed have that effect. The problem is that the test > > requires the wrong person to add such support. A developer who adds > > a new operator should be under no obligation to do anything with > > AMP. > > > > Having it as warning both in runtime and in the test has the effect > > to prompt people who actually use and care about the performance > > from AMP to take the steps and decide where a new operator fits. > > And like you said, "It is up to people like me actively working on > > better FP16 support to go through the FP32 list to introduce > > support and then move such operator to another list." This is also > > exactly how it should work with AMP. > > > > -sz > > > > On 2019/05/29 18:25:46, Przemys��aw Tr��dak > > wrote: > > > Thank you all for responding. > > > > > > Let me address a few misunderstandings about AMP in the > > > discussion so far: > > > - I think the main misunderstanding in the discussion so far is > > > that it would be somehow possible to implicitly cast to FP32 all > > > previously unseen operators. I would much prefer such solution > > > myself, however this is actually (unfortunately) not really > > > possible. The technical reason for that is the only place from > > > which one can get MXNet operators is MXListAllOps call, which > > > gives back all operators without any way of differentiating
RE: Order of includes in cpplint
Ah, I see the problem. The real problem is, we shouldn't use diamond include instead of quotes include for project header, which is cheating cpplint to consider them as system header. Diamond include should only be used for system header, including OS header, STL header and host-installed library header(opencv for example). For others, quote include should be used. Thanks, Zhennan -Original Message- From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] Sent: Wednesday, January 9, 2019 9:29 AM To: dev@mxnet.incubator.apache.org Subject: Re: Order of includes in cpplint I worked around the case, I saw several cases where project and third party headers were included with diamond includes instead of quotes. Agree I think is too much hassle to change the order of includes in the whole project anyway. But this convention can cause some headers not to be self sufficient as system headers are being included before other library headers which might be using a system header like without including it themselves. Pedro. On Wed, Jan 9, 2019 at 2:12 AM Qin, Zhennan wrote: > > Hi Pedro, > > Interesting topic. Google style does have guidance for this: > > https://google.github.io/styleguide/cppguide.html#Names_and_Order_of_I > ncludes > > According to it, the order is, > > dir2/foo2.h. > A blank line > C system files. > C++ system files. > A blank line > Other libraries' .h files. > Your project's .h files. > > As MXNet follows this style, I guess we shouldn't break it unless we have > some problems. Do you have such a case that need the change? > > Thanks, > Zhennan > > -Original Message- > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] > Sent: Wednesday, January 9, 2019 6:44 AM > To: dev@mxnet.incubator.apache.org > Subject: Order of includes in cpplint > > Hi MXNet community > > cpplint seems to complain when the order of includes is not [own, > system, other] > > But the general best practice in C++ is [own, project, 3rd party, > system] for the reasons explained in this stackoverflow answer: ( > https://stackoverflow.com/questions/614302/c-header-order ) > > A contribution to cpplint could be made to make this configurable: > > https://github.com/cpplint/cpplint/blob/master/cpplint.py#L605 > > Thoughts? > > Pedro.
RE: Order of includes in cpplint
Hi Pedro, Interesting topic. Google style does have guidance for this: https://google.github.io/styleguide/cppguide.html#Names_and_Order_of_Includes According to it, the order is, dir2/foo2.h. A blank line C system files. C++ system files. A blank line Other libraries' .h files. Your project's .h files. As MXNet follows this style, I guess we shouldn't break it unless we have some problems. Do you have such a case that need the change? Thanks, Zhennan -Original Message- From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] Sent: Wednesday, January 9, 2019 6:44 AM To: dev@mxnet.incubator.apache.org Subject: Order of includes in cpplint Hi MXNet community cpplint seems to complain when the order of includes is not [own, system, other] But the general best practice in C++ is [own, project, 3rd party, system] for the reasons explained in this stackoverflow answer: ( https://stackoverflow.com/questions/614302/c-header-order ) A contribution to cpplint could be made to make this configurable: https://github.com/cpplint/cpplint/blob/master/cpplint.py#L605 Thoughts? Pedro.
RE: Time out for Travis CI
Hi YiZhi and Kellen, From my point of view, travis should be able to get passed from a scratch build. Pending result on ccache hit/miss is not a good idea. For this PR, as it changed many header file, lots of files need be recompiled, just like a scratch build. I think that's the reason that travis timeout. This should be fixed before enabling travis, as it will block any change to those base header file. Again, it's not a special case with this PR only, you can find same problem on other PRs: https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_source=github_status_medium=notification Thanks, Zhennan -Original Message- From: YiZhi Liu [mailto:eazhi@gmail.com] Sent: Sunday, September 30, 2018 5:15 AM To: eazhi@gmail.com Cc: dev@mxnet.incubator.apache.org Subject: Re: Time out for Travis CI while other PRs are all good. On Sat, Sep 29, 2018 at 2:13 PM YiZhi Liu wrote: > > Honestly I don't know yet. I can help to investigate. Just given the > evidence that, travis timeout every time it gets re-triggered - 2 > times at least. Correct me if I'm wrong @ Zhennan On Sat, Sep 29, 2018 > at 1:54 PM kellen sunderland wrote: > > > > Reading over the PR I don't see what aspects would cause extra > > runtime YiZhi, could you point them out? > > > > On Sat, Sep 29, 2018 at 8:46 PM YiZhi Liu wrote: > > > > > Kellen, I think this PR introduces extra runtime in CI, thus > > > causes the timeout. Which means, once merged, every PR later will > > > see same timeout in travis. > > > > > > So shall we modify the changes to decrease the test running time? > > > or just disable the Travis CI? > > > > > > > > > On Fri, Sep 28, 2018 at 9:17 PM Qin, Zhennan > > > > > > wrote: > > > > > > > > Hi Kellen, > > > > > > > > Thanks for your explanation. Do you have a time plan to solve > > > > the > > > timeout issue? Rebasing can't work for my case. Or shall we run it > > > silently to disallow it voting X for overall CI result? Because > > > most developers are used to ignore the PRs with 'X'. > > > > > > > > Thanks, > > > > Zhennan > > > > > > > > -Original Message- > > > > From: kellen sunderland [mailto:kellen.sunderl...@gmail.com] > > > > Sent: Friday, September 28, 2018 10:38 PM > > > > To: dev@mxnet.incubator.apache.org > > > > Subject: Re: Time out for Travis CI > > > > > > > > Hey Zhennan, you're safe to ignore Travis failures for now. > > > > They're > > > just informational. > > > > > > > > The reason you sometimes see quick builds and sometimes see slow > > > > builds > > > is that we're making use of ccache in between builds. If your PR > > > is similar to what's in master you should build very quickly, if > > > not it's going to take a while and likely time out. If you see > > > timeouts rebasing may speed things up. Unfortunately the timeouts > > > are global and we're not able to increase them. I'm hoping that > > > adding artifact caching will speed up future builds to the point > > > that test runs and builds can be executed in under the global limit > > > (which is ~50 minutes). > > > > > > > > -Kellen > > > > > > > > > > > > On Fri, Sep 28, 2018 at 4:05 PM Qin, Zhennan > > > > > > > wrote: > > > > > > > > > Hi MXNet devs, > > > > > > > > > > I'm struggled with new Travis CI for a while, it always run > > > > > time out for this PR: > > > > > https://github.com/apache/incubator-mxnet/pull/12530 > > > > > > > > > > Most of the time, Jenkins CI can pass, while Travis can't be > > > > > finished within 50 minutes. For this PR, it shouldn't affect > > > > > much on the build time or unit test time. Also, I saw other PR has > > > > > same problem, eg. > > > > > > > > > > > > > > > https://travis-ci.org/apache/incubator-mxnet/builds/433172088? > > > > > utm_sour ce=github_status_medium=notification > > > > > > > > > > https://travis-ci.org/apache/incubator-mxnet/builds/434404305? > > > > > utm_sour ce=github_status_medium=notification > > > > > > > > > > According to the time stamp from Travis, all passed PR are > > > > > within small code change, and can complete `make -j2` within > > > > > 25s. But for timeout case, 'make -j2' will need about 1600s. > > > > > Does Travis do incremental build for each test? Shall we > > > > > increase time limit for large PR? Can we add more time stamp > > > > > for build and unites stage to > > > help understand what's going on there? > > > > > > > > > > Thanks in advance, > > > > > Zhennan > > > > > > > > > > > > > > > > > -- > > > Yizhi Liu > > > DMLC member > > > Amazon Web Services > > > Vancouver, Canada > > > > > > > -- > Yizhi Liu > DMLC member > Amazon Web Services > Vancouver, Canada -- Yizhi Liu DMLC member Amazon Web Services Vancouver, Canada
RE: Time out for Travis CI
Hi Kellen, Thanks for your explanation. Do you have a time plan to solve the timeout issue? Rebasing can't work for my case. Or shall we run it silently to disallow it voting X for overall CI result? Because most developers are used to ignore the PRs with 'X'. Thanks, Zhennan -Original Message- From: kellen sunderland [mailto:kellen.sunderl...@gmail.com] Sent: Friday, September 28, 2018 10:38 PM To: dev@mxnet.incubator.apache.org Subject: Re: Time out for Travis CI Hey Zhennan, you're safe to ignore Travis failures for now. They're just informational. The reason you sometimes see quick builds and sometimes see slow builds is that we're making use of ccache in between builds. If your PR is similar to what's in master you should build very quickly, if not it's going to take a while and likely time out. If you see timeouts rebasing may speed things up. Unfortunately the timeouts are global and we're not able to increase them. I'm hoping that adding artifact caching will speed up future builds to the point that test runs and builds can be executed in under the global limit (which is ~50 minutes). -Kellen On Fri, Sep 28, 2018 at 4:05 PM Qin, Zhennan wrote: > Hi MXNet devs, > > I'm struggled with new Travis CI for a while, it always run time out > for this PR: > https://github.com/apache/incubator-mxnet/pull/12530 > > Most of the time, Jenkins CI can pass, while Travis can't be finished > within 50 minutes. For this PR, it shouldn't affect much on the build > time or unit test time. Also, I saw other PR has same problem, eg. > > > https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_sour > ce=github_status_medium=notification > > https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_sour > ce=github_status_medium=notification > > According to the time stamp from Travis, all passed PR are within > small code change, and can complete `make -j2` within 25s. But for > timeout case, 'make -j2' will need about 1600s. Does Travis do > incremental build for each test? Shall we increase time limit for > large PR? Can we add more time stamp for build and unites stage to help > understand what's going on there? > > Thanks in advance, > Zhennan >
Time out for Travis CI
Hi MXNet devs, I'm struggled with new Travis CI for a while, it always run time out for this PR: https://github.com/apache/incubator-mxnet/pull/12530 Most of the time, Jenkins CI can pass, while Travis can't be finished within 50 minutes. For this PR, it shouldn't affect much on the build time or unit test time. Also, I saw other PR has same problem, eg. https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_source=github_status_medium=notification According to the time stamp from Travis, all passed PR are within small code change, and can complete `make -j2` within 25s. But for timeout case, 'make -j2' will need about 1600s. Does Travis do incremental build for each test? Shall we increase time limit for large PR? Can we add more time stamp for build and unites stage to help understand what's going on there? Thanks in advance, Zhennan