Re: Enable Timestamp in CI Logging

2019-09-27 Thread Pedro Larroy
Sheng, you should have admin access to Jenkins as of now.

Why wouldn't be persistent through reboots?

Pedro.

On Sat, Sep 14, 2019 at 10:07 PM Sheng Zha  wrote:

> Thank you, Philip. Looks like xgboost is using the same plugin for the
> timestamps.
>
> Unfortunately, I don't have admin access to the CI right now so I cannot
> add the plugin or view whether the plugin is already added. What's also
> unclear to me is how to add plugins to the CI in a persistent way that
> lasts through the next reboot.
>
> It would be great if someone could help in these aspects.
>
> -sz
>
> On 2019/09/15 04:27:47, Philip Cho  wrote:
> > Hi Sheng:
> >
> > Take a look at
> >
> https://github.com/dmlc/xgboost/blob/c89bcc4de5368b3f8a7fa170d8348287dab44caf/Jenkinsfile#L21
> > .
> >
> > Philip.
> >
> > On Sat, Sep 14, 2019 at 9:26 PM Sheng Zha  wrote:
> >
> > > Hi,
> > >
> > > There have been timeouts in the build step of CI in PRs. To help
> identify
> > > the steps where most time is taken, I suggest that we enable timestamp
> in
> > > the CI logging. With the help of a simple Jenkins plugin [1], we
> should be
> > > able to turn it on with some simple changes in our Jenkinsfiles.
> > >
> > > If you're already familiar with how to proceed, help would be much
> > > appreciated. Otherwise, I will start looking into how to proceed.
> > >
> > > -sz
> > >
> > > [1] http://wiki.jenkins.io/display/JENKINS/Timestamper
> > >
> >
>


Re: [DISCUSS] CI Access Control

2019-09-27 Thread Pedro Larroy
We will address the shortcomings that Marco outlined by using a pipeline to
deploy the CI infrastructure. Which will allow for contributions and easy
redeployment and rollback in the case of issues.

I would recommend planning a migration towards Drone IO or similar, with an
initial prototype to validate that the main use cases are covered.

Pedro.

On Thu, Sep 19, 2019 at 2:29 PM Sheng Zha  wrote:

> Hi Marco,
>
> Thank you for sharing the insights. The discussion is intended for setting
> goals so that future design improvement to the CI can take these goals into
> consideration. Thus, while I fully recognize that there could be difficulty
> in implementation, I'd still like to confirm with the community if the
> outlined access control recommendation is at the right level.
>
> To summarize your concerns:
> - opening up access control should be conditioned on having good version
> control and roll-back mechanism to ease the operation burden from breakage,
> which is more likely given larger user base.
> - upgrades to the system would be better managed as planned and collective
> efforts instead of adhoc tasks performed by uncoordinated individuals.
>
> You also mentioned that "changes to the system should only be done by the
> administrators". It's exactly the intention of this thread is to define who
> would qualify as administrators. Currently, such qualification is opaque,
> and only happens within a group in Amazon.
>
> On the other hand, this current way can, and already has caused friction.
> When this project's daily activity of validating and merging code is
> affected due to the system's instability, the community members have no
> choice but to wait for the issues to be resolved by the current system
> administrators. Other affected community members have no way to help even
> if they wish to.
>
> Given the existing Apache project governance model, I'd recommend that the
> goal for CI access control be set so that committer and PMC member who
> wishes to be involved should have the right to help.
>
> -sz
>
> On 2019/09/17 12:49:20, Marco de Abreu  wrote:
> > Ah, with regards to #1 and #2: Currently, we don't have any plugins that
> > control the actions of a single user and allows us to monitor and rate
> > limit them. Just giving trigger permission (which is also tied with
> > abort-permission if I recall correctly), would allow a malicious user to
> > start a huge number of jobs and thus either create immense costs or bring
> > down the system. Also, we'd have to check how we can restrict the trigger
> > permission to specific jobs.
> >
> > -Marco
> >
> > On Tue, Sep 17, 2019 at 2:47 PM Marco de Abreu 
> > wrote:
> >
> > > Hi Sheng,
> > >
> > > will I'm in general all in favour of widening the access to distribute
> the
> > > tasks, the situation around the CI system in particular is a bit more
> > > difficult.
> > >
> > > As far as I know, the creation of the CI system is neither automated,
> > > versioned nor backed up or safeguarded. This means that if somebody
> makes a
> > > change that breaks something, we're left with a broken system we can't
> > > recover from. Thus, I preferred it in the past to restrict the access
> as
> > > much as possible (at least to Prod) to avoid these situations from
> > > happening. While #1 and #2 are already possible today (we have two
> roles
> > > for committers and regular users that allow this already), #3 and #4
> come
> > > with a significant risk for the stability of the system.
> > >
> > > As soon as a job is added or changed, a lot of things happen in
> Jenkins -
> > > one of these tasks is the SCM scan which tries to determine the
> branches
> > > the job should run on. For somebody who is inexperienced, the first
> pitfall
> > > is that suddenly hundreds of jobs are being spawned which will
> certainly
> > > overload Jenkins and render it unusable. There are a lot of tricks and
> I
> > > could elaborate them, but basically the bottom line is that the
> > > configuration interface of Jenkins is far from fail-proof and exposes a
> > > significant risk if accessed by somebody who doesn't exactly know what
> > > they're doing - speak, we would need to design some kind of training
> and
> > > even that would not safeguard us from these fatal events.
> > >
> > > There's the whole security aspect around user-facing artifact
> generation
> > > of CI/CD and the possibility of them being tampered, but I don't think
> I
> > > have to elaborate that.
> > >
> > > With regards to #4 especially, I'd say that the risk of somebody just
> > > upgrading the system or changing plugins inherits an even bigger risk.
> > > Plugins are notoriously unsafe and system updates have also shown to
> not
> > > really go like a breeze. I'd argue that changes to the system should
> only
> > > be done by the administrators of it since they have a bigger overview
> over
> > > all the things that are currently going on while also having the full
> > > access (backups before making

Re: [Discuss] MXNet Python < 3.6 Support Deprecation

2019-11-06 Thread Pedro Larroy
In Numpy they are considering dropping 3.5 support for 1.18 or 1.19.

P.

On Tue, Nov 5, 2019 at 11:15 PM Xingjian SHI  wrote:

> I don’t think we should drop Python 3.5 now because Ubuntu 16.04 ships
> with that version. I suggest that we should revisit it next year.
>
> Best,
> Xingjian
> 
> From: Sheng Zha 
> Sent: Tuesday, August 27, 2019 10:49 AM
> To: d...@mxnet.apache.org
> Subject: Re: [Discuss] MXNet Python < 3.6 Support Deprecation
>
> Good summary. At the start the discussion thread my ask is to announce the
> intention of py2 deprecation in the next release, and then actually
> deprecate py2 in the next major release. Thus, the appropriate timing for
> dropping py2 support in CI should be the start of the next major release.
> The py35 vs py36 discussion will not affect the outcome of py2 deprecation.
>
> BTW, one alternative option to a formal voting in the Apache way is to
> through lazy consensus [1], which could apply more in our project. Given
> the positive feedback in this discussion thread, I will assume lazy
> consensus in 72hrs on py2 deprecation as defined above.
>
> [1] https://community.apache.org/committers/lazyConsensus.html
>
> On 2019/08/27 00:19:14, Marco de Abreu  wrote:
> > Pedro,
> >
> > thanks for already starting these efforts, but it might be too early for
> > that. Right now, this is a discussion thread where we try to gather
> > different opinions in order to lay a good base for a future voting
> thread.
> > In there, we would define the detailed timeline, versions etc. Until the
> > vote has passed, I'd say that it's too early to draw any conclusions. So
> > far, there are two open discussion points:
> >
> > 1. Which Python version to support. 3.5 vs 3.6 is currently in the
> > discussion due to Ubuntu 16.04 being shipped with 3.5 while the biggest
> > market share being 3.6 as of now.
> > 2. When to do the deprecation. EOY to match with official Python 2
> > deprecation, in 1.5 years to be in line with Ubuntu 16.04 LTS or with the
> > next major release (2.0) to adhere to semantic versioning.
> >
> > Once these points (and any future ones) have been properly discussed and
> > the community came to an agreement, we can formalize it with a voting
> > thread. Until then, I'd recommend to refrain from any actions or
> > user-facing communication regarding this topic.
> >
> > Best regards,
> > Marco
> >
> > On Tue, Aug 27, 2019 at 1:29 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > > I have sent a PR that removes Python2 from CI. But was closed. I
> thought
> > > everyone was +1 on this one. This would remove quite a bit of load on
> CI:
> > >
> > > https://github.com/apache/incubator-mxnet/pull/15990
> > >
> > > If it's not the right time to do this, what steps do we need to take?
> > >
> > > Pedro.
> > >
> > >
> > > On Mon, Aug 26, 2019 at 1:27 AM Leonard Lausen 
> wrote:
> > >
> > > > Lieven Govaerts  writes:
> > > > > Hi,
> > > > >
> > > > > On Thu, 22 Aug 2019 at 17:01, Leonard Lausen 
> > > wrote:
> > > > >
> > > > >> Hi,
> > > > >>
> > > > >> Pedro stated "Seems 3.6 is a reasonable choice." and there have
> been a
> > > > >> few +1 after Chaitanya's reply to Pedro. I would like to check if
> > > these
> > > > >> only refer to Chaitanya's mail about a dedicated "improvement"
> effort
> > > or
> > > > >> about dropping 3.5.
> > > > >>
> > > > >> Thus two questions:
> > > > >>
> > > > >> 1) Are there any concerns about dropping Python 3.5? Now is your
> > > chance
> > > > to
> > > > >> speak up if you think so.
> > > > >>
> > > > >>
> > > > > Ubuntu 16.04 LTS defaults to Python 3.5.x . The LTS releases are
> > > > supported
> > > > > for 5 years, so for 16.04 LTS it ends in 1.5 years.
> > > > >
> > > > > I'm not saying you should wait for 1.5 more years, people can
> upgrade
> > > to
> > > > > 18.04 LTS after all, but may I suggest you make this switch in a
> major
> > > > > release only? More specifically, ensure that Python 3.6-only code
> > > doesn't
> > > > > accidentally gets merged into a 

Please remove conflicting Open MP version from CMake builds

2019-11-30 Thread Pedro Larroy
(py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6 (upstream_master)+$ ldd
build/libmxnet.so| grep -i openmp
libomp.so =>
/home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so
(0x7fde0991d000)
(py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6 (upstream_master)+$ python
~/deeplearning-benchmark/image_classification/infer_imagenet.py --use-rec
--batch-size 256 --dtype float32 --num-data-workers 40 --mode hybrid
--model resnet50_v2 --use-pretrained --kvstore local --log-interval 1
--rec-val ~/data/val-passthrough.rec --rec-val-idx
~/data/val-passthrough.idx
INFO:root:Namespace(batch_norm=False, batch_size=256,
data_dir='~/.mxnet/datasets/imagenet', dataset_size=32, dtype='float32',
kvstore='local', last_gamma=False, log_interval=1, logging_dir='logs',
lr=0.1, lr_decay=0.1, lr_decay_epoch='40,60', lr_mode='step',
lr_poly_power=2, mode='hybrid', model='resnet50_v2', momentum=0.9,
num_epochs=3, num_gpus=0, num_workers=40,
rec_val='/home/piotr/data/val-passthrough.rec',
rec_val_idx='/home/piotr/data/val-passthrough.idx', save_dir='params',
save_frequency=0, top_k=0, use_pretrained=True, use_rec=True, use_se=False,
warmup_epochs=0, warmup_lr=0.0, wd=0.0001)
[10:42:02] ../src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2:
/home/piotr/data/val-passthrough.rec, use 36 threads for decoding..
INFO:root:Batch [0]
INFO:root:Top 1 accuracy: 0
INFO:root:warmup_throughput: 5 samples/sec warmup_time 43.150922
INFO:root:Batch [1]
INFO:root:Top 1 accuracy: 0
INFO:root:warmup_throughput: 6 samples/sec warmup_time 37.971927
INFO:root:Batch [2]
INFO:root:Top 1 accuracy: 0
INFO:root:warmup_throughput: 7 samples/sec warmup_time 35.755363







(py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp (upstream_master)+$
git st
On branch upstream_master
Your branch is up to date with 'origin/upstream_master'.

Changes not staged for commit:
  (use "git add/rm ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

deleted:3rdparty/openmp

no changes added to commit (use "git add" and/or "git commit -a")
(py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6_plat_omp (upstream_master)+$
ldd build/libmxnet.so | grep -i omp
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
(0x7f941241c000)

(py3_venv) piotr@34-215-197-42:130:~/mxnet_1.6_plat_omp (upstream_master)+$
python ~/deeplearning-benchmark/image_classification/infer_imagenet.py
--use-rec --batch-size 256 --dtype float32 --num-data-workers 40 --mode
hybrid --model resnet50_v2 --use-pretrained --kvstore local --log-interval
1 --rec-val ~/data/val-passthrough.rec --rec-val-idx
~/data/val-passthrough.idx
INFO:root:warmup_throughput: 147 samples/sec warmup_time 1.735117
INFO:root:Batch [16]
INFO:root:Top 1 accuracy: 0
INFO:root:warmup_throughput: 143 samples/sec warmup_time 1.785760
INFO:root:Batch [17]
INFO:root:Top 1 accuracy: 0
INFO:root:warmup_throughput: 148 samples/sec warmup_time 1.729033


CI Update

2019-12-02 Thread Pedro Larroy
Small update about CI, which is blocked.

Seems there's a nvidia driver compatibility problem in the base AMI that is
running in GPU instances and the nvidia docker images that we use for
building and testing.

We are working on providing a fix by updating the base images as doesn't
seem to be easy to fix by just changing the container.

Thanks.

Pedro.


Re: CI Update

2019-12-03 Thread Pedro Larroy
Hi MXNet community. We are in the process of updating the base AMIs for CI
with an updated CUDA driver to fix the CI blockage.

We would need help from the community to diagnose some of the build errors
which don't seem related to the infrastructure.

I have observed this build failure with tvm when not installing the cuda
driver in the container:


https://pastebin.com/bQA0W2U4

centos gpu builds and tests seem to run with the updated AMI and changes to
the container.


Thanks.


On Mon, Dec 2, 2019 at 12:11 PM Pedro Larroy 
wrote:

> Small update about CI, which is blocked.
>
> Seems there's a nvidia driver compatibility problem in the base AMI that
> is running in GPU instances and the nvidia docker images that we use for
> building and testing.
>
> We are working on providing a fix by updating the base images as doesn't
> seem to be easy to fix by just changing the container.
>
> Thanks.
>
> Pedro.
>


Re: CI Update

2019-12-03 Thread Pedro Larroy
Also please take note that there's a stage building TVM which is executing
compilation serially and takes a lot of time which impacts CI turnaround
time:

https://github.com/apache/incubator-mxnet/issues/16962

Pedro

On Tue, Dec 3, 2019 at 9:49 AM Pedro Larroy 
wrote:

> Hi MXNet community. We are in the process of updating the base AMIs for CI
> with an updated CUDA driver to fix the CI blockage.
>
> We would need help from the community to diagnose some of the build errors
> which don't seem related to the infrastructure.
>
> I have observed this build failure with tvm when not installing the cuda
> driver in the container:
>
>
> https://pastebin.com/bQA0W2U4
>
> centos gpu builds and tests seem to run with the updated AMI and changes
> to the container.
>
>
> Thanks.
>
>
> On Mon, Dec 2, 2019 at 12:11 PM Pedro Larroy 
> wrote:
>
>> Small update about CI, which is blocked.
>>
>> Seems there's a nvidia driver compatibility problem in the base AMI that
>> is running in GPU instances and the nvidia docker images that we use for
>> building and testing.
>>
>> We are working on providing a fix by updating the base images as doesn't
>> seem to be easy to fix by just changing the container.
>>
>> Thanks.
>>
>> Pedro.
>>
>


Re: CI Update

2019-12-03 Thread Pedro Larroy
Some PRs were experiencing build timeouts in the past. I have diagnosed
this to be a saturation of the EFS volume holding the compilation cache.
Once CI is back online this problem is very likely to be solved and you
should not see any more build timeout issues.

On Tue, Dec 3, 2019 at 10:18 AM Pedro Larroy 
wrote:

> Also please take note that there's a stage building TVM which is executing
> compilation serially and takes a lot of time which impacts CI turnaround
> time:
>
> https://github.com/apache/incubator-mxnet/issues/16962
>
> Pedro
>
> On Tue, Dec 3, 2019 at 9:49 AM Pedro Larroy 
> wrote:
>
>> Hi MXNet community. We are in the process of updating the base AMIs for
>> CI with an updated CUDA driver to fix the CI blockage.
>>
>> We would need help from the community to diagnose some of the build
>> errors which don't seem related to the infrastructure.
>>
>> I have observed this build failure with tvm when not installing the cuda
>> driver in the container:
>>
>>
>> https://pastebin.com/bQA0W2U4
>>
>> centos gpu builds and tests seem to run with the updated AMI and changes
>> to the container.
>>
>>
>> Thanks.
>>
>>
>> On Mon, Dec 2, 2019 at 12:11 PM Pedro Larroy <
>> pedro.larroy.li...@gmail.com> wrote:
>>
>>> Small update about CI, which is blocked.
>>>
>>> Seems there's a nvidia driver compatibility problem in the base AMI that
>>> is running in GPU instances and the nvidia docker images that we use for
>>> building and testing.
>>>
>>> We are working on providing a fix by updating the base images as doesn't
>>> seem to be easy to fix by just changing the container.
>>>
>>> Thanks.
>>>
>>> Pedro.
>>>
>>


Re: CI Update

2019-12-06 Thread Pedro Larroy
Hi all. CI is back to normal after Jake's commit:
https://github.com/apache/incubator-mxnet/pull/16968 please merge from
master.  If someone could look into the TVM building issues  described
above would be great.

On Tue, Dec 3, 2019 at 11:11 AM Pedro Larroy 
wrote:

> Some PRs were experiencing build timeouts in the past. I have diagnosed
> this to be a saturation of the EFS volume holding the compilation cache.
> Once CI is back online this problem is very likely to be solved and you
> should not see any more build timeout issues.
>
> On Tue, Dec 3, 2019 at 10:18 AM Pedro Larroy 
> wrote:
>
>> Also please take note that there's a stage building TVM which is
>> executing compilation serially and takes a lot of time which impacts CI
>> turnaround time:
>>
>> https://github.com/apache/incubator-mxnet/issues/16962
>>
>> Pedro
>>
>> On Tue, Dec 3, 2019 at 9:49 AM Pedro Larroy 
>> wrote:
>>
>>> Hi MXNet community. We are in the process of updating the base AMIs for
>>> CI with an updated CUDA driver to fix the CI blockage.
>>>
>>> We would need help from the community to diagnose some of the build
>>> errors which don't seem related to the infrastructure.
>>>
>>> I have observed this build failure with tvm when not installing the cuda
>>> driver in the container:
>>>
>>>
>>> https://pastebin.com/bQA0W2U4
>>>
>>> centos gpu builds and tests seem to run with the updated AMI and changes
>>> to the container.
>>>
>>>
>>> Thanks.
>>>
>>>
>>> On Mon, Dec 2, 2019 at 12:11 PM Pedro Larroy <
>>> pedro.larroy.li...@gmail.com> wrote:
>>>
>>>> Small update about CI, which is blocked.
>>>>
>>>> Seems there's a nvidia driver compatibility problem in the base AMI
>>>> that is running in GPU instances and the nvidia docker images that we use
>>>> for building and testing.
>>>>
>>>> We are working on providing a fix by updating the base images as
>>>> doesn't seem to be easy to fix by just changing the container.
>>>>
>>>> Thanks.
>>>>
>>>> Pedro.
>>>>
>>>


Re: Can upgrade windows CI cmake?

2019-12-06 Thread Pedro Larroy
CMake shipped with ubuntu has issues when compiling with CUDA on GPU
instances.  I wouldn't recommend anything older than 3.12 for Linux GPU

https://github.com/apache/incubator-mxnet/blob/master/ci/docker/install/ubuntu_core.sh#L63

I don't know about windows CMake version but would make sense to require a
newer version.

On Thu, Dec 5, 2019 at 7:26 PM Lausen, Leonard 
wrote:

> Currently we declare cmake_minimum_required(VERSION 3.0.2)
>
> I'm in favor of updating our CMake requirement. The main question may be
> what
> new version to pick as minimum requirement.
>
> In general, there is the guideline
>
> > You really should at least use a version of CMake that came out after
> your
> > compiler, since it needs to know compiler flags, etc, for that version.
> And,
> > since CMake will dumb itself down to the minimum required version in your
> > CMake file, installing a new CMake, even system wide, is pretty safe. You
> > should at least install it locally. It's easy (1-2 lines in many cases),
> and
> > you'll find that 5 minutes of work will save you hundreds of lines and
> hours
> > of CMakeLists.txt writing, and will be much easier to maintain in the
> long
> > run.
> https://cliutils.gitlab.io/modern-cmake/
>
> https://cliutils.gitlab.io/modern-cmake/chapters/intro/newcmake.html
> gives a
> short overview of all the improvements made to CMake over the past 6 years.
>
> It's easy for users to upgrade their cmake version with pip:
>   pip install --upgrade --user cmake
> Thus it wouldn't be overly problematic to rely on a very recent version of
> cmake, if indeed it's required.
>
> Nevertheless, if an earlier version fixes the problems, let's rather pick
> that
> one. Did you confirm which version is required to fix the problem?
>
> For now you could try if the CMake version shipped in the oldest supported
> Ubuntu LTS release (Ubuntu 16.04) is fixing your problem (CMake 3.5)? If
> not,
> please test if CMake version shipped in Ubuntu 18.04 (CMake 3.10) fixes
> your
> issue.
>
> Thanks
> Leonard
>
> On Fri, 2019-12-06 at 08:45 +0800, shiwen hu wrote:
> > i am send a pr  https://github.com/apache/incubator-mxnet/pull/16980 to
> > change windows build system.but now ci cmake version seems to be a bug.
> > can't to compile.can upgrade to 3.16.0?
>


Re: Please remove conflicting Open MP version from CMake builds

2019-12-06 Thread Pedro Larroy
I will try to stay on the sidelines for now since previous conversations
about OMP have not been productive here and I have spent way too much time
on this already, I'm not the first one giving up on trying to help with
this topic.

I would be glad if you guys can work together and find a solution. I will
just put my understanding of the big picture hoping that it helps move it
forward.


Recently the intel omp library which seemed to have the best performance of
the 3 was removed from MKL.

- There's 3 libraries in play, GNU Omp which is shipped with gcc (gomp),
LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL, which is
recently removed (iomp)

- IOMP seems to have the best performance, there's stability issues
producing crashes sometimes but the impact seems relatively small for users
and developers. In general seems linking with a different OMP version that
the one shipped with the compiler is known to cause stability issues but
it's done anyway.

- LLVM-OMP used when building with CMake, not used in the PIP releases or
when building with Make. Has stability issues, hangs when running in debug
mode during test execution and produces tons of assertions in debug mode.
Might have some small performance gains but there is no clear cut data that
showcases significant performance gains.

- GOMP is the version shipped with GCC and the PIP wheels without MKL, has
no stability problems.

As a ballpark, IOMP might give 10% performance improvement in some cases.

We need to document well how users should tune and configure MXNet when
using OMP.

As a developer, the safest bet is to use GOMP to be able to debug and
develop without issues. As a user of CPU inference / training you want to
run MKL so depends on how the Intel guys want to do things. My preference
as an engineer is always stability > speed.

Related tickets:

https://github.com/apache/incubator-mxnet/issues/16891

https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-562637931


https://github.com/apache/incubator-mxnet/issues/11417

https://github.com/apache/incubator-mxnet/issues/15690



On Fri, Dec 6, 2019 at 12:39 AM Lausen, Leonard 
wrote:

> Is this related to https://github.com/apache/incubator-mxnet/issues/10856?
>
> I unlocked that Github issue based on the Apache Code of Conduct
> https://www.apache.org/foundation/policies/conduct#specific-guidelines
>
>
> On Sat, 2019-11-30 at 02:47 -0800, Pedro Larroy wrote:
> > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6 (upstream_master)+$ ldd
> > build/libmxnet.so| grep -i openmp
> > libomp.so =>
> > /home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so
> > (0x7fde0991d000)
> > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6 (upstream_master)+$ python
> > ~/deeplearning-benchmark/image_classification/infer_imagenet.py --use-rec
> > --batch-size 256 --dtype float32 --num-data-workers 40 --mode hybrid
> > --model resnet50_v2 --use-pretrained --kvstore local --log-interval 1
> > --rec-val ~/data/val-passthrough.rec --rec-val-idx
> > ~/data/val-passthrough.idx
> > INFO:root:Namespace(batch_norm=False, batch_size=256,
> > data_dir='~/.mxnet/datasets/imagenet', dataset_size=32, dtype='float32',
> > kvstore='local', last_gamma=False, log_interval=1, logging_dir='logs',
> > lr=0.1, lr_decay=0.1, lr_decay_epoch='40,60', lr_mode='step',
> > lr_poly_power=2, mode='hybrid', model='resnet50_v2', momentum=0.9,
> > num_epochs=3, num_gpus=0, num_workers=40,
> > rec_val='/home/piotr/data/val-passthrough.rec',
> > rec_val_idx='/home/piotr/data/val-passthrough.idx', save_dir='params',
> > save_frequency=0, top_k=0, use_pretrained=True, use_rec=True,
> use_se=False,
> > warmup_epochs=0, warmup_lr=0.0, wd=0.0001)
> > [10:42:02] ../src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2:
> > /home/piotr/data/val-passthrough.rec, use 36 threads for decoding..
> > INFO:root:Batch [0]
> > INFO:root:Top 1 accuracy: 0
> > INFO:root:warmup_throughput: 5 samples/sec warmup_time 43.150922
> > INFO:root:Batch [1]
> > INFO:root:Top 1 accuracy: 0
> > INFO:root:warmup_throughput: 6 samples/sec warmup_time 37.971927
> > INFO:root:Batch [2]
> > INFO:root:Top 1 accuracy: 0
> > INFO:root:warmup_throughput: 7 samples/sec warmup_time 35.755363
> >
> >
> >
> >
> >
> >
> >
> > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp
> (upstream_master)+$
> > git st
> > On branch upstream_master
> > Your branch is up to date with 'origin/upstream_master'.
> >
> > Changes not staged for commit:
> >   (use "git add/rm ..." to update what will be 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-07 Thread Pedro Larroy
Stop disseminating false information:

https://github.com/apache/incubator-mxnet/issues/14979


On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier  wrote:

> -1
>
> mkldnn removed omp5 for licencing issues
> no bugs have actually been traced to the use of llvm openmp. only an assert
> caused by an actual bug in mxnet code. there are suitable workarounds.
>
> over time llvm omp has simply been used as a “catch all” for random
> problems that aren’t related at all (such as getenv race condition in an
> atfork call that isn’t even part of an omp parallel region).
>
> proposal is now and has always been roughly equivalent to the idea of
> “comment out an assert rather than fix the bug it’s reporting”.
>
> Up until very recently, Makefile version of mxnet used libomp5 for YEARS
> and not libgomp, with no issue reported (omp not built in debug mode), so
> the equivalent configuration from CMake mysteriously causing myriads if
> problems has questionable merit and smells more like a hubris situation.
>
> I use tensorflow as well and it links to libomp5 rather than libgomp.
>
> if the assert problem is really a problem, the bug being reported would be
> prioritized and fixed. it should be fixed regardless. all the time spent by
> some CI people trying to remove this could have simply fixed the actual bug
> in a small fraction of the time.
>
>
> On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard 
> wrote:
>
> > I think it's reasonable to assume that the Intel MKLDNN team is an
> > "authorative"
> > source about the issue of compilation with OpenMP and the OpenMP runtime
> > library
> > related issues. Thus I suggest we follow the recommendation of Intel
> > MKLDNN team
> > within the MXNet project.
> >
> > Looking through the Intel MKLDNN documentation, I find [1]:
> >
> > > DNNL uses OpenMP runtime library provided by the compiler.
> >
> > as well as
> >
> > > it's important to ensure that only one OpenMP runtime is used
> throughout
> > the
> > > application. Having more than one OpenMP runtime linked to an
> executable
> > may
> > > lead to undefined behavior including incorrect results or crashes.
> >
> > To keep our project maintainable and error free, I thus suggest we follow
> > DNNL
> > and use the OpenMP runtime library provided by the compiler.
> > We have limited ressources and finding the root cause for any bugs
> > resulting
> > from linking multiple OpenMP libraries as currently done is, in my
> > opinion. not
> > a good use of time. We know it's due to undefined behavior and we know
> > it's best
> > practice to use OpenMP runtime library provided by the compiler. So let's
> > just
> > do that.
> >
> > I think given that MKL-DNN has also adopted the "OpenMP runtime library
> > provided
> > by the compiler" approach, this issue is not contentious anymore and
> > qualifies
> > for lazy consensus.
> >
> > Thus if there is no objection within 72 hours (lazy consensus), let's
> drop
> > bundled LLVM OpenMP from master [2]. If we find any issues due to
> > droppeing the
> > bundled LLVM OpenMP, we can always add it back prior to the next release.
> >
> > Best regards
> > Leonard
> >
> > [1]:
> >
> >
> https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b1921d/doc/build/build_options.md#openmp
> > (This is the updated reference from Anton's previous comment, based on
> the
> > changes in MKLDNN done in the meantime
> >
> https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-415078066
> > )
> > [2]: Alike https://github.com/apache/incubator-mxnet/pull/12160
> >
> >
> > On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote:
> > > I will try to stay on the sidelines for now since previous
> conversations
> > > about OMP have not been productive here and I have spent way too much
> > time
> > > on this already, I'm not the first one giving up on trying to help with
> > > this topic.
> > >
> > > I would be glad if you guys can work together and find a solution. I
> will
> > > just put my understanding of the big picture hoping that it helps move
> it
> > > forward.
> > >
> > >
> > > Recently the intel omp library which seemed to have the best
> performance
> > of
> > > the 3 was removed from MKL.
> > >
> > > - There's 3 libraries in play, GNU Omp which is shipped with gcc
> (gomp),
> > > LLVM openmp in 3rd

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Pedro Larroy
rs. Ie. most users actually don't use the CMake
>> build
>> with 3rdparty/openmp. You can consider rescinding your veto on removing
>> 3rdparty/openmp after reading through the evidence in that issue. If you
>> don't
>> provide any evidence for why the methodology/conclusion in #14979 is
>> flawed, I
>> will assume your previous veto is void based on Apache Voting rule as it
>> lacks
>> technical justification and in any case was motivated by the assertion
>> issue,
>> which I agree with you, is likely not due to gomp / omp interaction.
>>
>> Thank you
>> Leonard
>>
>>
>> On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
>> > Stop disseminating false information:
>> >
>> > https://github.com/apache/incubator-mxnet/issues/14979
>> >
>> >
>> > On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier 
>> wrote:
>> >
>> > > -1
>> > >
>> > > mkldnn removed omp5 for licencing issues
>> > > no bugs have actually been traced to the use of llvm openmp. only an
>> assert
>> > > caused by an actual bug in mxnet code. there are suitable
workarounds.
>> > >
>> > > over time llvm omp has simply been used as a “catch all” for random
>> > > problems that aren’t related at all (such as getenv race condition in
>> an
>> > > atfork call that isn’t even part of an omp parallel region).
>> > >
>> > > proposal is now and has always been roughly equivalent to the idea of
>> > > “comment out an assert rather than fix the bug it’s reporting”.
>> > >
>> > > Up until very recently, Makefile version of mxnet used libomp5 for
>> YEARS
>> > > and not libgomp, with no issue reported (omp not built in debug
mode),
>> so
>> > > the equivalent configuration from CMake mysteriously causing myriads
if
>> > > problems has questionable merit and smells more like a hubris
>> situation.
>> > >
>> > > I use tensorflow as well and it links to libomp5 rather than libgomp.
>> > >
>> > > if the assert problem is really a problem, the bug being reported
>> would be
>> > > prioritized and fixed. it should be fixed regardless. all the time
>> spent by
>> > > some CI people trying to remove this could have simply fixed the
>> actual bug
>> > > in a small fraction of the time.
>> > >
>> > >
>> > > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard
>> 
>> > > wrote:
>> > >
>> > > > I think it's reasonable to assume that the Intel MKLDNN team is an
>> > > > "authorative"
>> > > > source about the issue of compilation with OpenMP and the OpenMP
>> runtime
>> > > > library
>> > > > related issues. Thus I suggest we follow the recommendation of
Intel
>> > > > MKLDNN team
>> > > > within the MXNet project.
>> > > >
>> > > > Looking through the Intel MKLDNN documentation, I find [1]:
>> > > >
>> > > > > DNNL uses OpenMP runtime library provided by the compiler.
>> > > >
>> > > > as well as
>> > > >
>> > > > > it's important to ensure that only one OpenMP runtime is used
>> > > throughout
>> > > > the
>> > > > > application. Having more than one OpenMP runtime linked to an
>> > > executable
>> > > > may
>> > > > > lead to undefined behavior including incorrect results or
crashes.
>> > > >
>> > > > To keep our project maintainable and error free, I thus suggest we
>> follow
>> > > > DNNL
>> > > > and use the OpenMP runtime library provided by the compiler.
>> > > > We have limited ressources and finding the root cause for any bugs
>> > > > resulting
>> > > > from linking multiple OpenMP libraries as currently done is, in my
>> > > > opinion. not
>> > > > a good use of time. We know it's due to undefined behavior and we
>> know
>> > > > it's best
>> > > > practice to use OpenMP runtime library provided by the compiler. So
>> let's
>> > > > just
>> > > > do that.
>> > > >
>> > > > I think given that MKL-DNN has also adopted the "OpenMP runtime
>> library
>> > > > provided
>> > &

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Pedro Larroy
Hi Leonard.

Are you saying that you have updated this library and the problems desribed
in the related tickets are no longer present?

P.

On Sunday, December 8, 2019, Lausen, Leonard 
wrote:
> Thanks Pedro and Chris for your responses.
>
> After further investigation I find:
>
> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 is
> caused by any incompatibility between gomp and llvm / intel omp. Rather
it's
> simply a problem of llvm / intel omp. See my comment to the issue for the
> methodology to arrive at this claim.
>
> 2) Regarding the assertion failure when compiling with (llvm)
3rdparty/openmp,
> it can be fixed by updating the by now 2 years old llvm openmp code to the
> newest released version. I went ahead and opened a PR
> https://github.com/apache/incubator-mxnet/pull/17012
>
> Based on the investigation described in 1), I think Chris is right that
the
> assertion failure is not due to some interaction between gomp and llvm
omp.
> However, I'm not sure about Chris's suggestion that the assertion failure
is due
> to a bug in MXNet. In fact, the failure goes away when updating the llvm
openmp
> code. So I think it's just due to a bug in the 2 years old code.
>
> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
> contentious. Thus let's do it via lazy consensus (72 hours) or just
approve the
> PR and merge it.
>
> Please also take a look at my comment at #14979 and let everyone know if
you see
> any option to fix the bug while keeping 3rdparty/openmp. As this bug
affects an
> important use-case, I beleive we need to remove 3rdparty/openmp from the
CMake
> build as long as we don't find a solution for making #14979 work with
> 3rdparty/openmp.
>
> In fact, removing 3rdparty/openmp will then match the current Makefile
setup
> that according to my understanding is used to build the nightly releases
used by
> the majority of developers. Ie. most users actually don't use the CMake
build
> with 3rdparty/openmp. You can consider rescinding your veto on removing
> 3rdparty/openmp after reading through the evidence in that issue. If you
don't
> provide any evidence for why the methodology/conclusion in #14979 is
flawed, I
> will assume your previous veto is void based on Apache Voting rule as it
lacks
> technical justification and in any case was motivated by the assertion
issue,
> which I agree with you, is likely not due to gomp / omp interaction.
>
> Thank you
> Leonard
>
>
> On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
>> Stop disseminating false information:
>>
>> https://github.com/apache/incubator-mxnet/issues/14979
>>
>>
>> On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier 
wrote:
>>
>> > -1
>> >
>> > mkldnn removed omp5 for licencing issues
>> > no bugs have actually been traced to the use of llvm openmp. only an
assert
>> > caused by an actual bug in mxnet code. there are suitable workarounds.
>> >
>> > over time llvm omp has simply been used as a “catch all” for random
>> > problems that aren’t related at all (such as getenv race condition in
an
>> > atfork call that isn’t even part of an omp parallel region).
>> >
>> > proposal is now and has always been roughly equivalent to the idea of
>> > “comment out an assert rather than fix the bug it’s reporting”.
>> >
>> > Up until very recently, Makefile version of mxnet used libomp5 for
YEARS
>> > and not libgomp, with no issue reported (omp not built in debug mode),
so
>> > the equivalent configuration from CMake mysteriously causing myriads if
>> > problems has questionable merit and smells more like a hubris
situation.
>> >
>> > I use tensorflow as well and it links to libomp5 rather than libgomp.
>> >
>> > if the assert problem is really a problem, the bug being reported
would be
>> > prioritized and fixed. it should be fixed regardless. all the time
spent by
>> > some CI people trying to remove this could have simply fixed the
actual bug
>> > in a small fraction of the time.
>> >
>> >
>> > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard

>> > wrote:
>> >
>> > > I think it's reasonable to assume that the Intel MKLDNN team is an
>> > > "authorative"
>> > > source about the issue of compilation with OpenMP and the OpenMP
runtime
>> > > library
>> > > related issues. Thus I suggest we follow the recommendation of Intel
>> > > MKLDNN team
>> > > within the MXNet project.
>> > >
>> > > Looking through

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Pedro Larroy
Great investigation thank you. I have to agree with your analysis and for
helping resolving this long standing issue.

This will not repair the damage made to the community of losing 3-4
valuable contributors. Introducing a library that causes bugs then blocking
changes and locking gh issues which attempt to remove or workaround the
issues in addition to making rude comments and worse things that are better
left out is still not acceptable and begs for an apology from Chris.

P.




On Sunday, December 8, 2019, Lausen, Leonard 
wrote:
> Thanks Pedro and Chris for your responses.
>
> After further investigation I find:
>
> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 is
> caused by any incompatibility between gomp and llvm / intel omp. Rather
it's
> simply a problem of llvm / intel omp. See my comment to the issue for the
> methodology to arrive at this claim.
>
> 2) Regarding the assertion failure when compiling with (llvm)
3rdparty/openmp,
> it can be fixed by updating the by now 2 years old llvm openmp code to the
> newest released version. I went ahead and opened a PR
> https://github.com/apache/incubator-mxnet/pull/17012
>
> Based on the investigation described in 1), I think Chris is right that
the
> assertion failure is not due to some interaction between gomp and llvm
omp.
> However, I'm not sure about Chris's suggestion that the assertion failure
is due
> to a bug in MXNet. In fact, the failure goes away when updating the llvm
openmp
> code. So I think it's just due to a bug in the 2 years old code.
>
> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
> contentious. Thus let's do it via lazy consensus (72 hours) or just
approve the
> PR and merge it.
>
> Please also take a look at my comment at #14979 and let everyone know if
you see
> any option to fix the bug while keeping 3rdparty/openmp. As this bug
affects an
> important use-case, I beleive we need to remove 3rdparty/openmp from the
CMake
> build as long as we don't find a solution for making #14979 work with
> 3rdparty/openmp.
>
> In fact, removing 3rdparty/openmp will then match the current Makefile
setup
> that according to my understanding is used to build the nightly releases
used by
> the majority of developers. Ie. most users actually don't use the CMake
build
> with 3rdparty/openmp. You can consider rescinding your veto on removing
> 3rdparty/openmp after reading through the evidence in that issue. If you
don't
> provide any evidence for why the methodology/conclusion in #14979 is
flawed, I
> will assume your previous veto is void based on Apache Voting rule as it
lacks
> technical justification and in any case was motivated by the assertion
issue,
> which I agree with you, is likely not due to gomp / omp interaction.
>
> Thank you
> Leonard
>
>
> On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
>> Stop disseminating false information:
>>
>> https://github.com/apache/incubator-mxnet/issues/14979
>>
>>
>> On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier 
wrote:
>>
>> > -1
>> >
>> > mkldnn removed omp5 for licencing issues
>> > no bugs have actually been traced to the use of llvm openmp. only an
assert
>> > caused by an actual bug in mxnet code. there are suitable workarounds.
>> >
>> > over time llvm omp has simply been used as a “catch all” for random
>> > problems that aren’t related at all (such as getenv race condition in
an
>> > atfork call that isn’t even part of an omp parallel region).
>> >
>> > proposal is now and has always been roughly equivalent to the idea of
>> > “comment out an assert rather than fix the bug it’s reporting”.
>> >
>> > Up until very recently, Makefile version of mxnet used libomp5 for
YEARS
>> > and not libgomp, with no issue reported (omp not built in debug mode),
so
>> > the equivalent configuration from CMake mysteriously causing myriads if
>> > problems has questionable merit and smells more like a hubris
situation.
>> >
>> > I use tensorflow as well and it links to libomp5 rather than libgomp.
>> >
>> > if the assert problem is really a problem, the bug being reported
would be
>> > prioritized and fixed. it should be fixed regardless. all the time
spent by
>> > some CI people trying to remove this could have simply fixed the
actual bug
>> > in a small fraction of the time.
>> >
>> >
>> > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard

>> > wrote:
>> >
>> > > I think it's reasonable to assume that the Intel MKLDNN team is an
>> > > &qu

The essence of deep learning, autodiff and higher order gradients

2019-12-18 Thread Pedro Larroy
Hi

I published the slides I presented at the last MXNet meetup on automatic
differentiation and higher order gradients. If you want to get more
insights to understand some PRs which have been sent or future directions
on this area for 2.0. I also compare implementation across major deep
learning frameworks. Let me know if you have any questions or feedbacks and
please click like or share my post.

https://www.linkedin.com/posts/pedrolarroy_the-essence-of-deep-learning-automatic-differentiation-activity-6613142805923536896-PuI5/

Pedro.


[discuss] add lgtm.com to mxnet

2019-12-18 Thread Pedro Larroy
Shall we add lgtm to mxnet?  https://lgtm.com/


Re: [apache/incubator-mxnet] [RFC] Custom Operator Part 2 (#17006)

2019-12-26 Thread Pedro Larroy
@wkcn could you explain your suggestion? calling gemm back into the framework 
which gets dispatched to GPU or CPU?

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17006#issuecomment-569131388

Re: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc0

2019-12-26 Thread Pedro Larroy
https://github.com/apache/incubator-mxnet/pull/17012  should be also ported
to the release branch.

On Fri, Dec 20, 2019 at 1:39 PM Przemysław Trędak 
wrote:

> That issue is now fixed in master, I am in the process of cherry-picking
> the fix to v1.6.x branch. I will prepare the RC1 once that is ready.
>
> Thanks
> Przemek
>
> On 2019/12/20 20:07:36, Lin Yuan  wrote:
> > What's the next step for the release? Should we continue testing this and
> > vote or wait until the
> > https://github.com/apache/incubator-mxnet/issues/17105 is fixed?
> >
> > Thanks!
> >
> > Lin
> >
> > On Wed, Dec 18, 2019 at 12:55 AM Lausen, Leonard
> 
> > wrote:
> >
> > > Thanks Przemysław for managing this release and everyone who
> contributed
> > > to it.
> > >
> > > Unfortunately Zechen Wang just discovered another issue with GPU
> Pointwise
> > > Fusion: https://github.com/apache/incubator-mxnet/issues/17105
> > >
> > > Thus, -1.
> > >
> > > Unfortunately, as the nightly release pipeline was broken until
> recently
> > > (and
> > > still isn't re-set up completely yet), the issue hasn't been discovered
> > > earlier.
> > >
> > > Przemysław may have a quick fix for the issue. Another option would be
> to
> > > release 1.6 with MXNET_USE_FUSION default to 0.
> > >
> > > Best regards
> > > Leonard
> > >
> > > On Wed, 2019-12-18 at 05:30 +, Chen, Ciyong wrote:
> > > > Appreciate Tredak to push out voting for 1.6 release.
> > > >
> > > > +1 as we've done lots of tests with expected performance in many
> > > different
> > > > scenarios including both single-node and multi-node (horovod based),
> > > both FP32
> > > > and INT8 precision on many topologies.
> > > >
> > > > -Ciyong
> > > >
> > > > -Original Message-
> > > > From: Zhao, Patric 
> > > > Sent: Tuesday, December 17, 2019 8:51 AM
> > > > To: dev@mxnet.incubator.apache.org; d...@mxnet.apache.org
> > > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version
> 1.6.0.rc0
> > > >
> > > > Thanks, Tredak, I will add some words for the new feature in the
> release
> > > note.
> > > >
> > > > +1 for voting because we have ran multiple time of tests in local and
> > > got the
> > > > expected performance boost.
> > > >
> > > > --Patric
> > > >
> > > > > -Original Message-
> > > > > From: Przemysław Trędak 
> > > > > Sent: Tuesday, December 17, 2019 4:49 AM
> > > > > To: d...@mxnet.apache.org
> > > > > Subject: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc0
> > > > >
> > > > > Dear MXNet community,
> > > > >
> > > > > This is the vote to release Apache MXNet (incubating) version
> 1.6.0.
> > > > > Voting starts now and will close on Friday, 20th December 2019
> > > 23:59:59 PST.
> > > > >
> > > > > Link to release notes:
> > > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.6.0+Release+notes
> > > > >
> > > > > Link to release candidate:
> > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc0
> > > > >
> > > > > Link to source and signatures on apache dist server:
> > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc0/
> > > > >
> > > > > Please remember to TEST first before voting accordingly:
> > > > > +1 = approve
> > > > > +0 = no opinion
> > > > > -1 = disapprove (provide reason)
> > > > >
> > > > > Additional notes:
> > > > >  - There was an issue[1] raised that 1.6.0.rc0 does not build with
> > > > > clang on FreeBSD - I decided to not block the voting for this and
> > > > > instead let the Community decide whether this is a blocker for the
> > > release.
> > > > >  - Patric Zhao and Tao Lv - could you help preparing a paragraph on
> > > > > MKLDNN
> > > > > 1.0 update in the New features section in the release notes?
> > > > >
> > > > > [1] https://github.com/apache/incubator-mxnet/issues/17076
> > > > >
> > > > > Best regards,
> > > > > Przemyslaw Tredak
> > >
> >
>


Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-26 Thread Pedro Larroy
What's the point of having an API if you type erase it? Then you might as
well have a single function API with a type erased callback name to select
the function to call. In the end you move the burden away from the API to
the callers and inside the API to the dispatchers. For going this route of
uber-clever template tricks to generate code, I think it's better to just
put in place proper code generation for maintainability. Could you provide
a bit more details about tradeoffs? Everything has tradeoffs, I don't
believe any solution which is sold as a panacea, there's no silver bullet.

On Thu, Dec 19, 2019 at 10:21 AM Tianqi Chen 
wrote:

> I have another candidate that would highly recommend: adopt TVM's FFI
> convention.
>
> The historical problem of MXNet FFI was the blowing amount of the C API
> bindings as we add new features. This creates a huge amount of maintenance
> burden.
>
> The real problem was not really about which FFI system to adopt(cython and
> pybind are fine in that end, except for the cost of compilation), but more
> of the cost to maintain the FFI. MXNet used to have a fast cython binding,
> but that was abandoned because we keep add new APIs we cannot keep up both
> ctypes and cython.
>
> When developing TVM we learnt from the lesson and restrict the API to a
> limited set of runtime APIs that does not change, and have a stable cython,
> ctypes binding for them. The runtime support a type-erased
> function(PackedFunc), which can be efficiently called from any of the
> frontend language, and all the APIs are exposed through the PackedFunc. On
> the python side an additional wrapping is created for better documentation
> and call into the PackedFunc. See more in
> https://docs.tvm.ai/dev/runtime.html The system works great for over a
> few years now.
>
> Of course I understand there has been legacy issues in MXNet that is why I
> did not bring this proposal up. But given this is a proposal for 2.0, I
> would encourage everyone to give a serious thought about this possibility.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> ,
> or unsubscribe
> 
> .
>


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569135511

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-26 Thread Pedro Larroy
Pybind is nice, I used Boost python many years ago, which I think is based
on. The problem with this is the hourglass C bindings, you have to go from
Python to C++ / Pybind, down to C and to the engine, this seems like a lot
of boilerplate.

On Mon, Dec 16, 2019 at 10:02 PM reminisce  wrote:

> MXNet imperative operator invocation overhead is as large as 30-60us,
> which is significant compared to the official NumPy operators with ~600ns
> overhead. This has negatively impacted the performance of applying MXNet to
> the models where many operators' kernel runtime duration is short,
> especially in the area of classic machine learning. We plan to address the
> problem in two steps:
>
>1.
>
>Short term: Use pybind11 to replace Python op API and ctypes/c api.
>Preliminary experiments show that the pure Python-C++ turnaround time by
>using Pybind is between 400-600ns, while the current Python op API using
>ctypes/c api costs more than 10us. We believe with the correct
>implementation, we can reduce the op invocation overhead to 2us including
>the time on FFI and engine.
>2.
>
>Long term: Adopt Python's C extension interface. NumPy did this by
>developing its own C API. This provides considerably less overhead compared
>to other solutions. However, it would cost much more engineering efforts by
>integrating this with our existing operator workflow in C++.
>
> @hzfan  @hgt312 
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> ,
> or unsubscribe
> 
> .
>


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569135990

Re: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc0

2019-12-27 Thread Pedro Larroy
Agree with Sheng, I think it would be good to have the nice fixes that
Leonard has done for 1.6 and not delay them to further releases since they
are beneficial to users and developers. Thanks Leonard for helping fix
these long standing issues.

On Fri, Dec 27, 2019 at 11:03 AM Lin Yuan  wrote:

> No, I just wanted to call it out because the title of the issue says
> "Failed
> OpenMP assertion when loading MXNet compiled with DEBUG=1
> <https://github.com/apache/incubator-mxnet/issues/10856#>".
> If this is considered a release blocker, I think we should backport it to
> 1.6.
>
> Thanks,
> Lin
>
> On Fri, Dec 27, 2019 at 10:47 AM Sheng Zha  wrote:
>
> > Reading these issues it’s pretty clear to me that these are fixes for
> > broken builds. I think we do consider broken builds to be release
> blockers.
> >
> > Lin, am I missing something on which you base your suggestion for
> delaying
> > these changes?
> >
> > -sz
> >
> > > On Dec 27, 2019, at 10:30 AM, Lin Yuan  wrote:
> > >
> > > Are these release blocker? It's very risky to make such last-minute
> big
> > > change after code freeze.
> > >
> > > Can we do this in the next release?
> > >
> > > Lin
> > >
> > >> On Fri, Dec 27, 2019 at 7:37 AM Lausen, Leonard
> > 
> > >> wrote:
> > >>
> > >> In case of backporting #17012, also
> > >> https://github.com/apache/incubator-mxnet/pull/17098 must be
> > backported.
> > >> The
> > >> updated OpenMP added a new target which is not used by MXNet but
> breaks
> > the
> > >> build on some systems with nvptx. #17098 disables building this unused
> > and
> > >> broken feature.
> > >>
> > >>> On Thu, 2019-12-26 at 12:55 -0800, Pedro Larroy wrote:
> > >>> https://github.com/apache/incubator-mxnet/pull/17012  should be also
> > >> ported
> > >>> to the release branch.
> > >>>
> > >>> On Fri, Dec 20, 2019 at 1:39 PM Przemysław Trędak <
> ptre...@apache.org>
> > >>> wrote:
> > >>>
> > >>>> That issue is now fixed in master, I am in the process of
> > >> cherry-picking
> > >>>> the fix to v1.6.x branch. I will prepare the RC1 once that is ready.
> > >>>>
> > >>>> Thanks
> > >>>> Przemek
> > >>>>
> > >>>> On 2019/12/20 20:07:36, Lin Yuan  wrote:
> > >>>>> What's the next step for the release? Should we continue testing
> > >> this and
> > >>>>> vote or wait until the
> > >>>>> https://github.com/apache/incubator-mxnet/issues/17105 is fixed?
> > >>>>>
> > >>>>> Thanks!
> > >>>>>
> > >>>>> Lin
> > >>>>>
> > >>>>> On Wed, Dec 18, 2019 at 12:55 AM Lausen, Leonard
> > >>>> 
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Thanks Przemysław for managing this release and everyone who
> > >>>> contributed
> > >>>>>> to it.
> > >>>>>>
> > >>>>>> Unfortunately Zechen Wang just discovered another issue with GPU
> > >>>> Pointwise
> > >>>>>> Fusion: https://github.com/apache/incubator-mxnet/issues/17105
> > >>>>>>
> > >>>>>> Thus, -1.
> > >>>>>>
> > >>>>>> Unfortunately, as the nightly release pipeline was broken until
> > >>>> recently
> > >>>>>> (and
> > >>>>>> still isn't re-set up completely yet), the issue hasn't been
> > >> discovered
> > >>>>>> earlier.
> > >>>>>>
> > >>>>>> Przemysław may have a quick fix for the issue. Another option
> > >> would be
> > >>>> to
> > >>>>>> release 1.6 with MXNET_USE_FUSION default to 0.
> > >>>>>>
> > >>>>>> Best regards
> > >>>>>> Leonard
> > >>>>>>
> > >>>>>> On Wed, 2019-12-18 at 05:30 +, Chen, Ciyong wrote:
> > >>>>>>> Appreciate Tredak to push out voting for 1.6 release.
> > >>>>>>>
> > >>>

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-27 Thread Pedro Larroy
Thanks for the explanation. I'm not so concerned about complexity of
dispatching. If I understood you correctly the main benefit that you
explain for the TVM project was not having to change the C API, but still
you need to do type checking in both ends, or at least on the receiving end
of the API, correct? I think we have discussed similar things in the past
and we might have different views on strongly typed vs dynamic typed. A
priori I prefer to see an API which can be evolved and changed, I find it
more explicit and clearer that what I think you do with PackedFun which I
have looked at briefly but not used extensively.  If one is going to call
into the C API using pybind, does it make sense to layer a C++ API on top
of the C API for this?

Also these microbenchmarks are nice, but we also need to consider the
overhead in typical workloads and see if it's still significant.

CFFI is also another alternative.

I couldn't access your pointers like:

https://github.com/tqchen/tvm/tree/pyffi

On Thu, Dec 26, 2019 at 2:00 PM Tianqi Chen 
wrote:

> @larroy indeed every solution has trade-offs, and these tradeoffs are
> discussed in the above posts when we compare solutions, and they are backed
> by benchmarks :) it would be great if you can also suggest potential
> tradeoffs here.
>
> When you expose an API from typed language(c++) to a dynamic
> language(python), you have to type erase it, given that the python
> functions don't have the type, and you have to pass the information along.
>
> The only difference is where you do the type checking(that the python type
> corresponds to the right c++ type), and translation(translating to the c++
> type).
>
> For example, in the case of pybind, the erasure is done implicitly when
> you call the python function, then checking and translation happens when
> you call into the c++ function.
>
> In the case of creating a C API for each feature and wrap things in the
> python side, the type checking is done in the python side, and translation
> as well.
>
> In the case of tvm ffi, the type translation is done in the python/cython
> side,  while the type checking is done in the c++.
>
> To dive deeper into the tradeoffs for PackedFunc calling convention. The
> convention erases the type by having the type code stored into the
> arguments. This brings additional cost of passing arguments into heap, as
> opposed to registers. So they might not be designed for inline functions
> that needs to happen at the order of 1e-9s, however, for API functions that
> needs to run around 1e-7 or even 1e-8 level, this convention is pretty good.
>
> In terms of the calling cost, it really depends on whether the caller and
> callee are strongly typed.
> - If caller is strongly typed, then assigning type code is O(1)
> - If caller is a dynamic type(like python) then we need to have a
> dispatcher to dispatch and select the right type code
> - If callee is strongly typed, then the cost of checking is O(1) by just
> check the code to be the correct one
> - If the callee is dynamic type, then a dispatching need to happen, which
> have another level of hashtable lookup O(1)
>
> As we can see, the only place where dispatching is necessary is the
> dynamic type handling case. Even in these cases, if there is a strong need
> of specialization, we can directly force the type by running checking on
> the caller, and pass in the right type code (the engineering burden is the
> same as wrapping the C API). However, the benchmark suggests that the
> dynamic dispatching cost is reasonable, and satisfies the API speed.
>
> Coming back to the tradeoff, the main tradeoff here is the engineering
> burden to keep an hourglass design(with fixed set of API) vs efficiency.
> While my post did not suggest that TVM's ffi is a silver bullet, it does
> works pretty well for our use cases. hope it helps
>
>
> --
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly or view it on GitHub:
>
> https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569139957


Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-27 Thread Pedro Larroy
LOL, the last one was my comment, not @szha :-D

-- 
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569358758

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-27 Thread Pedro Larroy
Test

On Fri, Dec 27, 2019 at 11:54 AM Pedro Larroy 
wrote:

> Thanks for the explanation. I'm not so concerned about complexity of
> dispatching. If I understood you correctly the main benefit that you
> explain for the TVM project was not having to change the C API, but still
> you need to do type checking in both ends, or at least on the receiving end
> of the API, correct? I think we have discussed similar things in the past
> and we might have different views on strongly typed vs dynamic typed. A
> priori I prefer to see an API which can be evolved and changed, I find it
> more explicit and clearer that what I think you do with PackedFun which I
> have looked at briefly but not used extensively.  If one is going to call
> into the C API using pybind, does it make sense to layer a C++ API on top
> of the C API for this?
>
> Also these microbenchmarks are nice, but we also need to consider the
> overhead in typical workloads and see if it's still significant.
>
> CFFI is also another alternative.
>
> I couldn't access your pointers like:
>
> https://github.com/tqchen/tvm/tree/pyffi
>
> On Thu, Dec 26, 2019 at 2:00 PM Tianqi Chen 
> wrote:
>
>> @larroy indeed every solution has trade-offs, and these tradeoffs are
>> discussed in the above posts when we compare solutions, and they are backed
>> by benchmarks :) it would be great if you can also suggest potential
>> tradeoffs here.
>>
>> When you expose an API from typed language(c++) to a dynamic
>> language(python), you have to type erase it, given that the python
>> functions don't have the type, and you have to pass the information along.
>>
>> The only difference is where you do the type checking(that the python
>> type corresponds to the right c++ type), and translation(translating to the
>> c++ type).
>>
>> For example, in the case of pybind, the erasure is done implicitly when
>> you call the python function, then checking and translation happens when
>> you call into the c++ function.
>>
>> In the case of creating a C API for each feature and wrap things in the
>> python side, the type checking is done in the python side, and translation
>> as well.
>>
>> In the case of tvm ffi, the type translation is done in the python/cython
>> side,  while the type checking is done in the c++.
>>
>> To dive deeper into the tradeoffs for PackedFunc calling convention. The
>> convention erases the type by having the type code stored into the
>> arguments. This brings additional cost of passing arguments into heap, as
>> opposed to registers. So they might not be designed for inline functions
>> that needs to happen at the order of 1e-9s, however, for API functions that
>> needs to run around 1e-7 or even 1e-8 level, this convention is pretty good.
>>
>> In terms of the calling cost, it really depends on whether the caller and
>> callee are strongly typed.
>> - If caller is strongly typed, then assigning type code is O(1)
>> - If caller is a dynamic type(like python) then we need to have a
>> dispatcher to dispatch and select the right type code
>> - If callee is strongly typed, then the cost of checking is O(1) by just
>> check the code to be the correct one
>> - If the callee is dynamic type, then a dispatching need to happen, which
>> have another level of hashtable lookup O(1)
>>
>> As we can see, the only place where dispatching is necessary is the
>> dynamic type handling case. Even in these cases, if there is a strong need
>> of specialization, we can directly force the type by running checking on
>> the caller, and pass in the right type code (the engineering burden is the
>> same as wrapping the C API). However, the benchmark suggests that the
>> dynamic dispatching cost is reasonable, and satisfies the API speed.
>>
>> Coming back to the tradeoff, the main tradeoff here is the engineering
>> burden to keep an hourglass design(with fixed set of API) vs efficiency.
>> While my post did not suggest that TVM's ffi is a silver bullet, it does
>> works pretty well for our use cases. hope it helps
>>
>>
>> --
>> You are receiving this because you are subscribed to this thread.
>> Reply to this email directly or view it on GitHub:
>>
>> https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569139957
>
>


windows ci, Cmake update, diverging scripts

2019-12-30 Thread Pedro Larroy
Hi

I was looking at a request from Leonard for updating CMake on windows, and
I see that the post-install.py script which setups the windows environment
in CI has diverged significantly from the incubator-mxnet-ci and the
private repository that is used to deploy to production CI.

https://github.com/apache/incubator-mxnet/pull/17031

I see quite some patch of differences, there's also different directory
structure which Marco committed to incubator-mxnet-ci  and MKL seems to be
removed. My question why has this diverged so much, I was expecting to
transplant just a single patch to update CMake.


http://ix.io/25WQ


Pedro.


Re: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc0

2019-12-30 Thread Pedro Larroy
Agree.

On Sat, Dec 28, 2019 at 12:43 PM Lausen, Leonard 
wrote:

> When including the OMP fixes in 1.6, Chris's fix for a race condition
> should be
> included as well. So it's 3 PRs:
>
> https://github.com/apache/incubator-mxnet/pull/17012
> https://github.com/apache/incubator-mxnet/pull/17039
> https://github.com/apache/incubator-mxnet/pull/17098
>
> While all of these don't affect the binary Python builds that will be
> distributed for 1.6 release, they do affect any users building the 1.6
> release
> from source with cmake. So it's beneficial to backport the 3 PRs.
>
> On Fri, 2019-12-27 at 11:24 -0800, Pedro Larroy wrote:
> > Agree with Sheng, I think it would be good to have the nice fixes that
> > Leonard has done for 1.6 and not delay them to further releases since
> they
> > are beneficial to users and developers. Thanks Leonard for helping fix
> > these long standing issues.
> >
> > On Fri, Dec 27, 2019 at 11:03 AM Lin Yuan  wrote:
> >
> > > No, I just wanted to call it out because the title of the issue says
> > > "Failed
> > > OpenMP assertion when loading MXNet compiled with DEBUG=1
> > > <https://github.com/apache/incubator-mxnet/issues/10856#>;".
> > > If this is considered a release blocker, I think we should backport it
> to
> > > 1.6.
> > >
> > > Thanks,
> > > Lin
> > >
> > > On Fri, Dec 27, 2019 at 10:47 AM Sheng Zha  wrote:
> > >
> > > > Reading these issues it’s pretty clear to me that these are fixes for
> > > > broken builds. I think we do consider broken builds to be release
> > > blockers.
> > > > Lin, am I missing something on which you base your suggestion for
> > > delaying
> > > > these changes?
> > > >
> > > > -sz
> > > >
> > > > > On Dec 27, 2019, at 10:30 AM, Lin Yuan 
> wrote:
> > > > >
> > > > > Are these release blocker? It's very risky to make such
> last-minute
> > > big
> > > > > change after code freeze.
> > > > >
> > > > > Can we do this in the next release?
> > > > >
> > > > > Lin
> > > > >
> > > > > > On Fri, Dec 27, 2019 at 7:37 AM Lausen, Leonard
> > > > 
> > > > > > wrote:
> > > > > >
> > > > > > In case of backporting #17012, also
> > > > > > https://github.com/apache/incubator-mxnet/pull/17098 must be
> > > > backported.
> > > > > > The
> > > > > > updated OpenMP added a new target which is not used by MXNet but
> > > breaks
> > > > the
> > > > > > build on some systems with nvptx. #17098 disables building this
> unused
> > > > and
> > > > > > broken feature.
> > > > > >
> > > > > > > On Thu, 2019-12-26 at 12:55 -0800, Pedro Larroy wrote:
> > > > > > > https://github.com/apache/incubator-mxnet/pull/17012  should
> be also
> > > > > > ported
> > > > > > > to the release branch.
> > > > > > >
> > > > > > > On Fri, Dec 20, 2019 at 1:39 PM Przemysław Trędak <
> > > ptre...@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > That issue is now fixed in master, I am in the process of
> > > > > > cherry-picking
> > > > > > > > the fix to v1.6.x branch. I will prepare the RC1 once that is
> > > > > > > > ready.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Przemek
> > > > > > > >
> > > > > > > > On 2019/12/20 20:07:36, Lin Yuan 
> wrote:
> > > > > > > > > What's the next step for the release? Should we continue
> testing
> > > > > > this and
> > > > > > > > > vote or wait until the
> > > > > > > > > https://github.com/apache/incubator-mxnet/issues/17105 is
> fixed?
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > >
> > > > > > > > > Lin
> > > > > > > > >
> > > > > > > > > On Wed, Dec 18, 2019 at 12:55 AM Lausen, Leonard
> > > > &g

Re: windows ci, Cmake update, diverging scripts

2019-12-30 Thread Pedro Larroy
It's automated but broken as the execution is in failed state. I think we
will need an engineer to do repairs there.

It's using systems manager automation to produce these AMIs.

On Mon, Dec 30, 2019 at 1:44 PM Lausen, Leonard 
wrote:

> Some more background:
>
> Since a few days, CI downloads and installs a more recent cmake version in
> the
> Windows job based on
>
> https://github.com/leezu/mxnet/blob/230ceee5d9e0e02e58be69dad1c4ffdadbaa1bd9/ci/build_windows.py#L148-L153
>
> This ad-hoc download and installation is not ideal and in fact a workaround
> until the base Windows AMI used by the CI server is updated. The script
> generating the base Windows AMI is tracked at
> https://github.com/apache/incubator-mxnet-ci and Shiwen Hu recently
> updated the
> script to include the updated cmake version:
> https://github.com/apache/incubator-mxnet-ci/pull/17
>
> It seems that this change needs to be deployed manually, which Pedro is
> attempting to do. But if I understand correctly Pedro found the public
> version
> of the AMI generation script and some currently used script diverged:
> http://ix.io/25WQ
>
>
>
> Questions:
> 1) Is there a git history associated with the version of the script that
> diverged?
>
> 2) According to
>
> https://github.com/apache/incubator-mxnet-ci/tree/master/services/jenkins-slave-creation-windows
> the Windows Base AMI should be created automatically. Why is it not done
> automatically anymore / why does the documentation claim it happens
> automatically but it doesn't?
>
> On Mon, 2019-12-30 at 12:11 -0800, Pedro Larroy wrote:
> > Hi
> >
> > I was looking at a request from Leonard for updating CMake on windows,
> and
> > I see that the post-install.py script which setups the windows
> environment
> > in CI has diverged significantly from the incubator-mxnet-ci and the
> > private repository that is used to deploy to production CI.
> >
> > https://github.com/apache/incubator-mxnet/pull/17031
> >
> > I see quite some patch of differences, there's also different directory
> > structure which Marco committed to incubator-mxnet-ci  and MKL seems to
> be
> > removed. My question why has this diverged so much, I was expecting to
> > transplant just a single patch to update CMake.
> >
> >
> > http://ix.io/25WQ
> >
> >
> > Pedro.
>


Re: windows ci, Cmake update, diverging scripts

2019-12-30 Thread Pedro Larroy
I have looked into this a bit, and seems the open source version which is
in https://github.com/apache/incubator-mxnet-ci is older than what's
already deployed.
The root cause of the failure in the update job seems to be a hardcoded AMI
which is no longer available. There seems to be a way now to query for the
latest windows AMI:
https://aws.amazon.com/blogs/mt/query-for-the-latest-windows-ami-using-systems-manager-parameter-store/

On Mon, Dec 30, 2019 at 3:12 PM Pedro Larroy 
wrote:

> It's automated but broken as the execution is in failed state. I think we
> will need an engineer to do repairs there.
>
> It's using systems manager automation to produce these AMIs.
>
> On Mon, Dec 30, 2019 at 1:44 PM Lausen, Leonard 
> wrote:
>
>> Some more background:
>>
>> Since a few days, CI downloads and installs a more recent cmake version
>> in the
>> Windows job based on
>>
>> https://github.com/leezu/mxnet/blob/230ceee5d9e0e02e58be69dad1c4ffdadbaa1bd9/ci/build_windows.py#L148-L153
>>
>> This ad-hoc download and installation is not ideal and in fact a
>> workaround
>> until the base Windows AMI used by the CI server is updated. The script
>> generating the base Windows AMI is tracked at
>> https://github.com/apache/incubator-mxnet-ci and Shiwen Hu recently
>> updated the
>> script to include the updated cmake version:
>> https://github.com/apache/incubator-mxnet-ci/pull/17
>>
>> It seems that this change needs to be deployed manually, which Pedro is
>> attempting to do. But if I understand correctly Pedro found the public
>> version
>> of the AMI generation script and some currently used script diverged:
>> http://ix.io/25WQ
>>
>>
>>
>> Questions:
>> 1) Is there a git history associated with the version of the script that
>> diverged?
>>
>> 2) According to
>>
>> https://github.com/apache/incubator-mxnet-ci/tree/master/services/jenkins-slave-creation-windows
>> the Windows Base AMI should be created automatically. Why is it not done
>> automatically anymore / why does the documentation claim it happens
>> automatically but it doesn't?
>>
>> On Mon, 2019-12-30 at 12:11 -0800, Pedro Larroy wrote:
>> > Hi
>> >
>> > I was looking at a request from Leonard for updating CMake on windows,
>> and
>> > I see that the post-install.py script which setups the windows
>> environment
>> > in CI has diverged significantly from the incubator-mxnet-ci and the
>> > private repository that is used to deploy to production CI.
>> >
>> > https://github.com/apache/incubator-mxnet/pull/17031
>> >
>> > I see quite some patch of differences, there's also different directory
>> > structure which Marco committed to incubator-mxnet-ci  and MKL seems to
>> be
>> > removed. My question why has this diverged so much, I was expecting to
>> > transplant just a single patch to update CMake.
>> >
>> >
>> > http://ix.io/25WQ
>> >
>> >
>> > Pedro.
>>
>


Re: windows ci, Cmake update, diverging scripts

2020-01-02 Thread Pedro Larroy
I cleaned up the windows setup and installation scripts. Now building MXNet
in windows can be done by executing just *2* scripts. One to setup the
dependencies and other to build.
I also modified the install instructions with this simplified setup. Please
help review the PR. This also updates CMake to 3.15 as requested by the
developers.

https://github.com/apache/incubator-mxnet/pull/17206

Afterwards I will configure the windows AMI pipeline to use this
environment so we can have CMake 3.15 in the windows AMI.

This is a streamlined workflow for developers using MXNet in windows which
might want to integrate with games or other commercial packages which need
deep learning.

Thanks.


On Mon, Dec 30, 2019 at 4:19 PM Pedro Larroy 
wrote:

> I have looked into this a bit, and seems the open source version which is
> in https://github.com/apache/incubator-mxnet-ci is older than what's
> already deployed.
> The root cause of the failure in the update job seems to be a hardcoded
> AMI which is no longer available. There seems to be a way now to query for
> the latest windows AMI:
> https://aws.amazon.com/blogs/mt/query-for-the-latest-windows-ami-using-systems-manager-parameter-store/
>
> On Mon, Dec 30, 2019 at 3:12 PM Pedro Larroy 
> wrote:
>
>> It's automated but broken as the execution is in failed state. I think we
>> will need an engineer to do repairs there.
>>
>> It's using systems manager automation to produce these AMIs.
>>
>> On Mon, Dec 30, 2019 at 1:44 PM Lausen, Leonard 
>> wrote:
>>
>>> Some more background:
>>>
>>> Since a few days, CI downloads and installs a more recent cmake version
>>> in the
>>> Windows job based on
>>>
>>> https://github.com/leezu/mxnet/blob/230ceee5d9e0e02e58be69dad1c4ffdadbaa1bd9/ci/build_windows.py#L148-L153
>>>
>>> This ad-hoc download and installation is not ideal and in fact a
>>> workaround
>>> until the base Windows AMI used by the CI server is updated. The script
>>> generating the base Windows AMI is tracked at
>>> https://github.com/apache/incubator-mxnet-ci and Shiwen Hu recently
>>> updated the
>>> script to include the updated cmake version:
>>> https://github.com/apache/incubator-mxnet-ci/pull/17
>>>
>>> It seems that this change needs to be deployed manually, which Pedro is
>>> attempting to do. But if I understand correctly Pedro found the public
>>> version
>>> of the AMI generation script and some currently used script diverged:
>>> http://ix.io/25WQ
>>>
>>>
>>>
>>> Questions:
>>> 1) Is there a git history associated with the version of the script that
>>> diverged?
>>>
>>> 2) According to
>>>
>>> https://github.com/apache/incubator-mxnet-ci/tree/master/services/jenkins-slave-creation-windows
>>> the Windows Base AMI should be created automatically. Why is it not done
>>> automatically anymore / why does the documentation claim it happens
>>> automatically but it doesn't?
>>>
>>> On Mon, 2019-12-30 at 12:11 -0800, Pedro Larroy wrote:
>>> > Hi
>>> >
>>> > I was looking at a request from Leonard for updating CMake on windows,
>>> and
>>> > I see that the post-install.py script which setups the windows
>>> environment
>>> > in CI has diverged significantly from the incubator-mxnet-ci and the
>>> > private repository that is used to deploy to production CI.
>>> >
>>> > https://github.com/apache/incubator-mxnet/pull/17031
>>> >
>>> > I see quite some patch of differences, there's also different directory
>>> > structure which Marco committed to incubator-mxnet-ci  and MKL seems
>>> to be
>>> > removed. My question why has this diverged so much, I was expecting to
>>> > transplant just a single patch to update CMake.
>>> >
>>> >
>>> > http://ix.io/25WQ
>>> >
>>> >
>>> > Pedro.
>>>
>>


Re: Stopping nightly releases to Pypi

2020-01-02 Thread Pedro Larroy
CD should be separate from CI for security reasons in any case.


On Sat, Dec 7, 2019 at 10:04 AM Marco de Abreu 
wrote:

> Could you elaborate how a non-Amazonian is able to access, maintain and
> review the CodeBuild pipeline? How come we've diverted from the community
> agreed-on standard where the public Jenkins serves for the purpose of
> testing and releasing MXNet? I'd be curious about the issues you're
> encountering with Jenkins CI that led to a non-standard solution.
>
> -Marco
>
>
> Skalicky, Sam  schrieb am Sa., 7. Dez. 2019,
> 18:39:
>
> > Hi MXNet Community,
> >
> > We have been working on getting nightly builds fixed and made available
> > again. We’ve made another system using AWS CodeBuild & S3 to work around
> > the problems with Jenkins CI, PyPI, etc. It is currently building all the
> > flavors and publishing to an S3 bucket here:
> >
> >
> https://us-west-2.console.aws.amazon.com/s3/buckets/apache-mxnet/dist/?region=us-west-2
> >
> > There are folders for each set of nightly builds, try out the wheels
> > starting today 2019-12-07. Builds start at 1:30am PT (9:30am GMT) and
> > arrive in the bucket 30min-2hours later. Inside each folder are the
> wheels
> > for each flavor of MXNet. Currently we’re only building for linux, builds
> > for windows/Mac will come later.
> >
> > If you want to download the wheels easily you can use a URL in the form
> of:
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/
> >
> /dist/-1.6.0b-py2.py3-none-manylinux1_x86_64.whl
> >
> > Heres a set of links for today’s builds
> >
> > (Plain mxnet, no mkl no cuda)
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > (mxnet-mkl
> > <
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl(mxnet-mkl
> >
> > )
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > (mxnet-cuXXX
> > <
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl(mxnet-cuXXX
> >
> > )
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu90-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu92-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu100-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu101-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > (mxnet-cuXXXmkl
> > <
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu101-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl(mxnet-cuXXXmkl
> >
> > )
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu90mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu92mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu100mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu101mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> > You can easily install these pip wheels in your system either by
> > downloading them to your machine first and then installing by doing:
> >
> > pip install /path/to/downloaded/wheel.whl
> >
> > Or you can install directly by just giving the link to pip like this:
> >
> > pip install
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> > Credit goes to everyone involved (in no particular order)
> > Rakesh Vasudevan
> > Zach Kimberg
> > Manu Seth
> > Sheng Zha
> > Jun Wu
> > Pedro Larroy
> > Chaitanya Bapat
> >
> > Thanks!
> > Sam
> >
> >
> > On Dec 5, 2019, at 1:16 AM, Lausen, Leonard  > <mailto:lau...@amazon.com.INVALID>> wrote:
> >
> > We don't loose pip by hosting on S3. We just don't host nightly releases
> > on Pypi
> > s

Re: Stopping nightly releases to Pypi

2020-01-03 Thread Pedro Larroy
I'm not involved in such efforts, but one possibility is to have the yaml
files that describe the pipelines for CD in the Apache repositories, would
that be acceptable from the Apache POV? In the end they should be very thin
and calling the scripts that are part of the CD packages.

On Fri, Jan 3, 2020 at 6:56 AM Marco de Abreu 
wrote:

> Agree, but the question how a non Amazonian is able to maintain and access
> the system is still open. As it stands right now, the community has taken a
> step back and loses some control if we continue down that road.
>
> I personally am disapproving of that approach since committers are no
> longer in control of that process. So far it seems like my questions were
> skipped and further actions have been taken. As openness and the community
> having control are part of our graduation criteria, I'm putting in my veto
> with a grace period until 15th of January. Please bring the system into a
> state that aligns with Apache values or revert the changes.
>
> -Marco
>
> Pedro Larroy  schrieb am Fr., 3. Jan. 2020,
> 03:33:
>
> > CD should be separate from CI for security reasons in any case.
> >
> >
> > On Sat, Dec 7, 2019 at 10:04 AM Marco de Abreu 
> > wrote:
> >
> > > Could you elaborate how a non-Amazonian is able to access, maintain and
> > > review the CodeBuild pipeline? How come we've diverted from the
> community
> > > agreed-on standard where the public Jenkins serves for the purpose of
> > > testing and releasing MXNet? I'd be curious about the issues you're
> > > encountering with Jenkins CI that led to a non-standard solution.
> > >
> > > -Marco
> > >
> > >
> > > Skalicky, Sam  schrieb am Sa., 7. Dez.
> 2019,
> > > 18:39:
> > >
> > > > Hi MXNet Community,
> > > >
> > > > We have been working on getting nightly builds fixed and made
> available
> > > > again. We’ve made another system using AWS CodeBuild & S3 to work
> > around
> > > > the problems with Jenkins CI, PyPI, etc. It is currently building all
> > the
> > > > flavors and publishing to an S3 bucket here:
> > > >
> > > >
> > >
> >
> https://us-west-2.console.aws.amazon.com/s3/buckets/apache-mxnet/dist/?region=us-west-2
> > > >
> > > > There are folders for each set of nightly builds, try out the wheels
> > > > starting today 2019-12-07. Builds start at 1:30am PT (9:30am GMT) and
> > > > arrive in the bucket 30min-2hours later. Inside each folder are the
> > > wheels
> > > > for each flavor of MXNet. Currently we’re only building for linux,
> > builds
> > > > for windows/Mac will come later.
> > > >
> > > > If you want to download the wheels easily you can use a URL in the
> form
> > > of:
> > > > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/
> > > >
> > >
> >
> /dist/-1.6.0b-py2.py3-none-manylinux1_x86_64.whl
> > > >
> > > > Heres a set of links for today’s builds
> > > >
> > > > (Plain mxnet, no mkl no cuda)
> > > >
> > > >
> > >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > > > (mxnet-mkl
> > > > <
> > >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl(mxnet-mkl
> > > >
> > > > )
> > > >
> > > >
> > >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > > > (mxnet-cuXXX
> > > > <
> > >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl(mxnet-cuXXX
> > > >
> > > > )
> > > >
> > > >
> > >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu90-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > > >
> > > >
> > >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu92-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > > >
> > > >
> > >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu100-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > > >
> &g

Re: Stopping nightly releases to Pypi

2020-01-03 Thread Pedro Larroy
Hey Marco.

As far as I have learned from other Apache mailing lists while lurking is
that Apache only cares about making source releases, binaries are a
courtesy to users that some projects decide to do, but I'm not sure I
understand your concerns regarding the PMC and what exactly are you vetoing
here, since everyone can compile, build and package our project as per the
open source license. I would suggest to have a constructive approach and
see how we can make this happen for the best of the project, specially
since somebody is volunteering to help with this and dedicate valuable
compute resources and people's time.

Regarding manual changes I don't see any need to have access to a code
build control plane for *anybody*, for several reasons, first is that
manual access to production account is a discouraged practice and are best
managed through pipeline deployments, second is that Code build is a hosted
service which is basically just using a build description file to do the
work, there's no need to do any manual fiddling or triggering. If all the
CD and description files are in the apache repository you can use your own
account or compute resources to do your own build flavor if you so desire.

Is your proposal to host this in Apache infrastructure?  Maybe I'm missing
something on this conversation

Pedro.


On Fri, Jan 3, 2020 at 3:21 PM Marco de Abreu 
wrote:

> Sam, while I understand that this solution was developed out of necessity,
> my question why a new system has been developed instead of fixing the
> existing one or adapting the solution. CodeBuild is a scheduler in the same
> fashion as Jenkins is. It runs code. So you can adapt it to Jenkins without
> much hassle.
>
> I'm not volunteering for this - why should I? The role of a PMC member is
> to steer the direction of the project. Just because a manager points
> towards a certain direction, if doesn't mean that they're going to do it.
>
> Apparently there was enough time at some point to develop a new solution
> from scratch. It might have been a solution for your internal team and
> that's fine, but upgrading it "temporarily" to be the advertised way on the
> official website is something different.
>
> I won't argue about how the veto can be enforced. I think it's in the best
> interest of the project if we try working on a solution instead of spending
> time on trying to figure out the power of the PMC.
>
> Pedro, that's certainly a step towards the right direction. But committers
> would also need access to the control plane of the system - to trigger,
> stop and audit builds. We could go down that road, but i think the fewer
> systems, the better - also for the sake of maintainability.
>
> Best regards,
> Marco
>
>
>
> Pedro Larroy  schrieb am Fr., 3. Jan. 2020,
> 20:55:
>
> > I'm not involved in such efforts, but one possibility is to have the yaml
> > files that describe the pipelines for CD in the Apache repositories,
> would
> > that be acceptable from the Apache POV? In the end they should be very
> thin
> > and calling the scripts that are part of the CD packages.
> >
> > On Fri, Jan 3, 2020 at 6:56 AM Marco de Abreu 
> > wrote:
> >
> > > Agree, but the question how a non Amazonian is able to maintain and
> > access
> > > the system is still open. As it stands right now, the community has
> > taken a
> > > step back and loses some control if we continue down that road.
> > >
> > > I personally am disapproving of that approach since committers are no
> > > longer in control of that process. So far it seems like my questions
> were
> > > skipped and further actions have been taken. As openness and the
> > community
> > > having control are part of our graduation criteria, I'm putting in my
> > veto
> > > with a grace period until 15th of January. Please bring the system
> into a
> > > state that aligns with Apache values or revert the changes.
> > >
> > > -Marco
> > >
> > > Pedro Larroy  schrieb am Fr., 3. Jan.
> > 2020,
> > > 03:33:
> > >
> > > > CD should be separate from CI for security reasons in any case.
> > > >
> > > >
> > > > On Sat, Dec 7, 2019 at 10:04 AM Marco de Abreu <
> > marco.g.ab...@gmail.com>
> > > > wrote:
> > > >
> > > > > Could you elaborate how a non-Amazonian is able to access, maintain
> > and
> > > > > review the CodeBuild pipeline? How come we've diverted from the
> > > community
> > > > > agreed-on standard where the public Jenkins serves for the purpose

Re: Stopping nightly releases to Pypi

2020-01-03 Thread Pedro Larroy
Just to clarify, the current CI is quite an overhead to maintain for
several reasons, this complexity is overkill for CD. Jenkins also has
constant plugin upgrades, security vulnerabilities, has to be restarted
from time to time as it stops working... and to make binary builds from an
environment which runs unsafe code, I don't think is good practice. So for
that, having a separate Jenkins, CodeBuild, Drone or using a separate
Jenkins node is the right solution. Agree with you that is just a
scheduler, but somebody is making efforts to keep it running. If you have
the appetite and resources to duplicate it for CD please go ahead.

On Fri, Jan 3, 2020 at 3:25 PM Marco de Abreu 
wrote:

> Regarding your point of finding somebody to maintain the solution: At
> Apache we usually retire things if there's no maintainer, since that
> indicates that the feature/system is not of enough interest to warrant
> maintenance - otherwise, someone would step up.
>
> While assistance in the form of a fix is always appreciated, the fix still
> has to conform with the way this project and Apache operates. Next time I'd
> recommend to contribute time on improving the existing community solution
> instead of developing an internal system.
>
> -Marco
>
> Marco de Abreu  schrieb am Sa., 4. Jan. 2020,
> 00:21:
>
> > Sam, while I understand that this solution was developed out of
> necessity,
> > my question why a new system has been developed instead of fixing the
> > existing one or adapting the solution. CodeBuild is a scheduler in the
> same
> > fashion as Jenkins is. It runs code. So you can adapt it to Jenkins
> without
> > much hassle.
> >
> > I'm not volunteering for this - why should I? The role of a PMC member is
> > to steer the direction of the project. Just because a manager points
> > towards a certain direction, if doesn't mean that they're going to do it.
> >
> > Apparently there was enough time at some point to develop a new solution
> > from scratch. It might have been a solution for your internal team and
> > that's fine, but upgrading it "temporarily" to be the advertised way on
> the
> > official website is something different.
> >
> > I won't argue about how the veto can be enforced. I think it's in the
> best
> > interest of the project if we try working on a solution instead of
> spending
> > time on trying to figure out the power of the PMC.
> >
> > Pedro, that's certainly a step towards the right direction. But
> committers
> > would also need access to the control plane of the system - to trigger,
> > stop and audit builds. We could go down that road, but i think the fewer
> > systems, the better - also for the sake of maintainability.
> >
> > Best regards,
> > Marco
> >
> >
> >
> > Pedro Larroy  schrieb am Fr., 3. Jan.
> 2020,
> > 20:55:
> >
> >> I'm not involved in such efforts, but one possibility is to have the
> yaml
> >> files that describe the pipelines for CD in the Apache repositories,
> would
> >> that be acceptable from the Apache POV? In the end they should be very
> >> thin
> >> and calling the scripts that are part of the CD packages.
> >>
> >> On Fri, Jan 3, 2020 at 6:56 AM Marco de Abreu 
> >> wrote:
> >>
> >> > Agree, but the question how a non Amazonian is able to maintain and
> >> access
> >> > the system is still open. As it stands right now, the community has
> >> taken a
> >> > step back and loses some control if we continue down that road.
> >> >
> >> > I personally am disapproving of that approach since committers are no
> >> > longer in control of that process. So far it seems like my questions
> >> were
> >> > skipped and further actions have been taken. As openness and the
> >> community
> >> > having control are part of our graduation criteria, I'm putting in my
> >> veto
> >> > with a grace period until 15th of January. Please bring the system
> into
> >> a
> >> > state that aligns with Apache values or revert the changes.
> >> >
> >> > -Marco
> >> >
> >> > Pedro Larroy  schrieb am Fr., 3. Jan.
> >> 2020,
> >> > 03:33:
> >> >
> >> > > CD should be separate from CI for security reasons in any case.
> >> > >
> >> > >
> >> > > On Sat, Dec 7, 2019 at 10:04 AM Marco de Abreu <
> >> marco.g.ab...@gmail.com>
> >> > > wrote:
> &g

Re: CD with windows need a special jenkins slave machine like restricted-utility

2020-01-07 Thread Pedro Larroy
I'm putting some efforts on the side to improve the state of this:

If you want to help:

https://github.com/apache/incubator-mxnet/pull/17206

https://github.com/aiengines/ci/tree/master/windows

Which of the cuda versions you listed it needs, I did some work on the side
to update VS and cmake to 3.16.2  you can test the scripts in the windows
folder above by using the three scripts in the windows folder in a fresh
windows instance. The older CMake version has a bug which introduces a
newline in the path and renders everything unusable, I installed 3.16.2 but
needs to be added to the path by the install script.

You can start a fresh gpu instance with this AMI:  aws ssm get-parameter
--name /aws/service/ami-windows-latest/Windows_Server-2019-English-Full-Base

Once this is working, we can update the AMI from CI. Also this needs to be
adjusted with the new VS 2019

https://github.com/apache/incubator-mxnet/blob/master/ci/build_windows.py#L42

To update cuda and nv driver, this two bundles should be added to the
script
https://github.com/aiengines/ci/blob/master/windows/windows_deps_headless_installer.py

https://windows-post-install.s3-us-west-2.amazonaws.com/cuda.zip

https://windows-post-install.s3-us-west-2.amazonaws.com/nv_driver_418.81.zip

Send PRs if you want to collaborate.

Pedro.




On Tue, Jan 7, 2020 at 6:13 AM Lausen, Leonard 
wrote:

> Regarding visual studio 2019: It seems we currently support Visual Studio
> 2015?
> Is there anything that Visual Studio 2015 can't do? If so, code and
> documentation should also be updated based on the new minimum version.
>
> On Tue, 2020-01-07 at 14:19 +0800, shiwen hu wrote:
> > it need visual studio 2019, cuda 9.0 9.2 10.0 10.1 10.2,
> > cmake 3.16.2,jom,opencv,openblas.
> > What do I need to do? Who should I contact?
>


Re: Stopping nightly releases to Pypi

2020-01-08 Thread Pedro Larroy
I understand your point. But you don't provide an alternative, and building
binary releases from the CI jenkins as it is today is a very bad idea since
it's an unsafe environment. I think it's fair to ask if you are vetoing
using codebuild for nightly releases you could provide an alternative
solution (for example Apache hosted Jenkins) or anything else. As you are
well aware non-committers can't communicate with Apache Infra or make
requests, so the onus is on you or other Apache person to provide a
solution that aligns with Apache values.

So far I see Sam trying to help with codebuild managed binary releases and
this is taken as a tinfoil hat corporate conspiracy. It's a pity that you
claim to endorse Apache values but not support what's best for the project,
which is to have things clean and in working order. I don't think users
care where the binary releases are hosted.

Pedro.

On Sun, Jan 5, 2020 at 5:56 AM Marco de Abreu 
wrote:

> Apache only cares about source releases as far as official releases are
> concerned. But Apache also cares about it's brand and image. You are right
> that anybody can compile an Apache project and distribute it, but it's
> under the PMCs control what can be advertised as official. This includes
> the following examples:
>
> - The official MXNet pypi, dockerhub, maven, etc account
> - The MXNet website
> - anything advertising to be MXNet
>
> If you publish a binary release and call it "AwesomeSpaghettiBolognese"
> while it's MXNet under the hood, that's totally in line with the Apache
> license. But if you decide to publish an MXNet branded package, then that's
> covered by the brand protection. I won't go into much more detail about
> legal reasons since that's not helping this discussion.
>
> I personally am vetoing a company-owned distribution channel to be
> advertised on the MXNet website or any official documentation. Also, I'd
> like to make sure that users do not mistake it for being a release that is
> affiliated or endorsed by Apache MXNet.
>
> We are taking a step back here and it's a pity to see that some people are
> still not endorsing the Apache values. This will be my last email regarding
> that topic and I will only follow up with actions after the 15th of January
> has been reached.
>
> Best regards
> Marco
>
>
> Pedro Larroy  schrieb am Sa., 4. Jan. 2020,
> 02:38:
>
> > Hey Marco.
> >
> > As far as I have learned from other Apache mailing lists while lurking is
> > that Apache only cares about making source releases, binaries are a
> > courtesy to users that some projects decide to do, but I'm not sure I
> > understand your concerns regarding the PMC and what exactly are you
> vetoing
> > here, since everyone can compile, build and package our project as per
> the
> > open source license. I would suggest to have a constructive approach and
> > see how we can make this happen for the best of the project, specially
> > since somebody is volunteering to help with this and dedicate valuable
> > compute resources and people's time.
> >
> > Regarding manual changes I don't see any need to have access to a code
> > build control plane for *anybody*, for several reasons, first is that
> > manual access to production account is a discouraged practice and are
> best
> > managed through pipeline deployments, second is that Code build is a
> hosted
> > service which is basically just using a build description file to do the
> > work, there's no need to do any manual fiddling or triggering. If all the
> > CD and description files are in the apache repository you can use your
> own
> > account or compute resources to do your own build flavor if you so
> desire.
> >
> > Is your proposal to host this in Apache infrastructure?  Maybe I'm
> missing
> > something on this conversation
> >
> > Pedro.
> >
> >
> > On Fri, Jan 3, 2020 at 3:21 PM Marco de Abreu 
> > wrote:
> >
> > > Sam, while I understand that this solution was developed out of
> > necessity,
> > > my question why a new system has been developed instead of fixing the
> > > existing one or adapting the solution. CodeBuild is a scheduler in the
> > same
> > > fashion as Jenkins is. It runs code. So you can adapt it to Jenkins
> > without
> > > much hassle.
> > >
> > > I'm not volunteering for this - why should I? The role of a PMC member
> is
> > > to steer the direction of the project. Just because a manager points
> > > towards a certain direction, if doesn't mean that they're going t

Re: Stopping nightly releases to Pypi

2020-01-08 Thread Pedro Larroy
Is not about Jenkins the software, is about the CI environment, which is
not secure. Last week there was crypto mining activity on the dev
environment, code can be injected on binary releases very easily. It should
be a separate instance for CD, so maybe you can facilitate that with Apache
as part of your suggestion.

On Wed, Jan 8, 2020 at 1:01 PM Marco de Abreu 
wrote:

> The risk of the current CD via Jenkins is known and was accepted as part of
> adopting Jenkins. The solution for the initial issue - no longer publishing
> to pypi - is to add a step to the existing CD pipeline which publishes the
> package to the s3 bucket instead of pypi.
>
> -Marco
>
> Pedro Larroy  schrieb am Mi., 8. Jan. 2020,
> 21:55:
>
> > I understand your point. But you don't provide an alternative, and
> building
> > binary releases from the CI jenkins as it is today is a very bad idea
> since
> > it's an unsafe environment. I think it's fair to ask if you are vetoing
> > using codebuild for nightly releases you could provide an alternative
> > solution (for example Apache hosted Jenkins) or anything else. As you are
> > well aware non-committers can't communicate with Apache Infra or make
> > requests, so the onus is on you or other Apache person to provide a
> > solution that aligns with Apache values.
> >
> > So far I see Sam trying to help with codebuild managed binary releases
> and
> > this is taken as a tinfoil hat corporate conspiracy. It's a pity that you
> > claim to endorse Apache values but not support what's best for the
> project,
> > which is to have things clean and in working order. I don't think users
> > care where the binary releases are hosted.
> >
> > Pedro.
> >
> > On Sun, Jan 5, 2020 at 5:56 AM Marco de Abreu 
> > wrote:
> >
> > > Apache only cares about source releases as far as official releases are
> > > concerned. But Apache also cares about it's brand and image. You are
> > right
> > > that anybody can compile an Apache project and distribute it, but it's
> > > under the PMCs control what can be advertised as official. This
> includes
> > > the following examples:
> > >
> > > - The official MXNet pypi, dockerhub, maven, etc account
> > > - The MXNet website
> > > - anything advertising to be MXNet
> > >
> > > If you publish a binary release and call it "AwesomeSpaghettiBolognese"
> > > while it's MXNet under the hood, that's totally in line with the Apache
> > > license. But if you decide to publish an MXNet branded package, then
> > that's
> > > covered by the brand protection. I won't go into much more detail about
> > > legal reasons since that's not helping this discussion.
> > >
> > > I personally am vetoing a company-owned distribution channel to be
> > > advertised on the MXNet website or any official documentation. Also,
> I'd
> > > like to make sure that users do not mistake it for being a release that
> > is
> > > affiliated or endorsed by Apache MXNet.
> > >
> > > We are taking a step back here and it's a pity to see that some people
> > are
> > > still not endorsing the Apache values. This will be my last email
> > regarding
> > > that topic and I will only follow up with actions after the 15th of
> > January
> > > has been reached.
> > >
> > > Best regards
> > > Marco
> > >
> > >
> > > Pedro Larroy  schrieb am Sa., 4. Jan.
> > 2020,
> > > 02:38:
> > >
> > > > Hey Marco.
> > > >
> > > > As far as I have learned from other Apache mailing lists while
> lurking
> > is
> > > > that Apache only cares about making source releases, binaries are a
> > > > courtesy to users that some projects decide to do, but I'm not sure I
> > > > understand your concerns regarding the PMC and what exactly are you
> > > vetoing
> > > > here, since everyone can compile, build and package our project as
> per
> > > the
> > > > open source license. I would suggest to have a constructive approach
> > and
> > > > see how we can make this happen for the best of the project,
> specially
> > > > since somebody is volunteering to help with this and dedicate
> valuable
> > > > compute resources and people's time.
> > > >
> > > > Regarding manual changes I don't see any need to have access to a
> code
> > > &g

Re: Stopping nightly releases to Pypi

2020-01-08 Thread Pedro Larroy
Marco, if you are fine publishing to an S3 bucket, what's your concern?
using a codebuild pipeline? The build logs could be push to the s3 bucket
if this is your concern.

As I said before, having binary releases in the current CI doesn't stand a
chance to pass security review as it is today, it's not safe and is a bad
idea, alternatives are
1 -Code Build (you don't support this because it's company owned, did I
understand this correctly?)
2 - Apache owned Jenkins (can you help with this?)
3 - Travis CI or similar, which in the end is similar to code build.
4- Another Jenkins just for CD (who owns?)

Pedro.

On Wed, Jan 8, 2020 at 1:01 PM Marco de Abreu 
wrote:

> The risk of the current CD via Jenkins is known and was accepted as part of
> adopting Jenkins. The solution for the initial issue - no longer publishing
> to pypi - is to add a step to the existing CD pipeline which publishes the
> package to the s3 bucket instead of pypi.
>
> -Marco
>
> Pedro Larroy  schrieb am Mi., 8. Jan. 2020,
> 21:55:
>
> > I understand your point. But you don't provide an alternative, and
> building
> > binary releases from the CI jenkins as it is today is a very bad idea
> since
> > it's an unsafe environment. I think it's fair to ask if you are vetoing
> > using codebuild for nightly releases you could provide an alternative
> > solution (for example Apache hosted Jenkins) or anything else. As you are
> > well aware non-committers can't communicate with Apache Infra or make
> > requests, so the onus is on you or other Apache person to provide a
> > solution that aligns with Apache values.
> >
> > So far I see Sam trying to help with codebuild managed binary releases
> and
> > this is taken as a tinfoil hat corporate conspiracy. It's a pity that you
> > claim to endorse Apache values but not support what's best for the
> project,
> > which is to have things clean and in working order. I don't think users
> > care where the binary releases are hosted.
> >
> > Pedro.
> >
> > On Sun, Jan 5, 2020 at 5:56 AM Marco de Abreu 
> > wrote:
> >
> > > Apache only cares about source releases as far as official releases are
> > > concerned. But Apache also cares about it's brand and image. You are
> > right
> > > that anybody can compile an Apache project and distribute it, but it's
> > > under the PMCs control what can be advertised as official. This
> includes
> > > the following examples:
> > >
> > > - The official MXNet pypi, dockerhub, maven, etc account
> > > - The MXNet website
> > > - anything advertising to be MXNet
> > >
> > > If you publish a binary release and call it "AwesomeSpaghettiBolognese"
> > > while it's MXNet under the hood, that's totally in line with the Apache
> > > license. But if you decide to publish an MXNet branded package, then
> > that's
> > > covered by the brand protection. I won't go into much more detail about
> > > legal reasons since that's not helping this discussion.
> > >
> > > I personally am vetoing a company-owned distribution channel to be
> > > advertised on the MXNet website or any official documentation. Also,
> I'd
> > > like to make sure that users do not mistake it for being a release that
> > is
> > > affiliated or endorsed by Apache MXNet.
> > >
> > > We are taking a step back here and it's a pity to see that some people
> > are
> > > still not endorsing the Apache values. This will be my last email
> > regarding
> > > that topic and I will only follow up with actions after the 15th of
> > January
> > > has been reached.
> > >
> > > Best regards
> > > Marco
> > >
> > >
> > > Pedro Larroy  schrieb am Sa., 4. Jan.
> > 2020,
> > > 02:38:
> > >
> > > > Hey Marco.
> > > >
> > > > As far as I have learned from other Apache mailing lists while
> lurking
> > is
> > > > that Apache only cares about making source releases, binaries are a
> > > > courtesy to users that some projects decide to do, but I'm not sure I
> > > > understand your concerns regarding the PMC and what exactly are you
> > > vetoing
> > > > here, since everyone can compile, build and package our project as
> per
> > > the
> > > > open source license. I would suggest to have a constructive approach
> > and
> > > > see how we can make this happen for the best of the pr

Re: Stopping nightly releases to Pypi

2020-01-08 Thread Pedro Larroy
Thanks for your detailed responses.

Having codebuild execute the recipe that is the apache repository is the
same effect and control that you would have in some service such as travis
CI. And the builds are fully reproducible. So it's under full control of
Apache the same way that any other hosted build solution is. Any
modification to the recipe would be executed on next commit, and the builds
are fully reproducible. There's no configuration in code build that would
be outside of the Apache MXNet repository in this case, since pipeline and
the config would be under the git repo.

And as you rightly pointed out, the Jenkins master is a weak point to the
restricted slaves. This was strongly criticized during the system review
and there is precedent of security flaws in the master. Insisting on mixing
CI and CD is not a good recommendation for what it has been explained above.

Pedro.

On Wed, Jan 8, 2020 at 2:41 PM Marco de Abreu 
wrote:

> Correct, I'm not bothered by the s3 bucket but by way how it gets
> published. It's not in Jenkins, so it's outside of the projects control.
>
> The security design due to the restricted nodes makes sure that no third
> party can gain access to these machines. They use separate caches, separate
> volumes, different instance profiles etc - I personally would consider the
> restricted slaves safe. If you're telling me that restricted slaves have
> been compromised with a crypto Miner, I'd be happy to discuss that matter
> and assist.
>
> Another attack vector is the Jenkins master, correct. If somebody
> infiltrates the Jenkins master, they can use that to jump onto the
> restricted slaves. They might modify the created artifacts, but once the
> system gets cleaned up, we're good to go again (You might rather want to
> consider a virus scan on the machines and created artifacts).
>
> But now let's say Jenkins master gets comprised. In that case, the
> artifacts are not the issue but the credentials. Jenkins contains committer
> credentials, which would allow to inject malware into our repository. Don't
> forget that a committer can add commits to other PRs, manually fake the CI
> status and then squash the PR to basically hide most of the traces. Unless
> someone reviews every single commit on master, we're basically out of luck.
>
> So yeah, that attack vector through the Jenkins master is valid, but
> considering that there are bigger risks involved in the system and the
> slaves themselves are pretty well protected, I'd not consider CD a severe
> issue in relation to the overall risk score of our system.
>
> So in order to make sure that we're well protected, I'd recommend to spend
> a bit of time on adapting the Jenkins pipeline to upload to s3 and then use
> all the remaining time to actually harden the Jenkins master and make sure
> that everything is constantly kept up to date. Security-wise, I'd consider
> that a way better investment than developing a new CD.
>
> -Marco
>
> Pedro Larroy  schrieb am Mi., 8. Jan. 2020,
> 22:49:
>
> > Marco, if you are fine publishing to an S3 bucket, what's your concern?
> > using a codebuild pipeline? The build logs could be push to the s3 bucket
> > if this is your concern.
> >
> > As I said before, having binary releases in the current CI doesn't stand
> a
> > chance to pass security review as it is today, it's not safe and is a bad
> > idea, alternatives are
> > 1 -Code Build (you don't support this because it's company owned, did I
> > understand this correctly?)
> > 2 - Apache owned Jenkins (can you help with this?)
> > 3 - Travis CI or similar, which in the end is similar to code build.
> > 4- Another Jenkins just for CD (who owns?)
> >
> > Pedro.
> >
> > On Wed, Jan 8, 2020 at 1:01 PM Marco de Abreu 
> > wrote:
> >
> > > The risk of the current CD via Jenkins is known and was accepted as
> part
> > of
> > > adopting Jenkins. The solution for the initial issue - no longer
> > publishing
> > > to pypi - is to add a step to the existing CD pipeline which publishes
> > the
> > > package to the s3 bucket instead of pypi.
> > >
> > > -Marco
> > >
> > > Pedro Larroy  schrieb am Mi., 8. Jan.
> > 2020,
> > > 21:55:
> > >
> > > > I understand your point. But you don't provide an alternative, and
> > > building
> > > > binary releases from the CI jenkins as it is today is a very bad idea
> > > since
> > > > it's an unsafe environment. I think it's fair to ask if you are
> vetoing
> > > > using codebuil

Re: CD with windows need a special jenkins slave machine like restricted-utility

2020-01-09 Thread Pedro Larroy
Is there a solution for this error in VS2017?

c:\users\administrator\mxnet\src\operator\mxnet_op.h(943) : fatal error
C1002: compiler is out of heap space in pass 2



On Tue, Jan 7, 2020 at 5:11 PM shiwen hu  wrote:

> >
> > I personally encountered the problem that 2015 can't compile in high
> > version cuda. But I can't remember the details. We can continue to use
> 2015
> > until we encounter problems.
> >
>


Re: CD with windows need a special jenkins slave machine like restricted-utility

2020-01-13 Thread Pedro Larroy
Isn't this something that gets selected through vcvars?

On Fri, Jan 10, 2020 at 6:46 PM shiwen hu  wrote:

> use x64 host msvc. cmake -T host=x64
>
> Pedro Larroy  于2020年1月10日周五 上午7:28写道:
>
> > Is there a solution for this error in VS2017?
> >
> > c:\users\administrator\mxnet\src\operator\mxnet_op.h(943) : fatal error
> > C1002: compiler is out of heap space in pass 2
> >
> >
> >
> > On Tue, Jan 7, 2020 at 5:11 PM shiwen hu  wrote:
> >
> > > >
> > > > I personally encountered the problem that 2015 can't compile in high
> > > > version cuda. But I can't remember the details. We can continue to
> use
> > > 2015
> > > > until we encounter problems.
> > > >
> > >
> >
>


Re: CD with windows need a special jenkins slave machine like restricted-utility

2020-01-13 Thread Pedro Larroy
Thanks, it's working after updating to a 64 bit compiler.
https://github.com/apache/incubator-mxnet/pull/17206

On Mon, Jan 13, 2020 at 4:55 PM Pedro Larroy 
wrote:

> Isn't this something that gets selected through vcvars?
>
> On Fri, Jan 10, 2020 at 6:46 PM shiwen hu  wrote:
>
>> use x64 host msvc. cmake -T host=x64
>>
>> Pedro Larroy  于2020年1月10日周五 上午7:28写道:
>>
>> > Is there a solution for this error in VS2017?
>> >
>> > c:\users\administrator\mxnet\src\operator\mxnet_op.h(943) : fatal error
>> > C1002: compiler is out of heap space in pass 2
>> >
>> >
>> >
>> > On Tue, Jan 7, 2020 at 5:11 PM shiwen hu  wrote:
>> >
>> > > >
>> > > > I personally encountered the problem that 2015 can't compile in high
>> > > > version cuda. But I can't remember the details. We can continue to
>> use
>> > > 2015
>> > > > until we encounter problems.
>> > > >
>> > >
>> >
>>
>


Re: Stop redistributing source code of 3rdparty dependencies to avoid licensing issues

2020-01-19 Thread Pedro Larroy
-1

I think is brittle to download a piece of source code that needs network
connectivity to build. The network is always in flux. Source archives that
need to download too many dependencies to build will end up broken with
time. I would expect source to build with a reasonable set of well known
system dependencies.


On Friday, January 17, 2020, Marco de Abreu  wrote:
> I agree with Tianqi. We may change our build system, but this won't free
us
> from the necessity to validate the licenses of our dependencies.
>
> The question at this point is whether we are allowed to differentiate
> between our main-source and hold it to the strict standards while treating
> the third party folder as dependency, where we only have to verify that
the
> projects are licensed with an Apache compatible license.
>
> At the moment, the project already treats them different: our license
> checks exclude third party. I think this is where the disparity is coming
> from. I'd recommend we discuss with Apache how we can handle this
> situation: package third party code for user convenience while limiting
> responsibility.
>
> In the end, we still have to ensure that everything is licensed properly,
> so maybe we should try to align both processes to match the real world
> instead of changing the real world to match the process.
>
> -Marco
>
> Tianqi Chen  schrieb am Fr., 17. Jan. 2020,
20:44:
>
>> I don't have an opinion, but would like to list pros and cons of doing
so.
>>
>> The pro of doing so is that it indeed simplifies the release process, as
>> these additional dependencies becomes category-B level dependencies as in
>> https://www.apache.org/legal/resolved.html
>>
>> The con of doing so is that it brings additional burden to the users of
the
>> software to check the license of these dependencies, in some sense,
>> including these information in the
>> license actually gives an extra level of transparency.
>>
>> The copyright message in some of the dependencies are a bit unfortunate,
>> one potential way to run the check is to write a python script to go
>> through the files and detect the line Copyright and cross match and add
>> them.
>>
>> Note that good models to follow are
>> - hadoop: https://github.com/apache/hadoop/tree/trunk/licenses
>> - flink: https://github.com/apache/flink
>>
>> Each of the repo have a licenses folder that contains licenses, and
things
>> points to them.
>>
>> I am not a lawyer, but the case for ps-lite seems can be resolved as long
>> as we can confirm these files follows Apache-2.0, as
>> https://www.apache.org/licenses/LICENSE-2.0 only requires us to
>> redistribute
>> the license and anything in the NOTICE, but we do not have the obligation
>> to list all the copyright messages in the source content.
>>
>> TQ
>>
>> On Fri, Jan 17, 2020 at 11:10 AM Yuan Tang 
>> wrote:
>>
>> > +1
>> >
>> > On Fri, Jan 17, 2020 at 1:59 PM Chris Olivier 
>> > wrote:
>> >
>> > > +1
>> > >
>> > > On Fri, Jan 17, 2020 at 10:19 AM Lausen, Leonard
>> > > > > >
>> > > wrote:
>> > >
>> > > > Dear MXNet community,
>> > > >
>> > > > as per recent mail on gene...@incubator.apache.org [1] there are a
>> > > number
>> > > > of
>> > > > licensing issues in MXNet 1.6rc1. Based on anecdotal evidence I
>> believe
>> > > > there
>> > > > has been no release so far without any licensing issues, which is a
>> > > > blocker to
>> > > > MXNet graduating from it's incubating status. One contributing
factor
>> > is
>> > > > that we
>> > > > bundle 3rdparty source code in our releases [2].
>> > > >
>> > > > One key factor is that 3rdparty projects don't always enforce
>> licensing
>> > > > best
>> > > > practice in the way we do. For example, 3rdparty/ps-lite doesn't
>> > enforce
>> > > > license
>> > > > headers in the source files and there has been confusion about the
>> > > license
>> > > > of
>> > > > recent contributions by ByteDance (See [1]).
>> > > >
>> > > > To avoid such licensing issues in MXNet releases a simple solution
is
>> > to
>> > > > stop
>> > > > distributing the 3rdparty code in our source releases. Instead, we
>> can
>> > > > adapt our
>> > > > buildsystem to download 3rdparty code as part of the build
>> > configuration
>> > > > process. CMake makes this very easy with the FetchContent module
[3].
>> > > >
>> > > > For development purpose involving changes to the 3rdparty source or
>> > build
>> > > > systems that can't access the internet, there are easy means for
>> > > > specifying the
>> > > > location of local sources (instead of downloading), via the
>> > > > FETCHCONTENT_SOURCE_DIR_ variable [4].
>> > > >
>> > > > Would there be any concerns about such approach? Obviously it can
>> only
>> > be
>> > > > fully
>> > > > implemented as soon as the CMake build system is feature complete
and
>> > the
>> > > > Makefile build can be dropped. (Note that the Makefile build is
being
>> > > > deprecated
>> > > > and removed as part of MXNet 2 roadmap [5])
>> > > >
>> > > > Best regards
>> > > > Leonard
>> > > >
>>

Re: MXNet 1.6 as last release with Python 2 support?

2020-01-23 Thread Pedro Larroy
 This is not good user experience. I have heard of impacts to some users /
projects.

Thanks.

On Tue, Jan 21, 2020 at 10:44 PM Skalicky, Sam 
wrote:

> Also, it has been reported that pip wheel installation with latest pip
> version 20.0.1 breaks installation of MXNet pip wheels which have py2.py3
> in the wheel name. This breaks all existing released versions. Work around
> is to install the older version of pip "pip install pip==19.3.1”.
>
> Sam
>
> > On Jan 21, 2020, at 4:35 PM, Chung, Alex 
> wrote:
> >
> > +1
> >
> > Sincerely,
> >
> > Alex Chung
> > Senior Product Manager | AWS AI
> >
> > 
> > From: shiwen hu 
> > Sent: Tuesday, January 21, 2020 4:26 PM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: MXNet 1.6 as last release with Python 2 support?
> >
> > +1
> >
> > Lai Wei  于2020年1月18日周六 上午2:51写道:
> >
> >> +1
> >>
> >>
> >> Best Regards
> >>
> >> Lai
> >>
> >>
> >> On Fri, Jan 17, 2020 at 10:39 AM Lin Yuan  wrote:
> >>
> >>> +1
> >>>
> >>> On Fri, Jan 17, 2020 at 10:04 AM Xingjian SHI 
> >>> wrote:
> >>>
>  +1. We should move to support Python>=3.5 only.
> 
>  Get Outlook for iOS
>  
>  From: Lausen, Leonard 
>  Sent: Friday, January 17, 2020 10:02:30 AM
>  To: d...@mxnet.apache.org 
>  Subject: Re: MXNet 1.6 as last release with Python 2 support?
> 
>  If the lazy consensus passes, I believe the minimum Python version
>  supported
>  would be Python 3.5.
> 
>  Python 3.5 because it seems to be the minimum Python 3 version tested
> >> by
>  our CI,
>  specifically in the jobs running on Ubuntu 16.04.
> 
>  Best regards
>  Leonard
> 
>  On Fri, 2020-01-17 at 17:36 +, Lausen, Leonard wrote:
> > Dear MXNet community,
> >
> > as effective January 1, 2020, no new bug reports, fixes, or changes
> >>> will
>  be
> > made
> > to Python 2, and as MXNet 1.6 will be released after January 1,
> >> 2020, I
> > suggest
> > to announce in the MXNet 1.6 release notes that MXNet 1.6 is the last
>  release
> > supporting Python 2.
> >
> > We have previously reached consensus on announcing that Python 2 is
>  dropped in
> > the next major release (ie. MXNet 2), however, given the delay in 1.6
>  release,
> > the plan to release 1.7 in the future and that Python 2 is dead
> >>> already I
> > think
> > we can revisit this assumption.
> >
> > Advantages are
> > - Time savings for developers, as Python 3 standard library contains
> >>> more
> >  features than Python 2, and it is more efficient to target only 1
>  language
> >  (Python 3) instead of 2 languages (Python 2 & 3)
> > - Simplification and cost savings for CI
> >
> > I thus suggest 72h lazy consensus for announcing dropping of Python 2
> >>> as
> > described above. If you disagree, please veto (send "-1") and we can
>  continue
> > supporting Python 2 in all 1.x releases as per previous consensus.
> >> Note
>  that
> > at
> > the time of previous consensus, no 1.7 release was planned.
> >
> > Best regards
> > Leonard
> 
> >>>
> >>
>
>


[ANNOUNCE] Python2 is no longer supported after MXNet 1.6 release

2020-02-03 Thread Pedro Larroy
Hi all

As per https://github.com/apache/incubator-mxnet/pull/15990 merge and as
agreed with the community we will no longer support python2 in oncoming
releases of MXNet.

Special thanks to Leonard for facilitating this.

Pedro.


Re: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc2

2020-02-03 Thread Pedro Larroy
-1

Unit tests passed in CPU build.

I observe crashes related to openmp using cpp unit tests:

https://github.com/apache/incubator-mxnet/issues/17043

Pedro.

On Mon, Feb 3, 2020 at 6:44 PM Chaitanya Bapat  wrote:

> +1
> Successfully built MXNet 1.6.0rc2 on Linux
> Tested for OpPerf utility
> For CPU -
> https://gist.github.com/ChaiBapchya/d5ecc3e971c5a3c558d672477b4b6b9c
>
> Works well!
>
>
>
> On Mon, 3 Feb 2020 at 15:43, Lin Yuan  wrote:
>
> > +1
> >
> > Tested Horovod with mnist example. My compiler flags are below:
> >
> > [✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2,
> ✔
> > CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖
> CPU_AVX2, ✔
> > OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖
> BLAS_MKL, ✖
> > BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER, ✔
> > DIST_KVSTORE, ✖ CXX14, ✖ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✖ DEBUG, ✖
> > TVM_OP]
> >
> > Lin
> >
> > On Sat, Feb 1, 2020 at 9:55 PM Tao Lv  wrote:
> >
> > > +1
> > >
> > > I tested below items:
> > > 1. download artifacts from Apache dist repo;
> > > 2. the signature looks good;
> > > 3. build from source code with MKL-DNN and MKL on centos;
> > > 4. run fp32 and int8 inference of ResNet50 under
> /example/quantization/.
> > >
> > > thanks,
> > > -tao
> > >
> > > On Sun, Feb 2, 2020 at 11:00 AM Tao Lv  wrote:
> > >
> > > > I see. I was looking at this page:
> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
> > > >
> > > > On Sun, Feb 2, 2020 at 4:54 AM Przemysław Trędak  >
> > > > wrote:
> > > >
> > > >> Hi Tao,
> > > >>
> > > >> Could you tell me where did you look for it and did not find it? I
> > just
> > > >> checked and both
> > > >> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc2/
> and
> > > >> draft of the release on GitHub have them.
> > > >>
> > > >> Thank you
> > > >> Przemek
> > > >>
> > > >> On 2020/02/01 14:23:11, Tao Lv  wrote:
> > > >> > It seems the src tar and signature are missing from the tag.
> > > >> >
> > > >> > On Fri, Jan 31, 2020 at 11:09 AM Przemysław Trędak <
> > > ptre...@apache.org>
> > > >> > wrote:
> > > >> >
> > > >> > > Dear MXNet community,
> > > >> > >
> > > >> > > This is the vote to release Apache MXNet (incubating) version
> > 1.6.0.
> > > >> > > Voting starts today and will close on Monday 2/3/2020 23:59 PST.
> > > >> > >
> > > >> > > Link to release notes:
> > > >> > >
> > > https://cwiki.apache.org/confluence/display/MXNET/1.6.0+Release+notes
> > > >> > >
> > > >> > > Link to release candidate:
> > > >> > >
> https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
> > > >> > >
> > > >> > > Link to source and signatures on apache dist server:
> > > >> > >
> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc2/
> > > >> > >
> > > >> > > The differences comparing to previous release candidate
> 1.6.0.rc1:
> > > >> > >  * Fixes for license issues (#17361, #17375, #17370, #17460)
> > > >> > >  * Bugfix for saving LSTM layer parameter (#17288)
> > > >> > >  * Bugfix for downloading the model from model zoo from multiple
> > > >> processes
> > > >> > > (#17372)
> > > >> > >  * Fixed a symbol.py in AMP for GluonNLP (#17408)
> > > >> > >
> > > >> > >
> > > >> > > Please remember to TEST first before voting accordingly:
> > > >> > > +1 = approve
> > > >> > > +0 = no opinion
> > > >> > > -1 = disapprove (provide reason)
> > > >> > >
> > > >> > >
> > > >> > > Best regards,
> > > >> > > Przemyslaw Tredak
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>
>
> --
> *Chaitanya Prakash Bapat*
> *+1 (973) 953-6299*
>
> [image: https://www.linkedin.com//in/chaibapat25]
> [image: https://www.facebook.com/chaibapat
> ]
> [image:
> https://twitter.com/ChaiBapchya] [image:
> https://www.linkedin.com//in/chaibapat25]
> 
>


Re: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc2

2020-02-04 Thread Pedro Larroy
Right. Would it be possible to have the CMake build also use libgomp for
consistency with the releases until these issues are resolved?
This can affect anyone compiling the distribution with CMake and also
happens randomly in CI, worsening the contributor experience due to CI
failures.

On Tue, Feb 4, 2020 at 9:33 AM Przemysław Trędak  wrote:

> Hi Pedro,
>
> From the issue that you linked it seems that you are using the LLVM
> OpenMP, whereas I believe the actual release uses libgomp (at least that's
> what seems to be the conclusion from this issue:
> https://github.com/apache/incubator-mxnet/issues/16891)?
>
> Przemek
>
> On 2020/02/04 03:42:30, Pedro Larroy 
> wrote:
> > -1
> >
> > Unit tests passed in CPU build.
> >
> > I observe crashes related to openmp using cpp unit tests:
> >
> > https://github.com/apache/incubator-mxnet/issues/17043
> >
> > Pedro.
> >
> > On Mon, Feb 3, 2020 at 6:44 PM Chaitanya Bapat 
> wrote:
> >
> > > +1
> > > Successfully built MXNet 1.6.0rc2 on Linux
> > > Tested for OpPerf utility
> > > For CPU -
> > > https://gist.github.com/ChaiBapchya/d5ecc3e971c5a3c558d672477b4b6b9c
> > >
> > > Works well!
> > >
> > >
> > >
> > > On Mon, 3 Feb 2020 at 15:43, Lin Yuan  wrote:
> > >
> > > > +1
> > > >
> > > > Tested Horovod with mnist example. My compiler flags are below:
> > > >
> > > > [✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔
> CPU_SSE2,
> > > ✔
> > > > CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖
> > > CPU_AVX2, ✔
> > > > OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖
> > > BLAS_MKL, ✖
> > > > BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER, ✔
> > > > DIST_KVSTORE, ✖ CXX14, ✖ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✖
> DEBUG, ✖
> > > > TVM_OP]
> > > >
> > > > Lin
> > > >
> > > > On Sat, Feb 1, 2020 at 9:55 PM Tao Lv  wrote:
> > > >
> > > > > +1
> > > > >
> > > > > I tested below items:
> > > > > 1. download artifacts from Apache dist repo;
> > > > > 2. the signature looks good;
> > > > > 3. build from source code with MKL-DNN and MKL on centos;
> > > > > 4. run fp32 and int8 inference of ResNet50 under
> > > /example/quantization/.
> > > > >
> > > > > thanks,
> > > > > -tao
> > > > >
> > > > > On Sun, Feb 2, 2020 at 11:00 AM Tao Lv  wrote:
> > > > >
> > > > > > I see. I was looking at this page:
> > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
> > > > > >
> > > > > > On Sun, Feb 2, 2020 at 4:54 AM Przemysław Trędak <
> ptre...@apache.org
> > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Hi Tao,
> > > > > >>
> > > > > >> Could you tell me where did you look for it and did not find
> it? I
> > > > just
> > > > > >> checked and both
> > > > > >>
> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc2/
> > > and
> > > > > >> draft of the release on GitHub have them.
> > > > > >>
> > > > > >> Thank you
> > > > > >> Przemek
> > > > > >>
> > > > > >> On 2020/02/01 14:23:11, Tao Lv  wrote:
> > > > > >> > It seems the src tar and signature are missing from the tag.
> > > > > >> >
> > > > > >> > On Fri, Jan 31, 2020 at 11:09 AM Przemysław Trędak <
> > > > > ptre...@apache.org>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Dear MXNet community,
> > > > > >> > >
> > > > > >> > > This is the vote to release Apache MXNet (incubating)
> version
> > > > 1.6.0.
> > > > > >> > > Voting starts today and will close on Monday 2/3/2020 23:59
> PST.
> > > > > >> > >
> > > > > >> > > Link to release notes:
> > > > > >> > >
> > > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.6.0+Release+notes
> >

Re: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc2

2020-02-04 Thread Pedro Larroy
@Chris: If you actually go and read the issue that I linked above, you can
see that I was using gdb. Maybe you can have a look into the issue if you
have an idea to fix. The backtrace points to a segfault in the omp library.
While the cause could be somewhere else which is causing undefined
behaviour, taking into consideration that this is not happening with
libgomp and other engineers believe that mixing openmp implementations at
runtime can cause UB, it's reasonable to believe that there's a good chance
that is related to this. I personally don't have time to investigate this
further, as I don't think introducing this dependency is worth the trouble
is causing, when the one provided by the platform works well enough.

0x743b284a in __kmp_fork_call () from
/home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so
(gdb) bt


@Lin: I personally wouldn't be comfortable releasing a version that
segfaults, I don't think that meets the quality bar. but this is up to the
community to decide, I'm only reporting what I observe.

Releasing with indications of this kind of problems causes issues later in
downstream projects and running services.

On Tue, Feb 4, 2020 at 11:07 AM Chris Olivier  wrote:

> When "fixing", please "fix" through actual root-cause analysis (use gdb,
> for instance) and not simply by guesswork and cutting out things which
> probably aren't actually at fault (blaming an OMP library that's in
> worldwide distribution int he billions should be treated with great
> skepticism).
>
> On Tue, Feb 4, 2020 at 10:44 AM Lin Yuan  wrote:
>
> > Pedro,
> >
> > While I agree with you we need to fix this usability issue, I don't think
> > this is a release blocker as Przemek mentioned above. Could we fix this
> in
> > the next minor release?
> >
> > Thanks,
> >
> > Lin
> >
> > On Tue, Feb 4, 2020 at 10:38 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com
> > >
> > wrote:
> >
> > > Right. Would it be possible to have the CMake build also use libgomp
> for
> > > consistency with the releases until these issues are resolved?
> > > This can affect anyone compiling the distribution with CMake and also
> > > happens randomly in CI, worsening the contributor experience due to CI
> > > failures.
> > >
> > > On Tue, Feb 4, 2020 at 9:33 AM Przemysław Trędak 
> > > wrote:
> > >
> > > > Hi Pedro,
> > > >
> > > > From the issue that you linked it seems that you are using the LLVM
> > > > OpenMP, whereas I believe the actual release uses libgomp (at least
> > > that's
> > > > what seems to be the conclusion from this issue:
> > > > https://github.com/apache/incubator-mxnet/issues/16891)?
> > > >
> > > > Przemek
> > > >
> > > > On 2020/02/04 03:42:30, Pedro Larroy 
> > > > wrote:
> > > > > -1
> > > > >
> > > > > Unit tests passed in CPU build.
> > > > >
> > > > > I observe crashes related to openmp using cpp unit tests:
> > > > >
> > > > > https://github.com/apache/incubator-mxnet/issues/17043
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Mon, Feb 3, 2020 at 6:44 PM Chaitanya Bapat <
> chai.ba...@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > > > +1
> > > > > > Successfully built MXNet 1.6.0rc2 on Linux
> > > > > > Tested for OpPerf utility
> > > > > > For CPU -
> > > > > >
> > https://gist.github.com/ChaiBapchya/d5ecc3e971c5a3c558d672477b4b6b9c
> > > > > >
> > > > > > Works well!
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, 3 Feb 2020 at 15:43, Lin Yuan 
> wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > Tested Horovod with mnist example. My compiler flags are below:
> > > > > > >
> > > > > > > [✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔
> > > > CPU_SSE2,
> > > > > > ✔
> > > > > > > CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖
> > > > > > CPU_AVX2, ✔
> > > > > > > OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖
> > > > > > BLAS_MKL, ✖
> > > > >

Re: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc2

2020-02-04 Thread Pedro Larroy
Hi Przemek

I'm fine if we add it to the release notes and try to fix it for the next
release. Changing my vote to +1

Pedro.

On Mon, Feb 3, 2020 at 7:42 PM Pedro Larroy 
wrote:

>
> -1
>
> Unit tests passed in CPU build.
>
> I observe crashes related to openmp using cpp unit tests:
>
> https://github.com/apache/incubator-mxnet/issues/17043
>
> Pedro.
>
> On Mon, Feb 3, 2020 at 6:44 PM Chaitanya Bapat 
> wrote:
>
>> +1
>> Successfully built MXNet 1.6.0rc2 on Linux
>> Tested for OpPerf utility
>> For CPU -
>> https://gist.github.com/ChaiBapchya/d5ecc3e971c5a3c558d672477b4b6b9c
>>
>> Works well!
>>
>>
>>
>> On Mon, 3 Feb 2020 at 15:43, Lin Yuan  wrote:
>>
>> > +1
>> >
>> > Tested Horovod with mnist example. My compiler flags are below:
>> >
>> > [✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔
>> CPU_SSE2, ✔
>> > CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖
>> CPU_AVX2, ✔
>> > OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖
>> BLAS_MKL, ✖
>> > BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER, ✔
>> > DIST_KVSTORE, ✖ CXX14, ✖ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✖ DEBUG, ✖
>> > TVM_OP]
>> >
>> > Lin
>> >
>> > On Sat, Feb 1, 2020 at 9:55 PM Tao Lv  wrote:
>> >
>> > > +1
>> > >
>> > > I tested below items:
>> > > 1. download artifacts from Apache dist repo;
>> > > 2. the signature looks good;
>> > > 3. build from source code with MKL-DNN and MKL on centos;
>> > > 4. run fp32 and int8 inference of ResNet50 under
>> /example/quantization/.
>> > >
>> > > thanks,
>> > > -tao
>> > >
>> > > On Sun, Feb 2, 2020 at 11:00 AM Tao Lv  wrote:
>> > >
>> > > > I see. I was looking at this page:
>> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
>> > > >
>> > > > On Sun, Feb 2, 2020 at 4:54 AM Przemysław Trędak <
>> ptre...@apache.org>
>> > > > wrote:
>> > > >
>> > > >> Hi Tao,
>> > > >>
>> > > >> Could you tell me where did you look for it and did not find it? I
>> > just
>> > > >> checked and both
>> > > >> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc2/
>> and
>> > > >> draft of the release on GitHub have them.
>> > > >>
>> > > >> Thank you
>> > > >> Przemek
>> > > >>
>> > > >> On 2020/02/01 14:23:11, Tao Lv  wrote:
>> > > >> > It seems the src tar and signature are missing from the tag.
>> > > >> >
>> > > >> > On Fri, Jan 31, 2020 at 11:09 AM Przemysław Trędak <
>> > > ptre...@apache.org>
>> > > >> > wrote:
>> > > >> >
>> > > >> > > Dear MXNet community,
>> > > >> > >
>> > > >> > > This is the vote to release Apache MXNet (incubating) version
>> > 1.6.0.
>> > > >> > > Voting starts today and will close on Monday 2/3/2020 23:59
>> PST.
>> > > >> > >
>> > > >> > > Link to release notes:
>> > > >> > >
>> > > https://cwiki.apache.org/confluence/display/MXNET/1.6.0+Release+notes
>> > > >> > >
>> > > >> > > Link to release candidate:
>> > > >> > >
>> https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
>> > > >> > >
>> > > >> > > Link to source and signatures on apache dist server:
>> > > >> > >
>> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc2/
>> > > >> > >
>> > > >> > > The differences comparing to previous release candidate
>> 1.6.0.rc1:
>> > > >> > >  * Fixes for license issues (#17361, #17375, #17370, #17460)
>> > > >> > >  * Bugfix for saving LSTM layer parameter (#17288)
>> > > >> > >  * Bugfix for downloading the model from model zoo from
>> multiple
>> > > >> processes
>> > > >> > > (#17372)
>> > > >> > >  * Fixed a symbol.py in AMP for GluonNLP (#17408)
>> > > >> > >
>> > > >> > >
>> > > >> > > Please remember to TEST first before voting accordingly:
>> > > >> > > +1 = approve
>> > > >> > > +0 = no opinion
>> > > >> > > -1 = disapprove (provide reason)
>> > > >> > >
>> > > >> > >
>> > > >> > > Best regards,
>> > > >> > > Przemyslaw Tredak
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >
>> > >
>> >
>>
>>
>> --
>> *Chaitanya Prakash Bapat*
>> *+1 (973) 953-6299*
>>
>> [image: https://www.linkedin.com//in/chaibapat25]
>> <https://github.com/ChaiBapchya>[image:
>> https://www.facebook.com/chaibapat]
>> <https://www.facebook.com/chaibapchya>[image:
>> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
>> https://www.linkedin.com//in/chaibapat25]
>> <https://www.linkedin.com//in/chaibapchya/>
>>
>


Re: Cuda 10.2 Wheels

2020-02-06 Thread Pedro Larroy
Hi Alfredo.

Isn't "mxnet_cu102mkl-1.6.0
"
what you are looking for? I see it on the second link you posted.

Pedro

On Tue, Feb 4, 2020 at 3:29 PM Alfredo Luque
 wrote:

> Hi folks,
>
> Are there any blockers on releasing CUDA 10.2 compatible wheels? Based on
> this
> readme
> <
> https://github.com/apache/incubator-mxnet/blob/master/tools/pip/doc/CU102_ADDITIONAL.md
> >
> the
> packages should be available on PyPi already but they don’t appear to exist
> yet.
>
> On the other thread, someone posted this static page
>  that has
> nightly builds hosted on S3 but it appears CUDA 10.2 wheels aren’t on
> there.
>
> —
> Alfredo Luque
> Software Engineer
> Machine Learning Infrastructure
> Airbnb
> San Francisco, CA
>


Re: Join request for MXNet Swift support

2020-02-10 Thread Pedro Larroy
Welcome Rahul ! Excited to have you join us.

I was wondering how fast, effective and what options are to call from
python into Swift, and from Swift into C to execute the dataflow graph or
call into operators. There was a thread before about microbenchmarking
calling into the C++ engine from Python using different methods. Not sure
if you have done some experiments in that direction.

Pedro.

On Mon, Feb 10, 2020 at 3:57 AM Tao Lv  wrote:

> Hi Rahul,
>
> Invite is sent to rahulbhal...@protonmail.com. Welcome to the community
> and
> looking forward to your contribution.
>
> -tao
>
> On Mon, Feb 10, 2020 at 1:10 PM Rahul  .invalid>
> wrote:
>
> > Hello,
> >
> > As per the conversation with [Pedro Larroy](https://twitter.com/plarroy)
> > on [Twitter thread](
> https://twitter.com/plarroy/status/1226408543621771264)
> > I would like to join this Slack channel for contributing to MXNet in
> Swift.
> >
> > Regards
> > Rahul Bhalley
> > [ORCID](https://orcid.org/-0002-4574-0390)
>


Re: Cuda 10.2 Wheels

2020-02-17 Thread Pedro Larroy
I would suggest to update the pip page descriptions or website with a link
to the new distribution channel. Right now It's ungoogleable how to find
the pre-release wheels. Also would be useful to link to this from the
website if possible.   Google directs to pip. If I find it confusing, I
can't image a random user.

On Tue, Feb 11, 2020 at 7:25 PM Sheng Zha  wrote:

> Thanks for bringing this up. That table is misleading and is not an
> acceptable solution for a static reference of the latest pre-releases (more
> in [1]). I’m currently working on the replacement that provides similar
> experiences as pytorch nightly builds page.
>
> -sz
>
> [1]
> https://github.com/apache/incubator-mxnet/issues/17537#issuecomment-584683578
>
>
> > On Feb 11, 2020, at 10:06 PM, Lv, Tao A  wrote:
> >
> > Hi Sheng,
> >
> > It seems the top latest build table is not well updated. I see there are
> 2020-2-12 builds for different variants but the latest build are still
> 2020-2-10 - the build date is not reflected in the link but can be got
> through `pip list`.
> >
> > Thanks,
> > -tao
> >
> > -Original Message-
> > From: Sheng Zha 
> > Sent: Tuesday, February 11, 2020 11:37 PM
> > To: d...@mxnet.apache.org
> > Subject: Re: Cuda 10.2 Wheels
> >
> > The static page is now accessible from
> https://repo.mxnet.io/dist/index.html. Note that the previous links may
> have been moved as part of reorganizing the file store namespaces. Please
> refer to the latest page.
> >
> > -sz
> >
> >> On 2020/02/06 23:21:21, Alfredo Luque 
> wrote:
> >> Looks like it updated since I last posted. Thanks!
> >>
> >> On February 6, 2020 at 3:20:34 PM, Pedro Larroy (
> >> pedro.larroy.li...@gmail.com) wrote:
> >>
> >> Hi Alfredo.
> >>
> >> Isn't "mxnet_cu102mkl-1.6.0
> >> <
> >>
> https://repo.mxnet.io/dist/mxnet_cu102mkl-1.6.0-py2.py3-none-manylinux1_x86_64.whl
> >"
> >>
> >> what you are looking for? I see it on the second link you posted.
> >>
> >> Pedro
> >>
> >> On Tue, Feb 4, 2020 at 3:29 PM Alfredo Luque
> >>  wrote:
> >>
> >>> Hi folks,
> >>>
> >>> Are there any blockers on releasing CUDA 10.2 compatible wheels?
> >>> Based on this readme <
> >>>
> >> https://github.com/apache/incubator-mxnet/blob/master/tools/pip/doc/CU
> >> 102_ADDITIONAL.md
> >>>>
> >>> the
> >>> packages should be available on PyPi already but they don’t appear
> >>> to
> >> exist
> >>> yet.
> >>>
> >>> On the other thread, someone posted this static page
> >>> <https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/index.html>
> >>> that
> >> has
> >>> nightly builds hosted on S3 but it appears CUDA 10.2 wheels aren’t
> >>> on there.
> >>>
> >>> —
> >>> Alfredo Luque
> >>> Software Engineer
> >>> Machine Learning Infrastructure
> >>> Airbnb
> >>> San Francisco, CA
> >>>
> >>
> >> —
> >> Alfredo Luque
> >> Software Engineer
> >> Machine Learning Infrastructure
> >> Airbnb
> >> San Francisco, CA
> >>
>


New AMIs for CI

2020-02-18 Thread Pedro Larroy
Hi

Tomorrow I will be updating the CI environment with new AMIs, and deploying
updated autoscaling logic with fixes, expect some disruptions in CI runs.

The Linux AMIs will be updated to Ubuntu 18.04 with updated GPU drivers,
this won't affect Linux container builds.

The new Windows AMI comes with a reproducible environment, VS2017, Visual
C++ updated from VC14 to VC15.

CMake 3.16.2, Perl and LLVM which are required for MXNet and TVM. Cuda is
still 9.2, but now it's easier to update as the installation is automated.

 Once the environment is updated, my PR needs to be merged to bring back
windows compilation in working order:

https://github.com/apache/incubator-mxnet/pull/17206

Thanks to Leonard and Joe for helping with various issues.

Pedro.


Re: New AMIs for CI

2020-02-19 Thread Pedro Larroy
I reverted the CI rollout due to the following issues:

https://github.com/apache/incubator-mxnet/issues/17633

https://github.com/apache/incubator-mxnet/issues/17635

I would need help from the community to fix them as we can't even compile
in debug mode in windows as the above, and also due to older cmake being
used in vs2017.

For updating to vs2019 we would need to update cuda.

Pedro.



On Tue, Feb 18, 2020 at 5:31 PM Pedro Larroy 
wrote:

> Hi
>
> Tomorrow I will be updating the CI environment with new AMIs, and
> deploying updated autoscaling logic with fixes, expect some disruptions in
> CI runs.
>
> The Linux AMIs will be updated to Ubuntu 18.04 with updated GPU drivers,
> this won't affect Linux container builds.
>
> The new Windows AMI comes with a reproducible environment, VS2017, Visual
> C++ updated from VC14 to VC15.
>
> CMake 3.16.2, Perl and LLVM which are required for MXNet and TVM. Cuda is
> still 9.2, but now it's easier to update as the installation is automated.
>
>  Once the environment is updated, my PR needs to be merged to bring back
> windows compilation in working order:
>
> https://github.com/apache/incubator-mxnet/pull/17206
>
> Thanks to Leonard and Joe for helping with various issues.
>
> Pedro.
>


Re: New AMIs for CI

2020-02-21 Thread Pedro Larroy
CI is back to normal. We haven't updated Windows AMIs due to issues with
GPU unit tests.

You might need to retrigger your PRs.

Thanks for your patience.

On Wed, Feb 19, 2020 at 5:54 PM Pedro Larroy 
wrote:

> I reverted the CI rollout due to the following issues:
>
> https://github.com/apache/incubator-mxnet/issues/17633
>
> https://github.com/apache/incubator-mxnet/issues/17635
>
> I would need help from the community to fix them as we can't even compile
> in debug mode in windows as the above, and also due to older cmake being
> used in vs2017.
>
> For updating to vs2019 we would need to update cuda.
>
> Pedro.
>
>
>
> On Tue, Feb 18, 2020 at 5:31 PM Pedro Larroy 
> wrote:
>
>> Hi
>>
>> Tomorrow I will be updating the CI environment with new AMIs, and
>> deploying updated autoscaling logic with fixes, expect some disruptions in
>> CI runs.
>>
>> The Linux AMIs will be updated to Ubuntu 18.04 with updated GPU drivers,
>> this won't affect Linux container builds.
>>
>> The new Windows AMI comes with a reproducible environment, VS2017, Visual
>> C++ updated from VC14 to VC15.
>>
>> CMake 3.16.2, Perl and LLVM which are required for MXNet and TVM. Cuda is
>> still 9.2, but now it's easier to update as the installation is automated.
>>
>>  Once the environment is updated, my PR needs to be merged to bring back
>> windows compilation in working order:
>>
>> https://github.com/apache/incubator-mxnet/pull/17206
>>
>> Thanks to Leonard and Joe for helping with various issues.
>>
>> Pedro.
>>
>


Workflow proposal

2020-03-11 Thread Pedro Larroy
Hi

I talk to some people about this and they thought it would be a good idea,
so sharing it here:

I would propose to use a staging or "dev" branch into which nightly &
performance tests are done periodically and then this branch is merged to
master. The goal of this workflow would be to avoid having regressions and
nightly failures creeping into master. PRs would get merged into dev and
dev promoted periodically / nightly into master.

The names can be swapped as well, between dev and master, so PRS get merged
into master and it doesn't change the workflow, and staging is the branch
where nightly changes are merged to.

Have this been considered?

Pedro.


Re: Workflow proposal

2020-03-16 Thread Pedro Larroy
The original idea is that the promotion to the other branch is automated by
nightly CI, so it shouldn't have those problems that are mentioned, so
there shouldn't be any manual merging on that branch.

On Wed, Mar 11, 2020 at 7:43 PM Chris Olivier  wrote:

> My $0.02
>
> We had this model dual-branch when I was at GE and it was problematic.
> Among other things, the two branches would tend to diverge to a large
> degree and you ended up just cherry-picking in stuff here and there, which
> caused even more problems, as well as the model allows the secondary branch
> to get pretty buggy -- human nature being what it is -- to the point where
> it's difficult to merge it into master without freezing them both and
> stabilizing, merging into master, then stabilizing again (small things
> almost certainly went into master in the meantime -- hotfixes, critical
> features, etc, while everything was on hold stabilizing the secondary
> branch).  It just double the work in the end, is what I experienced.
>
> On Wed, Mar 11, 2020 at 5:47 PM Yuan Tang  wrote:
>
> > Second to not introduce a dev branch. We should try to improve our
> release
> > process instead and avoid another branch that may introduce confusion
> > around the source of truth.
> >
> > On Wed, Mar 11, 2020 at 8:39 PM Tianqi Chen 
> > wrote:
> >
> > > While the idea of staging seems to be reasonable.
> > > Most OSS projects choose not to do so because having a complicated
> > staging
> > > will likely confuse the contributors, and increase the change of
> > > divergence(between dev and master).
> > >
> > > Given that we have a release model, so in some sense the release itself
> > > serves as a staging pt.
> > > A good approach would simply setup the nightly if necessary strive to
> fix
> > > regressions and make sure the formal release addresses the issues.
> > >
> > > TQ
> > >
> > > On Wed, Mar 11, 2020 at 5:32 PM Pedro Larroy <
> > pedro.larroy.li...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi
> > > >
> > > > I talk to some people about this and they thought it would be a good
> > > idea,
> > > > so sharing it here:
> > > >
> > > > I would propose to use a staging or "dev" branch into which nightly &
> > > > performance tests are done periodically and then this branch is
> merged
> > to
> > > > master. The goal of this workflow would be to avoid having
> regressions
> > > and
> > > > nightly failures creeping into master. PRs would get merged into dev
> > and
> > > > dev promoted periodically / nightly into master.
> > > >
> > > > The names can be swapped as well, between dev and master, so PRS get
> > > merged
> > > > into master and it doesn't change the workflow, and staging is the
> > branch
> > > > where nightly changes are merged to.
> > > >
> > > > Have this been considered?
> > > >
> > > > Pedro.
> > > >
> > >
> >
> >
> > --
> > Yuan Tang
> > https://terrytangyuan.github.io/about/ <http://twitter.com/TerryTangYuan
> >
> > <https://terrytangyuan.github.io/about/>
> >
>


Re: Workflow proposal

2020-03-17 Thread Pedro Larroy
The idea is that it would be rolled back automatically to the previous
successful nightly.  So PRs would be rebased and would address that nightly
test failure, this also links with the manual trigger of CI, which can also
be used to retrigger nightly or benchmarks.

On Mon, Mar 16, 2020 at 11:53 AM Marco de Abreu 
wrote:

> Considering how unstable our PR as well as our nightly jobs have been so
> far, is that an assumption we can rightfully make? Also, who'd be
> responsible for fixing that branch in case a PR actually breaks a nightly
> test?
>
> -Marco
>
> On Mon, Mar 16, 2020 at 7:41 PM Pedro Larroy  >
> wrote:
>
> > The original idea is that the promotion to the other branch is automated
> by
> > nightly CI, so it shouldn't have those problems that are mentioned, so
> > there shouldn't be any manual merging on that branch.
> >
> > On Wed, Mar 11, 2020 at 7:43 PM Chris Olivier 
> > wrote:
> >
> > > My $0.02
> > >
> > > We had this model dual-branch when I was at GE and it was problematic.
> > > Among other things, the two branches would tend to diverge to a large
> > > degree and you ended up just cherry-picking in stuff here and there,
> > which
> > > caused even more problems, as well as the model allows the secondary
> > branch
> > > to get pretty buggy -- human nature being what it is -- to the point
> > where
> > > it's difficult to merge it into master without freezing them both and
> > > stabilizing, merging into master, then stabilizing again (small things
> > > almost certainly went into master in the meantime -- hotfixes, critical
> > > features, etc, while everything was on hold stabilizing the secondary
> > > branch).  It just double the work in the end, is what I experienced.
> > >
> > > On Wed, Mar 11, 2020 at 5:47 PM Yuan Tang 
> > wrote:
> > >
> > > > Second to not introduce a dev branch. We should try to improve our
> > > release
> > > > process instead and avoid another branch that may introduce confusion
> > > > around the source of truth.
> > > >
> > > > On Wed, Mar 11, 2020 at 8:39 PM Tianqi Chen <
> tqc...@cs.washington.edu>
> > > > wrote:
> > > >
> > > > > While the idea of staging seems to be reasonable.
> > > > > Most OSS projects choose not to do so because having a complicated
> > > > staging
> > > > > will likely confuse the contributors, and increase the change of
> > > > > divergence(between dev and master).
> > > > >
> > > > > Given that we have a release model, so in some sense the release
> > itself
> > > > > serves as a staging pt.
> > > > > A good approach would simply setup the nightly if necessary strive
> to
> > > fix
> > > > > regressions and make sure the formal release addresses the issues.
> > > > >
> > > > > TQ
> > > > >
> > > > > On Wed, Mar 11, 2020 at 5:32 PM Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi
> > > > > >
> > > > > > I talk to some people about this and they thought it would be a
> > good
> > > > > idea,
> > > > > > so sharing it here:
> > > > > >
> > > > > > I would propose to use a staging or "dev" branch into which
> > nightly &
> > > > > > performance tests are done periodically and then this branch is
> > > merged
> > > > to
> > > > > > master. The goal of this workflow would be to avoid having
> > > regressions
> > > > > and
> > > > > > nightly failures creeping into master. PRs would get merged into
> > dev
> > > > and
> > > > > > dev promoted periodically / nightly into master.
> > > > > >
> > > > > > The names can be swapped as well, between dev and master, so PRS
> > get
> > > > > merged
> > > > > > into master and it doesn't change the workflow, and staging is
> the
> > > > branch
> > > > > > where nightly changes are merged to.
> > > > > >
> > > > > > Have this been considered?
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Yuan Tang
> > > > https://terrytangyuan.github.io/about/ <
> > http://twitter.com/TerryTangYuan
> > > >
> > > > <https://terrytangyuan.github.io/about/>
> > > >
> > >
> >
>


Re: Profiler Broken?

2020-05-28 Thread Pedro Larroy
Yes the profiler seems to be broken / has some concurrency issues. I have
seen corrupted profile results.

On Thu, May 28, 2020 at 12:30 PM Naveen Swamy  wrote:

> I am attempting to profile one of our models, I used the profiler.state to
> run/stop in code and also used the environment variables to autostart the
> profiler. It creates a 600MB json file, however when I view in chrome
> tracing it comes out to be blank screen (loading seems to be fine, didn't
> get any errors)
>
> Wondering if anyone has recently tried or if aware of profiler being
> broken?
>
> ENVIRON: Ubuntu 18.04
> MXNet : mxnet-cu101mkl
> Deep Learning AMI (Ubuntu 18.04) Version 29.0 (ami-043f9aeaf108ebc37)
>
> Thanks, Naveen
>


Slack channel

2017-06-20 Thread Pedro Larroy
Hi

Please add me to the slack channel.

-- 
Pedro Larroy Tovar


Re: Apache MXNet build failures are mostly valid - verify before merge

2017-09-28 Thread Pedro Larroy
Given the cost of running all the test for all the build flavors and
architectures I would propose the following:



   - Have a staging branch were PRs are merged by committers which runs all
   the integration tests with the appropriate frequency (say nightly).
   - Have automated fast forwards from the staging branch done
   automatically by Jenkins when the latest head passes all the tests for all
   platforms.



With this we would always have a stable master branch which is well tested,
while being able to adjust the tradeoff between correctness and quick
feedback for PRs.

Another improvement would be to split the feedback in stages, one would be
multi-platform / multi-flavor build which should be around 20 minutes, and
then two or more stages of quick tests and extensive tests. And as I
explained, we wouldn't need to run extensive tests on every PR, just
nightly on staging.

What do you think?

Pedro.

On Thu, Sep 28, 2017 at 2:02 PM, Joern Kottmann  wrote:

> At Apache OpenNLP we just established among committers that you check
> that the status indicator is green before you merge,
> and if it wasn't the case then we would ask the committer to take
> responsibility and repair things. Works very well our build is never
> broken.
>
> We also strongly prefer if each PR gets reviewed by another committer.
>
> Overall this works quite well. We don't use any of the protections
> against merging, it is important that you can trust each of the
> committers not to break things, if there are problems it is better to
> resolve them with talking to each other, rather than enforcing green
> status checks.
>
> Jörn
>
> On Thu, Sep 28, 2017 at 8:21 AM, Chris Olivier 
> wrote:
> > +1 on that
> >
> > On Wed, Sep 27, 2017 at 11:15 PM Gautam  wrote:
> >
> >> Hi Chris,
> >>
> >>   Here  is
> >> user
> >> document on semantics of protected branch.
> >> In short when a branch is protected following applies to that branch.
> >>
> >>- Can't be force pushed
> >>- Can't be deleted
> >>- Can't have changes merged into it until required status checks
> >> pass
> >>- Can't have changes merged into it until required reviews are
> approved
> >><
> >> https://help.github.com/articles/approving-a-pull-
> request-with-required-reviews
> >> >
> >>- Can't be edited or have files uploaded to it from the web
> >>- Can't have changes merged into it until changes to files that
> >> have a designated
> >>code owner  have
> >>been approved by that owner
> >>
> >>  I am sure many of us might not want to have all these but we can
> debate on
> >> it. My main motive was to "*Can't have changes merged into it until
> >> required status checks pass*"
> >>
> >>
> >> -Gautam
> >>
> >>
> >>
> >> On Wed, Sep 27, 2017 at 11:09 PM, Chris Olivier 
> >> wrote:
> >>
> >> > What does that mean? "Protected"? Protected from what?
> >> >
> >> > On Wed, Sep 27, 2017 at 11:08 PM Gautam  wrote:
> >> >
> >> > > Hi Chris,
> >> > >
> >> > >I mean make "master branch protected" of  MXNet.
> >> > >
> >> > > -Gautam
> >> > >
> >> > > On Wed, Sep 27, 2017 at 11:04 PM, Chris Olivier <
> cjolivie...@gmail.com
> >> >
> >> > > wrote:
> >> > >
> >> > > > What does this mean? "Mx-net branch protected"?
> >> > > >
> >> > > > On Wed, Sep 27, 2017 at 9:59 PM Tsuyoshi OZAWA <
> >> > ozawa.tsuyo...@gmail.com
> >> > > >
> >> > > > wrote:
> >> > > >
> >> > > > > +1,
> >> > > > >
> >> > > > > While I'm checking the recent build failures, and I think the
> >> > decision
> >> > > > > of making the mx-net branch protected is necessary for stable
> >> > > > > building.
> >> > > > > Thanks Kumar for resuming important discussion.
> >> > > > >
> >> > > > > Best regards
> >> > > > > - Tsuyoshi
> >> > > > >
> >> > > > > On Thu, Sep 28, 2017 at 12:56 PM, Kumar, Gautam <
> ga...@amazon.com>
> >> > > > wrote:
> >> > > > > > Reviving the discussion.
> >> > > > > >
> >> > > > > > At this point of time we have couple of stable builds
> >> > > > > >
> >> > > > > https://builds.apache.org/view/Incubator%20Projects/job/
> >> > > > incubator-mxnet/job/master/448/
> >> > > > > >
> >> > > > > https://builds.apache.org/view/Incubator%20Projects/job/
> >> > > > incubator-mxnet/job/master/449/
> >> > > > > >
> >> > > > > > Should we have a quick discussion or polling on making the
> mx-net
> >> > > > branch
> >> > > > > protected? If you still think we shouldn’t make it protected
> please
> >> > > > provide
> >> > > > > a reason to support your claim.
> >> > > > > >
> >> > > > > > Few of us have concern over Jenkin’s stability. If I look two
> >> weeks
> >> > > > > back, after upgrading Linux slave to g2.8x and new windows AMI,
> we
> >> > have
> >> > > > not
> >> > > > > seen any case where instance died due to high memory usage or
> any
> >> > > process
> >> > > > > got killed due to h

Improving and rationalizing unit tests

2017-10-16 Thread Pedro Larroy
Hi

Some of the unit tests are extremely costly in terms of memory and compute.

As an example in the gluon tests we are loading all the datasets.

test_gluon_data.test_datasets

Also running huge networks like resnets in test_gluon_model_zoo.

This is ridiculously slow, and straight impossible on some embedded /
memory constrained devices, and anyway is making tests run for longer than
needed.

Unit tests should be small, self contained, if possible pure (avoiding this
kind of dataset IO if possible).

I think it would be better to split them in real unit tests and extended
integration test suites that do more intensive computation. This would also
help with the feedback time with PRs and CI infrastructure.


Thoughts?


Re: Improving and rationalizing unit tests

2017-10-16 Thread Pedro Larroy
That's not true. random() and similar functions are based on a PRNG. It can
be debugged and it's completely deterministic, a good practice is to use a
known seed for this.

More info: https://en.wikipedia.org/wiki/Pseudorandom_number_generator

On Mon, Oct 16, 2017 at 5:42 PM, pracheer gupta 
wrote:

> @Chris: Any particular reason for -1? Randomness just prevents in writing
> tests that you can rely on and/or debug later on in case of failure.
>
> On Oct 16, 2017, at 8:28 AM, Chris Olivier  cjolivie...@gmail.com>> wrote:
>
> -1 for "must not use random numbers for input"
>
> On Mon, Oct 16, 2017 at 7:56 AM, Bhavin Thaker  mailto:bhavintha...@gmail.com>>
> wrote:
>
> I agree with Pedro.
>
> Based on various observations on unit test failures, I would like to
> propose a few guidelines to follow for the unit tests. Even though I use
> the word, “must” for my humble opinions below, please feel free to suggest
> alternatives or modifications to these guidelines:
>
> 1) 1a) Each unit test must have a run time budget <= X minutes. Say, X = 2
> minutes max.
> 1b) The total run time budget for all unit tests <= Y minutes. Say, Y = 60
> minutes max.
>
> 2) All Unit tests must have deterministic (not Stochastic) behavior. That
> is, instead of using the random() function to test a range of input values,
> each input test value must be carefully hand-picked to represent the
> commonly used input scenarios. The correct place to stochastically test
> random input values is to have continuously running nightly tests and NOT
> the sanity/smoke/unit tests for each PR.
>
> 3) All Unit tests must be as much self-contained and independent of
> external components as possible. For example, datasets required for the
> unit test must NOT be present on external website which, if unreachable,
> can cause test run failures. Instead, all datasets must be available
> locally.
>
> 4) It is impossible to test everything in unit tests and so only common
> use-cases and code-paths must be tested in unit-tests. Less common
> scenarios like integration with 3rd party products must be tested in
> nightly/weekly tests.
>
> 5) A unit test must NOT be disabled on a failure unless proven to exhibit
> unreliable behavior. The burden-of-proof for a test failure must be on the
> PR submitter and the PR must NOT be merged without a opening a new github
> issue explaining the problem. If the unit test is disabled for some reason,
> then the unit test must NOT be removed from the unit tests list; instead
> the unit test must be modified to add the following lines at the start of
> the test:
>Print(“Unit Test DISABLED; see GitHub issue: ”)
>Exit(0)
>
> Please suggest modifications to the above proposal such that we can make
> the unit tests framework to be the rock-solid foundation for the active
> development of Apache MXNet (Incubating).
>
> Regards,
> Bhavin Thaker.
>
>
> On Mon, Oct 16, 2017 at 5:56 AM Pedro Larroy  <mailto:pedro.larroy.li...@gmail.com>
>
> wrote:
>
> Hi
>
> Some of the unit tests are extremely costly in terms of memory and
> compute.
>
> As an example in the gluon tests we are loading all the datasets.
>
> test_gluon_data.test_datasets
>
> Also running huge networks like resnets in test_gluon_model_zoo.
>
> This is ridiculously slow, and straight impossible on some embedded /
> memory constrained devices, and anyway is making tests run for longer
> than
> needed.
>
> Unit tests should be small, self contained, if possible pure (avoiding
> this
> kind of dataset IO if possible).
>
> I think it would be better to split them in real unit tests and extended
> integration test suites that do more intensive computation. This would
> also
> help with the feedback time with PRs and CI infrastructure.
>
>
> Thoughts?
>
>
>


Re: Improving and rationalizing unit tests

2017-10-16 Thread Pedro Larroy
I think using a properly seeded and initialized (pseudo)random is actually
beneficial (and deterministic), handpicked examples are usually too
simplistic and miss corner cases.

Better yet is to use property based testing, which will pick corner cases
and do fuzzing automatically to check with high degree of confidence that a
testing condition holds.

Probably it would be good if we use a property based testing library in
adition to nose to check invariants.

A quick googling yields this one for python for example:
https://hypothesis.readthedocs.io/en/latest/quickstart.html does anyone
have experience or can recommend a nice property based testing library for
python?


Regards

On Mon, Oct 16, 2017 at 4:56 PM, Bhavin Thaker 
wrote:

> I agree with Pedro.
>
> Based on various observations on unit test failures, I would like to
> propose a few guidelines to follow for the unit tests. Even though I use
> the word, “must” for my humble opinions below, please feel free to suggest
> alternatives or modifications to these guidelines:
>
> 1) 1a) Each unit test must have a run time budget <= X minutes. Say, X = 2
> minutes max.
> 1b) The total run time budget for all unit tests <= Y minutes. Say, Y = 60
> minutes max.
>
> 2) All Unit tests must have deterministic (not Stochastic) behavior. That
> is, instead of using the random() function to test a range of input values,
> each input test value must be carefully hand-picked to represent the
> commonly used input scenarios. The correct place to stochastically test
> random input values is to have continuously running nightly tests and NOT
> the sanity/smoke/unit tests for each PR.
>
> 3) All Unit tests must be as much self-contained and independent of
> external components as possible. For example, datasets required for the
> unit test must NOT be present on external website which, if unreachable,
> can cause test run failures. Instead, all datasets must be available
> locally.
>
> 4) It is impossible to test everything in unit tests and so only common
> use-cases and code-paths must be tested in unit-tests. Less common
> scenarios like integration with 3rd party products must be tested in
> nightly/weekly tests.
>
> 5) A unit test must NOT be disabled on a failure unless proven to exhibit
> unreliable behavior. The burden-of-proof for a test failure must be on the
> PR submitter and the PR must NOT be merged without a opening a new github
> issue explaining the problem. If the unit test is disabled for some reason,
> then the unit test must NOT be removed from the unit tests list; instead
> the unit test must be modified to add the following lines at the start of
> the test:
> Print(“Unit Test DISABLED; see GitHub issue: ”)
> Exit(0)
>
> Please suggest modifications to the above proposal such that we can make
> the unit tests framework to be the rock-solid foundation for the active
> development of Apache MXNet (Incubating).
>
> Regards,
> Bhavin Thaker.
>
>
> On Mon, Oct 16, 2017 at 5:56 AM Pedro Larroy  >
> wrote:
>
> > Hi
> >
> > Some of the unit tests are extremely costly in terms of memory and
> compute.
> >
> > As an example in the gluon tests we are loading all the datasets.
> >
> > test_gluon_data.test_datasets
> >
> > Also running huge networks like resnets in test_gluon_model_zoo.
> >
> > This is ridiculously slow, and straight impossible on some embedded /
> > memory constrained devices, and anyway is making tests run for longer
> than
> > needed.
> >
> > Unit tests should be small, self contained, if possible pure (avoiding
> this
> > kind of dataset IO if possible).
> >
> > I think it would be better to split them in real unit tests and extended
> > integration test suites that do more intensive computation. This would
> also
> > help with the feedback time with PRs and CI infrastructure.
> >
> >
> > Thoughts?
> >
>


Re: Improving and rationalizing unit tests

2017-10-16 Thread Pedro Larroy
It's always going to be deterministic one way or another unless you use
random from the entropy pool such as /dev/random. I don't think it's a good
practice not to seed properly and have values depend on execution order /
parallelism / time or whatever, but that's just my opinion. I would want to
use the same values for all test runs for reproducibility.

I think your argument goes more towards the previously mentioned "property
based testing" approach, which is in the spirit of what you are supporting,
if I'm not mistaken.

On Mon, Oct 16, 2017 at 6:22 PM, Chris Olivier 
wrote:

> My take on the suggestion of purely deterministic inputs is (including
> deterministic seeding):
>
> "I want the same values to be used for all test runs because it is
> inconvenient when a unit test fails for some edge cases.  I prefer that
> unforseen edge case failures occur in the field and not during testing".
>
> Is this the motivation?  Seems strange to me.
>
>
> On Mon, Oct 16, 2017 at 9:09 AM, Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> wrote:
>
> > I think using a properly seeded and initialized (pseudo)random is
> actually
> > beneficial (and deterministic), handpicked examples are usually too
> > simplistic and miss corner cases.
> >
> > Better yet is to use property based testing, which will pick corner cases
> > and do fuzzing automatically to check with high degree of confidence
> that a
> > testing condition holds.
> >
> > Probably it would be good if we use a property based testing library in
> > adition to nose to check invariants.
> >
> > A quick googling yields this one for python for example:
> > https://hypothesis.readthedocs.io/en/latest/quickstart.html does anyone
> > have experience or can recommend a nice property based testing library
> for
> > python?
> >
> >
> > Regards
> >
> > On Mon, Oct 16, 2017 at 4:56 PM, Bhavin Thaker 
> > wrote:
> >
> > > I agree with Pedro.
> > >
> > > Based on various observations on unit test failures, I would like to
> > > propose a few guidelines to follow for the unit tests. Even though I
> use
> > > the word, “must” for my humble opinions below, please feel free to
> > suggest
> > > alternatives or modifications to these guidelines:
> > >
> > > 1) 1a) Each unit test must have a run time budget <= X minutes. Say, X
> =
> > 2
> > > minutes max.
> > > 1b) The total run time budget for all unit tests <= Y minutes. Say, Y =
> > 60
> > > minutes max.
> > >
> > > 2) All Unit tests must have deterministic (not Stochastic) behavior.
> That
> > > is, instead of using the random() function to test a range of input
> > values,
> > > each input test value must be carefully hand-picked to represent the
> > > commonly used input scenarios. The correct place to stochastically test
> > > random input values is to have continuously running nightly tests and
> NOT
> > > the sanity/smoke/unit tests for each PR.
> > >
> > > 3) All Unit tests must be as much self-contained and independent of
> > > external components as possible. For example, datasets required for the
> > > unit test must NOT be present on external website which, if
> unreachable,
> > > can cause test run failures. Instead, all datasets must be available
> > > locally.
> > >
> > > 4) It is impossible to test everything in unit tests and so only common
> > > use-cases and code-paths must be tested in unit-tests. Less common
> > > scenarios like integration with 3rd party products must be tested in
> > > nightly/weekly tests.
> > >
> > > 5) A unit test must NOT be disabled on a failure unless proven to
> exhibit
> > > unreliable behavior. The burden-of-proof for a test failure must be on
> > the
> > > PR submitter and the PR must NOT be merged without a opening a new
> github
> > > issue explaining the problem. If the unit test is disabled for some
> > reason,
> > > then the unit test must NOT be removed from the unit tests list;
> instead
> > > the unit test must be modified to add the following lines at the start
> of
> > > the test:
> > > Print(“Unit Test DISABLED; see GitHub issue: ”)
> > > Exit(0)
> > >
> > > Please suggest modifications to the above proposal such that we can
> make
> > > the unit tests framework to be the rock-solid foundation for the active
> > > development of Apache MXNet (Incubating).
> > >
&

Jenkins build is back to normal : mxnet_incubator_master » ubuntu-17.04 #137

2017-10-16 Thread Pedro Larroy
See 




Build failed in Jenkins: mxnet_incubator_master » arm64 #136

2017-10-16 Thread Pedro Larroy
See 


--
Started by upstream project "mxnet_incubator_master" build number 136
originally caused by:
 Started by an SCM change
Building in workspace 

[WS-CLEANUP] Deleting project workspace...
Cloning the remote Git repository
Cloning repository g...@github.com:apache/incubator-mxnet.git
 > git init 
 > 
 >  # timeout=10
Fetching upstream changes from g...@github.com:apache/incubator-mxnet.git
 > git --version # timeout=10
using GIT_SSH to set credentials 
 > git fetch --tags --progress g...@github.com:apache/incubator-mxnet.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url g...@github.com:apache/incubator-mxnet.git # 
 > timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # 
 > timeout=10
 > git config remote.origin.url g...@github.com:apache/incubator-mxnet.git # 
 > timeout=10
Fetching upstream changes from g...@github.com:apache/incubator-mxnet.git
using GIT_SSH to set credentials 
 > git fetch --tags --progress g...@github.com:apache/incubator-mxnet.git 
 > +refs/heads/master:refs/remotes/origin/master
Checking out Revision 7d0204015f548ed146018dc5348e9169451c3c89 (origin/v0.12.0)
org.eclipse.jgit.errors.MissingObjectException: Missing unknown 
7d0204015f548ed146018dc5348e9169451c3c89
at 
org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:158)
at org.eclipse.jgit.lib.ObjectReader.open(ObjectReader.java:227)
at org.eclipse.jgit.revwalk.RevWalk.parseAny(RevWalk.java:859)
at org.eclipse.jgit.revwalk.RevWalk.parseCommit(RevWalk.java:772)
at 
hudson.plugins.git.util.RevCommitRepositoryCallback.invoke(RevCommitRepositoryCallback.java:25)
at 
hudson.plugins.git.util.RevCommitRepositoryCallback.invoke(RevCommitRepositoryCallback.java:13)
at 
org.jenkinsci.plugins.gitclient.AbstractGitAPIImpl.withRepository(AbstractGitAPIImpl.java:29)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.withRepository(CliGitAPIImpl.java:71)
at hudson.plugins.git.GitSCM.printCommitMessageToLog(GitSCM.java:1195)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1159)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1212)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:566)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:491)
at hudson.model.Run.execute(Run.java:1737)
at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:419)


Build failed in Jenkins: mxnet_incubator_master » ubuntu-16.04-cuda_8.0_cudnn5 #136

2017-10-16 Thread Pedro Larroy
See 


--
Started by upstream project "mxnet_incubator_master" build number 136
originally caused by:
 Started by an SCM change
Building in workspace 

[WS-CLEANUP] Deleting project workspace...
Cloning the remote Git repository
Cloning repository g...@github.com:apache/incubator-mxnet.git
 > git init 
 > 
 >  # timeout=10
Fetching upstream changes from g...@github.com:apache/incubator-mxnet.git
 > git --version # timeout=10
using GIT_SSH to set credentials 
 > git fetch --tags --progress g...@github.com:apache/incubator-mxnet.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url g...@github.com:apache/incubator-mxnet.git # 
 > timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # 
 > timeout=10
 > git config remote.origin.url g...@github.com:apache/incubator-mxnet.git # 
 > timeout=10
Fetching upstream changes from g...@github.com:apache/incubator-mxnet.git
using GIT_SSH to set credentials 
 > git fetch --tags --progress g...@github.com:apache/incubator-mxnet.git 
 > +refs/heads/master:refs/remotes/origin/master
Checking out Revision 7d0204015f548ed146018dc5348e9169451c3c89 (origin/v0.12.0)
org.eclipse.jgit.errors.MissingObjectException: Missing unknown 
7d0204015f548ed146018dc5348e9169451c3c89
at 
org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:158)
at org.eclipse.jgit.lib.ObjectReader.open(ObjectReader.java:227)
at org.eclipse.jgit.revwalk.RevWalk.parseAny(RevWalk.java:859)
at org.eclipse.jgit.revwalk.RevWalk.parseCommit(RevWalk.java:772)
at 
hudson.plugins.git.util.RevCommitRepositoryCallback.invoke(RevCommitRepositoryCallback.java:25)
at 
hudson.plugins.git.util.RevCommitRepositoryCallback.invoke(RevCommitRepositoryCallback.java:13)
at 
org.jenkinsci.plugins.gitclient.AbstractGitAPIImpl.withRepository(AbstractGitAPIImpl.java:29)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.withRepository(CliGitAPIImpl.java:71)
at hudson.plugins.git.GitSCM.printCommitMessageToLog(GitSCM.java:1195)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1159)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1212)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:566)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:491)
at hudson.model.Run.execute(Run.java:1737)
at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:419)


Jenkins build is back to normal : mxnet_incubator_master » arm64 #137

2017-10-16 Thread Pedro Larroy
See 




Build failed in Jenkins: mxnet_incubator_master » armv7 #136

2017-10-16 Thread Pedro Larroy
See 


--
Started by upstream project "mxnet_incubator_master" build number 136
originally caused by:
 Started by an SCM change
Building in workspace 

[WS-CLEANUP] Deleting project workspace...
Cloning the remote Git repository
Cloning repository g...@github.com:apache/incubator-mxnet.git
 > git init 
 > 
 >  # timeout=10
Fetching upstream changes from g...@github.com:apache/incubator-mxnet.git
 > git --version # timeout=10
using GIT_SSH to set credentials 
 > git fetch --tags --progress g...@github.com:apache/incubator-mxnet.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url g...@github.com:apache/incubator-mxnet.git # 
 > timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # 
 > timeout=10
 > git config remote.origin.url g...@github.com:apache/incubator-mxnet.git # 
 > timeout=10
Fetching upstream changes from g...@github.com:apache/incubator-mxnet.git
using GIT_SSH to set credentials 
 > git fetch --tags --progress g...@github.com:apache/incubator-mxnet.git 
 > +refs/heads/master:refs/remotes/origin/master
Checking out Revision 7d0204015f548ed146018dc5348e9169451c3c89 (origin/v0.12.0)
org.eclipse.jgit.errors.MissingObjectException: Missing unknown 
7d0204015f548ed146018dc5348e9169451c3c89
at 
org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:158)
at org.eclipse.jgit.lib.ObjectReader.open(ObjectReader.java:227)
at org.eclipse.jgit.revwalk.RevWalk.parseAny(RevWalk.java:859)
at org.eclipse.jgit.revwalk.RevWalk.parseCommit(RevWalk.java:772)
at 
hudson.plugins.git.util.RevCommitRepositoryCallback.invoke(RevCommitRepositoryCallback.java:25)
at 
hudson.plugins.git.util.RevCommitRepositoryCallback.invoke(RevCommitRepositoryCallback.java:13)
at 
org.jenkinsci.plugins.gitclient.AbstractGitAPIImpl.withRepository(AbstractGitAPIImpl.java:29)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.withRepository(CliGitAPIImpl.java:71)
at hudson.plugins.git.GitSCM.printCommitMessageToLog(GitSCM.java:1195)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1159)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1212)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:566)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:491)
at hudson.model.Run.execute(Run.java:1737)
at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:419)


Jenkins build is back to normal : mxnet_incubator_master » android.armv7 #137

2017-10-16 Thread Pedro Larroy
See 




Jenkins build is back to normal : mxnet_incubator_master » ubuntu-16.04-cuda_8.0_cudnn5 #137

2017-10-16 Thread Pedro Larroy
See 




Build failed in Jenkins: mxnet_incubator_master » cmake.ubuntu-17.04 #136

2017-10-16 Thread Pedro Larroy
See 


--
Started by upstream project "mxnet_incubator_master" build number 136
originally caused by:
 Started by an SCM change
Building in workspace 

[WS-CLEANUP] Deleting project workspace...
Cloning the remote Git repository
Cloning repository g...@github.com:apache/incubator-mxnet.git
 > git init 
 > 
 >  # timeout=10
Fetching upstream changes from g...@github.com:apache/incubator-mxnet.git
 > git --version # timeout=10
using GIT_SSH to set credentials 
 > git fetch --tags --progress g...@github.com:apache/incubator-mxnet.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url g...@github.com:apache/incubator-mxnet.git # 
 > timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # 
 > timeout=10
 > git config remote.origin.url g...@github.com:apache/incubator-mxnet.git # 
 > timeout=10
Fetching upstream changes from g...@github.com:apache/incubator-mxnet.git
using GIT_SSH to set credentials 
 > git fetch --tags --progress g...@github.com:apache/incubator-mxnet.git 
 > +refs/heads/master:refs/remotes/origin/master
Checking out Revision 7d0204015f548ed146018dc5348e9169451c3c89 (origin/v0.12.0)
org.eclipse.jgit.errors.MissingObjectException: Missing unknown 
7d0204015f548ed146018dc5348e9169451c3c89
at 
org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:158)
at org.eclipse.jgit.lib.ObjectReader.open(ObjectReader.java:227)
at org.eclipse.jgit.revwalk.RevWalk.parseAny(RevWalk.java:859)
at org.eclipse.jgit.revwalk.RevWalk.parseCommit(RevWalk.java:772)
at 
hudson.plugins.git.util.RevCommitRepositoryCallback.invoke(RevCommitRepositoryCallback.java:25)
at 
hudson.plugins.git.util.RevCommitRepositoryCallback.invoke(RevCommitRepositoryCallback.java:13)
at 
org.jenkinsci.plugins.gitclient.AbstractGitAPIImpl.withRepository(AbstractGitAPIImpl.java:29)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.withRepository(CliGitAPIImpl.java:71)
at hudson.plugins.git.GitSCM.printCommitMessageToLog(GitSCM.java:1195)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1159)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1212)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:566)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:491)
at hudson.model.Run.execute(Run.java:1737)
at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:419)


Build failed in Jenkins: mxnet_incubator_master » android.armv7 #136

2017-10-16 Thread Pedro Larroy
See 


--
Started by upstream project "mxnet_incubator_master" build number 136
originally caused by:
 Started by an SCM change
Building in workspace 

[WS-CLEANUP] Deleting project workspace...
Cloning the remote Git repository
Cloning repository g...@github.com:apache/incubator-mxnet.git
 > git init 
 > 
 >  # timeout=10
Fetching upstream changes from g...@github.com:apache/incubator-mxnet.git
 > git --version # timeout=10
using GIT_SSH to set credentials 
 > git fetch --tags --progress g...@github.com:apache/incubator-mxnet.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url g...@github.com:apache/incubator-mxnet.git # 
 > timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # 
 > timeout=10
 > git config remote.origin.url g...@github.com:apache/incubator-mxnet.git # 
 > timeout=10
Fetching upstream changes from g...@github.com:apache/incubator-mxnet.git
using GIT_SSH to set credentials 
 > git fetch --tags --progress g...@github.com:apache/incubator-mxnet.git 
 > +refs/heads/master:refs/remotes/origin/master
Checking out Revision 7d0204015f548ed146018dc5348e9169451c3c89 (origin/v0.12.0)
org.eclipse.jgit.errors.MissingObjectException: Missing unknown 
7d0204015f548ed146018dc5348e9169451c3c89
at 
org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:158)
at org.eclipse.jgit.lib.ObjectReader.open(ObjectReader.java:227)
at org.eclipse.jgit.revwalk.RevWalk.parseAny(RevWalk.java:859)
at org.eclipse.jgit.revwalk.RevWalk.parseCommit(RevWalk.java:772)
at 
hudson.plugins.git.util.RevCommitRepositoryCallback.invoke(RevCommitRepositoryCallback.java:25)
at 
hudson.plugins.git.util.RevCommitRepositoryCallback.invoke(RevCommitRepositoryCallback.java:13)
at 
org.jenkinsci.plugins.gitclient.AbstractGitAPIImpl.withRepository(AbstractGitAPIImpl.java:29)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.withRepository(CliGitAPIImpl.java:71)
at hudson.plugins.git.GitSCM.printCommitMessageToLog(GitSCM.java:1195)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1159)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1212)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:566)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:491)
at hudson.model.Run.execute(Run.java:1737)
at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:419)


Jenkins build is back to normal : mxnet_incubator_master » armv7 #137

2017-10-16 Thread Pedro Larroy
See 




Jenkins build is back to normal : mxnet_incubator_master » armv6 #137

2017-10-16 Thread Pedro Larroy
See 




Jenkins build is back to normal : mxnet_incubator_master » cmake.ubuntu-17.04 #137

2017-10-16 Thread Pedro Larroy
See 




Build failed in Jenkins: mxnet_incubator_master » armv6 #136

2017-10-16 Thread Pedro Larroy
See 


--
Started by upstream project "mxnet_incubator_master" build number 136
originally caused by:
 Started by an SCM change
Building in workspace 

[WS-CLEANUP] Deleting project workspace...
Cloning the remote Git repository
Cloning repository g...@github.com:apache/incubator-mxnet.git
 > git init 
 > 
 >  # timeout=10
Fetching upstream changes from g...@github.com:apache/incubator-mxnet.git
 > git --version # timeout=10
using GIT_SSH to set credentials 
 > git fetch --tags --progress g...@github.com:apache/incubator-mxnet.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url g...@github.com:apache/incubator-mxnet.git # 
 > timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # 
 > timeout=10
 > git config remote.origin.url g...@github.com:apache/incubator-mxnet.git # 
 > timeout=10
Fetching upstream changes from g...@github.com:apache/incubator-mxnet.git
using GIT_SSH to set credentials 
 > git fetch --tags --progress g...@github.com:apache/incubator-mxnet.git 
 > +refs/heads/master:refs/remotes/origin/master
Checking out Revision 7d0204015f548ed146018dc5348e9169451c3c89 (origin/v0.12.0)
org.eclipse.jgit.errors.MissingObjectException: Missing unknown 
7d0204015f548ed146018dc5348e9169451c3c89
at 
org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:158)
at org.eclipse.jgit.lib.ObjectReader.open(ObjectReader.java:227)
at org.eclipse.jgit.revwalk.RevWalk.parseAny(RevWalk.java:859)
at org.eclipse.jgit.revwalk.RevWalk.parseCommit(RevWalk.java:772)
at 
hudson.plugins.git.util.RevCommitRepositoryCallback.invoke(RevCommitRepositoryCallback.java:25)
at 
hudson.plugins.git.util.RevCommitRepositoryCallback.invoke(RevCommitRepositoryCallback.java:13)
at 
org.jenkinsci.plugins.gitclient.AbstractGitAPIImpl.withRepository(AbstractGitAPIImpl.java:29)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.withRepository(CliGitAPIImpl.java:71)
at hudson.plugins.git.GitSCM.printCommitMessageToLog(GitSCM.java:1195)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1159)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1212)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:566)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:491)
at hudson.model.Run.execute(Run.java:1737)
at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:419)


Build failed in Jenkins: mxnet_incubator_master » ubuntu-17.04 #136

2017-10-16 Thread Pedro Larroy
See 


--
Started by upstream project "mxnet_incubator_master" build number 136
originally caused by:
 Started by an SCM change
Building in workspace 

[WS-CLEANUP] Deleting project workspace...
Cloning the remote Git repository
Cloning repository g...@github.com:apache/incubator-mxnet.git
 > git init 
 > 
 >  # timeout=10
Fetching upstream changes from g...@github.com:apache/incubator-mxnet.git
 > git --version # timeout=10
using GIT_SSH to set credentials 
 > git fetch --tags --progress g...@github.com:apache/incubator-mxnet.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url g...@github.com:apache/incubator-mxnet.git # 
 > timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # 
 > timeout=10
 > git config remote.origin.url g...@github.com:apache/incubator-mxnet.git # 
 > timeout=10
Fetching upstream changes from g...@github.com:apache/incubator-mxnet.git
using GIT_SSH to set credentials 
 > git fetch --tags --progress g...@github.com:apache/incubator-mxnet.git 
 > +refs/heads/master:refs/remotes/origin/master
Checking out Revision 7d0204015f548ed146018dc5348e9169451c3c89 (origin/v0.12.0)
org.eclipse.jgit.errors.MissingObjectException: Missing unknown 
7d0204015f548ed146018dc5348e9169451c3c89
at 
org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:158)
at org.eclipse.jgit.lib.ObjectReader.open(ObjectReader.java:227)
at org.eclipse.jgit.revwalk.RevWalk.parseAny(RevWalk.java:859)
at org.eclipse.jgit.revwalk.RevWalk.parseCommit(RevWalk.java:772)
at 
hudson.plugins.git.util.RevCommitRepositoryCallback.invoke(RevCommitRepositoryCallback.java:25)
at 
hudson.plugins.git.util.RevCommitRepositoryCallback.invoke(RevCommitRepositoryCallback.java:13)
at 
org.jenkinsci.plugins.gitclient.AbstractGitAPIImpl.withRepository(AbstractGitAPIImpl.java:29)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.withRepository(CliGitAPIImpl.java:71)
at hudson.plugins.git.GitSCM.printCommitMessageToLog(GitSCM.java:1195)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1159)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1212)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:566)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:491)
at hudson.model.Run.execute(Run.java:1737)
at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:419)


Re: [Proposal] Stabilizing Apache MXNet CI build system

2017-10-23 Thread Pedro Larroy
+1

We (with Kellen and Marco) are already working on a CI system that verifies
MXNet on devices, so far a work in progress, but at least we are checking
that the build is sane on Android, different arm flavors and ubuntu, also
building PRs. So far we are still working on having the unit tests pass on
some architectures like Jetson TX2 and ARM / Raspberry PI.

http://ci.mxnet.amazon-ml.com/

Agree with Steffen on creating a document with requirements and high level
architecture. Also I would like to have quicker feedback and as we
discussed before, saner unit tests. I think there's a big and nontrivial
amount of effort required here.

Pedro.

On Mon, Oct 23, 2017 at 6:43 AM, Steffen Rochel 
wrote:

> +1
> I support Option 1 - Set up separate Jenkins CI build system. While the
> Apache service is appropriate for some projects, our experience over the
> last 6 months has not been meeting the needs of the MXNet (incubating)
> project. AWS has been and will continue provide resources for such project.
> Agree we should create a document summarizing the requirements and high
> level architecture, which should answer the question of Jenkins or
> alternative.
>
> Steffen
>
> On Sat, Oct 21, 2017 at 6:51 PM shiwen hu  wrote:
>
> > +1
> >
> >
> > 2017-10-21 9:48 GMT+08:00 Chris Olivier :
> >
> > > Ok, just looking for anything that can cut a task out if possible. I do
> > > support not using Apache Jenkins server anyMore — it’s really not been
> > > working out for various reasons.  But having a person full time is
> > > something that Steffen would have to address, I imagine.
> > >
> > > On Fri, Oct 20, 2017 at 6:03 PM Mu Li  wrote:
> > >
> > > > I didn't see the clear advantage of CodePipline over pure jenkins,
> > > because
> > > > we don't need to deploy here.
> > > >
> > > > On Fri, Oct 20, 2017 at 5:34 PM, Chris Olivier <
> cjolivie...@gmail.com>
> > > > wrote:
> > > >
> > > > > CodePipeline, then.  You can point it to Jenkins instances.
> > > > >
> > > > >
> > > > > On Fri, Oct 20, 2017 at 4:49 PM Mu Li  wrote:
> > > > >
> > > > > > AWS CodeBuild is not an option. It doesn't support GPU instances,
> > mac
> > > > os
> > > > > x,
> > > > > > and windows. Not even mention the edge devices.
> > > > > >
> > > > > > On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <
> > > cjolivie...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Why don;t we look into fully managed AWS CodeBuild?  It
> maintains
> > > > > > > everything. It's also compatible with Jenkins.
> > > > > > >
> > > > > > > On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <
> > > > tqc...@cs.washington.edu
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > Tianqi
> > > > > > > > On Fri, Oct 20, 2017 at 1:39 PM Mu Li 
> > > wrote:
> > > > > > > >
> > > > > > > > > +1
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > It seems that the Apache CI is quite overloaded these days,
> > and
> > > > > > MXNet's
> > > > > > > > CI
> > > > > > > > > pipeline is too complex to run there. In addition, we may
> > need
> > > to
> > > > > add
> > > > > > > > more
> > > > > > > > > devices, e.g. macpro and rasbperry pi, into the server, and
> > > more
> > > > > > tasks
> > > > > > > > such
> > > > > > > > > as pip build. It means a lot of requests to the Infra team.
> > > > > > > > >
> > > > > > > > > We can reuse our previous Jenkins server at
> > > http://ci.mxnet.io/.
> > > > > But
> > > > > > > we
> > > > > > > > > probably need a dedicate developer to maintain it.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > > > > > > > > sandeep.krishn...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Hello all,
> > > > > > > > > >
> > > > > > > > > > I am hereby opening up a discussion thread on how we can
> > > > > stabilize
> > > > > > > > Apache
> > > > > > > > > > MXNet CI build system.
> > > > > > > > > >
> > > > > > > > > > Problems:
> > > > > > > > > >
> > > > > > > > > > 
> > > > > > > > > >
> > > > > > > > > > Recently, we have seen following issues with Apache MXNet
> > CI
> > > > > build
> > > > > > > > > systems:
> > > > > > > > > >
> > > > > > > > > >1. Apache Jenkins master is overloaded and we see
> issues
> > > > like
> > > > > -
> > > > > > > > unable
> > > > > > > > > >to trigger builds, difficult to load and view the blue
> > > ocean
> > > > > and
> > > > > > > > other
> > > > > > > > > >Jenkins build status page.
> > > > > > > > > >2. We are generating too many request/interaction on
> > > Apache
> > > > > > Infra
> > > > > > > > > team.
> > > > > > > > > >   1. Addition/deletion of new slave: Caused from
> > scaling
> > > > > > > activity,
> > > > > > > > > >   recycling, troubleshooting or any actions leading
> to
> > > > change
> > > > > > of
> > > > > > > > > slave
> > > > > > > > > >   machines.
> > > > > > > > > >   2. Plugins / other Jenkins Master configurations.
> 

Fix slicing for 0.12

2017-10-24 Thread Pedro Larroy
Hi

Can we get this PR in for 0.12?

https://github.com/apache/incubator-mxnet/pull/8400

It's a critical fix with undefined behaviour, which shows itself specially
in ARM platforms.

--
Pedro.


Re: Fix slicing for 0.12

2017-10-24 Thread Pedro Larroy
We could also get this one in:

https://github.com/apache/incubator-mxnet/issues/8383

We are working on a fix with Kellen.

How much time until the time window closes?

Pedro.

On Tue, Oct 24, 2017 at 4:50 PM, Chris Olivier 
wrote:

> Does anyone else want to make the case that they have a critical fix that
> should go into 0.12.0.rc1?  Hopefully the PR already passed CI or is in
> master already.
>
> On Tue, Oct 24, 2017 at 6:31 AM, Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> wrote:
>
> > Hi
> >
> > Can we get this PR in for 0.12?
> >
> > https://github.com/apache/incubator-mxnet/pull/8400
> >
> > It's a critical fix with undefined behaviour, which shows itself
> specially
> > in ARM platforms.
> >
> > --
> > Pedro.
> >
>


array bounds check in nnvm/Tuple

2017-10-25 Thread Pedro Larroy
Hi

I'm adding array bounds check in nnvm/Tuple:

commit 87eee62cce5dda325f8a66447b6746a1cc1ed90b (HEAD -> mxnet)
Author: Pedro Larroy 
Date:   Tue Oct 24 02:59:46 2017 +0200

Check array bounds in tuple in operator[]

diff --git a/include/nnvm/tuple.h b/include/nnvm/tuple.h
index f613858..420f1f7 100644
--- a/include/nnvm/tuple.h
+++ b/include/nnvm/tuple.h
@@ -160,6 +160,7 @@ class Tuple {
* \return the corresponding dimension size
*/
   inline ValueType& operator[](size_t i) {
+CHECK_LT(i, ndim_) << "tuple index out of bounds";
 return begin()[i];
   }
   /*!
@@ -168,6 +169,7 @@ class Tuple {
* \return the corresponding dimension size
*/
   inline const ValueType& operator[](size_t i) const {
+CHECK_LT(i, ndim_) << "tuple index out of bounds";
 return begin()[i];
   }
   /*!




And fixing places where we access out of bounds, even if it's correct, for
example chaging access &tuple[end]  to tuple.end()   so indexed access will
always be bound checked.

For example:

-  R[0] = mshadow::Shape1(rshape[0]);
-  R[1] = rshape.ndim() > 1 ? TShape(&rshape[1],
&rshape[rshape.ndim()]) : TShape(1);
+  R[0] = mshadow::Shape1(*rshape.begin());
+  R[1] = rshape.ndim() > 1 ? TShape(rshape.begin()+1, rshape.end()) :
TShape(1);


Any inconvenients in regards to this approach?


Pedro.


Re: [Proposal] Stabilizing Apache MXNet CI build system

2017-10-26 Thread Pedro Larroy
Thanks for your input guys, I think we are all on a good track to get this
fixed. I'm confident that Meghna and Marco are going to drive this to
success. We are collecting ideas and requirements for the document on how
we will revamp the testing infrastructure. My only question right now is
where to store this document to collaborate. I don't seem to have
permissions in confluence to edit the wiki:
https://cwiki.apache.org/confluence/display/MXNET/Continuous+Integration

Should we otherwise use a shared google doc or a github wiki or how?

Please advice.

Pedro.

On Thu, Oct 26, 2017 at 8:14 AM, Meghna Baijal 
wrote:

> Thanks Sandeep for driving this discussion. I am also in contact with Pedro
> and his team to include their requirements.
> And thank you Sebastian, I will let you know!
>
> Meghna
>
> On Wed, Oct 25, 2017 at 11:05 PM, Sebastian 
> wrote:
>
> > @meghana @pedro let me know if you need someone with a mentor hat to open
> > tickets or send mail to infra, happy to help here.
> >
> > Best,
> > Sebastian
> >
> >
> > On 25.10.2017 23:18, sandeep krishnamurthy wrote:
> >
> >> Thank you, everyone, for the discussion, proposal, and the vote.
> >>
> >> Here majority community members see current CI system for Apache MXNet
> is
> >> having issues in scaling and diverse test environments. And the common
> >> suggestion is to have a separate CI setup for Apache MXNet.
> >>
> >> Following are the next steps:
> >>
> >> 1. Meghana proposed she would like to take the lead on this and come up
> >> with an initial tech design write up covering requirements, use-cases,
> >> alternate solutions and a proposed solution on how we could set up the
> CI
> >> system for MXNet.
> >> 2. This tech design will be reviewed in the community and following
> that,
> >> collaborate with Infra team and mentors to complete setup in the
> >> integration of the new system with Repo and Website and more.
> >>
> >> @Pedro Larry - We should sync up on understanding how we can unify the
> set
> >> up you have for various devices and the new set up being proposed and
> >> built. Ideally, we should have a unified CI setup for the project
> >> accessible to the community.
> >>
> >> Regards,
> >> Sandeep
> >>
> >> On Mon, Oct 23, 2017 at 7:29 AM, Pedro Larroy <
> >> pedro.larroy.li...@gmail.com>
> >> wrote:
> >>
> >> +1
> >>>
> >>> We (with Kellen and Marco) are already working on a CI system that
> >>> verifies
> >>> MXNet on devices, so far a work in progress, but at least we are
> checking
> >>> that the build is sane on Android, different arm flavors and ubuntu,
> also
> >>> building PRs. So far we are still working on having the unit tests pass
> >>> on
> >>> some architectures like Jetson TX2 and ARM / Raspberry PI.
> >>>
> >>> http://ci.mxnet.amazon-ml.com/
> >>>
> >>> Agree with Steffen on creating a document with requirements and high
> >>> level
> >>> architecture. Also I would like to have quicker feedback and as we
> >>> discussed before, saner unit tests. I think there's a big and
> nontrivial
> >>> amount of effort required here.
> >>>
> >>> Pedro.
> >>>
> >>> On Mon, Oct 23, 2017 at 6:43 AM, Steffen Rochel <
> steffenroc...@gmail.com
> >>> >
> >>> wrote:
> >>>
> >>> +1
> >>>> I support Option 1 - Set up separate Jenkins CI build system. While
> the
> >>>> Apache service is appropriate for some projects, our experience over
> the
> >>>> last 6 months has not been meeting the needs of the MXNet (incubating)
> >>>> project. AWS has been and will continue provide resources for such
> >>>>
> >>> project.
> >>>
> >>>> Agree we should create a document summarizing the requirements and
> high
> >>>> level architecture, which should answer the question of Jenkins or
> >>>> alternative.
> >>>>
> >>>> Steffen
> >>>>
> >>>> On Sat, Oct 21, 2017 at 6:51 PM shiwen hu 
> >>>> wrote:
> >>>>
> >>>> +1
> >>>>>
> >>>>>
> >>>>> 2017-10-21 9:48 GMT+08:00 Chris Olivier :
> >>>>>
> >>>>> Ok, just looking for anything that can cut a 

Re: [Proposal] Stabilizing Apache MXNet CI build system

2017-10-27 Thread Pedro Larroy
Just to provide a high level overview of the ideas and proposals
coming from different sources for the requirements for testing and
validation of builds:

* Have terraform files for the testing infrastructure. Infrastructure
as code (IaC). Minus not emulated / nor cloud based, embedded
hardware. ("single command" replication of the testing infrastructure,
no manual steps).

* CI software based on Jenkins, unless someone thinks there's a better
alternative.

* Use autoscaling groups and improve staggered build + test steps to
achieve higher parallelism and shorter feedback times.

* Switch to a branching model based on stable master + integration
branch. PRs are merged into dev/integration which runs extended
nightly tests, which are
then merged into master, preferably in an automated way after
successful extended testing.
Master is always tested, and always buildable. Release branches or
tags in master as usual for releases.

* Build + test feedback time targeting less than 15 minutes.
(Currently a build in a 16x core takes 7m). This involves lot of
refactoring of tests, move expensive tests / big smoke tests to
nightlies on the integration branch, also tests on IoT devices / power
and performance regressions...

* Add code coverage and other quality metrics.

* Eliminate warnings and treat warnings as errors. We have spent time
tracking down "undefined behaviour" bugs that could have been caught
by compiler warnings.

Is there something I'm missing or additional things that come to your
mind that you would wish to add?

Pedro.


Re: [Proposal] Stabilizing Apache MXNet CI build system

2017-11-01 Thread Pedro Larroy
+1  That would be great.

On Mon, Oct 30, 2017 at 5:35 PM, Hen  wrote:
> How about we ask for a new mxnet repo to store all the config in?
>
> On Fri, Oct 27, 2017 at 05:30 Pedro Larroy 
> wrote:
>
>> Just to provide a high level overview of the ideas and proposals
>> coming from different sources for the requirements for testing and
>> validation of builds:
>>
>> * Have terraform files for the testing infrastructure. Infrastructure
>> as code (IaC). Minus not emulated / nor cloud based, embedded
>> hardware. ("single command" replication of the testing infrastructure,
>> no manual steps).
>>
>> * CI software based on Jenkins, unless someone thinks there's a better
>> alternative.
>>
>> * Use autoscaling groups and improve staggered build + test steps to
>> achieve higher parallelism and shorter feedback times.
>>
>> * Switch to a branching model based on stable master + integration
>> branch. PRs are merged into dev/integration which runs extended
>> nightly tests, which are
>> then merged into master, preferably in an automated way after
>> successful extended testing.
>> Master is always tested, and always buildable. Release branches or
>> tags in master as usual for releases.
>>
>> * Build + test feedback time targeting less than 15 minutes.
>> (Currently a build in a 16x core takes 7m). This involves lot of
>> refactoring of tests, move expensive tests / big smoke tests to
>> nightlies on the integration branch, also tests on IoT devices / power
>> and performance regressions...
>>
>> * Add code coverage and other quality metrics.
>>
>> * Eliminate warnings and treat warnings as errors. We have spent time
>> tracking down "undefined behaviour" bugs that could have been caught
>> by compiler warnings.
>>
>> Is there something I'm missing or additional things that come to your
>> mind that you would wish to add?
>>
>> Pedro.
>>


Re: [Proposal] Stabilizing Apache MXNet CI build system

2017-11-01 Thread Pedro Larroy
Hi Bhavin

Good suggestions.

I wanted to respond to your point #5

The promotion of integration to master would be done automatically by
jenkins once a commit passes the nightly tests. So it should not
impose any additional burden on the developers, as there is no manual
step involved / human gatekeeper.

It would be equivalent to your suggestion with tags. You can do the
same with branches, anyway a git branch is just a pointer to some
commit, so I think we are talking about the same.

Pedro.




On Wed, Nov 1, 2017 at 5:41 PM, Bhavin Thaker  wrote:
> Few comments/suggestions:
>
> 1) Can  we have this nice list of todo items on the Apache MXNet wiki page
> to track them better?
>
> 2) Can we have a set of owners for each set of tests and source code
> directory? One of the problems I have observed is that when there is a test
> failure, it is difficult to find an owner who will take the responsibility
> of fixing the test OR identifying the culprit code promptly -- this causes
> the master to continue to fail for many days.
>
> 3) Specifically, we need an owner for the Windows setup -- nobody seems to
> know much about it -- please feel free to correct me if required.
>
> 4) +1 to have a list of all feature requests on Jira or a similar commonly
> and easily accessible system.
>
> 5) -1 to the branching model -- I was the gatekeeper for the branching
> model at Informix for the database kernel code to be merged to master along
> with my day-job of being a database kernel engineer for around 9 months and
> hence have the opinion that a branching model just shifts the burden from
> one place to another. We don't have a dedicated team to do the branching
> model. If we really need a buildable master everyday, then we could just
> tag every successful build as last_clean_build on master -- use this tag to
> get a clean master at any time. How many Apache projects are doing
> development on separate branches?
>
> 6) FYI: Rahul (rahul003@) has fixed various warnings with this PR:
> https://github.com/apache/incubator-mxnet/pull/7109 and has a test added
> that fails for any warning found. We can build on top of his work.
>
> 7) FYI: For the unit-tests problems, Meghna identified that some of the
> unit-test run times have increased significantly in the recent builds. We
> need volunteers to help diagnose the root-cause here:
>
> Unit Test Task
>
> Build #337
>
> Build #500
>
> Build #556
>
> Python 2: GPU win
>
> 25
>
> 38
>
> 40
>
> Python 3: GPU Win
>
> 15
>
> 38
>
> 46
>
> Python2: CPU
>
> 25
>
> 35
>
> 80
>
> Python3: CPU
>
> 14
>
> 28
>
> 72
>
> R: CPU
>
> 20
>
> 34
>
> 24
>
> R: GPU
>
> 5
>
> 24
>
> 24
>
>
> 8) Ensure that all PRs submitted have corresponding documentation on
> http://mxnet.io for it.  It may be fine to have documentation follow the
> code changes as long as there is ownership that this task will be done in a
> timely manner.  For example, I have requested the Nvidia team to submit PRs
> to update documentation on http://mxnet.io for the Volta changes to MXNet.
>
>
> 9) Ensure that mega-PRs have some level of design or architecture
> document(s) shared on the Apache MXNet wiki. The mega-PR must have both
> unit-tests and nightly/integration tests submitted to demonstrate
> high-quality level.
>
>
> 10) Finally, how do we get ownership for code submitted to MXNet? When
> something fails in a code segment that only a small set of folks know
> about, what is the expected SLA for a response from them? When users deploy
> MXNet in production environments, they will expect some form of SLA for
> support and a patch release.
>
>
> Regards,
> Bhavin Thaker.
>
>
>
>
>
>
> On Wed, Nov 1, 2017 at 8:20 AM, Pedro Larroy 
> wrote:
>
>> +1  That would be great.
>>
>> On Mon, Oct 30, 2017 at 5:35 PM, Hen  wrote:
>> > How about we ask for a new mxnet repo to store all the config in?
>> >
>> > On Fri, Oct 27, 2017 at 05:30 Pedro Larroy > >
>> > wrote:
>> >
>> >> Just to provide a high level overview of the ideas and proposals
>> >> coming from different sources for the requirements for testing and
>> >> validation of builds:
>> >>
>> >> * Have terraform files for the testing infrastructure. Infrastructure
>> >> as code (IaC). Minus not emulated / nor cloud based, embedded
>> >> hardware. ("single command" replication of the testing infrastructure,
>> >> no manual steps).
>> >>
>> >> * CI software based on Jenkins, unless someone thinks the

update build instructions

2017-11-02 Thread Pedro Larroy
Hi

I would like to update the MXNet build instructions.

In particular I was thinking that it would be a good idea to update
the instructions to use CMake + Ninja. And add more information about
the different build flavours.


https://mxnet.incubator.apache.org/install/index.html


Thoughts?


Re: update build instructions

2017-11-02 Thread Pedro Larroy
Hi

For me it's more about correctness and reproducibility than build
times, nonetheless, seems that the ninja build is significantly faster
than the Make build:

Make:

real4m32.779s
user43m33.236s
sys 0m52.940s

CMake + Ninja:

real3m30.794s
user36m2.564s
sys 0m56.368s

Compiled on an g3.4xlarge with ebs

Provisioned IOPS SSD

io (115000 iops)


Pedro.

On Thu, Nov 2, 2017 at 4:07 PM, Bhavin Thaker  wrote:
> Hi Pedro,
>
> Using Ninja to improve build times is a good suggestion. Can you share the
> build times you have observed with and without using Ninja? I presume you
> have enabled compile-time options for GPU builds and Distributed MXNet for
> the builds you have experimented with.
>
> See also:
> https://ninja-build.org/manual.html
>
> Thanks,
> Bhavin Thaker.
>
> On Thu, Nov 2, 2017 at 7:57 AM, Pedro Larroy 
> wrote:
>
>> Hi
>>
>> I would like to update the MXNet build instructions.
>>
>> In particular I was thinking that it would be a good idea to update
>> the instructions to use CMake + Ninja. And add more information about
>> the different build flavours.
>>
>>
>> https://mxnet.incubator.apache.org/install/index.html
>>
>>
>> Thoughts?
>>


Re: update build instructions

2017-11-02 Thread Pedro Larroy
Right.

I tried now with the flavor that you requested and I have problems
generating the build file:

Seems that I need the variable pslite_LINKER_LIBS_DEBUG which is not
set. Any idea on how to compile with this flavour? (dist KVSTORE)




ubuntu@ip-172-31-35-161:~/mxnet/build$ cmake -DUSE_CUDA=ON
-DUSE_DIST_KVSTORE=ON -GNinja ..
-- Found MKL (include: /usr/local/include, lib: /usr/local/lib/libmklml_gnu.so
-- Found OpenBLAS libraries: /usr/local/lib/libopenblas.so
-- Found OpenBLAS include: /usr/local/include
-- CUDA detected: 8.0
-- Found cuDNN (include: /usr/local/cuda/include, library:
/usr/local/cuda/lib64/libcudnn.so)
-- Added CUDA NVCC flags for: sm_52
-- Could NOT find Gperftools (missing:  GPERFTOOLS_LIBRARIES
GPERFTOOLS_INCLUDE_DIR)
-- Could NOT find Jemalloc (missing:  JEMALLOC_LIBRARY JEMALLOC_INCLUDE_DIR)
--  OpenCV_LIBS=opencv_core;opencv_highgui;opencv_imgproc;opencv_imgcodecs
-- OpenCV found (/usr/local/share/OpenCV)
-- Could NOT find Jemalloc (missing:  JEMALLOC_LIBRARY JEMALLOC_INCLUDE_DIR)
-- Found cuDNN (include: /usr/local/cuda/include, library:
/usr/local/cuda/lib64/libcudnn.so)
You have called ADD_LIBRARY for library mxnet without any source
files. This typically indicates a problem with your CMakeLists.txt
file
-- Found PROTOBUF Compiler: /usr/local/bin/protoc
CMake Error at CMakeLists.txt:446 (target_link_libraries):
  The "debug" argument must be followed by a library.


-- Configuring incomplete, errors occurred!
See also "/home/ubuntu/mxnet/build/CMakeFiles/CMakeOutput.log".
See also "/home/ubuntu/mxnet/build/CMakeFiles/CMakeError.log".


Regards.

Pedro.

On Thu, Nov 2, 2017 at 4:52 PM, Bhavin Thaker  wrote:
> I agree about your point on correctness -- do you know of any known
> correctness issues with Ninja?
>
> These build times seem to be NOT with GPU builds and distributed kvstore
> enabled -- could you please confirm? nvcc builds take a significant time.
>
> Bhavin Thaker.
>
> On Thu, Nov 2, 2017 at 8:45 AM, Pedro Larroy 
> wrote:
>
>> Hi
>>
>> For me it's more about correctness and reproducibility than build
>> times, nonetheless, seems that the ninja build is significantly faster
>> than the Make build:
>>
>> Make:
>>
>> real4m32.779s
>> user43m33.236s
>> sys 0m52.940s
>>
>> CMake + Ninja:
>>
>> real3m30.794s
>> user36m2.564s
>> sys 0m56.368s
>>
>> Compiled on an g3.4xlarge with ebs
>>
>> Provisioned IOPS SSD
>>
>> io (115000 iops)
>>
>>
>> Pedro.
>>
>> On Thu, Nov 2, 2017 at 4:07 PM, Bhavin Thaker 
>> wrote:
>> > Hi Pedro,
>> >
>> > Using Ninja to improve build times is a good suggestion. Can you share
>> the
>> > build times you have observed with and without using Ninja? I presume you
>> > have enabled compile-time options for GPU builds and Distributed MXNet
>> for
>> > the builds you have experimented with.
>> >
>> > See also:
>> > https://ninja-build.org/manual.html
>> >
>> > Thanks,
>> > Bhavin Thaker.
>> >
>> > On Thu, Nov 2, 2017 at 7:57 AM, Pedro Larroy <
>> pedro.larroy.li...@gmail.com>
>> > wrote:
>> >
>> >> Hi
>> >>
>> >> I would like to update the MXNet build instructions.
>> >>
>> >> In particular I was thinking that it would be a good idea to update
>> >> the instructions to use CMake + Ninja. And add more information about
>> >> the different build flavours.
>> >>
>> >>
>> >> https://mxnet.incubator.apache.org/install/index.html
>> >>
>> >>
>> >> Thoughts?
>> >>
>>


Re: [Proposal] Stabilizing Apache MXNet CI build system

2017-11-06 Thread Pedro Larroy
Thanks for setting up the document guys, looks like a solid basis to
start to work on!

Marco, Kellen and I have already added some comments.

Pedro


On Sun, Nov 5, 2017 at 3:43 AM, Meghna Baijal
 wrote:
> Kellen, Thank you for your comments in the doc.
> Sure Steffen, I will continue to merge everyone’s comments into the doc and
> work with Pedro to finalize it.
> And then we can vote on the options.
>
> Thanks,
> Meghna Baijal
>
>
> On Sat, Nov 4, 2017 at 6:34 AM, Steffen Rochel 
> wrote:
>
>> Sandeep and Meghna have been working in background collecting input and
>> preparing a doc. I suggest to drive discussion forward and would like to
>> ask everybody to contribute to
>> https://docs.google.com/document/d/17PEasQ2VWrXi2Cf7IGZSWGZMawxDk
>> dlavUDASzUmLjk/edit?usp=sharing
>>
>> Lets converge on requirements and architecture, so we can move forward with
>> implementation.
>>
>> I would like to suggest for Pedro  and Meghna to lead the discussion and
>> help to resolve suggestions.
>>
>> I assume we need a vote once we are converged on a good draft to call it a
>> plan and move forward with implementation. As we all are unhappy with the
>> current CI situation I would also suggest a phased approach, so we can get
>> back to reliable and efficient basic CI quickly and add advanced
>> capabilities over time.
>>
>> Steffen
>>
>> On Wed, Nov 1, 2017 at 1:14 PM kellen sunderland <
>> kellen.sunderl...@gmail.com> wrote:
>>
>> > Hey Henri, I think that's what a few of us are advocating.  Running a set
>> > of quick tests as part of the PR process, and then a more detailed
>> > regression test suite periodically (say every 4 hours). This fits nicely
>> > into a tagging or 2 branch development system.  Commits will be tagged
>> (or
>> > merged into a stable branch) as soon as they pass the detailed regression
>> > testing.
>> >
>> > On Wed, Nov 1, 2017 at 9:07 PM, Hen  wrote:
>> >
>> > > Random question - can the CI be split such that the Apache CI is doing
>> a
>> > > basic set of checks on that hardware, and is hooked to a PR, while
>> there
>> > is
>> > > a larger "Is trunk good for release?" test that is running periodically
>> > > rather than on every PR?
>> > >
>> > > ie: do we need each PR to be run on varied hardware, or can we have
>> this
>> > > two tier approach?
>> > >
>> > > Hen
>> > >
>> > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
>> > > sandeep.krishn...@gmail.com> wrote:
>> > >
>> > > > Hello all,
>> > > >
>> > > > I am hereby opening up a discussion thread on how we can stabilize
>> > Apache
>> > > > MXNet CI build system.
>> > > >
>> > > > Problems:
>> > > >
>> > > > 
>> > > >
>> > > > Recently, we have seen following issues with Apache MXNet CI build
>> > > systems:
>> > > >
>> > > >1. Apache Jenkins master is overloaded and we see issues like -
>> > unable
>> > > >to trigger builds, difficult to load and view the blue ocean and
>> > other
>> > > >Jenkins build status page.
>> > > >2. We are generating too many request/interaction on Apache Infra
>> > > team.
>> > > >   1. Addition/deletion of new slave: Caused from scaling
>> activity,
>> > > >   recycling, troubleshooting or any actions leading to change of
>> > > slave
>> > > >   machines.
>> > > >   2. Plugins / other Jenkins Master configurations.
>> > > >   3. Experimentation on CI pipelines.
>> > > >3. Harder to debug and resolve issues - Since access to master and
>> > > slave
>> > > >is not with the same community, it requires Infra and community to
>> > > dive
>> > > >deep together on all action items.
>> > > >
>> > > > Possible Solutions:
>> > > >
>> > > > ==
>> > > >
>> > > >1. Can we set up a separate Jenkins CI build system for Apache
>> MXNet
>> > > >outside Apache Infra?
>> > > >2. Can we have a separate Jenkins Master in Apache Infra for
>> MXNet?
>> > > >3. Review design of current setup, refine and fill the gaps.
>> > > >
>> > > > @ Mentors/Infra team/Community:
>> > > >
>> > > > ==
>> > > >
>> > > > Please provide your suggestions on how we can proceed further and
>> work
>> > on
>> > > > stabilizing the CI build systems for MXNet.
>> > > >
>> > > > Also, if the community decides on separate Jenkins CI build system,
>> > what
>> > > > important points should be taken care of apart from the below:
>> > > >
>> > > >1. Community being able to access the build page for build
>> statuses.
>> > > >2. Committers being able to login with apache credentials.
>> > > >3. Hook setup from apache/incubator-mxnet repo to Jenkins master.
>> > > >
>> > > >
>> > > > Irrespective of the solution we come up, I think we should initiate a
>> > > > technical design discussion on how to setup the CI build system.
>> > > Probably 1
>> > > > or 2 pager documents with the architecture and review with Infra and
>> > > > community members.
>> > > >
>> > > > ***There were few proposal and discussion on the slack channe

Re: update build instructions

2017-11-07 Thread Pedro Larroy
Hi

I updated the build instructions using CMake, python3 and other minor
inacuracies in this PR:
https://github.com/apache/incubator-mxnet/pull/8578

Please have a look and comment.

I also added a file in the root "DEVELOPMENT.md"  which synthesizes
how to build mxnet and run the unit tests to get started making
changes.

Your feedback is welcome. And if you can please test the instructions
to check they work properly and nothing was missed.

Pedro.

On Thu, Nov 2, 2017 at 5:07 PM, Pedro Larroy
 wrote:
> Right.
>
> I tried now with the flavor that you requested and I have problems
> generating the build file:
>
> Seems that I need the variable pslite_LINKER_LIBS_DEBUG which is not
> set. Any idea on how to compile with this flavour? (dist KVSTORE)
>
>
>
>
> ubuntu@ip-172-31-35-161:~/mxnet/build$ cmake -DUSE_CUDA=ON
> -DUSE_DIST_KVSTORE=ON -GNinja ..
> -- Found MKL (include: /usr/local/include, lib: /usr/local/lib/libmklml_gnu.so
> -- Found OpenBLAS libraries: /usr/local/lib/libopenblas.so
> -- Found OpenBLAS include: /usr/local/include
> -- CUDA detected: 8.0
> -- Found cuDNN (include: /usr/local/cuda/include, library:
> /usr/local/cuda/lib64/libcudnn.so)
> -- Added CUDA NVCC flags for: sm_52
> -- Could NOT find Gperftools (missing:  GPERFTOOLS_LIBRARIES
> GPERFTOOLS_INCLUDE_DIR)
> -- Could NOT find Jemalloc (missing:  JEMALLOC_LIBRARY JEMALLOC_INCLUDE_DIR)
> --  OpenCV_LIBS=opencv_core;opencv_highgui;opencv_imgproc;opencv_imgcodecs
> -- OpenCV found (/usr/local/share/OpenCV)
> -- Could NOT find Jemalloc (missing:  JEMALLOC_LIBRARY JEMALLOC_INCLUDE_DIR)
> -- Found cuDNN (include: /usr/local/cuda/include, library:
> /usr/local/cuda/lib64/libcudnn.so)
> You have called ADD_LIBRARY for library mxnet without any source
> files. This typically indicates a problem with your CMakeLists.txt
> file
> -- Found PROTOBUF Compiler: /usr/local/bin/protoc
> CMake Error at CMakeLists.txt:446 (target_link_libraries):
>   The "debug" argument must be followed by a library.
>
>
> -- Configuring incomplete, errors occurred!
> See also "/home/ubuntu/mxnet/build/CMakeFiles/CMakeOutput.log".
> See also "/home/ubuntu/mxnet/build/CMakeFiles/CMakeError.log".
>
>
> Regards.
>
> Pedro.
>
> On Thu, Nov 2, 2017 at 4:52 PM, Bhavin Thaker  wrote:
>> I agree about your point on correctness -- do you know of any known
>> correctness issues with Ninja?
>>
>> These build times seem to be NOT with GPU builds and distributed kvstore
>> enabled -- could you please confirm? nvcc builds take a significant time.
>>
>> Bhavin Thaker.
>>
>> On Thu, Nov 2, 2017 at 8:45 AM, Pedro Larroy 
>> wrote:
>>
>>> Hi
>>>
>>> For me it's more about correctness and reproducibility than build
>>> times, nonetheless, seems that the ninja build is significantly faster
>>> than the Make build:
>>>
>>> Make:
>>>
>>> real4m32.779s
>>> user43m33.236s
>>> sys 0m52.940s
>>>
>>> CMake + Ninja:
>>>
>>> real3m30.794s
>>> user36m2.564s
>>> sys 0m56.368s
>>>
>>> Compiled on an g3.4xlarge with ebs
>>>
>>> Provisioned IOPS SSD
>>>
>>> io (115000 iops)
>>>
>>>
>>> Pedro.
>>>
>>> On Thu, Nov 2, 2017 at 4:07 PM, Bhavin Thaker 
>>> wrote:
>>> > Hi Pedro,
>>> >
>>> > Using Ninja to improve build times is a good suggestion. Can you share
>>> the
>>> > build times you have observed with and without using Ninja? I presume you
>>> > have enabled compile-time options for GPU builds and Distributed MXNet
>>> for
>>> > the builds you have experimented with.
>>> >
>>> > See also:
>>> > https://ninja-build.org/manual.html
>>> >
>>> > Thanks,
>>> > Bhavin Thaker.
>>> >
>>> > On Thu, Nov 2, 2017 at 7:57 AM, Pedro Larroy <
>>> pedro.larroy.li...@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi
>>> >>
>>> >> I would like to update the MXNet build instructions.
>>> >>
>>> >> In particular I was thinking that it would be a good idea to update
>>> >> the instructions to use CMake + Ninja. And add more information about
>>> >> the different build flavours.
>>> >>
>>> >>
>>> >> https://mxnet.incubator.apache.org/install/index.html
>>> >>
>>> >>
>>> >> Thoughts?
>>> >>
>>>


Re: [VOTE] A Separate CI System for Apache MXNet (incubating)

2017-11-13 Thread Pedro Larroy
+1 for [1]  (A setup separated from Apache Jenkins)

On Mon, Nov 13, 2017 at 4:50 AM, sandeep krishnamurthy
 wrote:
> +1 for [1] Jenkins (A setup separated from Apache Jenkins) - with
> preferably AWS Code Build integration to reduce the size of infrastructure
> we need to maintain.
>
> Thanks,
> Sandeep
>
> On Fri, Nov 10, 2017 at 11:57 AM, Bhavin Thaker 
> wrote:
>
>> +1 for [1] Jenkins (A setup separated from Apache Jenkins) - with various
>> plugins.
>>
>> Bhavin Thaker.
>>
>> On Fri, Nov 10, 2017 at 11:39 AM, Madan Jampani 
>> wrote:
>>
>> > +1 for (1)
>> >
>> > On Thu, Nov 9, 2017 at 4:41 PM, Meghna Baijal <
>> meghnabaijal2...@gmail.com>
>> > wrote:
>> >
>> > > Hi All,
>> > > A need has been identified for MXNet’s CI/CD setup to move away from
>> the
>> > > Apache Jenkins Service. Over the past few days there has been active
>> > > discussion on the necessary and advanced features for such a system and
>> > the
>> > > various options available. These are being tracked in this Google Doc
>> > > > > dlavUDASzUmLjk/edit> (and
>> > > are also in the pdf attached).
>> > >
>> > > I would like to start a vote to choose the framework for this new CI/CD
>> > > system. The options are -
>> > > [1] Jenkins (A setup separated from Apache Jenkins) - with various
>> > plugins
>> > > [2] TeamCity
>> > > [3] Travis CI
>> > > [4] GitLabCI
>> > > [5] BuildBot
>> > > [6] Other - Please Name
>> > >
>> > > Please feel free to add a comment to support your choice.
>> > > This vote will be open from now until the end of the day on Monday
>> > > 11/13/2017
>> > >
>> > > Thanks,
>> > > Meghna Baijal
>> > >
>> > >
>> > >
>> >
>>
>
>
>
> --
> Sandeep Krishnamurthy


[RFQ] Deprecate amalgamation

2017-11-20 Thread Pedro Larroy
Hi all

Given that we have working builds for ARM, Android, TX2 and the main
architectures, and after considering how amalgamation is done. I would
like to propose that we deprecate and remove amalgamation.

I don't think the cost of maintaining this feature and how it's done
justifies the ROI, given that we can now produce binary builds for
embedded platforms in a comfortable way. It's also consuming build &
test resources.

We should strive to simplify our build system and development process.

Pedro.


Re: [RFQ] Deprecate amalgamation

2017-11-20 Thread Pedro Larroy
I like the idea of amalgamation, I have used it in SQLite as it makes
very easy to just drop one header file and one source file in another
project to use the library.

But SQLite is often used as a library embedded in platforms / other libraries.

What's the use case of amalgamation in MXNet when we can build the
binary library for all the platforms with MXNet's build system?  Who
is using MXNet as an embedded library that can't use the shared
library + headers or specific language bindings?

Can't we call emscripten from CMake? I'm not familiar with our JS
bindings, but I don't see why we can't compile for emscripten as for
any other platform.

Pedro.

On Mon, Nov 20, 2017 at 11:59 PM, Tianqi Chen  wrote:
> We could resort to a middle ground. Instead of having an amalgamation
> script that generates a single file, simply have a file that includes
> everything and compiles that one. Which should also work.
>
> The javascript port can likely be superseded with some form of support in
> nnvm compiler, which transpires and generate likely more efficient code
> than current version.  We can enable that feature now except that there is
> no dedicated developer on it yet. We can talk about full deprecation after
> this
>
>
> Tianqi
>
> On Mon, Nov 20, 2017 at 2:47 PM, Pedro Larroy 
> wrote:
>
>> Hi all
>>
>> Given that we have working builds for ARM, Android, TX2 and the main
>> architectures, and after considering how amalgamation is done. I would
>> like to propose that we deprecate and remove amalgamation.
>>
>> I don't think the cost of maintaining this feature and how it's done
>> justifies the ROI, given that we can now produce binary builds for
>> embedded platforms in a comfortable way. It's also consuming build &
>> test resources.
>>
>> We should strive to simplify our build system and development process.
>>
>> Pedro.
>>


Re: 3rdparty packages as submodules

2017-11-20 Thread Pedro Larroy
We could also add gtest as well for example.



I would like to point out that is quite cumbersome to get your code
tested and ready before sending a PR, this includes installing
cpplint, pylint, gtest…

Installing gtest and bootstrapping it is not completely trivial.



Kind regards.

On Mon, Nov 20, 2017 at 11:23 AM, Eric Xie  wrote:
> I'm fine with a 3rdparty folder. Not sure about apache legal.
>
> On 2017-11-17 10:25, Chris Olivier  wrote:
>> All,
>>
>> I often find it desirable to have a method for 3rdparty packages to be
>> included (possibly optionally) in a 3rdparty directory.   We do this with
>> 'cub' to some degree, but it's in the root and is actually a fork in the
>> dmlc repository.  Some samples of what might go in there:
>>
>> 1) Intel OpenMP (llvm-openmp) -- In order to use Intel OMP by default
>> 2) gperftools -- In order to build statically with -fPIC, which isn't the
>> case with the general distribution
>> 3) mkl-dnn -- In order to build and have debug information available for
>> mkl-dnn (and possibly submit bugfixes)
>>
>> What do you all think?
>>
>> -Chris
>>


Re: [Important] Please Help make the Apache MXNet (incubating) 1.0 Release Notes Better!

2017-11-20 Thread Pedro Larroy
Thank you Meghna

Added notes about ARM & Nvidia Jetson support (beta) to the document.

On Mon, Nov 20, 2017 at 2:19 PM, Meghna Baijal
 wrote:
> Apologies. Done.
>
> On Mon, Nov 20, 2017 at 2:18 PM, Chris Olivier 
> wrote:
>
>> No write access :(
>>
>> On Mon, Nov 20, 2017 at 2:16 PM, Meghna Baijal > >
>> wrote:
>>
>> > Hello All,
>> > As you know I am currently working on finalizing the Release Candidate
>> for
>> > Apache MXNet (incubating) 1.0 Release. Anyone who has contributed to this
>> > release, could you please go through the release notes in the shared doc
>> > linked below and review/make changes as needed.
>> >
>> > Link -
>> > https://docs.google.com/document/d/1SdFwiTXlFBMmyVfEHpe7s3jteWxqf
>> > UzsxTgbagcZWzo/edit?usp=sharing
>> >
>> > The notes are very limited in details so feel free to add any
>> details/links
>> > to tutorials or documentation that you think will be useful. Changes can
>> be
>> > made until *EOD tonight (Monday, 11/20)* after which they need to be
>> merged
>> > into the NEWS.md.
>> >
>> > Thanks,
>> > Meghna Baijal
>> >
>>


Re: [RFQ] Deprecate amalgamation

2017-11-21 Thread Pedro Larroy
Anybody against removing amalgamation then? emscripten build is
already using CMake.

On Tue, Nov 21, 2017 at 9:22 AM, Tianqi Chen  wrote:
> Yes, you can call emscripten from CMake
>
> Tianqi
>
> On Mon, Nov 20, 2017 at 5:42 PM, Pedro Larroy 
> wrote:
>
>> I like the idea of amalgamation, I have used it in SQLite as it makes
>> very easy to just drop one header file and one source file in another
>> project to use the library.
>>
>> But SQLite is often used as a library embedded in platforms / other
>> libraries.
>>
>> What's the use case of amalgamation in MXNet when we can build the
>> binary library for all the platforms with MXNet's build system?  Who
>> is using MXNet as an embedded library that can't use the shared
>> library + headers or specific language bindings?
>>
>> Can't we call emscripten from CMake? I'm not familiar with our JS
>> bindings, but I don't see why we can't compile for emscripten as for
>> any other platform.
>>
>> Pedro.
>>
>> On Mon, Nov 20, 2017 at 11:59 PM, Tianqi Chen 
>> wrote:
>> > We could resort to a middle ground. Instead of having an amalgamation
>> > script that generates a single file, simply have a file that includes
>> > everything and compiles that one. Which should also work.
>> >
>> > The javascript port can likely be superseded with some form of support in
>> > nnvm compiler, which transpires and generate likely more efficient code
>> > than current version.  We can enable that feature now except that there
>> is
>> > no dedicated developer on it yet. We can talk about full deprecation
>> after
>> > this
>> >
>> >
>> > Tianqi
>> >
>> > On Mon, Nov 20, 2017 at 2:47 PM, Pedro Larroy <
>> pedro.larroy.li...@gmail.com>
>> > wrote:
>> >
>> >> Hi all
>> >>
>> >> Given that we have working builds for ARM, Android, TX2 and the main
>> >> architectures, and after considering how amalgamation is done. I would
>> >> like to propose that we deprecate and remove amalgamation.
>> >>
>> >> I don't think the cost of maintaining this feature and how it's done
>> >> justifies the ROI, given that we can now produce binary builds for
>> >> embedded platforms in a comfortable way. It's also consuming build &
>> >> test resources.
>> >>
>> >> We should strive to simplify our build system and development process.
>> >>
>> >> Pedro.
>> >>
>>


Use unique_ptr on Executor creation

2017-11-22 Thread Pedro Larroy
Hi

I would like to make a minor change of the cpp package for 1.0 by
returning a unique_ptr when creating Engine.

This is a C++ idiom that prevents memory leaks and fixes one leak in
the examples. As before it requires explicit delete as the ownership
of the pointer is not clear from the API call.


https://github.com/apache/incubator-mxnet/pull/8737/files#diff-c0e1fcfe1619faa4ff5f59d94e8bL341

Please raise any comments if you disagree with this change.

I think it's appropriate to make this change before the 1.0 release specially.

Pedro.


Re: Protected master needs to be turned off

2017-12-01 Thread Pedro Larroy
CI catches problems all the time. I don't think many of us can afford
to build all the flavors and architectures in their laptops or
workstations, so we have to rely on CI to catch all kinds of errors
from compilation errors to bugs plus regressions, specially in a
project which has so many build flavors.

I have had this experience in big projects several times and I can
tell you it's always the same.

So from extensive software development experience I write that we will
be able to develop and merge much faster once we have a reliable CI
running in short cycles, any other approach or shortcuts is just
accumulating technical debt for the future that somebody will have to
cleanup and will slow down development. Is better to have a CI with a
reduced scope working reliably than bypassing CI.

This is irrespective of using dev to merge or unprotected master.

We can't afford to have increased warnings, bugs creeping into the
codebase going unnoticed, build system problems, performance
regressions, etc. And we have to rely on a solid CI for this. If we
are not ready for this, we should halt feature development or at least
merging new features until we have a stable codebase and build system.


  1   2   3   4   5   >