Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-07-05 Thread Lai Wei
 (without setting the env variables, I got a close time (<1%) with
> >> v1.5
> >> > and
> >> > > > v1.4)
> >> > > > export
> >> KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
> >> > > > export OMP_NUM_THREADS=18
> >> > > >
> >> > > > Did you set any env variables during running?
> >> > > >
> >> > > > The performance result I got as below:
> >> > > > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> >> > > > real12m10.856s
> >> > > > user234m49.576s
> >> > > > sys 4m38.044s
> >> > > >
> >> > > > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> >> > > > real12m52.140s
> >> > > > user246m30.740s
> >> > > > sys 5m8.188s
> >> > > >
> >> > > > As I looked at the profiling data, most of the ops have same perf
> >> > between
> >> > > > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and
> >> "Pooling"
> >> > is
> >> > > > ~1.37x slower on v1.5 compared with v1.4.
> >> > > > Will do further analysis on these ops.
> >> > > >
> >> > > > Here's the hardware/OS info from my side:
> >> > > > --Python Info--
> >> > > > Version  : 3.6.8
> >> > > > Compiler : GCC 7.3.0
> >> > > > Build: ('default', 'Dec 30 2018 01:22:34')
> >> > > > Arch : ('64bit', '')
> >> > > > Pip Info---
> >> > > > Version  : 19.0.3
> >> > > > Directory:
> >> > > >
> >> /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
> >> > > > --MXNet Info---
> >> > > > Version  : 1.5.0
> >> > > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet
> >> > > > Hashtag not found. Not installed from pre-built package.
> >> > > > --System Info--
> >> > > > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
> >> > > > system   : Linux
> >> > > > node : ip-172-31-32-129
> >> > > > release  : 4.4.0-1085-aws
> >> > > > version  : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019
> >> > > > --Hardware Info--
> >> > > > machine  : x86_64
> >> > > > processor: x86_64
> >> > > > Architecture:  x86_64
> >> > > > CPU op-mode(s):32-bit, 64-bit
> >> > > > Byte Order:    Little Endian
> >> > > > CPU(s):72
> >> > > > On-line CPU(s) list:   0-71
> >> > > > Thread(s) per core:2
> >> > > > Core(s) per socket:18
> >> > > > Socket(s): 2
> >> > > > NUMA node(s):  2
> >> > > > Vendor ID: GenuineIntel
> >> > > > CPU family:    6
> >> > > > Model:     85
> >> > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @
> 3.00GHz
> >> > > > Stepping:  3
> >> > > > CPU MHz:   3000.000
> >> > > > BogoMIPS:  6000.00
> >> > > > Hypervisor vendor: KVM
> >> > > > Virtualization type:   full
> >> > > > L1d cache: 32K
> >> > > > L1i cache: 32K
> >> > > > L2 cache:  1024K
> >> > > > L3 cache:  25344K
> >> > > > NUMA node0 CPU(s): 0-17,36-53
> >> > > > NUMA node1 CPU(s): 18-35,54-71
> >> > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep
> >> mtrr
> >> > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx
> >> > pdpe1gb
> >> > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology
> >> nonstop_tsc
> >> > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16
> pcid
> >> > sse4_1
> >> > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsav

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-29 Thread Chris Olivier
n v1.5 compared with v1.4.
>> > > > Will do further analysis on these ops.
>> > > >
>> > > > Here's the hardware/OS info from my side:
>> > > > --Python Info--
>> > > > Version  : 3.6.8
>> > > > Compiler : GCC 7.3.0
>> > > > Build: ('default', 'Dec 30 2018 01:22:34')
>> > > > Arch : ('64bit', '')
>> > > > Pip Info---
>> > > > Version  : 19.0.3
>> > > > Directory:
>> > > >
>> /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
>> > > > --MXNet Info---
>> > > > Version  : 1.5.0
>> > > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet
>> > > > Hashtag not found. Not installed from pre-built package.
>> > > > --System Info--
>> > > > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
>> > > > system   : Linux
>> > > > node : ip-172-31-32-129
>> > > > release  : 4.4.0-1085-aws
>> > > > version  : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019
>> > > > --Hardware Info--
>> > > > machine  : x86_64
>> > > > processor: x86_64
>> > > > Architecture:  x86_64
>> > > > CPU op-mode(s):32-bit, 64-bit
>> > > > Byte Order:Little Endian
>> > > > CPU(s):72
>> > > > On-line CPU(s) list:   0-71
>> > > > Thread(s) per core:2
>> > > > Core(s) per socket:18
>> > > > Socket(s): 2
>> > > > NUMA node(s):  2
>> > > > Vendor ID:     GenuineIntel
>> > > > CPU family:6
>> > > > Model: 85
>> > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
>> > > > Stepping:  3
>> > > > CPU MHz:   3000.000
>> > > > BogoMIPS:  6000.00
>> > > > Hypervisor vendor: KVM
>> > > > Virtualization type:   full
>> > > > L1d cache: 32K
>> > > > L1i cache: 32K
>> > > > L2 cache:  1024K
>> > > > L3 cache:  25344K
>> > > > NUMA node0 CPU(s): 0-17,36-53
>> > > > NUMA node1 CPU(s): 18-35,54-71
>> > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep
>> mtrr
>> > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx
>> > pdpe1gb
>> > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology
>> nonstop_tsc
>> > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid
>> > sse4_1
>> > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
>> rdrand
>> > > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase
>> > > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f
>> rdseed
>> > adx
>> > > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
>> > > > --Network Test--
>> > > >
>> > > >
>> > > > -Ciyong
>> > > >
>> > > >
>> > > > -Original Message-
>> > > > From: Zhao, Patric [mailto:patric.z...@intel.com]
>> > > > Sent: Thursday, June 27, 2019 9:55 AM
>> > > > To: dev@mxnet.incubator.apache.org
>> > > > Cc: d...@mxnet.apache.org
>> > > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version
>> 1.5.0.rc1
>> > > >
>> > > > Could we run more epochs to see the performance difference or
>> profiling
>> > > > the difference between good and bad run?
>> > > >
>> > > > > -Original Message-
>> > > > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com]
>> > > > > Sent: Thursday, June 27, 2019 9:35 AM
>> > > > > To: dev@mxnet.incubator.apache.org
>> > > > > Cc: d...@mxnet.apache.org
>> > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version
>> > > > > 1.5.0.rc1
>> > > > >
>> > > > > I run again and the gap is a

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-29 Thread Chris Olivier
t; /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
> > > > --MXNet Info---
> > > > Version  : 1.5.0
> > > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet
> > > > Hashtag not found. Not installed from pre-built package.
> > > > --System Info--
> > > > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
> > > > system   : Linux
> > > > node : ip-172-31-32-129
> > > > release  : 4.4.0-1085-aws
> > > > version  : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019
> > > > --Hardware Info--
> > > > machine  : x86_64
> > > > processor: x86_64
> > > > Architecture:  x86_64
> > > > CPU op-mode(s):32-bit, 64-bit
> > > > Byte Order:Little Endian
> > > > CPU(s):72
> > > > On-line CPU(s) list:   0-71
> > > > Thread(s) per core:2
> > > > Core(s) per socket:18
> > > > Socket(s): 2
> > > > NUMA node(s):  2
> > > > Vendor ID: GenuineIntel
> > > > CPU family:6
> > > > Model: 85
> > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
> > > > Stepping:  3
> > > > CPU MHz:   3000.000
> > > > BogoMIPS:  6000.00
> > > > Hypervisor vendor: KVM
> > > > Virtualization type:   full
> > > > L1d cache: 32K
> > > > L1i cache: 32K
> > > > L2 cache:  1024K
> > > > L3 cache:  25344K
> > > > NUMA node0 CPU(s): 0-17,36-53
> > > > NUMA node1 CPU(s): 18-35,54-71
> > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep
> mtrr
> > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx
> > pdpe1gb
> > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology
> nonstop_tsc
> > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid
> > sse4_1
> > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
> rdrand
> > > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase
> > > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f
> rdseed
> > adx
> > > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
> > > > --Network Test--
> > > >
> > > >
> > > > -Ciyong
> > > >
> > > >
> > > > -Original Message-
> > > > From: Zhao, Patric [mailto:patric.z...@intel.com]
> > > > Sent: Thursday, June 27, 2019 9:55 AM
> > > > To: dev@mxnet.incubator.apache.org
> > > > Cc: d...@mxnet.apache.org
> > > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version
> 1.5.0.rc1
> > > >
> > > > Could we run more epochs to see the performance difference or
> profiling
> > > > the difference between good and bad run?
> > > >
> > > > > -Original Message-
> > > > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com]
> > > > > Sent: Thursday, June 27, 2019 9:35 AM
> > > > > To: dev@mxnet.incubator.apache.org
> > > > > Cc: d...@mxnet.apache.org
> > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version
> > > > > 1.5.0.rc1
> > > > >
> > > > > I run again and the gap is again bigger, I guess we need to average
> > > > > out the times across several runs:
> > > > >
> > > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench
> > > > > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py
> --epochs 5
> > > > > && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5
> > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172:
> > > > > ImageRecordIOParser2:
> > > > > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4
> > threads
> > > > > for decoding..
> > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image
> > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-28 Thread sandeep krishnamurthy
> > > > > >
> > > > > > > I was able to reproduced the similar result (v1.5
> is
> > ~%5.6
> > > slower
> > > > than
> > > > > > > v1.4, I was using 18 cores for computing) with your
> > script on
> > > > > C5.18xlarge.
> > > > > > > But need to bind the cores with below command when
> > running the
> > > > script,
> > > > > > > (without setting the env variables, I got a close
> time
> > (<1%)
> > > with
> > > > v1.5
> > > > > and
> > > > > > > v1.4)
> > > > > > > export
> > > > KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
> > > > > > > export OMP_NUM_THREADS=18
> > > > > > >
> > > > > > > Did you set any env variables during running?
> > > > > > >
> > > > > > > The performance result I got as below:
> > > > > > > 1) 1.4.1.rc0
> (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> > > > > > > real12m10.856s
> > > > > > > user234m49.576s
> > > > > > > sys 4m38.044s
> > > > > > >
> > > > > > > 2) 1.5.0.rc1
> (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> > > > > > > real12m52.140s
> > > > > > > user246m30.740s
> > > > > > > sys 5m8.188s
> > > > > > >
> > > > > > > As I looked at the profiling data, most of the ops
> have
> > same
> > > perf
> > > > > between
> > > > > > > v1.4 and v1.5. But some ops like "
> _backward_BatchNorm"
> > and
> > > > "Pooling"
> > > > > is
> > > > > > > ~1.37x slower on v1.5 compared with v1.4.
> > > > > > > Will do further analysis on these ops.
> > > > > > >
> > > > > > > Here's the hardware/OS info from my side:
> > > > > > > --Python Info--
> > > > > > > Version  : 3.6.8
> > > > > > > Compiler : GCC 7.3.0
> > > > > > > Build: ('default', 'Dec 30 2018 01:22:34')
> > > > > > > Arch : ('64bit', '')
> > > > > > > Pip Info---
> > > > > > > Version  : 19.0.3
> > > > > > > Directory:
> > > > > > >
> > > >
> >
> /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
> > > > > > > --MXNet Info---
> > > >     > > > Version  : 1.5.0
> > > > > > > Directory:
> > /home/ubuntu/ws/incubator-mxnet/python/mxnet
> > > > > > > Hashtag not found. Not installed from pre-built
> package.
> > > > > > > --System Info--
> > > > > > > Platform :
> > > Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
> > > > > > > system   : Linux
> > > > > > > node : ip-172-31-32-129
> > > > > > > release  : 4.4.0-1085-aws
> > > > > > > version  : #96-Ubuntu SMP Tue Jun 11 09:08:32
> UTC
> > 2019
> > > > > > > --Hardware Info--
> > > > > > > machine  : x86_64
> > > > > > > processor: x86_64
> > > > > > > Architecture:  x86_64
> > > > > > > CPU op-mode(s):32-bit, 64-bit
> > > > > > > Byte Order:Little Endian
> > > > > > > CPU(s):  

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-28 Thread Marco de Abreu
gt; On Thu, Jun 27, 2019 at 10:01 AM Lai Wei <
> roywei...@gmail.com>
> > wrote:
> > > > >
> > > > > Dear @dev,
> > > > >
> > > > > I m cancelling the vote for cached op fix:
> > > > >
> > > > > https://github.com/apache/incubator-mxnet/pull/15298
> > > > >
> > > > > As for the possible cpu training regression, it looks like
> not a
> > > blocker
> > > > > for now.
> > > > >
> > > > > I will start a new rc2 vote, please help to validate.
> > > > >
> > > > > Thanks!
> > > > >
> > > > >
> > > > > On Thu, Jun 27, 2019 at 10:06 PM Chen, Ciyong <
> > ciyong.c...@intel.com
> > > >
> > > > wrote:
> > > > >
> > > > > > Hi Pedro,
> > > > > >
> > > > > > I was able to reproduced the similar result (v1.5 is
> ~%5.6
> > slower
> > > than
> > > > > > v1.4, I was using 18 cores for computing) with your
> script on
> > > > C5.18xlarge.
> > > > > > But need to bind the cores with below command when
> running the
> > > script,
> > > > > > (without setting the env variables, I got a close time
> (<1%)
> > with
> > > v1.5
> > > > and
> > > > > > v1.4)
> > > > > > export
> > > KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
> > > > > > export OMP_NUM_THREADS=18
> > > > > >
> > > > > > Did you set any env variables during running?
> > > > > >
> > > > > > The performance result I got as below:
> > > > > > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> > > > > > real12m10.856s
> > > > > > user234m49.576s
> > > > > > sys 4m38.044s
> > > > > >
> > > > > > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> > > > > > real12m52.140s
> > > > > > user246m30.740s
> > > > > > sys 5m8.188s
> > > > > >
> > > > > > As I looked at the profiling data, most of the ops have
> same
> > perf
> > > > between
> > > > > > v1.4 and v1.5. But some ops like " _backward_BatchNorm"
> and
> > > "Pooling"
> > > > is
> > > > > > ~1.37x slower on v1.5 compared with v1.4.
> > > > > > Will do further analysis on these ops.
> > > > > >
> > > > > > Here's the hardware/OS info from my side:
> > > > > > --Python Info--
> > > > > > Version  : 3.6.8
> > > > > > Compiler : GCC 7.3.0
> > > > > > Build: ('default', 'Dec 30 2018 01:22:34')
> > > > > > Arch : ('64bit', '')
> > > > > > Pip Info---
> > > > > > Version  : 19.0.3
> > > > > > Directory:
> > > > > >
> > >
> /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
> > > > > > --MXNet Info---
> > > > > > Version  : 1.5.0
> > > > > > Directory:
> /home/ubuntu/ws/incubator-mxnet/python/mxnet
> > > > > > Hashtag not found. Not installed from pre-built package.
> > > > > > --System Info--
> > > > > > Platform :
> > Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
> > > > > > system   : Linux
> > > > > > node : ip-172-31-32-129
> > > > > > release  : 4.4.0-1085-aws
> > > > > > version  : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC
> 2019
> > > > > > -

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-28 Thread Davydenko, Denis
gt;
> > > > > The performance result I got as below:
> > > > > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> > > > > real12m10.856s
> > > > > user234m49.576s
> > > > > sys 4m38.044s
> > > > >
> > > > > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> > > > > real12m52.140s
> > > > > user246m30.740s
> > > > > sys 5m8.188s
> > > > >
> > > > > As I looked at the profiling data, most of the ops have same
> perf
> > > between
> > > > > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and
> > "Pooling"
> > > is
> > > > > ~1.37x slower on v1.5 compared with v1.4.
> > > > > Will do further analysis on these ops.
> > > > >
> > > > > Here's the hardware/OS info from my side:
> > > > > --Python Info--
> > > > > Version  : 3.6.8
> > > > > Compiler : GCC 7.3.0
> > > > > Build: ('default', 'Dec 30 2018 01:22:34')
> > > > > Arch : ('64bit', '')
> > > > > Pip Info---
> > > > > Version  : 19.0.3
> > > > > Directory:
> > > > >
> > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
> > > > > --MXNet Info---
> > > > > Version  : 1.5.0
> > > > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet
> > > > > Hashtag not found. Not installed from pre-built package.
> > > > > --System Info--
> > > > > Platform :
> Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
> > > > > system   : Linux
> > > > > node : ip-172-31-32-129
> > > > > release  : 4.4.0-1085-aws
> > > > > version  : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019
> > > > > --Hardware Info--
> > > > > machine  : x86_64
> > > > > processor: x86_64
> > > > > Architecture:  x86_64
> > > > > CPU op-mode(s):32-bit, 64-bit
> > > > > Byte Order:Little Endian
> > > > > CPU(s):72
> > > > > On-line CPU(s) list:   0-71
> > > > > Thread(s) per core:2
> > > > > Core(s) per socket:18
> > > > > Socket(s): 2
> > > > > NUMA node(s):  2
> > > > > Vendor ID:     GenuineIntel
> > > > > CPU family:6
> > > > > Model: 85
> > > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @
> > 3.00GHz
> > > > > Stepping:  3
> > > > > CPU MHz:   3000.000
> > > > > BogoMIPS:  6000.00
> > > > > Hypervisor vendor: KVM
> > > > > Virtualization type:   full
    > >     > > > L1d cache:     32K
> >     > > > L1i cache: 32K
> > > > > L2 cache:  1024K
> > > > > L3 cache:  25344K
> > > > > NUMA node0 CPU(s): 0-17,36-53
> > > > > NUMA node1 CPU(s): 18-35,54-71
> > > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic
> > sep mtrr
> > > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall
> nx
> > > pdpe1gb
> > > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology
> > nonstop_tsc
> > > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16
> > pcid
> > > sse4_1
> > > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
> f16c
> > rdrand
> > > > > hypervisor lahf_lm abm 3dnowprefetch invpcid_si

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-28 Thread Lai Wei
latform :
> Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
> > > > > system   : Linux
> > > > > node : ip-172-31-32-129
> > > > > release  : 4.4.0-1085-aws
> > > > > version  : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019
> > > > > --Hardware Info--
> > > > > machine  : x86_64
> > > > > processor: x86_64
> > > > > Architecture:  x86_64
> > > > > CPU op-mode(s):32-bit, 64-bit
> > > > > Byte Order:Little Endian
> > > > > CPU(s):72
> > > > > On-line CPU(s) list:   0-71
> > > > > Thread(s) per core:2
> >     > > > Core(s) per socket:18
> >     > > > Socket(s): 2
> > > > > NUMA node(s):  2
> > > > > Vendor ID: GenuineIntel
> > > > > CPU family:6
> > > > > Model: 85
> > > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @
> > 3.00GHz
> > > > > Stepping:  3
> > > > > CPU MHz:           3000.000
> >     > > > BogoMIPS:  6000.00
> > > > > Hypervisor vendor: KVM
> > > > > Virtualization type:   full
> > > > > L1d cache: 32K
> > > > > L1i cache: 32K
> > > > > L2 cache:  1024K
> > > > > L3 cache:  25344K
> > > > > NUMA node0 CPU(s): 0-17,36-53
> > > > > NUMA node1 CPU(s): 18-35,54-71
> > > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic
> > sep mtrr
> > > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall
> nx
> > > pdpe1gb
> > > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology
> > nonstop_tsc
> > > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16
> > pcid
> > > sse4_1
> > > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
> f16c
> > rdrand
> > > > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser
> > fsgsbase
> > > > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f
> > rdseed
> > > adx
> > > > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat
> pku
> > > > > --Network Test--
> > > > >
> > > > >
> > > > > -Ciyong
> > > > >
> > > > >
> > > > > -Original Message-
> > > > > From: Zhao, Patric [mailto:patric.z...@intel.com]
> > > > > Sent: Thursday, June 27, 2019 9:55 AM
> > > > > To: dev@mxnet.incubator.apache.org
> > > > > Cc: d...@mxnet.apache.org
> > > > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version
> > 1.5.0.rc1
> > > > >
> > > > > Could we run more epochs to see the performance difference or
> > profiling
> > > > > the difference between good and bad run?
> > > > >
> > > > > > -Original Message-
> > > > > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com]
> > > > > > Sent: Thursday, June 27, 2019 9:35 AM
> > > > > > To: dev@mxnet.incubator.apache.org
> > > > > > Cc: d...@mxnet.apache.org
> > > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version
> > > > > > 1.5.0.rc1
> > > > > >
> > > > > > I run again and the gap is again bigger, I guess we need to
> > average
> > > > > > out the times across several runs:
> > > > > >
> > > > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench
> > > > > > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py
> > --epochs 5
> > > > > > && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5
> > > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172:
> > > > > > ImageRecordIOParser2:
> > > > > > /home/piotr/de

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-28 Thread Pedro Larroy
> > > I will start a new rc2 vote, please help to validate.
> > >
> > > Thanks!
> > >
> > >
> > > On Thu, Jun 27, 2019 at 10:06 PM Chen, Ciyong  >
> > wrote:
> > >
> > > > Hi Pedro,
> > > >
> > > > I was able to reproduced the similar result (v1.5 is ~%5.6 slower
> than
> > > > v1.4, I was using 18 cores for computing) with your script on
> > C5.18xlarge.
> > > > But need to bind the cores with below command when running the
> script,
> > > > (without setting the env variables, I got a close time (<1%) with
> v1.5
> > and
> > > > v1.4)
> > > > export
> KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
> > > > export OMP_NUM_THREADS=18
> > > >
> > > > Did you set any env variables during running?
> > > >
> > > > The performance result I got as below:
> > > > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> > > > real12m10.856s
> > > > user234m49.576s
> > > > sys 4m38.044s
> > > >
> > > > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> > > > real12m52.140s
> > > > user246m30.740s
> > > > sys 5m8.188s
> > > >
> > > > As I looked at the profiling data, most of the ops have same perf
> > between
> > > > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and
> "Pooling"
> > is
> > > > ~1.37x slower on v1.5 compared with v1.4.
> > > > Will do further analysis on these ops.
> > > >
> > > > Here's the hardware/OS info from my side:
> > > > --Python Info--
> > > > Version  : 3.6.8
> > > > Compiler : GCC 7.3.0
> > > > Build: ('default', 'Dec 30 2018 01:22:34')
> > > > Arch : ('64bit', '')
> > > > Pip Info---
> > > > Version  : 19.0.3
> > > > Directory:
> > > >
> /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
> > > > --MXNet Info---
> > > > Version  : 1.5.0
> > > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet
> > > > Hashtag not found. Not installed from pre-built package.
> > > > --System Info--
> > > > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
> > > > system   : Linux
> > > > node : ip-172-31-32-129
> > > > release  : 4.4.0-1085-aws
> > > > version  : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019
> > > > --Hardware Info--
> > > > machine  : x86_64
> > > > processor: x86_64
> > > > Architecture:  x86_64
> > > > CPU op-mode(s):32-bit, 64-bit
> > > > Byte Order:Little Endian
> > > > CPU(s):72
> > > > On-line CPU(s) list:   0-71
> > > > Thread(s) per core:2
> > > > Core(s) per socket:18
> > > > Socket(s): 2
> > > > NUMA node(s):  2
> > > > Vendor ID: GenuineIntel
> > > > CPU family:6
> > > > Model: 85
> > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @
> 3.00GHz
> > > > Stepping:  3
> > > > CPU MHz:   3000.000
> > > > BogoMIPS:  6000.00
> > > > Hypervisor vendor:     KVM
> > > > Virtualization type:   full
> > > > L1d cache: 32K
> > > > L1i cache: 32K
> > > > L2 cache:  1024K
> > > > L3 cache:  25344K
> > > > NUMA node0 CPU(s): 0-17,36-53
> > > > NUMA node1 CPU(s): 18-35,54-71
> > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic
> sep mtrr
> > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx
> > pdpe1gb
> > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology
> n

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-27 Thread Manu Seth
> > > Will do further analysis on these ops.
> > >
> > > Here's the hardware/OS info from my side:
> > > --Python Info--
> > > Version  : 3.6.8
> > > Compiler : GCC 7.3.0
> > > Build: ('default', 'Dec 30 2018 01:22:34')
> > > Arch : ('64bit', '')
> > > Pip Info---
> > > Version  : 19.0.3
> > > Directory:
> > >
/home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
> > > --MXNet Info---
> > > Version  : 1.5.0
> > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet
> > > Hashtag not found. Not installed from pre-built package.
> > > --System Info--
> > > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
> > > system   : Linux
> > > node : ip-172-31-32-129
> > > release  : 4.4.0-1085-aws
> > > version  : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019
> > > --Hardware Info--
> > > machine  : x86_64
> > > processor: x86_64
> > > Architecture:  x86_64
> > > CPU op-mode(s):32-bit, 64-bit
> > > Byte Order:Little Endian
> > > CPU(s):72
> > > On-line CPU(s) list:   0-71
> > > Thread(s) per core:2
> > > Core(s) per socket:18
> > > Socket(s): 2
> > > NUMA node(s):  2
> > > Vendor ID: GenuineIntel
> > > CPU family:6
> > > Model: 85
> > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @
3.00GHz
> > > Stepping:  3
> > > CPU MHz:   3000.000
> > > BogoMIPS:  6000.00
> > > Hypervisor vendor: KVM
> > > Virtualization type:   full
> > > L1d cache: 32K
> > > L1i cache: 32K
> > > L2 cache:  1024K
> > > L3 cache:  25344K
> > > NUMA node0 CPU(s): 0-17,36-53
> > > NUMA node1 CPU(s): 18-35,54-71
> > > Flags: fpu vme de pse tsc msr pae mce cx8 apic
sep mtrr
> > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx
> pdpe1gb
    > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology
nonstop_tsc
> > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16
pcid
> sse4_1
> > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
rdrand
> > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser
fsgsbase
> > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f
rdseed
> adx
> > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
> > > --Network Test--
> > >
> > >
> > > -Ciyong
> > >
> > >
> > > -Original Message-
> > > From: Zhao, Patric [mailto:patric.z...@intel.com]
> > > Sent: Thursday, June 27, 2019 9:55 AM
> > > To: dev@mxnet.incubator.apache.org
> > > Cc: d...@mxnet.apache.org
> > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version
1.5.0.rc1
> > >
> > > Could we run more epochs to see the performance difference or
profiling
> > > the difference between good and bad run?
> > >
> > > > -Original Message-
> > > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com]
> > > > Sent: Thursday, June 27, 2019 9:35 AM
> > > > To: dev@mxnet.incubator.apache.org
> > > > Cc: d...@mxnet.apache.org
> > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version
> > > > 1.5.0.rc1
> > > >
> > > > I run again and the gap is again bigger, I guess we need to
average
> > > > out the times across several runs:
> > > >
> > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench
> > > > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py
--epochs 5
> > > > && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5
> > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172:
> > > > ImageRecor

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-27 Thread sandeep krishnamurthy
72
> > > On-line CPU(s) list:   0-71
> > > Thread(s) per core:2
> > > Core(s) per socket:18
> > > Socket(s): 2
> > > NUMA node(s):  2
> > > Vendor ID: GenuineIntel
> > > CPU family:6
> > > Model: 85
> > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
> > > Stepping:  3
> > > CPU MHz:   3000.000
> > > BogoMIPS:  6000.00
> > > Hypervisor vendor: KVM
> > > Virtualization type:   full
> > > L1d cache: 32K
> > > L1i cache: 32K
> > > L2 cache:  1024K
> > > L3 cache:  25344K
> > > NUMA node0 CPU(s): 0-17,36-53
> > > NUMA node1 CPU(s): 18-35,54-71
> > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx
> pdpe1gb
> > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc
> > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid
> sse4_1
> > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand
> > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase
> > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed
> adx
> > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
> > > --Network Test--
> > >
> > >
> > > -Ciyong
> > >
> > >
> > > -Original Message-
> > > From: Zhao, Patric [mailto:patric.z...@intel.com]
> > > Sent: Thursday, June 27, 2019 9:55 AM
> > > To: dev@mxnet.incubator.apache.org
> > > Cc: d...@mxnet.apache.org
> > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> > >
> > > Could we run more epochs to see the performance difference or profiling
> > > the difference between good and bad run?
> > >
> > > > -Original Message-
> > > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com]
> > > > Sent: Thursday, June 27, 2019 9:35 AM
> > > > To: dev@mxnet.incubator.apache.org
> > > > Cc: d...@mxnet.apache.org
> > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version
> > > > 1.5.0.rc1
> > > >
> > > > I run again and the gap is again bigger, I guess we need to average
> > > > out the times across several runs:
> > > >
> > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench
> > > > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5
> > > > && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5
> > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172:
> > > > ImageRecordIOParser2:
> > > > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4
> threads
> > > > for decoding..
> > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image
> > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image
> > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
> > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172:
> > > > ImageRecordIOParser2:
> > > > /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads
> > > > for decoding..
> > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image
> > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image
> > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
> > > > lr_schedule: {0: 0.05, 82: 0.005001, 123: 0.0005, 300:
> > > > 0.0001} Epoch 0, Changed learning rate to 0.05 [23:17:09]
> > > > ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> > > > 147456 bytes with malloc directly
> > > > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> > > > 589824 bytes with malloc directly
> > > > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> > > > 2359296 bytes with malloc directly
> > > > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> > > > 9437184 bytes with malloc directly
> > > > 

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-27 Thread Pedro Larroy
I will try to run a few benchmarks in a bare metal instance tonight to
remove virtualization variance for the measurements and provide some
numbers.

Please propose a set of models / examples that would be desirable to
run before the release and provide a link to an easy to run script
with instructions so we can validate the release better.

Thank you.

On Thu, Jun 27, 2019 at 10:01 AM Lai Wei  wrote:
>
> Dear @dev,
>
> I m cancelling the vote for cached op fix:
>
> https://github.com/apache/incubator-mxnet/pull/15298
>
> As for the possible cpu training regression, it looks like not a blocker
> for now.
>
> I will start a new rc2 vote, please help to validate.
>
> Thanks!
>
>
> On Thu, Jun 27, 2019 at 10:06 PM Chen, Ciyong  wrote:
>
> > Hi Pedro,
> >
> > I was able to reproduced the similar result (v1.5 is ~%5.6 slower than
> > v1.4, I was using 18 cores for computing) with your script on C5.18xlarge.
> > But need to bind the cores with below command when running the script,
> > (without setting the env variables, I got a close time (<1%) with v1.5 and
> > v1.4)
> > export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
> > export OMP_NUM_THREADS=18
> >
> > Did you set any env variables during running?
> >
> > The performance result I got as below:
> > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> > real12m10.856s
> > user234m49.576s
> > sys 4m38.044s
> >
> > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> > real12m52.140s
> > user246m30.740s
> > sys 5m8.188s
> >
> > As I looked at the profiling data, most of the ops have same perf between
> > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and "Pooling" is
> > ~1.37x slower on v1.5 compared with v1.4.
> > Will do further analysis on these ops.
> >
> > Here's the hardware/OS info from my side:
> > --Python Info--
> > Version  : 3.6.8
> > Compiler : GCC 7.3.0
> > Build: ('default', 'Dec 30 2018 01:22:34')
> > Arch : ('64bit', '')
> > Pip Info---
> > Version  : 19.0.3
> > Directory:
> > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
> > --MXNet Info---
> > Version  : 1.5.0
> > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet
> > Hashtag not found. Not installed from pre-built package.
> > --System Info--
> > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
> > system   : Linux
> > node : ip-172-31-32-129
> > release  : 4.4.0-1085-aws
> > version  : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019
> > --Hardware Info--
> > machine  : x86_64
> > processor: x86_64
> > Architecture:  x86_64
> > CPU op-mode(s):32-bit, 64-bit
> > Byte Order:Little Endian
> > CPU(s):72
> > On-line CPU(s) list:   0-71
> > Thread(s) per core:2
> > Core(s) per socket:18
> > Socket(s): 2
> > NUMA node(s):  2
> > Vendor ID: GenuineIntel
> > CPU family:6
> > Model: 85
> > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
> > Stepping:  3
> > CPU MHz:   3000.000
> > BogoMIPS:  6000.00
> > Hypervisor vendor: KVM
> > Virtualization type:   full
> > L1d cache: 32K
> > L1i cache: 32K
> > L2 cache:  1024K
> > L3 cache:  25344K
> > NUMA node0 CPU(s): 0-17,36-53
> > NUMA node1 CPU(s): 18-35,54-71
> > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb
> > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc
> > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1
> > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand
> > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase
> > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx
> > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
> > --Network Test--
> >
> >
> > -Ciyong
> >
> >
> > -Original Message-
> > From: Zhao, Patric [mailto:patric.z...@intel.com]
> > Sent: Thursday, June 27, 2019 9:55 AM
> >

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-27 Thread Lai Wei
Dear @dev,

I m cancelling the vote for cached op fix:

https://github.com/apache/incubator-mxnet/pull/15298

As for the possible cpu training regression, it looks like not a blocker
for now.

I will start a new rc2 vote, please help to validate.

Thanks!


On Thu, Jun 27, 2019 at 10:06 PM Chen, Ciyong  wrote:

> Hi Pedro,
>
> I was able to reproduced the similar result (v1.5 is ~%5.6 slower than
> v1.4, I was using 18 cores for computing) with your script on C5.18xlarge.
> But need to bind the cores with below command when running the script,
> (without setting the env variables, I got a close time (<1%) with v1.5 and
> v1.4)
> export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
> export OMP_NUM_THREADS=18
>
> Did you set any env variables during running?
>
> The performance result I got as below:
> 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> real12m10.856s
> user234m49.576s
> sys 4m38.044s
>
> 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> real12m52.140s
> user246m30.740s
> sys 5m8.188s
>
> As I looked at the profiling data, most of the ops have same perf between
> v1.4 and v1.5. But some ops like " _backward_BatchNorm" and "Pooling" is
> ~1.37x slower on v1.5 compared with v1.4.
> Will do further analysis on these ops.
>
> Here's the hardware/OS info from my side:
> --Python Info--
> Version  : 3.6.8
> Compiler : GCC 7.3.0
> Build: ('default', 'Dec 30 2018 01:22:34')
> Arch : ('64bit', '')
> Pip Info---
> Version  : 19.0.3
> Directory:
> /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
> --MXNet Info---
> Version  : 1.5.0
> Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet
> Hashtag not found. Not installed from pre-built package.
> --System Info--
> Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
> system   : Linux
> node : ip-172-31-32-129
> release  : 4.4.0-1085-aws
> version  : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019
> --Hardware Info--
> machine  : x86_64
> processor: x86_64
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):72
> On-line CPU(s) list:   0-71
> Thread(s) per core:2
> Core(s) per socket:18
> Socket(s): 2
> NUMA node(s):  2
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 85
> Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
> Stepping:  3
> CPU MHz:   3000.000
> BogoMIPS:  6000.00
> Hypervisor vendor: KVM
> Virtualization type:   full
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  1024K
> L3 cache:  25344K
> NUMA node0 CPU(s): 0-17,36-53
> NUMA node1 CPU(s): 18-35,54-71
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb
> rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc
> aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1
> sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand
> hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase
> tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx
> smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
> --Network Test------
>
>
> -Ciyong
>
>
> -Original Message-
> From: Zhao, Patric [mailto:patric.z...@intel.com]
> Sent: Thursday, June 27, 2019 9:55 AM
> To: dev@mxnet.incubator.apache.org
> Cc: d...@mxnet.apache.org
> Subject: RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
>
> Could we run more epochs to see the performance difference or profiling
> the difference between good and bad run?
>
> > -Original Message-
> > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com]
> > Sent: Thursday, June 27, 2019 9:35 AM
> > To: dev@mxnet.incubator.apache.org
> > Cc: d...@mxnet.apache.org
> > Subject: Re: [VOTE] Release Apache MXNet (incubating) version
> > 1.5.0.rc1
> >
> > I run again and the gap is again bigger, I guess we need to average
> > out the times across several runs:
> >
> > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench
> > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5
> > && time ~/mxnet_1.5/py3_venv/b

RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-27 Thread Chen, Ciyong
Hi Pedro,

I was able to reproduced the similar result (v1.5 is ~%5.6 slower than v1.4, I 
was using 18 cores for computing) with your script on C5.18xlarge.
But need to bind the cores with below command when running the script, (without 
setting the env variables, I got a close time (<1%) with v1.5 and v1.4)
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
export OMP_NUM_THREADS=18

Did you set any env variables during running?

The performance result I got as below:
1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
real12m10.856s
user234m49.576s
sys 4m38.044s

2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
real12m52.140s
user246m30.740s
sys 5m8.188s

As I looked at the profiling data, most of the ops have same perf between v1.4 
and v1.5. But some ops like " _backward_BatchNorm" and "Pooling" is ~1.37x 
slower on v1.5 compared with v1.4.
Will do further analysis on these ops.

Here's the hardware/OS info from my side:
--Python Info--
Version  : 3.6.8
Compiler : GCC 7.3.0
Build: ('default', 'Dec 30 2018 01:22:34')
Arch : ('64bit', '')
Pip Info---
Version  : 19.0.3
Directory: 
/home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
--MXNet Info---
Version  : 1.5.0
Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
--System Info--
Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
system   : Linux
node : ip-172-31-32-129
release  : 4.4.0-1085-aws
version  : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019
--Hardware Info--
machine  : x86_64
processor: x86_64
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):72
On-line CPU(s) list:   0-71
Thread(s) per core:2
Core(s) per socket:18
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 85
Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:  3
CPU MHz:   3000.000
BogoMIPS:  6000.00
Hypervisor vendor: KVM
Virtualization type:   full
L1d cache: 32K
L1i cache: 32K
L2 cache:  1024K
L3 cache:  25344K
NUMA node0 CPU(s): 0-17,36-53
NUMA node1 CPU(s): 18-35,54-71
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf 
tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic 
movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm 
abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep 
bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd 
xsaveopt xsavec xgetbv1 ida arat pku
--Network Test--


-Ciyong


-Original Message-
From: Zhao, Patric [mailto:patric.z...@intel.com] 
Sent: Thursday, June 27, 2019 9:55 AM
To: dev@mxnet.incubator.apache.org
Cc: d...@mxnet.apache.org
Subject: RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

Could we run more epochs to see the performance difference or profiling the 
difference between good and bad run?

> -Original Message-
> From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com]
> Sent: Thursday, June 27, 2019 9:35 AM
> To: dev@mxnet.incubator.apache.org
> Cc: d...@mxnet.apache.org
> Subject: Re: [VOTE] Release Apache MXNet (incubating) version 
> 1.5.0.rc1
> 
> I run again and the gap is again bigger, I guess we need to average 
> out the times across several runs:
> 
> piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench
> (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5 
> && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5 
> [23:17:09] ../src/io/iter_image_recordio_2.cc:172:
> ImageRecordIOParser2:
> /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 threads 
> for decoding..
> [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image 
> from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image 
> from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed 
> [23:17:09] ../src/io/iter_image_recordio_2.cc:172:
> ImageRecordIOParser2:
> /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads 
> for decoding..
> [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image 
> from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> [23:17:09] ../src/io/

RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-26 Thread Zhao, Patric
Could we run more epochs to see the performance difference or profiling the 
difference between good and bad run?

> -Original Message-
> From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com]
> Sent: Thursday, June 27, 2019 9:35 AM
> To: dev@mxnet.incubator.apache.org
> Cc: d...@mxnet.apache.org
> Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> 
> I run again and the gap is again bigger, I guess we need to average out the
> times across several runs:
> 
> piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench
> (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5
> && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5
> [23:17:09] ../src/io/iter_image_recordio_2.cc:172:
> ImageRecordIOParser2:
> /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 threads for
> decoding..
> [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image from
> /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image from
> /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
> [23:17:09] ../src/io/iter_image_recordio_2.cc:172:
> ImageRecordIOParser2:
> /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads for
> decoding..
> [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image from
> /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image from
> /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
> lr_schedule: {0: 0.05, 82: 0.005001, 123: 0.0005, 300: 0.0001}
> Epoch 0, Changed learning rate to 0.05
> [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> 147456 bytes with malloc directly
> [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> 589824 bytes with malloc directly
> [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> 2359296 bytes with malloc directly
> [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
> 9437184 bytes with malloc directly
> Epoch 0, Batch 199, Speed=384.149839
> Epoch 0, Duration=140.919567
> Epoch 0, Training accuracy=0.115169
> Epoch 0, Validation accuracy=0.141317
> Epoch 1, Batch 199, Speed=433.380512
> Epoch 1, Duration=119.553233
> Epoch 1, Training accuracy=0.170956
> Epoch 1, Validation accuracy=0.216146
> Epoch 2, Batch 199, Speed=434.864699
> Epoch 2, Duration=123.278490
> Epoch 2, Training accuracy=0.209455
> Epoch 2, Validation accuracy=0.247296
> Epoch 3, Batch 199, Speed=433.401854
> Epoch 3, Duration=118.327797
> Epoch 3, Training accuracy=0.248701
> Epoch 3, Validation accuracy=0.302083
> Epoch 4, Batch 199, Speed=419.713707
> Epoch 4, Duration=126.468409
> Epoch 4, Training accuracy=0.260949
> Epoch 4, Validation accuracy=0.269030
> 
> real10m55.796s
> user399m33.567s
> sys 13m55.904s
> [23:28:04] ../src/io/iter_image_recordio_2.cc:172:
> ImageRecordIOParser2:
> /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 threads for
> decoding..
> [23:28:04] ../src/io/iter_image_recordio_2.cc:230: Load mean image from
> /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> [23:28:04] ../src/io/iter_image_recordio_2.cc:248: Load mean image from
> /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
> [23:28:04] ../src/io/iter_image_recordio_2.cc:172:
> ImageRecordIOParser2:
> /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads for
> decoding..
> [23:28:04] ../src/io/iter_image_recordio_2.cc:230: Load mean image from
> /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> [23:28:04] ../src/io/iter_image_recordio_2.cc:248: Load mean image from
> /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
> lr_schedule: {0: 0.05, 82: 0.005001, 123: 0.0005, 300: 0.0001}
> Epoch 0, Changed learning rate to 0.05 Epoch 0, Batch 199,
> Speed=419.039188 Epoch 0, Duration=143.934903 Epoch 0, Training
> accuracy=0.122542 Epoch 0, Validation accuracy=0.164359 Epoch 1, Batch
> 199, Speed=445.257048 Epoch 1, Duration=135.248399 Epoch 1, Training
> accuracy=0.178828 Epoch 1, Validation accuracy=0.199419 Epoch 2, Batch
> 199, Speed=447.115215 Epoch 2, Duration=132.003770 Epoch 2, Training
> accuracy=0.217808 Epoch 2, Validation accuracy=0.233073 Epoch 3, Batch
> 199, Speed=441.079477 Epoch 3, Duration=126.543316 Epoch 3, Training
> accuracy=0.248102 Epoch 3, Validation accuracy=0.293870 Epoch 4, Batch
> 199, Speed=449.329787 Epoch 4, Duration=138.398325 Epoch 4, Training
> accuracy=0.270021 Epoch 4, Validation accuracy=0.311498
> 
> real11m45.329s
> user426m13.908s
> sys 16m45.093s
> 
&

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-26 Thread Pedro Larroy
mance.
> > >
> > > Here's the command I used to collect the time:
> > > python train_cifar10.py --num-epoch=5
> > >
> > > 1) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> > >     real9m4.880s
> > > user333m13.340s
> > > sys 14m36.100s
> > >
> > > 2) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> > > real9m2.155s
> > > user329m37.092s
> > > sys 16m8.668s
> > >
> > > -Ciyong
> > >
> > >
> > > -Original Message-
> > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com]
> > > Sent: Wednesday, June 26, 2019 6:28 AM
> > > To: dev@mxnet.incubator.apache.org
> > > Cc: d...@mxnet.apache.org
> > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> > >
> > > Hi these were my build flags and system info:
> > >
> > >
> > > --- # CMake configuration
> > > USE_CUDA: "OFF" # Build with CUDA support
> > > USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda
> > > USE_NCCL: "OFF" # Use NVidia NCCL with CUDA
> > > USE_OPENCV: "ON" # Build with OpenCV support
> > > USE_OPENMP: "ON" # Build with Openmp support
> > > USE_CUDNN: "ON" # Build with cudnn support) # one could set CUDNN_ROOT 
> > > for search path
> > > USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT ARM
> > > USE_F16C: "ON" # Build with x86 F16C instruction support) # autodetects 
> > > support if "ON"
> > > USE_LAPACK: "ON" # Build with lapack support
> > > USE_MKL_IF_AVAILABLE: "ON" # Use MKL if found
> > > USE_MKLML_MKL: "ON" # Use MKLDNN variant of MKL (if MKL found) IF 
> > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > USE_MKLDNN: "ON" # Use MKLDNN variant of MKL (if MKL found) IF 
> > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF NOT MSVC
> > > USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found)
> > > USE_JEMALLOC: "ON" # Build with Jemalloc support
> > > USE_PROFILER: "ON" # Build with Profiler support
> > > USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support
> > > USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins
> > > USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin
> > > USE_CPP_PACKAGE: "OFF" # Build C++ Package
> > > USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming conventions.
> > > USE_GPROF: "OFF" # Compile with gprof (profiling) flag
> > > USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the compiler supports 
> > > it
> > > USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) # one could 
> > > set VTUNE_ROOT for search path
> > > ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation support
> > > BUILD_CPP_EXAMPLES: "ON" # Build cpp examples
> > > INSTALL_EXAMPLES: "OFF" # Install the example source files.
> > > USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults.
> > > USE_TENSORRT: "OFF" # Enable infeference optimization with TensorRT.
> > > USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers.
> > > ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test coverage metric 
> > > output
> > > CMAKE_BUILD_TYPE: "Release"
> > > CMAKE_CUDA_COMPILER_LAUNCHER: "ccache"
> > > CMAKE_C_COMPILER_LAUNCHER: "ccache"
> > > CMAKE_CXX_COMPILER_LAUNCHER: "ccache"
> > >
> > > commit 4d9667121ae6fb643f2a02ab15e25231ed756cde (HEAD, tag: 1.5.0.rc1,
> > > upstream/v1.5.x)
> > > commit 1a7199691f5cbc6012bb53eecbf884bed5ae6590 (HEAD, tag: 1.4.1.rc0,
> > > upstream/v1.4.x)
> > >
> > > curl http://169.254.169.254/latest/meta-data/instance-type
> > > c5d.18xlarge
> > >
> > >
> > > Version  : 3.6.7
> > > Compiler : GCC 8.2.0
> > > Build: ('default', 'Oct 22 2018 11:32:17')
> > > Arch : ('64bit', 'ELF')
> > > Pip Info---
> > > Version  : 19.1.1
> > > Directory: 
> > > /home/piotr/mxnet_1.5/py3_venv/lib/python3.6/site-packages/pip
> > > --

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-26 Thread Pedro Larroy
 you point me 
> > the link to get your script (cifar10.py)?
> > Or you can also have a try with MXNet's script (train_cifar10.py) and see 
> > the performance.
> >
> > Here's the command I used to collect the time:
> > python train_cifar10.py --num-epoch=5
> >
> > 1) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> > real9m4.880s
> > user333m13.340s
> > sys 14m36.100s
> >
> > 2) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> > real9m2.155s
> > user329m37.092s
> > sys 16m8.668s
> >
> > -Ciyong
> >
> >
> > -----Original Message-
> > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com]
> > Sent: Wednesday, June 26, 2019 6:28 AM
> > To: dev@mxnet.incubator.apache.org
> > Cc: d...@mxnet.apache.org
> > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> >
> > Hi these were my build flags and system info:
> >
> >
> > --- # CMake configuration
> > USE_CUDA: "OFF" # Build with CUDA support
> > USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda
> > USE_NCCL: "OFF" # Use NVidia NCCL with CUDA
> > USE_OPENCV: "ON" # Build with OpenCV support
> > USE_OPENMP: "ON" # Build with Openmp support
> > USE_CUDNN: "ON" # Build with cudnn support) # one could set CUDNN_ROOT for 
> > search path
> > USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT ARM
> > USE_F16C: "ON" # Build with x86 F16C instruction support) # autodetects 
> > support if "ON"
> > USE_LAPACK: "ON" # Build with lapack support
> > USE_MKL_IF_AVAILABLE: "ON" # Use MKL if found
> > USE_MKLML_MKL: "ON" # Use MKLDNN variant of MKL (if MKL found) IF 
> > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > USE_MKLDNN: "ON" # Use MKLDNN variant of MKL (if MKL found) IF 
> > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF NOT MSVC
> > USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found)
> > USE_JEMALLOC: "ON" # Build with Jemalloc support
> > USE_PROFILER: "ON" # Build with Profiler support
> > USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support
> > USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins
> > USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin
> > USE_CPP_PACKAGE: "OFF" # Build C++ Package
> > USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming conventions.
> > USE_GPROF: "OFF" # Compile with gprof (profiling) flag
> > USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the compiler supports it
> > USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) # one could 
> > set VTUNE_ROOT for search path
> > ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation support
> > BUILD_CPP_EXAMPLES: "ON" # Build cpp examples
> > INSTALL_EXAMPLES: "OFF" # Install the example source files.
> > USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults.
> > USE_TENSORRT: "OFF" # Enable infeference optimization with TensorRT.
> > USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers.
> > ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test coverage metric 
> > output
> > CMAKE_BUILD_TYPE: "Release"
> > CMAKE_CUDA_COMPILER_LAUNCHER: "ccache"
> > CMAKE_C_COMPILER_LAUNCHER: "ccache"
> > CMAKE_CXX_COMPILER_LAUNCHER: "ccache"
> >
> > commit 4d9667121ae6fb643f2a02ab15e25231ed756cde (HEAD, tag: 1.5.0.rc1,
> > upstream/v1.5.x)
> > commit 1a7199691f5cbc6012bb53eecbf884bed5ae6590 (HEAD, tag: 1.4.1.rc0,
> > upstream/v1.4.x)
> >
> > curl http://169.254.169.254/latest/meta-data/instance-type
> > c5d.18xlarge
> >
> >
> > Version  : 3.6.7
> > Compiler : GCC 8.2.0
> > Build: ('default', 'Oct 22 2018 11:32:17')
> > Arch : ('64bit', 'ELF')
> > Pip Info---
> > Version  : 19.1.1
> > Directory: 
> > /home/piotr/mxnet_1.5/py3_venv/lib/python3.6/site-packages/pip
> > --MXNet Info---
> > Version  : 1.5.0
> > Directory: /home/piotr/mxnet_1.5/python/mxnet
> > Hashtag not found. Not installed from pre-built package.
> > --System Info--
> > Platform : Linux-4.15.0-1035-aws-x86_64-with-Ubuntu-

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-26 Thread Pedro Larroy
Hi Ciyong, thanks for trying to reproduce:

I used this one:
https://github.com/awslabs/deeplearning-benchmark/blob/master/dawnbench/cifar10.py

Could you provide hardware and OS details?

I will rerun and repost numbers in a few minutes.

Pedro.

On Wed, Jun 26, 2019 at 4:18 AM Chen, Ciyong  wrote:
>
> Hi Pedro,
>
> I'm looking at this case, and using the script of 
> "incubator-mxnet/example/image-classification/train_cifar10.py" to get
> the timing data, but seems there's not much difference between mxnet 
> 1.4.1.rc0 and 1.5.0.rc1 on C5.18xlarge.
>
> Not sure if there's any difference in the python script, can you point me the 
> link to get your script (cifar10.py)?
> Or you can also have a try with MXNet's script (train_cifar10.py) and see the 
> performance.
>
> Here's the command I used to collect the time:
> python train_cifar10.py --num-epoch=5
>
> 1) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> real9m4.880s
> user333m13.340s
> sys 14m36.100s
>
> 2) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> real9m2.155s
> user329m37.092s
> sys 16m8.668s
>
> -Ciyong
>
>
> -Original Message-
> From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com]
> Sent: Wednesday, June 26, 2019 6:28 AM
> To: dev@mxnet.incubator.apache.org
> Cc: d...@mxnet.apache.org
> Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
>
> Hi these were my build flags and system info:
>
>
> --- # CMake configuration
> USE_CUDA: "OFF" # Build with CUDA support
> USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda
> USE_NCCL: "OFF" # Use NVidia NCCL with CUDA
> USE_OPENCV: "ON" # Build with OpenCV support
> USE_OPENMP: "ON" # Build with Openmp support
> USE_CUDNN: "ON" # Build with cudnn support) # one could set CUDNN_ROOT for 
> search path
> USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT ARM
> USE_F16C: "ON" # Build with x86 F16C instruction support) # autodetects 
> support if "ON"
> USE_LAPACK: "ON" # Build with lapack support
> USE_MKL_IF_AVAILABLE: "ON" # Use MKL if found
> USE_MKLML_MKL: "ON" # Use MKLDNN variant of MKL (if MKL found) IF 
> USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> USE_MKLDNN: "ON" # Use MKLDNN variant of MKL (if MKL found) IF 
> USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF NOT MSVC
> USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found)
> USE_JEMALLOC: "ON" # Build with Jemalloc support
> USE_PROFILER: "ON" # Build with Profiler support
> USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support
> USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins
> USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin
> USE_CPP_PACKAGE: "OFF" # Build C++ Package
> USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming conventions.
> USE_GPROF: "OFF" # Compile with gprof (profiling) flag
> USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the compiler supports it
> USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) # one could set 
> VTUNE_ROOT for search path
> ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation support
> BUILD_CPP_EXAMPLES: "ON" # Build cpp examples
> INSTALL_EXAMPLES: "OFF" # Install the example source files.
> USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults.
> USE_TENSORRT: "OFF" # Enable infeference optimization with TensorRT.
> USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers.
> ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test coverage metric 
> output
> CMAKE_BUILD_TYPE: "Release"
> CMAKE_CUDA_COMPILER_LAUNCHER: "ccache"
> CMAKE_C_COMPILER_LAUNCHER: "ccache"
> CMAKE_CXX_COMPILER_LAUNCHER: "ccache"
>
> commit 4d9667121ae6fb643f2a02ab15e25231ed756cde (HEAD, tag: 1.5.0.rc1,
> upstream/v1.5.x)
> commit 1a7199691f5cbc6012bb53eecbf884bed5ae6590 (HEAD, tag: 1.4.1.rc0,
> upstream/v1.4.x)
>
> curl http://169.254.169.254/latest/meta-data/instance-type
> c5d.18xlarge
>
>
> Version  : 3.6.7
> Compiler : GCC 8.2.0
> Build: ('default', 'Oct 22 2018 11:32:17')
> Arch : ('64bit', 'ELF')
> Pip Info---
> Version  : 19.1.1
> Directory: /home/piotr/mxnet_1.5/py3_venv/lib/python3.6/site-packages/pip
> --MXNet Info---
> Version  : 1.5.0
> Directory: /home/pio

RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-26 Thread Chen, Ciyong
Hi Pedro,

I'm looking at this case, and using the script of 
"incubator-mxnet/example/image-classification/train_cifar10.py" to get
the timing data, but seems there's not much difference between mxnet 1.4.1.rc0 
and 1.5.0.rc1 on C5.18xlarge.

Not sure if there's any difference in the python script, can you point me the 
link to get your script (cifar10.py)?
Or you can also have a try with MXNet's script (train_cifar10.py) and see the 
performance.

Here's the command I used to collect the time: 
python train_cifar10.py --num-epoch=5

1) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
real9m4.880s
user333m13.340s
sys 14m36.100s

2) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
real9m2.155s
user329m37.092s
sys 16m8.668s

-Ciyong


-Original Message-
From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] 
Sent: Wednesday, June 26, 2019 6:28 AM
To: dev@mxnet.incubator.apache.org
Cc: d...@mxnet.apache.org
Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

Hi these were my build flags and system info:


--- # CMake configuration
USE_CUDA: "OFF" # Build with CUDA support
USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda
USE_NCCL: "OFF" # Use NVidia NCCL with CUDA
USE_OPENCV: "ON" # Build with OpenCV support
USE_OPENMP: "ON" # Build with Openmp support
USE_CUDNN: "ON" # Build with cudnn support) # one could set CUDNN_ROOT for 
search path
USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT ARM
USE_F16C: "ON" # Build with x86 F16C instruction support) # autodetects support 
if "ON"
USE_LAPACK: "ON" # Build with lapack support
USE_MKL_IF_AVAILABLE: "ON" # Use MKL if found
USE_MKLML_MKL: "ON" # Use MKLDNN variant of MKL (if MKL found) IF 
USE_MKL_IF_AVAILABLE AND (NOT APPLE)
USE_MKLDNN: "ON" # Use MKLDNN variant of MKL (if MKL found) IF 
USE_MKL_IF_AVAILABLE AND (NOT APPLE)
USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF NOT MSVC
USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found)
USE_JEMALLOC: "ON" # Build with Jemalloc support
USE_PROFILER: "ON" # Build with Profiler support
USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support
USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins
USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin
USE_CPP_PACKAGE: "OFF" # Build C++ Package
USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming conventions.
USE_GPROF: "OFF" # Compile with gprof (profiling) flag
USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the compiler supports it
USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) # one could set 
VTUNE_ROOT for search path
ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation support
BUILD_CPP_EXAMPLES: "ON" # Build cpp examples
INSTALL_EXAMPLES: "OFF" # Install the example source files.
USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults.
USE_TENSORRT: "OFF" # Enable infeference optimization with TensorRT.
USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers.
ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test coverage metric output
CMAKE_BUILD_TYPE: "Release"
CMAKE_CUDA_COMPILER_LAUNCHER: "ccache"
CMAKE_C_COMPILER_LAUNCHER: "ccache"
CMAKE_CXX_COMPILER_LAUNCHER: "ccache"

commit 4d9667121ae6fb643f2a02ab15e25231ed756cde (HEAD, tag: 1.5.0.rc1,
upstream/v1.5.x)
commit 1a7199691f5cbc6012bb53eecbf884bed5ae6590 (HEAD, tag: 1.4.1.rc0,
upstream/v1.4.x)

curl http://169.254.169.254/latest/meta-data/instance-type
c5d.18xlarge


Version  : 3.6.7
Compiler : GCC 8.2.0
Build: ('default', 'Oct 22 2018 11:32:17')
Arch : ('64bit', 'ELF')
Pip Info---
Version  : 19.1.1
Directory: /home/piotr/mxnet_1.5/py3_venv/lib/python3.6/site-packages/pip
--MXNet Info---
Version  : 1.5.0
Directory: /home/piotr/mxnet_1.5/python/mxnet
Hashtag not found. Not installed from pre-built package.
--System Info--
Platform : Linux-4.15.0-1035-aws-x86_64-with-Ubuntu-18.04-bionic
system   : Linux
node : ip-172-31-63-171
release  : 4.15.0-1035-aws
version  : #37-Ubuntu SMP Mon Mar 18 16:15:14 UTC 2019
--Hardware Info--
machine  : x86_64
processor: x86_64
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
CPU(s):  72
On-line CPU(s) list: 0-71
Thread(s) per core:  2
Core(s) per socket:  18
Socket(s):   2
NUMA node(s):2
Vendor ID:   GenuineIntel
CPU family:  6
Model:   85
Model name:  Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
St

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-25 Thread Pedro Larroy
-Pip Info---
Version  : 19.1.1
Directory: /home/piotr/mxnet_1.4/py3_venv/lib/python3.6/site-packages/pip
--MXNet Info---
Version  : 1.4.1
Directory: /home/piotr/mxnet_1.4/python/mxnet
Hashtag not found. Not installed from pre-built package.
--System Info--
Platform : Linux-4.15.0-1035-aws-x86_64-with-Ubuntu-18.04-bionic
system   : Linux
node : ip-172-31-63-171
release  : 4.15.0-1035-aws
version  : #37-Ubuntu SMP Mon Mar 18 16:15:14 UTC 2019
--Hardware Info--
machine  : x86_64
processor: x86_64
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
CPU(s):  72
On-line CPU(s) list: 0-71
Thread(s) per core:  2
Core(s) per socket:  18
Socket(s):   2
NUMA node(s):2
Vendor ID:   GenuineIntel
CPU family:  6
Model:   85
Model name:  Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:4
CPU MHz: 1223.344
BogoMIPS:6000.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:   32K
L1i cache:   32K
L2 cache:1024K
L3 cache:25344K
NUMA node0 CPU(s):   0-17,36-53
NUMA node1 CPU(s):   18-35,54-71
Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology
nonstop_tsc cpuid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid
sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti
fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx
avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw
avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
--Network Test--

On Tue, Jun 25, 2019 at 2:35 PM Pedro Larroy
 wrote:
>
> I did a training of cifar10 in CPU and seems there's some regressions
> in the range of 7% increase of training time against 1.4.1:
>
> (py3_venv) piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench
> (master)+$ time python cifar10.py --epochs 5
> real11m30.388s
> user417m7.766s
> sys 16m57.315s
>
> VS 1.4.1:
> real10m41.994s
> user392m40.646s
> sys 12m30.601s
>
>
> On Thu, Jun 20, 2019 at 10:15 PM Lai Wei  wrote:
> >
> > Hi Anirudh,
> >
> > Thanks for jumping into this quickly, I followed up on the issue.
> >
> > I was meant for sockeye developer/maintainers to help setup nightly tests
> > and raise issues early.
> >
> > Thanks!
> >
> > On Fri, Jun 21, 2019 at 10:10 AM Haibin Lin 
> > wrote:
> >
> > > In GluonNLP we are testing with MXNET nightly build for each PR, and we 
> > > did
> > > find some MXNet related issue caught by the CI.
> > > I recommend other toolkits also add integration tests with MXNet nightly.
> > > It helps identify issues early.
> > >
> > > Best,
> > > Haibin
> > >
> > > On Thu, Jun 20, 2019 at 18:52 Zhao, Patric  wrote:
> > >
> > > > Thanks to raise the issue and we will take a look ASAP.
> > > >
> > > > The downstream cases is not in the MXNet CI so it's hard to catch the
> > > > potential bugs or performance degradation for MXNet developers.
> > > >
> > > > In the future, I suggest adding the major downstream test cases, like
> > > from
> > > > sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test.
> > > > If it's still too heavy,  maybe testing it weekly or monthly :)
> > > >
> > > > Thanks,
> > > >
> > > > --Patric
> > > >
> > > > > -Original Message-
> > > > > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com]
> > > > > Sent: Friday, June 21, 2019 9:31 AM
> > > > > To: dev@mxnet.incubator.apache.org
> > > > > Cc: d...@mxnet.apache.org
> > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 
> > > > > 1.5.0.rc1
> > > > >
> > > > > Hi Lai,
> > > > >
> > > > > I have opened an issue:
> > > > > https://github.com/apache/incubator-mxnet/issues/15297
> > > > > I came to know about this issue only today and I have not been
> > > monitoring
> > > > > sockeye.
> > > > > I jumped onto this issue to make sure it wasn't caused by the dlpack
> > > > changes.
> > > > > Also, I don't  think sockeye CI checks against master,

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-25 Thread Pedro Larroy
I did a training of cifar10 in CPU and seems there's some regressions
in the range of 7% increase of training time against 1.4.1:

(py3_venv) piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench
(master)+$ time python cifar10.py --epochs 5
real11m30.388s
user417m7.766s
sys 16m57.315s

VS 1.4.1:
real10m41.994s
user392m40.646s
sys 12m30.601s


On Thu, Jun 20, 2019 at 10:15 PM Lai Wei  wrote:
>
> Hi Anirudh,
>
> Thanks for jumping into this quickly, I followed up on the issue.
>
> I was meant for sockeye developer/maintainers to help setup nightly tests
> and raise issues early.
>
> Thanks!
>
> On Fri, Jun 21, 2019 at 10:10 AM Haibin Lin 
> wrote:
>
> > In GluonNLP we are testing with MXNET nightly build for each PR, and we did
> > find some MXNet related issue caught by the CI.
> > I recommend other toolkits also add integration tests with MXNet nightly.
> > It helps identify issues early.
> >
> > Best,
> > Haibin
> >
> > On Thu, Jun 20, 2019 at 18:52 Zhao, Patric  wrote:
> >
> > > Thanks to raise the issue and we will take a look ASAP.
> > >
> > > The downstream cases is not in the MXNet CI so it's hard to catch the
> > > potential bugs or performance degradation for MXNet developers.
> > >
> > > In the future, I suggest adding the major downstream test cases, like
> > from
> > > sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test.
> > > If it's still too heavy,  maybe testing it weekly or monthly :)
> > >
> > > Thanks,
> > >
> > > --Patric
> > >
> > > > -Original Message-
> > > > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com]
> > > > Sent: Friday, June 21, 2019 9:31 AM
> > > > To: dev@mxnet.incubator.apache.org
> > > > Cc: d...@mxnet.apache.org
> > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> > > >
> > > > Hi Lai,
> > > >
> > > > I have opened an issue:
> > > > https://github.com/apache/incubator-mxnet/issues/15297
> > > > I came to know about this issue only today and I have not been
> > monitoring
> > > > sockeye.
> > > > I jumped onto this issue to make sure it wasn't caused by the dlpack
> > > changes.
> > > > Also, I don't  think sockeye CI checks against master, it is using
> > 1.4.1.
> > > >
> > > > Anirudh
> > > >
> > > >
> > > > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei  wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Could you share which test failed and what’s the crash? How to
> > > > > reproduce it?
> > > > >
> > > > > I was able to install sockeye and run all tests passed. Using python
> > > > > setup.py test
> > > > >
> > > > > I have tested both nightly pip package and 1.5.0.rc1
> > > > >
> > > > > It would be great to create an issue with reproducible steps and move
> > > > > the discussion there.
> > > > >
> > > > > Also I see sockeye nightly build[1] has been failing for some time,
> > if
> > > > > it’s due to MXNet change, please raise this early so we can track and
> > > > > solve it in time rather than block the release during vote time.
> > > > >
> > > > > [1] https://travis-ci.org/awslabs/sockeye
> > > > >
> > > > >
> > > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian
> > > > >  > > > > >
> > > > > wrote:
> > > > >
> > > > > > I was able to reproduce a crash with the commit
> > > > > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit
> > > > > > a862270beb2d796c1ba311183f7f4a766a18ad6c.
> > > > > >
> > > > > > Anirudh
> > > > > >
> > > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei 
> > wrote:
> > > > > >
> > > > > > > Hi Przemyslaw,
> > > > > > >
> > > > > > > Is there an issue with more details to track the problem?
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak
> > > > > > > 
> > > > > > > wrote:
> > > > > > >
> > &g

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Lai Wei
Hi Anirudh,

Thanks for jumping into this quickly, I followed up on the issue.

I was meant for sockeye developer/maintainers to help setup nightly tests
and raise issues early.

Thanks!

On Fri, Jun 21, 2019 at 10:10 AM Haibin Lin 
wrote:

> In GluonNLP we are testing with MXNET nightly build for each PR, and we did
> find some MXNet related issue caught by the CI.
> I recommend other toolkits also add integration tests with MXNet nightly.
> It helps identify issues early.
>
> Best,
> Haibin
>
> On Thu, Jun 20, 2019 at 18:52 Zhao, Patric  wrote:
>
> > Thanks to raise the issue and we will take a look ASAP.
> >
> > The downstream cases is not in the MXNet CI so it's hard to catch the
> > potential bugs or performance degradation for MXNet developers.
> >
> > In the future, I suggest adding the major downstream test cases, like
> from
> > sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test.
> > If it's still too heavy,  maybe testing it weekly or monthly :)
> >
> > Thanks,
> >
> > --Patric
> >
> > > -Original Message-
> > > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com]
> > > Sent: Friday, June 21, 2019 9:31 AM
> > > To: dev@mxnet.incubator.apache.org
> > > Cc: d...@mxnet.apache.org
> > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> > >
> > > Hi Lai,
> > >
> > > I have opened an issue:
> > > https://github.com/apache/incubator-mxnet/issues/15297
> > > I came to know about this issue only today and I have not been
> monitoring
> > > sockeye.
> > > I jumped onto this issue to make sure it wasn't caused by the dlpack
> > changes.
> > > Also, I don't  think sockeye CI checks against master, it is using
> 1.4.1.
> > >
> > > Anirudh
> > >
> > >
> > > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei  wrote:
> > >
> > > > Hi,
> > > >
> > > > Could you share which test failed and what’s the crash? How to
> > > > reproduce it?
> > > >
> > > > I was able to install sockeye and run all tests passed. Using python
> > > > setup.py test
> > > >
> > > > I have tested both nightly pip package and 1.5.0.rc1
> > > >
> > > > It would be great to create an issue with reproducible steps and move
> > > > the discussion there.
> > > >
> > > > Also I see sockeye nightly build[1] has been failing for some time,
> if
> > > > it’s due to MXNet change, please raise this early so we can track and
> > > > solve it in time rather than block the release during vote time.
> > > >
> > > > [1] https://travis-ci.org/awslabs/sockeye
> > > >
> > > >
> > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian
> > > >  > > > >
> > > > wrote:
> > > >
> > > > > I was able to reproduce a crash with the commit
> > > > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit
> > > > > a862270beb2d796c1ba311183f7f4a766a18ad6c.
> > > > >
> > > > > Anirudh
> > > > >
> > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei 
> wrote:
> > > > >
> > > > > > Hi Przemyslaw,
> > > > > >
> > > > > > Is there an issue with more details to track the problem?
> > > > > >
> > > > > >
> > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak
> > > > > > 
> > > > > > wrote:
> > > > > >
> > > > > > > -1
> > > > > > >
> > > > > > > There is a crash in sockeye unit test (python setup.py test)
> > > > > > > observed starting with nightly 1.5 build from 6/13 and still
> > > > > > > occuring in
> > > > > 1.5rc1. I
> > > > > > > don't yet have the exact commit that is responsible for it, but
> > > > > > > it is either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack
> > > > > > > related) or
> > > > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op
> > > optimization).
> > > > > > >
> > > > > > > On 2019/06/20 06:36:22, Lai Wei  wrote:
> > > > > > > > Dear MXNet community,
> > > > > > > >
> > > > 

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Haibin Lin
In GluonNLP we are testing with MXNET nightly build for each PR, and we did
find some MXNet related issue caught by the CI.
I recommend other toolkits also add integration tests with MXNet nightly.
It helps identify issues early.

Best,
Haibin

On Thu, Jun 20, 2019 at 18:52 Zhao, Patric  wrote:

> Thanks to raise the issue and we will take a look ASAP.
>
> The downstream cases is not in the MXNet CI so it's hard to catch the
> potential bugs or performance degradation for MXNet developers.
>
> In the future, I suggest adding the major downstream test cases, like from
> sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test.
> If it's still too heavy,  maybe testing it weekly or monthly :)
>
> Thanks,
>
> --Patric
>
> > -Original Message-
> > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com]
> > Sent: Friday, June 21, 2019 9:31 AM
> > To: dev@mxnet.incubator.apache.org
> > Cc: d...@mxnet.apache.org
> > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> >
> > Hi Lai,
> >
> > I have opened an issue:
> > https://github.com/apache/incubator-mxnet/issues/15297
> > I came to know about this issue only today and I have not been monitoring
> > sockeye.
> > I jumped onto this issue to make sure it wasn't caused by the dlpack
> changes.
> > Also, I don't  think sockeye CI checks against master, it is using 1.4.1.
> >
> > Anirudh
> >
> >
> > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei  wrote:
> >
> > > Hi,
> > >
> > > Could you share which test failed and what’s the crash? How to
> > > reproduce it?
> > >
> > > I was able to install sockeye and run all tests passed. Using python
> > > setup.py test
> > >
> > > I have tested both nightly pip package and 1.5.0.rc1
> > >
> > > It would be great to create an issue with reproducible steps and move
> > > the discussion there.
> > >
> > > Also I see sockeye nightly build[1] has been failing for some time, if
> > > it’s due to MXNet change, please raise this early so we can track and
> > > solve it in time rather than block the release during vote time.
> > >
> > > [1] https://travis-ci.org/awslabs/sockeye
> > >
> > >
> > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian
> > >  > > >
> > > wrote:
> > >
> > > > I was able to reproduce a crash with the commit
> > > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit
> > > > a862270beb2d796c1ba311183f7f4a766a18ad6c.
> > > >
> > > > Anirudh
> > > >
> > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei  wrote:
> > > >
> > > > > Hi Przemyslaw,
> > > > >
> > > > > Is there an issue with more details to track the problem?
> > > > >
> > > > >
> > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak
> > > > > 
> > > > > wrote:
> > > > >
> > > > > > -1
> > > > > >
> > > > > > There is a crash in sockeye unit test (python setup.py test)
> > > > > > observed starting with nightly 1.5 build from 6/13 and still
> > > > > > occuring in
> > > > 1.5rc1. I
> > > > > > don't yet have the exact commit that is responsible for it, but
> > > > > > it is either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack
> > > > > > related) or
> > > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op
> > optimization).
> > > > > >
> > > > > > On 2019/06/20 06:36:22, Lai Wei  wrote:
> > > > > > > Dear MXNet community,
> > > > > > >
> > > > > > > This is the 3-day vote to release Apache MXNet (incubating)
> > > > > > > version
> > > > > > 1.5.0.
> > > > > > > Voting on dev@ will start June 19, 23:59:59(PST)  and close on
> > > June
> > > > > 22,
> > > > > > > 23:59:59.
> > > > > > >
> > > > > > > 1) Link to release notes:
> > > > > > >
> > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Note
> > > > s
> > > > > > >
> > > > > > >
> > > > > > > 2) Link to release candidate:
> > > > > > >
> > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.r
> > > > > > > c1
> > > > > > >
> > > > > > >
> > > > > > > 3) Link to source and signatures on apache dist server:
> > > > > > >
> > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.r
> > > > > > > c1/
> > > > > > >
> > > > > > >
> > > > > > > Please remember to TEST first before voting accordingly:
> > > > > > >
> > > > > > > +1 = approve
> > > > > > > +0 = no opinion
> > > > > > > -1 = disapprove (provide reason)
> > > > > > > --
> > > > > > > Best Regards
> > > > > > >
> > > > > > > Lai
> > > > > > >
> > > > > >
> > > > > --
> > > > > Best Regards
> > > > >
> > > > > Lai
> > > > >
> > > >
> > > --
> > > Best Regards
> > >
> > > Lai
> > >
>


RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Zhao, Patric
Thanks to raise the issue and we will take a look ASAP.

The downstream cases is not in the MXNet CI so it's hard to catch the potential 
bugs or performance degradation for MXNet developers.

In the future, I suggest adding the major downstream test cases, like from 
sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test.
If it's still too heavy,  maybe testing it weekly or monthly :)

Thanks,

--Patric

> -Original Message-
> From: Anirudh Subramanian [mailto:anirudh2...@gmail.com]
> Sent: Friday, June 21, 2019 9:31 AM
> To: dev@mxnet.incubator.apache.org
> Cc: d...@mxnet.apache.org
> Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> 
> Hi Lai,
> 
> I have opened an issue:
> https://github.com/apache/incubator-mxnet/issues/15297
> I came to know about this issue only today and I have not been monitoring
> sockeye.
> I jumped onto this issue to make sure it wasn't caused by the dlpack changes.
> Also, I don't  think sockeye CI checks against master, it is using 1.4.1.
> 
> Anirudh
> 
> 
> On Thu, Jun 20, 2019 at 6:17 PM Lai Wei  wrote:
> 
> > Hi,
> >
> > Could you share which test failed and what’s the crash? How to
> > reproduce it?
> >
> > I was able to install sockeye and run all tests passed. Using python
> > setup.py test
> >
> > I have tested both nightly pip package and 1.5.0.rc1
> >
> > It would be great to create an issue with reproducible steps and move
> > the discussion there.
> >
> > Also I see sockeye nightly build[1] has been failing for some time, if
> > it’s due to MXNet change, please raise this early so we can track and
> > solve it in time rather than block the release during vote time.
> >
> > [1] https://travis-ci.org/awslabs/sockeye
> >
> >
> > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian
> >  > >
> > wrote:
> >
> > > I was able to reproduce a crash with the commit
> > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit
> > > a862270beb2d796c1ba311183f7f4a766a18ad6c.
> > >
> > > Anirudh
> > >
> > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei  wrote:
> > >
> > > > Hi Przemyslaw,
> > > >
> > > > Is there an issue with more details to track the problem?
> > > >
> > > >
> > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak
> > > > 
> > > > wrote:
> > > >
> > > > > -1
> > > > >
> > > > > There is a crash in sockeye unit test (python setup.py test)
> > > > > observed starting with nightly 1.5 build from 6/13 and still
> > > > > occuring in
> > > 1.5rc1. I
> > > > > don't yet have the exact commit that is responsible for it, but
> > > > > it is either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack
> > > > > related) or
> > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op
> optimization).
> > > > >
> > > > > On 2019/06/20 06:36:22, Lai Wei  wrote:
> > > > > > Dear MXNet community,
> > > > > >
> > > > > > This is the 3-day vote to release Apache MXNet (incubating)
> > > > > > version
> > > > > 1.5.0.
> > > > > > Voting on dev@ will start June 19, 23:59:59(PST)  and close on
> > June
> > > > 22,
> > > > > > 23:59:59.
> > > > > >
> > > > > > 1) Link to release notes:
> > > > > >
> > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Note
> > > s
> > > > > >
> > > > > >
> > > > > > 2) Link to release candidate:
> > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.r
> > > > > > c1
> > > > > >
> > > > > >
> > > > > > 3) Link to source and signatures on apache dist server:
> > > > > >
> > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.r
> > > > > > c1/
> > > > > >
> > > > > >
> > > > > > Please remember to TEST first before voting accordingly:
> > > > > >
> > > > > > +1 = approve
> > > > > > +0 = no opinion
> > > > > > -1 = disapprove (provide reason)
> > > > > > --
> > > > > > Best Regards
> > > > > >
> > > > > > Lai
> > > > > >
> > > > >
> > > > --
> > > > Best Regards
> > > >
> > > > Lai
> > > >
> > >
> > --
> > Best Regards
> >
> > Lai
> >


Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Anirudh Subramanian
Hi Lai,

I have opened an issue:
https://github.com/apache/incubator-mxnet/issues/15297
I came to know about this issue only today and I have not been monitoring
sockeye.
I jumped onto this issue to make sure it wasn't caused by the dlpack
changes.
Also, I don't  think sockeye CI checks against master, it is using 1.4.1.

Anirudh


On Thu, Jun 20, 2019 at 6:17 PM Lai Wei  wrote:

> Hi,
>
> Could you share which test failed and what’s the crash? How to reproduce
> it?
>
> I was able to install sockeye and run all tests passed. Using
> python setup.py test
>
> I have tested both nightly pip package and 1.5.0.rc1
>
> It would be great to create an issue with reproducible steps and move the
> discussion there.
>
> Also I see sockeye nightly build[1] has been failing for some time, if it’s
> due to MXNet change, please raise this early so we can track and solve it
> in time rather than block the release during vote time.
>
> [1] https://travis-ci.org/awslabs/sockeye
>
>
> On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian  >
> wrote:
>
> > I was able to reproduce a crash with the commit
> > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit
> > a862270beb2d796c1ba311183f7f4a766a18ad6c.
> >
> > Anirudh
> >
> > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei  wrote:
> >
> > > Hi Przemyslaw,
> > >
> > > Is there an issue with more details to track the problem?
> > >
> > >
> > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak 
> > > wrote:
> > >
> > > > -1
> > > >
> > > > There is a crash in sockeye unit test (python setup.py test) observed
> > > > starting with nightly 1.5 build from 6/13 and still occuring in
> > 1.5rc1. I
> > > > don't yet have the exact commit that is responsible for it, but it is
> > > > either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or
> > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization).
> > > >
> > > > On 2019/06/20 06:36:22, Lai Wei  wrote:
> > > > > Dear MXNet community,
> > > > >
> > > > > This is the 3-day vote to release Apache MXNet (incubating) version
> > > > 1.5.0.
> > > > > Voting on dev@ will start June 19, 23:59:59(PST)  and close on
> June
> > > 22,
> > > > > 23:59:59.
> > > > >
> > > > > 1) Link to release notes:
> > > > >
> > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > >
> > > > >
> > > > > 2) Link to release candidate:
> > > > >
> > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1
> > > > >
> > > > >
> > > > > 3) Link to source and signatures on apache dist server:
> > > > >
> > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/
> > > > >
> > > > >
> > > > > Please remember to TEST first before voting accordingly:
> > > > >
> > > > > +1 = approve
> > > > > +0 = no opinion
> > > > > -1 = disapprove (provide reason)
> > > > > --
> > > > > Best Regards
> > > > >
> > > > > Lai
> > > > >
> > > >
> > > --
> > > Best Regards
> > >
> > > Lai
> > >
> >
> --
> Best Regards
>
> Lai
>


Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Lai Wei
Hi,

Could you share which test failed and what’s the crash? How to reproduce it?

I was able to install sockeye and run all tests passed. Using
python setup.py test

I have tested both nightly pip package and 1.5.0.rc1

It would be great to create an issue with reproducible steps and move the
discussion there.

Also I see sockeye nightly build[1] has been failing for some time, if it’s
due to MXNet change, please raise this early so we can track and solve it
in time rather than block the release during vote time.

[1] https://travis-ci.org/awslabs/sockeye


On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian 
wrote:

> I was able to reproduce a crash with the commit
> 09202f7f261954383aa387144524d38f83f18d06 but not with the commit
> a862270beb2d796c1ba311183f7f4a766a18ad6c.
>
> Anirudh
>
> On Thu, Jun 20, 2019 at 3:53 PM Lai Wei  wrote:
>
> > Hi Przemyslaw,
> >
> > Is there an issue with more details to track the problem?
> >
> >
> > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak 
> > wrote:
> >
> > > -1
> > >
> > > There is a crash in sockeye unit test (python setup.py test) observed
> > > starting with nightly 1.5 build from 6/13 and still occuring in
> 1.5rc1. I
> > > don't yet have the exact commit that is responsible for it, but it is
> > > either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or
> > > 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization).
> > >
> > > On 2019/06/20 06:36:22, Lai Wei  wrote:
> > > > Dear MXNet community,
> > > >
> > > > This is the 3-day vote to release Apache MXNet (incubating) version
> > > 1.5.0.
> > > > Voting on dev@ will start June 19, 23:59:59(PST)  and close on June
> > 22,
> > > > 23:59:59.
> > > >
> > > > 1) Link to release notes:
> > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > >
> > > >
> > > > 2) Link to release candidate:
> > > >
> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1
> > > >
> > > >
> > > > 3) Link to source and signatures on apache dist server:
> > > >
> > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/
> > > >
> > > >
> > > > Please remember to TEST first before voting accordingly:
> > > >
> > > > +1 = approve
> > > > +0 = no opinion
> > > > -1 = disapprove (provide reason)
> > > > --
> > > > Best Regards
> > > >
> > > > Lai
> > > >
> > >
> > --
> > Best Regards
> >
> > Lai
> >
>
-- 
Best Regards

Lai


Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Anirudh Subramanian
I was able to reproduce a crash with the commit
09202f7f261954383aa387144524d38f83f18d06 but not with the commit
a862270beb2d796c1ba311183f7f4a766a18ad6c.

Anirudh

On Thu, Jun 20, 2019 at 3:53 PM Lai Wei  wrote:

> Hi Przemyslaw,
>
> Is there an issue with more details to track the problem?
>
>
> On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak 
> wrote:
>
> > -1
> >
> > There is a crash in sockeye unit test (python setup.py test) observed
> > starting with nightly 1.5 build from 6/13 and still occuring in 1.5rc1. I
> > don't yet have the exact commit that is responsible for it, but it is
> > either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or
> > 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization).
> >
> > On 2019/06/20 06:36:22, Lai Wei  wrote:
> > > Dear MXNet community,
> > >
> > > This is the 3-day vote to release Apache MXNet (incubating) version
> > 1.5.0.
> > > Voting on dev@ will start June 19, 23:59:59(PST)  and close on June
> 22,
> > > 23:59:59.
> > >
> > > 1) Link to release notes:
> > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > >
> > >
> > > 2) Link to release candidate:
> > >
> > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1
> > >
> > >
> > > 3) Link to source and signatures on apache dist server:
> > >
> > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/
> > >
> > >
> > > Please remember to TEST first before voting accordingly:
> > >
> > > +1 = approve
> > > +0 = no opinion
> > > -1 = disapprove (provide reason)
> > > --
> > > Best Regards
> > >
> > > Lai
> > >
> >
> --
> Best Regards
>
> Lai
>


Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Lai Wei
Hi Przemyslaw,

Is there an issue with more details to track the problem?


On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak 
wrote:

> -1
>
> There is a crash in sockeye unit test (python setup.py test) observed
> starting with nightly 1.5 build from 6/13 and still occuring in 1.5rc1. I
> don't yet have the exact commit that is responsible for it, but it is
> either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or
> 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization).
>
> On 2019/06/20 06:36:22, Lai Wei  wrote:
> > Dear MXNet community,
> >
> > This is the 3-day vote to release Apache MXNet (incubating) version
> 1.5.0.
> > Voting on dev@ will start June 19, 23:59:59(PST)  and close on June 22,
> > 23:59:59.
> >
> > 1) Link to release notes:
> > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> >
> >
> > 2) Link to release candidate:
> >
> > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1
> >
> >
> > 3) Link to source and signatures on apache dist server:
> >
> > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/
> >
> >
> > Please remember to TEST first before voting accordingly:
> >
> > +1 = approve
> > +0 = no opinion
> > -1 = disapprove (provide reason)
> > --
> > Best Regards
> >
> > Lai
> >
>
-- 
Best Regards

Lai


Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Przemysław Trędak
-1

There is a crash in sockeye unit test (python setup.py test) observed starting 
with nightly 1.5 build from 6/13 and still occuring in 1.5rc1. I don't yet have 
the exact commit that is responsible for it, but it is either 
a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or 
09202f7f261954383aa387144524d38f83f18d06 (cached op optimization).

On 2019/06/20 06:36:22, Lai Wei  wrote: 
> Dear MXNet community,
> 
> This is the 3-day vote to release Apache MXNet (incubating) version 1.5.0.
> Voting on dev@ will start June 19, 23:59:59(PST)  and close on June 22,
> 23:59:59.
> 
> 1) Link to release notes:
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> 
> 
> 2) Link to release candidate:
> 
> https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1
> 
> 
> 3) Link to source and signatures on apache dist server:
> 
> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/
> 
> 
> Please remember to TEST first before voting accordingly:
> 
> +1 = approve
> +0 = no opinion
> -1 = disapprove (provide reason)
> -- 
> Best Regards
> 
> Lai
>