Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> > (without setting the env variables, I got a close time (<1%) with > >> v1.5 > >> > and > >> > > > v1.4) > >> > > > export > >> KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 > >> > > > export OMP_NUM_THREADS=18 > >> > > > > >> > > > Did you set any env variables during running? > >> > > > > >> > > > The performance result I got as below: > >> > > > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590) > >> > > > real12m10.856s > >> > > > user234m49.576s > >> > > > sys 4m38.044s > >> > > > > >> > > > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde) > >> > > > real12m52.140s > >> > > > user246m30.740s > >> > > > sys 5m8.188s > >> > > > > >> > > > As I looked at the profiling data, most of the ops have same perf > >> > between > >> > > > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and > >> "Pooling" > >> > is > >> > > > ~1.37x slower on v1.5 compared with v1.4. > >> > > > Will do further analysis on these ops. > >> > > > > >> > > > Here's the hardware/OS info from my side: > >> > > > --Python Info-- > >> > > > Version : 3.6.8 > >> > > > Compiler : GCC 7.3.0 > >> > > > Build: ('default', 'Dec 30 2018 01:22:34') > >> > > > Arch : ('64bit', '') > >> > > > Pip Info--- > >> > > > Version : 19.0.3 > >> > > > Directory: > >> > > > > >> /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip > >> > > > --MXNet Info--- > >> > > > Version : 1.5.0 > >> > > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet > >> > > > Hashtag not found. Not installed from pre-built package. > >> > > > --System Info-- > >> > > > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid > >> > > > system : Linux > >> > > > node : ip-172-31-32-129 > >> > > > release : 4.4.0-1085-aws > >> > > > version : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 > >> > > > --Hardware Info-- > >> > > > machine : x86_64 > >> > > > processor: x86_64 > >> > > > Architecture: x86_64 > >> > > > CPU op-mode(s): 32-bit, 64-bit > >> > > > Byte Order:Little Endian > >> > > > CPU(s):72 > >> > > > On-line CPU(s) list: 0-71 > >> > > > Thread(s) per core:2 > >> > > > Core(s) per socket:18 > >> > > > Socket(s): 2 > >> > > > NUMA node(s): 2 > >> > > > Vendor ID: GenuineIntel > >> > > > CPU family: 6 > >> > > > Model: 85 > >> > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ > 3.00GHz > >> > > > Stepping: 3 > >> > > > CPU MHz: 3000.000 > >> > > > BogoMIPS: 6000.00 > >> > > > Hypervisor vendor: KVM > >> > > > Virtualization type: full > >> > > > L1d cache: 32K > >> > > > L1i cache: 32K > >> > > > L2 cache: 1024K > >> > > > L3 cache: 25344K > >> > > > NUMA node0 CPU(s): 0-17,36-53 > >> > > > NUMA node1 CPU(s): 18-35,54-71 > >> > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep > >> mtrr > >> > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx > >> > pdpe1gb > >> > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology > >> nonstop_tsc > >> > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 > pcid > >> > sse4_1 > >> > > >
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
n v1.5 compared with v1.4. >> > > > Will do further analysis on these ops. >> > > > >> > > > Here's the hardware/OS info from my side: >> > > > --Python Info-- >> > > > Version : 3.6.8 >> > > > Compiler : GCC 7.3.0 >> > > > Build: ('default', 'Dec 30 2018 01:22:34') >> > > > Arch : ('64bit', '') >> > > > Pip Info--- >> > > > Version : 19.0.3 >> > > > Directory: >> > > > >> /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip >> > > > --MXNet Info--- >> > > > Version : 1.5.0 >> > > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet >> > > > Hashtag not found. Not installed from pre-built package. >> > > > --System Info-- >> > > > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid >> > > > system : Linux >> > > > node : ip-172-31-32-129 >> > > > release : 4.4.0-1085-aws >> > > > version : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 >> > > > --Hardware Info-- >> > > > machine : x86_64 >> > > > processor: x86_64 >> > > > Architecture: x86_64 >> > > > CPU op-mode(s):32-bit, 64-bit >> > > > Byte Order:Little Endian >> > > > CPU(s):72 >> > > > On-line CPU(s) list: 0-71 >> > > > Thread(s) per core:2 >> > > > Core(s) per socket:18 >> > > > Socket(s): 2 >> > > > NUMA node(s): 2 >> > > > Vendor ID: GenuineIntel >> > > > CPU family:6 >> > > > Model: 85 >> > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz >> > > > Stepping: 3 >> > > > CPU MHz: 3000.000 >> > > > BogoMIPS: 6000.00 >> > > > Hypervisor vendor: KVM >> > > > Virtualization type: full >> > > > L1d cache: 32K >> > > > L1i cache: 32K >> > > > L2 cache: 1024K >> > > > L3 cache: 25344K >> > > > NUMA node0 CPU(s): 0-17,36-53 >> > > > NUMA node1 CPU(s): 18-35,54-71 >> > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep >> mtrr >> > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx >> > pdpe1gb >> > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology >> nonstop_tsc >> > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid >> > sse4_1 >> > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c >> rdrand >> > > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase >> > > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f >> rdseed >> > adx >> > > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku >> > > > --Network Test-- >> > > > >> > > > >> > > > -Ciyong >> > > > >> > > > >> > > > -Original Message- >> > > > From: Zhao, Patric [mailto:patric.z...@intel.com] >> > > > Sent: Thursday, June 27, 2019 9:55 AM >> > > > To: dev@mxnet.incubator.apache.org >> > > > Cc: d...@mxnet.apache.org >> > > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version >> 1.5.0.rc1 >> > > > >> > > > Could we run more epochs to see the performance difference or >> profiling >> > > > the difference between good and bad run? >> > > > >> > > > > -Original Message- >> > > > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] >> > > > > Sent: Thursday, June 27, 2019 9:35 AM >> > > > > To: dev@mxnet.incubator.apache.org >> > > > > Cc: d...@mxnet.apache.org >> > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version >> > > > > 1.5.0.rc1 >> > > > > >> >
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
t; > Directory: > > > > > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip > > > > --MXNet Info--- > > > > Version : 1.5.0 > > > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet > > > > Hashtag not found. Not installed from pre-built package. > > > > --System Info-- > > > > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid > > > > system : Linux > > > > node : ip-172-31-32-129 > > > > release : 4.4.0-1085-aws > > > > version : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 > > > > --Hardware Info-- > > > > machine : x86_64 > > > > processor: x86_64 > > > > Architecture: x86_64 > > > > CPU op-mode(s):32-bit, 64-bit > > > > Byte Order:Little Endian > > > > CPU(s):72 > > > > On-line CPU(s) list: 0-71 > > > > Thread(s) per core:2 > > > > Core(s) per socket:18 > > > > Socket(s): 2 > > > > NUMA node(s): 2 > > > > Vendor ID: GenuineIntel > > > > CPU family:6 > > > > Model: 85 > > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz > > > > Stepping: 3 > > > > CPU MHz: 3000.000 > > > > BogoMIPS: 6000.00 > > > > Hypervisor vendor: KVM > > > > Virtualization type: full > > > > L1d cache: 32K > > > > L1i cache: 32K > > > > L2 cache: 1024K > > > > L3 cache: 25344K > > > > NUMA node0 CPU(s): 0-17,36-53 > > > > NUMA node1 CPU(s): 18-35,54-71 > > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep > mtrr > > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx > > pdpe1gb > > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology > nonstop_tsc > > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid > > sse4_1 > > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c > rdrand > > > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase > > > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f > rdseed > > adx > > > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku > > > > --Network Test-- > > > > > > > > > > > > -Ciyong > > > > > > > > > > > > -Original Message- > > > > From: Zhao, Patric [mailto:patric.z...@intel.com] > > > > Sent: Thursday, June 27, 2019 9:55 AM > > > > To: dev@mxnet.incubator.apache.org > > > > Cc: d...@mxnet.apache.org > > > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version > 1.5.0.rc1 > > > > > > > > Could we run more epochs to see the performance difference or > profiling > > > > the difference between good and bad run? > > > > > > > > > -Original Message- > > > > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] > > > > > Sent: Thursday, June 27, 2019 9:35 AM > > > > > To: dev@mxnet.incubator.apache.org > > > > > Cc: d...@mxnet.apache.org > > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version > > > > > 1.5.0.rc1 > > > > > > > > > > I run again and the gap is again bigger, I guess we need to average > > > > > out the times across several runs: > > > > > > > > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench > > > > > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py > --epochs 5 > > > > > && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5 > > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172: > > > > > ImageRecordIOParser2: > > > > > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 > > threads > > > > > for decoding.. > > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image > > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > > > > > [
Re: FW: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
he profiling data, most of the ops have same > perf > > > between > > > > > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and > > > "Pooling" is > > > > > ~1.37x slower on v1.5 compared with v1.4. > > > > > Will do further analysis on these ops. > > > > > > > > > > Here's the hardware/OS info from my side: > > > > > --Python Info-- > > > > > Version : 3.6.8 > > > > > Compiler : GCC 7.3.0 > > > > > Build: ('default', 'Dec 30 2018 01:22:34') > > > > > Arch : ('64bit', '') > > > > > Pip Info--- > > > > > Version : 19.0.3 > > > > > Directory: > > > > > > > > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip > > > > > --MXNet Info--- > > > > > Version : 1.5.0 > > > > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet > > > > > Hashtag not found. Not installed from pre-built package. > > > > > --System Info-- > > > > > Platform : > Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid > > > > > system : Linux > > > > > node : ip-172-31-32-129 > > > > > release : 4.4.0-1085-aws > > > > > version : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 > > > > > --Hardware Info-- > > > > > machine : x86_64 > > > > > processor: x86_64 > > > > > Architecture: x86_64 > > > > > CPU op-mode(s):32-bit, 64-bit > > > > > Byte Order:Little Endian > > > > > CPU(s):72 > > > > > On-line CPU(s) list: 0-71 > > > > > Thread(s) per core:2 > > > > > Core(s) per socket:18 > > > > > Socket(s): 2 > > > > > NUMA node(s): 2 > > > > > Vendor ID: GenuineIntel > > > > > CPU family:6 > > > > > Model: 85 > > > > > Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ > > > 3.00GHz > > > > > Stepping: 3 > > > > > CPU MHz: 3000.000 > > > > > BogoMIPS: 6000.00 > > > > > Hypervisor vendor: KVM > > > > > Virtualization type: full > > > > > L1d cache: 32K > > > > > L1i cache: 32K > > > > > L2 cache: 1024K > > > > > L3 cache: 25344K > > > > > NUMA node0 CPU(s): 0-17,36-53 > > > > > NUMA node1 CPU(s): 18-35,54-71 > > > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic > sep > > > mtrr > > > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall > nx > > > pdpe1gb > > > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology > > > nonstop_tsc > > > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 > > > pcid sse4_1 > > > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx > f16c > > > rdrand > > > > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser > fsgsbase > > > > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f > > > rdseed adx > > > > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat > pku > > > > > --Network Test-- > > > > > > > > > > > > > > > -Ciyong > > > > > > > > > > > > > > > -Original Message- > > > > > From: Zhao, Patric [mailto:patric.z...@intel.com] > > > > > Sent: Thursday, June 27, 2019 9:55 AM > > > > > To: dev@mxnet.incubator.apache.org > > > > > Cc: d...@mxnet.apache.org > > > > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version > > > 1.5.0.rc1 > > > > > &
Re: FW: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
s/pip > > > > --MXNet Info--- > > > > Version : 1.5.0 > > > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet > > > > Hashtag not found. Not installed from pre-built package. > > > > --System Info-- > > > > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid > > > > system : Linux > > > > node : ip-172-31-32-129 > > > > release : 4.4.0-1085-aws > > > > version : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 > > > > --Hardware Info-- > > > > machine : x86_64 > > > > processor: x86_64 > > > > Architecture: x86_64 > > > > CPU op-mode(s):32-bit, 64-bit > > > > Byte Order:Little Endian > > > > CPU(s):72 > > > > On-line CPU(s) list: 0-71 > > > > Thread(s) per core:2 > > > > Core(s) per socket:18 > > > > Socket(s): 2 > > > > NUMA node(s): 2 > > > > Vendor ID: GenuineIntel > > > > CPU family:6 > > > > Model: 85 > > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ > > 3.00GHz > > > > Stepping: 3 > > > > CPU MHz: 3000.000 > > > > BogoMIPS: 6000.00 > > > > Hypervisor vendor: KVM > > > > Virtualization type: full > > > > L1d cache: 32K > > > > L1i cache: 32K > > > > L2 cache: 1024K > > > > L3 cache: 25344K > > > > NUMA node0 CPU(s): 0-17,36-53 > > > > NUMA node1 CPU(s): 18-35,54-71 > > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep > > mtrr > > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx > > pdpe1gb > > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology > > nonstop_tsc > > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 > > pcid sse4_1 > > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c > > rdrand > > > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase > > > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f > > rdseed adx > > > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku > > > > --Network Test-- > > > > > > > > > > > > -Ciyong > > > > > > > > > > > > -Original Message- > > > > From: Zhao, Patric [mailto:patric.z...@intel.com] > > > > Sent: Thursday, June 27, 2019 9:55 AM > > > > To: dev@mxnet.incubator.apache.org > > > > Cc: d...@mxnet.apache.org > > > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version > > 1.5.0.rc1 > > > > > > > > Could we run more epochs to see the performance difference or > > profiling > > > > the difference between good and bad run? > > > > > > > > > -Original Message- > > > > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] > > > > > Sent: Thursday, June 27, 2019 9:35 AM > > > > > To: dev@mxnet.incubator.apache.org > > > > > Cc: d...@mxnet.apache.org > > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version > > > > > 1.5.0.rc1 > > > > > > > > > > I run again and the gap is again bigger, I guess we need to > > average > > > > > out the times across several runs: > > > > > > > > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench > > > > > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py > > --epochs 5 > > > > > && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5 > > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172: > > > > > ImageRecordIOParser2: > > > > > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 > > threads >
Re: FW: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
; > > CPU op-mode(s):32-bit, 64-bit > > > Byte Order:Little Endian > > > CPU(s):72 > > > On-line CPU(s) list: 0-71 > > > Thread(s) per core:2 > > > Core(s) per socket:18 > > > Socket(s): 2 > > > NUMA node(s): 2 > > > Vendor ID: GenuineIntel > > > CPU family:6 > > > Model: 85 > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ > 3.00GHz > > > Stepping: 3 > > > CPU MHz: 3000.000 > > > BogoMIPS: 6000.00 > > > Hypervisor vendor: KVM > > > Virtualization type: full > > > L1d cache: 32K > > > L1i cache: 32K > > > L2 cache: 1024K > > > L3 cache: 25344K > > > NUMA node0 CPU(s): 0-17,36-53 > > > NUMA node1 CPU(s): 18-35,54-71 > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep > mtrr > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx > pdpe1gb > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology > nonstop_tsc > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 > pcid sse4_1 > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c > rdrand > > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase > > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f > rdseed adx > > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku > > > --Network Test-- > > > > > > > > > -Ciyong > > > > > > > > > -Original Message- > > > From: Zhao, Patric [mailto:patric.z...@intel.com] > > > Sent: Thursday, June 27, 2019 9:55 AM > > > To: dev@mxnet.incubator.apache.org > > > Cc: d...@mxnet.apache.org > > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version > 1.5.0.rc1 > > > > > > Could we run more epochs to see the performance difference or > profiling > > > the difference between good and bad run? > > > > > > > -Original Message- > > > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] > > > > Sent: Thursday, June 27, 2019 9:35 AM > > > > To: dev@mxnet.incubator.apache.org > > > > Cc: d...@mxnet.apache.org > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version > > > > 1.5.0.rc1 > > > > > > > > I run again and the gap is again bigger, I guess we need to > average > > > > out the times across several runs: > > > > > > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench > > > > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py > --epochs 5 > > > > && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5 > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172: > > > > ImageRecordIOParser2: > > > > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 > threads > > > > for decoding.. > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean > image > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean > image > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > completed > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172: > > > > ImageRecordIOParser2: > > > > /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 > threads > > > > for decoding.. > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean > image > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean > image > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > completed > > > > lr_schedule: {0: 0.05, 82: 0.005001, 123: 0.0005, > 300: > > > > 0.0001} Epoch 0, Changed learning rate to 0.05 [23:17:09] > > &
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
gt; > > > > > > Hi Pedro, > > > > > > > > > > > > > > I was able to reproduced the similar result (v1.5 > is > > ~%5.6 > > > slower > > > > than > > > > > > > v1.4, I was using 18 cores for computing) with your > > script on > > > > > C5.18xlarge. > > > > > > > But need to bind the cores with below command when > > running the > > > > script, > > > > > > > (without setting the env variables, I got a close > time > > (<1%) > > > with > > > > v1.5 > > > > > and > > > > > > > v1.4) > > > > > > > export > > > > KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 > > > > > > > export OMP_NUM_THREADS=18 > > > > > > > > > > > > > > Did you set any env variables during running? > > > > > > > > > > > > > > The performance result I got as below: > > > > > > > 1) 1.4.1.rc0 > (1a7199691f5cbc6012bb53eecbf884bed5ae6590) > > > > > > > real12m10.856s > > > > > > > user234m49.576s > > > > > > > sys 4m38.044s > > > > > > > > > > > > > > 2) 1.5.0.rc1 > (4d9667121ae6fb643f2a02ab15e25231ed756cde) > > > > > > > real12m52.140s > > > > > > > user246m30.740s > > > > > > > sys 5m8.188s > > > > > > > > > > > > > > As I looked at the profiling data, most of the ops > have > > same > > > perf > > > > > between > > > > > > > v1.4 and v1.5. But some ops like " > _backward_BatchNorm" > > and > > > > "Pooling" > > > > > is > > > > > > > ~1.37x slower on v1.5 compared with v1.4. > > > > > > > Will do further analysis on these ops. > > > > > > > > > > > > > > Here's the hardware/OS info from my side: > > > > > > > --Python Info-- > > > > > > > Version : 3.6.8 > > > > > > > Compiler : GCC 7.3.0 > > > > > > > Build: ('default', 'Dec 30 2018 01:22:34') > > > > > > > Arch : ('64bit', '') > > > > > > > Pip Info--- > > > > > > > Version : 19.0.3 > > > > > > > Directory: > > > > > > > > > > > > > > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip > > > > > > > ------MXNet Info--- > > > > > > > Version : 1.5.0 > > > > > > > Directory: > > /home/ubuntu/ws/incubator-mxnet/python/mxnet > > > > > > > Hashtag not found. Not installed from pre-built > package. > > > > > > > --System Info-- > > > > > > > Platform : > > > Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid > > > > > > > system : Linux > > > > > > > node : ip-172-31-32-129 > > > > > > > release : 4.4.0-1085-aws > > > > > > > version : #96-Ubuntu SMP Tue Jun 11 09:08:32 > UTC > > 2019 > > > > > > > --Hardware Info-- > > > > > > > machine : x86_64 > > > > > > > processor: x86_64 > > > > > > > Architecture: x86_64 > > > > > > > CPU op-mode(s):32-bit, 64-bit > > > >
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
; Here's the hardware/OS info from my side: > > > > > > --Python Info-- > > > > > > Version : 3.6.8 > > > > > > Compiler : GCC 7.3.0 > > > > > > Build: ('default', 'Dec 30 2018 01:22:34') > > > > > > Arch : ('64bit', '') > > > > > > Pip Info--- > > > > > > Version : 19.0.3 > > > > > > Directory: > > > > > > > > > > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip > > > > > > --MXNet Info--- > > > > > > Version : 1.5.0 > > > > > > Directory: > /home/ubuntu/ws/incubator-mxnet/python/mxnet > > > > > > Hashtag not found. Not installed from pre-built package. > > > > > > --System Info-- > > > > > > Platform : > > Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid > > > > > > system : Linux > > > > > > node : ip-172-31-32-129 > > > > > > release : 4.4.0-1085-aws > > > > > > version : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC > 2019 > > > > > > --Hardware Info-- > > > > > > machine : x86_64 > > > > > > processor : x86_64 > > > > > > Architecture: x86_64 > > > > > > CPU op-mode(s):32-bit, 64-bit > > > > > > Byte Order:Little Endian > > > > > > CPU(s):72 > > > > > > On-line CPU(s) list: 0-71 > > > > > > Thread(s) per core:2 > > > > > > Core(s) per socket:18 > > > > > > Socket(s): 2 > > > > > > NUMA node(s): 2 > > > > > > Vendor ID: GenuineIntel > > > > > > CPU family:6 > > > > > > Model: 85 > > > > > > Model name:Intel(R) Xeon(R) Platinum 8124M > CPU @ > > > 3.00GHz > > > > > > Stepping: 3 > > > > > > CPU MHz: 3000.000 > > > > > > BogoMIPS: 6000.00 > > > > > > Hypervisor vendor: KVM > > > > > > Virtualization type: full > > > > > > L1d cache: 32K > > > > > > L1i cache: 32K > > > > > > L2 cache: 1024K > > > > > > L3 cache: 25344K > > > > > > NUMA node0 CPU(s): 0-17,36-53 > > > > > > NUMA node1 CPU(s): 18-35,54-71 > > > > > > Flags: fpu vme de pse tsc msr pae mce > cx8 apic > > > sep mtrr > > > > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht > syscall > > nx > > > > pdpe1gb > > > > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl > xtopology > > > nonstop_tsc > > > > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 > fma cx16 > > > pcid > > > > sse4_1 > > > > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave > avx > > f16c > > > rdrand > > > > > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single > kaiser > > > fsgsbase > > > > > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx > avx512f > > > rdseed > > > > adx > > > > > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 > ida arat > > pku > > > > > > --Network Test-- > >
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
; Thank you. > > > > > > > > On Thu, Jun 27, 2019 at 10:01 AM Lai Wei < > roywei...@gmail.com> > > wrote: > > > > > > > > > > Dear @dev, > > > > > > > > > > I m cancelling the vote for cached op fix: > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/15298 > > > > > > > > > > As for the possible cpu training regression, it looks like > not a > > > blocker > > > > > for now. > > > > > > > > > > I will start a new rc2 vote, please help to validate. > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > On Thu, Jun 27, 2019 at 10:06 PM Chen, Ciyong < > > ciyong.c...@intel.com > > > > > > > > wrote: > > > > > > > > > > > Hi Pedro, > > > > > > > > > > > > I was able to reproduced the similar result (v1.5 is > ~%5.6 > > slower > > > than > > > > > > v1.4, I was using 18 cores for computing) with your > script on > > > > C5.18xlarge. > > > > > > But need to bind the cores with below command when > running the > > > script, > > > > > > (without setting the env variables, I got a close time > (<1%) > > with > > > v1.5 > > > > and > > > > > > v1.4) > > > > > > export > > > KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 > > > > > > export OMP_NUM_THREADS=18 > > > > > > > > > > > > Did you set any env variables during running? > > > > > > > > > > > > The performance result I got as below: > > > > > > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590) > > > > > > real12m10.856s > > > > > > user234m49.576s > > > > > > sys 4m38.044s > > > > > > > > > > > > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde) > > > > > > real12m52.140s > > > > > > user246m30.740s > > > > > > sys 5m8.188s > > > > > > > > > > > > As I looked at the profiling data, most of the ops have > same > > perf > > > > between > > > > > > v1.4 and v1.5. But some ops like " _backward_BatchNorm" > and > > > "Pooling" > > > > is > > > > > > ~1.37x slower on v1.5 compared with v1.4. > > > > > > Will do further analysis on these ops. > > > > > > > > > > > > Here's the hardware/OS info from my side: > > > > > > --Python Info-- > > > > > > Version : 3.6.8 > > > > > > Compiler : GCC 7.3.0 > > > > > > Build: ('default', 'Dec 30 2018 01:22:34') > > > > > > Arch : ('64bit', '') > > > > > > Pip Info--- > > > > > > Version : 19.0.3 > > > > > > Directory: > > > > > > > > > > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip > > > > > > --MXNet Info--- > > > > > > Version : 1.5.0 > > > > > > Directory: > /home/ubuntu/ws/incubator-mxnet/python/mxnet > > > > > > Hashtag not found. Not installed from pre-built package. > > > > > > --System Info-- > > > > > > Platform : > > Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid > > > > > > system : Linux > > > > > > node : ip-172-31-32-129 > > > > > > release : 4.4.0-1085-aws > > > > >
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
variables during running? > > > > > > > > > > The performance result I got as below: > > > > > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590) > > > > > real12m10.856s > > > > > user234m49.576s > > > > > sys 4m38.044s > > > > > > > > > > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde) > > > > > real12m52.140s > > > > > user246m30.740s > > > > > sys 5m8.188s > > > > > > > > > > As I looked at the profiling data, most of the ops have same > perf > > > between > > > > > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and > > "Pooling" > > > is > > > > > ~1.37x slower on v1.5 compared with v1.4. > > > > > Will do further analysis on these ops. > > > > > > > > > > Here's the hardware/OS info from my side: > > > > > --Python Info-- > > > > > Version : 3.6.8 > > > > > Compiler : GCC 7.3.0 > > > > > Build: ('default', 'Dec 30 2018 01:22:34') > > > > > Arch : ('64bit', '') > > > > > Pip Info--- > > > > > Version : 19.0.3 > > > > > Directory: > > > > > > > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip > > > > > --MXNet Info--- > > > > > Version : 1.5.0 > > > > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet > > > > > Hashtag not found. Not installed from pre-built package. > > > > > --System Info-- > > > > > Platform : > Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid > > > > > system : Linux > > > > > node : ip-172-31-32-129 > > > > > release : 4.4.0-1085-aws > > > > > version : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 > > > > > --Hardware Info-- > > > > > machine : x86_64 > > > > > processor: x86_64 > > > > > Architecture: x86_64 > > > > > CPU op-mode(s):32-bit, 64-bit > > > > > Byte Order:Little Endian > > > > > CPU(s):72 > > > > > On-line CPU(s) list: 0-71 > > > > > Thread(s) per core:2 > > > > > Core(s) per socket:18 > > > > > Socket(s): 2 > > > > > NUMA node(s): 2 > > > > > Vendor ID: GenuineIntel > > > > > CPU family:6 > > > > > Model: 85 > > > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ > > 3.00GHz > > > > > Stepping: 3 > > > > > CPU MHz: 3000.000 > > > > > BogoMIPS: 6000.00 > > > > > Hypervisor vendor: KVM > > > > > Virtualization type: full > > > > > L1d cache: 32K > > > > > L1i cache: 32K > > > > > L2 cache: 1024K > > > > > L3 cache: 25344K > > > > > NUMA node0 CPU(s): 0-17,36-53 > > > > > NUMA node1 CPU(s): 18-35,54-71 > > > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic > > sep mtrr > > > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall > nx > > > pdpe1gb > > > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology > > nonstop_tsc > > > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 > > pcid > > > sse4_1 > > > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx > f16c
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
ge. > > > > > --System Info-- > > > > > Platform : > Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid > > > > > system : Linux > > > > > node : ip-172-31-32-129 > > > > > release : 4.4.0-1085-aws > > > > > version : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 > > > > > --Hardware Info-- > > > > > machine : x86_64 > > > > > processor: x86_64 > > > > > Architecture: x86_64 > > > > > CPU op-mode(s):32-bit, 64-bit > > > > > Byte Order:Little Endian > > > > > CPU(s):72 > > > > > On-line CPU(s) list: 0-71 > > > > > Thread(s) per core:2 > > > > > Core(s) per socket:18 > > > > > Socket(s): 2 > > > > > NUMA node(s): 2 > > > > > Vendor ID: GenuineIntel > > > > > CPU family:6 > > > > > Model: 85 > > > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ > > 3.00GHz > > > > > Stepping: 3 > > > > > CPU MHz: 3000.000 > > > > > BogoMIPS: 6000.00 > > > > > Hypervisor vendor: KVM > > > > > Virtualization type: full > > > > > L1d cache: 32K > > > > > L1i cache: 32K > > > > > L2 cache: 1024K > > > > > L3 cache: 25344K > > > > > NUMA node0 CPU(s): 0-17,36-53 > > > > > NUMA node1 CPU(s): 18-35,54-71 > > > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic > > sep mtrr > > > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall > nx > > > pdpe1gb > > > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology > > nonstop_tsc > > > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 > > pcid > > > sse4_1 > > > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx > f16c > > rdrand > > > > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser > > fsgsbase > > > > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f > > rdseed > > > adx > > > > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat > pku > > > > > --Network Test-- > > > > > > > > > > > > > > > -Ciyong > > > > > > > > > > > > > > > -Original Message- > > > > > From: Zhao, Patric [mailto:patric.z...@intel.com] > > > > > Sent: Thursday, June 27, 2019 9:55 AM > > > > > To: dev@mxnet.incubator.apache.org > > > > > Cc: d...@mxnet.apache.org > > > > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version > > 1.5.0.rc1 > > > > > > > > > > Could we run more epochs to see the performance difference or > > profiling > > > > > the difference between good and bad run? > > > > > > > > > > > -Original Message- > > > > > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] > > > > > > Sent: Thursday, June 27, 2019 9:35 AM > > > > > > To: dev@mxnet.incubator.apache.org > > > > > > Cc: d...@mxnet.apache.org > > > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version > > > > > > 1.5.0.rc1 > > > > > > > > > > > > I run again and the gap is again bigger, I guess we need to > > average > > > > > > out the times across several runs: > > > > > > > > > > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench > > > > > > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py > > --epochs 5 > > > > > > && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5 > > > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172: > >
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> > > for now. > > > > > > I will start a new rc2 vote, please help to validate. > > > > > > Thanks! > > > > > > > > > On Thu, Jun 27, 2019 at 10:06 PM Chen, Ciyong > > > wrote: > > > > > > > Hi Pedro, > > > > > > > > I was able to reproduced the similar result (v1.5 is ~%5.6 slower > than > > > > v1.4, I was using 18 cores for computing) with your script on > > C5.18xlarge. > > > > But need to bind the cores with below command when running the > script, > > > > (without setting the env variables, I got a close time (<1%) with > v1.5 > > and > > > > v1.4) > > > > export > KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 > > > > export OMP_NUM_THREADS=18 > > > > > > > > Did you set any env variables during running? > > > > > > > > The performance result I got as below: > > > > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590) > > > > real12m10.856s > > > > user234m49.576s > > > > sys 4m38.044s > > > > > > > > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde) > > > > real12m52.140s > > > > user246m30.740s > > > > sys 5m8.188s > > > > > > > > As I looked at the profiling data, most of the ops have same perf > > between > > > > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and > "Pooling" > > is > > > > ~1.37x slower on v1.5 compared with v1.4. > > > > Will do further analysis on these ops. > > > > > > > > Here's the hardware/OS info from my side: > > > > --Python Info-- > > > > Version : 3.6.8 > > > > Compiler : GCC 7.3.0 > > > > Build: ('default', 'Dec 30 2018 01:22:34') > > > > Arch : ('64bit', '') > > > > Pip Info--- > > > > Version : 19.0.3 > > > > Directory: > > > > > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip > > > > --MXNet Info--- > > > > Version : 1.5.0 > > > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet > > > > Hashtag not found. Not installed from pre-built package. > > > > --System Info-- > > > > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid > > > > system : Linux > > > > node : ip-172-31-32-129 > > > > release : 4.4.0-1085-aws > > > > version : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 > > > > --Hardware Info-- > > > > machine : x86_64 > > > > processor: x86_64 > > > > Architecture: x86_64 > > > > CPU op-mode(s):32-bit, 64-bit > > > > Byte Order:Little Endian > > > > CPU(s):72 > > > > On-line CPU(s) list: 0-71 > > > > Thread(s) per core:2 > > > > Core(s) per socket:18 > > > > Socket(s): 2 > > > > NUMA node(s): 2 > > > > Vendor ID: GenuineIntel > > > > CPU family:6 > > > > Model: 85 > > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ > 3.00GHz > > > > Stepping: 3 > > > > CPU MHz: 3000.000 > > > > BogoMIPS: 6000.00 > > > > Hypervisor vendor: KVM > > > > Virtualization type: full > > > > L1d cache: 32K > > > > L1i cache: 32K > > > > L2 cache: 1024K > > > > L3 cache: 25344K > > > > NUMA node0 CPU(s): 0-17,36-53 > > > > NUMA node1 CPU(s): 18-35,54-71 > > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic > sep mtrr > > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx > > pdp
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
ared with v1.4. > > > Will do further analysis on these ops. > > > > > > Here's the hardware/OS info from my side: > > > --Python Info-- > > > Version : 3.6.8 > > > Compiler : GCC 7.3.0 > > > Build: ('default', 'Dec 30 2018 01:22:34') > > > Arch : ('64bit', '') > > > Pip Info--- > > > Version : 19.0.3 > > > Directory: > > > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip > > > --MXNet Info--- > > > Version : 1.5.0 > > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet > > > Hashtag not found. Not installed from pre-built package. > > > --System Info-- > > > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid > > > system : Linux > > > node : ip-172-31-32-129 > > > release : 4.4.0-1085-aws > > > version : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 > > > --Hardware Info-- > > > machine : x86_64 > > > processor: x86_64 > > > Architecture: x86_64 > > > CPU op-mode(s):32-bit, 64-bit > > > Byte Order:Little Endian > > > CPU(s):72 > > > On-line CPU(s) list: 0-71 > > > Thread(s) per core:2 > > > Core(s) per socket:18 > > > Socket(s): 2 > > > NUMA node(s): 2 > > > Vendor ID: GenuineIntel > > > CPU family:6 > > > Model: 85 > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz > > > Stepping: 3 > > > CPU MHz: 3000.000 > > > BogoMIPS: 6000.00 > > > Hypervisor vendor: KVM > > > Virtualization type: full > > > L1d cache: 32K > > > L1i cache: 32K > > > L2 cache: 1024K > > > L3 cache: 25344K > > > NUMA node0 CPU(s): 0-17,36-53 > > > NUMA node1 CPU(s): 18-35,54-71 > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx > pdpe1gb > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid > sse4_1 > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand > > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase > > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed > adx > > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku > > > --Network Test-- > > > > > > > > > -Ciyong > > > > > > > > > -Original Message- > > > From: Zhao, Patric [mailto:patric.z...@intel.com] > > > Sent: Thursday, June 27, 2019 9:55 AM > > > To: dev@mxnet.incubator.apache.org > > > Cc: d...@mxnet.apache.org > > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 > > > > > > Could we run more epochs to see the performance difference or profiling > > > the difference between good and bad run? > > > > > > > -Original Message- > > > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] > > > > Sent: Thursday, June 27, 2019 9:35 AM > > > > To: dev@mxnet.incubator.apache.org > > > > Cc: d...@mxnet.apache.org > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version > > > > 1.5.0.rc1 > > > > > > > > I run again and the gap is again bigger, I guess we need to average > > > > out the times across several runs: > > > > > > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench > > > > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5 > > > > && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5 > > > > [23:17:09] ../src/io
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Endian > > > CPU(s):72 > > > On-line CPU(s) list: 0-71 > > > Thread(s) per core:2 > > > Core(s) per socket:18 > > > Socket(s): 2 > > > NUMA node(s): 2 > > > Vendor ID: GenuineIntel > > > CPU family:6 > > > Model: 85 > > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz > > > Stepping: 3 > > > CPU MHz: 3000.000 > > > BogoMIPS: 6000.00 > > > Hypervisor vendor: KVM > > > Virtualization type: full > > > L1d cache: 32K > > > L1i cache: 32K > > > L2 cache: 1024K > > > L3 cache: 25344K > > > NUMA node0 CPU(s): 0-17,36-53 > > > NUMA node1 CPU(s): 18-35,54-71 > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx > pdpe1gb > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc > > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid > sse4_1 > > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand > > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase > > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed > adx > > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku > > > --Network Test-- > > > > > > > > > -Ciyong > > > > > > > > > -Original Message- > > > From: Zhao, Patric [mailto:patric.z...@intel.com] > > > Sent: Thursday, June 27, 2019 9:55 AM > > > To: dev@mxnet.incubator.apache.org > > > Cc: d...@mxnet.apache.org > > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 > > > > > > Could we run more epochs to see the performance difference or profiling > > > the difference between good and bad run? > > > > > > > -Original Message- > > > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] > > > > Sent: Thursday, June 27, 2019 9:35 AM > > > > To: dev@mxnet.incubator.apache.org > > > > Cc: d...@mxnet.apache.org > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version > > > > 1.5.0.rc1 > > > > > > > > I run again and the gap is again bigger, I guess we need to average > > > > out the times across several runs: > > > > > > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench > > > > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5 > > > > && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5 > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172: > > > > ImageRecordIOParser2: > > > > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 > threads > > > > for decoding.. > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:172: > > > > ImageRecordIOParser2: > > > > /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads > > > > for decoding.. > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > > > > [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image > > > > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed > > > > lr_schedule: {0: 0.05, 82: 0.005001, 123: 0.0005, 300: > > > > 0.0001} Epoch 0, Changed learning rate to 0.05 [23:17:09] > > > > ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate > > > > 147456 bytes with malloc directly > > > > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate > > > > 589824 bytes with malloc directly > > > > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate > > > > 2359296 bytes with malloc directly > > > > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate > > > > 9437184 by
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
I will try to run a few benchmarks in a bare metal instance tonight to remove virtualization variance for the measurements and provide some numbers. Please propose a set of models / examples that would be desirable to run before the release and provide a link to an easy to run script with instructions so we can validate the release better. Thank you. On Thu, Jun 27, 2019 at 10:01 AM Lai Wei wrote: > > Dear @dev, > > I m cancelling the vote for cached op fix: > > https://github.com/apache/incubator-mxnet/pull/15298 > > As for the possible cpu training regression, it looks like not a blocker > for now. > > I will start a new rc2 vote, please help to validate. > > Thanks! > > > On Thu, Jun 27, 2019 at 10:06 PM Chen, Ciyong wrote: > > > Hi Pedro, > > > > I was able to reproduced the similar result (v1.5 is ~%5.6 slower than > > v1.4, I was using 18 cores for computing) with your script on C5.18xlarge. > > But need to bind the cores with below command when running the script, > > (without setting the env variables, I got a close time (<1%) with v1.5 and > > v1.4) > > export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 > > export OMP_NUM_THREADS=18 > > > > Did you set any env variables during running? > > > > The performance result I got as below: > > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590) > > real12m10.856s > > user234m49.576s > > sys 4m38.044s > > > > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde) > > real12m52.140s > > user246m30.740s > > sys 5m8.188s > > > > As I looked at the profiling data, most of the ops have same perf between > > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and "Pooling" is > > ~1.37x slower on v1.5 compared with v1.4. > > Will do further analysis on these ops. > > > > Here's the hardware/OS info from my side: > > --Python Info-- > > Version : 3.6.8 > > Compiler : GCC 7.3.0 > > Build: ('default', 'Dec 30 2018 01:22:34') > > Arch : ('64bit', '') > > Pip Info--- > > Version : 19.0.3 > > Directory: > > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip > > --MXNet Info--- > > Version : 1.5.0 > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet > > Hashtag not found. Not installed from pre-built package. > > --System Info-- > > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid > > system : Linux > > node : ip-172-31-32-129 > > release : 4.4.0-1085-aws > > version : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 > > --Hardware Info-- > > machine : x86_64 > > processor: x86_64 > > Architecture: x86_64 > > CPU op-mode(s):32-bit, 64-bit > > Byte Order:Little Endian > > CPU(s):72 > > On-line CPU(s) list: 0-71 > > Thread(s) per core:2 > > Core(s) per socket:18 > > Socket(s): 2 > > NUMA node(s): 2 > > Vendor ID: GenuineIntel > > CPU family:6 > > Model: 85 > > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz > > Stepping: 3 > > CPU MHz: 3000.000 > > BogoMIPS: 6000.00 > > Hypervisor vendor: KVM > > Virtualization type: full > > L1d cache: 32K > > L1i cache: 32K > > L2 cache: 1024K > > L3 cache: 25344K > > NUMA node0 CPU(s): 0-17,36-53 > > NUMA node1 CPU(s): 18-35,54-71 > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc > > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 > > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand > > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx > > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku > > --Network Test-- > > > > > > -Ciyong > > > > > > -Original Message- > > From: Zhao, Patric [mailto:patric.z...@intel.com] > > S
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Dear @dev, I m cancelling the vote for cached op fix: https://github.com/apache/incubator-mxnet/pull/15298 As for the possible cpu training regression, it looks like not a blocker for now. I will start a new rc2 vote, please help to validate. Thanks! On Thu, Jun 27, 2019 at 10:06 PM Chen, Ciyong wrote: > Hi Pedro, > > I was able to reproduced the similar result (v1.5 is ~%5.6 slower than > v1.4, I was using 18 cores for computing) with your script on C5.18xlarge. > But need to bind the cores with below command when running the script, > (without setting the env variables, I got a close time (<1%) with v1.5 and > v1.4) > export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 > export OMP_NUM_THREADS=18 > > Did you set any env variables during running? > > The performance result I got as below: > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590) > real12m10.856s > user234m49.576s > sys 4m38.044s > > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde) > real12m52.140s > user246m30.740s > sys 5m8.188s > > As I looked at the profiling data, most of the ops have same perf between > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and "Pooling" is > ~1.37x slower on v1.5 compared with v1.4. > Will do further analysis on these ops. > > Here's the hardware/OS info from my side: > --Python Info-- > Version : 3.6.8 > Compiler : GCC 7.3.0 > Build: ('default', 'Dec 30 2018 01:22:34') > Arch : ('64bit', '') > Pip Info--- > Version : 19.0.3 > Directory: > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip > --MXNet Info--- > Version : 1.5.0 > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet > Hashtag not found. Not installed from pre-built package. > --System Info-- > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid > system : Linux > node : ip-172-31-32-129 > release : 4.4.0-1085-aws > version : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 > --Hardware Info-- > machine : x86_64 > processor: x86_64 > Architecture: x86_64 > CPU op-mode(s):32-bit, 64-bit > Byte Order:Little Endian > CPU(s):72 > On-line CPU(s) list: 0-71 > Thread(s) per core:2 > Core(s) per socket:18 > Socket(s): 2 > NUMA node(s): 2 > Vendor ID: GenuineIntel > CPU family:6 > Model: 85 > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz > Stepping: 3 > CPU MHz: 3000.000 > BogoMIPS: 6000.00 > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache: 1024K > L3 cache: 25344K > NUMA node0 CPU(s): 0-17,36-53 > NUMA node1 CPU(s): 18-35,54-71 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku > --Network Test-- > > > -Ciyong > > > -Original Message- > From: Zhao, Patric [mailto:patric.z...@intel.com] > Sent: Thursday, June 27, 2019 9:55 AM > To: dev@mxnet.incubator.apache.org > Cc: d...@mxnet.apache.org > Subject: RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 > > Could we run more epochs to see the performance difference or profiling > the difference between good and bad run? > > > -Original Message- > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] > > Sent: Thursday, June 27, 2019 9:35 AM > > To: dev@mxnet.incubator.apache.org > > Cc: d...@mxnet.apache.org > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version > > 1.5.0.rc1 > > > > I run again and the gap is again bigger, I guess we need to average > > out the times across several runs: > > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench > > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5 >
RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Hi Pedro, I was able to reproduced the similar result (v1.5 is ~%5.6 slower than v1.4, I was using 18 cores for computing) with your script on C5.18xlarge. But need to bind the cores with below command when running the script, (without setting the env variables, I got a close time (<1%) with v1.5 and v1.4) export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 export OMP_NUM_THREADS=18 Did you set any env variables during running? The performance result I got as below: 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590) real12m10.856s user234m49.576s sys 4m38.044s 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde) real12m52.140s user246m30.740s sys 5m8.188s As I looked at the profiling data, most of the ops have same perf between v1.4 and v1.5. But some ops like " _backward_BatchNorm" and "Pooling" is ~1.37x slower on v1.5 compared with v1.4. Will do further analysis on these ops. Here's the hardware/OS info from my side: --Python Info-- Version : 3.6.8 Compiler : GCC 7.3.0 Build: ('default', 'Dec 30 2018 01:22:34') Arch : ('64bit', '') Pip Info--- Version : 19.0.3 Directory: /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip --MXNet Info--- Version : 1.5.0 Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet Hashtag not found. Not installed from pre-built package. --System Info-- Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid system : Linux node : ip-172-31-32-129 release : 4.4.0-1085-aws version : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019 --Hardware Info-- machine : x86_64 processor: x86_64 Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):72 On-line CPU(s) list: 0-71 Thread(s) per core:2 Core(s) per socket:18 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family:6 Model: 85 Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz Stepping: 3 CPU MHz: 3000.000 BogoMIPS: 6000.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 25344K NUMA node0 CPU(s): 0-17,36-53 NUMA node1 CPU(s): 18-35,54-71 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku --Network Test-- -Ciyong -Original Message- From: Zhao, Patric [mailto:patric.z...@intel.com] Sent: Thursday, June 27, 2019 9:55 AM To: dev@mxnet.incubator.apache.org Cc: d...@mxnet.apache.org Subject: RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 Could we run more epochs to see the performance difference or profiling the difference between good and bad run? > -Original Message- > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] > Sent: Thursday, June 27, 2019 9:35 AM > To: dev@mxnet.incubator.apache.org > Cc: d...@mxnet.apache.org > Subject: Re: [VOTE] Release Apache MXNet (incubating) version > 1.5.0.rc1 > > I run again and the gap is again bigger, I guess we need to average > out the times across several runs: > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5 > && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5 > [23:17:09] ../src/io/iter_image_recordio_2.cc:172: > ImageRecordIOParser2: > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 threads > for decoding.. > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed > [23:17:09] ../src/io/iter_image_recordio_2.cc:172: > ImageRecordIOParser2: > /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads > for decoding.. > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image > from /home/piotr/deeplearning-benchmark/d
RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Could we run more epochs to see the performance difference or profiling the difference between good and bad run? > -Original Message- > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] > Sent: Thursday, June 27, 2019 9:35 AM > To: dev@mxnet.incubator.apache.org > Cc: d...@mxnet.apache.org > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 > > I run again and the gap is again bigger, I guess we need to average out the > times across several runs: > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench > (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5 > && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5 > [23:17:09] ../src/io/iter_image_recordio_2.cc:172: > ImageRecordIOParser2: > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 threads for > decoding.. > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image from > /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image from > /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed > [23:17:09] ../src/io/iter_image_recordio_2.cc:172: > ImageRecordIOParser2: > /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads for > decoding.. > [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image from > /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image from > /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed > lr_schedule: {0: 0.05, 82: 0.005001, 123: 0.0005, 300: 0.0001} > Epoch 0, Changed learning rate to 0.05 > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate > 147456 bytes with malloc directly > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate > 589824 bytes with malloc directly > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate > 2359296 bytes with malloc directly > [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate > 9437184 bytes with malloc directly > Epoch 0, Batch 199, Speed=384.149839 > Epoch 0, Duration=140.919567 > Epoch 0, Training accuracy=0.115169 > Epoch 0, Validation accuracy=0.141317 > Epoch 1, Batch 199, Speed=433.380512 > Epoch 1, Duration=119.553233 > Epoch 1, Training accuracy=0.170956 > Epoch 1, Validation accuracy=0.216146 > Epoch 2, Batch 199, Speed=434.864699 > Epoch 2, Duration=123.278490 > Epoch 2, Training accuracy=0.209455 > Epoch 2, Validation accuracy=0.247296 > Epoch 3, Batch 199, Speed=433.401854 > Epoch 3, Duration=118.327797 > Epoch 3, Training accuracy=0.248701 > Epoch 3, Validation accuracy=0.302083 > Epoch 4, Batch 199, Speed=419.713707 > Epoch 4, Duration=126.468409 > Epoch 4, Training accuracy=0.260949 > Epoch 4, Validation accuracy=0.269030 > > real10m55.796s > user399m33.567s > sys 13m55.904s > [23:28:04] ../src/io/iter_image_recordio_2.cc:172: > ImageRecordIOParser2: > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 threads for > decoding.. > [23:28:04] ../src/io/iter_image_recordio_2.cc:230: Load mean image from > /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > [23:28:04] ../src/io/iter_image_recordio_2.cc:248: Load mean image from > /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed > [23:28:04] ../src/io/iter_image_recordio_2.cc:172: > ImageRecordIOParser2: > /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads for > decoding.. > [23:28:04] ../src/io/iter_image_recordio_2.cc:230: Load mean image from > /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > [23:28:04] ../src/io/iter_image_recordio_2.cc:248: Load mean image from > /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed > lr_schedule: {0: 0.05, 82: 0.005001, 123: 0.0005, 300: 0.0001} > Epoch 0, Changed learning rate to 0.05 Epoch 0, Batch 199, > Speed=419.039188 Epoch 0, Duration=143.934903 Epoch 0, Training > accuracy=0.122542 Epoch 0, Validation accuracy=0.164359 Epoch 1, Batch > 199, Speed=445.257048 Epoch 1, Duration=135.248399 Epoch 1, Training > accuracy=0.178828 Epoch 1, Validation accuracy=0.199419 Epoch 2, Batch > 199, Speed=447.115215 Epoch 2, Duration=132.003770 Epoch 2, Training > accuracy=0.217808 Epoch 2, Validation accuracy=0.233073 Epoch 3, Batch > 199, Speed=441.079477 Epoch 3, Duration=126.543316 Epoch 3, Training > accuracy=0.248102 Epoch 3, Validation accuracy=0.293870 Epoch 4, Batch > 199, Speed=449.329787 Epoch 4, Duration=138.398325 Epoch 4, Training > accuracy=0.270021 Epoch 4, Validation accuracy=0.311498 > > real11m45.329s > user426m13.908s > sys 16m45.093s > &
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> > the performance. > > > > > > Here's the command I used to collect the time: > > > python train_cifar10.py --num-epoch=5 > > > > > > 1) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde) > > > real9m4.880s > > > user333m13.340s > > > sys 14m36.100s > > > > > > 2) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590) > > > real9m2.155s > > > user329m37.092s > > > sys 16m8.668s > > > > > > -Ciyong > > > > > > > > > -Original Message- > > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] > > > Sent: Wednesday, June 26, 2019 6:28 AM > > > To: dev@mxnet.incubator.apache.org > > > Cc: d...@mxnet.apache.org > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 > > > > > > Hi these were my build flags and system info: > > > > > > > > > --- # CMake configuration > > > USE_CUDA: "OFF" # Build with CUDA support > > > USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda > > > USE_NCCL: "OFF" # Use NVidia NCCL with CUDA > > > USE_OPENCV: "ON" # Build with OpenCV support > > > USE_OPENMP: "ON" # Build with Openmp support > > > USE_CUDNN: "ON" # Build with cudnn support) # one could set CUDNN_ROOT > > > for search path > > > USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT ARM > > > USE_F16C: "ON" # Build with x86 F16C instruction support) # autodetects > > > support if "ON" > > > USE_LAPACK: "ON" # Build with lapack support > > > USE_MKL_IF_AVAILABLE: "ON" # Use MKL if found > > > USE_MKLML_MKL: "ON" # Use MKLDNN variant of MKL (if MKL found) IF > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE) > > > USE_MKLDNN: "ON" # Use MKLDNN variant of MKL (if MKL found) IF > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE) > > > USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF NOT MSVC > > > USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found) > > > USE_JEMALLOC: "ON" # Build with Jemalloc support > > > USE_PROFILER: "ON" # Build with Profiler support > > > USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support > > > USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins > > > USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin > > > USE_CPP_PACKAGE: "OFF" # Build C++ Package > > > USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming conventions. > > > USE_GPROF: "OFF" # Compile with gprof (profiling) flag > > > USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the compiler supports > > > it > > > USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) # one could > > > set VTUNE_ROOT for search path > > > ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation support > > > BUILD_CPP_EXAMPLES: "ON" # Build cpp examples > > > INSTALL_EXAMPLES: "OFF" # Install the example source files. > > > USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults. > > > USE_TENSORRT: "OFF" # Enable infeference optimization with TensorRT. > > > USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers. > > > ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test coverage metric > > > output > > > CMAKE_BUILD_TYPE: "Release" > > > CMAKE_CUDA_COMPILER_LAUNCHER: "ccache" > > > CMAKE_C_COMPILER_LAUNCHER: "ccache" > > > CMAKE_CXX_COMPILER_LAUNCHER: "ccache" > > > > > > commit 4d9667121ae6fb643f2a02ab15e25231ed756cde (HEAD, tag: 1.5.0.rc1, > > > upstream/v1.5.x) > > > commit 1a7199691f5cbc6012bb53eecbf884bed5ae6590 (HEAD, tag: 1.4.1.rc0, > > > upstream/v1.4.x) > > > > > > curl http://169.254.169.254/latest/meta-data/instance-type > > > c5d.18xlarge > > > > > > > > > Version : 3.6.7 > > > Compiler : GCC 8.2.0 > > > Build: ('default', 'Oct 22 2018 11:32:17') > > > Arch : ('64bit', 'ELF') > > > Pip Info--- > > > Version : 19.1.1 > > > Directory: > > > /home/piotr/mxnet_1
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
hon script, can you point me > > the link to get your script (cifar10.py)? > > Or you can also have a try with MXNet's script (train_cifar10.py) and see > > the performance. > > > > Here's the command I used to collect the time: > > python train_cifar10.py --num-epoch=5 > > > > 1) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde) > > real9m4.880s > > user333m13.340s > > sys 14m36.100s > > > > 2) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590) > > real9m2.155s > > user329m37.092s > > sys 16m8.668s > > > > -Ciyong > > > > > > -----Original Message- > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] > > Sent: Wednesday, June 26, 2019 6:28 AM > > To: dev@mxnet.incubator.apache.org > > Cc: d...@mxnet.apache.org > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 > > > > Hi these were my build flags and system info: > > > > > > --- # CMake configuration > > USE_CUDA: "OFF" # Build with CUDA support > > USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda > > USE_NCCL: "OFF" # Use NVidia NCCL with CUDA > > USE_OPENCV: "ON" # Build with OpenCV support > > USE_OPENMP: "ON" # Build with Openmp support > > USE_CUDNN: "ON" # Build with cudnn support) # one could set CUDNN_ROOT for > > search path > > USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT ARM > > USE_F16C: "ON" # Build with x86 F16C instruction support) # autodetects > > support if "ON" > > USE_LAPACK: "ON" # Build with lapack support > > USE_MKL_IF_AVAILABLE: "ON" # Use MKL if found > > USE_MKLML_MKL: "ON" # Use MKLDNN variant of MKL (if MKL found) IF > > USE_MKL_IF_AVAILABLE AND (NOT APPLE) > > USE_MKLDNN: "ON" # Use MKLDNN variant of MKL (if MKL found) IF > > USE_MKL_IF_AVAILABLE AND (NOT APPLE) > > USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF NOT MSVC > > USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found) > > USE_JEMALLOC: "ON" # Build with Jemalloc support > > USE_PROFILER: "ON" # Build with Profiler support > > USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support > > USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins > > USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin > > USE_CPP_PACKAGE: "OFF" # Build C++ Package > > USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming conventions. > > USE_GPROF: "OFF" # Compile with gprof (profiling) flag > > USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the compiler supports it > > USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) # one could > > set VTUNE_ROOT for search path > > ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation support > > BUILD_CPP_EXAMPLES: "ON" # Build cpp examples > > INSTALL_EXAMPLES: "OFF" # Install the example source files. > > USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults. > > USE_TENSORRT: "OFF" # Enable infeference optimization with TensorRT. > > USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers. > > ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test coverage metric > > output > > CMAKE_BUILD_TYPE: "Release" > > CMAKE_CUDA_COMPILER_LAUNCHER: "ccache" > > CMAKE_C_COMPILER_LAUNCHER: "ccache" > > CMAKE_CXX_COMPILER_LAUNCHER: "ccache" > > > > commit 4d9667121ae6fb643f2a02ab15e25231ed756cde (HEAD, tag: 1.5.0.rc1, > > upstream/v1.5.x) > > commit 1a7199691f5cbc6012bb53eecbf884bed5ae6590 (HEAD, tag: 1.4.1.rc0, > > upstream/v1.4.x) > > > > curl http://169.254.169.254/latest/meta-data/instance-type > > c5d.18xlarge > > > > > > Version : 3.6.7 > > Compiler : GCC 8.2.0 > > Build: ('default', 'Oct 22 2018 11:32:17') > > Arch : ('64bit', 'ELF') > > Pip Info--- > > Version : 19.1.1 > > Directory: > > /home/piotr/mxnet_1.5/py3_venv/lib/python3.6/site-packages/pip > > --MXNet Info--- > > Version : 1.5.0 > > Directory: /home/piotr/mxnet_1.5/python/mxnet > > Hashtag not found. Not installed from pre-built package. > > --System Info-- &
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Hi Ciyong, thanks for trying to reproduce: I used this one: https://github.com/awslabs/deeplearning-benchmark/blob/master/dawnbench/cifar10.py Could you provide hardware and OS details? I will rerun and repost numbers in a few minutes. Pedro. On Wed, Jun 26, 2019 at 4:18 AM Chen, Ciyong wrote: > > Hi Pedro, > > I'm looking at this case, and using the script of > "incubator-mxnet/example/image-classification/train_cifar10.py" to get > the timing data, but seems there's not much difference between mxnet > 1.4.1.rc0 and 1.5.0.rc1 on C5.18xlarge. > > Not sure if there's any difference in the python script, can you point me the > link to get your script (cifar10.py)? > Or you can also have a try with MXNet's script (train_cifar10.py) and see the > performance. > > Here's the command I used to collect the time: > python train_cifar10.py --num-epoch=5 > > 1) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde) > real9m4.880s > user333m13.340s > sys 14m36.100s > > 2) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590) > real9m2.155s > user329m37.092s > sys 16m8.668s > > -Ciyong > > > -Original Message- > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] > Sent: Wednesday, June 26, 2019 6:28 AM > To: dev@mxnet.incubator.apache.org > Cc: d...@mxnet.apache.org > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 > > Hi these were my build flags and system info: > > > --- # CMake configuration > USE_CUDA: "OFF" # Build with CUDA support > USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda > USE_NCCL: "OFF" # Use NVidia NCCL with CUDA > USE_OPENCV: "ON" # Build with OpenCV support > USE_OPENMP: "ON" # Build with Openmp support > USE_CUDNN: "ON" # Build with cudnn support) # one could set CUDNN_ROOT for > search path > USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT ARM > USE_F16C: "ON" # Build with x86 F16C instruction support) # autodetects > support if "ON" > USE_LAPACK: "ON" # Build with lapack support > USE_MKL_IF_AVAILABLE: "ON" # Use MKL if found > USE_MKLML_MKL: "ON" # Use MKLDNN variant of MKL (if MKL found) IF > USE_MKL_IF_AVAILABLE AND (NOT APPLE) > USE_MKLDNN: "ON" # Use MKLDNN variant of MKL (if MKL found) IF > USE_MKL_IF_AVAILABLE AND (NOT APPLE) > USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF NOT MSVC > USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found) > USE_JEMALLOC: "ON" # Build with Jemalloc support > USE_PROFILER: "ON" # Build with Profiler support > USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support > USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins > USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin > USE_CPP_PACKAGE: "OFF" # Build C++ Package > USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming conventions. > USE_GPROF: "OFF" # Compile with gprof (profiling) flag > USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the compiler supports it > USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) # one could set > VTUNE_ROOT for search path > ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation support > BUILD_CPP_EXAMPLES: "ON" # Build cpp examples > INSTALL_EXAMPLES: "OFF" # Install the example source files. > USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults. > USE_TENSORRT: "OFF" # Enable infeference optimization with TensorRT. > USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers. > ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test coverage metric > output > CMAKE_BUILD_TYPE: "Release" > CMAKE_CUDA_COMPILER_LAUNCHER: "ccache" > CMAKE_C_COMPILER_LAUNCHER: "ccache" > CMAKE_CXX_COMPILER_LAUNCHER: "ccache" > > commit 4d9667121ae6fb643f2a02ab15e25231ed756cde (HEAD, tag: 1.5.0.rc1, > upstream/v1.5.x) > commit 1a7199691f5cbc6012bb53eecbf884bed5ae6590 (HEAD, tag: 1.4.1.rc0, > upstream/v1.4.x) > > curl http://169.254.169.254/latest/meta-data/instance-type > c5d.18xlarge > > > Version : 3.6.7 > Compiler : GCC 8.2.0 > Build: ('default', 'Oct 22 2018 11:32:17') > Arch : ('64bit', 'ELF') > Pip Info--- > Version : 19.1.1 > Directory: /home/piotr/mxnet_1.5/py3_venv/lib/python3.6/site-packages/pip > --MXNet Info--
RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Hi Pedro, I'm looking at this case, and using the script of "incubator-mxnet/example/image-classification/train_cifar10.py" to get the timing data, but seems there's not much difference between mxnet 1.4.1.rc0 and 1.5.0.rc1 on C5.18xlarge. Not sure if there's any difference in the python script, can you point me the link to get your script (cifar10.py)? Or you can also have a try with MXNet's script (train_cifar10.py) and see the performance. Here's the command I used to collect the time: python train_cifar10.py --num-epoch=5 1) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde) real9m4.880s user333m13.340s sys 14m36.100s 2) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590) real9m2.155s user329m37.092s sys 16m8.668s -Ciyong -Original Message- From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] Sent: Wednesday, June 26, 2019 6:28 AM To: dev@mxnet.incubator.apache.org Cc: d...@mxnet.apache.org Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 Hi these were my build flags and system info: --- # CMake configuration USE_CUDA: "OFF" # Build with CUDA support USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda USE_NCCL: "OFF" # Use NVidia NCCL with CUDA USE_OPENCV: "ON" # Build with OpenCV support USE_OPENMP: "ON" # Build with Openmp support USE_CUDNN: "ON" # Build with cudnn support) # one could set CUDNN_ROOT for search path USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT ARM USE_F16C: "ON" # Build with x86 F16C instruction support) # autodetects support if "ON" USE_LAPACK: "ON" # Build with lapack support USE_MKL_IF_AVAILABLE: "ON" # Use MKL if found USE_MKLML_MKL: "ON" # Use MKLDNN variant of MKL (if MKL found) IF USE_MKL_IF_AVAILABLE AND (NOT APPLE) USE_MKLDNN: "ON" # Use MKLDNN variant of MKL (if MKL found) IF USE_MKL_IF_AVAILABLE AND (NOT APPLE) USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF NOT MSVC USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found) USE_JEMALLOC: "ON" # Build with Jemalloc support USE_PROFILER: "ON" # Build with Profiler support USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin USE_CPP_PACKAGE: "OFF" # Build C++ Package USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming conventions. USE_GPROF: "OFF" # Compile with gprof (profiling) flag USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the compiler supports it USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) # one could set VTUNE_ROOT for search path ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation support BUILD_CPP_EXAMPLES: "ON" # Build cpp examples INSTALL_EXAMPLES: "OFF" # Install the example source files. USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults. USE_TENSORRT: "OFF" # Enable infeference optimization with TensorRT. USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers. ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test coverage metric output CMAKE_BUILD_TYPE: "Release" CMAKE_CUDA_COMPILER_LAUNCHER: "ccache" CMAKE_C_COMPILER_LAUNCHER: "ccache" CMAKE_CXX_COMPILER_LAUNCHER: "ccache" commit 4d9667121ae6fb643f2a02ab15e25231ed756cde (HEAD, tag: 1.5.0.rc1, upstream/v1.5.x) commit 1a7199691f5cbc6012bb53eecbf884bed5ae6590 (HEAD, tag: 1.4.1.rc0, upstream/v1.4.x) curl http://169.254.169.254/latest/meta-data/instance-type c5d.18xlarge Version : 3.6.7 Compiler : GCC 8.2.0 Build: ('default', 'Oct 22 2018 11:32:17') Arch : ('64bit', 'ELF') Pip Info--- Version : 19.1.1 Directory: /home/piotr/mxnet_1.5/py3_venv/lib/python3.6/site-packages/pip --MXNet Info--- Version : 1.5.0 Directory: /home/piotr/mxnet_1.5/python/mxnet Hashtag not found. Not installed from pre-built package. --System Info-- Platform : Linux-4.15.0-1035-aws-x86_64-with-Ubuntu-18.04-bionic system : Linux node : ip-172-31-63-171 release : 4.15.0-1035-aws version : #37-Ubuntu SMP Mon Mar 18 16:15:14 UTC 2019 --Hardware Info-- machine : x86_64 processor: x86_64 Architecture:x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 72 On-line CPU(s) list: 0-71 Thread(s) per core: 2 Core(s) per socket: 18 Socket(s): 2 NUMA node(s):2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Mode
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
2018 11:32:17') Arch : ('64bit', 'ELF') Pip Info--- Version : 19.1.1 Directory: /home/piotr/mxnet_1.4/py3_venv/lib/python3.6/site-packages/pip --MXNet Info--- Version : 1.4.1 Directory: /home/piotr/mxnet_1.4/python/mxnet Hashtag not found. Not installed from pre-built package. --System Info-- Platform : Linux-4.15.0-1035-aws-x86_64-with-Ubuntu-18.04-bionic system : Linux node : ip-172-31-63-171 release : 4.15.0-1035-aws version : #37-Ubuntu SMP Mon Mar 18 16:15:14 UTC 2019 --Hardware Info-- machine : x86_64 processor: x86_64 Architecture:x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 72 On-line CPU(s) list: 0-71 Thread(s) per core: 2 Core(s) per socket: 18 Socket(s): 2 NUMA node(s):2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz Stepping:4 CPU MHz: 1223.344 BogoMIPS:6000.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache:1024K L3 cache:25344K NUMA node0 CPU(s): 0-17,36-53 NUMA node1 CPU(s): 18-35,54-71 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke --Network Test-- On Tue, Jun 25, 2019 at 2:35 PM Pedro Larroy wrote: > > I did a training of cifar10 in CPU and seems there's some regressions > in the range of 7% increase of training time against 1.4.1: > > (py3_venv) piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench > (master)+$ time python cifar10.py --epochs 5 > real11m30.388s > user417m7.766s > sys 16m57.315s > > VS 1.4.1: > real10m41.994s > user392m40.646s > sys 12m30.601s > > > On Thu, Jun 20, 2019 at 10:15 PM Lai Wei wrote: > > > > Hi Anirudh, > > > > Thanks for jumping into this quickly, I followed up on the issue. > > > > I was meant for sockeye developer/maintainers to help setup nightly tests > > and raise issues early. > > > > Thanks! > > > > On Fri, Jun 21, 2019 at 10:10 AM Haibin Lin > > wrote: > > > > > In GluonNLP we are testing with MXNET nightly build for each PR, and we > > > did > > > find some MXNet related issue caught by the CI. > > > I recommend other toolkits also add integration tests with MXNet nightly. > > > It helps identify issues early. > > > > > > Best, > > > Haibin > > > > > > On Thu, Jun 20, 2019 at 18:52 Zhao, Patric wrote: > > > > > > > Thanks to raise the issue and we will take a look ASAP. > > > > > > > > The downstream cases is not in the MXNet CI so it's hard to catch the > > > > potential bugs or performance degradation for MXNet developers. > > > > > > > > In the future, I suggest adding the major downstream test cases, like > > > from > > > > sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test. > > > > If it's still too heavy, maybe testing it weekly or monthly :) > > > > > > > > Thanks, > > > > > > > > --Patric > > > > > > > > > -Original Message- > > > > > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com] > > > > > Sent: Friday, June 21, 2019 9:31 AM > > > > > To: dev@mxnet.incubator.apache.org > > > > > Cc: d...@mxnet.apache.org > > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version > > > > > 1.5.0.rc1 > > > > > > > > > > Hi Lai, > > > > > > > > > > I have opened an issue: > > > > > https://github.com/apache/incubator-mxnet/issues/15297 > > > > > I came to know about this issue only today and I have not been > > > monitoring > > > > > sockeye. > > > > > I jumped onto this issue to make sure it wasn't caused by the dlpack > >
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
I did a training of cifar10 in CPU and seems there's some regressions in the range of 7% increase of training time against 1.4.1: (py3_venv) piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench (master)+$ time python cifar10.py --epochs 5 real11m30.388s user417m7.766s sys 16m57.315s VS 1.4.1: real10m41.994s user392m40.646s sys 12m30.601s On Thu, Jun 20, 2019 at 10:15 PM Lai Wei wrote: > > Hi Anirudh, > > Thanks for jumping into this quickly, I followed up on the issue. > > I was meant for sockeye developer/maintainers to help setup nightly tests > and raise issues early. > > Thanks! > > On Fri, Jun 21, 2019 at 10:10 AM Haibin Lin > wrote: > > > In GluonNLP we are testing with MXNET nightly build for each PR, and we did > > find some MXNet related issue caught by the CI. > > I recommend other toolkits also add integration tests with MXNet nightly. > > It helps identify issues early. > > > > Best, > > Haibin > > > > On Thu, Jun 20, 2019 at 18:52 Zhao, Patric wrote: > > > > > Thanks to raise the issue and we will take a look ASAP. > > > > > > The downstream cases is not in the MXNet CI so it's hard to catch the > > > potential bugs or performance degradation for MXNet developers. > > > > > > In the future, I suggest adding the major downstream test cases, like > > from > > > sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test. > > > If it's still too heavy, maybe testing it weekly or monthly :) > > > > > > Thanks, > > > > > > --Patric > > > > > > > -Original Message----- > > > > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com] > > > > Sent: Friday, June 21, 2019 9:31 AM > > > > To: dev@mxnet.incubator.apache.org > > > > Cc: d...@mxnet.apache.org > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 > > > > > > > > Hi Lai, > > > > > > > > I have opened an issue: > > > > https://github.com/apache/incubator-mxnet/issues/15297 > > > > I came to know about this issue only today and I have not been > > monitoring > > > > sockeye. > > > > I jumped onto this issue to make sure it wasn't caused by the dlpack > > > changes. > > > > Also, I don't think sockeye CI checks against master, it is using > > 1.4.1. > > > > > > > > Anirudh > > > > > > > > > > > > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei wrote: > > > > > > > > > Hi, > > > > > > > > > > Could you share which test failed and what’s the crash? How to > > > > > reproduce it? > > > > > > > > > > I was able to install sockeye and run all tests passed. Using python > > > > > setup.py test > > > > > > > > > > I have tested both nightly pip package and 1.5.0.rc1 > > > > > > > > > > It would be great to create an issue with reproducible steps and move > > > > > the discussion there. > > > > > > > > > > Also I see sockeye nightly build[1] has been failing for some time, > > if > > > > > it’s due to MXNet change, please raise this early so we can track and > > > > > solve it in time rather than block the release during vote time. > > > > > > > > > > [1] https://travis-ci.org/awslabs/sockeye > > > > > > > > > > > > > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > I was able to reproduce a crash with the commit > > > > > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit > > > > > > a862270beb2d796c1ba311183f7f4a766a18ad6c. > > > > > > > > > > > > Anirudh > > > > > > > > > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei > > wrote: > > > > > > > > > > > > > Hi Przemyslaw, > > > > > > > > > > > > > > Is there an issue with more details to track the problem? > > > > > > > > > > > > > > > > > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak > > > > > > > > > > > > > > wrote: > > > > &g
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Hi Anirudh, Thanks for jumping into this quickly, I followed up on the issue. I was meant for sockeye developer/maintainers to help setup nightly tests and raise issues early. Thanks! On Fri, Jun 21, 2019 at 10:10 AM Haibin Lin wrote: > In GluonNLP we are testing with MXNET nightly build for each PR, and we did > find some MXNet related issue caught by the CI. > I recommend other toolkits also add integration tests with MXNet nightly. > It helps identify issues early. > > Best, > Haibin > > On Thu, Jun 20, 2019 at 18:52 Zhao, Patric wrote: > > > Thanks to raise the issue and we will take a look ASAP. > > > > The downstream cases is not in the MXNet CI so it's hard to catch the > > potential bugs or performance degradation for MXNet developers. > > > > In the future, I suggest adding the major downstream test cases, like > from > > sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test. > > If it's still too heavy, maybe testing it weekly or monthly :) > > > > Thanks, > > > > --Patric > > > > > -Original Message- > > > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com] > > > Sent: Friday, June 21, 2019 9:31 AM > > > To: dev@mxnet.incubator.apache.org > > > Cc: d...@mxnet.apache.org > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 > > > > > > Hi Lai, > > > > > > I have opened an issue: > > > https://github.com/apache/incubator-mxnet/issues/15297 > > > I came to know about this issue only today and I have not been > monitoring > > > sockeye. > > > I jumped onto this issue to make sure it wasn't caused by the dlpack > > changes. > > > Also, I don't think sockeye CI checks against master, it is using > 1.4.1. > > > > > > Anirudh > > > > > > > > > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei wrote: > > > > > > > Hi, > > > > > > > > Could you share which test failed and what’s the crash? How to > > > > reproduce it? > > > > > > > > I was able to install sockeye and run all tests passed. Using python > > > > setup.py test > > > > > > > > I have tested both nightly pip package and 1.5.0.rc1 > > > > > > > > It would be great to create an issue with reproducible steps and move > > > > the discussion there. > > > > > > > > Also I see sockeye nightly build[1] has been failing for some time, > if > > > > it’s due to MXNet change, please raise this early so we can track and > > > > solve it in time rather than block the release during vote time. > > > > > > > > [1] https://travis-ci.org/awslabs/sockeye > > > > > > > > > > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian > > > > > > > > > > > > wrote: > > > > > > > > > I was able to reproduce a crash with the commit > > > > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit > > > > > a862270beb2d796c1ba311183f7f4a766a18ad6c. > > > > > > > > > > Anirudh > > > > > > > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei > wrote: > > > > > > > > > > > Hi Przemyslaw, > > > > > > > > > > > > Is there an issue with more details to track the problem? > > > > > > > > > > > > > > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak > > > > > > > > > > > > wrote: > > > > > > > > > > > > > -1 > > > > > > > > > > > > > > There is a crash in sockeye unit test (python setup.py test) > > > > > > > observed starting with nightly 1.5 build from 6/13 and still > > > > > > > occuring in > > > > > 1.5rc1. I > > > > > > > don't yet have the exact commit that is responsible for it, but > > > > > > > it is either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack > > > > > > > related) or > > > > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op > > > optimization). > > > > > > > > > > > > > > On 2019/06/20 06:36:22, Lai Wei wrote: > > > > > > > > Dear MXNet community, > > > > > > >
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
In GluonNLP we are testing with MXNET nightly build for each PR, and we did find some MXNet related issue caught by the CI. I recommend other toolkits also add integration tests with MXNet nightly. It helps identify issues early. Best, Haibin On Thu, Jun 20, 2019 at 18:52 Zhao, Patric wrote: > Thanks to raise the issue and we will take a look ASAP. > > The downstream cases is not in the MXNet CI so it's hard to catch the > potential bugs or performance degradation for MXNet developers. > > In the future, I suggest adding the major downstream test cases, like from > sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test. > If it's still too heavy, maybe testing it weekly or monthly :) > > Thanks, > > --Patric > > > -Original Message- > > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com] > > Sent: Friday, June 21, 2019 9:31 AM > > To: dev@mxnet.incubator.apache.org > > Cc: d...@mxnet.apache.org > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 > > > > Hi Lai, > > > > I have opened an issue: > > https://github.com/apache/incubator-mxnet/issues/15297 > > I came to know about this issue only today and I have not been monitoring > > sockeye. > > I jumped onto this issue to make sure it wasn't caused by the dlpack > changes. > > Also, I don't think sockeye CI checks against master, it is using 1.4.1. > > > > Anirudh > > > > > > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei wrote: > > > > > Hi, > > > > > > Could you share which test failed and what’s the crash? How to > > > reproduce it? > > > > > > I was able to install sockeye and run all tests passed. Using python > > > setup.py test > > > > > > I have tested both nightly pip package and 1.5.0.rc1 > > > > > > It would be great to create an issue with reproducible steps and move > > > the discussion there. > > > > > > Also I see sockeye nightly build[1] has been failing for some time, if > > > it’s due to MXNet change, please raise this early so we can track and > > > solve it in time rather than block the release during vote time. > > > > > > [1] https://travis-ci.org/awslabs/sockeye > > > > > > > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian > > > > > > > > > wrote: > > > > > > > I was able to reproduce a crash with the commit > > > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit > > > > a862270beb2d796c1ba311183f7f4a766a18ad6c. > > > > > > > > Anirudh > > > > > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei wrote: > > > > > > > > > Hi Przemyslaw, > > > > > > > > > > Is there an issue with more details to track the problem? > > > > > > > > > > > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak > > > > > > > > > > wrote: > > > > > > > > > > > -1 > > > > > > > > > > > > There is a crash in sockeye unit test (python setup.py test) > > > > > > observed starting with nightly 1.5 build from 6/13 and still > > > > > > occuring in > > > > 1.5rc1. I > > > > > > don't yet have the exact commit that is responsible for it, but > > > > > > it is either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack > > > > > > related) or > > > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op > > optimization). > > > > > > > > > > > > On 2019/06/20 06:36:22, Lai Wei wrote: > > > > > > > Dear MXNet community, > > > > > > > > > > > > > > This is the 3-day vote to release Apache MXNet (incubating) > > > > > > > version > > > > > > 1.5.0. > > > > > > > Voting on dev@ will start June 19, 23:59:59(PST) and close on > > > June > > > > > 22, > > > > > > > 23:59:59. > > > > > > > > > > > > > > 1) Link to release notes: > > > > > > > > > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Note > > > > s > > > > > > > > > > > > > > > > > > > > > 2) Link to release candidate: > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.r > > > > > > > c1 > > > > > > > > > > > > > > > > > > > > > 3) Link to source and signatures on apache dist server: > > > > > > > > > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.r > > > > > > > c1/ > > > > > > > > > > > > > > > > > > > > > Please remember to TEST first before voting accordingly: > > > > > > > > > > > > > > +1 = approve > > > > > > > +0 = no opinion > > > > > > > -1 = disapprove (provide reason) > > > > > > > -- > > > > > > > Best Regards > > > > > > > > > > > > > > Lai > > > > > > > > > > > > > > > > > > -- > > > > > Best Regards > > > > > > > > > > Lai > > > > > > > > > > > > -- > > > Best Regards > > > > > > Lai > > > >
RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Thanks to raise the issue and we will take a look ASAP. The downstream cases is not in the MXNet CI so it's hard to catch the potential bugs or performance degradation for MXNet developers. In the future, I suggest adding the major downstream test cases, like from sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test. If it's still too heavy, maybe testing it weekly or monthly :) Thanks, --Patric > -Original Message- > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com] > Sent: Friday, June 21, 2019 9:31 AM > To: dev@mxnet.incubator.apache.org > Cc: d...@mxnet.apache.org > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 > > Hi Lai, > > I have opened an issue: > https://github.com/apache/incubator-mxnet/issues/15297 > I came to know about this issue only today and I have not been monitoring > sockeye. > I jumped onto this issue to make sure it wasn't caused by the dlpack changes. > Also, I don't think sockeye CI checks against master, it is using 1.4.1. > > Anirudh > > > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei wrote: > > > Hi, > > > > Could you share which test failed and what’s the crash? How to > > reproduce it? > > > > I was able to install sockeye and run all tests passed. Using python > > setup.py test > > > > I have tested both nightly pip package and 1.5.0.rc1 > > > > It would be great to create an issue with reproducible steps and move > > the discussion there. > > > > Also I see sockeye nightly build[1] has been failing for some time, if > > it’s due to MXNet change, please raise this early so we can track and > > solve it in time rather than block the release during vote time. > > > > [1] https://travis-ci.org/awslabs/sockeye > > > > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian > > > > > > wrote: > > > > > I was able to reproduce a crash with the commit > > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit > > > a862270beb2d796c1ba311183f7f4a766a18ad6c. > > > > > > Anirudh > > > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei wrote: > > > > > > > Hi Przemyslaw, > > > > > > > > Is there an issue with more details to track the problem? > > > > > > > > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak > > > > > > > > wrote: > > > > > > > > > -1 > > > > > > > > > > There is a crash in sockeye unit test (python setup.py test) > > > > > observed starting with nightly 1.5 build from 6/13 and still > > > > > occuring in > > > 1.5rc1. I > > > > > don't yet have the exact commit that is responsible for it, but > > > > > it is either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack > > > > > related) or > > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op > optimization). > > > > > > > > > > On 2019/06/20 06:36:22, Lai Wei wrote: > > > > > > Dear MXNet community, > > > > > > > > > > > > This is the 3-day vote to release Apache MXNet (incubating) > > > > > > version > > > > > 1.5.0. > > > > > > Voting on dev@ will start June 19, 23:59:59(PST) and close on > > June > > > > 22, > > > > > > 23:59:59. > > > > > > > > > > > > 1) Link to release notes: > > > > > > > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Note > > > s > > > > > > > > > > > > > > > > > > 2) Link to release candidate: > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.r > > > > > > c1 > > > > > > > > > > > > > > > > > > 3) Link to source and signatures on apache dist server: > > > > > > > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.r > > > > > > c1/ > > > > > > > > > > > > > > > > > > Please remember to TEST first before voting accordingly: > > > > > > > > > > > > +1 = approve > > > > > > +0 = no opinion > > > > > > -1 = disapprove (provide reason) > > > > > > -- > > > > > > Best Regards > > > > > > > > > > > > Lai > > > > > > > > > > > > > > > -- > > > > Best Regards > > > > > > > > Lai > > > > > > > > > -- > > Best Regards > > > > Lai > >
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Hi Lai, I have opened an issue: https://github.com/apache/incubator-mxnet/issues/15297 I came to know about this issue only today and I have not been monitoring sockeye. I jumped onto this issue to make sure it wasn't caused by the dlpack changes. Also, I don't think sockeye CI checks against master, it is using 1.4.1. Anirudh On Thu, Jun 20, 2019 at 6:17 PM Lai Wei wrote: > Hi, > > Could you share which test failed and what’s the crash? How to reproduce > it? > > I was able to install sockeye and run all tests passed. Using > python setup.py test > > I have tested both nightly pip package and 1.5.0.rc1 > > It would be great to create an issue with reproducible steps and move the > discussion there. > > Also I see sockeye nightly build[1] has been failing for some time, if it’s > due to MXNet change, please raise this early so we can track and solve it > in time rather than block the release during vote time. > > [1] https://travis-ci.org/awslabs/sockeye > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian > > wrote: > > > I was able to reproduce a crash with the commit > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit > > a862270beb2d796c1ba311183f7f4a766a18ad6c. > > > > Anirudh > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei wrote: > > > > > Hi Przemyslaw, > > > > > > Is there an issue with more details to track the problem? > > > > > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak > > > wrote: > > > > > > > -1 > > > > > > > > There is a crash in sockeye unit test (python setup.py test) observed > > > > starting with nightly 1.5 build from 6/13 and still occuring in > > 1.5rc1. I > > > > don't yet have the exact commit that is responsible for it, but it is > > > > either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization). > > > > > > > > On 2019/06/20 06:36:22, Lai Wei wrote: > > > > > Dear MXNet community, > > > > > > > > > > This is the 3-day vote to release Apache MXNet (incubating) version > > > > 1.5.0. > > > > > Voting on dev@ will start June 19, 23:59:59(PST) and close on > June > > > 22, > > > > > 23:59:59. > > > > > > > > > > 1) Link to release notes: > > > > > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes > > > > > > > > > > > > > > > 2) Link to release candidate: > > > > > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1 > > > > > > > > > > > > > > > 3) Link to source and signatures on apache dist server: > > > > > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/ > > > > > > > > > > > > > > > Please remember to TEST first before voting accordingly: > > > > > > > > > > +1 = approve > > > > > +0 = no opinion > > > > > -1 = disapprove (provide reason) > > > > > -- > > > > > Best Regards > > > > > > > > > > Lai > > > > > > > > > > > > -- > > > Best Regards > > > > > > Lai > > > > > > -- > Best Regards > > Lai >
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Hi, Could you share which test failed and what’s the crash? How to reproduce it? I was able to install sockeye and run all tests passed. Using python setup.py test I have tested both nightly pip package and 1.5.0.rc1 It would be great to create an issue with reproducible steps and move the discussion there. Also I see sockeye nightly build[1] has been failing for some time, if it’s due to MXNet change, please raise this early so we can track and solve it in time rather than block the release during vote time. [1] https://travis-ci.org/awslabs/sockeye On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian wrote: > I was able to reproduce a crash with the commit > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit > a862270beb2d796c1ba311183f7f4a766a18ad6c. > > Anirudh > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei wrote: > > > Hi Przemyslaw, > > > > Is there an issue with more details to track the problem? > > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak > > wrote: > > > > > -1 > > > > > > There is a crash in sockeye unit test (python setup.py test) observed > > > starting with nightly 1.5 build from 6/13 and still occuring in > 1.5rc1. I > > > don't yet have the exact commit that is responsible for it, but it is > > > either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization). > > > > > > On 2019/06/20 06:36:22, Lai Wei wrote: > > > > Dear MXNet community, > > > > > > > > This is the 3-day vote to release Apache MXNet (incubating) version > > > 1.5.0. > > > > Voting on dev@ will start June 19, 23:59:59(PST) and close on June > > 22, > > > > 23:59:59. > > > > > > > > 1) Link to release notes: > > > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes > > > > > > > > > > > > 2) Link to release candidate: > > > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1 > > > > > > > > > > > > 3) Link to source and signatures on apache dist server: > > > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/ > > > > > > > > > > > > Please remember to TEST first before voting accordingly: > > > > > > > > +1 = approve > > > > +0 = no opinion > > > > -1 = disapprove (provide reason) > > > > -- > > > > Best Regards > > > > > > > > Lai > > > > > > > > > -- > > Best Regards > > > > Lai > > > -- Best Regards Lai
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
I was able to reproduce a crash with the commit 09202f7f261954383aa387144524d38f83f18d06 but not with the commit a862270beb2d796c1ba311183f7f4a766a18ad6c. Anirudh On Thu, Jun 20, 2019 at 3:53 PM Lai Wei wrote: > Hi Przemyslaw, > > Is there an issue with more details to track the problem? > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak > wrote: > > > -1 > > > > There is a crash in sockeye unit test (python setup.py test) observed > > starting with nightly 1.5 build from 6/13 and still occuring in 1.5rc1. I > > don't yet have the exact commit that is responsible for it, but it is > > either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or > > 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization). > > > > On 2019/06/20 06:36:22, Lai Wei wrote: > > > Dear MXNet community, > > > > > > This is the 3-day vote to release Apache MXNet (incubating) version > > 1.5.0. > > > Voting on dev@ will start June 19, 23:59:59(PST) and close on June > 22, > > > 23:59:59. > > > > > > 1) Link to release notes: > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes > > > > > > > > > 2) Link to release candidate: > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1 > > > > > > > > > 3) Link to source and signatures on apache dist server: > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/ > > > > > > > > > Please remember to TEST first before voting accordingly: > > > > > > +1 = approve > > > +0 = no opinion > > > -1 = disapprove (provide reason) > > > -- > > > Best Regards > > > > > > Lai > > > > > > -- > Best Regards > > Lai >
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Hi Przemyslaw, Is there an issue with more details to track the problem? On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak wrote: > -1 > > There is a crash in sockeye unit test (python setup.py test) observed > starting with nightly 1.5 build from 6/13 and still occuring in 1.5rc1. I > don't yet have the exact commit that is responsible for it, but it is > either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or > 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization). > > On 2019/06/20 06:36:22, Lai Wei wrote: > > Dear MXNet community, > > > > This is the 3-day vote to release Apache MXNet (incubating) version > 1.5.0. > > Voting on dev@ will start June 19, 23:59:59(PST) and close on June 22, > > 23:59:59. > > > > 1) Link to release notes: > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes > > > > > > 2) Link to release candidate: > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1 > > > > > > 3) Link to source and signatures on apache dist server: > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/ > > > > > > Please remember to TEST first before voting accordingly: > > > > +1 = approve > > +0 = no opinion > > -1 = disapprove (provide reason) > > -- > > Best Regards > > > > Lai > > > -- Best Regards Lai
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
-1 There is a crash in sockeye unit test (python setup.py test) observed starting with nightly 1.5 build from 6/13 and still occuring in 1.5rc1. I don't yet have the exact commit that is responsible for it, but it is either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization). On 2019/06/20 06:36:22, Lai Wei wrote: > Dear MXNet community, > > This is the 3-day vote to release Apache MXNet (incubating) version 1.5.0. > Voting on dev@ will start June 19, 23:59:59(PST) and close on June 22, > 23:59:59. > > 1) Link to release notes: > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes > > > 2) Link to release candidate: > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1 > > > 3) Link to source and signatures on apache dist server: > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/ > > > Please remember to TEST first before voting accordingly: > > +1 = approve > +0 = no opinion > -1 = disapprove (provide reason) > -- > Best Regards > > Lai >
[VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Dear MXNet community, This is the 3-day vote to release Apache MXNet (incubating) version 1.5.0. Voting on dev@ will start June 19, 23:59:59(PST) and close on June 22, 23:59:59. 1) Link to release notes: https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes 2) Link to release candidate: https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1 3) Link to source and signatures on apache dist server: https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/ Please remember to TEST first before voting accordingly: +1 = approve +0 = no opinion -1 = disapprove (provide reason) -- Best Regards Lai