I run again and the gap is again bigger, I guess we need to average out the times across several runs:
piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench (master)+$ time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5 && time ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5 [23:17:09] ../src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 threads for decoding.. [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed [23:17:09] ../src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads for decoding.. [23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin [23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed lr_schedule: {0: 0.05, 82: 0.005000000000000001, 123: 0.0005, 300: 0.0001} Epoch 0, Changed learning rate to 0.05 [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate 147456 bytes with malloc directly [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate 589824 bytes with malloc directly [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate 2359296 bytes with malloc directly [23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate 9437184 bytes with malloc directly Epoch 0, Batch 199, Speed=384.149839 Epoch 0, Duration=140.919567 Epoch 0, Training accuracy=0.115169 Epoch 0, Validation accuracy=0.141317 Epoch 1, Batch 199, Speed=433.380512 Epoch 1, Duration=119.553233 Epoch 1, Training accuracy=0.170956 Epoch 1, Validation accuracy=0.216146 Epoch 2, Batch 199, Speed=434.864699 Epoch 2, Duration=123.278490 Epoch 2, Training accuracy=0.209455 Epoch 2, Validation accuracy=0.247296 Epoch 3, Batch 199, Speed=433.401854 Epoch 3, Duration=118.327797 Epoch 3, Training accuracy=0.248701 Epoch 3, Validation accuracy=0.302083 Epoch 4, Batch 199, Speed=419.713707 Epoch 4, Duration=126.468409 Epoch 4, Training accuracy=0.260949 Epoch 4, Validation accuracy=0.269030 real 10m55.796s user 399m33.567s sys 13m55.904s [23:28:04] ../src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 threads for decoding.. [23:28:04] ../src/io/iter_image_recordio_2.cc:230: Load mean image from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin [23:28:04] ../src/io/iter_image_recordio_2.cc:248: Load mean image from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed [23:28:04] ../src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads for decoding.. [23:28:04] ../src/io/iter_image_recordio_2.cc:230: Load mean image from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin [23:28:04] ../src/io/iter_image_recordio_2.cc:248: Load mean image from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed lr_schedule: {0: 0.05, 82: 0.005000000000000001, 123: 0.0005, 300: 0.0001} Epoch 0, Changed learning rate to 0.05 Epoch 0, Batch 199, Speed=419.039188 Epoch 0, Duration=143.934903 Epoch 0, Training accuracy=0.122542 Epoch 0, Validation accuracy=0.164359 Epoch 1, Batch 199, Speed=445.257048 Epoch 1, Duration=135.248399 Epoch 1, Training accuracy=0.178828 Epoch 1, Validation accuracy=0.199419 Epoch 2, Batch 199, Speed=447.115215 Epoch 2, Duration=132.003770 Epoch 2, Training accuracy=0.217808 Epoch 2, Validation accuracy=0.233073 Epoch 3, Batch 199, Speed=441.079477 Epoch 3, Duration=126.543316 Epoch 3, Training accuracy=0.248102 Epoch 3, Validation accuracy=0.293870 Epoch 4, Batch 199, Speed=449.329787 Epoch 4, Duration=138.398325 Epoch 4, Training accuracy=0.270021 Epoch 4, Validation accuracy=0.311498 real 11m45.329s user 426m13.908s sys 16m45.093s On Wed, Jun 26, 2019 at 4:18 PM Pedro Larroy <pedro.larroy.li...@gmail.com> wrote: > > The difference looks smaller now, more like your numbers. I wonder if > something happened during the previous benchmark like a system > update... > > > piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench (master)+$ > time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5 && time > ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5 > [22:49:41] ../src/io/iter_image_recordio_2.cc:172: > ImageRecordIOParser2: > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 threads > for decoding.. > [22:49:41] ../src/io/iter_image_recordio_2.cc:230: Load mean image > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > [22:49:41] ../src/io/iter_image_recordio_2.cc:248: Load mean image > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed > [22:49:41] ../src/io/iter_image_recordio_2.cc:172: > ImageRecordIOParser2: > /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads > for decoding.. > [22:49:41] ../src/io/iter_image_recordio_2.cc:230: Load mean image > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > [22:49:41] ../src/io/iter_image_recordio_2.cc:248: Load mean image > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed > lr_schedule: {0: 0.05, 82: 0.005000000000000001, 123: 0.0005, 300: 0.0001} > Epoch 0, Changed learning rate to 0.05 > [22:49:42] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate > 147456 bytes with malloc directly > [22:49:42] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate > 589824 bytes with malloc directly > [22:49:42] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate > 2359296 bytes with malloc directly > [22:49:42] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate > 9437184 bytes with malloc directly > Epoch 0, Batch 199, Speed=426.182733 > Epoch 0, Duration=134.868458 > Epoch 0, Training accuracy=0.127238 > Epoch 0, Validation accuracy=0.206388 > Epoch 1, Batch 199, Speed=313.127156 > Epoch 1, Duration=128.041775 > Epoch 1, Training accuracy=0.182065 > Epoch 1, Validation accuracy=0.202524 > Epoch 2, Batch 199, Speed=410.931187 > Epoch 2, Duration=124.920588 > Epoch 2, Training accuracy=0.202584 > Epoch 2, Validation accuracy=0.245693 > Epoch 3, Batch 199, Speed=419.119335 > Epoch 3, Duration=120.948349 > Epoch 3, Training accuracy=0.235854 > Epoch 3, Validation accuracy=0.291066 > Epoch 4, Batch 199, Speed=430.473733 > Epoch 4, Duration=130.181724 > Epoch 4, Training accuracy=0.257773 > Epoch 4, Validation accuracy=0.304988 > > real 11m7.356s > user 406m9.910s > sys 14m18.349s > [23:00:49] ../src/io/iter_image_recordio_2.cc:172: > ImageRecordIOParser2: > /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 threads > for decoding.. > [23:00:49] ../src/io/iter_image_recordio_2.cc:230: Load mean image > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > [23:00:49] ../src/io/iter_image_recordio_2.cc:248: Load mean image > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed > [23:00:49] ../src/io/iter_image_recordio_2.cc:172: > ImageRecordIOParser2: > /home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads > for decoding.. > [23:00:49] ../src/io/iter_image_recordio_2.cc:230: Load mean image > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin > [23:00:49] ../src/io/iter_image_recordio_2.cc:248: Load mean image > from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed > lr_schedule: {0: 0.05, 82: 0.005000000000000001, 123: 0.0005, 300: 0.0001} > Epoch 0, Changed learning rate to 0.05 > Epoch 0, Batch 199, Speed=348.618154 > Epoch 0, Duration=146.469352 > Epoch 0, Training accuracy=0.124121 > Epoch 0, Validation accuracy=0.167227 > Epoch 1, Batch 199, Speed=452.790825 > Epoch 1, Duration=130.199421 > Epoch 1, Training accuracy=0.183863 > Epoch 1, Validation accuracy=0.237079 > Epoch 2, Batch 199, Speed=451.406559 > Epoch 2, Duration=126.320823 > Epoch 2, Training accuracy=0.214844 > Epoch 2, Validation accuracy=0.244692 > Epoch 3, Batch 199, Speed=403.161873 > Epoch 3, Duration=125.331660 > Epoch 3, Training accuracy=0.243506 > Epoch 3, Validation accuracy=0.301182 > Epoch 4, Batch 199, Speed=450.826598 > Epoch 4, Duration=126.426253 > Epoch 4, Training accuracy=0.266424 > Epoch 4, Validation accuracy=0.311899 > > real 11m21.930s > user 415m3.855s > sys 13m53.975s > > On Wed, Jun 26, 2019 at 3:50 PM Pedro Larroy > <pedro.larroy.li...@gmail.com> wrote: > > > > Hi Ciyong, thanks for trying to reproduce: > > > > I used this one: > > https://github.com/awslabs/deeplearning-benchmark/blob/master/dawnbench/cifar10.py > > > > Could you provide hardware and OS details? > > > > I will rerun and repost numbers in a few minutes. > > > > Pedro. > > > > On Wed, Jun 26, 2019 at 4:18 AM Chen, Ciyong <ciyong.c...@intel.com> wrote: > > > > > > Hi Pedro, > > > > > > I'm looking at this case, and using the script of > > > "incubator-mxnet/example/image-classification/train_cifar10.py" to get > > > the timing data, but seems there's not much difference between mxnet > > > 1.4.1.rc0 and 1.5.0.rc1 on C5.18xlarge. > > > > > > Not sure if there's any difference in the python script, can you point me > > > the link to get your script (cifar10.py)? > > > Or you can also have a try with MXNet's script (train_cifar10.py) and see > > > the performance. > > > > > > Here's the command I used to collect the time: > > > python train_cifar10.py --num-epoch=5 > > > > > > 1) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde) > > > real 9m4.880s > > > user 333m13.340s > > > sys 14m36.100s > > > > > > 2) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590) > > > real 9m2.155s > > > user 329m37.092s > > > sys 16m8.668s > > > > > > -Ciyong > > > > > > > > > -----Original Message----- > > > From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com] > > > Sent: Wednesday, June 26, 2019 6:28 AM > > > To: dev@mxnet.incubator.apache.org > > > Cc: d...@mxnet.apache.org > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 > > > > > > Hi these were my build flags and system info: > > > > > > > > > --- # CMake configuration > > > USE_CUDA: "OFF" # Build with CUDA support > > > USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda > > > USE_NCCL: "OFF" # Use NVidia NCCL with CUDA > > > USE_OPENCV: "ON" # Build with OpenCV support > > > USE_OPENMP: "ON" # Build with Openmp support > > > USE_CUDNN: "ON" # Build with cudnn support) # one could set CUDNN_ROOT > > > for search path > > > USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT ARM > > > USE_F16C: "ON" # Build with x86 F16C instruction support) # autodetects > > > support if "ON" > > > USE_LAPACK: "ON" # Build with lapack support > > > USE_MKL_IF_AVAILABLE: "ON" # Use MKL if found > > > USE_MKLML_MKL: "ON" # Use MKLDNN variant of MKL (if MKL found) IF > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE) > > > USE_MKLDNN: "ON" # Use MKLDNN variant of MKL (if MKL found) IF > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE) > > > USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF NOT MSVC > > > USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found) > > > USE_JEMALLOC: "ON" # Build with Jemalloc support > > > USE_PROFILER: "ON" # Build with Profiler support > > > USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support > > > USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins > > > USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin > > > USE_CPP_PACKAGE: "OFF" # Build C++ Package > > > USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming conventions. > > > USE_GPROF: "OFF" # Compile with gprof (profiling) flag > > > USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the compiler supports > > > it > > > USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) # one could > > > set VTUNE_ROOT for search path > > > ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation support > > > BUILD_CPP_EXAMPLES: "ON" # Build cpp examples > > > INSTALL_EXAMPLES: "OFF" # Install the example source files. > > > USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults. > > > USE_TENSORRT: "OFF" # Enable infeference optimization with TensorRT. > > > USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers. > > > ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test coverage metric > > > output > > > CMAKE_BUILD_TYPE: "Release" > > > CMAKE_CUDA_COMPILER_LAUNCHER: "ccache" > > > CMAKE_C_COMPILER_LAUNCHER: "ccache" > > > CMAKE_CXX_COMPILER_LAUNCHER: "ccache" > > > > > > commit 4d9667121ae6fb643f2a02ab15e25231ed756cde (HEAD, tag: 1.5.0.rc1, > > > upstream/v1.5.x) > > > commit 1a7199691f5cbc6012bb53eecbf884bed5ae6590 (HEAD, tag: 1.4.1.rc0, > > > upstream/v1.4.x) > > > > > > curl http://169.254.169.254/latest/meta-data/instance-type > > > c5d.18xlarge > > > > > > > > > Version : 3.6.7 > > > Compiler : GCC 8.2.0 > > > Build : ('default', 'Oct 22 2018 11:32:17') > > > Arch : ('64bit', 'ELF') > > > ------------Pip Info----------- > > > Version : 19.1.1 > > > Directory : > > > /home/piotr/mxnet_1.5/py3_venv/lib/python3.6/site-packages/pip > > > ----------MXNet Info----------- > > > Version : 1.5.0 > > > Directory : /home/piotr/mxnet_1.5/python/mxnet > > > Hashtag not found. Not installed from pre-built package. > > > ----------System Info---------- > > > Platform : Linux-4.15.0-1035-aws-x86_64-with-Ubuntu-18.04-bionic > > > system : Linux > > > node : ip-172-31-63-171 > > > release : 4.15.0-1035-aws > > > version : #37-Ubuntu SMP Mon Mar 18 16:15:14 UTC 2019 > > > ----------Hardware Info---------- > > > machine : x86_64 > > > processor : x86_64 > > > Architecture: x86_64 > > > CPU op-mode(s): 32-bit, 64-bit > > > Byte Order: Little Endian > > > CPU(s): 72 > > > On-line CPU(s) list: 0-71 > > > Thread(s) per core: 2 > > > Core(s) per socket: 18 > > > Socket(s): 2 > > > NUMA node(s): 2 > > > Vendor ID: GenuineIntel > > > CPU family: 6 > > > Model: 85 > > > Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz > > > Stepping: 4 > > > CPU MHz: 1326.446 > > > BogoMIPS: 6000.00 > > > Hypervisor vendor: KVM > > > Virtualization type: full > > > L1d cache: 32K > > > L1i cache: 32K > > > L2 cache: 1024K > > > L3 cache: 25344K > > > NUMA node0 CPU(s): 0-17,36-53 > > > NUMA node1 CPU(s): 18-35,54-71 > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc > > > cpuid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid > > > sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c > > > rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase > > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq > > > rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt > > > xsavec xgetbv1 xsaves ida arat pku ospke ----------Network Test---------- > > > > > > ----------Python Info---------- > > > Version : 3.6.7 > > > Compiler : GCC 8.2.0 > > > Build : ('default', 'Oct 22 2018 11:32:17') > > > Arch : ('64bit', 'ELF') > > > ------------Pip Info----------- > > > Version : 19.1.1 > > > Directory : > > > /home/piotr/mxnet_1.4/py3_venv/lib/python3.6/site-packages/pip > > > ----------MXNet Info----------- > > > Version : 1.4.1 > > > Directory : /home/piotr/mxnet_1.4/python/mxnet > > > Hashtag not found. Not installed from pre-built package. > > > ----------System Info---------- > > > Platform : Linux-4.15.0-1035-aws-x86_64-with-Ubuntu-18.04-bionic > > > system : Linux > > > node : ip-172-31-63-171 > > > release : 4.15.0-1035-aws > > > version : #37-Ubuntu SMP Mon Mar 18 16:15:14 UTC 2019 > > > ----------Hardware Info---------- > > > machine : x86_64 > > > processor : x86_64 > > > Architecture: x86_64 > > > CPU op-mode(s): 32-bit, 64-bit > > > Byte Order: Little Endian > > > CPU(s): 72 > > > On-line CPU(s) list: 0-71 > > > Thread(s) per core: 2 > > > Core(s) per socket: 18 > > > Socket(s): 2 > > > NUMA node(s): 2 > > > Vendor ID: GenuineIntel > > > CPU family: 6 > > > Model: 85 > > > Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz > > > Stepping: 4 > > > CPU MHz: 1223.344 > > > BogoMIPS: 6000.00 > > > Hypervisor vendor: KVM > > > Virtualization type: full > > > L1d cache: 32K > > > L1i cache: 32K > > > L2 cache: 1024K > > > L3 cache: 25344K > > > NUMA node0 CPU(s): 0-17,36-53 > > > NUMA node1 CPU(s): 18-35,54-71 > > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr > > > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb > > > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc > > > cpuid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid > > > sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c > > > rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase > > > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq > > > rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt > > > xsavec xgetbv1 xsaves ida arat pku ospke ----------Network Test---------- > > > > > > On Tue, Jun 25, 2019 at 2:35 PM Pedro Larroy > > > <pedro.larroy.li...@gmail.com> wrote: > > > > > > > > I did a training of cifar10 in CPU and seems there's some regressions > > > > in the range of 7% increase of training time against 1.4.1: > > > > > > > > (py3_venv) piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench > > > > (master)+$ time python cifar10.py --epochs 5 > > > > real 11m30.388s > > > > user 417m7.766s > > > > sys 16m57.315s > > > > > > > > VS 1.4.1: > > > > real 10m41.994s > > > > user 392m40.646s > > > > sys 12m30.601s > > > > > > > > > > > > On Thu, Jun 20, 2019 at 10:15 PM Lai Wei <roywei...@gmail.com> wrote: > > > > > > > > > > Hi Anirudh, > > > > > > > > > > Thanks for jumping into this quickly, I followed up on the issue. > > > > > > > > > > I was meant for sockeye developer/maintainers to help setup nightly > > > > > tests and raise issues early. > > > > > > > > > > Thanks! > > > > > > > > > > On Fri, Jun 21, 2019 at 10:10 AM Haibin Lin > > > > > <haibin.lin....@gmail.com> > > > > > wrote: > > > > > > > > > > > In GluonNLP we are testing with MXNET nightly build for each PR, > > > > > > and we did find some MXNet related issue caught by the CI. > > > > > > I recommend other toolkits also add integration tests with MXNet > > > > > > nightly. > > > > > > It helps identify issues early. > > > > > > > > > > > > Best, > > > > > > Haibin > > > > > > > > > > > > On Thu, Jun 20, 2019 at 18:52 Zhao, Patric <patric.z...@intel.com> > > > > > > wrote: > > > > > > > > > > > > > Thanks to raise the issue and we will take a look ASAP. > > > > > > > > > > > > > > The downstream cases is not in the MXNet CI so it's hard to > > > > > > > catch the potential bugs or performance degradation for MXNet > > > > > > > developers. > > > > > > > > > > > > > > In the future, I suggest adding the major downstream test cases, > > > > > > > like > > > > > > from > > > > > > > sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test. > > > > > > > If it's still too heavy, maybe testing it weekly or monthly :) > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > --Patric > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com] > > > > > > > > Sent: Friday, June 21, 2019 9:31 AM > > > > > > > > To: dev@mxnet.incubator.apache.org > > > > > > > > Cc: d...@mxnet.apache.org > > > > > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version > > > > > > > > 1.5.0.rc1 > > > > > > > > > > > > > > > > Hi Lai, > > > > > > > > > > > > > > > > I have opened an issue: > > > > > > > > https://github.com/apache/incubator-mxnet/issues/15297 > > > > > > > > I came to know about this issue only today and I have not been > > > > > > monitoring > > > > > > > > sockeye. > > > > > > > > I jumped onto this issue to make sure it wasn't caused by the > > > > > > > > dlpack > > > > > > > changes. > > > > > > > > Also, I don't think sockeye CI checks against master, it is > > > > > > > > using > > > > > > 1.4.1. > > > > > > > > > > > > > > > > Anirudh > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei <roywei...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > Could you share which test failed and what’s the crash? How > > > > > > > > > to reproduce it? > > > > > > > > > > > > > > > > > > I was able to install sockeye and run all tests passed. > > > > > > > > > Using python setup.py test > > > > > > > > > > > > > > > > > > I have tested both nightly pip package and 1.5.0.rc1 > > > > > > > > > > > > > > > > > > It would be great to create an issue with reproducible steps > > > > > > > > > and move the discussion there. > > > > > > > > > > > > > > > > > > Also I see sockeye nightly build[1] has been failing for > > > > > > > > > some time, > > > > > > if > > > > > > > > > it’s due to MXNet change, please raise this early so we can > > > > > > > > > track and solve it in time rather than block the release > > > > > > > > > during vote time. > > > > > > > > > > > > > > > > > > [1] https://travis-ci.org/awslabs/sockeye > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian > > > > > > > > > <anirudh2...@gmail.com > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > I was able to reproduce a crash with the commit > > > > > > > > > > 09202f7f261954383aa387144524d38f83f18d06 but not with the > > > > > > > > > > commit a862270beb2d796c1ba311183f7f4a766a18ad6c. > > > > > > > > > > > > > > > > > > > > Anirudh > > > > > > > > > > > > > > > > > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei > > > > > > > > > > <roywei...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi Przemyslaw, > > > > > > > > > > > > > > > > > > > > > > Is there an issue with more details to track the problem? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak > > > > > > > > > > > <ptre...@apache.org> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > -1 > > > > > > > > > > > > > > > > > > > > > > > > There is a crash in sockeye unit test (python setup.py > > > > > > > > > > > > test) observed starting with nightly 1.5 build from > > > > > > > > > > > > 6/13 and still occuring in > > > > > > > > > > 1.5rc1. I > > > > > > > > > > > > don't yet have the exact commit that is responsible > > > > > > > > > > > > for it, but it is either > > > > > > > > > > > > a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack > > > > > > > > > > > > related) or > > > > > > > > > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op > > > > > > > > optimization). > > > > > > > > > > > > > > > > > > > > > > > > On 2019/06/20 06:36:22, Lai Wei <roywei...@gmail.com> > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Dear MXNet community, > > > > > > > > > > > > > > > > > > > > > > > > > > This is the 3-day vote to release Apache MXNet > > > > > > > > > > > > > (incubating) version > > > > > > > > > > > > 1.5.0. > > > > > > > > > > > > > Voting on dev@ will start June 19, 23:59:59(PST) > > > > > > > > > > > > > and close > > > > > > on > > > > > > > > > June > > > > > > > > > > > 22, > > > > > > > > > > > > > 23:59:59. > > > > > > > > > > > > > > > > > > > > > > > > > > 1) Link to release notes: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+No > > > > > > te > > > > > > > > > > s > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2) Link to release candidate: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.r > > > > > > > > > > > > > c1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3) Link to source and signatures on apache dist > > > > > > > > > > > > > server: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.r > > > > > > > > > > > > > c1/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Please remember to TEST first before voting > > > > > > > > > > > > > accordingly: > > > > > > > > > > > > > > > > > > > > > > > > > > +1 = approve > > > > > > > > > > > > > +0 = no opinion > > > > > > > > > > > > > -1 = disapprove (provide reason) > > > > > > > > > > > > > -- > > > > > > > > > > > > > Best Regards > > > > > > > > > > > > > > > > > > > > > > > > > > Lai > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Best Regards > > > > > > > > > > > > > > > > > > > > > > Lai > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Best Regards > > > > > > > > > > > > > > > > > > Lai > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best Regards > > > > > > > > > > Lai