Hi Matthias, I am OK with removing this flag, but would prefer that we keep the JIRA open until we are sure that caching is not a bottleneck. I have noticed that the gradients turns to sparse as we execute more iterations. Also, cache release time is dependent on the memory budget. Here are the statistics running Lenet on MNIST using https://github.com/apache/incubator-systemml/tree/master/scripts/staging/SystemML-NN/examples
With 20G driver memory, the statistics after running 10 epochs are as follows: Epoch: 10, Iter: 700, Train Loss: 0.20480149054528493, Train Accuracy: 0.984375, Val Loss: 0.026928755962383588, Val Accuracy: 0.9922 Epoch: 10, Iter: 800, Train Loss: 0.20165772217976913, Train Accuracy: 1.0, Val Loss: 0.027878978005867083, Val Accuracy: 0.9922 17/02/14 16:06:58 INFO DMLScript: SystemML Statistics: Total elapsed time: 12687.863 sec. Total compilation time: 2.168 sec. Total execution time: 12685.694 sec. Number of compiled Spark inst: 147. Number of executed Spark inst: 4. Cache hits (Mem, WB, FS, HDFS): 1096424/0/0/2. Cache writes (WB, FS, HDFS): 603950/15/8. Cache times (ACQr/m, RLS, EXP): 3.704/0.336/61.831/1.242 sec. HOP DAGs recompiled (PRED, SB): 0/154885. HOP DAGs recompile time: 28.663 sec. Functions recompiled: 1. Functions recompile time: 0.024 sec. Spark ctx create time (lazy): 1.009 sec. Spark trans counts (par,bc,col):0/0/2. Spark trans times (par,bc,col): 0.000/0.000/3.433 secs. Total JIT compile time: 44.711 sec. Total JVM GC count: 7459. Total JVM GC time: 166.26 sec. Heavy hitter instructions (name, time, count): -- 1) train 12138.979 sec 1 -- 2) conv2d_bias_add 10876.708 sec 17362 -- 3) conv2d_backward_filter 421.303 sec 17200 -- 4) sel+ 239.660 sec 25881 -- 5) update 226.687 sec 68800 -- 6) update_nesterov 223.775 sec 68800 -- 7) maxpooling_backward 136.709 sec 17200 -- 8) conv2d_backward_data 134.315 sec 8600 -- 9) ba+* 118.897 sec 51762 -- 10) relu_maxpooling 112.283 sec 17362 -- 11) relu_backward 107.483 sec 34400 -- 12) uack+ 89.258 sec 34400 -- 13) r' 74.304 sec 43000 -- 14) +* 57.193 sec 34400 -- 15) * 16.493 sec 95178 -- 16) rand 16.038 sec 8613 -- 17) / 8.352 sec 86492 -- 18) rangeReIndex 6.628 sec 17208 -- 19) + 3.054 sec 96528 -- 20) uark+ 2.219 sec 43241 -- 21) sp_csvrblk 2.183 sec 2 -- 22) rmvar 1.517 sec 1451571 -- 23) write 1.250 sec 9 -- 24) - 1.059 sec 86486 -- 25) createvar 1.026 sec 587259 -- 26) exp 0.663 sec 17281 -- 27) *2 0.361 sec 2 -- 28) uasqk+ 0.277 sec 320 -- 29) log 0.200 sec 160 -- 30) uarmax 0.191 sec 17281 With 5G driver memory, the statistics after running 10 epochs are as follows: Epoch: 10, Iter: 700, Train Loss: 0.19313544015858036, Train Accuracy: 1.0, Val Loss: 0.025943927403263182, Val Accuracy: 0.993 Epoch: 10, Iter: 800, Train Loss: 0.1883995965207449, Train Accuracy: 1.0, Val Loss: 0.0260796819319468, Val Accuracy: 0.9916 17/02/14 20:16:40 INFO DMLScript: SystemML Statistics: Total elapsed time: 13886.763 sec. Total compilation time: 2.148 sec. Total execution time: 13884.615 sec. Number of compiled Spark inst: 147. Number of executed Spark inst: 4. Cache hits (Mem, WB, FS, HDFS): 1096422/0/2/2. Cache writes (WB, FS, HDFS): 603868/2176/8. Cache times (ACQr/m, RLS, EXP): 3.883/0.343/271.757/1.312 sec. HOP DAGs recompiled (PRED, SB): 0/154885. HOP DAGs recompile time: 28.290 sec. Functions recompiled: 1. Functions recompile time: 0.023 sec. Spark ctx create time (lazy): 0.981 sec. Spark trans counts (par,bc,col):0/0/2. Spark trans times (par,bc,col): 0.000/0.000/3.501 secs. Total JIT compile time: 45.131 sec. Total JVM GC count: 7605. Total JVM GC time: 157.716 sec. Heavy hitter instructions (name, time, count): -- 1) train 13301.811 sec 1 -- 2) conv2d_bias_add 11890.291 sec 17362 -- 3) conv2d_backward_filter 416.645 sec 17200 -- 4) ba+* 252.966 sec 51762 -- 5) sel+ 237.334 sec 25881 -- 6) update 228.261 sec 68800 -- 7) update_nesterov 225.383 sec 68800 -- 8) maxpooling_backward 134.260 sec 17200 -- 9) +* 133.959 sec 34400 -- 10) conv2d_backward_data 128.046 sec 8600 -- 11) relu_maxpooling 106.499 sec 17362 -- 12) relu_backward 104.062 sec 34400 -- 13) uack+ 90.104 sec 34400 -- 14) r' 70.932 sec 43000 -- 15) * 16.203 sec 95178 -- 16) rand 16.131 sec 8613 -- 17) / 7.988 sec 86492 -- 18) rangeReIndex 7.640 sec 17208 -- 19) sp_csvrblk 2.220 sec 2 -- 20) + 2.121 sec 96528 -- 21) uark+ 2.079 sec 43241 -- 22) rmvar 1.580 sec 1451571 -- 23) rshape 1.533 sec 17200 -- 24) write 1.322 sec 9 -- 25) createvar 0.976 sec 587259 -- 26) - 0.961 sec 86486 -- 27) exp 0.659 sec 17281 -- 28) uasqk+ 0.314 sec 320 -- 29) *2 0.312 sec 2 -- 30) log 0.200 sec 160 Thanks, Niketan Pansare IBM Almaden Research Center E-mail: npansar At us.ibm.com http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar From: Matthias Boehm <mboe...@googlemail.com> To: dev@systemml.incubator.apache.org Date: 02/13/2017 04:29 PM Subject: Re: Removal of workaround flags Well, I used exactly the mnist_lenet scenario discussed in the JIRA, but what I've observed are eviction times <2.5% of total execution time, almost no sparse intermediates, and the script execution time being dominated by con2d_bias_add. Again, the discrepancy might very well stem from changes made since the JIRA was created. In any case, I would rather address any existing performance issues than globally disabling evictions (which could easily lead to OOMs) or sparse matrix formats. Hence, I'd like to remove these workaround flags in order to prevent shortcuts that do not apply to all users. Regards, Matthias On Mon, Feb 13, 2017 at 9:19 AM, <dusenberr...@gmail.com> wrote: > Thanks for bringing up the topic. Our deep learning scripts (i.e. > algorithms with several intermediate transformations) have shown cache > release times to be a major bottleneck, thus leading to the creation of > SYSTEMML-1140. Specifically, what did you use to attempt to reproduce 1140? > > > -Mike > > -- > > Mike Dusenberry > GitHub: github.com/dusenberrymw > LinkedIn: linkedin.com/in/mikedusenberry > > Sent from my iPhone. > > > > On Feb 12, 2017, at 12:30 AM, Matthias Boehm <mboe...@googlemail.com> > wrote: > > > > SYSTEMML-1140 >