[ https://issues.apache.org/jira/browse/SYSTEMML-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mike Dusenberry updated SYSTEMML-633: ------------------------------------- Priority: Blocker (was: Critical) > Improve Left-Indexing Performance with (Nested) Parfor Loops in UDFs > -------------------------------------------------------------------- > > Key: SYSTEMML-633 > URL: https://issues.apache.org/jira/browse/SYSTEMML-633 > Project: SystemML > Issue Type: Improvement > Components: ParFor > Reporter: Mike Dusenberry > Priority: Blocker > Attachments: Im2colWrapper.java, log.txt, log.txt, perf-dml.dml, > perf-tf.py, perf.sh, run.sh, systemml-nn-05.16.16.zip, systemml-nn.zip, > time.txt > > > In the experimental deep learning DML library I've been building > ([https://github.com/dusenberrymw/systemml-nn|https://github.com/dusenberrymw/systemml-nn]), > I've experienced severe bottlenecks due to *left-indexing* in parfor loops. > Here, I will highlight a few particular instances with simplified examples, > but the same issue is shared across many areas of the library, particularly > in the convolution and max pooling layers, and is exaggerated in real > use-cases. > *Quick note* on setup for any of the below experiments. Please grab a copy > of the above repo (particularly the {{nn}} directory), and run any > experiments with the {{nn}} package available at the base directory of the > experiment. > Scenario: *Convolution* > * In the library above, the forward pass of the convolution function > ([{{conv::forward(...)}} | > https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L8] > in {{nn/layers/conv.dml}}) essentially accepts a matrix {{X}} of images, a > matrix of weights {{W}}, and several other parameters corresponding to image > sizes, filter sizes, etc. It then loops through the images with a {{parfor}} > loop, and for each image it pads the image with {{util::pad_image}}, extracts > "patches" of the image into columns of a matrix in a sliding fashion across > the image with {{util::im2col}}, performs a matrix multiplication between the > matrix of patch columns and the weight matrix, and then saves the result into > a matrix defined outside of the parfor loop using left-indexing. > * Left-indexing has been identified as the bottleneck by a wide margin. > * Left-indexing is used in the main {{conv::forward(...)}} function in the > [last line in the parfor > loop|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L61], > in the > [{{util::pad_image(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L196] > function used by {{conv::forward(...)}}, as well as in the > [{{util::im2col(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L96] > function used by {{conv::forward(...)}}. > * Test script (assuming the {{nn}} package is available): > ** {{speed-633.dml}} {code} > source("nn/layers/conv.dml") as conv > source("nn/util.dml") as util > # Generate data > N = 64 # num examples > C = 30 # num channels > Hin = 28 # input height > Win = 28 # input width > F = 20 # num filters > Hf = 3 # filter height > Wf = 3 # filter width > stride = 1 > pad = 1 > X = rand(rows=N, cols=C*Hin*Win) > # Create layer > [W, b] = conv::init(F, C, Hf, Wf) > # Forward > [out, Hout, Wout] = conv::forward(X, W, b, C, Hin, Win, Hf, Wf, stride, > stride, pad, pad) > print("Out: " + nrow(out) + "x" + ncol(out)) > print("Hout: " + Hout) > print("Wout: " + Wout) > print("") > print(sum(out)) > {code} > * Invocation: > ** {{java -jar > $SYSTEMML_HOME/target/systemml-0.10.0-incubating-SNAPSHOT-standalone.jar -f > speed-633.dml -stats -explain -exec singlenode}} > * Stats output (modified to output up to 100 instructions): > ** {code} > ... > Total elapsed time: 26.834 sec. > Total compilation time: 0.529 sec. > Total execution time: 26.304 sec. > Number of compiled MR Jobs: 0. > Number of executed MR Jobs: 0. > Cache hits (Mem, WB, FS, HDFS): 9196235/0/0/0. > Cache writes (WB, FS, HDFS): 3070724/0/0. > Cache times (ACQr/m, RLS, EXP): 1.474/1.120/26.998/0.000 sec. > HOP DAGs recompiled (PRED, SB): 0/0. > HOP DAGs recompile time: 0.268 sec. > Functions recompiled: 129. > Functions recompile time: 0.841 sec. > ParFor loops optimized: 1. > ParFor optimize time: 0.032 sec. > ParFor initialize time: 0.015 sec. > ParFor result merge time: 0.028 sec. > ParFor total update in-place: 0/0/1559360 > Total JIT compile time: 14.235 sec. > Total JVM GC count: 94. > Total JVM GC time: 0.366 sec. > Heavy hitter instructions (name, time, count): > -- 1) leftIndex 41.670 sec 1559360 > -- 2) forward 26.212 sec 1 > -- 3) im2col_t45 25.919 sec 8 > -- 4) im2col_t41 25.850 sec 8 > -- 5) im2col_t48 25.831 sec 8 > -- 6) im2col_t43 25.752 sec 8 > -- 7) im2col_t44 25.736 sec 8 > -- 8) im2col_t42 25.695 sec 8 > -- 9) im2col_t47 25.691 sec 8 > -- 10) im2col_t46 25.519 sec 8 > -- 11) rangeReIndex 13.392 sec 3012544 > -- 12) rshape 8.197 sec 3064704 > -- 13) rmvar 4.988 sec 20363504 > -- 14) createvar 4.954 sec 7688965 > -- 15) ncol 1.148 sec 3014529 > -- 16) - 0.961 sec 3112834 > -- 17) + 0.878 sec 3124617 > -- 18) rand 0.839 sec 52228 > -- 19) * 0.480 sec 1764229 > -- 20) cpvar 0.366 sec 1607812 > -- 21) ba+* 0.257 sec 64 > -- 22) pad_image_t42 0.187 sec 8 > -- 23) pad_image_t47 0.181 sec 8 > -- 24) pad_image_t44 0.168 sec 8 > -- 25) pad_image_t46 0.164 sec 8 > -- 26) pad_image_t43 0.156 sec 8 > -- 27) pad_image_t48 0.153 sec 8 > -- 28) pad_image_t45 0.152 sec 8 > -- 29) pad_image_t41 0.152 sec 8 > -- 30) nrow 0.036 sec 50307 > -- 31) assignvar 0.016 sec 52235 > -- 32) uak+ 0.015 sec 1 > -- 33) castvti 0.000 sec 130 > -- 34) print 0.000 sec 5 > -- 35) / 0.000 sec 130 > -- 36) sqrt 0.000 sec 1 > {code} > ** *Full log file attached* (including a {{log=DEBUG}} modification to the > parfor loop in {{conv::forward(...)}}. > ** Note again that {{forward}}, {{im2col}}, and {{pad_image}} all involve > left-indexing. > * Other notes: > ** Further experiments involved replacing the {{util::im2col(...)}} function > with an external Java function using a basic, nested for-loop approach with > no regard for optimization. Compared with the fastest parfor DML version, I > experienced at least a *100x* speed improvement. When compared to the same > naive for-loop approach in DML, the speedup was even greater. > ** Even with this external version of {{im2col}}, and with padding disabled, > the left-indexing within the parfor loop of {{conv::forward(...)}} still > dominated the execution time, acting as the major bottleneck. > ** For all described experiments, logging indicated that parfor update in > place was *not* applied. -- This message was sent by Atlassian JIRA (v6.3.4#6332)