[ https://issues.apache.org/jira/browse/SYSTEMML-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthias Boehm updated SYSTEMML-2398: ------------------------------------- Summary: Paramserv ASP aggregation overhead on update per epoch (was: Paramserv ASP aggregation overhead in on update per epoch) > Paramserv ASP aggregation overhead on update per epoch > ------------------------------------------------------ > > Key: SYSTEMML-2398 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2398 > Project: SystemML > Issue Type: Bug > Reporter: Matthias Boehm > Priority: Major > > Here are the statistics of mnist60K, 2 epochs, 80 workers in ASP > {code} > SystemML Statistics: > Total elapsed time: 449.548 sec. > Total compilation time: 1.995 sec. > Total execution time: 447.553 sec. > Number of compiled MR Jobs: 0. > Number of executed MR Jobs: 0. > Cache hits (Mem, WB, FS, HDFS): 970241/0/0/2. > Cache writes (WB, FS, HDFS): 55191/0/0. > Cache times (ACQr/m, RLS, EXP): 1.048/0.120/1.087/0.000 sec. > HOP DAGs recompiled (PRED, SB): 0/13582. > HOP DAGs recompile time: 24.473 sec. > Functions recompiled: 1. > Functions recompile time: 0.013 sec. > Paramserv func number of workers: 79. > Paramserv func total gradients compute time: 1714.962 secs. > Paramserv func total aggregation time: 428.499 secs. > Paramserv func model broadcasting time: 2.080 secs. > Paramserv func total batch slicing time: 0.0190000000 secs. > Total JIT compile time: 37.461 sec. > Total JVM GC count: 66. > Total JVM GC time: 7.098 sec. > Heavy hitter instructions: > # Instruction Time(s) Count > 1 conv2d_bias_add 719.111 13768 > 2 paramserv 437.051 1 > 3 relu_backward 210.414 20370 > 4 ba+* 180.001 40928 > 5 conv2d_backward_filter 175.104 13580 > 6 +* 156.714 81480 > 7 conv2d_backward_data 140.779 6790 > 8 * 123.502 95173 > 9 -* 104.058 54320 > 10 - 94.502 74985 > {code} > As we see the aggregation is a major bottleneck. This is unexpected due to > the coarse-grained update per epoch. [~Guobao] could you please have a look > and profile where this is coming from? -- This message was sent by Atlassian JIRA (v7.6.3#76005)