[ https://issues.apache.org/jira/browse/SPARK-34015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-34015: --------------------------------- Fix Version/s: 3.1.1 3.2.0 > SparkR partition timing summary reports input time correctly > ------------------------------------------------------------ > > Key: SPARK-34015 > URL: https://issues.apache.org/jira/browse/SPARK-34015 > Project: Spark > Issue Type: Bug > Components: SparkR > Affects Versions: 2.3.2, 3.0.1 > Environment: Observed on CentOS-7 running spark 2.3.1 and on my mac > running master > Reporter: Tom Howland > Priority: Major > Fix For: 3.2.0, 3.1.1 > > Original Estimate: 0h > Remaining Estimate: 0h > > When sparkR is run at log level INFO, a summary of how the worker spent its > time processing the partition is printed. There is a logic error where it is > over-reporting the time inputting rows. > In detail: the variable inputElap in a wider context is used to mark the > beginning of reading rows, but in the part changed here it was used as a > local variable for measuring compute time. Thus, the error is not observable > if there is only one group per partition, which is what you get in unit tests. > For our application, here's what a log entry looks like before these changes > were applied: > {{20/10/09 04:08:58 WARN RRunner: Times: boot = 0.013 s, init = 0.005 s, > broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, > write-output = 0.020 s, total = 1021.546 s}} > this indicates that we're spending more time reading rows than operating on > the rows. > After these changes, it looks like this: > {{20/12/15 06:43:29 WARN RRunner: Times: boot = 0.013 s, init = 0.010 s, > broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, > write-output = 0.045 s, total = 1812.553 s}} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org