[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf
[ https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183722#comment-14183722 ] Apache Spark commented on SPARK-2585: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2935 > Remove special handling of Hadoop JobConf > - > > Key: SPARK-2585 > URL: https://issues.apache.org/jira/browse/SPARK-2585 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Josh Rosen >Priority: Critical > > This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the > implementation does not use shared conf objects). We no longer need to > specially broadcast the Hadoop configuration since we are broadcasting RDD > data anyways. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf
[ https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161383#comment-14161383 ] Josh Rosen commented on SPARK-2585: --- [~pwendell] I obtained these numbers by adding this benchmark to the Spark Examples package and running it through {{./bin/run-example}}, which I think should have given it a pretty big classpath. > Remove special handling of Hadoop JobConf > - > > Key: SPARK-2585 > URL: https://issues.apache.org/jira/browse/SPARK-2585 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Josh Rosen >Priority: Critical > > This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the > implementation does not use shared conf objects). We no longer need to > specially broadcast the Hadoop configuration since we are broadcasting RDD > data anyways. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf
[ https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161382#comment-14161382 ] Patrick Wendell commented on SPARK-2585: Hey [~joshrosen] what happens if you run your benchmark inside of the Spark shell? If the JobConf constructor searches the classpath this could end up taking a lot longer in that environment. It would be good to make sure. > Remove special handling of Hadoop JobConf > - > > Key: SPARK-2585 > URL: https://issues.apache.org/jira/browse/SPARK-2585 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Josh Rosen >Priority: Critical > > This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the > implementation does not use shared conf objects). We no longer need to > specially broadcast the Hadoop configuration since we are broadcasting RDD > data anyways. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf
[ https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161166#comment-14161166 ] Apache Spark commented on SPARK-2585: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/2683 > Remove special handling of Hadoop JobConf > - > > Key: SPARK-2585 > URL: https://issues.apache.org/jira/browse/SPARK-2585 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Josh Rosen >Priority: Critical > > This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the > implementation does not use shared conf objects). We no longer need to > specially broadcast the Hadoop configuration since we are broadcasting RDD > data anyways. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf
[ https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160642#comment-14160642 ] Andrew Ash commented on SPARK-2585: --- I also vote for "correct by default" and there are various potentially dangerous knobs you can turn to squeeze out more performance at risk if you care to. Note that the {{new JobConf()}} constructor loads defaults out of the Hadoop config files (I think {{core-site.xml}} and {{hdfs-site.xml}}) and you can disable that with the {{JobConf(false)}} constructor. Not sure if we need the local config files per server or if all the config options come solely from the driver's {{JobConf}} when instantiated. > Remove special handling of Hadoop JobConf > - > > Key: SPARK-2585 > URL: https://issues.apache.org/jira/browse/SPARK-2585 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Josh Rosen >Priority: Critical > > This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the > implementation does not use shared conf objects). We no longer need to > specially broadcast the Hadoop configuration since we are broadcasting RDD > data anyways. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf
[ https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160589#comment-14160589 ] Josh Rosen commented on SPARK-2585: --- I tried benchmarking the time need to create a new JobConf() object and it looks like it takes ~2.3 milliseconds: {code} import org.apache.hadoop.mapred.JobConf object HadoopConfBenchmark { def main(args: Array[String]) { val numConfs = 1 val start = System.currentTimeMillis() for (i <- 1 to numConfs) { new JobConf() } val end = System.currentTimeMillis() println(s"Took ${end - start} ms to create $numConfs new JobConfs") } } {code} On my laptop, this outputs {code} Took 23492 ms to create 1 new JobConfs {code} Since the correlation optimizer tests ran ~7 seconds slower with this PR, this slowdown might be explained if those tests were running ~3000 tasks. This is actually plausible, since the default parallelism was pretty high in those tests (~200 partitions, if I recall) and the queries were very complicated. For most real deployments (e.g. not running in local mode), the extra 2ms will probably be masked by other latencies (e.g. RPC), so I'd say that we should merge this patch for now and try to regain the performance elsewhere if it turns out to be a problem. There's the option of putting this behind a configuration option, but I don't like that approach because I feel that it's important to be "correct by default" and not have options that sacrifice correctness for performance. > Remove special handling of Hadoop JobConf > - > > Key: SPARK-2585 > URL: https://issues.apache.org/jira/browse/SPARK-2585 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Josh Rosen >Priority: Critical > > This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the > implementation does not use shared conf objects). We no longer need to > specially broadcast the Hadoop configuration since we are broadcasting RDD > data anyways. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf
[ https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099234#comment-14099234 ] Patrick Wendell commented on SPARK-2585: Unfortunately after a lot of effort we still can't get the test times down on this one and it's still unclear whether it will cause performance regressions. Since this isn't particularly critical from a user perspective (it's mostly about simplifying internals) I think it's best to punt this to 1.2. One unfortunate thing is that it means SPARK-2546 will remain broken in 1.1. > Remove special handling of Hadoop JobConf > - > > Key: SPARK-2585 > URL: https://issues.apache.org/jira/browse/SPARK-2585 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Josh Rosen >Priority: Critical > > This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the > implementation does not use shared conf objects). We no longer need to > specially broadcast the Hadoop configuration since we are broadcasting RDD > data anyways. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf
[ https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078879#comment-14078879 ] Apache Spark commented on SPARK-2585: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/1648 > Remove special handling of Hadoop JobConf > - > > Key: SPARK-2585 > URL: https://issues.apache.org/jira/browse/SPARK-2585 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Reynold Xin >Priority: Critical > > This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the > implementation does not use shared conf objects). We no longer need to > specially broadcast the Hadoop configuration since we are broadcasting RDD > data anyways. -- This message was sent by Atlassian JIRA (v6.2#6252)