[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf

2014-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183722#comment-14183722
 ] 

Apache Spark commented on SPARK-2585:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2935

 Remove special handling of Hadoop JobConf
 -

 Key: SPARK-2585
 URL: https://issues.apache.org/jira/browse/SPARK-2585
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Josh Rosen
Priority: Critical

 This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the 
 implementation does not use shared conf objects). We no longer need to 
 specially broadcast the Hadoop configuration since we are broadcasting RDD 
 data anyways.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf

2014-10-06 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160589#comment-14160589
 ] 

Josh Rosen commented on SPARK-2585:
---

I tried benchmarking the time need to create a new JobConf() object and it 
looks like it takes ~2.3 milliseconds:

{code}
import org.apache.hadoop.mapred.JobConf


object HadoopConfBenchmark {
  def main(args: Array[String]) {
val numConfs = 1
val start = System.currentTimeMillis()
for (i - 1 to numConfs) {
  new JobConf()
}
val end = System.currentTimeMillis()
println(sTook ${end - start} ms to create $numConfs new JobConfs)
  }
}
{code}

On my laptop, this outputs

{code}
Took 23492 ms to create 1 new JobConfs
{code}

Since the correlation optimizer tests ran ~7 seconds slower with this PR, this 
slowdown might be explained if those tests were running ~3000 tasks.  This is 
actually plausible, since the default parallelism was pretty high in those 
tests (~200 partitions, if I recall) and the queries were very complicated.

For most real deployments (e.g. not running in local mode), the extra 2ms will 
probably be masked by other latencies (e.g. RPC), so I'd say that we should 
merge this patch for now and try to regain the performance elsewhere if it 
turns out to be a problem.

There's the option of putting this behind a configuration option, but I don't 
like that approach because I feel that it's important to be correct by 
default and not have options that sacrifice correctness for performance.

 Remove special handling of Hadoop JobConf
 -

 Key: SPARK-2585
 URL: https://issues.apache.org/jira/browse/SPARK-2585
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Josh Rosen
Priority: Critical

 This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the 
 implementation does not use shared conf objects). We no longer need to 
 specially broadcast the Hadoop configuration since we are broadcasting RDD 
 data anyways.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf

2014-10-06 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160642#comment-14160642
 ] 

Andrew Ash commented on SPARK-2585:
---

I also vote for correct by default and there are various potentially 
dangerous knobs you can turn to squeeze out more performance at risk if you 
care to.

Note that the {{new JobConf()}} constructor loads defaults out of the Hadoop 
config files (I think {{core-site.xml}} and {{hdfs-site.xml}}) and you can 
disable that with the {{JobConf(false)}} constructor.  Not sure if we need the 
local config files per server or if all the config options come solely from the 
driver's {{JobConf}} when instantiated.

 Remove special handling of Hadoop JobConf
 -

 Key: SPARK-2585
 URL: https://issues.apache.org/jira/browse/SPARK-2585
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Josh Rosen
Priority: Critical

 This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the 
 implementation does not use shared conf objects). We no longer need to 
 specially broadcast the Hadoop configuration since we are broadcasting RDD 
 data anyways.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf

2014-10-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161166#comment-14161166
 ] 

Apache Spark commented on SPARK-2585:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2683

 Remove special handling of Hadoop JobConf
 -

 Key: SPARK-2585
 URL: https://issues.apache.org/jira/browse/SPARK-2585
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Josh Rosen
Priority: Critical

 This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the 
 implementation does not use shared conf objects). We no longer need to 
 specially broadcast the Hadoop configuration since we are broadcasting RDD 
 data anyways.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf

2014-10-06 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161383#comment-14161383
 ] 

Josh Rosen commented on SPARK-2585:
---

[~pwendell] I obtained these numbers by adding this benchmark to the Spark 
Examples package and running it through {{./bin/run-example}}, which I think 
should have given it a pretty big classpath.

 Remove special handling of Hadoop JobConf
 -

 Key: SPARK-2585
 URL: https://issues.apache.org/jira/browse/SPARK-2585
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Josh Rosen
Priority: Critical

 This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the 
 implementation does not use shared conf objects). We no longer need to 
 specially broadcast the Hadoop configuration since we are broadcasting RDD 
 data anyways.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf

2014-08-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099234#comment-14099234
 ] 

Patrick Wendell commented on SPARK-2585:


Unfortunately after a lot of effort we still can't get the test times down on 
this one and it's still unclear whether it will cause performance regressions.

Since this isn't particularly critical from a user perspective (it's mostly 
about simplifying internals) I think it's best to punt this to 1.2. One 
unfortunate thing is that it means SPARK-2546 will remain broken in 1.1.

 Remove special handling of Hadoop JobConf
 -

 Key: SPARK-2585
 URL: https://issues.apache.org/jira/browse/SPARK-2585
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Josh Rosen
Priority: Critical

 This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the 
 implementation does not use shared conf objects). We no longer need to 
 specially broadcast the Hadoop configuration since we are broadcasting RDD 
 data anyways.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf

2014-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078879#comment-14078879
 ] 

Apache Spark commented on SPARK-2585:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1648

 Remove special handling of Hadoop JobConf
 -

 Key: SPARK-2585
 URL: https://issues.apache.org/jira/browse/SPARK-2585
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Reynold Xin
Priority: Critical

 This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the 
 implementation does not use shared conf objects). We no longer need to 
 specially broadcast the Hadoop configuration since we are broadcasting RDD 
 data anyways.



--
This message was sent by Atlassian JIRA
(v6.2#6252)