[ 
https://issues.apache.org/jira/browse/SPARK-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538784#comment-14538784
 ] 

Sandy Ryza commented on SPARK-7410:
-----------------------------------

Thanks for the pointer, [~joshrosen].  Looked over that JIRA and hard to 
understand what's going on at first glance.  Are you saying it's a performance 
issue or a correctness issue?  Or both?  I can look deeper if you don't 
remember.

Regarding performance, broadcasting the conf is probably faster in the majority 
of cases.   But in cases where the RDD has only a couple partitions, the 
reverse is true.  So what I'm advocating for is the ability to turn 
broadcasting off in the latter case.

> Add option to avoid broadcasting configuration with newAPIHadoopFile
> --------------------------------------------------------------------
>
>                 Key: SPARK-7410
>                 URL: https://issues.apache.org/jira/browse/SPARK-7410
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.4.0
>            Reporter: Sandy Ryza
>
> I'm working with a Spark application that creates thousands of HadoopRDDs and 
> unions them together.  Certain details of the way the data is stored require 
> this.
> Creating ten thousand of these RDDs takes about 10 minutes, even before any 
> of them is used in an action.  I dug into why this takes so long and it looks 
> like the overhead of broadcasting the Hadoop configuration is taking up most 
> of the time.  In this case, the broadcasting isn't helpful because each 
> HadoopRDD only corresponds to one or two tasks.  When I reverted the original 
> change that switched to broadcasting configurations, the time it took to 
> instantiate these RDDs improved 10x.
> It would be nice if there was a way to turn this broadcasting off.  Either 
> through a Spark configuration option, a Hadoop configuration option, or an 
> argument to hadoopFile / newAPIHadoopFile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to