[ https://issues.apache.org/jira/browse/SPARK-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538784#comment-14538784 ]
Sandy Ryza commented on SPARK-7410: ----------------------------------- Thanks for the pointer, [~joshrosen]. Looked over that JIRA and hard to understand what's going on at first glance. Are you saying it's a performance issue or a correctness issue? Or both? I can look deeper if you don't remember. Regarding performance, broadcasting the conf is probably faster in the majority of cases. But in cases where the RDD has only a couple partitions, the reverse is true. So what I'm advocating for is the ability to turn broadcasting off in the latter case. > Add option to avoid broadcasting configuration with newAPIHadoopFile > -------------------------------------------------------------------- > > Key: SPARK-7410 > URL: https://issues.apache.org/jira/browse/SPARK-7410 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 1.4.0 > Reporter: Sandy Ryza > > I'm working with a Spark application that creates thousands of HadoopRDDs and > unions them together. Certain details of the way the data is stored require > this. > Creating ten thousand of these RDDs takes about 10 minutes, even before any > of them is used in an action. I dug into why this takes so long and it looks > like the overhead of broadcasting the Hadoop configuration is taking up most > of the time. In this case, the broadcasting isn't helpful because each > HadoopRDD only corresponds to one or two tasks. When I reverted the original > change that switched to broadcasting configurations, the time it took to > instantiate these RDDs improved 10x. > It would be nice if there was a way to turn this broadcasting off. Either > through a Spark configuration option, a Hadoop configuration option, or an > argument to hadoopFile / newAPIHadoopFile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org