[ 
https://issues.apache.org/jira/browse/SPARK-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538813#comment-14538813
 ] 

Josh Rosen commented on SPARK-7410:
-----------------------------------

The correctness issue might be slightly overblown: there was an old Hadoop 
version that had some unsynchronized mutable global state in the Configuration 
class and this could lead to crashes or hangs if multiple threads were 
concurrently constructing Configuration instances.  As a result, Spark tries to 
synchronize on a lock when constructing Configuration objects.  If that 
construction takes place while the RDD is being deserialized, then my concern 
was that we'd have to hold that lock for the duration of the RDD 
deserialization, which takes much longer than just constructing a 
Configuration, leading to a situation where that lock would become a huge point 
of contention and harm task launching performance on executors with many 
threads.  This synchronization shouldn't be necessary for newer Hadoop 
versions; it would be good to track through the Hadoop JIRAs referenced 
(transitively) from that other ticket to see if we're stuck supporting those 
older versions.

The performance issue was that our patch to remove the broadcast resulted in a 
huge slowdown of certain Spark SQL test cases.  My theory at the time was that 
Configuration objects are expensive to instantiate because they have to scan 
the entire classpath to search for defaults files.  When we broadcast the 
configuration, we only end up constructing O(numExecutors) configuration 
objects rather than O(numTasks) configurations, masking this overhead.  This 
probably doesn't apply to your scenario, though, since it sounds like you'll 
rarely or never end up running multiple tasks on the same executor with the 
same Hadoop configuration.

Given all of this, I'd be on board with adding a configuration for advanced 
users to bypass the broadcast for use cases like yours.  It might also be worth 
re-benchmarking to see whether we can just completely remove the broadcast, 
since the correctness + perf concerns may no longer be relevant.

> Add option to avoid broadcasting configuration with newAPIHadoopFile
> --------------------------------------------------------------------
>
>                 Key: SPARK-7410
>                 URL: https://issues.apache.org/jira/browse/SPARK-7410
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.4.0
>            Reporter: Sandy Ryza
>
> I'm working with a Spark application that creates thousands of HadoopRDDs and 
> unions them together.  Certain details of the way the data is stored require 
> this.
> Creating ten thousand of these RDDs takes about 10 minutes, even before any 
> of them is used in an action.  I dug into why this takes so long and it looks 
> like the overhead of broadcasting the Hadoop configuration is taking up most 
> of the time.  In this case, the broadcasting isn't helpful because each 
> HadoopRDD only corresponds to one or two tasks.  When I reverted the original 
> change that switched to broadcasting configurations, the time it took to 
> instantiate these RDDs improved 10x.
> It would be nice if there was a way to turn this broadcasting off.  Either 
> through a Spark configuration option, a Hadoop configuration option, or an 
> argument to hadoopFile / newAPIHadoopFile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to