[ 
https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147896#comment-15147896
 ] 

Nezih Yigitbasi commented on SPARK-13328:
-----------------------------------------

Although this long time can be reduced by decreasing the values of the 
{{spark.shuffle.io.maxRetries}} and {{spark.shuffle.io.retryWait}} parameters 
it may not be desirable to reduce # of retries globally and also reducing retry 
wait may increase the load on the serving block manager. 

I already have a fix where I added a new config parameter 
{{spark.block.failures.beforeLocationRefresh}} that determines when to refresh 
the list of block locations from the driver while going through all these 
locations. In my fix this parameter is honored only when dynamic allocation is 
enabled and I set its default value to Int.MaxValue so that it doesn't change 
the behavior even if dynamic alloc. is enabled (as refreshing the location may 
not be necessary in small clusters).

> Poor read performance for broadcast variables with dynamic resource allocation
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-13328
>                 URL: https://issues.apache.org/jira/browse/SPARK-13328
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.2
>            Reporter: Nezih Yigitbasi
>
> When dynamic resource allocation is enabled fetching broadcast variables from 
> removed executors were causing job failures and SPARK-9591 fixed this problem 
> by trying all locations of a block before giving up. However, the locations 
> of a block is retrieved only once from the driver in this process and the 
> locations in this list can be stale due to dynamic resource allocation. This 
> situation gets worse when running on a large cluster as the size of this 
> location list can be in the order of several hundreds out of which there may 
> be tens of stale entries. What we have observed is with the default settings 
> of 3 max retries and 5s between retries (that's 15s per location) the time it 
> takes to read a broadcast variable can be as high as ~17m (below log shows 
> the failed 70th block fetch attempt where each attempt takes 15s)
> {code}
> ...
> 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block 
> broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 
> 60675) (failed attempt 70)
> ...
> 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 18 took 1051049 ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to