Github user f7753 commented on the issue: https://github.com/apache/spark/pull/14239 @tgravescs To make it more readable and answer the question above. **1. Are you saying that you are loading all the data for all the maps from disk into memory and caching it waiting for the reducer to fetch it?** **2. does it conditionally do this or always do it?** I use parameters ` spark.shuffle.prepare.open ` to switch this mechanism off/on and `spark.shuffle.prepare.count ` to control the block number to cache. So here gives the user the privilege to control the MEM used for the pre-fetch block based on their machine conditions. **3. How exactly does the timing work on this, aren't you going to send the prepare immediately before sending the fetch? does the fetch block on waiting on the prepare to cache the data?** I changed the logistic of the shuffle message transfer process, each time I send a FetchRequest, I'll also send the next, so here the server side would eaxctly know the blockIds for the next fetch loop, then cache them, on the FetchRequest succeed callback, the cache would be released since all of them had send to the map side and no longer be used.When the `PrepareRequest` arrived, the server get a thread from the threadpool to operate the read request(In fact, I use a `FutureTask` to do this), if the `FetchRequest` arrived , since the data has not been cached fully yet, this req would be blocked like before and also more effcient than before while the data has been load to mem before the req actually arrive. **4. what testing have you done with this and what size of data? What type of load was on the nodes when testing, etc?** I have implement this and tested based on the branch 1.4 and 1.6, using Intel Hibench4.0 terasort 1TB data size, I got about 30% performance enhancements, on a cluster which has 5 node, each node has 96GB Memï¼CPU is Xeon E5 v3 , 7200RPM Disk. But note that since Benchmark like terasort would shuffle all the data that has been read, so in other cases, it may not work so well as that.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org