Hi +baibing3 +huangtao6 Came across your presentation on Alluxio - including shuffling - would you be interested in this?
________________________________ From: Matt Cheah <mch...@palantir.com> Sent: Tuesday, September 4, 2018 2:54 PM To: Yuanjian Li Cc: Spark dev list Subject: Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data Yuanjian, Thanks for sharing your progress! I was wondering if there was any prototype code that we could read to get an idea of what the implementation looks like? We can evaluate the design together and also benchmark workloads from across the community �C that is, we can collect more data from more Spark users. The experience would be greatly appreciated in the discussion. -Matt Cheah From: Yuanjian Li <xyliyuanj...@gmail.com> Date: Friday, August 31, 2018 at 8:29 PM To: Matt Cheah <mch...@palantir.com> Cc: Spark dev list <dev@spark.apache.org> Subject: Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data Hi Matt, Thanks for the great document and proposal, I want to +1 for the reliable shuffle data and give some feedback. I think a reliable shuffle service based on DFS is necessary on Spark, especially running Spark job over unstable environment. For example, while mixed deploying Spark with online service, Spark executor will be killed any time. Current stage retry strategy will make the job many times slower than normal job. Actually we(Baidu inc) solved this problem by stable shuffle service over Hadoop, and we are now docking Spark to this shuffle service. The POC work will be done at October as expect. We'll post more benchmark and detailed work at that time. I'm still reading your discussion document and happy to give more feedback in the doc. Thanks, Yuanjian Li Matt Cheah <mch...@palantir.com<mailto:mch...@palantir.com>>于2018年9月1日周六上午8:42写道: Hi everyone, I filed SPARK-25299 [issues.apache.org]<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D25299&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=aWBmhsrm7S7YT8YUwf0fphAsQ-piBw9ENlRn2ojrs9U&s=QmUpw5K6D-6ot7Kel1_RhXKdr7Rk_fXgqoaeIZN-kes&e=> to promote discussion on how we can improve the shuffle operation in Spark. The basic premise is to discuss the ways we can leverage distributed storage to improve the reliability and isolation of Spark’s shuffle architecture. A few designs and a full problem statement are outlined in thisarchitecture discussion document [docs.google.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-2DrVHSM_edit-23heading-3Dh.btqugnmt2h40&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=aWBmhsrm7S7YT8YUwf0fphAsQ-piBw9ENlRn2ojrs9U&s=d60j5-gfmUL6SeNwkEdWAR8IYOQd3UXHJ20XwUtteew&e=>. This is a complex problem and it would be great to get feedback from the community about the right direction to take this work in. Note that we have not yet committed to a specific implementation and architecture �C there’s a lot that needs to be discussed for this improvement, so we hope to get as much input as possible before moving forward with a design. Please feel free to leave comments and suggestions on the JIRA ticket or on the discussion document. Thank you! -Matt Cheah