Curious how SPARK-25299 (where file tracking is pushed to spark drivers, at least in option-5) interacts with Splash. The shuffle data location in SPARK-25299 would now have additional "fallback" logic for recovering from executor loss.
On Thu, Jan 3, 2019 at 6:24 AM Peter Rudenko <petro.rude...@gmail.com> wrote: > Hi Matt, i'm a developer of SparkRDMA shuffle manager: > https://github.com/Mellanox/SparkRDMA > Thanks for your effort on improving Spark Shuffle API. We are very > interested in participating in this. Have for now several comments: > 1. Went through these 4 documents: > > > https://docs.google.com/document/d/1tglSkfblFhugcjFXZOxuKsCdxfrHBXfxgTs-sbbNB3c/edit# > <https://docs.google.com/document/d/1tglSkfblFhugcjFXZOxuKsCdxfrHBXfxgTs-sbbNB3c/edit> > > > https://docs.google.com/document/d/1TA-gDw3ophy-gSu2IAW_5IMbRK_8pWBeXJwngN9YB80/edit > > > https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40 > > > https://docs.google.com/document/d/1kSpbBB-sDk41LeORm3-Hfr-up98Ozm5wskvB49tUhSs/edit# > <https://docs.google.com/document/d/1kSpbBB-sDk41LeORm3-Hfr-up98Ozm5wskvB49tUhSs/edit> > As i understood there's 2 discussions: improving shuffle manager API > itself (Splash manager) and improving external shuffle service > > <https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.9o9f7nm01fz6> > 2. We may consider to revisiting SPIP: RDMA Accelerated Shuffle Engine > <https://issues.apache.org/jira/browse/SPARK-22229> whether to support > RDMA in the main codebase or at least as a first-class shuffle plugin > (there are not much other open source shuffle plugins exists). We actively > develop it, adding new features. RDMA is now available on Azure ( > https://azure.microsoft.com/en-us/blog/introducing-the-new-hb-and-hc-azure-vm-sizes-for-hpc/), > Alibaba and other cloud providers. For now we support only memory <-> > memory transfer, but rdma is extensible to NVM and GPU data transfer. > 3. We have users that are interested in having this feature ( > https://issues.apache.org/jira/browse/SPARK-12196) - we can consider > adding it to this new API. > > Let me know if you need help in review / testing / benchmark. > I'll look more on documents and PR, > > Thanks, > Peter Rudenko > Software engineer at Mellanox Technologies. > > > ср, 19 груд. 2018 о 20:54 John Zhuge <john.zh...@gmail.com> пише: > >> Matt, appreciate the update! >> >> On Wed, Dec 19, 2018 at 10:51 AM Matt Cheah <mch...@palantir.com> wrote: >> >>> Hi everyone, >>> >>> >>> >>> Earlier this year, we proposed SPARK-25299 >>> <https://issues.apache.org/jira/browse/SPARK-25299>, proposing the idea >>> of using other storage systems for persisting shuffle files. Since that >>> time, we have been continuing to work on prototypes for this project. In >>> the interest of increasing transparency into our work, we have created a >>> progress >>> report document >>> <https://docs.google.com/document/d/1tglSkfblFhugcjFXZOxuKsCdxfrHBXfxgTs-sbbNB3c/edit?usp=sharing> >>> where you may find a summary of the work we have been doing, as well as >>> links to our prototypes on Github. We would ask that anyone who is very >>> familiar with the inner workings of Spark’s shuffle could provide feedback >>> and comments on our work thus far. We welcome any further discussion in >>> this space. You may comment in this e-mail thread or by commenting on the >>> progress report document. >>> >>> >>> >>> Looking forward to hearing from you. Thanks, >>> >>> >>> >>> -Matt Cheah >>> >> >> >> -- >> John >> >