Production results of push-based shuffle after rolling out to 100% of Spark workloads at LinkedIn

2021-04-15 Thread mshen
Hi,

We previously raised the SPIP for push-based shuffle in  SPARK-30602
  .
Thanks for the reviews from the community, a significant portion of the code
has already been merged.

In the meantime, we have been continuing to improve the solution at LinkedIn
to scale it to cover 100% of offline Spark workloads at LinkedIn, and we
reached that milestone last month.
We have observed a significant improvement to the shuffle operation
efficiency as well as job runtime across the clusters, and the results are
shared in the following blog post.
https://www.linkedin.com/pulse/bringing-next-gen-shuffle-architecture-data-linkedin-scale-min-shen/

Would like to get feedbacks from the community on the content covered in the
blog post.
In addition, since the release timeline for Spark 3.2 is now postponed till
September, we believe it would be reasonable to include push-based shuffle
as part of Spark 3.2 release itself, given that this feature has already
been validated in production at scale.
Want to also bring attention to the various patches currently under/pending
reviews under SPARK-30602, so we can get more eyes on the remaining patches.



-
Min Shen
Sr. Staff Software Engineer
LinkedIn
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Time to evaluate "continuous mode" in SS?

2020-09-15 Thread mshen
Hi Joseph,

Would be interested in discussing your thoughts for how push-based shuffle
could help with continuous mode in SS.
We have discussed internally at LinkedIn with our Samza peers as well as
with Alibaba Flink team for applicability of push-based shuffle on streaming
engines, especially for continuous operation mode.
Would be interested to know your thoughts for how that can apply to SS
continuous mode.



-
Min Shen
Staff Software Engineer
LinkedIn
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Push-based shuffle SPIP

2020-08-24 Thread mshen
The linked doc with detailed information of the branch does not seem to be
shareable publicly.
We have created a copy of the doc which should be publicly accessible.
https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit?usp=sharing



-
Min Shen
Staff Software Engineer
LinkedIn
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Push-based shuffle SPIP

2020-08-24 Thread mshen
We raised this SPIP ticket in
https://issues.apache.org/jira/browse/SPARK-30602 earlier this year.
Since then, we have progressed in multiple fronts, including:

* Our work is published in VLDB 2020. The final version of the paper is
attached in the SPIP ticket.
* We have further enhanced and productionized this work at LinkedIn, and
have enabled production flows adopting the new push-based shuffle mechanism,
with good results.
* We have recently also ported our push-based shuffle changes to OSS Spark
master branch, so other people can potentially try it out. Details of this
branch is in this  doc

  
* The  SPIP doc

  
is also further updated reflecting more recent designs.
* We have also discussed with multiple companies who share similar interest
in this work.

We would like to resume the discussion of this SPIP in the community, and
push for a voting on this.




-
Min Shen
Staff Software Engineer
LinkedIn
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Enabling push-based shuffle in Spark

2020-06-24 Thread mshen
Our paper summarizing this work of push-based shuffle was recently accepted
by VLDB 2020.
We have uploaded a preprint version of the paper to the  JIRA ticket
  , along with the
production results we have so far.



-
Min Shen
Staff Software Engineer
LinkedIn
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Enabling push-based shuffle in Spark

2020-01-23 Thread mshen
Hi Wenchen,

Glad to know that you like this idea.
We also looked into making this pluggable in our early design phase.
While the ShuffleManager API for pluggable shuffle systems does provide
quite some room for customized behaviors for Spark shuffle, we feel that it
is still not enough for this case.

Right now, the shuffle block location information is tracked inside
MapOutputTracker and updated by DAGScheduler.
Since we are relocating the shuffle blocks to improve overall shuffle
throughput and efficiency, being able to update the information tracked
inside MapOutputTracker so reducers can access their shuffle input more
efficiently is thus necessary.
Letting DAGScheduler orchestrate this process also provides the benefit of
better coping with stragglers.
If DAGScheduler has no control or is agnostic of the block push progress, it
does leave a few gaps.

On the shuffle Netty protocol side, there are a lot that can be leveraged
from the existing code.
With improvements in SPARK-24355 and SPARK-30512, the shuffle service Netty
server is becoming much more reliable.
The work in SPARK-6237 also provided quite some leverage for streaming push
of shuffle blocks.
Instead of building all of these from scratch, we took the alternative route
of building on top of the existing Netty protocol to implement the shuffle
block push operation.

We feel that this design has the potential of further improving Spark
shuffle system's scalability and efficiency, making Spark an even better
compute engine.
Would like to explore how we can leverage the shuffle plugin API to make
this design more acceptable.



-
Min Shen
Staff Software Engineer
LinkedIn
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Enabling push-based shuffle in Spark

2020-01-21 Thread mshen
Hi Reynold,

Thanks for the comments.
Although in the SPIP doc, a big portion of the problem motivation is around
optimizing small random reads for shuffle, I believe the benefit of this
design is beyond that.

In terms of the approach we take, it is true that the map phase would still
need to materialize the intermediate shuffle data and the reduce phase does
not start until the map phase is done.
However, pushing the shuffle data to the location where the reducers will
run while the map stage is running does help to provide additional latency
reduction beyond the disk read improvements.
In fact, for the benchmark job we used to measure improvements, the
reduction of shuffle fetch wait time is only about 30% of the total task
runtime reduction.

Another benefit we can expect is the improved shuffle reliability. By
reducing the # of blocks need to be fetched remotely and by providing a
2-replica of the intermediate shuffle data, we can also reduce the
likelihood of encountering shuffle fetch failures leading to expensive
retries.

For the alternative approach to merge map outputs for each node locally,
which is similar to Facebook's SOS shuffle service, there are a few
downsides:
1. The merge ratio might not be high enough, depending on the avg # of
mapper tasks per node.
2. It does not deliver the shuffle partition data to the reducers. Most of
the reducer task input still needs to be fetched remotely.
3. At least in Facebook's Rifle paper, the local merge is performed by the
shuffle service since it needs to read multiple mappers' output. This means
the memory buffering is happening on the shuffle service side, instead of on
the executor side. While our approach also does memory buffering right now,
we are doing this on the executor side, which makes it much less constraint
compared with doing this inside shuffle service.



-
Min Shen
Staff Software Engineer
LinkedIn
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Enabling push-based shuffle in Spark

2020-01-21 Thread mshen
I'd like to start a discussion on enabling push-based shuffle in Spark.
This is meant to address issues with existing shuffle inefficiency in a
large-scale Spark compute infra deployment.
Facebook's previous talks on  SOS shuffle
   and  Cosco
shuffle service

  
are solutions dealing with a similar problem.
Note that this is somewhat orthogonal to the work in  SPARK-25299
  , which is to use
remote storage to store shuffle data.
More details of our proposed design is in  SPARK-30602
  , with SPIP attached.
Would appreciate comments and discussions from the community.



-
Min Shen
Staff Software Engineer
LinkedIn
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org