[jira] [Issue Comment Deleted] (FLINK-22672) Some enhancements for pluggable shuffle service framework

Jiayi Liao (Jira) Mon, 06 Sep 2021 01:45:08 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-22672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jiayi Liao updated FLINK-22672:
-------------------------------
    Comment: was deleted

(was: Hi [~maguowei], [~jinxing6...@126.com], [~kevin.cyj],

Currently we're trying to migrate our shuffle data from local disk to an 
external shuffle service, which is very similar with COSCO in Facebook. And the 
mapper and reducer should be initialized with our own implemented 
{{ShuffleServiceClient}}, but I found that current construction of 
{{BoundedData}} is based on local files. See the factor of 
{{BoundedBlockingSubpartition}} below:


{code:java}
public abstract BoundedBlockingSubpartition create(
        int index,
        ResultPartition parent,
        File tempFile,
        int readBufferSize,
        boolean sslEnabled)
        throws IOException;
{code}

Is there any possibility to extend the concept of `BoundedBlockingSubpartition` 
and make it not only limited in local resources ? )

> Some enhancements for pluggable shuffle service framework
> ---------------------------------------------------------
>
>                 Key: FLINK-22672
>                 URL: https://issues.apache.org/jira/browse/FLINK-22672
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>            Reporter: Jin Xing
>            Assignee: Jin Xing
>            Priority: Major
>             Fix For: 1.14.0
>
>
> "Pluggable shuffle service" in Flink provides an architecture which are 
> unified for both streaming and batch jobs, allowing user to customize the 
> process of data transfer between shuffle stages according to scenarios.
> There are already a number of implementations of "remote shuffle service" on 
> Spark like [1][2][3]. Remote shuffle enables to shuffle data from/to a remote 
> cluster and achieves benefits like :
>  # The lifecycle of computing resource can be decoupled with shuffle data, 
> once computing task is finished, idle computing nodes can be released with 
> its completed shuffle data accommodated on remote shuffle cluster.
>  # There is no need to reserve disk capacity for shuffle on computing nodes. 
> Remote shuffle cluster serves shuffling request with better scaling ability 
> and alleviates the local disk pressure on computing nodes when data skew.
> Based on "pluggable shuffle service", we build our own "remote shuffle 
> service" on Flink –- Lattice, which targets to provide functionalities and 
> improve performance for batch processing jobs. Basically it works as below:
>  # Lattice cluster works as an independent service for shuffling request;
>  # LatticeShuffleMaster extends ShuffleMaster, works inside JM and talks with 
> remote Lattice cluster for shuffle resource application and shuffle data 
> lifecycle management;
>  # LatticeShuffleEnvironment extends ShuffleEnvironment, works inside TM and 
> provides an environment for shuffling data from/to remote Lattice cluster;
> During the process of building Lattice we find some potential enhancements on 
> "pluggable shuffle service". I will enumerate and create some sub JIRAs under 
> this umbrella
>  
> [1] 
> [https://www.alibabacloud.com/blog/emr-remote-shuffle-service-a-powerful-elastic-tool-of-serverless-spark_597728]
> [2] [https://bestoreo.github.io/post/cosco/cosco/]
> [3] [https://github.com/uber/RemoteShuffleService]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Issue Comment Deleted] (FLINK-22672) Some enhancements for pluggable shuffle service framework

Reply via email to