[ 
https://issues.apache.org/jira/browse/FLINK-9182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451976#comment-16451976
 ] 

ASF GitHub Bot commented on FLINK-9182:
---------------------------------------

GitHub user makeyang opened a pull request:

    https://github.com/apache/flink/pull/5908

    [FLINK-9182]async checkpoints for timer service

    ## What is the purpose of the change
    This PR is WIP, and is need finish unit tests which are marked as TODO.
    It is opened to collect feedback for a proposed solution for FLINK-9182
    
    ## Brief change log
    1. add one more state called rawkeyedstatemeta, which is help to store 
verion info of timerservice and tierm size
    2. make timer state snapshot async
    
    ## Does this pull request potentially affect one of the following parts:
        Dependencies (does it add or upgrade a dependency): (yes / (**no**)
        The public API, i.e., is any changed class annotated with 
@Public(Evolving): (yes / **no**)
        The serializers: (yes / **no** / don't know)
        The runtime per-record code paths (performance sensitive): (yes / 
**no** / don't know)
        Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (**yes** / no / don't know)
        The S3 file system connector: (yes / **no** / don't know)
    
    ## Documentation
        Does this pull request introduce a new feature? (yes / **no**)
        If yes, how is the feature documented? (not applicable / docs / 
JavaDocs / not documented)
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/makeyang/flink FLINK-9182

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5908.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5908
    
----
commit 7eca3bebd2b92ffb53a2058d10df966ffd3d4875
Author: makeyang <makeyang@...>
Date:   2018-04-25T09:24:26Z

    [FLINK-9182]async checkpoints for timer service
    there are some unit tests still need to fix which I marked as TODO.
    the package has passed the integration test on my test env so please take a 
look at the code to verified the init thoughts first

----


> async checkpoints for timer service
> -----------------------------------
>
>                 Key: FLINK-9182
>                 URL: https://issues.apache.org/jira/browse/FLINK-9182
>             Project: Flink
>          Issue Type: Improvement
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.5.0, 1.4.2
>            Reporter: makeyang
>            Assignee: makeyang
>            Priority: Minor
>             Fix For: 1.4.3, 1.5.1
>
>
> # problem description:
>  ## with the increase in the number of  'InternalTimer' object the checkpoint 
> more and more slowly
>  # improvement desgin
>  ## maintain a stateTableVersion, which is exactly the same thing as 
> CopyOnWriteStateTable and snapshotVersions which is exactly the same thing as 
> CopyOnWriteStateTable in InternalTimeServiceManager. one more thing: a 
> readwrite lock, which is used to protect snapshotVersions and 
> stateTableVersion
>  ## for each InternalTimer, add 2 more properties: create version and delete 
> version beside 3 existing properties: timestamp, key and namespace. each time 
> a Timer is registered in timerservice, it is created with stateTableVersion 
> as its create version while delete version is -1. each time when timer is 
> deleted in timerservice, it is marked delete for giving it a delete verison 
> equals to stateTableVersion without physically delete it from timerservice.
>  ## each time when try to snapshot timers, InternalTimeServiceManager 
> increase its stateTableVersion and add this stateTableVersion in 
> snapshotVersions. these 2 operators are protected by write lock of 
> InternalTimeServiceManager. that current stateTableVersion take as snapshot 
> version of this snapshot
>  ## shallow copy <String,HeapInternalTimerService> tuples
>  ## then use a another thread asynchronous snapshot whole things: 
> keyserialized, namespaceserializer and timers. for timers which is not 
> deleted(delete version is -1) and create version less than snapshot version, 
> serialized it. for timers whose delete version is not -1 and is bigger than 
> or equals snapshot version, serialized it. otherwise, it will not be 
> serialized by this snapshot.
>  ## when everything is serialized, remove snapshot version in 
> snapshotVersions, which is still in another thread and this action is guarded 
> by write lock.
>  ## last thing: timer physical deletion. 2 places to physically delete 
> timers: each time when timer is deleted in timerservice, it is marked delete 
> for giving it a delete verison equals to stateTableVersion without physically 
> delete it from timerservice. after this, check if snapshotVersions size is 0 
> (which means there is no running snapshot) and if true, delete timer .the 
> other place to delete is in snapshot timer's iterat: when timer's delete 
> version is less than min value of snapshotVersions, which means the timer is 
> deleted and no running snapshot should keep it.
>  ## some more additions: processingTimeTimers and eventTimeTimers for each 
> group used to be hashset and now it is changed to concurrenthashmap with 
> key+namesapce+timestamp as its hash key.
>  # related mail list thread
>  ## 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Slow-flink-checkpoint-td18946.html
>  # github pull request
>  ## //coming soon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to