[jira] (KAFKA-4317) RocksDB checkpoint files lost on kill -9

ASF GitHub Bot (JIRA) Tue, 31 Jan 2017 09:51:48 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847179#comment-15847179
 ]


ASF GitHub Bot commented on KAFKA-4317:
---------------------------------------

GitHub user dguy opened a pull request:

    https://github.com/apache/kafka/pull/2471

    KAFKA-4317: Checkpoint State Stores on commit/flush

    Currently the checkpoint file is deleted at state store initialization and 
it is only ever written again during a clean shutdown. This can result in 
significant delays during restarts as the entire store needs to be loaded from 
the changelog. 
    We can mitigate against this by frequently checkpointing the offsets. The 
checkpointing happens only during the commit phase, i.e, after we have manually 
flushed the store and the producer. So we guarantee that the checkpointed 
offsets are never greater than what has been flushed. 
    In the event of hard failure we can recover by reading the checkpoints and 
consuming from the stored offsets.
    The checkpoint interval can be controlled by the config 
`statestore.checkpoint.interval.ms` - if this is set to a value <= 0 it 
effectively turns checkpoints off. The interval is only i guide in that the 
minimum checkpoint time is always going to be the commit interval (as we need 
to do this to guarantee consistency)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dguy/kafka kafka-4317

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/kafka/pull/2471.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2471
    
----
commit 6743dc63293e2d0fca57dcb7d1a0ace5237837b0
Author: Damian Guy <damian....@gmail.com>
Date:   2017-01-31T13:37:00Z

    checkpoint statestores

----


> RocksDB checkpoint files lost on kill -9
> ----------------------------------------
>
>                 Key: KAFKA-4317
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4317
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>    Affects Versions: 0.10.0.1
>            Reporter: Greg Fodor
>            Assignee: Damian Guy
>            Priority: Critical
>              Labels: architecture, user-experience
>
> Right now, the checkpoint files for logged RocksDB stores are written during 
> a graceful shutdown, and removed upon restoration. Unfortunately this means 
> that in a scenario where the process is forcibly killed, the checkpoint files 
> are not there, so all RocksDB stores are rematerialized from scratch on the 
> next launch.
> In a way, this is good, because it simulates bootstrapping a new node (for 
> example, its a good way to see how much I/O is used to rematerialize the 
> stores) however it leads to longer recovery times when a non-graceful 
> shutdown occurs and we want to get the job up and running again.
> It seems that two possible things to consider:
> - Simply do not remove checkpoint files on restoring. This way a kill -9 will 
> result in only repeating the restoration of all the data generated in the 
> source topics since the last graceful shutdown.
> - Continually update the checkpoint files (perhaps on commit) -- this would 
> result in the least amount of overhead/latency in restarting, but the 
> additional complexity may not be worth it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] (KAFKA-4317) RocksDB checkpoint files lost on kill -9

Reply via email to