[jira] [Commented] (KAFKA-10005) Decouple RestoreListener from RestoreCallback and not enable bulk loading for RocksDB

Guozhang Wang (Jira) Fri, 22 May 2020 12:12:30 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114342#comment-17114342
 ]


Guozhang Wang commented on KAFKA-10005:
---------------------------------------

Here's a rough idea to replace the bulk loading configs: the key idea from the 
configs

{code}
dbOptions.setMaxBackgroundFlushes(4);
columnFamilyOptions.setDisableAutoCompactions(true);
columnFamilyOptions.setLevel0FileNumCompactionTrigger(1 << 30);
columnFamilyOptions.setLevel0SlowdownWritesTrigger(1 << 30);
columnFamilyOptions.setLevel0StopWritesTrigger(1 << 30);
{code}

is primarily for two aspects: 1) disable compaction, 2) never slow or stall 
writes.

This should be well replaced if we can just use the 
{{db.addFileWithFileInfo(externalSstFileInfo);}} API to add SST files directly, 
which is refactored and available since 5.5.2, and with this loading the state 
store can still be read but just not as efficient. Given that the following 
steps can be considered:

1) Rocksdb's restore procedure that calls {{putAllInternal}} would be 
implemented as db.addFileWithFileInfo. The key here is that the batch of writes 
should be sorted in order to be added as-is to the sst file directory.
2) We remove the per-store restore listener to toggle between bulk loading or 
not, instead we add a new function {{toggle}} in the internal 
{{BulkLoadingStore}} interface which is triggered in 
{{StreamTask#completeRestoration}}. RocksDB implementing this interface, would 
open the store in {{init}} with configs disabling compaction, and then in 
{{toggle}} would close / reopen the store with configs overridden to normal 
status as well as making a one-off big compaction.

Then out of the scope of this JIRA ticket, when we move the restoration to 
separate threads we can batch more records beyond a single poll call so that 
the sorted write-batch is large enough as a single sst file.

> Decouple RestoreListener from RestoreCallback and not enable bulk loading for 
> RocksDB
> -------------------------------------------------------------------------------------
>
>                 Key: KAFKA-10005
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10005
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Guozhang Wang
>            Priority: Major
>
> In Kafka Streams we have two restoration callbacks:
> * RestoreCallback (BatchingRestoreCallback): specified per-store via 
> registration to specify the logic of applying a batch of records read from 
> the changelog to the store. Used for both updating standby tasks and 
> restoring active tasks.
> * RestoreListener: specified per-instance via `setRestoreListener`, to 
> specify the logic for `onRestoreStart / onRestoreEnd / onBatchRestored`.
> As we can see these two callbacks are for quite different purposes, however 
> today we allow user's to register a per-store RestoreCallback which is also 
> implementing the RestoreListener. Such weird mixing is actually motivated by 
> Streams internal usage to enable / disable bulk loading inside RocksDB. For 
> user's however this is less meaningful to specify a callback to be a listener 
> since the `onRestoreStart / End` has the storeName passed in, so that users 
> can just define different listening logic if needed for different stores.
> On the other hand, this mixing of two callbacks enforces Streams to check 
> internally if the passed in per-store callback is also implementing listener, 
> and if yes trigger their calls, which increases the complexity. Besides, 
> toggle rocksDB for bulk loading requires us to open / close / reopen / 
> reclose 4 times during the restoration which could also be costly.
> Given that we have KIP-441 in place, I think we should consider different 
> ways other than toggle bulk loading during restoration for Streams (e.g. 
> using different threads for restoration).
> The proposal for this ticket is to completely decouple the listener from 
> callback -- i.e. we would not presume users passing in a callback function 
> that implements both RestoreCallback and RestoreListener, and also for 
> RocksDB we replace the bulk loading mechanism with other ways of 
> optimization: https://rockset.com/blog/optimizing-bulk-load-in-rocksdb/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-10005) Decouple RestoreListener from RestoreCallback and not enable bulk loading for RocksDB

Reply via email to