[jira] [Comment Edited] (KAFKA-8037) KTable restore may load bad data

Sophie Blee-Goldman (Jira) Wed, 22 Jul 2020 16:53:35 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163142#comment-17163142
 ]


Sophie Blee-Goldman edited comment on KAFKA-8037 at 7/22/20, 11:52 PM:
-----------------------------------------------------------------------

Ah, thanks for bringing us back to the question of double-topic vs restore-time 
[~guozhang] . I don't think I touched on this earlier and may have taken my 
thoughts on this question for granted without explaining them. If we can agree 
that the asymmetric/side effect serdes are not a problem here (and that is a 
big "if") then in the case that we may have corrupt data (non-default DEH) I 
think we should just deserialize during restoration instead of adding the 
changelog. Since we only have to deserialize and not serialize, the performance 
hit might not be as bad.

Also we have a number of improvements to restoration implementation and 
soon-to-be implemented that make restoration performance less of a pain point. 
For one thing, with KIP-441 most of the restoration will occur in the 
background anyway as long as there is one caught up client. Moving restoration 
to a separate thread will speed up restoration (hopefully) but more importantly 
it means that the main thread can continue to process other active tasks rather 
than being completely blocked on recovery. Plus all the rocksdb optimizations 
being considered


was (Author: ableegoldman):
Ah, thanks for bringing us back to the question of double-topic vs 
restore-time. I don't think I touched on this earlier and may have taken my 
thoughts on this question for granted without explaining them. If we can agree 
that the asymmetric/side effect serdes are not a problem here (and that is a 
big "if") then in the case that we may have corrupt data (non-default DEH) I 
think we should just deserialize during restoration instead of adding the 
changelog. Since we only have to deserialize and not serialize, the performance 
hit might not be as bad. Also we have a number of improvements to restoration 
implementation and soon-to-be implemented that make restoration performance 
less of a pain point. For one thing, with KIP-441 most of the restoration will 
occur in the background anyway as long as there is one caught up client. Moving 
restoration to a separate thread will speed up restoration (hopefully) but more 
importantly it means that the main thread can continue to process other active 
tasks rather than being completely blocked on recovery. Plus all the rocksdb 
optimizations being considered

> KTable restore may load bad data
> --------------------------------
>
>                 Key: KAFKA-8037
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8037
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Matthias J. Sax
>            Priority: Minor
>              Labels: pull-request-available
>
> If an input topic contains bad data, users can specify a 
> `deserialization.exception.handler` to drop corrupted records on read. 
> However, this mechanism may be by-passed on restore. Assume a 
> `builder.table()` call reads and drops a corrupted record. If the table state 
> is lost and restored from the changelog topic, the corrupted record may be 
> copied into the store, because on restore plain bytes are copied.
> If the KTable is used in a join, an internal `store.get()` call to lookup the 
> record would fail with a deserialization exception if the value part cannot 
> be deserialized.
> GlobalKTables are affected, too (cf. KAFKA-7663 that may allow a fix for 
> GlobalKTable case). It's unclear to me atm, how this issue could be addressed 
> for KTables though.
> Note, that user state stores are not affected, because they always have a 
> dedicated changelog topic (and don't reuse an input topic) and thus the 
> corrupted record would not be written into the changelog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-8037) KTable restore may load bad data

Reply via email to