[ 
https://issues.apache.org/jira/browse/KAFKA-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163142#comment-17163142
 ] 

Sophie Blee-Goldman commented on KAFKA-8037:
--------------------------------------------

Ah, thanks for bringing us back to the question of double-topic vs 
restore-time. I don't think I touched on this earlier and may have taken my 
thoughts on this question for granted without explaining them. If we can agree 
that the asymmetric/side effect serdes are not a problem here (and that is a 
big "if") then in the case that we may have corrupt data (non-default DEH) I 
think we should just deserialize during restoration instead of adding the 
changelog. Since we only have to deserialize and not serialize, the performance 
hit might not be as bad. Also we have a number of improvements to restoration 
implementation and soon-to-be implemented that make restoration performance 
less of a pain point. For one thing, with KIP-441 most of the restoration will 
occur in the background anyway as long as there is one caught up client. Moving 
restoration to a separate thread will speed up restoration (hopefully) but more 
importantly it means that the main thread can continue to process other active 
tasks rather than being completely blocked on recovery. Plus all the rocksdb 
optimizations being considered

> KTable restore may load bad data
> --------------------------------
>
>                 Key: KAFKA-8037
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8037
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Matthias J. Sax
>            Priority: Minor
>              Labels: pull-request-available
>
> If an input topic contains bad data, users can specify a 
> `deserialization.exception.handler` to drop corrupted records on read. 
> However, this mechanism may be by-passed on restore. Assume a 
> `builder.table()` call reads and drops a corrupted record. If the table state 
> is lost and restored from the changelog topic, the corrupted record may be 
> copied into the store, because on restore plain bytes are copied.
> If the KTable is used in a join, an internal `store.get()` call to lookup the 
> record would fail with a deserialization exception if the value part cannot 
> be deserialized.
> GlobalKTables are affected, too (cf. KAFKA-7663 that may allow a fix for 
> GlobalKTable case). It's unclear to me atm, how this issue could be addressed 
> for KTables though.
> Note, that user state stores are not affected, because they always have a 
> dedicated changelog topic (and don't reuse an input topic) and thus the 
> corrupted record would not be written into the changelog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to