[ https://issues.apache.org/jira/browse/KAFKA-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163142#comment-17163142 ]
Sophie Blee-Goldman edited comment on KAFKA-8037 at 7/22/20, 11:52 PM: ----------------------------------------------------------------------- Ah, thanks for bringing us back to the question of double-topic vs restore-time [~guozhang] . I don't think I touched on this earlier and may have taken my thoughts on this question for granted without explaining them. If we can agree that the asymmetric/side effect serdes are not a problem here (and that is a big "if") then in the case that we may have corrupt data (non-default DEH) I think we should just deserialize during restoration instead of adding the changelog. Since we only have to deserialize and not serialize, the performance hit might not be as bad. Also we have a number of improvements to restoration implementation and soon-to-be implemented that make restoration performance less of a pain point. For one thing, with KIP-441 most of the restoration will occur in the background anyway as long as there is one caught up client. Moving restoration to a separate thread will speed up restoration (hopefully) but more importantly it means that the main thread can continue to process other active tasks rather than being completely blocked on recovery. Plus all the rocksdb optimizations being considered was (Author: ableegoldman): Ah, thanks for bringing us back to the question of double-topic vs restore-time. I don't think I touched on this earlier and may have taken my thoughts on this question for granted without explaining them. If we can agree that the asymmetric/side effect serdes are not a problem here (and that is a big "if") then in the case that we may have corrupt data (non-default DEH) I think we should just deserialize during restoration instead of adding the changelog. Since we only have to deserialize and not serialize, the performance hit might not be as bad. Also we have a number of improvements to restoration implementation and soon-to-be implemented that make restoration performance less of a pain point. For one thing, with KIP-441 most of the restoration will occur in the background anyway as long as there is one caught up client. Moving restoration to a separate thread will speed up restoration (hopefully) but more importantly it means that the main thread can continue to process other active tasks rather than being completely blocked on recovery. Plus all the rocksdb optimizations being considered > KTable restore may load bad data > -------------------------------- > > Key: KAFKA-8037 > URL: https://issues.apache.org/jira/browse/KAFKA-8037 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: Matthias J. Sax > Priority: Minor > Labels: pull-request-available > > If an input topic contains bad data, users can specify a > `deserialization.exception.handler` to drop corrupted records on read. > However, this mechanism may be by-passed on restore. Assume a > `builder.table()` call reads and drops a corrupted record. If the table state > is lost and restored from the changelog topic, the corrupted record may be > copied into the store, because on restore plain bytes are copied. > If the KTable is used in a join, an internal `store.get()` call to lookup the > record would fail with a deserialization exception if the value part cannot > be deserialized. > GlobalKTables are affected, too (cf. KAFKA-7663 that may allow a fix for > GlobalKTable case). It's unclear to me atm, how this issue could be addressed > for KTables though. > Note, that user state stores are not affected, because they always have a > dedicated changelog topic (and don't reuse an input topic) and thus the > corrupted record would not be written into the changelog. -- This message was sent by Atlassian Jira (v8.3.4#803005)