[ 
https://issues.apache.org/jira/browse/KAFKA-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163134#comment-17163134
 ] 

Guozhang Wang commented on KAFKA-8037:
--------------------------------------

Great comments from all of you, thank you so much!

I'd like to see if I can gasp the core of this conversation so far: it seems we 
all agree that Streams has to ask for users's input rather than making "some 
decision" itself without incurring edge cases, but we are not agreeing on what 
that "decision" should be. So far I've seen two options here:

1) we ask users that per-source-table, if we can optimize away the changelog 
topic and just reuse the source topic; if "yes" we just treat it as normal 
changelog topics during restoration, i.e. no serde, no exception handler, etc.
2) per-source-table, we always optimize away the changelog and piggy-back on 
the source topic, but we ask users if the restoration process needs to be 
specially handled just like normal processing, i.e. "deserialize, exceptional 
handling, serialize", or not.

Note I've excluded global state stores out of this discussion so far since now 
I think for that case we should always piggy-back on source, and we should 
always treat restoration just like normal processing.

Aside from user's perspective, the difference between 1) and 2) brings me the 
performance / footprint related question: whether keeping an extra duplicated 
topic (but you never do serde) is worse, or longer restoration latency is 
worse. There are of course other considerata like if we keep a separate topic 
then we would not rely on source topic's retention etc, but I'd like to ask 
this trade-off question first because I cannot say that I have a clear winner 
in my mind :) For now I'm a bit leaning towards having an extra topic for 
better restoration latency since in the long term, storing more topics in Kafka 
may not be a huge deal (but I'd have to admit that more topics today is still a 
top argument against Kafka streams today as [~ableegoldman] said), though I am 
very welcoming any counter arguments here.

> KTable restore may load bad data
> --------------------------------
>
>                 Key: KAFKA-8037
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8037
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Matthias J. Sax
>            Priority: Minor
>              Labels: pull-request-available
>
> If an input topic contains bad data, users can specify a 
> `deserialization.exception.handler` to drop corrupted records on read. 
> However, this mechanism may be by-passed on restore. Assume a 
> `builder.table()` call reads and drops a corrupted record. If the table state 
> is lost and restored from the changelog topic, the corrupted record may be 
> copied into the store, because on restore plain bytes are copied.
> If the KTable is used in a join, an internal `store.get()` call to lookup the 
> record would fail with a deserialization exception if the value part cannot 
> be deserialized.
> GlobalKTables are affected, too (cf. KAFKA-7663 that may allow a fix for 
> GlobalKTable case). It's unclear to me atm, how this issue could be addressed 
> for KTables though.
> Note, that user state stores are not affected, because they always have a 
> dedicated changelog topic (and don't reuse an input topic) and thus the 
> corrupted record would not be written into the changelog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to