Thanks for your work on this KIP, Eno -- much appreciated! - I think it would help to improve the KIP by adding an end-to-end code example that demonstrates, with the DSL and with the Processor API, how the user would write a simple application that would then be augmented with the proposed KIP changes to handle exceptions. It should also become much clearer then that e.g. the KIP would lead to different code paths for the happy case and any failure scenarios.
- Do we have sufficient information available to make informed decisions on what to do next? For example, do we know in which part of the topology the record failed? `ConsumerRecord` gives us access to topic, partition, offset, timestamp, etc., but what about topology-related information (e.g. what is the associated state store, if any)? - Only partly on-topic for the scope of this KIP, but this is about the bigger picture: This KIP would give users the option to send corrupted records to dead letter queue (quarantine topic). But, what pattern would we advocate to process such a dead letter queue then, e.g. how to allow for retries with backoff ("If the first record in the dead letter queue fails again, then try the second record for the time being and go back to the first record at a later time"). Jay and Jan already alluded to ordering problems that will be caused by dead letter queues. As I said, retries might be out of scope but perhaps the implications should be considered if possible? Also, I wrote the text below before reaching the point in the conversation that this KIP's scope will be limited to exceptions in the category of poison pills / deserialization errors. But since Jay brought up user code errors again, I decided to include it again. ----------------------------snip---------------------------- A meta comment: I am not sure about this split between the code for the happy path (e.g. map/filter/... in the DSL) from the failure path (using exception handlers). In Scala, for example, we can do: scala> val computation = scala.util.Try(1 / 0) computation: scala.util.Try[Int] = Failure(java.lang.ArithmeticException: / by zero) scala> computation.getOrElse(42) res2: Int = 42 Another example with Scala's pattern matching, which is similar to `KStream#branch()`: computation match { case scala.util.Success(x) => x * 5 case scala.util.Failure(_) => 42 } (The above isn't the most idiomatic way to handle this in Scala, but that's not the point I'm trying to make here.) Hence the question I'm raising here is: Do we want to have an API where you code "the happy path", and then have a different code path for failures (using exceptions and handlers); or should we treat both Success and Failure in the same way? I think the failure/exception handling approach (as proposed in this KIP) is well-suited for errors in the category of deserialization problems aka poison pills, partly because the (default) serdes are defined through configuration (explicit serdes however are defined through API calls). However, I'm not yet convinced that the failure/exception handling approach is the best idea for user code exceptions, e.g. if you fail to guard against NPE in your lambdas or divide a number by zero. scala> val stream = Seq(1, 2, 3, 4, 5) stream: Seq[Int] = List(1, 2, 3, 4, 5) // Here: Fallback to a sane default when encountering failed records scala> stream.map(x => Try(1/(3 - x))).flatMap(t => Seq(t.getOrElse(42))) res19: Seq[Int] = List(0, 1, 42, -1, 0) // Here: Skip over failed records scala> stream.map(x => Try(1/(3 - x))).collect{ case Success(s) => s } res20: Seq[Int] = List(0, 1, -1, 0) The above is more natural to me than using error handlers to define how to deal with failed records (here, the value `3` causes an arithmetic exception). Again, it might help the KIP if we added an end-to-end example for such user code errors. ----------------------------snip---------------------------- On Tue, May 30, 2017 at 9:24 AM, Jan Filipiak <jan.filip...@trivago.com> wrote: > Hi Jay, > > Eno mentioned that he will narrow down the scope to only ConsumerRecord > deserialisation. > > I am working with Database Changelogs only. I would really not like to see > a dead letter queue or something > similliar. how am I expected to get these back in order. Just grind to > hold an call me on the weekend. I'll fix it > then in a few minutes rather spend 2 weeks ordering dead letters. (where > reprocessing might be even the faster fix) > > Best Jan > > > > > On 29.05.2017 20:23, Jay Kreps wrote: > >> - I think we should hold off on retries unless we have worked out the >> full usage pattern, people can always implement their own. I think >> the idea >> is that you send the message to some kind of dead letter queue and >> then >> replay these later. This obviously destroys all semantic guarantees >> we are >> working hard to provide right now, which may be okay. >> > >