We are trying to understand the order of commits when processing each
message in a Samza job.

T1: input offset commit
T2: changelog commit
T3: output commit

By looking at the code snippet in
https://github.com/apache/samza/blob/master/samza-core/src/main/scala/org/apache/samza/container/TaskInstance.scala#L155-L171,
my understanding is that for each input message, Samza always send update
message on changelog, send the output message and then commit the input
offset. It makes sense to me at the high level in terms of at least once
processing.

Specifically, we have two dumb questions:

1. When implementing our Samza task, does each call of process method
triggers a call to TaskInstance.commit?
2. Is there a way to buffer these commit activities in memory and flush
periodically? Our job is joining >1mm messages per second using a KV store
and we have a lot of concern for the changelog size, as in the worst case,
the change log will grow as fast as the input log.

Chen

-- 
Chen Song

Reply via email to