I wanted to achieve exactly once semantics in my job and wanted to make sure I 
understood the current behaviour correctly:

1. Only one Kafka transaction at a time (no concurrent checkpoints)
2. Only one transaction per checkpoint


My job has very large amount of state (>100GB) and I have no option but to use 
unaligned checkpoints. With the above limitation, it seems to me that if 
checkpoint interval is 1 minute and checkpoint takes about 10 seconds to 
complete then only one Kafka transaction can happen in 70 seconds. All of the 
output records will not be visible until the transaction completes. This way a 
steady stream of inputs will result in an buffered output stream where data is 
only visible after a minute, thereby destroying any sort of real time streaming 
use cases. Reducing the checkpoint interval is not really an option given the 
size of the checkpoint. Only way out would be to allow for multiple 
transactions per checkpoint.

Thanks,
Vishal

Reply via email to