Hi Samza Devs The significant concern I got recently is, container leak. The data pipeline based on Samza can guarantee at least once delivery but the duplicate rate is over 1.0%, I am having alerts right now. Container leaks will push a lot of alerts to me.
So, we need to find out running Samza on Mesos won't create that problem, or Spark Streaming won't have that issue. In the worst case, creating our own distribution coordination might be more predictable instead of running Yarn on EMR. What about standalone Samza? If this is quite plausible and the best solution in the near future, I want to be able to contribute. Could you share your thoughts or plans? I really appreciate if you give me some guideline about implementing custom cluster management interface of Samza. If it's possible, I want to take a look to replace Yarn support with EC2 ASG stuff. Thank you Best, Jae
