We use EMR instead of ECS but if that’s an option for your team, you can configure auto scaling rules in your cloud formation so that your task/job load dynamically controls cluster sizing.
Sent from my iPhone > On Nov 8, 2019, at 1:40 AM, Navneeth Krishnan <reachnavnee...@gmail.com> > wrote: > > Hello All, > > I have a streaming job running in production which is processing over 2 > billion events per day and it does some heavy processing on each event. We > have been facing some challenges in managing flink in production like > scaling in and out, restarting the job with savepoint etc. Flink provides a > lot of features which seemed as an obvious choice at that time but now with > all the operational overhead we are thinking should we still use flink for > our stream processing requirements or choose kafka streams. > > We currently deploy flink on ECR. Bringing up a new cluster for another > stream job is too expensive but on the flip side running it on the same > cluster becomes difficult since there are no ways to say this job has to be > run on a dedicated server versus this can run on a shared instance. Also > savepoint point, cancel and submit a new job results in some downtime. The > most critical part being there is no shared state among all tasks sort of a > global state. We sort of achieve this today using an external redis cache > but that incurs cost as well. > > If we are moving to kafka streams, it makes our deployment life much > easier, each new stream job will be a microservice that can scale > independently. With global state it's much easier to share state without > using external cache. But the disadvantage is we have to rely on the > partitions for parallelism. Although this might initially sound easier, > when we need to scale much higher this will become a bottleneck. > > Do you guys have any suggestions on this? We need to decide which way to > move forward and any suggestions would be of much greater help. > > Thanks