Thank you very much for the new release that makes auto-scaling possible. I'm
currently running multiple flink jobs and I've hand tuned the parallelism of
each of the operators to achieve the best throughput. I would much rather
use the auto-scaling capabilities of flink than have to hand tune my jobs
but it seems there are a few gaps:

1. Setting max parallelism seems to be the only user controlled knob at the
moment. As flink tries to achieve operator chaining by launching same number
of sub-tasks for each operator, I'm afraid the current auto-scaling will be
very inefficient. At a minimum, we need to support user provided ratios that
will be used to distribute sub-tasks among operators. E.g. O1:O2 = 4:1 will
mean that 4 sub-tasks of O1 should be started for each sub-task of O2. 

2. Allow for external system to set parallelism of operators. Perhaps job
manager's rest api can be extended to support such scaling requests

3. The doc says that local recovery doesn't work. This makes sense when a
restart is due to a scaling action but I couldn't quite understand why that
needs to be the case when a task manager is recovering from a crash

4. Is there any metric that allows us to distinguish between restart due to
scaling as opposed to restart due to some other reason? Based on the section
on limitations, there isn't but it would be good to add this as people will
eventually want to monitor and alert based on restarts due to failures
alone.

5. Suppose the number of containers are fixed and the job is running. Will
flink internally rebalance by adding sub-tasks of one operator and removing
sub-tasks of another? This could be driven by back-pressure for instance.
The doc doesn't mention this so I'm assuming that current scaling is
designed to maximize operator chaining. However, it does make sense to
incorporate back-pressure to rebalance. Should this be how future versions
of auto-scaling will work then we'll need to have some toggles to avoid
restart loops.

6. How is the implementation different from taking a savepoint and manually
rescaling? Are there any operator specific gotchas that we should watch out
for? For instance, we use AsyncIO operator and wanted to know how inflight
requests to a database would be handled when it's parallelism changes.

Once again, thank you for your continued support!



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to