Re: Flink AutoScaling EMR

Rex Fenley Wed, 11 Nov 2020 11:57:09 -0800

Another thought, would it be possible to
* Spin up new core or task nodes.
* Run a new copy of the same job on these new nodes from a savepoint.
* Have the new job *not* write to the sink until the other job is torn down?


This would allow us to be eventually consistent and maintain writes going
through without downtime. As long as whatever is buffering the sink doesn't
run out of space it should work just fine.

We're hoping to achieve consistency in less than 30s ideally.

Again though, if we could get savepoints to restore in less than 30s that
would likely be sufficient for our purposes.

Thanks!

On Wed, Nov 11, 2020 at 11:42 AM Rex Fenley <r...@remind101.com> wrote:

> Hello,
>
> I'm trying to find a solution for auto scaling our Flink EMR cluster with
> 0 downtime using RocksDB as state storage and S3 backing store.
>
> My current thoughts are like so:
> * Scaling an Operator dynamically would require all keyed state to be
> available to the set of subtasks for that operator, therefore a set of
> subtasks must be reading to and writing from the same RocksDB. I.e. to
> scale in and out subtasks in that set, they need to read from the same
> Rocks.
> * Since subtasks can run on different core nodes, is it possible to have
> different core nodes read/write to the same RocksDB?
> * When's the safe point to scale in and out an operator? Only right after
> a checkpoint possibly?
>
> If the above is not possible then we'll have to use save points which
> means some downtime, therefore:
> * Scaling out during high traffic is arguably more important to react
> quickly to than scaling in during low traffic. Is it possible to add more
> core nodes to EMR without disturbing a job? If so then maybe we can
> orchestrate running a new job on new nodes and then loading a savepoint
> from a currently running job.
>
> Lastly
> * Save Points for ~70Gib of data take on the order of minutes to tens of
> minutes for us to restore from, is there any way to speed up restoration?
>
> Thanks!
>
> --
>
> Rex Fenley  |  Software Engineer - Mobile and Backend
>
>
> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>  |
>  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
> <https://www.facebook.com/remindhq>
>


-- 

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
 |  FOLLOW
US <https://twitter.com/remindhq>  |  LIKE US
<https://www.facebook.com/remindhq>

Re: Flink AutoScaling EMR

Reply via email to