Re: Flink AutoScaling EMR

Rex Fenley Mon, 16 Nov 2020 18:36:52 -0800

Thanks for all the input!

On Sun, Nov 15, 2020 at 6:59 PM Xintong Song <tonysong...@gmail.com> wrote:


> Is there a way to make the new yarn job only on the new hardware?
>
> I think you can simply decommission the nodes from Yarn, so that new
> containers will not be allocated from those nodes. You might also need a
> large decommission timeout, upon which all the remaining running contains
> on the decommissioning node will be killed.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Fri, Nov 13, 2020 at 2:57 PM Robert Metzger <rmetz...@apache.org>
> wrote:
>
>> Hi,
>> it seems that YARN has a feature for targeting specific hardware:
>> https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html
>> In any case, you'll need enough spare resources for some time to be able
>> to run your job twice for this kind of "zero downtime handover"
>>
>> On Thu, Nov 12, 2020 at 6:10 PM Rex Fenley <r...@remind101.com> wrote:
>>
>>> Awesome, thanks! Is there a way to make the new yarn job only on the new
>>> hardware? Or would the two jobs have to run on intersecting hardware and
>>> then would be switched on/off, which means we'll need a buffer of resources
>>> for our orchestration?
>>>
>>> Also, good point on recovery. I'll spend some time looking into this.
>>>
>>> Thanks
>>>
>>>
>>> On Wed, Nov 11, 2020 at 11:53 PM Robert Metzger <rmetz...@apache.org>
>>> wrote:
>>>
>>>> Hey Rex,
>>>>
>>>> the second approach (spinning up a standby job and then doing a
>>>> handover) sounds more promising to implement, without rewriting half of the
>>>> Flink codebase ;)
>>>> What you need is a tool that orchestrates creating a savepoint,
>>>> starting a second job from the savepoint and then communicating with a
>>>> custom sink implementation that can be switched on/off in the two jobs.
>>>> With that approach, you should have almost no downtime, just increased
>>>> resource requirements during such a handover.
>>>>
>>>> What you need to consider as well is that this handover process only
>>>> works for scheduled maintenance. If you have a system failure, you'll have
>>>> downtime until the last checkpoint is restored.
>>>> If you are trying to reduce the potential downtime overall, I would
>>>> rather recommend optimizing the checkpoint restore time, as this can cover
>>>> both scheduled maintenance and system failures.
>>>>
>>>> Best,
>>>> Robert
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Nov 11, 2020 at 8:56 PM Rex Fenley <r...@remind101.com> wrote:
>>>>
>>>>> Another thought, would it be possible to
>>>>> * Spin up new core or task nodes.
>>>>> * Run a new copy of the same job on these new nodes from a savepoint.
>>>>> * Have the new job *not* write to the sink until the other job is
>>>>> torn down?
>>>>>
>>>>> This would allow us to be eventually consistent and maintain writes
>>>>> going through without downtime. As long as whatever is buffering the sink
>>>>> doesn't run out of space it should work just fine.
>>>>>
>>>>> We're hoping to achieve consistency in less than 30s ideally.
>>>>>
>>>>> Again though, if we could get savepoints to restore in less than 30s
>>>>> that would likely be sufficient for our purposes.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Wed, Nov 11, 2020 at 11:42 AM Rex Fenley <r...@remind101.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I'm trying to find a solution for auto scaling our Flink EMR cluster
>>>>>> with 0 downtime using RocksDB as state storage and S3 backing store.
>>>>>>
>>>>>> My current thoughts are like so:
>>>>>> * Scaling an Operator dynamically would require all keyed state to be
>>>>>> available to the set of subtasks for that operator, therefore a set of
>>>>>> subtasks must be reading to and writing from the same RocksDB. I.e. to
>>>>>> scale in and out subtasks in that set, they need to read from the same
>>>>>> Rocks.
>>>>>> * Since subtasks can run on different core nodes, is it possible to
>>>>>> have different core nodes read/write to the same RocksDB?
>>>>>> * When's the safe point to scale in and out an operator? Only right
>>>>>> after a checkpoint possibly?
>>>>>>
>>>>>> If the above is not possible then we'll have to use save points which
>>>>>> means some downtime, therefore:
>>>>>> * Scaling out during high traffic is arguably more important to react
>>>>>> quickly to than scaling in during low traffic. Is it possible to add more
>>>>>> core nodes to EMR without disturbing a job? If so then maybe we can
>>>>>> orchestrate running a new job on new nodes and then loading a savepoint
>>>>>> from a currently running job.
>>>>>>
>>>>>> Lastly
>>>>>> * Save Points for ~70Gib of data take on the order of minutes to tens
>>>>>> of minutes for us to restore from, is there any way to speed up 
>>>>>> restoration?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>>>>>
>>>>>>
>>>>>> Remind.com <https://www.remind.com/> |  BLOG
>>>>>> <http://blog.remind.com/>  |  FOLLOW US
>>>>>> <https://twitter.com/remindhq>  |  LIKE US
>>>>>> <https://www.facebook.com/remindhq>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>>>>
>>>>>
>>>>> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
>>>>>  |  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
>>>>> <https://www.facebook.com/remindhq>
>>>>>
>>>>
>>>
>>> --
>>>
>>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>>
>>>
>>> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
>>>  |  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
>>> <https://www.facebook.com/remindhq>
>>>
>>

-- 

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
 |  FOLLOW
US <https://twitter.com/remindhq>  |  LIKE US
<https://www.facebook.com/remindhq>

Re: Flink AutoScaling EMR

Reply via email to