Re: Flink cluster deployment strategy

2020-08-13 Thread sidhant gupta
Thanks, I will check it out.

On Thu, 13 Aug, 2020, 7:55 PM Arvid Heise,  wrote:

> Hi Sidhant,
>
> If you are starting fresh with Flink, I strongly recommend to skip ECS and
> EMR and directly go to a kubernetes-based solution. Scaling is much easier
> on K8s, there will be some kind of autoscaling coming in the next release,
> and the best of it all: you even have the option to go to a different cloud
> provider if needed.
>
> The easiest option for you is to use EKS on AWS together with Ververica
> community edition [1] or with one of the many kubernetes operators.
>
> [1] https://www.ververica.com/getting-started
>
> On Tue, Aug 11, 2020 at 3:23 PM Till Rohrmann 
> wrote:
>
>> Hi Sidhant,
>>
>> see the inline comments for answers
>>
>> On Tue, Aug 11, 2020 at 3:10 PM sidhant gupta 
>> wrote:
>>
>>> Hi Till,
>>>
>>> Thanks for your response.
>>> I have few queries though as mentioned below:
>>> (1) Can flink be used in map-reduce fashion with data streaming api ?
>>>
>>
>> What do you understand as map-reduce fashion? You can use Flink's DataSet
>> API for processing batch workloads (consisting not only of map and reduce
>> operations but also other operations such as groupReduce, flatMap, etc.).
>> Flink's DataStream API can be used to process bounded and unbounded
>> streaming data.
>>
>> (2) Does it make sense to use aws EMR if we are not using flink in
>>> map-reduce fashion with streaming api ?
>>>
>>
>> I think I don't fully understand what you mean with map-reduce fashion.
>> Do you mean multiple stages of map and reduce operations?
>>
>>
>>> (3) Can flink cluster be auto scaled using EMR Managed Scaling when used
>>> with yarn as per this link
>>> https://aws.amazon.com/blogs/big-data/introducing-amazon-emr-managed-scaling-automatically-resize-clusters-to-lower-cost/
>>>  ?
>>>
>>
>> I am no expert on EMR managed scaling but I believe that it would need
>> some custom tooling to scale a Flink job down (by taking a savepoint a
>> resuming from it with a lower parallelism) before downsizing the EMR
>> cluster.
>>
>>
>>> (4) If we set an explicit max parallelism, and set current parallelism
>>> (which might be less than the max parallelism) equal to the maximum number
>>> of slots and set slots per task manager while starting the yarn session,
>>> then if we increase the task manager as per auto scaling then does the
>>> parallelism would increase (till the max parallelism ) and the load would
>>> be distributed across the newly spined up task manager ? Refer:
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/production_ready.html#set-an-explicit-max-parallelism
>>>
>>>
>>
>> At the moment, Flink does not support this out of the box but the
>> community is working on this feature.
>>
>>>
>>> Regards
>>> Sidhant Gupta
>>>
>>> On Tue, 11 Aug, 2020, 5:19 PM Till Rohrmann, 
>>> wrote:
>>>
 Hi Sidhant,

 I am not an expert on AWS services but I believe that EMR might be a
 bit easier to start with since AWS EMR comes with Flink support out of the
 box [1]. On ECS I believe that you would have to set up the containers
 yourself. Another interesting deployment option could be to use Flink's
 native Kubernetes integration [2] which would work on AWS EKS.

 [1]
 https://docs.aws.amazon.com/emr/latest/ReleaseGuide/flink-create-cluster.html
 [2]
 https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html

 Cheers,
 Till

 On Tue, Aug 11, 2020 at 9:16 AM sidhant gupta 
 wrote:

> Hi all,
>
> I'm kind of new to flink cluster deployment. I wanted to know which
> flink
> cluster deployment and which job mode in aws is better in terms of
> ease of
> deployment, maintenance, HA, cost, etc. As of now I am considering aws
> EMR
> vs ECS (docker containers). We have a usecase of setting up a data
> streaming api which reads records from a Kafka topic, process it and
> then
> write to a another Kafka topic. Please let me know your thoughts on
> this.
>
> Thanks
> Sidhant Gupta
>

>
> --
>
> Arvid Heise | Senior Java Developer
>
> 
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward  - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> (Toni) Cheng
>


Re: Flink cluster deployment strategy

2020-08-13 Thread Arvid Heise
Hi Sidhant,

If you are starting fresh with Flink, I strongly recommend to skip ECS and
EMR and directly go to a kubernetes-based solution. Scaling is much easier
on K8s, there will be some kind of autoscaling coming in the next release,
and the best of it all: you even have the option to go to a different cloud
provider if needed.

The easiest option for you is to use EKS on AWS together with Ververica
community edition [1] or with one of the many kubernetes operators.

[1] https://www.ververica.com/getting-started

On Tue, Aug 11, 2020 at 3:23 PM Till Rohrmann  wrote:

> Hi Sidhant,
>
> see the inline comments for answers
>
> On Tue, Aug 11, 2020 at 3:10 PM sidhant gupta  wrote:
>
>> Hi Till,
>>
>> Thanks for your response.
>> I have few queries though as mentioned below:
>> (1) Can flink be used in map-reduce fashion with data streaming api ?
>>
>
> What do you understand as map-reduce fashion? You can use Flink's DataSet
> API for processing batch workloads (consisting not only of map and reduce
> operations but also other operations such as groupReduce, flatMap, etc.).
> Flink's DataStream API can be used to process bounded and unbounded
> streaming data.
>
> (2) Does it make sense to use aws EMR if we are not using flink in
>> map-reduce fashion with streaming api ?
>>
>
> I think I don't fully understand what you mean with map-reduce fashion. Do
> you mean multiple stages of map and reduce operations?
>
>
>> (3) Can flink cluster be auto scaled using EMR Managed Scaling when used
>> with yarn as per this link
>> https://aws.amazon.com/blogs/big-data/introducing-amazon-emr-managed-scaling-automatically-resize-clusters-to-lower-cost/
>>  ?
>>
>
> I am no expert on EMR managed scaling but I believe that it would need
> some custom tooling to scale a Flink job down (by taking a savepoint a
> resuming from it with a lower parallelism) before downsizing the EMR
> cluster.
>
>
>> (4) If we set an explicit max parallelism, and set current parallelism
>> (which might be less than the max parallelism) equal to the maximum number
>> of slots and set slots per task manager while starting the yarn session,
>> then if we increase the task manager as per auto scaling then does the
>> parallelism would increase (till the max parallelism ) and the load would
>> be distributed across the newly spined up task manager ? Refer:
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/production_ready.html#set-an-explicit-max-parallelism
>>
>>
>
> At the moment, Flink does not support this out of the box but the
> community is working on this feature.
>
>>
>> Regards
>> Sidhant Gupta
>>
>> On Tue, 11 Aug, 2020, 5:19 PM Till Rohrmann, 
>> wrote:
>>
>>> Hi Sidhant,
>>>
>>> I am not an expert on AWS services but I believe that EMR might be a bit
>>> easier to start with since AWS EMR comes with Flink support out of the box
>>> [1]. On ECS I believe that you would have to set up the containers
>>> yourself. Another interesting deployment option could be to use Flink's
>>> native Kubernetes integration [2] which would work on AWS EKS.
>>>
>>> [1]
>>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/flink-create-cluster.html
>>> [2]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Aug 11, 2020 at 9:16 AM sidhant gupta 
>>> wrote:
>>>
 Hi all,

 I'm kind of new to flink cluster deployment. I wanted to know which
 flink
 cluster deployment and which job mode in aws is better in terms of ease
 of
 deployment, maintenance, HA, cost, etc. As of now I am considering aws
 EMR
 vs ECS (docker containers). We have a usecase of setting up a data
 streaming api which reads records from a Kafka topic, process it and
 then
 write to a another Kafka topic. Please let me know your thoughts on
 this.

 Thanks
 Sidhant Gupta

>>>

-- 

Arvid Heise | Senior Java Developer



Follow us @VervericaData

--

Join Flink Forward  - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng


Re: Flink cluster deployment strategy

2020-08-11 Thread Till Rohrmann
Hi Sidhant,

see the inline comments for answers

On Tue, Aug 11, 2020 at 3:10 PM sidhant gupta  wrote:

> Hi Till,
>
> Thanks for your response.
> I have few queries though as mentioned below:
> (1) Can flink be used in map-reduce fashion with data streaming api ?
>

What do you understand as map-reduce fashion? You can use Flink's DataSet
API for processing batch workloads (consisting not only of map and reduce
operations but also other operations such as groupReduce, flatMap, etc.).
Flink's DataStream API can be used to process bounded and unbounded
streaming data.

(2) Does it make sense to use aws EMR if we are not using flink in
> map-reduce fashion with streaming api ?
>

I think I don't fully understand what you mean with map-reduce fashion. Do
you mean multiple stages of map and reduce operations?


> (3) Can flink cluster be auto scaled using EMR Managed Scaling when used
> with yarn as per this link
> https://aws.amazon.com/blogs/big-data/introducing-amazon-emr-managed-scaling-automatically-resize-clusters-to-lower-cost/
>  ?
>

I am no expert on EMR managed scaling but I believe that it would need some
custom tooling to scale a Flink job down (by taking a savepoint a resuming
from it with a lower parallelism) before downsizing the EMR cluster.


> (4) If we set an explicit max parallelism, and set current parallelism
> (which might be less than the max parallelism) equal to the maximum number
> of slots and set slots per task manager while starting the yarn session,
> then if we increase the task manager as per auto scaling then does the
> parallelism would increase (till the max parallelism ) and the load would
> be distributed across the newly spined up task manager ? Refer:
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/production_ready.html#set-an-explicit-max-parallelism
>
>

At the moment, Flink does not support this out of the box but the community
is working on this feature.

>
> Regards
> Sidhant Gupta
>
> On Tue, 11 Aug, 2020, 5:19 PM Till Rohrmann,  wrote:
>
>> Hi Sidhant,
>>
>> I am not an expert on AWS services but I believe that EMR might be a bit
>> easier to start with since AWS EMR comes with Flink support out of the box
>> [1]. On ECS I believe that you would have to set up the containers
>> yourself. Another interesting deployment option could be to use Flink's
>> native Kubernetes integration [2] which would work on AWS EKS.
>>
>> [1]
>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/flink-create-cluster.html
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html
>>
>> Cheers,
>> Till
>>
>> On Tue, Aug 11, 2020 at 9:16 AM sidhant gupta 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm kind of new to flink cluster deployment. I wanted to know which flink
>>> cluster deployment and which job mode in aws is better in terms of ease
>>> of
>>> deployment, maintenance, HA, cost, etc. As of now I am considering aws
>>> EMR
>>> vs ECS (docker containers). We have a usecase of setting up a data
>>> streaming api which reads records from a Kafka topic, process it and then
>>> write to a another Kafka topic. Please let me know your thoughts on this.
>>>
>>> Thanks
>>> Sidhant Gupta
>>>
>>


Re: Flink cluster deployment strategy

2020-08-11 Thread sidhant gupta
Hi Till,

Thanks for your response.
I have few queries though as mentioned below:
(1) Can flink be used in map-reduce fashion with data streaming api ?
(2) Does it make sense to use aws EMR if we are not using flink in
map-reduce fashion with streaming api ?
(3) Can flink cluster be auto scaled using EMR Managed Scaling when used
with yarn as per this link
https://aws.amazon.com/blogs/big-data/introducing-amazon-emr-managed-scaling-automatically-resize-clusters-to-lower-cost/
 ?
(4) If we set an explicit max parallelism, and set current parallelism
(which might be less than the max parallelism) equal to the maximum number
of slots and set slots per task manager while starting the yarn session,
then if we increase the task manager as per auto scaling then does the
parallelism would increase (till the max parallelism ) and the load would
be distributed across the newly spined up task manager ? Refer:
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/production_ready.html#set-an-explicit-max-parallelism


Regards
Sidhant Gupta

On Tue, 11 Aug, 2020, 5:19 PM Till Rohrmann,  wrote:

> Hi Sidhant,
>
> I am not an expert on AWS services but I believe that EMR might be a bit
> easier to start with since AWS EMR comes with Flink support out of the box
> [1]. On ECS I believe that you would have to set up the containers
> yourself. Another interesting deployment option could be to use Flink's
> native Kubernetes integration [2] which would work on AWS EKS.
>
> [1]
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/flink-create-cluster.html
> [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html
>
> Cheers,
> Till
>
> On Tue, Aug 11, 2020 at 9:16 AM sidhant gupta  wrote:
>
>> Hi all,
>>
>> I'm kind of new to flink cluster deployment. I wanted to know which flink
>> cluster deployment and which job mode in aws is better in terms of ease of
>> deployment, maintenance, HA, cost, etc. As of now I am considering aws EMR
>> vs ECS (docker containers). We have a usecase of setting up a data
>> streaming api which reads records from a Kafka topic, process it and then
>> write to a another Kafka topic. Please let me know your thoughts on this.
>>
>> Thanks
>> Sidhant Gupta
>>
>


Re: Flink cluster deployment strategy

2020-08-11 Thread Till Rohrmann
Hi Sidhant,

I am not an expert on AWS services but I believe that EMR might be a bit
easier to start with since AWS EMR comes with Flink support out of the box
[1]. On ECS I believe that you would have to set up the containers
yourself. Another interesting deployment option could be to use Flink's
native Kubernetes integration [2] which would work on AWS EKS.

[1]
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/flink-create-cluster.html
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html

Cheers,
Till

On Tue, Aug 11, 2020 at 9:16 AM sidhant gupta  wrote:

> Hi all,
>
> I'm kind of new to flink cluster deployment. I wanted to know which flink
> cluster deployment and which job mode in aws is better in terms of ease of
> deployment, maintenance, HA, cost, etc. As of now I am considering aws EMR
> vs ECS (docker containers). We have a usecase of setting up a data
> streaming api which reads records from a Kafka topic, process it and then
> write to a another Kafka topic. Please let me know your thoughts on this.
>
> Thanks
> Sidhant Gupta
>


Flink cluster deployment strategy

2020-08-11 Thread sidhant gupta
Hi all,

I'm kind of new to flink cluster deployment. I wanted to know which flink
cluster deployment and which job mode in aws is better in terms of ease of
deployment, maintenance, HA, cost, etc. As of now I am considering aws EMR
vs ECS (docker containers). We have a usecase of setting up a data
streaming api which reads records from a Kafka topic, process it and then
write to a another Kafka topic. Please let me know your thoughts on this.

Thanks
Sidhant Gupta