Re: Parallelism in Production: Best Practices

2021-05-22 Thread Yaroslav Tkachenko
Hi Robert,

Thanks for the advice! Checking Flink Forward talks seems like a good idea,
will do đź‘Ť

On Sat, May 22, 2021 at 4:19 AM Robert Metzger  wrote:

> Hi Yaroslav,
>
> My recommendation is to go with the 2nd pattern you've described, but I
> only have limited insights into real world production workloads.
>
> Besides the parallelism configuration, I also recommend looking into slot
> sharing groups, and maybe disabling operator chaining.
> I'm pretty sure some of Flink's large production users have shared
> information about this in past Flink Forward talks .. but it is difficult
> to find answers (unless you spend a lot of time on YouTube).
>
> Best,
> Robert
>
>
> On Wed, May 19, 2021 at 8:01 PM Yaroslav Tkachenko <
> yaroslav.tkache...@shopify.com> wrote:
>
>> Hi everyone,
>>
>> I'd love to learn more about how different companies approach specifying
>> Flink parallelism. I'm specifically interested in real, production
>> workloads.
>>
>> I can see a few common patterns:
>>
>> - Rely on default parallelism, scale by changing parallelism for the
>> whole pipeline. I guess it only works if the pipeline doesn't have obvious
>> bottlenecks. Also, it looks like the new reactive mode makes specifying
>> parallelism for an operator obsolete (
>> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/elastic_scaling/#configuration
>> )
>>
>> - Rely on default parallelism for most of the operators, but override it
>> for some. For example, it doesn't make sense for a Kafka source to have
>> parallelism higher than the number of partitions it consumes. Some custom
>> sinks could choose lower parallelism to avoid overloading their
>> destinations. Some transformation steps could choose higher parallelism to
>> distribute the work better, etc.
>>
>> - Don't rely on default parallelism and configure parallelism
>> explicitly for each operator. This requires very good knowledge of each
>> operator in the pipeline, but it could lead to very good performance.
>>
>> Is there a different pattern that I miss? What do you use? Feel free to
>> share any resources.
>>
>> If you do specify it explicitly, what do you think about the reactive
>> mode? Will you use it?
>>
>> Also, how often do you change parallelism? Do you set it once and forget
>> once the pipeline is stable? Do you keep re-evaluating it?
>>
>> Thanks.
>>
>


Re: Parallelism in Production: Best Practices

2021-05-22 Thread Robert Metzger
Hi Yaroslav,

My recommendation is to go with the 2nd pattern you've described, but I
only have limited insights into real world production workloads.

Besides the parallelism configuration, I also recommend looking into slot
sharing groups, and maybe disabling operator chaining.
I'm pretty sure some of Flink's large production users have shared
information about this in past Flink Forward talks .. but it is difficult
to find answers (unless you spend a lot of time on YouTube).

Best,
Robert


On Wed, May 19, 2021 at 8:01 PM Yaroslav Tkachenko <
yaroslav.tkache...@shopify.com> wrote:

> Hi everyone,
>
> I'd love to learn more about how different companies approach specifying
> Flink parallelism. I'm specifically interested in real, production
> workloads.
>
> I can see a few common patterns:
>
> - Rely on default parallelism, scale by changing parallelism for the whole
> pipeline. I guess it only works if the pipeline doesn't have obvious
> bottlenecks. Also, it looks like the new reactive mode makes specifying
> parallelism for an operator obsolete (
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/elastic_scaling/#configuration
> )
>
> - Rely on default parallelism for most of the operators, but override it
> for some. For example, it doesn't make sense for a Kafka source to have
> parallelism higher than the number of partitions it consumes. Some custom
> sinks could choose lower parallelism to avoid overloading their
> destinations. Some transformation steps could choose higher parallelism to
> distribute the work better, etc.
>
> - Don't rely on default parallelism and configure parallelism
> explicitly for each operator. This requires very good knowledge of each
> operator in the pipeline, but it could lead to very good performance.
>
> Is there a different pattern that I miss? What do you use? Feel free to
> share any resources.
>
> If you do specify it explicitly, what do you think about the reactive
> mode? Will you use it?
>
> Also, how often do you change parallelism? Do you set it once and forget
> once the pipeline is stable? Do you keep re-evaluating it?
>
> Thanks.
>


Re: Parallelism in Production: Best Practices

2021-05-20 Thread Yaroslav Tkachenko
Hi Jan, thanks for sharing this!

Just wanted to confirm: this approach works because of the task slot
sharing feature in Flink, doesn't it?

On Thu, May 20, 2021 at 1:12 AM Jan Brusch 
wrote:

>
> Hi Yaroslav,
>
> here's a fourth option that we usually use: We set the default
> parallelism once when we initially deploy the app (maybe change it a few
> times in the beginning). From that point on rescale by either resizing
> the TaskManager-Nodes or redistributing the parallelism over more / less
> TaskManager-Nodes.
>
> For example: We start to run an app with a default parallelism of 64 and
> we initially distribute this over 16 TaskManager-Nodes with 4 taskSlots
> each. Then we see that we have scaled way too high for the actual
> workload. We now have two options: Either reduce the hardware on the 16
> Nodes (CPU and RAM) or re-scale horizontally by re-distributing the
> workload over 8 TaskManager-Nodes with 8 taskSlots each.
>
> Since we leave the parallelism of the Job untouched in each case, we can
> easily rescale by re-deploying the whole cluster and let it resume from
> the last checkpoint. A cleaner way would probably be to do this
> re-deployment with explicit savepoints.
>
> We are doing this in kubernetes where both scaling options are really
> easy to carry out. But the same concepts should work on any other setup,
> too.
>
>
> Hope that helps
>
> Jan
>
> On 19.05.21 20:00, Yaroslav Tkachenko wrote:
> > Hi everyone,
> >
> > I'd love to learn more about how different companies approach
> > specifying Flink parallelism. I'm specifically interested in real,
> > production workloads.
> >
> > I can see a few common patterns:
> >
> > - Rely on default parallelism, scale by changing parallelism for the
> > whole pipeline. I guess it only works if the pipeline doesn't have
> > obvious bottlenecks. Also, it looks like the new reactive mode makes
> > specifying parallelism for an operator obsolete
> > (
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/elastic_scaling/#configuration
> > <
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/elastic_scaling/#configuration
> >)
> >
> > - Rely on default parallelism for most of the operators, but override
> > it for some. For example, it doesn't make sense for a Kafka source to
> > have parallelism higher than the number of partitions it consumes.
> > Some custom sinks could choose lower parallelism to avoid overloading
> > their destinations. Some transformation steps could choose higher
> > parallelism to distribute the work better, etc.
> >
> > - Don't rely on default parallelism and configure parallelism
> > explicitly for each operator. This requires very good knowledge of
> > each operator in the pipeline, but it could lead to very good
> performance.
> >
> > Is there a different pattern that I miss? What do you use? Feel free
> > to share any resources.
> >
> > If you do specify it explicitly, what do you think about the reactive
> > mode? Will you use it?
> >
> > Also, how often do you change parallelism? Do you set it once and
> > forget once the pipeline is stable? Do you keep re-evaluating it?
> >
> > Thanks.
>
> --
> neuland  – Büro für Informatik GmbH
> Konsul-Smidt-Str. 8g, 28217 Bremen
>
> Telefon (0421) 380107 57
> Fax (0421) 380107 99
> https://www.neuland-bfi.de
>
> https://twitter.com/neuland
> https://facebook.com/neulandbfi
> https://xing.com/company/neulandbfi
>
>
> Geschäftsführer: Thomas Gebauer, Jan Zander
> Registergericht: Amtsgericht Bremen, HRB 23395 HB
> USt-ID. DE 246585501
>
>


Re: Parallelism in Production: Best Practices

2021-05-20 Thread Jan Brusch



Hi Yaroslav,

here's a fourth option that we usually use: We set the default 
parallelism once when we initially deploy the app (maybe change it a few 
times in the beginning). From that point on rescale by either resizing 
the TaskManager-Nodes or redistributing the parallelism over more / less 
TaskManager-Nodes.


For example: We start to run an app with a default parallelism of 64 and 
we initially distribute this over 16 TaskManager-Nodes with 4 taskSlots 
each. Then we see that we have scaled way too high for the actual 
workload. We now have two options: Either reduce the hardware on the 16 
Nodes (CPU and RAM) or re-scale horizontally by re-distributing the 
workload over 8 TaskManager-Nodes with 8 taskSlots each.


Since we leave the parallelism of the Job untouched in each case, we can 
easily rescale by re-deploying the whole cluster and let it resume from 
the last checkpoint. A cleaner way would probably be to do this 
re-deployment with explicit savepoints.


We are doing this in kubernetes where both scaling options are really 
easy to carry out. But the same concepts should work on any other setup, 
too.



Hope that helps

Jan

On 19.05.21 20:00, Yaroslav Tkachenko wrote:

Hi everyone,

I'd love to learn more about how different companies approach 
specifying Flink parallelism. I'm specifically interested in real, 
production workloads.


I can see a few common patterns:

- Rely on default parallelism, scale by changing parallelism for the 
whole pipeline. I guess it only works if the pipeline doesn't have 
obvious bottlenecks. Also, it looks like the new reactive mode makes 
specifying parallelism for an operator obsolete 
(https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/elastic_scaling/#configuration 
)


- Rely on default parallelism for most of the operators, but override 
it for some. For example, it doesn't make sense for a Kafka source to 
have parallelism higher than the number of partitions it consumes. 
Some custom sinks could choose lower parallelism to avoid overloading 
their destinations. Some transformation steps could choose higher 
parallelism to distribute the work better, etc.


- Don't rely on default parallelism and configure parallelism 
explicitly for each operator. This requires very good knowledge of 
each operator in the pipeline, but it could lead to very good performance.


Is there a different pattern that I miss? What do you use? Feel free 
to share any resources.


If you do specify it explicitly, what do you think about the reactive 
mode? Will you use it?


Also, how often do you change parallelism? Do you set it once and 
forget once the pipeline is stable? Do you keep re-evaluating it?


Thanks.


--
neuland  – Büro für Informatik GmbH
Konsul-Smidt-Str. 8g, 28217 Bremen

Telefon (0421) 380107 57
Fax (0421) 380107 99
https://www.neuland-bfi.de

https://twitter.com/neuland
https://facebook.com/neulandbfi
https://xing.com/company/neulandbfi


Geschäftsführer: Thomas Gebauer, Jan Zander
Registergericht: Amtsgericht Bremen, HRB 23395 HB
USt-ID. DE 246585501