Re: [DISCUSS] Change some default config values of blocking shuffle

Yun Gao Tue, 14 Dec 2021 09:32:48 -0800

Hi, 

> I think setting taskmanager.network.sort-shuffle.min-parallelism to 1 and 
> using sort-shuffle for all cases by default is a good suggestion. I am not 
> choosing this value mainly because two reasons:


> 1. The first one is that it increases the usage of network memory which may 
> cause "insufficient network buffer" exception and user may have to increase 
> the total network buffers.
> 2. There are several (not many) TPCDS queries suffers some performance 
> regression on SSD.

> For the first issue, I will test more settings on tpcds and see the 
> influence. 
> For the second issue, I will try to find the cause and solve it in 1.15.

> I am open for your suggestion, but I still need some more tests and analysis
> to guarantee that it works well.

Very thanks YingJie for the detailed explanation and the further investigation!
As a whole, I think if all the issues are solvable, it might be better for us 
to directly
change to the most suitable final values according to the experiment results
 to avoid we have to modify the parameters again in the future. 

Best,
Yun



------------------------------------------------------------------
From:Till Rohrmann <trohrm...@apache.org>
Send Time:2021 Dec. 14 (Tue.) 18:57
To:Jingsong Li <jingsongl...@gmail.com>
Cc:Yingjie Cao <kevin.ying...@gmail.com>; 刘建刚 <liujiangangp...@gmail.com>; dev 
<d...@flink.apache.org>; Yun Gao <yungao...@aliyun.com>; user 
<u...@flink.apache.org>; user-zh <user-zh@flink.apache.org>
Subject:Re: [DISCUSS] Change some default config values of blocking shuffle

As part of this FLIP, does it make sense to also extend the documentation for 
the sort shuffle [1] to include a tuning guide? I am thinking of a more in 
depth description of what things you might observe and how to influence them 
with the configuration options.

[1] 
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/batch/blocking_shuffle/#sort-shuffle

Cheers,
Till
On Tue, Dec 14, 2021 at 8:43 AM Jingsong Li <jingsongl...@gmail.com> wrote:
Hi Yingjie,

 Thanks for your explanation. I have no more questions. +1

 On Tue, Dec 14, 2021 at 3:31 PM Yingjie Cao <kevin.ying...@gmail.com> wrote:
 >
 > Hi Jingsong,
 >
 > Thanks for your feedback.
 >
 > >>> My question is, what is the maximum parallelism a job can have with the 
 > >>> default configuration? (Does this break out of the box)
 >
 > Yes, you are right, these two options are related to network memory and 
 > framework off-heap memory. Generally, these changes will not break out of 
 > the box experience, but for some extreme cases, for example, there are too 
 > many ResultPartitions per task, users may need to increase network memory to 
 > avoid "insufficient network buffer" error. For framework off-head, I believe 
 > that user do not need to change the default value.
 >
 > In fact, I have a basic goal when changing these config values: when running 
 > TPCDS of medium parallelism with the default value, all queries must pass 
 > without any error and at the same time, the performance can be improved. I 
 > think if we achieve this goal, most common use cases can be covered.
 >
 > Currently, for the default configuration, the exclusive buffers required at 
 > input gate side is still parallelism relevant (though since 1.14, we can 
 > decouple the network buffer consumption from parallelism by setting a config 
 > value, it has slight performance influence on streaming jobs), which means 
 > that no large parallelism can be supported by the default configuration. 
 > Roughly, I would say the default value can support jobs of several hundreds 
 > of parallelism.
 >
 > >>> I do feel that this correspondence is a bit difficult to control at the 
 > >>> moment, and it would be best if a rough table could be provided.
 >
 > I think this is a good suggestion, we can provide those suggestions in the 
 > document.
 >
 > Best,
 > Yingjie
 >
 > Jingsong Li <jingsongl...@gmail.com> 于2021年12月14日周二 14:39写道：
 >>
 >> Hi  Yingjie,
 >>
 >> +1 for this FLIP. I'm pretty sure this will greatly improve the ease
 >> of batch jobs.
 >>
 >> Looks like "taskmanager.memory.framework.off-heap.batch-shuffle.size"
 >> and "taskmanager.network.sort-shuffle.min-buffers" are related to
 >> network memory and framework.off-heap.size.
 >>
 >> My question is, what is the maximum parallelism a job can have with
 >> the default configuration? (Does this break out of the box)
 >>
 >> How much network memory and framework.off-heap.size are required for
 >> how much parallelism in the default configuration?
 >>
 >> I do feel that this correspondence is a bit difficult to control at
 >> the moment, and it would be best if a rough table could be provided.
 >>
 >> Best,
 >> Jingsong
 >>
 >> On Tue, Dec 14, 2021 at 2:16 PM Yingjie Cao <kevin.ying...@gmail.com> wrote:
 >> >
 >> > Hi Jiangang,
 >> >
 >> > Thanks for your suggestion.
 >> >
 >> > >>> The config can affect the memory usage. Will the related memory 
 >> > >>> configs be changed?
 >> >
 >> > I think we will not change the default network memory settings. My best 
 >> > expectation is that the default value can work for most cases (though may 
 >> > not the best) and for other cases, user may need to tune the memory 
 >> > settings.
 >> >
 >> > >>> Can you share the tpcds results for different configs? Although we 
 >> > >>> change the default values, it is helpful to change them for different 
 >> > >>> users. In this case, the experience can help a lot.
 >> >
 >> > I did not keep all previous TPCDS results, but from the results, I can 
 >> > tell that on HDD, always using the sort-shuffle is a good choice. For 
 >> > small jobs, using sort-shuffle may not bring much performance gain, this 
 >> > may because that all shuffle data can be cached in memory (page cache), 
 >> > this is the case if the cluster have enough resources. However, if the 
 >> > whole cluster is under heavy burden or you are running large scale jobs, 
 >> > the performance of those small jobs can also be influenced. For 
 >> > large-scale jobs, the configurations suggested to be tuned are 
 >> > taskmanager.network.sort-shuffle.min-buffers and 
 >> > taskmanager.memory.framework.off-heap.batch-shuffle.size, you can 
 >> > increase these values for large-scale batch jobs.
 >> >
 >> > BTW, I am still running TPCDS tests these days and I can share these 
 >> > results soon.
 >> >
 >> > Best,
 >> > Yingjie
 >> >
 >> > 刘建刚 <liujiangangp...@gmail.com> 于2021年12月10日周五 18:30写道：
 >> >>
 >> >> Glad to see the suggestion. In our test, we found that small jobs with 
 >> >> the changing configs can not improve the performance much just as your 
 >> >> test. I have some suggestions:
 >> >>
 >> >> The config can affect the memory usage. Will the related memory configs 
 >> >> be changed?
 >> >> Can you share the tpcds results for different configs? Although we 
 >> >> change the default values, it is helpful to change them for different 
 >> >> users. In this case, the experience can help a lot.
 >> >>
 >> >> Best,
 >> >> Liu Jiangang
 >> >>
 >> >> Yun Gao <yungao...@aliyun.com.invalid> 于2021年12月10日周五 17:20写道：
 >> >>>
 >> >>> Hi Yingjie,
 >> >>>
 >> >>> Very thanks for drafting the FLIP and initiating the discussion!
 >> >>>
 >> >>> May I have a double confirmation for 
 >> >>> taskmanager.network.sort-shuffle.min-parallelism that
 >> >>> since other frameworks like Spark have used sort-based shuffle for all 
 >> >>> the cases, does our
 >> >>> current circumstance still have difference with them?
 >> >>>
 >> >>> Best,
 >> >>> Yun
 >> >>>
 >> >>>
 >> >>>
 >> >>>
 >> >>> ------------------------------------------------------------------
 >> >>> From:Yingjie Cao <kevin.ying...@gmail.com>
 >> >>> Send Time:2021 Dec. 10 (Fri.) 16:17
 >> >>> To:dev <d...@flink.apache.org>; user <u...@flink.apache.org>; user-zh 
 >> >>> <user-zh@flink.apache.org>
 >> >>> Subject:Re: [DISCUSS] Change some default config values of blocking 
 >> >>> shuffle
 >> >>>
 >> >>> Hi dev & users:
 >> >>>
 >> >>> I have created a FLIP [1] for it, feedbacks are highly appreciated.
 >> >>>
 >> >>> Best,
 >> >>> Yingjie
 >> >>>
 >> >>> [1] 
 >> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-199%3A+Change+some+default+config+values+of+blocking+shuffle+for+better+usability
 >> >>> Yingjie Cao <kevin.ying...@gmail.com> 于2021年12月3日周五 17:02写道：
 >> >>>
 >> >>> Hi dev & users,
 >> >>>
 >> >>> We propose to change some default values of blocking shuffle to improve 
 >> >>> the user out-of-box experience (not influence streaming). The default 
 >> >>> values we want to change are as follows:
 >> >>>
 >> >>> 1. Data compression 
 >> >>> (taskmanager.network.blocking-shuffle.compression.enabled): Currently, 
 >> >>> the default value is 'false'.  Usually, data compression can reduce 
 >> >>> both disk and network IO which is good for performance. At the same 
 >> >>> time, it can save storage space. We propose to change the default value 
 >> >>> to true.
 >> >>>
 >> >>> 2. Default shuffle implementation 
 >> >>> (taskmanager.network.sort-shuffle.min-parallelism): Currently, the 
 >> >>> default value is 'Integer.MAX', which means by default, Flink jobs will 
 >> >>> always use hash-shuffle. In fact, for high parallelism, sort-shuffle is 
 >> >>> better for both stability and performance. So we propose to reduce the 
 >> >>> default value to a proper smaller one, for example, 128. (We tested 
 >> >>> 128, 256, 512 and 1024 with a tpc-ds and 128 is the best one.)
 >> >>>
 >> >>> 3. Read buffer of sort-shuffle 
 >> >>> (taskmanager.memory.framework.off-heap.batch-shuffle.size): Currently, 
 >> >>> the default value is '32M'. Previously, when choosing the default 
 >> >>> value, both ‘32M' and '64M' are OK for tests and we chose the smaller 
 >> >>> one in a cautious way. However, recently, it is reported in the mailing 
 >> >>> list that the default value is not enough which caused a buffer request 
 >> >>> timeout issue. We already created a ticket to improve the behavior. At 
 >> >>> the same time, we propose to increase this default value to '64M' which 
 >> >>> can also help.
 >> >>>
 >> >>> 4. Sort buffer size of sort-shuffle 
 >> >>> (taskmanager.network.sort-shuffle.min-buffers): Currently, the default 
 >> >>> value is '64' which means '64' network buffers (32k per buffer by 
 >> >>> default). This default value is quite modest and the performance can be 
 >> >>> influenced. We propose to increase this value to a larger one, for 
 >> >>> example, 512 (the default TM and network buffer configuration can serve 
 >> >>> more than 10 result partitions concurrently).
 >> >>>
 >> >>> We already tested these default values together with tpc-ds benchmark 
 >> >>> in a cluster and both the performance and stability improved a lot. 
 >> >>> These changes can help to improve the out-of-box experience of blocking 
 >> >>> shuffle. What do you think about these changes? Is there any concern? 
 >> >>> If there are no objections, I will make these changes soon.
 >> >>>
 >> >>> Best,
 >> >>> Yingjie
 >> >>>
 >>
 >>
 >> --
 >> Best, Jingsong Lee



 -- 
 Best, Jingsong Lee

Re: [DISCUSS] Change some default config values of blocking shuffle

回复