Hi Till, Thanks for the suggestion. I think it makes a lot of sense to also extend the documentation for the sort shuffle to include a tuning guide.
Best, Yingjie Till Rohrmann <trohrm...@apache.org> 于2021年12月14日周二 18:57写道: > As part of this FLIP, does it make sense to also extend the documentation > for the sort shuffle [1] to include a tuning guide? I am thinking of a more > in depth description of what things you might observe and how to influence > them with the configuration options. > > [1] > https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/batch/blocking_shuffle/#sort-shuffle > > Cheers, > Till > > On Tue, Dec 14, 2021 at 8:43 AM Jingsong Li <jingsongl...@gmail.com> > wrote: > >> Hi Yingjie, >> >> Thanks for your explanation. I have no more questions. +1 >> >> On Tue, Dec 14, 2021 at 3:31 PM Yingjie Cao <kevin.ying...@gmail.com> >> wrote: >> > >> > Hi Jingsong, >> > >> > Thanks for your feedback. >> > >> > >>> My question is, what is the maximum parallelism a job can have with >> the default configuration? (Does this break out of the box) >> > >> > Yes, you are right, these two options are related to network memory and >> framework off-heap memory. Generally, these changes will not break out of >> the box experience, but for some extreme cases, for example, there are too >> many ResultPartitions per task, users may need to increase network memory >> to avoid "insufficient network buffer" error. For framework off-head, I >> believe that user do not need to change the default value. >> > >> > In fact, I have a basic goal when changing these config values: when >> running TPCDS of medium parallelism with the default value, all queries >> must pass without any error and at the same time, the performance can be >> improved. I think if we achieve this goal, most common use cases can be >> covered. >> > >> > Currently, for the default configuration, the exclusive buffers >> required at input gate side is still parallelism relevant (though since >> 1.14, we can decouple the network buffer consumption from parallelism by >> setting a config value, it has slight performance influence on streaming >> jobs), which means that no large parallelism can be supported by the >> default configuration. Roughly, I would say the default value can support >> jobs of several hundreds of parallelism. >> > >> > >>> I do feel that this correspondence is a bit difficult to control at >> the moment, and it would be best if a rough table could be provided. >> > >> > I think this is a good suggestion, we can provide those suggestions in >> the document. >> > >> > Best, >> > Yingjie >> > >> > Jingsong Li <jingsongl...@gmail.com> 于2021年12月14日周二 14:39写道: >> >> >> >> Hi Yingjie, >> >> >> >> +1 for this FLIP. I'm pretty sure this will greatly improve the ease >> >> of batch jobs. >> >> >> >> Looks like "taskmanager.memory.framework.off-heap.batch-shuffle.size" >> >> and "taskmanager.network.sort-shuffle.min-buffers" are related to >> >> network memory and framework.off-heap.size. >> >> >> >> My question is, what is the maximum parallelism a job can have with >> >> the default configuration? (Does this break out of the box) >> >> >> >> How much network memory and framework.off-heap.size are required for >> >> how much parallelism in the default configuration? >> >> >> >> I do feel that this correspondence is a bit difficult to control at >> >> the moment, and it would be best if a rough table could be provided. >> >> >> >> Best, >> >> Jingsong >> >> >> >> On Tue, Dec 14, 2021 at 2:16 PM Yingjie Cao <kevin.ying...@gmail.com> >> wrote: >> >> > >> >> > Hi Jiangang, >> >> > >> >> > Thanks for your suggestion. >> >> > >> >> > >>> The config can affect the memory usage. Will the related memory >> configs be changed? >> >> > >> >> > I think we will not change the default network memory settings. My >> best expectation is that the default value can work for most cases (though >> may not the best) and for other cases, user may need to tune the memory >> settings. >> >> > >> >> > >>> Can you share the tpcds results for different configs? Although >> we change the default values, it is helpful to change them for different >> users. In this case, the experience can help a lot. >> >> > >> >> > I did not keep all previous TPCDS results, but from the results, I >> can tell that on HDD, always using the sort-shuffle is a good choice. For >> small jobs, using sort-shuffle may not bring much performance gain, this >> may because that all shuffle data can be cached in memory (page cache), >> this is the case if the cluster have enough resources. However, if the >> whole cluster is under heavy burden or you are running large scale jobs, >> the performance of those small jobs can also be influenced. For large-scale >> jobs, the configurations suggested to be tuned are >> taskmanager.network.sort-shuffle.min-buffers and >> taskmanager.memory.framework.off-heap.batch-shuffle.size, you can increase >> these values for large-scale batch jobs. >> >> > >> >> > BTW, I am still running TPCDS tests these days and I can share these >> results soon. >> >> > >> >> > Best, >> >> > Yingjie >> >> > >> >> > 刘建刚 <liujiangangp...@gmail.com> 于2021年12月10日周五 18:30写道: >> >> >> >> >> >> Glad to see the suggestion. In our test, we found that small jobs >> with the changing configs can not improve the performance much just as your >> test. I have some suggestions: >> >> >> >> >> >> The config can affect the memory usage. Will the related memory >> configs be changed? >> >> >> Can you share the tpcds results for different configs? Although we >> change the default values, it is helpful to change them for different >> users. In this case, the experience can help a lot. >> >> >> >> >> >> Best, >> >> >> Liu Jiangang >> >> >> >> >> >> Yun Gao <yungao...@aliyun.com.invalid> 于2021年12月10日周五 17:20写道: >> >> >>> >> >> >>> Hi Yingjie, >> >> >>> >> >> >>> Very thanks for drafting the FLIP and initiating the discussion! >> >> >>> >> >> >>> May I have a double confirmation for >> taskmanager.network.sort-shuffle.min-parallelism that >> >> >>> since other frameworks like Spark have used sort-based shuffle for >> all the cases, does our >> >> >>> current circumstance still have difference with them? >> >> >>> >> >> >>> Best, >> >> >>> Yun >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> ------------------------------------------------------------------ >> >> >>> From:Yingjie Cao <kevin.ying...@gmail.com> >> >> >>> Send Time:2021 Dec. 10 (Fri.) 16:17 >> >> >>> To:dev <d...@flink.apache.org>; user <u...@flink.apache.org>; >> user-zh <user-zh@flink.apache.org> >> >> >>> Subject:Re: [DISCUSS] Change some default config values of >> blocking shuffle >> >> >>> >> >> >>> Hi dev & users: >> >> >>> >> >> >>> I have created a FLIP [1] for it, feedbacks are highly appreciated. >> >> >>> >> >> >>> Best, >> >> >>> Yingjie >> >> >>> >> >> >>> [1] >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-199%3A+Change+some+default+config+values+of+blocking+shuffle+for+better+usability >> >> >>> Yingjie Cao <kevin.ying...@gmail.com> 于2021年12月3日周五 17:02写道: >> >> >>> >> >> >>> Hi dev & users, >> >> >>> >> >> >>> We propose to change some default values of blocking shuffle to >> improve the user out-of-box experience (not influence streaming). The >> default values we want to change are as follows: >> >> >>> >> >> >>> 1. Data compression >> (taskmanager.network.blocking-shuffle.compression.enabled): Currently, the >> default value is 'false'. Usually, data compression can reduce both disk >> and network IO which is good for performance. At the same time, it can save >> storage space. We propose to change the default value to true. >> >> >>> >> >> >>> 2. Default shuffle implementation >> (taskmanager.network.sort-shuffle.min-parallelism): Currently, the default >> value is 'Integer.MAX', which means by default, Flink jobs will always use >> hash-shuffle. In fact, for high parallelism, sort-shuffle is better for >> both stability and performance. So we propose to reduce the default value >> to a proper smaller one, for example, 128. (We tested 128, 256, 512 and >> 1024 with a tpc-ds and 128 is the best one.) >> >> >>> >> >> >>> 3. Read buffer of sort-shuffle >> (taskmanager.memory.framework.off-heap.batch-shuffle.size): Currently, the >> default value is '32M'. Previously, when choosing the default value, both >> ‘32M' and '64M' are OK for tests and we chose the smaller one in a cautious >> way. However, recently, it is reported in the mailing list that the default >> value is not enough which caused a buffer request timeout issue. We already >> created a ticket to improve the behavior. At the same time, we propose to >> increase this default value to '64M' which can also help. >> >> >>> >> >> >>> 4. Sort buffer size of sort-shuffle >> (taskmanager.network.sort-shuffle.min-buffers): Currently, the default >> value is '64' which means '64' network buffers (32k per buffer by default). >> This default value is quite modest and the performance can be influenced. >> We propose to increase this value to a larger one, for example, 512 (the >> default TM and network buffer configuration can serve more than 10 result >> partitions concurrently). >> >> >>> >> >> >>> We already tested these default values together with tpc-ds >> benchmark in a cluster and both the performance and stability improved a >> lot. These changes can help to improve the out-of-box experience of >> blocking shuffle. What do you think about these changes? Is there any >> concern? If there are no objections, I will make these changes soon. >> >> >>> >> >> >>> Best, >> >> >>> Yingjie >> >> >>> >> >> >> >> >> >> -- >> >> Best, Jingsong Lee >> >> >> >> -- >> Best, Jingsong Lee >> >