Re: [DISCUSS] FLIP-329: Add operator attribute to specify support for object-reuse
Hi Martijn, Sorry for the late reply. I don't think it is feasible to always enable object reuse. If I understand correctly, object reuse is disabled by default to guarantee correctness because we cannot assume that the custom operator/function is safe to enable object reuse. The method proposed in the FLIP is to let the operator inform the Flink runtime whether it is safe to reuse the emitted records. It provides a fine-grained way of controlling the object reuse behavior at the operator level. In the long term, instead of always enabling object reuse, it is better to remove the object-reuse configuration and let the runtime determine whether to enable object reuse for each operator. I hope that addresses your question. Please let me know if you have further comments. Best regards, Xuannan On Fri, Sep 29, 2023 at 8:47 AM Martijn Visser wrote: > > Hi Xuannan, > > I have one question more from a strategic point of view: given that > we're working on Flink 2.0, wouldn't we actually want to be in a > situation where object-reuse is always used and don't make it > configurable anymore? IIRC, the only reason why it's a configuration > is for backward compatibility. > > Best regards, > > Martijn > > On Tue, Sep 26, 2023 at 1:32 AM Xuannan Su wrote: > > > > Hi all, > > > > We would like to revive the discussion and provide a quick update on > > the recent work of the FLIP. We have implemented a POC[1], run cases > > in the flink-benchmarks[2] against the POC, and verified that many of > > the operators in the benchmark will enable object-reuse without code > > changes, while the global object-reuse is disabled. > > > > Please let me know if you have any further comments on the FLIP. If > > there are no more comments, we will open the voting in 3 days. > > > > Best regards, > > Xuannan > > > > [1] https://github.com/apache/flink/pull/22897 > > [2] https://github.com/apache/flink-benchmarks > > > > > > On Fri, Jul 7, 2023 at 9:18 AM Dong Lin wrote: > > > > > > Hi Jing, > > > > > > Thank you for the suggestion. Yes, we can extend it to support null if in > > > the future we find any use-case for this flexibility. > > > > > > Best, > > > Dong > > > > > > On Thu, Jul 6, 2023 at 7:55 PM Jing Ge wrote: > > > > > > > Hi Dong, > > > > > > > > one scenario I could imagine is that users could enable global object > > > > reuse features but force deep copy for some user defined specific > > > > functions > > > > because of any limitations. But that is only my gut feeling. And agree, > > > > we > > > > could keep the solution simple for now as FLIP described and upgrade to > > > > 3VL > > > > once there are such real requirements that are rising. > > > > > > > > Best regards, > > > > Jing > > > > > > > > On Thu, Jul 6, 2023 at 12:30 PM Dong Lin wrote: > > > > > > > >> Hi Jing, > > > >> > > > >> Thank you for the detailed explanation. Please see my reply inline. > > > >> > > > >> On Thu, Jul 6, 2023 at 3:17 AM Jing Ge wrote: > > > >> > > > >>> Hi Xuannan, Hi Dong, > > > >>> > > > >>> Thanks for your clarification. > > > >>> > > > >>> @Xuannan > > > >>> > > > >>> A Jira ticket has been created for the doc update: > > > >>> https://issues.apache.org/jira/browse/FLINK-32546 > > > >>> > > > >>> @Dong > > > >>> > > > >>> I don't have a concrete example. I just thought about it from a > > > >>> conceptual or pattern's perspective. Since we have 1. coarse-grained > > > >>> global > > > >>> switch(CGS as abbreviation), i.e. the pipeline.object-reuse and 2. > > > >>> fine-grained local switch(FGS as abbreviation), i.e. the > > > >>> objectReuseCompliant variable for specific operators/functions, there > > > >>> will > > > >>> be the following patterns with appropriate combinations: > > > >>> > > > >>> pattern 1: coarse-grained switch only. Local object reuse will be > > > >>> controlled by the coarse-grained switch: > > > >>> 1.1 cgs == true -> local object reused enabled > > > >>> 1.2 cgs == true -> local object reused enabled > > > >>> 1.3 cgs == false -> local object reused disabled, i.e. deep copy > > > >>> enabled > > > >>> 1.4 cgs == false -> local object reused disabled, i.e. deep copy > > > >>> enabled > > > >>> > > > >>> afaiu, this is the starting point. I wrote 4 on purpose to make the > > > >>> regression check easier. We can consider it as the combinations with > > > >>> cgs(true/false) and fgs(true/false) while fgs is ignored. > > > >>> > > > >>> Now we introduce fine-grained switch. There will be two patterns: > > > >>> > > > >>> pattern 2: fine-grained switch over coarse-grained switch. > > > >>> Coarse-grained switch will be ignored when the local fine-grained > > > >>> switch > > > >>> has different value: > > > >>> 2.1 cgs == true and fgs == true -> local object reused enabled > > > >>> 2.2 cgs == true and fgs == false -> local object reused disabled, i.e. > > > >>> deep copy enabled > > > >>> 2.3 cgs == false and fgs == true -> local object reused enabled > > > >>> 2.4 cgs == false and fgs == fa
[DISCUSS] FLIP-374: Adding a separate configuration for specifying Java Options of the SQL Gateway
Hi, there, We'd like to start a discussion thread on "FLIP-374: Adding a separate configuration for specifying Java Options of the SQL Gateway"[1], where we propose adding a separate configuration option to specify the Java options for the SQL Gateway. This would allow users to fine-tune the memory settings, garbage collection behavior, and other relevant Java parameters specific to the SQL Gateway, ensuring optimal performance and stability in production environments. Looking forward to your feedback. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-374%3A+Adding+a+separate+configuration+for+specifying+Java+Options+of+the+SQL+Gateway Best, Yangze Guo
[jira] [Created] (FLINK-33203) FLIP-374: Adding a separate configuration for specifying Java Options of the SQL Gateway
Yangze Guo created FLINK-33203: -- Summary: FLIP-374: Adding a separate configuration for specifying Java Options of the SQL Gateway Key: FLINK-33203 URL: https://issues.apache.org/jira/browse/FLINK-33203 Project: Flink Issue Type: Improvement Components: Table SQL / Gateway Reporter: Yangze Guo Assignee: Yangze Guo Fix For: 1.19.0 {color:#00}The SQL Gateway is an essential component of Flink in OLAP scenarios, and its performance and stability determine the SLA of Flink as an OLAP service. Just like other components in Flink, we propose adding a separate configuration option to specify the Java options for the SQL Gateway. This would allow users to fine-tune the memory settings, garbage collection behavior, and other relevant Java parameters specific to the SQL Gateway, ensuring optimal performance and stability in production environments.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling
Hi Shammon, IIUC, you want more flexibility in controlling the two-phase strategy, right? > I want this because we would like to add a new slot to TM strategy such as SLOTS_NUM in the future for OLAP to improve the performance for olap jobs, which will use TASKS strategy for task to slot. cc Guoyangze Actually, one option can achieve your requirement, it can control two-phase. We can add a new enum for this option, and it will use the new strategy for slot to TM, and use task_balanced strategy for task to slot. Of course, I think 2 options is more flexible. If the strategy is too many, 2 options are easy for users. Also, I have a question: What is SLOTS_NUM strategy? Isn't it slot balanced at tm level? I want to check whether it's similar to `cluster.evenly-spread-out-slots`. If they are similar or same, the strategy isn't too many, and one option may be enough. Best, Rui On Sat, Oct 7, 2023 at 11:29 AM Shammon FY wrote: > Thanks Rui, I check the codes and you're right. > > As you described above, the entire process is actually two independent > steps from slot to TM and task to slot. Currenlty we use option > `cluster.evenly-spread-out-slots` for both of them. Can we provide > different options for the two steps, such as ANY/SLOTS for RM and ANY/TASKS > for slot pool? > > I want this because we would like to add a new slot to TM strategy such as > SLOTS_NUM in the future for OLAP to improve the performance for olap jobs, > which will use TASKS strategy for task to slot. cc Guoyangze > > Best, > Shammon FY > > On Fri, Oct 6, 2023 at 6:19 PM xiangyu feng wrote: > >> Thanks Yuepeng and Rui for driving this Discussion. >> >> Internally when we try to use Flink 1.17.1 in production, we are also >> suffering from the unbalanced task distribution problem for jobs with high >> qps and complex dag. So +1 for the overall proposal. >> >> Some questions about the details: >> >> 1, About the waiting mechanism: Will the waiting mechanism happen only in >> the second level 'assigning slots to TM'? IIUC, the first level >> 'assigning >> Tasks to Slots' needs only the asynchronous slot result from slotpool. >> >> 2, About the slot LoadingWeight: it is reasonable to use the number of >> tasks by default in the beginning, but it would be better if this could be >> easily extended in future to distinguish between CPU-intensive and >> IO-intensive workloads. In some cases, TMs may have IO bottlenecks but >> others have CPU bottlenecks. >> >> Regards, >> Xiangyu >> >> >> Yuepeng Pan 于2023年10月5日周四 18:34写道: >> >> > Hi, Zhu Zhu, >> > >> > Thanks for your feedback! >> > >> > > I think we can introduce a new config option >> > > `taskmanager.load-balance.mode`, >> > > which accepts "None"/"Slots"/"Tasks". >> `cluster.evenly-spread-out-slots` >> > > can be superseded by the "Slots" mode and get deprecated. In the >> future >> > > it can support more mode, e.g. "CpuCores", to work better for jobs >> with >> > > fine-grained resources. The proposed config option >> > > `slot.request.max-interval` >> > > then can be renamed to >> > `taskmanager.load-balance.request-stablizing-timeout` >> > > to show its relation with the feature. The proposed >> > `slot.sharing-strategy` >> > > is not needed, because the configured "Tasks" mode will do the work. >> > >> > The new proposed configuration option sounds good to me. >> > >> > I have a small question, If we set our configuration value to 'Tasks,' >> it >> > will initiate two processes: balancing the allocation of task >> quantities at >> > the slot level and balancing the number of tasks across TaskManagers >> (TMs). >> > Alternatively, if we configure it as 'Slots,' the system will employ the >> > LocalPreferred allocation policy (which is the default) when assigning >> > tasks to slots, and it will ensure that the number of slots used across >> TMs >> > is balanced. >> > Does this configuration essentially combine a balanced selection >> strategy >> > across two dimensions into fixed configuration items, right? >> > >> > I would appreciate it if you could correct me if I've made any errors. >> > >> > Best, >> > Yuepeng. >> > >> >
Re: [DISCUSS] FLIP-368 Reorganize the exceptions thrown in state interfaces
Hi Jing, Sorry for the late reply! I agree with you that we do not expect users to do anything with Flink and we won't "bother" them with those exceptions. However, users can still catch the `Throwable` and perform any necessary logging activities, similar to how they use Java Collection interfaces. Thanks for your insights! Best, Zakelly On Thu, Sep 21, 2023 at 8:43 PM Jing Ge wrote: > > Fair enough! Thanks Zakelly for the information. Afaic, even users can do > nothing with Flink, they still can do something in their territory, at > least doing some logging and metrics stuff, or triggering some other > services in their ecosystem. After all, the Flink jobs they build are part > of their service component. It didn't change the fact that we are going to > use the anti-pattern. Just because we didn't expect users to do > anything with Flink, does not mean users don't expect to do something with > the expected exception. Anyway, I am open to hearing different opinions. > > Best regards, > Jing > > On Thu, Sep 21, 2023 at 7:02 AM Zakelly Lan wrote: > > > Hi Martijn, > > > > Thanks for the reminder! > > > > This FLIP proposes a change to the state API that is annotated as > > @PublicEvolving and targets version 1.19. I have clarified this in > > the "Proposed Change" section of the FLIP. > > > > > > Hi Jing, > > > > Thanks for sharing your thoughts! Here are my opinions: > > > > 1. The exceptions of the state API are usually treated as critical > > ones. In other words, if anything goes wrong with state accessing, the > > element processing cannot proceed and the job should fail. Flink users > > may not know what to do when they encounter these exceptions. I > > believe this is the main reason why we want to replace them with > > unchecked exceptions. > > 2. There have also been some further discussions[1][2] from Stephan > > and Shixiaogang below the one you pointed out [3], and it seems they > > come to an agreement to use unchecked exceptions. After reviewing the > > entire discussion on that PR, I think their arguments are reasonable > > given the use case. > > > > Looking forward to your feedback. > > > > > > Best, > > Zakelly > > > > [1] https://github.com/apache/flink/pull/3380#issuecomment-286807853 > > [2] https://github.com/apache/flink/pull/3380#issuecomment-286932133 > > [3] https://github.com/apache/flink/pull/3380#issuecomment-281631160 > > > > On Thu, Sep 21, 2023 at 1:27 AM Jing Ge > > wrote: > > > > > > sorry, typo: It is a known "anti-pattern" instead of "ant-pattern" > > > > > > Best regards, > > > Jing > > > > > > On Wed, Sep 20, 2023 at 7:23 PM Jing Ge wrote: > > > > > > > Hi Zakelly, > > > > > > > > Thanks for driving this topic. From good software engineering's > > > > perspective, I have different thoughts: > > > > > > > > 1. The idea to get rid of all checked Exceptions and replace them with > > > > unchecked Exceptions is a known ant-pattern: "Generally speaking, do > > not > > > > throw a RuntimeException or create a subclass of RuntimeException > > simply > > > > because you don't want to be bothered with specifying the exceptions > > your > > > > methods can throw." [1] Checked Exceptions mean expected exceptions > > that > > > > can help developers find a way to catch them and decide what to do. It > > is > > > > part of the public API signature that can help developers build robust > > > > systems. We should not mix concepts and build expected exceptions with > > > > unchecked Java Exception classes. > > > > 2. The comment Stephan left [2] clearly pointed out that we should > > avoid > > > > using generic Java Exceptions, and "find some more 'specific' > > exceptions > > > > for the signature, like throws IOException or throws > > StateAccessException." > > > > So, the idea is to define/use specific checked Exception classes > > instead of > > > > using unchecked Exceptions. > > > > > > > > Looking forward to your thoughts. > > > > > > > > Best regards, > > > > Jing > > > > > > > > > > > > [1] > > > > > > https://docs.oracle.com/javase/tutorial/essential/exceptions/runtime.html > > > > [2] https://github.com/apache/flink/pull/3380#issuecomment-281631160 > > > > > > > > On Wed, Sep 20, 2023 at 4:52 PM Zakelly Lan > > wrote: > > > > > > > >> Hi Yanfei, > > > >> > > > >> Thanks for your reply! > > > >> > > > >> Yes, this FLIP aims to change all state-related exceptions to > > > >> unchecked exceptions and remove all exceptions from the signature. So > > > >> I believe we have come to an agreement to keep the interfaces simple. > > > >> > > > >> > > > >> Best regards, > > > >> Zakelly > > > >> > > > >> On Wed, Sep 20, 2023 at 2:26 PM Zakelly Lan > > > >> wrote: > > > >> > > > > >> > Hi Hangxiang, > > > >> > > > > >> > Thank you for your response! Here are my thoughts: > > > >> > > > > >> > 1. Regarding the exceptions thrown by internal interfaces, I suggest > > > >> > keeping them as checked exceptions. Since these exceptions will be > > > >> > handled by the internal cal
Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling
Hi Yangze, > 2. From my understanding, if user enable the > cluster.evenly-spread-out-slots, > LeastUtilizationResourceMatchingStrategy will be used to determine the > slot distribution and the slot allocation in the three TM will be > (taskmanager.numberOfTaskSlots=3): > TM1: 3 slot > TM2: 2 slot > TM3: 2 slot When all tms are ready in advance, the three TM will be: TM1: 3 slot TM2: 2 slot TM3: 2 slot For application mode, the resource manager doesn't apply for TM in advance, and slots aren't enough before the third TM is ready. So all slots of the second TM will be used up. The three TM will be: TM1: 3 slot TM2: 3 slot TM3: 1 slot That's why the FLIP add some notes: - All *free* slots are in the last TM, because ResourceManager doesn’t have the waiting mechanism, and it just requests 7 slots for this JobMaster. - Why is it acceptable? - - If we just add the waiting mechanism to JobMaster but not in ResourceManager, all *free* slots will be in the last TM. All slots of other TMs are offered to JM. - That is, only one TM may have fewer tasks than the other TMs. The difference between the number of tasks of other TMs is at most 1.So When *p* >> *slotsPerTM*, the problem can be ignored. - We can also suggest users, in cases that p is small, it's better to configure *slotsPerTM* to 1, or let *p % slotsPerTM* == 0. Please correct me if my understanding is wrong, thanks~ Best, Rui On Sun, Oct 1, 2023 at 7:38 PM Yangze Guo wrote: > Hi, Rui, > > 1. With the current mechanism, when physical slots are offered from > TM, the JobMaster will start deploying tasks and synchronizing their > states. With the addition of the waiting mechanism, IIUC, the > JobMaster will deploy and synchronize the states of all tasks only > after all resources are available. The task deployment and state > synchronization both occupy the JobMaster's RPC main thread. In > complex jobs with a lot of tasks, this waiting mechanism may increase > the pressure on the JobMaster and increase the end-to-end job > deployment time. > > 2. From my understanding, if user enable the > cluster.evenly-spread-out-slots, > LeastUtilizationResourceMatchingStrategy will be used to determine the > slot distribution and the slot allocation in the three TM will be > (taskmanager.numberOfTaskSlots=3): > TM1: 3 slot > TM2: 2 slot > TM3: 2 slot > > Best, > Yangze Guo > > On Sun, Oct 1, 2023 at 6:14 PM Rui Fan <1996fan...@gmail.com> wrote: > > > > Hi Shammon, > > > > Thanks for your feedback as well! > > > > > IIUC, the overall balance is divided into two parts: slot to TM and > task > > to slot. > > > 1. Slot to TM is guaranteed by SlotManager in ResourceManager > > > 2. Task to slot is guaranteed by the slot pool in JM > > > > > > These two are completely independent, what are the benefits of unifying > > > these two into one option? Also, do we want to share the same > > > option between SlotPool in JM and SlotManager in RM? This sounds a bit > > > strange. > > > > Your understanding is totally right, the balance needs 2 parts: slot to > TM > > and task to slot. > > > > As I understand, the following are benefits of unifying them into one > > option: > > > > - Flink users don't care about these principles inside of flink, they > don't > > know these 2 parts. > > - If flink provides 2 options, flink users need to set 2 options for > their > > job. > > - If one option is missed, the final result may not be good. (Users may > > have questions when using) > > - If flink just provides 1 option, enabling one option is enough. (Reduce > > the probability of misconfiguration) > > > > Also, Flink’s options are user-oriented. Each option represents a switch > or > > parameter of a feature. > > A feature may be composed of multiple components inside Flink. > > It might be better to keep only one switch per feature. > > > > Actually, the cluster.evenly-spread-out-slots option is used between > > SlotPool in JM and SlotManager in RM. 2 components to ensure > > this feature works well. > > > > Please correct me if my understanding is wrong, > > and looking forward to your feedback, thanks! > > > > Best, > > Rui > > > > On Sun, Oct 1, 2023 at 5:52 PM Rui Fan <1996fan...@gmail.com> wrote: > > > > > Hi Yangze, > > > > > > Thanks for your feedback! > > > > > > > 1. Is it possible for the SlotPool to get the slot allocation results > > > > from the SlotManager in advance instead of waiting for the actual > > > > physical slots to be registered, and perform pre-allocation? The > > > > benefit of doing this is to make the task deployment process > smoother, > > > > especially when there are a large number of tasks in the job. > > > > > > Could you elaborate on that? I didn't understand what's the benefit and > > > smoother. > > > > > > > 2. If user enable the cluster.evenly-spread-out-slots, the issue in > > > > example 2 of section 2.2.3 can be resolved. Do I understand it > > > > correctly? > > > > > > The exam
Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling
Thanks for the clarification, Rui. I believe the root cause of this issue is that in the current DefaultResourceAllocationStrategy, slot allocation begins before the decision to PendingTaskManagers requesting is made. That can be fixed within the strategy without introducing another waiting mechanism. I think it would be better to address this issue within the scope of this FLIP. However, I don't have a strong opinion on it, it depends on your bandwidth. Best, Yangze Guo On Sat, Oct 7, 2023 at 4:16 PM Rui Fan <1996fan...@gmail.com> wrote: > > Hi Yangze, > > > 2. From my understanding, if user enable the > > cluster.evenly-spread-out-slots, > > LeastUtilizationResourceMatchingStrategy will be used to determine the > > slot distribution and the slot allocation in the three TM will be > > (taskmanager.numberOfTaskSlots=3): > > TM1: 3 slot > > TM2: 2 slot > > TM3: 2 slot > > When all tms are ready in advance, the three TM will be: > TM1: 3 slot > TM2: 2 slot > TM3: 2 slot > > For application mode, the resource manager doesn't apply for > TM in advance, and slots aren't enough before the third TM is ready. > So all slots of the second TM will be used up. The three TM will be: > TM1: 3 slot > TM2: 3 slot > TM3: 1 slot > > That's why the FLIP add some notes: > > All free slots are in the last TM, because ResourceManager doesn’t have the > waiting mechanism, and it just requests 7 slots for this JobMaster. > Why is it acceptable? > > If we just add the waiting mechanism to JobMaster but not in ResourceManager, > all free slots will be in the last TM. All slots of other TMs are offered to > JM. > That is, only one TM may have fewer tasks than the other TMs. The difference > between the number of tasks of other TMs is at most 1.So When p >> > slotsPerTM, the problem can be ignored. > We can also suggest users, in cases that p is small, it's better to configure > slotsPerTM to 1, or let p % slotsPerTM == 0. > > Please correct me if my understanding is wrong, thanks~ > > Best, > Rui > > On Sun, Oct 1, 2023 at 7:38 PM Yangze Guo wrote: >> >> Hi, Rui, >> >> 1. With the current mechanism, when physical slots are offered from >> TM, the JobMaster will start deploying tasks and synchronizing their >> states. With the addition of the waiting mechanism, IIUC, the >> JobMaster will deploy and synchronize the states of all tasks only >> after all resources are available. The task deployment and state >> synchronization both occupy the JobMaster's RPC main thread. In >> complex jobs with a lot of tasks, this waiting mechanism may increase >> the pressure on the JobMaster and increase the end-to-end job >> deployment time. >> >> 2. From my understanding, if user enable the >> cluster.evenly-spread-out-slots, >> LeastUtilizationResourceMatchingStrategy will be used to determine the >> slot distribution and the slot allocation in the three TM will be >> (taskmanager.numberOfTaskSlots=3): >> TM1: 3 slot >> TM2: 2 slot >> TM3: 2 slot >> >> Best, >> Yangze Guo >> >> On Sun, Oct 1, 2023 at 6:14 PM Rui Fan <1996fan...@gmail.com> wrote: >> > >> > Hi Shammon, >> > >> > Thanks for your feedback as well! >> > >> > > IIUC, the overall balance is divided into two parts: slot to TM and task >> > to slot. >> > > 1. Slot to TM is guaranteed by SlotManager in ResourceManager >> > > 2. Task to slot is guaranteed by the slot pool in JM >> > > >> > > These two are completely independent, what are the benefits of unifying >> > > these two into one option? Also, do we want to share the same >> > > option between SlotPool in JM and SlotManager in RM? This sounds a bit >> > > strange. >> > >> > Your understanding is totally right, the balance needs 2 parts: slot to TM >> > and task to slot. >> > >> > As I understand, the following are benefits of unifying them into one >> > option: >> > >> > - Flink users don't care about these principles inside of flink, they don't >> > know these 2 parts. >> > - If flink provides 2 options, flink users need to set 2 options for their >> > job. >> > - If one option is missed, the final result may not be good. (Users may >> > have questions when using) >> > - If flink just provides 1 option, enabling one option is enough. (Reduce >> > the probability of misconfiguration) >> > >> > Also, Flink’s options are user-oriented. Each option represents a switch or >> > parameter of a feature. >> > A feature may be composed of multiple components inside Flink. >> > It might be better to keep only one switch per feature. >> > >> > Actually, the cluster.evenly-spread-out-slots option is used between >> > SlotPool in JM and SlotManager in RM. 2 components to ensure >> > this feature works well. >> > >> > Please correct me if my understanding is wrong, >> > and looking forward to your feedback, thanks! >> > >> > Best, >> > Rui >> > >> > On Sun, Oct 1, 2023 at 5:52 PM Rui Fan <1996fan...@gmail.com> wrote: >> > >> > > Hi Yangze, >> > > >> > > Thanks for your feedback! >> > > >> > > > 1. Is it possible for the SlotPool to ge
Re: [DISCUSS] FLIP-374: Adding a separate configuration for specifying Java Options of the SQL Gateway
Thanks for initiating this discussion. Within the development towards Streaming Warehousing, SQL Gateway will become more and more important. Big +1 to specify Java Options separately for SQL Gateway. Regards, Xiangyu Yangze Guo 于2023年10月7日周六 15:24写道: > Hi, there, > > We'd like to start a discussion thread on "FLIP-374: Adding a separate > configuration for specifying Java Options of the SQL Gateway"[1], > where we propose adding a separate configuration option to specify the > Java options for the SQL Gateway. This would allow users to fine-tune > the memory settings, garbage collection behavior, and other relevant > Java parameters specific to the SQL Gateway, ensuring optimal > performance and stability in production environments. > > Looking forward to your feedback. > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-374%3A+Adding+a+separate+configuration+for+specifying+Java+Options+of+the+SQL+Gateway > > Best, > Yangze Guo >
Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling
Hi Yangze, Thanks for your quick response! Sorry, I re-read the 2.2.2 part[1] about the Waiting Mechanism, I found it isn't clear. The root cause of introducing the waiting mechanism is that the slot requests are sent from JobMaster to SlotPool is one by one instead of one whole batch. I have rewritten the 2.2.2 part, please read it again in your free time. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-370%3A+Support+Balanced+Tasks+Scheduling#FLIP370:SupportBalancedTasksScheduling-2.2.2Waitingmechanism Best, Rui On Sat, Oct 7, 2023 at 4:34 PM Yangze Guo wrote: > Thanks for the clarification, Rui. > > I believe the root cause of this issue is that in the current > DefaultResourceAllocationStrategy, slot allocation begins before the > decision to PendingTaskManagers requesting is made. That can be fixed > within the strategy without introducing another waiting mechanism. I > think it would be better to address this issue within the scope of > this FLIP. However, I don't have a strong opinion on it, it depends on > your bandwidth. > > > Best, > Yangze Guo > > On Sat, Oct 7, 2023 at 4:16 PM Rui Fan <1996fan...@gmail.com> wrote: > > > > Hi Yangze, > > > > > 2. From my understanding, if user enable the > > > cluster.evenly-spread-out-slots, > > > LeastUtilizationResourceMatchingStrategy will be used to determine the > > > slot distribution and the slot allocation in the three TM will be > > > (taskmanager.numberOfTaskSlots=3): > > > TM1: 3 slot > > > TM2: 2 slot > > > TM3: 2 slot > > > > When all tms are ready in advance, the three TM will be: > > TM1: 3 slot > > TM2: 2 slot > > TM3: 2 slot > > > > For application mode, the resource manager doesn't apply for > > TM in advance, and slots aren't enough before the third TM is ready. > > So all slots of the second TM will be used up. The three TM will be: > > TM1: 3 slot > > TM2: 3 slot > > TM3: 1 slot > > > > That's why the FLIP add some notes: > > > > All free slots are in the last TM, because ResourceManager doesn’t have > the waiting mechanism, and it just requests 7 slots for this JobMaster. > > Why is it acceptable? > > > > If we just add the waiting mechanism to JobMaster but not in > ResourceManager, all free slots will be in the last TM. All slots of other > TMs are offered to JM. > > That is, only one TM may have fewer tasks than the other TMs. The > difference between the number of tasks of other TMs is at most 1.So When p > >> slotsPerTM, the problem can be ignored. > > We can also suggest users, in cases that p is small, it's better to > configure slotsPerTM to 1, or let p % slotsPerTM == 0. > > > > Please correct me if my understanding is wrong, thanks~ > > > > Best, > > Rui > > > > On Sun, Oct 1, 2023 at 7:38 PM Yangze Guo wrote: > >> > >> Hi, Rui, > >> > >> 1. With the current mechanism, when physical slots are offered from > >> TM, the JobMaster will start deploying tasks and synchronizing their > >> states. With the addition of the waiting mechanism, IIUC, the > >> JobMaster will deploy and synchronize the states of all tasks only > >> after all resources are available. The task deployment and state > >> synchronization both occupy the JobMaster's RPC main thread. In > >> complex jobs with a lot of tasks, this waiting mechanism may increase > >> the pressure on the JobMaster and increase the end-to-end job > >> deployment time. > >> > >> 2. From my understanding, if user enable the > >> cluster.evenly-spread-out-slots, > >> LeastUtilizationResourceMatchingStrategy will be used to determine the > >> slot distribution and the slot allocation in the three TM will be > >> (taskmanager.numberOfTaskSlots=3): > >> TM1: 3 slot > >> TM2: 2 slot > >> TM3: 2 slot > >> > >> Best, > >> Yangze Guo > >> > >> On Sun, Oct 1, 2023 at 6:14 PM Rui Fan <1996fan...@gmail.com> wrote: > >> > > >> > Hi Shammon, > >> > > >> > Thanks for your feedback as well! > >> > > >> > > IIUC, the overall balance is divided into two parts: slot to TM and > task > >> > to slot. > >> > > 1. Slot to TM is guaranteed by SlotManager in ResourceManager > >> > > 2. Task to slot is guaranteed by the slot pool in JM > >> > > > >> > > These two are completely independent, what are the benefits of > unifying > >> > > these two into one option? Also, do we want to share the same > >> > > option between SlotPool in JM and SlotManager in RM? This sounds a > bit > >> > > strange. > >> > > >> > Your understanding is totally right, the balance needs 2 parts: slot > to TM > >> > and task to slot. > >> > > >> > As I understand, the following are benefits of unifying them into one > >> > option: > >> > > >> > - Flink users don't care about these principles inside of flink, they > don't > >> > know these 2 parts. > >> > - If flink provides 2 options, flink users need to set 2 options for > their > >> > job. > >> > - If one option is missed, the final result may not be good. (Users > may > >> > have questions when using) > >> > - If flink just provides 1 option, enabling one option is en
Re: [DISCUSS] FLIP-374: Adding a separate configuration for specifying Java Options of the SQL Gateway
It's quite intuitive to provide such a configuration for sql gateway. Thanks Yangze for bringing this up and looking forward to it. Best, Zakelly On Sat, Oct 7, 2023 at 4:35 PM xiangyu feng wrote: > > Thanks for initiating this discussion. Within the development towards > Streaming Warehousing, SQL Gateway will become more and more important. > Big +1 to specify Java Options separately for SQL Gateway. > > Regards, > Xiangyu > > Yangze Guo 于2023年10月7日周六 15:24写道: > > > Hi, there, > > > > We'd like to start a discussion thread on "FLIP-374: Adding a separate > > configuration for specifying Java Options of the SQL Gateway"[1], > > where we propose adding a separate configuration option to specify the > > Java options for the SQL Gateway. This would allow users to fine-tune > > the memory settings, garbage collection behavior, and other relevant > > Java parameters specific to the SQL Gateway, ensuring optimal > > performance and stability in production environments. > > > > Looking forward to your feedback. > > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-374%3A+Adding+a+separate+configuration+for+specifying+Java+Options+of+the+SQL+Gateway > > > > Best, > > Yangze Guo > >
Re: [DISCUSS] FLIP-374: Adding a separate configuration for specifying Java Options of the SQL Gateway
Thanks to Yangze driving this proposal. `env.java.opts.xxx` is already supported for client, historyserver, jobmanager and taskmanager. And it's very useful for troubleshooting. So +1 for `env.java.opts.sql-gateway`. I have a minor question: doesn't the `env.java.opts.all` support sql-gateway? If yes, it's fine. If no, it's better to consider it to be the subtask of this FLIP. Best, Rui On Sat, Oct 7, 2023 at 4:35 PM xiangyu feng wrote: > Thanks for initiating this discussion. Within the development towards > Streaming Warehousing, SQL Gateway will become more and more important. > Big +1 to specify Java Options separately for SQL Gateway. > > Regards, > Xiangyu > > Yangze Guo 于2023年10月7日周六 15:24写道: > > > Hi, there, > > > > We'd like to start a discussion thread on "FLIP-374: Adding a separate > > configuration for specifying Java Options of the SQL Gateway"[1], > > where we propose adding a separate configuration option to specify the > > Java options for the SQL Gateway. This would allow users to fine-tune > > the memory settings, garbage collection behavior, and other relevant > > Java parameters specific to the SQL Gateway, ensuring optimal > > performance and stability in production environments. > > > > Looking forward to your feedback. > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-374%3A+Adding+a+separate+configuration+for+specifying+Java+Options+of+the+SQL+Gateway > > > > Best, > > Yangze Guo > > >
Re: [DISCUSS] FLIP-374: Adding a separate configuration for specifying Java Options of the SQL Gateway
Thanks Yangze for preparing this FLIP, it's good to have this ability for gateway since we already have it for other JVM processes (client/JM/TM) as Rui mentioned. Rui Fan <1996fan...@gmail.com> 于2023年10月7日周六 18:02写道: > > Thanks to Yangze driving this proposal. > > `env.java.opts.xxx` is already supported for client, historyserver, > jobmanager and taskmanager. And it's very useful for troubleshooting. > So +1 for `env.java.opts.sql-gateway`. > > I have a minor question: doesn't the `env.java.opts.all` support > sql-gateway? > If yes, it's fine. If no, it's better to consider it to be the subtask of > this FLIP. > > Best, > Rui > > > On Sat, Oct 7, 2023 at 4:35 PM xiangyu feng wrote: > > > Thanks for initiating this discussion. Within the development towards > > Streaming Warehousing, SQL Gateway will become more and more important. > > Big +1 to specify Java Options separately for SQL Gateway. > > > > Regards, > > Xiangyu > > > > Yangze Guo 于2023年10月7日周六 15:24写道: > > > > > Hi, there, > > > > > > We'd like to start a discussion thread on "FLIP-374: Adding a separate > > > configuration for specifying Java Options of the SQL Gateway"[1], > > > where we propose adding a separate configuration option to specify the > > > Java options for the SQL Gateway. This would allow users to fine-tune > > > the memory settings, garbage collection behavior, and other relevant > > > Java parameters specific to the SQL Gateway, ensuring optimal > > > performance and stability in production environments. > > > > > > Looking forward to your feedback. > > > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-374%3A+Adding+a+separate+configuration+for+specifying+Java+Options+of+the+SQL+Gateway > > > > > > Best, > > > Yangze Guo > > > > > -- Best, Benchao Li
Re: [DISCUSS] FLIP-374: Adding a separate configuration for specifying Java Options of the SQL Gateway
Thanks for all your comments. @Rui Thanks for the reminder. The "env.java.opts.all" has already taken effect now. Best, Yangze Guo On Sat, Oct 7, 2023 at 6:45 PM Benchao Li wrote: > > Thanks Yangze for preparing this FLIP, it's good to have this ability > for gateway since we already have it for other JVM processes > (client/JM/TM) as Rui mentioned. > > Rui Fan <1996fan...@gmail.com> 于2023年10月7日周六 18:02写道: > > > > Thanks to Yangze driving this proposal. > > > > `env.java.opts.xxx` is already supported for client, historyserver, > > jobmanager and taskmanager. And it's very useful for troubleshooting. > > So +1 for `env.java.opts.sql-gateway`. > > > > I have a minor question: doesn't the `env.java.opts.all` support > > sql-gateway? > > If yes, it's fine. If no, it's better to consider it to be the subtask of > > this FLIP. > > > > Best, > > Rui > > > > > > On Sat, Oct 7, 2023 at 4:35 PM xiangyu feng wrote: > > > > > Thanks for initiating this discussion. Within the development towards > > > Streaming Warehousing, SQL Gateway will become more and more important. > > > Big +1 to specify Java Options separately for SQL Gateway. > > > > > > Regards, > > > Xiangyu > > > > > > Yangze Guo 于2023年10月7日周六 15:24写道: > > > > > > > Hi, there, > > > > > > > > We'd like to start a discussion thread on "FLIP-374: Adding a separate > > > > configuration for specifying Java Options of the SQL Gateway"[1], > > > > where we propose adding a separate configuration option to specify the > > > > Java options for the SQL Gateway. This would allow users to fine-tune > > > > the memory settings, garbage collection behavior, and other relevant > > > > Java parameters specific to the SQL Gateway, ensuring optimal > > > > performance and stability in production environments. > > > > > > > > Looking forward to your feedback. > > > > > > > > [1] > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-374%3A+Adding+a+separate+configuration+for+specifying+Java+Options+of+the+SQL+Gateway > > > > > > > > Best, > > > > Yangze Guo > > > > > > > > > > > -- > > Best, > Benchao Li
RE: [ANNOUNCE] Apache Flink Stateful Functions Release 3.3.0 released
Hello Martijn, I do not see the images on Dockerhub yet. Is there an alternative source we can use in the meantime? Thanks, Frans -Original Message- From: Martijn Visser Sent: Tuesday, September 26, 2023 5:34 AM To: dev@flink.apache.org Subject: Re: [ANNOUNCE] Apache Flink Stateful Functions Release 3.3.0 released Hi Frans, Good remark, I still need to provide the images to those who have access to the Dockerhub, but I haven't been able to done that yet. Hopefully I can do that at the end of the week. Best regards, Martijn On Mon, Sep 25, 2023 at 2:04 PM wrote: > > Hi Martijn. > > Thanks for this. Should there also be docker images available? > https://hub.docker.com/r/apache/flink-statefun/tags goes up to 3.2.0. > > Frans > > -Original Message- > From: Martijn Visser > Sent: Tuesday, September 19, 2023 11:37 AM > To: dev@flink.apache.org; user ; user-zh > ; n...@flink.apache.org; annou...@apache.org > Subject: [ANNOUNCE] Apache Flink Stateful Functions Release 3.3.0 > released > > Stateful Functions is a cross-platform stack for building Stateful Serverless > applications, making it radically simpler to develop scalable, consistent, > and elastic distributed applications. This new release upgrades the Flink > runtime to 1.16.2. > > Release highlight: > - Upgrade underlying Flink dependency to 1.16.2 > > Release blogpost: > https://flink.apache.org/2023/09/19/stateful-functions-3.3.0-release-a > nnouncement/ > > The release is available for download at: > https://flink.apache.org/downloads/ > > Java SDK can be found at: > https://search.maven.org/artifact/org.apache.flink/statefun-sdk-java/3 > .3.0/jar > > Python SDK can be found at: > https://pypi.org/project/apache-flink-statefun/ > > GoLang SDK can be found at: > https://github.com/apache/flink-statefun/tree/statefun-sdk-go/v3.3.0 > > JavaScript SDK can be found at: > https://www.npmjs.com/package/apache-flink-statefun > > Official Docker image for Flink Stateful Functions can be found at: > https://hub.docker.com/r/apache/flink-statefun > > The full release notes are available in Jira: > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315 > 522&version=12351276 > > We would like to thank all contributors of the Apache Flink community who > made this release possible! > > Regards, > Martijn Visser > >
[jira] [Created] (FLINK-33204) Add description for missing options in the all jobmanager/taskmanager options section in document
Zhanghao Chen created FLINK-33204: - Summary: Add description for missing options in the all jobmanager/taskmanager options section in document Key: FLINK-33204 URL: https://issues.apache.org/jira/browse/FLINK-33204 Project: Flink Issue Type: Technical Debt Components: Runtime / Configuration Affects Versions: 1.17.0, 1.18.0 Reporter: Zhanghao Chen Fix For: 1.19.0 There are 4 options which are excluded from the all jobmanager/taskmanager options section in the configuration document: # taskmanager.bind-host # taskmanager.rpc.bind-port # jobmanager.bind-host # jobmanager.rpc.bind-port We should add them to the document under the all jobmanager/taskmanager options section for completeness. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33205) Replace Akka with Pekko in the description of "pekko.ssl.enabled"
Zhanghao Chen created FLINK-33205: - Summary: Replace Akka with Pekko in the description of "pekko.ssl.enabled" Key: FLINK-33205 URL: https://issues.apache.org/jira/browse/FLINK-33205 Project: Flink Issue Type: Technical Debt Components: Runtime / Configuration Affects Versions: 1.18.0 Reporter: Zhanghao Chen Fix For: 1.19.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling
Thanks for the updates, Rui. It does seem challenging to ensure evenness in slot deployment unless we introduce batch slot requests in SlotPool. However, one possibility is to add a delay of around 50ms during the SlotPool's resource requirement declaration to the ResourceManager, similar to the checkResourceRequirementsWithDelay in the SlotManager. In most cases, this delay would allow the SlotManager to see all resource requirements, then it can allocate the slot more evenly. As a side effect, it could also significantly reduce the number of RPC messages to the ResourceManager, which could become a single-point bottleneck in OLAP scenarios. WDYT? Best, Yangze Guo On Sat, Oct 7, 2023 at 5:52 PM Rui Fan <1996fan...@gmail.com> wrote: > > Hi Yangze, > > Thanks for your quick response! > > Sorry, I re-read the 2.2.2 part[1] about the Waiting Mechanism, I found > it isn't clear. The root cause of introducing the waiting mechanism is > that the slot requests are sent from JobMaster to SlotPool is > one by one instead of one whole batch. I have rewritten the 2.2.2 part, > please read it again in your free time. > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-370%3A+Support+Balanced+Tasks+Scheduling#FLIP370:SupportBalancedTasksScheduling-2.2.2Waitingmechanism > > Best, > Rui > > On Sat, Oct 7, 2023 at 4:34 PM Yangze Guo wrote: >> >> Thanks for the clarification, Rui. >> >> I believe the root cause of this issue is that in the current >> DefaultResourceAllocationStrategy, slot allocation begins before the >> decision to PendingTaskManagers requesting is made. That can be fixed >> within the strategy without introducing another waiting mechanism. I >> think it would be better to address this issue within the scope of >> this FLIP. However, I don't have a strong opinion on it, it depends on >> your bandwidth. >> >> >> Best, >> Yangze Guo >> >> On Sat, Oct 7, 2023 at 4:16 PM Rui Fan <1996fan...@gmail.com> wrote: >> > >> > Hi Yangze, >> > >> > > 2. From my understanding, if user enable the >> > > cluster.evenly-spread-out-slots, >> > > LeastUtilizationResourceMatchingStrategy will be used to determine the >> > > slot distribution and the slot allocation in the three TM will be >> > > (taskmanager.numberOfTaskSlots=3): >> > > TM1: 3 slot >> > > TM2: 2 slot >> > > TM3: 2 slot >> > >> > When all tms are ready in advance, the three TM will be: >> > TM1: 3 slot >> > TM2: 2 slot >> > TM3: 2 slot >> > >> > For application mode, the resource manager doesn't apply for >> > TM in advance, and slots aren't enough before the third TM is ready. >> > So all slots of the second TM will be used up. The three TM will be: >> > TM1: 3 slot >> > TM2: 3 slot >> > TM3: 1 slot >> > >> > That's why the FLIP add some notes: >> > >> > All free slots are in the last TM, because ResourceManager doesn’t have >> > the waiting mechanism, and it just requests 7 slots for this JobMaster. >> > Why is it acceptable? >> > >> > If we just add the waiting mechanism to JobMaster but not in >> > ResourceManager, all free slots will be in the last TM. All slots of other >> > TMs are offered to JM. >> > That is, only one TM may have fewer tasks than the other TMs. The >> > difference between the number of tasks of other TMs is at most 1.So When p >> > >> slotsPerTM, the problem can be ignored. >> > We can also suggest users, in cases that p is small, it's better to >> > configure slotsPerTM to 1, or let p % slotsPerTM == 0. >> > >> > Please correct me if my understanding is wrong, thanks~ >> > >> > Best, >> > Rui >> > >> > On Sun, Oct 1, 2023 at 7:38 PM Yangze Guo wrote: >> >> >> >> Hi, Rui, >> >> >> >> 1. With the current mechanism, when physical slots are offered from >> >> TM, the JobMaster will start deploying tasks and synchronizing their >> >> states. With the addition of the waiting mechanism, IIUC, the >> >> JobMaster will deploy and synchronize the states of all tasks only >> >> after all resources are available. The task deployment and state >> >> synchronization both occupy the JobMaster's RPC main thread. In >> >> complex jobs with a lot of tasks, this waiting mechanism may increase >> >> the pressure on the JobMaster and increase the end-to-end job >> >> deployment time. >> >> >> >> 2. From my understanding, if user enable the >> >> cluster.evenly-spread-out-slots, >> >> LeastUtilizationResourceMatchingStrategy will be used to determine the >> >> slot distribution and the slot allocation in the three TM will be >> >> (taskmanager.numberOfTaskSlots=3): >> >> TM1: 3 slot >> >> TM2: 2 slot >> >> TM3: 2 slot >> >> >> >> Best, >> >> Yangze Guo >> >> >> >> On Sun, Oct 1, 2023 at 6:14 PM Rui Fan <1996fan...@gmail.com> wrote: >> >> > >> >> > Hi Shammon, >> >> > >> >> > Thanks for your feedback as well! >> >> > >> >> > > IIUC, the overall balance is divided into two parts: slot to TM and >> >> > > task >> >> > to slot. >> >> > > 1. Slot to TM is guaranteed by SlotManager in ResourceManager >> >> > > 2. Task to slot is guarant