[RESULT] [VOTE] FLIP-418: Show data skew score on Flink Dashboard
I am happy to announce that “FLIP-418: Show data skew score on Flink Dashboard” has been accepted with Consensus. FLIP: https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard Votes: * Aleksandr Pilipenko +1 (non-binding) * Danny Cranmer +1 (binding) * Hong Liang +1 (binding) * Rui Fan +1 (binding) * Yuepeng Pan +1 (non-binding) There are no disapproving votes. Thanks all! Emre
RE: [VOTE] FLIP-418: Show data skew score on Flink Dashboard
Thanks all, this vote is now closed. I will announce the results on a separate thread. On 2024/01/29 10:09:10 "Kartoglu, Emre" wrote: > Hello, > > I'd like to call votes on FLIP-418: Show data skew score on Flink Dashboard. > > FLIP: > https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > Discussion: https://lists.apache.org/thread/m5ockoork0h2zr78h77dcrn71rbt35ql > > Kind regards, > Emre > >
Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard
Hi Rui, Thanks for the useful feedback and caring about the user experience. I will update the FLIP based on 1 comment. I consider this a minor update. Please find my detailed responses below. "numRecordsInPerSecond sounds make sense to me, and I think it's necessary to mention it in the FLIP wiki. It will let other developers to easily understand. WDYT?" I feel like this might be touching implementation details. No objections though, I will update the FLIP with this as one of the ways in which we can achieve the proposal. "After I detailed read the FLIP and Average_absolute_deviation, we know 0% is the best, 100% is worst." Correct. "I guess it is difficult for users who have not read the documentation to know the meaning of 50%. We hope that the designed Data skew will be easy for users to understand without reading or learning a series of backgrounds." I think I understand where you're coming from. My thought is that the user won't have to know exactly how the skew percentage/score is calculated. But this score will act as a warning sign for them. Upon seeing a skew score of 80% for an operator, as a user I will go and click on the operator to see many of my subtasks are not receiving any data at all. So it acts as a metric to get the user's attention to the skewed operator and fix issues. "For example, as you mentioned before, flink has a metric: numRecordsInPerSecond. I believe users know what numRecordsInPerSecond means even if they didn't read any documentation." The FLIP suggests that we will provide an explanation of the data skew score under the proposed Data Skew tab. I would like the exact wording to be left to the code review process to prevent these from blocking the implementation work/progress. This will be a user-friendly explanation with an option for the curious user to see the exact formula. Kind regards, Emre On 01/02/2024, 03:26, "Rui Fan" <1996fan...@gmail.com <mailto:1996fan...@gmail.com>> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > I was thinking about using the existing numRecordsInPerSecond metric numRecordsInPerSecond sounds make sense to me, and I think it's necessary to mention it in the FLIP wiki. It will let other developers to easily understand. WDYT? BTW, that's why I ask whether the data skew score means total receive records. > this would always give you a score higher than 1, with no way to cap the score. Yeah, you are right. max/mean is not a score, it's the data skew multiple. And I guess max/mean is easier to understand than Average_absolute_deviation. > I'm more used to working with percentages. The problem with the max/mean metric is I wouldn't immediately know whether a score of 300 is bad for instance. > Whereas if users saw above 50% as suggested in the FLIP for instance, they would consider taking action. I'm tempted to push back on this suggestion. Happy to discuss further, there is a chance I'm not seeing the downside of the proposed percentage based metric yet. Please let me know. After I detailed read the FLIP and Average_absolute_deviation, we know 0% is the best, 100% is worst. I guess it is difficult for users who have not read the documentation to know the meaning of 50%. We hope that the designed Data skew will be easy for users to understand without reading or learning a series of backgrounds. For example, as you mentioned before, flink has a metric: numRecordsInPerSecond. I believe users know what numRecordsInPerSecond means even if they didn't read any documentation. Of course, I'm opening for it. I may have missed something. I'd like to hear more feedback from the community. Best, Rui On Thu, Feb 1, 2024 at 4:13 AM Kartoglu, Emre mailto:kar...@amazon.co.uk.inva>lid> wrote: > Hi Rui, > > " and provide the total and current score in the detailed tab. I didn't > see the detailed design in the FLIP, would you mind > improve the design doc? Thanks". > > It will essentially be a basic list view similar to the "Checkpoints" tab. > I only briefly mentioned this in the FLIP because it will be a basic list > view. > No problem though, I will update the FLIP. > > > Please find my responses below quotations. > > " 1. About the current skew score, I still don't understand how to get > the list_of_number_of_records_received_by_each_subtask for > each subtask. > > the list_of_number_of_records_received_by_each_subtask of subtask1 > is > > total received records of subtask 1 from beginning to now - > total received records of subtask 1 from beginning to (now - 1min), right?" > > Yes, essential
Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard
Hi Rui, " and provide the total and current score in the detailed tab. I didn't see the detailed design in the FLIP, would you mind improve the design doc? Thanks". It will essentially be a basic list view similar to the "Checkpoints" tab. I only briefly mentioned this in the FLIP because it will be a basic list view. No problem though, I will update the FLIP. Please find my responses below quotations. " 1. About the current skew score, I still don't understand how to get the list_of_number_of_records_received_by_each_subtask for each subtask. the list_of_number_of_records_received_by_each_subtask of subtask1 is total received records of subtask 1 from beginning to now - total received records of subtask 1 from beginning to (now - 1min), right?" Yes, essentially correct. I was thinking about using the existing numRecordsInPerSecond metric (see https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/), this would give us per second granularity and this would be more "current/live" than per minute. "IIUC, you proposed score is between 0% to 100%, and 0% is the best. And the 100% is the worst." Correct. " For data skew, I'm not sure whether a multiple value is more intuitive. It means data skew score = max / mean. The data skew score is between 1 and infinity. 1 is the best, and the bigger the worse." I'm not sure I follow you here. Yes, this would always give you a score higher than 1, with no way to cap the score. I'm more used to working with percentages. The problem with the max/mean metric is I wouldn't immediately know whether a score of 300 is bad for instance. Whereas if users saw above 50% as suggested in the FLIP for instance, they would consider taking action. I'm tempted to push back on this suggestion. Happy to discuss further, there is a chance I'm not seeing the downside of the proposed percentage based metric yet. Please let me know. Kind regards, Emre On 31/01/2024, 10:57, "Rui Fan" <1996fan...@gmail.com <mailto:1996fan...@gmail.com>> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Sorry for the late reply. > So you would have a high data skew while 1 subtask is receiving all the data, but on average (say over 1-2 days) data skew would come down to 0 because all subtasks would have received their portion of the data. > I'm inclined to think that the current proposal might still be fair, as you do indeed have a skew by definition (but an intentional one). We can have a few ways forward: > > 0) We can keep the behaviour as proposed. My thoughts are that data skew is data skew, however intentional it may be. It is not necessarily bad, like in your example. It makes sense to me. Flink should show data skew correctly regardless of whether the user is intentional or not. > 1) Show data skew based on the beginning of time (not a live/current score). I mentioned some downsides to this in the FLIP: If you break or fix your data skew recently, the historical data might hide the recent fix/breakage, and it is inconsistent with the other metrics shown on the vertices e.g. Backpressure/Busy metrics show the live/current score. > > 2) We can choose not to put data skew score on the vertices on the job graph. And instead just use the new proposed Data Skew tab which could show live/current skew score and the total data skew score from the beginning of job. It makes sense, we can show the current skew score in the DAG WebUI by default, and provide the total and current score in the detailed tab. I didn't see the detailed design in the FLIP, would you mind improve the design doc? Thanks Also, I have 2 questions for now: 1. About the current skew score, I still don't understand how to get the list_of_number_of_records_received_by_each_subtask for each subtask. the list_of_number_of_records_received_by_each_subtask of subtask1 is total received records of subtask 1 from beginning to now - total received records of subtask 1 from beginning to (now - 1min), right? Note: 1min is an example. 30s or 2min is fine for me. 2. The skew score is percent I'm not sure whether the score shown in percent format is reasonable. For busy ratio or backpressure ratio, they are shown in percent format is intuitive. IIUC, you proposed score is between 0% to 100%, and 0% is the best. And the 100% is the worst. For data skew, I'm not sure whether a multiple value is more intuitive. It means data skew score = max / mean. For example, we have 5 subtasks, the received record numbers are [10,10, 10, 100, 10]. data skew score = max / mean = 100 / (140/5) = 100/ 28 = 3.57. The data skew score is between 1 and infinity. 1 is the best, and the bigger the worse.
[VOTE] FLIP-418: Show data skew score on Flink Dashboard
Hello, I'd like to call votes on FLIP-418: Show data skew score on Flink Dashboard. FLIP: https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard Discussion: https://lists.apache.org/thread/m5ockoork0h2zr78h77dcrn71rbt35ql Kind regards, Emre
Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard
Hi Krzysztof, Thank you for the feedback! Please find my comments below. 1. Configurability Adding a feature flag / configuration to enable this is still on the table as far as I am concerned. However I believe adding a new metric shouldn't warrant a flag/configuration. One might argue that we should have it for showing the metrics on the Flink UI, and I'd appreciate input on this. My default position is to not have a configuration/flag unless there is a good reason (e.g. it turns out there is impact on Flink UI for so far unknown reason). This is because the proposed change should only be improving the experience without any unwanted side effect. 2. Metrics I agree the new metrics should be compatible with the rest of the Flink metric reporting mechanism. I will update the FLIP and propose names for the metrics. Kind regards, Emre On 23/01/2024, 10:31, "Krzysztof Dziołak" mailto:kdzio...@live.com>> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hi Emre, Thank you for driving this proposal. I've got two questions about the extensions to the proposal that are not captured in the FLIP. 1. Configurability - what kind of configuration would you propose to maintain for this feature? Would On/off switch and/or aggregated period length be configurable? Should we capture the toggles in the FLIP ? 2. Metrics - are we planning to emit the skew metric via metric reporters mechanism. Should we capture proposed metric schema in the FLIP ? Kind regards, Krzysztof ____ From: Kartoglu, Emre mailto:kar...@amazon.co.uk.inva>LID> Sent: Monday, January 15, 2024 4:59 PM To: dev@flink.apache.org <mailto:dev@flink.apache.org> mailto:dev@flink.apache.org>> Subject: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard Hello, I’m opening this thread to discuss a FLIP[1] to make data skew more visible on Flink Dashboard. Data skew is currently not as visible as it should be. Users have to click each operator and check how much data each sub-task is processing and compare the sub-tasks against each other. This is especially cumbersome and error-prone for jobs with big job graphs and high parallelism. I’m proposing this FLIP to improve this. Kind regards, Emre [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard <https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard>
Re: Re:[DISCUSS] FLIP-418: Show data skew score on Flink Dashboard
Hi Xuyang, Thanks for the feedback! Please find my response below. > 1. How will the colors of vertics with high data skew scores be unified with > existing backpressure and high busyness colors on the UI? Users should be able to distinguish at a glance which vertics in the entire job graph is skewed. The current proposal does not suggest to change the colours of the vertices based on data skew. In another exchange with Rui, we touch on why data skew might not necessarily be bad (for instance if data skew is the designed behaviour). The colours are currently dedicated to the Busy/Backpressure metrics. I would not be keen on introducing another colour or using the same colours for data skew as I am not sure if that'll help or confuse users. I am also keen to keep the scope of this FLIP as minimal as possible with as few contentious points as possible. We could also revisit this point in future FLIPs, if it does not become a blocker for this one. Please let me know your thoughts. 2. Can you tell me that you prefer to unify Data Skew Score and Exception tab? In my opinion, Data Skew Score is in the same category as the existing Backpressured and Busy metrics. The FLIP does not propose to unify the Data Skew tab and the Exception tab. The proposed Data Skew tab would sit next to the Exception tab (but I'm not too opinionated on where it sits). Backpressure and Busy metrics are somewhat special in that they have high visibility thanks to the vertices changing colours based on their value. I agree that Data Skew is in the same category in that it can be used as an indicator of the job's health. I'm not sure if the suggestion here then is to not introduce a tab for data skew? I'd appreciate some clarification here. Look forward to hearing your thoughts. Emre On 16/01/2024, 06:05, "Xuyang" mailto:xyzhong...@163.com>> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hi, Emre. In large-scale production jobs, the phenomenon of data skew often occurs. Having an metric on the UI that reflects data skew without the need for manual inspection of each vertex by clicking on them would be quite cool. This could help users quickly identify problematic nodes, simplifying development and operations. I'm mainly curious about two minor points: 1. How will the colors of vertics with high data skew scores be unified with existing backpressure and high busyness colors on the UI? Users should be able to distinguish at a glance which vertics in the entire job graph is skewed. 2. Can you tell me that you prefer to unify Data Skew Score and Exception tab? In my opinion, Data Skew Score is in the same category as the existing Backpressured and Busy metrics. Looking forward to your reply. -- Best! Xuyang At 2024-01-16 00:59:57, "Kartoglu, Emre" mailto:kar...@amazon.co.uk.inva>LID> wrote: >Hello, > >I’m opening this thread to discuss a FLIP[1] to make data skew more visible on >Flink Dashboard. > >Data skew is currently not as visible as it should be. Users have to click >each operator and check how much data each sub-task is processing and compare >the sub-tasks against each other. This is especially cumbersome and >error-prone for jobs with big job graphs and high parallelism. I’m proposing >this FLIP to improve this. > >Kind regards, >Emre > >[1] >https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > ><https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard> > > >
Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard
Hi Rui, Thanks for the feedback. Please find my response below: > The number_of_records_received_by_each_subtask is the total received records, > right? No it's not the total. I understand why this is confusing. I had initially wanted to name it "the list of number of records received by each subtask". So its type is a list. Example: [10, 10, 10] => 3 sub-tasks and each one received 10 records. In your example, you have subtasks with each one designed to receive records at different times of the day. I hadn't thought about this use case! So you would have a high data skew while 1 subtask is receiving all the data, but on average (say over 1-2 days) data skew would come down to 0 because all subtasks would have received their portion of the data. I'm inclined to think that the current proposal might still be fair, as you do indeed have a skew by definition (but an intentional one). We can have a few ways forward: 0) We can keep the behaviour as proposed. My thoughts are that data skew is data skew, however intentional it may be. It is not necessarily bad, like in your example. 1) Show data skew based on the beginning of time (not a live/current score). I mentioned some downsides to this in the FLIP: If you break or fix your data skew recently, the historical data might hide the recent fix/breakage, and it is inconsistent with the other metrics shown on the vertices e.g. Backpressure/Busy metrics show the live/current score. 2) We can choose not to put data skew score on the vertices on the job graph. And instead just use the new proposed Data Skew tab which could show live/current skew score and the total data skew score from the beginning of job. Keen to hear your thoughts. Kind regards, Emre On 16/01/2024, 06:44, "Rui Fan" <1996fan...@gmail.com <mailto:1996fan...@gmail.com>> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Thanks Emre for driving this proposal! It's very useful for troubleshooting. I have a question: The number_of_records_received_by_each_subtask is the total received records, right? I'm not sure whether we should check data skew based on the latest duration period. In the production, I found the the total received records of all subtasks is balanced, but in the each time period, they are skew. For example, a flink job has `group by` or `keyBy` based on hour field. It mean: - In the 0-1 o'clock, subtaskA is busy, the rest of subtasks are idle. - In the 1-2 o'clock, subtaskB is busy, the rest of subtasks are idle. - Next hour, the busy subtask is changed. Looking forward to your opinions~ Best, Rui On Tue, Jan 16, 2024 at 2:05 PM Xuyang mailto:xyzhong...@163.com>> wrote: > Hi, Emre. > > > In large-scale production jobs, the phenomenon of data skew often occurs. > Having an metric on the UI that > reflects data skew without the need for manual inspection of each vertex > by clicking on them would be quite cool. > This could help users quickly identify problematic nodes, simplifying > development and operations. > > > I'm mainly curious about two minor points: > 1. How will the colors of vertics with high data skew scores be unified > with existing backpressure and high busyness > colors on the UI? Users should be able to distinguish at a glance which > vertics in the entire job graph is skewed. > 2. Can you tell me that you prefer to unify Data Skew Score and Exception > tab? In my opinion, Data Skew Score is in > the same category as the existing Backpressured and Busy metrics. > > > Looking forward to your reply. > > > > -- > > Best! > Xuyang > > > > > > At 2024-01-16 00:59:57, "Kartoglu, Emre" <mailto:kar...@amazon.co.uk.inva>LID> > wrote: > >Hello, > > > >I’m opening this thread to discuss a FLIP[1] to make data skew more > visible on Flink Dashboard. > > > >Data skew is currently not as visible as it should be. Users have to > click each operator and check how much data each sub-task is processing and > compare the sub-tasks against each other. This is especially cumbersome and > error-prone for jobs with big job graphs and high parallelism. I’m > proposing this FLIP to improve this. > > > >Kind regards, > >Emre > > > >[1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > > <https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard> > > > > > > >
[DISCUSS] FLIP-418: Show data skew score on Flink Dashboard
Hello, I’m opening this thread to discuss a FLIP[1] to make data skew more visible on Flink Dashboard. Data skew is currently not as visible as it should be. Users have to click each operator and check how much data each sub-task is processing and compare the sub-tasks against each other. This is especially cumbersome and error-prone for jobs with big job graphs and high parallelism. I’m proposing this FLIP to improve this. Kind regards, Emre [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
Show data skew score on Flink Dashboard?
Hello, Is there a reason why a type of data skew score (probably just a percentage) is not shown on the Flink Dashboard / UI? Currently users have to click on each operator and check how much data each subtask is processing to tell if there is skew. This is not efficient, especially cumbersome and also error prone for big job graphs. It would be useful to have this shown on the operator. Possibly also a warning message at the top or somewhere “more meta” if a significant amount of skew is detected (so that users don’t have to zoom in on each and every operator to check the skew score). I’d be happy to create a ticket for it, if there are no objections? Kind regards, Emre
Re: Maven plugin to detect issues early on
Hi Jing, The proposed plugin would be used by Flink application developers, when they are writing their Flink job. It would trigger during compilation/packaging and would look for known incompatibilities, bad practices, or bugs. For instance one cause of frustration for our customers is connector incompatibilities (specifically Kafka and Kinesis) with certain Flink versions. This plugin would be a quick way to update a list of known incompatibilities, bugs, bad practices, so customers get errors during compilation/packaging and not after they've deployed their Flink job. From what you're saying, the FLIP route might not be the best way to go. We might publish this plugin in our own GitHub namespace/group first, and then get community acknowledgement/support for it. I believe working with the Flink community on this is key as we'd need their support/opinion to do this the right way and reach more Flink users. Thanks Emre On 21/05/2023, 16:48, "Jing Ge" mailto:j...@ververica.com.inva>LID> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hi Emre, Thanks for your proposal. It looks very interesting! Please pay attention that most connectors have been externalized. Will your proposed plug be used for building Flink Connectors or Flink itself? Furthermore, it would be great if you could elaborate features wrt best practices so that we could understand how the plugin will help us. Afaik, FLIP is recommended for improvement ideas that will change public APIs. I am not sure if a new maven plugin belongs to it. Best regards, Jing On Tue, May 16, 2023 at 11:29 AM Kartoglu, Emre mailto:kar...@amazon.co.uk.inva>lid> wrote: > Hello all, > > Myself and 2 colleagues developed a Maven plugin (no support for Gradle or > other build tools yet) that we use internally to detect potential issues in > Flink apps at compilation/packaging stage: > > > * Known connector version incompatibilities – so far covering Kafka > and Kinesis > * Best practices e.g. setting operator IDs > > We’d like to make this open-source. Ideally with the Flink community’s > support/mention of it on the Flink website, so more people use it. > > Going forward, I believe we have at least the following options: > > * Get community support: Create a FLIP to discuss where the plugin > should live, what kind of problems it should detect etc. > * We still open-source it but without the community support (if the > community has objections to officially supporting it for instance). > > Just wanted to gauge the feeling/thoughts towards this tool from the > community before going ahead. > > Thanks, > Emre > >
RE: Call for help on the Web UI (In-Place Rescaling)
Hi David, This looks awesome. I am no expert on UI/UX, but still have opinions 😊 I normally use the Overview tab for monitoring Flink jobs, and having control inputs there breaks my assumption that Overview is “read-only” and for “watching”. Having said that for “educational purposes” that might actually be a good place - I am imagining there would be a “educationalMode: true” flag or something somewhere to enable these buttons (and other educational bits in future). The “educational purpose” bit makes me a lot more relaxed about having those buttons as they are in the video! Couple other things to consider: * Confirming new parallelism before actually doing it, e.g. having a “Deploy/Commit/Save” button * Allow users to enter parallelism without having to increment/decrement one by one Thanks, Emre On 2023/05/19 06:49:08 David Morávek wrote: > Hi Everyone, > > In FLINK-31471, we've introduced new "in-place rescaling features" to the > Web UI that show up when the scheduler supports FLIP-291 REST endpoints. > > I expect this to be a significant feature for user education (they have an > easy way to try out how rescaling behaves, especially in combination with a > backpressure monitor) and marketing (read as "we can do fancy demos"). > > However, the current sketch is not optimal due to my lack of UI/UX skills. > > Are there any volunteers that could and would like to help polish this? > > Here is a short demo [2] of what the current implementation can do. > > [1] https://issues.apache.org/jira/browse/FLINK-31471 > [2] https://www.youtube.com/watch?v=B1NVDTazsZY > > Best, > D. >
Maven plugin to detect issues early on
Hello all, Myself and 2 colleagues developed a Maven plugin (no support for Gradle or other build tools yet) that we use internally to detect potential issues in Flink apps at compilation/packaging stage: * Known connector version incompatibilities – so far covering Kafka and Kinesis * Best practices e.g. setting operator IDs We’d like to make this open-source. Ideally with the Flink community’s support/mention of it on the Flink website, so more people use it. Going forward, I believe we have at least the following options: * Get community support: Create a FLIP to discuss where the plugin should live, what kind of problems it should detect etc. * We still open-source it but without the community support (if the community has objections to officially supporting it for instance). Just wanted to gauge the feeling/thoughts towards this tool from the community before going ahead. Thanks, Emre