Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

Kartoglu, Emre Thu, 01 Feb 2024 01:53:02 -0800

Hi Rui,

Thanks for the useful feedback and caring about the user experience. 
I will update the FLIP based on 1 comment. I consider this a minor update.


Please find my detailed responses below. 

"numRecordsInPerSecond sounds make sense to me, and I think
it's necessary to mention it in the FLIP wiki. It will let other developers
to easily understand. WDYT?"

I feel like this might be touching implementation details. No objections though,
 I will update the FLIP with this as one of the ways in which we can achieve 
the proposal.


"After I detailed read the FLIP and Average_absolute_deviation, we know
0% is the best, 100% is worst."

Correct.


"I guess it is difficult for users who have not read the documentation to
know the meaning of 50%. We hope that the designed Data skew will
be easy for users to understand without reading or learning a series
of backgrounds."

I think I understand where you're coming from. My thought is that the user 
won't have to
know exactly how the skew percentage/score is calculated. But this score will
act as a warning sign for them. Upon seeing a skew score of 80% for an 
operator, as a user 
I will go and click on the operator to see many of my subtasks are not 
receiving any data at all.
So it acts as a metric to get the user's attention to the skewed operator and 
fix issues.


"For example, as you mentioned before, flink has a metric:
numRecordsInPerSecond.
I believe users know what numRecordsInPerSecond means even if they
didn't read any documentation."

The FLIP suggests that we will provide an explanation of the data skew score
under the proposed Data Skew tab. I would like the exact wording to be left to 
the code review process to prevent these from blocking the implementation 
work/progress. 
This will be a user-friendly explanation with an option for the curious user to 
see the exact formula.


Kind regards,
Emre


On 01/02/2024, 03:26, "Rui Fan" <1996fan...@gmail.com 
<mailto:1996fan...@gmail.com>> wrote:


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.






> I was thinking about using the existing numRecordsInPerSecond metric


numRecordsInPerSecond sounds make sense to me, and I think
it's necessary to mention it in the FLIP wiki. It will let other developers
to easily understand. WDYT?


BTW, that's why I ask whether the data skew score means total
receive records.


> this would always give you a score higher than 1, with no way to cap the
score.


Yeah, you are right. max/mean is not a score, it's the data skew multiple.
And I guess max/mean is easier to understand than
Average_absolute_deviation.


> I'm more used to working with percentages. The problem with the max/mean
metric is I wouldn't immediately know whether a score of 300 is bad for
instance.
> Whereas if users saw above 50% as suggested in the FLIP for instance,
they would consider taking action. I'm tempted to push back on this
suggestion. Happy to discuss further, there is a chance I'm not seeing the
downside of the proposed percentage based metric yet. Please let me know.


After I detailed read the FLIP and Average_absolute_deviation, we know
0% is the best, 100% is worst.


I guess it is difficult for users who have not read the documentation to
know the meaning of 50%. We hope that the designed Data skew will
be easy for users to understand without reading or learning a series
of backgrounds.


For example, as you mentioned before, flink has a metric:
numRecordsInPerSecond.
I believe users know what numRecordsInPerSecond means even if they
didn't read any documentation.


Of course, I'm opening for it. I may have missed something. I'd like to
hear
more feedback from the community.


Best,
Rui


On Thu, Feb 1, 2024 at 4:13 AM Kartoglu, Emre <kar...@amazon.co.uk.inva 
<mailto:kar...@amazon.co.uk.inva>lid>
wrote:


> Hi Rui,
>
> " and provide the total and current score in the detailed tab. I didn't
> see the detailed design in the FLIP, would you mind
> improve the design doc? Thanks".
>
> It will essentially be a basic list view similar to the "Checkpoints" tab.
> I only briefly mentioned this in the FLIP because it will be a basic list
> view.
> No problem though, I will update the FLIP.
>
>
> Please find my responses below quotations.
>
> " 1. About the current skew score, I still don't understand how to get
> the list_of_number_of_records_received_by_each_subtask for
> each subtask.
>
> the list_of_number_of_records_received_by_each_subtask of subtask1
> is
>
> total received records of subtask 1 from beginning to now -
> total received records of subtask 1 from beginning to (now - 1min), right?"
>
> Yes, essentially correct. I was thinking about using the existing
> numRecordsInPerSecond metric (see
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/ 
> <https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/>),
> this would give us per second granularity and this would be more
> "current/live" than per minute.
>
>
> "IIUC, you proposed score is between 0% to 100%, and 0% is the best.
> And the 100% is the worst."
>
> Correct.
>
>
> " For data skew, I'm not sure whether a multiple value is more intuitive.
> It means data skew score = max / mean.
> The data skew score is between 1 and infinity. 1 is the best, and
> the bigger the worse."
>
> I'm not sure I follow you here. Yes, this would always give you a score
> higher than 1, with no way to cap the score.
> I'm more used to working with percentages. The problem with the max/mean
> metric is I wouldn't immediately know whether a score of 300 is bad for
> instance.
> Whereas if users saw above 50% as suggested in the FLIP for instance, they
> would consider taking action. I'm tempted to push back on this suggestion.
> Happy to discuss further, there is a chance I'm not seeing the downside of
> the proposed percentage based metric yet. Please let me know.
>
> Kind regards,
> Emre
>
> On 31/01/2024, 10:57, "Rui Fan" <1996fan...@gmail.com 
> <mailto:1996fan...@gmail.com> <mailto:
> 1996fan...@gmail.com <mailto:1996fan...@gmail.com>>> wrote:
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
>
>
>
> Sorry for the late reply.
>
>
>
>
> > So you would have a high data skew while 1 subtask is receiving all the
> data, but on average (say over 1-2 days) data skew would come down to 0
> because all subtasks would have received their portion of the data.
> > I'm inclined to think that the current proposal might still be fair, as
> you do indeed have a skew by definition (but an intentional one). We can
> have a few ways forward:
> >
> > 0) We can keep the behaviour as proposed. My thoughts are that data skew
> is data skew, however intentional it may be. It is not necessarily bad,
> like in your example.
>
>
> It makes sense to me. Flink should show data skew correctly
> regardless of whether the user is intentional or not.
>
>
>
>
> > 1) Show data skew based on the beginning of time (not a live/current
> score).
> I mentioned some downsides to this in the FLIP: If you break or fix your
> data skew recently, the historical data might hide the recent fix/breakage,
> and it is inconsistent with the other metrics shown on the vertices e.g.
> Backpressure/Busy metrics show the live/current score.
> >
> > 2) We can choose not to put data skew score on the vertices on the job
> graph. And instead just use the new proposed Data Skew tab which could show
> live/current skew score and the total data skew score from the beginning of
> job.
>
>
> It makes sense, we can show the current skew score in the DAG WebUI by
> default,
> and provide the total and current score in the detailed tab.
>
>
> I didn't see the detailed design in the FLIP, would you mind
> improve the design doc? Thanks
>
>
> Also, I have 2 questions for now:
>
>
> 1. About the current skew score, I still don't understand how to get
> the list_of_number_of_records_received_by_each_subtask for
> each subtask.
>
>
> the list_of_number_of_records_received_by_each_subtask of subtask1
> is total received records of subtask 1 from beginning to now -
> total received records of subtask 1 from beginning to (now - 1min), right?
>
>
> Note: 1min is an example. 30s or 2min is fine for me.
>
>
> 2. The skew score is percent
>
>
> I'm not sure whether the score shown in percent format is reasonable.
> For busy ratio or backpressure ratio, they are shown in percent format
> is intuitive.
>
>
> IIUC, you proposed score is between 0% to 100%, and 0% is the best.
> And the 100% is the worst.
>
>
> For data skew, I'm not sure whether a multiple value is more intuitive.
> It means data skew score = max / mean.
>
>
> For example, we have 5 subtasks, the received record numbers are
> [10,10, 10, 100, 10].
> data skew score = max / mean = 100 / (140/5) = 100/ 28 = 3.57.
>
>
> The data skew score is between 1 and infinity. 1 is the best, and
> the bigger the worse.
>
>
> Looking forward to your opinions.
>
>
> Best,
> Rui
>
>
> On Tue, Jan 23, 2024 at 6:41 PM Kartoglu, Emre <kar...@amazon.co.uk.inva 
> <mailto:kar...@amazon.co.uk.inva>
> <mailto:kar...@amazon.co.uk.inva <mailto:kar...@amazon.co.uk.inva>>lid>
> wrote:
>
>
> > Hi Krzysztof,
> >
> > Thank you for the feedback! Please find my comments below.
> >
> > 1. Configurability
> >
> > Adding a feature flag / configuration to enable this is still on the
> table
> > as far as I am concerned. However I believe adding a new metric shouldn't
> > warrant a flag/configuration. One might argue that we should have it for
> > showing the metrics on the Flink UI, and I'd appreciate input on this. My
> > default position is to not have a configuration/flag unless there is a
> good
> > reason (e.g. it turns out there is impact on Flink UI for so far unknown
> > reason). This is because the proposed change should only be improving the
> > experience without any unwanted side effect.
> >
> > 2. Metrics
> >
> > I agree the new metrics should be compatible with the rest of the Flink
> > metric reporting mechanism. I will update the FLIP and propose names for
> > the metrics.
> >
> > Kind regards,
> > Emre
> >
> > On 23/01/2024, 10:31, "Krzysztof Dziołak" <kdzio...@live.com 
> > <mailto:kdzio...@live.com> <mailto:
> kdzio...@live.com <mailto:kdzio...@live.com>> <mailto:
> > kdzio...@live.com <mailto:kdzio...@live.com> <mailto:kdzio...@live.com 
> > <mailto:kdzio...@live.com>>>> wrote:
> >
> >
> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and
> know
> > the content is safe.
> >
> >
> >
> >
> >
> >
> > Hi Emre,
> >
> >
> > Thank you for driving this proposal. I've got two questions about the
> > extensions to the proposal that are not captured in the FLIP.
> >
> >
> >
> >
> > 1. Configurability - what kind of configuration would you propose to
> > maintain for this feature? Would On/off switch and/or aggregated period
> > length be configurable? Should we capture the toggles in the FLIP ?
> > 2. Metrics - are we planning to emit the skew metric via metric reporters
> > mechanism. Should we capture proposed metric schema in the FLIP ?
> >
> >
> > Kind regards,
> > Krzysztof
> >
> >
> > ________________________________
> > From: Kartoglu, Emre <kar...@amazon.co.uk.inva 
> > <mailto:kar...@amazon.co.uk.inva> <mailto:
> kar...@amazon.co.uk.inva <mailto:kar...@amazon.co.uk.inva>> <mailto:
> > kar...@amazon.co.uk.inva <mailto:kar...@amazon.co.uk.inva> 
> > <mailto:kar...@amazon.co.uk.inva <mailto:kar...@amazon.co.uk.inva>>>LID>
> > Sent: Monday, January 15, 2024 4:59 PM
> > To: dev@flink.apache.org <mailto:dev@flink.apache.org> 
> > <mailto:dev@flink.apache.org <mailto:dev@flink.apache.org>> <mailto:
> dev@flink.apache.org <mailto:dev@flink.apache.org> 
> <mailto:dev@flink.apache.org <mailto:dev@flink.apache.org>>> <
> > dev@flink.apache.org <mailto:dev@flink.apache.org> 
> > <mailto:dev@flink.apache.org <mailto:dev@flink.apache.org>> <mailto:
> dev@flink.apache.org <mailto:dev@flink.apache.org> 
> <mailto:dev@flink.apache.org <mailto:dev@flink.apache.org>>>>
> > Subject: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard
> >
> >
> > Hello,
> >
> >
> > I’m opening this thread to discuss a FLIP[1] to make data skew more
> > visible on Flink Dashboard.
> >
> >
> > Data skew is currently not as visible as it should be. Users have to
> click
> > each operator and check how much data each sub-task is processing and
> > compare the sub-tasks against each other. This is especially cumbersome
> and
> > error-prone for jobs with big job graphs and high parallelism. I’m
> > proposing this FLIP to improve this.
> >
> >
> > Kind regards,
> > Emre
> >
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
>  
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard>
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
>  
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard>
> >
> > <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
>  
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard>
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
>  
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard>
> >
> > >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>
>

Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

Reply via email to