Hi Rui,

Thanks for the feedback. Please find my response below:

> The number_of_records_received_by_each_subtask is the total received records, 
> right?

No it's not the total. I understand why this is confusing. I had initially 
wanted to name it "the list of number of records received by each subtask". So 
its type is a list. Example: [10, 10, 10] => 3 sub-tasks and each one received 
10 records. 

In your example, you have subtasks with each one designed to receive records at 
different times of the day. I hadn't thought about this use case! 
So you would have a high data skew while 1 subtask is receiving all the data, 
but on average (say over 1-2 days) data skew would come down to 0 because all 
subtasks would have received their portion of the data.
I'm inclined to think that the current proposal might still be fair, as you do 
indeed have a skew by definition (but an intentional one). We can have a few 
ways forward:

0) We can keep the behaviour as proposed. My thoughts are that data skew is 
data skew, however intentional it may be. It is not necessarily bad, like in 
your example.

1) Show data skew based on the beginning of time (not a live/current score). I 
mentioned some downsides to this in the FLIP: If you break or fix your data 
skew recently, the historical data might hide the recent fix/breakage, and it 
is inconsistent with the other metrics shown on the vertices e.g. 
Backpressure/Busy metrics show the live/current score.

2) We can choose not to put data skew score on the vertices on the job graph. 
And instead just use the new proposed Data Skew tab which could show 
live/current skew score and the total data skew score from the beginning of job.

Keen to hear your thoughts.

Kind regards,
Emre


On 16/01/2024, 06:44, "Rui Fan" <1996fan...@gmail.com 
<mailto:1996fan...@gmail.com>> wrote:


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.






Thanks Emre for driving this proposal!


It's very useful for troubleshooting.


I have a question:


The number_of_records_received_by_each_subtask is the
total received records, right?


I'm not sure whether we should check data skew based on
the latest duration period.


In the production, I found the the total received records of
all subtasks is balanced, but in the each time period, they
are skew.


For example, a flink job has `group by` or `keyBy` based on
hour field. It mean:
- In the 0-1 o'clock, subtaskA is busy, the rest of subtasks are idle.
- In the 1-2 o'clock, subtaskB is busy, the rest of subtasks are idle.
- Next hour, the busy subtask is changed.


Looking forward to your opinions~


Best,
Rui


On Tue, Jan 16, 2024 at 2:05 PM Xuyang <xyzhong...@163.com 
<mailto:xyzhong...@163.com>> wrote:


> Hi, Emre.
>
>
> In large-scale production jobs, the phenomenon of data skew often occurs.
> Having an metric on the UI that
> reflects data skew without the need for manual inspection of each vertex
> by clicking on them would be quite cool.
> This could help users quickly identify problematic nodes, simplifying
> development and operations.
>
>
> I'm mainly curious about two minor points:
> 1. How will the colors of vertics with high data skew scores be unified
> with existing backpressure and high busyness
> colors on the UI? Users should be able to distinguish at a glance which
> vertics in the entire job graph is skewed.
> 2. Can you tell me that you prefer to unify Data Skew Score and Exception
> tab? In my opinion, Data Skew Score is in
> the same category as the existing Backpressured and Busy metrics.
>
>
> Looking forward to your reply.
>
>
>
> --
>
> Best!
> Xuyang
>
>
>
>
>
> At 2024-01-16 00:59:57, "Kartoglu, Emre" <kar...@amazon.co.uk.inva 
> <mailto:kar...@amazon.co.uk.inva>LID>
> wrote:
> >Hello,
> >
> >I’m opening this thread to discuss a FLIP[1] to make data skew more
> visible on Flink Dashboard.
> >
> >Data skew is currently not as visible as it should be. Users have to
> click each operator and check how much data each sub-task is processing and
> compare the sub-tasks against each other. This is especially cumbersome and
> error-prone for jobs with big job graphs and high parallelism. I’m
> proposing this FLIP to improve this.
> >
> >Kind regards,
> >Emre
> >
> >[1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
>  
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard>
> >
> >
> >
>



Reply via email to