Startrekzky commented on issue #6346: URL: https://github.com/apache/incubator-devlake/issues/6346#issuecomment-1790481479
Hi @roberioguedelho , the definition of median is the 50th percentile value of a value set. ### DevLake's algorithm of calculating median Since MySQL doesn't support the `median` function, we wrote our own way to measure median. Below is the SQL to measure MTTR. Check the full doc here: https://devlake.apache.org/docs/Metrics/MTTR#how-is-it-calculated ``` -- Metric 3: median time to restore service - MTTR with _incidents as ( -- get the number of incidents created each month SELECT distinct i.id, date_format(i.created_date,'%y/%m') as month, cast(lead_time_minutes as signed) as lead_time_minutes FROM issues i join board_issues bi on i.id = bi.issue_id join boards b on bi.board_id = b.id join project_mapping pm on b.id = pm.row_id and pm.`table` = 'boards' WHERE pm.project_name in ($project) and i.type = 'INCIDENT' and i.lead_time_minutes is not null ), _find_median_mttr_each_month_ranks as( SELECT *, percent_rank() over(PARTITION BY month order by lead_time_minutes) as ranks FROM _incidents ), _mttr as( SELECT month, max(lead_time_minutes) as median_time_to_resolve FROM _find_median_mttr_each_month_ranks WHERE ranks <= 0.5 GROUP BY month ) SELECT cm.month, case when m.median_time_to_resolve is null then 0 else m.median_time_to_resolve/60 end as median_time_to_resolve_in_hour FROM calendar_months cm LEFT JOIN _mttr m on cm.month = m.month WHERE $__timeFilter(cm.month_timestamp) ``` In the above SQL, the median value is the `MAX value` with `ranks` equal or less than 0.5. ``` _find_median_mttr_each_month_ranks as( SELECT *, percent_rank() over(PARTITION BY month order by lead_time_minutes) as ranks FROM _incidents ), _mttr as( SELECT month, max(lead_time_minutes) as median_time_to_resolve FROM _find_median_mttr_each_month_ranks WHERE ranks <= 0.5 GROUP BY month ) ``` ### Why the median value in your case is 0? You have a value set with 4 values. |value | rank| |------| ----| | 0|0.25| | **0**|0.5| | 315|0.75| | 670|1| According to the above algorithm, the median value is 0. ### How to change the algorithm? Normally, if you have a large dataset, this won't be a problem. However, if you find the above algorithm doesn't make sense on your existing data set, you'll have to manually update the SQL. One easy way is to use `mean` to replace `median`. The way you proposed might be OK but it you have a dataset like |value | rank| |------| ----| | 0|0.2| | **0**|0.4| | **0**|0.6| | 315|0.8| | 670|1| The median value would be (0+0)/2 = 0. Not sure if this works for you. To sum up, all algorithms of calculating `median` or using `mean` to replace `median` all have constraints when they apply to different datasets. I hope it makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
