Startrekzky commented on issue #6346:
URL: 
https://github.com/apache/incubator-devlake/issues/6346#issuecomment-1790481479

   Hi @roberioguedelho , the definition of median is the 50th percentile value 
of a value set. 
   
   ### DevLake's algorithm of calculating median
   
   Since MySQL doesn't support the `median` function, we wrote our own way to 
measure median. 
   
   
   Below is the SQL to measure MTTR. Check the full doc here: 
https://devlake.apache.org/docs/Metrics/MTTR#how-is-it-calculated
   ```
   -- Metric 3: median time to restore service - MTTR
   with _incidents as (
   -- get the number of incidents created each month
       SELECT
         distinct i.id,
           date_format(i.created_date,'%y/%m') as month,
           cast(lead_time_minutes as signed) as lead_time_minutes
       FROM
           issues i
         join board_issues bi on i.id = bi.issue_id
         join boards b on bi.board_id = b.id
         join project_mapping pm on b.id = pm.row_id and pm.`table` = 'boards'
       WHERE
         pm.project_name in ($project)
           and i.type = 'INCIDENT'
           and i.lead_time_minutes is not null
   ),
   
   _find_median_mttr_each_month_ranks as(
       SELECT *, percent_rank() over(PARTITION BY month order by 
lead_time_minutes) as ranks
       FROM _incidents
   ),
   
   _mttr as(
       SELECT month, max(lead_time_minutes) as median_time_to_resolve
       FROM _find_median_mttr_each_month_ranks
       WHERE ranks <= 0.5
       GROUP BY month
   )
   
   SELECT
       cm.month,
       case
           when m.median_time_to_resolve is null then 0
           else m.median_time_to_resolve/60 end as 
median_time_to_resolve_in_hour
   FROM
       calendar_months cm
       LEFT JOIN _mttr m on cm.month = m.month
     WHERE $__timeFilter(cm.month_timestamp)
   ```
   
   In the above SQL, the median value is the `MAX value` with `ranks` equal or 
less than 0.5.
   
   ```
   _find_median_mttr_each_month_ranks as(
       SELECT *, percent_rank() over(PARTITION BY month order by 
lead_time_minutes) as ranks
       FROM _incidents
   ),
   
   _mttr as(
       SELECT month, max(lead_time_minutes) as median_time_to_resolve
       FROM _find_median_mttr_each_month_ranks
       WHERE ranks <= 0.5
       GROUP BY month
   )
   ```
   
   ### Why the median value in your case is 0?
   
   You have a value set with 4 values.
   
   |value | rank|
   |------| ----|
   | 0|0.25|
   | **0**|0.5|
   | 315|0.75|
   | 670|1|
   
   According to the above algorithm, the median value is 0. 
   
   
   ### How to change the algorithm?
   Normally, if you have a large dataset, this won't be a problem. However, if 
you find the above algorithm doesn't make sense on your existing data set, 
you'll have to manually update the SQL. 
   
   One easy way is to use `mean` to replace `median`.
   
   The way you proposed might be OK but it you have a dataset like 
   |value | rank|
   |------| ----|
   | 0|0.2|
   | **0**|0.4|
   | **0**|0.6|
   | 315|0.8|
   | 670|1|
   
   The median value would be (0+0)/2 = 0. Not sure if this works for you.
   
   
   To sum up, all algorithms of calculating `median` or using `mean` to replace 
`median` all have constraints when they apply to different datasets. 
   
   I hope it makes sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to