realno edited a comment on issue #1486:
URL: 
https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1016037596


   I have a update/question for this issue: `stddev` and `corr` have been 
merged few days ago. I took a look at `median` today and had an initial idea 
how to implement it. Now the problem is the implementation might be somewhat 
controversial. 
   
   There are some operations needed to calculate median: 1. sort 2. count 3. 
get nth value. I thought about few options:
   1. Implement a new function/operator - the function will look like `sort`, 
we may need to add a new type of expression for it. It feels to be a lot of 
overhead just for median.
   
   - Pros: No mess around plan builder and query parsing logic
   - Cons: Need to add a specific expression just for median; there may be some 
duplicate logic regarding sort and count
   
   2. Reuse some of the existing code through rewriting the logic plan. If we 
can have `sort` to keep the total number of record, we can add a new function 
to rewrite the logical plan to something like ` FIND(n) | SORT(col)`. But 
currently there is no such behavior in the code base. 
   
   - Pros: Better code reuse; May potentially provide a way to handle compound 
(high-order) functions, e.g. `Function1` -> `FunctionA(FunctionB(col))`
   - Cons: Need to modify logic for building plans; 
   
   3. Use approximate algorithms like KLL or t-digest. This way it can fit in 
existing aggregator API, but the result will be an approximation. There is 
already a PR https://github.com/apache/arrow-datafusion/pull/1539, we can help 
merge that then implement `median` as `quantile(0.5)`.
   
   - Pros: Easy to implement; Using existing aggregator API, Much better 
performance
   - Cons: The result is an approximation
   
   My preference is in the order of 3 > 2 >1. I'd like to see more opinions 
before moving forward.
   
   @matthewmturner @alamb 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to