realno edited a comment on issue #1486: URL: https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1016037596
I have a update/question for this issue: `stddev` and `corr` have been merged few days ago. I took a look at `median` today and had an initial idea how to implement it. Now the problem is the implementation might be somewhat controversial. There are some operations needed to calculate median: 1. sort 2. count 3. get nth value. I thought about few options: 1. Implement a new function/operator - the function will look like `sort`, we may need to add a new type of expression for it. It feels to be a lot of overhead just for median. - Pros: No mess around plan builder and query parsing logic - Cons: Need to add a specific expression just for median; there may be some duplicate logic regarding sort and count 2. Reuse some of the existing code through rewriting the logic plan. If we can have `sort` to keep the total number of record, we can add a new function to rewrite the logical plan to something like ` FIND(n) | SORT(col)`. But currently there is no such behavior in the code base. - Pros: Better code reuse; May potentially provide a way to handle compound (high-order) functions, e.g. `Function1` -> `FunctionA(FunctionB(col))` - Cons: Need to modify logic for building plans; 3. Use approximate algorithms like KLL or t-digest. This way it can fit in existing aggregator API, but the result will be an approximation. There is already a PR https://github.com/apache/arrow-datafusion/pull/1539, we can help merge that then implement `median` as `quantile(0.5)`. - Pros: Easy to implement; Using existing aggregator API, Much better performance - Cons: The result is an approximation My preference is in the order of 3 > 2 >1. I'd like to see more opinions before moving forward. @matthewmturner @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
