[GitHub] [spark] hagerf commented on issue #26762: [SPARK-30131] add array_median function
hagerf commented on issue #26762: [SPARK-30131] add array_median function URL: https://github.com/apache/spark/pull/26762#issuecomment-562107475 @srowen Ok, I see. If it's really that restrictive then users can use other functions for this, even though I think it could be a popular addition, used by many. So should I close this PR or ask some other peoples opinion on the matter? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hagerf commented on issue #26762: [SPARK-30131] add array_median function
hagerf commented on issue #26762: [SPARK-30131] add array_median function URL: https://github.com/apache/spark/pull/26762#issuecomment-562097725 Yes, of course. But we have the prefix `approx` because calculating exact median over a whole dataset is difficult to do efficiently. So users who want an exact median are forced to use rdds, or UDF etc on arrays if the data fits in an array. My point was: there is no exact median or percentile functionality at all in Spark. This would help for some subset of those use cases. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hagerf commented on issue #26762: [SPARK-30131] add array_median function
hagerf commented on issue #26762: [SPARK-30131] add array_median function URL: https://github.com/apache/spark/pull/26762#issuecomment-562059403 @HyukjinKwon I added some links, I think they should be relevant. We already have `approxQuantile` but then this would be an exact function, limited to arrays. This function only calculates median, which is the (probably) the most common use case. I can extend it to support exact quantiles, if people think that would be better. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hagerf commented on issue #26762: [SPARK-30131] add array_median function
hagerf commented on issue #26762: [SPARK-30131] add array_median function URL: https://github.com/apache/spark/pull/26762#issuecomment-561898965 @srowen From a quick googling, I see it in AWS Redshift and in IBM DB2 as aggregate functions. I've seen several tickets in Spark requesting median, and I know from my work that people use the median frequently so my intention was to solve a common request. But yes, this can of course be done by a UDF, or combination of other functions, but can be a bit cumbersome. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org