[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function
[ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904397#comment-15904397 ] Zhenhua Wang edited comment on SPARK-16283 at 3/10/17 4:09 AM: --- [~erlu] I think it's been made clear from the above discussions, Spark's result doesn't have to be the same as Hive's result. was (Author: zenwzh): [~erlu] I think it's been made clear from above discussions, Spark' result doesn't have to be the same as Hive's result. > Implement percentile_approx SQL function > > > Key: SPARK-16283 > URL: https://issues.apache.org/jira/browse/SPARK-16283 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Sean Zhong > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function
[ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901132#comment-15901132 ] chenerlu edited comment on SPARK-16283 at 3/9/17 1:55 AM: -- Hi, I am little confused about percentile_approx, is it different from hive's now ? will we get different result when the input is same ? for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) from test; and get different result. c4_double is show below: 1.0001 2.0001 3.0001 4.0001 5.0001 6.0001 7.0001 8.0001 9.0001 NULL -8.952 -96.0 Hive: [-87.2952,-6.9615799,1.30009998,2.40010003] spark 2.x: [-8.952,1.0001,2.0001,3.0001] so which result is right ? Could you pls reply me when you are free. [~rxin] [~lwlin] was (Author: erlu): Hi, I am little confused about percentile_approx, is it different from hive's now ? will we get different result when the input is same ? for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) from test; and get different result. c4_double is show below: 1.0001 2.0001 3.0001 4.0001 5.0001 6.0001 7.0001 8.0001 9.0001 NULL -8.952 -96.0 Hive: [-87.2952,-6.9615799,1.30009998,2.40010003] spark 2.x: [-8.952,1.0001,2.0001,3.0001] so which result is right ? Could you pls reply me when you are free. [~rxin] [~linwei] > Implement percentile_approx SQL function > > > Key: SPARK-16283 > URL: https://issues.apache.org/jira/browse/SPARK-16283 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Sean Zhong > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function
[ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901132#comment-15901132 ] chenerlu edited comment on SPARK-16283 at 3/9/17 1:55 AM: -- Hi, I am little confused about percentile_approx, is it different from hive's now ? will we get different result when the input is same ? for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) from test; and get different result. c4_double is show below: 1.0001 2.0001 3.0001 4.0001 5.0001 6.0001 7.0001 8.0001 9.0001 NULL -8.952 -96.0 Hive: [-87.2952,-6.9615799,1.30009998,2.40010003] spark 2.x: [-8.952,1.0001,2.0001,3.0001] so which result is right ? Could you pls reply me when you are free. [~rxin] [~linwei] was (Author: erlu): Hi, I am little confused about percentile_approx, is it different from hive's now ? will we get different result when the input is same ? for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) from test; and get different result. c4_double is show below: 1.0001 2.0001 3.0001 4.0001 5.0001 6.0001 7.0001 8.0001 9.0001 NULL -8.952 -96.0 Hive: [-87.2952,-6.9615799,1.30009998,2.40010003] spark 2.x: [-8.952,1.0001,2.0001,3.0001] so which result is right ? Could you pls reply me when you are free. > Implement percentile_approx SQL function > > > Key: SPARK-16283 > URL: https://issues.apache.org/jira/browse/SPARK-16283 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Sean Zhong > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function
[ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431136#comment-15431136 ] Sean Zhong edited comment on SPARK-16283 at 8/22/16 4:35 PM: - Created a sub-task SPARK-17188 to move QuantileSummaries to package org.apache.spark.sql.util of catalyst project was (Author: clockfly): Created a sub-task to move QuantileSummaries to package org.apache.spark.sql.util of catalyst project > Implement percentile_approx SQL function > > > Key: SPARK-16283 > URL: https://issues.apache.org/jira/browse/SPARK-16283 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function
[ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374295#comment-15374295 ] Liwei Lin edited comment on SPARK-16283 at 7/13/16 6:05 AM: Hive's percentile_approx implementation computes approximate percentile values from a histogram (please refer to [Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java] and [Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java] for details): - Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}} - parameter \[nb\] -- the number of histogram bins to use -- is optionally specified by users - if the number of unique values in the actual dataset is less than or equals to this \[nb\], we can expect an exact result; otherwise there are no approximation guarantees Our Dataset's approxQuantile() implementation is not really histogram-based (and thus differs from Hive's implementation): - our Dataset's approxQuantile()'s signature is something like: {{\_FUNC\_(expr, pc, relativeError)}} - parameter relativeError is specified by users and should be in \[0, 1\]; our approximation is deterministicly bounded by this relativeError -- please refer to [Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39] for details Since there's no direct deterministic relationship between \[nb\] and relativeError, it seems hard to build Hive's percentile_approx on top of our Dataset's approxQuantile(). So should we: (a) port Hive' implementation into Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on top of it, or (b) provide {{\_FUNC\_(expr, pc, relativeError)}} directly on top of our Dataset's approxQuantile() implementation, but this might be incompatible with Hive? [~rxin], [~thunterdb] could you share some thoughts? Thanks ! was (Author: proflin): Hive's percentile_approx implementation computes approximate percentile values from a histogram (please refer to [Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java] and [Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java] for details): - Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}} - parameter \[nb\] -- the number of histogram bins to use -- is optionally specified by users - if the number of unique values in the actual dataset is less than or equals to this \[nb\], we can expect an exact result; otherwise there are no approximation guarantees Our Dataset's approxQuantile() implementation is not really histogram-based (and thus differs from Hive's implementation): - our Dataset's approxQuantile()'s signature is something like: {{\_FUNC\_(expr, pc, relativeError)}} - parameter relativeError is specified by users and should be in \[0, 1\]; our approximation is deterministicly bounded by this relativeError -- please refer to [Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39] for details Since there's no direct deterministic relationship between \[nb\] and relativeError, it seems hard to build Hive's percentile_approx on top of our Dataset's approxQuantile(). So, [~rxin], [~thunterdb], should we: (a) port Hive' implementation into Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on top of it, or (b) provide {{\_FUNC\_(expr, pc, relativeError)}} directly on top of our Dataset's approxQuantile() implementation, but this might be incompatible with Hive? Could you share some thoughts? Thanks ! > Implement percentile_approx SQL function > > > Key: SPARK-16283 > URL: https://issues.apache.org/jira/browse/SPARK-16283 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function
[ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374295#comment-15374295 ] Liwei Lin edited comment on SPARK-16283 at 7/13/16 4:01 AM: Hive's percentile_approx implementation computes approximate percentile values from a histogram (please refer to [Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java] and [Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java] for details): - Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}} - parameter \[nb\] -- the number of histogram bins to use -- is optionally specified by users - if the number of unique values in the actual dataset is less than or equals to this \[nb\], we can expect an exact result; otherwise there are no approximation guarantees Our Dataset's approxQuantile() implementation is not really histogram-based (and thus differs from Hive's implementation): - our Dataset's approxQuantile()'s signature is something like: {{\_FUNC\_(expr, pc, relativeError)}} - parameter relativeError is specified by users and should be in \[0, 1\]; our approximation is deterministicly bounded by this relativeError -- please refer to [Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39] for details Since there's no direct deterministic relationship between \[nb\] and relativeError, it seems hard to build Hive's percentile_approx on top of our Dataset's approxQuantile(). So, [~rxin], [~thunterdb], should we: (a) port Hive' implementation into Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on top of it, or (b) provide {{\_FUNC\_(expr, pc, relativeError)}} directly on top of our Dataset's approxQuantile() implementation, but this might be incompatible with Hive? Could you share some thoughts? Thanks ! was (Author: proflin): Hive's percentile_approx implementation computes approximate percentile values from a histogram (please refer to [Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java] and [Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java] for details): - Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}} - parameter \[nb\] -- the number of histogram bins to use -- is optionally specified by users - if the number of unique values in the actual dataset is less than or equals to this \[nb\], we can expect an exact result; otherwise there are no approximation guarantees Our Dataset's approxQuantile() implementation is not really histogram-based (and thus differs from Hive's implementation): - our Dataset's approxQuantile()'s signature is something like: {{\_FUNC\_(expr, pc, relativeError)}} - parameter relativeError is specified by users and should be in \[0, 1\]; our approximation is deterministicly bounded by this relativeError -- please refer to [Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39] for details Since there's no direct deterministic relationship between \[nb\] and relativeError, it seems hard to build Hive's percentile_approx on top of our Dataset's approxQuantile(). So, [~rxin], [~thunterdb], should we: (a) port Hive' implementation into Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on top of it, or (b) provide {{\_FUNC\_(expr, pc, relativeError)}} directly on top of our Dataset's approxQuantile() implementation, but this might be incompatible with Hive? Thanks ! > Implement percentile_approx SQL function > > > Key: SPARK-16283 > URL: https://issues.apache.org/jira/browse/SPARK-16283 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org