[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2017-03-09 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904397#comment-15904397
 ] 

Zhenhua Wang commented on SPARK-16283:
--

[~erlu] I think it's been made clear from above discussions, Spark' result 
doesn't have to be the same as Hive's result.

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2017-03-08 Thread chenerlu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901132#comment-15901132
 ] 

chenerlu commented on SPARK-16283:
--

Hi, I am little confused about percentile_approx, is it different from hive's 
now ? will we get different result when the input is same ?

for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) 
from test; and get different result.

c4_double is show below:
1.0001
2.0001
3.0001
4.0001
5.0001
6.0001
7.0001
8.0001
9.0001
NULL
-8.952
-96.0

Hive:
[-87.2952,-6.9615799,1.30009998,2.40010003]

spark 2.x:
[-8.952,1.0001,2.0001,3.0001]

so which result is right ? Could you pls reply me when you are free.



> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-11-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15643008#comment-15643008
 ] 

Apache Spark commented on SPARK-16283:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14237

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-08-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450696#comment-15450696
 ] 

Apache Spark commented on SPARK-16283:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/14868

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-08-22 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431136#comment-15431136
 ] 

Sean Zhong commented on SPARK-16283:


Created a sub-task to move QuantileSummaries to package 
org.apache.spark.sql.util of catalyst project

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-07-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387602#comment-15387602
 ] 

Apache Spark commented on SPARK-16283:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14237

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-07-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387386#comment-15387386
 ] 

Apache Spark commented on SPARK-16283:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14298

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-07-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387382#comment-15387382
 ] 

Apache Spark commented on SPARK-16283:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14237

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-07-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15381335#comment-15381335
 ] 

Apache Spark commented on SPARK-16283:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14237

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-07-14 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378634#comment-15378634
 ] 

Liwei Lin commented on SPARK-16283:
---

Thanks for the clarification. I'm working on this one, thanks!

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-07-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378299#comment-15378299
 ] 

Reynold Xin commented on SPARK-16283:
-

We just need a function, and doesn't need it to be identical to Hive's result.


> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-07-14 Thread Tim Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378296#comment-15378296
 ] 

Tim Hunter commented on SPARK-16283:


Are we trying to reproduce Hive's results here? In this case, then yes there is 
no choice but port Hive's code. If we just want to have an equivalent result, 
then we can use the following pseudo-python-code:

{code}
def percentile_approx(df, x, num_hist):
  return quantile_approx(df, x, max(1/num_hist, 1e-3) )
{code}

The final result has the advantage over hive to have theoretical bounds on the 
result. The only issue is that the runtime in this case is O(num_hist ^ 2) 
(instead of linear) if I remember correctly.

Also, if we want to spend more time on improving the algorithms, I would prefer 
something that has some known guarantees rather than something completely novel.

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-07-13 Thread Kai Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375431#comment-15375431
 ] 

Kai Jiang commented on SPARK-16283:
---

I also noticed that there is an inconsistency between hive's approach and 
dataset's approach. Which one should we go with? Cause it's a function passed 
over to hive, I vote to port hive's implementation to spark.  [~rxin], 
[~thunterdb] could you share some ideas on this? Thanks!  I also would love to 
try on this one once we decide which way to go with.

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-07-12 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374295#comment-15374295
 ] 

Liwei Lin commented on SPARK-16283:
---

Hive's percentile_approx implementation computes approximate percentile values 
from a histogram (please refer to 
[Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java]
 and 
[Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java]
 for details):
- Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}}
- parameter \[nb\] -- the number of histogram bins to use -- is optionally 
specified by users
- if the number of unique values in the actual dataset is less than or equals 
to this \[nb\], we can expect an exact result; otherwise there are no 
approximation guarantees


Our Dataset's approxQuantile() implementation is not really histogram-based 
(and thus differs from Hive's implementation):
- our Dataset's approxQuantile()'s signature is something like: 
{{\_FUNC\_(expr, pc, relativeError)}}
- parameter relativeError is specified by users and should be in \[0, 1\]; our 
approximation is deterministicly bounded by this relativeError -- please refer 
to 
[Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39]
 for details


Since there's no direct deterministic relationship between \[nb\] and 
relativeError, it seems hard to build Hive's percentile_approx on top of our 
Dataset's approxQuantile(). So, [~rxin], [~thunterdb], should we: (a) port 
Hive' implementation into Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on 
top of it, or (b) provide {{\_FUNC\_(expr, pc, relativeError)}} directly on top 
of our Dataset's approxQuantile() implementation, but this might be 
incompatible with Hive? Thanks !

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-07-11 Thread Tim Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371112#comment-15371112
 ] 

Tim Hunter commented on SPARK-16283:


We should, the algorithm picked is optimized for this use case.

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-07-10 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370051#comment-15370051
 ] 

Reynold Xin commented on SPARK-16283:
-

[~thunterdb] can we use your implementation for percentile_approx?


> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org