subject:"Use rank with distribute by in HiveContext"

Use rank with distribute by in HiveContext

2015-07-16 Thread Lior Chaga

Does spark HiveContext support the rank() ... distribute by syntax (as in
the following article-
http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive
)?

If not, how can it be achieved?

Thanks,
Lior

RE: Use rank with distribute by in HiveContext

2015-07-16 Thread java8964

Yes. The HIVE UDF and distribute by both supported by Spark SQL.
If you are using Spark 1.4, you can try Hive analytics windows function 
(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics),most
 of which are already supported in Spark 1.4, so you don't need the customize 
UDF of rank.
Yong
Date: Thu, 16 Jul 2015 15:10:58 +0300
Subject: Use rank with distribute by in HiveContext
From: lio...@taboola.com
To: user@spark.apache.org

Does spark HiveContext support the rank() ... distribute by syntax (as in the 
following article- 
http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive 
)?
If not, how can it be achieved?
Thanks,Lior

Re: Use rank with distribute by in HiveContext

2015-07-16 Thread Todd Nist

Did you take a look at the excellent write up by Yin Huai and Michael
Armbrust?  It appears that rank is supported in the 1.4.x release.

https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Snippet from above article for your convenience:

To answer the first question “*What are the best-selling and the second
best-selling products in every category?*”, we need to rank products in a
category based on their revenue, and to pick the best selling and the
second best-selling products based the ranking. Below is the SQL query used
to answer this question by using window function dense_rank (we will
explain the syntax of using window functions in next section).

SELECT
  product,
  category,
  revenueFROM (
  SELECT
product,
category,
revenue,
dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank
  FROM productRevenue) tmpWHERE
  rank = 2



The result of this query is shown below. Without using window functions, it
is very hard to express the query in SQL, and even if a SQL query can be
expressed, it is hard for the underlying engine to efficiently evaluate the
query.

[image: 1-2]


SQLDataFrame APIRanking functionsrankrankdense_rankdenseRankpercent_rank
percentRankntilentilerow_numberrowNumber

 HTH.

-Todd

On Thu, Jul 16, 2015 at 8:10 AM, Lior Chaga lio...@taboola.com wrote:

 Does spark HiveContext support the rank() ... distribute by syntax (as in
 the following article-
 http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive
 )?

 If not, how can it be achieved?

 Thanks,
 Lior

Use rank with distribute by in HiveContext

RE: Use rank with distribute by in HiveContext

Re: Use rank with distribute by in HiveContext

3 matches

Site Navigation

Mail list logo

Footer information