Re: Use rank with distribute by in HiveContext

Todd Nist Thu, 16 Jul 2015 06:31:45 -0700

Did you take a look at the excellent write up by Yin Huai and Michael
Armbrust?  It appears that rank is supported in the 1.4.x release.

https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Snippet from above article for your convenience:

To answer the first question “*What are the best-selling and the second
best-selling products in every category?*”, we need to rank products in a
category based on their revenue, and to pick the best selling and the
second best-selling products based the ranking. Below is the SQL query used
to answer this question by using window function dense_rank (we will
explain the syntax of using window functions in next section).

SELECT
  product,
  category,
  revenueFROM (
  SELECT
    product,
    category,
    revenue,
    dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank
  FROM productRevenue) tmpWHERE
  rank <= 2

The result of this query is shown below. Without using window functions, it
is very hard to express the query in SQL, and even if a SQL query can be
expressed, it is hard for the underlying engine to efficiently evaluate the
query.

[image: 1-2]

SQLDataFrame APIRanking functionsrankrankdense_rankdenseRankpercent_rank
percentRankntilentilerow_numberrowNumber

 HTH.

-Todd

On Thu, Jul 16, 2015 at 8:10 AM, Lior Chaga <lio...@taboola.com> wrote:

> Does spark HiveContext support the rank() ... distribute by syntax (as in
> the following article-
> http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive
> )?
>
> If not, how can it be achieved?
>
> Thanks,
> Lior
>

Re: Use rank with distribute by in HiveContext

Reply via email to