Hi, I can try to guess what is wrong, but I might be incorrect.
You should be careful with window frames (you define them via the rowsBetween() method). In my understanding, all window functions can be divided into 2 groups: - functions defined by the org.apache.spark.sql.catalyst.expressions.WindowFunction trait ("true" window functions) - all other supported functions that are marked as window functions by providing a window specification. The main distinction is that functions from the first group might have a predefined internal frame. That's exactly your case. Both row_number() and rank() functions are from the first group (i.e. they have predefined internal frames). To make your case work, you have 2 options: - remove your own frame specification(i.e. rowsBetween(0, 49)) and use only Window.partitionBy(hivetable.col("location")) - state explictly correct window frames. For instance, rowsBetween(Long.MinValue, 0) for rank() and row_number(). By the way, there is not too much documentation how Spark resolves window frames. For that reason, I created a small pull request that can help: https://github.com/apache/spark/pull/14050 It would be nice if anyone experienced can take a look at it since it is based only on my own analysis. 2016-07-07 13:26 GMT+02:00 <luohui20...@sina.com>: > hi Anton: > I check the docs you mentioned, and have code accordingly, however > met an exception like "org.apache.spark.sql.AnalysisException: Window > function row_number does not take a frame specification.;" > It Seems that the row_number API is giving a global row numbers of > every row across all frames, by my understanding. If wrong,please correct > me. > I checked all the window function API of > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$, > and just found that maybe row_number() seems matches. I am not quit sure > about it. > > here is my code: > val hc = new org.apache.spark.sql.hive.HiveContext(sc) > val hivetable = hc.sql("select * from house_sale_pv_location") > val overLocation = > Window.partitionBy(hivetable.col("location")).rowsBetween(0, 49) > val sortedDF = hivetable.withColumn("rowNumber", > row_number().over(overLocation)) > sortedDF.registerTempTable("sortedDF") > val top50 = hc.sql("select id,location from sortedDF where > rowNumber<=50") > top50.registerTempTable("top50") > hc.sql("select * from top50 where > location=30").collect.foreach(println) > > > here, hivetable is a DF that I mentioned with 3 columns "id , pv, > location", which is already sorted by pv in desc. So I didn't call orderby > in the 3rd line of my code. I just want the first 50 rows, based on > physical location, of each frame. > > To Tal: > I tried rank API, however this is not the API I want , because there > are some values have same pv are ranked as same values. And first 50 rows > of each frame is what I'm expecting. the attached file shows what I got by > using rank. > Thank you anyway, I learnt what rank could provide from your advice. > > -------------------------------- > > Thanks&Best regards! > San.Luo > > ----- 原始邮件 ----- > 发件人:Anton Okolnychyi <anton.okolnyc...@gmail.com> > 收件人:user <user@spark.apache.org> > 抄送人:luohui20...@sina.com > 主题:Re: how to select first 50 value of each group after group by? > 日期:2016年07月06日 23点22分 > > The following resources should be useful: > > > https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html > > https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-windows.html > > The last link should have the exact solution > > 2016-07-06 16:55 GMT+02:00 Tal Grynbaum <tal.grynb...@gmail.com>: > > You can use rank window function to rank each row in the group, and then > filter the rowz with rank < 50 > > On Wed, Jul 6, 2016, 14:07 <luohui20...@sina.com> wrote: > > hi there > I have a DF with 3 columns: id , pv, location.(the rows are already > grouped by location and sort by pv in des) I wanna get the first 50 id > values grouped by location. I checked the API of > dataframe,groupeddata,pairRDD, and found no match. > is there a way to do this naturally? > any info will be appreciated. > > > > -------------------------------- > > Thanks&Best regards! > San.Luo > > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >