Re: Get complete row with latest timestamp after a groupBy?

2015-11-06 Thread bghit
You are trying to get the top-k most recent records for each user (k=1 in
your case). You should avoid using groupBy because it's an expensive
operation that will hurt performance in Spark -- check out [1] for more
details. Instead, you can use the combineByKey function with a custom
combiner which keeps an ordered list of the most recent transactions per
user. 

[1]
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Get-complete-row-with-latest-timestamp-after-a-groupBy-tp25304p25305.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Get complete row with latest timestamp after a groupBy?

2015-11-06 Thread bghit
I asked the same question a few days ago, but I did not receive any answer.
You may want to look into UDAFs for that.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Get-complete-row-with-latest-timestamp-after-a-groupBy-tp25304p25308.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org