Re: Get complete row with latest timestamp after a groupBy?

bghit Fri, 06 Nov 2015 02:08:30 -0800

You are trying to get the top-k most recent records for each user (k=1 in
your case). You should avoid using groupBy because it's an expensive
operation that will hurt performance in Spark -- check out [1] for more
details. Instead, you can use the combineByKey function with a custom
combiner which keeps an ordered list of the most recent transactions per
user.


[1]
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Get-complete-row-with-latest-timestamp-after-a-groupBy-tp25304p25305.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Get complete row with latest timestamp after a groupBy?

Reply via email to