You are trying to get the top-k most recent records for each user (k=1 in your case). You should avoid using groupBy because it's an expensive operation that will hurt performance in Spark -- check out [1] for more details. Instead, you can use the combineByKey function with a custom combiner which keeps an ordered list of the most recent transactions per user.
[1] https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Get-complete-row-with-latest-timestamp-after-a-groupBy-tp25304p25305.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org