Hi, I need to process some events in a specific order based on a timestamp, for each user in my data.
I had implemented this by using the dataframe sort method to sort by user id and then sort by the timestamp secondarily, then do a groupBy().mapValues() to process the events for each user. However on re-reading the docs I see that groupByKey() does not guarantee any ordering of the values, yet my code (which will fall over on out of order events) seems to run OK so far, on a local mode but with a machine with 8 CPUs. I guess the easiest way to be certain would be to sort the values after the groupByKey, but I'm wondering if using mapPartitions() to process all entries in a partition would work, given I had pre-ordered the data? This would require a bit more work to track when I switch from one user to the next as I process the events, but if the original order has been preserved on reading the events in, this should work. Anyone know definitively if this is the case? Regards, James