Hi,

I need to process some events in a specific order based on a timestamp, for
each user in my data.

I had implemented this by using the dataframe sort method to sort by user
id and then sort by the timestamp secondarily, then do a
groupBy().mapValues() to process the events for each user.

However on re-reading the docs I see that groupByKey() does not guarantee
any ordering of the values, yet my code (which will fall over on out of
order events) seems to run OK so far, on a local mode but with a machine
with 8 CPUs.

I guess the easiest way to be certain would be to sort the values after the
groupByKey, but I'm wondering if using mapPartitions() to process all
entries in a partition would work, given I had pre-ordered the data?

This would require a bit more work to track when I switch from one user to
the next as I process the events, but if the original order has been
preserved on reading the events in, this should work.

Anyone know definitively if this is the case?

Regards,

James

Reply via email to