Hi Magnus, Yes, I was thinking also about partitioning approach. And I think this is the best solution in this type of scenario.
Also my scenario is relevant to your last paragraph, the dates which are coming are very random. I can get updated from 2012 and from 2019. Therefore, this strategy might not be the best. Because when I do join on let's say month = month AND year = year. Then I think I might not get much performance gain. But I will try this approach. If that won't be working, then I will try to play with different partitioning schemes. Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org