Hi Magnus,

Yes, I was thinking also about partitioning approach. And I think this is
the best solution in this type of scenario. 

Also my scenario is relevant to your last paragraph, the dates which are
coming are very random. I can get updated from 2012 and from 2019.
Therefore, this strategy might not be the best. Because when I do join on
let's say month = month AND year = year. Then I think I might not get much
performance gain. But I will try this approach.

If that won't be working, then I will try to play with different
partitioning schemes. 

Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to