I have two tables in spark: T1 |--x1 |--x2
T2 |--z1 |--z2 - T1 is much larger than T2 - The values in column z2 are *very large* - There is a Many-One relationships between T1 and T2 respectively (via the x2 and z1 columns). I perform the following query: select T1.x1, T2.z2 from T1 join T2 on T1.x2 = T2.z1 In the resulting data set, the same value from T2.z2 will be multiplied to many values of T1.x1. Since this value is very heavy- I am concerned whether the data is actually duplicated or whether there are internal optimisations that maintain only references? p.s Originally posted on SO <https://stackoverflow.com/q/49716385/180650>