I have two tables in spark:

T1
|--x1
|--x2

T2
|--z1
|--z2


   - T1 is much larger than T2
   - The values in column z2 are *very large*
   - There is a Many-One relationships between T1 and T2 respectively (via
   the x2 and z1 columns).

I perform the following query:

select T1.x1, T2.z2 from T1
join T2 on T1.x2 = T2.z1

In the resulting data set, the same value from T2.z2 will be multiplied to
many values of T1.x1.

Since this value is very heavy- I am concerned whether the data is actually
duplicated or whether there are internal optimisations that maintain only
references?
p.s
Originally posted on SO <https://stackoverflow.com/q/49716385/180650>

Reply via email to