I have 2 dataframes , lets call them A and B,

A is made up out of [unique_id, field1]
B is made up out of [unique_id, field2]

The have the exact same number of rows, and every id in A is also present
in B

if I execute a join like this A.join(B,
Seq("unique_id")).select($"unique_id", $"field1") then spark will do an
expensive join even though it does not have to because all the fields it
needs are in A. is there some trick I can use so that catalyst will
optimise this join away ?

Reply via email to