I have 2 dataframes , lets call them A and B, A is made up out of [unique_id, field1] B is made up out of [unique_id, field2]
The have the exact same number of rows, and every id in A is also present in B if I execute a join like this A.join(B, Seq("unique_id")).select($"unique_id", $"field1") then spark will do an expensive join even though it does not have to because all the fields it needs are in A. is there some trick I can use so that catalyst will optimise this join away ?