Hi, I'm using pyspark (1.6.2) to do a little bit of ETL and have noticed a very odd situation. I have two dataframes, base and updated. The "updated" dataframe contains constrained subset of data from "base" that I wish to excluded. Something like this.
updated = base.where(base.X = F.lit(1000)) It's more complicated than that, but you get the idea. Later, I do a left join. base.join(updated, 'Core_Column', 'left_outer') This should return all values in base and null where updated doesn't have an equality match. And that's almost true, but here's where it gets strange. base.join(updated, 'Core_Column', 'left_outer').select(base.FieldId, updated.FieldId, 'updated.*').show() |FieldId|FieldId|FieldId|x|y|z |123|123|null|1|2|3 Now I understand why base.FieldId shows 123, but why does updated.FieldId show 123 as well, when the expanded join for 'updated.*' shows null. I can what I want to do by using an RDD, but I was hoping to avoid bypassing tungsten. It almost feels like it's optimizing the field based on the join. But I tested other fields as well and they also came back with values from base. Very odd. Any thoughts? Aaron