Hello, I'm trying to link records from two large data sources. Both datasets have almost same number of rows. Goal is to match records based on multiple columns.
val matchId = > SFAccountDF.as("SF").join(ELAccountDF.as("EL")).where($"SF.Email" === > $"EL.EmailAddress" || $"SF.Phone" === "EL.Phone") Joining with a OR(||) will result in a CartesianProduct. I'm trying to avoid that. One way to do this is to join on each column and UNION the results. > val phoneMatch = SFAccountDF.as("SF").filter("Phone != > ''").join(ELAccountDF.as("EL").filter("BusinessPhone != > ''")).where($"SF.Phone" === $"EL.BusinessPhone") > val emailMatch = SFAccountDF.as("SF").filter("Email != > ''").join(ELAccountDF.as("EL").filter("EmailAddress != > ''")).where($"SF.Email" === $"EL.EmailAddress") > > val matchId = > phoneMatch.unionAll(emailMatch.unionAll(faxMatch.unionAll(mobileMatch))) > matchId.cache().registerTempTable("matchId") Is there a more elegant way to do this? On a related note, has anyone worked on record linkage using Bloom Filters, Levenshtein distance, etc in Spark? Srikanth