Hello,

I'm trying to link records from two large data sources. Both datasets have
almost same number of rows.
Goal is to match records based on multiple columns.

    val matchId =
> SFAccountDF.as("SF").join(ELAccountDF.as("EL")).where($"SF.Email" ===
> $"EL.EmailAddress" || $"SF.Phone" === "EL.Phone")


Joining with a OR(||) will result in a CartesianProduct. I'm trying to
avoid that.
One way to do this is to join on each column and UNION the results.


>     val phoneMatch = SFAccountDF.as("SF").filter("Phone !=
> ''").join(ELAccountDF.as("EL").filter("BusinessPhone !=
> ''")).where($"SF.Phone" === $"EL.BusinessPhone")
>     val emailMatch = SFAccountDF.as("SF").filter("Email !=
> ''").join(ELAccountDF.as("EL").filter("EmailAddress !=
> ''")).where($"SF.Email" === $"EL.EmailAddress")
>
>     val matchId =
> phoneMatch.unionAll(emailMatch.unionAll(faxMatch.unionAll(mobileMatch)))
>     matchId.cache().registerTempTable("matchId")


Is there a more elegant way to do this?

On a related note, has anyone worked on record linkage using Bloom Filters,
Levenshtein distance, etc in Spark?

Srikanth

Reply via email to