Hi guys,

I'm hoping that someone can help me to make my setup more efficient. I'm trying 
to do record linkage across 2.5 billion records and have set myself up in Spark 
to handle the data. Right as of now, I'm relying on R (with the stringdist and 
RecordLinkage packages) to do the actual linkage. I'll transfer small batches 
in at a time from Spark to R, do the linkage, and send the resulting ID’s 
linking the records back. What I'd like to do is set up Spark so that the 
string distance measures (that I'm using in R) can be directly computed in 
Spark, thereby avoiding the data transfers. How can I go about doing this? An 
example of my data is provided below, along with example R code that I'm using 
below that (nb. “myMatchingFunction” calls functions from the stringdist and 
RecordLinkage packages). I'm open to switching to Python or Scala if I need to, 
or incorporating the C++ code for the string comparators into Spark. Thanks. 
—Linh

FIRSTNAME       LASTNAME        EMAIL   PHONE   ADDRESS DESIRED_ID
John    Smith   johnsm...@domain.com 
<applewebdata://30FB5F8D-F8EE-415E-8C8B-43DD1CF49784/johnsm...@domain.com>      
   1234 Main St.   1
John    Smith           1234567         1
J       Smith   johnsm...@domain.com 
<applewebdata://30FB5F8D-F8EE-415E-8C8B-43DD1CF49784/johnsm...@domain.com> 
1234567 1234 Main Street        1
Jane    Smith           2345678         2
Jane    Smith   janesm...@domain.com 
<applewebdata://30FB5F8D-F8EE-415E-8C8B-43DD1CF49784/janesm...@domain.com> 
2345678 5678 1st Street 2
Jane    Smith                   5678 First St.  2
Jane    Smith                           3


uk_breakcodes = read.df(sqlContext, "LT_IDMRQ_BREAK_CODES_UK.parquet", 
"parquet")
cache(uk_breakcodes)
registerTempTable(uk_breakcodes, "LT_IDMRQ_BREAK_CODES_UK")
tmp <- collect(sql(sqlContext, "select * from LT_IDMRQ_BREAK_CODES_UK where 
ADDRBRKCD = ‘XXXXXXX' or BADDRBRKCD=‘XXXXXXX' or OPBRKCD=‘XXXXXX'"))
matched = myMatchingFunction(tmp)

Reply via email to