How does one link two datasets using the compare.linkage function in the RecordLinkage package? This is to follow-up on my original posting earlier today: https://stat.ethz.ch/pipermail/r-help/2016-January/435736.html
I suggested then that I should perhaps have added the identity argument. But if I add the identity argument, then I unexpectedly get 5 matches, 47885 non-matches and 0 pairs with unknown status. For example, I get a match for row 4256 which is unexpected because the matching variable bm does not match -- is 0 in the result pair (because bm is 1 for BERND JUNG and 4 for BERND MUELLER). Also, is_match in row 1 changes from unknown (NA) to no match (0) which is unexpected since the matching variable bm matches (bm=1). Here are the major new R commands that I ran and the output: > rpairs <- compare.linkage(RLdata500,RLdata10000,blockfld=c(1), identity1=identity.RLdata500,identity2=identity.RLdata10000,exclude=c(2:5,7)) > subset(rpairs$pairs, is_match=="1") # Why these 5 matches? id1 id2 fname_c1 bm is_match 4256 59 1394 1 0 1 5811 174 3684 1 0 1 14699 139 4199 1 0 1 16453 92 4580 1 0 1 21840 73 737 1 0 1 > RLdata500[c(17, 59), ] # first obs, and first matching obs fname_c1 fname_c2 lname_c1 lname_c2 by bm bd 17 ALEXANDER <NA> MUELLER <NA> 1974 9 9 59 BERND <NA> JUNG KLEIN 1935 1 14 > RLdata10000[c(343, 1394), ] # first obs, and first matching obs fname_c1 fname_c2 lname_c1 lname_c2 by bm bd 343 ALEXANDER <NA> BAUMANN <NA> 1957 9 7 1394 BERND <NA> MUELLER <NA> 1942 4 4 > rpairs$pairs[1:2, ]; # list first 2 obs id1 id2 fname_c1 bm is_match 1 17 343 1 1 0 2 17 2385 1 0 0 What am I missing? How to probabilistically link two datasets using the compare.linkage function in the RecordLinkage package? Anders Alexandersson andersa...@gmail.com [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.