I am using the compare.linkage function in the RecordLinkage package, and getting a result I know is wrong, so I know I'm misunderstanding something. I am using R 3.2.3 for x64 Windows. I am very familar with Stata but not so much with R.
I can create record pairs from the blocking fields but all pairs are unknown status (NA). I cannot create matches or non-matches. I want a simple working example of how to link datasets using the RecordLinkage package. It seems that the manual and the R Journal Vol. 2/2 only show how to de-duplicate a single dataset using the compare.dedup function, not how to link two datasets together using the compare.linkage function. I can reproduce the examples in the R Journal article, so my R installation is fine. The example dataset in the manual have 500 and 10000 observations on 7 variables, but 1 observation and 2 variables will be enough to show the problem. My first comparison pattern loooks like this: id1 id2 fname_c1 bm is_match 1 17 343 1 1 NA Instead, I want and expect a comparison pattern that looks like this: id1 id2 fname_c1 bm is_match 1 17 343 1 1 1 My blocking variable is fname_c1 for first component of first name. My matching variable is bm for birth month. My understanding is that row 1 in my example output is the first row where fname_c1 matched in the underlying datasets. I want and expect is_match to be 1 when the matching variable bm=1 in both linkage datasets, as in the example. For more details, this is what I typed and the R output: > library(RecordLinkage) > data(RLdata500) > data(RLdata10000) > RLdata500[17, ] fname_c1 fname_c2 lname_c1 lname_c2 by bm bd 17 ALEXANDER <NA> MUELLER <NA> 1974 9 9 > RLdata10000[343, ] fname_c1 fname_c2 lname_c1 lname_c2 by bm bd 343 ALEXANDER <NA> BAUMANN <NA> 1957 9 7 > rpairs <- compare.linkage(RLdata500,RLdata10000,blockfld=c(1), exclude=c(2:5,7)) > rpairs$pairs[c(1:2), ] # Why is_match=NA? (should be 1) id1 id2 fname_c1 bm is_match 1 17 343 1 1 NA 2 17 2385 1 0 NA > rpairs <- epiWeights(rpairs) # (Weight calculation) > summary(rpairs) # (0 matches in Linkage Dataset) Linkage Data Set 500 records in data set 1 10000 records in data set 2 47890 record pairs 0 matches 0 non-matches 47890 pairs with unknown status Weight distribution: [omitted here to save space] References: 1. Manual for Package ‘RecordLinkage’ (Available online at https://cran.r-project.org/web/packages/RecordLinkage/RecordLinkage.pdf) 2. R Journal article Article "The RecordLinkage Package: Detecting Errors in Data" (Available online in PDF at https://journal.r-project.org/archive/2010-2/RJournal_2010-2_Sariyar+Borg.pdf ) I saw something in the manual and R journal article about identity argument for true match results, but I guess I only need that for reference ("gold standard") datasets. There is a non-missing value (bm=1) for my example in both underlying datasets, so that is not why the result is NA. What am I missing? How does one link two simple datasets using compare.linkage? [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.