H Roark wrote:
I'm wondering about the behavior of the merge function when using factors as by variables. I know that when you combine two factors using c() the results can be odd, as in: c(factor(1:5),factor(6:10)) which prints: [1] 1 2 3 4 5 1 2 3 4 5 I presume this is because factors are actually stored as integers, with 6,7,8,9,10 stored internally as 1,2,3,4,5. This concerns me somewhat, as I often merge data frames using factors as the by variables. From what I can tell, the merge function creates matches based on factor labels (i.e. the result of as.character(factor_var)) and not the internally stored integers, but I'm wondering if there are particular lurking problems that I should be aware of? I'm especially curious as to how R recalculates the levels of the by variables in outer joins where not every observation is matched, as in: df1<-data.frame(a=factor(c("a","b")),b=1:2) df2<-data.frame(a=factor(c("b","c")),c=2:3) df3<-merge(df1,df2,by="a",all=T)
As far as I know, there is no reason to be concerned when using merge as you do. The magic that ?merge is performing is actually being done in ?rbind, and you should read the help for that, particularly under "Data frame methods". You can also study the code of base.rbind.data.frame to see what it's actually doing. --Erik ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.