H Roark wrote:
I'm wondering about the behavior of the merge function when using factors as by 
variables. I know that when you combine two factors using c() the results can 
be odd, as in:

c(factor(1:5),factor(6:10))

which prints: [1] 1 2 3 4 5 1 2 3 4 5

I presume this is because factors are actually stored as integers, with 
6,7,8,9,10 stored internally as 1,2,3,4,5.

This concerns me somewhat, as I often merge data frames using factors as the by 
variables. From what I can tell, the merge function creates matches based on 
factor labels (i.e. the result of as.character(factor_var)) and not the 
internally stored integers, but I'm wondering if there are particular lurking 
problems that I should be aware of?  I'm especially curious as to how R 
recalculates the levels of the by variables in outer joins where not every 
observation is matched, as in:

df1<-data.frame(a=factor(c("a","b")),b=1:2)
df2<-data.frame(a=factor(c("b","c")),c=2:3)
df3<-merge(df1,df2,by="a",all=T)

As far as I know, there is no reason to be concerned when using merge
as you do.

The magic that ?merge is performing is actually being done in ?rbind,
and you should read the help for that, particularly under "Data frame
methods". You can also study the code of base.rbind.data.frame to see
what it's actually doing.

--Erik

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to