Good morning, I am trying to use R to estimate a fixed effects model (i.e., a panel regression model controlling for unobserved time-invariant heterogeneities across agents) using different estimation approaches (e.g. replicating xtreg from Stata, see e.g. https://www.stata.com/support/faqs/statistics/intercept-in-fixed-effects-model/). I have already asked this question on different stacks exchange forums and contacted package creators who dealt with this issue before, but I wasn't able to obtain an answer to my doubts. I hope to have better luck on this list.
Let me introduce the problem, and note that I am using an unbalanced panel. The easiest way to estimate my fixed effect model is using the function lm. Example: # load packages library(dplyr) # set seed for replication purposes set.seed(123) # create toy dataset x <- rnorm(4000) x2 <- rnorm(length(x)) id <- factor(sample(500,length(x),replace=TRUE)) firm <- data.frame(id = id) %>% group_by(id) %>% mutate(firm = 1:n()) %>% pull(firm) id.eff <- rlnorm(nlevels(id)) firm.eff <- rexp(length(unique(firm))) y <- x + 0.25*x2 + id.eff[id] + firm.eff[firm] + rnorm(length(x)) db = data.frame(y = y, x = x, id = id, firm = firm) rm <- db %>% group_by(id) %>% summarise(firm = max(firm)) %>% filter(firm == 1) %>% pull(id) db = db[-which(db$id %in% rm), ] # Run regression test <- lm(y ~ x + id, data = db) Another approach is demeaning the variables included into the model specification. In this way, one can exclude the fixed effects from the model. Of course, point estimates will be correct, while standard errors will be not (because we are not accounting for the degrees of freedom used in the demeaning). # demean data dbm <- as_tibble(db) %>% group_by(id) %>% mutate(y = y - mean(y), x = x - mean(x)) %>% ungroup() # run regression test2 <- lm(y ~ x, data = dbm) # compare results summary(test)$coefficients[2,1] > 0.9753364 summary(test2)$coefficients[2,1] > 0.9753364 Another way to do this is to demean the variables and add their grand average (I believe that this is what xtreg from Stata does) # create data n = length(unique(db$id)) dbh <- dbm %>% mutate(yh = y + (sum(db$y)/n), xh = x + (sum(db$x)/n)) # run regression test3 <- lm(yh ~ xh, dbh) # compare results summary(test)$coefficients[2,1] > 0.9753364 summary(test2)$coefficients[2,1] > 0.9753364 summary(test3)$coefficients[2,1] > 0.9753364 As one can see, the three approaches report the same point estimates (again, standard errors will be different instead). When I include an additional set of fixed effects in the model specification, the three approaches no longer return the same point estimate. However, differences seem to be negligible and they could be due to rounding. db$firm <- as.factor(db$firm) dbm$firm <- as.factor(dbm$firm) dbh$firm <- as.factor(dbh$firm) testB <- lm(y ~ x + id + firm, data = db) testB2 <- lm(y ~ x + firm, data = dbm) testB3 <- lm(yh ~ xh + firm, data = dbh) summary(testB)$coefficients[2,1] > 0.9834414 summary(testB2)$coefficients[2,1] > 0.9833334 summary(testB3)$coefficients[2,1] > 0.9833334 A similar behavior occurs if I use a dummy variable rather than a continous one. For the only purpose of the example, I show this by transforming my target variable x from a continuous to a dummy variable. # create data x3 <- ifelse(db$x > 0, 1, 0) db <- db %>% mutate(x3 = x3) dbm <- dbm %>% mutate(x3 = x3) %>% group_by(id) %>% mutate(x3 = x3 - mean(x3)) %>% ungroup() dbh <- dbh %>% mutate(x3 = dbm$x3) %>% mutate(x3 = x3 + (sum(db$x3)/n)) %>% ungroup() # Run regressions testC <- lm(y ~ x3 + id + firm, data = db) testC2 <- lm(y ~ x3 + firm, data = dbm) testC3 <- lm(yh ~ x3 + firm, data = dbh) summary(testC)$coefficients[2, 1] > 1.579883 summary(testC2)$coefficients[2, 1] > 1.579159 summary(testC3)$coefficients[2, 1] > 1.579159 Now, I want to estimate both the impact of x when this is higher than 0 (i.e., x3) and when this is lower or equal to zero (call it x4). Again, observe that x3 might as well be a real dummy variable, not a transformation of a continuous variable. In order to do that, I exclude the intercept from my model. Specifically, I do the following: # create data x4 <- ifelse(db$x <= 0, 1, 0) db <- db %>% mutate(x4 = x4) dbm <- dbm %>% mutate(x4 = x4) %>% group_by(id) %>% mutate(x4 = x4 - mean(x4)) %>% ungroup() dbh <- dbh %>% mutate(x4 = dbm$x4) %>% mutate(x4 = x4 + (sum(db$x4)/n)) %>% ungroup() testD <- lm(y ~ x3 + x4 + id + firm - 1, data = db) testD2 <- lm(y ~ x3 + x4 + firm - 1, data = dbm) testD3 <- lm(yh ~ x3 + x4 + firm - 1, data = dbh) summary(testD)$coefficients[1:2, ] > 1.2372816 -0.3426011 summary(testD2) > Call: lm(formula = y ~ x3 + x4 + firm - 1, data = dbm) Residuals: Min 1Q Median 3Q Max -3.8794 -0.7497 0.0010 0.7442 3.8486 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) x3 1.57916 0.03779 41.788 < 2e-16 *** x4 NA NA NA NA ... redacted summary(testD3)$coefficients[1:2] > 3.254654 1.675495 As you can see, the second approach is not able to estimate the impact of x4 on y. At the same time, the first and the third approach return very different point estimates. Is anyone able to explain me why I cannot obtain the same point estimates for this last exercise? Is there anything wrong in the way I include the second set of fixed effects? Is there anything wrong in the way I include the variables x3 and x4? Or this is simply a problem due to some internal functions in R? Any hint would be much appreciated. Best, Valerio ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.