I sent this to Iñaki personally by mistake. Thank you for notifying me. On Wed, Aug 8, 2018 at 7:53 PM Iñaki Úcar <i.uca...@gmail.com> wrote:
> > For what it's worth, I always thought about factors as fundamentally > characters, but with restrictions: a subspace of all possible strings. > And I'd say that a non-negligible number of R users may think about > them in a similar way. > That idea has been a common source of bugs and the most important reason why I always explain my students that factors are a special kind of numeric(integer), not character. Especially people coming from SPSS see immediately the link with categorical variables in that way, and understand that a factor is a modeling aid rather than an alternative for characters. It is a categorical variable and a more readable way of representing a set of dummy variables. I do agree that some of the factor behaviour is confusing at best, but that doesn't change the appropriate use and meaning of factors as categorical variables. Even more, I oppose the ideas that : 1) factors with different levels should be concatenated. 2) when combining factors, the union of the levels would somehow be a good choice. Factors with different levels are variables with different information, not more or less information. If one factor codes low and high and another codes low, mid and high, you can't say whether mid in one factor would be low or high in the first one. The second has a higher resolution, and that's exactly the reason why they should NOT be combined. Different levels indicate a different grouping, and hence that data should never be used as one set of dummy variables in any model. Even when combining factors, the union of levels only makes sense to me if there's no overlap between levels of both factors. In all other cases, a researcher will need to determine whether levels with the same label do mean the same thing in both factors, and that's not guaranteed. And when we're talking a factor with a higher resolution and a lower resolution, the correct thing to do modelwise is to recode one of the factors so they have the same resolution and every level the same definition before you merge that data. So imho the combination of two factors with different levels (or even levels in a different order) should give an error. Which R currently doesn't throw, so I get there's room for improvement. Cheers Joris -- Joris Meys Statistical consultant Department of Data Analysis and Mathematical Modelling Ghent University Coupure Links 653, B-9000 Gent (Belgium) <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel