On Tue, Mar 17, 2009 at 10:04:39AM -0400, Stavros Macrakis wrote: ... > 1) Factor allows repeated levels, e.g. factor(c(1),c(1,1,1)), with no > warning or error.
Yes, this is a confusing behavior, since repeated levels are never meaningful. > 2) Even from distinct inputs, factor of a numeric vector may generate > repeated levels, because it only uses 15 digits. I think, 15 digits is a reasonable choice. Mapping double precision numbers and character strings with a given decimal precision is never bijective. With 15 digits, we can achive that every character value has unique double precision representation, but not vice versa. With 17 digits, we have a unique character string for each double precision number, but not vice versa. What is better? Specification of as.character says() that the numbers are represented with 15 significant digits. So, I think, if as.factor() applies signif(,digits=15) to a numeric vector before determining the levels using sort(unique.default(x), this could help to eliminate most of the problems without being in conflict with the existing specification. > 3) The algorithm to determine the shortest format is inconsistent with > the algorithm to actually print, giving pathological cases like 0.3 > vs. 0.300000000000000. I do not exactly understand what you mean by inconsistent. If you do nums <- (.3 + 2e-16 * c(-2,-1,1,2)) options(digits=15) for (x in nums) print(x) # [1] 0.300000000000000 # [1] 0.3 # [1] 0.3 # [1] 0.300000000000000 as.character(nums) # [1] "0.300000000000000" "0.3" "0.3" # [4] "0.300000000000000" then print and as.character are consistent. Printing the whole vector behaves differently, since it uses the same format for all numbers. > The original problem was testing whether a floating-point number was a > member of a vector. rounding and then converting to a factor seem > like a very poor way of doing that, even if the above problems were > resolved. Comparing with a tolerance seems much more robust, clean, > and efficient. Definitely, using comparison tolerance is a meaningful approach. Its disadvantage is that the relation abs(x - y) <= eps is not transitive. So, it may also produce confusing results in some situations. I think that one has to choose the right solution depending on the application. Petr. ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel