As a side note, for floating point values, the IEEE 754 standard provides for a large set of NaN values, making it possible to have multiple types of NAs for floating point values...
Sent from my iPad > On May 25, 2021, at 3:03 PM, Avi Gross via R-devel <r-devel@r-project.org> > wrote: > > That helps get more understanding of what you want to do, Adrian. Getting > anyone to switch is always a challenge but changing R enough to tempt them > may be a bigger challenge. His is an old story. I was the first adopter for > C++ in my area and at first had to have my code be built with an all C > project making me reinvent some wheels so the same “make” system knew how to > build the two compatibly and link them. Of course, they all eventually had to > join me in a later release but I had moved forward by then. > > > > I have changed (or more accurately added) lots of languages in my life and > continue to do so. The biggest challenge is not to just adapt and use it > similarly to the previous ones already mastered but to understand WHY someone > designed the language this way and what kind of idioms are common and useful > even if that means a new way of thinking. But, of course, any “older” > language has evolved and often drifted in multiple directions. Many now > borrow heavily from others even when the philosophy is different and often > the results are not pretty. Making major changes in R might have serious > impacts on existing programs including just by making them fail as they run > out of memory. > > > > If you look at R, there is plenty you can do in base R, sometimes by standing > on your head. Yet you see package after package coming along that offers not > just new things but sometimes a reworking and even remodeling of old things. > R has a base graphics system I now rarely use and another called lattice I > have no reason to use again because I can do so much quite easily in ggplot. > Similarly, the evolving tidyverse group of packages approaches things from an > interesting direction to the point where many people mainly use it and not > base R. So if they were to teach a class in how to gather your data and > analyze it and draw pretty pictures, the students might walk away thinking > they had learned R but actually have learned these packages. > > > > Your scenario seems related to a common scenario of how we can have values > that signal beyond some range in an out-of-band manner. Years ago we had > functions in languages like C that would return a -1 on failure when only > non-negative results were otherwise possible. That can work fine but fails in > cases when any possible value in the range can be returned. We have languages > that deal with this kind of thing using error handling constructs like > exceptions. Sometimes you bundle up multiple items into a structure and > return that with one element of the structure holding some kind of return > status and another holding the payload. A variation on this theme, as in > languages like GO is to have function that return multiple values with one of > them containing nil on success and an error structure on failure. > > > > The situation we have here that seems to be of concern to you is that you > would like each item in a structure to have attributes that are recognized > and propagated as it is being processed. Older languages tended not to even > have a concept so basic types simply existed and two instances of the number > 5 might even be the same underlying one or two strings with the same contents > and so on. You could of course play the game of making a struct, as mentioned > above, but then you needed your own code to do all the handling as nothing > else knew it contained multiple items and which ones had which purpose. > > > > R did add generalized attributes and some are fairly well integrated or at > least partially. “Names” were discussed as not being easy to keep around. > Factors used their own tagging method that seems to work fairly well but > probably not everywhere. But what you want may be more general and not built > on similar foundations. > > > > I look at languages like Python that are arguably more object-oriented now > than R is and in some ways can be extended better, albeit not in others. If I > wanted to create an object to hold the number 5 and I add methods to the > object that allow it to participate in various ways with other objects using > the hidden payload but also sometimes using the hidden payload, then I might > pair it with the string “five” but also with dozens of other strings for the > word representing 5 in many languages. So I might have it act like a number > in numerical situations and like text when someone is using it in writing a > novel in any of many languages. > > > > You seem to want to have the original text visible that gives a reason > something is missing (or something like that) but have the software TREAT it > like it is missing in calculations. In effect, you want is.na() to be a bit > more like is.numeric() or is.character() and care more about the TYPE of what > is being stored. An item may contain a 999 and yet not be seen as a number > but as an NA. The problem I see is that you also may want the item to be a > string like “DELETED” and yet include it in the vector that R insists can > only hold integers. R does have a built-in data structure called a list that > indeed allows that. You can easily store data as a list of lists rather than > a list of vectors and many other structures. Some of those structures might > handle your needs BUT may only work properly if you build your own packages > as with the tidyverse and break as soon as any other functions encountered > them! > > > > But then you would arguably no longer be in R but in your own universe based > on R. > > > > I have written much code that does things a bit sideways. For example, I > might have a treelike structure in which you do some form of search till you > encounter a leaf node and return that value to be used in a calculation. To > perform a calculation using multiple trees such as taking an average, you > always use find_value(tree) and never hand over the tree itself. As I think I > pointed out earlier, you can do things like that in many places and hand over > a variation of your data. In the ggplot example, you might have: > > > > ggplot(data=mydata, aes(x=abs(col1), y=convert_string_to_numeric(col2)) … > > > > Ggplot would not use the original data in plotting but the view it is asked > to use. The function I made up above would know what values are some form of > NA and convert all others like “12.3” to numeric form. BUT it would not act > as simply or smoothly as when your data is already in the format everyone > else uses. > > > > So how does R know what something is? Presumably there is some overhead > associated with a vector or some table that records the type. A list > presumably depends on each internal item to have such a type. So maybe what > you want is for each item in a vector to have a type where one type is some > for of NA. But as noted, R does often not give a damn about an NA and happily > uses it to create more nonsense. The mean of a bunch of numbers that includes > one or more copies of things like NA (or NaN or inf) can pollute them all. > Generally R is not designed to give a darn. When people complain, they may > get mean to add an na.rm=TRUE or remove them some way before asking for a > mean or perhaps reset them to something like zero. > > > > So if you want to leave your variables in place with assorted meanings but a > tag saying they are to be treated as NA, much in R might have to change. Your > suggested approach though is not yet clear but might mean doing something > analogous to using extra bits and hoping nobody will notice. > > > > So, the solution is both blindingly obvious and even more blindingly stupid. > Use complex numbers! All normal content shall be stored as numbers like > 5.3+0i and any variant on NA shall be stored as something like 0+3i where 3 > means an NA of type 3. > > > > OK, humor aside, since the social sciences do not tend to even know what > complex numbers are, this should provide another dimension to hide lots of > meaningless info. Heck, you could convert message like “LATE” into some > numeric form. Assuming an English centered world (which I do not!) you could > store it with L replaced by 12 and A by 01 and so on so the imaginary > component might look like 0+12011905i and easily decoded back into LATE when > needed. Again, not a serious proposal. The storage probably would be twice > the size of a numeric albeit you can extract the real part when needed for > normal calculations and the imaginary part when you want to know about NA > type or whatever. > > > > What R really is missing is quaternions and octonions which are the only two > other variations on complex numbers that are possible and are sort of complex > numbers on steroids with either three or seven distinct square roots of > minus-one so they allow storage along additional axes in other dimensions. > > > > Yes, I am sure someone wrote a package for that! LOL! > > > > Ah, here is one: https://cran.r-project.org/web/packages/onion/onion.pdf > > > > I will end by saying my experience is that enticing people to do something > new is just a start. After they start, you often get lots of complaints and > requests for help and even requests to help them move back! Unless you make > some popular package everyone runs to, NOBODY else will be able to help them > on some things. The reality is that some of the more common tasks these > people do are sometimes already optimized for them and often do not make them > know more. I have had to use these systems and for some common tasks they are > easy. Dialog boxes can pop up and let you checks off various options and off > you go. No need to learn lots of programming details like the names of > various functions that do a Tukey test and what arguments they need and what > errors might have to be handled and so on. I know SPSS often produces LOTS of > output including many things you do not wat and then lets you remove parts > you don’t need or even know what they mean. Sure, R can have similar > functionality but often you are expected to sort of stitch various parts > together as well as ADD your own bits. I love that and value being able to be > creative. In my experience, most normal people just want to get the job done > and be fairly certain others accept the results ad then do other activities > they are better suited for, or at least think they are. > > > > There are intermediates I have used where I let them do various kinds of > processing on SPSS and save the result in some format I can read into R for > additional processing. The latter may not be stuff that requires keeping > track of multiple NA equivalents. Of course if you want to save the results > and move them back, that is a challenge. Hybrid approaches may tempt them to > try something and maybe later do more and more and move over. > > > > From: Adrian Dușa <dusa.adr...@unibuc.ro> > Sent: Tuesday, May 25, 2021 2:17 AM > To: Avi Gross <avigr...@verizon.net> > Cc: r-devel <r-devel@r-project.org> > Subject: Re: [Rd] [External] Re: 1954 from NA > > > > Dear Avi, > > > > Thank you so much for the extended messages, I read them carefully. > > While partially offering a solution (I've already been there), it creates > additional work for the user, and some of that is unnecessary. > > > > What I am trying to achieve is best described in this draft vignette: > > > > devtools::install_github("dusadrian/mixed") > > vignette("mixed") > > > > Once a value is declared to be missing, the user should not do anything else > about it. Despite being present, the value should automatically be treated as > missing by the software. That is the way it's done in all major statistical > packages like SAS, Stata and even SPSS. > > > > My end goal is to make R attractive for my faculty peers (and beyond), almost > all of whom are massively using SPSS and sometimes Stata. But in order to > convince them to (finally) make the switch, I need to provide similar > functionality, not additional work. > > > > Re. your first part of the message, I am definitely not trying to change the > R internals. The NA will still be NA, exactly as currently defined. > > My initial proposal was based on the observation that the 1954 payload was > stored as an unsigned int (thus occupying 32 bits) when it is obvious it > doesn't need more than 16. That was the only proposed modification, and > everything else stays the same. > > > > I now learned, thanks to all contributors in this list, that building > something around that payload is risky because we do not know exactly what > the compilers will do. One possible solution that I can think of, while > (still) maintaining the current functionality around the NA, is to use a > different high word for the NA that would not trigger compilation issues. But > I have absolutely no idea what that implies for the other inner workings of R. > > > > I very much trust the R core will eventually find a robust solution, they've > solved much more complicated problems than this. I just hope the current > thread will push the idea of tagged NAs on the table, for when they will > discuss this. > > > > Once that will be solved, and despite the current advice discouraging this > route, I believe tagging NAs is a valuable idea that should not be discarded. > > After all, the NA is nothing but a tagged NaN. > > > > All the best, > > Adrian > > > > > > On Tue, May 25, 2021 at 7:05 AM Avi Gross via R-devel <r-devel@r-project.org > <mailto:r-devel@r-project.org> > wrote: > > I was thinking about how one does things in a language that is properly > object-oriented versus R that makes various half-assed attempts at being such. > > Clearly in some such languages you can make an object that is a wrapper that > allows you to save an item that is the main payload as well as anything else > you want. You might need a way to convince everything else to allow you to > make things like lists and vectors and other collections of the objects and > perhaps automatically unbox them for many purposes. As an example in a > language like Python, you might provide methods so that adding A and B > actually gets the value out of A and/or B and adds them properly. But there > may be too many edge cases to handle and some software may not pay attention > to what you want including some libraries written in other languages. > > I mention Python for the odd reason that it is now possible to combine Python > and R in the same program and sort of switch back and forth between data > representations. This may provide some openings for preserving and accessing > metadata when needed. > > Realistically, if R was being designed from scratch TODAY, many things might > be done differently. But I recall it being developed at Bell Labs for > purposes where it was sort of revolutionary at the time (back when it was S) > and designed to do things in a vectorized way and probably primarily for the > kinds of scientific and mathematical operations where a single NA (of several > types depending on the data) was enough when augmented by a few things like a > Nan and Inf and -Inf. I doubt they seriously saw a need for an unlimited > number of NA that were all the same AND also all different that they felt had > to be built-in. As noted, had they had a reason to make it fully > object-oriented too and made the base types such as integer into full-fledged > objects with room for additional metadata, then things may be different. I > note I have seen languages which have both a data type called integer as > lower case and Integer as upper case. One of them is regularly boxed and > unboxed automagically when used in a context that needs the other. As far as > efficiency goes, this invisibly adds many steps. So do languages that > sometimes take a variable that is a pointer and invisibly reference it to > provide the underlying field rather than make you do extra typing and so on. > > So is there any reason only an NA should have such meta-data? Why not have > reasons associated with Inf stating it was an Inf because you asked for one > or the result of a calculation such as dividing by Zero (albeit maybe that > might be a NaN) and so on. Maybe I could annotate integers with whether they > are prime or even versus odd or a factor of 144 or anything else I can > imagine. But at some point, the overhead from allowing all this can become > substantial. I was amused at how python allows a function to be annotated > including by itself since it is an object. So it can store such metadata > perhaps in an attached dictionary so a complex costly calculation can have > the results cached and when you ask for the same thing in the same session, > it checks if it has done it and just returns the result in linear time. But > after a while, how many cached results can there be? > > -----Original Message----- > From: R-devel <r-devel-boun...@r-project.org > <mailto:r-devel-boun...@r-project.org> > On Behalf Of luke-tier...@uiowa.edu > <mailto:luke-tier...@uiowa.edu> > Sent: Monday, May 24, 2021 9:15 AM > To: Adrian Dușa <dusa.adr...@unibuc.ro <mailto:dusa.adr...@unibuc.ro> > > Cc: Greg Minshall <minsh...@umich.edu <mailto:minsh...@umich.edu> >; r-devel > <r-devel@r-project.org <mailto:r-devel@r-project.org> > > Subject: Re: [Rd] [External] Re: 1954 from NA > >> On Mon, 24 May 2021, Adrian Dușa wrote: >> >>> On Mon, May 24, 2021 at 2:11 PM Greg Minshall <minsh...@umich.edu >>> <mailto:minsh...@umich.edu> > wrote: >>> >>> [...] >>> if you have 500 columns of possibly-NA'd variables, you could have >>> one column of 500 "bits", where each bit has one of N values, N being >>> the number of explanations the corresponding column has for why the >>> NA exists. >>> > > PLEASE DO NOT DO THIS! > > It will not work reliably, as has been explained to you ad nauseam in this > thread. > > If you distribute code that does this it will only lead to bug reports on R > that will waste R-core time. > > As Alex explained, you can use attributes for this. If you need operations to > preserve attributes across subsetting you can define subsetting methods that > do that. > > If you are dead set on doing something in C you can try to develop an ALTREP > class that provides augmented missing value information. > > Best, > > luke > > > >> >> The mere thought of implementing something like that gives me shivers. >> Not to mention such a solution should also be robust when subsetting, >> splitting, column and row binding, etc. and everything can be lost if >> the user deletes that particular column without realising its importance. >> >> Social science datasets are much more alive and complex than one might >> first think: there are multi-wave studies with tens of countries, and >> aggregating such data is already a complex process to add even more >> complexity on top of that. >> >> As undocumented as they may be, or even subject to change, I think the >> R internals are much more reliable that this. >> >> Best wishes, >> Adrian >> >> > > -- > Luke Tierney > Ralph E. Wareham Professor of Mathematical Sciences > University of Iowa Phone: 319-335-3386 > Department of Statistics and Fax: 319-335-3017 > Actuarial Science > 241 Schaeffer Hall email: luke-tier...@uiowa.edu > <mailto:luke-tier...@uiowa.edu> > Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu > ______________________________________________ > R-devel@r-project.org <mailto:R-devel@r-project.org> mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > R-devel@r-project.org <mailto:R-devel@r-project.org> mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > > -- > > Adrian Dusa > University of Bucharest > Romanian Social Data Archive > Soseaua Panduri nr. 90-92 > 050663 Bucharest sector 5 > Romania > > https://adriandusa.eu > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel