Re: [Rd] [External] Re: 1954 from NA

Gregory Warnes Tue, 25 May 2021 18:13:31 -0700

As a side note, for floating point values, the IEEE 754 standard provides for a 
large set of NaN values, making it possible to have multiple types of NAs for 
floating point values...


Sent from my iPad

> On May 25, 2021, at 3:03 PM, Avi Gross via R-devel <r-devel@r-project.org> 
> wrote:
> 
> That helps get more understanding of what you want to do, Adrian. Getting 
> anyone to switch is always a challenge but changing R enough to tempt them 
> may be a bigger challenge. His is an old story. I was the first adopter for 
> C++ in my area and at first had to have my code be built with an all C 
> project making me reinvent some wheels so the same “make” system knew how to 
> build the two compatibly and link them. Of course, they all eventually had to 
> join me in a later release but I had moved forward by then.
> 
> 
> 
> I have changed (or more accurately added) lots of languages in my life and 
> continue to do so. The biggest challenge is not to just adapt and use it 
> similarly to the previous ones already mastered but to understand WHY someone 
> designed the language this way and what kind of idioms are common and useful 
> even if that means a new way of thinking. But, of course, any “older” 
> language has evolved and often drifted in multiple directions. Many now 
> borrow heavily from others even when the philosophy is different and often 
> the results are not pretty. Making major changes in R might have serious 
> impacts on existing programs including just by making them fail as they run 
> out of memory.
> 
> 
> 
> If you look at R, there is plenty you can do in base R, sometimes by standing 
> on your head. Yet you see package after package coming along that offers not 
> just new things but sometimes a reworking and even remodeling of old things. 
> R has a base graphics system I now rarely use and another called lattice I 
> have no reason to use again because I can do so much quite easily in ggplot. 
> Similarly, the evolving tidyverse group of packages approaches things from an 
> interesting direction to the point where many people mainly use it and not 
> base R. So if they were to teach a class in how to gather your data and 
> analyze it and draw pretty pictures, the students might walk away thinking 
> they had learned R but actually have learned these packages.
> 
> 
> 
> Your scenario seems related to a common scenario of how we can have values 
> that signal beyond some range in an out-of-band manner. Years ago we had 
> functions in languages like C that would return a -1 on failure when only 
> non-negative results were otherwise possible. That can work fine but fails in 
> cases when any possible value in the range can be returned. We have languages 
> that deal with this kind of thing using error handling constructs like 
> exceptions.  Sometimes you bundle up multiple items into a structure and 
> return that with one element of the structure holding some kind of return 
> status and another holding the payload. A variation on this theme, as in 
> languages like GO is to have function that return multiple values with one of 
> them containing nil on success and an error structure on failure.
> 
> 
> 
> The situation we have here that seems to be of concern to you is that you 
> would like each item in a structure to have attributes that are recognized 
> and propagated as it is being processed. Older languages tended not to even 
> have a concept so basic types simply existed and two instances of the number 
> 5 might even be the same underlying one or two strings with the same contents 
> and so on. You could of course play the game of making a struct, as mentioned 
> above, but then you needed your own code to do all the handling as nothing 
> else knew it contained multiple items and which ones had which purpose.
> 
> 
> 
> R did add generalized attributes and some are fairly well integrated or at 
> least partially. “Names” were discussed as not being easy to keep around. 
> Factors used their own tagging method that seems to work fairly well but 
> probably not everywhere. But what you want may be more general and not built 
> on similar foundations.
> 
> 
> 
> I look at languages like Python that are arguably more object-oriented now 
> than R is and in some ways can be extended better, albeit not in others. If I 
> wanted to create an object to hold the number 5 and I add methods to the 
> object that allow it to participate in various ways with other objects using 
> the hidden payload but also sometimes using the hidden payload, then I might 
> pair it with the string “five” but also with dozens of other strings for the 
> word representing 5 in many languages. So I might have it act like a number 
> in numerical situations and like text when someone is using it in writing a 
> novel in any of many languages.
> 
> 
> 
> You seem to want to have the original text visible that gives a reason 
> something is missing (or something like that) but have the software TREAT it 
> like it is missing in calculations. In effect, you want is.na() to be a bit 
> more like is.numeric() or is.character() and care more about the TYPE of what 
> is being stored. An item may contain a 999 and yet not be seen as a number 
> but as an NA. The problem I see is that you also may want the item to be a 
> string like “DELETED” and yet include it in the vector that R insists can 
> only hold integers. R does have a built-in data structure called a list that 
> indeed allows that. You can easily store data as a list of lists rather than 
> a list of vectors and many other structures. Some of those structures might 
> handle your needs BUT may only work properly if you build your own packages 
> as with  the tidyverse and break as soon as any other functions encountered 
> them!
> 
> 
> 
> But then you would arguably no longer be in R but in your own universe based 
> on R.
> 
> 
> 
> I have written much code that does things a bit sideways. For example, I 
> might have a treelike structure in which you do some form of search till you 
> encounter a leaf node and return that value to be used in a calculation. To 
> perform a calculation using multiple trees such as taking an average, you 
> always use find_value(tree) and never hand over the tree itself. As I think I 
> pointed out earlier, you can do things like that in many places and hand over 
> a variation of your data. In the ggplot example, you might have:
> 
> 
> 
> ggplot(data=mydata, aes(x=abs(col1), y=convert_string_to_numeric(col2)) …
> 
> 
> 
> Ggplot would not use the original data in plotting but the view it is asked 
> to use. The function I made up above would know what values are some form of 
> NA and convert all others like “12.3” to numeric form. BUT it would not act 
> as simply or smoothly as when your data is already in the format everyone 
> else uses.
> 
> 
> 
> So how does R know what something is? Presumably there is some overhead 
> associated with a vector or some table that records the type. A list 
> presumably depends on each internal item to have such a type. So maybe what 
> you want is for each item in a vector to have a type where one type is some 
> for of NA. But as noted, R does often not give a damn about an NA and happily 
> uses it to create more nonsense. The mean of a bunch of numbers that includes 
> one or more copies of things like NA (or NaN or inf) can pollute them all. 
> Generally R is not designed to give a darn. When people complain, they may 
> get mean to add an na.rm=TRUE or remove them some way before asking for a 
> mean or perhaps reset them to something like zero.
> 
> 
> 
> So if you want to leave your variables in place with assorted meanings but a 
> tag saying they are to be treated as NA, much in R might have to change. Your 
> suggested approach though is not yet clear but might mean doing something 
> analogous to using extra bits and hoping nobody will notice.
> 
> 
> 
> So, the solution is both blindingly obvious and even more blindingly stupid. 
> Use complex numbers! All normal content shall be stored as numbers like 
> 5.3+0i and any variant on NA shall be stored as something like 0+3i where 3 
> means an NA of type 3.
> 
> 
> 
> OK, humor aside, since the social sciences do not tend to even know what 
> complex numbers are, this should provide another dimension to hide lots of 
> meaningless info. Heck, you could convert  message like “LATE” into some 
> numeric form. Assuming an English centered world (which I do not!) you could 
> store it with L replaced by 12 and A by 01 and so on so the imaginary 
> component might look like 0+12011905i and easily decoded back into LATE when 
> needed. Again, not a serious proposal. The storage probably would be twice 
> the size of a numeric albeit you can extract the real part when needed for 
> normal calculations and the imaginary part when you want to know about NA 
> type or whatever. 
> 
> 
> 
> What R really is missing is quaternions and octonions which are the only two 
> other variations on complex numbers that are possible and are sort of complex 
> numbers on steroids with either three or seven distinct square roots of 
> minus-one  so they allow storage along additional axes in other dimensions.
> 
> 
> 
> Yes, I am sure someone wrote a package for that! LOL!
> 
> 
> 
> Ah, here is one: https://cran.r-project.org/web/packages/onion/onion.pdf
> 
> 
> 
> I will end by saying my experience is that enticing people to do something 
> new is just a start. After they start, you often get lots of complaints and 
> requests for help and even requests to help them move back! Unless you make 
> some popular package everyone runs to, NOBODY else will be able to help them 
> on some things. The reality is that some of the more common tasks these 
> people do are sometimes already optimized for them and often do not make them 
> know more. I have had to use these systems and for some common tasks they are 
> easy. Dialog boxes can pop up and let you checks off various options and off 
> you go. No need to learn lots of programming details like the names of 
> various functions that do a Tukey test and what arguments they need and what 
> errors might have to be handled and so on. I know SPSS often produces LOTS of 
> output including many things you do not wat and then lets you remove parts 
> you don’t need or even know what they mean. Sure, R can have similar 
> functionality but often you are expected to sort of stitch various parts 
> together as well as ADD your own bits. I love that and value being able to be 
> creative. In my experience, most normal people just want to get the job done 
> and be fairly certain others accept the results ad then do other activities 
> they are better suited for, or at least think they are.
> 
> 
> 
> There are intermediates I have used where I let them do various kinds of 
> processing on SPSS and save the result in some format I can read into R for 
> additional processing. The latter may not be stuff that requires keeping 
> track of multiple NA equivalents. Of course if you want to save the results 
> and move them back, that is  a challenge. Hybrid approaches may tempt them to 
> try something and maybe later do more and more and move over.
> 
> 
> 
> From: Adrian Dușa <dusa.adr...@unibuc.ro> 
> Sent: Tuesday, May 25, 2021 2:17 AM
> To: Avi Gross <avigr...@verizon.net>
> Cc: r-devel <r-devel@r-project.org>
> Subject: Re: [Rd] [External] Re: 1954 from NA
> 
> 
> 
> Dear Avi,
> 
> 
> 
> Thank you so much for the extended messages, I read them carefully.
> 
> While partially offering a solution (I've already been there), it creates 
> additional work for the user, and some of that is unnecessary.
> 
> 
> 
> What I am trying to achieve is best described in this draft vignette:
> 
> 
> 
> devtools::install_github("dusadrian/mixed")
> 
> vignette("mixed")
> 
> 
> 
> Once a value is declared to be missing, the user should not do anything else 
> about it. Despite being present, the value should automatically be treated as 
> missing by the software. That is the way it's done in all major statistical 
> packages like SAS, Stata and even SPSS.
> 
> 
> 
> My end goal is to make R attractive for my faculty peers (and beyond), almost 
> all of whom are massively using SPSS and sometimes Stata. But in order to 
> convince them to (finally) make the switch, I need to provide similar 
> functionality, not additional work.
> 
> 
> 
> Re. your first part of the message, I am definitely not trying to change the 
> R internals. The NA will still be NA, exactly as currently defined.
> 
> My initial proposal was based on the observation that the 1954 payload was 
> stored as an unsigned int (thus occupying 32 bits) when it is obvious it 
> doesn't need more than 16. That was the only proposed modification, and 
> everything else stays the same.
> 
> 
> 
> I now learned, thanks to all contributors in this list, that building 
> something around that payload is risky because we do not know exactly what 
> the compilers will do. One possible solution that I can think of, while 
> (still) maintaining the current functionality around the NA, is to use a 
> different high word for the NA that would not trigger compilation issues. But 
> I have absolutely no idea what that implies for the other inner workings of R.
> 
> 
> 
> I very much trust the R core will eventually find a robust solution, they've 
> solved much more complicated problems than this. I just hope the current 
> thread will push the idea of tagged NAs on the table, for when they will 
> discuss this.
> 
> 
> 
> Once that will be solved, and despite the current advice discouraging this 
> route, I believe tagging NAs is a valuable idea that should not be discarded.
> 
> After all, the NA is nothing but a tagged NaN.
> 
> 
> 
> All the best,
> 
> Adrian
> 
> 
> 
> 
> 
> On Tue, May 25, 2021 at 7:05 AM Avi Gross via R-devel <r-devel@r-project.org 
> <mailto:r-devel@r-project.org> > wrote:
> 
> I was thinking about how one does things in a language that is properly 
> object-oriented versus R that makes various half-assed attempts at being such.
> 
> Clearly in some such languages you can make an object that is a wrapper that 
> allows you to save an item that is the main payload as well as anything else 
> you want. You might need a way to convince everything else to allow you to 
> make things like lists and vectors and other collections of the objects and 
> perhaps automatically unbox them for many purposes. As an example in a 
> language like Python, you might provide methods so that adding A and B 
> actually gets the value out of A and/or B and adds them properly.  But there 
> may be too many edge cases to handle and some software may not pay attention 
> to what you want including some libraries written in other languages.
> 
> I mention Python for the odd reason that it is now possible to combine Python 
> and R in the same program and sort of switch back and forth between data 
> representations. This may provide some openings for preserving and accessing 
> metadata when needed.
> 
> Realistically, if R was being designed from scratch TODAY, many things might 
> be done differently. But I recall it being developed at Bell Labs for 
> purposes where it was sort of revolutionary at the time (back when it was S) 
> and designed to do things in a vectorized way and probably primarily for the 
> kinds of scientific and mathematical operations where a single NA (of several 
> types depending on the data) was enough when augmented by a few things like a 
> Nan and Inf and -Inf. I doubt they seriously saw a need for an unlimited 
> number of NA that were all the same AND also all different that they felt had 
> to be built-in. As noted, had they had a reason to make it fully 
> object-oriented too and made the base types such as integer into full-fledged 
> objects with room for additional metadata, then things may be different. I 
> note I have seen languages which have both a data type called integer as 
> lower case and Integer as upper case. One of them is regularly boxed and 
> unboxed automagically when used in a context that needs the other. As far as 
> efficiency goes, this invisibly adds many steps. So do languages that 
> sometimes take a variable that is a pointer and invisibly reference it to 
> provide the underlying field rather than make you do extra typing and so on.
> 
> So is there any reason only an NA should have such meta-data? Why not have 
> reasons associated with Inf stating it was an Inf because you asked for one 
> or the result of a calculation such as dividing by Zero (albeit maybe that 
> might be a NaN) and so on. Maybe I could annotate integers with whether they 
> are prime or even  versus odd  or a factor of 144 or anything else I can 
> imagine. But at some point, the overhead from allowing all this can become 
> substantial. I was amused at how python allows a function to be annotated 
> including by itself since it is an object. So it can store such metadata 
> perhaps in an attached dictionary so a complex costly calculation can have 
> the results cached and when you ask for the same thing in the same session, 
> it checks if it has done it and just returns the result in linear time. But 
> after a while, how many cached results can there be?
> 
> -----Original Message-----
> From: R-devel <r-devel-boun...@r-project.org 
> <mailto:r-devel-boun...@r-project.org> > On Behalf Of luke-tier...@uiowa.edu 
> <mailto:luke-tier...@uiowa.edu> 
> Sent: Monday, May 24, 2021 9:15 AM
> To: Adrian Dușa <dusa.adr...@unibuc.ro <mailto:dusa.adr...@unibuc.ro> >
> Cc: Greg Minshall <minsh...@umich.edu <mailto:minsh...@umich.edu> >; r-devel 
> <r-devel@r-project.org <mailto:r-devel@r-project.org> >
> Subject: Re: [Rd] [External] Re: 1954 from NA
> 
>> On Mon, 24 May 2021, Adrian Dușa wrote:
>> 
>>> On Mon, May 24, 2021 at 2:11 PM Greg Minshall <minsh...@umich.edu 
>>> <mailto:minsh...@umich.edu> > wrote:
>>> 
>>> [...]
>>> if you have 500 columns of possibly-NA'd variables, you could have 
>>> one column of 500 "bits", where each bit has one of N values, N being 
>>> the number of explanations the corresponding column has for why the 
>>> NA exists.
>>> 
> 
> PLEASE DO NOT DO THIS!
> 
> It will not work reliably, as has been explained to you ad nauseam in this 
> thread.
> 
> If you distribute code that does this it will only lead to bug reports on R 
> that will waste R-core time.
> 
> As Alex explained, you can use attributes for this. If you need operations to 
> preserve attributes across subsetting you can define subsetting methods that 
> do that.
> 
> If you are dead set on doing something in C you can try to develop an ALTREP 
> class that provides augmented missing value information.
> 
> Best,
> 
> luke
> 
> 
> 
>> 
>> The mere thought of implementing something like that gives me shivers. 
>> Not to mention such a solution should also be robust when subsetting, 
>> splitting, column and row binding, etc. and everything can be lost if 
>> the user deletes that particular column without realising its importance.
>> 
>> Social science datasets are much more alive and complex than one might 
>> first think: there are multi-wave studies with tens of countries, and 
>> aggregating such data is already a complex process to add even more 
>> complexity on top of that.
>> 
>> As undocumented as they may be, or even subject to change, I think the 
>> R internals are much more reliable that this.
>> 
>> Best wishes,
>> Adrian
>> 
>> 
> 
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>    Actuarial Science
> 241 Schaeffer Hall                  email:   luke-tier...@uiowa.edu 
> <mailto:luke-tier...@uiowa.edu> 
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
> ______________________________________________
> R-devel@r-project.org <mailto:R-devel@r-project.org>  mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel@r-project.org <mailto:R-devel@r-project.org>  mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 
> 
> 
> 
> 
> -- 
> 
> Adrian Dusa
> University of Bucharest
> Romanian Social Data Archive
> Soseaua Panduri nr. 90-92
> 050663 Bucharest sector 5
> Romania
> 
> https://adriandusa.eu
> 
> 
>    [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [External] Re: 1954 from NA

Reply via email to