Re: [Rd] [External] Re: 1954 from NA

Duncan Murdoch Tue, 25 May 2021 16:27:24 -0700

You've already been told how to solve this: just add attributes to theobjects. Use the standard NA to indicate that there is some kind ofmissingness, and the attribute to describe exactly what it is. Stick aclass on those objects and define methods so that subsetting andarithmetic preserves the extra info you've added. If you do someoperation that turns those NAs into NaNs, big deal: the attribute willstill be there, and is.na(NaN) still returns TRUE.


Base R doesn't need anything else.

You complained that users shouldn't need to know about attributes, andthey won't: you, as the author of the package that does this, willhandle all those details. Working in your subject area you know all thedifferent kinds of NAs that people care about, and how they code them ininput data, so you can make it all totally transparent. If you do itwell, someone in some other subject area with a completely different setof kinds of missingness will be able to adapt your code to their use.

I imagine this has all been done in one of the thousands of packages onCRAN, but if it hasn't been done well enough for you, do it better.


Duncan Murdoch

On 25/05/2021 7:01 p.m., Adrian Dușa wrote:

Dear Avi,

That was quite a lengthy email...
What you write makes sense of course. I try hard not to deviate from the
base R, and thought my solution does just that but apparently no such luck.

I suspect, however, that something will have to eventually change: since
one of the R building blocks (such as an NA) is questioned by compilers, it
is serious enough to attract attention from the R core and maintainers.
And if that happens, my fingers are crossed the solution would allow users
to declare existing values as missing.

The importance of that, for the social sciences, cannot be stressed enough.

Best wishes, thanks once again to everyone,
Adrian

On Tue, May 25, 2021 at 10:03 PM Avi Gross via R-devel <
r-devel@r-project.org> wrote:

That helps get more understanding of what you want to do, Adrian. Getting
anyone to switch is always a challenge but changing R enough to tempt them
may be a bigger challenge. His is an old story. I was the first adopter for
C++ in my area and at first had to have my code be built with an all C
project making me reinvent some wheels so the same “make” system knew how
to build the two compatibly and link them. Of course, they all eventually
had to join me in a later release but I had moved forward by then.



I have changed (or more accurately added) lots of languages in my life and
continue to do so. The biggest challenge is not to just adapt and use it
similarly to the previous ones already mastered but to understand WHY
someone designed the language this way and what kind of idioms are common
and useful even if that means a new way of thinking. But, of course, any
“older” language has evolved and often drifted in multiple directions. Many
now borrow heavily from others even when the philosophy is different and
often the results are not pretty. Making major changes in R might have
serious impacts on existing programs including just by making them fail as
they run out of memory.



If you look at R, there is plenty you can do in base R, sometimes by
standing on your head. Yet you see package after package coming along that
offers not just new things but sometimes a reworking and even remodeling of
old things. R has a base graphics system I now rarely use and another
called lattice I have no reason to use again because I can do so much quite
easily in ggplot. Similarly, the evolving tidyverse group of packages
approaches things from an interesting direction to the point where many
people mainly use it and not base R. So if they were to teach a class in
how to gather your data and analyze it and draw pretty pictures, the
students might walk away thinking they had learned R but actually have
learned these packages.



Your scenario seems related to a common scenario of how we can have values
that signal beyond some range in an out-of-band manner. Years ago we had
functions in languages like C that would return a -1 on failure when only
non-negative results were otherwise possible. That can work fine but fails
in cases when any possible value in the range can be returned. We have
languages that deal with this kind of thing using error handling constructs
like exceptions.  Sometimes you bundle up multiple items into a structure
and return that with one element of the structure holding some kind of
return status and another holding the payload. A variation on this theme,
as in languages like GO is to have function that return multiple values
with one of them containing nil on success and an error structure on
failure.



The situation we have here that seems to be of concern to you is that you
would like each item in a structure to have attributes that are recognized
and propagated as it is being processed. Older languages tended not to even
have a concept so basic types simply existed and two instances of the
number 5 might even be the same underlying one or two strings with the same
contents and so on. You could of course play the game of making a struct,
as mentioned above, but then you needed your own code to do all the
handling as nothing else knew it contained multiple items and which ones
had which purpose.



R did add generalized attributes and some are fairly well integrated or at
least partially. “Names” were discussed as not being easy to keep around.
Factors used their own tagging method that seems to work fairly well but
probably not everywhere. But what you want may be more general and not
built on similar foundations.



I look at languages like Python that are arguably more object-oriented now
than R is and in some ways can be extended better, albeit not in others. If
I wanted to create an object to hold the number 5 and I add methods to the
object that allow it to participate in various ways with other objects
using the hidden payload but also sometimes using the hidden payload, then
I might pair it with the string “five” but also with dozens of other
strings for the word representing 5 in many languages. So I might have it
act like a number in numerical situations and like text when someone is
using it in writing a novel in any of many languages.



You seem to want to have the original text visible that gives a reason
something is missing (or something like that) but have the software TREAT
it like it is missing in calculations. In effect, you want is.na() to be
a bit more like is.numeric() or is.character() and care more about the TYPE
of what is being stored. An item may contain a 999 and yet not be seen as a
number but as an NA. The problem I see is that you also may want the item
to be a string like “DELETED” and yet include it in the vector that R
insists can only hold integers. R does have a built-in data structure
called a list that indeed allows that. You can easily store data as a list
of lists rather than a list of vectors and many other structures. Some of
those structures might handle your needs BUT may only work properly if you
build your own packages as with  the tidyverse and break as soon as any
other functions encountered them!



But then you would arguably no longer be in R but in your own universe
based on R.



I have written much code that does things a bit sideways. For example, I
might have a treelike structure in which you do some form of search till
you encounter a leaf node and return that value to be used in a
calculation. To perform a calculation using multiple trees such as taking
an average, you always use find_value(tree) and never hand over the tree
itself. As I think I pointed out earlier, you can do things like that in
many places and hand over a variation of your data. In the ggplot example,
you might have:



ggplot(data=mydata, aes(x=abs(col1), y=convert_string_to_numeric(col2)) …



Ggplot would not use the original data in plotting but the view it is
asked to use. The function I made up above would know what values are some
form of NA and convert all others like “12.3” to numeric form. BUT it would
not act as simply or smoothly as when your data is already in the format
everyone else uses.



So how does R know what something is? Presumably there is some overhead
associated with a vector or some table that records the type. A list
presumably depends on each internal item to have such a type. So maybe what
you want is for each item in a vector to have a type where one type is some
for of NA. But as noted, R does often not give a damn about an NA and
happily uses it to create more nonsense. The mean of a bunch of numbers
that includes one or more copies of things like NA (or NaN or inf) can
pollute them all. Generally R is not designed to give a darn. When people
complain, they may get mean to add an na.rm=TRUE or remove them some way
before asking for a mean or perhaps reset them to something like zero.



So if you want to leave your variables in place with assorted meanings but
a tag saying they are to be treated as NA, much in R might have to change.
Your suggested approach though is not yet clear but might mean doing
something analogous to using extra bits and hoping nobody will notice.



So, the solution is both blindingly obvious and even more blindingly
stupid. Use complex numbers! All normal content shall be stored as numbers
like 5.3+0i and any variant on NA shall be stored as something like 0+3i
where 3 means an NA of type 3.



OK, humor aside, since the social sciences do not tend to even know what
complex numbers are, this should provide another dimension to hide lots of
meaningless info. Heck, you could convert  message like “LATE” into some
numeric form. Assuming an English centered world (which I do not!) you
could store it with L replaced by 12 and A by 01 and so on so the imaginary
component might look like 0+12011905i and easily decoded back into LATE
when needed. Again, not a serious proposal. The storage probably would be
twice the size of a numeric albeit you can extract the real part when
needed for normal calculations and the imaginary part when you want to know
about NA type or whatever.



What R really is missing is quaternions and octonions which are the only
two other variations on complex numbers that are possible and are sort of
complex numbers on steroids with either three or seven distinct square
roots of minus-one  so they allow storage along additional axes in other
dimensions.



Yes, I am sure someone wrote a package for that! LOL!



Ah, here is one: https://cran.r-project.org/web/packages/onion/onion.pdf



I will end by saying my experience is that enticing people to do something
new is just a start. After they start, you often get lots of complaints and
requests for help and even requests to help them move back! Unless you make
some popular package everyone runs to, NOBODY else will be able to help
them on some things. The reality is that some of the more common tasks
these people do are sometimes already optimized for them and often do not
make them know more. I have had to use these systems and for some common
tasks they are easy. Dialog boxes can pop up and let you checks off various
options and off you go. No need to learn lots of programming details like
the names of various functions that do a Tukey test and what arguments they
need and what errors might have to be handled and so on. I know SPSS often
produces LOTS of output including many things you do not wat and then lets
you remove parts you don’t need or even know what they mean. Sure, R can
have similar functionality but often you are expected to sort of stitch
various parts together as well as ADD your own bits. I love that and value
being able to be creative. In my experience, most normal people just want
to get the job done and be fairly certain others accept the results ad then
do other activities they are better suited for, or at least think they are.



There are intermediates I have used where I let them do various kinds of
processing on SPSS and save the result in some format I can read into R for
additional processing. The latter may not be stuff that requires keeping
track of multiple NA equivalents. Of course if you want to save the results
and move them back, that is  a challenge. Hybrid approaches may tempt them
to try something and maybe later do more and more and move over.


        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [External] Re: 1954 from NA

Reply via email to