Hi Adrian, Have a look at vctrs package — they have low-level primitives that might simplify your life a bit. I think you can get quite far by creating a custom type that stores NAs in an attribute and utilizes vctrs proxy functionality to preserve these attributes across different operations. Going that route will likely to give you a much more flexible and robust solution.
Best, Taras > On 24 May 2021, at 15:09, Adrian Dușa <dusa.adr...@gmail.com> wrote: > > Dear Alex, > > Thanks for piping in, I am learning with each new message. > The problem is clear, the solution escapes me though. I've already tried > the attributes route: it is going to triple the data size: along with the > additional (logical) variable that specifies which level is missing, one > also needs to store an index such that sorting the data would still > maintain the correct information. > > One also needs to think about subsetting (subset the attributes as well), > splitting (the same), aggregating multiple datasets (even more attention), > creating custom vectors out of multiple variables... complexity quickly > grows towards infinity. > > R factors are nice indeed, but: > - there are numerical variables which can hold multiple missing values (for > instance income) > - factors convert the original questionnaire values: if a missing value was > coded 999, turning that into a factor would convert that value into > something else > > I really, and wholeheartedly, do appreciate all advice: but please be > assured that I have been thinking about this for more than 10 years and > still haven't found a satisfactory solution. > > Which makes it even more intriguing, since other software like SAS or Stata > have solved this for decades: what is their implementation, and how come > they don't seem to be affected by the new M1 architecture? > When package "haven" introduced the tagged NA values I said: ah-haa... so > that is how it's done... only to learn that implementation is just as > fragile as the R internals. > > There really should be a robust solution for this seemingly mundane > problem, but apparently is far from mundane... > > Best wishes, > Adrian > > > On Mon, May 24, 2021 at 3:29 PM Bertram, Alexander <a...@bedatadriven.com> > wrote: > >> Dear Adrian, >> I just wanted to pipe in and underscore Thomas' point: the payload bits of >> IEEE 754 floating point values are no place to store data that you care >> about or need to keep. That is not only related to the R APIs, but also how >> processors handle floating point values and signaling and non-signaling >> NaNs. It is very difficult to reason about when and under which >> circumstances these bits are preserved. I spent a lot of time working on >> Renjin's handling of these values and I can assure that any such scheme >> will end in tears. >> >> A far, far better option is to use R's attributes to store this kind of >> metadata. This is exactly what this language feature is for. There is >> already a standard 'levels' attribute that holds the labels of factors like >> "Yes", "No" , "Refused", "Interviewer error'' etc. In the past, I've worked >> on projects where we stored an additional attribute like "missingLevels" >> that stores extra metadata on which levels should be used in which kind of >> analysis. That way, you can preserve all the information, and then write a >> utility function which automatically applies certain logic to a whole >> dataframe just before passing the data to an analysis function. This is >> also important because in surveys like this, different values should be >> excluded at different times. For example, you might want to include all >> responses in a data quality report, but exclude interviewer error and >> refusals when conducting a PCA or fitting a model. >> >> Best, >> Alex >> >> On Mon, May 24, 2021 at 2:03 PM Adrian Dușa <dusa.adr...@gmail.com> wrote: >> >>> On Mon, May 24, 2021 at 1:31 PM Tomas Kalibera <tomas.kalib...@gmail.com> >>> wrote: >>> >>>> [...] >>>> >>>> For the reasons I explained, I would be against such a change. Keeping >>> the >>>> data on the side, as also recommended by others on this list, would >>> allow >>>> you for a reliable implementation. I don't want to support fragile >>> package >>>> code building on unspecified R internals, and in this case particularly >>>> internals that themselves have not stood the test of time, so are at >>> high >>>> risk of change. >>>> >>> I understand, and it makes sense. >>> We'll have to wait for the R internals to settle (this really is >>> surprising, I wonder how other software have solved this). In the >>> meantime, >>> I will probably go ahead with NaNs. >>> >>> Thank you again, >>> Adrian >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> >> >> -- >> Alexander Bertram >> Technical Director >> *BeDataDriven BV* >> >> Web: http://bedatadriven.com >> Email: a...@bedatadriven.com >> Tel. Nederlands: +31(0)647205388 >> Skype: akbertram >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel