skolenik opened a new issue, #49149:
URL: https://github.com/apache/arrow/issues/49149

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   It appears that `arrow::write_parquet()` does not preserve 
`haven::tagged_na()` missing values, and converts them to regular missing 
values instead (loss of data).
   
   ```
   library(arrow)
   library(haven)
   library(dplyr)
   library(labelled)
   
   this_pq <- tempfile(fileext = "parquet")
   
   mydf <- data.frame(x = c(1,NA, haven::tagged_na("a")))
   mydf
   
   # save as parquet and reopen
   arrow::write_dataset(dataset = mydf, path = this_pq)
   this_ds <- open_dataset(sources = this_pq)
   mydf2 <- collect(this_ds)
   
   # expected result
   is_tagged_na(mydf$x, "a")
   
   # actual
   is_tagged_na(mydf2$x, "a")
   is_regular_na(mydf2$x)
   
   packageVersion("arrow")
   ```
   
   Output:
   
   ```
   > # expected result 
   > is_tagged_na(mydf$x, "a")
   [1] FALSE FALSE  TRUE 
   > # actual 
   > is_tagged_na(mydf2$x, "a") 
   [1] FALSE FALSE FALSE 
   > is_regular_na(mydf2$x) 
   > [1] FALSE  TRUE  TRUE
   > packageVersion("arrow")
   [1] ‘23.0.0’
   ```
    
   https://haven.tidyverse.org/reference/tagged_na.html
   
   Background: In SAS, the special missing values `.a`, `.b`, ..., `.z` are 
implemented as near negative infinity. In Stata, the special missing values 
`.a`, `.b`, ..., `.z` are implemented as near positive infinity. So they are 
literally reserving a few top values in a given format to be interpreted as 
special missing values rather than numbers (so for the int8 format, Stata goes 
from -127 to 100, with the value of 101 being interpreted as `.a`, ... 126 as 
`.z` and 127 as `NA`, see https://www.stata.com/help.cgi?datatypes and 
https://www.stata.com/help.cgi?missing.) What the implementation is in `haven`, 
I don't really know (the labels are implemented as `attributes()` and are more 
or less retained, see a somewhat extended reprex below). The main value-added 
of the whole concept is that you can distinguish the reasons for missing values 
with labels such as `haven::labelled(your_numeric_vector, labels = c("Don't 
know" = tagged_na("d"), "Refused" = tagged_na("r"), "Valid skip" = tagge
 d_na("s"), "Not in universe" = tagged_na("u") ) )`. 
   
   Labels are OK-ish:
   
   ```
   mydf <- data.frame(x = labelled(c(1,NA, haven::tagged_na("a")), labels = 
c("Blah" = 1, "aaa" = tagged_na("a"))))
   arrow::write_dataset(dataset = mydf, path = this_pq, format="parquet")
   this_ds <- open_dataset(sources = this_pq)
   mydf2 <- collect(this_ds)
   get_value_labels(mydf$x) |> labelled::print_tagged_na()
   get_value_labels(mydf2$x) |> labelled::print_tagged_na()
   ```
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to