skolenik opened a new issue, #49149:
URL: https://github.com/apache/arrow/issues/49149
### Describe the bug, including details regarding any error messages,
version, and platform.
It appears that `arrow::write_parquet()` does not preserve
`haven::tagged_na()` missing values, and converts them to regular missing
values instead (loss of data).
```
library(arrow)
library(haven)
library(dplyr)
library(labelled)
this_pq <- tempfile(fileext = "parquet")
mydf <- data.frame(x = c(1,NA, haven::tagged_na("a")))
mydf
# save as parquet and reopen
arrow::write_dataset(dataset = mydf, path = this_pq)
this_ds <- open_dataset(sources = this_pq)
mydf2 <- collect(this_ds)
# expected result
is_tagged_na(mydf$x, "a")
# actual
is_tagged_na(mydf2$x, "a")
is_regular_na(mydf2$x)
packageVersion("arrow")
```
Output:
```
> # expected result
> is_tagged_na(mydf$x, "a")
[1] FALSE FALSE TRUE
> # actual
> is_tagged_na(mydf2$x, "a")
[1] FALSE FALSE FALSE
> is_regular_na(mydf2$x)
> [1] FALSE TRUE TRUE
> packageVersion("arrow")
[1] ‘23.0.0’
```
https://haven.tidyverse.org/reference/tagged_na.html
Background: In SAS, the special missing values `.a`, `.b`, ..., `.z` are
implemented as near negative infinity. In Stata, the special missing values
`.a`, `.b`, ..., `.z` are implemented as near positive infinity. So they are
literally reserving a few top values in a given format to be interpreted as
special missing values rather than numbers (so for the int8 format, Stata goes
from -127 to 100, with the value of 101 being interpreted as `.a`, ... 126 as
`.z` and 127 as `NA`, see https://www.stata.com/help.cgi?datatypes and
https://www.stata.com/help.cgi?missing.) What the implementation is in `haven`,
I don't really know (the labels are implemented as `attributes()` and are more
or less retained, see a somewhat extended reprex below). The main value-added
of the whole concept is that you can distinguish the reasons for missing values
with labels such as `haven::labelled(your_numeric_vector, labels = c("Don't
know" = tagged_na("d"), "Refused" = tagged_na("r"), "Valid skip" = tagge
d_na("s"), "Not in universe" = tagged_na("u") ) )`.
Labels are OK-ish:
```
mydf <- data.frame(x = labelled(c(1,NA, haven::tagged_na("a")), labels =
c("Blah" = 1, "aaa" = tagged_na("a"))))
arrow::write_dataset(dataset = mydf, path = this_pq, format="parquet")
this_ds <- open_dataset(sources = this_pq)
mydf2 <- collect(this_ds)
get_value_labels(mydf$x) |> labelled::print_tagged_na()
get_value_labels(mydf2$x) |> labelled::print_tagged_na()
```
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]