jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817932243
##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and
existing data frames",
tibble(a = 1:2)
)
})
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char
and double", {
+ test_df <- tibble::tibble(
+ posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+ date_var = as.Date("2022-02-25"),
+ character_ymd_var = "2022-02-25 00:00:01",
+ character_ydm_var = "2022/25/02 00:00:01",
+ integer_var = 32L,
+ double_var = 34.56
+ )
+
+ # casting from POSIXct treated separately so we can skip on Windows
+ # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+ compare_dplyr_binding(
+ .input %>%
+ mutate(
+ date_dv = as.Date(date_var),
+ date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d
%H:%M:%S"),
+ date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m
%H:%M:%S"),
+ date_int = as.Date(integer_var, origin = "1970-01-01")
+ ) %>%
+ collect(),
+ test_df
+ )
+
+ # the way we go about it is a bit different, but the result is the same =>
+ # testing without compare_dplyr_binding()
+ expect_equal(
+ test_df %>%
+ arrow_table() %>%
+ mutate(date_double = as.Date(double_var)) %>%
+ collect(),
+ test_df %>%
+ arrow_table() %>%
+ mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+ collect()
+ )
+
+ expect_equal(
+ test_df %>%
+ record_batch() %>%
+ mutate(date_double = as.Date(double_var)) %>%
+ collect(),
+ test_df %>%
+ arrow_table() %>%
+ mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+ collect()
+ )
+
+ # actual and expected differ due to doubles are accounted for (floored in
+ # arrow and rounded to the next decimal in R)
+ expect_error(
+ compare_dplyr_binding(
+ .input %>%
+ mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+ collect(),
+ test_df
+ )
+ )
Review comment:
Thanks for the explanation. Is this part of the comment still accurate
then: `(floored in arrow and rounded to the next decimal in R)`?
I suspect (but don't know for certain!) what's going on is that you're
running into how R stores dates and how that differs from
[`date32()`](https://arrow.apache.org/docs/cpp/api/datatype.html?highlight=date32#classarrow_1_1_date32_type)
in Arrow. In R, a date object can be a float (I haven't looked at the source
to see if it's _always_ stored as a float, but that would be interesting to
know!) and that number is number of days since the epoch [1]. So in R you can
have fractional days:
```
> as.Date(36.54, origin = "1970-01-01")
[1] "1970-02-06"
> as.numeric(as.Date(36.54, origin = "1970-01-01"))
[1] 36.54
```
So if you add a small amount (but enough to get to the next whole number
you'll see a new date:
```
> as.Date(36.54, origin = "1970-01-01") + 0.46
[1] "1970-02-07"
```
But if we actually floored here, we would get the integer, and adding the
same amount won't get you to the next day (just to a bit before noon here):
```
> as.numeric(as.Date(floor(36.54), origin = "1970-01-01"))
[1] 36
> as.Date(floor(36.54), origin = "1970-01-01") + 0.46
[1] "1970-02-06"
```
Soooo, this means for us that we need to choose from (in order of best to
worst IMO, but all would be fine I think):
* Store a more precise value (e.g. as
[`date64()`](https://arrow.apache.org/docs/cpp/api/datatype.html?highlight=date32#classarrow_1_1_date64_type)
though we can't simply `cast(x, date64())` because `date64()` stores
milliseconds since the epoch. We also might still have some complications
comparison — I haven't experimented with `date64()` objects getting pulled back
into R and if they come in as Dates backed by floats.
* Not accept non-integers at all with an error and make a Jira to clean this
up later.
* Accept that Arrow simple floors, and the actual numeric values are
different
[1] — and it actually is the epoch, it converts from a different origin:
```
as.numeric(as.Date(36.54, origin = "1999-12-31"))
[1] 10992.54
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]