[ https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957884#comment-16957884 ]
Shannon C Lewis commented on ARROW-3543: ---------------------------------------- Just checking in any updates on this? > [R] Better support for timestamp format and time zones in R > ----------------------------------------------------------- > > Key: ARROW-3543 > URL: https://issues.apache.org/jira/browse/ARROW-3543 > Project: Apache Arrow > Issue Type: Bug > Components: R > Reporter: Olaf > Priority: Major > Fix For: 1.0.0 > > > See below for original description and reports. In sum, there is a mismatch > between how the C++ library and R interpret data without a timezone, and it > turns out that we're not passing the timezone to R if it is set in Arrow C++ > anyway. > The [C++ library > docs|http://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE] > say "If a timezone-aware field contains a recognized timezone, its values > may be localized to that locale upon display; the values of timezone-naive > fields must always be displayed “as is”, with no localization performed on > them." But R's print default, as well as the parsing default, is the current > time zone: > https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html > The C++ library seems to parse timestamp strings that don't have timezone > information as if they are UTC, so when you read timezone-naive timestamps > from Arrow and print them in R, they are shifted to be localized to the > current timezone. If you print timestamp data from Arrow with > {{print(timestamp_var, tz="GMT")}} it would look as you expect. > On further inspection, the [arrow-to-vector code for > timestamp|https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L504-L514] > doesn't seem to consider time zone information even if it does exist. So we > don't have the means currently in R to display timestamp data faithfully, > whether or not it is timezone-aware. > Among the tasks here: > * Include the timezone attribute in the POSIXct R vector that gets created > from a timestamp Arrow array > * Ensure that timezone-naive data from Arrow is printed in R "as is" with no > localization > ----- > Original description: > Hello the dream team, > Pasting from [https://github.com/wesm/feather/issues/351] > Thanks for this wonderful package. I was playing with feather and some > timestamps and I noticed some dangerous behavior. Maybe it is a bug. > Consider this > > {code:java} > import pandas as pd > import feather > import numpy as np > df = pd.DataFrame( > {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), > pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 > 14:01:02.200')]} > ) > df['timestamp_est'] = > pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None) > df > Out[17]: > string_time_utc timestamp_est > 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531 > 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > {code} > Here I create the corresponding `EST` timestamp of my original timestamps (in > `UTC` time). > Now saving the dataframe to `csv` or to `feather` will generate two > completely different results. > > {code:java} > df.to_csv('P://testing.csv') > df.to_feather('P://testing.feather') > {code} > Switching to R. > Using the good old `csv` gives me something a bit annoying, but expected. R > thinks my timezone is `UTC` by default, and wrongly attached this timezone to > `timestamp_est`. No big deal, I can always use `with_tz` or even better: > import as character and process as timestamp while in R. > > {code:java} > > dataframe <- read_csv('P://testing.csv') > Parsed with column specification: > cols( > X1 = col_integer(), > string_time_utc = col_datetime(format = ""), > timestamp_est = col_datetime(format = "") > ) > Warning message: > Missing column names filled in: 'X1' [1] > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 4 > X1 string_time_utc timestamp_est > <int> <dttm> <dttm> > 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530 > 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456 > 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200 > mytimezone > <chr> > 1 UTC > 2 UTC > 3 UTC {code} > {code:java} > #Now look at what happens with feather: > > > dataframe <- read_feather('P://testing.feather') > > > > dataframe %>% mutate(mytimezone = tz(timestamp_est)) > A tibble: 3 x 3 > string_time_utc timestamp_est mytimezone > <dttm> <dttm> <chr> > 1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" > 2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" > 3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code} > My timestamps have been converted!!! pure insanity. > Am I missing something here? > Thanks!! -- This message was sent by Atlassian Jira (v8.3.4#803005)