[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11358: ARROW-12820: [C++] Support zone offset in ISO8601, strptime parser

GitBox Wed, 20 Oct 2021 07:00:38 -0700


jorisvandenbossche commented on a change in pull request #11358:
URL: https://github.com/apache/arrow/pull/11358#discussion_r732809375




##########
File path: docs/source/cpp/csv.rst
##########
@@ -190,6 +190,70 @@ dictionary-encoded string-like array.  It switches to a 
plain string-like
 array when the threshold in :member:`ConvertOptions::auto_dict_max_cardinality`
 is reached.
 
+Timestamp inference/parsing
+---------------------------
+
+If type inference is enabled, the CSV reader first tries to interpret
+string-like columns as timestamps. If all rows have some zone offset
+(e.g. ``Z`` or ``+0100``), even if the offsets are inconsistent, then the
+inferred type will be UTC timestamp. If no rows have a zone offset, then the
+inferred type will be timestamp without timezone. A mix of rows with/without
+offsets will result in a string column.
+
+If the type is explicitly specified as a timestamp without timezone ("naive"),
+then the reader will error on values with zone offsets in that column. Else, if
+the type is timestamp with timezone, the column values must either all have
+zone offsets or all lack zone offsets. In the former case, values are
+unambiguous, since each row specifies a precise time in UTC, but in the latter
+case, Arrow will currently interpret the timestamps as specifying values in UTC
+(i.e. as if they had the zone offset "Z" or "+0000"), *not* as values in the
+local time of the timezone.

Review comment:
       So currently we have this behaviour:
   
   ```python
   In [26]: s = """col
       ...: 2021-01-01 09:00:00
       ...: """
   
   In [27]: csv.read_csv(io.BytesIO(s.encode()))
   Out[27]: 
   pyarrow.Table
   col: timestamp[s]
   ----
   col: [[2021-01-01 09:00:00]]
   
   In [28]: s2 = """col
       ...: 2021-01-01 09:00:00+01:00
       ...: """
   
   In [29]: csv.read_csv(io.BytesIO(s2.encode()))
   Out[29]: 
   pyarrow.Table
   col: string
   ----
   col: [["2021-01-01 09:00:00+01:00"]]
   ```
   
   So with a offset the "inference" doesn't actually infer timestamp (does this 
PR change that?). 
   
   And when explicitly mentioning the type for values without a timezone offset:
   
   ```python
   In [35]: csv.read_csv(io.BytesIO(s.encode()), 
convert_options=csv.ConvertOptions(column_types={"col": pa.timestamp('s')}))
   Out[35]: 
   pyarrow.Table
   col: timestamp[s]
   ----
   col: [[2021-01-01 09:00:00]]
   
   In [36]: csv.read_csv(io.BytesIO(s.encode()), 
convert_options=csv.ConvertOptions(column_types={"col": pa.timestamp('s', 
tz="Europe/Brussels")}))
   Out[36]: 
   pyarrow.Table
   col: timestamp[s, tz=Europe/Brussels]
   ----
   col: [[2021-01-01 09:00:00]]
   ```
   So here we indeed kind of "ignore" the timezone of the specified type and 
kind of assume the naive strings are in UTC.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11358: ARROW-12820: [C++] Support zone offset in ISO8601, strptime parser

Reply via email to