[GitHub] [arrow] rok commented on a change in pull request #11358: ARROW-12820: [C++] Support zone offset in ISO8601, strptime parser

GitBox Sat, 09 Oct 2021 03:17:22 -0700


rok commented on a change in pull request #11358:
URL: https://github.com/apache/arrow/pull/11358#discussion_r725374276




##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -1429,6 +1430,17 @@ TYPED_TEST(TestStringKernels, Strptime) {
   this->CheckUnary("strptime", input1, timestamp(TimeUnit::MICRO), output1, 
&options);
 }
 
+TYPED_TEST(TestStringKernels, StrptimeZoneOffset) {
+  if (!arrow::internal::kStrptimeSupportsZone) {
+    GTEST_SKIP() << "strptime does not support %z on this platform";
+  }
+  std::string input1 = R"(["5/1/2020 +01", null, "12/11/1900 -01:30"])";
+  std::string output1 =
+      R"(["2020-04-30T23:00:00.000000", null, "1900-12-11T01:30:00.000000"])";
+  StrptimeOptions options("%m/%d/%Y %z", TimeUnit::MICRO);
+  this->CheckUnary("strptime", input1, timestamp(TimeUnit::MICRO), output1, 
&options);

Review comment:
       > Now, I think the first one is reasonable to continue interpreting in 
this way. And the second seems reasonable to interpret as `timestamp[s, 
"UTC"]`. However, this is a user-visible change in behavior. Do we also want to 
reflect that in other places we use the ISO8601 parser?
   
   Do we know how breaking this would be? On that note [comparison doesn't 
error when comparing timezone-aware to timezone-naive 
timestamp](https://issues.apache.org/jira/browse/ARROW-13081) at the moment.
   
   > And what about the third? Do we error? (This wasn't previously parseable 
before.) Joris touches on this above as well.
   
   I'd find it odd to see mixed formats in my data and would want to know about 
it. So I'd propose erroring and allow for error handling options in strptime 
(return null if ambiguous or assume a given timezone if missing).

##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -1429,6 +1430,17 @@ TYPED_TEST(TestStringKernels, Strptime) {
   this->CheckUnary("strptime", input1, timestamp(TimeUnit::MICRO), output1, 
&options);
 }
 
+TYPED_TEST(TestStringKernels, StrptimeZoneOffset) {
+  if (!arrow::internal::kStrptimeSupportsZone) {
+    GTEST_SKIP() << "strptime does not support %z on this platform";
+  }
+  std::string input1 = R"(["5/1/2020 +01", null, "12/11/1900 -01:30"])";
+  std::string output1 =
+      R"(["2020-04-30T23:00:00.000000", null, "1900-12-11T01:30:00.000000"])";
+  StrptimeOptions options("%m/%d/%Y %z", TimeUnit::MICRO);
+  this->CheckUnary("strptime", input1, timestamp(TimeUnit::MICRO), output1, 
&options);

Review comment:
       > Now, I think the first one is reasonable to continue interpreting in 
this way. And the second seems reasonable to interpret as `timestamp[s, 
"UTC"]`. However, this is a user-visible change in behavior. Do we also want to 
reflect that in other places we use the ISO8601 parser?
   
   Do we know how breaking this would be? On that note [comparison doesn't 
error when comparing timezone-aware to timezone-naive 
timestamp](https://issues.apache.org/jira/browse/ARROW-13081) at the moment.
   
   > And what about the third? Do we error? (This wasn't previously parseable 
before.) Joris touches on this above as well.
   
   I'd find it odd to see mixed formats in my data and would want to know about 
it. So I'd propose erroring and allow for error handling options in strptime 
(return null if ambiguous or assume a user specified timezone if missing).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rok commented on a change in pull request #11358: ARROW-12820: [C++] Support zone offset in ISO8601, strptime parser

Reply via email to