Re: [PR] [SPARK-53383][PYTHON][TESTS] Add tests to check the timezone handling in Arrow UDF [spark]

via GitHub Tue, 26 Aug 2025 04:35:40 -0700


zhengruifeng commented on code in PR #52124:
URL: https://github.com/apache/spark/pull/52124#discussion_r2300706392



##########
python/pyspark/sql/tests/arrow/test_arrow_udf.py:
##########
@@ -97,6 +99,65 @@ def foo(x):
         self.assertEqual(foo.returnType, DoubleType())
         self.assertEqual(foo.evalType, PythonEvalType.SQL_SCALAR_ARROW_UDF)
 
+    def test_time_zone_against_map_in_arrow(self):
+        import pyarrow as pa
+
+        for tz in [
+            "Asia/Shanghai",
+            "Asia/Hong_Kong",
+            "America/Los_Angeles",
+            "Pacific/Honolulu",
+            "Europe/Amsterdam",
+            "US/Pacific",
+        ]:
+            with self.sql_conf({"spark.sql.session.timeZone": tz}):
+                # There is a time-zone conversion in df.collect:
+                # ts.astimezone().replace(tzinfo=None)
+                # it is controlled by env os.environ["TZ"].
+                # Note that if the env is not equvilent to 
spark.sql.session.timeZone,
+                # than there is a mismatch between the internal arrow data and 
df.collect.
+                os.environ["TZ"] = tz
+                time.tzset()
+
+                df = self.spark.sql("SELECT TIMESTAMP('2019-04-12 15:50:01') 
AS ts")
+
+                def check_value(t):
+                    assert isinstance(t, pa.Array)
+                    assert isinstance(t, pa.TimestampArray)
+                    assert isinstance(t[0], pa.Scalar)
+                    assert isinstance(t[0], pa.TimestampScalar)
+                    ts = t[0].as_py()
+                    assert isinstance(ts, datetime.datetime)
+                    assert ts.year == 2019
+                    assert ts.month == 4
+                    assert ts.day == 12
+                    assert ts.hour == 15
+                    assert ts.minute == 50
+                    assert ts.second == 1
+                    # the timezone is still kept in the internal arrow data
+                    assert ts.tzinfo is not None
+                    assert str(ts.tzinfo) == tz, str(ts.tzinfo)
+
+                @arrow_udf("timestamp")
+                def identity(t):
+                    check_value(t)
+                    return t
+
+                expected = [Row(ts=datetime.datetime(2019, 4, 12, 15, 50, 1))]
+                self.assertEqual(expected, df.collect())
+
+                result1 = df.select(identity("ts").alias("ts"))
+                self.assertEqual(expected, result1.collect())
+
+                def identity2(iter):
+                    for batch in iter:
+                        t = batch["ts"]
+                        check_value(t)
+                        yield batch
+
+                result2 = df.mapInArrow(identity2, df.schema)

Review Comment:
   compare the results with `mapInArrow` the behaviors are the same



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-53383][PYTHON][TESTS] Add tests to check the timezone handling in Arrow UDF [spark]

Reply via email to