[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20537 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20537#discussion_r167266079 --- Diff: python/pyspark/sql/types.py --- @@ -1744,8 +1744,27 @@ def _check_series_convert_timestamps_internal(s, timezone): from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype # TODO: handle nested timestamps, such as ArrayType(TimestampType())? if is_datetime64_dtype(s.dtype): +# tz_localize with ambiguous=False has the same behavior of pytz.localize +# >>> import datetime +# >>> import pandas as pd +# >>> import pytz +# >>> +# >>> t = datetime.datetime(2015, 11, 1, 1, 23, 24) +# >>> ts = pd.Series([t]) +# >>> tz = pytz.timezone('America/New_York') +# >>> +# >>> ts.dt.tz_localize(tz, ambiguous=False) +# 0 2015-11-01 01:23:24-05:00 +# dtype: datetime64[ns, America/New_York] +# >>> +# >>> ts.dt.tz_localize(tz, ambiguous=True) +# 0 2015-11-01 01:23:24-04:00 +# dtype: datetime64[ns, America/New_York] +# >>> +# >>> str(tz.localize(t)) +# '2015-11-01 01:23:24-05:00' --- End diff -- I add comment to explain this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20537#discussion_r167133597 --- Diff: python/pyspark/sql/types.py --- @@ -1744,8 +1744,27 @@ def _check_series_convert_timestamps_internal(s, timezone): from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype # TODO: handle nested timestamps, such as ArrayType(TimestampType())? if is_datetime64_dtype(s.dtype): +# tz_localize with ambiguous=False has the same behavior of pytz.localize +# >>> import datetime +# >>> import pandas as pd +# >>> import pytz +# >>> +# >>> t = datetime.datetime(2015, 11, 1, 1, 23, 24) +# >>> ts = pd.Series([t]) +# >>> tz = pytz.timezone('America/New_York') +# >>> +# >>> ts.dt.tz_localize(tz, ambiguous=False) +# 0 2015-11-01 01:23:24-05:00 +# dtype: datetime64[ns, America/New_York] +# >>> +# >>> ts.dt.tz_localize(tz, ambiguous=True) +# 0 2015-11-01 01:23:24-04:00 +# dtype: datetime64[ns, America/New_York] +# >>> +# >>> str(tz.localize(t)) +# '2015-11-01 01:23:24-05:00' --- End diff -- @icexelloss, I got that it's good to know but shall we describe it as a prose? This comment looks a format of a doctest but they are actually just in comments. It would be nicer if we just have a explanation in the comments, not as a doctest format. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20537#discussion_r166975718 --- Diff: python/pyspark/sql/types.py --- @@ -1730,7 +1730,28 @@ def _check_series_convert_timestamps_internal(s, timezone): # TODO: handle nested timestamps, such as ArrayType(TimestampType())? if is_datetime64_dtype(s.dtype): tz = timezone or 'tzlocal()' -return s.dt.tz_localize(tz).dt.tz_convert('UTC') +""" +tz_localize with ambiguous=False has the same behavior of pytz.localize +>>> import datetime +>>> import pandas as pd +>>> import pytz +>>> +>>> t = datetime.datetime(2015, 11, 1, 1, 23, 24) +>>> ts = pd.Series([t]) +>>> tz = pytz.timezone('America/New_York') +>>> +>>> ts.dt.tz_localize(tz, ambiguous=False) +>>> 0 2015-11-01 01:23:24-05:00 +>>> dtype: datetime64[ns, America/New_York] +>>> +>>> ts.dt.tz_localize(tz, ambiguous=True) +>>> 0 2015-11-01 01:23:24-04:00 +>>> dtype: datetime64[ns, America/New_York] +>>> +>>> str(tz.localize(t)) +>>> '2015-11-01 01:23:24-05:00' +""" +return s.dt.tz_localize(tz, ambiguous=False).dt.tz_convert('UTC') --- End diff -- Yes will create a new for `pandas_udf`. Seems `ambiguous=False` is undocumented in the method doc, @jreback can you please confirm this usage is correct? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20537#discussion_r166974650 --- Diff: python/pyspark/sql/types.py --- @@ -1730,7 +1730,28 @@ def _check_series_convert_timestamps_internal(s, timezone): # TODO: handle nested timestamps, such as ArrayType(TimestampType())? if is_datetime64_dtype(s.dtype): tz = timezone or 'tzlocal()' -return s.dt.tz_localize(tz).dt.tz_convert('UTC') +""" +tz_localize with ambiguous=False has the same behavior of pytz.localize +>>> import datetime +>>> import pandas as pd +>>> import pytz +>>> +>>> t = datetime.datetime(2015, 11, 1, 1, 23, 24) +>>> ts = pd.Series([t]) +>>> tz = pytz.timezone('America/New_York') +>>> +>>> ts.dt.tz_localize(tz, ambiguous=False) +>>> 0 2015-11-01 01:23:24-05:00 +>>> dtype: datetime64[ns, America/New_York] +>>> +>>> ts.dt.tz_localize(tz, ambiguous=True) +>>> 0 2015-11-01 01:23:24-04:00 +>>> dtype: datetime64[ns, America/New_York] +>>> +>>> str(tz.localize(t)) +>>> '2015-11-01 01:23:24-05:00' --- End diff -- Yeah Let me clean up the format... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20537#discussion_r166974367 --- Diff: python/pyspark/sql/types.py --- @@ -1730,7 +1730,28 @@ def _check_series_convert_timestamps_internal(s, timezone): # TODO: handle nested timestamps, such as ArrayType(TimestampType())? if is_datetime64_dtype(s.dtype): tz = timezone or 'tzlocal()' -return s.dt.tz_localize(tz).dt.tz_convert('UTC') +""" +tz_localize with ambiguous=False has the same behavior of pytz.localize --- End diff -- Oh definitely not doctest..Let me change to comments --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20537#discussion_r166826644 --- Diff: python/pyspark/sql/types.py --- @@ -1730,7 +1730,28 @@ def _check_series_convert_timestamps_internal(s, timezone): # TODO: handle nested timestamps, such as ArrayType(TimestampType())? if is_datetime64_dtype(s.dtype): tz = timezone or 'tzlocal()' -return s.dt.tz_localize(tz).dt.tz_convert('UTC') +""" +tz_localize with ambiguous=False has the same behavior of pytz.localize +>>> import datetime +>>> import pandas as pd +>>> import pytz +>>> +>>> t = datetime.datetime(2015, 11, 1, 1, 23, 24) +>>> ts = pd.Series([t]) +>>> tz = pytz.timezone('America/New_York') +>>> +>>> ts.dt.tz_localize(tz, ambiguous=False) +>>> 0 2015-11-01 01:23:24-05:00 +>>> dtype: datetime64[ns, America/New_York] +>>> +>>> ts.dt.tz_localize(tz, ambiguous=True) +>>> 0 2015-11-01 01:23:24-04:00 +>>> dtype: datetime64[ns, America/New_York] +>>> +>>> str(tz.localize(t)) +>>> '2015-11-01 01:23:24-05:00' --- End diff -- Hm .. this one seems a bit weird. Shouldn't it be `... '2015-11-01 01:23:24-05:00'`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20537#discussion_r166826468 --- Diff: python/pyspark/sql/tests.py --- @@ -3638,6 +3638,21 @@ def test_createDataFrame_with_int_col_names(self): self.assertEqual(pdf_col_names, df.columns) self.assertEqual(pdf_col_names, df_arrow.columns) +def test_timestamp_dst(self): +""" +SPARK-23314: Test daylight saving time +""" --- End diff -- Shall we just leave this as a comment (just to follow the majority)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20537#discussion_r166811270 --- Diff: python/pyspark/sql/types.py --- @@ -1730,7 +1730,28 @@ def _check_series_convert_timestamps_internal(s, timezone): # TODO: handle nested timestamps, such as ArrayType(TimestampType())? if is_datetime64_dtype(s.dtype): tz = timezone or 'tzlocal()' -return s.dt.tz_localize(tz).dt.tz_convert('UTC') +""" +tz_localize with ambiguous=False has the same behavior of pytz.localize +>>> import datetime +>>> import pandas as pd +>>> import pytz +>>> +>>> t = datetime.datetime(2015, 11, 1, 1, 23, 24) +>>> ts = pd.Series([t]) +>>> tz = pytz.timezone('America/New_York') +>>> +>>> ts.dt.tz_localize(tz, ambiguous=False) +>>> 0 2015-11-01 01:23:24-05:00 +>>> dtype: datetime64[ns, America/New_York] +>>> +>>> ts.dt.tz_localize(tz, ambiguous=True) +>>> 0 2015-11-01 01:23:24-04:00 +>>> dtype: datetime64[ns, America/New_York] +>>> +>>> str(tz.localize(t)) +>>> '2015-11-01 01:23:24-05:00' +""" +return s.dt.tz_localize(tz, ambiguous=False).dt.tz_convert('UTC') --- End diff -- I think for a `pd.Series` `ambiguous` takes an ndarray. Can also add a `pandas_udf` test case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20537#discussion_r166810415 --- Diff: python/pyspark/sql/types.py --- @@ -1730,7 +1730,28 @@ def _check_series_convert_timestamps_internal(s, timezone): # TODO: handle nested timestamps, such as ArrayType(TimestampType())? if is_datetime64_dtype(s.dtype): tz = timezone or 'tzlocal()' -return s.dt.tz_localize(tz).dt.tz_convert('UTC') +""" +tz_localize with ambiguous=False has the same behavior of pytz.localize --- End diff -- I'm not sure we want this doctest --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...
GitHub user icexelloss opened a pull request: https://github.com/apache/spark/pull/20537 [SPARK-23314][PYTHON] Add ambiguous=False when localizing tz-naive timestamps to deal with dst ## What changes were proposed in this pull request? When tz_localize a tz-naive timetamp, pandas will throw exception if the timestamp is during daylight saving time period, e.g., 2015-11-01 01:30:00. This PR fixes this issue by setting `ambiguous=False` when calling tz_localize, which is the same default behavior of pytz. ## How was this patch tested? Add `test_timestamp_dst` You can merge this pull request into a Git repository by running: $ git pull https://github.com/icexelloss/spark SPARK-23314 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20537.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20537 commit 6435feffdc056a8744848e367a585d32e8734b5f Author: Li JinDate: 2018-02-07T22:38:19Z Add ambiguous=False when localizing tz-naive timestamps to deal with dst --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org