[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...

2018-02-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20537


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...

2018-02-09 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/20537#discussion_r167266079
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1744,8 +1744,27 @@ def _check_series_convert_timestamps_internal(s, 
timezone):
 from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
 # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
 if is_datetime64_dtype(s.dtype):
+# tz_localize with ambiguous=False has the same behavior of 
pytz.localize
+# >>> import datetime
+# >>> import pandas as pd
+# >>> import pytz
+# >>>
+# >>> t = datetime.datetime(2015, 11, 1, 1, 23, 24)
+# >>> ts = pd.Series([t])
+# >>> tz = pytz.timezone('America/New_York')
+# >>>
+# >>> ts.dt.tz_localize(tz, ambiguous=False)
+# 0   2015-11-01 01:23:24-05:00
+# dtype: datetime64[ns, America/New_York]
+# >>>
+# >>> ts.dt.tz_localize(tz, ambiguous=True)
+# 0   2015-11-01 01:23:24-04:00
+# dtype: datetime64[ns, America/New_York]
+# >>>
+# >>> str(tz.localize(t))
+# '2015-11-01 01:23:24-05:00'
--- End diff --

I add comment to explain this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...

2018-02-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20537#discussion_r167133597
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1744,8 +1744,27 @@ def _check_series_convert_timestamps_internal(s, 
timezone):
 from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
 # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
 if is_datetime64_dtype(s.dtype):
+# tz_localize with ambiguous=False has the same behavior of 
pytz.localize
+# >>> import datetime
+# >>> import pandas as pd
+# >>> import pytz
+# >>>
+# >>> t = datetime.datetime(2015, 11, 1, 1, 23, 24)
+# >>> ts = pd.Series([t])
+# >>> tz = pytz.timezone('America/New_York')
+# >>>
+# >>> ts.dt.tz_localize(tz, ambiguous=False)
+# 0   2015-11-01 01:23:24-05:00
+# dtype: datetime64[ns, America/New_York]
+# >>>
+# >>> ts.dt.tz_localize(tz, ambiguous=True)
+# 0   2015-11-01 01:23:24-04:00
+# dtype: datetime64[ns, America/New_York]
+# >>>
+# >>> str(tz.localize(t))
+# '2015-11-01 01:23:24-05:00'
--- End diff --

@icexelloss, I got that it's good to know but shall we describe it as a 
prose? This comment looks a format of a doctest but they are actually just in 
comments.

It would be nicer if we just have a explanation in the comments, not as a 
doctest format.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...

2018-02-08 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/20537#discussion_r166975718
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1730,7 +1730,28 @@ def _check_series_convert_timestamps_internal(s, 
timezone):
 # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
 if is_datetime64_dtype(s.dtype):
 tz = timezone or 'tzlocal()'
-return s.dt.tz_localize(tz).dt.tz_convert('UTC')
+"""
+tz_localize with ambiguous=False has the same behavior of 
pytz.localize
+>>> import datetime
+>>> import pandas as pd
+>>> import pytz
+>>>
+>>> t = datetime.datetime(2015, 11, 1, 1, 23, 24)
+>>> ts = pd.Series([t])
+>>> tz = pytz.timezone('America/New_York')
+>>>
+>>> ts.dt.tz_localize(tz, ambiguous=False)
+>>> 0   2015-11-01 01:23:24-05:00
+>>> dtype: datetime64[ns, America/New_York]
+>>>
+>>> ts.dt.tz_localize(tz, ambiguous=True)
+>>> 0   2015-11-01 01:23:24-04:00
+>>> dtype: datetime64[ns, America/New_York]
+>>>
+>>> str(tz.localize(t))
+>>> '2015-11-01 01:23:24-05:00'
+"""
+return s.dt.tz_localize(tz, ambiguous=False).dt.tz_convert('UTC')
--- End diff --

Yes will create a new for `pandas_udf`.

Seems `ambiguous=False` is undocumented in the method doc, @jreback can you 
please confirm this usage is correct?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...

2018-02-08 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/20537#discussion_r166974650
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1730,7 +1730,28 @@ def _check_series_convert_timestamps_internal(s, 
timezone):
 # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
 if is_datetime64_dtype(s.dtype):
 tz = timezone or 'tzlocal()'
-return s.dt.tz_localize(tz).dt.tz_convert('UTC')
+"""
+tz_localize with ambiguous=False has the same behavior of 
pytz.localize
+>>> import datetime
+>>> import pandas as pd
+>>> import pytz
+>>>
+>>> t = datetime.datetime(2015, 11, 1, 1, 23, 24)
+>>> ts = pd.Series([t])
+>>> tz = pytz.timezone('America/New_York')
+>>>
+>>> ts.dt.tz_localize(tz, ambiguous=False)
+>>> 0   2015-11-01 01:23:24-05:00
+>>> dtype: datetime64[ns, America/New_York]
+>>>
+>>> ts.dt.tz_localize(tz, ambiguous=True)
+>>> 0   2015-11-01 01:23:24-04:00
+>>> dtype: datetime64[ns, America/New_York]
+>>>
+>>> str(tz.localize(t))
+>>> '2015-11-01 01:23:24-05:00'
--- End diff --

Yeah Let me clean up the format...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...

2018-02-08 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/20537#discussion_r166974367
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1730,7 +1730,28 @@ def _check_series_convert_timestamps_internal(s, 
timezone):
 # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
 if is_datetime64_dtype(s.dtype):
 tz = timezone or 'tzlocal()'
-return s.dt.tz_localize(tz).dt.tz_convert('UTC')
+"""
+tz_localize with ambiguous=False has the same behavior of 
pytz.localize
--- End diff --

Oh definitely not doctest..Let me change to comments


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...

2018-02-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20537#discussion_r166826644
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1730,7 +1730,28 @@ def _check_series_convert_timestamps_internal(s, 
timezone):
 # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
 if is_datetime64_dtype(s.dtype):
 tz = timezone or 'tzlocal()'
-return s.dt.tz_localize(tz).dt.tz_convert('UTC')
+"""
+tz_localize with ambiguous=False has the same behavior of 
pytz.localize
+>>> import datetime
+>>> import pandas as pd
+>>> import pytz
+>>>
+>>> t = datetime.datetime(2015, 11, 1, 1, 23, 24)
+>>> ts = pd.Series([t])
+>>> tz = pytz.timezone('America/New_York')
+>>>
+>>> ts.dt.tz_localize(tz, ambiguous=False)
+>>> 0   2015-11-01 01:23:24-05:00
+>>> dtype: datetime64[ns, America/New_York]
+>>>
+>>> ts.dt.tz_localize(tz, ambiguous=True)
+>>> 0   2015-11-01 01:23:24-04:00
+>>> dtype: datetime64[ns, America/New_York]
+>>>
+>>> str(tz.localize(t))
+>>> '2015-11-01 01:23:24-05:00'
--- End diff --

Hm .. this one seems a bit weird. Shouldn't it be `... '2015-11-01 
01:23:24-05:00'`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...

2018-02-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20537#discussion_r166826468
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3638,6 +3638,21 @@ def test_createDataFrame_with_int_col_names(self):
 self.assertEqual(pdf_col_names, df.columns)
 self.assertEqual(pdf_col_names, df_arrow.columns)
 
+def test_timestamp_dst(self):
+"""
+SPARK-23314: Test daylight saving time
+"""
--- End diff --

Shall we just leave this as a comment (just to follow the majority)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...

2018-02-07 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20537#discussion_r166811270
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1730,7 +1730,28 @@ def _check_series_convert_timestamps_internal(s, 
timezone):
 # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
 if is_datetime64_dtype(s.dtype):
 tz = timezone or 'tzlocal()'
-return s.dt.tz_localize(tz).dt.tz_convert('UTC')
+"""
+tz_localize with ambiguous=False has the same behavior of 
pytz.localize
+>>> import datetime
+>>> import pandas as pd
+>>> import pytz
+>>>
+>>> t = datetime.datetime(2015, 11, 1, 1, 23, 24)
+>>> ts = pd.Series([t])
+>>> tz = pytz.timezone('America/New_York')
+>>>
+>>> ts.dt.tz_localize(tz, ambiguous=False)
+>>> 0   2015-11-01 01:23:24-05:00
+>>> dtype: datetime64[ns, America/New_York]
+>>>
+>>> ts.dt.tz_localize(tz, ambiguous=True)
+>>> 0   2015-11-01 01:23:24-04:00
+>>> dtype: datetime64[ns, America/New_York]
+>>>
+>>> str(tz.localize(t))
+>>> '2015-11-01 01:23:24-05:00'
+"""
+return s.dt.tz_localize(tz, ambiguous=False).dt.tz_convert('UTC')
--- End diff --

I think for a `pd.Series` `ambiguous` takes an ndarray.  Can also add a 
`pandas_udf` test case?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...

2018-02-07 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20537#discussion_r166810415
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1730,7 +1730,28 @@ def _check_series_convert_timestamps_internal(s, 
timezone):
 # TODO: handle nested timestamps, such as ArrayType(TimestampType())?
 if is_datetime64_dtype(s.dtype):
 tz = timezone or 'tzlocal()'
-return s.dt.tz_localize(tz).dt.tz_convert('UTC')
+"""
+tz_localize with ambiguous=False has the same behavior of 
pytz.localize
--- End diff --

I'm not sure we want this doctest


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...

2018-02-07 Thread icexelloss
GitHub user icexelloss opened a pull request:

https://github.com/apache/spark/pull/20537

[SPARK-23314][PYTHON] Add ambiguous=False when localizing tz-naive 
timestamps to deal with dst

## What changes were proposed in this pull request?
When tz_localize a tz-naive timetamp, pandas will throw exception if the 
timestamp is during daylight saving time period, e.g., 2015-11-01 01:30:00. 
This PR fixes this issue by setting `ambiguous=False` when calling tz_localize, 
which is the same default behavior of pytz.

## How was this patch tested?
Add `test_timestamp_dst`


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/icexelloss/spark SPARK-23314

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20537.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20537


commit 6435feffdc056a8744848e367a585d32e8734b5f
Author: Li Jin 
Date:   2018-02-07T22:38:19Z

Add ambiguous=False when localizing tz-naive timestamps to deal with dst




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org