Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18664#discussion_r146922937
  
    --- Diff: python/pyspark/serializers.py ---
    @@ -224,7 +225,13 @@ def _create_batch(series):
         # If a nullable integer series has been promoted to floating point 
with NaNs, need to cast
         # NOTE: this is not necessary with Arrow >= 0.7
         def cast_series(s, t):
    -        if t is None or s.dtype == t.to_pandas_dtype():
    +        if type(t) == pa.TimestampType:
    +            # NOTE: convert to 'us' with astype here, unit ignored in 
`from_pandas` see ARROW-1680
    +            return 
_series_convert_timestamps_internal(s).values.astype('datetime64[us]')
    --- End diff --
    
    hmmm, that's strange `s.dt.tz_localize('tzlocal()` gets an `OverflowError: 
Python int too large to convert to C long` error when printing but 
`s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')` works but comes up with a 
bogus time where the NaT was.  I agree that `fillna(0)` is safer to avoid 
overflow.
    
    ```
    In [44]: s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC')
    Out[44]: 
    0      2017-10-24 17:44:51.483694+00:00
    1   1677-09-21 08:12:43.145224192+00:00
    dtype: datetime64[ns, UTC]


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to