Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18664#discussion_r131487406
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -3036,6 +3052,9 @@ def test_toPandas_arrow_toggle(self):
             pdf = df.toPandas()
             self.spark.conf.set("spark.sql.execution.arrow.enable", "true")
             pdf_arrow = df.toPandas()
    +        # need to remove timezone for comparison
    +        pdf_arrow["7_timestamp_t"] = \
    +            pdf_arrow["7_timestamp_t"].apply(lambda ts: 
ts.tz_localize(None))
    --- End diff --
    
    Let me explain it a little bit more. We are unable to break the backward 
compatibility. When users upgrade to a new version of Spark, we should not 
change our way to read/write data. 
    
    In this specific case, `toPandas()` should respect timezone, as what 
@icexelloss said. I believe we have made such a consensus in this PR. Now, the 
issue is how to fix it without breaking the existing users/applications who 
already rely on the previous/existing Spark behaviors. The solution is to 
introduce a new external configuration for this. Unfortunately, by default, we 
have to turn it off. Maybe in Spark 3.0, we can turn it on by default? 
    
    Second, with/without enabling arrow, Spark should have the exactly same 
external behavior (except performance). This is the rule we have to follow. Do 
you agree?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to