Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18664 > The baseline should be (as said above): Internal optimisation should not introduce any behaviour change, and we are discouraged to change the previous behaviour unless it has bugs in general. I am not sure if I totally agree with this. Take the struct for instance, in the non-Arrow version, struct is turned to `pyspark.sql.Row` object in `toPandas()`. I wouldn't call this bug because it's design choice. However, this is maybe not the best design choice because if the user pickle the `pandas.DataFrame` to a file and send it to someone, the receiver won't be able to deserialize this `pandas.DataFrame` without having the pyspark library dependency. Now I am **not** trying to argue we should or should not turn struct column to `pyspark.sql.Row`, my point is that there might be some design choices in the non-Arrow versions that are not ideal and maybe that's not a hard requirement to make behavior 100% the same between non-Arrow and Arrow version.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org