Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r204960838 --- Diff: python/pyspark/serializers.py --- @@ -184,27 +184,67 @@ def loads(self, obj): raise NotImplementedError -class ArrowSerializer(FramedSerializer): +class BatchOrderSerializer(Serializer): --- End diff -- Ah, okay. I think I understood the benefit. But my impression is that this is something we already were doing. Also, if this is something we could apply to other functionalities too, then it sounded to me a bit of orthogonal work to do separately. Another concern is, for example, how much we'd likely hit this OOM because I usually expect the data for createDataFrame from Pandas DataFrame or toPandas is likely be small. If the changes were small, then it would have been okay to me but kind of large changes and looks affecting many codes from Scala side to Python side.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org