[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...

HyukjinKwon Tue, 24 Jul 2018 18:45:01 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21546#discussion_r204960838
  
    --- Diff: python/pyspark/serializers.py ---
    @@ -184,27 +184,67 @@ def loads(self, obj):
             raise NotImplementedError
     
     
    -class ArrowSerializer(FramedSerializer):
    +class BatchOrderSerializer(Serializer):
    --- End diff --
    
    Ah, okay. I think I understood the benefit. But my impression is that this 
is something we already were doing. Also, if this is something we could apply 
to other functionalities too, then it sounded to me a bit of orthogonal work to 
do separately.
    
    Another concern is, for example, how much we'd likely hit this OOM because 
I usually expect the data for createDataFrame from Pandas DataFrame or toPandas 
is likely be small.
    
    If the changes were small, then it would have been okay to me but kind of 
large changes and looks affecting many codes from Scala side to Python side.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...

Reply via email to