[GitHub] spark pull request #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow...

holdenk Fri, 21 Sep 2018 09:54:02 -0700

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22275#discussion_r219556033
  
    --- Diff: python/pyspark/serializers.py ---
    @@ -208,8 +214,26 @@ def load_stream(self, stream):
             for batch in reader:
                 yield batch
     
    +        if self.load_batch_order:
    +            num = read_int(stream)
    +            self.batch_order = []
    +            for i in xrange(num):
    +                index = read_int(stream)
    +                self.batch_order.append(index)
    +
    +    def get_batch_order_and_reset(self):
    --- End diff --
    
    Looking at `_load_from_socket` I think I understand why this was done as a 
separate function here, but what about if the serializer its self returned 
either a tuple or re-ordered the batches its self?
    
    I'm just trying to get a better understanding, not saying those are better 
designs.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow...

Reply via email to