[GitHub] [spark] zhengruifeng commented on a diff in pull request #38468: [WIP][CONNECT][PYTHON] Arrow-based collect

GitBox Thu, 03 Nov 2022 03:46:59 -0700


zhengruifeng commented on code in PR #38468:
URL: https://github.com/apache/spark/pull/38468#discussion_r1012742215



##########
python/pyspark/sql/connect/client.py:
##########
@@ -251,6 +263,13 @@ def _execute_and_fetch(self, req: pb2.Request) -> 
typing.Optional[pandas.DataFra
 
         if len(result_dfs) > 0:
             df = pd.concat(result_dfs)
+
+            # pd.concat generates non-consecutive index like:
+            #   Int64Index([0, 1, 0, 1, 2, 0, 1, 0, 1, 2], dtype='int64')
+            # set it to RangeIndex to be consistent with pyspark
+            n = len(df)
+            df = df.set_index(pd.RangeIndex(start=0, stop=n, step=1))

Review Comment:
   make this change , otherwise some tests will fail
   
   those tests only generate single json batch, so works with json 



##########
python/pyspark/sql/connect/client.py:
##########
@@ -251,6 +263,13 @@ def _execute_and_fetch(self, req: pb2.Request) -> 
typing.Optional[pandas.DataFra
 
         if len(result_dfs) > 0:
             df = pd.concat(result_dfs)
+
+            # pd.concat generates non-consecutive index like:
+            #   Int64Index([0, 1, 0, 1, 2, 0, 1, 0, 1, 2], dtype='int64')
+            # set it to RangeIndex to be consistent with pyspark
+            n = len(df)
+            df = df.set_index(pd.RangeIndex(start=0, stop=n, step=1))

Review Comment:
   make this change , otherwise some tests will fail
   
   those tests only generate single json batch, so worked with json 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38468: [WIP][CONNECT][PYTHON] Arrow-based collect

Reply via email to