[GitHub] [spark] itholic opened a new pull request, #40507: [SPARK-42662][CONNECT][PS] Add `_distributed_sequence_id` for distributed-sequence index.

via GitHub Tue, 21 Mar 2023 02:03:45 -0700


itholic opened a new pull request, #40507:
URL: https://github.com/apache/spark/pull/40507


   ### What changes were proposed in this pull request?
   
   This PR proposes adding the `_distributed_sequence_id` to support pandas API 
on Spark in Spark Connect. `_distributed_sequence_id` create the 
distributed-sequence column which is used to generate [default index 
type](https://spark.apache.org/docs/3.2.0/api/python/user_guide/pandas_on_spark/options.html#default-index-type)
 for pandas API on Spark.
   
   ```python
   >>> import pyspark.sql.connect.functions as CF
   >>> data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
   >>> sdf = spark.createDataFrame(data, ["name", "age"])
   >>> sdf.show()
   +-------+---+
   |   name|age|
   +-------+---+
   |  Alice|  1|
   |    Bob|  2|
   |Charlie|  3|
   +-------+---+
   
   >>> sdf.select(CF._distributed_sequence_id().alias("sequence-index"), 
"*").show()
   +--------------+-------+---+
   |sequence-index|   name|age|
   +--------------+-------+---+
   |             0|  Alice|  1|
   |             1|    Bob|  2|
   |             2|Charlie|  3|
   +--------------+-------+---+
   ```
   
   ### Why are the changes needed?
   
   Spark Connect cannot reuse the existing logic for pandas API on Spark, 
because the existing logic uses Py4J to utilize functions in the JVM.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, this is an internal function.
   
   ### How was this patch tested?
   
   The patch was tested by adding unit tests and manually verifying the results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic opened a new pull request, #40507: [SPARK-42662][CONNECT][PS] Add `_distributed_sequence_id` for distributed-sequence index.

Reply via email to