itholic opened a new pull request, #40507: URL: https://github.com/apache/spark/pull/40507
### What changes were proposed in this pull request? This PR proposes adding the `_distributed_sequence_id` to support pandas API on Spark in Spark Connect. `_distributed_sequence_id` create the distributed-sequence column which is used to generate [default index type](https://spark.apache.org/docs/3.2.0/api/python/user_guide/pandas_on_spark/options.html#default-index-type) for pandas API on Spark. ```python >>> import pyspark.sql.connect.functions as CF >>> data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)] >>> sdf = spark.createDataFrame(data, ["name", "age"]) >>> sdf.show() +-------+---+ | name|age| +-------+---+ | Alice| 1| | Bob| 2| |Charlie| 3| +-------+---+ >>> sdf.select(CF._distributed_sequence_id().alias("sequence-index"), "*").show() +--------------+-------+---+ |sequence-index| name|age| +--------------+-------+---+ | 0| Alice| 1| | 1| Bob| 2| | 2|Charlie| 3| +--------------+-------+---+ ``` ### Why are the changes needed? Spark Connect cannot reuse the existing logic for pandas API on Spark, because the existing logic uses Py4J to utilize functions in the JVM. ### Does this PR introduce _any_ user-facing change? No, this is an internal function. ### How was this patch tested? The patch was tested by adding unit tests and manually verifying the results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org