HyukjinKwon commented on code in PR #36683: URL: https://github.com/apache/spark/pull/36683#discussion_r894061258
########## python/pyspark/sql/pandas/conversion.py: ########## @@ -596,7 +596,7 @@ def _create_from_pandas_with_arrow( ] # Slice the DataFrame to be batched - step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up + step = self._jconf.arrowMaxRecordsPerBatch() Review Comment: Yeah, that's true .. but I wonder if the default number of partitions is something we should consider given that it wasn't already configurable before, and `SparkSession.createDataFrame` does not expose the number of partitions too. If they really need, users might want to create an RDD with an explicit parallelism .. we don't support this now though (see also https://github.com/apache/spark/pull/29719). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org