[jira] [Created] (SPARK-45675) Specify number of partitions when creating spark dataframe from pandas dataframe

Jelmer Kuperus (Jira) Thu, 26 Oct 2023 01:33:17 -0700

Jelmer Kuperus created SPARK-45675:
--------------------------------------

             Summary: Specify number of partitions when creating spark 
dataframe from pandas dataframe
                 Key: SPARK-45675
                 URL: https://issues.apache.org/jira/browse/SPARK-45675
             Project: Spark
          Issue Type: Improvement
          Components: Pandas API on Spark
    Affects Versions: 3.5.0
            Reporter: Jelmer Kuperus



When converting a large pandas dataframe to a spark dataframe like so


 
{code:java}
import pandas as pd pdf = pd.DataFrame([{"board_id": "3074457346698037360_0", 
"file_name": "board-content", "value": "A" * 119251} for i in range(0, 20000)]) 
spark.createDataFrame(pdf).write.mode("overwrite").format("delta").saveAsTable("catalog.schema.table"){code}
 
 

You can encounter the following error

org.apache.spark.SparkException: Job aborted due to stage failure: Serialized 
task 11:1 was 366405365 bytes, which exceeds max allowed: 
spark.rpc.message.maxSize (268435456 bytes). Consider increasing 
spark.rpc.message.maxSize or using broadcast variables for large values.

 

As far as I can tell spark first converts the pandas dataframe into a python 
list and then constructs an rdd out of that list. which means that the 
parallelism is determined by the value of spark.sparkcontext.defaultparallelism 
and if the pandas dataframe is very large and the number of available cores is 
low then you end up with very large tasks that exceed the limits imposed on the 
size of tasks

 

Methods like spark.sparkContext.parallelize allow you to pass in the number of 
partitions of the resulting dataset. I think having a similar capability when 
creating a dataframe from a pandas dataframe makes a lot of sense. As right now 
I think the only workaround I can think of is changing the value of 
spark.default.parallelism but this is a system wide setting



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45675) Specify number of partitions when creating spark dataframe from pandas dataframe

Reply via email to