[ https://issues.apache.org/jira/browse/SPARK-32479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weichen Xu resolved SPARK-32479. -------------------------------- Resolution: Not A Bug > Fix the slicing logic in createDataFrame when converting pandas dataframe to > arrow table > ---------------------------------------------------------------------------------------- > > Key: SPARK-32479 > URL: https://issues.apache.org/jira/browse/SPARK-32479 > Project: Spark > Issue Type: Story > Components: PySpark > Affects Versions: 3.1.0 > Reporter: Liang Zhang > Assignee: Liang Zhang > Priority: Major > > h1. Problem: > In > [https://github.com/databricks/runtime/blob/84a952313ae73e3df32f065eb00cc0bcb024af14/python/pyspark/sql/pandas/conversion.py#L418|https://github.com/databricks/runtime/blob/84a952313ae73e3df32f065eb00cc0bcb024af14/python/pyspark/sql/pandas/conversion.py#L418,] > , the slicing logic may result in less partitions than specified. > h1. Example: > Assume: > {noformat} > length = 100 -> [0, 1, ..., 99] > num_slices = 99 = self.sparkContext.defaultParallelism{noformat} > Old method: > step = math.ceil(length / num_slices) = 2 > start = i * step, end = (i + 1) * step: > output: [0,1] [2,3] [4,5] ... [98,99] -> 50 slices != num_slices > > h1. Solution: > We can use a silimar logic as in > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L125] > {code:python} > # replace conversion.py#L418 > pdf_slices = (pdf.iloc[i * length // num_slices: (i + 1) * length // > num_slices] for i in xrange(0, num_slices)) > {code} > New method: > start = i * length // num_slices, end = (i + 1) * length // num_slices: > output: [0] [1] [2] ... [98,99] -> 99 slices > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org