[ https://issues.apache.org/jira/browse/SPARK-12774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
holdenk closed SPARK-12774. --------------------------- Resolution: Won't Fix In some ways yes avoiding unecessary iteration can be good, but allowing Spark to spill is also important. That being said - map and map partitions have been temporarily removed from DataFrame while the dataset API is sorted out in Python so I don't think this is likely to get in (although maybe worth being involved in the new dataframe API discussions if you are interested. > DataFrame.mapPartitions apply function operates on Pandas DataFrame instead > of a generator or rows > -------------------------------------------------------------------------------------------------- > > Key: SPARK-12774 > URL: https://issues.apache.org/jira/browse/SPARK-12774 > Project: Spark > Issue Type: Improvement > Components: PySpark > Reporter: Josh > Labels: dataframe, mapPartitions, pandas > > Currently DataFrame.mapPatitions is analogous to DataFrame.rdd.mapPatitions > in both Spark and pySpark. The function that is applied to each partition _f_ > must operate on a list generator. This is however very inefficient in Python. > It would be more logical and efficient if the apply function _f_ operated on > Pandas DataFrames instead and also returned a DataFrame. This avoids > unnecessary iteration in Python which is slow. > Currently: > {code} > def apply_function(rows): > df = pd.DataFrame(list(rows)) > df = df % 100 # Do something on df > return df.values.tolist() > table = sqlContext.read.parquet("") > table = table.mapPatitions(apply_function) > {code} > New apply function would accept a Pandas DataFrame and return a DataFrame: > {code} > def apply_function(df): > df = df % 100 # Do something on df > return df > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org