[ https://issues.apache.org/jira/browse/SPARK-26611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alberto updated SPARK-26611: ---------------------------- Description: The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} while this one does not: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count() {code} and niether this: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): df = df[df['first']=="9"] if len(df)>0: return df else: return pd.DataFrame({"first":[],"second":[]}) df.groupby("second").apply(filter_pandas).count() {code} See stacktrace [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] was: The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} while this one does not: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count() {code} and niether this: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): if len(df)>0: return df else: return pd.DataFrame({"first":[],"second":[]}) df.groupby("second").apply(filter_pandas).count() {code} See stacktrace [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] > GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly" > ------------------------------------------------------------------- > > Key: SPARK-26611 > URL: https://issues.apache.org/jira/browse/SPARK-26611 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 2.4.0 > Reporter: Alberto > Priority: Major > Labels: UDF, pandas, pyspark > > The following snippet crashes with error: org.apache.spark.SparkException: > Python worker exited unexpectedly (crashed) > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']=="9"] > df.groupby("second").apply(filter_pandas).count() > {code} > while this one does not: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], > ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']==9] > df.groupby("second").apply(filter_pandas).count() > {code} > and niether this: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > df = df[df['first']=="9"] > if len(df)>0: > return df > else: > return pd.DataFrame({"first":[],"second":[]}) > df.groupby("second").apply(filter_pandas).count() > {code} > > > See stacktrace > [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org