[ https://issues.apache.org/jira/browse/SPARK-26611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742735#comment-16742735 ]
Hyukjin Kwon commented on SPARK-26611: -------------------------------------- Which OS do you use? can you try to set an environment variable {{OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES}}? > GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly" > ------------------------------------------------------------------- > > Key: SPARK-26611 > URL: https://issues.apache.org/jira/browse/SPARK-26611 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.0 > Reporter: Alberto > Priority: Major > Labels: UDF, pandas, pyspark > > The following snippet crashes with error: org.apache.spark.SparkException: > Python worker exited unexpectedly (crashed) > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']=="9"] > df.groupby("second").apply(filter_pandas).count() > {code} > while this one does not: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], > ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']==9] > df.groupby("second").apply(filter_pandas).count() > {code} > and neither this: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > df = df[df['first']=="9"] > if len(df)>0: > return df > else: > return pd.DataFrame({"first":[],"second":[]}) > df.groupby("second").apply(filter_pandas).count() > {code} > > See stacktrace > [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] > > Using: > spark 2.4.0 > Pandas 0.19.2 > Pyarrow 0.8.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org