[ https://issues.apache.org/jira/browse/SPARK-25147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Cutler resolved SPARK-25147. ---------------------------------- Resolution: Cannot Reproduce Going to resolve this for now, please reopen if the above suggestion does not fix the issue > GroupedData.apply pandas_udf crashing > ------------------------------------- > > Key: SPARK-25147 > URL: https://issues.apache.org/jira/browse/SPARK-25147 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.3.1 > Environment: OS: Mac OS 10.13.6 > Python: 2.7.15, 3.6.6 > PyArrow: 0.10.0 > Pandas: 0.23.4 > Numpy: 1.15.0 > Reporter: Mike Sukmanowsky > Priority: Major > > Running the following example taken straight from the docs results in > {{org.apache.spark.SparkException: Python worker exited unexpectedly > (crashed)}} for reasons that aren't clear from any logs I can see: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import functions as F > spark = ( > SparkSession > .builder > .appName("pandas_udf") > .getOrCreate() > ) > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v") > ) > @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > ( > df > .groupby("id") > .apply(normalize) > .show() > ) > {code} > See output.log for > [stacktrace|https://gist.github.com/msukmanowsky/b9cb6700e8ccaf93f265962000403f28]. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org