[jira] [Commented] (SPARK-25147) GroupedData.apply pandas_udf crashing
[ https://issues.apache.org/jira/browse/SPARK-25147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743590#comment-16743590 ] Hyukjin Kwon commented on SPARK-25147: -- [~msukmanowsky], can you try to set an environment variable `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`, and try? > GroupedData.apply pandas_udf crashing > - > > Key: SPARK-25147 > URL: https://issues.apache.org/jira/browse/SPARK-25147 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: OS: Mac OS 10.13.6 > Python: 2.7.15, 3.6.6 > PyArrow: 0.10.0 > Pandas: 0.23.4 > Numpy: 1.15.0 >Reporter: Mike Sukmanowsky >Priority: Major > > Running the following example taken straight from the docs results in > {{org.apache.spark.SparkException: Python worker exited unexpectedly > (crashed)}} for reasons that aren't clear from any logs I can see: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import functions as F > spark = ( > SparkSession > .builder > .appName("pandas_udf") > .getOrCreate() > ) > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v") > ) > @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > ( > df > .groupby("id") > .apply(normalize) > .show() > ) > {code} > See output.log for > [stacktrace|https://gist.github.com/msukmanowsky/b9cb6700e8ccaf93f265962000403f28]. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25147) GroupedData.apply pandas_udf crashing
[ https://issues.apache.org/jira/browse/SPARK-25147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589177#comment-16589177 ] Bryan Cutler commented on SPARK-25147: -- Works for me on linux with: Python 3.6.6 pyarrow 0.10.0 pandas 23.4 numpy 1.14.3 Maybe only on MacOS? > GroupedData.apply pandas_udf crashing > - > > Key: SPARK-25147 > URL: https://issues.apache.org/jira/browse/SPARK-25147 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: OS: Mac OS 10.13.6 > Python: 2.7.15, 3.6.6 > PyArrow: 0.10.0 > Pandas: 0.23.4 > Numpy: 1.15.0 >Reporter: Mike Sukmanowsky >Priority: Major > > Running the following example taken straight from the docs results in > {{org.apache.spark.SparkException: Python worker exited unexpectedly > (crashed)}} for reasons that aren't clear from any logs I can see: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import functions as F > spark = ( > SparkSession > .builder > .appName("pandas_udf") > .getOrCreate() > ) > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v") > ) > @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > ( > df > .groupby("id") > .apply(normalize) > .show() > ) > {code} > See output.log for > [stacktrace|https://gist.github.com/msukmanowsky/b9cb6700e8ccaf93f265962000403f28]. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25147) GroupedData.apply pandas_udf crashing
[ https://issues.apache.org/jira/browse/SPARK-25147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589027#comment-16589027 ] Mike Sukmanowsky commented on SPARK-25147: -- [~hyukjin.kwon] should I take any other action here? I wasn't sure how else to debug the issue unless I dig into Spark internals. > GroupedData.apply pandas_udf crashing > - > > Key: SPARK-25147 > URL: https://issues.apache.org/jira/browse/SPARK-25147 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: OS: Mac OS 10.13.6 > Python: 2.7.15, 3.6.6 > PyArrow: 0.10.0 > Pandas: 0.23.4 > Numpy: 1.15.0 >Reporter: Mike Sukmanowsky >Priority: Major > > Running the following example taken straight from the docs results in > {{org.apache.spark.SparkException: Python worker exited unexpectedly > (crashed)}} for reasons that aren't clear from any logs I can see: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import functions as F > spark = ( > SparkSession > .builder > .appName("pandas_udf") > .getOrCreate() > ) > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v") > ) > @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > ( > df > .groupby("id") > .apply(normalize) > .show() > ) > {code} > See output.log for > [stacktrace|https://gist.github.com/msukmanowsky/b9cb6700e8ccaf93f265962000403f28]. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25147) GroupedData.apply pandas_udf crashing
[ https://issues.apache.org/jira/browse/SPARK-25147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585911#comment-16585911 ] Mike Sukmanowsky commented on SPARK-25147: -- Confirmed, works with: Python 2.7.15 PyArrow==0.9.0.post1 numpy==1.11.3 pandas==0.19.2 > GroupedData.apply pandas_udf crashing > - > > Key: SPARK-25147 > URL: https://issues.apache.org/jira/browse/SPARK-25147 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: OS: Mac OS 10.13.6 > Python: 2.7.15, 3.6.6 > PyArrow: 0.10.0 > Pandas: 0.23.4 > Numpy: 1.15.0 >Reporter: Mike Sukmanowsky >Priority: Major > > Running the following example taken straight from the docs results in > {{org.apache.spark.SparkException: Python worker exited unexpectedly > (crashed)}} for reasons that aren't clear from any logs I can see: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import functions as F > spark = ( > SparkSession > .builder > .appName("pandas_udf") > .getOrCreate() > ) > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v") > ) > @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > ( > df > .groupby("id") > .apply(normalize) > .show() > ) > {code} > See output.log for > [stacktrace|https://gist.github.com/msukmanowsky/b9cb6700e8ccaf93f265962000403f28]. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25147) GroupedData.apply pandas_udf crashing
[ https://issues.apache.org/jira/browse/SPARK-25147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585374#comment-16585374 ] Hyukjin Kwon commented on SPARK-25147: -- It works on: Python 2.7.14 Pandas 0.19.2 PyArrow 0.9.0 Numpy 1.11.3 Sounds specific to PyArrow or Pandas version from a cursery look. > GroupedData.apply pandas_udf crashing > - > > Key: SPARK-25147 > URL: https://issues.apache.org/jira/browse/SPARK-25147 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: OS: Mac OS 10.13.6 > Python: 2.7.15, 3.6.6 > PyArrow: 0.10.0 > Pandas: 0.23.4 > Numpy: 1.15.0 >Reporter: Mike Sukmanowsky >Priority: Major > > Running the following example taken straight from the docs results in > {{org.apache.spark.SparkException: Python worker exited unexpectedly > (crashed)}} for reasons that aren't clear from any logs I can see: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import functions as F > spark = ( > SparkSession > .builder > .appName("pandas_udf") > .getOrCreate() > ) > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v") > ) > @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > ( > df > .groupby("id") > .apply(normalize) > .show() > ) > {code} > See output.log for > [stacktrace|https://gist.github.com/msukmanowsky/b9cb6700e8ccaf93f265962000403f28]. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org