[jira] [Comment Edited] (SPARK-26611) GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly"
[ https://issues.apache.org/jira/browse/SPARK-26611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742872#comment-16742872 ] Alberto edited comment on SPARK-26611 at 1/15/19 8:56 AM: -- Tried both with a debian ) docker container working on Ubuntu 18.04 and on Databricks runtime 5.0. The env is not having any effect. was (Author: afumagalli): Tried both with a debian docker container working on Ubuntu and on Databricks runtime 5.0. The env is not having any effect. > GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly" > --- > > Key: SPARK-26611 > URL: https://issues.apache.org/jira/browse/SPARK-26611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Alberto >Priority: Major > Labels: UDF, pandas, pyspark > > The following snippet crashes with error: org.apache.spark.SparkException: > Python worker exited unexpectedly (crashed) > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']=="9"] > df.groupby("second").apply(filter_pandas).count() > {code} > while this one does not: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], > ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']==9] > df.groupby("second").apply(filter_pandas).count() > {code} > and neither this: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > df = df[df['first']=="9"] > if len(df)>0: > return df > else: > return pd.DataFrame({"first":[],"second":[]}) > df.groupby("second").apply(filter_pandas).count() > {code} > > See stacktrace > [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] > > Using: > spark 2.4.0 > Pandas 0.19.2 > Pyarrow 0.8.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26611) GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly"
[ https://issues.apache.org/jira/browse/SPARK-26611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742872#comment-16742872 ] Alberto commented on SPARK-26611: - Tried both with a debian docker container working on Ubuntu and on Databricks runtime 5.0. The env is not having any effect. > GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly" > --- > > Key: SPARK-26611 > URL: https://issues.apache.org/jira/browse/SPARK-26611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Alberto >Priority: Major > Labels: UDF, pandas, pyspark > > The following snippet crashes with error: org.apache.spark.SparkException: > Python worker exited unexpectedly (crashed) > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']=="9"] > df.groupby("second").apply(filter_pandas).count() > {code} > while this one does not: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], > ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']==9] > df.groupby("second").apply(filter_pandas).count() > {code} > and neither this: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > df = df[df['first']=="9"] > if len(df)>0: > return df > else: > return pd.DataFrame({"first":[],"second":[]}) > df.groupby("second").apply(filter_pandas).count() > {code} > > See stacktrace > [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] > > Using: > spark 2.4.0 > Pandas 0.19.2 > Pyarrow 0.8.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26611) GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly"
[ https://issues.apache.org/jira/browse/SPARK-26611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742140#comment-16742140 ] Alberto commented on SPARK-26611: - Updating pyarrow to version 0.9 seem to be soving the problem > GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly" > --- > > Key: SPARK-26611 > URL: https://issues.apache.org/jira/browse/SPARK-26611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Alberto >Priority: Major > Labels: UDF, pandas, pyspark > > The following snippet crashes with error: org.apache.spark.SparkException: > Python worker exited unexpectedly (crashed) > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']=="9"] > df.groupby("second").apply(filter_pandas).count() > {code} > while this one does not: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], > ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']==9] > df.groupby("second").apply(filter_pandas).count() > {code} > and neither this: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > df = df[df['first']=="9"] > if len(df)>0: > return df > else: > return pd.DataFrame({"first":[],"second":[]}) > df.groupby("second").apply(filter_pandas).count() > {code} > > See stacktrace > [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] > > Using: > spark 2.4.0 > Pandas 0.19.2 > Pyarrow 0.8.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26611) GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly"
[ https://issues.apache.org/jira/browse/SPARK-26611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alberto updated SPARK-26611: Description: The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} while this one does not: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count() {code} and neither this: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): df = df[df['first']=="9"] if len(df)>0: return df else: return pd.DataFrame({"first":[],"second":[]}) df.groupby("second").apply(filter_pandas).count() {code} See stacktrace [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] Using: spark 2.4.0 Pandas 0.19.2 Pyarrow 0.8.0 was: The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} while this one does not: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count() {code} and neither this: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): df = df[df['first']=="9"] if len(df)>0: return df else: return pd.DataFrame({"first":[],"second":[]}) df.groupby("second").apply(filter_pandas).count() {code} See stacktrace [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] > GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly" > --- > > Key: SPARK-26611 > URL: https://issues.apache.org/jira/browse/SPARK-26611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Alberto >Priority: Major > Labels: UDF, pandas, pyspark > > The following snippet crashes with error: org.apache.spark.SparkException: > Python worker exited unexpectedly (crashed) > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']=="9"] > df.groupby("second").apply(filter_pandas).count() > {code} > while this one does not: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], > ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']==9] > df.groupby("second").apply(filter_pandas).count() > {code} > and neither this: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > df = df[df['first']=="9"] > if len(df)>0: > return df > else: > return pd.DataFrame({"first":[],"second":[]}) > df.groupby("second").apply(filter_pandas).count() > {code} > > See stacktrace >
[jira] [Updated] (SPARK-26611) GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly"
[ https://issues.apache.org/jira/browse/SPARK-26611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alberto updated SPARK-26611: Description: The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} while this one does not: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count() {code} and neither this: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): df = df[df['first']=="9"] if len(df)>0: return df else: return pd.DataFrame({"first":[],"second":[]}) df.groupby("second").apply(filter_pandas).count() {code} See stacktrace [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] was: The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} while this one does not: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count() {code} and niether this: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): df = df[df['first']=="9"] if len(df)>0: return df else: return pd.DataFrame({"first":[],"second":[]}) df.groupby("second").apply(filter_pandas).count() {code} See stacktrace [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] > GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly" > --- > > Key: SPARK-26611 > URL: https://issues.apache.org/jira/browse/SPARK-26611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Alberto >Priority: Major > Labels: UDF, pandas, pyspark > > The following snippet crashes with error: org.apache.spark.SparkException: > Python worker exited unexpectedly (crashed) > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']=="9"] > df.groupby("second").apply(filter_pandas).count() > {code} > while this one does not: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], > ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']==9] > df.groupby("second").apply(filter_pandas).count() > {code} > and neither this: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > df = df[df['first']=="9"] > if len(df)>0: > return df > else: > return pd.DataFrame({"first":[],"second":[]}) > df.groupby("second").apply(filter_pandas).count() > {code} > > > See stacktrace >
[jira] [Updated] (SPARK-26611) GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly"
[ https://issues.apache.org/jira/browse/SPARK-26611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alberto updated SPARK-26611: Description: The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} while this one does not: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count() {code} and niether this: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): df = df[df['first']=="9"] if len(df)>0: return df else: return pd.DataFrame({"first":[],"second":[]}) df.groupby("second").apply(filter_pandas).count() {code} See stacktrace [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] was: The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} while this one does not: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count() {code} and niether this: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): if len(df)>0: return df else: return pd.DataFrame({"first":[],"second":[]}) df.groupby("second").apply(filter_pandas).count() {code} See stacktrace [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] > GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly" > --- > > Key: SPARK-26611 > URL: https://issues.apache.org/jira/browse/SPARK-26611 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Alberto >Priority: Major > Labels: UDF, pandas, pyspark > > The following snippet crashes with error: org.apache.spark.SparkException: > Python worker exited unexpectedly (crashed) > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']=="9"] > df.groupby("second").apply(filter_pandas).count() > {code} > while this one does not: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], > ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']==9] > df.groupby("second").apply(filter_pandas).count() > {code} > and niether this: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > df = df[df['first']=="9"] > if len(df)>0: > return df > else: > return pd.DataFrame({"first":[],"second":[]}) > df.groupby("second").apply(filter_pandas).count() > {code} > > > See stacktrace >
[jira] [Updated] (SPARK-26611) GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly"
[ https://issues.apache.org/jira/browse/SPARK-26611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alberto updated SPARK-26611: Component/s: (was: SQL) > GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly" > --- > > Key: SPARK-26611 > URL: https://issues.apache.org/jira/browse/SPARK-26611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Alberto >Priority: Major > Labels: UDF, pandas, pyspark > > The following snippet crashes with error: org.apache.spark.SparkException: > Python worker exited unexpectedly (crashed) > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']=="9"] > df.groupby("second").apply(filter_pandas).count() > {code} > while this one does not: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], > ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']==9] > df.groupby("second").apply(filter_pandas).count() > {code} > and niether this: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > df = df[df['first']=="9"] > if len(df)>0: > return df > else: > return pd.DataFrame({"first":[],"second":[]}) > df.groupby("second").apply(filter_pandas).count() > {code} > > > See stacktrace > [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26611) GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly"
[ https://issues.apache.org/jira/browse/SPARK-26611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alberto updated SPARK-26611: Description: The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} while this one does not: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count() {code} and niether this: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): if len(df)>0: return df else: return pd.DataFrame({"first":[],"second":[]}) df.groupby("second").apply(filter_pandas).count() {code} See stacktrace [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] was: The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) {code:java} df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} while this one does not: {code:java} df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count() {code} and niether this: {code:java} df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): if len(df)>0: return df else: return pd.DataFrame({"first":[],"second":[]}) df.groupby("second").apply(filter_pandas).count() {code} See stacktrace [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] > GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly" > --- > > Key: SPARK-26611 > URL: https://issues.apache.org/jira/browse/SPARK-26611 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Alberto >Priority: Major > Labels: UDF, pandas, pyspark > > The following snippet crashes with error: org.apache.spark.SparkException: > Python worker exited unexpectedly (crashed) > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']=="9"] > df.groupby("second").apply(filter_pandas).count() > {code} > while this one does not: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], > ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']==9] > df.groupby("second").apply(filter_pandas).count() > {code} > and niether this: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > if len(df)>0: > return df > else: > return pd.DataFrame({"first":[],"second":[]}) > df.groupby("second").apply(filter_pandas).count() > {code} > > > See stacktrace > [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Updated] (SPARK-26611) GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly"
[ https://issues.apache.org/jira/browse/SPARK-26611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alberto updated SPARK-26611: Description: The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) {code:java} df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} while this one does not: {code:java} df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count() {code} and niether this: {code:java} df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): if len(df)>0: return df else: return pd.DataFrame({"first":[],"second":[]}) df.groupby("second").apply(filter_pandas).count() {code} See stacktrace [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] was: The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) {code:java} df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} while this one does not: {code:java} df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count() {code} and niether this: {code:java} df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): if len(df)>0: return df else: return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} See stacktrace [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] > GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly" > --- > > Key: SPARK-26611 > URL: https://issues.apache.org/jira/browse/SPARK-26611 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Alberto >Priority: Major > Labels: UDF, pandas, pyspark > > The following snippet crashes with error: org.apache.spark.SparkException: > Python worker exited unexpectedly (crashed) > {code:java} > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']=="9"] > df.groupby("second").apply(filter_pandas).count() > {code} > while this one does not: > {code:java} > df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], > ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']==9] > df.groupby("second").apply(filter_pandas).count() > {code} > and niether this: > > {code:java} > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > if len(df)>0: > return df > else: > return pd.DataFrame({"first":[],"second":[]}) > df.groupby("second").apply(filter_pandas).count() > {code} > > > See stacktrace > [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26611) GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly"
[ https://issues.apache.org/jira/browse/SPARK-26611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alberto updated SPARK-26611: Description: The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) {code:java} df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} while this one does not: {code:java} df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count() {code} and niether this: {code:java} df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): if len(df)>0: return df else: return pd.DataFrame({"first":[],"second":[]}) df.groupby("second").apply(filter_pandas).count() {code} See stacktrace [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] was: The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) {code:java} df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} while this one does not: {code:java} df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count() {code} and niether this: {code:java} df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): if len(df)>0: return df else: return pd.DataFrame({"first":[],"second":[]}) df.groupby("second").apply(filter_pandas).count() {code} See stacktrace [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] > GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly" > --- > > Key: SPARK-26611 > URL: https://issues.apache.org/jira/browse/SPARK-26611 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Alberto >Priority: Major > Labels: UDF, pandas, pyspark > > The following snippet crashes with error: org.apache.spark.SparkException: > Python worker exited unexpectedly (crashed) > {code:java} > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']=="9"] > df.groupby("second").apply(filter_pandas).count() > {code} > while this one does not: > {code:java} > df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], > ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > return df[df['first']==9] > df.groupby("second").apply(filter_pandas).count() > {code} > and niether this: > {code:java} > df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), > ("5","6")], ("first","second")) > @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) > def filter_pandas(df): > if len(df)>0: > return df > else: > return pd.DataFrame({"first":[],"second":[]}) > df.groupby("second").apply(filter_pandas).count() > {code} > > > See stacktrace > [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26611) GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly"
Alberto created SPARK-26611: --- Summary: GROUPED_MAP pandas_udf crashing "Python worker exited unexpectedly" Key: SPARK-26611 URL: https://issues.apache.org/jira/browse/SPARK-26611 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.4.0 Reporter: Alberto The following snippet crashes with error: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) {code:java} df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} while this one does not: {code:java} df = spark.createDataFrame([(1,2),(2,2),(2,3), (3,4), (5,6)], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): return df[df['first']==9] df.groupby("second").apply(filter_pandas).count() {code} and niether this: {code:java} df = spark.createDataFrame([("1","2"),("2","2"),("2","3"), ("3","4"), ("5","6")], ("first","second")) @pandas_udf("first string, second string", PandasUDFType.GROUPED_MAP) def filter_pandas(df): if len(df)>0: return df else: return df[df['first']=="9"] df.groupby("second").apply(filter_pandas).count() {code} See stacktrace [here|https://gist.github.com/afumagallireply/02d4c1355bc64a9d2129cdd6d0e9d9f3] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21640) Method mode with String parameters within DataFrameWriter is error prone
[ https://issues.apache.org/jira/browse/SPARK-21640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114317#comment-16114317 ] Alberto edited comment on SPARK-21640 at 8/4/17 12:49 PM: -- Yes, that's is [~srowen]. I don't like to use string to do such things but in this case I have to. If there is an enum I think the "best worst" scenario is to use the values to call the method. And if the other values work I guess it should be also valid for the errorifexists case. was (Author: ardlema): Yes, that's is [~srowen]. I don't like to use string to do such things but in the case I have to. If there is an enum I think the "best worst" scenario is to use the values to call it. And if the other values work I guess it should be also valid for the errorifexists case. > Method mode with String parameters within DataFrameWriter is error prone > > > Key: SPARK-21640 > URL: https://issues.apache.org/jira/browse/SPARK-21640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Alberto >Priority: Trivial > > The following method: > {code:java} > def mode(saveMode: String): DataFrameWriter[T] > {code} > sets the SaveMode of the DataFrameWriter depending on the string that is > pass-in as parameter. > There is a java Enum with all the save modes which are Append, Overwrite, > ErrorIfExists and Ignore. In my current project I was writing some code that > was using this enum to get the string value that I use to call the mode > method: > {code:java} > private[utils] val configModeAppend = SaveMode.Append.toString.toLowerCase > private[utils] val configModeErrorIfExists = > SaveMode.ErrorIfExists.toString.toLowerCase > private[utils] val configModeIgnore = SaveMode.Ignore.toString.toLowerCase > private[utils] val configModeOverwrite = > SaveMode.Overwrite.toString.toLowerCase > {code} > The configModeErrorIfExists val contains the value "errorifexists" and when I > call the saveMode method using this string it does not match. I suggest to > include "errorifexists" as a right match for the ErrorIfExists SaveMode. > Will create a PR to address this issue ASAP. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21640) Method mode with String parameters within DataFrameWriter is error prone
[ https://issues.apache.org/jira/browse/SPARK-21640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114317#comment-16114317 ] Alberto commented on SPARK-21640: - Yes, that's is [~srowen]. I don't like to use string to do such things but in the case I have to. If there is an enum I think the "best worst" scenario is to use the values to call it. And if the other values work I guess it should be also valid for the errorifexists case. > Method mode with String parameters within DataFrameWriter is error prone > > > Key: SPARK-21640 > URL: https://issues.apache.org/jira/browse/SPARK-21640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Alberto >Priority: Trivial > > The following method: > {code:java} > def mode(saveMode: String): DataFrameWriter[T] > {code} > sets the SaveMode of the DataFrameWriter depending on the string that is > pass-in as parameter. > There is a java Enum with all the save modes which are Append, Overwrite, > ErrorIfExists and Ignore. In my current project I was writing some code that > was using this enum to get the string value that I use to call the mode > method: > {code:java} > private[utils] val configModeAppend = SaveMode.Append.toString.toLowerCase > private[utils] val configModeErrorIfExists = > SaveMode.ErrorIfExists.toString.toLowerCase > private[utils] val configModeIgnore = SaveMode.Ignore.toString.toLowerCase > private[utils] val configModeOverwrite = > SaveMode.Overwrite.toString.toLowerCase > {code} > The configModeErrorIfExists val contains the value "errorifexists" and when I > call the saveMode method using this string it does not match. I suggest to > include "errorifexists" as a right match for the ErrorIfExists SaveMode. > Will create a PR to address this issue ASAP. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21640) Method mode with String parameters within DataFrameWriter is error prone
Alberto created SPARK-21640: --- Summary: Method mode with String parameters within DataFrameWriter is error prone Key: SPARK-21640 URL: https://issues.apache.org/jira/browse/SPARK-21640 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Alberto The following method: {code:java} def mode(saveMode: String): DataFrameWriter[T] {code} sets the SaveMode of the DataFrameWriter depending on the string that is pass-in as parameter. There is a java Enum with all the save modes which are Append, Overwrite, ErrorIfExists and Ignore. In my current project I was writing some code that was using this enum to get the string value that I use to call the mode method: {code:java} private[utils] val configModeAppend = SaveMode.Append.toString.toLowerCase private[utils] val configModeErrorIfExists = SaveMode.ErrorIfExists.toString.toLowerCase private[utils] val configModeIgnore = SaveMode.Ignore.toString.toLowerCase private[utils] val configModeOverwrite = SaveMode.Overwrite.toString.toLowerCase {code} The configModeErrorIfExists val contains the value "errorifexists" and when I call the saveMode method using this string it does not match. I suggest to include "errorifexists" as a right match for the ErrorIfExists SaveMode. Will create a PR to address this issue ASAP. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9687) System.exit() still disrupt applications embedding Spark
[ https://issues.apache.org/jira/browse/SPARK-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alberto closed SPARK-9687. -- > System.exit() still disrupt applications embedding Spark > > > Key: SPARK-9687 > URL: https://issues.apache.org/jira/browse/SPARK-9687 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.4.1 >Reporter: Alberto >Priority: Minor > > This issue was already reported in #SPARK-4783. It was addressed in the > following PR: 5492 but we are still having the same issue. > The TaskSchedulerImpl class is now throwing a SparkException, this exception > is caught by the SparkUncaughtExceptionHandler which is again invoking a > System.exit() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9687) System.exit() still disrupt applications embedding Spark
Alberto created SPARK-9687: -- Summary: System.exit() still disrupt applications embedding Spark Key: SPARK-9687 URL: https://issues.apache.org/jira/browse/SPARK-9687 Project: Spark Issue Type: Improvement Affects Versions: 1.4.1 Reporter: Alberto Priority: Minor This issue was already reported in #SPARK-4783. It was addressed in the following PR: 5492 but we are still having the same issue. The TaskSchedulerImpl class is now throwing a SparkException, this exception is caught by the SparkUncaughtExceptionHandler which is again invoking a System.exit() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9687) System.exit() still disrupt applications embedding Spark
[ https://issues.apache.org/jira/browse/SPARK-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659860#comment-14659860 ] Alberto commented on SPARK-9687: Sorry about that, let's discuss in SPARK-4783 then System.exit() still disrupt applications embedding Spark Key: SPARK-9687 URL: https://issues.apache.org/jira/browse/SPARK-9687 Project: Spark Issue Type: Improvement Affects Versions: 1.4.1 Reporter: Alberto Priority: Minor This issue was already reported in #SPARK-4783. It was addressed in the following PR: 5492 but we are still having the same issue. The TaskSchedulerImpl class is now throwing a SparkException, this exception is caught by the SparkUncaughtExceptionHandler which is again invoking a System.exit() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659864#comment-14659864 ] Alberto commented on SPARK-4783: Still having this issue. We've found out that the exception throw by TaskSchedulerImpl is being caught by SparkUncaughtException which is calling System.exit() again. Would it make sense just logging the error and not throwing the exception? See https://github.com/apache/spark/pull/7993 System.exit() calls in SparkContext disrupt applications embedding Spark Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Components: Spark Core Reporter: David Semeria Assignee: Sean Owen Priority: Minor Fix For: 1.4.0 A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491993#comment-14491993 ] Alberto commented on SPARK-4783: Does it mean that you guys are going to create a PR with a fix/change proposal for this? Or just asking someone to create that PR? If so I am willing to create it. System.exit() calls in SparkContext disrupt applications embedding Spark Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Components: Spark Core Reporter: David Semeria A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6804) System.exit(1) on error
Alberto created SPARK-6804: -- Summary: System.exit(1) on error Key: SPARK-6804 URL: https://issues.apache.org/jira/browse/SPARK-6804 Project: Spark Issue Type: Improvement Reporter: Alberto We are developing a web application that is using Spark under the hood. Testing our app we have found out that when our spark master is not up and running and we try to connect with it, Spark is killing our app. We've been having a look at the code and we have noticed that the TaskSchedulerImpl class is just killing the JVM and our web application is obviously also killed. See following the code snippet I am talking about: else { // No task sets are active but we still got an error. Just exit since this // must mean the error is during registration. // It might be good to do something smarter here in the future. logError(Exiting due to error from cluster scheduler: + message) System.exit(1) } IMHO this guy should not invoke System.exit(1). Instead, it should throw an exception so the applications will be able to handle the error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6804) System.exit(1) on error
[ https://issues.apache.org/jira/browse/SPARK-6804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alberto updated SPARK-6804: --- Description: We are developing a web application that is using Spark under the hood. Testing our app we have found out that when our spark master is not up and running and we try to connect with it, Spark is killing our app. We've been having a look at the code and we have noticed that the TaskSchedulerImpl class is just killing the JVM and our web application is obviously also killed. See following the code snippet I am talking about: {code} else { // No task sets are active but we still got an error. Just exit since this // must mean the error is during registration. // It might be good to do something smarter here in the future. logError(Exiting due to error from cluster scheduler: + message) System.exit(1) } {code} IMHO this guy should not invoke System.exit(1). Instead, it should throw an exception so the applications will be able to handle the error. was: We are developing a web application that is using Spark under the hood. Testing our app we have found out that when our spark master is not up and running and we try to connect with it, Spark is killing our app. We've been having a look at the code and we have noticed that the TaskSchedulerImpl class is just killing the JVM and our web application is obviously also killed. See following the code snippet I am talking about: else { // No task sets are active but we still got an error. Just exit since this // must mean the error is during registration. // It might be good to do something smarter here in the future. logError(Exiting due to error from cluster scheduler: + message) System.exit(1) } IMHO this guy should not invoke System.exit(1). Instead, it should throw an exception so the applications will be able to handle the error. System.exit(1) on error --- Key: SPARK-6804 URL: https://issues.apache.org/jira/browse/SPARK-6804 Project: Spark Issue Type: Improvement Reporter: Alberto We are developing a web application that is using Spark under the hood. Testing our app we have found out that when our spark master is not up and running and we try to connect with it, Spark is killing our app. We've been having a look at the code and we have noticed that the TaskSchedulerImpl class is just killing the JVM and our web application is obviously also killed. See following the code snippet I am talking about: {code} else { // No task sets are active but we still got an error. Just exit since this // must mean the error is during registration. // It might be good to do something smarter here in the future. logError(Exiting due to error from cluster scheduler: + message) System.exit(1) } {code} IMHO this guy should not invoke System.exit(1). Instead, it should throw an exception so the applications will be able to handle the error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6431) Couldn't find leader offsets exception when creating KafkaDirectStream
[ https://issues.apache.org/jira/browse/SPARK-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482752#comment-14482752 ] Alberto commented on SPARK-6431: I think you're right Cody. I've been having a look at my code and I've found out that I'm not creating the topic before creating the DirectStream. If you are interested here is the test I am running: https://github.com/ardlema/big-brother/blob/master/src/test/scala/org/ardlema/spark/DwellDetectorTest.scala I completely agree with you, If the topic doesn't exist it should be returning an error and not a misleading empty set. Couldn't find leader offsets exception when creating KafkaDirectStream -- Key: SPARK-6431 URL: https://issues.apache.org/jira/browse/SPARK-6431 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Alberto When I try to create an InputDStream using the createDirectStream method of the KafkaUtils class and the kafka topic does not have any messages yet am getting the following error: org.apache.spark.SparkException: Couldn't find leader offsets for Set() org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leader offsets for Set() at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413) If I put a message in the topic before creating the DirectStream everything works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6431) Couldn't find leader offsets exception when creating KafkaDirectStream
[ https://issues.apache.org/jira/browse/SPARK-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482752#comment-14482752 ] Alberto edited comment on SPARK-6431 at 4/7/15 7:43 AM: You're absolutely right Cody. I've been having a look at my code and I've found out that I'm not creating the topic before creating the DirectStream. If you are interested here is the test I am running: https://github.com/ardlema/big-brother/blob/master/src/test/scala/org/ardlema/spark/DwellDetectorTest.scala I completely agree with you, If the topic doesn't exist it should be returning an error and not a misleading empty set. was (Author: ardlema): I think you're right Cody. I've been having a look at my code and I've found out that I'm not creating the topic before creating the DirectStream. If you are interested here is the test I am running: https://github.com/ardlema/big-brother/blob/master/src/test/scala/org/ardlema/spark/DwellDetectorTest.scala I completely agree with you, If the topic doesn't exist it should be returning an error and not a misleading empty set. Couldn't find leader offsets exception when creating KafkaDirectStream -- Key: SPARK-6431 URL: https://issues.apache.org/jira/browse/SPARK-6431 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Alberto When I try to create an InputDStream using the createDirectStream method of the KafkaUtils class and the kafka topic does not have any messages yet am getting the following error: org.apache.spark.SparkException: Couldn't find leader offsets for Set() org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leader offsets for Set() at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413) If I put a message in the topic before creating the DirectStream everything works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6431) Couldn't find leader offsets exception when creating KafkaDirectStream
Alberto created SPARK-6431: -- Summary: Couldn't find leader offsets exception when creating KafkaDirectStream Key: SPARK-6431 URL: https://issues.apache.org/jira/browse/SPARK-6431 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Alberto When I try to create an InputDStream using the createDirectStream method of the KafkaUtils class and the kafka topic does not have any messages yet am getting the following error: org.apache.spark.SparkException: Couldn't find leader offsets for Set() org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leader offsets for Set() at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413) If I put a message in the topic before creating the DirectStream everything works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334942#comment-14334942 ] Alberto commented on SPARK-5281: Having this problem as well trying to migrate to 1.2.1. Any updates? Registering table on RDD is giving MissingRequirementError -- Key: SPARK-5281 URL: https://issues.apache.org/jira/browse/SPARK-5281 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: sarsol Priority: Critical Application crashes on this line rdd.registerTempTable(temp) in 1.2 version when using sbt or Eclipse SCALA IDE Stacktrace Exception in thread main scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program Files\Java\jre7\lib\resources.jar;C:\Program Files\Java\jre7\lib\rt.jar;C:\Program Files\Java\jre7\lib\sunrsasign.jar;C:\Program Files\Java\jre7\lib\jsse.jar;C:\Program Files\Java\jre7\lib\jce.jar;C:\Program Files\Java\jre7\lib\charsets.jar;C:\Program Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) at com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5281) Registering table on RDD is giving MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334942#comment-14334942 ] Alberto edited comment on SPARK-5281 at 2/24/15 2:45 PM: - Having this problem as well trying to migrate from 1.1.1 to 1.2.1. Any updates? was (Author: ardlema): Having this problem as well trying to migrate to 1.2.1. Any updates? Registering table on RDD is giving MissingRequirementError -- Key: SPARK-5281 URL: https://issues.apache.org/jira/browse/SPARK-5281 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: sarsol Priority: Critical Application crashes on this line rdd.registerTempTable(temp) in 1.2 version when using sbt or Eclipse SCALA IDE Stacktrace Exception in thread main scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program Files\Java\jre7\lib\resources.jar;C:\Program Files\Java\jre7\lib\rt.jar;C:\Program Files\Java\jre7\lib\sunrsasign.jar;C:\Program Files\Java\jre7\lib\jsse.jar;C:\Program Files\Java\jre7\lib\jce.jar;C:\Program Files\Java\jre7\lib\charsets.jar;C:\Program Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) at com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org