[ https://issues.apache.org/jira/browse/SPARK-30473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Max Härtwig updated SPARK-30473: -------------------------------- Description: PySpark enum subclass crashes when used inside a UDF. Example: {code:java} from enum import Enum class Direction(Enum): NORTH = 0 SOUTH = 1 {code} Working: {code:java} Direction.NORTH{code} Crashing: {code:java} @udf def fn(a): Direction.NORTH return "" df.withColumn("test", fn("a")){code} Stacktrace: {noformat} SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 10.139.64.21, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads return pickle.loads(obj, encoding=encoding) File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ enum_members = {k: classdict[k] for k in classdict._member_names} AttributeError: 'dict' object has no attribute '_member_names'{noformat} I suspect the problem is in *python/pyspark/cloudpickle.py*. On line 586 in the function *_save_dynamic_enum*, the attribute *_member_names* is removed from the enum. Yet, this attribute is required by the *Enum* class. This results in all Enum subclasses crashing. was: PySpark enum subclass crashes when used inside a UDF. Example: {code:java} from enum import Enum class Direction(Enum): NORTH = 0 SOUTH = 1 {code} Working: {code:java} Direction.NORTH{code} Crashing: {code:java} @udf def fn(a): Direction.NORTH return "" df.withColumn("test", fn("a")){code} Stacktrace: {noformat} SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 10.139.64.21, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads return pickle.loads(obj, encoding=encoding) File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ enum_members = {k: classdict[k] for k in classdict._member_names} AttributeError: 'dict' object has no attribute '_member_names'{noformat} I suspect the problem is in `python/pyspark/cloudpickle.py`. On line 586 in the function `_save_dynamic_enum`, the attribute `_member_names` is removed from the enum. Yet, this attribute is required by the `Enum` class and Enum subclasses will crash. > PySpark enum subclass crashes when used inside UDF > -------------------------------------------------- > > Key: SPARK-30473 > URL: https://issues.apache.org/jira/browse/SPARK-30473 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.4 > Environment: Databricks Runtime 6.2 (includes Apache Spark 2.4.4, > Scala 2.11) > Reporter: Max Härtwig > Priority: Major > > PySpark enum subclass crashes when used inside a UDF. > > Example: > {code:java} > from enum import Enum > class Direction(Enum): > NORTH = 0 > SOUTH = 1 > {code} > > Working: > {code:java} > Direction.NORTH{code} > > Crashing: > {code:java} > @udf > def fn(a): > Direction.NORTH > return "" > df.withColumn("test", fn("a")){code} > > Stacktrace: > {noformat} > SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed > 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, > 10.139.64.21, executor 0): org.apache.spark.api.python.PythonException: > Traceback (most recent call last): > File "/databricks/spark/python/pyspark/serializers.py", line 182, in > _read_with_length return self.loads(obj) > File "/databricks/spark/python/pyspark/serializers.py", line 695, in > loads return pickle.loads(obj, encoding=encoding) > File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ > enum_members = {k: classdict[k] for k in classdict._member_names} > AttributeError: 'dict' object has no attribute '_member_names'{noformat} > > I suspect the problem is in *python/pyspark/cloudpickle.py*. On line 586 in > the function *_save_dynamic_enum*, the attribute *_member_names* is removed > from the enum. Yet, this attribute is required by the *Enum* class. This > results in all Enum subclasses crashing. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org