[ https://issues.apache.org/jira/browse/SPARK-32079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447747#comment-17447747 ]
Hyukjin Kwon commented on SPARK-32079: -------------------------------------- im working on this. > PySpark <> Beam pickling issues for collections.namedtuple > ---------------------------------------------------------- > > Key: SPARK-32079 > URL: https://issues.apache.org/jira/browse/SPARK-32079 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.0.0 > Reporter: Gerard Casas Saez > Priority: Major > > PySpark monkeypatching namedtuple makes it difficult/impossible to depickle > collections.namedtuple instances from outside of a pyspark environment. > > When PySpark has been loaded into the environment, any time that you try to > pickle a namedtuple, you are only able to unpickle it from an environment > where the > [hijack|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L385] > has been applied. > This conflicts directly when trying to use Beam from a non-Spark environment > (namingly Flink or Dataflow) making it impossible to use the pipeline if it > has a namedtuple loaded somewhere. > > {code:python} > import collections > import dill > ColumnInfo = collections.namedtuple( > "ColumnInfo", > [ > "name", # type: ColumnName # pytype: disable=ignored-type-comment > "type", # type: Optional[ColumnType] # pytype: > disable=ignored-type-comment > ]) > dill.dumps(ColumnInfo('test', int)) > {code} > {{b'\x80\x03cdill._dill\n_create_namedtuple\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x08\x00\x00\x00__main__q\x05\x87q\x06Rq\x07X\x04\x00\x00\x00testq\x08cdill._dill\n_load_type\nq\tX\x03\x00\x00\x00intq\n\x85q\x0bRq\x0c\x86q\r\x81q\x0e.'}} > {code:python} > import pyspark > import collections > import dill > ColumnInfo = collections.namedtuple( > "ColumnInfo", > [ > "name", # type: ColumnName # pytype: disable=ignored-type-comment > "type", # type: Optional[ColumnType] # pytype: > disable=ignored-type-comment > ]) > dill.dumps(ColumnInfo('test', int)) > {code} > {{b'\x80\x03cpyspark.serializers\n_restore\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x04\x00\x00\x00testq\x05cdill._dill\n_load_type\nq\x06X\x03\x00\x00\x00intq\x07\x85q\x08Rq\t\x86q\n\x87q\x0bRq\x0c.'}} > Second pickled object can only be used from an environment with PySpark. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org