[ https://issues.apache.org/jira/browse/SPARK-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-18161: ------------------------------------ Assignee: (was: Apache Spark) > Default PickleSerializer pickle protocol doesn't handle > 4GB objects > --------------------------------------------------------------------- > > Key: SPARK-18161 > URL: https://issues.apache.org/jira/browse/SPARK-18161 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 2.0.0, 2.0.1 > Reporter: Sloane Simmons > > When broadcasting a fairly large numpy matrix in a Spark 2.0.1 program, there > is an error serializing the object with: > {{OverflowError: cannot serialize a bytes object larger than 4 GiB}} > in the stack trace. > This is because Python's pickle serialization (with protocol <= 3) uses a > 32-bit integer for the object size, and so cannot handle objects larger than > 4 gigabytes. This was changed in Protocol 4 of pickle > (https://www.python.org/dev/peps/pep-3154/#bit-opcodes-for-large-objects) and > is available in Python 3.4+. > I would like to use this protocol for broadcasting and in the default > PickleSerializer where available to make pyspark more robust to broadcasting > large variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org