[ https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307482#comment-14307482 ]
Josh Rosen commented on SPARK-4897: ----------------------------------- By the way, it might be nice to see if we can figure out a good way of subdividing this task across multiple PRs so that the pieces that we have already figured out don't end up bitrotting / becoming merge-conflicts. For instance, if we can test the `cloudpickle.py` file separately from the other modules, then we could submit a PR that only adds 3.4 support to that file. If you can spot any other natural subproblems here, leave a comment or create a sub-task on this JIRA ticket. > Python 3 support > ---------------- > > Key: SPARK-4897 > URL: https://issues.apache.org/jira/browse/SPARK-4897 > Project: Spark > Issue Type: Improvement > Components: PySpark > Reporter: Josh Rosen > Priority: Minor > > It would be nice to have Python 3 support in PySpark, provided that we can do > it in a way that maintains backwards-compatibility with Python 2.6. > I started looking into porting this; my WIP work can be found at > https://github.com/JoshRosen/spark/compare/python3 > I was able to use the > [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] > tool to handle the basic conversion of things like {{print}} statements, etc. > and had to manually fix up a few imports for packages that moved / were > renamed, but the major blocker that I hit was {{cloudpickle}}: > {code} > [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark > Python 3.4.2 (default, Oct 19 2014, 17:52:17) > [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Traceback (most recent call last): > File "/Users/joshrosen/Documents/Spark/python/pyspark/shell.py", line 28, > in <module> > import pyspark > File "/Users/joshrosen/Documents/spark/python/pyspark/__init__.py", line > 41, in <module> > from pyspark.context import SparkContext > File "/Users/joshrosen/Documents/spark/python/pyspark/context.py", line 26, > in <module> > from pyspark import accumulators > File "/Users/joshrosen/Documents/spark/python/pyspark/accumulators.py", > line 97, in <module> > from pyspark.cloudpickle import CloudPickler > File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line > 120, in <module> > class CloudPickler(pickle.Pickler): > File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line > 122, in CloudPickler > dispatch = pickle.Pickler.dispatch.copy() > AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch' > {code} > This code looks like it will be hard difficult to port to Python 3, so this > might be a good reason to switch to > [Dill|https://github.com/uqfoundation/dill] for Python serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org