[ 
https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303154#comment-14303154
 ] 

Ian Ozsvald commented on SPARK-4897:
------------------------------------

If I can cast a vote...

I note that Python 2.6 is the lowest version of Python that's supported, some 
recent data might suggest that Python 2.6 support isn't so useful in the wider 
ecosystem and so might be slowing Spark development. A "Python 2 vs 3" survey 
was conducted before Christmas, the results are recently in:
http://www.randalolson.com/2015/01/30/python-usage-survey-2014/

Of 6,746 respondents less than 10% use Python 2.6 day-to-day. 81% use Python 
2.7 (and 43% Python 3.4 - including me) for day-to-day use (presumably for 
work), there's an approximate 50/50 split between Python 2 & 3 for personal 
projects. I'd humbly suggest that supporting Python 2.6 will slow development 
and avoiding Python 3.4 will hinder winder adoption.

The same survey a year back had 4,790 respondents, the second diagram on 
randalolson's site compares 2013 to 2014 - fewer people now are writing Python 
2 day-to-day and more people are writing Python 3 (though Python 2.7 is still 
significantly dominant). Given that Python 2.7 will be deprecated by 2020 the 
trend to Python 3.4 is clear. Core scientific libraries (e.g. scipy, numpy, 
pandas, matplotlib) all work in Python 3.4 and have done for several years.

The survey doesn't ask respondents whether they are web-devs, data scientists, 
ETL-folk, dev-ops etc so it is hard to extrapolate whether Spark-users are 
predominantly Python 2.6/2.7/3.4 but I'd suggest that a local survey in this 
community might provide useful guidance. 

Although it is on a longer cycle the major Linux distros like Ubuntu are 
switching away from Python 2.7 to Python 3+:
https://www.archlinux.org/news/python-is-now-python-3/  # switched 2010
http://www.phoronix.com/scan.php?page=news_item&px=Fedora-22-Python-3-Status  # 
Fedora to Python 3 around May 2015
https://wiki.ubuntu.com/Python/3  # work on-going, maybe the switch occurs in 
2015?

What is the use case for Python 2.6 support? Personally I'd vote for supporting 
2.7 as a minimum with a strong push for Python 3.4 compatibility to reduce 
wasted hours supporting older Python versions. Supporting older Pythons will 
also hinder the creation of a Python 2.7/3.4 compatible code-base due to 
cross-language complications.

About me - long-time speaker/teacher at Python conferences, O'Reilly author 
(High Performance Python), co-org of the 1000+ member PyDataLondon meetup and 
conference series, Python3.4 proponent since April 2014. At my PyData meetup I 
regularly query my usergroup (approx. 100 attendees each month), 1% use Python 
2.6, the majority use Python 2.7, each month more people switch up to Python 
3.4 (mainly to get away from unicode errors during text processing).

> Python 3 support
> ----------------
>
>                 Key: SPARK-4897
>                 URL: https://issues.apache.org/jira/browse/SPARK-4897
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>            Reporter: Josh Rosen
>            Priority: Minor
>
> It would be nice to have Python 3 support in PySpark, provided that we can do 
> it in a way that maintains backwards-compatibility with Python 2.6.
> I started looking into porting this; my WIP work can be found at 
> https://github.com/JoshRosen/spark/compare/python3
> I was able to use the 
> [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] 
> tool to handle the basic conversion of things like {{print}} statements, etc. 
> and had to manually fix up a few imports for packages that moved / were 
> renamed, but the major blocker that I hit was {{cloudpickle}}:
> {code}
> [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark
> Python 3.4.2 (default, Oct 19 2014, 17:52:17)
> [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Traceback (most recent call last):
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/shell.py", line 28, 
> in <module>
>     import pyspark
>   File "/Users/joshrosen/Documents/spark/python/pyspark/__init__.py", line 
> 41, in <module>
>     from pyspark.context import SparkContext
>   File "/Users/joshrosen/Documents/spark/python/pyspark/context.py", line 26, 
> in <module>
>     from pyspark import accumulators
>   File "/Users/joshrosen/Documents/spark/python/pyspark/accumulators.py", 
> line 97, in <module>
>     from pyspark.cloudpickle import CloudPickler
>   File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line 
> 120, in <module>
>     class CloudPickler(pickle.Pickler):
>   File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line 
> 122, in CloudPickler
>     dispatch = pickle.Pickler.dispatch.copy()
> AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch'
> {code}
> This code looks like it will be hard difficult to port to Python 3, so this 
> might be a good reason to switch to 
> [Dill|https://github.com/uqfoundation/dill] for Python serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to