date:20160105

executor lost when running sparksql

2016-01-05 Thread qinggangwa...@gmail.com

Hi all,
  I am running sparksql in hiveql dialect,  the sql is like "select * from 
(select * from t1 order by t1.id desc) as ff".  The sql succeed when it runs 
only once, but it failed when I run the sql five times at the same time.  It 
seemed that the thread is dumped and executors are lost.  The problem is not 
caused by memory or gc, the shufflle data is relative large, but the whole 
shuffle size is less than 3g onceand 15g five times.  Does anyone have a good 
idea?

Thanks



qinggangwa...@gmail.com

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Hamel Kothari

The "Too Many Files" part of the exception is just indicative of the fact
that when that call was made, too many files were already open. It doesn't
necessarily mean that that line is the source of all of the open files,
that's just the point at which it hit its limit.

What I would recommend is to try to run this code again and use "lsof" on
one of the spark executors (perhaps run it in a for loop, writing the
output to separate files) until it fails and see which files are being
opened, if there's anything that seems to be taking up a clear majority
that might key you in on the culprit.

On Tue, Jan 5, 2016 at 9:48 PM Priya Ch 
wrote:

> Yes, the fileinputstream is closed. May be i didn't show in the screen
> shot .
>
> As spark implements, sort-based shuffle, there is a parameter called
> maximum merge factor which decides the number of files that can be merged
> at once and this avoids too many open files. I am suspecting that it is
> something related to this.
>
> Can someone confirm on this ?
>
> On Tue, Jan 5, 2016 at 11:19 PM, Annabel Melongo <
> melongo_anna...@yahoo.com> wrote:
>
>> Vijay,
>>
>> Are you closing the fileinputstream at the end of each loop (
>> in.close())? My guess is those streams aren't close and thus the "too many
>> open files" exception.
>>
>>
>> On Tuesday, January 5, 2016 8:03 AM, Priya Ch <
>> learnings.chitt...@gmail.com> wrote:
>>
>>
>> Can some one throw light on this ?
>>
>> Regards,
>> Padma Ch
>>
>> On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch 
>> wrote:
>>
>> Chris, we are using spark 1.3.0 version. we have not set  
>> spark.streaming.concurrentJobs
>> this parameter. It takes the default value.
>>
>> Vijay,
>>
>>   From the tack trace it is evident that 
>> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
>> is throwing the exception. I opened the spark source code and visited the
>> line which is throwing this exception i.e
>>
>> [image: Inline image 1]
>>
>> The lie which is marked in red is throwing the exception. The file is
>> ExternalSorter.scala in org.apache.spark.util.collection package.
>>
>> i went through the following blog
>> http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
>> and understood that there is merge factor which decide the number of
>> on-disk files that could be merged. Is it some way related to this ?
>>
>> Regards,
>> Padma CH
>>
>> On Fri, Dec 25, 2015 at 7:51 PM, Chris Fregly  wrote:
>>
>> and which version of Spark/Spark Streaming are you using?
>>
>> are you explicitly setting the spark.streaming.concurrentJobs to
>> something larger than the default of 1?
>>
>> if so, please try setting that back to 1 and see if the problem still
>> exists.
>>
>> this is a dangerous parameter to modify from the default - which is why
>> it's not well-documented.
>>
>>
>> On Wed, Dec 23, 2015 at 8:23 AM, Vijay Gharge 
>> wrote:
>>
>> Few indicators -
>>
>> 1) during execution time - check total number of open files using lsof
>> command. Need root permissions. If it is cluster not sure much !
>> 2) which exact line in the code is triggering this error ? Can you paste
>> that snippet ?
>>
>>
>> On Wednesday 23 December 2015, Priya Ch 
>> wrote:
>>
>> ulimit -n 65000
>>
>> fs.file-max = 65000 ( in etc/sysctl.conf file)
>>
>> Thanks,
>> Padma Ch
>>
>> On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma  wrote:
>>
>> Could you share the ulimit for your setup please ?
>> - Thanks, via mobile,  excuse brevity.
>> On Dec 22, 2015 6:39 PM, "Priya Ch"  wrote:
>>
>> Jakob,
>>
>>Increased the settings like fs.file-max in /etc/sysctl.conf and also
>> increased user limit in /etc/security/limits.conf. But still see the
>> same issue.
>>
>> On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky 
>> wrote:
>>
>> It might be a good idea to see how many files are open and try increasing
>> the open file limit (this is done on an os level). In some application
>> use-cases it is actually a legitimate need.
>>
>> If that doesn't help, make sure you close any unused files and streams in
>> your code. It will also be easier to help diagnose the issue if you send an
>> error-reproducing snippet.
>>
>>
>>
>>
>>
>> --
>> Regards,
>> Vijay Gharge
>>
>>
>>
>>
>>
>>
>> --
>>
>> *Chris Fregly*
>> Principal Data Solutions Engineer
>> IBM Spark Technology Center, San Francisco, CA
>> http://spark.tc | http://advancedspark.com
>>
>>
>>
>>
>>
>>
>

Re: UpdateStateByKey : Partitioning and Shuffle

2016-01-05 Thread Tathagata Das

Both mapWithState and updateStateByKey by default uses the HashPartitioner,
and hashes the key in the key-value DStream on which the state operation is
applied. The new data and state is partition in the exact same partitioner,
so that same keys from the new data (from the input DStream) get shuffled
and colocated with the already partitioned state RDDs. So the new data is
brought to the corresponding old state in the same machine and then the
state mapping /updating function is applied. The state is not shuffled
every time, only the batches of new data is shuffled in every batch

On Tue, Jan 5, 2016 at 5:21 PM, Soumitra Johri  wrote:

> Hi,
>
> I am relatively new to Spark and am using updateStateByKey() operation to
> maintain state in my Spark Streaming application. The input data is coming
> through a Kafka topic.
>
>1. I want to understand how are DStreams partitioned?
>2. How does the partitioning work with mapWithState() or
>updateStatebyKey() method?
>3. In updateStateByKey() does the old state and the new values against
>a given key processed on same node ?
>4. How frequent is the shuffle for updateStateByKey() method ?
>
> The state I have to maintaining contains ~ 10 keys and I want to avoid
> shuffle every time I update the state , any tips to do it ?
>
> Warm Regards
> Soumitra
>

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Priya Ch

Yes, the fileinputstream is closed. May be i didn't show in the screen shot
.

As spark implements, sort-based shuffle, there is a parameter called
maximum merge factor which decides the number of files that can be merged
at once and this avoids too many open files. I am suspecting that it is
something related to this.

Can someone confirm on this ?

On Tue, Jan 5, 2016 at 11:19 PM, Annabel Melongo 
wrote:

> Vijay,
>
> Are you closing the fileinputstream at the end of each loop ( in.close())?
> My guess is those streams aren't close and thus the "too many open files"
> exception.
>
>
> On Tuesday, January 5, 2016 8:03 AM, Priya Ch <
> learnings.chitt...@gmail.com> wrote:
>
>
> Can some one throw light on this ?
>
> Regards,
> Padma Ch
>
> On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch 
> wrote:
>
> Chris, we are using spark 1.3.0 version. we have not set  
> spark.streaming.concurrentJobs
> this parameter. It takes the default value.
>
> Vijay,
>
>   From the tack trace it is evident that 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
> is throwing the exception. I opened the spark source code and visited the
> line which is throwing this exception i.e
>
> [image: Inline image 1]
>
> The lie which is marked in red is throwing the exception. The file is
> ExternalSorter.scala in org.apache.spark.util.collection package.
>
> i went through the following blog
> http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
> and understood that there is merge factor which decide the number of
> on-disk files that could be merged. Is it some way related to this ?
>
> Regards,
> Padma CH
>
> On Fri, Dec 25, 2015 at 7:51 PM, Chris Fregly  wrote:
>
> and which version of Spark/Spark Streaming are you using?
>
> are you explicitly setting the spark.streaming.concurrentJobs to
> something larger than the default of 1?
>
> if so, please try setting that back to 1 and see if the problem still
> exists.
>
> this is a dangerous parameter to modify from the default - which is why
> it's not well-documented.
>
>
> On Wed, Dec 23, 2015 at 8:23 AM, Vijay Gharge 
> wrote:
>
> Few indicators -
>
> 1) during execution time - check total number of open files using lsof
> command. Need root permissions. If it is cluster not sure much !
> 2) which exact line in the code is triggering this error ? Can you paste
> that snippet ?
>
>
> On Wednesday 23 December 2015, Priya Ch 
> wrote:
>
> ulimit -n 65000
>
> fs.file-max = 65000 ( in etc/sysctl.conf file)
>
> Thanks,
> Padma Ch
>
> On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma  wrote:
>
> Could you share the ulimit for your setup please ?
> - Thanks, via mobile,  excuse brevity.
> On Dec 22, 2015 6:39 PM, "Priya Ch"  wrote:
>
> Jakob,
>
>Increased the settings like fs.file-max in /etc/sysctl.conf and also
> increased user limit in /etc/security/limits.conf. But still see the same
> issue.
>
> On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky 
> wrote:
>
> It might be a good idea to see how many files are open and try increasing
> the open file limit (this is done on an os level). In some application
> use-cases it is actually a legitimate need.
>
> If that doesn't help, make sure you close any unused files and streams in
> your code. It will also be easier to help diagnose the issue if you send an
> error-reproducing snippet.
>
>
>
>
>
> --
> Regards,
> Vijay Gharge
>
>
>
>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>
>
>
>
>
>

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Jeff Zhang

+1

On Wed, Jan 6, 2016 at 9:18 AM, Juliet Hougland 
wrote:

> Most admins I talk to about python and spark are already actively (or on
> their way to) managing their cluster python installations. Even if people
> begin using the system python with pyspark, there is eventually a user who
> needs a complex dependency (like pandas or sklearn) on the cluster. No
> admin would muck around installing libs into system python, so you end up
> with other python installations.
>
> Installing a non-system python is something users intending to use pyspark
> on a real cluster should be thinking about, eventually, anyway. It would
> work in situations where people are running pyspark locally or actively
> managing python installations on a cluster. There is an awkward middle
> point where someone has installed spark but not configured their cluster
> (by installing non default python) in any other way. Most clusters I see
> are RHEL/CentOS and have something other than system python used by spark.
>
> What libraries stopped supporting python 2.6 and where does spark use
> them? The "ease of transitioning to pyspark onto a cluster" problem may be
> an easier pill to swallow if it only affected something like mllib or spark
> sql and not parts of the core api. You end up hoping numpy or pandas are
> installed in the runtime components of spark anyway. At that point people
> really should just go install a non system python. There are tradeoffs to
> using pyspark and I feel pretty fine explaining to people that managing
> their cluster's python installations is something that comes with using
> pyspark.
>
> RHEL/CentOS is so common that this would probably be a little work for a
> lot of people.
>
> --Juliet
>
> On Tue, Jan 5, 2016 at 4:07 PM, Koert Kuipers  wrote:
>
>> hey evil admin:)
>> i think the bit about java was from me?
>> if so, i meant to indicate that the reality for us is java is 1.7 on most
>> (all?) clusters. i do not believe spark prefers java 1.8. my point was that
>> even although java 1.7 is getting old as well it would be a major issue for
>> me if spark dropped java 1.7 support.
>>
>> On Tue, Jan 5, 2016 at 6:53 PM, Carlile, Ken 
>> wrote:
>>
>>> As one of the evil administrators that runs a RHEL 6 cluster, we already
>>> provide quite a few different version of python on our cluster pretty darn
>>> easily. All you need is a separate install directory and to set the
>>> PYTHON_HOME environment variable to point to the correct python, then have
>>> the users make sure the correct python is in their PATH. I understand that
>>> other administrators may not be so compliant.
>>>
>>> Saw a small bit about the java version in there; does Spark currently
>>> prefer Java 1.8.x?
>>>
>>> —Ken
>>>
>>> On Jan 5, 2016, at 6:08 PM, Josh Rosen  wrote:
>>>
>>> Note that you _can_ use a Python 2.7 `ipython` executable on the driver
 while continuing to use a vanilla `python` executable on the executors
>>>
>>>
>>> Whoops, just to be clear, this should actually read "while continuing to
>>> use a vanilla `python` 2.7 executable".
>>>
>>> On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen 
>>> wrote:
>>>
 Yep, the driver and executors need to have compatible Python versions.
 I think that there are some bytecode-level incompatibilities between 2.6
 and 2.7 which would impact the deserialization of Python closures, so I
 think you need to be running the same 2.x version for all communicating
 Spark processes. Note that you _can_ use a Python 2.7 `ipython` executable
 on the driver while continuing to use a vanilla `python` executable on the
 executors (we have environment variables which allow you to control these
 separately).

 On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> I think all the slaves need the same (or a compatible) version of
> Python installed since they run Python code in PySpark jobs natively.
>
> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers 
> wrote:
>
>> interesting i didnt know that!
>>
>> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> even if python 2.7 was needed only on this one machine that launches
>>> the app we can not ship it with our software because its gpl licensed
>>>
>>> Not to nitpick, but maybe this is important. The Python license is 
>>> GPL-compatible
>>> but not GPL :
>>>
>>> Note GPL-compatible doesn’t mean that we’re distributing Python
>>> under the GPL. All Python licenses, unlike the GPL, let you distribute a
>>> modified version without making your changes open source. The
>>> GPL-compatible licenses make it possible to combine Python with other
>>> software that is released under the GPL; the others don’t.
>>>
>>> Nick
>>> 
>>>
>>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers 
>>> wrote:

UpdateStateByKey : Partitioning and Shuffle

2016-01-05 Thread Soumitra Johri

Hi,

I am relatively new to Spark and am using updateStateByKey() operation to
maintain state in my Spark Streaming application. The input data is coming
through a Kafka topic.

   1. I want to understand how are DStreams partitioned?
   2. How does the partitioning work with mapWithState() or
   updateStatebyKey() method?
   3. In updateStateByKey() does the old state and the new values against a
   given key processed on same node ?
   4. How frequent is the shuffle for updateStateByKey() method ?

The state I have to maintaining contains ~ 10 keys and I want to avoid
shuffle every time I update the state , any tips to do it ?

Warm Regards
Soumitra

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Juliet Hougland

Most admins I talk to about python and spark are already actively (or on
their way to) managing their cluster python installations. Even if people
begin using the system python with pyspark, there is eventually a user who
needs a complex dependency (like pandas or sklearn) on the cluster. No
admin would muck around installing libs into system python, so you end up
with other python installations.

Installing a non-system python is something users intending to use pyspark
on a real cluster should be thinking about, eventually, anyway. It would
work in situations where people are running pyspark locally or actively
managing python installations on a cluster. There is an awkward middle
point where someone has installed spark but not configured their cluster
(by installing non default python) in any other way. Most clusters I see
are RHEL/CentOS and have something other than system python used by spark.

What libraries stopped supporting python 2.6 and where does spark use them?
The "ease of transitioning to pyspark onto a cluster" problem may be an
easier pill to swallow if it only affected something like mllib or spark
sql and not parts of the core api. You end up hoping numpy or pandas are
installed in the runtime components of spark anyway. At that point people
really should just go install a non system python. There are tradeoffs to
using pyspark and I feel pretty fine explaining to people that managing
their cluster's python installations is something that comes with using
pyspark.

RHEL/CentOS is so common that this would probably be a little work for a
lot of people.

--Juliet

On Tue, Jan 5, 2016 at 4:07 PM, Koert Kuipers  wrote:

> hey evil admin:)
> i think the bit about java was from me?
> if so, i meant to indicate that the reality for us is java is 1.7 on most
> (all?) clusters. i do not believe spark prefers java 1.8. my point was that
> even although java 1.7 is getting old as well it would be a major issue for
> me if spark dropped java 1.7 support.
>
> On Tue, Jan 5, 2016 at 6:53 PM, Carlile, Ken 
> wrote:
>
>> As one of the evil administrators that runs a RHEL 6 cluster, we already
>> provide quite a few different version of python on our cluster pretty darn
>> easily. All you need is a separate install directory and to set the
>> PYTHON_HOME environment variable to point to the correct python, then have
>> the users make sure the correct python is in their PATH. I understand that
>> other administrators may not be so compliant.
>>
>> Saw a small bit about the java version in there; does Spark currently
>> prefer Java 1.8.x?
>>
>> —Ken
>>
>> On Jan 5, 2016, at 6:08 PM, Josh Rosen  wrote:
>>
>> Note that you _can_ use a Python 2.7 `ipython` executable on the driver
>>> while continuing to use a vanilla `python` executable on the executors
>>
>>
>> Whoops, just to be clear, this should actually read "while continuing to
>> use a vanilla `python` 2.7 executable".
>>
>> On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen 
>> wrote:
>>
>>> Yep, the driver and executors need to have compatible Python versions. I
>>> think that there are some bytecode-level incompatibilities between 2.6 and
>>> 2.7 which would impact the deserialization of Python closures, so I think
>>> you need to be running the same 2.x version for all communicating Spark
>>> processes. Note that you _can_ use a Python 2.7 `ipython` executable on the
>>> driver while continuing to use a vanilla `python` executable on the
>>> executors (we have environment variables which allow you to control these
>>> separately).
>>>
>>> On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 I think all the slaves need the same (or a compatible) version of
 Python installed since they run Python code in PySpark jobs natively.

 On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers  wrote:

> interesting i didnt know that!
>
> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> even if python 2.7 was needed only on this one machine that launches
>> the app we can not ship it with our software because its gpl licensed
>>
>> Not to nitpick, but maybe this is important. The Python license is 
>> GPL-compatible
>> but not GPL :
>>
>> Note GPL-compatible doesn’t mean that we’re distributing Python under
>> the GPL. All Python licenses, unlike the GPL, let you distribute a 
>> modified
>> version without making your changes open source. The GPL-compatible
>> licenses make it possible to combine Python with other software that is
>> released under the GPL; the others don’t.
>>
>> Nick
>> 
>>
>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers 
>> wrote:
>>
>>> i do not think so.
>>>
>>> does the python 2.7 need to be installed on all slaves? if so, we do
>>> not have direct access to those.
>>>
>>> also, spark is

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen

I don't think that we're planning to drop Java 7 support for Spark 2.0.

Personally, I would recommend using Java 8 if you're running Spark 1.5.0+
and are using SQL/DataFrames so that you can benefit from improvements to
code cache flushing in the Java 8 JVMs. Spark SQL's generated classes can
fill up the JVM's code cache, which causes JIT to stop working for new
bytecode. Empirically, it looks like the Java 8 JVMs have an improved
ability to flush this code cache, thereby avoiding this problem.

TL;DR: I'd prefer to run Java 8 with Spark if given the choice.

On Tue, Jan 5, 2016 at 4:07 PM, Koert Kuipers  wrote:

> hey evil admin:)
> i think the bit about java was from me?
> if so, i meant to indicate that the reality for us is java is 1.7 on most
> (all?) clusters. i do not believe spark prefers java 1.8. my point was that
> even although java 1.7 is getting old as well it would be a major issue for
> me if spark dropped java 1.7 support.
>
> On Tue, Jan 5, 2016 at 6:53 PM, Carlile, Ken 
> wrote:
>
>> As one of the evil administrators that runs a RHEL 6 cluster, we already
>> provide quite a few different version of python on our cluster pretty darn
>> easily. All you need is a separate install directory and to set the
>> PYTHON_HOME environment variable to point to the correct python, then have
>> the users make sure the correct python is in their PATH. I understand that
>> other administrators may not be so compliant.
>>
>> Saw a small bit about the java version in there; does Spark currently
>> prefer Java 1.8.x?
>>
>> —Ken
>>
>> On Jan 5, 2016, at 6:08 PM, Josh Rosen  wrote:
>>
>> Note that you _can_ use a Python 2.7 `ipython` executable on the driver
>>> while continuing to use a vanilla `python` executable on the executors
>>
>>
>> Whoops, just to be clear, this should actually read "while continuing to
>> use a vanilla `python` 2.7 executable".
>>
>> On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen 
>> wrote:
>>
>>> Yep, the driver and executors need to have compatible Python versions. I
>>> think that there are some bytecode-level incompatibilities between 2.6 and
>>> 2.7 which would impact the deserialization of Python closures, so I think
>>> you need to be running the same 2.x version for all communicating Spark
>>> processes. Note that you _can_ use a Python 2.7 `ipython` executable on the
>>> driver while continuing to use a vanilla `python` executable on the
>>> executors (we have environment variables which allow you to control these
>>> separately).
>>>
>>> On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 I think all the slaves need the same (or a compatible) version of
 Python installed since they run Python code in PySpark jobs natively.

 On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers  wrote:

> interesting i didnt know that!
>
> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> even if python 2.7 was needed only on this one machine that launches
>> the app we can not ship it with our software because its gpl licensed
>>
>> Not to nitpick, but maybe this is important. The Python license is 
>> GPL-compatible
>> but not GPL :
>>
>> Note GPL-compatible doesn’t mean that we’re distributing Python under
>> the GPL. All Python licenses, unlike the GPL, let you distribute a 
>> modified
>> version without making your changes open source. The GPL-compatible
>> licenses make it possible to combine Python with other software that is
>> released under the GPL; the others don’t.
>>
>> Nick
>> 
>>
>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers 
>> wrote:
>>
>>> i do not think so.
>>>
>>> does the python 2.7 need to be installed on all slaves? if so, we do
>>> not have direct access to those.
>>>
>>> also, spark is easy for us to ship with our software since its
>>> apache 2 licensed, and it only needs to be present on the machine that
>>> launches the app (thanks to yarn).
>>> even if python 2.7 was needed only on this one machine that launches
>>> the app we can not ship it with our software because its gpl licensed, 
>>> so
>>> the client would have to download it and install it themselves, and this
>>> would mean its an independent install which has to be audited and 
>>> approved
>>> and now you are in for a lot of fun. basically it will never happen.
>>>
>>>
>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen >> > wrote:
>>>
 If users are able to install Spark 2.0 on their RHEL clusters, then
 I imagine that they're also capable of installing a standalone Python
 alongside that Spark version (without changing Python systemwide). For
 instance, Anaconda/Miniconda make it really easy to install Python
 2.7.x/3.x without impacting / changing th

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers

hey evil admin:)
i think the bit about java was from me?
if so, i meant to indicate that the reality for us is java is 1.7 on most
(all?) clusters. i do not believe spark prefers java 1.8. my point was that
even although java 1.7 is getting old as well it would be a major issue for
me if spark dropped java 1.7 support.

On Tue, Jan 5, 2016 at 6:53 PM, Carlile, Ken 
wrote:

> As one of the evil administrators that runs a RHEL 6 cluster, we already
> provide quite a few different version of python on our cluster pretty darn
> easily. All you need is a separate install directory and to set the
> PYTHON_HOME environment variable to point to the correct python, then have
> the users make sure the correct python is in their PATH. I understand that
> other administrators may not be so compliant.
>
> Saw a small bit about the java version in there; does Spark currently
> prefer Java 1.8.x?
>
> —Ken
>
> On Jan 5, 2016, at 6:08 PM, Josh Rosen  wrote:
>
> Note that you _can_ use a Python 2.7 `ipython` executable on the driver
>> while continuing to use a vanilla `python` executable on the executors
>
>
> Whoops, just to be clear, this should actually read "while continuing to
> use a vanilla `python` 2.7 executable".
>
> On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen 
> wrote:
>
>> Yep, the driver and executors need to have compatible Python versions. I
>> think that there are some bytecode-level incompatibilities between 2.6 and
>> 2.7 which would impact the deserialization of Python closures, so I think
>> you need to be running the same 2.x version for all communicating Spark
>> processes. Note that you _can_ use a Python 2.7 `ipython` executable on the
>> driver while continuing to use a vanilla `python` executable on the
>> executors (we have environment variables which allow you to control these
>> separately).
>>
>> On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I think all the slaves need the same (or a compatible) version of Python
>>> installed since they run Python code in PySpark jobs natively.
>>>
>>> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers  wrote:
>>>
 interesting i didnt know that!

 On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> even if python 2.7 was needed only on this one machine that launches
> the app we can not ship it with our software because its gpl licensed
>
> Not to nitpick, but maybe this is important. The Python license is 
> GPL-compatible
> but not GPL :
>
> Note GPL-compatible doesn’t mean that we’re distributing Python under
> the GPL. All Python licenses, unlike the GPL, let you distribute a 
> modified
> version without making your changes open source. The GPL-compatible
> licenses make it possible to combine Python with other software that is
> released under the GPL; the others don’t.
>
> Nick
> 
>
> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers 
> wrote:
>
>> i do not think so.
>>
>> does the python 2.7 need to be installed on all slaves? if so, we do
>> not have direct access to those.
>>
>> also, spark is easy for us to ship with our software since its apache
>> 2 licensed, and it only needs to be present on the machine that launches
>> the app (thanks to yarn).
>> even if python 2.7 was needed only on this one machine that launches
>> the app we can not ship it with our software because its gpl licensed, so
>> the client would have to download it and install it themselves, and this
>> would mean its an independent install which has to be audited and 
>> approved
>> and now you are in for a lot of fun. basically it will never happen.
>>
>>
>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
>> wrote:
>>
>>> If users are able to install Spark 2.0 on their RHEL clusters, then
>>> I imagine that they're also capable of installing a standalone Python
>>> alongside that Spark version (without changing Python systemwide). For
>>> instance, Anaconda/Miniconda make it really easy to install Python
>>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>>> require any special permissions to install (you don't need root / sudo
>>> access). Does this address the Python versioning concerns for RHEL 
>>> users?
>>>
>>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers 
>>> wrote:
>>>
 yeah, the practical concern is that we have no control over java or
 python version on large company clusters. our current reality for the 
 vast
 majority of them is java 7 and python 2.6, no matter how outdated that 
 is.

 i dont like it either, but i cannot change it.

 we currently don't use pyspark so i have no stake in this, but if
 we did i can assure you we would

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen

>
> Note that you _can_ use a Python 2.7 `ipython` executable on the driver
> while continuing to use a vanilla `python` executable on the executors


Whoops, just to be clear, this should actually read "while continuing to
use a vanilla `python` 2.7 executable".

On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen  wrote:

> Yep, the driver and executors need to have compatible Python versions. I
> think that there are some bytecode-level incompatibilities between 2.6 and
> 2.7 which would impact the deserialization of Python closures, so I think
> you need to be running the same 2.x version for all communicating Spark
> processes. Note that you _can_ use a Python 2.7 `ipython` executable on the
> driver while continuing to use a vanilla `python` executable on the
> executors (we have environment variables which allow you to control these
> separately).
>
> On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I think all the slaves need the same (or a compatible) version of Python
>> installed since they run Python code in PySpark jobs natively.
>>
>> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers  wrote:
>>
>>> interesting i didnt know that!
>>>
>>> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 even if python 2.7 was needed only on this one machine that launches
 the app we can not ship it with our software because its gpl licensed

 Not to nitpick, but maybe this is important. The Python license is 
 GPL-compatible
 but not GPL :

 Note GPL-compatible doesn’t mean that we’re distributing Python under
 the GPL. All Python licenses, unlike the GPL, let you distribute a modified
 version without making your changes open source. The GPL-compatible
 licenses make it possible to combine Python with other software that is
 released under the GPL; the others don’t.

 Nick
 

 On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers  wrote:

> i do not think so.
>
> does the python 2.7 need to be installed on all slaves? if so, we do
> not have direct access to those.
>
> also, spark is easy for us to ship with our software since its apache
> 2 licensed, and it only needs to be present on the machine that launches
> the app (thanks to yarn).
> even if python 2.7 was needed only on this one machine that launches
> the app we can not ship it with our software because its gpl licensed, so
> the client would have to download it and install it themselves, and this
> would mean its an independent install which has to be audited and approved
> and now you are in for a lot of fun. basically it will never happen.
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
> wrote:
>
>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>> imagine that they're also capable of installing a standalone Python
>> alongside that Spark version (without changing Python systemwide). For
>> instance, Anaconda/Miniconda make it really easy to install Python
>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>> require any special permissions to install (you don't need root / sudo
>> access). Does this address the Python versioning concerns for RHEL users?
>>
>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers 
>> wrote:
>>
>>> yeah, the practical concern is that we have no control over java or
>>> python version on large company clusters. our current reality for the 
>>> vast
>>> majority of them is java 7 and python 2.6, no matter how outdated that 
>>> is.
>>>
>>> i dont like it either, but i cannot change it.
>>>
>>> we currently don't use pyspark so i have no stake in this, but if we
>>> did i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>> dropped. no point in developing something that doesnt run for majority 
>>> of
>>> customers.
>>>
>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 As I pointed out in my earlier email, RHEL will support Python 2.6
 until 2020. So I'm assuming these large companies will have the option 
 of
 riding out Python 2.6 until then.

 Are we seriously saying that Spark should likewise support Python
 2.6 for the next several years? Even though the core Python devs 
 stopped
 supporting it in 2013?

 If that's not what we're suggesting, then when, roughly, can we
 drop support? What are the criteria?

 I understand the practical concern here. If companies are stuck
 using 2.6, it doesn't matter to them that it is deprecated. But 
 balancing
 that concern against the maintenance burden on this project, I would 
>>>

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen

Yep, the driver and executors need to have compatible Python versions. I
think that there are some bytecode-level incompatibilities between 2.6 and
2.7 which would impact the deserialization of Python closures, so I think
you need to be running the same 2.x version for all communicating Spark
processes. Note that you _can_ use a Python 2.7 `ipython` executable on the
driver while continuing to use a vanilla `python` executable on the
executors (we have environment variables which allow you to control these
separately).

On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas  wrote:

> I think all the slaves need the same (or a compatible) version of Python
> installed since they run Python code in PySpark jobs natively.
>
> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers  wrote:
>
>> interesting i didnt know that!
>>
>> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> even if python 2.7 was needed only on this one machine that launches the
>>> app we can not ship it with our software because its gpl licensed
>>>
>>> Not to nitpick, but maybe this is important. The Python license is 
>>> GPL-compatible
>>> but not GPL :
>>>
>>> Note GPL-compatible doesn’t mean that we’re distributing Python under
>>> the GPL. All Python licenses, unlike the GPL, let you distribute a modified
>>> version without making your changes open source. The GPL-compatible
>>> licenses make it possible to combine Python with other software that is
>>> released under the GPL; the others don’t.
>>>
>>> Nick
>>> 
>>>
>>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers  wrote:
>>>
 i do not think so.

 does the python 2.7 need to be installed on all slaves? if so, we do
 not have direct access to those.

 also, spark is easy for us to ship with our software since its apache 2
 licensed, and it only needs to be present on the machine that launches the
 app (thanks to yarn).
 even if python 2.7 was needed only on this one machine that launches
 the app we can not ship it with our software because its gpl licensed, so
 the client would have to download it and install it themselves, and this
 would mean its an independent install which has to be audited and approved
 and now you are in for a lot of fun. basically it will never happen.


 On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
 wrote:

> If users are able to install Spark 2.0 on their RHEL clusters, then I
> imagine that they're also capable of installing a standalone Python
> alongside that Spark version (without changing Python systemwide). For
> instance, Anaconda/Miniconda make it really easy to install Python
> 2.7.x/3.x without impacting / changing the system Python and doesn't
> require any special permissions to install (you don't need root / sudo
> access). Does this address the Python versioning concerns for RHEL users?
>
> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers 
> wrote:
>
>> yeah, the practical concern is that we have no control over java or
>> python version on large company clusters. our current reality for the 
>> vast
>> majority of them is java 7 and python 2.6, no matter how outdated that 
>> is.
>>
>> i dont like it either, but i cannot change it.
>>
>> we currently don't use pyspark so i have no stake in this, but if we
>> did i can assure you we would not upgrade to spark 2.x if python 2.6 was
>> dropped. no point in developing something that doesnt run for majority of
>> customers.
>>
>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> As I pointed out in my earlier email, RHEL will support Python 2.6
>>> until 2020. So I'm assuming these large companies will have the option 
>>> of
>>> riding out Python 2.6 until then.
>>>
>>> Are we seriously saying that Spark should likewise support Python
>>> 2.6 for the next several years? Even though the core Python devs stopped
>>> supporting it in 2013?
>>>
>>> If that's not what we're suggesting, then when, roughly, can we drop
>>> support? What are the criteria?
>>>
>>> I understand the practical concern here. If companies are stuck
>>> using 2.6, it doesn't matter to them that it is deprecated. But 
>>> balancing
>>> that concern against the maintenance burden on this project, I would say
>>> that "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable
>>> position to take. There are many tiny annoyances one has to put up with 
>>> to
>>> support 2.6.
>>>
>>> I suppose if our main PySpark contributors are fine putting up with
>>> those annoyances, then maybe we don't need to drop support just yet...
>>>
>>> Nick
>>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente <
>>> ju...@esbet.es>님이 작성:
>>>

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas

I think all the slaves need the same (or a compatible) version of Python
installed since they run Python code in PySpark jobs natively.

On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers  wrote:

> interesting i didnt know that!
>
> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> even if python 2.7 was needed only on this one machine that launches the
>> app we can not ship it with our software because its gpl licensed
>>
>> Not to nitpick, but maybe this is important. The Python license is 
>> GPL-compatible
>> but not GPL :
>>
>> Note GPL-compatible doesn’t mean that we’re distributing Python under the
>> GPL. All Python licenses, unlike the GPL, let you distribute a modified
>> version without making your changes open source. The GPL-compatible
>> licenses make it possible to combine Python with other software that is
>> released under the GPL; the others don’t.
>>
>> Nick
>> 
>>
>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers  wrote:
>>
>>> i do not think so.
>>>
>>> does the python 2.7 need to be installed on all slaves? if so, we do not
>>> have direct access to those.
>>>
>>> also, spark is easy for us to ship with our software since its apache 2
>>> licensed, and it only needs to be present on the machine that launches the
>>> app (thanks to yarn).
>>> even if python 2.7 was needed only on this one machine that launches the
>>> app we can not ship it with our software because its gpl licensed, so the
>>> client would have to download it and install it themselves, and this would
>>> mean its an independent install which has to be audited and approved and
>>> now you are in for a lot of fun. basically it will never happen.
>>>
>>>
>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
>>> wrote:
>>>
 If users are able to install Spark 2.0 on their RHEL clusters, then I
 imagine that they're also capable of installing a standalone Python
 alongside that Spark version (without changing Python systemwide). For
 instance, Anaconda/Miniconda make it really easy to install Python
 2.7.x/3.x without impacting / changing the system Python and doesn't
 require any special permissions to install (you don't need root / sudo
 access). Does this address the Python versioning concerns for RHEL users?

 On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers 
 wrote:

> yeah, the practical concern is that we have no control over java or
> python version on large company clusters. our current reality for the vast
> majority of them is java 7 and python 2.6, no matter how outdated that is.
>
> i dont like it either, but i cannot change it.
>
> we currently don't use pyspark so i have no stake in this, but if we
> did i can assure you we would not upgrade to spark 2.x if python 2.6 was
> dropped. no point in developing something that doesnt run for majority of
> customers.
>
> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> As I pointed out in my earlier email, RHEL will support Python 2.6
>> until 2020. So I'm assuming these large companies will have the option of
>> riding out Python 2.6 until then.
>>
>> Are we seriously saying that Spark should likewise support Python 2.6
>> for the next several years? Even though the core Python devs stopped
>> supporting it in 2013?
>>
>> If that's not what we're suggesting, then when, roughly, can we drop
>> support? What are the criteria?
>>
>> I understand the practical concern here. If companies are stuck using
>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>> concern against the maintenance burden on this project, I would say that
>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position 
>> to
>> take. There are many tiny annoyances one has to put up with to support 
>> 2.6.
>>
>> I suppose if our main PySpark contributors are fine putting up with
>> those annoyances, then maybe we don't need to drop support just yet...
>>
>> Nick
>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente <
>> ju...@esbet.es>님이 작성:
>>
>>> Unfortunately, Koert is right.
>>>
>>> I've been in a couple of projects using Spark (banking industry)
>>> where CentOS + Python 2.6 is the toolbox available.
>>>
>>> That said, I believe it should not be a concern for Spark. Python
>>> 2.6 is old and busted, which is totally opposite to the Spark philosophy
>>> IMO.
>>>
>>>
>>> El 5 ene 2016, a las 20:07, Koert Kuipers 
>>> escribió:
>>>
>>> rhel/centos 6 ships with python 2.6, doesnt it?
>>>
>>> if so, i still know plenty of large companies where python 2.6 is
>>> the only option. asking them for python 2.7 is not going to work
>>>
>>> so i think its a bad idea
>>>
>>>

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers

if python 2.7 only has to be present on the node that launches the app
(does it?) than that could be important indeed.

On Tue, Jan 5, 2016 at 6:02 PM, Koert Kuipers  wrote:

> interesting i didnt know that!
>
> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> even if python 2.7 was needed only on this one machine that launches the
>> app we can not ship it with our software because its gpl licensed
>>
>> Not to nitpick, but maybe this is important. The Python license is 
>> GPL-compatible
>> but not GPL :
>>
>> Note GPL-compatible doesn’t mean that we’re distributing Python under the
>> GPL. All Python licenses, unlike the GPL, let you distribute a modified
>> version without making your changes open source. The GPL-compatible
>> licenses make it possible to combine Python with other software that is
>> released under the GPL; the others don’t.
>>
>> Nick
>> 
>>
>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers  wrote:
>>
>>> i do not think so.
>>>
>>> does the python 2.7 need to be installed on all slaves? if so, we do not
>>> have direct access to those.
>>>
>>> also, spark is easy for us to ship with our software since its apache 2
>>> licensed, and it only needs to be present on the machine that launches the
>>> app (thanks to yarn).
>>> even if python 2.7 was needed only on this one machine that launches the
>>> app we can not ship it with our software because its gpl licensed, so the
>>> client would have to download it and install it themselves, and this would
>>> mean its an independent install which has to be audited and approved and
>>> now you are in for a lot of fun. basically it will never happen.
>>>
>>>
>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
>>> wrote:
>>>
 If users are able to install Spark 2.0 on their RHEL clusters, then I
 imagine that they're also capable of installing a standalone Python
 alongside that Spark version (without changing Python systemwide). For
 instance, Anaconda/Miniconda make it really easy to install Python
 2.7.x/3.x without impacting / changing the system Python and doesn't
 require any special permissions to install (you don't need root / sudo
 access). Does this address the Python versioning concerns for RHEL users?

 On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers 
 wrote:

> yeah, the practical concern is that we have no control over java or
> python version on large company clusters. our current reality for the vast
> majority of them is java 7 and python 2.6, no matter how outdated that is.
>
> i dont like it either, but i cannot change it.
>
> we currently don't use pyspark so i have no stake in this, but if we
> did i can assure you we would not upgrade to spark 2.x if python 2.6 was
> dropped. no point in developing something that doesnt run for majority of
> customers.
>
> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> As I pointed out in my earlier email, RHEL will support Python 2.6
>> until 2020. So I'm assuming these large companies will have the option of
>> riding out Python 2.6 until then.
>>
>> Are we seriously saying that Spark should likewise support Python 2.6
>> for the next several years? Even though the core Python devs stopped
>> supporting it in 2013?
>>
>> If that's not what we're suggesting, then when, roughly, can we drop
>> support? What are the criteria?
>>
>> I understand the practical concern here. If companies are stuck using
>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>> concern against the maintenance burden on this project, I would say that
>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position 
>> to
>> take. There are many tiny annoyances one has to put up with to support 
>> 2.6.
>>
>> I suppose if our main PySpark contributors are fine putting up with
>> those annoyances, then maybe we don't need to drop support just yet...
>>
>> Nick
>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente <
>> ju...@esbet.es>님이 작성:
>>
>>> Unfortunately, Koert is right.
>>>
>>> I've been in a couple of projects using Spark (banking industry)
>>> where CentOS + Python 2.6 is the toolbox available.
>>>
>>> That said, I believe it should not be a concern for Spark. Python
>>> 2.6 is old and busted, which is totally opposite to the Spark philosophy
>>> IMO.
>>>
>>>
>>> El 5 ene 2016, a las 20:07, Koert Kuipers 
>>> escribió:
>>>
>>> rhel/centos 6 ships with python 2.6, doesnt it?
>>>
>>> if so, i still know plenty of large companies where python 2.6 is
>>> the only option. asking them for python 2.7 is not going to work
>>>
>>> so i think its a bad idea
>>>
>>> On Tue, Jan 5,

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers

interesting i didnt know that!

On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas  wrote:

> even if python 2.7 was needed only on this one machine that launches the
> app we can not ship it with our software because its gpl licensed
>
> Not to nitpick, but maybe this is important. The Python license is 
> GPL-compatible
> but not GPL :
>
> Note GPL-compatible doesn’t mean that we’re distributing Python under the
> GPL. All Python licenses, unlike the GPL, let you distribute a modified
> version without making your changes open source. The GPL-compatible
> licenses make it possible to combine Python with other software that is
> released under the GPL; the others don’t.
>
> Nick
> 
>
> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers  wrote:
>
>> i do not think so.
>>
>> does the python 2.7 need to be installed on all slaves? if so, we do not
>> have direct access to those.
>>
>> also, spark is easy for us to ship with our software since its apache 2
>> licensed, and it only needs to be present on the machine that launches the
>> app (thanks to yarn).
>> even if python 2.7 was needed only on this one machine that launches the
>> app we can not ship it with our software because its gpl licensed, so the
>> client would have to download it and install it themselves, and this would
>> mean its an independent install which has to be audited and approved and
>> now you are in for a lot of fun. basically it will never happen.
>>
>>
>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
>> wrote:
>>
>>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>>> imagine that they're also capable of installing a standalone Python
>>> alongside that Spark version (without changing Python systemwide). For
>>> instance, Anaconda/Miniconda make it really easy to install Python
>>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>>> require any special permissions to install (you don't need root / sudo
>>> access). Does this address the Python versioning concerns for RHEL users?
>>>
>>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers  wrote:
>>>
 yeah, the practical concern is that we have no control over java or
 python version on large company clusters. our current reality for the vast
 majority of them is java 7 and python 2.6, no matter how outdated that is.

 i dont like it either, but i cannot change it.

 we currently don't use pyspark so i have no stake in this, but if we
 did i can assure you we would not upgrade to spark 2.x if python 2.6 was
 dropped. no point in developing something that doesnt run for majority of
 customers.

 On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> As I pointed out in my earlier email, RHEL will support Python 2.6
> until 2020. So I'm assuming these large companies will have the option of
> riding out Python 2.6 until then.
>
> Are we seriously saying that Spark should likewise support Python 2.6
> for the next several years? Even though the core Python devs stopped
> supporting it in 2013?
>
> If that's not what we're suggesting, then when, roughly, can we drop
> support? What are the criteria?
>
> I understand the practical concern here. If companies are stuck using
> 2.6, it doesn't matter to them that it is deprecated. But balancing that
> concern against the maintenance burden on this project, I would say that
> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
> take. There are many tiny annoyances one has to put up with to support 
> 2.6.
>
> I suppose if our main PySpark contributors are fine putting up with
> those annoyances, then maybe we don't need to drop support just yet...
>
> Nick
> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
> 작성:
>
>> Unfortunately, Koert is right.
>>
>> I've been in a couple of projects using Spark (banking industry)
>> where CentOS + Python 2.6 is the toolbox available.
>>
>> That said, I believe it should not be a concern for Spark. Python 2.6
>> is old and busted, which is totally opposite to the Spark philosophy IMO.
>>
>>
>> El 5 ene 2016, a las 20:07, Koert Kuipers 
>> escribió:
>>
>> rhel/centos 6 ships with python 2.6, doesnt it?
>>
>> if so, i still know plenty of large companies where python 2.6 is the
>> only option. asking them for python 2.7 is not going to work
>>
>> so i think its a bad idea
>>
>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <
>> juliet.hougl...@gmail.com> wrote:
>>
>>> I don't see a reason Spark 2.0 would need to support Python 2.6. At
>>> this point, Python 3 should be the default that is encouraged.
>>> Most organizations acknowledge the 2.7 is common, but lagging behind
>>> the version they should theoretically us

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas

even if python 2.7 was needed only on this one machine that launches the
app we can not ship it with our software because its gpl licensed

Not to nitpick, but maybe this is important. The Python license is
GPL-compatible
but not GPL :

Note GPL-compatible doesn’t mean that we’re distributing Python under the
GPL. All Python licenses, unlike the GPL, let you distribute a modified
version without making your changes open source. The GPL-compatible
licenses make it possible to combine Python with other software that is
released under the GPL; the others don’t.

Nick


On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers  wrote:

> i do not think so.
>
> does the python 2.7 need to be installed on all slaves? if so, we do not
> have direct access to those.
>
> also, spark is easy for us to ship with our software since its apache 2
> licensed, and it only needs to be present on the machine that launches the
> app (thanks to yarn).
> even if python 2.7 was needed only on this one machine that launches the
> app we can not ship it with our software because its gpl licensed, so the
> client would have to download it and install it themselves, and this would
> mean its an independent install which has to be audited and approved and
> now you are in for a lot of fun. basically it will never happen.
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen 
> wrote:
>
>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>> imagine that they're also capable of installing a standalone Python
>> alongside that Spark version (without changing Python systemwide). For
>> instance, Anaconda/Miniconda make it really easy to install Python
>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>> require any special permissions to install (you don't need root / sudo
>> access). Does this address the Python versioning concerns for RHEL users?
>>
>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers  wrote:
>>
>>> yeah, the practical concern is that we have no control over java or
>>> python version on large company clusters. our current reality for the vast
>>> majority of them is java 7 and python 2.6, no matter how outdated that is.
>>>
>>> i dont like it either, but i cannot change it.
>>>
>>> we currently don't use pyspark so i have no stake in this, but if we did
>>> i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>> dropped. no point in developing something that doesnt run for majority of
>>> customers.
>>>
>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 As I pointed out in my earlier email, RHEL will support Python 2.6
 until 2020. So I'm assuming these large companies will have the option of
 riding out Python 2.6 until then.

 Are we seriously saying that Spark should likewise support Python 2.6
 for the next several years? Even though the core Python devs stopped
 supporting it in 2013?

 If that's not what we're suggesting, then when, roughly, can we drop
 support? What are the criteria?

 I understand the practical concern here. If companies are stuck using
 2.6, it doesn't matter to them that it is deprecated. But balancing that
 concern against the maintenance burden on this project, I would say that
 "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
 take. There are many tiny annoyances one has to put up with to support 2.6.

 I suppose if our main PySpark contributors are fine putting up with
 those annoyances, then maybe we don't need to drop support just yet...

 Nick
 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
 작성:

> Unfortunately, Koert is right.
>
> I've been in a couple of projects using Spark (banking industry) where
> CentOS + Python 2.6 is the toolbox available.
>
> That said, I believe it should not be a concern for Spark. Python 2.6
> is old and busted, which is totally opposite to the Spark philosophy IMO.
>
>
> El 5 ene 2016, a las 20:07, Koert Kuipers 
> escribió:
>
> rhel/centos 6 ships with python 2.6, doesnt it?
>
> if so, i still know plenty of large companies where python 2.6 is the
> only option. asking them for python 2.7 is not going to work
>
> so i think its a bad idea
>
> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <
> juliet.hougl...@gmail.com> wrote:
>
>> I don't see a reason Spark 2.0 would need to support Python 2.6. At
>> this point, Python 3 should be the default that is encouraged.
>> Most organizations acknowledge the 2.7 is common, but lagging behind
>> the version they should theoretically use. Dropping python 2.6
>> support sounds very reasonable to me.
>>
>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> +1
>>>
>>> Red Hat supp

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu

Created JIRA: https://issues.apache.org/jira/browse/SPARK-12661

On Tue, Jan 5, 2016 at 2:49 PM, Koert Kuipers  wrote:
> i do not think so.
>
> does the python 2.7 need to be installed on all slaves? if so, we do not
> have direct access to those.
>
> also, spark is easy for us to ship with our software since its apache 2
> licensed, and it only needs to be present on the machine that launches the
> app (thanks to yarn).
> even if python 2.7 was needed only on this one machine that launches the app
> we can not ship it with our software because its gpl licensed, so the client
> would have to download it and install it themselves, and this would mean its
> an independent install which has to be audited and approved and now you are
> in for a lot of fun. basically it will never happen.
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen  wrote:
>>
>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>> imagine that they're also capable of installing a standalone Python
>> alongside that Spark version (without changing Python systemwide). For
>> instance, Anaconda/Miniconda make it really easy to install Python 2.7.x/3.x
>> without impacting / changing the system Python and doesn't require any
>> special permissions to install (you don't need root / sudo access). Does
>> this address the Python versioning concerns for RHEL users?
>>
>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers  wrote:
>>>
>>> yeah, the practical concern is that we have no control over java or
>>> python version on large company clusters. our current reality for the vast
>>> majority of them is java 7 and python 2.6, no matter how outdated that is.
>>>
>>> i dont like it either, but i cannot change it.
>>>
>>> we currently don't use pyspark so i have no stake in this, but if we did
>>> i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>> dropped. no point in developing something that doesnt run for majority of
>>> customers.
>>>
>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas
>>>  wrote:

 As I pointed out in my earlier email, RHEL will support Python 2.6 until
 2020. So I'm assuming these large companies will have the option of riding
 out Python 2.6 until then.

 Are we seriously saying that Spark should likewise support Python 2.6
 for the next several years? Even though the core Python devs stopped
 supporting it in 2013?

 If that's not what we're suggesting, then when, roughly, can we drop
 support? What are the criteria?

 I understand the practical concern here. If companies are stuck using
 2.6, it doesn't matter to them that it is deprecated. But balancing that
 concern against the maintenance burden on this project, I would say that
 "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
 take. There are many tiny annoyances one has to put up with to support 2.6.

 I suppose if our main PySpark contributors are fine putting up with
 those annoyances, then maybe we don't need to drop support just yet...

 Nick
 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente
 님이 작성:
>
> Unfortunately, Koert is right.
>
> I've been in a couple of projects using Spark (banking industry) where
> CentOS + Python 2.6 is the toolbox available.
>
> That said, I believe it should not be a concern for Spark. Python 2.6
> is old and busted, which is totally opposite to the Spark philosophy IMO.
>
>
> El 5 ene 2016, a las 20:07, Koert Kuipers  escribió:
>
> rhel/centos 6 ships with python 2.6, doesnt it?
>
> if so, i still know plenty of large companies where python 2.6 is the
> only option. asking them for python 2.7 is not going to work
>
> so i think its a bad idea
>
> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland
>  wrote:
>>
>> I don't see a reason Spark 2.0 would need to support Python 2.6. At
>> this point, Python 3 should be the default that is encouraged.
>> Most organizations acknowledge the 2.7 is common, but lagging behind
>> the version they should theoretically use. Dropping python 2.6
>> support sounds very reasonable to me.
>>
>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas
>>  wrote:
>>>
>>> +1
>>>
>>> Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes,
>>> Python 2.6 is ancient history and the core Python developers stopped
>>> supporting it in 2013. REHL 5 is not a good enough reason to continue
>>> support for Python 2.6 IMO.
>>>
>>> We should aim to support Python 2.7 and Python 3.3+ (which I believe
>>> we currently do).
>>>
>>> Nick
>>>
>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang 
>>> wrote:

 plus 1,

 we are currently using python 2.7.2 in production environment.





 在 2016-01-05 18:11:45

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers

i do not think so.

does the python 2.7 need to be installed on all slaves? if so, we do not
have direct access to those.

also, spark is easy for us to ship with our software since its apache 2
licensed, and it only needs to be present on the machine that launches the
app (thanks to yarn).
even if python 2.7 was needed only on this one machine that launches the
app we can not ship it with our software because its gpl licensed, so the
client would have to download it and install it themselves, and this would
mean its an independent install which has to be audited and approved and
now you are in for a lot of fun. basically it will never happen.


On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen  wrote:

> If users are able to install Spark 2.0 on their RHEL clusters, then I
> imagine that they're also capable of installing a standalone Python
> alongside that Spark version (without changing Python systemwide). For
> instance, Anaconda/Miniconda make it really easy to install Python
> 2.7.x/3.x without impacting / changing the system Python and doesn't
> require any special permissions to install (you don't need root / sudo
> access). Does this address the Python versioning concerns for RHEL users?
>
> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers  wrote:
>
>> yeah, the practical concern is that we have no control over java or
>> python version on large company clusters. our current reality for the vast
>> majority of them is java 7 and python 2.6, no matter how outdated that is.
>>
>> i dont like it either, but i cannot change it.
>>
>> we currently don't use pyspark so i have no stake in this, but if we did
>> i can assure you we would not upgrade to spark 2.x if python 2.6 was
>> dropped. no point in developing something that doesnt run for majority of
>> customers.
>>
>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> As I pointed out in my earlier email, RHEL will support Python 2.6 until
>>> 2020. So I'm assuming these large companies will have the option of riding
>>> out Python 2.6 until then.
>>>
>>> Are we seriously saying that Spark should likewise support Python 2.6
>>> for the next several years? Even though the core Python devs stopped
>>> supporting it in 2013?
>>>
>>> If that's not what we're suggesting, then when, roughly, can we drop
>>> support? What are the criteria?
>>>
>>> I understand the practical concern here. If companies are stuck using
>>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>>> concern against the maintenance burden on this project, I would say that
>>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
>>> take. There are many tiny annoyances one has to put up with to support 2.6.
>>>
>>> I suppose if our main PySpark contributors are fine putting up with
>>> those annoyances, then maybe we don't need to drop support just yet...
>>>
>>> Nick
>>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
>>> 작성:
>>>
 Unfortunately, Koert is right.

 I've been in a couple of projects using Spark (banking industry) where
 CentOS + Python 2.6 is the toolbox available.

 That said, I believe it should not be a concern for Spark. Python 2.6
 is old and busted, which is totally opposite to the Spark philosophy IMO.


 El 5 ene 2016, a las 20:07, Koert Kuipers  escribió:

 rhel/centos 6 ships with python 2.6, doesnt it?

 if so, i still know plenty of large companies where python 2.6 is the
 only option. asking them for python 2.7 is not going to work

 so i think its a bad idea

 On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <
 juliet.hougl...@gmail.com> wrote:

> I don't see a reason Spark 2.0 would need to support Python 2.6. At
> this point, Python 3 should be the default that is encouraged.
> Most organizations acknowledge the 2.7 is common, but lagging behind
> the version they should theoretically use. Dropping python 2.6
> support sounds very reasonable to me.
>
> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> +1
>>
>> Red Hat supports Python 2.6 on REHL 5 until 2020
>> ,
>> but otherwise yes, Python 2.6 is ancient history and the core Python
>> developers stopped supporting it in 2013. REHL 5 is not a good enough
>> reason to continue support for Python 2.6 IMO.
>>
>> We should aim to support Python 2.7 and Python 3.3+ (which I believe
>> we currently do).
>>
>> Nick
>>
>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang 
>> wrote:
>>
>>> plus 1,
>>>
>>> we are currently using python 2.7.2 in production environment.
>>>
>>>
>>>
>>>
>>>
>>> 在 2016-01-05 18:11:45，"Meethu Mathew"  写道：
>>>
>>> +1
>>> We use Python 2.7
>>>
>>> Regards,

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen

If users are able to install Spark 2.0 on their RHEL clusters, then I
imagine that they're also capable of installing a standalone Python
alongside that Spark version (without changing Python systemwide). For
instance, Anaconda/Miniconda make it really easy to install Python
2.7.x/3.x without impacting / changing the system Python and doesn't
require any special permissions to install (you don't need root / sudo
access). Does this address the Python versioning concerns for RHEL users?

On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers  wrote:

> yeah, the practical concern is that we have no control over java or python
> version on large company clusters. our current reality for the vast
> majority of them is java 7 and python 2.6, no matter how outdated that is.
>
> i dont like it either, but i cannot change it.
>
> we currently don't use pyspark so i have no stake in this, but if we did i
> can assure you we would not upgrade to spark 2.x if python 2.6 was dropped.
> no point in developing something that doesnt run for majority of customers.
>
> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> As I pointed out in my earlier email, RHEL will support Python 2.6 until
>> 2020. So I'm assuming these large companies will have the option of riding
>> out Python 2.6 until then.
>>
>> Are we seriously saying that Spark should likewise support Python 2.6 for
>> the next several years? Even though the core Python devs stopped supporting
>> it in 2013?
>>
>> If that's not what we're suggesting, then when, roughly, can we drop
>> support? What are the criteria?
>>
>> I understand the practical concern here. If companies are stuck using
>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>> concern against the maintenance burden on this project, I would say that
>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
>> take. There are many tiny annoyances one has to put up with to support 2.6.
>>
>> I suppose if our main PySpark contributors are fine putting up with those
>> annoyances, then maybe we don't need to drop support just yet...
>>
>> Nick
>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
>> 작성:
>>
>>> Unfortunately, Koert is right.
>>>
>>> I've been in a couple of projects using Spark (banking industry) where
>>> CentOS + Python 2.6 is the toolbox available.
>>>
>>> That said, I believe it should not be a concern for Spark. Python 2.6 is
>>> old and busted, which is totally opposite to the Spark philosophy IMO.
>>>
>>>
>>> El 5 ene 2016, a las 20:07, Koert Kuipers  escribió:
>>>
>>> rhel/centos 6 ships with python 2.6, doesnt it?
>>>
>>> if so, i still know plenty of large companies where python 2.6 is the
>>> only option. asking them for python 2.7 is not going to work
>>>
>>> so i think its a bad idea
>>>
>>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <
>>> juliet.hougl...@gmail.com> wrote:
>>>
 I don't see a reason Spark 2.0 would need to support Python 2.6. At
 this point, Python 3 should be the default that is encouraged.
 Most organizations acknowledge the 2.7 is common, but lagging behind
 the version they should theoretically use. Dropping python 2.6
 support sounds very reasonable to me.

 On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> +1
>
> Red Hat supports Python 2.6 on REHL 5 until 2020
> ,
> but otherwise yes, Python 2.6 is ancient history and the core Python
> developers stopped supporting it in 2013. REHL 5 is not a good enough
> reason to continue support for Python 2.6 IMO.
>
> We should aim to support Python 2.7 and Python 3.3+ (which I believe
> we currently do).
>
> Nick
>
> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang 
> wrote:
>
>> plus 1,
>>
>> we are currently using python 2.7.2 in production environment.
>>
>>
>>
>>
>>
>> 在 2016-01-05 18:11:45，"Meethu Mathew"  写道：
>>
>> +1
>> We use Python 2.7
>>
>> Regards,
>>
>> Meethu Mathew
>>
>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin 
>> wrote:
>>
>>> Does anybody here care about us dropping support for Python 2.6 in
>>> Spark 2.0?
>>>
>>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>>> parsing) when compared with Python 2.7. Some libraries that Spark 
>>> depend on
>>> stopped supporting 2.6. We can still convince the library maintainers to
>>> support 2.6, but it will be extra work. I'm curious if anybody still 
>>> uses
>>> Python 2.6 to run Spark.
>>>
>>> Thanks.
>>>
>>>
>>>
>>

>>>
>

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers

yeah, the practical concern is that we have no control over java or python
version on large company clusters. our current reality for the vast
majority of them is java 7 and python 2.6, no matter how outdated that is.

i dont like it either, but i cannot change it.

we currently don't use pyspark so i have no stake in this, but if we did i
can assure you we would not upgrade to spark 2.x if python 2.6 was dropped.
no point in developing something that doesnt run for majority of customers.

On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas  wrote:

> As I pointed out in my earlier email, RHEL will support Python 2.6 until
> 2020. So I'm assuming these large companies will have the option of riding
> out Python 2.6 until then.
>
> Are we seriously saying that Spark should likewise support Python 2.6 for
> the next several years? Even though the core Python devs stopped supporting
> it in 2013?
>
> If that's not what we're suggesting, then when, roughly, can we drop
> support? What are the criteria?
>
> I understand the practical concern here. If companies are stuck using 2.6,
> it doesn't matter to them that it is deprecated. But balancing that concern
> against the maintenance burden on this project, I would say that "upgrade
> to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to take.
> There are many tiny annoyances one has to put up with to support 2.6.
>
> I suppose if our main PySpark contributors are fine putting up with those
> annoyances, then maybe we don't need to drop support just yet...
>
> Nick
> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
> 작성:
>
>> Unfortunately, Koert is right.
>>
>> I've been in a couple of projects using Spark (banking industry) where
>> CentOS + Python 2.6 is the toolbox available.
>>
>> That said, I believe it should not be a concern for Spark. Python 2.6 is
>> old and busted, which is totally opposite to the Spark philosophy IMO.
>>
>>
>> El 5 ene 2016, a las 20:07, Koert Kuipers  escribió:
>>
>> rhel/centos 6 ships with python 2.6, doesnt it?
>>
>> if so, i still know plenty of large companies where python 2.6 is the
>> only option. asking them for python 2.7 is not going to work
>>
>> so i think its a bad idea
>>
>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <
>> juliet.hougl...@gmail.com> wrote:
>>
>>> I don't see a reason Spark 2.0 would need to support Python 2.6. At this
>>> point, Python 3 should be the default that is encouraged.
>>> Most organizations acknowledge the 2.7 is common, but lagging behind the
>>> version they should theoretically use. Dropping python 2.6
>>> support sounds very reasonable to me.
>>>
>>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 +1

 Red Hat supports Python 2.6 on REHL 5 until 2020
 ,
 but otherwise yes, Python 2.6 is ancient history and the core Python
 developers stopped supporting it in 2013. REHL 5 is not a good enough
 reason to continue support for Python 2.6 IMO.

 We should aim to support Python 2.7 and Python 3.3+ (which I believe we
 currently do).

 Nick

 On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang 
 wrote:

> plus 1,
>
> we are currently using python 2.7.2 in production environment.
>
>
>
>
>
> 在 2016-01-05 18:11:45，"Meethu Mathew"  写道：
>
> +1
> We use Python 2.7
>
> Regards,
>
> Meethu Mathew
>
> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin 
> wrote:
>
>> Does anybody here care about us dropping support for Python 2.6 in
>> Spark 2.0?
>>
>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>> parsing) when compared with Python 2.7. Some libraries that Spark depend 
>> on
>> stopped supporting 2.6. We can still convince the library maintainers to
>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>> Python 2.6 to run Spark.
>>
>> Thanks.
>>
>>
>>
>
>>>
>>

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas

As I pointed out in my earlier email, RHEL will support Python 2.6 until
2020. So I'm assuming these large companies will have the option of riding
out Python 2.6 until then.

Are we seriously saying that Spark should likewise support Python 2.6 for
the next several years? Even though the core Python devs stopped supporting
it in 2013?

If that's not what we're suggesting, then when, roughly, can we drop
support? What are the criteria?

I understand the practical concern here. If companies are stuck using 2.6,
it doesn't matter to them that it is deprecated. But balancing that concern
against the maintenance burden on this project, I would say that "upgrade
to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to take.
There are many tiny annoyances one has to put up with to support 2.6.

I suppose if our main PySpark contributors are fine putting up with those
annoyances, then maybe we don't need to drop support just yet...

Nick
2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
작성:

> Unfortunately, Koert is right.
>
> I've been in a couple of projects using Spark (banking industry) where
> CentOS + Python 2.6 is the toolbox available.
>
> That said, I believe it should not be a concern for Spark. Python 2.6 is
> old and busted, which is totally opposite to the Spark philosophy IMO.
>
>
> El 5 ene 2016, a las 20:07, Koert Kuipers  escribió:
>
> rhel/centos 6 ships with python 2.6, doesnt it?
>
> if so, i still know plenty of large companies where python 2.6 is the only
> option. asking them for python 2.7 is not going to work
>
> so i think its a bad idea
>
> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland  > wrote:
>
>> I don't see a reason Spark 2.0 would need to support Python 2.6. At this
>> point, Python 3 should be the default that is encouraged.
>> Most organizations acknowledge the 2.7 is common, but lagging behind the
>> version they should theoretically use. Dropping python 2.6
>> support sounds very reasonable to me.
>>
>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> +1
>>>
>>> Red Hat supports Python 2.6 on REHL 5 until 2020
>>> ,
>>> but otherwise yes, Python 2.6 is ancient history and the core Python
>>> developers stopped supporting it in 2013. REHL 5 is not a good enough
>>> reason to continue support for Python 2.6 IMO.
>>>
>>> We should aim to support Python 2.7 and Python 3.3+ (which I believe we
>>> currently do).
>>>
>>> Nick
>>>
>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang 
>>> wrote:
>>>
 plus 1,

 we are currently using python 2.7.2 in production environment.

 在 2016-01-05 18:11:45，"Meethu Mathew"  写道：

 +1
 We use Python 2.7

 Regards,

 Meethu Mathew

 On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin 
 wrote:

> Does anybody here care about us dropping support for Python 2.6 in
> Spark 2.0?
>
> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
> parsing) when compared with Python 2.7. Some libraries that Spark depend 
> on
> stopped supporting 2.6. We can still convince the library maintainers to
> support 2.6, but it will be extra work. I'm curious if anybody still uses
> Python 2.6 to run Spark.
>
> Thanks.
>
>
>

>>
>

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Julio Antonio Soto de Vicente

Unfortunately, Koert is right.

I've been in a couple of projects using Spark (banking industry) where CentOS + 
Python 2.6 is the toolbox available. 

That said, I believe it should not be a concern for Spark. Python 2.6 is old 
and busted, which is totally opposite to the Spark philosophy IMO.


> El 5 ene 2016, a las 20:07, Koert Kuipers  escribió:
> 
> rhel/centos 6 ships with python 2.6, doesnt it?
> 
> if so, i still know plenty of large companies where python 2.6 is the only 
> option. asking them for python 2.7 is not going to work
> 
> so i think its a bad idea
> 
>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland  
>> wrote:
>> I don't see a reason Spark 2.0 would need to support Python 2.6. At this 
>> point, Python 3 should be the default that is encouraged.
>> Most organizations acknowledge the 2.7 is common, but lagging behind the 
>> version they should theoretically use. Dropping python 2.6
>> support sounds very reasonable to me.
>> 
>>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas 
>>>  wrote:
>> 
>>> +1
>>> 
>>> Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python 
>>> 2.6 is ancient history and the core Python developers stopped supporting it 
>>> in 2013. REHL 5 is not a good enough reason to continue support for Python 
>>> 2.6 IMO.
>>> 
>>> We should aim to support Python 2.7 and Python 3.3+ (which I believe we 
>>> currently do).
>>> 
>>> Nick
>>> 
 On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang  wrote:
 plus 1,
 
 we are currently using python 2.7.2 in production environment.
 
 
 
 
 
 在 2016-01-05 18:11:45，"Meethu Mathew"  写道：
 +1
 We use Python 2.7
 
 Regards,
  
 Meethu Mathew
 
> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin  wrote:
> Does anybody here care about us dropping support for Python 2.6 in Spark 
> 2.0? 
> 
> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json 
> parsing) when compared with Python 2.7. Some libraries that Spark depend 
> on stopped supporting 2.6. We can still convince the library maintainers 
> to support 2.6, but it will be extra work. I'm curious if anybody still 
> uses Python 2.6 to run Spark.
> 
> Thanks.
>

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Ted Yu

+1

> On Jan 5, 2016, at 10:49 AM, Davies Liu  wrote:
> 
> +1
> 
> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas
>  wrote:
>> +1
>> 
>> Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python
>> 2.6 is ancient history and the core Python developers stopped supporting it
>> in 2013. REHL 5 is not a good enough reason to continue support for Python
>> 2.6 IMO.
>> 
>> We should aim to support Python 2.7 and Python 3.3+ (which I believe we
>> currently do).
>> 
>> Nick
>> 
>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang  wrote:
>>> 
>>> plus 1,
>>> 
>>> we are currently using python 2.7.2 in production environment.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 在 2016-01-05 18:11:45，"Meethu Mathew"  写道：
>>> 
>>> +1
>>> We use Python 2.7
>>> 
>>> Regards,
>>> 
>>> Meethu Mathew
>>> 
 On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin  wrote:
 
 Does anybody here care about us dropping support for Python 2.6 in Spark
 2.0?
 
 Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
 parsing) when compared with Python 2.7. Some libraries that Spark depend on
 stopped supporting 2.6. We can still convince the library maintainers to
 support 2.6, but it will be extra work. I'm curious if anybody still uses
 Python 2.6 to run Spark.
 
 Thanks.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers

rhel/centos 6 ships with python 2.6, doesnt it?

if so, i still know plenty of large companies where python 2.6 is the only
option. asking them for python 2.7 is not going to work

so i think its a bad idea

On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland 
wrote:

> I don't see a reason Spark 2.0 would need to support Python 2.6. At this
> point, Python 3 should be the default that is encouraged.
> Most organizations acknowledge the 2.7 is common, but lagging behind the
> version they should theoretically use. Dropping python 2.6
> support sounds very reasonable to me.
>
> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> +1
>>
>> Red Hat supports Python 2.6 on REHL 5 until 2020
>> , but
>> otherwise yes, Python 2.6 is ancient history and the core Python developers
>> stopped supporting it in 2013. REHL 5 is not a good enough reason to
>> continue support for Python 2.6 IMO.
>>
>> We should aim to support Python 2.7 and Python 3.3+ (which I believe we
>> currently do).
>>
>> Nick
>>
>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang  wrote:
>>
>>> plus 1,
>>>
>>> we are currently using python 2.7.2 in production environment.
>>>
>>>
>>>
>>>
>>>
>>> 在 2016-01-05 18:11:45，"Meethu Mathew"  写道：
>>>
>>> +1
>>> We use Python 2.7
>>>
>>> Regards,
>>>
>>> Meethu Mathew
>>>
>>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin 
>>> wrote:
>>>
 Does anybody here care about us dropping support for Python 2.6 in
 Spark 2.0?

 Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
 parsing) when compared with Python 2.7. Some libraries that Spark depend on
 stopped supporting 2.6. We can still convince the library maintainers to
 support 2.6, but it will be extra work. I'm curious if anybody still uses
 Python 2.6 to run Spark.

 Thanks.

>>>
>

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Juliet Hougland

I don't see a reason Spark 2.0 would need to support Python 2.6. At this
point, Python 3 should be the default that is encouraged.
Most organizations acknowledge the 2.7 is common, but lagging behind the
version they should theoretically use. Dropping python 2.6
support sounds very reasonable to me.

On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas  wrote:

> +1
>
> Red Hat supports Python 2.6 on REHL 5 until 2020
> , but
> otherwise yes, Python 2.6 is ancient history and the core Python developers
> stopped supporting it in 2013. REHL 5 is not a good enough reason to
> continue support for Python 2.6 IMO.
>
> We should aim to support Python 2.7 and Python 3.3+ (which I believe we
> currently do).
>
> Nick
>
> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang  wrote:
>
>> plus 1,
>>
>> we are currently using python 2.7.2 in production environment.
>>
>>
>>
>>
>>
>> 在 2016-01-05 18:11:45，"Meethu Mathew"  写道：
>>
>> +1
>> We use Python 2.7
>>
>> Regards,
>>
>> Meethu Mathew
>>
>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin  wrote:
>>
>>> Does anybody here care about us dropping support for Python 2.6 in Spark
>>> 2.0?
>>>
>>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>>> parsing) when compared with Python 2.7. Some libraries that Spark depend on
>>> stopped supporting 2.6. We can still convince the library maintainers to
>>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>>> Python 2.6 to run Spark.
>>>
>>> Thanks.
>>>
>>>
>>>
>>

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu

+1

On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas
 wrote:
> +1
>
> Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python
> 2.6 is ancient history and the core Python developers stopped supporting it
> in 2013. REHL 5 is not a good enough reason to continue support for Python
> 2.6 IMO.
>
> We should aim to support Python 2.7 and Python 3.3+ (which I believe we
> currently do).
>
> Nick
>
> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang  wrote:
>>
>> plus 1,
>>
>> we are currently using python 2.7.2 in production environment.
>>
>>
>>
>>
>>
>> 在 2016-01-05 18:11:45，"Meethu Mathew"  写道：
>>
>> +1
>> We use Python 2.7
>>
>> Regards,
>>
>> Meethu Mathew
>>
>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin  wrote:
>>>
>>> Does anybody here care about us dropping support for Python 2.6 in Spark
>>> 2.0?
>>>
>>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>>> parsing) when compared with Python 2.7. Some libraries that Spark depend on
>>> stopped supporting 2.6. We can still convince the library maintainers to
>>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>>> Python 2.6 to run Spark.
>>>
>>> Thanks.
>>>
>>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Rachana Srivastava

I have a very simple two lines program.  I am getting input from Kafka and save 
the input in a file and counting the input received.  My code looks like this, 
when I run this code I am getting two accumulator count for each input.

HashMap kafkaParams = new HashMap();  
kafkaParams.put("metadata.broker.list","localhost:9092");   
kafkaParams.put("zookeeper.connect", "localhost:2181");
JavaPairInputDStream messages = KafkaUtils.createDirectStream( 
jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, 
kafkaParams, topicsSet);
final Accumulator accum = jssc.sparkContext().accumulator(0);
JavaDStream lines = messages.map(
new Function, String>() {
   public String call(Tuple2 tuple2) { 
accum.add(1); return tuple2._2();
} });
lines.foreachRDD(new Function, Void>() {
public Void call(JavaRDD rdd) throws Exception {
if(!rdd.isEmpty() || !rdd.partitions().isEmpty()){ 
rdd.saveAsTextFile("hdfs://quickstart.cloudera:8020/user/cloudera/testDirJan4/test1.text");}
System.out.println(" & COUNT OF ACCUMULATOR IS " + 
accum.value()); return null;}
 });
 jssc.start();

If I comment rdd.saveAsTextFile I get correct count, but with 
rdd.saveAsTextFile for each input I am getting multiple accumulator count.

Thanks,

Rachana

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas

+1

Red Hat supports Python 2.6 on REHL 5 until 2020
, but
otherwise yes, Python 2.6 is ancient history and the core Python developers
stopped supporting it in 2013. REHL 5 is not a good enough reason to
continue support for Python 2.6 IMO.

We should aim to support Python 2.7 and Python 3.3+ (which I believe we
currently do).

Nick

On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang  wrote:

> plus 1,
>
> we are currently using python 2.7.2 in production environment.
>
>
>
>
>
> 在 2016-01-05 18:11:45，"Meethu Mathew"  写道：
>
> +1
> We use Python 2.7
>
> Regards,
>
> Meethu Mathew
>
> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin  wrote:
>
>> Does anybody here care about us dropping support for Python 2.6 in Spark
>> 2.0?
>>
>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>> parsing) when compared with Python 2.7. Some libraries that Spark depend on
>> stopped supporting 2.6. We can still convince the library maintainers to
>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>> Python 2.6 to run Spark.
>>
>> Thanks.
>>
>>
>>
>

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Allen Zhang

plus 1,


we are currently using python 2.7.2 in production environment.






在 2016-01-05 18:11:45，"Meethu Mathew"  写道：

+1
We use Python 2.7


Regards,
 
Meethu Mathew


On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin  wrote:

Does anybody here care about us dropping support for Python 2.6 in Spark 2.0? 


Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json parsing) 
when compared with Python 2.7. Some libraries that Spark depend on stopped 
supporting 2.6. We can still convince the library maintainers to support 2.6, 
but it will be extra work. I'm curious if anybody still uses Python 2.6 to run 
Spark.


Thanks.

RE: Support off-loading computations to a GPU

2016-01-05 Thread Kazuaki Ishizaki

Hi Alexander,
Thank you for having an interest.

We used a LR derived from a Spark sample program 
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkLR.scala
 
(not from mllib or ml). Here are scala source files for GPU and non-GPU 
versions.
GPU: 
https://github.com/kiszk/spark-gpu/blob/dev/examples/src/main/scala/org/apache/spark/examples/SparkGPULR.scala
non-GPU: 
https://github.com/kiszk/spark-gpu/blob/dev/examples/src/main/scala/org/apache/spark/examples/SparkLR.scala

Best Regards,
Kazuaki Ishizaki



From:   "Ulanov, Alexander" 
To: Kazuaki Ishizaki/Japan/IBM@IBMJP, "dev@spark.apache.org" 

Date:   2016/01/05 06:13
Subject:RE: Support off-loading computations to a GPU



Hi Kazuaki,
 
Sounds very interesting! Could you elaborate on your benchmark with 
regards to logistic regression (LR)? Did you compare your implementation 
with the current implementation of LR in Spark?
 
Best regards, Alexander
 
From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com] 
Sent: Sunday, January 03, 2016 7:52 PM
To: dev@spark.apache.org
Subject: Support off-loading computations to a GPU
 
Dear all,

We reopened the existing JIRA entry 
https://issues.apache.org/jira/browse/SPARK-3785to support off-loading 
computations to a GPU by adding a description for our prototype. We are 
working to effectively and easily exploit GPUs on Spark at 
http://github.com/kiszk/spark-gpu. Please also visit our project page 
http://kiszk.github.io/spark-gpu/.

For now, we added a new format for a partition in an RDD, which is a 
column-based structure in an array format, in addition to the current 
Iterator[T] format with Seq[T]. This reduces data 
serialization/deserialization and copy overhead between CPU and GPU.

Our prototype achieved more than 3x performance improvement for a simple 
logistic regression program using a NVIDIA K40 card.

This JIRA entry (SPARK-3785) includes a link to a design document. We are 
very glad to hear valuable feedback/suggestions/comments and to have great 
discussions to exploit GPUs in Spark.

Best Regards,
Kazuaki Ishizaki

Re:Support off-loading computations to a GPU

2016-01-05 Thread Kazuaki Ishizaki

Hi Allen,
Thank you for having an interest.

For quick start, I prepared a new page "Quick Start" at 
https://github.com/kiszk/spark-gpu/wiki/Quick-Start. You can install the 
package with two lines and run a sample program with one line.

We mean that "off-loading" is to exploit GPU for a task execution of 
Spark. For this, it is necessary to map a task into GPU kernels (While the 
current version requires a programmer to write CUDA code, future versions 
will prepare GPU code from a Spark program automatically). To execute GPU 
kernels requires data copy between CPU and GPU. To reduce data copy 
overhead, our prototype keeps data as a binary representation in RDD using 
a column format.

The current version does not specify the number of CUDA cores for a job by 
using a command line option. There are two ways to specify resources in 
GPU.
1) to specify the number of GPU cards by setting CUDA_VISIBLE_DEVICES in 
conf/spark-env.sh (refer to 
http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/
)
2) to specify the number of CUDA threads for processing a partition in a 
program as 
https://github.com/kiszk/spark-gpu/blob/dev/examples/src/main/scala/org/apache/spark/examples/SparkGPULR.scala#L89
 
(Sorry for no documentation now).

We are glad to support requested features or to looking forward to getting 
pull requests.
 
Best Regard,
Kazuaki Ishizaki



From:   "Allen Zhang" 
To: Kazuaki Ishizaki/Japan/IBM@IBMJP
Cc: dev@spark.apache.org
Date:   2016/01/04 13:29
Subject:Re:Support off-loading computations to a GPU



Hi Kazuaki,

I am looking at http://kiszk.github.io/spark-gpu/ , can you point me where 
is the kick-start scripts that I can give it a go?

to be more specifically, what does *"off-loading"* mean? aims to reduce 
the copy overhead between CPU and GPU?
I am a newbie for GPU, how can I specify how many GPU cores I want to use 
(like --executor-cores) ?





At 2016-01-04 11:52:01, "Kazuaki Ishizaki"  wrote:
Dear all,

We reopened the existing JIRA entry 
https://issues.apache.org/jira/browse/SPARK-3785to support off-loading 
computations to a GPU by adding a description for our prototype. We are 
working to effectively and easily exploit GPUs on Spark at 
http://github.com/kiszk/spark-gpu. Please also visit our project page 
http://kiszk.github.io/spark-gpu/.

For now, we added a new format for a partition in an RDD, which is a 
column-based structure in an array format, in addition to the current 
Iterator[T] format with Seq[T]. This reduces data 
serialization/deserialization and copy overhead between CPU and GPU.

Our prototype achieved more than 3x performance improvement for a simple 
logistic regression program using a NVIDIA K40 card.

This JIRA entry (SPARK-3785) includes a link to a design document. We are 
very glad to hear valuable feedback/suggestions/comments and to have great 
discussions to exploit GPUs in Spark.

Best Regards,
Kazuaki Ishizaki

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Meethu Mathew

+1
We use Python 2.7

Regards,

Meethu Mathew

On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin  wrote:

> Does anybody here care about us dropping support for Python 2.6 in Spark
> 2.0?
>
> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
> parsing) when compared with Python 2.7. Some libraries that Spark depend on
> stopped supporting 2.6. We can still convince the library maintainers to
> support 2.6, but it will be extra work. I'm curious if anybody still uses
> Python 2.6 to run Spark.
>
> Thanks.
>
>
>

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Sean Owen

+juliet for an additional opinion, but FWIW I think it's safe to say
that future CDH will have a more consistent Python story and that
story will support 2.7 rather than 2.6.

On Tue, Jan 5, 2016 at 7:17 AM, Reynold Xin  wrote:
> Does anybody here care about us dropping support for Python 2.6 in Spark
> 2.0?
>
> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
> parsing) when compared with Python 2.7. Some libraries that Spark depend on
> stopped supporting 2.6. We can still convince the library maintainers to
> support 2.6, but it will be extra work. I'm curious if anybody still uses
> Python 2.6 to run Spark.
>
> Thanks.
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread yash datta

+1

On Tue, Jan 5, 2016 at 1:57 PM, Jian Feng Zhang 
wrote:

> +1
>
> We use Python 2.7+ and 3.4+ to call PySpark.
>
> 2016-01-05 15:58 GMT+08:00 Kushal Datta :
>
>> +1
>>
>> 
>> Dr. Kushal Datta
>> Senior Research Scientist
>> Big Data Research & Pathfinding
>> Intel Corporation, USA.
>>
>> On Mon, Jan 4, 2016 at 11:52 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1
>>>
>>> no problem for me to remove Python 2.6 in 2.0.
>>>
>>> Thanks
>>> Regards
>>> JB
>>>
>>>
>>> On 01/05/2016 08:17 AM, Reynold Xin wrote:
>>>
 Does anybody here care about us dropping support for Python 2.6 in Spark
 2.0?

 Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
 parsing) when compared with Python 2.7. Some libraries that Spark depend
 on stopped supporting 2.6. We can still convince the library maintainers
 to support 2.6, but it will be extra work. I'm curious if anybody still
 uses Python 2.6 to run Spark.

 Thanks.



>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> Best,
> Jian Feng
>



-- 
When events unfold with calm and ease
When the winds that blow are merely breeze
Learn from nature, from birds and bees
Live your life in love, and let joy not cease.

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Jian Feng Zhang

+1

We use Python 2.7+ and 3.4+ to call PySpark.

2016-01-05 15:58 GMT+08:00 Kushal Datta :

> +1
>
> 
> Dr. Kushal Datta
> Senior Research Scientist
> Big Data Research & Pathfinding
> Intel Corporation, USA.
>
> On Mon, Jan 4, 2016 at 11:52 PM, Jean-Baptiste Onofré 
> wrote:
>
>> +1
>>
>> no problem for me to remove Python 2.6 in 2.0.
>>
>> Thanks
>> Regards
>> JB
>>
>>
>> On 01/05/2016 08:17 AM, Reynold Xin wrote:
>>
>>> Does anybody here care about us dropping support for Python 2.6 in Spark
>>> 2.0?
>>>
>>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>>> parsing) when compared with Python 2.7. Some libraries that Spark depend
>>> on stopped supporting 2.6. We can still convince the library maintainers
>>> to support 2.6, but it will be extra work. I'm curious if anybody still
>>> uses Python 2.6 to run Spark.
>>>
>>> Thanks.
>>>
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


-- 
Best,
Jian Feng

executor lost when running sparksql

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

Re: UpdateStateByKey : Partitioning and Shuffle

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

Re: [discuss] dropping Python 2.6 support

UpdateStateByKey : Partitioning and Shuffle

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Double Counting When Using Accumulators with Spark Streaming

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

RE: Support off-loading computations to a GPU

Re:Support off-loading computations to a GPU

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

Re: [discuss] dropping Python 2.6 support

34 matches

Site Navigation

Mail list logo

Footer information