Re: Block Transfer Service encryption support

2015-11-10 Thread Tim Preece
So it appears the tests fail because of an SSLHandshakeException. 

Tracing the failure I see:
3,0001,Using SSLEngineImpl.\0A
3,0001,\0AIs initial handshake: true\0A
3,0001,Ignoring unsupported cipher suite: SSL_RSA_WITH_DES_CBC_SHA for
TLSv1.2\0A
3,0001,No available cipher suite for TLSv1.2\0A
3,0001,shuffle-client-4\2C fatal error: 40: Couldn't kickstart
handshaking\0Ajavax.net.ssl.SSLHandshakeException: No appropriate
protocol\2C may be no appropriate cipher suite specified or protocols are
deactivated\0A
3,0001,shuffle-client-4
3,0001,\2C SEND TLSv1.2 ALERT:  
3,0001,fatal\2C 
3,0001,description = handshake_failure\0A
3,0001,shuffle-client-4\2C WRITE: TLSv1.2 Alert\2C length = 2\0A
3,0001,Using SSLEngineImpl.\0A
3,0001,shuffle-client-4\2C called closeOutbound()\0A
3,0001,shuffle-client-4\2C closeOutboundInternal()\0A
3,0001,[Raw write]: length = 7\0A
3,0001,: 15 03 03 00 02 02 28  
...\0A\0A
3,0001,\0AIs initial handshake: true\0A
3,0001,Ignoring unsupported cipher suite: SSL_RSA_WITH_DES_CBC_SHA for
TLSv1.2\0A
3,0001,No available cipher suite for TLSv1.2\0A
3,0001,shuffle-server-5\2C fatal error: 80: problem unwrapping net
record\0Ajavax.net.ssl.SSLHandshakeException: No appropriate protocol\2C may
be no appropriate cipher suite specified or protocols are deactivated\0A
3,0001,shuffle-server-5
3,0001,\2C SEND TLSv1.2 ALERT:  
3,0001,fatal\2C 
3,0001,description = internal_error\0A
3,0001,shuffle-server-5\2C WRITE: TLSv1.2 Alert\2C length = 2\0A
3,0001,shuffle-server-5\2C called closeOutbound()\0A
3,0001,shuffle-server-5\2C closeOutboundInternal()\0A
3,0001,shuffle-server-5\2C called closeInbound()\0A
3,0001,shuffle-server-5\2C closeInboundInternal()\0A
3,0001,shuffle-client-4\2C called closeOutbound()\0A
3,0001,shuffle-client-4\2C closeOutboundInternal()\0A
3,0001,shuffle-client-4\2C called closeInbound()\0A
3,0001,shuffle-client-4\2C closeInboundInternal()\0A
3,0001,shuffle-server-5\2C called closeOutbound()\0A
3,0001,shuffle-server-5\2C closeOutboundInternal()\0A
3,0001,shuffle-server-5\2C called closeInbound()\0A
3,0001,shuffle-server-5\2C closeInboundInternal()\0A

So this fails because of the use of DES. From
https://www-01.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.security.component.71.doc/security-component/jsse2Docs/ciphersuites.html
I see: 2 RFC 5246 TLS 1.2 forbids the use of these suites. These can be used
in the SSLv3/TLS1.0/TLS1.1 protocols, but cannot be used in TLS 1.2 and
later.

Note. I'm using the IBM Java SDK.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Block-Transfer-Service-encryption-support-tp10934p15116.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Reynold Xin
Hi All,

Spark 1.5.2 is a maintenance release containing stability fixes. This
release is based on the branch-1.5 maintenance branch of Spark. We
*strongly recommend* all 1.5.x users to upgrade to this release.

The full list of bug fixes is here: http://s.apache.org/spark-1.5.2

http://spark.apache.org/releases/spark-release-1-5-2.html


Re: Support for views/ virtual tables in SparkSQL

2015-11-10 Thread Michael Armbrust
We do support hive style views, though all tables have to be visible to
Hive.  You can also turn on the experimental native view support (but it
does not canonicalize the query).

set spark.sql.nativeView = true


On Mon, Nov 9, 2015 at 10:24 PM, Zhan Zhang  wrote:

> I think you can rewrite those TPC-H queries not using view, for example
> registerTempTable
>
> Thanks.
>
> Zhan Zhang
>
> On Nov 9, 2015, at 9:34 PM, Sudhir Menon  wrote:
>
> > Team:
> >
> > Do we plan to add support for views/ virtual tables in SparkSQL anytime
> soon?
> > Trying to run the TPC-H workload and failing on queries that assumes
> support for views in the underlying database
> >
> > Thanks in advance
> >
> > Suds
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Support for views/ virtual tables in SparkSQL

2015-11-10 Thread Sudhir Menon
Thanks Zhan, thanks Michael. I was already going down the temp table path,
will check out the experimental native view support

Suds

On Tue, Nov 10, 2015 at 11:22 AM, Michael Armbrust 
wrote:

> We do support hive style views, though all tables have to be visible to
> Hive.  You can also turn on the experimental native view support (but it
> does not canonicalize the query).
>
> set spark.sql.nativeView = true
>
>
> On Mon, Nov 9, 2015 at 10:24 PM, Zhan Zhang 
> wrote:
>
>> I think you can rewrite those TPC-H queries not using view, for example
>> registerTempTable
>>
>> Thanks.
>>
>> Zhan Zhang
>>
>> On Nov 9, 2015, at 9:34 PM, Sudhir Menon  wrote:
>>
>> > Team:
>> >
>> > Do we plan to add support for views/ virtual tables in SparkSQL anytime
>> soon?
>> > Trying to run the TPC-H workload and failing on queries that assumes
>> support for views in the underlying database
>> >
>> > Thanks in advance
>> >
>> > Suds
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

>
> > 3. Assembly-free distribution of Spark: don’t require building an
> enormous assembly jar in order to run Spark.
>
> Could you elaborate a bit on this? I'm not sure what an assembly-free
> distribution means.
>
>
Right now we ship Spark using a single assembly jar, which causes a few
different problems:

- total number of classes are limited on some configurations

- dependency swapping is harder


The proposal is to just avoid a single fat jar.


SPARK-11638: Run Spark on Mesos, in Docker with Bridge networking

2015-11-10 Thread Rad Gruchalski
Dear Team,  

We, Virdata, would like to present the result of the last few months of our 
work with Mesos and Spark. Our requirement was to run Spark on Mesos in Docker 
for multi-tenant.
This required adapting Spark to run in Docker with Bridge networking.

The result (and patches) of our work is presented in the following JIRA ticket: 
https://issues.apache.org/jira/browse/SPARK-11638. The PR is: 
https://github.com/apache/spark/pull/9608.

The Summary

Provides spark.driver.advertisedPort, spark.fileserver.advertisedPort, 
spark.broadcast.advertisedPort and spark.replClassServer.advertisedPort 
settings to enable running Spark in Mesos on Docker with Bridge networking. 
Provides patches for Akka Remote to enable Spark driver advertisement using 
alternative host and port.
With these settings, it is possible to run Spark Master in a Docker container 
and have the executors running on Mesos talk back correctly to such Master.

The problem is discussed on the Mesos mailing list here: 
https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E

We would like to contribute this to Apache Spark.

Happy to provide any further information.










Kind regards,

Radek Gruchalski

ra...@gruchalski.com (mailto:ra...@gruchalski.com)
 
(mailto:ra...@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.




Re: A proposal for Spark 2.0

2015-11-10 Thread Nicholas Chammas
> For this reason, I would *not* propose doing major releases to break
substantial API's or perform large re-architecting that prevent users from
upgrading. Spark has always had a culture of evolving architecture
incrementally and making changes - and I don't think we want to change this
model.

+1 for this. The Python community went through a lot of turmoil over the
Python 2 -> Python 3 transition because the upgrade process was too painful
for too long. The Spark community will benefit greatly from our explicitly
looking to avoid a similar situation.

> 3. Assembly-free distribution of Spark: don’t require building an
enormous assembly jar in order to run Spark.

Could you elaborate a bit on this? I'm not sure what an assembly-free
distribution means.

Nick

On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin  wrote:

> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in
> 1.7, 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to
> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
> major releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing
> in Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some
> subtle differences, these two are very similar but have completely
> different code path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>
>


Re: A proposal for Spark 2.0

2015-11-10 Thread Shivaram Venkataraman
+1

On a related note I think making it lightweight will ensure that we
stay on the current release schedule and don't unnecessarily delay 2.0
to wait for new features / big architectural changes.

In terms of fixes to 1.x, I think our current policy of back-porting
fixes to older releases would still apply. I don't think developing
new features on both 1.x and 2.x makes a lot of sense as we would like
users to switch to 2.x.

Shivaram

On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis  wrote:
> +1 on a lightweight 2.0
>
> What is the thinking around the 1.x line after Spark 2.0 is released? If not
> terminated, how will we determine what goes into each major version line?
> Will 1.x only be for stability fixes?
>
> Thanks,
> Kostas
>
> On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell  wrote:
>>
>> I also feel the same as Reynold. I agree we should minimize API breaks and
>> focus on fixing things around the edge that were mistakes (e.g. exposing
>> Guava and Akka) rather than any overhaul that could fragment the community.
>> Ideally a major release is a lightweight process we can do every couple of
>> years, with minimal impact for users.
>>
>> - Patrick
>>
>> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>>  wrote:
>>>
>>> > For this reason, I would *not* propose doing major releases to break
>>> > substantial API's or perform large re-architecting that prevent users from
>>> > upgrading. Spark has always had a culture of evolving architecture
>>> > incrementally and making changes - and I don't think we want to change 
>>> > this
>>> > model.
>>>
>>> +1 for this. The Python community went through a lot of turmoil over the
>>> Python 2 -> Python 3 transition because the upgrade process was too painful
>>> for too long. The Spark community will benefit greatly from our explicitly
>>> looking to avoid a similar situation.
>>>
>>> > 3. Assembly-free distribution of Spark: don’t require building an
>>> > enormous assembly jar in order to run Spark.
>>>
>>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>>> distribution means.
>>>
>>> Nick
>>>
>>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin  wrote:

 I’m starting a new thread since the other one got intermixed with
 feature requests. Please refrain from making feature request in this 
 thread.
 Not that we shouldn’t be adding features, but we can always add features in
 1.7, 2.1, 2.2, ...

 First - I want to propose a premise for how to think about Spark 2.0 and
 major releases in Spark, based on discussion with several members of the
 community: a major release should be low overhead and minimally disruptive
 to the Spark community. A major release should not be very different from a
 minor release and should not be gated based on new features. The main
 purpose of a major release is an opportunity to fix things that are broken
 in the current API and remove certain deprecated APIs (examples follow).

 For this reason, I would *not* propose doing major releases to break
 substantial API's or perform large re-architecting that prevent users from
 upgrading. Spark has always had a culture of evolving architecture
 incrementally and making changes - and I don't think we want to change this
 model. In fact, we’ve released many architectural changes on the 1.X line.

 If the community likes the above model, then to me it seems reasonable
 to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or 
 immediately
 after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
 major releases every 2 years seems doable within the above model.

 Under this model, here is a list of example things I would propose doing
 in Spark 2.0, separated into APIs and Operation/Deployment:


 APIs

 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
 Spark 1.x.

 2. Remove Akka from Spark’s API dependency (in streaming), so user
 applications can use Akka (SPARK-5293). We have gotten a lot of complaints
 about user applications being unable to use Akka due to Spark’s dependency
 on Akka.

 3. Remove Guava from Spark’s public API (JavaRDD Optional).

 4. Better class package structure for low level developer API’s. In
 particular, we have some DeveloperApi (mostly various listener-related
 classes) added over the years. Some packages include only one or two public
 classes but a lot of private classes. A better structure is to have public
 classes isolated to a few public packages, and these public packages should
 have minimal private classes for low level developer APIs.

 5. Consolidate task metric and accumulator API. Although having some
 subtle differences, these two are very similar but have completely 
 different

Re: A proposal for Spark 2.0

2015-11-10 Thread Josh Rosen
There's a proposal / discussion of the assembly-less distributions at
https://github.com/vanzin/spark/pull/2/files /
https://issues.apache.org/jira/browse/SPARK-11157.

On Tue, Nov 10, 2015 at 3:53 PM, Reynold Xin  wrote:

>
> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>>
>> > 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>> distribution means.
>>
>>
> Right now we ship Spark using a single assembly jar, which causes a few
> different problems:
>
> - total number of classes are limited on some configurations
>
> - dependency swapping is harder
>
>
> The proposal is to just avoid a single fat jar.
>
>
>


Re: A proposal for Spark 2.0

2015-11-10 Thread Kostas Sakellis
+1 on a lightweight 2.0

What is the thinking around the 1.x line after Spark 2.0 is released? If
not terminated, how will we determine what goes into each major version
line? Will 1.x only be for stability fixes?

Thanks,
Kostas

On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell  wrote:

> I also feel the same as Reynold. I agree we should minimize API breaks and
> focus on fixing things around the edge that were mistakes (e.g. exposing
> Guava and Akka) rather than any overhaul that could fragment the community.
> Ideally a major release is a lightweight process we can do every couple of
> years, with minimal impact for users.
>
> - Patrick
>
> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> > For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model.
>>
>> +1 for this. The Python community went through a lot of turmoil over the
>> Python 2 -> Python 3 transition because the upgrade process was too painful
>> for too long. The Spark community will benefit greatly from our explicitly
>> looking to avoid a similar situation.
>>
>> > 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>> distribution means.
>>
>> Nick
>>
>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin  wrote:
>>
>>> I’m starting a new thread since the other one got intermixed with
>>> feature requests. Please refrain from making feature request in this
>>> thread. Not that we shouldn’t be adding features, but we can always add
>>> features in 1.7, 2.1, 2.2, ...
>>>
>>> First - I want to propose a premise for how to think about Spark 2.0 and
>>> major releases in Spark, based on discussion with several members of the
>>> community: a major release should be low overhead and minimally disruptive
>>> to the Spark community. A major release should not be very different from a
>>> minor release and should not be gated based on new features. The main
>>> purpose of a major release is an opportunity to fix things that are broken
>>> in the current API and remove certain deprecated APIs (examples follow).
>>>
>>> For this reason, I would *not* propose doing major releases to break
>>> substantial API's or perform large re-architecting that prevent users from
>>> upgrading. Spark has always had a culture of evolving architecture
>>> incrementally and making changes - and I don't think we want to change this
>>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>>
>>> If the community likes the above model, then to me it seems reasonable
>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>> immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>> cadence of major releases every 2 years seems doable within the above model.
>>>
>>> Under this model, here is a list of example things I would propose doing
>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>>
>>>
>>> APIs
>>>
>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>>> Spark 1.x.
>>>
>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>>> about user applications being unable to use Akka due to Spark’s dependency
>>> on Akka.
>>>
>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>>
>>> 4. Better class package structure for low level developer API’s. In
>>> particular, we have some DeveloperApi (mostly various listener-related
>>> classes) added over the years. Some packages include only one or two public
>>> classes but a lot of private classes. A better structure is to have public
>>> classes isolated to a few public packages, and these public packages should
>>> have minimal private classes for low level developer APIs.
>>>
>>> 5. Consolidate task metric and accumulator API. Although having some
>>> subtle differences, these two are very similar but have completely
>>> different code path.
>>>
>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>> moving them to other package(s). They are already used beyond SQL, e.g. in
>>> ML pipelines, and will be used by streaming also.
>>>
>>>
>>> Operation/Deployment
>>>
>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>>> but it has been end-of-life.
>>>
>>> 2. Remove Hadoop 1 support.
>>>
>>> 3. Assembly-free distribution of Spark: don’t require building an
>>> enormous assembly jar in order to run Spark.
>>>
>>>
>


Re: A proposal for Spark 2.0

2015-11-10 Thread Mridul Muralidharan
Would be also good to fix api breakages introduced as part of 1.0
(where there is missing functionality now), overhaul & remove all
deprecated config/features/combinations, api changes that we need to
make to public api which has been deferred for minor releases.

Regards,
Mridul

On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:
> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in 1.7,
> 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to do
> Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after
> Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major
> releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing in
> Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark
> 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some subtle
> differences, these two are very similar but have completely different code
> path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
> it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
Echoing Shivaram here. I don't think it makes a lot of sense to add more
features to the 1.x line. We should still do critical bug fixes though.


On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> +1
>
> On a related note I think making it lightweight will ensure that we
> stay on the current release schedule and don't unnecessarily delay 2.0
> to wait for new features / big architectural changes.
>
> In terms of fixes to 1.x, I think our current policy of back-porting
> fixes to older releases would still apply. I don't think developing
> new features on both 1.x and 2.x makes a lot of sense as we would like
> users to switch to 2.x.
>
> Shivaram
>
> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis 
> wrote:
> > +1 on a lightweight 2.0
> >
> > What is the thinking around the 1.x line after Spark 2.0 is released? If
> not
> > terminated, how will we determine what goes into each major version line?
> > Will 1.x only be for stability fixes?
> >
> > Thanks,
> > Kostas
> >
> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell 
> wrote:
> >>
> >> I also feel the same as Reynold. I agree we should minimize API breaks
> and
> >> focus on fixing things around the edge that were mistakes (e.g. exposing
> >> Guava and Akka) rather than any overhaul that could fragment the
> community.
> >> Ideally a major release is a lightweight process we can do every couple
> of
> >> years, with minimal impact for users.
> >>
> >> - Patrick
> >>
> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
> >>  wrote:
> >>>
> >>> > For this reason, I would *not* propose doing major releases to break
> >>> > substantial API's or perform large re-architecting that prevent
> users from
> >>> > upgrading. Spark has always had a culture of evolving architecture
> >>> > incrementally and making changes - and I don't think we want to
> change this
> >>> > model.
> >>>
> >>> +1 for this. The Python community went through a lot of turmoil over
> the
> >>> Python 2 -> Python 3 transition because the upgrade process was too
> painful
> >>> for too long. The Spark community will benefit greatly from our
> explicitly
> >>> looking to avoid a similar situation.
> >>>
> >>> > 3. Assembly-free distribution of Spark: don’t require building an
> >>> > enormous assembly jar in order to run Spark.
> >>>
> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
> >>> distribution means.
> >>>
> >>> Nick
> >>>
> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin 
> wrote:
> 
>  I’m starting a new thread since the other one got intermixed with
>  feature requests. Please refrain from making feature request in this
> thread.
>  Not that we shouldn’t be adding features, but we can always add
> features in
>  1.7, 2.1, 2.2, ...
> 
>  First - I want to propose a premise for how to think about Spark 2.0
> and
>  major releases in Spark, based on discussion with several members of
> the
>  community: a major release should be low overhead and minimally
> disruptive
>  to the Spark community. A major release should not be very different
> from a
>  minor release and should not be gated based on new features. The main
>  purpose of a major release is an opportunity to fix things that are
> broken
>  in the current API and remove certain deprecated APIs (examples
> follow).
> 
>  For this reason, I would *not* propose doing major releases to break
>  substantial API's or perform large re-architecting that prevent users
> from
>  upgrading. Spark has always had a culture of evolving architecture
>  incrementally and making changes - and I don't think we want to
> change this
>  model. In fact, we’ve released many architectural changes on the 1.X
> line.
> 
>  If the community likes the above model, then to me it seems reasonable
>  to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
> immediately
>  after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
> cadence of
>  major releases every 2 years seems doable within the above model.
> 
>  Under this model, here is a list of example things I would propose
> doing
>  in Spark 2.0, separated into APIs and Operation/Deployment:
> 
> 
>  APIs
> 
>  1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>  Spark 1.x.
> 
>  2. Remove Akka from Spark’s API dependency (in streaming), so user
>  applications can use Akka (SPARK-5293). We have gotten a lot of
> complaints
>  about user applications being unable to use Akka due to Spark’s
> dependency
>  on Akka.
> 
>  3. Remove Guava from Spark’s public API (JavaRDD Optional).
> 
>  4. Better class package structure for low level developer API’s. In
>  particular, we have some DeveloperApi (mostly various listener-related
> 

Re: PMML version in MLLib

2015-11-10 Thread selvinsource
Thank you Fazlan, looks good!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PMML-version-in-MLLib-tp14944p15112.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A proposal for Spark 2.0

2015-11-10 Thread Sandy Ryza
Another +1 to Reynold's proposal.

Maybe this is obvious, but I'd like to advocate against a blanket removal
of deprecated / developer APIs.  Many APIs can likely be removed without
material impact (e.g. the SparkContext constructor that takes preferred
node location data), while others likely see heavier usage (e.g. I wouldn't
be surprised if mapPartitionsWithContext was baked into a number of apps)
and merit a little extra consideration.

Maybe also obvious, but I think a migration guide with API equivlents and
the like would be incredibly useful in easing the transition.

-Sandy

On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin  wrote:

> Echoing Shivaram here. I don't think it makes a lot of sense to add more
> features to the 1.x line. We should still do critical bug fixes though.
>
>
> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> +1
>>
>> On a related note I think making it lightweight will ensure that we
>> stay on the current release schedule and don't unnecessarily delay 2.0
>> to wait for new features / big architectural changes.
>>
>> In terms of fixes to 1.x, I think our current policy of back-porting
>> fixes to older releases would still apply. I don't think developing
>> new features on both 1.x and 2.x makes a lot of sense as we would like
>> users to switch to 2.x.
>>
>> Shivaram
>>
>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis 
>> wrote:
>> > +1 on a lightweight 2.0
>> >
>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>> If not
>> > terminated, how will we determine what goes into each major version
>> line?
>> > Will 1.x only be for stability fixes?
>> >
>> > Thanks,
>> > Kostas
>> >
>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell 
>> wrote:
>> >>
>> >> I also feel the same as Reynold. I agree we should minimize API breaks
>> and
>> >> focus on fixing things around the edge that were mistakes (e.g.
>> exposing
>> >> Guava and Akka) rather than any overhaul that could fragment the
>> community.
>> >> Ideally a major release is a lightweight process we can do every
>> couple of
>> >> years, with minimal impact for users.
>> >>
>> >> - Patrick
>> >>
>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>> >>  wrote:
>> >>>
>> >>> > For this reason, I would *not* propose doing major releases to break
>> >>> > substantial API's or perform large re-architecting that prevent
>> users from
>> >>> > upgrading. Spark has always had a culture of evolving architecture
>> >>> > incrementally and making changes - and I don't think we want to
>> change this
>> >>> > model.
>> >>>
>> >>> +1 for this. The Python community went through a lot of turmoil over
>> the
>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>> painful
>> >>> for too long. The Spark community will benefit greatly from our
>> explicitly
>> >>> looking to avoid a similar situation.
>> >>>
>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>> >>> > enormous assembly jar in order to run Spark.
>> >>>
>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>> >>> distribution means.
>> >>>
>> >>> Nick
>> >>>
>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin 
>> wrote:
>> 
>>  I’m starting a new thread since the other one got intermixed with
>>  feature requests. Please refrain from making feature request in this
>> thread.
>>  Not that we shouldn’t be adding features, but we can always add
>> features in
>>  1.7, 2.1, 2.2, ...
>> 
>>  First - I want to propose a premise for how to think about Spark 2.0
>> and
>>  major releases in Spark, based on discussion with several members of
>> the
>>  community: a major release should be low overhead and minimally
>> disruptive
>>  to the Spark community. A major release should not be very different
>> from a
>>  minor release and should not be gated based on new features. The main
>>  purpose of a major release is an opportunity to fix things that are
>> broken
>>  in the current API and remove certain deprecated APIs (examples
>> follow).
>> 
>>  For this reason, I would *not* propose doing major releases to break
>>  substantial API's or perform large re-architecting that prevent
>> users from
>>  upgrading. Spark has always had a culture of evolving architecture
>>  incrementally and making changes - and I don't think we want to
>> change this
>>  model. In fact, we’ve released many architectural changes on the 1.X
>> line.
>> 
>>  If the community likes the above model, then to me it seems
>> reasonable
>>  to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>> immediately
>>  after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>> cadence of
>>  major releases every 2 years seems doable within the above 

Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra
Really, Sandy?  "Extra consideration" even for already-deprecated API?  If
we're not going to remove these with a major version change, then just when
will we remove them?

On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza  wrote:

> Another +1 to Reynold's proposal.
>
> Maybe this is obvious, but I'd like to advocate against a blanket removal
> of deprecated / developer APIs.  Many APIs can likely be removed without
> material impact (e.g. the SparkContext constructor that takes preferred
> node location data), while others likely see heavier usage (e.g. I wouldn't
> be surprised if mapPartitionsWithContext was baked into a number of apps)
> and merit a little extra consideration.
>
> Maybe also obvious, but I think a migration guide with API equivlents and
> the like would be incredibly useful in easing the transition.
>
> -Sandy
>
> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin  wrote:
>
>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>> features to the 1.x line. We should still do critical bug fixes though.
>>
>>
>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> +1
>>>
>>> On a related note I think making it lightweight will ensure that we
>>> stay on the current release schedule and don't unnecessarily delay 2.0
>>> to wait for new features / big architectural changes.
>>>
>>> In terms of fixes to 1.x, I think our current policy of back-porting
>>> fixes to older releases would still apply. I don't think developing
>>> new features on both 1.x and 2.x makes a lot of sense as we would like
>>> users to switch to 2.x.
>>>
>>> Shivaram
>>>
>>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis 
>>> wrote:
>>> > +1 on a lightweight 2.0
>>> >
>>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>>> If not
>>> > terminated, how will we determine what goes into each major version
>>> line?
>>> > Will 1.x only be for stability fixes?
>>> >
>>> > Thanks,
>>> > Kostas
>>> >
>>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell 
>>> wrote:
>>> >>
>>> >> I also feel the same as Reynold. I agree we should minimize API
>>> breaks and
>>> >> focus on fixing things around the edge that were mistakes (e.g.
>>> exposing
>>> >> Guava and Akka) rather than any overhaul that could fragment the
>>> community.
>>> >> Ideally a major release is a lightweight process we can do every
>>> couple of
>>> >> years, with minimal impact for users.
>>> >>
>>> >> - Patrick
>>> >>
>>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>>> >>  wrote:
>>> >>>
>>> >>> > For this reason, I would *not* propose doing major releases to
>>> break
>>> >>> > substantial API's or perform large re-architecting that prevent
>>> users from
>>> >>> > upgrading. Spark has always had a culture of evolving architecture
>>> >>> > incrementally and making changes - and I don't think we want to
>>> change this
>>> >>> > model.
>>> >>>
>>> >>> +1 for this. The Python community went through a lot of turmoil over
>>> the
>>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>>> painful
>>> >>> for too long. The Spark community will benefit greatly from our
>>> explicitly
>>> >>> looking to avoid a similar situation.
>>> >>>
>>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>>> >>> > enormous assembly jar in order to run Spark.
>>> >>>
>>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>>> >>> distribution means.
>>> >>>
>>> >>> Nick
>>> >>>
>>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin 
>>> wrote:
>>> 
>>>  I’m starting a new thread since the other one got intermixed with
>>>  feature requests. Please refrain from making feature request in
>>> this thread.
>>>  Not that we shouldn’t be adding features, but we can always add
>>> features in
>>>  1.7, 2.1, 2.2, ...
>>> 
>>>  First - I want to propose a premise for how to think about Spark
>>> 2.0 and
>>>  major releases in Spark, based on discussion with several members
>>> of the
>>>  community: a major release should be low overhead and minimally
>>> disruptive
>>>  to the Spark community. A major release should not be very
>>> different from a
>>>  minor release and should not be gated based on new features. The
>>> main
>>>  purpose of a major release is an opportunity to fix things that are
>>> broken
>>>  in the current API and remove certain deprecated APIs (examples
>>> follow).
>>> 
>>>  For this reason, I would *not* propose doing major releases to break
>>>  substantial API's or perform large re-architecting that prevent
>>> users from
>>>  upgrading. Spark has always had a culture of evolving architecture
>>>  incrementally and making changes - and I don't think we want to
>>> change this
>>>  model. In fact, 

Re: A proposal for Spark 2.0

2015-11-10 Thread Sudhir Menon
Agree. If it is deprecated, get rid of it in 2.0
If the deprecation was a mistake, let's fix that.

Suds
Sent from my iPhone

On Nov 10, 2015, at 5:04 PM, Reynold Xin  wrote:

Maybe a better idea is to un-deprecate an API if it is too important to not
be removed.

I don't think we can drop Java 7 support. It's way too soon.



On Tue, Nov 10, 2015 at 4:59 PM, Mark Hamstra 
wrote:

> Really, Sandy?  "Extra consideration" even for already-deprecated API?  If
> we're not going to remove these with a major version change, then just when
> will we remove them?
>
> On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza 
> wrote:
>
>> Another +1 to Reynold's proposal.
>>
>> Maybe this is obvious, but I'd like to advocate against a blanket removal
>> of deprecated / developer APIs.  Many APIs can likely be removed without
>> material impact (e.g. the SparkContext constructor that takes preferred
>> node location data), while others likely see heavier usage (e.g. I wouldn't
>> be surprised if mapPartitionsWithContext was baked into a number of apps)
>> and merit a little extra consideration.
>>
>> Maybe also obvious, but I think a migration guide with API equivlents and
>> the like would be incredibly useful in easing the transition.
>>
>> -Sandy
>>
>> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin  wrote:
>>
>>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>>> features to the 1.x line. We should still do critical bug fixes though.
>>>
>>>
>>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
 +1

 On a related note I think making it lightweight will ensure that we
 stay on the current release schedule and don't unnecessarily delay 2.0
 to wait for new features / big architectural changes.

 In terms of fixes to 1.x, I think our current policy of back-porting
 fixes to older releases would still apply. I don't think developing
 new features on both 1.x and 2.x makes a lot of sense as we would like
 users to switch to 2.x.

 Shivaram

 On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis 
 wrote:
 > +1 on a lightweight 2.0
 >
 > What is the thinking around the 1.x line after Spark 2.0 is released?
 If not
 > terminated, how will we determine what goes into each major version
 line?
 > Will 1.x only be for stability fixes?
 >
 > Thanks,
 > Kostas
 >
 > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell 
 wrote:
 >>
 >> I also feel the same as Reynold. I agree we should minimize API
 breaks and
 >> focus on fixing things around the edge that were mistakes (e.g.
 exposing
 >> Guava and Akka) rather than any overhaul that could fragment the
 community.
 >> Ideally a major release is a lightweight process we can do every
 couple of
 >> years, with minimal impact for users.
 >>
 >> - Patrick
 >>
 >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
 >>  wrote:
 >>>
 >>> > For this reason, I would *not* propose doing major releases to
 break
 >>> > substantial API's or perform large re-architecting that prevent
 users from
 >>> > upgrading. Spark has always had a culture of evolving architecture
 >>> > incrementally and making changes - and I don't think we want to
 change this
 >>> > model.
 >>>
 >>> +1 for this. The Python community went through a lot of turmoil
 over the
 >>> Python 2 -> Python 3 transition because the upgrade process was too
 painful
 >>> for too long. The Spark community will benefit greatly from our
 explicitly
 >>> looking to avoid a similar situation.
 >>>
 >>> > 3. Assembly-free distribution of Spark: don’t require building an
 >>> > enormous assembly jar in order to run Spark.
 >>>
 >>> Could you elaborate a bit on this? I'm not sure what an
 assembly-free
 >>> distribution means.
 >>>
 >>> Nick
 >>>
 >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin 
 wrote:
 
  I’m starting a new thread since the other one got intermixed with
  feature requests. Please refrain from making feature request in
 this thread.
  Not that we shouldn’t be adding features, but we can always add
 features in
  1.7, 2.1, 2.2, ...
 
  First - I want to propose a premise for how to think about Spark
 2.0 and
  major releases in Spark, based on discussion with several members
 of the
  community: a major release should be low overhead and minimally
 disruptive
  to the Spark community. A major release should not be very
 different from a
  minor release and should not be gated based on new 

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
Maybe a better idea is to un-deprecate an API if it is too important to not
be removed.

I don't think we can drop Java 7 support. It's way too soon.



On Tue, Nov 10, 2015 at 4:59 PM, Mark Hamstra 
wrote:

> Really, Sandy?  "Extra consideration" even for already-deprecated API?  If
> we're not going to remove these with a major version change, then just when
> will we remove them?
>
> On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza 
> wrote:
>
>> Another +1 to Reynold's proposal.
>>
>> Maybe this is obvious, but I'd like to advocate against a blanket removal
>> of deprecated / developer APIs.  Many APIs can likely be removed without
>> material impact (e.g. the SparkContext constructor that takes preferred
>> node location data), while others likely see heavier usage (e.g. I wouldn't
>> be surprised if mapPartitionsWithContext was baked into a number of apps)
>> and merit a little extra consideration.
>>
>> Maybe also obvious, but I think a migration guide with API equivlents and
>> the like would be incredibly useful in easing the transition.
>>
>> -Sandy
>>
>> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin  wrote:
>>
>>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>>> features to the 1.x line. We should still do critical bug fixes though.
>>>
>>>
>>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
 +1

 On a related note I think making it lightweight will ensure that we
 stay on the current release schedule and don't unnecessarily delay 2.0
 to wait for new features / big architectural changes.

 In terms of fixes to 1.x, I think our current policy of back-porting
 fixes to older releases would still apply. I don't think developing
 new features on both 1.x and 2.x makes a lot of sense as we would like
 users to switch to 2.x.

 Shivaram

 On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis 
 wrote:
 > +1 on a lightweight 2.0
 >
 > What is the thinking around the 1.x line after Spark 2.0 is released?
 If not
 > terminated, how will we determine what goes into each major version
 line?
 > Will 1.x only be for stability fixes?
 >
 > Thanks,
 > Kostas
 >
 > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell 
 wrote:
 >>
 >> I also feel the same as Reynold. I agree we should minimize API
 breaks and
 >> focus on fixing things around the edge that were mistakes (e.g.
 exposing
 >> Guava and Akka) rather than any overhaul that could fragment the
 community.
 >> Ideally a major release is a lightweight process we can do every
 couple of
 >> years, with minimal impact for users.
 >>
 >> - Patrick
 >>
 >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
 >>  wrote:
 >>>
 >>> > For this reason, I would *not* propose doing major releases to
 break
 >>> > substantial API's or perform large re-architecting that prevent
 users from
 >>> > upgrading. Spark has always had a culture of evolving architecture
 >>> > incrementally and making changes - and I don't think we want to
 change this
 >>> > model.
 >>>
 >>> +1 for this. The Python community went through a lot of turmoil
 over the
 >>> Python 2 -> Python 3 transition because the upgrade process was too
 painful
 >>> for too long. The Spark community will benefit greatly from our
 explicitly
 >>> looking to avoid a similar situation.
 >>>
 >>> > 3. Assembly-free distribution of Spark: don’t require building an
 >>> > enormous assembly jar in order to run Spark.
 >>>
 >>> Could you elaborate a bit on this? I'm not sure what an
 assembly-free
 >>> distribution means.
 >>>
 >>> Nick
 >>>
 >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin 
 wrote:
 
  I’m starting a new thread since the other one got intermixed with
  feature requests. Please refrain from making feature request in
 this thread.
  Not that we shouldn’t be adding features, but we can always add
 features in
  1.7, 2.1, 2.2, ...
 
  First - I want to propose a premise for how to think about Spark
 2.0 and
  major releases in Spark, based on discussion with several members
 of the
  community: a major release should be low overhead and minimally
 disruptive
  to the Spark community. A major release should not be very
 different from a
  minor release and should not be gated based on new features. The
 main
  purpose of a major release is an opportunity to fix things that
 are broken
  in the current API and remove certain deprecated APIs (examples
 follow).

Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-10 Thread Sasaki Kai
Did you indicate CsvRelation in spark-csv package? LibSVMRelation is included 
in spark core package, but CsvRelation(spark-csv) is not.
Is it necessary for us to modify also spark-csv as you proposed in SPARK-11622?

Regards

Kai 

> On Nov 5, 2015, at 11:30 AM, Jeff Zhang  wrote:
> 
> 
> Not sure the reason,  it seems LibSVMRelation and CsvRelation can extends 
> HadoopFsRelation and leverage the features from HadoopFsRelation.  Any other 
> consideration for that ?
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
Mark,

I think we are in agreement, although I wouldn't go to the extreme and say
"a release with no new features might even be best."

Can you elaborate "anticipatory changes"? A concrete example or so would be
helpful.

On Tue, Nov 10, 2015 at 5:19 PM, Mark Hamstra 
wrote:

> I'm liking the way this is shaping up, and I'd summarize it this way (let
> me know if I'm misunderstanding or misrepresenting anything):
>
>- New features are not at all the focus of Spark 2.0 -- in fact, a
>release with no new features might even be best.
>- Remove deprecated API that we agree really should be deprecated.
>- Fix/change publicly-visible things that anyone who has spent any
>time looking at already knows are mistakes or should be done better, but
>that can't be changed within 1.x.
>
> Do we want to attempt anticipatory changes at all?  In other words, are
> there things we want to do in 2.x for which we already know that we'll want
> to make publicly-visible changes or that, if we don't add or change it now,
> will fall into the "everybody knows it shouldn't be that way" category when
> it comes time to discuss the Spark 3.0 release?  I'd be fine if we don't
> try at all to anticipate what is needed -- working from the premise that
> being forced into a 3.x release earlier than we expect would be less
> painful than trying to back out a mistake made at the outset of 2.0 while
> trying to guess what we'll need.
>
> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:
>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in
>> 1.7, 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely
>> different code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> moving them to other package(s). They are already used beyond SQL, e.g. in
>> ML pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>>
>


Re: A proposal for Spark 2.0

2015-11-10 Thread Sandy Ryza
Oh and another question - should Spark 2.0 support Java 7?

On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza  wrote:

> Another +1 to Reynold's proposal.
>
> Maybe this is obvious, but I'd like to advocate against a blanket removal
> of deprecated / developer APIs.  Many APIs can likely be removed without
> material impact (e.g. the SparkContext constructor that takes preferred
> node location data), while others likely see heavier usage (e.g. I wouldn't
> be surprised if mapPartitionsWithContext was baked into a number of apps)
> and merit a little extra consideration.
>
> Maybe also obvious, but I think a migration guide with API equivlents and
> the like would be incredibly useful in easing the transition.
>
> -Sandy
>
> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin  wrote:
>
>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>> features to the 1.x line. We should still do critical bug fixes though.
>>
>>
>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> +1
>>>
>>> On a related note I think making it lightweight will ensure that we
>>> stay on the current release schedule and don't unnecessarily delay 2.0
>>> to wait for new features / big architectural changes.
>>>
>>> In terms of fixes to 1.x, I think our current policy of back-porting
>>> fixes to older releases would still apply. I don't think developing
>>> new features on both 1.x and 2.x makes a lot of sense as we would like
>>> users to switch to 2.x.
>>>
>>> Shivaram
>>>
>>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis 
>>> wrote:
>>> > +1 on a lightweight 2.0
>>> >
>>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>>> If not
>>> > terminated, how will we determine what goes into each major version
>>> line?
>>> > Will 1.x only be for stability fixes?
>>> >
>>> > Thanks,
>>> > Kostas
>>> >
>>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell 
>>> wrote:
>>> >>
>>> >> I also feel the same as Reynold. I agree we should minimize API
>>> breaks and
>>> >> focus on fixing things around the edge that were mistakes (e.g.
>>> exposing
>>> >> Guava and Akka) rather than any overhaul that could fragment the
>>> community.
>>> >> Ideally a major release is a lightweight process we can do every
>>> couple of
>>> >> years, with minimal impact for users.
>>> >>
>>> >> - Patrick
>>> >>
>>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>>> >>  wrote:
>>> >>>
>>> >>> > For this reason, I would *not* propose doing major releases to
>>> break
>>> >>> > substantial API's or perform large re-architecting that prevent
>>> users from
>>> >>> > upgrading. Spark has always had a culture of evolving architecture
>>> >>> > incrementally and making changes - and I don't think we want to
>>> change this
>>> >>> > model.
>>> >>>
>>> >>> +1 for this. The Python community went through a lot of turmoil over
>>> the
>>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>>> painful
>>> >>> for too long. The Spark community will benefit greatly from our
>>> explicitly
>>> >>> looking to avoid a similar situation.
>>> >>>
>>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>>> >>> > enormous assembly jar in order to run Spark.
>>> >>>
>>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>>> >>> distribution means.
>>> >>>
>>> >>> Nick
>>> >>>
>>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin 
>>> wrote:
>>> 
>>>  I’m starting a new thread since the other one got intermixed with
>>>  feature requests. Please refrain from making feature request in
>>> this thread.
>>>  Not that we shouldn’t be adding features, but we can always add
>>> features in
>>>  1.7, 2.1, 2.2, ...
>>> 
>>>  First - I want to propose a premise for how to think about Spark
>>> 2.0 and
>>>  major releases in Spark, based on discussion with several members
>>> of the
>>>  community: a major release should be low overhead and minimally
>>> disruptive
>>>  to the Spark community. A major release should not be very
>>> different from a
>>>  minor release and should not be gated based on new features. The
>>> main
>>>  purpose of a major release is an opportunity to fix things that are
>>> broken
>>>  in the current API and remove certain deprecated APIs (examples
>>> follow).
>>> 
>>>  For this reason, I would *not* propose doing major releases to break
>>>  substantial API's or perform large re-architecting that prevent
>>> users from
>>>  upgrading. Spark has always had a culture of evolving architecture
>>>  incrementally and making changes - and I don't think we want to
>>> change this
>>>  model. In fact, we’ve released many architectural changes on the
>>> 1.X line.
>>> 
>>>  If the community likes the above 

Re: [ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Fengdong Yu
This is the most simplest announcement I saw.



> On Nov 11, 2015, at 12:49 AM, Reynold Xin  wrote:
> 
> Hi All,
> 
> Spark 1.5.2 is a maintenance release containing stability fixes. This release 
> is based on the branch-1.5 maintenance branch of Spark. We *strongly 
> recommend* all 1.5.x users to upgrade to this release.
> 
> The full list of bug fixes is here: http://s.apache.org/spark-1.5.2 
> 
> 
> http://spark.apache.org/releases/spark-release-1-5-2.html 
> 
> 
> 



Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-10 Thread Jeff Zhang
Yes Kai, I also to plan to do for CsvRelation, will create PR for spark-csv

On Wed, Nov 11, 2015 at 9:10 AM, Sasaki Kai  wrote:

> Did you indicate CsvRelation in spark-csv package? LibSVMRelation is
> included in spark core package, but CsvRelation(spark-csv) is not.
> Is it necessary for us to modify also spark-csv as you proposed in
> SPARK-11622?
>
> Regards
>
> Kai
>
> > On Nov 5, 2015, at 11:30 AM, Jeff Zhang  wrote:
> >
> >
> > Not sure the reason,  it seems LibSVMRelation and CsvRelation can
> extends HadoopFsRelation and leverage the features from HadoopFsRelation.
> Any other consideration for that ?
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
>
>


-- 
Best Regards

Jeff Zhang


Re: OLAP query using spark dataframe with cassandra

2015-11-10 Thread danielcsant
You can also evaluate Stratio Sparkta. It is a real time aggregation tool
based on Spark Streaming. 
It is able to write in Cassandra and in other databases like MongoDB,
Elasticsearch,... It is prepared to deploy this aggregations in Mesos so
maybe it fits your necessities.

There is no a query layer that could abstract the analytics part in OLAP but
it is on the roadmap.

Disclaimer: I work in this product 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/OLAP-query-using-spark-dataframe-with-cassandra-tp15082p15113.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra
I'm liking the way this is shaping up, and I'd summarize it this way (let
me know if I'm misunderstanding or misrepresenting anything):

   - New features are not at all the focus of Spark 2.0 -- in fact, a
   release with no new features might even be best.
   - Remove deprecated API that we agree really should be deprecated.
   - Fix/change publicly-visible things that anyone who has spent any time
   looking at already knows are mistakes or should be done better, but that
   can't be changed within 1.x.

Do we want to attempt anticipatory changes at all?  In other words, are
there things we want to do in 2.x for which we already know that we'll want
to make publicly-visible changes or that, if we don't add or change it now,
will fall into the "everybody knows it shouldn't be that way" category when
it comes time to discuss the Spark 3.0 release?  I'd be fine if we don't
try at all to anticipate what is needed -- working from the premise that
being forced into a 3.x release earlier than we expect would be less
painful than trying to back out a mistake made at the outset of 2.0 while
trying to guess what we'll need.

On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:

> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in
> 1.7, 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to
> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
> major releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing
> in Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some
> subtle differences, these two are very similar but have completely
> different code path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>
>


Re: A proposal for Spark 2.0

2015-11-10 Thread Patrick Wendell
I also feel the same as Reynold. I agree we should minimize API breaks and
focus on fixing things around the edge that were mistakes (e.g. exposing
Guava and Akka) rather than any overhaul that could fragment the community.
Ideally a major release is a lightweight process we can do every couple of
years, with minimal impact for users.

- Patrick

On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> > For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model.
>
> +1 for this. The Python community went through a lot of turmoil over the
> Python 2 -> Python 3 transition because the upgrade process was too painful
> for too long. The Spark community will benefit greatly from our explicitly
> looking to avoid a similar situation.
>
> > 3. Assembly-free distribution of Spark: don’t require building an
> enormous assembly jar in order to run Spark.
>
> Could you elaborate a bit on this? I'm not sure what an assembly-free
> distribution means.
>
> Nick
>
> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin  wrote:
>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in
>> 1.7, 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely
>> different code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> moving them to other package(s). They are already used beyond SQL, e.g. in
>> ML pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>>


Re: A proposal for Spark 2.0

2015-11-10 Thread Jean-Baptiste Onofré

Hi,

I fully agree that. Actually, I'm working on PR to add "client" and 
"exploded" profiles in Maven build.


The client profile create a spark-client-assembly jar, largely more 
lightweight that the spark-assembly. In our case, we construct jobs that 
don't require all the spark server side. It means that the minimal size 
of the generated jar is about 120MB, and it's painful in spark-submit 
submission time. That's why I started to remove unecessary dependencies 
in spark-assembly.


On the other hand, I'm also working on the "exploded" mode: instead of 
using a fat monolithic spark-assembly jar file, I'm working on a 
exploded mode, allowing users to view/change the dependencies.


For the client profile, I've already something ready, I will propose the 
PR very soon (by the end of this week hopefully). For the exploded 
profile, I need more time.


My $0.02

Regards
JB

On 11/11/2015 12:53 AM, Reynold Xin wrote:


On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
> wrote:


> 3. Assembly-free distribution of Spark: don’t require building an 
enormous assembly jar in order to run Spark.

Could you elaborate a bit on this? I'm not sure what an
assembly-free distribution means.


Right now we ship Spark using a single assembly jar, which causes a few
different problems:

- total number of classes are limited on some configurations

- dependency swapping is harder


The proposal is to just avoid a single fat jar.




--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Block Transfer Service encryption support

2015-11-10 Thread Tim Preece
Nb. I did notice some test failures when I ran a quick test on the pull
request ( not sure if it is related - I haven't looked in any detail at the
cause ).

Failed tests: 
 
SslChunkFetchIntegrationSuite>ChunkFetchIntegrationSuite.fetchBothChunks:201
expected:<[]> but was:<[0, 1]>
 
SslChunkFetchIntegrationSuite>ChunkFetchIntegrationSuite.fetchBufferChunk:175
expected:<[]> but was:<[0]>
 
SslChunkFetchIntegrationSuite>ChunkFetchIntegrationSuite.fetchChunkAndNonExistent:210
expected:<[]> but was:<[0]>
 
SslChunkFetchIntegrationSuite>ChunkFetchIntegrationSuite.fetchFileChunk:184
expected:<[]> but was:<[1]>
 
SslTransportClientFactorySuite>TransportClientFactorySuite.neverReturnInactiveClients:165
null
 
SslTransportClientFactorySuite>TransportClientFactorySuite.returnDifferentClientsForDifferentServers:145
null

Tim




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Block-Transfer-Service-encryption-support-tp10934p15114.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A proposal for Spark 2.0

2015-11-10 Thread Marcelo Vanzin
On Tue, Nov 10, 2015 at 6:51 PM, Reynold Xin  wrote:
> I think we are in agreement, although I wouldn't go to the extreme and say
> "a release with no new features might even be best."
>
> Can you elaborate "anticipatory changes"? A concrete example or so would be
> helpful.

I don't know if that's what Mark had in mind, but I'd count the
"remove Guava Optional from Java API" in that category. It would be
nice to have an alternative before that API is removed, although I
have no idea how you'd do it nicely, given that they're all in return
types (so overloading doesn't really work).

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-10 Thread Sasaki Kai
Great, thank you!

> On Nov 11, 2015, at 11:41 AM, Jeff Zhang  wrote:
> 
> Yes Kai, I also to plan to do for CsvRelation, will create PR for spark-csv
> 
> On Wed, Nov 11, 2015 at 9:10 AM, Sasaki Kai  > wrote:
> Did you indicate CsvRelation in spark-csv package? LibSVMRelation is included 
> in spark core package, but CsvRelation(spark-csv) is not.
> Is it necessary for us to modify also spark-csv as you proposed in 
> SPARK-11622?
> 
> Regards
> 
> Kai
> 
> > On Nov 5, 2015, at 11:30 AM, Jeff Zhang  > > wrote:
> >
> >
> > Not sure the reason,  it seems LibSVMRelation and CsvRelation can extends 
> > HadoopFsRelation and leverage the features from HadoopFsRelation.  Any 
> > other consideration for that ?
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
> 
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang



Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra
Heh... ok, I was intentionally pushing those bullet points to be extreme to
find where people would start pushing back, and I'll agree that we do
probably want some new features in 2.0 -- but I think we've got good
agreement that new features aren't really the main point of doing a 2.0
release.

I don't really have a concrete example of an anticipatory change, and
that's actually kind of the problem with trying to anticipate what we'll
need in the way of new public API and the like: Until what we already have
is clearly inadequate, it hard to concretely imagine how things really
should be.  At this point I don't have anything specific where I can say "I
really want to do __ with Spark in the future, and I think it should be
changed in this way in 2.0 to allow me to do that."  I'm just wondering
whether we want to even entertain those kinds of change requests if people
have them, or whether we can just delay making those kinds of decisions
until it is really obvious that what we have does't work and that there is
clearly something better that should be done.

On Tue, Nov 10, 2015 at 6:51 PM, Reynold Xin  wrote:

> Mark,
>
> I think we are in agreement, although I wouldn't go to the extreme and say
> "a release with no new features might even be best."
>
> Can you elaborate "anticipatory changes"? A concrete example or so would
> be helpful.
>
> On Tue, Nov 10, 2015 at 5:19 PM, Mark Hamstra 
> wrote:
>
>> I'm liking the way this is shaping up, and I'd summarize it this way (let
>> me know if I'm misunderstanding or misrepresenting anything):
>>
>>- New features are not at all the focus of Spark 2.0 -- in fact, a
>>release with no new features might even be best.
>>- Remove deprecated API that we agree really should be deprecated.
>>- Fix/change publicly-visible things that anyone who has spent any
>>time looking at already knows are mistakes or should be done better, but
>>that can't be changed within 1.x.
>>
>> Do we want to attempt anticipatory changes at all?  In other words, are
>> there things we want to do in 2.x for which we already know that we'll want
>> to make publicly-visible changes or that, if we don't add or change it now,
>> will fall into the "everybody knows it shouldn't be that way" category when
>> it comes time to discuss the Spark 3.0 release?  I'd be fine if we don't
>> try at all to anticipate what is needed -- working from the premise that
>> being forced into a 3.x release earlier than we expect would be less
>> painful than trying to back out a mistake made at the outset of 2.0 while
>> trying to guess what we'll need.
>>
>> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:
>>
>>> I’m starting a new thread since the other one got intermixed with
>>> feature requests. Please refrain from making feature request in this
>>> thread. Not that we shouldn’t be adding features, but we can always add
>>> features in 1.7, 2.1, 2.2, ...
>>>
>>> First - I want to propose a premise for how to think about Spark 2.0 and
>>> major releases in Spark, based on discussion with several members of the
>>> community: a major release should be low overhead and minimally disruptive
>>> to the Spark community. A major release should not be very different from a
>>> minor release and should not be gated based on new features. The main
>>> purpose of a major release is an opportunity to fix things that are broken
>>> in the current API and remove certain deprecated APIs (examples follow).
>>>
>>> For this reason, I would *not* propose doing major releases to break
>>> substantial API's or perform large re-architecting that prevent users from
>>> upgrading. Spark has always had a culture of evolving architecture
>>> incrementally and making changes - and I don't think we want to change this
>>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>>
>>> If the community likes the above model, then to me it seems reasonable
>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>> immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>> cadence of major releases every 2 years seems doable within the above model.
>>>
>>> Under this model, here is a list of example things I would propose doing
>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>>
>>>
>>> APIs
>>>
>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>>> Spark 1.x.
>>>
>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>>> about user applications being unable to use Akka due to Spark’s dependency
>>> on Akka.
>>>
>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>>
>>> 4. Better class package structure for low level developer API’s. In
>>> particular, we have some DeveloperApi (mostly various listener-related
>>> classes) added over the years. Some 

Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra
To take a stab at an example of something concrete and anticipatory I can
go back to something I mentioned previously.  It's not really a good
example because I don't mean to imply that I believe that its premises are
true, but try to go with it If we were to decide that real-time,
event-based streaming is something that we really think we'll want to do in
the 2.x cycle and that the current API (after having deprecations removed
and clear mistakes/inadequacies remedied) isn't adequate to support that,
would we want to "take our best shot" at defining a new API at the outset
of 2.0?  Another way of looking at it is whether API changes in 2.0 should
be entirely backward-looking, trying to fix problems that we've already
identified or whether there is room for some forward-looking changes that
are intended to open new directions for Spark development.

On Tue, Nov 10, 2015 at 7:04 PM, Mark Hamstra 
wrote:

> Heh... ok, I was intentionally pushing those bullet points to be extreme
> to find where people would start pushing back, and I'll agree that we do
> probably want some new features in 2.0 -- but I think we've got good
> agreement that new features aren't really the main point of doing a 2.0
> release.
>
> I don't really have a concrete example of an anticipatory change, and
> that's actually kind of the problem with trying to anticipate what we'll
> need in the way of new public API and the like: Until what we already have
> is clearly inadequate, it hard to concretely imagine how things really
> should be.  At this point I don't have anything specific where I can say "I
> really want to do __ with Spark in the future, and I think it should be
> changed in this way in 2.0 to allow me to do that."  I'm just wondering
> whether we want to even entertain those kinds of change requests if people
> have them, or whether we can just delay making those kinds of decisions
> until it is really obvious that what we have does't work and that there is
> clearly something better that should be done.
>
> On Tue, Nov 10, 2015 at 6:51 PM, Reynold Xin  wrote:
>
>> Mark,
>>
>> I think we are in agreement, although I wouldn't go to the extreme and
>> say "a release with no new features might even be best."
>>
>> Can you elaborate "anticipatory changes"? A concrete example or so would
>> be helpful.
>>
>> On Tue, Nov 10, 2015 at 5:19 PM, Mark Hamstra 
>> wrote:
>>
>>> I'm liking the way this is shaping up, and I'd summarize it this way
>>> (let me know if I'm misunderstanding or misrepresenting anything):
>>>
>>>- New features are not at all the focus of Spark 2.0 -- in fact, a
>>>release with no new features might even be best.
>>>- Remove deprecated API that we agree really should be deprecated.
>>>- Fix/change publicly-visible things that anyone who has spent any
>>>time looking at already knows are mistakes or should be done better, but
>>>that can't be changed within 1.x.
>>>
>>> Do we want to attempt anticipatory changes at all?  In other words, are
>>> there things we want to do in 2.x for which we already know that we'll want
>>> to make publicly-visible changes or that, if we don't add or change it now,
>>> will fall into the "everybody knows it shouldn't be that way" category when
>>> it comes time to discuss the Spark 3.0 release?  I'd be fine if we don't
>>> try at all to anticipate what is needed -- working from the premise that
>>> being forced into a 3.x release earlier than we expect would be less
>>> painful than trying to back out a mistake made at the outset of 2.0 while
>>> trying to guess what we'll need.
>>>
>>> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin 
>>> wrote:
>>>
 I’m starting a new thread since the other one got intermixed with
 feature requests. Please refrain from making feature request in this
 thread. Not that we shouldn’t be adding features, but we can always add
 features in 1.7, 2.1, 2.2, ...

 First - I want to propose a premise for how to think about Spark 2.0
 and major releases in Spark, based on discussion with several members of
 the community: a major release should be low overhead and minimally
 disruptive to the Spark community. A major release should not be very
 different from a minor release and should not be gated based on new
 features. The main purpose of a major release is an opportunity to fix
 things that are broken in the current API and remove certain deprecated
 APIs (examples follow).

 For this reason, I would *not* propose doing major releases to break
 substantial API's or perform large re-architecting that prevent users from
 upgrading. Spark has always had a culture of evolving architecture
 incrementally and making changes - and I don't think we want to change this
 model. In fact, we’ve released many architectural changes on the 1.X line.

 If the 

Re: A proposal for Spark 2.0

2015-11-10 Thread Jean-Baptiste Onofré

Agree, it makes sense.

Regards
JB

On 11/11/2015 01:28 AM, Reynold Xin wrote:

Echoing Shivaram here. I don't think it makes a lot of sense to add more
features to the 1.x line. We should still do critical bug fixes though.


On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman
> wrote:

+1

On a related note I think making it lightweight will ensure that we
stay on the current release schedule and don't unnecessarily delay 2.0
to wait for new features / big architectural changes.

In terms of fixes to 1.x, I think our current policy of back-porting
fixes to older releases would still apply. I don't think developing
new features on both 1.x and 2.x makes a lot of sense as we would like
users to switch to 2.x.

Shivaram

On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis
> wrote:
 > +1 on a lightweight 2.0
 >
 > What is the thinking around the 1.x line after Spark 2.0 is
released? If not
 > terminated, how will we determine what goes into each major
version line?
 > Will 1.x only be for stability fixes?
 >
 > Thanks,
 > Kostas
 >
 > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell
> wrote:
 >>
 >> I also feel the same as Reynold. I agree we should minimize API
breaks and
 >> focus on fixing things around the edge that were mistakes (e.g.
exposing
 >> Guava and Akka) rather than any overhaul that could fragment the
community.
 >> Ideally a major release is a lightweight process we can do every
couple of
 >> years, with minimal impact for users.
 >>
 >> - Patrick
 >>
 >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
 >> >
wrote:
 >>>
 >>> > For this reason, I would *not* propose doing major releases
to break
 >>> > substantial API's or perform large re-architecting that
prevent users from
 >>> > upgrading. Spark has always had a culture of evolving
architecture
 >>> > incrementally and making changes - and I don't think we want
to change this
 >>> > model.
 >>>
 >>> +1 for this. The Python community went through a lot of turmoil
over the
 >>> Python 2 -> Python 3 transition because the upgrade process was
too painful
 >>> for too long. The Spark community will benefit greatly from our
explicitly
 >>> looking to avoid a similar situation.
 >>>
 >>> > 3. Assembly-free distribution of Spark: don’t require building an
 >>> > enormous assembly jar in order to run Spark.
 >>>
 >>> Could you elaborate a bit on this? I'm not sure what an
assembly-free
 >>> distribution means.
 >>>
 >>> Nick
 >>>
 >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin
> wrote:
 
  I’m starting a new thread since the other one got intermixed with
  feature requests. Please refrain from making feature request
in this thread.
  Not that we shouldn’t be adding features, but we can always
add features in
  1.7, 2.1, 2.2, ...
 
  First - I want to propose a premise for how to think about
Spark 2.0 and
  major releases in Spark, based on discussion with several
members of the
  community: a major release should be low overhead and
minimally disruptive
  to the Spark community. A major release should not be very
different from a
  minor release and should not be gated based on new features.
The main
  purpose of a major release is an opportunity to fix things
that are broken
  in the current API and remove certain deprecated APIs
(examples follow).
 
  For this reason, I would *not* propose doing major releases to
break
  substantial API's or perform large re-architecting that
prevent users from
  upgrading. Spark has always had a culture of evolving architecture
  incrementally and making changes - and I don't think we want
to change this
  model. In fact, we’ve released many architectural changes on
the 1.X line.
 
  If the community likes the above model, then to me it seems
reasonable
  to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7)
or immediately
  after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
cadence of
  major releases every 2 years seems doable within the above model.
 
  Under this model, here is a list of example things I would
propose doing
  in Spark 2.0, separated into APIs and Operation/Deployment:
 
 
  APIs
 
  1. Remove interfaces,