Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-08 Thread Krishna Sankar
Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
Distributions X ...

May be one option is to have a minimum basic set (which I know is what we
are discussing) and move the rest to spark-packages.org. There the vendors
can add the latest downloads - for example when 1.4 is released, HDP can
build a release of HDP Spark 1.4 bundle.

Cheers
k/

On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote:

 We probably want to revisit the way we do binaries in general for
 1.4+. IMO, something worth forking a separate thread for.

 I've been hesitating to add new binaries because people
 (understandably) complain if you ever stop packaging older ones, but
 on the other hand the ASF has complained that we have too many
 binaries already and that we need to pare it down because of the large
 volume of files. Doubling the number of binaries we produce for Scala
 2.11 seemed like it would be too much.

 One solution potentially is to actually package Hadoop provided
 binaries and encourage users to use these by simply setting
 HADOOP_HOME, or have instructions for specific distros. I've heard
 that our existing packages don't work well on HDP for instance, since
 there are some configuration quirks that differ from the upstream
 Hadoop.

 If we cut down on the cross building for Hadoop versions, then it is
 more tenable to cross build for Scala versions without exploding the
 number of binaries.

 - Patrick

 On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen so...@cloudera.com wrote:
  Yeah, interesting question of what is the better default for the
  single set of artifacts published to Maven. I think there's an
  argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
  and cons discussed more at
 
  https://issues.apache.org/jira/browse/SPARK-5134
  https://github.com/apache/spark/pull/3917
 
  On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  +1
 
  Tested it on Mac OS X.
 
  One small issue I noticed is that the Scala 2.11 build is using Hadoop
 1 without Hive, which is kind of weird because people will more likely want
 Hadoop 2 with Hive. So it would be good to publish a build for that
 configuration instead. We can do it if we do a new RC, or it might be that
 binary builds may not need to be voted on (I forgot the details there).
 
  Matei

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-08 Thread Sean Owen
Yeah, interesting question of what is the better default for the
single set of artifacts published to Maven. I think there's an
argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
and cons discussed more at

https://issues.apache.org/jira/browse/SPARK-5134
https://github.com/apache/spark/pull/3917

On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 +1

 Tested it on Mac OS X.

 One small issue I noticed is that the Scala 2.11 build is using Hadoop 1 
 without Hive, which is kind of weird because people will more likely want 
 Hadoop 2 with Hive. So it would be good to publish a build for that 
 configuration instead. We can do it if we do a new RC, or it might be that 
 binary builds may not need to be voted on (I forgot the details there).

 Matei

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-08 Thread Patrick Wendell
We probably want to revisit the way we do binaries in general for
1.4+. IMO, something worth forking a separate thread for.

I've been hesitating to add new binaries because people
(understandably) complain if you ever stop packaging older ones, but
on the other hand the ASF has complained that we have too many
binaries already and that we need to pare it down because of the large
volume of files. Doubling the number of binaries we produce for Scala
2.11 seemed like it would be too much.

One solution potentially is to actually package Hadoop provided
binaries and encourage users to use these by simply setting
HADOOP_HOME, or have instructions for specific distros. I've heard
that our existing packages don't work well on HDP for instance, since
there are some configuration quirks that differ from the upstream
Hadoop.

If we cut down on the cross building for Hadoop versions, then it is
more tenable to cross build for Scala versions without exploding the
number of binaries.

- Patrick

On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen so...@cloudera.com wrote:
 Yeah, interesting question of what is the better default for the
 single set of artifacts published to Maven. I think there's an
 argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
 and cons discussed more at

 https://issues.apache.org/jira/browse/SPARK-5134
 https://github.com/apache/spark/pull/3917

 On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 +1

 Tested it on Mac OS X.

 One small issue I noticed is that the Scala 2.11 build is using Hadoop 1 
 without Hive, which is kind of weird because people will more likely want 
 Hadoop 2 with Hive. So it would be good to publish a build for that 
 configuration instead. We can do it if we do a new RC, or it might be that 
 binary builds may not need to be voted on (I forgot the details there).

 Matei

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-08 Thread Matei Zaharia
+1

Tested it on Mac OS X.

One small issue I noticed is that the Scala 2.11 build is using Hadoop 1 
without Hive, which is kind of weird because people will more likely want 
Hadoop 2 with Hive. So it would be good to publish a build for that 
configuration instead. We can do it if we do a new RC, or it might be that 
binary builds may not need to be voted on (I forgot the details there).

Matei

 On Mar 5, 2015, at 9:52 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.0!
 
 The tag to be voted on is v1.3.0-rc2 (commit 4aaf48d4):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc3/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 Staging repositories for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1078
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc3-docs/
 
 Please vote on releasing this package as Apache Spark 1.3.0!
 
 The vote is open until Monday, March 09, at 02:52 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.3.0
 [ ] -1 Do not release this package because ...
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 == How does this compare to RC2 ==
 This release includes the following bug fixes:
 
 https://issues.apache.org/jira/browse/SPARK-6144
 https://issues.apache.org/jira/browse/SPARK-6171
 https://issues.apache.org/jira/browse/SPARK-5143
 https://issues.apache.org/jira/browse/SPARK-6182
 https://issues.apache.org/jira/browse/SPARK-6175
 
 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.2 workload and running on this release candidate,
 then reporting any regressions.
 
 If you are happy with this release based on your own testing, give a +1 vote.
 
 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.3 QA period,
 so -1 votes should only occur for significant regressions from 1.2.1.
 Bugs already present in 1.2.X, minor regressions, or bugs related
 to new features will not block this release.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Loading previously serialized object to Spark

2015-03-08 Thread Akhil Das
Can you paste the complete code?

Thanks
Best Regards

On Sat, Mar 7, 2015 at 2:25 AM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Hi,

 I've implemented class MyClass in MLlib that does some operation on
 LabeledPoint. MyClass extends serializable, so I can map this operation on
 data of RDD[LabeledPoints], such as data.map(lp = MyClass.operate(lp)). I
 write this class in file with ObjectOutputStream.writeObject. Then I stop
 and restart Spark. I load this class from file with
 ObjectInputStream.readObject.asInstanceOf[MyClass]. When I try to map the
 same operation of this class to RDD, Spark throws not serializable
 exception:
 org.apache.spark.SparkException: Task not serializable
 at
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
 at
 org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1453)
 at org.apache.spark.rdd.RDD.map(RDD.scala:273)

 Could you suggest why it throws this exception while MyClass is
 serializable by definition?

 Best regards, Alexander



Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-08 Thread Sean Owen
Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
Maven artifacts.

Patrick I see you just commented on SPARK-5134 and will follow up
there. Sounds like this may accidentally not be a problem.

On binary tarball releases, I wonder if anyone has an opinion on my
opinion that these shouldn't be distributed for specific Hadoop
*distributions* to begin with. (Won't repeat the argument here yet.)
That resolves this n x m explosion too.

Vendors already provide their own distribution, yes, that's their job.


On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar ksanka...@gmail.com wrote:
 Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
 Distributions X ...

 May be one option is to have a minimum basic set (which I know is what we
 are discussing) and move the rest to spark-packages.org. There the vendors
 can add the latest downloads - for example when 1.4 is released, HDP can
 build a release of HDP Spark 1.4 bundle.

 Cheers
 k/

 On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote:

 We probably want to revisit the way we do binaries in general for
 1.4+. IMO, something worth forking a separate thread for.

 I've been hesitating to add new binaries because people
 (understandably) complain if you ever stop packaging older ones, but
 on the other hand the ASF has complained that we have too many
 binaries already and that we need to pare it down because of the large
 volume of files. Doubling the number of binaries we produce for Scala
 2.11 seemed like it would be too much.

 One solution potentially is to actually package Hadoop provided
 binaries and encourage users to use these by simply setting
 HADOOP_HOME, or have instructions for specific distros. I've heard
 that our existing packages don't work well on HDP for instance, since
 there are some configuration quirks that differ from the upstream
 Hadoop.

 If we cut down on the cross building for Hadoop versions, then it is
 more tenable to cross build for Scala versions without exploding the
 number of binaries.

 - Patrick

 On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen so...@cloudera.com wrote:
  Yeah, interesting question of what is the better default for the
  single set of artifacts published to Maven. I think there's an
  argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
  and cons discussed more at
 
  https://issues.apache.org/jira/browse/SPARK-5134
  https://github.com/apache/spark/pull/3917
 
  On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia matei.zaha...@gmail.com
  wrote:
  +1
 
  Tested it on Mac OS X.
 
  One small issue I noticed is that the Scala 2.11 build is using Hadoop
  1 without Hive, which is kind of weird because people will more likely 
  want
  Hadoop 2 with Hive. So it would be good to publish a build for that
  configuration instead. We can do it if we do a new RC, or it might be that
  binary builds may not need to be voted on (I forgot the details there).
 
  Matei

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Block Transfer Service encryption support

2015-03-08 Thread Patrick Wendell
I think that yes, longer term we want to have encryption of all
communicated data. However Jeff, can you open a JIRA to discuss the
design before opening a pull request (it's fine to link to a WIP
branch if you'd like)? I'd like to better understand the performance
and operational complexity of using SSL for this in comparison with
alternatives. It would also be good to look at how the Hadoop
encryption works for their shuffle service, in terms of the design
decisions made there.

- Patrick

On Sun, Mar 8, 2015 at 5:42 PM, Jeff Turpin turp1t...@gmail.com wrote:
 I have already written most of the code, just finishing up the unit tests
 right now...

 Jeff


 On Sun, Mar 8, 2015 at 5:39 PM, Andrew Ash and...@andrewash.com wrote:

 I'm interested in seeing this data transfer occurring over encrypted
 communication channels as well.  Many customers require that all network
 transfer occur encrypted to prevent the soft underbelly that's often
 found inside a corporate network.

 On Fri, Mar 6, 2015 at 4:20 PM, turp1twin turp1t...@gmail.com wrote:

 Is there a plan to implement SSL support for the Block Transfer Service
 (specifically, the NettyBlockTransferService implementation)? I can
 volunteer if needed...

 Jeff




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Block-Transfer-Service-encryption-support-tp10934.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-08 Thread Matei Zaharia
Yeah, my concern is that people should get Apache Spark from *Apache*, not from 
a vendor. It helps everyone use the latest features no matter where they are. 
In the Hadoop distro case, Hadoop made all this effort to have standard APIs 
(e.g. YARN), so it should be easy. But it is a problem if we're not packaging 
for the newest versions of some distros; I think we just fell behind at Hadoop 
2.4.

Matei

 On Mar 8, 2015, at 8:02 PM, Sean Owen so...@cloudera.com wrote:
 
 Yeah it's not much overhead, but here's an example of where it causes
 a little issue.
 
 I like that reasoning. However, the released builds don't track the
 later versions of Hadoop that vendors would be distributing -- there's
 no Hadoop 2.6 build for example. CDH4 is here, but not the
 far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't
 actually work with many CDH4 versions.
 
 I agree with the goal of maximizing the reach of Spark, but I don't
 know how much these builds advance that goal.
 
 Anyone can roll-their-own exactly-right build, and the docs and build
 have been set up to make that as simple as can be expected. So these
 aren't *required* to let me use latest Spark on distribution X.
 
 I had thought these existed to sorta support 'legacy' distributions,
 like CDH4, and that build was justified as a
 quasi-Hadoop-2.0.x-flavored build. But then I don't understand what
 the MapR profiles are for.
 
 I think it's too much work to correctly, in parallel, maintain any
 customizations necessary for any major distro, and it might be best to
 do not at all than to do it incompletely. You could say it's also an
 enabler for distros to vary in ways that require special
 customization.
 
 Maybe there's a concern that, if lots of people consume Spark on
 Hadoop, and most people consume Hadoop through distros, and distros
 alone manage Spark distributions, then you de facto 'have to' go
 through a distro instead of get bits from Spark? Different
 conversation but I think this sort of effect does not end up being a
 negative.
 
 Well anyway, I like the idea of seeing how far Hadoop-provided
 releases can help. It might kill several birds with one stone.
 
 On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 Our goal is to let people use the latest Apache release even if vendors fall 
 behind or don't want to package everything, so that's why we put out 
 releases for vendors' versions. It's fairly low overhead.
 
 Matei
 
 On Mar 8, 2015, at 5:56 PM, Sean Owen so...@cloudera.com wrote:
 
 Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
 at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
 Maven artifacts.
 
 Patrick I see you just commented on SPARK-5134 and will follow up
 there. Sounds like this may accidentally not be a problem.
 
 On binary tarball releases, I wonder if anyone has an opinion on my
 opinion that these shouldn't be distributed for specific Hadoop
 *distributions* to begin with. (Won't repeat the argument here yet.)
 That resolves this n x m explosion too.
 
 Vendors already provide their own distribution, yes, that's their job.
 
 
 On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar ksanka...@gmail.com wrote:
 Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
 Distributions X ...
 
 May be one option is to have a minimum basic set (which I know is what we
 are discussing) and move the rest to spark-packages.org. There the vendors
 can add the latest downloads - for example when 1.4 is released, HDP can
 build a release of HDP Spark 1.4 bundle.
 
 Cheers
 k/
 
 On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 We probably want to revisit the way we do binaries in general for
 1.4+. IMO, something worth forking a separate thread for.
 
 I've been hesitating to add new binaries because people
 (understandably) complain if you ever stop packaging older ones, but
 on the other hand the ASF has complained that we have too many
 binaries already and that we need to pare it down because of the large
 volume of files. Doubling the number of binaries we produce for Scala
 2.11 seemed like it would be too much.
 
 One solution potentially is to actually package Hadoop provided
 binaries and encourage users to use these by simply setting
 HADOOP_HOME, or have instructions for specific distros. I've heard
 that our existing packages don't work well on HDP for instance, since
 there are some configuration quirks that differ from the upstream
 Hadoop.
 
 If we cut down on the cross building for Hadoop versions, then it is
 more tenable to cross build for Scala versions without exploding the
 number of binaries.
 
 - Patrick
 
 On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen so...@cloudera.com wrote:
 Yeah, interesting question of what is the better default for the
 single set of artifacts published to Maven. I think there's an
 argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
 and cons 

Re: Block Transfer Service encryption support

2015-03-08 Thread Andrew Ash
I'm interested in seeing this data transfer occurring over encrypted
communication channels as well.  Many customers require that all network
transfer occur encrypted to prevent the soft underbelly that's often
found inside a corporate network.

On Fri, Mar 6, 2015 at 4:20 PM, turp1twin turp1t...@gmail.com wrote:

 Is there a plan to implement SSL support for the Block Transfer Service
 (specifically, the NettyBlockTransferService implementation)? I can
 volunteer if needed...

 Jeff




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Block-Transfer-Service-encryption-support-tp10934.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Block Transfer Service encryption support

2015-03-08 Thread Jeff Turpin
I have already written most of the code, just finishing up the unit tests
right now...

Jeff


On Sun, Mar 8, 2015 at 5:39 PM, Andrew Ash and...@andrewash.com wrote:

 I'm interested in seeing this data transfer occurring over encrypted
 communication channels as well.  Many customers require that all network
 transfer occur encrypted to prevent the soft underbelly that's often
 found inside a corporate network.

 On Fri, Mar 6, 2015 at 4:20 PM, turp1twin turp1t...@gmail.com wrote:

 Is there a plan to implement SSL support for the Block Transfer Service
 (specifically, the NettyBlockTransferService implementation)? I can
 volunteer if needed...

 Jeff




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Block-Transfer-Service-encryption-support-tp10934.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-08 Thread Patrick Wendell
I think it's important to separate the goals from the implementation.
I agree with Matei on the goal - I think the goal needs to be to allow
people to download Apache Spark and use it with CDH, HDP, MapR,
whatever... This is the whole reason why HDFS and YARN have stable
API's, so that other projects can build on them in a way that works
across multiple versions. I wouldn't want to force users to upgrade
according only to some vendor timetable, that doesn't seem from the
ASF perspective like a good thing for the project. If users want to
get packages from Bigtop, or the vendors, that's totally fine too.

My point earlier was - I am not sure we are actually accomplishing
that goal now, because I've heard in some cases our Hadoop 2.X
packages actually don't work on certain distributions, even those that
are based on that Hadoop version. So one solution is to move towards
bring your own Hadoop binaries and have users just set HADOOP_HOME
and maybe document any vendor-specific configs that need to be set.
That also happens to solve the too many binaries problem, but only
incidentally.

- Patrick

On Sun, Mar 8, 2015 at 4:07 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Our goal is to let people use the latest Apache release even if vendors fall 
 behind or don't want to package everything, so that's why we put out releases 
 for vendors' versions. It's fairly low overhead.

 Matei

 On Mar 8, 2015, at 5:56 PM, Sean Owen so...@cloudera.com wrote:

 Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
 at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
 Maven artifacts.

 Patrick I see you just commented on SPARK-5134 and will follow up
 there. Sounds like this may accidentally not be a problem.

 On binary tarball releases, I wonder if anyone has an opinion on my
 opinion that these shouldn't be distributed for specific Hadoop
 *distributions* to begin with. (Won't repeat the argument here yet.)
 That resolves this n x m explosion too.

 Vendors already provide their own distribution, yes, that's their job.


 On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar ksanka...@gmail.com wrote:
 Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
 Distributions X ...

 May be one option is to have a minimum basic set (which I know is what we
 are discussing) and move the rest to spark-packages.org. There the vendors
 can add the latest downloads - for example when 1.4 is released, HDP can
 build a release of HDP Spark 1.4 bundle.

 Cheers
 k/

 On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote:

 We probably want to revisit the way we do binaries in general for
 1.4+. IMO, something worth forking a separate thread for.

 I've been hesitating to add new binaries because people
 (understandably) complain if you ever stop packaging older ones, but
 on the other hand the ASF has complained that we have too many
 binaries already and that we need to pare it down because of the large
 volume of files. Doubling the number of binaries we produce for Scala
 2.11 seemed like it would be too much.

 One solution potentially is to actually package Hadoop provided
 binaries and encourage users to use these by simply setting
 HADOOP_HOME, or have instructions for specific distros. I've heard
 that our existing packages don't work well on HDP for instance, since
 there are some configuration quirks that differ from the upstream
 Hadoop.

 If we cut down on the cross building for Hadoop versions, then it is
 more tenable to cross build for Scala versions without exploding the
 number of binaries.

 - Patrick

 On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen so...@cloudera.com wrote:
 Yeah, interesting question of what is the better default for the
 single set of artifacts published to Maven. I think there's an
 argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
 and cons discussed more at

 https://issues.apache.org/jira/browse/SPARK-5134
 https://github.com/apache/spark/pull/3917

 On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 +1

 Tested it on Mac OS X.

 One small issue I noticed is that the Scala 2.11 build is using Hadoop
 1 without Hive, which is kind of weird because people will more likely 
 want
 Hadoop 2 with Hive. So it would be good to publish a build for that
 configuration instead. We can do it if we do a new RC, or it might be 
 that
 binary builds may not need to be voted on (I forgot the details there).

 Matei

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-08 Thread Matei Zaharia
Our goal is to let people use the latest Apache release even if vendors fall 
behind or don't want to package everything, so that's why we put out releases 
for vendors' versions. It's fairly low overhead.

Matei

 On Mar 8, 2015, at 5:56 PM, Sean Owen so...@cloudera.com wrote:
 
 Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
 at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
 Maven artifacts.
 
 Patrick I see you just commented on SPARK-5134 and will follow up
 there. Sounds like this may accidentally not be a problem.
 
 On binary tarball releases, I wonder if anyone has an opinion on my
 opinion that these shouldn't be distributed for specific Hadoop
 *distributions* to begin with. (Won't repeat the argument here yet.)
 That resolves this n x m explosion too.
 
 Vendors already provide their own distribution, yes, that's their job.
 
 
 On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar ksanka...@gmail.com wrote:
 Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
 Distributions X ...
 
 May be one option is to have a minimum basic set (which I know is what we
 are discussing) and move the rest to spark-packages.org. There the vendors
 can add the latest downloads - for example when 1.4 is released, HDP can
 build a release of HDP Spark 1.4 bundle.
 
 Cheers
 k/
 
 On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 We probably want to revisit the way we do binaries in general for
 1.4+. IMO, something worth forking a separate thread for.
 
 I've been hesitating to add new binaries because people
 (understandably) complain if you ever stop packaging older ones, but
 on the other hand the ASF has complained that we have too many
 binaries already and that we need to pare it down because of the large
 volume of files. Doubling the number of binaries we produce for Scala
 2.11 seemed like it would be too much.
 
 One solution potentially is to actually package Hadoop provided
 binaries and encourage users to use these by simply setting
 HADOOP_HOME, or have instructions for specific distros. I've heard
 that our existing packages don't work well on HDP for instance, since
 there are some configuration quirks that differ from the upstream
 Hadoop.
 
 If we cut down on the cross building for Hadoop versions, then it is
 more tenable to cross build for Scala versions without exploding the
 number of binaries.
 
 - Patrick
 
 On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen so...@cloudera.com wrote:
 Yeah, interesting question of what is the better default for the
 single set of artifacts published to Maven. I think there's an
 argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
 and cons discussed more at
 
 https://issues.apache.org/jira/browse/SPARK-5134
 https://github.com/apache/spark/pull/3917
 
 On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 +1
 
 Tested it on Mac OS X.
 
 One small issue I noticed is that the Scala 2.11 build is using Hadoop
 1 without Hive, which is kind of weird because people will more likely 
 want
 Hadoop 2 with Hive. So it would be good to publish a build for that
 configuration instead. We can do it if we do a new RC, or it might be that
 binary builds may not need to be voted on (I forgot the details there).
 
 Matei
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-08 Thread Sean Owen
Yeah it's not much overhead, but here's an example of where it causes
a little issue.

I like that reasoning. However, the released builds don't track the
later versions of Hadoop that vendors would be distributing -- there's
no Hadoop 2.6 build for example. CDH4 is here, but not the
far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't
actually work with many CDH4 versions.

I agree with the goal of maximizing the reach of Spark, but I don't
know how much these builds advance that goal.

Anyone can roll-their-own exactly-right build, and the docs and build
have been set up to make that as simple as can be expected. So these
aren't *required* to let me use latest Spark on distribution X.

I had thought these existed to sorta support 'legacy' distributions,
like CDH4, and that build was justified as a
quasi-Hadoop-2.0.x-flavored build. But then I don't understand what
the MapR profiles are for.

I think it's too much work to correctly, in parallel, maintain any
customizations necessary for any major distro, and it might be best to
do not at all than to do it incompletely. You could say it's also an
enabler for distros to vary in ways that require special
customization.

Maybe there's a concern that, if lots of people consume Spark on
Hadoop, and most people consume Hadoop through distros, and distros
alone manage Spark distributions, then you de facto 'have to' go
through a distro instead of get bits from Spark? Different
conversation but I think this sort of effect does not end up being a
negative.

Well anyway, I like the idea of seeing how far Hadoop-provided
releases can help. It might kill several birds with one stone.

On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Our goal is to let people use the latest Apache release even if vendors fall 
 behind or don't want to package everything, so that's why we put out releases 
 for vendors' versions. It's fairly low overhead.

 Matei

 On Mar 8, 2015, at 5:56 PM, Sean Owen so...@cloudera.com wrote:

 Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
 at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
 Maven artifacts.

 Patrick I see you just commented on SPARK-5134 and will follow up
 there. Sounds like this may accidentally not be a problem.

 On binary tarball releases, I wonder if anyone has an opinion on my
 opinion that these shouldn't be distributed for specific Hadoop
 *distributions* to begin with. (Won't repeat the argument here yet.)
 That resolves this n x m explosion too.

 Vendors already provide their own distribution, yes, that's their job.


 On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar ksanka...@gmail.com wrote:
 Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
 Distributions X ...

 May be one option is to have a minimum basic set (which I know is what we
 are discussing) and move the rest to spark-packages.org. There the vendors
 can add the latest downloads - for example when 1.4 is released, HDP can
 build a release of HDP Spark 1.4 bundle.

 Cheers
 k/

 On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote:

 We probably want to revisit the way we do binaries in general for
 1.4+. IMO, something worth forking a separate thread for.

 I've been hesitating to add new binaries because people
 (understandably) complain if you ever stop packaging older ones, but
 on the other hand the ASF has complained that we have too many
 binaries already and that we need to pare it down because of the large
 volume of files. Doubling the number of binaries we produce for Scala
 2.11 seemed like it would be too much.

 One solution potentially is to actually package Hadoop provided
 binaries and encourage users to use these by simply setting
 HADOOP_HOME, or have instructions for specific distros. I've heard
 that our existing packages don't work well on HDP for instance, since
 there are some configuration quirks that differ from the upstream
 Hadoop.

 If we cut down on the cross building for Hadoop versions, then it is
 more tenable to cross build for Scala versions without exploding the
 number of binaries.

 - Patrick

 On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen so...@cloudera.com wrote:
 Yeah, interesting question of what is the better default for the
 single set of artifacts published to Maven. I think there's an
 argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
 and cons discussed more at

 https://issues.apache.org/jira/browse/SPARK-5134
 https://github.com/apache/spark/pull/3917

 On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 +1

 Tested it on Mac OS X.

 One small issue I noticed is that the Scala 2.11 build is using Hadoop
 1 without Hive, which is kind of weird because people will more likely 
 want
 Hadoop 2 with Hive. So it would be good to publish a build for that
 configuration instead. We can do it if we do a new RC, or it might be 
 that
 binary builds