[RESULT] [VOTE] Release Apache Spark 1.0.0 (rc8)

2014-05-17 Thread Patrick Wendell
Cancelled in favor of rc9.

On Sat, May 17, 2014 at 12:51 AM, Patrick Wendell pwend...@gmail.com wrote:
 Due to the issue discovered by Michael, this vote is cancelled in favor of 
 rc9.

 On Fri, May 16, 2014 at 6:22 PM, Michael Armbrust
 mich...@databricks.com wrote:
 -1

 We found a regression in the way configuration is passed to executors.

 https://issues.apache.org/jira/browse/SPARK-1864
 https://github.com/apache/spark/pull/808

 Michael


 On Fri, May 16, 2014 at 3:57 PM, Mark Hamstra m...@clearstorydata.com
 wrote:

 +1


 On Fri, May 16, 2014 at 2:16 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  [Due to ASF e-mail outage, I'm not if anyone will actually receive
  this.]
 
  Please vote on releasing the following candidate as Apache Spark version
  1.0.0!
  This has only minor changes on top of rc7.
 
  The tag to be voted on is v1.0.0-rc8 (commit 80eea0f):
 
 
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=80eea0f111c06260ffaa780d2f3f7facd09c17bc
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.0.0-rc8/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1016/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/
 
  Please vote on releasing this package as Apache Spark 1.0.0!
 
  The vote is open until Monday, May 19, at 10:15 UTC and passes if a
  majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.0.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == API Changes ==
  We welcome users to compile Spark applications against 1.0. There are
  a few API changes in this release. Here are links to the associated
  upgrade guides - user facing changes have been kept as small as
  possible.
 
  changes to ML vector specification:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10
 
  changes to the Java API:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
 
  changes to the streaming API:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
 
  changes to the GraphX API:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
 
  coGroup and related functions now return Iterable[T] instead of Seq[T]
  == Call toSeq on the result to restore the old behavior
 
  SparkContext.jarOfClass returns Option[String] instead of Seq[String]
  == Call toSeq on the result to restore old behavior
 




Re: [VOTE] Release Apache Spark 1.0.0 (rc8)

2014-05-17 Thread Patrick Wendell
Due to the issue discovered by Michael, this vote is cancelled in favor of rc9.

On Fri, May 16, 2014 at 6:22 PM, Michael Armbrust
mich...@databricks.com wrote:
 -1

 We found a regression in the way configuration is passed to executors.

 https://issues.apache.org/jira/browse/SPARK-1864
 https://github.com/apache/spark/pull/808

 Michael


 On Fri, May 16, 2014 at 3:57 PM, Mark Hamstra m...@clearstorydata.com
 wrote:

 +1


 On Fri, May 16, 2014 at 2:16 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  [Due to ASF e-mail outage, I'm not if anyone will actually receive
  this.]
 
  Please vote on releasing the following candidate as Apache Spark version
  1.0.0!
  This has only minor changes on top of rc7.
 
  The tag to be voted on is v1.0.0-rc8 (commit 80eea0f):
 
 
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=80eea0f111c06260ffaa780d2f3f7facd09c17bc
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.0.0-rc8/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1016/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/
 
  Please vote on releasing this package as Apache Spark 1.0.0!
 
  The vote is open until Monday, May 19, at 10:15 UTC and passes if a
  majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.0.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == API Changes ==
  We welcome users to compile Spark applications against 1.0. There are
  a few API changes in this release. Here are links to the associated
  upgrade guides - user facing changes have been kept as small as
  possible.
 
  changes to ML vector specification:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10
 
  changes to the Java API:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
 
  changes to the streaming API:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
 
  changes to the GraphX API:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
 
  coGroup and related functions now return Iterable[T] instead of Seq[T]
  == Call toSeq on the result to restore the old behavior
 
  SparkContext.jarOfClass returns Option[String] instead of Seq[String]
  == Call toSeq on the result to restore old behavior
 




Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-17 Thread Patrick Wendell
I'll start the voting with a +1.

On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.0.0!
 This has one bug fix and one minor feature on top of rc8:
 SPARK-1864: https://github.com/apache/spark/pull/808
 SPARK-1808: https://github.com/apache/spark/pull/799

 The tag to be voted on is v1.0.0-rc9 (commit 920f947):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.0.0-rc9/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1017/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/

 Please vote on releasing this package as Apache Spark 1.0.0!

 The vote is open until Tuesday, May 20, at 08:56 UTC and passes if
 amajority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.

 changes to ML vector specification:
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10

 changes to the Java API:
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 changes to the streaming API:
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

 changes to the GraphX API:
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Sean Owen
On this note, non-binding commentary:

Releases happen in local minima of change, usually created by
internally enforced code freeze. Spark is incredibly busy now due to
external factors -- recently a TLP, recently discovered by a large new
audience, ease of contribution enabled by Github. It's getting like
the first year of mainstream battle-testing in a month. It's been very
hard to freeze anything! I see a number of non-trivial issues being
reported, and I don't think it has been possible to triage all of
them, even.

Given the high rate of change, my instinct would have been to release
0.10.0 now. But won't it always be very busy? I do think the rate of
significant issues will slow down.

Version ain't nothing but a number, but if it has any meaning it's the
semantic versioning meaning. 1.0 imposes extra handicaps around
striving to maintain backwards-compatibility. That may end up being
bent to fit in important changes that are going to be required in this
continuing period of change. Hadoop does this all the time
unfortunately and gets away with it, I suppose -- minor version
releases are really major. (On the other extreme, HBase is at 0.98 and
quite production-ready.)

Just consider this a second vote for focus on fixes and 1.0.x rather
than new features and 1.x. I think there are a few steps that could
streamline triage of this flood of contributions, and make all of this
easier, but that's for another thread.


On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra m...@clearstorydata.com wrote:
 +1, but just barely.  We've got quite a number of outstanding bugs
 identified, and many of them have fixes in progress.  I'd hate to see those
 efforts get lost in a post-1.0.0 flood of new features targeted at 1.1.0 --
 in other words, I'd like to see 1.0.1 retain a high priority relative to
 1.1.0.

 Looking through the unresolved JIRAs, it doesn't look like any of the
 identified bugs are show-stoppers or strictly regressions (although I will
 note that one that I have in progress, SPARK-1749, is a bug that we
 introduced with recent work -- it's not strictly a regression because we
 had equally bad but different behavior when the DAGScheduler exceptions
 weren't previously being handled at all vs. being slightly mis-handled
 now), so I'm not currently seeing a reason not to release.


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Mridul Muralidharan
I had echoed similar sentiments a while back when there was a discussion
around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
changes, add missing functionality, go through a hardening release before
1.0

But the community preferred a 1.0 :-)

Regards,
Mridul

On 17-May-2014 3:19 pm, Sean Owen so...@cloudera.com wrote:

 On this note, non-binding commentary:

 Releases happen in local minima of change, usually created by
 internally enforced code freeze. Spark is incredibly busy now due to
 external factors -- recently a TLP, recently discovered by a large new
 audience, ease of contribution enabled by Github. It's getting like
 the first year of mainstream battle-testing in a month. It's been very
 hard to freeze anything! I see a number of non-trivial issues being
 reported, and I don't think it has been possible to triage all of
 them, even.

 Given the high rate of change, my instinct would have been to release
 0.10.0 now. But won't it always be very busy? I do think the rate of
 significant issues will slow down.

 Version ain't nothing but a number, but if it has any meaning it's the
 semantic versioning meaning. 1.0 imposes extra handicaps around
 striving to maintain backwards-compatibility. That may end up being
 bent to fit in important changes that are going to be required in this
 continuing period of change. Hadoop does this all the time
 unfortunately and gets away with it, I suppose -- minor version
 releases are really major. (On the other extreme, HBase is at 0.98 and
 quite production-ready.)

 Just consider this a second vote for focus on fixes and 1.0.x rather
 than new features and 1.x. I think there are a few steps that could
 streamline triage of this flood of contributions, and make all of this
 easier, but that's for another thread.


 On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra m...@clearstorydata.com
wrote:
  +1, but just barely.  We've got quite a number of outstanding bugs
  identified, and many of them have fixes in progress.  I'd hate to see
those
  efforts get lost in a post-1.0.0 flood of new features targeted at
1.1.0 --
  in other words, I'd like to see 1.0.1 retain a high priority relative to
  1.1.0.
 
  Looking through the unresolved JIRAs, it doesn't look like any of the
  identified bugs are show-stoppers or strictly regressions (although I
will
  note that one that I have in progress, SPARK-1749, is a bug that we
  introduced with recent work -- it's not strictly a regression because we
  had equally bad but different behavior when the DAGScheduler exceptions
  weren't previously being handled at all vs. being slightly mis-handled
  now), so I'm not currently seeing a reason not to release.


Re: [jira] [Created] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-17 Thread Mridul Muralidharan
I suspect this is an issue we have fixed internally here as part of a
larger change - the issue we fixed was not a config issue but bugs in spark.

Unfortunately we plan to contribute this as part of 1.1

Regards,
Mridul
On 17-May-2014 4:09 pm, sam (JIRA) j...@apache.org wrote:

 sam created SPARK-1867:
 --

  Summary: Spark Documentation Error causes
 java.lang.IllegalStateException: unread block data
  Key: SPARK-1867
  URL: https://issues.apache.org/jira/browse/SPARK-1867
  Project: Spark
   Issue Type: Bug
 Reporter: sam


 I've employed two System Administrators on a contract basis (for quite a
 bit of money), and both contractors have independently hit the following
 exception.  What we are doing is:

 1. Installing Spark 0.9.1 according to the documentation on the website,
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on
 the cluster

 I've also included code snippets, and sbt deps at the bottom.

 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf

 Now I know that (b) is not the problem having successfully run the same
 code on other clusters while only including one jar (it's a fat jar).

 But I have no idea how to check for (a) - it appears Spark doesn't have
 any version checks or anything - it would be nice if it checked versions
 and threw a mismatching version exception: you have user code using
 version X and node Y has version Z.

 I would be very grateful for advice on this.

 The exception:

 Exception in thread main org.apache.spark.SparkException: Job aborted:
 Task 0.0:1 failed 32 times (most recent failure: Exception failure:
 java.lang.IllegalStateException: unread block data)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to
 java.lang.IllegalStateException: unread block data [duplicate 59]

 My code snippet:

 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))

 println(count =  + new SparkContext(conf).textFile(someHdfsPath).count())

 My SBT dependencies:

 // relevant
 org.apache.spark % spark-core_2.10 % 0.9.1,
 org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0,

 // standard, probably unrelated
 com.github.seratch %% awscala % [0.2,),
 org.scalacheck %% scalacheck % 1.10.1 % test,
 org.specs2 %% specs2 % 1.14 % test,
 org.scala-lang % scala-reflect % 2.10.3,
 org.scalaz %% scalaz-core % 7.0.5,
 net.minidev % json-smart % 1.2



 --
 This message was sent by Atlassian JIRA
 (v6.2#6252)



Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Mark Hamstra
Which of the unresolved bugs in spark-core do you think will require an
API-breaking change to fix?  If there are none of those, then we are still
essentially on track for a 1.0.0 release.

The number of contributions and pace of change now is quite high, but I
don't think that waiting for the pace to slow before releasing 1.0 is
viable.  If Spark's short history is any guide to its near future, the pace
will not slow by any significant amount for any noteworthy length of time,
but rather will continue to increase.  What we need to be aiming for, I
think, is to have the great majority of those new contributions being made
to MLLlib, GraphX, SparkSQL and other areas of the code that we have
clearly marked as not frozen in 1.x. I think we are already seeing that,
but if I am just not recognizing breakage of our semantic versioning
guarantee that will be forced on us by some pending changes, now would be a
good time to set me straight.


On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan mri...@gmail.comwrote:

 I had echoed similar sentiments a while back when there was a discussion
 around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
 changes, add missing functionality, go through a hardening release before
 1.0

 But the community preferred a 1.0 :-)

 Regards,
 Mridul

 On 17-May-2014 3:19 pm, Sean Owen so...@cloudera.com wrote:
 
  On this note, non-binding commentary:
 
  Releases happen in local minima of change, usually created by
  internally enforced code freeze. Spark is incredibly busy now due to
  external factors -- recently a TLP, recently discovered by a large new
  audience, ease of contribution enabled by Github. It's getting like
  the first year of mainstream battle-testing in a month. It's been very
  hard to freeze anything! I see a number of non-trivial issues being
  reported, and I don't think it has been possible to triage all of
  them, even.
 
  Given the high rate of change, my instinct would have been to release
  0.10.0 now. But won't it always be very busy? I do think the rate of
  significant issues will slow down.
 
  Version ain't nothing but a number, but if it has any meaning it's the
  semantic versioning meaning. 1.0 imposes extra handicaps around
  striving to maintain backwards-compatibility. That may end up being
  bent to fit in important changes that are going to be required in this
  continuing period of change. Hadoop does this all the time
  unfortunately and gets away with it, I suppose -- minor version
  releases are really major. (On the other extreme, HBase is at 0.98 and
  quite production-ready.)
 
  Just consider this a second vote for focus on fixes and 1.0.x rather
  than new features and 1.x. I think there are a few steps that could
  streamline triage of this flood of contributions, and make all of this
  easier, but that's for another thread.
 
 
  On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra m...@clearstorydata.com
 wrote:
   +1, but just barely.  We've got quite a number of outstanding bugs
   identified, and many of them have fixes in progress.  I'd hate to see
 those
   efforts get lost in a post-1.0.0 flood of new features targeted at
 1.1.0 --
   in other words, I'd like to see 1.0.1 retain a high priority relative
 to
   1.1.0.
  
   Looking through the unresolved JIRAs, it doesn't look like any of the
   identified bugs are show-stoppers or strictly regressions (although I
 will
   note that one that I have in progress, SPARK-1749, is a bug that we
   introduced with recent work -- it's not strictly a regression because
 we
   had equally bad but different behavior when the DAGScheduler exceptions
   weren't previously being handled at all vs. being slightly mis-handled
   now), so I'm not currently seeing a reason not to release.



Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Andrew Ash
+1 on the next release feeling more like a 0.10 than a 1.0
On May 17, 2014 4:38 AM, Mridul Muralidharan mri...@gmail.com wrote:

 I had echoed similar sentiments a while back when there was a discussion
 around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
 changes, add missing functionality, go through a hardening release before
 1.0

 But the community preferred a 1.0 :-)

 Regards,
 Mridul

 On 17-May-2014 3:19 pm, Sean Owen so...@cloudera.com wrote:
 
  On this note, non-binding commentary:
 
  Releases happen in local minima of change, usually created by
  internally enforced code freeze. Spark is incredibly busy now due to
  external factors -- recently a TLP, recently discovered by a large new
  audience, ease of contribution enabled by Github. It's getting like
  the first year of mainstream battle-testing in a month. It's been very
  hard to freeze anything! I see a number of non-trivial issues being
  reported, and I don't think it has been possible to triage all of
  them, even.
 
  Given the high rate of change, my instinct would have been to release
  0.10.0 now. But won't it always be very busy? I do think the rate of
  significant issues will slow down.
 
  Version ain't nothing but a number, but if it has any meaning it's the
  semantic versioning meaning. 1.0 imposes extra handicaps around
  striving to maintain backwards-compatibility. That may end up being
  bent to fit in important changes that are going to be required in this
  continuing period of change. Hadoop does this all the time
  unfortunately and gets away with it, I suppose -- minor version
  releases are really major. (On the other extreme, HBase is at 0.98 and
  quite production-ready.)
 
  Just consider this a second vote for focus on fixes and 1.0.x rather
  than new features and 1.x. I think there are a few steps that could
  streamline triage of this flood of contributions, and make all of this
  easier, but that's for another thread.
 
 
  On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra m...@clearstorydata.com
 wrote:
   +1, but just barely.  We've got quite a number of outstanding bugs
   identified, and many of them have fixes in progress.  I'd hate to see
 those
   efforts get lost in a post-1.0.0 flood of new features targeted at
 1.1.0 --
   in other words, I'd like to see 1.0.1 retain a high priority relative
 to
   1.1.0.
  
   Looking through the unresolved JIRAs, it doesn't look like any of the
   identified bugs are show-stoppers or strictly regressions (although I
 will
   note that one that I have in progress, SPARK-1749, is a bug that we
   introduced with recent work -- it's not strictly a regression because
 we
   had equally bad but different behavior when the DAGScheduler exceptions
   weren't previously being handled at all vs. being slightly mis-handled
   now), so I'm not currently seeing a reason not to release.



Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Sean Owen
On Sat, May 17, 2014 at 4:52 PM, Mark Hamstra m...@clearstorydata.com wrote:
 Which of the unresolved bugs in spark-core do you think will require an
 API-breaking change to fix?  If there are none of those, then we are still
 essentially on track for a 1.0.0 release.

I don't have a particular one in mind, but look at
https://issues.apache.org/jira/browse/SPARK-1817?filter=12327229 for
example. There are 10 issues marked blocker or critical, that are
targeted at Core / 1.0.0 (or unset). Many are probably not critical,
not for 1.0, or wouldn't require a big change to fix. But has this
been reviewed then -- can you tell? I'd be happy for someone to tell
me to stop worrying, yeah, there's nothing too big here.


 The number of contributions and pace of change now is quite high, but I
 don't think that waiting for the pace to slow before releasing 1.0 is
 viable.  If Spark's short history is any guide to its near future, the pace
 will not slow by any significant amount for any noteworthy length of time,

I think we'd agree core is the most important part. I'd humbly suggest
fixes and improvements to core remain exceptionally important after
1.0 and there is a long line of proposed changes, most good. Would be
great to really burn that down. Maybe that is the kind of thing I
personally would have preferred to see before a 1.0, but it's not up
to me and there are other factors at work here. I don't object
strongly or anything.


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Mridul Muralidharan
We made incompatible api changes whose impact we don't know yet completely
: both from implementation and usage point of view.

We had the option of getting real-world feedback from the user community if
we had gone to 0.10 but the spark developers seemed to be in a hurry to get
to 1.0 - so I made my opinion known but left it to the wisdom of larger
group of committers to decide ... I did not think it was critical enough to
do a binding -1 on.

Regards
Mridul
On 17-May-2014 9:43 pm, Mark Hamstra m...@clearstorydata.com wrote:

 Which of the unresolved bugs in spark-core do you think will require an
 API-breaking change to fix?  If there are none of those, then we are still
 essentially on track for a 1.0.0 release.

 The number of contributions and pace of change now is quite high, but I
 don't think that waiting for the pace to slow before releasing 1.0 is
 viable.  If Spark's short history is any guide to its near future, the pace
 will not slow by any significant amount for any noteworthy length of time,
 but rather will continue to increase.  What we need to be aiming for, I
 think, is to have the great majority of those new contributions being made
 to MLLlib, GraphX, SparkSQL and other areas of the code that we have
 clearly marked as not frozen in 1.x. I think we are already seeing that,
 but if I am just not recognizing breakage of our semantic versioning
 guarantee that will be forced on us by some pending changes, now would be a
 good time to set me straight.


 On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan mri...@gmail.com
 wrote:

  I had echoed similar sentiments a while back when there was a discussion
  around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
  changes, add missing functionality, go through a hardening release before
  1.0
 
  But the community preferred a 1.0 :-)
 
  Regards,
  Mridul
 
  On 17-May-2014 3:19 pm, Sean Owen so...@cloudera.com wrote:
  
   On this note, non-binding commentary:
  
   Releases happen in local minima of change, usually created by
   internally enforced code freeze. Spark is incredibly busy now due to
   external factors -- recently a TLP, recently discovered by a large new
   audience, ease of contribution enabled by Github. It's getting like
   the first year of mainstream battle-testing in a month. It's been very
   hard to freeze anything! I see a number of non-trivial issues being
   reported, and I don't think it has been possible to triage all of
   them, even.
  
   Given the high rate of change, my instinct would have been to release
   0.10.0 now. But won't it always be very busy? I do think the rate of
   significant issues will slow down.
  
   Version ain't nothing but a number, but if it has any meaning it's the
   semantic versioning meaning. 1.0 imposes extra handicaps around
   striving to maintain backwards-compatibility. That may end up being
   bent to fit in important changes that are going to be required in this
   continuing period of change. Hadoop does this all the time
   unfortunately and gets away with it, I suppose -- minor version
   releases are really major. (On the other extreme, HBase is at 0.98 and
   quite production-ready.)
  
   Just consider this a second vote for focus on fixes and 1.0.x rather
   than new features and 1.x. I think there are a few steps that could
   streamline triage of this flood of contributions, and make all of this
   easier, but that's for another thread.
  
  
   On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra m...@clearstorydata.com
 
  wrote:
+1, but just barely.  We've got quite a number of outstanding bugs
identified, and many of them have fixes in progress.  I'd hate to see
  those
efforts get lost in a post-1.0.0 flood of new features targeted at
  1.1.0 --
in other words, I'd like to see 1.0.1 retain a high priority relative
  to
1.1.0.
   
Looking through the unresolved JIRAs, it doesn't look like any of the
identified bugs are show-stoppers or strictly regressions (although I
  will
note that one that I have in progress, SPARK-1749, is a bug that we
introduced with recent work -- it's not strictly a regression because
  we
had equally bad but different behavior when the DAGScheduler
 exceptions
weren't previously being handled at all vs. being slightly
 mis-handled
now), so I'm not currently seeing a reason not to release.
 



Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-17 Thread Andrew Or
+1


2014-05-17 8:53 GMT-07:00 Mark Hamstra m...@clearstorydata.com:

 +1


 On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  I'll start the voting with a +1.
 
  On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell pwend...@gmail.com
  wrote:
   Please vote on releasing the following candidate as Apache Spark
 version
  1.0.0!
   This has one bug fix and one minor feature on top of rc8:
   SPARK-1864: https://github.com/apache/spark/pull/808
   SPARK-1808: https://github.com/apache/spark/pull/799
  
   The tag to be voted on is v1.0.0-rc9 (commit 920f947):
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75
  
   The release files, including signatures, digests, etc. can be found at:
   http://people.apache.org/~pwendell/spark-1.0.0-rc9/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/pwendell.asc
  
   The staging repository for this release can be found at:
  
 https://repository.apache.org/content/repositories/orgapachespark-1017/
  
   The documentation corresponding to this release can be found at:
   http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/
  
   Please vote on releasing this package as Apache Spark 1.0.0!
  
   The vote is open until Tuesday, May 20, at 08:56 UTC and passes if
   amajority of at least 3 +1 PMC votes are cast.
  
   [ ] +1 Release this package as Apache Spark 1.0.0
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/
  
   == API Changes ==
   We welcome users to compile Spark applications against 1.0. There are
   a few API changes in this release. Here are links to the associated
   upgrade guides - user facing changes have been kept as small as
   possible.
  
   changes to ML vector specification:
  
 
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10
  
   changes to the Java API:
  
 
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
  
   changes to the streaming API:
  
 
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
  
   changes to the GraphX API:
  
 
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
  
   coGroup and related functions now return Iterable[T] instead of Seq[T]
   == Call toSeq on the result to restore the old behavior
  
   SparkContext.jarOfClass returns Option[String] instead of Seq[String]
   == Call toSeq on the result to restore old behavior
 



Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Mark Hamstra
That is a past issue that we don't need to be re-opening now.  The present
issue, and what I am asking, is which pending bug fixes does anyone
anticipate will require breaking the public API guaranteed in rc9?


On Sat, May 17, 2014 at 9:44 AM, Mridul Muralidharan mri...@gmail.comwrote:

 We made incompatible api changes whose impact we don't know yet completely
 : both from implementation and usage point of view.

 We had the option of getting real-world feedback from the user community if
 we had gone to 0.10 but the spark developers seemed to be in a hurry to get
 to 1.0 - so I made my opinion known but left it to the wisdom of larger
 group of committers to decide ... I did not think it was critical enough to
 do a binding -1 on.

 Regards
 Mridul
 On 17-May-2014 9:43 pm, Mark Hamstra m...@clearstorydata.com wrote:

  Which of the unresolved bugs in spark-core do you think will require an
  API-breaking change to fix?  If there are none of those, then we are
 still
  essentially on track for a 1.0.0 release.
 
  The number of contributions and pace of change now is quite high, but I
  don't think that waiting for the pace to slow before releasing 1.0 is
  viable.  If Spark's short history is any guide to its near future, the
 pace
  will not slow by any significant amount for any noteworthy length of
 time,
  but rather will continue to increase.  What we need to be aiming for, I
  think, is to have the great majority of those new contributions being
 made
  to MLLlib, GraphX, SparkSQL and other areas of the code that we have
  clearly marked as not frozen in 1.x. I think we are already seeing that,
  but if I am just not recognizing breakage of our semantic versioning
  guarantee that will be forced on us by some pending changes, now would
 be a
  good time to set me straight.
 
 
  On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan mri...@gmail.com
  wrote:
 
   I had echoed similar sentiments a while back when there was a
 discussion
   around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
   changes, add missing functionality, go through a hardening release
 before
   1.0
  
   But the community preferred a 1.0 :-)
  
   Regards,
   Mridul
  
   On 17-May-2014 3:19 pm, Sean Owen so...@cloudera.com wrote:
   
On this note, non-binding commentary:
   
Releases happen in local minima of change, usually created by
internally enforced code freeze. Spark is incredibly busy now due to
external factors -- recently a TLP, recently discovered by a large
 new
audience, ease of contribution enabled by Github. It's getting like
the first year of mainstream battle-testing in a month. It's been
 very
hard to freeze anything! I see a number of non-trivial issues being
reported, and I don't think it has been possible to triage all of
them, even.
   
Given the high rate of change, my instinct would have been to release
0.10.0 now. But won't it always be very busy? I do think the rate of
significant issues will slow down.
   
Version ain't nothing but a number, but if it has any meaning it's
 the
semantic versioning meaning. 1.0 imposes extra handicaps around
striving to maintain backwards-compatibility. That may end up being
bent to fit in important changes that are going to be required in
 this
continuing period of change. Hadoop does this all the time
unfortunately and gets away with it, I suppose -- minor version
releases are really major. (On the other extreme, HBase is at 0.98
 and
quite production-ready.)
   
Just consider this a second vote for focus on fixes and 1.0.x rather
than new features and 1.x. I think there are a few steps that could
streamline triage of this flood of contributions, and make all of
 this
easier, but that's for another thread.
   
   
On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra 
 m...@clearstorydata.com
  
   wrote:
 +1, but just barely.  We've got quite a number of outstanding bugs
 identified, and many of them have fixes in progress.  I'd hate to
 see
   those
 efforts get lost in a post-1.0.0 flood of new features targeted at
   1.1.0 --
 in other words, I'd like to see 1.0.1 retain a high priority
 relative
   to
 1.1.0.

 Looking through the unresolved JIRAs, it doesn't look like any of
 the
 identified bugs are show-stoppers or strictly regressions
 (although I
   will
 note that one that I have in progress, SPARK-1749, is a bug that we
 introduced with recent work -- it's not strictly a regression
 because
   we
 had equally bad but different behavior when the DAGScheduler
  exceptions
 weren't previously being handled at all vs. being slightly
  mis-handled
 now), so I'm not currently seeing a reason not to release.
  
 



Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Kan Zhang
+1 on the running commentary here, non-binding of course :-)


On Sat, May 17, 2014 at 8:44 AM, Andrew Ash and...@andrewash.com wrote:

 +1 on the next release feeling more like a 0.10 than a 1.0
 On May 17, 2014 4:38 AM, Mridul Muralidharan mri...@gmail.com wrote:

  I had echoed similar sentiments a while back when there was a discussion
  around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
  changes, add missing functionality, go through a hardening release before
  1.0
 
  But the community preferred a 1.0 :-)
 
  Regards,
  Mridul
 
  On 17-May-2014 3:19 pm, Sean Owen so...@cloudera.com wrote:
  
   On this note, non-binding commentary:
  
   Releases happen in local minima of change, usually created by
   internally enforced code freeze. Spark is incredibly busy now due to
   external factors -- recently a TLP, recently discovered by a large new
   audience, ease of contribution enabled by Github. It's getting like
   the first year of mainstream battle-testing in a month. It's been very
   hard to freeze anything! I see a number of non-trivial issues being
   reported, and I don't think it has been possible to triage all of
   them, even.
  
   Given the high rate of change, my instinct would have been to release
   0.10.0 now. But won't it always be very busy? I do think the rate of
   significant issues will slow down.
  
   Version ain't nothing but a number, but if it has any meaning it's the
   semantic versioning meaning. 1.0 imposes extra handicaps around
   striving to maintain backwards-compatibility. That may end up being
   bent to fit in important changes that are going to be required in this
   continuing period of change. Hadoop does this all the time
   unfortunately and gets away with it, I suppose -- minor version
   releases are really major. (On the other extreme, HBase is at 0.98 and
   quite production-ready.)
  
   Just consider this a second vote for focus on fixes and 1.0.x rather
   than new features and 1.x. I think there are a few steps that could
   streamline triage of this flood of contributions, and make all of this
   easier, but that's for another thread.
  
  
   On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra m...@clearstorydata.com
 
  wrote:
+1, but just barely.  We've got quite a number of outstanding bugs
identified, and many of them have fixes in progress.  I'd hate to see
  those
efforts get lost in a post-1.0.0 flood of new features targeted at
  1.1.0 --
in other words, I'd like to see 1.0.1 retain a high priority relative
  to
1.1.0.
   
Looking through the unresolved JIRAs, it doesn't look like any of the
identified bugs are show-stoppers or strictly regressions (although I
  will
note that one that I have in progress, SPARK-1749, is a bug that we
introduced with recent work -- it's not strictly a regression because
  we
had equally bad but different behavior when the DAGScheduler
 exceptions
weren't previously being handled at all vs. being slightly
 mis-handled
now), so I'm not currently seeing a reason not to release.
 



Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Mridul Muralidharan
On 17-May-2014 11:40 pm, Mark Hamstra m...@clearstorydata.com wrote:

 That is a past issue that we don't need to be re-opening now.  The present

Huh ? If we need to revisit based on changed circumstances, we must - the
scope of changes introduced in this release was definitely not anticipated
when 1.0 vs 0.10 discussion happened.

If folks are worried about stability of core; it is a valid concern IMO.

Having said that, I am still ok with going to 1.0; but if a conversation
starts about need for 1.0 vs going to 0.10 I want to hear more and possibly
allay the concerns and not try to muzzle the discussion.


Regards
Mridul

 issue, and what I am asking, is which pending bug fixes does anyone
 anticipate will require breaking the public API guaranteed in rc9


 On Sat, May 17, 2014 at 9:44 AM, Mridul Muralidharan mri...@gmail.com
wrote:

  We made incompatible api changes whose impact we don't know yet
completely
  : both from implementation and usage point of view.
 
  We had the option of getting real-world feedback from the user
community if
  we had gone to 0.10 but the spark developers seemed to be in a hurry to
get
  to 1.0 - so I made my opinion known but left it to the wisdom of larger
  group of committers to decide ... I did not think it was critical
enough to
  do a binding -1 on.
 
  Regards
  Mridul
  On 17-May-2014 9:43 pm, Mark Hamstra m...@clearstorydata.com wrote:
 
   Which of the unresolved bugs in spark-core do you think will require
an
   API-breaking change to fix?  If there are none of those, then we are
  still
   essentially on track for a 1.0.0 release.
  
   The number of contributions and pace of change now is quite high, but
I
   don't think that waiting for the pace to slow before releasing 1.0 is
   viable.  If Spark's short history is any guide to its near future, the
  pace
   will not slow by any significant amount for any noteworthy length of
  time,
   but rather will continue to increase.  What we need to be aiming for,
I
   think, is to have the great majority of those new contributions being
  made
   to MLLlib, GraphX, SparkSQL and other areas of the code that we have
   clearly marked as not frozen in 1.x. I think we are already seeing
that,
   but if I am just not recognizing breakage of our semantic versioning
   guarantee that will be forced on us by some pending changes, now would
  be a
   good time to set me straight.
  
  
   On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan mri...@gmail.com
   wrote:
  
I had echoed similar sentiments a while back when there was a
  discussion
around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the
api
changes, add missing functionality, go through a hardening release
  before
1.0
   
But the community preferred a 1.0 :-)
   
Regards,
Mridul
   
On 17-May-2014 3:19 pm, Sean Owen so...@cloudera.com wrote:

 On this note, non-binding commentary:

 Releases happen in local minima of change, usually created by
 internally enforced code freeze. Spark is incredibly busy now due
to
 external factors -- recently a TLP, recently discovered by a large
  new
 audience, ease of contribution enabled by Github. It's getting
like
 the first year of mainstream battle-testing in a month. It's been
  very
 hard to freeze anything! I see a number of non-trivial issues
being
 reported, and I don't think it has been possible to triage all of
 them, even.

 Given the high rate of change, my instinct would have been to
release
 0.10.0 now. But won't it always be very busy? I do think the rate
of
 significant issues will slow down.

 Version ain't nothing but a number, but if it has any meaning it's
  the
 semantic versioning meaning. 1.0 imposes extra handicaps around
 striving to maintain backwards-compatibility. That may end up
being
 bent to fit in important changes that are going to be required in
  this
 continuing period of change. Hadoop does this all the time
 unfortunately and gets away with it, I suppose -- minor version
 releases are really major. (On the other extreme, HBase is at 0.98
  and
 quite production-ready.)

 Just consider this a second vote for focus on fixes and 1.0.x
rather
 than new features and 1.x. I think there are a few steps that
could
 streamline triage of this flood of contributions, and make all of
  this
 easier, but that's for another thread.


 On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra 
  m...@clearstorydata.com
   
wrote:
  +1, but just barely.  We've got quite a number of outstanding
bugs
  identified, and many of them have fixes in progress.  I'd hate
to
  see
those
  efforts get lost in a post-1.0.0 flood of new features targeted
at
1.1.0 --
  in other words, I'd like to see 1.0.1 retain a high priority
  relative
to
  1.1.0.
 
  Looking through the unresolved JIRAs, it doesn't look like 

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Mark Hamstra
I'm not trying to muzzle the discussion.  All I am saying is that we don't
need to have the same discussion about 0.10 vs. 1.0 that we already had.
 If you can tell me about specific changes in the current release candidate
that occasion new arguments for why a 1.0 release is an unacceptable idea,
then I'm listening.


On Sat, May 17, 2014 at 11:59 AM, Mridul Muralidharan mri...@gmail.comwrote:

 On 17-May-2014 11:40 pm, Mark Hamstra m...@clearstorydata.com wrote:
 
  That is a past issue that we don't need to be re-opening now.  The
 present

 Huh ? If we need to revisit based on changed circumstances, we must - the
 scope of changes introduced in this release was definitely not anticipated
 when 1.0 vs 0.10 discussion happened.

 If folks are worried about stability of core; it is a valid concern IMO.

 Having said that, I am still ok with going to 1.0; but if a conversation
 starts about need for 1.0 vs going to 0.10 I want to hear more and possibly
 allay the concerns and not try to muzzle the discussion.


 Regards
 Mridul

  issue, and what I am asking, is which pending bug fixes does anyone
  anticipate will require breaking the public API guaranteed in rc9
 
 
  On Sat, May 17, 2014 at 9:44 AM, Mridul Muralidharan mri...@gmail.com
 wrote:
 
   We made incompatible api changes whose impact we don't know yet
 completely
   : both from implementation and usage point of view.
  
   We had the option of getting real-world feedback from the user
 community if
   we had gone to 0.10 but the spark developers seemed to be in a hurry to
 get
   to 1.0 - so I made my opinion known but left it to the wisdom of larger
   group of committers to decide ... I did not think it was critical
 enough to
   do a binding -1 on.
  
   Regards
   Mridul
   On 17-May-2014 9:43 pm, Mark Hamstra m...@clearstorydata.com
 wrote:
  
Which of the unresolved bugs in spark-core do you think will require
 an
API-breaking change to fix?  If there are none of those, then we are
   still
essentially on track for a 1.0.0 release.
   
The number of contributions and pace of change now is quite high, but
 I
don't think that waiting for the pace to slow before releasing 1.0 is
viable.  If Spark's short history is any guide to its near future,
 the
   pace
will not slow by any significant amount for any noteworthy length of
   time,
but rather will continue to increase.  What we need to be aiming for,
 I
think, is to have the great majority of those new contributions being
   made
to MLLlib, GraphX, SparkSQL and other areas of the code that we have
clearly marked as not frozen in 1.x. I think we are already seeing
 that,
but if I am just not recognizing breakage of our semantic versioning
guarantee that will be forced on us by some pending changes, now
 would
   be a
good time to set me straight.
   
   
On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan 
 mri...@gmail.com
wrote:
   
 I had echoed similar sentiments a while back when there was a
   discussion
 around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the
 api
 changes, add missing functionality, go through a hardening release
   before
 1.0

 But the community preferred a 1.0 :-)

 Regards,
 Mridul

 On 17-May-2014 3:19 pm, Sean Owen so...@cloudera.com wrote:
 
  On this note, non-binding commentary:
 
  Releases happen in local minima of change, usually created by
  internally enforced code freeze. Spark is incredibly busy now due
 to
  external factors -- recently a TLP, recently discovered by a
 large
   new
  audience, ease of contribution enabled by Github. It's getting
 like
  the first year of mainstream battle-testing in a month. It's been
   very
  hard to freeze anything! I see a number of non-trivial issues
 being
  reported, and I don't think it has been possible to triage all of
  them, even.
 
  Given the high rate of change, my instinct would have been to
 release
  0.10.0 now. But won't it always be very busy? I do think the rate
 of
  significant issues will slow down.
 
  Version ain't nothing but a number, but if it has any meaning
 it's
   the
  semantic versioning meaning. 1.0 imposes extra handicaps around
  striving to maintain backwards-compatibility. That may end up
 being
  bent to fit in important changes that are going to be required in
   this
  continuing period of change. Hadoop does this all the time
  unfortunately and gets away with it, I suppose -- minor version
  releases are really major. (On the other extreme, HBase is at
 0.98
   and
  quite production-ready.)
 
  Just consider this a second vote for focus on fixes and 1.0.x
 rather
  than new features and 1.x. I think there are a few steps that
 could
  streamline triage of this flood of contributions, and make all of
   this
  easier, but that's for 

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Matei Zaharia
As others have said, the 1.0 milestone is about API stability, not about saying 
“we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner users can 
confidently build on Spark, knowing that the application they build today will 
still run on Spark 1.9.9 three years from now. This is something that I’ve seen 
done badly (and experienced the effects thereof) in other big data projects, 
such as MapReduce and even YARN. The result is that you annoy users, you end up 
with a fragmented userbase where everyone is building against a different 
version, and you drastically slow down development.

With a project as fast-growing as fast-growing as Spark in particular, there 
will be new bugs discovered and reported continuously, especially in the 
non-core components. Look at the graph of # of contributors in time to Spark: 
https://www.ohloh.net/p/apache-spark (bottom-most graph; “commits” changed when 
we started merging each patch as a single commit). This is not slowing down, 
and we need to have the culture now that we treat API stability and release 
numbers at the level expected for a 1.0 project instead of having people come 
in and randomly change the API.

I’ll also note that the issues marked “blocker” were marked so by their 
reporters, since the reporter can set the priority. I don’t consider stuff like 
parallelize() not partitioning ranges in the same way as other collections a 
blocker — it’s a bug, it would be good to fix it, but it only affects a small 
number of use cases. Of course if we find a real blocker (in particular a 
regression from a previous version, or a feature that’s just completely 
broken), we will delay the release for that, but at some point you have to say 
“okay, this fix will go into the next maintenance release”. Maybe we need to 
write a clear policy for what the issue priorities mean.

Finally, I believe it’s much better to have a culture where you can make 
releases on a regular schedule, and have the option to make a maintenance 
release in 3-4 days if you find new bugs, than one where you pile up stuff into 
each release. This is what much large project than us, like Linux, do, and it’s 
the only way to avoid indefinite stalling with a large contributor base. In the 
worst case, if you find a new bug that warrants immediate release, it goes into 
1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in three days with just your bug 
fix in it). And if you find an API that you’d like to improve, just add a new 
one and maybe deprecate the old one — at some point we have to respect our 
users and let them know that code they write today will still run tomorrow.

Matei

On May 17, 2014, at 10:32 AM, Kan Zhang kzh...@apache.org wrote:

 +1 on the running commentary here, non-binding of course :-)
 
 
 On Sat, May 17, 2014 at 8:44 AM, Andrew Ash and...@andrewash.com wrote:
 
 +1 on the next release feeling more like a 0.10 than a 1.0
 On May 17, 2014 4:38 AM, Mridul Muralidharan mri...@gmail.com wrote:
 
 I had echoed similar sentiments a while back when there was a discussion
 around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
 changes, add missing functionality, go through a hardening release before
 1.0
 
 But the community preferred a 1.0 :-)
 
 Regards,
 Mridul
 
 On 17-May-2014 3:19 pm, Sean Owen so...@cloudera.com wrote:
 
 On this note, non-binding commentary:
 
 Releases happen in local minima of change, usually created by
 internally enforced code freeze. Spark is incredibly busy now due to
 external factors -- recently a TLP, recently discovered by a large new
 audience, ease of contribution enabled by Github. It's getting like
 the first year of mainstream battle-testing in a month. It's been very
 hard to freeze anything! I see a number of non-trivial issues being
 reported, and I don't think it has been possible to triage all of
 them, even.
 
 Given the high rate of change, my instinct would have been to release
 0.10.0 now. But won't it always be very busy? I do think the rate of
 significant issues will slow down.
 
 Version ain't nothing but a number, but if it has any meaning it's the
 semantic versioning meaning. 1.0 imposes extra handicaps around
 striving to maintain backwards-compatibility. That may end up being
 bent to fit in important changes that are going to be required in this
 continuing period of change. Hadoop does this all the time
 unfortunately and gets away with it, I suppose -- minor version
 releases are really major. (On the other extreme, HBase is at 0.98 and
 quite production-ready.)
 
 Just consider this a second vote for focus on fixes and 1.0.x rather
 than new features and 1.x. I think there are a few steps that could
 streamline triage of this flood of contributions, and make all of this
 easier, but that's for another thread.
 
 
 On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra m...@clearstorydata.com
 
 wrote:
 +1, but just barely.  We've got quite a number of outstanding bugs
 identified, and many of 

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Mridul Muralidharan
On 18-May-2014 1:45 am, Mark Hamstra m...@clearstorydata.com wrote:

 I'm not trying to muzzle the discussion.  All I am saying is that we don't
 need to have the same discussion about 0.10 vs. 1.0 that we already had.

Agreed, no point in repeating the same discussion ... I am also trying to
understand what the concerns are.

Specifically though, the scope of 1.0 (in terms of changes) went up quite a
bit - a lot of which are new changes and features; not just the initially
envisioned api changes and stability fixes.

If this is raising concerns, particularly since lot of users are depending
on stability of spark interfaces (api, env, scripts, behavior); I want to
understand better what they are - and if they are legitimately serious
enough, we will need to revisit decision to go to 1.0 instead of 0.10 ...
I hope we don't need to though given how late we are in dev cycle

Regards
Mridul

  If you can tell me about specific changes in the current release
candidate
 that occasion new arguments for why a 1.0 release is an unacceptable idea,
 then I'm listening.


 On Sat, May 17, 2014 at 11:59 AM, Mridul Muralidharan mri...@gmail.com
wrote:

  On 17-May-2014 11:40 pm, Mark Hamstra m...@clearstorydata.com wrote:
  
   That is a past issue that we don't need to be re-opening now.  The
  present
 
  Huh ? If we need to revisit based on changed circumstances, we must -
the
  scope of changes introduced in this release was definitely not
anticipated
  when 1.0 vs 0.10 discussion happened.
 
  If folks are worried about stability of core; it is a valid concern IMO.
 
  Having said that, I am still ok with going to 1.0; but if a conversation
  starts about need for 1.0 vs going to 0.10 I want to hear more and
possibly
  allay the concerns and not try to muzzle the discussion.
 
 
  Regards
  Mridul
 
   issue, and what I am asking, is which pending bug fixes does anyone
   anticipate will require breaking the public API guaranteed in rc9
  
  
   On Sat, May 17, 2014 at 9:44 AM, Mridul Muralidharan mri...@gmail.com
  wrote:
  
We made incompatible api changes whose impact we don't know yet
  completely
: both from implementation and usage point of view.
   
We had the option of getting real-world feedback from the user
  community if
we had gone to 0.10 but the spark developers seemed to be in a
hurry to
  get
to 1.0 - so I made my opinion known but left it to the wisdom of
larger
group of committers to decide ... I did not think it was critical
  enough to
do a binding -1 on.
   
Regards
Mridul
On 17-May-2014 9:43 pm, Mark Hamstra m...@clearstorydata.com
  wrote:
   
 Which of the unresolved bugs in spark-core do you think will
require
  an
 API-breaking change to fix?  If there are none of those, then we
are
still
 essentially on track for a 1.0.0 release.

 The number of contributions and pace of change now is quite high,
but
  I
 don't think that waiting for the pace to slow before releasing
1.0 is
 viable.  If Spark's short history is any guide to its near future,
  the
pace
 will not slow by any significant amount for any noteworthy length
of
time,
 but rather will continue to increase.  What we need to be aiming
for,
  I
 think, is to have the great majority of those new contributions
being
made
 to MLLlib, GraphX, SparkSQL and other areas of the code that we
have
 clearly marked as not frozen in 1.x. I think we are already seeing
  that,
 but if I am just not recognizing breakage of our semantic
versioning
 guarantee that will be forced on us by some pending changes, now
  would
be a
 good time to set me straight.


 On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan 
  mri...@gmail.com
 wrote:

  I had echoed similar sentiments a while back when there was a
discussion
  around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize
the
  api
  changes, add missing functionality, go through a hardening
release
before
  1.0
 
  But the community preferred a 1.0 :-)
 
  Regards,
  Mridul
 
  On 17-May-2014 3:19 pm, Sean Owen so...@cloudera.com wrote:
  
   On this note, non-binding commentary:
  
   Releases happen in local minima of change, usually created by
   internally enforced code freeze. Spark is incredibly busy now
due
  to
   external factors -- recently a TLP, recently discovered by a
  large
new
   audience, ease of contribution enabled by Github. It's getting
  like
   the first year of mainstream battle-testing in a month. It's
been
very
   hard to freeze anything! I see a number of non-trivial issues
  being
   reported, and I don't think it has been possible to triage
all of
   them, even.
  
   Given the high rate of change, my instinct would have been to
  release
   0.10.0 now. But won't it always be very busy? I do think the
rate
  of
 

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-17 Thread Matei Zaharia
BTW for what it’s worth I agree this is a good option to add, the only tricky 
thing will be making sure the checkpoint blocks are not garbage-collected by 
the block store. I don’t think they will be though.

Matei
On May 17, 2014, at 2:20 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

 We do actually have replicated StorageLevels in Spark. You can use 
 MEMORY_AND_DISK_2 or construct your own StorageLevel with your own custom 
 replication factor.
 
 BTW you guys should probably have this discussion on the JIRA rather than the 
 dev list; I think the replies somehow ended up on the dev list.
 
 Matei
 
 On May 17, 2014, at 1:36 AM, Mridul Muralidharan mri...@gmail.com wrote:
 
 We don't have 3x replication in spark :-)
 And if we use replicated storagelevel, while decreasing odds of failure, it
 does not eliminate it (since we are not doing a great job with replication
 anyway from fault tolerance point of view).
 Also it does take a nontrivial performance hit with replicated levels.
 
 Regards,
 Mridul
 On 17-May-2014 8:16 am, Xiangrui Meng men...@gmail.com wrote:
 
 With 3x replication, we should be able to achieve fault tolerance.
 This checkPointed RDD can be cleared if we have another in-memory
 checkPointed RDD down the line. It can avoid hitting disk if we have
 enough memory to use. We need to investigate more to find a good
 solution. -Xiangrui
 
 On Fri, May 16, 2014 at 4:00 PM, Mridul Muralidharan mri...@gmail.com
 wrote:
 Effectively this is persist without fault tolerance.
 Failure of any node means complete lack of fault tolerance.
 I would be very skeptical of truncating lineage if it is not reliable.
 On 17-May-2014 3:49 am, Xiangrui Meng (JIRA) j...@apache.org wrote:
 
 Xiangrui Meng created SPARK-1855:
 
 
Summary: Provide memory-and-local-disk RDD checkpointing
Key: SPARK-1855
URL: https://issues.apache.org/jira/browse/SPARK-1855
Project: Spark
 Issue Type: New Feature
 Components: MLlib, Spark Core
   Affects Versions: 1.0.0
   Reporter: Xiangrui Meng
 
 
 Checkpointing is used to cut long lineage while maintaining fault
 tolerance. The current implementation is HDFS-based. Using the BlockRDD
 we
 can create in-memory-and-local-disk (with replication) checkpoints that
 are
 not as reliable as HDFS-based solution but faster.
 
 It can help applications that require many iterations.
 
 
 
 --
 This message was sent by Atlassian JIRA
 (v6.2#6252)
 
 
 



Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-17 Thread Matei Zaharia
We do actually have replicated StorageLevels in Spark. You can use 
MEMORY_AND_DISK_2 or construct your own StorageLevel with your own custom 
replication factor.

BTW you guys should probably have this discussion on the JIRA rather than the 
dev list; I think the replies somehow ended up on the dev list.

Matei

On May 17, 2014, at 1:36 AM, Mridul Muralidharan mri...@gmail.com wrote:

 We don't have 3x replication in spark :-)
 And if we use replicated storagelevel, while decreasing odds of failure, it
 does not eliminate it (since we are not doing a great job with replication
 anyway from fault tolerance point of view).
 Also it does take a nontrivial performance hit with replicated levels.
 
 Regards,
 Mridul
 On 17-May-2014 8:16 am, Xiangrui Meng men...@gmail.com wrote:
 
 With 3x replication, we should be able to achieve fault tolerance.
 This checkPointed RDD can be cleared if we have another in-memory
 checkPointed RDD down the line. It can avoid hitting disk if we have
 enough memory to use. We need to investigate more to find a good
 solution. -Xiangrui
 
 On Fri, May 16, 2014 at 4:00 PM, Mridul Muralidharan mri...@gmail.com
 wrote:
 Effectively this is persist without fault tolerance.
 Failure of any node means complete lack of fault tolerance.
 I would be very skeptical of truncating lineage if it is not reliable.
 On 17-May-2014 3:49 am, Xiangrui Meng (JIRA) j...@apache.org wrote:
 
 Xiangrui Meng created SPARK-1855:
 
 
 Summary: Provide memory-and-local-disk RDD checkpointing
 Key: SPARK-1855
 URL: https://issues.apache.org/jira/browse/SPARK-1855
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Spark Core
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
 
 
 Checkpointing is used to cut long lineage while maintaining fault
 tolerance. The current implementation is HDFS-based. Using the BlockRDD
 we
 can create in-memory-and-local-disk (with replication) checkpoints that
 are
 not as reliable as HDFS-based solution but faster.
 
 It can help applications that require many iterations.
 
 
 
 --
 This message was sent by Atlassian JIRA
 (v6.2#6252)
 
 



Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Mridul Muralidharan
I would make the case for interface stability not just api stability.
Particularly given that we have significantly changed some of our
interfaces, I want to ensure developers/users are not seeing red flags.

Bugs and code stability can be addressed in minor releases if found, but
behavioral change and/or interface changes would be a much more invasive
issue for our users.

Regards
Mridul
On 18-May-2014 2:19 am, Matei Zaharia matei.zaha...@gmail.com wrote:

 As others have said, the 1.0 milestone is about API stability, not about
 saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner
 users can confidently build on Spark, knowing that the application they
 build today will still run on Spark 1.9.9 three years from now. This is
 something that I’ve seen done badly (and experienced the effects thereof)
 in other big data projects, such as MapReduce and even YARN. The result is
 that you annoy users, you end up with a fragmented userbase where everyone
 is building against a different version, and you drastically slow down
 development.

 With a project as fast-growing as fast-growing as Spark in particular,
 there will be new bugs discovered and reported continuously, especially in
 the non-core components. Look at the graph of # of contributors in time to
 Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph; “commits”
 changed when we started merging each patch as a single commit). This is not
 slowing down, and we need to have the culture now that we treat API
 stability and release numbers at the level expected for a 1.0 project
 instead of having people come in and randomly change the API.

 I’ll also note that the issues marked “blocker” were marked so by their
 reporters, since the reporter can set the priority. I don’t consider stuff
 like parallelize() not partitioning ranges in the same way as other
 collections a blocker — it’s a bug, it would be good to fix it, but it only
 affects a small number of use cases. Of course if we find a real blocker
 (in particular a regression from a previous version, or a feature that’s
 just completely broken), we will delay the release for that, but at some
 point you have to say “okay, this fix will go into the next maintenance
 release”. Maybe we need to write a clear policy for what the issue
 priorities mean.

 Finally, I believe it’s much better to have a culture where you can make
 releases on a regular schedule, and have the option to make a maintenance
 release in 3-4 days if you find new bugs, than one where you pile up stuff
 into each release. This is what much large project than us, like Linux, do,
 and it’s the only way to avoid indefinite stalling with a large contributor
 base. In the worst case, if you find a new bug that warrants immediate
 release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in
 three days with just your bug fix in it). And if you find an API that you’d
 like to improve, just add a new one and maybe deprecate the old one — at
 some point we have to respect our users and let them know that code they
 write today will still run tomorrow.

 Matei

 On May 17, 2014, at 10:32 AM, Kan Zhang kzh...@apache.org wrote:

  +1 on the running commentary here, non-binding of course :-)
 
 
  On Sat, May 17, 2014 at 8:44 AM, Andrew Ash and...@andrewash.com
 wrote:
 
  +1 on the next release feeling more like a 0.10 than a 1.0
  On May 17, 2014 4:38 AM, Mridul Muralidharan mri...@gmail.com
 wrote:
 
  I had echoed similar sentiments a while back when there was a
 discussion
  around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
  changes, add missing functionality, go through a hardening release
 before
  1.0
 
  But the community preferred a 1.0 :-)
 
  Regards,
  Mridul
 
  On 17-May-2014 3:19 pm, Sean Owen so...@cloudera.com wrote:
 
  On this note, non-binding commentary:
 
  Releases happen in local minima of change, usually created by
  internally enforced code freeze. Spark is incredibly busy now due to
  external factors -- recently a TLP, recently discovered by a large new
  audience, ease of contribution enabled by Github. It's getting like
  the first year of mainstream battle-testing in a month. It's been very
  hard to freeze anything! I see a number of non-trivial issues being
  reported, and I don't think it has been possible to triage all of
  them, even.
 
  Given the high rate of change, my instinct would have been to release
  0.10.0 now. But won't it always be very busy? I do think the rate of
  significant issues will slow down.
 
  Version ain't nothing but a number, but if it has any meaning it's the
  semantic versioning meaning. 1.0 imposes extra handicaps around
  striving to maintain backwards-compatibility. That may end up being
  bent to fit in important changes that are going to be required in this
  continuing period of change. Hadoop does this all the time
  unfortunately and gets away with it, I suppose -- minor version
  releases are really 

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Michael Malak
While developers may appreciate 1.0 == API stability, I'm not sure that will 
be the understanding of the VP who gives the green light to a Spark-based 
development effort.

I fear a bug that silently produces erroneous results will be perceived like 
the FDIV bug, but in this case without the momentum of an existing large 
installed base and with a number of competitors (GridGain, H20, 
Stratosphere). Despite the stated intention of API stability, the perception 
(which becomes the reality) of 1.0 is that it's ready for production use -- 
not bullet-proof, but also not with known silent generation of erroneous 
results. Exceptions and crashes are much more tolerated than silent corruption 
of data. The result may be a reputation of the Spark team unconcerned about 
data integrity.

I ran into (and submitted) https://issues.apache.org/jira/browse/SPARK-1817 due 
to the lack of zipWithIndex(). zip() with a self-created partitioned range was 
the way I was trying to number with IDs a collection of nodes in preparation 
for the GraphX constructor. For the record, it was a frequent Spark committer 
who escalated it to blocker; I did not submit it as such. Partitioning a 
Scala range isn't just a toy example; it has a real-life use.

I also wonder about the REPL. Cloudera, for example, touts it as key to making 
Spark a crossover tool that Data Scientists can also use. The REPL can be 
considered an API of sorts -- not a traditional Scala or Java API, of course, 
but the API that a human data analyst would use. With the Scala REPL 
exhibiting some of the same bad behaviors as the Spark REPL, there is a 
question of whether the Spark REPL can even be fixed. If the Spark REPL has to 
be eliminated after 1.0 due to an inability to repair it, that would constitute 
API instability.


 
On Saturday, May 17, 2014 2:49 PM, Matei Zaharia matei.zaha...@gmail.com 
wrote:
 
As others have said, the 1.0 milestone is about API stability, not about saying 
“we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner users can 
confidently build on Spark, knowing that the application they build today will 
still run on Spark 1.9.9 three years from now. This is something that I’ve seen 
done badly (and experienced the effects thereof) in other big data projects, 
such as MapReduce and even YARN. The result is that you annoy users, you end up 
with a fragmented userbase where everyone is building against a different 
version, and you drastically slow down development.

With a project as fast-growing as fast-growing as Spark in particular, there 
will be new bugs discovered and reported continuously, especially in the 
non-core components. Look at the graph of # of contributors in time to Spark: 
https://www.ohloh.net/p/apache-spark (bottom-most graph; “commits” changed when 
we started merging each patch as a single commit). This is not slowing down, 
and we need to have the culture now that we treat API stability and release 
numbers at the level expected for a 1.0 project instead of having people come 
in and randomly change the API.

I’ll also note that the issues marked “blocker” were marked so by their 
reporters, since the reporter can set the priority. I don’t consider stuff like 
parallelize() not partitioning ranges in the same way as other collections a 
blocker — it’s a bug, it would be good to fix it, but it only affects a small 
number of use cases. Of course if we find a real blocker (in particular a 
regression from a previous version, or a feature that’s just completely 
broken), we will delay the release for that, but at some point you have to say 
“okay, this fix will go into the next maintenance release”. Maybe we need to 
write a clear policy for what the issue priorities mean.

Finally, I believe it’s much better to have a culture where you can make 
releases on a regular schedule, and have the option to make a maintenance 
release in 3-4 days if you find new bugs, than one where you pile up stuff into 
each release. This is what much large project than us, like Linux, do, and it’s 
the only way to avoid indefinite stalling with a large contributor base. In the 
worst case, if you find a new bug that warrants immediate release, it goes into 
1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in three days with just your bug 
fix in it). And if you find an API that you’d like to improve, just add a new 
one and maybe deprecate the old one — at some point we have to respect our 
users and let them know that code they write today will still run tomorrow.

Matei


On May 17, 2014, at 10:32 AM, Kan Zhang kzh...@apache.org wrote:

 +1 on the running commentary here, non-binding of course :-)
 
 
 On Sat, May 17, 2014 at 8:44 AM, Andrew Ash and...@andrewash.com wrote:
 
 +1 on the next release feeling more like a 0.10 than a 1.0
 On May 17, 2014 4:38 AM, Mridul Muralidharan mri...@gmail.com wrote:
 
 I had echoed similar sentiments a while back when there was a discussion
 around 0.10 vs 1.0 ... I would have 

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Matei Zaharia
Yup, this is a good point, the interface includes stuff like launch scripts and 
environment variables. However I do think that the current features of 
spark-submit can all be supported in future releases. We’ll definitely have a 
very strict standard for modifying these later on.

Matei

On May 17, 2014, at 2:05 PM, Mridul Muralidharan mri...@gmail.com wrote:

 I would make the case for interface stability not just api stability.
 Particularly given that we have significantly changed some of our
 interfaces, I want to ensure developers/users are not seeing red flags.
 
 Bugs and code stability can be addressed in minor releases if found, but
 behavioral change and/or interface changes would be a much more invasive
 issue for our users.
 
 Regards
 Mridul
 On 18-May-2014 2:19 am, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 As others have said, the 1.0 milestone is about API stability, not about
 saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner
 users can confidently build on Spark, knowing that the application they
 build today will still run on Spark 1.9.9 three years from now. This is
 something that I’ve seen done badly (and experienced the effects thereof)
 in other big data projects, such as MapReduce and even YARN. The result is
 that you annoy users, you end up with a fragmented userbase where everyone
 is building against a different version, and you drastically slow down
 development.
 
 With a project as fast-growing as fast-growing as Spark in particular,
 there will be new bugs discovered and reported continuously, especially in
 the non-core components. Look at the graph of # of contributors in time to
 Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph; “commits”
 changed when we started merging each patch as a single commit). This is not
 slowing down, and we need to have the culture now that we treat API
 stability and release numbers at the level expected for a 1.0 project
 instead of having people come in and randomly change the API.
 
 I’ll also note that the issues marked “blocker” were marked so by their
 reporters, since the reporter can set the priority. I don’t consider stuff
 like parallelize() not partitioning ranges in the same way as other
 collections a blocker — it’s a bug, it would be good to fix it, but it only
 affects a small number of use cases. Of course if we find a real blocker
 (in particular a regression from a previous version, or a feature that’s
 just completely broken), we will delay the release for that, but at some
 point you have to say “okay, this fix will go into the next maintenance
 release”. Maybe we need to write a clear policy for what the issue
 priorities mean.
 
 Finally, I believe it’s much better to have a culture where you can make
 releases on a regular schedule, and have the option to make a maintenance
 release in 3-4 days if you find new bugs, than one where you pile up stuff
 into each release. This is what much large project than us, like Linux, do,
 and it’s the only way to avoid indefinite stalling with a large contributor
 base. In the worst case, if you find a new bug that warrants immediate
 release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in
 three days with just your bug fix in it). And if you find an API that you’d
 like to improve, just add a new one and maybe deprecate the old one — at
 some point we have to respect our users and let them know that code they
 write today will still run tomorrow.
 
 Matei
 
 On May 17, 2014, at 10:32 AM, Kan Zhang kzh...@apache.org wrote:
 
 +1 on the running commentary here, non-binding of course :-)
 
 
 On Sat, May 17, 2014 at 8:44 AM, Andrew Ash and...@andrewash.com
 wrote:
 
 +1 on the next release feeling more like a 0.10 than a 1.0
 On May 17, 2014 4:38 AM, Mridul Muralidharan mri...@gmail.com
 wrote:
 
 I had echoed similar sentiments a while back when there was a
 discussion
 around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
 changes, add missing functionality, go through a hardening release
 before
 1.0
 
 But the community preferred a 1.0 :-)
 
 Regards,
 Mridul
 
 On 17-May-2014 3:19 pm, Sean Owen so...@cloudera.com wrote:
 
 On this note, non-binding commentary:
 
 Releases happen in local minima of change, usually created by
 internally enforced code freeze. Spark is incredibly busy now due to
 external factors -- recently a TLP, recently discovered by a large new
 audience, ease of contribution enabled by Github. It's getting like
 the first year of mainstream battle-testing in a month. It's been very
 hard to freeze anything! I see a number of non-trivial issues being
 reported, and I don't think it has been possible to triage all of
 them, even.
 
 Given the high rate of change, my instinct would have been to release
 0.10.0 now. But won't it always be very busy? I do think the rate of
 significant issues will slow down.
 
 Version ain't nothing but a number, but if it has any meaning it's the
 

Re: can RDD be shared across mutil spark applications?

2014-05-17 Thread Andy Konwinski
RDDs cannot currently be shared across multiple SparkContexts without using
something like the Tachyon project (which is a separate project/codebase).

Andy
On May 16, 2014 2:14 PM, qingyang li liqingyang1...@gmail.com wrote:





Re: can RDD be shared across mutil spark applications?

2014-05-17 Thread Christopher Nguyen
Qing Yang, Andy is correct in answering your direct question.

At the same time, depending on your context, you may be able to apply a
pattern where you turn the single Spark application into a service, and
multiple clients if that service can indeed share access to the same RDDs.

Several groups have built apps based on this pattern, and we will also show
something with this behavior at the upcoming Spark Summit (multiple users
collaborating on named DDFs with the same underlying RDDs).

Sent while mobile. Pls excuse typos etc.
On May 18, 2014 9:40 AM, Andy Konwinski andykonwin...@gmail.com wrote:

 RDDs cannot currently be shared across multiple SparkContexts without using
 something like the Tachyon project (which is a separate project/codebase).

 Andy
 On May 16, 2014 2:14 PM, qingyang li liqingyang1...@gmail.com wrote: