SPARK-964 related issues

2014-02-06 Thread Heiko Braun


Hi everybody, 
two questions, somehow related to the discussion about JDK 8 support [1].

- Has support for Spores [2] been considered/discussed already?
- Scala 2.11 is due in the next few month [3]. Would this be a reasonable 
target version for spark 1.0?

As far as I know, both JDK 8 support and Spores should become part of scala 
2.11.

Regards, Heiko

[1] https://spark-project.atlassian.net/browse/SPARK-964
[2] http://docs.scala-lang.org/sips/pending/spores.html
[3] 
https://issues.scala-lang.org/browse/SI/component/10600?selectedTab=com.atlassian.jira.plugin.system.project%3Acomponent-roadmap-panel

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Mridul Muralidharan
The reason I explicitly mentioned about binary compatibility was
because it was sort of hand waved in the proposal as good to have.
My understanding is that scala does make it painful to ensure binary
compatibility - but stability of interfaces is vital to ensure
dependable platforms.
Recompilation might be a viable option for developers - not for users.

Regards,
Mridul


On Thu, Feb 6, 2014 at 12:08 PM, Patrick Wendell pwend...@gmail.com wrote:
 If people feel that merging the intermediate SNAPSHOT number is
 significant, let's just defer merging that until this discussion
 concludes.

 That said - the decision to settle on 1.0 for the next release is not
 just because it happens to come after 0.9. It's a conscientious
 decision based on the development of the project to this point. A
 major focus of the 0.9 release was tying off loose ends in terms of
 backwards compatibility (e.g. spark configuration). There was some
 discussion back then of maybe cutting a 1.0 release but the decision
 was deferred until after 0.9.

 @mridul - pleas see the original post for discussion about binary 
 compatibility.

 On Wed, Feb 5, 2014 at 10:20 PM, Andy Konwinski andykonwin...@gmail.com 
 wrote:
 +1 for 0.10.0 now with the option to switch to 1.0.0 after further
 discussion.
 On Feb 5, 2014 9:53 PM, Andrew Ash and...@andrewash.com wrote:

 Agree on timeboxed releases as well.

 Is there a vision for where we want to be as a project before declaring the
 first 1.0 release?  While we're in the 0.x days per semver we can break
 backcompat at will (though we try to avoid it where possible), and that
 luxury goes away with 1.x  I just don't want to release a 1.0 simply
 because it seems to follow after 0.9 rather than making an intentional
 decision that we're at the point where we can stand by the current APIs and
 binary compatibility for the next year or so of the major release.

 Until that decision is made as a group I'd rather we do an immediate
 version bump to 0.10.0-SNAPSHOT and then if discussion warrants it later,
 replace that with 1.0.0-SNAPSHOT.  It's very easy to go from 0.10 to 1.0
 but not the other way around.

 https://github.com/apache/incubator-spark/pull/542

 Cheers!
 Andrew


 On Wed, Feb 5, 2014 at 9:49 PM, Heiko Braun ike.br...@googlemail.com
 wrote:

  +1 on time boxed releases and compatibility guidelines
 
 
   Am 06.02.2014 um 01:20 schrieb Patrick Wendell pwend...@gmail.com:
  
   Hi Everyone,
  
   In an effort to coordinate development amongst the growing list of
   Spark contributors, I've taken some time to write up a proposal to
   formalize various pieces of the development process. The next release
   of Spark will likely be Spark 1.0.0, so this message is intended in
   part to coordinate the release plan for 1.0.0 and future releases.
   I'll post this on the wiki after discussing it on this thread as
   tentative project guidelines.
  
   == Spark Release Structure ==
   Starting with Spark 1.0.0, the Spark project will follow the semantic
   versioning guidelines (http://semver.org/) with a few deviations.
   These small differences account for Spark's nature as a multi-module
   project.
  
   Each Spark release will be versioned:
   [MAJOR].[MINOR].[MAINTENANCE]
  
   All releases with the same major version number will have API
   compatibility, defined as [1]. Major version numbers will remain
   stable over long periods of time. For instance, 1.X.Y may last 1 year
   or more.
  
   Minor releases will typically contain new features and improvements.
   The target frequency for minor releases is every 3-4 months. One
   change we'd like to make is to announce fixed release dates and merge
   windows for each release, to facilitate coordination. Each minor
   release will have a merge window where new patches can be merged, a QA
   window when only fixes can be merged, then a final period where voting
   occurs on release candidates. These windows will be announced
   immediately after the previous minor release to give people plenty of
   time, and over time, we might make the whole release process more
   regular (similar to Ubuntu). At the bottom of this document is an
   example window for the 1.0.0 release.
  
   Maintenance releases will occur more frequently and depend on specific
   patches introduced (e.g. bug fixes) and their urgency. In general
   these releases are designed to patch bugs. However, higher level
   libraries may introduce small features, such as a new algorithm,
   provided they are entirely additive and isolated from existing code
   paths. Spark core may not introduce any features.
  
   When new components are added to Spark, they may initially be marked
   as alpha. Alpha components do not have to abide by the above
   guidelines, however, to the maximum extent possible, they should try
   to. Once they are marked stable they have to follow these
   guidelines. At present, GraphX is the only alpha component of Spark.
  
   [1] API compatibility:
 

Re: SPARK-964 related issues

2014-02-06 Thread Prashant Sharma
Hi,

Well in the context of only JDK8 support, there is actually no need to
migrate to Scala 2.11 or JDK 8 itself. The trick is to use Interfaces
instead of AbstractClasses for accepting functions.


On Thu, Feb 6, 2014 at 2:11 PM, Heiko Braun ike.br...@googlemail.comwrote:



 Hi everybody,
 two questions, somehow related to the discussion about JDK 8 support [1].

 - Has support for Spores [2] been considered/discussed already?
 - Scala 2.11 is due in the next few month [3]. Would this be a reasonable
 target version for spark 1.0?

 As far as I know, both JDK 8 support and Spores should become part of
 scala 2.11.

 Regards, Heiko

 [1] https://spark-project.atlassian.net/browse/SPARK-964
 [2] http://docs.scala-lang.org/sips/pending/spores.html
 [3]
 https://issues.scala-lang.org/browse/SI/component/10600?selectedTab=com.atlassian.jira.plugin.system.project%3Acomponent-roadmap-panel




-- 
Prashant


Re: [0.9.0] Possible deadlock in shutdown hook?

2014-02-06 Thread Mridul Muralidharan
shutdown hooks should not take 15 mins are you mentioned !
On the other hand, how busy was your disk when this was happening ?
(either due to spark or something else ?)

It might just be that there was a lot of stuff to remove ?

Regards,
Mridul


On Thu, Feb 6, 2014 at 3:50 PM, Andrew Ash and...@andrewash.com wrote:
 Hi Spark devs,

 Occasionally when hitting Ctrl-C in the scala spark shell on 0.9.0 one of
 my workers goes dead in the spark master UI.  I'm using the standalone
 cluster and didn't ever see this while using 0.8.0 so I think it may be a
 regression.

 When I prod on the hung CoarseGrainedExecutorBackend JVM with jstack and
 jmap -heap, it doesn't respond unless I add the -F force flag.  The heap
 isn't full, but there are some interesting bits in the jstack.  Poking
 around a little, I think there may be some kind of deadlock in the shutdown
 hooks.

 Below are the threads I think are most interesting:

 Thread 14308: (state = BLOCKED)
  - java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted frame)
  - java.lang.Runtime.exit(int) @bci=14, line=109 (Interpreted frame)
  - java.lang.System.exit(int) @bci=4, line=962 (Interpreted frame)
  -
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(java.lang.Object,
 scala.Function1) @bci=352, line=81 (Interpreted frame)
  - akka.actor.ActorCell.receiveMessage(java.lang.Object) @bci=25, line=498
 (Interpreted frame)
  - akka.actor.ActorCell.invoke(akka.dispatch.Envelope) @bci=39, line=456
 (Interpreted frame)
  - akka.dispatch.Mailbox.processMailbox(int, long) @bci=24, line=237
 (Interpreted frame)
  - akka.dispatch.Mailbox.run() @bci=20, line=219 (Interpreted frame)
  - akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec()
 @bci=4, line=386 (Interpreted frame)
  - scala.concurrent.forkjoin.ForkJoinTask.doExec() @bci=10, line=260
 (Compiled frame)
  -
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(scala.concurrent.forkjoin.ForkJoinTask)
 @bci=10, line=1339 (Compiled frame)
  -
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue)
 @bci=11, line=1979 (Compiled frame)
  - scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14, line=107
 (Interpreted frame)

 Thread 3865: (state = BLOCKED)
  - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
  - java.lang.Thread.join(long) @bci=38, line=1280 (Interpreted frame)
  - java.lang.Thread.join() @bci=2, line=1354 (Interpreted frame)
  - java.lang.ApplicationShutdownHooks.runHooks() @bci=87, line=106
 (Interpreted frame)
  - java.lang.ApplicationShutdownHooks$1.run() @bci=0, line=46 (Interpreted
 frame)
  - java.lang.Shutdown.runHooks() @bci=39, line=123 (Interpreted frame)
  - java.lang.Shutdown.sequence() @bci=26, line=167 (Interpreted frame)
  - java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted frame)
  - java.lang.Terminator$1.handle(sun.misc.Signal) @bci=8, line=52
 (Interpreted frame)
  - sun.misc.Signal$1.run() @bci=8, line=212 (Interpreted frame)
  - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)


 Thread 3987: (state = BLOCKED)
  - java.io.UnixFileSystem.list(java.io.File) @bci=0 (Interpreted frame)
  - java.io.File.list() @bci=29, line=1116 (Interpreted frame)
  - java.io.File.listFiles() @bci=1, line=1201 (Compiled frame)
  - org.apache.spark.util.Utils$.listFilesSafely(java.io.File) @bci=1,
 line=466 (Interpreted frame)
  - org.apache.spark.util.Utils$.deleteRecursively(java.io.File) @bci=9,
 line=478 (Compiled frame)
  -
 org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(java.io.File)
 @bci=4, line=479 (Compiled frame)
  -
 org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(java.lang.Object)
 @bci=5, line=478 (Compiled frame)
  -
 scala.collection.IndexedSeqOptimized$class.foreach(scala.collection.IndexedSeqOptimized,
 scala.Function1) @bci=22, line=33 (Compiled frame)
  - scala.collection.mutable.WrappedArray.foreach(scala.Function1) @bci=2,
 line=34 (Compiled frame)
  - org.apache.spark.util.Utils$.deleteRecursively(java.io.File) @bci=19,
 line=478 (Interpreted frame)
  -
 org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(java.io.File)
 @bci=14, line=141 (Interpreted frame)
  -
 org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(java.lang.Object)
 @bci=5, line=139 (Interpreted frame)
  -
 scala.collection.IndexedSeqOptimized$class.foreach(scala.collection.IndexedSeqOptimized,
 scala.Function1) @bci=22, line=33 (Compiled frame)
  - scala.collection.mutable.ArrayOps$ofRef.foreach(scala.Function1) @bci=2,
 line=108 (Interpreted frame)
  - org.apache.spark.storage.DiskBlockManager$$anon$1.run() @bci=39,
 line=139 (Interpreted frame)


 I think what happened here is that thread 14308 received the akka
 shutdown message and called System.exit().  This started thread 3865,
 which is the JVM shutting itself down.  Part of that process is running the
 shutdown hooks, so it started thread 3987.  

Re: [0.9.0] Possible deadlock in shutdown hook?

2014-02-06 Thread Andrew Ash
Per the book Java Concurrency in Practice the already-running threads
continue running while the shutdown hooks run.  So I think the race between
the writing thread and the deleting thread could be a very real possibility
:/

http://stackoverflow.com/a/3332925/120915


On Thu, Feb 6, 2014 at 2:49 AM, Andrew Ash and...@andrewash.com wrote:

 Got a repro locally on my MBP (the other was on a CentOS machine).

 Build spark, run a master and a worker with the sbin/start-all.sh script,
 then run this in a shell:

 import org.apache.spark.storage.StorageLevel._
 val s = sc.parallelize(1 to 10).persist(MEMORY_AND_DISK_SER);
 s.count

 After about a minute, this line appears in the shell logging output:

 14/02/06 02:44:44 WARN BlockManagerMasterActor: Removing BlockManager
 BlockManagerId(0, aash-mbp.dyn.yojoe.local, 57895, 0) with no recent heart
 beats: 57510ms exceeds 45000ms

 Ctrl-C the shell.  In jps there is now a worker, a master, and a
 CoarseGrainedExecutorBackend.

 Run jstack on the CGEBackend JVM, and I got the attached stacktraces.  I
 waited around for 15min then kill -9'd the JVM and restarted the process.

 I wonder if what's happening here is that the threads that are spewing
 data to disk (as that parallelize and persist would do) can write to disk
 faster than the cleanup threads can delete from disk.

 What do you think of that theory?


 Andrew



 On Thu, Feb 6, 2014 at 2:30 AM, Mridul Muralidharan mri...@gmail.comwrote:

 shutdown hooks should not take 15 mins are you mentioned !
 On the other hand, how busy was your disk when this was happening ?
 (either due to spark or something else ?)

 It might just be that there was a lot of stuff to remove ?

 Regards,
 Mridul


 On Thu, Feb 6, 2014 at 3:50 PM, Andrew Ash and...@andrewash.com wrote:
  Hi Spark devs,
 
  Occasionally when hitting Ctrl-C in the scala spark shell on 0.9.0 one
 of
  my workers goes dead in the spark master UI.  I'm using the standalone
  cluster and didn't ever see this while using 0.8.0 so I think it may be
 a
  regression.
 
  When I prod on the hung CoarseGrainedExecutorBackend JVM with jstack and
  jmap -heap, it doesn't respond unless I add the -F force flag.  The heap
  isn't full, but there are some interesting bits in the jstack.  Poking
  around a little, I think there may be some kind of deadlock in the
 shutdown
  hooks.
 
  Below are the threads I think are most interesting:
 
  Thread 14308: (state = BLOCKED)
   - java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted frame)
   - java.lang.Runtime.exit(int) @bci=14, line=109 (Interpreted frame)
   - java.lang.System.exit(int) @bci=4, line=962 (Interpreted frame)
   -
 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(java.lang.Object,
  scala.Function1) @bci=352, line=81 (Interpreted frame)
   - akka.actor.ActorCell.receiveMessage(java.lang.Object) @bci=25,
 line=498
  (Interpreted frame)
   - akka.actor.ActorCell.invoke(akka.dispatch.Envelope) @bci=39, line=456
  (Interpreted frame)
   - akka.dispatch.Mailbox.processMailbox(int, long) @bci=24, line=237
  (Interpreted frame)
   - akka.dispatch.Mailbox.run() @bci=20, line=219 (Interpreted frame)
   - akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec()
  @bci=4, line=386 (Interpreted frame)
   - scala.concurrent.forkjoin.ForkJoinTask.doExec() @bci=10, line=260
  (Compiled frame)
   -
 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(scala.concurrent.forkjoin.ForkJoinTask)
  @bci=10, line=1339 (Compiled frame)
   -
 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue)
  @bci=11, line=1979 (Compiled frame)
   - scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14,
 line=107
  (Interpreted frame)
 
  Thread 3865: (state = BLOCKED)
   - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
   - java.lang.Thread.join(long) @bci=38, line=1280 (Interpreted frame)
   - java.lang.Thread.join() @bci=2, line=1354 (Interpreted frame)
   - java.lang.ApplicationShutdownHooks.runHooks() @bci=87, line=106
  (Interpreted frame)
   - java.lang.ApplicationShutdownHooks$1.run() @bci=0, line=46
 (Interpreted
  frame)
   - java.lang.Shutdown.runHooks() @bci=39, line=123 (Interpreted frame)
   - java.lang.Shutdown.sequence() @bci=26, line=167 (Interpreted frame)
   - java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted frame)
   - java.lang.Terminator$1.handle(sun.misc.Signal) @bci=8, line=52
  (Interpreted frame)
   - sun.misc.Signal$1.run() @bci=8, line=212 (Interpreted frame)
   - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)
 
 
  Thread 3987: (state = BLOCKED)
   - java.io.UnixFileSystem.list(java.io.File) @bci=0 (Interpreted frame)
   - java.io.File.list() @bci=29, line=1116 (Interpreted frame)
   - java.io.File.listFiles() @bci=1, line=1201 (Compiled frame)
   - org.apache.spark.util.Utils$.listFilesSafely(java.io.File) @bci=1,
  line=466 (Interpreted frame)
   - 

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Sandy Ryza
Thanks for all this Patrick.

I like Heiko's proposal that requires every pull request to reference a
JIRA.  This is how things are done in Hadoop and it makes it much easier
to, for example, find out whether an issue you came across when googling
for an error is in a release.

I agree with Mridul about binary compatibility.  It can be a dealbreaker
for organizations that are considering an upgrade. The two ways I'm aware
of that cause binary compatibility are scala version upgrades and messing
around with inheritance.  Are these not avoidable at least for minor
releases?

-Sandy




On Thu, Feb 6, 2014 at 12:49 AM, Mridul Muralidharan mri...@gmail.comwrote:

 The reason I explicitly mentioned about binary compatibility was
 because it was sort of hand waved in the proposal as good to have.
 My understanding is that scala does make it painful to ensure binary
 compatibility - but stability of interfaces is vital to ensure
 dependable platforms.
 Recompilation might be a viable option for developers - not for users.

 Regards,
 Mridul


 On Thu, Feb 6, 2014 at 12:08 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  If people feel that merging the intermediate SNAPSHOT number is
  significant, let's just defer merging that until this discussion
  concludes.
 
  That said - the decision to settle on 1.0 for the next release is not
  just because it happens to come after 0.9. It's a conscientious
  decision based on the development of the project to this point. A
  major focus of the 0.9 release was tying off loose ends in terms of
  backwards compatibility (e.g. spark configuration). There was some
  discussion back then of maybe cutting a 1.0 release but the decision
  was deferred until after 0.9.
 
  @mridul - pleas see the original post for discussion about binary
 compatibility.
 
  On Wed, Feb 5, 2014 at 10:20 PM, Andy Konwinski andykonwin...@gmail.com
 wrote:
  +1 for 0.10.0 now with the option to switch to 1.0.0 after further
  discussion.
  On Feb 5, 2014 9:53 PM, Andrew Ash and...@andrewash.com wrote:
 
  Agree on timeboxed releases as well.
 
  Is there a vision for where we want to be as a project before
 declaring the
  first 1.0 release?  While we're in the 0.x days per semver we can break
  backcompat at will (though we try to avoid it where possible), and that
  luxury goes away with 1.x  I just don't want to release a 1.0 simply
  because it seems to follow after 0.9 rather than making an intentional
  decision that we're at the point where we can stand by the current
 APIs and
  binary compatibility for the next year or so of the major release.
 
  Until that decision is made as a group I'd rather we do an immediate
  version bump to 0.10.0-SNAPSHOT and then if discussion warrants it
 later,
  replace that with 1.0.0-SNAPSHOT.  It's very easy to go from 0.10 to
 1.0
  but not the other way around.
 
  https://github.com/apache/incubator-spark/pull/542
 
  Cheers!
  Andrew
 
 
  On Wed, Feb 5, 2014 at 9:49 PM, Heiko Braun ike.br...@googlemail.com
  wrote:
 
   +1 on time boxed releases and compatibility guidelines
  
  
Am 06.02.2014 um 01:20 schrieb Patrick Wendell pwend...@gmail.com
 :
   
Hi Everyone,
   
In an effort to coordinate development amongst the growing list of
Spark contributors, I've taken some time to write up a proposal to
formalize various pieces of the development process. The next
 release
of Spark will likely be Spark 1.0.0, so this message is intended in
part to coordinate the release plan for 1.0.0 and future releases.
I'll post this on the wiki after discussing it on this thread as
tentative project guidelines.
   
== Spark Release Structure ==
Starting with Spark 1.0.0, the Spark project will follow the
 semantic
versioning guidelines (http://semver.org/) with a few deviations.
These small differences account for Spark's nature as a
 multi-module
project.
   
Each Spark release will be versioned:
[MAJOR].[MINOR].[MAINTENANCE]
   
All releases with the same major version number will have API
compatibility, defined as [1]. Major version numbers will remain
stable over long periods of time. For instance, 1.X.Y may last 1
 year
or more.
   
Minor releases will typically contain new features and
 improvements.
The target frequency for minor releases is every 3-4 months. One
change we'd like to make is to announce fixed release dates and
 merge
windows for each release, to facilitate coordination. Each minor
release will have a merge window where new patches can be merged,
 a QA
window when only fixes can be merged, then a final period where
 voting
occurs on release candidates. These windows will be announced
immediately after the previous minor release to give people plenty
 of
time, and over time, we might make the whole release process more
regular (similar to Ubuntu). At the bottom of this document is an
example window for the 1.0.0 release.
   
  

Re: [0.9.0] Possible deadlock in shutdown hook?

2014-02-06 Thread Reynold Xin
Is it safe if we interrupt the running thread during shutdown?




On Thu, Feb 6, 2014 at 3:27 AM, Andrew Ash and...@andrewash.com wrote:

 Per the book Java Concurrency in Practice the already-running threads
 continue running while the shutdown hooks run.  So I think the race between
 the writing thread and the deleting thread could be a very real possibility
 :/

 http://stackoverflow.com/a/3332925/120915


 On Thu, Feb 6, 2014 at 2:49 AM, Andrew Ash and...@andrewash.com wrote:

  Got a repro locally on my MBP (the other was on a CentOS machine).
 
  Build spark, run a master and a worker with the sbin/start-all.sh script,
  then run this in a shell:
 
  import org.apache.spark.storage.StorageLevel._
  val s = sc.parallelize(1 to 10).persist(MEMORY_AND_DISK_SER);
  s.count
 
  After about a minute, this line appears in the shell logging output:
 
  14/02/06 02:44:44 WARN BlockManagerMasterActor: Removing BlockManager
  BlockManagerId(0, aash-mbp.dyn.yojoe.local, 57895, 0) with no recent
 heart
  beats: 57510ms exceeds 45000ms
 
  Ctrl-C the shell.  In jps there is now a worker, a master, and a
  CoarseGrainedExecutorBackend.
 
  Run jstack on the CGEBackend JVM, and I got the attached stacktraces.  I
  waited around for 15min then kill -9'd the JVM and restarted the process.
 
  I wonder if what's happening here is that the threads that are spewing
  data to disk (as that parallelize and persist would do) can write to disk
  faster than the cleanup threads can delete from disk.
 
  What do you think of that theory?
 
 
  Andrew
 
 
 
  On Thu, Feb 6, 2014 at 2:30 AM, Mridul Muralidharan mri...@gmail.com
 wrote:
 
  shutdown hooks should not take 15 mins are you mentioned !
  On the other hand, how busy was your disk when this was happening ?
  (either due to spark or something else ?)
 
  It might just be that there was a lot of stuff to remove ?
 
  Regards,
  Mridul
 
 
  On Thu, Feb 6, 2014 at 3:50 PM, Andrew Ash and...@andrewash.com
 wrote:
   Hi Spark devs,
  
   Occasionally when hitting Ctrl-C in the scala spark shell on 0.9.0 one
  of
   my workers goes dead in the spark master UI.  I'm using the standalone
   cluster and didn't ever see this while using 0.8.0 so I think it may
 be
  a
   regression.
  
   When I prod on the hung CoarseGrainedExecutorBackend JVM with jstack
 and
   jmap -heap, it doesn't respond unless I add the -F force flag.  The
 heap
   isn't full, but there are some interesting bits in the jstack.  Poking
   around a little, I think there may be some kind of deadlock in the
  shutdown
   hooks.
  
   Below are the threads I think are most interesting:
  
   Thread 14308: (state = BLOCKED)
- java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted frame)
- java.lang.Runtime.exit(int) @bci=14, line=109 (Interpreted frame)
- java.lang.System.exit(int) @bci=4, line=962 (Interpreted frame)
-
  
 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(java.lang.Object,
   scala.Function1) @bci=352, line=81 (Interpreted frame)
- akka.actor.ActorCell.receiveMessage(java.lang.Object) @bci=25,
  line=498
   (Interpreted frame)
- akka.actor.ActorCell.invoke(akka.dispatch.Envelope) @bci=39,
 line=456
   (Interpreted frame)
- akka.dispatch.Mailbox.processMailbox(int, long) @bci=24, line=237
   (Interpreted frame)
- akka.dispatch.Mailbox.run() @bci=20, line=219 (Interpreted frame)
- akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec()
   @bci=4, line=386 (Interpreted frame)
- scala.concurrent.forkjoin.ForkJoinTask.doExec() @bci=10, line=260
   (Compiled frame)
-
  
 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(scala.concurrent.forkjoin.ForkJoinTask)
   @bci=10, line=1339 (Compiled frame)
-
  
 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue)
   @bci=11, line=1979 (Compiled frame)
- scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14,
  line=107
   (Interpreted frame)
  
   Thread 3865: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Interpreted frame)
- java.lang.Thread.join(long) @bci=38, line=1280 (Interpreted frame)
- java.lang.Thread.join() @bci=2, line=1354 (Interpreted frame)
- java.lang.ApplicationShutdownHooks.runHooks() @bci=87, line=106
   (Interpreted frame)
- java.lang.ApplicationShutdownHooks$1.run() @bci=0, line=46
  (Interpreted
   frame)
- java.lang.Shutdown.runHooks() @bci=39, line=123 (Interpreted frame)
- java.lang.Shutdown.sequence() @bci=26, line=167 (Interpreted frame)
- java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted frame)
- java.lang.Terminator$1.handle(sun.misc.Signal) @bci=8, line=52
   (Interpreted frame)
- sun.misc.Signal$1.run() @bci=8, line=212 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)
  
  
   Thread 3987: (state = BLOCKED)
- java.io.UnixFileSystem.list(java.io.File) 

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Patrick Wendell
 I like Heiko's proposal that requires every pull request to reference a
 JIRA.  This is how things are done in Hadoop and it makes it much easier
 to, for example, find out whether an issue you came across when googling
 for an error is in a release.

I think this is a good idea and something on which there is wide
consensus. I separately was going to suggest this in a later e-mail
(it's not directly tied to versioning). One of many reasons this is
necessary is because it's becoming hard to track which features ended
up in which releases.

 I agree with Mridul about binary compatibility.  It can be a dealbreaker
 for organizations that are considering an upgrade. The two ways I'm aware
 of that cause binary compatibility are scala version upgrades and messing
 around with inheritance.  Are these not avoidable at least for minor
 releases?

This is clearly a goal but I'm hesitant to codify it until we
understand all of the reasons why it might not work. I've heard in
general with Scala there are many non-obvious things that can break
binary compatibility and we need to understand what they are. I'd
propose we add the migration tool [1] here to our build and use it for
a few months and see what happens (hat tip to Michael Armbrust).

It's easy to formalize this as a requirement later, it's impossible to
go the other direction. For Scala major versions it's possible we can
cross-build between 2.10 and 2.11 to retain link-level compatibility.
It's just entirely uncharted territory and AFAIK no one who's
suggesting this is speaking from experience maintaining this guarantee
for a Scala project.

That would be the strongest convincing reason for me - if someone has
actually done this in the past in a Scala project and speaks from
experience. Most of use are speaking from the perspective of Java
projects where we understand well the trade-off's and costs of
maintaining this guarantee.

[1] https://github.com/typesafehub/migration-manager

- Patrick


Re: Proposal for Spark Release Strategy

2014-02-06 Thread Henry Saputra
Thanks Patick to initiate the discussion about next road map for Apache Spark.

I am +1 for 0.10.0 for next version.

It will give us as community some time to digest the process and the
vision and make adjustment accordingly.

Release a 1.0.0 is a huge milestone and if we do need to break API
somehow or modify internal behavior dramatically we could take
advantage to release 1.0.0 as good step to go to.


- Henry



On Wed, Feb 5, 2014 at 9:52 PM, Andrew Ash and...@andrewash.com wrote:
 Agree on timeboxed releases as well.

 Is there a vision for where we want to be as a project before declaring the
 first 1.0 release?  While we're in the 0.x days per semver we can break
 backcompat at will (though we try to avoid it where possible), and that
 luxury goes away with 1.x  I just don't want to release a 1.0 simply
 because it seems to follow after 0.9 rather than making an intentional
 decision that we're at the point where we can stand by the current APIs and
 binary compatibility for the next year or so of the major release.

 Until that decision is made as a group I'd rather we do an immediate
 version bump to 0.10.0-SNAPSHOT and then if discussion warrants it later,
 replace that with 1.0.0-SNAPSHOT.  It's very easy to go from 0.10 to 1.0
 but not the other way around.

 https://github.com/apache/incubator-spark/pull/542

 Cheers!
 Andrew


 On Wed, Feb 5, 2014 at 9:49 PM, Heiko Braun ike.br...@googlemail.comwrote:

 +1 on time boxed releases and compatibility guidelines


  Am 06.02.2014 um 01:20 schrieb Patrick Wendell pwend...@gmail.com:
 
  Hi Everyone,
 
  In an effort to coordinate development amongst the growing list of
  Spark contributors, I've taken some time to write up a proposal to
  formalize various pieces of the development process. The next release
  of Spark will likely be Spark 1.0.0, so this message is intended in
  part to coordinate the release plan for 1.0.0 and future releases.
  I'll post this on the wiki after discussing it on this thread as
  tentative project guidelines.
 
  == Spark Release Structure ==
  Starting with Spark 1.0.0, the Spark project will follow the semantic
  versioning guidelines (http://semver.org/) with a few deviations.
  These small differences account for Spark's nature as a multi-module
  project.
 
  Each Spark release will be versioned:
  [MAJOR].[MINOR].[MAINTENANCE]
 
  All releases with the same major version number will have API
  compatibility, defined as [1]. Major version numbers will remain
  stable over long periods of time. For instance, 1.X.Y may last 1 year
  or more.
 
  Minor releases will typically contain new features and improvements.
  The target frequency for minor releases is every 3-4 months. One
  change we'd like to make is to announce fixed release dates and merge
  windows for each release, to facilitate coordination. Each minor
  release will have a merge window where new patches can be merged, a QA
  window when only fixes can be merged, then a final period where voting
  occurs on release candidates. These windows will be announced
  immediately after the previous minor release to give people plenty of
  time, and over time, we might make the whole release process more
  regular (similar to Ubuntu). At the bottom of this document is an
  example window for the 1.0.0 release.
 
  Maintenance releases will occur more frequently and depend on specific
  patches introduced (e.g. bug fixes) and their urgency. In general
  these releases are designed to patch bugs. However, higher level
  libraries may introduce small features, such as a new algorithm,
  provided they are entirely additive and isolated from existing code
  paths. Spark core may not introduce any features.
 
  When new components are added to Spark, they may initially be marked
  as alpha. Alpha components do not have to abide by the above
  guidelines, however, to the maximum extent possible, they should try
  to. Once they are marked stable they have to follow these
  guidelines. At present, GraphX is the only alpha component of Spark.
 
  [1] API compatibility:
 
  An API is any public class or interface exposed in Spark that is not
  marked as semi-private or experimental. Release A is API compatible
  with release B if code compiled against release A *compiles cleanly*
  against B. This does not guarantee that a compiled application that is
  linked against version A will link cleanly against version B without
  re-compiling. Link-level compatibility is something we'll try to
  guarantee that as well, and we might make it a requirement in the
  future, but challenges with things like Scala versions have made this
  difficult to guarantee in the past.
 
  == Merging Pull Requests ==
  To merge pull requests, committers are encouraged to use this tool [2]
  to collapse the request into one commit rather than manually
  performing git merges. It will also format the commit message nicely
  in a way that can be easily parsed later when writing credits.
  

Re: [0.9.0] Possible deadlock in shutdown hook?

2014-02-06 Thread Mridul Muralidharan
Looks like a pathological corner case here - where the the delete
thread is not getting run while the OS is busy prioritizing the thread
writing data (probably with heavy gc too).
Ideally, the delete thread would list files, remove them and then fail
when it tries to remove the non empty directory (since other thread
might be creating more in parallel).


Regards,
Mridul


On Thu, Feb 6, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com wrote:
 Got a repro locally on my MBP (the other was on a CentOS machine).

 Build spark, run a master and a worker with the sbin/start-all.sh script,
 then run this in a shell:

 import org.apache.spark.storage.StorageLevel._
 val s = sc.parallelize(1 to 10).persist(MEMORY_AND_DISK_SER);
 s.count

 After about a minute, this line appears in the shell logging output:

 14/02/06 02:44:44 WARN BlockManagerMasterActor: Removing BlockManager
 BlockManagerId(0, aash-mbp.dyn.yojoe.local, 57895, 0) with no recent heart
 beats: 57510ms exceeds 45000ms

 Ctrl-C the shell.  In jps there is now a worker, a master, and a
 CoarseGrainedExecutorBackend.

 Run jstack on the CGEBackend JVM, and I got the attached stacktraces.  I
 waited around for 15min then kill -9'd the JVM and restarted the process.

 I wonder if what's happening here is that the threads that are spewing data
 to disk (as that parallelize and persist would do) can write to disk faster
 than the cleanup threads can delete from disk.

 What do you think of that theory?


 Andrew



 On Thu, Feb 6, 2014 at 2:30 AM, Mridul Muralidharan mri...@gmail.com
 wrote:

 shutdown hooks should not take 15 mins are you mentioned !
 On the other hand, how busy was your disk when this was happening ?
 (either due to spark or something else ?)

 It might just be that there was a lot of stuff to remove ?

 Regards,
 Mridul


 On Thu, Feb 6, 2014 at 3:50 PM, Andrew Ash and...@andrewash.com wrote:
  Hi Spark devs,
 
  Occasionally when hitting Ctrl-C in the scala spark shell on 0.9.0 one
  of
  my workers goes dead in the spark master UI.  I'm using the standalone
  cluster and didn't ever see this while using 0.8.0 so I think it may be
  a
  regression.
 
  When I prod on the hung CoarseGrainedExecutorBackend JVM with jstack and
  jmap -heap, it doesn't respond unless I add the -F force flag.  The heap
  isn't full, but there are some interesting bits in the jstack.  Poking
  around a little, I think there may be some kind of deadlock in the
  shutdown
  hooks.
 
  Below are the threads I think are most interesting:
 
  Thread 14308: (state = BLOCKED)
   - java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted frame)
   - java.lang.Runtime.exit(int) @bci=14, line=109 (Interpreted frame)
   - java.lang.System.exit(int) @bci=4, line=962 (Interpreted frame)
   -
 
  org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(java.lang.Object,
  scala.Function1) @bci=352, line=81 (Interpreted frame)
   - akka.actor.ActorCell.receiveMessage(java.lang.Object) @bci=25,
  line=498
  (Interpreted frame)
   - akka.actor.ActorCell.invoke(akka.dispatch.Envelope) @bci=39, line=456
  (Interpreted frame)
   - akka.dispatch.Mailbox.processMailbox(int, long) @bci=24, line=237
  (Interpreted frame)
   - akka.dispatch.Mailbox.run() @bci=20, line=219 (Interpreted frame)
   - akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec()
  @bci=4, line=386 (Interpreted frame)
   - scala.concurrent.forkjoin.ForkJoinTask.doExec() @bci=10, line=260
  (Compiled frame)
   -
 
  scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(scala.concurrent.forkjoin.ForkJoinTask)
  @bci=10, line=1339 (Compiled frame)
   -
 
  scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue)
  @bci=11, line=1979 (Compiled frame)
   - scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14,
  line=107
  (Interpreted frame)
 
  Thread 3865: (state = BLOCKED)
   - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
   - java.lang.Thread.join(long) @bci=38, line=1280 (Interpreted frame)
   - java.lang.Thread.join() @bci=2, line=1354 (Interpreted frame)
   - java.lang.ApplicationShutdownHooks.runHooks() @bci=87, line=106
  (Interpreted frame)
   - java.lang.ApplicationShutdownHooks$1.run() @bci=0, line=46
  (Interpreted
  frame)
   - java.lang.Shutdown.runHooks() @bci=39, line=123 (Interpreted frame)
   - java.lang.Shutdown.sequence() @bci=26, line=167 (Interpreted frame)
   - java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted frame)
   - java.lang.Terminator$1.handle(sun.misc.Signal) @bci=8, line=52
  (Interpreted frame)
   - sun.misc.Signal$1.run() @bci=8, line=212 (Interpreted frame)
   - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)
 
 
  Thread 3987: (state = BLOCKED)
   - java.io.UnixFileSystem.list(java.io.File) @bci=0 (Interpreted frame)
   - java.io.File.list() @bci=29, line=1116 (Interpreted frame)
   - java.io.File.listFiles() @bci=1, line=1201 (Compiled frame)

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Sandy Ryza
Bleh, hit send to early again.  My second paragraph was to argue for 1.0.0
instead of 0.10.0, not to hammer on the binary compatibility point.


On Thu, Feb 6, 2014 at 11:04 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 *Would it make sense to put in something that strongly discourages binary
 incompatible changes when possible?


 On Thu, Feb 6, 2014 at 11:03 AM, Sandy Ryza sandy.r...@cloudera.comwrote:

 Not codifying binary compatibility as a hard rule sounds fine to me.
  Would it make sense to put something in that . I.e. avoid making needless
 changes to class hierarchies.

 Whether Spark considers itself stable or not, users are beginning to
 treat it so.  A responsible project will acknowledge this and provide the
 stability needed by its user base.  I think some projects have made the
 mistake of waiting too long to release a 1.0.0.  It allows them to put off
 making the hard decisions, but users and downstream projects suffer.

 If Spark needs to go through dramatic changes, there's always the option
 of a 2.0.0 that allows for this.

 -Sandy



 On Thu, Feb 6, 2014 at 10:56 AM, Matei Zaharia 
 matei.zaha...@gmail.comwrote:

 I think it's important to do 1.0 next. The project has been around for 4
 years, and I'd be comfortable maintaining the current codebase for a long
 time in an API and binary compatible way through 1.x releases. Over the
 past 4 years we haven't actually had major changes to the user-facing API --
 the only ones were changing the package to org.apache.spark, and upgrading
 the Scala version. I'd be okay leaving 1.x to always use Scala 2.10 for
 example, or later cross-building it for Scala 2.11. Updating to 1.0 says
 two things: it tells users that they can be confident that version will be
 maintained for a long time, which we absolutely want to do, and it lets
 outsiders see that the project is now fairly mature (for many people,
 pre-1.0 might still cause them not to try it). I think both are good for
 the community.

 Regarding binary compatibility, I agree that it's what we should strive
 for, but it just seems premature to codify now. Let's see how it works
 between, say, 1.0 and 1.1, and then we can codify it.

 Matei

 On Feb 6, 2014, at 10:43 AM, Henry Saputra henry.sapu...@gmail.com
 wrote:

  Thanks Patick to initiate the discussion about next road map for
 Apache Spark.
 
  I am +1 for 0.10.0 for next version.
 
  It will give us as community some time to digest the process and the
  vision and make adjustment accordingly.
 
  Release a 1.0.0 is a huge milestone and if we do need to break API
  somehow or modify internal behavior dramatically we could take
  advantage to release 1.0.0 as good step to go to.
 
 
  - Henry
 
 
 
  On Wed, Feb 5, 2014 at 9:52 PM, Andrew Ash and...@andrewash.com
 wrote:
  Agree on timeboxed releases as well.
 
  Is there a vision for where we want to be as a project before
 declaring the
  first 1.0 release?  While we're in the 0.x days per semver we can
 break
  backcompat at will (though we try to avoid it where possible), and
 that
  luxury goes away with 1.x  I just don't want to release a 1.0 simply
  because it seems to follow after 0.9 rather than making an intentional
  decision that we're at the point where we can stand by the current
 APIs and
  binary compatibility for the next year or so of the major release.
 
  Until that decision is made as a group I'd rather we do an immediate
  version bump to 0.10.0-SNAPSHOT and then if discussion warrants it
 later,
  replace that with 1.0.0-SNAPSHOT.  It's very easy to go from 0.10 to
 1.0
  but not the other way around.
 
  https://github.com/apache/incubator-spark/pull/542
 
  Cheers!
  Andrew
 
 
  On Wed, Feb 5, 2014 at 9:49 PM, Heiko Braun ike.br...@googlemail.com
 wrote:
 
  +1 on time boxed releases and compatibility guidelines
 
 
  Am 06.02.2014 um 01:20 schrieb Patrick Wendell pwend...@gmail.com
 :
 
  Hi Everyone,
 
  In an effort to coordinate development amongst the growing list of
  Spark contributors, I've taken some time to write up a proposal to
  formalize various pieces of the development process. The next
 release
  of Spark will likely be Spark 1.0.0, so this message is intended in
  part to coordinate the release plan for 1.0.0 and future releases.
  I'll post this on the wiki after discussing it on this thread as
  tentative project guidelines.
 
  == Spark Release Structure ==
  Starting with Spark 1.0.0, the Spark project will follow the
 semantic
  versioning guidelines (http://semver.org/) with a few deviations.
  These small differences account for Spark's nature as a multi-module
  project.
 
  Each Spark release will be versioned:
  [MAJOR].[MINOR].[MAINTENANCE]
 
  All releases with the same major version number will have API
  compatibility, defined as [1]. Major version numbers will remain
  stable over long periods of time. For instance, 1.X.Y may last 1
 year
  or more.
 
  Minor releases will typically contain new features and improvements.
  

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Sandy Ryza
*Would it make sense to put in something that strongly discourages binary
incompatible changes when possible?


On Thu, Feb 6, 2014 at 11:03 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Not codifying binary compatibility as a hard rule sounds fine to me.
  Would it make sense to put something in that . I.e. avoid making needless
 changes to class hierarchies.

 Whether Spark considers itself stable or not, users are beginning to treat
 it so.  A responsible project will acknowledge this and provide the
 stability needed by its user base.  I think some projects have made the
 mistake of waiting too long to release a 1.0.0.  It allows them to put off
 making the hard decisions, but users and downstream projects suffer.

 If Spark needs to go through dramatic changes, there's always the option
 of a 2.0.0 that allows for this.

 -Sandy



 On Thu, Feb 6, 2014 at 10:56 AM, Matei Zaharia matei.zaha...@gmail.comwrote:

 I think it's important to do 1.0 next. The project has been around for 4
 years, and I'd be comfortable maintaining the current codebase for a long
 time in an API and binary compatible way through 1.x releases. Over the
 past 4 years we haven't actually had major changes to the user-facing API --
 the only ones were changing the package to org.apache.spark, and upgrading
 the Scala version. I'd be okay leaving 1.x to always use Scala 2.10 for
 example, or later cross-building it for Scala 2.11. Updating to 1.0 says
 two things: it tells users that they can be confident that version will be
 maintained for a long time, which we absolutely want to do, and it lets
 outsiders see that the project is now fairly mature (for many people,
 pre-1.0 might still cause them not to try it). I think both are good for
 the community.

 Regarding binary compatibility, I agree that it's what we should strive
 for, but it just seems premature to codify now. Let's see how it works
 between, say, 1.0 and 1.1, and then we can codify it.

 Matei

 On Feb 6, 2014, at 10:43 AM, Henry Saputra henry.sapu...@gmail.com
 wrote:

  Thanks Patick to initiate the discussion about next road map for Apache
 Spark.
 
  I am +1 for 0.10.0 for next version.
 
  It will give us as community some time to digest the process and the
  vision and make adjustment accordingly.
 
  Release a 1.0.0 is a huge milestone and if we do need to break API
  somehow or modify internal behavior dramatically we could take
  advantage to release 1.0.0 as good step to go to.
 
 
  - Henry
 
 
 
  On Wed, Feb 5, 2014 at 9:52 PM, Andrew Ash and...@andrewash.com
 wrote:
  Agree on timeboxed releases as well.
 
  Is there a vision for where we want to be as a project before
 declaring the
  first 1.0 release?  While we're in the 0.x days per semver we can break
  backcompat at will (though we try to avoid it where possible), and that
  luxury goes away with 1.x  I just don't want to release a 1.0 simply
  because it seems to follow after 0.9 rather than making an intentional
  decision that we're at the point where we can stand by the current
 APIs and
  binary compatibility for the next year or so of the major release.
 
  Until that decision is made as a group I'd rather we do an immediate
  version bump to 0.10.0-SNAPSHOT and then if discussion warrants it
 later,
  replace that with 1.0.0-SNAPSHOT.  It's very easy to go from 0.10 to
 1.0
  but not the other way around.
 
  https://github.com/apache/incubator-spark/pull/542
 
  Cheers!
  Andrew
 
 
  On Wed, Feb 5, 2014 at 9:49 PM, Heiko Braun ike.br...@googlemail.com
 wrote:
 
  +1 on time boxed releases and compatibility guidelines
 
 
  Am 06.02.2014 um 01:20 schrieb Patrick Wendell pwend...@gmail.com:
 
  Hi Everyone,
 
  In an effort to coordinate development amongst the growing list of
  Spark contributors, I've taken some time to write up a proposal to
  formalize various pieces of the development process. The next release
  of Spark will likely be Spark 1.0.0, so this message is intended in
  part to coordinate the release plan for 1.0.0 and future releases.
  I'll post this on the wiki after discussing it on this thread as
  tentative project guidelines.
 
  == Spark Release Structure ==
  Starting with Spark 1.0.0, the Spark project will follow the semantic
  versioning guidelines (http://semver.org/) with a few deviations.
  These small differences account for Spark's nature as a multi-module
  project.
 
  Each Spark release will be versioned:
  [MAJOR].[MINOR].[MAINTENANCE]
 
  All releases with the same major version number will have API
  compatibility, defined as [1]. Major version numbers will remain
  stable over long periods of time. For instance, 1.X.Y may last 1 year
  or more.
 
  Minor releases will typically contain new features and improvements.
  The target frequency for minor releases is every 3-4 months. One
  change we'd like to make is to announce fixed release dates and merge
  windows for each release, to facilitate coordination. Each minor
  release will have a merge 

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Sandy Ryza
Not codifying binary compatibility as a hard rule sounds fine to me.  Would
it make sense to put something in that . I.e. avoid making needless changes
to class hierarchies.

Whether Spark considers itself stable or not, users are beginning to treat
it so.  A responsible project will acknowledge this and provide the
stability needed by its user base.  I think some projects have made the
mistake of waiting too long to release a 1.0.0.  It allows them to put off
making the hard decisions, but users and downstream projects suffer.

If Spark needs to go through dramatic changes, there's always the option of
a 2.0.0 that allows for this.

-Sandy



On Thu, Feb 6, 2014 at 10:56 AM, Matei Zaharia matei.zaha...@gmail.comwrote:

 I think it's important to do 1.0 next. The project has been around for 4
 years, and I'd be comfortable maintaining the current codebase for a long
 time in an API and binary compatible way through 1.x releases. Over the
 past 4 years we haven't actually had major changes to the user-facing API --
 the only ones were changing the package to org.apache.spark, and upgrading
 the Scala version. I'd be okay leaving 1.x to always use Scala 2.10 for
 example, or later cross-building it for Scala 2.11. Updating to 1.0 says
 two things: it tells users that they can be confident that version will be
 maintained for a long time, which we absolutely want to do, and it lets
 outsiders see that the project is now fairly mature (for many people,
 pre-1.0 might still cause them not to try it). I think both are good for
 the community.

 Regarding binary compatibility, I agree that it's what we should strive
 for, but it just seems premature to codify now. Let's see how it works
 between, say, 1.0 and 1.1, and then we can codify it.

 Matei

 On Feb 6, 2014, at 10:43 AM, Henry Saputra henry.sapu...@gmail.com
 wrote:

  Thanks Patick to initiate the discussion about next road map for Apache
 Spark.
 
  I am +1 for 0.10.0 for next version.
 
  It will give us as community some time to digest the process and the
  vision and make adjustment accordingly.
 
  Release a 1.0.0 is a huge milestone and if we do need to break API
  somehow or modify internal behavior dramatically we could take
  advantage to release 1.0.0 as good step to go to.
 
 
  - Henry
 
 
 
  On Wed, Feb 5, 2014 at 9:52 PM, Andrew Ash and...@andrewash.com wrote:
  Agree on timeboxed releases as well.
 
  Is there a vision for where we want to be as a project before declaring
 the
  first 1.0 release?  While we're in the 0.x days per semver we can break
  backcompat at will (though we try to avoid it where possible), and that
  luxury goes away with 1.x  I just don't want to release a 1.0 simply
  because it seems to follow after 0.9 rather than making an intentional
  decision that we're at the point where we can stand by the current APIs
 and
  binary compatibility for the next year or so of the major release.
 
  Until that decision is made as a group I'd rather we do an immediate
  version bump to 0.10.0-SNAPSHOT and then if discussion warrants it
 later,
  replace that with 1.0.0-SNAPSHOT.  It's very easy to go from 0.10 to 1.0
  but not the other way around.
 
  https://github.com/apache/incubator-spark/pull/542
 
  Cheers!
  Andrew
 
 
  On Wed, Feb 5, 2014 at 9:49 PM, Heiko Braun ike.br...@googlemail.com
 wrote:
 
  +1 on time boxed releases and compatibility guidelines
 
 
  Am 06.02.2014 um 01:20 schrieb Patrick Wendell pwend...@gmail.com:
 
  Hi Everyone,
 
  In an effort to coordinate development amongst the growing list of
  Spark contributors, I've taken some time to write up a proposal to
  formalize various pieces of the development process. The next release
  of Spark will likely be Spark 1.0.0, so this message is intended in
  part to coordinate the release plan for 1.0.0 and future releases.
  I'll post this on the wiki after discussing it on this thread as
  tentative project guidelines.
 
  == Spark Release Structure ==
  Starting with Spark 1.0.0, the Spark project will follow the semantic
  versioning guidelines (http://semver.org/) with a few deviations.
  These small differences account for Spark's nature as a multi-module
  project.
 
  Each Spark release will be versioned:
  [MAJOR].[MINOR].[MAINTENANCE]
 
  All releases with the same major version number will have API
  compatibility, defined as [1]. Major version numbers will remain
  stable over long periods of time. For instance, 1.X.Y may last 1 year
  or more.
 
  Minor releases will typically contain new features and improvements.
  The target frequency for minor releases is every 3-4 months. One
  change we'd like to make is to announce fixed release dates and merge
  windows for each release, to facilitate coordination. Each minor
  release will have a merge window where new patches can be merged, a QA
  window when only fixes can be merged, then a final period where voting
  occurs on release candidates. These windows will be announced
  immediately after 

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Evan Chan
+1 for 0.10.0.

It would give more time to study things (such as the new SparkConf)
and let the community decide if any breaking API changes are needed.

Also, a +1 for minor revisions not breaking code compatibility,
including Scala versions.   (I guess this would mean that 1.x would
stay on Scala 2.10.x)

On Thu, Feb 6, 2014 at 11:05 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
 Bleh, hit send to early again.  My second paragraph was to argue for 1.0.0
 instead of 0.10.0, not to hammer on the binary compatibility point.


 On Thu, Feb 6, 2014 at 11:04 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 *Would it make sense to put in something that strongly discourages binary
 incompatible changes when possible?


 On Thu, Feb 6, 2014 at 11:03 AM, Sandy Ryza sandy.r...@cloudera.comwrote:

 Not codifying binary compatibility as a hard rule sounds fine to me.
  Would it make sense to put something in that . I.e. avoid making needless
 changes to class hierarchies.

 Whether Spark considers itself stable or not, users are beginning to
 treat it so.  A responsible project will acknowledge this and provide the
 stability needed by its user base.  I think some projects have made the
 mistake of waiting too long to release a 1.0.0.  It allows them to put off
 making the hard decisions, but users and downstream projects suffer.

 If Spark needs to go through dramatic changes, there's always the option
 of a 2.0.0 that allows for this.

 -Sandy



 On Thu, Feb 6, 2014 at 10:56 AM, Matei Zaharia 
 matei.zaha...@gmail.comwrote:

 I think it's important to do 1.0 next. The project has been around for 4
 years, and I'd be comfortable maintaining the current codebase for a long
 time in an API and binary compatible way through 1.x releases. Over the
 past 4 years we haven't actually had major changes to the user-facing API 
 --
 the only ones were changing the package to org.apache.spark, and upgrading
 the Scala version. I'd be okay leaving 1.x to always use Scala 2.10 for
 example, or later cross-building it for Scala 2.11. Updating to 1.0 says
 two things: it tells users that they can be confident that version will be
 maintained for a long time, which we absolutely want to do, and it lets
 outsiders see that the project is now fairly mature (for many people,
 pre-1.0 might still cause them not to try it). I think both are good for
 the community.

 Regarding binary compatibility, I agree that it's what we should strive
 for, but it just seems premature to codify now. Let's see how it works
 between, say, 1.0 and 1.1, and then we can codify it.

 Matei

 On Feb 6, 2014, at 10:43 AM, Henry Saputra henry.sapu...@gmail.com
 wrote:

  Thanks Patick to initiate the discussion about next road map for
 Apache Spark.
 
  I am +1 for 0.10.0 for next version.
 
  It will give us as community some time to digest the process and the
  vision and make adjustment accordingly.
 
  Release a 1.0.0 is a huge milestone and if we do need to break API
  somehow or modify internal behavior dramatically we could take
  advantage to release 1.0.0 as good step to go to.
 
 
  - Henry
 
 
 
  On Wed, Feb 5, 2014 at 9:52 PM, Andrew Ash and...@andrewash.com
 wrote:
  Agree on timeboxed releases as well.
 
  Is there a vision for where we want to be as a project before
 declaring the
  first 1.0 release?  While we're in the 0.x days per semver we can
 break
  backcompat at will (though we try to avoid it where possible), and
 that
  luxury goes away with 1.x  I just don't want to release a 1.0 simply
  because it seems to follow after 0.9 rather than making an intentional
  decision that we're at the point where we can stand by the current
 APIs and
  binary compatibility for the next year or so of the major release.
 
  Until that decision is made as a group I'd rather we do an immediate
  version bump to 0.10.0-SNAPSHOT and then if discussion warrants it
 later,
  replace that with 1.0.0-SNAPSHOT.  It's very easy to go from 0.10 to
 1.0
  but not the other way around.
 
  https://github.com/apache/incubator-spark/pull/542
 
  Cheers!
  Andrew
 
 
  On Wed, Feb 5, 2014 at 9:49 PM, Heiko Braun ike.br...@googlemail.com
 wrote:
 
  +1 on time boxed releases and compatibility guidelines
 
 
  Am 06.02.2014 um 01:20 schrieb Patrick Wendell pwend...@gmail.com
 :
 
  Hi Everyone,
 
  In an effort to coordinate development amongst the growing list of
  Spark contributors, I've taken some time to write up a proposal to
  formalize various pieces of the development process. The next
 release
  of Spark will likely be Spark 1.0.0, so this message is intended in
  part to coordinate the release plan for 1.0.0 and future releases.
  I'll post this on the wiki after discussing it on this thread as
  tentative project guidelines.
 
  == Spark Release Structure ==
  Starting with Spark 1.0.0, the Spark project will follow the
 semantic
  versioning guidelines (http://semver.org/) with a few deviations.
  These small differences account for Spark's nature as a 

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Evan Chan
The other reason for waiting are things like stability.

It would be great to have as a goal for 1.0.0 that under most heavy
use scenarios, workers and executors don't just die, which is not true
today.
Also, there should be minimal silent failures which are difficult to debug.

On Thu, Feb 6, 2014 at 11:54 AM, Evan Chan e...@ooyala.com wrote:
 +1 for 0.10.0.

 It would give more time to study things (such as the new SparkConf)
 and let the community decide if any breaking API changes are needed.

 Also, a +1 for minor revisions not breaking code compatibility,
 including Scala versions.   (I guess this would mean that 1.x would
 stay on Scala 2.10.x)

 On Thu, Feb 6, 2014 at 11:05 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
 Bleh, hit send to early again.  My second paragraph was to argue for 1.0.0
 instead of 0.10.0, not to hammer on the binary compatibility point.


 On Thu, Feb 6, 2014 at 11:04 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 *Would it make sense to put in something that strongly discourages binary
 incompatible changes when possible?


 On Thu, Feb 6, 2014 at 11:03 AM, Sandy Ryza sandy.r...@cloudera.comwrote:

 Not codifying binary compatibility as a hard rule sounds fine to me.
  Would it make sense to put something in that . I.e. avoid making needless
 changes to class hierarchies.

 Whether Spark considers itself stable or not, users are beginning to
 treat it so.  A responsible project will acknowledge this and provide the
 stability needed by its user base.  I think some projects have made the
 mistake of waiting too long to release a 1.0.0.  It allows them to put off
 making the hard decisions, but users and downstream projects suffer.

 If Spark needs to go through dramatic changes, there's always the option
 of a 2.0.0 that allows for this.

 -Sandy



 On Thu, Feb 6, 2014 at 10:56 AM, Matei Zaharia 
 matei.zaha...@gmail.comwrote:

 I think it's important to do 1.0 next. The project has been around for 4
 years, and I'd be comfortable maintaining the current codebase for a long
 time in an API and binary compatible way through 1.x releases. Over the
 past 4 years we haven't actually had major changes to the user-facing API 
 --
 the only ones were changing the package to org.apache.spark, and upgrading
 the Scala version. I'd be okay leaving 1.x to always use Scala 2.10 for
 example, or later cross-building it for Scala 2.11. Updating to 1.0 says
 two things: it tells users that they can be confident that version will be
 maintained for a long time, which we absolutely want to do, and it lets
 outsiders see that the project is now fairly mature (for many people,
 pre-1.0 might still cause them not to try it). I think both are good for
 the community.

 Regarding binary compatibility, I agree that it's what we should strive
 for, but it just seems premature to codify now. Let's see how it works
 between, say, 1.0 and 1.1, and then we can codify it.

 Matei

 On Feb 6, 2014, at 10:43 AM, Henry Saputra henry.sapu...@gmail.com
 wrote:

  Thanks Patick to initiate the discussion about next road map for
 Apache Spark.
 
  I am +1 for 0.10.0 for next version.
 
  It will give us as community some time to digest the process and the
  vision and make adjustment accordingly.
 
  Release a 1.0.0 is a huge milestone and if we do need to break API
  somehow or modify internal behavior dramatically we could take
  advantage to release 1.0.0 as good step to go to.
 
 
  - Henry
 
 
 
  On Wed, Feb 5, 2014 at 9:52 PM, Andrew Ash and...@andrewash.com
 wrote:
  Agree on timeboxed releases as well.
 
  Is there a vision for where we want to be as a project before
 declaring the
  first 1.0 release?  While we're in the 0.x days per semver we can
 break
  backcompat at will (though we try to avoid it where possible), and
 that
  luxury goes away with 1.x  I just don't want to release a 1.0 simply
  because it seems to follow after 0.9 rather than making an intentional
  decision that we're at the point where we can stand by the current
 APIs and
  binary compatibility for the next year or so of the major release.
 
  Until that decision is made as a group I'd rather we do an immediate
  version bump to 0.10.0-SNAPSHOT and then if discussion warrants it
 later,
  replace that with 1.0.0-SNAPSHOT.  It's very easy to go from 0.10 to
 1.0
  but not the other way around.
 
  https://github.com/apache/incubator-spark/pull/542
 
  Cheers!
  Andrew
 
 
  On Wed, Feb 5, 2014 at 9:49 PM, Heiko Braun ike.br...@googlemail.com
 wrote:
 
  +1 on time boxed releases and compatibility guidelines
 
 
  Am 06.02.2014 um 01:20 schrieb Patrick Wendell pwend...@gmail.com
 :
 
  Hi Everyone,
 
  In an effort to coordinate development amongst the growing list of
  Spark contributors, I've taken some time to write up a proposal to
  formalize various pieces of the development process. The next
 release
  of Spark will likely be Spark 1.0.0, so this message is intended in
  part to coordinate the release plan for 1.0.0 

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Reynold Xin
+1 for 1.0


The point of 1.0 is for us to self-enforce API compatibility in the context
of longer term support. If we continue down the 0.xx road, we will always
have excuse for breaking APIs. That said, a major focus of 0.9 and some of
the work that are happening for 1.0 (e.g. configuration, Java 8 closure
support, security) are for better API compatibility support in 1.x releases.

While not perfect, Spark as is is already more mature than many (ASF)
projects that are versioned 1.x, 2.x, or even 10.x. Software releases are
always a moving target. 1.0 doesn't mean it is perfect and final. The
project will still evolve.




On Thu, Feb 6, 2014 at 11:54 AM, Evan Chan e...@ooyala.com wrote:

 +1 for 0.10.0.

 It would give more time to study things (such as the new SparkConf)
 and let the community decide if any breaking API changes are needed.

 Also, a +1 for minor revisions not breaking code compatibility,
 including Scala versions.   (I guess this would mean that 1.x would
 stay on Scala 2.10.x)

 On Thu, Feb 6, 2014 at 11:05 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
  Bleh, hit send to early again.  My second paragraph was to argue for
 1.0.0
  instead of 0.10.0, not to hammer on the binary compatibility point.
 
 
  On Thu, Feb 6, 2014 at 11:04 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
 
  *Would it make sense to put in something that strongly discourages
 binary
  incompatible changes when possible?
 
 
  On Thu, Feb 6, 2014 at 11:03 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
 
  Not codifying binary compatibility as a hard rule sounds fine to me.
   Would it make sense to put something in that . I.e. avoid making
 needless
  changes to class hierarchies.
 
  Whether Spark considers itself stable or not, users are beginning to
  treat it so.  A responsible project will acknowledge this and provide
 the
  stability needed by its user base.  I think some projects have made the
  mistake of waiting too long to release a 1.0.0.  It allows them to put
 off
  making the hard decisions, but users and downstream projects suffer.
 
  If Spark needs to go through dramatic changes, there's always the
 option
  of a 2.0.0 that allows for this.
 
  -Sandy
 
 
 
  On Thu, Feb 6, 2014 at 10:56 AM, Matei Zaharia 
 matei.zaha...@gmail.comwrote:
 
  I think it's important to do 1.0 next. The project has been around
 for 4
  years, and I'd be comfortable maintaining the current codebase for a
 long
  time in an API and binary compatible way through 1.x releases. Over
 the
  past 4 years we haven't actually had major changes to the user-facing
 API --
  the only ones were changing the package to org.apache.spark, and
 upgrading
  the Scala version. I'd be okay leaving 1.x to always use Scala 2.10
 for
  example, or later cross-building it for Scala 2.11. Updating to 1.0
 says
  two things: it tells users that they can be confident that version
 will be
  maintained for a long time, which we absolutely want to do, and it
 lets
  outsiders see that the project is now fairly mature (for many people,
  pre-1.0 might still cause them not to try it). I think both are good
 for
  the community.
 
  Regarding binary compatibility, I agree that it's what we should
 strive
  for, but it just seems premature to codify now. Let's see how it works
  between, say, 1.0 and 1.1, and then we can codify it.
 
  Matei
 
  On Feb 6, 2014, at 10:43 AM, Henry Saputra henry.sapu...@gmail.com
  wrote:
 
   Thanks Patick to initiate the discussion about next road map for
  Apache Spark.
  
   I am +1 for 0.10.0 for next version.
  
   It will give us as community some time to digest the process and the
   vision and make adjustment accordingly.
  
   Release a 1.0.0 is a huge milestone and if we do need to break API
   somehow or modify internal behavior dramatically we could take
   advantage to release 1.0.0 as good step to go to.
  
  
   - Henry
  
  
  
   On Wed, Feb 5, 2014 at 9:52 PM, Andrew Ash and...@andrewash.com
  wrote:
   Agree on timeboxed releases as well.
  
   Is there a vision for where we want to be as a project before
  declaring the
   first 1.0 release?  While we're in the 0.x days per semver we can
  break
   backcompat at will (though we try to avoid it where possible), and
  that
   luxury goes away with 1.x  I just don't want to release a 1.0
 simply
   because it seems to follow after 0.9 rather than making an
 intentional
   decision that we're at the point where we can stand by the current
  APIs and
   binary compatibility for the next year or so of the major release.
  
   Until that decision is made as a group I'd rather we do an
 immediate
   version bump to 0.10.0-SNAPSHOT and then if discussion warrants it
  later,
   replace that with 1.0.0-SNAPSHOT.  It's very easy to go from 0.10
 to
  1.0
   but not the other way around.
  
   https://github.com/apache/incubator-spark/pull/542
  
   Cheers!
   Andrew
  
  
   On Wed, Feb 5, 2014 at 9:49 PM, Heiko Braun 
 ike.br...@googlemail.com
  wrote:
  

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Imran Rashid
I don't really agree with this logic.  I think we haven't broken API so far
because we just keep adding stuff on to it, and we haven't bothered to
clean the api up, specifically to *avoid* breaking things.  Here's a
handful of api breaking things that we might want to consider:

* should we look at all the various configuration properties, and maybe
some of them should get renamed for consistency / clarity?
* do all of the functions on RDD need to be in core?  or do some of them
that are simple additions built on top of the primitives really belong in a
utils package or something?  Eg., maybe we should get rid of all the
variants of the mapPartitions / mapWith / etc.  just have map, and
mapPartitionsWithIndex  (too many choices in the api can also be confusing
to the user)
* are the right things getting tracked in SparkListener?  Do we need to add
or remove anything?

This is probably not the right list of questions, that's just an idea of
the kind of thing we should be thinking about.

Its also fine with me if 1.0 is next, I just think that we ought to be
asking these kinds of questions up and down the entire api before we
release 1.0.  And given that we haven't even started that discussion, it
seems possible that there could be new features we'd like to release in
0.10 before that discussion is finished.



On Thu, Feb 6, 2014 at 12:56 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 I think it's important to do 1.0 next. The project has been around for 4
 years, and I'd be comfortable maintaining the current codebase for a long
 time in an API and binary compatible way through 1.x releases. Over the
 past 4 years we haven't actually had major changes to the user-facing API --
 the only ones were changing the package to org.apache.spark, and upgrading
 the Scala version. I'd be okay leaving 1.x to always use Scala 2.10 for
 example, or later cross-building it for Scala 2.11. Updating to 1.0 says
 two things: it tells users that they can be confident that version will be
 maintained for a long time, which we absolutely want to do, and it lets
 outsiders see that the project is now fairly mature (for many people,
 pre-1.0 might still cause them not to try it). I think both are good for
 the community.

 Regarding binary compatibility, I agree that it's what we should strive
 for, but it just seems premature to codify now. Let's see how it works
 between, say, 1.0 and 1.1, and then we can codify it.

 Matei

 On Feb 6, 2014, at 10:43 AM, Henry Saputra henry.sapu...@gmail.com
 wrote:

  Thanks Patick to initiate the discussion about next road map for Apache
 Spark.
 
  I am +1 for 0.10.0 for next version.
 
  It will give us as community some time to digest the process and the
  vision and make adjustment accordingly.
 
  Release a 1.0.0 is a huge milestone and if we do need to break API
  somehow or modify internal behavior dramatically we could take
  advantage to release 1.0.0 as good step to go to.
 
 
  - Henry
 
 
 
  On Wed, Feb 5, 2014 at 9:52 PM, Andrew Ash and...@andrewash.com wrote:
  Agree on timeboxed releases as well.
 
  Is there a vision for where we want to be as a project before declaring
 the
  first 1.0 release?  While we're in the 0.x days per semver we can break
  backcompat at will (though we try to avoid it where possible), and that
  luxury goes away with 1.x  I just don't want to release a 1.0 simply
  because it seems to follow after 0.9 rather than making an intentional
  decision that we're at the point where we can stand by the current APIs
 and
  binary compatibility for the next year or so of the major release.
 
  Until that decision is made as a group I'd rather we do an immediate
  version bump to 0.10.0-SNAPSHOT and then if discussion warrants it
 later,
  replace that with 1.0.0-SNAPSHOT.  It's very easy to go from 0.10 to 1.0
  but not the other way around.
 
  https://github.com/apache/incubator-spark/pull/542
 
  Cheers!
  Andrew
 
 
  On Wed, Feb 5, 2014 at 9:49 PM, Heiko Braun ike.br...@googlemail.com
 wrote:
 
  +1 on time boxed releases and compatibility guidelines
 
 
  Am 06.02.2014 um 01:20 schrieb Patrick Wendell pwend...@gmail.com:
 
  Hi Everyone,
 
  In an effort to coordinate development amongst the growing list of
  Spark contributors, I've taken some time to write up a proposal to
  formalize various pieces of the development process. The next release
  of Spark will likely be Spark 1.0.0, so this message is intended in
  part to coordinate the release plan for 1.0.0 and future releases.
  I'll post this on the wiki after discussing it on this thread as
  tentative project guidelines.
 
  == Spark Release Structure ==
  Starting with Spark 1.0.0, the Spark project will follow the semantic
  versioning guidelines (http://semver.org/) with a few deviations.
  These small differences account for Spark's nature as a multi-module
  project.
 
  Each Spark release will be versioned:
  [MAJOR].[MINOR].[MAINTENANCE]
 
  All releases with the same major version 

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Sandy Ryza
If the APIs are usable, stability and continuity are much more important
than perfection.  With many already relying on the current APIs, I think
trying to clean them up will just cause pain for users and integrators.
 Hadoop made this mistake when they decided the original MapReduce APIs
were ugly and introduced a new set of APIs to do the same thing.  Even
though this happened in a pre-1.0 release, three years down the road, both
the old and new APIs are still supported, causing endless confusion for
users.  If individual functions or configuration properties have unclear
names, they can be deprecated and replaced, but redoing the APIs or
breaking compatibility at this point is simply not worth it.


On Thu, Feb 6, 2014 at 12:39 PM, Imran Rashid im...@quantifind.com wrote:

 I don't really agree with this logic.  I think we haven't broken API so far
 because we just keep adding stuff on to it, and we haven't bothered to
 clean the api up, specifically to *avoid* breaking things.  Here's a
 handful of api breaking things that we might want to consider:

 * should we look at all the various configuration properties, and maybe
 some of them should get renamed for consistency / clarity?
 * do all of the functions on RDD need to be in core?  or do some of them
 that are simple additions built on top of the primitives really belong in a
 utils package or something?  Eg., maybe we should get rid of all the
 variants of the mapPartitions / mapWith / etc.  just have map, and
 mapPartitionsWithIndex  (too many choices in the api can also be confusing
 to the user)
 * are the right things getting tracked in SparkListener?  Do we need to add
 or remove anything?

 This is probably not the right list of questions, that's just an idea of
 the kind of thing we should be thinking about.

 Its also fine with me if 1.0 is next, I just think that we ought to be
 asking these kinds of questions up and down the entire api before we
 release 1.0.  And given that we haven't even started that discussion, it
 seems possible that there could be new features we'd like to release in
 0.10 before that discussion is finished.



 On Thu, Feb 6, 2014 at 12:56 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

  I think it's important to do 1.0 next. The project has been around for 4
  years, and I'd be comfortable maintaining the current codebase for a long
  time in an API and binary compatible way through 1.x releases. Over the
  past 4 years we haven't actually had major changes to the user-facing
 API --
  the only ones were changing the package to org.apache.spark, and
 upgrading
  the Scala version. I'd be okay leaving 1.x to always use Scala 2.10 for
  example, or later cross-building it for Scala 2.11. Updating to 1.0 says
  two things: it tells users that they can be confident that version will
 be
  maintained for a long time, which we absolutely want to do, and it lets
  outsiders see that the project is now fairly mature (for many people,
  pre-1.0 might still cause them not to try it). I think both are good for
  the community.
 
  Regarding binary compatibility, I agree that it's what we should strive
  for, but it just seems premature to codify now. Let's see how it works
  between, say, 1.0 and 1.1, and then we can codify it.
 
  Matei
 
  On Feb 6, 2014, at 10:43 AM, Henry Saputra henry.sapu...@gmail.com
  wrote:
 
   Thanks Patick to initiate the discussion about next road map for Apache
  Spark.
  
   I am +1 for 0.10.0 for next version.
  
   It will give us as community some time to digest the process and the
   vision and make adjustment accordingly.
  
   Release a 1.0.0 is a huge milestone and if we do need to break API
   somehow or modify internal behavior dramatically we could take
   advantage to release 1.0.0 as good step to go to.
  
  
   - Henry
  
  
  
   On Wed, Feb 5, 2014 at 9:52 PM, Andrew Ash and...@andrewash.com
 wrote:
   Agree on timeboxed releases as well.
  
   Is there a vision for where we want to be as a project before
 declaring
  the
   first 1.0 release?  While we're in the 0.x days per semver we can
 break
   backcompat at will (though we try to avoid it where possible), and
 that
   luxury goes away with 1.x  I just don't want to release a 1.0 simply
   because it seems to follow after 0.9 rather than making an intentional
   decision that we're at the point where we can stand by the current
 APIs
  and
   binary compatibility for the next year or so of the major release.
  
   Until that decision is made as a group I'd rather we do an immediate
   version bump to 0.10.0-SNAPSHOT and then if discussion warrants it
  later,
   replace that with 1.0.0-SNAPSHOT.  It's very easy to go from 0.10 to
 1.0
   but not the other way around.
  
   https://github.com/apache/incubator-spark/pull/542
  
   Cheers!
   Andrew
  
  
   On Wed, Feb 5, 2014 at 9:49 PM, Heiko Braun ike.br...@googlemail.com
  wrote:
  
   +1 on time boxed releases and compatibility guidelines
  
  
   Am 06.02.2014 um 

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Patrick Wendell
Just to echo others - The relevant question is whether we want to
advertise stable API's for users that we will support for a long time
horizon. And doing this is critical to being taken seriously as a
mature project.

The question is not whether or not there are things we want to improve
about Spark (further reduce dependencies, runtime stability, etc) - of
course everyone wants to improve those things!

In the next few months ahead of 1.0 the plan would be to invest effort
in finishing off loose ends in the API and of course, no 1.0 release
candidate will pass muster if these aren't addressed. I only see a few
fairly small blockers though wrt API issues:

- We should mark things that may evolve and change as semi-private
developer API's (e.g. the Spark Listener).
- We need to standardize the Java API in a way that supports Java 8 lamdbas.

Other than that - I don't see many blockers in terms of API changes we
might want to make. A lot of those were dealt with in 0.9 specifically
to prepare for this.

The broader question API clean-up brings up a debate about the trade
off of compatibility with older pre-1.0 versions of Spark. This is not
the primary issue under discussion and can be debated separably.

The primary issue at hand is whether to have 1.0 in ~3 months vs
pushing it to ~6 months from now or more.

- Patrick

On Thu, Feb 6, 2014 at 12:49 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 If the APIs are usable, stability and continuity are much more important
 than perfection.  With many already relying on the current APIs, I think
 trying to clean them up will just cause pain for users and integrators.
  Hadoop made this mistake when they decided the original MapReduce APIs
 were ugly and introduced a new set of APIs to do the same thing.  Even
 though this happened in a pre-1.0 release, three years down the road, both
 the old and new APIs are still supported, causing endless confusion for
 users.  If individual functions or configuration properties have unclear
 names, they can be deprecated and replaced, but redoing the APIs or
 breaking compatibility at this point is simply not worth it.


 On Thu, Feb 6, 2014 at 12:39 PM, Imran Rashid im...@quantifind.com wrote:

 I don't really agree with this logic.  I think we haven't broken API so far
 because we just keep adding stuff on to it, and we haven't bothered to
 clean the api up, specifically to *avoid* breaking things.  Here's a
 handful of api breaking things that we might want to consider:

 * should we look at all the various configuration properties, and maybe
 some of them should get renamed for consistency / clarity?
 * do all of the functions on RDD need to be in core?  or do some of them
 that are simple additions built on top of the primitives really belong in a
 utils package or something?  Eg., maybe we should get rid of all the
 variants of the mapPartitions / mapWith / etc.  just have map, and
 mapPartitionsWithIndex  (too many choices in the api can also be confusing
 to the user)
 * are the right things getting tracked in SparkListener?  Do we need to add
 or remove anything?

 This is probably not the right list of questions, that's just an idea of
 the kind of thing we should be thinking about.

 Its also fine with me if 1.0 is next, I just think that we ought to be
 asking these kinds of questions up and down the entire api before we
 release 1.0.  And given that we haven't even started that discussion, it
 seems possible that there could be new features we'd like to release in
 0.10 before that discussion is finished.



 On Thu, Feb 6, 2014 at 12:56 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

  I think it's important to do 1.0 next. The project has been around for 4
  years, and I'd be comfortable maintaining the current codebase for a long
  time in an API and binary compatible way through 1.x releases. Over the
  past 4 years we haven't actually had major changes to the user-facing
 API --
  the only ones were changing the package to org.apache.spark, and
 upgrading
  the Scala version. I'd be okay leaving 1.x to always use Scala 2.10 for
  example, or later cross-building it for Scala 2.11. Updating to 1.0 says
  two things: it tells users that they can be confident that version will
 be
  maintained for a long time, which we absolutely want to do, and it lets
  outsiders see that the project is now fairly mature (for many people,
  pre-1.0 might still cause them not to try it). I think both are good for
  the community.
 
  Regarding binary compatibility, I agree that it's what we should strive
  for, but it just seems premature to codify now. Let's see how it works
  between, say, 1.0 and 1.1, and then we can codify it.
 
  Matei
 
  On Feb 6, 2014, at 10:43 AM, Henry Saputra henry.sapu...@gmail.com
  wrote:
 
   Thanks Patick to initiate the discussion about next road map for Apache
  Spark.
  
   I am +1 for 0.10.0 for next version.
  
   It will give us as community some time to digest the process and the
   

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Mark Hamstra
Imran:

 Its also fine with me if 1.0 is next, I just think that we ought to be
 asking these kinds of questions up and down the entire api before we
 release 1.0.


And moving master to 1.0.0-SNAPSHOT doesn't preclude that.  If anything, it
turns that ought to into must -- which is another way of saying what
Reynold said: The point of 1.0 is for us to self-enforce API compatibility
in the context of longer term support. If we continue down the 0.xx road,
we will always have excuse for breaking APIs.

1.0.0-SNAPSHOT doesn't mean that the API is final right now.  It means that
what is released next will be final over what is intended to be the lengthy
scope of a major release.  That means that adding new features and
functionality (at least to core spark) should be a very low priority for
this development cycle, and establishing the 1.0 API from what is already
in 0.9.0 should be our first priority.  It wouldn't trouble me at all if
not-strictly-necessary new features were left to hang out on the pull
request queue for quite awhile until we are ready to add them in 1.1.0, if
we were to do pretty much nothing else during this cycle except to get the
1.0 API to where most of us agree that it is in good shape.

If we're not adding new features and extending the 0.9.0 API, then there
really is no need for a 0.10.0 minor-release, whose main purpose would be
to collect the API additions from 0.9.0.  Bug-fixes go in 0.9.1-SNAPSHOT;
bug-fixes and finalized 1.0 API go in 1.0.0-SNAPSHOT; almost all new
features are put on hold and wait for 1.1.0-SNAPSHOT.

... it seems possible that there could be new features we'd like to release
 in 0.10...


We certainly can add new features to 1.0.0, but they will have to go
through a rigorous review to be certain that they are things that we really
want to commit to keeping going forward.  But after 1.0, that is true for
any new feature proposal unless we create specifically experimental
branches.  So what moving to 1.0.0-SNAPSHOT really means is that we are
saying that we have gone beyond the development phase where more-or-less
experimental features can be added to Spark releases only to be withdrawn
later -- that time is done after 1.0.0-SNAPSHOT.  Now to be fair,
tentative/experimental features have not been added willy-nilly to Spark
over recent releases, and withdrawal/replacement has been about as limited
in scope as could be fairly expected, so this shouldn't be a radically new
and different development paradigm.  There are, though, some experiments
that were added in the past and should probably now be withdrawn (or at
least deprecated in 1.0.0, withdrawn in 1.1.0.)  I'll put my own
contribution of mapWith, filterWith, et. al on the chopping block as an
effort that, at least in its present form, doesn't provide enough extra
over mapPartitionsWithIndex, and whose syntax is awkward enough that I
don't believe these methods have ever been widely used, so that their
inclusion in the 1.0 API is probably not warranted.

There are other elements of Spark that also should be culled and/or
refactored before 1.0.  Imran has listed a few. I'll also suggest that
there are at least parts of alternative Broadcast variable implementations
that should probably be left behind.  In any event, Imran is absolutely
correct that we need to have a discussion about these issues.  Moving to
1.0.0-SNAPSHOT forces us to begin that discussion.

So, I'm +1 for 1.0.0-incubating-SNAPSHOT (and looking forward to losing the
incubating!)




On Thu, Feb 6, 2014 at 12:39 PM, Imran Rashid im...@quantifind.com wrote:

 I don't really agree with this logic.  I think we haven't broken API so far
 because we just keep adding stuff on to it, and we haven't bothered to
 clean the api up, specifically to *avoid* breaking things.  Here's a
 handful of api breaking things that we might want to consider:

 * should we look at all the various configuration properties, and maybe
 some of them should get renamed for consistency / clarity?
 * do all of the functions on RDD need to be in core?  or do some of them
 that are simple additions built on top of the primitives really belong in a
 utils package or something?  Eg., maybe we should get rid of all the
 variants of the mapPartitions / mapWith / etc.  just have map, and
 mapPartitionsWithIndex  (too many choices in the api can also be confusing
 to the user)
 * are the right things getting tracked in SparkListener?  Do we need to add
 or remove anything?

 This is probably not the right list of questions, that's just an idea of
 the kind of thing we should be thinking about.

 Its also fine with me if 1.0 is next, I just think that we ought to be
 asking these kinds of questions up and down the entire api before we
 release 1.0.  And given that we haven't even started that discussion, it
 seems possible that there could be new features we'd like to release in
 0.10 before that discussion is finished.



 On Thu, Feb 6, 2014 at 12:56 PM, Matei Zaharia 

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Mark Hamstra
I'm not sure that that is the conclusion that I would draw from the Hadoop
example.  I would certainly agree that maintaining and supporting both an
old and a new API is a cause of endless confusion for users.  If we are
going to change or drop things from the API to reach 1.0, then we shouldn't
be maintaining and support the prior way of doing things beyond a 1.0.0 -
1.1.0 deprecation cycle.


On Thu, Feb 6, 2014 at 12:49 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 If the APIs are usable, stability and continuity are much more important
 than perfection.  With many already relying on the current APIs, I think
 trying to clean them up will just cause pain for users and integrators.
  Hadoop made this mistake when they decided the original MapReduce APIs
 were ugly and introduced a new set of APIs to do the same thing.  Even
 though this happened in a pre-1.0 release, three years down the road, both
 the old and new APIs are still supported, causing endless confusion for
 users.  If individual functions or configuration properties have unclear
 names, they can be deprecated and replaced, but redoing the APIs or
 breaking compatibility at this point is simply not worth it.


 On Thu, Feb 6, 2014 at 12:39 PM, Imran Rashid im...@quantifind.com
 wrote:

  I don't really agree with this logic.  I think we haven't broken API so
 far
  because we just keep adding stuff on to it, and we haven't bothered to
  clean the api up, specifically to *avoid* breaking things.  Here's a
  handful of api breaking things that we might want to consider:
 
  * should we look at all the various configuration properties, and maybe
  some of them should get renamed for consistency / clarity?
  * do all of the functions on RDD need to be in core?  or do some of them
  that are simple additions built on top of the primitives really belong
 in a
  utils package or something?  Eg., maybe we should get rid of all the
  variants of the mapPartitions / mapWith / etc.  just have map, and
  mapPartitionsWithIndex  (too many choices in the api can also be
 confusing
  to the user)
  * are the right things getting tracked in SparkListener?  Do we need to
 add
  or remove anything?
 
  This is probably not the right list of questions, that's just an idea of
  the kind of thing we should be thinking about.
 
  Its also fine with me if 1.0 is next, I just think that we ought to be
  asking these kinds of questions up and down the entire api before we
  release 1.0.  And given that we haven't even started that discussion, it
  seems possible that there could be new features we'd like to release in
  0.10 before that discussion is finished.
 
 
 
  On Thu, Feb 6, 2014 at 12:56 PM, Matei Zaharia matei.zaha...@gmail.com
  wrote:
 
   I think it's important to do 1.0 next. The project has been around for
 4
   years, and I'd be comfortable maintaining the current codebase for a
 long
   time in an API and binary compatible way through 1.x releases. Over the
   past 4 years we haven't actually had major changes to the user-facing
  API --
   the only ones were changing the package to org.apache.spark, and
  upgrading
   the Scala version. I'd be okay leaving 1.x to always use Scala 2.10 for
   example, or later cross-building it for Scala 2.11. Updating to 1.0
 says
   two things: it tells users that they can be confident that version will
  be
   maintained for a long time, which we absolutely want to do, and it lets
   outsiders see that the project is now fairly mature (for many people,
   pre-1.0 might still cause them not to try it). I think both are good
 for
   the community.
  
   Regarding binary compatibility, I agree that it's what we should strive
   for, but it just seems premature to codify now. Let's see how it works
   between, say, 1.0 and 1.1, and then we can codify it.
  
   Matei
  
   On Feb 6, 2014, at 10:43 AM, Henry Saputra henry.sapu...@gmail.com
   wrote:
  
Thanks Patick to initiate the discussion about next road map for
 Apache
   Spark.
   
I am +1 for 0.10.0 for next version.
   
It will give us as community some time to digest the process and the
vision and make adjustment accordingly.
   
Release a 1.0.0 is a huge milestone and if we do need to break API
somehow or modify internal behavior dramatically we could take
advantage to release 1.0.0 as good step to go to.
   
   
- Henry
   
   
   
On Wed, Feb 5, 2014 at 9:52 PM, Andrew Ash and...@andrewash.com
  wrote:
Agree on timeboxed releases as well.
   
Is there a vision for where we want to be as a project before
  declaring
   the
first 1.0 release?  While we're in the 0.x days per semver we can
  break
backcompat at will (though we try to avoid it where possible), and
  that
luxury goes away with 1.x  I just don't want to release a 1.0 simply
because it seems to follow after 0.9 rather than making an
 intentional
decision that we're at the point where we can stand by the current
  APIs
   and

Re: [0.9.0] Possible deadlock in shutdown hook?

2014-02-06 Thread Andrew Ash
I think the solution where we stop the writing threads and then let the
deleting threads completely clean up is the best option since the final
state doesn't have half-deleted temp dirs scattered across the cluster.

How feasible do you think it'd be to interrupt the other threads?


On Thu, Feb 6, 2014 at 10:54 AM, Mridul Muralidharan mri...@gmail.comwrote:

 Looks like a pathological corner case here - where the the delete
 thread is not getting run while the OS is busy prioritizing the thread
 writing data (probably with heavy gc too).
 Ideally, the delete thread would list files, remove them and then fail
 when it tries to remove the non empty directory (since other thread
 might be creating more in parallel).


 Regards,
 Mridul


 On Thu, Feb 6, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com wrote:
  Got a repro locally on my MBP (the other was on a CentOS machine).
 
  Build spark, run a master and a worker with the sbin/start-all.sh script,
  then run this in a shell:
 
  import org.apache.spark.storage.StorageLevel._
  val s = sc.parallelize(1 to 10).persist(MEMORY_AND_DISK_SER);
  s.count
 
  After about a minute, this line appears in the shell logging output:
 
  14/02/06 02:44:44 WARN BlockManagerMasterActor: Removing BlockManager
  BlockManagerId(0, aash-mbp.dyn.yojoe.local, 57895, 0) with no recent
 heart
  beats: 57510ms exceeds 45000ms
 
  Ctrl-C the shell.  In jps there is now a worker, a master, and a
  CoarseGrainedExecutorBackend.
 
  Run jstack on the CGEBackend JVM, and I got the attached stacktraces.  I
  waited around for 15min then kill -9'd the JVM and restarted the process.
 
  I wonder if what's happening here is that the threads that are spewing
 data
  to disk (as that parallelize and persist would do) can write to disk
 faster
  than the cleanup threads can delete from disk.
 
  What do you think of that theory?
 
 
  Andrew
 
 
 
  On Thu, Feb 6, 2014 at 2:30 AM, Mridul Muralidharan mri...@gmail.com
  wrote:
 
  shutdown hooks should not take 15 mins are you mentioned !
  On the other hand, how busy was your disk when this was happening ?
  (either due to spark or something else ?)
 
  It might just be that there was a lot of stuff to remove ?
 
  Regards,
  Mridul
 
 
  On Thu, Feb 6, 2014 at 3:50 PM, Andrew Ash and...@andrewash.com
 wrote:
   Hi Spark devs,
  
   Occasionally when hitting Ctrl-C in the scala spark shell on 0.9.0 one
   of
   my workers goes dead in the spark master UI.  I'm using the standalone
   cluster and didn't ever see this while using 0.8.0 so I think it may
 be
   a
   regression.
  
   When I prod on the hung CoarseGrainedExecutorBackend JVM with jstack
 and
   jmap -heap, it doesn't respond unless I add the -F force flag.  The
 heap
   isn't full, but there are some interesting bits in the jstack.  Poking
   around a little, I think there may be some kind of deadlock in the
   shutdown
   hooks.
  
   Below are the threads I think are most interesting:
  
   Thread 14308: (state = BLOCKED)
- java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted frame)
- java.lang.Runtime.exit(int) @bci=14, line=109 (Interpreted frame)
- java.lang.System.exit(int) @bci=4, line=962 (Interpreted frame)
-
  
  
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(java.lang.Object,
   scala.Function1) @bci=352, line=81 (Interpreted frame)
- akka.actor.ActorCell.receiveMessage(java.lang.Object) @bci=25,
   line=498
   (Interpreted frame)
- akka.actor.ActorCell.invoke(akka.dispatch.Envelope) @bci=39,
 line=456
   (Interpreted frame)
- akka.dispatch.Mailbox.processMailbox(int, long) @bci=24, line=237
   (Interpreted frame)
- akka.dispatch.Mailbox.run() @bci=20, line=219 (Interpreted frame)
- akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec()
   @bci=4, line=386 (Interpreted frame)
- scala.concurrent.forkjoin.ForkJoinTask.doExec() @bci=10, line=260
   (Compiled frame)
-
  
  
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(scala.concurrent.forkjoin.ForkJoinTask)
   @bci=10, line=1339 (Compiled frame)
-
  
  
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue)
   @bci=11, line=1979 (Compiled frame)
- scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14,
   line=107
   (Interpreted frame)
  
   Thread 3865: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Interpreted frame)
- java.lang.Thread.join(long) @bci=38, line=1280 (Interpreted frame)
- java.lang.Thread.join() @bci=2, line=1354 (Interpreted frame)
- java.lang.ApplicationShutdownHooks.runHooks() @bci=87, line=106
   (Interpreted frame)
- java.lang.ApplicationShutdownHooks$1.run() @bci=0, line=46
   (Interpreted
   frame)
- java.lang.Shutdown.runHooks() @bci=39, line=123 (Interpreted frame)
- java.lang.Shutdown.sequence() @bci=26, line=167 (Interpreted frame)
- java.lang.Shutdown.exit(int) @bci=96, 

Is there any way to make a quick test on some pre-commit code?

2014-02-06 Thread Nan Zhu
Hi, all 

Is it always necessary to run sbt assembly when you want to test some code, 

Sometimes you just repeatedly change one or two lines for some failed test 
case, it is really time-consuming to sbt assembly every time

any faster way?

Best, 

-- 
Nan Zhu



Re: Is there any way to make a quick test on some pre-commit code?

2014-02-06 Thread Reynold Xin
You can do

sbt/sbt assemble-deps


and then just run

sbt/sbt package

each time.


You can even do

sbt/sbt ~package

for automatic incremental compilation.



On Thu, Feb 6, 2014 at 4:46 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

 Hi, all

 Is it always necessary to run sbt assembly when you want to test some code,

 Sometimes you just repeatedly change one or two lines for some failed test
 case, it is really time-consuming to sbt assembly every time

 any faster way?

 Best,

 --
 Nan Zhu




Re: Is there any way to make a quick test on some pre-commit code?

2014-02-06 Thread Nan Zhu
Thank you very much, Reynold 

-- 
Nan Zhu



On Thursday, February 6, 2014 at 7:50 PM, Reynold Xin wrote:

 You can do
 
 sbt/sbt assemble-deps
 
 
 and then just run
 
 sbt/sbt package
 
 each time.
 
 
 You can even do
 
 sbt/sbt ~package
 
 for automatic incremental compilation.
 
 
 
 On Thu, Feb 6, 2014 at 4:46 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
 
  Hi, all
  
  Is it always necessary to run sbt assembly when you want to test some code,
  
  Sometimes you just repeatedly change one or two lines for some failed test
  case, it is really time-consuming to sbt assembly every time
  
  any faster way?
  
  Best,
  
  --
  Nan Zhu
  
 
 
 




Re: Discussion on strategy or roadmap should happen on dev@ list

2014-02-06 Thread Andrew Ash
I'd prefer a spark-jira-activity list so I can filter appropriately.  A
separate request, but a spark-commits list that has all commits emailed
to it has been helpful at work to keep a pulse on activity.


On Thu, Feb 6, 2014 at 6:27 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Henry (or anyone else), do you have any preference on sending these
 directly to “dev versus creating another list for “issues”? I guess we can
 try “dev” for a while and let people decide if it gets too spammy. We’ll
 just have to advertise it in advance.

 Matei

 On Feb 6, 2014, at 9:55 AM, Henry Saputra henry.sapu...@gmail.com wrote:

  HI Matei, yeah please subscribe it for now. Once we have ASF JIRA
  setup for Spark it will happen automatically.
 
  - Henry
 
  On Wed, Feb 5, 2014 at 2:56 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  Hey Henry, this makes sense. I’d like to add that one other vehicle for
 discussion has been JIRA at
 https://spark-project.atlassian.net/browse/SPARK. Right now the dev list
 is not subscribed to JIRA, but we’d be happy to subscribe it anytime if
 that helps. We were hoping to do this only when JIRA has been moved to the
 ASF, since infra can set up the forwarding automatically. But most major
 discussions (e.g. https://spark-project.atlassian.net/browse/SPARK-964,
 https://spark-project.atlassian.net/browse/SPARK-969) happen there. I
 think this is the model we want to have in the future — most other projects
 I’ve participated in also used JIRA for their discussion, and mirrored to
 either the “dev” list or an “issues” list.
 
  Matei
 
  On Feb 5, 2014, at 2:49 PM, Henry Saputra henry.sapu...@gmail.com
 wrote:
 
  Hi Guys,
 
  Just friendly reminder, some of you guys may work closely or
  collaborate outside the dev@ list and sometimes it is easier.
  But, as part of Apache Software Foundation project, any decision or
  outcome that could or will be implemented in the Apache Spark need to
  happen in the dev@ list as we are open and collaborative as community.
 
  If offline discussions happen please forward the history or potential
  solution to the dev@ list before any action taken.
 
  Most of us work remote so email is the official channel of discussion
  about stuff related to development in Spark.
 
  Github pull request is not the appropriate vehicle for technical
  discussions. It is used primarily for review of proposed patch which
  means initial problem most of the times had been identified and
  discussed.
 
  Thanks for understanding.
 
  - Henry
 




Notice: JIRA messages will be forwarded to this list

2014-02-06 Thread Matei Zaharia
As part of ASF policy, we need to archive all development discussions on a 
mailing list, so I’m going to do this on the dev list for now. You can filter 
these messages by the sender, which will be j...@spark-project.atlassian.net. 
If the list becomes too spammy as a result, we can create a separate “issues” 
list later, but I just want to have something in place to archive them for now.

Matei

Re: Discussion on strategy or roadmap should happen on dev@ list

2014-02-06 Thread Nan Zhu
Hi, Matei,  

Does it mean that I will receive the notification on every issue, instead of 
just the ones I’m watching on?  

Best,  

--  
Nan Zhu


On Thursday, February 6, 2014 at 10:56 PM, Matei Zaharia wrote:

 You can already get commits on comm...@spark.incubator.apache.org 
 (mailto:comm...@spark.incubator.apache.org) actually.
  
 Regarding JIRA issues, I think I’ll just try putting them on dev for now, and 
 we can move to a separate list later. If you’d like to filter them, the 
 address is j...@spark-project.atlassian.net 
 (mailto:j...@spark-project.atlassian.net).
  
 Matei
  
 On Feb 6, 2014, at 6:30 PM, Andrew Ash and...@andrewash.com 
 (mailto:and...@andrewash.com) wrote:
  
  I'd prefer a spark-jira-activity list so I can filter appropriately. A
  separate request, but a spark-commits list that has all commits emailed
  to it has been helpful at work to keep a pulse on activity.
   
   
  On Thu, Feb 6, 2014 at 6:27 PM, Matei Zaharia matei.zaha...@gmail.com 
  (mailto:matei.zaha...@gmail.com)wrote:
   
   Henry (or anyone else), do you have any preference on sending these
   directly to “dev versus creating another list for “issues”? I guess we 
   can
   try “dev” for a while and let people decide if it gets too spammy. We’ll
   just have to advertise it in advance.

   Matei

   On Feb 6, 2014, at 9:55 AM, Henry Saputra henry.sapu...@gmail.com 
   (mailto:henry.sapu...@gmail.com) wrote:

HI Matei, yeah please subscribe it for now. Once we have ASF JIRA
setup for Spark it will happen automatically.
 
- Henry
 
On Wed, Feb 5, 2014 at 2:56 PM, Matei Zaharia matei.zaha...@gmail.com 
(mailto:matei.zaha...@gmail.com)
   wrote:
 Hey Henry, this makes sense. I’d like to add that one other vehicle 
 for
 

   discussion has been JIRA at
   https://spark-project.atlassian.net/browse/SPARK. Right now the dev list
   is not subscribed to JIRA, but we’d be happy to subscribe it anytime if
   that helps. We were hoping to do this only when JIRA has been moved to the
   ASF, since infra can set up the forwarding automatically. But most major
   discussions (e.g. https://spark-project.atlassian.net/browse/SPARK-964,
   https://spark-project.atlassian.net/browse/SPARK-969) happen there. I
   think this is the model we want to have in the future — most other 
   projects
   I’ve participated in also used JIRA for their discussion, and mirrored to
   either the “dev” list or an “issues” list.
  
 Matei
  
 On Feb 5, 2014, at 2:49 PM, Henry Saputra henry.sapu...@gmail.com 
 (mailto:henry.sapu...@gmail.com)
   wrote:
  
  Hi Guys,
   
  Just friendly reminder, some of you guys may work closely or
  collaborate outside the dev@ list and sometimes it is easier.
  But, as part of Apache Software Foundation project, any decision or
  outcome that could or will be implemented in the Apache Spark need 
  to
  happen in the dev@ list as we are open and collaborative as 
  community.
   
  If offline discussions happen please forward the history or 
  potential
  solution to the dev@ list before any action taken.
   
  Most of us work remote so email is the official channel of 
  discussion
  about stuff related to development in Spark.
   
  Github pull request is not the appropriate vehicle for technical
  discussions. It is used primarily for review of proposed patch which
  means initial problem most of the times had been identified and
  discussed.
   
  Thanks for understanding.
   
  - Henry  



Test

2014-02-06 Thread Matei Zaharia
Just sending a test email from another email address to check for bounces..


Discussion on strategy or roadmap should happen on dev@ list

2014-02-06 Thread Henry Saputra
I'd say let's try with dev@ list and see how spammy it is.

If it is get too noisy we could always create issues@ list.

- Henry

On Thursday, February 6, 2014, Matei Zaharia
matei.zaha...@gmail.comjavascript:_e(%7B%7D,'cvml','matei.zaha...@gmail.com');
wrote:

 Henry (or anyone else), do you have any preference on sending these
 directly to “dev versus creating another list for “issues”? I guess we can
 try “dev” for a while and let people decide if it gets too spammy. We’ll
 just have to advertise it in advance.

 Matei

 On Feb 6, 2014, at 9:55 AM, Henry Saputra henry.sapu...@gmail.com wrote:

  HI Matei, yeah please subscribe it for now. Once we have ASF JIRA
  setup for Spark it will happen automatically.
 
  - Henry
 
  On Wed, Feb 5, 2014 at 2:56 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  Hey Henry, this makes sense. I’d like to add that one other vehicle for
 discussion has been JIRA at
 https://spark-project.atlassian.net/browse/SPARK. Right now the dev list
 is not subscribed to JIRA, but we’d be happy to subscribe it anytime if
 that helps. We were hoping to do this only when JIRA has been moved to the
 ASF, since infra can set up the forwarding automatically. But most major
 discussions (e.g. https://spark-project.atlassian.net/browse/SPARK-964,
 https://spark-project.atlassian.net/browse/SPARK-969) happen there. I
 think this is the model we want to have in the future — most other projects
 I’ve participated in also used JIRA for their discussion, and mirrored to
 either the “dev” list or an “issues” list.
 
  Matei
 
  On Feb 5, 2014, at 2:49 PM, Henry Saputra henry.sapu...@gmail.com
 wrote:
 
  Hi Guys,
 
  Just friendly reminder, some of you guys may work closely or
  collaborate outside the dev@ list and sometimes it is easier.
  But, as part of Apache Software Foundation project, any decision or
  outcome that could or will be implemented in the Apache Spark need to
  happen in the dev@ list as we are open and collaborative as community.
 
  If offline discussions happen please forward the history or potential
  solution to the dev@ list before any action taken.
 
  Most of us work remote so email is the official channel of discussion
  about stuff related to development in Spark.
 
  Github pull request is not the appropriate vehicle for technical
  discussions. It is used primarily for review of proposed patch which
  means initial problem most of the times had been identified and
  discussed.
 
  Thanks for understanding.
 
  - Henry
 




Re: Discussion on strategy or roadmap should happen on dev@ list

2014-02-06 Thread Matei Zaharia
Yes, notifications on every issue will be sent to 
dev@spark.incubator.apache.org. You can filter them out though by matching on 
(sender = j...@spark-project.atlassian.net AND to = 
dev@spark.incubator.apache.org). Should be fairly straightforward with Gmail.

Matei

On Feb 6, 2014, at 8:01 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

 Hi, Matei,  
 
 Does it mean that I will receive the notification on every issue, instead of 
 just the ones I’m watching on?  
 
 Best,  
 
 --  
 Nan Zhu
 
 
 On Thursday, February 6, 2014 at 10:56 PM, Matei Zaharia wrote:
 
 You can already get commits on comm...@spark.incubator.apache.org 
 (mailto:comm...@spark.incubator.apache.org) actually.
 
 Regarding JIRA issues, I think I’ll just try putting them on dev for now, 
 and we can move to a separate list later. If you’d like to filter them, the 
 address is j...@spark-project.atlassian.net 
 (mailto:j...@spark-project.atlassian.net).
 
 Matei
 
 On Feb 6, 2014, at 6:30 PM, Andrew Ash and...@andrewash.com 
 (mailto:and...@andrewash.com) wrote:
 
 I'd prefer a spark-jira-activity list so I can filter appropriately. A
 separate request, but a spark-commits list that has all commits emailed
 to it has been helpful at work to keep a pulse on activity.
 
 
 On Thu, Feb 6, 2014 at 6:27 PM, Matei Zaharia matei.zaha...@gmail.com 
 (mailto:matei.zaha...@gmail.com)wrote:
 
 Henry (or anyone else), do you have any preference on sending these
 directly to “dev versus creating another list for “issues”? I guess we can
 try “dev” for a while and let people decide if it gets too spammy. We’ll
 just have to advertise it in advance.
 
 Matei
 
 On Feb 6, 2014, at 9:55 AM, Henry Saputra henry.sapu...@gmail.com 
 (mailto:henry.sapu...@gmail.com) wrote:
 
 HI Matei, yeah please subscribe it for now. Once we have ASF JIRA
 setup for Spark it will happen automatically.
 
 - Henry
 
 On Wed, Feb 5, 2014 at 2:56 PM, Matei Zaharia matei.zaha...@gmail.com 
 (mailto:matei.zaha...@gmail.com)
 wrote:
 Hey Henry, this makes sense. I’d like to add that one other vehicle for
 
 
 discussion has been JIRA at
 https://spark-project.atlassian.net/browse/SPARK. Right now the dev list
 is not subscribed to JIRA, but we’d be happy to subscribe it anytime if
 that helps. We were hoping to do this only when JIRA has been moved to the
 ASF, since infra can set up the forwarding automatically. But most major
 discussions (e.g. https://spark-project.atlassian.net/browse/SPARK-964,
 https://spark-project.atlassian.net/browse/SPARK-969) happen there. I
 think this is the model we want to have in the future — most other projects
 I’ve participated in also used JIRA for their discussion, and mirrored to
 either the “dev” list or an “issues” list.
 
 Matei
 
 On Feb 5, 2014, at 2:49 PM, Henry Saputra henry.sapu...@gmail.com 
 (mailto:henry.sapu...@gmail.com)
 wrote:
 
 Hi Guys,
 
 Just friendly reminder, some of you guys may work closely or
 collaborate outside the dev@ list and sometimes it is easier.
 But, as part of Apache Software Foundation project, any decision or
 outcome that could or will be implemented in the Apache Spark need to
 happen in the dev@ list as we are open and collaborative as community.
 
 If offline discussions happen please forward the history or potential
 solution to the dev@ list before any action taken.
 
 Most of us work remote so email is the official channel of discussion
 about stuff related to development in Spark.
 
 Github pull request is not the appropriate vehicle for technical
 discussions. It is used primarily for review of proposed patch which
 means initial problem most of the times had been identified and
 discussed.
 
 Thanks for understanding.
 
 - Henry  
 



Re: Discussion on strategy or roadmap should happen on dev@ list

2014-02-06 Thread Reynold Xin
We can try it on dev, but I personally find the JIRA notifications pretty
spammy ... It will clutter the dev list, and make it harder to search for
useful information here.


On Thu, Feb 6, 2014 at 6:27 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Henry (or anyone else), do you have any preference on sending these
 directly to dev versus creating another list for issues? I guess we can
 try dev for a while and let people decide if it gets too spammy. We'll
 just have to advertise it in advance.

 Matei

 On Feb 6, 2014, at 9:55 AM, Henry Saputra henry.sapu...@gmail.com wrote:

  HI Matei, yeah please subscribe it for now. Once we have ASF JIRA
  setup for Spark it will happen automatically.
 
  - Henry
 
  On Wed, Feb 5, 2014 at 2:56 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  Hey Henry, this makes sense. I'd like to add that one other vehicle for
 discussion has been JIRA at
 https://spark-project.atlassian.net/browse/SPARK. Right now the dev list
 is not subscribed to JIRA, but we'd be happy to subscribe it anytime if
 that helps. We were hoping to do this only when JIRA has been moved to the
 ASF, since infra can set up the forwarding automatically. But most major
 discussions (e.g. https://spark-project.atlassian.net/browse/SPARK-964,
 https://spark-project.atlassian.net/browse/SPARK-969) happen there. I
 think this is the model we want to have in the future -- most other projects
 I've participated in also used JIRA for their discussion, and mirrored to
 either the dev list or an issues list.
 
  Matei
 
  On Feb 5, 2014, at 2:49 PM, Henry Saputra henry.sapu...@gmail.com
 wrote:
 
  Hi Guys,
 
  Just friendly reminder, some of you guys may work closely or
  collaborate outside the dev@ list and sometimes it is easier.
  But, as part of Apache Software Foundation project, any decision or
  outcome that could or will be implemented in the Apache Spark need to
  happen in the dev@ list as we are open and collaborative as community.
 
  If offline discussions happen please forward the history or potential
  solution to the dev@ list before any action taken.
 
  Most of us work remote so email is the official channel of discussion
  about stuff related to development in Spark.
 
  Github pull request is not the appropriate vehicle for technical
  discussions. It is used primarily for review of proposed patch which
  means initial problem most of the times had been identified and
  discussed.
 
  Thanks for understanding.
 
  - Henry
 




Re: Discussion on strategy or roadmap should happen on dev@ list

2014-02-06 Thread Matei Zaharia
Henry (or anyone else), do you have any preference on sending these directly to 
“dev versus creating another list for “issues”? I guess we can try “dev” for a 
while and let people decide if it gets too spammy. We’ll just have to advertise 
it in advance.

Matei

On Feb 6, 2014, at 9:55 AM, Henry Saputra henry.sapu...@gmail.com wrote:

 HI Matei, yeah please subscribe it for now. Once we have ASF JIRA
 setup for Spark it will happen automatically.
 
 - Henry
 
 On Wed, Feb 5, 2014 at 2:56 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Hey Henry, this makes sense. I’d like to add that one other vehicle for 
 discussion has been JIRA at 
 https://spark-project.atlassian.net/browse/SPARK. Right now the dev list is 
 not subscribed to JIRA, but we’d be happy to subscribe it anytime if that 
 helps. We were hoping to do this only when JIRA has been moved to the ASF, 
 since infra can set up the forwarding automatically. But most major 
 discussions (e.g. https://spark-project.atlassian.net/browse/SPARK-964, 
 https://spark-project.atlassian.net/browse/SPARK-969) happen there. I think 
 this is the model we want to have in the future — most other projects I’ve 
 participated in also used JIRA for their discussion, and mirrored to either 
 the “dev” list or an “issues” list.
 
 Matei
 
 On Feb 5, 2014, at 2:49 PM, Henry Saputra henry.sapu...@gmail.com wrote:
 
 Hi Guys,
 
 Just friendly reminder, some of you guys may work closely or
 collaborate outside the dev@ list and sometimes it is easier.
 But, as part of Apache Software Foundation project, any decision or
 outcome that could or will be implemented in the Apache Spark need to
 happen in the dev@ list as we are open and collaborative as community.
 
 If offline discussions happen please forward the history or potential
 solution to the dev@ list before any action taken.
 
 Most of us work remote so email is the official channel of discussion
 about stuff related to development in Spark.
 
 Github pull request is not the appropriate vehicle for technical
 discussions. It is used primarily for review of proposed patch which
 means initial problem most of the times had been identified and
 discussed.
 
 Thanks for understanding.
 
 - Henry
 



Re: Is there any way to make a quick test on some pre-commit code?

2014-02-06 Thread Mridul Muralidharan
This is neat, thanks Reynold !

Regards,
Mridul

On Fri, Feb 7, 2014 at 6:20 AM, Reynold Xin r...@databricks.com wrote:
 You can do

 sbt/sbt assemble-deps


 and then just run

 sbt/sbt package

 each time.


 You can even do

 sbt/sbt ~package

 for automatic incremental compilation.



 On Thu, Feb 6, 2014 at 4:46 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

 Hi, all

 Is it always necessary to run sbt assembly when you want to test some code,

 Sometimes you just repeatedly change one or two lines for some failed test
 case, it is really time-consuming to sbt assembly every time

 any faster way?

 Best,

 --
 Nan Zhu




Re: [0.9.0] Possible deadlock in shutdown hook?

2014-02-06 Thread Andrew Ash
There is probably just one threadpool that has task threads -- is it
possible to enumerate and interrupt just those?  We may need to keep string
a reference to that threadpool through to the shutdown thread to make that
happen.


On Thu, Feb 6, 2014 at 10:36 PM, Mridul Muralidharan mri...@gmail.comwrote:

 Ideally, interrupting the thread writing to disk should be sufficient
 - though since we are in middle of shutdown when this is happening, it
 is best case effort anyway.
 Identifying which threads to interrupt will be interesting since most
 of them are driven by threadpool's and we cant list all threads and
 interrupt all of them !


 Regards,
 Mridul


 On Fri, Feb 7, 2014 at 5:57 AM, Andrew Ash and...@andrewash.com wrote:
  I think the solution where we stop the writing threads and then let the
  deleting threads completely clean up is the best option since the final
  state doesn't have half-deleted temp dirs scattered across the cluster.
 
  How feasible do you think it'd be to interrupt the other threads?
 
 
  On Thu, Feb 6, 2014 at 10:54 AM, Mridul Muralidharan mri...@gmail.com
 wrote:
 
  Looks like a pathological corner case here - where the the delete
  thread is not getting run while the OS is busy prioritizing the thread
  writing data (probably with heavy gc too).
  Ideally, the delete thread would list files, remove them and then fail
  when it tries to remove the non empty directory (since other thread
  might be creating more in parallel).
 
 
  Regards,
  Mridul
 
 
  On Thu, Feb 6, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com
 wrote:
   Got a repro locally on my MBP (the other was on a CentOS machine).
  
   Build spark, run a master and a worker with the sbin/start-all.sh
 script,
   then run this in a shell:
  
   import org.apache.spark.storage.StorageLevel._
   val s = sc.parallelize(1 to 10).persist(MEMORY_AND_DISK_SER);
   s.count
  
   After about a minute, this line appears in the shell logging output:
  
   14/02/06 02:44:44 WARN BlockManagerMasterActor: Removing BlockManager
   BlockManagerId(0, aash-mbp.dyn.yojoe.local, 57895, 0) with no recent
  heart
   beats: 57510ms exceeds 45000ms
  
   Ctrl-C the shell.  In jps there is now a worker, a master, and a
   CoarseGrainedExecutorBackend.
  
   Run jstack on the CGEBackend JVM, and I got the attached stacktraces.
  I
   waited around for 15min then kill -9'd the JVM and restarted the
 process.
  
   I wonder if what's happening here is that the threads that are spewing
  data
   to disk (as that parallelize and persist would do) can write to disk
  faster
   than the cleanup threads can delete from disk.
  
   What do you think of that theory?
  
  
   Andrew
  
  
  
   On Thu, Feb 6, 2014 at 2:30 AM, Mridul Muralidharan mri...@gmail.com
 
   wrote:
  
   shutdown hooks should not take 15 mins are you mentioned !
   On the other hand, how busy was your disk when this was happening ?
   (either due to spark or something else ?)
  
   It might just be that there was a lot of stuff to remove ?
  
   Regards,
   Mridul
  
  
   On Thu, Feb 6, 2014 at 3:50 PM, Andrew Ash and...@andrewash.com
  wrote:
Hi Spark devs,
   
Occasionally when hitting Ctrl-C in the scala spark shell on 0.9.0
 one
of
my workers goes dead in the spark master UI.  I'm using the
 standalone
cluster and didn't ever see this while using 0.8.0 so I think it
 may
  be
a
regression.
   
When I prod on the hung CoarseGrainedExecutorBackend JVM with
 jstack
  and
jmap -heap, it doesn't respond unless I add the -F force flag.  The
  heap
isn't full, but there are some interesting bits in the jstack.
  Poking
around a little, I think there may be some kind of deadlock in the
shutdown
hooks.
   
Below are the threads I think are most interesting:
   
Thread 14308: (state = BLOCKED)
 - java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted
 frame)
 - java.lang.Runtime.exit(int) @bci=14, line=109 (Interpreted
 frame)
 - java.lang.System.exit(int) @bci=4, line=962 (Interpreted frame)
 -
   
   
 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(java.lang.Object,
scala.Function1) @bci=352, line=81 (Interpreted frame)
 - akka.actor.ActorCell.receiveMessage(java.lang.Object) @bci=25,
line=498
(Interpreted frame)
 - akka.actor.ActorCell.invoke(akka.dispatch.Envelope) @bci=39,
  line=456
(Interpreted frame)
 - akka.dispatch.Mailbox.processMailbox(int, long) @bci=24,
 line=237
(Interpreted frame)
 - akka.dispatch.Mailbox.run() @bci=20, line=219 (Interpreted
 frame)
 -
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec()
@bci=4, line=386 (Interpreted frame)
 - scala.concurrent.forkjoin.ForkJoinTask.doExec() @bci=10,
 line=260
(Compiled frame)
 -
   
   
 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(scala.concurrent.forkjoin.ForkJoinTask)
@bci=10, line=1339 

Re: Is there any way to make a quick test on some pre-commit code?

2014-02-06 Thread Henry Saputra
+1

On Thu, Feb 6, 2014 at 10:46 PM, Patrick Wendell pwend...@gmail.com wrote:
 We should document this on the wiki!

 On Thu, Feb 6, 2014 at 10:37 PM, Mridul Muralidharan mri...@gmail.com wrote:
 This is neat, thanks Reynold !

 Regards,
 Mridul

 On Fri, Feb 7, 2014 at 6:20 AM, Reynold Xin r...@databricks.com wrote:
 You can do

 sbt/sbt assemble-deps


 and then just run

 sbt/sbt package

 each time.


 You can even do

 sbt/sbt ~package

 for automatic incremental compilation.



 On Thu, Feb 6, 2014 at 4:46 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

 Hi, all

 Is it always necessary to run sbt assembly when you want to test some code,

 Sometimes you just repeatedly change one or two lines for some failed test
 case, it is really time-consuming to sbt assembly every time

 any faster way?

 Best,

 --
 Nan Zhu




Re: Is there any way to make a quick test on some pre-commit code?

2014-02-06 Thread Nan Zhu
+1 

-- 
Nan Zhu


On Friday, February 7, 2014 at 1:59 AM, Henry Saputra wrote:

 +1
 
 On Thu, Feb 6, 2014 at 10:46 PM, Patrick Wendell pwend...@gmail.com 
 (mailto:pwend...@gmail.com) wrote:
  We should document this on the wiki!
  
  On Thu, Feb 6, 2014 at 10:37 PM, Mridul Muralidharan mri...@gmail.com 
  (mailto:mri...@gmail.com) wrote:
   This is neat, thanks Reynold !
   
   Regards,
   Mridul
   
   On Fri, Feb 7, 2014 at 6:20 AM, Reynold Xin r...@databricks.com 
   (mailto:r...@databricks.com) wrote:
You can do

sbt/sbt assemble-deps


and then just run

sbt/sbt package

each time.


You can even do

sbt/sbt ~package

for automatic incremental compilation.



On Thu, Feb 6, 2014 at 4:46 PM, Nan Zhu zhunanmcg...@gmail.com 
(mailto:zhunanmcg...@gmail.com) wrote:

 Hi, all
 
 Is it always necessary to run sbt assembly when you want to test some 
 code,
 
 Sometimes you just repeatedly change one or two lines for some failed 
 test
 case, it is really time-consuming to sbt assembly every time
 
 any faster way?
 
 Best,
 
 --
 Nan Zhu
 


   
   
  
  
 
 
 




Re: [0.9.0] Possible deadlock in shutdown hook?

2014-02-06 Thread Tathagata Das
Its highly likely that the executor with the threadpool that runs the tasks
are the only set of threads that writes to disk. The tasks are designed to
be interrupted when the corresponding job is cancelled. So a reasonably
simple way could be to actually cancel the currently active jobs, which
would send the signal to the worker to stop the tasks. Currently, the
DAGSchedulerhttps://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L610does
not seem to actually cancel the jobs, only mark them as failed. So it
may be a simple addition.

There may be some complications with the external spilling of shuffle data
to disk not stopping immediately when the task is marked for killing. Gotta
try it out.

TD

On Thu, Feb 6, 2014 at 10:39 PM, Andrew Ash and...@andrewash.com wrote:

 There is probably just one threadpool that has task threads -- is it
 possible to enumerate and interrupt just those?  We may need to keep string
 a reference to that threadpool through to the shutdown thread to make that
 happen.


 On Thu, Feb 6, 2014 at 10:36 PM, Mridul Muralidharan mri...@gmail.com
 wrote:

  Ideally, interrupting the thread writing to disk should be sufficient
  - though since we are in middle of shutdown when this is happening, it
  is best case effort anyway.
  Identifying which threads to interrupt will be interesting since most
  of them are driven by threadpool's and we cant list all threads and
  interrupt all of them !
 
 
  Regards,
  Mridul
 
 
  On Fri, Feb 7, 2014 at 5:57 AM, Andrew Ash and...@andrewash.com wrote:
   I think the solution where we stop the writing threads and then let the
   deleting threads completely clean up is the best option since the final
   state doesn't have half-deleted temp dirs scattered across the cluster.
  
   How feasible do you think it'd be to interrupt the other threads?
  
  
   On Thu, Feb 6, 2014 at 10:54 AM, Mridul Muralidharan mri...@gmail.com
  wrote:
  
   Looks like a pathological corner case here - where the the delete
   thread is not getting run while the OS is busy prioritizing the thread
   writing data (probably with heavy gc too).
   Ideally, the delete thread would list files, remove them and then fail
   when it tries to remove the non empty directory (since other thread
   might be creating more in parallel).
  
  
   Regards,
   Mridul
  
  
   On Thu, Feb 6, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com
  wrote:
Got a repro locally on my MBP (the other was on a CentOS machine).
   
Build spark, run a master and a worker with the sbin/start-all.sh
  script,
then run this in a shell:
   
import org.apache.spark.storage.StorageLevel._
val s = sc.parallelize(1 to
 10).persist(MEMORY_AND_DISK_SER);
s.count
   
After about a minute, this line appears in the shell logging output:
   
14/02/06 02:44:44 WARN BlockManagerMasterActor: Removing
 BlockManager
BlockManagerId(0, aash-mbp.dyn.yojoe.local, 57895, 0) with no recent
   heart
beats: 57510ms exceeds 45000ms
   
Ctrl-C the shell.  In jps there is now a worker, a master, and a
CoarseGrainedExecutorBackend.
   
Run jstack on the CGEBackend JVM, and I got the attached
 stacktraces.
   I
waited around for 15min then kill -9'd the JVM and restarted the
  process.
   
I wonder if what's happening here is that the threads that are
 spewing
   data
to disk (as that parallelize and persist would do) can write to disk
   faster
than the cleanup threads can delete from disk.
   
What do you think of that theory?
   
   
Andrew
   
   
   
On Thu, Feb 6, 2014 at 2:30 AM, Mridul Muralidharan 
 mri...@gmail.com
  
wrote:
   
shutdown hooks should not take 15 mins are you mentioned !
On the other hand, how busy was your disk when this was happening ?
(either due to spark or something else ?)
   
It might just be that there was a lot of stuff to remove ?
   
Regards,
Mridul
   
   
On Thu, Feb 6, 2014 at 3:50 PM, Andrew Ash and...@andrewash.com
   wrote:
 Hi Spark devs,

 Occasionally when hitting Ctrl-C in the scala spark shell on
 0.9.0
  one
 of
 my workers goes dead in the spark master UI.  I'm using the
  standalone
 cluster and didn't ever see this while using 0.8.0 so I think it
  may
   be
 a
 regression.

 When I prod on the hung CoarseGrainedExecutorBackend JVM with
  jstack
   and
 jmap -heap, it doesn't respond unless I add the -F force flag.
  The
   heap
 isn't full, but there are some interesting bits in the jstack.
   Poking
 around a little, I think there may be some kind of deadlock in
 the
 shutdown
 hooks.

 Below are the threads I think are most interesting:

 Thread 14308: (state = BLOCKED)
  - java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted
  frame)
  - java.lang.Runtime.exit(int) @bci=14, line=109 (Interpreted
  frame)

Re: [0.9.0] Possible deadlock in shutdown hook?

2014-02-06 Thread Matei Zaharia
I don’t think we necessarily want to do this through the DAGScheduler because 
the worker might also shut down due to some unusual termination condition, like 
the driver node crashing. Can’t we do it at the top of the shutdown hook 
instead? If all the threads are in the same thread pool it might be possible to 
interrupt or stop the whole pool.

Matei

On Feb 6, 2014, at 11:30 PM, Andrew Ash and...@andrewash.com wrote:

 That's genius.  Of course when a worker is told to shutdown it should
 interrupt its worker threads -- I think that would address this issue.
 
 Are you thinking to put
 
 running.map(_.jobId).foreach { handleJobCancellation }
 
 at the top of the StopDAGScheduler block?
 
 
 On Thu, Feb 6, 2014 at 11:05 PM, Tathagata Das
 tathagata.das1...@gmail.comwrote:
 
 Its highly likely that the executor with the threadpool that runs the tasks
 are the only set of threads that writes to disk. The tasks are designed to
 be interrupted when the corresponding job is cancelled. So a reasonably
 simple way could be to actually cancel the currently active jobs, which
 would send the signal to the worker to stop the tasks. Currently, the
 DAGScheduler
 https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L610
 does
 not seem to actually cancel the jobs, only mark them as failed. So it
 may be a simple addition.
 
 There may be some complications with the external spilling of shuffle data
 to disk not stopping immediately when the task is marked for killing. Gotta
 try it out.
 
 TD
 
 On Thu, Feb 6, 2014 at 10:39 PM, Andrew Ash and...@andrewash.com wrote:
 
 There is probably just one threadpool that has task threads -- is it
 possible to enumerate and interrupt just those?  We may need to keep
 string
 a reference to that threadpool through to the shutdown thread to make
 that
 happen.
 
 
 On Thu, Feb 6, 2014 at 10:36 PM, Mridul Muralidharan mri...@gmail.com
 wrote:
 
 Ideally, interrupting the thread writing to disk should be sufficient
 - though since we are in middle of shutdown when this is happening, it
 is best case effort anyway.
 Identifying which threads to interrupt will be interesting since most
 of them are driven by threadpool's and we cant list all threads and
 interrupt all of them !
 
 
 Regards,
 Mridul
 
 
 On Fri, Feb 7, 2014 at 5:57 AM, Andrew Ash and...@andrewash.com
 wrote:
 I think the solution where we stop the writing threads and then let
 the
 deleting threads completely clean up is the best option since the
 final
 state doesn't have half-deleted temp dirs scattered across the
 cluster.
 
 How feasible do you think it'd be to interrupt the other threads?
 
 
 On Thu, Feb 6, 2014 at 10:54 AM, Mridul Muralidharan 
 mri...@gmail.com
 wrote:
 
 Looks like a pathological corner case here - where the the delete
 thread is not getting run while the OS is busy prioritizing the
 thread
 writing data (probably with heavy gc too).
 Ideally, the delete thread would list files, remove them and then
 fail
 when it tries to remove the non empty directory (since other thread
 might be creating more in parallel).
 
 
 Regards,
 Mridul
 
 
 On Thu, Feb 6, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com
 wrote:
 Got a repro locally on my MBP (the other was on a CentOS machine).
 
 Build spark, run a master and a worker with the sbin/start-all.sh
 script,
 then run this in a shell:
 
 import org.apache.spark.storage.StorageLevel._
 val s = sc.parallelize(1 to
 10).persist(MEMORY_AND_DISK_SER);
 s.count
 
 After about a minute, this line appears in the shell logging
 output:
 
 14/02/06 02:44:44 WARN BlockManagerMasterActor: Removing
 BlockManager
 BlockManagerId(0, aash-mbp.dyn.yojoe.local, 57895, 0) with no
 recent
 heart
 beats: 57510ms exceeds 45000ms
 
 Ctrl-C the shell.  In jps there is now a worker, a master, and a
 CoarseGrainedExecutorBackend.
 
 Run jstack on the CGEBackend JVM, and I got the attached
 stacktraces.
 I
 waited around for 15min then kill -9'd the JVM and restarted the
 process.
 
 I wonder if what's happening here is that the threads that are
 spewing
 data
 to disk (as that parallelize and persist would do) can write to
 disk
 faster
 than the cleanup threads can delete from disk.
 
 What do you think of that theory?
 
 
 Andrew
 
 
 
 On Thu, Feb 6, 2014 at 2:30 AM, Mridul Muralidharan 
 mri...@gmail.com
 
 wrote:
 
 shutdown hooks should not take 15 mins are you mentioned !
 On the other hand, how busy was your disk when this was
 happening ?
 (either due to spark or something else ?)
 
 It might just be that there was a lot of stuff to remove ?
 
 Regards,
 Mridul
 
 
 On Thu, Feb 6, 2014 at 3:50 PM, Andrew Ash and...@andrewash.com
 
 wrote:
 Hi Spark devs,
 
 Occasionally when hitting Ctrl-C in the scala spark shell on
 0.9.0
 one
 of
 my workers goes dead in the spark master UI.  I'm using the
 standalone
 cluster and didn't ever see this while using 0.8.0 so I think
 it
 may
 be
 a
 

Re: rough date for spark summit 2014

2014-02-06 Thread Andy Konwinski
We just announced this and tickets are available now. June 30 thru July 2
in downtown SF. Details at http://spark-summit.org
On Jan 30, 2014 11:13 AM, Ameet Kini ameetk...@gmail.com wrote:

I know this is still a few months off and folks are rushing towards 0.9
release, but do the devs have a rough date for Spark Summit 2014? Looks
like it'll be in summer, but is it Jun / July / Aug / Sep ? Even
late-summer would help.

Summer being a popular vacation time, a few months advance notice would be
greatly appreciated (read: I missed last summit due to a pre-scheduled
vacation and would hate to miss this one :)

Thanks,
Ameet

--
You received this message because you are subscribed to the Google Groups
Unofficial Apache Spark Dev Mailing List Mirror group.
To unsubscribe from this group and stop receiving emails from it, send an
email to apache-spark-dev-mirror+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


0.9.0 forces log4j usage

2014-02-06 Thread Paul Brown
We have a few applications that embed Spark, and in 0.8.0 and 0.8.1, we
were able to use slf4j, but 0.9.0 broke that and unintentionally forces
direct use of log4j as the logging backend.

The issue is here in the org.apache.spark.Logging trait:

https://github.com/apache/incubator-spark/blame/master/core/src/main/scala/org/apache/spark/Logging.scala#L107

log4j-over-slf4j *always* returns an empty enumeration for appenders to the
ROOT logger:

https://github.com/qos-ch/slf4j/blob/master/log4j-over-slf4j/src/main/java/org/apache/log4j/Category.java?source=c#L81

And this causes an infinite loop and an eventual stack overflow.

I'm happy to submit a Jira and a patch, but it would be significant enough
reversal of recent changes that it's probably worth discussing before I
sink a half hour into it.  My suggestion would be that initialization (or
not) should be left to the user with reasonable default behavior supplied
by the spark commandline tooling and not forced on applications that
incorporate Spark.

Thoughts/opinions?

-- Paul
—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


Re: [0.9.0] Possible deadlock in shutdown hook?

2014-02-06 Thread Tathagata Das
That definitely sound more reliable. Worth trying out if there is a
reliable way of reproducing the deadlock-like scenario.

TD


On Thu, Feb 6, 2014 at 11:38 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 I don't think we necessarily want to do this through the DAGScheduler
 because the worker might also shut down due to some unusual termination
 condition, like the driver node crashing. Can't we do it at the top of the
 shutdown hook instead? If all the threads are in the same thread pool it
 might be possible to interrupt or stop the whole pool.

 Matei

 On Feb 6, 2014, at 11:30 PM, Andrew Ash and...@andrewash.com wrote:

  That's genius.  Of course when a worker is told to shutdown it should
  interrupt its worker threads -- I think that would address this issue.
 
  Are you thinking to put
 
  running.map(_.jobId).foreach { handleJobCancellation }
 
  at the top of the StopDAGScheduler block?
 
 
  On Thu, Feb 6, 2014 at 11:05 PM, Tathagata Das
  tathagata.das1...@gmail.comwrote:
 
  Its highly likely that the executor with the threadpool that runs the
 tasks
  are the only set of threads that writes to disk. The tasks are designed
 to
  be interrupted when the corresponding job is cancelled. So a reasonably
  simple way could be to actually cancel the currently active jobs, which
  would send the signal to the worker to stop the tasks. Currently, the
  DAGScheduler
 
 https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L610
  does
  not seem to actually cancel the jobs, only mark them as failed. So it
  may be a simple addition.
 
  There may be some complications with the external spilling of shuffle
 data
  to disk not stopping immediately when the task is marked for killing.
 Gotta
  try it out.
 
  TD
 
  On Thu, Feb 6, 2014 at 10:39 PM, Andrew Ash and...@andrewash.com
 wrote:
 
  There is probably just one threadpool that has task threads -- is it
  possible to enumerate and interrupt just those?  We may need to keep
  string
  a reference to that threadpool through to the shutdown thread to make
  that
  happen.
 
 
  On Thu, Feb 6, 2014 at 10:36 PM, Mridul Muralidharan mri...@gmail.com
  wrote:
 
  Ideally, interrupting the thread writing to disk should be sufficient
  - though since we are in middle of shutdown when this is happening, it
  is best case effort anyway.
  Identifying which threads to interrupt will be interesting since most
  of them are driven by threadpool's and we cant list all threads and
  interrupt all of them !
 
 
  Regards,
  Mridul
 
 
  On Fri, Feb 7, 2014 at 5:57 AM, Andrew Ash and...@andrewash.com
  wrote:
  I think the solution where we stop the writing threads and then let
  the
  deleting threads completely clean up is the best option since the
  final
  state doesn't have half-deleted temp dirs scattered across the
  cluster.
 
  How feasible do you think it'd be to interrupt the other threads?
 
 
  On Thu, Feb 6, 2014 at 10:54 AM, Mridul Muralidharan 
  mri...@gmail.com
  wrote:
 
  Looks like a pathological corner case here - where the the delete
  thread is not getting run while the OS is busy prioritizing the
  thread
  writing data (probably with heavy gc too).
  Ideally, the delete thread would list files, remove them and then
  fail
  when it tries to remove the non empty directory (since other thread
  might be creating more in parallel).
 
 
  Regards,
  Mridul
 
 
  On Thu, Feb 6, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com
  wrote:
  Got a repro locally on my MBP (the other was on a CentOS machine).
 
  Build spark, run a master and a worker with the sbin/start-all.sh
  script,
  then run this in a shell:
 
  import org.apache.spark.storage.StorageLevel._
  val s = sc.parallelize(1 to
  10).persist(MEMORY_AND_DISK_SER);
  s.count
 
  After about a minute, this line appears in the shell logging
  output:
 
  14/02/06 02:44:44 WARN BlockManagerMasterActor: Removing
  BlockManager
  BlockManagerId(0, aash-mbp.dyn.yojoe.local, 57895, 0) with no
  recent
  heart
  beats: 57510ms exceeds 45000ms
 
  Ctrl-C the shell.  In jps there is now a worker, a master, and a
  CoarseGrainedExecutorBackend.
 
  Run jstack on the CGEBackend JVM, and I got the attached
  stacktraces.
  I
  waited around for 15min then kill -9'd the JVM and restarted the
  process.
 
  I wonder if what's happening here is that the threads that are
  spewing
  data
  to disk (as that parallelize and persist would do) can write to
  disk
  faster
  than the cleanup threads can delete from disk.
 
  What do you think of that theory?
 
 
  Andrew
 
 
 
  On Thu, Feb 6, 2014 at 2:30 AM, Mridul Muralidharan 
  mri...@gmail.com
 
  wrote:
 
  shutdown hooks should not take 15 mins are you mentioned !
  On the other hand, how busy was your disk when this was
  happening ?
  (either due to spark or something else ?)
 
  It might just be that there was a lot of stuff to remove ?
 
  Regards,
  Mridul
 
 
  On Thu, 

Re: [0.9.0] Possible deadlock in shutdown hook?

2014-02-06 Thread Andrew Ash
I think we can enumerate all current threads with the ThreadMXBean, filter
to those threads with the name of executor pool in them, and interrupt them.

http://docs.oracle.com/javase/6/docs/api/java/lang/management/ManagementFactory.html#getThreadMXBean%28%29

The executor threads are currently named according to the pattern Executor
task launch worker-X


On Thu, Feb 6, 2014 at 11:45 PM, Tathagata Das
tathagata.das1...@gmail.comwrote:

 That definitely sound more reliable. Worth trying out if there is a
 reliable way of reproducing the deadlock-like scenario.

 TD


 On Thu, Feb 6, 2014 at 11:38 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

  I don't think we necessarily want to do this through the DAGScheduler
  because the worker might also shut down due to some unusual termination
  condition, like the driver node crashing. Can't we do it at the top of
 the
  shutdown hook instead? If all the threads are in the same thread pool it
  might be possible to interrupt or stop the whole pool.
 
  Matei
 
  On Feb 6, 2014, at 11:30 PM, Andrew Ash and...@andrewash.com wrote:
 
   That's genius.  Of course when a worker is told to shutdown it should
   interrupt its worker threads -- I think that would address this issue.
  
   Are you thinking to put
  
   running.map(_.jobId).foreach { handleJobCancellation }
  
   at the top of the StopDAGScheduler block?
  
  
   On Thu, Feb 6, 2014 at 11:05 PM, Tathagata Das
   tathagata.das1...@gmail.comwrote:
  
   Its highly likely that the executor with the threadpool that runs the
  tasks
   are the only set of threads that writes to disk. The tasks are
 designed
  to
   be interrupted when the corresponding job is cancelled. So a
 reasonably
   simple way could be to actually cancel the currently active jobs,
 which
   would send the signal to the worker to stop the tasks. Currently, the
   DAGScheduler
  
 
 https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L610
   does
   not seem to actually cancel the jobs, only mark them as failed. So it
   may be a simple addition.
  
   There may be some complications with the external spilling of shuffle
  data
   to disk not stopping immediately when the task is marked for killing.
  Gotta
   try it out.
  
   TD
  
   On Thu, Feb 6, 2014 at 10:39 PM, Andrew Ash and...@andrewash.com
  wrote:
  
   There is probably just one threadpool that has task threads -- is it
   possible to enumerate and interrupt just those?  We may need to keep
   string
   a reference to that threadpool through to the shutdown thread to make
   that
   happen.
  
  
   On Thu, Feb 6, 2014 at 10:36 PM, Mridul Muralidharan 
 mri...@gmail.com
   wrote:
  
   Ideally, interrupting the thread writing to disk should be
 sufficient
   - though since we are in middle of shutdown when this is happening,
 it
   is best case effort anyway.
   Identifying which threads to interrupt will be interesting since
 most
   of them are driven by threadpool's and we cant list all threads and
   interrupt all of them !
  
  
   Regards,
   Mridul
  
  
   On Fri, Feb 7, 2014 at 5:57 AM, Andrew Ash and...@andrewash.com
   wrote:
   I think the solution where we stop the writing threads and then let
   the
   deleting threads completely clean up is the best option since the
   final
   state doesn't have half-deleted temp dirs scattered across the
   cluster.
  
   How feasible do you think it'd be to interrupt the other threads?
  
  
   On Thu, Feb 6, 2014 at 10:54 AM, Mridul Muralidharan 
   mri...@gmail.com
   wrote:
  
   Looks like a pathological corner case here - where the the delete
   thread is not getting run while the OS is busy prioritizing the
   thread
   writing data (probably with heavy gc too).
   Ideally, the delete thread would list files, remove them and then
   fail
   when it tries to remove the non empty directory (since other
 thread
   might be creating more in parallel).
  
  
   Regards,
   Mridul
  
  
   On Thu, Feb 6, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com
   wrote:
   Got a repro locally on my MBP (the other was on a CentOS
 machine).
  
   Build spark, run a master and a worker with the sbin/start-all.sh
   script,
   then run this in a shell:
  
   import org.apache.spark.storage.StorageLevel._
   val s = sc.parallelize(1 to
   10).persist(MEMORY_AND_DISK_SER);
   s.count
  
   After about a minute, this line appears in the shell logging
   output:
  
   14/02/06 02:44:44 WARN BlockManagerMasterActor: Removing
   BlockManager
   BlockManagerId(0, aash-mbp.dyn.yojoe.local, 57895, 0) with no
   recent
   heart
   beats: 57510ms exceeds 45000ms
  
   Ctrl-C the shell.  In jps there is now a worker, a master, and a
   CoarseGrainedExecutorBackend.
  
   Run jstack on the CGEBackend JVM, and I got the attached
   stacktraces.
   I
   waited around for 15min then kill -9'd the JVM and restarted the
   process.
  
   I wonder if what's happening here is