RE: MIMA Compatiblity Checks

2014-07-10 Thread Liu, Raymond
so how to run the check locally?

On master tree, sbt mimaReportBinaryIssues Seems to lead to a lot of errors 
reported. Do we need to modify SparkBuilder.scala etc to run it locally? Could 
not figure out how Jekins run the check on its console outputs.


Best Regards,
Raymond Liu

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Monday, June 09, 2014 3:40 AM
To: dev@spark.apache.org
Subject: MIMA Compatiblity Checks

Hey All,

Some people may have noticed PR failures due to binary compatibility checks. 
We've had these enabled in several of the sub-modules since the 0.9.0 release 
but we've turned them on in Spark core post 1.0.0 which has much higher churn.

The checks are based on the migration manager tool from Typesafe.
One issue is that tool doesn't support package-private declarations of classes 
or methods. Prashant Sharma has built instrumentation that adds partial support 
for package-privacy (via a workaround) but since there isn't really native 
support for this in MIMA we are still finding cases in which we trigger false 
positives.

In the next week or two we'll make it a priority to handle more of these 
false-positive cases. In the mean time users can add manual excludes to:

project/MimaExcludes.scala

to avoid triggering warnings for certain issues.

This is definitely annoying - sorry about that. Unfortunately we are the first 
open source Scala project to ever do this, so we are dealing with uncharted 
territory.

Longer term I'd actually like to see us just write our own sbt-based tool to do 
this in a better way (we've had trouble trying to extend MIMA itself, it e.g. 
has copy-pasted code in it from an old version of the scala compiler). If 
someone in the community is a Scala fan and wants to take that on, I'm happy to 
give more details.

- Patrick


when insert data into one table which is on tachyon, how can i control the data position?

2014-07-10 Thread qingyang li
when insert data (the data is small, it will not be partitioned
automatically)into one table which is on tachyon, how can i control the
data position,  i mean how can i point which machine the data should exist
on?
if we can not control, what is the data assign strategy of tachyon or spark?


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread RJ Nowling
I went ahead and created JIRAs.

JIRA for Hierarchical Clustering:
https://issues.apache.org/jira/browse/SPARK-2429

JIRA for Standarized Clustering APIs:
https://issues.apache.org/jira/browse/SPARK-2430

Before submitting a PR for the standardized API, I want to implement a
few clustering algorithms for myself to get a good feel for how to
structure such an API.  If others with more experience want to dive
into designing the API, though, that would allow us to get moving more
quickly.


On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath nick.pentre...@gmail.com wrote:
 Cool seems like a god initiative. Adding a couple extra high quality 
 clustering implantations will be great.


 I'd say it would make most sense to submit a PR for the Standardised API 
 first, agree that with everyone and then build on it for the specific 
 implementations.
 —
 Sent from Mailbox

 On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks everyone for the input.
 So it seems what people want is:
 * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
 conquer approach, look at DecisionTree implementation as a reference)
 * Restructure 3 Kmeans clustering algorithm implementations to prevent
 code duplication and conform to a consistent API where possible
 If this is correct, I'll start work on that.  How would it be best to
 structure it? Should I submit separate JIRAs / PRs for refactoring of
 current code, MiniBatch KMeans, and Hierarchical or keep my current
 JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
 for Hierarchical KMeans that builds on top of that?
 Thanks!
 On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote:
 Yeah if one were to replace the objective function in decision tree with
 minimizing the variance of the leaf nodes it would be a hierarchical
 clusterer.


 On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 If you're thinking along these lines, have a look at the DecisionTree
 implementation in MLlib. It uses the same idea and is optimized to prevent
 multiple passes over the data by computing several splits at each level of
 tree building. The tradeoff is increased model state and computation per
 pass over the data, but fewer total passes and hopefully lower
 communication overheads than, say, shuffling data around that belongs to
 one cluster or another. Something like that could work here as well.

 I'm not super-familiar with hierarchical K-Means so perhaps there's a more
 efficient way to implement it, though.


 On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote:

  No was thinking more top-down:
 
  assuming a distributed kmeans system already existing, recursively apply
  the kmeans algorithm on data already partitioned by the previous level of
  kmeans.
 
  I haven't been much of a fan of bottom up approaches like HAC mainly
  because they assume there is already a distance metric for items to
 items.
  This makes it hard to cluster new content. The distances between sibling
  clusters is also hard to compute (if you have thrown away the similarity
  matrix), do you count paths to same parent node if you are computing
  distances between items in two adjacent nodes for example. It is also a
 bit
  harder to distribute the computation for bottom up approaches as you have
  to already find the nearest neighbor to an item to begin the process.
 
 
  On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:
 
   The scikit-learn implementation may be of interest:
  
  
  
 
 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
  
   It's a bottom up approach.  The pair of clusters for merging are
   chosen to minimize variance.
  
   Their code is under a BSD license so it can be used as a template.
  
   Is something like that you were thinking Hector?
  
   On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
sure. more interesting problem here is choosing k at each level.
 Kernel
methods seem to be most promising.
   
   
On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com
  wrote:
   
No idea, never looked it up. Always just implemented it as doing
  k-means
again on each cluster.
   
FWIW standard k-means with euclidean distance has problems too with
  some
dimensionality reduction methods. Swapping out the distance metric
  with
negative dot or cosine may help.
   
Other more useful clustering would be hierarchical SVD. The reason
  why I
like hierarchical clustering is it makes for faster inference
  especially
over billions of users.
   
   
On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
 
wrote:
   
 Hector, could you share the references for hierarchical K-means?
   thanks.


 On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
   wrote:

  I would say for bigdata applications 

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread Nick Pentreath
Might be worth checking out scikit-learn and mahout to get some broad ideas—
Sent from Mailbox

On Thu, Jul 10, 2014 at 4:25 PM, RJ Nowling rnowl...@gmail.com wrote:

 I went ahead and created JIRAs.
 JIRA for Hierarchical Clustering:
 https://issues.apache.org/jira/browse/SPARK-2429
 JIRA for Standarized Clustering APIs:
 https://issues.apache.org/jira/browse/SPARK-2430
 Before submitting a PR for the standardized API, I want to implement a
 few clustering algorithms for myself to get a good feel for how to
 structure such an API.  If others with more experience want to dive
 into designing the API, though, that would allow us to get moving more
 quickly.
 On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath nick.pentre...@gmail.com 
 wrote:
 Cool seems like a god initiative. Adding a couple extra high quality 
 clustering implantations will be great.


 I'd say it would make most sense to submit a PR for the Standardised API 
 first, agree that with everyone and then build on it for the specific 
 implementations.
 —
 Sent from Mailbox

 On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks everyone for the input.
 So it seems what people want is:
 * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
 conquer approach, look at DecisionTree implementation as a reference)
 * Restructure 3 Kmeans clustering algorithm implementations to prevent
 code duplication and conform to a consistent API where possible
 If this is correct, I'll start work on that.  How would it be best to
 structure it? Should I submit separate JIRAs / PRs for refactoring of
 current code, MiniBatch KMeans, and Hierarchical or keep my current
 JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
 for Hierarchical KMeans that builds on top of that?
 Thanks!
 On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote:
 Yeah if one were to replace the objective function in decision tree with
 minimizing the variance of the leaf nodes it would be a hierarchical
 clusterer.


 On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 If you're thinking along these lines, have a look at the DecisionTree
 implementation in MLlib. It uses the same idea and is optimized to prevent
 multiple passes over the data by computing several splits at each level of
 tree building. The tradeoff is increased model state and computation per
 pass over the data, but fewer total passes and hopefully lower
 communication overheads than, say, shuffling data around that belongs to
 one cluster or another. Something like that could work here as well.

 I'm not super-familiar with hierarchical K-Means so perhaps there's a more
 efficient way to implement it, though.


 On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote:

  No was thinking more top-down:
 
  assuming a distributed kmeans system already existing, recursively apply
  the kmeans algorithm on data already partitioned by the previous level 
  of
  kmeans.
 
  I haven't been much of a fan of bottom up approaches like HAC mainly
  because they assume there is already a distance metric for items to
 items.
  This makes it hard to cluster new content. The distances between sibling
  clusters is also hard to compute (if you have thrown away the similarity
  matrix), do you count paths to same parent node if you are computing
  distances between items in two adjacent nodes for example. It is also a
 bit
  harder to distribute the computation for bottom up approaches as you 
  have
  to already find the nearest neighbor to an item to begin the process.
 
 
  On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:
 
   The scikit-learn implementation may be of interest:
  
  
  
 
 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
  
   It's a bottom up approach.  The pair of clusters for merging are
   chosen to minimize variance.
  
   Their code is under a BSD license so it can be used as a template.
  
   Is something like that you were thinking Hector?
  
   On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
sure. more interesting problem here is choosing k at each level.
 Kernel
methods seem to be most promising.
   
   
On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com
  wrote:
   
No idea, never looked it up. Always just implemented it as doing
  k-means
again on each cluster.
   
FWIW standard k-means with euclidean distance has problems too with
  some
dimensionality reduction methods. Swapping out the distance metric
  with
negative dot or cosine may help.
   
Other more useful clustering would be hierarchical SVD. The reason
  why I
like hierarchical clustering is it makes for faster inference
  especially
over billions of users.
   
   
On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
 
wrote:
   
 Hector, could you share the 

Feature selection interface

2014-07-10 Thread Ulanov, Alexander
Hi,

I've implemented a class that does Chi-squared feature selection for 
RDD[LabeledPoint]. It also computes basic class/feature occurrence statistics 
and other methods like mutual information or information gain can be easily 
implemented. I would like to make a pull request. However, MLlib master branch 
doesn't have any feature selection methods implemented. So, I need to create a 
proper interface that my class will extend or mix. It should be easy to use 
from developers and users prospective.

I was thinking that there should be FeatureEvaluator that for each feature from 
RDD[LabeledPoint] returns RDD[((featureIndex: Int, label: Double), value: 
Double)].
Then there should be FeatureSelector that selects top N features or top N 
features group by class etc.
And the simplest one, FeatureFilter that filters the data based on set of 
feature indices.

Additionally, there should be the interface for FeatureEvaluators that don't 
use class labels, i.e. for RDD[Vector].

I am concerned that such design looks rather disconnected because there are 3 
disconnected objects.

As a result of use, I would like to see something like val filteredData = 
Filter(data, ChiSquared(data).selectTop(100)).

Any ideas or suggestions?

Best regards, Alexander


Changes to sbt build have been merged

2014-07-10 Thread Patrick Wendell
Just a heads up, we merged Prashant's work on having the sbt build read all
dependencies from Maven. Please report any issues you find on the dev list
or on JIRA.

One note here for developers, going forward the sbt build will use the same
configuration style as the maven build (-D for options and -P for maven
profiles). So this will be a change for developers:

sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly

For now, we'll continue to support the old env-var options with a
deprecation warning.

- Patrick


Re: Changes to sbt build have been merged

2014-07-10 Thread Sandy Ryza
Woot!


On Thu, Jul 10, 2014 at 11:15 AM, Patrick Wendell patr...@databricks.com
wrote:

 Just a heads up, we merged Prashant's work on having the sbt build read all
 dependencies from Maven. Please report any issues you find on the dev list
 or on JIRA.

 One note here for developers, going forward the sbt build will use the same
 configuration style as the maven build (-D for options and -P for maven
 profiles). So this will be a change for developers:

 sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly

 For now, we'll continue to support the old env-var options with a
 deprecation warning.

 - Patrick



Re: Changes to sbt build have been merged

2014-07-10 Thread yao
Cool~


On Thu, Jul 10, 2014 at 1:29 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Woot!


 On Thu, Jul 10, 2014 at 11:15 AM, Patrick Wendell patr...@databricks.com
 wrote:

  Just a heads up, we merged Prashant's work on having the sbt build read
 all
  dependencies from Maven. Please report any issues you find on the dev
 list
  or on JIRA.
 
  One note here for developers, going forward the sbt build will use the
 same
  configuration style as the maven build (-D for options and -P for maven
  profiles). So this will be a change for developers:
 
  sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly
 
  For now, we'll continue to support the old env-var options with a
  deprecation warning.
 
  - Patrick
 



EC2 clusters ready in launch time + 30 seconds

2014-07-10 Thread Nicholas Chammas
Hi devs!

Right now it takes a non-trivial amount of time to launch EC2 clusters.
Part of this time is spent starting the EC2 instances, which is out of our
control. Another part of this time is spent installing stuff on and
configuring the instances. This, we can control.

I’d like to explore approaches to upgrading spark-ec2 so that launching a
cluster of any size generally takes only 30 seconds on top of the time to
launch the base EC2 instances. Since Amazon can launch instances
concurrently, I believe this means we should be able to launch a fully
operational Spark cluster of any size in constant time. Is that correct?

Do we already have an idea of what it would take to get to that point?

Nick
​


sparkSQL thread safe?

2014-07-10 Thread Ian O'Connell
Had a few quick questions...

Just wondering if right now spark sql is expected to be thread safe on
master?

doing a simple hadoop file - RDD - schema RDD - write parquet

will fail in reflection code if i run these in a thread pool.

The SparkSqlSerializer, seems to create a new Kryo instance each time it
wants to serialize anything. I got a huge speedup when I had any
non-primitive type in my SchemaRDD using the ResourcePool's from Chill for
providing the KryoSerializer to it. (I can open an RB if there is some
reason not to re-use them?)



With the Distinct Count operator there is no map-side operations, and a
test to check for this. Is there any reason not to do a map side combine
into a set and then merge the sets later? (similar to the approximate
distinct count operator)

===

Another thing while i'm mailing.. the 1.0.1 docs have a section like:

// Note: Case classes in Scala 2.10 can support only up to 22 fields. To
work around this limit, // you can use custom classes that implement the
Product interface.


Which sounds great, we have lots of data in thrift.. so via scrooge (
https://github.com/twitter/scrooge), we end up with ultimately instances of
traits which implement product. Though the reflection code appears to look
for the constructor of the class and base the types based on those
parameters?


Ian.


Re: sparkSQL thread safe?

2014-07-10 Thread Michael Armbrust
Hey Ian,

Thanks for bringing these up!  Responses in-line:

Just wondering if right now spark sql is expected to be thread safe on
 master?
 doing a simple hadoop file - RDD - schema RDD - write parquet
 will fail in reflection code if i run these in a thread pool.


You are probably hitting SPARK-2178
https://issues.apache.org/jira/browse/SPARK-2178 which is caused by
SI-6240 https://issues.scala-lang.org/browse/SI-6240.  We have a plan to
fix this by moving the schema introspection to compile time, using macros.


 The SparkSqlSerializer, seems to create a new Kryo instance each time it
 wants to serialize anything. I got a huge speedup when I had any
 non-primitive type in my SchemaRDD using the ResourcePool's from Chill for
 providing the KryoSerializer to it. (I can open an RB if there is some
 reason not to re-use them?)


Sounds like SPARK-2102 https://issues.apache.org/jira/browse/SPARK-2102.
 There is no reason AFAIK to not reuse the instance. A PR would be greatly
appreciated!


 With the Distinct Count operator there is no map-side operations, and a
 test to check for this. Is there any reason not to do a map side combine
 into a set and then merge the sets later? (similar to the approximate
 distinct count operator)


Thats just not an optimization that we had implemented yet... but I've just
done it here https://github.com/apache/spark/pull/1366 and it'll be in
master soon :)


 Another thing while i'm mailing.. the 1.0.1 docs have a section like:
 
 // Note: Case classes in Scala 2.10 can support only up to 22 fields. To
 work around this limit, // you can use custom classes that implement the
 Product interface.
 

 Which sounds great, we have lots of data in thrift.. so via scrooge (
 https://github.com/twitter/scrooge), we end up with ultimately instances
 of
 traits which implement product. Though the reflection code appears to look
 for the constructor of the class and base the types based on those
 parameters?


Yeah, thats true that we only look in the constructor at the moment, but I
don't think there is a really good reason for that (other than I guess we
will need to add code to make sure we skip builtin object methods).  If you
want to open a JIRA, we can try fixing this.

Michael


RE: EC2 clusters ready in launch time + 30 seconds

2014-07-10 Thread Nate D'Amico
You are partially correct.

It's not terribly complex, but also not easy to accomplish.  Sounds like you 
want to manage some partially/fully baked AMI's with the core spark libs and 
dependencies already on the image.  Main issues that crop up are:

1) image sprawl, as libs/config/defaults/etc change, images need to be 
rebuilt and references updated
2) cross region support (not too huge deal now with copy functionality, just 
more complex image mgmt.)

If you don’t want to restrict which instance types/sizes one can use, you also 
have uptick in image mgmt. complexity with:

3) instance type (need both standard and hvm)

Starting to work through some automation/config stuff for spark stack on EC2 
with a project, will be focusing the work through the apache bigtop effort to 
start, can then share with spark community directly as things progress if 
people are interested

Nate


-Original Message-
From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com] 
Sent: Thursday, July 10, 2014 3:06 PM
To: dev
Subject: EC2 clusters ready in launch time + 30 seconds

Hi devs!

Right now it takes a non-trivial amount of time to launch EC2 clusters.
Part of this time is spent starting the EC2 instances, which is out of our 
control. Another part of this time is spent installing stuff on and configuring 
the instances. This, we can control.

I’d like to explore approaches to upgrading spark-ec2 so that launching a 
cluster of any size generally takes only 30 seconds on top of the time to 
launch the base EC2 instances. Since Amazon can launch instances concurrently, 
I believe this means we should be able to launch a fully operational Spark 
cluster of any size in constant time. Is that correct?

Do we already have an idea of what it would take to get to that point?

Nick
​



Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-10 Thread Gary Malouf
-1 I honestly do not know the voting rules for the Spark community, so
please excuse me if I am out of line or if Mesos compatibility is not a
concern at this point.

We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos
0.18.2.  All of our jobs with data above a few gigabytes hung indefinitely.
 Downgrading back to the 1.0.0 stable release of Spark built the same way
worked for us.


On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves tgraves...@yahoo.com.invalid
wrote:

 +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with
 authentication on.

 Tom


 On Friday, July 4, 2014 2:39 PM, Patrick Wendell pwend...@gmail.com
 wrote:



 Please vote on releasing the following candidate as Apache Spark version
 1.0.1!

 The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.0.1-rc2/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1021/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/

 Please vote on releasing this package as Apache Spark 1.0.1!

 The vote is open until Monday, July 07, at 20:45 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 === Differences from RC1 ===
 This release includes only one blocking patch from rc1:
 https://github.com/apache/spark/pull/1255

 There are also smaller fixes which came in over the last week.

 === About this release ===
 This release fixes a few high-priority bugs in 1.0 and has a variety
 of smaller fixes. The full list is here: http://s.apache.org/b45. Some
 of the more visible patches are:

 SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
 SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size.
 SPARK-1790: Support r3 instance types on EC2.

 This is the first maintenance release on the 1.0 line. We plan to make
 additional maintenance releases as new fixes come in.



Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-10 Thread Gary Malouf
Just realized the deadline was Monday, my apologies.  The issue
nevertheless stands.


On Thu, Jul 10, 2014 at 9:28 PM, Gary Malouf malouf.g...@gmail.com wrote:

 -1 I honestly do not know the voting rules for the Spark community, so
 please excuse me if I am out of line or if Mesos compatibility is not a
 concern at this point.

 We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos
 0.18.2.  All of our jobs with data above a few gigabytes hung indefinitely.
  Downgrading back to the 1.0.0 stable release of Spark built the same way
 worked for us.


 On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves tgraves...@yahoo.com.invalid
 wrote:

 +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with
 authentication on.

 Tom


 On Friday, July 4, 2014 2:39 PM, Patrick Wendell pwend...@gmail.com
 wrote:



 Please vote on releasing the following candidate as Apache Spark version
 1.0.1!

 The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.0.1-rc2/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1021/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/

 Please vote on releasing this package as Apache Spark 1.0.1!

 The vote is open until Monday, July 07, at 20:45 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 === Differences from RC1 ===
 This release includes only one blocking patch from rc1:
 https://github.com/apache/spark/pull/1255

 There are also smaller fixes which came in over the last week.

 === About this release ===
 This release fixes a few high-priority bugs in 1.0 and has a variety
 of smaller fixes. The full list is here: http://s.apache.org/b45. Some
 of the more visible patches are:

 SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
 SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame
 size.
 SPARK-1790: Support r3 instance types on EC2.

 This is the first maintenance release on the 1.0 line. We plan to make
 additional maintenance releases as new fixes come in.