Performance & Memory Issues When Creating Many Columns in GROUP BY (spark-sql)

2015-05-19 Thread daniel.mescheder
Dear List,
We have run into serious problems trying to run a larger than average number of 
aggregations in a GROUP BY query. Symptoms of this problem are OutOfMemory 
exceptions and unreasonably long processing times due to GC. The problem occurs 
when the following two conditions are met:
 - The number of groups is relatively large (growing with the size of the 
dataset)
 - The number of columns is relatively large
To reproduce, paste the following gist into your spark-shell (I'm running 
1.3.1): https://gist.github.com/DanielMe/9467bb0d9ad3aa639429 

This example is relatively small in size:
 - The size of the input is 10ˆ6 * 64bit = 8MB
 - The size of the output should be around 3 * 10ˆ8 * 64bit = 2.4GB
 - The aggregations themselves are just "count(1)" and hence not so difficult 
to compute
I am running this on a cluster with three 61GB worker machines and an equally 
equipped master with the following spark-defaults.conf:
spark.executor.memory=55g
spark.driver.memory=55g
The result: The workers will choke with "java.lang.OutOfMemoryError: GC 
overhead limit exceeded". In fact, if you play with the num_columns parameter 
you should observe an unreasonable amount of time spent on GC even for lower 
values. If you run this on a desktop machine, low values of num_columns should 
already lead to OOM crashes.
My questions are:
 - What causes this behaviour?
 - Can/should catalyst be able to automatically optimize queries of this kind 
to run in reasonable time or at least not crash?
 - What are  possible workarounds to achieve the desired effect? (Even if that 
means not using DataFrames but going down to the raw RDD level)
Our preliminary analysis of the situation concluded that what is blowing up is 
in fact the hashTable in Aggregate::doExecute which will try to store the cross 
product of groups and columns on each partition. In fact, we managed to 
mitigate the issue a bit by
 - reducing the size of the partitions (which will make these hash tables 
smaller)
 - pre-partitioning the data using a HashPartitioner on the key (which will 
reduce the number of different groups per partition)
The latter actually seems to be a sensible thing to do whenever 
num_columns*num_groups > num_rows because in this setting the amount of data we 
have to shuffle around after the first aggregation step is actually larger than 
the amount of data we had initially. Could this be something that catalyst 
should take into account when creating a physical plan?
Thanks in advance.
Kind regards,

Daniel 





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Performance-Memory-Issues-When-Creating-Many-Columns-in-GROUP-BY-spark-sql-tp12313.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

[Catalyst] RFC: Using PartialFunction literals instead of objects

2015-05-19 Thread Edoardo Vacchi
Hi everybody,

At the moment, Catalyst rules are defined using two different types of rules:
`Rule[LogicalPlan]` and `Strategy` (which in turn maps to
`GenericStrategy[SparkPlan]`).

I propose to introduce utility methods to

  a) reduce the boilerplate to define rewrite rules
  b) turning them back into what they essentially represent: function types.

These changes would be backwards compatible, and would greatly help in
understanding what the code does. Personally, I feel like the current
use of objects is redundant and possibly confusing.

## `Rule[LogicalPlan]`

The analyzer and optimizer use `Rule[LogicalPlan]`, which, besides
defining a default `val ruleName`
only defines the method `apply(plan: TreeType): TreeType`.
Because the body of such method is always supposed to read `plan match
pf`, with `pf`
being some `PartialFunction[LogicalPlan, LogicalPlan]`, we can
conclude that `Rule[LogicalPlan]`
might be substituted by a PartialFunction.

I propose the following:

a) Introduce the utility method

def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]): Rule[LogicalPlan] =
  new Rule[LogicalPlan] {
def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
  }

b) progressively replace the boilerplate-y object definitions; e.g.

object MyRewriteRule extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case ... => ...
}

with

// define a Rule[LogicalPlan]
val MyRewriteRule = rule {
  case ... => ...
}

it might also be possible to make rule method `implicit`, thereby
further reducing MyRewriteRule to:

// define a PartialFunction[LogicalPlan, LogicalPlan]
// the implicit would convert it into a Rule[LogicalPlan] at the use sites
val MyRewriteRule = {
  case ... => ...
}


## Strategies

A similar solution could be applied to shorten the code for
Strategies, which are total functions
only because they are all supposed to manage the default case,
possibly returning `Nil`. In this case
we might introduce the following utility methods:

/**
 * Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan].
 * The partial function must therefore return *one single* SparkPlan
for each case.
 * The method will automatically wrap them in a [[Seq]].
 * Unhandled cases will automatically return Seq.empty
 */
protected def rule(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy =
  new Strategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] =
  if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty
  }

/**
 * Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan] ].
 * The partial function must therefore return a Seq[SparkPlan] for each case.
 * Unhandled cases will automatically return Seq.empty
 */
protected def seqrule(pf: PartialFunction[LogicalPlan,
Seq[SparkPlan]]): Strategy =
  new Strategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] =
  if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan]
  }

Thanks in advance
e.v.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Problem building master on 2.11

2015-05-19 Thread Iulian Dragoș
There's an open PR to fix it. If you could try it and report back on the PR
it'd be great. More likely to get in fast.

https://github.com/apache/spark/pull/6260

On Mon, May 18, 2015 at 6:43 PM, Fernando O.  wrote:

> I just noticed I sent this to users instead of dev:
>
> -- Forwarded message --
> From: Fernando O. 
> Date: Sat, May 16, 2015 at 4:09 PM
> Subject: Problem building master on 2.11
> To: "u...@spark.apache.org" 
>
>
> Is anyone else having issues when building spark from git?
> I created a jira ticket with a Docker file that reproduces the issue.
>
> The error:
> /spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56:
> error: not found: type Type
>   protected Type type() { return Type.UPLOAD_BLOCK; }
>
>
> https://issues.apache.org/jira/browse/SPARK-7670
>
>


-- 

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com


Re: Contribute code to MLlib

2015-05-19 Thread Trevor Grant
 There are most likely advantages and disadvantages to Tarek's algorithm
against the current implementation, and different scenarios where each is
more appropriate.

Would we not offer multiple PCA algorithms and let the user choose?

Trevor

Trevor Grant
Data Scientist

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley 
wrote:

> Hi Tarek,
>
> Thanks for your interest & for checking the guidelines first!  On 2 points:
>
> Algorithm: PCA is of course a critical algorithm.  The main question is
> how your algorithm/implementation differs from the current PCA.  If it's
> different and potentially better, I'd recommend opening up a JIRA for
> explaining & discussing it.
>
> Java/Scala: We really do require that algorithms be in Scala, for the sake
> of maintainability.  The conversion should be doable if you're willing
> since Scala is a pretty friendly language.  If you create the JIRA, you
> could also ask for help there to see if someone can collaborate with you to
> convert the code to Scala.
>
> Thanks!
> Joseph
>
> On Mon, May 18, 2015 at 3:13 AM, Tarek Elgamal 
> wrote:
>
>> Hi,
>>
>> I would like to contribute an algorithm to the MLlib project. I have
>> implemented a scalable PCA algorithm on spark. It is scalable for both tall
>> and fat matrices and the paper around it is accepted for publication in
>> SIGMOD 2015 conference. I looked at the guidelines in the following link:
>>
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>
>> I believe that most of the guidelines applies in my case, however, the
>> code is written in java and it was not clear in the guidelines whether
>> MLLib project accepts java code or not.
>> My algorithm can be found under this repository:
>> https://github.com/Qatar-Computing-Research-Institute/sPCA
>>
>> Any help on how to make it suitable for MLlib project will be greatly
>> appreciated.
>>
>> Best Regards,
>> Tarek Elgamal
>>
>>
>>
>>
>


Re: [ANNOUNCE] Ending Java 6 support in Spark 1.5 (Sep 2015)

2015-05-19 Thread Tim Ellison
Sean,

Did the JIRA get created?  If so I can't find it so a pointer would be
helpful.

Regards,
Tim

On 06/05/15 06:59, Reynold Xin wrote:
> Sean - Please do.
> 
> On Tue, May 5, 2015 at 10:57 PM, Sean Owen  wrote:
> 
>> OK to file a JIRA to scrape out a few Java 6-specific things in the
>> code? and/or close issues about working with Java 6 if they're not
>> going to be resolved for 1.4?
>>
>> I suppose this means the master builds and PR builder in Jenkins
>> should simply continue to use Java 7 then.
>>
>> On Tue, May 5, 2015 at 11:25 PM, Reynold Xin  wrote:
>>> Hi all,
>>>
>>> We will drop support for Java 6 starting Spark 1.5, tentative scheduled
>> to
>>> be released in Sep 2015. Spark 1.4, scheduled to be released in June
>> 2015,
>>> will be the last minor release that supports Java 6. That is to say:
>>>
>>> Spark 1.4.x (~ Jun 2015): will work with Java 6, 7, 8.
>>>
>>> Spark 1.5+ (~ Sep 2015): will NOT work with Java 6, but work with Java
>> 7, 8.
>>>
>>>
>>> PS: Oracle ended Java 6 updates in Feb 2013.
>>>
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RDD split into multiple RDDs

2015-05-19 Thread Justin Uang
To do it in one pass, conceptually what you would need to do is to consume
the entire parent iterator and store the values either in memory or on
disk, which is generally something you want to avoid given that the parent
iterator length is unbounded. If you need to start spilling to disk, you
might actually get better performance just from doing multiple passes,
provided that you don't have that many unique keys. In fact, the filter
approach that you mentioned earlier is conceptually the same as the
implementation of randomSplit, where each of the split RDDs has access to
the full parent RDD then does the sample.

In addition, building the map is actually very cheap. Since its lazy, you
only do the filters when you need to iterate across the rdd of a specific
key.
On Wed, Apr 29, 2015 at 9:57 AM Sébastien Soubré-Lanabère <
s.sou...@gmail.com> wrote:

> Hi Juan, Daniel,
>
> thank you for your explanations. Indeed, I don't have a big number of keys,
> at least not enough to stuck the scheduler.
>
> I was using a method quite similar as what you post, Juan, and yes it
> works, but I think this would be more efficient to not call filter on each
> key. So, I was thinking something like :
> - get the iterator of the KV rdd
> - distribute each value into a subset by key and then recreate a rdd from
> this subset
>
> Because spark context parallelize method cannot be used inside a
> transformation, I wonder if I could do it by creating a custom RDD and then
> try to implement something like PairRDDFunctions.lookup method, but
> remplacing Seq[V] of course by a RDD
>
> def lookup(key: K): Seq[V] = {
> self.partitioner match {
>   case Some(p) =>
> val index = p.getPartition(key)
> val process = (it: Iterator[(K, V)]) => {
>   val buf = new ArrayBuffer[V]
>   for (pair <- it if pair._1 == key) {
> buf += pair._2
>   }
>   buf
> } : Seq[V]
> val res = self.context.runJob(self, process, Array(index), false)
> res(0)
>   case None =>
> self.filter(_._1 == key).map(_._2).collect()
> }
>   }
>
>
> 2015-04-29 15:02 GMT+02:00 Juan Rodríguez Hortalá <
> juan.rodriguez.hort...@gmail.com>:
>
> > Hi Daniel,
> >
> > I understood Sébastien was talking having having a high number of keys, I
> > guess I was prejudiced by my own problem! :) Anyway I don't think you
> need
> > to use disk or a database to generate a RDD per key, you can use filter
> > which I guess would be more efficient because IO is avoided, especially
> if
> > the RDD was cached. For example:
> >
> > // in the spark shell
> > import org.apache.spark.rdd.RDD
> > import org.apache.spark.rdd.RDD._
> > import scala.reflect.ClassTag
> >
> > // generate a map from key to rdd of values
> > def groupByKeyToRDDs[K, V](pairRDD: RDD[(K, V)]) (implicit kt:
> > ClassTag[K], vt: ClassTag[V], ord: Ordering[K]): Map[K, RDD[V]] = {
> > val keys = pairRDD.keys.distinct.collect
> > (for (k <- keys) yield
> > k -> (pairRDD filter(_._1 == k) values)
> > ) toMap
> > }
> >
> > // simple demo
> > val xs = sc.parallelize(1 to 1000)
> > val ixs = xs map(x => (x % 10, x))
> > val gs = groupByKeyToRDDs(ixs)
> > gs(1).collect
> >
> > Just an idea.
> >
> > Greetings,
> >
> > Juan Rodriguez
> >
> >
> >
> > 2015-04-29 14:20 GMT+02:00 Daniel Darabos <
> > daniel.dara...@lynxanalytics.com>:
> >
> >> Check out http://stackoverflow.com/a/26051042/3318517. It's a nice
> >> method for saving the RDD into separate files by key in a single pass.
> Then
> >> you can read the files into separate RDDs.
> >>
> >> On Wed, Apr 29, 2015 at 2:10 PM, Juan Rodríguez Hortalá <
> >> juan.rodriguez.hort...@gmail.com> wrote:
> >>
> >>> Hi Sébastien,
> >>>
> >>> I came with a similar problem some time ago, you can see the discussion
> >>> in
> >>> the Spark users mailing list at
> >>>
> >>>
> http://markmail.org/message/fudmem4yy63p62ar#query:+page:1+mid:qv4gw6czf6lb6hpq+state:results
> >>> . My experience was that when you create too many RDDs the Spark
> >>> scheduler
> >>> gets stuck, so if you have many keys in the map you are creating you'll
> >>> probably have problems. On the other hand, the latest example I
> proposed
> >>> in
> >>> that mailing thread was a batch job in which we start from a single RDD
> >>> of
> >>> time tagged data, transform the RDD in a list of RDD corresponding to
> >>> generating windows according to the time tag of the records, and then
> >>> apply
> >>> a transformation of RDD to each window RDD, like for example KMeans.run
> >>> of
> >>> MLlib. This is very similar to what you propose.
> >>> So in my humble opinion the approach of generating thousands of RDDs by
> >>> filtering doesn't work, and a new RDD class should be implemented for
> >>> this.
> >>> I have never implemented a custom RDD, but if you want some help I
> would
> >>> be
> >>> happy to join you in this task
> >>>
> >>
> >> Sebastien said nothing about thousands of keys. This is a valid prob

Re: [ANNOUNCE] Ending Java 6 support in Spark 1.5 (Sep 2015)

2015-05-19 Thread Sean Owen
No, I didn't yet. I was hoping to change the default version and make
a few obvious changes to take advantage of it all at once. Go ahead
with a JIRA. I can look into it this evening.

We have just a little actual Java code so the new language features
might be nice to use there but won't have a big impact.

However we might do well to replace some Guava usages with standard
JDK equivalents. I'd have to see just how much disruption it would
cause.

On Tue, May 19, 2015 at 3:20 PM, Tim Ellison  wrote:
> Sean,
>
> Did the JIRA get created?  If so I can't find it so a pointer would be
> helpful.
>
> Regards,
> Tim
>
> On 06/05/15 06:59, Reynold Xin wrote:
>> Sean - Please do.
>>
>> On Tue, May 5, 2015 at 10:57 PM, Sean Owen  wrote:
>>
>>> OK to file a JIRA to scrape out a few Java 6-specific things in the
>>> code? and/or close issues about working with Java 6 if they're not
>>> going to be resolved for 1.4?
>>>
>>> I suppose this means the master builds and PR builder in Jenkins
>>> should simply continue to use Java 7 then.
>>>
>>> On Tue, May 5, 2015 at 11:25 PM, Reynold Xin  wrote:
 Hi all,

 We will drop support for Java 6 starting Spark 1.5, tentative scheduled
>>> to
 be released in Sep 2015. Spark 1.4, scheduled to be released in June
>>> 2015,
 will be the last minor release that supports Java 6. That is to say:

 Spark 1.4.x (~ Jun 2015): will work with Java 6, 7, 8.

 Spark 1.5+ (~ Sep 2015): will NOT work with Java 6, but work with Java
>>> 7, 8.


 PS: Oracle ended Java 6 updates in Feb 2013.



>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Contribute code to MLlib

2015-05-19 Thread Ram Sriharsha
Hi Trevor, Tarek

You make non standard algorithms (PCA or otherwise) available to users of
Spark as Spark Packages.
http://spark-packages.org
https://databricks.com/blog/2014/12/22/announcing-spark-packages.html

With the availability of spark packages, adding powerful experimental /
alternative machine learning algorithms to the pipeline has never been
easier. I would suggest that route in scenarios where one machine learning
algorithm is not clearly better in the common scenarios than an existing
implementation in MLLib.

If your algorithm is for a large class of use cases better than the
existing PCA implementation, then we should open a JIRA and discuss the
relative strengths/ weaknesses (perhaps with some benchmarks) so we can
better understand if it makes sense to switch out the existing PCA
implementation and make yours the default.

Ram

On Tue, May 19, 2015 at 6:56 AM, Trevor Grant 
wrote:

>  There are most likely advantages and disadvantages to Tarek's algorithm
> against the current implementation, and different scenarios where each is
> more appropriate.
>
> Would we not offer multiple PCA algorithms and let the user choose?
>
> Trevor
>
> Trevor Grant
> Data Scientist
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley 
> wrote:
>
>> Hi Tarek,
>>
>> Thanks for your interest & for checking the guidelines first!  On 2
>> points:
>>
>> Algorithm: PCA is of course a critical algorithm.  The main question is
>> how your algorithm/implementation differs from the current PCA.  If it's
>> different and potentially better, I'd recommend opening up a JIRA for
>> explaining & discussing it.
>>
>> Java/Scala: We really do require that algorithms be in Scala, for the
>> sake of maintainability.  The conversion should be doable if you're willing
>> since Scala is a pretty friendly language.  If you create the JIRA, you
>> could also ask for help there to see if someone can collaborate with you to
>> convert the code to Scala.
>>
>> Thanks!
>> Joseph
>>
>> On Mon, May 18, 2015 at 3:13 AM, Tarek Elgamal 
>> wrote:
>>
>>> Hi,
>>>
>>> I would like to contribute an algorithm to the MLlib project. I have
>>> implemented a scalable PCA algorithm on spark. It is scalable for both tall
>>> and fat matrices and the paper around it is accepted for publication in
>>> SIGMOD 2015 conference. I looked at the guidelines in the following link:
>>>
>>>
>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>>
>>> I believe that most of the guidelines applies in my case, however, the
>>> code is written in java and it was not clear in the guidelines whether
>>> MLLib project accepts java code or not.
>>> My algorithm can be found under this repository:
>>> https://github.com/Qatar-Computing-Research-Institute/sPCA
>>>
>>> Any help on how to make it suitable for MLlib project will be greatly
>>> appreciated.
>>>
>>> Best Regards,
>>> Tarek Elgamal
>>>
>>>
>>>
>>>
>>
>


[VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.4.0!

The tag to be voted on is v1.4.0-rc1 (commit 777a081):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=777a08166f1fb144146ba32581d4632c3466541e

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.4.0-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1092/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.4.0-rc1-docs/

Please vote on releasing this package as Apache Spark 1.4.0!

The vote is open until Friday, May 22, at 17:03 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== How can I help test this release? ==
If you are a Spark user, you can help us test this release by
taking a Spark 1.3 workload and running on this release candidate,
then reporting any regressions.

== What justifies a -1 vote for this release? ==
This vote is happening towards the end of the 1.4 QA period,
so -1 votes should only occur for significant regressions from 1.3.1.
Bugs already present in 1.3.X, minor regressions, or bugs related
to new features will not block this release.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Patrick Wendell
A couple of other process things:

1. Please *keep voting* (+1/-1) on this thread even if we find some
issues, until we cut RC2. This lets us pipeline the QA.
2. The SQL team owes a JIRA clean-up (forthcoming shortly)... there
are still a few "Blocker's" that aren't.


On Tue, May 19, 2015 at 9:10 AM, Patrick Wendell  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 1.4.0!
>
> The tag to be voted on is v1.4.0-rc1 (commit 777a081):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=777a08166f1fb144146ba32581d4632c3466541e
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.4.0-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1092/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.4.0-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.4.0!
>
> The vote is open until Friday, May 22, at 17:03 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == How can I help test this release? ==
> If you are a Spark user, you can help us test this release by
> taking a Spark 1.3 workload and running on this release candidate,
> then reporting any regressions.
>
> == What justifies a -1 vote for this release? ==
> This vote is happening towards the end of the 1.4 QA period,
> so -1 votes should only occur for significant regressions from 1.3.1.
> Bugs already present in 1.3.X, minor regressions, or bugs related
> to new features will not block this release.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Sean Owen
Before I vote, I wanted to point out there are still 9 Blockers for 1.4.0.
I'd like to use this status to really mean "must happen before the
release". Many of these may be already fixed, or aren't really blockers --
can just be updated accordingly.

I bet at least one will require further work if it's really meant for 1.4,
so all this means is there is likely to be another RC. We should still kick
the tires on RC1.

(I also assume we should be extra conservative about what is merged into
1.4 at this point.)


SPARK-6784 SQL Clean up all the inbound/outbound conversions for
DateType Adrian
Wang

SPARK-6811 SparkR Building binary R packages for SparkR Shivaram
Venkataraman

SPARK-6941 SQL Provide a better error message to explain that tables
created from RDDs are immutable
SPARK-7158 SQL collect and take return different results
SPARK-7478 SQL Add a SQLContext.getOrCreate to maintain a singleton
instance of SQLContext Tathagata Das

SPARK-7616 SQL Overwriting a partitioned parquet table corrupt data Cheng
Lian

SPARK-7654 SQL DataFrameReader and DataFrameWriter for input/output API Reynold
Xin

SPARK-7662 SQL Exception of multi-attribute generator anlysis in projection

SPARK-7713 SQL Use shared broadcast hadoop conf for partitioned table scan. Yin
Huai


On Tue, May 19, 2015 at 5:10 PM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.4.0!
>
> The tag to be voted on is v1.4.0-rc1 (commit 777a081):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=777a08166f1fb144146ba32581d4632c3466541e
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.4.0-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1092/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.4.0-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.4.0!
>
> The vote is open until Friday, May 22, at 17:03 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == How can I help test this release? ==
> If you are a Spark user, you can help us test this release by
> taking a Spark 1.3 workload and running on this release candidate,
> then reporting any regressions.
>
> == What justifies a -1 vote for this release? ==
> This vote is happening towards the end of the 1.4 QA period,
> so -1 votes should only occur for significant regressions from 1.3.1.
> Bugs already present in 1.3.X, minor regressions, or bugs related
> to new features will not block this release.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Punyashloka Biswal
When publishing future RCs to the staging repository, would it be possible
to use a version number that includes the "rc1" designation? In the current
setup, when I run a build against the artifacts at
https://repository.apache.org/content/repositories/orgapachespark-1092/org/apache/spark/spark-core_2.10/1.4.0/,
my local Maven cache will get polluted with things that claim to be 1.4.0
but aren't. It would be preferable for the version number to be 1.4.0-rc1
instead.

Thanks!
Punya

On Tue, May 19, 2015 at 12:20 PM Sean Owen  wrote:

> Before I vote, I wanted to point out there are still 9 Blockers for 1.4.0.
> I'd like to use this status to really mean "must happen before the
> release". Many of these may be already fixed, or aren't really blockers --
> can just be updated accordingly.
>
> I bet at least one will require further work if it's really meant for 1.4,
> so all this means is there is likely to be another RC. We should still kick
> the tires on RC1.
>
> (I also assume we should be extra conservative about what is merged into
> 1.4 at this point.)
>
>
> SPARK-6784 SQL Clean up all the inbound/outbound conversions for DateType 
> Adrian
> Wang
>
> SPARK-6811 SparkR Building binary R packages for SparkR Shivaram
> Venkataraman
>
> SPARK-6941 SQL Provide a better error message to explain that tables
> created from RDDs are immutable
> SPARK-7158 SQL collect and take return different results
> SPARK-7478 SQL Add a SQLContext.getOrCreate to maintain a singleton
> instance of SQLContext Tathagata Das
>
> SPARK-7616 SQL Overwriting a partitioned parquet table corrupt data Cheng
> Lian
>
> SPARK-7654 SQL DataFrameReader and DataFrameWriter for input/output API 
> Reynold
> Xin
>
> SPARK-7662 SQL Exception of multi-attribute generator anlysis in
> projection
>
> SPARK-7713 SQL Use shared broadcast hadoop conf for partitioned table
> scan. Yin Huai
>
>
> On Tue, May 19, 2015 at 5:10 PM, Patrick Wendell 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.4.0!
>>
>> The tag to be voted on is v1.4.0-rc1 (commit 777a081):
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=777a08166f1fb144146ba32581d4632c3466541e
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.4.0-rc1/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1092/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.4.0-rc1-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.4.0!
>>
>> The vote is open until Friday, May 22, at 17:03 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.4.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == How can I help test this release? ==
>> If you are a Spark user, you can help us test this release by
>> taking a Spark 1.3 workload and running on this release candidate,
>> then reporting any regressions.
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening towards the end of the 1.4 QA period,
>> so -1 votes should only occur for significant regressions from 1.3.1.
>> Bugs already present in 1.3.X, minor regressions, or bugs related
>> to new features will not block this release.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Patrick Wendell
Punya,

Let me see if I can publish these under rc1 as well. In the future
this will all be automated but current it's a somewhat manual task.

- Patrick

On Tue, May 19, 2015 at 9:32 AM, Punyashloka Biswal
 wrote:
> When publishing future RCs to the staging repository, would it be possible
> to use a version number that includes the "rc1" designation? In the current
> setup, when I run a build against the artifacts at
> https://repository.apache.org/content/repositories/orgapachespark-1092/org/apache/spark/spark-core_2.10/1.4.0/,
> my local Maven cache will get polluted with things that claim to be 1.4.0
> but aren't. It would be preferable for the version number to be 1.4.0-rc1
> instead.
>
> Thanks!
> Punya
>
>
> On Tue, May 19, 2015 at 12:20 PM Sean Owen  wrote:
>>
>> Before I vote, I wanted to point out there are still 9 Blockers for 1.4.0.
>> I'd like to use this status to really mean "must happen before the release".
>> Many of these may be already fixed, or aren't really blockers -- can just be
>> updated accordingly.
>>
>> I bet at least one will require further work if it's really meant for 1.4,
>> so all this means is there is likely to be another RC. We should still kick
>> the tires on RC1.
>>
>> (I also assume we should be extra conservative about what is merged into
>> 1.4 at this point.)
>>
>>
>> SPARK-6784 SQL Clean up all the inbound/outbound conversions for DateType
>> Adrian Wang
>>
>> SPARK-6811 SparkR Building binary R packages for SparkR Shivaram
>> Venkataraman
>>
>> SPARK-6941 SQL Provide a better error message to explain that tables
>> created from RDDs are immutable
>> SPARK-7158 SQL collect and take return different results
>> SPARK-7478 SQL Add a SQLContext.getOrCreate to maintain a singleton
>> instance of SQLContext Tathagata Das
>>
>> SPARK-7616 SQL Overwriting a partitioned parquet table corrupt data Cheng
>> Lian
>>
>> SPARK-7654 SQL DataFrameReader and DataFrameWriter for input/output API
>> Reynold Xin
>>
>> SPARK-7662 SQL Exception of multi-attribute generator anlysis in
>> projection
>>
>> SPARK-7713 SQL Use shared broadcast hadoop conf for partitioned table
>> scan. Yin Huai
>>
>>
>> On Tue, May 19, 2015 at 5:10 PM, Patrick Wendell 
>> wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.4.0!
>>>
>>> The tag to be voted on is v1.4.0-rc1 (commit 777a081):
>>>
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=777a08166f1fb144146ba32581d4632c3466541e
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-1.4.0-rc1/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1092/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-1.4.0-rc1-docs/
>>>
>>> Please vote on releasing this package as Apache Spark 1.4.0!
>>>
>>> The vote is open until Friday, May 22, at 17:03 UTC and passes
>>> if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.4.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> == How can I help test this release? ==
>>> If you are a Spark user, you can help us test this release by
>>> taking a Spark 1.3 workload and running on this release candidate,
>>> then reporting any regressions.
>>>
>>> == What justifies a -1 vote for this release? ==
>>> This vote is happening towards the end of the 1.4 QA period,
>>> so -1 votes should only occur for significant regressions from 1.3.1.
>>> Bugs already present in 1.3.X, minor regressions, or bugs related
>>> to new features will not block this release.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Punyashloka Biswal
Thanks! I realize that manipulating the published version in the pom is a
bit inconvenient but it's really useful to have clear version identifiers
when we're juggling different versions and testing them out. For example,
this will come in handy when we compare 1.4.0-rc1 and 1.4.0-rc2 in a couple
of weeks :)

Punya

On Tue, May 19, 2015 at 12:39 PM Patrick Wendell  wrote:

> Punya,
>
> Let me see if I can publish these under rc1 as well. In the future
> this will all be automated but current it's a somewhat manual task.
>
> - Patrick
>
> On Tue, May 19, 2015 at 9:32 AM, Punyashloka Biswal
>  wrote:
> > When publishing future RCs to the staging repository, would it be
> possible
> > to use a version number that includes the "rc1" designation? In the
> current
> > setup, when I run a build against the artifacts at
> >
> https://repository.apache.org/content/repositories/orgapachespark-1092/org/apache/spark/spark-core_2.10/1.4.0/
> ,
> > my local Maven cache will get polluted with things that claim to be 1.4.0
> > but aren't. It would be preferable for the version number to be 1.4.0-rc1
> > instead.
> >
> > Thanks!
> > Punya
> >
> >
> > On Tue, May 19, 2015 at 12:20 PM Sean Owen  wrote:
> >>
> >> Before I vote, I wanted to point out there are still 9 Blockers for
> 1.4.0.
> >> I'd like to use this status to really mean "must happen before the
> release".
> >> Many of these may be already fixed, or aren't really blockers -- can
> just be
> >> updated accordingly.
> >>
> >> I bet at least one will require further work if it's really meant for
> 1.4,
> >> so all this means is there is likely to be another RC. We should still
> kick
> >> the tires on RC1.
> >>
> >> (I also assume we should be extra conservative about what is merged into
> >> 1.4 at this point.)
> >>
> >>
> >> SPARK-6784 SQL Clean up all the inbound/outbound conversions for
> DateType
> >> Adrian Wang
> >>
> >> SPARK-6811 SparkR Building binary R packages for SparkR Shivaram
> >> Venkataraman
> >>
> >> SPARK-6941 SQL Provide a better error message to explain that tables
> >> created from RDDs are immutable
> >> SPARK-7158 SQL collect and take return different results
> >> SPARK-7478 SQL Add a SQLContext.getOrCreate to maintain a singleton
> >> instance of SQLContext Tathagata Das
> >>
> >> SPARK-7616 SQL Overwriting a partitioned parquet table corrupt data
> Cheng
> >> Lian
> >>
> >> SPARK-7654 SQL DataFrameReader and DataFrameWriter for input/output API
> >> Reynold Xin
> >>
> >> SPARK-7662 SQL Exception of multi-attribute generator anlysis in
> >> projection
> >>
> >> SPARK-7713 SQL Use shared broadcast hadoop conf for partitioned table
> >> scan. Yin Huai
> >>
> >>
> >> On Tue, May 19, 2015 at 5:10 PM, Patrick Wendell 
> >> wrote:
> >>>
> >>> Please vote on releasing the following candidate as Apache Spark
> version
> >>> 1.4.0!
> >>>
> >>> The tag to be voted on is v1.4.0-rc1 (commit 777a081):
> >>>
> >>>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=777a08166f1fb144146ba32581d4632c3466541e
> >>>
> >>> The release files, including signatures, digests, etc. can be found at:
> >>> http://people.apache.org/~pwendell/spark-1.4.0-rc1/
> >>>
> >>> Release artifacts are signed with the following key:
> >>> https://people.apache.org/keys/committer/pwendell.asc
> >>>
> >>> The staging repository for this release can be found at:
> >>>
> https://repository.apache.org/content/repositories/orgapachespark-1092/
> >>>
> >>> The documentation corresponding to this release can be found at:
> >>> http://people.apache.org/~pwendell/spark-1.4.0-rc1-docs/
> >>>
> >>> Please vote on releasing this package as Apache Spark 1.4.0!
> >>>
> >>> The vote is open until Friday, May 22, at 17:03 UTC and passes
> >>> if a majority of at least 3 +1 PMC votes are cast.
> >>>
> >>> [ ] +1 Release this package as Apache Spark 1.4.0
> >>> [ ] -1 Do not release this package because ...
> >>>
> >>> To learn more about Apache Spark, please see
> >>> http://spark.apache.org/
> >>>
> >>> == How can I help test this release? ==
> >>> If you are a Spark user, you can help us test this release by
> >>> taking a Spark 1.3 workload and running on this release candidate,
> >>> then reporting any regressions.
> >>>
> >>> == What justifies a -1 vote for this release? ==
> >>> This vote is happening towards the end of the 1.4 QA period,
> >>> so -1 votes should only occur for significant regressions from 1.3.1.
> >>> Bugs already present in 1.3.X, minor regressions, or bugs related
> >>> to new features will not block this release.
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> >>
> >
>


branch-1.4 merge ettiquite

2015-05-19 Thread Patrick Wendell
Hey All,

Since we are now voting, please tread very carefully with branch-1.4 merges.

For instances, bug fixes that don't represent regressions from 1.3.X,
these probably shouldn't be merged unless they are extremely simple
and well reviewed.

As usual mature/core components (e.g. Spark core) are more sensitive
than newer/edge ones (e.g. Dataframes).

I'm happy to provide guidance to people if they are on the fence about
patches. Ultimately this ends up being a matter of judgement and
assessing risk of specific patches. Just ping me on github.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [Catalyst] RFC: Using PartialFunction literals instead of objects

2015-05-19 Thread Michael Armbrust
Overall this seems like a reasonable proposal to me.  Here are a few
thoughts:

 - There is some debugging utility to the ruleName, so we would probably
want to at least make that an argument to the rule function.
 - We also have had rules that operate on SparkPlan, though since there is
only one ATM maybe we don't need sugar there.
 - I would not call the sugar for creating Strategies rule/seqrule, as I
think the one-to-one vs one-to-many distinction is useful.
 - I'm generally pro-refactoring to make the code nicer, especially when
its not official public API, but I do think its important to maintain
source compatibility (which I think you are) when possible as there are
other projects using catalyst.
 - Finally, we'll have to balance this with other code changes / conflicts.

You should probably open a JIRA and we can continue the discussion there.

On Tue, May 19, 2015 at 4:16 AM, Edoardo Vacchi 
wrote:

> Hi everybody,
>
> At the moment, Catalyst rules are defined using two different types of
> rules:
> `Rule[LogicalPlan]` and `Strategy` (which in turn maps to
> `GenericStrategy[SparkPlan]`).
>
> I propose to introduce utility methods to
>
>   a) reduce the boilerplate to define rewrite rules
>   b) turning them back into what they essentially represent: function
> types.
>
> These changes would be backwards compatible, and would greatly help in
> understanding what the code does. Personally, I feel like the current
> use of objects is redundant and possibly confusing.
>
> ## `Rule[LogicalPlan]`
>
> The analyzer and optimizer use `Rule[LogicalPlan]`, which, besides
> defining a default `val ruleName`
> only defines the method `apply(plan: TreeType): TreeType`.
> Because the body of such method is always supposed to read `plan match
> pf`, with `pf`
> being some `PartialFunction[LogicalPlan, LogicalPlan]`, we can
> conclude that `Rule[LogicalPlan]`
> might be substituted by a PartialFunction.
>
> I propose the following:
>
> a) Introduce the utility method
>
> def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]):
> Rule[LogicalPlan] =
>   new Rule[LogicalPlan] {
> def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
>   }
>
> b) progressively replace the boilerplate-y object definitions; e.g.
>
> object MyRewriteRule extends Rule[LogicalPlan] {
>   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> case ... => ...
> }
>
> with
>
> // define a Rule[LogicalPlan]
> val MyRewriteRule = rule {
>   case ... => ...
> }
>
> it might also be possible to make rule method `implicit`, thereby
> further reducing MyRewriteRule to:
>
> // define a PartialFunction[LogicalPlan, LogicalPlan]
> // the implicit would convert it into a Rule[LogicalPlan] at the use
> sites
> val MyRewriteRule = {
>   case ... => ...
> }
>
>
> ## Strategies
>
> A similar solution could be applied to shorten the code for
> Strategies, which are total functions
> only because they are all supposed to manage the default case,
> possibly returning `Nil`. In this case
> we might introduce the following utility methods:
>
> /**
>  * Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan].
>  * The partial function must therefore return *one single* SparkPlan
> for each case.
>  * The method will automatically wrap them in a [[Seq]].
>  * Unhandled cases will automatically return Seq.empty
>  */
> protected def rule(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy =
>   new Strategy {
> def apply(plan: LogicalPlan): Seq[SparkPlan] =
>   if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty
>   }
>
> /**
>  * Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan]
> ].
>  * The partial function must therefore return a Seq[SparkPlan] for each
> case.
>  * Unhandled cases will automatically return Seq.empty
>  */
> protected def seqrule(pf: PartialFunction[LogicalPlan,
> Seq[SparkPlan]]): Strategy =
>   new Strategy {
> def apply(plan: LogicalPlan): Seq[SparkPlan] =
>   if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan]
>   }
>
> Thanks in advance
> e.v.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Resource usage of a spark application

2015-05-19 Thread Ryan Williams
Hi Peter, a few months ago I was using MetricsSystem to export to Graphite
and then view in Grafana; relevant scripts and some instructions are here
 if you want to
take a look.

On Sun, May 17, 2015 at 8:48 AM Peter Prettenhofer <
peter.prettenho...@gmail.com> wrote:

> Hi all,
>
> I'm looking for a way to measure the current memory / cpu usage of a spark
> application to provide users feedback how much resources are actually being
> used.
> It seems that the metric system provides this information to some extend.
> It logs metrics on application level (nr of cores granted) and on the JVM
> level (memory usage).
> Is this the recommended way to gather this kind of information? If so, how
> do i best map a spark application to the corresponding JVM processes?
>
> If not, should i rather request this information from the resource manager
> (e.g. Mesos/YARN)?
>
> thanks,
>  Peter
>
> --
> Peter Prettenhofer
>


OT: Key types which have potential issues

2015-05-19 Thread Mridul Muralidharan
Hi,

  I vaguely remember issues with using float/double as keys in MR (and spark ?).
But cant seem to find documentation/analysis about the same.

Does anyone have some resource/link I can refer to ?


Thanks,
Mridul

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Krishna Sankar
Quick tests from my side - looks OK. The results are same or very similar
to 1.3.1. Will add dataframes et al in future tests.

+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:42 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests
2. Tested pyspark, mlib - running as well as compare results with 1.3.1
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK

Cheers


On Tue, May 19, 2015 at 9:10 AM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.4.0!
>
> The tag to be voted on is v1.4.0-rc1 (commit 777a081):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=777a08166f1fb144146ba32581d4632c3466541e
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.4.0-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1092/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.4.0-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.4.0!
>
> The vote is open until Friday, May 22, at 17:03 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == How can I help test this release? ==
> If you are a Spark user, you can help us test this release by
> taking a Spark 1.3 workload and running on this release candidate,
> then reporting any regressions.
>
> == What justifies a -1 vote for this release? ==
> This vote is happening towards the end of the 1.4 QA period,
> so -1 votes should only occur for significant regressions from 1.3.1.
> Bugs already present in 1.3.X, minor regressions, or bugs related
> to new features will not block this release.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Patrick Wendell
HI all,

I've created another release repository where the release is
identified with the version 1.4.0-rc1:

https://repository.apache.org/content/repositories/orgapachespark-1093/

On Tue, May 19, 2015 at 5:36 PM, Krishna Sankar  wrote:
> Quick tests from my side - looks OK. The results are same or very similar to
> 1.3.1. Will add dataframes et al in future tests.
>
> +1 (non-binding, of course)
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:42 min
>  mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
> -Dhadoop.version=2.6.0 -Phive -DskipTests
> 2. Tested pyspark, mlib - running as well as compare results with 1.3.1
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Laso Regression OK
> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
>Center And Scale OK
> 2.5. RDD operations OK
>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>Model evaluation/optimization (rank, numIter, lambda) with itertools
> OK
>
> Cheers
> 
>
> On Tue, May 19, 2015 at 9:10 AM, Patrick Wendell  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.4.0!
>>
>> The tag to be voted on is v1.4.0-rc1 (commit 777a081):
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=777a08166f1fb144146ba32581d4632c3466541e
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.4.0-rc1/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1092/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.4.0-rc1-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.4.0!
>>
>> The vote is open until Friday, May 22, at 17:03 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.4.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == How can I help test this release? ==
>> If you are a Spark user, you can help us test this release by
>> taking a Spark 1.3 workload and running on this release candidate,
>> then reporting any regressions.
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening towards the end of the 1.4 QA period,
>> so -1 votes should only occur for significant regressions from 1.3.1.
>> Bugs already present in 1.3.X, minor regressions, or bugs related
>> to new features will not block this release.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org