[Catalyst] RFC: Using PartialFunction literals instead of objects

2015-05-19 Thread Edoardo Vacchi
Hi everybody, At the moment, Catalyst rules are defined using two different types of rules: `Rule[LogicalPlan]` and `Strategy` (which in turn maps to `GenericStrategy[SparkPlan]`). I propose to introduce utility methods to a) reduce the boilerplate to define rewrite rules b) turning them

Performance Memory Issues When Creating Many Columns in GROUP BY (spark-sql)

2015-05-19 Thread daniel.mescheder
Dear List, We have run into serious problems trying to run a larger than average number of aggregations in a GROUP BY query. Symptoms of this problem are OutOfMemory exceptions and unreasonably long processing times due to GC. The problem occurs when the following two conditions are met: - The

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Krishna Sankar
Quick tests from my side - looks OK. The results are same or very similar to 1.3.1. Will add dataframes et al in future tests. +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:42 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Patrick Wendell
HI all, I've created another release repository where the release is identified with the version 1.4.0-rc1: https://repository.apache.org/content/repositories/orgapachespark-1093/ On Tue, May 19, 2015 at 5:36 PM, Krishna Sankar ksanka...@gmail.com wrote: Quick tests from my side - looks OK.

Re: Contribute code to MLlib

2015-05-19 Thread Trevor Grant
There are most likely advantages and disadvantages to Tarek's algorithm against the current implementation, and different scenarios where each is more appropriate. Would we not offer multiple PCA algorithms and let the user choose? Trevor Trevor Grant Data Scientist *Fortunate is he, who is

Re: [ANNOUNCE] Ending Java 6 support in Spark 1.5 (Sep 2015)

2015-05-19 Thread Tim Ellison
Sean, Did the JIRA get created? If so I can't find it so a pointer would be helpful. Regards, Tim On 06/05/15 06:59, Reynold Xin wrote: Sean - Please do. On Tue, May 5, 2015 at 10:57 PM, Sean Owen so...@cloudera.com wrote: OK to file a JIRA to scrape out a few Java 6-specific things in

Re: RDD split into multiple RDDs

2015-05-19 Thread Justin Uang
To do it in one pass, conceptually what you would need to do is to consume the entire parent iterator and store the values either in memory or on disk, which is generally something you want to avoid given that the parent iterator length is unbounded. If you need to start spilling to disk, you

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Patrick Wendell
Punya, Let me see if I can publish these under rc1 as well. In the future this will all be automated but current it's a somewhat manual task. - Patrick On Tue, May 19, 2015 at 9:32 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: When publishing future RCs to the staging repository, would

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Punyashloka Biswal
Thanks! I realize that manipulating the published version in the pom is a bit inconvenient but it's really useful to have clear version identifiers when we're juggling different versions and testing them out. For example, this will come in handy when we compare 1.4.0-rc1 and 1.4.0-rc2 in a couple

branch-1.4 merge ettiquite

2015-05-19 Thread Patrick Wendell
Hey All, Since we are now voting, please tread very carefully with branch-1.4 merges. For instances, bug fixes that don't represent regressions from 1.3.X, these probably shouldn't be merged unless they are extremely simple and well reviewed. As usual mature/core components (e.g. Spark core)

Re: [Catalyst] RFC: Using PartialFunction literals instead of objects

2015-05-19 Thread Michael Armbrust
Overall this seems like a reasonable proposal to me. Here are a few thoughts: - There is some debugging utility to the ruleName, so we would probably want to at least make that an argument to the rule function. - We also have had rules that operate on SparkPlan, though since there is only one

[VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc1 (commit 777a081): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=777a08166f1fb144146ba32581d4632c3466541e The release files, including signatures, digests, etc.

Re: Problem building master on 2.11

2015-05-19 Thread Iulian DragoČ™
There's an open PR to fix it. If you could try it and report back on the PR it'd be great. More likely to get in fast. https://github.com/apache/spark/pull/6260 On Mon, May 18, 2015 at 6:43 PM, Fernando O. fot...@gmail.com wrote: I just noticed I sent this to users instead of dev: --

OT: Key types which have potential issues

2015-05-19 Thread Mridul Muralidharan
Hi, I vaguely remember issues with using float/double as keys in MR (and spark ?). But cant seem to find documentation/analysis about the same. Does anyone have some resource/link I can refer to ? Thanks, Mridul - To

Re: Resource usage of a spark application

2015-05-19 Thread Ryan Williams
Hi Peter, a few months ago I was using MetricsSystem to export to Graphite and then view in Grafana; relevant scripts and some instructions are here https://github.com/hammerlab/grafana-spark-dashboards/ if you want to take a look. On Sun, May 17, 2015 at 8:48 AM Peter Prettenhofer

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Patrick Wendell
A couple of other process things: 1. Please *keep voting* (+1/-1) on this thread even if we find some issues, until we cut RC2. This lets us pipeline the QA. 2. The SQL team owes a JIRA clean-up (forthcoming shortly)... there are still a few Blocker's that aren't. On Tue, May 19, 2015 at 9:10

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Sean Owen
Before I vote, I wanted to point out there are still 9 Blockers for 1.4.0. I'd like to use this status to really mean must happen before the release. Many of these may be already fixed, or aren't really blockers -- can just be updated accordingly. I bet at least one will require further work if