Performance & Memory Issues When Creating Many Columns in GROUP BY (spark-sql)

2015-05-19 Thread daniel.mescheder
Dear List, We have run into serious problems trying to run a larger than average number of aggregations in a GROUP BY query. Symptoms of this problem are OutOfMemory exceptions and unreasonably long processing times due to GC. The problem occurs when the following two conditions are met: - The

[Catalyst] RFC: Using PartialFunction literals instead of objects

2015-05-19 Thread Edoardo Vacchi
Hi everybody, At the moment, Catalyst rules are defined using two different types of rules: `Rule[LogicalPlan]` and `Strategy` (which in turn maps to `GenericStrategy[SparkPlan]`). I propose to introduce utility methods to a) reduce the boilerplate to define rewrite rules b) turning them bac

Re: Problem building master on 2.11

2015-05-19 Thread Iulian DragoČ™
There's an open PR to fix it. If you could try it and report back on the PR it'd be great. More likely to get in fast. https://github.com/apache/spark/pull/6260 On Mon, May 18, 2015 at 6:43 PM, Fernando O. wrote: > I just noticed I sent this to users instead of dev: > > -- Forwarded mes

Re: Contribute code to MLlib

2015-05-19 Thread Trevor Grant
There are most likely advantages and disadvantages to Tarek's algorithm against the current implementation, and different scenarios where each is more appropriate. Would we not offer multiple PCA algorithms and let the user choose? Trevor Trevor Grant Data Scientist *"Fortunate is he, who is a

Re: [ANNOUNCE] Ending Java 6 support in Spark 1.5 (Sep 2015)

2015-05-19 Thread Tim Ellison
Sean, Did the JIRA get created? If so I can't find it so a pointer would be helpful. Regards, Tim On 06/05/15 06:59, Reynold Xin wrote: > Sean - Please do. > > On Tue, May 5, 2015 at 10:57 PM, Sean Owen wrote: > >> OK to file a JIRA to scrape out a few Java 6-specific things in the >> code?

Re: RDD split into multiple RDDs

2015-05-19 Thread Justin Uang
To do it in one pass, conceptually what you would need to do is to consume the entire parent iterator and store the values either in memory or on disk, which is generally something you want to avoid given that the parent iterator length is unbounded. If you need to start spilling to disk, you might

Re: [ANNOUNCE] Ending Java 6 support in Spark 1.5 (Sep 2015)

2015-05-19 Thread Sean Owen
No, I didn't yet. I was hoping to change the default version and make a few obvious changes to take advantage of it all at once. Go ahead with a JIRA. I can look into it this evening. We have just a little actual Java code so the new language features might be nice to use there but won't have a bi

Re: Contribute code to MLlib

2015-05-19 Thread Ram Sriharsha
Hi Trevor, Tarek You make non standard algorithms (PCA or otherwise) available to users of Spark as Spark Packages. http://spark-packages.org https://databricks.com/blog/2014/12/22/announcing-spark-packages.html With the availability of spark packages, adding powerful experimental / alternative m

[VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc1 (commit 777a081): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=777a08166f1fb144146ba32581d4632c3466541e The release files, including signatures, digests, etc. can

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Patrick Wendell
A couple of other process things: 1. Please *keep voting* (+1/-1) on this thread even if we find some issues, until we cut RC2. This lets us pipeline the QA. 2. The SQL team owes a JIRA clean-up (forthcoming shortly)... there are still a few "Blocker's" that aren't. On Tue, May 19, 2015 at 9:10

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Sean Owen
Before I vote, I wanted to point out there are still 9 Blockers for 1.4.0. I'd like to use this status to really mean "must happen before the release". Many of these may be already fixed, or aren't really blockers -- can just be updated accordingly. I bet at least one will require further work if

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Punyashloka Biswal
When publishing future RCs to the staging repository, would it be possible to use a version number that includes the "rc1" designation? In the current setup, when I run a build against the artifacts at https://repository.apache.org/content/repositories/orgapachespark-1092/org/apache/spark/spark-cor

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Patrick Wendell
Punya, Let me see if I can publish these under rc1 as well. In the future this will all be automated but current it's a somewhat manual task. - Patrick On Tue, May 19, 2015 at 9:32 AM, Punyashloka Biswal wrote: > When publishing future RCs to the staging repository, would it be possible > to us

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Punyashloka Biswal
Thanks! I realize that manipulating the published version in the pom is a bit inconvenient but it's really useful to have clear version identifiers when we're juggling different versions and testing them out. For example, this will come in handy when we compare 1.4.0-rc1 and 1.4.0-rc2 in a couple o

branch-1.4 merge ettiquite

2015-05-19 Thread Patrick Wendell
Hey All, Since we are now voting, please tread very carefully with branch-1.4 merges. For instances, bug fixes that don't represent regressions from 1.3.X, these probably shouldn't be merged unless they are extremely simple and well reviewed. As usual mature/core components (e.g. Spark core) are

Re: [Catalyst] RFC: Using PartialFunction literals instead of objects

2015-05-19 Thread Michael Armbrust
Overall this seems like a reasonable proposal to me. Here are a few thoughts: - There is some debugging utility to the ruleName, so we would probably want to at least make that an argument to the rule function. - We also have had rules that operate on SparkPlan, though since there is only one A

Re: Resource usage of a spark application

2015-05-19 Thread Ryan Williams
Hi Peter, a few months ago I was using MetricsSystem to export to Graphite and then view in Grafana; relevant scripts and some instructions are here if you want to take a look. On Sun, May 17, 2015 at 8:48 AM Peter Prettenhofer < peter.prett

OT: Key types which have potential issues

2015-05-19 Thread Mridul Muralidharan
Hi, I vaguely remember issues with using float/double as keys in MR (and spark ?). But cant seem to find documentation/analysis about the same. Does anyone have some resource/link I can refer to ? Thanks, Mridul - To unsubsc

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Krishna Sankar
Quick tests from my side - looks OK. The results are same or very similar to 1.3.1. Will add dataframes et al in future tests. +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:42 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Patrick Wendell
HI all, I've created another release repository where the release is identified with the version 1.4.0-rc1: https://repository.apache.org/content/repositories/orgapachespark-1093/ On Tue, May 19, 2015 at 5:36 PM, Krishna Sankar wrote: > Quick tests from my side - looks OK. The results are same