[GitHub] spark pull request: SPARK-2553. CoGroupedRDD unnecessarily allocat...
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/1461 SPARK-2553. CoGroupedRDD unnecessarily allocates a Tuple2 per dependency... ... per key My humble opinion is that avoiding allocations in this performance-critical section is worth the extra code. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sryza/spark sandy-spark-2553 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1461.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1461 commit 7eaf7f2a18ddea7ee47aacb5c5559c278a924899 Author: Sandy Ryza Date: 2014-07-17T08:19:48Z SPARK-2553. CoGroupedRDD unnecessarily allocates a Tuple2 per dependency per key --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1361#issuecomment-49272423 QA tests have started for PR 1361. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16774/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1447#issuecomment-49272425 QA tests have started for PR 1447. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16773/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1361#issuecomment-49272169 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1361#issuecomment-49272156 Jenkins, add to whitelist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49271938 QA tests have started for PR 1460. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16772/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2299] Consolidate various stageIdTo* ha...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1262#issuecomment-49271814 QA results for PR 1262:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16768/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [branch-0.9] Update CHANGES.txt
Github user mengxr closed the pull request at: https://github.com/apache/spark/pull/1459 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [branch-0.9] Update CHANGES.txt
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1459#issuecomment-49271772 Merged into branch-0.9. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/1460 [SPARK-2538] [PySpark] Hash based disk spilling aggregation During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation. It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition). You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark spill Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1460.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1460 commit f933713ed628779309fab0da76045f8750d6b350 Author: Davies Liu Date: 2014-07-17T08:03:32Z Hash based disk spilling aggregation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2423] Clean up SparkSubmit for readabil...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1349 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2423] Clean up SparkSubmit for readabil...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1349#issuecomment-49271133 Thanks Andrew, looks good! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2410][SQL][WIP] Cherry picked Hive Thri...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1399#discussion_r15045195 --- Diff: sbin/start-thriftserver.sh --- @@ -0,0 +1,24 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Figure out where Spark is installed +FWDIR="$(cd `dirname $0`/..; pwd)" + +CLASS="org.apache.spark.sql.hive.thriftserver.HiveThriftServer2" +$FWDIR/bin/spark-class $CLASS $@ --- End diff -- Checkout `spark-shell` for an example of a user-facing script that triages options to `spark-submit`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2526: Simplify options in make-distribut...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1445 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2412] CoalescedRDD throws exception wit...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1337 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2412] CoalescedRDD throws exception wit...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1337#issuecomment-49270261 Okay I merged this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49269144 I created a JIRA to deal with this and did some initial exploration, but I think I'll need to wait for Prashant to actually do it: https://issues.apache.org/jira/browse/SPARK-2549 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [branch-0.9] Update CHANGES.txt
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1459#issuecomment-49267990 QA results for PR 1459:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16769/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1447#issuecomment-49267874 QA tests have started for PR 1447. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16771/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1447#discussion_r15044062 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -712,8 +701,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) val index = p.getPartition(key) def process(it: Iterator[(K, V)]): Seq[V] = { val buf = new ArrayBuffer[V] - for ((k, v) <- it if k == key) { --- End diff -- Actually yes my bad :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/1447#discussion_r15044028 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -712,8 +701,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) val index = p.getPartition(key) def process(it: Iterator[(K, V)]): Seq[V] = { val buf = new ArrayBuffer[V] - for ((k, v) <- it if k == key) { --- End diff -- Wait, actually, my understanding of this loop is that it's iterating over every record within the partition. Am I missing something? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/1447#discussion_r15043860 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = { val map = new JHashMap[K, V] - iter.foreach { case (k, v) => -val old = map.get(k) -map.put(k, if (old == null) v else func(old, v)) + iter.foreach { pair => +val old = map.get(pair._1) --- End diff -- Also on removing the calls to _1, these values should be in cache, so accesses will be really fast. Will hold off on these for now, but happy to make the change if y'all want. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/1447#discussion_r15043849 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -571,12 +571,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) throw new SparkException("Default partitioner cannot partition array keys.") } val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), partitioner) -cg.mapValues { case Seq(vs, w1s, w2s, w3s) => - (vs.asInstanceOf[Seq[V]], -w1s.asInstanceOf[Seq[W1]], -w2s.asInstanceOf[Seq[W2]], -w3s.asInstanceOf[Seq[W3]]) -} +cg.mapValues { seq => seq.asInstanceOf[(Seq[V], Seq[W1], Seq[W2], Seq[W3])] } --- End diff -- Ah, right. I'll leave these as they are for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49266828 QA tests have started for PR 1450. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16770/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [branch-0.9] Update CHANGES.txt
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1459#issuecomment-49266193 QA tests have started for PR 1459. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16769/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1450#discussion_r15043414 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -214,7 +214,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) throw new SparkException("reduceByKeyLocally() does not support array keys") } -def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = { +val reducePartition = (iter: Iterator[(K, V)]) => { --- End diff -- this is fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: update CHANGES.txt
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/1459 update CHANGES.txt You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark v0.9.2-rc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1459.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1459 commit 6fa5a656281dd5df5cf7f72660efd89fe7b8ec8d Author: Xiangrui Meng Date: 2014-07-17T06:59:57Z update CHANGES.txt --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---