[GitHub] spark pull request: SPARK-2553. CoGroupedRDD unnecessarily allocat...

2014-07-17 Thread sryza
GitHub user sryza opened a pull request:

https://github.com/apache/spark/pull/1461

SPARK-2553. CoGroupedRDD unnecessarily allocates a Tuple2 per dependency...

... per key

My humble opinion is that avoiding allocations in this performance-critical 
section is worth the extra code.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sryza/spark sandy-spark-2553

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1461.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1461


commit 7eaf7f2a18ddea7ee47aacb5c5559c278a924899
Author: Sandy Ryza 
Date:   2014-07-17T08:19:48Z

SPARK-2553. CoGroupedRDD unnecessarily allocates a Tuple2 per dependency 
per key




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1361#issuecomment-49272423
  
QA tests have started for PR 1361. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16774/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1447#issuecomment-49272425
  
QA tests have started for PR 1447. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16773/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-07-17 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1361#issuecomment-49272169
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-07-17 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1361#issuecomment-49272156
  
Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1460#issuecomment-49271938
  
QA tests have started for PR 1460. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16772/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2299] Consolidate various stageIdTo* ha...

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1262#issuecomment-49271814
  
QA results for PR 1262:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16768/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [branch-0.9] Update CHANGES.txt

2014-07-17 Thread mengxr
Github user mengxr closed the pull request at:

https://github.com/apache/spark/pull/1459


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [branch-0.9] Update CHANGES.txt

2014-07-17 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1459#issuecomment-49271772
  
Merged into branch-0.9.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-17 Thread davies
GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/1460

[SPARK-2538] [PySpark] Hash based disk spilling aggregation

During aggregation in Python worker, if the memory usage is above 
spark.executor.memory, it will do disk spilling aggregation. 

It will split the aggregation into multiple stage, in each stage, it will 
partition the aggregated data by hash and dump them into disks. After all the 
data are aggregated, it will merge all the stages together (partition by 
partition).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark spill

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1460.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1460


commit f933713ed628779309fab0da76045f8750d6b350
Author: Davies Liu 
Date:   2014-07-17T08:03:32Z

Hash based disk spilling aggregation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2423] Clean up SparkSubmit for readabil...

2014-07-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1349


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2423] Clean up SparkSubmit for readabil...

2014-07-17 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1349#issuecomment-49271133
  
Thanks Andrew, looks good!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL][WIP] Cherry picked Hive Thri...

2014-07-17 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1399#discussion_r15045195
  
--- Diff: sbin/start-thriftserver.sh ---
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Figure out where Spark is installed
+FWDIR="$(cd `dirname $0`/..; pwd)"
+
+CLASS="org.apache.spark.sql.hive.thriftserver.HiveThriftServer2"
+$FWDIR/bin/spark-class $CLASS $@
--- End diff --

Checkout `spark-shell` for an example of a user-facing script that triages 
options to `spark-submit`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2526: Simplify options in make-distribut...

2014-07-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1445


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2412] CoalescedRDD throws exception wit...

2014-07-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1337


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2412] CoalescedRDD throws exception wit...

2014-07-17 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1337#issuecomment-49270261
  
Okay I merged this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1450#issuecomment-49269144
  
I created a JIRA to deal with this and did some initial exploration, but I 
think I'll need to wait for Prashant to actually do it:

https://issues.apache.org/jira/browse/SPARK-2549


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [branch-0.9] Update CHANGES.txt

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1459#issuecomment-49267990
  
QA results for PR 1459:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16769/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1447#issuecomment-49267874
  
QA tests have started for PR 1447. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16771/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15044062
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -712,8 +701,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 val index = p.getPartition(key)
 def process(it: Iterator[(K, V)]): Seq[V] = {
   val buf = new ArrayBuffer[V]
-  for ((k, v) <- it if k == key) {
--- End diff --

Actually yes my bad :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15044028
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -712,8 +701,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 val index = p.getPartition(key)
 def process(it: Iterator[(K, V)]): Seq[V] = {
   val buf = new ArrayBuffer[V]
-  for ((k, v) <- it if k == key) {
--- End diff --

Wait, actually, my understanding of this loop is that it's iterating over 
every record within the partition.  Am I missing something? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15043860
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 
 def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] 
= {
   val map = new JHashMap[K, V]
-  iter.foreach { case (k, v) =>
-val old = map.get(k)
-map.put(k, if (old == null) v else func(old, v))
+  iter.foreach { pair =>
+val old = map.get(pair._1)
--- End diff --

Also on removing the calls to _1,  these values should be in cache, so 
accesses will be really fast.  Will hold off on these for now, but happy to 
make the change if y'all want.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15043849
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -571,12 +571,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   throw new SparkException("Default partitioner cannot partition array 
keys.")
 }
 val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), 
partitioner)
-cg.mapValues { case Seq(vs, w1s, w2s, w3s) =>
-  (vs.asInstanceOf[Seq[V]],
-w1s.asInstanceOf[Seq[W1]],
-w2s.asInstanceOf[Seq[W2]],
-w3s.asInstanceOf[Seq[W3]])
-}
+cg.mapValues { seq  => seq.asInstanceOf[(Seq[V], Seq[W1], Seq[W2], 
Seq[W3])] }
--- End diff --

Ah, right.  I'll leave these as they are for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1450#issuecomment-49266828
  
QA tests have started for PR 1450. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16770/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [branch-0.9] Update CHANGES.txt

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1459#issuecomment-49266193
  
QA tests have started for PR 1459. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16769/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

2014-07-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1450#discussion_r15043414
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -214,7 +214,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   throw new SparkException("reduceByKeyLocally() does not support 
array keys")
 }
 
-def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] 
= {
+val reducePartition = (iter: Iterator[(K, V)]) => {
--- End diff --

this is fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: update CHANGES.txt

2014-07-17 Thread mengxr
GitHub user mengxr opened a pull request:

https://github.com/apache/spark/pull/1459

update CHANGES.txt



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mengxr/spark v0.9.2-rc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1459.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1459


commit 6fa5a656281dd5df5cf7f72660efd89fe7b8ec8d
Author: Xiangrui Meng 
Date:   2014-07-17T06:59:57Z

update CHANGES.txt




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


<    1   2   3