[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-25 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-50119786 Awesome! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-24 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1460 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enab

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-24 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-50110575 Thanks Davies. I've merged this in. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does no

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-50102382 QA results for PR 1460:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class AutoSerializer(Framed

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-50101034 QA results for PR 1460:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class AutoSerializer(Framed

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-50099568 QA tests have started for PR 1460. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17153/consoleFull --- If

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-50098449 QA tests have started for PR 1460. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17152/consoleFull --- If

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-24 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-50072351 BTW here's a patch that adds the GC calls I talked about above: https://gist.github.com/mateiz/297b8618ed033e7c8005 --- If your project is set up for it, you can reply to

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-24 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-50072248 Hey Davies, I tried this out a bit and saw two issues / areas for improvement: 1) Since the ExternalMerger is used in both map tasks and reduce tasks, one problem

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49980302 QA results for PR 1460:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class AutoSerializer(Framed

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49976553 QA tests have started for PR 1460. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17110/consoleFull --- If

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-24 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15333099 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,436 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49969141 QA results for PR 1460:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class AutoSerializer(Framed

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49967247 QA tests have started for PR 1460. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17094/consoleFull --- If

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15325703 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,433 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15325695 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,433 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49930734 QA results for PR 1460:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class AutoSerializer(Framed

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49928744 Ah NM. Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49928696 Looks like the latest tested code has an error in the test suite: ``` Running PySpark tests. Output is in python/unit-tests.log. Running test: pyspark/rdd.py

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49928790 Ah never mind. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49927437 The last commit has fixed the tests, should run it again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. I

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49917693 QA results for PR 1460:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class AutoSerializer(Framed

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49917537 QA tests have started for PR 1460. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17053/consoleFull --- If

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15306941 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15304302 --- Diff: python/pyspark/rdd.py --- @@ -1207,20 +1225,49 @@ def partitionBy(self, numPartitions, partitionFunc=portable_hash): if numPartitions is

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15303331 --- Diff: python/pyspark/rdd.py --- @@ -1265,26 +1312,28 @@ def combineByKey(self, createCombiner, mergeValue, mergeCombiners, if numPartitions is

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15303224 --- Diff: python/pyspark/rdd.py --- @@ -1207,20 +1225,49 @@ def partitionBy(self, numPartitions, partitionFunc=portable_hash): if numPartitions is

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15303079 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,432 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15302973 --- Diff: python/pyspark/rdd.py --- @@ -1207,20 +1225,49 @@ def partitionBy(self, numPartitions, partitionFunc=portable_hash): if numPartitions is

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49904887 QA tests have started for PR 1460. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17043/consoleFull --- If

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15300603 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15300459 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15300412 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15300037 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,416 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15299960 --- Diff: python/pyspark/rdd.py --- @@ -1209,18 +1227,44 @@ def partitionBy(self, numPartitions, partitionFunc=portable_hash): # Transferrin

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15299925 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,416 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mattf
Github user mattf commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15286886 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,416 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor lic

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mattf
Github user mattf commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15286571 --- Diff: python/pyspark/rdd.py --- @@ -1265,26 +1309,26 @@ def combineByKey(self, createCombiner, mergeValue, mergeCombiners, if numPartitions is

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mattf
Github user mattf commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15286548 --- Diff: python/pyspark/rdd.py --- @@ -1209,18 +1227,44 @@ def partitionBy(self, numPartitions, partitionFunc=portable_hash): # Transferring

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15274897 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,416 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15274877 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,416 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15274766 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15274753 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15274557 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-23 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15274517 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15271462 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15271374 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15271358 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15271189 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,416 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15268757 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,416 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15268748 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15268727 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15268733 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15268717 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15267225 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15267205 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15266964 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15266679 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15266539 --- Diff: python/pyspark/tests.py --- @@ -47,6 +48,64 @@ SPARK_HOME = os.environ["SPARK_HOME"] +class TestMerger(unittest.TestCase): +

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15266513 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15266361 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15266298 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15264650 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15264616 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15264608 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15263824 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15263760 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15264487 --- Diff: python/pyspark/tests.py --- @@ -47,6 +48,64 @@ SPARK_HOME = os.environ["SPARK_HOME"] +class TestMerger(unittest.TestCase): +

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15264473 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49816795 QA results for PR 1460:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class AutoSerializer(Framed

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15264236 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15264180 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15263661 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15263605 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15263569 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,378 @@ +# --- End diff -- Since this is a new internal file, also add it to the "exclude" se

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49814922 QA results for PR 1460:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class AutoSerializer(Framed

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49808272 QA tests have started for PR 1460. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16993/consoleFull --- If

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49805468 QA tests have started for PR 1460. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16991/consoleFull --- If

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49788328 QA results for PR 1460:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class AutoSerializer(Framed

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49774956 QA tests have started for PR 1460. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16975/consoleFull --- If

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49718092 QA results for PR 1460:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class AutoSerializer(Framed

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49709494 QA tests have started for PR 1460. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16957/consoleFull --- If

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15211652 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15211390 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15211313 --- Diff: python/pyspark/serializers.py --- @@ -297,6 +297,33 @@ class MarshalSerializer(FramedSerializer): loads = marshal.loads +clas

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15211142 --- Diff: python/pyspark/rdd.py --- @@ -1247,15 +1262,16 @@ def combineLocally(iterator): return combiners.iteritems() locally_com

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15211059 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -57,7 +57,9 @@ private[spark] class PythonRDD[T: ClassTag]( override de

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15208097 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15208070 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15208037 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -57,7 +57,9 @@ private[spark] class PythonRDD[T: ClassTag]( override de

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15208014 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15207983 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15207936 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15207914 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15207837 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15207812 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15207775 --- Diff: python/pyspark/shuffle.py --- @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor li

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15207721 --- Diff: python/pyspark/rdd.py --- @@ -1247,15 +1262,16 @@ def combineLocally(iterator): return combiners.iteritems() locally_com

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15207731 --- Diff: python/pyspark/rdd.py --- @@ -1247,15 +1262,16 @@ def combineLocally(iterator): return combiners.iteritems() locally_com

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1460#discussion_r15207683 --- Diff: python/pyspark/tests.py --- @@ -47,6 +48,40 @@ SPARK_HOME = os.environ["SPARK_HOME"] +class TestMerger(unittest.TestCase): +

  1   2   >