[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50225209
  
QA results for PR 1338:- This patch PASSES unit tests.For more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17216/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50225130
  
QA tests have started for PR 1338. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17218/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-25 Thread kanzhang
Github user kanzhang commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50225107
  
Rebased just now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Merging Hive Thrift/JDBC ser...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1600#issuecomment-50224608
  
QA tests have started for PR 1600. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17217/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Merging Hive Thrift/JDBC ser...

2014-07-25 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1600#issuecomment-50224519
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Merging Hive Thrift/JDBC ser...

2014-07-25 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1600#issuecomment-50224484
  
@marmbrus The build failure was caused by PySpark, please help re-test 
this, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

2014-07-25 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/1551#discussion_r15432269
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
@@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
   throw new SparkException("Unexpected Tuple2 element type " + 
pair._1.getClass)
   }
 case other =>
-  throw new SparkException("Unexpected element type " + 
first.getClass)
+  if (other == null) {
+dataOut.writeInt(SpecialLengths.NULL)
--- End diff --

It's header of var-length field, it's better to keep this header has fixed 
length, or you will need to deal with special var-length encoding.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: replace println to log4j

2014-07-25 Thread fireflyc
Github user fireflyc commented on the pull request:

https://github.com/apache/spark/pull/1372#issuecomment-50223879
  
My account is fireflyc, please assign the issue to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Merging Hive Thrift/JDBC ser...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1600#issuecomment-50223630
  
QA results for PR 1600:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class SparkSQLOperationManager(hiveContext: HiveContext) 
extends OperationManager with Logging {For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17215/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-25 Thread kanzhang
Github user kanzhang commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50223574
  
Updated the patch to allow users to specify batch size when reading in a 
sequence file. When batch size is 1, the pickled data is unbatched (and custom 
Writables are not cloned). Default batch size is set to 10. Added a few tests 
for unbatched cases. Let me know if you have any further comments. Thanks 
everyone for review and suggestions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50223494
  
QA tests have started for PR 1338. This patch DID NOT merge cleanly! 
View progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17216/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

2014-07-25 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1551#discussion_r15432123
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
@@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
   throw new SparkException("Unexpected Tuple2 element type " + 
pair._1.getClass)
   }
 case other =>
-  throw new SparkException("Unexpected element type " + 
first.getClass)
+  if (other == null) {
+dataOut.writeInt(SpecialLengths.NULL)
--- End diff --

maybe it doesn't matter much here, but would it make sense to write a byte 
instead of an int?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Merging Hive Thrift/JDBC ser...

2014-07-25 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1399#issuecomment-50222515
  
Opened #1600 to replace this PR. Hope Mr. Jenkins accepts it gently...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Merging Hive Thrift/JDBC ser...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1600#issuecomment-50222427
  
QA tests have started for PR 1600. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17215/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Merging Hive Thrift/JDBC ser...

2014-07-25 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/1600

[SPARK-2410][SQL] Merging Hive Thrift/JDBC server

(This is a replacement of #1399, trying to fix potential 
`HiveThriftServer2` port collision between parallel builds. Please refer to 
[these 
comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for 
details.)

JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Merging the Hive Thrift/JDBC server from 
[branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

Thanks @chenghao-intel for his initial contribution of the Spark SQL CLI.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark jdbc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1600.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1600


commit 2c4c5394d28c5724742edc9877465a1bd71d6f24
Author: Cheng Lian 
Date:   2014-07-14T05:22:25Z

Cherry picked the Hive Thrift server

commit 61f39f47b470a63e7528d1ef61faf02bac50b502
Author: Cheng Lian 
Date:   2014-07-19T06:55:20Z

Starts Hive Thrift server via spark-submit

commit a5310d1f6f7ed3c353a25bdba3d055cc5128fe45
Author: Cheng Lian 
Date:   2014-07-20T12:28:32Z

Make HiveThriftServer2 play well with spark-submit

commit 3ad4e75322b9f95818980f356fcd8cca2fb0a8eb
Author: Cheng Lian 
Date:   2014-07-20T14:10:55Z

Starts spark-sql shell with spark-submit

commit f975d2230cc4d231923ca6fc8a95f9b5a6062242
Author: Cheng Lian 
Date:   2014-07-20T14:11:21Z

Updated docs for Hive compatibility and Shark migration guide draft

commit b8905ba78605c41177af59766122c705a9a28868
Author: Cheng Lian 
Date:   2014-07-20T14:39:47Z

Fixed minor issues in spark-sql and start-thriftserver.sh

commit e214aabb78ca9adcf511388647c431f3a41972a4
Author: Cheng Lian 
Date:   2014-07-20T15:24:08Z

Added missing license headers

commit 40bafef734a95634063ab8a934a8b5e8e622f6f7
Author: Cheng Lian 
Date:   2014-07-20T15:49:05Z

Fixed more license header issues

commit 7755062242e03efdf40236ca6764df696cf0f0d9
Author: Cheng Lian 
Date:   2014-07-21T03:16:46Z

Adapts test suites to spark-submit settings

commit 061880f2bf2eb63c3c7513e2aa3afa0e4c35a4c2
Author: Cheng Lian 
Date:   2014-07-22T08:54:20Z

Addressed all comments by @pwendell

commit cfcf4611a896fd4be63d80eef977c8dcb3d6b89f
Author: Cheng Lian 
Date:   2014-07-22T09:08:07Z

Updated documents and build scripts for the newly added hive-thriftserver 
profile

commit 9cc0f0697ab12bbfb82f94bdec65a639e7bb4b37
Author: Cheng Lian 
Date:   2014-07-22T11:35:50Z

Starts beeline with spark-submit

commit 7db82a1373d014abc45416458d14a75dd722cf6f
Author: Cheng Lian 
Date:   2014-07-22T11:36:06Z

Fixed spark-submit application options handling logic

Any options in the application option list with the same option name
that SparkSubmitArguments recognizes (e.g., --help) are stolen by
SparkSubmit instead of passed to the application.

commit 1083e9d30bd0d73a36128682476cc2bb3ee83b2c
Author: Cheng Lian 
Date:   2014-07-22T14:45:49Z

Fixed failed test suites

commit 199e3fb78ae6e3ee064904b07009ad95e870871f
Author: Cheng Lian 
Date:   2014-07-22T23:19:49Z

Disabled MIMA for hive-thriftserver

commit fe0af31f28dab8b7dd64e687f28ea64d918217e3
Author: Cheng Lian 
Date:   2014-07-23T12:05:21Z

Reordered spark-submit options in spark-shell[.cmd]

All options behind primary resource are (and should be) recognized as
application options now.

commit 21c6cf48b1e3fafb9854d893c8f249e36c97fdc3
Author: Cheng Lian 
Date:   2014-07-23T12:11:52Z

Updated Spark SQL programming guide docs

commit 090beea8831f13ba67ea5ef02b2a21a2d5b81276
Author: Cheng Lian 
Date:   2014-07-25T01:34:07Z

Revert changes related to SPARK-2678, decided to move them to another PR

commit ac4618b74bb3289e910560fc9cdddcdbec807bb5
Author: Cheng Lian 
Date:   2014-07-26T03:31:52Z

Uses random port for HiveThriftServer2 to avoid collision with parallel 
builds




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2700] [SQL] Hidden files (such as .impa...

2014-07-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1599#issuecomment-50221641
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2700] [SQL] Hidden files (such as .impa...

2014-07-25 Thread chutium
Github user chutium closed the pull request at:

https://github.com/apache/spark/pull/1599


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2700] [SQL] Hidden files (such as .impa...

2014-07-25 Thread chutium
GitHub user chutium reopened a pull request:

https://github.com/apache/spark/pull/1599

[SPARK-2700] [SQL] Hidden files (such as .impala_insert_staging) should be 
filtered out by sqlContext.parquetFile

check if the path name starts with '.'

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chutium/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1599.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1599


commit 1de83a7560f85cd347bca6dde256d551da63a144
Author: chutium 
Date:   2014-07-16T11:44:09Z

SPARK-2407: Added Parse of SQL SUBSTR()

commit 88cb37d4b628b39c3a619d607bc7a1e756ed7ec0
Author: chutium 
Date:   2014-07-17T07:55:07Z

Merge https://github.com/apache/spark

commit 094f773221e42a7d53c52ce3637a0a833d09fa84
Author: chutium 
Date:   2014-07-17T08:30:47Z

Merge https://github.com/apache/spark

commit c8701724495078ae6dc67a1d1edb8b2157dc0733
Author: chutium 
Date:   2014-07-17T20:58:10Z

Merge https://github.com/apache/spark

commit 06e933b262ee64544d550fde3b9ba6d130de9a64
Author: chutium 
Date:   2014-07-17T22:24:58Z

Merge https://github.com/apache/spark

commit 9a60ccf4938dc921e143c27276c19bda59180e4b
Author: chutium 
Date:   2014-07-17T23:24:16Z

SPARK-2407: Added Parser of SQL SUBSTR() #1442

commit b49cc8a5bb73ff25c289a38cbaedfaf7edfefc5b
Author: chutium 
Date:   2014-07-18T08:33:25Z

SPARK-2407: Added Parser of SQL SUBSTRING() #1442

commit b32a1d0e6bed6a16dd3a936b232a880349b2ce95
Author: chutium 
Date:   2014-07-26T02:34:11Z

Merge https://github.com/apache/spark

commit 52905c68b9fe8f7a9b9175a80f18710cc6fb
Author: chutium 
Date:   2014-07-26T02:53:44Z

SPARK-2700 Hidden files (such as .impala_insert_staging) should be filtered 
out by sqlContext.parquetFile




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2696] Reduce default value of spark.ser...

2014-07-25 Thread falaki
Github user falaki commented on the pull request:

https://github.com/apache/spark/pull/1595#issuecomment-50221558
  
It is already done :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2700] [SQL] Hidden files (such as .impa...

2014-07-25 Thread chutium
GitHub user chutium opened a pull request:

https://github.com/apache/spark/pull/1599

[SPARK-2700] [SQL] Hidden files (such as .impala_insert_staging) should be 
filtered out by sqlContext.parquetFile

check if the path name starts with '.'

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chutium/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1599.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1599


commit 1de83a7560f85cd347bca6dde256d551da63a144
Author: chutium 
Date:   2014-07-16T11:44:09Z

SPARK-2407: Added Parse of SQL SUBSTR()

commit 88cb37d4b628b39c3a619d607bc7a1e756ed7ec0
Author: chutium 
Date:   2014-07-17T07:55:07Z

Merge https://github.com/apache/spark

commit 094f773221e42a7d53c52ce3637a0a833d09fa84
Author: chutium 
Date:   2014-07-17T08:30:47Z

Merge https://github.com/apache/spark

commit c8701724495078ae6dc67a1d1edb8b2157dc0733
Author: chutium 
Date:   2014-07-17T20:58:10Z

Merge https://github.com/apache/spark

commit 06e933b262ee64544d550fde3b9ba6d130de9a64
Author: chutium 
Date:   2014-07-17T22:24:58Z

Merge https://github.com/apache/spark

commit 9a60ccf4938dc921e143c27276c19bda59180e4b
Author: chutium 
Date:   2014-07-17T23:24:16Z

SPARK-2407: Added Parser of SQL SUBSTR() #1442

commit b49cc8a5bb73ff25c289a38cbaedfaf7edfefc5b
Author: chutium 
Date:   2014-07-18T08:33:25Z

SPARK-2407: Added Parser of SQL SUBSTRING() #1442

commit b32a1d0e6bed6a16dd3a936b232a880349b2ce95
Author: chutium 
Date:   2014-07-26T02:34:11Z

Merge https://github.com/apache/spark

commit 52905c68b9fe8f7a9b9175a80f18710cc6fb
Author: chutium 
Date:   2014-07-26T02:53:44Z

SPARK-2700 Hidden files (such as .impala_insert_staging) should be filtered 
out by sqlContext.parquetFile




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] [SPARK-2010] [PySpark] support nested st...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1598#issuecomment-50221527
  
QA results for PR 1598:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class List(list):class Dict(dict):class 
Row(tuple):class Row(tuple):For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17211/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2696] Reduce default value of spark.ser...

2014-07-25 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1595#issuecomment-50221355
  
Please update this in docs/configuration.md as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] [SPARK-2010] [PySpark] support nested st...

2014-07-25 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1598#issuecomment-50221066
  
Can you add [SQL] to these PRs as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2680: Lower spark.shuffle.memoryFraction...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1593#issuecomment-50220983
  
QA results for PR 1593:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17214/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2659][SQL] Fix division semantics for h...

2014-07-25 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1557


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2659][SQL] Fix division semantics for h...

2014-07-25 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1557#issuecomment-50220680
  
Thanks for reviewing! Merged into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50220618
  
QA results for PR 1338:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17213/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2652] [PySpark] Turning some default co...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1568#issuecomment-50220491
  
QA results for PR 1568:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17212/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2680: Lower spark.shuffle.memoryFraction...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1593#issuecomment-50219968
  
QA tests have started for PR 1593. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17214/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Part of [SPARK-2456] Removed some HashMaps fro...

2014-07-25 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1561


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Part of [SPARK-2456] Removed some HashMaps fro...

2014-07-25 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1561#issuecomment-50219861
  
Merging this in master. Thanks for reviewing. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2680: Lower spark.shuffle.memoryFraction...

2014-07-25 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1593#issuecomment-50219829
  
test this please!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Part of [SPARK-2456] Removed some HashMaps fro...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1561#issuecomment-50219833
  
QA results for PR 1561:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17210/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Merging Hive Thrift/JDBC ser...

2014-07-25 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1399#issuecomment-50219827
  
Ah, thanks, then I think this might be the problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1458] [PySpark] Expose sc.version in Ja...

2014-07-25 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1596#issuecomment-50219820
  
LGTM!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Merging Hive Thrift/JDBC ser...

2014-07-25 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1399#issuecomment-50219743
  
Is there a fixed port or something that we use? On Jenkins we run multiple 
tests at the same time so if you have a single fixed port, tests can fail. 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Merging Hive Thrift/JDBC ser...

2014-07-25 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1399#issuecomment-50219674
  
@marmbrus Sorry for the trouble. It's weird that the build suddenly breaks 
continuously after several heathy builds. Can't reproduce the failure locally 
right now, I'll merge PRs of those failed build to this branch to see what 
happens. One suspicious reason jump into my head is `HiveThrfitServer2Suite` 
uses a hard coded port number (1) and may cause collision with other 
parallel builds. Not familiar with our Jenkins setup, is it possible?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-25 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-50219496
  
@mateiz There were a couple of correctness issues with the previous code 
that my latest commits have fixed (see commit message for more detail). After 
fixing these I have traced through each step by hand myself and verified that 
the behavior is as expected and wrote new tests to confirm this. If you have 
time, I would encourage you (and others) to step through the code as well 
(especially in `unrollSafely` and `ensureFreeSpace`) to make sure the amount 
we're requesting / ensuring free space for make sense to you.

Let me know if you find anything. Otherwise this is ready from my side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50219459
  
QA tests have started for PR 1338. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17213/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2279] Added emptyRDD method to Java API

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1597#issuecomment-50219354
  
QA results for PR 1597:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17209/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-25 Thread kanzhang
Github user kanzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/1338#discussion_r15431198
  
--- Diff: python/pyspark/rdd.py ---
@@ -964,6 +964,106 @@ def first(self):
 """
 return self.take(1)[0]
 
+def saveAsNewAPIHadoopDataset(self, conf, keyConverter=None, 
valueConverter=None):
+"""
+Output a Python RDD of key-value pairs (of form C{RDD[(K, V)]}) to 
any Hadoop file
+system, using the new Hadoop OutputFormat API (mapreduce package). 
Keys/values are
+converted for output using either user specified converters or, by 
default,
+L{org.apache.spark.api.python.JavaToWritableConverter}.
+
+@param conf: Hadoop job configuration, passed in as a dict
+@param keyConverter: (None by default)
+@param valueConverter: (None by default)
+"""
+jconf = self.ctx._dictToJavaMap(conf)
+reserialized = 
self._reserialize(BatchedSerializer(PickleSerializer(), 10))
--- End diff --

@JoshRosen @MLnick the batch size here only affects transient data when 
writing and the re-serialization shouldn't be done if the data is already in 
pickle format (batch serialized or not). I'm uploading a patch to that effect. 
Since the batch size has no effect on the data persisted in files, I'm not 
exposing it to users. See you have any further comments and if you have a 
better suggestion for the default size 10, let me know.

@mateiz what you suggested is exposing the batch size for reading. I'll 
work on that next.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2652] [PySpark] Turning some default co...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1568#issuecomment-50219330
  
QA tests have started for PR 1568. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17212/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-25 Thread dorx
Github user dorx closed the pull request at:

https://github.com/apache/spark/pull/1025


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] [SPARK-2010] [PySpark] support nested st...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1598#issuecomment-50219202
  
QA tests have started for PR 1598. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17211/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-50219182
  
QA results for PR 1165:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):case class Sample(size: Long, numUpdates: Long)For 
more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17208/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2010] [PySpark] support nested structur...

2014-07-25 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1559#issuecomment-50219167
  
This PR will be splitted into several parts, make it easy to merge in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2010] [PySpark] support nested structur...

2014-07-25 Thread davies
Github user davies closed the pull request at:

https://github.com/apache/spark/pull/1559


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] [SPARK-2010] [PySpark] support nested st...

2014-07-25 Thread davies
GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/1598

[WIP] [SPARK-2010] [PySpark] support nested structure in SchemaRDD

Convert Row in JavaSchemaRDD into Array[Any] and unpickle them as tuple in 
Python, then convert them into namedtuple when needed.

This will let nested structure can be accessed as object, also it will 
reduce the size of serialized data and better performance.

PS: The code will be refactored later and more tests will be added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark nested

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1598.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1598


commit 644665a2ebae4bc4a49f28152e3af00681affcc8
Author: Davies Liu 
Date:   2014-07-25T23:11:57Z

use tuple and namedtuple for schemardd




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2260] Fix standalone-cluster mode, whic...

2014-07-25 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/1538#discussion_r15430975
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/SparkDeploySchedulerBackend.scala
 ---
@@ -45,7 +45,7 @@ private[spark] class SparkDeploySchedulerBackend(
   conf.get("spark.driver.host"), conf.get("spark.driver.port"),
   CoarseGrainedSchedulerBackend.ACTOR_NAME)
 val args = Seq(driverUrl, "{{EXECUTOR_ID}}", "{{HOSTNAME}}", 
"{{CORES}}", "{{WORKER_URL}}")
-val extraJavaOpts = 
sc.conf.getOption("spark.executor.extraJavaOptions")
+val extraJavaOpts = 
sc.conf.getOption("spark.executor.extraJavaOptions").toSeq
--- End diff --

This actually handles quoted strings, spaces, backslashes, and a 
combination of all the above (I have tested this). This is because we pass 
around these options as a sequence of strings before using them in commands.

I still need to verify the same for YARN.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Part of [SPARK-2456] Removed some HashMaps fro...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1561#issuecomment-50218632
  
QA tests have started for PR 1561. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17210/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2279] Added emptyRDD method to Java API

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1597#issuecomment-50217986
  
QA tests have started for PR 1597. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17209/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2279] Added emptyRDD method to Java API

2014-07-25 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1597#issuecomment-50217879
  
Jenkins, this is ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-50217815
  
QA tests have started for PR 1165. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17208/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2279] Added emptyRDD method to Java API

2014-07-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1597#issuecomment-50217772
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2279] Added emptyRDD method to Java API

2014-07-25 Thread bobpaulin
GitHub user bobpaulin opened a pull request:

https://github.com/apache/spark/pull/1597

[SPARK-2279] Added emptyRDD method to Java API

Added emptyRDD method to Java API with tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bobpaulin/spark SPARK-2279

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1597.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1597


commit 5ad57c28175cfcf20d45bb25e6a8299d910169ca
Author: bpaulin 
Date:   2014-07-26T00:25:06Z

[SPARK-2279] Added emptyRDD method to Java API




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2670] FetchFailedException should be th...

2014-07-25 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/1578#issuecomment-50217313
  
@pwendell I found this issue when I simulated disk fault. When shuffle_*_* 
cannot be open  successfully, FileNotFoundException was thrown from the 
constructor of RandomAccessFile in DiskStore#getBytes.

Yes, I will add test cases later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2670] FetchFailedException should be th...

2014-07-25 Thread sarutak
Github user sarutak commented on a diff in the pull request:

https://github.com/apache/spark/pull/1578#discussion_r15430260
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockFetcherIterator.scala ---
@@ -200,14 +200,21 @@ object BlockFetcherIterator {
   // these all at once because they will just memory-map some files, 
so they won't consume
   // any memory that might exceed our maxBytesInFlight
   for (id <- localBlocksToFetch) {
-getLocalFromDisk(id, serializer) match {
-  case Some(iter) => {
-// Pass 0 as size since it's not in flight
-results.put(new FetchResult(id, 0, () => iter))
-logDebug("Got local block " + id)
+try{
+  getLocalFromDisk(id, serializer) match {
+case Some(iter) => {
+  // Pass 0 as size since it's not in flight
+  results.put(new FetchResult(id, 0, () => iter))
+  logDebug("Got local block " + id)
+}
+case None => {
+  throw new BlockException(id, "Could not get block " + id + " 
from local machine")
+}
   }
-  case None => {
-throw new BlockException(id, "Could not get block " + id + " 
from local machine")
+} catch {
+  case e: Exception => {
+logError(s"Error occurred while fetch local block $id", e)
+results.put(new FetchResult(id, -1, null))
   }
--- End diff --

Actually, getLocalFromDisk never return None but can throw BlockException. 
so I think "case None" block above is useless and we should remove the "case 
None" block rather than doing results.put.

> Is there any other kind of error that can happen beyond getLocalFromDisk 
returning None?
Yes, BlockException is thrown from getLocalFromDisk, and 
FileNotFoundException from DiskStore#getBytes when it failed to fetch 
shuffle_*_* from local disk. 

> Also, the current code seems to forget the exception: it just puts in a 
failed result. Is this intentional, i.e. will get a FetchFailedException later?
It's for get FetchFailedException later. If we return from 
BasicBlockFetchIterator#getLocallocks, we can't know whether rest of blocks can 
be read successfully or not.






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1458] [PySpark] Expose sc.version in Ja...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1596#issuecomment-50216933
  
QA results for PR 1596:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17207/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length support to Spark SQL and...

2014-07-25 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50216434
  
It doesn't get to unit tests if the style check fails.
On Jul 25, 2014 3:48 PM, "StephenBoesch"  wrote:

> Thanks for the review Michael! I agree with / will apply all of your
> comments and will re-run with sbt scalastyle . Question: from
> 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17181/consoleFull
> there is a message in the jenkins output saying that unit tests failed. 
But
> I can not find any information on which failed tests. (I had run and 
re-run
> the sql/core and sql/catalyst tests before submitting the PR and they were
> passing)
>
> —
> Reply to this email directly or view it on GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2696] Reduce default value of spark.ser...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1595#issuecomment-50216324
  
QA results for PR 1595:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17206/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1520#issuecomment-50215914
  
QA results for PR 1520:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):trait DistributionGenerator extends Pseudorandom with 
Serializable {class UniformGenerator extends DistributionGenerator 
{class StandardNormalGenerator extends DistributionGenerator {class 
PoissonGenerator(val mean: Double) extends DistributionGenerator {For 
more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17205/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2679] [MLLib] Ser/De for Double

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1581#issuecomment-50215860
  
QA results for PR 1581:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17204/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Revert "[SPARK-2410][SQL] Merging Hive Thrift/...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1594#issuecomment-50215029
  
QA results for PR 1594:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17203/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2671] BlockObjectWriter should create p...

2014-07-25 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/1580#issuecomment-50215042
  
Parent directories named spark-local-* were deleted before shuffle, you can 
see the stack trace like this.


java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.(FileOutputStream.java:221)

org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116)

org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177)

org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:59)

org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:57)
scala.collection.Iterator$class.foreach(Iterator.scala:727)

org.apache.spark.util.collection.AppendOnlyMap$$anon$1.foreach(AppendOnlyMap.scala:159)

org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:57)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:147)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
org.apache.spark.scheduler.Task.run(Task.scala:51)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:189)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)


As you say, getFile creates parent directories but I think this path 
doesn't call getFile.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1458] [PySpark] Expose sc.version in Ja...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1596#issuecomment-50214974
  
QA tests have started for PR 1596. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17207/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1458] [PySpark] Expose sc.version in Ja...

2014-07-25 Thread JoshRosen
GitHub user JoshRosen opened a pull request:

https://github.com/apache/spark/pull/1596

[SPARK-1458] [PySpark] Expose sc.version in Java and PySpark



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JoshRosen/spark spark-1458

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1596.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1596


commit fdbb0bf9937551128b345459d9da91830e7be270
Author: Josh Rosen 
Date:   2014-07-25T23:19:10Z

Add SparkContext.version to Python & Java [SPARK-1458]




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2696] Reduce default value of spark.ser...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1595#issuecomment-50214227
  
QA tests have started for PR 1595. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17206/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Core][SPARK-2696] Reduce default value of spa...

2014-07-25 Thread falaki
GitHub user falaki opened a pull request:

https://github.com/apache/spark/pull/1595

[Core][SPARK-2696] Reduce default value of 
spark.serializer.objectStreamReset

The current default value of spark.serializer.objectStreamReset is 10,000. 
When trying to re-partition (e.g., to 64 partitions) a large file (e.g., 
500MB), containing 1MB records, the serializer will cache 1 x 1MB x 64 ~= 
640 GB which will cause out of memory errors.

This patch sets the default value to a more reasonable default value (100).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/falaki/spark objectStreamReset

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1595.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1595


commit 1aa0df87db69d3c814b827e27673b198acf49edb
Author: Hossein 
Date:   2014-07-25T22:56:06Z

Reduce default value of spark.serializer.objectStreamReset

commit 650a935cdd810fe7bbc43555ad126cb2bebaab92
Author: Hossein 
Date:   2014-07-25T23:05:05Z

Updated documentation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1520#issuecomment-50213682
  
QA tests have started for PR 1520. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17205/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2679] [MLLib] Ser/De for Double

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1581#issuecomment-50213667
  
QA tests have started for PR 1581. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17204/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2679] [MLLib] Ser/De for Double

2014-07-25 Thread dorx
Github user dorx commented on the pull request:

https://github.com/apache/spark/pull/1581#issuecomment-50213453
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-25 Thread dorx
Github user dorx commented on the pull request:

https://github.com/apache/spark/pull/1520#issuecomment-50213475
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1499#issuecomment-50213171
  
QA results for PR 1499:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class ShuffledRDD[K, V, C](case class 
ShuffleIndexBlockId(shuffleId: Int, mapId: Int, reduceId: Int)For more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17202/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length support to Spark SQL and...

2014-07-25 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50213067
  
Thanks for the review Michael!   I agree with / will apply all of your 
comments and will re-run with  sbt scalastyle .  Question: from 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17181/consoleFull
 there is a message in the jenkins output saying that unit tests failed. But I 
can not find any information on which failed tests.  (I had run and re-run  the 
sql/core and sql/catalyst tests before submitting the PR and they were passing)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Merging Hive Thrift/JDBC ser...

2014-07-25 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1399#issuecomment-50212572
  
Hey this was making Jenkins fail so I reverted it.  We should investigate 
and try again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Revert "[SPARK-2410][SQL] Merging Hive Thrift/...

2014-07-25 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1594


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Revert "[SPARK-2410][SQL] Merging Hive Thrift/...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1594#issuecomment-50212441
  
QA tests have started for PR 1594. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17203/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Revert "[SPARK-2410][SQL] Merging Hive Thrift/...

2014-07-25 Thread marmbrus
GitHub user marmbrus opened a pull request:

https://github.com/apache/spark/pull/1594

Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"

This reverts commit 06dc0d2c6b69c5d59b4d194ced2ac85bfe2e05e2.

#1399 is making Jenkins fail.  We should investigate and put this back 
after its passing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/marmbrus/spark revertJDBC

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1594.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1594


commit 59748da137b8787ffb86a7d03cf54a96f1dbe005
Author: Michael Armbrust 
Date:   2014-07-25T22:35:11Z

Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"

This reverts commit 06dc0d2c6b69c5d59b4d194ced2ac85bfe2e05e2.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-25 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50211345
  
Whoops, I was thinking of `destroy` instead of `unpersist.`  Since the 
driver keeps a copy of the broadcast variable, it should always be safe to 
unpersist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1726] [SPARK-2567] Eliminate zombie sta...

2014-07-25 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1566#issuecomment-50211023
  
BTW I've merged this only into 1.1 because the patch didn't apply cleanly 
on 1.0. If you think it's important, we can also add it to 1.0.x, but it 
doesn't seem like that big of a showstopper.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1726] [SPARK-2567] Eliminate zombie sta...

2014-07-25 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1566


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1726] [SPARK-2567] Eliminate zombie sta...

2014-07-25 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1566#issuecomment-50210960
  
Looks good to me too. I've merged this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2680: Lower spark.shuffle.memoryFraction...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1593#issuecomment-50210778
  
QA results for PR 1593:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17200/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-25 Thread kanzhang
Github user kanzhang commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50210598
  
> Even if we cached the results of the read, I think we'd still need to 
keep a copy of the configuration in case any of those cached partitions are 
lost and we need to recompute from scratch.

This is where I have a gap. I imagined the error recovery code would have 
kept a copy of the conf on driver and re-broadcast it when re-computation 
happens, whereas those broadcasted copies on executors could have been cleaned 
after job is successfully run the first time. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

2014-07-25 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/1471#issuecomment-50210528
  
Yeah I will close this PR -- Should I just modify SPARK-2563 for the Socket 
re-opening issue or do you think a new JIRA is better ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Part of [SPARK-2456] Removed some HashMaps fro...

2014-07-25 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1561#issuecomment-50210497
  
I'd be okay deferring the subclass thing till later too, the benefit isn't 
huge right now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1499#issuecomment-50210371
  
QA tests have started for PR 1499. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17202/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2652] [PySpark] Turning some default co...

2014-07-25 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1568#discussion_r15426978
  
--- Diff: python/pyspark/context.py ---
@@ -112,6 +121,8 @@ def __init__(self, master=None, appName=None, 
sparkHome=None, pyFiles=None,
 if environment:
 for key, value in environment.iteritems():
 self._conf.setExecutorEnv(key, value)
+for key, value in DEFAULT_CONFIGS.items():
+self._conf.setIfMissing(key, value)
--- End diff --

@davies, you also need to remove the 
self._conf.setIfMissing("spark.rdd.compress", "true") line above. Otherwise it 
looks good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

2014-07-25 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1471#issuecomment-50209890
  
If this PR doesn't help by the way, make sure to close it too so it doesn't 
stay in the list.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

2014-07-25 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1471#issuecomment-50209854
  
I see, got it. It sounds like we should open a JIRA for creating a new 
socket then. It's pretty strange that you can't reuse the same one in Java, but 
I guess that's how it works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-07-25 Thread cmccabe
Github user cmccabe commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-50209723
  
I took a look at rewriting this to avoid the reflection, and use 
conditional compilation instead.  Unfortunately, I think it just increased the 
complexity.  Since some of the APIs we need here are only in hadoop2.5 and 
higher, we'd have to have a Maven profile like "hadoop2.5_and_higher", which 
just seems awkward given that we already have Phadoop2.4, Phadoop2.3, etc.  So 
I think reflection might be a necessary evil for right now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Part of [SPARK-2456] Removed some HashMaps fro...

2014-07-25 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1561#issuecomment-50209623
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Part of [SPARK-2456] Removed some HashMaps fro...

2014-07-25 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1561#issuecomment-50209636
  
hive-thriftserver test failed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2568] RangePartitioner should run only ...

2014-07-25 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1562#discussion_r15426653
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -105,24 +108,91 @@ class RangePartitioner[K : Ordering : ClassTag, V](
 
   private var ordering = implicitly[Ordering[K]]
 
+  @transient private[spark] var singlePass = true // for unit tests
+
   // An array of upper bounds for the first (partitions - 1) partitions
   private var rangeBounds: Array[K] = {
 if (partitions == 1) {
-  Array()
+  Array.empty
 } else {
-  val rddSize = rdd.count()
-  val maxSampleSize = partitions * 20.0
-  val frac = math.min(maxSampleSize / math.max(rddSize, 1), 1.0)
-  val rddSample = rdd.sample(false, frac, 1).map(_._1).collect().sorted
-  if (rddSample.length == 0) {
-Array()
+  // This is the sample size we need to have roughly balanced output 
partitions.
+  val sampleSize = 20.0 * partitions
+  // Assume the input partitions are roughly balanced and over-sample 
a little bit.
+  val sampleSizePerPartition = math.ceil(3.0 * sampleSize / 
rdd.partitions.size).toInt
+  val shift = rdd.id
+  val classTagK = classTag[K]
+  val sketch = rdd.mapPartitionsWithIndex { (idx, iter) =>
+val seed = byteswap32(idx + shift)
+val (sample, n) = SamplingUtils.reservoirSampleAndCount(
+  iter.map(_._1), sampleSizePerPartition, seed)(classTagK)
+Iterator((idx, n, sample))
+  }.collect()
+  var numItems = 0L
+  sketch.foreach { case (_, n, _) =>
+numItems += n
+  }
+  if (numItems == 0L) {
+Array.empty
   } else {
-val bounds = new Array[K](partitions - 1)
-for (i <- 0 until partitions - 1) {
-  val index = (rddSample.length - 1) * (i + 1) / partitions
-  bounds(i) = rddSample(index)
+// If a partition contains much more than the average number of 
items, we re-sample from it
+// to ensure that enough items are collected from that partition.
+val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
+val candidates = ArrayBuffer.empty[(K, Float)]
+val imbalancedPartitions = ArrayBuffer.empty[Int]
+sketch.foreach { case (idx, n, sample) =>
+  if (fraction * n > sampleSizePerPartition) {
+imbalancedPartitions += idx
+  } else {
+// The weight is 1 over the sampling probability.
+val weight = (n.toDouble / sample.size).toFloat
+sample.foreach { key =>
+  candidates += ((key, weight))
+}
--- End diff --

Can probably just write
```
for (key <- samples) {
  candidates += ((key, weight))
}
```

Same with the foreach above. It will be slightly more readable, but no big 
deal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2679] [MLLib] Ser/De for Double

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1581#issuecomment-50209389
  
QA results for PR 1581:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17198/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Part of [SPARK-2456] Removed some HashMaps fro...

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1561#issuecomment-50209339
  
QA results for PR 1561:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17199/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-25 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50208947
  
@kanzhang I think it's unsafe to unpersist the read job's Hadoop 
configuration.  Even if we cached the results of the read, I think we'd still 
need to keep a copy of the configuration in case any of those cached partitions 
are lost and we need to recompute from scratch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2568] RangePartitioner should run only ...

2014-07-25 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1562#discussion_r15426437
  
--- Diff: core/src/test/scala/org/apache/spark/PartitioningSuite.scala ---
@@ -102,6 +100,34 @@ class PartitioningSuite extends FunSuite with 
SharedSparkContext with PrivateMet
 partitioner.getPartition(Row(100))
   }
 
+  test("RangePartitioner should run only one job if data is roughly 
balanced") {
+val rdd = sc.makeRDD(0 until 20, 20).flatMap { i =>
+  val random = new java.util.Random(i)
+  Iterator.fill(5000 * i)((random.nextDouble() + i, i))
+}.cache()
+for (numPartitions <- Seq(10, 20, 40)) {
+  val partitioner = new RangePartitioner(numPartitions, rdd)
+  assert(partitioner.numPartitions === numPartitions)
+  assert(partitioner.singlePass === true)
+  val counts = rdd.keys.map(key => 
partitioner.getPartition(key)).countByValue().values
+  assert(counts.max < 2.0 * counts.min)
+}
+  }
+
+  test("RangePartitioner should work well on unbalanced data") {
+val rdd = sc.makeRDD(0 until 20, 20).flatMap { i =>
+  val random = new java.util.Random(i)
+  Iterator.fill(20 * i * i * i)((random.nextDouble() + i, i))
+}.cache()
+for (numPartitions <- Seq(2, 4, 8)) {
+  val partitioner = new RangePartitioner(numPartitions, rdd)
+  assert(partitioner.numPartitions === numPartitions)
+  assert(partitioner.singlePass === false)
+  val counts = rdd.keys.map(key => 
partitioner.getPartition(key)).countByValue().values
+  assert(counts.max < 2.0 * counts.min)
+}
+  }
+
--- End diff --

Can you add some tests where the whole RDD has 0 elements, and some tests 
where individual partitions have 0 elements and others have more? That's where 
divide by zero errors can happen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2568] RangePartitioner should run only ...

2014-07-25 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1562#discussion_r15426348
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -105,24 +108,91 @@ class RangePartitioner[K : Ordering : ClassTag, V](
 
   private var ordering = implicitly[Ordering[K]]
 
+  @transient private[spark] var singlePass = true // for unit tests
+
   // An array of upper bounds for the first (partitions - 1) partitions
   private var rangeBounds: Array[K] = {
 if (partitions == 1) {
-  Array()
+  Array.empty
 } else {
-  val rddSize = rdd.count()
-  val maxSampleSize = partitions * 20.0
-  val frac = math.min(maxSampleSize / math.max(rddSize, 1), 1.0)
-  val rddSample = rdd.sample(false, frac, 1).map(_._1).collect().sorted
-  if (rddSample.length == 0) {
-Array()
+  // This is the sample size we need to have roughly balanced output 
partitions.
+  val sampleSize = 20.0 * partitions
+  // Assume the input partitions are roughly balanced and over-sample 
a little bit.
+  val sampleSizePerPartition = math.ceil(3.0 * sampleSize / 
rdd.partitions.size).toInt
+  val shift = rdd.id
+  val classTagK = classTag[K]
+  val sketch = rdd.mapPartitionsWithIndex { (idx, iter) =>
+val seed = byteswap32(idx + shift)
+val (sample, n) = SamplingUtils.reservoirSampleAndCount(
+  iter.map(_._1), sampleSizePerPartition, seed)(classTagK)
+Iterator((idx, n, sample))
+  }.collect()
+  var numItems = 0L
+  sketch.foreach { case (_, n, _) =>
+numItems += n
+  }
--- End diff --

(It would probably also be more efficient than doing a pattern match here)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2568] RangePartitioner should run only ...

2014-07-25 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1562#discussion_r15426305
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -105,24 +108,91 @@ class RangePartitioner[K : Ordering : ClassTag, V](
 
   private var ordering = implicitly[Ordering[K]]
 
+  @transient private[spark] var singlePass = true // for unit tests
+
   // An array of upper bounds for the first (partitions - 1) partitions
   private var rangeBounds: Array[K] = {
 if (partitions == 1) {
-  Array()
+  Array.empty
 } else {
-  val rddSize = rdd.count()
-  val maxSampleSize = partitions * 20.0
-  val frac = math.min(maxSampleSize / math.max(rddSize, 1), 1.0)
-  val rddSample = rdd.sample(false, frac, 1).map(_._1).collect().sorted
-  if (rddSample.length == 0) {
-Array()
+  // This is the sample size we need to have roughly balanced output 
partitions.
+  val sampleSize = 20.0 * partitions
+  // Assume the input partitions are roughly balanced and over-sample 
a little bit.
+  val sampleSizePerPartition = math.ceil(3.0 * sampleSize / 
rdd.partitions.size).toInt
+  val shift = rdd.id
+  val classTagK = classTag[K]
+  val sketch = rdd.mapPartitionsWithIndex { (idx, iter) =>
+val seed = byteswap32(idx + shift)
+val (sample, n) = SamplingUtils.reservoirSampleAndCount(
+  iter.map(_._1), sampleSizePerPartition, seed)(classTagK)
+Iterator((idx, n, sample))
+  }.collect()
+  var numItems = 0L
+  sketch.foreach { case (_, n, _) =>
+numItems += n
+  }
--- End diff --

You can replace this with `val numItems = sketch.map(_._2).sum`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   3   >