date:20140823

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2083#issuecomment-53145524
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19095/consoleFull)
 for   PR 2083 at commit 
[`60e8274`](https://github.com/apache/spark/commit/60e827480e31f7773278da2e83b81178edc8ebb7).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

2014-08-23 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2093#issuecomment-53145616
  
The doc tests should covered all the code paths, do we still need more 
tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2093#issuecomment-53145628
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19097/consoleFull)
 for   PR 2093 at commit 
[`be0e8ba`](https://github.com/apache/spark/commit/be0e8bae494805faafe70456866cd9fa2bf5a3ef).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add zipWithIndex() and ...

2014-08-23 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2092#issuecomment-53145677
  
I think doc tests should be enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

2014-08-23 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2091#issuecomment-53145718
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2091#issuecomment-53145884
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19098/consoleFull)
 for   PR 2091 at commit 
[`d9a0722`](https://github.com/apache/spark/commit/d9a07225003148805505a703082f34ad397b974e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2608] fix executor backend launch commo...

2014-08-23 Thread tnachen

GitHub user tnachen opened a pull request:

https://github.com/apache/spark/pull/2103

[SPARK-2608] fix executor backend launch commond over mesos mode

based on @scwf patch, rebased on master and have a fix to actually get it 
to work.
It failed to run with a single mesos master/slave without the fix.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tnachen/spark mesos_executor_fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2103.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2103


commit 5cf612188d9f40c8eb6340ef855349f3dceeda88
Author: Timothy Chen 
Date:   2014-08-21T02:07:15Z

[SPARK-2608] fix executor backend launch commond over mesos mode

commit 2be86ae573edcc0134602c51e622cf7cc2e4421b
Author: Timothy Chen 
Date:   2014-08-23T07:56:46Z

Use SPARK_HOME for finding computing class path




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2608] fix executor backend launch commo...

2014-08-23 Thread tnachen

Github user tnachen commented on the pull request:

https://github.com/apache/spark/pull/2103#issuecomment-53146097
  
@pwendell take a look at the new fix


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2608] fix executor backend launch commo...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2103#issuecomment-53146211
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19099/consoleFull)
 for   PR 2103 at commit 
[`2be86ae`](https://github.com/apache/spark/commit/2be86ae573edcc0134602c51e622cf7cc2e4421b).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3011][SQL] _temporary directory should ...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1959#issuecomment-53146504
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19096/consoleFull)
 for   PR 1959 at commit 
[`be30793`](https://github.com/apache/spark/commit/be30793970fb9ecdc4eece747fe0ee7ca62d3bf3).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2093#issuecomment-53146874
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19097/consoleFull)
 for   PR 2093 at commit 
[`be0e8ba`](https://github.com/apache/spark/commit/be0e8bae494805faafe70456866cd9fa2bf5a3ef).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2091#issuecomment-53147036
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19098/consoleFull)
 for   PR 2091 at commit 
[`d9a0722`](https://github.com/apache/spark/commit/d9a07225003148805505a703082f34ad397b974e).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2019#issuecomment-53147147
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19100/consoleFull)
 for   PR 2019 at commit 
[`814692c`](https://github.com/apache/spark/commit/814692c5f467aa787838e85a4bfbbc8f1cd97bae).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2608] fix executor backend launch commo...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2103#issuecomment-53147354
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19099/consoleFull)
 for   PR 2103 at commit 
[`2be86ae`](https://github.com/apache/spark/commit/2be86ae573edcc0134602c51e622cf7cc2e4421b).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Update building-with-maven.md

2014-08-23 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2102#discussion_r16629723
  
--- Diff: docs/building-with-maven.md ---
@@ -156,4 +156,12 @@ then ship it over to the cluster. We are investigating 
the exact cause for this.
 
 The assembly jar produced by `mvn package` will, by default, include all 
of Spark's dependencies, including Hadoop and some of its ecosystem projects. 
On YARN deployments, this causes multiple versions of these to appear on 
executor classpaths: the version packaged in the Spark assembly and the version 
on each node, included with yarn.application.classpath.  The `hadoop-provided` 
profile builds the assembly without including Hadoop-ecosystem projects, like 
ZooKeeper and Hadoop itself. 
 
+# Building under http proxy environment
 
+Sometimes,spark need be built in http proxy environment, We recommend the 
following settings:
+
+ mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 
-Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true 
-DskipTests clean package
--- End diff --

I'm not sure this resolves proxy problems. It may resolve a particular 
issue wherein your network proxy is breaking SSL connections, but is not how 
you configure proxies. See 
http://maven.apache.org/guides/mini/guide-proxies.html 

Attacks are very rare, but if someone were trying to inject a bad binary 
into your build, this would invite users to explicitly ignore that warning 
sign. So I disagree that this is something all users should use when using a 
proxy.

The `yarn` profile and so on are not related, just the two `maven.wagon` 
settings. The error message you quote does not contain the type of failure you 
would see, which is a "could not resolve dependencies" error. Finally, there 
are punctuation and capitalization problems, like "http".

I don't think this should be added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2964] [SQL] Fix the -S and --silent opt...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1886#issuecomment-53147680
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19101/consoleFull)
 for   PR 1886 at commit 
[`ffb68fa`](https://github.com/apache/spark/commit/ffb68fa9a6aa51d08383503b96a33f6e44333fe0).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2964] [SQL] Remove duplicated code from...

2014-08-23 Thread sarutak

Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/1886#issuecomment-53147717
  
I rebased #1994 to this PR for now, and rename the title of this PR to 
proper one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3069 [DOCS] Build instructions in README...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2014#issuecomment-53147889
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19102/consoleFull)
 for   PR 2014 at commit 
[`7aa045e`](https://github.com/apache/spark/commit/7aa045e881cda1d51c90040e5917839aee085b05).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3192] Some scripts have 2 space indenta...

2014-08-23 Thread sarutak

GitHub user sarutak opened a pull request:

https://github.com/apache/spark/pull/2104

[SPARK-3192] Some scripts have 2 space indentation but other scripts have 4 
space indentation.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sarutak/spark SPARK-3192

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2104.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2104


commit be4736bcff503380f254f1e7f8482d00182e236a
Author: Cheng Lian 
Date:   2014-08-14T13:28:44Z

Report better error message when running JDBC/CLI without hive-thriftserver 
profile enabled

commit 9c894d34327e608461bbe9be5ae5fcbb4ac6dd44
Author: Cheng Lian 
Date:   2014-08-21T06:15:47Z

Fixed bin/spark-sql -S option typo

commit a89e66df0085ddac4455654567ee95bc2a4e879a
Author: Cheng Lian 
Date:   2014-08-21T06:40:20Z

Fixed command line options quotation in scripts

commit 81b43a897b241eca16e75668557ecd81cb25c41a
Author: Cheng Lian 
Date:   2014-08-21T07:18:01Z

Shorten timeout to more reasonable value

commit 8c6f6581e609ceffc130404b57e3535a869a88e4
Author: Kousuke Saruta 
Date:   2014-08-23T09:01:35Z

Merge branch 'spark-3026' of https://github.com/liancheng/spark into 
SPARK-2964

commit ffb68fa9a6aa51d08383503b96a33f6e44333fe0
Author: Kousuke Saruta 
Date:   2014-08-23T09:13:11Z

Modified spark-sql and start-thriftserver.sh to use bin/utils.sh

commit 9b63cfc4ce96e58755ee8a8f0bae15c4c849e2e6
Author: Kousuke Saruta 
Date:   2014-08-23T09:28:33Z

Modified indentation of spark-shell




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2798 [BUILD] Correct several small error...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1726#issuecomment-53148099
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19105/consoleFull)
 for   PR 1726 at commit 
[`a46e2c6`](https://github.com/apache/spark/commit/a46e2c6abbb56553abab6d3e4da690c7d9e47772).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3069 [DOCS] Build instructions in README...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2014#issuecomment-53148101
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19103/consoleFull)
 for   PR 2014 at commit 
[`5c6b814`](https://github.com/apache/spark/commit/5c6b8144765e4810b5d8995e06498c14ceba844d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3192] Some scripts have 2 space indenta...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2104#issuecomment-53148100
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19104/consoleFull)
 for   PR 2104 at commit 
[`db78419`](https://github.com/apache/spark/commit/db78419bd76a8be22467d40e38b68acb74b18ca7).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2019#issuecomment-53148302
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19100/consoleFull)
 for   PR 2019 at commit 
[`814692c`](https://github.com/apache/spark/commit/814692c5f467aa787838e85a4bfbbc8f1cd97bae).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3069 [DOCS] Build instructions in README...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2014#issuecomment-53148996
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19102/consoleFull)
 for   PR 2014 at commit 
[`7aa045e`](https://github.com/apache/spark/commit/7aa045e881cda1d51c90040e5917839aee085b05).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

2014-08-23 Thread yu-iskw

Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/1964#issuecomment-53149119
  
BYW, I checked the performance of Math.abs() and breeze.numerics.abs.
It seems that Math.abs() performs better than breeze.numerics.abs.
A performs better than B.
https://gist.github.com/yu-iskw/518a48a68ef368998058


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3139] Akka timeouts from ContextCl...

2014-08-23 Thread witgo

Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/2056#issuecomment-53149146
  
In `removeShuffleBlocks`
```
 for (mapId <- state.completedMapTasks; reduceId <- 0 until 
state.numBuckets) {
val blockId = new ShuffleBlockId(shuffleId, mapId, reduceId)
blockManager.diskBlockManager.getFile(blockId).delete()
 }
```
To delete a lot of small files is very time-consuming.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3192] Some scripts have 2 space indenta...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2104#issuecomment-53149178
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19104/consoleFull)
 for   PR 2104 at commit 
[`db78419`](https://github.com/apache/spark/commit/db78419bd76a8be22467d40e38b68acb74b18ca7).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `$FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$`
  * `$FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2964] [SQL] Remove duplicated code from...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1886#issuecomment-53149212
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19101/consoleFull)
 for   PR 1886 at commit 
[`ffb68fa`](https://github.com/apache/spark/commit/ffb68fa9a6aa51d08383503b96a33f6e44333fe0).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `"$FWDIR"/bin/spark-submit --class $CLASS "$`
  * `"$FWDIR"/bin/spark-submit --class $CLASS "$`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3069 [DOCS] Build instructions in README...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2014#issuecomment-53149241
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19103/consoleFull)
 for   PR 2014 at commit 
[`5c6b814`](https://github.com/apache/spark/commit/5c6b8144765e4810b5d8995e06498c14ceba844d).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `In multiclass classification, all `$2^`
  * `public final class JavaDecisionTree `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2798 [BUILD] Correct several small error...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1726#issuecomment-53149251
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19105/consoleFull)
 for   PR 1726 at commit 
[`a46e2c6`](https://github.com/apache/spark/commit/a46e2c6abbb56553abab6d3e4da690c7d9e47772).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/2019#discussion_r16630100
  
--- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
@@ -118,14 +118,33 @@ abstract class Connection(val channel: SocketChannel, 
val selector: Selector,
   }
 
   def close() {
-closed = true
-val k = key()
-if (k != null) {
-  k.cancel()
+synchronized {
+  /**
+   * We should avoid executing closing sequence
+   * double by a same thread.
+   * Otherwise we can fail to call connectionsById.get() in
+   * ConnectionManager#removeConnection() at the 2nd time
+   */
+  if (!closed) {
+disposeSasl()
+
+/**
+  * callOnCloseCallback() should be invoked
+  * before k.cancel() and channel.close()
+  * to avoid key() returns null.
+  * If key() returns null before callOnCloseCallback(),
+  * We cannot remove entry from connectionsByKey in 
ConnectionManager
+  * and end up being threw CancelledKeyException.
+  */
+callOnCloseCallback()
+val k = key()
+if (k != null) {
+  k.cancel()
+}
+channel.close()
+closed = true
+  }
--- End diff --

This is incorrect change.
Any of those methods can throw an exception - leaving Connection.closed as 
false.

What is the point of the synchronized btw ? None of the other methods are 
protected by this lock


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/2019#discussion_r16630105
  
--- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
@@ -263,14 +282,20 @@ class SendingConnection(val address: 
InetSocketAddress, selector_ : Selector,
 
   val DEFAULT_INTEREST = SelectionKey.OP_READ
 
+  var alreadyReading = false
+
   override def registerInterest() {
 // Registering read too - does not really help in most cases, but for 
some
 // it does - so let us keep it for now.
-changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
+changeConnectionKeyInterest(
+  SelectionKey.OP_WRITE | (if (!alreadyReading) {
+alreadyReading = true
+DEFAULT_INTEREST
+  } else { 0 }))
--- End diff --

What is the intent behind this change ?
Probably there is a misunderstanding of why DEFAULT_INTEREST is registered 
: it is not to actually read from this socket, but to register for read events 
so that close is detected.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/2019#discussion_r16630107
  
--- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
@@ -263,14 +282,20 @@ class SendingConnection(val address: 
InetSocketAddress, selector_ : Selector,
 
   val DEFAULT_INTEREST = SelectionKey.OP_READ
 
+  var alreadyReading = false
+
   override def registerInterest() {
 // Registering read too - does not really help in most cases, but for 
some
 // it does - so let us keep it for now.
-changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
+changeConnectionKeyInterest(
+  SelectionKey.OP_WRITE | (if (!alreadyReading) {
+alreadyReading = true
+DEFAULT_INTEREST
+  } else { 0 }))
   }
 
   override def unregisterInterest() {
-changeConnectionKeyInterest(DEFAULT_INTEREST)
+changeConnectionKeyInterest(0)
--- End diff --

Incorrect change, please see above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/2019#issuecomment-53149673
  
handling tcp/ip events is by definition async, particularly when state 
changes can happen orthogonal to state within java variables.
so there is only so much you can try to do to reduce exceptions you see in 
the logs - the important point is not to prevent issues (which is not possible 
if you want to write performent robust code), but to detect them and ensure it 
is handled properly.

GIven that, the changes here look fragile : we can revisit this PR when 
they are addressed, since I think there is value in some of these.
(For example, make closed an atomic boolean and do a getAndSet and do the 
expensive close only if previous value was false; and so on)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3139] Akka timeouts from ContextCleaner...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2056#issuecomment-53149761
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19106/consoleFull)
 for   PR 2056 at commit 
[`1bdcbbb`](https://github.com/apache/spark/commit/1bdcbbba1dcf103c967c8f6e3dcb564828aba87a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3171] Don't print meaningless informati...

2014-08-23 Thread sarutak

Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/2078#issuecomment-53149962
  
> Regarding logging -
> host:port is insufficient to debug (and as I mentioned above, you will 
get NPE for the log messages in > this PR depending on the key's state) - you 
actually need the key instance correlated across log messages to debug issues.
> To find host:port for a key, it is logged elsewhere in the code iirc (and 
we grep by instance id).

Is this true? I cannot found log message which shows the key and 
corresponding host:port.
In ConnectionManager, when accepting connection and connecting to another 
host, host:port is logged so logging host:port when error occurred is helpful.
Even if we can found host:port for a key, I think, it's good to be able to 
find host:port directly.
I think, "be able to debug" is not "easy to debug".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3171] Don't print meaningless informati...

2014-08-23 Thread sarutak

Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/2078#issuecomment-53150028
  
I removed try / catch because selector.select never throws 
CancelledKeyException.
If it can be regression, could you show me how CancelledKeyException is 
thrown from selector.select?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3171] Don't print meaningless informati...

2014-08-23 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/2078#issuecomment-53150053
  
If you do want to get to host port in case logs are noisy (happens!),
ensure you retrieve it with robust code.
This pr can cause npe's
On 23-Aug-2014 4:42 pm, "Kousuke Saruta"  wrote:

> Regarding logging -
> host:port is insufficient to debug (and as I mentioned above, you will get
> NPE for the log messages in > this PR depending on the key's state) - you
> actually need the key instance correlated across log messages to debug
> issues.
> To find host:port for a key, it is logged elsewhere in the code iirc (and
> we grep by instance id).
>
> Is this true? I cannot found log message which shows the key and
> corresponding host:port.
> In ConnectionManager, when accepting connection and connecting to another
> host, host:port is logged so logging host:port when error occurred is
> helpful.
> Even if we can found host:port for a key, I think, it's good to be able to
> find host:port directly.
> I think, "be able to debug" is not "easy to debug".
>
> â
> Reply to this email directly or view it on GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3171] Don't print meaningless informati...

2014-08-23 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/2078#issuecomment-53150077
  
It can and it does, if key was cancelled
On 23-Aug-2014 4:46 pm, "Kousuke Saruta"  wrote:

> I removed try / catch because selector.select never throws
> CancelledKeyException.
> If it can be regression, could you show me how CancelledKeyException is
> thrown from selector.select?
>
> â
> Reply to this email directly or view it on GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3139] Akka timeouts from ContextCleaner...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2056#issuecomment-53150684
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19107/consoleFull)
 for   PR 2056 at commit 
[`72e7da1`](https://github.com/apache/spark/commit/72e7da16a1c83e8464c50d98bc20cc424fa029b1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3139] Akka timeouts from ContextCleaner...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2056#issuecomment-53150822
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19106/consoleFull)
 for   PR 2056 at commit 
[`1bdcbbb`](https://github.com/apache/spark/commit/1bdcbbba1dcf103c967c8f6e3dcb564828aba87a).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add zipWithIndex() and ...

2014-08-23 Thread mattf

Github user mattf commented on the pull request:

https://github.com/apache/spark/pull/2092#issuecomment-53151265
  
fair enough

+1 lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: configuration for spark.cleaner.referenceTrack...

2014-08-23 Thread CodingCat

Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/2097#issuecomment-53151295
  
enthat's exactly the motivation which I proposed this PR 
forContextCleaner sometimes mistakenly cleans my broadcast files in my test 
cases...about one month ago, I asked TD about this, we cannot find the 
reasonable explananation; on some day, it disappeared, recently, it's there 
again, so I have to close ContextCleaner in our test cases (I guess that's 
related to the fact that I'm using JDK7 which is not fully tested? the weird 
thing is that, in our 24 * 7 service, it never appear) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add `comp` argument for...

2014-08-23 Thread mattf

Github user mattf commented on a diff in the pull request:

https://github.com/apache/spark/pull/2094#discussion_r16630356
  
--- Diff: python/pyspark/rdd.py ---
@@ -810,23 +810,45 @@ def func(iterator):
 
 return self.mapPartitions(func).fold(zeroValue, combOp)
 
-def max(self):
+def max(self, comp=None):
 """
 Find the maximum item in this RDD.
 
->>> sc.parallelize([1.0, 5.0, 43.0, 10.0]).max()
+@param comp: A function used to compare two elements, the builtin 
`cmp`
--- End diff --

cmp may be used in max, but for this func the default is on line 829. 
either way, a minor nitpick.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add `comp` argument for...

2014-08-23 Thread mattf

Github user mattf commented on a diff in the pull request:

https://github.com/apache/spark/pull/2094#discussion_r16630361
  
--- Diff: python/pyspark/rdd.py ---
@@ -810,23 +810,45 @@ def func(iterator):
 
 return self.mapPartitions(func).fold(zeroValue, combOp)
 
-def max(self):
+def max(self, comp=None):
 """
 Find the maximum item in this RDD.
 
->>> sc.parallelize([1.0, 5.0, 43.0, 10.0]).max()
+@param comp: A function used to compare two elements, the builtin 
`cmp`
+ will be used by default.
+
+>>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0])
+>>> rdd.max()
 43.0
+>>> rdd.max(lambda a, b: cmp(str(a), str(b)))
+5.0
 """
-return self.reduce(max)
+if comp is not None:
+func = lambda a, b: a if comp(a, b) >= 0 else b
+else:
+func = max
 
-def min(self):
+return self.reduce(func)
+
+def min(self, comp=None):
 """
 Find the minimum item in this RDD.
 
->>> sc.parallelize([1.0, 5.0, 43.0, 10.0]).min()
-1.0
+@param comp: A function used to compare two elements, the builtin 
`cmp`
+ will be used by default.
+
+>>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0])
+>>> rdd.min()
+2.0
+>>> rdd.min(lambda a, b: cmp(str(a), str(b)))
+10.0
 """
-return self.reduce(min)
+if comp is not None:
--- End diff --

consider default of comp=min in arg list and test for comp is not min


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add `comp` argument for...

2014-08-23 Thread mattf

Github user mattf commented on the pull request:

https://github.com/apache/spark/pull/2094#issuecomment-53151507
  
agreed re doctest. i forgot it was in use.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2608] fix executor backend launch commo...

2014-08-23 Thread scwf

Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2103#issuecomment-53151568
  
what's the diff with my pr https://github.com/apache/spark/pull/1986?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

2014-08-23 Thread mattf

Github user mattf commented on a diff in the pull request:

https://github.com/apache/spark/pull/2091#discussion_r16630390
  
--- Diff: python/pyspark/rdd.py ---
@@ -856,6 +856,104 @@ def redFunc(left_counter, right_counter):
 
 return self.mapPartitions(lambda i: 
[StatCounter(i)]).reduce(redFunc)
 
+def histogram(self, buckets, evenBuckets=False):
+"""
+Compute a histogram using the provided buckets. The buckets
+are all open to the right except for the last which is closed.
+e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
+which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
+and 50 we would have a histogram of 1,0,1.
+
+If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
+this can be switched from an O(log n) inseration to O(1) per
+element(where n = # buckets), if you set `even` to True.
+
+Buckets must be sorted and not contain any duplicates, must be
+at least two elements.
+
+If `buckets` is a number, it will generates buckets which is
+evenly spaced between the minimum and maximum of the RDD. For
+example, if the min value is 0 and the max is 100, given buckets
+as 2, the resulting buckets will be [0,50) [50,100]. buckets must
+be at least 1 If the RDD contains infinity, NaN throws an exception
+If the elements in RDD do not vary (max == min) always returns
+a single bucket.
+
+It will return an tuple of buckets and histogram.
+
+>>> rdd = sc.parallelize(range(51))
+>>> rdd.histogram(2)
+([0, 25, 50], [25, 26])
+>>> rdd.histogram([0, 5, 25, 50])
+([0, 5, 25, 50], [5, 20, 26])
+>>> rdd.histogram([0, 15, 30, 45, 60], True)
+([0, 15, 30, 45, 60], [15, 15, 15, 6])
+"""
+
+if isinstance(buckets, (int, long)):
+if buckets < 1:
+raise ValueError("buckets should not less than 1")
+
+# filter out non-comparable elements
+self = self.filter(lambda x: x is not None and not isnan(x))
+
+# faster than stats()
+def minmax(a, b):
+return min(a[0], b[0]), max(a[1], b[1])
+try:
+minv, maxv = self.map(lambda x: (x, x)).reduce(minmax)
+except TypeError as e:
+if e.message == "reduce() of empty sequence with no 
initial value":
--- End diff --

the goal of propagating messages that do not expose implementation details 
is good imho

are you confident that the "empty sequence" is the only exception that 
could arise?

i was thinking about a mixed types in the rdd, but maybe that's not a 
problem


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

2014-08-23 Thread mattf

Github user mattf commented on the pull request:

https://github.com/apache/spark/pull/2093#issuecomment-53151737
  
> The doc tests should covered all the code paths, do we still need more 
tests?

it's worth including a lookup for 1000 or 1234, which won't be found


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3139] Akka timeouts from ContextCleaner...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2056#issuecomment-53151891
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19107/consoleFull)
 for   PR 2056 at commit 
[`72e7da1`](https://github.com/apache/spark/commit/72e7da16a1c83e8464c50d98bc20cc424fa029b1).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3171] Don't print meaningless informati...

2014-08-23 Thread sarutak

Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/2078#issuecomment-53151994
  
Thanks @mridulm .
The answer I want is why selector.select throws CancelledKeyException even 
through JavaDoc doesn't say select() throws the exception.

Before, there was a bug that Selector#select throws CancelledKeyException 
but it's already resolved.
You can check that here 
(http://www.oracle.com/technetwork/java/javase/releasenotes-138306.html  , Bug 
ID is 4729342)

If you are worried about there is still bug, we should leave comment why 
enclosing selector.select by try block or log it may be JDK bug when throwing 
CancelledKeyException from selector.select.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3171] Don't print meaningless informati...

2014-08-23 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/2078#issuecomment-53153441
  
Why it happens is anyone's guess, I know it happens since fairly the very 
beginning when I started using nio in 1.4.x; and continues till today. If you 
search online, you will see a lot of people still hit it (and most of them are 
not on 1.4.x).
I added this codepath since selector thread was dying due to this issue.

Btw, if you actually look at the evaluation of the bug, you will notice 
that some workarounds are suggested - which are not applicable to a moderately 
multi-threaded app like spark (we dont know 'when' we will be done with a key 
!).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3171] Don't print meaningless informati...

2014-08-23 Thread sarutak

Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/2078#issuecomment-53154553
  
I don't mind handling NPE but I wonder why / where / how NPE is thrown.
At least, JavaDoc doesn't say SelectionKey#channel, SocketChannel#socket 
return null.
And after key is cancelled, key.channel still return non-null object.

And I found, the bug of throwing CancelledKeyException may be for MacOSX.

https://java.net/projects/macosx-port/lists/issues/archive/2011-10/message/186

O.K. I agree to leave try / catch but I think, we should leave comment why 
select should be enclosed try block.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3171] Don't print meaningless informati...

2014-08-23 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/2078#issuecomment-53154886
  
I am not on mac - and have seen this on solaris and linux (not in ctx of 
spark); so the issue is not mac specific unfortunately (though it might also be 
on mac !).

Btw, we cannot assume that channels are readable/writable when they are in 
selector loop - it also includes channels which we have registered to connect, 
might not be connectable (remote host/rack gone, dns issues, etc).
Simple example where it fails : connect socket, and before connect 
completes, try to get remote address - it will return null. This specific 
example is reasonably beneign, we dont get NPE - just garbage in logs. (which 
is also the reason why we should always log the selection key directly)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Update building-with-maven.md

2014-08-23 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2102#issuecomment-53155103
  
Hey @loachli - thanks for looking into this. I don't think we can advise 
users to disable security settings for their maven build. Does your proxy 
support HTTPS?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3171] Don't print meaningless informati...

2014-08-23 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/2078#issuecomment-53155129
  
Looking at the code : 
'''
val remoteAddress = 
key.channel.asInstanceOf[SocketChannel].socket.getRemoteSocketAddress
'''
I agree, key.channel is document to always return non null value; 
SocketChannel.socket is implemented to create new socket and return that in 
case socket happens to be null.
And so looks like NPE is not possible.

Returned socket address can be null and making log message useless; but 
atleast thankfully we wont barf out with an exception !
Given this is for logging, make it a addressInfo String (in a method 
possibly since this is used in multiple places) and populate it with:
a) getRemoteSocketAddress (null should be fine I guess).
b) key.toString (would include key identifier to help debug lifecycle of a 
selection key).

Possibly other info which might be relevant ... and use this instead of 
remoteAddress in the code.
Given that spark logs sometimes go upto 60+ GB for some of our jobs, I can 
see the value in having the socket addresses along with other state transition 
information while debugging a problem (instead of having to grep to find this 
info !)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3171] Don't print meaningless informati...

2014-08-23 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/2078#issuecomment-53155180
  
/CC @JoshRosen since you looked at ConnectionManager and Connection 
recently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread sarutak

Github user sarutak commented on a diff in the pull request:

https://github.com/apache/spark/pull/2019#discussion_r16631095
  
--- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
@@ -263,14 +282,20 @@ class SendingConnection(val address: 
InetSocketAddress, selector_ : Selector,
 
   val DEFAULT_INTEREST = SelectionKey.OP_READ
 
+  var alreadyReading = false
+
   override def registerInterest() {
 // Registering read too - does not really help in most cases, but for 
some
 // it does - so let us keep it for now.
-changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
+changeConnectionKeyInterest(
+  SelectionKey.OP_WRITE | (if (!alreadyReading) {
+alreadyReading = true
+DEFAULT_INTEREST
+  } else { 0 }))
--- End diff --

I understand registering DEFAULT_INTEREST (OP_READ) is to detect closing 
connection by remote host.
But, once blocked by channel.read() in SendingConnection#read, 
DEFAULT_INTEREST is not needed.

In addition, because SendingConnection is never unregistered OP_READ, 2 
threads on the same SendingConnection should be active and during one of the 
thread cancels its key, another thread can evaluate key.isValid in 
ConnectionManager#run.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3068]remove MaxPermSize option for jvm ...

2014-08-23 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2011#issuecomment-53155439
  
This seems like a reasonable change. One issue is it does run the `java` 
binary an extra time on task lunch, but that seems fairly cheap when only 
asking for the version


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3068]remove MaxPermSize option for jvm ...

2014-08-23 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2011


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2608] fix executor backend launch commo...

2014-08-23 Thread tnachen

Github user tnachen commented on the pull request:

https://github.com/apache/spark/pull/2103#issuecomment-53156348
  
My last commit is the diff in this pr


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2608] fix executor backend launch commo...

2014-08-23 Thread tnachen

Github user tnachen commented on the pull request:

https://github.com/apache/spark/pull/2103#issuecomment-53156411
  
Btw this is still not ideal IMO since it computes the class path in the 
scheduler side and assumes all slave executors after unzip has the same setup.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Minor] fix typo

2014-08-23 Thread viirya

GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/2105

[Minor] fix typo

Fix a typo in comment.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 fix_typo

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2105.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2105


commit 6596a80e5a5a3023cbe7e1fc9796508304d35576
Author: Liang-Chi Hsieh 
Date:   2014-08-23T15:48:23Z

fix typo.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Minor] fix typo

2014-08-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2105#issuecomment-53156760
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread sarutak

Github user sarutak commented on a diff in the pull request:

https://github.com/apache/spark/pull/2019#discussion_r16631373
  
--- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
@@ -118,14 +118,33 @@ abstract class Connection(val channel: SocketChannel, 
val selector: Selector,
   }
 
   def close() {
-closed = true
-val k = key()
-if (k != null) {
-  k.cancel()
+synchronized {
+  /**
+   * We should avoid executing closing sequence
+   * double by a same thread.
+   * Otherwise we can fail to call connectionsById.get() in
+   * ConnectionManager#removeConnection() at the 2nd time
+   */
+  if (!closed) {
+disposeSasl()
+
+/**
+  * callOnCloseCallback() should be invoked
+  * before k.cancel() and channel.close()
+  * to avoid key() returns null.
+  * If key() returns null before callOnCloseCallback(),
+  * We cannot remove entry from connectionsByKey in 
ConnectionManager
+  * and end up being threw CancelledKeyException.
+  */
+callOnCloseCallback()
+val k = key()
+if (k != null) {
+  k.cancel()
+}
+channel.close()
+closed = true
+  }
--- End diff --

SendingConnection#close is called from 3 threads on the same instance.
For example, 1st thread of handle-read-write-executor calls 
ReceivingConnection#close -> SendingConnection#close,  2nd thread of 
handle-read-write-executor calles SendingConnection#close and 3rd thread of 
connection-manager-thread calls ConnectionManager#run -> 
SendingConnection#close.

I think, if it threw exception from any methods in close(), connection is 
not marked as closed because one of those thread is expected to close resources 
even if another thread fail to close.

And synchronized block is for protect being called SendingConnection#close 
from 3 threads.
It can be one of following situation.
(1) One thread of handle-read-write-execuor evaluates key.cancel in 
SendingConnection#close
(2) Then, connection-manager-thread calls removeConnection via 
callOnCloseCallback and evaluates "connectionsyKey -= connection.key". This 
should be fail because connection.key is null at this time.

After (2) above, connection-manager-thread expects connectionsByKey.size != 
0 in ConnectionManager#stop but that size cannot be 0 and we get log message 
"All connections not cleaned up".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3069 [DOCS] Build instructions in README...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2014#issuecomment-53157117
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19108/consoleFull)
 for   PR 2014 at commit 
[`9b56494`](https://github.com/apache/spark/commit/9b564944fef59afb61f8dd4af2aaf5771bcd46e8).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3069 [DOCS] Build instructions in README...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2014#issuecomment-53158753
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19108/consoleFull)
 for   PR 2014 at commit 
[`9b56494`](https://github.com/apache/spark/commit/9b564944fef59afb61f8dd4af2aaf5771bcd46e8).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Minor] fix typo

2014-08-23 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2105#issuecomment-53158982
  
I merged this; thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Minor] fix typo

2014-08-23 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2105


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/2019#discussion_r16631699
  
--- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
@@ -118,14 +118,33 @@ abstract class Connection(val channel: SocketChannel, 
val selector: Selector,
   }
 
   def close() {
-closed = true
-val k = key()
-if (k != null) {
-  k.cancel()
+synchronized {
+  /**
+   * We should avoid executing closing sequence
+   * double by a same thread.
+   * Otherwise we can fail to call connectionsById.get() in
+   * ConnectionManager#removeConnection() at the 2nd time
+   */
+  if (!closed) {
+disposeSasl()
+
+/**
+  * callOnCloseCallback() should be invoked
+  * before k.cancel() and channel.close()
+  * to avoid key() returns null.
+  * If key() returns null before callOnCloseCallback(),
+  * We cannot remove entry from connectionsByKey in 
ConnectionManager
+  * and end up being threw CancelledKeyException.
+  */
+callOnCloseCallback()
+val k = key()
+if (k != null) {
+  k.cancel()
+}
+channel.close()
+closed = true
+  }
--- End diff --

The way to handle this is to make closed an AtomicBoolean and do a 
getAndSet.
If the result of getAndSet is false, which means closed was false on 
invocation, only then do the actual logic of close from earlier : it is a bug 
that all invocations of close was trying to do the same thing.

Essentially :
a) Change 
```var closed = false```
to
```var closed = new AtomicBoolean(false)```

b) Change close() to
```
def close() {
  val prev = closed.getAndSet(true)
  if (! prev) {
closeImpl()
  }
}
```

Where closeImpl is a private method containing the logic from earlier close 
(except for the closed variable update).


This will ensure that failures in closeImpl will still result in connection 
being marked as close; and repeated invocations will not cause same code to be 
executed and other failures to surface (like missing id from map, etc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/2019#discussion_r16631703
  
--- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
@@ -263,14 +282,20 @@ class SendingConnection(val address: 
InetSocketAddress, selector_ : Selector,
 
   val DEFAULT_INTEREST = SelectionKey.OP_READ
 
+  var alreadyReading = false
+
   override def registerInterest() {
 // Registering read too - does not really help in most cases, but for 
some
 // it does - so let us keep it for now.
-changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
+changeConnectionKeyInterest(
+  SelectionKey.OP_WRITE | (if (!alreadyReading) {
+alreadyReading = true
+DEFAULT_INTEREST
+  } else { 0 }))
--- End diff --

There is no blocking read - read events are never fired for 
SendingConnection unless socket was closed from underneath us.
Which is why we always re-register for OP_READ irrespective of whether we 
are registering for OP_WRITE or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3170][CORE]: Bug Fix in Storage UI

2014-08-23 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2076#issuecomment-53159750
  
@andrewor14 - can you take a look at this patch? IIRC you worked on this 
code most recently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

2014-08-23 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2093#issuecomment-53159927
  
@mattf I had added a test case for it, thx.

I had do much refactor in this PR, please re-review it, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2093#issuecomment-53159943
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19109/consoleFull)
 for   PR 2093 at commit 
[`0f1bce8`](https://github.com/apache/spark/commit/0f1bce8bbf6ca8ccec04f7030707f5a01f3a15ae).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread sarutak

Github user sarutak commented on a diff in the pull request:

https://github.com/apache/spark/pull/2019#discussion_r16631901
  
--- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
@@ -263,14 +282,20 @@ class SendingConnection(val address: 
InetSocketAddress, selector_ : Selector,
 
   val DEFAULT_INTEREST = SelectionKey.OP_READ
 
+  var alreadyReading = false
+
   override def registerInterest() {
 // Registering read too - does not really help in most cases, but for 
some
 // it does - so let us keep it for now.
-changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
+changeConnectionKeyInterest(
+  SelectionKey.OP_WRITE | (if (!alreadyReading) {
+alreadyReading = true
+DEFAULT_INTEREST
+  } else { 0 }))
--- End diff --

SocketChannel#read blocks a thread which calls SocketChannel#read, JavaDoc 
says at least.
I think waiting on channel.read()  is needed by only one thread per one 
channel to detect disconnection. Why re-register OP_READ is needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add `comp` argument for...

2014-08-23 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2094#discussion_r16631909
  
--- Diff: python/pyspark/rdd.py ---
@@ -810,23 +810,45 @@ def func(iterator):
 
 return self.mapPartitions(func).fold(zeroValue, combOp)
 
-def max(self):
+def max(self, comp=None):
 """
 Find the maximum item in this RDD.
 
->>> sc.parallelize([1.0, 5.0, 43.0, 10.0]).max()
+@param comp: A function used to compare two elements, the builtin 
`cmp`
--- End diff --

Yes, using `comp` here is bit confusing.  The builtin `min` use `key`, it 
will be better for Python programer, but it will be different than Scala API.

cc @mateiz @rxin @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread sarutak

Github user sarutak commented on a diff in the pull request:

https://github.com/apache/spark/pull/2019#discussion_r16631910
  
--- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
@@ -118,14 +118,33 @@ abstract class Connection(val channel: SocketChannel, 
val selector: Selector,
   }
 
   def close() {
-closed = true
-val k = key()
-if (k != null) {
-  k.cancel()
+synchronized {
+  /**
+   * We should avoid executing closing sequence
+   * double by a same thread.
+   * Otherwise we can fail to call connectionsById.get() in
+   * ConnectionManager#removeConnection() at the 2nd time
+   */
+  if (!closed) {
+disposeSasl()
+
+/**
+  * callOnCloseCallback() should be invoked
+  * before k.cancel() and channel.close()
+  * to avoid key() returns null.
+  * If key() returns null before callOnCloseCallback(),
+  * We cannot remove entry from connectionsByKey in 
ConnectionManager
+  * and end up being threw CancelledKeyException.
+  */
+callOnCloseCallback()
+val k = key()
+if (k != null) {
+  k.cancel()
+}
+channel.close()
+closed = true
+  }
--- End diff --

I think all of closing sequence should be executed atomically so I think 
only using atomic boolean is insufficient.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add `comp` argument for...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2094#issuecomment-53160372
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19110/consoleFull)
 for   PR 2094 at commit 
[`2f63512`](https://github.com/apache/spark/commit/2f63512e10a608722c1e8cd9ab5d22124d389a5d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add `comp` argument for...

2014-08-23 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2094#discussion_r16631953
  
--- Diff: python/pyspark/rdd.py ---
@@ -810,23 +810,45 @@ def func(iterator):
 
 return self.mapPartitions(func).fold(zeroValue, combOp)
 
-def max(self):
+def max(self, comp=None):
 """
 Find the maximum item in this RDD.
 
->>> sc.parallelize([1.0, 5.0, 43.0, 10.0]).max()
+@param comp: A function used to compare two elements, the builtin 
`cmp`
--- End diff --

We already use `key` in Python instead of `Ordering` in Scala, so I had 
change it into `key`.

Also , I would like to add `key` to top(), will be helpful, such as:

rdd.map(lambda x: (x, 1)).reduce(add).top(20, key=itemgetter(1))

We already have `ord` in Scala. Should I add this in this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add `comp` argument for...

2014-08-23 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2094#discussion_r16631962
  
--- Diff: python/pyspark/rdd.py ---
@@ -810,23 +810,45 @@ def func(iterator):
 
 return self.mapPartitions(func).fold(zeroValue, combOp)
 
-def max(self):
+def max(self, comp=None):
 """
 Find the maximum item in this RDD.
 
->>> sc.parallelize([1.0, 5.0, 43.0, 10.0]).max()
+@param comp: A function used to compare two elements, the builtin 
`cmp`
+ will be used by default.
+
+>>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0])
+>>> rdd.max()
 43.0
+>>> rdd.max(lambda a, b: cmp(str(a), str(b)))
+5.0
 """
-return self.reduce(max)
+if comp is not None:
+func = lambda a, b: a if comp(a, b) >= 0 else b
+else:
+func = max
 
-def min(self):
+return self.reduce(func)
+
+def min(self, comp=None):
 """
 Find the minimum item in this RDD.
 
->>> sc.parallelize([1.0, 5.0, 43.0, 10.0]).min()
-1.0
+@param comp: A function used to compare two elements, the builtin 
`cmp`
+ will be used by default.
+
+>>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0])
+>>> rdd.min()
+2.0
+>>> rdd.min(lambda a, b: cmp(str(a), str(b)))
+10.0
 """
-return self.reduce(min)
+if comp is not None:
--- End diff --

min and comp have different meanings:

   >>> min(1, 2)
   1
   >>> cmp(1, 2)
   -1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/2019#discussion_r16631998
  
--- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
@@ -263,14 +282,20 @@ class SendingConnection(val address: 
InetSocketAddress, selector_ : Selector,
 
   val DEFAULT_INTEREST = SelectionKey.OP_READ
 
+  var alreadyReading = false
+
   override def registerInterest() {
 // Registering read too - does not really help in most cases, but for 
some
 // it does - so let us keep it for now.
-changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
+changeConnectionKeyInterest(
+  SelectionKey.OP_WRITE | (if (!alreadyReading) {
+alreadyReading = true
+DEFAULT_INTEREST
+  } else { 0 }))
--- End diff --

We use non blocking IO.
Please take a look at ConnectionManager and Connection classes in detail to 
get better understanding of the codebase; there are a lot of resources online 
about how to use nio in non blocking mode in a multithreaded application.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/2019#discussion_r16632025
  
--- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
@@ -118,14 +118,33 @@ abstract class Connection(val channel: SocketChannel, 
val selector: Selector,
   }
 
   def close() {
-closed = true
-val k = key()
-if (k != null) {
-  k.cancel()
+synchronized {
+  /**
+   * We should avoid executing closing sequence
+   * double by a same thread.
+   * Otherwise we can fail to call connectionsById.get() in
+   * ConnectionManager#removeConnection() at the 2nd time
+   */
+  if (!closed) {
+disposeSasl()
+
+/**
+  * callOnCloseCallback() should be invoked
+  * before k.cancel() and channel.close()
+  * to avoid key() returns null.
+  * If key() returns null before callOnCloseCallback(),
+  * We cannot remove entry from connectionsByKey in 
ConnectionManager
+  * and end up being threw CancelledKeyException.
+  */
+callOnCloseCallback()
+val k = key()
+if (k != null) {
+  k.cancel()
+}
+channel.close()
+closed = true
+  }
--- End diff --

I think you are misunderstanding the intent of what close is supposed to do 
for Connection classes. It is supposed to mirror normal expectation of close on 
streams - barring the bug I mentioned about.

In a nutshell, it is supposed to mark connection as closed (so the repeated 
invocations of the method are idempotent), and cleanup if required. Take a look 
at how close is implemented in general in various jdk IO classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add RDD.lookup(key)

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2093#issuecomment-53161474
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19109/consoleFull)
 for   PR 2093 at commit 
[`0f1bce8`](https://github.com/apache/spark/commit/0f1bce8bbf6ca8ccec04f7030707f5a01f3a15ae).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add `comp` argument for...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2094#issuecomment-53161691
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19110/consoleFull)
 for   PR 2094 at commit 
[`2f63512`](https://github.com/apache/spark/commit/2f63512e10a608722c1e8cd9ab5d22124d389a5d).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread sarutak

Github user sarutak commented on a diff in the pull request:

https://github.com/apache/spark/pull/2019#discussion_r16632233
  
--- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
@@ -263,14 +282,20 @@ class SendingConnection(val address: 
InetSocketAddress, selector_ : Selector,
 
   val DEFAULT_INTEREST = SelectionKey.OP_READ
 
+  var alreadyReading = false
+
   override def registerInterest() {
 // Registering read too - does not really help in most cases, but for 
some
 // it does - so let us keep it for now.
-changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
+changeConnectionKeyInterest(
+  SelectionKey.OP_WRITE | (if (!alreadyReading) {
+alreadyReading = true
+DEFAULT_INTEREST
+  } else { 0 }))
--- End diff --

Yes, we use blocking IO.
And I had a misunderstanding. SocketChannel#read blocks the other thread 
which try to call SocketChannel#read on the same instance so unregistering 
OP_READ is wrong.
So, we should resolve race condition using another way because a thread 
registering OP_READ can call SocketChannel#close in SendingConnection#read 
during a thread registering OP_WRITE calls SocketChannel#write in 
SendingConnection#write.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread sarutak

Github user sarutak commented on a diff in the pull request:

https://github.com/apache/spark/pull/2019#discussion_r16632326
  
--- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
@@ -118,14 +118,33 @@ abstract class Connection(val channel: SocketChannel, 
val selector: Selector,
   }
 
   def close() {
-closed = true
-val k = key()
-if (k != null) {
-  k.cancel()
+synchronized {
+  /**
+   * We should avoid executing closing sequence
+   * double by a same thread.
+   * Otherwise we can fail to call connectionsById.get() in
+   * ConnectionManager#removeConnection() at the 2nd time
+   */
+  if (!closed) {
+disposeSasl()
+
+/**
+  * callOnCloseCallback() should be invoked
+  * before k.cancel() and channel.close()
+  * to avoid key() returns null.
+  * If key() returns null before callOnCloseCallback(),
+  * We cannot remove entry from connectionsByKey in 
ConnectionManager
+  * and end up being threw CancelledKeyException.
+  */
+callOnCloseCallback()
+val k = key()
+if (k != null) {
+  k.cancel()
+}
+channel.close()
+closed = true
+  }
--- End diff --

O.K. Connecton#close is just for mark as closed and failure during closing 
is not recovered right?
If it is, using AtomicBoolean is reasonable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...

2014-08-23 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/2019#discussion_r16632362
  
--- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
@@ -263,14 +282,20 @@ class SendingConnection(val address: 
InetSocketAddress, selector_ : Selector,
 
   val DEFAULT_INTEREST = SelectionKey.OP_READ
 
+  var alreadyReading = false
+
   override def registerInterest() {
 // Registering read too - does not really help in most cases, but for 
some
 // it does - so let us keep it for now.
-changeConnectionKeyInterest(SelectionKey.OP_WRITE | DEFAULT_INTEREST)
+changeConnectionKeyInterest(
+  SelectionKey.OP_WRITE | (if (!alreadyReading) {
+alreadyReading = true
+DEFAULT_INTEREST
+  } else { 0 }))
--- End diff --

Please keep following in mind while trying to find a solution :

1) All invocations of register for a write connection will have OP_READ set 
(so there wont be a case where OP_READ is not set).
OP_WRITE may or may not be set based on whether we have outstanding data to 
write or not.
This is to ensure the tcp stack alerts us in case remote close is detected 
(via keep alive, etc).

2) Only a single thread per socket will process it at a given point of 
time, we ensure this : and marking for re-registeration happens within this 
(not actual registeration - that always happens in the selector thread).

So we wont have the case of conflicting re-registeration requests : we 
ensure this.
At worst, we can have :
a) OP_READ (because we finished write), wakeup selector
b) before selector thread woke up, we want to re-register with OP_WRITE | 
OP_READ again (since some other thread wanted to write data).
We process registeration requests in order - and so (b) will take 
precedence over (a).

We handle reverse case of some thread wanting to write while write is going 
on and finishes fully (resulting in (a) ) by use of resetForceReregister.
This code path is complicated since it handles a lot of edge cases.

3) No thread calls register on selector - only the selector thread can (not 
ensuring this causes deadlocks actually) : hence why we have registeration 
request queues for new and existing sockets.

4) A close can happen because of explicit close by spark, close due to 
socket errors at own side, close due to network issues, close due to remote 
side.
There is only so much we can do to distinguish these.
We detect remote close by (1) (note, it is not gauranteed to report 
immediately - and sometimes can take prolonged time) and local close is handled 
gracefully anyway.


Given all this, I am not sure what are the MT issues seen and the causes 
for it, it can be quite involved at times - the one main issue I see is, 
repeated invocation to close (and there can be repeated invocations as you 
rightly pointed out) seems to attempt to clean up the state repeatedly.
This is incorrect - it should do it once and only once; repeated 
invocations are legal, but actual close implementation code should be executed 
once.
Ofcourse, exception while executing it are fine and unrecoverable, and we 
have to live with it (like in case of socket/stream.close throwing exception).

To alleviate this, I proposed the AtomicBoolean change.
I might obviously be missing other things since it has been a while since I 
looked at these classes, so a fresh pair of eyes is definitely welcome !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2554][SQL] CountDistinct and SumDi...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1935#issuecomment-53163399
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19111/consoleFull)
 for   PR 1935 at commit 
[`5c7848d`](https://github.com/apache/spark/commit/5c7848d52070b639e23088972eb2a8316cddc54f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

2014-08-23 Thread markhamstra

Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/1360#issuecomment-53163415
  
I'm not sure I'm following, @mridulm.  The problem is not one of removing 
Executors, but rather of removing Applications that could and should still be 
left running even though some (but not all) Executors assigned to an 
Application are dying.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] Make functionRegistry in HiveContext tra...

2014-08-23 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/2074#issuecomment-53163432
  
Thanks! Merged to master and 1.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] Make functionRegistry in HiveContext tra...

2014-08-23 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2074


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

2014-08-23 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1360#issuecomment-53163562
  
@markhamstra In our cluster, this usually happens due to one or more 
executor being in a bad state : either due to insufficient disk for finishing a 
task or it is in process of cleaning up and exit'ing.
When task fails, usually due to locality, the same task gets re-assigned to 
the executor where it just failed usually due to locality match. And this 
repeated re-schedule on failing executor, fail loop causes application to fail 
since it hits the maximum number of failed tasks for application, or maximum 
number of task failures for a specific task (iirc there are two params).

We alleviate this by setting blacklist timeout to a non trivially 
appropriate value : this prevents the rapid reschedule of a failing task on the 
executor (and usually some other executor picks up the task - the timeout is 
chosen so that this is possible).
If the executor is healthy but cant execute this specific task, then 
blacklist works fine.
If executor is unhealthy and going to exit, then we will still have rapid 
task failures until executor notifies master when it exits - but the failure 
count per task is not hit (iirc the number of failed tasks for app is much 
higher than number of failed attempts per tasks).


Ofcourse, not sure if this is completely applicable in this case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

2014-08-23 Thread markhamstra

Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/1360#issuecomment-53164183
  
@mridulm Is this blacklisting behavior a customization that you have made 
to Spark?  If not, could you point me to where and how it is implemented?

What you are describing seems to be orthogonal and probably complementary 
to this PR: Yours, a means to prevent rescheduling of a task on an Executor 
where it cannot run successfully vs. this one, a means to prevent the killing 
of a running Application when some Executors die but others are still running 
the Application successfully.  Sounds to me like we want both of those means.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

2014-08-23 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1360#issuecomment-53164487
  
Take a look at 'spark.scheduler.executorTaskBlacklistTime' in 
TaskSetManager.
Since I run mostly in yarn-cluster mode, and there is only single 
application there; I was not sure how relevant black-listing was in your case 
actually ! (multiple apps via standalone I guess ?)

Note that we actually need a third case, which is not yet handled, slow 
executors/stragglers - particularly for low latency stages, they really kill 
execution times for some of our ML jobs (a 50x speedup becomes much much lower 
due to these).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-2554][SQL] CountDistinct and SumDi...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1935#issuecomment-53165766
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19111/consoleFull)
 for   PR 1935 at commit 
[`5c7848d`](https://github.com/apache/spark/commit/5c7848d52070b639e23088972eb2a8316cddc54f).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class JoinedRow2 extends Row `
  * `class JoinedRow3 extends Row `
  * `class JoinedRow4 extends Row `
  * `class JoinedRow5 extends Row `
  * `class GenericRow(protected[sql] val values: Array[Any]) extends Row `
  * `abstract class MutableValue extends Serializable `
  * `final class MutableInt extends MutableValue `
  * `final class MutableFloat extends MutableValue `
  * `final class MutableBoolean extends MutableValue `
  * `final class MutableDouble extends MutableValue `
  * `final class MutableShort extends MutableValue `
  * `final class MutableLong extends MutableValue `
  * `final class MutableByte extends MutableValue `
  * `final class MutableAny extends MutableValue `
  * `final class SpecificMutableRow(val values: Array[MutableValue]) 
extends MutableRow `
  * `case class CountDistinct(expressions: Seq[Expression]) extends 
PartialAggregate `
  * `case class CollectHashSet(expressions: Seq[Expression]) extends 
AggregateExpression `
  * `case class CollectHashSetFunction(`
  * `case class CombineSetsAndCount(inputSet: Expression) extends 
AggregateExpression `
  * `case class CombineSetsAndCountFunction(`
  * `case class CountDistinctFunction(`
  * `case class MaxOf(left: Expression, right: Expression) extends 
Expression `
  * `case class NewSet(elementType: DataType) extends LeafExpression `
  * `case class AddItemToSet(item: Expression, set: Expression) extends 
Expression `
  * `case class CombineSets(left: Expression, right: Expression) extends 
BinaryExpression `
  * `case class CountSet(child: Expression) extends UnaryExpression `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add `comp` argument for...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2094#issuecomment-53167601
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19112/consoleFull)
 for   PR 2094 at commit 
[`ad7e374`](https://github.com/apache/spark/commit/ad7e374bd834d1e789ff95bba09f0c87ba67c4fd).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add `comp` argument for...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2094#issuecomment-53167763
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19113/consoleFull)
 for   PR 2094 at commit 
[`ccbaf25`](https://github.com/apache/spark/commit/ccbaf25ce6d601bcbc7cb6081128c2b4236925ad).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add `comp` argument for...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2094#issuecomment-53169261
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19113/consoleFull)
 for   PR 2094 at commit 
[`ccbaf25`](https://github.com/apache/spark/commit/ccbaf25ce6d601bcbc7cb6081128c2b4236925ad).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add `comp` argument for...

2014-08-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2094#issuecomment-53169370
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19112/consoleFull)
 for   PR 2094 at commit 
[`ad7e374`](https://github.com/apache/spark/commit/ad7e374bd834d1e789ff95bba09f0c87ba67c4fd).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Update building-with-maven.md

2014-08-23 Thread loachli

Github user loachli commented on the pull request:

https://github.com/apache/spark/pull/2102#issuecomment-53170070
  
hey @pwendell , thanks for your comments- Yes, my proxy support https.

I had used no-proxy open environment before. In order to support more 
people using spark, I have to move spark environment into my company's inner 
environment. For security reason, I have to use http-proxy provided by my 
company to access network.
When I used spark in my company's inner environment, I could not compile 
spark successfully.
Because the maven's error hit was not obvious, I spent much time solving 
this problem.

You can find the definition of these two parameters in 
http://maven.apache.org/wagon/wagon-providers/wagon-http/
"maven.wagon.http.ssl.insecure = true/false (default false), 
enable/disable use of relaxed ssl check for user generated certificates.
maven.wagon.http.ssl.allowall = true/false (default false), 
enable/disable match of the server's X.509 certificate with hostname. If 
disabled, a browser like check will be used."

   I also found someone else had met this issue 
(https://issues.apache.org/jira/browse/SPARK-1125)
So I believe, this issue will be met by others in the future.
   I still think we could add this hit to the document. One optional way is 
that I could add risk warnig when using these parameters. Do you agree?








---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 157 matches

Mail list logo